Hive UDFs:
These are regular user-defined functions that operate row-wise and output one result for one row, such as most built-in mathematics and string functions.
These are regular user-defined functions that operate row-wise and output one result for one row, such as most built-in mathematics and string functions.
Ex: SELECT LOWER(str) FROM table_name; SELECT CONCAT(column1,column2) AS x FROM table_name;
There are 2 ways of writing the UDFs
- Simple – extend UDF class
- Generic – extend GenericUDF class
In this post, we will create a UDF to concatenate strings, and implement by extending simple and Generic class.
1. Simple – extend <org.apache.hadoop.hive.ql.exec.UDF> class
To write a Simple UDF, below 2 steps are neccessary:
1. Extend the org.apache.hadoop.hive.ql.exec.UDF class
2. Write an “evaluate” method.
package org.puneetha.hive.udf.udfconcat; import org.apache.hadoop.hive.ql.exec.Description; import org.apache.hadoop.hive.ql.exec.UDF; import org.apache.hadoop.io.Text; import org.apache.hadoop.io.IntWritable; /*** * * * @author Puneetha * */ @Description(name = "udf_concat", value = "_FUNC_(str1, str2, ... strN) - returns the concatenation of str1, str2, ... strN", extended = "Returns NULL if any argument is NULL.\n" + "Example:\n" + " > SELECT _FUNC_('abc', 'def') FROM src LIMIT 1;\n" + " 'abcdef'" + " > SELECT _FUNC_(1, 2) FROM src LIMIT 1;\n" + " '12'" ) public class UDFConcat extends UDF { public UDFConcat() { } private Text text = new Text(); public Text evaluate(Text... args) { text.clear(); for (Text arg : args) { if (arg == null) { return null; } text.append(arg.getBytes(), 0, arg.getLength()); } return text; } public Text evaluate(IntWritable... args) { text.clear(); for (IntWritable arg : args) { if (arg.toString() == null) { return null; } text.append(arg.toString().getBytes(), 0, (arg.toString()).length()); } return text; } }
Test Case:
package org.puneetha.hive.udf.udfconcat; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.Text; import org.testng.Assert; import org.testng.annotations.DataProvider; import org.testng.annotations.Test; /*** * * * @author Puneetha * */ public class UDFConcatTest { /* Test 2 arguments as input - for string input */ @DataProvider(name = "dataProvider1") public static String[][] inputData1() { String[][] testStrSet = { // Success {"hello" , "world" , "helloworld"} ,{"welcome" , " to the program" , "welcome to the program"} ,{"hello" , "world" , "hello9world"} ,{"" , "" , ""} }; return testStrSet; } @Test(dataProvider = "dataProvider1") public void testEvaluate1(String param1, String param2, String expectedResultStr) throws Exception { try { UDFConcat udfConcat = new UDFConcat(); Assert.assertEquals(new Text(expectedResultStr), udfConcat.evaluate(new Text(param1), new Text(param2) ) ); } catch (Exception e) { e.printStackTrace(); Assert.fail(); } } /* Test 3 arguments as input - for string input */ @DataProvider(name = "dataProvider2") public static String[][] inputData2() { String[][] testStrSet = { // Success {"how" , " are" , " you", "how are you"} ,{"its" , " nice" , " out there!", "its nice out there!"} ,{"" , "" , "" , ""} }; return testStrSet; } @Test(dataProvider = "dataProvider2") public void testEvaluate2(String param1, String param2, String param3, String expectedResultStr) throws Exception { try { UDFConcat udfConcat = new UDFConcat(); Assert.assertEquals(new Text(expectedResultStr), udfConcat.evaluate(new Text(param1), new Text(param2), new Text(param3) ) ); } catch (Exception e) { e.printStackTrace(); Assert.fail(); } } @DataProvider(name = "dataProvider3") public static Integer[][] inputData3() { Integer[][] testIntSet = { // Success {1 , 2 , 12} }; return testIntSet; } @Test(dataProvider = "dataProvider3") public void testEvaluate3(int param1, int param2, int expectedResultStr) throws Exception { try { UDFConcat udfConcat = new UDFConcat(); Assert.assertEquals(new Text(String.valueOf(expectedResultStr)), udfConcat.evaluate(new IntWritable(param1), new IntWritable(param2) ) ); } catch (Exception e) { e.printStackTrace(); Assert.fail(); } } }
2. Generic – extend <org.apache.hadoop.hive.ql.udf.generic.GenericUDF> class
To write a Simple UDF, below 2 steps are neccessary:
- Extend the org.apache.hadoop.hive.ql.udf.generic.GenericUDF class
- Write the “initialize” method. This will be called once and only once per GenericUDF instance.
- Write an “evaluate” method.
- Override the method “getDisplayString”. This method will get the String to be displayed in explain.
pom.xml (This file is common for both Simple and Generic UDF
4.0.0 custom org.puneetha.hive.udf 0.0.1-SNAPSHOT jar hive_udf http://maven.apache.org hive_udf_v1 UTF-8 cdh5.5.2 2.6.0-${cdh.version} 1.1.0-${cdh.version} 0.12.0-${cdh.version} 1.2.17 2.5 1.2.1 1.10.19 6.9.10 4.8.1 C:\Program Files\Java\jdk1.8.0_121 log4j log4j ${log4j.version} org.mockito mockito-all ${mockito.version} org.testng testng ${testng.version} org.apache.hadoop hadoop-client ${hadoop.version} org.apache.hive hive-jdbc ${hive.version} org.apache.hive hive-metastore ${hive.version} org.apache.hive hive-service ${hive.version} org.apache.pig pig ${pig.version} org.apache.maven.plugins maven-clean-plugin ${maven_jar_plugin.version} jdk.tools jdk.tools 1.8.0_121 system ${java.home}/lib/tools.jar ${project.finalname} org.codehaus.mojo exec-maven-plugin ${codehaus.version} org.apache.maven.plugins maven-jar-plugin ${maven_jar_plugin.version} cloudera-repo http://repository.cloudera.com/artifactory/cloudera-repos/