Category Archives: Hive

Hive – Timezone problem

Timezone problem – Any function which triggers mapreduce job, causes this problem, since it takes the local timezone of machine where it runs the mapper/reducer In our case, lets say our servers are in German timezone i.e. CET — With original setttings SET; +————————-+–+ | set | +————————-+–+ | | +————————-+–+ — Original… Read More »

Hive – testing queries with dummy data

If your query looks like “SELECT * FROM TABLE1;” You want to test the input from “TABLE1” with your dummy dataset. If you have a multiple subqueries using a base table. This comes very handy. — Creating single dummy row: SELECT * FROM ( — This is our dummy row, which is a replacement of… Read More »

Hive – Optimization

To set user timezone: Sort memory – The total amount of buffer memory to use while sorting files, in megabytes. By default, gives each merge stream 1MB, which should minimize seeks. SET io.sort.mb=800; Note: io.sort.mb should be 10 * io.sort.factor Memory — Shuffle memory SET mapreduce.reduce.shuffle.memory.limit.percent=0.65; — Map memory SET; — Reduce memory SET… Read More »

Hive – Best Practices

Testing with Dummy data – Check here Beeline doesnt honor tabs, if you are using any editors, you can replace tabs with space to maintain the structure and still use beeline effectively. Ex: CREATE TABLE IF NOT EXISTS default.test1 (idINT,name STRING); — this will fail Hive will throw an error saying “Error: Error while compiling… Read More »

Hive – big data – big problems

2017-07-26 00:32:04,676 INFO [communication thread] org.apache.hadoop.mapred.Task: Communication exception: java.lang.OutOfMemoryError: GC overhead limit exceeded at java.util.Arrays.copyOfRange( at java.lang.String.( at java.lang.String.substring( at at at at org.apache.hadoop.yarn.util.ProcfsBasedProcessTree.getProcessList( at org.apache.hadoop.yarn.util.ProcfsBasedProcessTree.updateProcessTree( at org.apache.hadoop.mapred.Task.updateResourceCounters( at org.apache.hadoop.mapred.Task.updateCounters( at org.apache.hadoop.mapred.Task.access( at org.apache.hadoop.mapred.Task$ at

Hive UDFs – Simple and Generic UDFs

Hive UDFs: These are regular user-defined functions that operate row-wise and output one result for one row, such as most built-in mathematics and string functions. Ex: SELECT LOWER(str) FROM table_name; SELECT CONCAT(column1,column2) AS x FROM table_name; There are 2 ways of writing the UDFs Simple – extend UDF class Generic – extend GenericUDF class In… Read More »

Hive Beeline cheatsheet

Beeline Shell Commands Command Description Example !help Print a summary of command usage !quit Exits the Beeline client. !history Display the command history !table <sql_query_file> Run SQL query from file !run /user/dummy_local_user/myquery1.sql set Prints a list of configuration variables that are overridden by the user or Hive. set -v Prints all Hadoop and Hive configuration… Read More »

Hive UDF with testNG test case – concatenate two strings

Hive UDF class package org.puneetha.hive.udf; import org.apache.hadoop.hive.ql.exec.UDF; import org.apache.hadoop.hive.ql.metadata.HiveException; import org.apache.hadoop.hive.ql.udf.UDFType; import; import org.apache.log4j.Logger; import org.apache.hadoop.hive.ql.exec.Description; /*** * * * @author Puneetha * */ @Description(name = “udf_concat” , value = “_FUNC_(STRING, STRING) – RETURN_TYPE(STRING)\n” + “Description: Concatenate two strings, separated by spaces” , extended = “Example:\n” + ” > SELECT udf_concat(‘hello’,’world’) FROM src;\n” +… Read More »

Hive Commands

Run hive one shot command in background $nohup hive -f sample.hql > output1.out 2>&1 & $nohup hive –database “default” -e “select * from tablename;” > output1.out 2>&1 & Replace delimiter in hive output from default delimiter to the character you wish ( In this example I am replacing it with comma(,) hive –database “database_name” -f… Read More »