Category Archives: Hive

Hive Macros examples

Hive Macros examples Data Type check – Check if a given column is a number DROP TEMPORARY MACRO IF EXISTS isNumber; CREATE TEMPORARY MACRO isNumber (input INT) CASE WHEN CAST(input AS INT) IS NULL THEN ‘NO’ else ‘YES’ END ; — Usage: SELECT isNumber(100), isNumber(“123”), isNumber(“12sd”); — Output +——+——+——+ | _c0 | _c1 | _c2… Read More »

Hive useful query stuff

Hive useful query stuff Generate random float numbers SELECT ARRAY( CAST(ROUND(RAND()* 100, 2) AS FLOAT) , CAST(ROUND(RAND()* 1000, 2) AS FLOAT) , CAST(ROUND(RAND()* 10000, 2) AS FLOAT) , CAST(ROUND(RAND()* -100, 2) AS FLOAT) , CAST(ROUND(RAND()* -1000, 2) AS FLOAT) , CAST(ROUND(RAND()* -10000, 2) AS FLOAT) )[CAST((FLOOR(RAND()*2)) AS INT)] ; — Output Everytime you run above… Read More »

Hive – Timezone problem

Timezone problem – Any function which triggers mapreduce job, causes this problem, since it takes the local timezone of machine where it runs the mapper/reducer In our case, lets say our servers are in German timezone i.e. CET — With original settings SET; +————————-+–+ | set | +————————-+–+ | | +————————-+–+ — Original… Read More »

Hive – testing queries with dummy data

If your query looks like “SELECT * FROM TABLE1;” You want to test the input from “TABLE1” with your dummy dataset. If you have a multiple subqueries using a base table. This comes very handy. — Creating single dummy row: SELECT * FROM ( — This is our dummy row, which is a replacement of… Read More »

Hive – Optimization

To set user timezone: Sort memory – The total amount of buffer memory to use while sorting files, in megabytes. By default, gives each merge stream 1MB, which should minimize seeks. SET io.sort.mb=800; Note: io.sort.mb should be 10 * io.sort.factor Memory — Shuffle memory SET mapreduce.reduce.shuffle.memory.limit.percent=0.65; — Map memory SET; — Reduce memory SET… Read More »

Hive – Best Practices

Testing with Dummy data – Check here Beeline doesnt honor tabs, if you are using any editors, you can replace tabs with space to maintain the structure and still use beeline effectively. Ex: CREATE TABLE IF NOT EXISTS default.test1 (idINT,name STRING); — this will fail Hive will throw an error saying “Error: Error while compiling… Read More »

Hive – big data – big problems

2017-07-26 00:32:04,676 INFO [communication thread] org.apache.hadoop.mapred.Task: Communication exception: java.lang.OutOfMemoryError: GC overhead limit exceeded at java.util.Arrays.copyOfRange( at java.lang.String.( at java.lang.String.substring( at at at at org.apache.hadoop.yarn.util.ProcfsBasedProcessTree.getProcessList( at org.apache.hadoop.yarn.util.ProcfsBasedProcessTree.updateProcessTree( at org.apache.hadoop.mapred.Task.updateResourceCounters( at org.apache.hadoop.mapred.Task.updateCounters( at org.apache.hadoop.mapred.Task.access( at org.apache.hadoop.mapred.Task$ at

Hive UDFs – Simple and Generic UDFs

Hive UDFs: These are regular user-defined functions that operate row-wise and output one result for one row, such as most built-in mathematics and string functions. Ex: SELECT LOWER(str) FROM table_name; SELECT CONCAT(column1,column2) AS x FROM table_name; There are 2 ways of writing the UDFs Simple – extend UDF class Generic – extend GenericUDF class In… Read More »

Hive Beeline cheatsheet

Beeline Shell Commands Command Description Example !help Print a summary of command usage !quit Exits the Beeline client. !history Display the command history !table <sql_query_file> Run SQL query from file !run /user/dummy_local_user/myquery1.sql set Prints a list of configuration variables that are overridden by the user or Hive. set -v Prints all Hadoop and Hive configuration… Read More »