Hive – Best Practices

Testing with Dummy data – Check here Beeline doesnt honor tabs, if you are using any editors, you can replace tabs with space to maintain the structure and still use beeline effectively. Ex: CREATE TABLE IF NOT EXISTS default.test1 (idINT,name STRING); — this will fail Hive will throw an error saying “Error: Error while compiling… Read More »

Hive – big data – big problems

2017-07-26 00:32:04,676 INFO [communication thread] org.apache.hadoop.mapred.Task: Communication exception: java.lang.OutOfMemoryError: GC overhead limit exceeded at java.util.Arrays.copyOfRange(Arrays.java:3664) at java.lang.String.(String.java:207) at java.lang.String.substring(String.java:1933) at java.io.File.getName(File.java:456) at java.io.UnixFileSystem.getBooleanAttributes(UnixFileSystem.java:243) at java.io.File.isDirectory(File.java:849) at org.apache.hadoop.yarn.util.ProcfsBasedProcessTree.getProcessList(ProcfsBasedProcessTree.java:511) at org.apache.hadoop.yarn.util.ProcfsBasedProcessTree.updateProcessTree(ProcfsBasedProcessTree.java:210) at org.apache.hadoop.mapred.Task.updateResourceCounters(Task.java:894) at org.apache.hadoop.mapred.Task.updateCounters(Task.java:1045) at org.apache.hadoop.mapred.Task.access(Task.java:82) at org.apache.hadoop.mapred.Task$TaskReporter.run(Task.java:782) at java.lang.Thread.run(Thread.java:745)

Tracking YARN logs

Create script to get yarn logs $ vim hadoop_logs.sh #!/bin/bash APPLICATION_ID= CONTAINER_ID= NODE_ADDRESS= if [ $# -eq 1 ]; then yarn logs -applicationId ${APPLICATION_ID} elif [ $# -eq 3 ]; then yarn logs -applicationId ${APPLICATION_ID} -containerId ${CONTAINER_ID} -nodeAddress ${NODE_ADDRESS} else echo “you must specify 1 or 3 arguments ” fi Create a symlink $ ln… Read More »

Search for a pattern in HDFS files – python script

Problem: Search a pattern in HDFS files and return the filename which contains this pattern. For example, below are our input files: $vim log1.out [Wed Oct 11 14:32:52 2000] [error] [client 127.0.0.1] client denied by server configuration: /export/home/live/ap/htdocs/test [Wed Oct 11 14:32:52 2000] [error] [client 127.0.0.1] client denied by server configuration: /export/home/live/ap/htdocs/test [Wed Oct 11… Read More »

Spark quick commands – Scala

Save file to HDFS with custom delimiter in Spark: import spark.sql val df = sql(“”” select * from test_db.test_table1 “””) df.write.format(“csv”).partitionBy(“year”, “month”).mode(‘overwrite’).option(“delimiter”, “|”).save(“/user/cloudera/project/workspace/test/test_table1”)

Hive UDFs – Simple and Generic UDFs

Hive UDFs: These are regular user-defined functions that operate row-wise and output one result for one row, such as most built-in mathematics and string functions. Ex: SELECT LOWER(str) FROM table_name; SELECT CONCAT(column1,column2) AS x FROM table_name; There are 2 ways of writing the UDFs Simple – extend UDF class Generic – extend GenericUDF class In… Read More »

Hive Beeline cheatsheet

Beeline Shell Commands Command Description Example !help Print a summary of command usage !quit Exits the Beeline client. !history Display the command history !table <sql_query_file> Run SQL query from file !run /user/dummy_local_user/myquery1.sql set Prints a list of configuration variables that are overridden by the user or Hive. set -v Prints all Hadoop and Hive configuration… Read More »

PIG UDF with testNG test case – concatenate two strings

PIG UDF class package org.puneetha.pig.udf; import java.io.IOException; import org.apache.log4j.Logger; import org.apache.pig.EvalFunc; import org.apache.pig.data.Tuple; /*** * * * @author Puneetha * */ public final class ConcatStrPig extends EvalFunc{ private static final Logger logger = Logger.getLogger(Thread.currentThread().getStackTrace()[0].getClassName()); @Override public String exec(final Tuple input) throws IOException { logger.debug(“Tuple=” + input.toString()); String separator = ” “; StringBuilder result = new… Read More »

Category: Pig