Hive Macros examples

Hive Macros examples Data Type check – Check if a given column is a number DROP TEMPORARY MACRO IF EXISTS isNumber; CREATE TEMPORARY MACRO isNumber (input INT) CASE WHEN CAST(input AS INT) IS NULL THEN ‘NO’ else ‘YES’ END ; — Usage: SELECT isNumber(100), isNumber(“123”), isNumber(“12sd”); — Output +——+——+——+ | _c0 | _c1 | _c2… Read More »

Hive useful query stuff

Hive useful query stuff Generate random float numbers SELECT ARRAY( CAST(ROUND(RAND()* 100, 2) AS FLOAT) , CAST(ROUND(RAND()* 1000, 2) AS FLOAT) , CAST(ROUND(RAND()* 10000, 2) AS FLOAT) , CAST(ROUND(RAND()* -100, 2) AS FLOAT) , CAST(ROUND(RAND()* -1000, 2) AS FLOAT) , CAST(ROUND(RAND()* -10000, 2) AS FLOAT) )[CAST((FLOOR(RAND()*2)) AS INT)] ; — Output Everytime you run above… Read More »

Getting started with Big data (Hadoop, Spark)!

Few questions running in your mind right now How and where do I start with Big data space? What is Big data, Hadoop, Spark, Hive, HBase, etc? What should I learn first? Which path should I take? Data Science, Data Engineer, Architect!

PIG – general stuff

Adding jar REGISTER /local/path/to/myjar_name.jar Set queue name Specify below in the pig script SET mapreduce.job.queuename ‘my_queuename’; (or) specify while running the PIG script $ pig -Dmapreduce.job.queuename=my_queuename -f my_script.pig Set job name Specify below in the pig script SET mapreduce.job.name ‘Testing HCatalog’; (or) specify while running the PIG script $ pig -Dmapreduce.job.name=”Testing HCatalog” -f my_script.pig

Category: Pig

Hive – Timezone problem

Timezone problem – Any function which triggers mapreduce job, causes this problem, since it takes the local timezone of machine where it runs the mapper/reducer In our case, lets say our servers are in German timezone i.e. CET — With original settings SET system:user.country; +————————-+–+ | set | +————————-+–+ | system:user.country=GB | +————————-+–+ — Original… Read More »

Search for a file in HDFS using Solr Find tool

HdfsFindTool is essentially the HDFS version of the Linux file system find command. The command walks one or more HDFS directory trees, finds all HDFS files that match the specified expression, and applies selected actions to them. By default, it prints the list of matching HDFS file paths to stdout, one path per line. Search… Read More »

Solr Installation and create new collection – standalone

Note: I am running this in Windows. Download Solr Download Solr from here I have downloaded solr-7.0.1: http://mirrors.whoishostingthis.com/apache/lucene/solr/7.0.1/solr-7.0.1.zip For this example, we will extract it to the folder C:\Users\Public\hadoop_ecosystem\solr-7.0.1 Start Solr Open command prompt and type below commands > c: > cd C:\Users\Public\hadoop_ecosystem\solr-7.0.1\bin C:\Users\Public\hadoop_ecosystem\solr-7.0.1\bin> solr start -p 8983 Output: Waiting up to 30 to see… Read More »

PySpark – dev set up – Eclipse – Windows

For our example purposes, we will set-up Spark in the location: C:\Users\Public\Spark_Dev_set_up Note: I am running Eclipse Neon Prerequisites Python 3.5 JRE 8 JDK 1.8 Eclipse plugins: PyDev Steps to set up: Download from here: https://spark.apache.org/downloads.html 1. Choose a Spark release: 2.1.0 2. Choose a package type: Pre-built for Apache Hadoop 2.6 3. Download below… Read More »