PySpark – dev set up – Eclipse – Windows

For our example purposes, we will set-up Spark in the location: C:\Users\Public\Spark_Dev_set_up Note: I am running Eclipse Neon Prerequisites Python 3.5 JRE 8 JDK 1.8 Eclipse plugins: PyDev Steps to set up: Download from here: 1. Choose a Spark release: 2.1.0 2. Choose a package type: Pre-built for Apache Hadoop 2.6 3. Download below… Read More »

Pyspark – getting started – useful stuff

Example to create dataframe from pyspark import SparkConf, SparkContext from pyspark.sql import SparkSession spark = SparkSession.builder.getOrCreate() sc = spark.sparkContext def create_dataframe(): “”” Example to create dataframe “”” headers = (“id” , “name”) data = [ (1, “puneetha”) ,(2, “bhoomika”) ] df = spark.createDataFrame(data, headers), False) # Output: # |id |name | # +—+——–+ #… Read More »

sqoop queries – examples

Apache Sqoop is a tool designed for efficiently transferring bulk data between Apache Hadoop and structured datastores such as relational databases. Open source Apache project that exchanges data between a database and HDFS Can import all tables, single tables or even partial tables with free form SQL queries into HDFS Data can be imported in… Read More »

Hive – testing queries with dummy data

If your query looks like “SELECT * FROM TABLE1;” You want to test the input from “TABLE1” with your dummy dataset. If you have a multiple subqueries using a base table. This comes very handy. — Creating single dummy row: SELECT * FROM ( — This is our dummy row, which is a replacement of… Read More »

Hive – Optimization

To set user timezone: Sort memory – The total amount of buffer memory to use while sorting files, in megabytes. By default, gives each merge stream 1MB, which should minimize seeks. SET io.sort.mb=800; Note: io.sort.mb should be 10 * io.sort.factor Memory — Shuffle memory SET mapreduce.reduce.shuffle.memory.limit.percent=0.65; — Map memory SET; — Reduce memory SET… Read More »

Hive – Best Practices

Testing with Dummy data – Check here Beeline doesnt honor tabs, if you are using any editors, you can replace tabs with space to maintain the structure and still use beeline effectively. Ex: CREATE TABLE IF NOT EXISTS default.test1 (idINT,name STRING); — this will fail Hive will throw an error saying “Error: Error while compiling… Read More »

Hive – big data – big problems

2017-07-26 00:32:04,676 INFO [communication thread] org.apache.hadoop.mapred.Task: Communication exception: java.lang.OutOfMemoryError: GC overhead limit exceeded at java.util.Arrays.copyOfRange( at java.lang.String.( at java.lang.String.substring( at at at at org.apache.hadoop.yarn.util.ProcfsBasedProcessTree.getProcessList( at org.apache.hadoop.yarn.util.ProcfsBasedProcessTree.updateProcessTree( at org.apache.hadoop.mapred.Task.updateResourceCounters( at org.apache.hadoop.mapred.Task.updateCounters( at org.apache.hadoop.mapred.Task.access( at org.apache.hadoop.mapred.Task$ at

Tracking YARN logs

Create script to get yarn logs $ vim #!/bin/bash APPLICATION_ID= CONTAINER_ID= NODE_ADDRESS= if [ $# -eq 1 ]; then yarn logs -applicationId ${APPLICATION_ID} elif [ $# -eq 3 ]; then yarn logs -applicationId ${APPLICATION_ID} -containerId ${CONTAINER_ID} -nodeAddress ${NODE_ADDRESS} else echo “you must specify 1 or 3 arguments ” fi Create a symlink $ ln… Read More »

Search for a pattern in HDFS files – python script

Problem: Search a pattern in HDFS files and return the filename which contains this pattern. For example, below are our input files: $vim log1.out [Wed Oct 11 14:32:52 2000] [error] [client] client denied by server configuration: /export/home/live/ap/htdocs/test [Wed Oct 11 14:32:52 2000] [error] [client] client denied by server configuration: /export/home/live/ap/htdocs/test [Wed Oct 11… Read More »