Featured Article

Configure Hadoop Security with Cloudera Manager 5 or later – using Kerberos

If you are using Cloudera Manager version 5 or less. Check out the other blog here Kerberos is a network authentication protocol created by MIT, and uses symmetric-key cryptography to authenticate users to network services, which means passwords are never actually sent over the network.Rather than authenticating each user to each network service separately as… Read More »

Getting started with Big data (Hadoop, Spark)!

Few questions running in your mind right now How and where do I start with Big data space? What is Big data, Hadoop, Spark, Hive, HBase, etc? What should I learn first? Which path should I take? Data Science, Data Engineer, Architect!

PIG – general stuff

Adding jar REGISTER /local/path/to/myjar_name.jar Set queue name Specify below in the pig script SET mapreduce.job.queuename ‘my_queuename’; (or) specify while running the PIG script $ pig -Dmapreduce.job.queuename=my_queuename -f my_script.pig Set job name Specify below in the pig script SET mapreduce.job.name ‘Testing HCatalog’; (or) specify while running the PIG script $ pig -Dmapreduce.job.name=”Testing HCatalog” -f my_script.pig

Category: Pig

Hive – Timezone problem

Timezone problem – Any function which triggers mapreduce job, causes this problem, since it takes the local timezone of machine where it runs the mapper/reducer In our case, lets say our servers are in German timezone i.e. CET — With original setttings SET system:user.country; +————————-+–+ | set | +————————-+–+ | system:user.country=GB | +————————-+–+ — Original… Read More »

Search for a file in HDFS using Solr Find tool

HdfsFindTool is essentially the HDFS version of the Linux file system find command. The command walks one or more HDFS directory trees, finds all HDFS files that match the specified expression, and applies selected actions to them. By default, it prints the list of matching HDFS file paths to stdout, one path per line. Search… Read More »

Solr Installation and create new collection – standalone

Note: I am running this in Windows. Download Solr Download Solr from here I have downloaded solr-7.0.1: http://mirrors.whoishostingthis.com/apache/lucene/solr/7.0.1/solr-7.0.1.zip For this example, we will extract it to the folder C:\Users\Public\hadoop_ecosystem\solr-7.0.1 Start Solr Open command prompt and type below commands > c: > cd C:\Users\Public\hadoop_ecosystem\solr-7.0.1\bin C:\Users\Public\hadoop_ecosystem\solr-7.0.1\bin> solr start -p 8983 Output: Waiting up to 30 to see… Read More »

PySpark – dev set up – Eclipse – Windows

For our example purposes, we will set-up Spark in the location: C:\Users\Public\Spark_Dev_set_up Note: I am running Eclipse Neon Prerequisites Python 3.5 JRE 8 JDK 1.8 Eclipse plugins: PyDev Steps to set up: Download from here: https://spark.apache.org/downloads.html 1. Choose a Spark release: 2.1.0 2. Choose a package type: Pre-built for Apache Hadoop 2.6 3. Download below… Read More »

Pyspark – getting started – useful stuff

Example to create dataframe from pyspark import SparkConf, SparkContext from pyspark.sql import SparkSession spark = SparkSession.builder.getOrCreate() sc = spark.sparkContext def create_dataframe(): “”” Example to create dataframe “”” headers = (“id” , “name”) data = [ (1, “puneetha”) ,(2, “bhoomika”) ] df = spark.createDataFrame(data, headers) df.show(1, False) # Output: # |id |name | # +—+——–+ #… Read More »

sqoop queries – examples

Apache Sqoop is a tool designed for efficiently transferring bulk data between Apache Hadoop and structured datastores such as relational databases. Open source Apache project that exchanges data between a database and HDFS Can import all tables, single tables or even partial tables with free form SQL queries into HDFS Data can be imported in… Read More »

Hive – testing queries with dummy data

If your query looks like “SELECT * FROM TABLE1;” You want to test the input from “TABLE1” with your dummy dataset. If you have a multiple subqueries using a base table. This comes very handy. — Creating single dummy row: SELECT * FROM ( — This is our dummy row, which is a replacement of… Read More »

Hive – Optimization

To set user timezone: Sort memory – The total amount of buffer memory to use while sorting files, in megabytes. By default, gives each merge stream 1MB, which should minimize seeks. SET io.sort.mb=800; Note: io.sort.mb should be 10 * io.sort.factor Memory — Shuffle memory SET mapreduce.reduce.shuffle.memory.limit.percent=0.65; — Map memory SET mapreduce.map.java.opts=-Xmx8192m; — Reduce memory SET… Read More »