Tag Archives: hadoop

Custom partitioner in mapreduce – using new hadoop api 2

This is the example of custom partitioner for classic wordcount program. Driver Class: We are partitioning keys based on the first letter, so we will have 27 partitions, 26 for each partition plus 1 other characters. Below are the additional things in Driver class. job.setNumReduceTasks(26); job.setPartitionerClass(WordcountPartitioner.class); package org.puneetha.customPartitioner; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.conf.Configured; import org.apache.hadoop.fs.FileSystem; import… Read More »

Pattern matching for files within a Mapreduce program – given hdfs path – using new api 2

Driver Class: package org.puneetha.patternMatching; import java.util.regex.Matcher; import java.util.regex.Pattern; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.conf.Configured; import org.apache.hadoop.fs.FileStatus; import org.apache.hadoop.fs.FileSystem; import org.apache.hadoop.fs.Path; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.util.GenericOptionsParser; import org.apache.hadoop.util.Tool; import org.apache.hadoop.util.ToolRunner; public class WordcountDriver extends Configured implements Tool { public int run(String[] args) throws Exception { Job job = Job.getInstance(getConf()); /* * … Other Driver class code …… Read More »

Rename reducer output part file – using Mapreduce code (with new hadoop api 2)

Below is the code to rename our reducer output part file name from “part-*” to “customName-*”. I am using the classic wordcount example(You can check out the basic implementation here) Driver Class: In Driver class: LazyOutputFormat.setOutputFormatClass(job, TextOutputFormat.class); – for avoiding the creation of empty default partfiles MultipleOutputs.addNamedOutput(job, “text”, TextOutputFormat.class,Text.class, IntWritable.class); – for adding new name… Read More »

Wordcount Mapreduce program – using Hadoop new API 2

Below is the classic wordcount example, using new api. If you are using maven, you can use the pom.xml given here. Change it according to the hadoop distribution/version you are using. Input Text: $vim input.txt cat dog apple cat horse orange apple $hadoop fs -mkdir -p /user/dummyuser/wordcount/input $hadoop fs -put input.txt /user/dummyuser/wordcount/input/ Driver Class: package… Read More »

Build Cloudera parcels offline repo – without internet connection

We create directory where we want to save our parcels. # mdkir -p /share/cdh_repo # cd /share/cdh_repo (Note: This can be any directory) 1) Download CDH parcel to the repo directory: #pwd /share/cdh_repo #wget http://archive-primary.cloudera.com/cdh5/parcels/latest/CDH-5.1.0-1.cdh5.1.0.p0.53-el6.parcel 2) Download the manifest.json file from the URL: #pwd /share/cdh_repo #wget http://archive-primary.cloudera.com/cdh5/parcels/latest/manifest.json 3) Copy the hash code: #pwd /share/cdh_repo Open… Read More »

Hadoop / HDFS Commands

Few useful Hadoop Commands Uncompress gz file from HDFS to HDFS – Hadoop: $hadoop fs -text /hdfs_path/compressed_file.gz | hadoop fs -put – /hdfs_path/uncompressed-file.txt To uncompress while copying from local to HDFS directly: $gunzip -c filename.txt.gz | hadoop fs -put – /user/dc-user/filename.txt Hadoop commands for reporting purpose: $hdfs fsck /hdfs_path $hdfs fsck /hdfs_path -files -locations $hadoop… Read More »

Apache Hadoop Cluster Set Up script – Ubuntu

We will see step by step how we write a installation script for “Single-node Hadoop Cluster Set Up” for Ubuntu. Assumptions: This script is written for Ubuntu OS, you can change it accordingly for other OS, as the skeleton remains the same. User has tar file in their system.(File check will be done inside script) There is … Read More »