Problem: Search a pattern in HDFS files and return the filename which contains this pattern.
For example, below are our input files:
$vim log1.out [Wed Oct 11 14:32:52 2000] [error] [client 127.0.0.1] client denied by server configuration: /export/home/live/ap/htdocs/test [Wed Oct 11 14:32:52 2000] [error] [client 127.0.0.1] client denied by server configuration: /export/home/live/ap/htdocs/test [Wed Oct 11 14:32:52 2000] [info] [client 127.0.0.1] client denied by server configuration: /export/home/live/ap/htdocs/test $vim log2.out [Wed Oct 11 14:32:52 2000] [warn] [client 127.0.0.1] client denied by server configuration: /export/home/live/ap/htdocs/test [Wed Oct 11 14:32:52 2000] [info] [client 127.0.0.1] client denied by server configuration: /export/home/live/ap/htdocs/test $vim log3.out [Wed Oct 11 14:32:52 2000] [error] [client 127.0.0.1] client denied by server configuration: /export/home/live/ap/htdocs/test [Wed Oct 11 14:32:52 2000] [warn] [client 127.0.0.1] client denied by server configuration: /export/home/live/ap/htdocs/test
Upload these files to HDFS:
$hadoop fs -mkdir -p /user/cloudera/input1 $hadoop fs -put log1.out log2.out log3.out /user/cloudera/input1/
Now our files are in HDFS location: /user/cloudera/input1/
We want to search for the files which contains pattern ‘error’ and get those filenames
Expected output in this case is: log1.out, log3.out
Create below script with name ‘stream_py.py’
$ vim stream_py.py #!/usr/bin/env python import os import re import sys # This is the pattern we are searching for in the file search_term="error" # This captures the HDFS filename which contains the pattern we are searching for try: input_file = os.environ['mapreduce_map_input_file'] except KeyError: input_file = os.environ['map_input_file'] for line in sys.stdin: if re.search(search_term, line): print input_file
Run below command to remove the output directory:
$hadoop fs -rm -r -skipTrash /user/cloudera/output1/
Run the Hadoop stream job:
$export HADOOP_STREAMING_JAR="/usr/lib/hadoop-0.20-mapreduce/contrib/streaming/hadoop-streaming-mr1.jar"; $hadoop jar ${HADOOP_STREAMING_JAR} \ -Dmapred.reduce.tasks=0 \ -input 'hdfs:///user/cloudera/input1/' \ -output 'hdfs:///user/cloudera/output1/' \ -mapper /home/cloudera/stream_py.py \ -file /home/cloudera/stream_py.py \ -inputformat org.apache.hadoop.mapred.TextInputFormat \ -outputformat org.apache.hadoop.mapred.TextOutputFormat
Note: hadoop-streaming-mr1.jar could be in different location in your cluster. To find the location, you could un locate command as below:
$locate hadoop-streaming-mr1.ja
Check the output:
$hadoop fs -text hdfs:///user/cloudera/output1/* | more Output: hdfs://quickstart.cloudera:8020/user/cloudera/input1/log1.out hdfs://quickstart.cloudera:8020/user/cloudera/input1/log1.out hdfs://quickstart.cloudera:8020/user/cloudera/input1/log3.out