Search for a pattern in HDFS files – python script

By | May 31, 2017
Problem: Search a pattern in HDFS files and return the filename which contains this pattern.

For example, below are our input files:


$vim log1.out
[Wed Oct 11 14:32:52 2000] [error] [client 127.0.0.1] client denied by server configuration: /export/home/live/ap/htdocs/test
[Wed Oct 11 14:32:52 2000] [error] [client 127.0.0.1] client denied by server configuration: /export/home/live/ap/htdocs/test
[Wed Oct 11 14:32:52 2000] [info] [client 127.0.0.1] client denied by server configuration: /export/home/live/ap/htdocs/test

$vim log2.out
[Wed Oct 11 14:32:52 2000] [warn] [client 127.0.0.1] client denied by server configuration: /export/home/live/ap/htdocs/test
[Wed Oct 11 14:32:52 2000] [info] [client 127.0.0.1] client denied by server configuration: /export/home/live/ap/htdocs/test

$vim log3.out
[Wed Oct 11 14:32:52 2000] [error] [client 127.0.0.1] client denied by server configuration: /export/home/live/ap/htdocs/test
[Wed Oct 11 14:32:52 2000] [warn] [client 127.0.0.1] client denied by server configuration: /export/home/live/ap/htdocs/test

Upload these files to HDFS:


$hadoop fs -mkdir -p /user/cloudera/input1
$hadoop fs -put log1.out log2.out log3.out /user/cloudera/input1/

Now our files are in HDFS location: /user/cloudera/input1/
We want to search for the files which contains pattern ‘error’ and get those filenames
Expected output in this case is: log1.out, log3.out

Create below script with name ‘stream_py.py’


$ vim stream_py.py
#!/usr/bin/env python
import os
import re
import sys

# This is the pattern we are searching for in the file
search_term="error"

# This captures the HDFS filename which contains the pattern we are searching for
try:
    input_file = os.environ['mapreduce_map_input_file']
except KeyError:
    input_file = os.environ['map_input_file']

for line in sys.stdin:
    if re.search(search_term, line):
        print input_file

Run below command to remove the output directory:

$hadoop fs -rm -r -skipTrash /user/cloudera/output1/

Run the Hadoop stream job:


$export HADOOP_STREAMING_JAR="/usr/lib/hadoop-0.20-mapreduce/contrib/streaming/hadoop-streaming-mr1.jar";
$hadoop jar ${HADOOP_STREAMING_JAR} \
-Dmapred.reduce.tasks=0 \
-input 'hdfs:///user/cloudera/input1/' \
-output 'hdfs:///user/cloudera/output1/'  \
-mapper /home/cloudera/stream_py.py \
-file  /home/cloudera/stream_py.py \
-inputformat org.apache.hadoop.mapred.TextInputFormat \
-outputformat org.apache.hadoop.mapred.TextOutputFormat  

Note: hadoop-streaming-mr1.jar could be in different location in your cluster. To find the location, you could un locate command as below:


$locate hadoop-streaming-mr1.ja

Check the output:

$hadoop fs -text  hdfs:///user/cloudera/output1/* | more
Output:
hdfs://quickstart.cloudera:8020/user/cloudera/input1/log1.out	
hdfs://quickstart.cloudera:8020/user/cloudera/input1/log1.out	
hdfs://quickstart.cloudera:8020/user/cloudera/input1/log3.out	

Leave a Reply

Your email address will not be published. Required fields are marked *