Spark – Hadoop – Programs in Scala

By | September 12, 2014
Lets say our input file contains below data:
$ cat sample1.txt
1,aa
2,bb
3,cc
Upload this file to hdfs.

$hadoop fs -put sample1.txt /user/puneetha/

Program – 1:
Read the text file from HDFS.


$spark-shell
scala> val myfile = sc.textFile("sample1.txt")
14/09/12 16:23:19 INFO MemoryStore: ensureFreeSpace(126588) called with curMem=0, maxMem=308713881
14/09/12 16:23:19 INFO MemoryStore: Block broadcast_0 stored as values to memory (estimated size 123.6 KB, free 294.3 MB)
mydata: org.apache.spark.rdd.RDD[String] = MappedRDD[1] at textFile at <console>:12
(OR) 
scala> val myfile = sc.textFile("hdfs:///user/puneetha/sample1.txt")

scala> myfile.count()
res2: Long = 3

Note:
1. Assuming our user directory in hdfs is /user/puneetha, When we simply specify filename like sc.textFile(“sample1.txt”) without hdfs path, it will assume that the path is /user/puneetha and it will try to pick the sample1.txt from there.
2. Since we are using textfile, each line corresponds to single element in the RDD, so we have the count as 3 i.e. number of lines in sample1.txt

Comment below if you find this blog useful.

One thought on “Spark – Hadoop – Programs in Scala

  1. narendra

    Hi puneetha,hope you are doing well…

    could you please update more examples in Spark ?

    Thanks,
    -Narendra.

    Reply

Leave a Reply

Your email address will not be published. Required fields are marked *