Lets say our input file contains below data:
$ cat sample1.txt
1,aa
2,bb
3,cc
Upload this file to hdfs.
$ cat sample1.txt
1,aa
2,bb
3,cc
Upload this file to hdfs.
$hadoop fs -put sample1.txt /user/puneetha/
Program – 1:
Read the text file from HDFS.
$spark-shell scala> val myfile = sc.textFile("sample1.txt") 14/09/12 16:23:19 INFO MemoryStore: ensureFreeSpace(126588) called with curMem=0, maxMem=308713881 14/09/12 16:23:19 INFO MemoryStore: Block broadcast_0 stored as values to memory (estimated size 123.6 KB, free 294.3 MB) mydata: org.apache.spark.rdd.RDD[String] = MappedRDD[1] at textFile at :12 (OR) scala> val myfile = sc.textFile("hdfs:///user/puneetha/sample1.txt") scala> myfile.count() res2: Long = 3
Note:
1. Assuming our user directory in hdfs is /user/puneetha, When we simply specify filename like sc.textFile(“sample1.txt”) without hdfs path, it will assume that the path is /user/puneetha and it will try to pick the sample1.txt from there.
2. Since we are using textfile, each line corresponds to single element in the RDD, so we have the count as 3 i.e. number of lines in sample1.txt
Comment below if you find this blog useful.
Hi puneetha,hope you are doing well…
could you please update more examples in Spark ?
Thanks,
-Narendra.