Apache Spark: SparkFiles.get(fileName.txt) - Unable to retrieve the file contents from SparkContext

I used SparkContext.addFile("hdfs://host:54310/spark/fileName.txt") and added a file to SparkContext. I verified its presence using org.apache.spark.SparkFiles.get(fileName.txt). It showed an absolute path, something like /tmp/spark-xxxx/userFiles-xxxx/fileName.txt.

SparkContext.addFile("hdfs://host:54310/spark/fileName.txt")

SparkContext

org.apache.spark.SparkFiles.get(fileName.txt)

/tmp/spark-xxxx/userFiles-xxxx/fileName.txt

Now I want to read that file from the above given absolute path
location from SparkContext. I tried
sc.textFile(org.apache.spark.SparkFiles.get("fileName.txt")).collect().foreach(println)
It considers the path returned by SparkFiles.get() as a HDFS
path, which is incorrect.

SparkContext

sc.textFile(org.apache.spark.SparkFiles.get("fileName.txt")).collect().foreach(println)

SparkFiles.get()

I searched extensively to find any helpful reads on this, but ran out of luck.

Is there anything wrong in the approach? Any help is really appreciated.

Here is the code and the outcome:

scala> sc.addFile("hdfs://localhost:54310/spark/fileName.txt") scala> org.apache.spark.SparkFiles.get("fileName.txt") res23: String = /tmp/spark-3646b5fe-0a67-4a16-bd25-015cc73533cd/userFiles-a7d54640-fab2-4dfa-a94f-7de6f74a0764/fileName.txt scala> sc.textFile(org.apache.spark.SparkFiles.get("fileName.txt")).collect().foreach(println) org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: hdfs://localhost:54310/tmp/spark-3646b5fe-0a67-4a16-bd25-015cc73533cd/userFiles-a7d54640-fab2-4dfa-a94f-7de6f74a0764/fileName.txt at org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:287) at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:229) at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:315) at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:200) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.rdd.RDD.partitions(RDD.scala:251) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.rdd.RDD.partitions(RDD.scala:251) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2092) at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:939) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) at org.apache.spark.rdd.RDD.withScope(RDD.scala:363) at org.apache.spark.rdd.RDD.collect(RDD.scala:938) ... 49 elided

Can you add code sample and error logs here
– vaquar khan
Jun 30 at 14:47

@vaquarkhan Added.
– Marco99
Jun 30 at 15:51

1 Answer
1

Refer to local file using the "file://" syntax.

sc.textFile("file://" + org.apache.spark.SparkFiles.get("fileName.txt")) .collect() .foreach(println)

Could you please explain how is that file related to local file system? I added the file to SparkContext.
– Marco99
Jun 30 at 15:50

When you add a file using sparkContext spark sends the file to driver and workers. SparkFiles.get call gets you the absolute path of the file which is local to the worker or driver it's not uploaded to hdfs.
– Sudev Ambadi
Jun 30 at 16:33

That was helpful, thank you!!
– Marco99
Jun 30 at 16:49

By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.

搜尋此網誌

Search between a Gas Station