Apache Spark: SparkFiles.get(fileName.txt) - Unable to retrieve the file contents from SparkContext

Multi tool use
Apache Spark: SparkFiles.get(fileName.txt) - Unable to retrieve the file contents from SparkContext
I used SparkContext.addFile("hdfs://host:54310/spark/fileName.txt")
and added a file to SparkContext
. I verified its presence using org.apache.spark.SparkFiles.get(fileName.txt)
. It showed an absolute path, something like /tmp/spark-xxxx/userFiles-xxxx/fileName.txt
.
SparkContext.addFile("hdfs://host:54310/spark/fileName.txt")
SparkContext
org.apache.spark.SparkFiles.get(fileName.txt)
/tmp/spark-xxxx/userFiles-xxxx/fileName.txt
Now I want to read that file from the above given absolute path
location from SparkContext
. I tried
sc.textFile(org.apache.spark.SparkFiles.get("fileName.txt")).collect().foreach(println)
It considers the path returned by SparkFiles.get()
as a HDFS
path, which is incorrect.
SparkContext
sc.textFile(org.apache.spark.SparkFiles.get("fileName.txt")).collect().foreach(println)
SparkFiles.get()
I searched extensively to find any helpful reads on this, but ran out of luck.
Is there anything wrong in the approach? Any help is really appreciated.
Here is the code and the outcome:
scala> sc.addFile("hdfs://localhost:54310/spark/fileName.txt")
scala> org.apache.spark.SparkFiles.get("fileName.txt")
res23: String = /tmp/spark-3646b5fe-0a67-4a16-bd25-015cc73533cd/userFiles-a7d54640-fab2-4dfa-a94f-7de6f74a0764/fileName.txt
scala> sc.textFile(org.apache.spark.SparkFiles.get("fileName.txt")).collect().foreach(println)
org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: hdfs://localhost:54310/tmp/spark-3646b5fe-0a67-4a16-bd25-015cc73533cd/userFiles-a7d54640-fab2-4dfa-a94f-7de6f74a0764/fileName.txt
at org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:287)
at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:229)
at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:315)
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:200)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:251)
at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:251)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2092)
at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:939)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
at org.apache.spark.rdd.RDD.collect(RDD.scala:938)
... 49 elided
@vaquarkhan Added.
– Marco99
Jun 30 at 15:51
1 Answer
1
Refer to local file using the "file://" syntax.
sc.textFile("file://" + org.apache.spark.SparkFiles.get("fileName.txt"))
.collect()
.foreach(println)
Could you please explain how is that file related to local file system? I added the file to SparkContext.
– Marco99
Jun 30 at 15:50
When you add a file using sparkContext spark sends the file to driver and workers. SparkFiles.get call gets you the absolute path of the file which is local to the worker or driver it's not uploaded to hdfs.
– Sudev Ambadi
Jun 30 at 16:33
That was helpful, thank you!!
– Marco99
Jun 30 at 16:49
By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.
Can you add code sample and error logs here
– vaquar khan
Jun 30 at 14:47