I know I can read a local file in Scala like so:
import scala.io.Source
val filename = "laba01/ml-100k/u.data"
for(line <- Source.fromFile(filename).getLines){
println(line)
}
This code words fine and prints out the lines from the text file. I run it in JupyterHub with Apache Toree.
I know I can read from HDFS at this server, because when I run the next code in another cell:
import sys.process._
"hdfs dfs -ls /labs/laba01/ml-100k/u.data"!
it works fine too, and I can see this output:
-rw-r--r-- 3 hdfs hdfs 1979173 2020-04-20 17:56 /labs/laba01/ml-100k/u.data
lastException: Throwable = null
warning: there was one feature warning; re-run with -feature for details
0
Now I want to read this same file kept in HDFS by running this:
import scala.io.Source
val filename = "hdfs:/labs/laba01/ml-100k/u.data"
for(line <- Source.fromFile(filename).getLines){
println(line)
}
but I get this output instead of the file's lines printed out:
lastException = null
Name: java.io.FileNotFoundException
Message: hdfs:/labs/laba01/ml-100k/u.data (No such file or directory)
StackTrace: at java.io.FileInputStream.open0(Native Method)
at java.io.FileInputStream.open(FileInputStream.java:195)
at java.io.FileInputStream.<init>(FileInputStream.java:138)
at scala.io.Source$.fromFile(Source.scala:91)
at scala.io.Source$.fromFile(Source.scala:76)
at scala.io.Source$.fromFile(Source.scala:54)
So how do I read this text file from HDFS?
scala.iowill not able to find any file in HDFS. It's not for that. If I'm not wrong it can only read file that are in your local (file:///)You need to use
hadoop-common.jarto read the data from HDFS.You can find code example here https://stackoverflow.com/a/41616512/7857701