I am only trying to read a textfile into a pyspark RDD, and I am noticing huge differences between sqlContext.read.load
and sqlContext.read.text
.
s3_single_file_inpath='s3a://bucket-name/file_name'
indata = sqlContext.read.load(s3_single_file_inpath, format='com.databricks.spark.csv', header='true', inferSchema='false',sep=',')
indata = sqlContext.read.text(s3_single_file_inpath)
The sqlContext.read.load
command above fails with
Py4JJavaError: An error occurred while calling o227.load.
: java.lang.ClassNotFoundException: Failed to find data source: com.databricks.spark.csv. Please find packages at http://spark-packages.org
But the second one succeeds?
Now, I am confused by this because all of the resources I see online say to use sqlContext.read.load
including this one: https://spark.apache.org/docs/1.6.1/sql-programming-guide.html.
It is not clear to me when to use which of these to use when. Is there a clear distinction between these?
The difference is:
text
is a built-in input format in Spark 1.6com.databricks.spark.csv
is a third party package in Spark 1.6To use third party Spark CSV (no longer needed in Spark 2.0) you have to follow the instructions on
spark-csv
site, for example provideargument with
spark-submit
/pyspark
commands.Beyond that
sqlContext.read.formatName(...)
is a syntactic sugar forsqlContext.read.format("formatName")
andsqlContext.read.load(..., format=formatName)
.