I have a CSV file which I am trying to load using Spark CSV package and it does not load data properly because few of the fields have \n
within them for e.g. the following two rows
"XYZ", "Test Data", "TestNew\nline", "OtherData"
"XYZ", "Test Data", "blablablabla
\nblablablablablalbal", "OtherData"
I am using the following code which is straightforward I am using parserLib
as univocity
as read in internet it solves multiple newline problem but it does not seems to be the case for me.
SQLContext sqlContext = new SQLContext(sc);
DataFrame df = sqlContext.read()
.format("com.databricks.spark.csv")
.option("inferSchema", "true")
.option("header", "true")
.option("parserLib","univocity")
.load("data.csv");
How do I replace newline within fields which starts with quotes. Is there any easier way?
According to SPARK-14194 (resolved as a duplicate) fields with new line characters are not supported and will never be.
That's however Spark 2.0, and you use
spark-csv
module.In the referenced SPARK-19610 it was fixed with the pull request:
In other words, use
wholeFile
option in Spark 2.x (as you can see in CSVDataSource).As to spark-csv, this comment might be of some help (highlighting mine):
In spark-csv's Features you can find the following: