CSV Coma Delimiter Split in Spark RDD but NOT to split coma with in double quotes

410 Views Asked by At

I have a CSV file with data as below

id,name,comp_name

1,raj,"rajeswari,motors"

2,shiva,amber kings

my requirement is to read this file to spark RDD, then do map split with coma delimiter. but giving code this splits all comas val splitdata = data.map(_.split(",")

i do not want to split coma with in double quotes. But i DO NOT want to use REGEX expression. is there any simple efficient method to acheive this?

Also 2nd requirement is read above csv file to Spark Dataframe and show it but i need to see double quotes in result output should look like

id name comp_name

1 raj "rajeswari,motors"

2 shiva amber kings

double quotes are not shown normally but is any way to do it?

I am using spark 2.4 / scala 2.11 / Eclipse IDE

1

There are 1 best solutions below

3
On

I would suggest try using dataframe instead of RDD?

df = spark.read.option("header", "true").csv("csv/file/path")

There won't be direct way, you have to use regex like this below to ignore "," enclosed between ""

val raw = sc.textFile("file:///tmp/stackoverflow_q_72457003.csv")
raw.map(_.split(",(?=(?:[^\"]*\"[^\"]*\")*[^\"]*$)")(2)).foreach(println)

You'd get output like this

"rajeswari,motors"

amber kings

Refer this post for understanding expression : Splitting on comma outside quotes

enter image description here