I have a very huge CSV file. I want to read it through Pyspark but I am not able to read it properly.
Sample csv as
"keyvalue","rto","state","maker_model","veh_type","veh_class"
"hnjsnjncjssssmj", "OD", "ODISHA", "BAJAJ AUTO", "Private Vehicle", "Car"
"hnjsnjncjssssjj", "OD", "ODISHA", "BAJAJ AUTO
", "Private Vehicle", "Car"
"hnjsnjncjssssmm", "GO", "GOA", "TATA MOTORS", "Private Vehicle", "Bus"
I want to read it as like this
+---------------+-----+---------+--------------+------------------+---------+
| keyvalue| rto| state| maker_model| veh_type|veh_class|
+---------------+-----+---------+--------------+------------------+---------+
|hnjsnjncjssssmj| "OD"| "ODISHA"| "BAJAJ AUTO"| "Private Vehicle"| "Car"|
|hnjsnjncjssssjj| "OD"| "ODISHA"| "BAJAJ AUTO"| "Private Vehicle"| "Car"|
|hnjsnjncjssssmm| "GO"| "GOA"| "TATA MOTORS"| "Private Vehicle"| "Bus"|
but my pyspark is unable to recognise 2nd row properly and it's breaking it like
+--------------------+------+---------+--------------+------------------+---------+
| keyvalue| rto| state| maker_model| veh_type|veh_class|
+--------------------+------+---------+--------------+------------------+---------+
| hnjsnjncjssssmj| "OD"| "ODISHA"| "BAJAJ AUTO"| "Private Vehicle"| "Car"|
| hnjsnjncjssssjj| "OD"| "ODISHA"| "BAJAJ AUTO| null| null|
|", "Private Vehicle"| "Car"| null| null| null| null|
| hnjsnjncjssssmm| "GO"| "GOA"| "TATA MOTORS"| "Private Vehicle"| "Bus"|
+--------------------+------+---------+--------------+------------------+---------+
I have tried various configurations in read csv function of spark but as of now nothing is working. Please guide me?
If the file isn't too large you could use regex to fix the broken lines and then read the fixed file in with spark.