Pandas csv parser not working properly when it encounters `"`

266 Views Asked by At

Problem statement:

Initially what I had

I have a CSV file with the below records:-

data.csv:-

 id,age,name
 3500300026,23,"rahul"
 3500300163,45,"sunita"
 3500320786,12,"patrick"
 3500321074,41,"Viper"
 3500321107,54,"Dawn Breaker"

When I tried to run script.py on this with encoding 'ISO-8859-1', it's running fine

# script.py
import pandas as pd
test_data2=pd.read_csv('data.csv', sep=',', encoding='ISO-8859-1')
print(test_data2)

Result1


Now what I have:-

But when I got a feed of the same file with " at the front of every record, the parser behaved awkwardly. After the data change, new records looks like below:-

id,age,name
"3500300026,23,"rahul"
"3500300163,45,"sunita"
"3500320786,12,"patrick"
"3500321074,41,"Viper"
"3500321107,54,"Dawn Breaker"

And after running the same script (script.py) for this new data file, I am getting the below result

Result2

Character " comes under ISO-8859-1 Character Set only so this can't be an issue anyway. It should be the parser, can't really get it why isn't the parser only focusing on , which I specifically passed as a separator to read_csv().

References: ISO-8859-1 Character set

I am curious to know the reason why pandas was not able to parse it properly or does it has any special connection with ".

1

There are 1 best solutions below

2
On

You can tell pandas that you don't want double quotes to be treated specially by adding an argument to read_csv:

  test_data2=pd.read_csv('data.csv', quoting=csv.QUOTE_NONE)

to read_csv(). The output will be:

In [11]: df
Out[11]: 
            id  age            name
0  "3500300026   23         "rahul"
1  "3500300163   45        "sunita"
2  "3500320786   12       "patrick"
3  "3500321074   41         "Viper"
4  "3500321107   54  "Dawn Breaker"

parsing only on the comma.