Problem statement:
Initially what I had
I have a CSV file with the below records:-
data.csv:-
id,age,name 3500300026,23,"rahul" 3500300163,45,"sunita" 3500320786,12,"patrick" 3500321074,41,"Viper" 3500321107,54,"Dawn Breaker"
When I tried to run script.py on this with encoding 'ISO-8859-1', it's running fine
# script.py import pandas as pd test_data2=pd.read_csv('data.csv', sep=',', encoding='ISO-8859-1') print(test_data2)
Now what I have:-
But when I got a feed of the same file with
"
at the front of every record, the parser behaved awkwardly. After the data change, new records looks like below:-
id,age,name "3500300026,23,"rahul" "3500300163,45,"sunita" "3500320786,12,"patrick" "3500321074,41,"Viper" "3500321107,54,"Dawn Breaker"
And after running the same script (script.py) for this new data file, I am getting the below result
Character "
comes under ISO-8859-1 Character Set only so this can't be an issue anyway. It should be the parser, can't really get it why isn't the parser only focusing on ,
which I specifically passed as a separator to read_csv().
References: ISO-8859-1 Character set
I am curious to know the reason why pandas was not able to parse it properly or does it has any special connection with "
.
You can tell pandas that you don't want double quotes to be treated specially by adding an argument to read_csv:
to
read_csv()
. The output will be:parsing only on the comma.