Why do tuples become strings after saving to csv and reloading the dataframe (pandas)?

1k Views Asked by At
import pandas as pd
from random import random
from collections import namedtuple

Smoker    = namedtuple("Smoker", ["Female","Male"])
Nonsmoker = namedtuple("Nonsmoker", ["Female","Male"])

DF = dict() 
DF["A"] = [(Smoker(random(),random()), Nonsmoker(random(),random())) for t in range(3)]
DF["B"] = [(Smoker(random(),random()), Nonsmoker(random(),random())) for t in range(3)]
DF = pd.DataFrame(DF, index=["t="+str(t+1) for t in range(3)])

I have this dataframe, each of whose cells is a tuple of two namedtuples. After I saved it to csv file and reloaded it, the printed-out looked the same, but each cell became a string. How did it happen? What should I do to obtain the same dataframe every time?

DF.to_csv("results.csv", index_label=False)
df = pd.read_csv('results.csv', index_col=0)

print(df)

for a,b in zip(df.A,df.B):
    print(type(a),type(b))
2

There are 2 best solutions below

2
On BEST ANSWER

I believe that is expected behaviour. Since csv is text-base, when you save object dtype to csv, the natural way is to use the string representation. So tuple((1,2)) becomes "(1,2)".

Now, when you read back csv file, the natural and safe way to interpret "(1,2)" is of course the string '(1,2)' because Pandas doesn't have an engine to parse tuple-valued columns.

TLDR, that's normal and expected behaviour. If you want to save and load your data with object dtype, you should use binary format such as to_pickle and from_pickle methods.

1
On

One approach to get a tuple while reading the csv is to use converters

Ex:

import ast

df = pd.read_csv('results.csv', index_col=0, converters={"A": ast.literal_eval, 
                                                         "B": ast.literal_eval})