import pandas as pd
from random import random
from collections import namedtuple
Smoker = namedtuple("Smoker", ["Female","Male"])
Nonsmoker = namedtuple("Nonsmoker", ["Female","Male"])
DF = dict()
DF["A"] = [(Smoker(random(),random()), Nonsmoker(random(),random())) for t in range(3)]
DF["B"] = [(Smoker(random(),random()), Nonsmoker(random(),random())) for t in range(3)]
DF = pd.DataFrame(DF, index=["t="+str(t+1) for t in range(3)])
I have this dataframe, each of whose cells is a tuple of two namedtuples. After I saved it to csv file and reloaded it, the printed-out looked the same, but each cell became a string. How did it happen? What should I do to obtain the same dataframe every time?
DF.to_csv("results.csv", index_label=False)
df = pd.read_csv('results.csv', index_col=0)
print(df)
for a,b in zip(df.A,df.B):
print(type(a),type(b))
I believe that is expected behaviour. Since
csv
is text-base, when you saveobject
dtype tocsv
, the natural way is to use the string representation. Sotuple((1,2))
becomes"(1,2)"
.Now, when you read back
csv
file, the natural and safe way to interpret"(1,2)"
is of course the string'(1,2)'
because Pandas doesn't have an engine to parse tuple-valued columns.TLDR, that's normal and expected behaviour. If you want to save and load your data with
object
dtype, you should use binary format such asto_pickle
andfrom_pickle
methods.