Pickle not able to save dataframes

2.3k Views Asked by At

I am trying to use pickle to save few large datasets that I generate through other datasets. While dumping it does not give me any error but when I try to load these datasets pickle exits with an eof error. Below is the code that I run to save the datasets:

from scipy.stats.mstats import mode
trainingSetCustomers = pd.DataFrame({'visitFrequency': trainingSet.size(),'totalAmountSpent': trainingSet['amountSpent'].sum(),'totalProducts': trainingSet['productCount'].sum(),'firstVisit': trainingSet['visitDate'].min(),'lastVisit': trainingSet['visitDate'].max(),'visitType':trainingSet['visitType'].apply(f),'country': trainingSet['country'].apply(f),'isReferred':trainingSet['isReferred'].sum()}).reset_index()
p2 = pickle.Pickler(open("trainingSetCustomers.p","wb")) #finaldatasetYear1AndYear2 #trainingset groupedCustomersWithDates dfOrdersNew groupedCustomersNew
p2.clear_memo()
p2.dump(trainingSetCustomers)
print "Training Set saved" #Done
trainingResultSetCustomers = pd.DataFrame({'futureVisitFrequency': trainingResultSet.size(),'futureTotalAmountSpent': trainingResultSet['amountSpent'].sum(),'futureTotalProducts': trainingResultSet['productCount'].sum(),'firstVisit': trainingResultSet['visitDate'].min(),'lastVisit': trainingResultSet['visitDate'].max(),'visitType':trainingResultSet['visitType'].apply(f),'country': trainingResultSet['country'].apply(f),'isReferred':trainingResultSet['isReferred'].sum()}).reset_index()
p3 = pickle.Pickler(open("trainingResultSetCustomers.p","wb")) #finaldatasetYear1AndYear2 #trainingset groupedCustomersWithDates dfOrdersNew groupedCustomersNew
p3.clear_memo()
p3.dump(trainingResultSetCustomers)
print "trainingresult set saved" #Done

This runs without any errors and prints the messages. But when I run following code:

trainingResultSetCustomers = pickle.load( open( "trainingResultSetCustomers.p", "rb" ) )

It gives me an EoFError. I need to store 4 of these kinds of test sets and I am really confused as ot why this is happening. I am running it on IPython notebook through ssh if that makes any difference. Also if I try this with only 5 rows it works perfectly

Data Structure : As can be seen from the code, this dataframe is generated by the properties of a grouped object.

This is the error I get :

EOFError                                  Traceback (most recent call last)
<ipython-input-10-86d38895c564> in <module>()
      5 p = pickle.Pickler(o) #finaldatasetYear1AndYear2 #trainingset groupedCustomersWithDates dfOrdersNew groupedCustomersNew
      6 p.clear_memo()
----> 7 trainingset = pickle.load(o)
      8 o.close()
      9 print "done"

/usr/lib/python2.7/pickle.pyc in load(file)
   1376 
   1377 def load(file):
-> 1378     return Unpickler(file).load()
   1379 
   1380 def loads(str):

/usr/lib/python2.7/pickle.pyc in load(self)
    856             while 1:
    857                 key = read(1)
--> 858                 dispatch[key](self)
    859         except _Stop, stopinst:
    860             return stopinst.value

/usr/lib/python2.7/pickle.pyc in load_eof(self)
    878 
    879     def load_eof(self):
--> 880         raise EOFError
    881     dispatch[''] = load_eof
    882 
1

There are 1 best solutions below

1
On BEST ANSWER

In absence of some test code and version numbers, the only thing I can see is that you are using pandas.Dataframe objects. These guys often can need some special handling that is built-into pandas built-in pickling methods. I believe pandas gives both the to_pickle and the save method, which provide pickling for a Dataframe. See: How to store data frame using PANDAS, Python and links within.

And, depending on how large a Dataframe you are trying to pickle, and the versions of your dependencies, it could be hitting up against a 64-bit pickling bug. See: Pickling a DataFrame.

Also, if you are sending serialized data trough ssh, you might want to check that you aren't running into some sort of ssh packet limitation. If you are just executing the code through ssh, then this should not be a potential issue.