I am trying to use pickle to save few large datasets that I generate through other datasets. While dumping it does not give me any error but when I try to load these datasets pickle exits with an eof error. Below is the code that I run to save the datasets:
from scipy.stats.mstats import mode
trainingSetCustomers = pd.DataFrame({'visitFrequency': trainingSet.size(),'totalAmountSpent': trainingSet['amountSpent'].sum(),'totalProducts': trainingSet['productCount'].sum(),'firstVisit': trainingSet['visitDate'].min(),'lastVisit': trainingSet['visitDate'].max(),'visitType':trainingSet['visitType'].apply(f),'country': trainingSet['country'].apply(f),'isReferred':trainingSet['isReferred'].sum()}).reset_index()
p2 = pickle.Pickler(open("trainingSetCustomers.p","wb")) #finaldatasetYear1AndYear2 #trainingset groupedCustomersWithDates dfOrdersNew groupedCustomersNew
p2.clear_memo()
p2.dump(trainingSetCustomers)
print "Training Set saved" #Done
trainingResultSetCustomers = pd.DataFrame({'futureVisitFrequency': trainingResultSet.size(),'futureTotalAmountSpent': trainingResultSet['amountSpent'].sum(),'futureTotalProducts': trainingResultSet['productCount'].sum(),'firstVisit': trainingResultSet['visitDate'].min(),'lastVisit': trainingResultSet['visitDate'].max(),'visitType':trainingResultSet['visitType'].apply(f),'country': trainingResultSet['country'].apply(f),'isReferred':trainingResultSet['isReferred'].sum()}).reset_index()
p3 = pickle.Pickler(open("trainingResultSetCustomers.p","wb")) #finaldatasetYear1AndYear2 #trainingset groupedCustomersWithDates dfOrdersNew groupedCustomersNew
p3.clear_memo()
p3.dump(trainingResultSetCustomers)
print "trainingresult set saved" #Done
This runs without any errors and prints the messages. But when I run following code:
trainingResultSetCustomers = pickle.load( open( "trainingResultSetCustomers.p", "rb" ) )
It gives me an EoFError. I need to store 4 of these kinds of test sets and I am really confused as ot why this is happening. I am running it on IPython notebook through ssh if that makes any difference. Also if I try this with only 5 rows it works perfectly
Data Structure : As can be seen from the code, this dataframe is generated by the properties of a grouped object.
This is the error I get :
EOFError Traceback (most recent call last)
<ipython-input-10-86d38895c564> in <module>()
5 p = pickle.Pickler(o) #finaldatasetYear1AndYear2 #trainingset groupedCustomersWithDates dfOrdersNew groupedCustomersNew
6 p.clear_memo()
----> 7 trainingset = pickle.load(o)
8 o.close()
9 print "done"
/usr/lib/python2.7/pickle.pyc in load(file)
1376
1377 def load(file):
-> 1378 return Unpickler(file).load()
1379
1380 def loads(str):
/usr/lib/python2.7/pickle.pyc in load(self)
856 while 1:
857 key = read(1)
--> 858 dispatch[key](self)
859 except _Stop, stopinst:
860 return stopinst.value
/usr/lib/python2.7/pickle.pyc in load_eof(self)
878
879 def load_eof(self):
--> 880 raise EOFError
881 dispatch[''] = load_eof
882
In absence of some test code and version numbers, the only thing I can see is that you are using
pandas.Dataframe
objects. These guys often can need some special handling that is built-intopandas
built-in pickling methods. I believepandas
gives both theto_pickle
and thesave
method, which provide pickling for aDataframe
. See: How to store data frame using PANDAS, Python and links within.And, depending on how large a
Dataframe
you are trying to pickle, and the versions of your dependencies, it could be hitting up against a 64-bit pickling bug. See: Pickling a DataFrame.Also, if you are sending serialized data trough
ssh
, you might want to check that you aren't running into some sort ofssh
packet limitation. If you are just executing the code throughssh
, then this should not be a potential issue.