I'm having some performance issues transforming a dataframe to csv.
import numpy as np
import pandas as pd
from time import time
t =time();_=pd.DataFrame(np.random.sample((10000,10))).to_csv(encoding=None); print time()-t
0.159129142761
t =time();_=pd.DataFrame(np.random.sample((10000,10))).to_csv(encoding='utf8'); print time()-t
1.16141009331
t =time();_=pd.DataFrame(np.random.sample((10000,10))).to_csv(encoding='ascii'); print time()-t
1.13821101189
Why specifying an encoding type will affect drastically the performance of this method? In my particular case, I'd rather use the default value (None) but since the dataframe I need to convert contains some special chars (chinese) I cannot use the default encoding which has superior performance.
Apparently, the default encoding is "ascii" but when selected has exactly the same performance as utf8, which is the one I need to use handle non-english char.
Any idea how can I cope with speed and get around this problem?
I'm using pandas 0.16.0 and Python 2.7.9.
EDIT:
I've upgrade to pandas 0.16.2 as per rth suggestion, and I get better timings
import pandas as pd
import numpy as np
x = pd.DataFrame(np.random.sample((10000,10)))
%timeit x.copy().to_csv(encoding='ascii')
%timeit x.copy().to_csv()
%timeit x.copy().to_csv(encoding='utf8')
10 loops, best of 3: 160 ms per loop
10 loops, best of 3: 73.7 ms per loop
10 loops, best of 3: 158 ms per loop
Still it's half slower specifying an encoding than using the default encoding. Clearly better than the previous scenario using the 0.16.0 version, but still a tangible difference.
I'm still keen to understand if it's a bug and how can I improve it... in my case it'a difference between 10 minutes or 20 minutes!
My guess it that the conversion to csv outputs a string in the native encoding, and then converts it to the requested encoding, which results in an unnecessary overhead if both are the same. See this particular line in the source code, where if the encoding is not None, it used a unicode formatter even for ascii.
If you need unicode though, it makes sens that it would be a bit slower with python 2.7 than plain ascii.
Still in my case, using Python 2.7.9-r2 64 bit and pandas 0.16.1-r1, I get merely a difference of a factor of 2 between these options, not a factor of 10 that you get,
so this could be potentially impoved for
encoding='ascii'
.