Pandas to_csv slower with encoding?

1.2k Views Asked by At

I'm having some performance issues transforming a dataframe to csv.

import numpy as np
import pandas as pd
from time import time

t =time();_=pd.DataFrame(np.random.sample((10000,10))).to_csv(encoding=None); print time()-t
0.159129142761
t =time();_=pd.DataFrame(np.random.sample((10000,10))).to_csv(encoding='utf8'); print time()-t
1.16141009331
t =time();_=pd.DataFrame(np.random.sample((10000,10))).to_csv(encoding='ascii'); print time()-t
1.13821101189

Why specifying an encoding type will affect drastically the performance of this method? In my particular case, I'd rather use the default value (None) but since the dataframe I need to convert contains some special chars (chinese) I cannot use the default encoding which has superior performance.

Apparently, the default encoding is "ascii" but when selected has exactly the same performance as utf8, which is the one I need to use handle non-english char.

Any idea how can I cope with speed and get around this problem?

I'm using pandas 0.16.0 and Python 2.7.9.

EDIT:

I've upgrade to pandas 0.16.2 as per rth suggestion, and I get better timings

import pandas as pd
import numpy as np
x = pd.DataFrame(np.random.sample((10000,10)))
%timeit x.copy().to_csv(encoding='ascii')
%timeit x.copy().to_csv()
%timeit x.copy().to_csv(encoding='utf8')
10 loops, best of 3: 160 ms per loop
10 loops, best of 3: 73.7 ms per loop
10 loops, best of 3: 158 ms per loop

Still it's half slower specifying an encoding than using the default encoding. Clearly better than the previous scenario using the 0.16.0 version, but still a tangible difference.

I'm still keen to understand if it's a bug and how can I improve it... in my case it'a difference between 10 minutes or 20 minutes!

1

There are 1 best solutions below

3
On BEST ANSWER

My guess it that the conversion to csv outputs a string in the native encoding, and then converts it to the requested encoding, which results in an unnecessary overhead if both are the same. See this particular line in the source code, where if the encoding is not None, it used a unicode formatter even for ascii.

If you need unicode though, it makes sens that it would be a bit slower with python 2.7 than plain ascii.

Still in my case, using Python 2.7.9-r2 64 bit and pandas 0.16.1-r1, I get merely a difference of a factor of 2 between these options, not a factor of 10 that you get,

In [1]: x = pd.DataFrame(np.random.sample((10000,10)))
   ...: 
   ...: %timeit x.copy().to_csv(encoding='ascii')
   ...: %timeit x.copy().to_csv()
   ...: %timeit x.copy().to_csv(encoding='utf8')
10 loops, best of 3: 109 ms per loop
10 loops, best of 3: 56.8 ms per loop
10 loops, best of 3: 108 ms per loop

so this could be potentially impoved for encoding='ascii'.