I am reading two types of csv files that are very similar. They are about the same lenght, 20 000 lines. Each line represent parameters recorded each second. Thus, the first column is the timestamp.
- In the first file, the pattern is the following: 2018-09-24 15:38
- In the second file, the pattern is the following: 2018-09-24 03:38:06 PM
In both cases, the command is the same:
data = pd.read_csv(file)
data['Timestamp'] = pd.to_datetime(data['Timestamp'])
I check the execution time for both lines:
- pd.read is as effective in both cases
- it takes ~3 to 4 seconds more to execute the second line of the code
The only difference is the date pattern. I would not have suspected that. Do you know why? Do you know how to fix this?
pandas.to_datetime
is extremely slow (in certain instances) when it needs to parse the dates automatically. Since it seems like you know the formats, you should explicitly pass them to theformat
parameter, which will greatly improve the speed.Here's an example:
That's 700x slower. Now specify the format explicitly:
pandas
is still parsing the second date format more slowly, but it's not nearly as bad as it was before.Edit: as of
pd.__version__ == '1.0.5'
the automatic parsing seems to have gotten much better for what used to be extremely slowly parsed formats, likely due to the implemenation of this performance improvement inpd.__version == '0.25.0'