Python pandas different separators for header and data

63 Views Asked by At

I wrote a nice script that parses into a pandas dataframe some monolitic output for easy plotting and manipulation.

I can read different sections of the lines and turn them into a bigger dataframe almost always

I can combine the following two:

  E*/MeV   224Pu     225Pu     226Pu     227Pu     228Pu     229Pu     230Pu  
    48.0 0.192E-11 0.106E-05 0.498E-05 0.354E-06 0.000E+00 0.000E+00 0.000E+00
...
...

and

  E*/MeV   224Pu     225Pu     226Pu     227Pu     228Pu     229Pu     230Pu  
    48.0 0.192E-11 0.106E-05 0.498E-05 0.354E-06 0.000E+00 0.000E+00 0.000E+00

...
...

by making them a dataframe with some crafty one liner pd.read_csv(filename, encoding='latin1', sep='\s+', skip_blank_lines=True, skiprows=lambda x: x not in rows_to_keep) once they are dataframes I can concatenate them.

HOWEVER, my problem arises when the headers suddenly also have one nasty space like bellow (notice the differce between 224Pu and 222 U). This last sections completely screws up my read_cvs.

  E*/MeV   222 U     223 U     224 U     225 U     226 U     227 U     228 U  
    48.0 0.722E-06 0.102E-03 0.306E-04 0.188E-06 0.671E-06 0.976E-06 0.977E-08
...
...

Is there any way where I can apply different "sep" settings for header and data?

Thank you in advance.

I tried different versions of sep, including sep=\s{2,} and it reads correctly the headers but now obviously it misses the data.

I do not see how adding more separators would help me as they are both spaces, or maybe I don't know enough about regexp.

3

There are 3 best solutions below

0
Corralien On BEST ANSWER

You can handle headers and data separately:

import re

filename = 'data.csv'
with open(filename) as fp:
    while True:
        row = fp.readline().strip()
        if row:
            headers = re.split(r'\s{2,}', row)
            break
    df = pd.read_csv(fp, sep='\s+', header=None, names=headers, ...)

First case:

>>> df
   E*/MeV         224Pu     225Pu     226Pu         227Pu  228Pu  229Pu  230Pu
0    48.0  1.920000e-12  0.000001  0.000005  3.540000e-07    0.0    0.0    0.0

Second case:

>>> df
   E*/MeV         222 U     223 U     224 U         225 U         226 U         227 U         228 U
0    48.0  7.220000e-07  0.000102  0.000031  1.880000e-07  6.710000e-07  9.760000e-07  9.770000e-09
0
ndclt On

Here's my try:

from io import StringIO

import pandas as pd

to_load = """  E*/MeV   222 U     223 U     224 U     225 U     226 U     227 U     228 U  
    48.0 0.722E-06 0.102E-03 0.306E-04 0.188E-06 0.671E-06 0.976E-06 0.977E-08"""

df = pd.read_csv(StringIO(to_load), sep="\s+", skiprows=1, header=None)
columns = pd.Index(pd.read_csv(StringIO(to_load), sep="\s{2,}", nrows=1, header=None).loc[0,:])
df.columns = columns
print(df)

prints:

0  E*/MeV         222 U     223 U  ...         226 U         227 U         228 U
0    48.0  7.220000e-07  0.000102  ...  6.710000e-07  9.760000e-07  9.770000e-09

I thanks to skirows and nrows argument, I can read only the header column with one seperator and before the content with another.

2
mozway On

In his particular case, you could use spaces that are followed by a digit as separator:

df = pd.read_csv('tricky.csv', sep=r'\s+(?=\d)', engine='python')

Output:

   E*/MeV         222 U     223 U     224 U         225 U         226 U         227 U         228 U
0    48.0  7.220000e-07  0.000102  0.000031  1.880000e-07  6.710000e-07  9.760000e-07  9.770000e-09