pd.csv_read Problem with comma & dot when import because of the 2nd line, no idea WHY?

176 Views Asked by At

I have a .txt file (23820 rows × 300 columns). It is '\t' seperated and the decimal is ','

When reading it in with csv_read, nearly every column in my file should be a float but it totally messes it up. I don't get float data (which has a dot as decimal) but string like '25,73234' This leads to my problem when trying to convert it. See the error message:

ValueError                                Traceback (most recent call last)
<ipython-input-3-87b4fe6976d1> in <module>
----> 1 DF['T Ans v F_deg_C'] = [float(x) for x in DF['T Ans v F_deg_C']]

<ipython-input-3-87b4fe6976d1> in <listcomp>(.0)
----> 1 DF['T Ans v F_deg_C'] = [float(x) for x in DF['T Ans v F_deg_C']]

ValueError: could not convert string to float: '25,79243'

Here is a screenshot of my DF after import:

DF Example:

I just could skip the first 2 rows and the data is fine BUT I want to put 1rd and 2nd row elementwise together as header of my DF. I know I could seperately read the headers in and then the data. I also could change the data after importing with replacing ',' to '.' before change string to float in each column but still I want to know what is going on there.

Now here is the question I can't answer myself. I figured out it is the 2nd line who causes my problem. When searching for answers I often read about misinterpreted '\n' or Regex so I printed the 2nd line including every piece of format with:

with open(data_path) as f:
    my_list = list(f)
my_list[1]

'yymmdd\thh:mm:ss\trpm\tNm\tkW\t°C\t°C\t°C\t°C\t°C\t°C\t°C\t°C\t°C\t°C\t°C\t°C\t°C\t°C\t°C\t°C\t°C\t°C\t°C\t°C\t°C\t°C\t°C\t°C\t°C\t°C\t°C\tmbar\tmbar\tmbar\tmbar\tmbar\tmbar\tmbar\tmbar\tmbar\tmbar\tmbar\tmbar\tmbar\tmbar\tmbar\tmbar\tmbar\tmbar\tmbar\tmbar\tmbar\tmbar\tmbar\tbar\tbar\tmbar\tmbar\tmbar\tkg/h\t%\tkg\tg/h\tkg/h\tl/min\tppm\tppm\tppm\tppm\tppm\tppm\t%\tppm\t%\tppm\tppm\tppm\t%\tppm\t%\tppm\tppm\tppm\t%\tppm\t%\tNm\tkW\tkW\tppm\tppm\tppm\tg/h\tg/h\tg/h\tg/kWh\tg/kWh\tg/kWh\t%\t%\t%\tg/h\tg/h\tg/h\tg/kWh\tg/kWh\tg/kWh\tg/h\tg/h\tg/h\tg/kWh\tg/kWh\tg/kWh\tg/h\tg/h\tg/h\tg/kWh\tg/kWh\tg/kWh\tg/kWh\tg/kWh\tg/kWh\tkg/h\tg/kg\t°C\t-\t-\t-\t-\t-\tg/kWh\tg/kWh\tbar\t-\t-\t-\t-\tkPa\tmbar\tmbar\tmg/l\tmg/l\t-\tkWh\t°C\tmbar\tmbar\tmbar\tmbar\tmbar\tmbar\tmbar\t?\t?\tcbm\t1\t\t?\t?\t\t\t\t\t\t?\t?\t?\t?\t?\t?\t?\t?\t?\t?\t?\t\t?\t?\t?\t?\t1\t1\t1\t1\t1\tppm\t\t1\tmg/Asp\t1\t\tbar\tbar\tbar\tbar\t1\t1\t1\t1\t1\t1\t1\t1\t1\t1\t1\t%\t°C\tbar\t°C\t1\t°C\t°C\t°C\t\tmbar\tmbar\tppm\t°C\tbar\t°C\tbar\tbar\tppm\t°C\t\t°C\t1/min\t1\t1\t1\t1\t1\t1\t1\t1\t1\t1\t1\t1\tkg/s\t1\t1\t1\t1\t1\t1\t1\t1\t1\t1\t1,0\t1,0\t1,0\t1,0\t1,0\t1,0\t1,0\tkW\t1,0\t1,0\t1,0\tppm\t1,0\t1,0\tg/s\t1,0\t1,0\t1,0\tppm\tppm\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\n'

This 2nd row represents the units of the headers. Can somebody explain to me why this line causes my problem?

1

There are 1 best solutions below

2
On BEST ANSWER

How about adding header=[0,1] to the function call? This specifies the first two lines in the file as the header.

In your case: pd.read_csv(data_path, delimiter='\t', decimal=',', header=[0,1])