How to use Python Polars read_csv when column length increases after row 1?

2.2k Views Asked by At

I have an example CSV with 1 column in row 1 and 2 columns in the other rows. The parser in Polars read_csv only recognizes 1 column. How do I force it to read more columns? I cannot simply use skiprows because sometimes more than the first row is a single column. I know Pandas can get around this with the names parameter, but I need to use Polars for speed. Any help would be appreciated.

CSV contents:

Data
Date,A
Time,B

Code:

import polars as pl
dumpdf = pl.read_csv('example.csv', has_header=False)
print(dumpdf)

Current and desired output

2

There are 2 best solutions below

0
On

Perhaps there are better ways, but the idea of a pre-processing step was something like:

import tempfile
import polars as pl

notcsv = tempfile.NamedTemporaryFile()
notcsv.write(b"""
Data
More
Data
Date,A
Time,B
Other,"foo
bar"
""".strip()
)
notcsv.seek(0)

def my_read_csv(filename):
    with open(filename, "rb") as f:
        lines = b""
        for line in f:
            if b"," in line:
                df = pl.concat([
                    pl.read_csv(lines, has_header=False),
                    pl.read_csv(b"".join((line, *(line for line in f))), has_header=False)
                ])
                return df
            else:
                lines += line[:line.rfind(b"\n")] + b",\n"
>>> my_read_csv(notcsv.name)
shape: (6, 2)
┌──────────┬──────────┐
│ column_1 ┆ column_2 │
│ ---      ┆ ---      │
│ str      ┆ str      │
╞══════════╪══════════╡
│ Data     ┆ null     │
│ More     ┆ null     │
│ Data     ┆ null     │
│ Date     ┆ A        │
│ Time     ┆ B        │
│ Other    ┆ foo      │
│          ┆ bar      │
└──────────┴──────────┘ 
0
On

You can set the separator parameter to the null character and then split the single resultant column yourself like this...

(
pl.read_csv('./sostream/example.csv',has_header=False,separator=chr(0000))
    .select(
        a=pl.col('column_1')
                .str.split(',')
                .list.to_struct(
                    n_field_strategy='max_width',
                    fields=lambda x:f"column_{x+1}"
                )
    )
    .unnest('a')
)

shape: (3, 2)
┌──────────┬──────────┐
│ column_1 ┆ column_2 │
│ ---      ┆ ---      │
│ str      ┆ str      │
╞══════════╪══════════╡
│ Data     ┆ null     │
│ Date     ┆ A        │
│ Time     ┆ B        │
└──────────┴──────────┘