Parsing strings with numbers and SI prefixes in polars

60 Views Asked by At

Say I have this dataframe:

>>> import polars
>>> df = polars.DataFrame(dict(j=['1.2', '1.2k', '1.2M', '-1.2B']))
>>> df
shape: (4, 1)
┌───────┐
│ j     │
│ ---   │
│ str   │
╞═══════╡
│ 1.2   │
│ 1.2k  │
│ 1.2M  │
│ -1.2B │
└───────┘

How would I go about parsing the above to get:

>>> df = polars.DataFrame(dict(j=[1.2, 1_200, 1_200_000, -1_200_000_000]))
>>> df
shape: (4, 1)
┌───────────┐
│ j         │
│ ---       │
│ f64       │
╞═══════════╡
│ 1.2       │
│ 1200.0    │
│ 1.2e6     │
│ -1.2000e9 │
└───────────┘
>>>
2

There are 2 best solutions below

0
Roman Pekar On BEST ANSWER

You can use str.extract() and str.strip_chars() to split the parts and then get the resulting number by using Expr.replace() + Expr.pow():

df.with_columns(
    pl.col('j').str.strip_chars('KMB').cast(pl.Float32) *
    pl.lit(10).pow(
        pl.col('j').str.extract(r'(K|M|B)').replace(['K','M','B'],[3,6,9]).fill_null(0)
    )
)

┌─────────────┐
│ j           │
│ ---         │
│ f64         │
╞═════════════╡
│ 1.2         │
│ 1200.000048 │
│ 1.2000e6    │
│ -1.2000e9   │
└─────────────┘
0
Hericks On

Another solution similar to that of @RomanPekar would be to replace k, M, and B with the corresponding SI multiplier directly.

df.with_columns(
    pl.col('j').str.strip_chars('kMB').cast(pl.Float32) *
    (
        pl.col('j')
        .str.extract(r'(k|M|B)')
        .replace({"k": 1e3, "M": 1e6, "B": 1e9}, default=1.0, return_dtype=pl.Float32)
    )
)