merge groups of columns in a polars dataframe to single columns

114 Views Asked by At

I have a polars dataframe with columns a_0, a_1, a_2, b_0, b_1, b_2. I want to convert it to a longer and thinner dataframe (3 x rows, but just 2 columns a and b), so that a contains a_0[0], a_1[0], a_2[0], a_0[1], a_1[1], a_2[1],... and the same for b. How can I do that?

3

There are 3 best solutions below

3
Roman Pekar On BEST ANSWER

You can use concat_list() to join the columns you want together and then use explode() to convert them into rows.

Let's take simple data frame as an example:

df = pl.DataFrame(
    data=[[x for x in range(6)]],
    schema=[f"a_{i}" for i in range(3)] + [f"b_{i}" for i in range(3)]
)

┌─────┬─────┬─────┬─────┬─────┬─────┐
│ a_0 ┆ a_1 ┆ a_2 ┆ b_0 ┆ b_1 ┆ b_2 │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 ┆ i64 ┆ i64 ┆ i64 │
╞═════╪═════╪═════╪═════╪═════╪═════╡
│ 0   ┆ 1   ┆ 2   ┆ 3   ┆ 4   ┆ 5   │
└─────┴─────┴─────┴─────┴─────┴─────┘

Now, you can reshape it. First, concat the columns into lists and rename the columns for the final result:

import polars.selectors as cs

df.select(
    pl.concat_list(cs.starts_with(x)).alias(x) for x in ['a','b']
)

┌───────────┬───────────┐
│ a         ┆ b         │
│ ---       ┆ ---       │
│ list[i64] ┆ list[i64] │
╞═══════════╪═══════════╡
│ [0, 1, 2] ┆ [3, 4, 5] │
└───────────┴───────────┘

No, explode lists into rows:

df.select(
    pl.concat_list(cs.starts_with(x)).alias(x) for x in ['a','b']
).explode(pl.all())

┌─────┬─────┐
│ a   ┆ b   │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═════╪═════╡
│ 0   ┆ 3   │
│ 1   ┆ 4   │
│ 2   ┆ 5   │
└─────┴─────┘
1
Hericks On

You can unpivot the dataframe from wide to long separately for each column using pl.DataFrame.melt and subsequently concatenate the results using pl.concat.

import polars.selectors as cs

pl.concat(
    [
        df.melt(value_vars=cs.starts_with(x), value_name=x).drop("variable")
        for x in ["a", "b"]
    ],
    how="horizontal"
)

Note that I've used column selectors to select all column starting with "a" / "b".

Output.

shape: (3, 2)
┌─────┬─────┐
│ a   ┆ b   │
│ --- ┆ --- │
│ i64 ┆ str │
╞═════╪═════╡
│ 0   ┆ a   │
│ 1   ┆ b   │
│ 2   ┆ c   │
└─────┴─────┘
2
jqurious On

The pattern I've been using for this is:

  • Use a bare .melt() to go from wide to long
  • Clean the "names" to produce the new column names
  • Enumerate each "group" to give an index/row_number
  • Re-pivot
df = pl.DataFrame(dict(a_0=1, a_1=2, a_2=3, b_0=4, b_1=5, b_2=6))

(df.melt()
   .with_columns(pl.col("variable").str.replace("_.*", ""))
   .with_columns(index = pl.int_range(pl.len()).over("variable"))
)
shape: (6, 3)
┌──────────┬───────┬───────┐
│ variable ┆ value ┆ index │
│ ---      ┆ ---   ┆ ---   │
│ str      ┆ i64   ┆ i64   │
╞══════════╪═══════╪═══════╡
│ a        ┆ 1     ┆ 0     │
│ a        ┆ 2     ┆ 1     │
│ a        ┆ 3     ┆ 2     │
│ b        ┆ 4     ┆ 0     │
│ b        ┆ 5     ┆ 1     │
│ b        ┆ 6     ┆ 2     │
└──────────┴───────┴───────┘

Which we then .pivot()

(df.melt()
   .with_columns(pl.col("variable").str.replace("_.*", ""))
   .with_columns(index = pl.int_range(pl.len()).over("variable"))
   .pivot(index="index", columns="variable", values="value")
)
shape: (3, 3)
┌───────┬─────┬─────┐
│ index ┆ a   ┆ b   │
│ ---   ┆ --- ┆ --- │
│ i64   ┆ i64 ┆ i64 │
╞═══════╪═════╪═════╡
│ 0     ┆ 1   ┆ 4   │
│ 1     ┆ 2   ┆ 5   │
│ 2     ┆ 3   ┆ 6   │
└───────┴─────┴─────┘