Is there a way to get the list of struct output of value_counts polars into a json?

303 Views Asked by At

I'm starting with Polars and I'm trying to word count a list of strings in polars and get the results as a dict into a polars dataframe.

Basically, I have this input dataframe:

df_test = pl.DataFrame({'a': [['the', 'dog', 'is', 'good', 'a'], ['toto', 'tata', 'I']]})
shape: (2, 1)
┌───────────────────────┐
│ a                     │
│ ---                   │
│ list[str]             │
╞═══════════════════════╡
│ ["the", "dog", … "a"] │
│ ["toto", "tata", "I"] │
└───────────────────────┘

from the command:

df_test.with_columns(
    pl.col('a').arr.eval(pl.element().value_counts())
)

I get this output:

shape: (2, 1)
┌───────────────────────────────────┐
│ a                                 │
│ ---                               │
│ list[struct[2]]                   │
╞═══════════════════════════════════╡
│ [{"is",1}, {"dog",1}, … {"good",… │
│ [{"tata",1}, {"I",1}, {"toto",1}… │
└───────────────────────────────────┘

Is there a way to get the result of value_counts like this?

enter image description here

Thanks by advance

1

There are 1 best solutions below

2
Wayoshi On

You can cast the value_counts output to a string, do some operations on each string (remove all {}, replace , with :), arr.join the list into one big string, then add back the final {} with pl.format:

df_test.with_columns(
        a=pl.format(
            "{{}}",
            pl.col("a")
            .arr.eval(
                pl.element()
                .value_counts()
                .cast(str)
                .str.strip("{}")
                .str.replace(",", ": ")
            )
            .arr.join(", "),
        )
    )
shape: (2, 1)
┌──────────────────────────────────────────────────┐
│ a                                                │
│ ---                                              │
│ str                                              │
╞══════════════════════════════════════════════════╡
│ {"good": 1, "dog": 1, "a": 1, "is": 1, "the": 1} │
│ {"tata": 1, "toto": 1, "I": 1}                   │
└──────────────────────────────────────────────────┘

This works as long as no words have a ,, which for a word count seems like a safe assumption. I think there's a potentially more general answer with using pl.format within arr.eval, but I got errors when trying to use struct expressions as arguments to pl.format within that context.