Polars more concise way to replace empty list with null

Question

Polars more concise way to replace empty list with null

84 Views Asked by teejay At 25 March 2024 at 18:37

Question overview

I am extracting data from a newline-delimited JSON file by applying a series of transformations. One of my transformations results in a list of values; for cases where the list is empty, I want the value to be null rather than an empty list. I have code that works, but it seems very convoluted and I'm wondering if there's a simpler way to do this that I am missing.

More detail

In each JSON object of my ndjson file, one of the data elements I'm interested in is an array of telecom nested JSON objects.

{
   ... some data
    "telecom":
        [
            {
                "rank": 1,
                "system": "phone",
                "use": "work",
                "value": "(123) 456-7890"
            },
            {
                "rank": 2,
                "system": "fax",
                "use": "work",
                "value": "(123) 456-7891"
            }
        ]
    ... some other data
}

As part of a larger data extraction operation, I am doing:

df.select(
   expr_first(),
   expr_extract_phone(),
   expr_others()  
)

where expr_first(), expr_extract_phone() and expr_others() return Polars expressions that perform some transformation on various fields of my dataset.

For expr_extract_phone() I want to get a list of phone numbers from telecom as follows:

for each nested object in the telecom array, extract value where system=="phone"
collect all the individual phone numbers in a list
if the list is empty, the value of the column should be null rather than []

I have been able to cobble together something that works:

def expr_extract_phone() -> pl.Expr:
    return pl.col('telecom').list.eval(
            pl.element().filter(pl.element().struct['system'] == 'phone').struct['value']
        ).list.unique().alias('phone_numbers').map_batches(lambda col:
            pl.LazyFrame(col).select(
                pl.when(pl.col('phone_numbers').list.len() > 0)\
                .then(pl.col('phone_numbers'))
            ).collect().get_column('phone_numbers'),
            return_dtype=pl.List(pl.String),
            is_elementwise=True
        )

Getting the list of phone numbers seems straightforward enough, however the entire map_batches portion to replace an empty list [] with a null value seems very convoluted. Is there a simpler way to accomplish what I'm trying to do?

For strings this SO post seems to provide a nice clean way to handle but I can't seem to find an equivalent for a list.

Original Q&A

There are 1 best solutions below

**Hericks** · Accepted Answer · 2024-03-25T19:06:35.060000

For a fixed column of type pl.List(...), you could simply use an pl.when().then() construct as follows.

import polars as pl

df = pl.DataFrame({
    "a": [[1, 2], [3], [], [4, 5], [], [6]]
})

shape: (6, 1)
┌───────────┐
│ a         │
│ ---       │
│ list[i64] │
╞═══════════╡
│ [1, 2]    │
│ [3]       │
│ []        │
│ [4, 5]    │
│ []        │
│ [6]       │
└───────────┘

df.with_columns(
    pl.when(pl.col("a").list.len() > 0).then(pl.col("a"))
)

shape: (6, 1)
┌───────────┐
│ a         │
│ ---       │
│ list[i64] │
╞═══════════╡
│ [1, 2]    │
│ [3]       │
│ null      │
│ [4, 5]    │
│ null      │
│ [6]       │
└───────────┘

The only complication in your concrete example stems from the fact that you don't have a fixed column given but need to apply the construct to an expression. This can easily be done, by saving the pl.Expr in a intermediate variable.

def expr_extract_phone() -> pl.Expr:
    phone_numbers_expr = (
        pl.col('telecom')
        .list.eval(
            pl.element().filter(pl.element().struct['system'] == 'phone')
            .struct['value']
        )
        .list.unique()
        .alias('phone_numbers')
    )
    # reuse expression from before
    return pl.when(phone_numbers_expr.list.len() > 0).then(phone_numbers_expr)

Warning. While polars usually caches repeated sub-expressions, it currently doesn't in cases when the repeated expression is of the form pl.Expr.list.eval. Especially, phone_numbers_expr may be (partly) recomputed.

Polars more concise way to replace empty list with null

There are 1 best solutions below

Related Questions in PYTHON-POLARS

Trending Questions

Popular # Hahtags

Popular Questions