Question overview
I am extracting data from a newline-delimited JSON file by applying a series of transformations. One of my transformations results in a list of values; for cases where the list is empty, I want the value to be null rather than an empty list. I have code that works, but it seems very convoluted and I'm wondering if there's a simpler way to do this that I am missing.
More detail
In each JSON object of my ndjson file, one of the data elements I'm interested in is an array of telecom nested JSON objects.
{
... some data
"telecom":
[
{
"rank": 1,
"system": "phone",
"use": "work",
"value": "(123) 456-7890"
},
{
"rank": 2,
"system": "fax",
"use": "work",
"value": "(123) 456-7891"
}
]
... some other data
}
As part of a larger data extraction operation, I am doing:
df.select(
expr_first(),
expr_extract_phone(),
expr_others()
)
where expr_first(), expr_extract_phone() and expr_others() return Polars expressions that perform some transformation on various fields of my dataset.
For expr_extract_phone() I want to get a list of phone numbers from telecom as follows:
- for each nested object in the
telecomarray, extractvaluewheresystem=="phone" - collect all the individual phone numbers in a list
- if the list is empty, the value of the column should be
nullrather than[]
I have been able to cobble together something that works:
def expr_extract_phone() -> pl.Expr:
return pl.col('telecom').list.eval(
pl.element().filter(pl.element().struct['system'] == 'phone').struct['value']
).list.unique().alias('phone_numbers').map_batches(lambda col:
pl.LazyFrame(col).select(
pl.when(pl.col('phone_numbers').list.len() > 0)\
.then(pl.col('phone_numbers'))
).collect().get_column('phone_numbers'),
return_dtype=pl.List(pl.String),
is_elementwise=True
)
Getting the list of phone numbers seems straightforward enough, however the entire map_batches portion to replace an empty list [] with a null value seems very convoluted. Is there a simpler way to accomplish what I'm trying to do?
For strings this SO post seems to provide a nice clean way to handle but I can't seem to find an equivalent for a list.
For a fixed column of type
pl.List(...), you could simply use anpl.when().then()construct as follows.The only complication in your concrete example stems from the fact that you don't have a fixed column given but need to apply the construct to an expression. This can easily be done, by saving the
pl.Exprin a intermediate variable.Warning. While polars usually caches repeated sub-expressions, it currently doesn't in cases when the repeated expression is of the form
pl.Expr.list.eval. Especially,phone_numbers_exprmay be (partly) recomputed.