Given the following Rust function:
pub fn test_dataframe() -> PolarsResult<DataFrame> {
let df = df! {
"lo" => [...],
"hi" => [...],
"register" => [...],
}?;
df.lazy().select([
col("lo")
.map_many(
decode_to_str,
&[col("hi")],
GetOutput::from_type(DataType::Utf8),
)
.cast(DataType::Categorical(None)) // <-- Here!
.alias("categorical_1"),
col("register")
.map(
register_names_from_register_ids,
GetOutput::from_type(DataType::Utf8),
)
.cast(DataType::Categorical(None)) // <-- And here!
.alias("categorical_2"),
]).collect()
}
I have a couple of options for how to instantiate the DataType enum in the calls to .cast().
- Calling it with
DataType::Categorical(None)as exemplified here. - Calling it with
DataType::Categorical(Some(Arc::new(RevMapping::default())))as exemplified here. - Instantiating the
RevMappingfirst and cloning it to the variousCategoricals. I couldn't find an example of this online; please see the code below.
// ...
let revmap = Arc::new(RevMapping::default());
// ...
.cast(DataType::Categorical(Some(revmap.clone()))
// ...
.cast(DataType::Categorical(Some(revmap.clone()))
// ...
My question is: what's the recommended way of doing it? I was surprised to see the Categorical(None) option in the official docs, as I can't see any disadvantages in re-using the cache between different columns, but I might be way wrong. I couldn't find any specific recommendations towards one or the other in the official docs.
I tried getting the estimated size of the dataframe with all three options using df.estimated_size(), but the size wasn't affected by my swapping out the RevMapping or leaving it None.