Given the following Rust function:
pub fn test_dataframe() -> PolarsResult<DataFrame> {
let df = df! {
"lo" => [...],
"hi" => [...],
"register" => [...],
}?;
df.lazy().select([
col("lo")
.map_many(
decode_to_str,
&[col("hi")],
GetOutput::from_type(DataType::Utf8),
)
.cast(DataType::Categorical(None)) // <-- Here!
.alias("categorical_1"),
col("register")
.map(
register_names_from_register_ids,
GetOutput::from_type(DataType::Utf8),
)
.cast(DataType::Categorical(None)) // <-- And here!
.alias("categorical_2"),
]).collect()
}
I have a couple of options for how to instantiate the DataType
enum in the calls to .cast()
.
- Calling it with
DataType::Categorical(None)
as exemplified here. - Calling it with
DataType::Categorical(Some(Arc::new(RevMapping::default())))
as exemplified here. - Instantiating the
RevMapping
first and cloning it to the variousCategorical
s. I couldn't find an example of this online; please see the code below.
// ...
let revmap = Arc::new(RevMapping::default());
// ...
.cast(DataType::Categorical(Some(revmap.clone()))
// ...
.cast(DataType::Categorical(Some(revmap.clone()))
// ...
My question is: what's the recommended way of doing it? I was surprised to see the Categorical(None)
option in the official docs, as I can't see any disadvantages in re-using the cache between different columns, but I might be way wrong. I couldn't find any specific recommendations towards one or the other in the official docs.
I tried getting the estimated size of the dataframe with all three options using df.estimated_size()
, but the size wasn't affected by my swapping out the RevMapping
or leaving it None
.