How do I cast to Categorical in Polars in Rust?

160 Views Asked by At

Given the following Rust function:

pub fn test_dataframe() -> PolarsResult<DataFrame> {
    let df = df! {
        "lo" => [...],
        "hi" => [...],
        "register" => [...],
    }?;
    df.lazy().select([
        col("lo")
            .map_many(
                decode_to_str,
                &[col("hi")],
                GetOutput::from_type(DataType::Utf8),
            )
            .cast(DataType::Categorical(None)) // <-- Here!
            .alias("categorical_1"),
        col("register")
            .map(
                register_names_from_register_ids,
                GetOutput::from_type(DataType::Utf8),
            )
            .cast(DataType::Categorical(None)) // <-- And here!
            .alias("categorical_2"),
    ]).collect()
}

I have a couple of options for how to instantiate the DataType enum in the calls to .cast().

  1. Calling it with DataType::Categorical(None) as exemplified here.
  2. Calling it with DataType::Categorical(Some(Arc::new(RevMapping::default()))) as exemplified here.
  3. Instantiating the RevMapping first and cloning it to the various Categoricals. I couldn't find an example of this online; please see the code below.
// ...
let revmap = Arc::new(RevMapping::default());
// ...
    .cast(DataType::Categorical(Some(revmap.clone()))
// ...
    .cast(DataType::Categorical(Some(revmap.clone()))
// ...

My question is: what's the recommended way of doing it? I was surprised to see the Categorical(None) option in the official docs, as I can't see any disadvantages in re-using the cache between different columns, but I might be way wrong. I couldn't find any specific recommendations towards one or the other in the official docs.

I tried getting the estimated size of the dataframe with all three options using df.estimated_size(), but the size wasn't affected by my swapping out the RevMapping or leaving it None.

0

There are 0 best solutions below