Impute and Add new calculated column with Rust DataFusion?

207 Views Asked by At

Considering, I have a json datafile named test_file.json with the following content.

{"a": 1, "b": "hi", "c": 3}
{"a": 5, "b": null, "c": 7}

Here how I can read the file in With DataFrame API of DataFusion:

use datafusion::prelude::*;

#[tokio::main]
async fn main() -> datafusion::error::Result<()> {
    let file_path = "datalayers/landing/test_file.json";
    
    let mut ctx = SessionContext::new();
    let df = ctx.read_json(file_path, NdJsonReadOptions::default()).await?;
    df.show().await?;
    Ok(())

I would like to do the following operation:

  • Impute the null value in the b column with an empty "" string either using fill na or case when statement
  • Create a new calculated column with combining the column a and b col("a") + col("b")

I have tried to went through the api documentation but could not find any function like with_column which spark has to add a new column and also how to impute the null values.

To add two columns I can do that with column expression col("a").add(col("c")).alias("d") but I was curious to know if it is possible to use something like with_column to add a new column.

1

There are 1 best solutions below

2
On BEST ANSWER

DataFusion's DataFrame does not currently have a with_column method but I think it would be good to add it. I filed an issue for this - https://github.com/apache/arrow-datafusion/issues/2844

Until that is added, you could call https://docs.rs/datafusion/9.0.0/datafusion/dataframe/struct.DataFrame.html#method.select to select the existing columns as well as the new expression:

df.select(vec![col("a"), col("b"), col("c"), col("a").add(col("c")).alias("d")]);