how to pass a dataframe produced in a c++ program to apache datafusion in a rust program

90 Views Asked by At

I need to generate a dataframe from a single host program in C++(though in the end it will become a MPI program). I also need to process this dataframe using the Apache Datafusion dataframe API in Rust.

Since the underlying in-memory structure of the dataframe is from Apache arrow which will share the same memory layout across languages, I expect there is a way I can perform this interprocess communication(IPC) with zero-copy (no network transfer, no persistence to disk and then to be picked up later).

How can I do this in a rust program to invoke this c++ program to generate a dataframe and then allow datafusion to pick it up? The following is an example but it would require the c++ program to persist the result in a CSV or Parquet file first.

let ctx = SessionContext::new();
let df = ctx.read_csv("tests/data/example.csv", CsvReadOptions::new()).await?;
let df = df.filter(col("a").lt_eq(col("b")))?
           .aggregate(vec![col("a")], vec![min(col("b"))])?
           .limit(0, Some(100))?;
// Print results
df.show().await?;
1

There are 1 best solutions below

1
On

I think at a high level, you want to generate an "IPC stream" and then transfer the batches via https://docs.rs/arrow-ipc/latest/arrow_ipc/reader/struct.StreamReader.html and then create a MemTable with those batches

At the moment there isn't any pre-built provider built into DataFusion for this that I know of

Related discussion on https://lists.apache.org/[email protected]