Create dataframe from C# List - Spark for .NET

3.5k Views Asked by At

I am currently new to .NET for Spark and need to append a C# list to a delta table. I assume I first need to create a Spark DataFrame to do this. In the sample code how would I go about appending "names" to the dataframe "df"?

It seems now this has been deprecated (https://github.com/Microsoft/Mobius) that using RDD's is not available with the new version (https://github.com/dotnet/spark)

using System.Collections.Generic;
using Microsoft.Spark.Sql;

namespace HelloSpark
{
    class Program
    {
        static void Main(string[] args)
        {
            var spark = SparkSession.Builder().GetOrCreate();
            var df = spark.Read().Json("people.json");
            df.Show();

            var names = new List<string> { "john", "20" };

        }
    }
}

The example file people.json looks like the following:

{"name":"Michael"}
{"name":"Andy", "age":"30"}
{"name":"Justin", "age":"19"}
2

There are 2 best solutions below

6
Amit On

You need to create another Dataframe using the list and union it with the original dataframe. Once done you can write it external storage. You can look for corresponding C# apis based on the Psuedo code below

 var names = new List<string> { "john", "20" };
 // Create a Dataframe using this list
 // In scala you can do spark.createDataFrame using the list.
 var newdf = spark.createDataFrame(names,yourschemaclass)
 // union it with original df
 var joineddf = df.union(newdf)
 // write to external storage if you want
 joineddf.write()
0
Ed Elliott On

You can now create a dataframe in .NET for Apache Spark (you couldn't when this question was written).

To do it you pass in an array of GenericRow's which take an array of objects for each column. You also need to define the schema:


using System;
using System.Collections.Generic;
using Microsoft.Spark.Sql;
using Microsoft.Spark.Sql.Types;

namespace CreateDataFrame
{
    class Program
    {
        static void Main(string[] args)
        {
            var spark = SparkSession.Builder().GetOrCreate();
            
            var df = spark.Read().Json("people.json");
            df.Show();

            var names = new List<string> { "john", "20" };

            var newNamesDataFrame = spark.CreateDataFrame(
                new List<GenericRow>{new GenericRow(names.ToArray())},
                    new StructType(
                    new List<StructField>()
                    {
                        new StructField("name", new StringType()),
                        new StructField("age", new StringType())
                    }));
            
            newNamesDataFrame.Union(df).Show();
        }
    }
}

Now you have the data frame you can write it using DataFrameWriter.Write.Format("delta").Save("/Path/ToFile")