Spark Dataframe API to Select multiple columns, map them to a fixed set, and Union ALL

231 Views Asked by Abhay Sibal At 26 February 2021 at 08:44

I have a CSV source file with this schema defined.

["Name", "Address", "TaxId", "SS Number", "Mobile Number", "Gender", "LastVisited"]

From this CSV, these are the operations I need to do:

Select a subset of columns, one at a time, and map all of them to this fixed schema:

["Name", "Address", "Mobile", "UniqueID", "UniqueIdentifierRefCode"]

So, for example, In the first iteration, I will be selecting only a subset of the columns:

[Col("Name"), Col("Address"), Col("Mobile Number"), Col("TaxId"), Lit("TaxIdentifier")]

In the next iteration I need to select a different subset, but map them to the same fixed schema:

[Col("Name"), Col("Address"), Col("Mobile Number"), Col("SS Number"), Lit("SocialSecurityNumber")]

I can do all of this by running a for loop, selecting out the columns, and doing a UnionAll in the end. But is there a better way to let Spark handle this?

Original Q&A

There are 1 best solutions below

Ed Elliott On 26 February 2021 at 09:26

You don't need a loop but can use a union for it, if you filter the dataframe to the rows you want, you can use Union - in the filter I used IsNotNull() but you can use whatever filter you like (if you aren't sure on the filter syntax give us more details and i'll help).

var taxId = dataFrame.Filter(Functions.Col("TaxId").IsNotNull())
    .WithColumn("UniqueId", Functions.Col("TaxId"));

var ssId = dataFrame.Filter(Functions.Col("ss").IsNotNull())
    .WithColumn("UniqueId", Functions.Col("ss"));

var unionedDataFrame = taxId.Union(ssId);
unionedDataFrame.Show()

Once you have your final data frame you can either select the columns you actually want or drop the columns you don't want:

unionedDataFrame.Drop("TaxId").Show()

unionedDataFrame.Select("name, UniqueId").Show()

Logically within Spark this is exactly the same as:

dataFrame.Filter(Functions.Col("TaxId").IsNotNull())
    .WithColumn("UniqueId", Functions.Col("TaxId"))
    .Union( 
      dataFrame.Filter(Functions.Col("ss").IsNotNull())
       .WithColumn("UniqueId", Functions.Col("ss"))
     ).Show()

Also to note that when you call a method you get a new DataFrame so the results of dataFrame.Filter() is a separate DataFrame to dataFrame but the important thing to note is that because of lazy evaluation, Spark creates the plan when it executes the query.

Spark Dataframe API to Select multiple columns, map them to a fixed set, and Union ALL

There are 1 best solutions below

Related Questions in .NET

Related Questions in APACHE-SPARK

Related Questions in .NET-SPARK

Trending Questions

Popular # Hahtags

Popular Questions