Spark Dataframe API to Select multiple columns, map them to a fixed set, and Union ALL

231 Views Asked by At

I have a CSV source file with this schema defined.

["Name", "Address", "TaxId", "SS Number", "Mobile Number", "Gender", "LastVisited"]

From this CSV, these are the operations I need to do:

  1. Select a subset of columns, one at a time, and map all of them to this fixed schema:

    ["Name", "Address", "Mobile", "UniqueID", "UniqueIdentifierRefCode"]

So, for example, In the first iteration, I will be selecting only a subset of the columns:

[Col("Name"), Col("Address"), Col("Mobile Number"), Col("TaxId"), Lit("TaxIdentifier")]

  1. In the next iteration I need to select a different subset, but map them to the same fixed schema:

    [Col("Name"), Col("Address"), Col("Mobile Number"), Col("SS Number"), Lit("SocialSecurityNumber")]

I can do all of this by running a for loop, selecting out the columns, and doing a UnionAll in the end. But is there a better way to let Spark handle this?

1

There are 1 best solutions below

2
Ed Elliott On

You don't need a loop but can use a union for it, if you filter the dataframe to the rows you want, you can use Union - in the filter I used IsNotNull() but you can use whatever filter you like (if you aren't sure on the filter syntax give us more details and i'll help).

var taxId = dataFrame.Filter(Functions.Col("TaxId").IsNotNull())
    .WithColumn("UniqueId", Functions.Col("TaxId"));

var ssId = dataFrame.Filter(Functions.Col("ss").IsNotNull())
    .WithColumn("UniqueId", Functions.Col("ss"));

var unionedDataFrame = taxId.Union(ssId);
unionedDataFrame.Show()

Once you have your final data frame you can either select the columns you actually want or drop the columns you don't want:

unionedDataFrame.Drop("TaxId").Show()

or

unionedDataFrame.Select("name, UniqueId").Show()

Logically within Spark this is exactly the same as:

dataFrame.Filter(Functions.Col("TaxId").IsNotNull())
    .WithColumn("UniqueId", Functions.Col("TaxId"))
    .Union( 
      dataFrame.Filter(Functions.Col("ss").IsNotNull())
       .WithColumn("UniqueId", Functions.Col("ss"))
     ).Show()

Also to note that when you call a method you get a new DataFrame so the results of dataFrame.Filter() is a separate DataFrame to dataFrame but the important thing to note is that because of lazy evaluation, Spark creates the plan when it executes the query.