I have a CSV source file with this schema defined.
["Name", "Address", "TaxId", "SS Number", "Mobile Number", "Gender", "LastVisited"]
From this CSV, these are the operations I need to do:
Select a subset of columns, one at a time, and map all of them to this fixed schema:
["Name", "Address", "Mobile", "UniqueID", "UniqueIdentifierRefCode"]
So, for example, In the first iteration, I will be selecting only a subset of the columns:
[Col("Name"), Col("Address"), Col("Mobile Number"), Col("TaxId"), Lit("TaxIdentifier")]
In the next iteration I need to select a different subset, but map them to the same fixed schema:
[Col("Name"), Col("Address"), Col("Mobile Number"), Col("SS Number"), Lit("SocialSecurityNumber")]
I can do all of this by running a for loop, selecting out the columns, and doing a UnionAll in the end. But is there a better way to let Spark handle this?
You don't need a loop but can use a union for it, if you filter the dataframe to the rows you want, you can use
Union- in the filter I usedIsNotNull()but you can use whatever filter you like (if you aren't sure on the filter syntax give us more details and i'll help).Once you have your final data frame you can either select the columns you actually want or drop the columns you don't want:
or
Logically within Spark this is exactly the same as:
Also to note that when you call a method you get a new DataFrame so the results of dataFrame.Filter() is a separate DataFrame to
dataFramebut the important thing to note is that because of lazy evaluation, Spark creates the plan when it executes the query.