I'm starting with a large zip file of csvs, which I unzipped in Palantir Foundry.
I now have a dataset which consists of multiple csvs (one for each year), where the csvs are almost the same schema but have some differences. How do I apply a schema to each of the csvs individually or normalize the schema between them?
If your files are unzipped and simply sitting as
.csv
s inside your dataset, you can use Spark's nativespark_session.read.csv
method similar to my answer over here.This will look like the following:
Note that the
union_many
verb will stack your schemas on top of each other, so if you have many many files with different schemas, many rows will be null since they will only exist in one file.If you knew the common fields for each schema, and knew that only one column would change names between files, you could change the logic to rename columns in
parsed_df
to harmonize the schemas. It'll depend how much you want to enforce requirements on your schemas.I would also include a testing method same as the other response so that you can quickly verify the correct parsing behavior.