Background
We use AWS Glue 4.0 for ETL processing jobs.
Each Glue job (PySpark) reads from and writes to AWS Glue tables. These tables are defined using CloudFormation templates and store data as Parquet files in S3. The tables are partitioned, generally on two columns.
Our business analysts use AWS Athena to query data in these tables.
Here is an excerpt from the StorageDescriptor field:
InputFormat: org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat
Location: <S3 location>
OutputFormat: org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat
SerdeInfo:
Parameters:
classification: Parquet
SerializationLibrary: org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe
Problem
I would like to change a column data type from string to array<string>.
This is easy enough in the CloudFormation template -- just modify it and reapply -- but I'm concerned about the existing Parquet files.
Question
Is there a painless way to migrate existing Parquet files to the new schema?
You can create a new Glue job that reads the existing Parquet files, applies the schema evolution to change the column data type, and writes the data back to S3 in the new schema.
This seems like one of - if not the most - painless ways to migrate them.