Is there a painless way to migrate existing Parquet files to a new schema? I wish to update an AWS Glue table column data type

46 Views Asked by At

Background

We use AWS Glue 4.0 for ETL processing jobs.

Each Glue job (PySpark) reads from and writes to AWS Glue tables. These tables are defined using CloudFormation templates and store data as Parquet files in S3. The tables are partitioned, generally on two columns.

Our business analysts use AWS Athena to query data in these tables.

Here is an excerpt from the StorageDescriptor field:

InputFormat: org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat
Location: <S3 location>
OutputFormat: org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat
SerdeInfo:
    Parameters:
        classification: Parquet
    SerializationLibrary: org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe

Problem

I would like to change a column data type from string to array<string>.

This is easy enough in the CloudFormation template -- just modify it and reapply -- but I'm concerned about the existing Parquet files.

Question

Is there a painless way to migrate existing Parquet files to the new schema?

1

There are 1 best solutions below

0
Vikas Sharma On

You can create a new Glue job that reads the existing Parquet files, applies the schema evolution to change the column data type, and writes the data back to S3 in the new schema.

This seems like one of - if not the most - painless ways to migrate them.