Is there a painless way to migrate existing Parquet files to a new schema? I wish to update an AWS Glue table column data type

46 Views Asked by Andrew Parsons At 23 January 2024 at 16:20

Background

We use AWS Glue 4.0 for ETL processing jobs.

Each Glue job (PySpark) reads from and writes to AWS Glue tables. These tables are defined using CloudFormation templates and store data as Parquet files in S3. The tables are partitioned, generally on two columns.

Our business analysts use AWS Athena to query data in these tables.

Here is an excerpt from the StorageDescriptor field:

InputFormat: org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat
Location: <S3 location>
OutputFormat: org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat
SerdeInfo:
    Parameters:
        classification: Parquet
    SerializationLibrary: org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe

Problem

I would like to change a column data type from string to array<string>.

This is easy enough in the CloudFormation template -- just modify it and reapply -- but I'm concerned about the existing Parquet files.

Question

Is there a painless way to migrate existing Parquet files to the new schema?

Original Q&A

There are 1 best solutions below

Vikas Sharma On 04 February 2024 at 20:06

You can create a new Glue job that reads the existing Parquet files, applies the schema evolution to change the column data type, and writes the data back to S3 in the new schema.

This seems like one of - if not the most - painless ways to migrate them.

Is there a painless way to migrate existing Parquet files to a new schema? I wish to update an AWS Glue table column data type

Background

Problem

Question

There are 1 best solutions below

Related Questions in AMAZON-S3

Related Questions in DATABASE-MIGRATION

Related Questions in AWS-GLUE

Related Questions in PARQUET

Related Questions in PYSPARK-SCHEMA

Trending Questions

Popular # Hahtags

Popular Questions