I have hundreds of parquet files, I want to get the column name and associated data type into a list in Python. I know I can get the schema, it comes in this format:
COL_1: string
-- field metadata --
PARQUET:field_id: '34'
COL_2: int32
-- field metadata --
PARQUET:field_id: '35'
I just want:
COL_1 string
COL_2 int32
In order to go from parquet to arrow (and vice versa), some meta data is added to the schema, under the
PARQUETkeyYou can remove the meta data easily:
This will print:
Bear in mind that if you start writing your own metadata, you'll want to only remove the meta data under the
PARQUETkey.