Extracting column name and datatype from parquet file with python

3.8k Views Asked by At

I have hundreds of parquet files, I want to get the column name and associated data type into a list in Python. I know I can get the schema, it comes in this format:

COL_1: string
   -- field metadata --
   PARQUET:field_id: '34'
COL_2: int32
   -- field metadata --
   PARQUET:field_id: '35'

I just want:

COL_1 string
COL_2 int32
1

There are 1 best solutions below

0
On

In order to go from parquet to arrow (and vice versa), some meta data is added to the schema, under the PARQUET key

You can remove the meta data easily:

import pyarrow as pa
import pyarrow.parquet as pq

table = pa.Table.from_arrays(
    [pa.array([1,2]), pa.array(['foo', 'bar'])],
    schema=pa.schema({'COL1': pa.int32(), 'COL2': pa.string()})
)
pq.write_table(table, '/tmp/table.pq')
parquet_file = pq.ParquetFile('/tmp/table.pq')

schema = pa.schema(
    [f.remove_metadata() for f in parquet_file.schema_arrow])
schema

This will print:

COL1: int32
COL2: string

Bear in mind that if you start writing your own metadata, you'll want to only remove the meta data under the PARQUET key.