Extracting column name and datatype from parquet file with python

3.7k Views Asked by Vorcry At 29 July 2025 at 04:20

I have hundreds of parquet files, I want to get the column name and associated data type into a list in Python. I know I can get the schema, it comes in this format:

COL_1: string
   -- field metadata --
   PARQUET:field_id: '34'
COL_2: int32
   -- field metadata --
   PARQUET:field_id: '35'

I just want:

COL_1 string
COL_2 int32

Original Q&A

There are 1 best solutions below

0x26res On 13 October 2020 at 12:34

In order to go from parquet to arrow (and vice versa), some meta data is added to the schema, under the PARQUET key

You can remove the meta data easily:

import pyarrow as pa
import pyarrow.parquet as pq

table = pa.Table.from_arrays(
    [pa.array([1,2]), pa.array(['foo', 'bar'])],
    schema=pa.schema({'COL1': pa.int32(), 'COL2': pa.string()})
)
pq.write_table(table, '/tmp/table.pq')
parquet_file = pq.ParquetFile('/tmp/table.pq')

schema = pa.schema(
    [f.remove_metadata() for f in parquet_file.schema_arrow])
schema

This will print:

COL1: int32
COL2: string

Bear in mind that if you start writing your own metadata, you'll want to only remove the meta data under the PARQUET key.

Extracting column name and datatype from parquet file with python

There are 1 best solutions below

Related Questions in PYTHON-3.X

Related Questions in PARQUET

Related Questions in PYARROW

Trending Questions

Popular # Hahtags

Popular Questions