Parquet schema / data type for entire null object DataFrame columns

775 Views Asked by At

I'm writing some DataFrame to binary parquet format with one or more entire null object columns.

If I then load the parquet dataset with use_legacy_dataset=False

parquet_dataset = pq.ParquetDataset(root_path, use_legacy_dataset=False, **kwargs)
type(parquet)
pyarrow.parquet._ParquetDatasetV2

It returns an _ParquetDatasetV2 instance and when I check the schema.

type(parquet_dataset.schema) 
pyarrow.lib.Schema

If I load the same file but with use_legacy_dataset=True

parquet_dataset2 = pq.ParquetDataset(root_path, use_legacy_dataset=True, **kwargs)

The schema for the file is an instance of ParquetSchema

type(parquet_dataset2.schema)
pyarrow._parquet.ParquetSchema

This is as I would expect and I'm aware that I can get the "arrow schema" like this.

arrow_schema = parquet_dataset2.schema.to_arrow_schema()
type(arrow_schema)
pyarrow.lib.Schema

i.e same format as when I use use_legacy_dataset=False

For an instance of ParquetSchema, I can get details of any column. e.g

parquet_dataset2.schema[13]

<ParquetColumnSchema>
  name: col13
  path: col13
  max_definition_level: 1
  max_repetition_level: 0
  physical_type: INT96
  logical_type: None
  converted_type (legacy): NONE

Here the "physical_type" for this column is INT96.

parquet.schema[13].physical_type
'INT32'

For an instance of pyarrow.lib.Schema, if I get the "data type" for the same column.

parquet_dataset.schema.field("col13").type
DataType(null)

i.e with no information about what the "data type" is supposed to be.

This information is available in the Parquet file. But how do I access it?

Is there way to convert instance of pyarrow.lib.Schema -> pyarrow._parquet.ParquetSchema?

0

There are 0 best solutions below