I'm writing some DataFrame to binary parquet format with one or more entire null object columns.
If I then load the parquet dataset with use_legacy_dataset=False
parquet_dataset = pq.ParquetDataset(root_path, use_legacy_dataset=False, **kwargs)
type(parquet)
pyarrow.parquet._ParquetDatasetV2
It returns an _ParquetDatasetV2
instance and when I check the schema.
type(parquet_dataset.schema)
pyarrow.lib.Schema
If I load the same file but with use_legacy_dataset=True
parquet_dataset2 = pq.ParquetDataset(root_path, use_legacy_dataset=True, **kwargs)
The schema for the file is an instance of ParquetSchema
type(parquet_dataset2.schema)
pyarrow._parquet.ParquetSchema
This is as I would expect and I'm aware that I can get the "arrow schema" like this.
arrow_schema = parquet_dataset2.schema.to_arrow_schema()
type(arrow_schema)
pyarrow.lib.Schema
i.e same format as when I use use_legacy_dataset=False
For an instance of ParquetSchema
, I can get details of any column. e.g
parquet_dataset2.schema[13]
<ParquetColumnSchema>
name: col13
path: col13
max_definition_level: 1
max_repetition_level: 0
physical_type: INT96
logical_type: None
converted_type (legacy): NONE
Here the "physical_type" for this column is INT96.
parquet.schema[13].physical_type
'INT32'
For an instance of pyarrow.lib.Schema
, if I get the "data type" for the same column.
parquet_dataset.schema.field("col13").type
DataType(null)
i.e with no information about what the "data type" is supposed to be.
This information is available in the Parquet file. But how do I access it?
Is there way to convert instance of pyarrow.lib.Schema
-> pyarrow._parquet.ParquetSchema
?