I followed pyarrow data types for columns that have lists of dictionaries? to create an Arrow table which includes a column of MapType.
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
print(f'PyArrow Version = {pa.__version__}')
print(f'Pandas Version = {pd.__version__}')
df = pd.DataFrame({
'col1': pd.Series([
[('id', 'something'), ('value2', 'else')],
[('id', 'something2'), ('value','else2')],
]),
'col2': pd.Series(['foo', 'bar'])
}
)
udt = pa.map_(pa.string(), pa.string())
schema = pa.schema([pa.field('col1', udt), pa.field('col2', pa.string())])
table = pa.Table.from_pandas(df, schema)
pq.write_table(table, './test_map.parquet')
The above code runs smoothly on my developing computer:
PyArrow Version = 1.0.1
Pandas Version = 1.1.2
And generated the test_map.parquet file successfully.
Then I use parquet-tools (1.11.1) to read the file, but get the following output:
col1:
.key_value:
.key_value:
col2 = foo
col1:
.key_value:
.key_value:
col2 = bar
The keys and values are missing... Could you help me on this?
We submitted a JIRA issue to Apache Arrow on Sep 30, 2020: https://issues.apache.org/jira/browse/ARROW-10140
And the issue had been resolved in PyArrow 2.0.0 which was released on Oct 20, 2020.
So if you have the same issue when using the map type, please upgrade your PyArrow to 2.0.0 (or higher in future).