When i use read_parquet method to read parquet file, it occurs Column 8 named hostIp expected length 548 but got length 549 error, hostIP is one column in REQUIRED_COLUMNS.
import pandas as pd
REQUIRED_COLUMNS = [...]
path = ...
data = pd.read_parquet(path, columns=REQUIRED_COLUMNS)
When i iterate each column in REQUIRED_COLUMNS to call read_parquet, it successed.
for col in REQUIRED_COLUMNS:
columns = [col]
data = pd.read_parquet(path, columns=columns)
And i check that the number of raws is 548 for each column in the above process.
The error you are getting is because the
hostIpcolumn in your Parquet file has 549 rows, but theread_parquet()method is expecting it to have 548 rows.The code you have provided shows that you are iterating over the
REQUIRED_COLUMNSlist and callingread_parquet()for each column individually. This works because each column has 548 rows. However, when you callread_parquet()with theREQUIRED_COLUMNSlist as thecolumnsargument, it will try to read all of the columns in the list, including thehostIpcolumn, which has 549 rows. This is why you are getting the error.To solve this problem, you can either:
read_parquet()method to only read the first 548 rows of thehostIpcolumn.hostIpcolumn from theREQUIRED_COLUMNSlist.Here is an example of how to change the
read_parquet()method to only read the first 548 rows of thehostIpcolumn:Here is an example of how to remove the
hostIpcolumn from theREQUIRED_COLUMNSlist: