I am trying to read parquet file through pandas, where a few columns do not exist in some files.
I am wondering to know ignore the column existence check in read parquet function.
def column_data(self):
"""_removing all unique value, so that will not less than 2 values_
Returns:
_list_: _returning a list column data_
"""
self._df_list_data = access_data.read_data(self.platform_id, self.equipment_id, self.date_range_df[0].date()) # get the list of data
column_data = []
for col in self._df_list_data.columns: #get the list of columns
if len(self._df_list_data.loc[:, col].unique()) <= 5: #check if column used more than double
self._df_list_data.drop(col, axis=1, inplace=True)
column_data = list(self._df_list_data.columns.values)
return column_data
I read the parquet file all the columns then store in a list data .
if columns is None:
data_list = pd.read_parquet(io.BytesIO(blob_data.readall()),engine='fastparquet') #run with default column list
else:
data_list = pd.read_parquet(io.BytesIO(blob_data.readall()),columns=columns,engine='fastparquet') #run without default column list
return data_list
The first condition is to get the data then call the column store in column_data()
The second condition is calling the parquet file based on the date, but there few date where some columns isn't exist.
So, the system isn't able to match the column data.
How could I ignore the column existence in pandas read parquet function?
As far as I know there is no option to ignore the column existence check in Pandas. What you could do to work around it is wrap your code in a try-catch block and deal with the missing columns error there. One options would be to specify default values that you can put into the missing columns:
You could also just drop the columns that cause the problem: