I am generating a script to validate XMLs (the specific code for each validation rule is written in the validation_rules module) and the code execution time was getting too long. So, I decided to use the multiprocessing module to process the DataFrame in parallel. However, when I run my code, I get the following error:
_pickle.PicklingError: Can't pickle <class 'pandas.core.frame.Pandas'>: attribute lookup Pandas on pandas.core.frame failed
This is the code I've tried:
def validate_xml(df_results):
# Create a Pool of processes
pool = Pool(cpu_count())
# Map the function to the rows in parallel
result_list = pool.map(apply_validation_function (df_results.itertuples(index=False), df_results))
# Combine the results into a dataframe
df_results[['status', 'comments']] = pd.DataFrame(result_list)
return df_results
def apply_validation_function(row, df_results):
function_name = str(row['function'])
if isinstance(function_name, str) and function_name != 'nan':
try:
function = getattr(validation_rules, function_name)
result = function(df_results, row.name)
return pd.Series({'status': result[0], 'comments': result[1]})
except Exception as e:
return pd.Series({'status': 'Error', 'comments': f'Error: {e}'})
else:
return pd.Series({'status': '', 'comments': ''})
Before the validation, df_results has the following columns:
- line_idx
- line_text
- rule_id
- rule_tag_id
- function (the name of the function in the module validation_rules)
And after the validation, the previous ones plus:
- status
- comments