I am generating a script to validate XMLs (the specific code for each validation rule is written in the validation_rules module) and the code execution time was getting too long. So, I decided to use the multiprocessing module to process the DataFrame in parallel. However, when I run my code, I get the following error:

_pickle.PicklingError: Can't pickle <class 'pandas.core.frame.Pandas'>: attribute lookup Pandas on pandas.core.frame failed

This is the code I've tried:

def validate_xml(df_results):
    # Create a Pool of processes
    pool = Pool(cpu_count())
    # Map the function to the rows in parallel
    result_list = pool.map(apply_validation_function (df_results.itertuples(index=False), df_results))
    # Combine the results into a dataframe
    df_results[['status', 'comments']] = pd.DataFrame(result_list)
    return df_results

def apply_validation_function(row, df_results):
    function_name = str(row['function'])
    if isinstance(function_name, str) and function_name != 'nan':
        try:
            function = getattr(validation_rules, function_name)
            result = function(df_results, row.name)
            return pd.Series({'status': result[0], 'comments': result[1]})
        except Exception as e:
            return pd.Series({'status': 'Error', 'comments': f'Error: {e}'})
    else:
        return pd.Series({'status': '', 'comments': ''})

Before the validation, df_results has the following columns:

  • line_idx
  • line_text
  • rule_id
  • rule_tag_id
  • function (the name of the function in the module validation_rules)

And after the validation, the previous ones plus:

  • status
  • comments
0

There are 0 best solutions below