Pandera checks for multiple columns

227 Views Asked by At

I am writing a Python script to validate columns in Pandas DataFrame using Pandera library. My issue is that I am OK validating one column at a time, but I was unable to find out how I can validate values in one column taking into consideration values from another column in the dataframe.

Specifically, I want to check if values exist in a column, and if they do I also want to check if values in another column for that row also contain certain text.

import numpy as np
import pandas as pd
import pandera as pa

dataframe = pd.DataFrame({'column_A': ['ABC company', 'BBB company', 'ABC company', 'CCC company'],
                          'column_B': ['1000', np.NaN, '2000', np.NaN]
                          })

schema = pa.DataFrameSchema(
    columns={
        'column_A': pa.Column(pa.String),
        'column_B': pa.Column(pa.String, nullable=True,
                              checks=pa.Check(
                                  lambda df: (df['column_A'].str.contains('ABC')) & (~df['column_B'].isna())))
    }
)

schema.validate(dataframe)

This was my attempt at solving the issue but Pandera is not aware of the fact that I want to examine multiple series during validation of column_B.

2

There are 2 best solutions below

1
Corralien On BEST ANSWER

If you want to check (look for # define a wide DataFrame-level check in the documentation) at the dataframe level, you need to do your test outside of a column:

import numpy as np
import pandas as pd
import pandera as pa

dataframe = pd.DataFrame({'column_A': ['ABC company', 'BBB company', 'ABC company', 'CCC company'],
                          'column_B': ['1000', np.NaN, '2000', np.NaN]
                          })

# define your dataframe level test
check_AB = pa.Check(
    lambda df: (df['column_A'].str.contains('ABC')) & (~df['column_B'].isna()),
    name='check_AB'
)

schema = pa.DataFrameSchema(
    columns={
        'column_A': pa.Column(pa.String),
        'column_B': pa.Column(pa.String, nullable=True)
    },
    checks=check_AB  # <- dataframe wide check
)

Output:

>>> schema.validate(dataframe)
...
<Check check_AB>
failure cases:
     column  index failure_case
0  column_A      1  BBB company
1  column_A      3  CCC company
5
Timeless On

IIUC, you need to adjust the boolean mask and isolate the checks to avoid a KeyError :

schema = pa.DataFrameSchema(
    columns={
        "column_A": pa.Column(pa.String),
        "column_B": pa.Column(pa.String, nullable=True),
    },
    checks=[
        pa.Check(
          lambda df: (
              df["column_A"].str.contains("ABC")
              & (b_notna:=df["column_B"].notna())
          ).where(b_notna, True), ignore_na=False
        )
    ]
)

Output :

print(schema.validate(dataframe))

      column_A column_B
0  ABC company     1000
1  BBB company      NaN
2  ABC company     2000
3  CCC company      NaN

Now, if we set the first value of column_A to "A_BC", the check fails:

# failure cases:

#      column  index  failure_case
# 0  column_A      0  A_BC company
# 1  column_B      0          1000