Can I use a pandera DataFrameModel to validate a property involving multiple dataframes?

54 Views Asked by At

I have the feeling that pandara's checks are primarily designed to run on a single dataframe. Assume I have a code which loads data into two dataframes df1 and df2. For example, after having loaded df1, I want to load df2 and validate that all values in the column df2.col2 are member of df1.col1. This seems to require that we define the check dynamically at runtime. Is this possible with pandera?

EDIT There is a solution if we convert the DataFrameModel of df2 to a schema:

import pandera as pa
import pandas as pd

class MyModel(pa.DataFrameModel):
    col2: pa.typing.Series[int] = pa.Field(ge=0)

df1 = pd.DataFrame({'col1':[1,2,2]})

col1_values = df1.col1.unique()

UpdatedSchema = MyModel.to_schema().update_column('col2',
                checks=[pa.Check.isin(col1_values)])

#the validation works
df2 = pd.DataFrame({'col2':[1,2,1]})
df2 = UpdatedSchema(df2)

#the validation fails
df2 = pd.DataFrame({'col2':[1,2,3]})
df2 = UpdatedSchema(df2) #this works

But I wonder if there is a way to do the same without requiring to convert to a DataFrameSchema ?

0

There are 0 best solutions below