Creating a pd.DataFrame with the Hypothesis library

62 Views Asked by At

I want to create some hypothesis based tests based on a random dataframe. I try to create a df using the following function:

@st.composite
def create_hypothesis_df(draw):
    num_rows = draw(st.integers(min_value=1, max_value=10))  # Adjust the number of rows as needed
    data = [
        (
            draw(st.text(min_size=0, max_size=)),
            '1750',
            draw(st.datetimes()),
            draw(st.datetimes()),
            draw(st.floats(min_value=1, max_value=1000)),
            draw(st.floats(min_value=1, max_value=1000)),
            draw(st.floats(min_value=1, max_value=1000)),
            draw(st.text(min_size=0, max_size=100)),
            draw(st.text(min_size=0, max_size=100)),
        ) for _ in range(num_rows)
    ]
    columns = ["col1", "col2", "col3", "col4", "col5", "col6", "col7", "col8", "col9"]
    return pd.DataFrame(data, columns=columns)

However, this always returns an df that has: "", 1750, 2001-01-01, 2001-01-01, 1.000, 1.000, etc.

So basically it uses just the minimum value.

I need not lazy values as I do some calculations in a transform function, these I want to test by doing something similar like: assert (result_df['new_column'] < input_df['col5']).all()

1

There are 1 best solutions below

0
On

Your example is not actually executable, due to the SyntaxError from st.text(min_size=0, max_size=). When I fix that, I get varied examples as expected - tested with

@given(create_hypothesis_df())
def test(df):
    print(df)

That said, I'd personally reach for Hypothesis' native support for Pandas, which would look like:

from hypothesis.extra.pandas import column, data_frames, range_indexes

def create_hypothesis_df():
    return data_frames(
        [
            column("col1", st.text(min_size=0, max_size=10)),
            column("col2", st.just("1750")),
            column("col3", st.datetimes()),
            column("col4", st.datetimes()),
            column("col5", st.floats(min_value=1, max_value=1000)),
            column("col6", st.floats(min_value=1, max_value=1000)),
            column("col7", st.floats(min_value=1, max_value=1000)),
            column("col8", st.text(min_size=0, max_size=100)),
            column("col9", st.text(min_size=0, max_size=100)),
        ],
        index=range_indexes(min_size=1, max_size=10),
    )

The idioms look pretty similar in this case, but this version is much easier to extend to sparse data, specific column dtypes, or other constraints; and is usually faster as data size grows.