Perform Excel MAXIFS in Pandas with multiple conditions

152 Views Asked by At

Issue

I'm trying to create using Pandas a new column returning the next max of the 2 next days of each id/date.

Below input data and the data I expect to get. I came up with a naive solution with iterrows (see bellow), but I would like to :

  1. find a more optimized/pythonic way to do it
  2. find a more generic way in order to pass the condition as method parameters to perform various different MAXIFS

n.b.1 This question is a generalization of Perform Excel MAXIFS in Pandas

n.b.2 It is equivalent to apply this Excel formula MAXIFS to each row: MAXIFS(C:C;A:A;A2;B:B;">"&B2, B:B;"<="&B2+2) where: A = id; B = date; C = value

Input data

df = pd.DataFrame({
    "id": ["a"] * 2 + ["b"] * 4 + ["a", "b"] * 2 + ["b"],
    "date": pd.date_range(datetime(2023, 1, 1), periods=11).tolist(),
    "value": [3, 10, 2, 20, 24, 9, 21, 7, 25, 12, 7]
})

df
#    id       date  value
# 0   a 2023-01-01      3
# 1   a 2023-01-02     10
# 2   b 2023-01-03      2
# 3   b 2023-01-04     20
# 4   b 2023-01-05     24
# 5   b 2023-01-06      9
# 6   a 2023-01-07     21
# 7   b 2023-01-08      7
# 8   a 2023-01-09     25
# 9   b 2023-01-10     12
# 10  b 2023-01-11      7

Expected output

expected_col = [10, np.nan, 24, 24, 9, 7, 25, 12, np.nan, 7, np.nan]
df_expected = pd.concat([df, pd.DataFrame({"next_2d_max": expected_col})], axis=1)

df_expected
#    id       date  value  next_2d_max
# 0   a 2023-01-01      3         10.0
# 1   a 2023-01-02     10          NaN
# 2   b 2023-01-03      2         24.0
# 3   b 2023-01-04     20         24.0
# 4   b 2023-01-05     24          9.0
# 5   b 2023-01-06      9          7.0
# 6   a 2023-01-07     21         25.0
# 7   b 2023-01-08      7         12.0
# 8   a 2023-01-09     25          NaN
# 9   b 2023-01-10     12          7.0
# 10  b 2023-01-11      7          NaN

Naive solution

I came up with a solution by iterating df line by line to find all local max. I assume this will not scale well on multi-millions lines DateFrame. Can you help me figure a better solution?

import pandas as pd
from datetime import timedelta


def get_local_max(df, row):
    local_max = df[
        (df["id"] == row["id"])
        & (df["date"] > row["date"])
        & (df["date"] <= row["date"] + timedelta(days=2))
    ]["value"].max()
    return local_max


def get_all_max(df):
    for index, row in df.iterrows():
        yield get_local_max(df, row)

df["next_2d_max"] = pd.Series([local_max for local_max in get_all_max(df)])
pd.testing.assert_frame_equal(df, df_expected)
1

There are 1 best solutions below

0
On

You can try:

# https://stackoverflow.com/a/74359384/10035985
indexer = pd.api.indexers.FixedForwardWindowIndexer(window_size=3)


def fn(x):
    return (
        x.set_index("date")
        .asfreq("1D")
        .rolling(indexer, min_periods=1)["value"]
        .max()
        .shift(-1)
    )


out = pd.merge(
    df, df.groupby("id").apply(fn), left_on=["id", "date"], right_index=True
).rename(columns={"value_x": "value", "value_y": "next_2d_max"})
print(out)

Prints:

   id       date  value  next_2d_max
0   a 2023-01-01      3         10.0
1   a 2023-01-02     10          NaN
2   b 2023-01-03      2         24.0
3   b 2023-01-04     20         24.0
4   b 2023-01-05     24          9.0
5   b 2023-01-06      9          7.0
6   a 2023-01-07     21         25.0
7   b 2023-01-08      7         12.0
8   a 2023-01-09     25          NaN
9   b 2023-01-10     12          7.0
10  b 2023-01-11      7          NaN