Fill pandas column forward iteratively, but without using iteration?

66 Views Asked by At

I have a pandas data frame with a column where a condition is met based on other elements in the data frame (not shown). Additionally, I have a column that extends the validness one row further with the following rule:

If a valid row is followed directly by ExtendsValid, that row is also valid, even if the underlying valid condition doesnt apply. Continue filling valid forward as long as ExtendsValid is 1

I have illustrated the result in column "FinalValid" (desired result. Doesnt have to be a new column, can also fill Valid forward). Note that rows 8 and 9 in the example also become valid. Also note that row 13 does NOT result in FinalValid, because you need a preceding valid row. Preceding valid row can be Valid or an extended valid row.

So far if I had a problem like that I did a cumbersome multi-step process:

  1. Create a new column for when "Valid" or "ExtendsValid" is true
  2. Create a new column marking the start point for each "sub-series" (a consecutive set of ones)
  3. Number each sub-series
  4. fillna using "group by" for each sub series

I can provide sample code but I am really looking for a totally different, more efficient approach, which of course must be non-iterating as well.

Any ideas would be welcome.

# Valid ExtendsValid FinalValid
1 0 0 0
2 1 0 1
3 0 0 0
4 0 0 0
5 1 0 1
6 0 0 0
7 1 0 1
8 0 1 1
9 0 1 1
10 0 0 0
11 1 0 1
12 0 0 0
13 0 1 0
14 0 0 0
2

There are 2 best solutions below

2
On BEST ANSWER

IIUC, you want to ffill the 1s only if there is an uninterrupted series of 1s starting on Valid and eventually continuing on ExtendsValid.

For this you can use a groupby.cummin:

df['FinalValid'] = (
 (df['Valid']|df['ExtendsValid'])
 .groupby(df['Valid'].cumsum())
 .cummin()
 )

Output:

NB. I slightly modified the input on row 3 to better illustrate the logic.

     #  Valid  ExtendsValid  FinalValid
0    1      0             0           0
1    2      1             0           1
2    3      0             0           0
3    4      0             1           0
4    5      1             0           1
5    6      0             0           0
6    7      1             0           1
7    8      0             1           1
8    9      0             1           1
9   10      0             0           0
10  11      1             0           1
11  12      0             0           0
0
On

Try:

state, data = 0, []
for v, e in zip(df.Valid, df.ExtendsValid):
    state = v or (not v and e and state)
    data.append(int(state))

df["FinalValid_new"] = data
print(df)

Prints:

     #  Valid  ExtendsValid  FinalValid  FinalValid_new
0    1      0             0           0               0
1    2      1             0           1               1
2    3      0             0           0               0
3    4      0             0           0               0
4    5      1             0           1               1
5    6      0             0           0               0
6    7      1             0           1               1
7    8      0             1           1               1
8    9      0             1           1               1
9   10      0             0           0               0
10  11      1             0           1               1
11  12      0             0           0               0
12  13      0             1           0               0
13  14      0             0           0               0

Note: For more performance I recommend to look at package.


EDIT: Example using :

from numba import njit


@njit
def compute_valid(valid, extends_valid, out):
    state = 0
    for i, (v, e) in enumerate(zip(valid, extends_valid)):
        state = v or (not v and e and state)
        out[i] = int(state)


df["FinalValid_new"] = 0
compute_valid(
    df.Valid.to_numpy(), df.ExtendsValid.to_numpy(), df.FinalValid_new.to_numpy()
)
print(df)