Identifying row numbers where value is stable before and after the value in the column hits a specified value

77 Views Asked by At

EDITED

I have a pandas dataframe like so:

data = {'ID': ['A', 'A',    'A',    'A',    'A',    'A',    'A',    'A',    'A',    'A',    'A',    'A',    'A',    'A',    'A',    'A',    'A',    'A',    'A',    'A',    'A',    'A',    'A',    'A',    'A',    'A',    'A',    'A',    'A',    'A',    'A',    'A',    'A',    'A',    'A',    'A',    'A',    'A',    'A',    'A',    'A',    'A',    'B',    'B',    'B',    'B',    'B',    'B',    'B',    'B',    'B',    'B'],
    'column_1': [0, 0,  0,  0,  0.1,    1,  1.5,    2,  3,  4,  4.5,    5,  4.9,    3,  2,  1.8,    1,  0,  0,  1,  3,  0,  1.3,    2,  3,  4.3,    4.8,    5,  4.2,    3.5,    3,  2.6,    2,  1.9,    1,  0,  0,  0,  0,  0,  0.1,    0.2,    0.3,    1,  2,  3,  5,  4,  2,  0.5,    0,  0],
    'column_2': [13,    25, 96, 59, 5,  92, 82, 141,    50, 85, 84, 113,    119,    128,    8,  133,    82, 10, 15, 62, 11, 68, 18, 24, 37, 55, 83, 48, 13, 81, 43, 36, 56, 43, 36, 46, 45, 127,    55, 67, 113,    98, 78, 78, 57, 131,    121,    126,    142,    51, 64, 95]}


    ID  column_1  column_2
0   A       0.0        13
1   A       0.0        25
2   A       0.0        96
3   A       0.0        59
4   A       0.1         5
5   A       1.0        92
6   A       1.5        82
7   A       2.0       141
8   A       3.0        50
9   A       4.0        85
10  A       4.5        84
11  A       5.0       113
12  A       4.9       119
13  A       3.0       128
14  A       2.0         8
15  A       1.8       133
16  A       1.0        82
17  A       0.0        10
18  A       0.0        15
19  A       1.0        62
20  A       3.0        11
21  A       0.0        68
22  A       1.3        18
23  A       2.0        24
24  A       3.0        37
25  A       4.3        55
26  A       4.8        83
27  A       5.0        48
28  A       4.2        13
29  A       3.5        81
30  A       3.0        43
31  A       2.6        36
32  A       2.0        56
33  A       1.9        43
34  A       1.0        36
35  A       0.0        46
36  A       0.0        45
37  A       0.0       127
38  A       0.0        55
39  A       0.0        67
40  A       0.1       113
41  A       0.2        98
42  B       0.3        78
43  B       1.0        78
44  B       2.0        57
45  B       3.0       131
46  B       5.0       121
47  B       4.0       126
48  B       2.0       142
49  B       0.5        51
50  B       0.0        64
51  B       0.0        95

Tracing back from when the value hits 5 in column_1, I want to find the value in column_2 just before the value in column_1 increased from 0 and just after it came back down to 0. So, in the data frame above, the values in column_2 would be 5, 10 and 18, 46. I want to perform some arithmetic and would like to add 2 columns before & after with those values grouped by the ID column. The expected output would be:

    ID  column_1  column_2  Before  After
0   A       0.0        13       0      0
1   A       0.0        25       0      0
2   A       0.0        96       0      0
3   A       0.0        59       0      0
4   A       0.1         5       0      0
5   A       1.0        92       0      0
6   A       1.5        82       0      0
7   A       2.0       141       0      0
8   A       3.0        50       0      0
9   A       4.0        85       0      0
10  A       4.5        84       0      0
11  A       5.0       113       5     10
12  A       4.9       119       0      0
13  A       3.0       128       0      0
14  A       2.0         8       0      0
15  A       1.8       133       0      0
16  A       1.0        82       0      0
17  A       0.0        10       0      0
18  A       0.0        15       0      0
19  A       1.0        62       0      0
20  A       3.0        11       0      0
21  A       0.0        68       0      0
22  A       1.3        18       0      0
23  A       2.0        24       0      0
24  A       3.0        37       0      0
25  A       4.3        55       0      0
26  A       4.8        83       0      0
27  A       5.0        48      18     46
28  A       4.2        13       0      0
29  A       3.5        81       0      0
30  A       3.0        43       0      0
31  A       2.6        36       0      0
32  A       2.0        56       0      0
33  A       1.9        43       0      0
34  A       1.0        36       0      0
35  A       0.0        46       0      0
36  A       0.0        45       0      0
37  A       0.0       127       0      0
38  A       0.0        55       0      0
39  A       0.0        67       0      0
40  A       0.1       113       0      0
41  A       0.2        98       0      0
42  B       0.3        78       0      0
43  B       1.0        78       0      0
44  B       2.0        57       0      0
45  B       3.0       131       0      0
46  B       5.0       121      78     64
47  B       4.0       126       0      0
48  B       2.0       142       0      0
49  B       0.5        51       0      0
50  B       0.0        64       0      0
51  B       0.0        95       0      0

For a given ID if column_1 starts with a non zero value, it should give the first value of column_2 for that group.

The rest of the rows in Before and After can be filled with null or zeroes. Is there an elegant way to achieve this?

2

There are 2 best solutions below

4
fsimonjetz On BEST ANSWER

With where and shift you can generate series where only the endpoints of the subsequences are filled and the rest is padded with NaN. Combined with ffill (or backfill, respectively), locating the rows where column_1 equals 5 gives pretty much the desired result:

col1 = df.column_1
col2 = df.column_2

df.loc[col1.eq(5), "Before"] = col2.where(col1.gt(0) & col1.shift().eq(0)).ffill()
df.loc[col1.eq(5), "After"] = col2.where(col1.eq(0) & col1.shift().gt(0)).backfill()

Result:

   ID  column_1  column_2  Before  After
0   A       0.0        13     NaN    NaN
1   A       0.0        25     NaN    NaN
2   A       0.0        96     NaN    NaN
3   A       0.0        59     NaN    NaN
4   A       0.1         5     NaN    NaN
5   A       1.0        92     NaN    NaN
6   A       1.5        82     NaN    NaN
7   A       2.0       141     NaN    NaN
8   A       3.0        50     NaN    NaN
9   A       4.0        85     NaN    NaN
10  A       4.5        84     NaN    NaN
11  A       5.0       113     5.0   10.0
12  A       4.9       119     NaN    NaN
13  A       3.0       128     NaN    NaN
14  A       2.0         8     NaN    NaN
15  A       1.8       133     NaN    NaN
16  A       1.0        82     NaN    NaN
17  A       0.0        10     NaN    NaN
18  A       0.0        15     NaN    NaN
19  A       1.0        62     NaN    NaN
20  A       3.0        11     NaN    NaN
21  A       0.0        68     NaN    NaN
22  A       1.3        18     NaN    NaN
23  A       2.0        24     NaN    NaN
24  A       3.0        37     NaN    NaN
25  A       4.3        55     NaN    NaN
26  A       4.8        83     NaN    NaN
27  A       5.0        48    18.0   46.0
28  A       4.2        13     NaN    NaN
29  A       3.5        81     NaN    NaN
30  A       3.0        43     NaN    NaN
31  A       2.6        36     NaN    NaN
32  A       2.0        56     NaN    NaN
33  A       1.9        43     NaN    NaN
34  A       1.0        36     NaN    NaN
35  A       0.0        46     NaN    NaN
36  A       0.0        45     NaN    NaN
37  A       0.0       127     NaN    NaN
38  A       0.0        55     NaN    NaN
39  A       0.0        67     NaN    NaN

With groupby()

def set_flags(frame):

    col1 = frame.column_1
    col2 = frame.column_2
    m = col1.eq(5)

    frame.loc[m, "Before"] = col2.where(col1.gt(0) & col1.shift(fill_value=0).eq(0)).ffill()
    frame.loc[m, "After"] = col2.where(col1.eq(0) & col1.shift().gt(0)).backfill()

    return frame

df[["Before", "After"]] = 0  # initializing columns with 0 keeps the int type
df = df.groupby("ID").apply(set_flags)

Result with new data

   ID  column_1  column_2  Before  After
0   A       0.0        13       0      0
1   A       0.0        25       0      0
2   A       0.0        96       0      0
3   A       0.0        59       0      0
4   A       0.1         5       0      0
5   A       1.0        92       0      0
6   A       1.5        82       0      0
7   A       2.0       141       0      0
8   A       3.0        50       0      0
9   A       4.0        85       0      0
10  A       4.5        84       0      0
11  A       5.0       113       5     10
12  A       4.9       119       0      0
13  A       3.0       128       0      0
14  A       2.0         8       0      0
15  A       1.8       133       0      0
16  A       1.0        82       0      0
17  A       0.0        10       0      0
18  A       0.0        15       0      0
19  A       1.0        62       0      0
20  A       3.0        11       0      0
21  A       0.0        68       0      0
22  A       1.3        18       0      0
23  A       2.0        24       0      0
24  A       3.0        37       0      0
25  A       4.3        55       0      0
26  A       4.8        83       0      0
27  A       5.0        48      18     46
28  A       4.2        13       0      0
29  A       3.5        81       0      0
30  A       3.0        43       0      0
31  A       2.6        36       0      0
32  A       2.0        56       0      0
33  A       1.9        43       0      0
34  A       1.0        36       0      0
35  A       0.0        46       0      0
36  A       0.0        45       0      0
37  A       0.0       127       0      0
38  A       0.0        55       0      0
39  A       0.0        67       0      0
40  A       0.1       113       0      0
41  A       0.2        98       0      0
42  B       0.3        78       0      0
43  B       1.0        78       0      0
44  B       2.0        57       0      0
45  B       3.0       131       0      0
46  B       5.0       121      78     64
47  B       4.0       126       0      0
48  B       2.0       142       0      0
49  B       0.5        51       0      0
50  B       0.0        64       0      0
51  B       0.0        95       0      0
5
Michael Cao On
  1. Identify indices where 0's and 5's are, then reset the index while preserving the original index. The original index is renamed 'index' by default.
  2. We use the new index to grab the 0's right before and right after each 5.
  3. After grabbing the zeros, we cross-reference back to the original index (modifying by +1 for before) to grab the relevant column_2 values.

zeros_fives = df.loc[df['column_1'].isin([0, 5])].reset_index()

zero_before_five = zeros_fives.loc[zeros_fives['column_1'] == 5].index - 1
zero_after_five = zeros_fives.loc[zeros_fives['column_1'] == 5].index + 1

before = zeros_fives.loc[zero_before_five, 'index'] + 1
after = zeros_fives.loc[zero_after_five, 'index']

df.loc[df['column_1'] == 5, 'before'] = df.loc[before, 'column_2'].values
df.loc[df['column_1'] == 5, 'after'] = df.loc[after, 'column_2'].values
df.fillna(0, inplace = True)