filter for specific sequences involving multiple columns and surrounding rows

163 Views Asked by At

I have data that looks like this:

enter image description here

It's standard financial price data (open, high, low, close).

In addition, I run some calculations. 'major_check' occasionally returns 1 or 2 (which 'minor_check' will then also return). 'minor_check' also returns 1 or 2, but more frequently. the rest is filled with 0 or NaN.

I'd like to test for specific patterns:

  • Whenever there is a 2 in 'major_check', I want to see if I can find a 21212 pattern in 'minor_check', with 21 preceding the central 2 and 12 following it.
  • If there is a 1 in 'major_check', I'd like to find a 12121 pattern in 'minor_check'

I highlighted a 21212 pattern in the screenshot to give a better idea on what I am looking for.

Once the 21212 or 12121 patterns are found, I'll check if specific rules applied on open/high/low/close (corresponding to the 5 rows constituting the pattern) are met or not.

Of course, one could naively iterate through the dataframe but this doesn't sound like the Pythonic way to do it.

I didn't manage to find a good way to do this, since a 21212 pattern can have some 0s inside it

3

There are 3 best solutions below

0
OCa On BEST ANSWER

As this answer by Timeless looked surprisingly complex, here is a quite simpler one.

Method:

  • Temporarily remove rows that are empty of test results (effectively skipping NaN and None),
  • Search for patterns
    • either with numpy.where and pandas.shift to check for patterns row-wise (faster),
    • or preferrably with pandas.rolling -probably faster, more compact, but still readable.
  • Finally report the findings into the original df.

You haven't specified how to flag the findings. Here they get marked as a True in two new columns, one for each pattern, appended to the original dataframe, for whatever use you would plan for them. They are called "hit1" and "hit2".

Input data
No text input data in your post, so until then I came up with my own. It is designed to produce one hit for each pattern:

  • "minor test" 12121 + "major test" 1 at index 14
  • "minor test" 21212 + "major test" 2 at index 12
import pandas as pd
import numpy as np

# Start dataframe
df = pd.DataFrame({'minor': [None,2,   1,   None,None,2,   None,2,   None,None,None,
                             1,   2,   None,1,   None,None,2,   None,1,   2,   None,None],
                   'major': [None,None,None,None,None,2,   None,None,None,None,None,
                             None,2,   None,1,   None,None,None,   None,None,None,None,None]})
df

    minor  major
0     NaN    NaN
1     2.0    NaN
2     1.0    NaN
3     NaN    NaN
4     NaN    NaN
5     2.0    2.0
6     NaN    NaN
7     2.0    NaN
8     NaN    NaN
9     NaN    NaN
10    NaN    NaN
11    1.0    NaN
12    2.0    2.0
13    NaN    NaN
14    1.0    1.0
15    NaN    NaN
16    NaN    NaN
17    2.0    NaN
18    NaN    NaN
19    1.0    NaN
20    2.0    NaN
21    NaN    NaN
22    NaN    NaN

Locate hits
Skip rows without test results

# Remove rows without minor test result
df1 = df.dropna(axis=0,subset='minor').copy()
# No reset_index because we'll use it to report back to df.

# Patterns of 'minor test' 
minor_pat1 = [1,2,1,2,1]
minor_pat2 = [2,1,2,1,2]

Pattern search:

  • alternative 1: .shift()and np.where
# Deploy shift columns, looking 2 values backwards and 2 forward
for i in range(-2,3):              # i in [-2,-1, 0, 1, 2]
    df1[i] = df1['minor'].shift(i) # create a column named i

# Test for both patterns and major test value
df1['hit1'] = np.where(df1['major']==1, # case 12121
                       df1[list(range(-2,3))].eq(minor_pat1).all(axis=1),
                       False)
df1['hit2'] = np.where(df1['major']==2, # case 21212
                       df1[list(range(-2,3))].eq(minor_pat2).all(axis=1),
                       False)
df1 # Temporary dataframe with findings located

    minor  major   -2   -1    0    1    2   hit1   hit2
1     2.0    NaN  2.0  1.0  2.0  NaN  NaN  False  False
2     1.0    NaN  2.0  2.0  1.0  2.0  NaN  False  False
5     2.0    2.0  1.0  2.0  2.0  1.0  2.0  False  False
7     2.0    NaN  2.0  1.0  2.0  2.0  1.0  False  False
11    1.0    NaN  1.0  2.0  1.0  2.0  2.0  False  False
12    2.0    2.0  2.0  1.0  2.0  1.0  2.0  False   True
14    1.0    1.0  1.0  2.0  1.0  2.0  1.0   True  False
17    2.0    NaN  2.0  1.0  2.0  1.0  2.0  False  False
19    1.0    NaN  NaN  2.0  1.0  2.0  1.0  False  False
20    2.0    NaN  NaN  NaN  2.0  1.0  2.0  False  False
  • alternative 2: one-liner with rolling, preferred:

.rolling() was designed for that purpose exactly.
Just too bad they haven't implemented .rolling().eq() yet (list of window functions).
This is why we must resort to apply .eq() from inside a lambda function.

df1['hit1'] = (df1['major']==1) & (df1['minor']
                                   .rolling(window=5, center=True)
                                   .apply(lambda x : x.eq(minor_pat1).all()))

df1['hit2'] = (df1['major']==2) & (df1['minor']
                                   .rolling(window=5, center=True)
                                   .apply(lambda x : x.eq(minor_pat2).all()))

    minor  major   hit1   hit2
1     2.0    NaN  False  False
2     1.0    NaN  False  False
5     2.0    2.0  False  False
7     2.0    NaN  False  False
11    1.0    NaN  False  False
12    2.0    2.0  False   True
14    1.0    1.0   True  False
17    2.0    NaN  False  False
19    1.0    NaN  False  False
20    2.0    NaN  False  False

Finally report back to original df

df.loc[df1.index,'hit1'] = df1['hit1']
df.loc[df1.index,'hit2'] = df1['hit2']
df

    minor  major   hit1   hit2
0     NaN    NaN    NaN    NaN
1     2.0    NaN  False  False
2     1.0    NaN  False  False
3     NaN    NaN    NaN    NaN
4     NaN    NaN    NaN    NaN
5     2.0    2.0  False  False
6     NaN    NaN    NaN    NaN
7     2.0    NaN  False  False
8     NaN    NaN    NaN    NaN
9     NaN    NaN    NaN    NaN
10    NaN    NaN    NaN    NaN
11    1.0    NaN  False  False
12    2.0    2.0  False   True
13    NaN    NaN    NaN    NaN
14    1.0    1.0   True  False
15    NaN    NaN    NaN    NaN
16    NaN    NaN    NaN    NaN
17    2.0    NaN  False  False
18    NaN    NaN    NaN    NaN
19    1.0    NaN  False  False
20    2.0    NaN  False  False
21    NaN    NaN    NaN    NaN
22    NaN    NaN    NaN    NaN
2
Floriancitt On

As you're specifically focusing on obtaining the patterns in the minor check, I would start by isolating a DataFrame where the minor_check row contains valid values.

Below we first create a mask for the DataFrame, using the .isin() method.

filtered_df = df[df['minor_check'].isin([1, 2])].reset_index(drop=True)

This will return a DataFrame only containing the rows relevant to your question.

Next, let's iterate through this DataFrame, and test all relevant indices for these conditions, being sure to account for the DataFrame's length.

I have provided some pre-filtered sample data for demonstration purposes.

data = {
    'major_check': [0, 0, 0, 1, 1, 2, 1, 1, 2, 0, 0],
    'minor_check': [0, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2]
}
filtered_df = pd.DataFrame(data)


def check_sequence(index, variant):
    if index - 2 >= 0 and index + 2 < len(df):
        sequence = df.iloc[index - 2 : index + 3]['minor_check'].tolist()
        
        if variant == 1:
            expected_sequence = [1, 2, 1, 2, 1]
        elif variant == 2:
            expected_sequence = [2, 1, 2, 1, 2]
        
        if sequence == expected_sequence:
            return True
    return False

# Iterating through major_check
for index, row in df.iterrows():
    if row['major_check'] == 1:
        if check_sequence(index, 1):
            print(f"Found sequence 1 at index {index}")
    elif row['major_check'] == 2:
        if check_sequence(index, 2):
            print(f"Found sequence 2 at index {index}")

With this sample data, the result is:

Found sequence 1 at index 3
Found sequence 1 at index 7
Found sequence 2 at index 8

Which identifies the pattern as expected.

1
Timeless On

A proposition using a hacky rolling :

def patfinder(df, pat):
    def fn(ser):
        d = dict(index=ser.index, dtype="int")
        return ser.mask(ser.eq(0)).eq(pd.Series([*pat], **d)).all()

    groups = ((center:=df["minor_check"].dropna().astype("int")
               .rolling(5, center=True).apply(fn).eq(1))
               .to_frame("m").join(center.mask(~center)
                        .cumsum().rename("g")))
    
    window = (groups.mask(~center)
               .ffill(limit=2).bfill(limit=2).fillna(False)
               .reindex(df.index, fill_value=False))
        
    return center & df["major_check"].eq(int(pat[2])), window

Test (on this df):

pat = "21212" # or "12121"

found, window = df.pipe(patfinder, pat)

if not df.loc[found].empty:
    print(df.loc[window["m"]].join(window["g"]))
    # print(df.loc[window["m"]]) # without counter-groups
    
          time  open  high  low  close  minor_check  major_check    g
20  02/03/2022  1.02  1.02 1.02   1.02         2.00          NaN 1.00
23  07/03/2022  1.01  1.02 1.00   1.01         1.00          NaN 1.00
25  09/03/2022  1.03  1.03 1.02   1.02         2.00         2.00 1.00
28  14/03/2022  1.03  1.03 1.03   1.03         1.00          NaN 1.00
31  17/03/2022  1.04  1.04 1.03   1.03         2.00          NaN 1.00

[5 rows x 8 columns]