Filter values within a percentage range based on the last non-filtered value

56 Views Asked by At

I have a Dataframe with the column "Trade" filled with some positive values; the rest are NaN values.

Trade
 100
 Nan
 Nan
 101
 Nan
 102
 Nan
 98
 107
 Nan
 101
 Nan
 98
 Nan
 Nan
 94

Among the non-Nan values, I need a vectorized solution to remove the values that fall within the 95%-105% value range of the last value that hasn't been removed. The final result should look like this:

Trade
 100
 Nan
 Nan
 Nan
 Nan
 Nan
 Nan
 Nan
 107
 Nan
 101
 Nan
 Nan
 Nan
 Nan
 94
2

There are 2 best solutions below

0
Cameron Riddell On

As others have pointed out since there is a dependency between the algorithm and the resultant rows there won't be a pure "single-pass" vectorized solution. However, you can still take a "shrinking" window approach to solve this problem while trying to minimize the number of iterative steps required.

import pandas as pd
from pandas import NA

df = pd.DataFrame(
    {'trade': [100, NA, NA, 101, NA, 102, NA, 98, 107, NA, 101, NA, 98, NA, NA, 94]}
).astype({'trade': 'Int32'})

tmp = df.dropna()
valid = [0]
while valid[-1] < tmp.index[-1]:
    chunk = tmp.loc[valid[-1]:, 'trade'] # get window of all unprocessed data
    target = chunk.iat[0]
    valid.append(                        # find the first boundary
        chunk.between(target * .95, target * 1.05).idxmin() 
    )

print(
    f'{valid = }',  # [0, 8, 10, 15] (while loop took len(valid) iterations)
    df.assign(      # mask over values not in `valid`
        cleaned=lambda d: d['trade'].where(d.index.isin(valid)),
    ),
    sep='\n\n',
)
# valid = [0, 8, 10, 15]
#
#     trade  cleaned
# 0     100      100
# 1    <NA>     <NA>
# 2    <NA>     <NA>
# 3     101     <NA>
# 4    <NA>     <NA>
# 5     102     <NA>
# 6    <NA>     <NA>
# 7      98     <NA>
# 8     107      107
# 9    <NA>     <NA>
# 10    101      101
# 11   <NA>     <NA>
# 12     98     <NA>
# 13   <NA>     <NA>
# 14   <NA>     <NA>
# 15     94       94

The reason I refer to this as a "shrinking" window approach is because each tick of our while-loop will work on a smaller chunk of the DataFrame, until we have consumed all of it. This allows us to leverage as many DataFrame/Series methods as possible so we're not doing any data processing at the Python level.

0
Gus On

This is what I did:

import pandas as pd

stop = 0.05

# Prepare

trade = df['Trade']
trade.fillna(0, inplace=True)
trade = trade[trade != 0]

# Filter

trade.loc[(trade > trade.shift(1) * (1-stop)) & 
 (trade < trade.shift(1) * (1+stop))] = 0


# Recompose Dataframe

df = df.merge(trade.rename('T1'), left_index=True, 
right_index=True, how='left')
df['Trade'] = df['T1']
df = df.drop(['T1'], axis=1)

In the end, the actual filtering took only 1 line of code and I'm sure the rest can be improved. I thought I would've had to loop the filter a couple of times but in the end, it wasn't needed. No lambda, no loops, and no ifs.