I have a pandas dataframe with a column that marks interesting points of data in another column (e.g. the locations of peaks and troughs). I often need to do some computation on the values between each marker. Is there a neat way to slice the dataframe using the markers as end points so that I can run a function on each slice? The dataframe would look like this, with the desired slices marked:
numbers markers
0 0.632009 None
1 0.733576 None # Slice 1 (0,1,2)
2 0.585944 x _________
3 0.212374 None
4 0.491948 None
5 0.324899 None # Slice 2 (3,4,5,6)
6 0.389103 y _________
7 0.638451 None
8 0.123557 None # Slice 3 (7,8,9)
9 0.588472 x _________
My current approach is to create an array of the indices where the markers occur, iterating over this array using the values to slice the dataframe, and then appending these slices to a list. I end up with a list of numpy arrays that I can then apply a function to:
import pandas as pd
df = pd.DataFrame({'numbers':np.random.rand(10),'markers':[None,None,'x',None,None,None,'y',None,None,'x']})
index_array = df[df.markers.isin(['x', 'y'])].index # returns an array of xy indices
slice_list = []
prev_i = 0 # first slice of the dataframe needs to start from index 0
for i in index_array:
new_slice = df.numbers[prev_i:i+1].values # i+1 to include the end marker in the slice
slice_list.append(new_slice)
prev_i = i+1 # excludes the start marker in the next slice
for j in slice_list:
myfunction(j)
This works, but I was wondering if there is a more idiomatic approach using fancy indexing/grouping/pivoting or something that I am missing? I've looked at using groupby, but that doesn't work because grouping on the markers column only returns the rows where the markers are, and multi-indexes and pivot tables require unique labels. I wouldn't bother asking, except pandas has a tool for just about everything so my expectations are probably unreasonably high.
I am not tied to ending up with a list of arrays, that was just the solution I found. I am very open to suggestions on changing the way that I structure my data from the very start if that makes things easier.
You can do this using a variant of the compare-cumsum-groupby pattern. Starting from
We can shift and take the cumulative sum to get:
After which
groupby
works as you want: