Slicing rows of pandas dataframe between

770 Views Asked by At

I have a pandas dataframe with a column that marks interesting points of data in another column (e.g. the locations of peaks and troughs). I often need to do some computation on the values between each marker. Is there a neat way to slice the dataframe using the markers as end points so that I can run a function on each slice? The dataframe would look like this, with the desired slices marked:

   numbers     markers
0  0.632009    None
1  0.733576    None    # Slice 1 (0,1,2)
2  0.585944       x    _________
3  0.212374    None
4  0.491948    None    
5  0.324899    None    # Slice 2 (3,4,5,6)
6  0.389103       y    _________
7  0.638451    None
8  0.123557    None    # Slice 3 (7,8,9)
9  0.588472       x    _________

My current approach is to create an array of the indices where the markers occur, iterating over this array using the values to slice the dataframe, and then appending these slices to a list. I end up with a list of numpy arrays that I can then apply a function to:

import pandas as pd
df = pd.DataFrame({'numbers':np.random.rand(10),'markers':[None,None,'x',None,None,None,'y',None,None,'x']})

index_array = df[df.markers.isin(['x', 'y'])].index  # returns an array of xy indices    
slice_list = []

prev_i = 0  # first slice of the dataframe needs to start from index 0
for i in index_array:
    new_slice = df.numbers[prev_i:i+1].values  # i+1 to include the end marker in the slice
    slice_list.append(new_slice)
    prev_i = i+1  # excludes the start marker in the next slice

for j in slice_list:
    myfunction(j)

This works, but I was wondering if there is a more idiomatic approach using fancy indexing/grouping/pivoting or something that I am missing? I've looked at using groupby, but that doesn't work because grouping on the markers column only returns the rows where the markers are, and multi-indexes and pivot tables require unique labels. I wouldn't bother asking, except pandas has a tool for just about everything so my expectations are probably unreasonably high.

I am not tied to ending up with a list of arrays, that was just the solution I found. I am very open to suggestions on changing the way that I structure my data from the very start if that makes things easier.

1

There are 1 best solutions below

1
On BEST ANSWER

You can do this using a variant of the compare-cumsum-groupby pattern. Starting from

>>> df["markers"].isin(["x","y"])
0    False
1    False
2     True
3    False
4    False
5    False
6     True
7    False
8    False
9     True
Name: markers, dtype: bool

We can shift and take the cumulative sum to get:

>>> df["markers"].isin(["x","y"]).shift().fillna(False).cumsum()
0    0
1    0
2    0
3    1
4    1
5    1
6    1
7    2
8    2
9    2
Name: markers, dtype: int64

After which groupby works as you want:

>>> group_id = df["markers"].isin(["x","y"]).shift().fillna(False).cumsum()
>>> for k,g in df.groupby(group_id):
...     print(k)
...     print(g)
...     
0
    numbers markers
0  0.632009    None
1  0.733576    None
2  0.585944       x
1
    numbers markers
3  0.212374    None
4  0.491948    None
5  0.324899    None
6  0.389103       y
2
    numbers markers
7  0.638451    None
8  0.123557    None
9  0.588472       x