Selecting rows by offsetting

98 Views Asked by At

I have this data frame, lets call it my_df. It looks like this:

my_df <- data.frame(rnorm(n = 30,sd=.5),rep(c("a","b","c"),each=10))
names(my_df) <- c("num","let")

head(my_df)
          num let
1  0.01202600   a
2  1.09025768   a
3 -0.08656178   a
4 -0.04847073   a
5 -0.63750258   a
6  0.58846135   a

What I want to do is select all of the rows when my_df$let == "b" as well as the five rows before the first row when my_df$let == "b", and the five rows after the last row when my_df == "b". So basically my_df[6:25,].

The data I'm actually working with is hundreds of thousands of lines long and I don't know what rows is what, and besides that each set of data doesn't match up row wise and I can't take the time to go through each set of data individually. I've been using a subset to select the data I want, but I don't know how to select the additional rows outside of the subset (1000 rows before and after).

Here's my subset for what I'm doing:

#The following lines seperate pXX_NoNegative into individual field sections
p04_HighWeeds <- subset(p04_NoNegative, subset = p04_NoNegative$GS_Field == "High Weeds")

I want to select all of the rows that the above code selects, but I also want 100 rows before that, and 1000 rows after that.

If you need any additional information that may help you please ask.

3

There are 3 best solutions below

0
On BEST ANSWER

Here's another idea using dplyr:

library(dplyr)
my_df %>% filter(lead(let == "b", 5) | lag(let == "b", 5))

Or as per @akrun suggestion using the devel version of data.table:

setDT(my_df)[shift(let == "b", 5) | shift(let == "b", type = "lead", 5)]

Which gives:

#           num let
#1   0.36723709   a
#2   0.24743170   a
#3  -0.33339924   a
#4  -0.57024317   a
#5   0.03390278   a
#6  -0.43495096   b
#7  -0.85107347   b
#8   0.53048931   b
#9  -0.26739611   b
#10 -0.96029355   b
#11 -0.71737408   b
#12  0.34324685   b
#13  0.12319646   b
#14  0.75207703   b
#15  0.18134006   b
#16 -0.02230777   c
#17  0.42646106   c
#18 -0.11055478   c
#19  0.06013187   c
#20  0.50782158   c
0
On

Normally splitting a data frame into a list of data frames based on some categorization is straightforward -- you would use split(my_df, my_df$let) in your case. However with the added complication that you want some number of rows before or after I would operate over the set of unique categorizations, selecting the rows you want in each case:

before <- 5
after <- 5
ret <- setNames(lapply(unique(my_df$let), function(x) {
  positions <- which(my_df$let == x)
  start.pos <- max(1, min(positions)-before)
  end.pos <- min(nrow(my_df), max(positions)+after)
  my_df[start.pos:end.pos,]
}), unique(my_df$let))

You can grab the observations for any category you want out of the returned list:

ret$b  # Also works: ret[["b"]]
#             num let
# 6  -0.197901427   a
# 7   0.194607192   a
# 8  -0.107318203   a
# 9  -0.365313233   a
# 10 -0.188926562   a
# 11  0.636272295   b
# 12 -0.058791973   b
# 13 -0.231029510   b
# 14  0.519441716   b
# 15  0.239510912   b
# 16  0.107025658   b
# 17 -0.446644081   b
# 18  0.145052077   b
# 19 -0.426090749   b
# 20 -0.356062993   b
# 21 -0.155012203   c
# 22 -0.007968255   c
# 23 -0.504253089   c
# 24  0.081624303   c
# 25 -0.657008233   c
0
On

I recently answered a nearly identical question: Select n rows after specific number. Adapting the single-segment solution to your data:

set.seed(1); my_df <- data.frame(rnorm(n = 30,sd=.5),rep(c("a","b","c"),each=10));
names(my_df) <- c("num","let");
brange <- range(which(my_df$let=='b'));
my_df$offb <- c((1-brange[1]):-1,rep(0,diff(brange)+1),1:(nrow(my_df)-brange[2]));
my_df;
##             num let offb
## 1  -0.313226905   a  -10
## 2   0.091821662   a   -9
## 3  -0.417814306   a   -8
## 4   0.797640401   a   -7
## 5   0.164753886   a   -6
## 6  -0.410234192   a   -5
## 7   0.243714526   a   -4
## 8   0.369162353   a   -3
## 9   0.287890676   a   -2
## 10 -0.152694194   a   -1
## 11  0.755890584   b    0
## 12  0.194921618   b    0
## 13 -0.310620290   b    0
## 14 -1.107349944   b    0
## 15  0.562465459   b    0
## 16 -0.022466805   b    0
## 17 -0.008095132   b    0
## 18  0.471918105   b    0
## 19  0.410610598   b    0
## 20  0.296950661   b    0
## 21  0.459488686   c    1
## 22  0.391068150   c    2
## 23  0.037282492   c    3
## 24 -0.994675848   c    4
## 25  0.309912874   c    5
## 26 -0.028064370   c    6
## 27 -0.077897753   c    7
## 28 -0.735376192   c    8
## 29 -0.239075028   c    9
## 30  0.208970780   c   10
subset(my_df,offb>=-5&offb<=5);
##             num let offb
## 6  -0.410234192   a   -5
## 7   0.243714526   a   -4
## 8   0.369162353   a   -3
## 9   0.287890676   a   -2
## 10 -0.152694194   a   -1
## 11  0.755890584   b    0
## 12  0.194921618   b    0
## 13 -0.310620290   b    0
## 14 -1.107349944   b    0
## 15  0.562465459   b    0
## 16 -0.022466805   b    0
## 17 -0.008095132   b    0
## 18  0.471918105   b    0
## 19  0.410610598   b    0
## 20  0.296950661   b    0
## 21  0.459488686   c    1
## 22  0.391068150   c    2
## 23  0.037282492   c    3
## 24 -0.994675848   c    4
## 25  0.309912874   c    5