Removing rows containing specific dates in R

5.7k Views Asked by At

Disclaimer: I am going to come out of this looking silly.

I have a data frame containing a column which has a date of class POSIXct. I am trying to remove some of the rows containing specific dates- public holidays. I tried to do that using this:

> modelset.nonholiday <- modelset[!modelset$date == as.POSIXct("2013-12-31")| !modelset$date ==as.POSIXct("2013-07-04") | !modelset$date == as.POSIXct("2014-07-04")| !modelset$date == as.POSIXct ("2013-11-28") | !modelset$date == as.POSIXct ("2013-11-29") | !modelset$date == as.POSIXct ("2013-12-24") | !modelset$date == as.POSIXct ("2013-12-25") | !modelset$date == as.POSIXct ("2014-02-14") | !modelset$date == as.POSIXct ("2014-04-20") | !modelset$date == as.POSIXct ("2014-05-26"), ]

The above didn't work. It returns the data frame removing only the first So I tried :

modelset[!modelset$date %in% c("2013-12-31", "2013-07-04", "2014-07-04",
             "2013-11-28", "2013-11-29", "2013-12-24", "2013-12-25", "2014-02-14", 
             "2014-04-20", "2014-05-26"), ]

This didn't work either. I also tried:

`%notin%` <- function(x,y) !(x %in% y) 

modelset[modelset$date %notin% as.POSIXct(c("2013-12-31", "2013-07-04", "2014-07-04",
                 "2013-11-28", "2013-11-29", "2013-12-24", "2013-12-25", "2014-02-14",
                 "2014-04-20", "2014-05-26")), ]`

I've referred Remove Rows From Data Frame where a Row match a String, R remove rows containing a certain value, and Standard way to remove multiple elements from a dataframe but can't seem to find what I am doing wrong.

> head(modelset)
    date spot.volume.loc spot.volume.nat nat.imp.a loc.imp.a nat.imp.m loc.imp.m branded.leads esi.leads
1 2013-07-01            2988             215     13931    4155.3      5770    1853.7           331       363
2 2013-07-02            3200             218     12589    4651.3      5374    2207.8           293       428
3 2013-07-03            3066             203     10305    3921.0      4754    1759.2           273       325
4 2013-07-04            3153              83      2353    4135.6       999    1912.2           172       184
5 2013-07-05            2959              59      1553    3573.4       815    1662.3           193       246
6 2013-07-06             667              53      2219     456.7       889     214.8           161       203
tv.leads callin.leads total.leads total.imp.a total.imp.m       day week quarter on.off
1      195           41         930     18086.3      7623.7    Monday   26      Q3   1.25
2      192           50         963     17240.3      7581.8   Tuesday   26      Q3   1.00
3      149           38         785     14226.0      6513.2 Wednesday   26      Q3   1.00
4       34            0         390      6488.6      2911.2  Thursday   26      Q3   1.00
5       50           18         507      5126.4      2477.3    Friday   26      Q3   0.75
6       14            9         387      2675.7      1103.8  Saturday   26      Q3   0.50
2

There are 2 best solutions below

0
On BEST ANSWER

For an answer using dplyr and using your %notin% approach, you also have:

library(dplyr)

dates <- 
  as.POSIXct(c("2013-12-31", "2013-07-04", "2014-07-04", "2013-11-28", "2013-11-29", 
               "2013-12-24", "2013-12-25", "2014-02-14", "2014-04-20", "2014-05-26"))

`%notin%` <- function(x,y) !(x %in% y) 

modelset %>%
  filter(date %notin% dates)
1
On

Use the which statement like so:

dat <- as.POSIXct(c("2013-12-31", "2013-07-04", "2014-07-04",
                                         "2013-11-28", "2013-11-29", "2013-12-24", "2013-12-25", "2014-02-14", 
                                         "2014-04-20", "2014-05-26"))

dat[which(dat != as.POSIXct(c("2013-12-31", "2014-07-04")))]

In your case, I believe it would be:

modelset <- modelset[which(!modelset$date %in% c("2013-12-31", "2013-07-04", "2014-07-04",
         "2013-11-28", "2013-11-29", "2013-12-24", "2013-12-25", "2014-02-14", 
         "2014-04-20", "2014-05-26"))]

What the which statement does is return row numbers where it's evaluated to be true. Then having it inside the brackets, it specifies those row numbers as the only ones to show.