Validate time series index

74 Views Asked by At

I am using a dataset which is grouped by group_by function of dplyr package. Each Group has it's own time index which i.e. supposedly consist of 12 months sequences. This means that it can start from January and end up in December or in other cases it can start from June of the year before and end up in May next year.

Here is the dataset example:

     ID       DATE
      8 2017-01-31
      8 2017-02-28
      8 2017-03-31
      8 2017-04-30
      8 2017-05-31
      8 2017-06-30
      8 2017-07-31
      8 2017-08-31
      8 2017-09-30
      8 2017-10-31
      8 2017-11-30
      8 2017-12-31
     32 2017-01-31
     32 2017-02-28
     32 2017-03-31
     32 2017-04-30
     32 2017-05-31
     32 2017-06-30
     32 2017-07-31
     32 2017-08-31
     32 2017-09-30
     32 2017-10-31
     32 2017-11-30
     32 2017-12-31
     45 2016-09-30
     45 2016-10-31
     45 2016-11-30
     45 2016-12-31
     45 2017-01-31
     45 2017-02-28
     45 2017-03-31
     45 2017-04-30
     45 2017-05-31
     45 2017-06-30
     45 2017-07-31
     45 2017-08-31

The Problem is that I can't confirm or validate visualy because of dataset dimensions if there are so called "jumps", in other words if dates are consistent. Is there any simple way in r to do that, perhaps some modification/combination of functions from tibbletime package.

Any help will by appreciated.

Thank you in advance.

2

There are 2 best solutions below

2
On

You can use the summarise function from dplyr to return a logical value of whether there are any day differences greater than 31 within each ID. You do this by first constructing a temporary date using only the year and month and attaching "-01" as the fake day:

library(dplyr)
library(lubridate)

df %>%
  group_by(ID) %>%
  mutate(DATE2 = ymd(paste0(sub('\\-\\d+$', '', DATE),'-01')),
         DATE_diff = c(0, diff(DATE2))) %>%
  summarise(Valid = !any(DATE_diff > 31))

Result:

# A tibble: 3 x 2
     ID Valid
  <int> <lgl>
1     8  TRUE
2    32  TRUE
3    45  TRUE

You can also visually check if there are any gaps by plotting your dates for each ID:

library(ggplot2)

df %>%
  mutate(DATE = ymd(paste0(sub('\\-\\d+$', '', DATE),'-01')),
         ID = as.factor(ID)) %>%
  ggplot(aes(x = DATE, y = ID, group = ID)) + 
  geom_point(aes(color = ID)) +
  scale_x_date(date_breaks = "1 month",
               date_labels = "%b-%Y") +
  labs(title = "Time Line by ID")

enter image description here

1
On

Here's how I would typically approach this problem using data.table -- the cut.Date() and seq.Date() functions from base are the meat of the logic, so you use the same approach with dplyr if desired.

library(data.table)

## Convert to data.table
setDT(df)

## Convert DATE to a date in case it wasn't already
df[,DATE := as.Date(DATE)]

## Order by ID and Date
setkey(df,ID,DATE)

## Create a column with the month of each date
df[,Month := as.Date(cut.Date(DATE, breaks = "months"))]

## Generate a sequence of Dates by month for the number of observations
## in each group -- .N
df[,ExpectedMonth := seq.Date(from = min(Month),
                              by = "months",
                              length.out = .N), by = .(ID)]

## Create a summary table to test whether an ID had 12 observations where
## the actual month was equal to the expected month
Test <- df[Month == ExpectedMonth, .(Valid = ifelse(.N == 12L,TRUE,FALSE)), by = .(ID)]

print(Test)
#    ID Valid
# 1:  8  TRUE
# 2: 32  TRUE
# 3: 45  TRUE

## Do a no-copy join of Test to df based on ID
## and create a column in df based on the 'Valid' column in Test
df[Test, Valid := i.Valid, on = "ID"]

## The final output:
head(df)
#    ID       DATE      Month ExpectedMonth Valid
# 1:  8 2017-01-31 2017-01-01    2017-01-01  TRUE
# 2:  8 2017-02-28 2017-02-01    2017-02-01  TRUE
# 3:  8 2017-03-31 2017-03-01    2017-03-01  TRUE
# 4:  8 2017-04-30 2017-04-01    2017-04-01  TRUE
# 5:  8 2017-05-31 2017-05-01    2017-05-01  TRUE
# 6:  8 2017-06-30 2017-06-01    2017-06-01  TRUE

You could also do things a little more compactly if you really wanted to using a self-join and skip creating Test

setDT(df)

df[,DATE := as.Date(DATE)]
setkey(df,ID,DATE)
df[,Month := as.Date(cut.Date(DATE, breaks = "months"))]
df[,ExpectedMonth := seq.Date(from = min(Month), by = "months", length.out = .N), keyby = .(ID)]
df[df[Month == ExpectedMonth,.(Valid = ifelse(.N == 12L,TRUE,FALSE)),keyby = .(ID)], Valid := i.Valid]