Lag or lead to assess user paths in R

74 Views Asked by At

I have the following dataset:

User Session_ID Page Path_Number
123A 12345 home 1
123A 12345 services 2
123A 12345 pricing 3
123A 12345 about 4
123A 12345 services 5
123A 12345 home 6
123B 34567 home 1
123B 34567 services 2
123B 34567 about 3
123B 34567 multimedia 4
123C 56789 home 1
123C 56789 about 2
123C 56789 pricing 3
123C 56789 about 4
123C 56789 services 5

There are three users with unique session IDs. Path Number is the path they follow once they are on the website. And, Page is the pages they visit.

The question that I am trying to answer is: How many people first go to the 'services' page and then go to the 'about' page?

I am using the following code to assess which user and session have both 'services' and 'about' in the path:

    dataset %>% group_by(Session_ID, User) %>% 
      summarize(services_and_about = ('services' %in% Page) & ('about' %in% Page)) %>%
      filter(services_and_about == "TRUE")

The result would be users 123A, 123B, and 123C.

However, I would like to also know which users visit the 'services' page BEFORE the 'about' page (only users 123A and 123B). I know I should use a lag or lead function here, but I am not sure how.

Thanks a lot for helping!

2

There are 2 best solutions below

2
On

Try this:

library(dplyr)
dat %>%
  group_by(Session_ID, User) %>%
  summarize(
    services_before_about = all(c("services", "about") %in% Page) &
      any(Path_Number[Page == "services"] < Path_Number[Page == "about"]),
    .groups = "drop")
# # A tibble: 3 x 3
#   Session_ID User  services_before_about
#        <int> <chr> <lgl>                
# 1      12345 123A  TRUE                 
# 2      34567 123B  TRUE                 
# 3      56789 123C  FALSE                
0
On

Assuming that other values of Page are irrlevant for this question, you can first filter the data to services and about only, then use a lag function to see whether there is any occurence of about that is preceded by services:

dataset %>% 
  filter(Page %in% c("services", "about")) %>%
  group_by(Session_ID, User) %>%
  mutate(previous_page = dplyr::(Page)) %>%
  mutate(about_after_service = ifelse(
    Page == "about" & previous_page == "services", 1, 0)) %>%
  group_by(User) %>% 
  summarize(about_after_service_n = sum(about_after_service)) %>%
  filter(about_after_service_n > 0)