User paths in R

72 Views Asked by At

The following is a sample of the dataset I am working on. I am trying to assess which users create a request on the contact form and are successful. So, the button click that tells me that the user has begun a request is "createrequestButtonClick" and the button click that denotes a successfully sent request is "SendButtonClick".

The problem I have is the path to "SendButtonClick" is uncertain it could be after 6 or 4 steps from "createrequestButtonClick". Also, a user can create and send (or not) multiple requests.

Through R code, how can I assess whether a "createrequestButtonClick" precedes a "SendButtonClick" or vice versa? If there isn't a "SendButtonClick" after a "createrequestButtonClick", it means that the user initiated a request, but did not submit it successfully (and this needs to be flagged).

structure(list(session_id = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2), 
User_ID = c("123", "123", "123", "123", "123", "123", "123", "123", "123", "123", "345", "345", "345", "345", "345", "345", "345", "345", "345", "345", "345"), 
Page = c("home", "contact", "createrequestButtonClick", "requestform", "requestform", "FormValueChange", "FormContactSelection", "FormValueChange", "SendButtonClick", "home", "home", "contact", "createrequestButtonClick", "requestform", "FormValueChange", "SendButtonClick", "contact", "createrequestButtonClick", "requestform", "FormValueChange", "SendButtonClick"), 
Path_ID = c(1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L), 
Path_Length = c(10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 11L, 11L, 11L, 11L, 11L, 11L, 11L, 11L, 11L, 11L, 11L)), 
row.names = c(NA, -21L), 
class = c("tbl_df", "tbl", "data.frame"))

There are 2 best solutions below


You can use cumsum() to create identifiers for all created requests. Then check if the send button was clicked in each request with any().


paths %>% 
  group_by(session_id) %>%
  mutate(request_id = cumsum(Page == "createrequestButtonClick")) %>% 
  filter(request_id > 0) %>%
  group_by(request_id, .add = TRUE) %>% 
  summarise(request_was_succesful = any(Page == "SendButtonClick")) %>%
  summarise(session_was_succesful = all(request_was_succesful))
#> # A tibble: 2 × 2
#>   session_id session_was_succesful
#>        <dbl> <lgl>                
#> 1          1 TRUE                 
#> 2          2 TRUE

A couple of simplified examples:

sessions <- rbind(
  data.frame(session_id = 1, action = c("create", "send")),
  data.frame(session_id = 2, action = c("create", "change", "send")),
  data.frame(session_id = 3, action = c("create", "send", "create", "send")),
  data.frame(session_id = 4, action = c("create")),
  data.frame(session_id = 5, action = c("create", "create", "send")),
  data.frame(session_id = 6, action = c("send", "create"))

#>    session_id action
#> 1           1 create
#> 2           1   send
#> 3           2 create
#> 4           2 change
#> 5           2   send
#> 6           3 create
#> 7           3   send
#> 8           3 create
#> 9           3   send
#> 10          4 create
#> 11          5 create
#> 12          5 create
#> 13          5   send
#> 14          6   send
#> 15          6 create

And the corresponding classifications:

sessions %>% 
  group_by(session_id) %>%
  mutate(request_id = cumsum(action == "create")) %>% 
  filter(request_id > 0) %>%
  group_by(request_id, .add = TRUE) %>% 
  summarise(request_was_succesful = any(action == "send")) %>%
  summarise(session_was_succesful = all(request_was_succesful))
#> # A tibble: 6 × 2
#>   session_id session_was_succesful
#>        <dbl> <lgl>                
#> 1          1 TRUE                 
#> 2          2 TRUE                 
#> 3          3 TRUE                 
#> 4          4 FALSE                
#> 5          5 FALSE                
#> 6          6 FALSE

Assuming that we can conclude createrequestButtonClick occurred before SendButtonClick for User_ID during session_ID if the Path_ID of SendButtonClick exceeds the Path_ID of createrequestButtonClick for the specified session and user, we can do the following:

  1. Find the min/max Path_ID value for each value of Path and User_ID during session_ID.
  2. Test if the min for createrequestButtonClick is less than the minimum for SendButtonClick. If TRUE, then at some point a createrequestButtonClick was followed up by a SendButtonClick.
  3. If the test is ever true, then that that row corresponds to a success.

# Only successful if SendButtonClick happens after createrequestButtonClick
page_sub <- df %>%
  filter(Page %in% c("createrequestButtonClick", "SendButtonClick"))

summary_df <- page_sub %>%
  group_by(session_id, User_ID, Page) %>%
  summarize(max_path = max(Path_ID),
            min_path = min(Path_ID)) %>%
  ungroup() %>%
  pivot_wider(names_from = Page,
              values_from = c(max_path, min_path))

# If min(createrequestButtonClick) < any(SendButtonClick), then success for
# that user during that session.  We'll need to add the minimums back to the
# data and then we can test.
joined <- page_sub %>% 
  filter(Page == "SendButtonClick") %>%
  left_join(., summary_df, by = c("session_id", "User_ID")) %>%
  mutate(success = if_else(min_path_createrequestButtonClick < Path_ID, 1, 0))

joined %>% select(session_id, User_ID, success)
#> # A tibble: 3 x 3
#>   session_id User_ID success
#>        <dbl> <chr>     <dbl>
#> 1          1 123           1
#> 2          2 345           1
#> 3          2 345           1

# If you had multiple sessions per person, you could then check per person
joined %>%
  group_by(User_ID) %>%
  summarise(success_sessions = sum(success),
            success_ever = if_else(success_sessions > 0, 1, 0))
#> # A tibble: 2 x 3
#>   User_ID success_sessions success_ever
#>   <chr>              <dbl>        <dbl>
#> 1 123                    1            1
#> 2 345                    2            1