How to create user paths from clickstream data

573 Views Asked by At

I have some clickstream data I'd like to attribution analyze in a particular way, but I need to get into a specific format for users that convert and those that don't.

Reprex data:

df <- structure(list(User_ID = c(2001, 2001, 2001, 2002, 2001, 2002, 
                             2001, 2002, 2002, 2003, 2003, 2001, 2002, 2002, 2001), Session_ID = c("1001", 
                                                                                                   "1002", "1003", "1004", "1005", "1006", "1007", "Not Set", "Not Set", 
                                                                                                   "Not Set", "Not Set", "Not Set", "1008", "1009", "Not Set"), 
                 Date_time = structure(c(1540103940, 1540104060, 1540104240, 
                                         1540318080, 1540318680, 1540318859, 1540314360, 1540413060, 
                                         1540413240, 1540538460, 1540538640, 1540629660, 1540755060, 
                                         1540755240, 1540803000), class = c("POSIXct", "POSIXt"), tzone = "UTC"), 
                 Source = c("Facebook", "Facebook", "Facebook", "Google", 
                            "Email", "Google", "Email", "Referral", "Referral", "Facebook", 
                            "Facebook", "Google", "Referral", "Direct", "Direct"), Conversion = c(0, 
                                                                                                  0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1)), class = c("spec_tbl_df", 
                                                                                                                                                        "tbl_df", "tbl", "data.frame"), row.names = c(NA, -15L), spec = structure(list(
                                                                                                                                                          cols = list(User_ID = structure(list(), class = c("collector_double", 
                                                                                                                                                                                                            "collector")), Session_ID = structure(list(), class = c("collector_character", 
                                                                                                                                                                                                                                                                    "collector")), Date_time = structure(list(format = ""), class = c("collector_datetime", 
                                                                                                                                                                                                                                                                                                                                      "collector")), Source = structure(list(), class = c("collector_character", 
                                                                                                                                                                                                                                                                                                                                                                                          "collector")), Conversion = structure(list(), class = c("collector_double", 
                                                                                                                                                                                                                                                                                                                                                                                                                                                  "collector"))), default = structure(list(), class = c("collector_guess", 
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        "collector")), skip = 1), class = "col_spec"))

Then set classes:

df <- df %>% 
  mutate(User_ID    = as.factor(User_ID),
         Session_ID = as.factor(Session_ID),
         Date_time  = as.POSIXct(Date_time)
         )

I'd like to get all user visit paths to purchase, or total paths for ones that do not lead to purchase.

The format for the new column path would be for example: Facebook > Facebook > Facebook > Email > Email for user 2001 which I know how to achieve using mutate(path = paste0(source, collapse = " > "))

The complications are:

  • The majority of session IDs are not set, meaning they're missing
  • Some users may convert more than once
  • Some users may convert and return but not convert

Each row would be either:

  • A conversion by user ID - most converted users only convert once, but some may convert multiple times in which case there would be a row per conversion. The path column would reflect the journey to conversion - for a user's second or subsequent conversion only the path subsequent to the previous conversion would be shown.
  • Or a non-converted user-journey with their total path in the above format

For the above reprex the result would look like below:

# A tibble: 5 x 5
  User_ID Session_ID Date_time           Conversion Path                                          
    <dbl> <chr>      <dttm>                   <dbl> <chr>                                         
1    2001 1007       2018-10-23 17:06:00          1 Facebook > Facebook > Facebook > Email > Email
2    2002 Not Set    2018-10-24 20:34:00          1 Google > Google > Referral > Referral         
3    2003 Not Set    2018-10-26 07:24:00          0 Facebook > Facebook                           
4    2002 1009       2018-10-28 19:34:00          0 Referral > Direct                             
5    2001 Not Set    2018-10-29 08:50:00          1 Google > Direct     

... where:

  • user 2001 converted twice and the paths are represented separately;
  • user 2002 converted and then came back later but did not convert and so the converted and non converted paths are represented as separate rows.
  • User 2003 never converted and so this path is represented.
1

There are 1 best solutions below

1
On BEST ANSWER

Here's an approach using dplyr:

df2 <- df %>%
  # Add a column to distinguish between known and unknown sessions
  mutate(known_session = Session_ID != "Not Set") %>%

  # For each user, split between know and unknown sessions...
  group_by(User_ID, known_session) %>%
  # Sort first by Session ID, then time
  arrange(Session_ID, Date_time) %>%
  # Track which # path they're on. Start with path #1; 
  #   new path if prior event was a conversion
  mutate(path_num = cumsum(lag(Conversion, default = 0)) + 1) %>%

  # Label path journey by combining everything so far
  mutate(Path = paste0(Source, collapse = " > ")) %>%
  # Just keep last step in each path
  filter(row_number() == n()) %>%
  ungroup() %>%

  # Tidying up with just the desired columns, chronological
  select(User_ID, Session_ID, Date_time, Conversion, Path) %>%
  arrange(Date_time)

I get slightly different results, but I think they correspond to the sample data as provided:

> df2
# A tibble: 5 x 5
  User_ID Session_ID Date_time      

     Conversion Path                                          
  <fct>   <fct>      <dttm>                   <dbl> <chr>                                         
1 2001    1007       2018-10-23 17:06:00          1 Facebook > Facebook > Facebook > Email > Email
2 2002    Not Set    2018-10-24 20:34:00          1 Referral > Referral                           
3 2003    Not Set    2018-10-26 07:24:00          0 Facebook > Facebook                           
4 2002    1009       2018-10-28 19:34:00          0 Google > Google > Referral > Direct           
5 2001    Not Set    2018-10-29 08:50:00          1 Google > Direct