Parsing URL in R to extract specific data

365 Views Asked by At

I have a data file which has a URL column in it. It looks something like this "https://www.google.com/ | query_string=utm_source=abc&utm_medium=yts&utm_campaign=123campaign&utm_term=camp%123&utm_content=brand&gclid=abcdefg|user_agent=xyz"

I want these data in seperate columns with their respective values as shown below

utm_source utm_medium utm_campaign utm_term utm_content user_agent
abc          yts      123campaign  camp%123   brand         xyz

Using dput for URL results in

c("https://www.google.com/ | query_string=null | ip_address=123.113.64.211 | user_agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.102 Safari/537.36","https://www.google.com/ | query_string=gclid=Lxi6sNo-A17RohDAcQgvD_fw4 | ip_address=167.11.116.237 | user_agent=Mozilla/5.0 (Linux; Android 8.0.0; SM-C701F) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.110 Mobile Safari/537.36","http://m.facebook.com/ | query_string=utm_source=fb&utm_medium=ctw&utm_campaign=abcPant_rem&utm_content=PantShirt | ip_address=106.193.181.252 | user_agent=Mozilla/5.0 (Linux; Android 10; SM-G975F Build/QP1A.190711.020; wv) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/81.0.4044.138 Mobile Safari/537.36 [FB_IAB/FB4A;FBAV/218.0.0.32.158;]")
2

There are 2 best solutions below

12
Allan Cameron On

Only one of the entries in URL contains multiple fields in the query string, and the first one doesn't contain any. You can't really make a data frame from the example in the question, but you can make a list of named vectors containing the fields in the query string like this:

queries <- sapply(strsplit(sapply(strsplit(URL, "query_string="), 
                           `[`, 2), " \\|"), `[`, 1)

lapply(strsplit(queries, "\\&|="), function(x) 
  setNames(x[seq(length(x)/2) * 2], x[seq(length(x)/2) * 2 - 1]))
#> [[1]]
#> null 
#>   NA 
#> 
#> [[2]]
#>                       gclid 
#> "Lxi6sNo-A17RohDAcQgvD_fw4" 
#> 
#> [[3]]
#>    utm_source    utm_medium  utm_campaign   utm_content 
#>          "fb"         "ctw" "abcPant_rem"   "PantShirt" 
5
Paul On

Here's a regex solution using the URL provided.

url <- "https://www.google.com/ | query_string=utm_source=abc&utm_medium=yts&utm_campaign=123campaign&utm_term=camp%123&utm_content=brand&gclid=abcdefg|user_agent=xyz"

str_match_all extracts patterns.

  • \\w+: Match one or more word characters
  • (...): Capture group
  • (?:...)?: Match group zero or one times, but don't capture the group. This is used to handle the query_string= part of the URL.
stringr::str_match_all(url, "(?:\\w+=)?(\\w+)=(\\w+)")
#> [[1]]
#>      [,1]                          [,2]           [,3]         
#> [1,] "query_string=utm_source=abc" "utm_source"   "abc"        
#> [2,] "utm_medium=yts"              "utm_medium"   "yts"        
#> [3,] "utm_campaign=123campaign"    "utm_campaign" "123campaign"
#> [4,] "utm_term=camp"               "utm_term"     "camp"       
#> [5,] "utm_content=brand"           "utm_content"  "brand"      
#> [6,] "gclid=abcdefg"               "gclid"        "abcdefg"    
#> [7,] "user_agent=xyz"              "user_agent"   "xyz"      

str_match_all returns a list of matrices where the first column is the complete match followed by each of the captured groups. Keep only the captured groups.

stringr::str_match_all(url, "(?:\\w+=)?(\\w+)=(\\w+)")[[1]][,2:3]
#>      [,1]           [,2]         
#> [1,] "utm_source"   "abc"        
#> [2,] "utm_medium"   "yts"        
#> [3,] "utm_campaign" "123campaign"
#> [4,] "utm_term"     "camp"       
#> [5,] "utm_content"  "brand"      
#> [6,] "gclid"        "abcdefg"    
#> [7,] "user_agent"   "xyz"