separate_wider_regex with lookahead

122 Views Asked by TemplateRex At 25 June 2023 at 10:37

I have a dataframe with sporting events (with no assumptions about the number of spaces or words) with an optional year, that can be formatted in a few different ways.

tibble::tibble(event_optional_year = c("World Championships", "Summer Olympics 12", "Olympics 2016", "Olympics 2020/221"))

How can I use tidyr::separate_wider_regex to split event_optional_year into two columns event and year? I want event in this case to be stripped of the optional year, and year equal to NA, 12, 2016 and 2020/2021, respectively.

I tried fiddling with positive lookahead in the regex:

tibble::tibble(event_optional_year = c("Olympics", "Olympics 12", "Olympics 2016", "Olympics 2020/221")) |> 
    tidyr::separate_wider_regex(
      "event_optional_year",
      c(
        event = ".*(?=(?:\\d.*\\d$)?)",
        year = "\\d.*\\d$"
      ),
      too_few = "align_start"
    )

but this gives as result:

  event                 year 
  <chr>                 <chr>
1 "World Championships" NA   
2 "Summer Olympics "    12   
3 "Olympics 20"         16   
4 "Olympics 2020/2"     21

Question: which regex does give me the desired result?

Original Q&A

There are 1 best solutions below

margusl On 25 June 2023 at 11:07 BEST ANSWER

Unnamed patterns in separate_wider_regex() simplify this situation a bit. event = ".*" is greedy and matches everything before "\\s+(?=\\d)" -- any number of whitespace that is followed by a digit (assuming that year-part starts with a digit). This handles spaces in event but assumes there are none in year.

library(dplyr)
library(tidyr)
tibble(event_optional_year = c("World Championships", 
                               "Summer Olympics 12", 
                               "Olympics 2016", 
                               "Olympics 2020/221")) %>% 
  separate_wider_regex(event_optional_year, 
                       c(event = ".*", "\\s+(?=\\d)", year = ".*$") , 
                       too_few = "align_start")
#> # A tibble: 4 × 2
#>   event               year    
#>   <chr>               <chr>   
#> 1 World Championships <NA>    
#> 2 Summer Olympics     12      
#> 3 Olympics            2016    
#> 4 Olympics            2020/221

^{Created on 2023-06-25 with reprex v2.0.2}

separate_wider_regex with lookahead

There are 1 best solutions below

Related Questions in R

Related Questions in REGEX

Related Questions in TIDYR

Related Questions in REGEX-LOOK-AHEAD

Trending Questions

Popular # Hahtags

Popular Questions