I have a dataframe with sporting events (with no assumptions about the number of spaces or words) with an optional year, that can be formatted in a few different ways.
tibble::tibble(event_optional_year = c("World Championships", "Summer Olympics 12", "Olympics 2016", "Olympics 2020/221"))
How can I use tidyr::separate_wider_regex to split event_optional_year into two columns event and year? I want event in this case to be stripped of the optional year, and year equal to NA, 12, 2016 and 2020/2021, respectively.
I tried fiddling with positive lookahead in the regex:
tibble::tibble(event_optional_year = c("Olympics", "Olympics 12", "Olympics 2016", "Olympics 2020/221")) |>
tidyr::separate_wider_regex(
"event_optional_year",
c(
event = ".*(?=(?:\\d.*\\d$)?)",
year = "\\d.*\\d$"
),
too_few = "align_start"
)
but this gives as result:
event year
<chr> <chr>
1 "World Championships" NA
2 "Summer Olympics " 12
3 "Olympics 20" 16
4 "Olympics 2020/2" 21
Question: which regex does give me the desired result?
Unnamed patterns in
separate_wider_regex()simplify this situation a bit.event = ".*"is greedy and matches everything before"\\s+(?=\\d)"-- any number of whitespace that is followed by a digit (assuming that year-part starts with a digit). This handles spaces in event but assumes there are none in year.Created on 2023-06-25 with reprex v2.0.2