I am trying to replace names with a "Z" from a column of free text responses using the ons names database (Freely available). However, there are certain names which are also potentially important words/ acronyms that I don't want to remove, such as "My", "He" and "Ta". I have anti-joined these words from my list of names, and am using regex with word boundaries to try and only replace the names I want to, yet for some reason it keeps still replacing "Ta", which I do not want it to! Is this some sort of regex pattern in and of itself/ does anyone know why it is doing this or how to fix it? Any help much appreciated! Regex is not my strong suit.
# Download ONS baby names data (1996-2021) and save in the working folder
# Data source: https://www.ons.gov.uk/peoplepopulationandcommunity/birthsdeathsandmarriages/livebirths/datasets/babynamesinenglandandwalesfrom1996
filepath <- "insert your file path here"
library(tidyverse)
library(readxl)
library(textclean)
library(janitor)
#library(qdap)
#---Remove first names-----#####
# Read in ONS baby names data (1996-2021) and create a list of names
excel_sheets(paste0(filepath, "babynames1996to2021.xlsx"))
boynames <- read_excel(paste0(filepath, "babynames1996to2021.xlsx"),
"1", skip = 7) %>%
select(Name)
girlnames <- read_excel(paste0(filepath, "babynames1996to2021.xlsx"),
"2", skip = 7) %>%
select(Name)
#remove names which we don't want to replace from the text
firstnames <- bind_rows(boynames, girlnames) %>%
mutate(no_char = nchar(Name)) %>%
filter(no_char > 1) %>% #removes single letter names
select(word = Name) %>%
filter(word != "My") %>% #removes the name "My"
filter(word != "He") %>% #removes the name "He"
filter(word != "The") %>% #removes the name "The"
filter(word != "His") %>% #removes the name "His"
filter(word != "A") %>% #removes the name "A"
filter(word != "Now") %>% #removes the name "Now"
filter(word != "To") %>% #removes the name "To"
filter(word != "Ta") #removes the name "Ta"
#use \\b to set word boundaries to find exact match of entire name
firstnames$word2 <- paste0("\\b",firstnames$word,"\\b")
#test text
text <- "Some text with Zoha, Zohal, and Zuzia in it."
text2 <- "Some text with A-Jay, A.J. and Aaban in it!"
text3 <- "Some text with Ta, My, and He in it"
#text as a column in a tibble (akin to our real data)
test <- tibble(comment=c(text,text2,text3))
for(i in 1:length(firstnames$word2)){
test$comment <- gsub(firstnames$word2[i], "Z", test$comment)
}
test
#this removes Ta, which it shouldn't!`
Here is what you can do and why:
\bword boundaries, as they just won't work in some cases. You need to use adaptive word boundaries, again, see my previous postAbdul,Abdul-,Abdul-Ahad, etc.) you cannot rely on the order of the names in the input sheet, you should sort the names by length in the descending order.Here is a snippet with just some example names and my example comment test string:
See the R demo online.
Mind the
perl=TRUEargument incomment <- gsub(firstnames$word2[i], "Z", comment, perl=TRUE).