R gsub is removing words I didn't ask it to when using a column of words I wish to replace

82 Views Asked by At

I am trying to replace names with a "Z" from a column of free text responses using the ons names database (Freely available). However, there are certain names which are also potentially important words/ acronyms that I don't want to remove, such as "My", "He" and "Ta". I have anti-joined these words from my list of names, and am using regex with word boundaries to try and only replace the names I want to, yet for some reason it keeps still replacing "Ta", which I do not want it to! Is this some sort of regex pattern in and of itself/ does anyone know why it is doing this or how to fix it? Any help much appreciated! Regex is not my strong suit.

# Download ONS baby names data (1996-2021) and save in the working folder
# Data source: https://www.ons.gov.uk/peoplepopulationandcommunity/birthsdeathsandmarriages/livebirths/datasets/babynamesinenglandandwalesfrom1996
filepath <- "insert your file path here"

library(tidyverse)
library(readxl)
library(textclean)
library(janitor)
#library(qdap)

#---Remove first names-----#####
# Read in ONS baby names data (1996-2021) and create a list of names

excel_sheets(paste0(filepath, "babynames1996to2021.xlsx"))                    
boynames <- read_excel(paste0(filepath, "babynames1996to2021.xlsx"), 
                       "1", skip = 7) %>% 
  select(Name)

girlnames <- read_excel(paste0(filepath, "babynames1996to2021.xlsx"), 
                        "2", skip = 7) %>% 
  select(Name)

#remove names which we don't want to replace from the text
firstnames <- bind_rows(boynames, girlnames) %>% 
  mutate(no_char = nchar(Name)) %>%
  filter(no_char > 1) %>% #removes single letter names
  select(word = Name) %>%
  filter(word != "My") %>% #removes the name "My"
  filter(word != "He") %>% #removes the name "He"
  filter(word != "The") %>% #removes the name "The"
  filter(word != "His") %>% #removes the name "His"            
  filter(word != "A") %>% #removes the name "A" 
  filter(word != "Now") %>% #removes the name "Now" 
  filter(word != "To") %>% #removes the name "To"
  filter(word != "Ta") #removes the name "Ta"

#use \\b to set word boundaries to find exact match of entire name
firstnames$word2 <- paste0("\\b",firstnames$word,"\\b")

#test text
text <-  "Some text with Zoha, Zohal, and Zuzia in it."
text2 <- "Some text with A-Jay, A.J. and Aaban in it!"
text3 <- "Some text with Ta, My, and He in it"

#text as a column in a tibble (akin to our real data)
test <- tibble(comment=c(text,text2,text3))

for(i in 1:length(firstnames$word2)){
  test$comment <- gsub(firstnames$word2[i], "Z", test$comment)
}

test

#this removes Ta, which it shouldn't!`
2

There are 2 best solutions below

2
Wiktor Stribiżew On BEST ANSWER

Here is what you can do and why:

  • The list of boys' and girls' names contains entries that have special regex metacharacters that needs to be escaped in order to be treated as literal characters, and since you are building the regex patterns dynamically, you can use a regex escaping function written by me some time ago
  • As you plan to only match whole words, but the words themselves can contain special characters at the start or end, you CANNOT rely on \b word boundaries, as they just won't work in some cases. You need to use adaptive word boundaries, again, see my previous post
  • Since there are names that share the same prefix (Abdul, Abdul-, Abdul-Ahad, etc.) you cannot rely on the order of the names in the input sheet, you should sort the names by length in the descending order.

Here is a snippet with just some example names and my example comment test string:

## Example name vector
somenames <- c("Abdul", "Abdul-", "Abdul-Ahad", "Abduallah", "T.")

## Sorting function
sort.by.length.desc <- function (v) v[order( -nchar(v)) ]
## Here, I create a sample data frame with the `word` column with sorted names
firstnames <- data.frame(word=sort.by.length.desc(somenames), stringsAsFactors = FALSE)

## This is the regex escaping function
regex.escape <- function(string) {
    gsub("([][{}()+*^${|\\\\?.])", "\\\\\\1", string)
}
## Now, build the regex patterns for each name using adaptive word boundaries
firstnames$word2 <- paste0("(?!\\B\\w)",regex.escape(firstnames$word),"(?!\\B\\w)")

## Test text
comment <- "Some text with Abdul Abdul- Abduallah Ta, My, and He in it"

## Replacements
for(i in 1:length(firstnames$word2)){
  comment <- gsub(firstnames$word2[i], "Z", comment, perl=TRUE)
}

comment
## => [1] "Some text with Z Z Z Ta, My, and He in it"

See the R demo online.

Mind the perl=TRUE argument in comment <- gsub(firstnames$word2[i], "Z", comment, perl=TRUE).

2
Nir Graham On

short answer; add ,fixed=TRUE to the end of your gsub() call

long answer, your words contain punctuation which are interpreted as regex, i.e look at firstnames$word[14757]

its T. so matches Ta (unless fixed is used)