Identify similar names in same row, then choose Mode

179 Views Asked by At

My data includes a Name column. Some names are written in upto eight different ways. I tried grouping them with the following code:

groups <- list()
i <- 1
while(length(x) > 0)
{
  id <- agrep(x[1], x, ignore.case = TRUE, max.distance = 0.1)
  groups[[i]] <- x[id]
  x <- x[-id]
  i <- i + 1
}

head(groups)
groups

Next, I want to add a new column that returns the, for example, most commonly used notation of a name for each row. The result should look like:

      A            B
1. John Snow    John Snow
2. Peter Wright Peter Wright
3. john snow    John Snow
4. John snow    John Snow
5. Peter wright Peter Wright
6. J. Snow      John Snow
7. John Snow    John Snow
etc.

How can I get there?

1

There are 1 best solutions below

3
G5W On

This answer is heavily based on a previous question/answer which put strings into groups. This answer merely adds finding the mode for each group and assigning the right mode to the original strings.

## The data
Names = c("John Snow", "Peter Wright",  "john snow",
    "John snow", "Peter wright", "J. Snow", "John Snow")

## Grouping like in the previous question
groups <- list()
i <- 1
x = Names
while(length(x) > 0)
{
  id <- agrep(x[1], x, ignore.case = TRUE, max.distance = 0.25)
  groups[[i]] <- x[id]
  x <- x[-id]
  i <- i + 1
}

## Find the mode for each group
Modes = sapply(groups, function(x) names(which.max(table(x))))

## Assign the correct mode to each string
StandardName = rep("", length(Names))
for(i in seq_along(groups)) {
    StandardName[Names %in% groups[[i]]] = Modes[i]
}

StandardName
[1] "John Snow"    "Peter wright" "John Snow"    "John Snow"    "Peter wright"
[6] "John Snow"    "John Snow" 

You will likely need to experiment with the right value of the max.distance argument to agrep.

If you want to add the answer to the data.frame, just add

df$StandardName = StandardName

To write the result so that it is accessible from Excel, use

write.csv(df, "MyData.csv")