My data includes a Name column. Some names are written in upto eight different ways. I tried grouping them with the following code:
groups <- list()
i <- 1
while(length(x) > 0)
{
id <- agrep(x[1], x, ignore.case = TRUE, max.distance = 0.1)
groups[[i]] <- x[id]
x <- x[-id]
i <- i + 1
}
head(groups)
groups
Next, I want to add a new column that returns the, for example, most commonly used notation of a name for each row. The result should look like:
A B
1. John Snow John Snow
2. Peter Wright Peter Wright
3. john snow John Snow
4. John snow John Snow
5. Peter wright Peter Wright
6. J. Snow John Snow
7. John Snow John Snow
etc.
How can I get there?
This answer is heavily based on a previous question/answer which put strings into groups. This answer merely adds finding the mode for each group and assigning the right mode to the original strings.
You will likely need to experiment with the right value of the
max.distanceargument toagrep.If you want to add the answer to the data.frame, just add
To write the result so that it is accessible from Excel, use