is there a R function to group data when they have less than X% of difference on their X, Y and Z values?

286 Views Asked by At

Good evening to you all,

I'm trying to group data according to their values on 3 columns together. I would like to group rows that have X, Y and Z coordinates that are less than 12% different from each other. In other words, for each observation, I would like to check whether their X, Y and Z values are less than 12% different from another observation in the table. My data frame looks more or less like this:

ID X Y Z
1 33 3 1
2 27 42 1
3 32 3 58
4 33 3 1
5 25 2 22
6 37 119 3

IDs 1 and 4 should be part of the same group as there is less than 12% of difference between their X, Y and Z coordinates. I was thinking of looping through the whole dataframe (looping through x to find similar rows, then through Y to find similar rows and finally through Z and then check if some results would appear through X, Y and Z, but it is awfully complicated and seems to not be working very well. I tried as well to group them by hand in such a way:

df %>% group_by(IDscrit) %>% summarize(n())
df$group <- ifelse(df$X <2 & df$Y <2 & df$Z <2, "group1", 
       ifelse(df$X <5 & df$X >2 & df$Y <5 &df$Y >2 & df$Z <5 & df$Z >2, "group2", NA))

But for very obvious reasons (not handy and I am excluding a lot of values of being grouped arbitrarily although they could be close to each other, like the for example if my table was displaying these values:

ID X Y Z
14 4 3 1
26 6 3 2

they would not be grouped, although they should), therefore this cannot work either... I would be super grateful for any advice, really any... It is way above my abilities in coding and I don't even know by which part I should pick this problem anymore.

2

There are 2 best solutions below

0
On

The following gives

max {abs(X_a - X_b), abs(Y_a - Y_b), abs(Z_a - Z_b)}

df <- tribble(
~ID, ~X, ~Y, ~Z,    
1, 33, 3, 1, 
2, 27, 42, 1, 
3, 32, 3, 58, 
4, 33, 3, 1, 
5, 25, 2, 22, 
6, 37, 119, 3)

dist(df[,-1],method = "maximum")

Output:

    1   2   3   4   5
2  39                
3  57  57            
4   0  39  57        
5  21  40  36  21    
6 116  77 116 116 117

Now, you can form the groups you want, I guess.

1
On

Using DF defined reproducibly int he Note at the end we define is.near which takes two numeric vectors each representing X, Y and Z and returns TRUE if they are within 12% of each other and FALSE otherwise. (Change is.near as needed if you need a different definition of nearness.) Using that we form an adjacency matrix A from the rows, convert that to a graph g and extract the id's of the connected components.

library(igraph)

is.near <- function(r, s) {
  max(abs(r - s)) < 0.12 * min(abs(r), abs(s))
}

nr <- nrow(DF)
s <- split(as.matrix(DF[-1]), 1:nr)
A <- outer(s, s, Vectorize(is.near)) - diag(nr)
g <- graph_from_adjacency_matrix(A, "undirected")
transform(DF, group = components(g)$membership)
##   ID  X   Y  Z group
## 1  1 33   3  1     1
## 2  2 27  42  1     2
## 3  3 32   3 58     3
## 4  4 33   3  1     1
## 5  5 25   2 22     4
## 6  6 37 119  3     5

plot(g)

screenshot

Note

DF <- structure(list(ID = 1:6, X = c(33L, 27L, 32L, 33L, 25L, 37L), 
    Y = c(3L, 42L, 3L, 3L, 2L, 119L), Z = c(1L, 1L, 58L, 1L, 
    22L, 3L)), row.names = c(NA, -6L), class = "data.frame")