Identifying non-overlapping values between factors in a dataframe in R

142 Views Asked by At

I would like to identify all non-overlapping values between groups (factors) in a dataframe. Let's use iris to illustrate. The iris dataset has measurements of sepal length, sepal width, petal length, and petal width for three plant species (setosa, versicolor, and virginica). All three species overlap in measurements of sepal length and width. In measurements of both petal length and width, setosa doesn't overlap with both versicolor and virginica.

What I want can be easily visualized manually using a variety of functions such as range values or scatter plots:

tapply(iris$Sepal.Length, INDEX = iris$Species, FUN = range)
tapply(iris$Sepal.Width, INDEX = iris$Species, FUN = range)
tapply(iris$Petal.Length, INDEX = iris$Species, FUN = range)
tapply(iris$Petal.Width, INDEX = iris$Species, FUN = range)

# or

library(ggplot2)
ggplot(iris, aes(Species, Sepal.Length)) + geom_point()
ggplot(iris, aes(Species, Sepal.Width)) + geom_point()
ggplot(iris, aes(Species, Petal.Length)) + geom_point()
ggplot(iris, aes(Species, Petal.Width)) + geom_point()

But it's impractical to do this manually for large datasets, so I'd like to write a function that identifies non-overlapping values between factors in dataframes like iris. The output could be a list of matrices with TRUE or FALSE (indicating non-overlap and overlap, respectively), one for each variable in the dataset. For example, the output for iris would be a list of 4 matrices:

$1.Sepal.Length
            setosa   versicolor   virginica
setosa      NA       FALSE        FALSE   
versicolor  FALSE    NA           FALSE   
virginica   FALSE    FALSE        NA   

$2.Sepal.Width
            setosa   versicolor   virginica
setosa      NA       FALSE        FALSE   
versicolor  FALSE    NA           FALSE   
virginica   FALSE    FALSE        NA   

$3.Petal.Length
            setosa   versicolor   virginica
setosa      NA       TRUE         TRUE   
versicolor  TRUE     NA           FALSE   
virginica   TRUE     FALSE        NA   

$4.Petal.Width
            setosa   versicolor   virginica
setosa      NA       TRUE         TRUE   
versicolor  TRUE     NA           FALSE   
virginica   TRUE     FALSE        NA   

I accept suggestions of different outputs, as long as they identify all non-overlapping values.

1

There are 1 best solutions below

5
DPH On

this is one possible solution within the tidyverse

library(dplyr)

# build custom function
my_fun <- function(x){
    # build tibble from input data (colum with metric) and Species vector from iris
    myDf <- dplyr::tibble(Species = as.character(iris$Species), Vals = as.numeric(x)) %>%
        # find min and max value per species
        dplyr::group_by(Species) %>%
        dplyr::summarise(mini = min(Vals), maxi = max(Vals)) 

    ret <- myDf %>%
        # build full join from data
        dplyr::full_join(myDf, by = character(), suffix = c("_1", "_2")) %>% 
        # convert operation to row wise
        dplyr::rowwise() %>% 
        # if species are the same generate NA else check if between  - I do negate here as if they are overlapping you want it to be FALSE
        dplyr::mutate(res = ifelse(Species_1 == Species_2, NA, !(dplyr::between(mini_1, mini_2, maxi_2) | dplyr::between(maxi_1, mini_2, maxi_2) | between(mini_2, mini_1, maxi_1) | dplyr::between(maxi_2, mini_1, maxi_1) ))) %>%
        # make tibble wide to get the wanted layout
        tidyr::pivot_wider(-c(mini_1, maxi_1, mini_2, maxi_2), names_from = Species_2, values_from = res) %>%
        # need it to be able to set row names
        as.data.frame()

    # set row names from column
    row.names(ret) <- ret$Species_1
    # remove column used to name rows
    ret$Species_1 <- NULL
    return(ret)
}

purrr::map(iris[, 1:4], ~my_fun(.x))

$Sepal.Length
           setosa versicolor virginica
setosa         NA      FALSE     FALSE
versicolor  FALSE         NA     FALSE
virginica   FALSE      FALSE        NA

$Sepal.Width
           setosa versicolor virginica
setosa         NA      FALSE     FALSE
versicolor  FALSE         NA     FALSE
virginica   FALSE      FALSE        NA

$Petal.Length
           setosa versicolor virginica
setosa         NA       TRUE      TRUE
versicolor   TRUE         NA     FALSE
virginica    TRUE      FALSE        NA

$Petal.Width
           setosa versicolor virginica
setosa         NA       TRUE      TRUE
versicolor   TRUE         NA     FALSE
virginica    TRUE      FALSE        NA