How to identify and plot partial order on dummy variables in R (ideally tidyverse)

112 Views Asked by At

I have a system of n dummy variables/set indicators. I want to automatically identify any nesting. Take a simple n = 3 case.

expand.grid(0:1, 0:1, 1)

Var1 Var2 Var3

0 0 1

1 0 1

0 1 1

1 1 1

E.g. Here Var1 and Var2 are not nested ("incomparable"), but both are nested (subsets of) in Var3. This is obviously a partially ordered set (ordered by the relation =<) so I would like to automatically identify the poset and plot its hasse diagram automatically. I see there are some packages for hasse diagrams but they seem like to not do this common and simple job very naturally. Is there an easy way?

The following begins to do the job but is hardly elegant (thanks to help from @danlooo below):

find_subset_poset <- \(data){
  
  is_nested <- \(df, col1, col2){
    
    all(pull(df, ensym(col1)) <= pull(df, ensym(col2)))
    
  }
  
  
  get_all_pairs = \(data){
    data %>% 
      names() %>% 
      combinat::combn(2,) %>%
      t() %>%
      as_tibble(.name_repair = "unique") %>% 
      set_names(c("col1", "col2")) 
  }
  
  all_pairs = get_all_pairs(data)
  is_eq = \(df, col1, col2){
    
    all(pull(df, ensym(col1)) == pull(df, ensym(col2)))
  }
  
  equal_indicator_cols = all_pairs %>% filter(pmap_lgl(., is_eq, df = data))
  
  # UNITE EQUAL COLUMNS
  while(nrow(equal_indicator_cols) > 0){
    sames = unlist(equal_indicator_cols[1, ])
    message("The following indicator columns are equal: ", sames)
    y = c(unlist(data[sames[1]]))
    
    # REMOVE DUPLICATE SETS
    data[sames] <- NULL
    nm <- sym(str_c(sames, collapse = ","))
    
    # ADD A SINGLE REPLACEMENT
    data = data %>% mutate(!!nm := y)
    
    # RECURSIVELY LOOK FOR FURTHER MATCHES
    all_pairs = get_all_pairs(data) 
    equal_indicator_cols = all_pairs %>% filter(pmap_lgl(., is_eq, df = data))
  }
  
  all_pairs %>%
    pmap(\(col1, col2) count(data, !!ensym(col1), !!ensym(col2))) %>% 
    map(print)
  
  all_ordered_pairs_of_non_equal_indicators = 
    all_pairs  %>% 
    rowwise() %>% 
    group_split() %>% 
    map(combinat::permn) %>% 
    map_depth(2, ~set_names(.x, c("col1", "col2"))) %>%
    flatten() %>% 
    reduce(rbind) 
  
  all_ordered_pairs_of_non_equal_indicators %>% 
    filter(pmap_lgl(., is_nested, df = data)) %>% 
    select(col1, col2) %>% 
    as.matrix() %>% 
    POSetR::poset(elements = names(data)) %>% 
    plot()
  
}
1

There are 1 best solutions below

8
On

Let data be a table of indicator variables w.l.o.g. e.g. col1 and col2 describing a poset with implicit ordering <=. col1 is nested in col2 if and only if they are not crossed. col1 and col2 are crossed if and only if all possible pairs of the unique values of col 1 and col2 (here 0 and 1) are present as rows in data.

library(hasseDiagram)
#> Loading required package: Rgraphviz
#> Loading required package: graph
#> Loading required package: BiocGenerics
#> Loading required package: parallel
#> 
#> Attaching package: 'BiocGenerics'
#> The following objects are masked from 'package:parallel':
#> 
#>     clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
#>     clusterExport, clusterMap, parApply, parCapply, parLapply,
#>     parLapplyLB, parRapply, parSapply, parSapplyLB
#> The following objects are masked from 'package:stats':
#> 
#>     IQR, mad, sd, var, xtabs
#> The following objects are masked from 'package:base':
#> 
#>     anyDuplicated, append, as.data.frame, basename, cbind, colnames,
#>     dirname, do.call, duplicated, eval, evalq, Filter, Find, get, grep,
#>     grepl, intersect, is.unsorted, lapply, Map, mapply, match, mget,
#>     order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank,
#>     rbind, Reduce, rownames, sapply, setdiff, sort, table, tapply,
#>     union, unique, unsplit, which.max, which.min
#> Loading required package: grid
library(tidyverse)

is_nested <- function(data, col1, col2) {
  # must not nested in itself
  if (col1 == col2) {
    return(FALSE)
  }

  orig_nrow <- data %>% nrow()
  nested_nrow <- data %>%
    count(!!sym(col1), !!sym(col2)) %>%
    nrow()

  # not crossed
  return(orig_nrow != nested_nrow)
}

data <- expand.grid(0:1, 0:1, 1)
vars <- colnames(data)

vars %>%
  combn(2) %>%
  t() %>%
  as_tibble() %>%
  mutate(
    is_nested = V1 %>% map2_lgl(V2, ~ is_nested(data, .x, .y)),
    V1 = V1 %>% factor(levels = vars),
    V2 = V2 %>% factor(levels = vars),
  ) %>%
  # must have all pairs e.g. (a,b) and (b,a) for the adjecency matrix
  # but matrix must not be symmmetric e.g. (a,b) = TRUE and (b,a) = FALSE
  complete(V1, V2, fill = list(is_nested = FALSE)) %>%
  pivot_wider(names_from = V2, values_from = is_nested) %>%
  column_to_rownames("V1") %>%
  as.matrix() %>%
  hasse(vars)
#> Warning: The `x` argument of `as_tibble.matrix()` must have unique column names if `.name_repair` is omitted as of tibble 2.0.0.
#> Using compatibility `.name_repair`.
#> This warning is displayed once every 8 hours.
#> Call `lifecycle::last_lifecycle_warnings()` to see where this warning was generated.

Created on 2022-03-16 by the reprex package (v2.0.0)