Computing graph efficiency metrics of a very large network

576 Views Asked by At

What are strategies for calculating the nodal and global efficiency of very large graphs in R?

I'm attempting to calculate the global efficiency of a very large igraph with brainGraph::efficiency(my_graph, type = "global").

library(igraph); library(brainGraph)  
g <- readRDS("path_to_my_graph.rds")  

> ecount(g); vcount(g) # count of edges and vertices
[1] 715758
[1] 290190

It reliably crashes R every time. Global efficiency is the mean of all nodal efficiencies, so I've attempted to calculate it that way with no success. The weights on each edge of my graph are all 1, so I've omitted the weights, but R still crashes.

# all of these cause R to crash
efficiency(g, type = "global")
efficiency(g, type = "nodal")
efficiency(g, type = "global", weights = NA)
efficiency(g, type = "nodal",  weights = NA)

My graph (~37MB) is available here on GoogleDrive as an .rds file for those who want data to test.

1

There are 1 best solutions below

0
On BEST ANSWER

R crashes because brainGraph::efficiency() attempts to calculate an enormous and dense distance matrix, which overwhelms the memory of my machine (32 GB). But I found a solution that chunks out the operation and runs in parallel.

Global efficiency is the mean of all nodal efficiencies in the graph. The nodal efficiency of a vertex i is:

enter image description here

We can sequentially calculate the nodal efficiency for each vertex in the graph, which splits the distance matrix calculation into smaller manageable bits. Because the efficiency of each vertex is independent, we can parallelize the operation so it doesn't take forever.

library(igraph)  
library(doParallel)

# nodal efficiency function
get_eff <- function(i){return((1/(length(V(g)) - 1))*sum(1/distances(g, V(g)[i])[-i]))}

no_cores <- detectCores() - 1 
cl       <- makeCluster(no_cores)
registerDoParallel(cl)  

result <- foreach(i = seq_along(V(g)), .combine = c, .packages = "igraph") %dopar% get_eff(i)

stopCluster(cl)
rm(cl)

global_eff <- mean(result)

Moreover, we can plot the distribution of nodal efficiencies, as well as the global (mean) efficiency, which gives us a better understanding of the network.

library(ggplot2)
data.frame(x = result) %>% 
  ggplot(aes(x)) + 
  geom_histogram() + 
  geom_vline(xintercept = mean(result), col = "red") # global efficiency
  theme_minimal() +
  labs(title = "Nodal efficiences", x = "", y = "Count")

enter image description here