Algorithm design: Find CCs containing specific nodes from a large edge list

95 Views Asked by At

Given: Large edge list, with about 90-100 million edges, 2 million nodes. These edges compose a giant graph with thousands of connected components.

I also have a list of nodes of interest. I want to extract the edges of only the CCs containing these nodes.

For example here's a tiny example: Consider a small graph with 3 CCs inside it.

edge_list = [(1,2), (3,2), (1,1), (4,5), (5,6), (4,6), (7,8), (8,9), (9,7)] #three CCs
## Below commands for visualization sake
G = nx.Graph() 
G.add_edges_from(edges)
nx.draw(G, with_labels=True)

And I have nodes of interest: NOI = [1,5]

I want a function that does the following:

CC_list = find_ccs(edge_list, NOI)
print(CC_list)
#output: [[(1,2), (3,2), (1,1)],
#          [(4,5), (5,6), (4,6)]]

Notice that only the edges of the two CCs that contain the NOIs are returned. I want to do this on a massive scale.

I'm open to any solution using Networkx, StellarGraph, or, most preferably PySpark. I can read the edge list using pyspark and once I have the filtered edge list, I can convert it to a CC using whatever library.

NOTE: The entire graph itself is too large to build fully using networkx or stellargraph. Thus, I have to take in an edge list and only extract the ccs I want. The above example with networkx is just for visualization purposes.

1

There are 1 best solutions below

3
On BEST ANSWER

Pick a node of interest. Run BFS from it to find the connected component that contains it. Each time you add a node to the CC, check if it's a node-of-interest and remove it from the set to check if it is. Repeat until you've found all the CCs containing nodes of interest.

Running time is O(V+E), where V & E are the nodes and edges of CCs that contain nodes of interest, not the larger graph.