I am trying to implement the above community detection algorithm in Java, and while I have access to C++ code, and the original paper - I can't make it work at all. My major issue is that I don't understand the purpose of the code - i.e. how the algorithm works. In practical terms, my code gets stuck in what seems to be an infinite loop at mergeBestQ
, the list heap
seems to be getting larger on each iteration (as I would expect from the code), but the value of topQ
is always returning the same value.
The graph I am testing this on is quite large (300,000 nodes, 650,000 edges). The original code I am using for my implementation is from the SNAP library (https://github.com/snap-stanford/snap/blob/master/snap-core/cmty.cpp). What would be great is if someone could explain to me the intuition of the algorithm, it seems to be initially setting each node to be in it's own community, then recording the modularity value (whatever that is) of each pair of connected nodes in the graph, then finding the pair of nodes which have the highest modularity and moving them to the same community. In addition, if someone could provide some mid level pseudo code, that would be great. Here is my implementation thus far, I have tried to keep it in one file for the sake of brevity, however CommunityGraph and CommunityNode are elsewhere (should not be required). Graph maintain a list of all nodes and each node maintains a list of its connections to other nodes. When running it never gets past the line while(this.mergeBestQ()){}
UPDATE - found several bugs in my code after a thorough review. The code now completes VERY quickly, but doesnt fully implement the algorithm, for example of the 300,000 nodes in the graph, it states there are approximately 299,000 communities (i.e. roughly 1 node per community). I have listed the updated code below. /// Clauset-Newman-Moore community detection method. /// At every step two communities that contribute maximum positive value to global modularity are merged. /// See: Finding community structure in very large networks, A. Clauset, M.E.J. Newman, C. Moore, 2004 public class CNMMCommunityMetric implements CommunityMetric{ private static class DoubleIntInt implements Comparable{ public double val1; public int val2; public int val3; DoubleIntInt(double val1, int val2, int val3){ this.val1 = val1; this.val2 = val2; this.val3 = val3; }
@Override
public int compareTo(DoubleIntInt o) {
//int this_sum = this.val2 + this.val3;
//int oth_sum = o.val2 + o.val3;
if(this.equals(o)){
return 0;
}
else if(val1 < o.val1 || (val1 == o.val1 && val2 < o.val2) || (val1 == o.val1 && val2 == o.val2 && val3 < o.val3)){
return 1;
}
else{
return -1;
}
//return this.val1 < o.val1 ? 1 : (this.val1 > o.val1 ? -1 : this_sum - oth_sum);
}
@Override
public boolean equals(Object o){
return this.val2 == ((DoubleIntInt)o).val2 && this.val3 == ((DoubleIntInt)o).val3;
}
@Override
public int hashCode() {
int hash = 3;
hash = 79 * hash + this.val2;
hash = 79 * hash + this.val3;
return hash;
}
}
private static class CommunityData {
double DegFrac;
TIntDoubleHashMap nodeToQ = new TIntDoubleHashMap();
int maxQId;
CommunityData(){
maxQId = -1;
}
CommunityData(double nodeDegFrac, int outDeg){
DegFrac = nodeDegFrac;
maxQId = -1;
}
void addQ(int NId, double Q) {
nodeToQ.put(NId, Q);
if (maxQId == -1 || nodeToQ.get(maxQId) < Q) {
maxQId = NId;
}
}
void updateMaxQ() {
maxQId=-1;
int[] nodeIDs = nodeToQ.keys();
double maxQ = nodeToQ.get(maxQId);
for(int i = 0; i < nodeIDs.length; i++){
int id = nodeIDs[i];
if(maxQId == -1 || maxQ < nodeToQ.get(id)){
maxQId = id;
maxQ = nodeToQ.get(maxQId);
}
}
}
void delLink(int K) {
int NId=getMxQNId();
nodeToQ.remove(K);
if (NId == K) {
updateMaxQ();
}
}
int getMxQNId() {
return maxQId;
}
double getMxQ() {
return nodeToQ.get(maxQId);
}
};
private TIntObjectHashMap<CommunityData> communityData = new TIntObjectHashMap<CommunityData>();
private TreeSet<DoubleIntInt> heap = new TreeSet<DoubleIntInt>();
private HashMap<DoubleIntInt,DoubleIntInt> set = new HashMap<DoubleIntInt,DoubleIntInt>();
private double Q = 0.0;
private UnionFind uf = new UnionFind();
@Override
public double getCommunities(CommunityGraph graph) {
init(graph);
//CNMMCommunityMetric metric = new CNMMCommunityMetric();
//metric.getCommunities(graph);
// maximize modularity
while (this.mergeBestQ(graph)) {
}
// reconstruct communities
HashMap<Integer, ArrayList<Integer>> IdCmtyH = new HashMap<Integer, ArrayList<Integer>>();
Iterator<CommunityNode> ns = graph.getNodes();
int community = 0;
TIntIntHashMap communities = new TIntIntHashMap();
while(ns.hasNext()){
CommunityNode n = ns.next();
int r = uf.find(n);
if(!communities.contains(r)){
communities.put(r, community++);
}
n.setCommunity(communities.get(r));
}
System.exit(0);
return this.Q;
}
private void init(Graph graph) {
double M = 0.5/graph.getEdgesList().size();
Iterator<Node> ns = graph.getNodes();
while(ns.hasNext()){
Node n = ns.next();
uf.add(n);
int edges = n.getEdgesList().size();
if(edges == 0){
continue;
}
CommunityData dat = new CommunityData(M * edges, edges);
communityData.put(n.getId(), dat);
Iterator<Edge> es = n.getConnections();
while(es.hasNext()){
Edge e = es.next();
Node dest = e.getStart() == n ? e.getEnd() : e.getStart();
double dstMod = 2 * M * (1.0 - edges * dest.getEdgesList().size() * M);//(1 / (2 * M)) - ((n.getEdgesList().size() * dest.getEdgesList().size()) / ((2 * M) * (2 * M)));// * (1.0 - edges * dest.getEdgesList().size() * M);
dat.addQ(dest.getId(), dstMod);
}
Q += -1.0 * (edges*M) * (edges*M);
if(n.getId() < dat.getMxQNId()){
addToHeap(createEdge(dat.getMxQ(), n.getId(), dat.getMxQNId()));
}
}
}
void addToHeap(DoubleIntInt o){
heap.add(o);
}
DoubleIntInt createEdge(double val1, int val2, int val3){
DoubleIntInt n = new DoubleIntInt(val1, val2, val3);
if(set.containsKey(n)){
DoubleIntInt n1 = set.get(n);
heap.remove(n1);
if(n1.val1 < val1){
n1.val1 = val1;
}
n = n1;
}
else{
set.put(n, n);
}
return n;
}
void removeFromHeap(Collection<DoubleIntInt> col, DoubleIntInt o){
//set.remove(o);
col.remove(o);
}
DoubleIntInt findMxQEdge() {
while (true) {
if (heap.isEmpty()) {
break;
}
DoubleIntInt topQ = heap.first();
removeFromHeap(heap, topQ);
//heap.remove(topQ);
if (!communityData.containsKey(topQ.val2) || ! communityData.containsKey(topQ.val3)) {
continue;
}
if (topQ.val1 != communityData.get(topQ.val2).getMxQ() && topQ.val1 != communityData.get(topQ.val3).getMxQ()) {
continue;
}
return topQ;
}
return new DoubleIntInt(-1.0, -1, -1);
}
boolean mergeBestQ(Graph graph) {
DoubleIntInt topQ = findMxQEdge();
if (topQ.val1 <= 0.0) {
return false;
}
// joint communities
int i = topQ.val3;
int j = topQ.val2;
uf.union(i, j);
Q += topQ.val1;
CommunityData datJ = communityData.get(j);
CommunityData datI = communityData.get(i);
datI.delLink(j);
datJ.delLink(i);
int[] datJData = datJ.nodeToQ.keys();
for(int _k = 0; _k < datJData.length; _k++){
int k = datJData[_k];
CommunityData datK = communityData.get(k);
double newQ = datJ.nodeToQ.get(k);
//if(datJ.nodeToQ.containsKey(i)){
// newQ = datJ.nodeToQ.get(i);
//}
if (datI.nodeToQ.containsKey(k)) {
newQ = newQ + datI.nodeToQ.get(k);
datK.delLink(i);
} // K connected to I and J
else {
newQ = newQ - 2 * datI.DegFrac * datK.DegFrac;
} // K connected to J not I
datJ.addQ(k, newQ);
datK.addQ(j, newQ);
addToHeap(createEdge(newQ, Math.min(j, k), Math.max(j, k)));
}
int[] datIData = datI.nodeToQ.keys();
for(int _k = 0; _k < datIData.length; _k++){
int k = datIData[_k];
if (!datJ.nodeToQ.containsKey(k)) { // K connected to I not J
CommunityData datK = communityData.get(k);
double newQ = datI.nodeToQ.get(k) - 2 * datJ.DegFrac * datK.DegFrac;
datJ.addQ(k, newQ);
datK.delLink(i);
datK.addQ(j, newQ);
addToHeap(createEdge(newQ, Math.min(j, k), Math.max(j, k)));
}
}
datJ.DegFrac += datI.DegFrac;
if (datJ.nodeToQ.isEmpty()) {
communityData.remove(j);
} // isolated community (done)
communityData.remove(i);
return true;
}
}
UPDATE:the currently listed code is fairly quick, and has half the memory usage compared to the "quickest" solution, while only being ~5% slower. the difference is in the use of hashmap + treest vs priority queue, and ensuring only a single object for a given i, j pair exists at any time.
So here's the original paper, a neat lil' six pages, only two of which are about the design & implementation. Here's a cliffnotes:
Q
, of the partition to be the ratio of the number of edges within each community to the number of edges between each community, minus the ratio you'd expect from a completely random partition.i
andj
of a partition, they then definedeltaQ_ij
to be how much the modularity of the partition would change if communitiesi
andj
were merged. So ifdeltaQ_ij > 0
, mergingi
andj
will improve the modularity of the partition.deltaQ_ij
for every pair of communities. Whichever two communitiesi, j
have the largestdeltaQ_ij
, merge those two. Repeat.deltaQ_ij
all turn negative, but in the paper the authors let the algorithm run until there's only one community left.That's pretty much it for understanding the algorithm. The details are in how to compute
deltaQ_ij
quickly and store the information efficiently.Edit: Data structure time!
So first off, I think the implementation you're referencing does things a different way to the paper. I'm not quite sure how, because the code is impenetrable, but it seems to use union-find and hashsets in place of the author's binary trees and multiple heaps. Not a clue why they do it a different way. You might want to email the guy who wrote it and ask.
Anyway, the algorithm in the paper needs several things from the format
deltaQ
is stored in:dQ
quickly.deltaQ_ik
anddeltaQ_ki
for a fixedi
quickly.deltaQ_kj
anddeltaQ_jk
for a fixedj
quickly.The solution the authors come up to for this is as follows:
i
, each non-zerodeltaQ_ik
is stored in a balanced binary tree, indexed byk
(so elements can be found easily), and in a heap (so the maximum for that community can be found easily).deltaQ_ik
from each communityi
's heap is then stored in another heap, so that the overall maximums can be found easily.When community
i
is merged with communityj
, several things happen to the binary trees:i
th community is added to thej
th community's binary tree. If an element with the same indexk
already exists, you sum the old and new values.j
th community's binary tree to reflect the fact that thej
th community has just increased in size.k
, we update anydeltaQ_kj
.i
is thrown away.And similarly, several things must happen to the heaps:
i
is thrown away.j
is rebuilt from scratch using the elements from the community's balanced binary tree.k
's heap, the position of entrydeltaQ_kj
is updated.i
in the overall heap is thrown away (causing bubbling) and the entries for communityj
and each communityk
connected toi
orj
are updated.Strangely, when two communities are merged there's no reference in the paper as to removing
deltaQ_ki
values from thek
th community's heap or tree. I think this might be dealt with by the setting ofa_i = 0
, but I don't understand the algorithm well enough to be sure.Edit: Trying to decipher the implementation you linked. Their primary datastructures are
CmtyIdUF
, a union-find structure that keeps track of which nodes are in which community (something that's neglected in the paper, but seems necessary unless you want to reconstruct community membership from a trace of the merge or something),MxQHeap
, a heap to keep track of whichdeltaQ_ij
is largest overall. Strangely, when they update the value of aTFltIntIntTr
in the heap, they don't ask the heap to re-heapify itself. This is worrying. Does it do it automatically or something?CmtyQH
, a hashmap that maps a community IDi
to a structureTCmtyDat
which holds what looks heap of thedeltaQ_ik
for that community. I think. Strangely though, theUpdateMaxQ
of theTCmtyDat
structure takes linear time, obviating any need for a heap. What's more, theUpdateMaxQ
method only appears to be called when an element of the heap is deleted. It should definitely also be getting called when the value of any element in the heap is updated.