I know there exist functions for computing the size of the connected components of a graph in NetworkX. You can add attributes to a node. In Axelrod's model for dissemination of culture, an interesting measurement is the size of the largest connected component whose nodes share several attributes. Is there a way of doing that in NetworkX? For example, let's say we have a population represented through a network. Each node has attributes of hair color and skin color. How can I get the size of the largest component of nodes such that in that subgraph each and every node has the same hair and skin color? Thank you


For general data analysis, it's best to use pandas. Use a graph library like networkx or graph-tool to determine the connected components, and then load that info into a DataFrame that you can analyze. In this case, the pandas groupby and nunique (number of unique elements) features will be useful.

Here's a self-contained example using graph-tool (using this network). You could also compute the connected components via networkx.

import numpy as np
import pandas as pd
import graph_tool.all as gt

# Download an example graph
g = gt.collection.ns["baseball", 'user-provider']

# Extract the player names
names = g.vertex_properties['name'].get_2d_array([0])[0]

# Extract connected component ID for each node
cc, cc_sizes = gt.label_components(g)

# Load into a DataFrame
players = pd.DataFrame({
    'id': np.arange(g.num_vertices()),
    'name': names,
    'cc': cc.a

# Create some random attributes
players['hair'] = np.random.choice(['purple', 'pink'], size=len(players))
players['skin'] = np.random.choice(['green', 'blue'], size=len(players))

# For the sake of this example, manipulate the data so
# that some groups are homogenous with respect to some attributes.
players.loc[players['cc'] == 2, 'hair'] = 'purple'
players.loc[players['cc'] == 2, 'skin'] = 'blue'

players.loc[players['cc'] == 4, 'hair'] = 'pink'
players.loc[players['cc'] == 4, 'skin'] = 'green'

# Now determine how many unique hair and skin colors we have in each group.
group_stats = players.groupby('cc').agg({
    'hair': 'nunique',
    'skin': ['nunique', 'size']

# Simplify the column names
group_stats.columns = ['hair_colors', 'skin_colors', 'player_count']

# Select homogenous groups, i.e. groups for which only 1 unique
# hair color is present and 1 unique skin color is present
homogenous = group_stats.query('hair_colors == 1 and skin_colors == 1')

# Sort from large groups to small groups
homogenous = homogenous.sort_values('player_count', ascending=False)

That prints the following:

    hair_colors  skin_colors  player_count
4             1            1             4
2             1            1             3