converting pandas dataframe to snap.py

1.1k Views Asked by At

I have a large dataset (several millions of rows) that i want to use for graph analysis. After data preparation and cleaning, the data is now in a python format (pandas dataframe).

For the sake of graph analysis, i am using Stanford Network Analysis Project (SNAP). The reason that i am using SNAP, even though other frameworks are also available such as networkx or GraphLab is that SNAP can handle very large graphs.

But SNAP uses different types of data structure that we are used to when using pandas. It uses Vectors, Hashtables, and Pairs.

https://snap.stanford.edu/snappy/doc/tutorial/tutorial.html

I find a difficulty converting from dataframe format to any of these. what i am doing currently is that i convert the dataframe to a text format first, saving it on the hard disk and read it again from SNAP using snap.LoadEdgeListStr

https://snap.stanford.edu/snappy/doc/reference/LoadEdgeListStr1.html?highlight=loadedgeliststr

is there a way for direct conversion between the two formats, so i don't need to do the same process every time?

1

There are 1 best solutions below

9
On BEST ANSWER

If you wish to convert a pandas dataframe to a SNAP graph in-memory, you could create a new graph and fill it with nodes and edges as follows:

import pandas as pd
import snap

# Create a sample pandas dataframe:
data = {
    's': [0, 0, 1],
    't': [1, 2, 0]
}
df = pd.DataFrame(data)

# Create SNAP directed graph:
G1 = snap.TNGraph.New()
# Add nodes:
nodes = set(df['s'].tolist() + df['t'].tolist())
for node in nodes:
    G1.AddNode(int(node))
# Add edges:
for index, row in df.iterrows():
    G1.AddEdge(int(row['s']), int(row['t']))
# Print result:
G1.Dump()

If you still wish to save / load your graphs after creating them for the first time, consider saving them in binary format instead of using text files (using the save() and load() functions). That should be much more efficient.

SNAP also provides Tables:

Tables in SNAP are designed to provide fast performance at scale, and to effortlessly handle datasets containing hundreds of millions of rows. They can be saved and loaded to disk in a binary format using the provided methods.

These allow a convenient API for transforming tables into graphs, however I don't think I would use them instead of pandas dataframe.