I have a large dataset (several millions of rows) that i want to use for graph analysis. After data preparation and cleaning, the data is now in a python format (pandas dataframe).
For the sake of graph analysis, i am using Stanford Network Analysis Project (SNAP). The reason that i am using SNAP, even though other frameworks are also available such as networkx or GraphLab is that SNAP can handle very large graphs.
But SNAP uses different types of data structure that we are used to when using pandas. It uses Vectors, Hashtables, and Pairs.
https://snap.stanford.edu/snappy/doc/tutorial/tutorial.html
I find a difficulty converting from dataframe format to any of these. what i am doing currently is that i convert the dataframe to a text format first, saving it on the hard disk and read it again from SNAP using snap.LoadEdgeListStr
https://snap.stanford.edu/snappy/doc/reference/LoadEdgeListStr1.html?highlight=loadedgeliststr
is there a way for direct conversion between the two formats, so i don't need to do the same process every time?
If you wish to convert a pandas dataframe to a SNAP graph in-memory, you could create a new graph and fill it with nodes and edges as follows:
If you still wish to save / load your graphs after creating them for the first time, consider saving them in binary format instead of using text files (using the
save()
andload()
functions). That should be much more efficient.SNAP also provides Tables:
These allow a convenient API for transforming tables into graphs, however I don't think I would use them instead of pandas dataframe.