I'm working on implementing a code that generates a k-NN graph map. So far, I've managed to make it work, resulting in the creation of HTML and JS files for visualizing the map. However, I'm facing difficulties in extracting detailed information about the structure of the generated graph.
While I can visualize the graph and understand its connections visually, I'm looking for a way to save this information for further analysis, such as knowing which nodes are connected to each other and the total number of edged/nodes in the graph.
My map is constructed using Faerun and relies on four variables: the coordinates of points in the plane (x, y), and 's' and 't', that store the indexes of start nodes and to nodes in the MST, respectively. I assumed that 's' and 't' (topology) was the information I was seeking for since it contains how points are connected with each other. However, I don't understand what 's' and 't' values represent.
Initially, I assumed that 's' and 't' were indices of points in the graph, meaning that if a point A has coordinates (Xa, Ya) and 's' = 2 and 't' = 15, then A would be connected to the points at index positions 2 and 15 in my data. However, when I examine the coordinates of these points, I notice that they are too far apart to be considered valid connections for A.
For example: x, y, s, t values from the code in a excel sheet
In this example, point with index= 22 (x=0.136130452, y=-0.082266837) is supposed to be connected with point index = 13 (x=-0.317409933, y=-0.280039787) as indicated by the value 's = 13', which, upon visual inspection of their coordinates, is not accurate.
How can I correctly understand what the 's' and 't' values represent in my graph, or/and how can I obtain accurate information about the real connections and nodes in the map?
Code that generates the graph:
import pandas as pd
import tmap
from faerun import Faerun
from mhfp.encoder import MHFPEncoder
from rdkit.Chem import AllChem
df = pd.read_csv('HMDB-smiles-short.csv')
print(df.shape)
# The number of permutations used by the MinHashing algorithm
perm = 512
# Initializing the MHFP encoder with 512 permutations
enc = MHFPEncoder(perm)
# Create MHFP fingerprints from SMILES
# The fingerprint vectors have to be of the tm.VectorUint data type
fingerprints = [tmap.VectorUint(enc.encode(s)) for s in df["smiles"]]
# Initialize the LSH Forest
lf = tmap.LSHForest(perm)
# Add the Fingerprints to the LSH Forest and index
lf.batch_add(fingerprints)
lf.index()
# Get the coordinates
x, y, s, t, _ = tmap.layout_from_lsh_forest(lf)
# Now plot the data
faerun = Faerun(view="front", coords=False)
faerun.add_scatter(
"ESOL_Basic",
{ "x": x,
"y": y,
"c": list(df.logSolubility.values),
"labels": df["smiles"]},
point_scale=5,
colormap = ['rainbow'],
has_legend=True,
legend_title = ['ESOL (mol/L)'],
categorical=[False],
shader = 'smoothCircle'
)
faerun.add_tree("ESOL_Basic_tree", {"from": s, "to": t}, point_helper="ESOL_Basic")
# Choose the "smiles" template to display structure on hover
faerun.plot('ESOL_Basic', template="smiles", notebook_height=750)
X,Y,S,T values used for making the graph
I have used different datasets, and I have checked that the index in the original file corresponds to the index of the points (a molecule with index = 0 in the original .csv corresponds to the molecule with index = 0 in the generated map).