How to predict links between present and a new node with PyG/GraphSage autoencoder?

906 Views Asked by At

I am a newbee in the field of GNN and want to use PyTorch Geometric (PyG) to train a Graph Neural Network (GNN) to predict links (edges) between nodes in a graph using an autoencoder (with a modified version of the PyG link prediction example with two SAGEConv layers (I used this tutorial). I would like to add a new node to the graph and predict which existing nodes have the highest probability of having an edge with the new node. Specifically, I have the following questions:

  1. How can I add a new node (feature tensor) to an existing graph that has no edges?
  2. How can I predict the nodes with the highest probability of having an edge with the new node?

The reason I chose the SAGEConv layers is that (if I understood it correctly) I can predict links to nodes that have not been present during training because of the inductive capabilities of GraphSage.

So far I defined and trained following model and training function:

class Net(torch.nn.Module):
    def __init__(self, in_channels, hidden_channels, out_channels, dropout=0.1):
        super().__init__()
        self.dropout = dropout
        self.conv1 = SAGEConv(in_channels, hidden_channels)
        self.conv2 = SAGEConv(hidden_channels, out_channels)
        

    def encode(self, x, edge_index):
        #x = self.conv1(x, edge_index).relu()
        #return self.conv2(x, edge_index)
        x = self.conv1(x, edge_index)
        x = F.relu(x)
        x = F.dropout(x, p=self.dropout)

        x = self.conv2(x, edge_index)
        x = F.relu(x)
        x = F.dropout(x, p=self.dropout)

        return x
    
    def decode(self, z, edge_label_index):
        return (z[edge_label_index[0]] * z[edge_label_index[1]]).sum(
            dim=-1
        )  # product of a pair of nodes on each edge

    def decode_all(self, z):
        prob_adj = z @ z.t()
        return (prob_adj > 0).nonzero(as_tuple=False).t()

def train_link_predictor(
    model, train_data, val_data, optimizer, criterion, n_epochs=100
):

    for epoch in range(1, n_epochs + 1):

        model.train()
        optimizer.zero_grad()
        
        # create node embeddings (aggregating neighbor nodes with GraphSage)
        z = model.forward(train_data.x, train_data.edge_index)
        
        # sampling training negatives for every training epoch
        neg_edge_index = negative_sampling(
            edge_index=train_data.edge_index, num_nodes=train_data.num_nodes,
            num_neg_samples=train_data.edge_label_index.size(1), method='sparse')

        edge_label_index = torch.cat(
            [train_data.edge_label_index, neg_edge_index],
            dim=-1,
        )
        
        # edge labels contain 1 for positive edges and 0 for negative edges
        edge_label = torch.cat([
            train_data.edge_label,
            train_data.edge_label.new_zeros(neg_edge_index.size(1))
        ], dim=0)
        
        # the decoder makes a prediction based on the node embeddings by calculating pairwise dot-product 
        out = model.decode(z, edge_label_index).view(-1)
        
        # the loss is calculated by minimizing the difference between predictions and labeled values for pos/neg edges
        loss = criterion(out, edge_label)        
        loss.backward()
        optimizer.step()

        val_auc = eval_link_predictor(model, val_data)
        writer.add_scalar("Loss/train", loss, epoch)
        writer.add_scalar("AUC/train", val_auc, epoch)
        if epoch % 10 == 0:
            print(f"Epoch: {epoch:03d}, Train Loss: {loss:.3f}, Val AUC: {val_auc:.3f}")
            

    return model
    

I thought about adding a new node feature (tensor) to list of the nodes (data.x) and add an entry with no edge to the adjacency matrix (data.edge_index) but have no idea if it is the best and useful solution.

1

There are 1 best solutions below

0
On

This is mostly like sharing ideas rather than an actual answer since I'm also curious about doing inductive prediction based on a trained GNN model on a large graph. First I recommend reading this discussion on PyG github as I found it relevant to this question. Especially subgraph is a really useful tool from PyG to separate nodes for tests. You may need to add the connection between the test node and all other nodes in the graph in edge_label_index to see if there is any connection between the new node and the other nodes.

Moreover, my understanding from the GraphSAGE paper is that one still needs some connection between the test node and other nodes to do message passing during inference. You don't need these edges during training (they should not be there) but during the test, you should add them to the edge_index of the test subgraph.

I hope we see some other answers to this question as it is also kind of my question.