I'm making an own implementation for the Graphormer architecture. Since this architecture needs to add an edge-based bias to the output for the key-query multiplication at the self-attention mechanism I am adding that bias by hand and doing the matrix multiplication for the data with the attention weights outside the attention mechanism:
import torch as th
from torch import nn
# Variable inicialization
B, T, C, H = 2, 3, 4, 2
self_attn = nn.MultiheadAttention(C, H, batch_first = True)
# Tensors
x = th.randn(B, T, C)
attn_bias = th.ones((B, T, T))
# Self-attention mechanism
_, attn_wei = self_attn(query=x, key=x, value=x)
# Adding attention bias
if attn_bias is not None:
attn_wei = attn_wei + attn_bias
x = attn_wei @ x # TODO use value(x) instead of x
print(x)
This works, but for using the full potential of self-attention, the last matrix multiplication should be like x = attn_wei @ value(x)
but I am not able to get the value projector from the selt_attn
object as it should have something like that inside of it.
How could I do this?