I would like to test my own formulation of the attention mechanism for a transformer. To that end I would like to find an existing pre trained transformer that is easy to read through and that uses not too large of a dataset. I just want to take that model and replace the code of the attention mechanism with my own.
I have already tried using https://www.google.com/url?sa=t&source=web&rct=j&opi=89978449&url=https://github.com/hkproj/pytorch-transformer&ved=2ahUKEwiKjpDF4q2DAxX-WUEAHXDMA8sQFnoECBIQAQ&usg=AOvVaw2HTqVDl_mTK23mqiuQ7_wH but for some reason I get an error after the first epoch with my code. The epoch takes about 30 mins so I'm struggling to debug to see where I've gone wrong.
I've also tried looking at the hugging face github repository but there are so many options and I'm lost there. I'm very new to deep learning and this is my first time trying to code a model.
Thanks







If you only want to test a change to the attention mechanism, you might want something simpler than a full encoder-decoder transformer.
There are three main kinds of transformer models -- encoder-decoder, encoder-only, and decoder-only.
The original transformer (implemented at the link you shared) is a full encoder decoder. In the original Vaswani et al paper this was trained on a translation task. This makes sense because it encodes the information in the input sentence then uses the decoder to generate a sentence in the target language.
An example of encoder-only would be BERT, typically trained on an MLM task.
What we generally refer to as "decoder-only" models are GPT-style models. IMO these are the simplest to train, since they're typically trained to predict next-token on arbitrary text. Some simple implementations of decoder-only models include:
A common pattern is to use a trivial tokenizer that tokenizes characters, use a simple single-document training text like "the complete works of Shakespeare," and work on a small variant of the model with fewer layers/dimensions. I find I can typically train a minimal GPT-style model this way on a consumer-grade laptop (macbook) and get it producing meaningful words within an hour or two.