I've been reading a lot about transformers and self attention and have seen both BERT and GPT-2 are a newer version that only use an encoder transformer (BERT) and decoder transformer (GPT-2). I've been trying to build a decoder only model for myself for next sequence prediction but am confused by one thing. I'm using PyTorch and have looked at thereSeq2Seq tutorial and then looked into the Transformer Decoder Block which is made up of Transformer Decoder Layers. My confusion comes from the memory these need to be passed as well. In the documentation they say memory is the last layer of the encoder block which makes sense for a Seq2Seq model but I'm wanting to make a decoder only model. So my question is what do you pass a decoder only model like GPT-2 for memory if you do not have an encoder?
What memory does Transformer Decoder Only use?
2.9k Views Asked by bellerb At
1
There are 1 best solutions below
Related Questions in PYTHON
- new thread blocks main thread
- Extracting viewCount & SubscriberCount from YouTube API V3 for a given channel, where channelID does not equal userID
- Display images on Django Template Site
- Difference between list() and dict() with generators
- How can I serialize a numpy array while preserving matrix dimensions?
- Protractor did not run properly when using browser.wait, msg: "Wait timed out after XXXms"
- Why is my program adding int as string (4+7 = 47)?
- store numpy array in mysql
- how to omit the less frequent words from a dictionary in python?
- Update a text file with ( new words+ \n ) after the words is appended into a list
- python how to write list of lists to file
- Removing URL features from tokens in NLTK
- Optimizing for Social Leaderboards
- Python : Get size of string in bytes
- What is the code of the sorted function?
Related Questions in PYTORCH
- Pytorch install with anaconda error
- How should I save the model of PyTorch if I want it loadable by OpenCV dnn module
- PyTorch: memorize output from several layers of sequencial
- in Pytorch, restore the model parameters but the same initial loss
- Seq2seq pytorch Inference slow
- Why does autograd not produce gradient for intermediate variables?
- pytorch inception model outputs the wrong label for every input image
- "expected CPU tensor(got CUDA tensor)" error for PyTorch
- Float16 (HalfTensor) in pytorch + cuda
- Access parameter names in torch
- Efficient way of calculating sum of unequal sized chunks of tensor in Pytorch
- what is the equivalent of theano.tensor.clip in pytorch?
- How can I do scatter and gather operations in NumPy?
- How do I write a PyTorch sequential model?
- How to combine multiple models together?
Related Questions in DECODER
- ASN1 Structure Encoding
- Android decoder dequeueOutputBuffer returns -1
- How to convert cable TV channel into Ip based stream
- Decoding a url-encoded windows-1251 (cp1251) string with JavaScript
- ZXing Decoder Online - Submit Error
- decoder JPEG not available (using virtualenv)
- Record and playback with Opus Codec in Android
- Unable to decode utf-8 using thai language?
- Tuning language weight (LW) and word insertion penalties (WIP) in CMU SPHINX
- FFMPEG: Explain parameters of any codecs function pointers
- Marc21 Binary Decoder with Akka-Stream
- Decoding H.264 NALU Stream C#
- Caesar Cipher decoded wrongly in Java
- image decoding in asp
- base64 image decoder for ASP classic
Related Questions in TRANSFORMER-MODEL
- Using parseincludes in Laravel5 Fractal
- how to transform result to map in hibernate5.2
- Cognos Framework manager alternatives on Linux only
- Modifying python AST while preserving comments
- Java Hibernate Transformer AliasToBeanNestedResultTransformer
- How to change color and stroke of one type of edges
- ibm cognos transformer multiple fact table not supported by dimension
- java standard lib produce wrong xml 1.1
- Mule returning a MessageCollection from component
- XLM-RoBERTa token - id relationship
- what's the difference between "self-attention mechanism" and "full-connection" layer?
- Transformer Image captioning model produces just padding rather than a caption
- Using Transformer's decoder to extract sentences
- Is there any way to self create Transformer to run on Coral board?
- Use Asus Transformer Prime as USB Debugger
Related Questions in GPT-2
- Is it possible to train gpt2 with our own data to generate text?
- GPT2 Model for title generation
- gpt2 logits are different when I use past_key_values
- How to change the fully connected network in a GPT model on Huggingface?
- sending automated reply through outlook with gpt model
- Understanding attention output from generate method in GPT model
- How to take a text file line by line as the input of the gpt2's generate method and save its output to another text file?
- Features have excessive nesting error when trying to use my own vocab_file
- How to extend Keras GPT2 model (MoE example)
- Transformers cross-entropy loss masked label issue
- How to use GPT-2 for topic modelling?
- Why new lines aren't generated with my fine-tuned DistilGPT2 model?
- On-the-fly tokenization with datasets, tokenizers, and torch Datasets and Dataloaders
- What memory does Transformer Decoder Only use?
- Error when using mode.generate() from Transformers - TypeError: forward() got an unexpected keyword argument 'return_dict'
Trending Questions
- UIImageView Frame Doesn't Reflect Constraints
- Is it possible to use adb commands to click on a view by finding its ID?
- How to create a new web character symbol recognizable by html/javascript?
- Why isn't my CSS3 animation smooth in Google Chrome (but very smooth on other browsers)?
- Heap Gives Page Fault
- Connect ffmpeg to Visual Studio 2008
- Both Object- and ValueAnimator jumps when Duration is set above API LvL 24
- How to avoid default initialization of objects in std::vector?
- second argument of the command line arguments in a format other than char** argv or char* argv[]
- How to improve efficiency of algorithm which generates next lexicographic permutation?
- Navigating to the another actvity app getting crash in android
- How to read the particular message format in android and store in sqlite database?
- Resetting inventory status after order is cancelled
- Efficiently compute powers of X in SSE/AVX
- Insert into an external database using ajax and php : POST 500 (Internal Server Error)
Popular Questions
- How do I undo the most recent local commits in Git?
- How can I remove a specific item from an array in JavaScript?
- How do I delete a Git branch locally and remotely?
- Find all files containing a specific text (string) on Linux?
- How do I revert a Git repository to a previous commit?
- How do I create an HTML button that acts like a link?
- How do I check out a remote Git branch?
- How do I force "git pull" to overwrite local files?
- How do I list all files of a directory?
- How to check whether a string contains a substring in JavaScript?
- How do I redirect to another webpage?
- How can I iterate over rows in a Pandas DataFrame?
- How do I convert a String to an int in Java?
- Does Python have a string 'contains' substring method?
- How do I check if a string contains a specific word?
After further investigation I believe I can now answer this myself. A decoder only transformer doesn't actually use any memory as there is no encoder-decoder self attention in it like there is in a encoder-decoder transformer. A decoder only transformer looks a lot like an encoder transformer only instead it uses a masked self attention layer over a self attention layer. In order to do this you can pass a square subsequent mask (upper triangle) so that the model cannot look forward to achieve a decoder only model like found in GPT-2/GPT-3.