I'm trying to train a language model with LSTM based on Penn Treebank (PTB) corpus.
I was thinking that I should simply train with every bigram in the corpus so that it could predict the next word given previous words, but then it wouldn't be able to predict next word based on multiple preceding words.
So what exactly is it to train a language model?
In my current implementation, I have batch size=20 and the vocabulary size is 10000, so I have 20 resulting matrices of 10k entries (parameters?) and the loss is calculated by making comparison to 20 ground-truth matrices of 10k entries, where only the index for actual next word is 1 and other entries are zero. Is this a right implementation? I'm getting perplexity of around 2 that hardly changes over iterations, which is definitely not in a right range of what it usually is, say around 100.
how to learn language model?
138 Views Asked by ytrewq At
1
There are 1 best solutions below
Related Questions in MACHINE-LEARNING
- How to cluster a set of strings?
- Enforcing that inputs sum to 1 and are contained in the unit interval in scikit-learn
- scikit-learn preperation
- Spark MLLib How to ignore features when training a classifier
- Increasing the efficiency of equipment using Amazon Machine Learning
- How to interpret scikit's learn confusion matrix and classification report?
- Amazon Machine Learning for sentiment analysis
- What Machine Learning algorithm would be appropriate?
- LDA generated topics
- Spectral clustering with Similarity matrix constructed by jaccard coefficient
- Speeding up Viterbi execution
- Memory Error with Classifier fit and partial_fit
- How to find algo type(regression,classification) in Caret in R for all algos at once?
- Difference between weka tool's correlation coefficient and scikit learn's coefficient of determination score
- What are the approaches to the Big-Data problems?
Related Questions in NLP
- command line parameter in word2vec
- Annotator dependencies: UIMA Type Capabilities?
- term frequency over time: how to plot +200 graphs in one plot with Python/pandas/matplotlib?
- Stanford Entity Recognizer (caseless) in Python Nltk
- How to interpret scikit's learn confusion matrix and classification report?
- Detect (predefined) topics in natural text
- Amazon Machine Learning for sentiment analysis
- How to Train an Input File containing lines of text in NLTK Python
- What exactly is the difference between AnalysisEngine and CAS Consumer?
- keywords in NEGATIVE Sentiment using sentiment Analysis(stanfordNLP)
- MaxEnt classifier implementation in java for linguistic features?
- Are word-vector orientations universal?
- Stanford Parser - Factored model and PCFG
- Training a Custom Model using Java Code - Stanford NER
- Topic or Tag suggestion algorithm
Related Questions in LSTM
- Conclusion from PCA of dataset
- Google Tensorflow LSTMCell Variables Mapping to Hochreiter97_lstm.pdf paper
- Predicting the Sinus Functions with RNNs
- CNTK Complaining about Dynamic Axis in LSTM
- How to Implement "Multidirectional" LSTMs?
- Many-to-one setting in LSTM using CNTK
- Error in Dimension for LSTM in tflearn
- LSTM model approach for time series (future prediction)
- How to improve the word rnn accuracy in tensorflow?
- How to choose layers in RNN (recurrent neural networks)?
- How to insert a value at given index or indices ( mutiple index ) into a Tensor?
- Retrieving last value of LSTM sequence in Tensorflow
- LSTM Networks for Sentiment Analysis - How to extend this model to 3 classes and classify new examples?
- Choosing the Length of Time Steps in Recurrent Neural Network
- The meaning of batch_size in ptb_word_lm (LSTM model of tensorflow)
Related Questions in LANGUAGE-MODEL
- command line parameter in word2vec
- Why is my Sphinx4 Recognition poor?
- Using theano to implement maximum likelihood learning in neural probability language model Python
- Getting probability of the text given word embedding model in gensim word2vec model
- Sphinx 4 corrupted ARPA LM?
- do searching in a very big ARPA file in a very short time in java
- Building openears compatible language model
- How can i use kenlm to check word alignment in a sentence?
- Fine tuning of Bert word embeddings
- Feed Forward Neural Network Language Model
- KenLM perplexity weirdness
- specify task_type for embeddings in Vertex AI
- Adding Conversation Memory to Xenova/LaMini-T5-61M Browser-based Model in JS
- How to train a keras tokenizer on a large corpus that doesn't fit in memory?
- Best approach for semantic similarity in large documents using BERT or LSTM models
Related Questions in PENN-TREEBANK
- Determine what tree bank type can come next
- Part-of-Speech tagging: what is the difference between known words and unknown words?
- How to reduce the number of POS tags in Penn Treebank? - NLTK (Python)
- Hebrew Stanford NLP tag set
- calculating perplexity for training LSTM on penn treebank
- Syntactical error when yacc file is called
- how could I use complete penn treebank dataset inside python/nltk
- Read complete penn treebank dataset from local directory
- How to extract the keywords on which universal sentence encoder was trained on?
- How to generate sentiment treebank in Stanford NLP
- how to learn language model?
- Finding span of each node in NLTK tree
- Extracting Function Tags from Parsed Sentence (using Stanford Parser)
- How to convert from column-based CoNLL format to the Penn Treebank annotation style?
- Entities containing underscore character are split into multiple entities by TokensAnnotation in CoreNLP
Trending Questions
- UIImageView Frame Doesn't Reflect Constraints
- Is it possible to use adb commands to click on a view by finding its ID?
- How to create a new web character symbol recognizable by html/javascript?
- Why isn't my CSS3 animation smooth in Google Chrome (but very smooth on other browsers)?
- Heap Gives Page Fault
- Connect ffmpeg to Visual Studio 2008
- Both Object- and ValueAnimator jumps when Duration is set above API LvL 24
- How to avoid default initialization of objects in std::vector?
- second argument of the command line arguments in a format other than char** argv or char* argv[]
- How to improve efficiency of algorithm which generates next lexicographic permutation?
- Navigating to the another actvity app getting crash in android
- How to read the particular message format in android and store in sqlite database?
- Resetting inventory status after order is cancelled
- Efficiently compute powers of X in SSE/AVX
- Insert into an external database using ajax and php : POST 500 (Internal Server Error)
Popular Questions
- How do I undo the most recent local commits in Git?
- How can I remove a specific item from an array in JavaScript?
- How do I delete a Git branch locally and remotely?
- Find all files containing a specific text (string) on Linux?
- How do I revert a Git repository to a previous commit?
- How do I create an HTML button that acts like a link?
- How do I check out a remote Git branch?
- How do I force "git pull" to overwrite local files?
- How do I list all files of a directory?
- How to check whether a string contains a substring in JavaScript?
- How do I redirect to another webpage?
- How can I iterate over rows in a Pandas DataFrame?
- How do I convert a String to an int in Java?
- Does Python have a string 'contains' substring method?
- How do I check if a string contains a specific word?
I think you don't need to train with every bigram in the corpus. Just use a sequence to sequence model, and when you predict the next word given previous words you just choose the one with the highest probability.
Yes, per step of decoding.
You can first read some open source code as a reference. For instance: word-rnn-tensorflow and char-rnn-tensorflow. The perplexity is at large -log(1/10000) which is around 9 per word(which means the model is not trained at all and selects the words totally randomly, as the model being tuned the complexity will decrease, so 2 is reasonable). I think 100 in your statement may mean the complexity per sentence.
For example, if tf.contrib.seq2seq.sequence_loss is employed to calculate the complexity, the result will be less than 10 if you set both
average_across_timestepsandaverage_across_batchto be True as default, but if you set theaverage_across_timestepsto be False and the average length of the sequence is about 10, it will be about 100.