How does doc2vec perform when trained on different sized datasets? There is no mention of dataset size in the original corpus, so I am wondering what is the minimum size required to get good performance out of doc2vec.
what is the minimum dataset size needed for good performance with doc2vec?
4.1k Views Asked by pete the dude At
1
There are 1 best solutions below
Related Questions in NLP
- Seeking Python Libraries for Removing Extraneous Characters and Spaces in Text
- Clarification on T5 Model Pre-training Objective and Denoising Process
- The training accuracy and the validation accuracy curves are almost parallel to each other. Is the model overfitting?
- Give Bert an input and ask him to predict. In this input, can Bert apply the first word prediction result to all subsequent predictions?
- Output of Cosine Similarity is not as expected
- Getting an error while using the open ai api to summarize news atricles
- SpanRuler on Retokenized tokens links back to original token text, not the token text with a split (space) introduced
- Should I use beam search on validation phase?
- Dialogflow failing to dectect the correct intent
- How to detect if two sentences are simmilar, not in meaning, but in syllables/words?
- Is BertForSequenceClassification using the CLS vector?
- Issue with memory when using spacy_universal_sentence_encoder for similarity detection
- Why does the Cloud Natural Language Model API return so many NULLs?
- Is there any OCR or technique that can recognize/identify radio buttons printed out in the form of pdf document?
- Model, lexicon to do fine grained emotions analysis on text in r
Related Questions in DOC2VEC
- Solution to solve problem different results when run Doc2vec gensim?
- TypeError: 'int' object is not iterable" and PCA Assertion Error in Python Clustering Function
- Does Doc2vec support multiple languages?And does transvec lib use for Doc2vec model?
- How to query questions with high similarity based on the input question content?
- Identifying Redundancy in Operations within doc2vec Model
- How to train doc2vec with pre-built vocab in gensim
- How to get most similar words to a tagged document in gensim doc2vec
- Detecting semantic dissimilarity in sentences with same words
- Why do I get inconsistent results between Fasttext, Longformer, and Doc2vec?
- How to get doc2vec to reliably work with UMAP?
- Infer document vectors for pretrained word vectors
- S3 object as gensim LineSentence
- sentiment classification using doc2vec and LSTM Models
- What would be the best way to compare different parts of a document in just one doc2vec embedding?
- Runtime Error in doc2vec model for a preprocessed dataset
Trending Questions
- UIImageView Frame Doesn't Reflect Constraints
- Is it possible to use adb commands to click on a view by finding its ID?
- How to create a new web character symbol recognizable by html/javascript?
- Why isn't my CSS3 animation smooth in Google Chrome (but very smooth on other browsers)?
- Heap Gives Page Fault
- Connect ffmpeg to Visual Studio 2008
- Both Object- and ValueAnimator jumps when Duration is set above API LvL 24
- How to avoid default initialization of objects in std::vector?
- second argument of the command line arguments in a format other than char** argv or char* argv[]
- How to improve efficiency of algorithm which generates next lexicographic permutation?
- Navigating to the another actvity app getting crash in android
- How to read the particular message format in android and store in sqlite database?
- Resetting inventory status after order is cancelled
- Efficiently compute powers of X in SSE/AVX
- Insert into an external database using ajax and php : POST 500 (Internal Server Error)
Popular Questions
- How do I undo the most recent local commits in Git?
- How can I remove a specific item from an array in JavaScript?
- How do I delete a Git branch locally and remotely?
- Find all files containing a specific text (string) on Linux?
- How do I revert a Git repository to a previous commit?
- How do I create an HTML button that acts like a link?
- How do I check out a remote Git branch?
- How do I force "git pull" to overwrite local files?
- How do I list all files of a directory?
- How to check whether a string contains a substring in JavaScript?
- How do I redirect to another webpage?
- How can I iterate over rows in a Pandas DataFrame?
- How do I convert a String to an int in Java?
- Does Python have a string 'contains' substring method?
- How do I check if a string contains a specific word?
A bunch of things have been called 'doc2vec', but it seems to most-often refer to the 'Paragraph Vector' technique from Le and Mikolov.
The original 'Paragraph Vector' paper describes evaluating it on three datasets:
The 1st two are publicly available, so you can also review their total sizes in words, typical document sizes, and vocabularies. (Note, though, that no one has been able to fully-reproduce that paper's sentiment-classification results on either of those first two datasets, implying some missing info or error in their reporting. It's possible to get close on the IMDB dataset.)
A followup paper applied the algorithm to discovering topical-relationships in the datasets:
So the corpuses used in those two early papers ranged from tens-of-thousands to millions of documents, and document sizes from a few word phrases to thousands-of-word articles. (But those works did not necessarily mix wildly-differently-sized documents.)
In general, word2vec/paragraph-vector techniques benefit from a lot of data and variety of word-contexts. I wouldn't expect good results without at least tens-of-thousands of documents. Documents longer than a few words each work much better. Results may be harder to interpret if wildly-different-in-size or -kind documents are mixed in the same training – such as mixing tweets and books.
But you really have to evaluate it with your corpus and goals, because what works with some data, for some purposes, may not be generalizable to very-different projects.