I am planning to create a classification model. Instead of using traditional models, I decided to use a new technique of creating word embeddings, clustering them using k-means, then use the mean of each cluster for comparision with the input(s). I decided to use fasttext as it supports subwords. I also have a large unsupervised text data. I would like to know if I should train the fasttext model with the data I have or I can go with the pre-trained model. If I should train, what are the benefits? Can someone explain me please
Pre-training or using the existing model of FastText?
25 Views Asked by Ram Deepak Prabhakar At
1
There are 1 best solutions below
Related Questions in NLP
- Seeking Python Libraries for Removing Extraneous Characters and Spaces in Text
- Clarification on T5 Model Pre-training Objective and Denoising Process
- The training accuracy and the validation accuracy curves are almost parallel to each other. Is the model overfitting?
- Give Bert an input and ask him to predict. In this input, can Bert apply the first word prediction result to all subsequent predictions?
- Output of Cosine Similarity is not as expected
- Getting an error while using the open ai api to summarize news atricles
- SpanRuler on Retokenized tokens links back to original token text, not the token text with a split (space) introduced
- Should I use beam search on validation phase?
- Dialogflow failing to dectect the correct intent
- How to detect if two sentences are simmilar, not in meaning, but in syllables/words?
- Is BertForSequenceClassification using the CLS vector?
- Issue with memory when using spacy_universal_sentence_encoder for similarity detection
- Why does the Cloud Natural Language Model API return so many NULLs?
- Is there any OCR or technique that can recognize/identify radio buttons printed out in the form of pdf document?
- Model, lexicon to do fine grained emotions analysis on text in r
Related Questions in TOKENIZE
- How to solve Config validation error when tokenizer is not callable in Python?
- SpanRuler on Retokenized tokens links back to original token text, not the token text with a split (space) introduced
- Altova Mapforce - How to use results from Tokenize at the same time in a database call?
- How do handle compound nouns (animal names) in word2vec (using tensorflow)?
- Tensorflow tokenizer question. What num_words does exactly?
- Issues with Training RoBERTa Model for Text Prediction with Fill Mask Task in Python
- Getting `ValueError: as_list() is not defined on an unknown TensorShape.` when trying to tokenize as part of the model
- Trying to run the LLama-2-7B-chat-GGUF on local machine but getting NotImplementedError
- why Tokenizer and TokenizerFast encode the same sentence get different result
- Compare vocabulary size of WordPiece and BPE tokenizer algorithm
- Why did I install Chinese tokenization package but the terminator kept saying I should install them?
- Pre-training or using the existing model of FastText?
- integrate `openai-whisper` tokenizer with spaCy
- Paytabs recurring payment request with token is giving [422 - Unable to process your request] error
- How to remove last N tokens in a string with XSLT?
Related Questions in FASTTEXT
- Gcc Error Trying to PIP Install Fasttext Redhat (Amazon ec2)
- FastText - Cannot load model.bin due to C++ extension failed to allocate the memory
- train word2vec with pretrained vectors
- precision and recall in fastText?
- fine tuning pre-trained word2vec Google News
- What is the input format of fastText and why does my model doesn't give me a meaningful similar output?
- problem saving pre-trained fasttext vectors in "word2vec" format with _save_word2vec_format()
- Incorporate fasttext vectors in tf.keras embedding layer?
- SPACY - Confusion about word vectors and tok2vec
- why seq2seq model return negative loss if I used a pre-trained embedding model
- fastText test_label shows recall as nan for all labels in text classification
- Fasttext model load time
- Mocking FastText model for utest
- Makefile error for fastText on windows - make (e=2)
- How to decide correct NLP approach for a project
Trending Questions
- UIImageView Frame Doesn't Reflect Constraints
- Is it possible to use adb commands to click on a view by finding its ID?
- How to create a new web character symbol recognizable by html/javascript?
- Why isn't my CSS3 animation smooth in Google Chrome (but very smooth on other browsers)?
- Heap Gives Page Fault
- Connect ffmpeg to Visual Studio 2008
- Both Object- and ValueAnimator jumps when Duration is set above API LvL 24
- How to avoid default initialization of objects in std::vector?
- second argument of the command line arguments in a format other than char** argv or char* argv[]
- How to improve efficiency of algorithm which generates next lexicographic permutation?
- Navigating to the another actvity app getting crash in android
- How to read the particular message format in android and store in sqlite database?
- Resetting inventory status after order is cancelled
- Efficiently compute powers of X in SSE/AVX
- Insert into an external database using ajax and php : POST 500 (Internal Server Error)
Popular Questions
- How do I undo the most recent local commits in Git?
- How can I remove a specific item from an array in JavaScript?
- How do I delete a Git branch locally and remotely?
- Find all files containing a specific text (string) on Linux?
- How do I revert a Git repository to a previous commit?
- How do I create an HTML button that acts like a link?
- How do I check out a remote Git branch?
- How do I force "git pull" to overwrite local files?
- How do I list all files of a directory?
- How to check whether a string contains a substring in JavaScript?
- How do I redirect to another webpage?
- How can I iterate over rows in a Pandas DataFrame?
- How do I convert a String to an int in Java?
- Does Python have a string 'contains' substring method?
- How do I check if a string contains a specific word?
You should try them both an see which scores better on whatever repeatable quality evaluation you'll be using to make other tuning choices.
There's a fair chance, but no guarantee, that with enough of your own domain text data, your own trained model will better capture the words/subwords of your domain.
But there's no firm rule-of-thumb on how much is "enough", for either most projects or more importantly your specific project, or how much the text and word-meanings in your area may be different from the more generic word-meaning in others' pretrained models. So you have to test them against each other - which shouldn't be hard. (Run once with your best-trained model, or several variants of it, then with one or more external pretrained models from others. Compare the results. Choose the best.)
Note that your "new technique" sounds like a pretty common naive but intuitively-attractive classification approach – compute one "average" vector to represent each known class, compute a vector for each candidate text, predict the class with the nearest vector for each text. Or: report the relative distances as ranked possibilities.
It is likely to perform poorly compared to traditional approaches, even very quick & simple approaches, because they will not squeeze as much of the available learning data into a simplistic model where each class is exactly "around" a single summary point. Real categories are often of diverse irregular shapes in the learning data, & the way usual techniques can learn that 'lumpiness', rather than reducing classes to a single centroid point, will do better.
(If you are in fact computing distances to multiple unlabeled clusters, larger than your number of final labels, then using those distances as input to a typical learned classifier, it may perform better than the 1-center-per-class approach I describe above. It will have then retained a bit more of the original learnable "shapes" and decision-boundaries in the original data. But again, traditional classifiers, with adequate feature choices/enrichment, are likely to subsume and exceed any value from that style of model.)