Find list of Out Of Vocabulary (OOV) words from my domain spectific pdf while using FastText model

135 Views Asked by Srijita Saha Roy At 26 July 2021 at 07:25

How to find list of Out Of Vocabulary (OOV) words from my domain spectific pdf while using FastText model? I need to fine tune FastText with my domain specific words.

Original Q&A

There are 1 best solutions below

gojomo On 27 July 2021 at 17:02

A FastText model will already be able to generate vectors for OOV words.

So there's not necessarily any need to either list the specifically OOV words in your PDF, nor 'fine tune' as FastText model.

You just ask it for vectors, it gives them back. The vectors for full in-vocabulary words, that were trained from relevant training material, will likely be best, while vectors synthesized for OOV words from word-fragments (character n-grams) shared with training material will just be rough guesses - better than nothing, but not great.

(To train a good word-vector requires many varied examples of a word's use, interleaved with similarly good examples of its many 'peer' words – and traditionally, in one unified, balanced training session.)

If you think you need to do more, you should expand your questin with more details about why you think that's necessary, and what existing precedents (in docs/tutorials/papers) you're trying to match.

I've not seen a well-documented way to casually fine-tune, or incrementally expand the known-vocabulary of, an existing FastText model. There would be a lot of expert tradeoffs required, and in many cases simply training a new model with sufficient data is likely to be a safer approach.

Anyone seeking such fine-tuning should have a clear idea of:

what their incremental data might be able to add to an existing model
what process/code will they be using, and why that process/code might be expected to give meaningful results with their specific starting model & new data
how the results of any such process can be evaluated to ensure the extra fine-tuning steps are beneficial compared to alternatives

Find list of Out Of Vocabulary (OOV) words from my domain spectific pdf while using FastText model

There are 1 best solutions below

Related Questions in NLP

Related Questions in DATA-SCIENCE

Related Questions in FASTTEXT

Related Questions in OOV

Trending Questions

Popular # Hahtags

Popular Questions