word's "boosting" during TF-IDF (topic modeling)

154 Views Asked by At

Here is the case. Let say we have dataset containing messages from a chat and we want to do a topic modeling on it (few topics for example).

Let us assume, that the topic A might be (and should) represented by few words but I know (let say from some external source), that all messages that contain a word word_to_boost should be predicted as A-belonging. All preprocessing and the bag of words is done. Is there any possibility to "boost" the word word_to_boost somehow, to suggest to the model putting all messages withing such word into the A topic? If so, is that recommended?

I assumed it might be done around TF-IDF but maybe there is a different approach?

Thanks in advance!

1

There are 1 best solutions below

0
On

There's a good amount of confusion here:

  • Topic modeling is unsupervised, it can be seen as a kind of clustering task. So by definition there are no predefined topics, and of course one can't pre-assign particular words to a topic/cluster.
  • If the task involves predefined "topics", then it's text classification: a model is trained with some annotated data.
  • In text classification, if a word is a really good indicator of the class then the model will make good use of it by itself. The whole point of ML methods is to let the model learn from data, otherwise one can do a rule-based system instead.
  • TFIDF is a common weighting scheme in text classification, but again it would be a terrible idea to manually modify the weights: why learn from data then?