Long text topic modelling differences

107 Views Asked by LeanneB At 20 October 2023 at 09:27

I have some very long documents. They have overall topics that are fairly standard, but each document will emphasise the topics differently AND within those topics they will have different subtopics

I would like to determine 1. The importance/ probabilities of each topic within each document (i.e. document 1 put more emphasis on topic 3 than document 2 did) and 2. The subtopics and their probabilities of each topic.

I have mostly seen bertopic and top2vec for short text like tweets.

Would they be an appropriate strategy for very long documents? Is there a better strategy for very long documents?

Original Q&A

There are 1 best solutions below

gojomo On 20 October 2023 at 22:12

You have to try them (and other classic methods like LDA) with your documents, and your goals, to evaluate their applicability. No external authority, having only a vague idea of what's available & important to your project, can give an a priori assessment of what will work, or be practical/optimal.

And, after you try various techniques and observe where they work or don't, and have a better idea of what you'd hoped for but was lacking, then you'll be able to ask more-detailed questions that could generate better insight.

Most topic-modeling options will offer a relative-score for each topic, by document. So yes, you'll have a sense of which documents are relatively-more associated with certain topics.

Many methods don't necessarily create hierarchical "sub-topics" of other higher-level topics, so if that's a requirement, it might require extra effort/steps.

If your documents are especially long, you may find it useful to split them into subdocuments, so that you get topic-analysis that's more sensitive to the full diversity of the documents, and can point to specific places where topics reside. Such splits would ideally match the document's own sections/chapters – but even a purely mechanical split may help you detect/characterize finer shifts in topic than a full-large-document analysis would reveal.

Long text topic modelling differences

There are 1 best solutions below

Related Questions in NLP

Related Questions in GENSIM

Related Questions in TOPIC-MODELING

Related Questions in TOP2VEC

Trending Questions

Popular # Hahtags

Popular Questions