Clusters Documents and Classify New Ones

32 Views Asked by At

I am working on a project that has some documents, and, I need to classify them to a categoty. I believe it is topic modeling. One approach is using LLM models. However, I have limited data and resources. Therefore, fine tuning a deep model is not applicable for me. Mostly, I am looking for a light LLM or a classic machine learning method. One of the main challenges in this project is that I do not know the number of topics (Algorithms like LDA need knowledge about number of topics). Moreover, the topics is changing each year. So, I am looking for a solution that can work in this scenario:

Clusters all documents into as much as category that is needed. These documents are belong to previous year. Also, it should generate a good title for each of them. Then, I want to classify new documents into one the categories during the upcoming year.

I tried different solutions such as LDA, clustering embeddings featurs, and document based LLM. I have a small data to check the performance of the algorithm. The LDA does not classify documents properly. For the clustering, first I extract embeddings using BERT and then cluster them with DBSCAN. Unfortunately, it can not cluster them and consider them as noise. For the last solution, I use one of gpt4all and give it a simple document which contains the topics and a brief of each topic. Then feed it with new document and generate promising results. However, I can not provide this document each year.

I will be appreciate to share your thoughts and ideas to me.

0

There are 0 best solutions below