I'm new to NLP. Currently, I have an unlabelled dataset which has about 10,000 rows and I tried both Text clustering and LDA model to get a few keywords that falls under clusters/topics.
Below is an example that is unlabelled dataset and in terms of the datapoints, some have labels but all written in one string, some without label and may not be as the same label position (for example looking at no.2 – the sequence should be issue: change of bedroom design, style: Japanese style, comments: want built in wardrobe, can start commencing work anytime after Christmas, how-did-you-know-us: influencer)
no description
1 issue: need help for decoration for the whole house, style: vintage all furniture must be high quality, other comments: nil, how-did-you-know-us: nil
2 Japanese style want built in wardrobe Change of bedroom design can start commencing work anytime after Christmas influencer
3 Issue: home decor style: vintage, other comments: budget up to $100k how-did-you-know-us: friend
4 Issue : demolition of shop, style: - other comments : anytime before 23 october, how-did-you-know-us: online
5 Home decor with lots of space planning, artistic, client is a musician and loves photography, friend
In terms of data cleaning, I only cleaned up the punctuations, and changed all to lowercase. However, now I’m thinking if I should have tried separating the data into multiple rows instead before cleaning up the punctuations and special characters as there are some quite clearly labelled (still finding a way to code this out). For example:
no Description
1 issue: need help for decoration for the whole house
1 style: vintage all furniture must be high quality
1 other comments: nil
1 how-did-you-know-us: nil
2 Japanese style want built in wardrobe Change of bedroom design can start commencing work anytime after Christmas influencer
Is there any examples/way that we could perform LDA model with train-test and at the same time, label the data as my dataset are unlabelled? Or are there other methods that I could perform for my unlabelled data?
For both model, I just throw in all datasets and build a model.
For Text clustering, while I'm able to get the clusters and got the visualisation, I'm unsure if we could label the cluster and train-test it. I'm also now exploring hierarchical clustering.
for LDA model, I have managed to generate a LDA model and also used pyLDAvis to visualise if there's any overlapping of clusters. Again, I not sure what is the next step to do.
My idea for both model is that I hope I can "label" my data upon performing the model so that whenever there is new data coming in, again in one-string format, it should be able to know it falls under which category.
I assume that for either methods, I have to work on making a unsupervised learning model to supervised learning model but not sure how should I do.