I want to cluster 160 000 documents or variable lengths.
Problem: Spacy LM model "en_core_web_lg" doesn't have all the words that are present in my documents. Creating NGrams also include non_present words to effect vector of the ngram.
Solution i tried: I override _word_ngrams method of TfIdfVectorizer to handle this.
import spacy
from sklearn.feature_extraction.text import TfidfVectorizer
class NewTfidfVectorizer(TfidfVectorizer):
def _word_ngrams(self, tokens, stop_words=None):
nlp = spacy.load('en_core_web_lg')
"""Turn tokens into a sequence of n-grams after stop words filtering and removing words which are out of lm model"""
tokens = super(TfidfVectorizer, self)._word_ngrams(tokens, None)
new_tokens = []
for token in tokens:
split_words = token.split(' ')
if len(split_words) == 1:
if nlp.vocab.has_vector(token):
new_tokens.append(token)
for token in tokens:
split_words = token.split(' ')
new_words = []
for word in split_words:
if word in new_tokens:
new_words.append(word)
new_tokens.append(' '.join(new_words))
# Convert the list to a set, which automatically removes duplicates
unique_tokens = set(new_tokens)
# Convert the set back to a list
unique_tokens_list = list(unique_tokens)
return sorted(unique_tokens_list)`
Now problem: It takes a significant amount of time to just fitting this
NGRAM_RANGE = (1, 3)
tfidf_vectorizer = NewTfidfVectorizer(analyzer='word', norm=None, ngram_range=NGRAM_RANGE, stop_words='english', use_idf=True, smooth_idf=True)
tfidf_matrix = tfidf_vectorizer.fit_transform(docs) # sparse matrix
vocab = tfidf_vectorizer.get_feature_names_out()
Step 1: process data: remove stopwords, html, xml tags and stemming Remove stopwords , html and xml tags by nltk stopwords and one personal list of stopwords.
Step 2: Tf-Idf fitting: used ngrams_range (1,3)
Step 3: normalize tfidf matrix (weighted_ifidf): "This step can be ignored"
Step 4: create a dict in this format. {"doc1": [("word1", 2.45), ("word2", 3.93454)], "doc2": {("word5", 1.395), ("word9", 4.2455)]}}
Step 5: Take weighted average of vectors based on normalize tfidf matrix: (vector12 + vector23 + vector3*1)/6
Step 6: Clustering using HDBSCAN
Few other problems:
the step 5 is very process intensive even though i use sparse matrix for tfidf at every point.
the data contains people and company names and do tend to have effect on cluster formation