Sentence Embedding Clustering

Question

Sentence Embedding Clustering

2.3k Views Asked by Anthony X. At 21 October 2020 at 06:30

I am working on a small project in which I need to eliminate irrelevant information (ads for instance) from the html content I extracted from the websites. Since I am a beginner in NLP, I came up with a simple approach after doing some research.

The language used in the websites is mainly Chinese and I stored each sentence (separated by comma) into a list. I used a model called HanLP to do semantic parsing on my sentences. Something like this:

[['萨哈夫', '说', '，', '伊拉克', '将', '同', '联合国', '销毁', '伊拉克', '大', '规模', '杀伤性', '武器', '特别', '委员会', '继续', '保持', '合作', '。'], 
 ['上海', '华安', '工业', '（', '集团', '）', '公司', '董事长', '谭旭光', '和', '秘书', '张晚霞', '来到', '美国', '纽约', '现代', '艺术', '博物馆', '参观', '。']]

I found a pretrained Chinese word embedding database to get the word embeddings in my list. Then my approach is to get the sentence embedding by calculating the element-wise average in that sentence. Now I got a list with sentence embedding vector of each individual sentence I parsed.

sentence: ['各国', '必须', '“', '大', '规模', '”', '支出', '》', '的', '报道', '称']
sentence embedding: [0.08130878633396192, -0.07660450288941237, 0.008989107615145093, 0.07014013996178453, 0.028158639980988068, 0.01821030060422014, 0.017793822186914356, 0.04148909364911643, 0.019383941353722053, 0.03080177273262631, -0.025636445207055658, -0.019274188523096116, 0.0007501963356679136, 0.00476544528183612, -0.024648051539605313, -0.011124626140702854, -0.0009071269834583455, -0.08850407109341839, 0.016131568784740837, -0.025241035714068195, -0.041586867829954084, -0.0068722023954085835, -0.010853541125966744, 0.03994347004812549, 0.04977656596086242, 0.029051605612039566, -0.031031965550606732, 0.05125975541093133, 0.02666312647687102, 0.0376262941096105, -0.00833959155716002, 0.035523645325817844, -0.0026961421932686458, 0.04742895790629766, -0.07069634984840047, -0.054931600324132225, 0.0727336619218642, 0.0434290729039772, -0.09277284060689536, -0.020194332538680596, 0.0011523241092535582, 0.035080605863847515, 0.13034072890877724, 0.06350403482263739, -0.04108352984555743, 0.03208382343026725, -0.08344872626052662, -0.14081071757457472, -0.010535095733675089, -0.04253014939075166, -0.06409504175694151, 0.04499104322696274, -0.1153958263722333, 0.011868207969448784, 0.032386500388383865, -0.0036963022192305125, 0.01861521213802255, 0.05440248447385701, 0.026148285970769146, 0.011136160687204789, 0.04259885661303997, 0.09219381585717201, 0.06065366725141013, -0.015763109010136264, -0.0030524068596688185, 0.0031816939061338253, -0.01272551697382534, 0.02884035756472837, -0.002176688645373691, -0.04119681418788704, -0.08371328799562021, 0.007803680078888481, 0.0917377421124415, 0.027042210250246255, -0.0168504383076321, -0.0005781924013387073, 0.0075592477594248276, 0.07226487367667934, 0.005541681396690282, 0.001809495755217292, 0.011297995647923513, 0.10331092673269185, 0.0034428672357039018, 0.07364177612841806, 0.03861967177892273, -0.051503680434755304, -0.025596174390309236, 0.014137779785828157, -0.08445698734034192, -0.07401955000717532, 0.05168289600194178, -0.019313615386966954, 0.007136409255591306, -0.042960755484686655, 0.01830706542188471, -0.001172357662157579, -0.008949846103364094, -0.02356141348454085, -0.05277112944432619, 0.006653293967247009, -0.00572453092106364, 0.049479073389771984, -0.03399876727913083, 0.029434629207984966, -0.06990156170319427, 0.0924786920659244, 0.015472117049450224, -0.10265431468459693, -0.023421658562834968, 0.004523425542918796, -0.008990391665561632, -0.06445665437389504, 0.03898039324717088, -0.025552247142927212, 0.03958867977119305, -0.03243451675569469, -0.03848901360338046, -0.061713250523263756, -0.00904815017499707, -0.03730008362750099, 0.02715366007760167, -0.08498009599067947, -0.00397337388924577, -0.0003402943098494275, 0.008005982349542055, 0.05871503853069788, -0.013795949010686441, 0.007956360128115524, -0.024331797295334665, 0.03842244771393863, -0.04393653944134712, 0.02677931230176579, 0.07715398648923094, -0.048624055216681554, -0.11324723844882101, -0.08751555024222894, -0.02469049582511864, -0.08767948790707371, -0.021930147846102376, 0.011519658294591036, -0.08155732788145542, -0.10763703049583868, -0.07967398501932621, -0.03249315629628571, 0.02701333300633864, -0.015305672687563028, 0.002375963249836456, 0.012275356545367024, -0.02917095824060115, 0.02626959386874329, -0.0158629031767222, -0.05546591058373451, -0.023678493686020374, -0.048296650278974666, -0.06167154920033433, 0.004435380412773652, 0.07418209609617903, 0.03524015434297987, 0.063185997529548, -0.05814945189790292, 0.13036084697920491, -0.03370768073099581, 0.03256692289671099, 0.06808869439092549, 0.0563600350340659, 5.7854774323376745e-05, -0.0793171048333699, 0.03862177783792669, 0.007196083004766313, 0.013824320821599527, 0.02798982642707415, -0.00918149473992261, -0.00839392692697319, 0.040496235374699936, -0.007375971498814496, -0.03586547057652338, -0.03411220566538924, -0.025101724758066914, -0.005714270286262035, 0.07351569867354225, -0.024216756182299418, 0.0066968070935796604, -0.032809603959321976, 0.05006068360737779, 0.0504626590250568, 0.04525104385208, -0.027629732069644062, 0.10429493219337681, -0.021474285961382768, 0.018212029964409092, 0.07260083373297345, 0.026920156976716084, 0.043199389770796355, -0.03641596379351209, 0.0661080302670598, 0.09141866947439584, 0.0157452768815512, -0.04552285996297459, -0.03509725736115466, 0.02857604629190808]

My next step is to cluster these sentence embedding vectors and find out sentences that clearly have irrelevant content compared to the others.

Does my approach even make sense? If it does, what tools can I use to cluster my sentence embedding values? I saw there are approaches such as K-means or calculate L2 distances but I am not sure how to implement.

Thanks!

Original Q&A

There are 2 best solutions below

**ATIF ADIB** · Answer 1 · 2020-10-21T06:56:00.387000

The approach makes sense, if you are trying to get rid of sentences which do not contribute to the downstream analysis but element-wise average may not be the best way to construct the sentence embeddings. A better way to construct sentence embeddings would be to take the individual word embeddings and then combine them using tf-idf.

sentence = [w1, w2, w3]
word_vectors = [v1, v2, v3] , # v is of shape (N, ) where N is the size of embedding

term_frequency_of_word = [t1, t2, t3]
inverse_doc_freq = [idf1, idf2, idf3]

word_weights = [tf*idf for tf,idf in zip(term_frequency_of_word, inverse_doc_freq)]

sentence_vector = np.zeros(N)

for weight, vector in zip(word_weights, word_vectors):
    scaled_vectors = vector * weight
    sentence_vector += scaled_vector

By applying tf-idf scaling your sentence embedding will move towards the embedding of the most important word(s) in the sentence which might help you apply clustering algorithms to filter out unwanted sentences.

Here is a quick tutorial on TF-IDF: http://www.tfidf.com

**roddar92** · Answer 2 · 2020-10-21T13:25:12.663000

For clustering you can try k-means, but this algorithm uses just Euclidean metric. For using another distance (i.e. cosine distance), the k-medoids is also suitable EM-algorithm. In Python, you can find KMeans in scikit-learn library. In order to try 'KMedoids', you should install scikit-learn-extra library (https://scikit-learn-extra.readthedocs.io/en/latest/generated/sklearn_extra.cluster.KMedoids.html) or this one: https://github.com/letiantian/kmedoids

Sentence Embedding Clustering

There are 2 best solutions below

Related Questions in PYTHON

Related Questions in NLP

Related Questions in K-MEANS

Related Questions in WORD-EMBEDDING

Related Questions in SENTENCE-SIMILARITY

Trending Questions

Popular # Hahtags

Popular Questions