I am working on a small project in which I need to eliminate irrelevant information (ads for instance) from the html content I extracted from the websites. Since I am a beginner in NLP, I came up with a simple approach after doing some research.
The language used in the websites is mainly Chinese and I stored each sentence (separated by comma) into a list. I used a model called HanLP to do semantic parsing on my sentences. Something like this:
[['萨哈夫', '说', ',', '伊拉克', '将', '同', '联合国', '销毁', '伊拉克', '大', '规模', '杀伤性', '武器', '特别', '委员会', '继续', '保持', '合作', '。'],
['上海', '华安', '工业', '(', '集团', ')', '公司', '董事长', '谭旭光', '和', '秘书', '张晚霞', '来到', '美国', '纽约', '现代', '艺术', '博物馆', '参观', '。']]
I found a pretrained Chinese word embedding database to get the word embeddings in my list. Then my approach is to get the sentence embedding by calculating the element-wise average in that sentence. Now I got a list with sentence embedding vector of each individual sentence I parsed.
sentence: ['各国', '必须', '“', '大', '规模', '”', '支出', '》', '的', '报道', '称']
sentence embedding: [0.08130878633396192, -0.07660450288941237, 0.008989107615145093, 0.07014013996178453, 0.028158639980988068, 0.01821030060422014, 0.017793822186914356, 0.04148909364911643, 0.019383941353722053, 0.03080177273262631, -0.025636445207055658, -0.019274188523096116, 0.0007501963356679136, 0.00476544528183612, -0.024648051539605313, -0.011124626140702854, -0.0009071269834583455, -0.08850407109341839, 0.016131568784740837, -0.025241035714068195, -0.041586867829954084, -0.0068722023954085835, -0.010853541125966744, 0.03994347004812549, 0.04977656596086242, 0.029051605612039566, -0.031031965550606732, 0.05125975541093133, 0.02666312647687102, 0.0376262941096105, -0.00833959155716002, 0.035523645325817844, -0.0026961421932686458, 0.04742895790629766, -0.07069634984840047, -0.054931600324132225, 0.0727336619218642, 0.0434290729039772, -0.09277284060689536, -0.020194332538680596, 0.0011523241092535582, 0.035080605863847515, 0.13034072890877724, 0.06350403482263739, -0.04108352984555743, 0.03208382343026725, -0.08344872626052662, -0.14081071757457472, -0.010535095733675089, -0.04253014939075166, -0.06409504175694151, 0.04499104322696274, -0.1153958263722333, 0.011868207969448784, 0.032386500388383865, -0.0036963022192305125, 0.01861521213802255, 0.05440248447385701, 0.026148285970769146, 0.011136160687204789, 0.04259885661303997, 0.09219381585717201, 0.06065366725141013, -0.015763109010136264, -0.0030524068596688185, 0.0031816939061338253, -0.01272551697382534, 0.02884035756472837, -0.002176688645373691, -0.04119681418788704, -0.08371328799562021, 0.007803680078888481, 0.0917377421124415, 0.027042210250246255, -0.0168504383076321, -0.0005781924013387073, 0.0075592477594248276, 0.07226487367667934, 0.005541681396690282, 0.001809495755217292, 0.011297995647923513, 0.10331092673269185, 0.0034428672357039018, 0.07364177612841806, 0.03861967177892273, -0.051503680434755304, -0.025596174390309236, 0.014137779785828157, -0.08445698734034192, -0.07401955000717532, 0.05168289600194178, -0.019313615386966954, 0.007136409255591306, -0.042960755484686655, 0.01830706542188471, -0.001172357662157579, -0.008949846103364094, -0.02356141348454085, -0.05277112944432619, 0.006653293967247009, -0.00572453092106364, 0.049479073389771984, -0.03399876727913083, 0.029434629207984966, -0.06990156170319427, 0.0924786920659244, 0.015472117049450224, -0.10265431468459693, -0.023421658562834968, 0.004523425542918796, -0.008990391665561632, -0.06445665437389504, 0.03898039324717088, -0.025552247142927212, 0.03958867977119305, -0.03243451675569469, -0.03848901360338046, -0.061713250523263756, -0.00904815017499707, -0.03730008362750099, 0.02715366007760167, -0.08498009599067947, -0.00397337388924577, -0.0003402943098494275, 0.008005982349542055, 0.05871503853069788, -0.013795949010686441, 0.007956360128115524, -0.024331797295334665, 0.03842244771393863, -0.04393653944134712, 0.02677931230176579, 0.07715398648923094, -0.048624055216681554, -0.11324723844882101, -0.08751555024222894, -0.02469049582511864, -0.08767948790707371, -0.021930147846102376, 0.011519658294591036, -0.08155732788145542, -0.10763703049583868, -0.07967398501932621, -0.03249315629628571, 0.02701333300633864, -0.015305672687563028, 0.002375963249836456, 0.012275356545367024, -0.02917095824060115, 0.02626959386874329, -0.0158629031767222, -0.05546591058373451, -0.023678493686020374, -0.048296650278974666, -0.06167154920033433, 0.004435380412773652, 0.07418209609617903, 0.03524015434297987, 0.063185997529548, -0.05814945189790292, 0.13036084697920491, -0.03370768073099581, 0.03256692289671099, 0.06808869439092549, 0.0563600350340659, 5.7854774323376745e-05, -0.0793171048333699, 0.03862177783792669, 0.007196083004766313, 0.013824320821599527, 0.02798982642707415, -0.00918149473992261, -0.00839392692697319, 0.040496235374699936, -0.007375971498814496, -0.03586547057652338, -0.03411220566538924, -0.025101724758066914, -0.005714270286262035, 0.07351569867354225, -0.024216756182299418, 0.0066968070935796604, -0.032809603959321976, 0.05006068360737779, 0.0504626590250568, 0.04525104385208, -0.027629732069644062, 0.10429493219337681, -0.021474285961382768, 0.018212029964409092, 0.07260083373297345, 0.026920156976716084, 0.043199389770796355, -0.03641596379351209, 0.0661080302670598, 0.09141866947439584, 0.0157452768815512, -0.04552285996297459, -0.03509725736115466, 0.02857604629190808]
My next step is to cluster these sentence embedding vectors and find out sentences that clearly have irrelevant content compared to the others.
Does my approach even make sense? If it does, what tools can I use to cluster my sentence embedding values? I saw there are approaches such as K-means or calculate L2 distances but I am not sure how to implement.
Thanks!
The approach makes sense, if you are trying to get rid of sentences which do not contribute to the downstream analysis but element-wise average may not be the best way to construct the sentence embeddings. A better way to construct sentence embeddings would be to take the individual word embeddings and then combine them using tf-idf.
By applying tf-idf scaling your sentence embedding will move towards the embedding of the most important word(s) in the sentence which might help you apply clustering algorithms to filter out unwanted sentences.
Here is a quick tutorial on TF-IDF: http://www.tfidf.com