Keep one entry of duplicate articles with SOLR deduplication

1.4k Views Asked by At

I have used Solr deduplication with following setting in solrconfig.xml

<updateRequestProcessorChain name="dedupe">
   <processor class="solr.processor.SignatureUpdateProcessorFactory">
     <bool name="enabled">true</bool>
     <str name="signatureField">signature</str>
     <bool name="overwriteDupes">true</bool>
     <str name="fields">description</str>
     <str name="signatureClass">solr.processor.TextProfileSignature</str>
   </processor>
   <processor class="solr.LogUpdateProcessorFactory" />
   <processor class="solr.RunUpdateProcessorFactory" />
 </updateRequestProcessorChain>

and in schema.xml

<field name="signature" type="string" stored="true" indexed="true" multiValued="false" />

My objective is to find documents with duplicate descriptions (used TextProfileSignature for near duplicate) keep one entry and remove other duplicate entries.

for e.g. doc1 description : Websol – Candidate should be good in communication and computer skills must be willing to relocate We have good vacancies for Back Office in international call centers

doc2 description :Websol – Candidate should be good in communication and computer skills must be willing to relocate We have good vacancies for Back Office in international call centers...

from these two docs only one to be deleted not both but with solr dedupe both entries get deleted.

Let me know if i am missing aything in setting or i need to follow other way to achieve this.

1

There are 1 best solutions below

0
On

Could be you are suffering from a known issue