Solr. DIH. Delete Discretionary Hyphen (soft-hyphen) in PDF

84 Views Asked by At

I have a problem with PDF.
I'm using solr 8.11.1. I create an index from PDF files using DIH. Everything works well. But PDF contains Discretionary Hyphen (soft-hyphen). The PDF was created in Indesign and Discretionary Hyphen was inserted into some of the long words. For example, the word *uncopyrightable* is divided like this: *un-co-py-righ-tab-le* (the hyphen shows where Discretionary Hyphen is). The word will not necessarily be wrapped to another line.
Because of this, I get several words in the index - *un*, *co*, *py*, *righ*, *tab*, *le*, instead of a single word *uncopyrightable*. And so with many words. Because of this, I can't find these words in the index now.
I tried in tika-data-config to replace the character (using unicode u00AD) with "":
  <entity name="pdf" processor="TikaEntityProcessor"
          url="${file.fileAbsolutePath}" format="text"
          transformer="TemplateTransformer,RegexTransformer">
    <field column="text" regex="\u00AD" replaceWith="" sourceColName="text"/>
  </entity>

But didn't get any result.
Then I tried to do this:

    <field column="text" regex="un co py righ tab le" replaceWith="777" sourceColName="text"/>

And I got 777 in the index.
It turns out that Discretionary Hyphen turns into a space even before being processed in tika-data-config.
How can this problem be solved now?


For information. If I open the PDF file with Adobe Reader and then copy and paste the text in Word, the spaces don't appear. If I open with PDF-XChange Viewer and paste it into Word, then spaces appear. If I open it with Microsoft Edge, then there are icons in the form of a question in a diamond.


I have no way to fix PDF. Besides, there are a lot of them.

0

There are 0 best solutions below