How to convert Tesseract software output (hocr) into plain txt file with fop (generates zero output)?

290 Views Asked by At

The resulting output: a txt file with empty lines.

The expected output: a txt file with words of "Привет Мир! Это я, обычный неработающий текст или рыба" text.

What am I doing wrong? Tried nested xsl:for-each code gives out the same kind of behavior.

1

There are 1 best solutions below

2
On BEST ANSWER

I see 2 problems in your attempt:

  1. Your instruction:

    <xsl:for-each select="//div [@class='ocr_page'] /div [@class='ocr_carea'] / p [@class='ocr_par'] / span[@class='ocr_line'] / span [@class='ocrx_word']">
    

    selects nothing, because your input XML puts all its elements in a namespace. See here how to solve this.

  2. Once you have it working, this instruction will put you in the context of span. From this context, your next instruction:

     <xsl:value-of select="normalize-space(span [@class='ocrx_word'])" disable-output-escaping="yes"/>
    

    also selects nothing, because span is not a child of itself. It should be:

    <xsl:value-of select="normalize-space(.)"/>
    

    and I doubt you want to disable output escaping in a stylesheet producing an XML result.