ocrmypdf recognize text as separate words in new line

65 Views Asked by At

Hi i was looking for application to batch OCR pdf documents that will have function like Hot Folder and work in Linux. I found someting like ocrmypdf and i run it as docker container. Everything looks to work fine but some recognized words are treated like to be in a separate line. Is there any chance to correct it? Is there any addidtional option to fix this problem?

Below is my docker compose file

version: "3.3"
services:
  ocrmypdf:
    restart: always
    container_name: ocrmypdf
    image: myimage:1
    volumes:
      - "/home/administrator/Desktop/ehh/input/:/input"
      - "/home/administrator/Desktop/ehh/output/:/output"
      - "/home/administrator/Desktop/ehh/processed/:/processed"

    environment:
      - OCR_OUTPUT_DIRECTORY_YEAR_MONTH=0
      - OCR_ON_SUCCESS_ARCHIVE=1
      - 'OCR_JSON_SETTINGS={"force-ocr": true, "language": "pol", "deskew": true,"optimize": "2" }'
    user: "root:root"
    entrypoint: python3
    command:
      -  watcher.py

I used below sample document in Polish language

enter image description here

and the resault o selecting text and copy paste it is like that

spólnota

mieszkaniowa

"Radość"

I tried to use options mentioned in documentation but nothing take effects

0

There are 0 best solutions below