Hi i was looking for application to batch OCR pdf documents that will have function like Hot Folder and work in Linux. I found someting like ocrmypdf and i run it as docker container. Everything looks to work fine but some recognized words are treated like to be in a separate line. Is there any chance to correct it? Is there any addidtional option to fix this problem?
Below is my docker compose file
version: "3.3"
services:
ocrmypdf:
restart: always
container_name: ocrmypdf
image: myimage:1
volumes:
- "/home/administrator/Desktop/ehh/input/:/input"
- "/home/administrator/Desktop/ehh/output/:/output"
- "/home/administrator/Desktop/ehh/processed/:/processed"
environment:
- OCR_OUTPUT_DIRECTORY_YEAR_MONTH=0
- OCR_ON_SUCCESS_ARCHIVE=1
- 'OCR_JSON_SETTINGS={"force-ocr": true, "language": "pol", "deskew": true,"optimize": "2" }'
user: "root:root"
entrypoint: python3
command:
- watcher.py
I used below sample document in Polish language
and the resault o selecting text and copy paste it is like that
spólnota
mieszkaniowa
"Radość"
I tried to use options mentioned in documentation but nothing take effects