FScrawler: perform OCR selectively only on PDF files that do not have text

271 Views Asked by At

I'm using FScrawler (2.7) to load text from PDFs into Elasticsearch (7.6.X). Most of PDF files have text, but some of PDF files contain images of scanned text and need to be OCRed. Is there a way to configure FScrawler such as that it performs OCR only on PDF files that contain images of scanned text, but not on files that already have text?

So far I can configure it to either not to do OCR on any files (case 1) or to do it on all files (case 2). In the first case, FScrawler skips all files with images of scanned text, but loads all files with text very quickly. In the second case, it takes really long time because it OCRs all the files, including those that already have text.

Here is OCR options setting for FScrawler: https://fscrawler.readthedocs.io/en/latest/user/ocr.html

Config for case 1:

name: "Case 1"
fs:
  url: "/path/to/data/dir"
  ocr:
    enabled: false
    pdf_strategy: 'no_ocr'

Config for Case 2:

name: "Case 2"
fs:
  url: "/path/to/data/dir"
  ocr:
    enabled: true
    pdf_strategy: 'ocr_and_text'

P.S. I can sort PDFs as OCRed and non-OCRed files using other means and have two separate FScrawler jobs for each pile of PDF files, but before I do this, I want to check if there is an easier way to use FScrawler native features.

0

There are 0 best solutions below