Using PDF2Image in Code Repository on Palantir Foundry

185 Views Asked by At

I am trying to use the library pdf2image in a Code Repository on Palantir Foundry and getting the error

pdf2image.exceptions.PDFInfoNotInstalledError: Unable to get page count. Is poppler installed and in PATH?

when using the function convert_from_bytes.

Does anyone know how to reference the poppler path and get rid of this error?

Thanks!

Here is the code:

def extract_pdf_text(input_bytes, language='eng', dpi=200):
    pages = convert_from_bytes(input_bytes, dpi)
    pdf_pages = ''
    for page_index, page in enumerate(pages):
        pdf_page = pytesseract.image_to_string(page, lang=language)
        pdf_pages = pdf_pages + pdf_page
    return pdf_pages

And the meta.yaml for reference:

# If you need to modify the runtime requirements for your package,
# update the 'requirements.run' section in this file

package:
  name: "{{ PACKAGE_NAME }}"
  version: "{{ PACKAGE_VERSION }}"

source:
  path: ../src

requirements:
  # Tools required to build the package. These packages are run on the build system and include
  # things such as revision control systems (Git, SVN) make tools (GNU make, Autotool, CMake) and
  # compilers (real cross, pseudo-cross, or native when not cross-compiling), and any source pre-processors.
  # https://docs.conda.io/projects/conda-build/en/latest/resources/define-metadata.html#build
  build:
    - python 3.8.*
    - setuptools

  # Packages required to run the package. These are the dependencies that are installed automatically
  # whenever the package is installed.
  # https://docs.conda.io/projects/conda-build/en/latest/resources/define-metadata.html#run
  run:
    - python 3.8.*
    - transforms {{ PYTHON_TRANSFORMS_VERSION }}
    - transforms-expectations
    - transforms-verbs
    - pytesseract
    - pdfplumber
    - googletrans
    - regex
    - pdf2image
    - langdetect
    - pandas
    - numpy
    - selenium
    - requests
    - pypdf2
    - poppler

build:
  script: python setup.py install --single-version-externally-managed --record=record.txt
1

There are 1 best solutions below

0
On

I found the problem when inspecting the CI-Checks. They failed before poppler was pulled. After I cleaned up meta.yaml and the checks succeded everything seems to work fine.