How to use PyMuPDF with Cloud Functions

120 Views Asked by At

I am currently working on a firebase project and I have to write a cloud function that extracts text from a given pdf. The project is written in typescript but we sort of have to use the python library PyMuPDF for the conversion. When I want to deploy the function I get the error

      fitz/fitz_wrap.c:2754:10: fatal error: fitz.h: No such file or directory
       2754 | #include <fitz.h>
            |          ^~~~~~~~
      compilation terminated.
      error: command '/usr/bin/x86_64-linux-gnu-gcc' failed with exit code 1
      [end of output]

  note: This error originates from a subprocess, and is likely not a problem with pip.
  ERROR: Failed building wheel for PyMuPDF
  Running setup.py clean for PyMuPDF
Failed to build PyMuPDF
ERROR: Could not build wheels for PyMuPDF, which is required to install pyproject.toml-based projects; Error ID: 656dd225

After some research I found this comment that says that pymupdf uses "extension modules written in C or C++" https://github.com/pymupdf/PyMuPDF/issues/430#issuecomment-576231408 Is there a way to still use the library?

requirements.txt

Flask==2.1.1
PyMuPDF==1.18.19
functions-framework==2.3.0

The part of the cloud function that uses the library

def binary_to_text(binary: bytes, filetype: Literal["pdf", "png", "jpg", "jpeg"]) -> str:
    try:
        if filetype == "pdf" or filetype == "txt":
            doc = fitz.open(filetype, binary)
            text = ""
            for page in doc:
                text += page.get_text()
            return text
        elif filetype in ["png", "jpg", "jpeg"]:
            # We may need to use a different library for extracting text from images
            # For example, pytesseract for OCR
            return "Text extraction from images is not supported yet."
        else:
            return "Unsupported file type (upload pdf, png, jpg, jpeg instead)."
    except Exception as e:
        return e

I tried to find a solution but I wasn't successful, so maybe someone else had the same issue.

1

There are 1 best solutions below

0
On

How to use PyMuPDF's built-in Tesseract API to extract text from an image:

import fitz
import pathlib

def image_to_text(binary_of_image):
    pix = fitz.Pixmap(binary_of_image)  # auto detection of img type
    pdfdata = pix.pdfocr_tobytes(language="eng+spa")  # English & Spanish
    doc = fitz.open(pdfdata)  # make temp PDF with 1 page
    page = doc[0]
    return page.get_text()  # return OCR-ed text

img_binary = pathlib.Path("some-image.file").read_bytes()
text = image_to_text(img_binary)
print(text)

Note: I am a maintainer and the original creator of PyMuPDF.