I am currently working on a firebase project and I have to write a cloud function that extracts text from a given pdf. The project is written in typescript but we sort of have to use the python library PyMuPDF for the conversion. When I want to deploy the function I get the error
fitz/fitz_wrap.c:2754:10: fatal error: fitz.h: No such file or directory
2754 | #include <fitz.h>
| ^~~~~~~~
compilation terminated.
error: command '/usr/bin/x86_64-linux-gnu-gcc' failed with exit code 1
[end of output]
note: This error originates from a subprocess, and is likely not a problem with pip.
ERROR: Failed building wheel for PyMuPDF
Running setup.py clean for PyMuPDF
Failed to build PyMuPDF
ERROR: Could not build wheels for PyMuPDF, which is required to install pyproject.toml-based projects; Error ID: 656dd225
After some research I found this comment that says that pymupdf uses "extension modules written in C or C++" https://github.com/pymupdf/PyMuPDF/issues/430#issuecomment-576231408 Is there a way to still use the library?
requirements.txt
Flask==2.1.1
PyMuPDF==1.18.19
functions-framework==2.3.0
The part of the cloud function that uses the library
def binary_to_text(binary: bytes, filetype: Literal["pdf", "png", "jpg", "jpeg"]) -> str:
try:
if filetype == "pdf" or filetype == "txt":
doc = fitz.open(filetype, binary)
text = ""
for page in doc:
text += page.get_text()
return text
elif filetype in ["png", "jpg", "jpeg"]:
# We may need to use a different library for extracting text from images
# For example, pytesseract for OCR
return "Text extraction from images is not supported yet."
else:
return "Unsupported file type (upload pdf, png, jpg, jpeg instead)."
except Exception as e:
return e
I tried to find a solution but I wasn't successful, so maybe someone else had the same issue.
How to use PyMuPDF's built-in Tesseract API to extract text from an image:
Note: I am a maintainer and the original creator of PyMuPDF.