'print(text_body)' produces unexpected blank output from copied PyPDF User Guide code

92 Views Asked by At

Former title: "TypeError: PageObject.extract_text() got an unexpected keyword argument 'visitor_text'"

Trying to follow PyPDF Documentation's example here on using a visitor to extract text.

Environment

environment.yml file contents, slightly changed:

name: C:\Users\[my_username]\src\repos\read-pdf\envs\read-pdf-env
channels:
  - conda-forge
  - defaults
dependencies:
  - asttokens=2.0.5=pyhd3eb1b0_0
  - beautifulsoup4=4.12.2=py312haa95532_0
  - brotli-python=1.0.9=py312hd77b12b_7
  - bzip2=1.0.8=he774522_0
  - ca-certificates=2023.12.12=haa95532_0
  - certifi=2023.11.17=py312haa95532_0
  - cffi=1.16.0=py312h2bbff1b_0
  - charset-normalizer=2.0.4=pyhd3eb1b0_0
  - colorama=0.4.6=py312haa95532_0
  - comm=0.1.2=py312haa95532_0
  - cryptography=41.0.7=py312h89fc84f_0
  - debugpy=1.6.7=py312hd77b12b_0
  - decorator=5.1.1=pyhd3eb1b0_0
  - defusedxml=0.7.1=pyhd3eb1b0_0
  - executing=0.8.3=pyhd3eb1b0_0
  - expat=2.5.0=hd77b12b_0
  - fpdf=1.7.2=pyhd8ed1ab_0
  - fpdf2=2.5.6=pyhd8ed1ab_0
  - freetype=2.12.1=ha860e81_0
  - giflib=5.2.1=h8cc25b3_3
  - idna=3.4=py312haa95532_0
  - ipykernel=6.29.0=pyha63f2e9_0
  - ipython=8.20.0=py312haa95532_0
  - jedi=0.18.1=py312haa95532_1
  - jpeg=9e=h2bbff1b_1
  - jupyter_client=8.6.0=py312haa95532_0
  - jupyter_core=5.5.0=py312haa95532_0
  - lerc=3.0=hd77b12b_0
  - libdeflate=1.17=h2bbff1b_1
  - libffi=3.4.4=hd77b12b_0
  - libpng=1.6.39=h8cc25b3_0
  - libsodium=1.0.18=h62dcd97_0
  - libtiff=4.5.1=hd77b12b_0
  - libwebp=1.3.2=hbc33d0d_0
  - libwebp-base=1.3.2=h2bbff1b_0
  - lz4-c=1.9.4=h2bbff1b_0
  - matplotlib-inline=0.1.6=py312haa95532_0
  - nest-asyncio=1.5.6=py312haa95532_0
  - openjpeg=2.4.0=h4fc8c34_0
  - openssl=3.0.12=h2bbff1b_0
  - packaging=23.1=py312haa95532_0
  - parso=0.8.3=pyhd3eb1b0_0
  - pdfminer=20191125=pyhd8ed1ab_1
  - pdfminer.six=20231228=pyhd8ed1ab_0
  - pillow=10.0.1=py312h045eedc_0
  - pip=23.3.1=py312haa95532_0
  - platformdirs=3.10.0=py312haa95532_0
  - prompt-toolkit=3.0.43=py312haa95532_0
  - prompt_toolkit=3.0.43=hd3eb1b0_0
  - psutil=5.9.0=py312h2bbff1b_0
  - pure_eval=0.2.2=pyhd3eb1b0_0
  - pycparser=2.21=pyhd3eb1b0_0
  - pycryptodome=3.15.0=py312h2bbff1b_0
  - pygments=2.15.1=py312haa95532_1
  - pyopenssl=23.2.0=py312haa95532_0
  - pypdf=4.0.0=pyhd8ed1ab_0
  - pypdf2=2.10.5=py312haa95532_0
  - pysocks=1.7.1=py312haa95532_0
  - python=3.12.0=h1d929f7_0
  - python-dateutil=2.8.2=pyhd3eb1b0_0
  - pywin32=305=py312h2bbff1b_0
  - pyzmq=25.1.2=py312hd77b12b_0
  - requests=2.31.0=py312haa95532_0
  - setuptools=68.2.2=py312haa95532_0
  - six=1.16.0=pyhd3eb1b0_1
  - soupsieve=2.5=py312haa95532_0
  - sqlite=3.41.2=h2bbff1b_0
  - stack_data=0.2.0=pyhd3eb1b0_0
  - svg.path=6.3=pyhd8ed1ab_0
  - tk=8.6.12=h2bbff1b_0
  - tornado=6.3.3=py312h2bbff1b_0
  - traitlets=5.7.1=py312haa95532_0
  - typing_extensions=4.9.0=py312haa95532_1
  - tzdata=2023d=h04d1e81_0
  - urllib3=1.26.18=py312haa95532_0
  - vc=14.2=h21ff451_1
  - vs2015_runtime=14.27.29016=h5e58377_2
  - wcwidth=0.2.5=pyhd3eb1b0_0
  - wheel=0.41.2=py312haa95532_0
  - win_inet_pton=1.1.0=py312haa95532_0
  - xz=5.4.5=h8cc25b3_0
  - zeromq=4.3.5=hd77b12b_0
  - zlib=1.2.13=h8cc25b3_0
  - zstd=1.5.5=hd43e919_0
prefix: C:\Users\[my_username]\src\repos\read-pdf\envs\read-pdf-env

VS Code Integrated Terminal

$ C:\Users\[my_username]\src\repos\read-pdf\envs\read-pdf-env\python.exe -m platform
Windows-11-10.0.22621-SP0

$ C:\Users\[my_username]\src\repos\read-pdf\envs\read-pdf-env\python.exe -c "import pypdf;print(pypdf._debug_versions)"
pypdf==4.0.0, crypt_provider=('cryptography', '41.0.7'), PIL=10.0.1

Code + PDF

This is a minimal, complete example that shows the issue:

from pypdf import PdfReader

filepath = r"C:\Users\[my_username]\src\repos\read-pdf\data\raw\GeoBase_NHNC1_Data_Model_UML_EN.pdf"

reader = PdfReader(filepath)
page = reader.pages[3]

parts = []


def visitor_body(text, cm, tm, font_dict, font_size):
    y = cm[5]
    if y > 50 and y < 720:
        parts.append(text)


page.extract_text(visitor_text=visitor_body)
text_body = "".join(parts)

print(text_body)

Sharing here the PDF file that I used.

Traceback

There is no traceback. No output whatsoever. Unless you consider the fact that, if I copy cell output from VSCode's Jupyter Extension, I get a newline if the output is pasted in notepad, like:


Previous version of this question, a Traceback was produced. That is not the case anymore.

I have tried changing page number by editing the hard-coded numeric value on line 6:

page = reader.pages[3]

The Tutorial does not say any output is expected, but page 4 is Table of Contents.

1

There are 1 best solutions below

1
eternal_white On

It's a documentation issue


They should've used text_matrix tm instead of current_matrix cm in the documentation.

Just change y = cm[5] to y = tm[5] and the code will work.

Here's the same code but modified:

from pypdf import PdfReader

filepath = # r"C:\Users\[my_username]\src\repos\read-pdf\data\raw\GeoBase_NHNC1_Data_Model_UML_EN.pdf"

reader = PdfReader(filepath)
page = reader.pages[3]

parts = []


def visitor_body(text, cm, tm, font_dict, font_size):
    y = tm[5]
    if y > 50 and y < 720:
        parts.append(text)


page.extract_text(visitor_text=visitor_body)
text_body = "".join(parts)

print(text_body)