How to use Python to find the page number where a certain fonts is used in a pdf

197 Views Asked by At

How to use Python to find the page number where a certain fonts is used in a pdf.

I tried in PYPDF2 library but not provided the expected output, For example Where Arial font is used, I want to print those page numbers.

Here is the MME

import PyPDF2

pdf_file_path = "input.pdf"
target_font = "Arial"

pdf = PyPDF2.PdfReader(open(pdf_file_path, "rb"))

# Iterate through the pages of the PDF
for page_number in range(len(pdf.pages)):
    page = pdf.pages[page_number]
    fonts = page['/Resources']['/Font']

    # Check if the target font is used on the page
    if any(target_font.lower() in font.lower() for font in fonts.keys()):
        print("Font", target_font, "is used on page", page_number + 1)
2

There are 2 best solutions below

0
Jorj McKie On BEST ANSWER

A solution using PyMuPDF:

import fitz  # PyMuPDF

target_font = "arial"

doc = fitz.open("input.pdf")
for page in doc:
    fontlist = page.get_fonts()
    for xref, ext, ftype, fontname, _, _ in fontlist:
        if target_font in fontname.lower():  # match may not be exact with subset fonts
            print(f"Page {page.number} uses {target_font}")
            break
0
K J On

Most copies of Python will have the poppler utilities which you can use via shell or simpler use the shell direct so here in windows this file will show it has arial in pages 3 and 6, You can use the output via a redirected list or other means at your disposal as required.

Here is a windows command line.

for /L %L in (1 1 6) do @pdffonts -f %L -l %L my1.pdf|find /i "arial"&if not errorlevel 1 echo Arial found on Page %L

Result for my1.pdf

PMGGAE+ArialMT                       CID TrueType      Identity-H       yes yes yes    306  0
Arial found on Page 3
PMGGAE+ArialMT                       CID TrueType      Identity-H       yes yes yes    306  0
Arial found on Page 6

You can run via batch file to add a loop of filenames but python is good for that itself. One limitation in the way I wrote a single line is you need to know how many pages in the file before query which would require a variable (easy in a different line or via Python query number of pages).

enter image description here

>type found.txt
Arial found on Page 3 of my1.pdf
Arial found on Page 6 of my1.pdf