How to extract font names using PyMuPDF without subsets?

37 Views Asked by At

We are using PyMuPDF Page.get_fonts() function to extract the font names from the PDF but we are getting font names with subsets we tried using fitz.Tools.set_subset_fontnames() setting in our code and its working for fonts returned by get_text() but its not working on get_fonts().

Here is my sample code:

import fitz
fitz.TOOLS.set_subset_fontnames(False)

file_path = "sample.pdf"
pdf_document = fitz.open(file_path)
for page in pdf_document:
    extracted_fonts = page.get_page_fonts(full=True)
print(extracted_fonts)

Here is the output I am getting:

[
  (140, 'ttf', 'TrueType', 'XEAAAC+Arial Bold', 'F3', 'WinAnsiEncoding', 0), 
  (138, 'ttf', 'TrueType', 'XEAAAB+Times New Roman', 'F2', 'WinAnsiEncoding', 0),
  (137, 'ttf', 'TrueType', 'XEAAAA+Arial', 'F1', 'WinAnsiEncoding', 0)
]

And I want the font names without subsets. For example, "Arial Bold" instead of "XEAAAC+Arial Bold"

1

There are 1 best solutions below

0
jepozdemir On

You can split the font name by the '+' character and then select the last part, which represents the actual font name without the subset prefix:

import fitz

fitz.TOOLS.set_subset_fontnames(False)

file_path = "sample.pdf"
pdf_document = fitz.open(file_path)

for page in pdf_document:
    extracted_fonts = page.get_fonts(full=True)
    cleaned_fonts = [(font_id, font_format, font_type, font_name.split('+')[-1], font_flags, font_encoding, font_embedded) for font_id, font_format, font_type, font_name, font_flags, font_encoding, font_embedded in extracted_fonts]
    print(cleaned_fonts)