How to get PDF file metadata 'Page Size' using Python?

8.3k Views Asked by At

I try to use PyPDF2 module in Python 3 but I can't display 'Page Size' property. I would like to know what the sheet of paper dimensions were before scanning to PDF file.

Something like this:

import PyPDF2
pdf=PdfFileReader("sample.pdf","rb")
print(pdf.getNumPages())

But I'm looking for another Python function instead of for example getNumPages()...

This command below prints some kind of metadata but without page size:

pdf_info=pdf.getDocumentInfo()
print(pdf_info)
3

There are 3 best solutions below

1
On BEST ANSWER

This code should help you:

import PyPDF2
pdf = PyPDF2.PdfFileReader("a.pdf","rb")
p = pdf.getPage(1)

w_in_user_space_units = p.mediaBox.getWidth()
h_in_user_space_units = p.mediaBox.getHeight()

# 1 user space unit is 1/72 inch
# 1/72 inch ~ 0.352 millimeters

w = float(p.mediaBox.getWidth()) * 0.352
h = float(p.mediaBox.getHeight()) * 0.352
0
On

GET "sheet of paper dimensions were before scanning to PDF file"

Is not really possible since scanners will be set to an output media size without the scanned media being known.

Take for examples

  • A letter sheet of paper placed on an A4 scanner bed or visa versa. The trace of the paper edge may or may not be visible in the output. The scanner simply works blind of the "source media", and for a document of mixed rotations, may need post processing to rescale some pages or rotate to upright.

  • Another example is using a mobile phone to scan a docket, it can be any source size, but the user software will determine the storage media size and rotation for PAGE file save. A5 A4 A3 whatever Portrait or Landscape.

Thus all you can ask from a PDF is, what is the stored PAGE size and display resolution, often varying between pages, and without confirming the source rotation.

For a simple list of stored page sizes there are several command line utilities that can list page variations.

Shell a one line command tool like xpdf/poppler pdfinfo to parse all different types of PDF and then parse that output. The output is similar for both with many lines but for your need

xpdf\pdfinfo -box filename
gives Page size: 594.96 x 841.92 pts (A4) (rotated 0 degrees)
and
poppler\pdfinfo -box filename
gives Page size: 594.96 x 841.92 pts (A4)

when scanning it is common to get size variation across the pages

Page    2 size: 595 x 842 pts (A4) (rotated 0 degrees)
Page    3 size: 595.32 x 841.92 pts (A4) (rotated 0 degrees)
Page    4 size: 595.44 x 842.04 pts (A4) (rotated 0 degrees)
Page    5 size: 595.44 x 842.04 pts (A4) (rotated 0 degrees)
Page    6 size: 595.2 x 841.9 pts (A4) (rotated 0 degrees)
Page    7 size: 595.45 x 841.9 pts (A4) (rotated 0 degrees)
Page    8 size: 595.45 x 841.9 pts (A4) (rotated 0 degrees)
Page    9 size: 595.2 x 841.44 pts (rotated 0 degrees)
0
On

Here's a more up-to-date flavor using pypdf:

from pypdf import PdfReader

pdf = PdfReader("a.pdf")
page = pdf.pages[1]

cm_per_inch = 2.54
points = 72

width_in_user_space_units = page.mediabox.width
height_in_user_space_units = page.mediabox.height

width_in_cm = float(width_in_user_space_units) / points * cm_per_inch
height_in_cm = float(height_in_user_space_units) / points * cm_per_inch