I am working on identifying forgery/tampering in bank statements PDF documents. Info metadata and XMP metadata is not always present in the PDFs that I have so I am not able to create any generalized rule to identify tampered PDFs. I am using Python libraries such as PyMuPDF, PDFMiner, PyPDF2 etc.

I have 2 questions:

  1. Is there any concrete way to identify whether the PDF is tampered (using Python or any other opensource technology) ?
  2. If the PDF is tampered then which part of the PDF has been tampered (using Python or any other opensource technology)?

Attaching 2 PDFs for reference -

original :- "sbi statment_out2.pdf" link - https://drive.google.com/file/d/1DoWAKYcCudRO-Cwjbgf7RjiJUsF3DD3s/view?usp=sharing

Tampered using Sejda online editor :- "sbi statment_out2_Sejda_edited.pdf link - https://drive.google.com/file/d/1J4eRy9tO3jN8AqEWNrKXtn40G6vdH5G3/view?usp=sharing

In tempered PDF, I have edited '2,412.00' under 'Credit' column to '12.00'.

Kindly let me know in case any open source solution, preferably in Python.

Thanks.

2

There are 2 best solutions below

2
On

The canonical way to ensure that a PDF is not tampered with is by only accepting PDFs with digital signatures by the originator and validating them as Frank has already pointed out with a link to an Adobe forum.

Variations thereof could be

  • that the PDF producer shared the hash value of the PDF via a different, secure channel for you to verify, or
  • that the producer of the PDF encrypted the PDF with a private key only known to them, and you decrypt it using the matching, probably public key.

Such cryptographic methods are reasonably secure if implemented correctly.


Unfortunately these secure methods require that the producer of the PDF cooperates accordingly when publishing the PDFs.

If the producer does not cooperate and simply publishes PDFs without such a cryptographic protection, you can still compare internal details of PDFs which should be created similarly. If such internal details differ considerably, either someone amateurishly tampered with the PDF or the PDF producer updated or switched the PDF production software.

In case of your example files there are numerous differences in such details, e.g.

  • The original claims to comply to PDF-1.4, the manipulated copy to PDF-1.5.
  • The original uses cross reference tables for the PDF objects, the copy cross reference streams.
  • The original and the copy have different Producer entries, 'iText 2.0.4 (by lowagie.com)' compared to 'SAMBox 2.2.12'.
  • The original has a ModDate entry on the claimed date of the document, the copy has one long thereafter.
  • The ID parts in the original PDF differ from each other, the ID parts in the copy coincide.
  • The copy has a typical page content stream structure for independently added content, the original does not.
  • The copy has one text object without text (the remainder of the original text object drawing the later removed number), the original does not.
  • The original only uses grayscale colors for the numbers in the table, the copy also uses RGB colors.
  • ...

Surely you can use Python PDF libraries to check for such details and determine divergences.

But beware, this way you will only catch dilettante forgers. Forgers who know their business will leave hardly any such traces in their outputs...

0
On

Adobe says that there is no way of detecting whether a pdf has been modified unless it is signed.

https://community.adobe.com/t5/acrobat-reader/how-to-detect-a-modified-pdf-file/td-p/3546278