Non-Deterministic behavior in PDF library when accessing Django model in between

63 Views Asked by At

A Django 4.2 on Python 3.10 application misbehaves sometimes, in a place that should be stateless. If I access the database (postgresql 14 via pcygopg 3.1.18) while working with pypdf 3.17.4, the output document is broken.. in roughly 1 out of 8 attempts. How can I determine the cause of the erratic behavior?

The resulting PDF has missing content, so comparing output size is sufficient to determine the bug was triggered:

# call as: python3 manage.py minrepro input.pdf
import argparse, io
from django.core.management.base import BaseCommand
from pypdf import PdfReader, PdfWriter
from djangoapp.models import DjangoModel


def main(fin):
    pdfout = PdfWriter()
    pageout = pdfout.add_blank_page(width=200, height=200)

    for i in range(8):
        # Note: accessing the database *during* PDF merging is relevant!
        # without the next line, the problem cannot be reproduced
        for c in range(31): a = DjangoModel.objects.first()

        fin.seek(0)
        for pagein in PdfReader(fin, strict=True).pages:
            pageout.merge_page(pagein)

    with io.BytesIO() as fout:
        pdfout.write(fout)
        return fout.tell()

class Command(BaseCommand):
    def add_arguments(self, parser):
        parser.add_argument(dest="pdf", type=argparse.FileType("rb") )
    def handle(self, *args, **options):
        for i in range(30):
            if i == 0: first_size = main(options["pdf"])
            current_size = main(options["pdf"])
            if not first_size == current_size:
                print(f"presumed stateless call was not {i=}, {first_size=} != {current_size=}")

  • My fist theory was pypdf failing to generate truly unique names for (post-merge) duplicate identifiers, and the DB access merely giving the opportunity (delay) for that to become noticeable. I looked for any time or random imports in the relevant libraries, no obvious bugs there. Also, nothing like time.sleep(0.02) could not make my reproducer work, only database access.
  • This is CPython with insert-ordered dicts, most code should behave the same on identical input. Also called python with -W module::ResourceWarning -W module::DeprecationWarning -W module::PendingDeprecationWarning to be more likely notified of any mistakes on my part, nothing.
  • Have reproduced the behavior in a single-threaded management command, it is therefore not related to the any http/web server issue.
1

There are 1 best solutions below

1
user On

Root cause identified: Dropping and recreating PdfReader in the loop objects has a low probability for pypdf.PdfWriter.merge to recycle into already used identifier mappings.

The trigger appears to be garbage collection in my loop, with psycopg3 indeed just providing the opportunity for that to happen, not specifically causing it.

The relevant documentation for resetting the association between source and destination PDF object has lead me to the previously missed detail that pypdf assumes, but never checks that PdfReader objects used during merging will never recycle python id().

Two objects with non-overlapping lifetimes may have the same id() value. -- Python id(object)

That has happened in my code, as easily visible by printing id(page.pdf).

Keeping a reference to the PdfReader object until the last input is merged (or simply until the output is written) also stops my reproducer from working, like this:

store_for_pypdf = []
for loop:
    ...
    reader = PdfReader(fin, strict=True)
    store_for_pypdf.append(reader)
    ... # use reader
...
pdfout.write(fout)
del store_for_pypdf