How to delete all images from a PDF without corrupting it using CAM::PDF?

1.1k Views Asked by At

The script below is able to remove all images from a PDF file using CAM::PDF. The output, however, is corrupt. PDF readers are nonetheless able to open it, but they complain about errors. For instance, mupdf says:

error: no XObject subtype specified
error: cannot draw xobject/image
warning: Ignoring errors during rendering
mupdf: warning: Errors found on page

Now, CAM::PDF page at CPAN (here) lists the deleteObject() method under "Deeper utilities", presumably meaning that it's not intended for public usage. Moreover, it warns that:

This function does NOT take care of dependencies on this object.

My question is: what is the right way to remove objects from a PDF file using CAM::PDF? If the issue has to do with dependencies, how can I remove an object while taking care of its dependencies?

For how to remove images from a PDF using other tools, see a related question here.

use CAM::PDF;    
my $pdf = new CAM::PDF ( shift ) or die $CAM::PDF::errstr;

foreach my $objnum ( sort { $a <=> $b } keys %{ $pdf->{xref} } ) {
  my $xobj = $pdf->dereference ( $objnum );

  if ( $xobj->{value}->{type} eq 'dictionary' ) {
    my $im = $xobj->{value}->{value};
    if
    (
      defined $im->{Type} and defined $im->{Subtype}
      and $pdf->getValue ( $im->{Type}    ) eq 'XObject'
      and $pdf->getValue ( $im->{Subtype} ) eq 'Image'
    )
    {
      $pdf->deleteObject ( $objnum );
    }
  }
}

$pdf->cleanoutput ( '-' );
2

There are 2 best solutions below

2
On

This uses CAM::PDF, but takes a slightly different approach. Rather than attempting to delete the images, which is pretty hard, it replaces each image with a transparent image.

Firstly, note that we can use image magick to generate a blank PDF that contains nothing but a transparent image:

% convert  -size 200x100 xc:none transparent.pdf

If we view the generated PDF in a text editor, we can find the main image object:

8 0 obj
<<
/Type /XObject
/Subtype /Image
/Name /Im0
...

The important thing to note here is that we have generated a transparent image as object number 8.

It then becomes matter of importing this object, and using it to replace each of the real images in the PDF, effectively blanking them.

use warnings; use strict;
use CAM::PDF;    
my $pdf = new CAM::PDF ( shift ) or die $CAM::PDF::errstr;

my $trans_pdf = CAM::PDF->new("transparent.pdf") || die "$CAM::PDF::errstr\n";
my $trans_objnum = 8; # object number of transparent image

foreach my $objnum ( sort { $a <=> $b } keys %{ $pdf->{xref} } ) {
  my $xobj = $pdf->dereference ( $objnum );

  if ( $xobj->{value}->{type} eq 'dictionary' ) {
    my $im = $xobj->{value}->{value};
    if
    (
      defined $im->{Type} and defined $im->{Subtype}
      and $pdf->getValue ( $im->{Type}    ) eq 'XObject'
      and $pdf->getValue ( $im->{Subtype} ) eq 'Image'
    ) {
        $pdf->replaceObject ( $objnum, $trans_pdf, $trans_objnum, 1 );
    }
  }
}

$pdf->cleanoutput ( '-' );

The script now replaces each image in the PDF with the imported transparent image object(object number 8 from transparent.pdf).

0
On

Another approach, which really deletes the images, is:

  1. find and delete image XObjects in resource lists,
  2. keep an array with names of deleted resources,
  3. substitute same-length whitespace for the corresponding Do operators in each page content,
  4. clean up and print.

Notice that dwarring's approach is safer, though, as it doesn't have to call $doc->cleanse at the end. According to the CAM::PDF documentation (here), the cleanse method

Remove unused objects. WARNING: this function breaks some PDF documents because it removes objects that are strictly part of the page model hierarchy, but which are required anyway (like some font definition objects).

I don't know how much of a problem using cleanse can be.

use CAM::PDF;
my $doc = new CAM::PDF ( shift ) or die $CAM::PDF::errstr;

# delete image XObjects among resources
# but keep their names

my @names;

foreach my $objnum ( sort { $a <=> $b } keys %{ $doc->{xref} } ) {
  my $obj = $doc->dereference( $objnum );
  next unless $obj->{value}->{type} eq 'dictionary';

  my $n = $obj->{value}->{value};

  my $resources = $doc->getValue ( $n->{Resources}       ) or next;
  my $resource  = $doc->getValue ( $resources->{XObject} ) or next;

  foreach my $name ( sort keys $resource ) {
    my $im = $doc->getValue ( $resource->{$name} ) or next;

    next unless defined $im->{Type}
            and defined $im->{Subtype}
            and $doc->getValue ( $im->{Type}    ) eq 'XObject'
            and $doc->getValue ( $im->{Subtype} ) eq 'Image';

    delete $resource->{$name};                                                                                                           
    push @names, $name;                                                                                                                  
  }                                                                                                                                      
}                                                                                                                                        


# delete the corresponding Do operators                                                                                                                        

if ( @names ) {                                                                                                                                                               
  foreach my $p ( 1 .. $doc->numPages ) {                                                                                                                                     
    my $content = $doc->getPageContent ( $p );
    my $s;
    foreach my $name ( @names ) {
      ++$s if $content =~ s{( / \Q$name\E \s+ Do \b )} { ' ' x length $1 }xeg;
    }
    $doc->setPageContent ( $p, $content ) if $s;
  }
}

$doc->cleanse;
$doc->cleanoutput;