How to find Blank Page in pdf file

10.3k Views Asked by At

I can not detect blank page in pdf file. I have searched internet for it but could not find a good solution.

Using Itextsharp I tried with page size, Xobjects. But they do not give exact result.

I tried

if(xobjects==null || textcontent==null || size <20 bytes )
  then "blank"
else
 not blank

But maximum time it returns wrong answer. I have used Itextsharp

The code is below... I am using Itextsharp Librabry

For xobjects

PdfDictionary xobjects = resourceDic.GetAsDict(PdfName.XOBJECT);
//here resourceDic is PdfDictionary type
//I know that if Xobjects is null then page is blank. But sometimes blank page gives xobjects which is not null.

For contentstream

 RandomAccessFileOrArray f = reader.SafeFile;
 //here reader = new PdfReader(filename);

 byte[] contentBytes = reader.GetPageContent(pageNum, f);
 //I have measured the size of contentbytes but sometimes it gives more than 20 bytes for   blank page

For textcontent

String extractedText = PdfTextExtractor.GetTextFromPage(reader, pageNum, new LocationTextExtractionStrategy());
  // sometimes blank page give a text more than 20 char length .
3

There are 3 best solutions below

7
On

I suspect you have tried .Trim() on your strings, so I won't suggest that on it's own.

What is the actual contents of the 20+ char length strings in the blank? I suspect it is just new line characters (like what happens when people press enter 10+ times just to get a new page rather than inserting a page-break), in which case:

String extractedText = 
    string.Replace(string.Replace(
        PdfTextExtractor.GetTextFromPage(reader, pageNum, new LocationTextExtractionStrategy())
    , Environment.NewLine, ""), "\n", "").Trim();

Let us know what the output contents is after this.

Another possibility is that it's blank text with non-breaking spaces and other characters that aren't actually spaces, you'll need to find and replace these manually.. at which point I would instead suggest that you actually just use a regex match for [0-9,a-z,A-Z] and use that to determine if your page is blank or not.

0
On

There is a wrapper library for C# and VB.NET from a mupdf c++ library. You could use it to convert to pages to bmp (in diferent formats tif, jpg, png) and check the size of the bitmap.

You should check which is the minimal size with the minimal characters of a page that you will consider as a blank.

3
On

A very simple way to discover empty pages is this: use a Ghostscript commandline that calls the bbox device.

Ghostscript's bbox calculates the coordinates of that minimum rectangle 'bounding box' which encloses all points of the page where a pixel would be rendered:

gs \
  -o /dev/null \
  -sDEVICE=bbox \
   input.pdf

On Windows:

gswin32c.exe ^
  -o nul ^
  -sDEVICE=bbox ^
   input.pdf

Result:

GPL Ghostscript 9.05 (2012-02-08)
Copyright (C) 2010 Artifex Software, Inc.  All rights reserved.
This software comes with NO WARRANTY: see the file PUBLIC for details.
Processing pages 1 through 6.
Page 1
%%BoundingBox: 27 281 548 804
%%HiResBoundingBox: 27.000000 281.000000 547.332031 804.000000
Page 2
%%BoundingBox: 0 0 0 0
%%HiResBoundingBox: 0.000000 0.000000 0.000000 0.000000
Page 3
%%BoundingBox: 27 302 568 814
%%HiResBoundingBox: 27.949219 302.000000 567.332031 814.000000
Page 4
%%BoundingBox: 27 302 568 814
%%HiResBoundingBox: 27.949219 302.000000 567.332031 814.000000
Page 5
%%BoundingBox: 27 302 568 814
%%HiResBoundingBox: 27.949219 302.000000 567.332031 814.000000
Page 6
%%BoundingBox: 27 302 568 814
%%HiResBoundingBox: 27.949219 302.000000 567.332031 814.000000

As you can see, page 2 of my input document was empty.