I am currently using pdfbox 1.8 to analyze PDF documents. Below is a very stripped down example of what I am doing.
import java.util.List;
import java.io.IOException;
import javax.swing.JFileChooser;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.pdmodel.PDPage;
import org.apache.pdfbox.pdmodel.common.PDStream;
public class Main
{
private static PDDocument reader;
public static void main(String[] args)
{
JFileChooser chooser = new JFileChooser();
int result = chooser.showOpenDialog(null);
if(result == JFileChooser.APPROVE_OPTION)
{
try
{
reader = PDDocument.load(chooser.getSelectedFile());
for(int pagenum = 1; pagenum <= reader.getNumberOfPages(); pagenum++)
{
System.out.println("===== Page:" + pagenum + " ======");
System.out.println(extract(pagenum));
}
}
catch(Exception e) { e.printStackTrace(); }
}
}
public static String extract(int pagenum) throws IOException
{
List allPages = reader.getDocumentCatalog().getAllPages();
PDPage page = (PDPage) allPages.get(pagenum-1);
PDStream contents = page.getContents();
CustomPDFTextStripper stripper = new CustomPDFTextStripper();
if (contents != null)
{
stripper.processStream(page, page.findResources(), page.getContents().getStream());
}
return stripper.getContents();
}
}
and
import org.apache.pdfbox.util.PDFTextStripper;
import java.io.IOException;
import org.apache.pdfbox.util.TextPosition;
public class CustomPDFTextStripper extends PDFTextStripper
{
private final StringBuilder builder;
private float lastBase;
public CustomPDFTextStripper() throws IOException
{
super.setSortByPosition(true);
builder = new StringBuilder();
lastBase = Float.MAX_VALUE;
}
public String getContents() { return builder.toString(); }
@Override
protected void processTextPosition(TextPosition textPos)
{
float ascent = textPos.getY();
if(ascent > lastBase)
builder.append("\n");
lastBase = textPos.getY() + textPos.getHeight();
builder.append(textPos.getCharacter());
// I want to be able to do stuff here and
// I need to read spaces and newline characters
}
}
I can't seem to find an equivalent solution in pdfbox 2.0 snapshot (I know it is unstable and has not been released yet). I tried to use something like:
CustomPDFTextStripper stripper = new CustomPDFTextStripper();
StringWriter dummy = new StringWriter();
stripper.setPageStart(""+(pagenum-1));
stripper.setPageEnd(""+(pagenum-1));
stripper.writeText(reader, dummy);
but it does not process spaces or give accurate textPos data in processTextPostion method.
Any ideas of how to get all of the TextPostion data same as 1.8 in 2.0?
========== EDIT 26JUN2015 8:00 PM CST ===========
Ok, I have had some time to look at it and found the problem. getWidthOfSpace() returns dramatically different result between 1.8 and 2.0.
In 1.8 it is around 2.49 - width of characters are around 5
In 2.0 it is around 27.5 - width of characters are around 5
Obviously 27.5 is wrong in 2.0
just run the following test and you will see
@Override
protected void processTextPosition(TextPosition textPos)
{
float spaceWidth = textPos.getWidthOfSpace();
float width = textPos.getWidth();
System.out.println(textPos.getCharacter() + " - Width of Space=" + spaceWidth + " - width=" + width);
builder.append(textPos.getCharacter());
}
(Of course getUnicode() for 2.0 instead of getCharacter())
===== EDIT 27JUN2015 8:00 PM CST ======
Here is link to PDF in used in test: Hello World
There indeed is an error in the current calculation of the width of space.
PDFTextStreamEngine.showGlyph(Matrix, PDFont, int, String, Vector)
currently (it's a SNAPSHOT, the situation may differ this evening) calculates the width like this:(PDFTextStreamEngine.java in revision 1688116)
but the
textRenderingMatrix
has been calculated inPDFStreamEngine.showText(byte[])
using:(PDFStreamEngine.java in revision 1688116)
Thus, both the font size and the horizontal scaling are multiplied twice into the space width. Furthermore the current transformation matrix is both fully multiplied into
textRenderingMatrix
and partially used asctm.getScalingFactorX()
; this can amount in most interesting combined results.Most likely it should suffice to remove these values as explicit factors from the
spaceWidthDisplay
calculation inPDFTextStreamEngine.showGlyph(Matrix, PDFont, int, String, Vector)
In version 1.8.9 the text space width is calculated like this in
PDFStreamEngine.processEncodedText(byte[])
:This can give rise to funny results, too, for interesting current transformation and text matrices but the factors of interest above were not multiplied twice into the result..