How to read from PDF using Selenium webdriver and Java

2.9k Views Asked by At

I am trying to read the contents of a PDF file using Java-Selenium. Below is my code. getWebDriver is a custom method in the framework. It returns the webdriver.

URL urlOfPdf = new URL(this.getWebDriver().getCurrentUrl());

BufferedInputStream fileToParse = new BufferedInputStream(urlOfPdf.openStream());

PDFParser parser = new PDFParser((RandomAccessRead) fileToParse);
parser.parse();

String output = new PDFTextStripper().getText(parser.getPDDocument());

The second line of the code gives compile time error if I don't parse it to RandomAccessRead type.

compilation error

And when I parse it, I get this run time error:

java.lang.ClassCastException: java.io.BufferedInputStream cannot be cast to org.apache.pdfbox.io.RandomAccessRead

runtime error

I need help with getting rid of these errors.

1

There are 1 best solutions below

2
On

First of, unless you want to interfere in the PDF loading process, there is no need to explicitly use the PdfParser class. You can instead use a static PDDocument.load method:

URL urlOfPdf = new URL(this.getWebDriver().getCurrentUrl());

BufferedInputStream fileToParse = new BufferedInputStream(urlOfPdf.openStream());

PDDocument document = PDDocument.load(fileToParse);

String output = new PDFTextStripper().getText(document);

Otherwise, if you do want to interfere in the loading process, you have to create a RandomAccessRead instance for your BufferedInputStream, you cannot simply cast it because the classes are not related.

You can do that like this

URL urlOfPdf = new URL(this.getWebDriver().getCurrentUrl());

BufferedInputStream fileToParse = new BufferedInputStream(urlOfPdf.openStream());

MemoryUsageSetting memUsageSetting = MemoryUsageSetting.setupMainMemoryOnly();
ScratchFile scratchFile = new ScratchFile(memUsageSetting);
PDFParser parser;
try
{
    RandomAccessRead source = scratchFile.createBuffer(fileToParse);
    parser = new PDFParser(source);
    parser.parse();
}
catch (IOException ioe)
{
    IOUtils.closeQuietly(scratchFile);
    throw ioe;
}

String output = new PDFTextStripper().getText(parser.getPDDocument());

(This essentially is copied and pasted from the source of PDDocument.load.)