How to use pdfMiner in python to predicatbly read values

420 Views Asked by At

I've been using pdfMiner to read values off of graphs and so far its been working great!

However there is one area in which the correct data is read correctly but in an unpredictable manner, meaning it will read all the graphs values correctly, in a completely different order than they appear.

This is not entirely a problem because as long as i know, say the last graph will always be read first, i can structure my program around that. Except it seems that pdfMiner is almost totally unpredicatable in the way it is reading this data, I can find no discernable pattern.

This is most probably because I am quite unfamiliar with pdfMiner so i am not entirely sure how it works. So yeah it would be really helpful if somone could just point me in the right direction.

Here is my data

And here is the conversion code i'm using:

from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
from cStringIO import StringIO
global values

print "Getting readable PDF"

rsrcmgr = PDFResourceManager()
retstr = StringIO()
codec = 'utf-8'
laparams = LAParams()
device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
fp = file("graphExtraction.pdf", 'rb')
interpreter = PDFPageInterpreter(rsrcmgr, device)
password = ""
maxpages = 0
caching = True
pagenos=set()
for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages, password=password,caching=caching,           check_extractable=True):
    interpreter.process_page(page)
fp.close()
device.close()
str = retstr.getvalue()
retstr.close()
values = str
1

There are 1 best solutions below

0
On

Use the bounding box information to follow the flow of your documents and figure out what comes first.