I've been using different python packages to parse PDFs, but I'm wondering if it's possible to measure the margins of a particular line in the document. The measurement I would like is for it to be in pixels css-style, if possible.
It doesn't need to be so specific, just to figure out if a line is left-aligned, centered, or right-aligned based on margins, starting from left-to-right.
Example:
# margin <= x
left-aligned
# margin >= y && margin <= z
center-aligened
# margin >= z
right-aligned
Obviously this is just an example, but the margin differential will not be large, meaning, PDFs I'm parsing will likely have (in css terms):
margin-left: 0
margin-left: x
margin-left: y
x, y
actual value are unimportant, the important thing is that they'll be consistent.
Sorry if this is confusing, the main thing I'm asking for is clarification or help in figuring out left-margin for every line in a pdf.
disclaimer: I am the author of
borb
, the library used in this answerYou can
SimpleLineOfTextExtraction
inborb
, which returns the lines of text in a PDF.You can check out this class here.
Each line has a content box (and a layout box), which can give you information about the location of that particular line of text.
You can use this to determine whether a line is left/right/middle aligned by comparing it to lines above/below it.
You can find an example of how to use this class here.
Essentially you open a document using the
PDF.loads
method, passing along anEventListener
.