Can an unstructured PDF be tagged using any tools/libraries? Only source of tagging a PDF was using Adobe Acrobat or Auto-Tag APIs (Not something which I am looking forward to + not so great results imo)
I know the bounding boxes and semantics of the elements (i.e paragraph, lists, headings, tables)
So, is there a way to manipulate PDF trees/objects? preferably in Python or JavaScript.
Any thoughts on the topic is appreciated!!
PDF spec Talks about "StructTreeRoot" for Tagged PDFs. Going deep inside for making these objects would be nerve-racking, so is there any high-level library to manipulate objects?
A this time there is a good overview at https://commonlook.com/auto-tagging-pdfs/
Tagging a PDF with all that entails needs to be done by the PDF writer so here is this page as Tagged by MS Edge or you can use Chromium/Foxit/Skia (e.g. use Chrome or Chromium Portable).
Consider how impossible this may be to do retrospectively word by word or even sentence or paragraph at a time, as PDF does not inherently have such constructions. Things like H1 are discarded by the paper printout generator as unrequired superfluous bloat for a printer.
OK the prime reason for tagging is the human challenged reader, so with a tagged PDF lets see how it fares. Here we are only dealing with one simple page without images or tables (the two most common reasons for checking tags)
So programmatically how will an iterative application driven by Python resolve the residual requirements which are missing.
Language, as a Human I know the language is English (that should have been obvious to a browser that speaks aloud)
The Title is missing but again that should be obvious is "TAGGING PDFS" suitable as a working title for approval by another person? Lets temporarily ignore the major errors that tagging and order of tabs is wrong. A human with eyes and brain to analyse why, can fix those, as they progress through all the pages human aspects, so can the "Human" read / navigate logically? will itself resolve the tags order, and at the same time, check if the page is visually suitable contrast for visually challenged.
So the tagging of a PDF is best done at the time a human completes their retrospective use of the page, and that is best done using "Pre-flight" "Post-flight" GUI applications, such as Acrobat.