I have tagged a pdf using pdfbox.
How I was tagged: Instead of extract text and tagging I am adding mcid's to the existing content stream (both open and closing ex: /p<< MCID 0 >> BDC .. .. .. EMC) and then I am adding that marked content to document root catalog structure.
What working: Almost everything is working fine like completely tagged pdf. It is passing the PAC3 accessibility checker also.
//Adding tags
tokens.add(++ind, type_check(t_ype, page));
currentMarkedContentDictionary = new COSDictionary();
currentMarkedContentDictionary.setInt(COSName.MCID, mcid);
if (altText != null && !altText.isEmpty()) {
currentMarkedContentDictionary.setString(COSName.ALT, altText);
}
mcid++;
tokens.add(++ind, currentMarkedContentDictionary);
tokens.add(++ind, Operator.getOperator("BDC"));
// Adding marked content to root structure
structureElement.appendKid(markedContent);
currentSection.appendKid(structureElement);
What not working: After tagging one future Is missing from tag structure. There is an option called "Find Tag from Selection" . Is not working. It is going to last tag while I select some test and press " Find tag from selection" in root structure. Please find the pdf in below link.
https://drive.google.com/file/d/11Lhuj50Bb9kChvD0kL_GOHQn4RNKZ0hR/view?usp=sharing
Parent tree:
https://drive.google.com/file/d/109xhUpqsQSFLPJB2nhXoU9ssMKnyht3G/view?usp=sharing
extra doc with tagging and parent tree: https://drive.google.com/file/d/1yzZSsjkb5_dGfq1Wu3VxsH73vr3alRmC/view?usp=sharing
Please help me to solve this problem.
New Problem: I observed that
while Jaws reading my tagged document, I am pressing controls like ctl+shift+5 in windows machine . It will show the options like drop down>"Read based on tagged structure" or >"Top left to bottom right" and below two radio buttons
Read curent page Read all pages image you can see. Shift+CTL+5 in adobe dc you can see image here
I selected "read based on tagging structure and Read current page" Now the jaws not reading the Tag structure. But if i use same doc for "Read entire document" it is reading perfect?
Link to doc:
https://drive.google.com/file/d/1CguMHa4DikFMP15VGERnPNWRq5vO3u6I/view?usp=sharing
Any help?
A nesting issue
You're doing this incorrectly. See for example the start of the page content stream in your document:
Focusing on the beginning and end of text objects and marked content, we see that you have
BT ... BDC ... ET ... BT ... EMC
According to the specification, though:
(ISO 32000-1 section 14.6 "Marked Content")
This issue was fixed in the second shared PDF,
res1.pdf
.Missing ParentTree and StructParents
The problem your question focuses on is
Finding a tag from selection essentially means that you have the MCID of some content stream instruction and you search the structure element in the structure tree referencing that marked content ID.
How PDF processors are expected to do this, is described in section 14.7.4.4 "Finding Structure Elements from Content Items" of the PDF specification ISO 32000-1 (or section 14.7.5.4 in ISO 32000-2):
Your PDF does not have that ParentTree at all, and your page does not contain a StructParents entry to lookup in a parent tree. Thus, the prescribed way to get from marked content to the structure tree is impossible to go.
A ParentTree was added in the third shared PDF,
new.pdf
.Incorrect ParentTree entries
While in
new.pdf
you have a ParentTree, its contents are clearly incorrect:The ParentTree is a number tree, i.e. integers are mapped to something here, so there obviously must not be multiple entries for the same integer key.
Furthermore, looking inside one of those values:
one sees that you claim that the following StructElem is the value for all marked content IDs:
Inspecting this StructElem further, one sees that it represents the final paragraph on the final page.
Thus, your observation
is what one can expect. If one expects any reasonable behavior at all, that is, with a ParentTree structure broken so badly.
Actually there was not only this
new.pdf
but alsores.pdf
andtagged without altext.pdf
with ParentTrees, but all these ParentTrees were broken like the tree ofnew.pdf
.You might want to start inspecting the structures you create when analyzing an unwanted behavior.
Another issue with parent tree entries
The previously described issue in parent trees meanwhile has been resolved, different pages now have different struct parents and the parent tree arrays now reference the struct elements for distinct MCIDs.
For some documents a different error occurs now, though, e.g. "res29_08_19.pdf". Here the parent tree starts like this:
In particular the first entry in the array is for MCID 3, the second for MCID 4, ...
This is invalid, according to the specification
(ISO 32000-1 section 14.7.4.4 "Finding Structure Elements from Content Items")
Thus, the first entry must be for MCID 0, the second for MCID 1, ...
You objected in a comment
But as a corollary of the above: Do not give MCIDs to marked content sequences you don't have a structure element for! MCIDs are for going back and forth between the structure hierarchy and the content streams. If you mark a piece of content without having a structure element for it, don't give it a MCID.
Yet another issue with parent tree entries
You again report problems with your newest file mathpdf.pdf. And indeed, there are issues; Adobe Acrobat Preflight reports a 5 pages list of inconsistent parent tree mappings like this:
In contrast to the previous issues the cause does not become clear by looking at the parent tree alone, one also has to look at the structure hierarchy.
Doing so, though, one peculiarity immediately hits the eye: In your parent tree you do not reference the actual parent structure element of the MCID but you reference a new structure tree node which claims to have the actual parent node from the structure hierarchy as its own parent (not actually being one of its kids) and also claims to have the MCID in question as kid.
For example let's look at the MCID 0 on the first page. In the structure hierarchy you have:
In the parent tree you have:
You should have simply referenced object 238 (the structure hierarchy parent of MCID 0) directly from the parent tree array for page one instead of that in-between object 62 which claims to have that object 238 as parent and MCID 0 as kid.
The reported inconsistency may be due to the node referenced from the parent tree (in object 62) claims to be a P paragraph with a parent node (in object 238) which is a Span. That is not allowed, a paragraph may contain a span but it cannot be contained in one.