I want to highlight the bbox's of a particular tag when they selected the tag in structure root. For that reason I am able to get the bbox's when the tag contains Attributes like this
But I found in some pdf's even though there is no attributes like (/A) , Adobe dc can able to highlight the content(bbox's) when you select the particular tag. How I can get bbox's in this case? The code what I tried to get attributes related bbox's is
String inputPdfFile = "D:/Documents/pdfs/res.pdf";
PDDocument old_document = PDDocument.load(new File(inputPdfFile));
PDStructureTreeRoot treeRoot = old_document.getDocumentCatalog().getStructureTreeRoot();
for (Object kid : treeRoot.getKids()){
for (Object kid2 :((PDStructureElement)kid).getKids()){
PDStructureElement kid2c = (PDStructureElement)kid2;
for (Object kid3 : kid2c.getKids()){
if (kid3 instanceof PDStructureElement){
PDStructureElement kid3c = (PDStructureElement)kid3;
System.out.println(kid3c.getAttributes());
}
}
}
}
The pdf link is https://drive.google.com/file/d/1_-tuWuReaTvrDsqQwldTnPYrMHSpXIWp/view?usp=sharing
Please help me any one......
To determine the actual bounding boxes (in contrast to those given in some Structure Element Layout Attributes), of the text of some marked content, you can use the PDFBox
PDFMarkedContentExtractor
and combine its results with the PDF Structure Tree data.The following code does so and creates an output PDF in which the determined bounding boxes are enclosed in colored rectangles:
(from the VisualizeMarkedContent method
visualize
)It uses the following helper method for recursively mapping the
PDMarkedContent
objects by their MCID:(VisualizeMarkedContent helper method)
The method
showStructure
recursively determines the bounding box of structure elements and draws a rectangle for each element respectively. Actually a structure element can contain content across pages, so we have to work with a mapping of pages to bounding boxes in itsboxes
variable...(VisualizeMarkedContent method)
The method
showContent
determines the bounding box of text associated with a given MCID, recursing if need be.(VisualizeMarkedContent method)
The previous two methods
showStructure
andshowContent
make use of the following helpers to build the (page-wise) union of bounding boxes:(VisualizeMarkedContent helper methods)
Finally the method
calculateGlyphBounds
has been borrowed from the PDFBox exampleDrawPrintTextLocations
to calculate the individual glyph bounding boxes:(VisualizeMarkedContent method)
The result for your example document: