How to avoid to tag the empty <TR<TD> cells to PDF using Itext 5

Question

How to avoid to tag the empty <TR<TD> cells to PDF using Itext 5

683 Views Asked by Naga Suresh Babu P At 12 December 2019 at 06:33

I an using i text 5 to generate the PDF from html as input . As part of PDF accessibility,adding pdfwriter.settagged().

But here all the empty and non-empty tags are tagging .can you please help how to avoid to tagging the non empty html tags

Original Q&A

There are 2 best solutions below

**André Lemos** · Answer 1 · 2019-12-12T11:10:16.200000

I suppose one way to go around it, would be to go through the StructTree on the output PDF document, and try to find the tag you are looking for, without any kids, and remove it from the parent. I do not use iText 5 anymore, as it has been deprecated (only security fixes are issued), but with iText 7, you could do something like:

private void removeEmptyTag() throws IOException {
    final PdfDocument pdfDoc = new PdfDocument(new PdfReader(ORIG),
            new PdfWriter(DEST));
    PdfDictionary catalog = pdfDoc.getCatalog().getPdfObject();
    // Gets the root dictionary
    PdfDictionary structTreeRoot = catalog.getAsDictionary(PdfName.StructTreeRoot);
    manipulate(structTreeRoot);

    pdfDoc.close();
}

public boolean manipulate(PdfDictionary element) {

    if (element == null)
        return false;

    if (PdfName.TD.equals(element.get(PdfName.S))) {
        if (!element.containsKey(PdfName.K)) {
            return true;
        }
    }

    PdfArray kids = element.getAsArray(PdfName.K);
    if (kids == null) return false;
    for (int i = 0; i < kids.size(); i++) {
        if (manipulate(kids.getAsDictionary(i))) {
            kids.remove(i);
        }
    }

    return false;
}

it's not the most elegant thing, but I've used pdfHTML to create an HTML file, where I had an empty td

<tr>
    <th>Firstname</th>
    <th>Lastname</th>
    <th>Age</th>
</tr>
<tr>
    <td>Jill</td>
    <td>Smith</td>
    <td></td>
</tr>
<tr>
    <td>Eve</td>
    <td>Jackson</td>
    <td>94</td>
</tr>

and then I've used the code to go through it and remove the empty tags (or rather, tags without children). Maybe there is a solution to do it directly with xmlWorker (I am assuming this is what you are using to create the HTML document), or a better post processing alternative to my suggestion.

**André Lemos** · Answer 2 · 2019-12-12T11:58:26.637000

You can do it directly with pdfHTML (basically the solution for HTML to PDF conversion in iText 7).

ConverterProperties props = new ConverterProperties();
props.setTagWorkerFactory(new DefaultTagWorkerFactory() {
                @Override
                public ITagWorker getCustomTagWorker(
                        IElementNode tag, ProcessorContext context) {
                    if (tag.name().equals(TagConstants.TD)) {
                        if (!tag.childNodes().isEmpty()) {
                            return new TdTagWorker(tag, context);
                        } else {
                            return new SpanTagWorker(tag, context);
                        }
                    }


                    return null;
                }
            });


PdfDocument doc = new PdfDocument(new PdfWriter(DEST));
doc.setTagged();

HtmlConverter.convertToPdf(new FileInputStream(ORIG), doc, props);

On the code above, you can use setTagWorkerFactory to have a custom behavior for your tags as detailed in the documentation. In this specific case, I'm simply changing empty TD tags into a Span element, which achieves the desired behavior (the superfluous TD tag disappears).

(to be completely honest, this relies on the inability of the TR worker to parse the SPAN tag, so it just jumps ship. I'll update the answer if I come up with a more elegant solution)

How to avoid to tag the empty <TR<TD> cells to PDF using Itext 5

There are 2 best solutions below

Related Questions in ITEXT

Related Questions in TAGGING

Related Questions in XMLWORKER

Trending Questions

Popular # Hahtags

Popular Questions