Extracting Identity-H encoded PDF text and replacing it using PDFBox in java

1.5k Views Asked by At

I'm struggling with reading the PDF document which is encoded with Identity-H( TrueType(CID) ). When I get the token values for Tj, I'm able to find unreadable string(random symbols). I need to any suggestion on how I can work this out as I need to find certain strings from the PDF and replace them.

        public void doIt( String inputFile, String outputFile, String strToFind, String message)
    throws IOException, COSVisitorException
{
    // the document
    PDDocument doc = null;
    try
    {
        doc = PDDocument.load( inputFile );
        List pages = doc.getDocumentCatalog().getAllPages();
        for( int i=0; i<pages.size(); i++ )
        {
            PDPage page = (PDPage)pages.get( i );
            PDStream contents = page.getContents();
            PDFStreamParser parser = new PDFStreamParser(contents.getStream() );
            parser.parse();
            List tokens = parser.getTokens();
            for( int j=0; j<tokens.size(); j++ )
            {
                Object next = tokens.get( j );
                if( next instanceof PDFOperator )
                {
                    PDFOperator op = (PDFOperator)next;
                    //Tj and TJ are the two operators that display
                    //strings in a PDF
                    if( op.getOperation().equals( "Tj" ) )
                    {
                        //Tj takes one operator and that is the string
                        //to display so lets update that operator
                        COSString previous = (COSString)tokens.get( j-1 );
                        String string = previous.getString();
                        string = string.replaceFirst( strToFind, message );
                        previous.reset();
                        previous.append( string.getBytes("ISO-8859-1") );
                    }
                    else if( op.getOperation().equals( "TJ" ) )
                    {
                        COSArray previous = (COSArray)tokens.get( j-1 );
                        for( int k=0; k<previous.size(); k++ )
                        {
                            Object arrElement = previous.getObject( k );
                            if( arrElement instanceof COSString )
                            {
                                COSString cosString = (COSString)arrElement;
                                String string = cosString.getString();
                                string = string.replaceFirst( strToFind, message );
                                cosString.reset();
                                cosString.append( string.getBytes("ISO-8859-1") );
                            }
                        }
                    }
                }
            }
            //now that the tokens are updated we will replace the
            //page content stream.
            PDStream updatedStream = new PDStream(doc);
            OutputStream out = updatedStream.createOutputStream();
            ContentStreamWriter tokenWriter = new ContentStreamWriter(out);
            tokenWriter.writeTokens( tokens );
            page.setContents( updatedStream );
        }
        doc.save( outputFile );
    }
    finally
    {
        if( doc != null )
        {
            doc.close();
        }
    }
}

This is an example provide in https://svn.apache.org/repos/asf/pdfbox/tags/1.5.0/pdfbox/src/main/java/org/apache/pdfbox/examples/pdmodel/ReplaceString.java

My program looks the same. I'm currently using PDFBox 1.7. Any suggesting how I can get the intended text or convert the encoding value from Identity-H to something that can be recognized by PDFBox. Or any suggestions for other API's that will allow me to use Identity-H encoded PDF for extracting the text and re-framing it again.

Thanks in advance for the help.

0

There are 0 best solutions below