Remove special characters from text/PDF with Apache Tika

3.5k Views Asked by At

I am parsing PDF file to extract text with Apache Tika.

//Create a body content handler
BodyContentHandler handler = new BodyContentHandler();

//Metadata
Metadata metadata = new Metadata();

//Input file path
FileInputStream inputstream = new FileInputStream(new File(faInputFileName));

//Parser context. It is used to parse InputStream
ParseContext pcontext = new ParseContext();

try
{       
    //parsing the document using PDF parser from Tika.
    PDFParser pdfparser = new PDFParser();

    //Do the parsing by calling the parse function of pdfparser
    pdfparser.parse(inputstream, handler, metadata,pcontext);

}catch(Exception e)
{
    System.out.println("Exception caught:");
}
String extractedText = handler.toString();

Above code works and text from the PDF is extcted.

There are some special characters in the PDF file (like @/&/£ or trademark sign, etc). How can I remove those special charaters during or after the extraction process?

1

There are 1 best solutions below

0
On

PDF uses unicode code points you may well have strings that contain surrogate pairs, combining forms (eg for diacritics) etc, and may wish to preserve these as their closest ASCII equivalent, eg normalise é to e. If so, you can do something like this:

import java.text.Normalizer;

String normalisedText = Normalizer.normalize(handler.toString(), Normalizer.Form.NFD);

If you are simply after ASCII text then once normalised you could filter the string you get from Tika using a regular expression as per this answer:

extractedText = normalisedText.replaceAll("[^\\p{ASCII}]", "");

However, since regular expressions can be slow (particularly on large strings) you may want to avoid the regex and do a simple substitution (as per this answer):

public static String flattenToAscii(String string) {
    char[] out = new char[string.length()];
    String normalized = Normalizer.normalize(string, Normalizer.Form.NFD);
    int j = 0;
    for (int i = 0, n = normalized.length(); i < n; ++i) {
        char c = normalized.charAt(i);
        if (c <= '\u007F') out[j++] = c;
    }
    return new String(out);
}