Parsing Text and Images from PDF

189 Views Asked by At

I am experimenting with parsing a PDF for some home automation ideas.

I am trying to see what data I can get from a pdf. The PDF I am testing is here: http://www.antrimandnewtownabbey.gov.uk/getmedia/ebfd33ba-d176-462b-99e3-9416b774f7bc/BIN-FLYER-THUR-CYC-B-FULL-YEAR-December-16-November-17.pdf.aspx

So I have used the following code to parse the PDF:

public class WebPagePdfExtractor {
    public Map<String, Object> processRecord(String url) {
    DefaultHttpClient httpclient = new DefaultHttpClient();
    Map<String, Object> map = new HashMap<String, Object>();
    try {
        HttpGet httpGet = new HttpGet(url);
        HttpResponse response = httpclient.execute(httpGet);
        HttpEntity entity = response.getEntity();
        InputStream input = null;
                if (entity != null) {
                    try{
                        input = entity.getContent();
                        BodyContentHandler handler = new BodyContentHandler();
                        Metadata metadata = new Metadata();
                        AutoDetectParser parser = new AutoDetectParser();
                        ParseContext parseContext = new ParseContext();
                        parser.parse(input, handler, metadata, parseContext);
                        map.put("text", handler.toString().replaceAll("\n|\r|\t", " "));
                        map.put("title", metadata.get(TikaCoreProperties.TITLE));
                        map.put("pageCount", metadata.get("xmpTPg:NPages"));
                        map.put("status_code", response.getStatusLine().getStatusCode() + "");
                } catch (Exception e) {                     
                    e.printStackTrace();
                }finally{
                    if(input != null){
                        try {
                            input.close();
                        } catch (IOException e) {
                            e.printStackTrace();
                        }
                    }
                }
                }
            }catch (Exception exception) {
                exception.printStackTrace();
            }
    return map;
}

public static void main(String[] args) {
    WebPagePdfExtractor webPagePdfExtractor = new WebPagePdfExtractor();
    Map<String, Object> extractedMap = webPagePdfExtractor.processRecord("http://www.antrimandnewtownabbey.gov.uk/getmedia/ebfd33ba-d176-462b-99e3-9416b774f7bc/BIN-FLYER-THUR-CYC-B-FULL-YEAR-December-16-November-17.pdf.aspx");
    System.out.println(extractedMap.get("text"));
   }
}

This returns all the text perfectly from the PDF, what I was wondering is taking this further is it possible to get a description of some sort of the the images in the PDF too. For example beside each date is a colour image is there a way to get this information to?

0

There are 0 best solutions below