How to Extract title, body and images from HTML with Apache tika parser

1.6k Views Asked by At

I want to extract title, html body(plain text), image urls from HTML page, it it possible to use Apache Tika server to achive it?

1

There are 1 best solutions below

0
On BEST ANSWER

Using the Apache Tika Server as-is, in a single step, you cannot get both the Body Plain Text and all img tag src URLs

You have a few choices available to you:

  1. Firstly, ask the Tika Server for the plain text of the file. Then, a second time ask it for the normalised HTML + filter that client-side for img tags
  2. Ask the Tika Server for the normalised HTML form, then grab out the img tag urls and plain text locally, likely with your own xhtml parser
  3. Call the Tika java code directly, with a custom Content Handler, without using the Server.

For option #3, you'd want to largely follow the fetch the body of the xhtml document example, but throw away most of the tag information. You'd only care about img tags as tags, the rest you'd only pass through the inner characters