How to Extract title, body and images from HTML with Apache tika parser

1.7k Views Asked by bertyuan At 23 December 2014 at 15:59

I want to extract title, html body(plain text), image urls from HTML page, it it possible to use Apache Tika server to achive it?

Original Q&A

There are 1 best solutions below

Gagravarr On 19 July 2015 at 20:50 BEST ANSWER

Using the Apache Tika Server as-is, in a single step, you cannot get both the Body Plain Text and all img tag src URLs

You have a few choices available to you:

Firstly, ask the Tika Server for the plain text of the file. Then, a second time ask it for the normalised HTML + filter that client-side for img tags
Ask the Tika Server for the normalised HTML form, then grab out the img tag urls and plain text locally, likely with your own xhtml parser
Call the Tika java code directly, with a custom Content Handler, without using the Server.

For option #3, you'd want to largely follow the fetch the body of the xhtml document example, but throw away most of the tag information. You'd only care about img tags as tags, the rest you'd only pass through the inner characters

How to Extract title, body and images from HTML with Apache tika parser

There are 1 best solutions below

Related Questions in HTML

Related Questions in APACHE

Related Questions in HTML-PARSING

Related Questions in APACHE-TIKA

Related Questions in EXTRACTOR

Trending Questions

Popular # Hahtags

Popular Questions