I am trying to write a summary of the content of a web page. For that I need to extract all the irrelevant text and data from a webpage.
I have used boilerpipe, but the text extraction is not good.The results are here, where you can see lot of irrelevant text.
Also tried JSoup to scrap away irrelevant data, by removing headers, footers, external links, etc. But again, the results are not up to the mark.
Document doc = Jsoup.connect("www.anyurl.com").get()
doc.head().remove();
doc.getElementsByTag("header").remove();
doc.getElementsByTag("footer").remove();
doc.getElementsByTag("form").remove();
doc.getElementsByTag("table").remove();
doc.getElementsByTag("meta").remove();
doc.getElementsByTag("img").remove();
doc.getElementsByTag("a").remove();
doc.getElementsByTag("br").remove();
doc.getElementsByClass("tags").remove();
doc.getElementsByClass("copyright").remove();
doc.getElementsByClass("widget").remove();
doc.select("div[class*=foot").remove();
doc.select("div[class*=tag").remove();
doc.select("div[class*=Loading").remove();
doc.select("div[class*=Widget").remove();
doc.select("div[class*=Head").remove();
doc.select("div[class*=menu").remove();
doc.select("p[class*=link").remove();
Elements paragraphs = doc.select("p");
Elements divs = doc.select("div");
formattedOutput = paragraphs.text() + divs.text();
Can anyone suggest me how to get this done? Is there any Java library other than boilerpipe, which does it for you?
I don't about java but you can use extract the main content from a webpage