I want to extract the text "Catholic Blended Margaritas" which exists in the part of HTML page pasted below.
I used the following xPath expression for the same:
xPath = "//div[@class='recipeBox']/div[@class='detailBox']/h3/text()";
And I passed it to HTMLCleaner whose part of code I am pasting here:
//use the cleaner to "clean" the HTML and return it as a TagNode object i.e. HTML page root node
TagNode rootNode = htmlCleaner.clean(new InputStreamReader(conn.getInputStream()));
// query XPath
Object[] nodes = rootNode.evaluateXPath(xpath);
But the above expression returns zero nodes.
The part of Html I have pasted down. In fact I want the text of all such nodes of which I have only pasted a part of Html. The HTML pages's link for your reference is as follows: http://www.foodfood.com/category/recipes/by-course/beverages/
Part of Html of the above link is as follows:
<div class="recipeBox ">
<a href="http://www.foodfood.com/recipes/catholic-blended-margaritas/" rel="bookmark" title="Permanent Link to Catholic Blended Margaritas">
<div class="pic">
<img width="230" height="150" src="http://www.foodfood.com/wp-content/uploads/2012/07/230x150xCatholic-Blended-Margaritas-230x150.jpg.pagespeed.ic.p_7Vr37LwJ.jpg" class="post_img_thumb wp-post-image" alt="Catholic-Blended-Margaritas" title="Catholic-Blended-Margaritas"/> </div>
<div class="detailBox">
<h3>Catholic Blended Margaritas</h3>
<p><p>Blended Margaritas is a delicious drink which can be enjoyed on any festive</p>
</p>
<div class="timer">5 Mins</div>
<a href="http://www.foodfood.com/recipes/catholic-blended-margaritas/?comments=1#comments_det"><span class="comments">No Comments</span> </a>
</div>
</a>
</div>
Please note the text "Catholic Blended Margaritas"(which I want) is nested inside two <div>
tags which is giving me problem.
I see 2 issues with
//div[@class='recipeBox']//div[@class='detailBox']/h3/text()
for your sample page:<div class="recipeBox ">
<a href="http://www.foodfood.com/recipes/catholic-blended-margaritas/" rel="bookmark" title="Permanent Link to Catholic Blended Margaritas">
linkSo I suggest you try with
//div[normalize-space(@class)='recipeBox']//div[@class='detailBox']/h3/text()