I want to extract the text "Catholic Blended Margaritas" which exists in the part of HTML page pasted below.

I used the following xPath expression for the same:

xPath = "//div[@class='recipeBox']/div[@class='detailBox']/h3/text()";

And I passed it to HTMLCleaner whose part of code I am pasting here:

//use the cleaner to "clean" the HTML and return it as a TagNode object i.e. HTML page root node
    TagNode rootNode = htmlCleaner.clean(new   InputStreamReader(conn.getInputStream()));   

    // query XPath  
    Object[] nodes = rootNode.evaluateXPath(xpath);   

But the above expression returns zero nodes.

The part of Html I have pasted down. In fact I want the text of all such nodes of which I have only pasted a part of Html. The HTML pages's link for your reference is as follows: http://www.foodfood.com/category/recipes/by-course/beverages/

Part of Html of the above link is as follows:

<div class="recipeBox ">
        <a href="http://www.foodfood.com/recipes/catholic-blended-margaritas/" rel="bookmark" title="Permanent Link to Catholic Blended Margaritas">
            <div class="pic">
                <img width="230" height="150" src="http://www.foodfood.com/wp-content/uploads/2012/07/230x150xCatholic-Blended-Margaritas-230x150.jpg.pagespeed.ic.p_7Vr37LwJ.jpg" class="post_img_thumb wp-post-image" alt="Catholic-Blended-Margaritas" title="Catholic-Blended-Margaritas"/>             </div>
            <div class="detailBox">
                <h3>Catholic Blended Margaritas</h3>
                <p><p>Blended Margaritas is a delicious drink which can be enjoyed on any festive</p>
</p>
                <div class="timer">5 Mins</div>
                <a href="http://www.foodfood.com/recipes/catholic-blended-margaritas/?comments=1#comments_det"><span class="comments">No Comments</span> </a>
            </div>
        </a>
    </div>

Please note the text "Catholic Blended Margaritas"(which I want) is nested inside two <div> tags which is giving me problem.

1

There are 1 best solutions below

0
On

I see 2 issues with //div[@class='recipeBox']//div[@class='detailBox']/h3/text() for your sample page:

  • the trailing space in the "class" attribute of <div class="recipeBox ">
  • nesting of your target elements inside the <a href="http://www.foodfood.com/recipes/catholic-blended-margaritas/" rel="bookmark" title="Permanent Link to Catholic Blended Margaritas"> link

So I suggest you try with //div[normalize-space(@class)='recipeBox']//div[@class='detailBox']/h3/text()