how to access the inner html content with the css engine in extractor plugin for filtering process

97 Views Asked by At

I have configured Apache Nutch , Solr with the extractor plug in for filtering of html content. how could i be able to access the inner div content with using css engine or xpath engine. Thanks in advance.

1

There are 1 best solutions below

1
On

Just use the "text" function. For instance if your html is look like this:

<div class="target">
    Hello <span>World!</span>
</div>

Then your extract-to rule is similar to this:

<extract-to field="my-field">
   <text>
       <expr value=".target"/>
   </text>
</extract-to>