Web harvest -- remove unusual characters

702 Views Asked by At

I'm trying to scrape a page that has some spaces after the anchors:

</a>&nbsp;&nbsp;|&nbsp;&nbsp;

I can't seem to find a way to specify the text, and I either trigger a processor error, or I fail to detect the string itself. Everything AFTERthe causes the html-to-xml conversion to fail, since the xml is not well formed when the characters are included. So, I need to remove everything AFTER the (note that there are other parts where there is a div tag or something else after the elsewhere in the doc).

My code:

<xpath expression="/">
     <regexp replace="true">
            <regexp-pattern>(nbsp;)</regexp-pattern>
                <regexp-source>
                    <html-to-xml omitcomments="true" advancedxmlescape="true" prunetags="head,script,meta,meta ,p,base,br,link,img,image,input,option,nbsp;">
                       <http url="http://mysite.org/map/aindex/" method="get" />
                    </html-to-xml>
                </regexp-source>
                <regexp-result>
                    <template></template>
                </regexp-result>
      </regexp>
</xpath>

I think my problem is with the regexp-pattern. I've tried:

 &nbsp;  
    \& nbsp;  (without the space in between -- SO doesn't display that correctly
    \s+\|\s+

among other things. I even tried to put the expression in a CDATA element, but I can't get this to work either.

Any thoughts?

1

There are 1 best solutions below

0
On

For &nbsp; in regexp-pattern you can try to use \u00A0