Web harvest -- remove unusual characters

710 Views Asked by user991945 At 23 May 2025 at 16:56

I'm trying to scrape a page that has some spaces after the anchors:

</a>&nbsp;&nbsp;|&nbsp;&nbsp;

I can't seem to find a way to specify the text, and I either trigger a processor error, or I fail to detect the string itself. Everything AFTERthe causes the html-to-xml conversion to fail, since the xml is not well formed when the characters are included. So, I need to remove everything AFTER the (note that there are other parts where there is a div tag or something else after the elsewhere in the doc).

My code:

<xpath expression="/">
     <regexp replace="true">
            <regexp-pattern>(nbsp;)</regexp-pattern>
                <regexp-source>
                    <html-to-xml omitcomments="true" advancedxmlescape="true" prunetags="head,script,meta,meta ,p,base,br,link,img,image,input,option,nbsp;">
                       <http url="http://mysite.org/map/aindex/" method="get" />
                    </html-to-xml>
                </regexp-source>
                <regexp-result>
                    <template></template>
                </regexp-result>
      </regexp>
</xpath>

I think my problem is with the regexp-pattern. I've tried:



 &nbsp;  
    \& nbsp;  (without the space in between -- SO doesn't display that correctly
    \s+\|\s+

among other things. I even tried to put the expression in a CDATA element, but I can't get this to work either.

Any thoughts?

Original Q&A

There are 1 best solutions below

Alexander On 08 December 2012 at 22:21

For   in regexp-pattern you can try to use \u00A0

Web harvest -- remove unusual characters

There are 1 best solutions below

Related Questions in XML

Related Questions in REGEX

Related Questions in WEB-SCRAPING

Related Questions in WEBHARVEST

Trending Questions

Popular # Hahtags

Popular Questions