jericho-html - text extracting and incorrect text lenght

634 Views Asked by user592704 At 03 August 2013 at 01:32

Today I tried to use the lib as jericho-html-3.2 to extract text from simple html... And I faced a strange text fake length problem as follows:

if I have html as this one

Hello World :)<br><br>Hello World :(<br><br>Hello World ;)<br>

...my RichTextArea getText().length() returns 42 that is correct length actually but when I try to extract text from this html with code like a

        Source source = new Source(html);
    String text = source.getTextExtractor().toString();

... the text.length() returns 44

So I don't get it why text which length is 42 turns into text which length is 44 and how to fix it?

Thanks

Original Q&A

There are 2 best solutions below

user592704 On 05 August 2013 at 01:04 BEST ANSWER

I had to dig it deeper and I think the wrong text length becomes from html line breakers because the jericho html-parser for some reason replaces new line breakers with spaces or something...

As for now, I cannot say for sure which more tags does it replace to which characters but as for my case I just tried to do some walk-around using regular expression like this (see snippet)

html=html.replaceAll("<br>","");

Source source = new Source(html);
String text = source.getTextExtractor().toString();

... so now it really returns original text length as 42 :)

I hope the tip saves one day

Thank you all for help

Abhijith Nagaraja On 04 August 2013 at 05:55

It is 44 only, you need to consider all the
tags as one character each, spaces as one character each and all the smileys as one character each.

H(1)e(2)l(3)l(4)o(5) (6)W(7)o(8)r(9)l(10)d(11) (12):)(13)<br>(14)<br>(15)H(16)e(17)l(18)l(19)o(20) (21)W(22)o(23)r(24)l(25)d(26) (27:((28)<br>(29)<br>(30)H(31)e(32)l(33)l(34)o(35) (36)W(37)o(38)r(39)l(40)d(41) (42);)(43)<br>(44)

jericho-html - text extracting and incorrect text lenght

There are 2 best solutions below

Related Questions in JAVA

Related Questions in GWT

Related Questions in HTML-PARSING

Related Questions in JERICHO-HTML-PARSER

Trending Questions

Popular # Hahtags

Popular Questions