jericho-html - text extracting and incorrect text lenght

627 Views Asked by At

Today I tried to use the lib as jericho-html-3.2 to extract text from simple html... And I faced a strange text fake length problem as follows:

if I have html as this one

Hello World :)<br><br>Hello World :(<br><br>Hello World ;)<br>

...my RichTextArea getText().length() returns 42 that is correct length actually but when I try to extract text from this html with code like a

        Source source = new Source(html);
    String text = source.getTextExtractor().toString();

... the text.length() returns 44

So I don't get it why text which length is 42 turns into text which length is 44 and how to fix it?

Thanks

2

There are 2 best solutions below

0
On BEST ANSWER

I had to dig it deeper and I think the wrong text length becomes from html line breakers because the jericho html-parser for some reason replaces new line breakers with spaces or something...

As for now, I cannot say for sure which more tags does it replace to which characters but as for my case I just tried to do some walk-around using regular expression like this (see snippet)

html=html.replaceAll("<br>","");

Source source = new Source(html);
String text = source.getTextExtractor().toString();

... so now it really returns original text length as 42 :)

I hope the tip saves one day


Thank you all for help

1
On

It is 44 only, you need to consider all the
tags as one character each, spaces as one character each and all the smileys as one character each.

H(1)e(2)l(3)l(4)o(5) (6)W(7)o(8)r(9)l(10)d(11) (12):)(13)<br>(14)<br>(15)H(16)e(17)l(18)l(19)o(20) (21)W(22)o(23)r(24)l(25)d(26) (27:((28)<br>(29)<br>(30)H(31)e(32)l(33)l(34)o(35) (36)W(37)o(38)r(39)l(40)d(41) (42);)(43)<br>(44)