Java : Split Sentence using unknown character?

855 Views Asked by At

I know, Many people have asked about splitting sentence questions. But my question is slightly different. I got some unknown character in String data (unknown for me, and looks like tab character) and I am trying to use it as delimiter for splitting.

Source Text is : (* try to select the blank spaces portion, may see effect)

The President   Profile of the President
Swearing in of the President
Assets of the President
Speeches    Speeches
Foreign Visits
Press Releases
Gallery Photo Gallery
Video Gallery
Rashtrapati Bhavan  Panoramic View

I was thinking the that blank space portion may be tab character. but I was wrong. I tried to match with tab but no effect.

Then I opened this string in Notepad ++ and set true to show all character. There I found this character. Kindly refer below image.

enter image description here

In above digram, One can clearly see something arrow symbol ("----->") in orange color, which symbol is this? and Its width is not fixed. So how can I split some sentences? is anybody face this problem?

3

There are 3 best solutions below

0
On BEST ANSWER

Unknowingly I got the answer. The spaces or arrow shows in above pics is nbsp; Html Entity. That is why I was unable to break the sentence. The above shown output came from Tika parser where I tried to hit html url and extract the html page data. Finally break it into sentences.

0
On

In such cases I usually open the file in Hex editor and check the exact character code whatever it is. However if you want to split just by any unknown character you can use [^...] pattern. Here's an example how to split the string by any character which is not alphanumeric or space:

String[] fields = inputStr.split("[^\\w ]");
0
On

You probably want to convert a portion of your text to unicode escapes in order to observe the code points.

Once you've figured out which code point corresponds to the whitespace character(s) you're looking for, you can use it in your split invocation, as part of the pattern, with the following idiom: \uhhhh.

Quoting the docs:

\xhh The character with hexadecimal value 0xhh

\uhhhh The character with hexadecimal value 0xhhhh

\x{h...h} The character with hexadecimal value 0xh...h (Character.MIN_CODE_POINT <= 0xh...h <= Character.MAX_CODE_POINT)