Split rich text into a list including the richtext code

619 Views Asked by At

I'm trying to find a way to split this properly but until now I bump into many issues.

using string.split / string.substring, string.indexof, string.replace and so on.

here is a sample string that needs to be split into a list.

We are <b><i>very</i></b><b>a</b>mused!\nThank you.

and the result in the list should be in this order :

0: We
1: are
2: <b>
3: <i>
4: very
5: </i>
6: </b>
7: <b>
8: a
9: </b>
10: mused!
11: \n
12: Thank
13: you.

So what i am trying to do is this :

splitStart = baseString.Value.Split(' ');
foreach (string part in splitStart)
{
    if (part.Contains("<"))
    {
        // get the parts <b>  <i>  <size>  <color>  </b>  </i>  </size> </color> \n
        textlist.Add(part); // add each part to list
    }
    else
    {
        textlist.Add(part);
        Debug.Log(part);
    }
}

I tried things like

contains("<n>")
replace "<n>" "" and add "<n>" to array

but that can break the sequence.

Edit : I forgot to say that this is for c#

1

There are 1 best solutions below

1
On

I think you need some pre-processing of characters using some html parser like jsoup or tree structure algorithm.

It's one option to make this case with Jsoup library.

1. Java version

First, prepare word list from the html tags.

final List<String> wordList = new ArrayList<String>();

then, traverse the html contents using Jsoup's NodeVisitor class.

doc.body().traverse(
            new NodeVisitor(){

                @Override
                public void head(Node arg0, int arg1) {
                    if(arg1 == 1)
                    {
                        String value = arg0.outerHtml();
                        if(!wordList.contains(value))
                            wordList.add(arg0.outerHtml());
                    }
                }

                @Override
                public void tail(Node arg0, int arg1) {

                }
            }
        );

Finally, the code is as follows.

import java.util.ArrayList;
import java.util.List;

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.nodes.Node;
import org.jsoup.select.Elements;
import org.jsoup.select.NodeVisitor;

public class HtmlTest {

    public static String parseHtml( String str ) {
        org.jsoup.nodes.Document doc = Jsoup.parse(str);

        final List<String> wordList = new ArrayList<String>();

        doc.body().traverse(
            new NodeVisitor(){

                @Override
                public void head(Node arg0, int arg1) {
                    if(arg1 == 1)
                    {
                        //String value = Jsoup.parse(arg0.outerHtml()).text();
                        String value = arg0.outerHtml();
                        if(!wordList.contains(value))
                            wordList.add(arg0.outerHtml());

                    }

                }

                @Override
                public void tail(Node arg0, int arg1) {

                }
            }
        );


        for(String word: wordList)
        {
            System.out.println(word);
        }

        return "";
    }

    public static void main(String[] args)
    {
        System.out.println(parseHtml( "We are <b><i>very</i></b><b>a</b>mused!\nThank you." ));
    }
}

The output must be looks like,

We are 
<b><i>very</i></b>
<b>a</b>
mused! Thank you.

2. C# version

Well, The source code of C# version is a litte bit different but the same process( with a little change needed).

This is my NodeVisitor version of code.

First parse the html contents.

Document doc = NSoupClient.Parse(str);

Second, select original sentence from 'body' tag.

doc.Select("body").Traverse(new TestNodeVisitor(wordList));

The complete code as follows.

using NSoup;
using NSoup.Nodes;
using NSoup.Select;
using System;
using System.Collections.Generic;
using System.IO;
namespace NSoupTest
{

    class Program
    {

        private class TestNodeVisitor : NodeVisitor
        {
            List<String> wordList;

            public TestNodeVisitor(List<String> wordList)
            {
                this.wordList = wordList;
            }

            public void Head(Node node, int depth)
            {
                if(depth == 1)
                {
                    String value = node.OuterHtml();

                    if(!wordList.Contains(value))
                        wordList.Add(value);
                }

            }

            public void Tail(Node node, int depth)
            {

            }
        }


        public static String parseHtml( String str ) {
            Document doc = NSoupClient.Parse(str);


            List<String> wordList 
                = new List<String>();

            doc.Select("body").Traverse(new TestNodeVisitor(wordList));


            foreach (String word in wordList)
            {

                Console.WriteLine(word);
            }

            return "";
        }

        static void Main(string[] args)
        {
            try
            {
                parseHtml("We are <b><i>very</i></b><b>a</b>mused!\nThank you.");
            }
            catch (FileNotFoundException fe) {
                Console.WriteLine(fe.Message);
            }

        }
    }
}

The output also should be

We are
<b><i>very</i></b>
<b>a</b>
mused! Thank you.

You can find the NSoup library I used at this time(actually, not a official version 0.8.0) from the site.

The official NSoup site is here but no visitor interface.

Then, you can use your own method to complete code.

I must tell you this is just an option for your goal.

Regard,