How to split a string by multiple separators - and know which separator matched

584 Views Asked by At

With String.split it is easy to split a string by multiple separators. You just needs to define a regular expression which matches all separators you want to use. For example

"1.22-3".split("[.-]")

results in the list with the elements "1", "22", and "3". So far so good.

Now however I also need to know which one of the separators was found between the segments. Is there a straightforward way to achieve this?

I looked at String.split, its deprecated predecessor StringTokenizer, and other supposedly more modern libraries (e.g. StrTokenizer from Apatche Commons), but with none of them I can get hold of the matched separator.

2

There are 2 best solutions below

2
On BEST ANSWER

It’s quite simple if you retrace what String.split(regex) does and record the information which String.split ignores:

String source = "1.22-3";
Matcher m=Pattern.compile("[.-]").matcher(source);
ArrayList<String> elements=new ArrayList<>();
ArrayList<String> separators=new ArrayList<>();
int pos;
for(pos=0; m.find(); pos=m.end()) {
    elements.add(source.substring(pos, m.start()));
    separators.add(m.group());
}
elements.add(source.substring(pos));

At the end of this code, separators.get(x) yields to the separator between elements.get(x) and elements.get(x+1). It should be clear that separators is one item smaller than elements.

If you want to have elements and separators in one list, just change the code to let these two lists be the same list. The items are already added in order of occurrence.

5
On

I think I was looking at the wrong algorithm for what I was trying to achieve. Instead of using methods to split by separators, the following two-step approach was more successful:

  • First, I implemented a lexer (aka tokenizer, scanner) that splits the string into tokens which include the separators. I.e. split 1.22-3 into 1, ., 22, -, 3

  • Then, I implement a parser which interprets this token stream, i.e. distinguishes segments and their separators.


Possible implementation of the lexer:

import java.util.ArrayList;
import java.util.List;

public final class FixedStringTokenScanner {

    /**
     * Splits the given input into tokens. Each token is either one of the given constant string
     * tokens or a string consisting of the other characters between the constant tokens.
     *
     * @param input
     *            The string to split.
     * @param fixedStringTokens
     *            A list of strings to be recognized as separate tokens.
     * @return A list of strings, which when concatenated would result in the input string.
     *         Occurrences of the fixed string tokens in the input string are returned as separate
     *         list entries. These entries are reference-equal to the respective fixedStringTokens
     *         entry. Characters which did not match any of the fixed string tokens are concatenated
     *         and returned as list entries at the respective positions in the list. The list does
     *         not contain empty or <code>null</code> entries.
     */
    public static List<String> splitToFixedStringTokensAndOtherTokens(final String input, final String... fixedStringTokens) {
        return new FixedStringTokenScannerRun(input, fixedStringTokens).splitToFixedStringAndOtherTokens();
    }

    private static class FixedStringTokenScannerRun {

        private final String input;
        private final String[] fixedStringTokens;

        private int scanIx = 0;
        StringBuilder otherContent = new StringBuilder();
        List<String> result = new ArrayList<String>();

        public FixedStringTokenScannerRun(final String input, final String[] fixedStringTokens) {
            this.input = input;
            this.fixedStringTokens = fixedStringTokens;
        }

        List<String> splitToFixedStringAndOtherTokens() {
            while (scanIx < input.length()) {
                scanIx += matchFixedStringOrAppendToOther();
            }
            storeOtherTokenIfNotEmpty();
            return result;
        }

        /**
         * @return the number of matched characters.
         */
        private int matchFixedStringOrAppendToOther() {
            for (String fixedString : fixedStringTokens) {
                if (input.regionMatches(scanIx, fixedString, 0, fixedString.length())) {
                    storeOtherTokenIfNotEmpty();
                    result.add(fixedString); // add string instance so that identity comparison works
                    return fixedString.length();
                }
            }
            appendCharacterToOther();
            return 1;
        }

        private void appendCharacterToOther() {
            otherContent.append(input.substring(scanIx, scanIx + 1));
        }

        private void storeOtherTokenIfNotEmpty() {
            if (otherContent.length() > 0) {
                result.add(otherContent.toString());
                otherContent.setLength(0);
            }
        }
    }
}