Get the list of object containing text matching a pattern

558 Views Asked by At

I'm currently working with the API Apache POI and I'm trying to edit a Word document with it (*.docx). A document is composed by paragraphs (in XWPFParagraph objects) and a paragraph contains text embedded in 'runs' (XWPFRun). A paragraph can have many runs (depending on the text properties, but it's sometimes random). In my document I can have specific tags which I need to replace with data (all my tags follows this pattern <#TAG_NAME#>)

So for example, if I process a paragraph containing the text Some text with a tag <#SOMETAG#>, I could get something like this

XWPFParagraph paragraph = ... // Get a paragraph from the document
System.out.println(paragraph.getText());
// Prints: Some text with a tag <#SOMETAG#>

But if I want to edit the text of that paragraph I need to process the runs and the number of runs is not fixed. So if I show the content of runs with that code:

System.out.println("Number of runs: " + paragraph.getRuns().size());
for (XWPFRun run : paragraph.getRuns()) {
    System.out.println(run.text());
}

Sometimes it can be like this:

// Output:
// Number of runs: 1
// Some text with a tag <#SOMETAG#>

And other time like this

// Output:
// Number of runs: 4
// Some text with a tag 
// <#
// SOMETAG
// #>

What I need to do is to get the first run containing the start of the tag and the indexes of the following runs containing the rest of the tag (if the tag is divided in many runs). I've managed to get a first version of that algorithm but it only works if the beginning of the tag (<#) and the end of the tag (#>) aren't divided. Here's what I've already done.

So what I would like to get is an algorithm capable to manage that problem and if possible get it work with any given tag (not necessarily <# and #>, so I could replace with something like this {{{ and this }}}).

Sorry if my English isn't perfect, don't hesitate to ask me to clarify any point you want.

1

There are 1 best solutions below

0
On BEST ANSWER

Finally I found the answer myself, I totally changed my way of thinking my original algorithm (I commented it so it might help someone who could be in the same situation I was)

// Before using the function, I'm sure that:
// paragraph.getText().contains(surroundedTag) == true
private void editParagraphWithData(XWPFParagraph paragraph, String surroundedTag, String replacement) {
    List<Integer> runsToRemove = new LinkedList<Integer>();
    StringBuilder tmpText = new StringBuilder();
    int runCursor = 0;

    // Processing (in normal order) the all runs until I found my surroundedTag
    while (!tmpText.toString().contains(surroundedTag)) {
        tmpText.append(paragraph.getRuns().get(runCursor).text());
        runsToRemove.add(runCursor);
        runCursor++;
    }

    tmpText = new StringBuilder();
    // Processing back (in reverse order) to only keep the runs I need to edit/remove
    while (!tmpText.toString().contains(surroundedTag)) {
        runCursor--;
        tmpText.insert(0, paragraph.getRuns().get(runCursor).text());
    }

    // Edit the first run of the tag
    XWPFRun runToEdit = paragraph.getRuns().get(runCursor);
    runToEdit.setText(tmpText.toString().replaceAll(surroundedTag, replacement), 0);

    // Forget the runs I don't to remove
    while (runCursor >= 0) {
        runsToRemove.remove(0);
        runCursor--;
    }

    // Remove the unused runs
    Collections.reverse(runsToRemove);
    for (Integer runToRemove : runsToRemove) {
        paragraph.removeRun(runToRemove);
    }
}

So now I'm processing all runs of the paragraph until I found my surrounded tag, then I'm processing back the paragraph to ignore the first runs if I don't need to edit them.