Dealing with java control characters in apache CharSequenceTranslator

33 Views Asked by At

I'm working on an application that allows a user to create a message and submit it to a 3rd party (modo communicate) where it can be viewed by other users. The catch is that modo interprets the text as markdown/html so we need to add escape characters to the outgoing text to ensure that the users are not tripped up.

Since there doesn't seem to be an existing library for markdown escaping, I ended up creating a custom org.apache.commons.text.translate.CharSequenceTranslator. The translator code looks like:

public static final CharSequenceTranslator ESCAPE_MARKDOWN;
static {
    final Map<CharSequence, CharSequence> escapeMarkdownMap = new HashMap<>();
    escapeMarkdownMap.put("`", "\\`");
    escapeMarkdownMap.put("\\", "\\\\");
    escapeMarkdownMap.put("*", "\\*");
    escapeMarkdownMap.put("_", "\\_");
    escapeMarkdownMap.put("{", "\\{");
    escapeMarkdownMap.put("}", "\\}");
    escapeMarkdownMap.put("[", "\\[");
    escapeMarkdownMap.put("]", "\\]");
    escapeMarkdownMap.put("<", "\\<");
    escapeMarkdownMap.put(">", "\\>");
    escapeMarkdownMap.put("(", "\\(");
    escapeMarkdownMap.put(")", "\\)");
    escapeMarkdownMap.put("#", "\\#");
    escapeMarkdownMap.put("+", "\\+");
    escapeMarkdownMap.put("-", "\\-");
    escapeMarkdownMap.put(".", "\\.");
    escapeMarkdownMap.put("!", "\\!");
    escapeMarkdownMap.put("|", "\\|");
    ESCAPE_MARKDOWN = new AggregateTranslator(
            new LookupTranslator(Collections.unmodifiableMap(escapeMarkdownMap))
            ,new LookupTranslator(EntityArrays.JAVA_CTRL_CHARS_ESCAPE)
            ,JavaUnicodeEscaper.outsideOf(32, 0x7e)
    );
}

I call the translator with:

@Test
void translateMarkdown()
{
    String expected = "\\*\\*\\\\abc123\\`\\*\\_\\{\\}\\[\\]\\<\\>\\(\\)\\#\\+\\-\\.\\!\\|\\*\\*\r\na new line";
    String actual = Message.ESCAPE_MARKDOWN.translate("**\\abc123`*_{}[]<>()#+-.!|**\r\na new line");
    assertThat(actual).isEqualTo(expected);
}

and get the output:

org.opentest4j.AssertionFailedError:  
expected:    
  "\*\*\\abc123\`\*\_\{\}\[\]\<\>\(\)\#\+\-\.\!\|\*\*   
  a new line"  
but was:    
  "\*\*\\abc123\`\*\_\{\}\[\]\<\>\(\)\#\+\-\.\!\|\*\*\r\na new line"

If I disable the JAVA_CTRL_CHARS_ESCAPE I get the output:

org.opentest4j.AssertionFailedError:  
expected:    
  "\*\*\\abc123\`\*\_\{\}\[\]\<\>\(\)\#\+\-\.\!\|\*\*   
  a new line"  
but was:    
  "\*\*\\abc123\`\*\_\{\}\[\]\<\>\(\)\#\+\-\.\!\|\*\*\u000D\u000Aa new line"

If I also disable the JavaUnicodeEscaper, the test will pass. I'm concerned however that that could cause other side effects.

What is the correct way to handle escaping markdown while preserving java control characters?

0

There are 0 best solutions below