JavaCC generates code that fails while parsing UTF-8 strings

128 Views Asked by sveri At 09 February 2023 at 12:26

I have an older project, where a JavaCC grammar was used to generate classes to parse a custom language.

Now, several years later I have to adapt the grammar to add functionality (just a minor change). This works, but when running all tests, I see I have a problem parsing UTF-8 characters. I don't really have an idea what is causing this. I reverted my change to the grammar and recreated the classes, but the problem remains. As soon as I run javacc with the grammar and run my tests, the one with the UTF-8 characters fail.

This is the call I am using:

java -cp javacc-7.0.10.jar javacc -GRAMMAR_ENCODING=UTF-8 functionsGrammar.jj

I tried it with all major javacc versions from 4.x to 7.0.10, they all have the same problem. I also tried this with different java version (6, 7, 8, 11) but that also did not make any difference.

Below you can find the relevant parts of the grammar:

options
{
  JDK_VERSION = "1.6";

  LOOKAHEAD= 2;
  FORCE_LA_CHECK = true;

  static = false;
}

TOKEN:
{
...
|< STRING : < QUOTES > (~["\"", "\\"])* ("\\"~[] (~["\"", "\\"])*)* < QUOTES > >
...}

TOKEN:
{
...
| < LIST :
    < LCURLY_BRACE > < SPACES >
    ( < STRING > | < DATE > | < PARAMETER_FIELD_ID > | < PARAMETER_ELEMENT > | < NULL > )
    ( < COMMA > < SPACES >
      ( < STRING > | < DATE > | < PARAMETER_FIELD_ID > | < PARAMETER_ELEMENT > | < NULL > )
    )*
...}

It fails for the string: "美丽的树" but works when changed to "slkdfj" for example.

I wonder if there are any options for JavaCC that I am missing? Or other java / javacc version combinations that might work?

Original Q&A

There are 1 best solutions below

Jonathan Revusky On 25 February 2023 at 16:38

Legacy JavaCC most certainly does not support the full, current Unicode standard, i.e. 32-bit characters. This is explained here. Granted, it may well be that the OP does not (at least currently) need more than the 16-bit (BMP, basic multilingual plane) characters. However, JavaCC 21 supports full Unicode. Besides that, JavaCC 21 fixes a plethora of existing bugs in legacy JavaCC that have not been addressed in over 20 years. I think that this article is illuminating in that regard.

JavaCC generates code that fails while parsing UTF-8 strings

There are 1 best solutions below

Related Questions in JAVA

Related Questions in UTF-8

Related Questions in JAVACC

Trending Questions

Popular # Hahtags

Popular Questions