I have an older project, where a JavaCC grammar was used to generate classes to parse a custom language.
Now, several years later I have to adapt the grammar to add functionality (just a minor change). This works, but when running all tests, I see I have a problem parsing UTF-8 characters. I don't really have an idea what is causing this. I reverted my change to the grammar and recreated the classes, but the problem remains. As soon as I run javacc with the grammar and run my tests, the one with the UTF-8 characters fail.
This is the call I am using:
java -cp javacc-7.0.10.jar javacc -GRAMMAR_ENCODING=UTF-8 functionsGrammar.jj
I tried it with all major javacc versions from 4.x to 7.0.10, they all have the same problem. I also tried this with different java version (6, 7, 8, 11) but that also did not make any difference.
Below you can find the relevant parts of the grammar:
options
{
JDK_VERSION = "1.6";
LOOKAHEAD= 2;
FORCE_LA_CHECK = true;
static = false;
}
TOKEN:
{
...
|< STRING : < QUOTES > (~["\"", "\\"])* ("\\"~[] (~["\"", "\\"])*)* < QUOTES > >
...}
TOKEN:
{
...
| < LIST :
< LCURLY_BRACE > < SPACES >
( < STRING > | < DATE > | < PARAMETER_FIELD_ID > | < PARAMETER_ELEMENT > | < NULL > )
( < COMMA > < SPACES >
( < STRING > | < DATE > | < PARAMETER_FIELD_ID > | < PARAMETER_ELEMENT > | < NULL > )
)*
...}
It fails for the string: "美丽的树" but works when changed to "slkdfj" for example.
I wonder if there are any options for JavaCC that I am missing? Or other java / javacc version combinations that might work?
Legacy JavaCC most certainly does not support the full, current Unicode standard, i.e. 32-bit characters. This is explained here. Granted, it may well be that the OP does not (at least currently) need more than the 16-bit (BMP, basic multilingual plane) characters. However, JavaCC 21 supports full Unicode. Besides that, JavaCC 21 fixes a plethora of existing bugs in legacy JavaCC that have not been addressed in over 20 years. I think that this article is illuminating in that regard.