In C++, a famous parsing ambiguity happens with code like
x<T> a;
Is it if T
is a type, it is what it looks like (a declaration of a variable a
of type x<T>
, otherwise it is (x < T) > a
(<>
are comparison operators, not angle brackets).
In fact, we could make a change to make this become unambiguous: we can make <
and >
nonassociative. So x < T > a
, without brackets, would not be a valid sentence anyway even if x
, T
and a
were all variable names.
How could one resolve this conflict in Menhir? At first glance it seems we just can't. Even with the aforementioned modification, we need to lookahead an indeterminate number of tokens before we see another closing >
, and conclude that it was a template instantiation, or otherwise, to conclude that it was an expression. Is there any way in Menhir to implement such an arbitrary lookahead?
Different languages (including the ones listed in your title) actually have very different rules for templates/generics (like what type of arguments there can be, where templates/generics can appear, when they are allowed to have an explicit argument list and what the syntax for template/type arguments on generic methods is), which strongly affect the options you have for parsing. In no language that I know is it true that the meaning of
x<T> a;
depends on whetherT
is a type.So let's go through the languages C++, Java, Rust and C#:
In all four of those languages both types and functions/methods can be templates/generic. So we'll not only have to worry about an ambiguity with variable declarations, but also function/method calls: is
f<T>(x)
a function/method call with an explicit template/type argument or is it two relational operators with the last operand parenthesized? In all four languages template/generic functions/methods can be called without template/type when those can be inferred, but that inference isn't always possible, so just disallowing explicit template/type arguments for function/method calls is not an option.Even if a language does not allow relational operators to be chained, we could get an ambiguity in expressions like this:
f(a<b, c, d>(e))
. Is this callingf
with the three argumentsa<b
,c
andd>e
or with the single argumenta<b, c, d>(e)
calling a function/method nameda
with the type/template argumentsb,c,d
?Now beyond this common foundation, most everything else is different between these languages:
Rust
In Rust the syntax for a variable declaration is
let variableName: type = expr;
, sox<T> a;
couldn't possibly be a variable declaration because that doesn't match the syntax at all. In addition it's also not a valid expression statement (anymore) because comparison operators can't be chained (anymore).So there's no ambiguity here or even a parsing difficulty. But what about function calls? For function calls, Rust avoided the ambiguity by simply choosing a different syntax to provide type arguments: instead of
f<T>(x)
the syntax isf::<T>(x)
. Since type arguments for function calls are optional when they can be inferred, this ugliness is thankfully not necessary very often.So in summary:
let a: x<T> = ...;
is a variable declaration,f(a<b, c, d>(e));
callsf
with three arguments andf(a::<b, c, d>(e));
callsa
with three type arguments. Parsing is easy because all of these are sufficiently different to be distinguished with just one token of lookahead.Java
In Java
x<T> a;
is in fact a valid variable declaration, but it is not a valid expression statement. The reason for that is that Java's grammar has a dedicated non-terminal for expressions that can appear as an expression statement and applications of relational operators (or any other non-assignment operators) are not matched by that non-terminal. Assignments are, but the left side of assignment expressions is similarly restricted. In fact, an identifier can only be the start of an expression statement if the next token is either a=
,.
,[
or(
. So an identifier followed by a<
can only be the start of a variable declaration, meaning we only need one token of lookahead to parse this.Note that when accessing static members of a generic class, you can and must refer to the class without type arguments (i.e.
FooClass.bar();
instead ofFooClass<T>.bar()
), so even in that case the class name would be followed by a.
, not a<
.But what about generic method calls? Something like
y = f<T>(x);
could still run into the ambiguity because relational operators are of course allowed on the right side of=
. Here Java chooses a similar solution as Rust by simply changing the syntax for generic method calls. Instead ofobject.f<T>(x)
the syntax isobject.<T>f(x)
where theobject.
part is non-optional even if the object isthis
. So to call a generic method with an explicit type argument on the current object, you'd have to writethis.<T>f(x);
, but like in Rust the type argument can often be inferred, allowing you to just writef(x);
.So in summary
x<T> a;
is a variable declaration and there can't be expression statements that start with relational operations; in general expressionsthis.<T>f(x)
is a generic method call andf<T>(x);
is a comparison (well, a type error, actually). Again, parsing is easy.C#
C# has the same restrictions on expression statements as Java does, so variable declarations aren't a problem, but unlike the previous two languages, it does allow
f<T>(x)
as the syntax for function calls. In order to avoid ambiguities, relational operators need to be parenthesized when used in a way that could also be valid call of a generic function. So the expressionf<T>(x)
is a method call and you'd need to add parenthesesf<(T>(x))
or(f<T)>(x)
to make it a comparison (though actually those would be type errors because you can't compare booleans with<
or>
, but the parser doesn't care about that) and similarlyf(a<b, c, d>(e))
calls a generic method nameda
with the type argumentsb,c,d
whereasf((a<b), c, (d<e))
would involve two comparisons (and you can in fact leave out one of the two pairs of parentheses).This leads to a nicer syntax for method calls with explicit type arguments than in the previous two languages, but parsing becomes kind of tricky. Considering that in the above example
f(a<b, c, d>(e))
we can actually place an arbitrary number of arguments befored>(e)
anda<b
is a perfectly valid comparison if not followed by d>(e), we actually need an arbitrary amount of lookahead, backtracking or non-determinism to parse this.So in summary
x<T> a;
is a variable declaration, there is no expression statement that starts with a comparison,f<T>(x)
is a method call expression and(f<T)>(x)
orf<(T>(x))
would be (ill-typed) comparisons. It is impossible to parse C# with menhir.C++
In C++
a < b;
is a valid (albeit useless) expression statement, the syntax for template function calls with explicit template arguments isf<T>(x)
anda<b>c
can be a perfectly valid (even well-typed) comparison. So statements likea<b>c;
and expressions likea<b>(c)
are actually ambiguous without additional information. Further, template arguments in C++ don't have to be types. That is,Foo<42> x;
or evenFoo<c> x;
wherec
is defined asconst int x = 42;
, for example, could be perfectly valid instantiations of theFoo
template ifFoo
is defined to take an integer as a template argument. So that's a bummer.To resolve this ambiguity, the C++ grammar refers to the rule
template-name
instead ofidentifier
in places where the name of a template is expected. So if we treated these as distinct entities, there'd be no ambiguity here. But of coursetemplate-name
is defined simply astemplate-name: identifier
in the grammar, so that seems pretty useless, ... except that the standard also says thattemplate-name
should only be matched when the given identifier names a template in the current scope. Similarly it says that identifiers should only be interpreted as variable names when they don't refer to a template (or type name).Note that, unlike the previous three languages, C++ requires all types and templates to be declared before they can be used. So when we see the statement
a<b>c;
, we know that it can only be a template instantiation if we've previously parsed a declaration for a template nameda
and it is currently in scope.So, if we keep track of scopes while parsing, we can simply use if-statements to check whether the name
a
refers to a previously parsed template or not in a hand-written parser. In parser generators that allow semantic predicates, we can do the same thing. Doing this does not even require any lookahead or backtracking.But what about parser generators like yacc or menhir that don't support semantic predicates? For these we can use something known as the lexer hack, meaning we make the lexer generate different tokens for type names, template names and ordinary identifiers. Then we have a nicely unambiguous grammar that we can feed our parser generator. Of course the trick is getting the lexer to actually do that. In order to accomplish that, we need to keep track of which templates and types are currently in scope using a symbol table and then access that symbol table from the lexer. We'll also need to tell the lexer when we're reading the name of a definition, like the
x
inint x;
, because then we want to generate a regular identifier even if a template namedx
is currently in scope (the definitionint x;
would shadow the template until the variable goes out of scope).This same approach is used to resolve the casting ambiguity (is
(T)(x)
a cast ofx
to typeT
or a function call of a function namedT
?) in C and C++.So in summary,
foo<T> a;
andfoo<T>(x)
are template instantiations if and only iffoo
is a template. Parsing's a bitch, but possible without arbitrary lookahead or backtracking and even using menhir when applying the lexer hack.