How do I get a &-conjoined regular expression to be considered as long a match as its constituents?

Question

How do I get a &-conjoined regular expression to be considered as long a match as its constituents?

160 Views Asked by darch At 02 February 2024 at 10:53

You can specify that a regular expression should only match if all its constituents match by using a conjunction, && or &:

[4] >  'ready' ~~ / r..dy & .ea..  /
｢ready｣
[5] >  'roody' ~~ / r..dy & .ea..  /
Nil
[6] >  'peaty' ~~ / r..dy & .ea..  /
Nil

When choosing which side of an alternation matches, Raku chooses the "better" side. Two rules matter here: an alternative with a longer declarative prefix is better than one with a shorter declarative prefix, and alternatives that were declared earlier are better than their siblings.

The documentation indicates that part of the reason to use & instead of && is that & is considered declarative. If I understand longest token matching correctly, that's what I want.

However, something surprising is happening if I use a conjunction as one branch of an alternation: it is never chosen if its alternatives aren't conjoined. It seems to be considered shorter than matches that seem to me to be of equal length.

This is annoying because I'm writing a parser in which it would be natural to say "if the text matches both these rules, it in fact is considered an example of the conjoint rule". The grammar consistently prefers to find the constituent rules to the conjoint rule.

The following REPL examples are all variations on a pattern in which we have an alternation with a conjoined alternative and a non-conjoined alternative and we want to know why the non-conjoined alternative is being chosen:

'froody' ~~ / froody & froody | froody /

(The REPL interactions have diagnostic code and extra brackets to make sure I'm not running into precedence problems.)

Here, the two sides of the alternation look to me like they should be considered the same length, so I expect it to choose the left branch. It chooses the right branch.

[7] >  'froody' ~~ / [ [ froody & froody ] { say 'left' } ] | [ froody { say 'right' } ] /
right

If I reverse the order, it still chooses the non-conjoined branch.

[7] >  'froody' ~~ / [ froody { say 'left' } | [ [ froody & froody ] { say 'right' } ] ] /
left

If I artificially shorten the declarative portion of the non-conjoined branch by prepending {}, it chooses the left branch...

[7] >  'froody' ~~ / [ [ froody & froody ] { say 'left' } ] | [ {}froody { say 'right' } ] /
left

... and also if we flip them. This suggests that the conjoined branch is considered to have a declarative length of 0.

[7] >  'froody' ~~ / [ {}froody { say 'left' } | [ [ froody & froody ] { say 'right' } ] ] /
left

So: How do I get the an alternation with a conjunction to be considered to contain as long a match as its non-conjoined alternative without dumb hacks? Is this an unreasonable thing to want? Is & not supposed to be able to do this?

Original Q&A

There are 1 best solutions below

**raiph** · Accepted Answer · 2024-02-03T02:40:14.543000

Disclaimer: This answer is quite plausibly wrong. That said, it's carefully researched, and I think it's at least half right.

TL;DR Some regex constructs terminate a pattern's LDP (Longest Declarative Prefix). & and [...] both do so at the start of the sub-expressions they produce. Others don't, including && (and assertions like <froody>), so use those instead.

Example, and discussion

I begin with a variant of your first example. This code...

say 'foo' ~~ /  foo & foo {print 'L '}  |  foo {print 'R '}  /

...displays R ｢foo｣. In other words, this has exactly the same behavior as your first example, choosing the RHS branch instead of the desired/expected left branch.

Of critical importance, I did not introduce [...] sub grouping, so have avoided quietly introducing double trouble. (In my tests, both & and [...] (and (...)) terminate the LDP at their start.)

Now we can change the & to && ...

say 'foo' ~~ /  foo && foo {print 'L '}  |  foo {print 'R '}  /

...and get the desired result: L ｢foo｣.

"declarative"

"declarative" generically means "expresses the logic of a computation without describing its control flow". There are two specific meanings of it within Raku's current regex feature set that are relevant here:

What almost all uses of the word "declarative" refer to, namely the "longest declarative prefixes" related to | alternations.
What just one use of the word "declarative" refers to, namely the (lack of a specified) order in which the LHS and RHS of an & expression are to be processed.

From the docs

Speculative design doc S05 uses the word "declarative" a lot:

While the syntax of | does not change, the default semantics do change slightly. We are attempting to concoct a pleasing mixture of declarative and procedural matching so that we can have the best of both. In short, you need not write your own tokenizer for a grammar because [Raku] will write one for you. See the section below on "Longest-token matching".

...

As with the disjunctions | and ||, conjunctions come in both & and && forms. The & form is considered declarative rather than procedural; it allows the compiler and/or the run-time system to decide which parts to evaluate first, and it is erroneous to assume either order happens consistently.

Likewise, in the Raku doc, specifically the regex doc, we find:

Briefly, what | does is this ... select the branch which has the longest declarative prefix. ... For more details, see the LTM strategy.

...

& (unlike &&) is considered declarative, and notionally all the segments can be evaluated in parallel, or in any order the compiler chooses.

Again, in all cases the term "declarative" is appropriate. But what it means in the phrase "longest declarative prefix" has absolutely nothing to do with what it means in "The & form is considered declarative".

The LHS of `&&` can specify a declarative prefix; why not `&`?

As explained in the previous section, the word "declarative" in relation to & should not be read to imply that it necessarily derives a declarative prefix (or pair of them?) out of its LHS (and RHS?). Furthermore, & clearly has a zero length prefix in current Rakudo. And there's nothing I've found in the speculation docs and IRC discussions in 2005 onward to suggest any intent for & to contribute to declaring a declarative prefix in the LTM sense.

But why not sweep all this confusion away by making & do at least as well as && in playing nicely in the LTM game?

My current thinking is that that's because & wouldn't then be declarative in the sense of always leaving it up to the compiler in which order to attempt matching of its LHS and RHS. In fact it would never leave it up to the compiler because Raku(do) couldn't know whether an & expression would end up appearing dynamically in the context of an LTM alternation, and couldn't sometimes do it in whatever order it prefers when it sees a & lexically, and other times try the LHS first because it sees it dynamically, because that would mean that refactoring could alter behavior.

So & would have to always do exactly the same thing as &&. But if that's the case, why have it available at all? One reason for having it would be because it's nice to have the option of declaring an & pair declaratively, i.e. giving the compiler the freedom to decide which order to match them in. But in that case the LTM prefix that & implies has to always be zero length.

How do I get a &-conjoined regular expression to be considered as long a match as its constituents?

There are 1 best solutions below

Example, and discussion

"declarative"

From the docs

The LHS of `&&` can specify a declarative prefix; why not `&`?

Related Questions in REGEX

Related Questions in RAKU

Trending Questions

Popular # Hahtags

Popular Questions

How do I get a &-conjoined regular expression to be considered as long a match as its constituents?

There are 1 best solutions below

Example, and discussion

"declarative"

From the docs

The LHS of && can specify a declarative prefix; why not &?

Related Questions in REGEX

Related Questions in RAKU

Trending Questions

Popular # Hahtags

Popular Questions

The LHS of `&&` can specify a declarative prefix; why not `&`?