Re: Our regex vs. POSIX on "longest match" - Mailing list pgsql-hackers

From Robert Haas
Subject Re: Our regex vs. POSIX on "longest match"
Date
Msg-id CA+TgmoYiYQyJ-2e-erykmn-z128V8zpGrPxe8XKU4+sCA41nVQ@mail.gmail.com
Whole thread Raw
In response to Re: Our regex vs. POSIX on "longest match"  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: Our regex vs. POSIX on "longest match"
Re: Our regex vs. POSIX on "longest match"
List pgsql-hackers
On Sun, Mar 4, 2012 at 3:20 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> hubert depesz lubaczewski <depesz@depesz.com> writes:
>> I stand on position that mixing greedy and non-greedy operators should
>> be possible, and that it should work according to POLA - i.e. greedines
>> of given atom shouldn't be influenced by other atoms.
>
> [ shrug... ]  That sounds good, but it's pretty much vacuous as far as
> defining a principled alternative behavior goes.  It's easy to
> demonstrate cases where atoms *must* be influenced by other ones.
> A trivial example is
>        (.*)(.*)
> It doesn't matter whether the second atom is greedy or not: it's not
> going to get to eat anything because the first one does instead.
> IOW this is just the same as
>        (.*)(.*?)
> --- they are both overall-greedy.

But I don't think this makes Brendan's example not a bug.  In this
case, we can't eat anything because there's nothing to eat.  In his
example, there's something to eat, and it's sitting in a place where
it logically seems to line up with a greedy quantifier, and yet the
quantifier doesn't eat it.  Now in fairness, I've seen my fair share
of apparently-buggy behavior from Perl's regex engine over the years
as well.

I think the right way to imagine this is as though the regular
expression were being matched to the source text in left-to-right
fashion.  For example, suppose the RE is [ab]+?[bc]+ and the source
text is aabbcc.  Clearly there's a match at position 1, but the match
could be anywhere between 3 and 6 characters in length, depending on
how greedy we think the RE should be, and the exact source text that
gets matched to each atom is up for grabs.  Enumerating all the
possibilities where each atom matches a string that is consistent with
its definition, leaving greediness aside, we get this:

[aa,b]
[aa,bb]
[aa,bbc]
[aa,bbcc]
[aab,b]
[aab,bc]
[aab,bcc]
[aabb,c]
[aabb,cc]

To distinguish among these, we look first at [ab]+? and decide that,
since it is non-greedy, the right overall answer must be one of the
ones that assigns to [ab]+? a match of minimal length within the space
of possible matches.  That narrows the field to just the first four
choices listed above.  Then we look at [bc]+ and, since it is greedy,
give it a match of maximal length within the space of possible
matches, leading to [ab]+? = aa and [bc]+ = bbcc.  Had the RE been
[ab]+?[bc]+?, the same algorithm would assign [ab]+? = aa and [bc]+? =
b; had it been [ab]+[bc]+? the same algorithm would assign [ab]+ =
aabb and [bc]+? = c.  Testing, I find that this matches what Perl does
in all of those cases.

Things get a bit more complicated when you throw alternation into the
picture, but I think it's basically right to view it as a greedy
operation.  Anything else seems rather unprincipled.  It may seem
surprising that a+?|b+? matched against aaa will match the first
branch to aaa rather than just a, but there's no real reason to
suppose that the ? which applies to the plus should somehow bleed up
and affect the interpretation of the alternation.  The RE might be
something like a+b+?|b+?a+ where the sub-REs have different greediness
interpretations and there's no obvious principled way to decide which
one should apply to the parent; or even something like cat|catfish
where the sub-REs don't contain any greedy/non-greedy operators at all
and yet we still have to assign some greediness to the alternation. I
think it's right to view every RE construct as greedy unless it's got
an explicit "not-greedy" flag attached to it; after all, that's the
traditional behavior of REs from time immemorial.  Someone could
invent a non-greedy form of alternation if they were so inclined.
This is different from what Perl does, but I think Perl's behavior
here is batty: given a+|a+b+ and the string aaabbb, it picks the first
branch and matches only aaa.  My whole being recoils: that surely
looks like a non-greedy interpretation of a regex with only greedy
quantifiers.  It turns out that if you write a+b+|a+ then it matches
the whole string; apparently, it selects the syntactically first
branch that can match, regardless of the length of the match, which
strikes me as nearly pure evil.  PostgreSQL - correctly, IMHO -
matches the longest substring available regardless of the syntactic
ordering; AIUI, the original charter for REs was to always match the
longest possible substring; non-greedy quantifiers were added later,
and should not be taken as a license to change the semantics of REs
that don't use them.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


pgsql-hackers by date:

Previous
From: Brendan Jurd
Date:
Subject: Re: Our regex vs. POSIX on "longest match"
Next
From: Michael Meskes
Date:
Subject: Re: ECPG FETCH readahead