Re: Our regex vs. POSIX on "longest match" - Mailing list pgsql-hackers

From Tom Lane
Subject Re: Our regex vs. POSIX on "longest match"
Date
Msg-id 16069.1330882462@sss.pgh.pa.us
Whole thread Raw
In response to Re: Our regex vs. POSIX on "longest match"  (Brendan Jurd <direvus@gmail.com>)
Responses Re: Our regex vs. POSIX on "longest match"  (hubert depesz lubaczewski <depesz@depesz.com>)
Re: Our regex vs. POSIX on "longest match"  (Brendan Jurd <direvus@gmail.com>)
List pgsql-hackers
Brendan Jurd <direvus@gmail.com> writes:
> On 4 March 2012 17:53, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Brendan Jurd <direvus@gmail.com> writes:
>>> I'll admit that this is a pretty obscure point, but we do appear to be
>>> in direct violation of POSIX here.

>> How so?  POSIX doesn't contain any non-greedy constructs.  If you use
>> only the POSIX-compatible greedy constructs, the behavior is compliant.

> While it's true that POSIX doesn't contemplate non-greed, after
> reading the spec I would have expected an expression *as a whole* to
> still prefer the longest match, insofar as that is possible while
> respecting non-greedy particles.

It's not apparent to me that a construct that is outside the spec
shouldn't be expected to change the overall behavior.  In particular,
what if the RE contains only non-greedy quantifiers --- would you still
expect longest match overall?  Henry certainly didn't.

> I just ran some quick experiments in
> Perl, Python and PHP, using 'xy1234' ~ 'y*?\d+'.  All returned
> 'y1234', which to my mind is the most obvious answer, and one which
> still makes sense in light of what POSIX has to say.  Whereas Postgres
> (and presumably Tcl) return 'y1'.

Well, that's just an arbitrary example.  The cases I remember people
complaining about in practice were the other way round: greedy
quantifier followed by non-greedy, and they were unhappy that the
non-greediness was effectively not respected (because the overall RE was
taken as greedy).  So you can't fix the issue by pointing to POSIX and
saying "overall greedy is always the right thing".

Another thing I've seen people complain about is that an alternation
(anything with top-level "|") is always taken as greedy overall.
This seems a bit surprising to me --- I'd have expected a rule like
"inherits its greediness from the leftmost branch that has one",
though I'm not sure whether that would really be much of an improvement
in practice.  (It would at least fix the problem that a leading "(a|b)"
determines greediness of the containing RE whereas the logically
equivalent "[ab]" does not.)  Again, pointing to POSIX and saying
"overall greedy is the right thing" doesn't help.
        regards, tom lane


pgsql-hackers by date:

Previous
From: Simon Riggs
Date:
Subject: Re: RFC: Making TRUNCATE more "MVCC-safe"
Next
From: Tom Lane
Date:
Subject: Re: Patch: improve selectivity estimation for IN/NOT IN