Re: Our regex vs. POSIX on "longest match" - Mailing list pgsql-hackers

From Brendan Jurd
Subject Re: Our regex vs. POSIX on "longest match"
Date
Msg-id CADxJZo39+Ms61LgXNm0Tmi3uALNO8t4c5OmcujK+BCFUeHqjxQ@mail.gmail.com
Whole thread Raw
In response to Re: Our regex vs. POSIX on "longest match"  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-hackers
On 5 March 2012 04:34, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Brendan Jurd <direvus@gmail.com> writes:
>> On 4 March 2012 17:53, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> Brendan Jurd <direvus@gmail.com> writes:
>> While it's true that POSIX doesn't contemplate non-greed, after
>> reading the spec I would have expected an expression *as a whole* to
>> still prefer the longest match, insofar as that is possible while
>> respecting non-greedy particles.
>
> It's not apparent to me that a construct that is outside the spec
> shouldn't be expected to change the overall behavior.  In particular,
> what if the RE contains only non-greedy quantifiers --- would you still
> expect longest match overall?  Henry certainly didn't.

Well, no, but then that wouldn't be "prefer the longest match, insofar
as that is possible whilerespecting non-greedy particles".  If all the quantifiers in an
expression are non-greedy, then it is trivially true that the only way
to respect the author's intention is to return the shortest match.

> Well, that's just an arbitrary example.  The cases I remember people
> complaining about in practice were the other way round: greedy
> quantifier followed by non-greedy, and they were unhappy that the
> non-greediness was effectively not respected (because the overall RE was
> taken as greedy).

I am unhappy about the reverse example too, and for the same reasons.

If we look at 'xy1234' ~ 'y*\d+?', Postgres gives us 'y1234'
(disregarding the non-greediness of the latter quantifier), and Python
gives us 'y1' (respecting both quantifiers).

So in Postgres, no matter how you mix the greediness up, some of your
quantifiers will not be respected.

> So you can't fix the issue by pointing to POSIX and
> saying "overall greedy is always the right thing".

"... insofar as that is possible while respecting non-greedy particles".

I will take Henry's word for it that this problem is harder than it
looks, but in any case, I think we may not presume to know better than
the author of the regex about the greediness of his quantifiers.

> Another thing I've seen people complain about is that an alternation
> (anything with top-level "|") is always taken as greedy overall.

Yeah, that is quirky.

Cheers,
BJ


pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: GiST for range types (was Re: Range Types - typo + NULL string constructor)
Next
From: Robert Haas
Date:
Subject: Re: Our regex vs. POSIX on "longest match"