Re: Regexp matching: bug or operator error? - Mailing list pgsql-general

From Tom Lane
Subject Re: Regexp matching: bug or operator error?
Date
Msg-id 22644.1101340967@sss.pgh.pa.us
Whole thread Raw
In response to Re: Regexp matching: bug or operator error?  (Ken Tanzer <ktanzer@desc.org>)
Responses Re: Regexp matching: bug or operator error?
List pgsql-general
Ken Tanzer <ktanzer@desc.org> writes:
> Thanks for the quick responses yesterday.  At a minimum, it seems like
> this behavior does not match what is described in the Postgres
> documentation (more detail below).

After looking at this more, I think that it is actually behaving as
Spencer designed it to.  The key point is this bit from the fine print
in section 9.6.3.5:

    A branch has the same preference as the first quantified atom in it
    which has a preference.

("branch" being any regexp with no outer-level | operator)

What this apparently means is that if the RE begins with a non-greedy
quantifier, then the matching will be done in such a way that the whole
RE matches the shortest possible string --- that is, the whole RE is
non-greedy.  It's still possible for individual items within the RE to
be greedy or non-greedy, but that only affects how much of the shortest
possible total match they are allowed to eat relative to each other.
All the examples I've looked at seem to work "properly" when seen in
this light.

I can see that this behavior could have some usefulness, and if need be
you can always override it by writing (...){1,1} around the whole RE.
So at this point I'm disinclined to vary from the Tcl semantics.

This does leave us with a documentation problem though, because this
behavior is surely not obvious from what it says in 9.6.3.5.  If you've
got any thoughts about a better explanation, I'm all ears.

> Here's the actual regex we're working on--any help
> reformulating this would be great!

> select substring('Searching for log 5376, referenced in this text'
>             FROM
>         '(?i)(?:.*?)logs?(?:\\s|\\n|<br>|<br />|
> )(?:entry|no|number|#)?(?:\\s|\\n|<br>|<br /> )?([0-9]{1,7})(.*?)');

I don't see that you need either the leading (?:.*?) or the trailing
(.*?) here, and if you dropped them then the first quantifier would be
the "s?" which is greedy so the curious case goes away.  I suppose the
idea of adding (?:.*?) was to ensure that "log" will be matched to the
first possible place where it could match --- but that is true anyway,
per the first sentence of 9.6.3.5.

            regards, tom lane

pgsql-general by date:

Previous
From: Jamie Deppeler
Date:
Subject: tableoid
Next
From: "Ed L."
Date:
Subject: Query for postmaster stats start time?