Thread: Regexp matching: bug or operator error?

Regexp matching: bug or operator error?

From
Ken Tanzer
Date:
Using Postgres V. 7.4.1, the following query:

    SELECT substring('X12345X' FROM '.*?([0-9]{1,5}).*?');

Returns '1'.  I would expect it to return '12345'.  Is this a bug, or am
I missing something?  Thanks.

Attachment

Re: Regexp matching: bug or operator error?

From
Thomas Hallgren
Date:
Ken Tanzer wrote:
> Using Postgres V. 7.4.1, the following query:
>
>    SELECT substring('X12345X' FROM '.*?([0-9]{1,5}).*?');
>
> Returns '1'.  I would expect it to return '12345'.  Is this a bug, or am
> I missing something?  Thanks.
>
The regexp {1,5} is satisfied with the minimum of 1 digit. It looks
ahead and finds your '.*'. That in turn consumes all but the last character.

Perhaps what you want is '[^0-9]+([0-9]{1,5})[^0-9]+'

Translates to "at least one non digit followed by 1-5 digits and then at
least 1 non digit".

Regards,
Thomas Hallgren

Re: Regexp matching: bug or operator error?

From
Tom Lane
Date:
Ken Tanzer <ktanzer@desc.org> writes:
> Using Postgres V. 7.4.1, the following query:

>     SELECT substring('X12345X' FROM '.*?([0-9]{1,5}).*?');

> Returns '1'.  I would expect it to return '12345'.  Is this a bug, or am
> I missing something?  Thanks.

Hmm ... playing with it, it seems that there may indeed be a bug here
... it's acting like the "non greedy" flag from the .*? is being applied
to the {m,n} as well.  In other words the behavior would be correct for

     SELECT substring('X12345X' FROM '.*?([0-9]{1,5}?).*?');

However, aren't you doing this the hard way?  Why not just

    SELECT substring('X12345X' FROM '([0-9]{1,5})');

            regards, tom lane

Re: Regexp matching: bug or operator error?

From
Tom Lane
Date:
Checking in the Tcl bug tracker reveals that this is an open issue
for them as well:

http://sourceforge.net/tracker/index.php?func=detail&aid=219219&group_id=10894&atid=110894
http://sourceforge.net/tracker/index.php?func=detail&aid=219358&group_id=10894&atid=110894

The first entry has Henry Spencer claiming that it is operating as
designed, but the second one seems to cast doubt on that claim.
In any case I tend to agree that the notation implies that greediness
should be an independent property of each quantifier.

However, if Henry can't or doesn't want to fix it, I'm not sure that
I care to wade in ;-)

            regards, tom lane

Re: Regexp matching: bug or operator error?

From
"gnari"
Date:
From: "Ken Tanzer" <ktanzer@desc.org>


> Using Postgres V. 7.4.1, the following query:
>
>     SELECT substring('X12345X' FROM '.*?([0-9]{1,5}).*?');
>
> Returns '1'.  I would expect it to return '12345'.  Is this a bug, or am
> I missing something?  Thanks.

the results of mixing greedy and non-greedy repetitions can be difficult to
predict.

try SELECT substring('X12345X' FROM '\\D*(\\d{1,5}).*?');

gnari



Re: Regexp matching: bug or operator error?

From
Ken Tanzer
Date:
Thanks for the quick responses yesterday.  At a minimum, it seems like
this behavior does not match what is described in the Postgres
documentation (more detail below).  But I still have a hard time
understanding the results of these two queries:

    select SUBSTRING( 'X444X','.*?([0-9]{1,3}).*?');
    This is the original query I submitted, with the puzzling non-greedy
match.  It returns '4'.

Adding start and end characters to the query, like so:

    select SUBSTRING( 'X444X','^.*?([0-9]{1,3}).*?$');
    returns '444'.

If the "whole RE" was being set non-greedy by the first ".*?", then
shouldn't the subsequent "([0-9]{1,3})" also match non-greedily,
returning a '4', with the last ".*?" then capturing the balance of
"44X"?  Either way, I'm not sure why the start and end characters would
effect the rest of the match.

In terms of the Postgres documentation, it definitely seems at odds with
the observed behavior.  Here's my attempts to explain why:

a)    select SUBSTRING( 'X444X','[0-9]{1,3}');

returns '444'.  This suggests that a "default" for the {m,n} syntax is
greedy.

b)  Table 9-13 of the docs describes {m,n} syntax, then lists {m,n}? as
a "non-greedy" version of the same.  That, and the fact that there
doesn't seem to be a specific "greedy" modifier, would both also imply
that {m,n} should be greedy.

Section 9.6.3.5: "A quantified atom with other normal quantifiers
(including {m,n} with m equal to n) prefers longest match"  I can't find
anything else in this section that would say otherwise.  I specifically
can't find anything that says the whole expression becomes greedy or not.

If the regex code isn't going to change, it seems that changing the
documentation would be very helpful to avoid confusion.  Of course,
that's easy for me to say, since I wouldn't have to do the work! For
that matter, I'd be willing to try editing the documentation, but I'd
have to understand what the actual behavior is before I could try to
describe it! :)  Either way, thanks for the great DB program!

Ken Tanzer

p.s., The suggested regex rewrites some people responded with were
appreciated, but the regex I used was just a simplified example for this
posting.  Here's the actual regex we're working on--any help
reformulating this would be great!

select substring('Searching for log 5376, referenced in this text'
            FROM
        '(?i)(?:.*?)logs?(?:\\s|\\n|<br>|<br />|
)(?:entry|no|number|#)?(?:\\s|\\n|<br>|<br /> )?([0-9]{1,7})(.*?)');

We were able to get this to work by adding start and end characters,
like so, but it doesn't seem like it should be necessary:

select substring('Searching for log 5376, referenced in this text'
        FROM
    '(?i)^(?:.*?)logs?(?:\\s|\\n|<br>|<br />|
)(?:entry|no|number|#)?(?:\\s|\\n\<br>|<br /> )?([0-9]{1,7})(.*?)$');




Tom Lane wrote:

>Checking in the Tcl bug tracker reveals that this is an open issue
>for them as well:
>
>http://sourceforge.net/tracker/index.php?func=detail&aid=219219&group_id=10894&atid=110894
>http://sourceforge.net/tracker/index.php?func=detail&aid=219358&group_id=10894&atid=110894
>
>The first entry has Henry Spencer claiming that it is operating as
>designed, but the second one seems to cast doubt on that claim.
>In any case I tend to agree that the notation implies that greediness
>should be an independent property of each quantifier.
>
>However, if Henry can't or doesn't want to fix it, I'm not sure that
>I care to wade in ;-)
>
>            regards, tom lane
>
>

Attachment

Re: Regexp matching: bug or operator error?

From
Tom Lane
Date:
Ken Tanzer <ktanzer@desc.org> writes:
> Thanks for the quick responses yesterday.  At a minimum, it seems like
> this behavior does not match what is described in the Postgres
> documentation (more detail below).

After looking at this more, I think that it is actually behaving as
Spencer designed it to.  The key point is this bit from the fine print
in section 9.6.3.5:

    A branch has the same preference as the first quantified atom in it
    which has a preference.

("branch" being any regexp with no outer-level | operator)

What this apparently means is that if the RE begins with a non-greedy
quantifier, then the matching will be done in such a way that the whole
RE matches the shortest possible string --- that is, the whole RE is
non-greedy.  It's still possible for individual items within the RE to
be greedy or non-greedy, but that only affects how much of the shortest
possible total match they are allowed to eat relative to each other.
All the examples I've looked at seem to work "properly" when seen in
this light.

I can see that this behavior could have some usefulness, and if need be
you can always override it by writing (...){1,1} around the whole RE.
So at this point I'm disinclined to vary from the Tcl semantics.

This does leave us with a documentation problem though, because this
behavior is surely not obvious from what it says in 9.6.3.5.  If you've
got any thoughts about a better explanation, I'm all ears.

> Here's the actual regex we're working on--any help
> reformulating this would be great!

> select substring('Searching for log 5376, referenced in this text'
>             FROM
>         '(?i)(?:.*?)logs?(?:\\s|\\n|<br>|<br />|
> )(?:entry|no|number|#)?(?:\\s|\\n|<br>|<br /> )?([0-9]{1,7})(.*?)');

I don't see that you need either the leading (?:.*?) or the trailing
(.*?) here, and if you dropped them then the first quantifier would be
the "s?" which is greedy so the curious case goes away.  I suppose the
idea of adding (?:.*?) was to ensure that "log" will be matched to the
first possible place where it could match --- but that is true anyway,
per the first sentence of 9.6.3.5.

            regards, tom lane

Re: Regexp matching: bug or operator error?

From
Kenneth Tanzer
Date:
OK.  I've been trying to get my mind around this, and think about ways
to improve the documentation (more about that below).  I'm pretty sure
that I can see the general concept now, and am almost convinced that it
really does work as described.  I guess I really don't like the whole RE
inheriting it's preference from the first preference it encounters.  It
seems like it would be better to use a switch preceeding the whole
expression, as is done with (?i) for case-insensitivity.  Designating
(?g) for greedy and (?G) for non-greedy seems like it would be less
inviting of human error.  But that's mostly a quibble, as long as it
works consistently as described!

I've still got a few problems, though.  Per 9.6.3.5, "An RE consisting
of two or more branches connected by the | operator prefers longest match."

Apply that to this example:

    SELECT substring('abc' FROM '.*?|.*?');

which returns a greedy 'abc'.  In this case, the whole RE is then
supposed to be greedy (because of the "|").  So I suppose that _might_
be said to work as described, even though the | in this case overrides
the expressed preference of both of its components (in which case I'm
just griping about not liking the way this was implemented. :) )

But what about these two queries:

    SELECT substring('a' FROM 'a?|a?');

This returns a greedy 'a', similar to the example above.  But then why does

    SELECT substring('ba' FROM 'a?|a?');

return a non-greedy empty string?  Using the logic from the previous
examples, this should do a greedy match and return 'a' as well.  If this
isn't a bug, please explain why!

With regard to the documentation, after re-reading it many times I'd
have to say the information is all there, but it's hard to absorb.  I
think the main problem is that the term "preference" is used to discuss
greedy/non-greediness, as well as the words greedy & non-greedy.  This
makes it easier for humans to fail to connect the dots.  People are more
likely to be familiar with greedy & non-greedy (especially those who've
used other regex's before), whereas preference isn't as clear.  Perhaps
the term "greediness preference" could be used in place of "preference",
or "preference" could be dropped altogether.

In general, I think "prefers shortest match" could be replaced with "is
non-greedy", and "prefers longest match" could be replaced with "is
greedy".  A little bit more context might be helpful as well, as in
example c) below.

As an example, here's a couple of different possibilities for the second
sentence of the section:

a) If the RE could match more than one substring starting at that point,
its choice is determined by its greediness /preference/: either the
longest substring (greedy), or the shortest (non-greedy).

b) If the RE could match more than one substring starting at that point,
the match can be either greedy (matching the longest substring) or
non-greedy (matching the shortest substring).  Whether an RE is greedy
or not is determined by the following rules...

c)  Like individual components of an RE, the entire RE can be either
greedy (matching the longest substring) or non-greedy (matching the
shortest substring).

Do you think an edit along these lines would be helpful?  If so, I'd be
willing to take a shot at re-writing that section.  Let me know.  Thanks.

Ken Tanzer




Tom Lane wrote:

>Ken Tanzer <ktanzer@desc.org> writes:
>
>
>>Thanks for the quick responses yesterday.  At a minimum, it seems like
>>this behavior does not match what is described in the Postgres
>>documentation (more detail below).
>>
>>
>
>After looking at this more, I think that it is actually behaving as
>Spencer designed it to.  The key point is this bit from the fine print
>in section 9.6.3.5:
>
>    A branch has the same preference as the first quantified atom in it
>    which has a preference.
>
>("branch" being any regexp with no outer-level | operator)
>
>What this apparently means is that if the RE begins with a non-greedy
>quantifier, then the matching will be done in such a way that the whole
>RE matches the shortest possible string --- that is, the whole RE is
>non-greedy.  It's still possible for individual items within the RE to
>be greedy or non-greedy, but that only affects how much of the shortest
>possible total match they are allowed to eat relative to each other.
>All the examples I've looked at seem to work "properly" when seen in
>this light.
>
>I can see that this behavior could have some usefulness, and if need be
>you can always override it by writing (...){1,1} around the whole RE.
>So at this point I'm disinclined to vary from the Tcl semantics.
>
>This does leave us with a documentation problem though, because this
>behavior is surely not obvious from what it says in 9.6.3.5.  If you've
>got any thoughts about a better explanation, I'm all ears.
>
>
>
>>Here's the actual regex we're working on--any help
>>reformulating this would be great!
>>
>>
>
>
>
>>select substring('Searching for log 5376, referenced in this text'
>>            FROM
>>        '(?i)(?:.*?)logs?(?:\\s|\\n|<br>|<br />|
>>)(?:entry|no|number|#)?(?:\\s|\\n|<br>|<br /> )?([0-9]{1,7})(.*?)');
>>
>>
>
>I don't see that you need either the leading (?:.*?) or the trailing
>(.*?) here, and if you dropped them then the first quantifier would be
>the "s?" which is greedy so the curious case goes away.  I suppose the
>idea of adding (?:.*?) was to ensure that "log" will be matched to the
>first possible place where it could match --- but that is true anyway,
>per the first sentence of 9.6.3.5.
>
>            regards, tom lane
>
>


Re: Regexp matching: bug or operator error?

From
Tom Lane
Date:
Kenneth Tanzer <ktanzer@desc.org> writes:
> But what about these two queries:
>     SELECT substring('a' FROM 'a?|a?');
> This returns a greedy 'a', similar to the example above.  But then why does
>     SELECT substring('ba' FROM 'a?|a?');
> return a non-greedy empty string?

You're ignoring the first rule of matching: when there is more than one
possible match, the match starting earliest in the string is chosen.
The longer-or-shorter business only applies when there are multiple
legal ways to form a match starting at the same place.  In this case
'a?' can form a legal match at the beginning of the string (ie, match
to no characters) and so the fact that a longer match is available later
in the string doesn't enter into it.

> With regard to the documentation, after re-reading it many times I'd
> have to say the information is all there, but it's hard to absorb.

I'd agree.  This section was taken nearly verbatim from Henry Spencer's
man page for the regexp package, and with all due respect to Henry,
it's definitely written in geek reference-page-speak.  Maybe a few
examples would help.

On the other hand, I don't want to try to turn the section into a regexp
tutorial --- there are entire books written about regexps (I quite like
the O'Reilly one, btw).  So there's a bulk-vs-friendliness tradeoff to
be made.

> I think the main problem is that the term "preference" is used to
> discuss greedy/non-greediness, as well as the words greedy &
> non-greedy.

Good point.  It would help to use only one term.

> As an example, here's a couple of different possibilities for the second
> sentence of the section:

I like this one:

> b) If the RE could match more than one substring starting at that point,
> the match can be either greedy (matching the longest substring) or
> non-greedy (matching the shortest substring).  Whether an RE is greedy
> or not is determined by the following rules...

Given that intro, there's no need to use the word "preference" at all.
Or almost --- what term will you use for "RE with no preference"?
Perhaps you can avoid the question by pointing out that greediness
only matters for quantifiers, since unquantified REs can only match
fixed-length strings.

The point you make here:

> c)  Like individual components of an RE, the entire RE can be either
> greedy (matching the longest substring) or non-greedy (matching the
> shortest substring).

is also important, but probably needs to be a completely separate
paragraph containing its own example.

> Do you think an edit along these lines would be helpful?  If so, I'd be
> willing to take a shot at re-writing that section.  Let me know.  Thanks.

Fire away.  Please send whatever you come up with to the pgsql-docs
list.

            regards, tom lane

Re: Regexp matching: bug or operator error?

From
Tom Lane
Date:
Kenneth Tanzer <ktanzer@desc.org> writes:
> OK.  I've been trying to get my mind around this, and think about ways
> to improve the documentation (more about that below).

I finally got around to revising the regex docs on the basis of this
discussion: see patch at
http://developer.postgresql.org/cvsweb.cgi/pgsql/doc/src/sgml/func.sgml
or the finished text at
http://candle.pha.pa.us/main/writings/pgsql/sgml/functions-matching.html#POSIX-MATCHING-RULES

Let me know if you have suggestions for further improvements.

            regards, tom lane