Home > mailing lists

Re: BUG #16133: Regexp quantifier issues - Mailing list pgsql-bugs

From	Tom Lane
Subject	Re: BUG #16133: Regexp quantifier issues
Date	November 22, 2019 23:34:47
Msg-id	27092.1574454887@sss.pgh.pa.us Whole thread Raw
In response to	BUG #16133: Regexp quantifier issues (PG Bug reporting form <noreply@postgresql.org>)
List	pgsql-bugs

Tree view

PG Bug reporting form <noreply@postgresql.org> writes:
> SELECT regexp_match('aaa', '(a*)*');
>  regexp_match 
> --------------
>  {aaa}
> (1 row)

> SELECT regexp_match('aaa', '(a*)+');
>  regexp_match 
> --------------
>  {""}
> (1 row)

> What seems to be happening here is that in the + case, the engine is doing
> one more match, matching (a*) against an empty string at the end of the
> input, unlike the * case where the last match of (a*) is against the whole
> string. This seems to violate the rules for determining where subexpression
> captures line up. (And certainly there is no justification for the + vs. *
> quantifier to make any difference here.)

[ shrug... ]  This excites me not in the least.  The engine is entirely
within its rights to match "a*" to an empty substring.

I do see a documentation issue, which is that we don't specify what
happens when capturing parens are within a repetition operator ---
which of the matches gets captured?

Digging around in the code, it looks like one might choose to cast
blame on regcomp.c, around line 1170:

         * If there's no backrefs involved, we can turn x{m,n} into
         * x{m-1,n-1}x, with capturing parens in only the second x.  This is
         * valid because we only care about capturing matches from the final
         * iteration of the quantifier.  It's a win because we can implement
         * the backref-free left side as a plain DFA node, since we don't
         * really care where its submatches are.

That is, we're effectively converting "(a*)+" to "a*(a*)*", whereupon
it's obvious why you're getting the result you do.

This comment only dates back to 173e29aa5, but apparently the behavior
of capturing the last instance is old; at the time I wrote
   
    Cases where a back-reference is part of a larger subexpression that
    is quantified have never worked in Spencer's regex engine, because
    he used a compile-time transformation that neglected the need to
    check the back-reference match in iterations before the last one.
    (That was okay for capturing parens, and we still do it if the
    regex has *only* capturing parens ... but it's not okay for backrefs.)

Perhaps capturing in the first iteration would provide less surprising
semantics, but I'm not sure that anyone would love us for making such a
change, even if it's feasible code- and performance-wise.

            regards, tom lane

pgsql-bugs by date:

From: PG Bug reporting form
Date: 22 November 2019, 23:08:55
Subject: BUG #16133: Regexp quantifier issues

From: Sandeep Thakkar
Date: 23 November 2019, 14:18:48
Subject: Re: BUG #16117: Server cannot be installed

Re: BUG #16133: Regexp quantifier issues - Mailing list pgsql-bugs

Previous

Next