PG Bug reporting form <noreply@postgresql.org> writes:
> SELECT regexp_match('aaa', '(a*)*');
> regexp_match
> --------------
> {aaa}
> (1 row)
> SELECT regexp_match('aaa', '(a*)+');
> regexp_match
> --------------
> {""}
> (1 row)
> What seems to be happening here is that in the + case, the engine is doing
> one more match, matching (a*) against an empty string at the end of the
> input, unlike the * case where the last match of (a*) is against the whole
> string. This seems to violate the rules for determining where subexpression
> captures line up. (And certainly there is no justification for the + vs. *
> quantifier to make any difference here.)
[ shrug... ] This excites me not in the least. The engine is entirely
within its rights to match "a*" to an empty substring.
I do see a documentation issue, which is that we don't specify what
happens when capturing parens are within a repetition operator ---
which of the matches gets captured?
Digging around in the code, it looks like one might choose to cast
blame on regcomp.c, around line 1170:
* If there's no backrefs involved, we can turn x{m,n} into
* x{m-1,n-1}x, with capturing parens in only the second x. This is
* valid because we only care about capturing matches from the final
* iteration of the quantifier. It's a win because we can implement
* the backref-free left side as a plain DFA node, since we don't
* really care where its submatches are.
That is, we're effectively converting "(a*)+" to "a*(a*)*", whereupon
it's obvious why you're getting the result you do.
This comment only dates back to 173e29aa5, but apparently the behavior
of capturing the last instance is old; at the time I wrote
Cases where a back-reference is part of a larger subexpression that
is quantified have never worked in Spencer's regex engine, because
he used a compile-time transformation that neglected the need to
check the back-reference match in iterations before the last one.
(That was okay for capturing parens, and we still do it if the
regex has *only* capturing parens ... but it's not okay for backrefs.)
Perhaps capturing in the first iteration would provide less surprising
semantics, but I'm not sure that anyone would love us for making such a
change, even if it's feasible code- and performance-wise.
regards, tom lane