Thread: Regex pattern with shorter back reference does NOT work as expected
Hi Tom,
Following example does not work as expected:
-- Should return TRUE but returning FALSE
SELECT 'Programmer' ~ '(\w).*?\1' as t;
-- Should return P, a and er i.e. 3 rows but returning just one row with
-- value Programmer
SELECT REGEXP_SPLIT_TO_TABLE('Programmer','(\w).*?\1');
Initially I thought that back-reference is not supported and thus we are
getting those result. But while trying few cases related to back-reference I
see that it is giving an error "invalid back-reference number", it means we
do have support for back-reference. So I tried few more scenarios. And I
observed that if we have input string as 'rogrammer' we are getting perfect
results i.e. when very first character is back-referenced. But failing when
first character is not part of back-reference.
This is happening only for shorter pattern matching. Longer match '(\w).*\1'
works well.
Clearly, above example has two matching pattern 'rogr' and 'mm'.
So I started debugging it to get a root cause for this. It is too complex to
understand what exactly is happening here. But while debugging I got this
chunk in regexec.c:cfindloop() function from where we are returning with
REG_NOMATCH
{
/* no point in trying again */
*coldp = cold;
return REG_NOMATCH;
}
It was starting at 'P' and ending in above block. It was strange that why it
is not continuing with next character i.e. from 'r'. So I replaced above
chunk with break statement so that it will continue from next character.
This trick worked well.
Since I have very little idea at this code area, I myself unsure that it is
indeed a correct fix. And thus thought of mailing on hackers.
I have attached patch which does above changes along with few tests in
regex.sql
Your valuable insights please...
ThanksFollowing example does not work as expected:
-- Should return TRUE but returning FALSE
SELECT 'Programmer' ~ '(\w).*?\1' as t;
-- Should return P, a and er i.e. 3 rows but returning just one row with
-- value Programmer
SELECT REGEXP_SPLIT_TO_TABLE('Programmer','(\w).*?\1');
Initially I thought that back-reference is not supported and thus we are
getting those result. But while trying few cases related to back-reference I
see that it is giving an error "invalid back-reference number", it means we
do have support for back-reference. So I tried few more scenarios. And I
observed that if we have input string as 'rogrammer' we are getting perfect
results i.e. when very first character is back-referenced. But failing when
first character is not part of back-reference.
This is happening only for shorter pattern matching. Longer match '(\w).*\1'
works well.
Clearly, above example has two matching pattern 'rogr' and 'mm'.
So I started debugging it to get a root cause for this. It is too complex to
understand what exactly is happening here. But while debugging I got this
chunk in regexec.c:cfindloop() function from where we are returning with
REG_NOMATCH
{
/* no point in trying again */
*coldp = cold;
return REG_NOMATCH;
}
It was starting at 'P' and ending in above block. It was strange that why it
is not continuing with next character i.e. from 'r'. So I replaced above
chunk with break statement so that it will continue from next character.
This trick worked well.
Since I have very little idea at this code area, I myself unsure that it is
indeed a correct fix. And thus thought of mailing on hackers.
I have attached patch which does above changes along with few tests in
regex.sql
Your valuable insights please...
--
Jeevan B Chalke
Jeevan B Chalke
Attachment
Jeevan Chalke <jeevan.chalke@enterprisedb.com> writes: > Following example does not work as expected: > -- Should return TRUE but returning FALSE > SELECT 'Programmer' ~ '(\w).*?\1' as t; This is clearly broken, but I'm uncomfortable with the proposed patch. As written, it changes behavior for both the shortest-match-preferred and longest-match-preferred cases; but you've offered no evidence that the longest-match case is broken. Maybe it is --- it's sure not obvious why it's okay to abandon the search early in this case. But I think we'd have been likely to hear about it before now if there were a matching failure in that path, since longest-preferred is so much more widely used. I think we should either convince ourselves that the longest-preferred case is also broken (preferably with a test case), or understand why it isn't. Such understanding would probably also teach us how to fix the shortest-preferred case in a way that doesn't give up early search exit. regards, tom lane
I wrote: > Jeevan Chalke <jeevan.chalke@enterprisedb.com> writes: >> Following example does not work as expected: >> >> -- Should return TRUE but returning FALSE >> SELECT 'Programmer' ~ '(\w).*?\1' as t; > This is clearly broken, but I'm uncomfortable with the proposed patch. > As written, it changes behavior for both the shortest-match-preferred > and longest-match-preferred cases; but you've offered no evidence that > the longest-match case is broken. Maybe it is --- it's sure not > obvious why it's okay to abandon the search early in this case. But I > think we'd have been likely to hear about it before now if there were > a matching failure in that path, since longest-preferred is so much > more widely used. After reflecting on this for awhile, I think that the longest-preferred case is indeed also wrong in theory, but it's unreachable, which explains the lack of bug reports. To get to the "no point in trying again" code, we have to have a success of the DFA match followed by a failure of the cdissect match, which essentially means that the string would match if we didn't have to constrain some backref to exactly match the string matched by its referent. Now, in the longest-preferred case, the supposed early exit test is "end == begin", ie the tentative match was of zero length overall. I can't envision any situation in which a backref constraint could fail given that, because both the referent pattern piece and the backref piece would have to be matching zero-length substrings as well. There could be anchor constraints, lookahead constraints, and so forth in play, but those would all have been checked by the DFA, so they're not going to result in cdissect failures. Hence, the "end == begin" test will simply never succeed. On the other hand, the check made in the shortest-preferred case is going to succeed if the tentative match was of maximal not minimal length, and that's certainly a possible case. So I think your patch is right, although I'd be inclined to refactor the code to have just one test on "shorter", like this: /* go around and try again, if possible */ if (shorter) { if(end == estop) break; estart = end + 1; } else { if (end == begin) break; estop = end - 1; } so as to make it clearer that we're just defending against selecting an impossible new estart or estop location. (Although I argued above that the "end == begin" test can't succeed, I wouldn't care to remove it entirely here.) regards, tom lane
Hi Tom,
--
Jeevan B Chalke
Senior Software Engineer, R&D
EnterpriseDB Corporation
The Enterprise PostgreSQL Company
Phone: +91 20 30589500
Website: www.enterprisedb.com
EnterpriseDB Blog: http://blogs.enterprisedb.com/
Follow us on Twitter: http://www.twitter.com/enterprisedb
This e-mail message (and any attachment) is intended for the use of the individual or entity to whom it is addressed. This message contains information from EnterpriseDB Corporation that may be privileged, confidential, or exempt from disclosure under applicable law. If you are not the intended recipient or authorized to receive this for the intended recipient, any use, dissemination, distribution, retention, archiving, or copying of this communication is strictly prohibited. If you have received this e-mail in error, please notify the sender immediately by reply e-mail and delete this message.
On Sat, Jul 13, 2013 at 10:43 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
I wrote:After reflecting on this for awhile, I think that the longest-preferred
> Jeevan Chalke <jeevan.chalke@enterprisedb.com> writes:
>> Following example does not work as expected:
>>
>> -- Should return TRUE but returning FALSE
>> SELECT 'Programmer' ~ '(\w).*?\1' as t;
> This is clearly broken, but I'm uncomfortable with the proposed patch.
> As written, it changes behavior for both the shortest-match-preferred
> and longest-match-preferred cases; but you've offered no evidence that
> the longest-match case is broken. Maybe it is --- it's sure not
> obvious why it's okay to abandon the search early in this case. But I
> think we'd have been likely to hear about it before now if there were
> a matching failure in that path, since longest-preferred is so much
> more widely used.
case is indeed also wrong in theory, but it's unreachable, which
explains the lack of bug reports. To get to the "no point in trying
again" code, we have to have a success of the DFA match followed by a
failure of the cdissect match, which essentially means that the string
would match if we didn't have to constrain some backref to exactly match
the string matched by its referent. Now, in the longest-preferred case,
the supposed early exit test is "end == begin", ie the tentative match
was of zero length overall. I can't envision any situation in which a
backref constraint could fail given that, because both the referent
pattern piece and the backref piece would have to be matching
zero-length substrings as well. There could be anchor constraints,
lookahead constraints, and so forth in play, but those would all have
been checked by the DFA, so they're not going to result in cdissect
failures. Hence, the "end == begin" test will simply never succeed.
Thanks for the explanation.
For last couple of days I was trying hard to find a test-case which triggers
"end == begin" but didn't find one.
For last couple of days I was trying hard to find a test-case which triggers
"end == begin" but didn't find one.
This explanation proves that it will never reach that. So I give up now.
On the other hand, the check made in the shortest-preferred case is
going to succeed if the tentative match was of maximal not minimal
length, and that's certainly a possible case.
So I think your patch is right, although I'd be inclined to refactor
the code to have just one test on "shorter", like this:
/* go around and try again, if possible */
if (shorter)
{
if (end == estop)
break;
estart = end + 1;
}
else
{
if (end == begin)
break;
estop = end - 1;
}
so as to make it clearer that we're just defending against selecting an
impossible new estart or estop location. (Although I argued above that
the "end == begin" test can't succeed, I wouldn't care to remove it
entirely here.)
OK. Looks good to me.
Thanks
regards, tom lane
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
--
Jeevan B Chalke
Senior Software Engineer, R&D
EnterpriseDB Corporation
The Enterprise PostgreSQL Company
Phone: +91 20 30589500
Website: www.enterprisedb.com
EnterpriseDB Blog: http://blogs.enterprisedb.com/
Follow us on Twitter: http://www.twitter.com/enterprisedb
This e-mail message (and any attachment) is intended for the use of the individual or entity to whom it is addressed. This message contains information from EnterpriseDB Corporation that may be privileged, confidential, or exempt from disclosure under applicable law. If you are not the intended recipient or authorized to receive this for the intended recipient, any use, dissemination, distribution, retention, archiving, or copying of this communication is strictly prohibited. If you have received this e-mail in error, please notify the sender immediately by reply e-mail and delete this message.
Jeevan Chalke <jeevan.chalke@enterprisedb.com> writes: >>> Following example does not work as expected: >>> >>> -- Should return TRUE but returning FALSE >>> SELECT 'Programmer' ~ '(\w).*?\1' as t; For the archives' sake --- I've filed a report about this with the Tcl crew. They seem to have moved their bugtracker recently; it's no longer at sourceforge but in their own ticket system. This bug is at https://core.tcl.tk/tcl/tktview/6585b21ca8fa6f3678d442b97241fdd43dba2ec0 regards, tom lane