Re: Another regexp performance improvement: skip useless paren-captures - Mailing list pgsql-hackers
From | Tom Lane |
---|---|
Subject | Re: Another regexp performance improvement: skip useless paren-captures |
Date | |
Msg-id | 3756810.1628562033@sss.pgh.pa.us Whole thread Raw |
In response to | Re: Another regexp performance improvement: skip useless paren-captures (Mark Dilger <mark.dilger@enterprisedb.com>) |
Responses |
Re: Another regexp performance improvement: skip useless paren-captures
Re: Another regexp performance improvement: skip useless paren-captures |
List | pgsql-hackers |
Mark Dilger <mark.dilger@enterprisedb.com> writes: > I ran a lot of tests with the patch, and the asserts have all cleared up, but I don't know how to think about the userfacing differences. If we had a good reason for raising an error for these back-references, maybe that'd be fine, butit seems to just be an implementation detail. I thought about this some more, and I'm coming around to the idea that throwing an error is the wrong thing. As a contrary example, consider (.)|(\1\1) We don't throw an error for this, and neither does Perl, even though the capturing parens can never be defined in the branch where the backrefs are. So it seems hard to argue that this is okay but the other thing isn't. Another interesting example is (.){0}(\1){0} I think that the correct interpretation is that this is a valid regexp matching an empty string (i.e., zero repetitions of each part), even though neither capture group will be defined. That's different from (.){0}(\1) which can never match. So I took another look at the code, and it doesn't seem that hard to make it act this way. The attached passes regression, but I've not beat on it with random strings. regards, tom lane diff --git a/src/backend/regex/regcomp.c b/src/backend/regex/regcomp.c index ae3a7b6a38..d9840171a3 100644 --- a/src/backend/regex/regcomp.c +++ b/src/backend/regex/regcomp.c @@ -1089,11 +1089,23 @@ parseqatom(struct vars *v, /* annoying special case: {0} or {0,0} cancels everything */ if (m == 0 && n == 0) { - if (atom != NULL) - freesubre(v, atom); - if (atomtype == '(') - v->subs[subno] = NULL; - delsub(v->nfa, lp, rp); + /* + * If we had capturing subexpression(s) within the atom, we don't want + * to destroy them, because it's legal (if useless) to back-ref them + * later. Hence, just unlink the atom from lp/rp and then ignore it. + */ + if (atom != NULL && (atom->flags & CAP)) + { + delsub(v->nfa, lp, atom->begin); + delsub(v->nfa, atom->end, rp); + } + else + { + /* Otherwise, we can clean up any subre infrastructure we made */ + if (atom != NULL) + freesubre(v, atom); + delsub(v->nfa, lp, rp); + } EMPTYARC(lp, rp); return top; } diff --git a/src/test/modules/test_regex/expected/test_regex.out b/src/test/modules/test_regex/expected/test_regex.out index 5a6cdf47c2..96c0e6fa59 100644 --- a/src/test/modules/test_regex/expected/test_regex.out +++ b/src/test/modules/test_regex/expected/test_regex.out @@ -3506,6 +3506,21 @@ select * from test_regex('((.))(\2)', 'xyy', 'oRP'); {yy,NULL,NULL,NULL} (2 rows) +-- expectMatch 21.39 NPQR {((.)){0}(\2){0}} xyz {} {} {} {} +select * from test_regex('((.)){0}(\2){0}', 'xyz', 'NPQR'); + test_regex +------------------------------------------------------------ + {3,REG_UBACKREF,REG_UBOUNDS,REG_UNONPOSIX,REG_UEMPTYMATCH} + {"",NULL,NULL,NULL} +(2 rows) + +-- expectNomatch 21.40 PQR {((.)){0}(\2)} xxx +select * from test_regex('((.)){0}(\2)', 'xxx', 'PQR'); + test_regex +-------------------------------------------- + {3,REG_UBACKREF,REG_UBOUNDS,REG_UNONPOSIX} +(1 row) + -- doing 22 "multicharacter collating elements" -- # again ugh -- MCCEs are not implemented in Postgres, so we skip all these tests diff --git a/src/test/modules/test_regex/sql/test_regex.sql b/src/test/modules/test_regex/sql/test_regex.sql index 3419564203..29806e9417 100644 --- a/src/test/modules/test_regex/sql/test_regex.sql +++ b/src/test/modules/test_regex/sql/test_regex.sql @@ -1020,6 +1020,10 @@ select * from test_regex('((.))(\2){0}', 'xy', 'RPQ'); select * from test_regex('((.))(\2)', 'xyy', 'RP'); -- expectMatch 21.38 oRP ((.))(\2) xyy yy {} {} {} select * from test_regex('((.))(\2)', 'xyy', 'oRP'); +-- expectMatch 21.39 NPQR {((.)){0}(\2){0}} xyz {} {} {} {} +select * from test_regex('((.)){0}(\2){0}', 'xyz', 'NPQR'); +-- expectNomatch 21.40 PQR {((.)){0}(\2)} xxx +select * from test_regex('((.)){0}(\2)', 'xxx', 'PQR'); -- doing 22 "multicharacter collating elements" -- # again ugh
pgsql-hackers by date: