Thread: Another regexp performance improvement: skip useless paren-captures
Here's a little finger exercise that improves a case that's bothered me for awhile. In a POSIX regexp, parentheses cause capturing by default; you have to write the very non-obvious "(?:...)" if you don't want the matching substring to be reported by the regexp engine. That'd be fine if capturing were cheap, but with our engine it is not particularly cheap. In many situations, the initial DFA check is sufficient to tell whether there is an overall match, but it does not tell where any subexpression match boundaries are. To identify exactly which substring is deemed to match a parenthesized subexpression, we have to recursively break down the match, which takes at the very least a few more DFA invocations; and with an uncooperative regex, it can easily result in O(N^2) behavior where there was none at the DFA stage. Therefore, we really ought to expend some effort to not capture subexpressions if the sub-match data is not actually needed, which in many invocations we know that it isn't. Spencer's original code has a REG_NOSUB option that looks like it ought to be good for this ... but on closer inspection it's basically useless, because it turns *all* parens into non-capturing ones. That breaks back-references, so unless you know that the regexp contains no back-refs, you can't use it. The attached proposed patch redefines REG_NOSUB as being a regexp- compile-time promise that the caller doesn't care about sub-match locations, but not a promise that no backrefs exist. (If the caller passes a match-locations array at execution anyway, it will just get back -1 values, as if no sub-match had been identified.) If that flag is passed, we run through the completed sub-regexp tree and remove the "capture" markers on any subREs that are not actually referenced by some backref. This typically causes some parent subREs to no longer be deemed "messy", so that their separate child subREs can be thrown away entirely, saving memory space as well as runtime. (I'd originally thought that a much more complex patch would be needed to do this, because I assumed that re-optimizing the subRE tree would be much more complicated than this. However, as far as I can see this is sufficient; this change doesn't expose any cases where additional tree restructuring would be helpful.) Testing with Joel's handy little corpus of web regexps, there's a useful improvement of the speed of ~ operators (a/k/a regexp_like()). I see the total time to apply regexp_like() to all 4474520 entries dropping from 10:17 to 5:46. Interesting statistics include regexp=# select max(duration),avg(duration) from headresults; max | avg -----------------+----------------- 00:00:00.939389 | 00:00:00.000138 (1 row) regexp=# select max(duration),avg(duration) from patchresults; max | avg -----------------+----------------- 00:00:00.918549 | 00:00:00.000077 (1 row) The lower percentiles don't move much, but upper ones do: regexp=# select percentile_cont(array[0.5,0.75,0.8,0.9]) within group(order by duration) from headresults; percentile_cont ------------------------------------------------------------------- {00:00:00.000027,00:00:00.000059,00:00:00.000067,00:00:00.000108} (1 row) regexp=# select percentile_cont(array[0.5,0.75,0.8,0.9]) within group(order by duration) from patchresults; percentile_cont ------------------------------------------------------------------- {00:00:00.000025,00:00:00.000042,00:00:00.000048,00:00:00.000065} (1 row) This isn't terribly surprising, because regexps that were already really cheap probably have no capturing parens to dispense with. Of course, there's no benefit with functions that do need sub-match data, such as regexp_match. But the added overhead in such cases should be quite negligible. The only downside I can see is that if you use the "same" regexp in both submatches-needed and non-submatches-needed contexts, you'll end up with two separate compiled regexp cache entries. That doesn't seem like a big problem though. regards, tom lane diff --git a/contrib/pg_trgm/trgm_regexp.c b/contrib/pg_trgm/trgm_regexp.c index bf1dea6352..64816dd370 100644 --- a/contrib/pg_trgm/trgm_regexp.c +++ b/contrib/pg_trgm/trgm_regexp.c @@ -541,9 +541,11 @@ createTrgmNFA(text *text_re, Oid collation, * Stage 1: Compile the regexp into a NFA, using the regexp library. */ #ifdef IGNORECASE - RE_compile(®ex, text_re, REG_ADVANCED | REG_ICASE, collation); + RE_compile(®ex, text_re, + REG_ADVANCED | REG_NOSUB | REG_ICASE, collation); #else - RE_compile(®ex, text_re, REG_ADVANCED, collation); + RE_compile(®ex, text_re, + REG_ADVANCED | REG_NOSUB, collation); #endif /* diff --git a/src/backend/regex/regc_lex.c b/src/backend/regex/regc_lex.c index 7673dab76f..45727ffa01 100644 --- a/src/backend/regex/regc_lex.c +++ b/src/backend/regex/regc_lex.c @@ -528,10 +528,7 @@ next(struct vars *v) } assert(NOTREACHED); } - if (v->cflags & REG_NOSUB) - RETV('(', 0); /* all parens non-capturing */ - else - RETV('(', 1); + RETV('(', 1); break; case CHR(')'): if (LASTTYPE('(')) diff --git a/src/backend/regex/regcomp.c b/src/backend/regex/regcomp.c index 9f71177d31..38f709cb60 100644 --- a/src/backend/regex/regcomp.c +++ b/src/backend/regex/regcomp.c @@ -65,7 +65,7 @@ static struct subre *subre(struct vars *, int, int, struct state *, struct state static void freesubre(struct vars *, struct subre *); static void freesubreandsiblings(struct vars *, struct subre *); static void freesrnode(struct vars *, struct subre *); -static void optst(struct vars *, struct subre *); +static void removecaptures(struct vars *, struct subre *); static int numst(struct subre *, int); static void markst(struct subre *); static void cleanst(struct vars *); @@ -431,7 +431,8 @@ pg_regcomp(regex_t *re, dumpst(v->tree, debug, 1); } #endif - optst(v, v->tree); + if (v->cflags & REG_NOSUB) + removecaptures(v, v->tree); v->ntree = numst(v->tree, 1); markst(v->tree); cleanst(v); @@ -1004,14 +1005,16 @@ parseqatom(struct vars *v, break; case BACKREF: /* the Feature From The Black Lagoon */ INSIST(type != LACON, REG_ESUBREG); - INSIST(v->nextvalue < v->nsubs, REG_ESUBREG); - INSIST(v->subs[v->nextvalue] != NULL, REG_ESUBREG); + subno = v->nextvalue; + assert(subno > 0); + INSIST(subno < v->nsubs, REG_ESUBREG); + NOERR(); + INSIST(v->subs[subno] != NULL, REG_ESUBREG); NOERR(); - assert(v->nextvalue > 0); atom = subre(v, 'b', BACKR, lp, rp); NOERR(); - subno = v->nextvalue; atom->backno = subno; + v->subs[subno]->flags |= BRUSE; EMPTYARC(lp, rp); /* temporarily, so there's something */ NEXT(); break; @@ -2122,19 +2125,54 @@ freesrnode(struct vars *v, /* might be NULL */ } /* - * optst - optimize a subRE subtree + * removecaptures - remove unnecessary capture subREs + * + * If the caller said that it doesn't care about subexpression match data, + * we may delete the "capture" markers on subREs that are not referenced + * by any backrefs, and then simplify anything that's become non-messy. + * Call this only if REG_NOSUB flag is set. */ static void -optst(struct vars *v, - struct subre *t) +removecaptures(struct vars *v, + struct subre *t) { + struct subre *t2; + + assert(t != NULL); + + /* + * If this isn't itself a backref target, clear capno and tentatively + * clear CAP flag. + */ + if (!(t->flags & BRUSE)) + { + t->capno = 0; + t->flags &= ~CAP; + } + + /* Now recurse to children */ + for (t2 = t->child; t2 != NULL; t2 = t2->sibling) + { + removecaptures(v, t2); + /* Propagate child CAP flag back up, if it's still set */ + if (t2->flags & CAP) + t->flags |= CAP; + } + /* - * DGP (2007-11-13): I assume it was the programmer's intent to eventually - * come back and add code to optimize subRE trees, but the routine coded - * just spends effort traversing the tree and doing nothing. We can do - * nothing with less effort. + * If t now contains neither captures nor backrefs, there's no longer any + * need to care where its sub-match boundaries are, so we can reduce it to + * a simple DFA node. (Note in particular that MIXED child greediness is + * not a hindrance here, so we don't use the MESSY() macro.) */ - return; + if ((t->flags & (CAP | BACKR)) == 0) + { + if (t->child) + freesubreandsiblings(v, t->child); + t->child = NULL; + t->op = '='; + t->flags &= ~MIXED; + } } /* @@ -2482,6 +2520,8 @@ stdump(struct subre *t, fprintf(f, " hascapture"); if (t->flags & BACKR) fprintf(f, " hasbackref"); + if (t->flags & BRUSE) + fprintf(f, " isreferenced"); if (!(t->flags & INUSE)) fprintf(f, " UNUSED"); if (t->latype != (char) -1) diff --git a/src/backend/regex/regexec.c b/src/backend/regex/regexec.c index 5b9a087820..e75b59e9f0 100644 --- a/src/backend/regex/regexec.c +++ b/src/backend/regex/regexec.c @@ -213,12 +213,10 @@ pg_regexec(regex_t *re, return REG_NOMATCH; backref = (v->g->info & REG_UBACKREF) ? 1 : 0; v->eflags = flags; - if (v->g->cflags & REG_NOSUB) - nmatch = 0; /* override client */ v->nmatch = nmatch; - if (backref) + if (backref && nmatch <= v->g->nsub) { - /* need work area */ + /* need larger work area */ if (v->g->nsub + 1 <= LOCALMAT) v->pmatch = mat; else @@ -229,7 +227,13 @@ pg_regexec(regex_t *re, v->nmatch = v->g->nsub + 1; } else + { + /* we can store results directly in caller's array */ v->pmatch = pmatch; + /* ensure any extra entries in caller's array are filled with -1 */ + if (nmatch > 0) + zapallsubs(pmatch, nmatch); + } v->details = details; v->start = (chr *) string; v->search_start = (chr *) string + search_start; @@ -290,12 +294,20 @@ pg_regexec(regex_t *re, else st = find(v, &v->g->tree->cnfa, &v->g->cmap); - /* copy (portion of) match vector over if necessary */ - if (st == REG_OKAY && v->pmatch != pmatch && nmatch > 0) + /* on success, ensure caller's match vector is filled correctly */ + if (st == REG_OKAY && nmatch > 0) { - zapallsubs(pmatch, nmatch); - n = (nmatch < v->nmatch) ? nmatch : v->nmatch; - memcpy(VS(pmatch), VS(v->pmatch), n * sizeof(regmatch_t)); + if (v->g->cflags & REG_NOSUB) + { + /* ensure we don't return partial results */ + zapallsubs(pmatch, nmatch); + } + else if (v->pmatch != pmatch) + { + /* copy portion of match vector over from (larger) work area */ + assert(nmatch <= v->nmatch); + memcpy(VS(pmatch), VS(v->pmatch), nmatch * sizeof(regmatch_t)); + } } /* clean up */ diff --git a/src/backend/utils/adt/jsonpath_gram.y b/src/backend/utils/adt/jsonpath_gram.y index de3d97931e..bd5d4488a0 100644 --- a/src/backend/utils/adt/jsonpath_gram.y +++ b/src/backend/utils/adt/jsonpath_gram.y @@ -583,6 +583,14 @@ jspConvertRegexFlags(uint32 xflags) errmsg("XQuery \"x\" flag (expanded regular expressions) is not implemented"))); } + /* + * We'll never need sub-match details at execution. While + * RE_compile_and_execute would set this flag anyway, force it on here to + * ensure that the regex cache entries created by makeItemLikeRegex are + * useful. + */ + cflags |= REG_NOSUB; + return cflags; } diff --git a/src/backend/utils/adt/regexp.c b/src/backend/utils/adt/regexp.c index 484d4265fd..268cee1cbe 100644 --- a/src/backend/utils/adt/regexp.c +++ b/src/backend/utils/adt/regexp.c @@ -347,6 +347,10 @@ RE_compile_and_execute(text *text_re, char *dat, int dat_len, { regex_t *re; + /* Use REG_NOSUB if caller does not want sub-match details */ + if (nmatch < 2) + cflags |= REG_NOSUB; + /* Compile RE */ re = RE_compile_and_cache(text_re, cflags, collation); @@ -1412,6 +1416,7 @@ setup_regexp_matches(text *orig_str, text *pattern, pg_re_flags *re_flags, int orig_len; pg_wchar *wide_str; int wide_len; + int cflags; regex_t *cpattern; regmatch_t *pmatch; int pmatch_len; @@ -1430,7 +1435,10 @@ setup_regexp_matches(text *orig_str, text *pattern, pg_re_flags *re_flags, wide_len = pg_mb2wchar_with_len(VARDATA_ANY(orig_str), wide_str, orig_len); /* set up the compiled pattern */ - cpattern = RE_compile_and_cache(pattern, re_flags->cflags, collation); + cflags = re_flags->cflags; + if (!use_subpatterns) + cflags |= REG_NOSUB; + cpattern = RE_compile_and_cache(pattern, cflags, collation); /* do we want to remember subpatterns? */ if (use_subpatterns && cpattern->re_nsub > 0) @@ -1952,7 +1960,7 @@ regexp_fixed_prefix(text *text_re, bool case_insensitive, Oid collation, if (case_insensitive) cflags |= REG_ICASE; - re = RE_compile_and_cache(text_re, cflags, collation); + re = RE_compile_and_cache(text_re, cflags | REG_NOSUB, collation); /* Examine it to see if there's a fixed prefix */ re_result = pg_regprefix(re, &str, &slen); diff --git a/src/include/regex/regex.h b/src/include/regex/regex.h index 2b48a19fb7..0455ae8069 100644 --- a/src/include/regex/regex.h +++ b/src/include/regex/regex.h @@ -106,7 +106,7 @@ typedef struct #define REG_QUOTE 000004 /* no special characters, none */ #define REG_NOSPEC REG_QUOTE /* historical synonym */ #define REG_ICASE 000010 /* ignore case */ -#define REG_NOSUB 000020 /* don't care about subexpressions */ +#define REG_NOSUB 000020 /* caller doesn't need subexpr match data */ #define REG_EXPANDED 000040 /* expanded format, white space & comments */ #define REG_NLSTOP 000100 /* \n doesn't match . or [^ ] */ #define REG_NLANCH 000200 /* ^ matches after \n, $ before */ diff --git a/src/include/regex/regguts.h b/src/include/regex/regguts.h index 8db631c83b..91a52840c4 100644 --- a/src/include/regex/regguts.h +++ b/src/include/regex/regguts.h @@ -477,13 +477,14 @@ struct subre #define MIXED 04 /* mixed preference below */ #define CAP 010 /* capturing parens here or below */ #define BACKR 020 /* back reference here or below */ +#define BRUSE 040 /* is referenced by a back reference */ #define INUSE 0100 /* in use in final tree */ -#define NOPROP 03 /* bits which may not propagate up */ +#define UPPROP (MIXED|CAP|BACKR) /* flags which should propagate up */ #define LMIX(f) ((f)<<2) /* LONGER -> MIXED */ #define SMIX(f) ((f)<<1) /* SHORTER -> MIXED */ -#define UP(f) (((f)&~NOPROP) | (LMIX(f) & SMIX(f) & MIXED)) +#define UP(f) (((f)&UPPROP) | (LMIX(f) & SMIX(f) & MIXED)) #define MESSY(f) ((f)&(MIXED|CAP|BACKR)) -#define PREF(f) ((f)&NOPROP) +#define PREF(f) ((f)&(LONGER|SHORTER)) #define PREF2(f1, f2) ((PREF(f1) != 0) ? PREF(f1) : PREF(f2)) #define COMBINE(f1, f2) (UP((f1)|(f2)) | PREF2(f1, f2)) char latype; /* LATYPE code, if lookaround constraint */ diff --git a/src/test/modules/test_regex/expected/test_regex.out b/src/test/modules/test_regex/expected/test_regex.out index 01d50ec1e3..dff5a16f57 100644 --- a/src/test/modules/test_regex/expected/test_regex.out +++ b/src/test/modules/test_regex/expected/test_regex.out @@ -146,14 +146,21 @@ select * from test_regex('(a)e', 'ae', '-'); {ae,a} (2 rows) --- expectMatch 4.2 o (a)e ae -select * from test_regex('(a)e', 'ae', 'o'); - test_regex ------------- - {0} - {NULL} +-- expectMatch 4.2 oPR (.)\1e aae +select * from test_regex('(.)\1e', 'aae', 'oPR'); + test_regex +-------------------------------- + {1,REG_UBACKREF,REG_UNONPOSIX} + {aae,NULL} (2 rows) +-- expectNomatch 4.2b oPR (.)\1e abe +select * from test_regex('(.)\1e', 'abe', 'oPR'); + test_regex +-------------------------------- + {1,REG_UBACKREF,REG_UNONPOSIX} +(1 row) + -- expectMatch 4.3 b {\(a\)b} ab ab a select * from test_regex('\(a\)b', 'ab', 'b'); test_regex @@ -2658,6 +2665,14 @@ select * from test_regex('(\w+(?= )).*\1', 'abc abcd abd', 'HLRP'); {"abc abc",abc} (2 rows) +-- exercise oversize-regmatch_t-array paths in regexec() +-- (that case is not reachable via test_regex, sadly) +select substring('fffoooooooooooooooooooooooooooooooo', '^(.)\1(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)'); + substring +----------- + f +(1 row) + -- doing 15 "octal escapes vs back references" -- # initial zero is always octal -- expectMatch 15.1 MP "a\\010b" "a\bb" "a\bb" diff --git a/src/test/modules/test_regex/sql/test_regex.sql b/src/test/modules/test_regex/sql/test_regex.sql index 7f5bc6e418..cfffd13686 100644 --- a/src/test/modules/test_regex/sql/test_regex.sql +++ b/src/test/modules/test_regex/sql/test_regex.sql @@ -63,8 +63,10 @@ select * from test_regex('ab', 'ab', 'b'); -- expectMatch 4.1 - (a)e ae ae a select * from test_regex('(a)e', 'ae', '-'); --- expectMatch 4.2 o (a)e ae -select * from test_regex('(a)e', 'ae', 'o'); +-- expectMatch 4.2 oPR (.)\1e aae +select * from test_regex('(.)\1e', 'aae', 'oPR'); +-- expectNomatch 4.2b oPR (.)\1e abe +select * from test_regex('(.)\1e', 'abe', 'oPR'); -- expectMatch 4.3 b {\(a\)b} ab ab a select * from test_regex('\(a\)b', 'ab', 'b'); -- expectMatch 4.4 - a((b)c) abc abc bc b @@ -775,6 +777,10 @@ select * from test_regex('(^\w+).*\1', 'abc abc abc', 'LRP'); select * from test_regex('(^\w+\M).*\1', 'abc abcd abd', 'LRP'); select * from test_regex('(\w+(?= )).*\1', 'abc abcd abd', 'HLRP'); +-- exercise oversize-regmatch_t-array paths in regexec() +-- (that case is not reachable via test_regex, sadly) +select substring('fffoooooooooooooooooooooooooooooooo', '^(.)\1(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)'); + -- doing 15 "octal escapes vs back references" -- # initial zero is always octal
On 8/4/21 6:15 PM, Tom Lane wrote: > Here's a little finger exercise that improves a case that's bothered me > for awhile. In a POSIX regexp, parentheses cause capturing by default; > you have to write the very non-obvious "(?:...)" if you don't want the > matching substring to be reported by the regexp engine. It's not obscure to perl programmers :-) > That'd be fine > if capturing were cheap, but with our engine it is not particularly > cheap. In many situations, the initial DFA check is sufficient to > tell whether there is an overall match, but it does not tell where any > subexpression match boundaries are. To identify exactly which substring > is deemed to match a parenthesized subexpression, we have to recursively > break down the match, which takes at the very least a few more DFA > invocations; and with an uncooperative regex, it can easily result in > O(N^2) behavior where there was none at the DFA stage. > > Therefore, we really ought to expend some effort to not capture > subexpressions if the sub-match data is not actually needed, which in > many invocations we know that it isn't. Spencer's original code has > a REG_NOSUB option that looks like it ought to be good for this ... but > on closer inspection it's basically useless, because it turns *all* > parens into non-capturing ones. That breaks back-references, so unless > you know that the regexp contains no back-refs, you can't use it. In perl you can use the 'n' modifier for this effect (since 5.22) I would expect to know if a back-ref were present. I'm a bit worried about how you'll keep track of back-ref numbering since back-refs only count capturing groups, and you're silently turning a capturing group into a non-capturing group. cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
Andrew Dunstan <andrew@dunslane.net> writes: > I'm a bit worried about how you'll keep track of back-ref numbering > since back-refs only count capturing groups, and you're silently turning > a capturing group into a non-capturing group. They're already numbered at this point, and we aren't changing the numbers of the capturing groups that remain live. There will be unused entries in the regmatch_t array at runtime (corresponding to the zapped groups), but that doesn't cost anything worth mentioning. Now that you mention it, I am not sure whether there are any regression test cases that specifically cover still being able to match \2 when the first capture group went away. Probably should add more cases... regards, tom lane
On Thu, Aug 5, 2021 at 9:43 AM Andrew Dunstan <andrew@dunslane.net> wrote: > On 8/4/21 6:15 PM, Tom Lane wrote: > > Here's a little finger exercise that improves a case that's bothered me > > for awhile. In a POSIX regexp, parentheses cause capturing by default; > > you have to write the very non-obvious "(?:...)" if you don't want the > > matching substring to be reported by the regexp engine. > It's not obscure to perl programmers :-) Well, I consider myself a pretty fair perl programmer, and I know there's a way to do that, but I never do it, and I would have had to look up the exact syntax. So +1 from me for anything automatic that avoids paying the overhead in some cases. -- Robert Haas EDB: http://www.enterprisedb.com
Robert Haas <robertmhaas@gmail.com> writes: > Well, I consider myself a pretty fair perl programmer, and I know > there's a way to do that, but I never do it, and I would have had to > look up the exact syntax. So +1 from me for anything automatic that > avoids paying the overhead in some cases. That's my feeling about it too --- I never really think of this point when writing a regexp. It seems like something the engine ought to handle gracefully, so this patch is an attempt to do so. regards, tom lane
On 8/5/21 10:39 AM, Robert Haas wrote: > On Thu, Aug 5, 2021 at 9:43 AM Andrew Dunstan <andrew@dunslane.net> wrote: >> On 8/4/21 6:15 PM, Tom Lane wrote: >>> Here's a little finger exercise that improves a case that's bothered me >>> for awhile. In a POSIX regexp, parentheses cause capturing by default; >>> you have to write the very non-obvious "(?:...)" if you don't want the >>> matching substring to be reported by the regexp engine. >> It's not obscure to perl programmers :-) > Well, I consider myself a pretty fair perl programmer, I also consider you one :-) Perhaps I should have said "many perl programmers". > and I know > there's a way to do that, but I never do it, and I would have had to > look up the exact syntax. So +1 from me for anything automatic that > avoids paying the overhead in some cases. Yeah, I'm not arguing against the idea. I also have to look it up, mainly because there is such a huge amount of stuff that can follow "(?", do "perldoc perlre" happens a lot when I'm doing that sort of work. cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
> On Aug 5, 2021, at 7:36 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Probably should add more cases... The patch triggers an assertion that master does not: +select 'azrlfkjbjgidgryryiglcabkgqluflu' !~ '(.(.)((.)))((?:(\3)))'; +server closed the connection unexpectedly + This probably means the server terminated abnormally + before or while processing the request. +connection to server was lost The relevant portion of the stack trace: frame #3: 0x00000001043bcf6d postgres`ExceptionalCondition(conditionName=<unavailable>, errorType=<unavailable>, fileName=<unavailable>,lineNumber=<unavailable>) at assert.c:69:2 [opt] frame #4: 0x000000010410168b postgres`cdissect(v=0x00007ffeebdd2ad8, t=0x00007f863cd055b0, begin=0x00007f863d821528,end=0x00007f863d82152c) at regexec.c:767:4 [opt] frame #5: 0x000000010410129b postgres`cdissect [inlined] ccondissect(v=<unavailable>, t=<unavailable>, begin=0x00007f863d821524,end=<unavailable>) at regexec.c:835:10 [opt] frame #6: 0x000000010410123d postgres`cdissect(v=0x00007ffeebdd2ad8, t=0x00007f863cd05430, begin=0x00007f863d821524,end=0x00007f863d82152c) at regexec.c:752 [opt] frame #7: 0x000000010410129b postgres`cdissect [inlined] ccondissect(v=<unavailable>, t=<unavailable>, begin=0x00007f863d821520,end=<unavailable>) at regexec.c:835:10 [opt] frame #8: 0x000000010410123d postgres`cdissect(v=0x00007ffeebdd2ad8, t=0x00007f863cd050f0, begin=0x00007f863d821520,end=0x00007f863d82152c) at regexec.c:752 [opt] frame #9: 0x0000000104101282 postgres`cdissect [inlined] ccondissect(v=<unavailable>, t=<unavailable>, begin=0x00007f863d821520,end=<unavailable>) at regexec.c:832:9 [opt] frame #10: 0x000000010410123d postgres`cdissect(v=0x00007ffeebdd2ad8, t=0x00007f863cd04ff0, begin=0x00007f863d821520,end=0x00007f863d821530) at regexec.c:752 [opt] frame #11: 0x00000001040ff508 postgres`pg_regexec [inlined] cfindloop(v=<unavailable>, cnfa=<unavailable>, cm=<unavailable>,d=0x00007ffeebdd6d68, s=0x00007ffeebdd2b48, coldp=<unavailable>) at regexec.c:600:10 [opt] frame #12: 0x00000001040ff36b postgres`pg_regexec [inlined] cfind(v=0x000000010459c5f8, cnfa=<unavailable>, cm=<unavailable>)at regexec.c:515 [opt] frame #13: 0x00000001040ff315 postgres`pg_regexec(re=<unavailable>, string=<unavailable>, len=140732855577960, search_start=<unavailable>,details=<unavailable>, nmatch=0, pmatch=0x0000000000000000, flags=0) at regexec.c:293 [opt] frame #14: 0x0000000104244d61 postgres`RE_wchar_execute(re=<unavailable>, data=<unavailable>, data_len=<unavailable>,start_search=<unavailable>, nmatch=<unavailable>, pmatch=<unavailable>) at regexp.c:274:19 [opt] frame #15: 0x0000000104242c80 postgres`textregexne [inlined] RE_execute(dat=<unavailable>, dat_len=31, nmatch=0, pmatch=0x0000000000000000)at regexp.c:322:10 [opt] frame #16: 0x0000000104242c50 postgres`textregexne [inlined] RE_compile_and_execute(text_re=<unavailable>, dat=<unavailable>,dat_len=31, cflags=19, collation=<unavailable>, nmatch=0, pmatch=<unavailable>) at regexp.c:357 [opt] — Mark Dilger EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Mark Dilger <mark.dilger@enterprisedb.com> writes: > The patch triggers an assertion that master does not: > +select 'azrlfkjbjgidgryryiglcabkgqluflu' !~ '(.(.)((.)))((?:(\3)))'; On looking into this, it's pretty simple: regexec.c has an assertion that a pure-capture subre node ought to be doing some capturing. case '(': /* no-op capture node */ assert(t->child != NULL); assert(t->capno > 0); That's fine as of HEAD, but with the proposed patch, we may notice that the node isn't actually referenced by any backref, and remove its capture marker, allowing this assertion to fire. Nothing's really wrong though. There seem to be three things we could do about that: 1. Extend removecaptures() so that it can actually remove no-op capture nodes if it's removed their capture markings. This would substantially complicate that function, and I judge that it's not worth the trouble. We'll only have such nodes in cases of capturing parentheses immediately surrounding capturing parentheses, which doesn't seem like a case worth expending sweat for. 2. Just drop the "t->capno > 0" assertion in regexec.c. 3. Weaken said assertion, perhaps by also checking the BRUSE flag bit. Not sure that I see any point to #3, so I just dropped the assertion in the attached. I've also rebased over the bug fixes from the other thread, and added a couple more test cases. regards, tom lane diff --git a/contrib/pg_trgm/trgm_regexp.c b/contrib/pg_trgm/trgm_regexp.c index bf1dea6352..64816dd370 100644 --- a/contrib/pg_trgm/trgm_regexp.c +++ b/contrib/pg_trgm/trgm_regexp.c @@ -541,9 +541,11 @@ createTrgmNFA(text *text_re, Oid collation, * Stage 1: Compile the regexp into a NFA, using the regexp library. */ #ifdef IGNORECASE - RE_compile(®ex, text_re, REG_ADVANCED | REG_ICASE, collation); + RE_compile(®ex, text_re, + REG_ADVANCED | REG_NOSUB | REG_ICASE, collation); #else - RE_compile(®ex, text_re, REG_ADVANCED, collation); + RE_compile(®ex, text_re, + REG_ADVANCED | REG_NOSUB, collation); #endif /* diff --git a/src/backend/regex/regc_lex.c b/src/backend/regex/regc_lex.c index 7673dab76f..45727ffa01 100644 --- a/src/backend/regex/regc_lex.c +++ b/src/backend/regex/regc_lex.c @@ -528,10 +528,7 @@ next(struct vars *v) } assert(NOTREACHED); } - if (v->cflags & REG_NOSUB) - RETV('(', 0); /* all parens non-capturing */ - else - RETV('(', 1); + RETV('(', 1); break; case CHR(')'): if (LASTTYPE('(')) diff --git a/src/backend/regex/regcomp.c b/src/backend/regex/regcomp.c index 60a220c57a..ae3a7b6a38 100644 --- a/src/backend/regex/regcomp.c +++ b/src/backend/regex/regcomp.c @@ -65,7 +65,7 @@ static struct subre *subre(struct vars *, int, int, struct state *, struct state static void freesubre(struct vars *, struct subre *); static void freesubreandsiblings(struct vars *, struct subre *); static void freesrnode(struct vars *, struct subre *); -static void optst(struct vars *, struct subre *); +static void removecaptures(struct vars *, struct subre *); static int numst(struct subre *, int); static void markst(struct subre *); static void cleanst(struct vars *); @@ -431,7 +431,8 @@ pg_regcomp(regex_t *re, dumpst(v->tree, debug, 1); } #endif - optst(v, v->tree); + if (v->cflags & REG_NOSUB) + removecaptures(v, v->tree); v->ntree = numst(v->tree, 1); markst(v->tree); cleanst(v); @@ -1013,14 +1014,16 @@ parseqatom(struct vars *v, break; case BACKREF: /* the Feature From The Black Lagoon */ INSIST(type != LACON, REG_ESUBREG); - INSIST(v->nextvalue < v->nsubs, REG_ESUBREG); - INSIST(v->subs[v->nextvalue] != NULL, REG_ESUBREG); + subno = v->nextvalue; + assert(subno > 0); + INSIST(subno < v->nsubs, REG_ESUBREG); + NOERRN(); + INSIST(v->subs[subno] != NULL, REG_ESUBREG); NOERRN(); - assert(v->nextvalue > 0); atom = subre(v, 'b', BACKR, lp, rp); NOERRN(); - subno = v->nextvalue; atom->backno = subno; + v->subs[subno]->flags |= BRUSE; EMPTYARC(lp, rp); /* temporarily, so there's something */ NEXT(); break; @@ -2141,19 +2144,54 @@ freesrnode(struct vars *v, /* might be NULL */ } /* - * optst - optimize a subRE subtree + * removecaptures - remove unnecessary capture subREs + * + * If the caller said that it doesn't care about subexpression match data, + * we may delete the "capture" markers on subREs that are not referenced + * by any backrefs, and then simplify anything that's become non-messy. + * Call this only if REG_NOSUB flag is set. */ static void -optst(struct vars *v, - struct subre *t) +removecaptures(struct vars *v, + struct subre *t) { + struct subre *t2; + + assert(t != NULL); + + /* + * If this isn't itself a backref target, clear capno and tentatively + * clear CAP flag. + */ + if (!(t->flags & BRUSE)) + { + t->capno = 0; + t->flags &= ~CAP; + } + + /* Now recurse to children */ + for (t2 = t->child; t2 != NULL; t2 = t2->sibling) + { + removecaptures(v, t2); + /* Propagate child CAP flag back up, if it's still set */ + if (t2->flags & CAP) + t->flags |= CAP; + } + /* - * DGP (2007-11-13): I assume it was the programmer's intent to eventually - * come back and add code to optimize subRE trees, but the routine coded - * just spends effort traversing the tree and doing nothing. We can do - * nothing with less effort. + * If t now contains neither captures nor backrefs, there's no longer any + * need to care where its sub-match boundaries are, so we can reduce it to + * a simple DFA node. (Note in particular that MIXED child greediness is + * not a hindrance here, so we don't use the MESSY() macro.) */ - return; + if ((t->flags & (CAP | BACKR)) == 0) + { + if (t->child) + freesubreandsiblings(v, t->child); + t->child = NULL; + t->op = '='; + t->flags &= ~MIXED; + } } /* @@ -2501,6 +2539,8 @@ stdump(struct subre *t, fprintf(f, " hascapture"); if (t->flags & BACKR) fprintf(f, " hasbackref"); + if (t->flags & BRUSE) + fprintf(f, " isreferenced"); if (!(t->flags & INUSE)) fprintf(f, " UNUSED"); if (t->latype != (char) -1) diff --git a/src/backend/regex/regexec.c b/src/backend/regex/regexec.c index 5b9a087820..3800e74ef7 100644 --- a/src/backend/regex/regexec.c +++ b/src/backend/regex/regexec.c @@ -213,12 +213,10 @@ pg_regexec(regex_t *re, return REG_NOMATCH; backref = (v->g->info & REG_UBACKREF) ? 1 : 0; v->eflags = flags; - if (v->g->cflags & REG_NOSUB) - nmatch = 0; /* override client */ v->nmatch = nmatch; - if (backref) + if (backref && nmatch <= v->g->nsub) { - /* need work area */ + /* need larger work area */ if (v->g->nsub + 1 <= LOCALMAT) v->pmatch = mat; else @@ -229,7 +227,13 @@ pg_regexec(regex_t *re, v->nmatch = v->g->nsub + 1; } else + { + /* we can store results directly in caller's array */ v->pmatch = pmatch; + /* ensure any extra entries in caller's array are filled with -1 */ + if (nmatch > 0) + zapallsubs(pmatch, nmatch); + } v->details = details; v->start = (chr *) string; v->search_start = (chr *) string + search_start; @@ -290,12 +294,20 @@ pg_regexec(regex_t *re, else st = find(v, &v->g->tree->cnfa, &v->g->cmap); - /* copy (portion of) match vector over if necessary */ - if (st == REG_OKAY && v->pmatch != pmatch && nmatch > 0) + /* on success, ensure caller's match vector is filled correctly */ + if (st == REG_OKAY && nmatch > 0) { - zapallsubs(pmatch, nmatch); - n = (nmatch < v->nmatch) ? nmatch : v->nmatch; - memcpy(VS(pmatch), VS(v->pmatch), n * sizeof(regmatch_t)); + if (v->g->cflags & REG_NOSUB) + { + /* ensure we don't return partial results */ + zapallsubs(pmatch, nmatch); + } + else if (v->pmatch != pmatch) + { + /* copy portion of match vector over from (larger) work area */ + assert(nmatch <= v->nmatch); + memcpy(VS(pmatch), VS(v->pmatch), nmatch * sizeof(regmatch_t)); + } } /* clean up */ @@ -752,7 +764,6 @@ cdissect(struct vars *v, break; case '(': /* no-op capture node */ assert(t->child != NULL); - assert(t->capno > 0); er = cdissect(v, t->child, begin, end); break; default: diff --git a/src/backend/utils/adt/jsonpath_gram.y b/src/backend/utils/adt/jsonpath_gram.y index de3d97931e..bd5d4488a0 100644 --- a/src/backend/utils/adt/jsonpath_gram.y +++ b/src/backend/utils/adt/jsonpath_gram.y @@ -583,6 +583,14 @@ jspConvertRegexFlags(uint32 xflags) errmsg("XQuery \"x\" flag (expanded regular expressions) is not implemented"))); } + /* + * We'll never need sub-match details at execution. While + * RE_compile_and_execute would set this flag anyway, force it on here to + * ensure that the regex cache entries created by makeItemLikeRegex are + * useful. + */ + cflags |= REG_NOSUB; + return cflags; } diff --git a/src/backend/utils/adt/regexp.c b/src/backend/utils/adt/regexp.c index 484d4265fd..268cee1cbe 100644 --- a/src/backend/utils/adt/regexp.c +++ b/src/backend/utils/adt/regexp.c @@ -347,6 +347,10 @@ RE_compile_and_execute(text *text_re, char *dat, int dat_len, { regex_t *re; + /* Use REG_NOSUB if caller does not want sub-match details */ + if (nmatch < 2) + cflags |= REG_NOSUB; + /* Compile RE */ re = RE_compile_and_cache(text_re, cflags, collation); @@ -1412,6 +1416,7 @@ setup_regexp_matches(text *orig_str, text *pattern, pg_re_flags *re_flags, int orig_len; pg_wchar *wide_str; int wide_len; + int cflags; regex_t *cpattern; regmatch_t *pmatch; int pmatch_len; @@ -1430,7 +1435,10 @@ setup_regexp_matches(text *orig_str, text *pattern, pg_re_flags *re_flags, wide_len = pg_mb2wchar_with_len(VARDATA_ANY(orig_str), wide_str, orig_len); /* set up the compiled pattern */ - cpattern = RE_compile_and_cache(pattern, re_flags->cflags, collation); + cflags = re_flags->cflags; + if (!use_subpatterns) + cflags |= REG_NOSUB; + cpattern = RE_compile_and_cache(pattern, cflags, collation); /* do we want to remember subpatterns? */ if (use_subpatterns && cpattern->re_nsub > 0) @@ -1952,7 +1960,7 @@ regexp_fixed_prefix(text *text_re, bool case_insensitive, Oid collation, if (case_insensitive) cflags |= REG_ICASE; - re = RE_compile_and_cache(text_re, cflags, collation); + re = RE_compile_and_cache(text_re, cflags | REG_NOSUB, collation); /* Examine it to see if there's a fixed prefix */ re_result = pg_regprefix(re, &str, &slen); diff --git a/src/include/regex/regex.h b/src/include/regex/regex.h index 2b48a19fb7..0455ae8069 100644 --- a/src/include/regex/regex.h +++ b/src/include/regex/regex.h @@ -106,7 +106,7 @@ typedef struct #define REG_QUOTE 000004 /* no special characters, none */ #define REG_NOSPEC REG_QUOTE /* historical synonym */ #define REG_ICASE 000010 /* ignore case */ -#define REG_NOSUB 000020 /* don't care about subexpressions */ +#define REG_NOSUB 000020 /* caller doesn't need subexpr match data */ #define REG_EXPANDED 000040 /* expanded format, white space & comments */ #define REG_NLSTOP 000100 /* \n doesn't match . or [^ ] */ #define REG_NLANCH 000200 /* ^ matches after \n, $ before */ diff --git a/src/include/regex/regguts.h b/src/include/regex/regguts.h index 8db631c83b..91a52840c4 100644 --- a/src/include/regex/regguts.h +++ b/src/include/regex/regguts.h @@ -477,13 +477,14 @@ struct subre #define MIXED 04 /* mixed preference below */ #define CAP 010 /* capturing parens here or below */ #define BACKR 020 /* back reference here or below */ +#define BRUSE 040 /* is referenced by a back reference */ #define INUSE 0100 /* in use in final tree */ -#define NOPROP 03 /* bits which may not propagate up */ +#define UPPROP (MIXED|CAP|BACKR) /* flags which should propagate up */ #define LMIX(f) ((f)<<2) /* LONGER -> MIXED */ #define SMIX(f) ((f)<<1) /* SHORTER -> MIXED */ -#define UP(f) (((f)&~NOPROP) | (LMIX(f) & SMIX(f) & MIXED)) +#define UP(f) (((f)&UPPROP) | (LMIX(f) & SMIX(f) & MIXED)) #define MESSY(f) ((f)&(MIXED|CAP|BACKR)) -#define PREF(f) ((f)&NOPROP) +#define PREF(f) ((f)&(LONGER|SHORTER)) #define PREF2(f1, f2) ((PREF(f1) != 0) ? PREF(f1) : PREF(f2)) #define COMBINE(f1, f2) (UP((f1)|(f2)) | PREF2(f1, f2)) char latype; /* LATYPE code, if lookaround constraint */ diff --git a/src/test/modules/test_regex/expected/test_regex.out b/src/test/modules/test_regex/expected/test_regex.out index 44da7d2019..6f47ddde40 100644 --- a/src/test/modules/test_regex/expected/test_regex.out +++ b/src/test/modules/test_regex/expected/test_regex.out @@ -146,12 +146,12 @@ select * from test_regex('(a)e', 'ae', '-'); {ae,a} (2 rows) --- expectMatch 4.2 o (a)e ae -select * from test_regex('(a)e', 'ae', 'o'); - test_regex ------------- - {0} - {NULL} +-- expectMatch 4.2 oPR (.)\1e abeaae aae {} +select * from test_regex('(.)\1e', 'abeaae', 'oPR'); + test_regex +-------------------------------- + {1,REG_UBACKREF,REG_UNONPOSIX} + {aae,NULL} (2 rows) -- expectMatch 4.3 b {\(a\)b} ab ab a @@ -2658,6 +2658,14 @@ select * from test_regex('(\w+(?= )).*\1', 'abc abcd abd', 'HLRP'); {"abc abc",abc} (2 rows) +-- exercise oversize-regmatch_t-array paths in regexec() +-- (that case is not reachable via test_regex, sadly) +select substring('fffoooooooooooooooooooooooooooooooo', '^(.)\1(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)'); + substring +----------- + f +(1 row) + -- doing 15 "octal escapes vs back references" -- # initial zero is always octal -- expectMatch 15.1 MP "a\\010b" "a\bb" "a\bb" @@ -3476,6 +3484,22 @@ select * from test_regex('((.))(\2){0}', 'xy', 'RPQ'); {x,x,x,NULL} (2 rows) +-- expectMatch 21.37 RP ((.))(\2) xyy yy y y y +select * from test_regex('((.))(\2)', 'xyy', 'RP'); + test_regex +-------------------------------- + {3,REG_UBACKREF,REG_UNONPOSIX} + {yy,y,y,y} +(2 rows) + +-- expectMatch 21.38 oRP ((.))(\2) xyy yy {} {} {} +select * from test_regex('((.))(\2)', 'xyy', 'oRP'); + test_regex +-------------------------------- + {3,REG_UBACKREF,REG_UNONPOSIX} + {yy,NULL,NULL,NULL} +(2 rows) + -- doing 22 "multicharacter collating elements" -- # again ugh -- MCCEs are not implemented in Postgres, so we skip all these tests diff --git a/src/test/modules/test_regex/sql/test_regex.sql b/src/test/modules/test_regex/sql/test_regex.sql index 9224fdfdd3..2861708932 100644 --- a/src/test/modules/test_regex/sql/test_regex.sql +++ b/src/test/modules/test_regex/sql/test_regex.sql @@ -63,8 +63,8 @@ select * from test_regex('ab', 'ab', 'b'); -- expectMatch 4.1 - (a)e ae ae a select * from test_regex('(a)e', 'ae', '-'); --- expectMatch 4.2 o (a)e ae -select * from test_regex('(a)e', 'ae', 'o'); +-- expectMatch 4.2 oPR (.)\1e abeaae aae {} +select * from test_regex('(.)\1e', 'abeaae', 'oPR'); -- expectMatch 4.3 b {\(a\)b} ab ab a select * from test_regex('\(a\)b', 'ab', 'b'); -- expectMatch 4.4 - a((b)c) abc abc bc b @@ -775,6 +775,10 @@ select * from test_regex('(^\w+).*\1', 'abc abc abc', 'LRP'); select * from test_regex('(^\w+\M).*\1', 'abc abcd abd', 'LRP'); select * from test_regex('(\w+(?= )).*\1', 'abc abcd abd', 'HLRP'); +-- exercise oversize-regmatch_t-array paths in regexec() +-- (that case is not reachable via test_regex, sadly) +select substring('fffoooooooooooooooooooooooooooooooo', '^(.)\1(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)'); + -- doing 15 "octal escapes vs back references" -- # initial zero is always octal @@ -1011,6 +1015,10 @@ select * from test_regex('(a*)*', 'bc', 'N'); select * from test_regex(' TO (([a-z0-9._]+|"([^"]+|"")+")+)', 'asd TO foo', 'M'); -- expectMatch 21.36 RPQ ((.))(\2){0} xy x x x {} select * from test_regex('((.))(\2){0}', 'xy', 'RPQ'); +-- expectMatch 21.37 RP ((.))(\2) xyy yy y y y +select * from test_regex('((.))(\2)', 'xyy', 'RP'); +-- expectMatch 21.38 oRP ((.))(\2) xyy yy {} {} {} +select * from test_regex('((.))(\2)', 'xyy', 'oRP'); -- doing 22 "multicharacter collating elements" -- # again ugh
> On Aug 8, 2021, at 10:04 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > I've also rebased over the bug fixes from the other thread, > and added a couple more test cases. > > regards, tom lane Hmm. This changes the behavior when applied against master (c1132aae336c41cf9d316222e525d8d593c2b5d2): select regexp_split_to_array('uuuzkodphfbfbfb', '((.))(\1\2)', 'ntw'); regexp_split_to_array ----------------------- - {"",zkodphfbfbfb} + {uuuzkodphfbfbfb} (1 row) The string starts with three "u" characters. The first of them is doubly-matched, meaning \1 and \2 refer to the first "u"character. The (\1\2) that follows matches the next two "u" characters. When the extra "useless" capture group is skipped,apparently this doesn't work anymore. I haven't looked at your patch, so I'm not sure why, but I'm guessing that\2 doesn't refer to anything. That analysis is consistent with the next change: select regexp_split_to_array('snfwbvxeesnzqabixqbixqiumpgxdemmxvnsemjxgqoqknrqessmcqmfslfspskqpqxe', '((((?:.))))\3'); - regexp_split_to_array ---------------------------------------------------------------------- - {snfwbvx,snzqabixqbixqiumpgxde,xvnsemjxgqoqknrqe,mcqmfslfspskqpqxe} + regexp_split_to_array +------------------------------------------------------------------------ + {snfwbvxeesnzqabixqbixqiumpgxdemmxvnsemjxgqoqknrqessmcqmfslfspskqpqxe} (1 row) The pattern matches any double character. I would expect it to match the "ee", the "mm" and the "ss" in the text. Withthe patched code, it matches nothing. — Mark Dilger EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Mark Dilger <mark.dilger@enterprisedb.com> writes: > Hmm. This changes the behavior when applied against master (c1132aae336c41cf9d316222e525d8d593c2b5d2): > select regexp_split_to_array('uuuzkodphfbfbfb', '((.))(\1\2)', 'ntw'); > regexp_split_to_array > ----------------------- > - {"",zkodphfbfbfb} > + {uuuzkodphfbfbfb} > (1 row) Ugh. The regex engine is finding the match correctly, but it's failing to tell the caller where it is :-(. I was a little too cute in optimizing the regmatch_t result-vector copying in pg_regexec, and forgot to ensure that the overall match position would be reported. Thanks for the testing! regards, tom lane diff --git a/contrib/pg_trgm/trgm_regexp.c b/contrib/pg_trgm/trgm_regexp.c index bf1dea6352..64816dd370 100644 --- a/contrib/pg_trgm/trgm_regexp.c +++ b/contrib/pg_trgm/trgm_regexp.c @@ -541,9 +541,11 @@ createTrgmNFA(text *text_re, Oid collation, * Stage 1: Compile the regexp into a NFA, using the regexp library. */ #ifdef IGNORECASE - RE_compile(®ex, text_re, REG_ADVANCED | REG_ICASE, collation); + RE_compile(®ex, text_re, + REG_ADVANCED | REG_NOSUB | REG_ICASE, collation); #else - RE_compile(®ex, text_re, REG_ADVANCED, collation); + RE_compile(®ex, text_re, + REG_ADVANCED | REG_NOSUB, collation); #endif /* diff --git a/src/backend/regex/regc_lex.c b/src/backend/regex/regc_lex.c index 7673dab76f..45727ffa01 100644 --- a/src/backend/regex/regc_lex.c +++ b/src/backend/regex/regc_lex.c @@ -528,10 +528,7 @@ next(struct vars *v) } assert(NOTREACHED); } - if (v->cflags & REG_NOSUB) - RETV('(', 0); /* all parens non-capturing */ - else - RETV('(', 1); + RETV('(', 1); break; case CHR(')'): if (LASTTYPE('(')) diff --git a/src/backend/regex/regcomp.c b/src/backend/regex/regcomp.c index 60a220c57a..ae3a7b6a38 100644 --- a/src/backend/regex/regcomp.c +++ b/src/backend/regex/regcomp.c @@ -65,7 +65,7 @@ static struct subre *subre(struct vars *, int, int, struct state *, struct state static void freesubre(struct vars *, struct subre *); static void freesubreandsiblings(struct vars *, struct subre *); static void freesrnode(struct vars *, struct subre *); -static void optst(struct vars *, struct subre *); +static void removecaptures(struct vars *, struct subre *); static int numst(struct subre *, int); static void markst(struct subre *); static void cleanst(struct vars *); @@ -431,7 +431,8 @@ pg_regcomp(regex_t *re, dumpst(v->tree, debug, 1); } #endif - optst(v, v->tree); + if (v->cflags & REG_NOSUB) + removecaptures(v, v->tree); v->ntree = numst(v->tree, 1); markst(v->tree); cleanst(v); @@ -1013,14 +1014,16 @@ parseqatom(struct vars *v, break; case BACKREF: /* the Feature From The Black Lagoon */ INSIST(type != LACON, REG_ESUBREG); - INSIST(v->nextvalue < v->nsubs, REG_ESUBREG); - INSIST(v->subs[v->nextvalue] != NULL, REG_ESUBREG); + subno = v->nextvalue; + assert(subno > 0); + INSIST(subno < v->nsubs, REG_ESUBREG); + NOERRN(); + INSIST(v->subs[subno] != NULL, REG_ESUBREG); NOERRN(); - assert(v->nextvalue > 0); atom = subre(v, 'b', BACKR, lp, rp); NOERRN(); - subno = v->nextvalue; atom->backno = subno; + v->subs[subno]->flags |= BRUSE; EMPTYARC(lp, rp); /* temporarily, so there's something */ NEXT(); break; @@ -2141,19 +2144,54 @@ freesrnode(struct vars *v, /* might be NULL */ } /* - * optst - optimize a subRE subtree + * removecaptures - remove unnecessary capture subREs + * + * If the caller said that it doesn't care about subexpression match data, + * we may delete the "capture" markers on subREs that are not referenced + * by any backrefs, and then simplify anything that's become non-messy. + * Call this only if REG_NOSUB flag is set. */ static void -optst(struct vars *v, - struct subre *t) +removecaptures(struct vars *v, + struct subre *t) { + struct subre *t2; + + assert(t != NULL); + + /* + * If this isn't itself a backref target, clear capno and tentatively + * clear CAP flag. + */ + if (!(t->flags & BRUSE)) + { + t->capno = 0; + t->flags &= ~CAP; + } + + /* Now recurse to children */ + for (t2 = t->child; t2 != NULL; t2 = t2->sibling) + { + removecaptures(v, t2); + /* Propagate child CAP flag back up, if it's still set */ + if (t2->flags & CAP) + t->flags |= CAP; + } + /* - * DGP (2007-11-13): I assume it was the programmer's intent to eventually - * come back and add code to optimize subRE trees, but the routine coded - * just spends effort traversing the tree and doing nothing. We can do - * nothing with less effort. + * If t now contains neither captures nor backrefs, there's no longer any + * need to care where its sub-match boundaries are, so we can reduce it to + * a simple DFA node. (Note in particular that MIXED child greediness is + * not a hindrance here, so we don't use the MESSY() macro.) */ - return; + if ((t->flags & (CAP | BACKR)) == 0) + { + if (t->child) + freesubreandsiblings(v, t->child); + t->child = NULL; + t->op = '='; + t->flags &= ~MIXED; + } } /* @@ -2501,6 +2539,8 @@ stdump(struct subre *t, fprintf(f, " hascapture"); if (t->flags & BACKR) fprintf(f, " hasbackref"); + if (t->flags & BRUSE) + fprintf(f, " isreferenced"); if (!(t->flags & INUSE)) fprintf(f, " UNUSED"); if (t->latype != (char) -1) diff --git a/src/backend/regex/regexec.c b/src/backend/regex/regexec.c index 5b9a087820..c08a224618 100644 --- a/src/backend/regex/regexec.c +++ b/src/backend/regex/regexec.c @@ -213,12 +213,10 @@ pg_regexec(regex_t *re, return REG_NOMATCH; backref = (v->g->info & REG_UBACKREF) ? 1 : 0; v->eflags = flags; - if (v->g->cflags & REG_NOSUB) - nmatch = 0; /* override client */ v->nmatch = nmatch; - if (backref) + if (backref && nmatch <= v->g->nsub) { - /* need work area */ + /* need larger work area */ if (v->g->nsub + 1 <= LOCALMAT) v->pmatch = mat; else @@ -229,7 +227,13 @@ pg_regexec(regex_t *re, v->nmatch = v->g->nsub + 1; } else + { + /* we can store results directly in caller's array */ v->pmatch = pmatch; + /* ensure any extra entries in caller's array are filled with -1 */ + if (nmatch > 0) + zapallsubs(pmatch, nmatch); + } v->details = details; v->start = (chr *) string; v->search_start = (chr *) string + search_start; @@ -290,12 +294,20 @@ pg_regexec(regex_t *re, else st = find(v, &v->g->tree->cnfa, &v->g->cmap); - /* copy (portion of) match vector over if necessary */ - if (st == REG_OKAY && v->pmatch != pmatch && nmatch > 0) + /* on success, ensure caller's match vector is filled correctly */ + if (st == REG_OKAY && nmatch > 0) { - zapallsubs(pmatch, nmatch); - n = (nmatch < v->nmatch) ? nmatch : v->nmatch; - memcpy(VS(pmatch), VS(v->pmatch), n * sizeof(regmatch_t)); + if (v->pmatch != pmatch) + { + /* copy portion of match vector over from (larger) work area */ + assert(nmatch <= v->nmatch); + memcpy(VS(pmatch), VS(v->pmatch), nmatch * sizeof(regmatch_t)); + } + if (v->g->cflags & REG_NOSUB) + { + /* don't expose possibly-partial sub-match results to caller */ + zapallsubs(pmatch, nmatch); + } } /* clean up */ @@ -752,7 +764,6 @@ cdissect(struct vars *v, break; case '(': /* no-op capture node */ assert(t->child != NULL); - assert(t->capno > 0); er = cdissect(v, t->child, begin, end); break; default: diff --git a/src/backend/utils/adt/jsonpath_gram.y b/src/backend/utils/adt/jsonpath_gram.y index de3d97931e..bd5d4488a0 100644 --- a/src/backend/utils/adt/jsonpath_gram.y +++ b/src/backend/utils/adt/jsonpath_gram.y @@ -583,6 +583,14 @@ jspConvertRegexFlags(uint32 xflags) errmsg("XQuery \"x\" flag (expanded regular expressions) is not implemented"))); } + /* + * We'll never need sub-match details at execution. While + * RE_compile_and_execute would set this flag anyway, force it on here to + * ensure that the regex cache entries created by makeItemLikeRegex are + * useful. + */ + cflags |= REG_NOSUB; + return cflags; } diff --git a/src/backend/utils/adt/regexp.c b/src/backend/utils/adt/regexp.c index 484d4265fd..268cee1cbe 100644 --- a/src/backend/utils/adt/regexp.c +++ b/src/backend/utils/adt/regexp.c @@ -347,6 +347,10 @@ RE_compile_and_execute(text *text_re, char *dat, int dat_len, { regex_t *re; + /* Use REG_NOSUB if caller does not want sub-match details */ + if (nmatch < 2) + cflags |= REG_NOSUB; + /* Compile RE */ re = RE_compile_and_cache(text_re, cflags, collation); @@ -1412,6 +1416,7 @@ setup_regexp_matches(text *orig_str, text *pattern, pg_re_flags *re_flags, int orig_len; pg_wchar *wide_str; int wide_len; + int cflags; regex_t *cpattern; regmatch_t *pmatch; int pmatch_len; @@ -1430,7 +1435,10 @@ setup_regexp_matches(text *orig_str, text *pattern, pg_re_flags *re_flags, wide_len = pg_mb2wchar_with_len(VARDATA_ANY(orig_str), wide_str, orig_len); /* set up the compiled pattern */ - cpattern = RE_compile_and_cache(pattern, re_flags->cflags, collation); + cflags = re_flags->cflags; + if (!use_subpatterns) + cflags |= REG_NOSUB; + cpattern = RE_compile_and_cache(pattern, cflags, collation); /* do we want to remember subpatterns? */ if (use_subpatterns && cpattern->re_nsub > 0) @@ -1952,7 +1960,7 @@ regexp_fixed_prefix(text *text_re, bool case_insensitive, Oid collation, if (case_insensitive) cflags |= REG_ICASE; - re = RE_compile_and_cache(text_re, cflags, collation); + re = RE_compile_and_cache(text_re, cflags | REG_NOSUB, collation); /* Examine it to see if there's a fixed prefix */ re_result = pg_regprefix(re, &str, &slen); diff --git a/src/include/regex/regex.h b/src/include/regex/regex.h index 2b48a19fb7..0455ae8069 100644 --- a/src/include/regex/regex.h +++ b/src/include/regex/regex.h @@ -106,7 +106,7 @@ typedef struct #define REG_QUOTE 000004 /* no special characters, none */ #define REG_NOSPEC REG_QUOTE /* historical synonym */ #define REG_ICASE 000010 /* ignore case */ -#define REG_NOSUB 000020 /* don't care about subexpressions */ +#define REG_NOSUB 000020 /* caller doesn't need subexpr match data */ #define REG_EXPANDED 000040 /* expanded format, white space & comments */ #define REG_NLSTOP 000100 /* \n doesn't match . or [^ ] */ #define REG_NLANCH 000200 /* ^ matches after \n, $ before */ diff --git a/src/include/regex/regguts.h b/src/include/regex/regguts.h index 8db631c83b..91a52840c4 100644 --- a/src/include/regex/regguts.h +++ b/src/include/regex/regguts.h @@ -477,13 +477,14 @@ struct subre #define MIXED 04 /* mixed preference below */ #define CAP 010 /* capturing parens here or below */ #define BACKR 020 /* back reference here or below */ +#define BRUSE 040 /* is referenced by a back reference */ #define INUSE 0100 /* in use in final tree */ -#define NOPROP 03 /* bits which may not propagate up */ +#define UPPROP (MIXED|CAP|BACKR) /* flags which should propagate up */ #define LMIX(f) ((f)<<2) /* LONGER -> MIXED */ #define SMIX(f) ((f)<<1) /* SHORTER -> MIXED */ -#define UP(f) (((f)&~NOPROP) | (LMIX(f) & SMIX(f) & MIXED)) +#define UP(f) (((f)&UPPROP) | (LMIX(f) & SMIX(f) & MIXED)) #define MESSY(f) ((f)&(MIXED|CAP|BACKR)) -#define PREF(f) ((f)&NOPROP) +#define PREF(f) ((f)&(LONGER|SHORTER)) #define PREF2(f1, f2) ((PREF(f1) != 0) ? PREF(f1) : PREF(f2)) #define COMBINE(f1, f2) (UP((f1)|(f2)) | PREF2(f1, f2)) char latype; /* LATYPE code, if lookaround constraint */ diff --git a/src/test/modules/test_regex/expected/test_regex.out b/src/test/modules/test_regex/expected/test_regex.out index 44da7d2019..5a6cdf47c2 100644 --- a/src/test/modules/test_regex/expected/test_regex.out +++ b/src/test/modules/test_regex/expected/test_regex.out @@ -146,12 +146,12 @@ select * from test_regex('(a)e', 'ae', '-'); {ae,a} (2 rows) --- expectMatch 4.2 o (a)e ae -select * from test_regex('(a)e', 'ae', 'o'); - test_regex ------------- - {0} - {NULL} +-- expectMatch 4.2 oPR (.)\1e abeaae aae {} +select * from test_regex('(.)\1e', 'abeaae', 'oPR'); + test_regex +-------------------------------- + {1,REG_UBACKREF,REG_UNONPOSIX} + {aae,NULL} (2 rows) -- expectMatch 4.3 b {\(a\)b} ab ab a @@ -2658,6 +2658,20 @@ select * from test_regex('(\w+(?= )).*\1', 'abc abcd abd', 'HLRP'); {"abc abc",abc} (2 rows) +-- exercise oversize-regmatch_t-array paths in regexec() +-- (that case is not reachable via test_regex, sadly) +select substring('fffoooooooooooooooooooooooooooooooo', '^(.)\1(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)'); + substring +----------- + f +(1 row) + +select regexp_split_to_array('abcxxxdefyyyghi', '((.))(\1\2)'); + regexp_split_to_array +----------------------- + {abc,def,ghi} +(1 row) + -- doing 15 "octal escapes vs back references" -- # initial zero is always octal -- expectMatch 15.1 MP "a\\010b" "a\bb" "a\bb" @@ -3476,6 +3490,22 @@ select * from test_regex('((.))(\2){0}', 'xy', 'RPQ'); {x,x,x,NULL} (2 rows) +-- expectMatch 21.37 RP ((.))(\2) xyy yy y y y +select * from test_regex('((.))(\2)', 'xyy', 'RP'); + test_regex +-------------------------------- + {3,REG_UBACKREF,REG_UNONPOSIX} + {yy,y,y,y} +(2 rows) + +-- expectMatch 21.38 oRP ((.))(\2) xyy yy {} {} {} +select * from test_regex('((.))(\2)', 'xyy', 'oRP'); + test_regex +-------------------------------- + {3,REG_UBACKREF,REG_UNONPOSIX} + {yy,NULL,NULL,NULL} +(2 rows) + -- doing 22 "multicharacter collating elements" -- # again ugh -- MCCEs are not implemented in Postgres, so we skip all these tests diff --git a/src/test/modules/test_regex/sql/test_regex.sql b/src/test/modules/test_regex/sql/test_regex.sql index 9224fdfdd3..3419564203 100644 --- a/src/test/modules/test_regex/sql/test_regex.sql +++ b/src/test/modules/test_regex/sql/test_regex.sql @@ -63,8 +63,8 @@ select * from test_regex('ab', 'ab', 'b'); -- expectMatch 4.1 - (a)e ae ae a select * from test_regex('(a)e', 'ae', '-'); --- expectMatch 4.2 o (a)e ae -select * from test_regex('(a)e', 'ae', 'o'); +-- expectMatch 4.2 oPR (.)\1e abeaae aae {} +select * from test_regex('(.)\1e', 'abeaae', 'oPR'); -- expectMatch 4.3 b {\(a\)b} ab ab a select * from test_regex('\(a\)b', 'ab', 'b'); -- expectMatch 4.4 - a((b)c) abc abc bc b @@ -775,6 +775,11 @@ select * from test_regex('(^\w+).*\1', 'abc abc abc', 'LRP'); select * from test_regex('(^\w+\M).*\1', 'abc abcd abd', 'LRP'); select * from test_regex('(\w+(?= )).*\1', 'abc abcd abd', 'HLRP'); +-- exercise oversize-regmatch_t-array paths in regexec() +-- (that case is not reachable via test_regex, sadly) +select substring('fffoooooooooooooooooooooooooooooooo', '^(.)\1(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)'); +select regexp_split_to_array('abcxxxdefyyyghi', '((.))(\1\2)'); + -- doing 15 "octal escapes vs back references" -- # initial zero is always octal @@ -1011,6 +1016,10 @@ select * from test_regex('(a*)*', 'bc', 'N'); select * from test_regex(' TO (([a-z0-9._]+|"([^"]+|"")+")+)', 'asd TO foo', 'M'); -- expectMatch 21.36 RPQ ((.))(\2){0} xy x x x {} select * from test_regex('((.))(\2){0}', 'xy', 'RPQ'); +-- expectMatch 21.37 RP ((.))(\2) xyy yy y y y +select * from test_regex('((.))(\2)', 'xyy', 'RP'); +-- expectMatch 21.38 oRP ((.))(\2) xyy yy {} {} {} +select * from test_regex('((.))(\2)', 'xyy', 'oRP'); -- doing 22 "multicharacter collating elements" -- # again ugh
> On Aug 8, 2021, at 1:25 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Ugh. The regex engine is finding the match correctly, but it's failing to > tell the caller where it is :-(. I was a little too cute in optimizing > the regmatch_t result-vector copying in pg_regexec, and forgot to ensure > that the overall match position would be reported. > > Thanks for the testing! Sure! Thanks for improving the regular expression engine! I have applied your latest patch and do not see any problems with it. All my tests pass with no asserts and with no differencesin results vs. master. This is a test suite of nearly 1.5 million separate regular expressions. — Mark Dilger EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Mark Dilger <mark.dilger@enterprisedb.com> writes: > I have applied your latest patch and do not see any problems with it. All my tests pass with no asserts and with no differencesin results vs. master. This is a test suite of nearly 1.5 million separate regular expressions. Cool, thanks. I also tried your millions-of-random-regexps script and didn't find any difference between the results from HEAD and those from the v3 patch. regards, tom lane
> On Aug 8, 2021, at 3:28 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Cool, thanks. I also tried your millions-of-random-regexps script > and didn't find any difference between the results from HEAD and > those from the v3 patch. The patch looks ready to commit. I don't expect to test it any further unless you have something in particular you'd likeme to focus on. Thanks again for working on this. — Mark Dilger EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Mark Dilger <mark.dilger@enterprisedb.com> writes: > The patch looks ready to commit. I don't expect to test it any further unless you have something in particular you'd likeme to focus on. Pushed, but while re-reading it before commit I noticed that there's some more fairly low-hanging fruit in regexp_replace(). As I had it in that patch, it never used REG_NOSUB because of the possibility that the replacement string uses "\N". However, we're already pre-checking the replacement string to see if it has backslashes at all, so while we're at it we can check for \N to discover if we actually need any subexpression match data or not. We do need to refactor a little to postpone calling pg_regcomp until after we know that, but I think that makes replace_text_regexp's API less ugly not more so. While I was at it, I changed the search-for-backslash loops to use memchr rather than handwritten looping. Their use of pg_mblen was pretty unnecessary given we only need to find backslashes, and we can assume the backend encoding is ASCII-safe. Using a bunch of random cases generated by your little perl script, I see maybe 10-15% speedup on test cases that don't use \N in the replacement string, while it's about a wash on cases that do. (If I'd been using a multibyte encoding, maybe the memchr change would have made a difference, but I didn't try that.) regards, tom lane diff --git a/src/backend/utils/adt/regexp.c b/src/backend/utils/adt/regexp.c index 268cee1cbe..3b7a607437 100644 --- a/src/backend/utils/adt/regexp.c +++ b/src/backend/utils/adt/regexp.c @@ -630,11 +630,10 @@ textregexreplace_noopt(PG_FUNCTION_ARGS) text *s = PG_GETARG_TEXT_PP(0); text *p = PG_GETARG_TEXT_PP(1); text *r = PG_GETARG_TEXT_PP(2); - regex_t *re; - - re = RE_compile_and_cache(p, REG_ADVANCED, PG_GET_COLLATION()); - PG_RETURN_TEXT_P(replace_text_regexp(s, (void *) re, r, 0, 1)); + PG_RETURN_TEXT_P(replace_text_regexp(s, p, r, + REG_ADVANCED, PG_GET_COLLATION(), + 0, 1)); } /* @@ -648,7 +647,6 @@ textregexreplace(PG_FUNCTION_ARGS) text *p = PG_GETARG_TEXT_PP(1); text *r = PG_GETARG_TEXT_PP(2); text *opt = PG_GETARG_TEXT_PP(3); - regex_t *re; pg_re_flags flags; /* @@ -672,10 +670,9 @@ textregexreplace(PG_FUNCTION_ARGS) parse_re_flags(&flags, opt); - re = RE_compile_and_cache(p, flags.cflags, PG_GET_COLLATION()); - - PG_RETURN_TEXT_P(replace_text_regexp(s, (void *) re, r, 0, - flags.glob ? 0 : 1)); + PG_RETURN_TEXT_P(replace_text_regexp(s, p, r, + flags.cflags, PG_GET_COLLATION(), + 0, flags.glob ? 0 : 1)); } /* @@ -694,7 +691,6 @@ textregexreplace_extended(PG_FUNCTION_ARGS) int n = 1; text *flags = PG_GETARG_TEXT_PP_IF_EXISTS(5); pg_re_flags re_flags; - regex_t *re; /* Collect optional parameters */ if (PG_NARGS() > 3) @@ -723,11 +719,10 @@ textregexreplace_extended(PG_FUNCTION_ARGS) if (PG_NARGS() <= 4) n = re_flags.glob ? 0 : 1; - /* Compile the regular expression */ - re = RE_compile_and_cache(p, re_flags.cflags, PG_GET_COLLATION()); - /* Do the replacement(s) */ - PG_RETURN_TEXT_P(replace_text_regexp(s, (void *) re, r, start - 1, n)); + PG_RETURN_TEXT_P(replace_text_regexp(s, p, r, + re_flags.cflags, PG_GET_COLLATION(), + start - 1, n)); } /* This is separate to keep the opr_sanity regression test from complaining */ diff --git a/src/backend/utils/adt/varlena.c b/src/backend/utils/adt/varlena.c index 348b5566de..acb8741734 100644 --- a/src/backend/utils/adt/varlena.c +++ b/src/backend/utils/adt/varlena.c @@ -4359,34 +4359,36 @@ replace_text(PG_FUNCTION_ARGS) } /* - * check_replace_text_has_escape_char + * check_replace_text_has_escape * - * check whether replace_text contains escape char. + * Returns 0 if text contains no backslashes that need processing. + * Returns 1 if text contains backslashes, but not regexp submatch specifiers. + * Returns 2 if text contains regexp submatch specifiers (\1 .. \9). */ -static bool -check_replace_text_has_escape_char(const text *replace_text) +static int +check_replace_text_has_escape(const text *replace_text) { + int result = 0; const char *p = VARDATA_ANY(replace_text); const char *p_end = p + VARSIZE_ANY_EXHDR(replace_text); - if (pg_database_encoding_max_length() == 1) - { - for (; p < p_end; p++) - { - if (*p == '\\') - return true; - } - } - else + while (p < p_end) { - for (; p < p_end; p += pg_mblen(p)) + /* Find next escape char, if any. */ + p = memchr(p, '\\', p_end - p); + if (p == NULL) + break; + p++; + /* Note: a backslash at the end doesn't require extra processing. */ + if (p < p_end) { - if (*p == '\\') - return true; + if (*p >= '1' && *p <= '9') + return 2; /* Found a submatch specifier, so done */ + result = 1; /* Found some other sequence, keep looking */ + p++; } } - - return false; + return result; } /* @@ -4403,25 +4405,17 @@ appendStringInfoRegexpSubstr(StringInfo str, text *replace_text, { const char *p = VARDATA_ANY(replace_text); const char *p_end = p + VARSIZE_ANY_EXHDR(replace_text); - int eml = pg_database_encoding_max_length(); - for (;;) + while (p < p_end) { const char *chunk_start = p; int so; int eo; - /* Find next escape char. */ - if (eml == 1) - { - for (; p < p_end && *p != '\\'; p++) - /* nothing */ ; - } - else - { - for (; p < p_end && *p != '\\'; p += pg_mblen(p)) - /* nothing */ ; - } + /* Find next escape char, if any. */ + p = memchr(p, '\\', p_end - p); + if (p == NULL) + p = p_end; /* Copy the text we just scanned over, if any. */ if (p > chunk_start) @@ -4473,7 +4467,7 @@ appendStringInfoRegexpSubstr(StringInfo str, text *replace_text, continue; } - if (so != -1 && eo != -1) + if (so >= 0 && eo >= 0) { /* * Copy the text that is back reference of regexp. Note so and eo @@ -4491,36 +4485,37 @@ appendStringInfoRegexpSubstr(StringInfo str, text *replace_text, } } -#define REGEXP_REPLACE_BACKREF_CNT 10 - /* * replace_text_regexp * - * replace substring(s) in src_text that match regexp with replace_text. + * replace substring(s) in src_text that match pattern with replace_text. + * The replace_text can contain backslash markers to substitute + * (parts of) the matched text. * + * cflags: regexp compile flags. + * collation: collation to use. * search_start: the character (not byte) offset in src_text at which to * begin searching. * n: if 0, replace all matches; if > 0, replace only the N'th match. - * - * Note: to avoid having to include regex.h in builtins.h, we declare - * the regexp argument as void *, but really it's regex_t *. */ text * -replace_text_regexp(text *src_text, void *regexp, +replace_text_regexp(text *src_text, text *pattern_text, text *replace_text, + int cflags, Oid collation, int search_start, int n) { text *ret_text; - regex_t *re = (regex_t *) regexp; + regex_t *re; int src_text_len = VARSIZE_ANY_EXHDR(src_text); int nmatches = 0; StringInfoData buf; - regmatch_t pmatch[REGEXP_REPLACE_BACKREF_CNT]; + regmatch_t pmatch[10]; /* main match, plus \1 to \9 */ + int nmatch = lengthof(pmatch); pg_wchar *data; size_t data_len; int data_pos; char *start_ptr; - bool have_escape; + int escape_status; initStringInfo(&buf); @@ -4528,8 +4523,19 @@ replace_text_regexp(text *src_text, void *regexp, data = (pg_wchar *) palloc((src_text_len + 1) * sizeof(pg_wchar)); data_len = pg_mb2wchar_with_len(VARDATA_ANY(src_text), data, src_text_len); - /* Check whether replace_text has escape char. */ - have_escape = check_replace_text_has_escape_char(replace_text); + /* Check whether replace_text has escapes, especially regexp submatches. */ + escape_status = check_replace_text_has_escape(replace_text); + + /* If no regexp submatches, we can use REG_NOSUB. */ + if (escape_status < 2) + { + cflags |= REG_NOSUB; + /* Also tell pg_regexec we only want the whole-match location. */ + nmatch = 1; + } + + /* Prepare the regexp. */ + re = RE_compile_and_cache(pattern_text, cflags, collation); /* start_ptr points to the data_pos'th character of src_text */ start_ptr = (char *) VARDATA_ANY(src_text); @@ -4546,7 +4552,7 @@ replace_text_regexp(text *src_text, void *regexp, data_len, search_start, NULL, /* no details */ - REGEXP_REPLACE_BACKREF_CNT, + nmatch, pmatch, 0); @@ -4602,10 +4608,9 @@ replace_text_regexp(text *src_text, void *regexp, } /* - * Copy the replace_text. Process back references when the - * replace_text has escape characters. + * Copy the replace_text, processing escapes if any are present. */ - if (have_escape) + if (escape_status > 0) appendStringInfoRegexpSubstr(&buf, replace_text, pmatch, start_ptr, data_pos); else diff --git a/src/include/utils/varlena.h b/src/include/utils/varlena.h index 6645e2af13..cd12252ed9 100644 --- a/src/include/utils/varlena.h +++ b/src/include/utils/varlena.h @@ -33,8 +33,9 @@ extern bool SplitDirectoriesString(char *rawstring, char separator, List **namelist); extern bool SplitGUCList(char *rawstring, char separator, List **namelist); -extern text *replace_text_regexp(text *src_text, void *regexp, +extern text *replace_text_regexp(text *src_text, text *pattern_text, text *replace_text, + int cflags, Oid collation, int search_start, int n); #endif
Tom, I can still trigger the old bug for which we thought we'd pushed a fix. The test case below crashes on master (e12694523e7e4482a052236f12d3d8b58be9a22c),and also on the fixed version "Make regexp engine's backref-related compilationstate more bulletproof." (cb76fbd7ec87e44b3c53165d68dc2747f7e26a9a). Can you test if it crashes for you, too? I'm not sure I see why this one fails when millions of others pass. The backtrace is still complaining about regc_nfa.c:1265: +select regexp_split_to_array('', '(?:((?:q+))){0}(\1){0,0}?*[^]'); +server closed the connection unexpectedly + This probably means the server terminated abnormally + before or while processing the request. +connection to server was lost — Mark Dilger EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Mark Dilger <mark.dilger@enterprisedb.com> writes: > I can still trigger the old bug for which we thought we'd pushed a fix. The test case below crashes on master (e12694523e7e4482a052236f12d3d8b58be9a22c),and also on the fixed version "Make regexp engine's backref-related compilationstate more bulletproof." (cb76fbd7ec87e44b3c53165d68dc2747f7e26a9a). > Can you test if it crashes for you, too? I'm not sure I see why this one fails when millions of others pass. > The backtrace is still complaining about regc_nfa.c:1265: > +select regexp_split_to_array('', '(?:((?:q+))){0}(\1){0,0}?*[^]'); > +server closed the connection unexpectedly Hmmm ... yeah, I see it too. This points up something I'd wondered about before, which is whether the code that "cancels everything" after detecting {0} is really OK. It throws away the outer subre *and children* without worrying about what might be inside, and here we see that that's not good enough --- there's still a v->subs pointer to the first capturing paren set, which we just deleted, so that the \1 later on messes up. I'm not sure why the back branches are managing not to crash, but that might just be a memory management artifact. regards, tom lane
I wrote: > Hmmm ... yeah, I see it too. This points up something I'd wondered > about before, which is whether the code that "cancels everything" > after detecting {0} is really OK. It throws away the outer subre > *and children* without worrying about what might be inside, and > here we see that that's not good enough --- there's still a v->subs > pointer to the first capturing paren set, which we just deleted, > so that the \1 later on messes up. I'm not sure why the back > branches are managing not to crash, but that might just be a memory > management artifact. ... yeah, it is. For me, this variant hits the assertion in all branches: regression=# select regexp_split_to_array('', '((.)){0}(\2){0}'); server closed the connection unexpectedly So that's a pre-existing (and very long-standing) bug. I'm not sure if it has any serious impact in non-assert builds though. Failure to clean out some disconnected arcs probably has no real effect on the regex's behavior later. regards, tom lane
Mark Dilger <mark.dilger@enterprisedb.com> writes: > +select regexp_split_to_array('', '(?:((?:q+))){0}(\1){0,0}?*[^]'); > +server closed the connection unexpectedly Here's a quick draft patch for this. Basically it moves the responsibility for clearing v->subs[] pointers into the freesubre() recursion, so that it will happen for contained capturing parens not only the top level. There is a potentially interesting definitional question: what exactly ought this regexp do? ((.)){0}\2 Because the capturing paren sets are zero-quantified, they will never be matched to any characters, so the backref can never have any defined referent. I suspect that study of the POSIX spec would lead to the conclusion that this is a legal regexp but it will never match anything. Implementing that would be tedious though, and what's more it seems very unlikely that the user wanted any such behavior. So I think throwing an error is an appropriate response. The existing code will throw such an error for ((.)){0}\1 so I guess Spencer did think about this to some extent -- he just forgot about the possibility of nested parens. This patch should work OK in HEAD and v14, but it will need a bit of fooling-about for older branches I think, given that they fill v->subs[] a little differently. regards, tom lane diff --git a/src/backend/regex/regcomp.c b/src/backend/regex/regcomp.c index ae3a7b6a38..ae1ff670f9 100644 --- a/src/backend/regex/regcomp.c +++ b/src/backend/regex/regcomp.c @@ -550,8 +550,6 @@ freev(struct vars *v, { if (v->re != NULL) rfree(v->re); - if (v->subs != v->sub10) - FREE(v->subs); if (v->nfa != NULL) freenfa(v->nfa); if (v->tree != NULL) @@ -562,6 +560,8 @@ freev(struct vars *v, freecvec(v->cv); if (v->cv2 != NULL) freecvec(v->cv2); + if (v->subs != v->sub10) + FREE(v->subs); if (v->lacons != NULL) freelacons(v->lacons, v->nlacons); ERR(err); /* nop if err==0 */ @@ -1091,8 +1091,8 @@ parseqatom(struct vars *v, { if (atom != NULL) freesubre(v, atom); - if (atomtype == '(') - v->subs[subno] = NULL; + /* freesubre() will clean up any v->subs[] pointers into the atom */ + assert(atomtype != '(' || v->subs[subno] == NULL); delsub(v->nfa, lp, rp); EMPTYARC(lp, rp); return top; @@ -2127,9 +2127,24 @@ freesrnode(struct vars *v, /* might be NULL */ if (sr == NULL) return; + /* + * If the node is referenced in v->subs[], clear that to avoid leaving a + * dangling pointer. This should only ever matter when parseqatom() is + * throwing away a {0,0}-quantified atom that contains capturing parens. + * Doing this will cause later backrefs to such parens to fail, which is + * maybe not strictly POSIX but seems more helpful than allowing a backref + * that can never have a defined referent. + */ + if (sr->capno > 0 && v != NULL) + { + assert(v->subs[sr->capno] == sr); + v->subs[sr->capno] = NULL; + } + if (!NULLCNFA(sr->cnfa)) freecnfa(&sr->cnfa); sr->flags = 0; /* in particular, not INUSE */ + sr->capno = 0; sr->child = sr->sibling = NULL; sr->begin = sr->end = NULL; diff --git a/src/test/modules/test_regex/expected/test_regex.out b/src/test/modules/test_regex/expected/test_regex.out index 5a6cdf47c2..850e0c8ef5 100644 --- a/src/test/modules/test_regex/expected/test_regex.out +++ b/src/test/modules/test_regex/expected/test_regex.out @@ -3506,6 +3506,10 @@ select * from test_regex('((.))(\2)', 'xyy', 'oRP'); {yy,NULL,NULL,NULL} (2 rows) +-- # throwing an error for this is arguably not per POSIX +-- expectError 21.39 - {((.)){0}(\2){0}} ESUBREG +select * from test_regex('((.)){0}(\2){0}', '', '-'); +ERROR: invalid regular expression: invalid backreference number -- doing 22 "multicharacter collating elements" -- # again ugh -- MCCEs are not implemented in Postgres, so we skip all these tests diff --git a/src/test/modules/test_regex/sql/test_regex.sql b/src/test/modules/test_regex/sql/test_regex.sql index 3419564203..378f22b67b 100644 --- a/src/test/modules/test_regex/sql/test_regex.sql +++ b/src/test/modules/test_regex/sql/test_regex.sql @@ -1020,6 +1020,9 @@ select * from test_regex('((.))(\2){0}', 'xy', 'RPQ'); select * from test_regex('((.))(\2)', 'xyy', 'RP'); -- expectMatch 21.38 oRP ((.))(\2) xyy yy {} {} {} select * from test_regex('((.))(\2)', 'xyy', 'oRP'); +-- # throwing an error for this is arguably not per POSIX +-- expectError 21.39 - {((.)){0}(\2){0}} ESUBREG +select * from test_regex('((.)){0}(\2){0}', '', '-'); -- doing 22 "multicharacter collating elements" -- # again ugh
> On Aug 9, 2021, at 12:14 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Pushed, but while re-reading it before commit I noticed that there's > some more fairly low-hanging fruit in regexp_replace(). As I had it > in that patch, it never used REG_NOSUB because of the possibility > that the replacement string uses "\N". However, we're already > pre-checking the replacement string to see if it has backslashes > at all, so while we're at it we can check for \N to discover if we > actually need any subexpression match data or not. We do need to > refactor a little to postpone calling pg_regcomp until after we > know that, but I think that makes replace_text_regexp's API less > ugly not more so. > > While I was at it, I changed the search-for-backslash loops to > use memchr rather than handwritten looping. Their use of > pg_mblen was pretty unnecessary given we only need to find > backslashes, and we can assume the backend encoding is ASCII-safe. > > Using a bunch of random cases generated by your little perl > script, I see maybe 10-15% speedup on test cases that don't > use \N in the replacement string, while it's about a wash > on cases that do. (If I'd been using a multibyte encoding, > maybe the memchr change would have made a difference, but > I didn't try that.) I've been reviewing and testing this (let-regexp_replace-use-NOSUB.patch) since you sent it 4 hours ago, and I can't seemto break it. There are pre-existing problems in the regex code, but this doesn't seem to add any new breakage. — Mark Dilger EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
> On Aug 9, 2021, at 4:31 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > There is a potentially interesting definitional question: > what exactly ought this regexp do? > > ((.)){0}\2 > > Because the capturing paren sets are zero-quantified, they will > never be matched to any characters, so the backref can never > have any defined referent. Perl regular expressions are not POSIX, but if there is a principled reason POSIX should differ from perl on this, we shouldbe clear what that is: #!/usr/bin/perl use strict; use warnings; our $match; if ('foo' =~ m/((.)(??{ die; })){0}(..)/) { print "captured 1 $1\n" if defined $1; print "captured 2 $2\n" if defined $2; print "captured 3 $3\n" if defined $3; print "captured 4 $4\n" if defined $4; print "match = $match\n" if defined $match; } This will print "captured 3 fo", proving that although the regular expression is parsed with the (..) bound to the thirdcapture group, the first two capture groups never run. If you don't believe that, change the {0} to {1} and observethat the script dies. > So I think throwing an > error is an appropriate response. The existing code will > throw such an error for > > ((.)){0}\1 > > so I guess Spencer did think about this to some extent -- he > just forgot about the possibility of nested parens. Ugg. That means our code throws an error where perl does not, pretty well negating my point above. If we're already throwingan error for this type of thing, I agree we should be consistent about it. My personal preference would have beento do the same thing as perl, but it seems that ship has already sailed. — Mark Dilger EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
> On Aug 9, 2021, at 5:14 PM, Mark Dilger <mark.dilger@enterprisedb.com> wrote: > > our $match; > if ('foo' =~ m/((.)(??{ die; })){0}(..)/) I left in a stray variable. A prior version of this script was assigning to $match where it now has die. Sorry for anyconfusion. — Mark Dilger EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Mark Dilger <mark.dilger@enterprisedb.com> writes: >> On Aug 9, 2021, at 12:14 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Pushed, but while re-reading it before commit I noticed that there's >> some more fairly low-hanging fruit in regexp_replace(). > I've been reviewing and testing this (let-regexp_replace-use-NOSUB.patch) since you sent it 4 hours ago, and I can't seemto break it. There are pre-existing problems in the regex code, but this doesn't seem to add any new breakage. Pushed that bit, thanks for testing! I plan to not do anything about the (()){0} bug until after the release window, since that will need to be back-patched. That bug's gotta be twenty years old, so while I kinda wish we'd found it a few days earlier, waiting another 3 months won't make much difference. regards, tom lane
Mark Dilger <mark.dilger@enterprisedb.com> writes: >> On Aug 9, 2021, at 4:31 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> There is a potentially interesting definitional question: >> what exactly ought this regexp do? >> ((.)){0}\2 >> Because the capturing paren sets are zero-quantified, they will >> never be matched to any characters, so the backref can never >> have any defined referent. > Perl regular expressions are not POSIX, but if there is a principled reason POSIX should differ from perl on this, we shouldbe clear what that is: > if ('foo' =~ m/((.)(??{ die; })){0}(..)/) > { > print "captured 1 $1\n" if defined $1; > print "captured 2 $2\n" if defined $2; > print "captured 3 $3\n" if defined $3; > print "captured 4 $4\n" if defined $4; > print "match = $match\n" if defined $match; > } Hm. I'm not sure that this example proves anything about Perl's handling of the situation, since you didn't use a backref. I tried both if ('foo' =~ m/((.)){0}\1/) if ('foo' =~ m/((.)){0}\2/) and while neither throws an error, they don't succeed either. So AFAICS Perl is acting in the way I'm attributing to POSIX. But maybe we should actually read POSIX ... >> ... I guess Spencer did think about this to some extent -- he >> just forgot about the possibility of nested parens. > Ugg. That means our code throws an error where perl does not, pretty > well negating my point above. If we're already throwing an error for > this type of thing, I agree we should be consistent about it. My > personal preference would have been to do the same thing as perl, but it > seems that ship has already sailed. Removing an error case is usually an easier sell than adding one. However, the fact that the simplest case (viz, '(.){0}\1') has always thrown an error and nobody has complained in twenty-ish years suggests that nobody much cares. regards, tom lane
> On Aug 9, 2021, at 6:11 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Hm. I'm not sure that this example proves anything about Perl's handling > of the situation, since you didn't use a backref. Well, this doesn't die either: if ('foo' =~ m/((??{ die; })(.)(??{ die $1; })){0}((??{ die "got here"; })|\2)/) { print "matched\n"; } The point is that the regex engine never walks the part of the pattern that {0} qualifies. I thought it was more clear inthe prior example, because that example proves that the engine does get as far as capturing. This example also does that,and with a backref, because it dies with "got here". — Mark Dilger EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
> On Aug 9, 2021, at 6:17 PM, Mark Dilger <mark.dilger@enterprisedb.com> wrote: > > Well, this doesn't die either: Meaning it doesn't die in the part of the pattern qualified by {0} either. It does die in the other part. Sorry again forthe confusion. — Mark Dilger EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
> On Aug 9, 2021, at 4:31 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > This patch should work OK in HEAD and v14, but it will need > a bit of fooling-about for older branches I think, given that > they fill v->subs[] a little differently. Note that I tested your patch *before* master, so the changes look backwards. I tested this fix and it seems good to me. Some patterns that used to work (by chance?) now raise an error, such as: select 'bpgouiwcquu' ~ '(((([e])*?)){0,0}?(\3))'; -ERROR: invalid regular expression: invalid backreference number + ?column? +---------- + f +(1 row) I ran a lot of tests with the patch, and the asserts have all cleared up, but I don't know how to think about the user facingdifferences. If we had a good reason for raising an error for these back-references, maybe that'd be fine, but itseems to just be an implementation detail. — Mark Dilger EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Mark Dilger <mark.dilger@enterprisedb.com> writes: > I ran a lot of tests with the patch, and the asserts have all cleared up, but I don't know how to think about the userfacing differences. If we had a good reason for raising an error for these back-references, maybe that'd be fine, butit seems to just be an implementation detail. I thought about this some more, and I'm coming around to the idea that throwing an error is the wrong thing. As a contrary example, consider (.)|(\1\1) We don't throw an error for this, and neither does Perl, even though the capturing parens can never be defined in the branch where the backrefs are. So it seems hard to argue that this is okay but the other thing isn't. Another interesting example is (.){0}(\1){0} I think that the correct interpretation is that this is a valid regexp matching an empty string (i.e., zero repetitions of each part), even though neither capture group will be defined. That's different from (.){0}(\1) which can never match. So I took another look at the code, and it doesn't seem that hard to make it act this way. The attached passes regression, but I've not beat on it with random strings. regards, tom lane diff --git a/src/backend/regex/regcomp.c b/src/backend/regex/regcomp.c index ae3a7b6a38..d9840171a3 100644 --- a/src/backend/regex/regcomp.c +++ b/src/backend/regex/regcomp.c @@ -1089,11 +1089,23 @@ parseqatom(struct vars *v, /* annoying special case: {0} or {0,0} cancels everything */ if (m == 0 && n == 0) { - if (atom != NULL) - freesubre(v, atom); - if (atomtype == '(') - v->subs[subno] = NULL; - delsub(v->nfa, lp, rp); + /* + * If we had capturing subexpression(s) within the atom, we don't want + * to destroy them, because it's legal (if useless) to back-ref them + * later. Hence, just unlink the atom from lp/rp and then ignore it. + */ + if (atom != NULL && (atom->flags & CAP)) + { + delsub(v->nfa, lp, atom->begin); + delsub(v->nfa, atom->end, rp); + } + else + { + /* Otherwise, we can clean up any subre infrastructure we made */ + if (atom != NULL) + freesubre(v, atom); + delsub(v->nfa, lp, rp); + } EMPTYARC(lp, rp); return top; } diff --git a/src/test/modules/test_regex/expected/test_regex.out b/src/test/modules/test_regex/expected/test_regex.out index 5a6cdf47c2..96c0e6fa59 100644 --- a/src/test/modules/test_regex/expected/test_regex.out +++ b/src/test/modules/test_regex/expected/test_regex.out @@ -3506,6 +3506,21 @@ select * from test_regex('((.))(\2)', 'xyy', 'oRP'); {yy,NULL,NULL,NULL} (2 rows) +-- expectMatch 21.39 NPQR {((.)){0}(\2){0}} xyz {} {} {} {} +select * from test_regex('((.)){0}(\2){0}', 'xyz', 'NPQR'); + test_regex +------------------------------------------------------------ + {3,REG_UBACKREF,REG_UBOUNDS,REG_UNONPOSIX,REG_UEMPTYMATCH} + {"",NULL,NULL,NULL} +(2 rows) + +-- expectNomatch 21.40 PQR {((.)){0}(\2)} xxx +select * from test_regex('((.)){0}(\2)', 'xxx', 'PQR'); + test_regex +-------------------------------------------- + {3,REG_UBACKREF,REG_UBOUNDS,REG_UNONPOSIX} +(1 row) + -- doing 22 "multicharacter collating elements" -- # again ugh -- MCCEs are not implemented in Postgres, so we skip all these tests diff --git a/src/test/modules/test_regex/sql/test_regex.sql b/src/test/modules/test_regex/sql/test_regex.sql index 3419564203..29806e9417 100644 --- a/src/test/modules/test_regex/sql/test_regex.sql +++ b/src/test/modules/test_regex/sql/test_regex.sql @@ -1020,6 +1020,10 @@ select * from test_regex('((.))(\2){0}', 'xy', 'RPQ'); select * from test_regex('((.))(\2)', 'xyy', 'RP'); -- expectMatch 21.38 oRP ((.))(\2) xyy yy {} {} {} select * from test_regex('((.))(\2)', 'xyy', 'oRP'); +-- expectMatch 21.39 NPQR {((.)){0}(\2){0}} xyz {} {} {} {} +select * from test_regex('((.)){0}(\2){0}', 'xyz', 'NPQR'); +-- expectNomatch 21.40 PQR {((.)){0}(\2)} xxx +select * from test_regex('((.)){0}(\2)', 'xxx', 'PQR'); -- doing 22 "multicharacter collating elements" -- # again ugh
I wrote: > So AFAICS Perl is acting in the way I'm attributing to POSIX. > But maybe we should actually read POSIX ... I went to look at the POSIX spec, and was reminded that it lacks backrefs altogether. (POSIX specifies the "BRE" and "ERE" regex flavors as described in our docs, but not "ARE".) So there's no help to be had there. The fact that Perl doesn't throw an error is probably the most useful precedent available. regards, tom lane
> On Aug 9, 2021, at 7:20 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > So I took another look at the code, and it doesn't seem that hard > to make it act this way. The attached passes regression, but > I've not beat on it with random strings. I expect to get back around to testing this in a day or so. — Mark Dilger EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
> On Aug 9, 2021, at 7:20 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > So I took another look at the code, and it doesn't seem that hard > to make it act this way. The attached passes regression, but > I've not beat on it with random strings. > alternate-fix-zero-quantified-nested-parens.patch I've beaten on this with random patterns and it seems to hold up just fine. I have also reviewed the diffs and, for thepatterns where the output changes, everything looks correct. I can't find anything wrong with this patch. — Mark Dilger EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Mark Dilger <mark.dilger@enterprisedb.com> writes: > I've beaten on this with random patterns and it seems to hold up just fine. I have also reviewed the diffs and, for thepatterns where the output changes, everything looks correct. I can't find anything wrong with this patch. Thanks for testing! I'll push it in a bit. regards, tom lane