Thread: Another regexp performance improvement: skip useless paren-captures

Another regexp performance improvement: skip useless paren-captures

From
Tom Lane
Date:
Here's a little finger exercise that improves a case that's bothered me
for awhile.  In a POSIX regexp, parentheses cause capturing by default;
you have to write the very non-obvious "(?:...)" if you don't want the
matching substring to be reported by the regexp engine.  That'd be fine
if capturing were cheap, but with our engine it is not particularly
cheap.  In many situations, the initial DFA check is sufficient to
tell whether there is an overall match, but it does not tell where any
subexpression match boundaries are.  To identify exactly which substring
is deemed to match a parenthesized subexpression, we have to recursively
break down the match, which takes at the very least a few more DFA
invocations; and with an uncooperative regex, it can easily result in
O(N^2) behavior where there was none at the DFA stage.

Therefore, we really ought to expend some effort to not capture
subexpressions if the sub-match data is not actually needed, which in
many invocations we know that it isn't.  Spencer's original code has
a REG_NOSUB option that looks like it ought to be good for this ... but
on closer inspection it's basically useless, because it turns *all*
parens into non-capturing ones.  That breaks back-references, so unless
you know that the regexp contains no back-refs, you can't use it.

The attached proposed patch redefines REG_NOSUB as being a regexp-
compile-time promise that the caller doesn't care about sub-match
locations, but not a promise that no backrefs exist.  (If the
caller passes a match-locations array at execution anyway, it will
just get back -1 values, as if no sub-match had been identified.)
If that flag is passed, we run through the completed sub-regexp
tree and remove the "capture" markers on any subREs that are
not actually referenced by some backref.  This typically causes
some parent subREs to no longer be deemed "messy", so that their
separate child subREs can be thrown away entirely, saving memory
space as well as runtime.

(I'd originally thought that a much more complex patch would be
needed to do this, because I assumed that re-optimizing the subRE
tree would be much more complicated than this.  However, as far
as I can see this is sufficient; this change doesn't expose any
cases where additional tree restructuring would be helpful.)

Testing with Joel's handy little corpus of web regexps, there's a
useful improvement of the speed of ~ operators (a/k/a regexp_like()).
I see the total time to apply regexp_like() to all 4474520 entries
dropping from 10:17 to 5:46.  Interesting statistics include

regexp=# select max(duration),avg(duration) from headresults;
       max       |       avg
-----------------+-----------------
 00:00:00.939389 | 00:00:00.000138
(1 row)

regexp=# select max(duration),avg(duration) from patchresults;
       max       |       avg
-----------------+-----------------
 00:00:00.918549 | 00:00:00.000077
(1 row)

The lower percentiles don't move much, but upper ones do:

regexp=# select percentile_cont(array[0.5,0.75,0.8,0.9]) within group(order by duration) from headresults;
                          percentile_cont
-------------------------------------------------------------------
 {00:00:00.000027,00:00:00.000059,00:00:00.000067,00:00:00.000108}
(1 row)

regexp=# select percentile_cont(array[0.5,0.75,0.8,0.9]) within group(order by duration) from patchresults;
                          percentile_cont
-------------------------------------------------------------------
 {00:00:00.000025,00:00:00.000042,00:00:00.000048,00:00:00.000065}
(1 row)

This isn't terribly surprising, because regexps that were already
really cheap probably have no capturing parens to dispense with.

Of course, there's no benefit with functions that do need sub-match
data, such as regexp_match.  But the added overhead in such cases
should be quite negligible.  The only downside I can see is that
if you use the "same" regexp in both submatches-needed and
non-submatches-needed contexts, you'll end up with two separate
compiled regexp cache entries.  That doesn't seem like a big
problem though.

            regards, tom lane

diff --git a/contrib/pg_trgm/trgm_regexp.c b/contrib/pg_trgm/trgm_regexp.c
index bf1dea6352..64816dd370 100644
--- a/contrib/pg_trgm/trgm_regexp.c
+++ b/contrib/pg_trgm/trgm_regexp.c
@@ -541,9 +541,11 @@ createTrgmNFA(text *text_re, Oid collation,
      * Stage 1: Compile the regexp into a NFA, using the regexp library.
      */
 #ifdef IGNORECASE
-    RE_compile(®ex, text_re, REG_ADVANCED | REG_ICASE, collation);
+    RE_compile(®ex, text_re,
+               REG_ADVANCED | REG_NOSUB | REG_ICASE, collation);
 #else
-    RE_compile(®ex, text_re, REG_ADVANCED, collation);
+    RE_compile(®ex, text_re,
+               REG_ADVANCED | REG_NOSUB, collation);
 #endif

     /*
diff --git a/src/backend/regex/regc_lex.c b/src/backend/regex/regc_lex.c
index 7673dab76f..45727ffa01 100644
--- a/src/backend/regex/regc_lex.c
+++ b/src/backend/regex/regc_lex.c
@@ -528,10 +528,7 @@ next(struct vars *v)
                 }
                 assert(NOTREACHED);
             }
-            if (v->cflags & REG_NOSUB)
-                RETV('(', 0);    /* all parens non-capturing */
-            else
-                RETV('(', 1);
+            RETV('(', 1);
             break;
         case CHR(')'):
             if (LASTTYPE('('))
diff --git a/src/backend/regex/regcomp.c b/src/backend/regex/regcomp.c
index 9f71177d31..38f709cb60 100644
--- a/src/backend/regex/regcomp.c
+++ b/src/backend/regex/regcomp.c
@@ -65,7 +65,7 @@ static struct subre *subre(struct vars *, int, int, struct state *, struct state
 static void freesubre(struct vars *, struct subre *);
 static void freesubreandsiblings(struct vars *, struct subre *);
 static void freesrnode(struct vars *, struct subre *);
-static void optst(struct vars *, struct subre *);
+static void removecaptures(struct vars *, struct subre *);
 static int    numst(struct subre *, int);
 static void markst(struct subre *);
 static void cleanst(struct vars *);
@@ -431,7 +431,8 @@ pg_regcomp(regex_t *re,
         dumpst(v->tree, debug, 1);
     }
 #endif
-    optst(v, v->tree);
+    if (v->cflags & REG_NOSUB)
+        removecaptures(v, v->tree);
     v->ntree = numst(v->tree, 1);
     markst(v->tree);
     cleanst(v);
@@ -1004,14 +1005,16 @@ parseqatom(struct vars *v,
             break;
         case BACKREF:            /* the Feature From The Black Lagoon */
             INSIST(type != LACON, REG_ESUBREG);
-            INSIST(v->nextvalue < v->nsubs, REG_ESUBREG);
-            INSIST(v->subs[v->nextvalue] != NULL, REG_ESUBREG);
+            subno = v->nextvalue;
+            assert(subno > 0);
+            INSIST(subno < v->nsubs, REG_ESUBREG);
+            NOERR();
+            INSIST(v->subs[subno] != NULL, REG_ESUBREG);
             NOERR();
-            assert(v->nextvalue > 0);
             atom = subre(v, 'b', BACKR, lp, rp);
             NOERR();
-            subno = v->nextvalue;
             atom->backno = subno;
+            v->subs[subno]->flags |= BRUSE;
             EMPTYARC(lp, rp);    /* temporarily, so there's something */
             NEXT();
             break;
@@ -2122,19 +2125,54 @@ freesrnode(struct vars *v,        /* might be NULL */
 }

 /*
- * optst - optimize a subRE subtree
+ * removecaptures - remove unnecessary capture subREs
+ *
+ * If the caller said that it doesn't care about subexpression match data,
+ * we may delete the "capture" markers on subREs that are not referenced
+ * by any backrefs, and then simplify anything that's become non-messy.
+ * Call this only if REG_NOSUB flag is set.
  */
 static void
-optst(struct vars *v,
-      struct subre *t)
+removecaptures(struct vars *v,
+               struct subre *t)
 {
+    struct subre *t2;
+
+    assert(t != NULL);
+
+    /*
+     * If this isn't itself a backref target, clear capno and tentatively
+     * clear CAP flag.
+     */
+    if (!(t->flags & BRUSE))
+    {
+        t->capno = 0;
+        t->flags &= ~CAP;
+    }
+
+    /* Now recurse to children */
+    for (t2 = t->child; t2 != NULL; t2 = t2->sibling)
+    {
+        removecaptures(v, t2);
+        /* Propagate child CAP flag back up, if it's still set */
+        if (t2->flags & CAP)
+            t->flags |= CAP;
+    }
+
     /*
-     * DGP (2007-11-13): I assume it was the programmer's intent to eventually
-     * come back and add code to optimize subRE trees, but the routine coded
-     * just spends effort traversing the tree and doing nothing. We can do
-     * nothing with less effort.
+     * If t now contains neither captures nor backrefs, there's no longer any
+     * need to care where its sub-match boundaries are, so we can reduce it to
+     * a simple DFA node.  (Note in particular that MIXED child greediness is
+     * not a hindrance here, so we don't use the MESSY() macro.)
      */
-    return;
+    if ((t->flags & (CAP | BACKR)) == 0)
+    {
+        if (t->child)
+            freesubreandsiblings(v, t->child);
+        t->child = NULL;
+        t->op = '=';
+        t->flags &= ~MIXED;
+    }
 }

 /*
@@ -2482,6 +2520,8 @@ stdump(struct subre *t,
         fprintf(f, " hascapture");
     if (t->flags & BACKR)
         fprintf(f, " hasbackref");
+    if (t->flags & BRUSE)
+        fprintf(f, " isreferenced");
     if (!(t->flags & INUSE))
         fprintf(f, " UNUSED");
     if (t->latype != (char) -1)
diff --git a/src/backend/regex/regexec.c b/src/backend/regex/regexec.c
index 5b9a087820..e75b59e9f0 100644
--- a/src/backend/regex/regexec.c
+++ b/src/backend/regex/regexec.c
@@ -213,12 +213,10 @@ pg_regexec(regex_t *re,
         return REG_NOMATCH;
     backref = (v->g->info & REG_UBACKREF) ? 1 : 0;
     v->eflags = flags;
-    if (v->g->cflags & REG_NOSUB)
-        nmatch = 0;                /* override client */
     v->nmatch = nmatch;
-    if (backref)
+    if (backref && nmatch <= v->g->nsub)
     {
-        /* need work area */
+        /* need larger work area */
         if (v->g->nsub + 1 <= LOCALMAT)
             v->pmatch = mat;
         else
@@ -229,7 +227,13 @@ pg_regexec(regex_t *re,
         v->nmatch = v->g->nsub + 1;
     }
     else
+    {
+        /* we can store results directly in caller's array */
         v->pmatch = pmatch;
+        /* ensure any extra entries in caller's array are filled with -1 */
+        if (nmatch > 0)
+            zapallsubs(pmatch, nmatch);
+    }
     v->details = details;
     v->start = (chr *) string;
     v->search_start = (chr *) string + search_start;
@@ -290,12 +294,20 @@ pg_regexec(regex_t *re,
     else
         st = find(v, &v->g->tree->cnfa, &v->g->cmap);

-    /* copy (portion of) match vector over if necessary */
-    if (st == REG_OKAY && v->pmatch != pmatch && nmatch > 0)
+    /* on success, ensure caller's match vector is filled correctly */
+    if (st == REG_OKAY && nmatch > 0)
     {
-        zapallsubs(pmatch, nmatch);
-        n = (nmatch < v->nmatch) ? nmatch : v->nmatch;
-        memcpy(VS(pmatch), VS(v->pmatch), n * sizeof(regmatch_t));
+        if (v->g->cflags & REG_NOSUB)
+        {
+            /* ensure we don't return partial results */
+            zapallsubs(pmatch, nmatch);
+        }
+        else if (v->pmatch != pmatch)
+        {
+            /* copy portion of match vector over from (larger) work area */
+            assert(nmatch <= v->nmatch);
+            memcpy(VS(pmatch), VS(v->pmatch), nmatch * sizeof(regmatch_t));
+        }
     }

     /* clean up */
diff --git a/src/backend/utils/adt/jsonpath_gram.y b/src/backend/utils/adt/jsonpath_gram.y
index de3d97931e..bd5d4488a0 100644
--- a/src/backend/utils/adt/jsonpath_gram.y
+++ b/src/backend/utils/adt/jsonpath_gram.y
@@ -583,6 +583,14 @@ jspConvertRegexFlags(uint32 xflags)
                      errmsg("XQuery \"x\" flag (expanded regular expressions) is not implemented")));
     }

+    /*
+     * We'll never need sub-match details at execution.  While
+     * RE_compile_and_execute would set this flag anyway, force it on here to
+     * ensure that the regex cache entries created by makeItemLikeRegex are
+     * useful.
+     */
+    cflags |= REG_NOSUB;
+
     return cflags;
 }

diff --git a/src/backend/utils/adt/regexp.c b/src/backend/utils/adt/regexp.c
index 484d4265fd..268cee1cbe 100644
--- a/src/backend/utils/adt/regexp.c
+++ b/src/backend/utils/adt/regexp.c
@@ -347,6 +347,10 @@ RE_compile_and_execute(text *text_re, char *dat, int dat_len,
 {
     regex_t    *re;

+    /* Use REG_NOSUB if caller does not want sub-match details */
+    if (nmatch < 2)
+        cflags |= REG_NOSUB;
+
     /* Compile RE */
     re = RE_compile_and_cache(text_re, cflags, collation);

@@ -1412,6 +1416,7 @@ setup_regexp_matches(text *orig_str, text *pattern, pg_re_flags *re_flags,
     int            orig_len;
     pg_wchar   *wide_str;
     int            wide_len;
+    int            cflags;
     regex_t    *cpattern;
     regmatch_t *pmatch;
     int            pmatch_len;
@@ -1430,7 +1435,10 @@ setup_regexp_matches(text *orig_str, text *pattern, pg_re_flags *re_flags,
     wide_len = pg_mb2wchar_with_len(VARDATA_ANY(orig_str), wide_str, orig_len);

     /* set up the compiled pattern */
-    cpattern = RE_compile_and_cache(pattern, re_flags->cflags, collation);
+    cflags = re_flags->cflags;
+    if (!use_subpatterns)
+        cflags |= REG_NOSUB;
+    cpattern = RE_compile_and_cache(pattern, cflags, collation);

     /* do we want to remember subpatterns? */
     if (use_subpatterns && cpattern->re_nsub > 0)
@@ -1952,7 +1960,7 @@ regexp_fixed_prefix(text *text_re, bool case_insensitive, Oid collation,
     if (case_insensitive)
         cflags |= REG_ICASE;

-    re = RE_compile_and_cache(text_re, cflags, collation);
+    re = RE_compile_and_cache(text_re, cflags | REG_NOSUB, collation);

     /* Examine it to see if there's a fixed prefix */
     re_result = pg_regprefix(re, &str, &slen);
diff --git a/src/include/regex/regex.h b/src/include/regex/regex.h
index 2b48a19fb7..0455ae8069 100644
--- a/src/include/regex/regex.h
+++ b/src/include/regex/regex.h
@@ -106,7 +106,7 @@ typedef struct
 #define REG_QUOTE    000004        /* no special characters, none */
 #define REG_NOSPEC    REG_QUOTE    /* historical synonym */
 #define REG_ICASE    000010        /* ignore case */
-#define REG_NOSUB    000020        /* don't care about subexpressions */
+#define REG_NOSUB    000020        /* caller doesn't need subexpr match data */
 #define REG_EXPANDED    000040    /* expanded format, white space & comments */
 #define REG_NLSTOP    000100        /* \n doesn't match . or [^ ] */
 #define REG_NLANCH    000200        /* ^ matches after \n, $ before */
diff --git a/src/include/regex/regguts.h b/src/include/regex/regguts.h
index 8db631c83b..91a52840c4 100644
--- a/src/include/regex/regguts.h
+++ b/src/include/regex/regguts.h
@@ -477,13 +477,14 @@ struct subre
 #define  MIXED     04                /* mixed preference below */
 #define  CAP     010            /* capturing parens here or below */
 #define  BACKR     020            /* back reference here or below */
+#define  BRUSE     040            /* is referenced by a back reference */
 #define  INUSE     0100            /* in use in final tree */
-#define  NOPROP  03                /* bits which may not propagate up */
+#define  UPPROP  (MIXED|CAP|BACKR)    /* flags which should propagate up */
 #define  LMIX(f) ((f)<<2)        /* LONGER -> MIXED */
 #define  SMIX(f) ((f)<<1)        /* SHORTER -> MIXED */
-#define  UP(f)     (((f)&~NOPROP) | (LMIX(f) & SMIX(f) & MIXED))
+#define  UP(f)     (((f)&UPPROP) | (LMIX(f) & SMIX(f) & MIXED))
 #define  MESSY(f)     ((f)&(MIXED|CAP|BACKR))
-#define  PREF(f) ((f)&NOPROP)
+#define  PREF(f)     ((f)&(LONGER|SHORTER))
 #define  PREF2(f1, f2)     ((PREF(f1) != 0) ? PREF(f1) : PREF(f2))
 #define  COMBINE(f1, f2) (UP((f1)|(f2)) | PREF2(f1, f2))
     char        latype;            /* LATYPE code, if lookaround constraint */
diff --git a/src/test/modules/test_regex/expected/test_regex.out b/src/test/modules/test_regex/expected/test_regex.out
index 01d50ec1e3..dff5a16f57 100644
--- a/src/test/modules/test_regex/expected/test_regex.out
+++ b/src/test/modules/test_regex/expected/test_regex.out
@@ -146,14 +146,21 @@ select * from test_regex('(a)e', 'ae', '-');
  {ae,a}
 (2 rows)

--- expectMatch    4.2  o        (a)e        ae
-select * from test_regex('(a)e', 'ae', 'o');
- test_regex
-------------
- {0}
- {NULL}
+-- expectMatch    4.2  oPR    (.)\1e        aae
+select * from test_regex('(.)\1e', 'aae', 'oPR');
+           test_regex
+--------------------------------
+ {1,REG_UBACKREF,REG_UNONPOSIX}
+ {aae,NULL}
 (2 rows)

+-- expectNomatch    4.2b  oPR    (.)\1e        abe
+select * from test_regex('(.)\1e', 'abe', 'oPR');
+           test_regex
+--------------------------------
+ {1,REG_UBACKREF,REG_UNONPOSIX}
+(1 row)
+
 -- expectMatch    4.3  b        {\(a\)b}    ab    ab    a
 select * from test_regex('\(a\)b', 'ab', 'b');
  test_regex
@@ -2658,6 +2665,14 @@ select * from test_regex('(\w+(?= )).*\1', 'abc abcd abd', 'HLRP');
  {"abc abc",abc}
 (2 rows)

+-- exercise oversize-regmatch_t-array paths in regexec()
+-- (that case is not reachable via test_regex, sadly)
+select substring('fffoooooooooooooooooooooooooooooooo',
'^(.)\1(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)');
+ substring
+-----------
+ f
+(1 row)
+
 -- doing 15 "octal escapes vs back references"
 -- # initial zero is always octal
 -- expectMatch    15.1  MP    "a\\010b"    "a\bb"    "a\bb"
diff --git a/src/test/modules/test_regex/sql/test_regex.sql b/src/test/modules/test_regex/sql/test_regex.sql
index 7f5bc6e418..cfffd13686 100644
--- a/src/test/modules/test_regex/sql/test_regex.sql
+++ b/src/test/modules/test_regex/sql/test_regex.sql
@@ -63,8 +63,10 @@ select * from test_regex('ab', 'ab', 'b');

 -- expectMatch    4.1  -        (a)e        ae    ae    a
 select * from test_regex('(a)e', 'ae', '-');
--- expectMatch    4.2  o        (a)e        ae
-select * from test_regex('(a)e', 'ae', 'o');
+-- expectMatch    4.2  oPR    (.)\1e        aae
+select * from test_regex('(.)\1e', 'aae', 'oPR');
+-- expectNomatch    4.2b  oPR    (.)\1e        abe
+select * from test_regex('(.)\1e', 'abe', 'oPR');
 -- expectMatch    4.3  b        {\(a\)b}    ab    ab    a
 select * from test_regex('\(a\)b', 'ab', 'b');
 -- expectMatch    4.4  -        a((b)c)        abc    abc    bc    b
@@ -775,6 +777,10 @@ select * from test_regex('(^\w+).*\1', 'abc abc abc', 'LRP');
 select * from test_regex('(^\w+\M).*\1', 'abc abcd abd', 'LRP');
 select * from test_regex('(\w+(?= )).*\1', 'abc abcd abd', 'HLRP');

+-- exercise oversize-regmatch_t-array paths in regexec()
+-- (that case is not reachable via test_regex, sadly)
+select substring('fffoooooooooooooooooooooooooooooooo',
'^(.)\1(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)');
+
 -- doing 15 "octal escapes vs back references"

 -- # initial zero is always octal

Re: Another regexp performance improvement: skip useless paren-captures

From
Andrew Dunstan
Date:
On 8/4/21 6:15 PM, Tom Lane wrote:
> Here's a little finger exercise that improves a case that's bothered me
> for awhile.  In a POSIX regexp, parentheses cause capturing by default;
> you have to write the very non-obvious "(?:...)" if you don't want the
> matching substring to be reported by the regexp engine.  


It's not obscure to perl programmers :-)


> That'd be fine
> if capturing were cheap, but with our engine it is not particularly
> cheap.  In many situations, the initial DFA check is sufficient to
> tell whether there is an overall match, but it does not tell where any
> subexpression match boundaries are.  To identify exactly which substring
> is deemed to match a parenthesized subexpression, we have to recursively
> break down the match, which takes at the very least a few more DFA
> invocations; and with an uncooperative regex, it can easily result in
> O(N^2) behavior where there was none at the DFA stage.
>
> Therefore, we really ought to expend some effort to not capture
> subexpressions if the sub-match data is not actually needed, which in
> many invocations we know that it isn't.  Spencer's original code has
> a REG_NOSUB option that looks like it ought to be good for this ... but
> on closer inspection it's basically useless, because it turns *all*
> parens into non-capturing ones.  That breaks back-references, so unless
> you know that the regexp contains no back-refs, you can't use it.


In perl you can use the 'n' modifier for this effect (since 5.22)


I would expect to know if a back-ref were present.


I'm a bit worried about how you'll keep track of back-ref numbering
since back-refs only count capturing groups, and you're silently turning
a capturing group into a non-capturing group.


cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com




Re: Another regexp performance improvement: skip useless paren-captures

From
Tom Lane
Date:
Andrew Dunstan <andrew@dunslane.net> writes:
> I'm a bit worried about how you'll keep track of back-ref numbering
> since back-refs only count capturing groups, and you're silently turning
> a capturing group into a non-capturing group.

They're already numbered at this point, and we aren't changing the numbers
of the capturing groups that remain live.  There will be unused entries in
the regmatch_t array at runtime (corresponding to the zapped groups), but
that doesn't cost anything worth mentioning.

Now that you mention it, I am not sure whether there are any regression
test cases that specifically cover still being able to match \2 when
the first capture group went away.  Probably should add more cases...

            regards, tom lane



Re: Another regexp performance improvement: skip useless paren-captures

From
Robert Haas
Date:
On Thu, Aug 5, 2021 at 9:43 AM Andrew Dunstan <andrew@dunslane.net> wrote:
> On 8/4/21 6:15 PM, Tom Lane wrote:
> > Here's a little finger exercise that improves a case that's bothered me
> > for awhile.  In a POSIX regexp, parentheses cause capturing by default;
> > you have to write the very non-obvious "(?:...)" if you don't want the
> > matching substring to be reported by the regexp engine.
> It's not obscure to perl programmers :-)

Well, I consider myself a pretty fair perl programmer, and I know
there's a way to do that, but I never do it, and I would have had to
look up the exact syntax. So +1 from me for anything automatic that
avoids paying the overhead in some cases.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: Another regexp performance improvement: skip useless paren-captures

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> Well, I consider myself a pretty fair perl programmer, and I know
> there's a way to do that, but I never do it, and I would have had to
> look up the exact syntax. So +1 from me for anything automatic that
> avoids paying the overhead in some cases.

That's my feeling about it too --- I never really think of this
point when writing a regexp.  It seems like something the engine
ought to handle gracefully, so this patch is an attempt to do so.

            regards, tom lane



Re: Another regexp performance improvement: skip useless paren-captures

From
Andrew Dunstan
Date:
On 8/5/21 10:39 AM, Robert Haas wrote:
> On Thu, Aug 5, 2021 at 9:43 AM Andrew Dunstan <andrew@dunslane.net> wrote:
>> On 8/4/21 6:15 PM, Tom Lane wrote:
>>> Here's a little finger exercise that improves a case that's bothered me
>>> for awhile.  In a POSIX regexp, parentheses cause capturing by default;
>>> you have to write the very non-obvious "(?:...)" if you don't want the
>>> matching substring to be reported by the regexp engine.
>> It's not obscure to perl programmers :-)
> Well, I consider myself a pretty fair perl programmer, 


I also consider you one :-)


Perhaps I should have said "many perl programmers".


> and I know
> there's a way to do that, but I never do it, and I would have had to
> look up the exact syntax. So +1 from me for anything automatic that
> avoids paying the overhead in some cases.


Yeah, I'm not arguing against the idea. I also have to look it up,
mainly because there is such a huge amount of stuff that can follow
"(?", do "perldoc perlre" happens a lot when I'm doing that sort of work.


cheers


andrew


--
Andrew Dunstan
EDB: https://www.enterprisedb.com




Re: Another regexp performance improvement: skip useless paren-captures

From
Mark Dilger
Date:

> On Aug 5, 2021, at 7:36 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> Probably should add more cases...

The patch triggers an assertion that master does not:

+select 'azrlfkjbjgidgryryiglcabkgqluflu' !~ '(.(.)((.)))((?:(\3)))';
+server closed the connection unexpectedly
+   This probably means the server terminated abnormally
+   before or while processing the request.
+connection to server was lost

The relevant portion of the stack trace:

    frame #3: 0x00000001043bcf6d postgres`ExceptionalCondition(conditionName=<unavailable>, errorType=<unavailable>,
fileName=<unavailable>,lineNumber=<unavailable>) at assert.c:69:2 [opt] 
    frame #4: 0x000000010410168b postgres`cdissect(v=0x00007ffeebdd2ad8, t=0x00007f863cd055b0,
begin=0x00007f863d821528,end=0x00007f863d82152c) at regexec.c:767:4 [opt] 
    frame #5: 0x000000010410129b postgres`cdissect [inlined] ccondissect(v=<unavailable>, t=<unavailable>,
begin=0x00007f863d821524,end=<unavailable>) at regexec.c:835:10 [opt] 
    frame #6: 0x000000010410123d postgres`cdissect(v=0x00007ffeebdd2ad8, t=0x00007f863cd05430,
begin=0x00007f863d821524,end=0x00007f863d82152c) at regexec.c:752 [opt] 
    frame #7: 0x000000010410129b postgres`cdissect [inlined] ccondissect(v=<unavailable>, t=<unavailable>,
begin=0x00007f863d821520,end=<unavailable>) at regexec.c:835:10 [opt] 
    frame #8: 0x000000010410123d postgres`cdissect(v=0x00007ffeebdd2ad8, t=0x00007f863cd050f0,
begin=0x00007f863d821520,end=0x00007f863d82152c) at regexec.c:752 [opt] 
    frame #9: 0x0000000104101282 postgres`cdissect [inlined] ccondissect(v=<unavailable>, t=<unavailable>,
begin=0x00007f863d821520,end=<unavailable>) at regexec.c:832:9 [opt] 
    frame #10: 0x000000010410123d postgres`cdissect(v=0x00007ffeebdd2ad8, t=0x00007f863cd04ff0,
begin=0x00007f863d821520,end=0x00007f863d821530) at regexec.c:752 [opt] 
    frame #11: 0x00000001040ff508 postgres`pg_regexec [inlined] cfindloop(v=<unavailable>, cnfa=<unavailable>,
cm=<unavailable>,d=0x00007ffeebdd6d68, s=0x00007ffeebdd2b48, coldp=<unavailable>) at regexec.c:600:10 [opt] 
    frame #12: 0x00000001040ff36b postgres`pg_regexec [inlined] cfind(v=0x000000010459c5f8, cnfa=<unavailable>,
cm=<unavailable>)at regexec.c:515 [opt] 
    frame #13: 0x00000001040ff315 postgres`pg_regexec(re=<unavailable>, string=<unavailable>, len=140732855577960,
search_start=<unavailable>,details=<unavailable>, nmatch=0, pmatch=0x0000000000000000, flags=0) at regexec.c:293 [opt] 
    frame #14: 0x0000000104244d61 postgres`RE_wchar_execute(re=<unavailable>, data=<unavailable>,
data_len=<unavailable>,start_search=<unavailable>, nmatch=<unavailable>, pmatch=<unavailable>) at regexp.c:274:19 [opt] 
    frame #15: 0x0000000104242c80 postgres`textregexne [inlined] RE_execute(dat=<unavailable>, dat_len=31, nmatch=0,
pmatch=0x0000000000000000)at regexp.c:322:10 [opt] 
    frame #16: 0x0000000104242c50 postgres`textregexne [inlined] RE_compile_and_execute(text_re=<unavailable>,
dat=<unavailable>,dat_len=31, cflags=19, collation=<unavailable>, nmatch=0, pmatch=<unavailable>) at regexp.c:357 [opt] 

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company






Re: Another regexp performance improvement: skip useless paren-captures

From
Tom Lane
Date:
Mark Dilger <mark.dilger@enterprisedb.com> writes:
> The patch triggers an assertion that master does not:

> +select 'azrlfkjbjgidgryryiglcabkgqluflu' !~ '(.(.)((.)))((?:(\3)))';

On looking into this, it's pretty simple: regexec.c has an assertion
that a pure-capture subre node ought to be doing some capturing.

        case '(':                /* no-op capture node */
            assert(t->child != NULL);
            assert(t->capno > 0);

That's fine as of HEAD, but with the proposed patch, we may notice
that the node isn't actually referenced by any backref, and remove
its capture marker, allowing this assertion to fire.  Nothing's
really wrong though.

There seem to be three things we could do about that:

1. Extend removecaptures() so that it can actually remove no-op
capture nodes if it's removed their capture markings.  This would
substantially complicate that function, and I judge that it's not
worth the trouble.  We'll only have such nodes in cases of
capturing parentheses immediately surrounding capturing parentheses,
which doesn't seem like a case worth expending sweat for.

2. Just drop the "t->capno > 0" assertion in regexec.c.

3. Weaken said assertion, perhaps by also checking the BRUSE flag bit.

Not sure that I see any point to #3, so I just dropped the
assertion in the attached.

I've also rebased over the bug fixes from the other thread,
and added a couple more test cases.

            regards, tom lane

diff --git a/contrib/pg_trgm/trgm_regexp.c b/contrib/pg_trgm/trgm_regexp.c
index bf1dea6352..64816dd370 100644
--- a/contrib/pg_trgm/trgm_regexp.c
+++ b/contrib/pg_trgm/trgm_regexp.c
@@ -541,9 +541,11 @@ createTrgmNFA(text *text_re, Oid collation,
      * Stage 1: Compile the regexp into a NFA, using the regexp library.
      */
 #ifdef IGNORECASE
-    RE_compile(®ex, text_re, REG_ADVANCED | REG_ICASE, collation);
+    RE_compile(®ex, text_re,
+               REG_ADVANCED | REG_NOSUB | REG_ICASE, collation);
 #else
-    RE_compile(®ex, text_re, REG_ADVANCED, collation);
+    RE_compile(®ex, text_re,
+               REG_ADVANCED | REG_NOSUB, collation);
 #endif

     /*
diff --git a/src/backend/regex/regc_lex.c b/src/backend/regex/regc_lex.c
index 7673dab76f..45727ffa01 100644
--- a/src/backend/regex/regc_lex.c
+++ b/src/backend/regex/regc_lex.c
@@ -528,10 +528,7 @@ next(struct vars *v)
                 }
                 assert(NOTREACHED);
             }
-            if (v->cflags & REG_NOSUB)
-                RETV('(', 0);    /* all parens non-capturing */
-            else
-                RETV('(', 1);
+            RETV('(', 1);
             break;
         case CHR(')'):
             if (LASTTYPE('('))
diff --git a/src/backend/regex/regcomp.c b/src/backend/regex/regcomp.c
index 60a220c57a..ae3a7b6a38 100644
--- a/src/backend/regex/regcomp.c
+++ b/src/backend/regex/regcomp.c
@@ -65,7 +65,7 @@ static struct subre *subre(struct vars *, int, int, struct state *, struct state
 static void freesubre(struct vars *, struct subre *);
 static void freesubreandsiblings(struct vars *, struct subre *);
 static void freesrnode(struct vars *, struct subre *);
-static void optst(struct vars *, struct subre *);
+static void removecaptures(struct vars *, struct subre *);
 static int    numst(struct subre *, int);
 static void markst(struct subre *);
 static void cleanst(struct vars *);
@@ -431,7 +431,8 @@ pg_regcomp(regex_t *re,
         dumpst(v->tree, debug, 1);
     }
 #endif
-    optst(v, v->tree);
+    if (v->cflags & REG_NOSUB)
+        removecaptures(v, v->tree);
     v->ntree = numst(v->tree, 1);
     markst(v->tree);
     cleanst(v);
@@ -1013,14 +1014,16 @@ parseqatom(struct vars *v,
             break;
         case BACKREF:            /* the Feature From The Black Lagoon */
             INSIST(type != LACON, REG_ESUBREG);
-            INSIST(v->nextvalue < v->nsubs, REG_ESUBREG);
-            INSIST(v->subs[v->nextvalue] != NULL, REG_ESUBREG);
+            subno = v->nextvalue;
+            assert(subno > 0);
+            INSIST(subno < v->nsubs, REG_ESUBREG);
+            NOERRN();
+            INSIST(v->subs[subno] != NULL, REG_ESUBREG);
             NOERRN();
-            assert(v->nextvalue > 0);
             atom = subre(v, 'b', BACKR, lp, rp);
             NOERRN();
-            subno = v->nextvalue;
             atom->backno = subno;
+            v->subs[subno]->flags |= BRUSE;
             EMPTYARC(lp, rp);    /* temporarily, so there's something */
             NEXT();
             break;
@@ -2141,19 +2144,54 @@ freesrnode(struct vars *v,        /* might be NULL */
 }

 /*
- * optst - optimize a subRE subtree
+ * removecaptures - remove unnecessary capture subREs
+ *
+ * If the caller said that it doesn't care about subexpression match data,
+ * we may delete the "capture" markers on subREs that are not referenced
+ * by any backrefs, and then simplify anything that's become non-messy.
+ * Call this only if REG_NOSUB flag is set.
  */
 static void
-optst(struct vars *v,
-      struct subre *t)
+removecaptures(struct vars *v,
+               struct subre *t)
 {
+    struct subre *t2;
+
+    assert(t != NULL);
+
+    /*
+     * If this isn't itself a backref target, clear capno and tentatively
+     * clear CAP flag.
+     */
+    if (!(t->flags & BRUSE))
+    {
+        t->capno = 0;
+        t->flags &= ~CAP;
+    }
+
+    /* Now recurse to children */
+    for (t2 = t->child; t2 != NULL; t2 = t2->sibling)
+    {
+        removecaptures(v, t2);
+        /* Propagate child CAP flag back up, if it's still set */
+        if (t2->flags & CAP)
+            t->flags |= CAP;
+    }
+
     /*
-     * DGP (2007-11-13): I assume it was the programmer's intent to eventually
-     * come back and add code to optimize subRE trees, but the routine coded
-     * just spends effort traversing the tree and doing nothing. We can do
-     * nothing with less effort.
+     * If t now contains neither captures nor backrefs, there's no longer any
+     * need to care where its sub-match boundaries are, so we can reduce it to
+     * a simple DFA node.  (Note in particular that MIXED child greediness is
+     * not a hindrance here, so we don't use the MESSY() macro.)
      */
-    return;
+    if ((t->flags & (CAP | BACKR)) == 0)
+    {
+        if (t->child)
+            freesubreandsiblings(v, t->child);
+        t->child = NULL;
+        t->op = '=';
+        t->flags &= ~MIXED;
+    }
 }

 /*
@@ -2501,6 +2539,8 @@ stdump(struct subre *t,
         fprintf(f, " hascapture");
     if (t->flags & BACKR)
         fprintf(f, " hasbackref");
+    if (t->flags & BRUSE)
+        fprintf(f, " isreferenced");
     if (!(t->flags & INUSE))
         fprintf(f, " UNUSED");
     if (t->latype != (char) -1)
diff --git a/src/backend/regex/regexec.c b/src/backend/regex/regexec.c
index 5b9a087820..3800e74ef7 100644
--- a/src/backend/regex/regexec.c
+++ b/src/backend/regex/regexec.c
@@ -213,12 +213,10 @@ pg_regexec(regex_t *re,
         return REG_NOMATCH;
     backref = (v->g->info & REG_UBACKREF) ? 1 : 0;
     v->eflags = flags;
-    if (v->g->cflags & REG_NOSUB)
-        nmatch = 0;                /* override client */
     v->nmatch = nmatch;
-    if (backref)
+    if (backref && nmatch <= v->g->nsub)
     {
-        /* need work area */
+        /* need larger work area */
         if (v->g->nsub + 1 <= LOCALMAT)
             v->pmatch = mat;
         else
@@ -229,7 +227,13 @@ pg_regexec(regex_t *re,
         v->nmatch = v->g->nsub + 1;
     }
     else
+    {
+        /* we can store results directly in caller's array */
         v->pmatch = pmatch;
+        /* ensure any extra entries in caller's array are filled with -1 */
+        if (nmatch > 0)
+            zapallsubs(pmatch, nmatch);
+    }
     v->details = details;
     v->start = (chr *) string;
     v->search_start = (chr *) string + search_start;
@@ -290,12 +294,20 @@ pg_regexec(regex_t *re,
     else
         st = find(v, &v->g->tree->cnfa, &v->g->cmap);

-    /* copy (portion of) match vector over if necessary */
-    if (st == REG_OKAY && v->pmatch != pmatch && nmatch > 0)
+    /* on success, ensure caller's match vector is filled correctly */
+    if (st == REG_OKAY && nmatch > 0)
     {
-        zapallsubs(pmatch, nmatch);
-        n = (nmatch < v->nmatch) ? nmatch : v->nmatch;
-        memcpy(VS(pmatch), VS(v->pmatch), n * sizeof(regmatch_t));
+        if (v->g->cflags & REG_NOSUB)
+        {
+            /* ensure we don't return partial results */
+            zapallsubs(pmatch, nmatch);
+        }
+        else if (v->pmatch != pmatch)
+        {
+            /* copy portion of match vector over from (larger) work area */
+            assert(nmatch <= v->nmatch);
+            memcpy(VS(pmatch), VS(v->pmatch), nmatch * sizeof(regmatch_t));
+        }
     }

     /* clean up */
@@ -752,7 +764,6 @@ cdissect(struct vars *v,
             break;
         case '(':                /* no-op capture node */
             assert(t->child != NULL);
-            assert(t->capno > 0);
             er = cdissect(v, t->child, begin, end);
             break;
         default:
diff --git a/src/backend/utils/adt/jsonpath_gram.y b/src/backend/utils/adt/jsonpath_gram.y
index de3d97931e..bd5d4488a0 100644
--- a/src/backend/utils/adt/jsonpath_gram.y
+++ b/src/backend/utils/adt/jsonpath_gram.y
@@ -583,6 +583,14 @@ jspConvertRegexFlags(uint32 xflags)
                      errmsg("XQuery \"x\" flag (expanded regular expressions) is not implemented")));
     }

+    /*
+     * We'll never need sub-match details at execution.  While
+     * RE_compile_and_execute would set this flag anyway, force it on here to
+     * ensure that the regex cache entries created by makeItemLikeRegex are
+     * useful.
+     */
+    cflags |= REG_NOSUB;
+
     return cflags;
 }

diff --git a/src/backend/utils/adt/regexp.c b/src/backend/utils/adt/regexp.c
index 484d4265fd..268cee1cbe 100644
--- a/src/backend/utils/adt/regexp.c
+++ b/src/backend/utils/adt/regexp.c
@@ -347,6 +347,10 @@ RE_compile_and_execute(text *text_re, char *dat, int dat_len,
 {
     regex_t    *re;

+    /* Use REG_NOSUB if caller does not want sub-match details */
+    if (nmatch < 2)
+        cflags |= REG_NOSUB;
+
     /* Compile RE */
     re = RE_compile_and_cache(text_re, cflags, collation);

@@ -1412,6 +1416,7 @@ setup_regexp_matches(text *orig_str, text *pattern, pg_re_flags *re_flags,
     int            orig_len;
     pg_wchar   *wide_str;
     int            wide_len;
+    int            cflags;
     regex_t    *cpattern;
     regmatch_t *pmatch;
     int            pmatch_len;
@@ -1430,7 +1435,10 @@ setup_regexp_matches(text *orig_str, text *pattern, pg_re_flags *re_flags,
     wide_len = pg_mb2wchar_with_len(VARDATA_ANY(orig_str), wide_str, orig_len);

     /* set up the compiled pattern */
-    cpattern = RE_compile_and_cache(pattern, re_flags->cflags, collation);
+    cflags = re_flags->cflags;
+    if (!use_subpatterns)
+        cflags |= REG_NOSUB;
+    cpattern = RE_compile_and_cache(pattern, cflags, collation);

     /* do we want to remember subpatterns? */
     if (use_subpatterns && cpattern->re_nsub > 0)
@@ -1952,7 +1960,7 @@ regexp_fixed_prefix(text *text_re, bool case_insensitive, Oid collation,
     if (case_insensitive)
         cflags |= REG_ICASE;

-    re = RE_compile_and_cache(text_re, cflags, collation);
+    re = RE_compile_and_cache(text_re, cflags | REG_NOSUB, collation);

     /* Examine it to see if there's a fixed prefix */
     re_result = pg_regprefix(re, &str, &slen);
diff --git a/src/include/regex/regex.h b/src/include/regex/regex.h
index 2b48a19fb7..0455ae8069 100644
--- a/src/include/regex/regex.h
+++ b/src/include/regex/regex.h
@@ -106,7 +106,7 @@ typedef struct
 #define REG_QUOTE    000004        /* no special characters, none */
 #define REG_NOSPEC    REG_QUOTE    /* historical synonym */
 #define REG_ICASE    000010        /* ignore case */
-#define REG_NOSUB    000020        /* don't care about subexpressions */
+#define REG_NOSUB    000020        /* caller doesn't need subexpr match data */
 #define REG_EXPANDED    000040    /* expanded format, white space & comments */
 #define REG_NLSTOP    000100        /* \n doesn't match . or [^ ] */
 #define REG_NLANCH    000200        /* ^ matches after \n, $ before */
diff --git a/src/include/regex/regguts.h b/src/include/regex/regguts.h
index 8db631c83b..91a52840c4 100644
--- a/src/include/regex/regguts.h
+++ b/src/include/regex/regguts.h
@@ -477,13 +477,14 @@ struct subre
 #define  MIXED     04                /* mixed preference below */
 #define  CAP     010            /* capturing parens here or below */
 #define  BACKR     020            /* back reference here or below */
+#define  BRUSE     040            /* is referenced by a back reference */
 #define  INUSE     0100            /* in use in final tree */
-#define  NOPROP  03                /* bits which may not propagate up */
+#define  UPPROP  (MIXED|CAP|BACKR)    /* flags which should propagate up */
 #define  LMIX(f) ((f)<<2)        /* LONGER -> MIXED */
 #define  SMIX(f) ((f)<<1)        /* SHORTER -> MIXED */
-#define  UP(f)     (((f)&~NOPROP) | (LMIX(f) & SMIX(f) & MIXED))
+#define  UP(f)     (((f)&UPPROP) | (LMIX(f) & SMIX(f) & MIXED))
 #define  MESSY(f)     ((f)&(MIXED|CAP|BACKR))
-#define  PREF(f) ((f)&NOPROP)
+#define  PREF(f)     ((f)&(LONGER|SHORTER))
 #define  PREF2(f1, f2)     ((PREF(f1) != 0) ? PREF(f1) : PREF(f2))
 #define  COMBINE(f1, f2) (UP((f1)|(f2)) | PREF2(f1, f2))
     char        latype;            /* LATYPE code, if lookaround constraint */
diff --git a/src/test/modules/test_regex/expected/test_regex.out b/src/test/modules/test_regex/expected/test_regex.out
index 44da7d2019..6f47ddde40 100644
--- a/src/test/modules/test_regex/expected/test_regex.out
+++ b/src/test/modules/test_regex/expected/test_regex.out
@@ -146,12 +146,12 @@ select * from test_regex('(a)e', 'ae', '-');
  {ae,a}
 (2 rows)

--- expectMatch    4.2  o        (a)e        ae
-select * from test_regex('(a)e', 'ae', 'o');
- test_regex
-------------
- {0}
- {NULL}
+-- expectMatch    4.2  oPR    (.)\1e        abeaae    aae    {}
+select * from test_regex('(.)\1e', 'abeaae', 'oPR');
+           test_regex
+--------------------------------
+ {1,REG_UBACKREF,REG_UNONPOSIX}
+ {aae,NULL}
 (2 rows)

 -- expectMatch    4.3  b        {\(a\)b}    ab    ab    a
@@ -2658,6 +2658,14 @@ select * from test_regex('(\w+(?= )).*\1', 'abc abcd abd', 'HLRP');
  {"abc abc",abc}
 (2 rows)

+-- exercise oversize-regmatch_t-array paths in regexec()
+-- (that case is not reachable via test_regex, sadly)
+select substring('fffoooooooooooooooooooooooooooooooo',
'^(.)\1(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)');
+ substring
+-----------
+ f
+(1 row)
+
 -- doing 15 "octal escapes vs back references"
 -- # initial zero is always octal
 -- expectMatch    15.1  MP    "a\\010b"    "a\bb"    "a\bb"
@@ -3476,6 +3484,22 @@ select * from test_regex('((.))(\2){0}', 'xy', 'RPQ');
  {x,x,x,NULL}
 (2 rows)

+-- expectMatch    21.37 RP    ((.))(\2)    xyy    yy    y    y    y
+select * from test_regex('((.))(\2)', 'xyy', 'RP');
+           test_regex
+--------------------------------
+ {3,REG_UBACKREF,REG_UNONPOSIX}
+ {yy,y,y,y}
+(2 rows)
+
+-- expectMatch    21.38 oRP    ((.))(\2)    xyy    yy    {}    {}    {}
+select * from test_regex('((.))(\2)', 'xyy', 'oRP');
+           test_regex
+--------------------------------
+ {3,REG_UBACKREF,REG_UNONPOSIX}
+ {yy,NULL,NULL,NULL}
+(2 rows)
+
 -- doing 22 "multicharacter collating elements"
 -- # again ugh
 -- MCCEs are not implemented in Postgres, so we skip all these tests
diff --git a/src/test/modules/test_regex/sql/test_regex.sql b/src/test/modules/test_regex/sql/test_regex.sql
index 9224fdfdd3..2861708932 100644
--- a/src/test/modules/test_regex/sql/test_regex.sql
+++ b/src/test/modules/test_regex/sql/test_regex.sql
@@ -63,8 +63,8 @@ select * from test_regex('ab', 'ab', 'b');

 -- expectMatch    4.1  -        (a)e        ae    ae    a
 select * from test_regex('(a)e', 'ae', '-');
--- expectMatch    4.2  o        (a)e        ae
-select * from test_regex('(a)e', 'ae', 'o');
+-- expectMatch    4.2  oPR    (.)\1e        abeaae    aae    {}
+select * from test_regex('(.)\1e', 'abeaae', 'oPR');
 -- expectMatch    4.3  b        {\(a\)b}    ab    ab    a
 select * from test_regex('\(a\)b', 'ab', 'b');
 -- expectMatch    4.4  -        a((b)c)        abc    abc    bc    b
@@ -775,6 +775,10 @@ select * from test_regex('(^\w+).*\1', 'abc abc abc', 'LRP');
 select * from test_regex('(^\w+\M).*\1', 'abc abcd abd', 'LRP');
 select * from test_regex('(\w+(?= )).*\1', 'abc abcd abd', 'HLRP');

+-- exercise oversize-regmatch_t-array paths in regexec()
+-- (that case is not reachable via test_regex, sadly)
+select substring('fffoooooooooooooooooooooooooooooooo',
'^(.)\1(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)');
+
 -- doing 15 "octal escapes vs back references"

 -- # initial zero is always octal
@@ -1011,6 +1015,10 @@ select * from test_regex('(a*)*', 'bc', 'N');
 select * from test_regex(' TO (([a-z0-9._]+|"([^"]+|"")+")+)', 'asd TO foo', 'M');
 -- expectMatch    21.36 RPQ    ((.))(\2){0}    xy    x    x    x    {}
 select * from test_regex('((.))(\2){0}', 'xy', 'RPQ');
+-- expectMatch    21.37 RP    ((.))(\2)    xyy    yy    y    y    y
+select * from test_regex('((.))(\2)', 'xyy', 'RP');
+-- expectMatch    21.38 oRP    ((.))(\2)    xyy    yy    {}    {}    {}
+select * from test_regex('((.))(\2)', 'xyy', 'oRP');

 -- doing 22 "multicharacter collating elements"
 -- # again ugh

Re: Another regexp performance improvement: skip useless paren-captures

From
Mark Dilger
Date:

> On Aug 8, 2021, at 10:04 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> I've also rebased over the bug fixes from the other thread,
> and added a couple more test cases.
>
>             regards, tom lane

Hmm.  This changes the behavior when applied against master (c1132aae336c41cf9d316222e525d8d593c2b5d2):

 select regexp_split_to_array('uuuzkodphfbfbfb', '((.))(\1\2)', 'ntw');
  regexp_split_to_array
 -----------------------
- {"",zkodphfbfbfb}
+ {uuuzkodphfbfbfb}
 (1 row)

The string starts with three "u" characters.  The first of them is doubly-matched, meaning \1 and \2 refer to the first
"u"character.  The (\1\2) that follows matches the next two "u" characters.  When the extra "useless" capture group is
skipped,apparently this doesn't work anymore. I haven't looked at your patch, so I'm not sure why, but I'm guessing
that\2 doesn't refer to anything. 

That analysis is consistent with the next change:

 select regexp_split_to_array('snfwbvxeesnzqabixqbixqiumpgxdemmxvnsemjxgqoqknrqessmcqmfslfspskqpqxe', '((((?:.))))\3');
-                        regexp_split_to_array
----------------------------------------------------------------------
- {snfwbvx,snzqabixqbixqiumpgxde,xvnsemjxgqoqknrqe,mcqmfslfspskqpqxe}
+                         regexp_split_to_array
+------------------------------------------------------------------------
+ {snfwbvxeesnzqabixqbixqiumpgxdemmxvnsemjxgqoqknrqessmcqmfslfspskqpqxe}
 (1 row)

The pattern matches any double character.  I would expect it to match the "ee", the "mm" and the "ss" in the text.
Withthe patched code, it matches nothing. 


—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company






Re: Another regexp performance improvement: skip useless paren-captures

From
Tom Lane
Date:
Mark Dilger <mark.dilger@enterprisedb.com> writes:
> Hmm.  This changes the behavior when applied against master (c1132aae336c41cf9d316222e525d8d593c2b5d2):

>  select regexp_split_to_array('uuuzkodphfbfbfb', '((.))(\1\2)', 'ntw');
>   regexp_split_to_array
>  -----------------------
> - {"",zkodphfbfbfb}
> + {uuuzkodphfbfbfb}
>  (1 row)

Ugh.  The regex engine is finding the match correctly, but it's failing to
tell the caller where it is :-(.  I was a little too cute in optimizing
the regmatch_t result-vector copying in pg_regexec, and forgot to ensure
that the overall match position would be reported.

Thanks for the testing!

            regards, tom lane

diff --git a/contrib/pg_trgm/trgm_regexp.c b/contrib/pg_trgm/trgm_regexp.c
index bf1dea6352..64816dd370 100644
--- a/contrib/pg_trgm/trgm_regexp.c
+++ b/contrib/pg_trgm/trgm_regexp.c
@@ -541,9 +541,11 @@ createTrgmNFA(text *text_re, Oid collation,
      * Stage 1: Compile the regexp into a NFA, using the regexp library.
      */
 #ifdef IGNORECASE
-    RE_compile(®ex, text_re, REG_ADVANCED | REG_ICASE, collation);
+    RE_compile(®ex, text_re,
+               REG_ADVANCED | REG_NOSUB | REG_ICASE, collation);
 #else
-    RE_compile(®ex, text_re, REG_ADVANCED, collation);
+    RE_compile(®ex, text_re,
+               REG_ADVANCED | REG_NOSUB, collation);
 #endif

     /*
diff --git a/src/backend/regex/regc_lex.c b/src/backend/regex/regc_lex.c
index 7673dab76f..45727ffa01 100644
--- a/src/backend/regex/regc_lex.c
+++ b/src/backend/regex/regc_lex.c
@@ -528,10 +528,7 @@ next(struct vars *v)
                 }
                 assert(NOTREACHED);
             }
-            if (v->cflags & REG_NOSUB)
-                RETV('(', 0);    /* all parens non-capturing */
-            else
-                RETV('(', 1);
+            RETV('(', 1);
             break;
         case CHR(')'):
             if (LASTTYPE('('))
diff --git a/src/backend/regex/regcomp.c b/src/backend/regex/regcomp.c
index 60a220c57a..ae3a7b6a38 100644
--- a/src/backend/regex/regcomp.c
+++ b/src/backend/regex/regcomp.c
@@ -65,7 +65,7 @@ static struct subre *subre(struct vars *, int, int, struct state *, struct state
 static void freesubre(struct vars *, struct subre *);
 static void freesubreandsiblings(struct vars *, struct subre *);
 static void freesrnode(struct vars *, struct subre *);
-static void optst(struct vars *, struct subre *);
+static void removecaptures(struct vars *, struct subre *);
 static int    numst(struct subre *, int);
 static void markst(struct subre *);
 static void cleanst(struct vars *);
@@ -431,7 +431,8 @@ pg_regcomp(regex_t *re,
         dumpst(v->tree, debug, 1);
     }
 #endif
-    optst(v, v->tree);
+    if (v->cflags & REG_NOSUB)
+        removecaptures(v, v->tree);
     v->ntree = numst(v->tree, 1);
     markst(v->tree);
     cleanst(v);
@@ -1013,14 +1014,16 @@ parseqatom(struct vars *v,
             break;
         case BACKREF:            /* the Feature From The Black Lagoon */
             INSIST(type != LACON, REG_ESUBREG);
-            INSIST(v->nextvalue < v->nsubs, REG_ESUBREG);
-            INSIST(v->subs[v->nextvalue] != NULL, REG_ESUBREG);
+            subno = v->nextvalue;
+            assert(subno > 0);
+            INSIST(subno < v->nsubs, REG_ESUBREG);
+            NOERRN();
+            INSIST(v->subs[subno] != NULL, REG_ESUBREG);
             NOERRN();
-            assert(v->nextvalue > 0);
             atom = subre(v, 'b', BACKR, lp, rp);
             NOERRN();
-            subno = v->nextvalue;
             atom->backno = subno;
+            v->subs[subno]->flags |= BRUSE;
             EMPTYARC(lp, rp);    /* temporarily, so there's something */
             NEXT();
             break;
@@ -2141,19 +2144,54 @@ freesrnode(struct vars *v,        /* might be NULL */
 }

 /*
- * optst - optimize a subRE subtree
+ * removecaptures - remove unnecessary capture subREs
+ *
+ * If the caller said that it doesn't care about subexpression match data,
+ * we may delete the "capture" markers on subREs that are not referenced
+ * by any backrefs, and then simplify anything that's become non-messy.
+ * Call this only if REG_NOSUB flag is set.
  */
 static void
-optst(struct vars *v,
-      struct subre *t)
+removecaptures(struct vars *v,
+               struct subre *t)
 {
+    struct subre *t2;
+
+    assert(t != NULL);
+
+    /*
+     * If this isn't itself a backref target, clear capno and tentatively
+     * clear CAP flag.
+     */
+    if (!(t->flags & BRUSE))
+    {
+        t->capno = 0;
+        t->flags &= ~CAP;
+    }
+
+    /* Now recurse to children */
+    for (t2 = t->child; t2 != NULL; t2 = t2->sibling)
+    {
+        removecaptures(v, t2);
+        /* Propagate child CAP flag back up, if it's still set */
+        if (t2->flags & CAP)
+            t->flags |= CAP;
+    }
+
     /*
-     * DGP (2007-11-13): I assume it was the programmer's intent to eventually
-     * come back and add code to optimize subRE trees, but the routine coded
-     * just spends effort traversing the tree and doing nothing. We can do
-     * nothing with less effort.
+     * If t now contains neither captures nor backrefs, there's no longer any
+     * need to care where its sub-match boundaries are, so we can reduce it to
+     * a simple DFA node.  (Note in particular that MIXED child greediness is
+     * not a hindrance here, so we don't use the MESSY() macro.)
      */
-    return;
+    if ((t->flags & (CAP | BACKR)) == 0)
+    {
+        if (t->child)
+            freesubreandsiblings(v, t->child);
+        t->child = NULL;
+        t->op = '=';
+        t->flags &= ~MIXED;
+    }
 }

 /*
@@ -2501,6 +2539,8 @@ stdump(struct subre *t,
         fprintf(f, " hascapture");
     if (t->flags & BACKR)
         fprintf(f, " hasbackref");
+    if (t->flags & BRUSE)
+        fprintf(f, " isreferenced");
     if (!(t->flags & INUSE))
         fprintf(f, " UNUSED");
     if (t->latype != (char) -1)
diff --git a/src/backend/regex/regexec.c b/src/backend/regex/regexec.c
index 5b9a087820..c08a224618 100644
--- a/src/backend/regex/regexec.c
+++ b/src/backend/regex/regexec.c
@@ -213,12 +213,10 @@ pg_regexec(regex_t *re,
         return REG_NOMATCH;
     backref = (v->g->info & REG_UBACKREF) ? 1 : 0;
     v->eflags = flags;
-    if (v->g->cflags & REG_NOSUB)
-        nmatch = 0;                /* override client */
     v->nmatch = nmatch;
-    if (backref)
+    if (backref && nmatch <= v->g->nsub)
     {
-        /* need work area */
+        /* need larger work area */
         if (v->g->nsub + 1 <= LOCALMAT)
             v->pmatch = mat;
         else
@@ -229,7 +227,13 @@ pg_regexec(regex_t *re,
         v->nmatch = v->g->nsub + 1;
     }
     else
+    {
+        /* we can store results directly in caller's array */
         v->pmatch = pmatch;
+        /* ensure any extra entries in caller's array are filled with -1 */
+        if (nmatch > 0)
+            zapallsubs(pmatch, nmatch);
+    }
     v->details = details;
     v->start = (chr *) string;
     v->search_start = (chr *) string + search_start;
@@ -290,12 +294,20 @@ pg_regexec(regex_t *re,
     else
         st = find(v, &v->g->tree->cnfa, &v->g->cmap);

-    /* copy (portion of) match vector over if necessary */
-    if (st == REG_OKAY && v->pmatch != pmatch && nmatch > 0)
+    /* on success, ensure caller's match vector is filled correctly */
+    if (st == REG_OKAY && nmatch > 0)
     {
-        zapallsubs(pmatch, nmatch);
-        n = (nmatch < v->nmatch) ? nmatch : v->nmatch;
-        memcpy(VS(pmatch), VS(v->pmatch), n * sizeof(regmatch_t));
+        if (v->pmatch != pmatch)
+        {
+            /* copy portion of match vector over from (larger) work area */
+            assert(nmatch <= v->nmatch);
+            memcpy(VS(pmatch), VS(v->pmatch), nmatch * sizeof(regmatch_t));
+        }
+        if (v->g->cflags & REG_NOSUB)
+        {
+            /* don't expose possibly-partial sub-match results to caller */
+            zapallsubs(pmatch, nmatch);
+        }
     }

     /* clean up */
@@ -752,7 +764,6 @@ cdissect(struct vars *v,
             break;
         case '(':                /* no-op capture node */
             assert(t->child != NULL);
-            assert(t->capno > 0);
             er = cdissect(v, t->child, begin, end);
             break;
         default:
diff --git a/src/backend/utils/adt/jsonpath_gram.y b/src/backend/utils/adt/jsonpath_gram.y
index de3d97931e..bd5d4488a0 100644
--- a/src/backend/utils/adt/jsonpath_gram.y
+++ b/src/backend/utils/adt/jsonpath_gram.y
@@ -583,6 +583,14 @@ jspConvertRegexFlags(uint32 xflags)
                      errmsg("XQuery \"x\" flag (expanded regular expressions) is not implemented")));
     }

+    /*
+     * We'll never need sub-match details at execution.  While
+     * RE_compile_and_execute would set this flag anyway, force it on here to
+     * ensure that the regex cache entries created by makeItemLikeRegex are
+     * useful.
+     */
+    cflags |= REG_NOSUB;
+
     return cflags;
 }

diff --git a/src/backend/utils/adt/regexp.c b/src/backend/utils/adt/regexp.c
index 484d4265fd..268cee1cbe 100644
--- a/src/backend/utils/adt/regexp.c
+++ b/src/backend/utils/adt/regexp.c
@@ -347,6 +347,10 @@ RE_compile_and_execute(text *text_re, char *dat, int dat_len,
 {
     regex_t    *re;

+    /* Use REG_NOSUB if caller does not want sub-match details */
+    if (nmatch < 2)
+        cflags |= REG_NOSUB;
+
     /* Compile RE */
     re = RE_compile_and_cache(text_re, cflags, collation);

@@ -1412,6 +1416,7 @@ setup_regexp_matches(text *orig_str, text *pattern, pg_re_flags *re_flags,
     int            orig_len;
     pg_wchar   *wide_str;
     int            wide_len;
+    int            cflags;
     regex_t    *cpattern;
     regmatch_t *pmatch;
     int            pmatch_len;
@@ -1430,7 +1435,10 @@ setup_regexp_matches(text *orig_str, text *pattern, pg_re_flags *re_flags,
     wide_len = pg_mb2wchar_with_len(VARDATA_ANY(orig_str), wide_str, orig_len);

     /* set up the compiled pattern */
-    cpattern = RE_compile_and_cache(pattern, re_flags->cflags, collation);
+    cflags = re_flags->cflags;
+    if (!use_subpatterns)
+        cflags |= REG_NOSUB;
+    cpattern = RE_compile_and_cache(pattern, cflags, collation);

     /* do we want to remember subpatterns? */
     if (use_subpatterns && cpattern->re_nsub > 0)
@@ -1952,7 +1960,7 @@ regexp_fixed_prefix(text *text_re, bool case_insensitive, Oid collation,
     if (case_insensitive)
         cflags |= REG_ICASE;

-    re = RE_compile_and_cache(text_re, cflags, collation);
+    re = RE_compile_and_cache(text_re, cflags | REG_NOSUB, collation);

     /* Examine it to see if there's a fixed prefix */
     re_result = pg_regprefix(re, &str, &slen);
diff --git a/src/include/regex/regex.h b/src/include/regex/regex.h
index 2b48a19fb7..0455ae8069 100644
--- a/src/include/regex/regex.h
+++ b/src/include/regex/regex.h
@@ -106,7 +106,7 @@ typedef struct
 #define REG_QUOTE    000004        /* no special characters, none */
 #define REG_NOSPEC    REG_QUOTE    /* historical synonym */
 #define REG_ICASE    000010        /* ignore case */
-#define REG_NOSUB    000020        /* don't care about subexpressions */
+#define REG_NOSUB    000020        /* caller doesn't need subexpr match data */
 #define REG_EXPANDED    000040    /* expanded format, white space & comments */
 #define REG_NLSTOP    000100        /* \n doesn't match . or [^ ] */
 #define REG_NLANCH    000200        /* ^ matches after \n, $ before */
diff --git a/src/include/regex/regguts.h b/src/include/regex/regguts.h
index 8db631c83b..91a52840c4 100644
--- a/src/include/regex/regguts.h
+++ b/src/include/regex/regguts.h
@@ -477,13 +477,14 @@ struct subre
 #define  MIXED     04                /* mixed preference below */
 #define  CAP     010            /* capturing parens here or below */
 #define  BACKR     020            /* back reference here or below */
+#define  BRUSE     040            /* is referenced by a back reference */
 #define  INUSE     0100            /* in use in final tree */
-#define  NOPROP  03                /* bits which may not propagate up */
+#define  UPPROP  (MIXED|CAP|BACKR)    /* flags which should propagate up */
 #define  LMIX(f) ((f)<<2)        /* LONGER -> MIXED */
 #define  SMIX(f) ((f)<<1)        /* SHORTER -> MIXED */
-#define  UP(f)     (((f)&~NOPROP) | (LMIX(f) & SMIX(f) & MIXED))
+#define  UP(f)     (((f)&UPPROP) | (LMIX(f) & SMIX(f) & MIXED))
 #define  MESSY(f)     ((f)&(MIXED|CAP|BACKR))
-#define  PREF(f) ((f)&NOPROP)
+#define  PREF(f)     ((f)&(LONGER|SHORTER))
 #define  PREF2(f1, f2)     ((PREF(f1) != 0) ? PREF(f1) : PREF(f2))
 #define  COMBINE(f1, f2) (UP((f1)|(f2)) | PREF2(f1, f2))
     char        latype;            /* LATYPE code, if lookaround constraint */
diff --git a/src/test/modules/test_regex/expected/test_regex.out b/src/test/modules/test_regex/expected/test_regex.out
index 44da7d2019..5a6cdf47c2 100644
--- a/src/test/modules/test_regex/expected/test_regex.out
+++ b/src/test/modules/test_regex/expected/test_regex.out
@@ -146,12 +146,12 @@ select * from test_regex('(a)e', 'ae', '-');
  {ae,a}
 (2 rows)

--- expectMatch    4.2  o        (a)e        ae
-select * from test_regex('(a)e', 'ae', 'o');
- test_regex
-------------
- {0}
- {NULL}
+-- expectMatch    4.2  oPR    (.)\1e        abeaae    aae    {}
+select * from test_regex('(.)\1e', 'abeaae', 'oPR');
+           test_regex
+--------------------------------
+ {1,REG_UBACKREF,REG_UNONPOSIX}
+ {aae,NULL}
 (2 rows)

 -- expectMatch    4.3  b        {\(a\)b}    ab    ab    a
@@ -2658,6 +2658,20 @@ select * from test_regex('(\w+(?= )).*\1', 'abc abcd abd', 'HLRP');
  {"abc abc",abc}
 (2 rows)

+-- exercise oversize-regmatch_t-array paths in regexec()
+-- (that case is not reachable via test_regex, sadly)
+select substring('fffoooooooooooooooooooooooooooooooo',
'^(.)\1(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)');
+ substring
+-----------
+ f
+(1 row)
+
+select regexp_split_to_array('abcxxxdefyyyghi', '((.))(\1\2)');
+ regexp_split_to_array
+-----------------------
+ {abc,def,ghi}
+(1 row)
+
 -- doing 15 "octal escapes vs back references"
 -- # initial zero is always octal
 -- expectMatch    15.1  MP    "a\\010b"    "a\bb"    "a\bb"
@@ -3476,6 +3490,22 @@ select * from test_regex('((.))(\2){0}', 'xy', 'RPQ');
  {x,x,x,NULL}
 (2 rows)

+-- expectMatch    21.37 RP    ((.))(\2)    xyy    yy    y    y    y
+select * from test_regex('((.))(\2)', 'xyy', 'RP');
+           test_regex
+--------------------------------
+ {3,REG_UBACKREF,REG_UNONPOSIX}
+ {yy,y,y,y}
+(2 rows)
+
+-- expectMatch    21.38 oRP    ((.))(\2)    xyy    yy    {}    {}    {}
+select * from test_regex('((.))(\2)', 'xyy', 'oRP');
+           test_regex
+--------------------------------
+ {3,REG_UBACKREF,REG_UNONPOSIX}
+ {yy,NULL,NULL,NULL}
+(2 rows)
+
 -- doing 22 "multicharacter collating elements"
 -- # again ugh
 -- MCCEs are not implemented in Postgres, so we skip all these tests
diff --git a/src/test/modules/test_regex/sql/test_regex.sql b/src/test/modules/test_regex/sql/test_regex.sql
index 9224fdfdd3..3419564203 100644
--- a/src/test/modules/test_regex/sql/test_regex.sql
+++ b/src/test/modules/test_regex/sql/test_regex.sql
@@ -63,8 +63,8 @@ select * from test_regex('ab', 'ab', 'b');

 -- expectMatch    4.1  -        (a)e        ae    ae    a
 select * from test_regex('(a)e', 'ae', '-');
--- expectMatch    4.2  o        (a)e        ae
-select * from test_regex('(a)e', 'ae', 'o');
+-- expectMatch    4.2  oPR    (.)\1e        abeaae    aae    {}
+select * from test_regex('(.)\1e', 'abeaae', 'oPR');
 -- expectMatch    4.3  b        {\(a\)b}    ab    ab    a
 select * from test_regex('\(a\)b', 'ab', 'b');
 -- expectMatch    4.4  -        a((b)c)        abc    abc    bc    b
@@ -775,6 +775,11 @@ select * from test_regex('(^\w+).*\1', 'abc abc abc', 'LRP');
 select * from test_regex('(^\w+\M).*\1', 'abc abcd abd', 'LRP');
 select * from test_regex('(\w+(?= )).*\1', 'abc abcd abd', 'HLRP');

+-- exercise oversize-regmatch_t-array paths in regexec()
+-- (that case is not reachable via test_regex, sadly)
+select substring('fffoooooooooooooooooooooooooooooooo',
'^(.)\1(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)');
+select regexp_split_to_array('abcxxxdefyyyghi', '((.))(\1\2)');
+
 -- doing 15 "octal escapes vs back references"

 -- # initial zero is always octal
@@ -1011,6 +1016,10 @@ select * from test_regex('(a*)*', 'bc', 'N');
 select * from test_regex(' TO (([a-z0-9._]+|"([^"]+|"")+")+)', 'asd TO foo', 'M');
 -- expectMatch    21.36 RPQ    ((.))(\2){0}    xy    x    x    x    {}
 select * from test_regex('((.))(\2){0}', 'xy', 'RPQ');
+-- expectMatch    21.37 RP    ((.))(\2)    xyy    yy    y    y    y
+select * from test_regex('((.))(\2)', 'xyy', 'RP');
+-- expectMatch    21.38 oRP    ((.))(\2)    xyy    yy    {}    {}    {}
+select * from test_regex('((.))(\2)', 'xyy', 'oRP');

 -- doing 22 "multicharacter collating elements"
 -- # again ugh

Re: Another regexp performance improvement: skip useless paren-captures

From
Mark Dilger
Date:

> On Aug 8, 2021, at 1:25 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> Ugh.  The regex engine is finding the match correctly, but it's failing to
> tell the caller where it is :-(.  I was a little too cute in optimizing
> the regmatch_t result-vector copying in pg_regexec, and forgot to ensure
> that the overall match position would be reported.
>
> Thanks for the testing!

Sure!  Thanks for improving the regular expression engine!

I have applied your latest patch and do not see any problems with it.  All my tests pass with no asserts and with no
differencesin results vs. master.  This is a test suite of nearly 1.5 million separate regular expressions. 

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company






Re: Another regexp performance improvement: skip useless paren-captures

From
Tom Lane
Date:
Mark Dilger <mark.dilger@enterprisedb.com> writes:
> I have applied your latest patch and do not see any problems with it.  All my tests pass with no asserts and with no
differencesin results vs. master.  This is a test suite of nearly 1.5 million separate regular expressions. 

Cool, thanks.  I also tried your millions-of-random-regexps script
and didn't find any difference between the results from HEAD and
those from the v3 patch.

            regards, tom lane



Re: Another regexp performance improvement: skip useless paren-captures

From
Mark Dilger
Date:

> On Aug 8, 2021, at 3:28 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> Cool, thanks.  I also tried your millions-of-random-regexps script
> and didn't find any difference between the results from HEAD and
> those from the v3 patch.

The patch looks ready to commit.  I don't expect to test it any further unless you have something in particular you'd
likeme to focus on. 

Thanks again for working on this.

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company






Re: Another regexp performance improvement: skip useless paren-captures

From
Tom Lane
Date:
Mark Dilger <mark.dilger@enterprisedb.com> writes:
> The patch looks ready to commit.  I don't expect to test it any further unless you have something in particular you'd
likeme to focus on. 

Pushed, but while re-reading it before commit I noticed that there's
some more fairly low-hanging fruit in regexp_replace().  As I had it
in that patch, it never used REG_NOSUB because of the possibility
that the replacement string uses "\N".  However, we're already
pre-checking the replacement string to see if it has backslashes
at all, so while we're at it we can check for \N to discover if we
actually need any subexpression match data or not.  We do need to
refactor a little to postpone calling pg_regcomp until after we
know that, but I think that makes replace_text_regexp's API less
ugly not more so.

While I was at it, I changed the search-for-backslash loops to
use memchr rather than handwritten looping.  Their use of
pg_mblen was pretty unnecessary given we only need to find
backslashes, and we can assume the backend encoding is ASCII-safe.

Using a bunch of random cases generated by your little perl
script, I see maybe 10-15% speedup on test cases that don't
use \N in the replacement string, while it's about a wash
on cases that do.  (If I'd been using a multibyte encoding,
maybe the memchr change would have made a difference, but
I didn't try that.)

            regards, tom lane

diff --git a/src/backend/utils/adt/regexp.c b/src/backend/utils/adt/regexp.c
index 268cee1cbe..3b7a607437 100644
--- a/src/backend/utils/adt/regexp.c
+++ b/src/backend/utils/adt/regexp.c
@@ -630,11 +630,10 @@ textregexreplace_noopt(PG_FUNCTION_ARGS)
     text       *s = PG_GETARG_TEXT_PP(0);
     text       *p = PG_GETARG_TEXT_PP(1);
     text       *r = PG_GETARG_TEXT_PP(2);
-    regex_t    *re;
-
-    re = RE_compile_and_cache(p, REG_ADVANCED, PG_GET_COLLATION());

-    PG_RETURN_TEXT_P(replace_text_regexp(s, (void *) re, r, 0, 1));
+    PG_RETURN_TEXT_P(replace_text_regexp(s, p, r,
+                                         REG_ADVANCED, PG_GET_COLLATION(),
+                                         0, 1));
 }

 /*
@@ -648,7 +647,6 @@ textregexreplace(PG_FUNCTION_ARGS)
     text       *p = PG_GETARG_TEXT_PP(1);
     text       *r = PG_GETARG_TEXT_PP(2);
     text       *opt = PG_GETARG_TEXT_PP(3);
-    regex_t    *re;
     pg_re_flags flags;

     /*
@@ -672,10 +670,9 @@ textregexreplace(PG_FUNCTION_ARGS)

     parse_re_flags(&flags, opt);

-    re = RE_compile_and_cache(p, flags.cflags, PG_GET_COLLATION());
-
-    PG_RETURN_TEXT_P(replace_text_regexp(s, (void *) re, r, 0,
-                                         flags.glob ? 0 : 1));
+    PG_RETURN_TEXT_P(replace_text_regexp(s, p, r,
+                                         flags.cflags, PG_GET_COLLATION(),
+                                         0, flags.glob ? 0 : 1));
 }

 /*
@@ -694,7 +691,6 @@ textregexreplace_extended(PG_FUNCTION_ARGS)
     int            n = 1;
     text       *flags = PG_GETARG_TEXT_PP_IF_EXISTS(5);
     pg_re_flags re_flags;
-    regex_t    *re;

     /* Collect optional parameters */
     if (PG_NARGS() > 3)
@@ -723,11 +719,10 @@ textregexreplace_extended(PG_FUNCTION_ARGS)
     if (PG_NARGS() <= 4)
         n = re_flags.glob ? 0 : 1;

-    /* Compile the regular expression */
-    re = RE_compile_and_cache(p, re_flags.cflags, PG_GET_COLLATION());
-
     /* Do the replacement(s) */
-    PG_RETURN_TEXT_P(replace_text_regexp(s, (void *) re, r, start - 1, n));
+    PG_RETURN_TEXT_P(replace_text_regexp(s, p, r,
+                                         re_flags.cflags, PG_GET_COLLATION(),
+                                         start - 1, n));
 }

 /* This is separate to keep the opr_sanity regression test from complaining */
diff --git a/src/backend/utils/adt/varlena.c b/src/backend/utils/adt/varlena.c
index 348b5566de..acb8741734 100644
--- a/src/backend/utils/adt/varlena.c
+++ b/src/backend/utils/adt/varlena.c
@@ -4359,34 +4359,36 @@ replace_text(PG_FUNCTION_ARGS)
 }

 /*
- * check_replace_text_has_escape_char
+ * check_replace_text_has_escape
  *
- * check whether replace_text contains escape char.
+ * Returns 0 if text contains no backslashes that need processing.
+ * Returns 1 if text contains backslashes, but not regexp submatch specifiers.
+ * Returns 2 if text contains regexp submatch specifiers (\1 .. \9).
  */
-static bool
-check_replace_text_has_escape_char(const text *replace_text)
+static int
+check_replace_text_has_escape(const text *replace_text)
 {
+    int            result = 0;
     const char *p = VARDATA_ANY(replace_text);
     const char *p_end = p + VARSIZE_ANY_EXHDR(replace_text);

-    if (pg_database_encoding_max_length() == 1)
-    {
-        for (; p < p_end; p++)
-        {
-            if (*p == '\\')
-                return true;
-        }
-    }
-    else
+    while (p < p_end)
     {
-        for (; p < p_end; p += pg_mblen(p))
+        /* Find next escape char, if any. */
+        p = memchr(p, '\\', p_end - p);
+        if (p == NULL)
+            break;
+        p++;
+        /* Note: a backslash at the end doesn't require extra processing. */
+        if (p < p_end)
         {
-            if (*p == '\\')
-                return true;
+            if (*p >= '1' && *p <= '9')
+                return 2;        /* Found a submatch specifier, so done */
+            result = 1;            /* Found some other sequence, keep looking */
+            p++;
         }
     }
-
-    return false;
+    return result;
 }

 /*
@@ -4403,25 +4405,17 @@ appendStringInfoRegexpSubstr(StringInfo str, text *replace_text,
 {
     const char *p = VARDATA_ANY(replace_text);
     const char *p_end = p + VARSIZE_ANY_EXHDR(replace_text);
-    int            eml = pg_database_encoding_max_length();

-    for (;;)
+    while (p < p_end)
     {
         const char *chunk_start = p;
         int            so;
         int            eo;

-        /* Find next escape char. */
-        if (eml == 1)
-        {
-            for (; p < p_end && *p != '\\'; p++)
-                 /* nothing */ ;
-        }
-        else
-        {
-            for (; p < p_end && *p != '\\'; p += pg_mblen(p))
-                 /* nothing */ ;
-        }
+        /* Find next escape char, if any. */
+        p = memchr(p, '\\', p_end - p);
+        if (p == NULL)
+            p = p_end;

         /* Copy the text we just scanned over, if any. */
         if (p > chunk_start)
@@ -4473,7 +4467,7 @@ appendStringInfoRegexpSubstr(StringInfo str, text *replace_text,
             continue;
         }

-        if (so != -1 && eo != -1)
+        if (so >= 0 && eo >= 0)
         {
             /*
              * Copy the text that is back reference of regexp.  Note so and eo
@@ -4491,36 +4485,37 @@ appendStringInfoRegexpSubstr(StringInfo str, text *replace_text,
     }
 }

-#define REGEXP_REPLACE_BACKREF_CNT        10
-
 /*
  * replace_text_regexp
  *
- * replace substring(s) in src_text that match regexp with replace_text.
+ * replace substring(s) in src_text that match pattern with replace_text.
+ * The replace_text can contain backslash markers to substitute
+ * (parts of) the matched text.
  *
+ * cflags: regexp compile flags.
+ * collation: collation to use.
  * search_start: the character (not byte) offset in src_text at which to
  * begin searching.
  * n: if 0, replace all matches; if > 0, replace only the N'th match.
- *
- * Note: to avoid having to include regex.h in builtins.h, we declare
- * the regexp argument as void *, but really it's regex_t *.
  */
 text *
-replace_text_regexp(text *src_text, void *regexp,
+replace_text_regexp(text *src_text, text *pattern_text,
                     text *replace_text,
+                    int cflags, Oid collation,
                     int search_start, int n)
 {
     text       *ret_text;
-    regex_t    *re = (regex_t *) regexp;
+    regex_t    *re;
     int            src_text_len = VARSIZE_ANY_EXHDR(src_text);
     int            nmatches = 0;
     StringInfoData buf;
-    regmatch_t    pmatch[REGEXP_REPLACE_BACKREF_CNT];
+    regmatch_t    pmatch[10];        /* main match, plus \1 to \9 */
+    int            nmatch = lengthof(pmatch);
     pg_wchar   *data;
     size_t        data_len;
     int            data_pos;
     char       *start_ptr;
-    bool        have_escape;
+    int            escape_status;

     initStringInfo(&buf);

@@ -4528,8 +4523,19 @@ replace_text_regexp(text *src_text, void *regexp,
     data = (pg_wchar *) palloc((src_text_len + 1) * sizeof(pg_wchar));
     data_len = pg_mb2wchar_with_len(VARDATA_ANY(src_text), data, src_text_len);

-    /* Check whether replace_text has escape char. */
-    have_escape = check_replace_text_has_escape_char(replace_text);
+    /* Check whether replace_text has escapes, especially regexp submatches. */
+    escape_status = check_replace_text_has_escape(replace_text);
+
+    /* If no regexp submatches, we can use REG_NOSUB. */
+    if (escape_status < 2)
+    {
+        cflags |= REG_NOSUB;
+        /* Also tell pg_regexec we only want the whole-match location. */
+        nmatch = 1;
+    }
+
+    /* Prepare the regexp. */
+    re = RE_compile_and_cache(pattern_text, cflags, collation);

     /* start_ptr points to the data_pos'th character of src_text */
     start_ptr = (char *) VARDATA_ANY(src_text);
@@ -4546,7 +4552,7 @@ replace_text_regexp(text *src_text, void *regexp,
                                     data_len,
                                     search_start,
                                     NULL,    /* no details */
-                                    REGEXP_REPLACE_BACKREF_CNT,
+                                    nmatch,
                                     pmatch,
                                     0);

@@ -4602,10 +4608,9 @@ replace_text_regexp(text *src_text, void *regexp,
         }

         /*
-         * Copy the replace_text. Process back references when the
-         * replace_text has escape characters.
+         * Copy the replace_text, processing escapes if any are present.
          */
-        if (have_escape)
+        if (escape_status > 0)
             appendStringInfoRegexpSubstr(&buf, replace_text, pmatch,
                                          start_ptr, data_pos);
         else
diff --git a/src/include/utils/varlena.h b/src/include/utils/varlena.h
index 6645e2af13..cd12252ed9 100644
--- a/src/include/utils/varlena.h
+++ b/src/include/utils/varlena.h
@@ -33,8 +33,9 @@ extern bool SplitDirectoriesString(char *rawstring, char separator,
                                    List **namelist);
 extern bool SplitGUCList(char *rawstring, char separator,
                          List **namelist);
-extern text *replace_text_regexp(text *src_text, void *regexp,
+extern text *replace_text_regexp(text *src_text, text *pattern_text,
                                  text *replace_text,
+                                 int cflags, Oid collation,
                                  int search_start, int n);

 #endif

Re: Another regexp performance improvement: skip useless paren-captures

From
Mark Dilger
Date:
Tom,

I can still trigger the old bug for which we thought we'd pushed a fix.  The test case below crashes on master
(e12694523e7e4482a052236f12d3d8b58be9a22c),and also on the fixed version "Make regexp engine's backref-related
compilationstate more bulletproof." (cb76fbd7ec87e44b3c53165d68dc2747f7e26a9a). 

Can you test if it crashes for you, too?  I'm not sure I see why this one fails when millions of others pass.

The backtrace is still complaining about regc_nfa.c:1265:

+select regexp_split_to_array('', '(?:((?:q+))){0}(\1){0,0}?*[^]');
+server closed the connection unexpectedly
+   This probably means the server terminated abnormally
+   before or while processing the request.
+connection to server was lost

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company






Re: Another regexp performance improvement: skip useless paren-captures

From
Tom Lane
Date:
Mark Dilger <mark.dilger@enterprisedb.com> writes:
> I can still trigger the old bug for which we thought we'd pushed a fix.  The test case below crashes on master
(e12694523e7e4482a052236f12d3d8b58be9a22c),and also on the fixed version "Make regexp engine's backref-related
compilationstate more bulletproof." (cb76fbd7ec87e44b3c53165d68dc2747f7e26a9a). 

> Can you test if it crashes for you, too?  I'm not sure I see why this one fails when millions of others pass.

> The backtrace is still complaining about regc_nfa.c:1265:

> +select regexp_split_to_array('', '(?:((?:q+))){0}(\1){0,0}?*[^]');
> +server closed the connection unexpectedly

Hmmm ... yeah, I see it too.  This points up something I'd wondered
about before, which is whether the code that "cancels everything"
after detecting {0} is really OK.  It throws away the outer subre
*and children* without worrying about what might be inside, and
here we see that that's not good enough --- there's still a v->subs
pointer to the first capturing paren set, which we just deleted,
so that the \1 later on messes up.  I'm not sure why the back
branches are managing not to crash, but that might just be a memory
management artifact.

            regards, tom lane



Re: Another regexp performance improvement: skip useless paren-captures

From
Tom Lane
Date:
I wrote:
> Hmmm ... yeah, I see it too.  This points up something I'd wondered
> about before, which is whether the code that "cancels everything"
> after detecting {0} is really OK.  It throws away the outer subre
> *and children* without worrying about what might be inside, and
> here we see that that's not good enough --- there's still a v->subs
> pointer to the first capturing paren set, which we just deleted,
> so that the \1 later on messes up.  I'm not sure why the back
> branches are managing not to crash, but that might just be a memory
> management artifact.

... yeah, it is.  For me, this variant hits the assertion in all
branches:

regression=# select regexp_split_to_array('', '((.)){0}(\2){0}');
server closed the connection unexpectedly

So that's a pre-existing (and very long-standing) bug.  I'm not
sure if it has any serious impact in non-assert builds though.
Failure to clean out some disconnected arcs probably has no
real effect on the regex's behavior later.

            regards, tom lane



Re: Another regexp performance improvement: skip useless paren-captures

From
Tom Lane
Date:
Mark Dilger <mark.dilger@enterprisedb.com> writes:
> +select regexp_split_to_array('', '(?:((?:q+))){0}(\1){0,0}?*[^]');
> +server closed the connection unexpectedly

Here's a quick draft patch for this.  Basically it moves the
responsibility for clearing v->subs[] pointers into the freesubre()
recursion, so that it will happen for contained capturing parens
not only the top level.

There is a potentially interesting definitional question:
what exactly ought this regexp do?

        ((.)){0}\2

Because the capturing paren sets are zero-quantified, they will
never be matched to any characters, so the backref can never
have any defined referent.  I suspect that study of the POSIX
spec would lead to the conclusion that this is a legal regexp
but it will never match anything.  Implementing that would be
tedious though, and what's more it seems very unlikely that
the user wanted any such behavior.  So I think throwing an
error is an appropriate response.  The existing code will
throw such an error for

        ((.)){0}\1

so I guess Spencer did think about this to some extent -- he
just forgot about the possibility of nested parens.

This patch should work OK in HEAD and v14, but it will need
a bit of fooling-about for older branches I think, given that
they fill v->subs[] a little differently.

            regards, tom lane

diff --git a/src/backend/regex/regcomp.c b/src/backend/regex/regcomp.c
index ae3a7b6a38..ae1ff670f9 100644
--- a/src/backend/regex/regcomp.c
+++ b/src/backend/regex/regcomp.c
@@ -550,8 +550,6 @@ freev(struct vars *v,
 {
     if (v->re != NULL)
         rfree(v->re);
-    if (v->subs != v->sub10)
-        FREE(v->subs);
     if (v->nfa != NULL)
         freenfa(v->nfa);
     if (v->tree != NULL)
@@ -562,6 +560,8 @@ freev(struct vars *v,
         freecvec(v->cv);
     if (v->cv2 != NULL)
         freecvec(v->cv2);
+    if (v->subs != v->sub10)
+        FREE(v->subs);
     if (v->lacons != NULL)
         freelacons(v->lacons, v->nlacons);
     ERR(err);                    /* nop if err==0 */
@@ -1091,8 +1091,8 @@ parseqatom(struct vars *v,
     {
         if (atom != NULL)
             freesubre(v, atom);
-        if (atomtype == '(')
-            v->subs[subno] = NULL;
+        /* freesubre() will clean up any v->subs[] pointers into the atom */
+        assert(atomtype != '(' || v->subs[subno] == NULL);
         delsub(v->nfa, lp, rp);
         EMPTYARC(lp, rp);
         return top;
@@ -2127,9 +2127,24 @@ freesrnode(struct vars *v,        /* might be NULL */
     if (sr == NULL)
         return;

+    /*
+     * If the node is referenced in v->subs[], clear that to avoid leaving a
+     * dangling pointer.  This should only ever matter when parseqatom() is
+     * throwing away a {0,0}-quantified atom that contains capturing parens.
+     * Doing this will cause later backrefs to such parens to fail, which is
+     * maybe not strictly POSIX but seems more helpful than allowing a backref
+     * that can never have a defined referent.
+     */
+    if (sr->capno > 0 && v != NULL)
+    {
+        assert(v->subs[sr->capno] == sr);
+        v->subs[sr->capno] = NULL;
+    }
+
     if (!NULLCNFA(sr->cnfa))
         freecnfa(&sr->cnfa);
     sr->flags = 0;                /* in particular, not INUSE */
+    sr->capno = 0;
     sr->child = sr->sibling = NULL;
     sr->begin = sr->end = NULL;

diff --git a/src/test/modules/test_regex/expected/test_regex.out b/src/test/modules/test_regex/expected/test_regex.out
index 5a6cdf47c2..850e0c8ef5 100644
--- a/src/test/modules/test_regex/expected/test_regex.out
+++ b/src/test/modules/test_regex/expected/test_regex.out
@@ -3506,6 +3506,10 @@ select * from test_regex('((.))(\2)', 'xyy', 'oRP');
  {yy,NULL,NULL,NULL}
 (2 rows)

+-- # throwing an error for this is arguably not per POSIX
+-- expectError    21.39 -        {((.)){0}(\2){0}}    ESUBREG
+select * from test_regex('((.)){0}(\2){0}', '', '-');
+ERROR:  invalid regular expression: invalid backreference number
 -- doing 22 "multicharacter collating elements"
 -- # again ugh
 -- MCCEs are not implemented in Postgres, so we skip all these tests
diff --git a/src/test/modules/test_regex/sql/test_regex.sql b/src/test/modules/test_regex/sql/test_regex.sql
index 3419564203..378f22b67b 100644
--- a/src/test/modules/test_regex/sql/test_regex.sql
+++ b/src/test/modules/test_regex/sql/test_regex.sql
@@ -1020,6 +1020,9 @@ select * from test_regex('((.))(\2){0}', 'xy', 'RPQ');
 select * from test_regex('((.))(\2)', 'xyy', 'RP');
 -- expectMatch    21.38 oRP    ((.))(\2)    xyy    yy    {}    {}    {}
 select * from test_regex('((.))(\2)', 'xyy', 'oRP');
+-- # throwing an error for this is arguably not per POSIX
+-- expectError    21.39 -        {((.)){0}(\2){0}}    ESUBREG
+select * from test_regex('((.)){0}(\2){0}', '', '-');

 -- doing 22 "multicharacter collating elements"
 -- # again ugh

Re: Another regexp performance improvement: skip useless paren-captures

From
Mark Dilger
Date:

> On Aug 9, 2021, at 12:14 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> Pushed, but while re-reading it before commit I noticed that there's
> some more fairly low-hanging fruit in regexp_replace().  As I had it
> in that patch, it never used REG_NOSUB because of the possibility
> that the replacement string uses "\N".  However, we're already
> pre-checking the replacement string to see if it has backslashes
> at all, so while we're at it we can check for \N to discover if we
> actually need any subexpression match data or not.  We do need to
> refactor a little to postpone calling pg_regcomp until after we
> know that, but I think that makes replace_text_regexp's API less
> ugly not more so.
>
> While I was at it, I changed the search-for-backslash loops to
> use memchr rather than handwritten looping.  Their use of
> pg_mblen was pretty unnecessary given we only need to find
> backslashes, and we can assume the backend encoding is ASCII-safe.
>
> Using a bunch of random cases generated by your little perl
> script, I see maybe 10-15% speedup on test cases that don't
> use \N in the replacement string, while it's about a wash
> on cases that do.  (If I'd been using a multibyte encoding,
> maybe the memchr change would have made a difference, but
> I didn't try that.)

I've been reviewing and testing this (let-regexp_replace-use-NOSUB.patch) since you sent it 4 hours ago, and I can't
seemto break it.  There are pre-existing problems in the regex code, but this doesn't seem to add any new breakage. 

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company






Re: Another regexp performance improvement: skip useless paren-captures

From
Mark Dilger
Date:

> On Aug 9, 2021, at 4:31 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> There is a potentially interesting definitional question:
> what exactly ought this regexp do?
>
>         ((.)){0}\2
>
> Because the capturing paren sets are zero-quantified, they will
> never be matched to any characters, so the backref can never
> have any defined referent.

Perl regular expressions are not POSIX, but if there is a principled reason POSIX should differ from perl on this, we
shouldbe clear what that is: 

    #!/usr/bin/perl

    use strict;
    use warnings;

    our $match;
    if ('foo' =~ m/((.)(??{ die; })){0}(..)/)
    {
        print "captured 1 $1\n" if defined $1;
        print "captured 2 $2\n" if defined $2;
        print "captured 3 $3\n" if defined $3;
        print "captured 4 $4\n" if defined $4;
        print "match = $match\n" if defined $match;
    }

This will print "captured 3 fo", proving that although the regular expression is parsed with the (..) bound to the
thirdcapture group, the first two capture groups never run.  If you don't believe that, change the {0} to {1} and
observethat the script dies. 

> So I think throwing an
> error is an appropriate response.  The existing code will
> throw such an error for
>
>         ((.)){0}\1
>
> so I guess Spencer did think about this to some extent -- he
> just forgot about the possibility of nested parens.


Ugg.  That means our code throws an error where perl does not, pretty well negating my point above.  If we're already
throwingan error for this type of thing, I agree we should be consistent about it.  My personal preference would have
beento do the same thing as perl, but it seems that ship has already sailed. 


—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company






Re: Another regexp performance improvement: skip useless paren-captures

From
Mark Dilger
Date:

> On Aug 9, 2021, at 5:14 PM, Mark Dilger <mark.dilger@enterprisedb.com> wrote:
>
>    our $match;
>    if ('foo' =~ m/((.)(??{ die; })){0}(..)/)

I left in a stray variable.  A prior version of this script was assigning to $match where it now has die.  Sorry for
anyconfusion. 

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company






Re: Another regexp performance improvement: skip useless paren-captures

From
Tom Lane
Date:
Mark Dilger <mark.dilger@enterprisedb.com> writes:
>> On Aug 9, 2021, at 12:14 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Pushed, but while re-reading it before commit I noticed that there's
>> some more fairly low-hanging fruit in regexp_replace().

> I've been reviewing and testing this (let-regexp_replace-use-NOSUB.patch) since you sent it 4 hours ago, and I can't
seemto break it.  There are pre-existing problems in the regex code, but this doesn't seem to add any new breakage. 

Pushed that bit, thanks for testing!

I plan to not do anything about the (()){0} bug until after the release
window, since that will need to be back-patched.  That bug's gotta be
twenty years old, so while I kinda wish we'd found it a few days
earlier, waiting another 3 months won't make much difference.

            regards, tom lane



Re: Another regexp performance improvement: skip useless paren-captures

From
Tom Lane
Date:
Mark Dilger <mark.dilger@enterprisedb.com> writes:
>> On Aug 9, 2021, at 4:31 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> There is a potentially interesting definitional question:
>> what exactly ought this regexp do?
>>     ((.)){0}\2
>> Because the capturing paren sets are zero-quantified, they will
>> never be matched to any characters, so the backref can never
>> have any defined referent.

> Perl regular expressions are not POSIX, but if there is a principled reason POSIX should differ from perl on this, we
shouldbe clear what that is: 

>     if ('foo' =~ m/((.)(??{ die; })){0}(..)/)
>     {
>         print "captured 1 $1\n" if defined $1;
>         print "captured 2 $2\n" if defined $2;
>         print "captured 3 $3\n" if defined $3;
>         print "captured 4 $4\n" if defined $4;
>         print "match = $match\n" if defined $match;
>     }

Hm.  I'm not sure that this example proves anything about Perl's handling
of the situation, since you didn't use a backref.  I tried both

    if ('foo' =~ m/((.)){0}\1/)

    if ('foo' =~ m/((.)){0}\2/)

and while neither throws an error, they don't succeed either.
So AFAICS Perl is acting in the way I'm attributing to POSIX.
But maybe we should actually read POSIX ...

>> ... I guess Spencer did think about this to some extent -- he
>> just forgot about the possibility of nested parens.

> Ugg.  That means our code throws an error where perl does not, pretty
> well negating my point above.  If we're already throwing an error for
> this type of thing, I agree we should be consistent about it.  My
> personal preference would have been to do the same thing as perl, but it
> seems that ship has already sailed.

Removing an error case is usually an easier sell than adding one.
However, the fact that the simplest case (viz, '(.){0}\1') has always
thrown an error and nobody has complained in twenty-ish years suggests
that nobody much cares.

            regards, tom lane



Re: Another regexp performance improvement: skip useless paren-captures

From
Mark Dilger
Date:

> On Aug 9, 2021, at 6:11 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> Hm.  I'm not sure that this example proves anything about Perl's handling
> of the situation, since you didn't use a backref.

Well, this doesn't die either:

if ('foo' =~ m/((??{ die; })(.)(??{ die $1; })){0}((??{ die "got here"; })|\2)/)
{
    print "matched\n";
}

The point is that the regex engine never walks the part of the pattern that {0} qualifies.  I thought it was more clear
inthe prior example, because that example proves that the engine does get as far as capturing.  This example also does
that,and with a backref, because it dies with "got here". 

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company






Re: Another regexp performance improvement: skip useless paren-captures

From
Mark Dilger
Date:

> On Aug 9, 2021, at 6:17 PM, Mark Dilger <mark.dilger@enterprisedb.com> wrote:
>
> Well, this doesn't die either:

Meaning it doesn't die in the part of the pattern qualified by {0} either.  It does die in the other part.  Sorry again
forthe confusion. 

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company






Re: Another regexp performance improvement: skip useless paren-captures

From
Mark Dilger
Date:

> On Aug 9, 2021, at 4:31 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> This patch should work OK in HEAD and v14, but it will need
> a bit of fooling-about for older branches I think, given that
> they fill v->subs[] a little differently.

Note that I tested your patch *before* master, so the changes look backwards.

I tested this fix and it seems good to me.  Some patterns that used to work (by chance?) now raise an error, such as:

 select 'bpgouiwcquu' ~ '(((([e])*?)){0,0}?(\3))';
-ERROR:  invalid regular expression: invalid backreference number
+ ?column?
+----------
+ f
+(1 row)

I ran a lot of tests with the patch, and the asserts have all cleared up, but I don't know how to think about the user
facingdifferences.  If we had a good reason for raising an error for these back-references, maybe that'd be fine, but
itseems to just be an implementation detail. 

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company






Re: Another regexp performance improvement: skip useless paren-captures

From
Tom Lane
Date:
Mark Dilger <mark.dilger@enterprisedb.com> writes:
> I ran a lot of tests with the patch, and the asserts have all cleared up, but I don't know how to think about the
userfacing differences.  If we had a good reason for raising an error for these back-references, maybe that'd be fine,
butit seems to just be an implementation detail. 

I thought about this some more, and I'm coming around to the idea that
throwing an error is the wrong thing.  As a contrary example, consider

    (.)|(\1\1)

We don't throw an error for this, and neither does Perl, even though
the capturing parens can never be defined in the branch where the
backrefs are.  So it seems hard to argue that this is okay but the
other thing isn't.  Another interesting example is

    (.){0}(\1){0}

I think that the correct interpretation is that this is a valid
regexp matching an empty string (i.e., zero repetitions of each
part), even though neither capture group will be defined.
That's different from

    (.){0}(\1)

which can never match.

So I took another look at the code, and it doesn't seem that hard
to make it act this way.  The attached passes regression, but
I've not beat on it with random strings.

            regards, tom lane

diff --git a/src/backend/regex/regcomp.c b/src/backend/regex/regcomp.c
index ae3a7b6a38..d9840171a3 100644
--- a/src/backend/regex/regcomp.c
+++ b/src/backend/regex/regcomp.c
@@ -1089,11 +1089,23 @@ parseqatom(struct vars *v,
     /* annoying special case:  {0} or {0,0} cancels everything */
     if (m == 0 && n == 0)
     {
-        if (atom != NULL)
-            freesubre(v, atom);
-        if (atomtype == '(')
-            v->subs[subno] = NULL;
-        delsub(v->nfa, lp, rp);
+        /*
+         * If we had capturing subexpression(s) within the atom, we don't want
+         * to destroy them, because it's legal (if useless) to back-ref them
+         * later.  Hence, just unlink the atom from lp/rp and then ignore it.
+         */
+        if (atom != NULL && (atom->flags & CAP))
+        {
+            delsub(v->nfa, lp, atom->begin);
+            delsub(v->nfa, atom->end, rp);
+        }
+        else
+        {
+            /* Otherwise, we can clean up any subre infrastructure we made */
+            if (atom != NULL)
+                freesubre(v, atom);
+            delsub(v->nfa, lp, rp);
+        }
         EMPTYARC(lp, rp);
         return top;
     }
diff --git a/src/test/modules/test_regex/expected/test_regex.out b/src/test/modules/test_regex/expected/test_regex.out
index 5a6cdf47c2..96c0e6fa59 100644
--- a/src/test/modules/test_regex/expected/test_regex.out
+++ b/src/test/modules/test_regex/expected/test_regex.out
@@ -3506,6 +3506,21 @@ select * from test_regex('((.))(\2)', 'xyy', 'oRP');
  {yy,NULL,NULL,NULL}
 (2 rows)

+-- expectMatch    21.39 NPQR    {((.)){0}(\2){0}}    xyz    {}    {}    {}    {}
+select * from test_regex('((.)){0}(\2){0}', 'xyz', 'NPQR');
+                         test_regex
+------------------------------------------------------------
+ {3,REG_UBACKREF,REG_UBOUNDS,REG_UNONPOSIX,REG_UEMPTYMATCH}
+ {"",NULL,NULL,NULL}
+(2 rows)
+
+-- expectNomatch    21.40 PQR    {((.)){0}(\2)}    xxx
+select * from test_regex('((.)){0}(\2)', 'xxx', 'PQR');
+                 test_regex
+--------------------------------------------
+ {3,REG_UBACKREF,REG_UBOUNDS,REG_UNONPOSIX}
+(1 row)
+
 -- doing 22 "multicharacter collating elements"
 -- # again ugh
 -- MCCEs are not implemented in Postgres, so we skip all these tests
diff --git a/src/test/modules/test_regex/sql/test_regex.sql b/src/test/modules/test_regex/sql/test_regex.sql
index 3419564203..29806e9417 100644
--- a/src/test/modules/test_regex/sql/test_regex.sql
+++ b/src/test/modules/test_regex/sql/test_regex.sql
@@ -1020,6 +1020,10 @@ select * from test_regex('((.))(\2){0}', 'xy', 'RPQ');
 select * from test_regex('((.))(\2)', 'xyy', 'RP');
 -- expectMatch    21.38 oRP    ((.))(\2)    xyy    yy    {}    {}    {}
 select * from test_regex('((.))(\2)', 'xyy', 'oRP');
+-- expectMatch    21.39 NPQR    {((.)){0}(\2){0}}    xyz    {}    {}    {}    {}
+select * from test_regex('((.)){0}(\2){0}', 'xyz', 'NPQR');
+-- expectNomatch    21.40 PQR    {((.)){0}(\2)}    xxx
+select * from test_regex('((.)){0}(\2)', 'xxx', 'PQR');

 -- doing 22 "multicharacter collating elements"
 -- # again ugh

Re: Another regexp performance improvement: skip useless paren-captures

From
Tom Lane
Date:
I wrote:
> So AFAICS Perl is acting in the way I'm attributing to POSIX.
> But maybe we should actually read POSIX ...

I went to look at the POSIX spec, and was reminded that it lacks
backrefs altogether.  (POSIX specifies the "BRE" and "ERE" regex
flavors as described in our docs, but not "ARE".)  So there's no
help to be had there.  The fact that Perl doesn't throw an error
is probably the most useful precedent available.

            regards, tom lane



Re: Another regexp performance improvement: skip useless paren-captures

From
Mark Dilger
Date:

> On Aug 9, 2021, at 7:20 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> So I took another look at the code, and it doesn't seem that hard
> to make it act this way.  The attached passes regression, but
> I've not beat on it with random strings.

I expect to get back around to testing this in a day or so.

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company






Re: Another regexp performance improvement: skip useless paren-captures

From
Mark Dilger
Date:

> On Aug 9, 2021, at 7:20 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> So I took another look at the code, and it doesn't seem that hard
> to make it act this way.  The attached passes regression, but
> I've not beat on it with random strings.
> alternate-fix-zero-quantified-nested-parens.patch

I've beaten on this with random patterns and it seems to hold up just fine.  I have also reviewed the diffs and, for
thepatterns where the output changes, everything looks correct.  I can't find anything wrong with this patch. 

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company






Re: Another regexp performance improvement: skip useless paren-captures

From
Tom Lane
Date:
Mark Dilger <mark.dilger@enterprisedb.com> writes:
> I've beaten on this with random patterns and it seems to hold up just fine.  I have also reviewed the diffs and, for
thepatterns where the output changes, everything looks correct.  I can't find anything wrong with this patch. 

Thanks for testing!  I'll push it in a bit.

            regards, tom lane