Home > mailing lists

Re: Another regexp performance improvement: skip useless paren-captures - Mailing list pgsql-hackers

From	Tom Lane
Subject	Re: Another regexp performance improvement: skip useless paren-captures
Date	August 10, 2021 02:20:33
Msg-id	3756810.1628562033@sss.pgh.pa.us Whole thread Raw
In response to	Re: Another regexp performance improvement: skip useless paren-captures (Mark Dilger <mark.dilger@enterprisedb.com>)
Responses	Re: Another regexp performance improvement: skip useless paren-captures Re: Another regexp performance improvement: skip useless paren-captures
List	pgsql-hackers

Tree view

Mark Dilger <mark.dilger@enterprisedb.com> writes:
> I ran a lot of tests with the patch, and the asserts have all cleared up, but I don't know how to think about the
userfacing differences.  If we had a good reason for raising an error for these back-references, maybe that'd be fine,
butit seems to just be an implementation detail. 

I thought about this some more, and I'm coming around to the idea that
throwing an error is the wrong thing.  As a contrary example, consider

    (.)|(\1\1)

We don't throw an error for this, and neither does Perl, even though
the capturing parens can never be defined in the branch where the
backrefs are.  So it seems hard to argue that this is okay but the
other thing isn't.  Another interesting example is

    (.){0}(\1){0}

I think that the correct interpretation is that this is a valid
regexp matching an empty string (i.e., zero repetitions of each
part), even though neither capture group will be defined.
That's different from

    (.){0}(\1)

which can never match.

So I took another look at the code, and it doesn't seem that hard
to make it act this way.  The attached passes regression, but
I've not beat on it with random strings.

            regards, tom lane

diff --git a/src/backend/regex/regcomp.c b/src/backend/regex/regcomp.c
index ae3a7b6a38..d9840171a3 100644
--- a/src/backend/regex/regcomp.c
+++ b/src/backend/regex/regcomp.c
@@ -1089,11 +1089,23 @@ parseqatom(struct vars *v,
     /* annoying special case:  {0} or {0,0} cancels everything */
     if (m == 0 && n == 0)
     {
-        if (atom != NULL)
-            freesubre(v, atom);
-        if (atomtype == '(')
-            v->subs[subno] = NULL;
-        delsub(v->nfa, lp, rp);
+        /*
+         * If we had capturing subexpression(s) within the atom, we don't want
+         * to destroy them, because it's legal (if useless) to back-ref them
+         * later.  Hence, just unlink the atom from lp/rp and then ignore it.
+         */
+        if (atom != NULL && (atom->flags & CAP))
+        {
+            delsub(v->nfa, lp, atom->begin);
+            delsub(v->nfa, atom->end, rp);
+        }
+        else
+        {
+            /* Otherwise, we can clean up any subre infrastructure we made */
+            if (atom != NULL)
+                freesubre(v, atom);
+            delsub(v->nfa, lp, rp);
+        }
         EMPTYARC(lp, rp);
         return top;
     }
diff --git a/src/test/modules/test_regex/expected/test_regex.out b/src/test/modules/test_regex/expected/test_regex.out
index 5a6cdf47c2..96c0e6fa59 100644
--- a/src/test/modules/test_regex/expected/test_regex.out
+++ b/src/test/modules/test_regex/expected/test_regex.out
@@ -3506,6 +3506,21 @@ select * from test_regex('((.))(\2)', 'xyy', 'oRP');
  {yy,NULL,NULL,NULL}
 (2 rows)

+-- expectMatch    21.39 NPQR    {((.)){0}(\2){0}}    xyz    {}    {}    {}    {}
+select * from test_regex('((.)){0}(\2){0}', 'xyz', 'NPQR');
+                         test_regex
+------------------------------------------------------------
+ {3,REG_UBACKREF,REG_UBOUNDS,REG_UNONPOSIX,REG_UEMPTYMATCH}
+ {"",NULL,NULL,NULL}
+(2 rows)
+
+-- expectNomatch    21.40 PQR    {((.)){0}(\2)}    xxx
+select * from test_regex('((.)){0}(\2)', 'xxx', 'PQR');
+                 test_regex
+--------------------------------------------
+ {3,REG_UBACKREF,REG_UBOUNDS,REG_UNONPOSIX}
+(1 row)
+
 -- doing 22 "multicharacter collating elements"
 -- # again ugh
 -- MCCEs are not implemented in Postgres, so we skip all these tests
diff --git a/src/test/modules/test_regex/sql/test_regex.sql b/src/test/modules/test_regex/sql/test_regex.sql
index 3419564203..29806e9417 100644
--- a/src/test/modules/test_regex/sql/test_regex.sql
+++ b/src/test/modules/test_regex/sql/test_regex.sql
@@ -1020,6 +1020,10 @@ select * from test_regex('((.))(\2){0}', 'xy', 'RPQ');
 select * from test_regex('((.))(\2)', 'xyy', 'RP');
 -- expectMatch    21.38 oRP    ((.))(\2)    xyy    yy    {}    {}    {}
 select * from test_regex('((.))(\2)', 'xyy', 'oRP');
+-- expectMatch    21.39 NPQR    {((.)){0}(\2){0}}    xyz    {}    {}    {}    {}
+select * from test_regex('((.)){0}(\2){0}', 'xyz', 'NPQR');
+-- expectNomatch    21.40 PQR    {((.)){0}(\2)}    xxx
+select * from test_regex('((.)){0}(\2)', 'xxx', 'PQR');

 -- doing 22 "multicharacter collating elements"
 -- # again ugh

pgsql-hackers by date:

From: Peter Smith
Date: 10 August 2021, 02:17:07
Subject: Re: Small documentation improvement for ALTER SUBSCRIPTION

From: Masahiko Sawada
Date: 10 August 2021, 02:35:17
Subject: Re: [BUG] wrong refresh when ALTER SUBSCRIPTION ADD/DROP PUBLICATION

Re: Another regexp performance improvement: skip useless paren-captures - Mailing list pgsql-hackers

Previous

Next