Silliness in regexp's citerdissect/creviterdissect - Mailing list pgsql-hackers

From Tom Lane
Subject Silliness in regexp's citerdissect/creviterdissect
Date
Msg-id 1808998.1629412269@sss.pgh.pa.us
Whole thread Raw
List pgsql-hackers
While poking at a report of slow regexp matching [1], I happened
to notice that citerdissect and creviterdissect are being remarkably
stupid about how to backtrack after a match failure.  Specifically,
having used the DFA logic to identify K possible submatches, they
then start the slow recursive "cdissect" check of each submatch.
But, if the I'th submatch fails dissection, their response is to
see about adjusting the length of the last submatch ... which will
do nothing at all to make the result of the I'th submatch change,
if it's before K.  When this happens, we should immediately
start adjusting the length of the I'th submatch.

Attached is a simple patch to do this.  It passes check-world, and
shows the same results as before on Jacobson's web-regexps corpus,
as well as on a sample of random regexps made by Dilger's script.
I don't really see any performance difference on Jacobson's corpus,
but Dilger's script did find an example where this makes a huge
difference:

HEAD:

regexp=# select regexp_match('nefajztngmsvfajztngmsvlwhjsq', '(.*)((\1)){9}');
 regexp_match
--------------

(1 row)

Time: 9655.141 ms (00:09.655)

regexp=# select regexp_match('nefajztngmsvfajztngmsvlwhxxxxxxxxxxjsq', '(.*)((\1)){9}');
 regexp_match
--------------
 {x,x,x}
(1 row)

Time: 271106.324 ms (04:31.106)

With patch:

regexp=# select regexp_match('nefajztngmsvfajztngmsvlwhjsq', '(.*)((\1)){9}');
 regexp_match
--------------

(1 row)

Time: 9.385 ms

regexp=# select regexp_match('nefajztngmsvfajztngmsvlwhxxxxxxxxxxjsq', '(.*)((\1)){9}');
 regexp_match
--------------
 {x,x,x}
(1 row)

Time: 25.103 ms

Admittedly this is a bit contrived.  (In v14/HEAD, you can make the
performance problem go away by getting rid of the redundant capturing
parens, as then we don't invoke citerdissect at all.  That trick doesn't
help in the back branches though.)

I think this is a pretty clear performance bug, and unless there are
objections I plan to push this fix into all branches.

            regards, tom lane

[1] https://www.postgresql.org/message-id/CALZg0g4FA1Xc5UMLrGBKM--erUGEAhe8GGLE-YcN7%3DO6rw2%3D0A%40mail.gmail.com

diff --git a/src/backend/regex/regexec.c b/src/backend/regex/regexec.c
index db54ebfba4..aa5ba85514 100644
--- a/src/backend/regex/regexec.c
+++ b/src/backend/regex/regexec.c
@@ -1147,8 +1147,8 @@ citerdissect(struct vars *v,
      * Our strategy is to first find a set of sub-match endpoints that are
      * valid according to the child node's DFA, and then recursively dissect
      * each sub-match to confirm validity.  If any validity check fails,
-     * backtrack the last sub-match and try again.  And, when we next try for
-     * a validity check, we need not recheck any successfully verified
+     * backtrack that sub-match and try again.  And, when we next try for a
+     * validity check, we need not recheck any successfully verified
      * sub-matches that we didn't move the endpoints of.  nverified remembers
      * how many sub-matches are currently known okay.
      */
@@ -1236,12 +1236,13 @@ citerdissect(struct vars *v,
             return REG_OKAY;
         }
 
-        /* match failed to verify, so backtrack */
+        /* i'th match failed to verify, so backtrack it */
+        k = i;
 
 backtrack:
 
         /*
-         * Must consider shorter versions of the current sub-match.  However,
+         * Must consider shorter versions of the k'th sub-match.  However,
          * we'll only ask for a zero-length match if necessary.
          */
         while (k > 0)
@@ -1352,8 +1353,8 @@ creviterdissect(struct vars *v,
      * Our strategy is to first find a set of sub-match endpoints that are
      * valid according to the child node's DFA, and then recursively dissect
      * each sub-match to confirm validity.  If any validity check fails,
-     * backtrack the last sub-match and try again.  And, when we next try for
-     * a validity check, we need not recheck any successfully verified
+     * backtrack that sub-match and try again.  And, when we next try for a
+     * validity check, we need not recheck any successfully verified
      * sub-matches that we didn't move the endpoints of.  nverified remembers
      * how many sub-matches are currently known okay.
      */
@@ -1447,12 +1448,13 @@ creviterdissect(struct vars *v,
             return REG_OKAY;
         }
 
-        /* match failed to verify, so backtrack */
+        /* i'th match failed to verify, so backtrack it */
+        k = i;
 
 backtrack:
 
         /*
-         * Must consider longer versions of the current sub-match.
+         * Must consider longer versions of the k'th sub-match.
          */
         while (k > 0)
         {

pgsql-hackers by date:

Previous
From: Vik Fearing
Date:
Subject: Re: [PATCH] Proof of concept for GUC improvements
Next
From: David Christensen
Date:
Subject: Re: [PATCH] Proof of concept for GUC improvements