Regex performance regression induced by match-all code - Mailing list pgsql-hackers

From Tom Lane
Subject Regex performance regression induced by match-all code
Date
Msg-id 3483895.1619898362@sss.pgh.pa.us
Whole thread Raw
Responses Re: Regex performance regression induced by match-all code
List pgsql-hackers
I found a nasty performance problem in commit 824bf7190: given the
right sort of regex, checkmatchall() takes an unreasonable amount
of time.  For example,

regression=# SELECT regexp_matches('', '(.|){20}','');
 regexp_matches 
----------------
 {""}
(1 row)

Time: 129.213 ms
regression=# SELECT regexp_matches('', '(.|){25}','');
 regexp_matches 
----------------
 {""}
(1 row)

Time: 4101.416 ms (00:04.101)
regression=# SELECT regexp_matches('', '(.|){30}','');
 regexp_matches 
----------------
 {""}
(1 row)

Time: 130803.927 ms (02:10.804)

That's quite awful compared to v13, where these cases take
only a couple ms.

Worse still, you can't get out of it with control-C, because
checkmatchall_recurse lacks any check-for-interrupt.

The problem here is basically that we're willing to recursively
examine all paths out of the same NFA state over and over, once for
each possible way of arriving at that state.  That's dumb and we can
do better, though it takes a little more code and some more memory.
The attached patch applies a few different methods to make this
better:

* Before starting the recursive search, do a linear-time pass
through all the states to check whether there are any non-RAINBOW
arcs.  This allows failing fast for most non-matchall NFAs.

* Memo-ize the results of the recursive search, by storing an
array of possible path lengths for each state after we've examined
it once.

* Rewrite the checks for pseudocolor arcs to make them linear
time rather than O(N^2) --- I decided I'd better do that too,
after noting that the problem cases had fairly large numbers
of pre-state outarcs.  This makes them cheap enough to do
before the recursive search not after.

* Put a heuristic upper bound on the number of NFA states for
which we'll attempt this optimization at all.  The main reason
for this is to bound the amount of memory we can expend on
memoization results.  I think that it will result in little
if any degradation in our ability to recognize matchall NFAs,
because of the existing restriction that we can't represent
cases involving path lengths that are finite but more than DUPINF.
If there are a lot more than DUPINF states then (I think) it becomes
pretty likely that we'd fail due to that restriction anyhow.
However, I've not made much attempt to quantify that argument;
I just chose DUPINF * 4 out of the air.

* Just in case that's not enough to fix things, add a cancel check
within checkmatchall_recurse.

The main thing I find a bit ugly about this solution is that
I'm abusing the state->tmp fields (which are declared struct state *)
to hold the memoization results (which are "bool *").  It'd be
possible to avoid that by allocating an additional "bool **" array
indexed by state number, but whether it's worth that depends on how
allergic you are to weird pointer casts.

Comments?

            regards, tom lane

diff --git a/src/backend/regex/regc_nfa.c b/src/backend/regex/regc_nfa.c
index 77b860cb0f..04920fa4e3 100644
--- a/src/backend/regex/regc_nfa.c
+++ b/src/backend/regex/regc_nfa.c
@@ -3032,46 +3032,109 @@ analyze(struct nfa *nfa)
 static void
 checkmatchall(struct nfa *nfa)
 {
-    bool        hasmatch[DUPINF + 1];
+    struct state *s;
+    bool        recur_result;
+    bool       *hasmatch;
     int            minmatch,
                 maxmatch,
                 morematch;

     /*
-     * hasmatch[i] will be set true if a match of length i is feasible, for i
-     * from 0 to DUPINF-1.  hasmatch[DUPINF] will be set true if every match
-     * length of DUPINF or more is feasible.
+     * If there are too many states, don't bother trying to detect matchall.
+     * This limit serves to bound the time and memory we could consume below.
+     * Note that even if the graph is all-RAINBOW, if there are significantly
+     * more than DUPINF states then it's likely that there are paths of length
+     * more than DUPINF, which would force us to fail anyhow.  The exact
+     * cutoff used here is a bit arbitrary.
      */
-    memset(hasmatch, 0, sizeof(hasmatch));
+    if (nfa->nstates > DUPINF * 4)
+        return;

     /*
-     * Recursively search the graph for all-RAINBOW paths to the "post" state,
-     * starting at the "pre" state.  The -1 initial depth accounts for the
-     * fact that transitions out of the "pre" state are not part of the
-     * matched string.  We likewise don't count the final transition to the
-     * "post" state as part of the match length.  (But we still insist that
-     * those transitions have RAINBOW arcs, otherwise there are lookbehind or
-     * lookahead constraints at the start/end of the pattern.)
+     * First, scan all the states to verify that only RAINBOW arcs appear,
+     * plus pseudocolor arcs adjacent to the pre and post states.  This lets
+     * us quickly eliminate most cases that aren't matchall NFAs.
      */
-    if (!checkmatchall_recurse(nfa, nfa->pre, false, -1, hasmatch))
-        return;
+    for (s = nfa->states; s != NULL; s = s->next)
+    {
+        struct arc *a;
+
+        for (a = s->outs; a != NULL; a = a->outchain)
+        {
+            if (a->type != PLAIN)
+                return;            /* any LACONs make it non-matchall */
+            if (a->co != RAINBOW)
+            {
+                if (nfa->cm->cd[a->co].flags & PSEUDO)
+                {
+                    /*
+                     * Pseudocolor arc: verify it's in a valid place (this
+                     * seems quite unlikely to fail, but let's be sure).
+                     */
+                    if (s == nfa->pre &&
+                        (a->co == nfa->bos[0] || a->co == nfa->bos[1]))
+                         /* okay BOS/BOL arc */ ;
+                    else if (a->to == nfa->post &&
+                             (a->co == nfa->eos[0] || a->co == nfa->eos[1]))
+                         /* okay EOS/EOL arc */ ;
+                    else
+                        return; /* unexpected pseudocolor arc */
+                    /* We'll check these arcs some more below. */
+                }
+                else
+                    return;        /* any other color makes it non-matchall */
+            }
+        }
+        /* Also, assert that the tmp fields are available for use. */
+        assert(s->tmp == NULL);
+    }

     /*
-     * We found some all-RAINBOW paths, and not anything that we couldn't
-     * handle.  Now verify that pseudocolor arcs adjacent to the pre and post
-     * states match the RAINBOW arcs there.  (We could do this while
-     * recursing, but it's expensive and unlikely to fail, so do it last.)
+     * The next cheapest check we can make is to verify that the BOS/BOL
+     * outarcs of the pre state reach the same states as its RAINBOW outarcs.
+     * If they don't, the NFA expresses some constraints on the character
+     * before the matched string, making it non-matchall.  Likewise, the
+     * EOS/EOL inarcs of the post state must match its RAINBOW inarcs.
      */
     if (!check_out_colors_match(nfa->pre, RAINBOW, nfa->bos[0]) ||
-        !check_out_colors_match(nfa->pre, nfa->bos[0], RAINBOW) ||
         !check_out_colors_match(nfa->pre, RAINBOW, nfa->bos[1]) ||
-        !check_out_colors_match(nfa->pre, nfa->bos[1], RAINBOW))
+        !check_in_colors_match(nfa->post, RAINBOW, nfa->eos[0]) ||
+        !check_in_colors_match(nfa->post, RAINBOW, nfa->eos[1]))
         return;
-    if (!check_in_colors_match(nfa->post, RAINBOW, nfa->eos[0]) ||
-        !check_in_colors_match(nfa->post, nfa->eos[0], RAINBOW) ||
-        !check_in_colors_match(nfa->post, RAINBOW, nfa->eos[1]) ||
-        !check_in_colors_match(nfa->post, nfa->eos[1], RAINBOW))
+
+    /*
+     * Recursively search the graph for all-RAINBOW paths to the "post" state,
+     * starting at the "pre" state, and computing the lengths of the paths.
+     * (Given the preceding checks, there should be at least one such path.
+     * However we could get back a false result anyway, in case there are
+     * multi-state loops, paths exceeding DUPINF+1 length, or non-algorithmic
+     * failures such as ENOMEM.)
+     */
+    recur_result = checkmatchall_recurse(nfa, nfa->pre);
+
+    /* Extract the pre state's hasmatch array so we don't free it below. */
+    if (nfa->pre->tmp != nfa->pre)
+        hasmatch = (bool *) nfa->pre->tmp;
+    else
+        hasmatch = NULL;
+    nfa->pre->tmp = NULL;
+
+    /* Free all other hasmatch arrays, and reset the tmp fields. */
+    for (s = nfa->states; s != NULL; s = s->next)
+    {
+        if (s->tmp != NULL && s->tmp != s)
+            FREE(s->tmp);
+        s->tmp = NULL;
+    }
+
+    /* Now clean up and exit, if we failed. */
+    if (!recur_result)
+    {
+        if (hasmatch)
+            FREE(hasmatch);
         return;
+    }
+    assert(hasmatch);

     /*
      * hasmatch[] now represents the set of possible match lengths; but we
@@ -3080,48 +3143,77 @@ checkmatchall(struct nfa *nfa)
      * lengths.  Find min and max of first run of lengths, then verify there
      * are no nonconsecutive lengths.
      */
-    for (minmatch = 0; minmatch <= DUPINF; minmatch++)
+    for (minmatch = 0; minmatch <= DUPINF + 1; minmatch++)
     {
         if (hasmatch[minmatch])
             break;
     }
-    assert(minmatch <= DUPINF); /* else checkmatchall_recurse lied */
-    for (maxmatch = minmatch; maxmatch < DUPINF; maxmatch++)
+    assert(minmatch <= DUPINF + 1); /* else checkmatchall_recurse lied */
+    for (maxmatch = minmatch; maxmatch < DUPINF + 1; maxmatch++)
     {
         if (!hasmatch[maxmatch + 1])
             break;
     }
-    for (morematch = maxmatch + 1; morematch <= DUPINF; morematch++)
+    for (morematch = maxmatch + 1; morematch <= DUPINF + 1; morematch++)
     {
         if (hasmatch[morematch])
+        {
+            FREE(hasmatch);
             return;                /* fail, there are nonconsecutive lengths */
+        }
     }

-    /* Success, so record the info */
-    nfa->minmatchall = minmatch;
-    nfa->maxmatchall = maxmatch;
+    /*
+     * Success, so record the info.  Here we have a fine point: the "path
+     * length" from the pre state includes the pre-to-initial transition, so
+     * it's one more than the actually matched string length.  (We avoided
+     * counting the final-to-post transition within checkmatchall_recurse, but
+     * not this one.)  This is why checkmatchall_recurse allows one more level
+     * of path length than might seem necessary.  This decrement also takes
+     * care of converting checkmatchall_recurse's definition of "infinity" as
+     * "DUPINF+1" to our normal representation as "DUPINF".
+     */
+    assert(minmatch > 0);        /* else pre and post states were adjacent */
+    nfa->minmatchall = minmatch - 1;
+    nfa->maxmatchall = maxmatch - 1;
     nfa->flags |= MATCHALL;
+
+    FREE(hasmatch);
 }

 /*
  * checkmatchall_recurse - recursive search for checkmatchall
  *
- * s is the current state
- * foundloop is true if any predecessor state has a loop-to-self
- * depth is the current recursion depth (starting at -1)
- * hasmatch[] is the output area for recording feasible match lengths
+ * s is the state to be examined in this recursion level.
+ *
+ * We return true if the search was performed successfully, false if
+ * we had to fail because of multi-state loops or other internal reasons.
+ * (Because "dead" states that can't reach the post state have been
+ * eliminated, and we already verified that only RAINBOW and parallel
+ * pseudocolor arcs exist, every state should have RAINBOW path(s) to
+ * the post state.  Hence we take a false result from recursive calls
+ * as meaning that we'd better fail altogether, not just that that
+ * particular state can't reach the post state.)
  *
- * We return true if there is at least one all-RAINBOW path to the "post"
- * state and no non-matchall paths; otherwise false.  Note we assume that
- * any dead-end paths have already been removed, else we might return
- * false unnecessarily.
+ * We abuse the state.tmp fields by using them this way:
+ *    s->tmp == NULL if s has not been visited yet;
+ *    s->tmp == s if we are currently searching s in this or an outer level;
+ *    otherwise, s->tmp points to a malloc'd bool hasmatch[] array showing
+ *    the possible path lengths from s to the post state.
+ *
+ * Each state's hasmatch[] array is of length DUPINF+2.  The entries from
+ * k = 0 to DUPINF are true if there is an all-RAINBOW path of length k
+ * from this state to the string end.  hasmatch[DUPINF+1] is true if all
+ * path lengths >= DUPINF+1 are possible.
+ *
+ * checkmatchall is responsible for eventually freeing the hasmatch[] arrays.
  */
 static bool
-checkmatchall_recurse(struct nfa *nfa, struct state *s,
-                      bool foundloop, int depth,
-                      bool *hasmatch)
+checkmatchall_recurse(struct nfa *nfa, struct state *s)
 {
     bool        result = false;
+    bool        foundloop = false;
+    bool       *hasmatch;
     struct arc *a;

     /*
@@ -3132,60 +3224,21 @@ checkmatchall_recurse(struct nfa *nfa, struct state *s,
         return false;

     /*
-     * Likewise, if we get to a depth too large to represent correctly in
-     * maxmatchall, fail quietly.
-     */
-    if (depth >= DUPINF)
-        return false;
-
-    /*
-     * Scan the outarcs to detect cases we can't handle, and to see if there
-     * is a loop-to-self here.  We need to know about any such loop before we
-     * recurse, so it's hard to avoid making two passes over the outarcs.  In
-     * any case, checking for showstoppers before we recurse is probably best.
+     * In case the search takes a long time, check for cancel.
      */
-    for (a = s->outs; a != NULL; a = a->outchain)
+    if (CANCEL_REQUESTED(nfa->v->re))
     {
-        if (a->type != PLAIN)
-            return false;        /* any LACONs make it non-matchall */
-        if (a->co != RAINBOW)
-        {
-            if (nfa->cm->cd[a->co].flags & PSEUDO)
-            {
-                /*
-                 * Pseudocolor arc: verify it's in a valid place (this seems
-                 * quite unlikely to fail, but let's be sure).
-                 */
-                if (s == nfa->pre &&
-                    (a->co == nfa->bos[0] || a->co == nfa->bos[1]))
-                     /* okay BOS/BOL arc */ ;
-                else if (a->to == nfa->post &&
-                         (a->co == nfa->eos[0] || a->co == nfa->eos[1]))
-                     /* okay EOS/EOL arc */ ;
-                else
-                    return false;    /* unexpected pseudocolor arc */
-                /* We'll finish checking these arcs after the recursion */
-                continue;
-            }
-            return false;        /* any other color makes it non-matchall */
-        }
-        if (a->to == s)
-        {
-            /*
-             * We found a cycle of length 1, so remember that to pass down to
-             * successor states.  (It doesn't matter if there was also such a
-             * loop at a predecessor state.)
-             */
-            foundloop = true;
-        }
-        else if (a->to->tmp)
-        {
-            /* We found a cycle of length > 1, so fail. */
-            return false;
-        }
+        NERR(REG_CANCEL);
+        return false;
     }

-    /* We need to recurse, so mark state as under consideration */
+    /* Create a hasmatch array for this state */
+    hasmatch = (bool *) MALLOC((DUPINF + 2) * sizeof(bool));
+    if (hasmatch == NULL)
+        return false;            /* again, treat as non-matchall */
+    memset(hasmatch, 0, (DUPINF + 2) * sizeof(bool));
+
+    /* Mark this state as being visited */
     assert(s->tmp == NULL);
     s->tmp = s;

@@ -3197,95 +3250,197 @@ checkmatchall_recurse(struct nfa *nfa, struct state *s,
         {
             /* We found an all-RAINBOW path to the post state */
             result = true;
-            /* ... which should not be adjacent to the pre state */
-            if (depth < 0)
+
+            /*
+             * Mark this state as being zero steps away from the string end
+             * (the transition to the post state doesn't count).
+             */
+            hasmatch[0] = true;
+        }
+        else if (a->to == s)
+        {
+            /* We found a cycle of length 1, which we'll deal with below. */
+            foundloop = true;
+        }
+        else if (a->to->tmp == a->to)
+        {
+            /* We found a cycle of length > 1, so fail. */
+            result = false;
+            break;
+        }
+        else
+        {
+            /* Consider paths forward through this to-state. */
+            bool       *nexthasmatch;
+            int            i;
+
+            /* If to-state was not already visited, recurse. */
+            if (a->to->tmp == NULL)
             {
-                NERR(REG_ASSERT);
-                return false;
+                result = checkmatchall_recurse(nfa, a->to);
+                /* Fail if any recursive path fails */
+                if (!result)
+                    break;
             }
-            /* Record potential match lengths */
-            hasmatch[depth] = true;
-            if (foundloop)
+            else
             {
-                /* A predecessor loop makes all larger lengths match, too */
-                int            i;
+                /* The previous visit must have found path(s) to the end. */
+                result = true;
+            }
+            assert(a->to->tmp != NULL);
+            assert(a->to->tmp != a->to);

-                for (i = depth + 1; i <= DUPINF; i++)
-                    hasmatch[i] = true;
+            /*
+             * Now, for every path of length i from a->to to the string end,
+             * there is a path of length i + 1 from s to the string end.
+             */
+            nexthasmatch = (bool *) a->to->tmp;
+            if (nexthasmatch[DUPINF] != nexthasmatch[DUPINF + 1])
+            {
+                /*
+                 * a->to has a path of length exactly DUPINF, but not longer;
+                 * or it has paths of all lengths > DUPINF but not one of
+                 * exactly that length.  In either case, we cannot represent
+                 * the possible path lengths from s correctly, so fail.
+                 */
+                result = false;
+                break;
             }
+            /* Merge knowledge of these path lengths into what we have. */
+            for (i = 0; i < DUPINF; i++)
+                hasmatch[i + 1] |= nexthasmatch[i];
+            /* Infinity + 1 is still infinity, though. */
+            hasmatch[DUPINF + 1] |= nexthasmatch[DUPINF + 1];
         }
-        else if (a->to != s)
+    }
+
+    if (result && foundloop)
+    {
+        /*
+         * If there is a loop at this state, then find the shortest known path
+         * length to the end.  The loop means that every larger path length is
+         * possible, too.  (It doesn't matter whether any of the longer
+         * lengths were already known possible.)
+         */
+        int            i;
+
+        for (i = 0; i <= DUPINF; i++)
         {
-            /* This is a new path forward; recurse to investigate */
-            result = checkmatchall_recurse(nfa, a->to,
-                                           foundloop, depth + 1,
-                                           hasmatch);
-            /* Fail if any recursive path fails */
-            if (!result)
+            if (hasmatch[i])
                 break;
         }
+        for (i++; i <= DUPINF + 1; i++)
+            hasmatch[i] = true;
     }

-    s->tmp = NULL;
+    /* Report out the completed path length map.  Ugly type pun here! */
+    s->tmp = (struct state *) hasmatch;
+
     return result;
 }

 /*
  * check_out_colors_match - subroutine for checkmatchall
  *
- * Check if every s outarc of color co1 has a matching outarc of color co2.
- * (checkmatchall_recurse already verified that all of the outarcs are PLAIN,
- * so we need not examine arc types here.  Also, since it verified that there
- * are only RAINBOW and pseudocolor arcs, there shouldn't be enough arcs for
- * this brute-force O(N^2) implementation to cause problems.)
+ * Check whether the set of states reachable from s by arcs of color co1
+ * is equivalent to the set reachable by arcs of color co2.
+ * checkmatchall already verified that all of the NFA's arcs are PLAIN,
+ * so we need not examine arc types here.
  */
 static bool
 check_out_colors_match(struct state *s, color co1, color co2)
 {
-    struct arc *a1;
-    struct arc *a2;
+    bool        result = true;
+    struct arc *a;

-    for (a1 = s->outs; a1 != NULL; a1 = a1->outchain)
+    /*
+     * To do this in linear time, we assume that the NFA contains no duplicate
+     * arcs.  Run through the out-arcs, marking states reachable by arcs of
+     * color co1.  Run through again, un-marking states reachable by arcs of
+     * color co2; if we see a not-marked state, we know this co2 arc is
+     * unmatched.  Then run through again, checking for still-marked states,
+     * and in any case leaving all the tmp fields reset to NULL.
+     */
+    for (a = s->outs; a != NULL; a = a->outchain)
     {
-        if (a1->co != co1)
-            continue;
-        for (a2 = s->outs; a2 != NULL; a2 = a2->outchain)
+        if (a->co == co1)
         {
-            if (a2->co == co2 && a2->to == a1->to)
-                break;
+            assert(a->to->tmp == NULL);
+            a->to->tmp = a->to;
+        }
+    }
+    for (a = s->outs; a != NULL; a = a->outchain)
+    {
+        if (a->co == co2)
+        {
+            if (a->to->tmp != NULL)
+                a->to->tmp = NULL;
+            else
+                result = false; /* unmatched co2 arc */
+        }
+    }
+    for (a = s->outs; a != NULL; a = a->outchain)
+    {
+        if (a->co == co1)
+        {
+            if (a->to->tmp != NULL)
+            {
+                result = false; /* unmatched co1 arc */
+                a->to->tmp = NULL;
+            }
         }
-        if (a2 == NULL)
-            return false;
     }
-    return true;
+    return result;
 }

 /*
  * check_in_colors_match - subroutine for checkmatchall
  *
- * Check if every s inarc of color co1 has a matching inarc of color co2.
- * (For paranoia's sake, ignore any non-PLAIN arcs here.  But we're still
- * not expecting very many arcs.)
+ * Check whether the set of states that can reach s by arcs of color co1
+ * is equivalent to the set that can reach s by arcs of color co2.
+ * checkmatchall already verified that all of the NFA's arcs are PLAIN,
+ * so we need not examine arc types here.
  */
 static bool
 check_in_colors_match(struct state *s, color co1, color co2)
 {
-    struct arc *a1;
-    struct arc *a2;
+    bool        result = true;
+    struct arc *a;

-    for (a1 = s->ins; a1 != NULL; a1 = a1->inchain)
+    /*
+     * Identical algorithm to check_out_colors_match, except examine the
+     * from-states of s' inarcs.
+     */
+    for (a = s->ins; a != NULL; a = a->inchain)
     {
-        if (a1->type != PLAIN || a1->co != co1)
-            continue;
-        for (a2 = s->ins; a2 != NULL; a2 = a2->inchain)
+        if (a->co == co1)
         {
-            if (a2->type == PLAIN && a2->co == co2 && a2->from == a1->from)
-                break;
+            assert(a->from->tmp == NULL);
+            a->from->tmp = a->from;
+        }
+    }
+    for (a = s->ins; a != NULL; a = a->inchain)
+    {
+        if (a->co == co2)
+        {
+            if (a->from->tmp != NULL)
+                a->from->tmp = NULL;
+            else
+                result = false; /* unmatched co2 arc */
         }
-        if (a2 == NULL)
-            return false;
     }
-    return true;
+    for (a = s->ins; a != NULL; a = a->inchain)
+    {
+        if (a->co == co1)
+        {
+            if (a->from->tmp != NULL)
+            {
+                result = false; /* unmatched co1 arc */
+                a->from->tmp = NULL;
+            }
+        }
+    }
+    return result;
 }

 /*
diff --git a/src/backend/regex/regcomp.c b/src/backend/regex/regcomp.c
index ba8dd86464..6786349807 100644
--- a/src/backend/regex/regcomp.c
+++ b/src/backend/regex/regcomp.c
@@ -182,8 +182,7 @@ static void markreachable(struct nfa *, struct state *, struct state *, struct s
 static void markcanreach(struct nfa *, struct state *, struct state *, struct state *);
 static long analyze(struct nfa *);
 static void checkmatchall(struct nfa *);
-static bool checkmatchall_recurse(struct nfa *, struct state *,
-                                  bool, int, bool *);
+static bool checkmatchall_recurse(struct nfa *, struct state *);
 static bool check_out_colors_match(struct state *, color, color);
 static bool check_in_colors_match(struct state *, color, color);
 static void compact(struct nfa *, struct cnfa *);

pgsql-hackers by date:

Previous
From: Bharath Rupireddy
Date:
Subject: Re: Enhanced error message to include hint messages for redundant options error
Next
From: Chapman Flack
Date:
Subject: Re: Re: Perform COPY FROM encoding conversions in larger chunks