Re: BUG #18708: regex problem: (?:[^\d\D]){0} asserts with "lp->nouts == 0 && rp->nins == 0" - Mailing list pgsql-bugs

From Tom Lane
Subject Re: BUG #18708: regex problem: (?:[^\d\D]){0} asserts with "lp->nouts == 0 && rp->nins == 0"
Date
Msg-id 2242483.1731627374@sss.pgh.pa.us
Whole thread Raw
In response to Re: BUG #18708: regex problem: (?:[^\d\D]){0} asserts with "lp->nouts == 0 && rp->nins == 0"  (Aleksander Alekseev <aleksander@timescale.com>)
Responses Re: BUG #18708: regex problem: (?:[^\d\D]){0} asserts with "lp->nouts == 0 && rp->nins == 0"
List pgsql-bugs
Aleksander Alekseev <aleksander@timescale.com> writes:
>> If you run
>> SELECT '' ~ '(?:[^\d\D]){0}';
>> it will assert with  "lp->nouts == 0 && rp->nins == 0"

> I wonder if the Assert is just wrong or if it's more complicated than that.

Interesting example.  I think the bug's actually in my commit
08c0d6ad6 (Invent "rainbow" arcs), although it is unreachable
before 2a0af7fe4.  What's happening is:

* "\d\D" can match any character whatsoever.  optimizebracket()
notices this and replaces a sheaf of NFA arcs with a single
RAINBOW arc, which is the same representation as for ".".

* Then we apply the complementation process in cbracket(),
and that calls colorcomplement() which does this:

    /* A RAINBOW arc matches all colors, making the complement empty */
     if (findarc(of, PLAIN, RAINBOW) != NULL)
         return;

We thus end up with no arcs from the bracket expression's start
state to its end state.  This is not wrong in itself: "you can't
get there from here" is a perfectly reasonable representation
of [^\d\D], which cannot match any character.

* However, the (?: ... ){0} superstructure around that eventually
leads us to

    /* annoying special case:  {0} or {0,0} cancels everything */
    if (m == 0 && n == 0)
    {
            ...
            /* Otherwise, we can clean up any subre infrastructure we made */
            if (atom != NULL)
                freesubre(v, atom);
            delsub(v->nfa, lp, rp);
    }

which is trying to throw away the detritus from inside the quantified
subexpression.  But delsub fails to remove it all, because there's no
path between the two specified states, and its assert notices that.

This is, I believe, not harmful in itself: the leftover unreachable
states/arcs will be cleaned up later, and we end with a valid NFA.

So, as you noted, removing that assert would suppress all visible
symptoms of the problem.  But I really don't want to do that; this
code is complicated and we need all the help we can get to find
bugs in it.

So I think the right thing to do is to fix colorcomplement() so
that it still produces some arc between its "from" and "to"
states, keeping the graph connected until we finish parsing the
regex.  It has to be a dummy arc that can't match anything, of
course.  We can throw away such an arc once we get to regex
optimization, because we don't need delsub() to work anymore.

In the attached draft patch I represented the dummy arc as a PLAIN
arc with a new fake color BLACK.  Another way to do it could be to
introduce a new arc "type" value, but that seemed like it would
involve touching more code.

This is admittedly a lot more code than removing one assert,
but I think we'll regret it if we allow delsub failures to go
undetected.

            regards, tom lane

diff --git a/src/backend/regex/regc_color.c b/src/backend/regex/regc_color.c
index 30bda0e5ad..27d324b8a6 100644
--- a/src/backend/regex/regc_color.c
+++ b/src/backend/regex/regc_color.c
@@ -1075,16 +1075,26 @@ colorcomplement(struct nfa *nfa,

     assert(of != from);

-    /* A RAINBOW arc matches all colors, making the complement empty */
+    /*
+     * A RAINBOW arc matches all colors, making the complement empty.  But we
+     * can't just return without making any arcs, because that would leave the
+     * NFA disconnected which would break any future delsub().  Instead, make
+     * a BLACK arc.  Also set the HASBLACK flag so we know to clean that up at
+     * the start of NFA optimization.
+     */
     if (findarc(of, PLAIN, RAINBOW) != NULL)
+    {
+        newarc(nfa, type, BLACK, from, to);
+        nfa->flags |= HASBLACK;
         return;
+    }

     /* Otherwise, transiently mark the colors that appear in of's out-arcs */
     for (a = of->outs; a != NULL; a = a->outchain)
     {
         if (a->type == PLAIN)
         {
-            assert(a->co >= 0);
+            assert(a->co >= 0); /* i.e. not RAINBOW or BLACK */
             cd = &cm->cd[a->co];
             assert(!UNUSEDCOLOR(cd));
             cd->flags |= COLMARK;
diff --git a/src/backend/regex/regc_nfa.c b/src/backend/regex/regc_nfa.c
index f1819a24f6..4b57a2c2b1 100644
--- a/src/backend/regex/regc_nfa.c
+++ b/src/backend/regex/regc_nfa.c
@@ -1599,6 +1599,12 @@ optimize(struct nfa *nfa,
     if (verbose)
         fprintf(f, "\ninitial cleanup:\n");
 #endif
+    /* If we have any BLACK arcs, drop them; but this is uncommon */
+    if (nfa->flags & HASBLACK)
+    {
+        removeblack(nfa);
+        nfa->flags &= ~HASBLACK;
+    }
     cleanup(nfa);                /* may simplify situation */
 #ifdef REG_DEBUG
     if (verbose)
@@ -2922,6 +2928,32 @@ clonesuccessorstates(struct nfa *nfa,
     }
 }

+/*
+ * removeblack - remove BLACK arcs, which are no longer useful
+ */
+static void
+removeblack(struct nfa *nfa)
+{
+    struct state *s;
+
+    for (s = nfa->states; s != NULL; s = s->next)
+    {
+        struct arc *a;
+        struct arc *nexta;
+
+        for (a = s->outs; a != NULL; a = nexta)
+        {
+            nexta = a->outchain;
+            if (a->type == PLAIN && a->co == BLACK)
+            {
+                freearc(nfa, a);
+                if (NISERR())
+                    return;
+            }
+        }
+    }
+}
+
 /*
  * cleanup - clean up NFA after optimizations
  */
@@ -3627,6 +3659,8 @@ dumpnfa(struct nfa *nfa,
         fprintf(f, ", eol [%ld]", (long) nfa->eos[1]);
     if (nfa->flags & HASLACONS)
         fprintf(f, ", haslacons");
+    if (nfa->flags & HASBLACK)
+        fprintf(f, ", hasblack");
     if (nfa->flags & MATCHALL)
     {
         fprintf(f, ", minmatchall %d", nfa->minmatchall);
@@ -3725,6 +3759,8 @@ dumparc(struct arc *a,
         case PLAIN:
             if (a->co == RAINBOW)
                 fprintf(f, "[*]");
+            else if (a->co == BLACK)
+                fprintf(f, "[-]");
             else
                 fprintf(f, "[%ld]", (long) a->co);
             break;
diff --git a/src/backend/regex/regcomp.c b/src/backend/regex/regcomp.c
index 8a6cfb2973..432fdb8eb8 100644
--- a/src/backend/regex/regcomp.c
+++ b/src/backend/regex/regcomp.c
@@ -215,6 +215,7 @@ static void clonesuccessorstates(struct nfa *nfa, struct state *ssource,
                                  struct state *spredecessor,
                                  struct arc *refarc, char *curdonemap,
                                  char *outerdonemap, int nstates);
+static void removeblack(struct nfa *nfa);
 static void cleanup(struct nfa *nfa);
 static void markreachable(struct nfa *nfa, struct state *s,
                           struct state *okay, struct state *mark);
@@ -346,7 +347,7 @@ struct vars
 #define SEND    'Z'                /* end of string (even if not EOL) */

 /* is an arc colored, and hence should belong to a color chain? */
-/* the test on "co" eliminates RAINBOW arcs, which we don't bother to chain */
+/* testing co eliminates RAINBOW/BLACK arcs, which we don't bother to chain */
 #define COLORED(a) \
     ((a)->co >= 0 && \
      ((a)->type == PLAIN || (a)->type == AHEAD || (a)->type == BEHIND))
@@ -1949,7 +1950,7 @@ optimizebracket(struct vars *v,
     for (a = lp->outs; a != NULL; a = a->outchain)
     {
         assert(a->type == PLAIN);
-        assert(a->co >= 0);        /* i.e. not RAINBOW */
+        assert(a->co >= 0);        /* i.e. not RAINBOW or BLACK */
         assert(a->to == rp);
         cd = &v->cm->cd[a->co];
         assert(!UNUSEDCOLOR(cd) && !(cd->flags & PSEUDO));
@@ -2368,6 +2369,7 @@ nfanode(struct vars *v,
     nfa = newnfa(v, v->cm, v->nfa);
     NOERRZ();
     dupnfa(nfa, t->begin, t->end, nfa->init, nfa->final);
+    nfa->flags = v->nfa->flags;
     if (!ISERR())
         specialcolors(nfa);
     if (!ISERR())
diff --git a/src/include/regex/regguts.h b/src/include/regex/regguts.h
index 3ca3647e11..a62df145b3 100644
--- a/src/include/regex/regguts.h
+++ b/src/include/regex/regguts.h
@@ -150,13 +150,16 @@ enum char_classes
  *
  * To further reduce the number of arcs in NFAs and DFAs, we also have a
  * special RAINBOW "color" that can be assigned to an arc.  This is not a
- * real color, in that it has no entry in color maps.
+ * real color, in that it has no entry in color maps.  To allow representing
+ * the complement of RAINBOW, there is also BLACK which matches no colors.
+ * (BLACK arcs exist only during regex parsing, not in finished NFAs.)
  */
 typedef short color;            /* colors of characters */

 #define MAX_COLOR    32767        /* max color (must fit in 'color' datatype) */
 #define COLORLESS    (-1)        /* impossible color */
 #define RAINBOW        (-2)        /* represents all colors except pseudocolors */
+#define BLACK        (-3)        /* represents no colors */
 #define WHITE        0            /* default color, parent of all others */
 /* Note: various places in the code know that WHITE is zero */

@@ -410,6 +413,8 @@ struct cnfa
     int            flags;            /* bitmask of the following flags: */
 #define  HASLACONS    01            /* uses lookaround constraints */
 #define  MATCHALL    02            /* matches all strings of a range of lengths */
+#define  HASBLACK    04            /* contains BLACK arcs */
+    /* Note: HASBLACK appears in nfa structs' flags, but never in cnfas */
     int            pre;            /* setup state number */
     int            post;            /* teardown state number */
     color        bos[2];            /* colors, if any, assigned to BOS and BOL */
diff --git a/src/test/modules/test_regex/expected/test_regex.out b/src/test/modules/test_regex/expected/test_regex.out
index 731ba506d3..c44c717edf 100644
--- a/src/test/modules/test_regex/expected/test_regex.out
+++ b/src/test/modules/test_regex/expected/test_regex.out
@@ -2071,6 +2071,20 @@ select * from test_regex('[\s\S]*', '012  3456789abc_*', 'LNPE');
  {"012  3456789abc_*"}
 (2 rows)

+-- bug #18708:
+select * from test_regex('(?:[^\d\D]){0}', '0123456789abc*', 'LNPQE');
+                             test_regex
+--------------------------------------------------------------------
+ {0,REG_UBOUNDS,REG_UBBS,REG_UNONPOSIX,REG_ULOCALE,REG_UEMPTYMATCH}
+ {""}
+(2 rows)
+
+select * from test_regex('[^\d\D]', '0123456789abc*', 'ILPE');
+                       test_regex
+--------------------------------------------------------
+ {0,REG_UBBS,REG_UNONPOSIX,REG_ULOCALE,REG_UIMPOSSIBLE}
+(1 row)
+
 -- check char classes' handling of newlines
 select * from test_regex('\s+', E'abc  \n  def', 'LP');
           test_regex
diff --git a/src/test/modules/test_regex/sql/test_regex.sql b/src/test/modules/test_regex/sql/test_regex.sql
index 478fa2c547..b2a847577e 100644
--- a/src/test/modules/test_regex/sql/test_regex.sql
+++ b/src/test/modules/test_regex/sql/test_regex.sql
@@ -619,6 +619,9 @@ select * from test_regex('[^1\D0]', 'abc0123456789*', 'LPE');
 select * from test_regex('\W', '0123456789abc_*', 'LP');
 select * from test_regex('[\W]', '0123456789abc_*', 'LPE');
 select * from test_regex('[\s\S]*', '012  3456789abc_*', 'LNPE');
+-- bug #18708:
+select * from test_regex('(?:[^\d\D]){0}', '0123456789abc*', 'LNPQE');
+select * from test_regex('[^\d\D]', '0123456789abc*', 'ILPE');

 -- check char classes' handling of newlines
 select * from test_regex('\s+', E'abc  \n  def', 'LP');

pgsql-bugs by date:

Previous
From: Aleksander Alekseev
Date:
Subject: Re: Libpq library error when doing physical Replication
Next
From: Tom Lane
Date:
Subject: Re: BUG #18708: regex problem: (?:[^\d\D]){0} asserts with "lp->nouts == 0 && rp->nins == 0"