Thread: BUG #18708: regex problem: (?:[^\d\D]){0} asserts with "lp->nouts == 0 && rp->nins == 0"

The following bug has been logged on the website:

Bug reference:      18708
Logged by:          Nikolay Shaplov (PostgresPro)
Email address:      dhyan@nataraj.su
PostgreSQL version: 16.4
Operating system:   Debian 12
Description:

Hi! 

We've found a bug: 

If you run 

SELECT '' ~ '(?:[^\d\D]){0}'; 

it will assert with  "lp->nouts == 0 && rp->nins == 0"

This behavior have been introduced in 2a0af7fe460 commit.

This bug have been found while fuzzing jsonpath_in function, and then
narrowed down to regex problem. Thanks to Andrey Bille for help with
narrowing sample down and finding commit that caused the problem.


Hi Nikolay,

> If you run
>
> SELECT '' ~ '(?:[^\d\D]){0}';
>
> it will assert with  "lp->nouts == 0 && rp->nins == 0"
>
> This behavior have been introduced in 2a0af7fe460 commit.
>
> This bug have been found while fuzzing jsonpath_in function, and then
> narrowed down to regex problem. Thanks to Andrey Bille for help with
> narrowing sample down and finding commit that caused the problem.

Thanks for the report. I can reproduce it with 18devel too:

```
#6  0x00005906af1a1e5b in delsub (nfa=0x5906b179f8b0,
lp=0x5906b179fd88, rp=0x5906b179fdc0) at
../src/backend/regex/regc_nfa.c:1292
1292        assert(lp->nouts == 0 && rp->nins == 0);    /* did the job */

(gdb) p *lp
$1 = {no = 4, flag = 0 '\000', nins = 1, nouts = 0, ins =
0x5906b17a3418, outs = 0x0, tmp = 0x0, next = 0x5906b179fdc0, prev =
0x5906b179fd50}

(gdb) p *rp
$2 = {no = 5, flag = 0 '\000', nins = 1, nouts = 1, ins =
0x5906b17a34f0, outs = 0x5906b17a3460, tmp = 0x5906b179fdc0, next =
0x5906b179fe30,
  prev = 0x5906b179fd88}
```

I wonder if the Assert is just wrong or if it's more complicated than that.

For the record:

([^\d\D]){0} - OK
(?:[^\d\D]){1} - OK
(?:[^\D]){0} - OK
(?:[^\d]){0} - OK
'(?:[^\d\D]){0}' - FAIL

The value of the left argument of the `~` operator is not important.

Thoughts?

-- 
Best regards,
Aleksander Alekseev



Hi,

> I wonder if the Assert is just wrong or if it's more complicated than that.
>
> For the record:
>
> ([^\d\D]){0} - OK
> (?:[^\d\D]){1} - OK
> (?:[^\D]){0} - OK
> (?:[^\d]){0} - OK
> '(?:[^\d\D]){0}' - FAIL

Well, removing the assert helps, and the regex seems to work correctly
after that.

This however is almost certainly not a correct fix (the assert is
right about lp->nouts == 0 it's only unhappy about rp->nins != 0) and
a second opinion is certainly needed since I'm looking at
src/backend/regex/regc_nfa.c for the first time in my life :)

-- 
Best regards,
Aleksander Alekseev



PG Bug reporting form <noreply@postgresql.org> writes:
> If you run 
> SELECT '' ~ '(?:[^\d\D]){0}'; 
> it will assert with  "lp->nouts == 0 && rp->nins == 0"
> This behavior have been introduced in 2a0af7fe460 commit.

Thanks for the report --- I'll dig into this later.

            regards, tom lane



Aleksander Alekseev <aleksander@timescale.com> writes:
>> If you run
>> SELECT '' ~ '(?:[^\d\D]){0}';
>> it will assert with  "lp->nouts == 0 && rp->nins == 0"

> I wonder if the Assert is just wrong or if it's more complicated than that.

Interesting example.  I think the bug's actually in my commit
08c0d6ad6 (Invent "rainbow" arcs), although it is unreachable
before 2a0af7fe4.  What's happening is:

* "\d\D" can match any character whatsoever.  optimizebracket()
notices this and replaces a sheaf of NFA arcs with a single
RAINBOW arc, which is the same representation as for ".".

* Then we apply the complementation process in cbracket(),
and that calls colorcomplement() which does this:

    /* A RAINBOW arc matches all colors, making the complement empty */
     if (findarc(of, PLAIN, RAINBOW) != NULL)
         return;

We thus end up with no arcs from the bracket expression's start
state to its end state.  This is not wrong in itself: "you can't
get there from here" is a perfectly reasonable representation
of [^\d\D], which cannot match any character.

* However, the (?: ... ){0} superstructure around that eventually
leads us to

    /* annoying special case:  {0} or {0,0} cancels everything */
    if (m == 0 && n == 0)
    {
            ...
            /* Otherwise, we can clean up any subre infrastructure we made */
            if (atom != NULL)
                freesubre(v, atom);
            delsub(v->nfa, lp, rp);
    }

which is trying to throw away the detritus from inside the quantified
subexpression.  But delsub fails to remove it all, because there's no
path between the two specified states, and its assert notices that.

This is, I believe, not harmful in itself: the leftover unreachable
states/arcs will be cleaned up later, and we end with a valid NFA.

So, as you noted, removing that assert would suppress all visible
symptoms of the problem.  But I really don't want to do that; this
code is complicated and we need all the help we can get to find
bugs in it.

So I think the right thing to do is to fix colorcomplement() so
that it still produces some arc between its "from" and "to"
states, keeping the graph connected until we finish parsing the
regex.  It has to be a dummy arc that can't match anything, of
course.  We can throw away such an arc once we get to regex
optimization, because we don't need delsub() to work anymore.

In the attached draft patch I represented the dummy arc as a PLAIN
arc with a new fake color BLACK.  Another way to do it could be to
introduce a new arc "type" value, but that seemed like it would
involve touching more code.

This is admittedly a lot more code than removing one assert,
but I think we'll regret it if we allow delsub failures to go
undetected.

            regards, tom lane

diff --git a/src/backend/regex/regc_color.c b/src/backend/regex/regc_color.c
index 30bda0e5ad..27d324b8a6 100644
--- a/src/backend/regex/regc_color.c
+++ b/src/backend/regex/regc_color.c
@@ -1075,16 +1075,26 @@ colorcomplement(struct nfa *nfa,

     assert(of != from);

-    /* A RAINBOW arc matches all colors, making the complement empty */
+    /*
+     * A RAINBOW arc matches all colors, making the complement empty.  But we
+     * can't just return without making any arcs, because that would leave the
+     * NFA disconnected which would break any future delsub().  Instead, make
+     * a BLACK arc.  Also set the HASBLACK flag so we know to clean that up at
+     * the start of NFA optimization.
+     */
     if (findarc(of, PLAIN, RAINBOW) != NULL)
+    {
+        newarc(nfa, type, BLACK, from, to);
+        nfa->flags |= HASBLACK;
         return;
+    }

     /* Otherwise, transiently mark the colors that appear in of's out-arcs */
     for (a = of->outs; a != NULL; a = a->outchain)
     {
         if (a->type == PLAIN)
         {
-            assert(a->co >= 0);
+            assert(a->co >= 0); /* i.e. not RAINBOW or BLACK */
             cd = &cm->cd[a->co];
             assert(!UNUSEDCOLOR(cd));
             cd->flags |= COLMARK;
diff --git a/src/backend/regex/regc_nfa.c b/src/backend/regex/regc_nfa.c
index f1819a24f6..4b57a2c2b1 100644
--- a/src/backend/regex/regc_nfa.c
+++ b/src/backend/regex/regc_nfa.c
@@ -1599,6 +1599,12 @@ optimize(struct nfa *nfa,
     if (verbose)
         fprintf(f, "\ninitial cleanup:\n");
 #endif
+    /* If we have any BLACK arcs, drop them; but this is uncommon */
+    if (nfa->flags & HASBLACK)
+    {
+        removeblack(nfa);
+        nfa->flags &= ~HASBLACK;
+    }
     cleanup(nfa);                /* may simplify situation */
 #ifdef REG_DEBUG
     if (verbose)
@@ -2922,6 +2928,32 @@ clonesuccessorstates(struct nfa *nfa,
     }
 }

+/*
+ * removeblack - remove BLACK arcs, which are no longer useful
+ */
+static void
+removeblack(struct nfa *nfa)
+{
+    struct state *s;
+
+    for (s = nfa->states; s != NULL; s = s->next)
+    {
+        struct arc *a;
+        struct arc *nexta;
+
+        for (a = s->outs; a != NULL; a = nexta)
+        {
+            nexta = a->outchain;
+            if (a->type == PLAIN && a->co == BLACK)
+            {
+                freearc(nfa, a);
+                if (NISERR())
+                    return;
+            }
+        }
+    }
+}
+
 /*
  * cleanup - clean up NFA after optimizations
  */
@@ -3627,6 +3659,8 @@ dumpnfa(struct nfa *nfa,
         fprintf(f, ", eol [%ld]", (long) nfa->eos[1]);
     if (nfa->flags & HASLACONS)
         fprintf(f, ", haslacons");
+    if (nfa->flags & HASBLACK)
+        fprintf(f, ", hasblack");
     if (nfa->flags & MATCHALL)
     {
         fprintf(f, ", minmatchall %d", nfa->minmatchall);
@@ -3725,6 +3759,8 @@ dumparc(struct arc *a,
         case PLAIN:
             if (a->co == RAINBOW)
                 fprintf(f, "[*]");
+            else if (a->co == BLACK)
+                fprintf(f, "[-]");
             else
                 fprintf(f, "[%ld]", (long) a->co);
             break;
diff --git a/src/backend/regex/regcomp.c b/src/backend/regex/regcomp.c
index 8a6cfb2973..432fdb8eb8 100644
--- a/src/backend/regex/regcomp.c
+++ b/src/backend/regex/regcomp.c
@@ -215,6 +215,7 @@ static void clonesuccessorstates(struct nfa *nfa, struct state *ssource,
                                  struct state *spredecessor,
                                  struct arc *refarc, char *curdonemap,
                                  char *outerdonemap, int nstates);
+static void removeblack(struct nfa *nfa);
 static void cleanup(struct nfa *nfa);
 static void markreachable(struct nfa *nfa, struct state *s,
                           struct state *okay, struct state *mark);
@@ -346,7 +347,7 @@ struct vars
 #define SEND    'Z'                /* end of string (even if not EOL) */

 /* is an arc colored, and hence should belong to a color chain? */
-/* the test on "co" eliminates RAINBOW arcs, which we don't bother to chain */
+/* testing co eliminates RAINBOW/BLACK arcs, which we don't bother to chain */
 #define COLORED(a) \
     ((a)->co >= 0 && \
      ((a)->type == PLAIN || (a)->type == AHEAD || (a)->type == BEHIND))
@@ -1949,7 +1950,7 @@ optimizebracket(struct vars *v,
     for (a = lp->outs; a != NULL; a = a->outchain)
     {
         assert(a->type == PLAIN);
-        assert(a->co >= 0);        /* i.e. not RAINBOW */
+        assert(a->co >= 0);        /* i.e. not RAINBOW or BLACK */
         assert(a->to == rp);
         cd = &v->cm->cd[a->co];
         assert(!UNUSEDCOLOR(cd) && !(cd->flags & PSEUDO));
@@ -2368,6 +2369,7 @@ nfanode(struct vars *v,
     nfa = newnfa(v, v->cm, v->nfa);
     NOERRZ();
     dupnfa(nfa, t->begin, t->end, nfa->init, nfa->final);
+    nfa->flags = v->nfa->flags;
     if (!ISERR())
         specialcolors(nfa);
     if (!ISERR())
diff --git a/src/include/regex/regguts.h b/src/include/regex/regguts.h
index 3ca3647e11..a62df145b3 100644
--- a/src/include/regex/regguts.h
+++ b/src/include/regex/regguts.h
@@ -150,13 +150,16 @@ enum char_classes
  *
  * To further reduce the number of arcs in NFAs and DFAs, we also have a
  * special RAINBOW "color" that can be assigned to an arc.  This is not a
- * real color, in that it has no entry in color maps.
+ * real color, in that it has no entry in color maps.  To allow representing
+ * the complement of RAINBOW, there is also BLACK which matches no colors.
+ * (BLACK arcs exist only during regex parsing, not in finished NFAs.)
  */
 typedef short color;            /* colors of characters */

 #define MAX_COLOR    32767        /* max color (must fit in 'color' datatype) */
 #define COLORLESS    (-1)        /* impossible color */
 #define RAINBOW        (-2)        /* represents all colors except pseudocolors */
+#define BLACK        (-3)        /* represents no colors */
 #define WHITE        0            /* default color, parent of all others */
 /* Note: various places in the code know that WHITE is zero */

@@ -410,6 +413,8 @@ struct cnfa
     int            flags;            /* bitmask of the following flags: */
 #define  HASLACONS    01            /* uses lookaround constraints */
 #define  MATCHALL    02            /* matches all strings of a range of lengths */
+#define  HASBLACK    04            /* contains BLACK arcs */
+    /* Note: HASBLACK appears in nfa structs' flags, but never in cnfas */
     int            pre;            /* setup state number */
     int            post;            /* teardown state number */
     color        bos[2];            /* colors, if any, assigned to BOS and BOL */
diff --git a/src/test/modules/test_regex/expected/test_regex.out b/src/test/modules/test_regex/expected/test_regex.out
index 731ba506d3..c44c717edf 100644
--- a/src/test/modules/test_regex/expected/test_regex.out
+++ b/src/test/modules/test_regex/expected/test_regex.out
@@ -2071,6 +2071,20 @@ select * from test_regex('[\s\S]*', '012  3456789abc_*', 'LNPE');
  {"012  3456789abc_*"}
 (2 rows)

+-- bug #18708:
+select * from test_regex('(?:[^\d\D]){0}', '0123456789abc*', 'LNPQE');
+                             test_regex
+--------------------------------------------------------------------
+ {0,REG_UBOUNDS,REG_UBBS,REG_UNONPOSIX,REG_ULOCALE,REG_UEMPTYMATCH}
+ {""}
+(2 rows)
+
+select * from test_regex('[^\d\D]', '0123456789abc*', 'ILPE');
+                       test_regex
+--------------------------------------------------------
+ {0,REG_UBBS,REG_UNONPOSIX,REG_ULOCALE,REG_UIMPOSSIBLE}
+(1 row)
+
 -- check char classes' handling of newlines
 select * from test_regex('\s+', E'abc  \n  def', 'LP');
           test_regex
diff --git a/src/test/modules/test_regex/sql/test_regex.sql b/src/test/modules/test_regex/sql/test_regex.sql
index 478fa2c547..b2a847577e 100644
--- a/src/test/modules/test_regex/sql/test_regex.sql
+++ b/src/test/modules/test_regex/sql/test_regex.sql
@@ -619,6 +619,9 @@ select * from test_regex('[^1\D0]', 'abc0123456789*', 'LPE');
 select * from test_regex('\W', '0123456789abc_*', 'LP');
 select * from test_regex('[\W]', '0123456789abc_*', 'LPE');
 select * from test_regex('[\s\S]*', '012  3456789abc_*', 'LNPE');
+-- bug #18708:
+select * from test_regex('(?:[^\d\D]){0}', '0123456789abc*', 'LNPQE');
+select * from test_regex('[^\d\D]', '0123456789abc*', 'ILPE');

 -- check char classes' handling of newlines
 select * from test_regex('\s+', E'abc  \n  def', 'LP');

I wrote:
> So I think the right thing to do is to fix colorcomplement() so
> that it still produces some arc between its "from" and "to"
> states, keeping the graph connected until we finish parsing the
> regex.  It has to be a dummy arc that can't match anything, of
> course.  We can throw away such an arc once we get to regex
> optimization, because we don't need delsub() to work anymore.

> In the attached draft patch I represented the dummy arc as a PLAIN
> arc with a new fake color BLACK.  Another way to do it could be to
> introduce a new arc "type" value, but that seemed like it would
> involve touching more code.

After thinking awhile longer, I decided to try it that way (with
a new arc type), and I believe I like the result better after all.
It's just a few lines more code, and I think it's more robust.
Up to now PLAIN arcs always matched exactly one character,
so the first version was putting a major dent in their semantics.
We got rid of the bogus arcs before doing anything that really leans
hard on arc semantics, but still it seems messy.  Hence v2 attached.

I'm not especially in love with the CANTMATCH arc type name, but
other possibilities such as DUMMY or NOMATCH felt too generic.
Better ideas anyone?

            regards, tom lane

diff --git a/src/backend/regex/regc_color.c b/src/backend/regex/regc_color.c
index 30bda0e5ad..8ae788f519 100644
--- a/src/backend/regex/regc_color.c
+++ b/src/backend/regex/regc_color.c
@@ -1075,9 +1075,19 @@ colorcomplement(struct nfa *nfa,

     assert(of != from);

-    /* A RAINBOW arc matches all colors, making the complement empty */
+    /*
+     * A RAINBOW arc matches all colors, making the complement empty.  But we
+     * can't just return without making any arcs, because that would leave the
+     * NFA disconnected which would break any future delsub().  Instead, make
+     * a CANTMATCH arc.  Also set the HASCANTMATCH flag so we know we need to
+     * clean that up at the start of NFA optimization.
+     */
     if (findarc(of, PLAIN, RAINBOW) != NULL)
+    {
+        newarc(nfa, CANTMATCH, 0, from, to);
+        nfa->flags |= HASCANTMATCH;
         return;
+    }

     /* Otherwise, transiently mark the colors that appear in of's out-arcs */
     for (a = of->outs; a != NULL; a = a->outchain)
@@ -1089,6 +1099,12 @@ colorcomplement(struct nfa *nfa,
             assert(!UNUSEDCOLOR(cd));
             cd->flags |= COLMARK;
         }
+
+        /*
+         * There's no syntax for re-complementing a color set, so we cannot
+         * see CANTMATCH arcs here.
+         */
+        assert(a->type != CANTMATCH);
     }

     /* Scan colors, clear transient marks, add arcs for unmarked colors */
diff --git a/src/backend/regex/regc_nfa.c b/src/backend/regex/regc_nfa.c
index f1819a24f6..acd2286def 100644
--- a/src/backend/regex/regc_nfa.c
+++ b/src/backend/regex/regc_nfa.c
@@ -1462,6 +1462,7 @@ removetraverse(struct nfa *nfa,
         {
             case PLAIN:
             case EMPTY:
+            case CANTMATCH:
                 /* nothing to do */
                 break;
             case AHEAD:
@@ -1599,6 +1600,12 @@ optimize(struct nfa *nfa,
     if (verbose)
         fprintf(f, "\ninitial cleanup:\n");
 #endif
+    /* If we have any CANTMATCH arcs, drop them; but this is uncommon */
+    if (nfa->flags & HASCANTMATCH)
+    {
+        removecantmatch(nfa);
+        nfa->flags &= ~HASCANTMATCH;
+    }
     cleanup(nfa);                /* may simplify situation */
 #ifdef REG_DEBUG
     if (verbose)
@@ -2922,6 +2929,34 @@ clonesuccessorstates(struct nfa *nfa,
     }
 }

+/*
+ * removecantmatch - remove CANTMATCH arcs, which are no longer useful
+ * once we are done with the parsing phase.  (We need them only to
+ * preserve connectedness of NFA subgraphs during parsing.)
+ */
+static void
+removecantmatch(struct nfa *nfa)
+{
+    struct state *s;
+
+    for (s = nfa->states; s != NULL; s = s->next)
+    {
+        struct arc *a;
+        struct arc *nexta;
+
+        for (a = s->outs; a != NULL; a = nexta)
+        {
+            nexta = a->outchain;
+            if (a->type == CANTMATCH)
+            {
+                freearc(nfa, a);
+                if (NISERR())
+                    return;
+            }
+        }
+    }
+}
+
 /*
  * cleanup - clean up NFA after optimizations
  */
@@ -3627,6 +3662,8 @@ dumpnfa(struct nfa *nfa,
         fprintf(f, ", eol [%ld]", (long) nfa->eos[1]);
     if (nfa->flags & HASLACONS)
         fprintf(f, ", haslacons");
+    if (nfa->flags & HASCANTMATCH)
+        fprintf(f, ", hascantmatch");
     if (nfa->flags & MATCHALL)
     {
         fprintf(f, ", minmatchall %d", nfa->minmatchall);
@@ -3749,6 +3786,9 @@ dumparc(struct arc *a,
             break;
         case EMPTY:
             break;
+        case CANTMATCH:
+            fprintf(f, "X");
+            break;
         default:
             fprintf(f, "0x%x/0%lo", a->type, (long) a->co);
             break;
diff --git a/src/backend/regex/regcomp.c b/src/backend/regex/regcomp.c
index 8a6cfb2973..15b264e50f 100644
--- a/src/backend/regex/regcomp.c
+++ b/src/backend/regex/regcomp.c
@@ -215,6 +215,7 @@ static void clonesuccessorstates(struct nfa *nfa, struct state *ssource,
                                  struct state *spredecessor,
                                  struct arc *refarc, char *curdonemap,
                                  char *outerdonemap, int nstates);
+static void removecantmatch(struct nfa *nfa);
 static void cleanup(struct nfa *nfa);
 static void markreachable(struct nfa *nfa, struct state *s,
                           struct state *okay, struct state *mark);
@@ -342,6 +343,7 @@ struct vars
 #define BEHIND    'r'                /* color-lookbehind arc */
 #define WBDRY    'w'                /* word boundary constraint */
 #define NWBDRY    'W'                /* non-word-boundary constraint */
+#define CANTMATCH 'x'            /* arc that cannot match anything */
 #define SBEGIN    'A'                /* beginning of string (even if not BOL) */
 #define SEND    'Z'                /* end of string (even if not EOL) */

@@ -2368,6 +2370,7 @@ nfanode(struct vars *v,
     nfa = newnfa(v, v->cm, v->nfa);
     NOERRZ();
     dupnfa(nfa, t->begin, t->end, nfa->init, nfa->final);
+    nfa->flags = v->nfa->flags;
     if (!ISERR())
         specialcolors(nfa);
     if (!ISERR())
diff --git a/src/include/regex/regguts.h b/src/include/regex/regguts.h
index 3ca3647e11..fd69299a16 100644
--- a/src/include/regex/regguts.h
+++ b/src/include/regex/regguts.h
@@ -410,6 +410,8 @@ struct cnfa
     int            flags;            /* bitmask of the following flags: */
 #define  HASLACONS    01            /* uses lookaround constraints */
 #define  MATCHALL    02            /* matches all strings of a range of lengths */
+#define  HASCANTMATCH 04        /* contains CANTMATCH arcs */
+    /* Note: HASCANTMATCH appears in nfa structs' flags, but never in cnfas */
     int            pre;            /* setup state number */
     int            post;            /* teardown state number */
     color        bos[2];            /* colors, if any, assigned to BOS and BOL */
diff --git a/src/test/modules/test_regex/expected/test_regex.out b/src/test/modules/test_regex/expected/test_regex.out
index 731ba506d3..c44c717edf 100644
--- a/src/test/modules/test_regex/expected/test_regex.out
+++ b/src/test/modules/test_regex/expected/test_regex.out
@@ -2071,6 +2071,20 @@ select * from test_regex('[\s\S]*', '012  3456789abc_*', 'LNPE');
  {"012  3456789abc_*"}
 (2 rows)

+-- bug #18708:
+select * from test_regex('(?:[^\d\D]){0}', '0123456789abc*', 'LNPQE');
+                             test_regex
+--------------------------------------------------------------------
+ {0,REG_UBOUNDS,REG_UBBS,REG_UNONPOSIX,REG_ULOCALE,REG_UEMPTYMATCH}
+ {""}
+(2 rows)
+
+select * from test_regex('[^\d\D]', '0123456789abc*', 'ILPE');
+                       test_regex
+--------------------------------------------------------
+ {0,REG_UBBS,REG_UNONPOSIX,REG_ULOCALE,REG_UIMPOSSIBLE}
+(1 row)
+
 -- check char classes' handling of newlines
 select * from test_regex('\s+', E'abc  \n  def', 'LP');
           test_regex
diff --git a/src/test/modules/test_regex/sql/test_regex.sql b/src/test/modules/test_regex/sql/test_regex.sql
index 478fa2c547..b2a847577e 100644
--- a/src/test/modules/test_regex/sql/test_regex.sql
+++ b/src/test/modules/test_regex/sql/test_regex.sql
@@ -619,6 +619,9 @@ select * from test_regex('[^1\D0]', 'abc0123456789*', 'LPE');
 select * from test_regex('\W', '0123456789abc_*', 'LP');
 select * from test_regex('[\W]', '0123456789abc_*', 'LPE');
 select * from test_regex('[\s\S]*', '012  3456789abc_*', 'LNPE');
+-- bug #18708:
+select * from test_regex('(?:[^\d\D]){0}', '0123456789abc*', 'LNPQE');
+select * from test_regex('[^\d\D]', '0123456789abc*', 'ILPE');

 -- check char classes' handling of newlines
 select * from test_regex('\s+', E'abc  \n  def', 'LP');

Hi,

I reviewed, applied and tested patch v2 on Linux x64 and MacOS x64. LGTM.

> I'm not especially in love with the CANTMATCH arc type name, but
> other possibilities such as DUMMY or NOMATCH felt too generic.
> Better ideas anyone?

Well... at least it's consistent with the current naming e.g.
HASCANTMATCH. I suggest keeping it as is.

--
Best regards,
Aleksander Alekseev



Hi,

> Well... at least it's consistent with the current naming e.g.
> HASCANTMATCH. I suggest keeping it as is.

Oh I wrote something stupid, HASCANTMATCH is part of the patch.

In any case, IMO the name is OK.

-- 
Best regards,
Aleksander Alekseev



Aleksander Alekseev <aleksander@timescale.com> writes:
> In any case, IMO the name is OK.

I've not thought of a better name since yesterday, so pushed.
Thanks for reviewing!

            regards, tom lane



Hello Tom,

16.11.2024 03:08, Tom Lane wrote:
> I've not thought of a better name since yesterday, so pushed.
> Thanks for reviewing!

Please look at the hornet's failure on processing a test query added here:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=hornet&dt=2024-11-15%2023%3A49%3A38

Best regards,
Alexander



Alexander Lakhin <exclusion@gmail.com> writes:
> Please look at the hornet's failure on processing a test query added here:
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=hornet&dt=2024-11-15%2023%3A49%3A38

Hmm... for the archives' sake, that looks like

 select * from test_regex('[^\\d\\D]', '0123456789abc*', 'ILPE');
-                        test_regex
- --------------------------------------------------------
-  {0,REG_UBBS,REG_UNONPOSIX,REG_ULOCALE,REG_UIMPOSSIBLE}
- (1 row)
-
+ ERROR:  invalid regular expression: out of memory
  -- check char classes' handling of newlines

I'm not sure what to make of it.  That regex shouldn't consume very
much memory.  To confirm that, I stepped through it and found that
newstate() is reached 14 times and newarc() 35 times.  That's a
pretty tiny amount of memory, and there are other regexps in the
tests that are far larger.

Moreover, no other animal has shown this, including hornet itself
on the v16 branch.  (It's only run this test in v15 and v16 so far,
so that's not a lot of data points.)

I'm inclined to guess this was some weird momentary glitch.  If it
reproduces then I'll look closer.

            regards, tom lane



16.11.2024 19:02, Tom Lane wrote:
> Moreover, no other animal has shown this, including hornet itself
> on the v16 branch.  (It's only run this test in v15 and v16 so far,
> so that's not a lot of data points.)
>
> I'm inclined to guess this was some weird momentary glitch.  If it
> reproduces then I'll look closer.

(Un)fortunately, tern (which is also a ppc animal) has produced the same
failure:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=tern&dt=2024-11-16%2022%3A00%3A12

Best regards,
Alexander



Alexander Lakhin <exclusion@gmail.com> writes:
> 16.11.2024 19:02, Tom Lane wrote:
>> I'm inclined to guess this was some weird momentary glitch.  If it
>> reproduces then I'll look closer.

> (Un)fortunately, tern (which is also a ppc animal) has produced the same
> failure:
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=tern&dt=2024-11-16%2022%3A00%3A12

Yeah, I saw that.  Even more confused now about what it could be.

            regards, tom lane



I wrote:
> Alexander Lakhin <exclusion@gmail.com> writes:
>> (Un)fortunately, tern (which is also a ppc animal) has produced the same
>> failure:
>> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=tern&dt=2024-11-16%2022%3A00%3A12

> Yeah, I saw that.  Even more confused now about what it could be.

After testing on hornet's host, it seems that this is a pre-existing
issue that we didn't happen to hit before.  Since the regex
'[^\d\D]' is unsatisfiable, it collapses to nothing (start state,
end state, and no arcs) in the first cleanup() call in optimize().
Then fixempties() counts the number of in-arcs and gets zero,
and then it does

    arcarray = (struct arc **) MALLOC(totalinarcs * sizeof(struct arc *));
    if (arcarray == NULL)
    {
        NERR(REG_ESPACE);
        ...

On a machine where malloc(0) returns NULL, this mistakenly
thinks that's an error.

I verified that

-    if (arcarray == NULL)
+    if (arcarray == NULL && totalinarcs != 0)

makes the failure go away, but I wonder if any other places in
backend/regex/ are at the same hazard.  Maybe the smartest fix
would be to put in a wrapper layer that does what pg_malloc
does:

    /* Avoid unportable behavior of malloc(0) */
    if (size == 0)
        size = 1;

One other point is that this theory fails to explain why
hornet didn't fail in the v16 branch ... oh, wait:
v15 has

#define MALLOC(n)        malloc(n)

where later branches have

#define MALLOC(n)        palloc_extended((n), MCXT_ALLOC_NO_OOM)

So the right answer seems to be to figure out why we didn't
back-patch that change.

            regards, tom lane



On Sun, Nov 17, 2024 at 01:26:38AM -0500, Tom Lane wrote:
>     arcarray = (struct arc **) MALLOC(totalinarcs * sizeof(struct arc *));
>     if (arcarray == NULL)
>     {
>         NERR(REG_ESPACE);
>         ...
> 
> On a machine where malloc(0) returns NULL, this mistakenly
> thinks that's an error.
> 
> I verified that
> 
> -    if (arcarray == NULL)
> +    if (arcarray == NULL && totalinarcs != 0)
> 
> makes the failure go away, but I wonder if any other places in
> backend/regex/ are at the same hazard.  Maybe the smartest fix
> would be to put in a wrapper layer that does what pg_malloc
> does:
> 
>     /* Avoid unportable behavior of malloc(0) */
>     if (size == 0)
>         size = 1;

Either of those sound reasonable.  The consequence of missing this hazard, a
deterministic ERROR, is modest.  This affects just one platform, in the oldest
branches.  There's a lack of complaints.  To me, all that would make the
one-line diff tempting.

> One other point is that this theory fails to explain why
> hornet didn't fail in the v16 branch ... oh, wait:
> v15 has 
> 
> #define MALLOC(n)        malloc(n)
> 
> where later branches have
> 
> #define MALLOC(n)        palloc_extended((n), MCXT_ALLOC_NO_OOM)
> 
> So the right answer seems to be to figure out why we didn't
> back-patch that change.

I don't recall a specific reason or see one in the discussion of commit
bea3d7e38.  It was done mainly to unblock commit db4f21e, which in turn
unblocked commit 0da096d.  The last commit is heavy, so I can understand it
skipping the back branches.  If I were making a (weak) argument against
back-patching bea3d7e38, I might cite the extra memory use from
RegexpCacheMemoryContext and children.



Noah Misch <noah@leadboat.com> writes:
> On Sun, Nov 17, 2024 at 01:26:38AM -0500, Tom Lane wrote:
>> arcarray = (struct arc **) MALLOC(totalinarcs * sizeof(struct arc *));
>> if (arcarray == NULL)

>> On a machine where malloc(0) returns NULL, this mistakenly
>> thinks that's an error.

> Either of those sound reasonable.  The consequence of missing this hazard, a
> deterministic ERROR, is modest.  This affects just one platform, in the oldest
> branches.  There's a lack of complaints.  To me, all that would make the
> one-line diff tempting.

I dug through the MALLOC and REALLOC calls in backend/regex/ and
convinced myself that this is the only one that's at risk.  (Some
of those conclusions depend on the assumption that a regex NFA
never has nstates == 0, but I think that's okay: we create start
and end states to begin with and never remove them.)  So the
one-liner fix is looking attractive.  I'd prefer a malloc wrapper
for future-proofing if this code were likely to receive a lot of
churn in the pre-v16 branches, but that seems pretty improbable
at this point.

>> So the right answer seems to be to figure out why we didn't
>> back-patch that change.

> I don't recall a specific reason or see one in the discussion of commit
> bea3d7e38.  It was done mainly to unblock commit db4f21e, which in turn
> unblocked commit 0da096d.  The last commit is heavy, so I can understand it
> skipping the back branches.  If I were making a (weak) argument against
> back-patching bea3d7e38, I might cite the extra memory use from
> RegexpCacheMemoryContext and children.

I think just on minimum-risk grounds, I wouldn't consider
back-patching bea3d7e38.  I had more in mind a bespoke
three-line malloc wrapper function.  But the one-line fix
seems sufficient for the problem at hand.

            regards, tom lane