Re: BUG #18708: regex problem: (?:[^\d\D]){0} asserts with "lp->nouts == 0 && rp->nins == 0" - Mailing list pgsql-bugs
From | Tom Lane |
---|---|
Subject | Re: BUG #18708: regex problem: (?:[^\d\D]){0} asserts with "lp->nouts == 0 && rp->nins == 0" |
Date | |
Msg-id | 2242483.1731627374@sss.pgh.pa.us Whole thread Raw |
In response to | Re: BUG #18708: regex problem: (?:[^\d\D]){0} asserts with "lp->nouts == 0 && rp->nins == 0" (Aleksander Alekseev <aleksander@timescale.com>) |
Responses |
Re: BUG #18708: regex problem: (?:[^\d\D]){0} asserts with "lp->nouts == 0 && rp->nins == 0"
|
List | pgsql-bugs |
Aleksander Alekseev <aleksander@timescale.com> writes: >> If you run >> SELECT '' ~ '(?:[^\d\D]){0}'; >> it will assert with "lp->nouts == 0 && rp->nins == 0" > I wonder if the Assert is just wrong or if it's more complicated than that. Interesting example. I think the bug's actually in my commit 08c0d6ad6 (Invent "rainbow" arcs), although it is unreachable before 2a0af7fe4. What's happening is: * "\d\D" can match any character whatsoever. optimizebracket() notices this and replaces a sheaf of NFA arcs with a single RAINBOW arc, which is the same representation as for ".". * Then we apply the complementation process in cbracket(), and that calls colorcomplement() which does this: /* A RAINBOW arc matches all colors, making the complement empty */ if (findarc(of, PLAIN, RAINBOW) != NULL) return; We thus end up with no arcs from the bracket expression's start state to its end state. This is not wrong in itself: "you can't get there from here" is a perfectly reasonable representation of [^\d\D], which cannot match any character. * However, the (?: ... ){0} superstructure around that eventually leads us to /* annoying special case: {0} or {0,0} cancels everything */ if (m == 0 && n == 0) { ... /* Otherwise, we can clean up any subre infrastructure we made */ if (atom != NULL) freesubre(v, atom); delsub(v->nfa, lp, rp); } which is trying to throw away the detritus from inside the quantified subexpression. But delsub fails to remove it all, because there's no path between the two specified states, and its assert notices that. This is, I believe, not harmful in itself: the leftover unreachable states/arcs will be cleaned up later, and we end with a valid NFA. So, as you noted, removing that assert would suppress all visible symptoms of the problem. But I really don't want to do that; this code is complicated and we need all the help we can get to find bugs in it. So I think the right thing to do is to fix colorcomplement() so that it still produces some arc between its "from" and "to" states, keeping the graph connected until we finish parsing the regex. It has to be a dummy arc that can't match anything, of course. We can throw away such an arc once we get to regex optimization, because we don't need delsub() to work anymore. In the attached draft patch I represented the dummy arc as a PLAIN arc with a new fake color BLACK. Another way to do it could be to introduce a new arc "type" value, but that seemed like it would involve touching more code. This is admittedly a lot more code than removing one assert, but I think we'll regret it if we allow delsub failures to go undetected. regards, tom lane diff --git a/src/backend/regex/regc_color.c b/src/backend/regex/regc_color.c index 30bda0e5ad..27d324b8a6 100644 --- a/src/backend/regex/regc_color.c +++ b/src/backend/regex/regc_color.c @@ -1075,16 +1075,26 @@ colorcomplement(struct nfa *nfa, assert(of != from); - /* A RAINBOW arc matches all colors, making the complement empty */ + /* + * A RAINBOW arc matches all colors, making the complement empty. But we + * can't just return without making any arcs, because that would leave the + * NFA disconnected which would break any future delsub(). Instead, make + * a BLACK arc. Also set the HASBLACK flag so we know to clean that up at + * the start of NFA optimization. + */ if (findarc(of, PLAIN, RAINBOW) != NULL) + { + newarc(nfa, type, BLACK, from, to); + nfa->flags |= HASBLACK; return; + } /* Otherwise, transiently mark the colors that appear in of's out-arcs */ for (a = of->outs; a != NULL; a = a->outchain) { if (a->type == PLAIN) { - assert(a->co >= 0); + assert(a->co >= 0); /* i.e. not RAINBOW or BLACK */ cd = &cm->cd[a->co]; assert(!UNUSEDCOLOR(cd)); cd->flags |= COLMARK; diff --git a/src/backend/regex/regc_nfa.c b/src/backend/regex/regc_nfa.c index f1819a24f6..4b57a2c2b1 100644 --- a/src/backend/regex/regc_nfa.c +++ b/src/backend/regex/regc_nfa.c @@ -1599,6 +1599,12 @@ optimize(struct nfa *nfa, if (verbose) fprintf(f, "\ninitial cleanup:\n"); #endif + /* If we have any BLACK arcs, drop them; but this is uncommon */ + if (nfa->flags & HASBLACK) + { + removeblack(nfa); + nfa->flags &= ~HASBLACK; + } cleanup(nfa); /* may simplify situation */ #ifdef REG_DEBUG if (verbose) @@ -2922,6 +2928,32 @@ clonesuccessorstates(struct nfa *nfa, } } +/* + * removeblack - remove BLACK arcs, which are no longer useful + */ +static void +removeblack(struct nfa *nfa) +{ + struct state *s; + + for (s = nfa->states; s != NULL; s = s->next) + { + struct arc *a; + struct arc *nexta; + + for (a = s->outs; a != NULL; a = nexta) + { + nexta = a->outchain; + if (a->type == PLAIN && a->co == BLACK) + { + freearc(nfa, a); + if (NISERR()) + return; + } + } + } +} + /* * cleanup - clean up NFA after optimizations */ @@ -3627,6 +3659,8 @@ dumpnfa(struct nfa *nfa, fprintf(f, ", eol [%ld]", (long) nfa->eos[1]); if (nfa->flags & HASLACONS) fprintf(f, ", haslacons"); + if (nfa->flags & HASBLACK) + fprintf(f, ", hasblack"); if (nfa->flags & MATCHALL) { fprintf(f, ", minmatchall %d", nfa->minmatchall); @@ -3725,6 +3759,8 @@ dumparc(struct arc *a, case PLAIN: if (a->co == RAINBOW) fprintf(f, "[*]"); + else if (a->co == BLACK) + fprintf(f, "[-]"); else fprintf(f, "[%ld]", (long) a->co); break; diff --git a/src/backend/regex/regcomp.c b/src/backend/regex/regcomp.c index 8a6cfb2973..432fdb8eb8 100644 --- a/src/backend/regex/regcomp.c +++ b/src/backend/regex/regcomp.c @@ -215,6 +215,7 @@ static void clonesuccessorstates(struct nfa *nfa, struct state *ssource, struct state *spredecessor, struct arc *refarc, char *curdonemap, char *outerdonemap, int nstates); +static void removeblack(struct nfa *nfa); static void cleanup(struct nfa *nfa); static void markreachable(struct nfa *nfa, struct state *s, struct state *okay, struct state *mark); @@ -346,7 +347,7 @@ struct vars #define SEND 'Z' /* end of string (even if not EOL) */ /* is an arc colored, and hence should belong to a color chain? */ -/* the test on "co" eliminates RAINBOW arcs, which we don't bother to chain */ +/* testing co eliminates RAINBOW/BLACK arcs, which we don't bother to chain */ #define COLORED(a) \ ((a)->co >= 0 && \ ((a)->type == PLAIN || (a)->type == AHEAD || (a)->type == BEHIND)) @@ -1949,7 +1950,7 @@ optimizebracket(struct vars *v, for (a = lp->outs; a != NULL; a = a->outchain) { assert(a->type == PLAIN); - assert(a->co >= 0); /* i.e. not RAINBOW */ + assert(a->co >= 0); /* i.e. not RAINBOW or BLACK */ assert(a->to == rp); cd = &v->cm->cd[a->co]; assert(!UNUSEDCOLOR(cd) && !(cd->flags & PSEUDO)); @@ -2368,6 +2369,7 @@ nfanode(struct vars *v, nfa = newnfa(v, v->cm, v->nfa); NOERRZ(); dupnfa(nfa, t->begin, t->end, nfa->init, nfa->final); + nfa->flags = v->nfa->flags; if (!ISERR()) specialcolors(nfa); if (!ISERR()) diff --git a/src/include/regex/regguts.h b/src/include/regex/regguts.h index 3ca3647e11..a62df145b3 100644 --- a/src/include/regex/regguts.h +++ b/src/include/regex/regguts.h @@ -150,13 +150,16 @@ enum char_classes * * To further reduce the number of arcs in NFAs and DFAs, we also have a * special RAINBOW "color" that can be assigned to an arc. This is not a - * real color, in that it has no entry in color maps. + * real color, in that it has no entry in color maps. To allow representing + * the complement of RAINBOW, there is also BLACK which matches no colors. + * (BLACK arcs exist only during regex parsing, not in finished NFAs.) */ typedef short color; /* colors of characters */ #define MAX_COLOR 32767 /* max color (must fit in 'color' datatype) */ #define COLORLESS (-1) /* impossible color */ #define RAINBOW (-2) /* represents all colors except pseudocolors */ +#define BLACK (-3) /* represents no colors */ #define WHITE 0 /* default color, parent of all others */ /* Note: various places in the code know that WHITE is zero */ @@ -410,6 +413,8 @@ struct cnfa int flags; /* bitmask of the following flags: */ #define HASLACONS 01 /* uses lookaround constraints */ #define MATCHALL 02 /* matches all strings of a range of lengths */ +#define HASBLACK 04 /* contains BLACK arcs */ + /* Note: HASBLACK appears in nfa structs' flags, but never in cnfas */ int pre; /* setup state number */ int post; /* teardown state number */ color bos[2]; /* colors, if any, assigned to BOS and BOL */ diff --git a/src/test/modules/test_regex/expected/test_regex.out b/src/test/modules/test_regex/expected/test_regex.out index 731ba506d3..c44c717edf 100644 --- a/src/test/modules/test_regex/expected/test_regex.out +++ b/src/test/modules/test_regex/expected/test_regex.out @@ -2071,6 +2071,20 @@ select * from test_regex('[\s\S]*', '012 3456789abc_*', 'LNPE'); {"012 3456789abc_*"} (2 rows) +-- bug #18708: +select * from test_regex('(?:[^\d\D]){0}', '0123456789abc*', 'LNPQE'); + test_regex +-------------------------------------------------------------------- + {0,REG_UBOUNDS,REG_UBBS,REG_UNONPOSIX,REG_ULOCALE,REG_UEMPTYMATCH} + {""} +(2 rows) + +select * from test_regex('[^\d\D]', '0123456789abc*', 'ILPE'); + test_regex +-------------------------------------------------------- + {0,REG_UBBS,REG_UNONPOSIX,REG_ULOCALE,REG_UIMPOSSIBLE} +(1 row) + -- check char classes' handling of newlines select * from test_regex('\s+', E'abc \n def', 'LP'); test_regex diff --git a/src/test/modules/test_regex/sql/test_regex.sql b/src/test/modules/test_regex/sql/test_regex.sql index 478fa2c547..b2a847577e 100644 --- a/src/test/modules/test_regex/sql/test_regex.sql +++ b/src/test/modules/test_regex/sql/test_regex.sql @@ -619,6 +619,9 @@ select * from test_regex('[^1\D0]', 'abc0123456789*', 'LPE'); select * from test_regex('\W', '0123456789abc_*', 'LP'); select * from test_regex('[\W]', '0123456789abc_*', 'LPE'); select * from test_regex('[\s\S]*', '012 3456789abc_*', 'LNPE'); +-- bug #18708: +select * from test_regex('(?:[^\d\D]){0}', '0123456789abc*', 'LNPQE'); +select * from test_regex('[^\d\D]', '0123456789abc*', 'ILPE'); -- check char classes' handling of newlines select * from test_regex('\s+', E'abc \n def', 'LP');
pgsql-bugs by date: