Thread: BUG #18708: regex problem: (?:[^\d\D]){0} asserts with "lp->nouts == 0 && rp->nins == 0"
BUG #18708: regex problem: (?:[^\d\D]){0} asserts with "lp->nouts == 0 && rp->nins == 0"
From
PG Bug reporting form
Date:
The following bug has been logged on the website: Bug reference: 18708 Logged by: Nikolay Shaplov (PostgresPro) Email address: dhyan@nataraj.su PostgreSQL version: 16.4 Operating system: Debian 12 Description: Hi! We've found a bug: If you run SELECT '' ~ '(?:[^\d\D]){0}'; it will assert with "lp->nouts == 0 && rp->nins == 0" This behavior have been introduced in 2a0af7fe460 commit. This bug have been found while fuzzing jsonpath_in function, and then narrowed down to regex problem. Thanks to Andrey Bille for help with narrowing sample down and finding commit that caused the problem.
Re: BUG #18708: regex problem: (?:[^\d\D]){0} asserts with "lp->nouts == 0 && rp->nins == 0"
From
Aleksander Alekseev
Date:
Hi Nikolay, > If you run > > SELECT '' ~ '(?:[^\d\D]){0}'; > > it will assert with "lp->nouts == 0 && rp->nins == 0" > > This behavior have been introduced in 2a0af7fe460 commit. > > This bug have been found while fuzzing jsonpath_in function, and then > narrowed down to regex problem. Thanks to Andrey Bille for help with > narrowing sample down and finding commit that caused the problem. Thanks for the report. I can reproduce it with 18devel too: ``` #6 0x00005906af1a1e5b in delsub (nfa=0x5906b179f8b0, lp=0x5906b179fd88, rp=0x5906b179fdc0) at ../src/backend/regex/regc_nfa.c:1292 1292 assert(lp->nouts == 0 && rp->nins == 0); /* did the job */ (gdb) p *lp $1 = {no = 4, flag = 0 '\000', nins = 1, nouts = 0, ins = 0x5906b17a3418, outs = 0x0, tmp = 0x0, next = 0x5906b179fdc0, prev = 0x5906b179fd50} (gdb) p *rp $2 = {no = 5, flag = 0 '\000', nins = 1, nouts = 1, ins = 0x5906b17a34f0, outs = 0x5906b17a3460, tmp = 0x5906b179fdc0, next = 0x5906b179fe30, prev = 0x5906b179fd88} ``` I wonder if the Assert is just wrong or if it's more complicated than that. For the record: ([^\d\D]){0} - OK (?:[^\d\D]){1} - OK (?:[^\D]){0} - OK (?:[^\d]){0} - OK '(?:[^\d\D]){0}' - FAIL The value of the left argument of the `~` operator is not important. Thoughts? -- Best regards, Aleksander Alekseev
Re: BUG #18708: regex problem: (?:[^\d\D]){0} asserts with "lp->nouts == 0 && rp->nins == 0"
From
Aleksander Alekseev
Date:
Hi, > I wonder if the Assert is just wrong or if it's more complicated than that. > > For the record: > > ([^\d\D]){0} - OK > (?:[^\d\D]){1} - OK > (?:[^\D]){0} - OK > (?:[^\d]){0} - OK > '(?:[^\d\D]){0}' - FAIL Well, removing the assert helps, and the regex seems to work correctly after that. This however is almost certainly not a correct fix (the assert is right about lp->nouts == 0 it's only unhappy about rp->nins != 0) and a second opinion is certainly needed since I'm looking at src/backend/regex/regc_nfa.c for the first time in my life :) -- Best regards, Aleksander Alekseev
Re: BUG #18708: regex problem: (?:[^\d\D]){0} asserts with "lp->nouts == 0 && rp->nins == 0"
From
Tom Lane
Date:
PG Bug reporting form <noreply@postgresql.org> writes: > If you run > SELECT '' ~ '(?:[^\d\D]){0}'; > it will assert with "lp->nouts == 0 && rp->nins == 0" > This behavior have been introduced in 2a0af7fe460 commit. Thanks for the report --- I'll dig into this later. regards, tom lane
Re: BUG #18708: regex problem: (?:[^\d\D]){0} asserts with "lp->nouts == 0 && rp->nins == 0"
From
Tom Lane
Date:
Aleksander Alekseev <aleksander@timescale.com> writes: >> If you run >> SELECT '' ~ '(?:[^\d\D]){0}'; >> it will assert with "lp->nouts == 0 && rp->nins == 0" > I wonder if the Assert is just wrong or if it's more complicated than that. Interesting example. I think the bug's actually in my commit 08c0d6ad6 (Invent "rainbow" arcs), although it is unreachable before 2a0af7fe4. What's happening is: * "\d\D" can match any character whatsoever. optimizebracket() notices this and replaces a sheaf of NFA arcs with a single RAINBOW arc, which is the same representation as for ".". * Then we apply the complementation process in cbracket(), and that calls colorcomplement() which does this: /* A RAINBOW arc matches all colors, making the complement empty */ if (findarc(of, PLAIN, RAINBOW) != NULL) return; We thus end up with no arcs from the bracket expression's start state to its end state. This is not wrong in itself: "you can't get there from here" is a perfectly reasonable representation of [^\d\D], which cannot match any character. * However, the (?: ... ){0} superstructure around that eventually leads us to /* annoying special case: {0} or {0,0} cancels everything */ if (m == 0 && n == 0) { ... /* Otherwise, we can clean up any subre infrastructure we made */ if (atom != NULL) freesubre(v, atom); delsub(v->nfa, lp, rp); } which is trying to throw away the detritus from inside the quantified subexpression. But delsub fails to remove it all, because there's no path between the two specified states, and its assert notices that. This is, I believe, not harmful in itself: the leftover unreachable states/arcs will be cleaned up later, and we end with a valid NFA. So, as you noted, removing that assert would suppress all visible symptoms of the problem. But I really don't want to do that; this code is complicated and we need all the help we can get to find bugs in it. So I think the right thing to do is to fix colorcomplement() so that it still produces some arc between its "from" and "to" states, keeping the graph connected until we finish parsing the regex. It has to be a dummy arc that can't match anything, of course. We can throw away such an arc once we get to regex optimization, because we don't need delsub() to work anymore. In the attached draft patch I represented the dummy arc as a PLAIN arc with a new fake color BLACK. Another way to do it could be to introduce a new arc "type" value, but that seemed like it would involve touching more code. This is admittedly a lot more code than removing one assert, but I think we'll regret it if we allow delsub failures to go undetected. regards, tom lane diff --git a/src/backend/regex/regc_color.c b/src/backend/regex/regc_color.c index 30bda0e5ad..27d324b8a6 100644 --- a/src/backend/regex/regc_color.c +++ b/src/backend/regex/regc_color.c @@ -1075,16 +1075,26 @@ colorcomplement(struct nfa *nfa, assert(of != from); - /* A RAINBOW arc matches all colors, making the complement empty */ + /* + * A RAINBOW arc matches all colors, making the complement empty. But we + * can't just return without making any arcs, because that would leave the + * NFA disconnected which would break any future delsub(). Instead, make + * a BLACK arc. Also set the HASBLACK flag so we know to clean that up at + * the start of NFA optimization. + */ if (findarc(of, PLAIN, RAINBOW) != NULL) + { + newarc(nfa, type, BLACK, from, to); + nfa->flags |= HASBLACK; return; + } /* Otherwise, transiently mark the colors that appear in of's out-arcs */ for (a = of->outs; a != NULL; a = a->outchain) { if (a->type == PLAIN) { - assert(a->co >= 0); + assert(a->co >= 0); /* i.e. not RAINBOW or BLACK */ cd = &cm->cd[a->co]; assert(!UNUSEDCOLOR(cd)); cd->flags |= COLMARK; diff --git a/src/backend/regex/regc_nfa.c b/src/backend/regex/regc_nfa.c index f1819a24f6..4b57a2c2b1 100644 --- a/src/backend/regex/regc_nfa.c +++ b/src/backend/regex/regc_nfa.c @@ -1599,6 +1599,12 @@ optimize(struct nfa *nfa, if (verbose) fprintf(f, "\ninitial cleanup:\n"); #endif + /* If we have any BLACK arcs, drop them; but this is uncommon */ + if (nfa->flags & HASBLACK) + { + removeblack(nfa); + nfa->flags &= ~HASBLACK; + } cleanup(nfa); /* may simplify situation */ #ifdef REG_DEBUG if (verbose) @@ -2922,6 +2928,32 @@ clonesuccessorstates(struct nfa *nfa, } } +/* + * removeblack - remove BLACK arcs, which are no longer useful + */ +static void +removeblack(struct nfa *nfa) +{ + struct state *s; + + for (s = nfa->states; s != NULL; s = s->next) + { + struct arc *a; + struct arc *nexta; + + for (a = s->outs; a != NULL; a = nexta) + { + nexta = a->outchain; + if (a->type == PLAIN && a->co == BLACK) + { + freearc(nfa, a); + if (NISERR()) + return; + } + } + } +} + /* * cleanup - clean up NFA after optimizations */ @@ -3627,6 +3659,8 @@ dumpnfa(struct nfa *nfa, fprintf(f, ", eol [%ld]", (long) nfa->eos[1]); if (nfa->flags & HASLACONS) fprintf(f, ", haslacons"); + if (nfa->flags & HASBLACK) + fprintf(f, ", hasblack"); if (nfa->flags & MATCHALL) { fprintf(f, ", minmatchall %d", nfa->minmatchall); @@ -3725,6 +3759,8 @@ dumparc(struct arc *a, case PLAIN: if (a->co == RAINBOW) fprintf(f, "[*]"); + else if (a->co == BLACK) + fprintf(f, "[-]"); else fprintf(f, "[%ld]", (long) a->co); break; diff --git a/src/backend/regex/regcomp.c b/src/backend/regex/regcomp.c index 8a6cfb2973..432fdb8eb8 100644 --- a/src/backend/regex/regcomp.c +++ b/src/backend/regex/regcomp.c @@ -215,6 +215,7 @@ static void clonesuccessorstates(struct nfa *nfa, struct state *ssource, struct state *spredecessor, struct arc *refarc, char *curdonemap, char *outerdonemap, int nstates); +static void removeblack(struct nfa *nfa); static void cleanup(struct nfa *nfa); static void markreachable(struct nfa *nfa, struct state *s, struct state *okay, struct state *mark); @@ -346,7 +347,7 @@ struct vars #define SEND 'Z' /* end of string (even if not EOL) */ /* is an arc colored, and hence should belong to a color chain? */ -/* the test on "co" eliminates RAINBOW arcs, which we don't bother to chain */ +/* testing co eliminates RAINBOW/BLACK arcs, which we don't bother to chain */ #define COLORED(a) \ ((a)->co >= 0 && \ ((a)->type == PLAIN || (a)->type == AHEAD || (a)->type == BEHIND)) @@ -1949,7 +1950,7 @@ optimizebracket(struct vars *v, for (a = lp->outs; a != NULL; a = a->outchain) { assert(a->type == PLAIN); - assert(a->co >= 0); /* i.e. not RAINBOW */ + assert(a->co >= 0); /* i.e. not RAINBOW or BLACK */ assert(a->to == rp); cd = &v->cm->cd[a->co]; assert(!UNUSEDCOLOR(cd) && !(cd->flags & PSEUDO)); @@ -2368,6 +2369,7 @@ nfanode(struct vars *v, nfa = newnfa(v, v->cm, v->nfa); NOERRZ(); dupnfa(nfa, t->begin, t->end, nfa->init, nfa->final); + nfa->flags = v->nfa->flags; if (!ISERR()) specialcolors(nfa); if (!ISERR()) diff --git a/src/include/regex/regguts.h b/src/include/regex/regguts.h index 3ca3647e11..a62df145b3 100644 --- a/src/include/regex/regguts.h +++ b/src/include/regex/regguts.h @@ -150,13 +150,16 @@ enum char_classes * * To further reduce the number of arcs in NFAs and DFAs, we also have a * special RAINBOW "color" that can be assigned to an arc. This is not a - * real color, in that it has no entry in color maps. + * real color, in that it has no entry in color maps. To allow representing + * the complement of RAINBOW, there is also BLACK which matches no colors. + * (BLACK arcs exist only during regex parsing, not in finished NFAs.) */ typedef short color; /* colors of characters */ #define MAX_COLOR 32767 /* max color (must fit in 'color' datatype) */ #define COLORLESS (-1) /* impossible color */ #define RAINBOW (-2) /* represents all colors except pseudocolors */ +#define BLACK (-3) /* represents no colors */ #define WHITE 0 /* default color, parent of all others */ /* Note: various places in the code know that WHITE is zero */ @@ -410,6 +413,8 @@ struct cnfa int flags; /* bitmask of the following flags: */ #define HASLACONS 01 /* uses lookaround constraints */ #define MATCHALL 02 /* matches all strings of a range of lengths */ +#define HASBLACK 04 /* contains BLACK arcs */ + /* Note: HASBLACK appears in nfa structs' flags, but never in cnfas */ int pre; /* setup state number */ int post; /* teardown state number */ color bos[2]; /* colors, if any, assigned to BOS and BOL */ diff --git a/src/test/modules/test_regex/expected/test_regex.out b/src/test/modules/test_regex/expected/test_regex.out index 731ba506d3..c44c717edf 100644 --- a/src/test/modules/test_regex/expected/test_regex.out +++ b/src/test/modules/test_regex/expected/test_regex.out @@ -2071,6 +2071,20 @@ select * from test_regex('[\s\S]*', '012 3456789abc_*', 'LNPE'); {"012 3456789abc_*"} (2 rows) +-- bug #18708: +select * from test_regex('(?:[^\d\D]){0}', '0123456789abc*', 'LNPQE'); + test_regex +-------------------------------------------------------------------- + {0,REG_UBOUNDS,REG_UBBS,REG_UNONPOSIX,REG_ULOCALE,REG_UEMPTYMATCH} + {""} +(2 rows) + +select * from test_regex('[^\d\D]', '0123456789abc*', 'ILPE'); + test_regex +-------------------------------------------------------- + {0,REG_UBBS,REG_UNONPOSIX,REG_ULOCALE,REG_UIMPOSSIBLE} +(1 row) + -- check char classes' handling of newlines select * from test_regex('\s+', E'abc \n def', 'LP'); test_regex diff --git a/src/test/modules/test_regex/sql/test_regex.sql b/src/test/modules/test_regex/sql/test_regex.sql index 478fa2c547..b2a847577e 100644 --- a/src/test/modules/test_regex/sql/test_regex.sql +++ b/src/test/modules/test_regex/sql/test_regex.sql @@ -619,6 +619,9 @@ select * from test_regex('[^1\D0]', 'abc0123456789*', 'LPE'); select * from test_regex('\W', '0123456789abc_*', 'LP'); select * from test_regex('[\W]', '0123456789abc_*', 'LPE'); select * from test_regex('[\s\S]*', '012 3456789abc_*', 'LNPE'); +-- bug #18708: +select * from test_regex('(?:[^\d\D]){0}', '0123456789abc*', 'LNPQE'); +select * from test_regex('[^\d\D]', '0123456789abc*', 'ILPE'); -- check char classes' handling of newlines select * from test_regex('\s+', E'abc \n def', 'LP');
Re: BUG #18708: regex problem: (?:[^\d\D]){0} asserts with "lp->nouts == 0 && rp->nins == 0"
From
Tom Lane
Date:
I wrote: > So I think the right thing to do is to fix colorcomplement() so > that it still produces some arc between its "from" and "to" > states, keeping the graph connected until we finish parsing the > regex. It has to be a dummy arc that can't match anything, of > course. We can throw away such an arc once we get to regex > optimization, because we don't need delsub() to work anymore. > In the attached draft patch I represented the dummy arc as a PLAIN > arc with a new fake color BLACK. Another way to do it could be to > introduce a new arc "type" value, but that seemed like it would > involve touching more code. After thinking awhile longer, I decided to try it that way (with a new arc type), and I believe I like the result better after all. It's just a few lines more code, and I think it's more robust. Up to now PLAIN arcs always matched exactly one character, so the first version was putting a major dent in their semantics. We got rid of the bogus arcs before doing anything that really leans hard on arc semantics, but still it seems messy. Hence v2 attached. I'm not especially in love with the CANTMATCH arc type name, but other possibilities such as DUMMY or NOMATCH felt too generic. Better ideas anyone? regards, tom lane diff --git a/src/backend/regex/regc_color.c b/src/backend/regex/regc_color.c index 30bda0e5ad..8ae788f519 100644 --- a/src/backend/regex/regc_color.c +++ b/src/backend/regex/regc_color.c @@ -1075,9 +1075,19 @@ colorcomplement(struct nfa *nfa, assert(of != from); - /* A RAINBOW arc matches all colors, making the complement empty */ + /* + * A RAINBOW arc matches all colors, making the complement empty. But we + * can't just return without making any arcs, because that would leave the + * NFA disconnected which would break any future delsub(). Instead, make + * a CANTMATCH arc. Also set the HASCANTMATCH flag so we know we need to + * clean that up at the start of NFA optimization. + */ if (findarc(of, PLAIN, RAINBOW) != NULL) + { + newarc(nfa, CANTMATCH, 0, from, to); + nfa->flags |= HASCANTMATCH; return; + } /* Otherwise, transiently mark the colors that appear in of's out-arcs */ for (a = of->outs; a != NULL; a = a->outchain) @@ -1089,6 +1099,12 @@ colorcomplement(struct nfa *nfa, assert(!UNUSEDCOLOR(cd)); cd->flags |= COLMARK; } + + /* + * There's no syntax for re-complementing a color set, so we cannot + * see CANTMATCH arcs here. + */ + assert(a->type != CANTMATCH); } /* Scan colors, clear transient marks, add arcs for unmarked colors */ diff --git a/src/backend/regex/regc_nfa.c b/src/backend/regex/regc_nfa.c index f1819a24f6..acd2286def 100644 --- a/src/backend/regex/regc_nfa.c +++ b/src/backend/regex/regc_nfa.c @@ -1462,6 +1462,7 @@ removetraverse(struct nfa *nfa, { case PLAIN: case EMPTY: + case CANTMATCH: /* nothing to do */ break; case AHEAD: @@ -1599,6 +1600,12 @@ optimize(struct nfa *nfa, if (verbose) fprintf(f, "\ninitial cleanup:\n"); #endif + /* If we have any CANTMATCH arcs, drop them; but this is uncommon */ + if (nfa->flags & HASCANTMATCH) + { + removecantmatch(nfa); + nfa->flags &= ~HASCANTMATCH; + } cleanup(nfa); /* may simplify situation */ #ifdef REG_DEBUG if (verbose) @@ -2922,6 +2929,34 @@ clonesuccessorstates(struct nfa *nfa, } } +/* + * removecantmatch - remove CANTMATCH arcs, which are no longer useful + * once we are done with the parsing phase. (We need them only to + * preserve connectedness of NFA subgraphs during parsing.) + */ +static void +removecantmatch(struct nfa *nfa) +{ + struct state *s; + + for (s = nfa->states; s != NULL; s = s->next) + { + struct arc *a; + struct arc *nexta; + + for (a = s->outs; a != NULL; a = nexta) + { + nexta = a->outchain; + if (a->type == CANTMATCH) + { + freearc(nfa, a); + if (NISERR()) + return; + } + } + } +} + /* * cleanup - clean up NFA after optimizations */ @@ -3627,6 +3662,8 @@ dumpnfa(struct nfa *nfa, fprintf(f, ", eol [%ld]", (long) nfa->eos[1]); if (nfa->flags & HASLACONS) fprintf(f, ", haslacons"); + if (nfa->flags & HASCANTMATCH) + fprintf(f, ", hascantmatch"); if (nfa->flags & MATCHALL) { fprintf(f, ", minmatchall %d", nfa->minmatchall); @@ -3749,6 +3786,9 @@ dumparc(struct arc *a, break; case EMPTY: break; + case CANTMATCH: + fprintf(f, "X"); + break; default: fprintf(f, "0x%x/0%lo", a->type, (long) a->co); break; diff --git a/src/backend/regex/regcomp.c b/src/backend/regex/regcomp.c index 8a6cfb2973..15b264e50f 100644 --- a/src/backend/regex/regcomp.c +++ b/src/backend/regex/regcomp.c @@ -215,6 +215,7 @@ static void clonesuccessorstates(struct nfa *nfa, struct state *ssource, struct state *spredecessor, struct arc *refarc, char *curdonemap, char *outerdonemap, int nstates); +static void removecantmatch(struct nfa *nfa); static void cleanup(struct nfa *nfa); static void markreachable(struct nfa *nfa, struct state *s, struct state *okay, struct state *mark); @@ -342,6 +343,7 @@ struct vars #define BEHIND 'r' /* color-lookbehind arc */ #define WBDRY 'w' /* word boundary constraint */ #define NWBDRY 'W' /* non-word-boundary constraint */ +#define CANTMATCH 'x' /* arc that cannot match anything */ #define SBEGIN 'A' /* beginning of string (even if not BOL) */ #define SEND 'Z' /* end of string (even if not EOL) */ @@ -2368,6 +2370,7 @@ nfanode(struct vars *v, nfa = newnfa(v, v->cm, v->nfa); NOERRZ(); dupnfa(nfa, t->begin, t->end, nfa->init, nfa->final); + nfa->flags = v->nfa->flags; if (!ISERR()) specialcolors(nfa); if (!ISERR()) diff --git a/src/include/regex/regguts.h b/src/include/regex/regguts.h index 3ca3647e11..fd69299a16 100644 --- a/src/include/regex/regguts.h +++ b/src/include/regex/regguts.h @@ -410,6 +410,8 @@ struct cnfa int flags; /* bitmask of the following flags: */ #define HASLACONS 01 /* uses lookaround constraints */ #define MATCHALL 02 /* matches all strings of a range of lengths */ +#define HASCANTMATCH 04 /* contains CANTMATCH arcs */ + /* Note: HASCANTMATCH appears in nfa structs' flags, but never in cnfas */ int pre; /* setup state number */ int post; /* teardown state number */ color bos[2]; /* colors, if any, assigned to BOS and BOL */ diff --git a/src/test/modules/test_regex/expected/test_regex.out b/src/test/modules/test_regex/expected/test_regex.out index 731ba506d3..c44c717edf 100644 --- a/src/test/modules/test_regex/expected/test_regex.out +++ b/src/test/modules/test_regex/expected/test_regex.out @@ -2071,6 +2071,20 @@ select * from test_regex('[\s\S]*', '012 3456789abc_*', 'LNPE'); {"012 3456789abc_*"} (2 rows) +-- bug #18708: +select * from test_regex('(?:[^\d\D]){0}', '0123456789abc*', 'LNPQE'); + test_regex +-------------------------------------------------------------------- + {0,REG_UBOUNDS,REG_UBBS,REG_UNONPOSIX,REG_ULOCALE,REG_UEMPTYMATCH} + {""} +(2 rows) + +select * from test_regex('[^\d\D]', '0123456789abc*', 'ILPE'); + test_regex +-------------------------------------------------------- + {0,REG_UBBS,REG_UNONPOSIX,REG_ULOCALE,REG_UIMPOSSIBLE} +(1 row) + -- check char classes' handling of newlines select * from test_regex('\s+', E'abc \n def', 'LP'); test_regex diff --git a/src/test/modules/test_regex/sql/test_regex.sql b/src/test/modules/test_regex/sql/test_regex.sql index 478fa2c547..b2a847577e 100644 --- a/src/test/modules/test_regex/sql/test_regex.sql +++ b/src/test/modules/test_regex/sql/test_regex.sql @@ -619,6 +619,9 @@ select * from test_regex('[^1\D0]', 'abc0123456789*', 'LPE'); select * from test_regex('\W', '0123456789abc_*', 'LP'); select * from test_regex('[\W]', '0123456789abc_*', 'LPE'); select * from test_regex('[\s\S]*', '012 3456789abc_*', 'LNPE'); +-- bug #18708: +select * from test_regex('(?:[^\d\D]){0}', '0123456789abc*', 'LNPQE'); +select * from test_regex('[^\d\D]', '0123456789abc*', 'ILPE'); -- check char classes' handling of newlines select * from test_regex('\s+', E'abc \n def', 'LP');
Re: BUG #18708: regex problem: (?:[^\d\D]){0} asserts with "lp->nouts == 0 && rp->nins == 0"
From
Aleksander Alekseev
Date:
Hi, I reviewed, applied and tested patch v2 on Linux x64 and MacOS x64. LGTM. > I'm not especially in love with the CANTMATCH arc type name, but > other possibilities such as DUMMY or NOMATCH felt too generic. > Better ideas anyone? Well... at least it's consistent with the current naming e.g. HASCANTMATCH. I suggest keeping it as is. -- Best regards, Aleksander Alekseev
Re: BUG #18708: regex problem: (?:[^\d\D]){0} asserts with "lp->nouts == 0 && rp->nins == 0"
From
Aleksander Alekseev
Date:
Hi, > Well... at least it's consistent with the current naming e.g. > HASCANTMATCH. I suggest keeping it as is. Oh I wrote something stupid, HASCANTMATCH is part of the patch. In any case, IMO the name is OK. -- Best regards, Aleksander Alekseev
Re: BUG #18708: regex problem: (?:[^\d\D]){0} asserts with "lp->nouts == 0 && rp->nins == 0"
From
Tom Lane
Date:
Aleksander Alekseev <aleksander@timescale.com> writes: > In any case, IMO the name is OK. I've not thought of a better name since yesterday, so pushed. Thanks for reviewing! regards, tom lane
Re: BUG #18708: regex problem: (?:[^\d\D]){0} asserts with "lp->nouts == 0 && rp->nins == 0"
From
Alexander Lakhin
Date:
Hello Tom, 16.11.2024 03:08, Tom Lane wrote: > I've not thought of a better name since yesterday, so pushed. > Thanks for reviewing! Please look at the hornet's failure on processing a test query added here: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=hornet&dt=2024-11-15%2023%3A49%3A38 Best regards, Alexander
Re: BUG #18708: regex problem: (?:[^\d\D]){0} asserts with "lp->nouts == 0 && rp->nins == 0"
From
Tom Lane
Date:
Alexander Lakhin <exclusion@gmail.com> writes: > Please look at the hornet's failure on processing a test query added here: > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=hornet&dt=2024-11-15%2023%3A49%3A38 Hmm... for the archives' sake, that looks like select * from test_regex('[^\\d\\D]', '0123456789abc*', 'ILPE'); - test_regex - -------------------------------------------------------- - {0,REG_UBBS,REG_UNONPOSIX,REG_ULOCALE,REG_UIMPOSSIBLE} - (1 row) - + ERROR: invalid regular expression: out of memory -- check char classes' handling of newlines I'm not sure what to make of it. That regex shouldn't consume very much memory. To confirm that, I stepped through it and found that newstate() is reached 14 times and newarc() 35 times. That's a pretty tiny amount of memory, and there are other regexps in the tests that are far larger. Moreover, no other animal has shown this, including hornet itself on the v16 branch. (It's only run this test in v15 and v16 so far, so that's not a lot of data points.) I'm inclined to guess this was some weird momentary glitch. If it reproduces then I'll look closer. regards, tom lane
Re: BUG #18708: regex problem: (?:[^\d\D]){0} asserts with "lp->nouts == 0 && rp->nins == 0"
From
Alexander Lakhin
Date:
16.11.2024 19:02, Tom Lane wrote: > Moreover, no other animal has shown this, including hornet itself > on the v16 branch. (It's only run this test in v15 and v16 so far, > so that's not a lot of data points.) > > I'm inclined to guess this was some weird momentary glitch. If it > reproduces then I'll look closer. (Un)fortunately, tern (which is also a ppc animal) has produced the same failure: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=tern&dt=2024-11-16%2022%3A00%3A12 Best regards, Alexander
Re: BUG #18708: regex problem: (?:[^\d\D]){0} asserts with "lp->nouts == 0 && rp->nins == 0"
From
Tom Lane
Date:
Alexander Lakhin <exclusion@gmail.com> writes: > 16.11.2024 19:02, Tom Lane wrote: >> I'm inclined to guess this was some weird momentary glitch. If it >> reproduces then I'll look closer. > (Un)fortunately, tern (which is also a ppc animal) has produced the same > failure: > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=tern&dt=2024-11-16%2022%3A00%3A12 Yeah, I saw that. Even more confused now about what it could be. regards, tom lane
Re: BUG #18708: regex problem: (?:[^\d\D]){0} asserts with "lp->nouts == 0 && rp->nins == 0"
From
Tom Lane
Date:
I wrote: > Alexander Lakhin <exclusion@gmail.com> writes: >> (Un)fortunately, tern (which is also a ppc animal) has produced the same >> failure: >> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=tern&dt=2024-11-16%2022%3A00%3A12 > Yeah, I saw that. Even more confused now about what it could be. After testing on hornet's host, it seems that this is a pre-existing issue that we didn't happen to hit before. Since the regex '[^\d\D]' is unsatisfiable, it collapses to nothing (start state, end state, and no arcs) in the first cleanup() call in optimize(). Then fixempties() counts the number of in-arcs and gets zero, and then it does arcarray = (struct arc **) MALLOC(totalinarcs * sizeof(struct arc *)); if (arcarray == NULL) { NERR(REG_ESPACE); ... On a machine where malloc(0) returns NULL, this mistakenly thinks that's an error. I verified that - if (arcarray == NULL) + if (arcarray == NULL && totalinarcs != 0) makes the failure go away, but I wonder if any other places in backend/regex/ are at the same hazard. Maybe the smartest fix would be to put in a wrapper layer that does what pg_malloc does: /* Avoid unportable behavior of malloc(0) */ if (size == 0) size = 1; One other point is that this theory fails to explain why hornet didn't fail in the v16 branch ... oh, wait: v15 has #define MALLOC(n) malloc(n) where later branches have #define MALLOC(n) palloc_extended((n), MCXT_ALLOC_NO_OOM) So the right answer seems to be to figure out why we didn't back-patch that change. regards, tom lane
Re: BUG #18708: regex problem: (?:[^\d\D]){0} asserts with "lp->nouts == 0 && rp->nins == 0"
From
Noah Misch
Date:
On Sun, Nov 17, 2024 at 01:26:38AM -0500, Tom Lane wrote: > arcarray = (struct arc **) MALLOC(totalinarcs * sizeof(struct arc *)); > if (arcarray == NULL) > { > NERR(REG_ESPACE); > ... > > On a machine where malloc(0) returns NULL, this mistakenly > thinks that's an error. > > I verified that > > - if (arcarray == NULL) > + if (arcarray == NULL && totalinarcs != 0) > > makes the failure go away, but I wonder if any other places in > backend/regex/ are at the same hazard. Maybe the smartest fix > would be to put in a wrapper layer that does what pg_malloc > does: > > /* Avoid unportable behavior of malloc(0) */ > if (size == 0) > size = 1; Either of those sound reasonable. The consequence of missing this hazard, a deterministic ERROR, is modest. This affects just one platform, in the oldest branches. There's a lack of complaints. To me, all that would make the one-line diff tempting. > One other point is that this theory fails to explain why > hornet didn't fail in the v16 branch ... oh, wait: > v15 has > > #define MALLOC(n) malloc(n) > > where later branches have > > #define MALLOC(n) palloc_extended((n), MCXT_ALLOC_NO_OOM) > > So the right answer seems to be to figure out why we didn't > back-patch that change. I don't recall a specific reason or see one in the discussion of commit bea3d7e38. It was done mainly to unblock commit db4f21e, which in turn unblocked commit 0da096d. The last commit is heavy, so I can understand it skipping the back branches. If I were making a (weak) argument against back-patching bea3d7e38, I might cite the extra memory use from RegexpCacheMemoryContext and children.
Re: BUG #18708: regex problem: (?:[^\d\D]){0} asserts with "lp->nouts == 0 && rp->nins == 0"
From
Tom Lane
Date:
Noah Misch <noah@leadboat.com> writes: > On Sun, Nov 17, 2024 at 01:26:38AM -0500, Tom Lane wrote: >> arcarray = (struct arc **) MALLOC(totalinarcs * sizeof(struct arc *)); >> if (arcarray == NULL) >> On a machine where malloc(0) returns NULL, this mistakenly >> thinks that's an error. > Either of those sound reasonable. The consequence of missing this hazard, a > deterministic ERROR, is modest. This affects just one platform, in the oldest > branches. There's a lack of complaints. To me, all that would make the > one-line diff tempting. I dug through the MALLOC and REALLOC calls in backend/regex/ and convinced myself that this is the only one that's at risk. (Some of those conclusions depend on the assumption that a regex NFA never has nstates == 0, but I think that's okay: we create start and end states to begin with and never remove them.) So the one-liner fix is looking attractive. I'd prefer a malloc wrapper for future-proofing if this code were likely to receive a lot of churn in the pre-v16 branches, but that seems pretty improbable at this point. >> So the right answer seems to be to figure out why we didn't >> back-patch that change. > I don't recall a specific reason or see one in the discussion of commit > bea3d7e38. It was done mainly to unblock commit db4f21e, which in turn > unblocked commit 0da096d. The last commit is heavy, so I can understand it > skipping the back branches. If I were making a (weak) argument against > back-patching bea3d7e38, I might cite the extra memory use from > RegexpCacheMemoryContext and children. I think just on minimum-risk grounds, I wouldn't consider back-patching bea3d7e38. I had more in mind a bespoke three-line malloc wrapper function. But the one-line fix seems sufficient for the problem at hand. regards, tom lane