Thread: pgbench's expression parsing & negative numbers
Hi, working on overflow correctness in pg I noticed that pgbench isn't quite there. I assume it won't matter terribly often, but the way it parses integers makes it incorrect for, at least, the negativemost number. It directly parses the integer input using: {digit}+ { yylval->ival = strtoint64(yytext); return INTEGER_CONST; } but that unfortunately means that the sign is no included in the call to strtoint64. Which in turn means you can't correctly parse PG_INT64_MIN, because that's not representable as a positive number on a two's complement machine (i.e. everywhere). Preliminary testing shows that that can relatively easily fixed by just prefixing [-+]? to the relevant exprscan.l rules. But it might also not be a bad idea to simply defer parsing of the number until exprparse.y has its hand on the number? There's plenty other things wrong with overflow handling too, especially evalFunc() basically just doesn't care. I'll come up with a patch for that sometime soon. A third complaint I have is that the tap tests are pretty hard to parse for humans. Greetings, Andres Freund
Hello Andres, > working on overflow correctness in pg I noticed that pgbench isn't quite > there. I assume it won't matter terribly often, but the way it parses > integers makes it incorrect for, at least, the negativemost number. > > It directly parses the integer input using: > {digit}+ { > yylval->ival = strtoint64(yytext); > return INTEGER_CONST; > } > > but that unfortunately means that the sign is no included in the call to > strtoint64. Which in turn means you can't correctly parse PG_INT64_MIN, Indeed. For a benchmarking script which generates a pseudo random load I do not see as a big issue, but maybe I'm missing some use case. > because that's not representable as a positive number on a two's > complement machine (i.e. everywhere). Preliminary testing shows that > that can relatively easily fixed by just prefixing [-+]? to the relevant > exprscan.l rules. Beware that it must handle the unary/binary plus/minus case consistently: "-123" vs "a -123" vs "a + -123" > But it might also not be a bad idea to simply defer parsing of the > number until exprparse.y has its hand on the number? It is usually simpler to let the lexer handle constants. > There's plenty other things wrong with overflow handling too, especially > evalFunc() basically just doesn't care. Sure. There are some overflow checking with div and double to int cast, which were added because of previous complaints, but which are not very useful to me. ISTM that it is a voluntary feature, which handles int64 operations with a modulo overflow like C. Generating exceptions on such overflows does not look very attractive to me. > I'll come up with a patch for that sometime soon. I wish you could provide some motivation: why does it matter much to a benchmarking script to handle overflows with more case? > A third complaint I have is that the tap tests are pretty hard to parse > for humans. Probably. Perl is pretty hard to humans to start with:-) There is a lot of code factorization to cram many tests together, so that one test is more or less one line and there are hundreds of them. Not sure it would help if it was expanded. There are a lot of regular expressions to check outputs, which does not help. I wanted to have the pgbench scripts outside but this has been rejected, so the tested scripts themselves are in the middle of the perl code, which I think has degraded the readability significantly. -- Fabien.
On 2017-12-12 07:21:21 +0100, Fabien COELHO wrote: > > Hello Andres, > > > working on overflow correctness in pg I noticed that pgbench isn't quite > > there. I assume it won't matter terribly often, but the way it parses > > integers makes it incorrect for, at least, the negativemost number. > > > > It directly parses the integer input using: > > {digit}+ { > > yylval->ival = strtoint64(yytext); > > return INTEGER_CONST; > > } > > > > but that unfortunately means that the sign is no included in the call to > > strtoint64. Which in turn means you can't correctly parse PG_INT64_MIN, > > Indeed. For a benchmarking script which generates a pseudo random load I do > not see as a big issue, but maybe I'm missing some use case. IDK, behaving correctly seems like a good idea. Also far from impossible that that code gets propagated further. Doesn't seem like an entirely crazy idea that somebody actually wants to benchmark boundary cases, which might be slower and such. > > because that's not representable as a positive number on a two's > > complement machine (i.e. everywhere). Preliminary testing shows that > > that can relatively easily fixed by just prefixing [-+]? to the relevant > > exprscan.l rules. > > Beware that it must handle the unary/binary plus/minus case consistently: > > "-123" vs "a -123" vs "a + -123" Which is a lot easier to solve on the parser side of things... > > But it might also not be a bad idea to simply defer parsing of the > > number until exprparse.y has its hand on the number? > > It is usually simpler to let the lexer handle constants. > > There's plenty other things wrong with overflow handling too, especially > > evalFunc() basically just doesn't care. > > Sure. > > There are some overflow checking with div and double to int cast, which were > added because of previous complaints, but which are not very useful to me. I think handling it inconsistently is the worst of all worlds. > ISTM that it is a voluntary feature, which handles int64 operations with a > modulo overflow like C. Generating exceptions on such overflows does not > look very attractive to me. No, it's not really voluntary. Signed overflow isn't actually defined behaviour in C. We kinda make it so on some platforms, but that's not really a good idea. > > I'll come up with a patch for that sometime soon. > > I wish you could provide some motivation: why does it matter much to a > benchmarking script to handle overflows with more case? a) I want to be able to compile without -fwrapv in the near future. Which means you can't have signed overflows, because they're undefined behaviour in C. b) I want to be able to compile with -ftrapv / -fsanitize=signed-integer-overflow, to be sure code is actually correct. Currently pgbench crashes with that. c) Randomly handling things in some but not all places is a bad idea. > > A third complaint I have is that the tap tests are pretty hard to parse > > for humans. > > Probably. > > Perl is pretty hard to humans to start with:-) Meh. > There is a lot of code factorization to cram many tests together, so that > one test is more or less one line and there are hundreds of them. Not sure > it would help if it was expanded. I don't think expanding it is really a problem, I think they're just largely not well documented, formatted. With some effort the tests could be a lot easier to read. Greetings, Andres Freund
Hello Andres, >> There are some overflow checking with div and double to int cast, which were >> added because of previous complaints, but which are not very useful to me. > > I think handling it inconsistently is the worst of all worlds. Hmmm... I cannot say that inconsistency is a good thing, that would not be consistent:-) My 0.02€: - I do not think that updating pgbench arithmetic for managing integer overflows is worth Andres Freund time. My guess is that most script would not trigger client-side overflows, so the change would be a no-op in practice. - I think that pgbench has more important defects/missing features, to be fixed more urgently. Given the time patches spend in the cf queue, obviously committers disagree on this one:-) Now ISTM that it does not harm anything to add such a feature, so fine with me. Maybe the global compiler option removal is worth the effort. -- Fabien.
Hi, On 2017-12-14 10:41:04 +0100, Fabien COELHO wrote: > - I do not think that updating pgbench arithmetic for managing integer > overflows is worth Andres Freund time. My guess is that most > script would not trigger client-side overflows, so the change would > be a no-op in practice. It might not be if you view it in isolation (although I'm not convinced). The problem is that it has cost beyond pgbench. Due to pgbench's overflow handling I can't run make check on a build that has -ftrapv, which found several bugs already. I'd be happy if somebody else would tackle the issue, but I don't quite see it happening... Greetings, Andres Freund
Hello, >> - I do not think that updating pgbench arithmetic for managing integer >> overflows is worth Andres Freund time. My guess is that most >> script would not trigger client-side overflows, so the change would >> be a no-op in practice. > > It might not be if you view it in isolation (although I'm not > convinced). The problem is that it has cost beyond pgbench. Due to > pgbench's overflow handling Lack of? > I can't run make check on a build that has -ftrapv, which found several > bugs already. Hmmm. You suggest that integer overflows do occur when running pgbench. Indeed, this tap test case: "\set maxint debug(:minint - 1)" Otherwise, some stat counters may overflow on very long runs? Although overflowing a int64 takes some time... > I'd be happy if somebody else would tackle the issue, but I don't quite > see it happening... I must admit that it is not very high on my may-do list. I'll review it if it appears, though. -- Fabien.
Hello Andres, > working on overflow correctness in pg I noticed that pgbench isn't quite > there. Indeed. > I assume it won't matter terribly often, but the way it parses integers > makes it incorrect for, at least, the negativemost number. [...] but > that unfortunately means that the sign is no included in the call to > strtoint64. The strtoint64 function is indeed a mess. It does not really report errors (more precisely, an error message is printed out, but the execution goes on nevertheless...). > Which in turn means you can't correctly parse PG_INT64_MIN, > because that's not representable as a positive number on a two's > complement machine (i.e. everywhere). Preliminary testing shows that > that can relatively easily fixed by just prefixing [-+]? to the relevant > exprscan.l rules. I'm not sure I get it, because then "1 -2" would be scanned as "int signed_int", which requires to add some fine rules to the parser to handle that as an addition, and I would be unclear about the priority handling, eg "1 -2 * 3". But I agree that it can be made to work with transfering the conversion to the parser and maybe some careful tweaking there. > But it might also not be a bad idea to simply defer > parsing of the number until exprparse.y has its hand on the number? > > There's plenty other things wrong with overflow handling too, especially > evalFunc() basically just doesn't care. Indeed. Some overflow issues are easy to fix with the existing pg_*_s64_overflow macros that you provided. It should also handle double2int64 overflows. I have tried to trigger errors with a -fsanitize=signed-integer-overflow compilation, without much success, but ISTM that you suggested that pgbench does not work with that in another thread. Could you be more precise? > I'll come up with a patch for that sometime soon. ISTM that you have not sent any patch on the subject, otherwise I would have reviewed it. Maybe I could do one some time later, unless you think that I should not. -- Fabien.
Hello Andres, >> I'll come up with a patch for that sometime soon. > > ISTM that you have not sent any patch on the subject, otherwise I would > have reviewed it. Maybe I could do one some time later, unless you think > that I should not. Here is a patch which detects pgbench overflows on int & double constants, and on integer operators. -- Fabien.
Attachment
>>> I'll come up with a patch for that sometime soon. >> >> ISTM that you have not sent any patch on the subject, otherwise I would >> have reviewed it. Maybe I could do one some time later, unless you think >> that I should not. > > Here is a patch which detects pgbench overflows on int & double constants, > and on integer operators. ... it but forgot to handle parsing min int, which was the initial focus of this thread. This patch does that as well by handling it as the special case between lexer & parser (the issue being that 9223372036854775808 cannot be lexed as an standard integer, as it is too large, and -9223372036854775808 is really two tokens, so must be managed from the parser). -- Fabien.
Attachment
The following review has been posted through the commitfest application: make installcheck-world: not tested Implements feature: tested, passed Spec compliant: tested, passed Documentation: not tested Patch does not apply cleanly on the master branch, anyways I managed that. Patch work according to specs, and no issue found. The only minor nit is that you can keep the full comments of function strtoint64 /* * If not errorOK, an error message is printed out. * If errorOK is true, just return "false" for bad input. */
Hello, > The following review has been posted through the commitfest application: > make installcheck-world: not tested > Implements feature: tested, passed > Spec compliant: tested, passed > Documentation: not tested > > Patch does not apply cleanly on the master branch, anyways I managed that. Patch work according to specs, and no issuefound. > The only minor nit is that you can keep the full comments of function strtoint64 > > /* > * If not errorOK, an error message is printed out. > * If errorOK is true, just return "false" for bad input. > */ Thanks for the review. Attached is a v4, with improved comments on strtoint64 as you requested. I also added 2 "unlikely" compiler directives. -- Fabien.
Attachment
Patch is good to go from my side. The new status of this patch is: Ready for Committer
Hello Ibrar, > The new status of this patch is: Ready for Committer Ok, thanks. Let's see what committers think about it. Andres, are you still interested in overflow detection and handling? -- Fabien.
Hi, On 2018-08-16 17:28:06 +0200, Fabien COELHO wrote: > > The new status of this patch is: Ready for Committer > > Ok, thanks. Let's see what committers think about it. > > Andres, are you still interested in overflow detection and handling? Yes. I'll try to look at it at some point not too far away. - Andres
Hi, On 2018-08-10 10:24:29 +0200, Fabien COELHO wrote: > diff --git a/src/bin/pgbench/exprscan.l b/src/bin/pgbench/exprscan.l > index 5c1bd88128..e8faea3857 100644 > --- a/src/bin/pgbench/exprscan.l > +++ b/src/bin/pgbench/exprscan.l > @@ -191,16 +191,26 @@ notnull [Nn][Oo][Tt][Nn][Uu][Ll][Ll] > yylval->bval = false; > return BOOLEAN_CONST; > } > +"9223372036854775808" { > + /* yuk: special handling for minint */ > + return MAXINT_PLUS_ONE_CONST; > + } Yikes, do we really need this? If we dealt with - in the lexer, we shouldn't need it, no? > /* > * strtoint64 -- convert a string to 64-bit integer > * > - * This function is a modified version of scanint8() from > + * This function is a slightly modified version of scanint8() from > * src/backend/utils/adt/int8.c. > + * > + * The function returns whether the conversion worked, and if so > + * "*result" is set to the result. > + * > + * If not errorOK, an error message is also printed out on errors. > */ > -int64 > -strtoint64(const char *str) > +bool > +strtoint64(const char *str, bool errorOK, int64 *result) > { > const char *ptr = str; > - int64 result = 0; > - int sign = 1; > + int64 tmp = 0; > + bool neg = false; > > /* > * Do our own scan, rather than relying on sscanf which might be broken > * for long long. > + * > + * As INT64_MIN can't be stored as a positive 64 bit integer, accumulate > + * value as a negative number. > */ > > /* skip leading spaces */ > @@ -650,46 +660,80 @@ strtoint64(const char *str) > if (*ptr == '-') > { > ptr++; > - > - /* > - * Do an explicit check for INT64_MIN. Ugly though this is, it's > - * cleaner than trying to get the loop below to handle it portably. > - */ > - if (strncmp(ptr, "9223372036854775808", 19) == 0) > - { > - result = PG_INT64_MIN; > - ptr += 19; > - goto gotdigits; > - } > - sign = -1; > + neg = true; > } > else if (*ptr == '+') > ptr++; > > /* require at least one digit */ > - if (!isdigit((unsigned char) *ptr)) > - fprintf(stderr, "invalid input syntax for integer: \"%s\"\n", str); > + if (unlikely(!isdigit((unsigned char) *ptr))) > + goto invalid_syntax; > > /* process digits */ > while (*ptr && isdigit((unsigned char) *ptr)) > { > - int64 tmp = result * 10 + (*ptr++ - '0'); > + int8 digit = (*ptr++ - '0'); > > - if ((tmp / 10) != result) /* overflow? */ > - fprintf(stderr, "value \"%s\" is out of range for type bigint\n", str); > - result = tmp; > + if (unlikely(pg_mul_s64_overflow(tmp, 10, &tmp)) || > + unlikely(pg_sub_s64_overflow(tmp, digit, &tmp))) > + goto out_of_range; > } > > -gotdigits: > - > /* allow trailing whitespace, but not other trailing chars */ > while (*ptr != '\0' && isspace((unsigned char) *ptr)) > ptr++; > > - if (*ptr != '\0') > - fprintf(stderr, "invalid input syntax for integer: \"%s\"\n", str); > + if (unlikely(*ptr != '\0')) > + goto invalid_syntax; > > - return ((sign < 0) ? -result : result); > + if (!neg) > + { > + if (unlikely(tmp == PG_INT64_MIN)) > + goto out_of_range; > + tmp = -tmp; > + } > + > + *result = tmp; > + return true; > + > +out_of_range: > + if (!errorOK) > + fprintf(stderr, > + "value \"%s\" is out of range for type bigint\n", str); > + return false; > + > +invalid_syntax: > + /* this can't happen here, function called on int-looking strings. */ This comment doesn't strike me as a good idea, it's almost guaranteed to be outdated at some point. > + if (!errorOK) > + fprintf(stderr, > + "invalid input syntax for type bigint: \"%s\"\n",str); > + return false; > +} Duplicating these routines is pretty ugly. I really wish we had more infrastructure to avoid this (in particularly a portable way to report an error and jump out). But probably out of scope for this patches? > +/* convert string to double, detecting overflows/underflows */ > +bool > +strtodouble(const char *str, bool errorOK, double *dv) > +{ > + char *end; > + > + *dv = strtod(str, &end); > + > + if (unlikely(errno != 0)) > + { > + if (!errorOK) > + fprintf(stderr, > + "value \"%s\" is out of range for type double\n", str); > + return false; > + } > + > + if (unlikely(end == str || *end != '\0')) > + { > + if (!errorOK) > + fprintf(stderr, > + "invalid input syntax for type double: \"%s\"\n",str); > + return false; > + } > + return true; > } Not sure I see much point in the unlikelys here, contrasting to the ones in the backend (and the copy for the frontend) it's extremely unlikley anything like this will ever be close to a bottleneck. In general, I'd strongly suggest not placing unlikelys in non-critical codepaths - they're way too often wrongly estimated. Greetings, Andres Freund
Hello Andres, >> +"9223372036854775808" { >> + /* yuk: special handling for minint */ >> + return MAXINT_PLUS_ONE_CONST; >> + } > > Yikes, do we really need this? If we dealt with - in the lexer, we > shouldn't need it, no? I'm not sure how to handle unary minus constant in the lexer, given that the same '-' character is also used as minus binary operator. The proposed solution is functional and has a reduce overall impact (one lexer and one parser rules added), so it looks good to me. Probably other approaches are possible. >> + /* this can't happen here, function called on int-looking strings. */ > > This comment doesn't strike me as a good idea, it's almost guaranteed to > be outdated at some point. It is valid now, and it can be removed anyway. > [...] Duplicating these routines is pretty ugly. Sure. > I really wish we had more infrastructure to avoid this (in particularly > a portable way to report an error and jump out). But probably out of > scope for this patches? Yes. Devising an appropriate client-side error handling/reporting infrastructure is a non trivial tasks, and would cost more than a few duplicate functions. "fprintf(stderr + return/exit" currently does the job with minimal hassle. I do not think that this patch is the right one to change how error are handle in postgres client applications. >> + if (unlikely(end == str || *end != '\0')) > Not sure I see much point in the unlikelys here, contrasting to the ones > in the backend (and the copy for the frontend) it's extremely unlikley > anything like this will ever be close to a bottleneck. In general, I'd > strongly suggest not placing unlikelys in non-critical codepaths - > they're way too often wrongly estimated. I put "unlikely" where I really thought it made sense. I do not know when they would have an actual performance impact, but I appreciate that they convey information to the reader of the code, telling that this path is expected not to be taken. It can be removed anyway if you do not like it. -- Fabien.
Hi, On 2018-09-26 22:38:41 +0200, Fabien COELHO wrote: > > > +"9223372036854775808" { > > > + /* yuk: special handling for minint */ > > > + return MAXINT_PLUS_ONE_CONST; > > > + } > > > > Yikes, do we really need this? If we dealt with - in the lexer, we > > shouldn't need it, no? > > I'm not sure how to handle unary minus constant in the lexer, given that the > same '-' character is also used as minus binary operator. True. I think the nicer fix than what you've done here is to instead simply always store the number, as coming from the lexer, as the negative number. We do similarly in a number of places. I've gone with yours for now. > > > + /* this can't happen here, function called on int-looking strings. */ > > > > This comment doesn't strike me as a good idea, it's almost guaranteed to > > be outdated at some point. > > It is valid now, and it can be removed anyway. Removed > > > + if (unlikely(end == str || *end != '\0')) > > > Not sure I see much point in the unlikelys here, contrasting to the ones > > in the backend (and the copy for the frontend) it's extremely unlikley > > anything like this will ever be close to a bottleneck. In general, I'd > > strongly suggest not placing unlikelys in non-critical codepaths - > > they're way too often wrongly estimated. > > I put "unlikely" where I really thought it made sense. I do not know when > they would have an actual performance impact, but I appreciate that they > convey information to the reader of the code, telling that this path is > expected not to be taken. I removed them. Pushed, thanks for the patch! I only did some very minor comment changes, reset errno to 0 before strtod, removed a redundant multiplication in PGBENCH_MUL. FWIW, after this, and fixing the prerequisite general code paths, the pgbench tests pass without -fsanitize=signed-integer-overflow errors. Greetings, Andres Freund