Thread: pgbench's expression parsing & negative numbers

pgbench's expression parsing & negative numbers

From
Andres Freund
Date:
Hi,

working on overflow correctness in pg I noticed that pgbench isn't quite
there.  I assume it won't matter terribly often, but the way it parses
integers makes it incorrect for, at least, the negativemost number.

It directly parses the integer input using:
{digit}+        {
                    yylval->ival = strtoint64(yytext);
                    return INTEGER_CONST;
                }

but that unfortunately means that the sign is no included in the call to
strtoint64. Which in turn means you can't correctly parse PG_INT64_MIN,
because that's not representable as a positive number on a two's
complement machine (i.e. everywhere).  Preliminary testing shows that
that can relatively easily fixed by just prefixing [-+]? to the relevant
exprscan.l rules.  But it might also not be a bad idea to simply defer
parsing of the number until exprparse.y has its hand on the number?

There's plenty other things wrong with overflow handling too, especially
evalFunc() basically just doesn't care.  I'll come up with a patch for
that sometime soon.

A third complaint I have is that the tap tests are pretty hard to parse
for humans.

Greetings,

Andres Freund


Re: pgbench's expression parsing & negative numbers

From
Fabien COELHO
Date:
Hello Andres,

> working on overflow correctness in pg I noticed that pgbench isn't quite
> there.  I assume it won't matter terribly often, but the way it parses
> integers makes it incorrect for, at least, the negativemost number.
>
> It directly parses the integer input using:
> {digit}+        {
>                     yylval->ival = strtoint64(yytext);
>                     return INTEGER_CONST;
>                 }
>
> but that unfortunately means that the sign is no included in the call to
> strtoint64. Which in turn means you can't correctly parse PG_INT64_MIN,

Indeed. For a benchmarking script which generates a pseudo random load I 
do not see as a big issue, but maybe I'm missing some use case.

> because that's not representable as a positive number on a two's 
> complement machine (i.e. everywhere).  Preliminary testing shows that 
> that can relatively easily fixed by just prefixing [-+]? to the relevant 
> exprscan.l rules.

Beware that it must handle the unary/binary plus/minus case consistently:

   "-123" vs "a -123" vs "a + -123"

> But it might also not be a bad idea to simply defer parsing of the 
> number until exprparse.y has its hand on the number?

It is usually simpler to let the lexer handle constants.

> There's plenty other things wrong with overflow handling too, especially 
> evalFunc() basically just doesn't care.

Sure.

There are some overflow checking with div and double to int cast, which 
were added because of previous complaints, but which are not very useful 
to me.

ISTM that it is a voluntary feature, which handles int64 operations 
with a modulo overflow like C. Generating exceptions on such overflows 
does not look very attractive to me.

>  I'll come up with a patch for that sometime soon.

I wish you could provide some motivation: why does it matter much to a 
benchmarking script to handle overflows with more case?

> A third complaint I have is that the tap tests are pretty hard to parse
> for humans.

Probably.

Perl is pretty hard to humans to start with:-)

There is a lot of code factorization to cram many tests together, so that 
one test is more or less one line and there are hundreds of them. Not sure 
it would help if it was expanded.

There are a lot of regular expressions to check outputs, which does not 
help.

I wanted to have the pgbench scripts outside but this has been rejected, 
so the tested scripts themselves are in the middle of the perl code, which 
I think has degraded the readability significantly.

-- 
Fabien.


Re: pgbench's expression parsing & negative numbers

From
Andres Freund
Date:
On 2017-12-12 07:21:21 +0100, Fabien COELHO wrote:
> 
> Hello Andres,
> 
> > working on overflow correctness in pg I noticed that pgbench isn't quite
> > there.  I assume it won't matter terribly often, but the way it parses
> > integers makes it incorrect for, at least, the negativemost number.
> > 
> > It directly parses the integer input using:
> > {digit}+        {
> >                     yylval->ival = strtoint64(yytext);
> >                     return INTEGER_CONST;
> >                 }
> > 
> > but that unfortunately means that the sign is no included in the call to
> > strtoint64. Which in turn means you can't correctly parse PG_INT64_MIN,
> 
> Indeed. For a benchmarking script which generates a pseudo random load I do
> not see as a big issue, but maybe I'm missing some use case.

IDK, behaving correctly seems like a good idea. Also far from impossible
that that code gets propagated further.  Doesn't seem like an entirely
crazy idea that somebody actually wants to benchmark boundary cases,
which might be slower and such.

> > because that's not representable as a positive number on a two's
> > complement machine (i.e. everywhere).  Preliminary testing shows that
> > that can relatively easily fixed by just prefixing [-+]? to the relevant
> > exprscan.l rules.
> 
> Beware that it must handle the unary/binary plus/minus case consistently:
> 
>   "-123" vs "a -123" vs "a + -123"

Which is a lot easier to solve on the parser side of things...


> > But it might also not be a bad idea to simply defer parsing of the
> > number until exprparse.y has its hand on the number?
> 
> It is usually simpler to let the lexer handle constants.


> > There's plenty other things wrong with overflow handling too, especially
> > evalFunc() basically just doesn't care.
> 
> Sure.
> 
> There are some overflow checking with div and double to int cast, which were
> added because of previous complaints, but which are not very useful to me.

I think handling it inconsistently is the worst of all worlds.


> ISTM that it is a voluntary feature, which handles int64 operations with a
> modulo overflow like C. Generating exceptions on such overflows does not
> look very attractive to me.

No, it's not really voluntary. Signed overflow isn't actually defined
behaviour in C. We kinda make it so on some platforms, but that's not
really a good idea.


> >  I'll come up with a patch for that sometime soon.
> 
> I wish you could provide some motivation: why does it matter much to a
> benchmarking script to handle overflows with more case?

a) I want to be able to compile without -fwrapv in the near
   future. Which means you can't have signed overflows, because they're
   undefined behaviour in C.
b) I want to be able to compile with -ftrapv /
   -fsanitize=signed-integer-overflow, to be sure code is actually
   correct. Currently pgbench crashes with that.
c) Randomly handling things in some but not all places is a bad idea.


> > A third complaint I have is that the tap tests are pretty hard to parse
> > for humans.
> 
> Probably.
> 
> Perl is pretty hard to humans to start with:-)

Meh.


> There is a lot of code factorization to cram many tests together, so that
> one test is more or less one line and there are hundreds of them. Not sure
> it would help if it was expanded.

I don't think expanding it is really a problem, I think they're just
largely not well documented, formatted. With some effort the tests could
be a lot easier to read.


Greetings,

Andres Freund


Re: pgbench's expression parsing & negative numbers

From
Fabien COELHO
Date:
Hello Andres,

>> There are some overflow checking with div and double to int cast, which were
>> added because of previous complaints, but which are not very useful to me.
>
> I think handling it inconsistently is the worst of all worlds.

Hmmm... I cannot say that inconsistency is a good thing, that would not 
be consistent:-)

My 0.02€:

  - I do not think that updating pgbench arithmetic for managing integer
    overflows is worth Andres Freund time. My guess is that most
    script would not trigger client-side overflows, so the change would
    be a no-op in practice.

  - I think that pgbench has more important defects/missing features, to
    be fixed more urgently. Given the time patches spend in the cf queue,
    obviously committers disagree on this one:-)

Now ISTM that it does not harm anything to add such a feature, so fine 
with me. Maybe the global compiler option removal is worth the effort.

-- 
Fabien.

Re: pgbench's expression parsing & negative numbers

From
Andres Freund
Date:
Hi,

On 2017-12-14 10:41:04 +0100, Fabien COELHO wrote:
>  - I do not think that updating pgbench arithmetic for managing integer
>    overflows is worth Andres Freund time. My guess is that most
>    script would not trigger client-side overflows, so the change would
>    be a no-op in practice.

It might not be if you view it in isolation (although I'm not
convinced). The problem is that it has cost beyond pgbench. Due to
pgbench's overflow handling I can't run make check on a build that has
-ftrapv, which found several bugs already.

I'd be happy if somebody else would tackle the issue, but I don't quite
see it happening...

Greetings,

Andres Freund


Re: pgbench's expression parsing & negative numbers

From
Fabien COELHO
Date:
Hello,

>>  - I do not think that updating pgbench arithmetic for managing integer
>>    overflows is worth Andres Freund time. My guess is that most
>>    script would not trigger client-side overflows, so the change would
>>    be a no-op in practice.
>
> It might not be if you view it in isolation (although I'm not
> convinced). The problem is that it has cost beyond pgbench. Due to
> pgbench's overflow handling

Lack of?

> I can't run make check on a build that has -ftrapv, which found several 
> bugs already.

Hmmm. You suggest that integer overflows do occur when running pgbench.

Indeed, this tap test case: "\set maxint debug(:minint - 1)"

Otherwise, some stat counters may overflow on very long runs? Although
overflowing a int64 takes some time...

> I'd be happy if somebody else would tackle the issue, but I don't quite 
> see it happening...

I must admit that it is not very high on my may-do list. I'll review it if 
it appears, though.

-- 
Fabien.


Re: pgbench's expression parsing & negative numbers

From
Fabien COELHO
Date:
Hello Andres,

> working on overflow correctness in pg I noticed that pgbench isn't quite
> there.

Indeed.

> I assume it won't matter terribly often, but the way it parses integers 
> makes it incorrect for, at least, the negativemost number. [...] but 
> that unfortunately means that the sign is no included in the call to 
> strtoint64.

The strtoint64 function is indeed a mess. It does not really report errors 
(more precisely, an error message is printed out, but the execution goes 
on nevertheless...).

> Which in turn means you can't correctly parse PG_INT64_MIN,
> because that's not representable as a positive number on a two's
> complement machine (i.e. everywhere).  Preliminary testing shows that
> that can relatively easily fixed by just prefixing [-+]? to the relevant
> exprscan.l rules.

I'm not sure I get it, because then "1 -2" would be scanned as "int 
signed_int", which requires to add some fine rules to the parser to handle 
that as an addition, and I would be unclear about the priority handling,
eg "1 -2 * 3". But I agree that it can be made to work with transfering
the conversion to the parser and maybe some careful tweaking there.

> But it might also not be a bad idea to simply defer
> parsing of the number until exprparse.y has its hand on the number?
>
> There's plenty other things wrong with overflow handling too, especially
> evalFunc() basically just doesn't care.

Indeed.

Some overflow issues are easy to fix with the existing pg_*_s64_overflow 
macros that you provided. It should also handle double2int64 overflows.

I have tried to trigger errors with a -fsanitize=signed-integer-overflow 
compilation, without much success, but ISTM that you suggested that 
pgbench does not work with that in another thread. Could you be more 
precise?

> I'll come up with a patch for that sometime soon.

ISTM that you have not sent any patch on the subject, otherwise I would 
have reviewed it. Maybe I could do one some time later, unless you think 
that I should not.

-- 
Fabien.


Re: pgbench's expression parsing & negative numbers

From
Fabien COELHO
Date:
Hello Andres,

>> I'll come up with a patch for that sometime soon.
>
> ISTM that you have not sent any patch on the subject, otherwise I would 
> have reviewed it. Maybe I could do one some time later, unless you think 
> that I should not.

Here is a patch which detects pgbench overflows on int & double constants, 
and on integer operators.

-- 
Fabien.
Attachment

Re: pgbench's expression parsing & negative numbers

From
Fabien COELHO
Date:
>>> I'll come up with a patch for that sometime soon.
>> 
>> ISTM that you have not sent any patch on the subject, otherwise I would 
>> have reviewed it. Maybe I could do one some time later, unless you think 
>> that I should not.
>
> Here is a patch which detects pgbench overflows on int & double constants, 
> and on integer operators.

... it but forgot to handle parsing min int, which was the initial focus 
of this thread.

This patch does that as well by handling it as the special case between 
lexer & parser (the issue being that 9223372036854775808 cannot be lexed
as an standard integer, as it is too large, and -9223372036854775808 is 
really two tokens, so must be managed from the parser).

-- 
Fabien.
Attachment

Re: pgbench's expression parsing & negative numbers

From
Ibrar Ahmed
Date:
The following review has been posted through the commitfest application:
make installcheck-world:  not tested
Implements feature:       tested, passed
Spec compliant:           tested, passed
Documentation:            not tested

Patch does not apply cleanly on the master branch, anyways I managed that.  Patch work according to specs, and no issue
found.
 
The only minor nit is that you can keep the full comments of function strtoint64

/*
 * If not errorOK, an error message is printed out.
 * If errorOK is true, just return "false" for bad input.
 */

Re: pgbench's expression parsing & negative numbers

From
Fabien COELHO
Date:
Hello,

> The following review has been posted through the commitfest application:
> make installcheck-world:  not tested
> Implements feature:       tested, passed
> Spec compliant:           tested, passed
> Documentation:            not tested
>
> Patch does not apply cleanly on the master branch, anyways I managed that.  Patch work according to specs, and no
issuefound.
 
> The only minor nit is that you can keep the full comments of function strtoint64
>
> /*
> * If not errorOK, an error message is printed out.
> * If errorOK is true, just return "false" for bad input.
> */

Thanks for the review.

Attached is a v4, with improved comments on strtoint64 as you requested.
I also added 2 "unlikely" compiler directives.

-- 
Fabien.
Attachment

Re: pgbench's expression parsing & negative numbers

From
Ibrar Ahmed
Date:
Patch is good to go from my side.

The new status of this patch is: Ready for Committer

Re: pgbench's expression parsing & negative numbers

From
Fabien COELHO
Date:
Hello Ibrar,

> The new status of this patch is: Ready for Committer

Ok, thanks. Let's see what committers think about it.

Andres, are you still interested in overflow detection and handling?

-- 
Fabien.


Re: pgbench's expression parsing & negative numbers

From
Andres Freund
Date:
Hi,

On 2018-08-16 17:28:06 +0200, Fabien COELHO wrote:
> > The new status of this patch is: Ready for Committer
> 
> Ok, thanks. Let's see what committers think about it.
> 
> Andres, are you still interested in overflow detection and handling?

Yes.  I'll try to look at it at some point not too far away.

- Andres


Re: pgbench's expression parsing & negative numbers

From
Andres Freund
Date:
Hi,

On 2018-08-10 10:24:29 +0200, Fabien COELHO wrote:
> diff --git a/src/bin/pgbench/exprscan.l b/src/bin/pgbench/exprscan.l
> index 5c1bd88128..e8faea3857 100644
> --- a/src/bin/pgbench/exprscan.l
> +++ b/src/bin/pgbench/exprscan.l
> @@ -191,16 +191,26 @@ notnull            [Nn][Oo][Tt][Nn][Uu][Ll][Ll]
>                      yylval->bval = false;
>                      return BOOLEAN_CONST;
>                  }
> +"9223372036854775808" {
> +                    /* yuk: special handling for minint */
> +                    return MAXINT_PLUS_ONE_CONST;
> +                }

Yikes, do we really need this?  If we dealt with - in the lexer, we
shouldn't need it, no?


>  /*
>   * strtoint64 -- convert a string to 64-bit integer
>   *
> - * This function is a modified version of scanint8() from
> + * This function is a slightly modified version of scanint8() from
>   * src/backend/utils/adt/int8.c.
> + *
> + * The function returns whether the conversion worked, and if so
> + * "*result" is set to the result.
> + *
> + * If not errorOK, an error message is also printed out on errors.
>   */
> -int64
> -strtoint64(const char *str)
> +bool
> +strtoint64(const char *str, bool errorOK, int64 *result)
>  {
>      const char *ptr = str;
> -    int64        result = 0;
> -    int            sign = 1;
> +    int64        tmp = 0;
> +    bool        neg = false;
>  
>      /*
>       * Do our own scan, rather than relying on sscanf which might be broken
>       * for long long.
> +     *
> +     * As INT64_MIN can't be stored as a positive 64 bit integer, accumulate
> +     * value as a negative number.
>       */
>  
>      /* skip leading spaces */
> @@ -650,46 +660,80 @@ strtoint64(const char *str)
>      if (*ptr == '-')
>      {
>          ptr++;
> -
> -        /*
> -         * Do an explicit check for INT64_MIN.  Ugly though this is, it's
> -         * cleaner than trying to get the loop below to handle it portably.
> -         */
> -        if (strncmp(ptr, "9223372036854775808", 19) == 0)
> -        {
> -            result = PG_INT64_MIN;
> -            ptr += 19;
> -            goto gotdigits;
> -        }
> -        sign = -1;
> +        neg = true;
>      }
>      else if (*ptr == '+')
>          ptr++;
>  
>      /* require at least one digit */
> -    if (!isdigit((unsigned char) *ptr))
> -        fprintf(stderr, "invalid input syntax for integer: \"%s\"\n", str);
> +    if (unlikely(!isdigit((unsigned char) *ptr)))
> +        goto invalid_syntax;
>  
>      /* process digits */
>      while (*ptr && isdigit((unsigned char) *ptr))
>      {
> -        int64        tmp = result * 10 + (*ptr++ - '0');
> +        int8        digit = (*ptr++ - '0');
>  
> -        if ((tmp / 10) != result)    /* overflow? */
> -            fprintf(stderr, "value \"%s\" is out of range for type bigint\n", str);
> -        result = tmp;
> +        if (unlikely(pg_mul_s64_overflow(tmp, 10, &tmp)) ||
> +            unlikely(pg_sub_s64_overflow(tmp, digit, &tmp)))
> +            goto out_of_range;
>      }
>  
> -gotdigits:
> -
>      /* allow trailing whitespace, but not other trailing chars */
>      while (*ptr != '\0' && isspace((unsigned char) *ptr))
>          ptr++;
>  
> -    if (*ptr != '\0')
> -        fprintf(stderr, "invalid input syntax for integer: \"%s\"\n", str);
> +    if (unlikely(*ptr != '\0'))
> +        goto invalid_syntax;
>  
> -    return ((sign < 0) ? -result : result);
> +    if (!neg)
> +    {
> +        if (unlikely(tmp == PG_INT64_MIN))
> +            goto out_of_range;
> +        tmp = -tmp;
> +    }
> +
> +    *result = tmp;
> +    return true;
> +
> +out_of_range:
> +    if (!errorOK)
> +        fprintf(stderr,
> +                "value \"%s\" is out of range for type bigint\n", str);
> +    return false;
> +
> +invalid_syntax:
> +    /* this can't happen here, function called on int-looking strings. */

This comment doesn't strike me as a good idea, it's almost guaranteed to
be outdated at some point.

> +    if (!errorOK)
> +        fprintf(stderr,
> +                "invalid input syntax for type bigint: \"%s\"\n",str);
> +    return false;
> +}


Duplicating these routines is pretty ugly.  I really wish we had more
infrastructure to avoid this (in particularly a portable way to report
an error and jump out).  But probably out of scope for this patches?

> +/* convert string to double, detecting overflows/underflows */
> +bool
> +strtodouble(const char *str, bool errorOK, double *dv)
> +{
> +    char *end;
> +
> +    *dv = strtod(str, &end);
> +
> +    if (unlikely(errno != 0))
> +    {
> +        if (!errorOK)
> +            fprintf(stderr,
> +                    "value \"%s\" is out of range for type double\n", str);
> +        return false;
> +    }
> +
> +    if (unlikely(end == str || *end != '\0'))
> +    {
> +        if (!errorOK)
> +            fprintf(stderr,
> +                    "invalid input syntax for type double: \"%s\"\n",str);
> +        return false;
> +    }
> +    return true;
>  }

Not sure I see much point in the unlikelys here, contrasting to the ones
in the backend (and the copy for the frontend) it's extremely unlikley
anything like this will ever be close to a bottleneck.  In general, I'd
strongly suggest not placing unlikelys in non-critical codepaths -
they're way too often wrongly estimated.

Greetings,

Andres Freund


Re: pgbench's expression parsing & negative numbers

From
Fabien COELHO
Date:
Hello Andres,

>> +"9223372036854775808" {
>> +                    /* yuk: special handling for minint */
>> +                    return MAXINT_PLUS_ONE_CONST;
>> +                }
>
> Yikes, do we really need this?  If we dealt with - in the lexer, we
> shouldn't need it, no?

I'm not sure how to handle unary minus constant in the lexer, given that 
the same '-' character is also used as minus binary operator.

The proposed solution is functional and has a reduce overall impact (one 
lexer and one parser rules added), so it looks good to me.

Probably other approaches are possible.

>> +    /* this can't happen here, function called on int-looking strings. */
>
> This comment doesn't strike me as a good idea, it's almost guaranteed to
> be outdated at some point.

It is valid now, and it can be removed anyway.

> [...] Duplicating these routines is pretty ugly.

Sure.

> I really wish we had more infrastructure to avoid this (in particularly 
> a portable way to report an error and jump out).  But probably out of 
> scope for this patches?

Yes.

Devising an appropriate client-side error handling/reporting 
infrastructure is a non trivial tasks, and would cost more than a few 
duplicate functions. "fprintf(stderr + return/exit" currently does the job 
with minimal hassle. I do not think that this patch is the right one to 
change how error are handle in postgres client applications.

>> +    if (unlikely(end == str || *end != '\0'))

> Not sure I see much point in the unlikelys here, contrasting to the ones
> in the backend (and the copy for the frontend) it's extremely unlikley
> anything like this will ever be close to a bottleneck.  In general, I'd
> strongly suggest not placing unlikelys in non-critical codepaths -
> they're way too often wrongly estimated.

I put "unlikely" where I really thought it made sense. I do not know when 
they would have an actual performance impact, but I appreciate that they 
convey information to the reader of the code, telling that this path is 
expected not to be taken.

It can be removed anyway if you do not like it.

-- 
Fabien.


Re: pgbench's expression parsing & negative numbers

From
Andres Freund
Date:
Hi,

On 2018-09-26 22:38:41 +0200, Fabien COELHO wrote:
> > > +"9223372036854775808" {
> > > +                    /* yuk: special handling for minint */
> > > +                    return MAXINT_PLUS_ONE_CONST;
> > > +                }
> > 
> > Yikes, do we really need this?  If we dealt with - in the lexer, we
> > shouldn't need it, no?
> 
> I'm not sure how to handle unary minus constant in the lexer, given that the
> same '-' character is also used as minus binary operator.

True. I think the nicer fix than what you've done here is to instead
simply always store the number, as coming from the lexer, as the
negative number.  We do similarly in a number of places.  I've gone with
yours for now.

> > > +    /* this can't happen here, function called on int-looking strings. */
> > 
> > This comment doesn't strike me as a good idea, it's almost guaranteed to
> > be outdated at some point.
> 
> It is valid now, and it can be removed anyway.

Removed



> > > +    if (unlikely(end == str || *end != '\0'))
> 
> > Not sure I see much point in the unlikelys here, contrasting to the ones
> > in the backend (and the copy for the frontend) it's extremely unlikley
> > anything like this will ever be close to a bottleneck.  In general, I'd
> > strongly suggest not placing unlikelys in non-critical codepaths -
> > they're way too often wrongly estimated.
> 
> I put "unlikely" where I really thought it made sense. I do not know when
> they would have an actual performance impact, but I appreciate that they
> convey information to the reader of the code, telling that this path is
> expected not to be taken.

I removed them.

Pushed, thanks for the patch!  I only did some very minor comment
changes, reset errno to 0 before strtod, removed a redundant
multiplication in PGBENCH_MUL.

FWIW, after this, and fixing the prerequisite general code paths, the
pgbench tests pass without -fsanitize=signed-integer-overflow errors.

Greetings,

Andres Freund