Numeric multiplication overflow errors - Mailing list pgsql-hackers

From Dean Rasheed
Subject Numeric multiplication overflow errors
Date
Msg-id CAEZATCUmeFWCrq2dNzZpRj5+6LfN85jYiDoqm+ucSXhb9U2TbA@mail.gmail.com
Whole thread Raw
Responses Re: Numeric multiplication overflow errors
List pgsql-hackers
I found a couple of places where numeric multiplication suffers from
overflow errors for inputs that aren't necessarily very large in
magnitude.

The first is with the numeric * operator, which attempts to always
produce the exact result, even though the numeric type has a maximum
of 16383 digits after the decimal point. If the limit is exceeded an
overflow error is produced, e.g.:

SELECT (1+2e-10000) * (1+3e-10000);
ERROR:  value overflows numeric format

I can't imagine anyone actually wanting that many digits after the
decimal point, but it can happen with a sequence of multiplications,
where the number of digits after the decimal point grows with each
multiply. Throwing an error is not particularly useful, and that error
message is quite misleading, since the result is not very large.
Therefore I propose to make this round the result to 16383 digits if
it exceeds that, as in the first attached patch.

It's worth noting that to get correct rounding, it's necessary to
compute the full exact product (which we're actually already doing)
before rounding, as opposed to passing rscale=16383 to mul_var(), and
letting it round. The latter approach would compute a truncated
product with MUL_GUARD_DIGITS extra digits of precision, which doesn't
necessarily round the final digit the right way. The test case in the
patch is an example that would round the wrong way with a truncated
product.

I considered doing the final rounding in mul_var() (after computing
the full product), to prevent such overflows for all callers, but I
think that's the wrong approach because it risks covering up other
problems, such as the following:

While looking through the remaining numeric code, I found one other
place that has a similar problem -- when calculating the sum of
squares for aggregates like variance() and stddev(), the squares can
end up with more than 16383 digits after the decimal point. When the
query is running on a single backend, that's no problem, because the
final result is rounded to 1000 digits. However, if it uses parallel
workers, the result from each worker is sent using numeric_send/recv()
which attempts to convert to numeric before sending. Thus it's
possible to have an aggregate query that fails if it uses parallel
workers and succeeds otherwise.

In this case, I don't think that rounding the result from each worker
is the right approach, since that can lead to the final result being
different depending on whether or not the query uses parallel workers.
Also, given that each worker is already doing the hard work of
computing these squares, it seems a waste to just discard that
information.

So the second patch fixes this by adding new numericvar_send/recv()
functions capable of sending the full precision NumericVar's from each
worker, without rounding. The first test case in this patch is an
example that would round the wrong way if the result from each worker
were rounded before being sent.

An additional benefit to this approach is that it also addresses the
issue noted in the old code about its use of numeric_send/recv() being
wasteful:

    /*
     * This is a little wasteful since make_result converts the NumericVar
     * into a Numeric and numeric_send converts it back again. Is it worth
     * splitting the tasks in numeric_send into separate functions to stop
     * this? Doing so would also remove the fmgr call overhead.
     */

So the patch converts all aggregate serialization/deserialization code
to use the new numericvar_send/recv() functions. I doubt that that
gives much in the way of a performance improvement, but it makes the
code a little neater as well as preventing overflows.

After writing that, I realised that there's another overflow risk --
if the input values are very large in magnitude, the sum of squares
could genuinely overflow the numeric type, while the final variance
could be quite small (and that can't be fixed by rounding). So this
too fails when parallel workers are used, and succeeds otherwise, and
the patch fixes this too, so I added a separate test case for it.

Regards,
Dean

Attachment

pgsql-hackers by date:

Previous
From: Masahiko Sawada
Date:
Subject: Re: What is "wraparound failure", really?
Next
From: Amit Kapila
Date:
Subject: Re: [bug?] Missed parallel safety checks, and wrong parallel safety