Numeric multiplication overflow errors - Mailing list pgsql-hackers
From | Dean Rasheed |
---|---|
Subject | Numeric multiplication overflow errors |
Date | |
Msg-id | CAEZATCUmeFWCrq2dNzZpRj5+6LfN85jYiDoqm+ucSXhb9U2TbA@mail.gmail.com Whole thread Raw |
Responses |
Re: Numeric multiplication overflow errors
|
List | pgsql-hackers |
I found a couple of places where numeric multiplication suffers from overflow errors for inputs that aren't necessarily very large in magnitude. The first is with the numeric * operator, which attempts to always produce the exact result, even though the numeric type has a maximum of 16383 digits after the decimal point. If the limit is exceeded an overflow error is produced, e.g.: SELECT (1+2e-10000) * (1+3e-10000); ERROR: value overflows numeric format I can't imagine anyone actually wanting that many digits after the decimal point, but it can happen with a sequence of multiplications, where the number of digits after the decimal point grows with each multiply. Throwing an error is not particularly useful, and that error message is quite misleading, since the result is not very large. Therefore I propose to make this round the result to 16383 digits if it exceeds that, as in the first attached patch. It's worth noting that to get correct rounding, it's necessary to compute the full exact product (which we're actually already doing) before rounding, as opposed to passing rscale=16383 to mul_var(), and letting it round. The latter approach would compute a truncated product with MUL_GUARD_DIGITS extra digits of precision, which doesn't necessarily round the final digit the right way. The test case in the patch is an example that would round the wrong way with a truncated product. I considered doing the final rounding in mul_var() (after computing the full product), to prevent such overflows for all callers, but I think that's the wrong approach because it risks covering up other problems, such as the following: While looking through the remaining numeric code, I found one other place that has a similar problem -- when calculating the sum of squares for aggregates like variance() and stddev(), the squares can end up with more than 16383 digits after the decimal point. When the query is running on a single backend, that's no problem, because the final result is rounded to 1000 digits. However, if it uses parallel workers, the result from each worker is sent using numeric_send/recv() which attempts to convert to numeric before sending. Thus it's possible to have an aggregate query that fails if it uses parallel workers and succeeds otherwise. In this case, I don't think that rounding the result from each worker is the right approach, since that can lead to the final result being different depending on whether or not the query uses parallel workers. Also, given that each worker is already doing the hard work of computing these squares, it seems a waste to just discard that information. So the second patch fixes this by adding new numericvar_send/recv() functions capable of sending the full precision NumericVar's from each worker, without rounding. The first test case in this patch is an example that would round the wrong way if the result from each worker were rounded before being sent. An additional benefit to this approach is that it also addresses the issue noted in the old code about its use of numeric_send/recv() being wasteful: /* * This is a little wasteful since make_result converts the NumericVar * into a Numeric and numeric_send converts it back again. Is it worth * splitting the tasks in numeric_send into separate functions to stop * this? Doing so would also remove the fmgr call overhead. */ So the patch converts all aggregate serialization/deserialization code to use the new numericvar_send/recv() functions. I doubt that that gives much in the way of a performance improvement, but it makes the code a little neater as well as preventing overflows. After writing that, I realised that there's another overflow risk -- if the input values are very large in magnitude, the sum of squares could genuinely overflow the numeric type, while the final variance could be quite small (and that can't be fixed by rounding). So this too fails when parallel workers are used, and succeeds otherwise, and the patch fixes this too, so I added a separate test case for it. Regards, Dean
Attachment
pgsql-hackers by date: