Thread: BUG #15307: Low numerical precision of (Co-) Variance
The following bug has been logged on the website: Bug reference: 15307 Logged by: Erich Schubert Email address: erich@debian.org PostgreSQL version: 9.6.6 Operating system: Linux Description: Numerical precision of variance computations in PostgreSQL is too low. SELECT VAR_SAMP(x::float8), COVAR_SAMP(x, x), VAR_SAMP(x) FROM (SELECT 1000000.01 as x UNION SELECT 999999.99 as x) AS x The first two give the low-precision answer 0.000244140625 instead of 0.0002. Interestingly enough, VAR_SAMP(x) is okay - I guess that postgres may autodetect a fixed-precision decimal here, or use some high precision code. If you add just another digit (10 million +- 1 cent), the bad functions even return a variance of 0: SELECT VAR_SAMP(x::float8), COVAR_SAMP(x, x), VAR_SAMP(x) FROM (SELECT 10000000.01 as x UNION SELECT 9999999.99 as x) AS x The reason is catastrophic cancellation. Apparently, the covariance is computed using the E[XY]-E[X]E[Y] approach, which suffers from low numerical precision. While I report this against version 9.6.6 (since I used sqlfiddle), this clearly is present still in Git: https://github.com/postgres/postgres/blob/6bf0bc842bd75877e31727eb559c6a69e237f831/src/backend/utils/adt/float.c#L2606 https://github.com/postgres/postgres/blob/6bf0bc842bd75877e31727eb559c6a69e237f831/src/backend/utils/adt/float.c#L2969 it should also affect the general "numeric" version (i.e. all versions of variance/covariance/standard deviation use the known-bad equation), but for integers it usually will be fine as long as the sum-of-squares can be stored: https://github.com/postgres/postgres/blob/6bf0bc842bd75877e31727eb559c6a69e237f831/src/backend/utils/adt/numeric.c#L4883 There are a number of methods to get a reasonable precision. The simplest to implement is a two pass approach: first compute the mean of each variable, then compute E[(X-meanX)*(Y-meanY)] in a second pass. This will usually give the best precision. Faster (single-pass) methods can be found in literature from the 1970s, and also Donald Knuth's "The Art of Computer Programming". In particular Young and Cramer's version performed quite well (and surprisingly much better than Welford's; supposedly due to CPU pipelining). Single-pass and parallelization-friendly approaches (interesting to use in particular in distributed databases, but also to avoid IO cost) with good accuracy are studied in: > Erich Schubert, and Michael Gertz. Numerically Stable Parallel Computation of (Co-)Variance. In: Proceedings of the 30th International Conference on Scientific and Statistical Database Management (SSDBM), Bolzano-Bozen, Italy. 2018, 10:1–10:12. DOI: 10.1145/3221269.3223036. https://dl.acm.org/citation.cfm?doid=3221269.3223036 The problem has also been observed in other SQL databases (MS - others like MySQL have implemented a numeric stable single-pass approach), see e.g.: > Kamat, Niranjan, and Arnab Nandi. "A Closer Look at Variance Implementations In Modern Database Systems." ACM SIGMOD Record 45.4 (2017): 28-33. DOI: 10.1145/3092931.3092936. https://dl.acm.org/citation.cfm?id=3092936 Sorry, I do not know the PostgreSQL internals (aggregation...) well enough to provide a patch.
On 1 August 2018 at 14:35, PG Bug reporting form <noreply@postgresql.org> wrote: > Numerical precision of variance computations in PostgreSQL is too low. > > SELECT VAR_SAMP(x::float8), COVAR_SAMP(x, x), VAR_SAMP(x) > FROM (SELECT 1000000.01 as x UNION SELECT 999999.99 as x) AS x > > The first two give the low-precision answer 0.000244140625 instead of > 0.0002. Interestingly enough, VAR_SAMP(x) is okay ... For a number of those statistical aggregates, PostgreSQL provides 2 implementations -- one implemented using double precision floating point arithmetic, which is much faster, but necessarily less accurate and possibly platform-dependent; and one implemented using the arbitrary precision numeric datatype, which will return much more accurate results. For any input datatypes other than floating point, you will automatically get the latter, which is what you're seeing with var_samp(x), when you're not explicitly casting the input to float8. However, all-the 2-argument aggregates such as corr() and covar_pop/samp() currently only have floating point implementations, and suffer from the problem you report, which I agree, is not great. If we can easily improve the accuracy of those aggregates, then I think it is worthwhile. Using a two pass approach isn't really an option, given the way that aggregates work in PostgreSQL, however, implementing Welford's algorithm looks to be quite straightforward. I had a quick play and I found that it fixed the accuracy problem with no noticeable performance penalty -- there are a few extra cycles in the accumulator functions, but compared to the overall per-tuple overhead, that appears to be negligible. I'll post something shortly. Regards, Dean
Hi, > For a number of those statistical aggregates, PostgreSQL provides 2 > implementations -- one implemented using double precision floating > point arithmetic, which is much faster, but necessarily less accurate > and possibly platform-dependent; and one implemented using the > arbitrary precision numeric datatype, which will return much more > accurate results. For any input datatypes other than floating point, > you will automatically get the latter, which is what you're seeing > with var_samp(x), when you're not explicitly casting the input to > float8. Ok, I am surprised that this is possible to do with arbitrary precision here, as for example with 8 data points, it will eventually involve a division by 7, which cannot be represented even with arbitrary (finite) precision floating point. But maybe just this division comes late enough (after the difference) that it just works before being converted to a float for the division. > However, all-the 2-argument aggregates such as corr() and > covar_pop/samp() currently only have floating point implementations, > and suffer from the problem you report, which I agree, is not great. > If we can easily improve the accuracy of those aggregates, then I > think it is worthwhile. > > Using a two pass approach isn't really an option, given the way that > aggregates work in PostgreSQL, however, implementing Welford's > algorithm looks to be quite straightforward. I had a quick play and I > found that it fixed the accuracy problem with no noticeable > performance penalty -- there are a few extra cycles in the accumulator > functions, but compared to the overall per-tuple overhead, that > appears to be negligible. Instead of Welford, I recommend to use the Youngs-Cramer approach. It is almost the same; roughly it is aggregating sum-X and sum((X-meanX)²) directly (whereas Welford updates mean-X, and sum((X-meanX)^2)). So the first aggregate is actually unchanged to the current approach. In my experiments, this was surprisingly much faster when the data was a double[]; explainable by CPU pipelining (see the Figure in the paper I linked - the division comes late, and the next step does not depend on the division, so pipelining can already process the next double without waiting for the division to finish). Youngs & Cramer original publication: https://www.jstor.org/stable/pdf/1267176.pdf The (non-parallel) implementation of the classic algorithms used in the SSDBM 2018 paper are (templated only for aggregate precision; no guard against len=0): template <typename T = double> double youngs_cramer_var(const double* const data, const size_t len) { T sum = data[0], var = 0; for(size_t i=1; i<len;) { const size_t oldi = i; const double x = data[i++]; sum += x; double tmp = i * x - sum; var += tmp * tmp / (i * (double) oldi); } return var / len; } Close to Welfords original article: template <typename T = double> double welford_var(const double* const data, const size_t len) { T mean = data[0], var = 0; for(size_t i=1; i<len; ) { const size_t oldi = i; const double x = data[i++]; const double delta = x - mean; mean = mean * oldi / (double) i + x / i; var += oldi / (double) i * delta * delta; } return var / len; } From Knuth's book, attributed to Welford (but usuing fewer divisons), likely the most widely known incremental version: template <typename T = double> double knuth_var(const double* const data, const size_t len) { T mean = data[0], xx = 0; for(size_t i=1; i<len;) { const double v = data[i++]; const double delta = v - mean; mean += delta / i; xx += delta * (v - mean); } return xx / len; } In both Welford and Knuth, the "mean" aggregate depends on the division; with YC it does not, so I recommend YC. If Postgres could use parallel aggregation, let me know. I can assist with the proper aggregation of several partitions (both distributed, or multithreaded). Using multiple partitions can also slightly improve precision; but it is mostly interesting to process multiple chunks in parallel. If I have some spare time to clean it up some more, I intend to open-source the entire experimental code from SSDBM18 for reproducibility. Regards, Erich
On 7 August 2018 at 12:58, Erich Schubert <erich@debian.org> wrote: > I am surprised that this is possible to do with arbitrary precision > here, as for example with 8 data points, it will eventually involve a > division by 7, which cannot be represented even with arbitrary (finite) > precision floating point. But maybe just this division comes late enough > (after the difference) that it just works before being converted to a float > for the division. Right, the division comes at the end in that case. > Instead of Welford, I recommend to use the Youngs-Cramer approach. It is > almost the same; roughly it is aggregating sum-X and sum((X-meanX)²) > directly (whereas Welford updates mean-X, and sum((X-meanX)^2)). So the > first aggregate is actually unchanged to the current approach. > In my experiments, this was surprisingly much faster when the data was a > double[]; explainable by CPU pipelining (see the Figure in the paper I > linked - the division comes late, and the next step does not depend on the > division, so pipelining can already process the next double without waiting > for the division to finish). > > Youngs & Cramer original publication: > https://www.jstor.org/stable/pdf/1267176.pdf OK, thanks for the reference. That was very interesting. It was actually the Knuth variant of the Welford algorithm that I was playing with, but I have now tried out the Youngs-Cramer algorithm as well. In terms of performance the YC algorithm was maybe 2-3% slower than Welford, which was roughly on a par with the current schoolbook algorithm in PostgreSQL, but there was some variability in those results, depending on the exact query in question. One thing to bear in mind is that the kind of micro-optimisations that will make a huge difference when processing in-memory data in tight loops as in your examples are not nearly as important in the PostgreSQL environment where there is a significant per-tuple overhead in fetching data, depending on the enclosing SQL query. So the odd cycle here and there from a division isn't going to matter much in this context. Both algorithms eliminate the main loss-of-precision problem that the current schoolbook algorithm suffers from, which I think has to be the primary aim of this. The YC paper suggests that the results from their algorithm might be slightly more accurate (although maybe not enough that we really care). What I do like about the YC algorithm is that it guarantees the same result from avg(), whereas the Welford algorithm will change avg() slightly, possibly making it very slightly less accurate. So I concur, the YC algorithm is probably preferable. For the record, attached are both versions that I tried. > If Postgres could use parallel aggregation, let me know. I can assist with > the proper aggregation of several partitions (both distributed, or > multithreaded). Using multiple partitions can also slightly improve > precision; but it is mostly interesting to process multiple chunks in > parallel. PostgreSQL isn't multithreaded, but it will parallelise large aggregates by distributing the computation over a small number of worker backend processes, and then combining the results. I did a quick back-of-the-envelope calculation to see how to combine the results, and in testing it seemed to work correctly, but it would be worth having a second pair of eyes to validate that. I'll add this to the next commitfest for consideration. Regards, Dean
Attachment
On 9 August 2018 at 12:02, Dean Rasheed <dean.a.rasheed@gmail.com> wrote: > ... the YC algorithm is probably preferable. For the record, > attached are both versions that I tried. > Here is an updated, more complete patch, based on the YC algorithm, with updated regression tests for the September commitfest. All the existing tests pass unchanged, although I'm somewhat surprised that the current tests pass with no platform variations. I've added new tests to cover infinity/NaN handling, parallel aggregation and confirm the improved accuracy with large offsets. The latter tests operate well within in the limits of double precision arithmetic, so I wouldn't expect any platform variation, but that's difficult guarantee. If there are problems, it may be necessary to round the test results. Notable changes from the previous patch: I have rewritten the overflow checks in the accum functions to be clearer and more efficient which, if anything, makes these aggregates now slightly faster than HEAD. More importantly though, I've added explicit code to force Sxx to be NaN if any input is infinite, which the previous coding didn't guarantee. I think NaN is the right result for quantities like variance, if any input value is infinite, since it logically involves 'infinity minus infinity'. That's also consistent with the current behaviour. I have also made the aggregate combine functions SQL-callable to make testing easier -- there was a bug in the previous version due to a typo which meant that float8_regr_combine() was incorrect when N1 was non-zero and N2 was zero. That situation is unlikely to happen in practice, and difficult to provoke deliberately without manually calling the combine function, which is why I didn't spot it before. The new tests cover all code branches, and make it easier to see that the combine functions are producing the correct results. Regards, Dean