Home > mailing lists

Re: BUG #15307: Low numerical precision of (Co-) Variance - Mailing list pgsql-bugs

From	Erich Schubert
Subject	Re: BUG #15307: Low numerical precision of (Co-) Variance
Date	August 7, 2018 17:58:11
Msg-id	c95877fb-4441-5c2a-7398-1e5305c32656@vitavonni.de Whole thread Raw
In response to	Re: BUG #15307: Low numerical precision of (Co-) Variance (Dean Rasheed <dean.a.rasheed@gmail.com>)
Responses	Re: BUG #15307: Low numerical precision of (Co-) Variance (Dean Rasheed <dean.a.rasheed@gmail.com>)
List	pgsql-bugs

Tree view

Hi,

> For a number of those statistical aggregates, PostgreSQL provides 2
> implementations -- one implemented using double precision floating
> point arithmetic, which is much faster, but necessarily less accurate
> and possibly platform-dependent; and one implemented using the
> arbitrary precision numeric datatype, which will return much more
> accurate results. For any input datatypes other than floating point,
> you will automatically get the latter, which is what you're seeing
> with var_samp(x), when you're not explicitly casting the input to
> float8.
Ok, I am surprised that this is possible to do with arbitrary precision 
here, as for example with 8 data points, it will eventually involve a 
division by 7, which cannot be represented even with arbitrary (finite) 
precision floating point. But maybe just this division comes late enough 
(after the difference) that it just works before being converted to a 
float for the division.
> However, all-the 2-argument aggregates such as corr() and
> covar_pop/samp() currently only have floating point implementations,
> and suffer from the problem you report, which I agree, is not great.
> If we can easily improve the accuracy of those aggregates, then I
> think it is worthwhile.
>
> Using a two pass approach isn't really an option, given the way that
> aggregates work in PostgreSQL, however, implementing Welford's
> algorithm looks to be quite straightforward. I had a quick play and I
> found that it fixed the accuracy problem with no noticeable
> performance penalty -- there are a few extra cycles in the accumulator
> functions, but compared to the overall per-tuple overhead, that
> appears to be negligible.
Instead of Welford, I recommend to use the Youngs-Cramer approach. It is 
almost the same; roughly it is aggregating sum-X and sum((X-meanX)²) 
directly (whereas Welford updates mean-X, and sum((X-meanX)^2)). So the 
first aggregate is actually unchanged to the current approach.
In my experiments, this was surprisingly much faster when the data was a 
double[]; explainable by CPU pipelining (see the Figure in the paper I 
linked - the division comes late, and the next step does not depend on 
the division, so pipelining can already process the next double without 
waiting for the division to finish).

Youngs & Cramer original publication: 
https://www.jstor.org/stable/pdf/1267176.pdf

The (non-parallel) implementation of the classic algorithms used in the 
SSDBM 2018 paper are (templated only for aggregate precision; no guard 
against len=0):

template <typename T = double>
double youngs_cramer_var(const double* const data, const size_t len) {
     T sum = data[0], var = 0;
     for(size_t i=1; i<len;) {
         const size_t oldi = i;
         const double x = data[i++];
         sum += x;
         double tmp = i * x - sum;
         var += tmp * tmp / (i * (double) oldi);
     }
     return var / len;
}

Close to Welfords original article:

template <typename T = double>
double welford_var(const double* const data, const size_t len) {
     T mean = data[0], var = 0;
     for(size_t i=1; i<len; ) {
         const size_t oldi = i;
         const double x = data[i++];
         const double delta = x - mean;
         mean = mean * oldi / (double) i + x / i;
         var += oldi / (double) i * delta * delta;
     }
     return var / len;
}

 From Knuth's book, attributed to Welford (but usuing fewer divisons), 
likely the most widely known incremental version:

template <typename T = double>
double knuth_var(const double* const data, const size_t len) {
     T mean = data[0], xx = 0;
     for(size_t i=1; i<len;) {
         const double v = data[i++];
         const double delta = v - mean;
         mean += delta / i;
         xx += delta * (v - mean);
     }
     return xx / len;
}

In both Welford and Knuth, the "mean" aggregate depends on the division; 
with YC it does not, so I recommend YC.

If Postgres could use parallel aggregation, let me know. I can assist 
with the proper aggregation of several partitions (both distributed, or 
multithreaded). Using multiple partitions can also slightly improve 
precision; but it is mostly interesting to process multiple chunks in 
parallel.
If I have some spare time to clean it up some more, I intend to 
open-source the entire experimental code from SSDBM18 for reproducibility.

Regards,
Erich

pgsql-bugs by date:

From: Michael Paquier
Date: 07 August 2018, 17:29:10
Subject: Re: BUG #15310: pg_upgrade dissociates event triggers from extensions

From: PG Bug reporting form
Date: 07 August 2018, 19:35:59
Subject: BUG #15313: Waiting `startup` process on a streaming replica

Re: BUG #15307: Low numerical precision of (Co-) Variance - Mailing list pgsql-bugs

Previous

Next