Re: BUG #15307: Low numerical precision of (Co-) Variance - Mailing list pgsql-bugs

From Erich Schubert
Subject Re: BUG #15307: Low numerical precision of (Co-) Variance
Date
Msg-id c95877fb-4441-5c2a-7398-1e5305c32656@vitavonni.de
Whole thread Raw
In response to Re: BUG #15307: Low numerical precision of (Co-) Variance  (Dean Rasheed <dean.a.rasheed@gmail.com>)
Responses Re: BUG #15307: Low numerical precision of (Co-) Variance  (Dean Rasheed <dean.a.rasheed@gmail.com>)
List pgsql-bugs
Hi,

> For a number of those statistical aggregates, PostgreSQL provides 2
> implementations -- one implemented using double precision floating
> point arithmetic, which is much faster, but necessarily less accurate
> and possibly platform-dependent; and one implemented using the
> arbitrary precision numeric datatype, which will return much more
> accurate results. For any input datatypes other than floating point,
> you will automatically get the latter, which is what you're seeing
> with var_samp(x), when you're not explicitly casting the input to
> float8.
Ok, I am surprised that this is possible to do with arbitrary precision 
here, as for example with 8 data points, it will eventually involve a 
division by 7, which cannot be represented even with arbitrary (finite) 
precision floating point. But maybe just this division comes late enough 
(after the difference) that it just works before being converted to a 
float for the division.
> However, all-the 2-argument aggregates such as corr() and
> covar_pop/samp() currently only have floating point implementations,
> and suffer from the problem you report, which I agree, is not great.
> If we can easily improve the accuracy of those aggregates, then I
> think it is worthwhile.
>
> Using a two pass approach isn't really an option, given the way that
> aggregates work in PostgreSQL, however, implementing Welford's
> algorithm looks to be quite straightforward. I had a quick play and I
> found that it fixed the accuracy problem with no noticeable
> performance penalty -- there are a few extra cycles in the accumulator
> functions, but compared to the overall per-tuple overhead, that
> appears to be negligible.
Instead of Welford, I recommend to use the Youngs-Cramer approach. It is 
almost the same; roughly it is aggregating sum-X and sum((X-meanX)²) 
directly (whereas Welford updates mean-X, and sum((X-meanX)^2)). So the 
first aggregate is actually unchanged to the current approach.
In my experiments, this was surprisingly much faster when the data was a 
double[]; explainable by CPU pipelining (see the Figure in the paper I 
linked - the division comes late, and the next step does not depend on 
the division, so pipelining can already process the next double without 
waiting for the division to finish).

Youngs & Cramer original publication: 
https://www.jstor.org/stable/pdf/1267176.pdf

The (non-parallel) implementation of the classic algorithms used in the 
SSDBM 2018 paper are (templated only for aggregate precision; no guard 
against len=0):

template <typename T = double>
double youngs_cramer_var(const double* const data, const size_t len) {
     T sum = data[0], var = 0;
     for(size_t i=1; i<len;) {
         const size_t oldi = i;
         const double x = data[i++];
         sum += x;
         double tmp = i * x - sum;
         var += tmp * tmp / (i * (double) oldi);
     }
     return var / len;
}

Close to Welfords original article:

template <typename T = double>
double welford_var(const double* const data, const size_t len) {
     T mean = data[0], var = 0;
     for(size_t i=1; i<len; ) {
         const size_t oldi = i;
         const double x = data[i++];
         const double delta = x - mean;
         mean = mean * oldi / (double) i + x / i;
         var += oldi / (double) i * delta * delta;
     }
     return var / len;
}

 From Knuth's book, attributed to Welford (but usuing fewer divisons), 
likely the most widely known incremental version:

template <typename T = double>
double knuth_var(const double* const data, const size_t len) {
     T mean = data[0], xx = 0;
     for(size_t i=1; i<len;) {
         const double v = data[i++];
         const double delta = v - mean;
         mean += delta / i;
         xx += delta * (v - mean);
     }
     return xx / len;
}

In both Welford and Knuth, the "mean" aggregate depends on the division; 
with YC it does not, so I recommend YC.

If Postgres could use parallel aggregation, let me know. I can assist 
with the proper aggregation of several partitions (both distributed, or 
multithreaded). Using multiple partitions can also slightly improve 
precision; but it is mostly interesting to process multiple chunks in 
parallel.
If I have some spare time to clean it up some more, I intend to 
open-source the entire experimental code from SSDBM18 for reproducibility.

Regards,
Erich


pgsql-bugs by date:

Previous
From: Michael Paquier
Date:
Subject: Re: BUG #15310: pg_upgrade dissociates event triggers from extensions
Next
From: PG Bug reporting form
Date:
Subject: BUG #15313: Waiting `startup` process on a streaming replica