On Wed, 3 Dec 2025 at 22:52, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> Poking at this, I soon found a test case where even with the separate
> sqrt() calls we'd produce a result slightly outside [-1, 1] (running
> this test over more values of x is sufficient). So now I think we
> should do both the separate sqrt and the clamp.
>
I'm starting to have doubts about having 2 sqrt() calls. The problem
is that it seems to produce a noticeable reduction in accuracy in
quite a few cases. This is especially noticeable with fully-correlated
data. For example:
SELECT n, (SELECT corr(x, x) FROM generate_series(1, n) x)
FROM generate_series(1, 10) g(n);
n | corr
----+--------------------
1 |
2 | 0.9999999999999998
3 | 0.9999999999999998
4 | 0.9999999999999998
5 | 0.9999999999999998
6 | 1
7 | 0.9999999999999999
8 | 1
9 | 0.9999999999999999
10 | 1
(10 rows)
Now I'm not sure that the current code can be expected to get cases
like this exactly right 100% of the time, but it's pretty close. For
example, if I do this:
WITH t1 AS (
SELECT n, random() * 1000 AS r FROM generate_series(1, 1000000) n
), t2 AS (
SELECT corr(r, r) FROM t1 GROUP BY n % 10000
)
SELECT count(*), count(*) FILTER (WHERE corr != 1) FROM t2;
on HEAD it produced corr = 1 every time I ran it, whereas the patch
gives rounding errors roughly 25% of the time, which seems likely to
be noticed.
Perhaps we should only use 2 sqrt()'s if the product Sxx * Syy overflows.
Regards,
Dean