Re: [PATCH] random_normal function - Mailing list pgsql-hackers

From Tom Lane
Subject Re: [PATCH] random_normal function
Date
Msg-id 67201.1673307519@sss.pgh.pa.us
Whole thread Raw
In response to Re: [PATCH] random_normal function  (Dean Rasheed <dean.a.rasheed@gmail.com>)
Responses Re: [PATCH] random_normal function
List pgsql-hackers
Dean Rasheed <dean.a.rasheed@gmail.com> writes:
> On Mon, 9 Jan 2023 at 18:52, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> (We could probably go further
>> than this, like trying to verify distribution properties.  But
>> it's been too long since college statistics for me to want to
>> write such tests myself, and I'm not real sure we need them.)

> I played around with the Kolmogorov–Smirnov test, to test random() for
> uniformity. The following passes roughly 99.9% of the time:

Ah, cool, thanks for this code.

> With a one-in-a-thousand chance of failing, if you wanted something
> with around a one-in-a-billion chance of failing, you could just try
> it 3 times:
> SELECT ks_test_uniform_random() OR
>        ks_test_uniform_random() OR
>        ks_test_uniform_random();
> but it feels pretty hacky, and probably not really necessary.

That seems like a good way, because I'm not satisfied with
one-in-a-thousand odds if we want to remove the "ignore" marker.
It's still plenty fast enough: on my machine, the v2 patch below
takes about 19ms, versus 22ms for the script as it stands in HEAD.

> Rigorous tests for other distributions are harder, but also probably
> not necessary if we have confidence that the underlying PRNG is
> uniform.

Agreed.

>> BTW, if this does bring the probability of failure down to the
>> one-in-a-billion range, I think we could also nuke the whole
>> "ignore:" business, simplifying pg_regress and allowing the
>> random test to be run in parallel with others.

> I didn't check the one-in-a-billion claim, but +1 for that.

I realized that we do already run random in a parallel group;
the "ignore: random" line in parallel_schedule just marks it
as failure-ignorable, it doesn't schedule it.  (The comment is a
bit misleading about this, but I want to remove that not rewrite it.)
Nonetheless, nuking the whole ignore-failures mechanism seems like
good cleanup to me.

Also, I tried this on some 32-bit big-endian hardware (NetBSD on macppc)
to verify my thesis that the results of random() are now machine
independent.  That part works, but the random_normal() tests didn't;
I saw low-order-bit differences from the results on x86_64 Linux.
Presumably, that's because one or more of sqrt()/log()/sin() are
rounding off a bit differently there.  v2 attached deals with this by
backing off to "extra_float_digits = 0" for that test.  Once it hits the
buildfarm we might find we have to reduce extra_float_digits some more,
but that was enough to make NetBSD/macppc happy.

            regards, tom lane

diff --git a/src/test/regress/expected/random.out b/src/test/regress/expected/random.out
index 30bd866138..8785c88ad2 100644
--- a/src/test/regress/expected/random.out
+++ b/src/test/regress/expected/random.out
@@ -1,81 +1,146 @@
 --
 -- RANDOM
--- Test the random function
+-- Test random() and allies
 --
--- count the number of tuples originally, should be 1000
-SELECT count(*) FROM onek;
- count
--------
-  1000
-(1 row)
-
--- pick three random rows, they shouldn't match
-(SELECT unique1 AS random
-  FROM onek ORDER BY random() LIMIT 1)
-INTERSECT
-(SELECT unique1 AS random
-  FROM onek ORDER BY random() LIMIT 1)
-INTERSECT
-(SELECT unique1 AS random
-  FROM onek ORDER BY random() LIMIT 1);
- random
---------
+-- Tests in this file may have a small probability of failure,
+-- since we are dealing with randomness.  Try to keep the failure
+-- risk for any one test case under 1e-9.
+--
+-- There should be no duplicates in 1000 random() values.
+-- (Assuming 52 random bits in the float8 results, we could
+-- take as many as 3000 values and still have less than 1e-9 chance
+-- of failure, per https://en.wikipedia.org/wiki/Birthday_problem)
+SELECT r, count(*)
+FROM (SELECT random() r FROM generate_series(1, 1000)) ss
+GROUP BY r HAVING count(*) > 1;
+ r | count
+---+-------
 (0 rows)

--- count roughly 1/10 of the tuples
-CREATE TABLE RANDOM_TBL AS
-  SELECT count(*) AS random
-  FROM onek WHERE random() < 1.0/10;
--- select again, the count should be different
-INSERT INTO RANDOM_TBL (random)
-  SELECT count(*)
-  FROM onek WHERE random() < 1.0/10;
--- select again, the count should be different
-INSERT INTO RANDOM_TBL (random)
-  SELECT count(*)
-  FROM onek WHERE random() < 1.0/10;
--- select again, the count should be different
-INSERT INTO RANDOM_TBL (random)
-  SELECT count(*)
-  FROM onek WHERE random() < 1.0/10;
--- now test that they are different counts
-SELECT random, count(random) FROM RANDOM_TBL
-  GROUP BY random HAVING count(random) > 3;
- random | count
---------+-------
-(0 rows)
+-- The range should be [0, 1).  We can expect that at least one out of 2000
+-- random values is in the lowest or highest 1% of the range with failure
+-- probability less than about 1e-9.
+SELECT count(*) FILTER (WHERE r < 0 OR r >= 1) AS out_of_range,
+       (count(*) FILTER (WHERE r < 0.01)) > 0 AS has_small,
+       (count(*) FILTER (WHERE r > 0.99)) > 0 AS has_large
+FROM (SELECT random() r FROM generate_series(1, 2000)) ss;
+ out_of_range | has_small | has_large
+--------------+-----------+-----------
+            0 | t         | t
+(1 row)

-SELECT AVG(random) FROM RANDOM_TBL
-  HAVING AVG(random) NOT BETWEEN 80 AND 120;
- avg
------
-(0 rows)
+-- Check for uniform distribution using the Kolmogorov–Smirnov test.
+CREATE FUNCTION ks_test_uniform_random()
+RETURNS boolean AS
+$$
+DECLARE
+  n int := 1000;        -- Number of samples
+  c float8 := 1.94947;  -- Critical value for 99.9% confidence
+  ok boolean;
+BEGIN
+  ok := (
+    WITH samples AS (
+      SELECT random() r FROM generate_series(1, n) ORDER BY 1
+    ), indexed_samples AS (
+      SELECT (row_number() OVER())-1.0 i, r FROM samples
+    )
+    SELECT max(abs(i/n-r)) < c / sqrt(n) FROM indexed_samples
+  );
+  RETURN ok;
+END
+$$
+LANGUAGE plpgsql;
+-- As written, ks_test_uniform_random() returns true about 99.9%
+-- of the time.  To get down to a roughly 1e-9 test failure rate,
+-- just run it 3 times and accept if any one of them passes.
+SELECT ks_test_uniform_random() OR
+       ks_test_uniform_random() OR
+       ks_test_uniform_random() AS uniform;
+ uniform
+---------
+ t
+(1 row)

 -- now test random_normal()
-TRUNCATE random_tbl;
-INSERT INTO random_tbl (random)
-  SELECT count(*)
-  FROM onek WHERE random_normal(0, 1) < 0;
-INSERT INTO random_tbl (random)
-  SELECT count(*)
-  FROM onek WHERE random_normal(0) < 0;
-INSERT INTO random_tbl (random)
-  SELECT count(*)
-  FROM onek WHERE random_normal() < 0;
-INSERT INTO random_tbl (random)
-  SELECT count(*)
-  FROM onek WHERE random_normal(stddev => 1, mean => 0) < 0;
--- expect similar, but not identical values
-SELECT random, count(random) FROM random_tbl
-  GROUP BY random HAVING count(random) > 3;
- random | count
---------+-------
+-- As above, there should be no duplicates in 1000 random_normal() values.
+SELECT r, count(*)
+FROM (SELECT random_normal() r FROM generate_series(1, 1000)) ss
+GROUP BY r HAVING count(*) > 1;
+ r | count
+---+-------
 (0 rows)

--- approximately check expected distribution
-SELECT AVG(random) FROM random_tbl
-  HAVING AVG(random) NOT BETWEEN 400 AND 600;
- avg
------
-(0 rows)
+-- ... unless we force the range (standard deviation) to zero.
+-- This is a good place to check that the mean input does something, too.
+SELECT r, count(*)
+FROM (SELECT random_normal(10, 0) r FROM generate_series(1, 100)) ss
+GROUP BY r;
+ r  | count
+----+-------
+ 10 |   100
+(1 row)
+
+SELECT r, count(*)
+FROM (SELECT random_normal(-10, 0) r FROM generate_series(1, 100)) ss
+GROUP BY r;
+  r  | count
+-----+-------
+ -10 |   100
+(1 row)
+
+-- setseed() should produce a reproducible series of random() values.
+SELECT setseed(0.5);
+ setseed
+---------
+
+(1 row)
+
+SELECT random() FROM generate_series(1, 10);
+       random
+---------------------
+  0.9851677175347999
+   0.825301858027981
+ 0.12974610012450416
+ 0.16356291958601088
+     0.6476186144084
+  0.8822771983038762
+  0.1404566845227775
+ 0.15619865764623442
+  0.5145227426983392
+  0.7712969548127826
+(10 rows)
+
+-- Likewise for random_normal(); however, since its implementation relies
+-- on libm functions that have different roundoff behaviors on different
+-- machines, we have to round off the results a bit to get consistent output.
+SET extra_float_digits = 0;
+SELECT random_normal() FROM generate_series(1, 10);
+   random_normal
+--------------------
+  0.208534644938377
+  0.264530240540963
+ -0.606752467900428
+  0.825799427852654
+   1.70111611735357
+ -0.223445463716189
+  0.249712419190998
+   -1.2494722990669
+  0.125627152043677
+  0.475391614544013
+(10 rows)
+
+SELECT random_normal(mean => 1, stddev => 0.1) r FROM generate_series(1, 10);
+         r
+-------------------
+  1.00605972811732
+  1.09685453015002
+  1.02869206132007
+ 0.909475676712336
+ 0.983724763134265
+ 0.939344549577623
+  1.18713500206363
+ 0.962257684292933
+ 0.914441206800407
+ 0.964031055575433
+(10 rows)

diff --git a/src/test/regress/sql/random.sql b/src/test/regress/sql/random.sql
index 3104af46b7..ce1cc37176 100644
--- a/src/test/regress/sql/random.sql
+++ b/src/test/regress/sql/random.sql
@@ -1,68 +1,85 @@
 --
 -- RANDOM
--- Test the random function
+-- Test random() and allies
+--
+-- Tests in this file may have a small probability of failure,
+-- since we are dealing with randomness.  Try to keep the failure
+-- risk for any one test case under 1e-9.
 --

--- count the number of tuples originally, should be 1000
-SELECT count(*) FROM onek;
-
--- pick three random rows, they shouldn't match
-(SELECT unique1 AS random
-  FROM onek ORDER BY random() LIMIT 1)
-INTERSECT
-(SELECT unique1 AS random
-  FROM onek ORDER BY random() LIMIT 1)
-INTERSECT
-(SELECT unique1 AS random
-  FROM onek ORDER BY random() LIMIT 1);
-
--- count roughly 1/10 of the tuples
-CREATE TABLE RANDOM_TBL AS
-  SELECT count(*) AS random
-  FROM onek WHERE random() < 1.0/10;
-
--- select again, the count should be different
-INSERT INTO RANDOM_TBL (random)
-  SELECT count(*)
-  FROM onek WHERE random() < 1.0/10;
-
--- select again, the count should be different
-INSERT INTO RANDOM_TBL (random)
-  SELECT count(*)
-  FROM onek WHERE random() < 1.0/10;
-
--- select again, the count should be different
-INSERT INTO RANDOM_TBL (random)
-  SELECT count(*)
-  FROM onek WHERE random() < 1.0/10;
-
--- now test that they are different counts
-SELECT random, count(random) FROM RANDOM_TBL
-  GROUP BY random HAVING count(random) > 3;
-
-SELECT AVG(random) FROM RANDOM_TBL
-  HAVING AVG(random) NOT BETWEEN 80 AND 120;
+-- There should be no duplicates in 1000 random() values.
+-- (Assuming 52 random bits in the float8 results, we could
+-- take as many as 3000 values and still have less than 1e-9 chance
+-- of failure, per https://en.wikipedia.org/wiki/Birthday_problem)
+SELECT r, count(*)
+FROM (SELECT random() r FROM generate_series(1, 1000)) ss
+GROUP BY r HAVING count(*) > 1;
+
+-- The range should be [0, 1).  We can expect that at least one out of 2000
+-- random values is in the lowest or highest 1% of the range with failure
+-- probability less than about 1e-9.
+
+SELECT count(*) FILTER (WHERE r < 0 OR r >= 1) AS out_of_range,
+       (count(*) FILTER (WHERE r < 0.01)) > 0 AS has_small,
+       (count(*) FILTER (WHERE r > 0.99)) > 0 AS has_large
+FROM (SELECT random() r FROM generate_series(1, 2000)) ss;
+
+-- Check for uniform distribution using the Kolmogorov–Smirnov test.
+
+CREATE FUNCTION ks_test_uniform_random()
+RETURNS boolean AS
+$$
+DECLARE
+  n int := 1000;        -- Number of samples
+  c float8 := 1.94947;  -- Critical value for 99.9% confidence
+  ok boolean;
+BEGIN
+  ok := (
+    WITH samples AS (
+      SELECT random() r FROM generate_series(1, n) ORDER BY 1
+    ), indexed_samples AS (
+      SELECT (row_number() OVER())-1.0 i, r FROM samples
+    )
+    SELECT max(abs(i/n-r)) < c / sqrt(n) FROM indexed_samples
+  );
+  RETURN ok;
+END
+$$
+LANGUAGE plpgsql;
+
+-- As written, ks_test_uniform_random() returns true about 99.9%
+-- of the time.  To get down to a roughly 1e-9 test failure rate,
+-- just run it 3 times and accept if any one of them passes.
+SELECT ks_test_uniform_random() OR
+       ks_test_uniform_random() OR
+       ks_test_uniform_random() AS uniform;

 -- now test random_normal()

-TRUNCATE random_tbl;
-INSERT INTO random_tbl (random)
-  SELECT count(*)
-  FROM onek WHERE random_normal(0, 1) < 0;
-INSERT INTO random_tbl (random)
-  SELECT count(*)
-  FROM onek WHERE random_normal(0) < 0;
-INSERT INTO random_tbl (random)
-  SELECT count(*)
-  FROM onek WHERE random_normal() < 0;
-INSERT INTO random_tbl (random)
-  SELECT count(*)
-  FROM onek WHERE random_normal(stddev => 1, mean => 0) < 0;
+-- As above, there should be no duplicates in 1000 random_normal() values.
+SELECT r, count(*)
+FROM (SELECT random_normal() r FROM generate_series(1, 1000)) ss
+GROUP BY r HAVING count(*) > 1;

--- expect similar, but not identical values
-SELECT random, count(random) FROM random_tbl
-  GROUP BY random HAVING count(random) > 3;
+-- ... unless we force the range (standard deviation) to zero.
+-- This is a good place to check that the mean input does something, too.
+SELECT r, count(*)
+FROM (SELECT random_normal(10, 0) r FROM generate_series(1, 100)) ss
+GROUP BY r;
+SELECT r, count(*)
+FROM (SELECT random_normal(-10, 0) r FROM generate_series(1, 100)) ss
+GROUP BY r;

--- approximately check expected distribution
-SELECT AVG(random) FROM random_tbl
-  HAVING AVG(random) NOT BETWEEN 400 AND 600;
+-- setseed() should produce a reproducible series of random() values.
+
+SELECT setseed(0.5);
+
+SELECT random() FROM generate_series(1, 10);
+
+-- Likewise for random_normal(); however, since its implementation relies
+-- on libm functions that have different roundoff behaviors on different
+-- machines, we have to round off the results a bit to get consistent output.
+SET extra_float_digits = 0;
+
+SELECT random_normal() FROM generate_series(1, 10);
+SELECT random_normal(mean => 1, stddev => 0.1) r FROM generate_series(1, 10);

pgsql-hackers by date:

Previous
From: Jelte Fennema
Date:
Subject: Re: Allow +group in pg_ident.conf
Next
From: Andres Freund
Date:
Subject: Re: POC: Lock updated tuples in tuple_update() and tuple_delete()