Re: New function normal_rand_array function to contrib/tablefunc. - Mailing list pgsql-hackers

From Dean Rasheed
Subject Re: New function normal_rand_array function to contrib/tablefunc.
Date
Msg-id CAEZATCX_Rd24mS4ScEYV=cLKMoQEJX=+oq40kYbok8rw211W8g@mail.gmail.com
Whole thread Raw
In response to Re: New function normal_rand_array function to contrib/tablefunc.  (Jim Jones <jim.jones@uni-muenster.de>)
List pgsql-hackers
On Tue, 2 Jul 2024 at 11:18, Jim Jones <jim.jones@uni-muenster.de> wrote:
>
> When either minval or maxval exceeds int4 the function cannot be
> executed/found
>
> SELECT * FROM normal_rand_array(5, 10, 8, 42::bigint);
>
> ERROR:  function normal_rand_array(integer, integer, integer, bigint)
> does not exist
> LINE 1: SELECT * FROM normal_rand_array(5, 10, 8, 42::bigint);
>                       ^
> HINT:  No function matches the given name and argument types. You might
> need to add explicit type casts.
>

This could be solved by defining separate functions for each supported
type, rather than one function with type anyelement. Internally, they
could continue to share common code, of course.

> In some cases the function returns an empty array. Is it also expected?
>

Perhaps it would be useful to have separate minimum and maximum array
length arguments, rather than a mean array length argument.

Actually, I find the current behaviour somewhat counterintuitive. Only
after reading the source code did I realise what it's actually doing,
which is this:

Row 1: array of random length in range [0, meanarraylen]
Row 2: array of length 2*meanarraylen - length of previous array
Row 3: array of random length in range [0, meanarraylen]
Row 4: array of length 2*meanarraylen - length of previous array
...

That's far from obvious (it's certainly not documented) and I don't
think it's a particularly good way of achieving a specified mean array
length, because only half the lengths are random.

One thing that's definitely needed is more documentation. It should
have its own new subsection, like the other tablefunc functions.

Something else that confused me is why this function is called
normal_rand_array(). The underlying random functions that it's calling
return random values from a uniform distribution, not a normal
distribution. Arrays of normally distributed random values might also
be useful, but that's not what this patch is doing.

Also, the function accepts float8 minval and maxval arguments, and
then simply ignores them and returns random float8 values in the range
[0,1), which is highly counterintuitive.

My suggestion would be to mirror the signatures of the core random()
functions more closely, and have this:

1). rand_array(numvals int, minlen int, maxlen int)
    returns setof float8[]

2). rand_array(numvals int, minlen int, maxlen int,
               minval int, maxval int)
    returns setof int[]

3). rand_array(numvals int, minlen int, maxlen int,
               minval bigint, maxval bigint)
    returns setof bigint[]

4). rand_array(numvals int, minlen int, maxlen int,
               minval numeric, maxval numeric)
    returns setof numeric[]

Also, I'd recommend giving the function arguments names in SQL, like
this, since that makes them easier to use.

Something else that's not obvious is that this patch is relying on the
core random functions, which means that it's using the same PRNG state
as those functions. That's probably OK, but it should be documented,
because it's different from tablefunc's normal_rand() function, which
uses pg_global_prng_state. In particular, what this means is that
calling setseed() will affect the output of these new functions, but
not normal_rand(). Incidentally, that makes writing tests much simpler
-- just call setseed() first and the output will be repeatable.

Regards,
Dean



pgsql-hackers by date:

Previous
From: David Rowley
Date:
Subject: Re: Parent/child context relation in pg_get_backend_memory_contexts()
Next
From: Stepan Neretin
Date:
Subject: Re: Sort functions with specialized comparators