Re: [PATCH] Introduce array_shuffle() and array_sample() - Mailing list pgsql-hackers

From Dean Rasheed
Subject Re: [PATCH] Introduce array_shuffle() and array_sample()
Date
Msg-id CAEZATCUPcdxeRLmiFvuzZ=kKVpx8kOKTYDPY4oVv3hQBJYfsCQ@mail.gmail.com
Whole thread Raw
In response to Re: [PATCH] Introduce array_shuffle() and array_sample()  (Martin Kalcher <martin.kalcher@aboutsource.net>)
Responses Re: [PATCH] Introduce array_shuffle() and array_sample()
Re: [PATCH] Introduce array_shuffle() and array_sample()
List pgsql-hackers
On Tue, 19 Jul 2022 at 21:21, Martin Kalcher
<martin.kalcher@aboutsource.net> wrote:
>
> Here is a patch with dimension aware array_shuffle() and array_sample().
>

+1 for this feature, and this way of handling multi-dimensional arrays.

> If you think array_flatten() is desirable, i can take a look at it.

That's not something I've ever wanted -- personally, I rarely use
multi-dimensional arrays in Postgres.

A couple of quick comments on the current patch:

It's important to mark these new functions as VOLATILE, not IMMUTABLE,
otherwise they won't work as expected in queries. See
https://www.postgresql.org/docs/current/xfunc-volatility.html

It would be better to use pg_prng_uint64_range() rather than rand() to
pick elements. Partly, that's because it uses a higher quality PRNG,
with a larger internal state, and it ensures that the results are
unbiased across the range. But more importantly, it interoperates with
setseed(), allowing predictable sequences of "random" numbers to be
generated -- something that's useful in writing repeatable regression
tests.

Assuming these new functions are made to interoperate with setseed(),
which I think they should be, then they also need to be marked as
PARALLEL RESTRICTED, rather than PARALLEL SAFE. See
https://www.postgresql.org/docs/current/parallel-safety.html, which
explains why setseed() and random() are parallel restricted.

In my experience, the requirement for sampling with replacement is
about as common as the requirement for sampling without replacement,
so it seems odd to provide one but not the other. Of course, we can
always add a with-replacement function later, and give it a different
name. But maybe array_sample() could be used for both, with a
"with_replacement" boolean parameter?

Regards,
Dean



pgsql-hackers by date:

Previous
From: Junwang Zhao
Date:
Subject: add a missing space
Next
From: Kyotaro Horiguchi
Date:
Subject: Re: i.e. and e.g.