Re: [PATCH] Introduce array_shuffle() and array_sample() - Mailing list pgsql-hackers

From Martin Kalcher
Subject Re: [PATCH] Introduce array_shuffle() and array_sample()
Date
Msg-id f5af7698-d1eb-bdf4-9969-82b18ab4f8cf@aboutsource.net
Whole thread Raw
In response to Re: [PATCH] Introduce array_shuffle() and array_sample()  (Fabien COELHO <coelho@cri.ensmp.fr>)
Responses Re: [PATCH] Introduce array_shuffle() and array_sample()
List pgsql-hackers
Am 24.07.22 um 10:15 schrieb Fabien COELHO:
> 
> My 0,02€ about this patch:

Thank you for your feedback. I attached a patch, that addresses most of 
your points.

> Use (void) when declaring no parameters in headers or in functions.

Done.

> Should the exchange be skipped when i == k?

The additional branch is actually slower (on my machine, tested with an 
10M element int array) for 1D-arrays, which i assume is the most common 
case. Even with a 2D-array with a subarray size of 1M there is no 
difference in execution time. i == k should be relatively rare for large 
arrays, given a good prng.

> I do not see the point of having *only* inline functions in a c file. 
> The external functions should not be inlined?

Done.

> The random and array shuffling functions share a common state. I'm 
> wondering whether it should ot should not be so. It seems ok, but then 
> ISTM that the documentation suggests implicitely that setseed applies to 
> random() only, which is not the case anymore after the patch.

I really like the idea of a prng state owned by the user, that is used 
by all user facing random functions, but see that their might be 
concerns about this change. I would update the setseed() documentation, 
if this proposal is accepted. Do you think my patch should already 
include the documentation update?

> If more samples are required than the number of elements, it does not 
> error out. I'm wondering whether it should.

The second argument to array_sample() is treated like a LIMIT, rather 
than the actual number of elements. I updated the documentation. My 
use-case is: take max random items from an array of unknown size.

> Also, the sampling should not return its result in order when the number 
> of elements required is the full array, ISTM that it should be shuffled 
> there as well.

You are the second person asking for this. It's done. I am thinking 
about ditching array_sample() and replace it with a second signature for 
array_shuffle():

   array_shuffle(array anyarray) -> anyarray
   array_shuffle(array anyarray, limit int) -> anyarray

What do you think?

> I must say that without significant knowledge of the array internal 
> implementation, the swap code looks pretty strange. ISTM that the code 
> would be clearer if pointers and array references style were not 
> intermixed.

Done. Went with pointers.

> Maybe you could add a test with a 3D array? Some sample with NULLs?

Done.

> Unrelated: I notice again that (postgre)SQL does not provide a way to 
> generate random integers. I do not see why not. Should we provide one?

Maybe. I think it is outside the scope of this patch.

Thank you, Martin
Attachment

pgsql-hackers by date:

Previous
From: Noah Misch
Date:
Subject: Re: fairywren hung in pg_basebackup tests
Next
From: Fabien COELHO
Date:
Subject: Re: [PATCH] Introduce array_shuffle() and array_sample()