Home > mailing lists

Re: [PATCH] Introduce array_shuffle() and array_sample() - Mailing list pgsql-hackers

From	Fabien COELHO
Subject	Re: [PATCH] Introduce array_shuffle() and array_sample()
Date	July 24, 2022 19:42:48
Msg-id	alpine.DEB.2.22.394.2207242132460.1359119@pseudo Whole thread
In response to	Re: [PATCH] Introduce array_shuffle() and array_sample() (Martin Kalcher <martin.kalcher@aboutsource.net>)
Responses	Re: [PATCH] Introduce array_shuffle() and array_sample()
List	pgsql-hackers

Tree view

Hello,

> Thank you for your feedback. I attached a patch, that addresses most of your 
> points.

I'll look into it. It would help if the patch could include a version 
number at the end.

>> Should the exchange be skipped when i == k?
>
> The additional branch is actually slower (on my machine, tested with an 10M 
> element int array) for 1D-arrays, which i assume is the most common case. 
> Even with a 2D-array with a subarray size of 1M there is no difference in 
> execution time. i == k should be relatively rare for large arrays, given a 
> good prng.

Ok, slower is bad.

>> The random and array shuffling functions share a common state. I'm 
>> wondering whether it should ot should not be so. It seems ok, but then ISTM 
>> that the documentation suggests implicitely that setseed applies to 
>> random() only, which is not the case anymore after the patch.
>
> I really like the idea of a prng state owned by the user, that is used by all 
> user facing random functions, but see that their might be concerns about this 
> change. I would update the setseed() documentation, if this proposal is 
> accepted. Do you think my patch should already include the documentation 
> update?

Duno. I'm still wondering what it should do. I'm pretty sure that the 
documentation should be clear about a shared seed, if any. I do not think 
that departing from the standard is a good thing, either.

>> If more samples are required than the number of elements, it does not error 
>> out. I'm wondering whether it should.
>
> The second argument to array_sample() is treated like a LIMIT, rather than 
> the actual number of elements. I updated the documentation. My use-case is: 
> take max random items from an array of unknown size.

Hmmm. ISTM that the use case of wanting "this many" stuff would make more 
sense because it is strictier so less likely to result in unforseen 
consequences. On principle I do not like not knowing the output size.

If someone wants a limit, they can easily "LEAST(#1 dim size, other 
limit)" to get it, it is easy enough with a strict function.

>> Also, the sampling should not return its result in order when the number of 
>> elements required is the full array, ISTM that it should be shuffled there 
>> as well.
>
> You are the second person asking for this. It's done. I am thinking about 
> ditching array_sample() and replace it with a second signature for 
> array_shuffle():
>
>  array_shuffle(array anyarray) -> anyarray
>  array_shuffle(array anyarray, limit int) -> anyarray
>
> What do you think?

I think that shuffle means shuffle, not partial shuffle, so a different 
name seems better to me.

-- 
Fabien.

pgsql-hackers by date:

From: Martin Kalcher
Date: 24 July 2022, 19:29:39
Subject: Re: [PATCH] Introduce array_shuffle() and array_sample()

From: Pavel Stehule
Date: 24 July 2022, 20:23:39
Subject: Re: psql - factor out echo code

Re: [PATCH] Introduce array_shuffle() and array_sample() - Mailing list pgsql-hackers

Previous

Next