Re: [PATCH] Introduce array_shuffle() and array_sample() - Mailing list pgsql-hackers

From Fabien COELHO
Subject Re: [PATCH] Introduce array_shuffle() and array_sample()
Date
Msg-id alpine.DEB.2.22.394.2207242132460.1359119@pseudo
Whole thread Raw
In response to Re: [PATCH] Introduce array_shuffle() and array_sample()  (Martin Kalcher <martin.kalcher@aboutsource.net>)
Responses Re: [PATCH] Introduce array_shuffle() and array_sample()
List pgsql-hackers
Hello,

> Thank you for your feedback. I attached a patch, that addresses most of your 
> points.

I'll look into it. It would help if the patch could include a version 
number at the end.

>> Should the exchange be skipped when i == k?
>
> The additional branch is actually slower (on my machine, tested with an 10M 
> element int array) for 1D-arrays, which i assume is the most common case. 
> Even with a 2D-array with a subarray size of 1M there is no difference in 
> execution time. i == k should be relatively rare for large arrays, given a 
> good prng.

Ok, slower is bad.

>> The random and array shuffling functions share a common state. I'm 
>> wondering whether it should ot should not be so. It seems ok, but then ISTM 
>> that the documentation suggests implicitely that setseed applies to 
>> random() only, which is not the case anymore after the patch.
>
> I really like the idea of a prng state owned by the user, that is used by all 
> user facing random functions, but see that their might be concerns about this 
> change. I would update the setseed() documentation, if this proposal is 
> accepted. Do you think my patch should already include the documentation 
> update?

Duno. I'm still wondering what it should do. I'm pretty sure that the 
documentation should be clear about a shared seed, if any. I do not think 
that departing from the standard is a good thing, either.

>> If more samples are required than the number of elements, it does not error 
>> out. I'm wondering whether it should.
>
> The second argument to array_sample() is treated like a LIMIT, rather than 
> the actual number of elements. I updated the documentation. My use-case is: 
> take max random items from an array of unknown size.

Hmmm. ISTM that the use case of wanting "this many" stuff would make more 
sense because it is strictier so less likely to result in unforseen 
consequences. On principle I do not like not knowing the output size.

If someone wants a limit, they can easily "LEAST(#1 dim size, other 
limit)" to get it, it is easy enough with a strict function.

>> Also, the sampling should not return its result in order when the number of 
>> elements required is the full array, ISTM that it should be shuffled there 
>> as well.
>
> You are the second person asking for this. It's done. I am thinking about 
> ditching array_sample() and replace it with a second signature for 
> array_shuffle():
>
>  array_shuffle(array anyarray) -> anyarray
>  array_shuffle(array anyarray, limit int) -> anyarray
>
> What do you think?

I think that shuffle means shuffle, not partial shuffle, so a different 
name seems better to me.

-- 
Fabien.



pgsql-hackers by date:

Previous
From: Martin Kalcher
Date:
Subject: Re: [PATCH] Introduce array_shuffle() and array_sample()
Next
From: Pavel Stehule
Date:
Subject: Re: psql - factor out echo code