Thread: Re: Proposal to introduce a shuffle function to intarray extension
On Fri, Jul 15, 2022 at 8:36 PM Martin Kalcher <martin.kalcher@aboutsource.net> wrote: > I would like to see a function like this inside the intarray extension. > Is there any way to get to this point? How is the process to deal with > such proposals? Hi Martin, I'm redirecting this to the pgsql-hackers@ mailing list, where we talk about code. For the archives, Martin's initial message to -general was: https://www.postgresql.org/message-id/9d160a44-7675-51e8-60cf-6d64b76db831%40aboutsource.net The first question is whether such a thing belongs in an external extension, or in the contrib/intarray module. The latter seems like a reasonable thing to want to me. The first step towards that will be to get your code into .patch format, as in git format-patch or git diff, that can be applied to the master branch. https://wiki.postgresql.org/wiki/Submitting_a_Patch Some initial feedback from me: I'd recommend adding a couple of comments to the code, like the algorithm name for someone who wants to read more about it (I think it's a Fisher-Yates shuffle?). You'll need to have the contrib/intarrays/intarray--1.5--1.6.sql file that creates the function. You might want to add something to control/intarray/sql/_int.sql that invokes the function when you run make check in there (but doesn't display the results, since that'd be unstable across machines?), just to have 'code coverage' (I mean, it'd prove it doesn't crash at least). Once details are settled, you'd also want to add documentation in doc/src/sgml/intarray.sgml. I understand that this is a specialised int[] shuffle, but I wonder if someone would ever want to have a general array shuffle, and how that would work, in terms of naming convention etc.
Am 16.07.22 um 23:56 schrieb Thomas Munro: > On Fri, Jul 15, 2022 at 8:36 PM Martin Kalcher > <martin.kalcher@aboutsource.net> wrote: >> I would like to see a function like this inside the intarray extension. >> Is there any way to get to this point? How is the process to deal with >> such proposals? > > Hi Martin, > > I'm redirecting this to the pgsql-hackers@ mailing list, where we talk > about code. For the archives, Martin's initial message to -general > was: > > https://www.postgresql.org/message-id/9d160a44-7675-51e8-60cf-6d64b76db831%40aboutsource.net > > The first question is whether such a thing belongs in an external > extension, or in the contrib/intarray module. The latter seems like a > reasonable thing to want to me. The first step towards that will be > to get your code into .patch format, as in git format-patch or git > diff, that can be applied to the master branch. > > https://wiki.postgresql.org/wiki/Submitting_a_Patch > > Some initial feedback from me: I'd recommend adding a couple of > comments to the code, like the algorithm name for someone who wants to > read more about it (I think it's a Fisher-Yates shuffle?). You'll > need to have the contrib/intarrays/intarray--1.5--1.6.sql file that > creates the function. You might want to add something to > control/intarray/sql/_int.sql that invokes the function when you run > make check in there (but doesn't display the results, since that'd be > unstable across machines?), just to have 'code coverage' (I mean, it'd > prove it doesn't crash at least). Once details are settled, you'd > also want to add documentation in doc/src/sgml/intarray.sgml. I > understand that this is a specialised int[] shuffle, but I wonder if > someone would ever want to have a general array shuffle, and how that > would work, in terms of naming convention etc. Hello Thomas, Thank you for pointing me towards the correct list. I do not feel qualified to answer the question wether this belongs in an external extension or in contrib/intarray. That reason i would like to see it in contrib/intarray is, that it is lot easier for me to get our operations team to upgrade our database system, because of new features we need, than to get them to install a self-maintained extension on our database servers. Thank you for your feedback. I tried to address all your points and made a first patch. Some points are still open: - Documentation is postponed until further feedback. - I am not shure about the naming. intarray has generic names like sort() and uniq() and specialised names like icount(). I guess in case someone wants to have a general array shuffle it could be accomplished with function overloading. Or am i wrong here? - I added a second function sample(), because it is a lot faster to take some elements from an array than to shuffle the whole array and slice it. This function can be removed if it is not wanted. The important one is shuffle(). Martin
Attachment
Martin Kalcher <martin.kalcher@aboutsource.net> writes: > Am 16.07.22 um 23:56 schrieb Thomas Munro: >> I understand that this is a specialised int[] shuffle, but I wonder if >> someone would ever want to have a general array shuffle, and how that >> would work, in terms of naming convention etc. > - I am not shure about the naming. intarray has generic names like > sort() and uniq() and specialised names like icount(). I guess in case > someone wants to have a general array shuffle it could be accomplished > with function overloading. Or am i wrong here? I suppose this is exactly the point Thomas was wondering about: if we use a generic function name for this, will it cause problems for someone trying to add a generic function later? We can investigate that question with a couple of toy functions: regression=# create function foo(int[]) returns text as 'select ''int[] version''' language sql; CREATE FUNCTION regression=# create function foo(anyarray) returns text as 'select ''anyarray version''' language sql; CREATE FUNCTION regression=# select foo('{1,2,3}'); ERROR: function foo(unknown) is not unique LINE 1: select foo('{1,2,3}'); ^ HINT: Could not choose a best candidate function. You might need to add explicit type casts. OK, that's not too surprising: with an unknown input there's just not anything that the parser can use to disambiguate. But this happens with just about any overloaded name, so I don't think it's a showstopper. regression=# select foo('{1,2,3}'::int[]); foo --------------- int[] version (1 row) regression=# select foo('{1,2,3}'::int8[]); foo ------------------ anyarray version (1 row) Good, that's more or less the minimum functionality we should expect. regression=# select foo('{1,2,3}'::int2[]); ERROR: function foo(smallint[]) is not unique LINE 1: select foo('{1,2,3}'::int2[]); ^ HINT: Could not choose a best candidate function. You might need to add explicit type casts. Oh, that's slightly annoying ... regression=# select foo('{1,2,3}'::oid[]); foo ------------------ anyarray version (1 row) regression=# select foo('{1,2,3}'::text[]); foo ------------------ anyarray version (1 row) regression=# select foo('{1,2,3}'::float8[]); foo ------------------ anyarray version (1 row) I couldn't readily find any case that misbehaves except for int2[]. You can force that to work, at least in one direction: regression=# select foo('{1,2,3}'::int2[]::int[]); foo --------------- int[] version (1 row) On the whole, I'd vote for calling it shuffle(), and expecting that we'd also use that name for any future generic version. That's clearly the easiest-to-remember definition, and it doesn't seem like the gotchas are bad enough to justify using separate names. > - I added a second function sample(), because it is a lot faster to take > some elements from an array than to shuffle the whole array and > slice it. This function can be removed if it is not wanted. I have no opinion about whether this one is valuable enough to include in intarray, but I do feel like sample() is a vague name, and easily confused with marginally-related operations like TABLESAMPLE. Can we think of a more on-point name? Something like "random_subset" would be pretty clear, but it's also clunky. It's too late here for me to think of le mot juste... regards, tom lane
On Sat, Jul 16, 2022 at 7:25 PM Martin Kalcher <martin.kalcher@aboutsource.net> wrote:
- I added a second function sample(), because it is a lot faster to take
some elements from an array than to shuffle the whole array and
slice it. This function can be removed if it is not wanted. The
important one is shuffle().
+SELECT sample('{1,2,3,4,5,6,7,8,9,10,11,12}', 6) != sample('{1,2,3,4,5,6,7,8,9,10,11,12}', 6);
+ ?column? +----------
+ t
+(1 row)
+
While small, there is a non-zero chance for both samples to be equal. This test should probably just go, I don't see what it tests that isn't covered by other tests or even trivial usage.
Same goes for:
+SELECT shuffle('{1,2,3,4,5,6,7,8,9,10,11,12}') != shuffle('{1,2,3,4,5,6,7,8,9,10,11,12}');
+ ?column?
+----------
+ t
+(1 row)
+
+ ?column?
+----------
+ t
+(1 row)
+
David J.
On Sat, Jul 16, 2022 at 8:18 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Martin Kalcher <martin.kalcher@aboutsource.net> writes:
> - I added a second function sample(), because it is a lot faster to take
> some elements from an array than to shuffle the whole array and
> slice it. This function can be removed if it is not wanted.
I have no opinion about whether this one is valuable enough to include in
intarray, but I do feel like sample() is a vague name, and easily confused
with marginally-related operations like TABLESAMPLE. Can we think of a
more on-point name? Something like "random_subset" would be pretty
clear, but it's also clunky. It's too late here for me to think of
le mot juste...
choose(input anyarray, size integer, with_replacement boolean default false, algorithm text default 'default')?
David J.
I wrote: > On the whole, I'd vote for calling it shuffle(), and expecting that > we'd also use that name for any future generic version. Actually ... is there a reason to bother with an intarray version at all, rather than going straight for an in-core anyarray function? It's not obvious to me that an int4-only version would have major performance advantages. regards, tom lane
On Sun, Jul 17, 2022 at 3:37 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > I wrote: > > On the whole, I'd vote for calling it shuffle(), and expecting that > > we'd also use that name for any future generic version. > > Actually ... is there a reason to bother with an intarray version > at all, rather than going straight for an in-core anyarray function? > It's not obvious to me that an int4-only version would have > major performance advantages. Yeah, that seems like a good direction. If there is a performance advantage to specialising, then perhaps we only have to specialise on size, not type. Perhaps there could be a general function that internally looks out for typbyval && typlen == 4, and dispatches to a specialised 4-byte, and likewise for 8, if it can, and that'd already be enough to cover int, bigint, float etc, without needing specialisations for each type. I went to see what Professor Lemire would have to say about all this, expecting to find a SIMD rabbit hole to fall down for some Sunday evening reading, but the main thing that jumped out was this article about the modulo operation required by textbook Fisher-Yates to get a bounded random number, the random() % n that appears in the patch. He talks about shuffling twice as fast by using a no-division trick to get bounded random numbers[1]. I guess you might need to use our pg_prng_uint32() for that trick because random()'s 0..RAND_MAX might introduce bias. Anyway, file that under go-faster ideas for later. [1] https://lemire.me/blog/2016/06/30/fast-random-shuffling/
Am 17.07.22 um 05:37 schrieb Tom Lane: > > Actually ... is there a reason to bother with an intarray version > at all, rather than going straight for an in-core anyarray function? > It's not obvious to me that an int4-only version would have > major performance advantages. > > regards, tom lane Hi Tom, thank you for your thoughts. There are two reasons for choosing an int4-only version. I am not familiar with postgres development (yet) and i was not sure how open you are about such changes to core and if the proposed feature is considered valuable enough to go into core. The second reason was ease of implementation. The intarray extension does not allow any NULL elements in arrays and treats multidimensional arrays as though they were linear. Which makes the implementation straight forward, because there are fewer cases to consider. However, i will take a look at an implementation for anyarray in core. Martin
Am 17.07.22 um 08:00 schrieb Thomas Munro: > > I went to see what Professor Lemire would have to say about all this, > expecting to find a SIMD rabbit hole to fall down for some Sunday > evening reading, but the main thing that jumped out was this article > about the modulo operation required by textbook Fisher-Yates to get a > bounded random number, the random() % n that appears in the patch. He > talks about shuffling twice as fast by using a no-division trick to > get bounded random numbers[1]. I guess you might need to use our > pg_prng_uint32() for that trick because random()'s 0..RAND_MAX might > introduce bias. Anyway, file that under go-faster ideas for later. > > [1] https://lemire.me/blog/2016/06/30/fast-random-shuffling/ Hi Thomas, the small bias of random() % n is not a problem for my use case, but might be for others. Its easily replaceable with (int) pg_prng_uint64_range(&pg_global_prng_state, 0, n-1) Unfortunately it is a bit slower (on my machine), but thats negligible. Martin
Am 17.07.22 um 05:32 schrieb David G. Johnston: > > +SELECT sample('{1,2,3,4,5,6,7,8,9,10,11,12}', 6) != > sample('{1,2,3,4,5,6,7,8,9,10,11,12}', 6); > + ?column? > +---------- > + t > +(1 row) > + > > While small, there is a non-zero chance for both samples to be equal. This > test should probably just go, I don't see what it tests that isn't covered > by other tests or even trivial usage. > Hey David, you are right. There is a small chance for this test to fail. I wanted to test, that two invocations produce different results (after all the main feature of the function). But it can probably go. Martin
Am 17.07.22 um 08:00 schrieb Thomas Munro: >> Actually ... is there a reason to bother with an intarray version >> at all, rather than going straight for an in-core anyarray function? >> It's not obvious to me that an int4-only version would have >> major performance advantages. > > Yeah, that seems like a good direction. If there is a performance > advantage to specialising, then perhaps we only have to specialise on > size, not type. Perhaps there could be a general function that > internally looks out for typbyval && typlen == 4, and dispatches to a > specialised 4-byte, and likewise for 8, if it can, and that'd already > be enough to cover int, bigint, float etc, without needing > specialisations for each type. I played around with the idea of an anyarray shuffle(). The hard part was to deal with arrays with variable length elements, as they can not be swapped easily in place. I solved it by creating an intermediate array of references to the elements. I'll attach a patch with the proof of concept. Unfortunatly it is already about 5 times slower than the specialised version and i am not sure if it is worth going down that road. Martin
Attachment
On Mon, Jul 18, 2022 at 4:15 AM Martin Kalcher <martin.kalcher@aboutsource.net> wrote: > Am 17.07.22 um 08:00 schrieb Thomas Munro: > >> Actually ... is there a reason to bother with an intarray version > >> at all, rather than going straight for an in-core anyarray function? > >> It's not obvious to me that an int4-only version would have > >> major performance advantages. > > > > Yeah, that seems like a good direction. If there is a performance > > advantage to specialising, then perhaps we only have to specialise on > > size, not type. Perhaps there could be a general function that > > internally looks out for typbyval && typlen == 4, and dispatches to a > > specialised 4-byte, and likewise for 8, if it can, and that'd already > > be enough to cover int, bigint, float etc, without needing > > specialisations for each type. > > I played around with the idea of an anyarray shuffle(). The hard part > was to deal with arrays with variable length elements, as they can not > be swapped easily in place. I solved it by creating an intermediate > array of references to the elements. I'll attach a patch with the proof > of concept. Unfortunatly it is already about 5 times slower than the > specialised version and i am not sure if it is worth going down that road. Seems OK for a worst case. It must still be a lot faster than doing it in SQL. Now I wonder what the exact requirements would be to dispatch to a faster version that would handle int4. I haven't studied this in detail but perhaps to dispatch to a fast shuffle for objects of size X, the requirement would be something like typlen == X && align_bytes <= typlen && typlen % align_bytes == 0, where align_bytes is typalign converted to ALIGNOF_{CHAR,SHORT,INT,DOUBLE}? Or in English, 'the data consists of densely packed objects of fixed size X, no padding'. Or perhaps you can work out the padded size and use that, to catch a few more types. Then you call array_shuffle_{2,4,8}() as appropriate, which should be as fast as your original int[] proposal, but work also for float, date, ...? About your experimental patch, I haven't reviewed it properly or tried it but I wonder if uint32 dat_offset, uint32 size (= half size elements) would be enough due to limitations on varlenas.
Martin Kalcher <martin.kalcher@aboutsource.net> writes: > Am 17.07.22 um 08:00 schrieb Thomas Munro: >>> Actually ... is there a reason to bother with an intarray version >>> at all, rather than going straight for an in-core anyarray function? > I played around with the idea of an anyarray shuffle(). The hard part > was to deal with arrays with variable length elements, as they can not > be swapped easily in place. I solved it by creating an intermediate > array of references to the elements. I'll attach a patch with the proof > of concept. This does not look particularly idiomatic, or even type-safe. What you should have done was use deconstruct_array to get an array of Datums and isnull flags, then shuffled those, then used construct_array to build the output. (Or, perhaps, use construct_md_array to replicate the input's precise dimensionality. Not sure if anyone would care.) regards, tom lane
Thomas Munro <thomas.munro@gmail.com> writes: > Seems OK for a worst case. It must still be a lot faster than doing > it in SQL. Now I wonder what the exact requirements would be to > dispatch to a faster version that would handle int4. I find it impossible to believe that it's worth micro-optimizing shuffle() to that extent. Now, maybe doing something in that line in deconstruct_array and construct_array would be worth our time, as that'd benefit a pretty wide group of functions. regards, tom lane
Am 18.07.22 um 00:46 schrieb Tom Lane: > > This does not look particularly idiomatic, or even type-safe. What you > should have done was use deconstruct_array to get an array of Datums and > isnull flags, then shuffled those, then used construct_array to build the > output. > > (Or, perhaps, use construct_md_array to replicate the input's > precise dimensionality. Not sure if anyone would care.) > > regards, tom lane deconstruct_array() would destroy the arrays dimensions. I would expect that shuffle() only shuffles the first dimension and keeps the inner arrays intact. Martin
Am 18.07.22 um 00:37 schrieb Thomas Munro: > Seems OK for a worst case. It must still be a lot faster than doing > it in SQL. Now I wonder what the exact requirements would be to > dispatch to a faster version that would handle int4. I haven't > studied this in detail but perhaps to dispatch to a fast shuffle for > objects of size X, the requirement would be something like typlen == X > && align_bytes <= typlen && typlen % align_bytes == 0, where > align_bytes is typalign converted to ALIGNOF_{CHAR,SHORT,INT,DOUBLE}? > Or in English, 'the data consists of densely packed objects of fixed > size X, no padding'. Or perhaps you can work out the padded size and > use that, to catch a few more types. Then you call > array_shuffle_{2,4,8}() as appropriate, which should be as fast as > your original int[] proposal, but work also for float, date, ...? > > About your experimental patch, I haven't reviewed it properly or tried > it but I wonder if uint32 dat_offset, uint32 size (= half size > elements) would be enough due to limitations on varlenas. I made another experimental patch with fast tracks for typelen4 and typelen8. alignments are not yet considered.
Attachment
Martin Kalcher <martin.kalcher@aboutsource.net> writes: > Am 18.07.22 um 00:46 schrieb Tom Lane: >> This does not look particularly idiomatic, or even type-safe. What you >> should have done was use deconstruct_array to get an array of Datums and >> isnull flags, then shuffled those, then used construct_array to build the >> output. >> >> (Or, perhaps, use construct_md_array to replicate the input's >> precise dimensionality. Not sure if anyone would care.) > deconstruct_array() would destroy the arrays dimensions. As I said, you can use construct_md_array if you want to preserve the array shape. > I would expect that shuffle() only shuffles the first dimension and > keeps the inner arrays intact. This argument is based on a false premise, ie that Postgres thinks multidimensional arrays are arrays-of-arrays. They aren't, and we're not going to start making them so by defining shuffle() at variance with every other array-manipulating function. Shuffling the individual elements regardless of array shape is the definition that's consistent with our existing functionality. (Having said that, even if we were going to implement it with that definition, I should think that it'd be easiest to do so on the array-of-Datums representation produced by deconstruct_array. That way you don't need to do different things for different element types.) regards, tom lane
Am 18.07.22 um 01:20 schrieb Tom Lane: >> I would expect that shuffle() only shuffles the first dimension and >> keeps the inner arrays intact. > > This argument is based on a false premise, ie that Postgres thinks > multidimensional arrays are arrays-of-arrays. They aren't, and > we're not going to start making them so by defining shuffle() > at variance with every other array-manipulating function. Shuffling > the individual elements regardless of array shape is the definition > that's consistent with our existing functionality. Hey Tom, thank you for clarification. I did not know that. I will make a patch that is using deconstruct_array(). Martin
Am 18.07.22 um 01:20 schrieb Tom Lane: > (Having said that, even if we were going to implement it with that > definition, I should think that it'd be easiest to do so on the > array-of-Datums representation produced by deconstruct_array. > That way you don't need to do different things for different element > types.) Thank you Tom, here is a patch utilising deconstruct_array(). If we agree, that this is the way to go, i would like to add array_sample() (good name?), some test cases, and documentation. One more question. How do i pick a Oid for the functions? Martin
Attachment
On Mon, Jul 18, 2022 at 2:47 PM Martin Kalcher <martin.kalcher@aboutsource.net> wrote:
> One more question. How do i pick a Oid for the functions?
For this, we recommend running src/include/catalog/unused_oids, and it will give you a random range to pick from. That reduces the chance of different patches conflicting with each other. It doesn't really matter what the oid here is, since at feature freeze a committer will change them anyway.
--
John Naylor
EDB: http://www.enterprisedb.com
John Naylor <john.naylor@enterprisedb.com> writes: > On Mon, Jul 18, 2022 at 2:47 PM Martin Kalcher < > martin.kalcher@aboutsource.net> wrote: >> One more question. How do i pick a Oid for the functions? > For this, we recommend running src/include/catalog/unused_oids, and it will > give you a random range to pick from. That reduces the chance of different > patches conflicting with each other. It doesn't really matter what the oid > here is, since at feature freeze a committer will change them anyway. If you want the nitty gritty details here, see https://www.postgresql.org/docs/devel/system-catalog-initial-data.html#SYSTEM-CATALOG-OID-ASSIGNMENT regards, tom lane
Thanks for all your feedback and help. I got a patch that i consider ready for review. It introduces two new functions: array_shuffle(anyarray) -> anyarray array_sample(anyarray, integer) -> anyarray array_shuffle() shuffles an array (obviously). array_sample() picks n random elements from an array. Is someone interested in looking at it? What are the next steps? Martin
Attachment
Martin Kalcher <martin.kalcher@aboutsource.net> writes: > Is someone interested in looking at it? What are the next steps? The preferred thing to do is to add it to our "commitfest" queue, which will ensure that it gets looked at eventually. The currently open cycle is 2022-09 [1] (see the "New Patch" button there). I believe you have to have signed up as a community member[2] in order to put yourself in as the patch author. I think "New Patch" will work anyway, but it's better to have an author listed. regards, tom lane [1] https://commitfest.postgresql.org/39/ [2] https://www.postgresql.org/account/login/
Am 18.07.22 um 21:29 schrieb Tom Lane: > The preferred thing to do is to add it to our "commitfest" queue, > which will ensure that it gets looked at eventually. The currently > open cycle is 2022-09 [1] (see the "New Patch" button there). Thanks Tom, did that. I am not sure if "SQL Commands" is the correct topic for this patch, but i guess its not that important. I am impressed by all the infrastructure build around this project. Martin
On Mon, Jul 18, 2022 at 3:03 PM Martin Kalcher <martin.kalcher@aboutsource.net> wrote: > Thanks for all your feedback and help. I got a patch that i consider > ready for review. It introduces two new functions: > > array_shuffle(anyarray) -> anyarray > array_sample(anyarray, integer) -> anyarray > > array_shuffle() shuffles an array (obviously). array_sample() picks n > random elements from an array. I like this idea. I think it's questionable whether the behavior of array_shuffle() is correct for a multi-dimensional array. The implemented behavior is to keep the dimensions as they were, but permute the elements across all levels at random. But there are at least two other behaviors that seem potentially defensible: (1) always return a 1-dimensional array, (2) shuffle the sub-arrays at the top-level without the possibility of moving elements within or between sub-arrays. What behavior we decide is best here should be documented. array_sample() will return elements in random order when sample_size < array_size, but in the original order when sample_size >= array_size. Similarly, it will always return a 1-dimensional array in the former case, but will keep the original dimensions in the latter case. That seems pretty hard to defend. I think it should always return a 1-dimensional array with elements in random order, and I think this should be documented. I also think you should add test cases involving multi-dimensional arrays, as well as arrays with non-default bounds. e.g. trying shuffling or sampling some values like '[8:10][-6:-5]={{1,2},{3,4},{5,6}}'::int[] -- Robert Haas EDB: http://www.enterprisedb.com
Robert Haas <robertmhaas@gmail.com> writes: > On Mon, Jul 18, 2022 at 3:03 PM Martin Kalcher > <martin.kalcher@aboutsource.net> wrote: >> array_shuffle(anyarray) -> anyarray >> array_sample(anyarray, integer) -> anyarray > I think it's questionable whether the behavior of array_shuffle() is > correct for a multi-dimensional array. The implemented behavior is to > keep the dimensions as they were, but permute the elements across all > levels at random. But there are at least two other behaviors that seem > potentially defensible: (1) always return a 1-dimensional array, (2) > shuffle the sub-arrays at the top-level without the possibility of > moving elements within or between sub-arrays. What behavior we decide > is best here should be documented. Martin had originally proposed (2), which I rejected on the grounds that we don't treat multi-dimensional arrays as arrays-of-arrays for any other purpose. Maybe we should have, but that ship sailed decades ago, and I doubt that shuffle() is the place to start changing it. I could get behind your option (1) though, to make it clearer that the input array's dimensionality is not of interest. Especially since, as you say, (1) is pretty much the only sensible choice for array_sample. > I also think you should add test cases involving multi-dimensional > arrays, as well as arrays with non-default bounds. e.g. trying > shuffling or sampling some values like > '[8:10][-6:-5]={{1,2},{3,4},{5,6}}'::int[] This'd only matter if we decide not to ignore the input's dimensionality. regards, tom lane
I wrote: > Martin had originally proposed (2), which I rejected on the grounds > that we don't treat multi-dimensional arrays as arrays-of-arrays for > any other purpose. Actually, after poking at it for awhile, that's an overstatement. It's true that the type system doesn't think N-D arrays are arrays-of-arrays, but there are individual functions/operators that do. For instance, @> is in the its-a-flat-list-of-elements camp: regression=# select array[[1,2],[3,4]] @> array[1,3]; ?column? ---------- t (1 row) but || wants to preserve dimensionality: regression=# select array[[1,2],[3,4]] || array[1]; ERROR: cannot concatenate incompatible arrays DETAIL: Arrays with differing dimensions are not compatible for concatenation. Various other functions dodge the issue by refusing to work on arrays of more than one dimension. There seem to be more functions that think arrays are flat than otherwise, but it's not as black-and-white as I was thinking. regards, tom lane
Am 18.07.22 um 23:03 schrieb Tom Lane: > I wrote: >> Martin had originally proposed (2), which I rejected on the grounds >> that we don't treat multi-dimensional arrays as arrays-of-arrays for >> any other purpose. > > Actually, after poking at it for awhile, that's an overstatement. > It's true that the type system doesn't think N-D arrays are > arrays-of-arrays, but there are individual functions/operators that do. > Thanks Robert for pointing out the inconsistent behavior of array_sample(). That needs to be fixed. As Tom's investigation showed, there is no consensus in the code if multi-dimensional arrays are treated as arrays-of-arrays or not. We need to decide what should be the correct treatment. If we go with (1) array_shuffle() and array_sample() should shuffle each element individually and always return a one-dimensional array. select array_shuffle('{{1,2},{3,4},{5,6}}'); ----------- {1,4,3,5,6,2} select array_sample('{{1,2},{3,4},{5,6}}', 3); ---------- {1,4,3} If we go with (2) both functions should only operate on the first dimension and shuffle whole subarrays and keep the dimensions intact. select array_shuffle('{{1,2},{3,4},{5,6}}'); --------------------- {{3,4},{1,2},{5,6}} select array_sample('{{1,2},{3,4},{5,6}}', 2); --------------- {{3,4},{1,2}} I do not feel qualified to make that decision. (2) complicates the code a bit, but that should not be the main argument here. Martin
Martin Kalcher <martin.kalcher@aboutsource.net> writes: > If we go with (1) array_shuffle() and array_sample() should shuffle each > element individually and always return a one-dimensional array. > select array_shuffle('{{1,2},{3,4},{5,6}}'); > ----------- > {1,4,3,5,6,2} > select array_sample('{{1,2},{3,4},{5,6}}', 3); > ---------- > {1,4,3} Independently of the dimensionality question --- I'd imagined that array_sample would select a random subset of the array elements but keep their order intact. If you want the behavior shown above, you can do array_shuffle(array_sample(...)). But if we randomize it, and that's not what the user wanted, she has no recourse. Now, if you're convinced that the set of people wanting sampling-without-shuffling is the empty set, then making everybody else call two functions is a loser. But I'm not convinced. At the least, I'd like to see the argument made why nobody would want that. regards, tom lane
On Mon, Jul 18, 2022 at 3:18 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Independently of the dimensionality question --- I'd imagined that
array_sample would select a random subset of the array elements
but keep their order intact. If you want the behavior shown
above, you can do array_shuffle(array_sample(...)). But if we
randomize it, and that's not what the user wanted, she has no
recourse.
And for those that want to know in what order those elements were chosen they have no recourse in the other setup.
I really think this function needs to grow an algorithm argument that can be used to specify stuff like ordering, replacement/without-replacement, etc...just some enums separated by commas that can be added to the call.
David J.
"David G. Johnston" <david.g.johnston@gmail.com> writes: > On Mon, Jul 18, 2022 at 3:18 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Independently of the dimensionality question --- I'd imagined that >> array_sample would select a random subset of the array elements >> but keep their order intact. If you want the behavior shown >> above, you can do array_shuffle(array_sample(...)). But if we >> randomize it, and that's not what the user wanted, she has no >> recourse. > And for those that want to know in what order those elements were chosen > they have no recourse in the other setup. Um ... why is "the order in which the elements were chosen" a concept we want to expose? ISTM sample() is a black box in which notionally the decisions could all be made at once. > I really think this function needs to grow an algorithm argument that can > be used to specify stuff like ordering, replacement/without-replacement, > etc...just some enums separated by commas that can be added to the call. I think you might run out of gold paint somewhere around here. I'm still not totally convinced we should bother with the sample() function at all, let alone that it needs algorithm variants. At some point we say to the user "here's a PL, write what you want for yourself". regards, tom lane
Am 19.07.22 um 00:18 schrieb Tom Lane: > > Independently of the dimensionality question --- I'd imagined that > array_sample would select a random subset of the array elements > but keep their order intact. If you want the behavior shown > above, you can do array_shuffle(array_sample(...)). But if we > randomize it, and that's not what the user wanted, she has no > recourse. > > Now, if you're convinced that the set of people wanting > sampling-without-shuffling is the empty set, then making everybody > else call two functions is a loser. But I'm not convinced. > At the least, I'd like to see the argument made why nobody > would want that. > On the contrary! I am pretty sure there are people out there wanting sampling-without-shuffling. I will think about that.
On Tue, Jul 19, 2022 at 8:15 AM Martin Kalcher <martin.kalcher@aboutsource.net> wrote: > Am 18.07.22 um 21:29 schrieb Tom Lane: > > The preferred thing to do is to add it to our "commitfest" queue, > > which will ensure that it gets looked at eventually. The currently > > open cycle is 2022-09 [1] (see the "New Patch" button there). > > Thanks Tom, did that. FYI that brought your patch to the attention of this CI robot, which is showing a couple of warnings. See the FAQ link there for an explanation of that infrastructure. http://cfbot.cputube.org/martin-kalcher.html
Am 19.07.22 um 00:52 schrieb Martin Kalcher: > > On the contrary! I am pretty sure there are people out there wanting > sampling-without-shuffling. I will think about that. I gave it some thought. Even though there might be use cases, where a stable order is desired, i would consider them edge cases, not worth the additional complexity. I personally would not expect array_sample() to return elements in any specific order. I looked up some sample() implementations. None of them makes guarantees about the order of the resulting array or explicitly states that the resulting array is in random or selection order. - Python random.sample [0] - Ruby Array#sample [1] - Rust rand::seq::SliceRandom::choose_multiple [2] - Julia StatsBase.sample [3] stable order needs explicit request [0] https://docs.python.org/3/library/random.html#random.sample [1] https://ruby-doc.org/core-3.0.0/Array.html#method-i-sample [2] https://docs.rs/rand/0.6.5/rand/seq/trait.SliceRandom.html#tymethod.choose_multiple [3] https://juliastats.org/StatsBase.jl/stable/sampling/#StatsBase.sample
Hi Martin, I didn't look at the code yet but I very much like the idea. Many thanks for working on this! It's a pity your patch was too late for the July commitfest. In September, please keep an eye on cfbot [1] to make sure your patch applies properly. > As Tom's investigation showed, there is no consensus in the code if > multi-dimensional arrays are treated as arrays-of-arrays or not. We need > to decide what should be the correct treatment. Here are my two cents. From the user perspective I would expect that by default a multidimensional array should be treated as an array of arrays. So for instance: ``` array_shuffle([ [1,2], [3,4], [5,6] ]) ``` ... should return something like: ``` [ [3,4], [1,2], [5,6] ] ``` Note that the order of the elements in the internal arrays is preserved. However, I believe there should be an optional argument that overrides this behavior. For instance: ``` array_shuffle([ [1,2], [3,4], [5,6] ], depth => 2) ``` BTW, while on it, shouldn't we add similar functions for JSON and/or JSONB? Or is this going to be too much for a single discussion? [1]: http://cfbot.cputube.org/ -- Best regards, Aleksander Alekseev
Am 18.07.22 um 23:48 schrieb Martin Kalcher: > > If we go with (1) array_shuffle() and array_sample() should shuffle each > element individually and always return a one-dimensional array. > > select array_shuffle('{{1,2},{3,4},{5,6}}'); > ----------- > {1,4,3,5,6,2} > > select array_sample('{{1,2},{3,4},{5,6}}', 3); > ---------- > {1,4,3} > > If we go with (2) both functions should only operate on the first > dimension and shuffle whole subarrays and keep the dimensions intact. > > select array_shuffle('{{1,2},{3,4},{5,6}}'); > --------------------- > {{3,4},{1,2},{5,6}} > > select array_sample('{{1,2},{3,4},{5,6}}', 2); > --------------- > {{3,4},{1,2}} > Having thought about it, i would go with (2). It gives the user the ability to decide wether or not array-of-arrays behavior is desired. If he wants the behavior of (1) he can flatten the array before applying array_shuffle(). Unfortunately there is no array_flatten() function (at the moment) and the user would have to work around it with unnest() and array_agg().
On Mon, Jul 18, 2022 at 6:43 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > Um ... why is "the order in which the elements were chosen" a concept > we want to expose? ISTM sample() is a black box in which notionally > the decisions could all be made at once. I agree with that. But I also think it's fine for the elements to be returned in a shuffled order rather than the original order. > > I really think this function needs to grow an algorithm argument that can > > be used to specify stuff like ordering, replacement/without-replacement, > > etc...just some enums separated by commas that can be added to the call. > > I think you might run out of gold paint somewhere around here. I'm > still not totally convinced we should bother with the sample() function > at all, let alone that it needs algorithm variants. At some point we > say to the user "here's a PL, write what you want for yourself". I don't know what gold paint has to do with anything here, but I agree that David's proposal seems to be moving the goalposts a very long way. The thing is, as Martin points out, these functions already exist in a bunch of other systems. For one example I've used myself, see https://underscorejs.org/ I probably wouldn't have called a function to put a list into a random order "shuffle" in a vacuum, but it seems to be common nomenclature these days. I believe that if you don't make reference to Fisher-Yates in the documentation, they kick you out of the cool programmers club. -- Robert Haas EDB: http://www.enterprisedb.com
On 2022-07-19 Tu 07:15, Martin Kalcher wrote: > Am 18.07.22 um 23:48 schrieb Martin Kalcher: >> >> If we go with (1) array_shuffle() and array_sample() should shuffle >> each element individually and always return a one-dimensional array. >> >> select array_shuffle('{{1,2},{3,4},{5,6}}'); >> ----------- >> {1,4,3,5,6,2} >> >> select array_sample('{{1,2},{3,4},{5,6}}', 3); >> ---------- >> {1,4,3} >> >> If we go with (2) both functions should only operate on the first >> dimension and shuffle whole subarrays and keep the dimensions intact. >> >> select array_shuffle('{{1,2},{3,4},{5,6}}'); >> --------------------- >> {{3,4},{1,2},{5,6}} >> >> select array_sample('{{1,2},{3,4},{5,6}}', 2); >> --------------- >> {{3,4},{1,2}} >> > > Having thought about it, i would go with (2). It gives the user the > ability to decide wether or not array-of-arrays behavior is desired. > If he wants the behavior of (1) he can flatten the array before > applying array_shuffle(). Unfortunately there is no array_flatten() > function (at the moment) and the user would have to work around it > with unnest() and array_agg(). > > Why not have an optional second parameter for array_shuffle that indicates whether or not to flatten? e.g. array_shuffle(my_array, flatten => true) cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
On Tue, Jul 19, 2022 at 9:53 AM Andrew Dunstan <andrew@dunslane.net> wrote: > > Having thought about it, i would go with (2). It gives the user the > > ability to decide wether or not array-of-arrays behavior is desired. > > If he wants the behavior of (1) he can flatten the array before > > applying array_shuffle(). Unfortunately there is no array_flatten() > > function (at the moment) and the user would have to work around it > > with unnest() and array_agg(). > > Why not have an optional second parameter for array_shuffle that > indicates whether or not to flatten? e.g. array_shuffle(my_array, > flatten => true) IMHO, if we think that's something many people are going to want, it would be better to add an array_flatten() function, because that could be used for a variety of purposes, whereas an option to this function can only be used for this function. -- Robert Haas EDB: http://www.enterprisedb.com
Robert Haas <robertmhaas@gmail.com> writes: > On Tue, Jul 19, 2022 at 9:53 AM Andrew Dunstan <andrew@dunslane.net> wrote: >> Why not have an optional second parameter for array_shuffle that >> indicates whether or not to flatten? e.g. array_shuffle(my_array, >> flatten => true) > IMHO, if we think that's something many people are going to want, it > would be better to add an array_flatten() function, because that could > be used for a variety of purposes, whereas an option to this function > can only be used for this function. Agreed. Whether it's really needed, I dunno --- I don't recall the issue having come up before. After taking a second look through https://www.postgresql.org/docs/current/functions-array.html it seems like the things that treat arrays as flat even when they are multi-D are just * unnest(), which is more or less forced into that position by our type system: it has to be anyarray returning anyelement, not anyarray returning anyelement-or-anyarray-depending. * array_to_string(), which maybe could have done it differently but can reasonably be considered a variant of unnest(). * The containment/overlap operators, which are kind of their own thing anyway. Arguably they got this wrong, though I suppose it's too late to rethink that. Everything else either explicitly rejects more-than-one-D arrays or does something that is compatible with thinking of them as arrays-of-arrays. So I withdraw my original position. These functions should just shuffle or select in the array's first dimension, preserving subarrays. Or else be lazy and reject more-than-one-D arrays; but it's probably not that hard to handle them. regards, tom lane
Am 19.07.22 um 16:20 schrieb Tom Lane: > > So I withdraw my original position. These functions should just > shuffle or select in the array's first dimension, preserving > subarrays. Or else be lazy and reject more-than-one-D arrays; > but it's probably not that hard to handle them. > Here is a patch with dimension aware array_shuffle() and array_sample(). If you think array_flatten() is desirable, i can take a look at it. Maybe a second parameter would be nice to specify the target dimension: select array_flatten(array[[[1,2],[3,4]],[[5,6],[7,8]]], 1); ------------------- {1,2,3,4,5,6,7,8} select array_flatten(array[[[1,2],[3,4]],[[5,6],[7,8]]], 2); ----------------------- {{1,2,3,4},{5,6,7,8}} Martin
Attachment
On Tue, 19 Jul 2022 at 21:21, Martin Kalcher <martin.kalcher@aboutsource.net> wrote: > > Here is a patch with dimension aware array_shuffle() and array_sample(). > +1 for this feature, and this way of handling multi-dimensional arrays. > If you think array_flatten() is desirable, i can take a look at it. That's not something I've ever wanted -- personally, I rarely use multi-dimensional arrays in Postgres. A couple of quick comments on the current patch: It's important to mark these new functions as VOLATILE, not IMMUTABLE, otherwise they won't work as expected in queries. See https://www.postgresql.org/docs/current/xfunc-volatility.html It would be better to use pg_prng_uint64_range() rather than rand() to pick elements. Partly, that's because it uses a higher quality PRNG, with a larger internal state, and it ensures that the results are unbiased across the range. But more importantly, it interoperates with setseed(), allowing predictable sequences of "random" numbers to be generated -- something that's useful in writing repeatable regression tests. Assuming these new functions are made to interoperate with setseed(), which I think they should be, then they also need to be marked as PARALLEL RESTRICTED, rather than PARALLEL SAFE. See https://www.postgresql.org/docs/current/parallel-safety.html, which explains why setseed() and random() are parallel restricted. In my experience, the requirement for sampling with replacement is about as common as the requirement for sampling without replacement, so it seems odd to provide one but not the other. Of course, we can always add a with-replacement function later, and give it a different name. But maybe array_sample() could be used for both, with a "with_replacement" boolean parameter? Regards, Dean
Am 21.07.22 um 10:41 schrieb Dean Rasheed: > > A couple of quick comments on the current patch: Thank you for your feedback! > It's important to mark these new functions as VOLATILE, not IMMUTABLE, > otherwise they won't work as expected in queries. See > https://www.postgresql.org/docs/current/xfunc-volatility.html CREATE FUNCTION marks functions as VOLATILE by default if not explicitly marked otherwise. I assumed function definitions in pg_proc.dat have the same behavior. I will fix that. https://www.postgresql.org/docs/current/sql-createfunction.html > It would be better to use pg_prng_uint64_range() rather than rand() to > pick elements. Partly, that's because it uses a higher quality PRNG, > with a larger internal state, and it ensures that the results are > unbiased across the range. But more importantly, it interoperates with > setseed(), allowing predictable sequences of "random" numbers to be > generated -- something that's useful in writing repeatable regression > tests. I agree that we should use pg_prng_uint64_range(). However, in order to achieve interoperability with setseed() we would have to use drandom_seed (rather than pg_global_prng_state) as rng state, which is declared statically in float.c and exclusively used by random(). Do we want to expose drandom_seed to other functions? > Assuming these new functions are made to interoperate with setseed(), > which I think they should be, then they also need to be marked as > PARALLEL RESTRICTED, rather than PARALLEL SAFE. See > https://www.postgresql.org/docs/current/parallel-safety.html, which > explains why setseed() and random() are parallel restricted. As mentioned above, i assumed the default here is PARALLEL UNSAFE. I'll fix that. > In my experience, the requirement for sampling with replacement is > about as common as the requirement for sampling without replacement, > so it seems odd to provide one but not the other. Of course, we can > always add a with-replacement function later, and give it a different > name. But maybe array_sample() could be used for both, with a > "with_replacement" boolean parameter? We can also add a "with_replacement" boolean parameter which is false by default to array_sample() later. I do not have a strong opinion about that and will implement it, if desired. Any opinions about default/no-default? Martin
On Thu, 21 Jul 2022 at 12:15, Martin Kalcher <martin.kalcher@aboutsource.net> wrote: > > I agree that we should use pg_prng_uint64_range(). However, in order to > achieve interoperability with setseed() we would have to use > drandom_seed (rather than pg_global_prng_state) as rng state, which is > declared statically in float.c and exclusively used by random(). Do we > want to expose drandom_seed to other functions? > Ah, I didn't realise that setseed() and random() were bound up so tightly. It does feel as though, if we're adding more user-facing functions that return random sequences, there ought to be a way to seed them, and I wouldn't want to have separate setseed functions for each one. I'm inclined to say that we want a new pg_global_prng_user_state that is updated by setseed(), and used by random(), array_shuffle(), array_sample(), and any other user-facing random functions we add later. I can also see that others might not like expanding the scope of setseed() in this way. Regards, Dean
Am 21.07.22 um 14:25 schrieb Dean Rasheed: > > I'm inclined to say that we want a new pg_global_prng_user_state that > is updated by setseed(), and used by random(), array_shuffle(), > array_sample(), and any other user-facing random functions we add > later. > I like the idea. How would you organize the code? I imagine some sort of user prng that is encapsulated in something like utils/adt/random.c and is guaranteed to always be seeded. Martin
Am 21.07.22 um 10:41 schrieb Dean Rasheed: > > It's important to mark these new functions as VOLATILE, not IMMUTABLE, > otherwise they won't work as expected in queries. See > https://www.postgresql.org/docs/current/xfunc-volatility.html > > It would be better to use pg_prng_uint64_range() rather than rand() to > pick elements. Partly, that's because it uses a higher quality PRNG, > with a larger internal state, and it ensures that the results are > unbiased across the range. But more importantly, it interoperates with > setseed(), allowing predictable sequences of "random" numbers to be > generated -- something that's useful in writing repeatable regression > tests. > > Assuming these new functions are made to interoperate with setseed(), > which I think they should be, then they also need to be marked as > PARALLEL RESTRICTED, rather than PARALLEL SAFE. See > https://www.postgresql.org/docs/current/parallel-safety.html, which > explains why setseed() and random() are parallel restricted. > Here is an updated patch that marks the functions VOLATILE PARALLEL RESTRICTED and uses pg_prng_uint64_range() rather than rand().
Attachment
On Thu, 21 Jul 2022 at 16:43, Martin Kalcher <martin.kalcher@aboutsource.net> wrote: > > Am 21.07.22 um 14:25 schrieb Dean Rasheed: > > > > I'm inclined to say that we want a new pg_global_prng_user_state that > > is updated by setseed(), and used by random(), array_shuffle(), > > array_sample(), and any other user-facing random functions we add > > later. > > > > I like the idea. How would you organize the code? I imagine some sort of > user prng that is encapsulated in something like utils/adt/random.c and > is guaranteed to always be seeded. > Something like that, perhaps. I can see 2 ways it could go: Option 1: Keep random.c small - Responsible for initialisation of the user prng on demand - Expose the user prng state to other code like float.c and arrayfuncs.c Option 2: Move all random functions wanting to use the user prng to random.c - Starting with drandom() and setseed() - Then, array_shuffle() and array_sample() - Later, any new SQL-callable random functions we might add - Keep the user prng state local to random.c Option 2 seems quite appealing, because it keeps all SQL-callable random functions together in one place, and keeps the state local, making it easier to keep track of which functions are using it. Code like the Fisher-Yates algorithm is more to do with random than it is to do with arrays, so putting it in random.c seems to make more sense. It's possible that some hypothetical random function might need access to type-specific internals. For example, if we wanted to add a function to return a random numeric, it would probably have to go in numeric.c, but that could be solved by having random.c call numeric.c, passing it the prng state. Regards, Dean
Am 22.07.22 um 09:59 schrieb Dean Rasheed:> > Option 1: > Keep random.c small > - Responsible for initialisation of the user prng on demand > - Expose the user prng state to other code like float.c and arrayfuncs.c > > Option 2: > Move all random functions wanting to use the user prng to random.c > - Starting with drandom() and setseed() > - Then, array_shuffle() and array_sample() > - Later, any new SQL-callable random functions we might add > - Keep the user prng state local to random.c > Hey Dean, i came to the same conclusions and went with Option 1 (see patch). Mainly because most code in utils/adt is organized by type and this way it is clear, that this is a thin wrapper around pg_prng. What do you think?
Attachment
On Fri, 22 Jul 2022 at 10:31, Martin Kalcher <martin.kalcher@aboutsource.net> wrote: > > i came to the same conclusions and went with Option 1 (see patch). > Mainly because most code in utils/adt is organized by type and this way > it is clear, that this is a thin wrapper around pg_prng. > > What do you think? Looks fairly neat, on a quick read-through. There's certainly something to be said for preserving the organisation by type. Regards, Dean
On 7/19/22 10:20, Tom Lane wrote: > Everything else either explicitly rejects more-than-one-D arrays > or does something that is compatible with thinking of them as > arrays-of-arrays. I think I am responsible for at least some of those, and I agree that thinking of MD arrays as arrays-of-arrays is preferable even though they are not actually that. Long ago[1] Peter E asked me to fix that as I recall but it was one of those round tuits that I never found. > So I withdraw my original position. These functions should just > shuffle or select in the array's first dimension, preserving > subarrays. Or else be lazy and reject more-than-one-D arrays; > but it's probably not that hard to handle them. +1 Joe [1] https://www.postgresql.org/message-id/flat/Pine.LNX.4.44.0306281418020.2178-100000%40peter.localdomain#a064d6dd8593993d799db453a3ee04d1 -- Joe Conway RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Am 22.07.22 um 11:31 schrieb Martin Kalcher: > > i came to the same conclusions and went with Option 1 (see patch). > Mainly because most code in utils/adt is organized by type and this way > it is clear, that this is a thin wrapper around pg_prng. > Small patch update. I realized the new functions should live array_userfuncs.c (rather than arrayfuncs.c), fixed some file headers and added some comments to the code.
Attachment
>> i came to the same conclusions and went with Option 1 (see patch). Mainly >> because most code in utils/adt is organized by type and this way it is >> clear, that this is a thin wrapper around pg_prng. >> > > Small patch update. I realized the new functions should live > array_userfuncs.c (rather than arrayfuncs.c), fixed some file headers and > added some comments to the code. My 0,02€ about this patch: Use (void) when declaring no parameters in headers or in functions. Should the exchange be skipped when i == k? I do not see the point of having *only* inline functions in a c file. The external functions should not be inlined? The random and array shuffling functions share a common state. I'm wondering whether it should ot should not be so. It seems ok, but then ISTM that the documentation suggests implicitely that setseed applies to random() only, which is not the case anymore after the patch. If more samples are required than the number of elements, it does not error out. I'm wondering whether it should. Also, the sampling should not return its result in order when the number of elements required is the full array, ISTM that it should be shuffled there as well. I must say that without significant knowledge of the array internal implementation, the swap code looks pretty strange. ISTM that the code would be clearer if pointers and array references style were not intermixed. Maybe you could add a test with a 3D array? Some sample with NULLs? Unrelated: I notice again that (postgre)SQL does not provide a way to generate random integers. I do not see why not. Should we provide one? -- Fabien.
Am 24.07.22 um 10:15 schrieb Fabien COELHO: > > My 0,02€ about this patch: Thank you for your feedback. I attached a patch, that addresses most of your points. > Use (void) when declaring no parameters in headers or in functions. Done. > Should the exchange be skipped when i == k? The additional branch is actually slower (on my machine, tested with an 10M element int array) for 1D-arrays, which i assume is the most common case. Even with a 2D-array with a subarray size of 1M there is no difference in execution time. i == k should be relatively rare for large arrays, given a good prng. > I do not see the point of having *only* inline functions in a c file. > The external functions should not be inlined? Done. > The random and array shuffling functions share a common state. I'm > wondering whether it should ot should not be so. It seems ok, but then > ISTM that the documentation suggests implicitely that setseed applies to > random() only, which is not the case anymore after the patch. I really like the idea of a prng state owned by the user, that is used by all user facing random functions, but see that their might be concerns about this change. I would update the setseed() documentation, if this proposal is accepted. Do you think my patch should already include the documentation update? > If more samples are required than the number of elements, it does not > error out. I'm wondering whether it should. The second argument to array_sample() is treated like a LIMIT, rather than the actual number of elements. I updated the documentation. My use-case is: take max random items from an array of unknown size. > Also, the sampling should not return its result in order when the number > of elements required is the full array, ISTM that it should be shuffled > there as well. You are the second person asking for this. It's done. I am thinking about ditching array_sample() and replace it with a second signature for array_shuffle(): array_shuffle(array anyarray) -> anyarray array_shuffle(array anyarray, limit int) -> anyarray What do you think? > I must say that without significant knowledge of the array internal > implementation, the swap code looks pretty strange. ISTM that the code > would be clearer if pointers and array references style were not > intermixed. Done. Went with pointers. > Maybe you could add a test with a 3D array? Some sample with NULLs? Done. > Unrelated: I notice again that (postgre)SQL does not provide a way to > generate random integers. I do not see why not. Should we provide one? Maybe. I think it is outside the scope of this patch. Thank you, Martin
Attachment
Hello, > Thank you for your feedback. I attached a patch, that addresses most of your > points. I'll look into it. It would help if the patch could include a version number at the end. >> Should the exchange be skipped when i == k? > > The additional branch is actually slower (on my machine, tested with an 10M > element int array) for 1D-arrays, which i assume is the most common case. > Even with a 2D-array with a subarray size of 1M there is no difference in > execution time. i == k should be relatively rare for large arrays, given a > good prng. Ok, slower is bad. >> The random and array shuffling functions share a common state. I'm >> wondering whether it should ot should not be so. It seems ok, but then ISTM >> that the documentation suggests implicitely that setseed applies to >> random() only, which is not the case anymore after the patch. > > I really like the idea of a prng state owned by the user, that is used by all > user facing random functions, but see that their might be concerns about this > change. I would update the setseed() documentation, if this proposal is > accepted. Do you think my patch should already include the documentation > update? Duno. I'm still wondering what it should do. I'm pretty sure that the documentation should be clear about a shared seed, if any. I do not think that departing from the standard is a good thing, either. >> If more samples are required than the number of elements, it does not error >> out. I'm wondering whether it should. > > The second argument to array_sample() is treated like a LIMIT, rather than > the actual number of elements. I updated the documentation. My use-case is: > take max random items from an array of unknown size. Hmmm. ISTM that the use case of wanting "this many" stuff would make more sense because it is strictier so less likely to result in unforseen consequences. On principle I do not like not knowing the output size. If someone wants a limit, they can easily "LEAST(#1 dim size, other limit)" to get it, it is easy enough with a strict function. >> Also, the sampling should not return its result in order when the number of >> elements required is the full array, ISTM that it should be shuffled there >> as well. > > You are the second person asking for this. It's done. I am thinking about > ditching array_sample() and replace it with a second signature for > array_shuffle(): > > array_shuffle(array anyarray) -> anyarray > array_shuffle(array anyarray, limit int) -> anyarray > > What do you think? I think that shuffle means shuffle, not partial shuffle, so a different name seems better to me. -- Fabien.
Am 24.07.22 um 21:42 schrieb Fabien COELHO: > > Duno. I'm still wondering what it should do. I'm pretty sure that the > documentation should be clear about a shared seed, if any. I do not > think that departing from the standard is a good thing, either. Are sure it violates the standard? I could not find anything about it. The private prng state for random() was introduced in 2018 [0]. Neither commit nor discussion mentions any standard compliance. [0] https://www.postgresql.org/message-id/E1gdNAo-00036g-TB%40gemulon.postgresql.org I updated the documentation for setseed(). > If someone wants a limit, they can easily "LEAST(#1 dim size, other > limit)" to get it, it is easy enough with a strict function. Convinced. It errors out now if n is out of bounds. Martin
Attachment
Patch update without merge conflicts. Martin
Attachment
Hi, On 2022-08-04 07:46:10 +0200, Martin Kalcher wrote: > Patch update without merge conflicts. Due to the merge of the meson based build, this patch needs to be adjusted. See https://cirrus-ci.com/build/6580671765282816 Looks like it'd just be adding user_prng.c to src/backend/utils/adt/meson.build's list. Greetings, Andres Freund
Am 22.09.22 um 17:23 schrieb Andres Freund: > Hi, > > On 2022-08-04 07:46:10 +0200, Martin Kalcher wrote: >> Patch update without merge conflicts. > > Due to the merge of the meson based build, this patch needs to be adjusted. See > https://cirrus-ci.com/build/6580671765282816 > Looks like it'd just be adding user_prng.c to > src/backend/utils/adt/meson.build's list. > > Greetings, > > Andres Freund Hi Andres, thanks for the heads up. Adjusted patch is attached. - Martin
Attachment
Martin Kalcher <martin.kalcher@aboutsource.net> writes: > [ v4-0001-Introduce-array_shuffle-and-array_sample.patch ] I think this idea of exporting drandom()'s PRNG for all and sundry to use is completely misguided. If we go down that path we'll be right back in the swamp that we were in when we used random(3), namely that (a) it's not clear what affects what, and (b) to the extent that there are such interferences, it could be bad, maybe even a security problem. We very intentionally decoupled drandom() from the rest of the world at commit 6645ad6bd, and I'm not ready to unlearn that lesson. With our current PRNG infrastructure it doesn't cost much to have a separate PRNG for every purpose. I don't object to having array_shuffle() and array_sample() share one PRNG, but I don't think it should go much further than that. regards, tom lane
Am 26.09.22 um 22:16 schrieb Tom Lane: > > With our current PRNG infrastructure it doesn't cost much to have > a separate PRNG for every purpose. I don't object to having > array_shuffle() and array_sample() share one PRNG, but I don't > think it should go much further than that. > Thanks for your thoughts, Tom. I have a couple of questions. Should we introduce a new seed function for the new PRNG state, used by array_shuffle() and array_sample()? What would be a good name? Or should those functions use pg_global_prng_state? Is it safe to assume, that pg_global_prng_state is seeded? Martin
>> With our current PRNG infrastructure it doesn't cost much to have >> a separate PRNG for every purpose. I don't object to having >> array_shuffle() and array_sample() share one PRNG, but I don't >> think it should go much further than that. > > Thanks for your thoughts, Tom. I have a couple of questions. Should we > introduce a new seed function for the new PRNG state, used by array_shuffle() > and array_sample()? What would be a good name? Or should those functions use > pg_global_prng_state? Is it safe to assume, that pg_global_prng_state is > seeded? I'd suggest to use the existing global state. The global state should be seeded at the process start, AFAIKR. -- Fabien.
Fabien COELHO <coelho@cri.ensmp.fr> writes: >> Thanks for your thoughts, Tom. I have a couple of questions. Should we >> introduce a new seed function for the new PRNG state, used by array_shuffle() >> and array_sample()? What would be a good name? Or should those functions use >> pg_global_prng_state? Is it safe to assume, that pg_global_prng_state is >> seeded? > I'd suggest to use the existing global state. The global state should be > seeded at the process start, AFAIKR. It is seeded at process start, yes. If you don't feel a need for user control over the sequence used by these functions, then using pg_global_prng_state is fine. (Basically the point to be made here is that we need to keep a tight rein on what can be affected by setseed().) regards, tom lane
Am 28.09.22 um 16:18 schrieb Tom Lane: > It is seeded at process start, yes. If you don't feel a need for > user control over the sequence used by these functions, then using > pg_global_prng_state is fine. (Basically the point to be made > here is that we need to keep a tight rein on what can be affected > by setseed().) > > regards, tom lane New patch: array_shuffle() and array_sample() use pg_global_prng_state now. Martin
Attachment
Martin Kalcher <martin.kalcher@aboutsource.net> writes: > New patch: array_shuffle() and array_sample() use pg_global_prng_state now. I took a closer look at the patch today. I find this behavior a bit surprising: +SELECT array_dims(array_sample('[-1:2][2:3]={{1,2},{3,NULL},{5,6},{7,8}}'::int[], 3)); + array_dims +------------- + [-1:1][2:3] +(1 row) I can buy preserving the lower bound in array_shuffle(), but array_sample() is not preserving the first-dimension indexes of the array, so ISTM it ought to reset the first lower bound to 1. Some other comments: + Returns <parameter>n</parameter> randomly chosen elements from <parameter>array</parameter> in selection order. What's "selection order"? And this probably shouldn't just rely on the example to describe what happens with multi-D arrays. Writing "elements" seems somewhere between confusing and wrong. * Personally I think I'd pass the TypeCacheEntry pointer to array_shuffle_n, and let it pull out what it needs. Less duplicate code that way. * I find array_shuffle_n drastically undercommented, and what comments it has are pretty misleading, eg + /* Swap all elements in item (i) with elements in item (j). */ j is *not* the index of the second item to be swapped. You could make it so, and that might be more readable: j = (int) pg_prng_uint64_range(&pg_global_prng_state, i, nitem - 1); jelms = elms + j * nelm; jnuls = nuls + j * nelm; But if you want the code to stay as it is, this comment needs work. * I think good style for SQL-callable C functions is to make the arguments clear at the top: +array_sample(PG_FUNCTION_ARGS) +{ + ArrayType *array = PG_GETARG_ARRAYTYPE_P(0); + int n = PG_GETARG_INT32(1); + ArrayType *result; + ... other declarations as needed ... We can't quite make normal C declaration style work, but that's a poor excuse for not making the argument list visible as directly as possible. * I wouldn't bother with the PG_FREE_IF_COPY calls. Those are generally only used in btree comparison functions, in which there's a need to not leak memory. * I wonder if we really need these to be proparallel = 'r'. If we let a worker process execute them, they will take numbers from the worker's pg_global_prng_state seeding not the parent process's seeding, but why is that a problem? In the prior version where you were tying them to the parent's drandom() sequence, proparallel = 'r' made sense; but now I think it's unnecessary. regards, tom lane
On Thu, 29 Sept 2022 at 15:34, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Martin Kalcher <martin.kalcher@aboutsource.net> writes: > > New patch: array_shuffle() and array_sample() use pg_global_prng_state now. > > I took a closer look at the patch today. I find this behavior a bit > surprising: > It looks like this patch received useful feedback and it wouldn't take much to push it over the line. But it's been Waiting On Author since last September. Martin, any chance of getting these last bits of feedback resolved so it can be Ready for Commit? -- Gregory Stark As Commitfest Manager
Given that there's been no updates since September 22 I'm going to make this patch Returned with Feedback. The patch can be resurrected when there's more work done -- don't forget to move it to the new CF when you do that. -- Gregory Stark As Commitfest Manager
> On 29 Sep 2022, at 21:33, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Martin Kalcher <martin.kalcher@aboutsource.net> writes: >> New patch: array_shuffle() and array_sample() use pg_global_prng_state now. > > I took a closer look at the patch today. Since this seems pretty close to going in, and seems like quite useful functions, I took a look to see if I could get it across the line (although I noticed that CFM beat me to the clock in sending this =)). > I find this behavior a bit surprising: > > +SELECT array_dims(array_sample('[-1:2][2:3]={{1,2},{3,NULL},{5,6},{7,8}}'::int[], 3)); > + array_dims > +------------- > + [-1:1][2:3] > +(1 row) > > I can buy preserving the lower bound in array_shuffle(), but > array_sample() is not preserving the first-dimension indexes of > the array, so ISTM it ought to reset the first lower bound to 1. I might be daft but I'm not sure I follow why not preserving here, can you explain? The rest of your comments have been addressed in the attached v6 I think (although I'm pretty sure the docs part is just as bad now, explaining these in concise words is hard, will take another look with fresh eyes tomorrow). -- Daniel Gustafsson
Attachment
Daniel Gustafsson <daniel@yesql.se> writes: > On 29 Sep 2022, at 21:33, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> I find this behavior a bit surprising: >> >> +SELECT array_dims(array_sample('[-1:2][2:3]={{1,2},{3,NULL},{5,6},{7,8}}'::int[], 3)); >> + array_dims >> +------------- >> + [-1:1][2:3] >> +(1 row) >> >> I can buy preserving the lower bound in array_shuffle(), but >> array_sample() is not preserving the first-dimension indexes of >> the array, so ISTM it ought to reset the first lower bound to 1. > I might be daft but I'm not sure I follow why not preserving here, can you > explain? Because array_sample selects only some of the (first level) array elements, those elements are typically not going to have the same indexes in the output as they did in the input. So I don't see why it would be useful to preserve the same lower-bound index. It does make sense to preserve the lower-order index bounds ([2:3] in this example) because we are including or not including those array slices as a whole. regards, tom lane
> On 3 Apr 2023, at 23:46, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Daniel Gustafsson <daniel@yesql.se> writes: >> On 29 Sep 2022, at 21:33, Tom Lane <tgl@sss.pgh.pa.us> wrote: >>> I find this behavior a bit surprising: >>> >>> +SELECT array_dims(array_sample('[-1:2][2:3]={{1,2},{3,NULL},{5,6},{7,8}}'::int[], 3)); >>> + array_dims >>> +------------- >>> + [-1:1][2:3] >>> +(1 row) >>> >>> I can buy preserving the lower bound in array_shuffle(), but >>> array_sample() is not preserving the first-dimension indexes of >>> the array, so ISTM it ought to reset the first lower bound to 1. > >> I might be daft but I'm not sure I follow why not preserving here, can you >> explain? > > Because array_sample selects only some of the (first level) array > elements, those elements are typically not going to have the same > indexes in the output as they did in the input. So I don't see why > it would be useful to preserve the same lower-bound index. It does > make sense to preserve the lower-order index bounds ([2:3] in this > example) because we are including or not including those array > slices as a whole. Ah, ok, now I see what you mean, thanks! I'll try to fix up the patch like this tomorrow. -- Daniel Gustafsson
Daniel Gustafsson <daniel@yesql.se> writes: > Ah, ok, now I see what you mean, thanks! I'll try to fix up the patch like > this tomorrow. Since we're running out of time, I took the liberty of fixing and pushing this. regards, tom lane
> On 7 Apr 2023, at 17:47, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Daniel Gustafsson <daniel@yesql.se> writes: >> Ah, ok, now I see what you mean, thanks! I'll try to fix up the patch like >> this tomorrow. > > Since we're running out of time, I took the liberty of fixing and > pushing this. Great, thanks! -- Daniel Gustafsson
Hi all,
reading this blog post https://www.depesz.com/2023/04/18/waiting-for-postgresql-16-add-array_sample-and-array_shuffle-functions/ I became aware of the new feature and had a look at it and the commit https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=888f2ea0a81ff171087bdd1c5c1eeda3b78d73d4
To me the description
/*
* Shuffle array using Fisher-Yates algorithm. Scan the array and swap
* current item (nelm datums starting at ielms) with a randomly chosen
* later item (nelm datums starting at jelms) in each iteration. We can
* stop once we've done n iterations; then first n items are the result.
*/
* Shuffle array using Fisher-Yates algorithm. Scan the array and swap
* current item (nelm datums starting at ielms) with a randomly chosen
* later item (nelm datums starting at jelms) in each iteration. We can
* stop once we've done n iterations; then first n items are the result.
*/
seems wrong. For n = 1 the returned item could never be the 1st item of the array (see "randomly chosen later item").
If this really is the case then the result is not really random. But to me it seems j later can be 0 (making it not really "later"), so this might only be a documentation issue.
Best regards
Salek Talangi
Am Mi., 19. Apr. 2023 um 13:48 Uhr schrieb Daniel Gustafsson <daniel@yesql.se>:
> On 7 Apr 2023, at 17:47, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> Daniel Gustafsson <daniel@yesql.se> writes:
>> Ah, ok, now I see what you mean, thanks! I'll try to fix up the patch like
>> this tomorrow.
>
> Since we're running out of time, I took the liberty of fixing and
> pushing this.
Great, thanks!
--
Daniel Gustafsson