Re: [PATCH] Introduce array_shuffle() and array_sample() - Mailing list pgsql-hackers

From Tom Lane
Subject Re: [PATCH] Introduce array_shuffle() and array_sample()
Date
Msg-id 344667.1664480006@sss.pgh.pa.us
Whole thread Raw
In response to Re: [PATCH] Introduce array_shuffle() and array_sample()  (Martin Kalcher <martin.kalcher@aboutsource.net>)
Responses Re: [PATCH] Introduce array_shuffle() and array_sample()
Re: [PATCH] Introduce array_shuffle() and array_sample()
List pgsql-hackers
Martin Kalcher <martin.kalcher@aboutsource.net> writes:
> New patch: array_shuffle() and array_sample() use pg_global_prng_state now.

I took a closer look at the patch today.  I find this behavior a bit
surprising:

+SELECT array_dims(array_sample('[-1:2][2:3]={{1,2},{3,NULL},{5,6},{7,8}}'::int[], 3));
+ array_dims
+-------------
+ [-1:1][2:3]
+(1 row)

I can buy preserving the lower bound in array_shuffle(), but
array_sample() is not preserving the first-dimension indexes of
the array, so ISTM it ought to reset the first lower bound to 1.

Some other comments:

+        Returns <parameter>n</parameter> randomly chosen elements from <parameter>array</parameter> in selection
order.

What's "selection order"?  And this probably shouldn't just rely
on the example to describe what happens with multi-D arrays.
Writing "elements" seems somewhere between confusing and wrong.

* Personally I think I'd pass the TypeCacheEntry pointer to array_shuffle_n,
and let it pull out what it needs.  Less duplicate code that way.

* I find array_shuffle_n drastically undercommented, and what comments
it has are pretty misleading, eg

+        /* Swap all elements in item (i) with elements in item (j). */

j is *not* the index of the second item to be swapped.  You could make
it so, and that might be more readable:

        j = (int) pg_prng_uint64_range(&pg_global_prng_state, i, nitem - 1);
        jelms = elms + j * nelm;
        jnuls = nuls + j * nelm;

But if you want the code to stay as it is, this comment needs work.

* I think good style for SQL-callable C functions is to make the arguments
clear at the top:

+array_sample(PG_FUNCTION_ARGS)
+{
+    ArrayType  *array = PG_GETARG_ARRAYTYPE_P(0);
+    int         n = PG_GETARG_INT32(1);
+    ArrayType  *result;
+    ... other declarations as needed ...

We can't quite make normal C declaration style work, but that's a poor
excuse for not making the argument list visible as directly as possible.

* I wouldn't bother with the PG_FREE_IF_COPY calls.  Those are generally
only used in btree comparison functions, in which there's a need to not
leak memory.

* I wonder if we really need these to be proparallel = 'r'.  If we let
a worker process execute them, they will take numbers from the worker's
pg_global_prng_state seeding not the parent process's seeding, but why
is that a problem?  In the prior version where you were tying them
to the parent's drandom() sequence, proparallel = 'r' made sense;
but now I think it's unnecessary.

            regards, tom lane



pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: making relfilenodes 56 bits
Next
From: Robert Haas
Date:
Subject: Re: problems with making relfilenodes 56-bits