Re: pgbench - add pseudo-random permutation function - Mailing list pgsql-hackers

From Dean Rasheed
Subject Re: pgbench - add pseudo-random permutation function
Date
Msg-id CAEZATCXBZqSBr5iBhH-jQXjo2k04twnscX4Wi4JdMX8=2DdLSw@mail.gmail.com
Whole thread Raw
In response to Re: pgbench - add pseudo-random permutation function  (Fabien COELHO <coelho@cri.ensmp.fr>)
Responses Re: pgbench - add pseudo-random permutation function
List pgsql-hackers
On Fri, 2 Apr 2021 at 06:38, Fabien COELHO <coelho@cri.ensmp.fr> wrote:
>
> >>   r = (uint64) (pg_erand48(random_state.xseed) * size);
> >>
> >> I do not understand why the random values are multiplied by anything in
> >> the first place…
> >
> > These are just random integers in the range [0,mask] and [0,size-1],
> > formed in exactly the same way as getrand().
>
> Indeed, erand returns a double, this was the part I was missing. I did not
> realize that you had switched to doubles in your approach.
>
> I think that permute should only use integer operations. I'd suggest to
> use one of the integer variants instead of going through a double
> computation and casting back to int. The internal state is based on
> integers, I do not see the added value of going through floats, possibly
> enduring floating point issues (undeflow, rounding, normalization,
> whatever) on the way, whereas from start to finish we just need ints.
>

This is the already-established coding pattern used in getrand() to
pick a random number uniformly in some range that's not necessarily a
power of 2.

Floating point underflow and normalisation issues are not possible
because erand48() takes a 48-bit integer N and uses ldexp() to store
N/2^48 in a double, which is an exact operation since IEEE doubles
have something like 56-bit mantissas. This is then turned back into an
integer in the required range by multiplying by the desired maximum
value, so there's never any risk of underflow or normalisation issues.
If you didn't do it that way, you'd need to rely on some larger
integer datatype, such as 128-bit integers.

I guess that there may be rounding variations once the required
maximum value exceeds something like 2^56 (although the comment in
getrand() is much more conservative than that), so it's possible that
a pgbench script calling random() with (ub-lb) larger than that might
give different results on different platforms. For the non-uniform
random functions, that effect might well kick in sooner. I'm not aware
of any field complaints about that though, possibly because real-world
data sizes are significantly smaller than that.

In practice, permute() is likely to take its input from one of the
non-uniform random functions, so it won't be permute() that first
introduces rounding issues.

> See attached v27 proposal.
>

This update has a number of flaws. For example, this:

+static uint64
+randu64(RandomState * state)
+{
+    uint64 r1 = pg_jrand48((*state).xseed),
+           r2 = pg_jrand48((*state).xseed);
+    return r1 << 51 | r2 << 13 | r1 >> 13;
+}

It still uses a 48-bit RandomState, so it doesn't improve on getrand()
in that respect.

It replaces a single erand48() call with 2 jrand48() calls, which
comes at a cost in terms of performance. (In fact, changing the number
of rounds in the previous version of permute() from 4 to 6 has a
smaller performance impact than this -- more about that below.)

jrand48() returns a signed 32-bit integer, which has a 50% chance of
being negative, so when that is cast to a uint64, there is a 50%
chance that the 32 most significant bits will be 1. When the various
parts are OR'ed together, that will then mask out any randomness from
the other parts. For example, 50% of the time, the jrand48() value
used for r1 will be negative, and so 32 bits in the middle of the
final result will all be set.

There is essentially no randomness at all in bits 45..50, and the r1
and r2 values overlap in bits 13..18, giving them a 75% chance of
being set.

So overall, the results will be highly non-uniform, with less
randomness and poorer performance than erand48().

In addition, it returns a result in the range [0,2^64), which is not
really what's wanted. For example:

+        /* Random offset */
+        r = randu64(&random_state2);
+        v = (v + r) % size;

The previous code used r = getrand(0, size-1), which gave a uniformly
random offset. However, the new code risks introducing additional
non-uniformity when size is not a power of 2.

Finally, worst of all, this random offset is no longer bijective, due
to 64-bit integer wrap-around. For example, suppose that size=100 and
r=(2^64-10), then the following 2 values both map to the same result:

  v = 20 -> (v + r) % size
          = (20 + (2^64 - 10)) % 100
          = (2^64 + 10) % 100
          = (10) % 100
          = 10

  v = 4 -> (v + r) % size
         = (4 + (2^64 - 10)) % 100
         = (2^64 - 6) % 100
         = (18446744073709551610) % 100
         = 10

So not only are the results no longer uniformly random, they might not
even form a permutation.

I did some more testing of the previous version (v26), this time
looking at much larger sizes, all the way up to the maximum, which is
2^63-1 since it comes from a signed int64. In general, the results
were very good, though I did notice some slight non-uniformity in the
way adjacent inputs were separated from another when the size was just
under a power of 2. I think that's the hardest case for this
algorithm, because there's very little overlap between the 2 halves.
Increasing the number of rounds from 4 to 6 ironed out that
non-uniformity (and as mentioned above, is still cheaper than using
randu64() with 4 rounds), so I think we should go with that.

> I still think that *rand48 is a poor (relatively small state) and
> inefficient (the implementation includes packing and unpacking 16 bits
> ints to build a 64 bits int) choice.
>

I can somewhat understand your dislike of *rand48(), but in the
context of pgbench I think that it is perfectly adequate. Since we now
use 2 RandomStates, I don't think the state size is an issue anymore,
if it ever really was. Using erand48() gives better performance than
jrand48() because it returns 48 bits rather than 32, so fewer calls
are needed, which allows more rounds for the same cost. Additionally,
following the same pattern as existing code reduces the risk of bugs,
and builds on proven, tested code.

You may wish to submit a separate patch to replace pgbench's use of
*rand48() with something else, and that would be discussed on its own
merits, but I don't see why that should hold up adding permute().

Regards,
Dean



pgsql-hackers by date:

Previous
From: Andrew Dunstan
Date:
Subject: Re: ALTER TABLE ADD COLUMN fast default
Next
From: Tom Lane
Date:
Subject: Re: SP-GiST confusion: indexed column's type vs. index column type