Re: pgbench - add pseudo-random permutation function - Mailing list pgsql-hackers

From Dean Rasheed
Subject Re: pgbench - add pseudo-random permutation function
Date
Msg-id CAEZATCVbotfL4WOiwDshBn2CYo+-gKv45x74jM4A8qFFFiRALQ@mail.gmail.com
Whole thread Raw
In response to pgbench - add pseudo-random permutation function  (Fabien COELHO <coelho@cri.ensmp.fr>)
Responses Re: pgbench - add pseudo-random permutation function
List pgsql-hackers
On Sun, 14 Mar 2021 at 16:08, Fabien COELHO <coelho@cri.ensmp.fr> wrote:
>
> > My main question on this now is, do you have a scholar reference for
> > this algorithm?
>
> Nope, otherwise I would have put a reference. I'm a scholar though, if
> it helps:-)
>
> I could not find any algorithm that fitted the bill. The usual approach
> (eg benchmarking designs) is too use some hash function and assume that it
> is not a permutation, too bad.
>
> Basically the algorithm mimics a few rounds of cryptographic encryption
> adapted to any size and simple operators, whereas encryption function are
> restricted to power of two blocks (eg the Feistel network). The structure
> is the same AES with its AddRoundKey the xor-ing stage (adapted to non
> power of two in whiten above), MixColumns which does the scattering, and
> for key expansion, I used Donald Knuth generator. Basically one could say
> that permute is an inexpensive and insecure AES:-)
>

I spent a little time playing around with this, trying to come up with
a reasonable way to test it. It's apparent from the code and
associated comments that the transformation is bijective and so will
produce a permutation, but it's less obvious how random that
permutation will be. Obviously the Fisher-Yates algorithm would give
the most random permutation, but that's not really practical in this
context. Any other algorithm will, in some sense, be less random than
that, so I think it's really just a question of testing that it's
random enough to satisfy the intended use case.

The approach to testing I tried was to use the Kolmogorov-Smirnov test
to test how uniformly random a couple of quantities are:

1). For a given size and fixed input value, and a large number of
randomly chosen seed values, is the output uniformly random?

2). For a given size and a pair of nearby input values, are the two
output values uniformly randomly spaced apart?

This second test seems desirable to ensure sufficient shuffling so
that inputs coming from some non-uniform random distribution are
scattered sufficiently to distribute the maxima/minima (x -> x + rand
mod size would pass test 1, so that by itself is insufficient).

I tested this using the attached (somewhat ugly) stand-alone test C
program, and for the most part these 2 tests seemed to pass. The
program also tests that the output really is a permutation, just to be
sure, and this was confirmed in all cases.

However, I noticed that the results are less good when size is a power
of 2. In this case, the results seem significantly less uniform,
suggesting that for such sizes, there is an increased chance that the
permuted output might still be skewed.

Based on the above, I also experimented with a different permutation
algorithm (permute2() in the test), which attempts to inject more
randomness into the bijection, and between pairs of inputs, to make
the output less predictable and less likely to be accidentally
non-uniform. It's possible that it suffers from other problems, but it
did do better in these 2 tests.

Maybe some other tests might be useful, but I really don't know. As
noted above, any algorithm is likely to lead to some pattern in the
output (e.g., it looks like both these algorithms cause adjacent
inputs to always end up separated by an odd number), so a judgement
call will be required to decide what is random enough.

Regards,
Dean

Attachment

pgsql-hackers by date:

Previous
From: Fujii Masao
Date:
Subject: Re: Failed assertion on standby while shutdown
Next
From: Stephen Frost
Date:
Subject: Re: Disable WAL logging to speed up data loading