Re: pgbench - add pseudo-random permutation function - Mailing list pgsql-hackers
From | Dean Rasheed |
---|---|
Subject | Re: pgbench - add pseudo-random permutation function |
Date | |
Msg-id | CAEZATCVbotfL4WOiwDshBn2CYo+-gKv45x74jM4A8qFFFiRALQ@mail.gmail.com Whole thread Raw |
In response to | pgbench - add pseudo-random permutation function (Fabien COELHO <coelho@cri.ensmp.fr>) |
Responses |
Re: pgbench - add pseudo-random permutation function
|
List | pgsql-hackers |
On Sun, 14 Mar 2021 at 16:08, Fabien COELHO <coelho@cri.ensmp.fr> wrote: > > > My main question on this now is, do you have a scholar reference for > > this algorithm? > > Nope, otherwise I would have put a reference. I'm a scholar though, if > it helps:-) > > I could not find any algorithm that fitted the bill. The usual approach > (eg benchmarking designs) is too use some hash function and assume that it > is not a permutation, too bad. > > Basically the algorithm mimics a few rounds of cryptographic encryption > adapted to any size and simple operators, whereas encryption function are > restricted to power of two blocks (eg the Feistel network). The structure > is the same AES with its AddRoundKey the xor-ing stage (adapted to non > power of two in whiten above), MixColumns which does the scattering, and > for key expansion, I used Donald Knuth generator. Basically one could say > that permute is an inexpensive and insecure AES:-) > I spent a little time playing around with this, trying to come up with a reasonable way to test it. It's apparent from the code and associated comments that the transformation is bijective and so will produce a permutation, but it's less obvious how random that permutation will be. Obviously the Fisher-Yates algorithm would give the most random permutation, but that's not really practical in this context. Any other algorithm will, in some sense, be less random than that, so I think it's really just a question of testing that it's random enough to satisfy the intended use case. The approach to testing I tried was to use the Kolmogorov-Smirnov test to test how uniformly random a couple of quantities are: 1). For a given size and fixed input value, and a large number of randomly chosen seed values, is the output uniformly random? 2). For a given size and a pair of nearby input values, are the two output values uniformly randomly spaced apart? This second test seems desirable to ensure sufficient shuffling so that inputs coming from some non-uniform random distribution are scattered sufficiently to distribute the maxima/minima (x -> x + rand mod size would pass test 1, so that by itself is insufficient). I tested this using the attached (somewhat ugly) stand-alone test C program, and for the most part these 2 tests seemed to pass. The program also tests that the output really is a permutation, just to be sure, and this was confirmed in all cases. However, I noticed that the results are less good when size is a power of 2. In this case, the results seem significantly less uniform, suggesting that for such sizes, there is an increased chance that the permuted output might still be skewed. Based on the above, I also experimented with a different permutation algorithm (permute2() in the test), which attempts to inject more randomness into the bijection, and between pairs of inputs, to make the output less predictable and less likely to be accidentally non-uniform. It's possible that it suffers from other problems, but it did do better in these 2 tests. Maybe some other tests might be useful, but I really don't know. As noted above, any algorithm is likely to lead to some pattern in the output (e.g., it looks like both these algorithms cause adjacent inputs to always end up separated by an odd number), so a judgement call will be required to decide what is random enough. Regards, Dean
Attachment
pgsql-hackers by date: