Re: pgbench - add pseudo-random permutation function - Mailing list pgsql-hackers
From | Dean Rasheed |
---|---|
Subject | Re: pgbench - add pseudo-random permutation function |
Date | |
Msg-id | CAEZATCXxmmgjEFFkB4-N-xMrbMit89oGhef4iLiusVvWzDnqcw@mail.gmail.com Whole thread Raw |
In response to | Re: pgbench - add pseudo-random permutation function (Dean Rasheed <dean.a.rasheed@gmail.com>) |
Responses |
Re: pgbench - add pseudo-random permutation function
|
List | pgsql-hackers |
On Mon, 22 Mar 2021 at 13:43, Dean Rasheed <dean.a.rasheed@gmail.com> wrote: > > On Sun, 14 Mar 2021 at 16:08, Fabien COELHO <coelho@cri.ensmp.fr> wrote: > > > > > My main question on this now is, do you have a scholar reference for > > > this algorithm? > > > > Nope, otherwise I would have put a reference. I'm a scholar though, if > > it helps:-) > > > > I could not find any algorithm that fitted the bill. The usual approach > > (eg benchmarking designs) is too use some hash function and assume that it > > is not a permutation, too bad. > > > > Basically the algorithm mimics a few rounds of cryptographic encryption > > adapted to any size and simple operators, whereas encryption function are > > restricted to power of two blocks (eg the Feistel network). The structure > > is the same AES with its AddRoundKey the xor-ing stage (adapted to non > > power of two in whiten above), MixColumns which does the scattering, and > > for key expansion, I used Donald Knuth generator. Basically one could say > > that permute is an inexpensive and insecure AES:-) > > > > I spent a little time playing around with this, trying to come up with > a reasonable way to test it. I spent more time testing this -- the previous testing was mostly for large values of size, so I decided to also test it for small sizes. The theoretical number of possible permutations grows rapidly with size (size!), and the actual number of permutations observed grew almost as quickly: size size! distinct perms 2 2 2 3 6 6 4 24 24 5 120 120 6 720 720 7 5040 5040 8 40320 24382 9 362880 181440 10 3628800 3627355 My test script stopped once the first permutation had been seen 100 times, so it might have seen more permutations had it kept going for longer. However, looking at the actual output, there is a very significant non-uniformity in the probability of particular permutations being chosen, especially at certain sizes like 8 and 10. For example, at size = 8, the identity permutation is significantly more likely than almost any other permutation (roughly 13 times more likely than it would be by random chance). Additionally, the signs are that this non-uniformity tends to increase with size. At size = 10, the average number of occurrences of each permutation in the test was 148, but there were some that didn't appear at all, many that only appeared 2 or 3 times, and then some that appeared nearly 5000 times. Also, there appears to be an unfortunate dependence on the seed -- if the seed is even and the size is a power of 2, it looks like the number of distinct permutations produced is just size/2, which is a tiny fraction of size!. This, together with the results from the previous K-S uniformity tests at larger sizes, suggests that the current algorithm may not be random enough to scatter values and remove correlations from a non-uniform distribution. In an attempt to do better, I tweaked the algorithm in the attached patch, which makes use of erand48() to inject more randomness into the permutation choice. Repeating the tests with the updated algorithm shows that it has a number of advantages: 1). For small sizes (2-10), each of the size! possible permutations is now produced with roughly equal probability. No single permutation was much more likely than expected. 2). The loss of randomness for even seeds is gone. 3). For any given input, the output is more uniformly randomly distributed for different seeds. 4). For any pair of nearby inputs, the corresponding outputs are more uniformly randomly separated from one another. 5). The updated algorithm no longer uses modular_multiply(), so it works the same on all platforms. I still feel slightly uneasy about inventing our own algorithm here, but I wasn't able to find anything else suitable, and I've now tested this quite extensively. All the indications are that this updated algorithm works well at all sizes and produces sufficiently random results, though if anyone else can think of other approaches to testing the core algorithm, that would be useful. For reference, I attach 2 standalone test C programs I used for testing, which include copies of the old and new algorithms. I also did a quick copy editing pass over the docs, and I think the patch is in pretty good shape. Barring objections, I'll see about committing it later this week. Regards, Dean
Attachment
pgsql-hackers by date: