Thread: pgbench - add pseudo-random permutation function
Hello, This patch adds a pseudo-random permutation function to pgbench. It allows to mix non uniform random keys to avoid trivial correlations between neighboring values, hence between pages. The function is a simplistic form of encryption adapted to any size, using a few iterations of scramble and scatter phases. The result is not cryptographically convincing, nor even statistically, but it is quite inexpensive and achieves the desired result. A computation costs 0.22 µs per call on my laptop, about three times the cost of a simple function. Alternative designs, such as iterating over an actual encryption function or using some sbox, would lead to much more costly solutions and complex code. I also join a few scripts I used for testing. -- Fabien.
Attachment
Hi Fabian, I reviewed `pgbench-prp-func-1.patch` and found an incomplete implementation. In the pseudorandom_perm function, I confirmed that the scramble and scatter operations are mathematically bijections. Therefore, this function is mathematically correct. However, the implementation of the scatter operation in this patch overflows in many cases if the variable:size is 38 bit integer or greater. Because the variable:size and the item of the array:primes[] which stores 27-29 bit integers are multiplicated. If overflow occurs, the scatter operation does not satisfy bijective. I did a sampling inspection, whose conditions are the variable:size is 1099511627773 (= 40 bit integer) and the variable:seed is 5432. Even with very few samples, I found a huge number of collisions as shown below: pr_perm(409749816, 1099511627773, 5432) = pr_perm(491041141, 1099511627773, 5432) = pr_perm(729171766, 1099511627773, 5432) = pr_perm(849775914, 1099511627773, 5432) = 22445482629, pr_perm(45609644, 1099511627773, 5432) = pr_perm(174005122, 1099511627773, 5432) = pr_perm(811754941, 1099511627773, 5432) = pr_perm( 1131176891, 1099511627773, 5432) = 10626826534, and so on. There are two ways to resolve this issue. The first one is to reduce the maximum value can be set for the variable:size. The second one is to add a modular multiplication function to avoid overflow. I made a patch including the function that can be calculated modular multiplication without overflow, and attached this mail. (I also attached a simple test suite of the function I added.) In the others parts of the original patch, I could apply the patch and did tests without trouble; the documentations and comments are well except one comment in the function shown below: >> (1) scramble: partial xors on power-or-2 subsets I could not understand this meaning when I read it at the first time. Best regards, On 2018/07/28 15:03, Fabien COELHO wrote: > > Hello, > > This patch adds a pseudo-random permutation function to pgbench. It > allows to mix non uniform random keys to avoid trivial correlations > between neighboring values, hence between pages. > > The function is a simplistic form of encryption adapted to any size, > using a few iterations of scramble and scatter phases. The result is not > cryptographically convincing, nor even statistically, but it is quite > inexpensive and achieves the desired result. A computation costs 0.22 µs > per call on my laptop, about three times the cost of a simple function. > > Alternative designs, such as iterating over an actual encryption > function or using some sbox, would lead to much more costly solutions > and complex code. > > I also join a few scripts I used for testing. >
Attachment
Hello Hironobu-san, > I reviewed `pgbench-prp-func-1.patch` and found an incomplete implementation. Indeed, thanks a lot for the debug, and the proposed fix! I'm going to work a little bit more on the patch based on your proposed changes, and submit a v3, hopefully soon. -- Fabien.
Hello Hironobu-san, > However, the implementation of the scatter operation in this patch overflows > in many cases if the variable:size is 38 bit integer or greater. Because the > variable:size and the item of the array:primes[] which stores 27-29 bit > integers are multiplicated. If overflow occurs, the scatter operation does > not satisfy bijective. Indeed. Again, thanks for the debug! As you contributed some code, I added you as a co-author in the CF entry. Attached a v3, based on your fix, plus some additional changes: - explicitly declare unsigned variables where appropriate, to avoid casts - use smaller 24 bits primes instead of 27-29 bits - add a shortcut for multiplier below 24 bits and y value below 40 bits, which should avoid the manually implemented multiplication in most practical cases (tables with over 2^40 rows are pretty rare...). - change the existing shortcut to look a the number of bits instead of using 32 limits. - add a test for minimal code coverage with over 40 bits sizes - attempt to improve the documentation - some comments were updates, hopefully for the better The main idea behind the smaller primes is to avoid the expensive modmul implementation on most realistic cases. -- Fabien.
Attachment
Hello Hironobu-san, Here is a v4, based on our out-of-list discussion: - the mask function is factored out - the popcount builtin is used if available > Attached a v3, based on your fix, plus some additional changes: > - explicitly declare unsigned variables where appropriate, to avoid casts > - use smaller 24 bits primes instead of 27-29 bits > - add a shortcut for multiplier below 24 bits and y value below 40 bits, > which should avoid the manually implemented multiplication in most > practical cases (tables with over 2^40 rows are pretty rare...). > - change the existing shortcut to look a the number of bits instead of > using 32 limits. > - add a test for minimal code coverage with over 40 bits sizes > - attempt to improve the documentation > - some comments were updates, hopefully for the better -- Fabien.
Attachment
Hi Fabian-san, I reviewed 'pgbench-prp-func/pgbench-prp-func-4.patch'. I could apply it and did the TAP test successfully. I could not find typo in the documentations and comments. To make sure, I checked the new routine which contains the __builtin_popcountll() function on Linux + gcc 7.3.1 and I confirmed that it works well. I thinks this patch is fine. Best regards, On 2018/09/16 21:05, Fabien COELHO wrote: > > Hello Hironobu-san, > > Here is a v4, based on our out-of-list discussion: > - the mask function is factored out > - the popcount builtin is used if available > >> Attached a v3, based on your fix, plus some additional changes: >> - explicitly declare unsigned variables where appropriate, to avoid casts >> - use smaller 24 bits primes instead of 27-29 bits >> - add a shortcut for multiplier below 24 bits and y value below 40 bits, >> which should avoid the manually implemented multiplication in most >> practical cases (tables with over 2^40 rows are pretty rare...). >> - change the existing shortcut to look a the number of bits instead of >> using 32 limits. >> - add a test for minimal code coverage with over 40 bits sizes >> - attempt to improve the documentation >> - some comments were updates, hopefully for the better >
> I reviewed 'pgbench-prp-func/pgbench-prp-func-4.patch'. [...] I thinks > this patch is fine. Thanks! You may consider turning it "ready" in the CF app, so as to see whether some committer agrees with you. -- Fabien.
On Wed, Sep 19, 2018 at 7:07 AM Fabien COELHO <coelho@cri.ensmp.fr> wrote: > > I reviewed 'pgbench-prp-func/pgbench-prp-func-4.patch'. [...] I thinks > > this patch is fine. modular_multiply() is an interesting device. I will leave this to committers with a stronger mathematical background than me, but I have a small comment in passing: +#ifdef HAVE__BUILTIN_POPCOUNTLL + return __builtin_popcountll(n); +#else /* use clever no branching bitwise operator version */ I think it is not enough for the compiler to have __builtin_popcountll(). The CPU that this is eventually executed on must actually have the POPCNT instruction[1] (or equivalent on ARM, POWER etc), or the program will die with SIGILL. There are CPUs in circulation produced in this decade that don't have it. I have previously considered something like this[2], but realised you would therefore need a runtime check. There are a couple of ways to do that: see commit a7a73875 for one example, also __builtin_cpu_supports(), and direct CPU ID bit checks (some platforms). There is also the GCC "multiversion" system that takes care of that for you but that worked only for C++ last time I checked. [1] https://en.wikipedia.org/wiki/SSE4#POPCNT_and_LZCNT [2] https://www.postgresql.org/message-id/CAEepm%3D3g1_fjJGp38QGv%2B38BC2HHVkzUq6s69nk3mWLgPHqC3A%40mail.gmail.com -- Thomas Munro http://www.enterprisedb.com
Hello Thomas, > modular_multiply() is an interesting device. I will leave this to > committers with a stronger mathematical background than me, but I have > a small comment in passing: For testing this function, I have manually commented out the various shortcuts so that the manual multiplication code is always used, and the tests passed. I just did it again. > +#ifdef HAVE__BUILTIN_POPCOUNTLL > + return __builtin_popcountll(n); > +#else /* use clever no branching bitwise operator version */ > > I think it is not enough for the compiler to have > __builtin_popcountll(). The CPU that this is eventually executed on > must actually have the POPCNT instruction[1] (or equivalent on ARM, > POWER etc), or the program will die with SIGILL. Hmmm, I'd be pretty surprised: The point of the builtin is to delegate to the compiler the hassle of selecting the best option available, depending on target hardware class: whether it issues a cpu/sse4 instruction is not/should not be our problem. > There are CPUs in circulation produced in this decade that don't have > it. Then the compiler, when generating code that is expected to run for a large class of hardware which include such old ones, should not use a possibly unavailable instruction, or the runtime should take responsability for dynamically selecting a viable option. My understanding is that it should always be safe to call __builtin_XYZ functions when available. Then if you compile saying that you want code specific to cpu X and then run it on cpu Y and it fails, basically you just shot yourself in the foot. > I have previously considered something like this[2], but realised you > would therefore need a runtime check. There are a couple of ways to do > that: see commit a7a73875 for one example, also > __builtin_cpu_supports(), and direct CPU ID bit checks (some platforms). > There is also the GCC "multiversion" system that takes care of that for > you but that worked only for C++ last time I checked. ISTM that the purpose of a dynamic check would be to have the better hardware benefit even when compiling for a generic class of hardware which may or may not have the better option. This approach is fine for reaching a better performance/portability compromise, but I do not think that it is that useful for pgbench to go to this level of optimization, unless someone else takes care, hence the compiler builtin. An interesting side effect of your mail is that while researching a bit on the subject I noticed __builtin_clzll() which helps improve the nbits code compared to using popcount. Patch attached uses CLZ insted of POPCOUNT. -- Fabien.
Attachment
On Wed, Sep 26, 2018 at 8:26 PM Fabien COELHO <coelho@cri.ensmp.fr> wrote: > > modular_multiply() is an interesting device. I will leave this to > > committers with a stronger mathematical background than me, but I have > > a small comment in passing: > > For testing this function, I have manually commented out the various > shortcuts so that the manual multiplication code is always used, and the > tests passed. I just did it again. > > > +#ifdef HAVE__BUILTIN_POPCOUNTLL > > + return __builtin_popcountll(n); > > +#else /* use clever no branching bitwise operator version */ > > > > I think it is not enough for the compiler to have > > __builtin_popcountll(). The CPU that this is eventually executed on > > must actually have the POPCNT instruction[1] (or equivalent on ARM, > > POWER etc), or the program will die with SIGILL. > > Hmmm, I'd be pretty surprised: The point of the builtin is to delegate to > the compiler the hassle of selecting the best option available, depending > on target hardware class: whether it issues a cpu/sse4 instruction is > not/should not be our problem. True, it selects based on what you tell it: $ cat x.c #include <stdio.h> #include <stdlib.h> int main(int argc, char *argv[]) { printf("%d\n", __builtin_popcount(atoi(argv[1]))); } $ gdb -q a.out ... (gdb) disass main ... 0x00000000004005a4 <+39>: callq 0x4005c0 <__popcountdi2> ... $ gcc -mpopcnt x.c $ gdb -q a.out ... (gdb) disass main ... 0x000000000040059f <+34>: popcnt %eax,%eax ... I'd forgotten one detail... we figure out if need to tell it that we have SSE4.2 instructions explicitly in the configure script: checking which CRC-32C implementation to use... SSE 4.2 with runtime check We enable -msse4.2 just on the one file that needs it, in src/port/Makefile: pg_crc32c_sse42.o: CFLAGS+=$(CFLAGS_SSE42) On this CentOS machine I see: gcc ... -msse4.2 ... -c pg_crc32c_sse42.c -o pg_crc32c_sse42_srv.o That is necessary because most people consume PostgreSQL through packages from distributions that have to work on an Athlon II or whatever, so we can't just use -msse4.2 for every translation unit. So it becomes our job to isolate small bits of code that use newer instructions, if it's really worth the effort to do that, and supply our own runtime checks and provide a fallback. > > There are CPUs in circulation produced in this decade that don't have > > it. > > Then the compiler, when generating code that is expected to run for a > large class of hardware which include such old ones, should not use a > possibly unavailable instruction, or the runtime should take > responsability for dynamically selecting a viable option. Right. Our problem is that if we didn't do our own runtime testing, users of (say) Debian and RHEL packages (= most users of PostgreSQL) would effectively never be able to use things like CRC32 instructions. None of that seems worth it for something like this. > An interesting side effect of your mail is that while researching a bit on > the subject I noticed __builtin_clzll() which helps improve the nbits code > compared to using popcount. Patch attached uses CLZ insted of POPCOUNT. It's annoying that fls() didn't make it into POSIX along with ffs(). On my system it compiles to a BSR instruction, just like __builtin_clz(). -- Thomas Munro http://www.enterprisedb.com
Hello, > That is necessary because most people consume PostgreSQL through > packages from distributions that have to work on an Athlon II or > whatever, so we can't just use -msse4.2 for every translation unit. > So it becomes our job to isolate small bits of code that use newer > instructions, if it's really worth the effort to do that, and supply > our own runtime checks and provide a fallback. Ok. That was my understanding so as to improve the portability/performance compromise. I do not think that pgbench is worth the effort on this particular point. > [...] None of that seems worth it for something like this. Indeed. So, am I right to deducing that you are satisfied with the current status of the patch, with the nbits implementation either based on popcount (v4) or clz (v5) compiler intrinsics? I think that the clz option is better. -- Fabien.
On Wed, Sep 26, 2018 at 01:20:49PM +0200, Fabien COELHO wrote: > So, am I right to deducing that you are satisfied with the current status of > the patch, with the nbits implementation either based on popcount (v4) or > clz (v5) compiler intrinsics? I think that the clz option is better. Fabien, please note that v5 does not apply, so I moved this patch to next CF, waiting on author. -- Michael
Attachment
> PostgreSQL Hackers <pgsql-hackers@lists.postgresql.org> > Subject: Re: pgbench - add pseudo-random permutation function > > On Wed, Sep 26, 2018 at 01:20:49PM +0200, Fabien COELHO wrote: >> So, am I right to deducing that you are satisfied with the current status of >> the patch, with the nbits implementation either based on popcount (v4) or >> clz (v5) compiler intrinsics? I think that the clz option is better. > > Fabien, please note that v5 does not apply, Indeed, tests interact with 92a0342 committed on Thursday. Here is a rebase of the latest version based on CLZ. According to basic test I made, the underlying hardware instruction seems to be more often available. > so I moved this patch to next CF, waiting on author. I'm going to change its state to "Needs review". -- Fabien.
Attachment
Hi, I reviewed pgbench-prp-func-6.patch. (1) modular_multiply() In modular_multiply(), the numbers of digits of inputs are checked in order to determine using the interleaved modular multiplication algorithm or not. However, the check condition absolutely depends on the implementation of pseudorandom_perm() even though modular_multiply() is a general function. I think it is better that pseudorandom_perm() directly checks the condition because it can be worked efficiently and modular_multiply() can be used in general purpose. (2) pseudorandom_perm() and 001_pgbench_with_server.pl Reading the discussion of __builtin_xxx functions, I remembered another overflow issue. pseudorandom_perm() uses the Donald Knuth's linear congruential generator method to obtain pseudo-random numbers. This method, not only this but also all linear congruential generator methods, always overflows. If we do not need to guarantee the portability of this code, we do not care about the result of this method because we just use *pseudo* random numbers. (In fact, I ignored it before.) However, since we have to guarantee the portability, we have to calculate it accurately. I therefore implemented the function dk_lcg() and rewrote pseudorandom_perm(). Just to be sure, I made a python code to check the algorithm of pseudorandom_perm() and run it. Fortunately, my python code and pseudorandom_perm() which I rewrote returned the same answers, so I rewrote the answers in 001_pgbench_with_server.pl. I attached the new patch `pgbench-prp-func-7.patch`, the python code `pseudorandom_perm.py`, and the pr_perm check script file `pr_perm_check.sql`. Best regards, On 2018/10/01 12:19, Fabien COELHO wrote: >> PostgreSQL Hackers <pgsql-hackers@lists.postgresql.org> >> Subject: Re: pgbench - add pseudo-random permutation function >> >> On Wed, Sep 26, 2018 at 01:20:49PM +0200, Fabien COELHO wrote: >>> So, am I right to deducing that you are satisfied with the current >>> status of >>> the patch, with the nbits implementation either based on popcount >>> (v4) or >>> clz (v5) compiler intrinsics? I think that the clz option is better. >> >> Fabien, please note that v5 does not apply, > > Indeed, tests interact with 92a0342 committed on Thursday. > > Here is a rebase of the latest version based on CLZ. According to basic > test I made, the underlying hardware instruction seems to be more often > available. > >> so I moved this patch to next CF, waiting on author. > > I'm going to change its state to "Needs review". >
Attachment
Hello Hironobu-san, > I reviewed pgbench-prp-func-6.patch. Thanks. > (1) modular_multiply() > In modular_multiply(), the numbers of digits of inputs are checked in order > to determine using the interleaved modular multiplication algorithm or not. > However, the check condition absolutely depends on the implementation of > pseudorandom_perm() even though modular_multiply() is a general function. > > I think it is better that pseudorandom_perm() directly checks the condition > because it can be worked efficiently and modular_multiply() can be used in > general purpose. You moved the shortcut to the caller. Why not, it removes one modulo operation and avoids the call altogether. > (2) pseudorandom_perm() and 001_pgbench_with_server.pl > Reading the discussion of __builtin_xxx functions, I remembered another > overflow issue. > > pseudorandom_perm() uses the Donald Knuth's linear congruential generator > method to obtain pseudo-random numbers. This method, not only this but also > all linear congruential generator methods, always overflows. > If we do not need to guarantee the portability of this code, ISTM that we do not need such changes: the code is already portable because standard C unsigned operations overflow consistently and this is what it really expected for the linear congruential generator. If an alternate implementation should be provided, given the heavy cost of the modular multiplication function, it would be only for those architectures which fails, detected at configure time. I would not go into this without an actual failing architecture & C compiler. Also, the alternate implementation should not change the result, so something looks amiss in your version. Moreover, I'm unclear how to implement an overflow multiply with the safe no overflow version. Here is a v8 which: - moves the shortcut to the caller - changes the r formula as you did - does nothing about Knuth's formula, as nothing should be needed - fixes tests after Peter's exit status changes -- Fabien.
Attachment
Hello, > Also, the alternate implementation should not change the result, so > something looks amiss in your version. Moreover, I'm unclear how to > implement an overflow multiply with the safe no overflow version. (snip) I made an honest mistake. I had assumed the modulo number of Knuth's LCG is (2 ^ 64 - 1). BTW, I found other overflow issue. In pseudorandom_perm(), `modular_multiply() + (key >> LCG_SHIFT)` may overflow if the result of modular_multiply() is large. Therefore, I've improved it. Also, I've simplified Step 5 in modular_multiply(). I attached pgbench-prp-func-9.patch. Best regards,
Attachment
Hello Hironobu-san, > In pseudorandom_perm(), `modular_multiply() + (key >> LCG_SHIFT)` may > overflow if the result of modular_multiply() is large. Therefore, I've > improved it. > Also, I've simplified Step 5 in modular_multiply(). Attached is a v10, where I have: - updated some comments - the + cannot overflow because size is taken from a signed int and the added value is small thanks to the shift. I have put back the simple formula and added a comment about it. - added a few test cases, and fix the associated checks -- Fabien.
Attachment
Hi Fabian-san, I reviewed 'pgbench-prp-func/pgbench-prp-func-10.patch'. On 2018/10/24 12:55, Fabien COELHO wrote: > > Hello Hironobu-san, > >> In pseudorandom_perm(), `modular_multiply() + (key >> LCG_SHIFT)` may >> overflow if the result of modular_multiply() is large. Therefore, I've >> improved it. > >> Also, I've simplified Step 5 in modular_multiply(). > > Attached is a v10, where I have: > - updated some comments > - the + cannot overflow because size is taken from a signed int > and the added value is small thanks to the shift. > I have put back the simple formula and added a comment about it. > - added a few test cases, and fix the associated checks > I agree your discussion before. I checked the tests you added in this patch and I confirmed that there is no problem. I thinks this patch is fine. Best regards,
> I thinks this patch is fine. Thanks! Hopefully some committer will pick it up at some point. -- Fabien.
Can you please pgindent this? -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Hello Alvaro, > Can you please pgindent this? Hmmm. After some investigation, I installed some "pg_bsd_indent" and ran the "pgindent" script, which reindented far more than the patch... So I picked up the patch-related changes and integrated them manually, although not comment changes which break the logic of the algorithm descriptions. I have not found how to tell pgindent to let comments indentation alone. Here is the result for the code, and for part of comments. -- Fabien.
Attachment
On 2018-Oct-24, Fabien COELHO wrote: > > Hello Alvaro, > > > Can you please pgindent this? > > Hmmm. After some investigation, I installed some "pg_bsd_indent" and ran the > "pgindent" script, which reindented far more than the patch... So I picked > up the patch-related changes and integrated them manually, Cool, thanks. > although not comment changes which break the logic of the algorithm > descriptions. I have not found how to tell pgindent to let comments > indentation alone. You can use /*----- for such comments. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Wed, Oct 24, 2018 at 06:00:08PM -0300, Alvaro Herrera wrote: > On 2018-Oct-24, Fabien COELHO wrote: >> although not comment changes which break the logic of the algorithm >> descriptions. I have not found how to tell pgindent to let comments >> indentation alone. > > You can use /*----- for such comments. A recent example of that is what d55241af has done in pg_verify_checksums.c. -- Michael
Attachment
Hello Alvaro, >> although not comment changes which break the logic of the algorithm >> descriptions. I have not found how to tell pgindent to let comments >> indentation alone. > > You can use /*----- for such comments. Thanks for the hint. Here is an updated patch using this marker. I noticed that some instances use a closing "*-----" sequence, but it does not seem useful, so I left it out. I have also tried to improve a few comments, and moved a declaration into a loop because I did not like the pg-indented version much. -- Fabien.
Attachment
On Thu, Oct 25, 2018 at 08:21:27AM +0200, Fabien COELHO wrote: > Thanks for the hint. Here is an updated patch using this marker. > > I noticed that some instances use a closing "*-----" sequence, but it does > not seem useful, so I left it out. > > I have also tried to improve a few comments, and moved a declaration into a > loop because I did not like the pg-indented version much. This patch is marked as ready for committer for some time now, and it has rotten, so I am moving it to next CF, waiting on author. -- Michael
Attachment
I updated the patch. And also I added some explanations and simple examples in the modular_multiply function. Fabian-san, To make easily understanding, I think it is a good idea to add a brief explanation of outline the pseudorandom_perm function and bijective functions/transformations. What do you think about? Best regards, On 2019/02/02 1:16, Michael Paquier wrote: > On Thu, Oct 25, 2018 at 08:21:27AM +0200, Fabien COELHO wrote: >> Thanks for the hint. Here is an updated patch using this marker. >> >> I noticed that some instances use a closing "*-----" sequence, but it does >> not seem useful, so I left it out. >> >> I have also tried to improve a few comments, and moved a declaration into a >> loop because I did not like the pg-indented version much. > > This patch is marked as ready for committer for some time now, and it > has rotten, so I am moving it to next CF, waiting on author. > -- > Michael >
Attachment
Hi, On 2019-02-10 17:46:15 +0000, Hironobu SUZUKI wrote: > I updated the patch. And also I added some explanations and simple examples > in the modular_multiply function. It'd be good to update the commitfest entry to say 'needs review' the next time. > +# PGAC_C_BUILTIN_CLZLL > +# ------------------------- > +# Check if the C compiler understands __builtin_clzll(), > +# and define HAVE__BUILTIN_CLZLL if so. > +# Both GCC & CLANG seem to have one. > +AC_DEFUN([PGAC_C_BUILTIN_CLZLL], > +[AC_CACHE_CHECK(for __builtin_clzll, pgac_cv__builtin_clzll, > +[AC_COMPILE_IFELSE([AC_LANG_SOURCE( > +[static unsigned long int x = __builtin_clzll(0xaabbccddeeff0011);] > +)], > +[pgac_cv__builtin_clzll=yes], > +[pgac_cv__builtin_clzll=no])]) > +if test x"$pgac_cv__builtin_clzll" = xyes ; then > +AC_DEFINE(HAVE__BUILTIN_CLZLL, 1, > + [Define to 1 if your compiler understands __builtin_clzll.]) > +fi])# PGAC_C_BUILTIN_CLZLL I think this has been partially superceded by commit 711bab1e4d19b5c9967328315a542d93386b1ac5 Author: Alvaro Herrera <alvherre@alvh.no-ip.org> Date: 2019-02-13 16:10:06 -0300 Add basic support for using the POPCNT and SSE4.2s LZCNT opcodes could you make sur eit's integrated appropriately? > <para> > + Function <literal>pr_perm</literal> implements a pseudo-random permutation. > + It allows to mix the output of non uniform random functions so that > + values drawn more often are not trivially correlated. > + It permutes integers in [0, size) using a seed by applying rounds of > + simple invertible functions, similarly to an encryption function, > + although beware that it is not at all cryptographically secure. > + Compared to <literal>hash</literal> functions discussed above, the function > + ensures that a perfect permutation is applied: there are no collisions > + nor holes in the output values. > + Values outside the interval are interpreted modulo the size. > + The function errors if size is not positive. > + If no seed is provided, <literal>:default_seed</literal> is used. > + For a given size and seed, the function is fully deterministic: if two > + permutations on the same size must not be correlated, use distinct seeds > + as outlined in the previous example about hash functions. > + </para> This doesn't really explain why we want this in pgbench. Greetings, Andres Freund
Hello Andres, >> +# PGAC_C_BUILTIN_CLZLL > > I think this has been partially superceded by > > commit 711bab1e4d19b5c9967328315a542d93386b1ac5 > Author: Alvaro Herrera <alvherre@alvh.no-ip.org> > Date: 2019-02-13 16:10:06 -0300 Indeed, the patch needs a rebase & conflit resolution. I'll do it. Later. >> <para> >> + Function <literal>pr_perm</literal> implements a pseudo-random permutation. >> + It allows to mix the output of non uniform random functions so that >> + values drawn more often are not trivially correlated. >> + It permutes integers in [0, size) using a seed by applying rounds of >> + simple invertible functions, similarly to an encryption function, >> + although beware that it is not at all cryptographically secure. >> + Compared to <literal>hash</literal> functions discussed above, the function >> + ensures that a perfect permutation is applied: there are no collisions >> + nor holes in the output values. >> + Values outside the interval are interpreted modulo the size. >> + The function errors if size is not positive. >> + If no seed is provided, <literal>:default_seed</literal> is used. >> + For a given size and seed, the function is fully deterministic: if two >> + permutations on the same size must not be correlated, use distinct seeds >> + as outlined in the previous example about hash functions. >> + </para> > > This doesn't really explain why we want this in pgbench. Who is "we"? If someone runs non uniform tests, ie with random_exp/zipf/gauss, close values are drawn with a similar frequency, thus correlated, inducing an undeserved correlation at the page level (eg for read) and better performance that would be the case if relative frequencies were not correlated to key values. So the function allows having more realistic non uniform test, whereas currently we can only have non uniform test with very unrealistically correlated values at the key level and possibly at the page level, meaning non representative performances because of these induced bias. This is under the assumption that pgbench should allow more realistic performance test scenarii, which I believe is a desirable purpose. If someone disagree with this purpose, then they would consider both non uniform random functions and this proposed pseudo-random permutation function as useless, as probably most other additions to pgbench. -- Fabien.
> Indeed, the patch needs a rebase & conflit resolution. I'll do it. Later. Here is an update: - take advantage of pg_bitutils (although I noted that the "slow" popcount there could be speeded-up and shorten with a bitwise operator implementation that I just removed from pgbench). - add comments about the bijective transformations in the code. As already stated, this function makes sense for people who want to test performance with pgbench using non uniform rands. If you do not want to do that, you will probably find the function pretty useless. I can't help it. Also, non uniform rands is also a way to test pg lock contention behavior. -- Fabien.
Attachment
Hi Hironobu, On 3/3/19 12:55 PM, Fabien COELHO wrote: > >> Indeed, the patch needs a rebase & conflit resolution. I'll do it. Later. > > Here is an update: > > - take advantage of pg_bitutils (although I noted that the "slow" > popcount there could be speeded-up and shorten with a bitwise operator > implementation that I just removed from pgbench). > > - add comments about the bijective transformations in the code. > > As already stated, this function makes sense for people who want to test > performance with pgbench using non uniform rands. If you do not want to > do that, you will probably find the function pretty useless. I can't > help it. > > Also, non uniform rands is also a way to test pg lock contention behavior. You have signed up as a reviewer for this patch. Do you know when you'll have time to do the review? Regards, -- -David david@pgmasters.net
On 2019/03/21 17:27, David Steele wrote: > Hi Hironobu, > Sorry for the late reply. I reviewed this patch. Function nbits(), which was previously discussed, has been simplified by using the function pg_popcount64(). By adding the mathematical explanation, it has been easier to understand the operation of this function. I believe that these improvements will have a positive impact on maintenance. The patch could be applied successfully and the tests passed without problems. So, I think the latest patch is fine. Best regards, > On 3/3/19 12:55 PM, Fabien COELHO wrote: >> >>> Indeed, the patch needs a rebase & conflit resolution. I'll do it. >>> Later. >> >> Here is an update: >> >> - take advantage of pg_bitutils (although I noted that the "slow" >> popcount there could be speeded-up and shorten with a bitwise >> operator >> implementation that I just removed from pgbench). >> >> - add comments about the bijective transformations in the code. >> >> As already stated, this function makes sense for people who want to >> test performance with pgbench using non uniform rands. If you do not >> want to do that, you will probably find the function pretty useless. I >> can't help it. >> >> Also, non uniform rands is also a way to test pg lock contention >> behavior. > > You have signed up as a reviewer for this patch. Do you know when > you'll have time to do the review? > > Regards,
Hello Hironobu-san, Here is a v15 which is a rebase, plus a large simplification of the modmul function if an int128 type is available, which is probably always… Could you have a look and possibly revalidate? > Sorry for the late reply. I reviewed this patch. > > Function nbits(), which was previously discussed, has been simplified by > using the function pg_popcount64(). > > By adding the mathematical explanation, it has been easier to understand the > operation of this function. > > I believe that these improvements will have a positive impact on maintenance. > > The patch could be applied successfully and the tests passed without > problems. > > So, I think the latest patch is fine. > > > Best regards, > > > >> On 3/3/19 12:55 PM, Fabien COELHO wrote: >>> >>>> Indeed, the patch needs a rebase & conflit resolution. I'll do it. Later. >>> >>> Here is an update: >>> >>> - take advantage of pg_bitutils (although I noted that the "slow" >>> popcount there could be speeded-up and shorten with a bitwise operator >>> implementation that I just removed from pgbench). >>> >>> - add comments about the bijective transformations in the code. >>> >>> As already stated, this function makes sense for people who want to test >>> performance with pgbench using non uniform rands. If you do not want to do >>> that, you will probably find the function pretty useless. I can't help it. >>> >>> Also, non uniform rands is also a way to test pg lock contention behavior. >> >> You have signed up as a reviewer for this patch. Do you know when you'll >> have time to do the review? >> >> Regards, > > -- Fabien Coelho - CRI, MINES ParisTech
Attachment
On Fri, May 24, 2019 at 2:46 AM Fabien COELHO <coelho@cri.ensmp.fr> wrote: > Here is a v15 which is a rebase, plus a large simplification of the modmul > function if an int128 type is available, which is probably always… > > Function nbits(), which was previously discussed, has been simplified by > > using the function pg_popcount64(). Hi Fabien, Suzuki-san, I am not smart enough to commit this or judge its value for benchmarking, but I have a few trivial comments on the language: + It allows to mix the output of non uniform random functions so that "It allows the output of non-uniform random functions to be mixed so that" + ensures that a perfect permutation is applied: there are no collisions + nor holes in the output values. "neither collisions nor holes", or "no collisions or holes" + The function errors if size is not positive. "raises an error" + * 24 bits mega primes from https://primes.utm.edu/lists/small/millions/ "24 bit mega primes" +/* length of n binary representation */ +static int +nbits(uint64 n) +{ + /* set lower bits to 1 and count them */ + return pg_popcount64(compute_mask(n)); +} I suppose you could use n == 0 ? 0 : pg_leftmost_one_pos64(n) + 1, and then... +/* return smallest mask holding n */ +static uint64 +compute_mask(uint64 n) +{ + n |= n >> 1; + n |= n >> 2; + n |= n >> 4; + n |= n >> 8; + n |= n >> 16; + n |= n >> 32; + return n; +} ... here you could use 1 << nbits(n)) - 1. I have no idea if doing it that way around is better, it's just a thought and removes a few lines of bit-swizzling code. -- Thomas Munro https://enterprisedb.com
Hello Thomas, >>> Function nbits(), which was previously discussed, has been simplified by >>> using the function pg_popcount64(). > > Hi Fabien, Suzuki-san, > > I am not smart enough to commit this I'm in no hurry:-) > or judge its value for benchmarking, I think that it is valuable given that we have non uniform random generators in pgbench. I'm wondering about the modular_multiply manual implementation which adds quite a few lines of non trivial code. If int128 is available on all/most platforms, it could be removed and marked as not supported on these, although it would create an issue with non regression tests. > but I have a few trivial comments on the language: > > + It allows to mix the output of non uniform random functions so that > > "It allows the output of non-uniform random functions to be mixed so that" Fixed. > + ensures that a perfect permutation is applied: there are no collisions > + nor holes in the output values. > > "neither collisions nor holes", or "no collisions or holes" I choose the first. > + The function errors if size is not positive. > > "raises an error" Fixed. > + * 24 bits mega primes from https://primes.utm.edu/lists/small/millions/ > > "24 bit mega primes" Fixed. > +/* length of n binary representation */ > +static int > +nbits(uint64 n) > +{ > + /* set lower bits to 1 and count them */ > + return pg_popcount64(compute_mask(n)); > +} > > I suppose you could use n == 0 ? 0 : pg_leftmost_one_pos64(n) + 1, and then... It would create a branch, that I'm trying to avoid. > +/* return smallest mask holding n */ > +static uint64 > +compute_mask(uint64 n) > +{ > + n |= n >> 1; > + n |= n >> 2; > + n |= n >> 4; > + n |= n >> 8; > + n |= n >> 16; > + n |= n >> 32; > + return n; > +} > > ... here you could use 1 << nbits(n)) - 1. I have no idea if doing it > that way around is better, it's just a thought and removes a few lines > of bit-swizzling code. This would create a infinite recursion as nbits currently uses compute_mask. The 6 bitfield operation above is pretty efficient, I'd let it at that. Attached a v16. -- Fabien.
Attachment
Hi, This patch was marked as RFC on 2019-03-30, but since then there have been a couple more issues pointed out in a review by Thomas Munro, and it went through 2019-09 and 2019-11 without any attention. Is the RFC status still appropriate? regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
> This patch was marked as RFC on 2019-03-30, but since then there have > been a couple more issues pointed out in a review by Thomas Munro, and > it went through 2019-09 and 2019-11 without any attention. Is the RFC > status still appropriate? Thomas review was about comments/documentation wording and asking for explanations, which I think I addressed, and the code did not actually change, so I'm not sure that the "needs review" is really needed, but do as you feel. -- Fabien
On 2020-01-05 10:02, Fabien COELHO wrote: > >> This patch was marked as RFC on 2019-03-30, but since then there have >> been a couple more issues pointed out in a review by Thomas Munro, and >> it went through 2019-09 and 2019-11 without any attention. Is the RFC >> status still appropriate? > > Thomas review was about comments/documentation wording and asking for > explanations, which I think I addressed, and the code did not actually > change, so I'm not sure that the "needs review" is really needed, but do > as you feel. I read the whole thread, I still don't know what this patch is supposed to do. I know what the words in the subject line mean, but I don't know how this helps a pgbench user run better benchmarks. I feel this is also the sentiment expressed by others earlier in the thread. You indicated that this functionality makes sense to those who want this functionality, but so far only two people, namely the patch author and the reviewer, have participated in the discussion on the substance of this patch. So either the feature is extremely niche, or nobody understands it. I think you ought to take about three steps back and explain this in more basic terms, even just in email at first so that we can then discuss what to put into the documentation. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 2020-Jan-30, Peter Eisentraut wrote: > I read the whole thread, I still don't know what this patch is supposed to > do. I know what the words in the subject line mean, but I don't know how > this helps a pgbench user run better benchmarks. I feel this is also the > sentiment expressed by others earlier in the thread. You indicated that > this functionality makes sense to those who want this functionality, but so > far only two people, namely the patch author and the reviewer, have > participated in the discussion on the substance of this patch. So either > the feature is extremely niche, or nobody understands it. I think you ought > to take about three steps back and explain this in more basic terms, even > just in email at first so that we can then discuss what to put into the > documentation. After re-reading one more time, it dawned on me that the point of this is similar in spirit to this one: https://wiki.postgresql.org/wiki/Pseudo_encrypt The idea seems to be to map the int4 domain into itself, so you can use a sequence to generate numbers that will not look like a sequence, allowing the user to hide some properties (such as the generation rate) that might be useful to an eavesdropper/attacker. In terms of writing benchmarks, it seems useful to destroy all locality of access, which changes the benchmark completely. (I'm not sure if this is something benchmark writers really want to have.) If I'm right, then I agree that the documentation provided with the patch does a pretty bad job at explaining it, because until now I didn't at all realize this is what it was. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Hello Peter, >>> This patch was marked as RFC on 2019-03-30, but since then there have >>> been a couple more issues pointed out in a review by Thomas Munro, and >>> it went through 2019-09 and 2019-11 without any attention. Is the RFC >>> status still appropriate? >> >> Thomas review was about comments/documentation wording and asking for >> explanations, which I think I addressed, and the code did not actually >> change, so I'm not sure that the "needs review" is really needed, but do >> as you feel. > > I read the whole thread, I still don't know what this patch is supposed to > do. I know what the words in the subject line mean, but I don't know how > this helps a pgbench user run better benchmarks. I feel this is also the > sentiment expressed by others earlier in the thread. You indicated that this > functionality makes sense to those who want this functionality, but so far > only two people, namely the patch author and the reviewer, have participated > in the discussion on the substance of this patch. So either the feature is > extremely niche, or nobody understands it. I think you ought to take about > three steps back and explain this in more basic terms, even just in email at > first so that we can then discuss what to put into the documentation. Here is the motivation for this feature: When you design a benchmark to test performance, you want to avoid unwanted correlation which may impact performance unduly, one way or another. For the default pgbench benchmarks, accounts are chosen uniformly, no problem. However, if you want to test a non uniform distribution, which is what most people would encounter in practice, things are different. For instance, you would replace random by random_exp in the default benchmarks. If you do that, then all accounts with lower ids would come out more often. However this creates an unwanted correlation and performance effect: why frequent accounts would just happen to be those with small ids? which just happen to reside together in the first few pages of the table? In order to avoid these effects, you need to shuffle the chosen account ids, so that frequent accounts are not specifically those at the beginning of the table. Basically, as soon as you have a non uniform random generator selecting step you want to add some shuffle, otherwise you are going to skew your benchmark measures. Nobody should use a non-uniform generator for selecting rows without some shuffling, ever. I have no doubt that many people do without realizing that they are skewing their performance data. Once you realize that, you can try to invent your own shuffling, but frankly this is not as easy as it looks to have something non trivial which would not generate unwanted correlation. I had a lot of (very) bad design before coming up with the one in the patch: You want something fast, you want good steering, which are contradictory objectives. There is also the question of being able to change the shuffling for a given size, for instance to check that there is no unwanted effects, hence the seeding. Good seeded shuffling is what an encryption algorithm do, but for these it is somehow simpler because they would usually work on power of two sizes. In conclusion, using a seeded shuffle step is needed as soon as you start using non random generators. Providing one in pgbench, a tool designed to run possibly non-uniform benchmarks against postgres, looks like common courtesy. Not providing one is helping people to design bad benchmarks, possibly without realizing it, so is outright thoughtlessness. I have no doubt that the documentation should be improved to explain that in a concise but clear way. -- Fabien.
Hello Alvaro, >> I read the whole thread, I still don't know what this patch is supposed to >> do. I know what the words in the subject line mean, but I don't know how >> this helps a pgbench user run better benchmarks. I feel this is also the >> sentiment expressed by others earlier in the thread. You indicated that >> this functionality makes sense to those who want this functionality, but so >> far only two people, namely the patch author and the reviewer, have >> participated in the discussion on the substance of this patch. So either >> the feature is extremely niche, or nobody understands it. I think you ought >> to take about three steps back and explain this in more basic terms, even >> just in email at first so that we can then discuss what to put into the >> documentation. > > After re-reading one more time, it dawned on me that the point of this > is similar in spirit to this one: > https://wiki.postgresql.org/wiki/Pseudo_encrypt Indeed. The one in the wiki is useless because it is on all integers, whereas in a benchmark you want it for a given size and you want seeding, but otherwise the same correlation-avoidance problem is addressed. > The idea seems to be to map the int4 domain into itself, so you can use > a sequence to generate numbers that will not look like a sequence, > allowing the user to hide some properties (such as the generation rate) > that might be useful to an eavesdropper/attacker. In terms of writing > benchmarks, it seems useful to destroy all locality of access, which > changes the benchmark completely. Yes. > (I'm not sure if this is something benchmark writers really want to > have.) I do not get this sentence. I'm sure that a benchmark writer should really want to avoid unrealistic correlations that have a performance impact. > If I'm right, then I agree that the documentation provided with the > patch does a pretty bad job at explaining it, because until now I didn't > at all realize this is what it was. The documentation is improvable, no doubt. Attached is an attempt at improving things. I have added a explicit note and hijacked an existing example to better illustrate the purpose of the function. -- Fabien.
Attachment
Hi Fabien, On 2/1/20 5:12 AM, Fabien COELHO wrote: > > Attached is an attempt at improving things. I have added a explicit note > and hijacked an existing example to better illustrate the purpose of the > function. This patch does not build on Linux due to some unused functions and variables: http://cfbot.cputube.org/patch_27_1712.log The CF entry has been updated to Waiting on Author. A few committers have expressed doubts about whether this patch is needed and it doesn't make sense to keep moving it from CF to CF. I'm planning to mark this patch RwF on April 8 and I suggest you resubmit if you are able to get some consensus. Regards, -- -David david@pgmasters.net
Hello David, >> Attached is an attempt at improving things. I have added a explicit note >> and hijacked an existing example to better illustrate the purpose of the >> function. > > This patch does not build on Linux due to some unused functions and > variables: http://cfbot.cputube.org/patch_27_1712.log This link is about some other patch, but indeed there is something amiss in v18. Attached a v19 which fixes that. > The CF entry has been updated to Waiting on Author. I put it back to "needs review". > A few committers have expressed doubts about whether this patch is needed Yep. The key point is that if you (think that you) do not need it, it is by definition useless. If you finally figure out that you need it (IMHO you must for any benchmark with non uniform randoms, otherwise performance result are biased and thus invalid) and it is not available, then you are just stuck. > and it doesn't make sense to keep moving it from CF to CF. You do as you feel. IMO such a feature is useful and consistent with providing non-uniform random functions. > I'm planning to mark this patch RwF on April 8 and I suggest you resubmit if > you are able to get some consensus. People interested in non-uniform benchmarks would see the point. Why many people would be happy with uniform benchmarks only while life is not uniform at all fails me. -- Fabien.
On 2020-Apr-02, Fabien COELHO wrote: > > I'm planning to mark this patch RwF on April 8 and I suggest you > > resubmit if you are able to get some consensus. > > People interested in non-uniform benchmarks would see the point. Why many > people would be happy with uniform benchmarks only while life is not uniform > at all fails me. I don't think we should boot this patch. I don't think I would be able to get this over the commit line in this CF, but let's not discard it. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
> Attached is an attempt at improving things. I have added a explicit note and > hijacked an existing example to better illustrate the purpose of the > function. A significant part of the complexity of the patch is the overflow-handling implementation of (a * b % c) for 64 bits integers. However this implementation is probably never used because int128 bits are available and the one line implementation takes precedence, or the size is small enough (less than 48 bits) so that there are no overflows with the primes involved, thus the operation is done directly on 64 bits integers. I could remove the implementation and replace it with a "not available on this architecture" message, which would seldom be triggered: you would have to use a 32 bits arch (?) and test against a table with about 262 Trows, which I guess does not exists anywhere. This approach would remove about 40% of the code & comments. Thougths? -- Fabien.
On 4/2/20 3:01 AM, Alvaro Herrera wrote: > On 2020-Apr-02, Fabien COELHO wrote: > >>> I'm planning to mark this patch RwF on April 8 and I suggest you >>> resubmit if you are able to get some consensus. >> >> People interested in non-uniform benchmarks would see the point. Why many >> people would be happy with uniform benchmarks only while life is not uniform >> at all fails me. > > I don't think we should boot this patch. I don't think I would be able > to get this over the commit line in this CF, but let's not discard it. Understood. I have moved the patch to the 2020-07 CF in Needs Review state. Regards, -- -David david@pgmasters.net
>> I don't think we should boot this patch. I don't think I would be able >> to get this over the commit line in this CF, but let's not discard it. > > Understood. I have moved the patch to the 2020-07 CF in Needs Review state. Attached v19 is a rebase, per cfbot. -- Fabien.
Attachment
> Attached v19 is a rebase, per cfbot. Attached v20 fixes a doc xml typo, per cfbot again. -- Fabien.
Attachment
The following review has been posted through the commitfest application: make installcheck-world: tested, passed Implements feature: tested, passed Spec compliant: tested, passed Documentation: not tested There are few "Stripping trailing CRs from the patch" and one "Hunk succeeded offset -2 lines" warning. other than that the patch applies successfully and works as it claims. The new status of this patch is: Ready for Committer
On Sun, 14 Mar 2021 at 16:08, Fabien COELHO <coelho@cri.ensmp.fr> wrote: > > > My main question on this now is, do you have a scholar reference for > > this algorithm? > > Nope, otherwise I would have put a reference. I'm a scholar though, if > it helps:-) > > I could not find any algorithm that fitted the bill. The usual approach > (eg benchmarking designs) is too use some hash function and assume that it > is not a permutation, too bad. > > Basically the algorithm mimics a few rounds of cryptographic encryption > adapted to any size and simple operators, whereas encryption function are > restricted to power of two blocks (eg the Feistel network). The structure > is the same AES with its AddRoundKey the xor-ing stage (adapted to non > power of two in whiten above), MixColumns which does the scattering, and > for key expansion, I used Donald Knuth generator. Basically one could say > that permute is an inexpensive and insecure AES:-) > I spent a little time playing around with this, trying to come up with a reasonable way to test it. It's apparent from the code and associated comments that the transformation is bijective and so will produce a permutation, but it's less obvious how random that permutation will be. Obviously the Fisher-Yates algorithm would give the most random permutation, but that's not really practical in this context. Any other algorithm will, in some sense, be less random than that, so I think it's really just a question of testing that it's random enough to satisfy the intended use case. The approach to testing I tried was to use the Kolmogorov-Smirnov test to test how uniformly random a couple of quantities are: 1). For a given size and fixed input value, and a large number of randomly chosen seed values, is the output uniformly random? 2). For a given size and a pair of nearby input values, are the two output values uniformly randomly spaced apart? This second test seems desirable to ensure sufficient shuffling so that inputs coming from some non-uniform random distribution are scattered sufficiently to distribute the maxima/minima (x -> x + rand mod size would pass test 1, so that by itself is insufficient). I tested this using the attached (somewhat ugly) stand-alone test C program, and for the most part these 2 tests seemed to pass. The program also tests that the output really is a permutation, just to be sure, and this was confirmed in all cases. However, I noticed that the results are less good when size is a power of 2. In this case, the results seem significantly less uniform, suggesting that for such sizes, there is an increased chance that the permuted output might still be skewed. Based on the above, I also experimented with a different permutation algorithm (permute2() in the test), which attempts to inject more randomness into the bijection, and between pairs of inputs, to make the output less predictable and less likely to be accidentally non-uniform. It's possible that it suffers from other problems, but it did do better in these 2 tests. Maybe some other tests might be useful, but I really don't know. As noted above, any algorithm is likely to lead to some pattern in the output (e.g., it looks like both these algorithms cause adjacent inputs to always end up separated by an odd number), so a judgement call will be required to decide what is random enough. Regards, Dean
Attachment
On Mon, 22 Mar 2021 at 13:43, Dean Rasheed <dean.a.rasheed@gmail.com> wrote: > > On Sun, 14 Mar 2021 at 16:08, Fabien COELHO <coelho@cri.ensmp.fr> wrote: > > > > > My main question on this now is, do you have a scholar reference for > > > this algorithm? > > > > Nope, otherwise I would have put a reference. I'm a scholar though, if > > it helps:-) > > > > I could not find any algorithm that fitted the bill. The usual approach > > (eg benchmarking designs) is too use some hash function and assume that it > > is not a permutation, too bad. > > > > Basically the algorithm mimics a few rounds of cryptographic encryption > > adapted to any size and simple operators, whereas encryption function are > > restricted to power of two blocks (eg the Feistel network). The structure > > is the same AES with its AddRoundKey the xor-ing stage (adapted to non > > power of two in whiten above), MixColumns which does the scattering, and > > for key expansion, I used Donald Knuth generator. Basically one could say > > that permute is an inexpensive and insecure AES:-) > > > > I spent a little time playing around with this, trying to come up with > a reasonable way to test it. I spent more time testing this -- the previous testing was mostly for large values of size, so I decided to also test it for small sizes. The theoretical number of possible permutations grows rapidly with size (size!), and the actual number of permutations observed grew almost as quickly: size size! distinct perms 2 2 2 3 6 6 4 24 24 5 120 120 6 720 720 7 5040 5040 8 40320 24382 9 362880 181440 10 3628800 3627355 My test script stopped once the first permutation had been seen 100 times, so it might have seen more permutations had it kept going for longer. However, looking at the actual output, there is a very significant non-uniformity in the probability of particular permutations being chosen, especially at certain sizes like 8 and 10. For example, at size = 8, the identity permutation is significantly more likely than almost any other permutation (roughly 13 times more likely than it would be by random chance). Additionally, the signs are that this non-uniformity tends to increase with size. At size = 10, the average number of occurrences of each permutation in the test was 148, but there were some that didn't appear at all, many that only appeared 2 or 3 times, and then some that appeared nearly 5000 times. Also, there appears to be an unfortunate dependence on the seed -- if the seed is even and the size is a power of 2, it looks like the number of distinct permutations produced is just size/2, which is a tiny fraction of size!. This, together with the results from the previous K-S uniformity tests at larger sizes, suggests that the current algorithm may not be random enough to scatter values and remove correlations from a non-uniform distribution. In an attempt to do better, I tweaked the algorithm in the attached patch, which makes use of erand48() to inject more randomness into the permutation choice. Repeating the tests with the updated algorithm shows that it has a number of advantages: 1). For small sizes (2-10), each of the size! possible permutations is now produced with roughly equal probability. No single permutation was much more likely than expected. 2). The loss of randomness for even seeds is gone. 3). For any given input, the output is more uniformly randomly distributed for different seeds. 4). For any pair of nearby inputs, the corresponding outputs are more uniformly randomly separated from one another. 5). The updated algorithm no longer uses modular_multiply(), so it works the same on all platforms. I still feel slightly uneasy about inventing our own algorithm here, but I wasn't able to find anything else suitable, and I've now tested this quite extensively. All the indications are that this updated algorithm works well at all sizes and produces sufficiently random results, though if anyone else can think of other approaches to testing the core algorithm, that would be useful. For reference, I attach 2 standalone test C programs I used for testing, which include copies of the old and new algorithms. I also did a quick copy editing pass over the docs, and I think the patch is in pretty good shape. Barring objections, I'll see about committing it later this week. Regards, Dean
Attachment
Hello Dean, Thanks a lot for this work. This version looks much better than the previous one you sent, and has significant advantages over the one I sent, in particular avoiding the prime number stuff and large modular multiply. So this looks good! I'm happy that halves-xoring is back because it was an essential part of the design. ISTM that the previous version you sent only had linear/affine transforms which was a bad idea. The linear transform is moved inside halves, why not, and the any-odd-number multiplication is prime with power-of-two trick is neat on these. However I still have some reservations: First, I have a thing against erand48. The erand 48 bits design looks like something that made sense when computer architectures where 16/32 bits and large integers were 32 bits, so a 16 bits margin looked good enough. Using a int16[3] type now is really depressing and probably costly on modern architectures. With 64 bits architecture, and given that we are manipulating 64 bits integers (well 63 really), ISTM that the minimal state size for a PRNG should be at least 64 bits. It there a good reason NOT to use a 64 bits state prn generator? I took Knuth's because it is cheap and 64 bits. I'd accept anything which is at least 64 bits. I looked at xor-shift things, but there was some controversy around these so I kept it simple and just shifted small values to get rid of the obvious cycles on Knuth's generator. Second, I have a significant reservation about the very structure of the transformation in this version: loop 4 times : // FIRST HALF STEER m/r = pseudo randoms if v in first "half" v = ((v * m) ^ r) & mask; rotate1(v) // FULL SHIFT 1 r = pseudo random v = (v + r) % size // SECOND HALF STEER m/r = pseudo randoms if v in second "half" same as previous on second half // FULL SHIFT 2 r = pseudo random v = (v + r) % size I'm really at odds with FULL SHIFT 1, because it means that up to 1/256 of values are kept out of STEERING. Whole chunks of values could be kept unshuffled because they would only have SHIFTS apply to them and each time fall in the not steered half. It should be an essential part of the design that at least one steer is applied on a value at each round, and if two are applied then fine, but certainly not zero. So basically I think that the design would be significantly improved by removing "FULL SHIFT 1". Third, I think that the rotate code can be simplified, in particular the ?: should be avoided because it may induce branches quite damaging to processor performance. I'll give it some more thoughts. -- Fabien.
On Tue, 30 Mar 2021 at 19:26, Fabien COELHO <coelho@cri.ensmp.fr> wrote: > > First, I have a thing against erand48. Yeah, that's probably a fair point. However, all the existing pgbench random functions are using it, so I think it's fair enough for permute() to do the same (and actually 2^48 is pretty huge). Switching to a 64-bit PRNG might not be a bad idea, but I think that's something we'd want to do across the board, and so I think it should be out of scope for this patch. > Second, I have a significant reservation about the very structure of the > transformation in this version: > > loop 4 times : > > // FIRST HALF STEER > m/r = pseudo randoms > if v in first "half" > v = ((v * m) ^ r) & mask; > rotate1(v) > > // FULL SHIFT 1 > r = pseudo random > v = (v + r) % size > > // SECOND HALF STEER > m/r = pseudo randoms > if v in second "half" > same as previous on second half > > // FULL SHIFT 2 > r = pseudo random > v = (v + r) % size > > I'm really at odds with FULL SHIFT 1, because it means that up to 1/256 of > values are kept out of STEERING. Whole chunks of values could be kept > unshuffled because they would only have SHIFTS apply to them and each time > fall in the not steered half. It should be an essential part of the design > that at least one steer is applied on a value at each round, and if two > are applied then fine, but certainly not zero. So basically I think that > the design would be significantly improved by removing "FULL SHIFT 1". Ah, that's a good point. Something else that also concerned me there was that it might lead to 2 consecutive full shifts with nothing in between, which would lead to less uniform randomness (like the Irwin-Hall distribution). I just did a quick test without the first full shift, and the results do appear to be better, so removing that looks like a good idea. > Third, I think that the rotate code can be simplified, in particular the > ?: should be avoided because it may induce branches quite damaging to > processor performance. Yeah, I wondered about that. Perhaps there's a "trick" that can be used to simplify it. Pre-computing the number of bits in the mask would probably help. I'll give it some thought. Regards, Dean
On Tue, 30 Mar 2021 at 20:31, Dean Rasheed <dean.a.rasheed@gmail.com> wrote: > > Yeah, that's probably a fair point. However, all the existing pgbench > random functions are using it, so I think it's fair enough for > permute() to do the same (and actually 2^48 is pretty huge). Switching > to a 64-bit PRNG might not be a bad idea, but I think that's something > we'd want to do across the board, and so I think it should be out of > scope for this patch. > Of course the immediate counter-argument to changing the existing random functions would be that doing so would break lots of people's tests, and no one would thank us for that. Still, I think that, since the existing random functions use a 48-bit PRNG, it's not unreasonable for permute() to do the same. Regards, Dean
Hello Dean, >> First, I have a thing against erand48. > > Yeah, that's probably a fair point. However, all the existing pgbench > random functions are using it, so I think it's fair enough for permute() > to do the same (and actually 2^48 is pretty huge). Switching to a 64-bit > PRNG might not be a bad idea, but I think that's something we'd want to > do across the board, and so I think it should be out of scope for this > patch. But less likely to pass, whereas here we have an internal function that we can set as we want. Also, there is a 64 bits seed provided to the function which instantly ignores 16 of them, which looks pretty silly to me. Also, the function is named everywhere erand48 with its hardcoded int16[3] state, which makes a poor abstraction. At least, I suggest that two 48-bits prng could be initialized with parts of the seed and used in different places, eg for r & m. Also, the seed could be used to adjust the rotation, maybe. >> I'm really at odds with FULL SHIFT 1, because it means that up to 1/256 of >> values are kept out of STEERING. [...] > > Ah, that's a good point. Something else that also concerned me there was > that it might lead to 2 consecutive full shifts with nothing in between, > which would lead to less uniform randomness (like the Irwin-Hall > distribution). I just did a quick test without the first full shift, and > the results do appear to be better, Indeed, it makes sense to me. > so removing that looks like a good idea. >> Third, I think that the rotate code can be simplified, in particular >> the ?: should be avoided because it may induce branches quite damaging >> to processor performance. > > Yeah, I wondered about that. Perhaps there's a "trick" that can be > used to simplify it. Pre-computing the number of bits in the mask > would probably help. See pg_popcount64(). -- Fabien.
On Wed, 31 Mar 2021 at 09:02, Fabien COELHO <coelho@cri.ensmp.fr> wrote: > > >> First, I have a thing against erand48. > > > Also, there is a 64 bits seed provided to the function which instantly > ignores 16 of them, which looks pretty silly to me. > Yeah, that was copied from set_random_seed(). > At least, I suggest that two 48-bits prng could be initialized with parts > of the seed and used in different places, eg for r & m. > That could work. I'd certainly feel better about that than implementing a whole new PRNG. > Also, the seed could be used to adjust the rotation, maybe. > Perhaps. I'm not sure it's really necessary though. > >> I'm really at odds with FULL SHIFT 1, because it means that up to 1/256 of > >> values are kept out of STEERING. [...] > > > > Ah, that's a good point. Something else that also concerned me there was > > that it might lead to 2 consecutive full shifts with nothing in between, > > which would lead to less uniform randomness (like the Irwin-Hall > > distribution). I just did a quick test without the first full shift, and > > the results do appear to be better, > > Indeed, it makes sense to me. > OK, attached is an update making this change and simplifying the rotate code, which hopefully just leaves the question of what (if anything) to do with pg_erand48(). > >> Third, I think that the rotate code can be simplified, in particular > >> the ?: should be avoided because it may induce branches quite damaging > >> to processor performance. > > > > Yeah, I wondered about that. Perhaps there's a "trick" that can be > > used to simplify it. Pre-computing the number of bits in the mask > > would probably help. > > See pg_popcount64(). > Actually, I used pg_leftmost_one_pos64() to calculate the mask length, allowing the mask to be computed from that, so there is no longer a need for compute_mask(), which seems like a neat little simplification. Regards, Dean
Attachment
Hello Dean, > OK, attached is an update making this change and simplifying the rotate > code, which hopefully just leaves the question of what (if anything) to > do with pg_erand48(). Yep. While looking at it, I have some doubts on this part: m = (uint64) (pg_erand48(random_state.xseed) * (mask + 1)) | 1; r = (uint64) (pg_erand48(random_state.xseed) * (mask + 1)); r = (uint64) (pg_erand48(random_state.xseed) * size); I do not understand why the random values are multiplied by anything in the first place… This one looks like a no-op : r = (uint64) (pg_erand48(random_state.xseed) * size); v = (v + r) % size; v = (v + r) % size = (v + rand * size) % size =? (v % size + rand * size % size) % size =? (v % size + 0) % size = v % size = v I'm also skeptical about this one: r = (uint64) (pg_erand48(random_state.xseed) * (mask + 1)); if (v <= mask) v = ((v * m) ^ r) & mask; v = ((v * m) ^ r) & mask = ((v * m) ^ r) % (mask+1) = ((v * m) ^ (rand * (mask+1))) % (mask+1) =? ((v * m) % (mask+1)) ^ (rand * (mask+1) % (mask+1)) =? ((v * m) % (mask+1)) ^ (0) = (v * m) & mask Or possibly I'm missing something obvious and I'm wrong with my arithmetic? -- Fabien.
On Wed, 31 Mar 2021 at 18:53, Fabien COELHO <coelho@cri.ensmp.fr> wrote: > > While looking at it, I have some doubts on this part: > > m = (uint64) (pg_erand48(random_state.xseed) * (mask + 1)) | 1; > r = (uint64) (pg_erand48(random_state.xseed) * (mask + 1)); > r = (uint64) (pg_erand48(random_state.xseed) * size); > > I do not understand why the random values are multiplied by anything in > the first place… > These are just random integers in the range [0,mask] and [0,size-1], formed in exactly the same way as getrand(). > This one looks like a no-op : > > r = (uint64) (pg_erand48(random_state.xseed) * size); > v = (v + r) % size; > > v = (v + r) % size > = (v + rand * size) % size > =? (v % size + rand * size % size) % size > =? (v % size + 0) % size > = v % size > = v > rand * size % size is not zero because rand is a floating point number in the range [0,1), so actually rand * size % size = rand * size. Similarly in the other case, you're forgetting that rand is not an integer. Thinking more about our use of erand48(), the only real impact it has is to limit the number of possible permutations produced, and actually 2^48 is so huge (roughly 281 million million) that I can't ever see that being an issue in practice. (In a quick dummy test, replacing erand48() with a silly "erand8()" function that only returned one of 256 distinct values, permute() still worked fine at any size, except for the fact that only up to 256 distinct permutations were produced. In other words, limitations on the source of randomness don't prevent it from producing permutations of any size, they just limit the number of distinct permutations possible. And since 2^48 is so big, that shouldn't be an issue.) Also, I think the source of the input seed is most likely to be either manually hand-picked integers or pgbench's own random() function, so the only real issue I can see is that by ignoring the upper 16-bits, there's a very small chance of always using the same random sequence if some hand-picked numbers only vary in the top 16 bits, though I think that's highly unlikely in practice. Nonetheless, it's not much more effort to make another random state and use those remaining bits of the seed and get more internal random states, so here's an update doing that. I intentionally chose to reuse the lower 16 bits of the seed in the second random function (in a different slot of the random state), since those are probably the ones most likely to vary in practice. This doesn't actually make any measurable difference to any of the tests, but it closes that potential loophole of ignoring part of the seed. In all my tests, the biggest improvement was between v23 and v24 of the patch. By comparison, the later versions have been relatively small improvements, and it's probably now "random enough" for the intended purposes. Regards, Dean
Attachment
>> r = (uint64) (pg_erand48(random_state.xseed) * size); >> >> I do not understand why the random values are multiplied by anything in >> the first place… > > These are just random integers in the range [0,mask] and [0,size-1], > formed in exactly the same way as getrand(). Indeed, erand returns a double, this was the part I was missing. I did not realize that you had switched to doubles in your approach. I think that permute should only use integer operations. I'd suggest to use one of the integer variants instead of going through a double computation and casting back to int. The internal state is based on integers, I do not see the added value of going through floats, possibly enduring floating point issues (undeflow, rounding, normalization, whatever) on the way, whereas from start to finish we just need ints. See attached v27 proposal. I still think that *rand48 is a poor (relatively small state) and inefficient (the implementation includes packing and unpacking 16 bits ints to build a 64 bits int) choice. -- Fabien.
> See attached v27 proposal. As usual, it is easier to see with the actual attachement:-) -- Fabien.
Attachment
On Fri, 2 Apr 2021 at 06:38, Fabien COELHO <coelho@cri.ensmp.fr> wrote: > > >> r = (uint64) (pg_erand48(random_state.xseed) * size); > >> > >> I do not understand why the random values are multiplied by anything in > >> the first place… > > > > These are just random integers in the range [0,mask] and [0,size-1], > > formed in exactly the same way as getrand(). > > Indeed, erand returns a double, this was the part I was missing. I did not > realize that you had switched to doubles in your approach. > > I think that permute should only use integer operations. I'd suggest to > use one of the integer variants instead of going through a double > computation and casting back to int. The internal state is based on > integers, I do not see the added value of going through floats, possibly > enduring floating point issues (undeflow, rounding, normalization, > whatever) on the way, whereas from start to finish we just need ints. > This is the already-established coding pattern used in getrand() to pick a random number uniformly in some range that's not necessarily a power of 2. Floating point underflow and normalisation issues are not possible because erand48() takes a 48-bit integer N and uses ldexp() to store N/2^48 in a double, which is an exact operation since IEEE doubles have something like 56-bit mantissas. This is then turned back into an integer in the required range by multiplying by the desired maximum value, so there's never any risk of underflow or normalisation issues. If you didn't do it that way, you'd need to rely on some larger integer datatype, such as 128-bit integers. I guess that there may be rounding variations once the required maximum value exceeds something like 2^56 (although the comment in getrand() is much more conservative than that), so it's possible that a pgbench script calling random() with (ub-lb) larger than that might give different results on different platforms. For the non-uniform random functions, that effect might well kick in sooner. I'm not aware of any field complaints about that though, possibly because real-world data sizes are significantly smaller than that. In practice, permute() is likely to take its input from one of the non-uniform random functions, so it won't be permute() that first introduces rounding issues. > See attached v27 proposal. > This update has a number of flaws. For example, this: +static uint64 +randu64(RandomState * state) +{ + uint64 r1 = pg_jrand48((*state).xseed), + r2 = pg_jrand48((*state).xseed); + return r1 << 51 | r2 << 13 | r1 >> 13; +} It still uses a 48-bit RandomState, so it doesn't improve on getrand() in that respect. It replaces a single erand48() call with 2 jrand48() calls, which comes at a cost in terms of performance. (In fact, changing the number of rounds in the previous version of permute() from 4 to 6 has a smaller performance impact than this -- more about that below.) jrand48() returns a signed 32-bit integer, which has a 50% chance of being negative, so when that is cast to a uint64, there is a 50% chance that the 32 most significant bits will be 1. When the various parts are OR'ed together, that will then mask out any randomness from the other parts. For example, 50% of the time, the jrand48() value used for r1 will be negative, and so 32 bits in the middle of the final result will all be set. There is essentially no randomness at all in bits 45..50, and the r1 and r2 values overlap in bits 13..18, giving them a 75% chance of being set. So overall, the results will be highly non-uniform, with less randomness and poorer performance than erand48(). In addition, it returns a result in the range [0,2^64), which is not really what's wanted. For example: + /* Random offset */ + r = randu64(&random_state2); + v = (v + r) % size; The previous code used r = getrand(0, size-1), which gave a uniformly random offset. However, the new code risks introducing additional non-uniformity when size is not a power of 2. Finally, worst of all, this random offset is no longer bijective, due to 64-bit integer wrap-around. For example, suppose that size=100 and r=(2^64-10), then the following 2 values both map to the same result: v = 20 -> (v + r) % size = (20 + (2^64 - 10)) % 100 = (2^64 + 10) % 100 = (10) % 100 = 10 v = 4 -> (v + r) % size = (4 + (2^64 - 10)) % 100 = (2^64 - 6) % 100 = (18446744073709551610) % 100 = 10 So not only are the results no longer uniformly random, they might not even form a permutation. I did some more testing of the previous version (v26), this time looking at much larger sizes, all the way up to the maximum, which is 2^63-1 since it comes from a signed int64. In general, the results were very good, though I did notice some slight non-uniformity in the way adjacent inputs were separated from another when the size was just under a power of 2. I think that's the hardest case for this algorithm, because there's very little overlap between the 2 halves. Increasing the number of rounds from 4 to 6 ironed out that non-uniformity (and as mentioned above, is still cheaper than using randu64() with 4 rounds), so I think we should go with that. > I still think that *rand48 is a poor (relatively small state) and > inefficient (the implementation includes packing and unpacking 16 bits > ints to build a 64 bits int) choice. > I can somewhat understand your dislike of *rand48(), but in the context of pgbench I think that it is perfectly adequate. Since we now use 2 RandomStates, I don't think the state size is an issue anymore, if it ever really was. Using erand48() gives better performance than jrand48() because it returns 48 bits rather than 32, so fewer calls are needed, which allows more rounds for the same cost. Additionally, following the same pattern as existing code reduces the risk of bugs, and builds on proven, tested code. You may wish to submit a separate patch to replace pgbench's use of *rand48() with something else, and that would be discussed on its own merits, but I don't see why that should hold up adding permute(). Regards, Dean
Hello Dean, >> I think that permute should only use integer operations. I'd suggest to >> use one of the integer variants instead of going through a double >> computation and casting back to int. The internal state is based on >> integers, I do not see the added value of going through floats, possibly >> enduring floating point issues (undeflow, rounding, normalization, >> whatever) on the way, whereas from start to finish we just need ints. > > This is the already-established coding pattern used in getrand() to > pick a random number uniformly in some range that's not necessarily a > power of 2. Indeed. I'm not particularly happy with that one either. > Floating point underflow and normalisation issues are not possible > because erand48() takes a 48-bit integer N and uses ldexp() to store > N/2^48 in a double, which is an exact operation since IEEE doubles > have something like 56-bit mantissas. Double mantissa size is 52 bits. > This is then turned back into an integer in the required range by > multiplying by the desired maximum value, so there's never any risk of > underflow or normalisation issues. ISTM that there are significant issues when multiplying with an integer, because the integer is cast to a double before multiplying, so if the int is over 52 bits then it is coldly truncated and some values are just lost in the process and will never be drawn. Probably not too many of them, but some of them anyway. > I guess that there may be rounding variations once the required > maximum value exceeds something like 2^56 (although the comment in > getrand() is much more conservative than that), so it's possible that > a pgbench script calling random() with (ub-lb) larger than that might > give different results on different platforms. Dunno. This may be the same issue I'm pointing out above. > For the non-uniform random functions, that effect might well kick in > sooner. I'm not aware of any field complaints about that though, > possibly because real-world data sizes are significantly smaller than > that. > > In practice, permute() is likely to take its input from one of the > non-uniform random functions, so it won't be permute() that first > introduces rounding issues. Sure. I'd like permute to be immune to that. >> See attached v27 proposal. > > This update has a number of flaws. For example, this: Indeed:-) > +static uint64 > +randu64(RandomState * state) > +{ > + uint64 r1 = pg_jrand48((*state).xseed), > + r2 = pg_jrand48((*state).xseed); > + return r1 << 51 | r2 << 13 | r1 >> 13; > +} > > It still uses a 48-bit RandomState, so it doesn't improve on getrand() > in that respect. Sure. I'm pretty unhappy with that one, but I was not trying to address that. My idea that randu64 would be replace with something better at some point. My intention was "64-bits pseudo-random", my implementation does not work, ok:-) > It replaces a single erand48() call with 2 jrand48() calls, which > comes at a cost in terms of performance. (In fact, changing the number > of rounds in the previous version of permute() from 4 to 6 has a > smaller performance impact than this -- more about that below.) Sure, same remark as above, I was not trying to address that pointB. > jrand48() returns a signed 32-bit integer, which has a 50% chance of > being negative, so when that is cast to a uint64, there is a 50% > chance that the 32 most significant bits will be 1. Argh. > When the various parts are OR'ed together, that will then mask out any > randomness from the other parts. For example, 50% of the time, the > jrand48() value used for r1 will be negative, and so 32 bits in the > middle of the final result will all be set. Argh. I hesitated to use xor. I should not have:-) > So overall, the results will be highly non-uniform, with less > randomness and poorer performance than erand48(). Indeed, bad choice. I wanted to used the unsigned version but it is not implemented, and swichted to the signed version without thinking of some of the implications. > In addition, it returns a result in the range [0,2^64), which is not > really what's wanted. For example: > > + /* Random offset */ > + r = randu64(&random_state2); > + v = (v + r) % size; > > The previous code used r = getrand(0, size-1), which gave a uniformly > random offset. However, the new code risks introducing additional > non-uniformity when size is not a power of 2. ISTM that the overall non uniformity is worse with the float approach as opposed to the int approach. Conceptually, the same kind of bias is expected whether you get through floats or through ints, because the underlying power-of-two maths is the same, so what makes the difference in reducing non-uniformity is using more bits. Basically, when enough bits are used the same number of values should appear n vs n+1 times. When not enough bits are provided, things get ugly: for instance, with size = 2^53, even if the floats were fully the 52-bit float pseudo-random mantissa (they are really 48 with erand48) would result in only even numbers to be produced, whereas with ints all numbers are produced. With erand48, when size is above 48 bits ISTM that last bits are always zeros with the double approach. I'm not counting lost values because of size truncation when converting it to double. > Finally, worst of all, this random offset is no longer bijective, due > to 64-bit integer wrap-around. For example, suppose that size=100 and > r=(2^64-10), then the following 2 values both map to the same result: > > v = 20 -> (v + r) % size > = (20 + (2^64 - 10)) % 100 > = (2^64 + 10) % 100 > = (10) % 100 > = 10 > > v = 4 -> (v + r) % size > = (4 + (2^64 - 10)) % 100 > = (2^64 - 6) % 100 > = (18446744073709551610) % 100 > = 10 > > So not only are the results no longer uniformly random, they might not > even form a permutation. Indeed, this one is pretty fun! Probably the right formula for this approach is "(v + r % size) % size", which is kind of a mouthful. I fully agree that my v27 implementation is butched on many dimensions, some of them intentional and temporary (use jrand48 twice) and some of them accidental (not considering int overflows, being optimistic on signed to unsigned casts…). I still disagree though that going through floating point is the right thing to do, because of some of the issues I outlined above (eg truncation and rounding for above 48/52-bits sizes). Basically I think that an algorithm dealing with integers should not have to resort to floating point computations unless it is actually required. This is not the case for permute, were v26 is using doubles as glorified 48-bit integers, that could be extended to 52-bit integers, but no more. The only benefit I see is using implicitly the internal 104-bit rounding by truncation on multiply, but I do not think that implicitely reducing the practical int values to 52 bits is worth it, and that the same quality (bias) can be achieved for 63 bits integers by keeping them as ints are writing the right formula, which I fully failed to demonstrate in v27. > I did some more testing of the previous version (v26), this time > looking at much larger sizes, all the way up to the maximum, which is > 2^63-1 since it comes from a signed int64. In general, the results > were very good, though I did notice some slight non-uniformity in the > way adjacent inputs were separated from another when the size was just > under a power of 2. I think that's the hardest case for this > algorithm, because there's very little overlap between the 2 halves. Yes, less values are steered twice per round. However, as for adjacent values for large sizes, I'm wondering whether this may have more to do with the 48 bit limitations, so that lower bits are not really xored for instance. Not sure. > Increasing the number of rounds from 4 to 6 ironed out that > non-uniformity (and as mentioned above, is still cheaper than using > randu64() with 4 rounds), so I think we should go with that. There is a quality-cost tradeoff. With the previous version I convinced myself that 4 rounds were a good compromise (not perfect, but ok for keeping the cost low on practical sizes). With this version, I'll admit that I do not have an opinion. > You may wish to submit a separate patch to replace pgbench's use of > *rand48() with something else, and that would be discussed on its own > merits, but I don't see why that should hold up adding permute(). I'll see. Attached a v28 which I hope fixes the many above issues and stays with ints. The randu64 is still some kind of a joke, I artificially reduced the cost by calling jrand48 once and extending it to 64 bits, so it could give an idea of the cost endured if a 64-bit prng was used. Now you are the committer, you can do as you please, I'm just stating my (mathematical) opinions about using floating point computations for that. I think that apart from this point of principle/philosophy the permute performance and implementation are reasonable, and better than my initial version because it avoids int128 computations and the large prime number business. -- Fabien.
Attachment
On Mon, 5 Apr 2021 at 13:07, Fabien COELHO <coelho@cri.ensmp.fr> wrote: > > Attached a v28 which I hope fixes the many above issues and stays with > ints. The randu64 is still some kind of a joke, I artificially reduced the > cost by calling jrand48 once and extending it to 64 bits, so it could give > an idea of the cost endured if a 64-bit prng was used. > > Now you are the committer, you can do as you please, I'm just stating my > (mathematical) opinions about using floating point computations for that. > I think that apart from this point of principle/philosophy the permute > performance and implementation are reasonable, and better than my initial > version because it avoids int128 computations and the large prime number > business. > Pushed. I decided not to go with the "joke" randu64() function, but instead used getrand() directly. This at least removes any *direct* use of doubles in permute() (though of course they're still used indirectly), and means that if getrand() is improved in the future, permute() will benefit too. Regards, Dean
Hello Dean, > Pushed. > > I decided not to go with the "joke" randu64() function, but instead > used getrand() directly. This at least removes any *direct* use of > doubles in permute() (though of course they're still used indirectly), > and means that if getrand() is improved in the future, permute() will > benefit too. Good idea, at least it is hidden and in one place. Thanks for the push! -- Fabien.