Re: gaussian distribution pgbench -- splits v4 - Mailing list pgsql-hackers

From Fabien COELHO
Subject Re: gaussian distribution pgbench -- splits v4
Date
Msg-id alpine.DEB.2.10.1408010905040.9457@sto
Whole thread Raw
In response to Re: gaussian distribution pgbench -- splits v4  (Robert Haas <robertmhaas@gmail.com>)
Responses Re: gaussian distribution pgbench -- splits v4
List pgsql-hackers
Hello,

>> Version one is "k' = 1 + (a * k + b) modulo n" with "a" prime with 
>> respect to "n", "n" being the number of keys. This is nearly possible, 
>> but for the modulo operator which is currently missing, and that I'm 
>> planning to submit for this very reason, but probably another time.
>
> That's pretty crude,

Yep. It is very simple, it is much better than nothing, and for a database 
test is may be "good enough".

> although I don't object to a modulo operator.  It would be nice to be 
> able to use a truly random permutation, which is not hard to generate 
> but probably requires O(n) storage, likely a problem for large scale 
> factors.

That is indeed the actual issue in my mind. I was thinking of permutations 
with a formula, which are not so easy to find and may end-up looking like 
"(a*k+b)%n" anyway. I had the same issue for generating random data for a 
schema (see http://www.coelho.net/datafiller.html).

> Maybe somebody who knows more math than I do (like you, probably!) can 
> come up with something more clever.

I can certainly suggest other formula, but that does not mean beautiful 
code, thus would probably be rejected. I'll see.

An alternative to this whole process may be to hash/modulo a non uniform 
random value.
      id = 1 + hash(some-random()) % n

But the hashing changes the distribution as it adds collisions, so I have 
to think about how to be able to control the distribution in that case, 
and what hash function to use.

-- 
Fabien.



pgsql-hackers by date:

Previous
From: Jeff Davis
Date:
Subject: numeric and float comparison oddities
Next
From: Mitsumasa KONDO
Date:
Subject: Re: gaussian distribution pgbench -- splits v4