Re: General purpose hashing func in pgbench - Mailing list pgsql-hackers
From | Fabien COELHO |
---|---|
Subject | Re: General purpose hashing func in pgbench |
Date | |
Msg-id | alpine.DEB.2.20.1801121614470.13422@lancre Whole thread Raw |
In response to | Re: General purpose hashing func in pgbench (Ildar Musin <i.musin@postgrespro.ru>) |
Responses |
Re: General purpose hashing func in pgbench
|
List | pgsql-hackers |
Hello Ildar, >> so that different instances of hash function within one script would >> have different seeds. Yes, that is a good idea, I can do that. >> > Added this feature in attached patch. But on a second thought this could > be something that user won't expect. For example, they may want to run > pgbench with two scripts: > - the first one updates row by key that is a hashed random_zipfian value; > - the second one reads row by key generated the same way > (that is actually what YCSB workloads A and B do) > > It feels natural to write something like this: > \set rnd random_zipfian(0, 1000000, 0.99) > \set key abs(hash(:rnd)) % 1000 > in both scripts and expect that they both would have the same > distribution. But they wouldn't. We could of course describe this > implicit behaviour in documentation, but ISTM that shared seed would be > more clear. I think that it depends on the use case, that both can be useful, so there should be a way to do either. With "always different" default seed, distinct distributions are achieved with: -- DIFF distinct seeds inside and between runs \set i1 abs(hash(:r1)) % 1000 \set j1 abs(hash(:r2)) % 1000 and the same distribution can be done with an explicit seed: -- DIFF same seed inside and between runs \set i1 abs(hash(:r1), 5432) % 1000 \set j1 abs(hash(:r2), 5432) % 1000 The drawback is that the same seed is used between runs in this case, which is not desirable. This could be circumvented by adding the random seed as an automatic variable and using it, eg: -- DIFF same seed inside run, distinct between runs \set i1 abs(hash(:r1), :random_seed + 5432) % 1000 \set j1 abs(hash(:r2), :random_seed + 2345) % 1000 Now with a shared hash_seed the same distribution is by default: -- SHARED same underlying hash_seed inside run, distinct between runs \set i1 abs(hash(:r1)) % 1000 \set j1 abs(hash(:r2)) % 1000 However some trick is needed now to get distinct seeds. With -- SHARED distinct seed inside run, but same between runs \set i1 abs(hash(:r1, 5432)) % 1000 \set j1 abs(hash(:r2, 2345)) % 1000 We are back to the same issue has the previous case because then the distribution is the same from one run to the next, which is not desirable. I found this workaround trick: -- SHARED distinct seeds inside and between runs \set i1 abs(hash(:r1, hash(5432))) % 1000 \set j1 abs(hash(:r2, hash(2345))) % 1000 Or with a new :hash_seed or :random_seed automatic variable, we could also have: -- SHARED distinct seeds inside and between runs \set i1 abs(hash(:r1, :hash_seed + 5432)) % 1000 \set j1 abs(hash(:r2, :hash_seed + 2345)) % 1000 It provides controllable distinct seeds between runs but equal one between if desired, by reusing the same value to be hashed as a seed. I also agree with your argument that the user may reasonably expect that hash(5432) == hash(5432) inside and between scripts, at least on the same run, so would be surprised that it is not the case. So I've changed my mind, I'm sorry for making you going back and forth on the subject. I'm now okay with one shared 64 bit hash seed, with a clear documentation about the fact, and an outline of the trick to achieve distinct distributions inside a run if desired and why it would be desirable to avoid correlations. Also, I think that providing the seed as automatic variable (:hash_seed or :hseed or whatever) would make some sense as well. Maybe this could be used as a way to fix the seed explicitely, eg: pgbench -D hash_seed=1234 ... Would use this value instead of the random generated one. Also, with that the default inserted second argument could be simply ":hash_seed", which would simplify the executor which would not have to do check for an optional second argument. -- Fabien.
pgsql-hackers by date: