Re: General purpose hashing func in pgbench - Mailing list pgsql-hackers
From | Ildar Musin |
---|---|
Subject | Re: General purpose hashing func in pgbench |
Date | |
Msg-id | bca39e87-6d40-98f6-e4ef-e2d88e109f9c@postgrespro.ru Whole thread Raw |
In response to | Re: General purpose hashing func in pgbench (Fabien COELHO <coelho@cri.ensmp.fr>) |
Responses |
Re: General purpose hashing func in pgbench
|
List | pgsql-hackers |
Hi Fabien, 13/01/2018 11:16, Fabien COELHO пишет: > > Hello Ildar, > >>> so that different instances of hash function within one script would >>> have different seeds. Yes, that is a good idea, I can do that. >>> >> Added this feature in attached patch. But on a second thought this could >> be something that user won't expect. For example, they may want to run >> pgbench with two scripts: >> - the first one updates row by key that is a hashed random_zipfian >> value; >> - the second one reads row by key generated the same way >> (that is actually what YCSB workloads A and B do) >> >> It feels natural to write something like this: >> \set rnd random_zipfian(0, 1000000, 0.99) >> \set key abs(hash(:rnd)) % 1000 >> in both scripts and expect that they both would have the same >> distribution. But they wouldn't. We could of course describe this >> implicit behaviour in documentation, but ISTM that shared seed would be >> more clear. > > I think that it depends on the use case, that both can be useful, so > there should be a way to do either. > > With "always different" default seed, distinct distributions are achieved > with: > > -- DIFF distinct seeds inside and between runs > \set i1 abs(hash(:r1)) % 1000 > \set j1 abs(hash(:r2)) % 1000 > > and the same distribution can be done with an explicit seed: > > -- DIFF same seed inside and between runs > \set i1 abs(hash(:r1), 5432) % 1000 > \set j1 abs(hash(:r2), 5432) % 1000 > > The drawback is that the same seed is used between runs in this case, > which is not desirable. This could be circumvented by adding the > random seed as an automatic variable and using it, eg: > > -- DIFF same seed inside run, distinct between runs > \set i1 abs(hash(:r1), :random_seed + 5432) % 1000 > \set j1 abs(hash(:r2), :random_seed + 2345) % 1000 > > > Now with a shared hash_seed the same distribution is by default: > > -- SHARED same underlying hash_seed inside run, distinct between runs > \set i1 abs(hash(:r1)) % 1000 > \set j1 abs(hash(:r2)) % 1000 > > However some trick is needed now to get distinct seeds. With > > -- SHARED distinct seed inside run, but same between runs > \set i1 abs(hash(:r1, 5432)) % 1000 > \set j1 abs(hash(:r2, 2345)) % 1000 > > We are back to the same issue has the previous case because then the > distribution is the same from one run to the next, which is not > desirable. I found this workaround trick: > > -- SHARED distinct seeds inside and between runs > \set i1 abs(hash(:r1, hash(5432))) % 1000 > \set j1 abs(hash(:r2, hash(2345))) % 1000 > > Or with a new :hash_seed or :random_seed automatic variable, we could > also have: > > -- SHARED distinct seeds inside and between runs > \set i1 abs(hash(:r1, :hash_seed + 5432)) % 1000 > \set j1 abs(hash(:r2, :hash_seed + 2345)) % 1000 > > It provides controllable distinct seeds between runs but equal one > between if desired, by reusing the same value to be hashed as a seed. > > I also agree with your argument that the user may reasonably expect > that hash(5432) == hash(5432) inside and between scripts, at least on > the same run, so would be surprised that it is not the case. > > So I've changed my mind, I'm sorry for making you going back and forth > on the subject. I'm now okay with one shared 64 bit hash seed, with a > clear documentation about the fact, and an outline of the trick to > achieve distinct distributions inside a run if desired and why it > would be desirable to avoid correlations. Also, I think that providing > the seed as automatic variable (:hash_seed or :hseed or whatever) > would make some sense as well. Maybe this could be used as a way to > fix the seed explicitely, eg: > > pgbench -D hash_seed=1234 ... > > Would use this value instead of the random generated one. Also, with > that the default inserted second argument could be simply > ":hash_seed", which would simplify the executor which would not have > to do check for an optional second argument. > Here is a new version of patch. I've splitted it into two parts. The first one is almost the same as v4 from [1] with some refactoring. The second part introduces random_seed variable as you proposed. I didn't do the executor simplification thing yet because I'm a little concerned about inventive users, who may want to change random_seed variable in runtime (which is possible since pgbench doesn't have read only variables aka constants AFAIK). [1] https://www.postgresql.org/message-id/43a8fbbb-32fa-6478-30a9-f64041adf019%40postgrespro.ru -- Ildar Musin Postgres Professional: http://www.postgrespro.com Russian Postgres Company
Attachment
pgsql-hackers by date: