Re: General purpose hashing func in pgbench - Mailing list pgsql-hackers

From Fabien COELHO
Subject Re: General purpose hashing func in pgbench
Date
Msg-id alpine.DEB.2.20.1801121614470.13422@lancre
Whole thread Raw
In response to Re: General purpose hashing func in pgbench  (Ildar Musin <i.musin@postgrespro.ru>)
Responses Re: General purpose hashing func in pgbench
List pgsql-hackers
Hello Ildar,

>> so that different instances of hash function within one script would
>> have different seeds. Yes, that is a good idea, I can do that.
>>
> Added this feature in attached patch. But on a second thought this could
> be something that user won't expect. For example, they may want to run
> pgbench with two scripts:
> - the first one updates row by key that is a hashed random_zipfian value;
> - the second one reads row by key generated the same way
> (that is actually what YCSB workloads A and B do)
>
> It feels natural to write something like this:
> \set rnd random_zipfian(0, 1000000, 0.99)
> \set key abs(hash(:rnd)) % 1000
> in both scripts and expect that they both would have the same
> distribution. But they wouldn't. We could of course describe this
> implicit behaviour in documentation, but ISTM that shared seed would be
> more clear.

I think that it depends on the use case, that both can be useful, so there 
should be a way to do either.

With "always different" default seed, distinct distributions are achieved
with:

    -- DIFF distinct seeds inside and between runs
    \set i1 abs(hash(:r1)) % 1000
    \set j1 abs(hash(:r2)) % 1000

and the same distribution can be done with an explicit seed:

    -- DIFF same seed inside and between runs
    \set i1 abs(hash(:r1), 5432) % 1000
    \set j1 abs(hash(:r2), 5432) % 1000

The drawback is that the same seed is used between runs in this case, 
which is not desirable. This could be circumvented by adding the random 
seed as an automatic variable and using it, eg:

    -- DIFF same seed inside run, distinct between runs
    \set i1 abs(hash(:r1), :random_seed + 5432) % 1000
    \set j1 abs(hash(:r2), :random_seed + 2345) % 1000


Now with a shared hash_seed the same distribution is by default:

    -- SHARED same underlying hash_seed inside run, distinct between runs
    \set i1 abs(hash(:r1)) % 1000
    \set j1 abs(hash(:r2)) % 1000

However some trick is needed now to get distinct seeds. With

    -- SHARED distinct seed inside run, but same between runs
    \set i1 abs(hash(:r1, 5432)) % 1000
    \set j1 abs(hash(:r2, 2345)) % 1000

We are back to the same issue has the previous case because then the 
distribution is the same from one run to the next, which is not desirable. 
I found this workaround trick:

    -- SHARED distinct seeds inside and between runs
    \set i1 abs(hash(:r1, hash(5432))) % 1000
    \set j1 abs(hash(:r2, hash(2345))) % 1000

Or with a new :hash_seed or :random_seed automatic variable, we could also 
have:

    -- SHARED distinct seeds inside and between runs
    \set i1 abs(hash(:r1, :hash_seed + 5432)) % 1000
    \set j1 abs(hash(:r2, :hash_seed + 2345)) % 1000

It provides controllable distinct seeds between runs but equal one between 
if desired, by reusing the same value to be hashed as a seed.

I also agree with your argument that the user may reasonably expect that 
hash(5432) == hash(5432) inside and between scripts, at least on the same 
run, so would be surprised that it is not the case.

So I've changed my mind, I'm sorry for making you going back and forth on 
the subject. I'm now okay with one shared 64 bit hash seed, with a clear 
documentation about the fact, and an outline of the trick to achieve 
distinct distributions inside a run if desired and why it would be 
desirable to avoid correlations. Also, I think that providing the seed as 
automatic variable (:hash_seed or :hseed or whatever) would make some 
sense as well. Maybe this could be used as a way to fix the seed 
explicitely, eg:

    pgbench -D hash_seed=1234 ...

Would use this value instead of the random generated one. Also, with that 
the default inserted second argument could be simply ":hash_seed", which 
would simplify the executor which would not have to do check for an 
optional second argument.

-- 
Fabien.


pgsql-hackers by date:

Previous
From: Tomas Vondra
Date:
Subject: Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
Next
From: Marina Polyakova
Date:
Subject: Re: master make check fails on Solaris 10