Home > mailing lists

Re: General purpose hashing func in pgbench - Mailing list pgsql-hackers

From	Ildar Musin
Subject	Re: General purpose hashing func in pgbench
Date	January 16, 2018 17:49:23
Msg-id	bca39e87-6d40-98f6-e4ef-e2d88e109f9c@postgrespro.ru Whole thread Raw
In response to	Re: General purpose hashing func in pgbench (Fabien COELHO <coelho@cri.ensmp.fr>)
Responses	Re: General purpose hashing func in pgbench
List	pgsql-hackers

Tree view

Hi Fabien,


13/01/2018 11:16, Fabien COELHO пишет:
>
> Hello Ildar,
>
>>> so that different instances of hash function within one script would
>>> have different seeds. Yes, that is a good idea, I can do that.
>>>
>> Added this feature in attached patch. But on a second thought this could
>> be something that user won't expect. For example, they may want to run
>> pgbench with two scripts:
>> - the first one updates row by key that is a hashed random_zipfian
>> value;
>> - the second one reads row by key generated the same way
>> (that is actually what YCSB workloads A and B do)
>>
>> It feels natural to write something like this:
>> \set rnd random_zipfian(0, 1000000, 0.99)
>> \set key abs(hash(:rnd)) % 1000
>> in both scripts and expect that they both would have the same
>> distribution. But they wouldn't. We could of course describe this
>> implicit behaviour in documentation, but ISTM that shared seed would be
>> more clear.
>
> I think that it depends on the use case, that both can be useful, so
> there should be a way to do either.
>
> With "always different" default seed, distinct distributions are achieved
> with:
>
>    -- DIFF distinct seeds inside and between runs
>    \set i1 abs(hash(:r1)) % 1000
>    \set j1 abs(hash(:r2)) % 1000
>
> and the same distribution can be done with an explicit seed:
>
>    -- DIFF same seed inside and between runs
>    \set i1 abs(hash(:r1), 5432) % 1000
>    \set j1 abs(hash(:r2), 5432) % 1000
>
> The drawback is that the same seed is used between runs in this case,
> which is not desirable. This could be circumvented by adding the
> random seed as an automatic variable and using it, eg:
>
>    -- DIFF same seed inside run, distinct between runs
>    \set i1 abs(hash(:r1), :random_seed + 5432) % 1000
>    \set j1 abs(hash(:r2), :random_seed + 2345) % 1000
>
>
> Now with a shared hash_seed the same distribution is by default:
>
>    -- SHARED same underlying hash_seed inside run, distinct between runs
>    \set i1 abs(hash(:r1)) % 1000
>    \set j1 abs(hash(:r2)) % 1000
>
> However some trick is needed now to get distinct seeds. With
>
>    -- SHARED distinct seed inside run, but same between runs
>    \set i1 abs(hash(:r1, 5432)) % 1000
>    \set j1 abs(hash(:r2, 2345)) % 1000
>
> We are back to the same issue has the previous case because then the
> distribution is the same from one run to the next, which is not
> desirable. I found this workaround trick:
>
>    -- SHARED distinct seeds inside and between runs
>    \set i1 abs(hash(:r1, hash(5432))) % 1000
>    \set j1 abs(hash(:r2, hash(2345))) % 1000
>
> Or with a new :hash_seed or :random_seed automatic variable, we could
> also have:
>
>    -- SHARED distinct seeds inside and between runs
>    \set i1 abs(hash(:r1, :hash_seed + 5432)) % 1000
>    \set j1 abs(hash(:r2, :hash_seed + 2345)) % 1000
>
> It provides controllable distinct seeds between runs but equal one
> between if desired, by reusing the same value to be hashed as a seed.
>
> I also agree with your argument that the user may reasonably expect
> that hash(5432) == hash(5432) inside and between scripts, at least on
> the same run, so would be surprised that it is not the case.
>
> So I've changed my mind, I'm sorry for making you going back and forth
> on the subject. I'm now okay with one shared 64 bit hash seed, with a
> clear documentation about the fact, and an outline of the trick to
> achieve distinct distributions inside a run if desired and why it
> would be desirable to avoid correlations. Also, I think that providing
> the seed as automatic variable (:hash_seed or :hseed or whatever)
> would make some sense as well. Maybe this could be used as a way to
> fix the seed explicitely, eg:
>
>    pgbench -D hash_seed=1234 ...
>
> Would use this value instead of the random generated one. Also, with
> that the default inserted second argument could be simply
> ":hash_seed", which would simplify the executor which would not have
> to do check for an optional second argument.
>
Here is a new version of patch. I've splitted it into two parts. The
first one is almost the same as v4 from [1] with some refactoring. The
second part introduces random_seed variable as you proposed. I didn't do
the executor simplification thing yet because I'm a little concerned
about inventive users, who may want to change random_seed variable in
runtime (which is possible since pgbench doesn't have read only
variables aka constants AFAIK).

[1]
https://www.postgresql.org/message-id/43a8fbbb-32fa-6478-30a9-f64041adf019%40postgrespro.ru

-- 
Ildar Musin
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company

Attachment

pgsql-hackers by date:

From: Andrew Dunstan
Date: 16 January 2018, 17:30:22
Subject: Re: [HACKERS] Deadlock in XLogInsert at AIX

From: Peter Eisentraut
Date: 16 January 2018, 17:55:16
Subject: Re: [HACKERS] generated columns

Re: General purpose hashing func in pgbench - Mailing list pgsql-hackers

Attachment

Previous

Next