Re: General purpose hashing func in pgbench - Mailing list pgsql-hackers

From Fabien COELHO
Subject Re: General purpose hashing func in pgbench
Date
Msg-id alpine.DEB.2.20.1712200807550.5165@lancre
Whole thread Raw
In response to General purpose hashing func in pgbench  (Ildar Musin <i.musin@postgrespro.ru>)
Responses Re: General purpose hashing func in pgbench
List pgsql-hackers
Hello Ildar,

> Following up the recent discussion on zipfian distribution I was trying
> to reproduce some YCSB-like workloads. As this paper [1] describes, YCSB
> uses zipfian distribution to generate keys in order simulate intensive
> load on small number of records as it happens in real world applications
> (e.g. blogs). One problem is that most popular records keys are
> clustered together. To scatter them across the keyspace authors use
> hashing, the FNV-1a hash function in particular [2].

Yes, clustering is an issue for realistic load tests and may influence the 
resulting measured performance unduly.

> I've read Fabien Coelho's thread on additional operators and functions.
> Generally it could be possible to implement some basic hashing
> algorithms right in a pgbench script using just bitwise and arithmetic
> operators. But should we probably provide users with some general
> purpose hash function?

The problem I see with hash functions is that it generates collisions, 
which is an undesirable effect when dealing with keys as it biases the 
initial randomization.

Thus I intend to submit a parametric pseudo-random permutation function, 
some day. As I foresaw that it might require some arguing I did not 
include the function in the operators and functions collection I 
submitted, so as not to slow down the inclusion process. Given that the 
patch was submitted over 20 months ago and is still stuck in the CF, that 
was a misjudgement:-)

This is no no way a point against including one/some hash functions, 
especially if they actually appear in a benchmark.

> The attached patch introduces hash() function which implements FNV-1a as
> an example of such hashing algorithm. There are also couple of images in
> the attachement that I have got from visualizing original zipfian
> distribution and the hashed one. Usage example:

> In psql:
> create table abc as select generate_series(0, 999) as a, 0 as b;
>
> pgbench script:
> \set rnd random_zipfian(0, 1000000, 0.99)
> \set key abs(hash(:rnd)) % 1000
> begin;
> update abc set b = b + 1 where a = :key;
> end;
>
> Any thoughts or suggestions?

As there may be several hash functions included in the long run. I'd 
suggest that the hash function should be named more precisely, eg 
"hash_fnv1a".

The image looks like the distribution is more regularly scattered than 
actually randomized... Maybe this is because the first highest 256 values 
are really scattered by the process multiply/modulo process. Or maybe this 
is an optical effect?

ISTM that there are undesired utf8 chars in a comment. Should be kept 
ASCII.

I would put the actual hash computation in a separate function rather than 
inlined in the evaluator.

Add the submission to the next CF?

-- 
Fabien.


pgsql-hackers by date:

Previous
From: Masahiko Sawada
Date:
Subject: Re: User defined data types in Logical Replication
Next
From: Amit Langote
Date:
Subject: Re: [HACKERS] path toward faster partition pruning