Re: General purpose hashing func in pgbench - Mailing list pgsql-hackers
From | Fabien COELHO |
---|---|
Subject | Re: General purpose hashing func in pgbench |
Date | |
Msg-id | alpine.DEB.2.20.1712200807550.5165@lancre Whole thread Raw |
In response to | General purpose hashing func in pgbench (Ildar Musin <i.musin@postgrespro.ru>) |
Responses |
Re: General purpose hashing func in pgbench
|
List | pgsql-hackers |
Hello Ildar, > Following up the recent discussion on zipfian distribution I was trying > to reproduce some YCSB-like workloads. As this paper [1] describes, YCSB > uses zipfian distribution to generate keys in order simulate intensive > load on small number of records as it happens in real world applications > (e.g. blogs). One problem is that most popular records keys are > clustered together. To scatter them across the keyspace authors use > hashing, the FNV-1a hash function in particular [2]. Yes, clustering is an issue for realistic load tests and may influence the resulting measured performance unduly. > I've read Fabien Coelho's thread on additional operators and functions. > Generally it could be possible to implement some basic hashing > algorithms right in a pgbench script using just bitwise and arithmetic > operators. But should we probably provide users with some general > purpose hash function? The problem I see with hash functions is that it generates collisions, which is an undesirable effect when dealing with keys as it biases the initial randomization. Thus I intend to submit a parametric pseudo-random permutation function, some day. As I foresaw that it might require some arguing I did not include the function in the operators and functions collection I submitted, so as not to slow down the inclusion process. Given that the patch was submitted over 20 months ago and is still stuck in the CF, that was a misjudgement:-) This is no no way a point against including one/some hash functions, especially if they actually appear in a benchmark. > The attached patch introduces hash() function which implements FNV-1a as > an example of such hashing algorithm. There are also couple of images in > the attachement that I have got from visualizing original zipfian > distribution and the hashed one. Usage example: > In psql: > create table abc as select generate_series(0, 999) as a, 0 as b; > > pgbench script: > \set rnd random_zipfian(0, 1000000, 0.99) > \set key abs(hash(:rnd)) % 1000 > begin; > update abc set b = b + 1 where a = :key; > end; > > Any thoughts or suggestions? As there may be several hash functions included in the long run. I'd suggest that the hash function should be named more precisely, eg "hash_fnv1a". The image looks like the distribution is more regularly scattered than actually randomized... Maybe this is because the first highest 256 values are really scattered by the process multiply/modulo process. Or maybe this is an optical effect? ISTM that there are undesired utf8 chars in a comment. Should be kept ASCII. I would put the actual hash computation in a separate function rather than inlined in the evaluator. Add the submission to the next CF? -- Fabien.
pgsql-hackers by date: