Re: General purpose hashing func in pgbench - Mailing list pgsql-hackers

From Fabien COELHO
Subject Re: General purpose hashing func in pgbench
Date
Msg-id alpine.DEB.2.20.1801170727240.11884@lancre
Whole thread Raw
In response to Re: General purpose hashing func in pgbench  (Ildar Musin <i.musin@postgrespro.ru>)
Responses Re: General purpose hashing func in pgbench
List pgsql-hackers
Hello Ildar,

> Here is a new version of patch. I've splitted it into two parts. The 
> first one is almost the same as v4 from [1] with some refactoring. The 
> second part introduces random_seed variable as you proposed.

Patch 1 applies. Compilations fails, there are two "hash_seed" declared in 
"pgbench.c".

Patch 2 applies cleanly on top of the previous patch and compiles, because 
the variable is removed...

If an ":hash_seed" pgbench variable is used, ISTM that there is no need 
for a global variable at all, so the two patches are going back and forth, 
which is unhelpful. ISTM better to provide just one combined patch for the 
feature.

If the hash_seed variable really needs to be kept, it should be an "int64" 
variable, like other pgbench values. Calling random just usually 
initializes about 31 bits, so random should be called 2 or maybe 3 times? 
Or maybe use the internal getrand which has 48 pseudorandom bits?

For me "random seed" is what is passed to srandom, so the variable should 
rather be named hash_seed because there could also be a random seed 
(actually, there is in another parallel patch:-). Moreover, this seed may 
or may not be random if set, so calling it "random_seed" is not desirable.

The len < 1 || len > 2 is checked twice, once in the "switch", on in an 
"if" just after the "switch". Once is enough.

> I didn't do the executor simplification thing yet because I'm a little 
> concerned about inventive users, who may want to change random_seed 
> variable in runtime (which is possible since pgbench doesn't have read 
> only variables aka constants AFAIK).

If the user choses to overide hash_seed in their script, it is their 
decision, the documentation has only to be clear about :hash_seed being 
the default seed. I see no clear reason to work around this possibility by 
evaluating the seed at parse time, especially as the variable may not have 
its final value yet depending on the option order. I'd suggest to just use 
make_variable("hash_seed") for the default second argument and simplify 
the executor.

The seed variable is not tested explicitely in the script, you could add
a "hash(5432) == hash(5432, :hash_seed)" for instance.

It would be nice if an native English speaker could proofread the 
documentation text. I'd suggest: "*an* optional seed parameter". "In case 
*the* seed...". "<literal>:hash_seed</literal>". "shared for" -> "shared 
by". "following listing" -> "following pgbench script". "few accounts 
generates" -> "few accounts generate".

For the document example, I'd use larger values for the random & modulo, 
eg 100000000 and 1000000. The drawback is that zipfian does a costly 
computation of the generalized harmonic number when the parameter is lower 
than 1.0. For cities, the parameter found by Zipf was 1.07 (says 
Wikipedia). Maybe use this historical value. Or maybe use an exponential 
distribution in the example.

-- 
Fabien.


pgsql-hackers by date:

Previous
From: Ashutosh Sharma
Date:
Subject: Test-cases for exclusion constraints is missing in alter_table.sql file
Next
From: Marina Polyakova
Date:
Subject: Re: WIP Patch: Precalculate stable functions, infrastructure v1