Re: CPU costs of random_zipfian in pgbench - Mailing list pgsql-hackers

From Fabien COELHO
Subject Re: CPU costs of random_zipfian in pgbench
Date
Msg-id alpine.DEB.2.21.1902170907110.8818@lancre
Whole thread Raw
In response to CPU costs of random_zipfian in pgbench  (Tomas Vondra <tomas.vondra@2ndquadrant.com>)
Responses Re: CPU costs of random_zipfian in pgbench
Re: CPU costs of random_zipfian in pgbench
List pgsql-hackers
Hello Tomas,

> I'm trying to use random_zipfian() for benchmarking of skewed data sets, 
> and I ran head-first into an issue with rather excessive CPU costs. 
> [...] This happens because generalizedHarmonicNumber() does this:
>
>     for (i = n; i > 1; i--)
>         ans += pow(i, -s);
>
> where n happens to be 1000000000 (range passed to random_zipfian), so
> the loop takes quite a bit of time.

If you find a better formula for the harmonic number, you are welcome 
and probably get your name on it:-)

> So much that we only ever complete a
> single transaction, because this work happens in the context of the
> first transction, and so it counts into the 30-second limit.

That is why the value is cached, so it is done once per size & value.

If you want skewed but not especially zipfian, use exponential which is 
quite cheap. Also zipfian with a > 1.0 parameter does not have to compute 
the harmonic number, so it depends in the parameter.

Please also beware that non uniform keys are correlated (the more frequent 
are close values), which is somewhat non representative of what you would 
expect in real life. This is why I submitted a pseudo-random permutation 
function, which alas does not get much momentum from committers.

> The docs actually do mention performance of this function:
>
>    The function's performance is poor for parameter values close and
>    above 1.0 and on a small range.
>
> But that does not seem to cover the example I just posted, because 0.1
> is not particularly close to 1.0 (or above 1.0), and 1e9 values hardly
> counts as "small range".

Yep. The warning is about the repeated cost of calling random_zipfian, 
which is not good when the parameter is close to and above 1.0. It is not 
about the setup cost when the value is between 0 and &. This could indeed 
deserve a warning.

Now if you do benchmarking on a database, probably you want to run hours 
of to level out checkpointing and other background tasks, so the setup 
cost should be negligeable, in the end...

> I see this part of random_zipfian comes from "Quickly Generating
> Billion-Record Synthetic Databases" which deals with generating data
> sets, where wasting a bit of time is not a huge deal. But when running
> benchmarks it matters much more. So maybe there's a disconnect here?

Hmmm. The first author of this paper got a Turing award:-) The disconnect 
is just that the setup cost is neglected, or computed offline.

> Interestingly enough, I don't see this issue on values above 1.0, no
> matter how close to 1.0 those are. Although the throughput seems lower
> than with uniform access, so this part of random_zipfian is not quite
> cheap either.

Indeed. Pg provides an underlying uniform pseudo-random function, so 
generating uniform is cheap. Others need more or less expensive 
transformations.

> Now, I don't know what to do about this. Ideally, we'd have a faster
> algorithm to generate zipfian distributions

You are welcome to find one, and get famous (hmmm... among some 
specialized mathematicians at least:-) for it.

> - I don't know if such thing exists though. Or maybe we could precompute 
> the distribution first, not counting it into the benchmark duration.

Could be done, but this would require significant partial evaluation 
efforts and only work when the size and parameter are constants (although 
using the function with variable parameters would be a very bad idea 
anyway).

> But I guess the very least we can/should do is improving the docs to
> make it more obvious which cases are expected to be slow.

Yep. Attached is a doc & comment improvement.

-- 
Fabien.
Attachment

pgsql-hackers by date:

Previous
From: Ramanarayana
Date:
Subject: Re: BUG #15548: Unaccent does not remove combining diacritical characters
Next
From: Andrey Borodin
Date:
Subject: Re: amcheck verification for GiST