Re: CPU costs of random_zipfian in pgbench - Mailing list pgsql-hackers

From Fabien COELHO
Subject Re: CPU costs of random_zipfian in pgbench
Date
Msg-id alpine.DEB.2.21.1903231802440.18811@lancre
Whole thread Raw
In response to Re: CPU costs of random_zipfian in pgbench  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: CPU costs of random_zipfian in pgbench
Re: CPU costs of random_zipfian in pgbench
List pgsql-hackers
Hello Tom,

> I started to look through this, and the more I looked the more unhappy
> I got that we're having this discussion at all.  The zipfian support
> in pgbench is seriously over-engineered and under-documented.  As an
> example, I was flabbergasted to find out that the end-of-run summary
> statistics now include this:
>
>    /* Report zipfian cache overflow */
>    for (i = 0; i < nthreads; i++)
>    {
>        totalCacheOverflows += threads[i].zipf_cache.overflowCount;
>    }
>    if (totalCacheOverflows > 0)
>    {
>        printf("zipfian cache array overflowed %d time(s)\n", totalCacheOverflows);
>    }
>
> What is the point of that, and if there is a point, why is it nowhere
> mentioned in pgbench.sgml?

Indeed, there should.

> What would a user do with this information, and how would they know what 
> to do?

Sure, but it was unclear what to do. Extending the cache to avoid that 
would look like over-engineering.

> I remain of the opinion that we ought to simply rip out support for
> zipfian with s < 1.

Some people really want zipfian because it reflects their data access 
pattern, possibly with s < 1.

We cannot helpt it: real life seems zipfianly distributed:-)

> It's not useful for benchmarking purposes to have a random-number 
> function with such poor computational properties.

This is mostly a startup cost, the generation cost when a bench is running 
is reasonable. How to best implement the precomputation is an open 
question.

As a reviewer I was not thrilled by the cache stuff, but I had no better 
idea that would not fall under "over-over engineering" or the like.

Maybe it could error out and say "recompile me", but then someone
would have said "that is unacceptable".

Maybe it could auto extend the cache, but that is still more 
unnecessary over-engineering, IMHO.

Maybe a there could be some mandatory declarations or partial eval that 
could precompute the needed parameters out/before the bench is started, 
with a clear message "precomputing stuff...", but that would be over over 
over engineering again... and that would mean restricting random_zipfian 
parameters to near-constants, which would require some explaining, but 
maybe it is an option. I guess that in the paper original context, the 
parameters (s & n) are known before the bench is started, so that the 
needed value are computed offline once and for all.

> I think leaving it in there is just a foot-gun: we'd be a lot better off 
> throwing an error that tells people to use some other distribution.

When s < 1, the startup cost is indeed a pain. However, it is a pain 
prescribed by a Turing Award.

> Or if we do leave it in there, we for sure have to have documentation
> that *actually* explains how to use it, which this patch still doesn't.

I'm not sure what explaining there could be about how to use it: one calls 
the function to obtain pseudo-random integers with the desired 
distribution?

> There's nothing suggesting that you'd better not use a large number of
> different (n,s) combinations.

Indeed, there is no caveat about this point, as noted above.

Please find an updated patch for the documentation, pointing out the 
existence of the cache and an advice not to overdo it.

It does not solve the underlying problem which raised your complaint, but 
at least it is documented.

-- 
Fabien.


pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: CPU costs of random_zipfian in pgbench
Next
From: Fabien COELHO
Date:
Subject: Re: CPU costs of random_zipfian in pgbench