Re: CPU costs of random_zipfian in pgbench - Mailing list pgsql-hackers

From Fabien COELHO
Subject Re: CPU costs of random_zipfian in pgbench
Date
Msg-id alpine.DEB.2.21.1903242006310.9939@lancre
Whole thread Raw
In response to Re: CPU costs of random_zipfian in pgbench  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: CPU costs of random_zipfian in pgbench
List pgsql-hackers
Hello Tom,

>>> If this is done, some people with zipfian distribution that currently
>>> work might be unhappy.
>
>> After giving it some thought, I think that this cannot be fully fixed for
>> 12.
>
> Just to clarify --- my complaint about "over engineering" referred to
> the fact that a cache exists at all; fooling around with the overflow
> behavior isn't really going to answer that.

Having some cache really makes sense for s < 1, AFAICS.

> The bigger picture here is that a benchmarking tool that contains its
> own performance surprises is not a nice thing to have.

Hmmm. It seems that it depends. Sometimes self inflected wounds are the 
soldier's responsability, sometimes they must be prevented.

> It's not very hard to imagine somebody wasting a lot of time trying to 
> explain weird results, only to find out that the cause is unstable 
> performance of random_zipfian.  Or worse, people might draw totally 
> incorrect conclusions if they fail to drill down enough to realize that 
> there's a problem in pgbench rather than on the server side.

Yep, benchmarking is tougher than it looks: it is very easy to get it 
wrong without knowing, whatever tool you use.

>> Given the constraint of Jim Gray's approximated method for s in (0, 1),
>> which really does zipfian for the first two integers and then uses an
>> exponential approximation, the only approach is that the parameters must
>> be computed in a partial eval preparation phase before the bench code is
>> run. This means that only (mostly) constants would be allowed as
>> parameters when s is in (0, 1), but I think that this is acceptable
>> because anyway the method fundamentaly requires it.
>
> Yeah, if we could store all the required harmonic numbers before the
> test run timing starts, that would address the concern about stable
> performance. But I have to wonder whether zipfian with s < 1 is useful
> enough to justify so much code.

I do not know about that. I do not think that Jim Gray chose to invent an 
approximated method for s < 1 because he thought it was fun. I think he 
did it because his benchmark data required it. If you need it, you need 
it. If you do not need it, you do not care.

>> The attached other attached patch illustrate what I call poor performance
>> for stupid parameters (no point in doing zipfian on 2 integers…) :
>> ...
>> Maybe the implementation could impose that s is at least 1.001 to avoid
>> the lower performance?
>
> I was wondering about that too.  It seems like it'd be a wise idea to
> further constrain s and/or n to ensure that the s > 1 code path doesn't do
> anything too awful ...

Yep. The attached version enforces s >= 1.001, which avoids the worse cost
of iterating, according to my small tests.

> unless someone comes up with a better implementation that has stable 
> performance without such constraints.

Hmmm…

-- 
Fabien.
Attachment

pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: CPU costs of random_zipfian in pgbench
Next
From: Rahila Syed
Date:
Subject: Re: monitoring CREATE INDEX [CONCURRENTLY]