Re: CPU costs of random_zipfian in pgbench - Mailing list pgsql-hackers

From Fabien COELHO
Subject Re: CPU costs of random_zipfian in pgbench
Date
Msg-id alpine.DEB.2.21.1903241513570.9939@lancre
Whole thread Raw
In response to Re: CPU costs of random_zipfian in pgbench  (Tomas Vondra <tomas.vondra@2ndquadrant.com>)
Responses Re: CPU costs of random_zipfian in pgbench
List pgsql-hackers
Hello Tomas,

>>> What would a user do with this information, and how would they know
>>> what to do?
>>
>> Sure, but it was unclear what to do. Extending the cache to avoid that 
>> would look like over-engineering.
>
> That seems like a rather strange argument. What exactly is so complex on
> resizing the cache to quality as over-engineering?

Because the cache size can diverge on a bad script, and the search is 
currently linear. Even with a log search it does not look attractive.

> If the choice is between reporting the failure to the user, and
> addressing the failure, surely the latter would be the default option?
> Particularly if the user can't really address the issue easily
> (recompiling psql is not very practical solution).
>
>>> I remain of the opinion that we ought to simply rip out support for
>>> zipfian with s < 1.
>
> +1 to that

If this is done, some people with zipfian distribution that currently 
work might be unhappy.

>> Some people really want zipfian because it reflects their data access
>> pattern, possibly with s < 1.
>>
>> We cannot helpt it: real life seems zipfianly distributed:-)
>
> Sure. But that hardly means we ought to provide algorithms that we know
> are not suitable for benchmarking tools, which I'd argue is this case.

Hmmm. It really depends on the values actually used.

> Also, we have two algorithms for generating zipfian distributions. Why
> wouldn't it be sufficient to keep just one of them?

Because the two methods work for different values of the parameter, so it 
depends on the distribution which is targetted. If you want s < 1, the 
exclusion method does not help (precisely, the real "infinite" zipfian 
distribution is not mathematical defined when s < 1 because the underlying 
sum diverges. Having s < 1 can only be done on a finite set).

>>> It's not useful for benchmarking purposes to have a random-number
>>> function with such poor computational properties.
>>
>> This is mostly a startup cost, the generation cost when a bench is
>> running is reasonable. How to best implement the precomputation is an
>> open question.
>
> ... which means it's not a startup cost. IMHO this simply shows pgbench
> does not have the necessary infrastructure to provide this feature.

Well, a quick one has been proposed with a cache. Now I can imagine 
alternatives, but I'm wondering how much it is worth it.

Eg, restraint random_zipfian to more or less constant parameters when s < 
1, so that a partial evaluation could collect the needed pairs and perform 
the pre-computations before the bench is started.

Now, that looks like over-engineering, and then someone would complain of 
the restrictions.

>> As a reviewer I was not thrilled by the cache stuff, but I had no better
>> idea that would not fall under "over-over engineering" or the like.
>>
>> Maybe it could error out and say "recompile me", but then someone
>> would have said "that is unacceptable".
>>
>> Maybe it could auto extend the cache, but that is still more unnecessary
>> over-engineering, IMHO.
>
> I'm puzzled. Why would that be over-engineering?

As stated above, cache size divergence and O(n) search. Even a log2(n) 
search would be bad, IMO.

>>> I think leaving it in there is just a foot-gun: we'd be a lot better
>>> off throwing an error that tells people to use some other distribution.
>>
>> When s < 1, the startup cost is indeed a pain. However, it is a pain
>> prescribed by a Turing Award.
>
> Then we shouldn't have it, probably. Or we should at least implement a
> proper startup phase, so that the costly precomputation is not included
> in the test interval.

That can be done, but I'm not sure of an very elegant design. And I was 
just the reviewer on this one.

>>> Or if we do leave it in there, we for sure have to have documentation
>>> that *actually* explains how to use it, which this patch still doesn't.
>>
>> I'm not sure what explaining there could be about how to use it: one
>> calls the function to obtain pseudo-random integers with the desired
>> distribution?
>
> Well, I'd argue the current description "performance is poor" is not
> particularly clear.
>
>>> There's nothing suggesting that you'd better not use a large number of
>>> different (n,s) combinations.
>>
>> Indeed, there is no caveat about this point, as noted above.
>>> Please find an updated patch for the documentation, pointing out the
>> existence of the cache and an advice not to overdo it.
>>> It does not solve the underlying problem which raised your complaint,
>> but at least it is documented.
>>
>
> I think you forgot to attache the patch ...

Indeed, this is becoming a habbit:-( Attached. Hopefully.

-- 
Fabien.
Attachment

pgsql-hackers by date:

Previous
From: Fabien COELHO
Date:
Subject: Re: CPU costs of random_zipfian in pgbench
Next
From: Fabien COELHO
Date:
Subject: Re: chained transactions