Re: CPU costs of random_zipfian in pgbench - Mailing list pgsql-hackers
From | Fabien COELHO |
---|---|
Subject | Re: CPU costs of random_zipfian in pgbench |
Date | |
Msg-id | alpine.DEB.2.21.1903241513570.9939@lancre Whole thread Raw |
In response to | Re: CPU costs of random_zipfian in pgbench (Tomas Vondra <tomas.vondra@2ndquadrant.com>) |
Responses |
Re: CPU costs of random_zipfian in pgbench
|
List | pgsql-hackers |
Hello Tomas, >>> What would a user do with this information, and how would they know >>> what to do? >> >> Sure, but it was unclear what to do. Extending the cache to avoid that >> would look like over-engineering. > > That seems like a rather strange argument. What exactly is so complex on > resizing the cache to quality as over-engineering? Because the cache size can diverge on a bad script, and the search is currently linear. Even with a log search it does not look attractive. > If the choice is between reporting the failure to the user, and > addressing the failure, surely the latter would be the default option? > Particularly if the user can't really address the issue easily > (recompiling psql is not very practical solution). > >>> I remain of the opinion that we ought to simply rip out support for >>> zipfian with s < 1. > > +1 to that If this is done, some people with zipfian distribution that currently work might be unhappy. >> Some people really want zipfian because it reflects their data access >> pattern, possibly with s < 1. >> >> We cannot helpt it: real life seems zipfianly distributed:-) > > Sure. But that hardly means we ought to provide algorithms that we know > are not suitable for benchmarking tools, which I'd argue is this case. Hmmm. It really depends on the values actually used. > Also, we have two algorithms for generating zipfian distributions. Why > wouldn't it be sufficient to keep just one of them? Because the two methods work for different values of the parameter, so it depends on the distribution which is targetted. If you want s < 1, the exclusion method does not help (precisely, the real "infinite" zipfian distribution is not mathematical defined when s < 1 because the underlying sum diverges. Having s < 1 can only be done on a finite set). >>> It's not useful for benchmarking purposes to have a random-number >>> function with such poor computational properties. >> >> This is mostly a startup cost, the generation cost when a bench is >> running is reasonable. How to best implement the precomputation is an >> open question. > > ... which means it's not a startup cost. IMHO this simply shows pgbench > does not have the necessary infrastructure to provide this feature. Well, a quick one has been proposed with a cache. Now I can imagine alternatives, but I'm wondering how much it is worth it. Eg, restraint random_zipfian to more or less constant parameters when s < 1, so that a partial evaluation could collect the needed pairs and perform the pre-computations before the bench is started. Now, that looks like over-engineering, and then someone would complain of the restrictions. >> As a reviewer I was not thrilled by the cache stuff, but I had no better >> idea that would not fall under "over-over engineering" or the like. >> >> Maybe it could error out and say "recompile me", but then someone >> would have said "that is unacceptable". >> >> Maybe it could auto extend the cache, but that is still more unnecessary >> over-engineering, IMHO. > > I'm puzzled. Why would that be over-engineering? As stated above, cache size divergence and O(n) search. Even a log2(n) search would be bad, IMO. >>> I think leaving it in there is just a foot-gun: we'd be a lot better >>> off throwing an error that tells people to use some other distribution. >> >> When s < 1, the startup cost is indeed a pain. However, it is a pain >> prescribed by a Turing Award. > > Then we shouldn't have it, probably. Or we should at least implement a > proper startup phase, so that the costly precomputation is not included > in the test interval. That can be done, but I'm not sure of an very elegant design. And I was just the reviewer on this one. >>> Or if we do leave it in there, we for sure have to have documentation >>> that *actually* explains how to use it, which this patch still doesn't. >> >> I'm not sure what explaining there could be about how to use it: one >> calls the function to obtain pseudo-random integers with the desired >> distribution? > > Well, I'd argue the current description "performance is poor" is not > particularly clear. > >>> There's nothing suggesting that you'd better not use a large number of >>> different (n,s) combinations. >> >> Indeed, there is no caveat about this point, as noted above. >>> Please find an updated patch for the documentation, pointing out the >> existence of the cache and an advice not to overdo it. >>> It does not solve the underlying problem which raised your complaint, >> but at least it is documented. >> > > I think you forgot to attache the patch ... Indeed, this is becoming a habbit:-( Attached. Hopefully. -- Fabien.
Attachment
pgsql-hackers by date: