Re: CPU costs of random_zipfian in pgbench - Mailing list pgsql-hackers
From | Fabien COELHO |
---|---|
Subject | Re: CPU costs of random_zipfian in pgbench |
Date | |
Msg-id | alpine.DEB.2.21.1902170907110.8818@lancre Whole thread Raw |
In response to | CPU costs of random_zipfian in pgbench (Tomas Vondra <tomas.vondra@2ndquadrant.com>) |
Responses |
Re: CPU costs of random_zipfian in pgbench
Re: CPU costs of random_zipfian in pgbench |
List | pgsql-hackers |
Hello Tomas, > I'm trying to use random_zipfian() for benchmarking of skewed data sets, > and I ran head-first into an issue with rather excessive CPU costs. > [...] This happens because generalizedHarmonicNumber() does this: > > for (i = n; i > 1; i--) > ans += pow(i, -s); > > where n happens to be 1000000000 (range passed to random_zipfian), so > the loop takes quite a bit of time. If you find a better formula for the harmonic number, you are welcome and probably get your name on it:-) > So much that we only ever complete a > single transaction, because this work happens in the context of the > first transction, and so it counts into the 30-second limit. That is why the value is cached, so it is done once per size & value. If you want skewed but not especially zipfian, use exponential which is quite cheap. Also zipfian with a > 1.0 parameter does not have to compute the harmonic number, so it depends in the parameter. Please also beware that non uniform keys are correlated (the more frequent are close values), which is somewhat non representative of what you would expect in real life. This is why I submitted a pseudo-random permutation function, which alas does not get much momentum from committers. > The docs actually do mention performance of this function: > > The function's performance is poor for parameter values close and > above 1.0 and on a small range. > > But that does not seem to cover the example I just posted, because 0.1 > is not particularly close to 1.0 (or above 1.0), and 1e9 values hardly > counts as "small range". Yep. The warning is about the repeated cost of calling random_zipfian, which is not good when the parameter is close to and above 1.0. It is not about the setup cost when the value is between 0 and &. This could indeed deserve a warning. Now if you do benchmarking on a database, probably you want to run hours of to level out checkpointing and other background tasks, so the setup cost should be negligeable, in the end... > I see this part of random_zipfian comes from "Quickly Generating > Billion-Record Synthetic Databases" which deals with generating data > sets, where wasting a bit of time is not a huge deal. But when running > benchmarks it matters much more. So maybe there's a disconnect here? Hmmm. The first author of this paper got a Turing award:-) The disconnect is just that the setup cost is neglected, or computed offline. > Interestingly enough, I don't see this issue on values above 1.0, no > matter how close to 1.0 those are. Although the throughput seems lower > than with uniform access, so this part of random_zipfian is not quite > cheap either. Indeed. Pg provides an underlying uniform pseudo-random function, so generating uniform is cheap. Others need more or less expensive transformations. > Now, I don't know what to do about this. Ideally, we'd have a faster > algorithm to generate zipfian distributions You are welcome to find one, and get famous (hmmm... among some specialized mathematicians at least:-) for it. > - I don't know if such thing exists though. Or maybe we could precompute > the distribution first, not counting it into the benchmark duration. Could be done, but this would require significant partial evaluation efforts and only work when the size and parameter are constants (although using the function with variable parameters would be a very bad idea anyway). > But I guess the very least we can/should do is improving the docs to > make it more obvious which cases are expected to be slow. Yep. Attached is a doc & comment improvement. -- Fabien.
Attachment
pgsql-hackers by date: