Re: gaussian distribution pgbench - Mailing list pgsql-hackers

From KONDO Mitsumasa
Subject Re: gaussian distribution pgbench
Date
Msg-id 52FCBA54.3020203@lab.ntt.co.jp
Whole thread Raw
In response to Re: gaussian distribution pgbench  (Fabien COELHO <coelho@cri.ensmp.fr>)
Responses Re: gaussian distribution pgbench  (KONDO Mitsumasa <kondo.mitsumasa@lab.ntt.co.jp>)
List pgsql-hackers
Hi Febien,

Thank you very much for your very detail and useful comments!
I read your comment, I agree most of your advice:)

Attached patch is fixed for your comment. That are...
  - Remove redundant long-option.
    - We can use "--gaussian=NUM -S" or "--gaussian=NUMN -N" options.
  - Add sentence in document
  - Separate two random generate function which are uniform and gaussian.
    - getGaussianrand() is created.
  - Fix ranged random number more strictly, ex. (0,1) or [0,1).
    - Please see comment of source code in detail:).
  - Fix typo.
  - Use cos() and sin() function when we generate gaussian random number.
  - Add fast sqrt calculation algorithm.
  - Reuse sqrt result and pre generate random number for reducing calculation cost.
    - Experience of this method is under following. It will be little-bit faster
than non-reuse method. And distribution of gaussian is still good.

* Settings
  shared_buffers = 1024MB

* Test script
  pgbench -i -s 1
  pgbench --gaussian=2 -T 30 -S -c8 -j4 -n
  pgbench --gaussian=2 -T 30 -S -c8 -j4 -n
  pgbench --gaussian=2 -T 30 -S -c8 -j4 -n

* Result
   method         |  try1  |  try2  |  try3  |
--------------------------------------------|
reuse method     | 44189  | 44453  | 44013  |
non-reuse method | 43567  | 43635  | 43508  |



(2014/02/09 21:32), Fabien COELHO wrote:
>    This is a valuable contribution to enable pgbench to generate more realistic
>    loads, which is seldom uniform in practice.
Thanks!

>    However, ISTM that other distributions such an exponantial one would make
>    more sense,
I can easy to create exponential distribution. Here, I assume exponential
distribution that is f(x) = lambda * exp^(-lambda * x) in general.
What do you think under following interface?

custom script: \setexp [varname] min max threshold
command      : --exponential=NUM(threshold)

I don't want to use lambda variable for simple implementation. So lambda is
always 1. Because it can enough to control distribution by threshold. Threshold
parameter is f(x) value. And using created distribution projects to 'aid' by same
method. If you think OK, I will impliment under followings tomorrow, and also
create parseing part of this function...

do
{
    rand = 1.0 - pg_erand48(thread->random_state);
    rand = -log(rand);
}while( rand > exp_threshold)

return rand / exp_threshold;


>    and also the values should be further randomized so that
>    neighboring values are not more likely to be drawn. The latest point is non
>    trivial.
That's right, but I worry about gaussian randomness and benchmark reproducibility
might be disappeared when we re-randomized access pattern, because Postgres
storage method manages records by each pages and it is difficult to realize
access randomness in whole pages, not record. If we solve this problem, we have
to need algorithm for smart shuffule projection function that is still having
gaussian randomized. I think it will be difficult, and it have to impement in
another patch in the future.


> * Mathematical soundness
>
>    We want to derive a discrete normal distribution from a uniform one.
>    Well, normal distributions are for continuous variables... Anyway, this is
>    done by computing a continuous normal distribution which is then projected
>    onto integers. I'm basically fine with that.
>
>    The system uses a Box-Muller transform (1958) to do this transformation.
>    The Ziggurat method seems to be prefered for this purpose, *but* it would
>    require precalculated tables which depends on the target values. So I'm
>    fine with the Box-Muller transform for pgbench.
Yes, that's right. I selected simple and relatively faster algorithm, that is
Box-Muller transform.

>    The BM method uses 2 uniformly distributed numbers to derive 2 normally
>    distributed numbers. The implementation computes one of these, and loops
>    over till one match a threshold criterion.
>
>    More explanations, at least in comments, are needed about this threshold
>    and its meaning. It is required to be more than 2. I guess is that it allows
>    to limit the number of iterations of the while loop,
Yes. This loop could not almost go on, because min stdev_threshold is 2.
The possibility of retry-loop is under 4 percent. It might not be problem.

>    but in what proportion
>    is unclear. The documentation does not also help the user to understand
>    this value and its meaning.
Yes, it is huristic method. So I added the comments in document.


>    What I think it is: it is the deviation for the FURTHEST point around the
>    mean, that is the actual deviation associated to the "min" and "max" target
>    values. The 2 minimum value induces that there is a least 4 stddev lengths
>    between min & max, with the most likely mean in the middle.
Correct!

>    If the threshold test fails, one of the 2 uniform number is redrawn, a new
>    candidate value is tested. I'm not at ease about why only 1 value is redrawn
>    and not both, some explanations would be welcome. Also, on the other hand,
>    why not test the other possible value (with cos) if the first one fails?
Yes, I think so too. So I fixed this partan and it will be better. Past
implementations are not good:(

>    Also, as suggested above, I would like some explanations about how much this
>    while loop may iterate without success, say with the expected average number
>    of iterations with its explanation in a comment.
I add my comments in source code.

> * Implementation
>
>    Random values :
>    double rand1 = 1.0 - rand; // instead of the LONG_MAX computation & limits.h
>    rand2 should be in (0, 1], but it is in [0, 1), use "1.0 - ..." as well?!
It's more smart method. I change to this method.

>    What is called "stdev*" in getrand() is really the chosen deviation from
>    the target mean, so it would make more sense to name it "dev".
Hmm, I like stdev*. Short variable makes us more confuse:( And it's not big problem.

>    I do not think that the getrand refactoring was such a good idea. I'm sorry
>    if I may have suggested that in a previous comment.
>    The new getrand possibly ignores its parameters, hmmmm. ISTM that it would
>    be much simpler in the code to have a separate and clean "getrand_normal"
>    or "getrand_gauss" called for "\setgaussian", and that's it. This would
>    allow to get rid of DistType and all of getrand changes in the code.
I separate two function that are getrand() and getGaussianrand(), it becomes more
clear I think.

>    There are heavy constants computations (sqrt(log()) within the while
>    loop which would be moved out of the loop.
>
>    ISTM that the while condition would be easier to read as:
>
>       while ( dev < - threshold || threshold < dev )
OK, fixed.

>
>    Maybe the \\setgaussian argument handling may be transformed into a function,
>    so that it could be used easily later for some other distribution (say some
>    setexp:-)


> * Options
>
>    ISTM that the test options would be better if made orthogonal, i.e. not to
>    have three --gaussian* options. I would suggest to have only one
>    --gaussian=NUM which would trigger gaussian tests with this threshold,
>    and --gaussian=3.5 --select-only would use the select-only variant,
>    and so on.
Agreed. Fixed.

> * Typos
>
>    gausian -> gaussian
>    patern -> pattern
Oh, fixed.

> * Conclusion :
>
>   - this is a valuable patch to help create more realistic load and make pgbench
>     a more useful tool. I'm greatly in favor of having such a functionality.
>
>   - it seems to me that the patch should be further improved before being
>     committed, in particular I would suggest:
>
>     (1) improve the explanations in the code and in the documentation, especially
>     about what is the "deviation threshold" and its precise link to generated
>     values.
>
>     (2) simplify the code with a separate gaussian getrand, and simpler or
>     more efficient code here and there, see comments above.
>
>     (3) use only one option to trigger gaussian tests.
>
>     (bonus) \setexp would be a nice:-)
Thank you for your comments. They make my patch more polished:)
I think my patch is fixed for supporting all your comments, but it might not be fixed
as you think. And if you notice other part, please send me.

Regards,
--
Mitsumasa KONDO
NTT Open Source Software Center

Attachment

pgsql-hackers by date:

Previous
From: Kyotaro HORIGUCHI
Date:
Subject: Re: [BUG] Archive recovery failure on 9.3+.
Next
From: Kohei KaiGai
Date:
Subject: Re: New hook after raw parsing, before analyze