Re: gaussian distribution pgbench - Mailing list pgsql-hackers

From Fabien COELHO
Subject Re: gaussian distribution pgbench
Date
Msg-id alpine.DEB.2.10.1402091328110.1388@sto
Whole thread Raw
In response to Re: gaussian distribution pgbench  (KONDO Mitsumasa <kondo.mitsumasa@lab.ntt.co.jp>)
Responses Re: gaussian distribution pgbench
List pgsql-hackers
Hello,

> I revise my gaussian pgbench patch which wss requested from community.

With a lot of delay for which I apologise, please find hereafter the 
review.

Gaussian Pgbench v3 patch by Mitsumasa KONDO review

* The purpose of the patch is to allow a pgbench script to draw from normally  distributed integer values instead of
uniformlydistributed.
 
  This is a valuable contribution to enable pgbench to generate more realistic  loads, which is seldom uniform in
practice.
  However, ISTM that other distributions such an exponantial one would make  more sense, and also the values should be
furtherrandomized so that  neighboring values are not more likely to be drawn. The latest point is non  trivial.
 

* Compilation
  The patch applies and compiles against current head. It works as expected,  although there is few feedback from the
scriptto show that.
 

* Mathematical soundness
  We want to derive a discrete normal distribution from a uniform one.  Well, normal distributions are for continuous
variables...Anyway, this is  done by computing a continuous normal distribution which is then projected  onto integers.
I'mbasically fine with that.
 
  The system uses a Box-Muller transform (1958) to do this transformation.  The Ziggurat method seems to be prefered
forthis purpose, *but* it would  require precalculated tables which depends on the target values. So I'm  fine with the
Box-Mullertransform for pgbench.
 
  The BM method uses 2 uniformly distributed numbers to derive 2 normally  distributed numbers. The implementation
computesone of these, and loops  over till one match a threshold criterion.
 
  More explanations, at least in comments, are needed about this threshold  and its meaning. It is required to be more
than2. I guess is that it allows  to limit the number of iterations of the while loop, but in what proportion  is
unclear.The documentation does not also help the user to understand  this value and its meaning.
 
  What I think it is: it is the deviation for the FURTHEST point around the  mean, that is the actual deviation
associatedto the "min" and "max" target  values. The 2 minimum value induces that there is a least 4 stddev lengths
betweenmin & max, with the most likely mean in the middle.
 
  If the threshold test fails, one of the 2 uniform number is redrawn, a new  candidate value is tested. I'm not at
easeabout why only 1 value is redrawn  and not both, some explanations would be welcome. Also, on the other hand,  why
nottest the other possible value (with cos) if the first one fails?
 
  Also, as suggested above, I would like some explanations about how much this  while loop may iterate without success,
saywith the expected average number  of iterations with its explanation in a comment.
 

* Implementation
  Random values :  double rand1 = 1.0 - rand; // instead of the LONG_MAX computation & limits.h  rand2 should be in (0,
1],but it is in [0, 1), use "1.0 - ..." as well?!
 
  What is called "stdev*" in getrand() is really the chosen deviation from  the target mean, so it would make more
senseto name it "dev".
 
  I do not think that the getrand refactoring was such a good idea. I'm sorry  if I may have suggested that in a
previouscomment.  The new getrand possibly ignores its parameters, hmmmm. ISTM that it would  be much simpler in the
codeto have a separate and clean "getrand_normal"  or "getrand_gauss" called for "\setgaussian", and that's it. This
would allow to get rid of DistType and all of getrand changes in the code.
 
  There are heavy constants computations (sqrt(log()) within the while  loop which would be moved out of the loop.
  ISTM that the while condition would be easier to read as:
     while ( dev < - threshold || threshold < dev )
  Maybe the \\setgaussian argument handling may be transformed into a function,  so that it could be used easily later
forsome other distribution (say some  setexp:-)
 

* Options
  ISTM that the test options would be better if made orthogonal, i.e. not to  have three --gaussian* options. I would
suggestto have only one  --gaussian=NUM which would trigger gaussian tests with this threshold,  and --gaussian=3.5
--select-onlywould use the select-only variant,  and so on.
 

* Typos
  gausian -> gaussian  patern -> pattern

* Conclusion :
 - this is a valuable patch to help create more realistic load and make pgbench   a more useful tool. I'm greatly in
favorof having such a functionality.
 
 - it seems to me that the patch should be further improved before being   committed, in particular I would suggest:
   (1) improve the explanations in the code and in the documentation, especially   about what is the "deviation
threshold"and its precise link to generated   values.
 
   (2) simplify the code with a separate gaussian getrand, and simpler or   more efficient code here and there, see
commentsabove.
 
   (3) use only one option to trigger gaussian tests.
   (bonus) \setexp would be a nice:-)

-- 
Fabien.



pgsql-hackers by date:

Previous
From: Magnus Hagander
Date:
Subject: Terminating pg_basebackup background streamer
Next
From: Andrew Dunstan
Date:
Subject: Re: narwhal and PGDLLIMPORT