Re: pgbench gaussian/exponential docs improvements - Mailing list pgsql-hackers
From | Fabien COELHO |
---|---|
Subject | Re: pgbench gaussian/exponential docs improvements |
Date | |
Msg-id | alpine.DEB.2.10.1510252141040.24734@sto Whole thread Raw |
In response to | Re: pgbench gaussian/exponential docs improvements (Tomas Vondra <tomas.vondra@2ndquadrant.com>) |
Responses |
Re: pgbench gaussian/exponential docs improvements
|
List | pgsql-hackers |
> [...] > > So either the information is important and then should be placed in the > docs directly, or it's not and then linking to wikipedia is pointless > because the users are not interested in learning all the details about > each distribution function. What is important is that these distributions can be used from pgbench. What is a gaussian or an exponential distribution is *not* important as such. For me it is not the point of pg documentation to explain probability theory, but just to provide *precise* information about what is actually available, for someone who would be interested, without having to read the source code. At least that is the idea behind the current documentation. >>> Firstly, it'd be nice if we could add some figures illustrating the >>> distributions - much better than explaining the shapes in text. I >>> don't know if we include figures in the existing docs (probably not), >>> but generating the figures is rather simple. >> >> There is basically no figures in the documentation. Too bad, but it is >> understandable: what should be the format (svg, jpg, png, ...), should >> it be generated (gnuplot, others), what is the impact on the >> documentation build (html, epub, pdf, ...), how portable should it be, >> what about compressed formats vs git diffs? >> >> Once you start asking these questions you understand why there are no >> figures:-) > > I don't see why diffs would be a problem. I was not only thinking of mathematical figures, I was also thinking of graphics, some format may be zip containing XML stuff for instance. >>> Probably nitpicking, but left/right of what? I assume the normal >>> distribution is placed at 0, so it's left/right of zero. >> >> No, it is around the middle of the interval. > > You mean [min,max] interval? Yep. > I believe the transformation > > 2.0 * threshold * (i - min - mu + 0.5) / (max - min + 1) > > essentially moves the mean into 0, scales the data to [0,1] and then applies > the threshold. Probably:-) I wrote that some time ago, and it is 10 pm for me:-). > In other words, the general shape of the curve will be exactly the same no > matter the actual min/max (except that for longer intervals the values will > be lower, as there are more possible values). > > I don't really see how it's related to this? > > [(max-min)/2 - thresholds, (max-min)/2 + threshold] The gaussian distribution is about reals, but it is used for integers, so there is a projection on integers from the real values. The function should compute the probability of drawing a given integer "i" in the interval, that is given min, max and threshold, what is the probability of drawing i. >>> Could we simplify the equation a bit? It's needlessly difficult to >>> realize it's actually just CDF(i+0.5) - CDF(i-0.5). I think it'd be >>> good to first define the CDF and then just use that. >> >> ISTM that PHI is *the* normal CDF, which is more or less available as >> such in various environment (matlab, python, excel...). Well, why not >> defined the particular CDF and use it. Not sure the text would be that >> much lighter, though. > > PHI is the CDF of the normal distribution, not the modified probability > distribution here (with threshold and scaled to the desired interval). Yep, that is exactly what I was saying, I think. >>> This seems broken - too many sentences about the 67% and 95%. >> >> The point is to provide rules of thumb to describe how the distribution >> is shaped. Any better sentence is welcome. > > Ah, I misread the sentence initially. I haven't realized it speaks about > 1/threshold in the first part, and the second part is an example for > threshold=4.0. So I thought it's a repetition of the first part. Maybe it needs spacing and colons and rewording, if it too hard to parse. >>> Does it make sense to explicitly mention the implementation detail >>> (Box-Muller transform) here? > > No, my point was exactly the opposite - removing the mention of Box-Muller > entirely, not adding more details about it. Ok. I think that the fact that it relies on the Box-Muller transform is relevant, because there are other methods to generate a gaussian distribution, and I would say that there is no reason to have to go to the source code to check that. But I would not provide further details. So I'm fine with the current status. -- Fabien.
pgsql-hackers by date: