Re: pgbench gaussian/exponential docs improvements - Mailing list pgsql-hackers

From Fabien COELHO
Subject Re: pgbench gaussian/exponential docs improvements
Date
Msg-id alpine.DEB.2.10.1510252141040.24734@sto
Whole thread Raw
In response to Re: pgbench gaussian/exponential docs improvements  (Tomas Vondra <tomas.vondra@2ndquadrant.com>)
Responses Re: pgbench gaussian/exponential docs improvements  (Tomas Vondra <tomas.vondra@2ndquadrant.com>)
List pgsql-hackers
> [...]
>
> So either the information is important and then should be placed in the 
> docs directly, or it's not and then linking to wikipedia is pointless 
> because the users are not interested in learning all the details about 
> each distribution function.

What is important is that these distributions can be used from pgbench. 
What is a gaussian or an exponential distribution is *not* important as 
such.

For me it is not the point of pg documentation to explain probability 
theory, but just to provide *precise* information about what is actually 
available, for someone who would be interested, without having to read the 
source code. At least that is the idea behind the current documentation.

>>> Firstly, it'd be nice if we could add some figures illustrating the
>>> distributions - much better than explaining the shapes in text. I
>>> don't know if we include figures in the existing docs (probably not),
>>> but generating the figures is rather simple.
>> 
>> There is basically no figures in the documentation. Too bad, but it is
>> understandable: what should be the format (svg, jpg, png, ...), should
>> it be generated (gnuplot, others), what is the impact on the
>> documentation build (html, epub, pdf, ...), how portable should it be,
>> what about compressed formats vs git diffs?
>> 
>> Once you start asking these questions you understand why there are no
>> figures:-)
>
> I don't see why diffs would be a problem.

I was not only thinking of mathematical figures, I was also thinking of 
graphics, some format may be zip containing XML stuff for instance.

>>> Probably nitpicking, but left/right of what? I assume the normal
>>> distribution is placed at 0, so it's left/right of zero.
>> 
>> No, it is around the middle of the interval.
>
> You mean [min,max] interval?

Yep.

> I believe the transformation
>
>    2.0 * threshold * (i - min - mu + 0.5) / (max - min + 1)
>
> essentially moves the mean into 0, scales the data to [0,1] and then applies 
> the threshold.

Probably:-) I wrote that some time ago, and it is 10 pm for me:-).

> In other words, the general shape of the curve will be exactly the same no 
> matter the actual min/max (except that for longer intervals the values will 
> be lower, as there are more possible values).
>
> I don't really see how it's related to this?
>
>    [(max-min)/2 - thresholds, (max-min)/2 + threshold]

The gaussian distribution is about reals, but it is used for integers, so 
there is a projection on integers from the real values. The function 
should compute the probability of drawing a given integer "i" in the 
interval, that is given min, max and threshold, what is the probability of 
drawing i.

>>> Could we simplify the equation a bit? It's needlessly difficult to
>>> realize it's actually just CDF(i+0.5) - CDF(i-0.5). I think it'd be
>>> good to first define the CDF and then just use that.
>> 
>> ISTM that PHI is *the* normal CDF, which is more or less available as
>> such in various environment (matlab, python, excel...). Well, why not
>> defined the particular CDF and use it. Not sure the text would be that
>> much lighter, though.
>
> PHI is the CDF of the normal distribution, not the modified probability 
> distribution here (with threshold and scaled to the desired interval).

Yep, that is exactly what I was saying, I think.

>>> This seems broken - too many sentences about the 67% and 95%.
>> 
>> The point is to provide rules of thumb to describe how the distribution
>> is shaped. Any better sentence is welcome.
>
> Ah, I misread the sentence initially. I haven't realized it speaks about 
> 1/threshold in the first part, and the second part is an example for 
> threshold=4.0. So I thought it's a repetition of the first part.

Maybe it needs spacing and colons and rewording, if it too hard to parse.

>>> Does it make sense to explicitly mention the implementation detail
>>> (Box-Muller transform) here?
>
> No, my point was exactly the opposite - removing the mention of Box-Muller 
> entirely, not adding more details about it.

Ok. I think that the fact that it relies on the Box-Muller transform is 
relevant, because there are other methods to generate a gaussian 
distribution, and I would say that there is no reason to have to go to the 
source code to check that. But I would not provide further details. So I'm 
fine with the current status.

-- 
Fabien.



pgsql-hackers by date:

Previous
From: Tomas Vondra
Date:
Subject: Re: pgbench gaussian/exponential docs improvements
Next
From: Tatsuo Ishii
Date:
Subject: Re: Re: [BUGS] BUG #13611: test_postmaster_connection failed (Windows, listen_addresses = '0.0.0.0' or '::')