Re: pgbench gaussian/exponential docs improvements - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: pgbench gaussian/exponential docs improvements
Date
Msg-id 562D8EDD.3080801@2ndquadrant.com
Whole thread Raw
In response to Re: pgbench gaussian/exponential docs improvements  (Fabien COELHO <coelho@cri.ensmp.fr>)
Responses Re: pgbench gaussian/exponential docs improvements
List pgsql-hackers

On 10/25/2015 10:01 PM, Fabien COELHO wrote:
>
>> [...]
>>
>> So either the information is important and then should be placed in
>> the docs directly, or it's not and then linking to wikipedia is
>> pointless because the users are not interested in learning all the
>> details about each distribution function.
>
> What is important is that these distributions can be used from pgbench.
> What is a gaussian or an exponential distribution is *not* important as
> such.
>
> For me it is not the point of pg documentation to explain probability
> theory, but just to provide *precise* information about what is actually
> available, for someone who would be interested, without having to read
> the source code. At least that is the idea behind the current
> documentation.

OK, fair enough. OTOH many of our users don't have immediate knowledge 
of statistical distributions, so if we could give them additional info 
in a reasonable way, that'd be good.

>
>>>> Firstly, it'd be nice if we could add some figures illustrating the
>>>> distributions - much better than explaining the shapes in text. I
>>>> don't know if we include figures in the existing docs (probably not),
>>>> but generating the figures is rather simple.
>>>
>>> There is basically no figures in the documentation. Too bad, but it is
>>> understandable: what should be the format (svg, jpg, png, ...), should
>>> it be generated (gnuplot, others), what is the impact on the
>>> documentation build (html, epub, pdf, ...), how portable should it be,
>>> what about compressed formats vs git diffs?
>>>
>>> Once you start asking these questions you understand why there are no
>>> figures:-)
>>
>> I don't see why diffs would be a problem.
>
> I was not only thinking of mathematical figures, I was also thinking of
> graphics, some format may be zip containing XML stuff for instance.

But we don't need it here, so why should we care about it too much?

>
>> In other words, the general shape of the curve will be exactly the
>> same no matter the actual min/max (except that for longer intervals
>> the values will be lower, as there are more possible values).
>>
>> I don't really see how it's related to this?
>>
>>    [(max-min)/2 - thresholds, (max-min)/2 + threshold]
>
> The gaussian distribution is about reals, but it is used for integers,
> so there is a projection on integers from the real values. The function
> should compute the probability of drawing a given integer "i" in the
> interval, that is given min, max and threshold, what is the probability
> of drawing i.

I do understand that. I'm trying to explain that "threshold" is in fact 
completely disconnected from min and max, as the transformation scales 
the data to [-1,1] like this
    2.0 * (i - min - mu + 0.5) / (max - min + 1)

and only then the 'threshold' coefficient is applied. And if I read the 
Box-Muller transformation correctly, it generates data with standard 
Normal distribution from [-threshold, threshold] and then transforms 
them to the right mean etc.

But maybe that's what the first sentence is trying to say? I mean this:
    For a Gaussian distribution, the interval is mapped onto a standard    normal distribution (the classical
bell-shapedGaussian curve)    truncated at -threshold on the left and +threshold on the right.
 

I'm asking about this because it wasn't to me immediately clear whether 
I need to tweak this for data sets with different scales, but apparently 
not. After reading the docs again I think that's also clear from last 
sentence that relates threshold and 67% and 95%.

Anyway, the references to "standard normal distribution" are a bit 
sloppy - "standard" usually means normal distribution with exactly mu=0 
and sigma=1. So it's a bit strange to say
    standard normal distribution, with mean mu defined as (max+min)/2.0

because that's not a standard normal distribution at all. I propose to 
fix this by removing the "standard".

[1] as wikipedia notes, Gauss himself used different sigma

>
>>>> Could we simplify the equation a bit? It's needlessly difficult to
>>>> realize it's actually just CDF(i+0.5) - CDF(i-0.5). I think it'd be
>>>> good to first define the CDF and then just use that.
>>>
>>> ISTM that PHI is *the* normal CDF, which is more or less available as
>>> such in various environment (matlab, python, excel...). Well, why not
>>> defined the particular CDF and use it. Not sure the text would be that
>>> much lighter, though.
>>
>> PHI is the CDF of the normal distribution, not the modified
>> probability distribution here (with threshold and scaled to the
>> desired interval).
>
> Yep, that is exactly what I was saying, I think.

I think we're talking about slightly different things. Essentially the 
transformation transforms Normal distribution (with PHI as CDF) into 
another statistical distribution (with the thresholds and such), and a 
different CDF, let's say CDF2, which is
  CDF2(x) = PHI(2.0 * threshold * ...) / (2.0 * PHI(threshold) - 1.0)

and then the probability of "i" is
  P(X=i) = CDF2(i+0.5) - CDF2(i-0.5)

Which is what I meant by simplifying the equation. Not that it'd make 
easier to imagine the shape, though ...

>>>> This seems broken - too many sentences about the 67% and 95%.
>>>
>>> The point is to provide rules of thumb to describe how the distribution
>>> is shaped. Any better sentence is welcome.
>>
>> Ah, I misread the sentence initially. I haven't realized it speaks
>> about 1/threshold in the first part, and the second part is an example
>> for threshold=4.0. So I thought it's a repetition of the first part.
>
> Maybe it needs spacing and colons and rewording, if it too hard to parse.

Maybe. Another thing is that "middle quarter" and "middle half" seems a 
bit strange - if you split data into 1/4s there's no middle one (sure, I 
understand what the sentence is meant to say).

>
>>>> Does it make sense to explicitly mention the implementation detail
>>>> (Box-Muller transform) here?
>>
>> No, my point was exactly the opposite - removing the mention of
>> Box-Muller entirely, not adding more details about it.
>
> Ok. I think that the fact that it relies on the Box-Muller transform is
> relevant, because there are other methods to generate a gaussian
> distribution, and I would say that there is no reason to have to go to
> the source code to check that. But I would not provide further details.
> So I'm fine with the current status.

There are alternative methods for almost every non-trivial piece of 
code, and we generally don't mention that in user docs. Why should we 
mention it in this case? Why would the user care which particular PRNG 
was used to generate the numbers? Maybe there really is a reason for 
that, I don't know.

regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



pgsql-hackers by date:

Previous
From: Amit Langote
Date:
Subject: Re: ATT_FOREIGN_TABLE and ATWrongRelkindError()
Next
From: Kisung Kim
Date:
Subject: Re: questions about PG update performance