Re: pgbench gaussian/exponential docs improvements - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: pgbench gaussian/exponential docs improvements
Date
Msg-id 562D3CA0.8080100@2ndquadrant.com
Whole thread Raw
In response to Re: pgbench gaussian/exponential docs improvements  (Fabien COELHO <coelho@cri.ensmp.fr>)
Responses Re: pgbench gaussian/exponential docs improvements  (Fabien COELHO <coelho@cri.ensmp.fr>)
List pgsql-hackers

On 10/25/2015 08:11 PM, Fabien COELHO wrote:
>
> Hello Tomas,
>
>> I've been looking at the checkpoint patches (sorting, flush and FPW
>> compensation) and realized we got gaussian/exponential distributions
>> in pgbench which is useful for simulating simple non-uniform workloads.
>
> Indeed.
>
>> But I think the current docs is a bit too difficult to understand for
>> people without deep insight into statistics and shapes of probability
>> distributions.
>
> I think the idea is that (1) if you do not know anything distributions,
> probably you do not want expo/gauss, and (2) pg documentation should not
> try to be an introductory course in probability theory.
>
> AFAICR I suggested to point to relevant wikipedia pages but this has
> been more or less rejected, so it ended up as it is, which is indeed
> pretty unconvincing.

I don't think links to wikipedia are all that useful in this context.

Firstly, we have no control over wikipedia pages so we can't point the 
users to particular sections of the page (as we don't know if it gets 
rewritten tomorrow).

So either the information is important and then should be placed in the 
docs directly, or it's not and then linking to wikipedia is pointless 
because the users are not interested in learning all the details about 
each distribution function.

>> Firstly, it'd be nice if we could add some figures illustrating the
>> distributions - much better than explaining the shapes in text. I
>> don't know if we include figures in the existing docs (probably not),
>> but generating the figures is rather simple.
>
> There is basically no figures in the documentation. Too bad, but it is
> understandable: what should be the format (svg, jpg, png, ...), should
> it be generated (gnuplot, others), what is the impact on the
> documentation build (html, epub, pdf, ...), how portable should it be,
> what about compressed formats vs git diffs?
>
> Once you start asking these questions you understand why there are no
> figures:-)

I don't see why diffs would be a problem. Include gnuplot source files, 
then build the appropriate format for each output format (eps for pdf, 
png for web, ...).

But yes, it definitely requires some work on the Makefiles.

>>> For a Gaussian distribution, the interval is mapped onto a standard
>>> normal distribution (the classical bell-shaped Gaussian curve)
>>> truncated at -threshold on the left and +threshold on the right.
>>
>> Probably nitpicking, but left/right of what? I assume the normal
>> distribution is placed at 0, so it's left/right of zero.
>
> No, it is around the middle of the interval.

You mean [min,max] interval? I believe the transformation
    2.0 * threshold * (i - min - mu + 0.5) / (max - min + 1)

essentially moves the mean into 0, scales the data to [0,1] and then 
applies the threshold.

In other words, the general shape of the curve will be exactly the same 
no matter the actual min/max (except that for longer intervals the 
values will be lower, as there are more possible values).

I don't really see how it's related to this?
    [(max-min)/2 - thresholds, (max-min)/2 + threshold]



>>> To be precise, if PHI(x) is the cumulative distribution function of
>>> the standard normal distribution, with mean mu defined as (max + min)
>>> / 2.0, then value i between min and max inclusive is drawn with
>>> probability: (PHI(2.0 * threshold * (i - min - mu + 0.5) / (max -
>>> min + 1)) - PHI(2.0 * threshold * (i - min - mu - 0.5) / (max - min +
>>> 1))) / (2.0 * PHI(threshold) - 1.0). Intuitively, the larger the
>>> threshold, the more frequently values close to the middle of the
>>> interval are drawn, and the less frequently values close to the min
>>> and max bounds.
>>
>> Could we simplify the equation a bit? It's needlessly difficult to
>> realize it's actually just CDF(i+0.5) - CDF(i-0.5). I think it'd be
>> good to first define the CDF and then just use that.
>
> ISTM that PHI is *the* normal CDF, which is more or less available as
> such in various environment (matlab, python, excel...). Well, why not
> defined the particular CDF and use it. Not sure the text would be that
> much lighter, though.

PHI is the CDF of the normal distribution, not the modified probability 
distribution here (with threshold and scaled to the desired interval).
>
>>> About 67% of values are drawn from the middle 1.0 / threshold and 95%
>>> in the middle 2.0 / threshold; for instance, if threshold is 4.0, 67%
>>> of values are drawn from the middle quarter and 95% from the middle
>>> half of the interval.
>>
>> This seems broken - too many sentences about the 67% and 95%.
>
> The point is to provide rules of thumb to describe how the distribution
> is shaped. Any better sentence is welcome.

Ah, I misread the sentence initially. I haven't realized it speaks about 
1/threshold in the first part, and the second part is an example for 
threshold=4.0. So I thought it's a repetition of the first part.

>
>>> The minimum threshold is 2.0 for performance of the Box-Muller
>>> transform.
>>
>> Does it make sense to explicitly mention the implementation detail
>> (Box-Muller transform) here?
>
> It is too complex, I would avoid it. I would point to the wikipedia page
> if that could be allowed.
>
> https://en.wikipedia.org/wiki/Box%E2%80%93Muller_transform

No, my point was exactly the opposite - removing the mention of 
Box-Muller entirely, not adding more details about it.

regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



pgsql-hackers by date:

Previous
From: Robert Haas
Date:
Subject: Re: make Gather node projection-capable
Next
From: Fabien COELHO
Date:
Subject: Re: pgbench gaussian/exponential docs improvements