Re: pgbench gaussian/exponential docs improvements - Mailing list pgsql-hackers

From Fabien COELHO
Subject Re: pgbench gaussian/exponential docs improvements
Date
Msg-id alpine.DEB.2.10.1510260711240.24734@sto
Whole thread Raw
In response to Re: pgbench gaussian/exponential docs improvements  (Tomas Vondra <tomas.vondra@2ndquadrant.com>)
Responses Re: pgbench gaussian/exponential docs improvements  (Fabien COELHO <coelho@cri.ensmp.fr>)
List pgsql-hackers
>> I was not only thinking of mathematical figures, I was also thinking of
>> graphics, some format may be zip containing XML stuff for instance.
>
> But we don't need it here, so why should we care about it too much?

I was just digressing about the main subject:-) Having some graphics in 
the doc would help here and there, though.

> I do understand that. I'm trying to explain that "threshold" is in fact 
> completely disconnected from min and max, as the transformation scales the 
> data to [-1,1] like this
>
>    2.0 * (i - min - mu + 0.5) / (max - min + 1)
>
> and only then the 'threshold' coefficient is applied. And if I read the 
> Box-Muller transformation correctly, it generates data with standard Normal 
> distribution from [-threshold, threshold] and then transforms them to the 
> right mean etc.

Yep, the threshold parameter is designed to be somehow independent of the 
actual [min max] range.

> But maybe that's what the first sentence is trying to say? I mean this:
>
>    For a Gaussian distribution, the interval is mapped onto a standard
>    normal distribution (the classical bell-shaped Gaussian curve)
>    truncated at -threshold on the left and +threshold on the right.

Yep, that looks like it.

> I'm asking about this because it wasn't to me immediately clear whether I 
> need to tweak this for data sets with different scales, but apparently not.

Indeed, This is the idea of how the parameter is used.

> After reading the docs again I think that's also clear from last sentence 
> that relates threshold and 67% and 95%.

Yep.

> Anyway, the references to "standard normal distribution" are a bit sloppy - 
> "standard" usually means normal distribution with exactly mu=0 and sigma=1. 
> So it's a bit strange to say
>
>    standard normal distribution, with mean mu defined as (max+min)/2.0
>
> because that's not a standard normal distribution at all. I propose to fix 
> this by removing the "standard".

Hmmm, probably fine if it is both more precise and shorter!

> [...]
>  CDF2(x) = PHI(2.0 * threshold * ...) / (2.0 * PHI(threshold) - 1.0)
>
> and then the probability of "i" is
>
>  P(X=i) = CDF2(i+0.5) - CDF2(i-0.5)

I agree that defining the shifted/scaled CDF and using it afterwards looks 
cleaner.

> Which is what I meant by simplifying the equation. Not that it'd make easier 
> to imagine the shape, though ...

Sure. This is the part about providing the "precise" information, what is 
the actual probability of drawing i depending on the parameters.

> Maybe. Another thing is that "middle quarter" and "middle half" seems a bit 
> strange - if you split data into 1/4s there's no middle one (sure, I 
> understand what the sentence is meant to say).

Improvements are welcome!

>> Ok. I think that the fact that it relies on the Box-Muller transform is
>> relevant, because there are other methods to generate a gaussian
>> distribution, and I would say that there is no reason to have to go to
>> the source code to check that. But I would not provide further details.
>> So I'm fine with the current status.
>
> There are alternative methods for almost every non-trivial piece of code, and 
> we generally don't mention that in user docs. Why should we mention it in 
> this case? Why would the user care which particular PRNG was used to generate 
> the numbers? Maybe there really is a reason for that, I don't know.

If that was security, because one has just been announced to be broken and 
you want to know whether you depend on it.

As a scientist, I like it when follow scientists who achieved useful 
things have their name cited:-).

-- 
Fabien.



pgsql-hackers by date:

Previous
From: Craig Ringer
Date:
Subject: Re: PATCH: 9.5 replication origins fix for logical decoding
Next
From: Ashutosh Bapat
Date:
Subject: Re: questions about PG update performance