Re: pgbench gaussian/exponential docs improvements - Mailing list pgsql-hackers
From | Fabien COELHO |
---|---|
Subject | Re: pgbench gaussian/exponential docs improvements |
Date | |
Msg-id | alpine.DEB.2.10.1510260711240.24734@sto Whole thread Raw |
In response to | Re: pgbench gaussian/exponential docs improvements (Tomas Vondra <tomas.vondra@2ndquadrant.com>) |
Responses |
Re: pgbench gaussian/exponential docs improvements
|
List | pgsql-hackers |
>> I was not only thinking of mathematical figures, I was also thinking of >> graphics, some format may be zip containing XML stuff for instance. > > But we don't need it here, so why should we care about it too much? I was just digressing about the main subject:-) Having some graphics in the doc would help here and there, though. > I do understand that. I'm trying to explain that "threshold" is in fact > completely disconnected from min and max, as the transformation scales the > data to [-1,1] like this > > 2.0 * (i - min - mu + 0.5) / (max - min + 1) > > and only then the 'threshold' coefficient is applied. And if I read the > Box-Muller transformation correctly, it generates data with standard Normal > distribution from [-threshold, threshold] and then transforms them to the > right mean etc. Yep, the threshold parameter is designed to be somehow independent of the actual [min max] range. > But maybe that's what the first sentence is trying to say? I mean this: > > For a Gaussian distribution, the interval is mapped onto a standard > normal distribution (the classical bell-shaped Gaussian curve) > truncated at -threshold on the left and +threshold on the right. Yep, that looks like it. > I'm asking about this because it wasn't to me immediately clear whether I > need to tweak this for data sets with different scales, but apparently not. Indeed, This is the idea of how the parameter is used. > After reading the docs again I think that's also clear from last sentence > that relates threshold and 67% and 95%. Yep. > Anyway, the references to "standard normal distribution" are a bit sloppy - > "standard" usually means normal distribution with exactly mu=0 and sigma=1. > So it's a bit strange to say > > standard normal distribution, with mean mu defined as (max+min)/2.0 > > because that's not a standard normal distribution at all. I propose to fix > this by removing the "standard". Hmmm, probably fine if it is both more precise and shorter! > [...] > CDF2(x) = PHI(2.0 * threshold * ...) / (2.0 * PHI(threshold) - 1.0) > > and then the probability of "i" is > > P(X=i) = CDF2(i+0.5) - CDF2(i-0.5) I agree that defining the shifted/scaled CDF and using it afterwards looks cleaner. > Which is what I meant by simplifying the equation. Not that it'd make easier > to imagine the shape, though ... Sure. This is the part about providing the "precise" information, what is the actual probability of drawing i depending on the parameters. > Maybe. Another thing is that "middle quarter" and "middle half" seems a bit > strange - if you split data into 1/4s there's no middle one (sure, I > understand what the sentence is meant to say). Improvements are welcome! >> Ok. I think that the fact that it relies on the Box-Muller transform is >> relevant, because there are other methods to generate a gaussian >> distribution, and I would say that there is no reason to have to go to >> the source code to check that. But I would not provide further details. >> So I'm fine with the current status. > > There are alternative methods for almost every non-trivial piece of code, and > we generally don't mention that in user docs. Why should we mention it in > this case? Why would the user care which particular PRNG was used to generate > the numbers? Maybe there really is a reason for that, I don't know. If that was security, because one has just been announced to be broken and you want to know whether you depend on it. As a scientist, I like it when follow scientists who achieved useful things have their name cited:-). -- Fabien.
pgsql-hackers by date: