Thread: Add error functions: erf() and erfc()
Now that we have random_normal(), it seems like it would be useful to add the error functions erf() and erfc(), which I think are potentially useful to the people who will find random_normal() useful, and possibly others. An immediate use for erf() is that it allows us to do a Kolmogorov-Smirnov test for random_normal(), similar to the one for random(). Both of these functions are defined in POSIX and C99, so in theory they should be available on all platforms. If that turns out not to be the case, then there's a commonly used implementation (e.g., see [1]), which we could include. I played around with that (replacing the direct bit manipulation stuff with frexp()/ldexp(), see pg_erf.c attached), and it appeared to be accurate to +/-1 ULP across the full range of inputs. Hopefully we won't need that though. I tested this on a couple of different platforms and found I needed to reduce extra_float_digits to -1 to get the regression tests to pass consistently, due to rounding errors. It wouldn't surprise me if that needs to be reduced further, though perhaps it's not necessary to have so many tests (I included one test value from each branch, while testing the hand-rolled implementation). Regards, Dean [1] https://github.com/lsds/musl/blob/master/src/math/erf.c
Attachment
On Mon, Feb 27, 2023 at 12:54:35PM +0000, Dean Rasheed wrote: > + /* > + * For erf, we don't need an errno check because it never overflows. > + */ > + /* > + * For erfc, we don't need an errno check because it never overflows. > + */ The man pages for these seem to indicate that underflow can occur. Do we need to check for that? -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
On Tue, Feb 28, 2023 at 1:54 AM Dean Rasheed <dean.a.rasheed@gmail.com> wrote: > Now that we have random_normal(), it seems like it would be useful to > add the error functions erf() and erfc(), which I think are > potentially useful to the people who will find random_normal() useful, > and possibly others. > > An immediate use for erf() is that it allows us to do a > Kolmogorov-Smirnov test for random_normal(), similar to the one for > random(). > > Both of these functions are defined in POSIX and C99, so in theory > they should be available on all platforms. If that turns out not to be > the case, then there's a commonly used implementation (e.g., see [1]), > which we could include. I played around with that (replacing the > direct bit manipulation stuff with frexp()/ldexp(), see pg_erf.c > attached), and it appeared to be accurate to +/-1 ULP across the full > range of inputs. Hopefully we won't need that though. Hi, No comment on the maths, but I'm pretty sure we won't need a fallback implementation. That stuff goes back to the math libraries of 80s Unixes, even though it didn't make it into C until '99. I just checked the man pages for all our target systems and they all show it. (There might be some portability details around the tgmath.h versions on some systems, eg to support different float sizes, I dunno, but you're using the plain math.h versions.) I wonder if the SQL standard has anything to say about these, or the related normal CDF. I can't check currently but I doubt it, based on searches and other systems' manuals. Two related functions that also arrived in C99 are lgamma() and tgamma(). If you'll excuse the digression, that reminded me of something I was trying to figure out once, for a practical problem. My statistics knowledge is extremely patchy, but I have been trying to up my benchmarking game, and that led to a bit of remedial reading on Student's t tests and related stuff. A few shaven yaks later, I understood that you could probably (if you like pain) do that sort of stuff inside PostgreSQL using our existing aggregates, if you took the approach of ministat[1]. That tool has a table of critical values inside it, indexed by degrees-of-freedom (1-100) and confidence level (80, 90, 95, 98, 99, 99.5), and one could probably write SQL queries that spit out an answer like "p is less than 5%, ship it!", if we stole that table. But what if I actually want to know p? Of course you can do all that good stuff very easily with tools like R, SciPy etc and maybe that's the best way to do it. But Oracle, and I think several other analytics-focused SQL systems, can do it in a very easy built-in way. I think to get at that you probably need the t CDF, and in there[2] I see... Γ(). Huh. [1] https://man.freebsd.org/cgi/man.cgi?query=ministat [2] https://www.mathworks.com/help/stats/tcdf.html
On Wed, 8 Mar 2023 at 20:11, Nathan Bossart <nathandbossart@gmail.com> wrote: > > On Mon, Feb 27, 2023 at 12:54:35PM +0000, Dean Rasheed wrote: > > + /* > > + * For erf, we don't need an errno check because it never overflows. > > + */ > > > + /* > > + * For erfc, we don't need an errno check because it never overflows. > > + */ > > The man pages for these seem to indicate that underflow can occur. Do we > need to check for that? > No, I don't think so. The docs indicate that if an underflow occurs, the correct result (after rounding) should be returned, so I think we should just return that result (as we do for tanh(), for example). Regards, Dean
On Wed, Mar 08, 2023 at 11:29:12PM +0000, Dean Rasheed wrote: > On Wed, 8 Mar 2023 at 20:11, Nathan Bossart <nathandbossart@gmail.com> wrote: >> The man pages for these seem to indicate that underflow can occur. Do we >> need to check for that? > > No, I don't think so. The docs indicate that if an underflow occurs, > the correct result (after rounding) should be returned, so I think we > should just return that result (as we do for tanh(), for example). Makes sense. I'm also wondering about whether we need the isinf() checks. IIUC that should never happen, but maybe you added that "just in case." -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
On Wed, 8 Mar 2023 at 23:24, Thomas Munro <thomas.munro@gmail.com> wrote: > > No comment on the maths, but I'm pretty sure we won't need a fallback > implementation. That stuff goes back to the math libraries of 80s > Unixes, even though it didn't make it into C until '99. I just > checked the man pages for all our target systems and they all show it. > (There might be some portability details around the tgmath.h versions > on some systems, eg to support different float sizes, I dunno, but > you're using the plain math.h versions.) > Thanks for checking. Hopefully they will be available everywhere. I think what's more likely to happen is that the tests will reveal implementation variations, as they did when the hyperbolic functions were added, and it'll be necessary to adjust or remove some of the test cases. When I originally wrote those tests, I picked a value from each branch that the FreeBSD implementation handled differently, but I think that's overkill. If the purpose of the tests is just to confirm that the right C library function has been exposed, they could probably be pared all the way down to just testing erf(1) and erfc(1), but it might be useful to first see what platform variations exist. > Two related functions that also arrived in C99 are lgamma() and > tgamma(). If you'll excuse the digression, that reminded me of > something I was trying to figure out once, for a practical problem. > My statistics knowledge is extremely patchy, but I have been trying to > up my benchmarking game, and that led to a bit of remedial reading on > Student's t tests and related stuff. A few shaven yaks later, I > understood that you could probably (if you like pain) do that sort of > stuff inside PostgreSQL using our existing aggregates, if you took the > approach of ministat[1]. That tool has a table of critical values > inside it, indexed by degrees-of-freedom (1-100) and confidence level > (80, 90, 95, 98, 99, 99.5), and one could probably write SQL queries > that spit out an answer like "p is less than 5%, ship it!", if we > stole that table. But what if I actually want to know p? Of course > you can do all that good stuff very easily with tools like R, SciPy > etc and maybe that's the best way to do it. But Oracle, and I think > several other analytics-focused SQL systems, can do it in a very easy > built-in way. I think to get at that you probably need the t CDF, and > in there[2] I see... Γ(). Huh. > Hmm, possibly having the gamma function would be independently useful for other things too. I don't want to get side-tracked though. Regards, Dean
On Thu, 9 Mar 2023 at 00:13, Nathan Bossart <nathandbossart@gmail.com> wrote: > > On Wed, Mar 08, 2023 at 11:29:12PM +0000, Dean Rasheed wrote: > > On Wed, 8 Mar 2023 at 20:11, Nathan Bossart <nathandbossart@gmail.com> wrote: > >> The man pages for these seem to indicate that underflow can occur. Do we > >> need to check for that? > > > > No, I don't think so. The docs indicate that if an underflow occurs, > > the correct result (after rounding) should be returned, so I think we > > should just return that result (as we do for tanh(), for example). > > Makes sense. > > I'm also wondering about whether we need the isinf() checks. IIUC that > should never happen, but maybe you added that "just in case." > I copied those from dtanh(), otherwise I probably wouldn't have bothered. erf() is always in the range [-1, 1], just like tanh(), so it should never overflow, but maybe it can happen in a broken implementation. Regards, Dean
On Thu, Mar 09, 2023 at 12:27:47AM +0000, Dean Rasheed wrote: > On Thu, 9 Mar 2023 at 00:13, Nathan Bossart <nathandbossart@gmail.com> wrote: >> I'm also wondering about whether we need the isinf() checks. IIUC that >> should never happen, but maybe you added that "just in case." > > I copied those from dtanh(), otherwise I probably wouldn't have > bothered. erf() is always in the range [-1, 1], just like tanh(), so > it should never overflow, but maybe it can happen in a broken > implementation. Okay. This looks good to me, then. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
On Thu, Mar 9, 2023 at 1:16 PM Dean Rasheed <dean.a.rasheed@gmail.com> wrote: > On Wed, 8 Mar 2023 at 23:24, Thomas Munro <thomas.munro@gmail.com> wrote: > > ... But Oracle, and I think > > several other analytics-focused SQL systems, can do it in a very easy > > built-in way. I think to get at that you probably need the t CDF, and > > in there[2] I see... Γ(). Huh. > > Hmm, possibly having the gamma function would be independently useful > for other things too. I don't want to get side-tracked though. I guess if we did want to add some nice easy-to-use hypothesis testing tools to PostgreSQL, then perhaps gamma wouldn't actually be needed from SQL, but it might be used inside C code for something higher level like tcdf()[1], or even very high level like t_test_independent_agg(s1, s2) etc. Anyway, just thought I'd mention those in passing, as I see they arrived together; sorry for getting off topic. [1] https://stats.stackexchange.com/questions/394978/how-to-approximate-the-student-t-cdf-at-a-point-without-the-hypergeometric-funct