Re: [HACKERS] multivariate statistics (v19) - Mailing list pgsql-hackers

From David Fetter
Subject Re: [HACKERS] multivariate statistics (v19)
Date
Msg-id 20170212194207.GA2120@fetter.org
Whole thread Raw
In response to Re: [HACKERS] multivariate statistics (v19)  (Dean Rasheed <dean.a.rasheed@gmail.com>)
Responses Re: [HACKERS] multivariate statistics (v24)
List pgsql-hackers
On Sun, Feb 12, 2017 at 10:35:04AM +0000, Dean Rasheed wrote:
> On 11 February 2017 at 01:17, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
> > Thanks for the feedback, I'll fix this. I've allowed myself to be a bit
> > sloppy because the number of attributes in the statistics is currently
> > limited to 8, so the overflows are currently not an issue. But it doesn't
> > hurt to make it future-proof, in case we change that mostly artificial limit
> > sometime in the future.
> >
> 
> Ah right, so it can't overflow at present, but it's neater to have an
> overflow-proof algorithm.
> 
> Thinking about the exactness of the division steps is quite
> interesting. Actually, the order of the multiplying factors doesn't
> matter as long as the divisors are in increasing order. So in both my
> proposal:
> 
>     result = 1
>     for (i = 1; i <= k; i++)
>         result = (result * (n-k+i)) / i;
> 
> and David's proposal, which is equivalent but has the multiplying
> factors in the opposite order, equivalent to:
> 
>     result = 1
>     for (i = 1; i <= k; i++)
>         result = (result * (n-i+1)) / i;
> 
> the divisions are exact at each step. The first time through the loop
> it divides by 1 which is trivially exact. The second time it divides
> by 2, having multiplied by 2 consecutive factors, one of which is
> therefore guaranteed to be divisible by 2. The third time it divides
> by 3, having multiplied by 3 consecutive factors, one of which is
> therefore guaranteed to be divisible by 3, and so on.

Right.  You know you can use integer division, which make sense as
permutations of discrete sets are always integers.

> My approach originally seemed more logical to me because of the way it
> derives from the recurrence relation binomial(n, k) = binomial(n-1,
> k-1) * n / k, but they both work fine as long as they have suitable
> overflow checks.

Right.  We could even cache those checks (sorry) based on data type
limits by architecture and OS if performance on those operations ever
matters that much.

> It's also interesting that descriptions of this algorithm tend to
> talk about setting k to min(k, n-k) at the start as an optimisation
> step, as I did in fact, whereas it's actually more than that -- it
> helps prevent unnecessary intermediate overflows when k > n/2. Of
> course, that's not a worry for the current use of this function, but
> it's good to have a robust algorithm.

Indeed. :)

Best,
David.
-- 
David Fetter <david(at)fetter(dot)org> http://fetter.org/
Phone: +1 415 235 3778  AIM: dfetter666  Yahoo!: dfetter
Skype: davidfetter      XMPP: david(dot)fetter(at)gmail(dot)com

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate



pgsql-hackers by date:

Previous
From: Vladimir Rusinov
Date:
Subject: Re: [HACKERS] WIP: About CMake v2
Next
From: David Rowley
Date:
Subject: Re: [HACKERS] Improve OR conditions on joined columns (common starschema problem)