Re: Weighted Stats - Mailing list pgsql-hackers

From David Fetter
Subject Re: Weighted Stats
Date
Msg-id 20160320223840.GA17918@fetter.org
Whole thread Raw
In response to Re: Weighted Stats  (Tomas Vondra <tomas.vondra@2ndquadrant.com>)
Responses Re: Weighted Stats  (David Fetter <david@fetter.org>)
List pgsql-hackers
On Sat, Mar 19, 2016 at 05:04:08PM +0100, Tomas Vondra wrote:
> Hi,
> 
> On 03/19/2016 07:34 AM, David Fetter wrote:
> >On Fri, Mar 18, 2016 at 06:12:12PM -0700, Jeff Janes wrote:
> >>On Tue, Mar 15, 2016 at 8:36 AM, David Fetter <david@fetter.org> wrote:
> >>>
> >>>Please find attached a patch that uses the float8 version to cover the
> >>>numeric types.
> >>
> >>Is there a well-defined meaning for having a negative weight?  If no,
> >>should it be disallowed?
> >
> >Opinions on this appear to vary.  A Wikipedia article defines weights
> >as non-negative, while a manual to which it refers only uses non-zero.
> >
> >https://en.wikipedia.org/wiki/Weighted_arithmetic_mean#Mathematical_definition
> >https://www.gnu.org/software/gsl/manual/html_node/Weighted-Samples.html
> 
> I don't think that actually allows negative weights. It says that
> 
>    w_i = 1/\sigma_i^2
> 
> and variance is always > 0, thus w_i > 0. The zero is used as a special flag
> to remove the sample from the data set in a simple way.
> 
> >I'm not sure which if either would be authoritative, but I could
> >certainly make up variants for each assumption.
> >
> >The assumption they have in common about weights is that a zero
> >weight is not part of the calculation, which assumption is
> >implemented in the previously submitted code.
> 
> I think that if we're not sure what should happen with negative weights,
> then we should disallow them. It's easy to allow them later once we have a
> reasonable definition, but if we allow them now and later realize it should
> behave differently, we'll be in trouble because of breaking existing uses.

OK

> I can't really come up with a reasonable example that would actually use
> negative weights. Can you? That would probably help with defining the
> behavior correctly.

No, but I'm not a statistician.  I've seen them mentioned in contexts
that appear to be discussions among same, and again opinions vary.

> Allowing negative weights has other consequences. For example, what if
> sum(W) ends up being 0? For example
> 
> CREATE TABLE t (a float, b float);
> INSERT INTO  t SELECT i,  1 FROM generate_series(1,1000) s(i);
> INSERT INTO  t SELECT i, -1 FROM generate_series(1,1000) s(i);
> 
> SELECT weighted_avg(a,b) FROM t;
>  weighted_avg
> --------------
>           NaN
> (1 row)
> 
> Is that the correct behavior? Why?

It's not, and you're right.

I will send a patch that disallows negative weights this evening or
tomorrow.  It will be slightly more complicated as I believe I will
need to create a new accumulator function for the weighted_avg() case
where I had been using an extant one before.

Cheers,
David.
-- 
David Fetter <david@fetter.org> http://fetter.org/
Phone: +1 415 235 3778  AIM: dfetter666  Yahoo!: dfetter
Skype: davidfetter      XMPP: david.fetter@gmail.com

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate



pgsql-hackers by date:

Previous
From: Fabien COELHO
Date:
Subject: Re: extend pgbench expressions with functions
Next
From: Yuri Niyazov
Date:
Subject: pg_upgrade documentation improvement patch