Thread: Multi-parameter aggregates.

Multi-parameter aggregates.

From
Berend Tober
Date:
I'm interested in defining a covariance aggregate function. (As a
refresher, remember that covariance is a little bit like variance, but
is between two variables:

cov(X,Y)    =  <XY> - <X><Y>,

where the angular brackets in this case denote taking the averag.
Variance is a special case when X and Y are the same.)

But the whole user-defined aggregate thing is tough to get a handle on.
I'm not even sure if the direction I'm heading in below will actually
work, but as far as I got, I'm stuck on not knowing how to define a
aggregate that takes more that one variable as its argument, so its use
in SQL would look like, e.g.,

SELECT company, COVAR(year, sales) FROM annual_sales GROUP BY company;

Here is what I tried, and I'm wondering if the team here can help me
make this work (or tell me that the whole approach makes no sense,
etc.). All the DDL executes without syntactical errors until the last
function definition, and the problem is with the "  BASETYPE=numeric"
line, i.e., "ERROR:  AggregateCreate: function
covariance_accum(numeric[], numeric) does not exist"

CREATE TYPE public._covariance AS
    (n integer, x numeric, y numeric, xy numeric);


CREATE OR REPLACE FUNCTION public.covariance_accum(_covariance, numeric,
numeric)
    RETURNS _covariance AS '
    BEGIN
    _covariance.n := _covariance.n+1;
    _covariance.x := _covariance.x+$2;
    _covariance.y := _covariance.x+$3;
    _covariance.xy:= _covariance.xy+($1*$2);
    END;
    'LANGUAGE 'plpgsql' IMMUTABLE STRICT;
GRANT EXECUTE ON FUNCTION public.covariance_accum(_covariance, numeric,
numeric) TO public;
COMMENT ON FUNCTION public.covariance_accum(_covariance, numeric,
numeric) IS 'covariance aggregate transition function';


-- Need to include a check for N equal zero data points

CREATE OR REPLACE FUNCTION public.numeric_covariance(_covariance)
    RETURNS numeric AS '
    BEGIN
    (_covariance.xy/_covariance.n) -
(_covariance.x/_covariance.n)*(_covariance.y/_covariance.n);
    END;
    'LANGUAGE 'plpgsql' IMMUTABLE STRICT;
GRANT EXECUTE ON FUNCTION public.numeric_covariance(_covariance) TO public;
COMMENT ON FUNCTION public.numeric_covariance(_covariance) IS
'covariance aggregate final function';



CREATE AGGREGATE public.covariance(
  BASETYPE=numeric,
  SFUNC=covariance_accum,
  STYPE=numeric[],
  FINALFUNC=numeric_covariance,
  INITCOND='{0,0,0,0}'
);


/*
--I also tried this:

CREATE AGGREGATE covariance(
  BASETYPE='numeric, numeric',
  SFUNC=covariance_accum,
  STYPE=numeric[],
  FINALFUNC=numeric_covariance,
  INITCOND='{0,0,0,0}'
);

-- to no avail.
*/

Regards,
Berend



Attachment

Re: Multi-parameter aggregates.

From
Joe Conway
Date:
Berend Tober wrote:
> I'm stuck on not knowing how to define a
> aggregate that takes more that one variable as its argument

This is currently unsupported.

Joe

Re: Multi-parameter aggregates.

From
"Hakan Kocaman"
Date:
Hello Berend,

have you considered using pl/r.
http://www.joeconway.com/plr/

I think R got a covariance-function.
http://www.r-project.org/

Best regards

Hakan Kocaman

Software-Developer
digame.de GmbH
Richard-Byrd-Str. 4-8
50829 Köln

Tel.: +49 (0) 221 59 68 88 31
Fax: +49 (0) 221 59 68 88 98

Email: hakan.kocaman@digame.de



> -----Original Message-----
> From: pgsql-general-owner@postgresql.org
> [mailto:pgsql-general-owner@postgresql.org] On Behalf Of Berend Tober
> Sent: Monday, November 21, 2005 5:03 PM
> To: pgsql-general@postgresql.org
> Subject: [GENERAL] Multi-parameter aggregates.
>
>
> I'm interested in defining a covariance aggregate function. (As a
> refresher, remember that covariance is a little bit like
> variance, but
> is between two variables:
>
> cov(X,Y)    =  <XY> - <X><Y>,
>
> where the angular brackets in this case denote taking the averag.
> Variance is a special case when X and Y are the same.)
>
> But the whole user-defined aggregate thing is tough to get a
> handle on.
> I'm not even sure if the direction I'm heading in below will actually
> work, but as far as I got, I'm stuck on not knowing how to define a
> aggregate that takes more that one variable as its argument,
> so its use
> in SQL would look like, e.g.,
>
> SELECT company, COVAR(year, sales) FROM annual_sales GROUP BY company;
>
> Here is what I tried, and I'm wondering if the team here can help me
> make this work (or tell me that the whole approach makes no sense,
> etc.). All the DDL executes without syntactical errors until the last
> function definition, and the problem is with the "  BASETYPE=numeric"
> line, i.e., "ERROR:  AggregateCreate: function
> covariance_accum(numeric[], numeric) does not exist"
>
> CREATE TYPE public._covariance AS
>     (n integer, x numeric, y numeric, xy numeric);
>
>
> CREATE OR REPLACE FUNCTION
> public.covariance_accum(_covariance, numeric,
> numeric)
>     RETURNS _covariance AS '
>     BEGIN
>     _covariance.n := _covariance.n+1;
>     _covariance.x := _covariance.x+$2;
>     _covariance.y := _covariance.x+$3;
>     _covariance.xy:= _covariance.xy+($1*$2);
>     END;
>     'LANGUAGE 'plpgsql' IMMUTABLE STRICT;
> GRANT EXECUTE ON FUNCTION
> public.covariance_accum(_covariance, numeric,
> numeric) TO public;
> COMMENT ON FUNCTION public.covariance_accum(_covariance, numeric,
> numeric) IS 'covariance aggregate transition function';
>
>
> -- Need to include a check for N equal zero data points
>
> CREATE OR REPLACE FUNCTION public.numeric_covariance(_covariance)
>     RETURNS numeric AS '
>     BEGIN
>     (_covariance.xy/_covariance.n) -
> (_covariance.x/_covariance.n)*(_covariance.y/_covariance.n);
>     END;
>     'LANGUAGE 'plpgsql' IMMUTABLE STRICT;
> GRANT EXECUTE ON FUNCTION
> public.numeric_covariance(_covariance) TO public;
> COMMENT ON FUNCTION public.numeric_covariance(_covariance) IS
> 'covariance aggregate final function';
>
>
>
> CREATE AGGREGATE public.covariance(
>   BASETYPE=numeric,
>   SFUNC=covariance_accum,
>   STYPE=numeric[],
>   FINALFUNC=numeric_covariance,
>   INITCOND='{0,0,0,0}'
> );
>
>
> /*
> --I also tried this:
>
> CREATE AGGREGATE covariance(
>   BASETYPE='numeric, numeric',
>   SFUNC=covariance_accum,
>   STYPE=numeric[],
>   FINALFUNC=numeric_covariance,
>   INITCOND='{0,0,0,0}'
> );
>
> -- to no avail.
> */
>
> Regards,
> Berend
>
>
>

Re: Multi-parameter aggregates.

From
"Florian G. Pflug"
Date:
Joe Conway wrote:
> Berend Tober wrote:
>
>> I'm stuck on not knowing how to define a aggregate that takes more
>> that one variable as its argument
But I guess it _could_ take an array as argument, maybe even
a record (postgresql pseudonym for what's called a structure in C).

You'd use it with the following syntax (array):
select covariance(array[x, y]) from t ;

or, for a "record"-parameter (I didn't test it - I just
guess that it should work...)
select covariance((x, y)) from t ;

greetings, Florian Pflug

Re: Multi-parameter aggregates.

From
Michael Fuhr
Date:
On Mon, Nov 21, 2005 at 11:03:22AM -0500, Berend Tober wrote:
> I'm interested in defining a covariance aggregate function. (As a
> refresher, remember that covariance is a little bit like variance, but
> is between two variables:
>
> cov(X,Y)    =  <XY> - <X><Y>,
>
> where the angular brackets in this case denote taking the averag.
> Variance is a special case when X and Y are the same.)
>
> But the whole user-defined aggregate thing is tough to get a handle on.
> I'm not even sure if the direction I'm heading in below will actually
> work, but as far as I got, I'm stuck on not knowing how to define a
> aggregate that takes more that one variable as its argument, so its use
> in SQL would look like, e.g.,
>
> SELECT company, COVAR(year, sales) FROM annual_sales GROUP BY company;

I think aggregates must take a single value, so the above won't
work as written.  However, in PostgreSQL 8.0 or later you could
define the aggregate's base type to be a composite type and do
something like

SELECT company, COVAR(ROW(year, sales)) FROM annual_sales GROUP BY company;

You'd create the aggregate like this:

CREATE TYPE covar_state AS (...);
CREATE TYPE xypair AS (x numeric, y numeric);

CREATE FUNCTION covar_accum(covar_state, xypair) RETURNS covar_state AS ...
CREATE FUNCTION covar_final(covar_state) RETURNS numeric AS ...

CREATE AGGREGATE covar (
    BASETYPE  = xypair,
    SFUNC     = covar_accum,
    FINALFUNC = covar_final,
    STYPE     = covar_state,
    INITCOND  = '(...)'
);

--
Michael Fuhr

Re: Multi-parameter aggregates.

From
Tom Lane
Date:
Berend Tober <btober@seaworthysys.com> writes:
> I'm stuck on not knowing how to define a
> aggregate that takes more that one variable as its argument,

That's because there isn't any way to do that.  It's on the TODO list
I believe.  In the meantime, you could possibly kluge it up by
defining a composite type to be the aggregate's argument, and calling
it like
    SELECT covariance(row(...)) FROM ...

            regards, tom lane

Re: Multi-parameter aggregates.

From
Berend Tober
Date:
Hakan Kocaman wrote:

>have you considered using pl/r.
>http://www.joeconway.com/plr/
>
>I think R got a covariance-function.
>http://www.r-project.org/
>
>
>
That would be, like, the easy way.

Thanks!
Berend



Attachment

Re: Multi-parameter aggregates.

From
Berend Tober
Date:
Michael Fuhr wrote:

>On Mon, Nov 21, 2005 at 11:03:22AM -0500, Berend Tober wrote:
>
>
>>I'm interested in defining a covariance aggregate function.
>>
>>
>
>I think aggregates must take a single value, so the above won't
>work as written.  However, in PostgreSQL 8.0 or later you could
>define the aggregate's base type to be a composite type
>

Thanks. I briefly explored something like that, using the existing POINT
data type, but didn't press it too far, pending mailling list advice.

Regards,
Berend


Attachment

Re: Multi-parameter aggregates.

From
Berend Tober
Date:
Michael Fuhr wrote:

>On Mon, Nov 21, 2005 at 11:03:22AM -0500, Berend Tober wrote:
>
>
>>I'm interested in defining a covariance aggregate function. (As a
>>
>>
>...aggregates must take a single value, so the above won't
>work as written.  However, in PostgreSQL 8.0 or later you could
>define the aggregate's base type to be a composite type ...
>
>

Thanks for you help. Here is what I came up with, using the existing
POINT data type, which when you think about it makes a lot of sense, and
even thoug having to explicitly cast a pair of columns as a point type
is a little cludgy, I'm get the analytical tool I need, regardless.

CREATE TYPE public._regression AS (
    n integer,
    x double precision,
    y double precision,
    xy double precision
    );


CREATE OR REPLACE FUNCTION public.weighted_average(double precision,
double precision, double precision)
    RETURNS double precision AS '
    BEGIN
    IF $1>1.0 THEN
        RAISE EXCEPTION ''Weighted average % coefficient exceeds
unity.'', $1;
    ELSIF $1<0 THEN
        RAISE EXCEPTION ''Weighted average % coefficient less than
zero.'', $1;
    END IF;
    RETURN $1*$2 + (1.-$1)*$3;
    END;
    'LANGUAGE 'plpgsql' IMMUTABLE STRICT;


CREATE OR REPLACE FUNCTION public.regression_accum(_regression, point)
    RETURNS _regression AS '
    SELECT
        $1.n+1,
        weighted_average($1.n/($1.n+1.), $1.x, $2[0]),
        weighted_average($1.n/($1.n+1.), $1.y, $2[1]),
        weighted_average($1.n/($1.n+1.), $1.xy, $2[0]*$2[1]);
    'LANGUAGE 'sql' IMMUTABLE STRICT;
GRANT EXECUTE ON FUNCTION public.regression_accum(_regression, point) TO
public;
COMMENT ON FUNCTION public.regression_accum(_regression, point)
    IS 'regression aggregate transition function';

CREATE OR REPLACE FUNCTION public.covariance(_regression)
    RETURNS double precision AS '
    BEGIN
    IF $1.n = 0 THEN
        RAISE EXCEPTION ''No covariance data'';
    END IF;
    RETURN $1.xy - $1.x*$1.y;
    END;
    'LANGUAGE 'plpgsql' IMMUTABLE STRICT;
GRANT EXECUTE ON FUNCTION public.covariance(_regression) TO public;
COMMENT ON FUNCTION public.covariance(_regression)
    IS 'cov aggregate final function';

CREATE AGGREGATE public.cov(
  BASETYPE=point,
  SFUNC=regression_accum,
  STYPE=_regression,
  FINALFUNC=covariance,
  INITCOND='(0,0,0,0)'
);

-- Examples

CREATE LOCAL TEMPORARY TABLE test (
    example integer,
    x numeric,
    y numeric
) without oids;

-- Hand calculated: cov(x,y)=0.25
INSERT INTO test VALUES (1, 0.0, 0.0);
INSERT INTO test VALUES (1, 1.0, 1.0);

-- Hand calculated: cov(x,x)=0.389, cov(y,y)=1.556, cov(x,y)=0.778
INSERT INTO test VALUES (2, 0.0, 0.0);
INSERT INTO test VALUES (2, 1.0, 2.0);
INSERT INTO test VALUES (2, 1.5, 3.0);

-- Hand calculated: cov(x,x)=0.264, cov(y,y)=0.192, cov(x,y)=0.214
INSERT INTO test VALUES (3, 0.25, 0.125);
INSERT INTO test VALUES (3, 1.0, 1.0);
INSERT INTO test VALUES (3, 1.5, 1.1);


--
http://www.quantlet.com/mdstat/scripts/mva/htmlbook/mvahtmlnode22.html:
cov(x,y) = -80.02
INSERT INTO test VALUES (4, 230, 125);
INSERT INTO test VALUES (4, 181,  99);
INSERT INTO test VALUES (4, 165,  97);
INSERT INTO test VALUES (4, 150, 115);
INSERT INTO test VALUES (4,  97, 120);
INSERT INTO test VALUES (4, 192, 100);
INSERT INTO test VALUES (4, 181,  80);
INSERT INTO test VALUES (4, 189,  90);
INSERT INTO test VALUES (4, 172,  95);
INSERT INTO test VALUES (4, 170, 125);

/*
Note that the value COV(X,X) and COV(Y,Y) do not equal output from
the native Postgresql aggregates VARIANCE(x) and VARIANCE(y),
respectively. I cannot explain why they are different, but COV()
is consistent with direct calculation.
*/

SELECT
    'Example '||example AS title,
    COUNT(*) AS N,
    COV(POINT(x,x)) AS covXX,
    COV(POINT(y,y)) AS covYY,
    COV(POINT(x,y)) AS covXY,
    VARIANCE(x),
    VARIANCE(y)
    FROM test
    GROUP BY 1
    ORDER BY 1;