Thread: aggregate function for median calculation

aggregate function for median calculation

From

"Thalis A. Kalfigopoulos"

Date:

18 June 2001, 15:01:48

Hippl,
    I'm interested in calculating the median of a set of numbers. The algorithm requires that all values are known in
advance(ie stored in an array). So the question is: how can I store everything first in an array so I can later process
itgiven that I'd like this to be an aggregate function. I thought of creating an aggregate function and have the
state_function()gather all the values of a group in an array and the final_function() to do the actuall median
calculationon this array. But the intermmediate state cannot hold multiple values in an array (can it?)  
Any ideas on how to go with this?

TIA,
thalis

Re: aggregate function for median calculation

From

Tom Lane

Date:

18 June 2001, 16:28:42

"Thalis A. Kalfigopoulos" <thalis@cs.pitt.edu> writes:
> But the intermmediate state cannot hold multiple values in an array
> (can it?)

Sure, why not?  Might not scale too well to lots of values, however.

            regards, tom lane

Re: aggregate function for median calculation

From

Alex Pilosov

Date:

18 June 2001, 19:06:18

On Mon, 18 Jun 2001, Thalis A. Kalfigopoulos wrote:

> Hippl,
>     I'm interested in calculating the median of a set of numbers.
> The algorithm requires that all values are known in advance (ie stored
> in an array). So the question is: how can I store everything first in
> an array so I can later process it given that I'd like this to be an
> aggregate function. I thought of creating an aggregate function and
> have the state_function() gather all the values of a group in an array
> and the final_function() to do the actuall median calculation on this
> array. But the intermmediate state cannot hold multiple values in an
> array (can it?)  Any ideas on how to go with this?

With current architecture, its kinda painful to implement such a function.
Your 'state' function should allocate (palloc) memory for each element
processed and then pfree it when you are done.

-alex

Re: aggregate function for median calculation

From

Philip Hallstrom

Date:

18 June 2001, 19:31:46

I missed the first part, but if the numbers are rows in a table, why not
do something like:

numrows = select count(*) from table1 where some_condition
median_value = select some_col from table1 where some_condition order by
               some_col limit numrows/2, 1

(or something very close to that anyway).

-philip



On Mon, 18 Jun 2001, Alex Pilosov wrote:

> On Mon, 18 Jun 2001, Thalis A. Kalfigopoulos wrote:
>
> > Hippl,
> >     I'm interested in calculating the median of a set of numbers.
> > The algorithm requires that all values are known in advance (ie stored
> > in an array). So the question is: how can I store everything first in
> > an array so I can later process it given that I'd like this to be an
> > aggregate function. I thought of creating an aggregate function and
> > have the state_function() gather all the values of a group in an array
> > and the final_function() to do the actuall median calculation on this
> > array. But the intermmediate state cannot hold multiple values in an
> > array (can it?)  Any ideas on how to go with this?
>
> With current architecture, its kinda painful to implement such a function.
> Your 'state' function should allocate (palloc) memory for each element
> processed and then pfree it when you are done.
>
> -alex
>
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 6: Have you searched our list archives?
>
> http://www.postgresql.org/search.mpl
>

Re: aggregate function for median calculation

From

"Thalis A. Kalfigopoulos"

Date:

19 June 2001, 16:17:33

On Mon, 18 Jun 2001, Philip Hallstrom wrote:

> I missed the first part, but if the numbers are rows in a table, why not
> do something like:
>
> numrows = select count(*) from table1 where some_condition
> median_value = select some_col from table1 where some_condition order by
>                some_col limit numrows/2, 1
>
> (or something very close to that anyway).
>
> -philip

Because I'll be using this median() as an aggregate function and grouping by multiple attributes of different tables.
Soinstead of me doing the aggregation bookkeeping, I'd rather have it as a Pg aggr. function. 

cheers,
thalis

>
>
>
> On Mon, 18 Jun 2001, Alex Pilosov wrote:
>
> > On Mon, 18 Jun 2001, Thalis A. Kalfigopoulos wrote:
> >
> > > Hippl,
> > >     I'm interested in calculating the median of a set of numbers.
> > > The algorithm requires that all values are known in advance (ie stored
> > > in an array). So the question is: how can I store everything first in
> > > an array so I can later process it given that I'd like this to be an
> > > aggregate function. I thought of creating an aggregate function and
> > > have the state_function() gather all the values of a group in an array
> > > and the final_function() to do the actuall median calculation on this
> > > array. But the intermmediate state cannot hold multiple values in an
> > > array (can it?)  Any ideas on how to go with this?
> >
> > With current architecture, its kinda painful to implement such a function.
> > Your 'state' function should allocate (palloc) memory for each element
> > processed and then pfree it when you are done.
> >
> > -alex
> >
> >
> >
> > ---------------------------(end of broadcast)---------------------------
> > TIP 6: Have you searched our list archives?
> >
> > http://www.postgresql.org/search.mpl
> >
>
>

Re: aggregate function for median calculation

From

"Thalis A. Kalfigopoulos"

Date:

19 June 2001, 16:20:20

On Mon, 18 Jun 2001, Alex Pilosov wrote:

> On Mon, 18 Jun 2001, Thalis A. Kalfigopoulos wrote:
>
> > Hippl,
> >     I'm interested in calculating the median of a set of numbers.
> > The algorithm requires that all values are known in advance (ie stored
> > in an array). So the question is: how can I store everything first in
> > an array so I can later process it given that I'd like this to be an
> > aggregate function. I thought of creating an aggregate function and
> > have the state_function() gather all the values of a group in an array
> > and the final_function() to do the actuall median calculation on this
> > array. But the intermmediate state cannot hold multiple values in an
> > array (can it?)  Any ideas on how to go with this?
>
> With current architecture, its kinda painful to implement such a function.
> Your 'state' function should allocate (palloc) memory for each element
> processed and then pfree it when you are done.
>
> -alex

So is there a way I can actually have a variable (the array with the elements) maintained between calls of the same
function?I'm trying to figure how to design the aggregate's state_function that will do the element accumulation
(beforeactually passing this array to the final to calculate the median). 

TIA,
thalis

Re: aggregate function for median calculation

From

Peter Eisentraut

Date:

19 June 2001, 16:50:41

Thalis A. Kalfigopoulos writes:

> So is there a way I can actually have a variable (the array with the elements) maintained between calls of the same
function?I'm trying to figure how to design the aggregate's state_function that will do the element accumulation
(beforeactually passing this array to the final to calculate the median). 

Sure, you create a (static) global variable and reallocate memory for it
in each call and free it by the finalizer function.

--
Peter Eisentraut   peter_e@gmx.net   http://funkturm.homeip.net/~peter

Re: aggregate function for median calculation

From

Tom Lane

Date:

19 June 2001, 18:54:01

Peter Eisentraut <peter_e@gmx.net> writes:
> Sure, you create a (static) global variable and reallocate memory for it
> in each call and free it by the finalizer function.

A static would be a bad idea (consider a query with multiple instances
of this aggregate being evaluated in parallel).

But there's no reason that the transition function can't palloc a larger
and larger chunk of memory for each result (as long as you don't run out
of memory, anyway).

            regards, tom lane

Re: aggregate function for median calculation

From

"Thalis A. Kalfigopoulos"

Date:

20 June 2001, 21:34:29

On Tue, 19 Jun 2001, Tom Lane wrote:

> Peter Eisentraut <peter_e@gmx.net> writes:
> > Sure, you create a (static) global variable and reallocate memory for it
> > in each call and free it by the finalizer function.
>
> A static would be a bad idea (consider a query with multiple instances
> of this aggregate being evaluated in parallel).
>
> But there's no reason that the transition function can't palloc a larger
> and larger chunk of memory for each result (as long as you don't run out
> of memory, anyway).
>
>             regards, tom lane
>

I'm still a bit confused about how to declare the type of the transition state.
I have a structure that will hold the current state:
struct state {
    int elem_count; //how many numbers have been assigned so far
    int *elem_area; //ptr to area holding these integers whose median I'll later calculate
}

So the transition function is:
struct state *trans_function(struct state *last_state,int new_element){
    //Allocate mem for a new state structure to hold all previous numbers + new one
    //Copy over all the previous elements from last_state and add the new_element
    //Return ptr to this new state struct
}

My question is: how do I declare this in Pg?
declare function transition_func(???,int4) RETURNS ??? AS 'path_to_.so_file','trans_function' LANGUAGE 'C';

I assume that I have mixed up what can be a transition state :-/

Any help welcome.


TIA,
thalis

Re: aggregate function for median calculation

From

Tom Lane

Date:

21 June 2001, 01:21:30

"Thalis A. Kalfigopoulos" <thalis@cs.pitt.edu> writes:
> I'm still a bit confused about how to declare the type of the transition state.
> I have a structure that will hold the current state:
> struct state {
>     int elem_count; //how many numbers have been assigned so far
>     int *elem_area; //ptr to area holding these integers whose median I'll later calculate
> }

> My question is: how do I declare this in Pg?

Since you haven't made this a full-fledged Postgres type, you don't.

There's no real *need* to make it a full-fledged type, actually, since
nobody except your two aggregate functions will ever manipulate the
value.  So I'd suggest cheating.  Make it a legal "varlena" value by
having the first word of the struct contain the total length in bytes
of the object (including itself).  Then pretend in the SQL transition
function declaration and aggregate declaration that the transition
datatype is any old varlena type --- text or bytea would do fine.
Nothing except your code will look at anything except the varlena
length, so the fact that the body of the value means something unusual
won't matter.

Or, if you want to be rigidly correct, you could make the transition
value be a legitimate int4 array (_int4), which is only a field or
three more overhead.  But I don't see any value in it, except possibly
for manual testing of your transition functions.

            regards, tom lane

Re: aggregate function for median calculation

From

"Thalis A. Kalfigopoulos"

Date:

21 June 2001, 12:10:20

On Thu, 21 Jun 2001, Tom Lane wrote:

> "Thalis A. Kalfigopoulos" <thalis@cs.pitt.edu> writes:
> > I'm still a bit confused about how to declare the type of the transition state.
> > I have a structure that will hold the current state:
> > struct state {
> >     int elem_count; //how many numbers have been assigned so far
> >     int *elem_area; //ptr to area holding these integers whose median I'll later calculate
> > }
>
> > My question is: how do I declare this in Pg?
>
> Since you haven't made this a full-fledged Postgres type, you don't.
>
> There's no real *need* to make it a full-fledged type, actually, since
> nobody except your two aggregate functions will ever manipulate the
> value.  So I'd suggest cheating.  Make it a legal "varlena" value by
> having the first word of the struct contain the total length in bytes
> of the object (including itself).  Then pretend in the SQL transition
> function declaration and aggregate declaration that the transition
> datatype is any old varlena type --- text or bytea would do fine.
> Nothing except your code will look at anything except the varlena
> length, so the fact that the body of the value means something unusual
> won't matter.
>
> Or, if you want to be rigidly correct, you could make the transition
> value be a legitimate int4 array (_int4), which is only a field or
> three more overhead.  But I don't see any value in it, except possibly
> for manual testing of your transition functions.
>
>             regards, tom lane
>

I worked on it so the struct state now has 3 fields: int length, int elem_count, int *elements. Now the issue is that
thetransition function, which I declare as strict, has a different input and state type (int4 and text respectively),
soit requires an INITCOND. What should that be? It has to be the text representation of what the state struct is
initially.The state initially has elem_count=0, elements=NULL and length=sizeof(struct state) which is 3*sizeof(int)
(the2 integer fields and a pointer i.e. another integer). How do I explain that? 

I tried some scenarios for INITCOD like '\0' and 0 but they all crashed :-/

Any ideas?

tia,
thalis

Re: aggregate function for median calculation

From

Tom Lane

Date:

21 June 2001, 12:47:38

"Thalis A. Kalfigopoulos" <thalis@cs.pitt.edu> writes:
> I worked on it so the struct state now has 3 fields: int length, int
> elem_count, int *elements. Now the issue is that the transition
> function, which I declare as strict, has a different input and state
> type (int4 and text respectively), so it requires an INITCOND. What
> should that be?

You're going to have to make your code able to start from a value that
can be represented as an INITCOND.  Say, an empty string (zero length
word and nothing else).  You're cheating, remember?

            regards, tom lane