Thread: aggregate function for median calculation
Hippl, I'm interested in calculating the median of a set of numbers. The algorithm requires that all values are known in advance(ie stored in an array). So the question is: how can I store everything first in an array so I can later process itgiven that I'd like this to be an aggregate function. I thought of creating an aggregate function and have the state_function()gather all the values of a group in an array and the final_function() to do the actuall median calculationon this array. But the intermmediate state cannot hold multiple values in an array (can it?) Any ideas on how to go with this? TIA, thalis
"Thalis A. Kalfigopoulos" <thalis@cs.pitt.edu> writes: > But the intermmediate state cannot hold multiple values in an array > (can it?) Sure, why not? Might not scale too well to lots of values, however. regards, tom lane
On Mon, 18 Jun 2001, Thalis A. Kalfigopoulos wrote: > Hippl, > I'm interested in calculating the median of a set of numbers. > The algorithm requires that all values are known in advance (ie stored > in an array). So the question is: how can I store everything first in > an array so I can later process it given that I'd like this to be an > aggregate function. I thought of creating an aggregate function and > have the state_function() gather all the values of a group in an array > and the final_function() to do the actuall median calculation on this > array. But the intermmediate state cannot hold multiple values in an > array (can it?) Any ideas on how to go with this? With current architecture, its kinda painful to implement such a function. Your 'state' function should allocate (palloc) memory for each element processed and then pfree it when you are done. -alex
I missed the first part, but if the numbers are rows in a table, why not do something like: numrows = select count(*) from table1 where some_condition median_value = select some_col from table1 where some_condition order by some_col limit numrows/2, 1 (or something very close to that anyway). -philip On Mon, 18 Jun 2001, Alex Pilosov wrote: > On Mon, 18 Jun 2001, Thalis A. Kalfigopoulos wrote: > > > Hippl, > > I'm interested in calculating the median of a set of numbers. > > The algorithm requires that all values are known in advance (ie stored > > in an array). So the question is: how can I store everything first in > > an array so I can later process it given that I'd like this to be an > > aggregate function. I thought of creating an aggregate function and > > have the state_function() gather all the values of a group in an array > > and the final_function() to do the actuall median calculation on this > > array. But the intermmediate state cannot hold multiple values in an > > array (can it?) Any ideas on how to go with this? > > With current architecture, its kinda painful to implement such a function. > Your 'state' function should allocate (palloc) memory for each element > processed and then pfree it when you are done. > > -alex > > > > ---------------------------(end of broadcast)--------------------------- > TIP 6: Have you searched our list archives? > > http://www.postgresql.org/search.mpl >
On Mon, 18 Jun 2001, Philip Hallstrom wrote: > I missed the first part, but if the numbers are rows in a table, why not > do something like: > > numrows = select count(*) from table1 where some_condition > median_value = select some_col from table1 where some_condition order by > some_col limit numrows/2, 1 > > (or something very close to that anyway). > > -philip Because I'll be using this median() as an aggregate function and grouping by multiple attributes of different tables. Soinstead of me doing the aggregation bookkeeping, I'd rather have it as a Pg aggr. function. cheers, thalis > > > > On Mon, 18 Jun 2001, Alex Pilosov wrote: > > > On Mon, 18 Jun 2001, Thalis A. Kalfigopoulos wrote: > > > > > Hippl, > > > I'm interested in calculating the median of a set of numbers. > > > The algorithm requires that all values are known in advance (ie stored > > > in an array). So the question is: how can I store everything first in > > > an array so I can later process it given that I'd like this to be an > > > aggregate function. I thought of creating an aggregate function and > > > have the state_function() gather all the values of a group in an array > > > and the final_function() to do the actuall median calculation on this > > > array. But the intermmediate state cannot hold multiple values in an > > > array (can it?) Any ideas on how to go with this? > > > > With current architecture, its kinda painful to implement such a function. > > Your 'state' function should allocate (palloc) memory for each element > > processed and then pfree it when you are done. > > > > -alex > > > > > > > > ---------------------------(end of broadcast)--------------------------- > > TIP 6: Have you searched our list archives? > > > > http://www.postgresql.org/search.mpl > > > >
On Mon, 18 Jun 2001, Alex Pilosov wrote: > On Mon, 18 Jun 2001, Thalis A. Kalfigopoulos wrote: > > > Hippl, > > I'm interested in calculating the median of a set of numbers. > > The algorithm requires that all values are known in advance (ie stored > > in an array). So the question is: how can I store everything first in > > an array so I can later process it given that I'd like this to be an > > aggregate function. I thought of creating an aggregate function and > > have the state_function() gather all the values of a group in an array > > and the final_function() to do the actuall median calculation on this > > array. But the intermmediate state cannot hold multiple values in an > > array (can it?) Any ideas on how to go with this? > > With current architecture, its kinda painful to implement such a function. > Your 'state' function should allocate (palloc) memory for each element > processed and then pfree it when you are done. > > -alex So is there a way I can actually have a variable (the array with the elements) maintained between calls of the same function?I'm trying to figure how to design the aggregate's state_function that will do the element accumulation (beforeactually passing this array to the final to calculate the median). TIA, thalis
Thalis A. Kalfigopoulos writes: > So is there a way I can actually have a variable (the array with the elements) maintained between calls of the same function?I'm trying to figure how to design the aggregate's state_function that will do the element accumulation (beforeactually passing this array to the final to calculate the median). Sure, you create a (static) global variable and reallocate memory for it in each call and free it by the finalizer function. -- Peter Eisentraut peter_e@gmx.net http://funkturm.homeip.net/~peter
Peter Eisentraut <peter_e@gmx.net> writes: > Sure, you create a (static) global variable and reallocate memory for it > in each call and free it by the finalizer function. A static would be a bad idea (consider a query with multiple instances of this aggregate being evaluated in parallel). But there's no reason that the transition function can't palloc a larger and larger chunk of memory for each result (as long as you don't run out of memory, anyway). regards, tom lane
On Tue, 19 Jun 2001, Tom Lane wrote: > Peter Eisentraut <peter_e@gmx.net> writes: > > Sure, you create a (static) global variable and reallocate memory for it > > in each call and free it by the finalizer function. > > A static would be a bad idea (consider a query with multiple instances > of this aggregate being evaluated in parallel). > > But there's no reason that the transition function can't palloc a larger > and larger chunk of memory for each result (as long as you don't run out > of memory, anyway). > > regards, tom lane > I'm still a bit confused about how to declare the type of the transition state. I have a structure that will hold the current state: struct state { int elem_count; //how many numbers have been assigned so far int *elem_area; //ptr to area holding these integers whose median I'll later calculate } So the transition function is: struct state *trans_function(struct state *last_state,int new_element){ //Allocate mem for a new state structure to hold all previous numbers + new one //Copy over all the previous elements from last_state and add the new_element //Return ptr to this new state struct } My question is: how do I declare this in Pg? declare function transition_func(???,int4) RETURNS ??? AS 'path_to_.so_file','trans_function' LANGUAGE 'C'; I assume that I have mixed up what can be a transition state :-/ Any help welcome. TIA, thalis
"Thalis A. Kalfigopoulos" <thalis@cs.pitt.edu> writes: > I'm still a bit confused about how to declare the type of the transition state. > I have a structure that will hold the current state: > struct state { > int elem_count; //how many numbers have been assigned so far > int *elem_area; //ptr to area holding these integers whose median I'll later calculate > } > My question is: how do I declare this in Pg? Since you haven't made this a full-fledged Postgres type, you don't. There's no real *need* to make it a full-fledged type, actually, since nobody except your two aggregate functions will ever manipulate the value. So I'd suggest cheating. Make it a legal "varlena" value by having the first word of the struct contain the total length in bytes of the object (including itself). Then pretend in the SQL transition function declaration and aggregate declaration that the transition datatype is any old varlena type --- text or bytea would do fine. Nothing except your code will look at anything except the varlena length, so the fact that the body of the value means something unusual won't matter. Or, if you want to be rigidly correct, you could make the transition value be a legitimate int4 array (_int4), which is only a field or three more overhead. But I don't see any value in it, except possibly for manual testing of your transition functions. regards, tom lane
On Thu, 21 Jun 2001, Tom Lane wrote: > "Thalis A. Kalfigopoulos" <thalis@cs.pitt.edu> writes: > > I'm still a bit confused about how to declare the type of the transition state. > > I have a structure that will hold the current state: > > struct state { > > int elem_count; //how many numbers have been assigned so far > > int *elem_area; //ptr to area holding these integers whose median I'll later calculate > > } > > > My question is: how do I declare this in Pg? > > Since you haven't made this a full-fledged Postgres type, you don't. > > There's no real *need* to make it a full-fledged type, actually, since > nobody except your two aggregate functions will ever manipulate the > value. So I'd suggest cheating. Make it a legal "varlena" value by > having the first word of the struct contain the total length in bytes > of the object (including itself). Then pretend in the SQL transition > function declaration and aggregate declaration that the transition > datatype is any old varlena type --- text or bytea would do fine. > Nothing except your code will look at anything except the varlena > length, so the fact that the body of the value means something unusual > won't matter. > > Or, if you want to be rigidly correct, you could make the transition > value be a legitimate int4 array (_int4), which is only a field or > three more overhead. But I don't see any value in it, except possibly > for manual testing of your transition functions. > > regards, tom lane > I worked on it so the struct state now has 3 fields: int length, int elem_count, int *elements. Now the issue is that thetransition function, which I declare as strict, has a different input and state type (int4 and text respectively), soit requires an INITCOND. What should that be? It has to be the text representation of what the state struct is initially.The state initially has elem_count=0, elements=NULL and length=sizeof(struct state) which is 3*sizeof(int) (the2 integer fields and a pointer i.e. another integer). How do I explain that? I tried some scenarios for INITCOD like '\0' and 0 but they all crashed :-/ Any ideas? tia, thalis
"Thalis A. Kalfigopoulos" <thalis@cs.pitt.edu> writes: > I worked on it so the struct state now has 3 fields: int length, int > elem_count, int *elements. Now the issue is that the transition > function, which I declare as strict, has a different input and state > type (int4 and text respectively), so it requires an INITCOND. What > should that be? You're going to have to make your code able to start from a value that can be represented as an INITCOND. Say, an empty string (zero length word and nothing else). You're cheating, remember? regards, tom lane