Thread: proposal: tuplestore, tuplesort aggregate functions
Hello I still thinking about a "median" type functions. My idea is to introduce a new syntax for stype definition - like stype = type, or stype = ARRAY OF type [ ORDER [ DESC | ASC ]], or stype = TUPLESTORE OF type, or stype = TUPLESORT OF type [ DESC | ASC ] when stype is ARRAY of then final and transistent functions can be a PL functions. When stype isn't scalar, then sfunc can be undefined (it use a buildin functions). Then we can implement a aggregate only with final functions. so median function can be defined: CREATE FUNCTION num_median_final(internal) RETURNS numeric AS ... CREATE AGGREGATE median(numeric) (stype = TUPLESORT OF numeric, finalfunc = num_median_final); This feature has impact primary on agg executor, and can be relative simple - no planner changes (or not big), minimal parser changes. Main reason for this feature is possible access to tuplesort and tuplesort. I hope, so this can solve a problems with computing a median and similar functions on very large datasets. comments? regards
Pavel Stehule <pavel.stehule@gmail.com> writes: > I still thinking about a "median" type functions. My idea is to > introduce a new syntax for stype definition - like > stype = type, or > stype = ARRAY OF type [ ORDER [ DESC | ASC ]], or > stype = TUPLESTORE OF type, or > stype = TUPLESORT OF type [ DESC | ASC ] This seems like a fairly enormous amount of conceptual (and code) infrastructure just to make it possible to build median() out of spare parts. It's also exposing some implementation details that I'd just as soon not expose in SQL. I'd rather just implement median as a special-purpose aggregate. regards, tom lane
2010/8/18 Tom Lane <tgl@sss.pgh.pa.us>: > Pavel Stehule <pavel.stehule@gmail.com> writes: >> I still thinking about a "median" type functions. My idea is to >> introduce a new syntax for stype definition - like > >> stype = type, or >> stype = ARRAY OF type [ ORDER [ DESC | ASC ]], or >> stype = TUPLESTORE OF type, or >> stype = TUPLESORT OF type [ DESC | ASC ] > > This seems like a fairly enormous amount of conceptual (and code) > infrastructure just to make it possible to build median() out of spare > parts. It's also exposing some implementation details that I'd just as > soon not expose in SQL. I'd rather just implement median as a > special-purpose aggregate. yes, it is little bit strange - but when we talked last time about this topic, I understand, so you dislike any special solution for this functionality. So I searched different more general way. On the other hand, I agree so special purpose aggregate (with a few changes in nodeAgg) can be enough. The median (and additional forms) is really special and there are not wide used use case. Regards Pavel > > regards, tom lane >
On Wed, Aug 18, 2010 at 04:03:25PM +0200, Pavel Stehule wrote: > 2010/8/18 Tom Lane <tgl@sss.pgh.pa.us>: > > Pavel Stehule <pavel.stehule@gmail.com> writes: > >> I still thinking about a "median" type functions. My idea is to > >> introduce a new syntax for stype definition - like > > > >> stype = type, or > >> stype = ARRAY OF type [ ORDER [ DESC | ASC ]], or > >> stype = TUPLESTORE OF type, or > >> stype = TUPLESORT OF type [ DESC | ASC ] > > > > This seems like a fairly enormous amount of conceptual (and code) > > infrastructure just to make it possible to build median() out of > > spare parts. It's also exposing some implementation details that > > I'd just as soon not expose in SQL. I'd rather just implement > > median as a special-purpose aggregate. > > yes, it is little bit strange - but when we talked last time about > this topic, I understand, so you dislike any special solution for > this functionality. So I searched different more general way. On the > other hand, I agree so special purpose aggregate (with a few changes > in nodeAgg) can be enough. The median (and additional forms) is > really special and there are not wide used use case. Which median do you plan to implement? Or do you plan to implement several different medians, each with distinguishing names? Cheers, David. -- David Fetter <david@fetter.org> http://fetter.org/ Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter Skype: davidfetter XMPP: david.fetter@gmail.com iCal: webcal://www.tripit.com/feed/ical/people/david74/tripit.ics Remember to vote! Consider donating to Postgres: http://www.postgresql.org/about/donate
2010/8/18 David Fetter <david@fetter.org>: > On Wed, Aug 18, 2010 at 04:03:25PM +0200, Pavel Stehule wrote: >> 2010/8/18 Tom Lane <tgl@sss.pgh.pa.us>: >> > Pavel Stehule <pavel.stehule@gmail.com> writes: >> >> I still thinking about a "median" type functions. My idea is to >> >> introduce a new syntax for stype definition - like >> > >> >> stype = type, or >> >> stype = ARRAY OF type [ ORDER [ DESC | ASC ]], or >> >> stype = TUPLESTORE OF type, or >> >> stype = TUPLESORT OF type [ DESC | ASC ] >> > >> > This seems like a fairly enormous amount of conceptual (and code) >> > infrastructure just to make it possible to build median() out of >> > spare parts. It's also exposing some implementation details that >> > I'd just as soon not expose in SQL. I'd rather just implement >> > median as a special-purpose aggregate. >> >> yes, it is little bit strange - but when we talked last time about >> this topic, I understand, so you dislike any special solution for >> this functionality. So I searched different more general way. On the >> other hand, I agree so special purpose aggregate (with a few changes >> in nodeAgg) can be enough. The median (and additional forms) is >> really special and there are not wide used use case. > > Which median do you plan to implement? Or do you plan to implement > several different medians, each with distinguishing names? my proposal enabled implementation of any "median like" function. But if we implement median as special case of aggregate, then some basic "median" will be implemented. Regards Pavel > > Cheers, > David. > -- > David Fetter <david@fetter.org> http://fetter.org/ > Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter > Skype: davidfetter XMPP: david.fetter@gmail.com > iCal: webcal://www.tripit.com/feed/ical/people/david74/tripit.ics > > Remember to vote! > Consider donating to Postgres: http://www.postgresql.org/about/donate >
On Wed, Aug 18, 2010 at 04:10:18PM +0200, Pavel Stehule wrote: > 2010/8/18 David Fetter <david@fetter.org>: > > Which median do you plan to implement? Or do you plan to implement > > several different medians, each with distinguishing names? > > my proposal enabled implementation of any "median like" function. But > if we implement median as special case of aggregate, then some basic > "median" will be implemented. Apart from the medians, which "median-like" aggregates do you have in mind to start with? If you can provide examples of "median-like" aggregates that people might need to implement as user-defined aggregates, or other places where people would use this machinery, it will make your case stronger for this refactoring. Otherwise, it seems like a more reasonable thing to make the medians special case code. Cheers, David. -- David Fetter <david@fetter.org> http://fetter.org/ Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter Skype: davidfetter XMPP: david.fetter@gmail.com iCal: webcal://www.tripit.com/feed/ical/people/david74/tripit.ics Remember to vote! Consider donating to Postgres: http://www.postgresql.org/about/donate
David Fetter <david@fetter.org> writes: > Apart from the medians, which "median-like" aggregates do you have in > mind to start with? If you can provide examples of "median-like" > aggregates that people might need to implement as user-defined > aggregates, or other places where people would use this machinery, it > will make your case stronger for this refactoring. There would be plenty of scope to re-use the machinery without any SQL-level extensions. All you need is a polymorphic aggregate transition function that maintains a tuplestore or whatever. I don't see that extra syntax in CREATE AGGREGATE is really buying much of anything. regards, tom lane
2010/8/18 David Fetter <david@fetter.org>: > On Wed, Aug 18, 2010 at 04:10:18PM +0200, Pavel Stehule wrote: >> 2010/8/18 David Fetter <david@fetter.org>: >> > Which median do you plan to implement? Or do you plan to implement >> > several different medians, each with distinguishing names? >> >> my proposal enabled implementation of any "median like" function. But >> if we implement median as special case of aggregate, then some basic >> "median" will be implemented. > > Apart from the medians, which "median-like" aggregates do you have in > mind to start with? If you can provide examples of "median-like" > aggregates that people might need to implement as user-defined > aggregates, or other places where people would use this machinery, it > will make your case stronger for this refactoring. > I didn't think about some special median - this proposal is just about aggregates with large a transistent data. Then the access to tuplestore can be very usefull. > Otherwise, it seems like a more reasonable thing to make the medians > special case code. yes, minimally for this moment. Regards Pavel > > Cheers, > David. > -- > David Fetter <david@fetter.org> http://fetter.org/ > Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter > Skype: davidfetter XMPP: david.fetter@gmail.com > iCal: webcal://www.tripit.com/feed/ical/people/david74/tripit.ics > > Remember to vote! > Consider donating to Postgres: http://www.postgresql.org/about/donate >
On Wed, Aug 18, 2010 at 10:39:33AM -0400, Tom Lane wrote: > David Fetter <david@fetter.org> writes: > > Apart from the medians, which "median-like" aggregates do you have in > > mind to start with? If you can provide examples of "median-like" > > aggregates that people might need to implement as user-defined > > aggregates, or other places where people would use this machinery, it > > will make your case stronger for this refactoring. > > There would be plenty of scope to re-use the machinery without any > SQL-level extensions. All you need is a polymorphic aggregate > transition function that maintains a tuplestore or whatever. > I don't see that extra syntax in CREATE AGGREGATE is really buying > much of anything. Thanks for clarifying. Might this help out with things like GROUPING SETS or wCTEs? Cheers, David (a little slow today). -- David Fetter <david@fetter.org> http://fetter.org/ Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter Skype: davidfetter XMPP: david.fetter@gmail.com iCal: webcal://www.tripit.com/feed/ical/people/david74/tripit.ics Remember to vote! Consider donating to Postgres: http://www.postgresql.org/about/donate
2010/8/18 Tom Lane <tgl@sss.pgh.pa.us>: > David Fetter <david@fetter.org> writes: >> Apart from the medians, which "median-like" aggregates do you have in >> mind to start with? If you can provide examples of "median-like" >> aggregates that people might need to implement as user-defined >> aggregates, or other places where people would use this machinery, it >> will make your case stronger for this refactoring. > > There would be plenty of scope to re-use the machinery without any > SQL-level extensions. All you need is a polymorphic aggregate > transition function that maintains a tuplestore or whatever. > I don't see that extra syntax in CREATE AGGREGATE is really buying > much of anything. > Have we to use a transisdent function? If we implement median as special variant of aggregate - because we need to push an sort, then we can skip a transident function function - and call directly final function. This mechanism is used for aggregates with ORDER BY now. So there can be a special path for direct call of final func. There is useles to call transident function. Regards Pavel > regards, tom lane >
David Fetter <david@fetter.org> writes: > On Wed, Aug 18, 2010 at 10:39:33AM -0400, Tom Lane wrote: >> There would be plenty of scope to re-use the machinery without any >> SQL-level extensions. All you need is a polymorphic aggregate >> transition function that maintains a tuplestore or whatever. >> I don't see that extra syntax in CREATE AGGREGATE is really buying >> much of anything. > Thanks for clarifying. Might this help out with things like GROUPING > SETS or wCTEs? Don't see how --- this is just about what you can do within an aggregate. regards, tom lane
On Wed, Aug 18, 2010 at 04:46:57PM +0200, Pavel Stehule wrote: > 2010/8/18 Tom Lane <tgl@sss.pgh.pa.us>: > > David Fetter <david@fetter.org> writes: > >> Apart from the medians, which "median-like" aggregates do you > >> have in mind to start with? If you can provide examples of > >> "median-like" aggregates that people might need to implement as > >> user-defined aggregates, or other places where people would use > >> this machinery, it will make your case stronger for this > >> refactoring. > > > > There would be plenty of scope to re-use the machinery without any > > SQL-level extensions. All you need is a polymorphic aggregate > > transition function that maintains a tuplestore or whatever. I > > don't see that extra syntax in CREATE AGGREGATE is really buying > > much of anything. > > > > Have we to use a transisdent function? If we implement median as > special variant of aggregate - because we need to push an sort, then > we can skip a transident function function - and call directly final > function. This mechanism is used for aggregates with ORDER BY now. > So there can be a special path for direct call of final func. There > is useles to call transident function. Just a wacky idea here. Could we make a special state transition function called IDENTITY or some such that would turn into a noop? Cheers, David. -- David Fetter <david@fetter.org> http://fetter.org/ Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter Skype: davidfetter XMPP: david.fetter@gmail.com iCal: webcal://www.tripit.com/feed/ical/people/david74/tripit.ics Remember to vote! Consider donating to Postgres: http://www.postgresql.org/about/donate
Pavel Stehule <pavel.stehule@gmail.com> writes: > 2010/8/18 Tom Lane <tgl@sss.pgh.pa.us>: >> There would be plenty of scope to re-use the machinery without any >> SQL-level extensions. All you need is a polymorphic aggregate >> transition function that maintains a tuplestore or whatever. > Have we to use a transisdent function? If we implement median as > special variant of aggregate - because we need to push an sort, then > we can skip a transident function function - and call directly final > function. Well, that would require a whole bunch of *other* mechanisms, which you weren't saying anything about in your original proposal. But driving it off the transtype declaration would be quite inappropriate anyway IMO. regards, tom lane
2010/8/18 Tom Lane <tgl@sss.pgh.pa.us>: > Pavel Stehule <pavel.stehule@gmail.com> writes: >> 2010/8/18 Tom Lane <tgl@sss.pgh.pa.us>: >>> There would be plenty of scope to re-use the machinery without any >>> SQL-level extensions. All you need is a polymorphic aggregate >>> transition function that maintains a tuplestore or whatever. > >> Have we to use a transisdent function? If we implement median as >> special variant of aggregate - because we need to push an sort, then >> we can skip a transident function function - and call directly final >> function. > > Well, that would require a whole bunch of *other* mechanisms, which you > weren't saying anything about in your original proposal. But driving > it off the transtype declaration would be quite inappropriate anyway IMO. > I'll test both variant first. Maybe there are not any significant difference between them. Now nodeAgg can build, fill a tuplesort. So I think is natural use it. It needs only one - skip a calling a transident function and directly call final function with external tuplesort. Minimally you don't need 2x same code. Regards Pavel Stehule > regards, tom lane >
Hello > > I'll test both variant first. Maybe there are not any significant > difference between them. Now nodeAgg can build, fill a tuplesort. So I > think is natural use it. It needs only one - skip a calling a > transident function and directly call final function with external > tuplesort. Minimally you don't need 2x same code. yesterday I did a small test. Aggregates without transident functions are only about 2% faster, so there has no sense thinking more about them. I'll send a patch with median and percentile functions immediately - these functions are implemented like usual aggregates. Regards Pavel > > Regards > > Pavel Stehule > >> regards, tom lane >> >
On Thu, Aug 19, 2010 at 12:45:13PM +0200, Pavel Stehule wrote: > Hello > > > I'll test both variant first. Maybe there are not any significant > > difference between them. Now nodeAgg can build, fill a tuplesort. > > So I think is natural use it. It needs only one - skip a calling a > > transident function and directly call final function with external > > tuplesort. Minimally you don't need 2x same code. > > yesterday I did a small test. Aggregates without transident > functions are only about 2% faster, so there has no sense thinking > more about them. I'll send a patch with median and percentile > functions immediately - these functions are implemented like usual > aggregates. NTILE is already a windowing function. Might want to check into any performance improvements you can give that. As to median, please make sure you say in detail which median you're using and name it so, as there is no single, authoritative median. Cheers, David. -- David Fetter <david@fetter.org> http://fetter.org/ Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter Skype: davidfetter XMPP: david.fetter@gmail.com iCal: webcal://www.tripit.com/feed/ical/people/david74/tripit.ics Remember to vote! Consider donating to Postgres: http://www.postgresql.org/about/donate
2010/8/19 David Fetter <david@fetter.org>: > On Thu, Aug 19, 2010 at 12:45:13PM +0200, Pavel Stehule wrote: >> Hello >> >> > I'll test both variant first. Maybe there are not any significant >> > difference between them. Now nodeAgg can build, fill a tuplesort. >> > So I think is natural use it. It needs only one - skip a calling a >> > transident function and directly call final function with external >> > tuplesort. Minimally you don't need 2x same code. >> >> yesterday I did a small test. Aggregates without transident >> functions are only about 2% faster, so there has no sense thinking >> more about them. I'll send a patch with median and percentile >> functions immediately - these functions are implemented like usual >> aggregates. > > NTILE is already a windowing function. Might want to check into any > performance improvements you can give that. The performance has to be +/- same. It is based on same technology - tuplesort. > > As to median, please make sure you say in detail which median you're > using and name it so, as there is no single, authoritative median. I searched about this. And I found so there is a different methods for same statistic value with different names. But result has to be same. I don't found a other median than "arithmetic" or "financial" median (with a two derived forms - left, right median). These methods are a different forms of some SQL - see http://www.simple-talk.com/sql/t-sql-programming/median-workbench/ and it is SQL related solutions. With tuplesort I can to use a simple solution - I have a number of elements, have a sorted set and can move to n position is set. Next forms of median what I found are estimations. regards Pavel > > Cheers, > David. > -- > David Fetter <david@fetter.org> http://fetter.org/ > Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter > Skype: davidfetter XMPP: david.fetter@gmail.com > iCal: webcal://www.tripit.com/feed/ical/people/david74/tripit.ics > > Remember to vote! > Consider donating to Postgres: http://www.postgresql.org/about/donate >
On Thu, Aug 19, 2010 at 9:46 AM, David Fetter <david@fetter.org> wrote: > As to median, please make sure you say in detail which median you're > using and name it so, as there is no single, authoritative median. You've made this assertion at least three times now, but I confess that I've only ever learned one way to compute a median; and quick Google searches for "median", "kinds of median", and few other variants failed to turn up anything obvious either. It seems to me that if median is good enough for Oracle, Sybase, Excel, and the nun who taught my fifth-grade elementary school class, it ought to be good enough for us, too. (Don't mess with Sr. Catherine!) -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company
Robert Haas <robertmhaas@gmail.com> writes: > You've made this assertion at least three times now, but I confess > that I've only ever learned one way to compute a median; and quick > Google searches for "median", "kinds of median", and few other > variants failed to turn up anything obvious either. There are different ways to define it when the number of samples is even. However I believe that "use the mean of the two middle values" is much the most common way to deal with that. regards, tom lane
On Thu, Aug 19, 2010 at 4:21 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> You've made this assertion at least three times now, but I confess >> that I've only ever learned one way to compute a median; and quick >> Google searches for "median", "kinds of median", and few other >> variants failed to turn up anything obvious either. > > There are different ways to define it when the number of samples is even. > However I believe that "use the mean of the two middle values" is much > the most common way to deal with that. I suppose there could also be a bit of an ambiguity if you're working with a type like int4 where the values are discrete steps. Like, what do you do with {1, 2}? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company
Robert Haas <robertmhaas@gmail.com> wrote: > I suppose there could also be a bit of an ambiguity if you're working > with a type like int4 where the values are discrete steps. Like, what > do you do with {1, 2}? The same thing you do with the avg function? -Kevin
"Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes: > Robert Haas <robertmhaas@gmail.com> wrote: >> I suppose there could also be a bit of an ambiguity if you're working >> with a type like int4 where the values are discrete steps. Like, what >> do you do with {1, 2}? Hmm, good point. > The same thing you do with the avg function? avg's approach is not at all datatype-independent though. If you're willing to give up the idea of a polymorphic median() function, that would probably be the thing to do. If not, you need to take the left or right one of the two central elements. regards, tom lane
On Thu, Aug 19, 2010 at 05:40:46PM -0400, Tom Lane wrote: > "Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes: > > Robert Haas <robertmhaas@gmail.com> wrote: > >> I suppose there could also be a bit of an ambiguity if you're working > >> with a type like int4 where the values are discrete steps. Like, what > >> do you do with {1, 2}? > > Hmm, good point. > > > The same thing you do with the avg function? > > avg's approach is not at all datatype-independent though. If you're > willing to give up the idea of a polymorphic median() function, that > would probably be the thing to do. If not, you need to take the left > or right one of the two central elements. Whether the median needs to be in the sample is one question that doesn't really have a universal right answer. A separate question, also without a universal right answer, is whether the median needs to be in the same domain as the sample, int4 being the case above. We could just go with "whatever Oracle, DB2 and MS-SQL Server have," assuming it's the same thing, until something appears in the SQL standard. Cheers, David. -- David Fetter <david@fetter.org> http://fetter.org/ Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter Skype: davidfetter XMPP: david.fetter@gmail.com iCal: webcal://www.tripit.com/feed/ical/people/david74/tripit.ics Remember to vote! Consider donating to Postgres: http://www.postgresql.org/about/donate
On Thu, Aug 19, 2010 at 5:53 PM, David Fetter <david@fetter.org> wrote: > On Thu, Aug 19, 2010 at 05:40:46PM -0400, Tom Lane wrote: >> "Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes: >> > Robert Haas <robertmhaas@gmail.com> wrote: >> >> I suppose there could also be a bit of an ambiguity if you're working >> >> with a type like int4 where the values are discrete steps. Like, what >> >> do you do with {1, 2}? >> >> Hmm, good point. >> >> > The same thing you do with the avg function? >> >> avg's approach is not at all datatype-independent though. If you're >> willing to give up the idea of a polymorphic median() function, that >> would probably be the thing to do. If not, you need to take the left >> or right one of the two central elements. > > Whether the median needs to be in the sample is one question that > doesn't really have a universal right answer. A separate question, > also without a universal right answer, is whether the median needs to > be in the same domain as the sample, int4 being the case above. > > We could just go with "whatever Oracle, DB2 and MS-SQL Server have," > assuming it's the same thing, until something appears in the SQL > standard. That's usually a sensible default, when in doubt. If nothing else, it improves compatibility. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company
On Thu, Aug 19, 2010 at 06:03:57PM -0400, Robert Haas wrote: > On Thu, Aug 19, 2010 at 5:53 PM, David Fetter <david@fetter.org> wrote: > > On Thu, Aug 19, 2010 at 05:40:46PM -0400, Tom Lane wrote: > >> "Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes: > >> > Robert Haas <robertmhaas@gmail.com> wrote: > >> >> I suppose there could also be a bit of an ambiguity if you're working > >> >> with a type like int4 where the values are discrete steps. Like, what > >> >> do you do with {1, 2}? > >> > >> Hmm, good point. > >> > >> > The same thing you do with the avg function? > >> > >> avg's approach is not at all datatype-independent though. If > >> you're willing to give up the idea of a polymorphic median() > >> function, that would probably be the thing to do. If not, you > >> need to take the left or right one of the two central elements. > > > > Whether the median needs to be in the sample is one question that > > doesn't really have a universal right answer. A separate > > question, also without a universal right answer, is whether the > > median needs to be in the same domain as the sample, int4 being > > the case above. > > > > We could just go with "whatever Oracle, DB2 and MS-SQL Server > > have," assuming it's the same thing, until something appears in > > the SQL standard. > > That's usually a sensible default, when in doubt. If nothing else, > it improves compatibility. That's assuming we find such a thing, which I haven't yet. Trying to stay cheery, David. -- David Fetter <david@fetter.org> http://fetter.org/ Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter Skype: davidfetter XMPP: david.fetter@gmail.com iCal: webcal://www.tripit.com/feed/ical/people/david74/tripit.ics Remember to vote! Consider donating to Postgres: http://www.postgresql.org/about/donate