Thread: two-argument aggregates and SQL 2003
Hello All, I just thought about implementing some two-argument aggregate functions from SQL 2003 (like CORR(x,y), REGR_SLOPE(x,y) etc...) ( http://www.wiscorp.com/SQL2003Features.pdf , page 10) 1) I looked into the architecture of how the aggregate functions are created and used, and it seemed for me that the structure of the pg_aggregate and pg_proc tables do not prevent the creating of the two-argument aggregate functions -- Just for each two-arg. aggregate, the corresponding three-arg. transition function and the two-arg. aggregate_dummy function should be added to the pg_proc and a record should be added to the pg_aggregate. Nothing else and nothing internal need not to be changed to insert new two-arg. aggregate functions into the core. Am I right in this ? 2) Also I thought about allowing the user to create the new two-arg. aggregates. With that I only saw one thing which could/should be changed, and this is the handling of the BASETYPE attribute of the CREATE AGGREGATE command. CREATE AGGREGATE name ( BASETYPE = input_data_type, SFUNC = sfunc, STYPE = state_data_type ... ) I am not very familiar with the parser/lexer details in postgres, but is it possible to allow to do things like that : CREATE AGGREGATE new_2arg_agg ( BASETYPE = (int,int) , .... ) to create the two-arg. aggregates ? I'd like to hear any comments/advices/objections... Regards,Sergey ***************************************************** Sergey E. Koposov Max Planck Institute for Astronomy/Sternberg Astronomical Institute Web: http://lnfm1.sai.msu.ru/~math E-mail: math@sai.msu.ru
"Sergey E. Koposov" <math@sai.msu.ru> writes: > ... Nothing else and nothing internal need not to be changed to > insert new two-arg. aggregate functions into the core. > Am I right in this ? IIRC the main issues are the syntax of CREATE AGGREGATE and the actual implementation in nodeAgg.c. See previous discussions, eg http://archives.postgresql.org/pgsql-general/2006-03/msg00512.php I would really prefer to see CREATE AGGREGATE normalized to have a syntax comparable to CREATE FUNCTION (or DROP AGGREGATE for that matter):CREATE AGGREGATE aggname (typname [, ... ]) ...definition... but it's not clear how to get there without breaking backwards compatibility :-( regards, tom lane
On Thu, 13 Apr 2006, Tom Lane wrote: > "Sergey E. Koposov" <math@sai.msu.ru> writes: > > ... Nothing else and nothing internal need not to be changed to > > insert new two-arg. aggregate functions into the core. > > Am I right in this ? > > IIRC the main issues are the syntax of CREATE AGGREGATE and the actual > implementation in nodeAgg.c. See previous discussions, eg > http://archives.postgresql.org/pgsql-general/2006-03/msg00512.php Actually, I think that I'll try to implement that. And I already have spent some time looking at the things which should be changed. And I have the question. Does it make sense to extend the aggregate functions to the only two-argument case? I mean, does it have a chance to be accepted ? Because it seems that it will be much simpler for me to implement the one or two arg. aggregates (not aggregates with ANY number of args) since it does not require variable length arrays and additional burdens with the memory allocations, contexts etc... > > I would really prefer to see CREATE AGGREGATE normalized to have a > syntax comparable to CREATE FUNCTION (or DROP AGGREGATE for that > matter): > CREATE AGGREGATE aggname (typname [, ... ]) ...definition... > but it's not clear how to get there without breaking backwards > compatibility :-( > I don't know what to do with CREATE AGGREGATE syntax. I think that I won't work on that, since at least I want to enable the core (not user created) two-arg. aggregates. I hope that it's acceptable ... Regards,Sergey ******************************************************************* Sergey E. Koposov Max Planck Institute for Astronomy/Sternberg Astronomical Institute Web: http://lnfm1.sai.msu.ru/~math E-mail: math@sai.msu.ru
"Sergey E. Koposov" <math@sai.msu.ru> writes: > Does it make sense to extend the aggregate > functions to the only two-argument case? No, I don't think so, for two reasons: 1. The user's-eye view: if someone wants 2 arguments, tomorrow he'll want 3, etc. There's an old saying that "the only good numbers in programming language design are zero, one, and N" --- if you allow more than one of anything, there shouldn't be an upper limit on how many you allow. In practice there are many places in PG where we break that rule to the extent of having a configurable upper limit (eg MAX_INDEX_KEYS) ... but small limits hard-wired into the code are just not pleasant. 2. The implementor's view: hard-wired limits are usually not that nice from a coding standpoint either. Polya's Inventors' Paradox states that "the more general problem may be easier to solve", and I've found that usually holds up in program design too. Code that handles exactly 2 of something is generally uglier and less maintainable than code that handles N of something, because for example you are tempted to duplicate chunks of code instead of turning them into loops. regards, tom lane
Tom Lane wrote: > I would really prefer to see CREATE AGGREGATE normalized to have a > syntax comparable to CREATE FUNCTION (or DROP AGGREGATE for that > matter): > CREATE AGGREGATE aggname (typname [, ... ]) ...definition... > but it's not clear how to get there without breaking backwards > compatibility :-( To modify the CREATE FUNCTION syntax into a new CREATE AGGREGATE syntax, we would modify a few things, I think: CREATE [ OR REPLACE ] FUNCTION name ( [ [ argmode ] [ argname ] argtype [, ...] ] ) [ RETURNS rettype ] { LANGUAGE langname | IMMUTABLE | STABLE | VOLATILE | CALLED ON NULL INPUT | RETURNS NULL ON NULL INPUT | STRICT | [ EXTERNAL ]SECURITY INVOKER | [ EXTERNAL ] SECURITY DEFINER | AS 'definition' | AS 'obj_file', 'link_symbol' } ... [ WITH ( attribute[, ...] ) ] 1) Drop [ argmode ] because there is no OUT or INOUT parameters possible. 2) Change implicit meaning of the [ rettype ] parameter to not allow SETOF. (I'd love to have aggregates functions that take arbitrary numbers of rows as input and return arbitrary numbers of rows as output. But I'm guessing the internals of the backend would require much work to handle it?) 3) Add a state_data_type 4) Add an optional initial_condition 5) Add an optional sort_operator 6) Add some handling of a final_function like behavior, which I have not handled below. Should it be done like the current CREATE AGGREGATE syntax, where you must reference another function, or can anybody see a clean way to let this one function do it all in one shot? This might give us, excluding any final_function syntax: CREATE [ OR REPLACE ] AGGREGATE name ( [ [ argname ] argtype [, ...] ] ) STYPE state_data_type [ INITCOND initial_condition] [ SORTOP sort_operator ] [ RETURNS rettype ] { LANGUAGE langname | IMMUTABLE | STABLE | VOLATILE | CALLED ON NULL INPUT | RETURNS NULL ON NULL INPUT | STRICT | [ EXTERNAL ] SECURITY INVOKER | [ EXTERNAL ] SECURITYDEFINER | AS 'definition' | AS 'obj_file', 'link_symbol' } ... [ WITH ( attribute [, ...] ) ] It seems that this syntax is distinct from the current syntax and that the parser could support both. Thoughts?
I wrote [ in an off-list reply to Mark Dilger ]: > I don't think this solves the parsing problem at all. The problem as I > see it is that given > CREATE AGGREGATE foo (bar ... > it's not obvious whether bar is a def_elem name (old syntax) or a type > name (new syntax). It's possible that we can get bison to eat both > anyway on the basis that the lookahead token must be '=' at this point > for old syntax while it could not be '=' for new syntax. I did some idle investigation of this and found that it's indeed possible, as long as we make the further restriction that none of the definition-list keywords used by old-style CREATE AGGREGATE be keywords of the SQL grammar. (We could probably allow selected ones if we had to, but using ColLabel rather than IDENT in the patch below leads to tons of reduce/reduce conflicts...) Attached is a proof-of-concept patch, which doesn't do anything useful as-is because none of the rest of the backend has been updated, but it does prove that bison can be made to handle CREATE AGGREGATE syntax with an initial list of type names. For instance the first example in http://www.postgresql.org/docs/8.1/static/xaggr.html would become CREATE AGGREGATE complex_sum (complex) ( sfunc = complex_add, stype = complex, initcond = '(0,0)' ); I'm inclined to flesh this out and apply it with or without any further work by Sergey, simply because it makes the syntax of CREATE AGGREGATE more in harmony with DROP AGGREGATE and the other AGGREGATE commands. Any objections out there? Another thing we could look into is doing something similar to CREATE OPERATOR, so that it names the new operator the same way you would do in DROP OPERATOR. Not sure if this is worth the trouble or not, as I don't find DROP OPERATOR amazingly intuitive. regards, tom lane *** src/backend/parser/gram.y.orig Wed Mar 22 19:19:29 2006 --- src/backend/parser/gram.y Fri Apr 14 17:50:52 2006 *************** *** 224,231 **** %type <list> stmtblock stmtmulti OptTableElementList TableElementList OptInherit definition opt_distinct opt_definition func_args func_args_list func_as createfunc_opt_list alterfunc_opt_list oper_argtypes RuleActionList RuleActionMulti opt_column_list columnList opt_name_list sort_clause opt_sort_clause sortby_list index_params --- 224,232 ---- %type <list> stmtblock stmtmulti OptTableElementList TableElementList OptInherit definition opt_distinct opt_definition func_args func_args_list func_as createfunc_opt_list alterfunc_opt_list + aggr_args aggr_args_list old_aggr_definition old_aggr_list oper_argtypes RuleActionList RuleActionMulti opt_column_list columnList opt_name_list sort_clause opt_sort_clause sortby_list index_params *************** *** 246,252 **** %type <defelt> createfunc_opt_item common_func_opt_item %type <fun_param> func_arg %type <fun_param_mode> arg_class ! %type <typnam> func_return func_type aggr_argtype %type <boolean> TriggerForType OptTemp %type <oncommit> OnCommitOption --- 247,253 ---- %type <defelt> createfunc_opt_item common_func_opt_item %type <fun_param> func_arg %type <fun_param_mode> arg_class ! %type <typnam> func_return func_type %type <boolean> TriggerForType OptTemp %type <oncommit> OnCommitOption *************** *** 285,291 **** %type <node> TableElement ConstraintElem TableFuncElement %type <node> columnDef ! %type <defelt> def_elem %type <node> def_arg columnElem where_clause a_expr b_expr c_expr func_expr AexprConst indirection_el columnref in_expr having_clause func_table array_expr --- 286,292 ---- %type <node> TableElement ConstraintElem TableFuncElement %type <node> columnDef ! %type <defelt> def_elem old_aggr_elem %type <node> def_arg columnElem where_clause a_expr b_expr c_expr func_expr AexprConst indirection_el columnref in_expr having_clause func_table array_expr *************** *** 2671,2681 **** *****************************************************************************/ DefineStmt: ! CREATE AGGREGATE func_name definition { DefineStmt *n = makeNode(DefineStmt); n->kind = OBJECT_AGGREGATE; n->defnames = $3; n->definition = $4; $$ = (Node *)n; } --- 2672,2692 ---- *****************************************************************************/ DefineStmt: ! CREATE AGGREGATE func_name aggr_args definition { DefineStmt *n = makeNode(DefineStmt); n->kind = OBJECT_AGGREGATE; n->defnames = $3; + /* XXX put args somewhere */ + n->definition = $5; + $$ = (Node *)n; + } + | CREATE AGGREGATE func_name old_aggr_definition + { + /* old-style syntax for CREATE AGGREGATE */ + DefineStmt *n = makeNode(DefineStmt); + n->kind = OBJECT_AGGREGATE; + n->defnames = $3; n->definition = $4; $$ = (Node *)n; } *************** *** 2764,2769 **** --- 2775,2802 ---- | Sconst { $$ = (Node *)makeString($1); } ; + aggr_args: '(' aggr_args_list ')' { $$ = $2; } + | '(' '*' ')' { $$ = NIL; } + ; + + aggr_args_list: + Typename { $$ = list_make1($1); } + | aggr_args_list ',' Typename { $$ = lappend($1, $3); } + ; + + old_aggr_definition: '(' old_aggr_list ')' { $$ = $2; } + ; + + old_aggr_list: old_aggr_elem { $$ = list_make1($1); } + | old_aggr_list ',' old_aggr_elem { $$ = lappend($1, $3); } + ; + + old_aggr_elem: IDENT '=' def_arg + { + $$ = makeDefElem($1, (Node *)$3); + } + ; + /***************************************************************************** * *************** *** 2960,2966 **** * COMMENT ON [ [ DATABASE | DOMAIN | INDEX | SEQUENCE | TABLE | TYPE | VIEW | * CONVERSION | LANGUAGE | OPERATOR CLASS | LARGE OBJECT | * CAST | COLUMN | SCHEMA | TABLESPACE | ROLE ] <objname> | ! * AGGREGATE <aggname> (<aggtype>) | * FUNCTION <funcname> (arg1, arg2, ...) | * OPERATOR <op> (leftoperand_typ, rightoperand_typ) | * TRIGGER <triggername> ON <relname> | --- 2993,2999 ---- * COMMENT ON [ [ DATABASE | DOMAIN | INDEX | SEQUENCE | TABLE | TYPE | VIEW | * CONVERSION | LANGUAGE | OPERATOR CLASS | LARGE OBJECT | * CAST | COLUMN | SCHEMA | TABLESPACE | ROLE ] <objname> | ! * AGGREGATE <aggname> (arg1, ...) | * FUNCTION <funcname> (arg1, arg2, ...) | * OPERATOR <op> (leftoperand_typ, rightoperand_typ) | * TRIGGER <triggername> ON <relname> | *************** *** 2980,2993 **** n->comment = $6; $$ = (Node *) n; } ! | COMMENT ON AGGREGATE func_name '(' aggr_argtype ')' ! IS comment_text { CommentStmt *n = makeNode(CommentStmt); n->objtype = OBJECT_AGGREGATE; n->objname = $4; ! n->objargs = list_make1($6); ! n->comment = $9; $$ = (Node *) n; } | COMMENT ON FUNCTION func_name func_args IS comment_text --- 3013,3025 ---- n->comment = $6; $$ = (Node *) n; } ! | COMMENT ON AGGREGATE func_name aggr_args IS comment_text { CommentStmt *n = makeNode(CommentStmt); n->objtype = OBJECT_AGGREGATE; n->objname = $4; ! n->objargs = $5; ! n->comment = $7; $$ = (Node *) n; } | COMMENT ON FUNCTION func_name func_args IS comment_text *************** *** 3844,3850 **** * QUERY: * * DROP FUNCTION funcname (arg1, arg2, ...) [ RESTRICT | CASCADE ] ! * DROP AGGREGATE aggname (aggtype) [ RESTRICT | CASCADE ] * DROP OPERATOR opname (leftoperand_typ, rightoperand_typ) [ RESTRICT | CASCADE ] * *****************************************************************************/ --- 3876,3882 ---- * QUERY: * * DROP FUNCTION funcname (arg1, arg2, ...) [ RESTRICT | CASCADE ] ! * DROP AGGREGATE aggname (arg1, ...) [ RESTRICT | CASCADE ] * DROP OPERATOR opname (leftoperand_typ, rightoperand_typ) [ RESTRICT | CASCADE ] * *****************************************************************************/ *************** *** 3861,3881 **** ; RemoveAggrStmt: ! DROP AGGREGATE func_name '(' aggr_argtype ')' opt_drop_behavior { RemoveAggrStmt *n = makeNode(RemoveAggrStmt); n->aggname = $3; ! n->aggtype = $5; ! n->behavior = $7; $$ = (Node *)n; } ; - aggr_argtype: - Typename { $$ = $1; } - | '*' { $$ = NULL; } - ; - RemoveOperStmt: DROP OPERATOR any_operator '(' oper_argtypes ')' opt_drop_behavior { --- 3893,3908 ---- ; RemoveAggrStmt: ! DROP AGGREGATE func_name aggr_args opt_drop_behavior { RemoveAggrStmt *n = makeNode(RemoveAggrStmt); n->aggname = $3; ! n->aggtype = $4; ! n->behavior = $5; $$ = (Node *)n; } ; RemoveOperStmt: DROP OPERATOR any_operator '(' oper_argtypes ')' opt_drop_behavior { *************** *** 4013,4025 **** * *****************************************************************************/ ! RenameStmt: ALTER AGGREGATE func_name '(' aggr_argtype ')' RENAME TO name { RenameStmt *n = makeNode(RenameStmt); n->renameType = OBJECT_AGGREGATE; n->object = $3; ! n->objarg = list_make1($5); ! n->newname = $9; $$ = (Node *)n; } | ALTER CONVERSION_P any_name RENAME TO name --- 4040,4052 ---- * *****************************************************************************/ ! RenameStmt: ALTER AGGREGATE func_name aggr_args RENAME TO name { RenameStmt *n = makeNode(RenameStmt); n->renameType = OBJECT_AGGREGATE; n->object = $3; ! n->objarg = $4; ! n->newname = $7; $$ = (Node *)n; } | ALTER CONVERSION_P any_name RENAME TO name *************** *** 4153,4165 **** *****************************************************************************/ AlterObjectSchemaStmt: ! ALTER AGGREGATE func_name '(' aggr_argtype ')' SET SCHEMA name { AlterObjectSchemaStmt *n = makeNode(AlterObjectSchemaStmt); n->objectType = OBJECT_AGGREGATE; n->object = $3; ! n->objarg = list_make1($5); ! n->newschema = $9; $$ = (Node *)n; } | ALTER DOMAIN_P any_name SET SCHEMA name --- 4180,4192 ---- *****************************************************************************/ AlterObjectSchemaStmt: ! ALTER AGGREGATE func_name aggr_args SET SCHEMA name { AlterObjectSchemaStmt *n = makeNode(AlterObjectSchemaStmt); n->objectType = OBJECT_AGGREGATE; n->object = $3; ! n->objarg = $4; ! n->newschema = $7; $$ = (Node *)n; } | ALTER DOMAIN_P any_name SET SCHEMA name *************** *** 4211,4223 **** * *****************************************************************************/ ! AlterOwnerStmt: ALTER AGGREGATE func_name '(' aggr_argtype ')' OWNER TO RoleId { AlterOwnerStmt *n = makeNode(AlterOwnerStmt); n->objectType = OBJECT_AGGREGATE; n->object = $3; ! n->objarg = list_make1($5); ! n->newowner = $9; $$ = (Node *)n; } | ALTER CONVERSION_P any_name OWNER TO RoleId --- 4238,4250 ---- * *****************************************************************************/ ! AlterOwnerStmt: ALTER AGGREGATE func_name aggr_args OWNER TO RoleId { AlterOwnerStmt *n = makeNode(AlterOwnerStmt); n->objectType = OBJECT_AGGREGATE; n->object = $3; ! n->objarg = $4; ! n->newowner = $7; $$ = (Node *)n; } | ALTER CONVERSION_P any_name OWNER TO RoleId
I wrote: > ... Polya's Inventors' Paradox states that > "the more general problem may be easier to solve", and I've found that > usually holds up in program design too. While fooling around with the grammar patch that I showed earlier today, I had an epiphany that might serve as illustration of the above. We have traditionally thought of COUNT(*) as an "aggregate over any base type". But wouldn't it be cleaner to think of it as an aggregate over zero inputs? That would get rid of the rather artificial need to convert COUNT(*) to COUNT(1). We would actually have two separate aggregate functions, which could most accurately be described ascount()count(anyelement) where the latter is the form that has the behavior of counting the non-null values of the input. While this doesn't really simplify nodeAgg.c, it wouldn't add any complexity either (once the code has been recast to support variable numbers of arguments). And it seems to me that it clarifies the semantics noticeably --- in particular, there'd no longer be this weird special case that an aggregate over ANY should have a one-input transition function where everything else takes two-input. The rule would be simple: an N-input aggregate uses an N-plus-one-input transition function. regards, tom lane
On Sat, Apr 15, 2006 at 12:51:24AM -0400, Tom Lane wrote: > I wrote: > > ... Polya's Inventors' Paradox states that > > "the more general problem may be easier to solve", and I've found that > > usually holds up in program design too. > > While fooling around with the grammar patch that I showed earlier today, > I had an epiphany that might serve as illustration of the above. We > have traditionally thought of COUNT(*) as an "aggregate over any base > type". But wouldn't it be cleaner to think of it as an aggregate over > zero inputs? That would get rid of the rather artificial need to > convert COUNT(*) to COUNT(1). We would actually have two separate > aggregate functions, which could most accurately be described as > count() > count(anyelement) > where the latter is the form that has the behavior of counting the > non-null values of the input. > > While this doesn't really simplify nodeAgg.c, it wouldn't add any > complexity either (once the code has been recast to support variable > numbers of arguments). And it seems to me that it clarifies the > semantics noticeably --- in particular, there'd no longer be this weird > special case that an aggregate over ANY should have a one-input > transition function where everything else takes two-input. The rule > would be simple: an N-input aggregate uses an N-plus-one-input > transition function. Speaking strictly from a users PoV, I'm not sure this is a great idea, since it encourages non-standard code (AFAIK no one else accepts 'count()'), and getting rid of support for count(*) seems like a non-starter, so I'm not sure there's any benefit. -- Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com Pervasive Software http://pervasive.com work: 512-231-6117 vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461
"Jim C. Nasby" <jnasby@pervasive.com> writes: > On Sat, Apr 15, 2006 at 12:51:24AM -0400, Tom Lane wrote: >> I had an epiphany that might serve as illustration of the above. We >> have traditionally thought of COUNT(*) as an "aggregate over any base >> type". But wouldn't it be cleaner to think of it as an aggregate over >> zero inputs? > Speaking strictly from a users PoV, I'm not sure this is a great idea, > since it encourages non-standard code (AFAIK no one else accepts > 'count()'), and getting rid of support for count(*) seems like a > non-starter, so I'm not sure there's any benefit. Well, if you want, we can still insist that actual invocations of a zero-argument aggregate be spelled with (*). But from a conceptual and documentation standpoint we should think of them as zero-argument, not sort-of-one-argument. regards, tom lane