Thread: Review: listagg aggregate
Pavel, My review of your listagg patch. Submission Review ----------------- * The diff is a context diff and applies cleanly to HEAD (with just two hunks offset by 2 lines each). * There is documentation, though I'm not sure it needs to be mentioned in the string functions documentation. No harm init, I guess. I would like to see an example, though, and the documentation does not currently explain what each of the parameters arefor. In fact, it looks like all the existing aggregates take only one parameter, so there was not previously a need toexplain it. But listagg() has an optional second param. I think that the description should explain what it's for. * There are tests and they look fine. Usability Review ---------------- * The patch does in fact implement the aggregate function it describes, and OH YES do we want it (I've written my own inSQL a few times). * No, we don't already have it. * Yes it follows community-agreed behavior. I'm assuming that there is no special parsing of aggregate functions, so thesimple use of commas to separate the two parameters is appropriate, rather than using a keyword like MySQL's SEPARATORin the group_concat() aggregate. * No need to have pg_dump support, no dangers that I can see, looks like all the bases have been covered. Feature Test ------------ * Everything built cleanly, but I got an OID dupe error when I tried to init the DB. Looks like 2997 and 2998 have been usedfor something else since you created the patch. I changed them to 2995 and 2996 and then it worked. * The feature appears to work. I didn't see any tests for encodings or other data types, so I ran a few myself and they workfine: postgres=# select listagg(a, U&'-\0441\043B\043E\043D-') from (values('aaaa'),('bbbb'),('cccc' listagg -------------------------- aaaa-слон-bbbb-слон-cccc (1 row) postgres=# select listagg(a, U&'\2014') from (values(U&'\0441\043B\043E\043D'),(U&'d\0061t\+000061'),(U&'\0441\043B\043E\043D'))AS g(a); listagg ---------------- слон—data—слон (1 row) postgres=# select listagg(a::text) from (values(1),(2),(3)) AS g(a); listagg --------- 123 (1 row) Performance Review ------------------ No performance issues, except that it should be faster than a custom aggregate that does the same thing. To test, I createda quick custom aggregate (no second argument, alas, so listagg() is more flexible) like so: CREATE OR REPLACE FUNCTION a2s(ANYARRAY) RETURNS TEXT LANGUAGE SQL AS $$ SELECT array_to_string($1, ','); $$; CREATE AGGREGATE string_accum ( SFUNC = array_accum, BASETYPE = ANYELEMENT, STYPE = ANYARRAY, INITCOND = '{}', FINALFUNC = a2s ); Then I ran some simple tests (thanks for the clue, depesz): postgres=# select count(*) from (select string_accum(a) from (values('aaaa'),('bbbb'),('cccc')) AS g(a), generate_series(1,10000)i) AS x(i); count ------- 1 (1 row) Time: 1365.382 ms postgres=# select count(*) from (select listagg(a) from (values('aaaa'),('bbbb'),('cccc')) AS g(a), generate_series(1,10000)i) AS x(i); count ------- 1 (1 row) Time: 17.989 ms So overall, it looks like listagg() is 1-2 orders of magnitude faster. YMMV, and my system is built with --enable-cassertand --enable-debug. Still, good job. Coding Review ------------- * Is varchar.c really the best place to put the ListAggState struct and the listagg() function? I grepped the source forarray_agg() and it's in src/backend/utils/adt/array_userfuncs.c. Maybe there's an equivalent file for string functions?Otherwise, the style of the C code looks fine to my untrained eye. Actually, shouldn't it return text rather than varchar? * Does it really require four functions to do its work? Might there be some way to use the array_agg() C functions and thenjust a different final function to turn it into a string (using the internal array_to_string() function, perhaps)? I'mnot at all sure about it, but given how little code was required to create the same basic functionality in SQL, I'm surprisedthat the C implementation requires four functions (accumStringResult(), listagg1_transfn(), listagg2_transfn(),and listagg_finalfn()). Maybe they're required to make it fast and avoid the overhead of an array? * No compiler warnings, I never made it crash, good comments, does what it says on the tin. I doubt that there are any portabilityissues, as the code seems to use standard PostgreSQL internal macros and functions. Architecture Review ------------------- * No dependencies, things seem to make sense overall, notwithstanding my questions in the Coding Review. Review Review ------------- The only thing I haven't covered so far is the name. I agree with Tom's assertion that the name is awful. Sure there maybe a precedent in Oracle, but I hardly find that convincing (some of the big corporations seem to do a really shitty jobnaming things in their APIs). Given that we already have array_agg(), what about simply concat_agg(), as Tom (wincingly,I grant you) suggested? Or string_agg()? My bike shed is puce. Attached is a new patch with the changed OIDs and an added phrase in the documentation that indicates that the second argumentcan be used to separate the concatenated values. Best, David
Attachment
Hello 2010/1/22 David E. Wheeler <david@kineticode.com>: > Pavel, > > My review of your listagg patch. > > Submission Review > ----------------- > * The diff is a context diff and applies cleanly to HEAD (with just two hunks offset by 2 lines each). > > * There is documentation, though I'm not sure it needs to be mentioned in the string functions documentation. No harm init, I guess. > > I would like to see an example, though, and the documentation does not currently explain what each of the parameters arefor. In fact, it looks like all the existing aggregates take only one parameter, so there was not previously a need toexplain it. But listagg() has an optional second param. I think that the description should explain what it's for. > can I help with it, please. My English is terrible. > * There are tests and they look fine. > > Usability Review > ---------------- > * The patch does in fact implement the aggregate function it describes, and OH YES do we want it (I've written my own inSQL a few times). > > * No, we don't already have it. > > * Yes it follows community-agreed behavior. I'm assuming that there is no special parsing of aggregate functions, so thesimple use of commas to separate the two parameters is appropriate, rather than using a keyword like MySQL's SEPARATORin the group_concat() aggregate. > > * No need to have pg_dump support, no dangers that I can see, looks like all the bases have been covered. > > Feature Test > ------------ > * Everything built cleanly, but I got an OID dupe error when I tried to init the DB. Looks like 2997 and 2998 have beenused for something else since you created the patch. I changed them to 2995 and 2996 and then it worked. > * The feature appears to work. I didn't see any tests for encodings or other data types, so I ran a few myself and theywork fine: > > postgres=# select listagg(a, U&'-\0441\043B\043E\043D-') from (values('aaaa'),('bbbb'),('cccc' > listagg > -------------------------- > aaaa-слон-bbbb-слон-cccc > (1 row) > > postgres=# select listagg(a, U&'\2014') from (values(U&'\0441\043B\043E\043D'),(U&'d\0061t\+000061'),(U&'\0441\043B\043E\043D'))AS g(a); > listagg > ---------------- > слон—data—слон > (1 row) > > > postgres=# select listagg(a::text) from (values(1),(2),(3)) AS g(a); > listagg > --------- > 123 > (1 row) > > > Performance Review > ------------------ > > No performance issues, except that it should be faster than a custom aggregate that does the same thing. To test, I createda quick custom aggregate (no second argument, alas, so listagg() is more flexible) like so: > > CREATE OR REPLACE FUNCTION a2s(ANYARRAY) > RETURNS TEXT LANGUAGE SQL AS $$ > SELECT array_to_string($1, ','); > $$; > > CREATE AGGREGATE string_accum ( > SFUNC = array_accum, > BASETYPE = ANYELEMENT, > STYPE = ANYARRAY, > INITCOND = '{}', > FINALFUNC = a2s > ); > > Then I ran some simple tests (thanks for the clue, depesz): > > postgres=# select count(*) from (select string_accum(a) from (values('aaaa'),('bbbb'),('cccc')) AS g(a), generate_series(1,10000)i) AS x(i); > count > ------- > 1 > (1 row) > > Time: 1365.382 ms > > postgres=# select count(*) from (select listagg(a) from (values('aaaa'),('bbbb'),('cccc')) AS g(a), generate_series(1,10000)i) AS x(i); > count > ------- > 1 > (1 row) > > Time: 17.989 ms > > So overall, it looks like listagg() is 1-2 orders of magnitude faster. YMMV, and my system is built with --enable-cassertand --enable-debug. Still, good job. > > Coding Review > ------------- > > * Is varchar.c really the best place to put the ListAggState struct and the listagg() function? I grepped the source forarray_agg() and it's in src/backend/utils/adt/array_userfuncs.c. Maybe there's an equivalent file for string functions?Otherwise, the style of the C code looks fine to my untrained eye. > array user functions are used more time in pg core. The complexity of array functions are much higher, so I don't think we need special file. > Actually, shouldn't it return text rather than varchar? > I'll recheck it. I am sure so all parameters should be a text. > * Does it really require four functions to do its work? Might there be some way to use the array_agg() C functions andthen just a different final function to turn it into a string (using the internal array_to_string() function, perhaps)? We can, but it isn't good way. Processing of arrays is little bit more expensive then processing plain text. It is reason why listagg is faster, than your custom aggregate. More, the final function could be faster - the content is final. I'm not at all sure about it, but given how little code was required to create the same basic functionality in SQL, I'm surprised that the C implementation requires four functions (accumStringResult(), listagg1_transfn(), listagg2_transfn(), and listagg_finalfn()). Maybe they're required to make it fast and avoid the overhead of an array? It normal for aggregate functions. We need more transfn function, because we need two two variant: listagg(col), listagg(col, sep). Our coding guidlines doesn't advice share C functions - but these functions are +/- wrapper for accumStringResult - so there is zero overhead. > > * No compiler warnings, I never made it crash, good comments, does what it says on the tin. I doubt that there are anyportability issues, as the code seems to use standard PostgreSQL internal macros and functions. > > Architecture Review > ------------------- > > * No dependencies, things seem to make sense overall, notwithstanding my questions in the Coding Review. > > Review Review > ------------- > > The only thing I haven't covered so far is the name. I agree with Tom's assertion that the name is awful. Sure there maybe a precedent in Oracle, but I hardly find that convincing (some of the big corporations seem to do a really shitty jobnaming things in their APIs). Given that we already have array_agg(), what about simply concat_agg(), as Tom (wincingly,I grant you) suggested? Or string_agg()? I don't think. When we have function, with same parameters, same behave like some Oracle function, then I am strongly prefer Oracle name. I don't see any benefit from different name. It can only confuse developers and add the trable to people who porting applications. > > My bike shed is puce. > > Attached is a new patch with the changed OIDs and an added phrase in the documentation that indicates that the second argumentcan be used to separate the concatenated values. > > Best, > Thank you very much Regards Pavel > David > > > > > > > -- > Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-hackers > >
On Jan 24, 2010, at 1:19 AM, Pavel Stehule wrote: > can I help with it, please. My English is terrible. Yes, I added a bit in the patch I submitted. > array user functions are used more time in pg core. The complexity of > array functions are much higher, so I don't think we need special > file. Okay. Should have tried it in PL/pgSQL then, perhaps. > I'll recheck it. I am sure so all parameters should be a text. Probably shouldn't go into varchar.c then, yes? > We can, but it isn't good way. Processing of arrays is little bit more > expensive then processing plain text. It is reason why listagg is > faster, than your custom aggregate. More, the final function could be > faster - the content is final. Understood. > It normal for aggregate functions. We need more transfn function, > because we need two two variant: listagg(col), listagg(col, sep). Our > coding guidlines doesn't advice share C functions - but these > functions are +/- wrapper for accumStringResult - so there is zero > overhead. Ah, okay, it's because of the second argument. Now I understand. > I don't think. When we have function, with same parameters, same > behave like some Oracle function, then I am strongly prefer Oracle > name. I don't see any benefit from different name. It can only confuse > developers and add the trable to people who porting applications. Meh. If the name is terrible, we don't have to use it, and it's easy enough to create an alias in SQL for those who needit. Best, David
On Fri, 2010-01-22 at 11:14 -0800, David E.Wheeler wrote: > No performance issues ISTM that this class of function is inherently dangerous performance wise. * It looks incredibly easy to construct enormous lists. We should test the explosion limit of this to see how it is handled. Perhaps we need some parameter limits to control that, depending upon results. * Optimizer doesn't consider whether the result type of an aggregate get bigger as the aggregate processes more rows. If we're adding this function we should give some thought in that area also, or at least a comment to note that it can and will cause the optimizer problems in complex queries. -- Simon Riggs www.2ndQuadrant.com
2010/1/24 Simon Riggs <simon@2ndquadrant.com>: > On Fri, 2010-01-22 at 11:14 -0800, David E.Wheeler wrote: >> No performance issues > > ISTM that this class of function is inherently dangerous performance > wise. there is potencial risk, but this risk isn't new. The almost all what you say is true for array aggregates. > > * It looks incredibly easy to construct enormous lists. We should test > the explosion limit of this to see how it is handled. Perhaps we need > some parameter limits to control that, depending upon results. There are no limit for generating large values - like bytea, xml, or text. There are not limit for array_accum or array(query). So, I don't think we need some special mechanism for listagg. If somebody will generate too large a value, then he will get "out of memory" exception. > > * Optimizer doesn't consider whether the result type of an aggregate get > bigger as the aggregate processes more rows. If we're adding this > function we should give some thought in that area also, or at least a > comment to note that it can and will cause the optimizer problems in > complex queries. > this is true, but this isn't some new. array_accum working well without optimizer problems. > -- > Simon Riggs www.2ndQuadrant.com > > > -- > Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-hackers >
Simon Riggs <simon@2ndQuadrant.com> writes: > On Fri, 2010-01-22 at 11:14 -0800, David E.Wheeler wrote: >> No performance issues > ISTM that this class of function is inherently dangerous performance > wise. > * It looks incredibly easy to construct enormous lists. We should test > the explosion limit of this to see how it is handled. Perhaps we need > some parameter limits to control that, depending upon results. > * Optimizer doesn't consider whether the result type of an aggregate get > bigger as the aggregate processes more rows. If we're adding this > function we should give some thought in that area also, or at least a > comment to note that it can and will cause the optimizer problems in > complex queries. We have that problem already with array_agg(), and I don't recall many complaints about it. It might be worth worrying about at some point, but I don't think it's reasonable to insist that it be fixed before any more such aggregates are created. I agree that testing-to-failure would be a good idea just to be sure it fails cleanly. regards, tom lane
2010/1/24 David E. Wheeler <david@kineticode.com>: > On Jan 24, 2010, at 1:19 AM, Pavel Stehule wrote: > >> can I help with it, please. My English is terrible. > > Yes, I added a bit in the patch I submitted. > >> array user functions are used more time in pg core. The complexity of >> array functions are much higher, so I don't think we need special >> file. > > Okay. Should have tried it in PL/pgSQL then, perhaps. > :) - I'll look on it again. >> I'll recheck it. I am sure so all parameters should be a text. > > Probably shouldn't go into varchar.c then, yes? > >> We can, but it isn't good way. Processing of arrays is little bit more >> expensive then processing plain text. It is reason why listagg is >> faster, than your custom aggregate. More, the final function could be >> faster - the content is final. > > Understood. > >> It normal for aggregate functions. We need more transfn function, >> because we need two two variant: listagg(col), listagg(col, sep). Our >> coding guidlines doesn't advice share C functions - but these >> functions are +/- wrapper for accumStringResult - so there is zero >> overhead. > > Ah, okay, it's because of the second argument. Now I understand. > >> I don't think. When we have function, with same parameters, same >> behave like some Oracle function, then I am strongly prefer Oracle >> name. I don't see any benefit from different name. It can only confuse >> developers and add the trable to people who porting applications. > > Meh. If the name is terrible, we don't have to use it, and it's easy enough to create an alias in SQL for those who needit. > The "list" is common name for this content - it is usual in Microsoft SQL Server, it is usual in Oracle. Maybe we can vote about the name Regards Pavel Stehule > Best, > > David > >
>> I don't think. When we have function, with same parameters, same >> behave like some Oracle function, then I am strongly prefer Oracle >> name. I don't see any benefit from different name. It can only confuse >> developers and add the trable to people who porting applications. > > Meh. If the name is terrible, we don't have to use it, and it's easy enough to create an alias in SQL for those who needit. The corresponding function in Oracle is called wm_concat. In MySQL its called group_concat. I don't believe DB2 or SQL Server have built in equivalents. The Oracle name isn't really an option ("wm' stands for workspace manager) I think listagg or string_agg would be the most appropriate names. Oh and before Oracle had wm_concat, Tom Kyte wrote a function called stragg that was pretty popular. Scott
On sön, 2010-01-24 at 21:29 -0800, Scott Bailey wrote: > I think listagg or string_agg would be the most appropriate names. Oh > and before Oracle had wm_concat, Tom Kyte wrote a function called > stragg that was pretty popular. Well, xmlagg -> concatenates values to form xml datum array_agg -> concatenates values to form array datum ??? -> concatenates values to form string datum So it's pretty clear that listagg does not fit into this scheme.
2010/1/25 Peter Eisentraut <peter_e@gmx.net>: > On sön, 2010-01-24 at 21:29 -0800, Scott Bailey wrote: >> I think listagg or string_agg would be the most appropriate names. Oh >> and before Oracle had wm_concat, Tom Kyte wrote a function called >> stragg that was pretty popular. > > Well, > > xmlagg -> concatenates values to form xml datum > array_agg -> concatenates values to form array datum > ??? -> concatenates values to form string datum > > So it's pretty clear that listagg does not fit into this scheme. when you define list as text domain, then this the name is correct. searching "list sql string" on google. You can see, so "list" is usually used. Regards Pavel > > > -- > Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-hackers >
Pavel Stehule <pavel.stehule@gmail.com> writes: > 2010/1/25 Peter Eisentraut <peter_e@gmx.net>: >> xmlagg -> concatenates values to form xml datum >> array_agg -> concatenates values to form array datum >> ??? -> concatenates values to form string datum >> >> So it's pretty clear that listagg does not fit into this scheme. > when you define list as text domain, then this the name is correct. IOW, if you define away the problem then there's no problem? I agree that "list" is a terrible choice of name here. "string_agg" seemed reasonable and in keeping with the standardized "array_agg". regards, tom lane
2010/1/25 Tom Lane <tgl@sss.pgh.pa.us>: > Pavel Stehule <pavel.stehule@gmail.com> writes: >> 2010/1/25 Peter Eisentraut <peter_e@gmx.net>: >>> xmlagg -> concatenates values to form xml datum >>> array_agg -> concatenates values to form array datum >>> ??? -> concatenates values to form string datum >>> >>> So it's pretty clear that listagg does not fit into this scheme. > >> when you define list as text domain, then this the name is correct. > > IOW, if you define away the problem then there's no problem? > > I agree that "list" is a terrible choice of name here. "string_agg" > seemed reasonable and in keeping with the standardized "array_agg". > I am not happy, I thing so we do bigger chaos then it is. But it hasn't progress. So I agree with name string_agg. In this case isn't a problem rename this function if somebody would. I'll send patch over hour. regards Pavel Stehule > regards, tom lane >
2010/1/25 Tom Lane <tgl@sss.pgh.pa.us>: > Pavel Stehule <pavel.stehule@gmail.com> writes: >> 2010/1/25 Peter Eisentraut <peter_e@gmx.net>: >>> xmlagg -> concatenates values to form xml datum >>> array_agg -> concatenates values to form array datum >>> ??? -> concatenates values to form string datum >>> >>> So it's pretty clear that listagg does not fit into this scheme. > >> when you define list as text domain, then this the name is correct. > > IOW, if you define away the problem then there's no problem? > > I agree that "list" is a terrible choice of name here. "string_agg" > seemed reasonable and in keeping with the standardized "array_agg". actualised patch - the name is string_agg regards Pavel Stehule > > regards, tom lane >
Attachment
2010/1/24 Tom Lane <tgl@sss.pgh.pa.us>: > Simon Riggs <simon@2ndQuadrant.com> writes: >> On Fri, 2010-01-22 at 11:14 -0800, David E.Wheeler wrote: >>> No performance issues > >> ISTM that this class of function is inherently dangerous performance >> wise. > >> * It looks incredibly easy to construct enormous lists. We should test >> the explosion limit of this to see how it is handled. Perhaps we need >> some parameter limits to control that, depending upon results. > >> * Optimizer doesn't consider whether the result type of an aggregate get >> bigger as the aggregate processes more rows. If we're adding this >> function we should give some thought in that area also, or at least a >> comment to note that it can and will cause the optimizer problems in >> complex queries. > > We have that problem already with array_agg(), and I don't recall many > complaints about it. It might be worth worrying about at some point, > but I don't think it's reasonable to insist that it be fixed before > any more such aggregates are created. > > I agree that testing-to-failure would be a good idea just to be sure it > fails cleanly. postgres=# \timing Timing is on. postgres=# select pg_size_pretty(length(string_agg('012345678901234567890'::text,','))) from generate_series(1,10000000) g(i);pg_size_pretty ----------------210 MB (1 row) Time: 5831,218 ms postgres=# select pg_size_pretty(length(string_agg('012345678901234567890'::text,','))) from generate_series(1,50000000) g(i); ^[^[ERROR: out of memory DETAIL: Cannot enlarge string buffer containing 1073741812 bytes by 21 more bytes. postgres=# I thing, so 210 MB is more then is necessary :) Regards Pavel Stehule > > regards, tom lane > > -- > Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-hackers >
On Jan 25, 2010, at 2:09 AM, Peter Eisentraut wrote: > xmlagg -> concatenates values to form xml datum > array_agg -> concatenates values to form array datum > ??? -> concatenates values to form string datum concat_agg(). David
On Jan 25, 2010, at 6:12 AM, Pavel Stehule wrote: > I am not happy, I thing so we do bigger chaos then it is. But it > hasn't progress. So I agree with name string_agg. In this case isn't a > problem rename this function if somebody would. Could you not use CREATE AGGREGATE to create an alias named listagg if you needed it? Or are we limited to a single argumentin that case? David
On Mon, Jan 25, 2010 at 2:27 PM, David E. Wheeler <david@kineticode.com> wrote: > On Jan 25, 2010, at 2:09 AM, Peter Eisentraut wrote: >> xmlagg -> concatenates values to form xml datum >> array_agg -> concatenates values to form array datum >> ??? -> concatenates values to form string datum > > concat_agg(). I like that one... ...Robert
2010/1/25 David E. Wheeler <david@kineticode.com>: > On Jan 25, 2010, at 6:12 AM, Pavel Stehule wrote: > >> I am not happy, I thing so we do bigger chaos then it is. But it >> hasn't progress. So I agree with name string_agg. In this case isn't a >> problem rename this function if somebody would. > > Could you not use CREATE AGGREGATE to create an alias named listagg if you needed it? Or are we limited to a single argumentin that case? we can, but without one common known name we will have a propriety name. Pavel > > David > > > >
2010/1/25 Robert Haas <robertmhaas@gmail.com>: > On Mon, Jan 25, 2010 at 2:27 PM, David E. Wheeler <david@kineticode.com> wrote: >> On Jan 25, 2010, at 2:09 AM, Peter Eisentraut wrote: >>> xmlagg -> concatenates values to form xml datum >>> array_agg -> concatenates values to form array datum >>> ??? -> concatenates values to form string datum >> >> concat_agg(). > > I like that one... > why is concat_agg better than listagg ? Pavel > ...Robert > > -- > Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-hackers >
On Jan 25, 2010, at 23:14, Pavel Stehule <pavel.stehule@gmail.com> wrote: > why is concat_agg better than listagg ? Because it's an aggregate that cocatenates values. It's not an aggregate that lists things. I also like concat_agg better than string_agg because it's not limited to acting on strings. Best, David
On Tue, Jan 26, 2010 at 1:08 PM, David E. Wheeler <david@kineticode.com> wrote: ..... > > Because it's an aggregate that cocatenates values. It's not an aggregate > that lists things. I also like concat_agg better than string_agg because > it's not limited to acting on strings. > ..... Given that it potentially produces a delimited list, not a straight conacatenation (and that list is unacceptable since it would be descriptive as a noun but not as a verb) would implode_agg not be the most descriptive name?
On Tue, Jan 26, 2010 at 1:23 PM, Alastair Turner <bell@ctrlf5.co.za> wrote: > ..... > > Given that it potentially produces a delimited list, not a straight > conacatenation (and that list is unacceptable since it would be > descriptive as a noun but not as a verb) would implode_agg not be the > most descriptive name? > Actually, scratch that. The other *agg functions are named for what they produce as output. Not for their process - as per the objection to list_agg and suggestions of conact_agg and implode_agg. string_agg would be consistent, which is a wonderful thing if you can get it in a naming scheme.
On tis, 2010-01-26 at 03:08 -0800, David E. Wheeler wrote: > On Jan 25, 2010, at 23:14, Pavel Stehule <pavel.stehule@gmail.com> > wrote: > > > why is concat_agg better than listagg ? > > Because it's an aggregate that cocatenates values. It's not an > aggregate that lists things. I also like concat_agg better than > string_agg because it's not limited to acting on strings. What else can it act on?
On Tue, Jan 26, 2010 at 2:14 AM, Pavel Stehule <pavel.stehule@gmail.com> wrote: > 2010/1/25 Robert Haas <robertmhaas@gmail.com>: >> On Mon, Jan 25, 2010 at 2:27 PM, David E. Wheeler <david@kineticode.com> wrote: >>> On Jan 25, 2010, at 2:09 AM, Peter Eisentraut wrote: >>>> xmlagg -> concatenates values to form xml datum >>>> array_agg -> concatenates values to form array datum >>>> ??? -> concatenates values to form string datum >>> >>> concat_agg(). >> >> I like that one... > > why is concat_agg better than listagg ? Because it doesn't make lists. Honestly, I don't love concat_agg() either - why should something need to have agg in the name just because it's an aggregate? I think the most descriptive name would be something like concatenate_with_separator(), but that's kind of long. ...Robert
2010/1/26 Robert Haas <robertmhaas@gmail.com>: > On Tue, Jan 26, 2010 at 2:14 AM, Pavel Stehule <pavel.stehule@gmail.com> wrote: >> 2010/1/25 Robert Haas <robertmhaas@gmail.com>: >>> On Mon, Jan 25, 2010 at 2:27 PM, David E. Wheeler <david@kineticode.com> wrote: >>>> On Jan 25, 2010, at 2:09 AM, Peter Eisentraut wrote: >>>>> xmlagg -> concatenates values to form xml datum >>>>> array_agg -> concatenates values to form array datum >>>>> ??? -> concatenates values to form string datum >>>> >>>> concat_agg(). >>> >>> I like that one... >> >> why is concat_agg better than listagg ? > > Because it doesn't make lists. > > Honestly, I don't love concat_agg() either - why should something need > to have agg in the name just because it's an aggregate? I think the > most descriptive name would be something like > concatenate_with_separator(), but that's kind of long. This is never ending story :) MySQL has function concate_ws - but this function has different semantic. I thing so string_agg is short, and from one view consistent Pavel > > ...Robert >
On Jan 26, 2010, at 4:03, Peter Eisentraut <peter_e@gmx.net> wrote: > What else can it act on? Any data type, since they all can be converted to text. Integers would be a common choice. David
Pavel Stehule <pavel.stehule@gmail.com> writes: > 2010/1/25 Robert Haas <robertmhaas@gmail.com>: >> On Mon, Jan 25, 2010 at 2:27 PM, David E. Wheeler <david@kineticode.com> wrote: >>> concat_agg(). >> >> I like that one... > why is concat_agg better than listagg ? It isn't ... it's the wrong part of speech. "concat"enate is a verb, whereas the other functions we would like it to be named parallel to are using nouns there. (Yes, I know "array" can be used as a verb, but I don't think anyone reads it that way in "array_agg"...) regards, tom lane
Tom Lane <tgl@sss.pgh.pa.us> wrote: > Pavel Stehule <pavel.stehule@gmail.com> writes: >> why is concat_agg better than listagg ? > > It isn't ... it's the wrong part of speech. "concat"enate is a > verb, Concatenation is a noun. "concat" doesn't get far enough to distinguish. -Kevin
I don't know if it has anything to do with HS/SR, I haven't tried it on a single CVS vanilla installation (yet). To test with HS/SR, I've setup three 9.0devel instances (cvs as of today) on a single machine, one as a primary, and two as slaves. I used the instructions in http://wiki.postgresql.org/wiki/Streaming_Replication and ./configure --prefix=/var/data1/pg_stuff/pg_installations/pgsql.sr_primary \ --with-pgport=6565 --quiet --enable-depend --with-openssl--with-perl \ --with-libxml --with-libxslt These seem to work/replicate well. Now, when restoring a 700MB dump (made with a 9.0devel pg_dump + pg_restore) into the primary, errors like the following occur, on all three instances: FATAL: could not open file "pg_xlog/0000000100000001000000FF" (log file 1, segment 255): No such file or directory So, there are three logfiles: pgsql.sr_primary/logfile pgsql.sr_slavery/logfile pgsql.sr_slave02/logfile $ grep -E 'FATAL|ERROR' pgsql.sr_*/logfile pgsql.sr_primary/logfile:ERROR: canceling autovacuum task pgsql.sr_primary/logfile:ERROR: canceling autovacuum task pgsql.sr_primary/logfile:FATAL: could not open file "pg_xlog/0000000100000001000000FF" (log file 1, segment 255): No such file or directory pgsql.sr_primary/logfile:FATAL: could not open file "pg_xlog/0000000100000001000000FF" (log file 1, segment 255): No such file or directory pgsql.sr_primary/logfile:ERROR: canceling autovacuum task pgsql.sr_primary/logfile:ERROR: canceling autovacuum task pgsql.sr_slave02/logfile:ERROR: could not read xlog records: FATAL: could not open file "pg_xlog/0000000100000001000000FF" (log file 1, segment 255): No such file or directory pgsql.sr_slavery/logfile:ERROR: could not read xlog records: FATAL: could not open file "pg_xlog/0000000100000001000000FF" (log file 1, segment 255): No such file or directory This has happened several times, always 'segment' 255, 'log file' 1, 3 or 4. hth, Erik Rijkers
"David E. Wheeler" <david@kineticode.com> writes: > Because it's an aggregate that cocatenates values. It's not an > aggregate that lists things. I also like concat_agg better than > string_agg because it's not limited to acting on strings. But what it *produces* is a string. For comparison, the SQL-standard-specified array_agg produces arrays, but what it acts on isn't an array. regards, tom lane
"Erik Rijkers" <er@xs4all.nl> wrote: > FATAL: could not open file "pg_xlog/0000000100000001000000FF" > (log file 1, segment 255): No such file or directory Yeah, log file segment numbers skip FF; they go from FE to 00 with the "log file" number (the middle part) bumped by one. Whatever is coming up with that segment number is wrong. (Unless this is changing in 9.0?) -Kevin
On Jan 26, 2010, at 9:36 AM, Tom Lane wrote: > But what it *produces* is a string. For comparison, the > SQL-standard-specified array_agg produces arrays, but what it > acts on isn't an array. Meh. This is all just bike-shedding. I'm fine with string_agg(), though in truth none of the names has really been great.The inclusion of "agg" in the name is unfortunate. I'll have a look at Pavel's new patch now. David
"David E. Wheeler" <david@kineticode.com> writes: > Meh. This is all just bike-shedding. I'm fine with string_agg(), though in truth none of the names has really been great.The inclusion of "agg" in the name is unfortunate. Yeah, I wouldn't be for it either if it weren't for the precedent of array_agg. I was quite surprised that the SQL committee chose that name, because they've avoided using the term "aggregate function" at all, but there it is ... regards, tom lane
On Jan 25, 2010, at 6:56 AM, Pavel Stehule wrote: > actualised patch - the name is string_agg All looks fine except I'm getting this error during initdb: creating template1 database in /usr/local/pgsql-devel/data/base/1 ... FATAL: could not create unique index "pg_proc_oid_index" DETAIL: Key (oid)=(3031) is duplicated. child process exited with exit code 1 Would you mind re-submitting with unique OIDs? Thanks, David
On Tue, Jan 26, 2010 at 12:36 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > "David E. Wheeler" <david@kineticode.com> writes: >> Because it's an aggregate that cocatenates values. It's not an >> aggregate that lists things. I also like concat_agg better than >> string_agg because it's not limited to acting on strings. > > But what it *produces* is a string. For comparison, the > SQL-standard-specified array_agg produces arrays, but what it > acts on isn't an array. This point is well-taken, but naming it string_agg() because it produces a string doesn't seem quite descriptive enough. We might someday (if we don't already) have a number of aggregates that produce an output that is a string; we can't name them all by the output type. ...Robert
Robert Haas <robertmhaas@gmail.com> writes: > On Tue, Jan 26, 2010 at 12:36 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> But what it *produces* is a string. �For comparison, the >> SQL-standard-specified array_agg produces arrays, but what it >> acts on isn't an array. > This point is well-taken, but naming it string_agg() because it > produces a string doesn't seem quite descriptive enough. We might > someday (if we don't already) have a number of aggregates that produce > an output that is a string; we can't name them all by the output type. True, but the same point could be made against array_agg, and that didn't stop the committee from choosing that name. As long as string_agg is the "most obvious" aggregate-to-string functionality, which ISTM it is, I think it's all right for it to have pride of place in naming. regards, tom lane
On Wed, Jan 27, 2010 at 2:19 AM, Erik Rijkers <er@xs4all.nl> wrote: > Now, when restoring a 700MB dump (made with a 9.0devel pg_dump + pg_restore) into the primary, > errors like the following occur, on all three instances: > > FATAL: could not open file "pg_xlog/0000000100000001000000FF" (log file 1, segment 255): No such > file or directory Thanks for the report! This is the bug of SR :( I think that walsender wrongly treats the WAL-boundary. > pgsql.sr_slave02/logfile:ERROR: could not read xlog records: FATAL: could not open file > "pg_xlog/0000000100000001000000FF" (log file 1, segment 255): No such file or directory > pgsql.sr_slavery/logfile:ERROR: could not read xlog records: FATAL: could not open file > "pg_xlog/0000000100000001000000FF" (log file 1, segment 255): No such file or directory Also the ReadRecord() or its surrounding functions seem to have treated wrongly the WAL-boundary. I'll fix those bugs after a night's sleep. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
On Wed, Jan 27, 2010 at 4:02 AM, Fujii Masao <masao.fujii@gmail.com> wrote: > This is the bug of SR :( I think that walsender wrongly treats the WAL-boundary. The attached patch would fix this bug. >> pgsql.sr_slave02/logfile:ERROR: could not read xlog records: FATAL: could not open file >> "pg_xlog/0000000100000001000000FF" (log file 1, segment 255): No such file or directory >> pgsql.sr_slavery/logfile:ERROR: could not read xlog records: FATAL: could not open file >> "pg_xlog/0000000100000001000000FF" (log file 1, segment 255): No such file or directory > > Also the ReadRecord() or its surrounding functions seem to have treated > wrongly the WAL-boundary. Oops! I've misread the log messages. This is not the bug. Because of the FATAL error from the primary, the standby seems to just emit an ERROR message, and exit. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
Attachment
Fujii Masao wrote: > *** a/src/backend/replication/walsender.c > --- b/src/backend/replication/walsender.c > *************** > *** 661,666 **** XLogSend(StringInfo outMsg) > --- 661,673 ---- > > sentPtr = endptr; > > + if (sentPtr.xrecoff >= XLogFileSize) > + { > + /* crossing a logid boundary */ > + sentPtr.xlogid += 1; > + sentPtr.xrecoff = 0; > + } > + > /* > * Read the log directly into the output buffer to prevent > * extra memcpy calls. > Before that, endptr is advanced using XLByteAdvance() macro, which does handle xlogid boundaries. Is XLByteAdvance() broken? -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Wed, Jan 27, 2010 at 7:05 PM, Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote: > Before that, endptr is advanced using XLByteAdvance() macro, which does > handle xlogid boundaries. Is XLByteAdvance() broken? No. The cause of the bug is that endptr might be set to the SendRqstPtr that has crossed a xlogid boundary in the following code. > /* if we went beyond SendRqstPtr, back off */ > if (XLByteLT(SendRqstPtr, endptr)) > endptr = SendRqstPtr; Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
Fujii Masao wrote: > On Wed, Jan 27, 2010 at 7:05 PM, Heikki Linnakangas > <heikki.linnakangas@enterprisedb.com> wrote: >> Before that, endptr is advanced using XLByteAdvance() macro, which does >> handle xlogid boundaries. Is XLByteAdvance() broken? > > No. The cause of the bug is that endptr might be set to the SendRqstPtr > that has crossed a xlogid boundary in the following code. > >> /* if we went beyond SendRqstPtr, back off */ >> if (XLByteLT(SendRqstPtr, endptr)) >> endptr = SendRqstPtr; But SendRqstPtr comes from LogwrtResult.Write, surely that's correct, no? -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Wed, Jan 27, 2010 at 8:37 PM, Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote: > But SendRqstPtr comes from LogwrtResult.Write, surely that's correct, no? Right. But the point is that LogwrtResult.Write might indicate 0/FF000000 because it's the last byte + 1 written out. XLogRead() treats this as the location where WAL begins to be read, so converts it to the WAL file name 0000000100000000000000FF by using XLByteToSeg. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
with actualised oids Regards Pavel Stehule 2010/1/26 David E. Wheeler <david@kineticode.com>: > On Jan 25, 2010, at 6:56 AM, Pavel Stehule wrote: > >> actualised patch - the name is string_agg > > All looks fine except I'm getting this error during initdb: > > creating template1 database in /usr/local/pgsql-devel/data/base/1 ... FATAL: could not create unique index "pg_proc_oid_index" > DETAIL: Key (oid)=(3031) is duplicated. > child process exited with exit code 1 > > Would you mind re-submitting with unique OIDs? > > Thanks, > > David
Attachment
Fujii Masao wrote: > On Wed, Jan 27, 2010 at 8:37 PM, Heikki Linnakangas > <heikki.linnakangas@enterprisedb.com> wrote: >> But SendRqstPtr comes from LogwrtResult.Write, surely that's correct, no? > > Right. But the point is that LogwrtResult.Write might indicate 0/FF000000 > because it's the last byte + 1 written out. XLogRead() treats this as the > location where WAL begins to be read, so converts it to the WAL file name > 0000000100000000000000FF by using XLByteToSeg. Ah, I see it now, thanks. How confusing. Your patch clearly works, but I wonder how we should logically treat that boundary value. You're setting sentPtr to the beginning of the logid, when LogwrtResult.Write is 0/FF000000. So you're logically thinking that the non-existent FF log segment has been sent to the standby as soon as the previous segment has been fully sent. Another option would be to set SendRqstPtr to the beginning of next logid, when LogwrtResult.Write is 0/FF000000.. Yet another option would be to set startptr to the beginning of next logid, when sentPtr is 0/FF000000: *************** *** 633,638 **** --- 633,648 ---- * WAL record. */ startptr = sentPtr; + if (startptr.xrecoff >= XLogFileSize) + { + /* + * crossing a logid boundary, skip the non-existent last log + * segment in previous logical log file. + */ + startptr.xlogid += 1; + startptr.xrecoff = 0; + } + endptr = startptr; XLByteAdvance(endptr, MAX_SEND_SIZE); /* round down to page boundary. */ I think I prefer that, it feels logically more correct. Setting sentPtr beyond LogwrtResult.Write feels wrong, even though 0/FF000000 and 1/00000000 are really the same physical location. I'll commit it that way. I also note that I broke that same thing in ReadRecord() in the retry-logic patch I just comitted. I'll fix that too. Thanks for the testing, Erik! -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Tue, Jan 26, 2010 at 1:38 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> On Tue, Jan 26, 2010 at 12:36 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >>> But what it *produces* is a string. For comparison, the >>> SQL-standard-specified array_agg produces arrays, but what it >>> acts on isn't an array. > >> This point is well-taken, but naming it string_agg() because it >> produces a string doesn't seem quite descriptive enough. We might >> someday (if we don't already) have a number of aggregates that produce >> an output that is a string; we can't name them all by the output type. > > True, but the same point could be made against array_agg, and that > didn't stop the committee from choosing that name. As long as > string_agg is the "most obvious" aggregate-to-string functionality, > which ISTM it is, I think it's all right for it to have pride of place > in naming. Maybe so, but personally, I'd still prefer something more descriptive. ...Robert
On Wed, January 27, 2010 17:38, Heikki Linnakangas wrote: > I'll commit it that way. This is just to let you know that these commits seem to have solved the bug. I repeated the original test (700 MB pg_restore into a primary, with two replicating slaves, on one machine). Checking afterwards: $ for port in 6565 6566 6567; do psql -qtAp $port -c "select pg_database_size('${PGDATABASE}'), version()"; done 17057193792|PostgreSQL 9.0devel-sr_primary [...] 17057193792|PostgreSQL 9.0devel-sr_slavery 17057193792|PostgreSQL 9.0devel-sr_slave02 So that looked promising :) To make sure I also COPYd all tables to file and diff'ed them. There were indeed no differences. I did get at one point in the sr_slave02 log: (consecutive lines) LOG: consistent recovery state reached at 0/2000000 LOG: database system is ready to accept read only connections LOG: could not send data to client: Broken pipe LOG: unexpected EOF on client connection but apparently this is not problematic? HS/SR is a fantastic set of features. I'll keep hammering away a bit at the dynamic duo; if you have specific testing ideas, let me know. Thanks! Erik Rijkers
On Jan 27, 2010, at 7:58 AM, Pavel Stehule wrote: > with actualised oids Thanks. Looks good, modulo my preference for concat_agg(). I'll mark it ready for committer. Best, David
On Thu, Jan 28, 2010 at 9:32 AM, Erik Rijkers <er@xs4all.nl> wrote: > On Wed, January 27, 2010 17:38, Heikki Linnakangas wrote: > >> I'll commit it that way. Thanks Heikki! Yeah, your approach makes more sense. > This is just to let you know that these commits seem to have solved the bug. Good! Thanks Erik! > I did get at one point in the sr_slave02 log: (consecutive lines) > > LOG: consistent recovery state reached at 0/2000000 > LOG: database system is ready to accept read only connections > LOG: could not send data to client: Broken pipe > LOG: unexpected EOF on client connection > > but apparently this is not problematic? It seems to me that you've unexpectedly killed (e.g., kill -9) a client connected to the standby during executing a read-only query. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
Pavel Stehule <pavel.stehule@gmail.com> wrote: > with actualised oids I'm checking the patch for commit, and have a couple of comments. * I think we cannot cache the delimiter at the first call. For example, SELECT string_agg(elem, delim) FROM (VALUES('A', ','), ('B', '+'), ('C', '*')) t(elem, delim); should return 'A+B*C' rather than 'A,B,C'. * Can we use StringInfo directly as the aggregate context instead of StringAggState? For the first reason, we need to drop 'delimiter' field from struct StringAggState. Now it has only StringInfo field. * We'd better avoiding to call text_to_cstring() for delimitors and elements for performance reason. We can use appendBinaryStringInfo() here. My proposal patch attached. Also, I've not changed it yet, but it might be considerable: * Do we need better names for string_agg1_transfn and string_agg2_transfn? They are almost "internal names", but we could have more like string_agg_with_sep_transfn. Comments? Regards, --- Takahiro Itagaki NTT Open Source Software Center
Attachment
On Jan 27, 2010, at 6:47 PM, Takahiro Itagaki wrote: > * I think we cannot cache the delimiter at the first call. > For example, > SELECT string_agg(elem, delim) > FROM (VALUES('A', ','), ('B', '+'), ('C', '*')) t(elem, delim); > should return 'A+B*C' rather than 'A,B,C'. Ooh, nice. > * Can we use StringInfo directly as the aggregate context instead of > StringAggState? For the first reason, we need to drop 'delimiter' field > from struct StringAggState. Now it has only StringInfo field. Makes sense. > * We'd better avoiding to call text_to_cstring() for delimitors and elements > for performance reason. We can use appendBinaryStringInfo() here. > > My proposal patch attached. > > Also, I've not changed it yet, but it might be considerable: > > * Do we need better names for string_agg1_transfn and string_agg2_transfn? > They are almost "internal names", but we could have more > like string_agg_with_sep_transfn. Yes please. > Comments? Patch looks great, thank you! David
2010/1/28 Takahiro Itagaki <itagaki.takahiro@oss.ntt.co.jp>: > > Pavel Stehule <pavel.stehule@gmail.com> wrote: > >> with actualised oids > > I'm checking the patch for commit, and have a couple of comments. > > * I think we cannot cache the delimiter at the first call. > For example, > SELECT string_agg(elem, delim) > FROM (VALUES('A', ','), ('B', '+'), ('C', '*')) t(elem, delim); > should return 'A+B*C' rather than 'A,B,C'. no I dislike it. This using is nonsense. Regards Pavel > > * Can we use StringInfo directly as the aggregate context instead of > StringAggState? For the first reason, we need to drop 'delimiter' field > from struct StringAggState. Now it has only StringInfo field. > > * We'd better avoiding to call text_to_cstring() for delimitors and elements > for performance reason. We can use appendBinaryStringInfo() here. > > My proposal patch attached. > > Also, I've not changed it yet, but it might be considerable: > > * Do we need better names for string_agg1_transfn and string_agg2_transfn? > They are almost "internal names", but we could have more > like string_agg_with_sep_transfn. > > Comments? > > Regards, > --- > Takahiro Itagaki > NTT Open Source Software Center > >
2010/1/28 David E. Wheeler <david@kineticode.com>: > On Jan 27, 2010, at 6:47 PM, Takahiro Itagaki wrote: > >> * I think we cannot cache the delimiter at the first call. >> For example, >> SELECT string_agg(elem, delim) >> FROM (VALUES('A', ','), ('B', '+'), ('C', '*')) t(elem, delim); >> should return 'A+B*C' rather than 'A,B,C'. > > Ooh, nice. > >> * Can we use StringInfo directly as the aggregate context instead of >> StringAggState? For the first reason, we need to drop 'delimiter' field >> from struct StringAggState. Now it has only StringInfo field. > > Makes sense. no, has not. Pavel > >> * We'd better avoiding to call text_to_cstring() for delimitors and elements >> for performance reason. We can use appendBinaryStringInfo() here. >> >> My proposal patch attached. >> >> Also, I've not changed it yet, but it might be considerable: >> >> * Do we need better names for string_agg1_transfn and string_agg2_transfn? >> They are almost "internal names", but we could have more >> like string_agg_with_sep_transfn. > > Yes please. > >> Comments? > > Patch looks great, thank you! > > David > > >
2010/1/28 Pavel Stehule <pavel.stehule@gmail.com>: > 2010/1/28 David E. Wheeler <david@kineticode.com>: >> On Jan 27, 2010, at 6:47 PM, Takahiro Itagaki wrote: >> >>> * I think we cannot cache the delimiter at the first call. >>> For example, >>> SELECT string_agg(elem, delim) >>> FROM (VALUES('A', ','), ('B', '+'), ('C', '*')) t(elem, delim); >>> should return 'A+B*C' rather than 'A,B,C'. >> >> Ooh, nice. >> >>> * Can we use StringInfo directly as the aggregate context instead of >>> StringAggState? For the first reason, we need to drop 'delimiter' field >>> from struct StringAggState. Now it has only StringInfo field. >> >> Makes sense. > > no, has not. What is use case for this behave?? Pavel > > Pavel > >> >>> * We'd better avoiding to call text_to_cstring() for delimitors and elements >>> for performance reason. We can use appendBinaryStringInfo() here. >>> >>> My proposal patch attached. >>> >>> Also, I've not changed it yet, but it might be considerable: >>> >>> * Do we need better names for string_agg1_transfn and string_agg2_transfn? >>> They are almost "internal names", but we could have more >>> like string_agg_with_sep_transfn. >> >> Yes please. >> >>> Comments? >> >> Patch looks great, thank you! >> >> David >> >> >> >
Pavel Stehule <pavel.stehule@gmail.com> wrote: > 2010/1/28 Pavel Stehule <pavel.stehule@gmail.com>: > > 2010/1/28 David E. Wheeler <david@kineticode.com>: > >> On Jan 27, 2010, at 6:47 PM, Takahiro Itagaki wrote: > >> > >>> * I think we cannot cache the delimiter at the first call. > >>> For example, > >>> SELECT string_agg(elem, delim) > >>> FROM (VALUES('A', ','), ('B', '+'), ('C', '*')) t(elem, delim); > >>> should return 'A+B*C' rather than 'A,B,C'. > > > > no, has not. > What is use case for this behave?? I also think this usage is nonsense, but seems to be the most consistent behavior for me. I didn't say anything about use-cases, but just capability. Since we allow such kinds of usage for now, you need to verify the delimiter is not changed rather than ignoring it if you want disallow to change the delimiter during an aggregation. Of course you can cache the first delimiter at start, and check delimiters are not changed every calls -- but I think it is just a waste of cpu cycle. Regards, --- Takahiro Itagaki NTT Open Source Software Center
On Thu, Jan 28, 2010 at 3:59 AM, Takahiro Itagaki <itagaki.takahiro@oss.ntt.co.jp> wrote: > Pavel Stehule <pavel.stehule@gmail.com> wrote: > >> 2010/1/28 Pavel Stehule <pavel.stehule@gmail.com>: >> > 2010/1/28 David E. Wheeler <david@kineticode.com>: >> >> On Jan 27, 2010, at 6:47 PM, Takahiro Itagaki wrote: >> >> >> >>> * I think we cannot cache the delimiter at the first call. >> >>> For example, >> >>> SELECT string_agg(elem, delim) >> >>> FROM (VALUES('A', ','), ('B', '+'), ('C', '*')) t(elem, delim); >> >>> should return 'A+B*C' rather than 'A,B,C'. >> > >> > no, has not. >> What is use case for this behave?? > > I also think this usage is nonsense, but seems to be the most consistent > behavior for me. I didn't say anything about use-cases, but just capability. > Since we allow such kinds of usage for now, you need to verify the > delimiter is not changed rather than ignoring it if you want disallow > to change the delimiter during an aggregation. > > Of course you can cache the first delimiter at start, and check delimiters > are not changed every calls -- but I think it is just a waste of cpu cycle. Agreed. Not caching it seems the simplest solution. ...Robert
2010/1/28 Robert Haas <robertmhaas@gmail.com>: > On Thu, Jan 28, 2010 at 3:59 AM, Takahiro Itagaki > <itagaki.takahiro@oss.ntt.co.jp> wrote: >> Pavel Stehule <pavel.stehule@gmail.com> wrote: >> >>> 2010/1/28 Pavel Stehule <pavel.stehule@gmail.com>: >>> > 2010/1/28 David E. Wheeler <david@kineticode.com>: >>> >> On Jan 27, 2010, at 6:47 PM, Takahiro Itagaki wrote: >>> >> >>> >>> * I think we cannot cache the delimiter at the first call. >>> >>> For example, >>> >>> SELECT string_agg(elem, delim) >>> >>> FROM (VALUES('A', ','), ('B', '+'), ('C', '*')) t(elem, delim); >>> >>> should return 'A+B*C' rather than 'A,B,C'. >>> > >>> > no, has not. >>> What is use case for this behave?? >> >> I also think this usage is nonsense, but seems to be the most consistent >> behavior for me. I didn't say anything about use-cases, but just capability. >> Since we allow such kinds of usage for now, you need to verify the >> delimiter is not changed rather than ignoring it if you want disallow >> to change the delimiter during an aggregation. >> >> Of course you can cache the first delimiter at start, and check delimiters >> are not changed every calls -- but I think it is just a waste of cpu cycle. > > Agreed. Not caching it seems the simplest solution. simplest could not be a best. There have to be only a const expression. But we have not possibility to check it in pg. Pavel > > ...Robert >
On Thu, Jan 28, 2010 at 9:01 AM, Pavel Stehule <pavel.stehule@gmail.com> wrote: > simplest could not be a best. There have to be only a const > expression. But we have not possibility to check it in pg. Well... that's an entirely arbitrary limitation. I admit that it doesn't seem likely that someone would want to have a variable delimiter, but putting extra effort and code complexity into preventing it seems pointless. ...Robert
2010/1/28 Robert Haas <robertmhaas@gmail.com>: > On Thu, Jan 28, 2010 at 9:01 AM, Pavel Stehule <pavel.stehule@gmail.com> wrote: >> simplest could not be a best. There have to be only a const >> expression. But we have not possibility to check it in pg. > > Well... that's an entirely arbitrary limitation. I admit that it > doesn't seem likely that someone would want to have a variable > delimiter, but putting extra effort and code complexity into > preventing it seems pointless. It is only a few lines with zero complexity. The main issue of Takahiro proposal is "unclean" behave. we can have a content c1 c2 ----------- c11, c12, c21, c22 and result of string_agg(c1, c2) have to be ?? c11 c12 c21 or c11 c22 c21 ?? What if some content of c2 will be NULL ?? I checked oracle. Oracle doesn't allow variable as delimiter. We can't check it. But we can fix first value and using it as constant. More - Takahiro proposal has one performance gain. It have to deTOAST separator value in every cycle. Takahiro has true - we can store length of separator and we can add separator to cumulative value as binary value. I prefer original behave with note in documentation - so as separator is used a value on first row, when is used expression and not constant. Regards Pavel > > ...Robert >
Robert Haas <robertmhaas@gmail.com> writes: > On Thu, Jan 28, 2010 at 9:01 AM, Pavel Stehule <pavel.stehule@gmail.com> wrote: >> simplest could not be a best. There have to be only a const >> expression. But we have not possibility to check it in pg. > Well... that's an entirely arbitrary limitation. I admit that it > doesn't seem likely that someone would want to have a variable > delimiter, but putting extra effort and code complexity into > preventing it seems pointless. Yeah. The real issue here is that in some cases you'd like to have non-aggregated parameters to an aggregate, but SQL has no notation to express that. I think Pavel's underlying complaint is that if the delimiter argument isn't constant, then we're exposing an implementation dependency in terms of just which values get separated by which delimiters. The most practical implementation seems to be that the first-call delimiter isn't actually used at all, and on subsequent calls the delimiter *precedes* the associated value, which is a bit surprising given the order in which one writes them. Not sure if this is worth documenting though. Those two or three people who actually try it will figure it out soon enough. regards, tom lane
On Thu, Jan 28, 2010 at 10:56 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> On Thu, Jan 28, 2010 at 9:01 AM, Pavel Stehule <pavel.stehule@gmail.com> wrote: >>> simplest could not be a best. There have to be only a const >>> expression. But we have not possibility to check it in pg. > >> Well... that's an entirely arbitrary limitation. I admit that it >> doesn't seem likely that someone would want to have a variable >> delimiter, but putting extra effort and code complexity into >> preventing it seems pointless. > > Yeah. The real issue here is that in some cases you'd like to have > non-aggregated parameters to an aggregate, but SQL has no notation > to express that. Right. > I think Pavel's underlying complaint is that if the delimiter > argument isn't constant, then we're exposing an implementation > dependency in terms of just which values get separated by which > delimiters. The most practical implementation seems to be that > the first-call delimiter isn't actually used at all, and on > subsequent calls the delimiter *precedes* the associated value, > which is a bit surprising given the order in which one writes > them. Not sure if this is worth documenting though. Those two > or three people who actually try it will figure it out soon enough. Yeah, I'm thoroughly unworried about it. ...Robert
On Wed, Jan 27, 2010 at 9:47 PM, Takahiro Itagaki <itagaki.takahiro@oss.ntt.co.jp> wrote: > * Do we need better names for string_agg1_transfn and string_agg2_transfn? > They are almost "internal names", but we could have more > like string_agg_with_sep_transfn. Surely the names of the transition and final functions should match the name of the aggregate. But if there's a proposal on the table to call this string_app_with_sep() instead of just string_agg(), +1 from me. ...Robert
2010/1/29 Pavel Stehule <pavel.stehule@gmail.com>: > 2010/1/28 Robert Haas <robertmhaas@gmail.com>: >> On Thu, Jan 28, 2010 at 9:01 AM, Pavel Stehule <pavel.stehule@gmail.com> wrote: >>> simplest could not be a best. There have to be only a const >>> expression. But we have not possibility to check it in pg. >> >> Well... that's an entirely arbitrary limitation. I admit that it >> doesn't seem likely that someone would want to have a variable >> delimiter, but putting extra effort and code complexity into >> preventing it seems pointless. > > It is only a few lines with zero complexity. > > The main issue of Takahiro proposal is "unclean" behave. > > we can have a content > > c1 c2 > ----------- > c11, c12, > c21, c22 > > and result of string_agg(c1, c2) > > have to be ?? c11 c12 c21 or c11 c22 c21 ?? What if some content of c2 > will be NULL ?? I checked oracle. Oracle doesn't allow variable as > delimiter. We can't check it. But we can fix first value and using it > as constant. What about get_fn_expr_arg_stable() to check if the argument is stable during aggregate? Regards, -- Hitoshi Harada
2010/1/28 Robert Haas <robertmhaas@gmail.com>: > On Thu, Jan 28, 2010 at 10:56 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Robert Haas <robertmhaas@gmail.com> writes: >>> On Thu, Jan 28, 2010 at 9:01 AM, Pavel Stehule <pavel.stehule@gmail.com> wrote: >>>> simplest could not be a best. There have to be only a const >>>> expression. But we have not possibility to check it in pg. >> >>> Well... that's an entirely arbitrary limitation. I admit that it >>> doesn't seem likely that someone would want to have a variable >>> delimiter, but putting extra effort and code complexity into >>> preventing it seems pointless. >> >> Yeah. The real issue here is that in some cases you'd like to have >> non-aggregated parameters to an aggregate, but SQL has no notation >> to express that. > > Right. > >> I think Pavel's underlying complaint is that if the delimiter >> argument isn't constant, then we're exposing an implementation >> dependency in terms of just which values get separated by which >> delimiters. The most practical implementation seems to be that >> the first-call delimiter isn't actually used at all, and on >> subsequent calls the delimiter *precedes* the associated value, >> which is a bit surprising given the order in which one writes >> them. Not sure if this is worth documenting though. Those two >> or three people who actually try it will figure it out soon enough. > ok - there is one query, in 99.99% the second argument will be a constant. Can we use this information and optimize function for this case? The detoast on every row can take some percent from a performance. Pavel > Yeah, I'm thoroughly unworried about it. > > ...Robert >
2010/1/28 Robert Haas <robertmhaas@gmail.com>: > On Wed, Jan 27, 2010 at 9:47 PM, Takahiro Itagaki > <itagaki.takahiro@oss.ntt.co.jp> wrote: >> * Do we need better names for string_agg1_transfn and string_agg2_transfn? >> They are almost "internal names", but we could have more >> like string_agg_with_sep_transfn. > > Surely the names of the transition and final functions should match > the name of the aggregate. But if there's a proposal on the table to > call this string_app_with_sep() instead of just string_agg(), +1 from > me. no, >>>string_app_with_sep<<< is too long for common using. Pavel > > ...Robert >
Pavel Stehule <pavel.stehule@gmail.com> writes: > in 99.99% the second argument will be a constant. Can we use this > information and optimize function for this case? > The detoast on every row can take some percent from a performance. What detoast? There won't be one for a constant, nor even for a variable in any sane situation --- who's going to be using multi-kilobyte delimiter values? And if they do, aren't they likely to run out of memory for the result long before the repeated detoasts become an interesting problem? You're arguing about a case that seems quite irrelevant to the real world. regards, tom lane
2010/1/28 Hitoshi Harada <umi.tanuki@gmail.com>: > 2010/1/29 Pavel Stehule <pavel.stehule@gmail.com>: >> 2010/1/28 Robert Haas <robertmhaas@gmail.com>: >>> On Thu, Jan 28, 2010 at 9:01 AM, Pavel Stehule <pavel.stehule@gmail.com> wrote: >>>> simplest could not be a best. There have to be only a const >>>> expression. But we have not possibility to check it in pg. >>> >>> Well... that's an entirely arbitrary limitation. I admit that it >>> doesn't seem likely that someone would want to have a variable >>> delimiter, but putting extra effort and code complexity into >>> preventing it seems pointless. >> >> It is only a few lines with zero complexity. >> >> The main issue of Takahiro proposal is "unclean" behave. >> >> we can have a content >> >> c1 c2 >> ----------- >> c11, c12, >> c21, c22 >> >> and result of string_agg(c1, c2) >> >> have to be ?? c11 c12 c21 or c11 c22 c21 ?? What if some content of c2 >> will be NULL ?? I checked oracle. Oracle doesn't allow variable as >> delimiter. We can't check it. But we can fix first value and using it >> as constant. > > What about get_fn_expr_arg_stable() to check if the argument is stable > during aggregate? I newer know so this function exists. Now we can a) check and allow only stable params b) when second parameter is stable, then store it and use it as constant. I prefer a) Pavel > > Regards, > > > -- > Hitoshi Harada >
2010/1/28 Tom Lane <tgl@sss.pgh.pa.us>: > Pavel Stehule <pavel.stehule@gmail.com> writes: >> in 99.99% the second argument will be a constant. Can we use this >> information and optimize function for this case? > >> The detoast on every row can take some percent from a performance. > > What detoast? There won't be one for a constant, nor even for a > variable in any sane situation --- who's going to be using > multi-kilobyte delimiter values? And if they do, aren't they likely > to run out of memory for the result long before the repeated detoasts > become an interesting problem? You're arguing about a case that > seems quite irrelevant to the real world. > I asking with get_fn_expr_arg_stable() we are able to fix second parameter without some performance issues. Regards Pavel > regards, tom lane >
Hitoshi Harada <umi.tanuki@gmail.com> writes: > What about get_fn_expr_arg_stable() to check if the argument is stable > during aggregate? Seems quite expensive (possible catalog lookups) and there still isn't any strong argument why we should bother. regards, tom lane
Pavel Stehule <pavel.stehule@gmail.com> writes: > with get_fn_expr_arg_stable() we are able to fix second parameter > without some performance issues. No, that will create its own performance issues --- get_fn_expr_arg_stable isn't especially cheap. If there were a really strong reason why we had to do it, then I'd agree, but frankly the argument for disallowing a variable delimiter is too thin. regards, tom lane
On 2010-01-28 19:17, Pavel Stehule wrote: > 2010/1/28 Hitoshi Harada <umi.tanuki@gmail.com>: >> What about get_fn_expr_arg_stable() to check if the argument is stable >> during aggregate? > > I newer know so this function exists. Now we can > > a) check and allow only stable params > b) when second parameter is stable, then store it and use it as constant. > > I prefer a) Someone might have a perfectly good use case for using different delimiters. I don't think it's a good idea to be artificially limiting what you can and can't do. Regards, Marko Tiikkaja
On Jan 28, 2010, at 9:29 AM, Marko Tiikkaja wrote: > Someone might have a perfectly good use case for using different > delimiters. I don't think it's a good idea to be artificially limiting > what you can and can't do. +1 David
<p>One situation where this could actually matter in the long term is if we want to have an optimization for aggregate functionswhose state variables can be combined. this could be important if we ever want to do parallel processing someday.<p>Sowe could have, for example two subjobs build two sublists and then combiner the two lists later. in that casethe separator might not be the one the user expected - our put another way the one the user expected might not be availablewhen we need it.<p>We could say this isn't a problem because not all aggregate functions will be amenable to suchan optimization and perhaps this will just be one of them. Alternately we could just have faith that a solution willbe found - it doesn't seem like it should be an insoluble problem to me.<p>greg<p><blockquote type="cite">On 28 Jan 201015:57, "Tom Lane" <<a href="mailto:tgl@sss.pgh.pa.us">tgl@sss.pgh.pa.us</a>> wrote:<br /><br /><p><font color="#500050">RobertHaas <<a href="mailto:robertmhaas@gmail.com">robertmhaas@gmail.com</a>> writes:<br /> > OnThu, Jan 28, 2010 at 9:01 AM, Pavel Stehule <pavel....</font>Yeah. The real issue here is that in some cases you'dlike to have<br /> non-aggregated parameters to an aggregate, but SQL has no notation<br /> to express that.<br /><br/> I think Pavel's underlying complaint is that if the delimiter<br /> argument isn't constant, then we're exposingan implementation<br /> dependency in terms of just which values get separated by which<br /> delimiters. The mostpractical implementation seems to be that<br /> the first-call delimiter isn't actually used at all, and on<br /> subsequentcalls the delimiter *precedes* the associated value,<br /> which is a bit surprising given the order in which onewrites<br /> them. Not sure if this is worth documenting though. Those two<br /> or three people who actually try itwill figure it out soon enough.<br /><br /> regards, tom lane<br /><p><font color="#500050"><br/>-- <br />Sent via pgsql-hackers mailing list (<a href="mailto:pgsql-hackers@postgresql.org">pgsql-hackers@postgresql.org</a>)<br/>To make changes to your subs...</font></blockquote>
2010/1/28 Tom Lane <tgl@sss.pgh.pa.us>: > Pavel Stehule <pavel.stehule@gmail.com> writes: >> with get_fn_expr_arg_stable() we are able to fix second parameter >> without some performance issues. > > No, that will create its own performance issues --- > get_fn_expr_arg_stable isn't especially cheap. > If there were a really strong reason why we had to do it, > then I'd agree, but frankly the argument for disallowing > a variable delimiter is too thin. it could be called only once. But I agree, so could be better, check it in parser time. ok - I am only one who like original behave - so I am withdrawing. Robert, If you like, please commit Itagaki's patch. + add note about behave when second parameter isn't stable. Regards Pavel Stehule > > regards, tom lane >
On Fri, Jan 29, 2010 at 2:43 AM, Pavel Stehule <pavel.stehule@gmail.com> wrote: > 2010/1/28 Tom Lane <tgl@sss.pgh.pa.us>: >> Pavel Stehule <pavel.stehule@gmail.com> writes: >>> with get_fn_expr_arg_stable() we are able to fix second parameter >>> without some performance issues. >> >> No, that will create its own performance issues --- >> get_fn_expr_arg_stable isn't especially cheap. >> If there were a really strong reason why we had to do it, >> then I'd agree, but frankly the argument for disallowing >> a variable delimiter is too thin. > > it could be called only once. But I agree, so could be better, check > it in parser time. > > ok - I am only one who like original behave - so I am withdrawing. > Robert, If you like, please commit Itagaki's patch. + add note about > behave when second parameter isn't stable. I haven't even looked at this code - I sort of assumed Itagaki was handling this one. But it might be good to make sure that the docs have been read through by a native English speaker prior to commit... ...Robert
On Jan 29, 2010, at 10:43 AM, Robert Haas wrote: > I haven't even looked at this code - I sort of assumed Itagaki was > handling this one. But it might be good to make sure that the docs > have been read through by a native English speaker prior to commit... I did and revised them slightly. There isn't much, just a brief comment in the table of aggregate functions. The documentationfor all the functions on that page could use a little love, frankly. Best, David
On Fri, Jan 29, 2010 at 1:45 PM, David E. Wheeler <david@kineticode.com> wrote: > On Jan 29, 2010, at 10:43 AM, Robert Haas wrote: > >> I haven't even looked at this code - I sort of assumed Itagaki was >> handling this one. But it might be good to make sure that the docs >> have been read through by a native English speaker prior to commit... > > I did and revised them slightly. There isn't much, just a brief comment in the table of aggregate functions. The documentationfor all the functions on that page could use a little love, frankly. Want to take a short at it? ...Robert
On Jan 29, 2010, at 10:46 AM, Robert Haas wrote: >> I did and revised them slightly. There isn't much, just a brief comment in the table of aggregate functions. The documentationfor all the functions on that page could use a little love, frankly. > > Want to take a short at it? ENOTUITS! /me is already sorely over-committed… David
Robert Haas <robertmhaas@gmail.com> wrote: > > ok - I am only one who like original behave - so I ?am withdrawing. > > Robert, If you like, please commit Itagaki's patch. + add note about > > behave when second parameter isn't stable. > > I haven't even looked at this code - I sort of assumed Itagaki was > handling this one. But it might be good to make sure that the docs > have been read through by a native English speaker prior to commit... Applied with some editorialization. Please check the docs are good enough. I removed a test pattern for variable delimiters from regression test because it might be an undefined behavior. We might change the behavior in the future, so we guarantee only constant delimiters. The transition functions are renamed to "string_agg_transfn" and "string_agg_delim_transfn". We cannot use "string_agg_transfn" for both names because we cast the function name as regproc in bootstrap; it complains about duplicated function names. Regards, --- Takahiro Itagaki NTT Open Source Software Center
2010/2/1 Takahiro Itagaki <itagaki.takahiro@oss.ntt.co.jp>: > > Robert Haas <robertmhaas@gmail.com> wrote: > >> > ok - I am only one who like original behave - so I ?am withdrawing. >> > Robert, If you like, please commit Itagaki's patch. + add note about >> > behave when second parameter isn't stable. >> >> I haven't even looked at this code - I sort of assumed Itagaki was >> handling this one. But it might be good to make sure that the docs >> have been read through by a native English speaker prior to commit... > > Applied with some editorialization. Please check the docs are good enough. > > I removed a test pattern for variable delimiters from regression test > because it might be an undefined behavior. We might change the behavior > in the future, so we guarantee only constant delimiters. > > The transition functions are renamed to "string_agg_transfn" and > "string_agg_delim_transfn". We cannot use "string_agg_transfn" for > both names because we cast the function name as regproc in bootstrap; > it complains about duplicated function names. thank you very much Pavel Stehule > > Regards, > --- > Takahiro Itagaki > NTT Open Source Software Center > > >
I noticed that the regression test results don't include the following case:<br /><br />select string_agg(a) from (values(null),(null))g(a);<br /><br /> There is one similar where a delimiter is provided.<br /><br /> Which leads me toask for clarification, would it be safe to assume that string_agg can never itself return null?<br /><br /> Thom<br /><br/>
2010/2/1 Thom Brown <thombrown@gmail.com>: > I noticed that the regression test results don't include the following case: > > select string_agg(a) from (values(null),(null)) g(a); > > There is one similar where a delimiter is provided. > > Which leads me to ask for clarification, would it be safe to assume that > string_agg can never itself return null? if all values are null, then result is null. Pavel > > Thom > >
On 1 February 2010 13:40, Pavel Stehule <pavel.stehule@gmail.com> wrote:
Ah, I was looking at the expected results, and couldn't see a NULL outcome, but then these aren't indicated in such results anyway.
Thom
2010/2/1 Thom Brown <thombrown@gmail.com>:if all values are null, then result is null.> I noticed that the regression test results don't include the following case:
>
> select string_agg(a) from (values(null),(null)) g(a);
>
> There is one similar where a delimiter is provided.
>
> Which leads me to ask for clarification, would it be safe to assume that
> string_agg can never itself return null?
Pavel
Ah, I was looking at the expected results, and couldn't see a NULL outcome, but then these aren't indicated in such results anyway.
Thom
On Sun, Jan 31, 2010 at 10:29 PM, Takahiro Itagaki <itagaki.takahiro@oss.ntt.co.jp> wrote: > Applied with some editorialization. Please check the docs are good enough. Looks pretty good. I have committed a couple very minor adjustments. ...Robert