Thread: an aggregate array function
<div class="Section1"><p class="MsoNormal"><font face="Arial" size="2"><span style="font-size:10.0pt; font-family:Arial">Dear hackers,</span></font><p class="MsoNormal"><font face="Arial" size="2"><span style="font-size:10.0pt; font-family:Arial"> </span></font><p class="MsoNormal"><font face="Arial" size="2"><span style="font-size:10.0pt; font-family:Arial">Do you think there would be any use for an aggregate which returns an array of the aggregated (usuallysimple) type?<span style="mso-spacerun:yes"> </span>Has this already been done by anyone?<span style="mso-spacerun:yes"> </span>I looked at the source and noticed that for each inserted item, the array utility functionsperform a deep copy of the entire array (plus a reallocation). <span style="mso-spacerun:yes"> </span>Normally,this is no big deal, but if executed in the query stage, it could be kind of slow.<spanstyle="mso-spacerun:yes"> </span>I also noticed that null values inside is an item on the <span class="SpellE">todo</span>list.<span style="mso-spacerun:yes"> </span>Is anybody currently working on this?</span></font><pclass="MsoNormal"><font face="Arial" size="2"><span style="font-size:10.0pt; font-family:Arial"> </span></font><p class="MsoNormal"><font face="Arial" size="2"><span style="font-size:10.0pt; font-family:Arial">Merlin</span></font></div>
Merlin Moncure wrote: > Dear hackers, > > Do you think there would be any use for an aggregate which returns an > array of the aggregated (usually simple) type? Has this already been > done by anyone? I looked at the source and noticed that for each > inserted item, the array utility functions perform a deep copy of the > entire array (plus a reallocation). Normally, this is no big deal, but > if executed in the query stage, it could be kind of slow. What exactly have you looked at? In current cvs there is array_append and array_cat. There *was* array_accum, but that was yanked due to an objection that it was redundant with the other two. There is a contrib (intagg) that avoids the deep copy by passing pointers as integers, but I don't think that would be accepted as a builtin solution. I've considered maybe using a hash table to register valid pointers, but have not thought too hard about it yet. In any case it won't happen for 7.4. BTW, I tried array_accum() (which is not really much different than array_append()) with groups of about 10,000 elements and feeding the array into a plr final function for a mean calculation. It was for sure slow compared to a native AVG() aggregate, but it wasn't that bad either. I don't remember the specifics, but it would be easy enough to try it out for yourself. > I also noticed that null values inside is an item on the todo list. Is anybody > currently working on this? > No one is currently working on it that I'm aware of, but I was considering working on it for 7.5 Joe
Joe Conway wrote: > > Do you think there would be any use for an aggregate which returns an > > array of the aggregated (usually simple) type? >What exactly have you looked at? In current cvs there is array_append >and array_cat. There *was* array_accum, but that was yanked due to an >objection that it was redundant with the other two. Actually, I was looking at array_set, which has a provision to grow a 1d array. I was looking at the 7.3.2 sources, so it's pretty clear I would have to look at something newer. I'll wait for 7.4. >BTW, I tried array_accum() (which is not really much different than >array_append()) with groups of about 10,000 elements and feeding the >array into a plr final function for a mean calculation. It was for sure >slow compared to a native AVG() aggregate, but it wasn't that bad >either. I don't remember the specifics, but it would be easy enough to >try it out for yourself. Well, if dynamic growth is expected, there are some easy optimizations that could reduce the time spent reallocating the array. If there was a detectable difference vs. the avg() function, which performs zero reallocations, its probably worthwhile. Based on what I saw with the older source, I assumed there was little or no dynamic growth in normal use. > No one is currently working on it that I'm aware of, but I was > considering working on it for 7.5 By all means! What do you think about the other question about an 'array creating aggregate', is that a useful contribution? Joe
Merlin Moncure wrote: > What do you think about the other question about an > 'array creating aggregate', is that a useful contribution? > Hmm, either I'm not understanding you, or you're not understanding me ;-) First, see contrib/intagg. Second, the following works in 7.4devel: -- create test data for polymorphic aggregates create table t(f1 int, f2 float, f3 float); insert into t values(1,11.1,21.1); insert into t values(1,11.2,21.2); insert into t values(1,11.3,21.3); insert into t values(2,12.1,22.1); insert into t values(2,12.2,22.2); insert into t values(2,12.3,22.3); insert into t values(3,13.1,23.1); insert into t values(3,13.2,23.2); CREATE AGGREGATE myagg1 ( BASETYPE = float8, SFUNC = array_append, STYPE = float8[], INITCOND = '{}' ); CREATE AGGREGATE myagg2 ( BASETYPE = float8[], SFUNC = array_cat, STYPE = float8[], INITCOND = '{}' ); regression=# select f1, myagg1(f2) from t group by f1; f1 | myagg1 ----+------------------ 3 | {13.1,13.2} 2 | {12.1,12.2,12.3} 1 | {11.1,11.2,11.3} (3 rows) regression=# select f1, myagg2(array[f2,f3]) from t group by f1; f1 | myagg2 ----+--------------------------------------- 3 | {{13.1,23.1},{13.2,23.2}} 2 | {{12.1,22.1},{12.2,22.2},{12.3,22.3}} 1| {{11.1,21.1},{11.2,21.2},{11.3,21.3}} (3 rows) Joe
Merlin Moncure wrote: > Dear hackers, > > > > Do you think there would be any use for an aggregate which returns an > array of the aggregated (usually simple) type? Has this already been > done by anyone? I looked at the source and noticed that for each > inserted item, the array utility functions perform a deep copy of the > entire array (plus a reallocation). Normally, this is no big deal, > but if executed in the query stage, it could be kind of slow. I also > noticed that null values inside is an item on the todo list. Is > anybody currently working on this? > > > > Merlin > I got another question to this: why does PostgreSQL support Arrays in the first place? For my taste, its a very messy concept... Regards, Dani
It's in the SQL99 standard. There's nothing forcing you to use them - I am a (possibly) old-fashioned data architect, so I never use them ;-) SQL99 actually allows you to use more or less arbitrary composite types as columns (although Pg currently doesn't) - many would argue that this violates first normal form. OTOH there's probably a good case to make that all this is necessary to provide good support to ObjectRelational mappings, and other OO stuff. andrew Dani Oderbolz wrote: > I got another question to this: > why does PostgreSQL support Arrays in the first place? > For my taste, its a very messy concept... > > Regards, > Dani > >
Andrew wrote: > It's in the SQL99 standard. There's nothing forcing you to use them - I > am a (possibly) old-fashioned data architect, so I never use them ;-) > SQL99 actually allows you to use more or less arbitrary composite types > as columns (although Pg currently doesn't) - many would argue that this > violates first normal form. <snip> I would (strongly) disagree with your last statement: I believe the opposite to be true. I do agree that they are a violation of the first normal form when used a mechanism for general data storage; however the intent here is for arrays to help get around sql's difficulty (largely due to the lack of recursive queries in postgres, but also in a more general way) in dealing with post-query related data. Arrays (as a result of a query) help to enhance relational use of the database by indirectly allowing a more natural method of storage by giving you more power to query the data. The main problem is sql's general treatment of results sets as two dimensional tables when in fact they are just possible branch points in a 'n' dimensional space (specially expressed by arrays in limited, but useful circumstances). In other words, arrays for input = bad, arrays for output = not so bad. When recursive queries become available, I'll probably use them instead(I've never had the luxury), but in the mean time... p.s. Joe, you were right I did misunderstand both you and postgres's capabilities at the present time. The functionality in your example was exactly what I was looking for. I still hold to my point that if the array is performing deep copies upon growth, there can be vast speed improvements in cases (such as during the array aggregation) when aggressive growth can be predicted in advance. The worst case of 'reallocate after each aggregation' can be particularly bad. In any case, I'll shut up now :) Regards, Merlin
well, (smile) I didn't say *I* saw violation of FNF as an objection. I think my statement is true - many would see it as a violation of FNF. Many others like you might argue differently. I first got into this business nearly 20 years ago when I came to realise the severe limitations of the relational algebra. Nothing much has changed about that - it can still be very cumbersome, to say the least. So I'm not dogmatic about obeying some design law that some theorist has laid down. It's more like the English grammar rule about split infinitives - I understand why some people disagree with the rule (or deny it exists), but I still (almost) never use split infinitives myself - I just don't find the need. *shrug* Never mind - I'll go back to wrestling with Ant now. cheers andrew Merlin Moncure wrote: >Andrew wrote: > > >>It's in the SQL99 standard. There's nothing forcing you to use them - >> >> >I > > >>am a (possibly) old-fashioned data architect, so I never use them ;-) >> >> > > > >>SQL99 actually allows you to use more or less arbitrary composite >> >> >types > > >>as columns (although Pg currently doesn't) - many would argue that >> >> >this > > >>violates first normal form. >> >> ><snip> > >I would (strongly) disagree with your last statement: I believe the >opposite to be true. I do agree that they are a violation of the first >normal form when used a mechanism for general data storage; however the >intent here is for arrays to help get around sql's difficulty (largely >due to the lack of recursive queries in postgres, but also in a more >general way) in dealing with post-query related data. > >Arrays (as a result of a query) help to enhance relational use of the >database by indirectly allowing a more natural method of storage by >giving you more power to query the data. The main problem is sql's >general treatment of results sets as two dimensional tables when in fact >they are just possible branch points in a 'n' dimensional space >(specially expressed by arrays in limited, but useful circumstances). > >In other words, arrays for input = bad, arrays for output = not so bad. >When recursive queries become available, I'll probably use them >instead(I've never had the luxury), but in the mean time... > >p.s. >Joe, you were right I did misunderstand both you and postgres's >capabilities at the present time. The functionality in your example was >exactly what I was looking for. I still hold to my point that if the >array is performing deep copies upon growth, there can be vast speed >improvements in cases (such as during the array aggregation) when >aggressive growth can be predicted in advance. The worst case of >'reallocate after each aggregation' can be particularly bad. In any >case, I'll shut up now :) > >Regards, >Merlin > > >