Thread: an aggregate array function

an aggregate array function

From
"Merlin Moncure"
Date:
<div class="Section1"><p class="MsoNormal"><font face="Arial" size="2"><span style="font-size:10.0pt;
font-family:Arial">Dear hackers,</span></font><p class="MsoNormal"><font face="Arial" size="2"><span
style="font-size:10.0pt;
font-family:Arial"> </span></font><p class="MsoNormal"><font face="Arial" size="2"><span style="font-size:10.0pt;
font-family:Arial">Do you think there would be any use for an aggregate which returns an array of the aggregated
(usuallysimple) type?<span style="mso-spacerun:yes">  </span>Has this already been done by anyone?<span
style="mso-spacerun:yes"> </span>I looked at the source and noticed that for each inserted item, the array utility
functionsperform a deep copy of the entire array (plus a reallocation). <span
style="mso-spacerun:yes"> </span>Normally,this is no big deal, but if executed in the query stage, it could be kind of
slow.<spanstyle="mso-spacerun:yes">  </span>I also noticed that null values inside is an item on the <span
class="SpellE">todo</span>list.<span style="mso-spacerun:yes">  </span>Is anybody currently working on
this?</span></font><pclass="MsoNormal"><font face="Arial" size="2"><span style="font-size:10.0pt; 
font-family:Arial"> </span></font><p class="MsoNormal"><font face="Arial" size="2"><span style="font-size:10.0pt;
font-family:Arial">Merlin</span></font></div>

Re: an aggregate array function

From
Joe Conway
Date:
Merlin Moncure wrote:
> Dear hackers,
>  
> Do you think there would be any use for an aggregate which returns an
> array of the aggregated (usually simple) type?  Has this already been
> done by anyone?  I looked at the source and noticed that for each
> inserted item, the array utility functions perform a deep copy of the
> entire array (plus a reallocation).  Normally, this is no big deal, but
> if executed in the query stage, it could be kind of slow.

What exactly have you looked at? In current cvs there is array_append 
and array_cat. There *was* array_accum, but that was yanked due to an 
objection that it was redundant with the other two.

There is a contrib (intagg) that avoids the deep copy by passing 
pointers as integers, but I don't think that would be accepted as a 
builtin solution. I've considered maybe using a hash table to register 
valid pointers, but have not thought too hard about it yet. In any case 
it won't happen for 7.4.

BTW, I tried array_accum() (which is not really much different than 
array_append()) with groups of about 10,000 elements and feeding the 
array into a plr final function for a mean calculation. It was for sure 
slow compared to a native AVG() aggregate, but it wasn't that bad 
either. I don't remember the specifics, but it would be easy enough to 
try it out for yourself.

> I also noticed that null values inside is an item on the todo list.  Is anybody
> currently working on this?
>  

No one is currently working on it that I'm aware of, but I was 
considering working on it for 7.5

Joe




Re: an aggregate array function

From
"Merlin Moncure"
Date:
Joe Conway wrote:
> > Do you think there would be any use for an aggregate which returns
an
> > array of the aggregated (usually simple) type?

>What exactly have you looked at? In current cvs there is array_append
>and array_cat. There *was* array_accum, but that was yanked due to an
>objection that it was redundant with the other two.

Actually, I was looking at array_set, which has a provision to grow a 1d
array.  I was looking at the 7.3.2 sources, so it's pretty clear I would
have to look at something newer.  I'll wait for 7.4.

>BTW, I tried array_accum() (which is not really much different than
>array_append()) with groups of about 10,000 elements and feeding the
>array into a plr final function for a mean calculation. It was for sure

>slow compared to a native AVG() aggregate, but it wasn't that bad
>either. I don't remember the specifics, but it would be easy enough to
>try it out for yourself.

Well, if dynamic growth is expected, there are some easy optimizations
that could reduce the time spent reallocating the array.  If there was a
detectable difference vs. the avg() function, which performs zero
reallocations, its probably worthwhile.

Based on what I saw with the older source, I assumed there was little or
no dynamic growth in normal use.

> No one is currently working on it that I'm aware of, but I was
> considering working on it for 7.5

By all means!  What do you think about the other question about an
'array creating aggregate', is that a useful contribution?

Joe




Re: an aggregate array function

From
Joe Conway
Date:
Merlin Moncure wrote:
> What do you think about the other question about an
> 'array creating aggregate', is that a useful contribution?
> 

Hmm, either I'm not understanding you, or you're not understanding me ;-)

First, see contrib/intagg.
Second, the following works in 7.4devel:

-- create test data for polymorphic aggregates
create table t(f1 int, f2 float, f3 float);
insert into t values(1,11.1,21.1);
insert into t values(1,11.2,21.2);
insert into t values(1,11.3,21.3);
insert into t values(2,12.1,22.1);
insert into t values(2,12.2,22.2);
insert into t values(2,12.3,22.3);
insert into t values(3,13.1,23.1);
insert into t values(3,13.2,23.2);

CREATE AGGREGATE myagg1
(  BASETYPE = float8,  SFUNC = array_append,  STYPE = float8[],  INITCOND = '{}'
);

CREATE AGGREGATE myagg2
(  BASETYPE = float8[],  SFUNC = array_cat,  STYPE = float8[],  INITCOND = '{}'
);

regression=# select f1, myagg1(f2) from t group by f1; f1 |      myagg1
----+------------------  3 | {13.1,13.2}  2 | {12.1,12.2,12.3}  1 | {11.1,11.2,11.3}
(3 rows)

regression=# select f1, myagg2(array[f2,f3]) from t group by f1; f1 |                myagg2
----+---------------------------------------  3 | {{13.1,23.1},{13.2,23.2}}  2 | {{12.1,22.1},{12.2,22.2},{12.3,22.3}}
1| {{11.1,21.1},{11.2,21.2},{11.3,21.3}}
 
(3 rows)

Joe



Re: an aggregate array function

From
Dani Oderbolz
Date:
Merlin Moncure wrote:

> Dear hackers,
>
>  
>
> Do you think there would be any use for an aggregate which returns an 
> array of the aggregated (usually simple) type?  Has this already been 
> done by anyone?  I looked at the source and noticed that for each 
> inserted item, the array utility functions perform a deep copy of the 
> entire array (plus a reallocation).  Normally, this is no big deal, 
> but if executed in the query stage, it could be kind of slow.  I also 
> noticed that null values inside is an item on the todo list.  Is 
> anybody currently working on this?
>
>  
>
> Merlin
>
I got another question to this:
why does PostgreSQL support Arrays in the first place?
For my taste, its a very messy concept...

Regards,
Dani



Re: an aggregate array function

From
Andrew Dunstan
Date:
It's in the SQL99 standard. There's nothing forcing you to use them - I 
am a (possibly) old-fashioned data architect, so I never use them ;-)

SQL99 actually allows you to use more or less arbitrary composite types 
as columns (although Pg currently doesn't) - many would argue that this 
violates first normal form. OTOH there's probably a good case to make 
that all this is necessary to provide good support to ObjectRelational 
mappings, and other OO stuff.

andrew

Dani Oderbolz wrote:

> I got another question to this:
> why does PostgreSQL support Arrays in the first place?
> For my taste, its a very messy concept...
>
> Regards,
> Dani
>
>




Re: an aggregate array function

From
"Merlin Moncure"
Date:
Andrew wrote:
> It's in the SQL99 standard. There's nothing forcing you to use them -
I
> am a (possibly) old-fashioned data architect, so I never use them ;-)

> SQL99 actually allows you to use more or less arbitrary composite
types
> as columns (although Pg currently doesn't) - many would argue that
this
> violates first normal form.
<snip>

I would (strongly) disagree with your last statement: I believe the
opposite to be true. I do agree that they are a violation of the first
normal form when used a mechanism for general data storage;  however the
intent here is for arrays to help get around sql's difficulty (largely
due to the lack of recursive queries in postgres, but also in a more
general way) in dealing with post-query related data.

Arrays (as a result of a query) help to enhance relational use of the
database by indirectly allowing a more natural method of storage by
giving you more power to query the data.  The main problem is sql's
general treatment of results sets as two dimensional tables when in fact
they are just possible branch points in a 'n' dimensional space
(specially expressed by arrays in limited, but useful circumstances).

In other words, arrays for input = bad, arrays for output = not so bad.
When recursive queries become available, I'll probably use them
instead(I've never had the luxury), but in the mean time...

p.s.
Joe, you were right I did misunderstand both you and postgres's
capabilities at the present time.  The functionality in your example was
exactly what I was looking for.  I still hold to my point that if the
array is performing deep copies upon growth, there can be vast speed
improvements in cases (such as during the array aggregation) when
aggressive growth can be predicted in advance.  The worst case of
'reallocate after each aggregation' can be particularly bad.  In any
case, I'll shut up now :)

Regards,
Merlin





Re: an aggregate array function

From
Andrew Dunstan
Date:
well, (smile) I didn't say *I* saw violation of FNF as an objection. I 
think my statement is true - many would see it as a violation of FNF. 
Many others like you might argue differently.

I first got into this business nearly 20 years ago when I came to 
realise the severe limitations of the relational algebra. Nothing much 
has changed about that - it can still be very cumbersome, to say the 
least. So I'm not dogmatic about obeying some design law that some 
theorist has laid down. It's more like the English grammar rule about 
split infinitives - I understand why some people disagree with the rule 
(or deny it exists), but I still (almost) never use split infinitives 
myself - I just don't find the need.

*shrug* Never mind - I'll go back to wrestling with Ant now.

cheers

andrew


Merlin Moncure wrote:

>Andrew wrote:
>  
>
>>It's in the SQL99 standard. There's nothing forcing you to use them -
>>    
>>
>I 
>  
>
>>am a (possibly) old-fashioned data architect, so I never use them ;-)
>>    
>>
>
>  
>
>>SQL99 actually allows you to use more or less arbitrary composite
>>    
>>
>types 
>  
>
>>as columns (although Pg currently doesn't) - many would argue that
>>    
>>
>this 
>  
>
>>violates first normal form. 
>>    
>>
><snip>
>
>I would (strongly) disagree with your last statement: I believe the
>opposite to be true. I do agree that they are a violation of the first
>normal form when used a mechanism for general data storage;  however the
>intent here is for arrays to help get around sql's difficulty (largely
>due to the lack of recursive queries in postgres, but also in a more
>general way) in dealing with post-query related data.  
>
>Arrays (as a result of a query) help to enhance relational use of the
>database by indirectly allowing a more natural method of storage by
>giving you more power to query the data.  The main problem is sql's
>general treatment of results sets as two dimensional tables when in fact
>they are just possible branch points in a 'n' dimensional space
>(specially expressed by arrays in limited, but useful circumstances).
>
>In other words, arrays for input = bad, arrays for output = not so bad.
>When recursive queries become available, I'll probably use them
>instead(I've never had the luxury), but in the mean time...
>
>p.s.  
>Joe, you were right I did misunderstand both you and postgres's
>capabilities at the present time.  The functionality in your example was
>exactly what I was looking for.  I still hold to my point that if the
>array is performing deep copies upon growth, there can be vast speed
>improvements in cases (such as during the array aggregation) when
>aggressive growth can be predicted in advance.  The worst case of
>'reallocate after each aggregation' can be particularly bad.  In any
>case, I'll shut up now :)
>
>Regards,
>Merlin
>
>  
>