Greetings,
The array_accum example aggregate in the user documentation works
reasonably on small data sets but doesn't work too hot on large ones.
http://www.postgresql.org/docs/8.1/static/xaggr.html
Normally I wouldn't care particularly much but it turns out that PL/R
uses arrays for quite a bit (eg: histograms and other statistics
functions). I've also heard other complaints about the performance of
arrays, though I'm not sure if those were due to array_accum or
something else.
Long story short, I set out to build a faster array_accum. Much to my
suprise and delight, we already *had* one. accumArrayResult() and
makeArrayResult()/construct_md_array() appear to do a fantastic job.
I've created a couple of 'glue' functions to expose these functions so
they can be used in an aggregate. I'm sure they could be improved
upon and possibly made even smaller than they already are (90 lines
total for both) but I'd like to throw out the idea of including them
in core. The aggregate created with them could also be considered for
inclusion though I'm less concerned with that. I don't expect general
PostgreSQL users would have trouble creating the aggregate- I don't
know that the average user would be able or willing to write the C
functions.
For comparison, the new functions run with:
time psql -c "select aaccum(generate_series) from generate_series(1,1000000);" > /dev/null
4.24s real 0.34s user 0.06s system
Compared to:
time psql -c "select array_accum(generate_series) from generate_series(1,1000000);" > /dev/null
...
Well, it's still running and it's been over an hour.
The main differences, as I see it, are: accumArrayResult() works in
chunks of 64 elements, and uses repalloc(). array_accum uses
array_set() which works on individual elements and uses
palloc()/memcpy(). I appriciate that this is done because for most
cases of array_set() it's not acceptable to modify the input and am
not suggesting that be changed. An alternative might be to modify
array_set() to check if it is in an aggregate and change its behavior
but adding the seperate functions seemed cleaner and much less
intrusive to me.
Please find the functions attached.
Thanks,
Stephen