array_accum aggregate - Mailing list pgsql-hackers

From Stephen Frost
Subject array_accum aggregate
Date
Msg-id 20061006203453.GJ24675@kenobi.snowman.net
Whole thread Raw
Responses Re: array_accum aggregate  (Tom Lane <tgl@sss.pgh.pa.us>)
Re: array_accum aggregate  (Stephen Frost <sfrost@snowman.net>)
List pgsql-hackers
Greetings,

  The array_accum example aggregate in the user documentation works
  reasonably on small data sets but doesn't work too hot on large ones.
  http://www.postgresql.org/docs/8.1/static/xaggr.html

  Normally I wouldn't care particularly much but it turns out that PL/R
  uses arrays for quite a bit (eg: histograms and other statistics
  functions).  I've also heard other complaints about the performance of
  arrays, though I'm not sure if those were due to array_accum or
  something else.

  Long story short, I set out to build a faster array_accum.  Much to my
  suprise and delight, we already *had* one.  accumArrayResult() and
  makeArrayResult()/construct_md_array() appear to do a fantastic job.
  I've created a couple of 'glue' functions to expose these functions so
  they can be used in an aggregate.  I'm sure they could be improved
  upon and possibly made even smaller than they already are (90 lines
  total for both) but I'd like to throw out the idea of including them
  in core.  The aggregate created with them could also be considered for
  inclusion though I'm less concerned with that.  I don't expect general
  PostgreSQL users would have trouble creating the aggregate- I don't
  know that the average user would be able or willing to write the C
  functions.

  For comparison, the new functions run with:
  time psql -c "select aaccum(generate_series) from generate_series(1,1000000);" > /dev/null
      4.24s real     0.34s user     0.06s system

  Compared to:
  time psql -c "select array_accum(generate_series) from generate_series(1,1000000);" > /dev/null
  ...

  Well, it's still running and it's been over an hour.

  The main differences, as I see it, are: accumArrayResult() works in
  chunks of 64 elements, and uses repalloc().  array_accum uses
  array_set() which works on individual elements and uses
  palloc()/memcpy().  I appriciate that this is done because for most
  cases of array_set() it's not acceptable to modify the input and am
  not suggesting that be changed.  An alternative might be to modify
  array_set() to check if it is in an aggregate and change its behavior
  but adding the seperate functions seemed cleaner and much less
  intrusive to me.

  Please find the functions attached.

          Thanks,

            Stephen

Attachment

pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: Should there be a charcat?
Next
From: Peter Eisentraut
Date:
Subject: Re: 8.2 translation status?