Thread: nodeAgg perf tweak

nodeAgg perf tweak

From

Neil Conway

Date:

01 December 2004, 02:48:57

I noticed that we have a bottleneck in aggregate performance in
advance_transition_function(): according to callgrind, doing datumCopy()
and pfree() for every tuple produced by the transition function is
pretty expensive. Some queries bare this out:

dvl=# SELECT W.element_id, count(W.element_ID) FROM watch_list_element W
GROUP by W.element_id ORDER by count(W.element_ID) LIMIT 5;
 element_id | count
------------+-------
      65525 |     1
     163816 |     1
      16341 |     1
     131023 |     1
      65469 |     1
(5 rows)

Time: 176.723 ms

dvl=# select count(*) from watch_list_element;
 count
--------
 138044
(1 row)

Time: 94.246 ms

I've attached a quick and dirty hack that avoids the need to palloc()
and pfree() for every tuple produced by the aggregate's transition
function. This results in:

dvl=# SELECT W.element_id, count(W.element_ID) FROM watch_list_element W
GROUP by W.element_id ORDER by count(W.element_ID) LIMIT 5;
 element_id | count
------------+-------
      65525 |     1
     163816 |     1
      16341 |     1
     131023 |     1
      65469 |     1
(5 rows)

Time: 154.378 ms

dvl=# select count(*) from watch_list_element;
 count
--------
 138044
(1 row)

Time: 73.975 ms

I can reproduce this performance difference consistently. I thought this
might have been attributable to memory checking overhead because
assertions were enabled, but that doesn't appear to be the case (the
above results are without --enable-cassert).

The attached patch invokes the transition function in the current memory
context. I don't think that's right: a memory leak in an aggregate's
transition function would be problematic when we're invoked from a
per-query memory context, as is the case with advance_aggregates().
Perhaps we need an additional short-lived memory context in
AggStatePerAggData: we could invoke the transition function in that
context, and reset it once per, say, 1000 tuples.

Alternatively we could just mandate that aggregate transition function's
not leak memory and then invoke the transition function in, say, the
aggregate's memory context, but that seems a little fragile.

Comments?

-Neil

Attachment

node_agg_perf-1.patch

Re: nodeAgg perf tweak

From

Tom Lane

Date:

01 December 2004, 04:15:43

Neil Conway <neilc@samurai.com> writes:
> I've attached a quick and dirty hack that avoids the need to palloc()
> and pfree() for every tuple produced by the aggregate's transition
> function.

And how badly does it leak memory?  I do not believe this patch is
tenable.

Something that occurred to me the other morning in the shower is that we
could trivially inline MemoryContextSwitchTo() when using gcc, much as
you did for list_length().  I haven't gotten around to doing it but it
seems like a free percent-or-two improvement.
        regards, tom lane

Re: nodeAgg perf tweak

From

Neil Conway

Date:

01 December 2004, 04:31:13

On Tue, 2004-11-30 at 23:15 -0500, Tom Lane wrote:
> And how badly does it leak memory?  I do not believe this patch is
> tenable.

Did you read the rest of my mail?

> Something that occurred to me the other morning in the shower is that we
> could trivially inline MemoryContextSwitchTo() when using gcc, much as
> you did for list_length().  I haven't gotten around to doing it but it
> seems like a free percent-or-two improvement.

Yeah, it actually occurred to me as well this would be worth doing. It's
not relevant to this particular example though.

-Neil

Re: nodeAgg perf tweak

From

Simon Riggs

Date:

01 December 2004, 08:28:49

On Wed, 2004-12-01 at 02:48, Neil Conway wrote:
> I noticed that we have a bottleneck in aggregate performance in
> advance_transition_function(): according to callgrind, doing datumCopy()
> and pfree() for every tuple produced by the transition function is
> pretty expensive. Some queries bare this out:
> 
...

> I can reproduce this performance difference consistently. I thought this
> might have been attributable to memory checking overhead because
> assertions were enabled, but that doesn't appear to be the case (the
> above results are without --enable-cassert).

That looks like a useful performance gain, and count(*) is a weak point
on the performance roadmap. Thanks.

> The attached patch invokes the transition function in the current memory
> context. I don't think that's right: a memory leak in an aggregate's
> transition function would be problematic when we're invoked from a
> per-query memory context, as is the case with advance_aggregates().
> Perhaps we need an additional short-lived memory context in
> AggStatePerAggData: we could invoke the transition function in that
> context, and reset it once per, say, 1000 tuples.

I'd be a little twitchy about new memory contexts at this stage of beta,
but if the code is fairly well isolated that could be good.

> Alternatively we could just mandate that aggregate transition function's
> not leak memory and then invoke the transition function in, say, the
> aggregate's memory context, but that seems a little fragile.

Would it possible to differentiate between well-known builtin functions
and added transition functions? I realise nothing is that simple, but it
seems we could trust some functions more than others. The trust level
could be determined at planning time. [DB2 uses FENCED or UNFENCED
declarations for this purpose, though the exact meaning is different to
your proposal - I'm not suggesting adding external syntax though]

A performance gain on the built-ins alone would satisfy 80% of users.

-- 
Best Regards, Simon Riggs

Re: nodeAgg perf tweak

From

Neil Conway

Date:

01 December 2004, 08:37:43

On Wed, 2004-12-01 at 08:25 +0000, Simon Riggs wrote:
> I'd be a little twitchy about new memory contexts at this stage of beta,
> but if the code is fairly well isolated that could be good.

This would be for 8.1 anyway.

> Would it possible to differentiate between well-known builtin functions
> and added transition functions?

IMHO, this would be ugly and add unnecessary complexity. I'd like to
find a solution that actually fixes the problem, rather than just
working around it in the common case.

-Neil

Re: nodeAgg perf tweak

From

Simon Riggs

Date:

01 December 2004, 10:07:11

On Wed, 2004-12-01 at 08:37, Neil Conway wrote:
> On Wed, 2004-12-01 at 08:25 +0000, Simon Riggs wrote:
> > I'd be a little twitchy about new memory contexts at this stage of beta,
> > but if the code is fairly well isolated that could be good.
> 
> This would be for 8.1 anyway.
> 
> > Would it possible to differentiate between well-known builtin functions
> > and added transition functions?
> 
> IMHO, this would be ugly and add unnecessary complexity. I'd like to
> find a solution that actually fixes the problem, rather than just
> working around it in the common case.

It would be my suggestion to implement the optimisation for the common
case *now*, then fix the general case later.

Please shave 20% off everybody's aggregation queries, ugly or not.
You'll see a lot of happy people.

-- 
Best Regards, Simon Riggs

Please release (was Re: nodeAgg perf tweak)

From

Alvaro Herrera

Date:

01 December 2004, 14:08:19

On Wed, Dec 01, 2004 at 10:03:40AM +0000, Simon Riggs wrote:

> Please shave 20% off everybody's aggregation queries, ugly or not.
> You'll see a lot of happy people.

When would the experimentation end, and 8.0 be finally released?
There's real development waiting only for release to happen ...

I have submitted three patches already that are pending for 8.1, and the
code keeps drifting.  There has to be a release soon.  We can't keep in
beta forever.

Also, I think we should consider using a time-based release process
instead of the unpredictable mess that it is used now.  If hard
development was done in branches and only merged when stable, we could
do this.  (This is also something that a better SCM could help us do,
though GCC is living proof that it can be done with CVS too.)

-- 
Alvaro Herrera (<alvherre[@]dcc.uchile.cl>)
"God is real, unless declared as int"

Re: Please release (was Re: nodeAgg perf tweak)

From

Neil Conway

Date:

02 December 2004, 01:05:58

On Wed, 2004-12-01 at 11:08 -0300, Alvaro Herrera wrote:
> When would the experimentation end, and 8.0 be finally released?

As I said, this is work for 8.1; I don't think it ought to affect the
release schedule of 8.0 (perhaps in some marginal way because Tom might
spend some time discussing issues with me, but I'll leave it up to Tom
to decide how best to spend his time).

> There's real development waiting only for release to happen ...

There is real development being done as we speak, that will be landed to
8.1 when it branches :)

I think you've raised two distinct issues:

(1) Release scheduling: time-based releases vs. feature-based releases,
how long the beta period is, and so on.

(2) Development methodology: how do we make it easy for people to engage
in new development while the mainline is being stabilized for a release?

> If hard development was done in branches and only merged when stable,
> we could do this.

I think this might be worth considering as an improvement to #2. What
I've been experimenting with locally is doing development in a private
VC system, and merging in mainline CVS changes every week or so. That
allows me to keep long-running projects in their own private branchs and
to take advantage of the VC system's merge support, without needing to
change the work environment used by other developers.

-Neil

Re: Please release (was Re: nodeAgg perf tweak)

From

"Andrew Dunstan"

Date:

02 December 2004, 02:46:54

Neil Conway said:
> On Wed, 2004-12-01 at 11:08 -0300, Alvaro Herrera wrote:
>
>> There's real development waiting only for release to happen ...
>
> There is real development being done as we speak, that will be landed
> to 8.1 when it branches :)
>

I've raised this before, but would like to suggest again that there might be
virtue in branching earlier in the dev cycle - maybe around the time of the
first beta.

cheers

andrew

Re: nodeAgg perf tweak

From

Tom Lane

Date:

02 December 2004, 03:13:04

Neil Conway <neilc@samurai.com> writes:
> The attached patch invokes the transition function in the current memory
> context. I don't think that's right: a memory leak in an aggregate's
> transition function would be problematic when we're invoked from a
> per-query memory context, as is the case with advance_aggregates().

Agreed.  You can't really assume that the transition function is
leak-free, particularly not when it's returning a pass-by-ref datatype.

> Perhaps we need an additional short-lived memory context in
> AggStatePerAggData: we could invoke the transition function in that
> context, and reset it once per, say, 1000 tuples.

This seems like it might work.  Instead of copying the result into the
aggcontext on every cycle, we could copy it only when we intend to reset
the working context.  However there is still a risk: the function might
return a pointer into the source tuple, not something that's in the
working context at all.  (See text_smaller() as one example.)  This is
problematic since the source tuple will go away on the next cycle.
The existing copying code takes care of this case as well as the one
where the result is in per_tuple_context.

I'm a bit surprised that you measured a significant performance speedup
from removing the copying step.  Awhile ago I experimented with hacking
the count() aggregate function internally to avoid pallocs, and was
disappointed to measure no noticeable speedup at all.  Digging in the
list archives, it looks like I neglected to report that failure, but
I do still have the patch and I attach it here.  This was from late
Dec '03 so it'd be against just-past-7.4 sources.  I'd be interested
to hear how this compares to what you did under your test conditions;
maybe I was using a bogus test case.
        regards, tom lane

*** src/backend/executor/nodeAgg.c.orig    Sat Nov 29 14:51:48 2003
--- src/backend/executor/nodeAgg.c    Sun Dec 21 16:02:25 2003
***************
*** 350,356 ****      */      /* MemSet(&fcinfo, 0, sizeof(fcinfo)); */
!     fcinfo.context = NULL;     fcinfo.resultinfo = NULL;     fcinfo.isnull = false; 
--- 350,356 ----      */      /* MemSet(&fcinfo, 0, sizeof(fcinfo)); */
!     fcinfo.context = (void *) aggstate;     fcinfo.resultinfo = NULL;     fcinfo.isnull = false; 
***************
*** 528,533 ****
--- 528,534 ----         FunctionCallInfoData fcinfo;          MemSet(&fcinfo, 0, sizeof(fcinfo));
+         fcinfo.context = (void *) aggstate;         fcinfo.flinfo = &peraggstate->finalfn;         fcinfo.nargs = 1;
      fcinfo.arg[0] = pergroupstate->transValue;
 
*** /home/postgres/pgsql/src/backend/utils/adt/int8.c.orig    Mon Dec  1 17:29:19 2003
--- /home/postgres/pgsql/src/backend/utils/adt/int8.c    Sun Dec 21 16:03:43 2003
***************
*** 17,22 ****
--- 17,23 ---- #include <math.h>  #include "libpq/pqformat.h"
+ #include "nodes/nodes.h" #include "utils/int8.h"  
***************
*** 565,573 **** Datum int8inc(PG_FUNCTION_ARGS) {
!     int64        arg = PG_GETARG_INT64(0); 
!     PG_RETURN_INT64(arg + 1); }  Datum
--- 566,594 ---- Datum int8inc(PG_FUNCTION_ARGS) {
!     if (fcinfo->context != NULL && IsA(fcinfo->context, AggState))
!     {
!         /*
!          * Special case to avoid palloc overhead for COUNT().  Note: this
!          * assumes int8 is a pass-by-ref type; if we ever support pass-by-val
!          * int8, this should be ifdef'd out when int8 is pass-by-val.
!          *
!          * When called from nodeAgg, we know that the argument is modifiable
!          * local storage, so just update it in-place.
!          */
!         int64       *arg = (int64 *) PG_GETARG_POINTER(0);
! 
!         (*arg)++; 
!         PG_RETURN_POINTER(arg);
!     }
!     else
!     {
!         /* Not called by nodeAgg, so just do it the dumb way */
!         int64        arg = PG_GETARG_INT64(0);
! 
!         PG_RETURN_INT64(arg + 1);
!     } }  Datum

Re: Please release (was Re: nodeAgg perf tweak)

From

Tom Lane

Date:

02 December 2004, 03:14:47

"Andrew Dunstan" <andrew@dunslane.net> writes:
> I've raised this before, but would like to suggest again that there might be
> virtue in branching earlier in the dev cycle - maybe around the time of the
> first beta.

Given the amount of patching that's gone on, branching 8.1 earlier would
have meant a great deal more work for the committers, in the form of
double-patching stuff.  And I don't know about the other committers, but
I've certainly had no spare bandwidth for reviewing or applying 8.1-cycle
patches anyway.  So I think we made the right move to delay the branch.

Core hasn't really talked about the point yet, but I suspect we'll make
the branch after 8.0RC1 appears (which hopefully will be Real Soon).
        regards, tom lane

Re: nodeAgg perf tweak

From

Neil Conway

Date:

02 December 2004, 05:33:51

On Wed, 2004-12-01 at 15:54 -0500, Tom Lane wrote:
> This seems like it might work.  Instead of copying the result into the
> aggcontext on every cycle, we could copy it only when we intend to reset
> the working context.

Right.

> This is
> problematic since the source tuple will go away on the next cycle.
> The existing copying code takes care of this case as well as the one
> where the result is in per_tuple_context.

ISTM it would be reasonable to mandate that aggregate authors return one
of three things from their state transition functions:
 (a) return the previous state value (b) return the "next data item" value (c) return some other value; if by a
pass-by-referencetype, allocated

in CurrentMemoryContext

In the case of (b), we need to copy it in advance_transition_function()
to ensure it survives to the next invocation of the state transition
function, but that should be doable. For both (a) and (c) we can assume
the value has been allocated in the per-1000-rows-transition-function
temporary context, and thus we need only copy it when we reset that
context.

> Digging in the
> list archives, it looks like I neglected to report that failure, but
> I do still have the patch and I attach it here.  This was from late
> Dec '03 so it'd be against just-past-7.4 sources.  I'd be interested
> to hear how this compares to what you did under your test conditions;
> maybe I was using a bogus test case.

I can reproduce the performance improvement with the patch you sent (I
can repost numbers if you like, but your patch and mine resulted in
about the same performance when I quickly tested it). You can find the
test data at:
   http://www.langille.org/watch_list_elements.sql.tgz

Just load the .sql file, run ANALYZE, and try the queries I posted.

-Neil

Re: nodeAgg perf tweak

From

Tom Lane

Date:

02 December 2004, 15:45:48

Neil Conway <neilc@samurai.com> writes:
> ISTM it would be reasonable to mandate that aggregate authors return one
> of three things from their state transition functions:

>   (a) return the previous state value
>   (b) return the "next data item" value
>   (c) return some other value; if by a pass-by-reference type, allocated
> in CurrentMemoryContext

> In the case of (b), we need to copy it in advance_transition_function()
> to ensure it survives to the next invocation of the state transition
> function, but that should be doable. For both (a) and (c) we can assume
> the value has been allocated in the per-1000-rows-transition-function
> temporary context, and thus we need only copy it when we reset that
> context.

Well, (1) I'm uncomfortable with "mandating" that aggregate functions do
anything special, and (2) I think you lose much of the performance
benefit as soon as you have to distinguish cases (b) and (c).  Ideally
you would use MemoryContextContains for this, but that doesn't work
reliably on pointers that point to fields of a tuple.

I like the approach shown in my prototype patch better, because it
doesn't create any backwards-compatibility issues for existing aggregate
functions; instead individual aggregates may be rewritten to avoid
palloc's.  (In fact, you can get to a zero-pallocs-per-cycle condition,
rather than reducing from two to one which is the most that the approach
you suggest could save.)

> I can reproduce the performance improvement with the patch you sent (I
> can repost numbers if you like, but your patch and mine resulted in
> about the same performance when I quickly tested it).

Hmph.  I'll have to repeat my test and figure out why I failed to see
much benefit.  I had sort of abandoned it in disgust after not getting
results, but obviously I shoulda dug deeper.
        regards, tom lane

Re: Please release (was Re: nodeAgg perf tweak)

From

Andrew Dunstan

Date:

02 December 2004, 16:13:56


Tom Lane wrote:

>"Andrew Dunstan" <andrew@dunslane.net> writes:
>  
>
>>I've raised this before, but would like to suggest again that there might be
>>virtue in branching earlier in the dev cycle - maybe around the time of the
>>first beta.
>>    
>>
>
>Given the amount of patching that's gone on, branching 8.1 earlier would
>have meant a great deal more work for the committers, in the form of
>double-patching stuff.  And I don't know about the other committers, but
>I've certainly had no spare bandwidth for reviewing or applying 8.1-cycle
>patches anyway.  So I think we made the right move to delay the branch.
>
>Core hasn't really talked about the point yet, but I suspect we'll make
>the branch after 8.0RC1 appears (which hopefully will be Real Soon).
>
>
>  
>

I do understand. Really. It just would be nice not to have development 
work queued for months every release cycle. But limited resources as 
always ...

cheers

andrew

Re: Please release (was Re: nodeAgg perf tweak)

From

Alvaro Herrera

Date:

02 December 2004, 17:04:02

On Wed, Dec 01, 2004 at 10:04:43PM -0500, Tom Lane wrote:
> "Andrew Dunstan" <andrew@dunslane.net> writes:
> > I've raised this before, but would like to suggest again that there might be
> > virtue in branching earlier in the dev cycle - maybe around the time of the
> > first beta.
> 
> Given the amount of patching that's gone on, branching 8.1 earlier would
> have meant a great deal more work for the committers, in the form of
> double-patching stuff.

That's because of CVS.  Other systems deal with this issues in a more
committer-friendly manner.

You just have to look at how using BitKeeper has dramatically increased
the development going on at the Linux kernel.  Whatever your opinion is
on Linux itself, that change alone is something to consider.

Also, committers are not the only users of our SCM tool.  Consider the
autovacuum-backend-integration patch.  Nobody reviewed it because it's a
lot of hassle to take the patch, apply it locally, test it, and continue
development.  Because if later Matthew comes up with an updated patch,
the reviewer has trouble merging both things, and re-reviewing the new
patch ("did he change this in the new patch, or is it the same as
before?").  Branching the whole thing could mean that I can integrate
his own change history in my private tree, so I can see what went on
since the last time; and he can see what changes I did and merge it;
etc.  The final step, which would be your own review and integration
into the official tree, has a better patch to begin with.  Which is the
way Linux works.

I think I will start using Subversion to track Postgres, just to see how
it goes.

-- 
Alvaro Herrera (<alvherre[@]dcc.uchile.cl>)
"If it wasn't for my companion, I believe I'd be having
the time of my life"  (John Dunbar)

Re: nodeAgg perf tweak

From

Neil Conway

Date:

02 December 2004, 23:57:59

On Thu, 2004-12-02 at 10:45 -0500, Tom Lane wrote:
> (2) I think you lose much of the performance
> benefit as soon as you have to distinguish cases (b) and (c).  Ideally
> you would use MemoryContextContains for this, but that doesn't work
> reliably on pointers that point to fields of a tuple.

Why wouldn't a simple comparison work? We're passing two arguments into
the aggregate function: (a) corresponds to returning the first argument,
and (b) corresponds to returning the second argument. If the aggregate
wants to do something more than just return one of the arguments, they
can copy/alloc in the current context.

> I like the approach shown in my prototype patch better, because it
> doesn't create any backwards-compatibility issues for existing aggregate
> functions; instead individual aggregates may be rewritten to avoid
> palloc's.

Yeah, I like your approach as well (sorry, I had thought Simon's earlier
suggestion along these lines would have required adding knowledge about
builtin aggregates to advance_transition_function() itself; adding
knowledge to the aggregate implementation is a lot nicer).

-Neil

Re: nodeAgg perf tweak

From

Tom Lane

Date:

03 December 2004, 00:07:34

Neil Conway <neilc@samurai.com> writes:
> On Thu, 2004-12-02 at 10:45 -0500, Tom Lane wrote:
>> (2) I think you lose much of the performance
>> benefit as soon as you have to distinguish cases (b) and (c).

> Why wouldn't a simple comparison work? We're passing two arguments into
> the aggregate function: (a) corresponds to returning the first argument,
> and (b) corresponds to returning the second argument.

True, but you still have to palloc if it returns the second argument.

> Yeah, I like your approach as well (sorry, I had thought Simon's earlier
> suggestion along these lines would have required adding knowledge about
> builtin aggregates to advance_transition_function() itself; adding
> knowledge to the aggregate implementation is a lot nicer).

It needs documentation --- what I sent you was just a crude hack for
testing, not something I would've committed as-is.  IIRC, the spec I had
in mind was "if fcinfo->context is an AggState node then the function
may assume it's being called as an aggregate's transition or final
function.  In this case, the first argument is certainly either an
initial value or the output of a prior transition function call.  The
function may assume that it's OK to scribble on the first argument
instead of allocating a fresh object for its result."  One could also
imagine that the transition and final functions could make a private 
agreement about the contents of the state value, such that it needn't
be a strictly valid Datum --- for instance it might contain pointers to
additional transient storage someplace.  I think that at the time we
were discussing such hacks in connection with user-defined aggregates
that needed a large amount of state.
        regards, tom lane

Re: nodeAgg perf tweak

From

Neil Conway

Date:

03 December 2004, 01:45:17

On Thu, 2004-12-02 at 19:07 -0500, Tom Lane wrote:
> True, but you still have to palloc if it returns the second argument.

Is that common? In any case, I don't see how you can _ever_ avoid a
palloc if the aggregate returns the second argument. The second argument
is in a per-tuple memory context: there's nothing the aggregate, or
nodeAgg, can do about it.

I think the tradeoffs between our patches are:

- mine would apply to all aggregates, without the need to modify any of
them individually
- yours would mean that int8inc() and similar aggregates wouldn't ever
need to do palloc(); mine would require a palloc() every k calls to the
transition function. I don't really see this as a problem: in practice k
will be sufficiently large that the palloc overhead should be
negligible.

-Neil

Re: nodeAgg perf tweak

From

Tom Lane

Date:

03 December 2004, 01:51:52

Neil Conway <neilc@samurai.com> writes:
> - yours would mean that int8inc() and similar aggregates wouldn't ever
> need to do palloc(); mine would require a palloc() every k calls to the
> transition function.

No.  The current code involves two pallocs per cycle, one inside the
aggregate function to construct its result value, and then one in
datumCopy to copy the result into the proper context.  Your patch
reduces that to 1 + 1/k pallocs per cycle, mine to zero.

The fact that it's a central fix for all aggregate functions is
definitely a nice feature of your approach, but I am concerned about the
possible side-effects on user-defined aggregate functions that may not
work as you expect them to.  I think it's safer to keep the aggregate
code behaving as-is and get the performance win in the individual
functions.  There are not that many aggregates that we really care that
much about.
        regards, tom lane

Re: nodeAgg perf tweak

From

Neil Conway

Date:

03 December 2004, 02:19:58

On Thu, 2004-12-02 at 20:51 -0500, Tom Lane wrote:
> No.  The current code involves two pallocs per cycle, one inside the
> aggregate function to construct its result value, and then one in
> datumCopy to copy the result into the proper context.

Ah, true -- missed the fact that PG_RETURN_INT64() does a palloc(). (We
really ought to fix that on 64-bit machines...)

> The fact that it's a central fix for all aggregate functions is
> definitely a nice feature of your approach, but I am concerned about the
> possible side-effects on user-defined aggregate functions that may not
> work as you expect them to.  I think it's safer to keep the aggregate
> code behaving as-is and get the performance win in the individual
> functions.  There are not that many aggregates that we really care that
> much about.

Okay, fair enough :)

BTW, the spec you posted in your previous message makes sense to me.

-Neil