Thread: nodeAgg perf tweak
I noticed that we have a bottleneck in aggregate performance in advance_transition_function(): according to callgrind, doing datumCopy() and pfree() for every tuple produced by the transition function is pretty expensive. Some queries bare this out: dvl=# SELECT W.element_id, count(W.element_ID) FROM watch_list_element W GROUP by W.element_id ORDER by count(W.element_ID) LIMIT 5; element_id | count ------------+------- 65525 | 1 163816 | 1 16341 | 1 131023 | 1 65469 | 1 (5 rows) Time: 176.723 ms dvl=# select count(*) from watch_list_element; count -------- 138044 (1 row) Time: 94.246 ms I've attached a quick and dirty hack that avoids the need to palloc() and pfree() for every tuple produced by the aggregate's transition function. This results in: dvl=# SELECT W.element_id, count(W.element_ID) FROM watch_list_element W GROUP by W.element_id ORDER by count(W.element_ID) LIMIT 5; element_id | count ------------+------- 65525 | 1 163816 | 1 16341 | 1 131023 | 1 65469 | 1 (5 rows) Time: 154.378 ms dvl=# select count(*) from watch_list_element; count -------- 138044 (1 row) Time: 73.975 ms I can reproduce this performance difference consistently. I thought this might have been attributable to memory checking overhead because assertions were enabled, but that doesn't appear to be the case (the above results are without --enable-cassert). The attached patch invokes the transition function in the current memory context. I don't think that's right: a memory leak in an aggregate's transition function would be problematic when we're invoked from a per-query memory context, as is the case with advance_aggregates(). Perhaps we need an additional short-lived memory context in AggStatePerAggData: we could invoke the transition function in that context, and reset it once per, say, 1000 tuples. Alternatively we could just mandate that aggregate transition function's not leak memory and then invoke the transition function in, say, the aggregate's memory context, but that seems a little fragile. Comments? -Neil
Attachment
Neil Conway <neilc@samurai.com> writes: > I've attached a quick and dirty hack that avoids the need to palloc() > and pfree() for every tuple produced by the aggregate's transition > function. And how badly does it leak memory? I do not believe this patch is tenable. Something that occurred to me the other morning in the shower is that we could trivially inline MemoryContextSwitchTo() when using gcc, much as you did for list_length(). I haven't gotten around to doing it but it seems like a free percent-or-two improvement. regards, tom lane
On Tue, 2004-11-30 at 23:15 -0500, Tom Lane wrote: > And how badly does it leak memory? I do not believe this patch is > tenable. Did you read the rest of my mail? > Something that occurred to me the other morning in the shower is that we > could trivially inline MemoryContextSwitchTo() when using gcc, much as > you did for list_length(). I haven't gotten around to doing it but it > seems like a free percent-or-two improvement. Yeah, it actually occurred to me as well this would be worth doing. It's not relevant to this particular example though. -Neil
On Wed, 2004-12-01 at 02:48, Neil Conway wrote: > I noticed that we have a bottleneck in aggregate performance in > advance_transition_function(): according to callgrind, doing datumCopy() > and pfree() for every tuple produced by the transition function is > pretty expensive. Some queries bare this out: > ... > I can reproduce this performance difference consistently. I thought this > might have been attributable to memory checking overhead because > assertions were enabled, but that doesn't appear to be the case (the > above results are without --enable-cassert). That looks like a useful performance gain, and count(*) is a weak point on the performance roadmap. Thanks. > The attached patch invokes the transition function in the current memory > context. I don't think that's right: a memory leak in an aggregate's > transition function would be problematic when we're invoked from a > per-query memory context, as is the case with advance_aggregates(). > Perhaps we need an additional short-lived memory context in > AggStatePerAggData: we could invoke the transition function in that > context, and reset it once per, say, 1000 tuples. I'd be a little twitchy about new memory contexts at this stage of beta, but if the code is fairly well isolated that could be good. > Alternatively we could just mandate that aggregate transition function's > not leak memory and then invoke the transition function in, say, the > aggregate's memory context, but that seems a little fragile. Would it possible to differentiate between well-known builtin functions and added transition functions? I realise nothing is that simple, but it seems we could trust some functions more than others. The trust level could be determined at planning time. [DB2 uses FENCED or UNFENCED declarations for this purpose, though the exact meaning is different to your proposal - I'm not suggesting adding external syntax though] A performance gain on the built-ins alone would satisfy 80% of users. -- Best Regards, Simon Riggs
On Wed, 2004-12-01 at 08:25 +0000, Simon Riggs wrote: > I'd be a little twitchy about new memory contexts at this stage of beta, > but if the code is fairly well isolated that could be good. This would be for 8.1 anyway. > Would it possible to differentiate between well-known builtin functions > and added transition functions? IMHO, this would be ugly and add unnecessary complexity. I'd like to find a solution that actually fixes the problem, rather than just working around it in the common case. -Neil
On Wed, 2004-12-01 at 08:37, Neil Conway wrote: > On Wed, 2004-12-01 at 08:25 +0000, Simon Riggs wrote: > > I'd be a little twitchy about new memory contexts at this stage of beta, > > but if the code is fairly well isolated that could be good. > > This would be for 8.1 anyway. > > > Would it possible to differentiate between well-known builtin functions > > and added transition functions? > > IMHO, this would be ugly and add unnecessary complexity. I'd like to > find a solution that actually fixes the problem, rather than just > working around it in the common case. It would be my suggestion to implement the optimisation for the common case *now*, then fix the general case later. Please shave 20% off everybody's aggregation queries, ugly or not. You'll see a lot of happy people. -- Best Regards, Simon Riggs
On Wed, Dec 01, 2004 at 10:03:40AM +0000, Simon Riggs wrote: > Please shave 20% off everybody's aggregation queries, ugly or not. > You'll see a lot of happy people. When would the experimentation end, and 8.0 be finally released? There's real development waiting only for release to happen ... I have submitted three patches already that are pending for 8.1, and the code keeps drifting. There has to be a release soon. We can't keep in beta forever. Also, I think we should consider using a time-based release process instead of the unpredictable mess that it is used now. If hard development was done in branches and only merged when stable, we could do this. (This is also something that a better SCM could help us do, though GCC is living proof that it can be done with CVS too.) -- Alvaro Herrera (<alvherre[@]dcc.uchile.cl>) "God is real, unless declared as int"
On Wed, 2004-12-01 at 11:08 -0300, Alvaro Herrera wrote: > When would the experimentation end, and 8.0 be finally released? As I said, this is work for 8.1; I don't think it ought to affect the release schedule of 8.0 (perhaps in some marginal way because Tom might spend some time discussing issues with me, but I'll leave it up to Tom to decide how best to spend his time). > There's real development waiting only for release to happen ... There is real development being done as we speak, that will be landed to 8.1 when it branches :) I think you've raised two distinct issues: (1) Release scheduling: time-based releases vs. feature-based releases, how long the beta period is, and so on. (2) Development methodology: how do we make it easy for people to engage in new development while the mainline is being stabilized for a release? > If hard development was done in branches and only merged when stable, > we could do this. I think this might be worth considering as an improvement to #2. What I've been experimenting with locally is doing development in a private VC system, and merging in mainline CVS changes every week or so. That allows me to keep long-running projects in their own private branchs and to take advantage of the VC system's merge support, without needing to change the work environment used by other developers. -Neil
Neil Conway said: > On Wed, 2004-12-01 at 11:08 -0300, Alvaro Herrera wrote: > >> There's real development waiting only for release to happen ... > > There is real development being done as we speak, that will be landed > to 8.1 when it branches :) > I've raised this before, but would like to suggest again that there might be virtue in branching earlier in the dev cycle - maybe around the time of the first beta. cheers andrew
Neil Conway <neilc@samurai.com> writes: > The attached patch invokes the transition function in the current memory > context. I don't think that's right: a memory leak in an aggregate's > transition function would be problematic when we're invoked from a > per-query memory context, as is the case with advance_aggregates(). Agreed. You can't really assume that the transition function is leak-free, particularly not when it's returning a pass-by-ref datatype. > Perhaps we need an additional short-lived memory context in > AggStatePerAggData: we could invoke the transition function in that > context, and reset it once per, say, 1000 tuples. This seems like it might work. Instead of copying the result into the aggcontext on every cycle, we could copy it only when we intend to reset the working context. However there is still a risk: the function might return a pointer into the source tuple, not something that's in the working context at all. (See text_smaller() as one example.) This is problematic since the source tuple will go away on the next cycle. The existing copying code takes care of this case as well as the one where the result is in per_tuple_context. I'm a bit surprised that you measured a significant performance speedup from removing the copying step. Awhile ago I experimented with hacking the count() aggregate function internally to avoid pallocs, and was disappointed to measure no noticeable speedup at all. Digging in the list archives, it looks like I neglected to report that failure, but I do still have the patch and I attach it here. This was from late Dec '03 so it'd be against just-past-7.4 sources. I'd be interested to hear how this compares to what you did under your test conditions; maybe I was using a bogus test case. regards, tom lane *** src/backend/executor/nodeAgg.c.orig Sat Nov 29 14:51:48 2003 --- src/backend/executor/nodeAgg.c Sun Dec 21 16:02:25 2003 *************** *** 350,356 **** */ /* MemSet(&fcinfo, 0, sizeof(fcinfo)); */ ! fcinfo.context = NULL; fcinfo.resultinfo = NULL; fcinfo.isnull = false; --- 350,356 ---- */ /* MemSet(&fcinfo, 0, sizeof(fcinfo)); */ ! fcinfo.context = (void *) aggstate; fcinfo.resultinfo = NULL; fcinfo.isnull = false; *************** *** 528,533 **** --- 528,534 ---- FunctionCallInfoData fcinfo; MemSet(&fcinfo, 0, sizeof(fcinfo)); + fcinfo.context = (void *) aggstate; fcinfo.flinfo = &peraggstate->finalfn; fcinfo.nargs = 1; fcinfo.arg[0] = pergroupstate->transValue; *** /home/postgres/pgsql/src/backend/utils/adt/int8.c.orig Mon Dec 1 17:29:19 2003 --- /home/postgres/pgsql/src/backend/utils/adt/int8.c Sun Dec 21 16:03:43 2003 *************** *** 17,22 **** --- 17,23 ---- #include <math.h> #include "libpq/pqformat.h" + #include "nodes/nodes.h" #include "utils/int8.h" *************** *** 565,573 **** Datum int8inc(PG_FUNCTION_ARGS) { ! int64 arg = PG_GETARG_INT64(0); ! PG_RETURN_INT64(arg + 1); } Datum --- 566,594 ---- Datum int8inc(PG_FUNCTION_ARGS) { ! if (fcinfo->context != NULL && IsA(fcinfo->context, AggState)) ! { ! /* ! * Special case to avoid palloc overhead for COUNT(). Note: this ! * assumes int8 is a pass-by-ref type; if we ever support pass-by-val ! * int8, this should be ifdef'd out when int8 is pass-by-val. ! * ! * When called from nodeAgg, we know that the argument is modifiable ! * local storage, so just update it in-place. ! */ ! int64 *arg = (int64 *) PG_GETARG_POINTER(0); ! ! (*arg)++; ! PG_RETURN_POINTER(arg); ! } ! else ! { ! /* Not called by nodeAgg, so just do it the dumb way */ ! int64 arg = PG_GETARG_INT64(0); ! ! PG_RETURN_INT64(arg + 1); ! } } Datum
"Andrew Dunstan" <andrew@dunslane.net> writes: > I've raised this before, but would like to suggest again that there might be > virtue in branching earlier in the dev cycle - maybe around the time of the > first beta. Given the amount of patching that's gone on, branching 8.1 earlier would have meant a great deal more work for the committers, in the form of double-patching stuff. And I don't know about the other committers, but I've certainly had no spare bandwidth for reviewing or applying 8.1-cycle patches anyway. So I think we made the right move to delay the branch. Core hasn't really talked about the point yet, but I suspect we'll make the branch after 8.0RC1 appears (which hopefully will be Real Soon). regards, tom lane
On Wed, 2004-12-01 at 15:54 -0500, Tom Lane wrote: > This seems like it might work. Instead of copying the result into the > aggcontext on every cycle, we could copy it only when we intend to reset > the working context. Right. > This is > problematic since the source tuple will go away on the next cycle. > The existing copying code takes care of this case as well as the one > where the result is in per_tuple_context. ISTM it would be reasonable to mandate that aggregate authors return one of three things from their state transition functions: (a) return the previous state value (b) return the "next data item" value (c) return some other value; if by a pass-by-referencetype, allocated in CurrentMemoryContext In the case of (b), we need to copy it in advance_transition_function() to ensure it survives to the next invocation of the state transition function, but that should be doable. For both (a) and (c) we can assume the value has been allocated in the per-1000-rows-transition-function temporary context, and thus we need only copy it when we reset that context. > Digging in the > list archives, it looks like I neglected to report that failure, but > I do still have the patch and I attach it here. This was from late > Dec '03 so it'd be against just-past-7.4 sources. I'd be interested > to hear how this compares to what you did under your test conditions; > maybe I was using a bogus test case. I can reproduce the performance improvement with the patch you sent (I can repost numbers if you like, but your patch and mine resulted in about the same performance when I quickly tested it). You can find the test data at: http://www.langille.org/watch_list_elements.sql.tgz Just load the .sql file, run ANALYZE, and try the queries I posted. -Neil
Neil Conway <neilc@samurai.com> writes: > ISTM it would be reasonable to mandate that aggregate authors return one > of three things from their state transition functions: > (a) return the previous state value > (b) return the "next data item" value > (c) return some other value; if by a pass-by-reference type, allocated > in CurrentMemoryContext > In the case of (b), we need to copy it in advance_transition_function() > to ensure it survives to the next invocation of the state transition > function, but that should be doable. For both (a) and (c) we can assume > the value has been allocated in the per-1000-rows-transition-function > temporary context, and thus we need only copy it when we reset that > context. Well, (1) I'm uncomfortable with "mandating" that aggregate functions do anything special, and (2) I think you lose much of the performance benefit as soon as you have to distinguish cases (b) and (c). Ideally you would use MemoryContextContains for this, but that doesn't work reliably on pointers that point to fields of a tuple. I like the approach shown in my prototype patch better, because it doesn't create any backwards-compatibility issues for existing aggregate functions; instead individual aggregates may be rewritten to avoid palloc's. (In fact, you can get to a zero-pallocs-per-cycle condition, rather than reducing from two to one which is the most that the approach you suggest could save.) > I can reproduce the performance improvement with the patch you sent (I > can repost numbers if you like, but your patch and mine resulted in > about the same performance when I quickly tested it). Hmph. I'll have to repeat my test and figure out why I failed to see much benefit. I had sort of abandoned it in disgust after not getting results, but obviously I shoulda dug deeper. regards, tom lane
Tom Lane wrote: >"Andrew Dunstan" <andrew@dunslane.net> writes: > > >>I've raised this before, but would like to suggest again that there might be >>virtue in branching earlier in the dev cycle - maybe around the time of the >>first beta. >> >> > >Given the amount of patching that's gone on, branching 8.1 earlier would >have meant a great deal more work for the committers, in the form of >double-patching stuff. And I don't know about the other committers, but >I've certainly had no spare bandwidth for reviewing or applying 8.1-cycle >patches anyway. So I think we made the right move to delay the branch. > >Core hasn't really talked about the point yet, but I suspect we'll make >the branch after 8.0RC1 appears (which hopefully will be Real Soon). > > > > I do understand. Really. It just would be nice not to have development work queued for months every release cycle. But limited resources as always ... cheers andrew
On Wed, Dec 01, 2004 at 10:04:43PM -0500, Tom Lane wrote: > "Andrew Dunstan" <andrew@dunslane.net> writes: > > I've raised this before, but would like to suggest again that there might be > > virtue in branching earlier in the dev cycle - maybe around the time of the > > first beta. > > Given the amount of patching that's gone on, branching 8.1 earlier would > have meant a great deal more work for the committers, in the form of > double-patching stuff. That's because of CVS. Other systems deal with this issues in a more committer-friendly manner. You just have to look at how using BitKeeper has dramatically increased the development going on at the Linux kernel. Whatever your opinion is on Linux itself, that change alone is something to consider. Also, committers are not the only users of our SCM tool. Consider the autovacuum-backend-integration patch. Nobody reviewed it because it's a lot of hassle to take the patch, apply it locally, test it, and continue development. Because if later Matthew comes up with an updated patch, the reviewer has trouble merging both things, and re-reviewing the new patch ("did he change this in the new patch, or is it the same as before?"). Branching the whole thing could mean that I can integrate his own change history in my private tree, so I can see what went on since the last time; and he can see what changes I did and merge it; etc. The final step, which would be your own review and integration into the official tree, has a better patch to begin with. Which is the way Linux works. I think I will start using Subversion to track Postgres, just to see how it goes. -- Alvaro Herrera (<alvherre[@]dcc.uchile.cl>) "If it wasn't for my companion, I believe I'd be having the time of my life" (John Dunbar)
On Thu, 2004-12-02 at 10:45 -0500, Tom Lane wrote: > (2) I think you lose much of the performance > benefit as soon as you have to distinguish cases (b) and (c). Ideally > you would use MemoryContextContains for this, but that doesn't work > reliably on pointers that point to fields of a tuple. Why wouldn't a simple comparison work? We're passing two arguments into the aggregate function: (a) corresponds to returning the first argument, and (b) corresponds to returning the second argument. If the aggregate wants to do something more than just return one of the arguments, they can copy/alloc in the current context. > I like the approach shown in my prototype patch better, because it > doesn't create any backwards-compatibility issues for existing aggregate > functions; instead individual aggregates may be rewritten to avoid > palloc's. Yeah, I like your approach as well (sorry, I had thought Simon's earlier suggestion along these lines would have required adding knowledge about builtin aggregates to advance_transition_function() itself; adding knowledge to the aggregate implementation is a lot nicer). -Neil
Neil Conway <neilc@samurai.com> writes: > On Thu, 2004-12-02 at 10:45 -0500, Tom Lane wrote: >> (2) I think you lose much of the performance >> benefit as soon as you have to distinguish cases (b) and (c). > Why wouldn't a simple comparison work? We're passing two arguments into > the aggregate function: (a) corresponds to returning the first argument, > and (b) corresponds to returning the second argument. True, but you still have to palloc if it returns the second argument. > Yeah, I like your approach as well (sorry, I had thought Simon's earlier > suggestion along these lines would have required adding knowledge about > builtin aggregates to advance_transition_function() itself; adding > knowledge to the aggregate implementation is a lot nicer). It needs documentation --- what I sent you was just a crude hack for testing, not something I would've committed as-is. IIRC, the spec I had in mind was "if fcinfo->context is an AggState node then the function may assume it's being called as an aggregate's transition or final function. In this case, the first argument is certainly either an initial value or the output of a prior transition function call. The function may assume that it's OK to scribble on the first argument instead of allocating a fresh object for its result." One could also imagine that the transition and final functions could make a private agreement about the contents of the state value, such that it needn't be a strictly valid Datum --- for instance it might contain pointers to additional transient storage someplace. I think that at the time we were discussing such hacks in connection with user-defined aggregates that needed a large amount of state. regards, tom lane
On Thu, 2004-12-02 at 19:07 -0500, Tom Lane wrote: > True, but you still have to palloc if it returns the second argument. Is that common? In any case, I don't see how you can _ever_ avoid a palloc if the aggregate returns the second argument. The second argument is in a per-tuple memory context: there's nothing the aggregate, or nodeAgg, can do about it. I think the tradeoffs between our patches are: - mine would apply to all aggregates, without the need to modify any of them individually - yours would mean that int8inc() and similar aggregates wouldn't ever need to do palloc(); mine would require a palloc() every k calls to the transition function. I don't really see this as a problem: in practice k will be sufficiently large that the palloc overhead should be negligible. -Neil
Neil Conway <neilc@samurai.com> writes: > - yours would mean that int8inc() and similar aggregates wouldn't ever > need to do palloc(); mine would require a palloc() every k calls to the > transition function. No. The current code involves two pallocs per cycle, one inside the aggregate function to construct its result value, and then one in datumCopy to copy the result into the proper context. Your patch reduces that to 1 + 1/k pallocs per cycle, mine to zero. The fact that it's a central fix for all aggregate functions is definitely a nice feature of your approach, but I am concerned about the possible side-effects on user-defined aggregate functions that may not work as you expect them to. I think it's safer to keep the aggregate code behaving as-is and get the performance win in the individual functions. There are not that many aggregates that we really care that much about. regards, tom lane
On Thu, 2004-12-02 at 20:51 -0500, Tom Lane wrote: > No. The current code involves two pallocs per cycle, one inside the > aggregate function to construct its result value, and then one in > datumCopy to copy the result into the proper context. Ah, true -- missed the fact that PG_RETURN_INT64() does a palloc(). (We really ought to fix that on 64-bit machines...) > The fact that it's a central fix for all aggregate functions is > definitely a nice feature of your approach, but I am concerned about the > possible side-effects on user-defined aggregate functions that may not > work as you expect them to. I think it's safer to keep the aggregate > code behaving as-is and get the performance win in the individual > functions. There are not that many aggregates that we really care that > much about. Okay, fair enough :) BTW, the spec you posted in your previous message makes sense to me. -Neil