Thread: Memory Accounting v11
This patch tracks memory usage (at the block level) for all memory contexts. Individual palloc()s aren't tracked; only the new blocks allocated to the memory context with malloc(). It also adds a new function, MemoryContextMemAllocated() which can either retrieve the total allocated for the context, or it can recurse and sum up the allocations for all subcontexts as well. This is intended to be used by HashAgg in an upcoming patch that will bound the memory and spill to disk. Previous discussion here: http://www.postgresql.org/message-id/1407012053.15301.53.camel@jeff-desktop Previous concerns: * There was a slowdown reported of around 1-3% (depending on the exact version of the patch) on an IBM power machine when doing an index rebuild. The results were fairly noisy for me, but it seemed to hold up. See http://www.postgresql.org/message-id/CA +Tgmobnu7XEn1gRdXnFo37P79bF=qLt46=37ajP3Cro9dBRaA@mail.gmail.com * Adding a struct field to MemoryContextData may be bad for the CPU caching behavior, and may be the cause of the above slowdown. * Previous versions of the patch updated the parent contexts' allocations as well, which avoided the need to recurse when querying the total allocation. That seemed to penalize cases that didn't need to track the allocation. We discussed trying to "opt-in" to this behavior, but it seemed more awkward than helpful. Now, the patch only updates the allocation of the current context, and querying means recursing through the child contexts. * There was a concern that, if MemoryContextMemAllocated needs to recurse to the child contexts, it will be too slow for HashAgg of array_agg, which creates a child context per group. That was solved with http://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=b419865a814abbca12bdd6eef6a3d5ed67f432e1 My general answer to the performance concerns is that they aren't a reason to block this patch, unless someone has a suggestion about how to fix them. Adding one field to a struct and a few arithmetic operations for each malloc() or realloc() seems reasonable to me. The current state, where HashAgg just blows up the memory, is just not reasonable, and we need to track the memory to fix that problem. Others have also mentioned that we might want to use this mechanism to track memory for other operators, like Sort or HashJoin, which might be simpler and more accurate. Regards, Jeff Davis
Attachment
Jeff Davis <pgsql@j-davis.com> writes: > This patch tracks memory usage (at the block level) for all memory > contexts. Individual palloc()s aren't tracked; only the new blocks > allocated to the memory context with malloc(). > ... > My general answer to the performance concerns is that they aren't a > reason to block this patch, unless someone has a suggestion about how to > fix them. Adding one field to a struct and a few arithmetic operations > for each malloc() or realloc() seems reasonable to me. Maybe, but this here is a pretty weak argument: > The current state, where HashAgg just blows up the memory, is just not > reasonable, and we need to track the memory to fix that problem. Meh. HashAgg could track its memory usage without loading the entire system with a penalty. Moreover, this is about fourth or fifth down the list of the implementation problems with spilling hash aggregation to disk. It would be good to see credible solutions for the bigger issues before we buy into adding overhead for a mechanism with no uses in core. > Others have also mentioned that we might want to use this mechanism to > track memory for other operators, like Sort or HashJoin, which might be > simpler and more accurate. That would imply redefining the meaning of work_mem for those operations; it would suddenly include a lot more overhead space than it used to, causing them to spill to disk much more quickly than before, unless one changes the work_mem setting to compensate. Not sure that people would like such a change. An even bigger problem is that it would pretty much break the logic around LACKMEM() in tuplesort.c, which supposes that spilling individual tuples to disk is a reliable and linear way to decrease the accounted-for memory. Individual pfree's would typically not decrease the accounted total in this implementation. Depending on what you have in mind for the spill-to-disk behavior, this issue could be a fail for HashAgg as well, which is another reason not to commit this patch in advance of seeing the use-case. regards, tom lane
Hi, On 06/14/15 21:43, Jeff Davis wrote: > This patch tracks memory usage (at the block level) for all memory > contexts. Individual palloc()s aren't tracked; only the new blocks > allocated to the memory context with malloc(). I see it's adding the new field as int64 - wouldn't a Size be more appropriate, considering that's what we use in mctx.h and aset.c? > It also adds a new function, MemoryContextMemAllocated() which can > either retrieve the total allocated for the context, or it can > recurse and sum up the allocations for all subcontexts as well. > > This is intended to be used by HashAgg in an upcoming patch that will > bound the memory and spill to disk. > > Previous discussion here: > > http://www.postgresql.org/message-id/1407012053.15301.53.camel@jeff-desktop > > Previous concerns: > > * There was a slowdown reported of around 1-3% (depending on the exact > version of the patch) on an IBM power machine when doing an index > rebuild. The results were fairly noisy for me, but it seemed to hold up. > See http://www.postgresql.org/message-id/CA > +Tgmobnu7XEn1gRdXnFo37P79bF=qLt46=37ajP3Cro9dBRaA@mail.gmail.com > * Adding a struct field to MemoryContextData may be bad for the CPU > caching behavior, and may be the cause of the above slowdown. > * Previous versions of the patch updated the parent contexts' > allocations as well, which avoided the need to recurse when querying > the total allocation. That seemed to penalize cases that didn't need > to track the allocation. We discussed trying to "opt-in" to this > behavior, but it seemed more awkward than helpful. Now, the patch > only updates the allocation of the current context, and querying > means recursing through the child contexts. I don't think the opt-in idea itself was awkward, it was rather about the particular APIs that we came up with, especially when combined with the 'context inheritance' thing. I still think the opt-in approach and updating accounting for the parent contexts was the best one, because it (a) minimizes impact in cases that don't use the accounting, and (b) makes finding the current amount of memory cheap ... > * There was a concern that, if MemoryContextMemAllocated needs to > recurse to the child contexts, it will be too slow for HashAgg of > array_agg, which creates a child context per group. That was solved with > http://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=b419865a814abbca12bdd6eef6a3d5ed67f432e1 I wouldn't say this was "solved" - we have fixed one particular example of such aggregate implementation, because it was causing OOM issues with many groups, but there may be other custom aggregates using the same pattern. Granted, built-in aggregates are probably more critical than aggregates provided by extensions, but I wouldn't dare to mark this solved ... > My general answer to the performance concerns is that they aren't a > reason to block this patch, unless someone has a suggestion about how > to fix them. Adding one field to a struct and a few arithmetic > operations for each malloc() or realloc() seems reasonable to me. I'm not buying this, sorry. While I agree that we should not expect the memory accounting to be entirely free, we should be very careful about the overhead especially if we're dropping the opt-in and thus imposing the overhead on everyone. But "performance concerns are not a reason to block this patch" approach seems wrong. With any other patch a 3% regression would be considered a serious issue IMNSHO. > The current state, where HashAgg just blows up the memory, is just > not reasonable, and we need to track the memory to fix that problem. > Others have also mentioned that we might want to use this mechanism > to track memory for other operators, like Sort or HashJoin, which > might be simpler and more accurate. Dropping the memory accounting implementations and keeping just this new solution would be nice, only if we agree the performance impact to be acceptable. We already have accounting solution for each of those places, so I don't think the unification alone outweighs the regression. regards -- Tomas Vondra http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Hi, On 06/14/15 22:21, Tom Lane wrote: > Jeff Davis <pgsql@j-davis.com> writes: >> This patch tracks memory usage (at the block level) for all memory >> contexts. Individual palloc()s aren't tracked; only the new blocks >> allocated to the memory context with malloc(). >> ... >> My general answer to the performance concerns is that they aren't a >> reason to block this patch, unless someone has a suggestion about how to >> fix them. Adding one field to a struct and a few arithmetic operations >> for each malloc() or realloc() seems reasonable to me. > > Maybe, but this here is a pretty weak argument: > >> The current state, where HashAgg just blows up the memory, is just not >> reasonable, and we need to track the memory to fix that problem. > > Meh. HashAgg could track its memory usage without loading the entire > system with a penalty. +1 to a solution like that, although I don't think that's doable without digging the info from memory contexts somehow. > Moreover, this is about fourth or fifth down the > list of the implementation problems with spilling hash aggregation to > disk. It would be good to see credible solutions for the bigger issues > before we buy into adding overhead for a mechanism with no uses in core. I don't think so. If we don't have an acceptable solution for tracking memory in hashagg, there's not really much point in doing any of the following steps. That's why Jeff is tackling this problem first. Also, Jeff already posted a PoC of the memory-bounded hashagg, although it only worked for aggregates with fixed-size state, and not that great for cases like array_agg where the state grows. Maybe the improvements to aggregate functions proposed by David Rowley last week might help in those cases, as that'd allow spilling those states to disk. So while the approach proposed by Jeff may turn out not to be the right one, I think memory accounting needs to be solved first (even if it's not committed until the whole feature is ready). >> Others have also mentioned that we might want to use this mechanism >> to track memory for other operators, like Sort or HashJoin, which >> might be simpler and more accurate. > > That would imply redefining the meaning of work_mem for those > operations; it would suddenly include a lot more overhead space than > it used to, causing them to spill to disk much more quickly than > before, unless one changes the work_mem setting to compensate. Not > sure that people would like such a change. Don't we tweak the work_mem meaning over time anyway? Maybe not to such extent, but for example the hashjoin 9.5 improvements certainly change this by packing the tuples more densely, sizing the buckets differently etc. It's true the changes are in the other direction (i.e. batching less frequently) though. OTOH this would make the accounting more accurate (e.g. eliminating differences between cases with different amounts of overhead because of different tuple width, for example). Wouldn't that be a good thing, even if people need to tweak the work_mem? Isn't that kinda acceptable when upgrading to the next major version? > An even bigger problem is that it would pretty much break the logic > around LACKMEM() in tuplesort.c, which supposes that spilling > individual tuples to disk is a reliable and linear way to decrease > the accounted-for memory. Individual pfree's would typically not > decrease the accounted total in this implementation. Depending on > what you have in mind for the spill-to-disk behavior, this issue > could be a fail for HashAgg as well, which is another reason not to > commit this patch in advance of seeing the use-case. That's true, but I think the plan was always to wait for the whole feature, and only then commit all the pieces. -- Tomas Vondra http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Sun, 2015-06-14 at 16:21 -0400, Tom Lane wrote: > Meh. HashAgg could track its memory usage without loading the entire > system with a penalty. When I tried doing it outside of the MemoryContexts, it seemed to get awkward quickly. I'm open to suggestion if you can point me in the right direction. Maybe I can peek at the sizes of chunks holding state values and group keys, and combine that with the hash table size-estimating code? > Moreover, this is about fourth or fifth down the > list of the implementation problems with spilling hash aggregation to > disk. It would be good to see credible solutions for the bigger issues > before we buy into adding overhead for a mechanism with no uses in core. I had several iterations of a full implementation of the spill-to-disk HashAgg patch[1]. Tomas Vondra has some constructive review comments, but all of them seemed solvable. What do you see as a major unsolved issue? If I recall, you were concerned about things like array_agg, where an individual state could get larger than work_mem. That's a valid concern, but it's not the problem I was trying to solve. Regards,Jeff Davis [1] http://www.postgresql.org/message-id/1407706010.6623.16.camel@jeff-desktop
On 14 June 2015 at 23:51, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
--
The current state, where HashAgg just blows up the memory, is just not
reasonable, and we need to track the memory to fix that problem.
Meh. HashAgg could track its memory usage without loading the entire
system with a penalty.
+1 to a solution like that, although I don't think that's doable without digging the info from memory contexts somehow.
Jeff is right, we desperately need a solution and this is the place to start.
Tom's concern remains valid: we must not load the entire system with a penalty.
The only questions I have are:
* If the memory allocations adapt to the usage pattern, then we expect to see few memory chunk allocations. Why are we expecting "the entire system" to experience a penalty?
* If we do not manage our resources, how are we certain this does not induce a penalty? Not tracking memory could be worse than tracking it.
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 14 June 2015 at 23:51, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
The current state, where HashAgg just blows up the memory, is just not
reasonable, and we need to track the memory to fix that problem.
Meh. HashAgg could track its memory usage without loading the entire
system with a penalty.
+1 to a solution like that, although I don't think that's doable without digging the info from memory contexts somehow.
I am sorry to ask questions unrelated to the subject, but how is tracking memory going to fix the HashAgg blow up problem? Is there a plan to make HashAgg not blow up (i.e. spill the hash table)?
Thanks,
-cktan
On Thu, Jul 2, 2015 at 4:19 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
On 14 June 2015 at 23:51, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:--The current state, where HashAgg just blows up the memory, is just not
reasonable, and we need to track the memory to fix that problem.
Meh. HashAgg could track its memory usage without loading the entire
system with a penalty.
+1 to a solution like that, although I don't think that's doable without digging the info from memory contexts somehow.Jeff is right, we desperately need a solution and this is the place to start.Tom's concern remains valid: we must not load the entire system with a penalty.The only questions I have are:* If the memory allocations adapt to the usage pattern, then we expect to see few memory chunk allocations. Why are we expecting "the entire system" to experience a penalty?* If we do not manage our resources, how are we certain this does not induce a penalty? Not tracking memory could be worse than tracking it.Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Thu, Jul 2, 2015 at 11:37 AM, CK Tan <cktan@vitessedata.com> wrote: > > I am sorry to ask questions unrelated to the subject, but how is tracking > memory going to fix the HashAgg blow up problem? Is there a plan to make > HashAgg not blow up (i.e. spill the hash table)? > Your question is probably answered here: http://www.postgresql.org/message-id/1407012053.15301.53.camel@jeff-desktop Regards, Qingqing
On 15 June 2015 at 07:43, Jeff Davis <pgsql@j-davis.com> wrote:
* There was a slowdown reported of around 1-3% (depending on the exact
version of the patch) on an IBM power machine when doing an index
rebuild. The results were fairly noisy for me, but it seemed to hold up.
See http://www.postgresql.org/message-id/CA
+Tgmobnu7XEn1gRdXnFo37P79bF=qLt46=37ajP3Cro9dBRaA@mail.gmail.com
Hi Jeff,
I've been looking over the code and reason the previous emails about this patch.
I don't yet understand if the reported slowdown is from the increase in struct size or from the additional work being done in palloc() calls, however, on reading the code I did notice an existing redundant NULL check in AllocSetAlloc() right where you put you're extra accounting code.
The attached patch should apply on top of your patch and removes the extra NULL check.
Perhaps if some over the overhead is the extra instructions then this can help get us back to where we were before.
Regards
David Rowley
--
David Rowley http://www.2ndQuadrant.com/
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
Attachment
On 15 June 2015 at 07:43, Jeff Davis <pgsql@j-davis.com> wrote:
This patch tracks memory usage (at the block level) for all memory
contexts. Individual palloc()s aren't tracked; only the new blocks
allocated to the memory context with malloc().
It also adds a new function, MemoryContextMemAllocated() which can
either retrieve the total allocated for the context, or it can recurse
and sum up the allocations for all subcontexts as well.
This is intended to be used by HashAgg in an upcoming patch that will
bound the memory and spill to disk.
Previous discussion here:
http://www.postgresql.org/message-id/1407012053.15301.53.camel@jeff-desktop
Previous concerns:
* There was a slowdown reported of around 1-3% (depending on the exact
version of the patch) on an IBM power machine when doing an index
rebuild. The results were fairly noisy for me, but it seemed to hold up.
See http://www.postgresql.org/message-id/CA
+Tgmobnu7XEn1gRdXnFo37P79bF=qLt46=37ajP3Cro9dBRaA@mail.gmail.com
* Adding a struct field to MemoryContextData may be bad for the CPU
caching behavior, and may be the cause of the above slowdown.
I've been looking at this patch and trying to reproduce the reported slowdown by using Tomas' function to try to exercise palloc() with minimal overhead of other code:
I also wanted to attempt to determine if the slowdown was due to the mem_allocated maintenance or if it was down to the struct size increasing.
I decided to test this just by simply commenting out all of the context->mem_allocated += / -= and just keep the mem_allocated field in the struct, but I've been really struggling to see any sort of performance decrease anywhere. I'm just getting too much variation in the test results to get any sort of idea.
I've attached a spreadsheet of the results I saw. It would be really good if Robert was able to test with the IBM PPC just with the extra field in the struct so we can see if the 1-3% lies there, or if it's in overhead of keeping mem_allocated up-to-date.
My main question here is: How sure are you that none of your intended use cases will need to call MemoryContextMemAllocated with recurse = true? I'd imagine if you ever have to use MemoryContextMemAllocated(ctx, true) then we'll all be sitting around with our stopwatches out to see if the call incurs too much of a performance penalty.
Other small issues:
+ oldblksize = block->endptr - ((char *)block);
+
block = (AllocBlock) realloc(block, blksize);
if (block == NULL)
return NULL;
+
+ context->mem_allocated += blksize - oldblksize;
+
Shouldn't you be setting oldblksize after the "if"? I'd imagine the realloc() failure is rare, but it just seems better to do it just before it's required and only when required.
+ if (recurse)
+ {
+ MemoryContext child = context->firstchild;
+ for (child = context->firstchild;
+ child != NULL;
Here there's a double assignment of "child".
***************
*** 632,637 **** AllocSetDelete(MemoryContext context)
--- 637,644 ----
{
AllocBlock next = block->next;
+ context->mem_allocated -= block->endptr - ((char *) block);
+
I might be mistaken here, but can't you just set context->mem_allocted = 0; after that loop?
Or maybe it would be an improvement to only do the decrement if MEMORY_CONTEXT_CHECKING is defined, then Assert() that mem_allocated == 0 after the loop...
One other thing that I don't remember seeing discussed would be to just store a List in each context which would store all of the mem_allocated's that need to be updated during each allocation and deallocation of memory. The list would be setup once when the context is created. When a child context is setup we'd just loop up the parent context chain and list_concat() all of the parent's lists onto the child's lists. Would this method add too much overhead when a context is deleted? Has this been explored already?
Regards
David Rowley
--
David Rowley http://www.2ndQuadrant.com/
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
Attachment
On Tue, 2015-07-07 at 16:49 +1200, David Rowley wrote: > of performance decrease anywhere. I'm just getting too much variation > in the test results to get any sort of idea. > That was my experience as well. Thank you for taking a look. > My main question here is: How sure are you that none of your intended > use cases will need to call MemoryContextMemAllocated with recurse = > true? I'd imagine if you ever have to > use MemoryContextMemAllocated(ctx, true) then we'll all be sitting > around with our stopwatches out to see if the call incurs too much of > a performance penalty. I don't expect HashAgg to be using many memory contexts. It did with array_agg, but that's been changed, and was a bad pattern anyway (one context per group is quite wasteful). If it turns out to be slow, we can call it less frequently (e.g. extrapolate growth rate to determine when to check). > > Shouldn't you be setting oldblksize after the "if"? I'd imagine the > realloc() failure is rare, but it just seems better to do it just > before it's required and only when required. I don't think that's safe. Realloc can return an entirely new pointer, so the endptr would be pointing outside of the allocation. I'd have to hang on to the original pointer before the realloc, so I'd need an extra variable anyway. > > Here there's a double assignment of "child". Will fix. > > I might be mistaken here, but can't you just set context->mem_allocted > = 0; after that loop? > Or maybe it would be an improvement to only do the decrement > if MEMORY_CONTEXT_CHECKING is defined, then Assert() that > mem_allocated == 0 after the loop... > OK. Why not just always Assert that? > > One other thing that I don't remember seeing discussed would be to > just store a List in each context which would store all of the > mem_allocated's that need to be updated during each allocation and > deallocation of memory. The list would be setup once when the context > is created. When a child context is setup we'd just loop up the parent > context chain and list_concat() all of the parent's lists onto the > child's lists. Would this method add too much overhead when a context > is deleted? Has this been explored already? > That's a good idea, as it would be faster than recursing. Also, the number of parents can't change, so we can just make an array, and that would be quite fast to update. Unless I'm missing something, this sounds like a nice solution. It would require more space in MemoryContextData, but that might not be a problem. Regards,Jeff Davis
On 7 July 2015 at 18:59, Jeff Davis <pgsql@j-davis.com> wrote:
On Tue, 2015-07-07 at 16:49 +1200, David Rowley wrote:
> I might be mistaken here, but can't you just set context->mem_allocted
> = 0; after that loop?
> Or maybe it would be an improvement to only do the decrement
> if MEMORY_CONTEXT_CHECKING is defined, then Assert() that
> mem_allocated == 0 after the loop...
>
OK. Why not just always Assert that?
>
Well, I thought the per loop decrement of the mem_allocated was wasteful, as shouldn't it always get to zero after the final loop anyway?
If so, for efficiency it would be better to zero it after the loop, but if we do that then we can't assert for zero.
> One other thing that I don't remember seeing discussed would be to
> just store a List in each context which would store all of the
> mem_allocated's that need to be updated during each allocation and
> deallocation of memory. The list would be setup once when the context
> is created. When a child context is setup we'd just loop up the parent
> context chain and list_concat() all of the parent's lists onto the
> child's lists. Would this method add too much overhead when a context
> is deleted? Has this been explored already?
>
That's a good idea, as it would be faster than recursing. Also, the
number of parents can't change, so we can just make an array, and that
would be quite fast to update. Unless I'm missing something, this sounds
like a nice solution. It would require more space in MemoryContextData,
but that might not be a problem.
Oh an array is even better, but why more space? won't it just be a pointer to an array? It can be NULL if there's nothing to track. Sounds like it would be the same amount of space, at least on a 64 bit machine.
Regards
David Rowley
--
David Rowley http://www.2ndQuadrant.com/
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
On 2015-07-07 16:49:58 +1200, David Rowley wrote: > I've been looking at this patch and trying to reproduce the reported > slowdown by using Tomas' function to try to exercise palloc() with minimal > overhead of other code: > > https://github.com/tvondra/palloc_bench That's not necessarily meaningful. Increased cache footprint (both data and branch prediction) is often only noticeable with additional code running pushing stuff out.
On 7 July 2015 at 20:23, David Rowley <david.rowley@2ndquadrant.com> wrote:
On 7 July 2015 at 18:59, Jeff Davis <pgsql@j-davis.com> wrote:> One other thing that I don't remember seeing discussed would be to
> just store a List in each context which would store all of the
> mem_allocated's that need to be updated during each allocation and
> deallocation of memory. The list would be setup once when the context
> is created. When a child context is setup we'd just loop up the parent
> context chain and list_concat() all of the parent's lists onto the
> child's lists. Would this method add too much overhead when a context
> is deleted? Has this been explored already?
>
That's a good idea, as it would be faster than recursing. Also, the
number of parents can't change, so we can just make an array, and that
would be quite fast to update. Unless I'm missing something, this sounds
like a nice solution. It would require more space in MemoryContextData,
but that might not be a problem.Oh an array is even better, but why more space? won't it just be a pointer to an array? It can be NULL if there's nothing to track. Sounds like it would be the same amount of space, at least on a 64 bit machine.
I'd imagine that the first element of the array could just store the length of it. The type might be slightly wider for what you need for an array length, but this'll save having an extra field in the struct for it.
Are you going to implement this? or are you happy with the single level context tracking is the right thing?
I'm going to mark the patch as waiting on author for now.
Regards
David Rowley
--
David Rowley http://www.2ndQuadrant.com/
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
On Thu, 2015-07-09 at 14:41 +1200, David Rowley wrote: > Are you going to implement this? or are you happy with the single > level context tracking is the right thing? > I'm going to mark the patch as waiting on author for now. Attached a version of the patch that does multi-level tracking, v12. It does this is a simpler way, just like an earlier version of the patch: it simply traverses up the chain and adds to each parent in a "total_allocated" field. The idea you mentioned is a possible optimization of this idea, but it only makes sense if I'm able to detect a difference between v11 (single-level) and v12 (multi-level). I tried Robert's test[1] again and I didn't see a difference on my workstation (technically, v12 came out the fastest, which means I'm just seeing noise anyway), so I can't evaluate whether your idea will improve things. After talking with a few people at PGCon, small noisy differences in CPU timings can appear for almost any tweak to the code, and aren't necessarily cause for major concern. Regards, Jeff Davis [1] pgbench -i -s 300, then do the following 3 times each for master, v11, and v12, and take the median of logged traces: start server; set trace_sort=on; reindex index pgbench_accounts_pkey; stop server
Attachment
On Sat, Jul 11, 2015 at 2:28 AM, Jeff Davis <pgsql@j-davis.com> wrote: > After talking with a few people at PGCon, small noisy differences in CPU > timings can appear for almost any tweak to the code, and aren't > necessarily cause for major concern. I agree with that in general, but the concern is a lot bigger when the function is something that is called everywhere and accounts for a measurable percentage of our total CPU usage on almost any workload. If memory allocation got slower because, say, you added some code to regexp.c and it caused AllocSetAlloc to split a cache line where it hadn't previously, I wouldn't be worried about that; the next patch, like as not, will buy the cost back again. But here you really are adding code to a hot path. tuplesort.c does its own accounting, and TBH that seems like the right thing to do here, too. The difficulty is, I think, that some transition functions use an internal data type for the transition state, which might not be a single palloc'd chunk. But since we can't spill those aggregates to disk *anyway*, that doesn't really matter. If the transition is a varlena or a fixed-length type, we can know how much space it's consuming without hooking into the memory context framework. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Hi, On 07/14/2015 10:19 PM, Robert Haas wrote: > On Sat, Jul 11, 2015 at 2:28 AM, Jeff Davis <pgsql@j-davis.com> wrote: >> After talking with a few people at PGCon, small noisy differences >> in CPU timings can appear for almost any tweak to the code, and >> aren't necessarily cause for major concern. > > I agree with that in general, but the concern is a lot bigger when the > function is something that is called everywhere and accounts for a > measurable percentage of our total CPU usage on almost any workload. > If memory allocation got slower because, say, you added some code to > regexp.c and it caused AllocSetAlloc to split a cache line where it > hadn't previously, I wouldn't be worried about that; the next patch, > like as not, will buy the cost back again. But here you really are > adding code to a hot path. I think Jeff was suggesting that we should ignore changes measurably affecting performance - I'm one of those he discussed this patch with at pgcon, and I can assure you impact on performance was one of the main topics of the discussion. Firstly, do we really have good benchmarks and measurements? I really doubt that. We do have some numbers for REINDEX, where we observed 0.5-1% regression on noisy results from a Power machine (and we've been unable to reproduce that on x86). I don't think that's a representative benchmark, and I'd like to see more thorough measurements. And I agreed to do this, once Jeff comes up with a new version of the patch. Secondly, the question is whether the performance is impacted more by the additional instructions, or by other things - say, random padding, as was explained by Andrew Gierth in [1]. I don't know whether that's happening in this patch, but if it is, it seems rather strange to use this against this patch and not the others (because there surely will be other patches causing similar issues). [1] http://www.postgresql.org/message-id/87vbitb2zp.fsf@news-spur.riddles.org.uk > tuplesort.c does its own accounting, and TBH that seems like the right > thing to do here, too. The difficulty is, I think, that some > transition functions use an internal data type for the transition > state, which might not be a single palloc'd chunk. But since we can't > spill those aggregates to disk *anyway*, that doesn't really matter. > If the transition is a varlena or a fixed-length type, we can know how > much space it's consuming without hooking into the memory context > framework. I respectfully disagree. Our current inability to dump/load the state has little to do with how we measure consumed memory, IMHO. It's true that we do have two kinds of aggregates, depending on the nature of the aggregate state: (a) fixed-size state (data types passed by value, variable length types that do not grow once allocated, ...) (b) continuously growing state (as in string_agg/array_agg) Jeff's HashAgg patch already fixes (a) and can fix (b) once we get a solution for dump/load of the aggregate stats - which we need to implement anyway for parallel aggregate. I know there was a proposal to force all aggregates to use regular data types as aggregate stats, but I can't see how that could work without a significant performance penalty. For example array_agg() is using internal to pass ArrayBuildState - how do you turn that to a regular data type without effectively serializing/deserializing the whole array on every transition? And even if we come up with a solution for array_agg, do we really believe it's possible to do for all custom aggregates? Maybe I'm missing something but I doubt that. ISTM designing ephemeral data structure allows tweaks that are impossible otherwise. What might be possible is extending the aggregate API with another custom function returning size of the aggregate state. So when defining an aggregate using 'internal' for aggregate state, you'd specify transfunc, finalfunc and sizefunc. That seems doable, I guess. I find the memory accounting as a way more elegant solution, though. kind regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Tue, 2015-07-14 at 16:19 -0400, Robert Haas wrote: > tuplesort.c does its own accounting, and TBH that seems like the right > thing to do here, too. The difficulty is, I think, that some > transition functions use an internal data type for the transition > state, which might not be a single palloc'd chunk. But since we can't > spill those aggregates to disk *anyway*, that doesn't really matter. So would it be acceptable to just ignore the memory consumed by "internal", or come up with some heuristic? Regards,Jeff Davis
On Wed, Jul 15, 2015 at 12:57 PM, Jeff Davis <pgsql@j-davis.com> wrote:
On Tue, 2015-07-14 at 16:19 -0400, Robert Haas wrote:
> tuplesort.c does its own accounting, and TBH that seems like the right
> thing to do here, too. The difficulty is, I think, that some
> transition functions use an internal data type for the transition
> state, which might not be a single palloc'd chunk. But since we can't
> spill those aggregates to disk *anyway*, that doesn't really matter.
So would it be acceptable to just ignore the memory consumed by
"internal", or come up with some heuristic?
Regards,
Jeff Davis
I think a heuristic would be more suited here and ignoring memory consumption for internal types means that we are not making the memory accounting useful for a set of usecases.
--
Regards,
Atri
l'apprenant
On Wed, 2015-07-15 at 12:59 +0530, Atri Sharma wrote: > > I think a heuristic would be more suited here and ignoring memory > consumption for internal types means that we are not making the memory > accounting useful for a set of usecases. > OK, I will drop this patch and proceed with the HashAgg patch, with a heuristic for internal types. Regards,Jeff Davis
On Tue, Jul 14, 2015 at 9:14 PM, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: > Firstly, do we really have good benchmarks and measurements? I really doubt > that. We do have some numbers for REINDEX, where we observed 0.5-1% > regression on noisy results from a Power machine (and we've been unable to > reproduce that on x86). I don't think that's a representative benchmark, and > I'd like to see more thorough measurements. And I agreed to do this, once > Jeff comes up with a new version of the patch. > > Secondly, the question is whether the performance is impacted more by the > additional instructions, or by other things - say, random padding, as was > explained by Andrew Gierth in [1]. > > I don't know whether that's happening in this patch, but if it is, it seems > rather strange to use this against this patch and not the others (because > there surely will be other patches causing similar issues). It matters, at least to me, an awful lot where we're adding lines of code. If you want to add modest amounts of additional code to CREATE TABLE or CHECKPOINT or something like that, I really don't care, because that stuff doesn't execute frequently enough to really matter to performance even if we add a significant chunk of overhead, and we're doing other expensive stuff at the same time, like fsync. The same can't be said about functions like LWLockAcquire and AllocSetAlloc that routinely show up at the top of CPU profiles. I agree that there is room to question whether the benchmarks I did are sufficient reason to think that the abstract concern that putting more code into a function might make it slower translates into a measurable drop in performance in practice. But I think when it comes to these very hot code paths, extreme conservatism is warranted. We can agree to disagree about that. >> tuplesort.c does its own accounting, and TBH that seems like the right >> thing to do here, too. The difficulty is, I think, that some >> transition functions use an internal data type for the transition >> state, which might not be a single palloc'd chunk. But since we can't >> spill those aggregates to disk *anyway*, that doesn't really matter. >> If the transition is a varlena or a fixed-length type, we can know how >> much space it's consuming without hooking into the memory context >> framework. > > I respectfully disagree. Our current inability to dump/load the state has > little to do with how we measure consumed memory, IMHO. > > It's true that we do have two kinds of aggregates, depending on the nature > of the aggregate state: > > (a) fixed-size state (data types passed by value, variable length types > that do not grow once allocated, ...) > > (b) continuously growing state (as in string_agg/array_agg) > > Jeff's HashAgg patch already fixes (a) and can fix (b) once we get a > solution for dump/load of the aggregate stats - which we need to implement > anyway for parallel aggregate. > > I know there was a proposal to force all aggregates to use regular data > types as aggregate stats, but I can't see how that could work without a > significant performance penalty. For example array_agg() is using internal > to pass ArrayBuildState - how do you turn that to a regular data type > without effectively serializing/deserializing the whole array on every > transition? That is a good point. Tom suggested that his new expanded-format stuff might be able to be adapted to the purpose, but I have no idea how possible that really is. > And even if we come up with a solution for array_agg, do we really believe > it's possible to do for all custom aggregates? Maybe I'm missing something > but I doubt that. ISTM designing ephemeral data structure allows tweaks that > are impossible otherwise. > > What might be possible is extending the aggregate API with another custom > function returning size of the aggregate state. So when defining an > aggregate using 'internal' for aggregate state, you'd specify transfunc, > finalfunc and sizefunc. That seems doable, I guess. And infunc and outfunc. If we don't use the expanded-format stuff for aggregates with a type-internal transition state, we won't be able to use input and output functions to serialize and deserialize them, either. > I find the memory accounting as a way more elegant solution, though. I think we're just going to have to agree to disagree on that. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, Jul 15, 2015 at 3:27 AM, Jeff Davis <pgsql@j-davis.com> wrote: > On Tue, 2015-07-14 at 16:19 -0400, Robert Haas wrote: >> tuplesort.c does its own accounting, and TBH that seems like the right >> thing to do here, too. The difficulty is, I think, that some >> transition functions use an internal data type for the transition >> state, which might not be a single palloc'd chunk. But since we can't >> spill those aggregates to disk *anyway*, that doesn't really matter. > > So would it be acceptable to just ignore the memory consumed by > "internal", or come up with some heuristic? Either one would be OK with me. I'd probably ignore that for the first version of the patch and just let the hash table grow without bound in that case. Then in a later patch I'd introduce additional infrastructure to deal with that part of the problem. But if something else works out well while coding it up, I'd be OK with that, too. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Hi, On 07/15/2015 09:21 PM, Robert Haas wrote: > On Tue, Jul 14, 2015 at 9:14 PM, Tomas Vondra > <tomas.vondra@2ndquadrant.com> wrote: >> Firstly, do we really have good benchmarks and measurements? I really doubt >> that. We do have some numbers for REINDEX, where we observed 0.5-1% >> regression on noisy results from a Power machine (and we've been unable to >> reproduce that on x86). I don't think that's a representative benchmark, and >> I'd like to see more thorough measurements. And I agreed to do this, once >> Jeff comes up with a new version of the patch. >> >> Secondly, the question is whether the performance is impacted more by the >> additional instructions, or by other things - say, random padding, as was >> explained by Andrew Gierth in [1]. >> >> I don't know whether that's happening in this patch, but if it is, >> it seems rather strange to use this against this patch and not the >> others (because there surely will be other patches causing similar >> issues). > > It matters, at least to me, an awful lot where we're adding lines of > code. If you want to add modest amounts of additional code to CREATE > TABLE or CHECKPOINT or something like that, I really don't care, > because that stuff doesn't execute frequently enough to really > matter to performance even if we add a significant chunk of overhead, > and we're doing other expensive stuff at the same time, like fsync. > The same can't be said about functions like LWLockAcquire and > AllocSetAlloc that routinely show up at the top of CPU profiles. > > I agree that there is room to question whether the benchmarks I did > are sufficient reason to think that the abstract concern that putting > more code into a function might make it slower translates into a > measurable drop in performance in practice. But I think when it comes > to these very hot code paths, extreme conservatism is warranted. We > can agree to disagree about that. No, that is not what I tried to say. I certainly agree that we need to pay attention when adding stuff hot paths - there's no disagreement about this. The problem with random padding is that adding code somewhere may introduce padding that affects other pieces of code. That is essentially the point of Andrew's explanation that I linked in my previous message. And the question is - are the differences we've measured really due to code added to the hot path, or something introduced by random padding due to some other changes (possibly in a code that was not even executed during the test)? I don't know how significant impact this may have in this case, or how serious this is in general, but we're talking about 0.5-1% difference on a noisy benchmark. And if such cases of random padding really are a problem in practice, we've certainly missed plenty of other patches with the same impact. Because effectively what Jeff's last patch did was adding a single int64 counter to MemoryContextData structure, and incrementing it for each allocated block (not chunk). I can't really imagine a solution making it cheaper, because there are very few faster operations. Even "opt-in" won't make this much faster, because you'll have to check a flag and you'll need two fields (flag + counter). Of course, this assumes "local counter" (i.e. not updating counters the parent contexts), but I claim that's OK. I've been previously pushing for eagerly updating all the parent contexts, so that finding out the allocated memory is fast even when there are many child contexts - a prime example was array_agg() before I fixed it. But I changed my mind on this and now say "screw them". I claim that aggregates using a separate memory context for each group are a lost case - they already suffer a significant overhead, and should be fixed just like we fixed array_agg(). That was effectively the outcome of pgcon discussions - to use the simple int64 counter, do the accounting for all contexts, and update only the local counter. For cases with many child contexts finding out the amount of allocated memory won't be cheap, but well - there's not much we can do about that. >> I know there was a proposal to force all aggregates to use regular >> data types as aggregate stats, but I can't see how that could work >> without a significant performance penalty. For example array_agg() >> is using internal to pass ArrayBuildState - how do you turn that to >> a regular data type without effectively serializing/deserializing >> the whole array on every transition? > > That is a good point. Tom suggested that his new expanded-format > stuff might be able to be adapted to the purpose, but I have no idea > how possible that really is. Me neither, and maybe there's a good solution for that, making my concerns unfounded. >> What might be possible is extending the aggregate API with another >> custom function returning size of the aggregate state. So when >> defining an aggregate using 'internal' for aggregate state, you'd >> specify transfunc, finalfunc and sizefunc. That seems doable, I >> guess. > > And infunc and outfunc. If we don't use the expanded-format stuff for > aggregates with a type-internal transition state, we won't be able to > use input and output functions to serialize and deserialize them, > either. Sure, that is indeed necessary for spilling the aggregates to disk. But for aggregates with fixed-size state that's not necessary (Jeff's HashAgg patch handles them fine), so I see this as a separate thing. > >> I find the memory accounting as a way more elegant solution, though. > > I think we're just going to have to agree to disagree on that. Sure, it's certainly a matter of personal taste. -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Hi, On 07/15/2015 07:07 PM, Jeff Davis wrote: > On Wed, 2015-07-15 at 12:59 +0530, Atri Sharma wrote: > >> >> I think a heuristic would be more suited here and ignoring memory >> consumption for internal types means that we are not making the >> memory accounting useful for a set of usecases. >> > OK, I will drop this patch and proceed with the HashAgg patch, with > a heuristic for internal types. Could someone briefly explain what heuristics are we talking about? -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 16 July 2015 at 05:07, Jeff Davis <pgsql@j-davis.com> wrote:
On Wed, 2015-07-15 at 12:59 +0530, Atri Sharma wrote:
>
> I think a heuristic would be more suited here and ignoring memory
> consumption for internal types means that we are not making the memory
> accounting useful for a set of usecases.
>
OK, I will drop this patch and proceed with the HashAgg patch, with a
heuristic for internal types.
Should we mark the patch as "returned with feedback" in the commitfest app then?
--
David Rowley http://www.2ndQuadrant.com/
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
On Fri, 2015-07-17 at 15:52 +1200, David Rowley wrote: > Should we mark the patch as "returned with feedback" in the commitfest > app then? I believe the memory accounting patch has been rejected. Instead, the work will be done in the HashAgg patch. Thank you for the review! Regards,Jeff Davis