Thread: Memory Accounting v11

Memory Accounting v11

From
Jeff Davis
Date:
This patch tracks memory usage (at the block level) for all memory
contexts. Individual palloc()s aren't tracked; only the new blocks
allocated to the memory context with malloc().

It also adds a new function, MemoryContextMemAllocated() which can
either retrieve the total allocated for the context, or it can recurse
and sum up the allocations for all subcontexts as well.

This is intended to be used by HashAgg in an upcoming patch that will
bound the memory and spill to disk.

Previous discussion here:

http://www.postgresql.org/message-id/1407012053.15301.53.camel@jeff-desktop

Previous concerns:

* There was a slowdown reported of around 1-3% (depending on the exact
version of the patch) on an IBM power machine when doing an index
rebuild. The results were fairly noisy for me, but it seemed to hold up.
See http://www.postgresql.org/message-id/CA
+Tgmobnu7XEn1gRdXnFo37P79bF=qLt46=37ajP3Cro9dBRaA@mail.gmail.com
* Adding a struct field to MemoryContextData may be bad for the CPU
caching behavior, and may be the cause of the above slowdown.
* Previous versions of the patch updated the parent contexts'
allocations as well, which avoided the need to recurse when querying the
total allocation. That seemed to penalize cases that didn't need to
track the allocation. We discussed trying to "opt-in" to this behavior,
but it seemed more awkward than helpful. Now, the patch only updates the
allocation of the current context, and querying means recursing through
the child contexts.
* There was a concern that, if MemoryContextMemAllocated needs to
recurse to the child contexts, it will be too slow for HashAgg of
array_agg, which creates a child context per group. That was solved with
http://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=b419865a814abbca12bdd6eef6a3d5ed67f432e1

My general answer to the performance concerns is that they aren't a
reason to block this patch, unless someone has a suggestion about how to
fix them. Adding one field to a struct and a few arithmetic operations
for each malloc() or realloc() seems reasonable to me.

The current state, where HashAgg just blows up the memory, is just not
reasonable, and we need to track the memory to fix that problem. Others
have also mentioned that we might want to use this mechanism to track
memory for other operators, like Sort or HashJoin, which might be
simpler and more accurate.

Regards,
    Jeff Davis


Attachment

Re: Memory Accounting v11

From
Tom Lane
Date:
Jeff Davis <pgsql@j-davis.com> writes:
> This patch tracks memory usage (at the block level) for all memory
> contexts. Individual palloc()s aren't tracked; only the new blocks
> allocated to the memory context with malloc().
> ...
> My general answer to the performance concerns is that they aren't a
> reason to block this patch, unless someone has a suggestion about how to
> fix them. Adding one field to a struct and a few arithmetic operations
> for each malloc() or realloc() seems reasonable to me.

Maybe, but this here is a pretty weak argument:

> The current state, where HashAgg just blows up the memory, is just not
> reasonable, and we need to track the memory to fix that problem.

Meh.  HashAgg could track its memory usage without loading the entire
system with a penalty.  Moreover, this is about fourth or fifth down the
list of the implementation problems with spilling hash aggregation to
disk.  It would be good to see credible solutions for the bigger issues
before we buy into adding overhead for a mechanism with no uses in core.

> Others have also mentioned that we might want to use this mechanism to
> track memory for other operators, like Sort or HashJoin, which might be
> simpler and more accurate.

That would imply redefining the meaning of work_mem for those operations;
it would suddenly include a lot more overhead space than it used to,
causing them to spill to disk much more quickly than before, unless one
changes the work_mem setting to compensate.  Not sure that people would
like such a change.

An even bigger problem is that it would pretty much break the logic around
LACKMEM() in tuplesort.c, which supposes that spilling individual tuples
to disk is a reliable and linear way to decrease the accounted-for memory.
Individual pfree's would typically not decrease the accounted total in
this implementation.  Depending on what you have in mind for the
spill-to-disk behavior, this issue could be a fail for HashAgg as well,
which is another reason not to commit this patch in advance of seeing the
use-case.
        regards, tom lane



Re: Memory Accounting v11

From
Tomas Vondra
Date:
Hi,

On 06/14/15 21:43, Jeff Davis wrote:
> This patch tracks memory usage (at the block level) for all memory
> contexts. Individual palloc()s aren't tracked; only the new blocks
> allocated to the memory context with malloc().

I see it's adding the new field as int64 - wouldn't a Size be more 
appropriate, considering that's what we use in mctx.h and aset.c?

> It also adds a new function, MemoryContextMemAllocated() which can
> either retrieve the total allocated for the context, or it can
> recurse and sum up the allocations for all subcontexts as well.
>
> This is intended to be used by HashAgg in an upcoming patch that will
> bound the memory and spill to disk.
>
> Previous discussion here:
>
> http://www.postgresql.org/message-id/1407012053.15301.53.camel@jeff-desktop
>
> Previous concerns:
>
> * There was a slowdown reported of around 1-3% (depending on the exact
> version of the patch) on an IBM power machine when doing an index
> rebuild. The results were fairly noisy for me, but it seemed to hold up.
> See http://www.postgresql.org/message-id/CA
> +Tgmobnu7XEn1gRdXnFo37P79bF=qLt46=37ajP3Cro9dBRaA@mail.gmail.com
> * Adding a struct field to MemoryContextData may be bad for the CPU
> caching behavior, and may be the cause of the above slowdown.

> * Previous versions of the patch updated the parent contexts'
> allocations as well, which avoided the need to recurse when querying
> the total allocation. That seemed to penalize cases that didn't need
> to track the allocation. We discussed trying to "opt-in" to this
> behavior, but it seemed more awkward than helpful. Now, the patch
> only updates the allocation of the current context, and querying
> means recursing through the child contexts.

I don't think the opt-in idea itself was awkward, it was rather about 
the particular APIs that we came up with, especially when combined with 
the 'context inheritance' thing.

I still think the opt-in approach and updating accounting for the parent 
contexts was the best one, because it (a) minimizes impact in cases that 
don't use the accounting, and (b) makes finding the current amount of 
memory cheap ...

> * There was a concern that, if MemoryContextMemAllocated needs to
> recurse to the child contexts, it will be too slow for HashAgg of
> array_agg, which creates a child context per group. That was solved with
> http://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=b419865a814abbca12bdd6eef6a3d5ed67f432e1

I wouldn't say this was "solved" - we have fixed one particular example 
of such aggregate implementation, because it was causing OOM issues with 
many groups, but there may be other custom aggregates using the same 
pattern.

Granted, built-in aggregates are probably more critical than aggregates 
provided by extensions, but I wouldn't dare to mark this solved ...

> My general answer to the performance concerns is that they aren't a
> reason to block this patch, unless someone has a suggestion about how
> to fix them. Adding one field to a struct and a few arithmetic
> operations for each malloc() or realloc() seems reasonable to me.

I'm not buying this, sorry. While I agree that we should not expect the 
memory accounting to be entirely free, we should be very careful about 
the overhead especially if we're dropping the opt-in and thus imposing 
the overhead on everyone.

But "performance concerns are not a reason to block this patch" approach 
seems wrong. With any other patch a 3% regression would be considered a 
serious issue IMNSHO.

> The current state, where HashAgg just blows up the memory, is just
> not reasonable, and we need to track the memory to fix that problem.
> Others have also mentioned that we might want to use this mechanism
> to track memory for other operators, like Sort or HashJoin, which
> might be simpler and more accurate.

Dropping the memory accounting implementations and keeping just this new 
solution would be nice, only if we agree the performance impact to be 
acceptable. We already have accounting solution for each of those 
places, so I don't think the unification alone outweighs the regression.

regards

--
Tomas Vondra                   http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Memory Accounting v11

From
Tomas Vondra
Date:
Hi,

On 06/14/15 22:21, Tom Lane wrote:
> Jeff Davis <pgsql@j-davis.com> writes:
>> This patch tracks memory usage (at the block level) for all memory
>> contexts. Individual palloc()s aren't tracked; only the new blocks
>> allocated to the memory context with malloc().
>> ...
>> My general answer to the performance concerns is that they aren't a
>> reason to block this patch, unless someone has a suggestion about how to
>> fix them. Adding one field to a struct and a few arithmetic operations
>> for each malloc() or realloc() seems reasonable to me.
>
> Maybe, but this here is a pretty weak argument:
>
>> The current state, where HashAgg just blows up the memory, is just not
>> reasonable, and we need to track the memory to fix that problem.
>
> Meh. HashAgg could track its memory usage without loading the entire
> system with a penalty.

+1 to a solution like that, although I don't think that's doable without 
digging the info from memory contexts somehow.
> Moreover, this is about fourth or fifth down the
> list of the implementation problems with spilling hash aggregation to
> disk.  It would be good to see credible solutions for the bigger issues
> before we buy into adding overhead for a mechanism with no uses in core.

I don't think so. If we don't have an acceptable solution for tracking 
memory in hashagg, there's not really much point in doing any of the 
following steps. That's why Jeff is tackling this problem first.

Also, Jeff already posted a PoC of the memory-bounded hashagg, although 
it only worked for aggregates with fixed-size state, and not that great 
for cases like array_agg where the state grows. Maybe the improvements 
to aggregate functions proposed by David Rowley last week might help in 
those cases, as that'd allow spilling those states to disk.

So while the approach proposed by Jeff may turn out not to be the right 
one, I think memory accounting needs to be solved first (even if it's 
not committed until the whole feature is ready).

>> Others have also mentioned that we might want to use this mechanism
>> to track memory for other operators, like Sort or HashJoin, which
>> might be simpler and more accurate.
>
> That would imply redefining the meaning of work_mem for those
> operations; it would suddenly include a lot more overhead space than
> it used to, causing them to spill to disk much more quickly than
> before, unless one changes the work_mem setting to compensate. Not
> sure that people would like such a change.

Don't we tweak the work_mem meaning over time anyway? Maybe not to such 
extent, but for example the hashjoin 9.5 improvements certainly change 
this by packing the tuples more densely, sizing the buckets differently 
etc. It's true the changes are in the other direction (i.e. batching 
less frequently) though.

OTOH this would make the accounting more accurate (e.g. eliminating 
differences between cases with different amounts of overhead because of 
different tuple width, for example). Wouldn't that be a good thing, even 
if people need to tweak the work_mem? Isn't that kinda acceptable when 
upgrading to the next major version?

> An even bigger problem is that it would pretty much break the logic
> around LACKMEM() in tuplesort.c, which supposes that spilling
> individual tuples to disk is a reliable and linear way to decrease
> the accounted-for memory. Individual pfree's would typically not
> decrease the accounted total in this implementation. Depending on
> what you have in mind for the spill-to-disk behavior, this issue
> could be a fail for HashAgg as well, which is another reason not to
> commit this patch in advance of seeing the use-case.

That's true, but I think the plan was always to wait for the whole 
feature, and only then commit all the pieces.


--
Tomas Vondra                   http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Memory Accounting v11

From
Jeff Davis
Date:
On Sun, 2015-06-14 at 16:21 -0400, Tom Lane wrote:
> Meh.  HashAgg could track its memory usage without loading the entire
> system with a penalty.

When I tried doing it outside of the MemoryContexts, it seemed to get
awkward quickly. I'm open to suggestion if you can point me in the right
direction. Maybe I can peek at the sizes of chunks holding state values
and group keys, and combine that with the hash table size-estimating
code?

>   Moreover, this is about fourth or fifth down the
> list of the implementation problems with spilling hash aggregation to
> disk.  It would be good to see credible solutions for the bigger issues
> before we buy into adding overhead for a mechanism with no uses in core.

I had several iterations of a full implementation of the spill-to-disk
HashAgg patch[1]. Tomas Vondra has some constructive review comments,
but all of them seemed solvable. What do you see as a major unsolved
issue?

If I recall, you were concerned about things like array_agg, where an
individual state could get larger than work_mem. That's a valid concern,
but it's not the problem I was trying to solve.

Regards,Jeff Davis

[1]
http://www.postgresql.org/message-id/1407706010.6623.16.camel@jeff-desktop




Re: Memory Accounting v11

From
Simon Riggs
Date:
On 14 June 2015 at 23:51, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
 
The current state, where HashAgg just blows up the memory, is just not
reasonable, and we need to track the memory to fix that problem.

Meh. HashAgg could track its memory usage without loading the entire
system with a penalty.

+1 to a solution like that, although I don't think that's doable without digging the info from memory contexts somehow.

Jeff is right, we desperately need a solution and this is the place to start.

Tom's concern remains valid: we must not load the entire system with a penalty.


The only questions I have are:

* If the memory allocations adapt to the usage pattern, then we expect to see few memory chunk allocations. Why are we expecting "the entire system" to experience a penalty?

* If we do not manage our resources, how are we certain this does not induce a penalty? Not tracking memory could be worse than tracking it.

--
Simon Riggs                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Memory Accounting v11

From
CK Tan
Date:
On 14 June 2015 at 23:51, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
 
The current state, where HashAgg just blows up the memory, is just not
reasonable, and we need to track the memory to fix that problem.

Meh. HashAgg could track its memory usage without loading the entire
system with a penalty.

+1 to a solution like that, although I don't think that's doable without digging the info from memory contexts somehow.

I am sorry to ask questions unrelated to the subject, but how is tracking memory going to fix the HashAgg blow up problem? Is there a plan to make HashAgg not blow up (i.e. spill the hash table)?

Thanks,
-cktan



On Thu, Jul 2, 2015 at 4:19 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
On 14 June 2015 at 23:51, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
 
The current state, where HashAgg just blows up the memory, is just not
reasonable, and we need to track the memory to fix that problem.

Meh. HashAgg could track its memory usage without loading the entire
system with a penalty.

+1 to a solution like that, although I don't think that's doable without digging the info from memory contexts somehow.

Jeff is right, we desperately need a solution and this is the place to start.

Tom's concern remains valid: we must not load the entire system with a penalty.


The only questions I have are:

* If the memory allocations adapt to the usage pattern, then we expect to see few memory chunk allocations. Why are we expecting "the entire system" to experience a penalty?

* If we do not manage our resources, how are we certain this does not induce a penalty? Not tracking memory could be worse than tracking it.

--
Simon Riggs                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Memory Accounting v11

From
Qingqing Zhou
Date:
On Thu, Jul 2, 2015 at 11:37 AM, CK Tan <cktan@vitessedata.com> wrote:
>
> I am sorry to ask questions unrelated to the subject, but how is tracking
> memory going to fix the HashAgg blow up problem? Is there a plan to make
> HashAgg not blow up (i.e. spill the hash table)?
>

Your question is probably answered here:
http://www.postgresql.org/message-id/1407012053.15301.53.camel@jeff-desktop

Regards,
Qingqing



Re: Memory Accounting v11

From
David Rowley
Date:
On 15 June 2015 at 07:43, Jeff Davis <pgsql@j-davis.com> wrote:

* There was a slowdown reported of around 1-3% (depending on the exact
version of the patch) on an IBM power machine when doing an index
rebuild. The results were fairly noisy for me, but it seemed to hold up.
See http://www.postgresql.org/message-id/CA
+Tgmobnu7XEn1gRdXnFo37P79bF=qLt46=37ajP3Cro9dBRaA@mail.gmail.com

Hi Jeff,

I've been looking over the code and reason the previous emails about this patch.
I don't yet understand if the reported slowdown is from the increase in struct size or from the additional work being done in palloc() calls, however, on reading the code I did notice an existing redundant NULL check in AllocSetAlloc() right where you put you're extra accounting code.

The attached patch should apply on top of your patch and removes the extra NULL check.

Perhaps if some over the overhead is the extra instructions then this can help get us back to where we were before.

Regards

David Rowley
 
--
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services
 
Attachment

Re: Memory Accounting v11

From
David Rowley
Date:

On 15 June 2015 at 07:43, Jeff Davis <pgsql@j-davis.com> wrote:
This patch tracks memory usage (at the block level) for all memory
contexts. Individual palloc()s aren't tracked; only the new blocks
allocated to the memory context with malloc().

It also adds a new function, MemoryContextMemAllocated() which can
either retrieve the total allocated for the context, or it can recurse
and sum up the allocations for all subcontexts as well.

This is intended to be used by HashAgg in an upcoming patch that will
bound the memory and spill to disk.

Previous discussion here:

http://www.postgresql.org/message-id/1407012053.15301.53.camel@jeff-desktop

Previous concerns:

* There was a slowdown reported of around 1-3% (depending on the exact
version of the patch) on an IBM power machine when doing an index
rebuild. The results were fairly noisy for me, but it seemed to hold up.
See http://www.postgresql.org/message-id/CA
+Tgmobnu7XEn1gRdXnFo37P79bF=qLt46=37ajP3Cro9dBRaA@mail.gmail.com
* Adding a struct field to MemoryContextData may be bad for the CPU
caching behavior, and may be the cause of the above slowdown.

I've been looking at this patch and trying to reproduce the reported slowdown by using Tomas' function to try to exercise palloc() with minimal overhead of other code: 


I also wanted to attempt to determine if the slowdown was due to the mem_allocated maintenance or if it was down to the struct size increasing. 
I decided to test this just by simply commenting out all of the context->mem_allocated += / -= and just keep the mem_allocated field in the struct, but I've been really struggling to see any sort of performance decrease anywhere. I'm just getting too much variation in the test results to get any sort of idea.

I've attached a spreadsheet of the results I saw. It would be really good if Robert was able to test with the IBM PPC just with the extra field in the struct so we can see if the 1-3% lies there, or if it's in overhead of keeping mem_allocated up-to-date.

My main question here is: How sure are you that none of your intended use cases will need to call MemoryContextMemAllocated with recurse = true? I'd imagine if you ever have to use MemoryContextMemAllocated(ctx, true) then we'll all be sitting around with our stopwatches out to see if the call incurs too much of a performance penalty.

Other small issues:


+ oldblksize = block->endptr - ((char *)block);
  block = (AllocBlock) realloc(block, blksize);
  if (block == NULL)
  return NULL;
+ context->mem_allocated += blksize - oldblksize;

Shouldn't you be setting oldblksize after the "if"? I'd imagine the realloc() failure is rare, but it just seems better to do it just before it's required and only when required.

+ if (recurse)
+ {
+ MemoryContext child = context->firstchild;
+ for (child = context->firstchild;
+ child != NULL;

Here there's a double assignment of "child".

***************
*** 632,637 **** AllocSetDelete(MemoryContext context)
--- 637,644 ----
  {
  AllocBlock next = block->next;
  
+ context->mem_allocated -= block->endptr - ((char *) block);

I might be mistaken here, but can't you just set context->mem_allocted = 0; after that loop? 
Or maybe it would be an improvement to only do the decrement if MEMORY_CONTEXT_CHECKING is defined, then Assert() that mem_allocated == 0 after the loop...


One other thing that I don't remember seeing discussed would be to just store a List in each context which would store all of the mem_allocated's that need to be updated during each allocation and deallocation of memory. The list would be setup once when the context is created. When a child context is setup we'd just loop up the parent context chain and list_concat() all of the parent's lists onto the child's lists. Would this method add too much overhead when a context is deleted? Has this been explored already?

Regards

David Rowley
 
--
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Attachment

Re: Memory Accounting v11

From
Jeff Davis
Date:
On Tue, 2015-07-07 at 16:49 +1200, David Rowley wrote:

> of performance decrease anywhere. I'm just getting too much variation
> in the test results to get any sort of idea.
> 
That was my experience as well. Thank you for taking a look.

> My main question here is: How sure are you that none of your intended
> use cases will need to call MemoryContextMemAllocated with recurse =
> true? I'd imagine if you ever have to
> use MemoryContextMemAllocated(ctx, true) then we'll all be sitting
> around with our stopwatches out to see if the call incurs too much of
> a performance penalty.

I don't expect HashAgg to be using many memory contexts. It did with
array_agg, but that's been changed, and was a bad pattern anyway (one
context per group is quite wasteful).

If it turns out to be slow, we can call it less frequently (e.g.
extrapolate growth rate to determine when to check).

> 
> Shouldn't you be setting oldblksize after the "if"? I'd imagine the
> realloc() failure is rare, but it just seems better to do it just
> before it's required and only when required.

I don't think that's safe. Realloc can return an entirely new pointer,
so the endptr would be pointing outside of the allocation. I'd have to
hang on to the original pointer before the realloc, so I'd need an extra
variable anyway.


> 
> Here there's a double assignment of "child".

Will fix.

> 
> I might be mistaken here, but can't you just set context->mem_allocted
> = 0; after that loop? 
> Or maybe it would be an improvement to only do the decrement
> if MEMORY_CONTEXT_CHECKING is defined, then Assert() that
> mem_allocated == 0 after the loop...
> 
OK. Why not just always Assert that?
> 
> One other thing that I don't remember seeing discussed would be to
> just store a List in each context which would store all of the
> mem_allocated's that need to be updated during each allocation and
> deallocation of memory. The list would be setup once when the context
> is created. When a child context is setup we'd just loop up the parent
> context chain and list_concat() all of the parent's lists onto the
> child's lists. Would this method add too much overhead when a context
> is deleted? Has this been explored already?
> 
That's a good idea, as it would be faster than recursing. Also, the
number of parents can't change, so we can just make an array, and that
would be quite fast to update. Unless I'm missing something, this sounds
like a nice solution. It would require more space in MemoryContextData,
but that might not be a problem.

Regards,Jeff Davis






Re: Memory Accounting v11

From
David Rowley
Date:
On 7 July 2015 at 18:59, Jeff Davis <pgsql@j-davis.com> wrote:
On Tue, 2015-07-07 at 16:49 +1200, David Rowley wrote:
> I might be mistaken here, but can't you just set context->mem_allocted
> = 0; after that loop?
> Or maybe it would be an improvement to only do the decrement
> if MEMORY_CONTEXT_CHECKING is defined, then Assert() that
> mem_allocated == 0 after the loop...
>
OK. Why not just always Assert that?
>

Well, I thought the per loop decrement of the mem_allocated was wasteful, as shouldn't it always get to zero after the final loop anyway?
If so, for efficiency it would be better to zero it after the loop, but if we do that then we can't assert for zero.
 
> One other thing that I don't remember seeing discussed would be to
> just store a List in each context which would store all of the
> mem_allocated's that need to be updated during each allocation and
> deallocation of memory. The list would be setup once when the context
> is created. When a child context is setup we'd just loop up the parent
> context chain and list_concat() all of the parent's lists onto the
> child's lists. Would this method add too much overhead when a context
> is deleted? Has this been explored already?
>
That's a good idea, as it would be faster than recursing. Also, the
number of parents can't change, so we can just make an array, and that
would be quite fast to update. Unless I'm missing something, this sounds
like a nice solution. It would require more space in MemoryContextData,
but that might not be a problem.


Oh an array is even better, but why more space? won't it just be a pointer to an array? It can be NULL if there's nothing to track. Sounds like it would be the same amount of space, at least on a 64 bit machine. 

One thing to keep in mind is that I think if a parent is tracking memory, then a child will also need to track memory too, as when the child context is deleted, we'll need to decrement the parent's mem_allocated by that of all its child contexts

Regards

David Rowley

--
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: Memory Accounting v11

From
Andres Freund
Date:
On 2015-07-07 16:49:58 +1200, David Rowley wrote:
> I've been looking at this patch and trying to reproduce the reported
> slowdown by using Tomas' function to try to exercise palloc() with minimal
> overhead of other code:
> 
> https://github.com/tvondra/palloc_bench

That's not necessarily meaningful. Increased cache footprint (both data
and branch prediction) is often only noticeable with additional code
running pushing stuff out.




Re: Memory Accounting v11

From
David Rowley
Date:
On 7 July 2015 at 20:23, David Rowley <david.rowley@2ndquadrant.com> wrote:
On 7 July 2015 at 18:59, Jeff Davis <pgsql@j-davis.com> wrote:
 
> One other thing that I don't remember seeing discussed would be to
> just store a List in each context which would store all of the
> mem_allocated's that need to be updated during each allocation and
> deallocation of memory. The list would be setup once when the context
> is created. When a child context is setup we'd just loop up the parent
> context chain and list_concat() all of the parent's lists onto the
> child's lists. Would this method add too much overhead when a context
> is deleted? Has this been explored already?
>
That's a good idea, as it would be faster than recursing. Also, the
number of parents can't change, so we can just make an array, and that
would be quite fast to update. Unless I'm missing something, this sounds
like a nice solution. It would require more space in MemoryContextData,
but that might not be a problem.


Oh an array is even better, but why more space? won't it just be a pointer to an array? It can be NULL if there's nothing to track. Sounds like it would be the same amount of space, at least on a 64 bit machine. 


I'd imagine that the first element of the array could just store the length of it. The type might be slightly wider for what you need for an array length, but this'll save having an extra field in the struct for it.

Are you going to implement this? or are you happy with the single level context tracking is the right thing?
I'm going to mark the patch as waiting on author for now.
 
Regards

David Rowley

--
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services
 

Re: Memory Accounting v11

From
Jeff Davis
Date:
On Thu, 2015-07-09 at 14:41 +1200, David Rowley wrote:

> Are you going to implement this? or are you happy with the single
> level context tracking is the right thing?
> I'm going to mark the patch as waiting on author for now.

Attached a version of the patch that does multi-level tracking, v12. It
does this is a simpler way, just like an earlier version of the patch:
it simply traverses up the chain and adds to each parent in a
"total_allocated" field.

The idea you mentioned is a possible optimization of this idea, but it
only makes sense if I'm able to detect a difference between v11
(single-level) and v12 (multi-level). I tried Robert's test[1] again and
I didn't see a difference on my workstation (technically, v12 came out
the fastest, which means I'm just seeing noise anyway), so I can't
evaluate whether your idea will improve things.

After talking with a few people at PGCon, small noisy differences in CPU
timings can appear for almost any tweak to the code, and aren't
necessarily cause for major concern.

Regards,
    Jeff Davis

[1] pgbench -i -s 300, then do the following 3 times each for master,
v11, and v12, and take the median of logged traces:

start server; set trace_sort=on; reindex index pgbench_accounts_pkey;
stop server


Attachment

Re: Memory Accounting v11

From
Robert Haas
Date:
On Sat, Jul 11, 2015 at 2:28 AM, Jeff Davis <pgsql@j-davis.com> wrote:
> After talking with a few people at PGCon, small noisy differences in CPU
> timings can appear for almost any tweak to the code, and aren't
> necessarily cause for major concern.

I agree with that in general, but the concern is a lot bigger when the
function is something that is called everywhere and accounts for a
measurable percentage of our total CPU usage on almost any workload.
If memory allocation got slower because, say, you added some code to
regexp.c and it caused AllocSetAlloc to split a cache line where it
hadn't previously, I wouldn't be worried about that; the next patch,
like as not, will buy the cost back again.  But here you really are
adding code to a hot path.

tuplesort.c does its own accounting, and TBH that seems like the right
thing to do here, too.  The difficulty is, I think, that some
transition functions use an internal data type for the transition
state, which might not be a single palloc'd chunk.  But since we can't
spill those aggregates to disk *anyway*, that doesn't really matter.
If the transition is a varlena or a fixed-length type, we can know how
much space it's consuming without hooking into the memory context
framework.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Memory Accounting v11

From
Tomas Vondra
Date:
Hi,

On 07/14/2015 10:19 PM, Robert Haas wrote:
> On Sat, Jul 11, 2015 at 2:28 AM, Jeff Davis <pgsql@j-davis.com> wrote:
>> After talking with a few people at PGCon, small noisy differences
>> in CPU timings can appear for almost any tweak to the code, and
>> aren't necessarily cause for major concern.
>
> I agree with that in general, but the concern is a lot bigger when the
> function is something that is called everywhere and accounts for a
> measurable percentage of our total CPU usage on almost any workload.
> If memory allocation got slower because, say, you added some code to
> regexp.c and it caused AllocSetAlloc to split a cache line where it
> hadn't previously, I wouldn't be worried about that; the next patch,
> like as not, will buy the cost back again.  But here you really are
> adding code to a hot path.

I think Jeff was suggesting that we should ignore changes measurably 
affecting performance - I'm one of those he discussed this patch with at 
pgcon, and I can assure you impact on performance was one of the main 
topics of the discussion.

Firstly, do we really have good benchmarks and measurements? I really 
doubt that. We do have some numbers for REINDEX, where we observed 
0.5-1% regression on noisy results from a Power machine (and we've been 
unable to reproduce that on x86). I don't think that's a representative 
benchmark, and I'd like to see more thorough measurements. And I agreed 
to do this, once Jeff comes up with a new version of the patch.

Secondly, the question is whether the performance is impacted more by 
the additional instructions, or by other things - say, random padding, 
as was explained by Andrew Gierth in [1].

I don't know whether that's happening in this patch, but if it is, it 
seems rather strange to use this against this patch and not the others 
(because there surely will be other patches causing similar issues).

[1] 
http://www.postgresql.org/message-id/87vbitb2zp.fsf@news-spur.riddles.org.uk

> tuplesort.c does its own accounting, and TBH that seems like the right
> thing to do here, too.  The difficulty is, I think, that some
> transition functions use an internal data type for the transition
> state, which might not be a single palloc'd chunk.  But since we can't
> spill those aggregates to disk *anyway*, that doesn't really matter.
> If the transition is a varlena or a fixed-length type, we can know how
> much space it's consuming without hooking into the memory context
> framework.

I respectfully disagree. Our current inability to dump/load the state 
has little to do with how we measure consumed memory, IMHO.

It's true that we do have two kinds of aggregates, depending on the 
nature of the aggregate state:

(a) fixed-size state (data types passed by value, variable length types    that do not grow once allocated, ...)

(b) continuously growing state (as in string_agg/array_agg)

Jeff's HashAgg patch already fixes (a) and can fix (b) once we get a 
solution for dump/load of the aggregate stats - which we need to 
implement anyway for parallel aggregate.

I know there was a proposal to force all aggregates to use regular data 
types as aggregate stats, but I can't see how that could work without a 
significant performance penalty. For example array_agg() is using 
internal to pass ArrayBuildState - how do you turn that to a regular 
data type without effectively serializing/deserializing the whole array 
on every transition?

And even if we come up with a solution for array_agg, do we really 
believe it's possible to do for all custom aggregates? Maybe I'm missing 
something but I doubt that. ISTM designing ephemeral data structure 
allows tweaks that are impossible otherwise.

What might be possible is extending the aggregate API with another 
custom function returning size of the aggregate state. So when defining 
an aggregate using 'internal' for aggregate state, you'd specify 
transfunc, finalfunc and sizefunc. That seems doable, I guess.

I find the memory accounting as a way more elegant solution, though.

kind regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Memory Accounting v11

From
Jeff Davis
Date:
On Tue, 2015-07-14 at 16:19 -0400, Robert Haas wrote:
> tuplesort.c does its own accounting, and TBH that seems like the right
> thing to do here, too.  The difficulty is, I think, that some
> transition functions use an internal data type for the transition
> state, which might not be a single palloc'd chunk.  But since we can't
> spill those aggregates to disk *anyway*, that doesn't really matter.

So would it be acceptable to just ignore the memory consumed by
"internal", or come up with some heuristic?

Regards,Jeff Davis





Re: Memory Accounting v11

From
Atri Sharma
Date:


On Wed, Jul 15, 2015 at 12:57 PM, Jeff Davis <pgsql@j-davis.com> wrote:
On Tue, 2015-07-14 at 16:19 -0400, Robert Haas wrote:
> tuplesort.c does its own accounting, and TBH that seems like the right
> thing to do here, too.  The difficulty is, I think, that some
> transition functions use an internal data type for the transition
> state, which might not be a single palloc'd chunk.  But since we can't
> spill those aggregates to disk *anyway*, that doesn't really matter.

So would it be acceptable to just ignore the memory consumed by
"internal", or come up with some heuristic?

Regards,
        Jeff Davis


I think a heuristic would be more suited here and ignoring memory consumption for internal types means that we are not making the memory accounting useful for a set of usecases.



--
Regards,
 
Atri
l'apprenant

Re: Memory Accounting v11

From
Jeff Davis
Date:
On Wed, 2015-07-15 at 12:59 +0530, Atri Sharma wrote:

> 
> I think a heuristic would be more suited here and ignoring memory
> consumption for internal types means that we are not making the memory
> accounting useful for a set of usecases. 
> 
OK, I will drop this patch and proceed with the HashAgg patch, with a
heuristic for internal types.

Regards,Jeff Davis






Re: Memory Accounting v11

From
Robert Haas
Date:
On Tue, Jul 14, 2015 at 9:14 PM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
> Firstly, do we really have good benchmarks and measurements? I really doubt
> that. We do have some numbers for REINDEX, where we observed 0.5-1%
> regression on noisy results from a Power machine (and we've been unable to
> reproduce that on x86). I don't think that's a representative benchmark, and
> I'd like to see more thorough measurements. And I agreed to do this, once
> Jeff comes up with a new version of the patch.
>
> Secondly, the question is whether the performance is impacted more by the
> additional instructions, or by other things - say, random padding, as was
> explained by Andrew Gierth in [1].
>
> I don't know whether that's happening in this patch, but if it is, it seems
> rather strange to use this against this patch and not the others (because
> there surely will be other patches causing similar issues).

It matters, at least to me, an awful lot where we're adding lines of
code.  If you want to add modest amounts of additional code to CREATE
TABLE or CHECKPOINT or something like that, I really don't care,
because that stuff doesn't execute frequently enough to really matter
to performance even if we add a significant chunk of overhead, and
we're doing other expensive stuff at the same time, like fsync.  The
same can't be said about functions like LWLockAcquire and
AllocSetAlloc that routinely show up at the top of CPU profiles.

I agree that there is room to question whether the benchmarks I did
are sufficient reason to think that the abstract concern that putting
more code into a function might make it slower translates into a
measurable drop in performance in practice.  But I think when it comes
to these very hot code paths, extreme conservatism is warranted.  We
can agree to disagree about that.

>> tuplesort.c does its own accounting, and TBH that seems like the right
>> thing to do here, too.  The difficulty is, I think, that some
>> transition functions use an internal data type for the transition
>> state, which might not be a single palloc'd chunk.  But since we can't
>> spill those aggregates to disk *anyway*, that doesn't really matter.
>> If the transition is a varlena or a fixed-length type, we can know how
>> much space it's consuming without hooking into the memory context
>> framework.
>
> I respectfully disagree. Our current inability to dump/load the state has
> little to do with how we measure consumed memory, IMHO.
>
> It's true that we do have two kinds of aggregates, depending on the nature
> of the aggregate state:
>
> (a) fixed-size state (data types passed by value, variable length types
>     that do not grow once allocated, ...)
>
> (b) continuously growing state (as in string_agg/array_agg)
>
> Jeff's HashAgg patch already fixes (a) and can fix (b) once we get a
> solution for dump/load of the aggregate stats - which we need to implement
> anyway for parallel aggregate.
>
> I know there was a proposal to force all aggregates to use regular data
> types as aggregate stats, but I can't see how that could work without a
> significant performance penalty. For example array_agg() is using internal
> to pass ArrayBuildState - how do you turn that to a regular data type
> without effectively serializing/deserializing the whole array on every
> transition?

That is a good point.  Tom suggested that his new expanded-format
stuff might be able to be adapted to the purpose, but I have no idea
how possible that really is.

> And even if we come up with a solution for array_agg, do we really believe
> it's possible to do for all custom aggregates? Maybe I'm missing something
> but I doubt that. ISTM designing ephemeral data structure allows tweaks that
> are impossible otherwise.
>
> What might be possible is extending the aggregate API with another custom
> function returning size of the aggregate state. So when defining an
> aggregate using 'internal' for aggregate state, you'd specify transfunc,
> finalfunc and sizefunc. That seems doable, I guess.

And infunc and outfunc.  If we don't use the expanded-format stuff for
aggregates with a type-internal transition state, we won't be able to
use input and output functions to serialize and deserialize them,
either.

> I find the memory accounting as a way more elegant solution, though.

I think we're just going to have to agree to disagree on that.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Memory Accounting v11

From
Robert Haas
Date:
On Wed, Jul 15, 2015 at 3:27 AM, Jeff Davis <pgsql@j-davis.com> wrote:
> On Tue, 2015-07-14 at 16:19 -0400, Robert Haas wrote:
>> tuplesort.c does its own accounting, and TBH that seems like the right
>> thing to do here, too.  The difficulty is, I think, that some
>> transition functions use an internal data type for the transition
>> state, which might not be a single palloc'd chunk.  But since we can't
>> spill those aggregates to disk *anyway*, that doesn't really matter.
>
> So would it be acceptable to just ignore the memory consumed by
> "internal", or come up with some heuristic?

Either one would be OK with me.  I'd probably ignore that for the
first version of the patch and just let the hash table grow without
bound in that case.  Then in a later patch I'd introduce additional
infrastructure to deal with that part of the problem.  But if
something else works out well while coding it up, I'd be OK with that,
too.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Memory Accounting v11

From
Tomas Vondra
Date:
Hi,

On 07/15/2015 09:21 PM, Robert Haas wrote:
> On Tue, Jul 14, 2015 at 9:14 PM, Tomas Vondra
> <tomas.vondra@2ndquadrant.com> wrote:
>> Firstly, do we really have good benchmarks and measurements? I really doubt
>> that. We do have some numbers for REINDEX, where we observed 0.5-1%
>> regression on noisy results from a Power machine (and we've been unable to
>> reproduce that on x86). I don't think that's a representative benchmark, and
>> I'd like to see more thorough measurements. And I agreed to do this, once
>> Jeff comes up with a new version of the patch.
>>
>> Secondly, the question is whether the performance is impacted more by the
>> additional instructions, or by other things - say, random padding, as was
>> explained by Andrew Gierth in [1].
>>
>> I don't know whether that's happening in this patch, but if it is,
>> it seems rather strange to use this against this patch and not the
>> others (because there surely will be other patches causing similar
>> issues).
>
> It matters, at least to me, an awful lot where we're adding lines of
> code. If you want to add modest amounts of additional code to CREATE
> TABLE or CHECKPOINT or something like that, I really don't care,
> because that stuff doesn't execute frequently enough to really
> matter to performance even if we add a significant chunk of overhead,
> and we're doing other expensive stuff at the same time, like fsync.
> The same can't be said about functions like LWLockAcquire and
> AllocSetAlloc that routinely show up at the top of CPU profiles.
>
> I agree that there is room to question whether the benchmarks I did
> are sufficient reason to think that the abstract concern that putting
> more code into a function might make it slower translates into a
> measurable drop in performance in practice. But I think when it comes
> to these very hot code paths, extreme conservatism is warranted. We
> can agree to disagree about that.

No, that is not what I tried to say. I certainly agree that we need to 
pay attention when adding stuff hot paths - there's no disagreement 
about this.

The problem with random padding is that adding code somewhere may 
introduce padding that affects other pieces of code. That is essentially 
the point of Andrew's explanation that I linked in my previous message.

And the question is - are the differences we've measured really due to 
code added to the hot path, or something introduced by random padding 
due to some other changes (possibly in a code that was not even executed 
during the test)?

I don't know how significant impact this may have in this case, or how 
serious this is in general, but we're talking about 0.5-1% difference on 
a noisy benchmark. And if such cases of random padding really are a 
problem in practice, we've certainly missed plenty of other patches with 
the same impact.

Because effectively what Jeff's last patch did was adding a single int64 
counter to MemoryContextData structure, and incrementing it for each 
allocated block (not chunk). I can't really imagine a solution making it 
cheaper, because there are very few faster operations. Even "opt-in" 
won't make this much faster, because you'll have to check a flag and 
you'll need two fields (flag + counter).

Of course, this assumes "local counter" (i.e. not updating counters the 
parent contexts), but I claim that's OK. I've been previously pushing 
for eagerly updating all the parent contexts, so that finding out the 
allocated memory is fast even when there are many child contexts - a 
prime example was array_agg() before I fixed it. But I changed my mind 
on this and now say "screw them". I claim that aggregates using a 
separate memory context for each group are a lost case - they already 
suffer a significant overhead, and should be fixed just like we fixed 
array_agg().

That was effectively the outcome of pgcon discussions - to use the 
simple int64 counter, do the accounting for all contexts, and update 
only the local counter. For cases with many child contexts finding out 
the amount of allocated memory won't be cheap, but well - there's not 
much we can do about that.

>> I know there was a proposal to force all aggregates to use regular
>> data types as aggregate stats, but I can't see how that could work
>> without a significant performance penalty. For example array_agg()
>> is using internal to pass ArrayBuildState - how do you turn that to
>> a regular data type without effectively serializing/deserializing
>> the whole array on every transition?
>
> That is a good point. Tom suggested that his new expanded-format
> stuff might be able to be adapted to the purpose, but I have no idea
>  how possible that really is.

Me neither, and maybe there's a good solution for that, making my 
concerns unfounded.

>> What might be possible is extending the aggregate API with another
>> custom function returning size of the aggregate state. So when
>> defining an aggregate using 'internal' for aggregate state, you'd
>> specify transfunc, finalfunc and sizefunc. That seems doable, I
>> guess.
>
> And infunc and outfunc.  If we don't use the expanded-format stuff for
> aggregates with a type-internal transition state, we won't be able to
> use input and output functions to serialize and deserialize them,
> either.

Sure, that is indeed necessary for spilling the aggregates to disk. But 
for aggregates with fixed-size state that's not necessary (Jeff's 
HashAgg patch handles them fine), so I see this as a separate thing.

>
>> I find the memory accounting as a way more elegant solution, though.
>
> I think we're just going to have to agree to disagree on that.

Sure, it's certainly a matter of personal taste.


--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Memory Accounting v11

From
Tomas Vondra
Date:
Hi,

On 07/15/2015 07:07 PM, Jeff Davis wrote:
> On Wed, 2015-07-15 at 12:59 +0530, Atri Sharma wrote:
>
>>
>> I think a heuristic would be more suited here and ignoring memory
>> consumption for internal types means that we are not making the
>> memory accounting useful for a set of usecases.
>>
> OK, I will drop this patch and proceed with the HashAgg patch, with
> a heuristic for internal types.

Could someone briefly explain what heuristics are we talking about?


--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Memory Accounting v11

From
David Rowley
Date:

On 16 July 2015 at 05:07, Jeff Davis <pgsql@j-davis.com> wrote:
On Wed, 2015-07-15 at 12:59 +0530, Atri Sharma wrote:

>
> I think a heuristic would be more suited here and ignoring memory
> consumption for internal types means that we are not making the memory
> accounting useful for a set of usecases.
>
OK, I will drop this patch and proceed with the HashAgg patch, with a
heuristic for internal types.


Should we mark the patch as "returned with feedback" in the commitfest app then?
 
--
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: Memory Accounting v11

From
Jeff Davis
Date:
On Fri, 2015-07-17 at 15:52 +1200, David Rowley wrote:

> Should we mark the patch as "returned with feedback" in the commitfest
> app then?

I believe the memory accounting patch has been rejected. Instead, the
work will be done in the HashAgg patch.

Thank you for the review!

Regards,Jeff Davis