Thread: Make MemoryContextMemAllocated() more precise

Make MemoryContextMemAllocated() more precise

From
Jeff Davis
Date:
AllocSet allocates memory for itself in blocks, which double in size up
to maxBlockSize. So, the current block (the last one malloc'd) may
represent half of the total memory allocated for the context itself.

The free space at the end of that block hasn't been touched at all, and
doesn't represent fragmentation or overhead. That means that the
"allocated" memory can be 2X the memory ever touched in the worst case.

Although that's technically correct, the purpose of
MemoryContextMemAllocated() is to give a "real" usage number so we know
when we're out of work_mem and need to spill (in particular, the disk-
based HashAgg work, but ideally other operators as well). This "real"
number should include fragmentation, freed-and-not-reused chunks, and
other overhead. But it should not include significant amounts of
allocated-but-never-touched memory, which says more about economizing
calls to malloc than it does about the algorithm's memory usage. 

Attached is a patch that makes mem_allocated a method (rather than a
field) of MemoryContext, and allows each memory context type to track
the memory its own way. They all do the same thing as before
(increment/decrement a field), but AllocSet also subtracts out the free
space in the current block. For Slab and Generation, we could do
something similar, but it's not as much of a problem because there's no
doubling of the allocation size.

Although I think this still matches the word "allocation" in spirit,
it's not technically correct, so feel free to suggest a new name for
MemoryContextMemAllocated().

Regards,
    Jeff Davis


Attachment

Re: Make MemoryContextMemAllocated() more precise

From
Jeff Davis
Date:
On Mon, 2020-03-16 at 11:45 -0700, Jeff Davis wrote:
> Attached is a patch that makes mem_allocated a method (rather than a
> field) of MemoryContext, and allows each memory context type to track
> the memory its own way. They all do the same thing as before
> (increment/decrement a field), but AllocSet also subtracts out the
> free
> space in the current block. For Slab and Generation, we could do
> something similar, but it's not as much of a problem because there's
> no
> doubling of the allocation size.

Committed.

In an off-list discussion, Andres suggested that MemoryContextStats
could be refactored to achieve this purpose, perhaps with flags to
avoid walking through the blocks and freelists when those are not
needed.

We discussed a few other names, such as "space", "active memory", and
"touched". We didn't settle on any great name, but "touched" seemed to
be the most descriptive.

This refactoring/renaming can be done later; right now I committed this
to unblock disk-based Hash Aggregation, which is ready.

Regards,
    Jeff Davis





Re: Make MemoryContextMemAllocated() more precise

From
Robert Haas
Date:
On Mon, Mar 16, 2020 at 2:45 PM Jeff Davis <pgsql@j-davis.com> wrote:
> Attached is a patch that makes mem_allocated a method (rather than a
> field) of MemoryContext, and allows each memory context type to track
> the memory its own way. They all do the same thing as before
> (increment/decrement a field), but AllocSet also subtracts out the free
> space in the current block. For Slab and Generation, we could do
> something similar, but it's not as much of a problem because there's no
> doubling of the allocation size.
>
> Although I think this still matches the word "allocation" in spirit,
> it's not technically correct, so feel free to suggest a new name for
> MemoryContextMemAllocated().

Procedurally, I think that it is highly inappropriate to submit a
patch two weeks after the start of the final CommitFest and then
commit it just over 48 hours later without a single endorsement of the
change from anyone.

Substantively, I think that whether or not this is improvement depends
considerably on how your OS handles overcommit. I do not have enough
knowledge to know whether it will be better in general, but would
welcome opinions from others.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Make MemoryContextMemAllocated() more precise

From
Tomas Vondra
Date:
On Thu, Mar 19, 2020 at 11:44:05AM -0400, Robert Haas wrote:
>On Mon, Mar 16, 2020 at 2:45 PM Jeff Davis <pgsql@j-davis.com> wrote:
>> Attached is a patch that makes mem_allocated a method (rather than a
>> field) of MemoryContext, and allows each memory context type to track
>> the memory its own way. They all do the same thing as before
>> (increment/decrement a field), but AllocSet also subtracts out the free
>> space in the current block. For Slab and Generation, we could do
>> something similar, but it's not as much of a problem because there's no
>> doubling of the allocation size.
>>
>> Although I think this still matches the word "allocation" in spirit,
>> it's not technically correct, so feel free to suggest a new name for
>> MemoryContextMemAllocated().
>
>Procedurally, I think that it is highly inappropriate to submit a
>patch two weeks after the start of the final CommitFest and then
>commit it just over 48 hours later without a single endorsement of the
>change from anyone.
>

True.

>Substantively, I think that whether or not this is improvement depends
>considerably on how your OS handles overcommit. I do not have enough
>knowledge to know whether it will be better in general, but would
>welcome opinions from others.
>

Not sure overcommit is a major factor, and if it is then maybe it's the
strategy of doubling block size that's causing problems.

AFAICS the 2x allocation is the worst case, because it only happens
right after allocating a new block (of twice the size), when the
"utilization" drops from 100% to 50%. But in practice the utilization
will be somewhere in between, with an average of 75%. And we're not
doubling the block size indefinitely - there's an upper limit, so over
time the utilization drops less and less. So as the contexts grow, the
discrepancy disappears. And I'd argue the smaller the context, the less
of an issue the overcommit behavior is.

My understanding is that this is really just an accounting issue, where
allocating a block would get us over the limit, which I suppose might be
an issue with low work_mem values.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Make MemoryContextMemAllocated() more precise

From
Jeff Davis
Date:
On Thu, 2020-03-19 at 19:11 +0100, Tomas Vondra wrote:
> AFAICS the 2x allocation is the worst case, because it only happens
> right after allocating a new block (of twice the size), when the
> "utilization" drops from 100% to 50%. But in practice the utilization
> will be somewhere in between, with an average of 75%.

Sort of. Hash Agg is constantly watching the memory, so it will
typically spill right at the point where the accounting for that memory
context is off by 2X. 

That's mitigated because the hash table itself (the array of
TupleHashEntryData) ends up allocated as its own block, so does not
have any waste. The total (table mem + out of line) might be close to
right if the hash table array itself is a large fraction of the data,
but I don't think that's what we want.

>  And we're not
> doubling the block size indefinitely - there's an upper limit, so
> over
> time the utilization drops less and less. So as the contexts grow,
> the
> discrepancy disappears. And I'd argue the smaller the context, the
> less
> of an issue the overcommit behavior is.

The problem is that the default work_mem is 4MB, and the doubling
behavior goes to 8MB, so it's a problem with default settings.

Regards,
    Jeff Davis





Re: Make MemoryContextMemAllocated() more precise

From
Robert Haas
Date:
On Thu, Mar 19, 2020 at 2:11 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
> My understanding is that this is really just an accounting issue, where
> allocating a block would get us over the limit, which I suppose might be
> an issue with low work_mem values.

Well, the issue is, if I understand correctly, that this means that
MemoryContextStats() might now report a smaller amount of memory than
what we actually allocated from the operating system. That seems like
it might lead someone trying to figure out where a backend is leaking
memory to erroneous conclusions.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Make MemoryContextMemAllocated() more precise

From
Jeff Davis
Date:
On Thu, 2020-03-19 at 11:44 -0400, Robert Haas wrote:
> Procedurally, I think that it is highly inappropriate to submit a
> patch two weeks after the start of the final CommitFest and then
> commit it just over 48 hours later without a single endorsement of
> the
> change from anyone.

Reverted.

Sorry, I misjudged this as a "supporting fix for a specific problem",
but it seems others feel it requires discussion.

> Substantively, I think that whether or not this is improvement
> depends
> considerably on how your OS handles overcommit. I do not have enough
> knowledge to know whether it will be better in general, but would
> welcome opinions from others.

I think omitting the tail of the current block is an unqualified
improvement for the purpose of obeying work_mem, regardless of the OS.
The block sizes keep doubling up to 8MB, and it doesn't make a lot of
sense to count that last large mostly-empty block against work_mem.

Regards,
    Jeff Davis





Re: Make MemoryContextMemAllocated() more precise

From
Robert Haas
Date:
On Thu, Mar 19, 2020 at 3:27 PM Jeff Davis <pgsql@j-davis.com> wrote:
> I think omitting the tail of the current block is an unqualified
> improvement for the purpose of obeying work_mem, regardless of the OS.
> The block sizes keep doubling up to 8MB, and it doesn't make a lot of
> sense to count that last large mostly-empty block against work_mem.

Well, again, my point is that whether or not it counts depends on your
system's overcommit behavior. Depending on how you have the
configured, or what your OS likes to do, it may in reality count or
not count. Or so I believe. Am I wrong?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Make MemoryContextMemAllocated() more precise

From
Tomas Vondra
Date:
On Thu, Mar 19, 2020 at 12:25:16PM -0700, Jeff Davis wrote:
>On Thu, 2020-03-19 at 19:11 +0100, Tomas Vondra wrote:
>> AFAICS the 2x allocation is the worst case, because it only happens
>> right after allocating a new block (of twice the size), when the
>> "utilization" drops from 100% to 50%. But in practice the utilization
>> will be somewhere in between, with an average of 75%.
>
>Sort of. Hash Agg is constantly watching the memory, so it will
>typically spill right at the point where the accounting for that memory
>context is off by 2X.
>
>That's mitigated because the hash table itself (the array of
>TupleHashEntryData) ends up allocated as its own block, so does not
>have any waste. The total (table mem + out of line) might be close to
>right if the hash table array itself is a large fraction of the data,
>but I don't think that's what we want.
>
>>  And we're not
>> doubling the block size indefinitely - there's an upper limit, so
>> over
>> time the utilization drops less and less. So as the contexts grow,
>> the
>> discrepancy disappears. And I'd argue the smaller the context, the
>> less
>> of an issue the overcommit behavior is.
>
>The problem is that the default work_mem is 4MB, and the doubling
>behavior goes to 8MB, so it's a problem with default settings.
>

Yes, it's an issue for the accuracy of our accounting. What Robert was
talking about is overcommit behavior at the OS level, which I'm arguing
is unlikely to be an issue, because for low work_mem values the absolute
difference is small, and on large work_mem values it's limited by the
block size limit.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Make MemoryContextMemAllocated() more precise

From
Jeff Davis
Date:
On Thu, 2020-03-19 at 15:26 -0400, Robert Haas wrote:
> Well, the issue is, if I understand correctly, that this means that
> MemoryContextStats() might now report a smaller amount of memory than
> what we actually allocated from the operating system. That seems like
> it might lead someone trying to figure out where a backend is leaking
> memory to erroneous conclusions.

MemoryContextStats() was unaffected by my now-reverted change.

Regards,
    Jeff Davis





Re: Make MemoryContextMemAllocated() more precise

From
Jeff Davis
Date:
On Mon, 2020-03-16 at 11:45 -0700, Jeff Davis wrote:
> Although that's technically correct, the purpose of
> MemoryContextMemAllocated() is to give a "real" usage number so we
> know
> when we're out of work_mem and need to spill (in particular, the
> disk-
> based HashAgg work, but ideally other operators as well). This "real"
> number should include fragmentation, freed-and-not-reused chunks, and
> other overhead. But it should not include significant amounts of
> allocated-but-never-touched memory, which says more about economizing
> calls to malloc than it does about the algorithm's memory usage. 

To expand on this point:

work_mem is to keep executor algorithms somewhat constrained in the
memory that they use. With that in mind, it should reflect things that
the algorithm has some control over, and can be measured cheaply.

Therefore, we shouldn't include huge nearly-empty blocks of memory that
the system decided to allocate in response to a request for a small
chunk (the algorithm has little control). Nor should we try to walk a
list of blocks or free chunks (expensive).

We should include used memory, reasonable overhead (chunk headers,
alignment, etc.), and probably free chunks (which represent
fragmentation).

For AllocSet, the notion of "all touched memory", which is everything
except the current block's tail, seems to fit the requirements well.

Regards,
    Jeff Davis





Re: Make MemoryContextMemAllocated() more precise

From
Robert Haas
Date:
On Thu, Mar 19, 2020 at 3:44 PM Jeff Davis <pgsql@j-davis.com> wrote:
> On Thu, 2020-03-19 at 15:26 -0400, Robert Haas wrote:
> > Well, the issue is, if I understand correctly, that this means that
> > MemoryContextStats() might now report a smaller amount of memory than
> > what we actually allocated from the operating system. That seems like
> > it might lead someone trying to figure out where a backend is leaking
> > memory to erroneous conclusions.
>
> MemoryContextStats() was unaffected by my now-reverted change.

Oh. Well, that addresses my concern, then. If this only affects the
accounting for memory-bounded hash aggregation and nothing else is
going to be touched, including MemoryContextStats(), then it's not an
issue for me.

Other people may have different concerns, but that was the only thing
that was bothering me.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Make MemoryContextMemAllocated() more precise

From
Jeff Davis
Date:
On Thu, 2020-03-19 at 15:33 -0400, Robert Haas wrote:
> On Thu, Mar 19, 2020 at 3:27 PM Jeff Davis <pgsql@j-davis.com> wrote:
> > I think omitting the tail of the current block is an unqualified
> > improvement for the purpose of obeying work_mem, regardless of the
> > OS.
> > The block sizes keep doubling up to 8MB, and it doesn't make a lot
> > of
> > sense to count that last large mostly-empty block against work_mem.
> 
> Well, again, my point is that whether or not it counts depends on
> your
> system's overcommit behavior. Depending on how you have the
> configured, or what your OS likes to do, it may in reality count or
> not count. Or so I believe. Am I wrong?

I don't believe it should not be counted for the purposes of work_mem.

Let's say that the OS eagerly allocates it, then what is the algorithm
supposed to do in response? It can either:

1. just accept that all of the space is used, even though it's
potentially as low as 50% used, and spill earlier than may be
necessary; or

2. find a way to measure the free space, and somehow predict whether
that space will be reused the next time a group is added to the hash
table

It just doesn't seem reasonable for me for the algorithm to change its
behavior based on these large block allocations. It may be valuable
information for other purposes (like tuning your OS, or tracking down
memory issues), though.

Regards,
    Jeff Davis





Re: Make MemoryContextMemAllocated() more precise

From
Jeff Davis
Date:
On Thu, 2020-03-19 at 16:04 -0400, Robert Haas wrote:
> Other people may have different concerns, but that was the only thing
> that was bothering me.

OK, thank you for raising it.

Perhaps we can re-fix the issue for HashAgg if necessary, or I can
tweak some accounting things within HashAgg itself, or both.

Regards,
    Jeff Davis





Re: Make MemoryContextMemAllocated() more precise

From
Jeff Davis
Date:
On Wed, 2020-03-18 at 15:41 -0700, Jeff Davis wrote:
> In an off-list discussion, Andres suggested that MemoryContextStats
> could be refactored to achieve this purpose, perhaps with flags to
> avoid walking through the blocks and freelists when those are not
> needed.

Attached refactoring patch. There's enough in here that warrants
discussion that I don't think this makes sense for v13 and I'm adding
it to the July commitfest.

I still think we should do something for v13, such as the originally-
proposed patch[1]. It's not critical, but it simply reports a better
number for memory consumption. Currently, the memory usage appears to
jump, often right past work mem (by a reasonable but noticable amount),
which could be confusing.

Regarding the attached patch (target v14):

  * there's a new MemoryContextCount() that simply calculates the
    statistics without printing anything, and returns a struct
    - it supports flags to indicate which stats should be
      calculated, so that some callers can avoid walking through
      blocks/freelists
  * it adds a new statistic for "new space" (i.e. untouched)
  * it eliminates specialization of the memory context printing
    - the only specialization was for generation.c to output the
      number of chunks, which can be done easily enough for the
      other types, too

Regards,
    Jeff Davis


[1] 
https://postgr.es/m/ec63d70b668818255486a83ffadc3aec492c1f57.camel%40j-davis.com

Attachment

Re: Make MemoryContextMemAllocated() more precise

From
Andres Freund
Date:
Hi,

On 2020-03-27 17:21:10 -0700, Jeff Davis wrote:
> Attached refactoring patch. There's enough in here that warrants
> discussion that I don't think this makes sense for v13 and I'm adding
> it to the July commitfest.

IDK, adding a commit to v13 that we know we should do architecturally
differently in v14, when the difference in complexity between the two
patches isn't actually *that* big...

I'd like to see others jump in here...


> I still think we should do something for v13, such as the originally-
> proposed patch[1]. It's not critical, but it simply reports a better
> number for memory consumption. Currently, the memory usage appears to
> jump, often right past work mem (by a reasonable but noticable amount),
> which could be confusing.

Is that really a significant issue for most work mem sizes? Shouldn't
the way we increase sizes lead to the max difference between the
measurements to be somewhat limited?


>   * there's a new MemoryContextCount() that simply calculates the
>     statistics without printing anything, and returns a struct
>     - it supports flags to indicate which stats should be
>       calculated, so that some callers can avoid walking through
>       blocks/freelists
>   * it adds a new statistic for "new space" (i.e. untouched)
>   * it eliminates specialization of the memory context printing
>     - the only specialization was for generation.c to output the
>       number of chunks, which can be done easily enough for the
>       other types, too

That sounds like a good direction.



> +    if (flags & MCXT_STAT_NBLOCKS)
> +        counters.nblocks = nblocks;
> +    if (flags & MCXT_STAT_NCHUNKS)
> +        counters.nchunks = set->nChunks;
> +    if (flags & MCXT_STAT_FREECHUNKS)
> +        counters.freechunks = freechunks;
> +    if (flags & MCXT_STAT_TOTALSPACE)
> +        counters.totalspace = set->memAllocated;
> +    if (flags & MCXT_STAT_FREESPACE)
> +        counters.freespace = freespace;
> +    if (flags & MCXT_STAT_NEWSPACE)
> +        counters.newspace = set->blocks->endptr - set->blocks->freeptr;

I'd spec it so that context implementations are allowed to
unconditionally fill fields, even when the flag isn't specified. The
branches quoted don't buy us anyting...



> diff --git a/src/include/nodes/memnodes.h b/src/include/nodes/memnodes.h
> index c9f2bbcb367..cc545852968 100644
> --- a/src/include/nodes/memnodes.h
> +++ b/src/include/nodes/memnodes.h
> @@ -29,11 +29,21 @@
>  typedef struct MemoryContextCounters
>  {
>      Size        nblocks;        /* Total number of malloc blocks */
> +    Size        nchunks;        /* Total number of chunks (used+free) */
>      Size        freechunks;        /* Total number of free chunks */
>      Size        totalspace;        /* Total bytes requested from malloc */
>      Size        freespace;        /* The unused portion of totalspace */
> +    Size        newspace;        /* Allocated but never held any chunks */

I'd add some reasoning as to why this is useful.


>  } MemoryContextCounters;
>  
> +#define MCXT_STAT_NBLOCKS        (1 << 0)
> +#define MCXT_STAT_NCHUNKS        (1 << 1)
> +#define MCXT_STAT_FREECHUNKS    (1 << 2)
> +#define MCXT_STAT_TOTALSPACE    (1 << 3)
> +#define MCXT_STAT_FREESPACE        (1 << 4)
> +#define MCXT_STAT_NEWSPACE        (1 << 5)

s/MCXT_STAT/MCXT_STAT_NEED/?


> +#define MCXT_STAT_ALL            ((1 << 6) - 1)

Hm, why not go for ~0 or such?


Greetings,

Andres Freund



Re: Make MemoryContextMemAllocated() more precise

From
David Rowley
Date:
On Sat, 28 Mar 2020 at 13:21, Jeff Davis <pgsql@j-davis.com> wrote:
> Attached refactoring patch. There's enough in here that warrants
> discussion that I don't think this makes sense for v13 and I'm adding
> it to the July commitfest.

I had a read over this too. I noted down the following during my pass of it.

1. The comment mentions "passthru", but you've removed that parameter.

  * For now, the passthru pointer just points to "int level"; later we might
  * make that more complicated.
  */
 static void
-MemoryContextStatsPrint(MemoryContext context, void *passthru,
+MemoryContextStatsPrint(MemoryContext context, int level,
  const char *stats_string)

2. I don't think MemoryContextCount is the best name for this
function. When I saw:

counters = MemoryContextCount(aggstate->hash_metacxt, flags, true);

i assumed it was returning some integer number, that is until I looked
at the "counters" datatype.

Maybe it would be better to name the function
MemoryContextGetTelemetry and the struct MemoryContextTelemetry rather
than MemoryContextCounters? Or maybe MemoryContextTally and call the
function either MemoryContextGetTelemetry() or
MemoryContextGetTally(). Or perhaps MemoryContextGetAccounting() and
the struct MemoryContextAccounting

3. I feel like it would be nicer if you didn't change the "count"
methods to return a MemoryContextCounters. It means you may need to
zero a struct for each level, assign the values, then add them to the
total.  If you were just to zero the struct in MemoryContextCount()
then pass it into the count function, then you could just have it do
all the += work. It would reduce the code in MemoryContextCount() too.

4. Do you think it would be better to have two separate functions for
MemoryContextCount(), a recursive version and a non-recursive version.
I feel that the function should be so simple that it does not make a
great deal of sense to load it up to handle both cases.  Looking
around mcxt.c, I see MemoryContextResetOnly() and
MemoryContextResetChildren(), while that's not a perfect example, it
does seem like a better lead to follow.

5. For performance testing, I tried using the following table with 1MB
work_mem then again with 1GB work_mem.  I wondered if making the
accounting more complex would cause a slowdown in nodeAgg.c, as I see
we're calling this function each time we get a tuple that belongs in a
new group. The 1MB test is likely not such a great way to measure this
since we do spill to disk in that case and the changing in accounting
means we do that at a slightly different time, but the 1GB test does
not spill and it's a bit slower.

create table t (a int);
insert into t select x from generate_Series(1,10000000) x;
analyze t;

-- Unpatched

set work_mem = '1MB';
explain analyze select a from t group by a; -- Ran 3 times.

                                                       QUERY PLAN

-------------------------------------------------------------------------------------------------------------------------
 HashAggregate  (cost=481750.10..659875.95 rows=10000048 width=4)
(actual time=3350.190..8193.400 rows=10000000 loops=1)
   Group Key: a
   Planned Partitions: 32
   Peak Memory Usage: 1177 kB
   Disk Usage: 234920 kB
   HashAgg Batches: 1188
   ->  Seq Scan on t  (cost=0.00..144248.48 rows=10000048 width=4)
(actual time=0.013..1013.755 rows=10000000 loops=1)
 Planning Time: 0.131 ms
 Execution Time: 8586.420 ms
 Execution Time: 8446.961 ms
 Execution Time: 8449.492 ms
(9 rows)

-- Patched

set work_mem = '1MB';
explain analyze select a from t group by a;

                                                       QUERY PLAN

-------------------------------------------------------------------------------------------------------------------------
 HashAggregate  (cost=481748.00..659873.00 rows=10000000 width=4)
(actual time=3470.107..8598.836 rows=10000000 loops=1)
   Group Key: a
   Planned Partitions: 32
   Peak Memory Usage: 1033 kB
   Disk Usage: 234816 kB
   HashAgg Batches: 1056
   ->  Seq Scan on t  (cost=0.00..144248.00 rows=10000000 width=4)
(actual time=0.017..1091.820 rows=10000000 loops=1)
 Planning Time: 0.285 ms
 Execution Time: 8996.824 ms
 Execution Time: 8781.624 ms
 Execution Time: 8900.324 ms
(9 rows)

-- Unpatched

set work_mem = '1GB';
explain analyze select a from t group by a;
                                                       QUERY PLAN

-------------------------------------------------------------------------------------------------------------------------
 HashAggregate  (cost=169248.00..269248.00 rows=10000000 width=4)
(actual time=4537.779..7033.318 rows=10000000 loops=1)
   Group Key: a
   Peak Memory Usage: 868369 kB
   ->  Seq Scan on t  (cost=0.00..144248.00 rows=10000000 width=4)
(actual time=0.018..820.136 rows=10000000 loops=1)
 Planning Time: 0.054 ms
 Execution Time: 7561.063 ms
 Execution Time: 7573.985 ms
 Execution Time: 7572.058 ms
(6 rows)

-- Patched

set work_mem = '1GB';
explain analyze select a from t group by a;
                                                       QUERY PLAN

-------------------------------------------------------------------------------------------------------------------------
 HashAggregate  (cost=169248.00..269248.00 rows=10000000 width=4)
(actual time=4840.045..7359.970 rows=10000000 loops=1)
   Group Key: a
   Peak Memory Usage: 861975 kB
   ->  Seq Scan on t  (cost=0.00..144248.00 rows=10000000 width=4)
(actual time=0.018..789.975 rows=10000000 loops=1)
 Planning Time: 0.055 ms
 Execution Time: 7904.069 ms
 Execution Time: 7913.692 ms
 Execution Time: 7927.061 ms
(6 rows)

Perhaps the slowdown is unrelated. If it, then maybe the reduction in
branching mentioned by Andres might help a bit plus maybe a bit from
what I mentioned above about passing in the counter struct instead of
returning it at each level.

David



Re: Make MemoryContextMemAllocated() more precise

From
Jeff Davis
Date:
On Mon, 2020-04-06 at 23:39 +1200, David Rowley wrote:
> 1. The comment mentions "passthru", but you've removed that
> parameter.

Fixed, thank you.

> 2. I don't think MemoryContextCount is the best name for this
> function. When I saw:

I've gone back and forth on naming a bit. The right name, in my
opinion, is MemoryContextStats(), but that's taken by something that
should be called MemoryContextReport(). But I didn't want to change
that as it would probably annoy a lot of people who are used to calling
MemoryContextStats() from gdb.

I changed the new function to be called MemoryContextGetCounters(),
which is more directly what it's doing. "Telemetry" makes me think more
of a stream of information rather than a particular point in time.

> 3. I feel like it would be nicer if you didn't change the "count"
> methods to return a MemoryContextCounters. It means you may need to
> zero a struct for each level, assign the values, then add them to the
> total.  If you were just to zero the struct in MemoryContextCount()
> then pass it into the count function, then you could just have it do
> all the += work. It would reduce the code in MemoryContextCount()
> too.

I changed it to use a pointer out parameter, but I don't think an
in/out parameter is quite right there. MemoryContextStats() ends up
using both the per-context counters as well as the totals, so it's not
helpful to return just the totals.

> 4. Do you think it would be better to have two separate functions for
> MemoryContextCount(), a recursive version and a non-recursive
> version.

I could, but right now the only caller passes recurse=true, so I'm not
really eliminating any code in that path by specializing the functions.
Are you thinking about performance or you just think it would be better
to have two entry points?

> 5. For performance testing, I tried using the following table with
> 1MB
> work_mem then again with 1GB work_mem.  I wondered if making the
> accounting more complex would cause a slowdown in nodeAgg.c, as I see
> we're calling this function each time we get a tuple that belongs in
> a
> new group. The 1MB test is likely not such a great way to measure
> this
> since we do spill to disk in that case and the changing in accounting
> means we do that at a slightly different time, but the 1GB test does
> not spill and it's a bit slower.

I think this was because I was previously returning a struct. By just
passing a pointer as an out param, it seems to have mitigated it, but
not completely eliminated the cost (< 2% in my tests). I was using an
OFFSET 100000000 instead of EXPLAIN ANALYZE in my test measurements
because it was less noisy, and I focused on the 1GB test for the reason
you mention.

I also addressed Andres's comments:

* changed the name of the flags from MCXT_STAT to MCXT_COUNT
* changed ((1<<6)-1) to ~0
* removed unnecessary branches from the GetCounters method
* expanded some comments

Regards,
    Jeff Davis


Attachment

Re: Make MemoryContextMemAllocated() more precise

From
Jeff Davis
Date:
On Sun, 2020-04-05 at 16:48 -0700, Andres Freund wrote:
> > I still think we should do something for v13, such as the
> > originally-
> > proposed patch[1]. It's not critical, but it simply reports a
> > better
> > number for memory consumption. Currently, the memory usage appears
> > to
> > jump, often right past work mem (by a reasonable but noticable
> > amount),
> > which could be confusing.
> 
> Is that really a significant issue for most work mem sizes? Shouldn't
> the way we increase sizes lead to the max difference between the
> measurements to be somewhat limited?

For work_mem less than 16MB, it's essentially spilling when the table
context is about 75% of what it could be (as bad as 50% and as good as
100% depending on a number of factors). That's not terrible, but it is
significant.

It also means that the reported memory peak jumps up rather than going
up gradually, so it ends up surpassing work_mem (e.g. 4MB of work_mem
might show a memory peak of 5MB). So it's a weird combination of under-
utilizing and over-reporting.

> I'd spec it so that context implementations are allowed to
> unconditionally fill fields, even when the flag isn't specified. The
> branches quoted don't buy us anyting...

Done.

> I'd add some reasoning as to why this is useful.

Done.

> s/MCXT_STAT/MCXT_STAT_NEED/?

I changed to MCXT_COUNT_. MCXT_STAT_NEED seemed slightly verbose for
me.

> > +#define MCXT_STAT_ALL            ((1 << 6) - 1)
> 
> Hm, why not go for ~0 or such?

Done.

Regards,
    Jeff Davis





Re: Make MemoryContextMemAllocated() more precise

From
David Rowley
Date:
On Tue, 7 Apr 2020 at 14:26, Jeff Davis <pgsql@j-davis.com> wrote:
>
> On Mon, 2020-04-06 at 23:39 +1200, David Rowley wrote:
> > 4. Do you think it would be better to have two separate functions for
> > MemoryContextCount(), a recursive version and a non-recursive
> > version.
>
> I could, but right now the only caller passes recurse=true, so I'm not
> really eliminating any code in that path by specializing the functions.
> Are you thinking about performance or you just think it would be better
> to have two entry points?

I was thinking in terms of both performance by eliminating a branch,
but mostly it was down to ease of code reading.

I thought it was easier to read:
MemoryContextGetCountersRecruse(&counters); then
MemoryContextGetCounters(&counters, true); since I might need to go
see what "true" is for.

The non-recursive version, if we decide we need one, would likely just
be 1 one-line body function calling the implementation's getcounter
method.

David



Re: Make MemoryContextMemAllocated() more precise

From
Jeff Davis
Date:
On Mon, 2020-03-16 at 11:45 -0700, Jeff Davis wrote:
> AllocSet allocates memory for itself in blocks, which double in size
> up
> to maxBlockSize. So, the current block (the last one malloc'd) may
> represent half of the total memory allocated for the context itself.

Narrower approach that doesn't touch memory context internals:

If the blocks double up in size to maxBlockSize, why not just create
the memory context with a smaller maxBlockSize? I had originally
dismissed this as a hack that could slow down some workloads when
work_mem is large.

But we can simply make it proportional to work_mem, which makes a lot
of sense for an operator like HashAgg that controls its memory usage.
It can allocate in blocks large enough that we don't call malloc() too
often when work_mem is large; but small enough that we don't overrun
work_mem when work_mem is small.

I have attached a patch to do this only for HashAgg, using a new entry
point in execUtils.c called CreateWorkExprContext(). It sets
maxBlockSize to 1/16th of work_mem (rounded down to a power of two),
with a minimum of initBlockSize.

This could be a good general solution for other operators as well, but
that requires a bit more investigation, so I'll leave that for v14.

The attached patch is narrow and solves the problem for HashAgg nicely
without interfering with anything else, so I plan to commit it soon for
v13.

Regards,
    Jeff Davis


Attachment

Re: Make MemoryContextMemAllocated() more precise

From
Daniel Gustafsson
Date:
> On 8 Apr 2020, at 02:21, Jeff Davis <pgsql@j-davis.com> wrote:

> The attached patch is narrow and solves the problem for HashAgg nicely
> without interfering with anything else, so I plan to commit it soon for
> v13.

If I read this thread correctly, there is nothing outstanding here for 14 from
this patch?  I've marked the entry committed i the July commitfest, feel to
free to change that in case I misunderstood.

cheers ./daniel