Thread: 9.5: Better memory accounting, towards memory-bounded HashAgg

9.5: Better memory accounting, towards memory-bounded HashAgg

From
Jeff Davis
Date:
Attached is a patch that explicitly tracks allocated memory (the blocks,
not the chunks) for each memory context, as well as its children.

This is a prerequisite for memory-bounded HashAgg, which I intend to
submit for the next CF. Hashjoin tracks the tuple sizes that it adds to
the hash table, which is a good estimate for Hashjoin. But I don't think
it's as easy for Hashagg, for which we need to track transition values,
etc. (also, for HashAgg, I expect that the overhead will be more
significant than for Hashjoin). If we track the space used by the memory
contexts directly, it's easier and more accurate.

I did some simple pgbench select-only tests, and I didn't see any TPS
difference.

Regards,
    Jeff Davis


Attachment

Re: 9.5: Better memory accounting, towards memory-bounded HashAgg

From
Peter Geoghegan
Date:
On Sat, Aug 2, 2014 at 1:40 PM, Jeff Davis <pgsql@j-davis.com> wrote:
> This is a prerequisite for memory-bounded HashAgg, which I intend to
> submit for the next CF.

FWIW, I think that's a good project. A large number of these TPC-H
queries used HashAggs when I checked, on a moderate sized sample TPC-H
database:

http://examples.citusdata.com/tpch_queries.html

(I found these queries at random from Googling, but happened to have a
~2GB TPC-H database on my laptop). I attach EXPLAIN ANALYZE ouput for
each, as shown on master. From this admittedly unscientific random
sampling, 5 out of 8 query plans have a hash aggregate node. TPC-H is
a benchmark that Postgres does not tend to do too well on [1], and I
suspect that this has something to do with it; lower work_mem settings
will spook the optimizer into using a group aggregate within
choose_hashed_grouping(). Of course, in order to get the benefit of
your patch, that will need to be adjusted. I think that part is
surprisingly straightforward, though.

[1] https://wiki.postgresql.org/wiki/TPC-H
--
Peter Geoghegan

Attachment

Re: 9.5: Better memory accounting, towards memory-bounded HashAgg

From
Robert Haas
Date:
On Sat, Aug 2, 2014 at 4:40 PM, Jeff Davis <pgsql@j-davis.com> wrote:
> Attached is a patch that explicitly tracks allocated memory (the blocks,
> not the chunks) for each memory context, as well as its children.
>
> This is a prerequisite for memory-bounded HashAgg, which I intend to
> submit for the next CF. Hashjoin tracks the tuple sizes that it adds to
> the hash table, which is a good estimate for Hashjoin. But I don't think
> it's as easy for Hashagg, for which we need to track transition values,
> etc. (also, for HashAgg, I expect that the overhead will be more
> significant than for Hashjoin). If we track the space used by the memory
> contexts directly, it's easier and more accurate.
>
> I did some simple pgbench select-only tests, and I didn't see any TPS
> difference.

I was curious whether a performance difference would show up when
sorting, so I tried it out.  I set up a test with pgbench -i 300.  I
then repeatedly restarted the database, and after each restart, did
this:

time psql -c 'set trace_sort=on; reindex index pgbench_accounts_pkey;'

I alternated runs between master and master with this patch, and got
the following results:

master:
LOG:  internal sort ended, 1723933 KB used: CPU 2.58s/11.54u sec
elapsed 16.88 sec
LOG:  internal sort ended, 1723933 KB used: CPU 2.50s/12.37u sec
elapsed 17.60 sec
LOG:  internal sort ended, 1723933 KB used: CPU 2.14s/11.28u sec
elapsed 16.11 sec

memory-accounting:
LOG:  internal sort ended, 1723933 KB used: CPU 2.57s/11.97u sec
elapsed 17.39 sec
LOG:  internal sort ended, 1723933 KB used: CPU 2.30s/12.57u sec
elapsed 17.68 sec
LOG:  internal sort ended, 1723933 KB used: CPU 2.54s/11.99u sec
elapsed 17.25 sec

Comparing the median times, that's about a 3% regression.  For this
particular case, we might be able to recapture that by replacing the
bespoke memory-tracking logic in tuplesort.c with use of this new
facility.  I'm not sure whether there are other cases that we might
also want to test; I think stuff that runs all on the server side is
likely to show up problems more clearly than pgbench.  Maybe a
PL/pgsql loop that does something allocation-intensive on each
iteration, for example, like parsing a big JSON document.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: 9.5: Better memory accounting, towards memory-bounded HashAgg

From
Tomas Vondra
Date:
On 2.8.2014 22:40, Jeff Davis wrote:
> Attached is a patch that explicitly tracks allocated memory (the blocks,
> not the chunks) for each memory context, as well as its children.
> 
> This is a prerequisite for memory-bounded HashAgg, which I intend to

Anyway, I'm really looking forward to the memory-bounded hashagg, and
I'm willing to spend some time testing it.

> submit for the next CF. Hashjoin tracks the tuple sizes that it adds to
> the hash table, which is a good estimate for Hashjoin. But I don't think

Actually, even for HashJoin the estimate is pretty bad, and varies a lot
depending on the tuple width. With "narrow" tuples (e.g. ~40B of data),
which is actually quite common case, you easily get ~100% palloc overhead.

I managed to address that by using a custom allocator. See this:
  https://commitfest.postgresql.org/action/patch_view?id=1503

I wonder whether something like that would be possible for the hashagg?

That would make the memory accounting accurate with 0% overhead (because
it's not messing with the memory context at all), but it only for the
one node (but maybe that's OK?).

> it's as easy for Hashagg, for which we need to track transition values,
> etc. (also, for HashAgg, I expect that the overhead will be more
> significant than for Hashjoin). If we track the space used by the memory
> contexts directly, it's easier and more accurate.

I don't think that's comparable - I can easily think about cases leading
to extreme palloc overhead with HashAgg (think of an aggregate
implementing COUNT DISTINCT - that effectively needs to store all the
values, and with short values the palloc overhead will be terrible).

Actually, I was looking at HashAgg (it's a somehow natural direction
after messing with Hash Join), and my plan was to use a similar dense
allocation approach. The trickiest part would probably me making this
available from the custom aggregates.

regards
Tomas



Re: 9.5: Better memory accounting, towards memory-bounded HashAgg

From
Jeff Davis
Date:
On Wed, 2014-08-06 at 11:43 -0400, Robert Haas wrote:
> Comparing the median times, that's about a 3% regression.  For this
> particular case, we might be able to recapture that by replacing the
> bespoke memory-tracking logic in tuplesort.c with use of this new
> facility.  I'm not sure whether there are other cases that we might
> also want to test; I think stuff that runs all on the server side is
> likely to show up problems more clearly than pgbench.  Maybe a
> PL/pgsql loop that does something allocation-intensive on each
> iteration, for example, like parsing a big JSON document.

I wasn't able to reproduce your results on my machine. At -s 300, with
maintenance_work_mem set high enough to do internal sort, it took about
40s and I heard some disk activity, so I didn't think it was a valid
result. I went down to -s 150, and it took around 5.3s on both master
and memory-accounting.

Either way, it's better to be conservative. Attached is a version of the
patch with opt-in memory usage tracking. Child contexts inherit the
setting from their parent.

Regards,
    Jeff Davis


Attachment

Re: 9.5: Better memory accounting, towards memory-bounded HashAgg

From
Jeff Davis
Date:
On Fri, 2014-08-08 at 01:16 -0700, Jeff Davis wrote:
> Either way, it's better to be conservative. Attached is a version of the
> patch with opt-in memory usage tracking. Child contexts inherit the
> setting from their parent.

There was a problem with the previous patch: the parent was unlinked
before the delete_context method was called, which meant that the
parent's memory was not being decremented.

Attached is a fix. It would be simpler to just reorder the operations in
MemoryContextDelete, but there is a comment warning against that. So, I
pass the old parent as an argument to the delete_context method.

Regards,
    Jeff Davis


Attachment

Re: 9.5: Better memory accounting, towards memory-bounded HashAgg

From
Robert Haas
Date:
On Fri, Aug 8, 2014 at 4:16 AM, Jeff Davis <pgsql@j-davis.com> wrote:
> I wasn't able to reproduce your results on my machine. At -s 300, with
> maintenance_work_mem set high enough to do internal sort, it took about
> 40s and I heard some disk activity, so I didn't think it was a valid
> result. I went down to -s 150, and it took around 5.3s on both master
> and memory-accounting.
>
> Either way, it's better to be conservative. Attached is a version of the
> patch with opt-in memory usage tracking. Child contexts inherit the
> setting from their parent.

I repeated my tests with your v3 patch.  Here are the new results:

master, as of commit a4287a689d10bd4863e3dfbf9c4f232feeca0cdd:
2014-08-14 16:45:24 UTC [2940] LOG:  internal sort ended, 1723933 KB
used: CPU 2.44s/11.52u sec elapsed 16.68 sec
2014-08-14 16:46:34 UTC [2960] LOG:  internal sort ended, 1723933 KB
used: CPU 2.63s/11.65u sec elapsed 16.94 sec
2014-08-14 16:47:26 UTC [2979] LOG:  internal sort ended, 1723933 KB
used: CPU 2.63s/11.48u sec elapsed 16.85 sec

memory-accounting-v3, on top of the aforementioned master commit:
2014-08-14 16:46:05 UTC [2950] LOG:  internal sort ended, 1723933 KB
used: CPU 2.52s/12.16u sec elapsed 17.36 sec
2014-08-14 16:47:00 UTC [2969] LOG:  internal sort ended, 1723933 KB
used: CPU 2.52s/11.90u sec elapsed 17.11 sec
2014-08-14 16:47:51 UTC [2988] LOG:  internal sort ended, 1723933 KB
used: CPU 2.52s/11.98u sec elapsed 17.31 sec

It appears to me that the performance characteristics for this version
are not significantly different from version 1.  I have not looked at
the code.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: 9.5: Better memory accounting, towards memory-bounded HashAgg

From
Tomas Vondra
Date:
On 10.8.2014 22:50, Jeff Davis wrote:
> On Fri, 2014-08-08 at 01:16 -0700, Jeff Davis wrote:
>> Either way, it's better to be conservative. Attached is a version
>> of the patch with opt-in memory usage tracking. Child contexts
>> inherit the setting from their parent.
> 
> There was a problem with the previous patch: the parent was unlinked 
> before the delete_context method was called, which meant that the 
> parent's memory was not being decremented.
> 
> Attached is a fix. It would be simpler to just reorder the operations
> in MemoryContextDelete, but there is a comment warning against that.
> So, I pass the old parent as an argument to the delete_context
> method.

Hi,

I did a quick check of the patch, mostly because I wasn't sure whether
it allows accounting for a selected subtree only (e.g. aggcontext and
it's children). And yes, it does exactly that, which is great.

Also, the code seems fine to me, except maybe for this piece in
AllocSetDelete:
   /*    * Parent is already unlinked from context, so can't use    * update_allocation().    */   while (parent !=
NULL)  {       parent->total_allocated -= context->total_allocated;       parent = parent->parent;   }
 

I believe this should check parent->track_mem, just like
update_allocation, because this way it walks all the parent context up
to the TopMemoryContext.

It does not really break anything (because parents without enabled
tracking don't report allocated space), but for plans
creating/destroying a lot of contexts (say, GroupAggregate with a lot of
groups) this might unnecessarily add overhead.

regards
Tomas



Re: 9.5: Better memory accounting, towards memory-bounded HashAgg

From
Tom Lane
Date:
Tomas Vondra <tv@fuzzy.cz> writes:
> I believe this should check parent->track_mem, just like
> update_allocation, because this way it walks all the parent context up
> to the TopMemoryContext.

TBH, I don't think this "opt-in tracking" business has been thought
through even a little bit; for example, what happens if there is a
declared-to-not-track context in between levels that are declared
to track?  And it certainly has not been proven that the extra
complexity actually saves cycles rather than the reverse.
        regards, tom lane



Re: 9.5: Better memory accounting, towards memory-bounded HashAgg

From
Tomas Vondra
Date:
On 16.8.2014 20:00, Tom Lane wrote:
> Tomas Vondra <tv@fuzzy.cz> writes:
>> I believe this should check parent->track_mem, just like
>> update_allocation, because this way it walks all the parent context up
>> to the TopMemoryContext.
> 
> TBH, I don't think this "opt-in tracking" business has been thought 
> through even a little bit; for example, what happens if there is a 
> declared-to-not-track context in between levels that are declared to
> track? And it certainly has not been proven that the extra complexity
> actually saves cycles rather than the reverse.


AFAIK there's no other built-in accounting mechanism at the moment.
That's one of the reasons why some of the nodes had to invent their own
custom accounting - hashjoin is an example of that.

We do have MemoryContextStats, which essentially walks the memory
context hierarchy and accounts all the important information. While
working on the hashjoin patch, I was thinking that maybe we could create
MemoryContextStats clone, computing the accounting info on request.

So I did an experiment, writing this new MemoryContextStats clone and
rewriting the nodeHash.c to use this instead of custom tracking. The
impact was consistently ~2-3%, for the hashjoin queries I measured (e.g.
9100 ms instead of 8900 ms etc.).

I just tried the same with the hashjoin patch applied. With the built-in
accounting the queries now run ~2.5x faster - around 3500ms. With the
MemoryContextStats clone, it takes a whopping 18000 ms.

I don't know what causes such increase (I assume I use the accounting
info within a loop or something). The point is that with Jeff's patch,
it goes down to 3500 ms, with pretty much zero overhead.

I'll check why the patched hashjoin is so much slower, but even if I can
optimize it somehow I don't think I'll get it to 0 overhead. So I'd say
this accounting actally saves the cycles.


Of course, that does not prove it's well a well thought-out design,
although it seems sensible to me. It's the least complex accounting
implementation I can think of.

For example, I don't think being able to create nonsensical hierarchies
is a major flaw. We certainly should not do that in our code, and we
should take reasonable precautions to prevent this in a user code (as
for example inheriting the tracking flag in MemoryContextCreate, as Jeff
does). Can this be seen as a foot gun? Maybe, but we're handing
ammunition to users on a daily basis - for example they can switch to
TopMemoryContext, but that does not make memory contexts bad.

But maybe the inheritance really is not necessary - maybe it would be
enough to track this per-context, and then just walk through the
contexts and collect this. Because my observation is that most of the
time is actually spent in walking through blocks and freelists.

Then the issues with nonsensical hierarchies would disappear, and also
the possible overhead with updating all parent hierarchies would go away
(not sure if it's even measurable, though).

regards
Tomas






Re: 9.5: Better memory accounting, towards memory-bounded HashAgg

From
Jeff Davis
Date:
On Thu, 2014-08-14 at 12:52 -0400, Robert Haas wrote:
> It appears to me that the performance characteristics for this version
> are not significantly different from version 1.  I have not looked at
> the code.

While trying to reproduce your results, I noticed what might be around a
1% regression just from adding the 3 fields to MemoryContextData. If I
cut it down to adding just one field, the regression disappears.

The results are fairly noisy, so I could be chasing the wrong thing. But
one reason to believe it is that I pushed the size of MemoryContextData
above 64, which sounds like it might be an important threshold.

Regards,Jeff Davis





Re: 9.5: Better memory accounting, towards memory-bounded HashAgg

From
Jeff Davis
Date:
On Sat, 2014-08-16 at 23:09 +0200, Tomas Vondra wrote:
> But maybe the inheritance really is not necessary - maybe it would be
> enough to track this per-context, and then just walk through the
> contexts and collect this. Because my observation is that most of the
> time is actually spent in walking through blocks and freelists.

That makes a lot of sense to me.

Another approach is to pass a flag to hash_create that tells it not to
create a subcontext. Then we don't need to traverse anything; we already
know which context we want to look at. Perhaps I was being too clever
with the idea of tracking space for an entire hierarchy.

Also, as I pointed out in my reply to Robert, adding too many fields to
MemoryContextData may be the cause of the regression. Your idea requires
only one field, which doesn't show the same regression in my tests.

Regards,Jeff Davis





Re: 9.5: Better memory accounting, towards memory-bounded HashAgg

From
"Tomas Vondra"
Date:
On 19 Srpen 2014, 10:26, Jeff Davis wrote:
> On Sat, 2014-08-16 at 23:09 +0200, Tomas Vondra wrote:
>> But maybe the inheritance really is not necessary - maybe it would be
>> enough to track this per-context, and then just walk through the
>> contexts and collect this. Because my observation is that most of the
>> time is actually spent in walking through blocks and freelists.
>
> That makes a lot of sense to me.
>
> Another approach is to pass a flag to hash_create that tells it not to
> create a subcontext. Then we don't need to traverse anything; we already
> know which context we want to look at. Perhaps I was being too clever
> with the idea of tracking space for an entire hierarchy.
>
> Also, as I pointed out in my reply to Robert, adding too many fields to
> MemoryContextData may be the cause of the regression. Your idea requires
> only one field, which doesn't show the same regression in my tests.

Yeah, keeping the structure size below 64B seems like a good idea.

The use-case for this is tracking a chosen subtree of contexts - e.g.
aggcontext and below, so I'd expect the tracked subtrees to be relatively
shallow. Am I right?

My fear is that by removing the inheritance bit, we'll hurt cases with a
lot of child contexts. For example, array_agg currently creates a separate
context for each group - what happens if you have 100k groups and do
MemoryContextGetAllocated? I guess iterating over 100k groups is not free.

Wouldn't the solution with inheritance and propagating the accounting info
to the parent actually better? Or maybe allowing both, having two flags
when creating a context instead of one?
 AllocSetCreateTracked( ...., bool track, bool propagate_immediately)

By squashing both flags into a single mask you wouldn't increase the size.
Also, do we really need to track allocated bytes - couldn't we track
kilobytes or something and use smaller data types to get below the 64B?

regards
Tomas




Re: 9.5: Better memory accounting, towards memory-bounded HashAgg

From
Jeff Davis
Date:
On Tue, 2014-08-19 at 12:54 +0200, Tomas Vondra wrote:
> The use-case for this is tracking a chosen subtree of contexts - e.g.
> aggcontext and below, so I'd expect the tracked subtrees to be relatively
> shallow. Am I right?

Right.

> My fear is that by removing the inheritance bit, we'll hurt cases with a
> lot of child contexts. For example, array_agg currently creates a separate
> context for each group - what happens if you have 100k groups and do
> MemoryContextGetAllocated? I guess iterating over 100k groups is not free.

Good point. We don't want to make checking the allocated space into an
expensive operation, because then we will be forced to call it less
frequently, which implies that we'd be sloppier about actually fitting
in work_mem.

> Wouldn't the solution with inheritance and propagating the accounting info
> to the parent actually better? Or maybe allowing both, having two flags
> when creating a context instead of one?

That seems overly complicated. We only have one driving use case, so two
orthogonal options sounds excessive. Perhaps one option if we can't
solve the performance problem and we need to isolate the changes to
hashagg.

> Also, do we really need to track allocated bytes - couldn't we track
> kilobytes or something and use smaller data types to get below the 64B?

Good idea.

I attached a patch that uses two uint32 fields so that it doesn't
increase the size of MemoryContextData, and it tracks memory usage for
all contexts. I was unable to detect any performance regression vs.
master, but on my machine the results are noisy.

It would be easier to resolve the performance concern if I could
reliably get the results Robert is getting. I think I was able to
reproduce the regression with the old patch, but the results were still
noisy.

Regards,
    Jeff Davis


Attachment

Re: 9.5: Better memory accounting, towards memory-bounded HashAgg

From
Robert Haas
Date:
On Wed, Aug 20, 2014 at 2:11 AM, Jeff Davis <pgsql@j-davis.com> wrote:
> I attached a patch that uses two uint32 fields so that it doesn't
> increase the size of MemoryContextData, and it tracks memory usage for
> all contexts. I was unable to detect any performance regression vs.
> master, but on my machine the results are noisy.

Still doesn't look good here.  On the same PPC64 machine I've been
using for testing:

master:
2014-08-22 16:26:42 UTC [13153] LOG:  internal sort ended, 1723933 KB
used: CPU 2.19s/11.41u sec elapsed 16.40 sec
2014-08-22 16:28:18 UTC [13173] LOG:  internal sort ended, 1723933 KB
used: CPU 2.56s/11.37u sec elapsed 16.64 sec
2014-08-22 16:29:37 UTC [13194] LOG:  internal sort ended, 1723933 KB
used: CPU 2.57s/11.48u sec elapsed 16.75 sec

memory-accounting-v4:
2014-08-22 16:27:35 UTC [13163] LOG:  internal sort ended, 1723933 KB
used: CPU 2.72s/11.91u sec elapsed 17.43 sec
2014-08-22 16:29:10 UTC [13185] LOG:  internal sort ended, 1723933 KB
used: CPU 2.67s/12.03u sec elapsed 17.40 sec
2014-08-22 16:30:01 UTC [13203] LOG:  internal sort ended, 1723933 KB
used: CPU 2.65s/11.97u sec elapsed 17.45 sec

That's a 4.7% regression on the median results, worse than before.

> It would be easier to resolve the performance concern if I could
> reliably get the results Robert is getting. I think I was able to
> reproduce the regression with the old patch, but the results were still
> noisy.

If you send me an SSH key by private email I can set you an account up
on this machine, if that's helpful.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: 9.5: Better memory accounting, towards memory-bounded HashAgg

From
Tomas Vondra
Date:
On 20.8.2014 08:11, Jeff Davis wrote:
> On Tue, 2014-08-19 at 12:54 +0200, Tomas Vondra wrote:
>> The use-case for this is tracking a chosen subtree of contexts - e.g.
>> aggcontext and below, so I'd expect the tracked subtrees to be relatively
>> shallow. Am I right?
> 
> Right.
> 
>> My fear is that by removing the inheritance bit, we'll hurt cases with a
>> lot of child contexts. For example, array_agg currently creates a separate
>> context for each group - what happens if you have 100k groups and do
>> MemoryContextGetAllocated? I guess iterating over 100k groups is not free.
> 
> Good point. We don't want to make checking the allocated space into an
> expensive operation, because then we will be forced to call it less
> frequently, which implies that we'd be sloppier about actually fitting
> in work_mem.
> 
>> Wouldn't the solution with inheritance and propagating the accounting info
>> to the parent actually better? Or maybe allowing both, having two flags
>> when creating a context instead of one?
> 
> That seems overly complicated. We only have one driving use case, so two
> orthogonal options sounds excessive. Perhaps one option if we can't
> solve the performance problem and we need to isolate the changes to
> hashagg.
> 
>> Also, do we really need to track allocated bytes - couldn't we track
>> kilobytes or something and use smaller data types to get below the 64B?
> 
> Good idea.
> 
> I attached a patch that uses two uint32 fields so that it doesn't
> increase the size of MemoryContextData, and it tracks memory usage for
> all contexts. I was unable to detect any performance regression vs.
> master, but on my machine the results are noisy.

I don't think we really need to abandon the 'tracked' flag (or that we
should). I think it was useful, and removing it might be one of the
reasons why Robert now sees worse impact than before.

I assume you did that to get below 64B, right? What about to changing
'isReset' in MemoryContextData to a 'flags' field instead? It's a bool
now, i.e. 1B -> 8 bits.

I don't think it makes sense to abandon this only to get <= 64B, if it
forces you to touch multiple 64B structures unnecessarily.

Also, you're doing this for every block in AllocSetReset:
   update_allocation(context, -(block->endptr - ((char*) block)));

So if there are 1000 blocks, you'll call that 1000x (and it will
propagate up to TopMemoryContext). Instead keep a local sum and only do
the call once at the end. Again, this might be one of the reasons why
Robet sees ~4% overhead.

BTW what happens when AllocSetContext is created with initBlockSize that
is not a multiple of 1kB? I see it's enforced to be at least 1024, but I
don't see anything forcing it to be a multiple of 1024?

What if I create a context with initBlockSize=1500? ISTM it'll break the
accounting, right? (Not that it would make sense to create such
contexts, but it's possible.)

Tomas



Re: 9.5: Better memory accounting, towards memory-bounded HashAgg

From
Robert Haas
Date:
On Fri, Aug 22, 2014 at 2:13 PM, Tomas Vondra <tv@fuzzy.cz> wrote:
> I don't think we really need to abandon the 'tracked' flag (or that we
> should). I think it was useful, and removing it might be one of the
> reasons why Robert now sees worse impact than before.

The version that introduced that flag had the same overhead as the
previous version that didn't, at least in my testing, so I'm not sure
it's at all useful.  Part of the problem here is that the number of
instructions that you can afford to take to complete a memory
allocation is just really small, and so every one hurts.  Memory
latency may be an issue as well, of course.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: 9.5: Better memory accounting, towards memory-bounded HashAgg

From
Tomas Vondra
Date:
On 20.8.2014 08:11, Jeff Davis wrote:
> On Tue, 2014-08-19 at 12:54 +0200, Tomas Vondra wrote:
>
> It would be easier to resolve the performance concern if I could
> reliably get the results Robert is getting. I think I was able to
> reproduce the regression with the old patch, but the results were still
> noisy.

I created a small extension for this purpose, it's available here:

   https://github.com/tvondra/palloc_bench

In short, it creates a chain of memory contexts, and then repeats
palloc/pfree a given number of times (with a chosen request size).

It either calls AllocSetContextCreate or AllocSetContextCreateTracked,
depending on whether there's a

#define TRACKING_FLAG

so either leave it there or comment it out. The time is printed as a
warhing.

I did a number of tests to get an idea of the overhead, using this call

  select palloc_bench(10,100000000,32768);

which means 10 memory contexts, 100000000 palloc/free cycles with 32kB
requests.

And I got these results:

master
-------
3246.03 ms
3125.24 ms
3247.86 ms
3251.85 ms
3113.03 ms
3140.35 ms

v3 patch (AllocSetContextCreate => no tracing)
3303.64 ms
3278.57 ms
3295.11 ms
3325.63 ms
3329.84 ms
3439.27 ms

-- v3 patch (AllocSetContextCreateTracked => tracing)
6296.43 ms
5228.83 ms
5271.43 ms
5158.60 ms
5404.57 ms
5240.40 ms

-- v4 (tracing all the time)
6728.84 ms
6478.88 ms
6478.82 ms
6473.57 ms
6554.96 ms
6528.66 ms

I think this makes the overhead clearly visible. I also worth mentioning
that this does nothing else except for palloc/free, which is not really
what a real workload does. And 100000000 palloc/free of 32kB blocks
means ~3TB of RAM, unless my math is broken.

So looking at the numbers and saying "7 seconds >> 3 seconds", all is
lost is not really appropriate IMHO.

Anyway, ISTM that v4 is actually a bitm ore expensive than v3 for some
reason. I'm not entirely sure why, but I suspect it's because of
updating the few additional memory contexts up to TopMemoryContext.
That's something v3 didn't do.

I tried to hack a bit on the idea of using a single byte for the flags
(isReset and track_mem) - patch attached. That got me pretty much to v3
performance (or maybe slightly better):

-- v4 + flag (tracing all the time)
5222.38 ms
4958.37 ms
5072.21 ms
5100.43 ms
5059.65 ms
4995.52 ms

But nothing that'd magically save the day ... and with disabled tracing
we get pretty much to v3 numbers (with trace_mem=false). So this
gymnastics gave us pretty much nothing ...

But I have realized that maybe the problem is that we're handling memory
contexts and accounting as 1:1. But that's not really the case - most of
the context is not really interested in this. They don't use accounting
now, and it won't change. So only small fraction of memory contexts will
ask for acounting. Yet all the contexts are forced to pass accounting
info from their children to their parent, if there happens to be a
context with track_mem=true somewhere above them.

And that's exactly the problem, because most of the time is spent in
update_accounting, in the loop over parents.

So my proposal is to separate those two things into two hierarchies,
that are somehow parallel, but not exactly.

That means:

(1) creating a structure with the accouting info

    typedef struct MemoryAccountingData {

        Size  total_allocated;
        Size  self_allocated;

        struct MemoryAccountingData * parent;

    } MemoryAccountingData;

(2) adding a pointer to MemoryAccountingData to MemoryContextData, and
    a 'track_mem' flag for contexts that requested tracking

    typedef struct MemoryContextData
    {
        ...
        MemoryAccounting  accounting;
        ...
    } MemoryContextData;

(3) when a context does not request accounting, it just uses the
    accounting pointer from the parent context, and track_mem=false

(4) when the context requests accounting, it allocates it's own
    accounting structure, sets accounting->parent to the accounting
    from parent, and sets track_mem=true

Now all the contexts have a direct pointer to the accounting of the
nearest parent context that explicitly requested accounting, and don't
need to walk through all the parents.

Contexts that did not request tracking have track_mem=false, and their
accounting points to the parent with explicit accounting, or is NULL if
there's no such parent. For these contexts, GetAllocated always returns
0, but that's OK because they haven't requested accounting anyway.

Contexts that requested tracking have have track_mem=true, and have
their own specific accounting instance. The accounting->parent plays
pretty much the same role as 'accounting' with 'track_mem=false' (see
previous paragraph). These contexts return GetAllocated properly.

Now, I did a quick with the palloc_bench - 1 context with tracking
enabled, and a chain of 10 contexts without tracking (but updating the
accounting for the first context).

And I see this - with tracking enabled:

3235.57 ms
3240.09 ms
3225.47 ms
3306.95 ms
3249.14 ms
3225.56 ms

and with tracking disabled:

3193.43 ms
3169.57 ms
3156.48 ms
3147.12 ms
3142.25 ms
3161.91 ms
3149.97 ms

Which is quite good, IMHO. Disabled is pretty much exactly as master
(i.e. no accounting at all), enabled is about equal to v3 with disabled
tracing.

But maybe I did something stupid in those patches, it's 3AM here ...

Patch attached, consider it a an early alpha version.

regards
Tomas

Attachment

Re: 9.5: Better memory accounting, towards memory-bounded HashAgg

From
Jeff Davis
Date:
On Fri, 2014-08-22 at 12:34 -0400, Robert Haas wrote:
> Still doesn't look good here.  On the same PPC64 machine I've been
> using for testing:

I have a new approach to the patch which is to call a callback at each
block allocation and child contexts inherit the callback from their
parents. The callback could be defined to simply dereference and
increment its argument, which would mean block allocations anywhere in
the hierarchy are incrementing the same variable, avoiding the need to
traverse the hierarchy.

I've made some progress I think (thanks to both Robert and Tomas), but I
have a bit more microoptimization and testing to do. I'll mark it
"returned with feedback" for now, though if I find the time I'll do more
testing to see if the performance concerns are fully addressed.

Regards,Jeff Davis





Re: 9.5: Better memory accounting, towards memory-bounded HashAgg

From
Tom Lane
Date:
Jeff Davis <pgsql@j-davis.com> writes:
> I have a new approach to the patch which is to call a callback at each
> block allocation and child contexts inherit the callback from their
> parents. The callback could be defined to simply dereference and
> increment its argument, which would mean block allocations anywhere in
> the hierarchy are incrementing the same variable, avoiding the need to
> traverse the hierarchy.

Cute, but it's impossible to believe that a function call isn't going
to be *even more* overhead than what you've got now.  This could only
measure out as a win when the feature is going unused.

What about removing the callback per se and just keeping the argument,
as it were.  That is, a MemoryContext has a field of type "size_t *"
that is normally NULL, but if set to non-NULL, then we increment the
pointed-to value for pallocs and decrement for pfrees.

One thought in either case is that we don't have to touch the API for
MemoryContextCreate: rather, the feature can be enabled in a separate
call after making the context.
        regards, tom lane



Re: 9.5: Better memory accounting, towards memory-bounded HashAgg

From
Tomas Vondra
Date:
On 29.8.2014 16:12, Tom Lane wrote:
> Jeff Davis <pgsql@j-davis.com> writes:
>> I have a new approach to the patch which is to call a callback at each
>> block allocation and child contexts inherit the callback from their
>> parents. The callback could be defined to simply dereference and
>> increment its argument, which would mean block allocations anywhere in
>> the hierarchy are incrementing the same variable, avoiding the need to
>> traverse the hierarchy.
>
> Cute, but it's impossible to believe that a function call isn't going
> to be *even more* overhead than what you've got now.  This could only
> measure out as a win when the feature is going unused.
> 
> What about removing the callback per se and just keeping the argument,
> as it were.  That is, a MemoryContext has a field of type "size_t *"
> that is normally NULL, but if set to non-NULL, then we increment the
> pointed-to value for pallocs and decrement for pfrees.

How is that going to work with two memory contexts (parent/child), both
requesting accounting? If I understand the proposal correctly, the child
will either inherit pointer to the field in parent (and then won't get
accounting for it's own memory), or wil get a private field (and thus
won't update the accounting of the parent context).

Or am I missing something?

regards
Tomas



Re: 9.5: Better memory accounting, towards memory-bounded HashAgg

From
Jeff Davis
Date:
On Fri, 2014-08-29 at 10:12 -0400, Tom Lane wrote:
> What about removing the callback per se and just keeping the argument,
> as it were.  That is, a MemoryContext has a field of type "size_t *"
> that is normally NULL, but if set to non-NULL, then we increment the
> pointed-to value for pallocs and decrement for pfrees.

I like that idea; patch attached. Two differences:
  * At block level, not chunk level.
  * I use a uint64, because a size_t is only guaranteed to hold one
allocation, and we need to sum up many allocations.

When unused, it does still appear to have a little overhead in Robert's
test on his power machine. It seems to be between a 0.5-1.0% regression.
There's not much extra code on that path, just a branch, pointer
dereference, and an addition; so I don't think it will get much cheaper
than it is.

This regression seems harder to reproduce on my workstation, or perhaps
it's just noisier.

> One thought in either case is that we don't have to touch the API for
> MemoryContextCreate: rather, the feature can be enabled in a separate
> call after making the context.

That seems fairly awkward to me because the pointer needs to be
inherited from the parent context when not specified. I left the extra
API call in.

The inheritance is awkward anyway, though. If you create a tracked
context as a child of an already-tracked context, allocations in the
newer one won't count against the original. I don't see a way around
that without introducing even more performance problems.

If this patch is unacceptable, my only remaining idea is to add the
memory only for the current context with no inheritance (thereby
eliminating the branch, also). That will require HashAgg to traverse all
the child contexts to check whether the memory limit has been exceeded.
As Tomas pointed out, that could be a lot of work in the case of
array_agg with many groups.

Regards,
    Jeff Davis


Attachment

Re: 9.5: Better memory accounting, towards memory-bounded HashAgg

From
Tomas Vondra
Date:
Hi,

On 16.10.2014 03:26, Jeff Davis wrote:
> On Fri, 2014-08-29 at 10:12 -0400, Tom Lane wrote:
>> What about removing the callback per se and just keeping the
>> argument, as it were. That is, a MemoryContext has a field of type
>> "size_t *" that is normally NULL, but if set to non-NULL, then we
>> increment the pointed-to value for pallocs and decrement for
>> pfrees.
> 
> I like that idea; patch attached. Two differences:
>   * At block level, not chunk level.
>   * I use a uint64, because a size_t is only guaranteed to hold one
> allocation, and we need to sum up many allocations.
>
> When unused, it does still appear to have a little overhead in 
> Robert's test on his power machine. It seems to be between a 0.5-1.0%
> regression. There's not much extra code on that path, just a branch,
> pointer dereference, and an addition; so I don't think it will get
> much cheaper than it is.
> 
> This regression seems harder to reproduce on my workstation, or
> perhaps it's just noisier.

I assume it's the "reindex pgbench_accounts_pkey" test, correct?

I did the same test on my workstation (with i7-4770R CPU), and ISTM
there's a measurable difference, although in surprising directions. For
binaries compiled with gcc 4.7.3 and clang 3.3 and 3.5, the results look
like this (5 runs for each combination):

gcc 4.7.3 / master
------------------
LOG:  internal sort ended, ... CPU 1.24s/4.32u sec elapsed 9.78 sec
LOG:  internal sort ended, ... CPU 1.29s/4.31u sec elapsed 9.80 sec
LOG:  internal sort ended, ... CPU 1.26s/4.31u sec elapsed 9.80 sec
LOG:  internal sort ended, ... CPU 1.26s/4.37u sec elapsed 9.78 sec
LOG:  internal sort ended, ... CPU 1.33s/4.29u sec elapsed 9.78 sec

gcc 4.7.3 / patched
-------------------
LOG:  internal sort ended, ... CPU 1.35s/4.27u sec elapsed 9.78 sec
LOG:  internal sort ended, ... CPU 1.33s/4.30u sec elapsed 9.74 sec
LOG:  internal sort ended, ... CPU 1.34s/4.27u sec elapsed 9.75 sec
LOG:  internal sort ended, ... CPU 1.27s/4.26u sec elapsed 9.78 sec
LOG:  internal sort ended, ... CPU 1.35s/4.26u sec elapsed 9.78 sec

clang 3.3 / master
------------------
LOG:  internal sort ended, ... CPU 1.32s/4.61u sec elapsed 10.00 sec
LOG:  internal sort ended, ... CPU 1.27s/4.48u sec elapsed 9.95 sec
LOG:  internal sort ended, ... CPU 1.35s/4.48u sec elapsed 9.99 sec
LOG:  internal sort ended, ... CPU 1.32s/4.49u sec elapsed 9.97 sec
LOG:  internal sort ended, ... CPU 1.32s/4.47u sec elapsed 10.04 sec

clang 3.3 / patched
-------------------
LOG:  internal sort ended, ... CPU 1.35s/4.59u sec elapsed 10.13 sec
LOG:  internal sort ended, ... CPU 1.31s/4.61u sec elapsed 10.06 sec
LOG:  internal sort ended, ... CPU 1.28s/4.63u sec elapsed 10.10 sec
LOG:  internal sort ended, ... CPU 1.27s/4.58u sec elapsed 10.01 sec
LOG:  internal sort ended, ... CPU 1.29s/4.60u sec elapsed 10.05 sec

clang 3.5 / master
------------------
LOG:  internal sort ended, ... CPU 1.30s/4.46u sec elapsed 9.96 sec
LOG:  internal sort ended, ... CPU 1.30s/4.49u sec elapsed 9.96 sec
LOG:  internal sort ended, ... CPU 1.30s/4.53u sec elapsed 9.95 sec
LOG:  internal sort ended, ... CPU 1.25s/4.51u sec elapsed 9.95 sec
LOG:  internal sort ended, ... CPU 1.30s/4.50u sec elapsed 9.95 sec

clang 3.5 / patched
-------------------
LOG:  internal sort ended, ... CPU 1.32s/4.59u sec elapsed 9.97 sec
LOG:  internal sort ended, ... CPU 1.27s/4.49u sec elapsed 9.91 sec
LOG:  internal sort ended, ... CPU 1.29s/4.48u sec elapsed 9.88 sec
LOG:  internal sort ended, ... CPU 1.31s/4.49u sec elapsed 9.93 sec
LOG:  internal sort ended, ... CPU 1.23s/4.56u sec elapsed 9.90 sec

So while on clang 3.3 I really see about ~1% slowdown, on the other two
compilers it seems a tad faster (but it might easily be a noise).

It'd be interesting to know what compiler/version Robert used ...

>> One thought in either case is that we don't have to touch the API
>> for MemoryContextCreate: rather, the feature can be enabled in a
>> separate call after making the context.
> 
> That seems fairly awkward to me because the pointer needs to be 
> inherited from the parent context when not specified. I left the
> extra API call in.

I don't see a big difference between adding a new API "create" method
and a adding a separate "enable tracking" method.

What however seems awkward is using the 'uint64*' directly as a
parameter, instead of using 'enable_tracking' flag as in the previous
versions. Is there a reason for that?

This allows supplying a pointer to a uint64 variable, i.e. something
like this:
  {      uint64 mem_used = 0;      MemoryContext context          = MemoryContextCreateTracked(..., &mem_used);
      ...
      if (mem_used > work_mem * 1024L)      {          ... do something ...      }  }

Thanks to exposing the total_mem like this, it's possible to get the
current value directly like this (without the API methods available in
the previous versions). It also makes it possible to "detach" the
accounting from the parent (i.e. not count it against the parent).

But is this intended/sane?


> The inheritance is awkward anyway, though. If you create a tracked 
> context as a child of an already-tracked context, allocations in the 
> newer one won't count against the original. I don't see a way around 
> that without introducing even more performance problems.

I see this is as the main drawback of the current design :-(

The question is whether we can accept such limitation, and enforce it
somehow. For example if you request a new context with tracking, and you
find out the parent context already has tracking enabled, we can error
out. This would effectively make it "internal only" feature, because the
user code (e.g. custom aggregates in extensions) can't rely on getting
non-tracking context, thus making it impossible to use tracking. Maybe
that's acceptable limitation?

If we can get into such conflicting situation without any external code,
it's pretty much DOA. I can't think of such example, but am not
convinced there isn't any ...


Regarding the "performance problems" - ISTM it's important to
differentiate between impact on performance when no memory accounting is
enabled (but the code is there), and impact with the accounting.

In other words:
 (a) free when not used (or as close to 'free' as possible) (b) cheap when used (as cheap as possible)

I think the current patch achieves (a) - it's difficult to beat the
single (a != NULL) check.

We haven't seen any benchmarks for (b), at least not with the current
patch. But even if we introduce some additional overhead (and it's
impossible not to), we get something in return - the question is whether
the cost is adequate. Also, if we don't have this accounting, the
overhead for the local implementations may be easily higher.


> If this patch is unacceptable, my only remaining idea is to add the
> memory only for the current context with no inheritance (thereby
> eliminating the branch, also). That will require HashAgg to traverse all
> the child contexts to check whether the memory limit has been exceeded.
> As Tomas pointed out, that could be a lot of work in the case of
> array_agg with many groups.

I can't really imagine dropping the inheritance ...

Hierarchy of memory contexts with accounting that does not support
hierarchies seems rather strange IMNSHO, making it either useless or at
least very expensive in the interesting cases (the array_agg with a
context per group is a prime example).

ISTM the reason for the issues with inheritance (with two memory
contexts with tracking enabled) is the "single uint64 counter".

In one of my previous messages (from 23/8), I proposed to use this
struct to build "parallel" accounting hierarchy (tweaked to match the v6
patch):
   typedef struct MemoryAccountingData {
       uint64 track_mem;       struct MemoryAccountingData * parent;
   } MemoryAccountingData;

Each context has either a private instance (if tracking was enabled), or
point to a nearest context with tracking enabled (if existing).

In the 'tracking disabled' case, this should have exactly the same
overhead as the v6 patch (still a single (pointer != NULL) check).

In the 'tracking enabled' case, it'll be more expensive, because the
changes need to propagate up the tree. But that happens only if there
are multiple contexts with tracking enabled - if there's a single one,
it should be exactly the same as the v6 patch.

Haven't tried measuring the impact though, because we don't have actual
code using it. The measurements in the 23/8 say that with palloc_bench,
the overhead was ~3% (compared to 'tracking disabled'). Also, all that
benchmark is doing is palloc, so it's not really +3% in practice.


A few minor comments regarding the patch:

(a) I see we're not tracking self/total values any more. I don't think   we really need those two numbers, the total is
enough.Correct?
 


(b) both AllocSetReset/AllocSetDelete update the accounting info for   each block separately:
   if (context->track_mem != NULL)       *context->track_mem -= block->endptr - ((char *) block);
   Wouldn't it be better to accumulate the total into a local variable   and then update the track_mem at once? This
mightbe more important   when using the MemoryAccountingData structure and multiple   tracking-enabled contexts,
though.


(c) Most of the changes in AllocSetContextCreateTracked are only   renames 'context => set', which can be easily
eliminatedby using
 
       AllocSet context = (AllocSet)MemoryContextCreate(...
   and casting to MemoryContext at the one place where it's needed.


regards
Tomas



Re: 9.5: Better memory accounting, towards memory-bounded HashAgg

From
Simon Riggs
Date:
On 16 October 2014 02:26, Jeff Davis <pgsql@j-davis.com> wrote:

> The inheritance is awkward anyway, though. If you create a tracked
> context as a child of an already-tracked context, allocations in the
> newer one won't count against the original. I don't see a way around
> that without introducing even more performance problems.

This seems to have reached impasse, which is a shame. Let me throw out
a few questions to see if that might help.

Do I understand correctly that we are trying to account for exact
memory usage at palloc/pfree time? Why??

Surely we just want to keep track of the big chunks? Does this need to
be exact? How exact does it need to be?

Or alternatively, can't we just sample the allocations to reduce the overhead?

Are there some assumptions we can make about micro-contexts never
needing to be tracked at all? Jeff seems not too bothered by
inheritance, whereas Tomas thinks its essential. Perhaps there is a
middle ground, depending upon the role of the context?

-- Simon Riggs                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services



Re: 9.5: Better memory accounting, towards memory-bounded HashAgg

From
Jeff Davis
Date:
On Sat, 2014-11-15 at 21:36 +0000, Simon Riggs wrote:
> Do I understand correctly that we are trying to account for exact
> memory usage at palloc/pfree time? Why??

Not palloc chunks, only tracking at the level of allocated blocks (that
we allocate with malloc).

It was a surprise to me that accounting at that level would have any
measurable impact, but Robert found a reasonable case on a POWER machine
that degraded a couple percent. I wasn't able to reproduce it
consistently on x86.

> Or alternatively, can't we just sample the allocations to reduce the overhead?

Not sure quite what you mean by "sample", but it sounds like something
along those lines would work.


Attached is a patch that does something very simple: only tracks blocks
held in the current context, with no inheritance or anything like it.
This reduces it to a couple arithmetic instructions added to the
alloc/dealloc path, so hopefully that removes the general performance
concern raised by Robert[1].

To calculate the total memory used, I included a function
MemoryContextMemAllocated() that walks the memory context and its
children recursively.

Of course, I was originally trying to avoid that, because it moves the
problem to HashAgg. For each group, it will need to execute
MemoryContextMemAllocated() to see if work_mem has been exceeded. It
will have to visit a couple contexts, or perhaps many (in the case of
array_agg, which creates one per group), which could be a measurable
regression for HashAgg.

But if that does turn out to be a problem, I think it's solvable. First,
I could micro-optimize the function by making it iterative rather than
recursive, to save on function call overhead. Second, I could offer a
way to prevent the HTAB from creating its own context, which would be
one less context to visit. And if those don't work, perhaps I could
resort to a sampling method of some kind, as you allude to above.

Regards,
    Jeff Davis

[1] I'm fairly sure I tested something very similar on Robert's POWER
machine a while ago, and it was fine. But I think it's offline or moved
now, so I can't verify the results. If this patch still somehow turns
out to be a 1%+ regression on any reasonable test, I don't know what to
say.


Attachment

Re: 9.5: Better memory accounting, towards memory-bounded HashAgg

From
Simon Riggs
Date:
On 17 November 2014 07:31, Jeff Davis <pgsql@j-davis.com> wrote:
> On Sat, 2014-11-15 at 21:36 +0000, Simon Riggs wrote:
>> Do I understand correctly that we are trying to account for exact
>> memory usage at palloc/pfree time? Why??
>
> Not palloc chunks, only tracking at the level of allocated blocks (that
> we allocate with malloc).
>
> It was a surprise to me that accounting at that level would have any
> measurable impact, but Robert found a reasonable case on a POWER machine
> that degraded a couple percent. I wasn't able to reproduce it
> consistently on x86.

Surprise to me also.

Robert's tests showed a deviation of 0.4 sec after a restart. ISTM
that we wouldn't see that every time.

AFAIK the whole purpose of the memory allocator is to reduce the
number of system calls, so if we are doing so many malloc() calls as
to be noticeable just for accounting then something is wrong.

-- Simon Riggs                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services



Re: 9.5: Better memory accounting, towards memory-bounded HashAgg

From
Simon Riggs
Date:
On 17 November 2014 07:31, Jeff Davis <pgsql@j-davis.com> wrote:

> To calculate the total memory used, I included a function
> MemoryContextMemAllocated() that walks the memory context and its
> children recursively.

If we assume that we want the results accurate to within 5%, then we
can work out a good sampling rate to achieve that accuracy.

We would probably need a counting mechanism that didn't slow down O(N)
as our allocations get bigger.

-- Simon Riggs                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services



Re: 9.5: Better memory accounting, towards memory-bounded HashAgg

From
Tomas Vondra
Date:
On 15.11.2014 22:36, Simon Riggs wrote:
> On 16 October 2014 02:26, Jeff Davis <pgsql@j-davis.com> wrote:
> 
>> The inheritance is awkward anyway, though. If you create a tracked 
>> context as a child of an already-tracked context, allocations in 
>> the newer one won't count against the original. I don't see a way 
>> around that without introducing even more performance problems.
> 
> This seems to have reached impasse, which is a shame. Let me throw 
> out a few questions to see if that might help.
> 
> Do I understand correctly that we are trying to account for exact 
> memory usage at palloc/pfree time? Why??
> 
> Surely we just want to keep track of the big chunks? Does this need 
> to be exact? How exact does it need to be?

What do you mean by "big chunks"? Blocks, or really over-sized chunks
(over 8kB)? I assume blocks, and that's what the proposed patches are doing.

> Or alternatively, can't we just sample the allocations to reduce
> the overhead?

That might work, but I'm afraid it's trickier than it might seem.

> Are there some assumptions we can make about micro-contexts never 
> needing to be tracked at all?

What is a micro-context?

> Jeff seems not too bothered by inheritance, whereas Tomas thinks its
> essential. Perhaps there is a middle ground, depending upon the role
> of the context?

Having a hierarchy of memory contexts and an accounting completely
ignoring that hierarchy seems a bit strange to me.

Maybe we can impose some restrictions, for example:
 (a) a single context with tracking explicitly requested (in the whole     hierarchy)
 (b) when tracking is requested for a context, the parent context must     not have tracking enabled

Either of these restrictions would prevent a situation where a context
has to update accounting for two parent contexts. That'd allow updating
a single place (because each context has a single parent context with
tracking requested).

But I'm not sure that those restrictions are acceptable, that we can
enforce them reliably, or how frequently that'd break external code
(e.g. extensions with custom aggregates etc.). Because this is part of
the public API, including the option to enable accounting for a memory
context.

Imagine you write an extension with a custom aggregate (or whatever) and
it works just fine. Then we tweak something in the executor, requesting
accounting for another context, and the extension suddenly breaks.

Or maybe we don't change anything, and the aggregate works fine with
Group Aggregate, but breaks whenever the plan switches to Hash Aggregate
(because that's using accounting).

Or maybe the user just redirects the tracking somewhere else (e.g. by
supplying a pointer to the v6 of the patch), excluding the context from
the accounting entirely.

So yeah, I think the inheritance is an important aspect, at least for a
generic accounting infrastructure. I'd bet that if we choose to do that,
there'll be a good deal of various WTF moments and foot-shooting in the
near future ...

On 23/8 I proposed to use a separate hierarchy of accounting info,
parallel to the memory context hierarchy (but only for contexts with
tracking explicitly requested)
  http://www.postgresql.org/message-id/53F7E83C.3020304@fuzzy.cz

but no response so far. I do see ~1% regression on Robert's benchmark
(on x86), but it's quite noisy. Not sure about PPC, which is what Robert
used. I don't see how to make it faster without sacrificing full
inheritance support ...

Tomas



Re: 9.5: Better memory accounting, towards memory-bounded HashAgg

From
Andres Freund
Date:
Hi,

On 2014-11-16 23:31:51 -0800, Jeff Davis wrote:
> *** a/src/include/nodes/memnodes.h
> --- b/src/include/nodes/memnodes.h
> ***************
> *** 60,65 **** typedef struct MemoryContextData
> --- 60,66 ----
>       MemoryContext nextchild;    /* next child of same parent */
>       char       *name;            /* context name (just for debugging) */
>       bool        isReset;        /* T = no space alloced since last reset */
> +     uint64        mem_allocated;    /* track memory allocated for this context */
>   #ifdef USE_ASSERT_CHECKING
>       bool        allowInCritSection;    /* allow palloc in critical section */
>   #endif

That's quite possibly one culprit for the slowdown. Right now one
AllocSetContext struct fits precisely into three cachelines. After your
change it doesn't anymore.

Consider not counting memory in bytes but blocks and adding it directly
after the NodeTag. That'd leave the size the same as right now.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: 9.5: Better memory accounting, towards memory-bounded HashAgg

From
Tomas Vondra
Date:
On 17.11.2014 08:31, Jeff Davis wrote:
> On Sat, 2014-11-15 at 21:36 +0000, Simon Riggs wrote:
>> Do I understand correctly that we are trying to account for exact
>> memory usage at palloc/pfree time? Why??
> 
> Not palloc chunks, only tracking at the level of allocated blocks
> (that we allocate with malloc).
> 
> It was a surprise to me that accounting at that level would have any 
> measurable impact, but Robert found a reasonable case on a POWER
> machine that degraded a couple percent. I wasn't able to reproduce
> it consistently on x86.
> 
>> Or alternatively, can't we just sample the allocations to reduce
>> the overhead?
> 
> Not sure quite what you mean by "sample", but it sounds like
> something along those lines would work.

Maybe. It might also cause unexpected volatility / nondeterminism in the
query execution.

Imagine a Hash Aggregate, where a small number of the requests is
considerably larger than the rest. If you happen to sample only a few of
those large requests, the whole hash aggregate happens in-memory,
otherwise it starts batching. The users will observe this as random
variations to the runtimes - same query / parameters, idle machine,
sudden changes to the plan ...

Maybe I'm too wary, but I guess there are use cases where the latency
uniformity is a concern.

> Attached is a patch that does something very simple: only tracks 
> blocks held in the current context, with no inheritance or anything 
> like it. This reduces it to a couple arithmetic instructions added
> to the alloc/dealloc path, so hopefully that removes the general 
> performance concern raised by Robert[1].
> 
> To calculate the total memory used, I included a function
> MemoryContextMemAllocated() that walks the memory context and its
> children recursively.
> 
> Of course, I was originally trying to avoid that, because it moves 
> the problem to HashAgg. For each group, it will need to execute 
> MemoryContextMemAllocated() to see if work_mem has been exceeded. It
>  will have to visit a couple contexts, or perhaps many (in the case 
> of array_agg, which creates one per group), which could be a 
> measurable regression for HashAgg.

:-(

> But if that does turn out to be a problem, I think it's solvable. 
> First, I could micro-optimize the function by making it iterative 
> rather than recursive, to save on function call overhead. Second, I 
> could offer a way to prevent the HTAB from creating its own context, 
> which would be one less context to visit. And if those don't work, 
> perhaps I could resort to a sampling method of some kind, as you 
> allude to above.

Do you plan to try this approach with your hashagg patch, and do some
measurements?

I'd expect the performance hit to be significant, but I'd like to see
some numbers.

kind regards
Tomas




Re: 9.5: Better memory accounting, towards memory-bounded HashAgg

From
Tomas Vondra
Date:
On 17.11.2014 18:04, Andres Freund wrote:
> Hi,
> 
> On 2014-11-16 23:31:51 -0800, Jeff Davis wrote:
>> *** a/src/include/nodes/memnodes.h
>> --- b/src/include/nodes/memnodes.h
>> ***************
>> *** 60,65 **** typedef struct MemoryContextData
>> --- 60,66 ----
>>       MemoryContext nextchild;    /* next child of same parent */
>>       char       *name;            /* context name (just for debugging) */
>>       bool        isReset;        /* T = no space alloced since last reset */
>> +     uint64        mem_allocated;    /* track memory allocated for this context */
>>   #ifdef USE_ASSERT_CHECKING
>>       bool        allowInCritSection;    /* allow palloc in critical section */
>>   #endif
> 
> That's quite possibly one culprit for the slowdown. Right now one 
> AllocSetContext struct fits precisely into three cachelines. After
> your change it doesn't anymore.

I'm no PPC64 expert, but I thought the cache lines are 128 B on that
platform, since at least Power6?

Also, if I'm counting right, the MemoryContextData structure is 56B
without the 'mem_allocated' field (and without the allowInCritSection),
and 64B with it (at that particular place). sizeof() seems to confirm
that. (But I'm on x86, so maybe the alignment on PPC64 is different?).

> Consider not counting memory in bytes but blocks and adding it
> directly after the NodeTag. That'd leave the size the same as right
> now.

I suppose you mean "kbytes", because the block size is not fixed.

regards
Tomas



Re: 9.5: Better memory accounting, towards memory-bounded HashAgg

From
Andres Freund
Date:
On 2014-11-17 19:42:25 +0100, Tomas Vondra wrote:
> On 17.11.2014 18:04, Andres Freund wrote:
> > Hi,
> > 
> > On 2014-11-16 23:31:51 -0800, Jeff Davis wrote:
> >> *** a/src/include/nodes/memnodes.h
> >> --- b/src/include/nodes/memnodes.h
> >> ***************
> >> *** 60,65 **** typedef struct MemoryContextData
> >> --- 60,66 ----
> >>       MemoryContext nextchild;    /* next child of same parent */
> >>       char       *name;            /* context name (just for debugging) */
> >>       bool        isReset;        /* T = no space alloced since last reset */
> >> +     uint64        mem_allocated;    /* track memory allocated for this context */
> >>   #ifdef USE_ASSERT_CHECKING
> >>       bool        allowInCritSection;    /* allow palloc in critical section */
> >>   #endif
> > 
> > That's quite possibly one culprit for the slowdown. Right now one 
> > AllocSetContext struct fits precisely into three cachelines. After
> > your change it doesn't anymore.
> 
> I'm no PPC64 expert, but I thought the cache lines are 128 B on that
> platform, since at least Power6?

Hm, might be true.

> Also, if I'm counting right, the MemoryContextData structure is 56B
> without the 'mem_allocated' field (and without the allowInCritSection),
> and 64B with it (at that particular place). sizeof() seems to confirm
> that. (But I'm on x86, so maybe the alignment on PPC64 is different?).

The MemoryContextData struct is embedded into AllocSetContext.

> > Consider not counting memory in bytes but blocks and adding it
> > directly after the NodeTag. That'd leave the size the same as right
> > now.
> 
> I suppose you mean "kbytes", because the block size is not fixed.

Some coarser size than bytes. Don't really care about the granularity.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: 9.5: Better memory accounting, towards memory-bounded HashAgg

From
Tomas Vondra
Date:
On 17.11.2014 19:46, Andres Freund wrote:
> On 2014-11-17 19:42:25 +0100, Tomas Vondra wrote:
>> On 17.11.2014 18:04, Andres Freund wrote:
>>> Hi,
>>>
>>> On 2014-11-16 23:31:51 -0800, Jeff Davis wrote:
>>>> *** a/src/include/nodes/memnodes.h
>>>> --- b/src/include/nodes/memnodes.h
>>>> ***************
>>>> *** 60,65 **** typedef struct MemoryContextData
>>>> --- 60,66 ----
>>>>       MemoryContext nextchild;    /* next child of same parent */
>>>>       char       *name;            /* context name (just for debugging) */
>>>>       bool        isReset;        /* T = no space alloced since last reset */
>>>> +     uint64        mem_allocated;    /* track memory allocated for this context */
>>>>   #ifdef USE_ASSERT_CHECKING
>>>>       bool        allowInCritSection;    /* allow palloc in critical section */
>>>>   #endif
>>>
>>> That's quite possibly one culprit for the slowdown. Right now one
>>> AllocSetContext struct fits precisely into three cachelines. After
>>> your change it doesn't anymore.
>>
>> I'm no PPC64 expert, but I thought the cache lines are 128 B on that
>> platform, since at least Power6?
>
> Hm, might be true.
>
>> Also, if I'm counting right, the MemoryContextData structure is 56B
>> without the 'mem_allocated' field (and without the allowInCritSection),
>> and 64B with it (at that particular place). sizeof() seems to confirm
>> that. (But I'm on x86, so maybe the alignment on PPC64 is different?).
>
> The MemoryContextData struct is embedded into AllocSetContext.

Oh, right. That makes is slightly more complicated, though, because
AllocSetContext adds 6 x 8B fields plus an in-line array of freelist
pointers. Which is 11x8 bytes. So in total 56+56+88=200B, without the
additional field. There might be some difference because of alignment,
but I still don't see how that one additional field might impact cachelines?

But if we separated the freelist, that might actually make it faster, at
least for calls that don't touch the freelist at all, no? Because most
of the palloc() calls will be handled from the current block.

I tried this on the patch I sent on 23/8 (with the MemoryAccounting
hierarchy parallel to the contexts), and I got this (running 10x the
reindex, without restarts or such):

without the patch
-----------------
  ... 1723933 KB used: CPU 1.35s/4.33u sec elapsed 10.38 sec
  ... 1723933 KB used: CPU 1.39s/4.19u sec elapsed 9.89 sec
  ... 1723933 KB used: CPU 1.39s/4.26u sec elapsed 9.90 sec
  ... 1723933 KB used: CPU 1.45s/4.22u sec elapsed 10.15 sec
  ... 1723933 KB used: CPU 1.36s/4.20u sec elapsed 9.84 sec
  ... 1723933 KB used: CPU 1.32s/4.20u sec elapsed 9.96 sec
  ... 1723933 KB used: CPU 1.50s/4.23u sec elapsed 9.91 sec
  ... 1723933 KB used: CPU 1.47s/4.24u sec elapsed 9.90 sec
  ... 1723933 KB used: CPU 1.38s/4.23u sec elapsed 10.09 sec
  ... 1723933 KB used: CPU 1.37s/4.19u sec elapsed 9.84 sec

with the patch (no tracking enabled)
------------------------------------
  ... 1723933 KB used: CPU 1.40s/4.28u sec elapsed 10.79 sec
  ... 1723933 KB used: CPU 1.39s/4.32u sec elapsed 10.14 sec
  ... 1723933 KB used: CPU 1.40s/4.16u sec elapsed 9.99 sec
  ... 1723933 KB used: CPU 1.32s/4.21u sec elapsed 10.16 sec
  ... 1723933 KB used: CPU 1.37s/4.34u sec elapsed 10.03 sec
  ... 1723933 KB used: CPU 1.35s/4.22u sec elapsed 9.97 sec
  ... 1723933 KB used: CPU 1.43s/4.20u sec elapsed 9.99 sec
  ... 1723933 KB used: CPU 1.39s/4.17u sec elapsed 9.71 sec
  ... 1723933 KB used: CPU 1.44s/4.16u sec elapsed 9.91 sec
  ... 1723933 KB used: CPU 1.40s/4.25u sec elapsed 9.99 sec

So it's about 9,986 vs 10,068 for averages, and 9,905 vs 9,99 for a
median. I.e. ~0.8% difference. That's on x86, though. I wonder what'd be
the effect on the PPC machine.

Patch attached - it's not perfect, though. For example it should free
the freelists at some point, but I feel a bit lazy today.

Tomas

Attachment

Re: 9.5: Better memory accounting, towards memory-bounded HashAgg

From
Jeff Davis
Date:
On Mon, 2014-11-17 at 18:04 +0100, Andres Freund wrote:
> That's quite possibly one culprit for the slowdown. Right now one
> AllocSetContext struct fits precisely into three cachelines. After your
> change it doesn't anymore.

Thank you! I made the same mistake as Tomas: I saw that
MemoryContextData went to 64 bytes and thought "that should be fine"
while ignoring AllocSetContext.

I did some tests in the past between a version of my patch that made
MemoryContextData 72 bytes and one that made it 64 bytes, but I may have
neglected to test against the original 56 byte version.

I'm still not able to test to see if this is the real problem, but it
seems best to eliminate it anyway.

> Consider not counting memory in bytes but blocks and adding it directly
> after the NodeTag. That'd leave the size the same as right now.

I can also just move isReset there, and keep mem_allocated as a uint64.
That way, if I find later that I want to track the aggregated value for
the child contexts as well, I can split it into two uint32s. I'll hold
off any any such optimizations until I see some numbers from HashAgg
though.

Attached new version.

Regards,
    Jeff Davis


Attachment

Re: 9.5: Better memory accounting, towards memory-bounded HashAgg

From
Jim Nasby
Date:
On 11/17/14, 10:50 AM, Tomas Vondra wrote:
> Either of these restrictions would prevent a situation where a context
> has to update accounting for two parent contexts. That'd allow updating
> a single place (because each context has a single parent context with
> tracking requested).

Another option might be to be lazy on updating parent contexts. I'm thinking something like keep a boolean (or make a
sizeof 0 magic) that indicates whether a context has up-to-date info from it's children or not. That would allow you to
onlyupdate the complete size when you need it, but if your children haven't been touched since the last time you
calculatedthen you're don't need to recalc.
 
-- 
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com



Re: 9.5: Better memory accounting, towards memory-bounded HashAgg

From
Andres Freund
Date:
On 2014-11-17 21:03:07 +0100, Tomas Vondra wrote:
> On 17.11.2014 19:46, Andres Freund wrote:
> > On 2014-11-17 19:42:25 +0100, Tomas Vondra wrote:
> >> On 17.11.2014 18:04, Andres Freund wrote:
> >>> Hi,
> >>>
> >>> On 2014-11-16 23:31:51 -0800, Jeff Davis wrote:
> >>>> *** a/src/include/nodes/memnodes.h
> >>>> --- b/src/include/nodes/memnodes.h
> >>>> ***************
> >>>> *** 60,65 **** typedef struct MemoryContextData
> >>>> --- 60,66 ----
> >>>>       MemoryContext nextchild;    /* next child of same parent */
> >>>>       char       *name;            /* context name (just for debugging) */
> >>>>       bool        isReset;        /* T = no space alloced since last reset */
> >>>> +     uint64        mem_allocated;    /* track memory allocated for this context */
> >>>>   #ifdef USE_ASSERT_CHECKING
> >>>>       bool        allowInCritSection;    /* allow palloc in critical section */
> >>>>   #endif
> >>>
> >>> That's quite possibly one culprit for the slowdown. Right now one 
> >>> AllocSetContext struct fits precisely into three cachelines. After
> >>> your change it doesn't anymore.
> >>
> >> I'm no PPC64 expert, but I thought the cache lines are 128 B on that
> >> platform, since at least Power6?
> > 
> > Hm, might be true.
> > 
> >> Also, if I'm counting right, the MemoryContextData structure is 56B
> >> without the 'mem_allocated' field (and without the allowInCritSection),
> >> and 64B with it (at that particular place). sizeof() seems to confirm
> >> that. (But I'm on x86, so maybe the alignment on PPC64 is different?).
> > 
> > The MemoryContextData struct is embedded into AllocSetContext.
> 
> Oh, right. That makes is slightly more complicated, though, because
> AllocSetContext adds 6 x 8B fields plus an in-line array of freelist
> pointers. Which is 11x8 bytes. So in total 56+56+88=200B, without the
> additional field. There might be some difference because of alignment,
> but I still don't see how that one additional field might impact cachelines?

It's actually 196 bytes:

struct AllocSetContext {       MemoryContextData          header;               /*     0    56 */       AllocBlock
          blocks;               /*    56     8 */       /* --- cacheline 1 boundary (64 bytes) --- */       AllocChunk
              freelist[11];         /*    64    88 */       /* --- cacheline 2 boundary (128 bytes) was 24 bytes ago
---*/       Size                       initBlockSize;        /*   152     8 */       Size
maxBlockSize;        /*   160     8 */       Size                       nextBlockSize;        /*   168     8 */
Size                      allocChunkLimit;      /*   176     8 */       AllocBlock                 keeper;
/*   184     8 */       /* --- cacheline 3 boundary (192 bytes) --- */
 
       /* size: 192, cachelines: 3, members: 8 */
};

And thus one additional field tipps it over the edge.

"pahole" is a very useful utility.

> But if we separated the freelist, that might actually make it faster, at
> least for calls that don't touch the freelist at all, no? Because most
> of the palloc() calls will be handled from the current block.

I seriously doubt it. The additional indirection + additional branches
are likely to make it worse.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: 9.5: Better memory accounting, towards memory-bounded HashAgg

From
Tomas Vondra
Date:
On 21.11.2014 00:03, Andres Freund wrote:
> On 2014-11-17 21:03:07 +0100, Tomas Vondra wrote:
>> On 17.11.2014 19:46, Andres Freund wrote:
>>
>>> The MemoryContextData struct is embedded into AllocSetContext.
>>
>> Oh, right. That makes is slightly more complicated, though, because
>> AllocSetContext adds 6 x 8B fields plus an in-line array of 
>> freelist pointers. Which is 11x8 bytes. So in total 56+56+88=200B, 
>> without the additional field. There might be some difference 
>> because of alignment, but I still don't see how that one
>> additional field might impact cachelines?
> 
> It's actually 196 bytes:

Ummmm, I think the pahole output shows 192, not 196? Otherwise it
wouldn't be exactly 3 cachelines anyway.

But yeah - my math-foo was weak for a moment, because 6x8 != 56. Which
is the 8B difference :-/

> struct AllocSetContext {
>         MemoryContextData          header;               /*     0    56 */
>         AllocBlock                 blocks;               /*    56     8 */
>         /* --- cacheline 1 boundary (64 bytes) --- */
>         AllocChunk                 freelist[11];         /*    64    88 */
>         /* --- cacheline 2 boundary (128 bytes) was 24 bytes ago --- */
>         Size                       initBlockSize;        /*   152     8 */
>         Size                       maxBlockSize;         /*   160     8 */
>         Size                       nextBlockSize;        /*   168     8 */
>         Size                       allocChunkLimit;      /*   176     8 */
>         AllocBlock                 keeper;               /*   184     8 */
>         /* --- cacheline 3 boundary (192 bytes) --- */
> 
>         /* size: 192, cachelines: 3, members: 8 */
> };
> 
> And thus one additional field tipps it over the edge.
> 
> "pahole" is a very useful utility.

Indeed.

>> But if we separated the freelist, that might actually make it
>> faster, at least for calls that don't touch the freelist at all,
>> no? Because most of the palloc() calls will be handled from the
>> current block.
> 
> I seriously doubt it. The additional indirection + additional
> branches are likely to make it worse.

That's possible, although I tried it on "my version" of the accounting
patch, and it showed slight improvement (lower overhead) on Robert's
reindex benchmark.

The question is how would that work with regular workload, because
moving the freelist out of the structure makes it smaller (2 cachelines
instead of 3), and it can only impact workloads working with the
freelists (i.e. either by calling free, or realloc, or whatever).
Although palloc() checks the freelist too ...

Also, those pieces may be allocated together (next to each other), which
might keep locality.

But I haven't tested any of this, and my knowledge of this low-level
stuff is poor, so I might be completely wrong.

regard
Tomas



Re: 9.5: Better memory accounting, towards memory-bounded HashAgg

From
Peter Geoghegan
Date:
On Mon, Nov 17, 2014 at 11:39 PM, Jeff Davis <pgsql@j-davis.com> wrote:
> I can also just move isReset there, and keep mem_allocated as a uint64.
> That way, if I find later that I want to track the aggregated value for
> the child contexts as well, I can split it into two uint32s. I'll hold
> off any any such optimizations until I see some numbers from HashAgg
> though.

I took a quick look at memory-accounting-v8.patch.

Is there some reason why mem_allocated is a uint64? All other things
being equal, I'd follow the example of tuplesort.c's
MemoryContextAllocHuge() API, which (following bugfix commit
79e0f87a1) uses int64 variables to track available memory and so on.

-- 
Peter Geoghegan



Re: 9.5: Better memory accounting, towards memory-bounded HashAgg

From
Jeff Davis
Date:
On Sun, 2014-11-30 at 17:49 -0800, Peter Geoghegan wrote:
> On Mon, Nov 17, 2014 at 11:39 PM, Jeff Davis <pgsql@j-davis.com> wrote:
> > I can also just move isReset there, and keep mem_allocated as a uint64.
> > That way, if I find later that I want to track the aggregated value for
> > the child contexts as well, I can split it into two uint32s. I'll hold
> > off any any such optimizations until I see some numbers from HashAgg
> > though.
>
> I took a quick look at memory-accounting-v8.patch.
>
> Is there some reason why mem_allocated is a uint64? All other things
> being equal, I'd follow the example of tuplesort.c's
> MemoryContextAllocHuge() API, which (following bugfix commit
> 79e0f87a1) uses int64 variables to track available memory and so on.

No reason. New version attached; that's the only change.

Regards,
    Jeff Davis



Attachment

Re: 9.5: Better memory accounting, towards memory-bounded HashAgg

From
Michael Paquier
Date:
On Tue, Dec 2, 2014 at 2:14 PM, Jeff Davis <pgsql@j-davis.com> wrote:
> On Sun, 2014-11-30 at 17:49 -0800, Peter Geoghegan wrote:
>> On Mon, Nov 17, 2014 at 11:39 PM, Jeff Davis <pgsql@j-davis.com> wrote:
>> > I can also just move isReset there, and keep mem_allocated as a uint64.
>> > That way, if I find later that I want to track the aggregated value for
>> > the child contexts as well, I can split it into two uint32s. I'll hold
>> > off any any such optimizations until I see some numbers from HashAgg
>> > though.
>>
>> I took a quick look at memory-accounting-v8.patch.
>>
>> Is there some reason why mem_allocated is a uint64? All other things
>> being equal, I'd follow the example of tuplesort.c's
>> MemoryContextAllocHuge() API, which (following bugfix commit
>> 79e0f87a1) uses int64 variables to track available memory and so on.
>
> No reason. New version attached; that's the only change.
Note that I am marking this patch back to "Needs Review" state. I
doesn't seem that this patch has been reviewed completely.
-- 
Michael



Re: 9.5: Better memory accounting, towards memory-bounded HashAgg

From
Michael Paquier
Date:
On Mon, Dec 8, 2014 at 9:39 AM, Michael Paquier
<michael.paquier@gmail.com> wrote:
> On Tue, Dec 2, 2014 at 2:14 PM, Jeff Davis <pgsql@j-davis.com> wrote:
>> On Sun, 2014-11-30 at 17:49 -0800, Peter Geoghegan wrote:
>>> On Mon, Nov 17, 2014 at 11:39 PM, Jeff Davis <pgsql@j-davis.com> wrote:
>>> > I can also just move isReset there, and keep mem_allocated as a uint64.
>>> > That way, if I find later that I want to track the aggregated value for
>>> > the child contexts as well, I can split it into two uint32s. I'll hold
>>> > off any any such optimizations until I see some numbers from HashAgg
>>> > though.
>>>
>>> I took a quick look at memory-accounting-v8.patch.
>>>
>>> Is there some reason why mem_allocated is a uint64? All other things
>>> being equal, I'd follow the example of tuplesort.c's
>>> MemoryContextAllocHuge() API, which (following bugfix commit
>>> 79e0f87a1) uses int64 variables to track available memory and so on.
>>
>> No reason. New version attached; that's the only change.
> Note that I am marking this patch back to "Needs Review" state. I
Moving to CF 2014-12.
-- 
Michael



Re: 9.5: Better memory accounting, towards memory-bounded HashAgg

From
Tomas Vondra
Date:
On 2.12.2014 06:14, Jeff Davis wrote:
> On Sun, 2014-11-30 at 17:49 -0800, Peter Geoghegan wrote:
>> On Mon, Nov 17, 2014 at 11:39 PM, Jeff Davis <pgsql@j-davis.com> wrote:
>>> I can also just move isReset there, and keep mem_allocated as a uint64.
>>> That way, if I find later that I want to track the aggregated value for
>>> the child contexts as well, I can split it into two uint32s. I'll hold
>>> off any any such optimizations until I see some numbers from HashAgg
>>> though.
>>
>> I took a quick look at memory-accounting-v8.patch.
>>
>> Is there some reason why mem_allocated is a uint64? All other things
>> being equal, I'd follow the example of tuplesort.c's
>> MemoryContextAllocHuge() API, which (following bugfix commit
>> 79e0f87a1) uses int64 variables to track available memory and so on.
>
> No reason. New version attached; that's the only change.

I've started reviewing this today. It does not apply cleanly on current
head, because of 4a14f13a committed a few days ago. That commit changed
the constants in  src/include/utils/hsearch.h, so the patch needs to use
this:

#define HASH_NOCHILDCXT 0x4000 /* ... */

That's the only conflict, and after fixing it it compiles OK. However, I
got a segfault on the very first query I tried :-(

  create table test_hash_agg as select i AS a, i AS b, i AS c, i AS d
    from generate_series(1,10000000) s(i);

  analyze test_hash_agg;

  select a, count(*) from test_hash_agg where a < 100000 group by a;

With a fresh cluster (default config), I get this:

  server closed the connection unexpectedly
        This probably means the server terminated abnormally
        before or while processing the request.
  The connection to the server was lost. Attempting reset: Failed.

The backtrace looks like this (full attached):

Program received signal SIGSEGV, Segmentation fault.
advance_transition_function (aggstate=aggstate@entry=0x2b5c5f0,
peraggstate=peraggstate@entry=0x2b5efd0,
pergroupstate=pergroupstate@entry=0x8) at nodeAgg.c:468
468                     if (pergroupstate->noTransValue)
(gdb) bt
#0  advance_transition_function at nodeAgg.c:468
#1  0x00000000005c3494 in advance_aggregates at nodeAgg.c:624
#2  0x00000000005c3dc2 in agg_fill_hash_table at nodeAgg.c:1640
#3  ExecAgg (node=node@entry=0x2b5c5f0) at nodeAgg.c:1338

(gdb) print pergroupstate->noTransValue
Cannot access memory at address 0x11
(gdb) print pergroupstate
$1 = (AggStatePerGroup) 0x8

My guess is something is scribbling over the pergroup memory, or maybe
the memory context gets released, or something. In any case, it's easily
reproducible, and apparently deterministic (always exactly the same
values, no randomness).

regards
Tomas

Attachment

Re: 9.5: Better memory accounting, towards memory-bounded HashAgg

From
Tomas Vondra
Date:
On 21.12.2014 20:19, Tomas Vondra wrote:
>
> However, I got a segfault on the very first query I tried :-(
> 
>   create table test_hash_agg as select i AS a, i AS b, i AS c, i AS d
>     from generate_series(1,10000000) s(i);
> 
>   analyze test_hash_agg;
> 
>   select a, count(*) from test_hash_agg where a < 100000 group by a;
> 
> With a fresh cluster (default config), I get this:
> 
>   server closed the connection unexpectedly
>         This probably means the server terminated abnormally
>         before or while processing the request.
>   The connection to the server was lost. Attempting reset: Failed.
> 
> The backtrace looks like this (full attached):
> 
> Program received signal SIGSEGV, Segmentation fault.
> advance_transition_function (aggstate=aggstate@entry=0x2b5c5f0,
> peraggstate=peraggstate@entry=0x2b5efd0,
> pergroupstate=pergroupstate@entry=0x8) at nodeAgg.c:468
> 468                     if (pergroupstate->noTransValue)
> (gdb) bt
> #0  advance_transition_function at nodeAgg.c:468
> #1  0x00000000005c3494 in advance_aggregates at nodeAgg.c:624
> #2  0x00000000005c3dc2 in agg_fill_hash_table at nodeAgg.c:1640
> #3  ExecAgg (node=node@entry=0x2b5c5f0) at nodeAgg.c:1338
> 
> (gdb) print pergroupstate->noTransValue
> Cannot access memory at address 0x11
> (gdb) print pergroupstate
> $1 = (AggStatePerGroup) 0x8
> 
> My guess is something is scribbling over the pergroup memory, or maybe
> the memory context gets released, or something. In any case, it's easily
> reproducible, and apparently deterministic (always exactly the same
> values, no randomness).

BTW the fact that 'make installcheck' and 'make check' both pass yet a
simple query crashes, suggests there are no regression tests testing the
batching properly. Which should be part of the patch, I believe.

regards
Tomas



Re: 9.5: Better memory accounting, towards memory-bounded HashAgg

From
Jeff Davis
Date:
It seems that these two patches are being reviewed together. Should I
just combine them into one? My understanding was that some wanted to
review the memory accounting patch separately.

On Sun, 2014-12-21 at 20:19 +0100, Tomas Vondra wrote:
> That's the only conflict, and after fixing it it compiles OK. However, I
> got a segfault on the very first query I tried :-(

If lookup_hash_entry doesn't find the group, and there's not enough
memory to create it, then it returns NULL; but the caller wasn't
checking for NULL. My apologies for such a trivial mistake, I was doing
most of my testing using DISTINCT. My fix here was done quickly, so I'll
take a closer look later to make sure I didn't miss something else.

New patch attached (rebased, as well).

I also see your other message about adding regression testing. I'm
hesitant to slow down the tests for everyone to run through this code
path though. Should I add regression tests, and then remove them later
after we're more comfortable that it works?

Regards
    Jeff Davis


Attachment

Re: 9.5: Better memory accounting, towards memory-bounded HashAgg

From
Tomas Vondra
Date:
On 23.12.2014 10:16, Jeff Davis wrote:
> It seems that these two patches are being reviewed together. Should
> I just combine them into one? My understanding was that some wanted
> to review the memory accounting patch separately.

I think we should keep the patches separate. Applying two patches is
trivial, splitting them not so much.

> On Sun, 2014-12-21 at 20:19 +0100, Tomas Vondra wrote:
>> That's the only conflict, and after fixing it it compiles OK.
>> However, I got a segfault on the very first query I tried :-(
> 
> If lookup_hash_entry doesn't find the group, and there's not enough 
> memory to create it, then it returns NULL; but the caller wasn't 
> checking for NULL. My apologies for such a trivial mistake, I was
> doing most of my testing using DISTINCT. My fix here was done
> quickly, so I'll take a closer look later to make sure I didn't miss
> something else.
> 
> New patch attached (rebased, as well).
> 
> I also see your other message about adding regression testing. I'm 
> hesitant to slow down the tests for everyone to run through this
> code path though. Should I add regression tests, and then remove them
> later after we're more comfortable that it works?

I think when done right, the additional time will be negligible. By
setting the work_mem low, we don't need that much data.

regards
Tomas



Re: 9.5: Better memory accounting, towards memory-bounded HashAgg

From
Jeff Davis
Date:
On Tue, 2014-12-23 at 01:16 -0800, Jeff Davis wrote:
> New patch attached (rebased, as well).
>
> I also see your other message about adding regression testing. I'm
> hesitant to slow down the tests for everyone to run through this code
> path though. Should I add regression tests, and then remove them later
> after we're more comfortable that it works?

Attached are some tests I ran. First, generate the data sets with
hashagg_test_data.sql. Then, do (I used work_mem at default of 4MB):

  set enable_hashagg=false;
  \o /tmp/sort.out
  \i /tmp/hashagg_test.sql
  \o
  set enable_hashagg=true;
  \o /tmp/hash.out
  \i /tmp/hashagg_test.sql

and then diff'd the output to make sure the results are the same (except
the plans, of course). The script loads the results into a temp table,
then sorts it before outputting, to make the test order-independent. I
didn't just add an ORDER BY, because that would change the plan and it
would never use hashagg.

I think that has fairly good coverage of the hashagg code. I used 3
different input data sets, byval and byref types (for group key and
args), and a group aggregate query as well as DISTINCT. Let me know if I
missed something.

I also did some performance comparisons between disk-based sort+group
and disk-based hashagg. The results are quite favorable for hashagg
given the data sets I provided. Simply create the data using
hashagg_test_data.sql (if not already done), set the work_mem to the
value you want to test, and run hashagg_test_perf.sql. It uses EXPLAIN
ANALYZE for the timings.

singleton: 10M groups of 1
even: 1M groups of 10
skew: wildly different group sizes; see data script

q1: group aggregate query
q2: distinct query

The total memory requirements for the test to run without going to disk
ranges from about 100MB (for "even") to about 1GB (for "singleton").
Regardless of work_mem, these all fit in memory on my machine, so they
aren't *really* going to disk. Also note that, because of how the memory
blocks are allocated, and that hashagg waits until memory is exceeded,
then hashagg might use about double work_mem when work_mem is small (the
effect is not important at higher values).

work_mem='1MB':
                  sort+group (s)    hashagg (s)
   singleton q1           12             10
   singleton q2            8              7
   even q1                14              7
   even q2                10              5
   skew q1                22              6
   skew q2                16              4

work_mem='4MB':
                  sort+group (s)    hashagg (s)
   singleton q1           12             11
   singleton q2            8              6
   even q1                12              7
   even q2                 9              5
   skew q1                19              6
   skew q2                13              3

work_mem='16MB':
                  sort+group (s)    hashagg (s)
   singleton q1           12             11
   singleton q2            8              7
   even q1                14              7
   even q2                10              5
   skew q1                15              6
   skew q2                12              4

work_mem='64MB':
                  sort+group (s)    hashagg (s)
   singleton q1           13             12
   singleton q2            9              8
   even q1                14              8
   even q2                10              5
   skew q1                17              6
   skew q2                13              4

work_mem='256MB':
                  sort+group (s)    hashagg (s)
   singleton q1           12             12
   singleton q2            9              8
   even q1                14              7
   even q2                11              4
   skew q1                16              6
   skew q2                13              4

work_mem='512MB':
                  sort+group (s)    hashagg (s)
   singleton q1           12             12
   singleton q2            9              7
   even q1                14              7
   even q2                10              4
   skew q1                16              6
   skew q2                12              4

work_mem='2GB':
                  sort+group (s)    hashagg (s)
   singleton q1            9             12
   singleton q2            6              6
   even q1                 8              7
   even q2                 6              4
   skew q1                 7              6
   skew q2                 5              4


These numbers are great news for disk-based hashagg. It seems to be the
same or better than sort+group in nearly all cases (again, this example
doesn't actually go to disk, so those numbers may come out differently).
Also, the numbers are remarkably stable for varying work_mem for both
plans. That means that it doesn't cost much to keep a lower work_mem as
long as your system has plenty of memory.

Do others have similar numbers? I'm quite surprised at how little
work_mem seems to matter for these plans (HashJoin might be a different
story though). I feel like I made a mistake -- can someone please do a
sanity check on my numbers?

Regards,
    Jeff Davis


Attachment

Re: 9.5: Better memory accounting, towards memory-bounded HashAgg

From
Peter Geoghegan
Date:
On Sun, Dec 28, 2014 at 12:37 PM, Jeff Davis <pgsql@j-davis.com> wrote:
> Do others have similar numbers? I'm quite surprised at how little
> work_mem seems to matter for these plans (HashJoin might be a different
> story though). I feel like I made a mistake -- can someone please do a
> sanity check on my numbers?

I have seen external sorts that were quicker than internal sorts
before. With my abbreviated key patch, under certain circumstances
external sorts are faster, while presumably the same thing is true of
int4 attribute sorts today. Actually, I saw a 10MB work_mem setting
that was marginally faster than a multi-gigabyte one that fit the
entire sort in memory. It probably has something to do with caching
effects dominating over the expense of more comparisons, since higher
work_mem settings that still resulted in an external sort were slower
than the 10MB setting.

I was surprised by this too, but it has been independently reported by
Jeff Janes.
-- 
Peter Geoghegan



Re: 9.5: Better memory accounting, towards memory-bounded HashAgg

From
Jeff Davis
Date:
On Sun, 2014-12-28 at 12:37 -0800, Jeff Davis wrote:
> I feel like I made a mistake -- can someone please do a
> sanity check on my numbers?

I forgot to randomize the inputs, which doesn't matter much for hashagg
but does matter for sort. New data script attached. The results are even
*better* for disk-based hashagg than the previous numbers suggest. Here
are some new numbers:

work_mem='1MB':
                  sort+group (s)    hashagg (s)
   singleton q1           21             10
   singleton q2           12              8
   even q1                20              7
   even q2                13              5
   skew q1                22              6
   skew q2                16              4

work_mem='4MB':
                  sort+group (s)    hashagg (s)
   singleton q1           17             10
   singleton q2           11              6
   even q1                16              7
   even q2                11              5
   skew q1                19              6
   skew q2                13              4

work_mem='16MB':
                  sort+group (s)    hashagg (s)
   singleton q1           16             11
   singleton q2           11              7
   even q1                15              8
   even q2                12              6
   skew q1                15              6
   skew q2                12              4

work_mem='64MB':
                  sort+group (s)    hashagg (s)
   singleton q1           18             12
   singleton q2           13              8
   even q1                17             10
   even q2                13              6
   skew q1                17              6
   skew q2                14              4

work_mem='256MB':
                  sort+group (s)    hashagg (s)
   singleton q1           18             12
   singleton q2           14              7
   even q1                16              9
   even q2                14              5
   skew q1                18              6
   skew q2                13              4

work_mem='512MB':
                  sort+group (s)    hashagg (s)
   singleton q1           18             12
   singleton q2           14              7
   even q1                17              9
   even q2                14              5
   skew q1                17              6
   skew q2                13              4

work_mem='2GB':
                  sort+group (s)    hashagg (s)
   singleton q1           11             11
   singleton q2            7              6
   even q1                10              9
   even q2                 7              5
   skew q1                 7              6
   skew q2                 4              4


Regards,
    Jeff Davis


Attachment

Re: 9.5: Better memory accounting, towards memory-bounded HashAgg

From
Jim Nasby
Date:
On 12/28/14, 2:45 PM, Peter Geoghegan wrote:
> On Sun, Dec 28, 2014 at 12:37 PM, Jeff Davis <pgsql@j-davis.com> wrote:
>> Do others have similar numbers? I'm quite surprised at how little
>> work_mem seems to matter for these plans (HashJoin might be a different
>> story though). I feel like I made a mistake -- can someone please do a
>> sanity check on my numbers?
>
> I have seen external sorts that were quicker than internal sorts
> before. With my abbreviated key patch, under certain circumstances
> external sorts are faster, while presumably the same thing is true of
> int4 attribute sorts today. Actually, I saw a 10MB work_mem setting
> that was marginally faster than a multi-gigabyte one that fit the
> entire sort in memory. It probably has something to do with caching
> effects dominating over the expense of more comparisons, since higher
> work_mem settings that still resulted in an external sort were slower
> than the 10MB setting.
>
> I was surprised by this too, but it has been independently reported by
> Jeff Janes.

I haven't tested for external faster than internal in a while, but I've certainly seen this effect before. Generally,
onceyou get beyond a certain size (maybe 100MB?) you run the risk of a tapesort being faster than an internal sort.
 
-- 
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com



Re: 9.5: Better memory accounting, towards memory-bounded HashAgg

From
Jeff Janes
Date:
On Sun, Dec 28, 2014 at 12:45 PM, Peter Geoghegan <pg@heroku.com> wrote:
On Sun, Dec 28, 2014 at 12:37 PM, Jeff Davis <pgsql@j-davis.com> wrote:
> Do others have similar numbers? I'm quite surprised at how little
> work_mem seems to matter for these plans (HashJoin might be a different
> story though). I feel like I made a mistake -- can someone please do a
> sanity check on my numbers?

I have seen external sorts that were quicker than internal sorts
before. With my abbreviated key patch, under certain circumstances
external sorts are faster, while presumably the same thing is true of
int4 attribute sorts today. Actually, I saw a 10MB work_mem setting
that was marginally faster than a multi-gigabyte one that fit the
entire sort in memory. It probably has something to do with caching
effects dominating over the expense of more comparisons, since higher
work_mem settings that still resulted in an external sort were slower
than the 10MB setting.

I was surprised by this too, but it has been independently reported by
Jeff Janes.

I don't recall (at the moment) seeing our external sort actually faster than quick-sort, but I've very reliably seen external sorts get faster with less memory than with more.  It is almost certainly a CPU caching issue.  Very large simple binary heaps are horrible on the CPU cache. And for sort-by-reference values, quick sort is also pretty bad.  

With a slow enough data bus between the CPU and main memory, I don't doubt that a 'tapesort' with small work_mem could actually be faster than quicksort with large work_mem.  But I don't recall seeing it myself.  But I'd be surprised that a tapesort as currently implemented would be faster than a quicksort if the tapesort is using just one byte less memory than the quicksort is.

But to Jeff Davis's question, yes, tapesort is not very sensitive to work_mem, and to the extent it is sensitive it is in the other direction of more memory being bad.  Once work_mem is so small that it takes multiple passes over the data to do the merge, then small memory would really be a problem.  But on modern hardware you have to get pretty absurd settings before that happens.

Cheers,

Jeff

Re: 9.5: Better memory accounting, towards memory-bounded HashAgg

From
Tomas Vondra
Date:
Hi,

On 23.12.2014 10:16, Jeff Davis wrote:
> It seems that these two patches are being reviewed together. Should
> I just combine them into one? My understanding was that some wanted
> to review the memory accounting patch separately.
>
> On Sun, 2014-12-21 at 20:19 +0100, Tomas Vondra wrote:
>> That's the only conflict, and after fixing it it compiles OK.
>> However, I got a segfault on the very first query I tried :-(
>
> If lookup_hash_entry doesn't find the group, and there's not enough
> memory to create it, then it returns NULL; but the caller wasn't
> checking for NULL. My apologies for such a trivial mistake, I was
> doing most of my testing using DISTINCT. My fix here was done
> quickly, so I'll take a closer look later to make sure I didn't miss
> something else.
>
> New patch attached (rebased, as well).

I did a review today, using these two patches:

    * memory-accounting-v9.patch (submitted on December 2)
    * hashagg-disk-20141222.patch

I started with some basic performance measurements comparing hashagg
queries without and with the patches (while you compared hashagg and
sort). That's IMHO an interesting comparison, especially when no
batching is necessary - in the optimal case the users should not see any
slowdown (we shouldn't make them pay for the batching unless it actually
is necessary).

So I did this:

    drop table if exists test_hash_agg;

    create table test_hash_agg as
        select
            i AS a,
            mod(i,1000000) AS b,
            mod(i,100000) AS c,
            mod(i,10000) AS d,
            mod(i,1000) AS e,
            mod(i,100) AS f
    from generate_series(1,10000000) s(i);

    vacuum (full, analyze) test_hash_agg;

i.e. a ~500MB table with 10M rows, and columns with different
cardinalities. And then queries like this:

   select count(*) from (select a, count(a) from test_hash_agg
                         group by a) foo;

   -- 10M groups (OOM)
   select count(*) from (select a, array_agg(a) from test_hash_agg
                         group by a) foo;

   -- 100 groups
   select count(*) from (select f, array_agg(f) from test_hash_agg
                         group by f) foo;

which performed quite well, i.e. I've seen absolutely no slowdown.
Which, in the array_agg case, is quite is quite suspicious, because on
every lookup_hash_entry call, it has to do MemoryContextMemAllocated()
on 10M contexts, and I really doubt that can be done in ~0 time.

So I started digging in the code and I noticed this:

    hash_mem = MemoryContextMemAllocated(aggstate->hashcontext, true);

which is IMHO obviously wrong, because that accounts only for the
hashtable itself. It might be correct for aggregates with state passed
by value, but it's incorrect for state passed by reference (e.g.
Numeric, arrays etc.), because initialize_aggregates does this:

    oldContext = MemoryContextSwitchTo(aggstate->aggcontext);
    pergroupstate->transValue = datumCopy(peraggstate->initValue,
                                        peraggstate->transtypeByVal,
                                        peraggstate->transtypeLen);
    MemoryContextSwitchTo(oldContext);

and it's also wrong for all the user-defined aggretates that have no
access to hashcontext at all, and only get reference to aggcontext using
AggCheckCallContext(). array_agg() is a prime example.

In those cases the patch actually does no memory accounting and as
hashcontext has no child contexts, there's no accounting overhead.

After fixing this bug (replacing hashcontext with aggcontext in both
calls to MemoryContextMemAllocated) the performance drops dramatically.
For the query with 100 groups (not requiring any batching) I see this:

test=# explain analyze select count(x) from (select f, array_agg(1) AS x
from test_hash_agg group by f) foo;

                        QUERY PLAN (original patch)
------------------------------------------------------------------------
 Aggregate  (cost=213695.57..213695.58 rows=1 width=32)
            (actual time=2539.156..2539.156 rows=1 loops=1)
   ->  HashAggregate  (cost=213693.07..213694.32 rows=100 width=4)
                      (actual time=2492.264..2539.012 rows=100 loops=1)
         Group Key: test_hash_agg.f
         Batches: 1  Memory Usage: 24kB  Disk Usage:0kB
         ->  Seq Scan on test_hash_agg
                     (cost=0.00..163693.71 rows=9999871 width=4)
                     (actual time=0.022..547.379 rows=10000000 loops=1)
 Planning time: 0.039 ms
 Execution time: 2542.932 ms
(7 rows)

                        QUERY PLAN (fixed patch)
------------------------------------------------------------------------
 Aggregate  (cost=213695.57..213695.58 rows=1 width=32)
            (actual time=5670.885..5670.885 rows=1 loops=1)
   ->  HashAggregate  (cost=213693.07..213694.32 rows=100 width=4)
                      (actual time=5623.254..5670.803 rows=100 loops=1)
         Group Key: test_hash_agg.f
         Batches: 1  Memory Usage: 117642kB  Disk Usage:0kB
         ->  Seq Scan on test_hash_agg
                     (cost=0.00..163693.71 rows=9999871 width=4)
                     (actual time=0.014..577.924 rows=10000000 loops=1)
 Planning time: 0.103 ms
 Execution time: 5677.187 ms
(7 rows)

So the performance drops 2x. With more groups, the performance impact is
even worse. For example with the first query (with 10M groups), this is
what I get in perf:

explain analyze select count(x) from (select a, array_agg(1) AS x from
test_hash_agg group by a) foo;


   PerfTop:    4671 irqs/sec  kernel:11.2%  exact:  0.0%
------------------------------------------------------------------------

    87.07%  postgres                 [.] MemoryContextMemAllocated
     1.63%  [zcommon]                [k] fletcher_4_native
     1.60%  [kernel]                 [k] acpi_processor_ffh_cstate_enter
     0.68%  [kernel]                 [k] xhci_irq
     0.30%  ld-2.19.so               [.] _dl_sysinfo_int80
     0.30%  [kernel]                 [k] memmove
     0.26%  [kernel]                 [k] ia32_syscall
     0.16%  libglib-2.0.so.0.3800.2  [.] 0x000000000008c52a
     0.15%  libpthread-2.19.so       [.] pthread_mutex_lock


and it runs indefinitely (I gave up after a few minutes). I believe this
renders the proposed memory accounting approach dead.

Granted, the 10M groups is a bit extreme example, but the first query
with 100 groups certainly is not.

I understand the effort to avoid the 2% regression measured by Robert on
a PowerPC machine, but I don't think that's a sufficient reason to cause
so much trouble to everyone using array_agg() or user-defined aggregates
based on the same 'create subcontext' pattern.

Especially when the reindex may be often improved by using more
maintenance_work_mem, but there's nothing you can do to improve this (if
it's not batching, then increasing work_mem will do nothing).

The array_agg() patch I submitted to this CF would fix this particular
query, as it removes the child contexts (so there's no need for
recursion in MemoryContextMemAllocated), but it does nothing to the
user-defined aggregates out there. And it's not committed yet.

Also, ISTM this makes it rather unusable as a general accounting
approach. If mere 100 subcontexts results in 2x slowdown, then a handful
of subcontexts will certainly have a measurable impact if we decide to
use this somewhere else (and not just in hash aggregate).

So I was curious how the accounting mechanism I proposed (parallel
MemoryAccounting hierarchy next to MemoryContext) would handle this, so
I've used it instead of the memory-accounting-v9.patch. And I measured
no difference compared to master (no slowdown at all).

I also tried array_agg() queries that actually require batchning, e.g.

  select a, array_agg(1) x from test_hash_agg group by a;

which produces 10M groups, each using a separate 8kB context, which
amounts to 80GB in total. With work_mem=1GB this should proceed just
fine with ~80 batches. In practice, it runs indefinitely (again, I lost
patience after a few minutes), and I see this:

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s
avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0,00     0,00    0,00  281,00     0,00 140784,00
1002,02   143,25  512,57    0,00  512,57   3,56 100,00

tomas@rimmer ~/tmp/pg-hashagg/base/pgsql_tmp $ du -s
374128  .
tomas@rimmer ~/tmp/pg-hashagg/base/pgsql_tmp $ ls -l | wc -l
130
tomas@rimmer ~/tmp/pg-hashagg/base/pgsql_tmp $ ls -l | head
celkem 372568
-rw------- 1 tomas users 2999480  7. led 19.46 pgsql_tmp23267.2637
-rw------- 1 tomas users 2973520  7. led 19.46 pgsql_tmp23267.2638
-rw------- 1 tomas users 2978800  7. led 19.46 pgsql_tmp23267.2639
-rw------- 1 tomas users 2959880  7. led 19.46 pgsql_tmp23267.2640
-rw------- 1 tomas users 3010040  7. led 19.46 pgsql_tmp23267.2641
-rw------- 1 tomas users 3083520  7. led 19.46 pgsql_tmp23267.2642
-rw------- 1 tomas users 3053160  7. led 19.46 pgsql_tmp23267.2643
-rw------- 1 tomas users 3044360  7. led 19.46 pgsql_tmp23267.2644
-rw------- 1 tomas users 3014000  7. led 19.46 pgsql_tmp23267.2645

That is, there are ~130 files, each ~3MB large, ~370MB in total. But the
system does ~140MB/s writes all the time. Also, the table has ~500MB in
total.

So either I'm missing something, but there's some sort of bug.

Attached you can find a bunch of files:

  * 0001-memory-accounting.patch (my memory accounting)
  * 0002-hashagg-patch.patch (jeff's patch, for completeness)
  * 0003-context-fix.patch (fix hashcontext -> aggcontext)
  * test.sql (data / queries I used for testing)

regards
Tomas

Attachment

Re: 9.5: Better memory accounting, towards memory-bounded HashAgg

From
Michael Paquier
Date:
On Thu, Jan 8, 2015 at 4:07 AM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
> which is IMHO obviously wrong, because that accounts only for the
> hashtable itself.
> In those cases the patch actually does no memory accounting and as
> hashcontext has no child contexts, there's no accounting overhead.
> [blah]

> So the performance drops 2x. With more groups, the performance impact is
> even worse.
Hmm. It seems that this patch is not there yet, so marking it as
returned with feedback, especially knowing that the patch is incorrect
by using hascontext as mentioned by Tomas.
-- 
Michael



Re: 9.5: Better memory accounting, towards memory-bounded HashAgg

From
Jeff Davis
Date:
On Wed, 2015-01-07 at 20:07 +0100, Tomas Vondra wrote:
> So I started digging in the code and I noticed this:
> 
>     hash_mem = MemoryContextMemAllocated(aggstate->hashcontext, true);
> 
> which is IMHO obviously wrong, because that accounts only for the
> hashtable itself. It might be correct for aggregates with state passed
> by value, but it's incorrect for state passed by reference (e.g.
> Numeric, arrays etc.), because initialize_aggregates does this:
> 
>     oldContext = MemoryContextSwitchTo(aggstate->aggcontext);
>     pergroupstate->transValue = datumCopy(peraggstate->initValue,
>                                         peraggstate->transtypeByVal,
>                                         peraggstate->transtypeLen);
>     MemoryContextSwitchTo(oldContext);
> 
> and it's also wrong for all the user-defined aggretates that have no
> access to hashcontext at all, and only get reference to aggcontext using
> AggCheckCallContext(). array_agg() is a prime example.

Actually, I believe the first one is right, and the second one is wrong.
If we allocate pass-by-ref transition states in aggcontext, that defeats
the purpose of the patch, because we can't release the memory when we
start a new batch (because we can't reset aggcontext).

Instead, the transition states should be copied into hashcontext; we
should count the memory usage of hashcontext; AggCheckCallContext should
return hashcontext; and after each batch we should reset hashcontext.

It might be worth refactoring a bit to call it the "groupcontext" or
something instead, and use it for both HashAgg and GroupAgg. That would
reduce the need for conditionals.



[ ... problems with array_agg subcontexts ... ]

> and it runs indefinitely (I gave up after a few minutes). I believe this
> renders the proposed memory accounting approach dead.

...

> The array_agg() patch I submitted to this CF would fix this particular
> query, as it removes the child contexts (so there's no need for
> recursion in MemoryContextMemAllocated), but it does nothing to the
> user-defined aggregates out there. And it's not committed yet.

Now that your patch is committed, I don't think I'd call the memory
accounting patch I proposed "dead" yet. We decided that creating one
context per group is essentially a bug, so we don't need to optimize for
that case.

Your approach may be better, though.

Thank you for reviewing. I'll update my patches and resubmit for the
next CF.

Regards,Jeff Davis





Re: 9.5: Better memory accounting, towards memory-bounded HashAgg

From
Tomas Vondra
Date:
On 22.2.2015 09:14, Jeff Davis wrote:
> 
> On Wed, 2015-01-07 at 20:07 +0100, Tomas Vondra wrote:
>> So I started digging in the code and I noticed this:
>>
>>     hash_mem = MemoryContextMemAllocated(aggstate->hashcontext, true);
>>
>> which is IMHO obviously wrong, because that accounts only for the
>> hashtable itself. It might be correct for aggregates with state passed
>> by value, but it's incorrect for state passed by reference (e.g.
>> Numeric, arrays etc.), because initialize_aggregates does this:
>>
>>     oldContext = MemoryContextSwitchTo(aggstate->aggcontext);
>>     pergroupstate->transValue = datumCopy(peraggstate->initValue,
>>                                         peraggstate->transtypeByVal,
>>                                         peraggstate->transtypeLen);
>>     MemoryContextSwitchTo(oldContext);
>>
>> and it's also wrong for all the user-defined aggretates that have no
>> access to hashcontext at all, and only get reference to aggcontext using
>> AggCheckCallContext(). array_agg() is a prime example.
> 
> Actually, I believe the first one is right, and the second one is
> wrong. If we allocate pass-by-ref transition states in aggcontext,
> that defeats the purpose of the patch, because we can't release the
> memory when we start a new batch (because we can't reset aggcontext).
> 
> Instead, the transition states should be copied into hashcontext; we 
> should count the memory usage of hashcontext; AggCheckCallContext
> should return hashcontext; and after each batch we should reset
> hashcontext.

Hmmm, maybe. I'd however suggest not to stuff everything into a single
memory context, but to create a 'transvalue' context next to
hashcontext, and return that one from AggCheckCallContext(). I.e.
something like this:
   aggcontext       hashcontext - hash table etc.       transcontext - new context for transition values

It's better to keep things separated, I guess.

> It might be worth refactoring a bit to call it the "groupcontext" or 
> something instead, and use it for both HashAgg and GroupAgg. That
> would reduce the need for conditionals.

Not sure that's worth it. The patch already seems complex enough.

> [ ... problems with array_agg subcontexts ... ]
> 
>> and it runs indefinitely (I gave up after a few minutes). I believe this
>> renders the proposed memory accounting approach dead.
> 
> ...
> 
>> The array_agg() patch I submitted to this CF would fix this
>> particular query, as it removes the child contexts (so there's no
>> need for recursion in MemoryContextMemAllocated), but it does
>> nothing to the user-defined aggregates out there. And it's not
>> committed yet.
> 
> Now that your patch is committed, I don't think I'd call the memory 
> accounting patch I proposed "dead" yet. We decided that creating one 
> context per group is essentially a bug, so we don't need to optimize
> for that case.

Yes, but the question is how many other aggregates designed similarly to
array_agg are out there (in extensions atc.), and whether we actually
care about them. Because those will be equally impacted by this, and we
can't fix them.

But maybe we don't really care, and it's the author's responsibility to
test them on a new major version, and fix the performance issue just
like we did with array_agg().

> 
> Your approach may be better, though.
>

I'm not sure - it really boils down do the goal and assumptions.

If we restrict ourselves to HashAggregate-only solution, under the
assumption that the number of child contexts is small (and leave the fix
up to the author of the extension), then your approach is probably
simpler and better than the approach I proposed.

If we're aiming for a general-purpose solution that might be used
elsewhere, then I'm afraid we can't make the 'small number of child
contexts' assumption and the simple solution may not be the right one.

Although I believe that we shouldn't be creating large numbers of
'parallel' memory contexts anyway.

> Thank you for reviewing. I'll update my patches and resubmit for the
>  next CF.

Thanks for working on this.

-- 
Tomas Vondra                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services