Thread: 9.5: Better memory accounting, towards memory-bounded HashAgg
Attached is a patch that explicitly tracks allocated memory (the blocks, not the chunks) for each memory context, as well as its children. This is a prerequisite for memory-bounded HashAgg, which I intend to submit for the next CF. Hashjoin tracks the tuple sizes that it adds to the hash table, which is a good estimate for Hashjoin. But I don't think it's as easy for Hashagg, for which we need to track transition values, etc. (also, for HashAgg, I expect that the overhead will be more significant than for Hashjoin). If we track the space used by the memory contexts directly, it's easier and more accurate. I did some simple pgbench select-only tests, and I didn't see any TPS difference. Regards, Jeff Davis
Attachment
On Sat, Aug 2, 2014 at 1:40 PM, Jeff Davis <pgsql@j-davis.com> wrote: > This is a prerequisite for memory-bounded HashAgg, which I intend to > submit for the next CF. FWIW, I think that's a good project. A large number of these TPC-H queries used HashAggs when I checked, on a moderate sized sample TPC-H database: http://examples.citusdata.com/tpch_queries.html (I found these queries at random from Googling, but happened to have a ~2GB TPC-H database on my laptop). I attach EXPLAIN ANALYZE ouput for each, as shown on master. From this admittedly unscientific random sampling, 5 out of 8 query plans have a hash aggregate node. TPC-H is a benchmark that Postgres does not tend to do too well on [1], and I suspect that this has something to do with it; lower work_mem settings will spook the optimizer into using a group aggregate within choose_hashed_grouping(). Of course, in order to get the benefit of your patch, that will need to be adjusted. I think that part is surprisingly straightforward, though. [1] https://wiki.postgresql.org/wiki/TPC-H -- Peter Geoghegan
Attachment
On Sat, Aug 2, 2014 at 4:40 PM, Jeff Davis <pgsql@j-davis.com> wrote: > Attached is a patch that explicitly tracks allocated memory (the blocks, > not the chunks) for each memory context, as well as its children. > > This is a prerequisite for memory-bounded HashAgg, which I intend to > submit for the next CF. Hashjoin tracks the tuple sizes that it adds to > the hash table, which is a good estimate for Hashjoin. But I don't think > it's as easy for Hashagg, for which we need to track transition values, > etc. (also, for HashAgg, I expect that the overhead will be more > significant than for Hashjoin). If we track the space used by the memory > contexts directly, it's easier and more accurate. > > I did some simple pgbench select-only tests, and I didn't see any TPS > difference. I was curious whether a performance difference would show up when sorting, so I tried it out. I set up a test with pgbench -i 300. I then repeatedly restarted the database, and after each restart, did this: time psql -c 'set trace_sort=on; reindex index pgbench_accounts_pkey;' I alternated runs between master and master with this patch, and got the following results: master: LOG: internal sort ended, 1723933 KB used: CPU 2.58s/11.54u sec elapsed 16.88 sec LOG: internal sort ended, 1723933 KB used: CPU 2.50s/12.37u sec elapsed 17.60 sec LOG: internal sort ended, 1723933 KB used: CPU 2.14s/11.28u sec elapsed 16.11 sec memory-accounting: LOG: internal sort ended, 1723933 KB used: CPU 2.57s/11.97u sec elapsed 17.39 sec LOG: internal sort ended, 1723933 KB used: CPU 2.30s/12.57u sec elapsed 17.68 sec LOG: internal sort ended, 1723933 KB used: CPU 2.54s/11.99u sec elapsed 17.25 sec Comparing the median times, that's about a 3% regression. For this particular case, we might be able to recapture that by replacing the bespoke memory-tracking logic in tuplesort.c with use of this new facility. I'm not sure whether there are other cases that we might also want to test; I think stuff that runs all on the server side is likely to show up problems more clearly than pgbench. Maybe a PL/pgsql loop that does something allocation-intensive on each iteration, for example, like parsing a big JSON document. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2.8.2014 22:40, Jeff Davis wrote: > Attached is a patch that explicitly tracks allocated memory (the blocks, > not the chunks) for each memory context, as well as its children. > > This is a prerequisite for memory-bounded HashAgg, which I intend to Anyway, I'm really looking forward to the memory-bounded hashagg, and I'm willing to spend some time testing it. > submit for the next CF. Hashjoin tracks the tuple sizes that it adds to > the hash table, which is a good estimate for Hashjoin. But I don't think Actually, even for HashJoin the estimate is pretty bad, and varies a lot depending on the tuple width. With "narrow" tuples (e.g. ~40B of data), which is actually quite common case, you easily get ~100% palloc overhead. I managed to address that by using a custom allocator. See this: https://commitfest.postgresql.org/action/patch_view?id=1503 I wonder whether something like that would be possible for the hashagg? That would make the memory accounting accurate with 0% overhead (because it's not messing with the memory context at all), but it only for the one node (but maybe that's OK?). > it's as easy for Hashagg, for which we need to track transition values, > etc. (also, for HashAgg, I expect that the overhead will be more > significant than for Hashjoin). If we track the space used by the memory > contexts directly, it's easier and more accurate. I don't think that's comparable - I can easily think about cases leading to extreme palloc overhead with HashAgg (think of an aggregate implementing COUNT DISTINCT - that effectively needs to store all the values, and with short values the palloc overhead will be terrible). Actually, I was looking at HashAgg (it's a somehow natural direction after messing with Hash Join), and my plan was to use a similar dense allocation approach. The trickiest part would probably me making this available from the custom aggregates. regards Tomas
On Wed, 2014-08-06 at 11:43 -0400, Robert Haas wrote: > Comparing the median times, that's about a 3% regression. For this > particular case, we might be able to recapture that by replacing the > bespoke memory-tracking logic in tuplesort.c with use of this new > facility. I'm not sure whether there are other cases that we might > also want to test; I think stuff that runs all on the server side is > likely to show up problems more clearly than pgbench. Maybe a > PL/pgsql loop that does something allocation-intensive on each > iteration, for example, like parsing a big JSON document. I wasn't able to reproduce your results on my machine. At -s 300, with maintenance_work_mem set high enough to do internal sort, it took about 40s and I heard some disk activity, so I didn't think it was a valid result. I went down to -s 150, and it took around 5.3s on both master and memory-accounting. Either way, it's better to be conservative. Attached is a version of the patch with opt-in memory usage tracking. Child contexts inherit the setting from their parent. Regards, Jeff Davis
Attachment
On Fri, 2014-08-08 at 01:16 -0700, Jeff Davis wrote: > Either way, it's better to be conservative. Attached is a version of the > patch with opt-in memory usage tracking. Child contexts inherit the > setting from their parent. There was a problem with the previous patch: the parent was unlinked before the delete_context method was called, which meant that the parent's memory was not being decremented. Attached is a fix. It would be simpler to just reorder the operations in MemoryContextDelete, but there is a comment warning against that. So, I pass the old parent as an argument to the delete_context method. Regards, Jeff Davis
Attachment
On Fri, Aug 8, 2014 at 4:16 AM, Jeff Davis <pgsql@j-davis.com> wrote: > I wasn't able to reproduce your results on my machine. At -s 300, with > maintenance_work_mem set high enough to do internal sort, it took about > 40s and I heard some disk activity, so I didn't think it was a valid > result. I went down to -s 150, and it took around 5.3s on both master > and memory-accounting. > > Either way, it's better to be conservative. Attached is a version of the > patch with opt-in memory usage tracking. Child contexts inherit the > setting from their parent. I repeated my tests with your v3 patch. Here are the new results: master, as of commit a4287a689d10bd4863e3dfbf9c4f232feeca0cdd: 2014-08-14 16:45:24 UTC [2940] LOG: internal sort ended, 1723933 KB used: CPU 2.44s/11.52u sec elapsed 16.68 sec 2014-08-14 16:46:34 UTC [2960] LOG: internal sort ended, 1723933 KB used: CPU 2.63s/11.65u sec elapsed 16.94 sec 2014-08-14 16:47:26 UTC [2979] LOG: internal sort ended, 1723933 KB used: CPU 2.63s/11.48u sec elapsed 16.85 sec memory-accounting-v3, on top of the aforementioned master commit: 2014-08-14 16:46:05 UTC [2950] LOG: internal sort ended, 1723933 KB used: CPU 2.52s/12.16u sec elapsed 17.36 sec 2014-08-14 16:47:00 UTC [2969] LOG: internal sort ended, 1723933 KB used: CPU 2.52s/11.90u sec elapsed 17.11 sec 2014-08-14 16:47:51 UTC [2988] LOG: internal sort ended, 1723933 KB used: CPU 2.52s/11.98u sec elapsed 17.31 sec It appears to me that the performance characteristics for this version are not significantly different from version 1. I have not looked at the code. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 10.8.2014 22:50, Jeff Davis wrote: > On Fri, 2014-08-08 at 01:16 -0700, Jeff Davis wrote: >> Either way, it's better to be conservative. Attached is a version >> of the patch with opt-in memory usage tracking. Child contexts >> inherit the setting from their parent. > > There was a problem with the previous patch: the parent was unlinked > before the delete_context method was called, which meant that the > parent's memory was not being decremented. > > Attached is a fix. It would be simpler to just reorder the operations > in MemoryContextDelete, but there is a comment warning against that. > So, I pass the old parent as an argument to the delete_context > method. Hi, I did a quick check of the patch, mostly because I wasn't sure whether it allows accounting for a selected subtree only (e.g. aggcontext and it's children). And yes, it does exactly that, which is great. Also, the code seems fine to me, except maybe for this piece in AllocSetDelete: /* * Parent is already unlinked from context, so can't use * update_allocation(). */ while (parent != NULL) { parent->total_allocated -= context->total_allocated; parent = parent->parent; } I believe this should check parent->track_mem, just like update_allocation, because this way it walks all the parent context up to the TopMemoryContext. It does not really break anything (because parents without enabled tracking don't report allocated space), but for plans creating/destroying a lot of contexts (say, GroupAggregate with a lot of groups) this might unnecessarily add overhead. regards Tomas
Tomas Vondra <tv@fuzzy.cz> writes: > I believe this should check parent->track_mem, just like > update_allocation, because this way it walks all the parent context up > to the TopMemoryContext. TBH, I don't think this "opt-in tracking" business has been thought through even a little bit; for example, what happens if there is a declared-to-not-track context in between levels that are declared to track? And it certainly has not been proven that the extra complexity actually saves cycles rather than the reverse. regards, tom lane
On 16.8.2014 20:00, Tom Lane wrote: > Tomas Vondra <tv@fuzzy.cz> writes: >> I believe this should check parent->track_mem, just like >> update_allocation, because this way it walks all the parent context up >> to the TopMemoryContext. > > TBH, I don't think this "opt-in tracking" business has been thought > through even a little bit; for example, what happens if there is a > declared-to-not-track context in between levels that are declared to > track? And it certainly has not been proven that the extra complexity > actually saves cycles rather than the reverse. AFAIK there's no other built-in accounting mechanism at the moment. That's one of the reasons why some of the nodes had to invent their own custom accounting - hashjoin is an example of that. We do have MemoryContextStats, which essentially walks the memory context hierarchy and accounts all the important information. While working on the hashjoin patch, I was thinking that maybe we could create MemoryContextStats clone, computing the accounting info on request. So I did an experiment, writing this new MemoryContextStats clone and rewriting the nodeHash.c to use this instead of custom tracking. The impact was consistently ~2-3%, for the hashjoin queries I measured (e.g. 9100 ms instead of 8900 ms etc.). I just tried the same with the hashjoin patch applied. With the built-in accounting the queries now run ~2.5x faster - around 3500ms. With the MemoryContextStats clone, it takes a whopping 18000 ms. I don't know what causes such increase (I assume I use the accounting info within a loop or something). The point is that with Jeff's patch, it goes down to 3500 ms, with pretty much zero overhead. I'll check why the patched hashjoin is so much slower, but even if I can optimize it somehow I don't think I'll get it to 0 overhead. So I'd say this accounting actally saves the cycles. Of course, that does not prove it's well a well thought-out design, although it seems sensible to me. It's the least complex accounting implementation I can think of. For example, I don't think being able to create nonsensical hierarchies is a major flaw. We certainly should not do that in our code, and we should take reasonable precautions to prevent this in a user code (as for example inheriting the tracking flag in MemoryContextCreate, as Jeff does). Can this be seen as a foot gun? Maybe, but we're handing ammunition to users on a daily basis - for example they can switch to TopMemoryContext, but that does not make memory contexts bad. But maybe the inheritance really is not necessary - maybe it would be enough to track this per-context, and then just walk through the contexts and collect this. Because my observation is that most of the time is actually spent in walking through blocks and freelists. Then the issues with nonsensical hierarchies would disappear, and also the possible overhead with updating all parent hierarchies would go away (not sure if it's even measurable, though). regards Tomas
On Thu, 2014-08-14 at 12:52 -0400, Robert Haas wrote: > It appears to me that the performance characteristics for this version > are not significantly different from version 1. I have not looked at > the code. While trying to reproduce your results, I noticed what might be around a 1% regression just from adding the 3 fields to MemoryContextData. If I cut it down to adding just one field, the regression disappears. The results are fairly noisy, so I could be chasing the wrong thing. But one reason to believe it is that I pushed the size of MemoryContextData above 64, which sounds like it might be an important threshold. Regards,Jeff Davis
On Sat, 2014-08-16 at 23:09 +0200, Tomas Vondra wrote: > But maybe the inheritance really is not necessary - maybe it would be > enough to track this per-context, and then just walk through the > contexts and collect this. Because my observation is that most of the > time is actually spent in walking through blocks and freelists. That makes a lot of sense to me. Another approach is to pass a flag to hash_create that tells it not to create a subcontext. Then we don't need to traverse anything; we already know which context we want to look at. Perhaps I was being too clever with the idea of tracking space for an entire hierarchy. Also, as I pointed out in my reply to Robert, adding too many fields to MemoryContextData may be the cause of the regression. Your idea requires only one field, which doesn't show the same regression in my tests. Regards,Jeff Davis
On 19 Srpen 2014, 10:26, Jeff Davis wrote: > On Sat, 2014-08-16 at 23:09 +0200, Tomas Vondra wrote: >> But maybe the inheritance really is not necessary - maybe it would be >> enough to track this per-context, and then just walk through the >> contexts and collect this. Because my observation is that most of the >> time is actually spent in walking through blocks and freelists. > > That makes a lot of sense to me. > > Another approach is to pass a flag to hash_create that tells it not to > create a subcontext. Then we don't need to traverse anything; we already > know which context we want to look at. Perhaps I was being too clever > with the idea of tracking space for an entire hierarchy. > > Also, as I pointed out in my reply to Robert, adding too many fields to > MemoryContextData may be the cause of the regression. Your idea requires > only one field, which doesn't show the same regression in my tests. Yeah, keeping the structure size below 64B seems like a good idea. The use-case for this is tracking a chosen subtree of contexts - e.g. aggcontext and below, so I'd expect the tracked subtrees to be relatively shallow. Am I right? My fear is that by removing the inheritance bit, we'll hurt cases with a lot of child contexts. For example, array_agg currently creates a separate context for each group - what happens if you have 100k groups and do MemoryContextGetAllocated? I guess iterating over 100k groups is not free. Wouldn't the solution with inheritance and propagating the accounting info to the parent actually better? Or maybe allowing both, having two flags when creating a context instead of one? AllocSetCreateTracked( ...., bool track, bool propagate_immediately) By squashing both flags into a single mask you wouldn't increase the size. Also, do we really need to track allocated bytes - couldn't we track kilobytes or something and use smaller data types to get below the 64B? regards Tomas
On Tue, 2014-08-19 at 12:54 +0200, Tomas Vondra wrote: > The use-case for this is tracking a chosen subtree of contexts - e.g. > aggcontext and below, so I'd expect the tracked subtrees to be relatively > shallow. Am I right? Right. > My fear is that by removing the inheritance bit, we'll hurt cases with a > lot of child contexts. For example, array_agg currently creates a separate > context for each group - what happens if you have 100k groups and do > MemoryContextGetAllocated? I guess iterating over 100k groups is not free. Good point. We don't want to make checking the allocated space into an expensive operation, because then we will be forced to call it less frequently, which implies that we'd be sloppier about actually fitting in work_mem. > Wouldn't the solution with inheritance and propagating the accounting info > to the parent actually better? Or maybe allowing both, having two flags > when creating a context instead of one? That seems overly complicated. We only have one driving use case, so two orthogonal options sounds excessive. Perhaps one option if we can't solve the performance problem and we need to isolate the changes to hashagg. > Also, do we really need to track allocated bytes - couldn't we track > kilobytes or something and use smaller data types to get below the 64B? Good idea. I attached a patch that uses two uint32 fields so that it doesn't increase the size of MemoryContextData, and it tracks memory usage for all contexts. I was unable to detect any performance regression vs. master, but on my machine the results are noisy. It would be easier to resolve the performance concern if I could reliably get the results Robert is getting. I think I was able to reproduce the regression with the old patch, but the results were still noisy. Regards, Jeff Davis
Attachment
On Wed, Aug 20, 2014 at 2:11 AM, Jeff Davis <pgsql@j-davis.com> wrote: > I attached a patch that uses two uint32 fields so that it doesn't > increase the size of MemoryContextData, and it tracks memory usage for > all contexts. I was unable to detect any performance regression vs. > master, but on my machine the results are noisy. Still doesn't look good here. On the same PPC64 machine I've been using for testing: master: 2014-08-22 16:26:42 UTC [13153] LOG: internal sort ended, 1723933 KB used: CPU 2.19s/11.41u sec elapsed 16.40 sec 2014-08-22 16:28:18 UTC [13173] LOG: internal sort ended, 1723933 KB used: CPU 2.56s/11.37u sec elapsed 16.64 sec 2014-08-22 16:29:37 UTC [13194] LOG: internal sort ended, 1723933 KB used: CPU 2.57s/11.48u sec elapsed 16.75 sec memory-accounting-v4: 2014-08-22 16:27:35 UTC [13163] LOG: internal sort ended, 1723933 KB used: CPU 2.72s/11.91u sec elapsed 17.43 sec 2014-08-22 16:29:10 UTC [13185] LOG: internal sort ended, 1723933 KB used: CPU 2.67s/12.03u sec elapsed 17.40 sec 2014-08-22 16:30:01 UTC [13203] LOG: internal sort ended, 1723933 KB used: CPU 2.65s/11.97u sec elapsed 17.45 sec That's a 4.7% regression on the median results, worse than before. > It would be easier to resolve the performance concern if I could > reliably get the results Robert is getting. I think I was able to > reproduce the regression with the old patch, but the results were still > noisy. If you send me an SSH key by private email I can set you an account up on this machine, if that's helpful. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 20.8.2014 08:11, Jeff Davis wrote: > On Tue, 2014-08-19 at 12:54 +0200, Tomas Vondra wrote: >> The use-case for this is tracking a chosen subtree of contexts - e.g. >> aggcontext and below, so I'd expect the tracked subtrees to be relatively >> shallow. Am I right? > > Right. > >> My fear is that by removing the inheritance bit, we'll hurt cases with a >> lot of child contexts. For example, array_agg currently creates a separate >> context for each group - what happens if you have 100k groups and do >> MemoryContextGetAllocated? I guess iterating over 100k groups is not free. > > Good point. We don't want to make checking the allocated space into an > expensive operation, because then we will be forced to call it less > frequently, which implies that we'd be sloppier about actually fitting > in work_mem. > >> Wouldn't the solution with inheritance and propagating the accounting info >> to the parent actually better? Or maybe allowing both, having two flags >> when creating a context instead of one? > > That seems overly complicated. We only have one driving use case, so two > orthogonal options sounds excessive. Perhaps one option if we can't > solve the performance problem and we need to isolate the changes to > hashagg. > >> Also, do we really need to track allocated bytes - couldn't we track >> kilobytes or something and use smaller data types to get below the 64B? > > Good idea. > > I attached a patch that uses two uint32 fields so that it doesn't > increase the size of MemoryContextData, and it tracks memory usage for > all contexts. I was unable to detect any performance regression vs. > master, but on my machine the results are noisy. I don't think we really need to abandon the 'tracked' flag (or that we should). I think it was useful, and removing it might be one of the reasons why Robert now sees worse impact than before. I assume you did that to get below 64B, right? What about to changing 'isReset' in MemoryContextData to a 'flags' field instead? It's a bool now, i.e. 1B -> 8 bits. I don't think it makes sense to abandon this only to get <= 64B, if it forces you to touch multiple 64B structures unnecessarily. Also, you're doing this for every block in AllocSetReset: update_allocation(context, -(block->endptr - ((char*) block))); So if there are 1000 blocks, you'll call that 1000x (and it will propagate up to TopMemoryContext). Instead keep a local sum and only do the call once at the end. Again, this might be one of the reasons why Robet sees ~4% overhead. BTW what happens when AllocSetContext is created with initBlockSize that is not a multiple of 1kB? I see it's enforced to be at least 1024, but I don't see anything forcing it to be a multiple of 1024? What if I create a context with initBlockSize=1500? ISTM it'll break the accounting, right? (Not that it would make sense to create such contexts, but it's possible.) Tomas
On Fri, Aug 22, 2014 at 2:13 PM, Tomas Vondra <tv@fuzzy.cz> wrote: > I don't think we really need to abandon the 'tracked' flag (or that we > should). I think it was useful, and removing it might be one of the > reasons why Robert now sees worse impact than before. The version that introduced that flag had the same overhead as the previous version that didn't, at least in my testing, so I'm not sure it's at all useful. Part of the problem here is that the number of instructions that you can afford to take to complete a memory allocation is just really small, and so every one hurts. Memory latency may be an issue as well, of course. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 20.8.2014 08:11, Jeff Davis wrote: > On Tue, 2014-08-19 at 12:54 +0200, Tomas Vondra wrote: > > It would be easier to resolve the performance concern if I could > reliably get the results Robert is getting. I think I was able to > reproduce the regression with the old patch, but the results were still > noisy. I created a small extension for this purpose, it's available here: https://github.com/tvondra/palloc_bench In short, it creates a chain of memory contexts, and then repeats palloc/pfree a given number of times (with a chosen request size). It either calls AllocSetContextCreate or AllocSetContextCreateTracked, depending on whether there's a #define TRACKING_FLAG so either leave it there or comment it out. The time is printed as a warhing. I did a number of tests to get an idea of the overhead, using this call select palloc_bench(10,100000000,32768); which means 10 memory contexts, 100000000 palloc/free cycles with 32kB requests. And I got these results: master ------- 3246.03 ms 3125.24 ms 3247.86 ms 3251.85 ms 3113.03 ms 3140.35 ms v3 patch (AllocSetContextCreate => no tracing) 3303.64 ms 3278.57 ms 3295.11 ms 3325.63 ms 3329.84 ms 3439.27 ms -- v3 patch (AllocSetContextCreateTracked => tracing) 6296.43 ms 5228.83 ms 5271.43 ms 5158.60 ms 5404.57 ms 5240.40 ms -- v4 (tracing all the time) 6728.84 ms 6478.88 ms 6478.82 ms 6473.57 ms 6554.96 ms 6528.66 ms I think this makes the overhead clearly visible. I also worth mentioning that this does nothing else except for palloc/free, which is not really what a real workload does. And 100000000 palloc/free of 32kB blocks means ~3TB of RAM, unless my math is broken. So looking at the numbers and saying "7 seconds >> 3 seconds", all is lost is not really appropriate IMHO. Anyway, ISTM that v4 is actually a bitm ore expensive than v3 for some reason. I'm not entirely sure why, but I suspect it's because of updating the few additional memory contexts up to TopMemoryContext. That's something v3 didn't do. I tried to hack a bit on the idea of using a single byte for the flags (isReset and track_mem) - patch attached. That got me pretty much to v3 performance (or maybe slightly better): -- v4 + flag (tracing all the time) 5222.38 ms 4958.37 ms 5072.21 ms 5100.43 ms 5059.65 ms 4995.52 ms But nothing that'd magically save the day ... and with disabled tracing we get pretty much to v3 numbers (with trace_mem=false). So this gymnastics gave us pretty much nothing ... But I have realized that maybe the problem is that we're handling memory contexts and accounting as 1:1. But that's not really the case - most of the context is not really interested in this. They don't use accounting now, and it won't change. So only small fraction of memory contexts will ask for acounting. Yet all the contexts are forced to pass accounting info from their children to their parent, if there happens to be a context with track_mem=true somewhere above them. And that's exactly the problem, because most of the time is spent in update_accounting, in the loop over parents. So my proposal is to separate those two things into two hierarchies, that are somehow parallel, but not exactly. That means: (1) creating a structure with the accouting info typedef struct MemoryAccountingData { Size total_allocated; Size self_allocated; struct MemoryAccountingData * parent; } MemoryAccountingData; (2) adding a pointer to MemoryAccountingData to MemoryContextData, and a 'track_mem' flag for contexts that requested tracking typedef struct MemoryContextData { ... MemoryAccounting accounting; ... } MemoryContextData; (3) when a context does not request accounting, it just uses the accounting pointer from the parent context, and track_mem=false (4) when the context requests accounting, it allocates it's own accounting structure, sets accounting->parent to the accounting from parent, and sets track_mem=true Now all the contexts have a direct pointer to the accounting of the nearest parent context that explicitly requested accounting, and don't need to walk through all the parents. Contexts that did not request tracking have track_mem=false, and their accounting points to the parent with explicit accounting, or is NULL if there's no such parent. For these contexts, GetAllocated always returns 0, but that's OK because they haven't requested accounting anyway. Contexts that requested tracking have have track_mem=true, and have their own specific accounting instance. The accounting->parent plays pretty much the same role as 'accounting' with 'track_mem=false' (see previous paragraph). These contexts return GetAllocated properly. Now, I did a quick with the palloc_bench - 1 context with tracking enabled, and a chain of 10 contexts without tracking (but updating the accounting for the first context). And I see this - with tracking enabled: 3235.57 ms 3240.09 ms 3225.47 ms 3306.95 ms 3249.14 ms 3225.56 ms and with tracking disabled: 3193.43 ms 3169.57 ms 3156.48 ms 3147.12 ms 3142.25 ms 3161.91 ms 3149.97 ms Which is quite good, IMHO. Disabled is pretty much exactly as master (i.e. no accounting at all), enabled is about equal to v3 with disabled tracing. But maybe I did something stupid in those patches, it's 3AM here ... Patch attached, consider it a an early alpha version. regards Tomas
Attachment
On Fri, 2014-08-22 at 12:34 -0400, Robert Haas wrote: > Still doesn't look good here. On the same PPC64 machine I've been > using for testing: I have a new approach to the patch which is to call a callback at each block allocation and child contexts inherit the callback from their parents. The callback could be defined to simply dereference and increment its argument, which would mean block allocations anywhere in the hierarchy are incrementing the same variable, avoiding the need to traverse the hierarchy. I've made some progress I think (thanks to both Robert and Tomas), but I have a bit more microoptimization and testing to do. I'll mark it "returned with feedback" for now, though if I find the time I'll do more testing to see if the performance concerns are fully addressed. Regards,Jeff Davis
Jeff Davis <pgsql@j-davis.com> writes: > I have a new approach to the patch which is to call a callback at each > block allocation and child contexts inherit the callback from their > parents. The callback could be defined to simply dereference and > increment its argument, which would mean block allocations anywhere in > the hierarchy are incrementing the same variable, avoiding the need to > traverse the hierarchy. Cute, but it's impossible to believe that a function call isn't going to be *even more* overhead than what you've got now. This could only measure out as a win when the feature is going unused. What about removing the callback per se and just keeping the argument, as it were. That is, a MemoryContext has a field of type "size_t *" that is normally NULL, but if set to non-NULL, then we increment the pointed-to value for pallocs and decrement for pfrees. One thought in either case is that we don't have to touch the API for MemoryContextCreate: rather, the feature can be enabled in a separate call after making the context. regards, tom lane
On 29.8.2014 16:12, Tom Lane wrote: > Jeff Davis <pgsql@j-davis.com> writes: >> I have a new approach to the patch which is to call a callback at each >> block allocation and child contexts inherit the callback from their >> parents. The callback could be defined to simply dereference and >> increment its argument, which would mean block allocations anywhere in >> the hierarchy are incrementing the same variable, avoiding the need to >> traverse the hierarchy. > > Cute, but it's impossible to believe that a function call isn't going > to be *even more* overhead than what you've got now. This could only > measure out as a win when the feature is going unused. > > What about removing the callback per se and just keeping the argument, > as it were. That is, a MemoryContext has a field of type "size_t *" > that is normally NULL, but if set to non-NULL, then we increment the > pointed-to value for pallocs and decrement for pfrees. How is that going to work with two memory contexts (parent/child), both requesting accounting? If I understand the proposal correctly, the child will either inherit pointer to the field in parent (and then won't get accounting for it's own memory), or wil get a private field (and thus won't update the accounting of the parent context). Or am I missing something? regards Tomas
On Fri, 2014-08-29 at 10:12 -0400, Tom Lane wrote: > What about removing the callback per se and just keeping the argument, > as it were. That is, a MemoryContext has a field of type "size_t *" > that is normally NULL, but if set to non-NULL, then we increment the > pointed-to value for pallocs and decrement for pfrees. I like that idea; patch attached. Two differences: * At block level, not chunk level. * I use a uint64, because a size_t is only guaranteed to hold one allocation, and we need to sum up many allocations. When unused, it does still appear to have a little overhead in Robert's test on his power machine. It seems to be between a 0.5-1.0% regression. There's not much extra code on that path, just a branch, pointer dereference, and an addition; so I don't think it will get much cheaper than it is. This regression seems harder to reproduce on my workstation, or perhaps it's just noisier. > One thought in either case is that we don't have to touch the API for > MemoryContextCreate: rather, the feature can be enabled in a separate > call after making the context. That seems fairly awkward to me because the pointer needs to be inherited from the parent context when not specified. I left the extra API call in. The inheritance is awkward anyway, though. If you create a tracked context as a child of an already-tracked context, allocations in the newer one won't count against the original. I don't see a way around that without introducing even more performance problems. If this patch is unacceptable, my only remaining idea is to add the memory only for the current context with no inheritance (thereby eliminating the branch, also). That will require HashAgg to traverse all the child contexts to check whether the memory limit has been exceeded. As Tomas pointed out, that could be a lot of work in the case of array_agg with many groups. Regards, Jeff Davis
Attachment
Hi, On 16.10.2014 03:26, Jeff Davis wrote: > On Fri, 2014-08-29 at 10:12 -0400, Tom Lane wrote: >> What about removing the callback per se and just keeping the >> argument, as it were. That is, a MemoryContext has a field of type >> "size_t *" that is normally NULL, but if set to non-NULL, then we >> increment the pointed-to value for pallocs and decrement for >> pfrees. > > I like that idea; patch attached. Two differences: > * At block level, not chunk level. > * I use a uint64, because a size_t is only guaranteed to hold one > allocation, and we need to sum up many allocations. > > When unused, it does still appear to have a little overhead in > Robert's test on his power machine. It seems to be between a 0.5-1.0% > regression. There's not much extra code on that path, just a branch, > pointer dereference, and an addition; so I don't think it will get > much cheaper than it is. > > This regression seems harder to reproduce on my workstation, or > perhaps it's just noisier. I assume it's the "reindex pgbench_accounts_pkey" test, correct? I did the same test on my workstation (with i7-4770R CPU), and ISTM there's a measurable difference, although in surprising directions. For binaries compiled with gcc 4.7.3 and clang 3.3 and 3.5, the results look like this (5 runs for each combination): gcc 4.7.3 / master ------------------ LOG: internal sort ended, ... CPU 1.24s/4.32u sec elapsed 9.78 sec LOG: internal sort ended, ... CPU 1.29s/4.31u sec elapsed 9.80 sec LOG: internal sort ended, ... CPU 1.26s/4.31u sec elapsed 9.80 sec LOG: internal sort ended, ... CPU 1.26s/4.37u sec elapsed 9.78 sec LOG: internal sort ended, ... CPU 1.33s/4.29u sec elapsed 9.78 sec gcc 4.7.3 / patched ------------------- LOG: internal sort ended, ... CPU 1.35s/4.27u sec elapsed 9.78 sec LOG: internal sort ended, ... CPU 1.33s/4.30u sec elapsed 9.74 sec LOG: internal sort ended, ... CPU 1.34s/4.27u sec elapsed 9.75 sec LOG: internal sort ended, ... CPU 1.27s/4.26u sec elapsed 9.78 sec LOG: internal sort ended, ... CPU 1.35s/4.26u sec elapsed 9.78 sec clang 3.3 / master ------------------ LOG: internal sort ended, ... CPU 1.32s/4.61u sec elapsed 10.00 sec LOG: internal sort ended, ... CPU 1.27s/4.48u sec elapsed 9.95 sec LOG: internal sort ended, ... CPU 1.35s/4.48u sec elapsed 9.99 sec LOG: internal sort ended, ... CPU 1.32s/4.49u sec elapsed 9.97 sec LOG: internal sort ended, ... CPU 1.32s/4.47u sec elapsed 10.04 sec clang 3.3 / patched ------------------- LOG: internal sort ended, ... CPU 1.35s/4.59u sec elapsed 10.13 sec LOG: internal sort ended, ... CPU 1.31s/4.61u sec elapsed 10.06 sec LOG: internal sort ended, ... CPU 1.28s/4.63u sec elapsed 10.10 sec LOG: internal sort ended, ... CPU 1.27s/4.58u sec elapsed 10.01 sec LOG: internal sort ended, ... CPU 1.29s/4.60u sec elapsed 10.05 sec clang 3.5 / master ------------------ LOG: internal sort ended, ... CPU 1.30s/4.46u sec elapsed 9.96 sec LOG: internal sort ended, ... CPU 1.30s/4.49u sec elapsed 9.96 sec LOG: internal sort ended, ... CPU 1.30s/4.53u sec elapsed 9.95 sec LOG: internal sort ended, ... CPU 1.25s/4.51u sec elapsed 9.95 sec LOG: internal sort ended, ... CPU 1.30s/4.50u sec elapsed 9.95 sec clang 3.5 / patched ------------------- LOG: internal sort ended, ... CPU 1.32s/4.59u sec elapsed 9.97 sec LOG: internal sort ended, ... CPU 1.27s/4.49u sec elapsed 9.91 sec LOG: internal sort ended, ... CPU 1.29s/4.48u sec elapsed 9.88 sec LOG: internal sort ended, ... CPU 1.31s/4.49u sec elapsed 9.93 sec LOG: internal sort ended, ... CPU 1.23s/4.56u sec elapsed 9.90 sec So while on clang 3.3 I really see about ~1% slowdown, on the other two compilers it seems a tad faster (but it might easily be a noise). It'd be interesting to know what compiler/version Robert used ... >> One thought in either case is that we don't have to touch the API >> for MemoryContextCreate: rather, the feature can be enabled in a >> separate call after making the context. > > That seems fairly awkward to me because the pointer needs to be > inherited from the parent context when not specified. I left the > extra API call in. I don't see a big difference between adding a new API "create" method and a adding a separate "enable tracking" method. What however seems awkward is using the 'uint64*' directly as a parameter, instead of using 'enable_tracking' flag as in the previous versions. Is there a reason for that? This allows supplying a pointer to a uint64 variable, i.e. something like this: { uint64 mem_used = 0; MemoryContext context = MemoryContextCreateTracked(..., &mem_used); ... if (mem_used > work_mem * 1024L) { ... do something ... } } Thanks to exposing the total_mem like this, it's possible to get the current value directly like this (without the API methods available in the previous versions). It also makes it possible to "detach" the accounting from the parent (i.e. not count it against the parent). But is this intended/sane? > The inheritance is awkward anyway, though. If you create a tracked > context as a child of an already-tracked context, allocations in the > newer one won't count against the original. I don't see a way around > that without introducing even more performance problems. I see this is as the main drawback of the current design :-( The question is whether we can accept such limitation, and enforce it somehow. For example if you request a new context with tracking, and you find out the parent context already has tracking enabled, we can error out. This would effectively make it "internal only" feature, because the user code (e.g. custom aggregates in extensions) can't rely on getting non-tracking context, thus making it impossible to use tracking. Maybe that's acceptable limitation? If we can get into such conflicting situation without any external code, it's pretty much DOA. I can't think of such example, but am not convinced there isn't any ... Regarding the "performance problems" - ISTM it's important to differentiate between impact on performance when no memory accounting is enabled (but the code is there), and impact with the accounting. In other words: (a) free when not used (or as close to 'free' as possible) (b) cheap when used (as cheap as possible) I think the current patch achieves (a) - it's difficult to beat the single (a != NULL) check. We haven't seen any benchmarks for (b), at least not with the current patch. But even if we introduce some additional overhead (and it's impossible not to), we get something in return - the question is whether the cost is adequate. Also, if we don't have this accounting, the overhead for the local implementations may be easily higher. > If this patch is unacceptable, my only remaining idea is to add the > memory only for the current context with no inheritance (thereby > eliminating the branch, also). That will require HashAgg to traverse all > the child contexts to check whether the memory limit has been exceeded. > As Tomas pointed out, that could be a lot of work in the case of > array_agg with many groups. I can't really imagine dropping the inheritance ... Hierarchy of memory contexts with accounting that does not support hierarchies seems rather strange IMNSHO, making it either useless or at least very expensive in the interesting cases (the array_agg with a context per group is a prime example). ISTM the reason for the issues with inheritance (with two memory contexts with tracking enabled) is the "single uint64 counter". In one of my previous messages (from 23/8), I proposed to use this struct to build "parallel" accounting hierarchy (tweaked to match the v6 patch): typedef struct MemoryAccountingData { uint64 track_mem; struct MemoryAccountingData * parent; } MemoryAccountingData; Each context has either a private instance (if tracking was enabled), or point to a nearest context with tracking enabled (if existing). In the 'tracking disabled' case, this should have exactly the same overhead as the v6 patch (still a single (pointer != NULL) check). In the 'tracking enabled' case, it'll be more expensive, because the changes need to propagate up the tree. But that happens only if there are multiple contexts with tracking enabled - if there's a single one, it should be exactly the same as the v6 patch. Haven't tried measuring the impact though, because we don't have actual code using it. The measurements in the 23/8 say that with palloc_bench, the overhead was ~3% (compared to 'tracking disabled'). Also, all that benchmark is doing is palloc, so it's not really +3% in practice. A few minor comments regarding the patch: (a) I see we're not tracking self/total values any more. I don't think we really need those two numbers, the total is enough.Correct? (b) both AllocSetReset/AllocSetDelete update the accounting info for each block separately: if (context->track_mem != NULL) *context->track_mem -= block->endptr - ((char *) block); Wouldn't it be better to accumulate the total into a local variable and then update the track_mem at once? This mightbe more important when using the MemoryAccountingData structure and multiple tracking-enabled contexts, though. (c) Most of the changes in AllocSetContextCreateTracked are only renames 'context => set', which can be easily eliminatedby using AllocSet context = (AllocSet)MemoryContextCreate(... and casting to MemoryContext at the one place where it's needed. regards Tomas
On 16 October 2014 02:26, Jeff Davis <pgsql@j-davis.com> wrote: > The inheritance is awkward anyway, though. If you create a tracked > context as a child of an already-tracked context, allocations in the > newer one won't count against the original. I don't see a way around > that without introducing even more performance problems. This seems to have reached impasse, which is a shame. Let me throw out a few questions to see if that might help. Do I understand correctly that we are trying to account for exact memory usage at palloc/pfree time? Why?? Surely we just want to keep track of the big chunks? Does this need to be exact? How exact does it need to be? Or alternatively, can't we just sample the allocations to reduce the overhead? Are there some assumptions we can make about micro-contexts never needing to be tracked at all? Jeff seems not too bothered by inheritance, whereas Tomas thinks its essential. Perhaps there is a middle ground, depending upon the role of the context? -- Simon Riggs http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Sat, 2014-11-15 at 21:36 +0000, Simon Riggs wrote: > Do I understand correctly that we are trying to account for exact > memory usage at palloc/pfree time? Why?? Not palloc chunks, only tracking at the level of allocated blocks (that we allocate with malloc). It was a surprise to me that accounting at that level would have any measurable impact, but Robert found a reasonable case on a POWER machine that degraded a couple percent. I wasn't able to reproduce it consistently on x86. > Or alternatively, can't we just sample the allocations to reduce the overhead? Not sure quite what you mean by "sample", but it sounds like something along those lines would work. Attached is a patch that does something very simple: only tracks blocks held in the current context, with no inheritance or anything like it. This reduces it to a couple arithmetic instructions added to the alloc/dealloc path, so hopefully that removes the general performance concern raised by Robert[1]. To calculate the total memory used, I included a function MemoryContextMemAllocated() that walks the memory context and its children recursively. Of course, I was originally trying to avoid that, because it moves the problem to HashAgg. For each group, it will need to execute MemoryContextMemAllocated() to see if work_mem has been exceeded. It will have to visit a couple contexts, or perhaps many (in the case of array_agg, which creates one per group), which could be a measurable regression for HashAgg. But if that does turn out to be a problem, I think it's solvable. First, I could micro-optimize the function by making it iterative rather than recursive, to save on function call overhead. Second, I could offer a way to prevent the HTAB from creating its own context, which would be one less context to visit. And if those don't work, perhaps I could resort to a sampling method of some kind, as you allude to above. Regards, Jeff Davis [1] I'm fairly sure I tested something very similar on Robert's POWER machine a while ago, and it was fine. But I think it's offline or moved now, so I can't verify the results. If this patch still somehow turns out to be a 1%+ regression on any reasonable test, I don't know what to say.
Attachment
On 17 November 2014 07:31, Jeff Davis <pgsql@j-davis.com> wrote: > On Sat, 2014-11-15 at 21:36 +0000, Simon Riggs wrote: >> Do I understand correctly that we are trying to account for exact >> memory usage at palloc/pfree time? Why?? > > Not palloc chunks, only tracking at the level of allocated blocks (that > we allocate with malloc). > > It was a surprise to me that accounting at that level would have any > measurable impact, but Robert found a reasonable case on a POWER machine > that degraded a couple percent. I wasn't able to reproduce it > consistently on x86. Surprise to me also. Robert's tests showed a deviation of 0.4 sec after a restart. ISTM that we wouldn't see that every time. AFAIK the whole purpose of the memory allocator is to reduce the number of system calls, so if we are doing so many malloc() calls as to be noticeable just for accounting then something is wrong. -- Simon Riggs http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 17 November 2014 07:31, Jeff Davis <pgsql@j-davis.com> wrote: > To calculate the total memory used, I included a function > MemoryContextMemAllocated() that walks the memory context and its > children recursively. If we assume that we want the results accurate to within 5%, then we can work out a good sampling rate to achieve that accuracy. We would probably need a counting mechanism that didn't slow down O(N) as our allocations get bigger. -- Simon Riggs http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 15.11.2014 22:36, Simon Riggs wrote: > On 16 October 2014 02:26, Jeff Davis <pgsql@j-davis.com> wrote: > >> The inheritance is awkward anyway, though. If you create a tracked >> context as a child of an already-tracked context, allocations in >> the newer one won't count against the original. I don't see a way >> around that without introducing even more performance problems. > > This seems to have reached impasse, which is a shame. Let me throw > out a few questions to see if that might help. > > Do I understand correctly that we are trying to account for exact > memory usage at palloc/pfree time? Why?? > > Surely we just want to keep track of the big chunks? Does this need > to be exact? How exact does it need to be? What do you mean by "big chunks"? Blocks, or really over-sized chunks (over 8kB)? I assume blocks, and that's what the proposed patches are doing. > Or alternatively, can't we just sample the allocations to reduce > the overhead? That might work, but I'm afraid it's trickier than it might seem. > Are there some assumptions we can make about micro-contexts never > needing to be tracked at all? What is a micro-context? > Jeff seems not too bothered by inheritance, whereas Tomas thinks its > essential. Perhaps there is a middle ground, depending upon the role > of the context? Having a hierarchy of memory contexts and an accounting completely ignoring that hierarchy seems a bit strange to me. Maybe we can impose some restrictions, for example: (a) a single context with tracking explicitly requested (in the whole hierarchy) (b) when tracking is requested for a context, the parent context must not have tracking enabled Either of these restrictions would prevent a situation where a context has to update accounting for two parent contexts. That'd allow updating a single place (because each context has a single parent context with tracking requested). But I'm not sure that those restrictions are acceptable, that we can enforce them reliably, or how frequently that'd break external code (e.g. extensions with custom aggregates etc.). Because this is part of the public API, including the option to enable accounting for a memory context. Imagine you write an extension with a custom aggregate (or whatever) and it works just fine. Then we tweak something in the executor, requesting accounting for another context, and the extension suddenly breaks. Or maybe we don't change anything, and the aggregate works fine with Group Aggregate, but breaks whenever the plan switches to Hash Aggregate (because that's using accounting). Or maybe the user just redirects the tracking somewhere else (e.g. by supplying a pointer to the v6 of the patch), excluding the context from the accounting entirely. So yeah, I think the inheritance is an important aspect, at least for a generic accounting infrastructure. I'd bet that if we choose to do that, there'll be a good deal of various WTF moments and foot-shooting in the near future ... On 23/8 I proposed to use a separate hierarchy of accounting info, parallel to the memory context hierarchy (but only for contexts with tracking explicitly requested) http://www.postgresql.org/message-id/53F7E83C.3020304@fuzzy.cz but no response so far. I do see ~1% regression on Robert's benchmark (on x86), but it's quite noisy. Not sure about PPC, which is what Robert used. I don't see how to make it faster without sacrificing full inheritance support ... Tomas
Hi, On 2014-11-16 23:31:51 -0800, Jeff Davis wrote: > *** a/src/include/nodes/memnodes.h > --- b/src/include/nodes/memnodes.h > *************** > *** 60,65 **** typedef struct MemoryContextData > --- 60,66 ---- > MemoryContext nextchild; /* next child of same parent */ > char *name; /* context name (just for debugging) */ > bool isReset; /* T = no space alloced since last reset */ > + uint64 mem_allocated; /* track memory allocated for this context */ > #ifdef USE_ASSERT_CHECKING > bool allowInCritSection; /* allow palloc in critical section */ > #endif That's quite possibly one culprit for the slowdown. Right now one AllocSetContext struct fits precisely into three cachelines. After your change it doesn't anymore. Consider not counting memory in bytes but blocks and adding it directly after the NodeTag. That'd leave the size the same as right now. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 17.11.2014 08:31, Jeff Davis wrote: > On Sat, 2014-11-15 at 21:36 +0000, Simon Riggs wrote: >> Do I understand correctly that we are trying to account for exact >> memory usage at palloc/pfree time? Why?? > > Not palloc chunks, only tracking at the level of allocated blocks > (that we allocate with malloc). > > It was a surprise to me that accounting at that level would have any > measurable impact, but Robert found a reasonable case on a POWER > machine that degraded a couple percent. I wasn't able to reproduce > it consistently on x86. > >> Or alternatively, can't we just sample the allocations to reduce >> the overhead? > > Not sure quite what you mean by "sample", but it sounds like > something along those lines would work. Maybe. It might also cause unexpected volatility / nondeterminism in the query execution. Imagine a Hash Aggregate, where a small number of the requests is considerably larger than the rest. If you happen to sample only a few of those large requests, the whole hash aggregate happens in-memory, otherwise it starts batching. The users will observe this as random variations to the runtimes - same query / parameters, idle machine, sudden changes to the plan ... Maybe I'm too wary, but I guess there are use cases where the latency uniformity is a concern. > Attached is a patch that does something very simple: only tracks > blocks held in the current context, with no inheritance or anything > like it. This reduces it to a couple arithmetic instructions added > to the alloc/dealloc path, so hopefully that removes the general > performance concern raised by Robert[1]. > > To calculate the total memory used, I included a function > MemoryContextMemAllocated() that walks the memory context and its > children recursively. > > Of course, I was originally trying to avoid that, because it moves > the problem to HashAgg. For each group, it will need to execute > MemoryContextMemAllocated() to see if work_mem has been exceeded. It > will have to visit a couple contexts, or perhaps many (in the case > of array_agg, which creates one per group), which could be a > measurable regression for HashAgg. :-( > But if that does turn out to be a problem, I think it's solvable. > First, I could micro-optimize the function by making it iterative > rather than recursive, to save on function call overhead. Second, I > could offer a way to prevent the HTAB from creating its own context, > which would be one less context to visit. And if those don't work, > perhaps I could resort to a sampling method of some kind, as you > allude to above. Do you plan to try this approach with your hashagg patch, and do some measurements? I'd expect the performance hit to be significant, but I'd like to see some numbers. kind regards Tomas
On 17.11.2014 18:04, Andres Freund wrote: > Hi, > > On 2014-11-16 23:31:51 -0800, Jeff Davis wrote: >> *** a/src/include/nodes/memnodes.h >> --- b/src/include/nodes/memnodes.h >> *************** >> *** 60,65 **** typedef struct MemoryContextData >> --- 60,66 ---- >> MemoryContext nextchild; /* next child of same parent */ >> char *name; /* context name (just for debugging) */ >> bool isReset; /* T = no space alloced since last reset */ >> + uint64 mem_allocated; /* track memory allocated for this context */ >> #ifdef USE_ASSERT_CHECKING >> bool allowInCritSection; /* allow palloc in critical section */ >> #endif > > That's quite possibly one culprit for the slowdown. Right now one > AllocSetContext struct fits precisely into three cachelines. After > your change it doesn't anymore. I'm no PPC64 expert, but I thought the cache lines are 128 B on that platform, since at least Power6? Also, if I'm counting right, the MemoryContextData structure is 56B without the 'mem_allocated' field (and without the allowInCritSection), and 64B with it (at that particular place). sizeof() seems to confirm that. (But I'm on x86, so maybe the alignment on PPC64 is different?). > Consider not counting memory in bytes but blocks and adding it > directly after the NodeTag. That'd leave the size the same as right > now. I suppose you mean "kbytes", because the block size is not fixed. regards Tomas
On 2014-11-17 19:42:25 +0100, Tomas Vondra wrote: > On 17.11.2014 18:04, Andres Freund wrote: > > Hi, > > > > On 2014-11-16 23:31:51 -0800, Jeff Davis wrote: > >> *** a/src/include/nodes/memnodes.h > >> --- b/src/include/nodes/memnodes.h > >> *************** > >> *** 60,65 **** typedef struct MemoryContextData > >> --- 60,66 ---- > >> MemoryContext nextchild; /* next child of same parent */ > >> char *name; /* context name (just for debugging) */ > >> bool isReset; /* T = no space alloced since last reset */ > >> + uint64 mem_allocated; /* track memory allocated for this context */ > >> #ifdef USE_ASSERT_CHECKING > >> bool allowInCritSection; /* allow palloc in critical section */ > >> #endif > > > > That's quite possibly one culprit for the slowdown. Right now one > > AllocSetContext struct fits precisely into three cachelines. After > > your change it doesn't anymore. > > I'm no PPC64 expert, but I thought the cache lines are 128 B on that > platform, since at least Power6? Hm, might be true. > Also, if I'm counting right, the MemoryContextData structure is 56B > without the 'mem_allocated' field (and without the allowInCritSection), > and 64B with it (at that particular place). sizeof() seems to confirm > that. (But I'm on x86, so maybe the alignment on PPC64 is different?). The MemoryContextData struct is embedded into AllocSetContext. > > Consider not counting memory in bytes but blocks and adding it > > directly after the NodeTag. That'd leave the size the same as right > > now. > > I suppose you mean "kbytes", because the block size is not fixed. Some coarser size than bytes. Don't really care about the granularity. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 17.11.2014 19:46, Andres Freund wrote: > On 2014-11-17 19:42:25 +0100, Tomas Vondra wrote: >> On 17.11.2014 18:04, Andres Freund wrote: >>> Hi, >>> >>> On 2014-11-16 23:31:51 -0800, Jeff Davis wrote: >>>> *** a/src/include/nodes/memnodes.h >>>> --- b/src/include/nodes/memnodes.h >>>> *************** >>>> *** 60,65 **** typedef struct MemoryContextData >>>> --- 60,66 ---- >>>> MemoryContext nextchild; /* next child of same parent */ >>>> char *name; /* context name (just for debugging) */ >>>> bool isReset; /* T = no space alloced since last reset */ >>>> + uint64 mem_allocated; /* track memory allocated for this context */ >>>> #ifdef USE_ASSERT_CHECKING >>>> bool allowInCritSection; /* allow palloc in critical section */ >>>> #endif >>> >>> That's quite possibly one culprit for the slowdown. Right now one >>> AllocSetContext struct fits precisely into three cachelines. After >>> your change it doesn't anymore. >> >> I'm no PPC64 expert, but I thought the cache lines are 128 B on that >> platform, since at least Power6? > > Hm, might be true. > >> Also, if I'm counting right, the MemoryContextData structure is 56B >> without the 'mem_allocated' field (and without the allowInCritSection), >> and 64B with it (at that particular place). sizeof() seems to confirm >> that. (But I'm on x86, so maybe the alignment on PPC64 is different?). > > The MemoryContextData struct is embedded into AllocSetContext. Oh, right. That makes is slightly more complicated, though, because AllocSetContext adds 6 x 8B fields plus an in-line array of freelist pointers. Which is 11x8 bytes. So in total 56+56+88=200B, without the additional field. There might be some difference because of alignment, but I still don't see how that one additional field might impact cachelines? But if we separated the freelist, that might actually make it faster, at least for calls that don't touch the freelist at all, no? Because most of the palloc() calls will be handled from the current block. I tried this on the patch I sent on 23/8 (with the MemoryAccounting hierarchy parallel to the contexts), and I got this (running 10x the reindex, without restarts or such): without the patch ----------------- ... 1723933 KB used: CPU 1.35s/4.33u sec elapsed 10.38 sec ... 1723933 KB used: CPU 1.39s/4.19u sec elapsed 9.89 sec ... 1723933 KB used: CPU 1.39s/4.26u sec elapsed 9.90 sec ... 1723933 KB used: CPU 1.45s/4.22u sec elapsed 10.15 sec ... 1723933 KB used: CPU 1.36s/4.20u sec elapsed 9.84 sec ... 1723933 KB used: CPU 1.32s/4.20u sec elapsed 9.96 sec ... 1723933 KB used: CPU 1.50s/4.23u sec elapsed 9.91 sec ... 1723933 KB used: CPU 1.47s/4.24u sec elapsed 9.90 sec ... 1723933 KB used: CPU 1.38s/4.23u sec elapsed 10.09 sec ... 1723933 KB used: CPU 1.37s/4.19u sec elapsed 9.84 sec with the patch (no tracking enabled) ------------------------------------ ... 1723933 KB used: CPU 1.40s/4.28u sec elapsed 10.79 sec ... 1723933 KB used: CPU 1.39s/4.32u sec elapsed 10.14 sec ... 1723933 KB used: CPU 1.40s/4.16u sec elapsed 9.99 sec ... 1723933 KB used: CPU 1.32s/4.21u sec elapsed 10.16 sec ... 1723933 KB used: CPU 1.37s/4.34u sec elapsed 10.03 sec ... 1723933 KB used: CPU 1.35s/4.22u sec elapsed 9.97 sec ... 1723933 KB used: CPU 1.43s/4.20u sec elapsed 9.99 sec ... 1723933 KB used: CPU 1.39s/4.17u sec elapsed 9.71 sec ... 1723933 KB used: CPU 1.44s/4.16u sec elapsed 9.91 sec ... 1723933 KB used: CPU 1.40s/4.25u sec elapsed 9.99 sec So it's about 9,986 vs 10,068 for averages, and 9,905 vs 9,99 for a median. I.e. ~0.8% difference. That's on x86, though. I wonder what'd be the effect on the PPC machine. Patch attached - it's not perfect, though. For example it should free the freelists at some point, but I feel a bit lazy today. Tomas
Attachment
On Mon, 2014-11-17 at 18:04 +0100, Andres Freund wrote: > That's quite possibly one culprit for the slowdown. Right now one > AllocSetContext struct fits precisely into three cachelines. After your > change it doesn't anymore. Thank you! I made the same mistake as Tomas: I saw that MemoryContextData went to 64 bytes and thought "that should be fine" while ignoring AllocSetContext. I did some tests in the past between a version of my patch that made MemoryContextData 72 bytes and one that made it 64 bytes, but I may have neglected to test against the original 56 byte version. I'm still not able to test to see if this is the real problem, but it seems best to eliminate it anyway. > Consider not counting memory in bytes but blocks and adding it directly > after the NodeTag. That'd leave the size the same as right now. I can also just move isReset there, and keep mem_allocated as a uint64. That way, if I find later that I want to track the aggregated value for the child contexts as well, I can split it into two uint32s. I'll hold off any any such optimizations until I see some numbers from HashAgg though. Attached new version. Regards, Jeff Davis
Attachment
On 11/17/14, 10:50 AM, Tomas Vondra wrote: > Either of these restrictions would prevent a situation where a context > has to update accounting for two parent contexts. That'd allow updating > a single place (because each context has a single parent context with > tracking requested). Another option might be to be lazy on updating parent contexts. I'm thinking something like keep a boolean (or make a sizeof 0 magic) that indicates whether a context has up-to-date info from it's children or not. That would allow you to onlyupdate the complete size when you need it, but if your children haven't been touched since the last time you calculatedthen you're don't need to recalc. -- Jim Nasby, Data Architect, Blue Treble Consulting Data in Trouble? Get it in Treble! http://BlueTreble.com
On 2014-11-17 21:03:07 +0100, Tomas Vondra wrote: > On 17.11.2014 19:46, Andres Freund wrote: > > On 2014-11-17 19:42:25 +0100, Tomas Vondra wrote: > >> On 17.11.2014 18:04, Andres Freund wrote: > >>> Hi, > >>> > >>> On 2014-11-16 23:31:51 -0800, Jeff Davis wrote: > >>>> *** a/src/include/nodes/memnodes.h > >>>> --- b/src/include/nodes/memnodes.h > >>>> *************** > >>>> *** 60,65 **** typedef struct MemoryContextData > >>>> --- 60,66 ---- > >>>> MemoryContext nextchild; /* next child of same parent */ > >>>> char *name; /* context name (just for debugging) */ > >>>> bool isReset; /* T = no space alloced since last reset */ > >>>> + uint64 mem_allocated; /* track memory allocated for this context */ > >>>> #ifdef USE_ASSERT_CHECKING > >>>> bool allowInCritSection; /* allow palloc in critical section */ > >>>> #endif > >>> > >>> That's quite possibly one culprit for the slowdown. Right now one > >>> AllocSetContext struct fits precisely into three cachelines. After > >>> your change it doesn't anymore. > >> > >> I'm no PPC64 expert, but I thought the cache lines are 128 B on that > >> platform, since at least Power6? > > > > Hm, might be true. > > > >> Also, if I'm counting right, the MemoryContextData structure is 56B > >> without the 'mem_allocated' field (and without the allowInCritSection), > >> and 64B with it (at that particular place). sizeof() seems to confirm > >> that. (But I'm on x86, so maybe the alignment on PPC64 is different?). > > > > The MemoryContextData struct is embedded into AllocSetContext. > > Oh, right. That makes is slightly more complicated, though, because > AllocSetContext adds 6 x 8B fields plus an in-line array of freelist > pointers. Which is 11x8 bytes. So in total 56+56+88=200B, without the > additional field. There might be some difference because of alignment, > but I still don't see how that one additional field might impact cachelines? It's actually 196 bytes: struct AllocSetContext { MemoryContextData header; /* 0 56 */ AllocBlock blocks; /* 56 8 */ /* --- cacheline 1 boundary (64 bytes) --- */ AllocChunk freelist[11]; /* 64 88 */ /* --- cacheline 2 boundary (128 bytes) was 24 bytes ago ---*/ Size initBlockSize; /* 152 8 */ Size maxBlockSize; /* 160 8 */ Size nextBlockSize; /* 168 8 */ Size allocChunkLimit; /* 176 8 */ AllocBlock keeper; /* 184 8 */ /* --- cacheline 3 boundary (192 bytes) --- */ /* size: 192, cachelines: 3, members: 8 */ }; And thus one additional field tipps it over the edge. "pahole" is a very useful utility. > But if we separated the freelist, that might actually make it faster, at > least for calls that don't touch the freelist at all, no? Because most > of the palloc() calls will be handled from the current block. I seriously doubt it. The additional indirection + additional branches are likely to make it worse. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 21.11.2014 00:03, Andres Freund wrote: > On 2014-11-17 21:03:07 +0100, Tomas Vondra wrote: >> On 17.11.2014 19:46, Andres Freund wrote: >> >>> The MemoryContextData struct is embedded into AllocSetContext. >> >> Oh, right. That makes is slightly more complicated, though, because >> AllocSetContext adds 6 x 8B fields plus an in-line array of >> freelist pointers. Which is 11x8 bytes. So in total 56+56+88=200B, >> without the additional field. There might be some difference >> because of alignment, but I still don't see how that one >> additional field might impact cachelines? > > It's actually 196 bytes: Ummmm, I think the pahole output shows 192, not 196? Otherwise it wouldn't be exactly 3 cachelines anyway. But yeah - my math-foo was weak for a moment, because 6x8 != 56. Which is the 8B difference :-/ > struct AllocSetContext { > MemoryContextData header; /* 0 56 */ > AllocBlock blocks; /* 56 8 */ > /* --- cacheline 1 boundary (64 bytes) --- */ > AllocChunk freelist[11]; /* 64 88 */ > /* --- cacheline 2 boundary (128 bytes) was 24 bytes ago --- */ > Size initBlockSize; /* 152 8 */ > Size maxBlockSize; /* 160 8 */ > Size nextBlockSize; /* 168 8 */ > Size allocChunkLimit; /* 176 8 */ > AllocBlock keeper; /* 184 8 */ > /* --- cacheline 3 boundary (192 bytes) --- */ > > /* size: 192, cachelines: 3, members: 8 */ > }; > > And thus one additional field tipps it over the edge. > > "pahole" is a very useful utility. Indeed. >> But if we separated the freelist, that might actually make it >> faster, at least for calls that don't touch the freelist at all, >> no? Because most of the palloc() calls will be handled from the >> current block. > > I seriously doubt it. The additional indirection + additional > branches are likely to make it worse. That's possible, although I tried it on "my version" of the accounting patch, and it showed slight improvement (lower overhead) on Robert's reindex benchmark. The question is how would that work with regular workload, because moving the freelist out of the structure makes it smaller (2 cachelines instead of 3), and it can only impact workloads working with the freelists (i.e. either by calling free, or realloc, or whatever). Although palloc() checks the freelist too ... Also, those pieces may be allocated together (next to each other), which might keep locality. But I haven't tested any of this, and my knowledge of this low-level stuff is poor, so I might be completely wrong. regard Tomas
On Mon, Nov 17, 2014 at 11:39 PM, Jeff Davis <pgsql@j-davis.com> wrote: > I can also just move isReset there, and keep mem_allocated as a uint64. > That way, if I find later that I want to track the aggregated value for > the child contexts as well, I can split it into two uint32s. I'll hold > off any any such optimizations until I see some numbers from HashAgg > though. I took a quick look at memory-accounting-v8.patch. Is there some reason why mem_allocated is a uint64? All other things being equal, I'd follow the example of tuplesort.c's MemoryContextAllocHuge() API, which (following bugfix commit 79e0f87a1) uses int64 variables to track available memory and so on. -- Peter Geoghegan
On Sun, 2014-11-30 at 17:49 -0800, Peter Geoghegan wrote: > On Mon, Nov 17, 2014 at 11:39 PM, Jeff Davis <pgsql@j-davis.com> wrote: > > I can also just move isReset there, and keep mem_allocated as a uint64. > > That way, if I find later that I want to track the aggregated value for > > the child contexts as well, I can split it into two uint32s. I'll hold > > off any any such optimizations until I see some numbers from HashAgg > > though. > > I took a quick look at memory-accounting-v8.patch. > > Is there some reason why mem_allocated is a uint64? All other things > being equal, I'd follow the example of tuplesort.c's > MemoryContextAllocHuge() API, which (following bugfix commit > 79e0f87a1) uses int64 variables to track available memory and so on. No reason. New version attached; that's the only change. Regards, Jeff Davis
Attachment
On Tue, Dec 2, 2014 at 2:14 PM, Jeff Davis <pgsql@j-davis.com> wrote: > On Sun, 2014-11-30 at 17:49 -0800, Peter Geoghegan wrote: >> On Mon, Nov 17, 2014 at 11:39 PM, Jeff Davis <pgsql@j-davis.com> wrote: >> > I can also just move isReset there, and keep mem_allocated as a uint64. >> > That way, if I find later that I want to track the aggregated value for >> > the child contexts as well, I can split it into two uint32s. I'll hold >> > off any any such optimizations until I see some numbers from HashAgg >> > though. >> >> I took a quick look at memory-accounting-v8.patch. >> >> Is there some reason why mem_allocated is a uint64? All other things >> being equal, I'd follow the example of tuplesort.c's >> MemoryContextAllocHuge() API, which (following bugfix commit >> 79e0f87a1) uses int64 variables to track available memory and so on. > > No reason. New version attached; that's the only change. Note that I am marking this patch back to "Needs Review" state. I doesn't seem that this patch has been reviewed completely. -- Michael
On Mon, Dec 8, 2014 at 9:39 AM, Michael Paquier <michael.paquier@gmail.com> wrote: > On Tue, Dec 2, 2014 at 2:14 PM, Jeff Davis <pgsql@j-davis.com> wrote: >> On Sun, 2014-11-30 at 17:49 -0800, Peter Geoghegan wrote: >>> On Mon, Nov 17, 2014 at 11:39 PM, Jeff Davis <pgsql@j-davis.com> wrote: >>> > I can also just move isReset there, and keep mem_allocated as a uint64. >>> > That way, if I find later that I want to track the aggregated value for >>> > the child contexts as well, I can split it into two uint32s. I'll hold >>> > off any any such optimizations until I see some numbers from HashAgg >>> > though. >>> >>> I took a quick look at memory-accounting-v8.patch. >>> >>> Is there some reason why mem_allocated is a uint64? All other things >>> being equal, I'd follow the example of tuplesort.c's >>> MemoryContextAllocHuge() API, which (following bugfix commit >>> 79e0f87a1) uses int64 variables to track available memory and so on. >> >> No reason. New version attached; that's the only change. > Note that I am marking this patch back to "Needs Review" state. I Moving to CF 2014-12. -- Michael
On 2.12.2014 06:14, Jeff Davis wrote: > On Sun, 2014-11-30 at 17:49 -0800, Peter Geoghegan wrote: >> On Mon, Nov 17, 2014 at 11:39 PM, Jeff Davis <pgsql@j-davis.com> wrote: >>> I can also just move isReset there, and keep mem_allocated as a uint64. >>> That way, if I find later that I want to track the aggregated value for >>> the child contexts as well, I can split it into two uint32s. I'll hold >>> off any any such optimizations until I see some numbers from HashAgg >>> though. >> >> I took a quick look at memory-accounting-v8.patch. >> >> Is there some reason why mem_allocated is a uint64? All other things >> being equal, I'd follow the example of tuplesort.c's >> MemoryContextAllocHuge() API, which (following bugfix commit >> 79e0f87a1) uses int64 variables to track available memory and so on. > > No reason. New version attached; that's the only change. I've started reviewing this today. It does not apply cleanly on current head, because of 4a14f13a committed a few days ago. That commit changed the constants in src/include/utils/hsearch.h, so the patch needs to use this: #define HASH_NOCHILDCXT 0x4000 /* ... */ That's the only conflict, and after fixing it it compiles OK. However, I got a segfault on the very first query I tried :-( create table test_hash_agg as select i AS a, i AS b, i AS c, i AS d from generate_series(1,10000000) s(i); analyze test_hash_agg; select a, count(*) from test_hash_agg where a < 100000 group by a; With a fresh cluster (default config), I get this: server closed the connection unexpectedly This probably means the server terminated abnormally before or while processing the request. The connection to the server was lost. Attempting reset: Failed. The backtrace looks like this (full attached): Program received signal SIGSEGV, Segmentation fault. advance_transition_function (aggstate=aggstate@entry=0x2b5c5f0, peraggstate=peraggstate@entry=0x2b5efd0, pergroupstate=pergroupstate@entry=0x8) at nodeAgg.c:468 468 if (pergroupstate->noTransValue) (gdb) bt #0 advance_transition_function at nodeAgg.c:468 #1 0x00000000005c3494 in advance_aggregates at nodeAgg.c:624 #2 0x00000000005c3dc2 in agg_fill_hash_table at nodeAgg.c:1640 #3 ExecAgg (node=node@entry=0x2b5c5f0) at nodeAgg.c:1338 (gdb) print pergroupstate->noTransValue Cannot access memory at address 0x11 (gdb) print pergroupstate $1 = (AggStatePerGroup) 0x8 My guess is something is scribbling over the pergroup memory, or maybe the memory context gets released, or something. In any case, it's easily reproducible, and apparently deterministic (always exactly the same values, no randomness). regards Tomas
Attachment
On 21.12.2014 20:19, Tomas Vondra wrote: > > However, I got a segfault on the very first query I tried :-( > > create table test_hash_agg as select i AS a, i AS b, i AS c, i AS d > from generate_series(1,10000000) s(i); > > analyze test_hash_agg; > > select a, count(*) from test_hash_agg where a < 100000 group by a; > > With a fresh cluster (default config), I get this: > > server closed the connection unexpectedly > This probably means the server terminated abnormally > before or while processing the request. > The connection to the server was lost. Attempting reset: Failed. > > The backtrace looks like this (full attached): > > Program received signal SIGSEGV, Segmentation fault. > advance_transition_function (aggstate=aggstate@entry=0x2b5c5f0, > peraggstate=peraggstate@entry=0x2b5efd0, > pergroupstate=pergroupstate@entry=0x8) at nodeAgg.c:468 > 468 if (pergroupstate->noTransValue) > (gdb) bt > #0 advance_transition_function at nodeAgg.c:468 > #1 0x00000000005c3494 in advance_aggregates at nodeAgg.c:624 > #2 0x00000000005c3dc2 in agg_fill_hash_table at nodeAgg.c:1640 > #3 ExecAgg (node=node@entry=0x2b5c5f0) at nodeAgg.c:1338 > > (gdb) print pergroupstate->noTransValue > Cannot access memory at address 0x11 > (gdb) print pergroupstate > $1 = (AggStatePerGroup) 0x8 > > My guess is something is scribbling over the pergroup memory, or maybe > the memory context gets released, or something. In any case, it's easily > reproducible, and apparently deterministic (always exactly the same > values, no randomness). BTW the fact that 'make installcheck' and 'make check' both pass yet a simple query crashes, suggests there are no regression tests testing the batching properly. Which should be part of the patch, I believe. regards Tomas
It seems that these two patches are being reviewed together. Should I just combine them into one? My understanding was that some wanted to review the memory accounting patch separately. On Sun, 2014-12-21 at 20:19 +0100, Tomas Vondra wrote: > That's the only conflict, and after fixing it it compiles OK. However, I > got a segfault on the very first query I tried :-( If lookup_hash_entry doesn't find the group, and there's not enough memory to create it, then it returns NULL; but the caller wasn't checking for NULL. My apologies for such a trivial mistake, I was doing most of my testing using DISTINCT. My fix here was done quickly, so I'll take a closer look later to make sure I didn't miss something else. New patch attached (rebased, as well). I also see your other message about adding regression testing. I'm hesitant to slow down the tests for everyone to run through this code path though. Should I add regression tests, and then remove them later after we're more comfortable that it works? Regards Jeff Davis
Attachment
On 23.12.2014 10:16, Jeff Davis wrote: > It seems that these two patches are being reviewed together. Should > I just combine them into one? My understanding was that some wanted > to review the memory accounting patch separately. I think we should keep the patches separate. Applying two patches is trivial, splitting them not so much. > On Sun, 2014-12-21 at 20:19 +0100, Tomas Vondra wrote: >> That's the only conflict, and after fixing it it compiles OK. >> However, I got a segfault on the very first query I tried :-( > > If lookup_hash_entry doesn't find the group, and there's not enough > memory to create it, then it returns NULL; but the caller wasn't > checking for NULL. My apologies for such a trivial mistake, I was > doing most of my testing using DISTINCT. My fix here was done > quickly, so I'll take a closer look later to make sure I didn't miss > something else. > > New patch attached (rebased, as well). > > I also see your other message about adding regression testing. I'm > hesitant to slow down the tests for everyone to run through this > code path though. Should I add regression tests, and then remove them > later after we're more comfortable that it works? I think when done right, the additional time will be negligible. By setting the work_mem low, we don't need that much data. regards Tomas
On Tue, 2014-12-23 at 01:16 -0800, Jeff Davis wrote: > New patch attached (rebased, as well). > > I also see your other message about adding regression testing. I'm > hesitant to slow down the tests for everyone to run through this code > path though. Should I add regression tests, and then remove them later > after we're more comfortable that it works? Attached are some tests I ran. First, generate the data sets with hashagg_test_data.sql. Then, do (I used work_mem at default of 4MB): set enable_hashagg=false; \o /tmp/sort.out \i /tmp/hashagg_test.sql \o set enable_hashagg=true; \o /tmp/hash.out \i /tmp/hashagg_test.sql and then diff'd the output to make sure the results are the same (except the plans, of course). The script loads the results into a temp table, then sorts it before outputting, to make the test order-independent. I didn't just add an ORDER BY, because that would change the plan and it would never use hashagg. I think that has fairly good coverage of the hashagg code. I used 3 different input data sets, byval and byref types (for group key and args), and a group aggregate query as well as DISTINCT. Let me know if I missed something. I also did some performance comparisons between disk-based sort+group and disk-based hashagg. The results are quite favorable for hashagg given the data sets I provided. Simply create the data using hashagg_test_data.sql (if not already done), set the work_mem to the value you want to test, and run hashagg_test_perf.sql. It uses EXPLAIN ANALYZE for the timings. singleton: 10M groups of 1 even: 1M groups of 10 skew: wildly different group sizes; see data script q1: group aggregate query q2: distinct query The total memory requirements for the test to run without going to disk ranges from about 100MB (for "even") to about 1GB (for "singleton"). Regardless of work_mem, these all fit in memory on my machine, so they aren't *really* going to disk. Also note that, because of how the memory blocks are allocated, and that hashagg waits until memory is exceeded, then hashagg might use about double work_mem when work_mem is small (the effect is not important at higher values). work_mem='1MB': sort+group (s) hashagg (s) singleton q1 12 10 singleton q2 8 7 even q1 14 7 even q2 10 5 skew q1 22 6 skew q2 16 4 work_mem='4MB': sort+group (s) hashagg (s) singleton q1 12 11 singleton q2 8 6 even q1 12 7 even q2 9 5 skew q1 19 6 skew q2 13 3 work_mem='16MB': sort+group (s) hashagg (s) singleton q1 12 11 singleton q2 8 7 even q1 14 7 even q2 10 5 skew q1 15 6 skew q2 12 4 work_mem='64MB': sort+group (s) hashagg (s) singleton q1 13 12 singleton q2 9 8 even q1 14 8 even q2 10 5 skew q1 17 6 skew q2 13 4 work_mem='256MB': sort+group (s) hashagg (s) singleton q1 12 12 singleton q2 9 8 even q1 14 7 even q2 11 4 skew q1 16 6 skew q2 13 4 work_mem='512MB': sort+group (s) hashagg (s) singleton q1 12 12 singleton q2 9 7 even q1 14 7 even q2 10 4 skew q1 16 6 skew q2 12 4 work_mem='2GB': sort+group (s) hashagg (s) singleton q1 9 12 singleton q2 6 6 even q1 8 7 even q2 6 4 skew q1 7 6 skew q2 5 4 These numbers are great news for disk-based hashagg. It seems to be the same or better than sort+group in nearly all cases (again, this example doesn't actually go to disk, so those numbers may come out differently). Also, the numbers are remarkably stable for varying work_mem for both plans. That means that it doesn't cost much to keep a lower work_mem as long as your system has plenty of memory. Do others have similar numbers? I'm quite surprised at how little work_mem seems to matter for these plans (HashJoin might be a different story though). I feel like I made a mistake -- can someone please do a sanity check on my numbers? Regards, Jeff Davis
Attachment
On Sun, Dec 28, 2014 at 12:37 PM, Jeff Davis <pgsql@j-davis.com> wrote: > Do others have similar numbers? I'm quite surprised at how little > work_mem seems to matter for these plans (HashJoin might be a different > story though). I feel like I made a mistake -- can someone please do a > sanity check on my numbers? I have seen external sorts that were quicker than internal sorts before. With my abbreviated key patch, under certain circumstances external sorts are faster, while presumably the same thing is true of int4 attribute sorts today. Actually, I saw a 10MB work_mem setting that was marginally faster than a multi-gigabyte one that fit the entire sort in memory. It probably has something to do with caching effects dominating over the expense of more comparisons, since higher work_mem settings that still resulted in an external sort were slower than the 10MB setting. I was surprised by this too, but it has been independently reported by Jeff Janes. -- Peter Geoghegan
On Sun, 2014-12-28 at 12:37 -0800, Jeff Davis wrote: > I feel like I made a mistake -- can someone please do a > sanity check on my numbers? I forgot to randomize the inputs, which doesn't matter much for hashagg but does matter for sort. New data script attached. The results are even *better* for disk-based hashagg than the previous numbers suggest. Here are some new numbers: work_mem='1MB': sort+group (s) hashagg (s) singleton q1 21 10 singleton q2 12 8 even q1 20 7 even q2 13 5 skew q1 22 6 skew q2 16 4 work_mem='4MB': sort+group (s) hashagg (s) singleton q1 17 10 singleton q2 11 6 even q1 16 7 even q2 11 5 skew q1 19 6 skew q2 13 4 work_mem='16MB': sort+group (s) hashagg (s) singleton q1 16 11 singleton q2 11 7 even q1 15 8 even q2 12 6 skew q1 15 6 skew q2 12 4 work_mem='64MB': sort+group (s) hashagg (s) singleton q1 18 12 singleton q2 13 8 even q1 17 10 even q2 13 6 skew q1 17 6 skew q2 14 4 work_mem='256MB': sort+group (s) hashagg (s) singleton q1 18 12 singleton q2 14 7 even q1 16 9 even q2 14 5 skew q1 18 6 skew q2 13 4 work_mem='512MB': sort+group (s) hashagg (s) singleton q1 18 12 singleton q2 14 7 even q1 17 9 even q2 14 5 skew q1 17 6 skew q2 13 4 work_mem='2GB': sort+group (s) hashagg (s) singleton q1 11 11 singleton q2 7 6 even q1 10 9 even q2 7 5 skew q1 7 6 skew q2 4 4 Regards, Jeff Davis
Attachment
On 12/28/14, 2:45 PM, Peter Geoghegan wrote: > On Sun, Dec 28, 2014 at 12:37 PM, Jeff Davis <pgsql@j-davis.com> wrote: >> Do others have similar numbers? I'm quite surprised at how little >> work_mem seems to matter for these plans (HashJoin might be a different >> story though). I feel like I made a mistake -- can someone please do a >> sanity check on my numbers? > > I have seen external sorts that were quicker than internal sorts > before. With my abbreviated key patch, under certain circumstances > external sorts are faster, while presumably the same thing is true of > int4 attribute sorts today. Actually, I saw a 10MB work_mem setting > that was marginally faster than a multi-gigabyte one that fit the > entire sort in memory. It probably has something to do with caching > effects dominating over the expense of more comparisons, since higher > work_mem settings that still resulted in an external sort were slower > than the 10MB setting. > > I was surprised by this too, but it has been independently reported by > Jeff Janes. I haven't tested for external faster than internal in a while, but I've certainly seen this effect before. Generally, onceyou get beyond a certain size (maybe 100MB?) you run the risk of a tapesort being faster than an internal sort. -- Jim Nasby, Data Architect, Blue Treble Consulting Data in Trouble? Get it in Treble! http://BlueTreble.com
On Sun, Dec 28, 2014 at 12:45 PM, Peter Geoghegan <pg@heroku.com> wrote:
On Sun, Dec 28, 2014 at 12:37 PM, Jeff Davis <pgsql@j-davis.com> wrote:
> Do others have similar numbers? I'm quite surprised at how little
> work_mem seems to matter for these plans (HashJoin might be a different
> story though). I feel like I made a mistake -- can someone please do a
> sanity check on my numbers?
I have seen external sorts that were quicker than internal sorts
before. With my abbreviated key patch, under certain circumstances
external sorts are faster, while presumably the same thing is true of
int4 attribute sorts today. Actually, I saw a 10MB work_mem setting
that was marginally faster than a multi-gigabyte one that fit the
entire sort in memory. It probably has something to do with caching
effects dominating over the expense of more comparisons, since higher
work_mem settings that still resulted in an external sort were slower
than the 10MB setting.
I was surprised by this too, but it has been independently reported by
Jeff Janes.
I don't recall (at the moment) seeing our external sort actually faster than quick-sort, but I've very reliably seen external sorts get faster with less memory than with more. It is almost certainly a CPU caching issue. Very large simple binary heaps are horrible on the CPU cache. And for sort-by-reference values, quick sort is also pretty bad.
With a slow enough data bus between the CPU and main memory, I don't doubt that a 'tapesort' with small work_mem could actually be faster than quicksort with large work_mem. But I don't recall seeing it myself. But I'd be surprised that a tapesort as currently implemented would be faster than a quicksort if the tapesort is using just one byte less memory than the quicksort is.
But to Jeff Davis's question, yes, tapesort is not very sensitive to work_mem, and to the extent it is sensitive it is in the other direction of more memory being bad. Once work_mem is so small that it takes multiple passes over the data to do the merge, then small memory would really be a problem. But on modern hardware you have to get pretty absurd settings before that happens.
Cheers,
Jeff
Hi, On 23.12.2014 10:16, Jeff Davis wrote: > It seems that these two patches are being reviewed together. Should > I just combine them into one? My understanding was that some wanted > to review the memory accounting patch separately. > > On Sun, 2014-12-21 at 20:19 +0100, Tomas Vondra wrote: >> That's the only conflict, and after fixing it it compiles OK. >> However, I got a segfault on the very first query I tried :-( > > If lookup_hash_entry doesn't find the group, and there's not enough > memory to create it, then it returns NULL; but the caller wasn't > checking for NULL. My apologies for such a trivial mistake, I was > doing most of my testing using DISTINCT. My fix here was done > quickly, so I'll take a closer look later to make sure I didn't miss > something else. > > New patch attached (rebased, as well). I did a review today, using these two patches: * memory-accounting-v9.patch (submitted on December 2) * hashagg-disk-20141222.patch I started with some basic performance measurements comparing hashagg queries without and with the patches (while you compared hashagg and sort). That's IMHO an interesting comparison, especially when no batching is necessary - in the optimal case the users should not see any slowdown (we shouldn't make them pay for the batching unless it actually is necessary). So I did this: drop table if exists test_hash_agg; create table test_hash_agg as select i AS a, mod(i,1000000) AS b, mod(i,100000) AS c, mod(i,10000) AS d, mod(i,1000) AS e, mod(i,100) AS f from generate_series(1,10000000) s(i); vacuum (full, analyze) test_hash_agg; i.e. a ~500MB table with 10M rows, and columns with different cardinalities. And then queries like this: select count(*) from (select a, count(a) from test_hash_agg group by a) foo; -- 10M groups (OOM) select count(*) from (select a, array_agg(a) from test_hash_agg group by a) foo; -- 100 groups select count(*) from (select f, array_agg(f) from test_hash_agg group by f) foo; which performed quite well, i.e. I've seen absolutely no slowdown. Which, in the array_agg case, is quite is quite suspicious, because on every lookup_hash_entry call, it has to do MemoryContextMemAllocated() on 10M contexts, and I really doubt that can be done in ~0 time. So I started digging in the code and I noticed this: hash_mem = MemoryContextMemAllocated(aggstate->hashcontext, true); which is IMHO obviously wrong, because that accounts only for the hashtable itself. It might be correct for aggregates with state passed by value, but it's incorrect for state passed by reference (e.g. Numeric, arrays etc.), because initialize_aggregates does this: oldContext = MemoryContextSwitchTo(aggstate->aggcontext); pergroupstate->transValue = datumCopy(peraggstate->initValue, peraggstate->transtypeByVal, peraggstate->transtypeLen); MemoryContextSwitchTo(oldContext); and it's also wrong for all the user-defined aggretates that have no access to hashcontext at all, and only get reference to aggcontext using AggCheckCallContext(). array_agg() is a prime example. In those cases the patch actually does no memory accounting and as hashcontext has no child contexts, there's no accounting overhead. After fixing this bug (replacing hashcontext with aggcontext in both calls to MemoryContextMemAllocated) the performance drops dramatically. For the query with 100 groups (not requiring any batching) I see this: test=# explain analyze select count(x) from (select f, array_agg(1) AS x from test_hash_agg group by f) foo; QUERY PLAN (original patch) ------------------------------------------------------------------------ Aggregate (cost=213695.57..213695.58 rows=1 width=32) (actual time=2539.156..2539.156 rows=1 loops=1) -> HashAggregate (cost=213693.07..213694.32 rows=100 width=4) (actual time=2492.264..2539.012 rows=100 loops=1) Group Key: test_hash_agg.f Batches: 1 Memory Usage: 24kB Disk Usage:0kB -> Seq Scan on test_hash_agg (cost=0.00..163693.71 rows=9999871 width=4) (actual time=0.022..547.379 rows=10000000 loops=1) Planning time: 0.039 ms Execution time: 2542.932 ms (7 rows) QUERY PLAN (fixed patch) ------------------------------------------------------------------------ Aggregate (cost=213695.57..213695.58 rows=1 width=32) (actual time=5670.885..5670.885 rows=1 loops=1) -> HashAggregate (cost=213693.07..213694.32 rows=100 width=4) (actual time=5623.254..5670.803 rows=100 loops=1) Group Key: test_hash_agg.f Batches: 1 Memory Usage: 117642kB Disk Usage:0kB -> Seq Scan on test_hash_agg (cost=0.00..163693.71 rows=9999871 width=4) (actual time=0.014..577.924 rows=10000000 loops=1) Planning time: 0.103 ms Execution time: 5677.187 ms (7 rows) So the performance drops 2x. With more groups, the performance impact is even worse. For example with the first query (with 10M groups), this is what I get in perf: explain analyze select count(x) from (select a, array_agg(1) AS x from test_hash_agg group by a) foo; PerfTop: 4671 irqs/sec kernel:11.2% exact: 0.0% ------------------------------------------------------------------------ 87.07% postgres [.] MemoryContextMemAllocated 1.63% [zcommon] [k] fletcher_4_native 1.60% [kernel] [k] acpi_processor_ffh_cstate_enter 0.68% [kernel] [k] xhci_irq 0.30% ld-2.19.so [.] _dl_sysinfo_int80 0.30% [kernel] [k] memmove 0.26% [kernel] [k] ia32_syscall 0.16% libglib-2.0.so.0.3800.2 [.] 0x000000000008c52a 0.15% libpthread-2.19.so [.] pthread_mutex_lock and it runs indefinitely (I gave up after a few minutes). I believe this renders the proposed memory accounting approach dead. Granted, the 10M groups is a bit extreme example, but the first query with 100 groups certainly is not. I understand the effort to avoid the 2% regression measured by Robert on a PowerPC machine, but I don't think that's a sufficient reason to cause so much trouble to everyone using array_agg() or user-defined aggregates based on the same 'create subcontext' pattern. Especially when the reindex may be often improved by using more maintenance_work_mem, but there's nothing you can do to improve this (if it's not batching, then increasing work_mem will do nothing). The array_agg() patch I submitted to this CF would fix this particular query, as it removes the child contexts (so there's no need for recursion in MemoryContextMemAllocated), but it does nothing to the user-defined aggregates out there. And it's not committed yet. Also, ISTM this makes it rather unusable as a general accounting approach. If mere 100 subcontexts results in 2x slowdown, then a handful of subcontexts will certainly have a measurable impact if we decide to use this somewhere else (and not just in hash aggregate). So I was curious how the accounting mechanism I proposed (parallel MemoryAccounting hierarchy next to MemoryContext) would handle this, so I've used it instead of the memory-accounting-v9.patch. And I measured no difference compared to master (no slowdown at all). I also tried array_agg() queries that actually require batchning, e.g. select a, array_agg(1) x from test_hash_agg group by a; which produces 10M groups, each using a separate 8kB context, which amounts to 80GB in total. With work_mem=1GB this should proceed just fine with ~80 batches. In practice, it runs indefinitely (again, I lost patience after a few minutes), and I see this: Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sda 0,00 0,00 0,00 281,00 0,00 140784,00 1002,02 143,25 512,57 0,00 512,57 3,56 100,00 tomas@rimmer ~/tmp/pg-hashagg/base/pgsql_tmp $ du -s 374128 . tomas@rimmer ~/tmp/pg-hashagg/base/pgsql_tmp $ ls -l | wc -l 130 tomas@rimmer ~/tmp/pg-hashagg/base/pgsql_tmp $ ls -l | head celkem 372568 -rw------- 1 tomas users 2999480 7. led 19.46 pgsql_tmp23267.2637 -rw------- 1 tomas users 2973520 7. led 19.46 pgsql_tmp23267.2638 -rw------- 1 tomas users 2978800 7. led 19.46 pgsql_tmp23267.2639 -rw------- 1 tomas users 2959880 7. led 19.46 pgsql_tmp23267.2640 -rw------- 1 tomas users 3010040 7. led 19.46 pgsql_tmp23267.2641 -rw------- 1 tomas users 3083520 7. led 19.46 pgsql_tmp23267.2642 -rw------- 1 tomas users 3053160 7. led 19.46 pgsql_tmp23267.2643 -rw------- 1 tomas users 3044360 7. led 19.46 pgsql_tmp23267.2644 -rw------- 1 tomas users 3014000 7. led 19.46 pgsql_tmp23267.2645 That is, there are ~130 files, each ~3MB large, ~370MB in total. But the system does ~140MB/s writes all the time. Also, the table has ~500MB in total. So either I'm missing something, but there's some sort of bug. Attached you can find a bunch of files: * 0001-memory-accounting.patch (my memory accounting) * 0002-hashagg-patch.patch (jeff's patch, for completeness) * 0003-context-fix.patch (fix hashcontext -> aggcontext) * test.sql (data / queries I used for testing) regards Tomas
Attachment
On Thu, Jan 8, 2015 at 4:07 AM, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: > which is IMHO obviously wrong, because that accounts only for the > hashtable itself. > In those cases the patch actually does no memory accounting and as > hashcontext has no child contexts, there's no accounting overhead. > [blah] > So the performance drops 2x. With more groups, the performance impact is > even worse. Hmm. It seems that this patch is not there yet, so marking it as returned with feedback, especially knowing that the patch is incorrect by using hascontext as mentioned by Tomas. -- Michael
On Wed, 2015-01-07 at 20:07 +0100, Tomas Vondra wrote: > So I started digging in the code and I noticed this: > > hash_mem = MemoryContextMemAllocated(aggstate->hashcontext, true); > > which is IMHO obviously wrong, because that accounts only for the > hashtable itself. It might be correct for aggregates with state passed > by value, but it's incorrect for state passed by reference (e.g. > Numeric, arrays etc.), because initialize_aggregates does this: > > oldContext = MemoryContextSwitchTo(aggstate->aggcontext); > pergroupstate->transValue = datumCopy(peraggstate->initValue, > peraggstate->transtypeByVal, > peraggstate->transtypeLen); > MemoryContextSwitchTo(oldContext); > > and it's also wrong for all the user-defined aggretates that have no > access to hashcontext at all, and only get reference to aggcontext using > AggCheckCallContext(). array_agg() is a prime example. Actually, I believe the first one is right, and the second one is wrong. If we allocate pass-by-ref transition states in aggcontext, that defeats the purpose of the patch, because we can't release the memory when we start a new batch (because we can't reset aggcontext). Instead, the transition states should be copied into hashcontext; we should count the memory usage of hashcontext; AggCheckCallContext should return hashcontext; and after each batch we should reset hashcontext. It might be worth refactoring a bit to call it the "groupcontext" or something instead, and use it for both HashAgg and GroupAgg. That would reduce the need for conditionals. [ ... problems with array_agg subcontexts ... ] > and it runs indefinitely (I gave up after a few minutes). I believe this > renders the proposed memory accounting approach dead. ... > The array_agg() patch I submitted to this CF would fix this particular > query, as it removes the child contexts (so there's no need for > recursion in MemoryContextMemAllocated), but it does nothing to the > user-defined aggregates out there. And it's not committed yet. Now that your patch is committed, I don't think I'd call the memory accounting patch I proposed "dead" yet. We decided that creating one context per group is essentially a bug, so we don't need to optimize for that case. Your approach may be better, though. Thank you for reviewing. I'll update my patches and resubmit for the next CF. Regards,Jeff Davis
On 22.2.2015 09:14, Jeff Davis wrote: > > On Wed, 2015-01-07 at 20:07 +0100, Tomas Vondra wrote: >> So I started digging in the code and I noticed this: >> >> hash_mem = MemoryContextMemAllocated(aggstate->hashcontext, true); >> >> which is IMHO obviously wrong, because that accounts only for the >> hashtable itself. It might be correct for aggregates with state passed >> by value, but it's incorrect for state passed by reference (e.g. >> Numeric, arrays etc.), because initialize_aggregates does this: >> >> oldContext = MemoryContextSwitchTo(aggstate->aggcontext); >> pergroupstate->transValue = datumCopy(peraggstate->initValue, >> peraggstate->transtypeByVal, >> peraggstate->transtypeLen); >> MemoryContextSwitchTo(oldContext); >> >> and it's also wrong for all the user-defined aggretates that have no >> access to hashcontext at all, and only get reference to aggcontext using >> AggCheckCallContext(). array_agg() is a prime example. > > Actually, I believe the first one is right, and the second one is > wrong. If we allocate pass-by-ref transition states in aggcontext, > that defeats the purpose of the patch, because we can't release the > memory when we start a new batch (because we can't reset aggcontext). > > Instead, the transition states should be copied into hashcontext; we > should count the memory usage of hashcontext; AggCheckCallContext > should return hashcontext; and after each batch we should reset > hashcontext. Hmmm, maybe. I'd however suggest not to stuff everything into a single memory context, but to create a 'transvalue' context next to hashcontext, and return that one from AggCheckCallContext(). I.e. something like this: aggcontext hashcontext - hash table etc. transcontext - new context for transition values It's better to keep things separated, I guess. > It might be worth refactoring a bit to call it the "groupcontext" or > something instead, and use it for both HashAgg and GroupAgg. That > would reduce the need for conditionals. Not sure that's worth it. The patch already seems complex enough. > [ ... problems with array_agg subcontexts ... ] > >> and it runs indefinitely (I gave up after a few minutes). I believe this >> renders the proposed memory accounting approach dead. > > ... > >> The array_agg() patch I submitted to this CF would fix this >> particular query, as it removes the child contexts (so there's no >> need for recursion in MemoryContextMemAllocated), but it does >> nothing to the user-defined aggregates out there. And it's not >> committed yet. > > Now that your patch is committed, I don't think I'd call the memory > accounting patch I proposed "dead" yet. We decided that creating one > context per group is essentially a bug, so we don't need to optimize > for that case. Yes, but the question is how many other aggregates designed similarly to array_agg are out there (in extensions atc.), and whether we actually care about them. Because those will be equally impacted by this, and we can't fix them. But maybe we don't really care, and it's the author's responsibility to test them on a new major version, and fix the performance issue just like we did with array_agg(). > > Your approach may be better, though. > I'm not sure - it really boils down do the goal and assumptions. If we restrict ourselves to HashAggregate-only solution, under the assumption that the number of child contexts is small (and leave the fix up to the author of the extension), then your approach is probably simpler and better than the approach I proposed. If we're aiming for a general-purpose solution that might be used elsewhere, then I'm afraid we can't make the 'small number of child contexts' assumption and the simple solution may not be the right one. Although I believe that we shouldn't be creating large numbers of 'parallel' memory contexts anyway. > Thank you for reviewing. I'll update my patches and resubmit for the > next CF. Thanks for working on this. -- Tomas Vondra http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services