Thread: Make MemoryContextMemAllocated() more precise
AllocSet allocates memory for itself in blocks, which double in size up to maxBlockSize. So, the current block (the last one malloc'd) may represent half of the total memory allocated for the context itself. The free space at the end of that block hasn't been touched at all, and doesn't represent fragmentation or overhead. That means that the "allocated" memory can be 2X the memory ever touched in the worst case. Although that's technically correct, the purpose of MemoryContextMemAllocated() is to give a "real" usage number so we know when we're out of work_mem and need to spill (in particular, the disk- based HashAgg work, but ideally other operators as well). This "real" number should include fragmentation, freed-and-not-reused chunks, and other overhead. But it should not include significant amounts of allocated-but-never-touched memory, which says more about economizing calls to malloc than it does about the algorithm's memory usage. Attached is a patch that makes mem_allocated a method (rather than a field) of MemoryContext, and allows each memory context type to track the memory its own way. They all do the same thing as before (increment/decrement a field), but AllocSet also subtracts out the free space in the current block. For Slab and Generation, we could do something similar, but it's not as much of a problem because there's no doubling of the allocation size. Although I think this still matches the word "allocation" in spirit, it's not technically correct, so feel free to suggest a new name for MemoryContextMemAllocated(). Regards, Jeff Davis
Attachment
On Mon, 2020-03-16 at 11:45 -0700, Jeff Davis wrote: > Attached is a patch that makes mem_allocated a method (rather than a > field) of MemoryContext, and allows each memory context type to track > the memory its own way. They all do the same thing as before > (increment/decrement a field), but AllocSet also subtracts out the > free > space in the current block. For Slab and Generation, we could do > something similar, but it's not as much of a problem because there's > no > doubling of the allocation size. Committed. In an off-list discussion, Andres suggested that MemoryContextStats could be refactored to achieve this purpose, perhaps with flags to avoid walking through the blocks and freelists when those are not needed. We discussed a few other names, such as "space", "active memory", and "touched". We didn't settle on any great name, but "touched" seemed to be the most descriptive. This refactoring/renaming can be done later; right now I committed this to unblock disk-based Hash Aggregation, which is ready. Regards, Jeff Davis
On Mon, Mar 16, 2020 at 2:45 PM Jeff Davis <pgsql@j-davis.com> wrote: > Attached is a patch that makes mem_allocated a method (rather than a > field) of MemoryContext, and allows each memory context type to track > the memory its own way. They all do the same thing as before > (increment/decrement a field), but AllocSet also subtracts out the free > space in the current block. For Slab and Generation, we could do > something similar, but it's not as much of a problem because there's no > doubling of the allocation size. > > Although I think this still matches the word "allocation" in spirit, > it's not technically correct, so feel free to suggest a new name for > MemoryContextMemAllocated(). Procedurally, I think that it is highly inappropriate to submit a patch two weeks after the start of the final CommitFest and then commit it just over 48 hours later without a single endorsement of the change from anyone. Substantively, I think that whether or not this is improvement depends considerably on how your OS handles overcommit. I do not have enough knowledge to know whether it will be better in general, but would welcome opinions from others. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Thu, Mar 19, 2020 at 11:44:05AM -0400, Robert Haas wrote: >On Mon, Mar 16, 2020 at 2:45 PM Jeff Davis <pgsql@j-davis.com> wrote: >> Attached is a patch that makes mem_allocated a method (rather than a >> field) of MemoryContext, and allows each memory context type to track >> the memory its own way. They all do the same thing as before >> (increment/decrement a field), but AllocSet also subtracts out the free >> space in the current block. For Slab and Generation, we could do >> something similar, but it's not as much of a problem because there's no >> doubling of the allocation size. >> >> Although I think this still matches the word "allocation" in spirit, >> it's not technically correct, so feel free to suggest a new name for >> MemoryContextMemAllocated(). > >Procedurally, I think that it is highly inappropriate to submit a >patch two weeks after the start of the final CommitFest and then >commit it just over 48 hours later without a single endorsement of the >change from anyone. > True. >Substantively, I think that whether or not this is improvement depends >considerably on how your OS handles overcommit. I do not have enough >knowledge to know whether it will be better in general, but would >welcome opinions from others. > Not sure overcommit is a major factor, and if it is then maybe it's the strategy of doubling block size that's causing problems. AFAICS the 2x allocation is the worst case, because it only happens right after allocating a new block (of twice the size), when the "utilization" drops from 100% to 50%. But in practice the utilization will be somewhere in between, with an average of 75%. And we're not doubling the block size indefinitely - there's an upper limit, so over time the utilization drops less and less. So as the contexts grow, the discrepancy disappears. And I'd argue the smaller the context, the less of an issue the overcommit behavior is. My understanding is that this is really just an accounting issue, where allocating a block would get us over the limit, which I suppose might be an issue with low work_mem values. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Thu, 2020-03-19 at 19:11 +0100, Tomas Vondra wrote: > AFAICS the 2x allocation is the worst case, because it only happens > right after allocating a new block (of twice the size), when the > "utilization" drops from 100% to 50%. But in practice the utilization > will be somewhere in between, with an average of 75%. Sort of. Hash Agg is constantly watching the memory, so it will typically spill right at the point where the accounting for that memory context is off by 2X. That's mitigated because the hash table itself (the array of TupleHashEntryData) ends up allocated as its own block, so does not have any waste. The total (table mem + out of line) might be close to right if the hash table array itself is a large fraction of the data, but I don't think that's what we want. > And we're not > doubling the block size indefinitely - there's an upper limit, so > over > time the utilization drops less and less. So as the contexts grow, > the > discrepancy disappears. And I'd argue the smaller the context, the > less > of an issue the overcommit behavior is. The problem is that the default work_mem is 4MB, and the doubling behavior goes to 8MB, so it's a problem with default settings. Regards, Jeff Davis
On Thu, Mar 19, 2020 at 2:11 PM Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: > My understanding is that this is really just an accounting issue, where > allocating a block would get us over the limit, which I suppose might be > an issue with low work_mem values. Well, the issue is, if I understand correctly, that this means that MemoryContextStats() might now report a smaller amount of memory than what we actually allocated from the operating system. That seems like it might lead someone trying to figure out where a backend is leaking memory to erroneous conclusions. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Thu, 2020-03-19 at 11:44 -0400, Robert Haas wrote: > Procedurally, I think that it is highly inappropriate to submit a > patch two weeks after the start of the final CommitFest and then > commit it just over 48 hours later without a single endorsement of > the > change from anyone. Reverted. Sorry, I misjudged this as a "supporting fix for a specific problem", but it seems others feel it requires discussion. > Substantively, I think that whether or not this is improvement > depends > considerably on how your OS handles overcommit. I do not have enough > knowledge to know whether it will be better in general, but would > welcome opinions from others. I think omitting the tail of the current block is an unqualified improvement for the purpose of obeying work_mem, regardless of the OS. The block sizes keep doubling up to 8MB, and it doesn't make a lot of sense to count that last large mostly-empty block against work_mem. Regards, Jeff Davis
On Thu, Mar 19, 2020 at 3:27 PM Jeff Davis <pgsql@j-davis.com> wrote: > I think omitting the tail of the current block is an unqualified > improvement for the purpose of obeying work_mem, regardless of the OS. > The block sizes keep doubling up to 8MB, and it doesn't make a lot of > sense to count that last large mostly-empty block against work_mem. Well, again, my point is that whether or not it counts depends on your system's overcommit behavior. Depending on how you have the configured, or what your OS likes to do, it may in reality count or not count. Or so I believe. Am I wrong? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Thu, Mar 19, 2020 at 12:25:16PM -0700, Jeff Davis wrote: >On Thu, 2020-03-19 at 19:11 +0100, Tomas Vondra wrote: >> AFAICS the 2x allocation is the worst case, because it only happens >> right after allocating a new block (of twice the size), when the >> "utilization" drops from 100% to 50%. But in practice the utilization >> will be somewhere in between, with an average of 75%. > >Sort of. Hash Agg is constantly watching the memory, so it will >typically spill right at the point where the accounting for that memory >context is off by 2X. > >That's mitigated because the hash table itself (the array of >TupleHashEntryData) ends up allocated as its own block, so does not >have any waste. The total (table mem + out of line) might be close to >right if the hash table array itself is a large fraction of the data, >but I don't think that's what we want. > >> And we're not >> doubling the block size indefinitely - there's an upper limit, so >> over >> time the utilization drops less and less. So as the contexts grow, >> the >> discrepancy disappears. And I'd argue the smaller the context, the >> less >> of an issue the overcommit behavior is. > >The problem is that the default work_mem is 4MB, and the doubling >behavior goes to 8MB, so it's a problem with default settings. > Yes, it's an issue for the accuracy of our accounting. What Robert was talking about is overcommit behavior at the OS level, which I'm arguing is unlikely to be an issue, because for low work_mem values the absolute difference is small, and on large work_mem values it's limited by the block size limit. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Thu, 2020-03-19 at 15:26 -0400, Robert Haas wrote: > Well, the issue is, if I understand correctly, that this means that > MemoryContextStats() might now report a smaller amount of memory than > what we actually allocated from the operating system. That seems like > it might lead someone trying to figure out where a backend is leaking > memory to erroneous conclusions. MemoryContextStats() was unaffected by my now-reverted change. Regards, Jeff Davis
On Mon, 2020-03-16 at 11:45 -0700, Jeff Davis wrote: > Although that's technically correct, the purpose of > MemoryContextMemAllocated() is to give a "real" usage number so we > know > when we're out of work_mem and need to spill (in particular, the > disk- > based HashAgg work, but ideally other operators as well). This "real" > number should include fragmentation, freed-and-not-reused chunks, and > other overhead. But it should not include significant amounts of > allocated-but-never-touched memory, which says more about economizing > calls to malloc than it does about the algorithm's memory usage. To expand on this point: work_mem is to keep executor algorithms somewhat constrained in the memory that they use. With that in mind, it should reflect things that the algorithm has some control over, and can be measured cheaply. Therefore, we shouldn't include huge nearly-empty blocks of memory that the system decided to allocate in response to a request for a small chunk (the algorithm has little control). Nor should we try to walk a list of blocks or free chunks (expensive). We should include used memory, reasonable overhead (chunk headers, alignment, etc.), and probably free chunks (which represent fragmentation). For AllocSet, the notion of "all touched memory", which is everything except the current block's tail, seems to fit the requirements well. Regards, Jeff Davis
On Thu, Mar 19, 2020 at 3:44 PM Jeff Davis <pgsql@j-davis.com> wrote: > On Thu, 2020-03-19 at 15:26 -0400, Robert Haas wrote: > > Well, the issue is, if I understand correctly, that this means that > > MemoryContextStats() might now report a smaller amount of memory than > > what we actually allocated from the operating system. That seems like > > it might lead someone trying to figure out where a backend is leaking > > memory to erroneous conclusions. > > MemoryContextStats() was unaffected by my now-reverted change. Oh. Well, that addresses my concern, then. If this only affects the accounting for memory-bounded hash aggregation and nothing else is going to be touched, including MemoryContextStats(), then it's not an issue for me. Other people may have different concerns, but that was the only thing that was bothering me. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Thu, 2020-03-19 at 15:33 -0400, Robert Haas wrote: > On Thu, Mar 19, 2020 at 3:27 PM Jeff Davis <pgsql@j-davis.com> wrote: > > I think omitting the tail of the current block is an unqualified > > improvement for the purpose of obeying work_mem, regardless of the > > OS. > > The block sizes keep doubling up to 8MB, and it doesn't make a lot > > of > > sense to count that last large mostly-empty block against work_mem. > > Well, again, my point is that whether or not it counts depends on > your > system's overcommit behavior. Depending on how you have the > configured, or what your OS likes to do, it may in reality count or > not count. Or so I believe. Am I wrong? I don't believe it should not be counted for the purposes of work_mem. Let's say that the OS eagerly allocates it, then what is the algorithm supposed to do in response? It can either: 1. just accept that all of the space is used, even though it's potentially as low as 50% used, and spill earlier than may be necessary; or 2. find a way to measure the free space, and somehow predict whether that space will be reused the next time a group is added to the hash table It just doesn't seem reasonable for me for the algorithm to change its behavior based on these large block allocations. It may be valuable information for other purposes (like tuning your OS, or tracking down memory issues), though. Regards, Jeff Davis
On Thu, 2020-03-19 at 16:04 -0400, Robert Haas wrote: > Other people may have different concerns, but that was the only thing > that was bothering me. OK, thank you for raising it. Perhaps we can re-fix the issue for HashAgg if necessary, or I can tweak some accounting things within HashAgg itself, or both. Regards, Jeff Davis
On Wed, 2020-03-18 at 15:41 -0700, Jeff Davis wrote: > In an off-list discussion, Andres suggested that MemoryContextStats > could be refactored to achieve this purpose, perhaps with flags to > avoid walking through the blocks and freelists when those are not > needed. Attached refactoring patch. There's enough in here that warrants discussion that I don't think this makes sense for v13 and I'm adding it to the July commitfest. I still think we should do something for v13, such as the originally- proposed patch[1]. It's not critical, but it simply reports a better number for memory consumption. Currently, the memory usage appears to jump, often right past work mem (by a reasonable but noticable amount), which could be confusing. Regarding the attached patch (target v14): * there's a new MemoryContextCount() that simply calculates the statistics without printing anything, and returns a struct - it supports flags to indicate which stats should be calculated, so that some callers can avoid walking through blocks/freelists * it adds a new statistic for "new space" (i.e. untouched) * it eliminates specialization of the memory context printing - the only specialization was for generation.c to output the number of chunks, which can be done easily enough for the other types, too Regards, Jeff Davis [1] https://postgr.es/m/ec63d70b668818255486a83ffadc3aec492c1f57.camel%40j-davis.com
Attachment
Hi, On 2020-03-27 17:21:10 -0700, Jeff Davis wrote: > Attached refactoring patch. There's enough in here that warrants > discussion that I don't think this makes sense for v13 and I'm adding > it to the July commitfest. IDK, adding a commit to v13 that we know we should do architecturally differently in v14, when the difference in complexity between the two patches isn't actually *that* big... I'd like to see others jump in here... > I still think we should do something for v13, such as the originally- > proposed patch[1]. It's not critical, but it simply reports a better > number for memory consumption. Currently, the memory usage appears to > jump, often right past work mem (by a reasonable but noticable amount), > which could be confusing. Is that really a significant issue for most work mem sizes? Shouldn't the way we increase sizes lead to the max difference between the measurements to be somewhat limited? > * there's a new MemoryContextCount() that simply calculates the > statistics without printing anything, and returns a struct > - it supports flags to indicate which stats should be > calculated, so that some callers can avoid walking through > blocks/freelists > * it adds a new statistic for "new space" (i.e. untouched) > * it eliminates specialization of the memory context printing > - the only specialization was for generation.c to output the > number of chunks, which can be done easily enough for the > other types, too That sounds like a good direction. > + if (flags & MCXT_STAT_NBLOCKS) > + counters.nblocks = nblocks; > + if (flags & MCXT_STAT_NCHUNKS) > + counters.nchunks = set->nChunks; > + if (flags & MCXT_STAT_FREECHUNKS) > + counters.freechunks = freechunks; > + if (flags & MCXT_STAT_TOTALSPACE) > + counters.totalspace = set->memAllocated; > + if (flags & MCXT_STAT_FREESPACE) > + counters.freespace = freespace; > + if (flags & MCXT_STAT_NEWSPACE) > + counters.newspace = set->blocks->endptr - set->blocks->freeptr; I'd spec it so that context implementations are allowed to unconditionally fill fields, even when the flag isn't specified. The branches quoted don't buy us anyting... > diff --git a/src/include/nodes/memnodes.h b/src/include/nodes/memnodes.h > index c9f2bbcb367..cc545852968 100644 > --- a/src/include/nodes/memnodes.h > +++ b/src/include/nodes/memnodes.h > @@ -29,11 +29,21 @@ > typedef struct MemoryContextCounters > { > Size nblocks; /* Total number of malloc blocks */ > + Size nchunks; /* Total number of chunks (used+free) */ > Size freechunks; /* Total number of free chunks */ > Size totalspace; /* Total bytes requested from malloc */ > Size freespace; /* The unused portion of totalspace */ > + Size newspace; /* Allocated but never held any chunks */ I'd add some reasoning as to why this is useful. > } MemoryContextCounters; > > +#define MCXT_STAT_NBLOCKS (1 << 0) > +#define MCXT_STAT_NCHUNKS (1 << 1) > +#define MCXT_STAT_FREECHUNKS (1 << 2) > +#define MCXT_STAT_TOTALSPACE (1 << 3) > +#define MCXT_STAT_FREESPACE (1 << 4) > +#define MCXT_STAT_NEWSPACE (1 << 5) s/MCXT_STAT/MCXT_STAT_NEED/? > +#define MCXT_STAT_ALL ((1 << 6) - 1) Hm, why not go for ~0 or such? Greetings, Andres Freund
On Sat, 28 Mar 2020 at 13:21, Jeff Davis <pgsql@j-davis.com> wrote: > Attached refactoring patch. There's enough in here that warrants > discussion that I don't think this makes sense for v13 and I'm adding > it to the July commitfest. I had a read over this too. I noted down the following during my pass of it. 1. The comment mentions "passthru", but you've removed that parameter. * For now, the passthru pointer just points to "int level"; later we might * make that more complicated. */ static void -MemoryContextStatsPrint(MemoryContext context, void *passthru, +MemoryContextStatsPrint(MemoryContext context, int level, const char *stats_string) 2. I don't think MemoryContextCount is the best name for this function. When I saw: counters = MemoryContextCount(aggstate->hash_metacxt, flags, true); i assumed it was returning some integer number, that is until I looked at the "counters" datatype. Maybe it would be better to name the function MemoryContextGetTelemetry and the struct MemoryContextTelemetry rather than MemoryContextCounters? Or maybe MemoryContextTally and call the function either MemoryContextGetTelemetry() or MemoryContextGetTally(). Or perhaps MemoryContextGetAccounting() and the struct MemoryContextAccounting 3. I feel like it would be nicer if you didn't change the "count" methods to return a MemoryContextCounters. It means you may need to zero a struct for each level, assign the values, then add them to the total. If you were just to zero the struct in MemoryContextCount() then pass it into the count function, then you could just have it do all the += work. It would reduce the code in MemoryContextCount() too. 4. Do you think it would be better to have two separate functions for MemoryContextCount(), a recursive version and a non-recursive version. I feel that the function should be so simple that it does not make a great deal of sense to load it up to handle both cases. Looking around mcxt.c, I see MemoryContextResetOnly() and MemoryContextResetChildren(), while that's not a perfect example, it does seem like a better lead to follow. 5. For performance testing, I tried using the following table with 1MB work_mem then again with 1GB work_mem. I wondered if making the accounting more complex would cause a slowdown in nodeAgg.c, as I see we're calling this function each time we get a tuple that belongs in a new group. The 1MB test is likely not such a great way to measure this since we do spill to disk in that case and the changing in accounting means we do that at a slightly different time, but the 1GB test does not spill and it's a bit slower. create table t (a int); insert into t select x from generate_Series(1,10000000) x; analyze t; -- Unpatched set work_mem = '1MB'; explain analyze select a from t group by a; -- Ran 3 times. QUERY PLAN ------------------------------------------------------------------------------------------------------------------------- HashAggregate (cost=481750.10..659875.95 rows=10000048 width=4) (actual time=3350.190..8193.400 rows=10000000 loops=1) Group Key: a Planned Partitions: 32 Peak Memory Usage: 1177 kB Disk Usage: 234920 kB HashAgg Batches: 1188 -> Seq Scan on t (cost=0.00..144248.48 rows=10000048 width=4) (actual time=0.013..1013.755 rows=10000000 loops=1) Planning Time: 0.131 ms Execution Time: 8586.420 ms Execution Time: 8446.961 ms Execution Time: 8449.492 ms (9 rows) -- Patched set work_mem = '1MB'; explain analyze select a from t group by a; QUERY PLAN ------------------------------------------------------------------------------------------------------------------------- HashAggregate (cost=481748.00..659873.00 rows=10000000 width=4) (actual time=3470.107..8598.836 rows=10000000 loops=1) Group Key: a Planned Partitions: 32 Peak Memory Usage: 1033 kB Disk Usage: 234816 kB HashAgg Batches: 1056 -> Seq Scan on t (cost=0.00..144248.00 rows=10000000 width=4) (actual time=0.017..1091.820 rows=10000000 loops=1) Planning Time: 0.285 ms Execution Time: 8996.824 ms Execution Time: 8781.624 ms Execution Time: 8900.324 ms (9 rows) -- Unpatched set work_mem = '1GB'; explain analyze select a from t group by a; QUERY PLAN ------------------------------------------------------------------------------------------------------------------------- HashAggregate (cost=169248.00..269248.00 rows=10000000 width=4) (actual time=4537.779..7033.318 rows=10000000 loops=1) Group Key: a Peak Memory Usage: 868369 kB -> Seq Scan on t (cost=0.00..144248.00 rows=10000000 width=4) (actual time=0.018..820.136 rows=10000000 loops=1) Planning Time: 0.054 ms Execution Time: 7561.063 ms Execution Time: 7573.985 ms Execution Time: 7572.058 ms (6 rows) -- Patched set work_mem = '1GB'; explain analyze select a from t group by a; QUERY PLAN ------------------------------------------------------------------------------------------------------------------------- HashAggregate (cost=169248.00..269248.00 rows=10000000 width=4) (actual time=4840.045..7359.970 rows=10000000 loops=1) Group Key: a Peak Memory Usage: 861975 kB -> Seq Scan on t (cost=0.00..144248.00 rows=10000000 width=4) (actual time=0.018..789.975 rows=10000000 loops=1) Planning Time: 0.055 ms Execution Time: 7904.069 ms Execution Time: 7913.692 ms Execution Time: 7927.061 ms (6 rows) Perhaps the slowdown is unrelated. If it, then maybe the reduction in branching mentioned by Andres might help a bit plus maybe a bit from what I mentioned above about passing in the counter struct instead of returning it at each level. David
On Mon, 2020-04-06 at 23:39 +1200, David Rowley wrote: > 1. The comment mentions "passthru", but you've removed that > parameter. Fixed, thank you. > 2. I don't think MemoryContextCount is the best name for this > function. When I saw: I've gone back and forth on naming a bit. The right name, in my opinion, is MemoryContextStats(), but that's taken by something that should be called MemoryContextReport(). But I didn't want to change that as it would probably annoy a lot of people who are used to calling MemoryContextStats() from gdb. I changed the new function to be called MemoryContextGetCounters(), which is more directly what it's doing. "Telemetry" makes me think more of a stream of information rather than a particular point in time. > 3. I feel like it would be nicer if you didn't change the "count" > methods to return a MemoryContextCounters. It means you may need to > zero a struct for each level, assign the values, then add them to the > total. If you were just to zero the struct in MemoryContextCount() > then pass it into the count function, then you could just have it do > all the += work. It would reduce the code in MemoryContextCount() > too. I changed it to use a pointer out parameter, but I don't think an in/out parameter is quite right there. MemoryContextStats() ends up using both the per-context counters as well as the totals, so it's not helpful to return just the totals. > 4. Do you think it would be better to have two separate functions for > MemoryContextCount(), a recursive version and a non-recursive > version. I could, but right now the only caller passes recurse=true, so I'm not really eliminating any code in that path by specializing the functions. Are you thinking about performance or you just think it would be better to have two entry points? > 5. For performance testing, I tried using the following table with > 1MB > work_mem then again with 1GB work_mem. I wondered if making the > accounting more complex would cause a slowdown in nodeAgg.c, as I see > we're calling this function each time we get a tuple that belongs in > a > new group. The 1MB test is likely not such a great way to measure > this > since we do spill to disk in that case and the changing in accounting > means we do that at a slightly different time, but the 1GB test does > not spill and it's a bit slower. I think this was because I was previously returning a struct. By just passing a pointer as an out param, it seems to have mitigated it, but not completely eliminated the cost (< 2% in my tests). I was using an OFFSET 100000000 instead of EXPLAIN ANALYZE in my test measurements because it was less noisy, and I focused on the 1GB test for the reason you mention. I also addressed Andres's comments: * changed the name of the flags from MCXT_STAT to MCXT_COUNT * changed ((1<<6)-1) to ~0 * removed unnecessary branches from the GetCounters method * expanded some comments Regards, Jeff Davis
Attachment
On Sun, 2020-04-05 at 16:48 -0700, Andres Freund wrote: > > I still think we should do something for v13, such as the > > originally- > > proposed patch[1]. It's not critical, but it simply reports a > > better > > number for memory consumption. Currently, the memory usage appears > > to > > jump, often right past work mem (by a reasonable but noticable > > amount), > > which could be confusing. > > Is that really a significant issue for most work mem sizes? Shouldn't > the way we increase sizes lead to the max difference between the > measurements to be somewhat limited? For work_mem less than 16MB, it's essentially spilling when the table context is about 75% of what it could be (as bad as 50% and as good as 100% depending on a number of factors). That's not terrible, but it is significant. It also means that the reported memory peak jumps up rather than going up gradually, so it ends up surpassing work_mem (e.g. 4MB of work_mem might show a memory peak of 5MB). So it's a weird combination of under- utilizing and over-reporting. > I'd spec it so that context implementations are allowed to > unconditionally fill fields, even when the flag isn't specified. The > branches quoted don't buy us anyting... Done. > I'd add some reasoning as to why this is useful. Done. > s/MCXT_STAT/MCXT_STAT_NEED/? I changed to MCXT_COUNT_. MCXT_STAT_NEED seemed slightly verbose for me. > > +#define MCXT_STAT_ALL ((1 << 6) - 1) > > Hm, why not go for ~0 or such? Done. Regards, Jeff Davis
On Tue, 7 Apr 2020 at 14:26, Jeff Davis <pgsql@j-davis.com> wrote: > > On Mon, 2020-04-06 at 23:39 +1200, David Rowley wrote: > > 4. Do you think it would be better to have two separate functions for > > MemoryContextCount(), a recursive version and a non-recursive > > version. > > I could, but right now the only caller passes recurse=true, so I'm not > really eliminating any code in that path by specializing the functions. > Are you thinking about performance or you just think it would be better > to have two entry points? I was thinking in terms of both performance by eliminating a branch, but mostly it was down to ease of code reading. I thought it was easier to read: MemoryContextGetCountersRecruse(&counters); then MemoryContextGetCounters(&counters, true); since I might need to go see what "true" is for. The non-recursive version, if we decide we need one, would likely just be 1 one-line body function calling the implementation's getcounter method. David
On Mon, 2020-03-16 at 11:45 -0700, Jeff Davis wrote: > AllocSet allocates memory for itself in blocks, which double in size > up > to maxBlockSize. So, the current block (the last one malloc'd) may > represent half of the total memory allocated for the context itself. Narrower approach that doesn't touch memory context internals: If the blocks double up in size to maxBlockSize, why not just create the memory context with a smaller maxBlockSize? I had originally dismissed this as a hack that could slow down some workloads when work_mem is large. But we can simply make it proportional to work_mem, which makes a lot of sense for an operator like HashAgg that controls its memory usage. It can allocate in blocks large enough that we don't call malloc() too often when work_mem is large; but small enough that we don't overrun work_mem when work_mem is small. I have attached a patch to do this only for HashAgg, using a new entry point in execUtils.c called CreateWorkExprContext(). It sets maxBlockSize to 1/16th of work_mem (rounded down to a power of two), with a minimum of initBlockSize. This could be a good general solution for other operators as well, but that requires a bit more investigation, so I'll leave that for v14. The attached patch is narrow and solves the problem for HashAgg nicely without interfering with anything else, so I plan to commit it soon for v13. Regards, Jeff Davis
Attachment
> On 8 Apr 2020, at 02:21, Jeff Davis <pgsql@j-davis.com> wrote: > The attached patch is narrow and solves the problem for HashAgg nicely > without interfering with anything else, so I plan to commit it soon for > v13. If I read this thread correctly, there is nothing outstanding here for 14 from this patch? I've marked the entry committed i the July commitfest, feel to free to change that in case I misunderstood. cheers ./daniel