Thread: (full) Memory context dump considered harmful
Hi all! We just had a case of a very long running process of ours that creates does a lot of prepared statements through Perls DBD:Pg running into: https://rt.cpan.org/Public/Bug/Display.html?id=88827 This resulted in millions of prepared statements created, but not removed in the affected backends over the course of 1-2 hours until the backends in question ran out of memory. The "out of memory" condition resulted in one memory context dump generated per occurance each consisting of >2M lines(!) (basically a line of CachedPlan/CachePlanSource per statement/function). In the 20 Minutes or it took monitoring to alert and operations to react this cause a followup incident because repeated "out of memory" conditions caused over 400M(!!) loglines amounting to some 15GB of data running the log partition dangerously close to full. an example memory context dump looks like this: TopMemoryContext: 582728880 total in 71126 blocks; 6168 free (52 chunks); 582722712 used TopTransactionContext: 8192 total in 1 blocks; 6096 free (1 chunks); 2096 used ExecutorState: 8192 total in 1 blocks; 5392 free (0 chunks); 2800 used ExprContext: 8192 total in 1 blocks;8160 free (0 chunks); 32 used SPI Exec: 0 total in 0 blocks; 0 free (0 chunks); 0 used SPI Proc: 8192 totalin 1 blocks; 5416 free (0 chunks); 2776 used PL/pgSQL function context: 8192 total in 1 blocks; 1152 free (1 chunks); 7040 used PL/pgSQL function context: 24576 total in 2 blocks; 11400 free (1 chunks); 13176 used Type information cache: 24576 total in 2 blocks; 11888 free (5 chunks); 12688 used PL/pgSQL function context: 8192 total in 1 blocks; 1120 free (1 chunks); 7072 used PL/pgSQL function context: 24576 total in 2 blocks; 10984 free (1 chunks); 13592 used PL/pgSQL function context: 57344 total in 3 blocks; 29928 free (2 chunks); 27416 used PL/pgSQL function context: 57344 total in 3 blocks; 28808 free (2 chunks); 28536 used PL/pgSQL function context: 24576 total in 2 blocks; 5944 free (3 chunks); 18632 used RI compare cache: 24576 total in 2 blocks; 15984 free (5 chunks); 8592 used RI query cache: 24576 total in 2 blocks; 11888 free (5 chunks); 12688 used PL/pgSQL function context: 57344 total in 3 blocks; 31832 free (2 chunks); 25512 used PL/pgSQL function context: 57344 total in 3 blocks; 29600 free (2 chunks); 27744 used PL/pgSQL function context: 57344 total in 3 blocks; 39688 free (5 chunks); 17656 used CFuncHash: 8192 total in 1 blocks; 1680 free (0 chunks); 6512 used Rendezvous variable hash: 8192 totalin 1 blocks; 1680 free (0 chunks); 6512 used PLpgSQL function cache: 24520 total in 2 blocks; 3744 free (0 chunks); 20776 used Prepared Queries: 125886512 total in 25 blocks; 4764208 free (91 chunks); 121122304 used TableSpace cache: 8192 total in 1 blocks; 3216 free (0 chunks); 4976 used Operator lookup cache:24576 total in 2 blocks; 11888 free (5 chunks); 12688 used MessageContext: 8192 total in 1 blocks; 6976 free (0 chunks); 1216 used Operator class cache: 8192total in 1 blocks; 1680 free (0 chunks); 6512 used smgr relation table: 24576 total in 2 blocks; 5696 free (4 chunks); 18880 used TransactionAbortContext: 32768 total in 1 blocks; 32736 free (0 chunks); 32 used Portal hash: 8192 total in 1 blocks; 1680 free (0 chunks); 6512 used PortalMemory: 8192 total in 1 blocks;7888 free (0 chunks); 304 used PortalHeapMemory: 1024 total in 1 blocks; 64 free (0 chunks); 960 used ExecutorState:57344 total in 3 blocks; 21856 free (2 chunks); 35488 used ExprContext: 0 total in 0 blocks; 0 free (0 chunks); 0 used ExprContext: 8192 total in 1 blocks;8160 free (1 chunks); 32 used Relcache by OID: 24576 total in 2 blocks; 12832 free (3 chunks); 11744 used CacheMemoryContext: 42236592 total in 28 blocks; 7160904 free (298 chunks); 35075688 used CachedPlan: 7168 total in 3 blocks; 1544 free (0 chunks); 5624 used CachedPlanSource: 7168 totalin 3 blocks; 3904 free (1 chunks); 3264 used CachedPlan: 15360 total in 4 blocks; 6440 free (1 chunks); 8920 used CachedPlanSource: 7168 total in 3 blocks;1352 free (0 chunks); 5816 used CachedPlan: 15360 total in 4 blocks; 6440 free (1 chunks); 8920 used CachedPlanSource: 7168 total in 3 blocks;1352 free (0 chunks); 5816 used CachedPlan: 7168 total in 3 blocks; 1544 free (0 chunks); 5624 used CachedPlanSource: 7168 total in 3 blocks;3904 free (1 chunks); 3264 used CachedPlan: 15360 total in 4 blocks; 6440 free (1 chunks); 8920 used CachedPlanSource: 7168 total in 3 blocks;1352 free (0 chunks); 5816 used CachedPlan: 15360 total in 4 blocks; 6440 free (1 chunks); 8920 used CachedPlanSource: 7168 total in 3 blocks;1352 free (0 chunks); 5816 used CachedPlan: 7168 total in 3 blocks; 1544 free (0 chunks); 5624 used CachedPlanSource: 7168 total in 3 blocks;3904 free (1 chunks); 3264 used CachedPlan: 15360 total in 4 blocks; 6440 free (1 chunks); 8920 used CachedPlanSource: 7168 total in 3 blocks;1352 free (0 chunks); 5816 used CachedPlan: 15360 total in 4 blocks; 6440 free (1 chunks); 8920 used CachedPlanSource: 7168 total in 3 blocks;1352 free (0 chunks); 5816 used CachedPlan: 7168 total in 3 blocks; 1544 free (0 chunks); 5624 used CachedPlanSource: 7168 total in 3 blocks;3904 free (1 chunks); 3264 used CachedPlan: 15360 total in 4 blocks; 6440 free (1 chunks); 8920 used CachedPlanSource: 7168 total in 3 blocks;1352 free (0 chunks); 5816 used CachedPlan: 7168 total in 3 blocks; 2248 free (0 chunks); 4920 used CachedPlanSource: 3072 total in 2 blocks;64 free (0 chunks); 3008 used CachedPlan: 15360 total in 4 blocks; 7576 free (0 chunks); 7784 used CachedPlanSource:7168 total in 3 blocks; 1888 free (0 chunks); 5280 used CachedPlan: 15360 total in 4 blocks; 7576 free (0 chunks); 7784 used CachedPlanSource: 7168 total in 3 blocks;1888 free (0 chunks); 5280 used CachedPlan: 7168 total in 3 blocks; 2248 free (0 chunks); 4920 used CachedPlanSource: 3072 total in 2 blocks;64 free (0 chunks); 3008 used CachedPlan: 15360 total in 4 blocks; 7576 free (0 chunks); 7784 used ... goes of for some 2M lines... given that an "out of memory" situation is already bad enough, but essentially mounting a DoS on the rest of the system due to creating excessive amounts of logs feels even worse. I wonder if we should have a default of capping the dump to say 1k lines or such and only optionally do a full one. the other option would be to find an alternative way of displaying the information like if there is a large number of repeated entries just show the first X of them and say "123456 similiar ones". This is 9.1.14 on Debian Wheezy/amd64 fwiw - but I dont think we have made relevant changes in more recent versions. regards Stefan
On 08/20/2015 08:51 AM, Stefan Kaltenbrunner wrote: > This is 9.1.14 on Debian Wheezy/amd64 fwiw - but I dont think we have > made relevant changes in more recent versions. It seems we may also want to consider a way to drop those prepared queries after a period time of non use. JD > > > > regards > > > Stefan > > -- Command Prompt, Inc. - http://www.commandprompt.com/ 503-667-4564 PostgreSQL Centered full stack support, consulting and development. Announcing "I'm offended" is basically telling the world you can't control your own emotions, so everyone else should do it for you.
Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes: > I wonder if we should have a default of capping the dump to say 1k lines > or such and only optionally do a full one. -1. It's worked like this for the last fifteen years or thereabouts, and you're the first one to complain. I suspect some weirdness in your logging setup, rather than any systemic problem that we need to lobotomize our diagnostic output in order to prevent. (The reason I say "lobotomize" is that there's no particularly good reason to assume that the first N lines will tell you what you need to know. And the filter rule would have to be *very* stupid, because we can't risk trying to allocate any additional memory to track what we're doing here.) regards, tom lane
On 08/20/2015 06:09 PM, Tom Lane wrote: > Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes: >> I wonder if we should have a default of capping the dump to say 1k lines >> or such and only optionally do a full one. > > -1. It's worked like this for the last fifteen years or thereabouts, > and you're the first one to complain. I suspect some weirdness in > your logging setup, rather than any systemic problem that we > need to lobotomize our diagnostic output in order to prevent. not sure what you consider weird in the logging setup here - the context dump is imho borderline on internal diagnostic output at a debug level (rather than making sense to an average sysadmin) already (and no way to control it). But having (like in our case) the backend dumping 2 million basically identical lines into a general logfile per event seems excessive and rather abusive towards the rest of the system (just from an IO perspective for example or from a log file post processing tool perspective) > > (The reason I say "lobotomize" is that there's no particularly good > reason to assume that the first N lines will tell you what you need > to know. And the filter rule would have to be *very* stupid, because > we can't risk trying to allocate any additional memory to track what > we're doing here.) I do understand that there migt be challenges there but in the last 15 years machines got way faster and pg got way capable and some of those capabilities might need to get revisited in that regards - and while it is very nice that pg survives multiple oom cases pretty nicely I dont think it is entitled to put an imho unreasonable burden on the rest of the system by writing "insane" amounts of data. Just from a sysadmin perspective it also means that it is trivial for a misbehaving app to fill up the logfile on a system because unlike almost all other actual other logging output there seems to be no way to control/disabled it on a per role/database level. Stefan
Hi, On 08/20/2015 11:04 PM, Stefan Kaltenbrunner wrote: > On 08/20/2015 06:09 PM, Tom Lane wrote: >> Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes: >>> I wonder if we should have a default of capping the dump to say 1k lines >>> or such and only optionally do a full one. >> >> -1. It's worked like this for the last fifteen years or >> thereabouts, and you're the first one to complain. I suspect some >> weirdness in your logging setup, rather than any systemic problem >> that we need to lobotomize our diagnostic output in order to >> prevent. > > not sure what you consider weird in the logging setup here - the > context dump is imho borderline on internal diagnostic output at a > debug level (rather than making sense to an average sysadmin) already > (and no way to control it). But having (like in our case) the backend > dumping 2 million basically identical lines into a general logfile > per event seems excessive and rather abusive towards the rest of the > system (just from an IO perspective for example or from a log file > post processing tool perspective) I've seen similar issues too, on systems with perfectly common logging setups. So even if your logging setup would be strange, it's not the main factor here. >> (The reason I say "lobotomize" is that there's no particularly >> good reason to assume that the first N lines will tell you what you >> need to know. And the filter rule would have to be *very* stupid, >> because we can't risk trying to allocate any additional memory to >> track what we're doing here.) > > I do understand that there migt be challenges there but in the last > 15 years machines got way faster and pg got way capable and some of > those capabilities might need to get revisited in that regards - and > while it is very nice that pg survives multiple oom cases pretty > nicely I dont think it is entitled to put an imho unreasonable burden > on the rest of the system by writing "insane" amounts of data. > > Just from a sysadmin perspective it also means that it is trivial for > a misbehaving app to fill up the logfile on a system because unlike > almost all other actual other logging output there seems to be no way > to control/disabled it on a per role/database level. IMHO the first thing we might do is provide log_memory_stats GUC controlling that. I'm not a big fan of adding GUC for everything, but in this case it seems appropriate, just like the other log_ options. I also don't think logging just subset of the stats is a lost case. Sure, we can't know which of the lines are important, but for example logging just the top-level contexts with a summary of the child contexts would be OK. So something like this: TopMemoryContext: 582728880 total in 71126 blocks; 6168 free ... TopTransactionContext: 8192 total in 1 blocks; 6096 free(1 ... 3 child contexts, 10 blocks, 34873 free PL/pgSQL function context: 8192 total in 1 blocks; 1152 fre ... PL/pgSQLfunction context: 24576 total in 2 blocks; 11400 f ... Type information cache: 24576 total in 2 blocks; 11888 free... PL/pgSQL function context: 8192 total in 1 blocks; 1120 fre ... PL/pgSQL function context: 24576 total in 2 blocks;10984 f ... PL/pgSQL function context: 57344 total in 3 blocks; 29928 f ... PL/pgSQL function context: 57344 totalin 3 blocks; 28808 f ... PL/pgSQL function context: 24576 total in 2 blocks; 5944 fr ... RI compare cache: 24576 totalin 2 blocks; 15984 free (5 ch ... RI query cache: 24576 total in 2 blocks; 11888 free (5 chun ... PL/pgSQL functioncontext: 57344 total in 3 blocks; 31832 f ... PL/pgSQL function context: 57344 total in 3 blocks; 29600 f ... PL/pgSQLfunction context: 57344 total in 3 blocks; 39688 f ... CFuncHash: 8192 total in 1 blocks; 1680 free (0 chunks);65 ... Rendezvous variable hash: 8192 total in 1 blocks; 1680 free ... PLpgSQL function cache: 24520 total in 2blocks; 3744 free ... Prepared Queries: 125886512 total in 25 blocks; 4764208 fre ... TableSpace cache: 8192 total in1 blocks; 3216 free (0 chun ... Operator lookup cache: 24576 total in 2 blocks; 11888 free ... MessageContext: 8192total in 1 blocks; 6976 free (0 chunks ... Operator class cache: 8192 total in 1 blocks; 1680 free (0 ... smgr relationtable: 24576 total in 2 blocks; 5696 free (4 ... TransactionAbortContext: 32768 total in 1 blocks; 32736 fre ... Portal hash: 8192 total in 1 blocks; 1680 free (0 chunks); ... PortalMemory: 8192 total in 1 blocks; 7888 free (0 chunks);... 4 child contexts, 34 blocks, 87283 free Relcache by OID: 24576 total in 2 blocks; 12832 free (3 chu ... CacheMemoryContext: 42236592 total in 28 blocks; 7160904 fr ... 1487438 child contexts, 14874382 blocks, 3498349 free instead of the millions of lines with all the datails. So either log_memory_stats = (on|off) or log_memory_stats = (full|short|off) might work. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Tomas Vondra <tomas.vondra@2ndquadrant.com> writes: > On 08/20/2015 11:04 PM, Stefan Kaltenbrunner wrote: >> On 08/20/2015 06:09 PM, Tom Lane wrote: >>> (The reason I say "lobotomize" is that there's no particularly >>> good reason to assume that the first N lines will tell you what you >>> need to know. And the filter rule would have to be *very* stupid, >>> because we can't risk trying to allocate any additional memory to >>> track what we're doing here.) > IMHO the first thing we might do is provide log_memory_stats GUC > controlling that. I'm not a big fan of adding GUC for everything, but in > this case it seems appropriate, just like the other log_ options. I don't think this is advisable. You would have to either keep it turned off and risk not being able to debug OOM situations, or keep it turned on and risk log-volume problems; neither is very satisfactory. > I also don't think logging just subset of the stats is a lost case. > Sure, we can't know which of the lines are important, but for example > logging just the top-level contexts with a summary of the child contexts > would be OK. I thought a bit more about this. Generally, what you want to know about a given situation is which contexts have a whole lot of allocations and/or a whole lot of child contexts. What you suggest above won't work very well if the problem is buried more than about two levels down in the context tree. But suppose we add a parameter to memory context stats collection that is the maximum number of child contexts to print *per parent context*. If there are more than that, summarize the rest as per your suggestion. So any given recursion level might look like FooContext: m total in n blocks ... ChildContext1: m total in n blocks ... possible grandchildren... ChildContext2: m total in n blocks ... possible grandchildren... ChildContext3: m total in n blocks ... possible grandchildren... k more child contexts containing m total in n blocks ... This would require a fixed amount of extra state per recursion level, so it could be done with a few more parameters/local variables in MemoryContextStats and no need to risk a malloc(). The case where you would lose important data is where the serious bloat is in some specific child context that is after the first N children of its direct parent. I don't believe I've ever seen a case where that was critical information as long as N isn't too tiny. I think we could hard-wire N at 100 or something like that and pretty much fix Stefan's complaint, while losing little if any detail in typical cases. Manual invocation of MemoryContextStats could pass larger values if you really needed it during debugging. Thoughts? regards, tom lane
I wrote: > I thought a bit more about this. Generally, what you want to know about > a given situation is which contexts have a whole lot of allocations > and/or a whole lot of child contexts. What you suggest above won't work > very well if the problem is buried more than about two levels down in > the context tree. But suppose we add a parameter to memory context stats > collection that is the maximum number of child contexts to print *per > parent context*. If there are more than that, summarize the rest as per > your suggestion. Here's a draft patch along this line. After playing with it a bit, I think that a wired-in limit of 100 children per level would be fine. We could imagine adding a GUC for that, but I think it's overkill. A couple of notes: Since the summarization mechanism requires passing totals back up anyway, I took the opportunity to add a "grand total" line to the end of the printout. I'm half tempted to modify that to convert the numbers in the grand total line to kilobytes or even MB, but it could be argued either way. This assumes that the total memory allocation couldn't exceed the range of size_t, a thing which is not promised by the C standard --- but AFAIK it would be true on any modern machine (and really we were already making such an assumption in AllocSetStats). Also, if we ever add a memory context type that works fundamentally differently from AllocSet, we might need to rethink what we compute/print as summary stats. I figure we can cross that bridge when we come to it. Comments? regards, tom lane diff --git a/src/backend/utils/mmgr/aset.c b/src/backend/utils/mmgr/aset.c index febeb6e..e86c7cb 100644 *** a/src/backend/utils/mmgr/aset.c --- b/src/backend/utils/mmgr/aset.c *************** static void AllocSetReset(MemoryContext *** 253,259 **** static void AllocSetDelete(MemoryContext context); static Size AllocSetGetChunkSpace(MemoryContext context, void *pointer); static bool AllocSetIsEmpty(MemoryContext context); ! static void AllocSetStats(MemoryContext context, int level); #ifdef MEMORY_CONTEXT_CHECKING static void AllocSetCheck(MemoryContext context); --- 253,260 ---- static void AllocSetDelete(MemoryContext context); static Size AllocSetGetChunkSpace(MemoryContext context, void *pointer); static bool AllocSetIsEmpty(MemoryContext context); ! static void AllocSetStats(MemoryContext context, int level, bool print, ! MemoryContextCounters *totals); #ifdef MEMORY_CONTEXT_CHECKING static void AllocSetCheck(MemoryContext context); *************** AllocSetIsEmpty(MemoryContext context) *** 1228,1237 **** /* * AllocSetStats ! * Displays stats about memory consumption of an allocset. */ static void ! AllocSetStats(MemoryContext context, int level) { AllocSet set = (AllocSet) context; Size nblocks = 0; --- 1229,1243 ---- /* * AllocSetStats ! * Compute stats about memory consumption of an allocset. ! * ! * level: recursion level (0 at top level); used for print indentation. ! * print: true to print stats to stderr. ! * totals: if not NULL, add stats about this allocset into *totals. */ static void ! AllocSetStats(MemoryContext context, int level, bool print, ! MemoryContextCounters *totals) { AllocSet set = (AllocSet) context; Size nblocks = 0; *************** AllocSetStats(MemoryContext context, int *** 1239,1247 **** Size totalspace = 0; Size freespace = 0; AllocBlock block; - AllocChunk chunk; int fidx; - int i; for (block = set->blocks; block != NULL; block = block->next) { --- 1245,1251 ---- *************** AllocSetStats(MemoryContext context, int *** 1251,1256 **** --- 1255,1262 ---- } for (fidx = 0; fidx < ALLOCSET_NUM_FREELISTS; fidx++) { + AllocChunk chunk; + for (chunk = set->freelist[fidx]; chunk != NULL; chunk = (AllocChunk) chunk->aset) { *************** AllocSetStats(MemoryContext context, int *** 1259,1271 **** } } ! for (i = 0; i < level; i++) ! fprintf(stderr, " "); ! fprintf(stderr, "%s: %zu total in %zd blocks; %zu free (%zd chunks); %zu used\n", ! set->header.name, totalspace, nblocks, freespace, nchunks, ! totalspace - freespace); } --- 1265,1289 ---- } } ! if (print) ! { ! int i; ! for (i = 0; i < level; i++) ! fprintf(stderr, " "); ! fprintf(stderr, "%s: %zu total in %zd blocks; %zu free (%zd chunks); %zu used\n", ! set->header.name, totalspace, nblocks, freespace, nchunks, ! totalspace - freespace); ! } ! ! if (totals) ! { ! totals->nblocks += nblocks; ! totals->nchunks += nchunks; ! totals->totalspace += totalspace; ! totals->freespace += freespace; ! } } diff --git a/src/backend/utils/mmgr/mcxt.c b/src/backend/utils/mmgr/mcxt.c index 12d29f7..ad1421c 100644 *** a/src/backend/utils/mmgr/mcxt.c --- b/src/backend/utils/mmgr/mcxt.c *************** MemoryContext CurTransactionContext = NU *** 52,58 **** MemoryContext PortalContext = NULL; static void MemoryContextCallResetCallbacks(MemoryContext context); ! static void MemoryContextStatsInternal(MemoryContext context, int level); /* * You should not do memory allocations within a critical section, because --- 52,60 ---- MemoryContext PortalContext = NULL; static void MemoryContextCallResetCallbacks(MemoryContext context); ! static void MemoryContextStatsInternal(MemoryContext context, int level, ! bool print, int max_children, ! MemoryContextCounters *totals); /* * You should not do memory allocations within a critical section, because *************** MemoryContextIsEmpty(MemoryContext conte *** 477,501 **** * MemoryContextStats * Print statistics about the named context and all its descendants. * ! * This is just a debugging utility, so it's not fancy. The statistics ! * are merely sent to stderr. */ void MemoryContextStats(MemoryContext context) { ! MemoryContextStatsInternal(context, 0); } static void ! MemoryContextStatsInternal(MemoryContext context, int level) { MemoryContext child; AssertArg(MemoryContextIsValid(context)); ! (*context->methods->stats) (context, level); ! for (child = context->firstchild; child != NULL; child = child->nextchild) ! MemoryContextStatsInternal(child, level + 1); } /* --- 479,584 ---- * MemoryContextStats * Print statistics about the named context and all its descendants. * ! * This is just a debugging utility, so it's not very fancy. However, we do ! * make some effort to summarize when the output would otherwise be very long. ! * The statistics are sent to stderr. */ void MemoryContextStats(MemoryContext context) { ! /* A hard-wired limit on the number of children is usually good enough */ ! MemoryContextStatsDetail(context, 100); ! } ! ! /* ! * MemoryContextStatsDetail ! * ! * Entry point for use if you want to vary the number of child contexts shown. ! */ ! void ! MemoryContextStatsDetail(MemoryContext context, int max_children) ! { ! MemoryContextCounters grand_totals; ! ! memset(&grand_totals, 0, sizeof(grand_totals)); ! ! MemoryContextStatsInternal(context, 0, true, max_children, &grand_totals); ! ! fprintf(stderr, ! "Grand total: %zu bytes in %zd blocks; %zu free (%zd chunks); %zu used\n", ! grand_totals.totalspace, grand_totals.nblocks, ! grand_totals.freespace, grand_totals.nchunks, ! grand_totals.totalspace - grand_totals.freespace); } + /* + * MemoryContextStatsInternal + * One recursion level for MemoryContextStats + * + * Print this context if print is true, but in any case accumulate counts into + * *totals (if given). + */ static void ! MemoryContextStatsInternal(MemoryContext context, int level, ! bool print, int max_children, ! MemoryContextCounters *totals) { + MemoryContextCounters local_totals; MemoryContext child; + int ichild; AssertArg(MemoryContextIsValid(context)); ! /* Examine the context itself */ ! (*context->methods->stats) (context, level, print, totals); ! ! /* ! * Examine children. If there are more than max_children of them, we do ! * not print the rest explicitly, but just summarize them. ! */ ! memset(&local_totals, 0, sizeof(local_totals)); ! ! for (child = context->firstchild, ichild = 0; ! child != NULL; ! child = child->nextchild, ichild++) ! { ! if (ichild < max_children) ! MemoryContextStatsInternal(child, level + 1, ! print, max_children, ! totals); ! else ! MemoryContextStatsInternal(child, level + 1, ! false, max_children, ! &local_totals); ! } ! ! /* Deal with excess children */ ! if (ichild > max_children) ! { ! if (print) ! { ! int i; ! ! for (i = 0; i <= level; i++) ! fprintf(stderr, " "); ! fprintf(stderr, ! "%d more child contexts containing %zu total in %zd blocks; %zu free (%zd chunks); %zu used\n", ! ichild - max_children, ! local_totals.totalspace, ! local_totals.nblocks, ! local_totals.freespace, ! local_totals.nchunks, ! local_totals.totalspace - local_totals.freespace); ! } ! ! if (totals) ! { ! totals->nblocks += local_totals.nblocks; ! totals->nchunks += local_totals.nchunks; ! totals->totalspace += local_totals.totalspace; ! totals->freespace += local_totals.freespace; ! } ! } } /* diff --git a/src/include/nodes/memnodes.h b/src/include/nodes/memnodes.h index 5e036b9..5e6bb60 100644 *** a/src/include/nodes/memnodes.h --- b/src/include/nodes/memnodes.h *************** *** 17,22 **** --- 17,40 ---- #include "nodes/nodes.h" /* + * MemoryContextCounters + * Summarization state for MemoryContextStats collection. + * + * The set of counters in this struct is biased towards AllocSet; if we ever + * add any context types that are based on fundamentally different approaches, + * we might need more or different counters here. A possible API spec then + * would be to print only nonzero counters, but for now we just summarize in + * the format historically used by AllocSet. + */ + typedef struct MemoryContextCounters + { + Size nblocks; /* Total number of malloc blocks */ + Size nchunks; /* Total number of free chunks */ + Size totalspace; /* Total bytes requested from malloc */ + Size freespace; /* The unused portion of totalspace */ + } MemoryContextCounters; + + /* * MemoryContext * A logical context in which memory allocations occur. * *************** typedef struct MemoryContextMethods *** 44,50 **** void (*delete_context) (MemoryContext context); Size (*get_chunk_space) (MemoryContext context, void *pointer); bool (*is_empty) (MemoryContext context); ! void (*stats) (MemoryContext context, int level); #ifdef MEMORY_CONTEXT_CHECKING void (*check) (MemoryContext context); #endif --- 62,69 ---- void (*delete_context) (MemoryContext context); Size (*get_chunk_space) (MemoryContext context, void *pointer); bool (*is_empty) (MemoryContext context); ! void (*stats) (MemoryContext context, int level, bool print, ! MemoryContextCounters *totals); #ifdef MEMORY_CONTEXT_CHECKING void (*check) (MemoryContext context); #endif diff --git a/src/include/utils/memutils.h b/src/include/utils/memutils.h index f0fe0f4..38106d7 100644 *** a/src/include/utils/memutils.h --- b/src/include/utils/memutils.h *************** extern MemoryContext GetMemoryChunkConte *** 104,109 **** --- 104,110 ---- extern MemoryContext MemoryContextGetParent(MemoryContext context); extern bool MemoryContextIsEmpty(MemoryContext context); extern void MemoryContextStats(MemoryContext context); + extern void MemoryContextStatsDetail(MemoryContext context, int max_children); extern void MemoryContextAllowInCriticalSection(MemoryContext context, bool allow);
On 08/21/2015 08:37 PM, Tom Lane wrote: > Tomas Vondra <tomas.vondra@2ndquadrant.com> writes: >> >> I also don't think logging just subset of the stats is a lost case. >> Sure, we can't know which of the lines are important, but for example >> logging just the top-level contexts with a summary of the child contexts >> would be OK. > > I thought a bit more about this. Generally, what you want to know about > a given situation is which contexts have a whole lot of allocations > and/or a whole lot of child contexts. What you suggest above won't work > very well if the problem is buried more than about two levels down in > the context tree. But suppose we add a parameter to memory context stats > collection that is the maximum number of child contexts to print *per > parent context*. If there are more than that, summarize the rest as per > your suggestion. So any given recursion level might look like > > FooContext: m total in n blocks ... > ChildContext1: m total in n blocks ... > possible grandchildren... > ChildContext2: m total in n blocks ... > possible grandchildren... > ChildContext3: m total in n blocks ... > possible grandchildren... > k more child contexts containing m total in n blocks ... > > This would require a fixed amount of extra state per recursion level, > so it could be done with a few more parameters/local variables in > MemoryContextStats and no need to risk a malloc(). > > The case where you would lose important data is where the serious > bloat is in some specific child context that is after the first N > children of its direct parent. I don't believe I've ever seen a case > where that was critical information as long as N isn't too tiny. Couldn't we make it a bit smarter to handle even cases like this? For example we might first count/sum the child contexts, and then print either all child contexts (if there are only a few of them) or just those that are >5% of the total, 2x the average or something like that. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Tomas Vondra <tomas.vondra@2ndquadrant.com> writes: > On 08/21/2015 08:37 PM, Tom Lane wrote: >> ... But suppose we add a parameter to memory context stats >> collection that is the maximum number of child contexts to print *per >> parent context*. If there are more than that, summarize the rest as per >> your suggestion. >> >> The case where you would lose important data is where the serious >> bloat is in some specific child context that is after the first N >> children of its direct parent. I don't believe I've ever seen a case >> where that was critical information as long as N isn't too tiny. > Couldn't we make it a bit smarter to handle even cases like this? For > example we might first count/sum the child contexts, and then print > either all child contexts (if there are only a few of them) or just > those that are >5% of the total, 2x the average or something like that. That seems overly complicated. In the first place, you couldn't possibly implement that without two passes over the context set, which would be mighty expensive. (In the situations where this is getting run, most likely portions of the address space have been swapped out, and we'll have to swap them back in again. Bad enough to iterate over the whole process address space once, but twice?) In the second place, it would seem to require a much wider API between MemoryContextStats and the per-context-type stats methods, including for example a way to decide which numbers about a context were the most important ones. I'm already concerned about having had to expose a set of accumulatable numbers at all ... don't want the API to get into their semantics in that kind of detail. As I said, based on my experience in looking at memory context dumps (and I've seen a few), a show-the-first-N-children heuristic probably will work fine. If we find out differently we can work on refining it, but I don't think a complex design is warranted at this stage. One thing we could consider doing to improve the odds that it's fine would be to rearrange things so that child contexts of the same parent are more likely to be "similar" --- for example, break out all relcache entries to be children of a RelCacheContext instead of the generic CacheMemoryContext, likewise for cached plans, etc. But I'm not yet convinced that'd be worth the trouble. regards, tom lane
On 08/22/2015 06:06 PM, Tom Lane wrote: > Tomas Vondra <tomas.vondra@2ndquadrant.com> writes: >> >> Couldn't we make it a bit smarter to handle even cases like this? For >> example we might first count/sum the child contexts, and then print >> either all child contexts (if there are only a few of them) or just >> those that are >5% of the total, 2x the average or something like that. > > That seems overly complicated. In the first place, you couldn't > possibly implement that without two passes over the context set, > which would be mighty expensive. (In the situations where this is > getting run, most likely portions of the address space have been > swapped out, and we'll have to swap them back in again. Bad enough to > iterate over the whole process address space once, but twice?) In the > second place, it would seem to require a much wider API between > MemoryContextStats and the per-context-type stats methods, including > for example a way to decide which numbers about a context were the > most important ones. I'm already concerned about having had to expose > a set of accumulatable numbers at all ... don't want the API to get > into their semantics in that kind of detail. Sure, the two passes seem quite annoying, although I'm not convinced it'd be a problem in practice - most modern systems I've been dealing with recently were configured with the 'no swapping' goal, and using only a tiny swap for extreme cases (e.g. 256GB RAM with 4GB of swap). If only we had some memory accounting in place ;-) Then we wouldn't have to do the two passes ... > As I said, based on my experience in looking at memory context dumps > (and I've seen a few), a show-the-first-N-children heuristic > probably will work fine. If we find out differently we can work on > refining it, but I don't think a complex design is warranted at this > stage. OK, let's go with the simple approach. I'm still not quite convinced having no GUC for turning this off is a good idea, though. From time to time we're dealing with OOM issues at customer systems, with very limited access (e.g. no gdb). In those cases the memory context is the only information we have initial, and it's usually quite difficult to get additional info. I agree that the 'first-N-children' heuristics is going to work in most cases, but if it doesn't it'd be nice to have a way to force "full" stats in ad-hoc manner (i.e. disable before query, etc.). OTOH now that I'm thinking about it, most OOM errors I've seen (at least on Linux) were either because of exceptionally large palloc() requests (e.g. because of varlena header corruption, overflow bugs, ...) or because of OOM killer. In the first case the memory context stats are utterly useless because the issue has nothing to do with other memory contexts, in the latter case you don't get any stats at all because the process is simply killed. One question regarding the proposed patch though - if I get it right (just looking at the diff), it simply limits the output to first 100 child contexts at each level independently. So if each of those 100 child contexts has >100 child contexts on it's own, we get 100x100 lines. And so on, if the hierarchy is deeper. This probably is not addressable without introducing some global counter of printed contexts, and it may not be an issue at all (all the cases I could remember have a single huge context or many sibling contexts). > One thing we could consider doing to improve the odds that it's fine > would be to rearrange things so that child contexts of the same > parent are more likely to be "similar" --- for example, break out > all relcache entries to be children of a RelCacheContext instead of > the generic CacheMemoryContext, likewise for cached plans, etc. But > I'm not yet convinced that'd be worth the trouble. That'd be nice but I see that as an independent improvement - it might improve the odds for internal contexts, but what about contexts coming from user code (e.g. custom aggregates)? kind regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Tomas Vondra <tomas.vondra@2ndquadrant.com> writes: > One question regarding the proposed patch though - if I get it right > (just looking at the diff), it simply limits the output to first 100 > child contexts at each level independently. So if each of those 100 > child contexts has >100 child contexts on it's own, we get 100x100 > lines. And so on, if the hierarchy is deeper. This probably is not > addressable without introducing some global counter of printed contexts, > and it may not be an issue at all (all the cases I could remember have a > single huge context or many sibling contexts). Right. The situation Stefan was complaining of, almost certainly, involved a huge number of children of the same context. This patch would successfully abbreviate that case, no matter where it happened in the context tree exactly. In principle, if you were getting that sort of expansion at multiple levels of the context tree concurrently, you could still get a mighty long context dump ... but I really doubt that would happen in practice. (And if it did happen, an overall limit on the number of contexts printed would hide the fact that it was happening, which wouldn't be desirable.) >> One thing we could consider doing to improve the odds that it's fine >> would be to rearrange things so that child contexts of the same >> parent are more likely to be "similar" --- for example, break out >> all relcache entries to be children of a RelCacheContext instead of >> the generic CacheMemoryContext, likewise for cached plans, etc. But >> I'm not yet convinced that'd be worth the trouble. > That'd be nice but I see that as an independent improvement - it might > improve the odds for internal contexts, but what about contexts coming > from user code (e.g. custom aggregates)? Yeah, cases like custom aggregates would be hard to classify. regards, tom lane
On 08/22/2015 06:25 AM, Tomas Vondra wrote: > On 08/21/2015 08:37 PM, Tom Lane wrote: >> Tomas Vondra <tomas.vondra@2ndquadrant.com> writes: >>> >>> I also don't think logging just subset of the stats is a lost case. >>> Sure, we can't know which of the lines are important, but for example >>> logging just the top-level contexts with a summary of the child contexts >>> would be OK. >> >> I thought a bit more about this. Generally, what you want to know about >> a given situation is which contexts have a whole lot of allocations >> and/or a whole lot of child contexts. What you suggest above won't work >> very well if the problem is buried more than about two levels down in >> the context tree. But suppose we add a parameter to memory context stats >> collection that is the maximum number of child contexts to print *per >> parent context*. If there are more than that, summarize the rest as per >> your suggestion. So any given recursion level might look like >> >> FooContext: m total in n blocks ... >> ChildContext1: m total in n blocks ... >> possible grandchildren... >> ChildContext2: m total in n blocks ... >> possible grandchildren... >> ChildContext3: m total in n blocks ... >> possible grandchildren... >> k more child contexts containing m total in n blocks ... >> >> This would require a fixed amount of extra state per recursion level, >> so it could be done with a few more parameters/local variables in >> MemoryContextStats and no need to risk a malloc(). >> >> The case where you would lose important data is where the serious >> bloat is in some specific child context that is after the first N >> children of its direct parent. I don't believe I've ever seen a case >> where that was critical information as long as N isn't too tiny. > > Couldn't we make it a bit smarter to handle even cases like this? For > example we might first count/sum the child contexts, and then print > either all child contexts (if there are only a few of them) or just > those that are >5% of the total, 2x the average or something like that. While having that kind of logic would be nice, i dont think it is required. For the case I had the proposed patch from tom seems perfectly fine to me - not sure we would want a GUC. From a DBA perspective I dont think anybody needs millions of lines of almost duplicated memory context dumps and also not sure we need it from a developer perspective either (other than the information: "there were more than those I printed") regards Stefan