Thread: failures on barnacle (CLOBBER_CACHE_RECURSIVELY) because of memory leaks
Hi! So after 83 days, the regression tests on barnacle completed, and it smells like another memory leak in CacheMemoryContext, similar to those fixed in 078b2ed on May 18. Barnacle is one of those machines with -DCLOBBER_CACHE_RECURSIVELY, and the tests were running with a snapshot from May 19, so the memory leaks detected with "only" -DCLOBBER_CACHE_ALWAYS should be fixed there. I have a script in place looking for excessive memory usage of postgres backends - whenever a backend allocates more than 512MB of VSS, it does "MemoryContextStats(TopMemoryContext)" on it (from gdb). The results get logged into the postmaster log, and thus get to the buildfarm. See this: http://pgbuildfarm.org/cgi-bin/show_log.pl?nm=barnacle&dt=2014-05-19%2011%3A25%3A03 Essentially, it looks like this: TopMemoryContext: 69984 total in 10 blocks; 6216 free (24 chunks); 63768 used TopTransactionContext: 8192 total in 1 blocks;5768 free (23 chunks); 2424 used ... Relcache by OID: 8192 total in 1 blocks; 592 free (0 chunks); 7600 used CacheMemoryContext: 301981696 totalin 45 blocks; 3701744 free (140 chunks); 298279952 used ... or like this: CacheMemoryContext: 1602215936 total in 200 blocks; 497336 free (138 chunks); 1601718600 used So probably another low-probability memory leak, happening apparently only in CLOBBER_CACHE_RECURSIVELY. regards Tomas
"Tomas Vondra" <tv@fuzzy.cz> writes: > So after 83 days, the regression tests on barnacle completed, Hah, that's perseverance! > and it > smells like another memory leak in CacheMemoryContext, similar to those > fixed in 078b2ed on May 18. Ugh, will look. > See this: > http://pgbuildfarm.org/cgi-bin/show_log.pl?nm=barnacle&dt=2014-05-19%2011%3A25%3A03 Initially I was more worried about the crashes, but it looks like that's probably just a side-effect of the leak: [537a053d.1f4b:2] LOG: server process (PID 32110) was terminated by signal 9: Killed [537a053d.1f4b:3] DETAIL: Failed process was running: select pg_get_viewdef('vv3', true); [537a053d.1f4b:4] LOG: terminating any other active server processes [537a053d.1f52:2] WARNING: terminating connection because of crash of another server process ... [537a053d.1f4b:7] LOG: server process (PID 3882) was terminated by signal 9: Killed [537a053d.1f4b:8] DETAIL: Failed process was running: SELECT gid FROM pg_prepared_xacts; [537a053d.1f4b:9] LOG: terminating any other active server processes ... Evidently the OOM killer is at large on this machine. regards, tom lane
Re: failures on barnacle (CLOBBER_CACHE_RECURSIVELY) because of memory leaks
From
Tomas Vondra
Date:
On 12.8.2014 02:05, Tom Lane wrote: > > Evidently the OOM killer is at large on this machine. Yes. It's a machine with only 8GB of RAM, and there are 3 VMs (LXC containers), with 2GB of RAM each. That's not much, but while it's mostly out of necessity, it's apparently a good way to catch leaks. Tomas
I wrote: > "Tomas Vondra" <tv@fuzzy.cz> writes: >> So after 83 days, the regression tests on barnacle completed, > Hah, that's perseverance! >> and it >> smells like another memory leak in CacheMemoryContext, similar to those >> fixed in 078b2ed on May 18. > Ugh, will look. I've been experimenting by running the create_view test in isolation (it's not quite self-contained, but close enough) and I find that there are two memory leaks exposed here: 1. The relcache.c functions that provide on-demand caching, namely RelationGetIndexList and RelationGetIndexAttrBitmap, are not careful to free the old values (if any) of the relcache fields they fill. I think we believed that the old values would surely always be null ... but that's not necessarily the case when doing a recursive cache reload of a system catalog, because we might've used/filled the fields while fetching the data needed to fill them. This results in a session-lifespan leak of data in CacheMemoryContext, which is what Tomas saw. (I'm not real sure that this is a live issue for RelationGetIndexAttrBitmap, but it demonstrably is for RelationGetIndexList.) 2. reloptions.c's parseRelOptions() leaks memory when disassembling a reloptions array. The leak is in the caller's memory context which is typically query-lifespan, so under normal circumstances this is not significant. However, a CLOBBER_CACHE_RECURSIVELY build can call this an awful lot of times within a single query. The queries in create_view that operate on views with security_barrier reloptions manage to eat quite a lot of memory this way. The attached patch fixes both of these things. I'm inclined to think it is not worth back-patching because neither effect could amount to anything noticeable without CLOBBER_CACHE_RECURSIVELY. Anyone think otherwise? regards, tom lane diff --git a/src/backend/access/common/reloptions.c b/src/backend/access/common/reloptions.c index c7ad6f9..e5f0240 100644 *** a/src/backend/access/common/reloptions.c --- b/src/backend/access/common/reloptions.c *************** parseRelOptions(Datum options, bool vali *** 912,925 **** /* Done if no options */ if (PointerIsValid(DatumGetPointer(options))) { ! ArrayType *array; Datum *optiondatums; int noptions; - array = DatumGetArrayTypeP(options); - - Assert(ARR_ELEMTYPE(array) == TEXTOID); - deconstruct_array(array, TEXTOID, -1, false, 'i', &optiondatums, NULL, &noptions); --- 912,921 ---- /* Done if no options */ if (PointerIsValid(DatumGetPointer(options))) { ! ArrayType *array = DatumGetArrayTypeP(options); Datum *optiondatums; int noptions; deconstruct_array(array, TEXTOID, -1, false, 'i', &optiondatums, NULL, &noptions); *************** parseRelOptions(Datum options, bool vali *** 959,964 **** --- 955,965 ---- errmsg("unrecognized parameter \"%s\"", s))); } } + + /* It's worth avoiding memory leaks in this function */ + pfree(optiondatums); + if (((void *) array) != DatumGetPointer(options)) + pfree(array); } *numrelopts = numoptions; diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c index 10d300a..fd84853 100644 *** a/src/backend/utils/cache/relcache.c --- b/src/backend/utils/cache/relcache.c *************** RelationGetIndexList(Relation relation) *** 3646,3651 **** --- 3646,3652 ---- ScanKeyData skey; HeapTuple htup; List *result; + List *oldlist; char replident = relation->rd_rel->relreplident; Oid oidIndex = InvalidOid; Oid pkeyIndex = InvalidOid; *************** RelationGetIndexList(Relation relation) *** 3737,3742 **** --- 3738,3744 ---- /* Now save a copy of the completed list in the relcache entry. */ oldcxt = MemoryContextSwitchTo(CacheMemoryContext); + oldlist = relation->rd_indexlist; relation->rd_indexlist = list_copy(result); relation->rd_oidindex = oidIndex; if (replident == REPLICA_IDENTITY_DEFAULT && OidIsValid(pkeyIndex)) *************** RelationGetIndexList(Relation relation) *** 3748,3753 **** --- 3750,3758 ---- relation->rd_indexvalid = 1; MemoryContextSwitchTo(oldcxt); + /* Don't leak the old list, if there is one */ + list_free(oldlist); + return result; } *************** RelationGetIndexAttrBitmap(Relation rela *** 4141,4146 **** --- 4146,4159 ---- list_free(indexoidlist); + /* Don't leak any old values of these bitmaps */ + bms_free(relation->rd_indexattr); + relation->rd_indexattr = NULL; + bms_free(relation->rd_keyattr); + relation->rd_keyattr = NULL; + bms_free(relation->rd_idattr); + relation->rd_idattr = NULL; + /* * Now save copies of the bitmaps in the relcache entry. We intentionally * set rd_indexattr last, because that's the one that signals validity of
"Tomas Vondra" <tv@fuzzy.cz> writes: > So after 83 days, the regression tests on barnacle completed, and it > smells like another memory leak in CacheMemoryContext, similar to those > fixed in 078b2ed on May 18. I've pushed fixes for the issues I was able to identify by running the create_view test. I definitely don't have the patience to run all the tests this way, but if you do, please start a new run. A couple thoughts about barnacle's configuration: * It's not necessary to set -DCLOBBER_FREED_MEMORY or -DMEMORY_CONTEXT_CHECKING in CPPFLAGS; those are already selected by --enable-cassert. Removing those -D switches would suppress a whole boatload of compiler warnings. * I'm a bit dubious about testing -DRANDOMIZE_ALLOCATED_MEMORY in the same build as -DCLOBBER_CACHE_RECURSIVELY, because each of these is darned expensive and it's not clear you'd learn anything by running them both together. I think you might be better advised to run two separate buildfarm critters with those two options, and thereby perhaps get turnaround in something less than 80 days. * It'd likely be a good idea to take out the TestUpgrade and TestDecoding modules from the config too. Otherwise, we won't be seeing barnacle's next report before 2015, judging from the runtime of the check step compared to some of the other slow buildfarm machines. (I wonder whether there's an easy way to skip the installcheck step, as that's going to require a much longer run than it can possibly be worth too.) regards, tom lane
Re: failures on barnacle (CLOBBER_CACHE_RECURSIVELY) because of memory leaks
From
Tomas Vondra
Date:
On 13.8.2014 17:52, Tom Lane wrote: > "Tomas Vondra" <tv@fuzzy.cz> writes: >> So after 83 days, the regression tests on barnacle completed, and it >> smells like another memory leak in CacheMemoryContext, similar to those >> fixed in 078b2ed on May 18. > > I've pushed fixes for the issues I was able to identify by running the > create_view test. I definitely don't have the patience to run all the > tests this way, but if you do, please start a new run. > > A couple thoughts about barnacle's configuration: > > * It's not necessary to set -DCLOBBER_FREED_MEMORY or -DMEMORY_CONTEXT_CHECKING > in CPPFLAGS; those are already selected by --enable-cassert. Removing > those -D switches would suppress a whole boatload of compiler warnings. I know - I already fixed this two months ago, but barnacle was still running with the old settings (and I decided to let it run). > * I'm a bit dubious about testing -DRANDOMIZE_ALLOCATED_MEMORY in the > same build as -DCLOBBER_CACHE_RECURSIVELY, because each of these is > darned expensive and it's not clear you'd learn anything by running > them both together. I think you might be better advised to run two > separate buildfarm critters with those two options, and thereby perhaps > get turnaround in something less than 80 days. OK, I removed this for barnacle/addax/mite, let's see what's the impact. FWIW We have three other animals running with CLOBBER_CACHE_ALWAYS + RANDOMIZE_ALLOCATED_MEMORY, and it takes ~20h per branch. But maybe the price when combined with CLOBBER_CACHE_RECURSIVELY is much higher. > * It'd likely be a good idea to take out the TestUpgrade and TestDecoding > modules from the config too. Otherwise, we won't be seeing barnacle's > next report before 2015, judging from the runtime of the check step > compared to some of the other slow buildfarm machines. (I wonder whether > there's an easy way to skip the installcheck step, as that's going to > require a much longer run than it can possibly be worth too.) OK, I did this too. I stopped the already running test on addax and started the test on barnacle again. Let's see in a few days/weeks/months what is the result. regards Tomas
Re: failures on barnacle (CLOBBER_CACHE_RECURSIVELY) because of memory leaks
From
Tomas Vondra
Date:
Hi, On 13.8.2014 19:17, Tomas Vondra wrote: > On 13.8.2014 17:52, Tom Lane wrote: > >> * I'm a bit dubious about testing -DRANDOMIZE_ALLOCATED_MEMORY in the >> same build as -DCLOBBER_CACHE_RECURSIVELY, because each of these is >> darned expensive and it's not clear you'd learn anything by running >> them both together. I think you might be better advised to run two >> separate buildfarm critters with those two options, and thereby perhaps >> get turnaround in something less than 80 days. > > OK, I removed this for barnacle/addax/mite, let's see what's the impact. > > FWIW We have three other animals running with CLOBBER_CACHE_ALWAYS + > RANDOMIZE_ALLOCATED_MEMORY, and it takes ~20h per branch. But maybe the > price when combined with CLOBBER_CACHE_RECURSIVELY is much higher. > >> * It'd likely be a good idea to take out the TestUpgrade and TestDecoding >> modules from the config too. Otherwise, we won't be seeing barnacle's >> next report before 2015, judging from the runtime of the check step >> compared to some of the other slow buildfarm machines. (I wonder whether >> there's an easy way to skip the installcheck step, as that's going to >> require a much longer run than it can possibly be worth too.) > > OK, I did this too. > > I stopped the already running test on addax and started the test on > barnacle again. Let's see in a few days/weeks/months what is the result. It seems to be running much faster (probably after removing the randomization), and apparently it passed the create_view tests without crashing, but then crashed at 'constraints' (again, because of OOM). PortalMemory: 8192 total in 1 blocks; 7880 free (0 chunks); 312 used PortalHeapMemory: 1024 total in 1 blocks; 840 free (0 chunks); 184 used ExecutorState: 769654952 total in 103 blocks; 114984 free (296 chunks); 769539968 used ExprContext: 0 total in 0 blocks; 0 free (0 chunks); 0 used I suppose we don't expect 760MB ExecutorState here. Also, there's ~60MB MessageContext. It's still running, so I'm attaching the relevant part of log (again, with MemoryContextStats for backends with VSS >= 512MB . FWIW, it's running against a844c29966d7c0cd6a457e9324f175349bb12df0. regards Tomas
Attachment
Tomas Vondra <tv@fuzzy.cz> writes: >> I stopped the already running test on addax and started the test on >> barnacle again. Let's see in a few days/weeks/months what is the result. > It seems to be running much faster (probably after removing the > randomization), and apparently it passed the create_view tests without > crashing, but then crashed at 'constraints' (again, because of OOM). I poked into the "constraints" test and soon found another leak just like the other one :-(, which I've now fixed. I lack the patience to let "constraints" run to conclusion with C_C_R on --- I let it run for about 20 hours and it still was only maybe 1/4th done --- but it got further than it did in your report and there was zero sign of leakage. regards, tom lane
Re: failures on barnacle (CLOBBER_CACHE_RECURSIVELY) because of memory leaks
From
Tomas Vondra
Date:
On 24.8.2014 18:01, Tom Lane wrote: > Tomas Vondra <tv@fuzzy.cz> writes: >>> I stopped the already running test on addax and started the test on >>> barnacle again. Let's see in a few days/weeks/months what is the result. > >> It seems to be running much faster (probably after removing the >> randomization), and apparently it passed the create_view tests without >> crashing, but then crashed at 'constraints' (again, because of OOM). > > I poked into the "constraints" test and soon found another leak just like > the other one :-(, which I've now fixed. I lack the patience to let > "constraints" run to conclusion with C_C_R on --- I let it run for about > 20 hours and it still was only maybe 1/4th done --- but it got further > than it did in your report and there was zero sign of leakage. OK. There was some outage in the facility hosting this machine, and it went up just a few hours ago. So I killed the current run, and it'll start chewing on this new commit in a few minutes. Regarding those leaks we've detected so far - is it the kind of leaks that can happen only in testing with those specific flags, or is it something that can happen in production too? (Assuming no one is running with CLOBBER_CACHE_RECURSIVELY in production, of course ;-) That is, does it seem worth the effort running those tests / fixing those leaks? regards Tomas
Tomas Vondra <tv@fuzzy.cz> writes: > Regarding those leaks we've detected so far - is it the kind of leaks > that can happen only in testing with those specific flags, or is it > something that can happen in production too? (Assuming no one is running > with CLOBBER_CACHE_RECURSIVELY in production, of course ;-) That is, > does it seem worth the effort running those tests / fixing those leaks? I believe most or all of these leaks were understood and intentionally ignored in the original coding, on the grounds that they were intraquery leaks and no real-world situation would ever cause so many cache reloads in a single query that the leakage would amount to anything problematic. I think that reasoning is still valid for production usage. It seems worth fixing the leaks in HEAD so that we can get through the regression tests on barnacle and see if anything of greater significance turns up --- but if this is all we find, it might not have been worth the trouble. regards, tom lane
Re: failures on barnacle (CLOBBER_CACHE_RECURSIVELY) because of memory leaks
From
Tomas Vondra
Date:
On 24.8.2014 18:28, Tom Lane wrote: > Tomas Vondra <tv@fuzzy.cz> writes: >> Regarding those leaks we've detected so far - is it the kind of leaks >> that can happen only in testing with those specific flags, or is it >> something that can happen in production too? (Assuming no one is running >> with CLOBBER_CACHE_RECURSIVELY in production, of course ;-) That is, >> does it seem worth the effort running those tests / fixing those leaks? > > I believe most or all of these leaks were understood and intentionally > ignored in the original coding, on the grounds that they were intraquery > leaks and no real-world situation would ever cause so many cache reloads > in a single query that the leakage would amount to anything problematic. > I think that reasoning is still valid for production usage. It seems > worth fixing the leaks in HEAD so that we can get through the regression > tests on barnacle and see if anything of greater significance turns up --- > but if this is all we find, it might not have been worth the trouble. OK. Some time ago we got a report (on the czech mailing list) with this: TopMemoryContext: 1375320 total in 168 blocks; 6472 free (18 chunks); 1368848 used ... CacheMemoryContext: 232883248 total in 5251 blocks; 5644000 free (2 chunks); 227239248 used Apparently they're using some sort of persistent connections, and there's ~8000 tables in that particular database, which eventually leads to OOM for them. Could this be related? Anyway, let's leave the tests running - either we find something interesting or not. I think it's worth it. regards Tomas