Thread: failures on barnacle (CLOBBER_CACHE_RECURSIVELY) because of memory leaks

failures on barnacle (CLOBBER_CACHE_RECURSIVELY) because of memory leaks

From
"Tomas Vondra"
Date:
Hi!

So after 83 days, the regression tests on barnacle completed, and it
smells like another memory leak in CacheMemoryContext, similar to those
fixed in 078b2ed on May 18.

Barnacle is one of those machines with -DCLOBBER_CACHE_RECURSIVELY, and
the tests were running with a snapshot from May 19, so the memory leaks
detected with "only" -DCLOBBER_CACHE_ALWAYS should be fixed there.

I have a script in place looking for excessive memory usage of postgres
backends - whenever a backend allocates more than 512MB of VSS, it does
"MemoryContextStats(TopMemoryContext)" on it (from gdb).

The results get logged into the postmaster log, and thus get to the
buildfarm.

See this:
http://pgbuildfarm.org/cgi-bin/show_log.pl?nm=barnacle&dt=2014-05-19%2011%3A25%3A03

Essentially, it looks like this:

TopMemoryContext: 69984 total in 10 blocks; 6216 free (24 chunks); 63768 used TopTransactionContext: 8192 total in 1
blocks;5768 free (23 chunks);
 
2424 used    ... Relcache by OID: 8192 total in 1 blocks; 592 free (0 chunks); 7600 used CacheMemoryContext: 301981696
totalin 45 blocks; 3701744 free (140
 
chunks); 298279952 used    ...

or like this:
  CacheMemoryContext: 1602215936 total in 200 blocks; 497336 free (138
chunks); 1601718600 used


So probably another low-probability memory leak, happening apparently only
in CLOBBER_CACHE_RECURSIVELY.


regards
Tomas




"Tomas Vondra" <tv@fuzzy.cz> writes:
> So after 83 days, the regression tests on barnacle completed,

Hah, that's perseverance!

> and it
> smells like another memory leak in CacheMemoryContext, similar to those
> fixed in 078b2ed on May 18.

Ugh, will look.

> See this:
> http://pgbuildfarm.org/cgi-bin/show_log.pl?nm=barnacle&dt=2014-05-19%2011%3A25%3A03

Initially I was more worried about the crashes, but it looks like that's
probably just a side-effect of the leak:

[537a053d.1f4b:2] LOG:  server process (PID 32110) was terminated by signal 9: Killed
[537a053d.1f4b:3] DETAIL:  Failed process was running: select pg_get_viewdef('vv3', true);
[537a053d.1f4b:4] LOG:  terminating any other active server processes
[537a053d.1f52:2] WARNING:  terminating connection because of crash of another server process
...
[537a053d.1f4b:7] LOG:  server process (PID 3882) was terminated by signal 9: Killed
[537a053d.1f4b:8] DETAIL:  Failed process was running: SELECT gid FROM pg_prepared_xacts;
[537a053d.1f4b:9] LOG:  terminating any other active server processes
...

Evidently the OOM killer is at large on this machine.
        regards, tom lane



Re: failures on barnacle (CLOBBER_CACHE_RECURSIVELY) because of memory leaks

From
Tomas Vondra
Date:
On 12.8.2014 02:05, Tom Lane wrote:
>
> Evidently the OOM killer is at large on this machine.

Yes. It's a machine with only 8GB of RAM, and there are 3 VMs (LXC
containers), with 2GB of RAM each. That's not much, but while it's
mostly out of necessity, it's apparently a good way to catch leaks.

Tomas




I wrote:
> "Tomas Vondra" <tv@fuzzy.cz> writes:
>> So after 83 days, the regression tests on barnacle completed,

> Hah, that's perseverance!

>> and it
>> smells like another memory leak in CacheMemoryContext, similar to those
>> fixed in 078b2ed on May 18.

> Ugh, will look.

I've been experimenting by running the create_view test in isolation
(it's not quite self-contained, but close enough) and I find that there
are two memory leaks exposed here:

1. The relcache.c functions that provide on-demand caching, namely
RelationGetIndexList and RelationGetIndexAttrBitmap, are not careful
to free the old values (if any) of the relcache fields they fill.
I think we believed that the old values would surely always be null ...
but that's not necessarily the case when doing a recursive cache reload
of a system catalog, because we might've used/filled the fields while
fetching the data needed to fill them.  This results in a session-lifespan
leak of data in CacheMemoryContext, which is what Tomas saw.  (I'm not
real sure that this is a live issue for RelationGetIndexAttrBitmap, but
it demonstrably is for RelationGetIndexList.)

2. reloptions.c's parseRelOptions() leaks memory when disassembling a
reloptions array.  The leak is in the caller's memory context which
is typically query-lifespan, so under normal circumstances this is not
significant.  However, a CLOBBER_CACHE_RECURSIVELY build can call this
an awful lot of times within a single query.  The queries in create_view
that operate on views with security_barrier reloptions manage to eat
quite a lot of memory this way.

The attached patch fixes both of these things.  I'm inclined to think
it is not worth back-patching because neither effect could amount to
anything noticeable without CLOBBER_CACHE_RECURSIVELY.  Anyone think
otherwise?

            regards, tom lane

diff --git a/src/backend/access/common/reloptions.c b/src/backend/access/common/reloptions.c
index c7ad6f9..e5f0240 100644
*** a/src/backend/access/common/reloptions.c
--- b/src/backend/access/common/reloptions.c
*************** parseRelOptions(Datum options, bool vali
*** 912,925 ****
      /* Done if no options */
      if (PointerIsValid(DatumGetPointer(options)))
      {
!         ArrayType  *array;
          Datum       *optiondatums;
          int            noptions;

-         array = DatumGetArrayTypeP(options);
-
-         Assert(ARR_ELEMTYPE(array) == TEXTOID);
-
          deconstruct_array(array, TEXTOID, -1, false, 'i',
                            &optiondatums, NULL, &noptions);

--- 912,921 ----
      /* Done if no options */
      if (PointerIsValid(DatumGetPointer(options)))
      {
!         ArrayType  *array = DatumGetArrayTypeP(options);
          Datum       *optiondatums;
          int            noptions;

          deconstruct_array(array, TEXTOID, -1, false, 'i',
                            &optiondatums, NULL, &noptions);

*************** parseRelOptions(Datum options, bool vali
*** 959,964 ****
--- 955,965 ----
                           errmsg("unrecognized parameter \"%s\"", s)));
              }
          }
+
+         /* It's worth avoiding memory leaks in this function */
+         pfree(optiondatums);
+         if (((void *) array) != DatumGetPointer(options))
+             pfree(array);
      }

      *numrelopts = numoptions;
diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c
index 10d300a..fd84853 100644
*** a/src/backend/utils/cache/relcache.c
--- b/src/backend/utils/cache/relcache.c
*************** RelationGetIndexList(Relation relation)
*** 3646,3651 ****
--- 3646,3652 ----
      ScanKeyData skey;
      HeapTuple    htup;
      List       *result;
+     List       *oldlist;
      char        replident = relation->rd_rel->relreplident;
      Oid            oidIndex = InvalidOid;
      Oid            pkeyIndex = InvalidOid;
*************** RelationGetIndexList(Relation relation)
*** 3737,3742 ****
--- 3738,3744 ----

      /* Now save a copy of the completed list in the relcache entry. */
      oldcxt = MemoryContextSwitchTo(CacheMemoryContext);
+     oldlist = relation->rd_indexlist;
      relation->rd_indexlist = list_copy(result);
      relation->rd_oidindex = oidIndex;
      if (replident == REPLICA_IDENTITY_DEFAULT && OidIsValid(pkeyIndex))
*************** RelationGetIndexList(Relation relation)
*** 3748,3753 ****
--- 3750,3758 ----
      relation->rd_indexvalid = 1;
      MemoryContextSwitchTo(oldcxt);

+     /* Don't leak the old list, if there is one */
+     list_free(oldlist);
+
      return result;
  }

*************** RelationGetIndexAttrBitmap(Relation rela
*** 4141,4146 ****
--- 4146,4159 ----

      list_free(indexoidlist);

+     /* Don't leak any old values of these bitmaps */
+     bms_free(relation->rd_indexattr);
+     relation->rd_indexattr = NULL;
+     bms_free(relation->rd_keyattr);
+     relation->rd_keyattr = NULL;
+     bms_free(relation->rd_idattr);
+     relation->rd_idattr = NULL;
+
      /*
       * Now save copies of the bitmaps in the relcache entry.  We intentionally
       * set rd_indexattr last, because that's the one that signals validity of

"Tomas Vondra" <tv@fuzzy.cz> writes:
> So after 83 days, the regression tests on barnacle completed, and it
> smells like another memory leak in CacheMemoryContext, similar to those
> fixed in 078b2ed on May 18.

I've pushed fixes for the issues I was able to identify by running the
create_view test.  I definitely don't have the patience to run all the
tests this way, but if you do, please start a new run.

A couple thoughts about barnacle's configuration:

* It's not necessary to set -DCLOBBER_FREED_MEMORY or -DMEMORY_CONTEXT_CHECKING
in CPPFLAGS; those are already selected by --enable-cassert.  Removing
those -D switches would suppress a whole boatload of compiler warnings.

* I'm a bit dubious about testing -DRANDOMIZE_ALLOCATED_MEMORY in the
same build as -DCLOBBER_CACHE_RECURSIVELY, because each of these is
darned expensive and it's not clear you'd learn anything by running
them both together.  I think you might be better advised to run two
separate buildfarm critters with those two options, and thereby perhaps
get turnaround in something less than 80 days.

* It'd likely be a good idea to take out the TestUpgrade and TestDecoding
modules from the config too.  Otherwise, we won't be seeing barnacle's
next report before 2015, judging from the runtime of the check step
compared to some of the other slow buildfarm machines.  (I wonder whether
there's an easy way to skip the installcheck step, as that's going to
require a much longer run than it can possibly be worth too.)
        regards, tom lane



Re: failures on barnacle (CLOBBER_CACHE_RECURSIVELY) because of memory leaks

From
Tomas Vondra
Date:
On 13.8.2014 17:52, Tom Lane wrote:
> "Tomas Vondra" <tv@fuzzy.cz> writes:
>> So after 83 days, the regression tests on barnacle completed, and it
>> smells like another memory leak in CacheMemoryContext, similar to those
>> fixed in 078b2ed on May 18.
> 
> I've pushed fixes for the issues I was able to identify by running the
> create_view test.  I definitely don't have the patience to run all the
> tests this way, but if you do, please start a new run.
> 
> A couple thoughts about barnacle's configuration:
> 
> * It's not necessary to set -DCLOBBER_FREED_MEMORY or -DMEMORY_CONTEXT_CHECKING
> in CPPFLAGS; those are already selected by --enable-cassert.  Removing
> those -D switches would suppress a whole boatload of compiler warnings.

I know - I already fixed this two months ago, but barnacle was still
running with the old settings (and I decided to let it run).

> * I'm a bit dubious about testing -DRANDOMIZE_ALLOCATED_MEMORY in the
> same build as -DCLOBBER_CACHE_RECURSIVELY, because each of these is
> darned expensive and it's not clear you'd learn anything by running
> them both together.  I think you might be better advised to run two
> separate buildfarm critters with those two options, and thereby perhaps
> get turnaround in something less than 80 days.

OK, I removed this for barnacle/addax/mite, let's see what's the impact.

FWIW We have three other animals running with CLOBBER_CACHE_ALWAYS +
RANDOMIZE_ALLOCATED_MEMORY, and it takes ~20h per branch. But maybe the
price when combined with CLOBBER_CACHE_RECURSIVELY is much higher.

> * It'd likely be a good idea to take out the TestUpgrade and TestDecoding
> modules from the config too.  Otherwise, we won't be seeing barnacle's
> next report before 2015, judging from the runtime of the check step
> compared to some of the other slow buildfarm machines.  (I wonder whether
> there's an easy way to skip the installcheck step, as that's going to
> require a much longer run than it can possibly be worth too.)

OK, I did this too.

I stopped the already running test on addax and started the test on
barnacle again. Let's see in a few days/weeks/months what is the result.

regards
Tomas



Re: failures on barnacle (CLOBBER_CACHE_RECURSIVELY) because of memory leaks

From
Tomas Vondra
Date:
Hi,

On 13.8.2014 19:17, Tomas Vondra wrote:
> On 13.8.2014 17:52, Tom Lane wrote:
>
>> * I'm a bit dubious about testing -DRANDOMIZE_ALLOCATED_MEMORY in the
>> same build as -DCLOBBER_CACHE_RECURSIVELY, because each of these is
>> darned expensive and it's not clear you'd learn anything by running
>> them both together.  I think you might be better advised to run two
>> separate buildfarm critters with those two options, and thereby perhaps
>> get turnaround in something less than 80 days.
>
> OK, I removed this for barnacle/addax/mite, let's see what's the impact.
>
> FWIW We have three other animals running with CLOBBER_CACHE_ALWAYS +
> RANDOMIZE_ALLOCATED_MEMORY, and it takes ~20h per branch. But maybe the
> price when combined with CLOBBER_CACHE_RECURSIVELY is much higher.
>
>> * It'd likely be a good idea to take out the TestUpgrade and TestDecoding
>> modules from the config too.  Otherwise, we won't be seeing barnacle's
>> next report before 2015, judging from the runtime of the check step
>> compared to some of the other slow buildfarm machines.  (I wonder whether
>> there's an easy way to skip the installcheck step, as that's going to
>> require a much longer run than it can possibly be worth too.)
>
> OK, I did this too.
>
> I stopped the already running test on addax and started the test on
> barnacle again. Let's see in a few days/weeks/months what is the result.

It seems to be running much faster (probably after removing the
randomization), and apparently it passed the create_view tests without
crashing, but then crashed at 'constraints' (again, because of OOM).

  PortalMemory: 8192 total in 1 blocks; 7880 free (0 chunks); 312 used
    PortalHeapMemory: 1024 total in 1 blocks; 840 free (0 chunks); 184 used
      ExecutorState: 769654952 total in 103 blocks; 114984 free (296
chunks); 769539968 used
        ExprContext: 0 total in 0 blocks; 0 free (0 chunks); 0 used

I suppose we don't expect 760MB ExecutorState here. Also, there's ~60MB
MessageContext.

It's still running, so I'm attaching the relevant part of log (again,
with MemoryContextStats for backends with VSS >= 512MB .

FWIW, it's running against a844c29966d7c0cd6a457e9324f175349bb12df0.

regards
Tomas

Attachment
Tomas Vondra <tv@fuzzy.cz> writes:
>> I stopped the already running test on addax and started the test on
>> barnacle again. Let's see in a few days/weeks/months what is the result.

> It seems to be running much faster (probably after removing the
> randomization), and apparently it passed the create_view tests without
> crashing, but then crashed at 'constraints' (again, because of OOM).

I poked into the "constraints" test and soon found another leak just like
the other one :-(, which I've now fixed.  I lack the patience to let
"constraints" run to conclusion with C_C_R on --- I let it run for about
20 hours and it still was only maybe 1/4th done --- but it got further
than it did in your report and there was zero sign of leakage.
        regards, tom lane



Re: failures on barnacle (CLOBBER_CACHE_RECURSIVELY) because of memory leaks

From
Tomas Vondra
Date:
On 24.8.2014 18:01, Tom Lane wrote:
> Tomas Vondra <tv@fuzzy.cz> writes:
>>> I stopped the already running test on addax and started the test on
>>> barnacle again. Let's see in a few days/weeks/months what is the result.
> 
>> It seems to be running much faster (probably after removing the
>> randomization), and apparently it passed the create_view tests without
>> crashing, but then crashed at 'constraints' (again, because of OOM).
> 
> I poked into the "constraints" test and soon found another leak just like
> the other one :-(, which I've now fixed.  I lack the patience to let
> "constraints" run to conclusion with C_C_R on --- I let it run for about
> 20 hours and it still was only maybe 1/4th done --- but it got further
> than it did in your report and there was zero sign of leakage.

OK. There was some outage in the facility hosting this machine, and it
went up just a few hours ago. So I killed the current run, and it'll
start chewing on this new commit in a few minutes.

Regarding those leaks we've detected so far - is it the kind of leaks
that can happen only in testing with those specific flags, or is it
something that can happen in production too? (Assuming no one is running
with CLOBBER_CACHE_RECURSIVELY in production, of course ;-) That is,
does it seem worth the effort running those tests / fixing those leaks?

regards
Tomas



Tomas Vondra <tv@fuzzy.cz> writes:
> Regarding those leaks we've detected so far - is it the kind of leaks
> that can happen only in testing with those specific flags, or is it
> something that can happen in production too? (Assuming no one is running
> with CLOBBER_CACHE_RECURSIVELY in production, of course ;-) That is,
> does it seem worth the effort running those tests / fixing those leaks?

I believe most or all of these leaks were understood and intentionally
ignored in the original coding, on the grounds that they were intraquery
leaks and no real-world situation would ever cause so many cache reloads
in a single query that the leakage would amount to anything problematic.
I think that reasoning is still valid for production usage.  It seems
worth fixing the leaks in HEAD so that we can get through the regression
tests on barnacle and see if anything of greater significance turns up ---
but if this is all we find, it might not have been worth the trouble.
        regards, tom lane



Re: failures on barnacle (CLOBBER_CACHE_RECURSIVELY) because of memory leaks

From
Tomas Vondra
Date:
On 24.8.2014 18:28, Tom Lane wrote:
> Tomas Vondra <tv@fuzzy.cz> writes:
>> Regarding those leaks we've detected so far - is it the kind of leaks
>> that can happen only in testing with those specific flags, or is it
>> something that can happen in production too? (Assuming no one is running
>> with CLOBBER_CACHE_RECURSIVELY in production, of course ;-) That is,
>> does it seem worth the effort running those tests / fixing those leaks?
> 
> I believe most or all of these leaks were understood and intentionally
> ignored in the original coding, on the grounds that they were intraquery
> leaks and no real-world situation would ever cause so many cache reloads
> in a single query that the leakage would amount to anything problematic.
> I think that reasoning is still valid for production usage.  It seems
> worth fixing the leaks in HEAD so that we can get through the regression
> tests on barnacle and see if anything of greater significance turns up ---
> but if this is all we find, it might not have been worth the trouble.

OK. Some time ago we got a report (on the czech mailing list) with this:

TopMemoryContext: 1375320 total in 168 blocks; 6472 free (18 chunks);
1368848 used ... CacheMemoryContext: 232883248 total in 5251 blocks; 5644000 free (2
chunks); 227239248 used

Apparently they're using some sort of persistent connections, and
there's ~8000 tables in that particular database, which eventually leads
to OOM for them. Could this be related?

Anyway, let's leave the tests running - either we find something
interesting or not. I think it's worth it.

regards
Tomas