Thread: buildfarm: strange OOM failures on markhor (running CLOBBER_CACHE_RECURSIVELY)

Hi all,

a few days ago I setup an buildfarm animal markhor, running the tests
with CLOBBER_CACHE_RECURSIVELY. As the tests are running very long,
reporting the results back to the server fails because of a safeguard
limit in the buildfarm server. Anyway, that's being discussed in a
different thread - here it's merely as a 'don't bother looking for addax
on the buildfarm website' warning.

I've been checking the progress of the recursive tests today, and I
found it actually failed in the 'make check' step. The logs are
available here:

buildfarm logs: http://www.fuzzy.cz/tmp/buildfarm/recursive-oom.tgz
kernel logs: http://www.fuzzy.cz/tmp/buildfarm/messages

The tests are running within a LXC container (operated through libvirt),
so whenever I say 'VM' I actually mean a LXC container. It might be some
VM/LXC misconfiguration, but as this happens only to a single VM (the
one running tests with recursive clobber), I find it unlikely.

================== An example of the failure ==================

parallel group (20 tests):  pg_lsn regproc oid name char money float4
txid text int2 varchar int4 float8 boolean int8 uuid rangetypes bit
numeric enum    ...    float4      ... ok    float8      ... ok    bit         ... FAILED (test process exited with
exitcode 2)    numeric     ... FAILED (test process exited with exit code 2)    txid        ... ok    ...
 
===============================================================

and then of course the usual 'terminating connection because of crash of
another server process' warning. Apparently, it's getting killed by the
OOM killer, because it exhausts all the memory assigned to that VM (2GB).

May 15 19:44:53 postgres invoked oom-killer: gfp_mask=0xd0, order=0,
oom_adj=0, oom_score_adj=0
May 15 19:44:53 cspug kernel: postgres cpuset=recursive-builds
mems_allowed=0
May 15 19:44:53 cspug kernel: Pid: 17159, comm: postgres Not tainted
2.6.32-431.17.1.el6.centos.plus.x86_64 #1

AFAIK 2GB is more than enough for a buildfarm machine (after all,
chipmunk hass just 512MB). Also, this only happens on this VM
(cpuset=recursive-builds), the other two VMs, with exactly the same
limits, running other buildfarm animals (regular or with
CLOBBER_CACHE_ALWAYS) are perfectly happy. See magpie or markhor for
example. And I don't see any reason why a build with recursive clobber
should require more memory than a regular build. So this seems like a
memory leak somewhere in the cache invalidation code.

I thought it might be fixed by commit b23b0f5588 (Code review for recent
changes in relcache.c), but mite is currently working on
7894ac5004 and yet it already failed on OOM.

The failures apparently happen within a few hours of the test start. For
example on addax (gcc), the build started on 02:50 and the first OOM
failure happened on 05:19, on mite (clang), it's 03:20 vs. 06:50. So
it's like ~3-4 after the tests start.

regards
Tomas



Tomas Vondra <tv@fuzzy.cz> writes:
> ... then of course the usual 'terminating connection because of crash of
> another server process' warning. Apparently, it's getting killed by the
> OOM killer, because it exhausts all the memory assigned to that VM (2GB).

Can you fix things so it runs into its process ulimit before the OOM killer
triggers?  Then we'd get a memory map dumped to stderr, which would be
helpful in localizing the problem.

> ... So this seems like a
> memory leak somewhere in the cache invalidation code.

Smells that way to me too, but let's get some more evidence.
        regards, tom lane



On 17.5.2014 19:55, Tom Lane wrote:
> Tomas Vondra <tv@fuzzy.cz> writes:
>> ... then of course the usual 'terminating connection because of crash of
>> another server process' warning. Apparently, it's getting killed by the
>> OOM killer, because it exhausts all the memory assigned to that VM (2GB).
>
> Can you fix things so it runs into its process ulimit before the OOM killer
> triggers?  Then we'd get a memory map dumped to stderr, which would be
> helpful in localizing the problem.

I did this in /etc/security/limits.d/80-pgbuild.conf:

pgbuild hard as 1835008

so the user the buildfarm runs under will have up to ~1.75GB of RAM (of
the 2GB available to the container).

>
>> ... So this seems like a
>> memory leak somewhere in the cache invalidation code.
>
> Smells that way to me too, but let's get some more evidence.

The tests are already running, and there are a few postgres processes:

  PID   VIRT  RES %CPU    TIME+  COMMAND
11478   449m 240m 100.0 112:53.57 postgres: pgbuild regression [local]
CREATE VIEW
11423   219m  19m  0.0   0:00.17 postgres: checkpointer process
11424   219m 2880  0.0   0:00.05 postgres: writer process
11425   219m 5920  0.0   0:00.12 postgres: wal writer process
11426   219m 2708  0.0   0:00.05 postgres: autovacuum launcher process
11427  79544 1836  0.0   0:00.17 postgres: stats collector process
11479  1198m 1.0g  0.0  91:09.99 postgres: pgbuild regression [local]
CREATE INDEX waiting

Attached is 'pmap -x' output for the two interesting processes (11478,
11479).

Tomas

Attachment
On 2014-05-17 20:41:37 +0200, Tomas Vondra wrote:
> On 17.5.2014 19:55, Tom Lane wrote:
> > Tomas Vondra <tv@fuzzy.cz> writes:
> The tests are already running, and there are a few postgres processes:
> 
>   PID   VIRT  RES %CPU    TIME+  COMMAND
> 11478   449m 240m 100.0 112:53.57 postgres: pgbuild regression [local]
> CREATE VIEW
> 11423   219m  19m  0.0   0:00.17 postgres: checkpointer process
> 11424   219m 2880  0.0   0:00.05 postgres: writer process
> 11425   219m 5920  0.0   0:00.12 postgres: wal writer process
> 11426   219m 2708  0.0   0:00.05 postgres: autovacuum launcher process
> 11427  79544 1836  0.0   0:00.17 postgres: stats collector process
> 11479  1198m 1.0g  0.0  91:09.99 postgres: pgbuild regression [local]
> CREATE INDEX waiting
> 
> Attached is 'pmap -x' output for the two interesting processes (11478,
> 11479).

Could you gdb -p 11479 into the process and issue 'p
MemoryContextStats(TopMemoryContext)'. That should print information
about the server's allocation to its stderr.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



On 17.5.2014 21:31, Andres Freund wrote:
> On 2014-05-17 20:41:37 +0200, Tomas Vondra wrote:
>> On 17.5.2014 19:55, Tom Lane wrote:
>>> Tomas Vondra <tv@fuzzy.cz> writes:
>> The tests are already running, and there are a few postgres processes:
>>
>>   PID   VIRT  RES %CPU    TIME+  COMMAND
>> 11478   449m 240m 100.0 112:53.57 postgres: pgbuild regression [local]
>> CREATE VIEW
>> 11423   219m  19m  0.0   0:00.17 postgres: checkpointer process
>> 11424   219m 2880  0.0   0:00.05 postgres: writer process
>> 11425   219m 5920  0.0   0:00.12 postgres: wal writer process
>> 11426   219m 2708  0.0   0:00.05 postgres: autovacuum launcher process
>> 11427  79544 1836  0.0   0:00.17 postgres: stats collector process
>> 11479  1198m 1.0g  0.0  91:09.99 postgres: pgbuild regression [local]
>> CREATE INDEX waiting
>>
>> Attached is 'pmap -x' output for the two interesting processes (11478,
>> 11479).
>
> Could you gdb -p 11479 into the process and issue 'p
> MemoryContextStats(TopMemoryContext)'. That should print information
> about the server's allocation to its stderr.

That process already finished, but I've done that for another process
(that had ~400MB allocated, and was growing steadily - about 1MB / 10
seconds). It was running a SELECT query and already completed.

Then it executed ANALYZE, and I took several snapshots of it - not sure
how much memory it had at the beginning (maybe ~250MB), then when it had
~350MB, 400MB, 500MB and 600MB. It's still running, not sure how much
will it grow.

Anyway, the main difference between the analyze snapshot seems to be this:

  init:  CacheMemoryContext: 67100672 total in 17 blocks; ...
  350MB: CacheMemoryContext: 134209536 total in 25 blocks; ...
  400MB: CacheMemoryContext: 192929792 total in 32 blocks; ...
  500MB: CacheMemoryContext: 293593088 total in 44 blocks; ...
  600MB: CacheMemoryContext: 411033600 total in 58 blocks; ...

Not sure if there's something wrong with the SELECT memory context. It
has ~1500 of nested nodes like these:

   SQL function data: 24576 total in 2 blocks; ...
  ExecutorState: 24576 total in 2 blocks; ...
    SQL function data: 24576 total in 2 blocks; ...
    ExprContext: 8192 total in 1 blocks; ...

But maybe it's expected / OK.

regards
Tomas

Attachment
On 17.5.2014 22:33, Tomas Vondra wrote:
> On 17.5.2014 21:31, Andres Freund wrote:
>> On 2014-05-17 20:41:37 +0200, Tomas Vondra wrote:
>>> On 17.5.2014 19:55, Tom Lane wrote:
>>>> Tomas Vondra <tv@fuzzy.cz> writes:
>>> The tests are already running, and there are a few postgres processes:
>>>
>>>   PID   VIRT  RES %CPU    TIME+  COMMAND
>>> 11478   449m 240m 100.0 112:53.57 postgres: pgbuild regression [local]
>>> CREATE VIEW
>>> 11423   219m  19m  0.0   0:00.17 postgres: checkpointer process
>>> 11424   219m 2880  0.0   0:00.05 postgres: writer process
>>> 11425   219m 5920  0.0   0:00.12 postgres: wal writer process
>>> 11426   219m 2708  0.0   0:00.05 postgres: autovacuum launcher process
>>> 11427  79544 1836  0.0   0:00.17 postgres: stats collector process
>>> 11479  1198m 1.0g  0.0  91:09.99 postgres: pgbuild regression [local]
>>> CREATE INDEX waiting
>>>
>>> Attached is 'pmap -x' output for the two interesting processes (11478,
>>> 11479).
>>
>> Could you gdb -p 11479 into the process and issue 'p
>> MemoryContextStats(TopMemoryContext)'. That should print information
>> about the server's allocation to its stderr.

And another memory context stats for a session executing CREATE INDEX,
while having allocated  The interesting thing is there are ~11k lines
that look exactly like this:

pg_namespace_oid_index: 1024 total in 1 blocks; 88 free (0 chunks); 936 used

Seems a bit strange to me.

Tomas

Attachment
Hi,

On 2014-05-17 22:33:31 +0200, Tomas Vondra wrote:
> Anyway, the main difference between the analyze snapshot seems to be this:
> 
>   init:  CacheMemoryContext: 67100672 total in 17 blocks; ...
>   350MB: CacheMemoryContext: 134209536 total in 25 blocks; ...
>   400MB: CacheMemoryContext: 192929792 total in 32 blocks; ...
>   500MB: CacheMemoryContext: 293593088 total in 44 blocks; ...
>   600MB: CacheMemoryContext: 411033600 total in 58 blocks; ...

Hm, so something is definitely leaking memory inside CacheMemoryContext
itself. Is that happening constantly or just with individual tests?

> Not sure if there's something wrong with the SELECT memory context. It
> has ~1500 of nested nodes like these:
> 
>    SQL function data: 24576 total in 2 blocks; ...
>   ExecutorState: 24576 total in 2 blocks; ...
>     SQL function data: 24576 total in 2 blocks; ...
>     ExprContext: 8192 total in 1 blocks; ...
> 
> But maybe it's expected / OK.

I'd guess that's a recursive function call. Any chance you know what's
been executing at that point? I'd bet it's been the 'errors' check. That
has:
-- Check that stack depth detection mechanism works and
-- max_stack_depth is not set too high
create function infinite_recurse() returns int as
'select infinite_recurse()' language sql;
\set VERBOSITY terse
select infinite_recurse();
ERROR:  stack depth limit exceeded

which'd pretty much produce a tree of executions like yours.

Greetings,

Andres Freund



On 17.5.2014 23:09, Andres Freund wrote:
> Hi,
>
> On 2014-05-17 22:33:31 +0200, Tomas Vondra wrote:
>> Anyway, the main difference between the analyze snapshot seems to be this:
>>
>>   init:  CacheMemoryContext: 67100672 total in 17 blocks; ...
>>   350MB: CacheMemoryContext: 134209536 total in 25 blocks; ...
>>   400MB: CacheMemoryContext: 192929792 total in 32 blocks; ...
>>   500MB: CacheMemoryContext: 293593088 total in 44 blocks; ...
>>   600MB: CacheMemoryContext: 411033600 total in 58 blocks; ...
>
> Hm, so something is definitely leaking memory inside CacheMemoryContext
> itself. Is that happening constantly or just with individual tests?

I think it's happening constantly - I've been eyeballing the VM for some
time, and the backends usually hoover around ~220 MBs, executing SELECTs
etc. And then it executes CREATE INDEX and it starts growing.

Attached is a log for a CREATE INDEX session at the monent it had ~1.3GB
allocated (create-index-5.log.gz). It has ~217k copies of

pg_attrdef_adrelid_adnum_index: 3072 total in 2 blocks; 1968 free (0
chunks); 1104 used

And also another log (AFAIK from another CREATE INDEX session), with
~11k copies of

pg_namespace_oid_index: 1024 total in 1 blocks; 88 free (0 chunks); 936 used

>> Not sure if there's something wrong with the SELECT memory context. It
>> has ~1500 of nested nodes like these:
>>
>>    SQL function data: 24576 total in 2 blocks; ...
>>   ExecutorState: 24576 total in 2 blocks; ...
>>     SQL function data: 24576 total in 2 blocks; ...
>>     ExprContext: 8192 total in 1 blocks; ...
>>
>> But maybe it's expected / OK.
>
> I'd guess that's a recursive function call. Any chance you know what's
> been executing at that point? I'd bet it's been the 'errors' check. That
> has:
> -- Check that stack depth detection mechanism works and
> -- max_stack_depth is not set too high
> create function infinite_recurse() returns int as
> 'select infinite_recurse()' language sql;
> \set VERBOSITY terse
> select infinite_recurse();
> ERROR:  stack depth limit exceeded
>
> which'd pretty much produce a tree of executions like yours.

Yeah, I was thinking it's something like that. I'm not sure what tests
were running at that that particular moment, so maybe it really was the
'errors' check.

Tomas

Attachment
On 2014-05-17 22:55:14 +0200, Tomas Vondra wrote:
> And another memory context stats for a session executing CREATE INDEX,
> while having allocated  The interesting thing is there are ~11k lines
> that look exactly like this:
> 
> pg_namespace_oid_index: 1024 total in 1 blocks; 88 free (0 chunks); 936 used

Heh. That certainly doesn't look right. I bet it's the contexts from
RelationInitIndexAccessInfo().

Looks like there generally are recursion 'troubles' with system
indexes. RelationClearRelation() will mark them as invalid causing any
further lookup to do RelationBuildDesc() calls. Which will again rebuild
the same relation if it's used somewhere inside RelationBuildDesc() or
RelationInitIndexAccessInfo(). From a quick look it looks like it should
resolve itself after some time. Even freeing the superflous memory
contexts. But I am not sure if there scenarios where that won't
happen...


Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Andres Freund <andres@2ndquadrant.com> writes:
> On 2014-05-17 22:55:14 +0200, Tomas Vondra wrote:
>> And another memory context stats for a session executing CREATE INDEX,
>> while having allocated  The interesting thing is there are ~11k lines
>> that look exactly like this:
>> 
>> pg_namespace_oid_index: 1024 total in 1 blocks; 88 free (0 chunks); 936 used

> Heh. That certainly doesn't look right. I bet it's the contexts from
> RelationInitIndexAccessInfo().

> Looks like there generally are recursion 'troubles' with system
> indexes. RelationClearRelation() will mark them as invalid causing any
> further lookup to do RelationBuildDesc() calls. Which will again rebuild
> the same relation if it's used somewhere inside RelationBuildDesc() or
> RelationInitIndexAccessInfo(). From a quick look it looks like it should
> resolve itself after some time. Even freeing the superflous memory
> contexts. But I am not sure if there scenarios where that won't
> happen...

I've identified a couple of memory leakage scenarios in relcache.c:

1. The RelationCacheInsert macro just blithely ignores the possibility
that hash_search(HASH_ENTER) returns found = true.  There's a comment/* used to give notice if found -- now just keep
quiet*/
 
which AFAICT was there when we got the code from Berkeley.  Well, it
turns out that that's a can-happen scenario, at least for a few system
catalogs and indexes that are referenced during relcache build.  The
scenario is that we come in through RelationIdGetRelation, don't find
a cache entry for, say, pg_index, and set about building one in
RelationBuildDesc.  Somewhere in the guts of that we have need to read
pg_index, whereupon there's a recursive call of RelationIdGetRelation,
which still doesn't find the entry, so we again call RelationBuildDesc
which builds an entirely separate Relation structure.  The recursion
does terminate thanks to the recursion-stopping provisions in relcache.c,
so the inner call finishes and enters its completed Relation structure
into the RelationIdCache hashtable.  Control returns out to the outer
invocation of RelationBuildDesc, which finishes, and enters its completed
Relation structure into the RelationIdCache hashtable --- overwriting the
link to the Relation made by the inner invocation.  That Relation and
subsidiary data are now unreferenced and permanently leaked in
CacheMemoryContext.  If it's an index you can see its leaked subsidiary
rd_indexcxt in a memory dump, which is what we're looking at above.

AFAICT, the inner invocation's Relation should always have zero reference
count by the time we get back to the outer invocation.  Therefore it
should be possible for RelationCacheInsert to just delete the
about-to-be-unreachable Relation struct.  I'm experimenting with a patch
that adds logic like this to RelationCacheInsert:
   if (found)   {Relation oldrel = idhentry->reldesc;idhentry->reldesc = RELATION;if
(RelationHasReferenceCountZero(oldrel))   RelationDestroyRelation(oldrel, false);else    elog(WARNING, "leaking
still-referencedduplicate relation");   }
 

and so far it looks good.

2. There's a much smaller leak in AttrDefaultFetch: it doesn't
bother to pfree the result of TextDatumGetCString().  This leakage
occurs in the caller's context not CacheMemoryContext, so it's only
query lifespan not session lifespan, and it would not ordinarily be
significant --- but with the CLOBBER_CACHE logic enabled, we rebuild
some relcache entries a damn lot of times within some queries, so the
leak adds up.

With both of these things fixed, I'm not seeing any significant memory
bloat during the first parallel group of the regression tests.  I don't
think I'll have the patience to let it run much further than that
(the uuid and enum tests are still running after an hour :-().

BTW, it strikes me that we could probably improve the runtime of the
CLOBBER tests noticeably if we were to nail AttrDefaultIndexId,
IndexIndrelidIndexId, and ConstraintRelidIndexId into cache.  I see
no very good reason not to do that; it should help performance a bit
in normal cases too.

While I'm at it: I could not help noticing RememberToFreeTupleDescAtEOX,
which was not there last time I looked at this code.  Isn't that broken
by design?  It's basically a deliberately induced transaction-lifespan
memory leak, and AFAICS if it does anything at all, it's supporting
incorrect calling code.  There should *never* be any situation where it's
not okay to free a tupledesc with zero refcount.  And the comment
justifying it is insane on its face:
        * If we Rebuilt a relcache entry during a transaction then its        * possible we did that because the
TupDescchanged as the result of        * an ALTER TABLE that ran at less than AccessExclusiveLock. It's        *
possiblesomeone copied that TupDesc, in which case the copy would        * point to free'd memory. So if we rebuild an
entrywe keep the
 

If someone copied the tupdesc, how would freeing the original result
in the copy being damaged?  And even more to the point, if anyone is
doing anything that involves inspecting a table's rowtype, how could
it possibly be safe for someone else to be altering the rowtype with
less than exclusive lock?  Unless I hear an adequate defense of that
code PDQ, it's going away.
        regards, tom lane



Hi,

On 2014-05-18 14:49:10 -0400, Tom Lane wrote:
> AFAICT, the inner invocation's Relation should always have zero reference
> count by the time we get back to the outer invocation.  Therefore it
> should be possible for RelationCacheInsert to just delete the
> about-to-be-unreachable Relation struct.  I'm experimenting with a patch
> that adds logic like this to RelationCacheInsert:
> 
>     if (found)
>     {
>     Relation oldrel = idhentry->reldesc;
>     idhentry->reldesc = RELATION;
>     if (RelationHasReferenceCountZero(oldrel))
>         RelationDestroyRelation(oldrel, false);
>     else
>         elog(WARNING, "leaking still-referenced duplicate relation");
>     }
> 
> and so far it looks good.

If that happens we'd essentially have a too low reference count on the
entry remaining in the relcache. I'd consider putting an Assert() in
that branch. Perhaps it should also be only allowed for system
relations? If it somehow happens for user defined ones we'd loose
rd_newRelfilenodeSubid and such. Which we treat as reliable these days.

> With both of these things fixed, I'm not seeing any significant memory
> bloat during the first parallel group of the regression tests.  I don't
> think I'll have the patience to let it run much further than that
> (the uuid and enum tests are still running after an hour :-().

Yea, it's really unbearable. A *single* UPDATE took close to 5 minutes
here when testing the decoding stuff.

> BTW, it strikes me that we could probably improve the runtime of the
> CLOBBER tests noticeably if we were to nail AttrDefaultIndexId,
> IndexIndrelidIndexId, and ConstraintRelidIndexId into cache.  I see
> no very good reason not to do that; it should help performance a bit
> in normal cases too.

Sounds like a good plan.

From what I've seen here, the mojority of the time is spent iterating
over various hashes. Primarily relcache.c and catcache.c. Removing
some syscache lookups will reduce the repetitiveness of tha ta bit.

> While I'm at it: I could not help noticing RememberToFreeTupleDescAtEOX,
> which was not there last time I looked at this code.  Isn't that broken
> by design?  It's basically a deliberately induced transaction-lifespan
> memory leak, and AFAICS if it does anything at all, it's supporting
> incorrect calling code.  There should *never* be any situation where it's
> not okay to free a tupledesc with zero refcount.

I am not sure I understand that code. The whole keep_tupdesc/keep_rules
logic isn't new and a bit underdocumented as well :(. Seems to come from
491dd4a9.
If I understand correctly the issue is that callers have accessed the
tupledesc with RelationGetDescr() which doesn't increase the
refcount. If then there's rebuild of that relation's relcache entry
they'll still be looking at the old tupledesc. Which will now have a
reference count of 0. The existing code tries to handle that by not
invalidating the old tupledescs if equal to the rebuilt one. Thereby
leaving the uncounted reference from the outside in a valid state.
The reason that mostly works < 9.4 is that the tupledesc normally won't
change without an AEL lock on the relation which can't happen while
there are still open references.
But I am wondering, is that actually true, even before today? ... No,
doesn't look that way:
CREATE TABLE a(id serial primary key);
BEGIN;
DECLARE foo CURSOR FOR SELECT * FROM a;
ALTER TABLE a RENAME COLUMN id TO id2;

So, renaming a column doesn't enforce CheckTableNotInUse() which means
that it's possible to access a RelationGetDescr() before calling an
external function which might do a ALTER TABLE .. RENAME COLUMN thereby
triggering !keep_tupdesc. And freeing the existing tupledesc.
Now, it will likely work in this case because the DECLARE has probably
properly incremented the refcount, but nonetheless... I am pretty sure I
could construct a crasher with that, even < 9.4.

This is a mess...

I think the code is really only triggering if !keep_tupdesc is false
which should be a fairly minor amount of cases. So it's probably not too
bad performancewise.

>          * If we Rebuilt a relcache entry during a transaction then its
>          * possible we did that because the TupDesc changed as the result of
>          * an ALTER TABLE that ran at less than AccessExclusiveLock. It's
>          * possible someone copied that TupDesc, in which case the copy would
>          * point to free'd memory. So if we rebuild an entry we keep the
> 
> If someone copied the tupdesc, how would freeing the original result
> in the copy being damaged?

The sentence only seems to make sense with s/copy/accessed without
increasing the refcoung, e.g. with RelationGetDescr,/.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Andres Freund <andres@2ndquadrant.com> writes:
> On 2014-05-18 14:49:10 -0400, Tom Lane wrote:
>> if (RelationHasReferenceCountZero(oldrel))
>>     RelationDestroyRelation(oldrel, false);
>> else
>>     elog(WARNING, "leaking still-referenced duplicate relation");

> If that happens we'd essentially have a too low reference count on the
> entry remaining in the relcache.

No, we'd have two independent entries, each with its own correct refcount.
When the refcount on the no-longer-linked-in-the-hashtable entry goes to
zero, it'd be leaked, same as it's always been.  (The refcount presumably
corresponds to someone holding a direct pointer to the Relation struct,
which is what they'd use to decrement the refcount.)

> I'd consider putting an Assert() in that branch.

I'm a bit afraid to do that for a condition that the system's never tested
for at all up to now; in any case, a WARNING would be visible in production
whereas an Assert would probably do nothing.  If we see no reports of this
WARNING for a release cycle or so, maybe asserting would be appropriate.

I did put in an assert to prevent this path from being taken in cases
where we don't expect that a previous entry could possibly exist.

> Perhaps it should also be only allowed for system
> relations?

One would hope those are the only ones that get opened during relcache
load ;-)

>> BTW, it strikes me that we could probably improve the runtime of the
>> CLOBBER tests noticeably if we were to nail AttrDefaultIndexId,
>> IndexIndrelidIndexId, and ConstraintRelidIndexId into cache.  I see
>> no very good reason not to do that; it should help performance a bit
>> in normal cases too.

> Sounds like a good plan.

I experimented with that and didn't really see any marked improvement
in the CLOBBER_CACHE_ALWAYS runtime, so I'm less excited about it than
I was yesterday.

Still concerned about RememberToFreeTupleDescAtEOX, but that's an
independent issue really.
        regards, tom lane



On 2014-05-19 11:25:04 -0400, Tom Lane wrote:
> Andres Freund <andres@2ndquadrant.com> writes:
> > On 2014-05-18 14:49:10 -0400, Tom Lane wrote:
> >> if (RelationHasReferenceCountZero(oldrel))
> >>     RelationDestroyRelation(oldrel, false);
> >> else
> >>     elog(WARNING, "leaking still-referenced duplicate relation");
> 
> > If that happens we'd essentially have a too low reference count on the
> > entry remaining in the relcache.
> 
> No, we'd have two independent entries, each with its own correct refcount.
> When the refcount on the no-longer-linked-in-the-hashtable entry goes to
> zero, it'd be leaked, same as it's always been.  (The refcount presumably
> corresponds to someone holding a direct pointer to the Relation struct,
> which is what they'd use to decrement the refcount.)

The problem is that only one of these entries will get properly handled
by cache invalidation. I wonder if the correct thing wouldn't be to
return the entry already in the cache. But that'd not be trivial to do
either, without the potential to return a invalid entry :(

> > I'd consider putting an Assert() in that branch.
> 
> I'm a bit afraid to do that for a condition that the system's never tested
> for at all up to now; in any case, a WARNING would be visible in production
> whereas an Assert would probably do nothing.  Ifentry  we see no reports of this
> WARNING for a release cycle or so, maybe asserting would be appropriate.

Fair point.

> > Perhaps it should also be only allowed for system
> > relations?
> 
> One would hope those are the only ones that get opened during relcache
> load ;-)

I thought about it for a while and I wonder if that's necessarily
correct. If somebody registers a relcache invalidation callback that
could happen when invalidations are processed somewhere while rebuilding
a entry?

> Still concerned about RememberToFreeTupleDescAtEOX, but that's an
> independent issue really.

Me too and yes.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Andres Freund <andres@2ndquadrant.com> writes:
> On 2014-05-19 11:25:04 -0400, Tom Lane wrote:
>> No, we'd have two independent entries, each with its own correct refcount.
>> When the refcount on the no-longer-linked-in-the-hashtable entry goes to
>> zero, it'd be leaked, same as it's always been.  (The refcount presumably
>> corresponds to someone holding a direct pointer to the Relation struct,
>> which is what they'd use to decrement the refcount.)

> The problem is that only one of these entries will get properly handled
> by cache invalidation.

I don't think that's a problem; if the old entry is still referenced,
presumably the holder of the reference is also holding a lock sufficient
to prevent any problematic schema changes.

> I wonder if the correct thing wouldn't be to
> return the entry already in the cache. But that'd not be trivial to do
> either, without the potential to return a invalid entry :(

The one we just finished computing is presumably "newer" in some sense,
otherwise I would've done it like that.  In particular, if the older
entry is no longer rd_isvalid, it'd definitely be unsafe to return.

>> One would hope those are the only ones that get opened during relcache
>> load ;-)

> I thought about it for a while and I wonder if that's necessarily
> correct. If somebody registers a relcache invalidation callback that
> could happen when invalidations are processed somewhere while rebuilding
> a entry?

Cache invalidation callbacks shouldn't do anything that would involve
opening non-system relations, IMO.
        regards, tom lane



On 18.5.2014 20:49, Tom Lane wrote:
> With both of these things fixed, I'm not seeing any significant memory
> bloat during the first parallel group of the regression tests.  I don't
> think I'll have the patience to let it run much further than that
> (the uuid and enum tests are still running after an hour :-().

I've been watching barnacle running the tests with this commit applied,
and so far it seems fine - the memory consumption is hovering around
240MB, and it's running for ~8 hours now without a single OOM.
Previously the first OOM happened within ~3 hours.

I've also included a simple python script that checks amount of memory
used by the postgres processes every minute, and if it's more that 512MB
uses gdb to flush memory context statistics into postmaster log.
So far there was no such case.

Sadly, with 'mite' this resulted in 900+ MB postmaster.log, which then
apparently caused the buildfarm client to be killed by OOM (seems the
client is loading the log into memory, not expecting it to be this
large). That's why there are no results from mite available.

BTW, I see this was applied to HEAD only. Shouldn't this be applied to
the other branches too?

regards
Tomas



Tomas Vondra <tv@fuzzy.cz> writes:
> On 18.5.2014 20:49, Tom Lane wrote:
>> With both of these things fixed, I'm not seeing any significant memory
>> bloat during the first parallel group of the regression tests.  I don't
>> think I'll have the patience to let it run much further than that
>> (the uuid and enum tests are still running after an hour :-().

> BTW, I see this was applied to HEAD only. Shouldn't this be applied to
> the other branches too?

I intentionally didn't do that, first because I have only a limited amount
of confidence in the patch, and second because I don't think it matters
for anything except CLOBBER_CACHE_RECURSIVELY testing.  Which surely
you're not going to do in the back branches?
        regards, tom lane



On 19.5.2014 22:11, Tom Lane wrote:
> Tomas Vondra <tv@fuzzy.cz> writes:
>> On 18.5.2014 20:49, Tom Lane wrote:
>>> With both of these things fixed, I'm not seeing any significant memory
>>> bloat during the first parallel group of the regression tests.  I don't
>>> think I'll have the patience to let it run much further than that
>>> (the uuid and enum tests are still running after an hour :-().
> 
>> BTW, I see this was applied to HEAD only. Shouldn't this be applied to
>> the other branches too?
> 
> I intentionally didn't do that, first because I have only a limited
> amount of confidence in the patch, and second because I don't think
> it matters for anything except CLOBBER_CACHE_RECURSIVELY testing.
> Which surely you're not going to do in the back branches?

Yes, I'm running this on HEAD only.

I was however wondering if this might be related to OOM errors a few
local users reported to us. IIRC they've been using temporary tables
quite heavily - not sure if that could be related.

regards
Tomas



On 2014-05-19 23:40:32 +0200, Tomas Vondra wrote:
> On 19.5.2014 22:11, Tom Lane wrote:
> > Tomas Vondra <tv@fuzzy.cz> writes:
> > I intentionally didn't do that, first because I have only a limited
> > amount of confidence in the patch, and second because I don't think
> > it matters for anything except CLOBBER_CACHE_RECURSIVELY testing.
> > Which surely you're not going to do in the back branches?
> 
> Yes, I'm running this on HEAD only.
> 
> I was however wondering if this might be related to OOM errors a few
> local users reported to us. IIRC they've been using temporary tables
> quite heavily - not sure if that could be related.

I've significant doubts that it is. I suggest you disable memory
overcommit - then you'll get nice memory dumps on stderr instead of oom
kills.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Andres Freund <andres@2ndquadrant.com> writes:
> On 2014-05-19 23:40:32 +0200, Tomas Vondra wrote:
>> I was however wondering if this might be related to OOM errors a few
>> local users reported to us. IIRC they've been using temporary tables
>> quite heavily - not sure if that could be related.

> I've significant doubts that it is.

The main leak (the whole-Relation one) only applies to entries for certain
system catalogs.  The smaller leak would apply to any relation having
DEFAULT column values --- but it's fairly hard to see how any table would
get invalidated so many times in one query as to make that leak an issue,
unless you were doing CLOBBER_CACHE or similar stress testing.

> I suggest you disable memory overcommit - then you'll get nice memory
> dumps on stderr instead of oom kills.

Yeah, a context stats dump would help a lot here.
        regards, tom lane



On 2014-05-19 13:45:15 -0400, Tom Lane wrote:
> Andres Freund <andres@2ndquadrant.com> writes:
> > On 2014-05-19 11:25:04 -0400, Tom Lane wrote:
> >> No, we'd have two independent entries, each with its own correct refcount.
> >> When the refcount on the no-longer-linked-in-the-hashtable entry goes to
> >> zero, it'd be leaked, same as it's always been.  (The refcount presumably
> >> corresponds to someone holding a direct pointer to the Relation struct,
> >> which is what they'd use to decrement the refcount.)
> 
> > The problem is that only one of these entries will get properly handled
> > by cache invalidation.
> 
> I don't think that's a problem; if the old entry is still referenced,
> presumably the holder of the reference is also holding a lock sufficient
> to prevent any problematic schema changes.

Except if it's during ddl itself... But we'd probably seen problems by
now if there aactually probem.

> > I thought about it for a while and I wonder if that's necessarily
> > correct. If somebody registers a relcache invalidation callback that
> > could happen when invalidations are processed somewhere while rebuilding
> > a entry?
> 
> Cache invalidation callbacks shouldn't do anything that would involve
> opening non-system relations, IMO.

I think that's unneccessarily restrictive. There's good reasons to build
caches ontop the relcache that also have to visit user defined
relations. Imagine the configuration tables of a replication solution
for example. Since there's no builtin, integrated, way to add data to
relcache entries that's pretty much the only choice.
It's probably best to avoid these problems by simply setting an
->invalid flag or something, but I can imagine cases where that's not
the best strategy.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



On Mon, May 19, 2014 at 05:08:04PM +0200, Andres Freund wrote:
> On 2014-05-18 14:49:10 -0400, Tom Lane wrote:
> > While I'm at it: I could not help noticing RememberToFreeTupleDescAtEOX,
> > which was not there last time I looked at this code.  Isn't that broken
> > by design?  It's basically a deliberately induced transaction-lifespan
> > memory leak, and AFAICS if it does anything at all, it's supporting
> > incorrect calling code.  There should *never* be any situation where it's
> > not okay to free a tupledesc with zero refcount.
> 
> I am not sure I understand that code. The whole keep_tupdesc/keep_rules
> logic isn't new and a bit underdocumented as well :(. Seems to come from
> 491dd4a9.
> If I understand correctly the issue is that callers have accessed the
> tupledesc with RelationGetDescr() which doesn't increase the
> refcount.

Right.  See here for background:
http://www.postgresql.org/message-id/flat/20130801005351.GA325106@tornado.leadboat.com

> >          * If we Rebuilt a relcache entry during a transaction then its
> >          * possible we did that because the TupDesc changed as the result of
> >          * an ALTER TABLE that ran at less than AccessExclusiveLock. It's
> >          * possible someone copied that TupDesc, in which case the copy would
> >          * point to free'd memory. So if we rebuild an entry we keep the
> > 
> > If someone copied the tupdesc, how would freeing the original result
> > in the copy being damaged?
> 
> The sentence only seems to make sense with s/copy/accessed without
> increasing the refcoung, e.g. with RelationGetDescr,/.

Agreed.  The hazards arise when copying the rd_att pointer, not when
deep-copying the whole TupleDesc.

-- 
Noah Misch
EnterpriseDB                                 http://www.enterprisedb.com