We need to rethink relation cache entry rebuild - Mailing list pgsql-hackers

From Tom Lane
Subject We need to rethink relation cache entry rebuild
Date
Msg-id 24663.1262998895@sss.pgh.pa.us
Whole thread Raw
Responses Re: We need to rethink relation cache entry rebuild  (Greg Stark <gsstark@mit.edu>)
Build farm tweaks  ("Greg Sabino Mullane" <greg@turnstep.com>)
Re: We need to rethink relation cache entry rebuild  (Stefan Kaltenbrunner <stefan@kaltenbrunner.cc>)
Re: We need to rethink relation cache entry rebuild  (Simon Riggs <simon@2ndQuadrant.com>)
Re: We need to rethink relation cache entry rebuild  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-hackers
I mentioned earlier that buildfarm member jaguar (that's the one that
builds with CLOBBER_CACHE_ALWAYS) was showing suspicious intermittent
failures.  There's another one today:
http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=jaguar&dt=2010-01-08%2004:00:02
and I also managed to reproduce a similar problem locally after running
the parallel regression tests with CLOBBER_CACHE_ALWAYS for about a day.
I'm not entirely sure yet that I understand everything that is going
wrong, but there is one thing that is real clear: honoring a SIGINT or
SIGTERM during relcache rebuild leads to Assert failure or worse.
The stack trace at the bottom of the above-mentioned page shows the
problem pretty clearly.  What appears to have happened is that a DROP
DATABASE was issued, causing a cancel to be sent to an autovac worker in
that database, and the worker happened to be in the middle of a relcache
entry rebuild when it accepted the signal.  (CLOBBER_CACHE_ALWAYS makes
this hugely more probable, but it could happen in ordinary builds too.)
The problem is that the relcache entry being rebuilt is referenced, so
it has a positive reference count that is tracked in a ResourceOwner
... but the refcount field has been temporarily zeroed while
RelationBuildDesc runs, so when transaction abort tries to release the
refcount we hit that Assert in RelationDecrementReferenceCount.

In a non-Assert build you wouldn't notice an immediate problem, but that
doesn't mean things are okay in the field.  The problem can be stated
more generally as: relcache rebuild temporarily messes up a relcache
entry that other places have got pointers to.  If an error occurs during
rebuild, those other places might still try to use their pointers,
expecting to reference a valid relcache entry.  I think this could
happen for example if the error occurred inside a subtransaction, and
we trapped the exception and allowed the outer transaction to resume.
Code in the outer transaction would naturally still expect its pointer
to a valid, reference-count-incremented relcache entry to be good.

RelationClearRelation does take the step of de-linking the entry it's
going to rebuild from the hash table.  When that code was designed,
before resource owners or subtransactions, I think it was actually
safe --- an error could result in a leaked entry in CacheMemoryContext,
but there could not be any live pointers to the partially-rebuilt entry
after we did transaction abort cleanup.  Now, however, it's entirely
not safe, and it's only the rather low probability of failure that
has prevented us from spotting the problem before.

Basically I think we have to fix this by ensuring that an error escape
can't occur while a relcache entry is in a partially rebuilt state.
What I have in mind is to rewrite RelationClearRelation so that it
does a rebuild this way:

1. Build a new relcache entry from scratch.  This entry won't be linked
from anywhere.  A failure here results in no worse than some leaked
memory.

2. Swap the contents of the old and new relcache entries.  Do this in
straight-line code that has no possibility of CHECK_FOR_INTERRUPTS.
But don't swap the refcount fields, nor certain infrastructure pointers
that we try to preserve intact if possible --- for example, don't swap
the tupledesc pointers if the old and new tupledescs are equal.

3. Free the new relcache entry along with the (possibly swapped)
infrastructure for it.

One slightly tricky spot is that the index infrastructure has to be
swapped all or nothing, because it all lives in a single subsidiary
memory context.  I think this is all right, but if it isn't we'll need
to expend more code on cleaning that infrastructure up properly instead
of just destroying a context.

Comments?
        regards, tom lane


pgsql-hackers by date:

Previous
From: Peter Eisentraut
Date:
Subject: Re: damage control mode
Next
From: Greg Stark
Date:
Subject: Re: synchronized snapshots