Re: [BUGS] BUG #5412: test case produced, possible race condition. - Mailing list pgsql-hackers

From Tom Lane
Subject Re: [BUGS] BUG #5412: test case produced, possible race condition.
Date
Msg-id 8731.1271269904@sss.pgh.pa.us
Whole thread Raw
Responses Re: [BUGS] BUG #5412: test case produced, possible race condition.  ("Kevin Grittner" <Kevin.Grittner@wicourts.gov>)
Re: [BUGS] BUG #5412: test case produced, possible race condition.  (Rusty Conover <rconover@infogears.com>)
List pgsql-hackers
I wrote:
> [ theory about cause of Rusty's crash ]

I started to doubt this theory after wondering why the problem hadn't
been exposed by CLOBBER_CACHE_ALWAYS testing, which is done routinely
by the buildfarm.  That setting would surely cause the cache flush to
happen at the troublesome time.  After a good deal more investigation,
I found out why it doesn't crash with that.  The problematic case is
for a relation that has rd_newRelfilenodeSubid nonzero but
rd_createSubid zero (ie, it's been truncated in the current xact).
Given that, RelationFlushRelation will attempt a rebuild but
RelationCacheInvalidate won't exempt the relation from destruction.
However, if you do a TRUNCATE under CLOBBER_CACHE_ALWAYS, the relcache
entry gets blown away immediately at the conclusion of that command,
because we'll do a RelationCacheInvalidate as a consequence of
CLOBBER_CACHE_ALWAYS.  When the relcache entry is rebuilt for later use,
it won't have rd_newRelfilenodeSubid set, so it's not a hazard anymore.
In order to expose this bug, the relcache entry has to survive past the
TRUNCATE and then a cache flush has to occur while we are in process of
rebuilding it, not before.

What this suggests is that CLOBBER_CACHE_ALWAYS is actually too strong
to provide a thorough test of cache flush hazards.  Maybe we need an
alternate setting along the lines of CLOBBER_CACHE_SOMETIMES that would
randomly choose whether or not to flush at any given opportunity.  But
if such a setup did produce a crash, it'd be awfully hard to reproduce
for investigation.  Ideas?

There is another slightly odd thing here, which is that the stack trace
Rusty provided clearly shows the crash occurring during processing of a
local relcache invalidation message for the truncated relation.  This
would be expected during execution of the TRUNCATE itself, but at that
point the rel has positive refcnt so there's no problem.  According to
the stack trace the active SQL command is an INSERT ... SELECT, and I
wouldn't expect that to queue any relcache invals.  Are there any
triggers or other unusual things in the real application (not the
watered-down test case) that would be triggered in INSERT/SELECT?
        regards, tom lane


pgsql-hackers by date:

Previous
From: "Kevin Grittner"
Date:
Subject: Re: shared_buffers documentation
Next
From: "Kevin Grittner"
Date:
Subject: Re: [BUGS] BUG #5412: test case produced, possible race condition.