Home > mailing lists

Re: [BUGS] BUG #5412: test case produced, possible race condition. - Mailing list pgsql-hackers

From	Tom Lane
Subject	Re: [BUGS] BUG #5412: test case produced, possible race condition.
Date	April 14, 2010 18:32:05
Msg-id	8731.1271269904@sss.pgh.pa.us Whole thread Raw
Responses	Re: [BUGS] BUG #5412: test case produced, possible race condition. ("Kevin Grittner" <Kevin.Grittner@wicourts.gov>) Re: [BUGS] BUG #5412: test case produced, possible race condition. (Rusty Conover <rconover@infogears.com>)
List	pgsql-hackers

Tree view

I wrote:
> [ theory about cause of Rusty's crash ]

I started to doubt this theory after wondering why the problem hadn't
been exposed by CLOBBER_CACHE_ALWAYS testing, which is done routinely
by the buildfarm.  That setting would surely cause the cache flush to
happen at the troublesome time.  After a good deal more investigation,
I found out why it doesn't crash with that.  The problematic case is
for a relation that has rd_newRelfilenodeSubid nonzero but
rd_createSubid zero (ie, it's been truncated in the current xact).
Given that, RelationFlushRelation will attempt a rebuild but
RelationCacheInvalidate won't exempt the relation from destruction.
However, if you do a TRUNCATE under CLOBBER_CACHE_ALWAYS, the relcache
entry gets blown away immediately at the conclusion of that command,
because we'll do a RelationCacheInvalidate as a consequence of
CLOBBER_CACHE_ALWAYS.  When the relcache entry is rebuilt for later use,
it won't have rd_newRelfilenodeSubid set, so it's not a hazard anymore.
In order to expose this bug, the relcache entry has to survive past the
TRUNCATE and then a cache flush has to occur while we are in process of
rebuilding it, not before.

What this suggests is that CLOBBER_CACHE_ALWAYS is actually too strong
to provide a thorough test of cache flush hazards.  Maybe we need an
alternate setting along the lines of CLOBBER_CACHE_SOMETIMES that would
randomly choose whether or not to flush at any given opportunity.  But
if such a setup did produce a crash, it'd be awfully hard to reproduce
for investigation.  Ideas?

There is another slightly odd thing here, which is that the stack trace
Rusty provided clearly shows the crash occurring during processing of a
local relcache invalidation message for the truncated relation.  This
would be expected during execution of the TRUNCATE itself, but at that
point the rel has positive refcnt so there's no problem.  According to
the stack trace the active SQL command is an INSERT ... SELECT, and I
wouldn't expect that to queue any relcache invals.  Are there any
triggers or other unusual things in the real application (not the
watered-down test case) that would be triggered in INSERT/SELECT?
        regards, tom lane

pgsql-hackers by date:

From: "Kevin Grittner"
Date: 14 April 2010, 18:04:24
Subject: Re: shared_buffers documentation

From: "Kevin Grittner"
Date: 14 April 2010, 18:39:37
Subject: Re: [BUGS] BUG #5412: test case produced, possible race condition.

Re: [BUGS] BUG #5412: test case produced, possible race condition. - Mailing list pgsql-hackers

Previous

Next