Re: Postgresql 8.4.1 segfault, backtrace - Mailing list pgsql-bugs

From Tom Lane
Subject Re: Postgresql 8.4.1 segfault, backtrace
Date
Msg-id 730.1253833254@sss.pgh.pa.us
Whole thread Raw
In response to Postgresql 8.4.1 segfault, backtrace  (Richard Neill <rn214@cam.ac.uk>)
Responses Re: Postgresql 8.4.1 segfault, backtrace  (Tom Lane <tgl@sss.pgh.pa.us>)
Re: Postgresql 8.4.1 segfault, backtrace  ("Michael Brown" <mbrown@fensystems.co.uk>)
Re: Postgresql 8.4.1 segfault, backtrace  (Heikki Linnakangas <heikki.linnakangas@enterprisedb.com>)
List pgsql-bugs
Michael Brown <mbrown@fensystems.co.uk> writes:
>> ... (If you have a spare machine with the same OS and
>> the same postgres executables, maybe you could put the core file on that
>> and let me ssh in to have a look?)

[ ssh details ]

Thanks for letting me poke around.  What I found out is that the
hash_seq_search loop in RelationCacheInitializePhase2 is crashing
because it's attempting to examine a hashtable entry that is on the
hashtable's freelist!?  Given that information I think the cause of
the bug is fairly clear:

1. RelationCacheInitializePhase2 loads the rules or trigger descriptions
for some system catalog (actually it must be the latter; we haven't got
any catalogs with rules attached).

2. By chance, a shared-cache-inval flush comes through while it's doing
that, causing all non-open, non-nailed relcache entries to be discarded.
Including, in particular, the one that is "next" according to the
hash_seq_search's status.

3. Now the loop iterates into the freelist, and kaboom.  It will
probably fail to fail on entries that are actually discarded, because
they still have valid pointers in them ... but as soon as it gets to
a never-yet-used freelist entry, it'll do a null dereference.

RelationCacheInitializePhase2 is breaking the rules by assuming that it
can continue to iterate the hash_seq_search after doing something that
might cause a hash entry other than the current one to be discarded.
We can probably fix that without too much trouble, eg by restarting the
loop after an update.

But: the question at this point is why we've never seen such a report
before 8.4.  If this theory is correct, it's been broken for a *long*
time.  I can think of a couple of possible explanations:

A: the problem can only manifest if this loop has work to do for
a relcache entry that is not the last one in its bucket chain.
8.4 might have added more preloaded relcache entries than were there
before.  Or the 8.4 changes in the hash functions might have shuffled
the entries' bucket placement around so that the problem can happen
when it couldn't before.

B: the 8.4 changes in the shared-cache-inval mechanism might have
made it more likely that a freshly started backend could get hit with a
relcache flush request.  I should think that those changes would have
made this *less* likely not more so, so maybe there is an additional
bug lurking in that area.

I shall go and do some further investigation, but at least it's now
clear where to look.  Thanks for the report, and for being so helpful
in providing information!

            regards, tom lane

pgsql-bugs by date:

Previous
From: Michael Brown
Date:
Subject: Re: Postgresql 8.4.1 segfault, backtrace
Next
From: "Seneca Cunningham"
Date:
Subject: BUG #5080: test tablespace failure