Re: catalog corruption bug - Mailing list pgsql-hackers

From Tom Lane
Subject Re: catalog corruption bug
Date
Msg-id 480.1136664517@sss.pgh.pa.us
Whole thread Raw
In response to Re: catalog corruption bug  (Jeremy Drake <pgsql@jdrake.com>)
Responses Re: catalog corruption bug
List pgsql-hackers
Jeremy Drake <pgsql@jdrake.com> writes:
> On Sat, 7 Jan 2006, Tom Lane wrote:
>> I'll go fix CatCacheRemoveCList, but I think this is not the bug
>> we're looking for.

> Incidentally, one of my processes did get that error at the same time.
> All of the other processes had an error
> DBD::Pg::st execute failed: server closed the connection unexpectedly
>         This probably means the server terminated abnormally
>         before or while processing the request.
> But this one had the DBD::Pg::st execute failed: ERROR:  duplicate key
> violates unique constraint "pg_type_typname_nsp_index"

Oh, that's interesting ... maybe there is some relation after all?
Hard to see what ...

> It looks like my kernel did not have the option to append the pid to core
> files ,so perhaps they both croaked at the same time but only this one got
> to write a core file?

Yeah, they'd all be dumping into the same directory.  It's reasonable to
suppose that the corefile you have is from the one that aborted last.
That would suggest that this is effect not cause ... hmmm ...

A bit of a leap in the dark, but: maybe the triggering event for this
situation is not a "VACUUM pg_amop" but a global cache reset due to
sinval message buffer overrun.  It's fairly clear how that would lead
to the CatCacheRemoveCList bug.  The duplicate-key failure could be an
unrelated bug triggered by the same condition.  I have no idea yet what
the mechanism could be, but cache reset is a sufficiently seldom-exercised
code path that it's entirely plausible that there are bugs lurking in it.

If this is correct then we could vastly increase the probability of
seeing the bug by setting up something to force cache resets at a high
rate.  If you're interested I could put together a code patch for that.

> BTW, nothing of any interest made it into the backend log regarding what
> assert(s) failed.

What you'd be looking for is a line starting "TRAP:".
        regards, tom lane


pgsql-hackers by date:

Previous
From: "Qingqing Zhou"
Date:
Subject: Re: Warm-up cache may have its virtue
Next
From: Tom Lane
Date:
Subject: Test tool for sinval reset situations