Re: buffer assertion tripping under repeat pgbench load - Mailing list pgsql-hackers

From Tom Lane
Subject Re: buffer assertion tripping under repeat pgbench load
Date
Msg-id 29851.1356578244@sss.pgh.pa.us
Whole thread Raw
In response to Re: buffer assertion tripping under repeat pgbench load  (Greg Stark <stark@mit.edu>)
Responses Re: buffer assertion tripping under repeat pgbench load  (Greg Stark <stark@mit.edu>)
List pgsql-hackers
Greg Stark <stark@mit.edu> writes:
> On Wed, Dec 26, 2012 at 11:47 PM, Greg Smith <greg@2ndquadrant.com> wrote:
>> It would be nice if this were just something like a memory issue on this
>> system.  That I'm getting the same very odd value every time--this refcount
>> of 1073741824--makes it seem less random than I expect from bad memory.
>> Once I get a few more crash samples (with buffer ids) I'll shut the system
>> down for a pass of memtest86+.

> Well that's a one-bit error and it would never get detected until the
> value was decremented down to what should be zero so that's pretty
> much exactly what I would expect to see from a memory or cpu error.

Yeah, the fact that it's always the same bit makes it seem like it could
be one bad physical bit.  (Does this machine have ECC memory??)

The thing that this theory has a hard time with is that the buffer's
global refcount is zero.  If you assume that there's a bit that
sometimes randomly goes to 1 when it should be 0, then what I'd expect
to typically happen is that UnpinBuffer sees nonzero LocalRefCount and
hence doesn't drop the session's global pin when it should.  The only
way that doesn't happen is if decrementing LocalRefCount to zero stores
a nonzero pattern when it should store zero, but nonetheless the CPU
thinks it stored zero.  As you say there's some small possibility of a
CPU glitch doing that, but then why is it only happening to
LocalRefCount and not any other similar coding?

At the moment I like the other theory you alluded to, that this is a
wild store from code that thinks it's manipulating some other data
structure entirely.  The buffer IDs will help confirm or refute that
perhaps.  No idea ATM how we would find the problem if it's like that
...
        regards, tom lane



pgsql-hackers by date:

Previous
From: Stephen Frost
Date:
Subject: Re: Proposal: Store "timestamptz" of database creation on "pg_database"
Next
From: Tom Lane
Date:
Subject: Re: Proposal: Store "timestamptz" of database creation on "pg_database"