Re: buffer assertion tripping under repeat pgbench load - Mailing list pgsql-hackers

From Greg Smith
Subject Re: buffer assertion tripping under repeat pgbench load
Date
Msg-id 50DFB001.7010000@2ndQuadrant.com
Whole thread Raw
In response to Re: buffer assertion tripping under repeat pgbench load  (Greg Stark <stark@mit.edu>)
Responses Re: buffer assertion tripping under repeat pgbench load  (Robert Haas <robertmhaas@gmail.com>)
Re: buffer assertion tripping under repeat pgbench load  (Greg Stark <stark@mit.edu>)
List pgsql-hackers
On 12/27/12 7:43 AM, Greg Stark wrote:
> If it's always the first buffer then it could conceivably still be
> some other heap allocated object that always lands before
> LocalRefCount. It does seem a bit weird to be storing 1<<30 though --
> there are no 1<<30 constants that we might be storing for example.

It is a strange power of two to be appearing there.  I can follow your 
reasoning for why this could be a bit flipping error.  There's no sign 
of that elsewhere though, no other crashes under load.  I'm using this 
server here because it's worked fine for a while now.

I added printing the buffer number, and they're all over the place:

2012-12-27 06:36:39 EST [26306]: WARNING:  refcount of buf 29270 
containing base/16384/90124 blockNum=82884, flags=0x127 is 1073741824 
should be 0, globally: 0
2012-12-27 02:08:19 EST [21719]: WARNING:  refcount of buf 114262 
containing base/16384/81932 blockNum=133333, flags=0x106 is 1073741824 
should be 0, globally: 0
2012-12-26 20:03:05 EST [15117]: WARNING:  refcount of buf 142934 
containing base/16384/73740 blockNum=87961, flags=0x127 is 1073741824 
should be 0, globally: 0

The relation continues to bounce between pgbench_accounts and its 
primary key, no pattern there either I can see.  To answer a few other 
questions:  this system does not have ECC RAM.  It did survive many 
passes of memtest86+ without any problems though, right after the above.

I tried duplicating the problem on a similar server.  It keeps hanging 
due to some Linux software RAID bug before it runs for very long. 
Whatever is going on here, it really doesn't want to be discovered.

For reference sake, the debugging code those latest messages came from 
is now:

diff --git a/src/backend/storage/buffer/bufmgr.c 
b/src/backend/storage/buffer/bufmgr.c
index dddb6c0..60d3ad3 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -1697,11 +1697,27 @@ AtEOXact_Buffers(bool isCommit)        if (assert_enabled)        {                int
          i;
 
+               int                     RefCountErrors = 0;
                for (i = 0; i < NBuffers; i++)                {
-                       Assert(PrivateRefCount[i] == 0);
+
+                       if (PrivateRefCount[i] != 0)
+                       {
+                               /*
+ 
PrintBufferLeakWarning(&BufferDescriptors[i]);
+                               */
+                               BufferDesc *bufHdr = &BufferDescriptors[i];
+                               elog(WARNING,
+                                       "refcount of buf %d containing 
%s blockNum=%u, flags=0x%x is %u should be 0, globally: %u",
+ 
i,relpathbackend(bufHdr->tag.rnode, InvalidBackendId, bufHdr->tag.forkNum),
+                                       bufHdr->tag.blockNum, 
bufHdr->flags, PrivateRefCount[i], bufHdr->refcount);
+                               RefCountErrors++;
+                       }                }
+               if (RefCountErrors > 0)
+                       elog(WARNING, "buffers with non-zero refcount is 
%d", RefCountErrors);
+               Assert(RefCountErrors == 0);        } #endif




pgsql-hackers by date:

Previous
From: Robert Haas
Date:
Subject: Re: PATCH: optimized DROP of multiple tables within a transaction
Next
From: Peter Geoghegan
Date:
Subject: Re: pg_stat_statements: calls under-estimation propagation