Re: buffer assertion tripping under repeat pgbench load - Mailing list pgsql-hackers

From Andres Freund
Subject Re: buffer assertion tripping under repeat pgbench load
Date
Msg-id 20121223091533.GA1653@alap2.fritz.box
Whole thread Raw
In response to buffer assertion tripping under repeat pgbench load  (Greg Smith <greg@2ndQuadrant.com>)
Responses Re: buffer assertion tripping under repeat pgbench load  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-hackers
Hi,

On 2012-12-23 02:36:42 -0500, Greg Smith wrote:
> I'm testing a checkout from a few days ago and trying to complete a day long
> pgbench stress test, with assertions and debugging on.  I want to make sure
> the base code works as expected before moving on to testing checksums.  It's
> crashing before finishing though.  Here's a sample:
>
> 2012-12-20 20:36:11 EST [26613]: LOG:  checkpoint complete: wrote 32 buffers
> (0.0%); 0 transaction log file(s) added, 0 removed, 0 recycled; write=3.108
> s, sync=0.978 s, total=4.187 s; sync files=7, longest=0.832 s, average=0.139
> s
> TRAP: FailedAssertion("!(PrivateRefCount[i] == 0)", File: "bufmgr.c", Line:
> 1703)
> 2012-12-20 20:36:44 EST [26611]: LOG:  server process (PID 4834) was
> terminated by signal 6: Aborted
> 2012-12-20 20:36:44 EST [26611]: DETAIL:  Failed process was running: END;
>
> Running the same test set again gave the same crash, so this looks
> repeatable.  It's not trivial to trigger given the average runtime to crash
> I'm seeing.  Second sample:
>
> 2012-12-22 21:27:17 EST [6006]: LOG:  checkpoint complete: wrote 34 buffers
> (0.0%); 0 transaction log file(s) added, 0 removed, 0 recycled; write=3.310
> s, sync=2.906 s, total=6.424 s; sync files=7, longest=2.648 s, average=0.415
> s
> TRAP: FailedAssertion("!(PrivateRefCount[i] == 0)", File: "bufmgr.c", Line:
> 1703)
> 2012-12-22 21:27:23 EST [26611]: LOG:  server process (PID 12143) was
> terminated by signal 6: Aborted
> 2012-12-22 21:27:23 EST [26611]: DETAIL:  Failed process was running: END;
>
> The important part:
>
> TRAP: FailedAssertion("!(PrivateRefCount[i] == 0)", File: "bufmgr.c", Line:
> 1703)
>
> That's the check done by "AtEOXact_Buffers - clean up at end of
> transaction".  It fired while executing the "END;"" statement at the end of
> the standard pgbench test.  The test looks for loose things that weren't
> pinned/unpinned correctly by the transaction.  I'm not sure if getting a
> stack trace at that point will be useful.  The ref count mistake could
> easily have been made by an earlier statement in the transaction block,
> right?
>
> This is on a server where shared_buffers=2GB, checkpoint_segments=64. Test
> runs were automated with pgbench-tools, using the following configuration:
>
> MAX_WORKERS="4"
> SCRIPT="tpc-b.sql"
> SCALES="500 1000"
> SETCLIENTS="4 8 16 32 64 96"
> SETTIMES=3
> RUNTIME=600
>
> All of the crashes so far have been at scale=500, and I haven't seen those
> finish yet.  It failed 7 hours in the first time, then 4 hours in.
>
> For code reference, the last commit in the source code tree I built against
> was:
>
> af275a12dfeecd621ed9899a9382f26a68310263
> Thu Dec 20 14:23:31 2012 +0200
>
> Looking at recent activity, the only buffer pin related changes I saw was
> this one:
>
> commit 62656617dbe49cca140f3da588a5e311b3fc35ea
> Date:   Mon Dec 3 18:53:31 2012 +0000
>     Avoid holding vmbuffer pin after VACUUM.

I don't immediately see anything bad here.

> This is my first test like this against 9.3 development though, so the cause
> could be an earlier commit.  I'm just starting with the most recent work as
> the first suspect.  Next I think I'll try autovacuum=off and see if the
> crash goes away.  Other ideas are welcome.

Something like the (untested) debug message below might be helpful:

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 7be767b..20f61a1 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -1700,7 +1700,13 @@ AtEOXact_Buffers(bool isCommit)        for (i = 0; i < NBuffers; i++)        {
-            Assert(PrivateRefCount[i] == 0);
+            if (PrivateRefCount[i] != 0)
+            {
+                BufferDesc *bufHdr = &BufferDescriptors[i];
+                elog(PANIC, "refcount of %s is %u should be 0, globally: %u",
+                     relpathbackend(bufHdr->tag.rnode, InvalidBackendId, bufHdr->tag.forkNum),
+                     PrivateRefCount[i], bufHdr->refcount);
+            }        }    }#endif

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



pgsql-hackers by date:

Previous
From: Greg Smith
Date:
Subject: buffer assertion tripping under repeat pgbench load
Next
From: Marko Kreen
Date:
Subject: Re: pgcrypto seeding problem when ssl=on