buffer assertion tripping under repeat pgbench load - Mailing list pgsql-hackers

From Greg Smith
Subject buffer assertion tripping under repeat pgbench load
Date
Msg-id 50D6B48A.1070905@2ndQuadrant.com
Whole thread Raw
Responses Re: buffer assertion tripping under repeat pgbench load
List pgsql-hackers
I'm testing a checkout from a few days ago and trying to complete a day 
long pgbench stress test, with assertions and debugging on.  I want to 
make sure the base code works as expected before moving on to testing 
checksums.  It's crashing before finishing though.  Here's a sample:

2012-12-20 20:36:11 EST [26613]: LOG:  checkpoint complete: wrote 32 
buffers (0.0%); 0 transaction log file(s) added, 0 removed, 0 recycled; 
write=3.108 s, sync=0.978 s, total=4.187 s; sync files=7, longest=0.832 
s, average=0.139 s
TRAP: FailedAssertion("!(PrivateRefCount[i] == 0)", File: "bufmgr.c", 
Line: 1703)
2012-12-20 20:36:44 EST [26611]: LOG:  server process (PID 4834) was 
terminated by signal 6: Aborted
2012-12-20 20:36:44 EST [26611]: DETAIL:  Failed process was running: END;

Running the same test set again gave the same crash, so this looks 
repeatable.  It's not trivial to trigger given the average runtime to 
crash I'm seeing.  Second sample:

2012-12-22 21:27:17 EST [6006]: LOG:  checkpoint complete: wrote 34 
buffers (0.0%); 0 transaction log file(s) added, 0 removed, 0 recycled; 
write=3.310 s, sync=2.906 s, total=6.424 s; sync files=7, longest=2.648 
s, average=0.415 s
TRAP: FailedAssertion("!(PrivateRefCount[i] == 0)", File: "bufmgr.c", 
Line: 1703)
2012-12-22 21:27:23 EST [26611]: LOG:  server process (PID 12143) was 
terminated by signal 6: Aborted
2012-12-22 21:27:23 EST [26611]: DETAIL:  Failed process was running: END;

The important part:

TRAP: FailedAssertion("!(PrivateRefCount[i] == 0)", File: "bufmgr.c", 
Line: 1703)

That's the check done by "AtEOXact_Buffers - clean up at end of 
transaction".  It fired while executing the "END;"" statement at the end 
of the standard pgbench test.  The test looks for loose things that 
weren't pinned/unpinned correctly by the transaction.  I'm not sure if 
getting a stack trace at that point will be useful.  The ref count 
mistake could easily have been made by an earlier statement in the 
transaction block, right?

This is on a server where shared_buffers=2GB, checkpoint_segments=64. 
Test runs were automated with pgbench-tools, using the following 
configuration:

MAX_WORKERS="4"
SCRIPT="tpc-b.sql"
SCALES="500 1000"
SETCLIENTS="4 8 16 32 64 96"
SETTIMES=3
RUNTIME=600

All of the crashes so far have been at scale=500, and I haven't seen 
those finish yet.  It failed 7 hours in the first time, then 4 hours in.

For code reference, the last commit in the source code tree I built 
against was:

af275a12dfeecd621ed9899a9382f26a68310263
Thu Dec 20 14:23:31 2012 +0200

Looking at recent activity, the only buffer pin related changes I saw 
was this one:

commit 62656617dbe49cca140f3da588a5e311b3fc35ea
Date:   Mon Dec 3 18:53:31 2012 +0000    Avoid holding vmbuffer pin after VACUUM.

This is my first test like this against 9.3 development though, so the 
cause could be an earlier commit.  I'm just starting with the most 
recent work as the first suspect.  Next I think I'll try autovacuum=off 
and see if the crash goes away.  Other ideas are welcome.

-- 
Greg Smith   2ndQuadrant US    greg@2ndQuadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com



pgsql-hackers by date:

Previous
From: Andres Freund
Date:
Subject: Re: pgcrypto seeding problem when ssl=on
Next
From: Andres Freund
Date:
Subject: Re: buffer assertion tripping under repeat pgbench load