Thread: buffer assertion tripping under repeat pgbench load
I'm testing a checkout from a few days ago and trying to complete a day long pgbench stress test, with assertions and debugging on. I want to make sure the base code works as expected before moving on to testing checksums. It's crashing before finishing though. Here's a sample: 2012-12-20 20:36:11 EST [26613]: LOG: checkpoint complete: wrote 32 buffers (0.0%); 0 transaction log file(s) added, 0 removed, 0 recycled; write=3.108 s, sync=0.978 s, total=4.187 s; sync files=7, longest=0.832 s, average=0.139 s TRAP: FailedAssertion("!(PrivateRefCount[i] == 0)", File: "bufmgr.c", Line: 1703) 2012-12-20 20:36:44 EST [26611]: LOG: server process (PID 4834) was terminated by signal 6: Aborted 2012-12-20 20:36:44 EST [26611]: DETAIL: Failed process was running: END; Running the same test set again gave the same crash, so this looks repeatable. It's not trivial to trigger given the average runtime to crash I'm seeing. Second sample: 2012-12-22 21:27:17 EST [6006]: LOG: checkpoint complete: wrote 34 buffers (0.0%); 0 transaction log file(s) added, 0 removed, 0 recycled; write=3.310 s, sync=2.906 s, total=6.424 s; sync files=7, longest=2.648 s, average=0.415 s TRAP: FailedAssertion("!(PrivateRefCount[i] == 0)", File: "bufmgr.c", Line: 1703) 2012-12-22 21:27:23 EST [26611]: LOG: server process (PID 12143) was terminated by signal 6: Aborted 2012-12-22 21:27:23 EST [26611]: DETAIL: Failed process was running: END; The important part: TRAP: FailedAssertion("!(PrivateRefCount[i] == 0)", File: "bufmgr.c", Line: 1703) That's the check done by "AtEOXact_Buffers - clean up at end of transaction". It fired while executing the "END;"" statement at the end of the standard pgbench test. The test looks for loose things that weren't pinned/unpinned correctly by the transaction. I'm not sure if getting a stack trace at that point will be useful. The ref count mistake could easily have been made by an earlier statement in the transaction block, right? This is on a server where shared_buffers=2GB, checkpoint_segments=64. Test runs were automated with pgbench-tools, using the following configuration: MAX_WORKERS="4" SCRIPT="tpc-b.sql" SCALES="500 1000" SETCLIENTS="4 8 16 32 64 96" SETTIMES=3 RUNTIME=600 All of the crashes so far have been at scale=500, and I haven't seen those finish yet. It failed 7 hours in the first time, then 4 hours in. For code reference, the last commit in the source code tree I built against was: af275a12dfeecd621ed9899a9382f26a68310263 Thu Dec 20 14:23:31 2012 +0200 Looking at recent activity, the only buffer pin related changes I saw was this one: commit 62656617dbe49cca140f3da588a5e311b3fc35ea Date: Mon Dec 3 18:53:31 2012 +0000 Avoid holding vmbuffer pin after VACUUM. This is my first test like this against 9.3 development though, so the cause could be an earlier commit. I'm just starting with the most recent work as the first suspect. Next I think I'll try autovacuum=off and see if the crash goes away. Other ideas are welcome. -- Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com
Hi, On 2012-12-23 02:36:42 -0500, Greg Smith wrote: > I'm testing a checkout from a few days ago and trying to complete a day long > pgbench stress test, with assertions and debugging on. I want to make sure > the base code works as expected before moving on to testing checksums. It's > crashing before finishing though. Here's a sample: > > 2012-12-20 20:36:11 EST [26613]: LOG: checkpoint complete: wrote 32 buffers > (0.0%); 0 transaction log file(s) added, 0 removed, 0 recycled; write=3.108 > s, sync=0.978 s, total=4.187 s; sync files=7, longest=0.832 s, average=0.139 > s > TRAP: FailedAssertion("!(PrivateRefCount[i] == 0)", File: "bufmgr.c", Line: > 1703) > 2012-12-20 20:36:44 EST [26611]: LOG: server process (PID 4834) was > terminated by signal 6: Aborted > 2012-12-20 20:36:44 EST [26611]: DETAIL: Failed process was running: END; > > Running the same test set again gave the same crash, so this looks > repeatable. It's not trivial to trigger given the average runtime to crash > I'm seeing. Second sample: > > 2012-12-22 21:27:17 EST [6006]: LOG: checkpoint complete: wrote 34 buffers > (0.0%); 0 transaction log file(s) added, 0 removed, 0 recycled; write=3.310 > s, sync=2.906 s, total=6.424 s; sync files=7, longest=2.648 s, average=0.415 > s > TRAP: FailedAssertion("!(PrivateRefCount[i] == 0)", File: "bufmgr.c", Line: > 1703) > 2012-12-22 21:27:23 EST [26611]: LOG: server process (PID 12143) was > terminated by signal 6: Aborted > 2012-12-22 21:27:23 EST [26611]: DETAIL: Failed process was running: END; > > The important part: > > TRAP: FailedAssertion("!(PrivateRefCount[i] == 0)", File: "bufmgr.c", Line: > 1703) > > That's the check done by "AtEOXact_Buffers - clean up at end of > transaction". It fired while executing the "END;"" statement at the end of > the standard pgbench test. The test looks for loose things that weren't > pinned/unpinned correctly by the transaction. I'm not sure if getting a > stack trace at that point will be useful. The ref count mistake could > easily have been made by an earlier statement in the transaction block, > right? > > This is on a server where shared_buffers=2GB, checkpoint_segments=64. Test > runs were automated with pgbench-tools, using the following configuration: > > MAX_WORKERS="4" > SCRIPT="tpc-b.sql" > SCALES="500 1000" > SETCLIENTS="4 8 16 32 64 96" > SETTIMES=3 > RUNTIME=600 > > All of the crashes so far have been at scale=500, and I haven't seen those > finish yet. It failed 7 hours in the first time, then 4 hours in. > > For code reference, the last commit in the source code tree I built against > was: > > af275a12dfeecd621ed9899a9382f26a68310263 > Thu Dec 20 14:23:31 2012 +0200 > > Looking at recent activity, the only buffer pin related changes I saw was > this one: > > commit 62656617dbe49cca140f3da588a5e311b3fc35ea > Date: Mon Dec 3 18:53:31 2012 +0000 > Avoid holding vmbuffer pin after VACUUM. I don't immediately see anything bad here. > This is my first test like this against 9.3 development though, so the cause > could be an earlier commit. I'm just starting with the most recent work as > the first suspect. Next I think I'll try autovacuum=off and see if the > crash goes away. Other ideas are welcome. Something like the (untested) debug message below might be helpful: diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c index 7be767b..20f61a1 100644 --- a/src/backend/storage/buffer/bufmgr.c +++ b/src/backend/storage/buffer/bufmgr.c @@ -1700,7 +1700,13 @@ AtEOXact_Buffers(bool isCommit) for (i = 0; i < NBuffers; i++) { - Assert(PrivateRefCount[i] == 0); + if (PrivateRefCount[i] != 0) + { + BufferDesc *bufHdr = &BufferDescriptors[i]; + elog(PANIC, "refcount of %s is %u should be 0, globally: %u", + relpathbackend(bufHdr->tag.rnode, InvalidBackendId, bufHdr->tag.forkNum), + PrivateRefCount[i], bufHdr->refcount); + } } }#endif Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Andres Freund <andres@2ndquadrant.com> writes: >> This is my first test like this against 9.3 development though, so the cause >> could be an earlier commit. I'm just starting with the most recent work as >> the first suspect. Next I think I'll try autovacuum=off and see if the >> crash goes away. Other ideas are welcome. > Something like the (untested) debug message below might be helpful: It might also be interesting to know if there is more than one still-pinned buffer --- that is, if you're going to hack the code, fix it to elog(LOG) each pinned buffer and then panic after completing the loop. regards, tom lane
On 12/23/12 1:10 PM, Tom Lane wrote: > It might also be interesting to know if there is more than one > still-pinned buffer --- that is, if you're going to hack the code, fix > it to elog(LOG) each pinned buffer and then panic after completing the > loop. Easy enough; I kept it so the actual source of panic is still an assertion failure: diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c index dddb6c0..df43643 100644 --- a/src/backend/storage/buffer/bufmgr.c +++ b/src/backend/storage/buffer/bufmgr.c @@ -1697,11 +1697,21 @@ AtEOXact_Buffers(bool isCommit) if (assert_enabled) { int i; + int RefCountErrors = 0; for (i = 0; i < NBuffers; i++) { - Assert(PrivateRefCount[i] == 0); + + if (PrivateRefCount[i] != 0) + { + BufferDesc *bufHdr = &BufferDescriptors[i]; + elog(LOG, "refcount of %s is %u should be 0, globally: %u", + relpathbackend(bufHdr->tag.rnode, InvalidBackendId, bufHdr->tag.forkNum), + PrivateRefCount[i], bufHdr->refcount); + RefCountErrors++; + } } + Assert(RefCountErrors == 0); } #endif
On 23 December 2012 19:42, Greg Smith <greg@2ndquadrant.com> wrote: > diff --git a/src/backend/storage/buffer/bufmgr.c > b/src/backend/storage/buffer/bufmgr.c > index dddb6c0..df43643 100644 > --- a/src/backend/storage/buffer/bufmgr.c > +++ b/src/backend/storage/buffer/bufmgr.c > @@ -1697,11 +1697,21 @@ AtEOXact_Buffers(bool isCommit) > if (assert_enabled) > { > int i; > + int RefCountErrors = 0; > > > for (i = 0; i < NBuffers; i++) > { > - Assert(PrivateRefCount[i] == 0); > + > + if (PrivateRefCount[i] != 0) > + { > + BufferDesc *bufHdr = &BufferDescriptors[i]; > + elog(LOG, "refcount of %s is %u should be 0, > globally: %u", > > + relpathbackend(bufHdr->tag.rnode, > InvalidBackendId, bufHdr->tag.forkNum), > + PrivateRefCount[i], bufHdr->refcount); > + RefCountErrors++; > + } > } > + Assert(RefCountErrors == 0); > } > #endif > We already have PrintBufferLeakWarning() for this, which might be a bit neater. If that last change was the cause, then its caused within VACUUM. I'm running a thrash test with autovacuums set much more frequently but nothing yet. Are you by any chance running with unlogged tables? There was a change there recently, something around checkpoint IIRC. Can you set checkpoints more frequent also? -- Simon Riggs http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 12/23/12 3:17 PM, Simon Riggs wrote: > If that last change was the cause, then its caused within VACUUM. I'm > running a thrash test with autovacuums set much more frequently but > nothing yet. I am not very suspicious of that VACUUM change; just pointed it out for completeness sake. > Are you by any chance running with unlogged tables? There was a change > there recently, something around checkpoint IIRC. Can you set > checkpoints more frequent also? This is straight pgbench, modifying only scale, client count, workers, and duration. No unlogged tables involved. The crash is so intermittent that I'm trying not to change anything until I get more productive output out of one. The last run I kicked off has been chugging for 14 hours without a hiccup yet. -- Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com
On 23 December 2012 21:52, Greg Smith <greg@2ndquadrant.com> wrote: > On 12/23/12 3:17 PM, Simon Riggs wrote: >> >> If that last change was the cause, then its caused within VACUUM. I'm >> running a thrash test with autovacuums set much more frequently but >> nothing yet. > > > I am not very suspicious of that VACUUM change; just pointed it out for > completeness sake. Nothing seen here either after 100 minutes of thrash with ~1 VACUUM per sec, at SF1 -- Simon Riggs http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 12/23/12 3:17 PM, Simon Riggs wrote: > We already have PrintBufferLeakWarning() for this, which might be a bit neater. Maybe. I tried using this, and I just got a seg fault within that code. I can't figure out if I called it incorrectly orif the buffer involved is so damaged that PrintBufferLeakWarning chokes on it. I'll look at that myself later. I did get some output from the variation Andres suggested. There was exactly one screwed up buffer: 2012-12-24 06:08:46 EST [26015]: WARNING: refcount of base/16384/49169 is 1073741824 should be 0, globally: 0 That is pgbench_accounts_pkey. 1073741824 = 0100 0000 0000 0000 0000 0000 0000 0000 = 2^30 Pretty odd value to find in a PrivateRefCount. What makes me nervous about all of the PrivateRefCount coding is how it switches between references like PrivateRefCount[(bufnum) - 1] and PrivateRefCount[b]. Might this be an off by one error in one of those, where the wrong form was used? -- Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com
Greg Smith <greg@2ndQuadrant.com> writes: > I did get some output from the variation Andres suggested. There was > exactly one screwed up buffer: > 2012-12-24 06:08:46 EST [26015]: WARNING: refcount of base/16384/49169 > is 1073741824 should be 0, globally: 0 > That is pgbench_accounts_pkey. 1073741824 = > 0100 0000 0000 0000 0000 0000 0000 0000 = 2^30 Huh. Looks a bit like overflow of the refcount, which would explain why it takes such a long test case to reproduce it. But how could that be happening without somebody forgetting to decrement the refcount, which ought to lead to a visible failure in shorter tests? Even more interesting that the buffer's global refcount is zero. Now that we know it's an index, it would be mighty useful to know which block is the problem (I'm wondering if it's metapage or root). Could you add the block number to the printout? regards, tom lane
On 24 December 2012 15:57, Greg Smith <greg@2ndquadrant.com> wrote: > 2012-12-24 06:08:46 EST [26015]: WARNING: refcount of base/16384/49169 is > 1073741824 should be 0, globally: 0 > > That is pgbench_accounts_pkey. 1073741824 = > 0100 0000 0000 0000 0000 0000 0000 0000 = 2^30 > > Pretty odd value to find in a PrivateRefCount. What makes me nervous about > all of the PrivateRefCount coding is how it switches between references like > PrivateRefCount[(bufnum) - 1] and PrivateRefCount[b]. Might this be an off > by one error in one of those, where the wrong form was used? PrivateRefCount is also one of the only large chunks of memory in a running server where single bit errors aren't tolerated. In shared buffers, most SLRUs or various other places, single bit errors would often go unnoticed. I wonder if you're having a hardware problem?, If not, it might be worth looking to see if that block is the root, maybe there's a problem in fastroot code, or something that only kicks in with larger indexes. -- Simon Riggs http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 24 December 2012 16:07, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Huh. Looks a bit like overflow of the refcount, which would explain why > it takes such a long test case to reproduce it. But how could that be > happening without somebody forgetting to decrement the refcount, which > ought to lead to a visible failure in shorter tests? Even more > interesting that the buffer's global refcount is zero. But we test at the end of each transaction whether any pin count is non-zero, so it can't have slowly built up. It would be easier for it to have decremented too many times within just one transaction. Something to do with tail recursion during VACUUM? -- Simon Riggs http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 12/24/12 11:10 AM, Simon Riggs wrote: > I wonder if you're having a hardware problem? Always possible. I didn't report this until I had replicated the crash and seen exactly the same thing twice. I've seen it crash on this assertion 6 times now. Bad hardware is not normally so consistent. I do have another system that's almost identical to this one, except for the hard drives. If progress on tracking down the problem stalls I can try to duplicate this on there. I kicked off another test that includes the block number just before Tom suggested it, so I should have the block by tomorrow at the latest. The range of runtime before crash is 3 to 14 hours so far. -- Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com
On 24 December 2012 16:25, Greg Smith <greg@2ndquadrant.com> wrote: > On 12/24/12 11:10 AM, Simon Riggs wrote: > >> I wonder if you're having a hardware problem? > > > Always possible. I didn't report this until I had replicated the crash and > seen exactly the same thing twice. I've seen it crash on this assertion 6 > times now. Bad hardware is not normally so consistent. Consistency in bugs is a wonderful thing, so I figure we can immediately rule out the hardware. > I kicked off another test that includes the block number just before Tom > suggested it, so I should have the block by tomorrow at the latest. The > range of runtime before crash is 3 to 14 hours so far. Tom will have fixed it by then ;-) Sorry, I should have mentioned it was the block number that I wanted and that was why I suggested using the other print function. -- Simon Riggs http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Greg Smith <greg@2ndQuadrant.com> writes: > I kicked off another test that includes the block number just before Tom > suggested it, so I should have the block by tomorrow at the latest. The > range of runtime before crash is 3 to 14 hours so far. Cool. Once you get the crash, please also capture the contents of the index's metapage --- pg_filedump or contrib/pageinspect's bt_metap() should do. If the problem page isn't the metapage itself, we'll need that info to tell if it's the root. Actually, it might be interesting to capture the content of the problem page itself, too, if it isn't the metapage. "pg_filedump -i -F" or contrib/pageinspect's bt_page_stats() + bt_page_items() should show anything useful. regards, tom lane
To try and speed up replicating this problem I switched to a smaller database scale, 100, and I was able to get a crash there. Here's the latest: 2012-12-26 00:01:19 EST [2278]: WARNING: refcount of base/16384/57610 blockNum=118571, flags=0x106 is 1073741824 should be 0, globally: 0 2012-12-26 00:01:19 EST [2278]: WARNING: buffers with non-zero refcount is 1 TRAP: FailedAssertion("!(RefCountErrors == 0)", File: "bufmgr.c", Line: 1720) That's the same weird 1073741824 count as before. I was planning to dump some index info, but then I saw this: $ psql -d pgbench -c "select relname,relkind,relfilenode from pg_class where relfilenode=57610" relname | relkind | relfilenode ------------------+---------+------------- pgbench_accounts | r | 57610 Making me think this isn't isolated to being an index problem. I tried to soldier on with pg_filedump anyway. It looks like the last version I saw there (9.2.0 from November) doesn't compile anymore: $ make -f Makefile.contrib USE_PGXS=1 gcc -O2 -Wall -Wmissing-prototypes -Wpointer-arith -Wdeclaration-after-statement -Wendif-labels -Wmissing-format-attribute -Wformat-security -fno-strict-aliasing -fwrapv -g -I. -I. -I/var/lib/pgsql/pgwork/inst/baseline/include/server -I/var/lib/pgsql/pgwork/inst/baseline/include/internal -D_GNU_SOURCE -c -o pg_filedump.o pg_filedump.c -MMD -MP -MF .deps/pg_filedump.Po pg_filedump.c: In function ‘FormatHeader’: pg_filedump.c:617: error: request for member ‘xlogid’ in something not a structure or union pg_filedump.c:617: error: request for member ‘xrecoff’ in something not a structure or union pg_filedump.c: In function ‘FormatItem’: pg_filedump.c:904: error: invalid application of ‘sizeof’ to incomplete type ‘HeapTupleHeaderData’ ... Lots more after that. Does this need an update or is there just a devel version I should grab? -- Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com
Greg Smith <greg@2ndQuadrant.com> writes: > To try and speed up replicating this problem I switched to a smaller > database scale, 100, and I was able to get a crash there. Here's the > latest: > 2012-12-26 00:01:19 EST [2278]: WARNING: refcount of base/16384/57610 > blockNum=118571, flags=0x106 is 1073741824 should be 0, globally: 0 > 2012-12-26 00:01:19 EST [2278]: WARNING: buffers with non-zero refcount > is 1 > TRAP: FailedAssertion("!(RefCountErrors == 0)", File: "bufmgr.c", Line: > 1720) > That's the same weird 1073741824 count as before. I was planning to > dump some index info, but then I saw this: > $ psql -d pgbench -c "select relname,relkind,relfilenode from pg_class > where relfilenode=57610" > relname | relkind | relfilenode > ------------------+---------+------------- > pgbench_accounts | r | 57610 > Making me think this isn't isolated to being an index problem. Yeah, that destroys my theory that there's something broken about index management specifically. Now we're looking for something that can affect any buffer's refcount, which more than likely means it has nothing to do with the buffer's contents ... > I tried > to soldier on with pg_filedump anyway. It looks like the last version I > saw there (9.2.0 from November) doesn't compile anymore: Meh, looks like it needs fixes for Heikki's int64-xlogrecoff patch. I haven't gotten around to doing that yet, but would gladly take a patch if anyone wants to do it. However, I now doubt that examining the buffer content will help much on this problem. Now that we know the bug's reproducible on smaller instances, could you put together an exact description of what you're doing to trigger it? What is the DB configuration, pgbench parameters, etc? Also, it'd be worthwhile to just repeat the test a few more times to see if there's any sort of pattern in which buffers get affected. I'm now suspicious that it might not always be just one buffer, for example. regards, tom lane
Tom Lane <tgl@sss.pgh.pa.us> schrieb: >Greg Smith <greg@2ndQuadrant.com> writes: >> To try and speed up replicating this problem I switched to a smaller >> database scale, 100, and I was able to get a crash there. Here's the > >> latest: > >> 2012-12-26 00:01:19 EST [2278]: WARNING: refcount of >base/16384/57610 >> blockNum=118571, flags=0x106 is 1073741824 should be 0, globally: 0 >> 2012-12-26 00:01:19 EST [2278]: WARNING: buffers with non-zero >refcount >> is 1 >> TRAP: FailedAssertion("!(RefCountErrors == 0)", File: "bufmgr.c", >Line: >> 1720) > >> That's the same weird 1073741824 count as before. I was planning to >> dump some index info, but then I saw this: > >> $ psql -d pgbench -c "select relname,relkind,relfilenode from >pg_class >> where relfilenode=57610" >> relname | relkind | relfilenode >> ------------------+---------+------------- >> pgbench_accounts | r | 57610 > >> Making me think this isn't isolated to being an index problem. > >Yeah, that destroys my theory that there's something broken about index >management specifically. Now we're looking for something that can >affect any buffer's refcount, which more than likely means it has >nothing to do with the buffer's contents ... > >> I tried >> to soldier on with pg_filedump anyway. It looks like the last >version I >> saw there (9.2.0 from November) doesn't compile anymore: > >Meh, looks like it needs fixes for Heikki's int64-xlogrecoff patch. >I haven't gotten around to doing that yet, but would gladly take a >patch if anyone wants to do it. However, I now doubt that examining >the buffer content will help much on this problem. > >Now that we know the bug's reproducible on smaller instances, could you >put together an exact description of what you're doing to trigger >it? What is the DB configuration, pgbench parameters, etc? > >Also, it'd be worthwhile to just repeat the test a few more times >to see if there's any sort of pattern in which buffers get affected. >I'm now suspicious that it might not always be just one buffer, >for example. I don't think its necessarily only one buffer - if I read the above output correctly Greg used the suggested debug outputwhich just put the elog(WARN) before the Assert... Greg, could you output all "bad" buffers and only assert after the loop if there was at least one refcounted buffer? Andres --- Please excuse the brevity and formatting - I am writing this on my mobile phone.
On 12/26/12 1:58 PM, anarazel@anarazel.de wrote: > I don't think its necessarily only one buffer - if I read the above output correctly Greg used the suggested debug outputwhich just put the elog(WARN) before the Assert... > > Greg, could you output all "bad" buffers and only assert after the loop if there was at least one refcounted buffer? I've been doing that for a while, the only thing that's new is getting the block number. It's only ever been one buffer involved. -- Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com
On Wed, Dec 26, 2012 at 6:33 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Yeah, that destroys my theory that there's something broken about index > management specifically. Now we're looking for something that can > affect any buffer's refcount, which more than likely means it has > nothing to do with the buffer's contents ... Hardware problem is still a plausible candidate. Have you run any memory checker software or seen anything else crash? Classically gcc is as good at detecting memory problems as memory checking software. Or a bad cpu can also sometimes cause problems like this. Have you been able to reproduce on any other machines? Did you ever say what kind of hardware it was? This is the local reference count so I can't see how it could be a race condition or anything like that but it sure smells a bit like one. -- greg
On Wed, Dec 26, 2012 at 6:33 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Yeah, that destroys my theory that there's something broken about index > management specifically. Now we're looking for something that can > affect any buffer's refcount, which more than likely means it has > nothing to do with the buffer's contents ... Also, do you have the buffer id of the broken buffer? I wonder if it's not just any buffer but always the same same buffer even if it's a different block in that buffer. e.g. maybe it's always the first buffer because something is overwriting past the end of the BufferBlocks array which is declared immediately before PrivateRefCount. (Or maybe your compiler is laying out these objects in a different way from most people's compilers and we're overwriting past the end of some other object routinely but yours is the only place where it's being laid out preceding a critical data structure) -- greg
On 12/26/12 5:40 PM, Greg Stark wrote: > Also, do you have the buffer id of the broken buffer? I wonder if it's > not just any buffer but always the same same buffer even if it's a > different block in that buffer. I just added something looking for that. Before I got to that I found another crash: 2012-12-26 18:01:42 EST [973]: WARNING: refcount of base/16384/65553 blockNum=22140, flags=0x1a7 is 1073741824 should be 0, globally: 0 2012-12-26 18:01:42 EST [973]: WARNING: buffers with non-zero refcount is 1 -bash-4.1$ export PGPORT=5433 -bash-4.1$ psql -d pgbench -c "select relname,relkind,relfilenode from pg_class where relfilenode=65553" relname | relkind | relfilenode -----------------------+---------+------------- pgbench_accounts_pkey | i | 65553 So back to an index again. > (Or maybe your compiler is laying out these objects > in a different way from most people's compilers and we're overwriting > past the end of some other object routinely but yours is the only > place where it's being laid out preceding a critical data structure) I doubt there is anything special about this compiler, given that it's the standard RedHat 6 build stack cloned via Scientific Linux 6.0. The two things I expect I'm doing differently than most tests are: -Using 2GB for shared_buffers -Running a write heavy test that goes for many hours It would be nice if this were just something like a memory issue on this system. That I'm getting the same very odd value every time--this refcount of 1073741824--makes it seem less random than I expect from bad memory. Once I get a few more crash samples (with buffer ids) I'll shut the system down for a pass of memtest86+. Regardless, I've copied over the same source code and test configuration to a similar system here. If I can reproduce this on a second system, I'll push all the details out to the list, hopeful that other people might see it too. -- Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com
On 12/26/12 5:28 PM, Greg Stark wrote: > Did you ever say what kind of hardware it was? This is the local > reference count so I can't see how it could be a race condition or > anything like that but it sure smells a bit like one. Agreed, that smell is the reason I'm proceeding so far like this is an obscure bug rather than something with my hardware or install. The CPU is an Intel i7 870 using HyperThreading to simulate 8 cores, 16GB of RAM. The database is on a single regular SATA drive. I simplified the storage setup before I started public reports. -- Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com
On Wed, Dec 26, 2012 at 11:47 PM, Greg Smith <greg@2ndquadrant.com> wrote: > It would be nice if this were just something like a memory issue on this > system. That I'm getting the same very odd value every time--this refcount > of 1073741824--makes it seem less random than I expect from bad memory. > Once I get a few more crash samples (with buffer ids) I'll shut the system > down for a pass of memtest86+. Well that's a one-bit error and it would never get detected until the value was decremented down to what should be zero so that's pretty much exactly what I would expect to see from a memory or cpu error. What's odd is that it's always hitting the LocalRefCount array, not any other large data structure. For 2GB of buffers the LocalRefCount will be 1MB per client. That's a pretty big target but it's hardly the only such data structure in Postgres. It's also possible it's a bad cpu, not bad memory. If it affects decrement or increment in particular it's possible that the pattern of usage on LocalRefCount is particularly prone to triggering it. -- greg
Greg Stark <stark@mit.edu> writes: > On Wed, Dec 26, 2012 at 11:47 PM, Greg Smith <greg@2ndquadrant.com> wrote: >> It would be nice if this were just something like a memory issue on this >> system. That I'm getting the same very odd value every time--this refcount >> of 1073741824--makes it seem less random than I expect from bad memory. >> Once I get a few more crash samples (with buffer ids) I'll shut the system >> down for a pass of memtest86+. > Well that's a one-bit error and it would never get detected until the > value was decremented down to what should be zero so that's pretty > much exactly what I would expect to see from a memory or cpu error. Yeah, the fact that it's always the same bit makes it seem like it could be one bad physical bit. (Does this machine have ECC memory??) The thing that this theory has a hard time with is that the buffer's global refcount is zero. If you assume that there's a bit that sometimes randomly goes to 1 when it should be 0, then what I'd expect to typically happen is that UnpinBuffer sees nonzero LocalRefCount and hence doesn't drop the session's global pin when it should. The only way that doesn't happen is if decrementing LocalRefCount to zero stores a nonzero pattern when it should store zero, but nonetheless the CPU thinks it stored zero. As you say there's some small possibility of a CPU glitch doing that, but then why is it only happening to LocalRefCount and not any other similar coding? At the moment I like the other theory you alluded to, that this is a wild store from code that thinks it's manipulating some other data structure entirely. The buffer IDs will help confirm or refute that perhaps. No idea ATM how we would find the problem if it's like that ... regards, tom lane
On Thu, Dec 27, 2012 at 3:17 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > The thing that this theory has a hard time with is that the buffer's > global refcount is zero. If you assume that there's a bit that > sometimes randomly goes to 1 when it should be 0, then what I'd expect > to typically happen is that UnpinBuffer sees nonzero LocalRefCount and > hence doesn't drop the session's global pin when it should. The only > way that doesn't happen is if decrementing LocalRefCount to zero stores > a nonzero pattern when it should store zero, but nonetheless the CPU > thinks it stored zero. It seems entirely plausible when you factor in the L2 cache. The cpu could be happily incrementing and decrementing the count entirely correctly blissfully unaware that the value being stored in the DRAM has this extra bit set every time. Not until the transaction ends and it has to refetch the cache line because enough time has passed for it to age out of L2 cache does it find the corrupt value. > At the moment I like the other theory you alluded to, that this is a > wild store from code that thinks it's manipulating some other data > structure entirely. The buffer IDs will help confirm or refute that > perhaps. No idea ATM how we would find the problem if it's like that Yeah, though I have to admit the particular data structures I mentioned are entirely irrelevant. Looking back at the code LocalRefCount is heap allocated and BufferBlocks is actually a pointer to shmem. If it's always the first buffer then it could conceivably still be some other heap allocated object that always lands before LocalRefCount. It does seem a bit weird to be storing 1<<30 though -- there are no 1<<30 constants that we might be storing for example. It occurs to me that it should be possible to walk the list of resource owners and count up what the correct LocalRefCount should be every time we increment or decrement it. That should catch the error sooner -- if it's not a hardware problem that the cache is hiding. -- greg
Greg Stark <stark@mit.edu> writes: > On Thu, Dec 27, 2012 at 3:17 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> The thing that this theory has a hard time with is that the buffer's >> global refcount is zero. If you assume that there's a bit that >> sometimes randomly goes to 1 when it should be 0, then what I'd expect >> to typically happen is that UnpinBuffer sees nonzero LocalRefCount and >> hence doesn't drop the session's global pin when it should. The only >> way that doesn't happen is if decrementing LocalRefCount to zero stores >> a nonzero pattern when it should store zero, but nonetheless the CPU >> thinks it stored zero. > It seems entirely plausible when you factor in the L2 cache. The cpu > could be happily incrementing and decrementing the count entirely > correctly blissfully unaware that the value being stored in the DRAM > has this extra bit set every time. Not until the transaction ends and > it has to refetch the cache line because enough time has passed for it > to age out of L2 cache does it find the corrupt value. Hmm ... that could be plausible. It would be good if we could reproduce this (or fail to) on some other machine. Greg mentioned running some memory diagnostics as well. regards, tom lane
On Thu, Dec 27, 2012 at 11:33 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Greg Stark <stark@mit.edu> writes: >> On Thu, Dec 27, 2012 at 3:17 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >>> The thing that this theory has a hard time with is that the buffer's >>> global refcount is zero. If you assume that there's a bit that >>> sometimes randomly goes to 1 when it should be 0, then what I'd expect >>> to typically happen is that UnpinBuffer sees nonzero LocalRefCount and >>> hence doesn't drop the session's global pin when it should. The only >>> way that doesn't happen is if decrementing LocalRefCount to zero stores >>> a nonzero pattern when it should store zero, but nonetheless the CPU >>> thinks it stored zero. > >> It seems entirely plausible when you factor in the L2 cache. The cpu >> could be happily incrementing and decrementing the count entirely >> correctly blissfully unaware that the value being stored in the DRAM >> has this extra bit set every time. Not until the transaction ends and >> it has to refetch the cache line because enough time has passed for it >> to age out of L2 cache does it find the corrupt value. > > Hmm ... that could be plausible. It would be good if we could reproduce > this (or fail to) on some other machine. Greg mentioned running some > memory diagnostics as well. For what it's worth, I ran a bunch of pgbench runs at the same scale factors, same settings for shared_buffers and checkpoint_segments, and same scale factors that Greg mentioned. I used the IBM POWER7 box. I did not get a failure there; but there are enough other variables that I'm not sure this proves much. I'm always skeptical of the explanation that the hardware is bad. That could be it, but bad memory rarely causes such predictable failures, at least IME. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 12/27/12 7:43 AM, Greg Stark wrote: > If it's always the first buffer then it could conceivably still be > some other heap allocated object that always lands before > LocalRefCount. It does seem a bit weird to be storing 1<<30 though -- > there are no 1<<30 constants that we might be storing for example. It is a strange power of two to be appearing there. I can follow your reasoning for why this could be a bit flipping error. There's no sign of that elsewhere though, no other crashes under load. I'm using this server here because it's worked fine for a while now. I added printing the buffer number, and they're all over the place: 2012-12-27 06:36:39 EST [26306]: WARNING: refcount of buf 29270 containing base/16384/90124 blockNum=82884, flags=0x127 is 1073741824 should be 0, globally: 0 2012-12-27 02:08:19 EST [21719]: WARNING: refcount of buf 114262 containing base/16384/81932 blockNum=133333, flags=0x106 is 1073741824 should be 0, globally: 0 2012-12-26 20:03:05 EST [15117]: WARNING: refcount of buf 142934 containing base/16384/73740 blockNum=87961, flags=0x127 is 1073741824 should be 0, globally: 0 The relation continues to bounce between pgbench_accounts and its primary key, no pattern there either I can see. To answer a few other questions: this system does not have ECC RAM. It did survive many passes of memtest86+ without any problems though, right after the above. I tried duplicating the problem on a similar server. It keeps hanging due to some Linux software RAID bug before it runs for very long. Whatever is going on here, it really doesn't want to be discovered. For reference sake, the debugging code those latest messages came from is now: diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c index dddb6c0..60d3ad3 100644 --- a/src/backend/storage/buffer/bufmgr.c +++ b/src/backend/storage/buffer/bufmgr.c @@ -1697,11 +1697,27 @@ AtEOXact_Buffers(bool isCommit) if (assert_enabled) { int i; + int RefCountErrors = 0; for (i = 0; i < NBuffers; i++) { - Assert(PrivateRefCount[i] == 0); + + if (PrivateRefCount[i] != 0) + { + /* + PrintBufferLeakWarning(&BufferDescriptors[i]); + */ + BufferDesc *bufHdr = &BufferDescriptors[i]; + elog(WARNING, + "refcount of buf %d containing %s blockNum=%u, flags=0x%x is %u should be 0, globally: %u", + i,relpathbackend(bufHdr->tag.rnode, InvalidBackendId, bufHdr->tag.forkNum), + bufHdr->tag.blockNum, bufHdr->flags, PrivateRefCount[i], bufHdr->refcount); + RefCountErrors++; + } } + if (RefCountErrors > 0) + elog(WARNING, "buffers with non-zero refcount is %d", RefCountErrors); + Assert(RefCountErrors == 0); } #endif
On Sat, Dec 29, 2012 at 10:07 PM, Greg Smith <greg@2ndquadrant.com> wrote: > It is a strange power of two to be appearing there. I can follow your > reasoning for why this could be a bit flipping error. There's no sign of > that elsewhere though, no other crashes under load. I'm using this server > here because it's worked fine for a while now. In theory, if this is sufficiently reproducible, could you start backing up through the commit history 100 or so commits at a time until you find the culprit? Or use git bisect? It'd be kind of time-consuming, but... -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 30 December 2012 04:37, Robert Haas <robertmhaas@gmail.com> wrote: > On Sat, Dec 29, 2012 at 10:07 PM, Greg Smith <greg@2ndquadrant.com> wrote: >> It is a strange power of two to be appearing there. I can follow your >> reasoning for why this could be a bit flipping error. There's no sign of >> that elsewhere though, no other crashes under load. I'm using this server >> here because it's worked fine for a while now. > > In theory, if this is sufficiently reproducible, could you start > backing up through the commit history 100 or so commits at a time > until you find the culprit? Or use git bisect? It'd be kind of > time-consuming, but... FYI, Andrew Dunstan recently blogged about using git bisect to find problems in Postgres. His example shell script will save you the trouble of independently coming up with much the same boilerplate (the non-boilerplate parts are probably what you need to worry about, though). http://people.planetpostgresql.org/andrew/index.php?/archives/301-Finding-the-origin-of-problems-with-git-bisect.html If you try and do a binary search through many commits *manually*, for a bug that is this hard to reproduce, well, that's no fun. -- Peter Geoghegan http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training and Services
On Sun, Dec 30, 2012 at 3:07 AM, Greg Smith <greg@2ndquadrant.com> wrote: > It is a strange power of two to be appearing there. I can follow your > reasoning for why this could be a bit flipping error. There's no sign of > that elsewhere though, no other crashes under load. I'm using this server > here because it's worked fine for a while now. Hm. Another idea. Can you use taskset to bind postgres to just one processor at a time and see if it's reproducible on both processors? The processor affinity is inherited across fork so doing it once on the postmaster before starting pgbench ought to be sufficient. If it's not a hardware error then the most plausible explanation seems to me that one of the one of the bitsets such as t_infomask or something is being set based on a wild pointer. For example src/backend/commands/trigger.c:3469 and 3572 set this bit offset on a bitmask at the end of a pointer. These lines haven't changed in 4 years but you don't have any command triggers on this database do you? Maybe this pointer isn't being initialized? -- greg
On 12/23/12 3:17 PM, Simon Riggs wrote: > We already have PrintBufferLeakWarning() for this, which might be a bit neater. It does look like basically the same info. I hacked the code to generate this warning all the time. Patch from Andres I've been using: WARNING: refcount of buf 1 containing global/12886 blockNum=0, flags=0x106 is 0 should be 0, globally: 0 And using PrintBufferLeakWarning with the same input: WARNING: buffer refcount leak: [001] (rel=global/12886, blockNum=0, flags=0x106, refcount=0 0) Attached is a change I'd propose be committed. This doesn't do anything in a non-assertion build, it just follows all of the "why don't you try..." suggestions I've gotten here. I still have no idea why these are popping out for me. I'm saving the state on all this to try and track it down again later. I'm out of time to bang my head on this particular thing anymore right now, it doesn't seem that important given it's not reproducible anywhere else. Any assertion errors that come out will be more useful with just this change though. Output looks like this (from hacked code to always trip it and shared_buffers=16): ... WARNING: buffer refcount leak: [016] (rel=pg_tblspc/0/PG_9.3_201212081/0/0_(null), blockNum=4294967295, flags=0x0, refcount=0 0) WARNING: buffers with non-zero refcount is 16 TRAP: FailedAssertion("!(RefCountErrors == 0)", File: "bufmgr.c", Line: 1724) I intentionally used a regular diff for this small one because it shows the subtle changes I made here better. I touched more than I strictly had to because this area of the code is filled with off by one corrections, and adding this warning highlighted one I objected to. Note how the function is defined: PrintBufferLeakWarning(Buffer buffer) The official buffer identifier description says: typedef int Buffer; Zero is invalid, positive is the index of a shared buffer (1..NBuffers), negative is the index of a local buffer (-1 .. -NLocBuffer). However, the loop in AtEOXact_Buffers aimed to iterate over PrivateRefCount, using array indexes 0...NBuffers-1. It was using an array index rather than a proper buffer identification number. And that meant when I tried to pass it to PrintBufferLeakWarning, it asserted on the buffer id--because "Zero is invalid". I could have just fixed that with an off by one fix in the other direction. That seemed wrong to me though. All of the other code like PrintBufferLeakWarning expects to see a Buffer value. It already corrects for the off by one problem like this: buf = &BufferDescriptors[buffer - 1]; loccount = PrivateRefCount[buffer - 1]; It seemed cleaner to me to make the i variable in AtEOXact_Buffers be a Buffer number. Then you access PrivateRefCount[buffer - 1] in that code, making it look like more of the surrounding references. And the value goes directly into PrintBufferLeakWarning without a problem. -- Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com
Attachment
On 12/26/12 7:23 PM, Greg Stark wrote: > It's also possible it's a bad cpu, not bad memory. If it affects > decrement or increment in particular it's possible that the pattern of > usage on LocalRefCount is particularly prone to triggering it. This looks to be the winning answer. It turns out that under extended multi-hour loads at high concurrency, something related to CPU overheating was occasionally flipping a bit. One round of compressed air for all the fans/vents, a little tweaking of the fan controls, and now the system goes >24 hours with no problems. Sorry about all the noise over this. I do think the improved warning messages that came out of the diagnosis ideas are useful. The reworked code must slows down the checking a few cycles, but if you care about performance these assertions are tacked onto the biggest pig around. I added the patch to the January CF as "Improve buffer refcount leak warning messages". The sample I showed with the patch submission was a simulated one. Here's the output from the last crash before resolving the issue, where the assertion really triggered: WARNING: buffer refcount leak: [170583] (rel=base/16384/16578, blockNum=302295, flags=0x106, refcount=0 1073741824) WARNING: buffers with non-zero refcount is 1 TRAP: FailedAssertion("!(RefCountErrors == 0)", File: "bufmgr.c", Line: 1712) -- Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com
On Sun, Jan 13, 2013 at 12:34:07AM -0500, Greg Smith wrote: > On 12/26/12 7:23 PM, Greg Stark wrote: > >It's also possible it's a bad cpu, not bad memory. If it affects > >decrement or increment in particular it's possible that the pattern of > >usage on LocalRefCount is particularly prone to triggering it. > > This looks to be the winning answer. It turns out that under > extended multi-hour loads at high concurrency, something related to > CPU overheating was occasionally flipping a bit. One round of > compressed air for all the fans/vents, a little tweaking of the fan > controls, and now the system goes >24 hours with no problems. Odd your system didn't report the problem to you. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. +
Hi, I just reviewed this patch. https://commitfest.postgresql.org/action/patch_view?id=1035 2013/1/13 Greg Smith <greg@2ndquadrant.com>: > On 12/26/12 7:23 PM, Greg Stark wrote: >> >> It's also possible it's a bad cpu, not bad memory. If it affects >> decrement or increment in particular it's possible that the pattern of >> usage on LocalRefCount is particularly prone to triggering it. > > > This looks to be the winning answer. It turns out that under extended > multi-hour loads at high concurrency, something related to CPU overheating > was occasionally flipping a bit. One round of compressed air for all the > fans/vents, a little tweaking of the fan controls, and now the system goes >>24 hours with no problems. > > Sorry about all the noise over this. I do think the improved warning > messages that came out of the diagnosis ideas are useful. The reworked code > must slows down the checking a few cycles, but if you care about performance > these assertions are tacked onto the biggest pig around. > > I added the patch to the January CF as "Improve buffer refcount leak warning > messages". The sample I showed with the patch submission was a simulated > one. Here's the output from the last crash before resolving the issue, > where the assertion really triggered: > > WARNING: buffer refcount leak: [170583] (rel=base/16384/16578, > blockNum=302295, flags=0x106, refcount=0 1073741824) > > WARNING: buffers with non-zero refcount is 1 > TRAP: FailedAssertion("!(RefCountErrors == 0)", File: "bufmgr.c", Line: > 1712) This patch is intended to improve warning message at AtEOXact_Buffers(), but I guess another function, AtProcExit_Buffers(), needs to be modified as well. Right? With this additional fix, the patch could be applied to the current git master, and could be compiled with --enable-cassert option. Then, I need some suggestion from hackers to continue this review. How should I reproduce this message for review? This is a debug warning message, so it's not easy for me to reproduce this message. Any suggestion? -- Satoshi Nagayasu <snaga@uptime.jp> Uptime Technologies, LLC http://www.uptime.jp/
On 1/27/13 2:32 AM, Satoshi Nagayasu wrote: > This patch is intended to improve warning message at > AtEOXact_Buffers(), but I guess another function, > AtProcExit_Buffers(), needs to be modified as well. Right? Yes, good catch. I've attached an updated patch that does the same sort of modification to that one too. > Then, I need some suggestion from hackers to continue this review. > How should I reproduce this message for review? > This is a debug warning message, so it's not easy for me > to reproduce this message. What I was doing to test the code out is alter this check: if (PrivateRefCount[i - 1] != 0) To have the opposite logic: if (PrivateRefCount[i - 1] = 0) Then the patch fires on every buffer, all the time. You can still tell which are the real errors should you happen to get one, because the log entry shows the count. If it's 0, you know that's just a debugging message. To keep the output from a test like that from being completely overwhelming, I also set: shared_buffers = 16 Which is its minimum value. -- Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com
Attachment
Greg Smith <greg@2ndQuadrant.com> writes: > On 1/27/13 2:32 AM, Satoshi Nagayasu wrote: >> This patch is intended to improve warning message at >> AtEOXact_Buffers(), but I guess another function, >> AtProcExit_Buffers(), needs to be modified as well. Right? > Yes, good catch. I've attached an updated patch that does the same sort > of modification to that one too. I went ahead and committed this, since it didn't really seem to me to be in need of additional testing. regards, tom lane