Re: ENOSPC FailedAssertion("!(RefCountErrors == 0)" - Mailing list pgsql-hackers
From | Tom Lane |
---|---|
Subject | Re: ENOSPC FailedAssertion("!(RefCountErrors == 0)" |
Date | |
Msg-id | 7317.1531694923@sss.pgh.pa.us Whole thread Raw |
In response to | Re: ENOSPC FailedAssertion("!(RefCountErrors == 0)" (Michael Paquier <michael@paquier.xyz>) |
Responses |
Re: ENOSPC FailedAssertion("!(RefCountErrors == 0)"
|
List | pgsql-hackers |
Michael Paquier <michael@paquier.xyz> writes: > On Wed, Jun 27, 2018 at 06:39:39PM -0500, Justin Pryzby wrote: >> < 2018-06-17 11:38:45.468 CDT >FATAL: could not extend file "base/17379/38665798": No space left on device >> < 2018-06-17 11:38:45.468 CDT >HINT: Check free disk space. >> < 2018-06-17 11:38:45.468 CDT >CONTEXT: WAL redo at 2AA/239676B8 for Heap/INSERT+INIT: off 1 >> < 2018-06-17 11:38:45.468 CDT >WARNING: buffer refcount leak: [2366] (rel=base/17379/38665798, blockNum=1241, flags=0x8a000000,refcount=1 1) >> TRAP: FailedAssertion("!(RefCountErrors == 0)", File: "bufmgr.c", Line: 2523) > You may be interested in this thread, vintage 2017, where I also posted > a test case, assuming that you can create an extra partition and make it > run out of space: > https://www.postgresql.org/message-id/19533.1497283908@sss.pgh.pa.us > So that's a known problem. I got around to poking into this today. It's easily reproducible with Michael's script, and capturing a stack trace from the assertion seems to be enough to explain the problem: #0 ExceptionalCondition (conditionName=0xa60137 "!(RefCountErrors == 0)", errorType=0x8e50f7 "FailedAssertion", fileName=0xa5ffa3 "bufmgr.c", lineNumber=2523) at assert.c:30 #1 0x000000000075829b in CheckForBufferLeaks () at bufmgr.c:2523 #2 0x00000000007582b3 in AtProcExit_Buffers (code=<value optimized out>, arg=<value optimized out>) at bufmgr.c:2476 #3 0x00000000007651f5 in shmem_exit (code=1) at ipc.c:272 #4 0x000000000076526e in proc_exit_prepare (code=1) at ipc.c:194 #5 0x00000000007652f8 in proc_exit (code=1) at ipc.c:107 #6 0x000000000089af5d in errfinish (dummy=<value optimized out>) at elog.c:543 #7 0x000000000078615b in mdextend (reln=<value optimized out>, forknum=MAIN_FORKNUM, blocknum=<value optimized out>, buffer=<value optimized out>, skipFsync=false) at md.c:549 #8 0x0000000000758bcf in ReadBuffer_common (smgr=<value optimized out>, relpersistence=112 'p', forkNum=MAIN_FORKNUM, blockNum=19603, mode=RBM_ZERO_AND_LOCK, strategy=0x0, hit=0x7fff70b892cf) at bufmgr.c:865 #9 0x0000000000759424 in ReadBufferWithoutRelcache (rnode=..., forkNum=MAIN_FORKNUM, blockNum=4294967295, mode=<value optimized out>, strategy=<value optimized out>) at bufmgr.c:692 #10 0x000000000052a09b in XLogReadBufferExtended (rnode=..., forknum=MAIN_FORKNUM, blkno=19603, mode=RBM_ZERO_AND_LOCK) at xlogutils.c:489 #11 0x000000000052a419 in XLogReadBufferForRedoExtended (record=0x13827e8, block_id=0 '\000', mode=RBM_ZERO_AND_LOCK, get_cleanup_lock=false, buf=0x7fff70b893ac) at xlogutils.c:390 #12 0x000000000052a649 in XLogInitBufferForRedo (record=<value optimized out>, block_id=<value optimized out>) at xlogutils.c:305 #13 0x00000000004bfa8a in heap_xlog_insert (record=0x13827e8) at heapam.c:8580 #14 0x00000000004c0b55 in heap_redo (record=0x13827e8) at heapam.c:9282 #15 0x000000000051fab0 in StartupXLOG () at xlog.c:7319 #16 0x0000000000710b71 in StartupProcessMain () at startup.c:217 #17 0x000000000052eeb5 in AuxiliaryProcessMain (argc=2, argv=0x7fff70b8c320) at bootstrap.c:441 #18 0x000000000070b6d7 in StartChildProcess (type=StartupProcess) at postmaster.c:5331 #19 0x000000000070fc25 in PostmasterMain (argc=3, argv=<value optimized out>) at postmaster.c:1371 So basically, WAL replay hits an error while holding a buffer pin, and nothing is done to release the buffer pin, but AtProcExit_Buffers thinks something should have been done. I'm not sure that it's worth trying to fix this "nicely". The startup process does not run a transaction and hence doesn't have any of the infrastructure that'd be needed to clean up in a nice way --- for instance, it has no CurrentResourceOwner. Perhaps in some distant future somebody will think it's worth trying to improve that, but I'm not interested in working on it right now, and even less interested in back-patching it. What seems like a better answer for now is to adjust AtProcExit_Buffers so that it doesn't cause an assertion failure in this case. I think we can define "this case" as being "failure exit from the startup process": if that happens, the postmaster is going to throw up its hands and force a panic shutdown anyway, so the failure to free a buffer pin shouldn't have any serious consequences. The attached one-liner patch does that, and is enough to get through Michael's test case without an assertion. This is just proof of concept though --- my inclination, if we go this route, is to make a slightly longer patch that would fix CheckForBufferLeaks to still print the WARNING messages if any, but not die with an assertion afterwards. Another question is whether this is really a sufficient band-aid. It looks to me like AtProcExit_Buffers will be called in any auxiliary process type, not only the startup process. Do, or should we, force a panic restart for nonzero-exit-code failures of all aux process types? If not, what are we going to do to clean up similar failures in other aux process types? BTW, this assertion has been there since fe548629; before that, there was code that would release any leaked buffer pins, relying on the PrivateRefCount data for that. So another idea is to restore some version of that capability, although I think we really really don't wanna scan the PrivateRefCount array if we can avoid it. Thoughts? regards, tom lane diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c index 01eabe5..d814229 100644 --- a/src/backend/storage/buffer/bufmgr.c +++ b/src/backend/storage/buffer/bufmgr.c @@ -2473,7 +2473,14 @@ AtProcExit_Buffers(int code, Datum arg) AbortBufferIO(); UnlockBuffers(); - CheckForBufferLeaks(); + /* + * The startup process might fail to release buffer pins if it fatals out. + * Since that will result in postmaster shutdown anyway, there's not + * really a reason to work hard to prevent it; but let's not raise an + * assertion failure when it happens. + */ + if (code == 0 || !AmStartupProcess()) + CheckForBufferLeaks(); /* localbuf.c needs a chance too */ AtProcExit_LocalBuffers();
pgsql-hackers by date: