Home > mailing lists

Re: ENOSPC FailedAssertion("!(RefCountErrors == 0)" - Mailing list pgsql-hackers

From	Andres Freund
Subject	Re: ENOSPC FailedAssertion("!(RefCountErrors == 0)"
Date	July 16, 2018 19:13:37
Msg-id	20180716131337.usestwjviwwocpyz@alap3.anarazel.de Whole thread Raw
In response to	Re: ENOSPC FailedAssertion("!(RefCountErrors == 0)" (Tom Lane <tgl@sss.pgh.pa.us>)
Responses	Re: ENOSPC FailedAssertion("!(RefCountErrors == 0)"
List	pgsql-hackers

Tree view

Hi,

On 2018-07-15 18:48:43 -0400, Tom Lane wrote:
> So basically, WAL replay hits an error while holding a buffer pin, and
> nothing is done to release the buffer pin, but AtProcExit_Buffers thinks
> something should have been done.

I think there's a few other cases where we hit this. I've seen something
similar from inside checkpointer / BufferSync(). I'd be surprised if
bgwriter couldn't be triggered into the same.


> What seems like a better answer for now is to adjust AtProcExit_Buffers
> so that it doesn't cause an assertion failure in this case.  I think we
> can define "this case" as being "failure exit from the startup process":
> if that happens, the postmaster is going to throw up its hands and force
> a panic shutdown anyway, so the failure to free a buffer pin shouldn't
> have any serious consequences.


> The attached one-liner patch does that, and is enough to get through
> Michael's test case without an assertion.  This is just proof of concept
> though --- my inclination, if we go this route, is to make a slightly
> longer patch that would fix CheckForBufferLeaks to still print the WARNING
> messages if any, but not die with an assertion afterwards.
> 
> Another question is whether this is really a sufficient band-aid.  It
> looks to me like AtProcExit_Buffers will be called in any auxiliary
> process type, not only the startup process.

We could just replace the Assert() with a PANIC?


> Do, or should we, force a panic restart for nonzero-exit-code failures
> of all aux process types?  If not, what are we going to do to clean up
> similar failures in other aux process types?

I'm pretty sure that we do *not* force a panic on all nonzero-exit-code
cases for other subprocesses.


> BTW, this assertion has been there since fe548629; before that, there
> was code that would release any leaked buffer pins, relying on the
> PrivateRefCount data for that.  So another idea is to restore some
> version of that capability, although I think we really really don't
> wanna scan the PrivateRefCount array if we can avoid it.

I don't think scanning PrivateRefCount would be particularly problematic
- in almost all cases it's going to be tiny. These day it's not NBuffers
sized anymore, so I can't forsee any performance problems?

I think we could invent something like like enter/leave "transactional
mode" wherein we throw a PANIC when a buffer leaked. Processes that
don't enter it - like startup, checkpointer, etc - would just WARN?

Greetings,

Andres Freund

pgsql-hackers by date:

From: Andres Freund
Date: 16 July 2018, 19:06:10
Subject: Re: patch to allow disable of WAL recycling

From: Andres Freund
Date: 16 July 2018, 19:19:25
Subject: Re: patch to allow disable of WAL recycling

Re: ENOSPC FailedAssertion("!(RefCountErrors == 0)" - Mailing list pgsql-hackers

Previous

Next