Thread: Catching resource leaks during WAL replay

Catching resource leaks during WAL replay

From

Heikki Linnakangas

Date:

27 March 2013, 20:40:22

While looking at bug #7969, it occurred to me that it would be nice if
we could catch resource leaks in WAL redo routines better. It would be
useful during development, to catch bugs earlier, and it could've turned
that replay-stopping error into a warning.

For regular transactions, we use ResourceOwners to track buffer pins
(like in #7969) and other resources. There's no fundamental reason we
couldn't use one during replay. After running a redo routine, there
should be no buffer pins held or other resources held.

Lwlocks are not tracked by resource owners, but we could still easily
warn if any are held after the redo routine exits.

- Heikki

Attachment

xlog-redo-resource-leak-warning.patch

Re: Catching resource leaks during WAL replay

From

Simon Riggs

Date:

27 March 2013, 21:42:19

On 27 March 2013 20:40, Heikki Linnakangas <hlinnakangas@vmware.com> wrote:

> While looking at bug #7969, it occurred to me that it would be nice if we
> could catch resource leaks in WAL redo routines better. It would be useful
> during development, to catch bugs earlier, and it could've turned that
> replay-stopping error into a warning.
>
> For regular transactions, we use ResourceOwners to track buffer pins (like
> in #7969) and other resources. There's no fundamental reason we couldn't use
> one during replay. After running a redo routine, there should be no buffer
> pins held or other resources held.
>
> Lwlocks are not tracked by resource owners, but we could still easily warn
> if any are held after the redo routine exits.

I'm inclined to think that the overhead isn't worth the trouble. This
is the only bug of its type we had in recent years.

Perhaps we need another level of compile for checks that happen only in beta?

-- Simon Riggs                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services

Re: Catching resource leaks during WAL replay

From

Tom Lane

Date:

27 March 2013, 23:01:32

Simon Riggs <simon@2ndQuadrant.com> writes:
> On 27 March 2013 20:40, Heikki Linnakangas <hlinnakangas@vmware.com> wrote:
>> While looking at bug #7969, it occurred to me that it would be nice if we
>> could catch resource leaks in WAL redo routines better. It would be useful
>> during development, to catch bugs earlier, and it could've turned that
>> replay-stopping error into a warning.

> I'm inclined to think that the overhead isn't worth the trouble. This
> is the only bug of its type we had in recent years.

I agree that checking for resource leaks after each WAL record seems
too expensive compared to what we'd get for it.  But perhaps it's worth
making a check every so often, like at restartpoints?
        regards, tom lane

Re: Catching resource leaks during WAL replay

From

Simon Riggs

Date:

27 March 2013, 23:09:35

On 27 March 2013 23:01, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Simon Riggs <simon@2ndQuadrant.com> writes:
>> On 27 March 2013 20:40, Heikki Linnakangas <hlinnakangas@vmware.com> wrote:
>>> While looking at bug #7969, it occurred to me that it would be nice if we
>>> could catch resource leaks in WAL redo routines better. It would be useful
>>> during development, to catch bugs earlier, and it could've turned that
>>> replay-stopping error into a warning.
>
>> I'm inclined to think that the overhead isn't worth the trouble. This
>> is the only bug of its type we had in recent years.
>
> I agree that checking for resource leaks after each WAL record seems
> too expensive compared to what we'd get for it.  But perhaps it's worth
> making a check every so often, like at restartpoints?

+1

-- Simon Riggs                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services

Re: Catching resource leaks during WAL replay

From

Heikki Linnakangas

Date:

28 March 2013, 15:00:36

On 28.03.2013 01:01, Tom Lane wrote:
> Simon Riggs<simon@2ndQuadrant.com>  writes:
>> On 27 March 2013 20:40, Heikki Linnakangas<hlinnakangas@vmware.com>  wrote:
>>> While looking at bug #7969, it occurred to me that it would be nice if we
>>> could catch resource leaks in WAL redo routines better. It would be useful
>>> during development, to catch bugs earlier, and it could've turned that
>>> replay-stopping error into a warning.
>
>> I'm inclined to think that the overhead isn't worth the trouble. This
>> is the only bug of its type we had in recent years.
>
> I agree that checking for resource leaks after each WAL record seems
> too expensive compared to what we'd get for it.  But perhaps it's worth
> making a check every so often, like at restartpoints?

That sounds very seldom. How about making it an assertion to check after 
every record? I guess I'll have to do some testing to see how expensive 
it really is.

- Heikki

Re: Catching resource leaks during WAL replay

From

Tom Lane

Date:

28 March 2013, 15:09:41

Heikki Linnakangas <hlinnakangas@vmware.com> writes:
> On 28.03.2013 01:01, Tom Lane wrote:
>> Simon Riggs<simon@2ndQuadrant.com>  writes:
>>> I'm inclined to think that the overhead isn't worth the trouble. This
>>> is the only bug of its type we had in recent years.

>> I agree that checking for resource leaks after each WAL record seems
>> too expensive compared to what we'd get for it.  But perhaps it's worth
>> making a check every so often, like at restartpoints?

> That sounds very seldom. How about making it an assertion to check after 
> every record? I guess I'll have to do some testing to see how expensive 
> it really is.

Well, the actually productive part of this patch is to reduce such a
failure from ERROR to WARNING, which seems like it probably only
requires *one* resource cleanup after we exit the apply loop.  Doing it
per restartpoint is probably reasonable to limit the resource owner's
memory consumption (if there were a leak) over a long replay sequence.
I am really not seeing much advantage to doing it per record.

I suppose you are thinking of being helpful during development, but if
anything I would argue that the current behavior of a hard failure is
best for development.  It guarantees that the developer will notice the
failure, if it occurs at all in his test scenario; whereas a WARNING
that goes only to the postmaster log will be very very easily missed.
        regards, tom lane