Thread: clog_redo causing very long recovery time

clog_redo causing very long recovery time

From
Joseph Conway
Date:
I'm working with a client that uses Postgres on what amounts to an
appliance.

The database is therefore subject to occasional torture such as, in this
particular case, running out of disk space while performing a million
plus queries (of mixed varieties, many using plpgsql with exception
handling -- more on that later), and eventually being power-cycled. Upon
restart, clog_redo was called approx 885000 times (CLOG_ZEROPAGE) during
recovery, which took almost 2 hours on their hardware. I should note
that this is on Postgres 8.3.x.

After studying the source, I can only see one possible way that this
could have occurred:

In varsup.c:GetNewTracsactionId(), ExtendCLOG() needs to succeed on a
freshly zeroed clog page, and ExtendSUBTRANS() has to fail. Both of
these calls can lead to a page being pushed out of shared buffers and to
disk, so given a lack of disk space, sufficient clog buffers, but lack
of subtrans buffers, this could happen. At that point the transaction id
does not get advanced, so clog zeros the same page, extendSUBTRANS()
fails again, rinse and repeat.

I believe in the case above, subtrans buffers were exhausted due to the
extensive use of plpgsql with exception handling.

I can simulate this failure with the attached debug-clog patch which
makes use of two pre-existing debug GUCs to selectively interject an
ERROR in between calls to ExtendCLOG() and ExtendSUBTRANS(). If you want
to test this yourself, apply this patch and use the function in
test_clog.sql to generate a million or so transactions. After the first
32K or before (based on when clog gets its first opportunity to zero a
new page) you should start seeing messages about injected ERRORs. Let a
few hundred thousand ERRORs go by, then kill postgres. Recovery will
take ages, because clog_redo is calling fsync hundreds of thousands of
times in order to zero the same page over and over.

The attached fix-clogredo diff is my proposal for a fix for this.

Thoughts/alternatives, etc appreciated.

Thanks,

Joe


--
Joe Conway
credativ LLC: http://www.credativ.us
Linux, PostgreSQL, and general Open Source
Training, Service, Consulting, & 24x7 Support


Attachment

Re: clog_redo causing very long recovery time

From
Tom Lane
Date:
Joseph Conway <mail@joeconway.com> writes:
> I'm working with a client that uses Postgres on what amounts to an
> appliance.

> The database is therefore subject to occasional torture such as, in this
> particular case, running out of disk space while performing a million
> plus queries (of mixed varieties, many using plpgsql with exception
> handling -- more on that later), and eventually being power-cycled. Upon
> restart, clog_redo was called approx 885000 times (CLOG_ZEROPAGE) during
> recovery, which took almost 2 hours on their hardware. I should note
> that this is on Postgres 8.3.x.

> After studying the source, I can only see one possible way that this
> could have occurred:

> In varsup.c:GetNewTracsactionId(), ExtendCLOG() needs to succeed on a
> freshly zeroed clog page, and ExtendSUBTRANS() has to fail. Both of
> these calls can lead to a page being pushed out of shared buffers and to
> disk, so given a lack of disk space, sufficient clog buffers, but lack
> of subtrans buffers, this could happen. At that point the transaction id
> does not get advanced, so clog zeros the same page, extendSUBTRANS()
> fails again, rinse and repeat.

> I believe in the case above, subtrans buffers were exhausted due to the
> extensive use of plpgsql with exception handling.

Hmm, interesting.  I believe that it's not really a question of buffer
space or lack of it, but whether the OS will give us disk space when we
try to add a page to the current pg_subtrans file.  In any case, the
point is that a failure between ExtendCLOG and incrementing nextXid
is bad news.

> The attached fix-clogredo diff is my proposal for a fix for this.

That seems pretty grotty :-(

I think a more elegant fix might be to just swap the order of the 
ExtendCLOG and ExtendSUBTRANS calls in GetNewTransactionId.  The
reason that would help is that pg_subtrans isn't WAL-logged, so if
we succeed doing ExtendSUBTRANS and then fail in ExtendCLOG, we
won't have written any XLOG entry, and thus repeated failures will not
result in repeated XLOG entries.  I seem to recall having considered
exactly that point when the clog WAL support was first done, but the
scenario evidently wasn't considered when subtransactions were stuck
in :-(.

It would probably also help to put in a comment admonishing people
to not add stuff right there.  I see the SSI guys have fallen into
the same trap.
        regards, tom lane


Re: clog_redo causing very long recovery time

From
Alvaro Herrera
Date:
Excerpts from Tom Lane's message of vie may 06 00:22:43 -0300 2011:

> I think a more elegant fix might be to just swap the order of the 
> ExtendCLOG and ExtendSUBTRANS calls in GetNewTransactionId.  The
> reason that would help is that pg_subtrans isn't WAL-logged, so if
> we succeed doing ExtendSUBTRANS and then fail in ExtendCLOG, we
> won't have written any XLOG entry, and thus repeated failures will not
> result in repeated XLOG entries.  I seem to recall having considered
> exactly that point when the clog WAL support was first done, but the
> scenario evidently wasn't considered when subtransactions were stuck
> in :-(.

I'm pretty sure it would have never occured to me to consider such a
scenario.

-- 
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


Re: clog_redo causing very long recovery time

From
Joe Conway
Date:
On 05/05/2011 08:22 PM, Tom Lane wrote:
> Joseph Conway <mail@joeconway.com> writes:
>> The attached fix-clogredo diff is my proposal for a fix for this.
>
> That seems pretty grotty :-(
>
> I think a more elegant fix might be to just swap the order of the
> ExtendCLOG and ExtendSUBTRANS calls in GetNewTransactionId.  The
> reason that would help is that pg_subtrans isn't WAL-logged, so if
> we succeed doing ExtendSUBTRANS and then fail in ExtendCLOG, we
> won't have written any XLOG entry, and thus repeated failures will not
> result in repeated XLOG entries.  I seem to recall having considered
> exactly that point when the clog WAL support was first done, but the
> scenario evidently wasn't considered when subtransactions were stuck
> in :-(.

Yes, that does seem much nicer :-)

> It would probably also help to put in a comment admonishing people
> to not add stuff right there.  I see the SSI guys have fallen into
> the same trap.

Right -- I think another similar problem exists in GetNewMultiXactId
where ExtendMultiXactOffset could succeed and write an XLOG entry and
then  ExtendMultiXactMember could fail before advancing nextMXact. The
problem in this case is that they both write XLOG entries, so a simple
reversal doesn't help.

Joe

--
Joe Conway
credativ LLC: http://www.credativ.us
Linux, PostgreSQL, and general Open Source
Training, Service, Consulting, & 24x7 Support


Re: clog_redo causing very long recovery time

From
Tom Lane
Date:
Joe Conway <mail@joeconway.com> writes:
> Right -- I think another similar problem exists in GetNewMultiXactId
> where ExtendMultiXactOffset could succeed and write an XLOG entry and
> then  ExtendMultiXactMember could fail before advancing nextMXact. The
> problem in this case is that they both write XLOG entries, so a simple
> reversal doesn't help.

Hmm.  Maybe we need a real fix then.  I was just sitting here
speculating about whether we'd ever decide we need to WAL-log
pg_subtrans --- because if we did, my solution would fail.

I still think that the right fix is to avoid emitting redundant
XLOG records in the first place, rather than hacking recovery
to not process them.  Possibly we could modify slru.c so that
it could be determined whether zeroing of the current page had
already happened.  In a quick look, it looks like noting whether
latest_page_number had already been advanced to that page might
do the trick.
        regards, tom lane


Re: clog_redo causing very long recovery time

From
Joe Conway
Date:
On 05/05/2011 09:00 PM, Tom Lane wrote:
> Joe Conway <mail@joeconway.com> writes:
>> Right -- I think another similar problem exists in GetNewMultiXactId
>> where ExtendMultiXactOffset could succeed and write an XLOG entry and
>> then  ExtendMultiXactMember could fail before advancing nextMXact. The
>> problem in this case is that they both write XLOG entries, so a simple
>> reversal doesn't help.
>
> Hmm.  Maybe we need a real fix then.  I was just sitting here
> speculating about whether we'd ever decide we need to WAL-log
> pg_subtrans --- because if we did, my solution would fail.
>
> I still think that the right fix is to avoid emitting redundant
> XLOG records in the first place, rather than hacking recovery
> to not process them.  Possibly we could modify slru.c so that
> it could be determined whether zeroing of the current page had
> already happened.  In a quick look, it looks like noting whether
> latest_page_number had already been advanced to that page might
> do the trick.

Thanks -- I'll test that out.

Joe


--
Joseph E Conway
President/CEO
credativ LLC
www.credativ.us

616 Burnham Street
El Cajon, CA 92019

Office: +1 619 270 8787
Mobile: +1 619 843 8340


Re: clog_redo causing very long recovery time

From
Simon Riggs
Date:
On Fri, May 6, 2011 at 4:22 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

>> The attached fix-clogredo diff is my proposal for a fix for this.
>
> That seems pretty grotty :-(
>
> I think a more elegant fix might be to just swap the order of the
> ExtendCLOG and ExtendSUBTRANS calls in GetNewTransactionId.  The
> reason that would help is that pg_subtrans isn't WAL-logged, so if
> we succeed doing ExtendSUBTRANS and then fail in ExtendCLOG, we
> won't have written any XLOG entry, and thus repeated failures will not
> result in repeated XLOG entries.  I seem to recall having considered
> exactly that point when the clog WAL support was first done, but the
> scenario evidently wasn't considered when subtransactions were stuck
> in :-(.

I agree with Tom about the need for a fix that prevents generation of
repeated WAL records.

OTOH, I also like Joe's fix in the recovery code to avoid responding
to repeated records.

Can we have both please?

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: clog_redo causing very long recovery time

From
Tom Lane
Date:
Simon Riggs <simon@2ndQuadrant.com> writes:
> I agree with Tom about the need for a fix that prevents generation of
> repeated WAL records.

> OTOH, I also like Joe's fix in the recovery code to avoid responding
> to repeated records.

> Can we have both please?

Why?  The patch in the recovery code is seriously ugly, and it won't
do anything useful once we've fixed the other end.  Please notice also
that we'd need several instances of that kluge if we want to cover all
the SLRU-based cases.
        regards, tom lane