Thread: clog_redo causing very long recovery time
I'm working with a client that uses Postgres on what amounts to an appliance. The database is therefore subject to occasional torture such as, in this particular case, running out of disk space while performing a million plus queries (of mixed varieties, many using plpgsql with exception handling -- more on that later), and eventually being power-cycled. Upon restart, clog_redo was called approx 885000 times (CLOG_ZEROPAGE) during recovery, which took almost 2 hours on their hardware. I should note that this is on Postgres 8.3.x. After studying the source, I can only see one possible way that this could have occurred: In varsup.c:GetNewTracsactionId(), ExtendCLOG() needs to succeed on a freshly zeroed clog page, and ExtendSUBTRANS() has to fail. Both of these calls can lead to a page being pushed out of shared buffers and to disk, so given a lack of disk space, sufficient clog buffers, but lack of subtrans buffers, this could happen. At that point the transaction id does not get advanced, so clog zeros the same page, extendSUBTRANS() fails again, rinse and repeat. I believe in the case above, subtrans buffers were exhausted due to the extensive use of plpgsql with exception handling. I can simulate this failure with the attached debug-clog patch which makes use of two pre-existing debug GUCs to selectively interject an ERROR in between calls to ExtendCLOG() and ExtendSUBTRANS(). If you want to test this yourself, apply this patch and use the function in test_clog.sql to generate a million or so transactions. After the first 32K or before (based on when clog gets its first opportunity to zero a new page) you should start seeing messages about injected ERRORs. Let a few hundred thousand ERRORs go by, then kill postgres. Recovery will take ages, because clog_redo is calling fsync hundreds of thousands of times in order to zero the same page over and over. The attached fix-clogredo diff is my proposal for a fix for this. Thoughts/alternatives, etc appreciated. Thanks, Joe -- Joe Conway credativ LLC: http://www.credativ.us Linux, PostgreSQL, and general Open Source Training, Service, Consulting, & 24x7 Support
Attachment
Joseph Conway <mail@joeconway.com> writes: > I'm working with a client that uses Postgres on what amounts to an > appliance. > The database is therefore subject to occasional torture such as, in this > particular case, running out of disk space while performing a million > plus queries (of mixed varieties, many using plpgsql with exception > handling -- more on that later), and eventually being power-cycled. Upon > restart, clog_redo was called approx 885000 times (CLOG_ZEROPAGE) during > recovery, which took almost 2 hours on their hardware. I should note > that this is on Postgres 8.3.x. > After studying the source, I can only see one possible way that this > could have occurred: > In varsup.c:GetNewTracsactionId(), ExtendCLOG() needs to succeed on a > freshly zeroed clog page, and ExtendSUBTRANS() has to fail. Both of > these calls can lead to a page being pushed out of shared buffers and to > disk, so given a lack of disk space, sufficient clog buffers, but lack > of subtrans buffers, this could happen. At that point the transaction id > does not get advanced, so clog zeros the same page, extendSUBTRANS() > fails again, rinse and repeat. > I believe in the case above, subtrans buffers were exhausted due to the > extensive use of plpgsql with exception handling. Hmm, interesting. I believe that it's not really a question of buffer space or lack of it, but whether the OS will give us disk space when we try to add a page to the current pg_subtrans file. In any case, the point is that a failure between ExtendCLOG and incrementing nextXid is bad news. > The attached fix-clogredo diff is my proposal for a fix for this. That seems pretty grotty :-( I think a more elegant fix might be to just swap the order of the ExtendCLOG and ExtendSUBTRANS calls in GetNewTransactionId. The reason that would help is that pg_subtrans isn't WAL-logged, so if we succeed doing ExtendSUBTRANS and then fail in ExtendCLOG, we won't have written any XLOG entry, and thus repeated failures will not result in repeated XLOG entries. I seem to recall having considered exactly that point when the clog WAL support was first done, but the scenario evidently wasn't considered when subtransactions were stuck in :-(. It would probably also help to put in a comment admonishing people to not add stuff right there. I see the SSI guys have fallen into the same trap. regards, tom lane
Excerpts from Tom Lane's message of vie may 06 00:22:43 -0300 2011: > I think a more elegant fix might be to just swap the order of the > ExtendCLOG and ExtendSUBTRANS calls in GetNewTransactionId. The > reason that would help is that pg_subtrans isn't WAL-logged, so if > we succeed doing ExtendSUBTRANS and then fail in ExtendCLOG, we > won't have written any XLOG entry, and thus repeated failures will not > result in repeated XLOG entries. I seem to recall having considered > exactly that point when the clog WAL support was first done, but the > scenario evidently wasn't considered when subtransactions were stuck > in :-(. I'm pretty sure it would have never occured to me to consider such a scenario. -- Álvaro Herrera <alvherre@commandprompt.com> The PostgreSQL Company - Command Prompt, Inc. PostgreSQL Replication, Consulting, Custom Development, 24x7 support
On 05/05/2011 08:22 PM, Tom Lane wrote: > Joseph Conway <mail@joeconway.com> writes: >> The attached fix-clogredo diff is my proposal for a fix for this. > > That seems pretty grotty :-( > > I think a more elegant fix might be to just swap the order of the > ExtendCLOG and ExtendSUBTRANS calls in GetNewTransactionId. The > reason that would help is that pg_subtrans isn't WAL-logged, so if > we succeed doing ExtendSUBTRANS and then fail in ExtendCLOG, we > won't have written any XLOG entry, and thus repeated failures will not > result in repeated XLOG entries. I seem to recall having considered > exactly that point when the clog WAL support was first done, but the > scenario evidently wasn't considered when subtransactions were stuck > in :-(. Yes, that does seem much nicer :-) > It would probably also help to put in a comment admonishing people > to not add stuff right there. I see the SSI guys have fallen into > the same trap. Right -- I think another similar problem exists in GetNewMultiXactId where ExtendMultiXactOffset could succeed and write an XLOG entry and then ExtendMultiXactMember could fail before advancing nextMXact. The problem in this case is that they both write XLOG entries, so a simple reversal doesn't help. Joe -- Joe Conway credativ LLC: http://www.credativ.us Linux, PostgreSQL, and general Open Source Training, Service, Consulting, & 24x7 Support
Joe Conway <mail@joeconway.com> writes: > Right -- I think another similar problem exists in GetNewMultiXactId > where ExtendMultiXactOffset could succeed and write an XLOG entry and > then ExtendMultiXactMember could fail before advancing nextMXact. The > problem in this case is that they both write XLOG entries, so a simple > reversal doesn't help. Hmm. Maybe we need a real fix then. I was just sitting here speculating about whether we'd ever decide we need to WAL-log pg_subtrans --- because if we did, my solution would fail. I still think that the right fix is to avoid emitting redundant XLOG records in the first place, rather than hacking recovery to not process them. Possibly we could modify slru.c so that it could be determined whether zeroing of the current page had already happened. In a quick look, it looks like noting whether latest_page_number had already been advanced to that page might do the trick. regards, tom lane
On 05/05/2011 09:00 PM, Tom Lane wrote: > Joe Conway <mail@joeconway.com> writes: >> Right -- I think another similar problem exists in GetNewMultiXactId >> where ExtendMultiXactOffset could succeed and write an XLOG entry and >> then ExtendMultiXactMember could fail before advancing nextMXact. The >> problem in this case is that they both write XLOG entries, so a simple >> reversal doesn't help. > > Hmm. Maybe we need a real fix then. I was just sitting here > speculating about whether we'd ever decide we need to WAL-log > pg_subtrans --- because if we did, my solution would fail. > > I still think that the right fix is to avoid emitting redundant > XLOG records in the first place, rather than hacking recovery > to not process them. Possibly we could modify slru.c so that > it could be determined whether zeroing of the current page had > already happened. In a quick look, it looks like noting whether > latest_page_number had already been advanced to that page might > do the trick. Thanks -- I'll test that out. Joe -- Joseph E Conway President/CEO credativ LLC www.credativ.us 616 Burnham Street El Cajon, CA 92019 Office: +1 619 270 8787 Mobile: +1 619 843 8340
On Fri, May 6, 2011 at 4:22 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> The attached fix-clogredo diff is my proposal for a fix for this. > > That seems pretty grotty :-( > > I think a more elegant fix might be to just swap the order of the > ExtendCLOG and ExtendSUBTRANS calls in GetNewTransactionId. The > reason that would help is that pg_subtrans isn't WAL-logged, so if > we succeed doing ExtendSUBTRANS and then fail in ExtendCLOG, we > won't have written any XLOG entry, and thus repeated failures will not > result in repeated XLOG entries. I seem to recall having considered > exactly that point when the clog WAL support was first done, but the > scenario evidently wasn't considered when subtransactions were stuck > in :-(. I agree with Tom about the need for a fix that prevents generation of repeated WAL records. OTOH, I also like Joe's fix in the recovery code to avoid responding to repeated records. Can we have both please? -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Simon Riggs <simon@2ndQuadrant.com> writes: > I agree with Tom about the need for a fix that prevents generation of > repeated WAL records. > OTOH, I also like Joe's fix in the recovery code to avoid responding > to repeated records. > Can we have both please? Why? The patch in the recovery code is seriously ugly, and it won't do anything useful once we've fixed the other end. Please notice also that we'd need several instances of that kluge if we want to cover all the SLRU-based cases. regards, tom lane