Re: Disaster! - Mailing list pgsql-hackers
From | Tom Lane |
---|---|
Subject | Re: Disaster! |
Date | |
Msg-id | 871.1075093237@sss.pgh.pa.us Whole thread Raw |
In response to | Disaster! (Christopher Kings-Lynne <chriskl@familyhealth.com.au>) |
Responses |
Re: Disaster!
Re: Disaster! Re: Disaster! Re: Disaster! |
List | pgsql-hackers |
Okay ... Chris was kind enough to let me examine the WAL logs and postmaster stderr log for his recent problem, and I believe that I have now achieved a full understanding of what happened. The true bug is indeed somewhere else than slru.c, and we would not have found it if slru.c had had less-paranoid error checking. The WAL log shows that checkpoints were happening every five minutes up to 2004-01-23 10:13:10, but no checkpoint completion record appears after that. However, the system remained up, with plenty of activity, until 10:45:24, when it was finally taken down by a panic. The last transaction commit records in the WAL log are commit: 14286807 at 2004-01-23 10:45:23 commit: 14286811 at 2004-01-23 10:45:24 commit: 14286814 at 2004-01-23 10:45:24 commit: 14286824 at 2004-01-23 10:45:24 commit: 14286825 at 2004-01-23 10:45:24 commit: 14286836 at 2004-01-23 10:45:24 commit: 14286838 at 2004-01-23 10:45:24 commit: 14286850 at 2004-01-23 10:45:24 commit: 14286851 at 2004-01-23 10:45:24 Over in the postmaster log, the first sign of trouble is Jan 23 10:18:07 canaveral postgres[20039]: [879-1] LOG: could not close temporary statistics file "/usr/local/pgsql/data/global/pgstat.tmp.20035":No space left on device and there is a steady stream of transactions failing with out-of-space errors over the next half hour, but none of the failures are worse than a transaction abort. Finally we see Jan 23 10:45:24 canaveral postgres[57237]: [17-1] ERROR: could not access status of transaction 0 Jan 23 10:45:24 canaveral postgres[57237]: [17-2] DETAIL: could not write to file "/usr/local/pgsql/data/pg_clog/000D" atoffset 147456: No space left on device Jan 23 10:45:24 canaveral postgres[57237]: [18-1] WARNING: AbortTransaction and not in in-progress state Jan 23 10:45:24 canaveral postgres[57237]: [19-1] PANIC: could not access status of transaction 0 Jan 23 10:45:24 canaveral postgres[57237]: [19-2] DETAIL: could not write to file "/usr/local/pgsql/data/pg_clog/000D" atoffset 147456: No space left on device Jan 23 10:45:24 canaveral postgres[20035]: [5-1] LOG: server process (PID 57237) was terminated by signal 6 Jan 23 10:45:24 canaveral postgres[20035]: [6-1] LOG: terminating any other active server processes after which the postmaster's recovery attempts fail, as Chris already detailed. (Note: the reference to "transaction 0" is not significant; that just happens because SimpleLruWritePage doesn't have a specific transaction number to blame its write failures on.) Those are the observed facts, what's the interpretation? I think it shows that Postgres is pretty darn robust, actually. We were able to stay up and do useful work for quite a long time with zero free space; what's more, we lost no transactions that were successfully committed. The data was successfully stored in preallocated WAL space. (If things had gone on this way for awhile longer, we would have panicked for lack of WAL space, but Chris was actually not anywhere near there; he'd only filled about two WAL segments in the half hour of operations.) Note also that checkpoints were attempted several times during that interval, and they all failed gracefully --- no panic, no incorrect WAL update. But why did this panic finally happen? The key observation is that the first nonexistent page of pg_clog was the page beginning with transaction 14286848. Neither this xact nor the following one have any commit or abort record in WAL, but we do see entries for 14286850 and 14286851. It is also notable that there is no WAL entry for extension of pg_clog to include this page --- normally a WAL entry is made each time a page of zeroes is added to pg_clog. My interpretation of the sequence of events is: Transaction 14286848 started, and since it was the first for its pg_clog page, it tried to do ZeroCLOGPage() for that page (see ExtendCLOG). This required making room in the in-memory clog buffers, which required dumping one of the previously-buffered clog pages, which failed for lack of disk space, leading to this log entry: Jan 23 10:45:24 canaveral postgres[57237]: [17-1] ERROR: could not access status of transaction 0 Jan 23 10:45:24 canaveral postgres[57237]: [17-2] DETAIL: could not write to file "/usr/local/pgsql/data/pg_clog/000D" atoffset 147456: No space left on device Jan 23 10:45:24 canaveral postgres[57237]: [18-1] WARNING: AbortTransaction and not in in-progress state (Note: page offset 147456 is the page two before the one containing xid 14286848. This page had been allocated in clog buffers but never yet successfully written to disk. Ditto for the page in between.) The next thing that happened was that transaction xids 14286849 and 14286850 were assigned (ie, those xacts started), and then 14286850 tried to commit. This again led to a failed attempt to write out a clog page, but this time the error was promoted to a panic because it happened inside the transaction commit critical section: Jan 23 10:45:24 canaveral postgres[57237]: [19-1] PANIC: could not access status of transaction 0 Jan 23 10:45:24 canaveral postgres[57237]: [19-2] DETAIL: could not write to file "/usr/local/pgsql/data/pg_clog/000D" atoffset 147456: No space left on device The final commit record in WAL, from xid 14286851, must have come from a different backend that was able to get that far in its commit sequence before hearing the all-hands-abandon-ship signal from the postmaster. (AFAICT it was just chance that the same backend process was responsible for both 14286848 and 14286850. Presumably 14286849 was taken out by yet another backend that hadn't gotten as far as trying to commit.) After Chris had freed some disk space, WAL replay was able to create the clog page at offset 147456, because there was a clog-extension WAL entry for it within the WAL entries since the last successful checkpoint. It was also able to correctly fill that page using the transaction commit data in WAL. Likewise for the page after that. But when it got to the commit record for 14286850, the error checks in slru.c barfed because there was no such page, thus exposing the real problem: there wasn't a clog extension WAL record for that page. In short, the bug is in GetNewTransactionId(), which shorn of extraneous details looks like LWLockAcquire(XidGenLock, LW_EXCLUSIVE); xid = ShmemVariableCache->nextXid; TransactionIdAdvance(ShmemVariableCache->nextXid); ExtendCLOG(xid); LWLockRelease(XidGenLock); and the correct fix is to swap the order of the TransactionIdAdvance and ExtendCLOG lines. Because these lines are out of order, a failure occurring down inside ExtendCLOG leaves the shared-memory copy of nextXid already advanced, and so subsequent transactions coming through this bit of code will see that they are not the first transaction in their page and conclude that they need not do any work to extend clog. With the operations in the correct order, ExtendCLOG failure will leave the counter unchanged, so every subsequent transaction will try to do ExtendCLOG and will fail until some disk space becomes available. (Note that this code is *not* in a critical section, indeed it's not yet inside a transaction at all, and so failure here does not mean a panic.) If you like you can think of the real problem with this code as being that it violates the basic WAL rule: "make a WAL entry before making permanent changes". nextXid isn't on disk, but it's persistent shared-memory state, and it mustn't be updated until any WAL entries associated with that action have been made. Any questions? regards, tom lane
pgsql-hackers by date: