Re: Disaster! - Mailing list pgsql-hackers

From Tom Lane
Subject Re: Disaster!
Date
Msg-id 871.1075093237@sss.pgh.pa.us
Whole thread Raw
In response to Disaster!  (Christopher Kings-Lynne <chriskl@familyhealth.com.au>)
Responses Re: Disaster!
Re: Disaster!
Re: Disaster!
Re: Disaster!
List pgsql-hackers
Okay ... Chris was kind enough to let me examine the WAL logs and
postmaster stderr log for his recent problem, and I believe that
I have now achieved a full understanding of what happened.  The true
bug is indeed somewhere else than slru.c, and we would not have found
it if slru.c had had less-paranoid error checking.

The WAL log shows that checkpoints were happening every five minutes
up to 2004-01-23 10:13:10, but no checkpoint completion record appears
after that.  However, the system remained up, with plenty of activity,
until 10:45:24, when it was finally taken down by a panic.  The last
transaction commit records in the WAL log are

commit: 14286807 at 2004-01-23 10:45:23 
commit: 14286811 at 2004-01-23 10:45:24 
commit: 14286814 at 2004-01-23 10:45:24 
commit: 14286824 at 2004-01-23 10:45:24 
commit: 14286825 at 2004-01-23 10:45:24 
commit: 14286836 at 2004-01-23 10:45:24 
commit: 14286838 at 2004-01-23 10:45:24 
commit: 14286850 at 2004-01-23 10:45:24 
commit: 14286851 at 2004-01-23 10:45:24 

Over in the postmaster log, the first sign of trouble is

Jan 23 10:18:07 canaveral postgres[20039]: [879-1] LOG:  could not close temporary statistics file
"/usr/local/pgsql/data/global/pgstat.tmp.20035":No space left on device
 

and there is a steady stream of transactions failing with out-of-space
errors over the next half hour, but none of the failures are worse than
a transaction abort.  Finally we see

Jan 23 10:45:24 canaveral postgres[57237]: [17-1] ERROR:  could not access status of transaction 0
Jan 23 10:45:24 canaveral postgres[57237]: [17-2] DETAIL:  could not write to file "/usr/local/pgsql/data/pg_clog/000D"
atoffset 147456: No space left on device
 
Jan 23 10:45:24 canaveral postgres[57237]: [18-1] WARNING:  AbortTransaction and not in in-progress state
Jan 23 10:45:24 canaveral postgres[57237]: [19-1] PANIC:  could not access status of transaction 0
Jan 23 10:45:24 canaveral postgres[57237]: [19-2] DETAIL:  could not write to file "/usr/local/pgsql/data/pg_clog/000D"
atoffset 147456: No space left on device
 
Jan 23 10:45:24 canaveral postgres[20035]: [5-1] LOG:  server process (PID 57237) was terminated by signal 6
Jan 23 10:45:24 canaveral postgres[20035]: [6-1] LOG:  terminating any other active server processes

after which the postmaster's recovery attempts fail, as Chris already
detailed.  (Note: the reference to "transaction 0" is not significant;
that just happens because SimpleLruWritePage doesn't have a specific
transaction number to blame its write failures on.)

Those are the observed facts, what's the interpretation?  I think it
shows that Postgres is pretty darn robust, actually.  We were able to
stay up and do useful work for quite a long time with zero free space;
what's more, we lost no transactions that were successfully committed.
The data was successfully stored in preallocated WAL space.  (If things
had gone on this way for awhile longer, we would have panicked for lack
of WAL space, but Chris was actually not anywhere near there; he'd only
filled about two WAL segments in the half hour of operations.)  Note
also that checkpoints were attempted several times during that interval,
and they all failed gracefully --- no panic, no incorrect WAL update.

But why did this panic finally happen?  The key observation is that
the first nonexistent page of pg_clog was the page beginning with
transaction 14286848.  Neither this xact nor the following one have any
commit or abort record in WAL, but we do see entries for 14286850 and
14286851.  It is also notable that there is no WAL entry for extension
of pg_clog to include this page --- normally a WAL entry is made each
time a page of zeroes is added to pg_clog.  My interpretation of the
sequence of events is:

Transaction 14286848 started, and since it was the first for its pg_clog
page, it tried to do ZeroCLOGPage() for that page (see ExtendCLOG).  This
required making room in the in-memory clog buffers, which required
dumping one of the previously-buffered clog pages, which failed for lack
of disk space, leading to this log entry:

Jan 23 10:45:24 canaveral postgres[57237]: [17-1] ERROR:  could not access status of transaction 0
Jan 23 10:45:24 canaveral postgres[57237]: [17-2] DETAIL:  could not write to file "/usr/local/pgsql/data/pg_clog/000D"
atoffset 147456: No space left on device
 
Jan 23 10:45:24 canaveral postgres[57237]: [18-1] WARNING:  AbortTransaction and not in in-progress state

(Note: page offset 147456 is the page two before the one containing xid
14286848.  This page had been allocated in clog buffers but never yet
successfully written to disk.  Ditto for the page in between.)  The next
thing that happened was that transaction xids 14286849 and 14286850 were
assigned (ie, those xacts started), and then 14286850 tried to commit.
This again led to a failed attempt to write out a clog page, but this
time the error was promoted to a panic because it happened inside the
transaction commit critical section:

Jan 23 10:45:24 canaveral postgres[57237]: [19-1] PANIC:  could not access status of transaction 0
Jan 23 10:45:24 canaveral postgres[57237]: [19-2] DETAIL:  could not write to file "/usr/local/pgsql/data/pg_clog/000D"
atoffset 147456: No space left on device
 

The final commit record in WAL, from xid 14286851, must have come from a
different backend that was able to get that far in its commit sequence
before hearing the all-hands-abandon-ship signal from the postmaster.
(AFAICT it was just chance that the same backend process was responsible
for both 14286848 and 14286850.  Presumably 14286849 was taken out by
yet another backend that hadn't gotten as far as trying to commit.)

After Chris had freed some disk space, WAL replay was able to create the
clog page at offset 147456, because there was a clog-extension WAL entry
for it within the WAL entries since the last successful checkpoint.  It
was also able to correctly fill that page using the transaction commit
data in WAL.  Likewise for the page after that.  But when it got to the
commit record for 14286850, the error checks in slru.c barfed because
there was no such page, thus exposing the real problem: there wasn't a
clog extension WAL record for that page.

In short, the bug is in GetNewTransactionId(), which shorn of extraneous
details looks like
   LWLockAcquire(XidGenLock, LW_EXCLUSIVE);
   xid = ShmemVariableCache->nextXid;
   TransactionIdAdvance(ShmemVariableCache->nextXid);
   ExtendCLOG(xid);
   LWLockRelease(XidGenLock);

and the correct fix is to swap the order of the TransactionIdAdvance and
ExtendCLOG lines.  Because these lines are out of order, a failure
occurring down inside ExtendCLOG leaves the shared-memory copy of
nextXid already advanced, and so subsequent transactions coming through
this bit of code will see that they are not the first transaction in
their page and conclude that they need not do any work to extend clog.
With the operations in the correct order, ExtendCLOG failure will leave
the counter unchanged, so every subsequent transaction will try to do
ExtendCLOG and will fail until some disk space becomes available.  (Note
that this code is *not* in a critical section, indeed it's not yet
inside a transaction at all, and so failure here does not mean a panic.)

If you like you can think of the real problem with this code as being
that it violates the basic WAL rule: "make a WAL entry before making
permanent changes".  nextXid isn't on disk, but it's persistent
shared-memory state, and it mustn't be updated until any WAL entries
associated with that action have been made.

Any questions?
        regards, tom lane


pgsql-hackers by date:

Previous
From: Bruce Momjian
Date:
Subject: Re: [pgsql-hackers-win32] What's left?
Next
From: Dennis Bjorklund
Date:
Subject: Re: Named arguments in function calls