Re: Disaster! - Mailing list pgsql-hackers

From Tom Lane
Subject Re: Disaster!
Date
Msg-id 4221.1074892864@sss.pgh.pa.us
Whole thread Raw
In response to Re: Disaster!  (Martín Marqués<martin@bugs.unl.edu.ar>)
Responses Re: Disaster!  (Alvaro Herrera <alvherre@dcc.uchile.cl>)
List pgsql-hackers
Martín Marqués <martin@bugs.unl.edu.ar> writes:
> Tom, could you give a small insight on what occurred here, why those
> 8k of zeros fixed it, and what is a "WAL replay"?

I think what happened is that there was insufficient space to write out
a new page of the clog (transaction commit) file.  This would result in
a database panic, which is fine --- you're not gonna get much done
anyway if you are down to zero free disk space.  However, after Chris
freed up space, the system needed to replay the WAL from the last
checkpoint to ensure consistency.  The WAL entries evidently included
references to transactions whose commit bits were in the unwritten page.
Now there would also be WAL entries recording those commits, so once the
replay was complete everything would be cool.  But the clog access code
evidently got confused by being asked to read a page that didn't exist
in the file.  I'm not sure yet how that sequence of events occurred,
which is why I asked Chris for a stack trace.

Adding a page of zeroes fixed it by eliminating the read error
condition.  It was okay to do so because zeroes is the correct initial
state for a clog page (all transactions in it "still in progress").
After WAL replay, any completed transactions would be updated in the page.
        regards, tom lane


pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: Disaster!
Next
From: Alvaro Herrera
Date:
Subject: Re: Disaster!