Re: Hard limit on WAL space used (because PANIC sucks) - Mailing list pgsql-hackers

From Craig Ringer
Subject Re: Hard limit on WAL space used (because PANIC sucks)
Date
Msg-id 51B50F98.5000808@2ndquadrant.com
Whole thread Raw
In response to Re: Hard limit on WAL space used (because PANIC sucks)  ("MauMau" <maumau307@gmail.com>)
Responses Re: Hard limit on WAL space used (because PANIC sucks)
Re: Hard limit on WAL space used (because PANIC sucks)
List pgsql-hackers
On 06/10/2013 06:39 AM, MauMau wrote:

> The problem is that the reliability of the database system decreases
> with more disks, because failure of any one of those disks would result
> in a database PANIC shutdown

More specifically, with more independent sets of disks / file systems.

>> I'd rather like to be able to recover from this by treating the
>> tablespace as dead, so any attempt to get a lock on any table within it
>> fails with an error and already-in-WAL writes to it just get discarded.
>> It's the sort of thing that'd only be reasonable to do as a recovery
>> option (like zero_damaged_pages) since if applied by default it'd lead
>> to potentially severe and unexpected data loss.
> 
> I'm in favor of taking a tablespace offline when I/O failure is
> encountered, and continue running the database server.  But WAL must not
> be discarded because committed transactions must be preserved for
> durability of ACID.
[snip]
> WAL is not affected by the offlining of tablespaces.  WAL records
> already written on the WAL buffer will be written to pg_xlog/ and
> archived as usual. Those WAL records will be used to recover committed
> transactions during archive recovery.

(I'm still learning the details of Pg's WAL, WAL replay and recovery, so
the below's just my understanding):

The problem is that WAL for all tablespaces is mixed together in the
archives. If you lose your tablespace then you have to keep *all* WAL
around and replay *all* of it again when the tablespace comes back
online. This would be very inefficient, would require a lot of tricks to
cope with applying WAL to a database that has an on-disk state in the
future as far as the archives are concerned. It's not as simple as just
replaying all WAL all over again - as I understand it, things like
CLUSTER or TRUNCATE will result in relfilenodes not being where they're
expected to be as far as old WAL archives are concerned. Selective
replay would be required, and that leaves the door open to all sorts of
new and exciting bugs in areas that'd hardly ever get tested.

To solve the massive disk space explosion problem I imagine we'd have to
have per-tablespace WAL. That'd cause a *huge* increase in fsync costs
and loss of the rather nice property that WAL writes are nice sequential
writes. It'd be complicated and probably cause nightmares during
recovery, for archive-based replication, etc.

The only other thing I can think of is: When a tablespace is offline,
write WAL records to a separate "tablespace recovery log" as they're
encountered. Replay this log when the tablespace comes is restored,
before applying any other new WAL to the tablespace. This wouldn't
affect archive-based recovery since it'd already have the records from
the original WAL.

None of these options seem exactly simple or pretty, especially given
the additional complexities that'd be involved in allowing WAL records
to be applied out-of-order, something that AFAIK _never_h happens at the
moment.

The key problem, of course, is that this all sounds like a lot of
complicated work for a case that's not really supposed to happen. Right
now, the answer is "your database is unrecoverable, switch to your
streaming warm standby and re-seed it from the standby". Not pretty, but
at least there's the option of using a sync standby and avoiding data loss.

How would you approach this?

-- Craig Ringer                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services



pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: postgres_fdw regression tests order dependency
Next
From: Robins Tharakan
Date:
Subject: Revisit items marked 'NO' in sql_features.txt