Re: Hard limit on WAL space used (because PANIC sucks) - Mailing list pgsql-hackers
From | Craig Ringer |
---|---|
Subject | Re: Hard limit on WAL space used (because PANIC sucks) |
Date | |
Msg-id | 51B50F98.5000808@2ndquadrant.com Whole thread Raw |
In response to | Re: Hard limit on WAL space used (because PANIC sucks) ("MauMau" <maumau307@gmail.com>) |
Responses |
Re: Hard limit on WAL space used (because PANIC sucks)
Re: Hard limit on WAL space used (because PANIC sucks) |
List | pgsql-hackers |
On 06/10/2013 06:39 AM, MauMau wrote: > The problem is that the reliability of the database system decreases > with more disks, because failure of any one of those disks would result > in a database PANIC shutdown More specifically, with more independent sets of disks / file systems. >> I'd rather like to be able to recover from this by treating the >> tablespace as dead, so any attempt to get a lock on any table within it >> fails with an error and already-in-WAL writes to it just get discarded. >> It's the sort of thing that'd only be reasonable to do as a recovery >> option (like zero_damaged_pages) since if applied by default it'd lead >> to potentially severe and unexpected data loss. > > I'm in favor of taking a tablespace offline when I/O failure is > encountered, and continue running the database server. But WAL must not > be discarded because committed transactions must be preserved for > durability of ACID. [snip] > WAL is not affected by the offlining of tablespaces. WAL records > already written on the WAL buffer will be written to pg_xlog/ and > archived as usual. Those WAL records will be used to recover committed > transactions during archive recovery. (I'm still learning the details of Pg's WAL, WAL replay and recovery, so the below's just my understanding): The problem is that WAL for all tablespaces is mixed together in the archives. If you lose your tablespace then you have to keep *all* WAL around and replay *all* of it again when the tablespace comes back online. This would be very inefficient, would require a lot of tricks to cope with applying WAL to a database that has an on-disk state in the future as far as the archives are concerned. It's not as simple as just replaying all WAL all over again - as I understand it, things like CLUSTER or TRUNCATE will result in relfilenodes not being where they're expected to be as far as old WAL archives are concerned. Selective replay would be required, and that leaves the door open to all sorts of new and exciting bugs in areas that'd hardly ever get tested. To solve the massive disk space explosion problem I imagine we'd have to have per-tablespace WAL. That'd cause a *huge* increase in fsync costs and loss of the rather nice property that WAL writes are nice sequential writes. It'd be complicated and probably cause nightmares during recovery, for archive-based replication, etc. The only other thing I can think of is: When a tablespace is offline, write WAL records to a separate "tablespace recovery log" as they're encountered. Replay this log when the tablespace comes is restored, before applying any other new WAL to the tablespace. This wouldn't affect archive-based recovery since it'd already have the records from the original WAL. None of these options seem exactly simple or pretty, especially given the additional complexities that'd be involved in allowing WAL records to be applied out-of-order, something that AFAIK _never_h happens at the moment. The key problem, of course, is that this all sounds like a lot of complicated work for a case that's not really supposed to happen. Right now, the answer is "your database is unrecoverable, switch to your streaming warm standby and re-seed it from the standby". Not pretty, but at least there's the option of using a sync standby and avoiding data loss. How would you approach this? -- Craig Ringer http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
pgsql-hackers by date: