Re: quick review - Mailing list pgsql-hackers
From | Christopher Browne |
---|---|
Subject | Re: quick review |
Date | |
Msg-id | 87bqlsk738.fsf@wolfe.cbbrowne.com Whole thread Raw |
In response to | quick review ("Molle Bestefich" <molle.bestefich@gmail.com>) |
List | pgsql-hackers |
A long time ago, in a galaxy far, far away, qnex42@gmail.com ("Dawid Kuroczko") wrote: > On 12/24/06, tomas@tuxteam.de <tomas@tuxteam.de> wrote: >> -----BEGIN PGP SIGNED MESSAGE----- >> Hash: SHA1 >> >> On Mon, Dec 18, 2006 at 03:47:42AM +0100, Molle Bestefich wrote: >> >> [...] >> >> > Simply put, a tool with just a single button named "recover >> > all the data that you can" is by far the best solution in so >> > many cases. Minimal fuzz, minimal downtime, minimal money >> > spent on recovery. And perhaps there's even a good chance that >> > any missing data could be entered back into the system manually. >> >> I think the point which has been made here was that the recovery tool >> *is already there*: i.e. all what can be done as an "one-click" recovery >> is done by the system at start-up. Beyond this no cookbook exists (and >> thus no way to put it under an one-click procedure). >> >> So this one-click thing would be mainly something to cater for the >> "needs" of marketing. > > Well start-up recovery is great and reliable. The only problem is that > it won't help if you have some obscure hardware problem, you really > have a problem. If you want to sleep well, you should know what to > do when disaster happens. > > I really like the approach of XFS filesystem, which ships with fsck.xfs > which is essentially equivalent to /bin/true. They write in their white > paper that they did so, because journaling should recover from all > failures. Yet they also wrote that some time after they learned that > hardware corruption is not as unlikely as one might assume, so they > provide xfs_check an xfs_repair utilities. > > I think there should be a documented way to recover from obscure > hardware failure, with even more detailed information how this could > result only from using crappy hardware... And I don't think this should > be "one click" process -- some people might miss real (software) > corruption, and this is a biggest drawback. Perhaps the disaster > recoverer should leave a detailed log which would be enough to > detect software-corruption even after the recovery [and users should > be advised to send them]. The trouble is that it is often *impossible* to recover from the "obscure hardware failure." If the failure is that a bunch of vital bits have been lost or scribbled on, there may be NO way to recover from this. And in practice, this in fact seems to be a common form for "obscure hardware failure" to take: those problems are, in fact, irretrievable. There historically have been two main sorts of corruptions: 1. Hardware corruptions where the only recovery is to have some sort of replica of the data, whether via near-hardware mechanisms (e.g. - RAID) or more 'logical' mechanisms (e.g. - replication systems). 2. Software corruptions, where the answer is not to provide some "recovery mechanism," but rather to FIX THE BUG that is leading to the problem. Once the bug is fixed, there is no more corruption (of this sort). Neither of these is amenable to there being some mechanism such as you describe. There are really only two possibilities: a) The problem is one that the WAL recovery system can cope with, or b) There has been True Data Loss, and there is NO recovery system short of recovering from backup/replica. -- output = ("cbbrowne" "@" "acm.org") http://cbbrowne.com/info/slony.html "Here I am, brain the size of a planet, and they ask me to take you down the the bridge. Call that job satisfaction? 'Cos I don't." -- Marvin the Paranoid Android
pgsql-hackers by date: