Re: disaster recovery - Mailing list pgsql-general
From | Marco Colombo |
---|---|
Subject | Re: disaster recovery |
Date | |
Msg-id | Pine.LNX.4.44.0311281505350.25502-100000@Megathlon.ESI Whole thread Raw |
In response to | Re: disaster recovery ("Craig O'Shannessy" <craig@ucw.com.au>) |
List | pgsql-general |
On Sat, 29 Nov 2003, Craig O'Shannessy wrote: > On Fri, 28 Nov 2003, Marco Colombo wrote: > > > On Fri, 28 Nov 2003, Craig O'Shannessy wrote: > > > > > > > > > > From my point of view, it's just support for my demands to have each > > > > mission-critical server supported by a UPS, if not redundant power > > > > supplies and two UPSes. > > > > > > > > > > Never had a kernel panic? I've had a few. Probably flakey hardware. I > > > feel safer since journalling file systems hit linux. > > > > On any hardware flakey enough to cause panics, no FS code will save > > you. The FS may "reliably" write total rubbish to disk. It may have been > > doing that for hours, thrashing the whole FS structure, before something > > triggered the panic. > > You are no safer with journal than you are with a plain FAT (or any > > other FS technology). Journal files get corrupted themselves. > > > > This isn't always true. For example, my most recent panic was due to a > ide cdrom driver on a fairly expensive Intel dual xeon box, running 2.4.18 > I mounted the cdrom and boom, panic. If I'd been running ext2, I would > have had a very lengthy reboot and lots of pissed off users, but as it's > ext3, the system was back up in a couple of minutes, and I just removed > the cdrom drive from fstab (I've got other cdrom drives :) Sure, I didn't mean it to be _always_ true, just true in general. And you've been lucky. You don't actually know what happened... a runaway pointer that tried to write to some protected location in kernel space? How can you be 100% sure it _did not_ write to some write-enabled pages, like, say, the in-core copy of the inode of some very important file of yours? Or the cached copy of some directory, orphaning a number of critical files? If ext3 wrote that on disk, the journal won't help you much (unless, maybe, if mounted with data=journal). And what if that runaway pointer wrote some garbage (with Murphy's laws in action) to _the incore copy of the journal_ itself? And reboot time is another (lengthy) matter: someone would advise to do a full fsck after a crash even with ext3 - Redhat systems do ask you for that right after boot - so let's say ext3 gives you the option to boot fast, if you're not _that_ paranoid about your data. But all this is about being paranoid about our data, isn't it? B-) > I can't remember what the problem was, but it was known and unusual, I > think it might have been the drive firmware from memory. > > Of course cosmic rays etc can and do flip bits in memory, so any non-ecc > system can panic if the wrong bit flips. Incredibly rare, but again, I'm > glad I'm running a journalling file system, if just for the reboot time. No need for cosmic rays. A faulty fan, either on the CPU, or in the case, or (many MBs have it nowadays) on the chipset will do. Do you ever upgrade your RAM? I've seen faultly DIMMs. And what exaclty happens when something overtemps (CPU, RAM, MB, disks) in your system? Does your MB go into "protection" mode (i.e. it freezes, without giving any message to the OS)? Bit flipping is not "incredibly rare", believe me. I've seen all of them. Usually the system just crashes, and you'll get it up pretty fast. However, random corruption is rare, but possible. .TM. -- ____/ ____/ / / / / Marco Colombo ___/ ___ / / Technical Manager / / / ESI s.r.l. _____/ _____/ _/ Colombo@ESI.it
pgsql-general by date: