Re: disaster recovery - Mailing list pgsql-general
From | Marco Colombo |
---|---|
Subject | Re: disaster recovery |
Date | |
Msg-id | Pine.LNX.4.44.0311281126250.25502-100000@Megathlon.ESI Whole thread Raw |
In response to | Re: disaster recovery (Alex Satrapa <alex@lintelsys.com.au>) |
Responses |
Re: disaster recovery
Re: disaster recovery |
List | pgsql-general |
On Fri, 28 Nov 2003, Alex Satrapa wrote: > Doug McNaught wrote: > > I took it as a garbled understanding of the "Linux does async metadata > > updates" criticism. Which is true for ext2, but was never the > > show-stopper some BSD-ers wanted it to be. :) > > I have on several occasions demonstrated how "bad" asynchronous writes > are to a BSD-bigot by pulling the plug on a mail server (having a > terminal on another machine showing the results of tail -f > /var/log/mail.log), then showing that when the machine comes back up the > most we've ever lost is one message Sorry, I can't resist. Posting this on a PostgreSQL list it's too funny. This is the last place they want to hear about a lost transaction. Even just one. The problem is (was) with programs using _directory_ operations as syncronization primitives, and mail spoolers (say, MTAs) are typical in that. Beware that in the MTA world loosing a single message is just as bad as loosing a committed transaction for DB people. MTAs are expected to return "OK, I received and stored the message." only _after_ they committed it to disk in a safe manner (that's because the other side is allowed to delete its copy after seeing the "OK"). The only acceptable behaviour in case of failure (which is of course unacceptable in the DB world) for MTAs is to deliver _two_ copies of a message, but _never_ zero (message lost). That's what might happen if something crashes (or connection is lost) _after_ the MTA committed the message to disk and _before_ the peer received notification of that. Later the peer will try and send the message again (the receiving MTA has enough knowledge to detect the duplication, but usually real world MTAs don't do that AFAIK). My understanding of the problem is: UNIX fsync(), historically, used to sync also directory data (filename entries) before returning. MTAs used to call rename()/fsync() or link()/unlink()/fsync() sequences to "commit" a message to disk. In Linux, fsync() is documented _not_ to sync directory data, "just" file data and metadata (inode). While the UNIX behaviour turned out to be very useful, personally I don't think Linux fsync() is broken/buggy. A file in UNIX is just that, data blocks and inode. Syncing directory data was just a (useful) side-effect of one implementation. In Linux, an explicit fsync() on the directory itself is needed (and in each path component if you changed one of them too), if you want to commit changes to disk. Doing that is just as safe as on any filesystem, even on ext2 with async writes enabled (it doesn't mean "ignore fsync()" after all!). AFAIK, but I might be wrong as I know little of this, PostgreSQL does not relay on directory operations for commits or WAL writes. It operates on file _data_ and uses fsync(). That works fine with ext2 in async writes mode, too, no wonder. No need to mount noasync or to use chattr -S. BTW, there's no change in the fsync() itself, AFAIK. Some journalled FS (maybe _all_ of them) will update directory data with fsync() too, but that's an implementation detail. In my very personal opinion, any application relaying on that is buggy. A directory and a file are different "objects" in UNIX, and if you need both synced to disk, you need to call fsync() two times. Note that syncing a file on most journalled FS means syncing the journal, _all_ pending writes on that FS, even those not related to your file. How could the FS "partially" sync the journal, to sync just _your_ file data and metadata? That's why directory data gets synced, too. There's no magic in fsync(). > From the BSD-bigot's point of view, this is equivalent to the end of > the world as we know it. From anyone's point of view, loosing track of a committed transaction (and an accepted message is just that) is the end of the world. > From my point of view, it's just support for my demands to have each > mission-critical server supported by a UPS, if not redundant power > supplies and two UPSes. Of course. The OS can only be sure it delivered the data to the disk. If the disk lies on having actually stored it on the plates (as IDE disks do), there's still a window of vulnerability. What I don't really get is how SCSI disks can not lie about writes and at the same time not show performance degradation on writes compared to their IDE cousins. How any disk mechanics can perform at the same speed of DRAM is beyond my understanding (even if that mechanics is 3 time as expensive as IDE one). .TM. -- ____/ ____/ / / / / Marco Colombo ___/ ___ / / Technical Manager / / / ESI s.r.l. _____/ _____/ _/ Colombo@ESI.it
pgsql-general by date: