Thread: disaster recovery
We are evaluating Postgres and would like some input about disaster recovery. I know in MsSQL they have a feature calledtransactional logs that would enable a database to be put back together based off those logs. Does Postgres do anything like this? Isaw in the documentation transactional logging but I don't know if it is the same. Where can I find info about disaster recovery in Postgres. Thankyou in advance for any info given. Jason Tesser Web/Multimedia Programmer Northland Ministries Inc. (715)324-6900 x3050
Jason Tesser wrote: > We are evaluating Postgres and would like some input about disaster recovery. I'm going to try to communicate what I understand, and other list members can correct me at their selected level of vehemence :) Please send corrections to the list - I may take days to post follow-ups. > I know in MsSQL they have a feature called transactional > logs that would enable a database to be put back together based off those logs. A roughly parallel concept in PostgreSQL (what's the correct capitalisation and spelling?) is the "Write Ahead Log" (WAL). There is also a quite dissimilar concept called the query log - which is good to inspect for common queries to allow database tuning, but is not replay-able. The theory is that given a PostgreSQL database and the respective WAL, you can recreate the database to the time that the last entry of the WAL was written to disk. Some caveats though: 1) Under Linux, if you have the file system containing the WAL mounted with asynchronous writes, "all bets are off". The *BSD crowd (that I know of) take great pleasure in constantly reminding me that if the power fails, my file system will be in an indeterminate state - things could be half-written all over the file system. 2) If you're using IDE drives, under any operating system, and have write-caching turned on in the IDE drives themselves, again "all bets are off" 3) If you're using IDE drives behind a RAID controller, YMMV. So to play things safe, one recommendation to ensure database robustness is to: 1) Store the WAL on a separate physical drive 2) Under Linux, mount that file system with synchronous writes (ie: fsync won't return until the data is actually, really, written to the interface) 3) If using IDE drives, turn off write caching on the WAL volume so that you know data is actually written to disk when the drive claims it is. Note that disabling write caching will impact write performance significantly. Most people *want* write caching turned on for throughput-critical file systems, and turned off for mission-critical file systems. Note too that SCSI systems tend to have no "write cache" as such, since they use "tagged command queues". The OS can say to the SCSI drive something that is effectively, "here are 15 blocks of data to write to disk, get back to me when the last one is actually written to the media", and continue on its way. On IDE, the OS can only have one command outstanding - the purpose of the write cache is to allow multiple commands to be received and "acknowledged" before any data is actually written to the media. When the host is correctly configured, you can recover a PostgreSQL database from a hardware failure by recovering the database file itself and "replaying" the WAL to that database. Read more about WAL here: http://www.postgresql.org/docs/current/static/wal.html Regards Alex PS: Please send corrections to the list PPS: Don't forget to include "fire drills" as part of your disaster recovery plan - get plenty of practice at recovering a database from a crashed machine so that you don't make mistakes when the time comes that you actually need to do it! PPPS: And follow your own advice ;)
Alex Satrapa <alex@lintelsys.com.au> writes: > 1) Under Linux, if you have the file system containing the WAL mounted > with asynchronous writes, "all bets are off". The *BSD crowd (that I > know of) take great pleasure in constantly reminding me that if the > power fails, my file system will be in an indeterminate state - things > could be half-written all over the file system. This is pretty out of date. If you use a journaling filesystem (there are four solid ones available and modern distros use them) metadata is consistent and crash recovery is fast. Even with ext2, WAL files are preallocated and PG calls fsync() after writing, so in practice it's not likely to cause problems. -Doug
Alex Satrapa wrote: > Some caveats though: > 1) Under Linux, if you have the file system containing the WAL mounted > with asynchronous writes, "all bets are off". The *BSD crowd (that I > know of) take great pleasure in constantly reminding me that if the > power fails, my file system will be in an indeterminate state - things > could be half-written all over the file system. This is only a problem for ext2. Ext3, Reiser, XFS, JFS are all fine, though you get better performance from them by mounting them 'writeback'. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania 19073
Doug McNaught <doug@mcnaught.org> writes: > Alex Satrapa <alex@lintelsys.com.au> writes: >> 1) Under Linux, if you have the file system containing the WAL mounted >> with asynchronous writes, "all bets are off". > ... > Even with ext2, WAL files are preallocated and PG calls fsync() after > writing, so in practice it's not likely to cause problems. Um. I took the reference to "mounted with async write" to mean a soft-mounted NFS filesystem. It does not matter which OS you think is the one true OS --- running a database over NFS is the act of someone with a death wish. But, yeah, soft-mounted NFS is a particularly malevolent variety ... regards, tom lane
Tom Lane <tgl@sss.pgh.pa.us> writes: > Doug McNaught <doug@mcnaught.org> writes: >> Alex Satrapa <alex@lintelsys.com.au> writes: >>> 1) Under Linux, if you have the file system containing the WAL mounted >>> with asynchronous writes, "all bets are off". >> ... >> Even with ext2, WAL files are preallocated and PG calls fsync() after >> writing, so in practice it's not likely to cause problems. > > Um. I took the reference to "mounted with async write" to mean a > soft-mounted NFS filesystem. It does not matter which OS you think is > the one true OS --- running a database over NFS is the act of someone > with a death wish. But, yeah, soft-mounted NFS is a particularly > malevolent variety ... I took it as a garbled understanding of the "Linux does async metadata updates" criticism. Which is true for ext2, but was never the show-stopper some BSD-ers wanted it to be. :) -Doug
Doug McNaught wrote: > I took it as a garbled understanding of the "Linux does async metadata > updates" criticism. Which is true for ext2, but was never the > show-stopper some BSD-ers wanted it to be. :) I have on several occasions demonstrated how "bad" asynchronous writes are to a BSD-bigot by pulling the plug on a mail server (having a terminal on another machine showing the results of tail -f /var/log/mail.log), then showing that when the machine comes back up the most we've ever lost is one message From the BSD-bigot's point of view, this is equivalent to the end of the world as we know it. From my point of view, it's just support for my demands to have each mission-critical server supported by a UPS, if not redundant power supplies and two UPSes. Alex
> > From my point of view, it's just support for my demands to have each > mission-critical server supported by a UPS, if not redundant power > supplies and two UPSes. > Never had a kernel panic? I've had a few. Probably flakey hardware. I feel safer since journalling file systems hit linux.
Craig O'Shannessy wrote: > Never had a kernel panic? I've had a few. Probably flakey hardware. I > feel safer since journalling file systems hit linux. The only kernel panic I've ever had was when playing with a development version of the kernel (2.3.x). Never played with development kernels since then - I'm a user, not a developer. All the outages I've experienced so far have been due to external factors such as (in order of frequency): - Colocation facility technicians repatching panels and putting my connection "back" into the wrong port - Colo facility power failure (we were told they had dual redundant diesel+battery UPS, but they only had one, the second was being installed "any time now") - End user's machines crashing - Client software crashing - Colo facility techs ripping power cables or network cables while "cleaning up" cable trays - Hard drive failure (hard, fast and very real - one revolution the drive was working, the next it was a charred blackened mess of fibreglass, silicon and aluminium) I have to admit that in none of those cases would synchronous vs asynchronous, journalling vs non-journalling or *any* file system decision have made the slightest jot of a difference to the integrity of my data. I've yet to experience a CPU failure (touch wood!).
On Fri, 28 Nov 2003, Marco Colombo wrote: > On Fri, 28 Nov 2003, Craig O'Shannessy wrote: > > > > > > > From my point of view, it's just support for my demands to have each > > > mission-critical server supported by a UPS, if not redundant power > > > supplies and two UPSes. > > > > > > > Never had a kernel panic? I've had a few. Probably flakey hardware. I > > feel safer since journalling file systems hit linux. > > On any hardware flakey enough to cause panics, no FS code will save > you. The FS may "reliably" write total rubbish to disk. It may have been > doing that for hours, thrashing the whole FS structure, before something > triggered the panic. > You are no safer with journal than you are with a plain FAT (or any > other FS technology). Journal files get corrupted themselves. > This isn't always true. For example, my most recent panic was due to a ide cdrom driver on a fairly expensive Intel dual xeon box, running 2.4.18 I mounted the cdrom and boom, panic. If I'd been running ext2, I would have had a very lengthy reboot and lots of pissed off users, but as it's ext3, the system was back up in a couple of minutes, and I just removed the cdrom drive from fstab (I've got other cdrom drives :) I can't remember what the problem was, but it was known and unusual, I think it might have been the drive firmware from memory. Of course cosmic rays etc can and do flip bits in memory, so any non-ecc system can panic if the wrong bit flips. Incredibly rare, but again, I'm glad I'm running a journalling file system, if just for the reboot time.
On Thu, 27 Nov 2003, Doug McNaught wrote: > Tom Lane <tgl@sss.pgh.pa.us> writes: > > > Doug McNaught <doug@mcnaught.org> writes: > >> Alex Satrapa <alex@lintelsys.com.au> writes: > >>> 1) Under Linux, if you have the file system containing the WAL mounted > >>> with asynchronous writes, "all bets are off". > >> ... > >> Even with ext2, WAL files are preallocated and PG calls fsync() after > >> writing, so in practice it's not likely to cause problems. > > > > Um. I took the reference to "mounted with async write" to mean a > > soft-mounted NFS filesystem. It does not matter which OS you think is > > the one true OS --- running a database over NFS is the act of someone > > with a death wish. But, yeah, soft-mounted NFS is a particularly > > malevolent variety ... > > I took it as a garbled understanding of the "Linux does async metadata > updates" criticism. Which is true for ext2, but was never the > show-stopper some BSD-ers wanted it to be. :) And it's not file metadata, it's directory data. Metadata (inode data) is synced, even in ext2, AFAIK. Quoting the man page: fsync copies all in-core parts of a file to disk, and waits until the device reports that all parts are on sta- ble storage. It also updates metadata stat information. It does not necessarily ensure that the entry in the directory containing the file has also reached disk. For that an explicit fsync on the file descriptor of the directory is also needed For WALs, this is perfectly fine. It can be a problem for those applications that do a lot of renames and relay on those as sync/locking mechanisms (think of mail spoolers). .TM. -- ____/ ____/ / / / / Marco Colombo ___/ ___ / / Technical Manager / / / ESI s.r.l. _____/ _____/ _/ Colombo@ESI.it
On Fri, 28 Nov 2003, Alex Satrapa wrote: > Doug McNaught wrote: > > I took it as a garbled understanding of the "Linux does async metadata > > updates" criticism. Which is true for ext2, but was never the > > show-stopper some BSD-ers wanted it to be. :) > > I have on several occasions demonstrated how "bad" asynchronous writes > are to a BSD-bigot by pulling the plug on a mail server (having a > terminal on another machine showing the results of tail -f > /var/log/mail.log), then showing that when the machine comes back up the > most we've ever lost is one message Sorry, I can't resist. Posting this on a PostgreSQL list it's too funny. This is the last place they want to hear about a lost transaction. Even just one. The problem is (was) with programs using _directory_ operations as syncronization primitives, and mail spoolers (say, MTAs) are typical in that. Beware that in the MTA world loosing a single message is just as bad as loosing a committed transaction for DB people. MTAs are expected to return "OK, I received and stored the message." only _after_ they committed it to disk in a safe manner (that's because the other side is allowed to delete its copy after seeing the "OK"). The only acceptable behaviour in case of failure (which is of course unacceptable in the DB world) for MTAs is to deliver _two_ copies of a message, but _never_ zero (message lost). That's what might happen if something crashes (or connection is lost) _after_ the MTA committed the message to disk and _before_ the peer received notification of that. Later the peer will try and send the message again (the receiving MTA has enough knowledge to detect the duplication, but usually real world MTAs don't do that AFAIK). My understanding of the problem is: UNIX fsync(), historically, used to sync also directory data (filename entries) before returning. MTAs used to call rename()/fsync() or link()/unlink()/fsync() sequences to "commit" a message to disk. In Linux, fsync() is documented _not_ to sync directory data, "just" file data and metadata (inode). While the UNIX behaviour turned out to be very useful, personally I don't think Linux fsync() is broken/buggy. A file in UNIX is just that, data blocks and inode. Syncing directory data was just a (useful) side-effect of one implementation. In Linux, an explicit fsync() on the directory itself is needed (and in each path component if you changed one of them too), if you want to commit changes to disk. Doing that is just as safe as on any filesystem, even on ext2 with async writes enabled (it doesn't mean "ignore fsync()" after all!). AFAIK, but I might be wrong as I know little of this, PostgreSQL does not relay on directory operations for commits or WAL writes. It operates on file _data_ and uses fsync(). That works fine with ext2 in async writes mode, too, no wonder. No need to mount noasync or to use chattr -S. BTW, there's no change in the fsync() itself, AFAIK. Some journalled FS (maybe _all_ of them) will update directory data with fsync() too, but that's an implementation detail. In my very personal opinion, any application relaying on that is buggy. A directory and a file are different "objects" in UNIX, and if you need both synced to disk, you need to call fsync() two times. Note that syncing a file on most journalled FS means syncing the journal, _all_ pending writes on that FS, even those not related to your file. How could the FS "partially" sync the journal, to sync just _your_ file data and metadata? That's why directory data gets synced, too. There's no magic in fsync(). > From the BSD-bigot's point of view, this is equivalent to the end of > the world as we know it. From anyone's point of view, loosing track of a committed transaction (and an accepted message is just that) is the end of the world. > From my point of view, it's just support for my demands to have each > mission-critical server supported by a UPS, if not redundant power > supplies and two UPSes. Of course. The OS can only be sure it delivered the data to the disk. If the disk lies on having actually stored it on the plates (as IDE disks do), there's still a window of vulnerability. What I don't really get is how SCSI disks can not lie about writes and at the same time not show performance degradation on writes compared to their IDE cousins. How any disk mechanics can perform at the same speed of DRAM is beyond my understanding (even if that mechanics is 3 time as expensive as IDE one). .TM. -- ____/ ____/ / / / / Marco Colombo ___/ ___ / / Technical Manager / / / ESI s.r.l. _____/ _____/ _/ Colombo@ESI.it
On Fri, 28 Nov 2003, Craig O'Shannessy wrote: > > > > From my point of view, it's just support for my demands to have each > > mission-critical server supported by a UPS, if not redundant power > > supplies and two UPSes. > > > > Never had a kernel panic? I've had a few. Probably flakey hardware. I > feel safer since journalling file systems hit linux. On any hardware flakey enough to cause panics, no FS code will save you. The FS may "reliably" write total rubbish to disk. It may have been doing that for hours, thrashing the whole FS structure, before something triggered the panic. You are no safer with journal than you are with a plain FAT (or any other FS technology). Journal files get corrupted themselves. .TM. -- ____/ ____/ / / / / Marco Colombo ___/ ___ / / Technical Manager / / / ESI s.r.l. _____/ _____/ _/ Colombo@ESI.it
On Fri, 28 Nov 2003, Alex Satrapa wrote: [...] > I have to admit that in none of those cases would synchronous vs > asynchronous, journalling vs non-journalling or *any* file system > decision have made the slightest jot of a difference to the integrity of > my data. > > I've yet to experience a CPU failure (touch wood!). I have. I have seen memory failures, too. Bits getting flipped at random. CPUs going mad. Video cards whose text buffer gets overwritten by "something"... all were HW failures. There's little the SW can do when the HW fails, just report that, if it gets any chance. Your data is already (potentially) lost when that happens. Reliably saving the content of a memory-corrupted buffer to disk will just cause _more_ damage to your data. That's expecially true when the "data" is filesystem metadata. Horror stories. I still remember the day when /bin/chmod became of type ? and size +4GB on my home PC (that was Linux 0.98 on a 100MB HD - with a buggy IDE chipset). .TM. -- ____/ ____/ / / / / Marco Colombo ___/ ___ / / Technical Manager / / / ESI s.r.l. _____/ _____/ _/ Colombo@ESI.it
alex@lintelsys.com.au (Alex Satrapa) writes: > Craig O'Shannessy wrote: >> Never had a kernel panic? I've had a few. Probably flakey hardware. I >> feel safer since journalling file systems hit linux. > > The only kernel panic I've ever had was when playing with a > development version of the kernel (2.3.x). Never played with > development kernels since then - I'm a user, not a developer. You apparently don't "get out enough;" while Linux is certainly a lot more reliable than systems that need to be rebooted every few days so that they don't spontaneously reboot, perfection is not to be had: 1. Flakey hardware can _always_ take things down. A buggy video card and/or X driver can and will take systems down in a flash. (And this problem shouldn't leave *BSD folk feeling comfortable; they have no "silver bullet" against this problem...) 2. Devices that pretend to be SCSI devices have a history of being troublesome. I have encountered kernel panics as a result of IDE-CDROMs, USB memory card readers, and the USB Palm interface going 'flakey.' 3. There's an oft-heavily-loaded system that I have been working with that has occasionally kernel paniced. Haven't been able to get enough error messages out of it to track it down. Note that none of these scenarios have anything to do with "development kernels;" in ALL these cases, I have experienced the problems when running "production" kernels. There have been times when I have tracked "bleeding edge" kernels; I never, in those times, experienced data loss, although there have, historically, been experimental versions which did break so badly as to trash filesystems. I have seen a LOT more kernel panics in "production" versions than in "experimental" versions, personally; the notion that avoiding "dev" kernels will eliminate kernel panics is just fantasy. Production kernels can't prevent disk hardware from being flakey; that, alone, is point enough. -- let name="cbbrowne" and tld="libertyrms.info" in String.concat "@" [name;tld];; <http://dev6.int.libertyrms.com/> Christopher Browne (416) 646 3304 x124 (land)
On Sat, 29 Nov 2003, Craig O'Shannessy wrote: > On Fri, 28 Nov 2003, Marco Colombo wrote: > > > On Fri, 28 Nov 2003, Craig O'Shannessy wrote: > > > > > > > > > > From my point of view, it's just support for my demands to have each > > > > mission-critical server supported by a UPS, if not redundant power > > > > supplies and two UPSes. > > > > > > > > > > Never had a kernel panic? I've had a few. Probably flakey hardware. I > > > feel safer since journalling file systems hit linux. > > > > On any hardware flakey enough to cause panics, no FS code will save > > you. The FS may "reliably" write total rubbish to disk. It may have been > > doing that for hours, thrashing the whole FS structure, before something > > triggered the panic. > > You are no safer with journal than you are with a plain FAT (or any > > other FS technology). Journal files get corrupted themselves. > > > > This isn't always true. For example, my most recent panic was due to a > ide cdrom driver on a fairly expensive Intel dual xeon box, running 2.4.18 > I mounted the cdrom and boom, panic. If I'd been running ext2, I would > have had a very lengthy reboot and lots of pissed off users, but as it's > ext3, the system was back up in a couple of minutes, and I just removed > the cdrom drive from fstab (I've got other cdrom drives :) Sure, I didn't mean it to be _always_ true, just true in general. And you've been lucky. You don't actually know what happened... a runaway pointer that tried to write to some protected location in kernel space? How can you be 100% sure it _did not_ write to some write-enabled pages, like, say, the in-core copy of the inode of some very important file of yours? Or the cached copy of some directory, orphaning a number of critical files? If ext3 wrote that on disk, the journal won't help you much (unless, maybe, if mounted with data=journal). And what if that runaway pointer wrote some garbage (with Murphy's laws in action) to _the incore copy of the journal_ itself? And reboot time is another (lengthy) matter: someone would advise to do a full fsck after a crash even with ext3 - Redhat systems do ask you for that right after boot - so let's say ext3 gives you the option to boot fast, if you're not _that_ paranoid about your data. But all this is about being paranoid about our data, isn't it? B-) > I can't remember what the problem was, but it was known and unusual, I > think it might have been the drive firmware from memory. > > Of course cosmic rays etc can and do flip bits in memory, so any non-ecc > system can panic if the wrong bit flips. Incredibly rare, but again, I'm > glad I'm running a journalling file system, if just for the reboot time. No need for cosmic rays. A faulty fan, either on the CPU, or in the case, or (many MBs have it nowadays) on the chipset will do. Do you ever upgrade your RAM? I've seen faultly DIMMs. And what exaclty happens when something overtemps (CPU, RAM, MB, disks) in your system? Does your MB go into "protection" mode (i.e. it freezes, without giving any message to the OS)? Bit flipping is not "incredibly rare", believe me. I've seen all of them. Usually the system just crashes, and you'll get it up pretty fast. However, random corruption is rare, but possible. .TM. -- ____/ ____/ / / / / Marco Colombo ___/ ___ / / Technical Manager / / / ESI s.r.l. _____/ _____/ _/ Colombo@ESI.it
On Fri, Nov 28, 2003 at 12:28:25 +0100, Marco Colombo <marco@esi.it> wrote: > > My understanding of the problem is: UNIX fsync(), historically, > used to sync also directory data (filename entries) before returning. > MTAs used to call rename()/fsync() or link()/unlink()/fsync() > sequences to "commit" a message to disk. In Linux, fsync() is > documented _not_ to sync directory data, "just" file data and metadata > (inode). While the UNIX behaviour turned out to be very useful, > personally I don't think Linux fsync() is broken/buggy. A file in > UNIX is just that, data blocks and inode. Syncing directory data > was just a (useful) side-effect of one implementation. In Linux, > an explicit fsync() on the directory itself is needed (and in each > path component if you changed one of them too), if you want to > commit changes to disk. Doing that is just as safe as on any filesystem, > even on ext2 with async writes enabled (it doesn't mean "ignore fsync()" > after all!). A new function name should have been used to go along with the new semantics.
> This is only a problem for ext2. Ext3, Reiser, XFS, JFS are all fine, > though you get better performance from them by mounting them > 'writeback'. What does 'writeback' do exactly?
"Rick Gigger" <rick@alpinenetworking.com> writes: >> This is only a problem for ext2. Ext3, Reiser, XFS, JFS are all fine, >> though you get better performance from them by mounting them >> 'writeback'. > > What does 'writeback' do exactly? AFAIK 'writeback' only applies to ext3. The 'data=writeback' setting journals metadata but not data, so it's faster but may lose file contents in case of a crash. For Postgres, which calls fsync() on the WAL, this is not an issue since when fsync() returns the file contents are commited to disk. AFAIK XFS and JFS are always in 'writeback' mode; I'm not sure about Reiser. -Doug
Marco Colombo wrote: > On Fri, 28 Nov 2003, Alex Satrapa wrote: >> From the BSD-bigot's point of view, this is equivalent to the end of >>the world as we know it. > > From anyone's point of view, loosing track of a committed transaction > (and an accepted message is just that) is the end of the world. When hardware fails, you'd be mad to trust the data stored on the hardware. You can't be sure that the data that's actually on disk is what was supposed to be there, the whole of what's supposed to be there, and nothing but what's supposed to be there. You just can't. This emphasis that some people have on "committing writes to disk" is misplaced. If the data is really that important, you'd be sending it to three places at once (one or three, not two - ask any sailor about clocks) - async or not. > What I don't > really get is how SCSI disks can not lie about writes and at the same > time not show performance degradation on writes compared to their > IDE cousins. SCSI disks have the advantage of "tagged command queues". A simplified version of the difference between IDE's single-transaction model and SCSI's tagged command queue is as follows (this is based on my vague understanding of SCSI magic): On an IDE disk, you do this: PC: here, disk, store this data Disk: Okay, done PC: and here's a second block Disk: Okay, done ... ad nauseum ... PC: and here's a ninety fifth block Disk: Okay, done. On a SCSI disk, you do this: PC: Disk, stor these ninety five blocks, and tell me when you've finished [time passes] PC: Oh, can you fetch me some blocks from over there while you're at it? [time passes] Disk: Okay, all those writes are done! [fetching continues] > How any disk mechanics can perform at the same speed of > DRAM is beyond my understanding (even if that mechanics is 3 time > as expensive as IDE one). It's not the mechanics that are faster, it's just the the transferring stuff to the disk's buffers can be done "asynchronously" - you're not waiting for previous writes to complete before queuing new writes (or reads). At the same time, the SCSI disk isn't "lying" to you about having committed the data to media, since the two stages of request and confirmation can be separated in time. So at any time, the disk can have a number of read and write requests queued up, and it can decide which order to do them in. The OS can happily go on its way. At least, that's my understanding. Alex
On Tue, 2 Dec 2003, Alex Satrapa wrote: > Marco Colombo wrote: > > On Fri, 28 Nov 2003, Alex Satrapa wrote: > >> From the BSD-bigot's point of view, this is equivalent to the end of > >>the world as we know it. > > > > From anyone's point of view, loosing track of a committed transaction > > (and an accepted message is just that) is the end of the world. > > When hardware fails, you'd be mad to trust the data stored on the > hardware. You can't be sure that the data that's actually on disk is > what was supposed to be there, the whole of what's supposed to be there, > and nothing but what's supposed to be there. You just can't. This > emphasis that some people have on "committing writes to disk" is misplaced. > > If the data is really that important, you'd be sending it to three > places at once (one or three, not two - ask any sailor about clocks) - > async or not. Sure, but we were discussing a 'pull the plug' scenario, not HW failures. Only RAID (which is a way of sending data to different places) saves you from a disk failure (if it can be _detected_!), and nothing from a CPU/RAM failure, on a conventional PC (but a second PC, if you're lucky). The original problem was ext2 loosing _only_ one message after reboot when someone pulls the plug. The real problem is not the disk, it's the application returning "OK, COMMITTED" to the other side (which may be a SMTP client or a PostgreSQL client). IDE tricks these applications in returning OK _before_ the data hits safe storage (platters). The FS may play a role too, expecially for those applications that use fsync() on a file to sync directory data too. On many journalled FS, fsync() triggers a (global) journal write (which sometimes can be a performance killer), so, as a side effect, a sync of directory data too. AFAIK, ext2 is safe to use with PostgreSQL, since commits do not involve any directory operation (if so, I hope PostgreSQL does a fsync() on the involved directory too). With heavy transaction loads, I guess it will outperform journalled filesystems, w/o _any_ loss in data safety. I have no data to back up such a statement, though. [ ok on the SCSI async behavior ] .TM. -- ____/ ____/ / / / / Marco Colombo ___/ ___ / / Technical Manager / / / ESI s.r.l. _____/ _____/ _/ Colombo@ESI.it