Re: disaster recovery - Mailing list pgsql-general

From Marco Colombo
Subject Re: disaster recovery
Date
Msg-id Pine.LNX.4.44.0311281126250.25502-100000@Megathlon.ESI
Whole thread Raw
In response to Re: disaster recovery  (Alex Satrapa <alex@lintelsys.com.au>)
Responses Re: disaster recovery  (Bruno Wolff III <bruno@wolff.to>)
Re: disaster recovery  (Alex Satrapa <alex@lintelsys.com.au>)
List pgsql-general
On Fri, 28 Nov 2003, Alex Satrapa wrote:

> Doug McNaught wrote:
> > I took it as a garbled understanding of the "Linux does async metadata
> > updates" criticism.  Which is true for ext2, but was never the
> > show-stopper some BSD-ers wanted it to be.  :)
>
> I have on several occasions demonstrated how "bad" asynchronous writes
> are to a BSD-bigot by pulling the plug on a mail server (having a
> terminal on another machine showing the results of tail -f
> /var/log/mail.log), then showing that when the machine comes back up the
> most we've ever lost is one message

Sorry, I can't resist. Posting this on a PostgreSQL list it's too funny.
This is the last place they want to hear about a lost transaction. Even
just one.

The problem is (was) with programs using _directory_ operations as
syncronization primitives, and mail spoolers (say, MTAs) are typical
in that. Beware that in the MTA world loosing a single message is
just as bad as loosing a committed transaction for DB people. MTAs
are expected to return "OK, I received and stored the message."
only _after_ they committed it to disk in a safe manner (that's because
the other side is allowed to delete its copy after seeing the "OK").
The only acceptable behaviour in case of failure (which is of course
unacceptable in the DB world) for MTAs is to deliver _two_ copies
of a message, but _never_ zero (message lost). That's what might happen
if something crashes (or connection is lost) _after_ the MTA committed
the message to disk and _before_ the peer received notification of
that. Later the peer will try and send the message again (the receiving
MTA has enough knowledge to detect the duplication, but usually real
world MTAs don't do that AFAIK).

My understanding of the problem is: UNIX fsync(), historically,
used to sync also directory data (filename entries) before returning.
MTAs used to call rename()/fsync() or link()/unlink()/fsync()
sequences to "commit" a message to disk. In Linux, fsync() is
documented _not_ to sync directory data, "just" file data and metadata
(inode). While the UNIX behaviour turned out to be very useful,
personally I don't think Linux fsync() is broken/buggy. A file in
UNIX is just that, data blocks and inode. Syncing directory data
was just a (useful) side-effect of one implementation. In Linux,
an explicit fsync() on the directory itself is needed (and in each
path component if you changed one of them too), if you want to
commit changes to disk. Doing that is just as safe as on any filesystem,
even on ext2 with async writes enabled (it doesn't mean "ignore fsync()"
after all!).

AFAIK, but I might be wrong as I know little of this, PostgreSQL
does not relay on directory operations for commits or WAL writes.
It operates on file _data_ and uses fsync(). That works fine with
ext2 in async writes mode, too, no wonder. No need to mount noasync
or to use chattr -S.

BTW, there's no change in the fsync() itself, AFAIK. Some journalled FS
(maybe _all_ of them) will update directory data with fsync() too,
but that's an implementation detail. In my very personal opinion,
any application relaying on that is buggy. A directory and a file
are different "objects" in UNIX, and if you need both synced to disk,
you need to call fsync() two times. Note that syncing a file on
most journalled FS means syncing the journal, _all_ pending writes on
that FS, even those not related to your file.  How could the FS
"partially" sync the journal, to sync just _your_ file data and metadata?
That's why directory data gets synced, too. There's no magic in fsync().

>  From the BSD-bigot's point of view, this is equivalent to the end of
> the world as we know it.

From anyone's point of view, loosing track of a committed transaction
(and an accepted message is just that) is the end of the world.

>  From my point of view, it's just support for my demands to have each
> mission-critical server supported by a UPS, if not redundant power
> supplies and two UPSes.

Of course. The OS can only be sure it delivered the data to the disk.
If the disk lies on having actually stored it on the plates (as IDE
disks do), there's still a window of vulnerability. What I don't
really get is how SCSI disks can not lie about writes and at the same
time not show performance degradation on writes compared to their
IDE cousins. How any disk mechanics can perform at the same speed of
DRAM is beyond my understanding (even if that mechanics is 3 time
as expensive as IDE one).

.TM.
--
      ____/  ____/   /
     /      /       /            Marco Colombo
    ___/  ___  /   /              Technical Manager
   /          /   /             ESI s.r.l.
 _____/ _____/  _/               Colombo@ESI.it


pgsql-general by date:

Previous
From: Christopher Browne
Date:
Subject: Re: Humor me: Postgresql vs. MySql (esp. licensing)
Next
From: "Philippe Lang"
Date:
Subject: Restore-point?