Thread: disaster recovery

disaster recovery

From
"Jason Tesser"
Date:
We are evaluating Postgres and would like some input about disaster recovery.  I know in MsSQL they have a feature
calledtransactional 
logs that would enable a database to be put back together based off those logs.  Does Postgres do anything like this?
Isaw in the documentation 
transactional logging but I don't know if it is the same.  Where can I find info about disaster recovery in Postgres.
Thankyou in advance 
for any info given.

Jason Tesser
Web/Multimedia Programmer
Northland Ministries Inc.
(715)324-6900 x3050


Re: disaster recovery

From
Alex Satrapa
Date:
Jason Tesser wrote:
> We are evaluating Postgres and would like some input about disaster recovery.

I'm going to try to communicate what I understand, and other list
members can correct me at their selected level of vehemence :)
Please send corrections to the list - I may take days to post follow-ups.

 > I know in MsSQL they have a feature called transactional
> logs that would enable a database to be put back together based off those logs.

A roughly parallel concept in PostgreSQL (what's the correct
capitalisation and spelling?) is the "Write Ahead Log" (WAL). There is
also a quite dissimilar concept called the query log - which is good to
inspect for common queries to allow database tuning, but is not replay-able.

The theory is that given a PostgreSQL database and the respective WAL,
you can recreate the database to the time that the last entry of the WAL
was written to disk.

Some caveats though:
1) Under Linux, if you have the file system containing the WAL mounted
with asynchronous writes, "all bets are off". The *BSD crowd (that I
know of) take great pleasure in constantly reminding me that if the
power fails, my file system will be in an indeterminate state - things
could be half-written all over the file system.
2) If you're using IDE drives, under any operating system, and have
write-caching turned on in the IDE drives themselves, again "all bets
are off"
3) If you're using IDE drives behind a RAID controller, YMMV.

So to play things safe, one recommendation to ensure database robustness
is to:
1) Store the WAL on a separate physical drive
2) Under Linux, mount that file system with synchronous writes (ie:
fsync won't return until the data is actually, really, written to the
interface)
3) If using IDE drives, turn off write caching on the WAL volume so that
you know data is actually written to disk when the drive claims it is.

Note that disabling write caching will impact write performance
significantly. Most people *want* write caching turned on for
throughput-critical file systems, and turned off for mission-critical
file systems.

Note too that SCSI systems tend to have no "write cache" as such, since
they use "tagged command queues". The OS can say to the SCSI drive
something that is effectively, "here are 15 blocks of data to write to
disk, get back to me when the last one is actually written to the
media", and continue on its way.  On IDE, the OS can only have one
command outstanding - the purpose of the write cache is to allow
multiple commands to be received and "acknowledged" before any data is
actually written to the media.

When the host is correctly configured, you can recover a PostgreSQL
database from a hardware failure by recovering the database file itself
and "replaying" the WAL to that database.

Read more about WAL here:
http://www.postgresql.org/docs/current/static/wal.html

Regards
Alex
PS: Please send corrections to the list
PPS: Don't forget to include "fire drills" as part of your disaster
recovery plan - get plenty of practice at recovering a database from a
crashed machine so that you don't make mistakes when the time comes that
you actually need to do it!
PPPS: And follow your own advice ;)


Re: disaster recovery

From
Doug McNaught
Date:
Alex Satrapa <alex@lintelsys.com.au> writes:

> 1) Under Linux, if you have the file system containing the WAL mounted
> with asynchronous writes, "all bets are off". The *BSD crowd (that I
> know of) take great pleasure in constantly reminding me that if the
> power fails, my file system will be in an indeterminate state - things
> could be half-written all over the file system.

This is pretty out of date.  If you use a journaling filesystem
(there are four solid ones available and modern distros use them)
metadata is consistent and crash recovery is fast.

Even with ext2, WAL files are preallocated and PG calls fsync() after
writing, so in practice it's not likely to cause problems.

-Doug

Re: disaster recovery

From
Bruce Momjian
Date:
Alex Satrapa wrote:
> Some caveats though:
> 1) Under Linux, if you have the file system containing the WAL mounted
> with asynchronous writes, "all bets are off". The *BSD crowd (that I
> know of) take great pleasure in constantly reminding me that if the
> power fails, my file system will be in an indeterminate state - things
> could be half-written all over the file system.

This is only a problem for ext2.  Ext3, Reiser, XFS, JFS are all fine,
though you get better performance from them by mounting them
'writeback'.

--
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073

Re: disaster recovery

From
Tom Lane
Date:
Doug McNaught <doug@mcnaught.org> writes:
> Alex Satrapa <alex@lintelsys.com.au> writes:
>> 1) Under Linux, if you have the file system containing the WAL mounted
>> with asynchronous writes, "all bets are off".
> ...
> Even with ext2, WAL files are preallocated and PG calls fsync() after
> writing, so in practice it's not likely to cause problems.

Um.  I took the reference to "mounted with async write" to mean a
soft-mounted NFS filesystem.  It does not matter which OS you think is
the one true OS --- running a database over NFS is the act of someone
with a death wish.  But, yeah, soft-mounted NFS is a particularly
malevolent variety ...

            regards, tom lane

Re: disaster recovery

From
Doug McNaught
Date:
Tom Lane <tgl@sss.pgh.pa.us> writes:

> Doug McNaught <doug@mcnaught.org> writes:
>> Alex Satrapa <alex@lintelsys.com.au> writes:
>>> 1) Under Linux, if you have the file system containing the WAL mounted
>>> with asynchronous writes, "all bets are off".
>> ...
>> Even with ext2, WAL files are preallocated and PG calls fsync() after
>> writing, so in practice it's not likely to cause problems.
>
> Um.  I took the reference to "mounted with async write" to mean a
> soft-mounted NFS filesystem.  It does not matter which OS you think is
> the one true OS --- running a database over NFS is the act of someone
> with a death wish.  But, yeah, soft-mounted NFS is a particularly
> malevolent variety ...

I took it as a garbled understanding of the "Linux does async metadata
updates" criticism.  Which is true for ext2, but was never the
show-stopper some BSD-ers wanted it to be.  :)

-Doug

Re: disaster recovery

From
Alex Satrapa
Date:
Doug McNaught wrote:
> I took it as a garbled understanding of the "Linux does async metadata
> updates" criticism.  Which is true for ext2, but was never the
> show-stopper some BSD-ers wanted it to be.  :)

I have on several occasions demonstrated how "bad" asynchronous writes
are to a BSD-bigot by pulling the plug on a mail server (having a
terminal on another machine showing the results of tail -f
/var/log/mail.log), then showing that when the machine comes back up the
most we've ever lost is one message

 From the BSD-bigot's point of view, this is equivalent to the end of
the world as we know it.

 From my point of view, it's just support for my demands to have each
mission-critical server supported by a UPS, if not redundant power
supplies and two UPSes.

Alex


Re: disaster recovery

From
"Craig O'Shannessy"
Date:
>
>  From my point of view, it's just support for my demands to have each
> mission-critical server supported by a UPS, if not redundant power
> supplies and two UPSes.
>

Never had a kernel panic?  I've had a few.  Probably flakey hardware. I
feel safer since journalling file systems hit linux.


Re: disaster recovery

From
Alex Satrapa
Date:
Craig O'Shannessy wrote:
> Never had a kernel panic?  I've had a few.  Probably flakey hardware. I
> feel safer since journalling file systems hit linux.

The only kernel panic I've ever had was when playing with a development
version of the kernel (2.3.x). Never played with development kernels
since then - I'm a user, not a developer.

All the outages I've experienced so far have been due to external
factors such as (in order of frequency):
  - Colocation facility technicians repatching panels and
    putting my connection "back" into the wrong port
  - Colo facility power failure (we were told they had dual
    redundant diesel+battery UPS, but they only had one, the
    second was being installed "any time now")
  - End user's machines crashing
  - Client software crashing
  - Colo facility techs ripping power cables or network
    cables while "cleaning up" cable trays
  - Hard drive failure (hard, fast and very real - one
    revolution the drive was working, the next it was a
    charred blackened mess of fibreglass, silicon and
    aluminium)

I have to admit that in none of those cases would synchronous vs
asynchronous, journalling vs non-journalling or *any* file system
decision have made the slightest jot of a difference to the integrity of
my data.

I've yet to experience a CPU failure (touch wood!).


Re: disaster recovery

From
"Craig O'Shannessy"
Date:
On Fri, 28 Nov 2003, Marco Colombo wrote:

> On Fri, 28 Nov 2003, Craig O'Shannessy wrote:
>
> > >
> > >  From my point of view, it's just support for my demands to have each
> > > mission-critical server supported by a UPS, if not redundant power
> > > supplies and two UPSes.
> > >
> >
> > Never had a kernel panic?  I've had a few.  Probably flakey hardware. I
> > feel safer since journalling file systems hit linux.
>
> On any hardware flakey enough to cause panics, no FS code will save
> you. The FS may "reliably" write total rubbish to disk. It may have been
> doing that for hours, thrashing the whole FS structure, before something
> triggered the panic.
> You are no safer with journal than you are with a plain FAT (or any
> other FS technology). Journal files get corrupted themselves.
>

This isn't always true.  For example, my most recent panic was due to a
ide cdrom driver on a fairly expensive Intel dual xeon box, running 2.4.18
I mounted the cdrom and boom, panic.  If I'd been running ext2, I would
have had a very lengthy reboot and lots of pissed off users, but as it's
ext3, the system was back up in a couple of minutes, and I just removed
the cdrom drive from fstab (I've got other cdrom drives :)

I can't remember what the problem was, but it was known and unusual, I
think it might have been the drive firmware from memory.

Of course cosmic rays etc can and do flip bits in memory, so any non-ecc
system can panic if the wrong bit flips.  Incredibly rare, but again, I'm
glad I'm running a journalling file system, if just for the reboot time.



Re: disaster recovery

From
Marco Colombo
Date:
On Thu, 27 Nov 2003, Doug McNaught wrote:

> Tom Lane <tgl@sss.pgh.pa.us> writes:
>
> > Doug McNaught <doug@mcnaught.org> writes:
> >> Alex Satrapa <alex@lintelsys.com.au> writes:
> >>> 1) Under Linux, if you have the file system containing the WAL mounted
> >>> with asynchronous writes, "all bets are off".
> >> ...
> >> Even with ext2, WAL files are preallocated and PG calls fsync() after
> >> writing, so in practice it's not likely to cause problems.
> >
> > Um.  I took the reference to "mounted with async write" to mean a
> > soft-mounted NFS filesystem.  It does not matter which OS you think is
> > the one true OS --- running a database over NFS is the act of someone
> > with a death wish.  But, yeah, soft-mounted NFS is a particularly
> > malevolent variety ...
>
> I took it as a garbled understanding of the "Linux does async metadata
> updates" criticism.  Which is true for ext2, but was never the
> show-stopper some BSD-ers wanted it to be.  :)

And it's not file metadata, it's directory data. Metadata (inode
data) is synced, even in ext2, AFAIK.

Quoting the man page:
       fsync copies all in-core parts of  a  file  to  disk,  and
       waits  until the device reports that all parts are on sta-
       ble storage.  It also updates metadata  stat  information.
       It  does  not  necessarily  ensure  that  the entry in the
       directory containing the file has also reached disk.   For
       that  an  explicit  fsync  on  the  file descriptor of the
       directory is also needed

For WALs, this is perfectly fine. It can be a problem for those
applications that do a lot of renames and relay on those as
sync/locking mechanisms (think of mail spoolers).

.TM.
--
      ____/  ____/   /
     /      /       /            Marco Colombo
    ___/  ___  /   /              Technical Manager
   /          /   /             ESI s.r.l.
 _____/ _____/  _/               Colombo@ESI.it


Re: disaster recovery

From
Marco Colombo
Date:
On Fri, 28 Nov 2003, Alex Satrapa wrote:

> Doug McNaught wrote:
> > I took it as a garbled understanding of the "Linux does async metadata
> > updates" criticism.  Which is true for ext2, but was never the
> > show-stopper some BSD-ers wanted it to be.  :)
>
> I have on several occasions demonstrated how "bad" asynchronous writes
> are to a BSD-bigot by pulling the plug on a mail server (having a
> terminal on another machine showing the results of tail -f
> /var/log/mail.log), then showing that when the machine comes back up the
> most we've ever lost is one message

Sorry, I can't resist. Posting this on a PostgreSQL list it's too funny.
This is the last place they want to hear about a lost transaction. Even
just one.

The problem is (was) with programs using _directory_ operations as
syncronization primitives, and mail spoolers (say, MTAs) are typical
in that. Beware that in the MTA world loosing a single message is
just as bad as loosing a committed transaction for DB people. MTAs
are expected to return "OK, I received and stored the message."
only _after_ they committed it to disk in a safe manner (that's because
the other side is allowed to delete its copy after seeing the "OK").
The only acceptable behaviour in case of failure (which is of course
unacceptable in the DB world) for MTAs is to deliver _two_ copies
of a message, but _never_ zero (message lost). That's what might happen
if something crashes (or connection is lost) _after_ the MTA committed
the message to disk and _before_ the peer received notification of
that. Later the peer will try and send the message again (the receiving
MTA has enough knowledge to detect the duplication, but usually real
world MTAs don't do that AFAIK).

My understanding of the problem is: UNIX fsync(), historically,
used to sync also directory data (filename entries) before returning.
MTAs used to call rename()/fsync() or link()/unlink()/fsync()
sequences to "commit" a message to disk. In Linux, fsync() is
documented _not_ to sync directory data, "just" file data and metadata
(inode). While the UNIX behaviour turned out to be very useful,
personally I don't think Linux fsync() is broken/buggy. A file in
UNIX is just that, data blocks and inode. Syncing directory data
was just a (useful) side-effect of one implementation. In Linux,
an explicit fsync() on the directory itself is needed (and in each
path component if you changed one of them too), if you want to
commit changes to disk. Doing that is just as safe as on any filesystem,
even on ext2 with async writes enabled (it doesn't mean "ignore fsync()"
after all!).

AFAIK, but I might be wrong as I know little of this, PostgreSQL
does not relay on directory operations for commits or WAL writes.
It operates on file _data_ and uses fsync(). That works fine with
ext2 in async writes mode, too, no wonder. No need to mount noasync
or to use chattr -S.

BTW, there's no change in the fsync() itself, AFAIK. Some journalled FS
(maybe _all_ of them) will update directory data with fsync() too,
but that's an implementation detail. In my very personal opinion,
any application relaying on that is buggy. A directory and a file
are different "objects" in UNIX, and if you need both synced to disk,
you need to call fsync() two times. Note that syncing a file on
most journalled FS means syncing the journal, _all_ pending writes on
that FS, even those not related to your file.  How could the FS
"partially" sync the journal, to sync just _your_ file data and metadata?
That's why directory data gets synced, too. There's no magic in fsync().

>  From the BSD-bigot's point of view, this is equivalent to the end of
> the world as we know it.

From anyone's point of view, loosing track of a committed transaction
(and an accepted message is just that) is the end of the world.

>  From my point of view, it's just support for my demands to have each
> mission-critical server supported by a UPS, if not redundant power
> supplies and two UPSes.

Of course. The OS can only be sure it delivered the data to the disk.
If the disk lies on having actually stored it on the plates (as IDE
disks do), there's still a window of vulnerability. What I don't
really get is how SCSI disks can not lie about writes and at the same
time not show performance degradation on writes compared to their
IDE cousins. How any disk mechanics can perform at the same speed of
DRAM is beyond my understanding (even if that mechanics is 3 time
as expensive as IDE one).

.TM.
--
      ____/  ____/   /
     /      /       /            Marco Colombo
    ___/  ___  /   /              Technical Manager
   /          /   /             ESI s.r.l.
 _____/ _____/  _/               Colombo@ESI.it


Re: disaster recovery

From
Marco Colombo
Date:
On Fri, 28 Nov 2003, Craig O'Shannessy wrote:

> >
> >  From my point of view, it's just support for my demands to have each
> > mission-critical server supported by a UPS, if not redundant power
> > supplies and two UPSes.
> >
>
> Never had a kernel panic?  I've had a few.  Probably flakey hardware. I
> feel safer since journalling file systems hit linux.

On any hardware flakey enough to cause panics, no FS code will save
you. The FS may "reliably" write total rubbish to disk. It may have been
doing that for hours, thrashing the whole FS structure, before something
triggered the panic.
You are no safer with journal than you are with a plain FAT (or any
other FS technology). Journal files get corrupted themselves.

.TM.
--
      ____/  ____/   /
     /      /       /            Marco Colombo
    ___/  ___  /   /              Technical Manager
   /          /   /             ESI s.r.l.
 _____/ _____/  _/               Colombo@ESI.it


Re: disaster recovery

From
Marco Colombo
Date:
On Fri, 28 Nov 2003, Alex Satrapa wrote:

[...]
> I have to admit that in none of those cases would synchronous vs
> asynchronous, journalling vs non-journalling or *any* file system
> decision have made the slightest jot of a difference to the integrity of
> my data.
>
> I've yet to experience a CPU failure (touch wood!).

I have. I have seen memory failures, too. Bits getting flipped at random.
CPUs going mad. Video cards whose text buffer gets overwritten by
"something"... all were HW failures.  There's little the SW can do when
the HW fails, just report that, if it gets any chance.
Your data is already (potentially) lost when that happens. Reliably
saving the content of a memory-corrupted buffer to disk will just cause
_more_ damage to your data. That's expecially true when the "data" is
filesystem metadata. Horror stories. I still remember the day when
/bin/chmod became of type ? and size +4GB on my home PC (that was
Linux 0.98 on a 100MB HD - with a buggy IDE chipset).

.TM.
--
      ____/  ____/   /
     /      /       /            Marco Colombo
    ___/  ___  /   /              Technical Manager
   /          /   /             ESI s.r.l.
 _____/ _____/  _/               Colombo@ESI.it


Re: disaster recovery

From
Christopher Browne
Date:
alex@lintelsys.com.au (Alex Satrapa) writes:
> Craig O'Shannessy wrote:
>> Never had a kernel panic?  I've had a few.  Probably flakey hardware. I
>> feel safer since journalling file systems hit linux.
>
> The only kernel panic I've ever had was when playing with a
> development version of the kernel (2.3.x). Never played with
> development kernels since then - I'm a user, not a developer.

You apparently don't "get out enough;" while Linux is certainly a lot
more reliable than systems that need to be rebooted every few days so
that they don't spontaneously reboot, perfection is not to be had:

 1.  Flakey hardware can _always_ take things down.

     A buggy video card and/or X driver can and will take systems down
     in a flash.  (And this problem shouldn't leave *BSD folk feeling
     comfortable; they have no "silver bullet" against this
     problem...)

 2.  Devices that pretend to be SCSI devices have a history of being
     troublesome.  I have encountered kernel panics as a result of
     IDE-CDROMs, USB memory card readers, and the USB Palm interface
     going 'flakey.'

 3.  There's an oft-heavily-loaded system that I have been working with
     that has occasionally kernel paniced.  Haven't been able to get
     enough error messages out of it to track it down.

Note that none of these scenarios have anything to do with
"development kernels;" in ALL these cases, I have experienced the
problems when running "production" kernels.

There have been times when I have tracked "bleeding edge" kernels; I
never, in those times, experienced data loss, although there have,
historically, been experimental versions which did break so badly as
to trash filesystems.

I have seen a LOT more kernel panics in "production" versions than in
"experimental" versions, personally; the notion that avoiding "dev"
kernels will eliminate kernel panics is just fantasy.

Production kernels can't prevent disk hardware from being flakey;
that, alone, is point enough.
--
let name="cbbrowne" and tld="libertyrms.info" in String.concat "@" [name;tld];;
<http://dev6.int.libertyrms.com/>
Christopher Browne
(416) 646 3304 x124 (land)

Re: disaster recovery

From
Marco Colombo
Date:
On Sat, 29 Nov 2003, Craig O'Shannessy wrote:

> On Fri, 28 Nov 2003, Marco Colombo wrote:
>
> > On Fri, 28 Nov 2003, Craig O'Shannessy wrote:
> >
> > > >
> > > >  From my point of view, it's just support for my demands to have each
> > > > mission-critical server supported by a UPS, if not redundant power
> > > > supplies and two UPSes.
> > > >
> > >
> > > Never had a kernel panic?  I've had a few.  Probably flakey hardware. I
> > > feel safer since journalling file systems hit linux.
> >
> > On any hardware flakey enough to cause panics, no FS code will save
> > you. The FS may "reliably" write total rubbish to disk. It may have been
> > doing that for hours, thrashing the whole FS structure, before something
> > triggered the panic.
> > You are no safer with journal than you are with a plain FAT (or any
> > other FS technology). Journal files get corrupted themselves.
> >
>
> This isn't always true.  For example, my most recent panic was due to a
> ide cdrom driver on a fairly expensive Intel dual xeon box, running 2.4.18
> I mounted the cdrom and boom, panic.  If I'd been running ext2, I would
> have had a very lengthy reboot and lots of pissed off users, but as it's
> ext3, the system was back up in a couple of minutes, and I just removed
> the cdrom drive from fstab (I've got other cdrom drives :)

Sure, I didn't mean it to be _always_ true, just true in general. And
you've been lucky. You don't actually know what happened... a runaway
pointer that tried to write to some protected location in kernel space?
How can you be 100% sure it _did not_ write to some write-enabled pages,
like, say, the in-core copy of the inode of some very important file
of yours? Or the cached copy of some directory, orphaning a number of
critical files? If ext3 wrote that on disk, the journal won't help
you much (unless, maybe, if mounted with data=journal). And what if that
runaway pointer wrote some garbage (with Murphy's laws in action) to
_the incore copy of the journal_ itself?

And reboot time is another (lengthy) matter: someone would advise to
do a full fsck after a crash even with ext3 - Redhat systems do ask
you for that right after boot - so let's say ext3 gives you the option
to boot fast, if you're not _that_ paranoid about your data. But all
this is about being paranoid about our data, isn't it? B-)

> I can't remember what the problem was, but it was known and unusual, I
> think it might have been the drive firmware from memory.
>
> Of course cosmic rays etc can and do flip bits in memory, so any non-ecc
> system can panic if the wrong bit flips.  Incredibly rare, but again, I'm
> glad I'm running a journalling file system, if just for the reboot time.

No need for cosmic rays. A faulty fan, either on the CPU, or in the
case, or (many MBs have it nowadays) on the chipset will do. Do you ever
upgrade your RAM? I've seen faultly DIMMs. And what exaclty happens
when something overtemps (CPU, RAM, MB, disks) in your system? Does
your MB go into "protection" mode (i.e. it freezes, without giving any
message to the OS)? Bit flipping is not "incredibly rare", believe me.
I've seen all of them. Usually the system just crashes, and you'll
get it up pretty fast. However, random corruption is rare, but possible.

.TM.
--
      ____/  ____/   /
     /      /       /            Marco Colombo
    ___/  ___  /   /              Technical Manager
   /          /   /             ESI s.r.l.
 _____/ _____/  _/               Colombo@ESI.it


Re: disaster recovery

From
Bruno Wolff III
Date:
On Fri, Nov 28, 2003 at 12:28:25 +0100,
  Marco Colombo <marco@esi.it> wrote:
>
> My understanding of the problem is: UNIX fsync(), historically,
> used to sync also directory data (filename entries) before returning.
> MTAs used to call rename()/fsync() or link()/unlink()/fsync()
> sequences to "commit" a message to disk. In Linux, fsync() is
> documented _not_ to sync directory data, "just" file data and metadata
> (inode). While the UNIX behaviour turned out to be very useful,
> personally I don't think Linux fsync() is broken/buggy. A file in
> UNIX is just that, data blocks and inode. Syncing directory data
> was just a (useful) side-effect of one implementation. In Linux,
> an explicit fsync() on the directory itself is needed (and in each
> path component if you changed one of them too), if you want to
> commit changes to disk. Doing that is just as safe as on any filesystem,
> even on ext2 with async writes enabled (it doesn't mean "ignore fsync()"
> after all!).

A new function name should have been used to go along with the new semantics.

Re: disaster recovery

From
"Rick Gigger"
Date:
> This is only a problem for ext2.  Ext3, Reiser, XFS, JFS are all fine,
> though you get better performance from them by mounting them
> 'writeback'.

What does 'writeback' do exactly?

Re: disaster recovery

From
Doug McNaught
Date:
"Rick Gigger" <rick@alpinenetworking.com> writes:

>> This is only a problem for ext2.  Ext3, Reiser, XFS, JFS are all fine,
>> though you get better performance from them by mounting them
>> 'writeback'.
>
> What does 'writeback' do exactly?

AFAIK 'writeback' only applies to ext3.  The 'data=writeback' setting
journals metadata but not data, so it's faster but may lose file
contents in case of a crash.  For Postgres, which calls fsync() on the
WAL, this is not an issue since when fsync() returns the file contents
are commited to disk.

AFAIK XFS and JFS are always in 'writeback' mode; I'm not sure about
Reiser.

-Doug

Re: disaster recovery

From
Alex Satrapa
Date:
Marco Colombo wrote:
> On Fri, 28 Nov 2003, Alex Satrapa wrote:
>> From the BSD-bigot's point of view, this is equivalent to the end of
>>the world as we know it.
>
> From anyone's point of view, loosing track of a committed transaction
> (and an accepted message is just that) is the end of the world.

When hardware fails, you'd be mad to trust the data stored on the
hardware. You can't be sure that the data that's actually on disk is
what was supposed to be there, the whole of what's supposed to be there,
and nothing but what's supposed to be there. You just can't.  This
emphasis that some people have on "committing writes to disk" is misplaced.

If the data is really that important, you'd be sending it to three
places at once (one or three, not two - ask any sailor about clocks) -
async or not.

> What I don't
> really get is how SCSI disks can not lie about writes and at the same
> time not show performance degradation on writes compared to their
> IDE cousins.

SCSI disks have the advantage of "tagged command queues". A simplified
version of the difference between IDE's single-transaction model and
SCSI's tagged command queue is as follows (this is based on my vague
understanding of SCSI magic):

On an IDE disk, you do this:

PC: here, disk, store this data
Disk: Okay, done
PC: and here's a second block
Disk: Okay, done
... ad nauseum ...
PC: and here's a ninety fifth block
Disk: Okay, done.

On a SCSI disk, you do this:
PC: Disk, stor these ninety five blocks, and tell me when you've finished
[time passes]
PC: Oh, can you fetch me some blocks from over there while you're at it?
[time passes]
Disk: Okay, all those writes are done!
[fetching continues]


> How any disk mechanics can perform at the same speed of
> DRAM is beyond my understanding (even if that mechanics is 3 time
> as expensive as IDE one).

It's not the mechanics that are faster, it's just the the transferring
stuff to the disk's buffers can be done "asynchronously" - you're not
waiting for previous writes to complete before queuing new writes (or
reads). At the same time, the SCSI disk isn't "lying" to you about
having committed the data to media, since the two stages of request and
confirmation can be separated in time.

So at any time, the disk can have a number of read and write requests
queued up, and it can decide which order to do them in. The OS can
happily go on its way.

At least, that's my understanding.
Alex


Re: disaster recovery

From
Marco Colombo
Date:
On Tue, 2 Dec 2003, Alex Satrapa wrote:

> Marco Colombo wrote:
> > On Fri, 28 Nov 2003, Alex Satrapa wrote:
> >> From the BSD-bigot's point of view, this is equivalent to the end of
> >>the world as we know it.
> >
> > From anyone's point of view, loosing track of a committed transaction
> > (and an accepted message is just that) is the end of the world.
>
> When hardware fails, you'd be mad to trust the data stored on the
> hardware. You can't be sure that the data that's actually on disk is
> what was supposed to be there, the whole of what's supposed to be there,
> and nothing but what's supposed to be there. You just can't.  This
> emphasis that some people have on "committing writes to disk" is misplaced.
>
> If the data is really that important, you'd be sending it to three
> places at once (one or three, not two - ask any sailor about clocks) -
> async or not.

Sure, but we were discussing a 'pull the plug' scenario, not HW failures.
Only RAID (which is a way of sending data to different places)
saves you from a disk failure (if it can be _detected_!), and nothing
from a CPU/RAM failure, on a conventional PC (but a second PC, if you're
lucky). The original problem was ext2 loosing _only_ one message after
reboot when someone pulls the plug. The real problem is not the disk, it's
the application returning "OK, COMMITTED" to the other side (which may
be a SMTP client or a PostgreSQL client). IDE tricks these applications
in returning OK _before_ the data hits safe storage (platters). The FS
may play a role too, expecially for those applications that use fsync()
on a file to sync directory data too. On many journalled FS, fsync()
triggers a (global) journal write (which sometimes can be a performance
killer), so, as a side effect, a sync of directory data too.

AFAIK, ext2 is safe to use with PostgreSQL, since commits do not involve
any directory operation (if so, I hope PostgreSQL does a fsync() on the
involved directory too). With heavy transaction loads, I guess it will
outperform journalled filesystems, w/o _any_ loss in data safety. I have
no data to back up such a statement, though.

[ ok on the SCSI async behavior ]

.TM.
--
      ____/  ____/   /
     /      /       /            Marco Colombo
    ___/  ___  /   /              Technical Manager
   /          /   /             ESI s.r.l.
 _____/ _____/  _/               Colombo@ESI.it