Re: FAQ Q - Mailing list pgsql-admin

From Tom Lane
Subject Re: FAQ Q
Date
Msg-id 10978.1023759032@sss.pgh.pa.us
Whole thread Raw
In response to Re: FAQ Q  (Tim Ellis <Tim.Ellis@gamet.com>)
Responses Re: FAQ Q
List pgsql-admin
Tim Ellis <Tim.Ellis@gamet.com> writes:
> Tom Lane <tgl@sss.pgh.pa.us> wrote:

>> You do not have to trust Postgres itself: in all cases we push the log
>> entries out to the OS before declaring a transaction committed.

> In all cases while fsync mode is enabled, not in all cases (including
> no-fsync) right?

All cases is what I said, and all cases is what I meant.  fsync mode
only controls whether there's an fsync() after the write().

> So no-fsync simply means that "commit" != "sync'd to disk"? How is this
> different than the big boys do it? From Sybase internals classes and
> Oracle documentation, I know that when you say "COMMIT" you don't get any
> response back until the RDBMS has written that transaction TO DISK.

Or thinks it has.  One of the points that people don't much like to talk
about is the difficulty of knowing how far the data has really been
pushed.  In fsync mode we report commit when the kernel has told us
it's written the data; we have no way of knowing whether the kernel
lied, and even less way of knowing whether the disk drive has actually
written the information or only cached it on the controller board.
On modern disk drives the truth is likely to be that the bits have
only gone as far as the drive controller; so you may lose data if
you lose power and the power supply does not have enough capacity
to hold up the drive while it's finishing its pending writes.

However, the big boys running on the same hardware are going to have the
same problem.

If you can figure out how to configure your drives not to report write
complete until it's really complete, then you can feel secure with
either Postgres or the big boys.  Or you can buy a UPS and make sure you
are configured to shut down before the UPS runs out of steam.

In no-fsync mode we report commit after pushing the data out to the
kernel, but we don't try to force the kernel to push it out to disk.
So you are safe against a Postgres crash, but not against kernel or
hardware failure.  Your transaction could be lost if the system dies
before the kernel gets around to syncing it out to disk (typically
30 sec max, on most Unixen).

> Is the FAQ being too generous in speed and too unfair in recoverability to
> the commercial offerings here?

I didn't write the FAQ and won't take responsibility for its claims
about commercial databases.  I really don't know what tradeoffs they
offer in this area.

> Sybase at least just tries to avoid writing tranlogs to disk except in
> batches, so when you say COMMIT, it pauses until X amount of time passes
> or Y number of committed transactions have piled up, then it writes those
> trans to disk. But your client is blocked on the COMMIT until the
> transaction actually makes it to disk.

We have that too, although it's not on by default, and probably needs
further tuning work.

> In 7.1/7.2, when you COMMIT, you are guaranteed that WAL (and probably
> nothing but WAL) is written to disk before you get a return and that
> therefore if your instance immediately thereafter crashes, that after
> recovery (which is required), you will have a consistent (not-corrupted)
> database.

Modulo the above issues, yes.  Note also that WAL fsync protects you
against data corruption in the case of a mid-transaction system-level
failure: if all the WAL updates have made it to disk, then we will be
able to fix any incomplete or missing writes in the data files.  Without
fsync there is a distinct risk of corrupted data.  This has to do with
forcing fsync on the WAL files before we start to modify data pages
intra-transaction.

            regards, tom lane

pgsql-admin by date:

Previous
From: Tim Ellis
Date:
Subject: Re: FAQ Q
Next
From: Bruce Momjian
Date:
Subject: Re: FAQ Q