Re: Mount options for Ext3? - Mailing list pgsql-performance

From Tom Lane
Subject Re: Mount options for Ext3?
Date
Msg-id 1260.1043463535@sss.pgh.pa.us
Whole thread Raw
In response to Re: Mount options for Ext3?  (Kevin Brown <kevin@sysexperts.com>)
Responses Re: Mount options for Ext3?
Re: Mount options for Ext3?
List pgsql-performance
Kevin Brown <kevin@sysexperts.com> writes:
> I was presuming that when a savepoint occurs, a marker is written to
> the log indicating which transactions had been committed to the data
> files, and that this marker was paid attention to during database
> startup.

Not quite.  The marker says that all datafile updates described by
log entries before point X have been flushed to disk by the checkpoint
--- and, therefore, if we need to restart we need only replay log
entries occurring after the last checkpoint's point X.

This has nothing directly to do with which transactions are committed
or not committed.  If we based checkpoint behavior on that, we'd need
to maintain an indefinitely large amount of WAL log to cope with
long-running transactions.

The actual checkpoint algorithm is

    take note of current logical end of WAL (this will be point X)
    write() all dirty buffers in shared buffer arena
    sync() to ensure that above writes, as well as previous ones,
        are on disk
    put checkpoint record referencing point X into WAL; write and
        fsync WAL
    update pg_control with new checkpoint record, fsync it

Since pg_control is what's examined after restart, the checkpoint is
effectively committed when the pg_control write hits disk.  At any
instant before that, a crash would result in replaying from the
prior checkpoint's point X.  The algorithm is correct if and only if
the pg_control write hits disk after all the other writes mentioned.

The key assumption we are making about the filesystem's behavior is that
writes scheduled by the sync() will occur before the pg_control write
that's issued after it.  People have occasionally faulted this algorithm
by quoting the sync() man page, which saith (in the Gospel According To
HP)

     The writing, although scheduled, is not necessarily complete upon
     return from sync.

This, however, is not a problem in itself.  What we need to know is
whether the filesystem will allow writes issued after the sync() to
complete before those "scheduled" by the sync().


> So suppose the marker makes it to the log but not all of the data the
> marker refers to makes it to the data files.  Then the system crashes.

I think that this analysis is not relevant to what we're doing.

            regards, tom lane

pgsql-performance by date:

Previous
From: Kevin Brown
Date:
Subject: Re: Mount options for Ext3?
Next
From: Tom Lane
Date:
Subject: Re: WEIRD CRASH?!?!