Home > mailing lists

Re: Mount options for Ext3? - Mailing list pgsql-performance

From	Tom Lane
Subject	Re: Mount options for Ext3?
Date	January 24, 2003 21:58:54
Msg-id	1260.1043463535@sss.pgh.pa.us Whole thread Raw
In response to	Re: Mount options for Ext3? (Kevin Brown <kevin@sysexperts.com>)
Responses	Re: Mount options for Ext3? Re: Mount options for Ext3?
List	pgsql-performance

Tree view

Kevin Brown <kevin@sysexperts.com> writes:
> I was presuming that when a savepoint occurs, a marker is written to
> the log indicating which transactions had been committed to the data
> files, and that this marker was paid attention to during database
> startup.

Not quite.  The marker says that all datafile updates described by
log entries before point X have been flushed to disk by the checkpoint
--- and, therefore, if we need to restart we need only replay log
entries occurring after the last checkpoint's point X.

This has nothing directly to do with which transactions are committed
or not committed.  If we based checkpoint behavior on that, we'd need
to maintain an indefinitely large amount of WAL log to cope with
long-running transactions.

The actual checkpoint algorithm is

    take note of current logical end of WAL (this will be point X)
    write() all dirty buffers in shared buffer arena
    sync() to ensure that above writes, as well as previous ones,
        are on disk
    put checkpoint record referencing point X into WAL; write and
        fsync WAL
    update pg_control with new checkpoint record, fsync it

Since pg_control is what's examined after restart, the checkpoint is
effectively committed when the pg_control write hits disk.  At any
instant before that, a crash would result in replaying from the
prior checkpoint's point X.  The algorithm is correct if and only if
the pg_control write hits disk after all the other writes mentioned.

The key assumption we are making about the filesystem's behavior is that
writes scheduled by the sync() will occur before the pg_control write
that's issued after it.  People have occasionally faulted this algorithm
by quoting the sync() man page, which saith (in the Gospel According To
HP)

     The writing, although scheduled, is not necessarily complete upon
     return from sync.

This, however, is not a problem in itself.  What we need to know is
whether the filesystem will allow writes issued after the sync() to
complete before those "scheduled" by the sync().

> So suppose the marker makes it to the log but not all of the data the
> marker refers to makes it to the data files.  Then the system crashes.

I think that this analysis is not relevant to what we're doing.

            regards, tom lane

pgsql-performance by date:

From: Kevin Brown
Date: 24 January 2003, 21:50:07
Subject: Re: Mount options for Ext3?

From: Tom Lane
Date: 24 January 2003, 22:10:32
Subject: Re: WEIRD CRASH?!?!

Re: Mount options for Ext3? - Mailing list pgsql-performance

Previous

Next