Kevin Brown <kevin@sysexperts.com> writes:
> I was presuming that when a savepoint occurs, a marker is written to
> the log indicating which transactions had been committed to the data
> files, and that this marker was paid attention to during database
> startup.
Not quite. The marker says that all datafile updates described by
log entries before point X have been flushed to disk by the checkpoint
--- and, therefore, if we need to restart we need only replay log
entries occurring after the last checkpoint's point X.
This has nothing directly to do with which transactions are committed
or not committed. If we based checkpoint behavior on that, we'd need
to maintain an indefinitely large amount of WAL log to cope with
long-running transactions.
The actual checkpoint algorithm is
take note of current logical end of WAL (this will be point X)
write() all dirty buffers in shared buffer arena
sync() to ensure that above writes, as well as previous ones,
are on disk
put checkpoint record referencing point X into WAL; write and
fsync WAL
update pg_control with new checkpoint record, fsync it
Since pg_control is what's examined after restart, the checkpoint is
effectively committed when the pg_control write hits disk. At any
instant before that, a crash would result in replaying from the
prior checkpoint's point X. The algorithm is correct if and only if
the pg_control write hits disk after all the other writes mentioned.
The key assumption we are making about the filesystem's behavior is that
writes scheduled by the sync() will occur before the pg_control write
that's issued after it. People have occasionally faulted this algorithm
by quoting the sync() man page, which saith (in the Gospel According To
HP)
The writing, although scheduled, is not necessarily complete upon
return from sync.
This, however, is not a problem in itself. What we need to know is
whether the filesystem will allow writes issued after the sync() to
complete before those "scheduled" by the sync().
> So suppose the marker makes it to the log but not all of the data the
> marker refers to makes it to the data files. Then the system crashes.
I think that this analysis is not relevant to what we're doing.
regards, tom lane