Re: Mount options for Ext3? - Mailing list pgsql-performance

From Kevin Brown
Subject Re: Mount options for Ext3?
Date
Msg-id 20030125041319.GE28252@filer
Whole thread Raw
In response to Re: Mount options for Ext3?  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses WAL replay logic (was Re: Mount options for Ext3?)
List pgsql-performance
Tom Lane wrote:
> Kevin Brown <kevin@sysexperts.com> writes:
> > I was presuming that when a savepoint occurs, a marker is written to
> > the log indicating which transactions had been committed to the data
> > files, and that this marker was paid attention to during database
> > startup.
>
> Not quite.  The marker says that all datafile updates described by
> log entries before point X have been flushed to disk by the checkpoint
> --- and, therefore, if we need to restart we need only replay log
> entries occurring after the last checkpoint's point X.
>
> This has nothing directly to do with which transactions are committed
> or not committed.  If we based checkpoint behavior on that, we'd need
> to maintain an indefinitely large amount of WAL log to cope with
> long-running transactions.

Ah.  My apologies for my imprecise wording.  I should have said
"...indicating which transactions had been written to the data files"
instead of "...had been committed to the data files", and meant to say
"checkpoint" but instead said "savepoint".  I'll try to do better
here.

> The actual checkpoint algorithm is
>
>     take note of current logical end of WAL (this will be point X)
>     write() all dirty buffers in shared buffer arena
>     sync() to ensure that above writes, as well as previous ones,
>         are on disk
>     put checkpoint record referencing point X into WAL; write and
>         fsync WAL
>     update pg_control with new checkpoint record, fsync it
>
> Since pg_control is what's examined after restart, the checkpoint is
> effectively committed when the pg_control write hits disk.  At any
> instant before that, a crash would result in replaying from the
> prior checkpoint's point X.  The algorithm is correct if and only if
> the pg_control write hits disk after all the other writes mentioned.

[...]

> > So suppose the marker makes it to the log but not all of the data the
> > marker refers to makes it to the data files.  Then the system crashes.
>
> I think that this analysis is not relevant to what we're doing.

Agreed.  The context of that analysis is when synchronous writes by
the database are turned off and one is left to rely on the operating
system to do the right thing.  Clearly it doesn't apply when
synchronous writes are enabled.  As long as only one process handles a
checkpoint, an operating system that guarantees that a process' writes
are committed to disk in the same order that they were requested,
combined with a journalling filesystem that at least wrote all data
prior to committing the associated metadata transactions, would be
sufficient to guarantee the integrity of the database even if all
synchronous writes by the database were turned off.  This would hold
even if the operating system reordered writes from multiple processes.
It suggests an operating system feature that could be considered
highly desirable (and relates to the discussion elsewhere about
trading off shared buffers against OS file cache: it's often better to
rely on the abilities of the OS rather than roll your own mechanism).

One question I have is: in the event of a crash, why not simply replay
all the transactions found in the WAL?  Is the startup time of the
database that badly affected if pg_control is ignored?

If there exists somewhere a reasonably succinct description of the
reasoning behind the current transaction management scheme (including
an analysis of the pros and cons), I'd love to read it and quit
bugging you.  :-)


--
Kevin Brown                          kevin@sysexperts.com

pgsql-performance by date:

Previous
From: Tom Lane
Date:
Subject: Re: WEIRD CRASH?!?!
Next
From: Curt Sampson
Date:
Subject: Re: Having trouble with backups (was: Re: Crash Recovery)