Re: WAL replay logic (was Re: [PERFORM] Mount options for Ext3?) - Mailing list pgsql-hackers

From Kevin Brown
Subject Re: WAL replay logic (was Re: [PERFORM] Mount options for Ext3?)
Date
Msg-id 20030125101111.GB12957@filer
Whole thread Raw
In response to WAL replay logic (was Re: [PERFORM] Mount options for Ext3?)  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: WAL replay logic (was Re: [PERFORM] Mount options for Ext3?)
Re: WAL replay logic (was Re: [PERFORM] Mount options for Ext3?)
List pgsql-hackers
Tom Lane wrote:
> Kevin Brown <kevin@sysexperts.com> writes:
> > One question I have is: in the event of a crash, why not simply replay
> > all the transactions found in the WAL?  Is the startup time of the
> > database that badly affected if pg_control is ignored?
>
> Interesting thought, indeed.  Since we truncate the WAL after each
> checkpoint, seems like this approach would no more than double the time
> for restart.

Hmm...truncating the WAL after each checkpoint minimizes the amount of
disk space eaten by the WAL, but on the other hand keeping older
segments around buys you some safety in the event that things get
really hosed.  But your later comments make it sound like the older
WAL segments are kept around anyway, just rotated.

> The win is it'd eliminate pg_control as a single point of
> failure.  It's always bothered me that we have to update pg_control on
> every checkpoint --- it should be a write-pretty-darn-seldom file,
> considering how critical it is.
>
> I think we'd have to make some changes in the code for deleting old
> WAL segments --- right now it's not careful to delete them in order.
> But surely that can be coped with.

Even that might not be necessary.  See below.

> OTOH, this might just move the locus for fatal failures out of
> pg_control and into the OS' algorithms for writing directory updates.
> We would have no cross-check that the set of WAL file names visible in
> pg_xlog is sensible or aligned with the true state of the datafile
> area.

Well, what we somehow need to guarantee is that there is always WAL
data that is older than the newest consistent data in the datafile
area, right?  Meaning that if the datafile area gets scribbled on in
an inconsistent manner, you always have WAL data to fill in the gaps.

Right now we do that by using fsync() and sync().  But I think it
would be highly desirable to be able to more or less guarantee
database consistency even if fsync were turned off.  The price for
that might be too high, though.

> We'd have to take it on faith that we should replay the visible files
> in their name order.  This might mean we'd have to abandon the current
> hack of recycling xlog segments by renaming them --- which would be a
> nontrivial performance hit.

It's probably a bad idea for the replay to be based on the filenames.
Instead, it should probably be based strictly on the contents of the
xlog segment files.  Seems to me the beginning of each segment file
should have some kind of header information that makes it clear where
in the scheme of things it belongs.  Additionally, writing some sort
of checksum, either at the beginning or the end, might not be a bad
idea either (doesn't have to be a strict checksum, but it needs to be
something that's reasonably likely to catch corruption within a
segment).

Do that, and you don't have to worry about renaming xlog segments at
all: you simply move on to the next logical segment in the list (a
replay just reads the header info for all the segments and orders the
list as it sees fit, and discards all segments prior to any gap it
finds.  It may be that you simply have to bail out if you find a gap,
though).  As long as the xlog segment checksum information is
consistent with the contents of the segment and as long as its
transactions pick up where the previous segment's left off (assuming
it's not the first segment, of course), you can safely replay the
transactions it contains.

I presume we're recycling xlog segments in order to avoid file
creation and unlink overhead?  Otherwise you can simply create new
segments as needed and unlink old segments as policy dictates.

> Comments anyone?
>
> > If there exists somewhere a reasonably succinct description of the
> > reasoning behind the current transaction management scheme (including
> > an analysis of the pros and cons), I'd love to read it and quit
> > bugging you.  :-)
>
> Not that I know of.  Would you care to prepare such a writeup?  There
> is a lot of material in the source-code comments, but no coherent
> presentation.

Be happy to.  Just point me to any non-obvious source files.

Thus far on my plate:

    1.  PID file locking for postmaster startup (doesn't strictly need
    to be the PID file but it may as well be, since we're already
    messing with it anyway).  I'm currently looking at how to do
    the autoconf tests, since I've never developed using autoconf
    before.

    2.  Documenting the transaction management scheme.

I was initially interested in implementing the explicit JOIN
reordering but based on your recent comments I think you have a much
better handle on that than I.  I'll be very interested to see what you
do, to see if it's anything close to what I figure has to happen...


--
Kevin Brown                          kevin@sysexperts.com

pgsql-hackers by date:

Previous
From: Curt Sampson
Date:
Subject: Re: WAL replay logic (was Re: [PERFORM] Mount options for Ext3?)
Next
From: Tom Lane
Date:
Subject: Re: WAL replay logic (was Re: [PERFORM] Mount options for Ext3?)