Re: WAL replay logic (was Re: Mount options for Ext3?) - Mailing list pgsql-performance

From Bruce Momjian
Subject Re: WAL replay logic (was Re: Mount options for Ext3?)
Date
Msg-id 200302141430.h1EEUUr06607@candle.pha.pa.us
Whole thread Raw
In response to Re: WAL replay logic (was Re: Mount options for Ext3?)  (Kevin Brown <kevin@sysexperts.com>)
Responses Re: [HACKERS] WAL replay logic (was Re: Mount options for
List pgsql-performance
Is there a TODO here, like "Allow recovery from corrupt pg_control via
WAL"?

---------------------------------------------------------------------------

Kevin Brown wrote:
> Tom Lane wrote:
> > Kevin Brown <kevin@sysexperts.com> writes:
> > > One question I have is: in the event of a crash, why not simply replay
> > > all the transactions found in the WAL?  Is the startup time of the
> > > database that badly affected if pg_control is ignored?
> >
> > Interesting thought, indeed.  Since we truncate the WAL after each
> > checkpoint, seems like this approach would no more than double the time
> > for restart.
>
> Hmm...truncating the WAL after each checkpoint minimizes the amount of
> disk space eaten by the WAL, but on the other hand keeping older
> segments around buys you some safety in the event that things get
> really hosed.  But your later comments make it sound like the older
> WAL segments are kept around anyway, just rotated.
>
> > The win is it'd eliminate pg_control as a single point of
> > failure.  It's always bothered me that we have to update pg_control on
> > every checkpoint --- it should be a write-pretty-darn-seldom file,
> > considering how critical it is.
> >
> > I think we'd have to make some changes in the code for deleting old
> > WAL segments --- right now it's not careful to delete them in order.
> > But surely that can be coped with.
>
> Even that might not be necessary.  See below.
>
> > OTOH, this might just move the locus for fatal failures out of
> > pg_control and into the OS' algorithms for writing directory updates.
> > We would have no cross-check that the set of WAL file names visible in
> > pg_xlog is sensible or aligned with the true state of the datafile
> > area.
>
> Well, what we somehow need to guarantee is that there is always WAL
> data that is older than the newest consistent data in the datafile
> area, right?  Meaning that if the datafile area gets scribbled on in
> an inconsistent manner, you always have WAL data to fill in the gaps.
>
> Right now we do that by using fsync() and sync().  But I think it
> would be highly desirable to be able to more or less guarantee
> database consistency even if fsync were turned off.  The price for
> that might be too high, though.
>
> > We'd have to take it on faith that we should replay the visible files
> > in their name order.  This might mean we'd have to abandon the current
> > hack of recycling xlog segments by renaming them --- which would be a
> > nontrivial performance hit.
>
> It's probably a bad idea for the replay to be based on the filenames.
> Instead, it should probably be based strictly on the contents of the
> xlog segment files.  Seems to me the beginning of each segment file
> should have some kind of header information that makes it clear where
> in the scheme of things it belongs.  Additionally, writing some sort
> of checksum, either at the beginning or the end, might not be a bad
> idea either (doesn't have to be a strict checksum, but it needs to be
> something that's reasonably likely to catch corruption within a
> segment).
>
> Do that, and you don't have to worry about renaming xlog segments at
> all: you simply move on to the next logical segment in the list (a
> replay just reads the header info for all the segments and orders the
> list as it sees fit, and discards all segments prior to any gap it
> finds.  It may be that you simply have to bail out if you find a gap,
> though).  As long as the xlog segment checksum information is
> consistent with the contents of the segment and as long as its
> transactions pick up where the previous segment's left off (assuming
> it's not the first segment, of course), you can safely replay the
> transactions it contains.
>
> I presume we're recycling xlog segments in order to avoid file
> creation and unlink overhead?  Otherwise you can simply create new
> segments as needed and unlink old segments as policy dictates.
>
> > Comments anyone?
> >
> > > If there exists somewhere a reasonably succinct description of the
> > > reasoning behind the current transaction management scheme (including
> > > an analysis of the pros and cons), I'd love to read it and quit
> > > bugging you.  :-)
> >
> > Not that I know of.  Would you care to prepare such a writeup?  There
> > is a lot of material in the source-code comments, but no coherent
> > presentation.
>
> Be happy to.  Just point me to any non-obvious source files.
>
> Thus far on my plate:
>
>     1.  PID file locking for postmaster startup (doesn't strictly need
>     to be the PID file but it may as well be, since we're already
>     messing with it anyway).  I'm currently looking at how to do
>     the autoconf tests, since I've never developed using autoconf
>     before.
>
>     2.  Documenting the transaction management scheme.
>
> I was initially interested in implementing the explicit JOIN
> reordering but based on your recent comments I think you have a much
> better handle on that than I.  I'll be very interested to see what you
> do, to see if it's anything close to what I figure has to happen...
>
>
> --
> Kevin Brown                          kevin@sysexperts.com
>
> ---------------------------(end of broadcast)---------------------------
> TIP 6: Have you searched our list archives?
>
> http://archives.postgresql.org
>

--
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073

pgsql-performance by date:

Previous
From: Bruce Momjian
Date:
Subject: Re: [HACKERS] Terrible performance on wide selects
Next
From: Tom Lane
Date:
Subject: Re: Tuning scenarios (was Changing the default configuration)