Thread: WAL replay logic (was Re: [PERFORM] Mount options for Ext3?)

WAL replay logic (was Re: [PERFORM] Mount options for Ext3?)

From

Tom Lane

Date:

25 January 2003, 01:46:03

Kevin Brown <kevin@sysexperts.com> writes:
> One question I have is: in the event of a crash, why not simply replay
> all the transactions found in the WAL?  Is the startup time of the
> database that badly affected if pg_control is ignored?

Interesting thought, indeed.  Since we truncate the WAL after each
checkpoint, seems like this approach would no more than double the time
for restart.  The win is it'd eliminate pg_control as a single point of
failure.  It's always bothered me that we have to update pg_control on
every checkpoint --- it should be a write-pretty-darn-seldom file,
considering how critical it is.

I think we'd have to make some changes in the code for deleting old
WAL segments --- right now it's not careful to delete them in order.
But surely that can be coped with.

OTOH, this might just move the locus for fatal failures out of
pg_control and into the OS' algorithms for writing directory updates.
We would have no cross-check that the set of WAL file names visible in
pg_xlog is sensible or aligned with the true state of the datafile area.
We'd have to take it on faith that we should replay the visible files
in their name order.  This might mean we'd have to abandon the current
hack of recycling xlog segments by renaming them --- which would be a
nontrivial performance hit.

Comments anyone?

> If there exists somewhere a reasonably succinct description of the
> reasoning behind the current transaction management scheme (including
> an analysis of the pros and cons), I'd love to read it and quit
> bugging you.  :-)

Not that I know of.  Would you care to prepare such a writeup?  There
is a lot of material in the source-code comments, but no coherent
presentation.

            regards, tom lane

Re: WAL replay logic (was Re: [PERFORM] Mount options for Ext3?)

From

Curt Sampson

Date:

25 January 2003, 03:00:03

On Sat, 25 Jan 2003, Tom Lane wrote:

> We'd have to take it on faith that we should replay the visible files
> in their name order.

Couldn't you could just put timestamp information at the beginning if
each file, (or perhaps use that of the first transaction), and read the
beginning of each file to find out what order to run them in. Perhaps
you could even check the last transaction in each file as well to see if
there are "holes" between the available logs.

> This might mean we'd have to abandon the current
> hack of recycling xlog segments by renaming them --- which would be a
> nontrivial performance hit.

Rename and write a "this is an empty logfile" record at the beginning?
Though I don't see how you could do this in an atomic manner.... Maybe if
you included the filename in the WAL file header, you'd see that if the name
doesn't match the header, it's a recycled file....

(This response sent only to hackers.)

cjs
-- 
Curt Sampson  <cjs@cynic.net>   +81 90 7737 2974   http://www.netbsd.org   Don't you know, in this new Dark Age, we're
alllight.  --XTC

Re: WAL replay logic (was Re: [PERFORM] Mount options for Ext3?)

From

Kevin Brown

Date:

25 January 2003, 05:11:22

Tom Lane wrote:
> Kevin Brown <kevin@sysexperts.com> writes:
> > One question I have is: in the event of a crash, why not simply replay
> > all the transactions found in the WAL?  Is the startup time of the
> > database that badly affected if pg_control is ignored?
>
> Interesting thought, indeed.  Since we truncate the WAL after each
> checkpoint, seems like this approach would no more than double the time
> for restart.

Hmm...truncating the WAL after each checkpoint minimizes the amount of
disk space eaten by the WAL, but on the other hand keeping older
segments around buys you some safety in the event that things get
really hosed.  But your later comments make it sound like the older
WAL segments are kept around anyway, just rotated.

> The win is it'd eliminate pg_control as a single point of
> failure.  It's always bothered me that we have to update pg_control on
> every checkpoint --- it should be a write-pretty-darn-seldom file,
> considering how critical it is.
>
> I think we'd have to make some changes in the code for deleting old
> WAL segments --- right now it's not careful to delete them in order.
> But surely that can be coped with.

Even that might not be necessary.  See below.

> OTOH, this might just move the locus for fatal failures out of
> pg_control and into the OS' algorithms for writing directory updates.
> We would have no cross-check that the set of WAL file names visible in
> pg_xlog is sensible or aligned with the true state of the datafile
> area.

Well, what we somehow need to guarantee is that there is always WAL
data that is older than the newest consistent data in the datafile
area, right?  Meaning that if the datafile area gets scribbled on in
an inconsistent manner, you always have WAL data to fill in the gaps.

Right now we do that by using fsync() and sync().  But I think it
would be highly desirable to be able to more or less guarantee
database consistency even if fsync were turned off.  The price for
that might be too high, though.

> We'd have to take it on faith that we should replay the visible files
> in their name order.  This might mean we'd have to abandon the current
> hack of recycling xlog segments by renaming them --- which would be a
> nontrivial performance hit.

It's probably a bad idea for the replay to be based on the filenames.
Instead, it should probably be based strictly on the contents of the
xlog segment files.  Seems to me the beginning of each segment file
should have some kind of header information that makes it clear where
in the scheme of things it belongs.  Additionally, writing some sort
of checksum, either at the beginning or the end, might not be a bad
idea either (doesn't have to be a strict checksum, but it needs to be
something that's reasonably likely to catch corruption within a
segment).

Do that, and you don't have to worry about renaming xlog segments at
all: you simply move on to the next logical segment in the list (a
replay just reads the header info for all the segments and orders the
list as it sees fit, and discards all segments prior to any gap it
finds.  It may be that you simply have to bail out if you find a gap,
though).  As long as the xlog segment checksum information is
consistent with the contents of the segment and as long as its
transactions pick up where the previous segment's left off (assuming
it's not the first segment, of course), you can safely replay the
transactions it contains.

I presume we're recycling xlog segments in order to avoid file
creation and unlink overhead?  Otherwise you can simply create new
segments as needed and unlink old segments as policy dictates.

> Comments anyone?
>
> > If there exists somewhere a reasonably succinct description of the
> > reasoning behind the current transaction management scheme (including
> > an analysis of the pros and cons), I'd love to read it and quit
> > bugging you.  :-)
>
> Not that I know of.  Would you care to prepare such a writeup?  There
> is a lot of material in the source-code comments, but no coherent
> presentation.

Be happy to.  Just point me to any non-obvious source files.

Thus far on my plate:

    1.  PID file locking for postmaster startup (doesn't strictly need
    to be the PID file but it may as well be, since we're already
    messing with it anyway).  I'm currently looking at how to do
    the autoconf tests, since I've never developed using autoconf
    before.

    2.  Documenting the transaction management scheme.

I was initially interested in implementing the explicit JOIN
reordering but based on your recent comments I think you have a much
better handle on that than I.  I'll be very interested to see what you
do, to see if it's anything close to what I figure has to happen...

--
Kevin Brown                          kevin@sysexperts.com

Re: WAL replay logic (was Re: [PERFORM] Mount options for Ext3?)

From

Tom Lane

Date:

25 January 2003, 11:16:24

Curt Sampson <cjs@cynic.net> writes:
> On Sat, 25 Jan 2003, Tom Lane wrote:
>> We'd have to take it on faith that we should replay the visible files
>> in their name order.

> Couldn't you could just put timestamp information at the beginning if
> each file,

Good thought --- there's already an xlp_pageaddr field on every page
of WAL, and you could examine that to be sure it matches the file name.
If not, the file csn be ignored.
        regards, tom lane

Re: WAL replay logic (was Re: [PERFORM] Mount options for Ext3?)

From

Bruce Momjian

Date:

27 January 2003, 15:28:52

Is there a TODO here?  I like the idea of not writing pg_controldata, or
at least allowing it not to be read, perhaps with a pg_resetxlog flag so
we can cleanly recover from a corrupt pg_controldata if the WAL files
are OK.

We don't want to get rid of the WAL file rename optimization because
those are 16mb files and keeping them from checkpoint to checkpoint is
probably a win.  I also like the idea of allowing something between our
"at the instant" recovery, and no recovery with fsync off.  A "recover
from last checkpooint time" option would be really valuable for some.

---------------------------------------------------------------------------

Kevin Brown wrote:
> Tom Lane wrote:
> > Kevin Brown <kevin@sysexperts.com> writes:
> > > One question I have is: in the event of a crash, why not simply replay
> > > all the transactions found in the WAL?  Is the startup time of the
> > > database that badly affected if pg_control is ignored?
> >
> > Interesting thought, indeed.  Since we truncate the WAL after each
> > checkpoint, seems like this approach would no more than double the time
> > for restart.
>
> Hmm...truncating the WAL after each checkpoint minimizes the amount of
> disk space eaten by the WAL, but on the other hand keeping older
> segments around buys you some safety in the event that things get
> really hosed.  But your later comments make it sound like the older
> WAL segments are kept around anyway, just rotated.
>
> > The win is it'd eliminate pg_control as a single point of
> > failure.  It's always bothered me that we have to update pg_control on
> > every checkpoint --- it should be a write-pretty-darn-seldom file,
> > considering how critical it is.
> >
> > I think we'd have to make some changes in the code for deleting old
> > WAL segments --- right now it's not careful to delete them in order.
> > But surely that can be coped with.
>
> Even that might not be necessary.  See below.
>
> > OTOH, this might just move the locus for fatal failures out of
> > pg_control and into the OS' algorithms for writing directory updates.
> > We would have no cross-check that the set of WAL file names visible in
> > pg_xlog is sensible or aligned with the true state of the datafile
> > area.
>
> Well, what we somehow need to guarantee is that there is always WAL
> data that is older than the newest consistent data in the datafile
> area, right?  Meaning that if the datafile area gets scribbled on in
> an inconsistent manner, you always have WAL data to fill in the gaps.
>
> Right now we do that by using fsync() and sync().  But I think it
> would be highly desirable to be able to more or less guarantee
> database consistency even if fsync were turned off.  The price for
> that might be too high, though.
>
> > We'd have to take it on faith that we should replay the visible files
> > in their name order.  This might mean we'd have to abandon the current
> > hack of recycling xlog segments by renaming them --- which would be a
> > nontrivial performance hit.
>
> It's probably a bad idea for the replay to be based on the filenames.
> Instead, it should probably be based strictly on the contents of the
> xlog segment files.  Seems to me the beginning of each segment file
> should have some kind of header information that makes it clear where
> in the scheme of things it belongs.  Additionally, writing some sort
> of checksum, either at the beginning or the end, might not be a bad
> idea either (doesn't have to be a strict checksum, but it needs to be
> something that's reasonably likely to catch corruption within a
> segment).
>
> Do that, and you don't have to worry about renaming xlog segments at
> all: you simply move on to the next logical segment in the list (a
> replay just reads the header info for all the segments and orders the
> list as it sees fit, and discards all segments prior to any gap it
> finds.  It may be that you simply have to bail out if you find a gap,
> though).  As long as the xlog segment checksum information is
> consistent with the contents of the segment and as long as its
> transactions pick up where the previous segment's left off (assuming
> it's not the first segment, of course), you can safely replay the
> transactions it contains.
>
> I presume we're recycling xlog segments in order to avoid file
> creation and unlink overhead?  Otherwise you can simply create new
> segments as needed and unlink old segments as policy dictates.
>
> > Comments anyone?
> >
> > > If there exists somewhere a reasonably succinct description of the
> > > reasoning behind the current transaction management scheme (including
> > > an analysis of the pros and cons), I'd love to read it and quit
> > > bugging you.  :-)
> >
> > Not that I know of.  Would you care to prepare such a writeup?  There
> > is a lot of material in the source-code comments, but no coherent
> > presentation.
>
> Be happy to.  Just point me to any non-obvious source files.
>
> Thus far on my plate:
>
>     1.  PID file locking for postmaster startup (doesn't strictly need
>     to be the PID file but it may as well be, since we're already
>     messing with it anyway).  I'm currently looking at how to do
>     the autoconf tests, since I've never developed using autoconf
>     before.
>
>     2.  Documenting the transaction management scheme.
>
> I was initially interested in implementing the explicit JOIN
> reordering but based on your recent comments I think you have a much
> better handle on that than I.  I'll be very interested to see what you
> do, to see if it's anything close to what I figure has to happen...
>
>
> --
> Kevin Brown                          kevin@sysexperts.com
>
> ---------------------------(end of broadcast)---------------------------
> TIP 6: Have you searched our list archives?
>
> http://archives.postgresql.org
>

--
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073

Re: WAL replay logic (was Re: [PERFORM] Mount options for Ext3?)

From

Bruce Momjian

Date:

14 February 2003, 09:30:36

Is there a TODO here, like "Allow recovery from corrupt pg_control via
WAL"?

---------------------------------------------------------------------------

Kevin Brown wrote:
> Tom Lane wrote:
> > Kevin Brown <kevin@sysexperts.com> writes:
> > > One question I have is: in the event of a crash, why not simply replay
> > > all the transactions found in the WAL?  Is the startup time of the
> > > database that badly affected if pg_control is ignored?
> >
> > Interesting thought, indeed.  Since we truncate the WAL after each
> > checkpoint, seems like this approach would no more than double the time
> > for restart.
>
> Hmm...truncating the WAL after each checkpoint minimizes the amount of
> disk space eaten by the WAL, but on the other hand keeping older
> segments around buys you some safety in the event that things get
> really hosed.  But your later comments make it sound like the older
> WAL segments are kept around anyway, just rotated.
>
> > The win is it'd eliminate pg_control as a single point of
> > failure.  It's always bothered me that we have to update pg_control on
> > every checkpoint --- it should be a write-pretty-darn-seldom file,
> > considering how critical it is.
> >
> > I think we'd have to make some changes in the code for deleting old
> > WAL segments --- right now it's not careful to delete them in order.
> > But surely that can be coped with.
>
> Even that might not be necessary.  See below.
>
> > OTOH, this might just move the locus for fatal failures out of
> > pg_control and into the OS' algorithms for writing directory updates.
> > We would have no cross-check that the set of WAL file names visible in
> > pg_xlog is sensible or aligned with the true state of the datafile
> > area.
>
> Well, what we somehow need to guarantee is that there is always WAL
> data that is older than the newest consistent data in the datafile
> area, right?  Meaning that if the datafile area gets scribbled on in
> an inconsistent manner, you always have WAL data to fill in the gaps.
>
> Right now we do that by using fsync() and sync().  But I think it
> would be highly desirable to be able to more or less guarantee
> database consistency even if fsync were turned off.  The price for
> that might be too high, though.
>
> > We'd have to take it on faith that we should replay the visible files
> > in their name order.  This might mean we'd have to abandon the current
> > hack of recycling xlog segments by renaming them --- which would be a
> > nontrivial performance hit.
>
> It's probably a bad idea for the replay to be based on the filenames.
> Instead, it should probably be based strictly on the contents of the
> xlog segment files.  Seems to me the beginning of each segment file
> should have some kind of header information that makes it clear where
> in the scheme of things it belongs.  Additionally, writing some sort
> of checksum, either at the beginning or the end, might not be a bad
> idea either (doesn't have to be a strict checksum, but it needs to be
> something that's reasonably likely to catch corruption within a
> segment).
>
> Do that, and you don't have to worry about renaming xlog segments at
> all: you simply move on to the next logical segment in the list (a
> replay just reads the header info for all the segments and orders the
> list as it sees fit, and discards all segments prior to any gap it
> finds.  It may be that you simply have to bail out if you find a gap,
> though).  As long as the xlog segment checksum information is
> consistent with the contents of the segment and as long as its
> transactions pick up where the previous segment's left off (assuming
> it's not the first segment, of course), you can safely replay the
> transactions it contains.
>
> I presume we're recycling xlog segments in order to avoid file
> creation and unlink overhead?  Otherwise you can simply create new
> segments as needed and unlink old segments as policy dictates.
>
> > Comments anyone?
> >
> > > If there exists somewhere a reasonably succinct description of the
> > > reasoning behind the current transaction management scheme (including
> > > an analysis of the pros and cons), I'd love to read it and quit
> > > bugging you.  :-)
> >
> > Not that I know of.  Would you care to prepare such a writeup?  There
> > is a lot of material in the source-code comments, but no coherent
> > presentation.
>
> Be happy to.  Just point me to any non-obvious source files.
>
> Thus far on my plate:
>
>     1.  PID file locking for postmaster startup (doesn't strictly need
>     to be the PID file but it may as well be, since we're already
>     messing with it anyway).  I'm currently looking at how to do
>     the autoconf tests, since I've never developed using autoconf
>     before.
>
>     2.  Documenting the transaction management scheme.
>
> I was initially interested in implementing the explicit JOIN
> reordering but based on your recent comments I think you have a much
> better handle on that than I.  I'll be very interested to see what you
> do, to see if it's anything close to what I figure has to happen...
>
>
> --
> Kevin Brown                          kevin@sysexperts.com
>
> ---------------------------(end of broadcast)---------------------------
> TIP 6: Have you searched our list archives?
>
> http://archives.postgresql.org
>

--
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073

Re: WAL replay logic (was Re: [PERFORM] Mount options for

From

Curt Sampson

Date:

15 February 2003, 03:29:24

On Fri, 14 Feb 2003, Bruce Momjian wrote:

> Is there a TODO here, like "Allow recovery from corrupt pg_control via
> WAL"?

Isn't that already in section 12.2.1 of the documentation?

     Using pg_control to get the checkpoint position speeds up the
     recovery process, but to handle possible corruption of pg_control,
     we should actually implement the reading of existing log segments
     in reverse order -- newest to oldest -- in order to find the last
     checkpoint. This has not been implemented, yet.

cjs
--
Curt Sampson  <cjs@cynic.net>   +81 90 7737 2974   http://www.netbsd.org
    Don't you know, in this new Dark Age, we're all light.  --XTC

Re: WAL replay logic (was Re: [PERFORM] Mount options for

From

Bruce Momjian

Date:

18 February 2003, 00:16:04

Added to TODO:

    * Allow WAL information to recover corrupted pg_controldata

---------------------------------------------------------------------------

Curt Sampson wrote:
> On Fri, 14 Feb 2003, Bruce Momjian wrote:
>
> > Is there a TODO here, like "Allow recovery from corrupt pg_control via
> > WAL"?
>
> Isn't that already in section 12.2.1 of the documentation?
>
>      Using pg_control to get the checkpoint position speeds up the
>      recovery process, but to handle possible corruption of pg_control,
>      we should actually implement the reading of existing log segments
>      in reverse order -- newest to oldest -- in order to find the last
>      checkpoint. This has not been implemented, yet.
>
> cjs
> --
> Curt Sampson  <cjs@cynic.net>   +81 90 7737 2974   http://www.netbsd.org
>     Don't you know, in this new Dark Age, we're all light.  --XTC
>
> ---------------------------(end of broadcast)---------------------------
> TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org
>

--
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073

Re: WAL replay logic (was Re: [PERFORM] Mount options for

From

Curt Sampson

Date:

18 February 2003, 00:26:47

On Tue, 18 Feb 2003, Bruce Momjian wrote:

>
> Added to TODO:
>
>     * Allow WAL information to recover corrupted pg_controldata
>...
> >      Using pg_control to get the checkpoint position speeds up the
> >      recovery process, but to handle possible corruption of pg_control,
> >      we should actually implement the reading of existing log segments
> >      in reverse order -- newest to oldest -- in order to find the last
> >      checkpoint. This has not been implemented, yet.

So if you do this, do you still need to store that information in
pg_control at all?

cjs
-- 
Curt Sampson  <cjs@cynic.net>   +81 90 7737 2974   http://www.netbsd.org   Don't you know, in this new Dark Age, we're
alllight.  --XTC

Re: WAL replay logic (was Re: [PERFORM] Mount options for

From

Bruce Momjian

Date:

18 February 2003, 14:09:35

Uh, not sure.  Does it guard against corrupt WAL records?

---------------------------------------------------------------------------

Curt Sampson wrote:
> On Tue, 18 Feb 2003, Bruce Momjian wrote:
> 
> >
> > Added to TODO:
> >
> >     * Allow WAL information to recover corrupted pg_controldata
> >...
> > >      Using pg_control to get the checkpoint position speeds up the
> > >      recovery process, but to handle possible corruption of pg_control,
> > >      we should actually implement the reading of existing log segments
> > >      in reverse order -- newest to oldest -- in order to find the last
> > >      checkpoint. This has not been implemented, yet.
> 
> So if you do this, do you still need to store that information in
> pg_control at all?
> 
> cjs
> -- 
> Curt Sampson  <cjs@cynic.net>   +81 90 7737 2974   http://www.netbsd.org
>     Don't you know, in this new Dark Age, we're all light.  --XTC
> 
> ---------------------------(end of broadcast)---------------------------
> TIP 5: Have you checked our extensive FAQ?
> 
> http://www.postgresql.org/users-lounge/docs/faq.html
> 

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073