Thread: WAL replay logic (was Re: [PERFORM] Mount options for Ext3?)
Kevin Brown <kevin@sysexperts.com> writes: > One question I have is: in the event of a crash, why not simply replay > all the transactions found in the WAL? Is the startup time of the > database that badly affected if pg_control is ignored? Interesting thought, indeed. Since we truncate the WAL after each checkpoint, seems like this approach would no more than double the time for restart. The win is it'd eliminate pg_control as a single point of failure. It's always bothered me that we have to update pg_control on every checkpoint --- it should be a write-pretty-darn-seldom file, considering how critical it is. I think we'd have to make some changes in the code for deleting old WAL segments --- right now it's not careful to delete them in order. But surely that can be coped with. OTOH, this might just move the locus for fatal failures out of pg_control and into the OS' algorithms for writing directory updates. We would have no cross-check that the set of WAL file names visible in pg_xlog is sensible or aligned with the true state of the datafile area. We'd have to take it on faith that we should replay the visible files in their name order. This might mean we'd have to abandon the current hack of recycling xlog segments by renaming them --- which would be a nontrivial performance hit. Comments anyone? > If there exists somewhere a reasonably succinct description of the > reasoning behind the current transaction management scheme (including > an analysis of the pros and cons), I'd love to read it and quit > bugging you. :-) Not that I know of. Would you care to prepare such a writeup? There is a lot of material in the source-code comments, but no coherent presentation. regards, tom lane
On Sat, 25 Jan 2003, Tom Lane wrote: > We'd have to take it on faith that we should replay the visible files > in their name order. Couldn't you could just put timestamp information at the beginning if each file, (or perhaps use that of the first transaction), and read the beginning of each file to find out what order to run them in. Perhaps you could even check the last transaction in each file as well to see if there are "holes" between the available logs. > This might mean we'd have to abandon the current > hack of recycling xlog segments by renaming them --- which would be a > nontrivial performance hit. Rename and write a "this is an empty logfile" record at the beginning? Though I don't see how you could do this in an atomic manner.... Maybe if you included the filename in the WAL file header, you'd see that if the name doesn't match the header, it's a recycled file.... (This response sent only to hackers.) cjs -- Curt Sampson <cjs@cynic.net> +81 90 7737 2974 http://www.netbsd.org Don't you know, in this new Dark Age, we're alllight. --XTC
Tom Lane wrote: > Kevin Brown <kevin@sysexperts.com> writes: > > One question I have is: in the event of a crash, why not simply replay > > all the transactions found in the WAL? Is the startup time of the > > database that badly affected if pg_control is ignored? > > Interesting thought, indeed. Since we truncate the WAL after each > checkpoint, seems like this approach would no more than double the time > for restart. Hmm...truncating the WAL after each checkpoint minimizes the amount of disk space eaten by the WAL, but on the other hand keeping older segments around buys you some safety in the event that things get really hosed. But your later comments make it sound like the older WAL segments are kept around anyway, just rotated. > The win is it'd eliminate pg_control as a single point of > failure. It's always bothered me that we have to update pg_control on > every checkpoint --- it should be a write-pretty-darn-seldom file, > considering how critical it is. > > I think we'd have to make some changes in the code for deleting old > WAL segments --- right now it's not careful to delete them in order. > But surely that can be coped with. Even that might not be necessary. See below. > OTOH, this might just move the locus for fatal failures out of > pg_control and into the OS' algorithms for writing directory updates. > We would have no cross-check that the set of WAL file names visible in > pg_xlog is sensible or aligned with the true state of the datafile > area. Well, what we somehow need to guarantee is that there is always WAL data that is older than the newest consistent data in the datafile area, right? Meaning that if the datafile area gets scribbled on in an inconsistent manner, you always have WAL data to fill in the gaps. Right now we do that by using fsync() and sync(). But I think it would be highly desirable to be able to more or less guarantee database consistency even if fsync were turned off. The price for that might be too high, though. > We'd have to take it on faith that we should replay the visible files > in their name order. This might mean we'd have to abandon the current > hack of recycling xlog segments by renaming them --- which would be a > nontrivial performance hit. It's probably a bad idea for the replay to be based on the filenames. Instead, it should probably be based strictly on the contents of the xlog segment files. Seems to me the beginning of each segment file should have some kind of header information that makes it clear where in the scheme of things it belongs. Additionally, writing some sort of checksum, either at the beginning or the end, might not be a bad idea either (doesn't have to be a strict checksum, but it needs to be something that's reasonably likely to catch corruption within a segment). Do that, and you don't have to worry about renaming xlog segments at all: you simply move on to the next logical segment in the list (a replay just reads the header info for all the segments and orders the list as it sees fit, and discards all segments prior to any gap it finds. It may be that you simply have to bail out if you find a gap, though). As long as the xlog segment checksum information is consistent with the contents of the segment and as long as its transactions pick up where the previous segment's left off (assuming it's not the first segment, of course), you can safely replay the transactions it contains. I presume we're recycling xlog segments in order to avoid file creation and unlink overhead? Otherwise you can simply create new segments as needed and unlink old segments as policy dictates. > Comments anyone? > > > If there exists somewhere a reasonably succinct description of the > > reasoning behind the current transaction management scheme (including > > an analysis of the pros and cons), I'd love to read it and quit > > bugging you. :-) > > Not that I know of. Would you care to prepare such a writeup? There > is a lot of material in the source-code comments, but no coherent > presentation. Be happy to. Just point me to any non-obvious source files. Thus far on my plate: 1. PID file locking for postmaster startup (doesn't strictly need to be the PID file but it may as well be, since we're already messing with it anyway). I'm currently looking at how to do the autoconf tests, since I've never developed using autoconf before. 2. Documenting the transaction management scheme. I was initially interested in implementing the explicit JOIN reordering but based on your recent comments I think you have a much better handle on that than I. I'll be very interested to see what you do, to see if it's anything close to what I figure has to happen... -- Kevin Brown kevin@sysexperts.com
Curt Sampson <cjs@cynic.net> writes: > On Sat, 25 Jan 2003, Tom Lane wrote: >> We'd have to take it on faith that we should replay the visible files >> in their name order. > Couldn't you could just put timestamp information at the beginning if > each file, Good thought --- there's already an xlp_pageaddr field on every page of WAL, and you could examine that to be sure it matches the file name. If not, the file csn be ignored. regards, tom lane
Is there a TODO here? I like the idea of not writing pg_controldata, or at least allowing it not to be read, perhaps with a pg_resetxlog flag so we can cleanly recover from a corrupt pg_controldata if the WAL files are OK. We don't want to get rid of the WAL file rename optimization because those are 16mb files and keeping them from checkpoint to checkpoint is probably a win. I also like the idea of allowing something between our "at the instant" recovery, and no recovery with fsync off. A "recover from last checkpooint time" option would be really valuable for some. --------------------------------------------------------------------------- Kevin Brown wrote: > Tom Lane wrote: > > Kevin Brown <kevin@sysexperts.com> writes: > > > One question I have is: in the event of a crash, why not simply replay > > > all the transactions found in the WAL? Is the startup time of the > > > database that badly affected if pg_control is ignored? > > > > Interesting thought, indeed. Since we truncate the WAL after each > > checkpoint, seems like this approach would no more than double the time > > for restart. > > Hmm...truncating the WAL after each checkpoint minimizes the amount of > disk space eaten by the WAL, but on the other hand keeping older > segments around buys you some safety in the event that things get > really hosed. But your later comments make it sound like the older > WAL segments are kept around anyway, just rotated. > > > The win is it'd eliminate pg_control as a single point of > > failure. It's always bothered me that we have to update pg_control on > > every checkpoint --- it should be a write-pretty-darn-seldom file, > > considering how critical it is. > > > > I think we'd have to make some changes in the code for deleting old > > WAL segments --- right now it's not careful to delete them in order. > > But surely that can be coped with. > > Even that might not be necessary. See below. > > > OTOH, this might just move the locus for fatal failures out of > > pg_control and into the OS' algorithms for writing directory updates. > > We would have no cross-check that the set of WAL file names visible in > > pg_xlog is sensible or aligned with the true state of the datafile > > area. > > Well, what we somehow need to guarantee is that there is always WAL > data that is older than the newest consistent data in the datafile > area, right? Meaning that if the datafile area gets scribbled on in > an inconsistent manner, you always have WAL data to fill in the gaps. > > Right now we do that by using fsync() and sync(). But I think it > would be highly desirable to be able to more or less guarantee > database consistency even if fsync were turned off. The price for > that might be too high, though. > > > We'd have to take it on faith that we should replay the visible files > > in their name order. This might mean we'd have to abandon the current > > hack of recycling xlog segments by renaming them --- which would be a > > nontrivial performance hit. > > It's probably a bad idea for the replay to be based on the filenames. > Instead, it should probably be based strictly on the contents of the > xlog segment files. Seems to me the beginning of each segment file > should have some kind of header information that makes it clear where > in the scheme of things it belongs. Additionally, writing some sort > of checksum, either at the beginning or the end, might not be a bad > idea either (doesn't have to be a strict checksum, but it needs to be > something that's reasonably likely to catch corruption within a > segment). > > Do that, and you don't have to worry about renaming xlog segments at > all: you simply move on to the next logical segment in the list (a > replay just reads the header info for all the segments and orders the > list as it sees fit, and discards all segments prior to any gap it > finds. It may be that you simply have to bail out if you find a gap, > though). As long as the xlog segment checksum information is > consistent with the contents of the segment and as long as its > transactions pick up where the previous segment's left off (assuming > it's not the first segment, of course), you can safely replay the > transactions it contains. > > I presume we're recycling xlog segments in order to avoid file > creation and unlink overhead? Otherwise you can simply create new > segments as needed and unlink old segments as policy dictates. > > > Comments anyone? > > > > > If there exists somewhere a reasonably succinct description of the > > > reasoning behind the current transaction management scheme (including > > > an analysis of the pros and cons), I'd love to read it and quit > > > bugging you. :-) > > > > Not that I know of. Would you care to prepare such a writeup? There > > is a lot of material in the source-code comments, but no coherent > > presentation. > > Be happy to. Just point me to any non-obvious source files. > > Thus far on my plate: > > 1. PID file locking for postmaster startup (doesn't strictly need > to be the PID file but it may as well be, since we're already > messing with it anyway). I'm currently looking at how to do > the autoconf tests, since I've never developed using autoconf > before. > > 2. Documenting the transaction management scheme. > > I was initially interested in implementing the explicit JOIN > reordering but based on your recent comments I think you have a much > better handle on that than I. I'll be very interested to see what you > do, to see if it's anything close to what I figure has to happen... > > > -- > Kevin Brown kevin@sysexperts.com > > ---------------------------(end of broadcast)--------------------------- > TIP 6: Have you searched our list archives? > > http://archives.postgresql.org > -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania 19073
Is there a TODO here, like "Allow recovery from corrupt pg_control via WAL"? --------------------------------------------------------------------------- Kevin Brown wrote: > Tom Lane wrote: > > Kevin Brown <kevin@sysexperts.com> writes: > > > One question I have is: in the event of a crash, why not simply replay > > > all the transactions found in the WAL? Is the startup time of the > > > database that badly affected if pg_control is ignored? > > > > Interesting thought, indeed. Since we truncate the WAL after each > > checkpoint, seems like this approach would no more than double the time > > for restart. > > Hmm...truncating the WAL after each checkpoint minimizes the amount of > disk space eaten by the WAL, but on the other hand keeping older > segments around buys you some safety in the event that things get > really hosed. But your later comments make it sound like the older > WAL segments are kept around anyway, just rotated. > > > The win is it'd eliminate pg_control as a single point of > > failure. It's always bothered me that we have to update pg_control on > > every checkpoint --- it should be a write-pretty-darn-seldom file, > > considering how critical it is. > > > > I think we'd have to make some changes in the code for deleting old > > WAL segments --- right now it's not careful to delete them in order. > > But surely that can be coped with. > > Even that might not be necessary. See below. > > > OTOH, this might just move the locus for fatal failures out of > > pg_control and into the OS' algorithms for writing directory updates. > > We would have no cross-check that the set of WAL file names visible in > > pg_xlog is sensible or aligned with the true state of the datafile > > area. > > Well, what we somehow need to guarantee is that there is always WAL > data that is older than the newest consistent data in the datafile > area, right? Meaning that if the datafile area gets scribbled on in > an inconsistent manner, you always have WAL data to fill in the gaps. > > Right now we do that by using fsync() and sync(). But I think it > would be highly desirable to be able to more or less guarantee > database consistency even if fsync were turned off. The price for > that might be too high, though. > > > We'd have to take it on faith that we should replay the visible files > > in their name order. This might mean we'd have to abandon the current > > hack of recycling xlog segments by renaming them --- which would be a > > nontrivial performance hit. > > It's probably a bad idea for the replay to be based on the filenames. > Instead, it should probably be based strictly on the contents of the > xlog segment files. Seems to me the beginning of each segment file > should have some kind of header information that makes it clear where > in the scheme of things it belongs. Additionally, writing some sort > of checksum, either at the beginning or the end, might not be a bad > idea either (doesn't have to be a strict checksum, but it needs to be > something that's reasonably likely to catch corruption within a > segment). > > Do that, and you don't have to worry about renaming xlog segments at > all: you simply move on to the next logical segment in the list (a > replay just reads the header info for all the segments and orders the > list as it sees fit, and discards all segments prior to any gap it > finds. It may be that you simply have to bail out if you find a gap, > though). As long as the xlog segment checksum information is > consistent with the contents of the segment and as long as its > transactions pick up where the previous segment's left off (assuming > it's not the first segment, of course), you can safely replay the > transactions it contains. > > I presume we're recycling xlog segments in order to avoid file > creation and unlink overhead? Otherwise you can simply create new > segments as needed and unlink old segments as policy dictates. > > > Comments anyone? > > > > > If there exists somewhere a reasonably succinct description of the > > > reasoning behind the current transaction management scheme (including > > > an analysis of the pros and cons), I'd love to read it and quit > > > bugging you. :-) > > > > Not that I know of. Would you care to prepare such a writeup? There > > is a lot of material in the source-code comments, but no coherent > > presentation. > > Be happy to. Just point me to any non-obvious source files. > > Thus far on my plate: > > 1. PID file locking for postmaster startup (doesn't strictly need > to be the PID file but it may as well be, since we're already > messing with it anyway). I'm currently looking at how to do > the autoconf tests, since I've never developed using autoconf > before. > > 2. Documenting the transaction management scheme. > > I was initially interested in implementing the explicit JOIN > reordering but based on your recent comments I think you have a much > better handle on that than I. I'll be very interested to see what you > do, to see if it's anything close to what I figure has to happen... > > > -- > Kevin Brown kevin@sysexperts.com > > ---------------------------(end of broadcast)--------------------------- > TIP 6: Have you searched our list archives? > > http://archives.postgresql.org > -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania 19073
On Fri, 14 Feb 2003, Bruce Momjian wrote: > Is there a TODO here, like "Allow recovery from corrupt pg_control via > WAL"? Isn't that already in section 12.2.1 of the documentation? Using pg_control to get the checkpoint position speeds up the recovery process, but to handle possible corruption of pg_control, we should actually implement the reading of existing log segments in reverse order -- newest to oldest -- in order to find the last checkpoint. This has not been implemented, yet. cjs -- Curt Sampson <cjs@cynic.net> +81 90 7737 2974 http://www.netbsd.org Don't you know, in this new Dark Age, we're all light. --XTC
Added to TODO: * Allow WAL information to recover corrupted pg_controldata --------------------------------------------------------------------------- Curt Sampson wrote: > On Fri, 14 Feb 2003, Bruce Momjian wrote: > > > Is there a TODO here, like "Allow recovery from corrupt pg_control via > > WAL"? > > Isn't that already in section 12.2.1 of the documentation? > > Using pg_control to get the checkpoint position speeds up the > recovery process, but to handle possible corruption of pg_control, > we should actually implement the reading of existing log segments > in reverse order -- newest to oldest -- in order to find the last > checkpoint. This has not been implemented, yet. > > cjs > -- > Curt Sampson <cjs@cynic.net> +81 90 7737 2974 http://www.netbsd.org > Don't you know, in this new Dark Age, we're all light. --XTC > > ---------------------------(end of broadcast)--------------------------- > TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org > -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania 19073
On Tue, 18 Feb 2003, Bruce Momjian wrote: > > Added to TODO: > > * Allow WAL information to recover corrupted pg_controldata >... > > Using pg_control to get the checkpoint position speeds up the > > recovery process, but to handle possible corruption of pg_control, > > we should actually implement the reading of existing log segments > > in reverse order -- newest to oldest -- in order to find the last > > checkpoint. This has not been implemented, yet. So if you do this, do you still need to store that information in pg_control at all? cjs -- Curt Sampson <cjs@cynic.net> +81 90 7737 2974 http://www.netbsd.org Don't you know, in this new Dark Age, we're alllight. --XTC
Uh, not sure. Does it guard against corrupt WAL records? --------------------------------------------------------------------------- Curt Sampson wrote: > On Tue, 18 Feb 2003, Bruce Momjian wrote: > > > > > Added to TODO: > > > > * Allow WAL information to recover corrupted pg_controldata > >... > > > Using pg_control to get the checkpoint position speeds up the > > > recovery process, but to handle possible corruption of pg_control, > > > we should actually implement the reading of existing log segments > > > in reverse order -- newest to oldest -- in order to find the last > > > checkpoint. This has not been implemented, yet. > > So if you do this, do you still need to store that information in > pg_control at all? > > cjs > -- > Curt Sampson <cjs@cynic.net> +81 90 7737 2974 http://www.netbsd.org > Don't you know, in this new Dark Age, we're all light. --XTC > > ---------------------------(end of broadcast)--------------------------- > TIP 5: Have you checked our extensive FAQ? > > http://www.postgresql.org/users-lounge/docs/faq.html > -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073