Re: WAL replay logic (was Re: Mount options for Ext3?) - Mailing list pgsql-performance
From | Bruce Momjian |
---|---|
Subject | Re: WAL replay logic (was Re: Mount options for Ext3?) |
Date | |
Msg-id | 200302141430.h1EEUUr06607@candle.pha.pa.us Whole thread Raw |
In response to | Re: WAL replay logic (was Re: Mount options for Ext3?) (Kevin Brown <kevin@sysexperts.com>) |
Responses |
Re: [HACKERS] WAL replay logic (was Re: Mount options for
|
List | pgsql-performance |
Is there a TODO here, like "Allow recovery from corrupt pg_control via WAL"? --------------------------------------------------------------------------- Kevin Brown wrote: > Tom Lane wrote: > > Kevin Brown <kevin@sysexperts.com> writes: > > > One question I have is: in the event of a crash, why not simply replay > > > all the transactions found in the WAL? Is the startup time of the > > > database that badly affected if pg_control is ignored? > > > > Interesting thought, indeed. Since we truncate the WAL after each > > checkpoint, seems like this approach would no more than double the time > > for restart. > > Hmm...truncating the WAL after each checkpoint minimizes the amount of > disk space eaten by the WAL, but on the other hand keeping older > segments around buys you some safety in the event that things get > really hosed. But your later comments make it sound like the older > WAL segments are kept around anyway, just rotated. > > > The win is it'd eliminate pg_control as a single point of > > failure. It's always bothered me that we have to update pg_control on > > every checkpoint --- it should be a write-pretty-darn-seldom file, > > considering how critical it is. > > > > I think we'd have to make some changes in the code for deleting old > > WAL segments --- right now it's not careful to delete them in order. > > But surely that can be coped with. > > Even that might not be necessary. See below. > > > OTOH, this might just move the locus for fatal failures out of > > pg_control and into the OS' algorithms for writing directory updates. > > We would have no cross-check that the set of WAL file names visible in > > pg_xlog is sensible or aligned with the true state of the datafile > > area. > > Well, what we somehow need to guarantee is that there is always WAL > data that is older than the newest consistent data in the datafile > area, right? Meaning that if the datafile area gets scribbled on in > an inconsistent manner, you always have WAL data to fill in the gaps. > > Right now we do that by using fsync() and sync(). But I think it > would be highly desirable to be able to more or less guarantee > database consistency even if fsync were turned off. The price for > that might be too high, though. > > > We'd have to take it on faith that we should replay the visible files > > in their name order. This might mean we'd have to abandon the current > > hack of recycling xlog segments by renaming them --- which would be a > > nontrivial performance hit. > > It's probably a bad idea for the replay to be based on the filenames. > Instead, it should probably be based strictly on the contents of the > xlog segment files. Seems to me the beginning of each segment file > should have some kind of header information that makes it clear where > in the scheme of things it belongs. Additionally, writing some sort > of checksum, either at the beginning or the end, might not be a bad > idea either (doesn't have to be a strict checksum, but it needs to be > something that's reasonably likely to catch corruption within a > segment). > > Do that, and you don't have to worry about renaming xlog segments at > all: you simply move on to the next logical segment in the list (a > replay just reads the header info for all the segments and orders the > list as it sees fit, and discards all segments prior to any gap it > finds. It may be that you simply have to bail out if you find a gap, > though). As long as the xlog segment checksum information is > consistent with the contents of the segment and as long as its > transactions pick up where the previous segment's left off (assuming > it's not the first segment, of course), you can safely replay the > transactions it contains. > > I presume we're recycling xlog segments in order to avoid file > creation and unlink overhead? Otherwise you can simply create new > segments as needed and unlink old segments as policy dictates. > > > Comments anyone? > > > > > If there exists somewhere a reasonably succinct description of the > > > reasoning behind the current transaction management scheme (including > > > an analysis of the pros and cons), I'd love to read it and quit > > > bugging you. :-) > > > > Not that I know of. Would you care to prepare such a writeup? There > > is a lot of material in the source-code comments, but no coherent > > presentation. > > Be happy to. Just point me to any non-obvious source files. > > Thus far on my plate: > > 1. PID file locking for postmaster startup (doesn't strictly need > to be the PID file but it may as well be, since we're already > messing with it anyway). I'm currently looking at how to do > the autoconf tests, since I've never developed using autoconf > before. > > 2. Documenting the transaction management scheme. > > I was initially interested in implementing the explicit JOIN > reordering but based on your recent comments I think you have a much > better handle on that than I. I'll be very interested to see what you > do, to see if it's anything close to what I figure has to happen... > > > -- > Kevin Brown kevin@sysexperts.com > > ---------------------------(end of broadcast)--------------------------- > TIP 6: Have you searched our list archives? > > http://archives.postgresql.org > -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania 19073
pgsql-performance by date: