Re: WAL replay logic (was Re: [PERFORM] Mount options for Ext3?) - Mailing list pgsql-hackers
From | Kevin Brown |
---|---|
Subject | Re: WAL replay logic (was Re: [PERFORM] Mount options for Ext3?) |
Date | |
Msg-id | 20030125101111.GB12957@filer Whole thread Raw |
In response to | WAL replay logic (was Re: [PERFORM] Mount options for Ext3?) (Tom Lane <tgl@sss.pgh.pa.us>) |
Responses |
Re: WAL replay logic (was Re: [PERFORM] Mount options for Ext3?)
Re: WAL replay logic (was Re: [PERFORM] Mount options for Ext3?) |
List | pgsql-hackers |
Tom Lane wrote: > Kevin Brown <kevin@sysexperts.com> writes: > > One question I have is: in the event of a crash, why not simply replay > > all the transactions found in the WAL? Is the startup time of the > > database that badly affected if pg_control is ignored? > > Interesting thought, indeed. Since we truncate the WAL after each > checkpoint, seems like this approach would no more than double the time > for restart. Hmm...truncating the WAL after each checkpoint minimizes the amount of disk space eaten by the WAL, but on the other hand keeping older segments around buys you some safety in the event that things get really hosed. But your later comments make it sound like the older WAL segments are kept around anyway, just rotated. > The win is it'd eliminate pg_control as a single point of > failure. It's always bothered me that we have to update pg_control on > every checkpoint --- it should be a write-pretty-darn-seldom file, > considering how critical it is. > > I think we'd have to make some changes in the code for deleting old > WAL segments --- right now it's not careful to delete them in order. > But surely that can be coped with. Even that might not be necessary. See below. > OTOH, this might just move the locus for fatal failures out of > pg_control and into the OS' algorithms for writing directory updates. > We would have no cross-check that the set of WAL file names visible in > pg_xlog is sensible or aligned with the true state of the datafile > area. Well, what we somehow need to guarantee is that there is always WAL data that is older than the newest consistent data in the datafile area, right? Meaning that if the datafile area gets scribbled on in an inconsistent manner, you always have WAL data to fill in the gaps. Right now we do that by using fsync() and sync(). But I think it would be highly desirable to be able to more or less guarantee database consistency even if fsync were turned off. The price for that might be too high, though. > We'd have to take it on faith that we should replay the visible files > in their name order. This might mean we'd have to abandon the current > hack of recycling xlog segments by renaming them --- which would be a > nontrivial performance hit. It's probably a bad idea for the replay to be based on the filenames. Instead, it should probably be based strictly on the contents of the xlog segment files. Seems to me the beginning of each segment file should have some kind of header information that makes it clear where in the scheme of things it belongs. Additionally, writing some sort of checksum, either at the beginning or the end, might not be a bad idea either (doesn't have to be a strict checksum, but it needs to be something that's reasonably likely to catch corruption within a segment). Do that, and you don't have to worry about renaming xlog segments at all: you simply move on to the next logical segment in the list (a replay just reads the header info for all the segments and orders the list as it sees fit, and discards all segments prior to any gap it finds. It may be that you simply have to bail out if you find a gap, though). As long as the xlog segment checksum information is consistent with the contents of the segment and as long as its transactions pick up where the previous segment's left off (assuming it's not the first segment, of course), you can safely replay the transactions it contains. I presume we're recycling xlog segments in order to avoid file creation and unlink overhead? Otherwise you can simply create new segments as needed and unlink old segments as policy dictates. > Comments anyone? > > > If there exists somewhere a reasonably succinct description of the > > reasoning behind the current transaction management scheme (including > > an analysis of the pros and cons), I'd love to read it and quit > > bugging you. :-) > > Not that I know of. Would you care to prepare such a writeup? There > is a lot of material in the source-code comments, but no coherent > presentation. Be happy to. Just point me to any non-obvious source files. Thus far on my plate: 1. PID file locking for postmaster startup (doesn't strictly need to be the PID file but it may as well be, since we're already messing with it anyway). I'm currently looking at how to do the autoconf tests, since I've never developed using autoconf before. 2. Documenting the transaction management scheme. I was initially interested in implementing the explicit JOIN reordering but based on your recent comments I think you have a much better handle on that than I. I'll be very interested to see what you do, to see if it's anything close to what I figure has to happen... -- Kevin Brown kevin@sysexperts.com
pgsql-hackers by date: