Thread: Mount options for Ext3?
Folks, What mount options to people use for Ext3, particularly what do you set "data = " for a high-transaction database? I'm used to ReiserFS ("noatime, notail") and am not really sure where to go with Ext3. -- -Josh Berkus Aglio Database Solutions San Francisco
Josh Berkus wrote: > Folks, > > What mount options to people use for Ext3, particularly what do you set "data > = " for a high-transaction database? I'm used to ReiserFS ("noatime, > notail") and am not really sure where to go with Ext3. For ReiserFS, I can certainly understand using "noatime", but I'm not sure why you use "notail" except to allow LILO to operate properly on it. The default for ext3 is to do ordered writes: data is written before the associated metadata transaction commits, but the data itself isn't journalled. But because PostgreSQL synchronously writes the transaction log (using fsync() by default, if I'm not mistaken) and uses sync() during a savepoint, I would think that ordered writes at the filesystem level would probably buy you very little in the way of additional data integrity in the event of a crash. So if I'm right about that, then you might consider using the "data=writeback" option for the filesystem that contains the actual data (usually /usr/local/pgsql/data), but I'd use the default ("data=ordered") at the very least (I suppose there's no harm in using "data=journal" if you're willing to put up with the performance hit, but it's not clear to me what benefit, if any, there is) for everything else. I use ReiserFS also, so I'm basing the above on what knowledge I have of the ext3 filesystem and the way PostgreSQL writes data. The more interesting question in my mind is: if you use PostgreSQL on an ext3 filesystem with "data=ordered" or "data=journal", can you get away with turning off PostgreSQL's fsync altogether and still get the same kind of data integrity that you'd get with fsync enabled? If the operating system is able to guarantee data integrity, is it still necessary to worry about it at the database level? I suspect the answer to that is that you can safely turn off fsync only if the operating system will guarantee that write transactions from a process are actually committed in the order they arrive from that process. Otherwise you'd have to worry about write transactions to the transaction log committing before the writes to the data files during a savepoint, which would leave the overall database in an inconsistent state if the system were to crash after the transaction log write (which marks the savepoint as completed) committed but before the data file writes committed. And my suspicion is that the operating system rarely makes any such guarantee, journalled filesystem or not. -- Kevin Brown kevin@sysexperts.com
Kevin Brown <kevin@sysexperts.com> writes: > I suspect the answer to that is that you can safely turn off fsync > only if the operating system will guarantee that write transactions > from a process are actually committed in the order they arrive from > that process. Yeah. We use fsync partly so that when we tell a client a transaction is committed, it really is committed (ie, down to disk) --- but also as a means of controlling write order. I strongly doubt that any modern filesystem will promise to execute writes exactly in the order issued, unless prodded by means such as fsync. > Otherwise you'd have to worry about write transactions > to the transaction log committing before the writes to the data files > during a savepoint, Actually, the other way around is the problem. The WAL algorithm works so long as log writes hit disk before the data-file changes they describe (that's why it's called write *ahead* log). regards, tom lane
Tom Lane wrote: > > Otherwise you'd have to worry about write transactions > > to the transaction log committing before the writes to the data files > > during a savepoint, > > Actually, the other way around is the problem. The WAL algorithm works > so long as log writes hit disk before the data-file changes they > describe (that's why it's called write *ahead* log). Hmm...a case where the transaction data gets written to the files before the transaction itself even manages to get written to the log? True. But I was thinking about the following: I was presuming that when a savepoint occurs, a marker is written to the log indicating which transactions had been committed to the data files, and that this marker was paid attention to during database startup. So suppose the marker makes it to the log but not all of the data the marker refers to makes it to the data files. Then the system crashes. When the database starts back up, the savepoint marker in the transaction log shows that the transactions had already been committed to disk. But because the OS wrote the requested data (including the savepoint marker) out of order, the savepoint marker made it to the disk before some of the data made it to the data files. And so, the database is in an inconsistent state and it has no way to know about it. But then, I guess the easy way around the above problem is to always commit all the transactions in the log to disk when the database comes up, which renders the savepoint marker moot...and leads back to the scenario you were referring to... If the savepoint only commits the older transactions in the log (and not all of them) to disk, the possibility of the situation you're referring would, I'd think, be reduced (possibly quite considerably). ...or is my understanding of how all this works completely off? -- Kevin Brown kevin@sysexperts.com
Kevin, > So if I'm right about that, then you might consider using the > "data=writeback" option for the filesystem that contains the actual > data (usually /usr/local/pgsql/data), but I'd use the default > ("data=ordered") at the very least (I suppose there's no harm in using > "data=journal" if you're willing to put up with the performance hit, > but it's not clear to me what benefit, if any, there is) for > everything else. Well, the only reason I use Ext3 rather than Ext2 is to prevent fsck's on restart after a crash. So I'm interested in the data option that gives the minimum performance hit, even if it means that I sacrifice some reliability. I'm running with fsynch on, and the DB is on a mirrored drive array, so I'm not too worried about filesystem-level errors. So would that be "data=writeback"? -- -Josh Berkus Aglio Database Solutions San Francisco
Josh Berkus wrote: > Well, the only reason I use Ext3 rather than Ext2 is to prevent fsck's on > restart after a crash. So I'm interested in the data option that gives the > minimum performance hit, even if it means that I sacrifice some reliability. > I'm running with fsynch on, and the DB is on a mirrored drive array, so I'm > not too worried about filesystem-level errors. > > So would that be "data=writeback"? Yes. That should give almost the same semantics as ext2 does by default, except that metadata is journalled, so no fsck needed. :-) In fact, I believe that's exactly how ReiserFS works, if I'm not mistaken (I saw someone claim that it does data journalling, but I've never seen any references to how to get ReiserFS to journal data). BTW, why exactly are you running ext3? It has some nice journalling features but it sounds like you don't want to use them. But at the same time, it uses pre-allocated inodes just like ext2 does, so it's possible to run out of inodes on ext2/3 while AFAIK that's not possible under ReiserFS. That's not likely to be a problem unless you're running a news server or something, though. :-) On the other hand, ext3 with data=writeback will probably be faster than ReiserFS for a number of things. No idea how stable ext3 is versus ReiserFS... -- Kevin Brown kevin@sysexperts.com
Kevin Brown <kevin@sysexperts.com> writes: > I was presuming that when a savepoint occurs, a marker is written to > the log indicating which transactions had been committed to the data > files, and that this marker was paid attention to during database > startup. Not quite. The marker says that all datafile updates described by log entries before point X have been flushed to disk by the checkpoint --- and, therefore, if we need to restart we need only replay log entries occurring after the last checkpoint's point X. This has nothing directly to do with which transactions are committed or not committed. If we based checkpoint behavior on that, we'd need to maintain an indefinitely large amount of WAL log to cope with long-running transactions. The actual checkpoint algorithm is take note of current logical end of WAL (this will be point X) write() all dirty buffers in shared buffer arena sync() to ensure that above writes, as well as previous ones, are on disk put checkpoint record referencing point X into WAL; write and fsync WAL update pg_control with new checkpoint record, fsync it Since pg_control is what's examined after restart, the checkpoint is effectively committed when the pg_control write hits disk. At any instant before that, a crash would result in replaying from the prior checkpoint's point X. The algorithm is correct if and only if the pg_control write hits disk after all the other writes mentioned. The key assumption we are making about the filesystem's behavior is that writes scheduled by the sync() will occur before the pg_control write that's issued after it. People have occasionally faulted this algorithm by quoting the sync() man page, which saith (in the Gospel According To HP) The writing, although scheduled, is not necessarily complete upon return from sync. This, however, is not a problem in itself. What we need to know is whether the filesystem will allow writes issued after the sync() to complete before those "scheduled" by the sync(). > So suppose the marker makes it to the log but not all of the data the > marker refers to makes it to the data files. Then the system crashes. I think that this analysis is not relevant to what we're doing. regards, tom lane
Tom Lane wrote: > Kevin Brown <kevin@sysexperts.com> writes: > > I was presuming that when a savepoint occurs, a marker is written to > > the log indicating which transactions had been committed to the data > > files, and that this marker was paid attention to during database > > startup. > > Not quite. The marker says that all datafile updates described by > log entries before point X have been flushed to disk by the checkpoint > --- and, therefore, if we need to restart we need only replay log > entries occurring after the last checkpoint's point X. > > This has nothing directly to do with which transactions are committed > or not committed. If we based checkpoint behavior on that, we'd need > to maintain an indefinitely large amount of WAL log to cope with > long-running transactions. Ah. My apologies for my imprecise wording. I should have said "...indicating which transactions had been written to the data files" instead of "...had been committed to the data files", and meant to say "checkpoint" but instead said "savepoint". I'll try to do better here. > The actual checkpoint algorithm is > > take note of current logical end of WAL (this will be point X) > write() all dirty buffers in shared buffer arena > sync() to ensure that above writes, as well as previous ones, > are on disk > put checkpoint record referencing point X into WAL; write and > fsync WAL > update pg_control with new checkpoint record, fsync it > > Since pg_control is what's examined after restart, the checkpoint is > effectively committed when the pg_control write hits disk. At any > instant before that, a crash would result in replaying from the > prior checkpoint's point X. The algorithm is correct if and only if > the pg_control write hits disk after all the other writes mentioned. [...] > > So suppose the marker makes it to the log but not all of the data the > > marker refers to makes it to the data files. Then the system crashes. > > I think that this analysis is not relevant to what we're doing. Agreed. The context of that analysis is when synchronous writes by the database are turned off and one is left to rely on the operating system to do the right thing. Clearly it doesn't apply when synchronous writes are enabled. As long as only one process handles a checkpoint, an operating system that guarantees that a process' writes are committed to disk in the same order that they were requested, combined with a journalling filesystem that at least wrote all data prior to committing the associated metadata transactions, would be sufficient to guarantee the integrity of the database even if all synchronous writes by the database were turned off. This would hold even if the operating system reordered writes from multiple processes. It suggests an operating system feature that could be considered highly desirable (and relates to the discussion elsewhere about trading off shared buffers against OS file cache: it's often better to rely on the abilities of the OS rather than roll your own mechanism). One question I have is: in the event of a crash, why not simply replay all the transactions found in the WAL? Is the startup time of the database that badly affected if pg_control is ignored? If there exists somewhere a reasonably succinct description of the reasoning behind the current transaction management scheme (including an analysis of the pros and cons), I'd love to read it and quit bugging you. :-) -- Kevin Brown kevin@sysexperts.com
Kevin Brown <kevin@sysexperts.com> writes: > One question I have is: in the event of a crash, why not simply replay > all the transactions found in the WAL? Is the startup time of the > database that badly affected if pg_control is ignored? Interesting thought, indeed. Since we truncate the WAL after each checkpoint, seems like this approach would no more than double the time for restart. The win is it'd eliminate pg_control as a single point of failure. It's always bothered me that we have to update pg_control on every checkpoint --- it should be a write-pretty-darn-seldom file, considering how critical it is. I think we'd have to make some changes in the code for deleting old WAL segments --- right now it's not careful to delete them in order. But surely that can be coped with. OTOH, this might just move the locus for fatal failures out of pg_control and into the OS' algorithms for writing directory updates. We would have no cross-check that the set of WAL file names visible in pg_xlog is sensible or aligned with the true state of the datafile area. We'd have to take it on faith that we should replay the visible files in their name order. This might mean we'd have to abandon the current hack of recycling xlog segments by renaming them --- which would be a nontrivial performance hit. Comments anyone? > If there exists somewhere a reasonably succinct description of the > reasoning behind the current transaction management scheme (including > an analysis of the pros and cons), I'd love to read it and quit > bugging you. :-) Not that I know of. Would you care to prepare such a writeup? There is a lot of material in the source-code comments, but no coherent presentation. regards, tom lane
Tom Lane wrote: > Kevin Brown <kevin@sysexperts.com> writes: > > One question I have is: in the event of a crash, why not simply replay > > all the transactions found in the WAL? Is the startup time of the > > database that badly affected if pg_control is ignored? > > Interesting thought, indeed. Since we truncate the WAL after each > checkpoint, seems like this approach would no more than double the time > for restart. Hmm...truncating the WAL after each checkpoint minimizes the amount of disk space eaten by the WAL, but on the other hand keeping older segments around buys you some safety in the event that things get really hosed. But your later comments make it sound like the older WAL segments are kept around anyway, just rotated. > The win is it'd eliminate pg_control as a single point of > failure. It's always bothered me that we have to update pg_control on > every checkpoint --- it should be a write-pretty-darn-seldom file, > considering how critical it is. > > I think we'd have to make some changes in the code for deleting old > WAL segments --- right now it's not careful to delete them in order. > But surely that can be coped with. Even that might not be necessary. See below. > OTOH, this might just move the locus for fatal failures out of > pg_control and into the OS' algorithms for writing directory updates. > We would have no cross-check that the set of WAL file names visible in > pg_xlog is sensible or aligned with the true state of the datafile > area. Well, what we somehow need to guarantee is that there is always WAL data that is older than the newest consistent data in the datafile area, right? Meaning that if the datafile area gets scribbled on in an inconsistent manner, you always have WAL data to fill in the gaps. Right now we do that by using fsync() and sync(). But I think it would be highly desirable to be able to more or less guarantee database consistency even if fsync were turned off. The price for that might be too high, though. > We'd have to take it on faith that we should replay the visible files > in their name order. This might mean we'd have to abandon the current > hack of recycling xlog segments by renaming them --- which would be a > nontrivial performance hit. It's probably a bad idea for the replay to be based on the filenames. Instead, it should probably be based strictly on the contents of the xlog segment files. Seems to me the beginning of each segment file should have some kind of header information that makes it clear where in the scheme of things it belongs. Additionally, writing some sort of checksum, either at the beginning or the end, might not be a bad idea either (doesn't have to be a strict checksum, but it needs to be something that's reasonably likely to catch corruption within a segment). Do that, and you don't have to worry about renaming xlog segments at all: you simply move on to the next logical segment in the list (a replay just reads the header info for all the segments and orders the list as it sees fit, and discards all segments prior to any gap it finds. It may be that you simply have to bail out if you find a gap, though). As long as the xlog segment checksum information is consistent with the contents of the segment and as long as its transactions pick up where the previous segment's left off (assuming it's not the first segment, of course), you can safely replay the transactions it contains. I presume we're recycling xlog segments in order to avoid file creation and unlink overhead? Otherwise you can simply create new segments as needed and unlink old segments as policy dictates. > Comments anyone? > > > If there exists somewhere a reasonably succinct description of the > > reasoning behind the current transaction management scheme (including > > an analysis of the pros and cons), I'd love to read it and quit > > bugging you. :-) > > Not that I know of. Would you care to prepare such a writeup? There > is a lot of material in the source-code comments, but no coherent > presentation. Be happy to. Just point me to any non-obvious source files. Thus far on my plate: 1. PID file locking for postmaster startup (doesn't strictly need to be the PID file but it may as well be, since we're already messing with it anyway). I'm currently looking at how to do the autoconf tests, since I've never developed using autoconf before. 2. Documenting the transaction management scheme. I was initially interested in implementing the explicit JOIN reordering but based on your recent comments I think you have a much better handle on that than I. I'll be very interested to see what you do, to see if it's anything close to what I figure has to happen... -- Kevin Brown kevin@sysexperts.com
On 2003-01-24 21:58:55 -0500, Tom Lane wrote: > The key assumption we are making about the filesystem's behavior is that > writes scheduled by the sync() will occur before the pg_control write > that's issued after it. People have occasionally faulted this algorithm > by quoting the sync() man page, which saith (in the Gospel According To > HP) > > The writing, although scheduled, is not necessarily complete upon > return from sync. > > This, however, is not a problem in itself. What we need to know is > whether the filesystem will allow writes issued after the sync() to > complete before those "scheduled" by the sync(). > Certain linux 2.4.* kernels (not sure which, newer ones don't seem to have it) have the following kernel config option: Use the NOOP Elevator (WARNING) CONFIG_BLK_DEV_ELEVATOR_NOOP If you are using a raid class top-level driver above the ATA/IDE core, one may find a performance boost by preventing a merging and re-sorting of the new requests. If unsure, say N. If one were certain his OS wouldn't do any re-ordering of writes, would it be safe to run with fsync = off? (not that I'm going to try this, but I'm just curious) Vincent van Leeuwen Media Design
pgsql.spam@vinz.nl writes: > If one were certain his OS wouldn't do any re-ordering of writes, would it be > safe to run with fsync = off? (not that I'm going to try this, but I'm just > curious) I suppose so ... but if your OS doesn't do *any* re-ordering of writes, I'd say you need a better OS. Even in Postgres, we'd often like the OS to collapse multiple writes of the same disk page into one write. And we certainly want the various writes forced by a sync() to be done with some intelligence about disk layout, not blindly in order of issuance. regards, tom lane
On Sat, 2003-01-25 at 23:34, Tom Lane wrote: > pgsql.spam@vinz.nl writes: > > If one were certain his OS wouldn't do any re-ordering of writes, would it be > > safe to run with fsync = off? (not that I'm going to try this, but I'm just > > curious) > > I suppose so ... but if your OS doesn't do *any* re-ordering of writes, > I'd say you need a better OS. Even in Postgres, we'd often like the OS > to collapse multiple writes of the same disk page into one write. And > we certainly want the various writes forced by a sync() to be done with > some intelligence about disk layout, not blindly in order of issuance. And anyway, wouldn't SCSI's Tagged Command Queueing override it all, no matter if the OS did re-ordering or not? But then, it really means it when it says that fsync() succeeds, so does TCQ matter in this case? -- +---------------------------------------------------------------+ | Ron Johnson, Jr. mailto:ron.l.johnson@cox.net | | Jefferson, LA USA http://members.cox.net/ron.l.johnson | | | | "Fear the Penguin!!" | +---------------------------------------------------------------+
Kevin, > BTW, why exactly are you running ext3? It has some nice journalling > features but it sounds like you don't want to use them. Because our RAID array controller, an Adaptec 2200S, is only compatible with RedHat 8.0, without some fancy device driver hacking. It certainly wasn't my first choice, I've been using Reiser for 4 years and am very happy with it. Warning to anyone following this thread: The web site info for the 2200S says "Redhat and SuSE", but drivers are only available for RedHat. Adaptec's Linux guru, Brian, has been unable to get the web site maintainers to correct the information on the site. -- -Josh Berkus Aglio Database Solutions San Francisco
On Mon, 2003-01-27 at 13:23, Josh Berkus wrote: > Kevin, > > > BTW, why exactly are you running ext3? It has some nice journalling > > features but it sounds like you don't want to use them. > > Because our RAID array controller, an Adaptec 2200S, is only compatible with > RedHat 8.0, without some fancy device driver hacking. It certainly wasn't my Binary-only, or OSS and just tuned to their kernels? -- +---------------------------------------------------------------+ | Ron Johnson, Jr. mailto:ron.l.johnson@cox.net | | Jefferson, LA USA http://members.cox.net/ron.l.johnson | | | | "Fear the Penguin!!" | +---------------------------------------------------------------+
Let me add that I have heard that on Linux XFS is better for PostgreSQL than either ext3 or Reiser. --------------------------------------------------------------------------- Kevin Brown wrote: > Josh Berkus wrote: > > Well, the only reason I use Ext3 rather than Ext2 is to prevent fsck's on > > restart after a crash. So I'm interested in the data option that gives the > > minimum performance hit, even if it means that I sacrifice some reliability. > > I'm running with fsynch on, and the DB is on a mirrored drive array, so I'm > > not too worried about filesystem-level errors. > > > > So would that be "data=writeback"? > > Yes. That should give almost the same semantics as ext2 does by > default, except that metadata is journalled, so no fsck needed. :-) > > In fact, I believe that's exactly how ReiserFS works, if I'm not > mistaken (I saw someone claim that it does data journalling, but I've > never seen any references to how to get ReiserFS to journal data). > > > BTW, why exactly are you running ext3? It has some nice journalling > features but it sounds like you don't want to use them. But at the > same time, it uses pre-allocated inodes just like ext2 does, so it's > possible to run out of inodes on ext2/3 while AFAIK that's not > possible under ReiserFS. That's not likely to be a problem unless > you're running a news server or something, though. :-) > > On the other hand, ext3 with data=writeback will probably be faster > than ReiserFS for a number of things. > > No idea how stable ext3 is versus ReiserFS... > > > > -- > Kevin Brown kevin@sysexperts.com > > ---------------------------(end of broadcast)--------------------------- > TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org > -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania 19073
Is there a TODO here? I like the idea of not writing pg_controldata, or at least allowing it not to be read, perhaps with a pg_resetxlog flag so we can cleanly recover from a corrupt pg_controldata if the WAL files are OK. We don't want to get rid of the WAL file rename optimization because those are 16mb files and keeping them from checkpoint to checkpoint is probably a win. I also like the idea of allowing something between our "at the instant" recovery, and no recovery with fsync off. A "recover from last checkpooint time" option would be really valuable for some. --------------------------------------------------------------------------- Kevin Brown wrote: > Tom Lane wrote: > > Kevin Brown <kevin@sysexperts.com> writes: > > > One question I have is: in the event of a crash, why not simply replay > > > all the transactions found in the WAL? Is the startup time of the > > > database that badly affected if pg_control is ignored? > > > > Interesting thought, indeed. Since we truncate the WAL after each > > checkpoint, seems like this approach would no more than double the time > > for restart. > > Hmm...truncating the WAL after each checkpoint minimizes the amount of > disk space eaten by the WAL, but on the other hand keeping older > segments around buys you some safety in the event that things get > really hosed. But your later comments make it sound like the older > WAL segments are kept around anyway, just rotated. > > > The win is it'd eliminate pg_control as a single point of > > failure. It's always bothered me that we have to update pg_control on > > every checkpoint --- it should be a write-pretty-darn-seldom file, > > considering how critical it is. > > > > I think we'd have to make some changes in the code for deleting old > > WAL segments --- right now it's not careful to delete them in order. > > But surely that can be coped with. > > Even that might not be necessary. See below. > > > OTOH, this might just move the locus for fatal failures out of > > pg_control and into the OS' algorithms for writing directory updates. > > We would have no cross-check that the set of WAL file names visible in > > pg_xlog is sensible or aligned with the true state of the datafile > > area. > > Well, what we somehow need to guarantee is that there is always WAL > data that is older than the newest consistent data in the datafile > area, right? Meaning that if the datafile area gets scribbled on in > an inconsistent manner, you always have WAL data to fill in the gaps. > > Right now we do that by using fsync() and sync(). But I think it > would be highly desirable to be able to more or less guarantee > database consistency even if fsync were turned off. The price for > that might be too high, though. > > > We'd have to take it on faith that we should replay the visible files > > in their name order. This might mean we'd have to abandon the current > > hack of recycling xlog segments by renaming them --- which would be a > > nontrivial performance hit. > > It's probably a bad idea for the replay to be based on the filenames. > Instead, it should probably be based strictly on the contents of the > xlog segment files. Seems to me the beginning of each segment file > should have some kind of header information that makes it clear where > in the scheme of things it belongs. Additionally, writing some sort > of checksum, either at the beginning or the end, might not be a bad > idea either (doesn't have to be a strict checksum, but it needs to be > something that's reasonably likely to catch corruption within a > segment). > > Do that, and you don't have to worry about renaming xlog segments at > all: you simply move on to the next logical segment in the list (a > replay just reads the header info for all the segments and orders the > list as it sees fit, and discards all segments prior to any gap it > finds. It may be that you simply have to bail out if you find a gap, > though). As long as the xlog segment checksum information is > consistent with the contents of the segment and as long as its > transactions pick up where the previous segment's left off (assuming > it's not the first segment, of course), you can safely replay the > transactions it contains. > > I presume we're recycling xlog segments in order to avoid file > creation and unlink overhead? Otherwise you can simply create new > segments as needed and unlink old segments as policy dictates. > > > Comments anyone? > > > > > If there exists somewhere a reasonably succinct description of the > > > reasoning behind the current transaction management scheme (including > > > an analysis of the pros and cons), I'd love to read it and quit > > > bugging you. :-) > > > > Not that I know of. Would you care to prepare such a writeup? There > > is a lot of material in the source-code comments, but no coherent > > presentation. > > Be happy to. Just point me to any non-obvious source files. > > Thus far on my plate: > > 1. PID file locking for postmaster startup (doesn't strictly need > to be the PID file but it may as well be, since we're already > messing with it anyway). I'm currently looking at how to do > the autoconf tests, since I've never developed using autoconf > before. > > 2. Documenting the transaction management scheme. > > I was initially interested in implementing the explicit JOIN > reordering but based on your recent comments I think you have a much > better handle on that than I. I'll be very interested to see what you > do, to see if it's anything close to what I figure has to happen... > > > -- > Kevin Brown kevin@sysexperts.com > > ---------------------------(end of broadcast)--------------------------- > TIP 6: Have you searched our list archives? > > http://archives.postgresql.org > -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania 19073
Is there a TODO here, like "Allow recovery from corrupt pg_control via WAL"? --------------------------------------------------------------------------- Kevin Brown wrote: > Tom Lane wrote: > > Kevin Brown <kevin@sysexperts.com> writes: > > > One question I have is: in the event of a crash, why not simply replay > > > all the transactions found in the WAL? Is the startup time of the > > > database that badly affected if pg_control is ignored? > > > > Interesting thought, indeed. Since we truncate the WAL after each > > checkpoint, seems like this approach would no more than double the time > > for restart. > > Hmm...truncating the WAL after each checkpoint minimizes the amount of > disk space eaten by the WAL, but on the other hand keeping older > segments around buys you some safety in the event that things get > really hosed. But your later comments make it sound like the older > WAL segments are kept around anyway, just rotated. > > > The win is it'd eliminate pg_control as a single point of > > failure. It's always bothered me that we have to update pg_control on > > every checkpoint --- it should be a write-pretty-darn-seldom file, > > considering how critical it is. > > > > I think we'd have to make some changes in the code for deleting old > > WAL segments --- right now it's not careful to delete them in order. > > But surely that can be coped with. > > Even that might not be necessary. See below. > > > OTOH, this might just move the locus for fatal failures out of > > pg_control and into the OS' algorithms for writing directory updates. > > We would have no cross-check that the set of WAL file names visible in > > pg_xlog is sensible or aligned with the true state of the datafile > > area. > > Well, what we somehow need to guarantee is that there is always WAL > data that is older than the newest consistent data in the datafile > area, right? Meaning that if the datafile area gets scribbled on in > an inconsistent manner, you always have WAL data to fill in the gaps. > > Right now we do that by using fsync() and sync(). But I think it > would be highly desirable to be able to more or less guarantee > database consistency even if fsync were turned off. The price for > that might be too high, though. > > > We'd have to take it on faith that we should replay the visible files > > in their name order. This might mean we'd have to abandon the current > > hack of recycling xlog segments by renaming them --- which would be a > > nontrivial performance hit. > > It's probably a bad idea for the replay to be based on the filenames. > Instead, it should probably be based strictly on the contents of the > xlog segment files. Seems to me the beginning of each segment file > should have some kind of header information that makes it clear where > in the scheme of things it belongs. Additionally, writing some sort > of checksum, either at the beginning or the end, might not be a bad > idea either (doesn't have to be a strict checksum, but it needs to be > something that's reasonably likely to catch corruption within a > segment). > > Do that, and you don't have to worry about renaming xlog segments at > all: you simply move on to the next logical segment in the list (a > replay just reads the header info for all the segments and orders the > list as it sees fit, and discards all segments prior to any gap it > finds. It may be that you simply have to bail out if you find a gap, > though). As long as the xlog segment checksum information is > consistent with the contents of the segment and as long as its > transactions pick up where the previous segment's left off (assuming > it's not the first segment, of course), you can safely replay the > transactions it contains. > > I presume we're recycling xlog segments in order to avoid file > creation and unlink overhead? Otherwise you can simply create new > segments as needed and unlink old segments as policy dictates. > > > Comments anyone? > > > > > If there exists somewhere a reasonably succinct description of the > > > reasoning behind the current transaction management scheme (including > > > an analysis of the pros and cons), I'd love to read it and quit > > > bugging you. :-) > > > > Not that I know of. Would you care to prepare such a writeup? There > > is a lot of material in the source-code comments, but no coherent > > presentation. > > Be happy to. Just point me to any non-obvious source files. > > Thus far on my plate: > > 1. PID file locking for postmaster startup (doesn't strictly need > to be the PID file but it may as well be, since we're already > messing with it anyway). I'm currently looking at how to do > the autoconf tests, since I've never developed using autoconf > before. > > 2. Documenting the transaction management scheme. > > I was initially interested in implementing the explicit JOIN > reordering but based on your recent comments I think you have a much > better handle on that than I. I'll be very interested to see what you > do, to see if it's anything close to what I figure has to happen... > > > -- > Kevin Brown kevin@sysexperts.com > > ---------------------------(end of broadcast)--------------------------- > TIP 6: Have you searched our list archives? > > http://archives.postgresql.org > -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania 19073
On Fri, 14 Feb 2003, Bruce Momjian wrote: > Is there a TODO here, like "Allow recovery from corrupt pg_control via > WAL"? Isn't that already in section 12.2.1 of the documentation? Using pg_control to get the checkpoint position speeds up the recovery process, but to handle possible corruption of pg_control, we should actually implement the reading of existing log segments in reverse order -- newest to oldest -- in order to find the last checkpoint. This has not been implemented, yet. cjs -- Curt Sampson <cjs@cynic.net> +81 90 7737 2974 http://www.netbsd.org Don't you know, in this new Dark Age, we're all light. --XTC
Added to TODO: * Allow WAL information to recover corrupted pg_controldata --------------------------------------------------------------------------- Curt Sampson wrote: > On Fri, 14 Feb 2003, Bruce Momjian wrote: > > > Is there a TODO here, like "Allow recovery from corrupt pg_control via > > WAL"? > > Isn't that already in section 12.2.1 of the documentation? > > Using pg_control to get the checkpoint position speeds up the > recovery process, but to handle possible corruption of pg_control, > we should actually implement the reading of existing log segments > in reverse order -- newest to oldest -- in order to find the last > checkpoint. This has not been implemented, yet. > > cjs > -- > Curt Sampson <cjs@cynic.net> +81 90 7737 2974 http://www.netbsd.org > Don't you know, in this new Dark Age, we're all light. --XTC > > ---------------------------(end of broadcast)--------------------------- > TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org > -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania 19073