Thread: xlog loose ends, continued
There is another loose end that I forgot I needed to discuss with you. xlog.c's ReadRecord formerly contained code that would zero out the rest of the log segment (and delete the next log segment, if any) upon detecting a missing or corrupted xlog record. I removed that code because I considered it horribly dangerous where it was. If there is anything wrong with either the xlog or pg_control's pointers to it, that code was quite capable of wiping out all hope of recovery *and* all evidence of what went wrong. I think it's really bad to automatically destroy log data, especially when we do not yet know if we are capable of recovering. If we need this functionality, it should be invoked only at the completion of StartupXLOG, after we have finished the recovery phase. However, I'd be a lot happier if we could avoid wholesale zeroing at all. I presume the point of this code was that if we recover and then suffer a later crash at a point where we've just written an xlog record that exactly fills an xlog page, a subsequent scan of the log might continue on from that point and pick up xlog records from the prior (failed) system run. Is there a way to guard against that scenario without having to zero out data during recovery? One thought that comes to mind is to store StartUpID in XLOG page headers, and abort log scanning if we come to a page with StartUpID less than what came before. Is that secure/sufficient? Is there a better way? regards, tom lane
> I presume the point of this code was that if we recover and > then suffer > a later crash at a point where we've just written an xlog record that > exactly fills an xlog page, a subsequent scan of the log > might continue > on from that point and pick up xlog records from the prior (failed) > system run. Is there a way to guard against that scenario without > having to zero out data during recovery? > > One thought that comes to mind is to store StartUpID in XLOG page > headers, and abort log scanning if we come to a page with StartUpID > less than what came before. Is that secure/sufficient? Is there > a better way? This code was from the old days when there was no CRC in log records. Should we try to read log up to the *physical end* - ie end of last log file - regardless invalid CRC-s/zero pages with attempt to re-apply interim valid records? (Or do we already do this?) This way we'll know where is actual end of log (last valid record) to begin production from there. (Unfortunately, we'll have to read empty files pre-created by checkpointer -:(). Anyway I like idea of StartUpID in page headers - this will help if some log files disappeared. Should we add CRC to page header? Hm, maybe XLogFileInit should initialize files with StartUpID & CRC in pages? We would avoid reading empty files. Vadim
"Mikheev, Vadim" <vmikheev@SECTORBASE.COM> writes: > This code was from the old days when there was no CRC in log records. Ah, right. The CRC makes things safer ... but there's still a risk that old log pages could look like a valid continuation. > Should we try to read log up to the *physical end* - ie end of last > log file - regardless invalid CRC-s/zero pages with attempt to > re-apply interim valid records? (Or do we already do this?) That doesn't seem like a good idea --- once we fail to read an XLOG record, it's probably best to stop there rather than continue on. I think we want to try for a consistent recovery to a past point in time (ie, wherever the xlog gap is) not a partial recovery to a later time. > Anyway I like idea of StartUpID in page headers - this will help > if some log files disappeared. Should we add CRC to page header? That seems like overkill. I was hoping to keep the page header overhead at eight bytes. We could do that either by storing just the two LSBs of StartUpID (and doing the sequence checking mod 64k) or by reducing the magic number to two bytes so there's room for four bytes of StartUpID. I think I like the first alternative better --- comments? > Hm, maybe XLogFileInit should initialize files with StartUpID & CRC > in pages? We would avoid reading empty files. We already stop when we hit a zeroed page (because it's not got the right magic number). That seems sufficient. regards, tom lane
> > Should we try to read log up to the *physical end* - ie end of last > > log file - regardless invalid CRC-s/zero pages with attempt to > > re-apply interim valid records? (Or do we already do this?) > > That doesn't seem like a good idea --- once we fail to read an XLOG > record, it's probably best to stop there rather than continue on. > I think we want to try for a consistent recovery to a past point in > time (ie, wherever the xlog gap is) not a partial recovery to a later > time. No way for consistent recovery if there is gap in log due to disk write re-ordering anyway (and we can't know what was the reason of the gap). I thought that you wanted apply as much of log as we can. If you don't then I missed your point in first message: > xlog.c's ReadRecord formerly contained code that would zero > out the rest of the log segment (and delete the next log segment, > if any) upon detecting a missing or corrupted xlog record. > I removed that code because I considered it horribly dangerous > where it was. If there is anything wrong with either the xlog or > pg_control's pointers to it, that code was quite capable of wiping > out all hope of recovery *and* all evidence of what went wrong. ^^^^^^^^^^^^^^^^^^^^^^^^ So, if we are not going to re-apply as much valid records as we can read from log then zeroing is no more dangerous than SUI in headers. But I totaly agreed that SUI is much better. > > Anyway I like idea of StartUpID in page headers - this will help > > if some log files disappeared. Should we add CRC to page header? > > That seems like overkill. I was hoping to keep the page > header overhead at eight bytes. We could do that either by storing just > the two LSBs of StartUpID (and doing the sequence checking mod 64k) or > by reducing the magic number to two bytes so there's room for four bytes of > StartUpID. I think I like the first alternative better --- comments? I don't think a few additional bytes in header is a problem. BTW, why not use CRC32 in header instead of magic? Or just StartUpID instead of magic if you don't want to calculate CRC for header - xlp_magic doesn't seem to be more useful than SUI. > > Hm, maybe XLogFileInit should initialize files with StartUpID & CRC > > in pages? We would avoid reading empty files. > > We already stop when we hit a zeroed page (because it's not got the > right magic number). That seems sufficient. What if the next page after zeroed one is correct (due to write re-ordering)? (But I take back SUI+CRC in XLogFileInit - useless -:)) Vadim
"Mikheev, Vadim" <vmikheev@SECTORBASE.COM> writes: >> That doesn't seem like a good idea --- once we fail to read an XLOG >> record, it's probably best to stop there rather than continue on. >> I think we want to try for a consistent recovery to a past point in >> time (ie, wherever the xlog gap is) not a partial recovery to a later >> time. > No way for consistent recovery if there is gap in log due to > disk write re-ordering anyway (and we can't know what was > the reason of the gap). I thought that you wanted apply as much of log > as we can. If you don't then I missed your point in first message: >> xlog.c's ReadRecord formerly contained code that would zero >> out the rest of the log segment (and delete the next log segment, >> if any) upon detecting a missing or corrupted xlog record. >> I removed that code because I considered it horribly dangerous >> where it was. If there is anything wrong with either the xlog or >> pg_control's pointers to it, that code was quite capable of wiping >> out all hope of recovery *and* all evidence of what went wrong. > ^^^^^^^^^^^^^^^^^^^^^^^^ What I was thinking about in that last paragraph was manual analysis and recovery. I don't think it's a good idea for automatic system startup to skip over gaps in the log. > So, if we are not going to re-apply as much valid records as > we can read from log then zeroing is no more dangerous than > SUI in headers. But I totaly agreed that SUI is much better. Okay, I will change the page headers to include SUI (or the low-order bits of it anyway), and make ReadRecord stop if it notices a backwards jump in SUI. > I don't think a few additional bytes in header is a problem. > BTW, why not use CRC32 in header instead of magic? There is so little changeable information in a page header that a CRC wouldn't be much more than an eight-byte magic number. And we don't need eight bytes worth of magic number (even four is more than enough, really). So I'd just as soon keep the headers simple and small. regards, tom lane
> >> xlog.c's ReadRecord formerly contained code that would zero > >> out the rest of the log segment (and delete the next log segment, > >> if any) upon detecting a missing or corrupted xlog record. > >> I removed that code because I considered it horribly dangerous > >> where it was. If there is anything wrong with either the xlog or > >> pg_control's pointers to it, that code was quite capable of wiping > >> out all hope of recovery *and* all evidence of what went wrong. > > ^^^^^^^^^^^^^^^^^^^^^^^^ > > What I was thinking about in that last paragraph was manual > analysis and recovery. I don't think it's a good idea for automatic > system startup to skip over gaps in the log. But if we'll not try to read after gap then after restart system will not notice gap and valid records after it and just rewrite log space with new records. Not much chance for manual analysis - ppl will not report any problems. Vadim
"Mikheev, Vadim" <vmikheev@SECTORBASE.COM> writes: >> What I was thinking about in that last paragraph was manual >> analysis and recovery. I don't think it's a good idea for automatic >> system startup to skip over gaps in the log. > But if we'll not try to read after gap then after restart system will > not notice gap and valid records after it and just rewrite log space > with new records. Not much chance for manual analysis - ppl will > not report any problems. That'll be true in any case, unless we refuse to start up at all upon detecting xlog corruption (which doesn't seem like the way to fly). Not sure what we can do about that. regards, tom lane
> >> What I was thinking about in that last paragraph was manual > >> analysis and recovery. I don't think it's a good idea for automatic > >> system startup to skip over gaps in the log. > > > But if we'll not try to read after gap then after restart > > system will not notice gap and valid records after it and > > just rewrite log space with new records. Not much chance for > > manual analysis - ppl will not report any problems. > > That'll be true in any case, unless we refuse to start up at all upon > detecting xlog corruption (which doesn't seem like the way to fly). > Not sure what we can do about that. What I would refuse in the event of log corruption is continuing normal database operations. It's ok to dump such database for manual recovery, but continuing to use it is VERY BAD THING. The fact that users will use inconsistent DB may become big headache for us - just imagine compains about index scans returning incorrect results (index tuples pointing to free heap space was left and then that space was used for tuple with different keys). Failing to restart was bad but silent restart in the event of log corruption is bad too. In first case we had at least chance to discover original problem. Vadim
> > It will be sufficient if DB will not use all 2^32 XIDs > > without shutdown. > > I liked the xid wraparound idea, won't that be sufficient here too ? > I don't like the idea to reuse a xid sooner than absolutely necessary. We need it to reduce pg_log size requirements. > This would complicate the search for potentially inconsistent pages > after crash. There is no such search currently and I can't imagine how/when/for what to do such search. ? Vadim
"Mikheev, Vadim" <vmikheev@SECTORBASE.COM> writes: >> That'll be true in any case, unless we refuse to start up at all upon >> detecting xlog corruption (which doesn't seem like the way to fly). >> Not sure what we can do about that. > What I would refuse in the event of log corruption is continuing > normal database operations. Hmm. We could do that if we had some notion of a read-only operating mode, perhaps. But we don't have one now and I don't want to add it for 7.1. Can we agree to look at this more for 7.2? If we did have that, it would make sense to scan the rest of the log (after the last valid XLOG record) to see if we find any more records. If we do then --- whether they're valid or not --- we have a corrupted DB and we should go into the read-only state. But for the moment I think it's best not to make such a scan. regards, tom lane
> > What I would refuse in the event of log corruption is continuing > > normal database operations. > > Hmm. We could do that if we had some notion of a read-only operating > mode, perhaps. But we don't have one now and I don't want to add it > for 7.1. Can we agree to look at this more for 7.2? We need not in full support of read-only mode - just set some flag in shmem and disallow write ops. I think 7.1.1 or so is good for that. Vadim
> >> That'll be true in any case, unless we refuse to start up at all upon > >> detecting xlog corruption (which doesn't seem like the way to fly). > >> Not sure what we can do about that. > > What I would refuse in the event of log corruption is continuing > > normal database operations. > Hmm. We could do that if we had some notion of a read-only operating > mode, perhaps. But we don't have one now and I don't want to add it > for 7.1. Can we agree to look at this more for 7.2? I'd like to have a readonly mode driven by integrity requirements for corruption recovery for database tables, for replication, and (in the future) for distributed databases, so perhaps we can do a trial implementation fairly soon. Not sure how it would impact the backend(s), but istm that we might be able to do a first implementation for 7.1.x. I'll bring it up again when appropriate... - Thomas
Maybe there should be an error message like : "PostgreSQL has detected severe xlog corruption. Please fix this with pg_recover (or similar) manually before restarting the database"? Guess I'm suggesting a separate xlog recovery tool for "bad cases" of xlog corruption, so decisions can by manually made by a DBA where necessary. Not everything has to be automatic I'm thinking. There are probably times where the dBA would prefer behaviour that doesn't seem intuitive anyway. Regards and best wishes, Justin Clift Tom Lane wrote: > > "Mikheev, Vadim" <vmikheev@SECTORBASE.COM> writes: > >> What I was thinking about in that last paragraph was manual > >> analysis and recovery. I don't think it's a good idea for automatic > >> system startup to skip over gaps in the log. > > > But if we'll not try to read after gap then after restart system will > > not notice gap and valid records after it and just rewrite log space > > with new records. Not much chance for manual analysis - ppl will > > not report any problems. > > That'll be true in any case, unless we refuse to start up at all upon > detecting xlog corruption (which doesn't seem like the way to fly). > Not sure what we can do about that. > > regards, tom lane > > ---------------------------(end of broadcast)--------------------------- > TIP 6: Have you searched our list archives? > > http://www.postgresql.org/search.mpl
"Mikheev, Vadim" wrote: > > > >> What I was thinking about in that last paragraph was manual > > >> analysis and recovery. I don't think it's a good idea for automatic > > >> system startup to skip over gaps in the log. > > > > > But if we'll not try to read after gap then after restart > > > system will not notice gap and valid records after it and > > > just rewrite log space with new records. Not much chance for > > > manual analysis - ppl will not report any problems. > > > > That'll be true in any case, unless we refuse to start up at all upon > > detecting xlog corruption (which doesn't seem like the way to fly). > > Not sure what we can do about that. > > What I would refuse in the event of log corruption is continuing > normal database operations. Log corruption is never an unique cause of a recovery failure. If there's a bug in redo stuff the result would also be a recovery failure. Currently the redo stuff has to accomplish redo operations completely. No matter how trivial the bug may be, it's always serious unfortunately. > It's ok to dump such database for manual > recovery, but continuing to use it is VERY BAD THING. The fact that > users will use inconsistent DB may become big headache for us - just > imagine compains about index scans returning incorrect results > (index tuples pointing to free heap space was left and then that space > was used for tuple with different keys). > Hmm this seems nothing worse than 7.0. I would complain if postmaster couldn't restart due to this reason. IIRC few ppl mind the (even system) index corruption. > Failing to restart was bad but silent restart in the event of log > corruption is bad too. Agreed. regards, Hiroshi Inoue
Tom Lane wrote: > > "Mikheev, Vadim" <vmikheev@SECTORBASE.COM> writes: > >> That'll be true in any case, unless we refuse to start up at all upon > >> detecting xlog corruption (which doesn't seem like the way to fly). > >> Not sure what we can do about that. > > > What I would refuse in the event of log corruption is continuing > > normal database operations. > > Hmm. We could do that if we had some notion of a read-only operating > mode, perhaps. But we don't have one now and I don't want to add it > for 7.1. Can we agree to look at this more for 7.2? I'd love to see PostgreSQL have a read-only mode of some kind that let enquiry function against a possibly otherwise corrupted database, without the stress of worrying that you might be making things worse. I know other DB servers that have this sort of thing, and it has been a life-saver for me on occasion to allow critical information to be extracted before you nuke it all and start over. Cheers, Andrew. -- _____________________________________________________________________ Andrew McMillan, e-mail: Andrew@catalyst.net.nz Catalyst IT Ltd, PO Box 10-225, Level 22, 105 The Terrace, Wellington Me: +64 (21) 635 694, Fax: +64 (4) 499 5596, Office: +64 (4) 499 2267