Thread: Detecting corrupted pages earlier
Postgres has a bad habit of becoming very confused if the page header of a page on disk has become corrupted. In particular, bogus values in the pd_lower field tend to make it look like there are many more tuples than there really are, and of course these "tuples" contain garbage. That leads to core dumps, weird complaints about out-of-range transaction numbers (the latter generally in the form of an abort referencing a nonexistent pg_clog file), and other un-fun stuff. I'm thinking of modifying ReadBuffer() so that it errors out if the page read in does not contain either zeroes or a valid-looking header. (The exception for zeroes seems to be needed for hash indexes, which tend to initialize pages out-of-order.) This would make it much easier for people to recognize situations where a page header has become corrupted on disk. Comments? Can anyone think of a scenario where this would be a bad idea? regards, tom lane
>>>>> "Tom" == Tom Lane <tgl@sss.pgh.pa.us> writes: Tom> Postgres has a bad habit of becoming very confused if the Tom> page header of a page on disk has become corrupted. In Tom> particular, bogus values in the pd_lower field tend to make I haven't read this piece of pgsql code very carefully so I apologize if what I suggest is already present. One "standard" solution to handle disk page corruption is the use of "consistency" bits. The idea is that the bit that starts every 256th byte of a page is a consistency bit. In a 8K page, you'd have 32 consistency bits. If the page is in a "consistent" state, then all 32 bits will be either 0 or 1. When a page is written to disk, the "actual" bit in each c-bit position is copied out and placed in the header (end/beginning) of the page. With a 8K page, there will be one word that contains the "actual" bit. Then the c-bits are all either set or reset depending on the state when the page was last read: if on read time the c-bits were set, then on write time they are reset. So when you read a page, if some of the consistency bits are set and some others are reset then you know that there was a corruption. This is of course based on the assumption that most disk arms manage to atomically write 256 bytes at a time. -- Pip-pip Sailesh http://www.cs.berkeley.edu/~sailesh
On Mon, 17 Feb 2003, Tom Lane wrote: > Postgres has a bad habit of becoming very confused if the page header of > a page on disk has become corrupted. What typically causes this corruption? If it's any kind of a serious problem, maybe it would be worth keeping a CRC of the header at the end of the page somewhere. cjs -- Curt Sampson <cjs@cynic.net> +81 90 7737 2974 http://www.netbsd.org Don't you know, in this new Dark Age, we're alllight. --XTC
Curt Sampson <cjs@cynic.net> writes: > On Mon, 17 Feb 2003, Tom Lane wrote: >> Postgres has a bad habit of becoming very confused if the page header of >> a page on disk has become corrupted. > What typically causes this corruption? Well, I'd like to know that too. I have seen some cases that were identified as hardware problems (disk wrote data to wrong sector, RAM dropped some bits, etc). I'm not convinced that that's the whole story, but I have nothing to chew on that could lead to identifying a software bug. > If it's any kind of a serious problem, maybe it would be worth keeping > a CRC of the header at the end of the page somewhere. See past discussions about keeping CRCs of page contents. Ultimately I think it's a significant expenditure of CPU for very marginal returns --- the layers underneath us are supposed to keep their own CRCs or other cross-checks, and a very substantial chunk of the problem seems to be bad RAM, against which occasional software CRC checks aren't especially useful. regards, tom lane
Tom Lane wrote: > Curt Sampson <cjs@cynic.net> writes: > > On Mon, 17 Feb 2003, Tom Lane wrote: > >> Postgres has a bad habit of becoming very confused if the page header of > >> a page on disk has become corrupted. > > > What typically causes this corruption? > > Well, I'd like to know that too. I have seen some cases that were > identified as hardware problems (disk wrote data to wrong sector, RAM > dropped some bits, etc). I'm not convinced that that's the whole story, > but I have nothing to chew on that could lead to identifying a software > bug. > > > If it's any kind of a serious problem, maybe it would be worth keeping > > a CRC of the header at the end of the page somewhere. > > See past discussions about keeping CRCs of page contents. Ultimately > I think it's a significant expenditure of CPU for very marginal returns > --- the layers underneath us are supposed to keep their own CRCs or > other cross-checks, and a very substantial chunk of the problem seems > to be bad RAM, against which occasional software CRC checks aren't > especially useful. I believe the farthest we got was the idea of adding a CRC page check option in case you suspected bad hardware. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
On Mon, 17 Feb 2003, Tom Lane wrote: > Curt Sampson <cjs@cynic.net> writes: > > > If it's any kind of a serious problem, maybe it would be worth keeping > > a CRC of the header at the end of the page somewhere. > > See past discussions about keeping CRCs of page contents. Ultimately > I think it's a significant expenditure of CPU for very marginal returns > --- the layers underneath us are supposed to keep their own CRCs or > other cross-checks, and a very substantial chunk of the problem seems > to be bad RAM, against which occasional software CRC checks aren't > especially useful. Well, I wasn't proposing the whole page, just the header. That would be significantly cheaper (in fact, there's no real need even for a CRC; probably just xoring all of the words in the header into one word would be fine) and would tell you if the page was torn during the write, which was what I was imagining the problem might be. But bad memory, well, not much you can do about that beyond saying, "buy ECC, dude." cjs -- Curt Sampson <cjs@cynic.net> +81 90 7737 2974 http://www.netbsd.org Don't you know, in this new Dark Age, we're alllight. --XTC
Curt Sampson <cjs@cynic.net> writes: > Well, I wasn't proposing the whole page, just the header. That would be > significantly cheaper (in fact, there's no real need even for a CRC; > probably just xoring all of the words in the header into one word would > be fine) and would tell you if the page was torn during the write, which > was what I was imagining the problem might be. The header is only a dozen or two bytes long, so torn-page syndrome won't result in header corruption. The cases I've been able to study look like the header and a lot of the following page data have been overwritten with garbage --- when it made any sense at all, it looked like the contents of non-Postgres files (eg, plain text), which is why I mentioned the possibility of disks writing data to the wrong sector. Another recent report suggested that all bytes of the header had been replaced with 0x55, which sounds more like RAM or disk-controller malfeasance. You're right that we don't need a heck of a powerful check to catch this sort of thing. I was envisioning checks comparable to what's now in PageAddItem: valid pagesize, valid version, pd_lower and pd_upper and pd_special sane relative to each other and to the pagesize. I think this would be nearly as effective as an XOR sum --- and it has the major advantage of being compatible with the existing page layout. I'd like to think we're done munging the page layout for awhile. regards, tom lane
On Tue, 18 Feb 2003, Tom Lane wrote: > The header is only a dozen or two bytes long, so torn-page syndrome > won't result in header corruption. No. But the checksum would detect both header corruption and torn pages. Two for the price of one. But I don't think it's worth changing the page layout for, either. Maybe, if anybody still cares next time the page layout is changed, pop it in with whatever else is being changed. cjs -- Curt Sampson <cjs@cynic.net> +81 90 7737 2974 http://www.netbsd.org Don't you know, in this new Dark Age, we're alllight. --XTC
Tom Lane wrote: > The cases I've been able to study look like the header and a lot of the > following page data have been overwritten with garbage --- when it made > any sense at all, it looked like the contents of non-Postgres files (eg, > plain text), which is why I mentioned the possibility of disks writing > data to the wrong sector. That also sounds suspiciously like the behavior of certain filesystems (Reiserfs, for one) after a crash when the filesystem prior to the crash was highly active with writes. Had the sites that reported this experienced OS crashes or power interruptions? -- Kevin Brown kevin@sysexperts.com
Kevin Brown <kevin@sysexperts.com> writes: > Tom Lane wrote: >> The cases I've been able to study look like the header and a lot of the >> following page data have been overwritten with garbage --- when it made >> any sense at all, it looked like the contents of non-Postgres files (eg, >> plain text), which is why I mentioned the possibility of disks writing >> data to the wrong sector. > That also sounds suspiciously like the behavior of certain filesystems > (Reiserfs, for one) after a crash when the filesystem prior to the > crash was highly active with writes. Isn't reiserfs supposed to be more crash-resistant than ext2, rather than less so? > Had the sites that reported this > experienced OS crashes or power interruptions? Can't recall whether they admitted to such or not. regards, tom lane
Tom Lane kirjutas T, 18.02.2003 kell 17:21: > Kevin Brown <kevin@sysexperts.com> writes: > > Tom Lane wrote: > >> The cases I've been able to study look like the header and a lot of the > >> following page data have been overwritten with garbage --- when it made > >> any sense at all, it looked like the contents of non-Postgres files (eg, > >> plain text), which is why I mentioned the possibility of disks writing > >> data to the wrong sector. > > > That also sounds suspiciously like the behavior of certain filesystems > > (Reiserfs, for one) after a crash when the filesystem prior to the > > crash was highly active with writes. I was bitten by it about a year ago as well. > Isn't reiserfs supposed to be more crash-resistant than ext2, rather > than less so? It's supposed to be, but when it is run in (default?) metadata-only-logging mode, then you can well get perfectly good metadata with unallocated (zero-filled) data pages. There had been some more severe errors as well. ----------------- Hannu
On Mon, 2003-02-17 at 22:04, Tom Lane wrote: > Curt Sampson <cjs@cynic.net> writes: > > On Mon, 17 Feb 2003, Tom Lane wrote: > >> Postgres has a bad habit of becoming very confused if the page header of > >> a page on disk has become corrupted. > > > What typically causes this corruption? > > Well, I'd like to know that too. I have seen some cases that were > identified as hardware problems (disk wrote data to wrong sector, RAM > dropped some bits, etc). I'm not convinced that that's the whole story, > but I have nothing to chew on that could lead to identifying a software > bug. > > > If it's any kind of a serious problem, maybe it would be worth keeping > > a CRC of the header at the end of the page somewhere. > > See past discussions about keeping CRCs of page contents. Ultimately > I think it's a significant expenditure of CPU for very marginal returns > --- the layers underneath us are supposed to keep their own CRCs or > other cross-checks, and a very substantial chunk of the problem seems > to be bad RAM, against which occasional software CRC checks aren't > especially useful. This is exactly why "magic numbers" or simple algorithmic bit patterns are commonly used. If the "magic number" or bit pattern doesn't match it's page number accordingly, you know something is wrong. Storage cost tends to be slightly and CPU overhead low. I agree with you that a CRC is seems overkill for little return. Regards, -- Greg Copeland <greg@copelandconsulting.net> Copeland Computer Consulting
Tom Lane wrote: > > Postgres has a bad habit of becoming very confused if the page header of > a page on disk has become corrupted. In particular, bogus values in the > pd_lower field tend to make it look like there are many more tuples than > there really are, and of course these "tuples" contain garbage. That > leads to core dumps, weird complaints about out-of-range transaction > numbers (the latter generally in the form of an abort referencing a > nonexistent pg_clog file), and other un-fun stuff. > > I'm thinking of modifying ReadBuffer() so that it errors out if the What does the *error out* mean ? Is there a way to make our way around the pages ? > page read in does not contain either zeroes or a valid-looking header. > (The exception for zeroes seems to be needed for hash indexes, which > tend to initialize pages out-of-order.) This would make it much easier > for people to recognize situations where a page header has become > corrupted on disk. > > Comments? Can anyone think of a scenario where this would be a bad > idea? IIRC there was a similar thread long ago. IMHO CRC isn't sufficient because CRC could be calculated even for (silently) corrupted pages. regards, Hiroshi Inouehttp://www.geocities.jp/inocchichichi/psqlodbc/
Hiroshi Inoue <Inoue@tpf.co.jp> writes: > Tom Lane wrote: >> I'm thinking of modifying ReadBuffer() so that it errors out if the > What does the *error out* mean ? Mark the buffer as having an I/O error and then elog(ERROR). > Is there a way to make our way around the pages ? If the header is corrupt, I don't think so. You'd need at the very least to fix the bad header fields (particularly pd_lower) before you could safely try to examine tuples. (In the cases that I've seen, some or all of the line pointers are clobbered too, making it even less likely that any useful data can be extracted automatically.) Basically I'd rather have accesses to the clobbered page fail with elog(ERROR) than with more drastic errors. Right now, the least dangerous result you are likely to get is elog(FATAL) out of the clog code, and you can easily get a PANIC or backend coredump instead. > IMHO CRC isn't sufficient because CRC could be calculated > even for (silently) corrupted pages. Yeah, it seems a great expense for only marginal additional protection. regards, tom lane
Tom Lane wrote: > > Hiroshi Inoue <Inoue@tpf.co.jp> writes: > > Tom Lane wrote: > >> I'm thinking of modifying ReadBuffer() so that it errors out if the > > > What does the *error out* mean ? > > Mark the buffer as having an I/O error and then elog(ERROR). > > > Is there a way to make our way around the pages ? > > If the header is corrupt, I don't think so. What I asked is how to read all other sane pages. Once pages are corrupted users would copy the sane data ASAP. regards, Hiroshi Inouehttp://www.geocities.jp/inocchichichi/psqlodbc/
Hiroshi Inoue <Inoue@tpf.co.jp> writes: > Tom Lane wrote: >> Hiroshi Inoue <Inoue@tpf.co.jp> writes: >>> Is there a way to make our way around the pages ? >> >> If the header is corrupt, I don't think so. > What I asked is how to read all other sane pages. Oh, I see. You can do "SELECT ... LIMIT n" to get the rows before the broken page, but there's no way to get the ones after it. My proposal won't make this worse, but it won't make it any better either. Do you have an idea how to get the rows after the broken page? regards, tom lane
Tom Lane wrote: > > Hiroshi Inoue <Inoue@tpf.co.jp> writes: > > Tom Lane wrote: > >> Hiroshi Inoue <Inoue@tpf.co.jp> writes: > >>> Is there a way to make our way around the pages ? > >> > >> If the header is corrupt, I don't think so. > > > What I asked is how to read all other sane pages. > > Oh, I see. You can do "SELECT ... LIMIT n" to get the rows before the > broken page, but there's no way to get the ones after it. My proposal > won't make this worse, but it won't make it any better either. Do you > have an idea how to get the rows after the broken page? How about adding a new option to skip corrupted pages ? regards, Hiroshi Inouehttp://www.geocities.jp/inocchichichi/psqlodbc/
Hiroshi Inoue <Inoue@tpf.co.jp> writes: > How about adding a new option to skip corrupted pages ? I have committed changes to implement checking for damaged page headers, along the lines of last month's discussion. It includes a GUC variable to control the response as suggested by Hiroshi. Given the number of data-corruption reports we've seen recently, I am more than half tempted to commit the change into the 7.3.* branch too. However the GUC variable makes it seem a little like a new feature. I could backpatch the whole change, or backpatch it without the GUC variable (so the only possible response is elog(ERROR)), or leave well enough alone. Any votes? The patch is pretty small; I include the non-boilerplate parts below. regards, tom lane *** /home/tgl/pgsql/src/backend/storage/buffer/bufmgr.c.orig Tue Mar 25 09:06:11 2003 --- /home/tgl/pgsql/src/backend/storage/buffer/bufmgr.c Fri Mar 28 11:54:48 2003 *************** *** 59,64 **** --- 60,69 ---- (*((XLogRecPtr*) MAKE_PTR((bufHdr)->data))) + /* GUC variable */ + bool zero_damaged_pages = false; + + static void WaitIO(BufferDesc *buf); static void StartBufferIO(BufferDesc *buf, bool forInput); static void TerminateBufferIO(BufferDesc*buf); *************** *** 217,222 **** --- 222,241 ---- { status = smgrread(DEFAULT_SMGR, reln, blockNum, (char *) MAKE_PTR(bufHdr->data)); + /* check for garbage data */ + if (status == SM_SUCCESS && + !PageHeaderIsValid((PageHeader) MAKE_PTR(bufHdr->data))) + { + if (zero_damaged_pages) + { + elog(WARNING, "Invalid page header in block %u of %s; zeroing out page", + blockNum, RelationGetRelationName(reln)); + MemSet((char *) MAKE_PTR(bufHdr->data), 0, BLCKSZ); + } + else + elog(ERROR, "Invalid page header in block %u of %s", + blockNum, RelationGetRelationName(reln)); + } } if (isLocalBuf) *** /home/tgl/pgsql/src/backend/storage/page/bufpage.c.orig Tue Mar 25 09:06:12 2003 --- /home/tgl/pgsql/src/backend/storage/page/bufpage.c Fri Mar 28 11:38:43 2003 *************** *** 48,53 **** --- 46,96 ---- } + /* + * PageHeaderIsValid + * Check that the header fields of a page appear valid. + * + * This is called when a page has just been read in from disk. The idea is + * to cheaply detect trashed pages before we go nuts following bogus item + * pointers, testing invalid transaction identifiers, etc. + * + * It turns out to be necessary to allow zeroed pages here too. Even though + * this routine is *not* called when deliberately adding a page to a relation, + * there are scenarios in which a zeroed page might be found in a table. + * (Example: a backend extends a relation, then crashes before it can write + * any WAL entry about the new page. The kernel will already have the + * zeroed page in the file, and it will stay that way after restart.) So we + * allow zeroed pages here, and are careful that the page access macros + * treat such a page as empty and without free space. Eventually, VACUUM + * will clean up such a page and make it usable. + */ + bool + PageHeaderIsValid(PageHeader page) + { + char *pagebytes; + int i; + + /* Check normal case */ + if (PageGetPageSize(page) == BLCKSZ && + PageGetPageLayoutVersion(page) == PG_PAGE_LAYOUT_VERSION && + page->pd_lower >= SizeOfPageHeaderData && + page->pd_lower <= page->pd_upper && + page->pd_upper <= page->pd_special && + page->pd_special <= BLCKSZ && + page->pd_special == MAXALIGN(page->pd_special)) + return true; + + /* Check all-zeroes case */ + pagebytes = (char *) page; + for (i = 0; i < BLCKSZ; i++) + { + if (pagebytes[i] != 0) + return false; + } + return true; + } + + /* ---------------- * PageAddItem *
On Fri, 28 Mar 2003, Tom Lane wrote: > Hiroshi Inoue <Inoue@tpf.co.jp> writes: > > How about adding a new option to skip corrupted pages ? > > I have committed changes to implement checking for damaged page headers, > along the lines of last month's discussion. It includes a GUC variable > to control the response as suggested by Hiroshi. Is zeroing the pages the only / best option? Hiroshi suggested skipping the pages as I recall. Is there any chance of recovering data from a trashed page manually? If so perhaps the GUC variable should allow three options: error, zero, and skip. Kris Jurka
Kris Jurka <books@ejurka.com> writes: > Is zeroing the pages the only / best option? It's the only way to avoid a core dump when the system tries to process the page. And no, I don't want to propagate the notion that "this page is broken" beyond the buffer manager, so testing elsewhere isn't an acceptable answer. Basically, one should only turn this variable on after giving up on the possibility of getting any data out of the broken page itself. It would be folly to run with it turned on as a normal setting. regards, tom lane
Tom Lane wrote: > Kris Jurka <books@ejurka.com> writes: > > Is zeroing the pages the only / best option? > > It's the only way to avoid a core dump when the system tries to process > the page. And no, I don't want to propagate the notion that "this page > is broken" beyond the buffer manager, so testing elsewhere isn't an > acceptable answer. > > Basically, one should only turn this variable on after giving up on the > possibility of getting any data out of the broken page itself. It would > be folly to run with it turned on as a normal setting. This statement should *definitely* go into the documentation for the option, then... -- Kevin Brown kevin@sysexperts.com
Kevin Brown <kevin@sysexperts.com> writes: > Tom Lane wrote: >> Basically, one should only turn this variable on after giving up on the >> possibility of getting any data out of the broken page itself. It would >> be folly to run with it turned on as a normal setting. > This statement should *definitely* go into the documentation for the > option, then... Here's what I put in --- feel free to suggest better wording. ZERO_DAMAGED_PAGES (boolean) Detection of a damaged page header normally causes PostgreSQL to report an error, aborting the current transaction.Setting zero_damaged_pages to true causes the system to instead report a warning, zero out the damagedpage, and continue processing. This behavior <emphasis>will lose data</>, namely all the rows on the damagedpage. But it allows you to get past the error and retrieve rows from any undamaged pages that may be present inthe table. So it is useful for recovering data if corruption has occurred due to hardware or software error. Thedefault setting is off, and it can only be changed by a superuser. regards, tom lane
Tom Lane wrote: > Here's what I put in --- feel free to suggest better wording. > > ZERO_DAMAGED_PAGES (boolean) > > Detection of a damaged page header normally causes PostgreSQL to > report an error, aborting the current transaction. Setting > zero_damaged_pages to true causes the system to instead report a > warning, zero out the damaged page, and continue processing. This > behavior <emphasis>will lose data</>, namely all the rows on the > damaged page. But it allows you to get past the error and retrieve > rows from any undamaged pages that may be present in the table. So > it is useful for recovering data if corruption has occurred due to > hardware or software error. The default setting is off, and it can > only be changed by a superuser. I would change "will lose data" to "will destroy data" -- if only to emphasize the fact that with the option enabled you have no chance of recoving the data. Other than that, it looks good to me. I was going to suggest that perhaps a statement like "running with this option enabled for normal operation is probably not a good idea" should be put in there, but whether such a thing belongs in the documentation or not depends on whether or not you believe the documentation should be neutral regarding administrative decisions. The part of the documentation your modification applies to should probably remain neutral, IMHO, but that implies that perhaps a paragraph that talks about dealing with damaged pages (and which references this option) should be placed into the administrator's guide. -- Kevin Brown kevin@sysexperts.com
Kevin Brown <kevin@sysexperts.com> writes: > Tom Lane wrote: >> Basically, one should only turn this variable on after giving up on the >> possibility of getting any data out of the broken page itself. It would >> be folly to run with it turned on as a normal setting. > This statement should *definitely* go into the documentation for the > option, then... Andrew Sullivan expressed concern about this, too. The thing could be made a little more failsafe if we made it impossible to set ZERO_DAMAGED_PAGES to true in postgresql.conf, or by any means other than an actual SET command --- whose impact would then be limited to the current session. This is kind of an ugly wart on the GUC mechanism, but I think not difficult to do with an assign_hook (it just has to refuse non-interactive settings). Comments? regards, tom lane
Tom Lane wrote: > Kevin Brown <kevin@sysexperts.com> writes: > > Tom Lane wrote: > >> Basically, one should only turn this variable on after giving up on the > >> possibility of getting any data out of the broken page itself. It would > >> be folly to run with it turned on as a normal setting. > > > This statement should *definitely* go into the documentation for the > > option, then... > > Andrew Sullivan expressed concern about this, too. The thing could > be made a little more failsafe if we made it impossible to set > ZERO_DAMAGED_PAGES to true in postgresql.conf, or by any means other > than an actual SET command --- whose impact would then be limited to > the current session. This is kind of an ugly wart on the GUC mechanism, > but I think not difficult to do with an assign_hook (it just has to > refuse non-interactive settings). > > Comments? Perhaps better would be to throw a message message any time it is turned on, reminding them it should not be left on. Is that cleaner? -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
Bruce Momjian <pgman@candle.pha.pa.us> writes: > Perhaps better would be to throw a message message any time it is turned > on, reminding them it should not be left on. Is that cleaner? Where are you going to throw a message to, if it's in postgresql.conf? Bleating in the postmaster log probably won't draw the attention of a user clueless enough to be trying this ;-) regards, tom lane
On Wed, Apr 02, 2003 at 03:25:58PM -0500, Tom Lane wrote: > the current session. This is kind of an ugly wart on the GUC mechanism, > but I think not difficult to do with an assign_hook (it just has to > refuse non-interactive settings). It may be an ugly wart, but I think it's only prudent. I'd be willing to bet a fair amount that there is a significant overlap between the population which uses Bob's Discount Hardware for 10 million row, ultra-critical databases and the population which likes to twiddle postgresql.conf settings without reading the fine manual. Those folks are going to get burned by this setting unless it is very hard to turn on. (I think the setting is an excellent idea, though, for emergency cases.) I don't usually like making servers totally idiot-proof, but I can just imagine the Slashdot conversations after 7.4 comes out (and for at least 60 or 70 years thereafter): "PostgreSQL just randomly zeroes a page of data! It sucks! Use MySQL instead. It can recover from bad pages on your disk by randomly making up data for you, instead of just writing zeros." A -- ---- Andrew Sullivan 204-4141 Yonge Street Liberty RMS Toronto, Ontario Canada <andrew@libertyrms.info> M2P 2A8 +1 416 646 3304 x110
Tom Lane wrote: > Kevin Brown <kevin@sysexperts.com> writes: > > Tom Lane wrote: > >> Basically, one should only turn this variable on after giving up on the > >> possibility of getting any data out of the broken page itself. It would > >> be folly to run with it turned on as a normal setting. > > > This statement should *definitely* go into the documentation for the > > option, then... > > Andrew Sullivan expressed concern about this, too. The thing could > be made a little more failsafe if we made it impossible to set > ZERO_DAMAGED_PAGES to true in postgresql.conf, or by any means other > than an actual SET command --- whose impact would then be limited to > the current session. This is kind of an ugly wart on the GUC mechanism, > but I think not difficult to do with an assign_hook (it just has to > refuse non-interactive settings). Hmm...I don't know that I'd want to go that far -- setting this variable could be regarded as a policy decision. Some shops may have very good reason for running with ZERO_DAMAGED_PAGES enabled all the time, but I don't know what those reasons might be. But the fact that I can't think of a good reason isn't sufficient cause to remove that as an option. I would definitely be in favor of issuing a warning ("Running with ZERO_DAMAGED_PAGES enabled will cause you to lose possibly recoverable data whenever a damaged page is encountered. Be sure you know what you're doing") whenever the variable is set, whether it be at startup or during a session. -- Kevin Brown kevin@sysexperts.com
Tom Lane wrote: > Bruce Momjian <pgman@candle.pha.pa.us> writes: > > Perhaps better would be to throw a message message any time it is turned > > on, reminding them it should not be left on. Is that cleaner? > > Where are you going to throw a message to, if it's in postgresql.conf? > > Bleating in the postmaster log probably won't draw the attention of a > user clueless enough to be trying this ;-) Perhaps. I'd bank more on the user checking the postmaster log than being clueful enough not to use this, though, since the postmaster log is pretty much a universal thing that has been around a long time while this option hasn't. Cluelessness about the use of a rather esoteric option doesn't necessarily imply total cluelessness. :-) -- Kevin Brown kevin@sysexperts.com
Andrew Sullivan <andrew@libertyrms.info> writes: > On Wed, Apr 02, 2003 at 03:25:58PM -0500, Tom Lane wrote: > > > the current session. This is kind of an ugly wart on the GUC mechanism, > > but I think not difficult to do with an assign_hook (it just has to > > refuse non-interactive settings). > > It may be an ugly wart, but I think it's only prudent. I'd be > willing to bet a fair amount that there is a significant overlap > between the population which uses Bob's Discount Hardware for 10 > million row, ultra-critical databases and the population which likes > to twiddle postgresql.conf settings without reading the fine > manual. Those folks are going to get burned by this setting unless > it is very hard to turn on. (I think the setting is an excellent > idea, though, for emergency cases.) > > I don't usually like making servers totally idiot-proof, but I can > just imagine the Slashdot conversations after 7.4 comes out (and for > at least 60 or 70 years thereafter): "PostgreSQL just randomly zeroes > a page of data! It sucks! Use MySQL instead. It can recover from > bad pages on your disk by randomly making up data for you, instead of > just writing zeros." Perhaps the real answer is to use a more *descriptive* name for the variable than ZERO_DAMAGED_PAGES. Something along the lines of CLUELESS_DBA_ESCHEWS_BACKUPS, or ONLY_AN_IDIOT_SETS_THIS_VARIABLE. You don't have to read the manual to see that these variables are not to be trifled with. There are some cases where this particular feature would be useful. What needs to be done is to make the feature less dangerous to the newbie without making it less useful to the person who actually needs the functionality. Jason
Kevin Brown <kevin@sysexperts.com> writes: > Hmm...I don't know that I'd want to go that far -- setting this > variable could be regarded as a policy decision. Some shops may have > very good reason for running with ZERO_DAMAGED_PAGES enabled all the > time, but I don't know what those reasons might be. I would buy this argument if I could imagine even a faintly plausible reason for doing that ... but I can't. regards, tom lane
On Wed, Apr 02, 2003 at 02:07:02PM -0700, Jason Earl wrote: > There are some cases where this particular feature would be useful. > What needs to be done is to make the feature less dangerous to the > newbie without making it less useful to the person who actually needs > the functionality. I'll repeat what I said before (I think to Tom): "That's a pretty big foot-gun you've got there." I cannot possibly imagine what sort of data-recovery situation would warrant running with the option turned on all the time. You know you have big-trouble, oh-no, ISP ran over the tapes while they were busy pitching magnets through your cage, data corruption problems, and this is your best hope for recovery? Great. Log in, turn on this option, and start working. But across every back end? It's the doomsday device for databases. Emergency recovery options are _good_: everyone has his favourite UNIX file undeletion story. But it's sure not a good idea to give everyone root just in case she or he deletes some file. A -- ---- Andrew Sullivan 204-4141 Yonge Street Liberty RMS Toronto, Ontario Canada <andrew@libertyrms.info> M2P 2A8 +1 416 646 3304 x110
Andrew Sullivan <andrew@libertyrms.info> writes: > You know you have big-trouble, oh-no, ISP ran over > the tapes while they were busy pitching magnets through your cage, > data corruption problems, and this is your best hope for recovery? > Great. Log in, turn on this option, and start working. But across > every back end? It's the doomsday device for databases. Yeah, it is. Actually, the big problem with it in my mind is this scenario: you get a page-header-corrupted error on page X, you investigate and decide there's no hope for page X, so you turn on zero_damaged_pages and try to dump the table. It comes to page X, complains, zeroes it, proceeds, ... and then comes to damaged page Y, complains, zeroes it, proceeds. Maybe you didn't know page Y had problems. Maybe you could have gotten something useful out of page Y if you'd looked first. Too late now. What I'd really prefer to see is not a ZERO_DAMAGED_PAGES setting, but an explicit command to "DESTROY PAGE n OF TABLE foo". That would make you manually admit defeat for each individual page before it'd drop data. But I don't presently have time to implement such a command (any volunteers out there?). Also, I could see where try-to-dump, fail, DESTROY, try again, lather, rinse, repeat, could get pretty tedious on a badly damaged table. regards, tom lane
Tom Lane wrote: > Andrew Sullivan <andrew@libertyrms.info> writes: > > You know you have big-trouble, oh-no, ISP ran over > > the tapes while they were busy pitching magnets through your cage, > > data corruption problems, and this is your best hope for recovery? > > Great. Log in, turn on this option, and start working. But across > > every back end? It's the doomsday device for databases. > > Yeah, it is. Actually, the big problem with it in my mind is this > scenario: you get a page-header-corrupted error on page X, you > investigate and decide there's no hope for page X, so you turn on > zero_damaged_pages and try to dump the table. It comes to page X, > complains, zeroes it, proceeds, ... and then comes to damaged page Y, > complains, zeroes it, proceeds. Maybe you didn't know page Y had > problems. Maybe you could have gotten something useful out of page Y > if you'd looked first. Too late now. > > What I'd really prefer to see is not a ZERO_DAMAGED_PAGES setting, > but an explicit command to "DESTROY PAGE n OF TABLE foo". That would > make you manually admit defeat for each individual page before it'd > drop data. But I don't presently have time to implement such a command > (any volunteers out there?). Also, I could see where try-to-dump, fail, > DESTROY, try again, lather, rinse, repeat, could get pretty tedious on a > badly damaged table. Can we make the GUC setting a numeric, so you can set it to 1 and it clears just one page, or 0 and it clears all pages? -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
> What I'd really prefer to see is not a ZERO_DAMAGED_PAGES setting, > but an explicit command to "DESTROY PAGE n OF TABLE foo". That would > make you manually admit defeat for each individual page before it'd > drop data. But I don't presently have time to implement such a command > (any volunteers out there?). Also, I could see where try-to-dump, fail, > DESTROY, try again, lather, rinse, repeat, could get pretty tedious on a > badly damaged table. I'm not volunteering, but this would be better: ALTER TABLE foo ZERO [ PAGE n | BAD PAGES ]; Chris
On Wed, Apr 02, 2003 at 11:10:13PM -0500, Tom Lane wrote: > What I'd really prefer to see is not a ZERO_DAMAGED_PAGES setting, > but an explicit command to "DESTROY PAGE n OF TABLE foo". That would > make you manually admit defeat for each individual page before it'd > drop data. But I don't presently have time to implement such a command > (any volunteers out there?). Also, I could see where try-to-dump, fail, > DESTROY, try again, lather, rinse, repeat, could get pretty tedious on a > badly damaged table. Huh, and what if I accidentaly mistype the number and destroy a valid page? Maybe the command should only succeed if it confirms that the page is corrupted. There could also be "DESTROY ALL PAGES" and/or "DESTROY n PAGES" commands, where the latter zeroes the first n corrupted pages in the table, for the case with lots of corrupted pages. Maybe a command for getting a list of the corrupted pages is useful? -- Alvaro Herrera (<alvherre[a]dcc.uchile.cl>) "La verdad no siempre es bonita, pero el hambre de ella si"
Alvaro Herrera <alvherre@dcc.uchile.cl> writes: > Huh, and what if I accidentaly mistype the number and destroy a valid > page? Maybe the command should only succeed if it confirms that the > page is corrupted. Good point ... but what if the corruption is subtle enough that the automatic tests don't notice it? It could be that only "SELECT * FROM ..." actually dumps core. This is an acute problem for tests that are generic enough to be implemented in just one or a few places --- for example, the tests I just added have no idea whether they are looking at a heap or index page, much less what kind of index page. > There could also be "DESTROY ALL PAGES" and/or "DESTROY n PAGES" > commands, where the latter zeroes the first n corrupted pages in the > table, for the case with lots of corrupted pages. Maybe a command for > getting a list of the corrupted pages is useful? We have "DESTROY ALL PAGES", it's called DROP TABLE. The other doesn't appeal to me either, because it doesn't say exactly which N pages you're willing to lose. regards, tom lane
Tom Lane writes: > Andrew Sullivan expressed concern about this, too. The thing could > be made a little more failsafe if we made it impossible to set > ZERO_DAMAGED_PAGES to true in postgresql.conf, or by any means other > than an actual SET command --- whose impact would then be limited to > the current session. This is kind of an ugly wart on the GUC mechanism, > but I think not difficult to do with an assign_hook (it just has to > refuse non-interactive settings). Fighting against people who randomly change settings without being informed about what they do is pointless. It's like trying to prevent 'rm -rf /*'. And it's not like you can easily set anything in postgresql.conf by accident. -- Peter Eisentraut peter_e@gmx.net
Peter Eisentraut <peter_e@gmx.net> writes: > Tom Lane writes: >> Andrew Sullivan expressed concern about this, too. The thing could >> be made a little more failsafe if we made it impossible to set >> ZERO_DAMAGED_PAGES to true in postgresql.conf, or by any means other >> than an actual SET command --- whose impact would then be limited to >> the current session. This is kind of an ugly wart on the GUC mechanism, >> but I think not difficult to do with an assign_hook (it just has to >> refuse non-interactive settings). > Fighting against people who randomly change settings without being > informed about what they do is pointless. If you don't want an active defense, how about a passive one --- like just not listing zero_damaged_pages in postgresql.conf.sample? We already have several variables deliberately not listed there ... regards, tom lane
On Thu, Apr 03, 2003 at 02:39:17PM -0500, Tom Lane wrote: > just not listing zero_damaged_pages in postgresql.conf.sample? We > already have several variables deliberately not listed there ... Hey, that might be a good solution. Of course, it doesn't solve the "doomsday device" problem, but nobody who uses it can complain that they didn't know what the thing would do. A -- ---- Andrew Sullivan 204-4141 Yonge Street Liberty RMS Toronto, Ontario Canada <andrew@libertyrms.info> M2P 2A8 +1 416 646 3304 x110
On 2003-04-02 16:18:33 -0500, Tom Lane wrote: > Kevin Brown <kevin@sysexperts.com> writes: > > Hmm...I don't know that I'd want to go that far -- setting this > > variable could be regarded as a policy decision. Some shops may have > > very good reason for running with ZERO_DAMAGED_PAGES enabled all the > > time, but I don't know what those reasons might be. > > I would buy this argument if I could imagine even a faintly plausible > reason for doing that ... but I can't. > > regards, tom lane > I've been following this discussion with great interest, because I actually have a situation where running with ZERO_DAMAGED_PAGES on all the time would be somewhat plausible. We use a PostgreSQL database purely for caching pages for a very busy website. A user changes some stuff which causes a page on the site to change, the HTML for the new page gets generated using the data from another database (containing all the actual data) and the generated HTML is inserted into this PG database. When a page is requested that isn't cached yet, it'll be generated and inserted too. This makes it possible to invalidate the cache-version of a large amount of pages by simply deleting the relevant rows and not spending the time to regenerate all that data immediately (and it makes crashrecovery more robust). We can afford to lose all the data in the cache DB, because it's all generated by using other data anyway. But losing all data would be bad from a performance/uptime perspective, as all the cached data would need to be regenerated (which takes a few days). Also, making backups once a day and restoring such a backup when something goes wrong is also impractical, because in our situation old data is much worse than no data at all. I'm working on a script to detect old data and delete it so a new page will be generated, but that isn't finished yet. Two weeks ago the server running this database screwed up (it crashes pretty badly) and made some data unreadable. Although I was running with fsync on on an ext3 partition (with data=writeback, linux 2.4.20, PG 7.2) some of the PG datafiles got damaged anyway (I blame IDE disks). The damage seemed light enough to keep running with this dataset (it occasionally borked with 'heap_delete: (am)invalid tid', but since our application attempts a delete followed by an insert of newly generated data in case of a db error it would repair itself most of the time). Two crashes later (weirdly patched kernels hooray) the errors got progressively worse ('missing chunk number 0 for toast value 79960605' and stuff like that) so we were forced to shut the website down, dump all the data we could dump (not everything), initdb and restore that dump. This cost us about 10 hours downtime. If I'd had the option I just would've set ZERO_DAMAGED_PAGES to true and let it run for a few days to sort itself out. Alternatively an ALTER TABLE foo ZERO DAMAGED PAGES; would've worked as well, although that would create a small downtime too. I know I'm doing a lot of weird things, and that I could avoid a lot of the problems listed here were I to do things differently, but the fact remains that I actually have a real-life situation where running (for a while at least) with ZERO_DAMAGED_PAGES on makes some kind of sense. Vincent van Leeuwen Media Design - http://www.mediadesign.nl/
Vincent van Leeuwen <pgsql.spam@vinz.nl> writes: > ... This cost us about 10 hours downtime. If I'd had the option I just > would've set ZERO_DAMAGED_PAGES to true and let it run for a few days to sort > itself out. Yikes. If I understand this correctly, you had both critical data and cache data in the same database. As for the cache stuff, a few quick TRUNCATE TABLE commands would have gotten you out of the woods. As for the critical data (the stuff you actually needed to dump and restore), do you really want ZERO_DAMAGED_PAGES on for that? It's a heck of a blunt tool. regards, tom lane
On 2003-04-03 18:40:54 -0500, Tom Lane wrote: > Vincent van Leeuwen <pgsql.spam@vinz.nl> writes: > > ... This cost us about 10 hours downtime. If I'd had the option I just > > would've set ZERO_DAMAGED_PAGES to true and let it run for a few days to sort > > itself out. > > Yikes. If I understand this correctly, you had both critical data and > cache data in the same database. As for the cache stuff, a few quick > TRUNCATE TABLE commands would have gotten you out of the woods. As for > the critical data (the stuff you actually needed to dump and restore), > do you really want ZERO_DAMAGED_PAGES on for that? It's a heck of a > blunt tool. > > regards, tom lane No, it wasn't that bad :) The REAL data is on a different server which hasn't let us down so far (and has reliable hardware and software, and backups :)). Only the cache database was hurt. The problem with truncating everything was that rebuilding the cache would cost about 48 hours downtime, as there is A LOT of data to rebuild. This really is an interim solution, things will be constructed much better and more reliable in the future, but for now it's there. Another reason we went for the dump/restore is that we upgraded to 7.3.2 at the same time, which we were postponing because weren't looking forward to that downtime :) Vincent van Leeuwen Media Design - http://www.mediadesign.nl/
Andrew Sullivan wrote: > On Thu, Apr 03, 2003 at 02:39:17PM -0500, Tom Lane wrote: > > just not listing zero_damaged_pages in postgresql.conf.sample? We > > already have several variables deliberately not listed there ... > > Hey, that might be a good solution. Of course, it doesn't solve the > "doomsday device" problem, but nobody who uses it can complain that > they didn't know what the thing would do. Shouldn't each variable listed in postgresql.conf.sample have comments right above it explaining what it does anyway? A self-documenting configuration file is a really handy thing to have. If it's documented that way in postgresql.conf.sample (and adjacent to the variable itself, even) then nobody who changed it would have grounds to complain about not knowing what the variable did. I'm much more in favor of being lucid and upfront about everything than hiding things just because they might be dangerous. That said, all sorts of warnings and such should be in that bit of documentation in postgresql.conf.sample, so that it's made abundantly clear that this particular option is not one to be messing with except when you know exactly what you're doing... -- Kevin Brown kevin@sysexperts.com
Kevin Brown <kevin@sysexperts.com> writes: > Shouldn't each variable listed in postgresql.conf.sample have comments > right above it explaining what it does anyway? Not really --- if you can't be bothered to consult the Admin Guide when in doubt, you have no business editing the config file. A word or two of hints is one thing, but circles and arrows and a paragraph on the back of each one just ain't gonna fly. We went down that path previously with pg_hba.conf, and finally realized that it was a waste of time to maintain what amounted to two separate sets of documentation. regards, tom lane
On Fri, Apr 04, 2003 at 05:52:46PM -0800, Kevin Brown wrote: > Shouldn't each variable listed in postgresql.conf.sample have comments > right above it explaining what it does anyway? A self-documenting > configuration file is a really handy thing to have. That's no help. It makes the config file very long (some of the config values are complicated to use, and so you'd need to explain all of that to avoid danger), so people are likely to stop reading the comments. Given that the thing one is trying to prevent is handing a loaded foot-gun to the RTFM-averse, adding documentation doesn't seem to be a help. A -- ---- Andrew Sullivan 204-4141 Yonge Street Liberty RMS Toronto, Ontario Canada <andrew@libertyrms.info> M2P 2A8 +1 416 646 3304 x110