Thread: Undetected corruption of table files
I am slightly worried that corruption of data files may remain undetected in PostgreSQL. As an experiment, I created a simple table with a primary key index and inserted a couple of rows. The corresponding data file is 1 page = 8K long. Now when I stop the server, zero out the data file with dd if=/dev/zero of=45810 bs=8192 count=1 and start the server again, the table is empty when I SELECT from it and no errors are reported. Only a VACUUM gives me the WARNING: relation "test" page 0 is uninitialized --- fixing and the file is truncated to length zero. The next thing I tried is to randomly scribble into the 8K data file with a hex editor at different locations. Some of these actions provoked error messages ranging from ERROR: invalid page header in block 0 of relation "test" over ERROR: could not access status of transaction 1954047348 to LOG: server process (PID 28149) was terminated by signal 11 Frequently, though, the result was that some of the rows were "missing", i.e. there was no error message when I SELECTed from the table, but some of the rows were gone. I got no errors or warnings from VACUUM either. As far as I know there is no tool to verify the integrity of a PostgreSQL table. - Shouldn't there be an error, some kind of 'missing magic number' or similar, when a table file consists of only zeros? - Wouldn't it be desirable to have some means to verify the integrity of a table file or a whole database? Yours, Laurenz Albe
"Albe Laurenz" <all@adv.magwien.gv.at> writes: > - Shouldn't there be an error, some kind of 'missing magic > number' or similar, when a table file consists of only > zeros? The particular case of an all-zeroes page is specifically allowed, and has to be because it's a valid transient state in various scenarios. > - Wouldn't it be desirable to have some means to verify the > integrity of a table file or a whole database? SELECT * usually does reasonably well at that. regards, tom lane
Tom Lane wrote: >> - Shouldn't there be an error, some kind of 'missing magic >> number' or similar, when a table file consists of only >> zeros? > > The particular case of an all-zeroes page is specifically allowed, > and has to be because it's a valid transient state in various > scenarios. I see, that was a pathological case. >> - Wouldn't it be desirable to have some means to verify the >> integrity of a table file or a whole database? > > SELECT * usually does reasonably well at that. Would it be an option to have a checksum somewhere in each data block that is verified upon read? Would this be a lot of work to add? I'd be willing to try it if it is desirable and feasible. Yours, Laurenz Albe
"Albe Laurenz" <all@adv.magwien.gv.at> writes: > Would it be an option to have a checksum somewhere in each > data block that is verified upon read? That's been proposed before and rejected before. See the archives ... regards, tom lane
Tom Lane wrote: >> Would it be an option to have a checksum somewhere in each >> data block that is verified upon read? > > That's been proposed before and rejected before. See the archives ... I searched for "checksum" and couldn't find it. Could someone give me a pointer? I'm not talking about WAL files here. Thanks, Laurenz Albe
"Albe Laurenz" <all@adv.magwien.gv.at> writes: > Tom Lane wrote: >>> Would it be an option to have a checksum somewhere in each >>> data block that is verified upon read? >> That's been proposed before and rejected before. See the archives ... > I searched for "checksum" and couldn't find it. Could someone > give me a pointer? I'm not talking about WAL files here. "CRC" maybe? Also, make sure your search goes all the way back; I think the prior discussions were around the same time WAL was initially put in, and/or when we dropped the WAL CRC width from 64 to 32 bits. The very measurable overhead of WAL CRCs are the main thing that's discouraged us from having page CRCs. (Well, that and the lack of evidence that they'd actually gain anything.) regards, tom lane
On 8/27/07, Tom Lane <tgl@sss.pgh.pa.us> wrote: > that and the lack of evidence that they'd actually gain anything I find it somewhat ironic that PostgreSQL strives to be fairly non-corruptable, yet has no way to detect a corrupted page. The only reason for not having CRCs is because it will slow down performance... which is exactly opposite of conventional PostgreSQL wisdom (no performance trade-off for durability). -- Jonah H. Harris, Software Architect | phone: 732.331.1324 EnterpriseDB Corporation | fax: 732.331.1301 33 Wood Ave S, 3rd Floor | jharris@enterprisedb.com Iselin, New Jersey 08830 | http://www.enterprisedb.com/
On 8/27/07, Jonah H. Harris <jonah.harris@gmail.com> wrote: > On 8/27/07, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > that and the lack of evidence that they'd actually gain anything > > I find it somewhat ironic that PostgreSQL strives to be fairly > non-corruptable, yet has no way to detect a corrupted page. The only > reason for not having CRCs is because it will slow down performance... > which is exactly opposite of conventional PostgreSQL wisdom (no > performance trade-off for durability). But how does detecting a corrupted data page gain you any durability? All it means is that the platform underneath screwed up, and you've already *lost* durability. What do you do then? It seems like the same idea as an application trying to detect RAM errors.
"Tom Lane" <tgl@sss.pgh.pa.us> writes: > "Albe Laurenz" <all@adv.magwien.gv.at> writes: >> Tom Lane wrote: >>>> Would it be an option to have a checksum somewhere in each >>>> data block that is verified upon read? > >>> That's been proposed before and rejected before. See the archives ... > >> I searched for "checksum" and couldn't find it. Could someone >> give me a pointer? I'm not talking about WAL files here. > > "CRC" maybe? Also, make sure your search goes all the way back; I think > the prior discussions were around the same time WAL was initially put > in, and/or when we dropped the WAL CRC width from 64 to 32 bits. > The very measurable overhead of WAL CRCs are the main thing that's > discouraged us from having page CRCs. (Well, that and the lack of > evidence that they'd actually gain anything.) I thought we determined the reason WAL CRCs are expensive is because we have to checksum each WAL record individually. I recall the last time this came up I ran some microbenchmarks and found that the cost to CRC an entire 8k block was on the order of tens of microseconds. The last time it came up was in the context of allowing turning off full_page_writes but offering a guarantee that torn pages would be detected on recovery and no later. I was a proponent of using writev to embed bytes in each 512 byte block and Jonah said it would be no faster than a CRC (and obviously considerably more complicated). My benchmarks showed that Jonah was right and the CRC was cheaper than a the added cost of using writev. I do agree the benefits of having a CRC are overstated. Most times corruption is caused by bad memory and a CRC will happily checksum the corrupted memory just fine. A checksum is no guarantee. But I've also seen data corruption caused by bad memory in an i/o controller, for example. There are always going to be cases where it could help. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com
Jonah H. Harris wrote: > On 8/27/07, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> that and the lack of evidence that they'd actually gain anything > > I find it somewhat ironic that PostgreSQL strives to be fairly > non-corruptable, yet has no way to detect a corrupted page. The only > reason for not having CRCs is because it will slow down performance... > which is exactly opposite of conventional PostgreSQL wisdom (no > performance trade-off for durability). Why? I can't say I speak for the developers, but I think the reason is that data corruption can (with the very rare exception of undetected programming errors) only be caused by hardware problems. If you have a "proper" production database server, your memory has error checking, and your RAID controller has something of the kind as well. If not you would probably be running the database on a filesystem that has reliable integrity verification mechanisms. In the worst case (all the above mechanisms fail), you have backups. IMHO the problem is covered quite adequately. The operating system and the hardware cover for the database, as they should; it's _their_ job. -- Alban Hertroys alban@magproductions.nl magproductions b.v. T: ++31(0)534346874 F: ++31(0)534346876 M: I: www.magproductions.nl A: Postbus 416 7500 AK Enschede // Integrate Your World //
"Trevor Talbot" <quension@gmail.com> writes: > On 8/27/07, Jonah H. Harris <jonah.harris@gmail.com> wrote: >> I find it somewhat ironic that PostgreSQL strives to be fairly >> non-corruptable, yet has no way to detect a corrupted page. > But how does detecting a corrupted data page gain you any durability? > All it means is that the platform underneath screwed up, and you've > already *lost* durability. What do you do then? Indeed. In fact, the most likely implementation of this (refuse to do anything with a page with a bad CRC) would be a net loss from that standpoint, because you couldn't get *any* data out of a page, even if only part of it had been zapped. regards, tom lane
On 8/27/07, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Indeed. In fact, the most likely implementation of this (refuse to do > anything with a page with a bad CRC) would be a net loss from that > standpoint, because you couldn't get *any* data out of a page, even if > only part of it had been zapped. At least you would know it was corrupted, instead of getting funky errors and/or crashes. -- Jonah H. Harris, Software Architect | phone: 732.331.1324 EnterpriseDB Corporation | fax: 732.331.1301 33 Wood Ave S, 3rd Floor | jharris@enterprisedb.com Iselin, New Jersey 08830 | http://www.enterprisedb.com/
On Mon, Aug 27, 2007 at 12:08:17PM -0400, Jonah H. Harris wrote: > On 8/27/07, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Indeed. In fact, the most likely implementation of this (refuse to do > > anything with a page with a bad CRC) would be a net loss from that > > standpoint, because you couldn't get *any* data out of a page, even if > > only part of it had been zapped. I think it'd be perfectly reasonable to have a mode where you could bypass the check so that you could see what was in the corrupted page (as well as deleting everything on the page so that you could "fix" the corruption). Obviously, this should be restricted to superusers. > At least you would know it was corrupted, instead of getting funky > errors and/or crashes. Or worse, getting what appears to be perfectly valid data, but isn't. -- Decibel!, aka Jim Nasby decibel@decibel.org EnterpriseDB http://enterprisedb.com 512.569.9461 (cell)
Attachment
At 11:48 PM 8/27/2007, Trevor Talbot wrote: >On 8/27/07, Jonah H. Harris <jonah.harris@gmail.com> wrote: > > On 8/27/07, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > > that and the lack of evidence that they'd actually gain anything > > > > I find it somewhat ironic that PostgreSQL strives to be fairly > > non-corruptable, yet has no way to detect a corrupted page. The only > > reason for not having CRCs is because it will slow down performance... > > which is exactly opposite of conventional PostgreSQL wisdom (no > > performance trade-off for durability). > >But how does detecting a corrupted data page gain you any durability? >All it means is that the platform underneath screwed up, and you've >already *lost* durability. What do you do then? The benefit I see is you get to change the platform underneath earlier than later. Whether that's worth it or not I don't know - real world stats/info would be good. Even my home PATA drives tend to grumble about stuff first before they fail, so it might not be worthwhile doing the extra work. Regards, Link.
* Alban Hertroys: > If you have a "proper" production database server, your memory has > error checking, and your RAID controller has something of the kind > as well. To my knowledge, no readily available controller performs validation on reads (not even for RAID-1 or RAID-10, where it would be pretty straightforward). Something like an Adler32 checksum (not a full CRC) on each page might be helpful. However, what I'd really like to see is something that catches missed writes, but this is very difficult to implement AFAICT. -- Florian Weimer <fweimer@bfk.de> BFK edv-consulting GmbH http://www.bfk.de/ Kriegsstraße 100 tel: +49-721-96201-1 D-76133 Karlsruhe fax: +49-721-96201-99