Thread: Block-level CRC checks
A customer of ours has been having trouble with corrupted data for some time. Of course, we've almost always blamed hardware (and we've seen RAID controllers have their firmware upgraded, among other actions), but the useful thing to know is when corruption has happened, and where. So we've been tasked with adding CRCs to data files. The idea is that these CRCs are going to be checked just after reading files from disk, and calculated just before writing it. They are just a protection against the storage layer going mad; they are not intended to protect against faulty RAM, CPU or kernel. This code would be run-time or compile-time configurable. I'm not absolutely sure which yet; the problem with run-time is what to do if the user restarts the server with the setting flipped. It would have almost no impact on users who don't enable it. The implementation I'm envisioning requires the use of a new relation fork to store the per-block CRCs. Initially I'm aiming at a CRC32 sum for each block. FlushBuffer would calculate the checksum and store it in the CRC fork; ReadBuffer_common would read the page, calculate the checksum, and compare it to the one stored in the CRC fork. A buffer's io_in_progress lock protects the buffer's CRC. We read and pin the CRC page before acquiring the lock, to avoid having two buffer IO operations in flight. I'd like to submit this for 8.4, but I want to ensure that -hackers at large approve of this feature before starting serious coding. Opinions? -- Alvaro Herrera http://www.CommandPrompt.com/ PostgreSQL Replication, Consulting, Custom Development, 24x7 support
On Tue, Sep 30, 2008 at 2:02 PM, Alvaro Herrera <alvherre@commandprompt.com> wrote: > A customer of ours has been having trouble with corrupted data for some > time. Of course, we've almost always blamed hardware (and we've seen > RAID controllers have their firmware upgraded, among other actions), but > the useful thing to know is when corruption has happened, and where. Agreed. > So we've been tasked with adding CRCs to data files. Awesome. > The idea is that these CRCs are going to be checked just after reading > files from disk, and calculated just before writing it. They are > just a protection against the storage layer going mad; they are not > intended to protect against faulty RAM, CPU or kernel. This is the common case. > This code would be run-time or compile-time configurable. I'm not > absolutely sure which yet; the problem with run-time is what to do if > the user restarts the server with the setting flipped. It would have > almost no impact on users who don't enable it. I've supported this forever! > The implementation I'm envisioning requires the use of a new relation > fork to store the per-block CRCs. Initially I'm aiming at a CRC32 sum > for each block. FlushBuffer would calculate the checksum and store it > in the CRC fork; ReadBuffer_common would read the page, calculate the > checksum, and compare it to the one stored in the CRC fork. > > A buffer's io_in_progress lock protects the buffer's CRC. We read and > pin the CRC page before acquiring the lock, to avoid having two buffer > IO operations in flight. If the CRC gets written before the block, how is recovery going to handle it? I'm not too familiar with the new forks stuff, but recovery will pull the old block, compare it against the checksum, and consider the block invalid, correct? > I'd like to submit this for 8.4, but I want to ensure that -hackers at > large approve of this feature before starting serious coding. IMHO, this is a functionality that should be enabled by default (as it is on most other RDBMS). It would've prevented severe corruption in the 20 or so databases I've had to fix, and other than making it optional, I don't see the reasoning for a separate relation fork rather than storing it directly on the block (as everyone else does). Similarly, I think Greg Stark was playing with a patch for it (http://archives.postgresql.org/pgsql-hackers/2007-02/msg01850.php). -- Jonah H. Harris, Senior DBA myYearbook.com
Alvaro Herrera <alvherre@commandprompt.com> writes: > The implementation I'm envisioning requires the use of a new relation > fork to store the per-block CRCs. That seems bizarre, and expensive, and if you lose one block of the CRC fork you lose confidence in a LOT of data. Why not keep the CRCs in the page headers? > A buffer's io_in_progress lock protects the buffer's CRC. Unfortunately, it doesn't. See hint bits. regards, tom lane
Alvaro Herrera wrote: > A customer of ours has been having trouble with corrupted data for some > time. Of course, we've almost always blamed hardware (and we've seen > RAID controllers have their firmware upgraded, among other actions), but > the useful thing to know is when corruption has happened, and where. > > So we've been tasked with adding CRCs to data files. > > The idea is that these CRCs are going to be checked just after reading > files from disk, and calculated just before writing it. They are > just a protection against the storage layer going mad; they are not > intended to protect against faulty RAM, CPU or kernel. This has been suggested before, and the usual objection is precisely that it only protects from errors in the storage layer, giving a false sense of security. Doesn't some filesystems include a per-block CRC, which would achieve the same thing? ZFS? > This code would be run-time or compile-time configurable. I'm not > absolutely sure which yet; the problem with run-time is what to do if > the user restarts the server with the setting flipped. It would have > almost no impact on users who don't enable it. Yeah, seems like it would need to be compile-time or initdb-time configurable. > The implementation I'm envisioning requires the use of a new relation > fork to store the per-block CRCs. Initially I'm aiming at a CRC32 sum > for each block. FlushBuffer would calculate the checksum and store it > in the CRC fork; ReadBuffer_common would read the page, calculate the > checksum, and compare it to the one stored in the CRC fork. Surely it would be much simpler to just add a field to the page header. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Tue, 30 Sep 2008 14:33:04 -0400 "Jonah H. Harris" <jonah.harris@gmail.com> wrote: > > I'd like to submit this for 8.4, but I want to ensure that -hackers > > at large approve of this feature before starting serious coding. > > IMHO, this is a functionality that should be enabled by default (as it > is on most other RDBMS). It would've prevented severe corruption in What other RDMS have it enabled by default? Sincerely, Joshua D. Drake -- The PostgreSQL Company since 1997: http://www.commandprompt.com/ PostgreSQL Community Conference: http://www.postgresqlconference.org/ United States PostgreSQL Association: http://www.postgresql.us/
On Tue, Sep 30, 2008 at 2:49 PM, Joshua Drake <jd@commandprompt.com> wrote: > On Tue, 30 Sep 2008 14:33:04 -0400 > "Jonah H. Harris" <jonah.harris@gmail.com> wrote: > >> > I'd like to submit this for 8.4, but I want to ensure that -hackers >> > at large approve of this feature before starting serious coding. >> >> IMHO, this is a functionality that should be enabled by default (as it >> is on most other RDBMS). It would've prevented severe corruption in > > What other RDMS have it enabled by default? Oracle and (I belive) SQL Server >= 2005 -- Jonah H. Harris, Senior DBA myYearbook.com
Hello Alvaro, some random thoughts while reading your proposal follow... Alvaro Herrera wrote: > So we've been tasked with adding CRCs to data files. Disks get larger and relative reliability shrinks, it seems. So I agree that this is a worthwhile thing to have. But shouldn't that be the job of the filesystem? Think of ZFS or the upcoming BTRFS. > The idea is that these CRCs are going to be checked just after reading > files from disk, and calculated just before writing it. They are > just a protection against the storage layer going mad; they are not > intended to protect against faulty RAM, CPU or kernel. That sounds reasonable if we do it from Postgres. > This code would be run-time or compile-time configurable. I'm not > absolutely sure which yet; the problem with run-time is what to do if > the user restarts the server with the setting flipped. It would have > almost no impact on users who don't enable it. I'd say calculating a CRC is close enough to be considered "no impact". A single core of a modern CPU easily reaches way above 200 MiB/s throughput for CRC32 today. See [1]. Maybe consider Adler-32 which is 3-4x faster [2], also part of zlib and AFAIK about equally safe for 8k blocks and above. > The implementation I'm envisioning requires the use of a new relation > fork to store the per-block CRCs. Initially I'm aiming at a CRC32 sum > for each block. FlushBuffer would calculate the checksum and store it > in the CRC fork; ReadBuffer_common would read the page, calculate the > checksum, and compare it to the one stored in the CRC fork. Huh? Aren't CRCs normally stored as part of the block they are supposed to protect? Or how do you expect to ensure the data from the CRC relation fork is correct? How about crash safety (a data block written, but not its CRC block or vice versa)? Wouldn't that double the amount of seeking required for writes? > I'd like to submit this for 8.4, but I want to ensure that -hackers at > large approve of this feature before starting serious coding. Very cool! Regards Markus Wanner [1]: Crypto++ benchmarks: http://www.cryptopp.com/benchmarks.html [2]: Wikipedia about hash functions: http://en.wikipedia.org/wiki/List_of_hash_functions#Computational_costs_of_CRCs_vs_Hashes
Alvaro Herrera wrote: > Initially I'm aiming at a CRC32 sum > for each block. FlushBuffer would calculate the checksum and store it > in the CRC fork; ReadBuffer_common would read the page, calculate the > checksum, and compare it to the one stored in the CRC fork. There's one fundamental problem with that, related to the way our hint bits are written. Currently, hint bit updates are not WAL-logged, and thus no full page write is done when only hint bits are changed. Imagine what happens if hint bits are updated on a page, but there's no other changes, and we crash so that only one half of the new page version makes it to disk (= torn page). The CRC would not match, even though the page is actually valid. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Tue, 30 Sep 2008, Heikki Linnakangas wrote: > Doesn't some filesystems include a per-block CRC, which would achieve the > same thing? ZFS? Yes, there is a popular advoacy piece for ZFS with a high-level view of why and how they implement that at http://blogs.sun.com/bonwick/entry/zfs_end_to_end_data The guarantees are stronger than what you can get if you just put a CRC in the block itself. I'd never really thought too hard about putting this in the database knowing that ZFS is available for environments where this is a concern, but it certainly would be a nice addition. The best analysis I've ever seen that makes a case for OS or higher level disk checksums of some sort, by looking at the myriad ways that disks and disk arrays fail in the real world, is in http://www.usenix.org/event/fast08/tech/full_papers/bairavasundaram/bairavasundaram.pdf (there is a shorter version that hits the high points of that at http://www.usenix.org/publications/login/2008-06/openpdfs/bairavasundaram.pdf ) One really interesting bit in there I'd never seen before is that they find real data that supports the stand that enterprise drives are significantly more reliable than consumer ones. While general failure rates aren't that different, "SATA disks have an order of magnitude higher probability of developing checksum mismatches than Fibre Channel disks. We find that 0.66% of SATA disks develop at least one mismatch during the first 17 months in the field, whereas only 0.06% of Fibre Channel disks develop a mismatch during that time." -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
> A customer of ours has been having trouble with corrupted data for some > time. Of course, we've almost always blamed hardware (and we've seen > RAID controllers have their firmware upgraded, among other actions), but > the useful thing to know is when corruption has happened, and where. That is an important statement, to know when it happens not necessarily to be able to recover the block or where in the block it is corrupt. Is that correct? > > So we've been tasked with adding CRCs to data files. CRC or checksum? If the objective is merely general "detection" there should be some latitude in choosing the methodology for performance. > > The idea is that these CRCs are going to be checked just after reading > files from disk, and calculated just before writing it. They are > just a protection against the storage layer going mad; they are not > intended to protect against faulty RAM, CPU or kernel. It will actually find faults in all if it. If the CPU can't add and/or a RAM location lost a bit, this will blow up just as easily as a bad block. It may cause "false identification" of an error, but it will keep a bad system from hiding. > > This code would be run-time or compile-time configurable. I'm not > absolutely sure which yet; the problem with run-time is what to do if > the user restarts the server with the setting flipped. It would have > almost no impact on users who don't enable it. CPU capacity on modern hardware within a small area of RAM is practically infinite when compared to any sort of I/O. > > The implementation I'm envisioning requires the use of a new relation > fork to store the per-block CRCs. Initially I'm aiming at a CRC32 sum > for each block. FlushBuffer would calculate the checksum and store it > in the CRC fork; ReadBuffer_common would read the page, calculate the > checksum, and compare it to the one stored in the CRC fork. Hell, all that is needed is a long or a short checksum value in the block. I mean, if you just want a sanity test, it doesn't take much. Using a second relation creates confusion. If there is a CRC discrepancy between two different blocks, who's wrong? You need a third "control" to know. If the block knows its CRC or checksum and that is in error, the block is bad. > > A buffer's io_in_progress lock protects the buffer's CRC. We read and > pin the CRC page before acquiring the lock, to avoid having two buffer > IO operations in flight. > > I'd like to submit this for 8.4, but I want to ensure that -hackers at > large approve of this feature before starting serious coding. > > Opinions? If its fast enough, its a good idea. It could be very helpful in protecting users data. > > -- > Alvaro Herrera > http://www.CommandPrompt.com/ > PostgreSQL Replication, Consulting, Custom Development, 24x7 support > > -- > Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-hackers >
Alvaro Herrera wrote: > A customer of ours has been having trouble with corrupted data for some > time. Of course, we've almost always blamed hardware (and we've seen > RAID controllers have their firmware upgraded, among other actions), but > the useful thing to know is when corruption has happened, and where. > > So we've been tasked with adding CRCs to data files. Maybe a stupid question, but what I/O subsystems corrupt data and fail to report it? -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
On Tue, Sep 30, 2008 at 1:41 PM, Bruce Momjian <bruce@momjian.us> wrote:
Practically all of them. Here is a good paper on various checksums, their failure rates, and practical applications.
"Parity Lost and Parity Regained"
http://www.usenix.org/event/fast08/tech/full_papers/krioukov/krioukov_html/index.html
-jwb
Alvaro Herrera wrote:Maybe a stupid question, but what I/O subsystems corrupt data and fail
> A customer of ours has been having trouble with corrupted data for some
> time. Of course, we've almost always blamed hardware (and we've seen
> RAID controllers have their firmware upgraded, among other actions), but
> the useful thing to know is when corruption has happened, and where.
>
> So we've been tasked with adding CRCs to data files.
to report it?
Practically all of them. Here is a good paper on various checksums, their failure rates, and practical applications.
"Parity Lost and Parity Regained"
http://www.usenix.org/event/fast08/tech/full_papers/krioukov/krioukov_html/index.html
-jwb
On Sep 30, 2008, at 2:17 PM, pgsql@mohawksoft.com wrote: >> A customer of ours has been having trouble with corrupted data for >> some >> time. Of course, we've almost always blamed hardware (and we've seen >> RAID controllers have their firmware upgraded, among other >> actions), but >> the useful thing to know is when corruption has happened, and where. > > That is an important statement, to know when it happens not > necessarily to > be able to recover the block or where in the block it is corrupt. > Is that > correct? Oh, correcting the corruption would be AWESOME beyond belief! But at this point I'd settle for just knowing it had happened. >> So we've been tasked with adding CRCs to data files. > > CRC or checksum? If the objective is merely general "detection" there > should be some latitude in choosing the methodology for performance. See above. Perhaps the best win would be a case where you could choose which method you wanted. We generally have extra CPU on the servers, so we could afford to burn some cycles with more complex algorithms. >> The idea is that these CRCs are going to be checked just after >> reading >> files from disk, and calculated just before writing it. They are >> just a protection against the storage layer going mad; they are not >> intended to protect against faulty RAM, CPU or kernel. > > It will actually find faults in all if it. If the CPU can't add and/ > or a > RAM location lost a bit, this will blow up just as easily as a bad > block. > It may cause "false identification" of an error, but it will keep a > bad > system from hiding. Well, very likely not, since the intention is to only compute the CRC when we write the block out, at least for now. In the future I would like to be able to detect when a CPU or memory goes bonkers and poops on something, because that's actually happened to us as well. >> The implementation I'm envisioning requires the use of a new relation >> fork to store the per-block CRCs. Initially I'm aiming at a CRC32 >> sum >> for each block. FlushBuffer would calculate the checksum and >> store it >> in the CRC fork; ReadBuffer_common would read the page, calculate the >> checksum, and compare it to the one stored in the CRC fork. > > Hell, all that is needed is a long or a short checksum value in the > block. > I mean, if you just want a sanity test, it doesn't take much. Using a > second relation creates confusion. If there is a CRC discrepancy > between > two different blocks, who's wrong? You need a third "control" to > know. If > the block knows its CRC or checksum and that is in error, the block is > bad. I believe the idea was to make this as non-invasive as possible. And it would be really nice if this could be enabled without a dump/ reload (maybe the upgrade stuff would make this possible?) -- Decibel!, aka Jim C. Nasby, Database Architect decibel@decibel.org Give your computer some brain candy! www.distributed.net Team #1828
On Tue, 30 Sep 2008 13:48:52 -0700 "Jeffrey Baker" <jwbaker@gmail.com> wrote: > > Practically all of them. Here is a good paper on various checksums, > their failure rates, and practical applications. > > "Parity Lost and Parity Regained" > http://www.usenix.org/event/fast08/tech/full_papers/krioukov/krioukov_html/index.html > In a related article published in Login called "Data Corruption in the storage stack: a closer look" they say: During a 41-month period we observed more than 400,000 instances of checksum mistmatches, 8% of which were discovered during RAID reconstruction, creating the possibility of real data loss. They also have a wonderful term they mention, "Silent Data corruptions". Joshua D. Drake [1] Login June 2008 > -jwb -- The PostgreSQL Company since 1997: http://www.commandprompt.com/ PostgreSQL Community Conference: http://www.postgresqlconference.org/ United States PostgreSQL Association: http://www.postgresql.us/
On Sep 30, 2008, at 1:48 PM, Heikki Linnakangas wrote: > This has been suggested before, and the usual objection is > precisely that it only protects from errors in the storage layer, > giving a false sense of security. If you can come up with a mechanism for detecting non-storage errors as well, I'm all ears. :) In the meantime, you're way, way more likely to experience corruption at the storage layer than anywhere else. We've had several corruption events, only one of which was memory related... and we *know* it was memory related because we actually got logs saying so. But with a SAN environment there's a lot of moving parts, all waiting to screw up your data: filesystem SAN device driver SAN network SAN BIOS drive BIOS drive That's above things that could hose your data outside of storage: kernel CPU memory motherboard > Doesn't some filesystems include a per-block CRC, which would > achieve the same thing? ZFS? Sure, some do. We're on linux and can't run ZFS. And I'll argue that no linux FS is anywhere near as tested as ext3 is, which means that going to some other FS that offers you CRC means you're now exposing yourself to the possibility of issues with the FS itself. Not to mention that changing filesystems on a large production system is very painful. -- Decibel!, aka Jim C. Nasby, Database Architect decibel@decibel.org Give your computer some brain candy! www.distributed.net Team #1828
> > I believe the idea was to make this as non-invasive as possible. And > it would be really nice if this could be enabled without a dump/ > reload (maybe the upgrade stuff would make this possible?) > -- It's all about the probability of a duplicate check being generated. If you use a 32 bit checksum, then you have a theoretical probability of 1 in 4 billion that a corrupt block will be missed (probably much lower depending on your algorithm). If you use a short, then a 1 in 65 thousand probability. If you use an 8 bit number, then 1 in 256. Why am I going on? Well, if there are any spare bits in a block header, they could be used for the check value.
On 30 Sep 2008, at 10:17 PM, Decibel! <decibel@decibel.org> wrote: > On Sep 30, 2008, at 1:48 PM, Heikki Linnakangas wrote: >> This has been suggested before, and the usual objection is >> precisely that it only protects from errors in the storage layer, >> giving a false sense of security. > > If you can come up with a mechanism for detecting non-storage errors > as well, I'm all ears. :) > > In the meantime, you're way, way more likely to experience > corruption at the storage layer than anywhere else. Fwiw this hasn't been my experience. Bad memory is extremely common and even the storage failures I've seen (excluding the drive crashes) turned out to actually be caused by bad memory. That said I've always been interested in doing this. The main use case in my mind has actually been for data that's been restored from old backups which have been lying round and floating between machines for a while with many opportunities for bit errors to show up. The main stumbling block I ran into was how to deal with turning the option off and on. I wanted it to be possible to turn off the option to have the database ignore any errors and to avoid the overhead. But that means including an escape hatch value which is always considered to be correct. But that dramatically reduces the effectiveness of the scheme. Another issue is it will make space available on each page smaller making it harder to do in place upgrades. If you can deal with those issues and carefully deal with the contingencies so it's clear to people what to do when errra occur or they want to turn the feature on or off then I'm all for it. That despite my experience of memory errors being a lot more common than undetected storage errors.
Joshua Drake wrote: > During a 41-month period we observed more than 400,000 instances of > checksum mistmatches, 8% of which were discovered during RAID > reconstruction, creating the possibility of real data loss. > > They also have a wonderful term they mention, "Silent Data corruptions". > > > Exactely! From my experience, the only assumption to be made about storage is that it can and will fail ... frequently! It is unreliable (not to mention slooow) and should not be trusted; regardless of the price tag or brand. This could help detect: - fs corruption - vfs bug - raid controller firmware bug - bad disk sector - power crash - weird martian-like raid rebuilds Although, this idea won't prevent anything. Everything would still sinisterly fail on you. The difference is, no more silence. -- Andrew Chernow eSilo, LLC every bit counts http://www.esilo.com/
If you are concerned with data integrity (not caused by bugs in the code itself), you may be interested in utilizing ZFS; however, be aware that I found and reported a bug in their implementation of the Fletcher checksum algorithm they use by default to attempt to verify the integrity of the data stored in their file system, and further aware that checksums/CRCs do not enable the correction of errors in general, therefore be prepared to make the decision of "what should be done in the event of a failure"; ZFS effectively locks up in certain circumstances rather risk silently using suspect data with some form of persistent indication that the result may be corrupted. (strong CRC's and FEC's are relatively inexpensive to compute). So in summary, my two cents: a properly implemented 32/64 bit Fletcher checksum is likely adequate to detect most errors (and correct them if presumed to be a result of a single flipped bit within 128KB or so, as such a Fletcher checksum has a hamming distance of 3 within blocks of this size, albeit fairly expensive to do so by trial and error; further presuming that this can not be relied upon, a strategy potentially utilizing the suspect data as if it were good likely needs to be adopted, accompanied somehow with a persistent indication that the query results (or specific sub-results) are themselves suspect, as it may often be a lesser evil than the alternative (but not always). Or use a file system like ZFS, and let it do its thing, and hope for the best.
Paul Schlie wrote: > this can not be relied upon, a strategy potentially utilizing the suspect > data as if it were good likely needs to be adopted, accompanied somehow with > a persistent indication that the query results (or specific sub-results) are > themselves suspect, as it may often be a lesser evil than the alternative > (but not always). Or use a file system like ZFS, and let it do its thing, > and hope for the best. Problem is, most people are running PostgreSQL on one of two operating systems. Linux or Win32. (/me has been amazed at how many people are running win32) ZFS is not an option; generally speaking. Joshua D. Drake
> Joshua D. Drake wrote: > ... > ZFS is not an option; generally speaking. Then in general, if the corruption occurred within the: - read path, try again and hope it takes care of itself. - write path, the best that can be hoped for is a single bit error within the data itself which can be both detected andcorrected with a sufficiently strong check sum; or worst case if address or control information was corrupted, god knowswhat happed to the data, and what other data may have been destroyed by having the data written to the wrong blocksand typically unrecoverable. - drive itself, this is most typically very unlikely, as strong FEC codes typically prevent the misidentification of unrecoverabledata as being otherwise. The simplest thing to do would seem to be to upon reading blocks check the check sum, if bad, try read again; if that doesn't fix the problem, assume a single bit error, and iteratively flip single bits until the check sum matches (hopefully not making the problem worse as may be the case if many bits were actually already in error) and write the data back, and proceed as normal, possibly logging the action; otherwise presume the data is unrecoverable and in error, somehow mark it as being so such that subsequent queries which may utilize any portion of it knows it may be corrupt (which I suspect may be best done not on file-system blocks, but actually on a logical rows or even individual entries if very large, as my best initial guess, and likely to measurably affect performance when enabled, and haven't a clue how resulting query should/could be identified as being potentially corrupt without confusing the client which requested it).
Jonah H. Harris wrote: >>>> I'd like to submit this for 8.4, but I want to ensure that -hackers >>>> at large approve of this feature before starting serious coding. >>> >>> IMHO, this is a functionality that should be enabled by default (as it >>> is on most other RDBMS). It would've prevented severe corruption in >> >> What other RDMS have it enabled by default? > > Oracle and (I belive) SQL Server >= 2005 References: http://download-uk.oracle.com/docs/cd/B19306_01/server.102/b14237/initparams040.htm#CHDDCEIC http://download.oracle.com/docs/cd/B28359_01/server.111/b28320/initparams046.htm#sthref130 Oracle claims that it introduces only 1% to 2% overhead. Actually I suggested the same feature for 8.3: http://archives.postgresql.org/pgsql-hackers/2007-08/msg01003.php It was eventually rejected because the majority felt that it would be a small benefit (only detects disk corruption and not software bugs) that would not justify the overhead and the additional code. Incidently, Oracle also has a parameter DB_BLOCK_CHECKING that checks blocks for logical consistency. This is OFF by default, but Oracle recommends that you activate it if you can live with the performance impact. Yours, Laurenz Albe
Alvaro Herrera napsal(a): > This code would be run-time or compile-time configurable. I'm not > absolutely sure which yet; the problem with run-time is what to do if > the user restarts the server with the setting flipped. It would have > almost no impact on users who don't enable it. I prefer runtime configuration. Solaris has two filesystems UFS, ZFS. ZFS offers strong error detection and a another CRC is overhead. You need mechanism how to enable this feature on UFS and disable on ZFS. I suggest to have it as tablespace property and for default tablespace you can setup it in initdb phase. Zdenek
CRC-checks will help to detect corrupt data. my question: WHAT should happen when corrupted data is detected? a) PostgreSQL can end with some paniccode b) a log can be written, with some rather high level a) has the benefit that it surely will be noticed. Which is a real benefet, as I suppose that many users of PostgreSQL on the low end do not have dedicated DBA-Admins who read the *.log every day b) has the benefit that work goes on BUT: What can somebody do AFTER PostgreSQL has detected "data is corrupted, CRC error in block xxxx" ? My first thought of action would be: get new drive, pg_dump_all to save place, install new drive, pg_restore BUT: how will pg_dump_all be enabled? As PostgreSQL must be forced to accept the corrupted data to do a pg_dump... Next step: for huge databases it is tempting to not pg_dump, but filecopy; with shut down database or pg_backup() etc. What way can that be supported while data is corrupted? Please do not misunderstand my question ... I really look forward to get this kind of information; especially since corrupted data on hard drives is an early sign of BIG trouble to come. I am just asking to start thinking about "what do after corruption has been detected" Harald -- GHUM Harald Massa persuadere et programmare Harald Armin Massa Spielberger Straße 49 70435 Stuttgart 0173/9409607 no fx, no carrier pigeon - EuroPython 2009 will take place in Birmingham - Stay tuned!
On Tue, 2008-09-30 at 17:13 -0400, pgsql@mohawksoft.com wrote: > > > > I believe the idea was to make this as non-invasive as possible. And > > it would be really nice if this could be enabled without a dump/ > > reload (maybe the upgrade stuff would make this possible?) > > -- > > It's all about the probability of a duplicate check being generated. If > you use a 32 bit checksum, then you have a theoretical probability of 1 in > 4 billion that a corrupt block will be missed (probably much lower > depending on your algorithm). If you use a short, then a 1 in 65 thousand > probability. If you use an 8 bit number, then 1 in 256. > > Why am I going on? Well, if there are any spare bits in a block header, > they could be used for the check value. Even and 64-bit integer is just 0.1% of 8k page size, and it is even less than 0.1% likely that page will be 100% full and thus that 64bits wastes any real space at all. So I don't think that this is a space issue. --------------- Hannu
Hannu Krosing <hannu@2ndQuadrant.com> writes: > So I don't think that this is a space issue. No, it's all about time penalties and loss of concurrency. regards, tom lane
"Harald Armin Massa" <haraldarminmassa@gmail.com> writes: > WHAT should happen when corrupted data is detected? Same thing that happens now, ie, query fails with an error. This would just be an extension of the existing validity checks done at page read time. regards, tom lane
On Wed, Oct 1, 2008 at 9:25 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > "Harald Armin Massa" <haraldarminmassa@gmail.com> writes: >> WHAT should happen when corrupted data is detected? > > Same thing that happens now, ie, query fails with an error. This would > just be an extension of the existing validity checks done at page read > time. Agreed. -- Jonah H. Harris, Senior DBA myYearbook.com
> On Tue, 2008-09-30 at 17:13 -0400, pgsql@mohawksoft.com wrote: >> > >> > I believe the idea was to make this as non-invasive as possible. And >> > it would be really nice if this could be enabled without a dump/ >> > reload (maybe the upgrade stuff would make this possible?) >> > -- >> >> It's all about the probability of a duplicate check being generated. If >> you use a 32 bit checksum, then you have a theoretical probability of 1 >> in >> 4 billion that a corrupt block will be missed (probably much lower >> depending on your algorithm). If you use a short, then a 1 in 65 >> thousand >> probability. If you use an 8 bit number, then 1 in 256. >> >> Why am I going on? Well, if there are any spare bits in a block header, >> they could be used for the check value. > > Even and 64-bit integer is just 0.1% of 8k page size, and it is even > less than 0.1% likely that page will be 100% full and thus that 64bits > wastes any real space at all. > > So I don't think that this is a space issue. > Oh, I don't think it is a space issue either, the question was could there be a way to do it without a dump and restore. The only way that occurs to me is if there are some unused bits in the block header. I haven't looked at that code in years. The numerics of it are just a description of the probability of a duplicate sum or crc, meaning a false OK. Also, regardless of whether or not the block is full, the block is read and written as a block and that the underlying data unimportant.
Paul Schlie wrote: > > ... if that doesn't fix > the problem, assume a single bit error, and iteratively flip > single bits until the check sum matches ... This can actually be done much faster, if you're doing a CRC checksum (aka modulo over GF(2^n)). Basically, an error flipping bit n will always create the same xor between the computed CRC and the stored CRC. So you can just store a table- for all n, an error in bit n will create an xor of this value, sort the table in order of xor values, and then you can do a binary search on the table, and get exactly what bit was wrong. This is actually probably fairly safe- for an 8K page, there are only 65536 possible bit positions. Assuming a 32-bit CRC, that means that larger corrupts are much more likely to hit one of the other 4,294,901,760 (2^32 - 2^16) CRC values- 99.998% likely, in fact. Brian > (hopefully not making the > problem worse as may be the case if many bits were actually already > in error) and write the data back, and proceed as normal, possibly > logging the action; otherwise presume the data is unrecoverable and > in error, somehow mark it as being so such that subsequent queries > which may utilize any portion of it knows it may be corrupt (which > I suspect may be best done not on file-system blocks, but actually > on a logical rows or even individual entries if very large, as my > best initial guess, and likely to measurably affect performance > when enabled, and haven't a clue how resulting query should/could > be identified as being potentially corrupt without confusing the > client which requested it). > > > >
> Hannu Krosing <hannu@2ndQuadrant.com> writes: >> So I don't think that this is a space issue. > > No, it's all about time penalties and loss of concurrency. I don't think that the amount of time it would take to calculate and test the sum is even important. It may be in older CPUs, but these days CPUs are so fast in RAM and a block is very small. On x86 systems, depending on page alignment, we are talking about two or three pages that will be "in memory" (They were used to read the block from disk or previously accessed).
Brian Hurt wrote: > Paul Schlie wrote: >> >> ... if that doesn't fix >> the problem, assume a single bit error, and iteratively flip >> single bits until the check sum matches ... > This can actually be done much faster, if you're doing a CRC checksum > (aka modulo over GF(2^n)). Basically, an error flipping bit n will > always create the same xor between the computed CRC and the stored > CRC. So you can just store a table- for all n, an error in bit n will > create an xor of this value, sort the table in order of xor values, > and then you can do a binary search on the table, and get exactly what > bit was wrong. > > This is actually probably fairly safe- for an 8K page, there are only > 65536 possible bit positions. Assuming a 32-bit CRC, that means that > larger corrupts are much more likely to hit one of the other > 4,294,901,760 (2^32 - 2^16) CRC values- 99.998% likely, in fact. > Actually, I think I'm going to take this back. Thinking about it, the table is going to be large-ish (~512K) and it assumes a fixed 8K page size. I think a better solution would be a tight loop, something like: r = 1u; for (i = 0; i < max_bits_per_page; ++i) { if (r == xor_difference) { return i; } else if ((r & 1u) == 1u) { r = (r >> 1) ^ CRC_POLY; } else { r >>= 1; } } Brian
pgsql@mohawksoft.com writes: >> No, it's all about time penalties and loss of concurrency. > I don't think that the amount of time it would take to calculate and test > the sum is even important. It may be in older CPUs, but these days CPUs > are so fast in RAM and a block is very small. On x86 systems, depending on > page alignment, we are talking about two or three pages that will be "in > memory" (They were used to read the block from disk or previously > accessed). Your optimism is showing ;-). XLogInsert routinely shows up as a major CPU hog in any update-intensive test, and AFAICT that's mostly from the CRC calculation for WAL records. We could possibly use something cheaper than a real CRC, though. A word-wide XOR (ie, effectively a parity calculation) would be sufficient to detect most problems. regards, tom lane
Brian Hurt wrote: > Paul Schlie wrote: >> >> ... if that doesn't fix >> the problem, assume a single bit error, and iteratively flip >> single bits until the check sum matches ... > This can actually be done much faster, if you're doing a CRC checksum > (aka modulo over GF(2^n)). Basically, an error flipping bit n will > always create the same xor between the computed CRC and the stored CRC. > So you can just store a table- for all n, an error in bit n will create > an xor of this value, sort the table in order of xor values, and then > you can do a binary search on the table, and get exactly what bit was > wrong. > > This is actually probably fairly safe- for an 8K page, there are only > 65536 possible bit positions. Assuming a 32-bit CRC, that means that > larger corrupts are much more likely to hit one of the other > 4,294,901,760 (2^32 - 2^16) CRC values- 99.998% likely, in fact. - yes, if you're willing to compute true CRC's as opposed to simpler checksums, which may be worth the price if in fact many/most data check failures are truly caused by single bit errors somewhere in the chain, which I'd guess may typically be the case in absents of an error manifesting itself in the data's storage control structure which would likely result in the catastrophic loss of the data itself along with potential corruption of other previously stored data if occurring in the write chain if not otherwise detected. (and all presuming the hopeful presence of reasonably healthy ECC main memory and processor, as otherwise data corruption may easily go unnoticed between it's check/use/storage obviously). - however if such a storage block integrity check/correction mechanism is to be developed, I can't help but wonder if such a facility is truly best if architected as a shell around the file system calls in general (thereby useable by arbitrary programs, as opposed to being specific to any single one, thereby arguably predominantly outside of the scope of the db itself)? >> (hopefully not making the >> problem worse as may be the case if many bits were actually already >> in error) and write the data back, and proceed as normal, possibly >> logging the action; otherwise presume the data is unrecoverable and >> in error, somehow mark it as being so such that subsequent queries >> which may utilize any portion of it knows it may be corrupt (which >> I suspect may be best done not on file-system blocks, but actually >> on a logical rows or even individual entries if very large, as my >> best initial guess, and likely to measurably affect performance >> when enabled, and haven't a clue how resulting query should/could >> be identified as being potentially corrupt without confusing the >> client which requested it). >>
Brian Hurt wrote: > Brian Hurt wrote: >> Paul Schlie wrote: >>> >>> ... if that doesn't fix >>> the problem, assume a single bit error, and iteratively flip >>> single bits until the check sum matches ... >> This can actually be done much faster, if you're doing a CRC checksum >> (aka modulo over GF(2^n)). Basically, an error flipping bit n will >> always create the same xor between the computed CRC and the stored >> CRC. So you can just store a table- for all n, an error in bit n will >> create an xor of this value, sort the table in order of xor values, >> and then you can do a binary search on the table, and get exactly what >> bit was wrong. >> >> This is actually probably fairly safe- for an 8K page, there are only >> 65536 possible bit positions. Assuming a 32-bit CRC, that means that >> larger corrupts are much more likely to hit one of the other >> 4,294,901,760 (2^32 - 2^16) CRC values- 99.998% likely, in fact. >> > > Actually, I think I'm going to take this back. Thinking about it, the > table is going to be large-ish (~512K) and it assumes a fixed 8K page > size. I think a better solution would be a tight loop, something like: > r = 1u; > for (i = 0; i < max_bits_per_page; ++i) { > if (r == xor_difference) { > return i; > } else if ((r & 1u) == 1u) { > r = (r >> 1) ^ CRC_POLY; > } else { > r >>= 1; > } > } - or used a hash table indexed by the xor diff between the computed and stored CRC (stored separately and CRC'd itself and possibly stored redundantly) which I thought was what you meant; either way correction performance isn't likely that important as its use should hopefully be rare.
On Wed, Oct 1, 2008 at 10:27 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> I don't think that the amount of time it would take to calculate and test >> the sum is even important. It may be in older CPUs, but these days CPUs >> are so fast in RAM and a block is very small. On x86 systems, depending on >> page alignment, we are talking about two or three pages that will be "in >> memory" (They were used to read the block from disk or previously >> accessed). > > Your optimism is showing ;-). XLogInsert routinely shows up as a major > CPU hog in any update-intensive test, and AFAICT that's mostly from the > CRC calculation for WAL records. I probably wouldn't compare checksumming *every* WAL record to a single block-level checksum. -- Jonah H. Harris, Senior DBA myYearbook.com
Tom Lane wrote: > Alvaro Herrera <alvherre@commandprompt.com> writes: > > A buffer's io_in_progress lock protects the buffer's CRC. > > Unfortunately, it doesn't. See hint bits. Hmm, so it seems we need to keep held of the bufferhead's spinlock while calculating the checksum, just after resetting BM_JUST_DIRTIED. Yuck. -- Alvaro Herrera http://www.CommandPrompt.com/ PostgreSQL Replication, Consulting, Custom Development, 24x7 support
Alvaro Herrera <alvherre@commandprompt.com> writes: > Tom Lane wrote: >> Unfortunately, it doesn't. See hint bits. > Hmm, so it seems we need to keep held of the bufferhead's spinlock while > calculating the checksum, just after resetting BM_JUST_DIRTIED. Yuck. No, holding a spinlock that long is entirely unacceptable, and it's the wrong thing anyway, because we don't hold the header lock while manipulating hint bits. What this would *actually* mean is that we'd need to hold exclusive not shared buffer lock on a buffer we are about to write, and that would have to be maintained while computing the checksum and until the write is completed. The JUST_DIRTIED business could go away, in fact. (Thinks for a bit...) I wonder if that could induce any deadlock problems? The concurrency hit might be the least of our worries. regards, tom lane
"Jonah H. Harris" <jonah.harris@gmail.com> writes: > On Wed, Oct 1, 2008 at 10:27 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Your optimism is showing ;-). XLogInsert routinely shows up as a major >> CPU hog in any update-intensive test, and AFAICT that's mostly from the >> CRC calculation for WAL records. > I probably wouldn't compare checksumming *every* WAL record to a > single block-level checksum. No, not at all. Block-level checksums would be an order of magnitude more expensive: they're on bigger chunks of data and they'd be done more often. regards, tom lane
On Wed, Oct 1, 2008 at 11:36 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> I probably wouldn't compare checksumming *every* WAL record to a >> single block-level checksum. > > No, not at all. Block-level checksums would be an order of magnitude > more expensive: they're on bigger chunks of data and they'd be done more > often. That's debatable and would be dependent on cache and the workload. In our case however, because shared buffers doesn't scale, we would end up doing a lot more block-level checksums than the other vendors just pushing the block to/from the OS cache. -- Jonah H. Harris, Senior DBA myYearbook.com
Paul Schlie <schlie@comcast.net> writes: > - yes, if you're willing to compute true CRC's as opposed to simpler > checksums, which may be worth the price if in fact many/most data > check failures are truly caused by single bit errors somewhere in the > chain, FWIW, not one of the corrupted-data problems I've investigated has ever looked like a single-bit error. So the theoretical basis for using a CRC here seems pretty weak. I doubt we'd even consider automatic repair attempts anyway. regards, tom lane
Tom Lane escribió: > "Jonah H. Harris" <jonah.harris@gmail.com> writes: > > I probably wouldn't compare checksumming *every* WAL record to a > > single block-level checksum. > > No, not at all. Block-level checksums would be an order of magnitude > more expensive: they're on bigger chunks of data and they'd be done more > often. More often? My intention is that they are checked when the buffer is read in, and calculated/stored when the buffer is written out. In-memory changers of the block do not check nor recalculate the sum. Is this not OK? -- Alvaro Herrera http://www.CommandPrompt.com/ PostgreSQL Replication, Consulting, Custom Development, 24x7 support
Tom Lane <tgl@sss.pgh.pa.us> writes: > "Jonah H. Harris" <jonah.harris@gmail.com> writes: >> On Wed, Oct 1, 2008 at 10:27 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >>> Your optimism is showing ;-). XLogInsert routinely shows up as a major >>> CPU hog in any update-intensive test, and AFAICT that's mostly from the >>> CRC calculation for WAL records. > >> I probably wouldn't compare checksumming *every* WAL record to a >> single block-level checksum. > > No, not at all. Block-level checksums would be an order of magnitude > more expensive: they're on bigger chunks of data and they'd be done more > often. Yeah, it's not a single block, it's the total amount that matters and that's going to amount to the entire i/o bandwidth of the database. That said I think the reason WAL checksums are so expensive is the startup and finishing cost. I wonder if we could do something clever here though. Only one process is busy calculating the checksum -- it just has to know if anyone fiddles the hint bits while it's busy. If setting a hint bit cleared a flag on the buffer header then the checksumming process could set that flag, begin checksumming, and check that the flag is still set when he's finished. Actually I suppose that wouldn't actually be good enough. He would have to do the i/o and check that the checksum was still valid after the i/o. If not then he would have to recalculate the checksum and repeat the i/o. That might make the idea a loser since I think the only way it wins is if you rarely actually get someone setting the hint bits during i/o anyways. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com Get trained by Bruce Momjian - ask me about EnterpriseDB'sPostgreSQL training!
On Wed, Oct 1, 2008 at 11:57 AM, Alvaro Herrera <alvherre@commandprompt.com> wrote: > Tom Lane escribió: >> No, not at all. Block-level checksums would be an order of magnitude >> more expensive: they're on bigger chunks of data and they'd be done more >> often. > > More often? My intention is that they are checked when the buffer is > read in, and calculated/stored when the buffer is written out. > In-memory changers of the block do not check nor recalculate the sum. > > Is this not OK? That is the way it should work, only on read-in/write-out. -- Jonah H. Harris, Senior DBA myYearbook.com
> Jonah H. Harris wrote: >> Tom Lane wrote: >> "Harald Armin Massa" writes: >>> WHAT should happen when corrupted data is detected? >> >> Same thing that happens now, ie, query fails with an error. This would >> just be an extension of the existing validity checks done at page read >> time. > > Agreed. - however it must be understood that this may result in large portions of the database (not otherwise corrupted) from being legitimately queried, which may in practice simply not be acceptable; unless I misunderstand it's significance. (so wonder if it may be necessary to identify only the logical data within the corrupted block as being in potential error, and attempt to limit its effect as much as possible?) - if ZFS's effective behavior is any indication, although folks strongly favor strong data integrity, many aren't ready to accept the consequences of only being able to access known good data, especially if it inhibits access to remaining most likely good data (and lacking any reasonable mechanism to otherwise recover access to it). Sometimes ignorance is bliss.
> pgsql@mohawksoft.com writes: >>> No, it's all about time penalties and loss of concurrency. > >> I don't think that the amount of time it would take to calculate and >> test >> the sum is even important. It may be in older CPUs, but these days CPUs >> are so fast in RAM and a block is very small. On x86 systems, depending >> on >> page alignment, we are talking about two or three pages that will be "in >> memory" (They were used to read the block from disk or previously >> accessed). > > Your optimism is showing ;-). XLogInsert routinely shows up as a major > CPU hog in any update-intensive test, and AFAICT that's mostly from the > CRC calculation for WAL records. > > We could possibly use something cheaper than a real CRC, though. A > word-wide XOR (ie, effectively a parity calculation) would be sufficient > to detect most problems. That was something that I mentioned in my first response. if the *only* purpose of the check is to generate a "pass" or "fail" status, and not something to be used to find where in the block it is corrupted or attempt to regenerate the data, then we could certainly optimize the check algorithm. A simple checksum may be good enough.
* Heikki Linnakangas: > Currently, hint bit updates are not WAL-logged, and thus no full page > write is done when only hint bits are changed. Imagine what happens if > hint bits are updated on a page, but there's no other changes, and we > crash so that only one half of the new page version makes it to disk > (= torn page). The CRC would not match, even though the page is > actually valid. The non-logged hint bit writes are somewhat dangerous anyway. Maybe it's time to get rid of this peculiarity, despite the performance impact? -- Florian Weimer <fweimer@bfk.de> BFK edv-consulting GmbH http://www.bfk.de/ Kriegsstraße 100 tel: +49-721-96201-1 D-76133 Karlsruhe fax: +49-721-96201-99
Alvaro Herrera <alvherre@commandprompt.com> writes: > Tom Lane escribi�: >> No, not at all. Block-level checksums would be an order of magnitude >> more expensive: they're on bigger chunks of data and they'd be done more >> often. > More often? My intention is that they are checked when the buffer is > read in, and calculated/stored when the buffer is written out. Right. My point is that the volume of data involved is more than the WAL traffic. Example: you update one tuple on a page, your WAL record is that tuple, but you had to checksum the whole page when you read it in and you'll have to do it again when you write it out. Not to mention that in a read-mostly query mix there's not much WAL traffic at all, but plenty of page reading (and maybe some writing too, if hint bit updates happen). "Order of magnitude" might be an overstatement, but I don't believe for a moment that the cost will be negligible. That's why I'm thinking about something cheaper than a full-blown CRC calculation. regards, tom lane
On Wed, 2008-10-01 at 16:57 +0100, Gregory Stark wrote: > I wonder if we could do something clever here though. Only one process > is busy > calculating the checksum -- it just has to know if anyone fiddles the hint > bits while it's busy. What if the hint bits are added at the very end to the checksum, with an exclusive lock to them ? Then the exclusive lock should be short enough... only it might be deadlock-prone as any lock upgrade... Cheers, Csaba.
* Tom Lane: > No, not at all. Block-level checksums would be an order of magnitude > more expensive: they're on bigger chunks of data and they'd be done more > often. For larger blocks, checksumming can be parallelized at the instruction level, especially if the block size is statically known. And for large blocks, Adler32 isn't that bad compared to CRC32 from a error detection POV, so maybe you could use that. I've seen faults which were uncovered by page-level checksumming, so I'd be willing to pay the performance cost. 8-/ -- Florian Weimer <fweimer@bfk.de> BFK edv-consulting GmbH http://www.bfk.de/ Kriegsstraße 100 tel: +49-721-96201-1 D-76133 Karlsruhe fax: +49-721-96201-99
* Gregory Stark <stark@enterprisedb.com> [081001 11:59]: > If setting a hint bit cleared a flag on the buffer header then the > checksumming process could set that flag, begin checksumming, and check that > the flag is still set when he's finished. > > Actually I suppose that wouldn't actually be good enough. He would have to do > the i/o and check that the checksum was still valid after the i/o. If not then > he would have to recalculate the checksum and repeat the i/o. That might make > the idea a loser since I think the only way it wins is if you rarely actually > get someone setting the hint bits during i/o anyways. A doubled-write is essentially "free" with PostgreSQL because it's not doing direct IO, rather relying on the OS page cache to be efficient. So if you write block A and then call write on block A immediately (or, realistically, after a redo of the checksum), the first write is almost *never* going to take IO bandwidth to your spindles... But the problem is if something crashes (or interrupts PG) between those two writes, you've got a block of data into the pagecache (and possibly to the disks) that PG will no longer read in, because the CRC/checksum fails despite the actual content being valid... So if you're going to be makeing PG refuse to read-in blocks with bad CRC/csum, you need to guarnetee that nothing fiddles with the block between the start of the CRC and the completion of the write(). One possibility would be to "double-buffer" the write... i.e. as you calculate your CRC, you're doing it on a local copy of the block, which you hand to the OS to write... If you're touching the whole block of memory to CRC it, it isn't *ridiculously* more expensive to copy the memory somewhere else as you do it... a. -- Aidan Van Dyk Create like a god, aidan@highrise.ca command like a king, http://www.highrise.ca/ work like a slave.
Tom Lane wrote: > Paul Schlie writes: >> - yes, if you're willing to compute true CRC's as opposed to simpler >> checksums, which may be worth the price if in fact many/most data >> check failures are truly caused by single bit errors somewhere in the >> chain, > > FWIW, not one of the corrupted-data problems I've investigated has ever > looked like a single-bit error. So the theoretical basis for using a > CRC here seems pretty weak. I doubt we'd even consider automatic repair > attempts anyway. - although I accept that you may be correct in your assessment that most errors are in fact multi-bit; I've never seen any hard data to coberate either this or my suspicion that most errors are in fact single bit in nature (if occurring within the read/processing/write paths from storage), but agree that if occurring within an otherwise ECC'd memory subsystem, would have to be multi-bit in nature; however in systems which record very low single bit corrected errors, and little if any uncorrectable double bit errors, it seems unlikely that multi-bit errors resulting from memory failure can account for the number of integrity check failures for data stored in file systems; so strongly suspect that of the failures you've had occasion to investigate, they were predominantly so catastrophic they were sufficiently obvious to catch your attention, with most having more subtle integrity errors simply sneaking below the radar. (As it seems clear that statistically hardware failure will most likely result in single bit errors being injected into data with greater frequency than multi-bit ones, and will not be detected unless otherwise provisioned to be minimally detected, if not corrected at each communication boundary the data traverses).
Aidan Van Dyk wrote: > One possibility would be to "double-buffer" the write... i.e. as you > calculate your CRC, you're doing it on a local copy of the block, which > you hand to the OS to write... If you're touching the whole block of > memory to CRC it, it isn't *ridiculously* more expensive to copy the > memory somewhere else as you do it... > Coming in to this late - so apologies if this makes no sense - but doesn't writev() provide the required capability? Also, what is the difference between the OS not writing the block at all, and writing the block but missing the checksum? This window seems to be small enough (most of the time being faster than the system can schedule the buffer to be dumped?) that the "additional risk" seems theoretical rather than real. Unless there is evidence that writev() performs poorly, I'd suggest that avoiding double-buffering by using writev() would be preferred. Cheers, mark -- Mark Mielke <mark@mielke.cc>
Tom Lane wrote: <blockquote cite="mid:23549.1222876590@sss.pgh.pa.us" type="cite"><pre wrap="">Paul Schlie <a class="moz-txt-link-rfc2396E"href="mailto:schlie@comcast.net"><schlie@comcast.net></a> writes: </pre><blockquote type="cite"><prewrap="">- yes, if you're willing to compute true CRC's as opposed to simpler checksums, which may be worth the price if in fact many/most data check failures are truly caused by single bit errors somewhere in the chain, </pre></blockquote><pre wrap=""> FWIW, not one of the corrupted-data problems I've investigated has ever looked like a single-bit error. So the theoretical basis for using a CRC here seems pretty weak. I doubt we'd even consider automatic repair attempts anyway. </pre></blockquote><br /> Single bit failures are probably the most common, but they are probably alreadyhandled by the hardware. I don't think I've ever seen a modern hard drive return a wrong bit - I get short reads first.By the time somebody notices a problem, it's probably more than a few bits that have accumulated. For example, if memoryhas a faulty cell in it, it will create a fault a percentage of every time it is accessed. One bit error easily turnsinto two, three, ... Then there is the fact that no hardware is perfect, and every single component in the computerhas a chance, however small, of introducing bit errors... :-(<br /><br /> Cheers,<br /> mark<br /><br /><pre class="moz-signature"cols="72">-- Mark Mielke <a class="moz-txt-link-rfc2396E" href="mailto:mark@mielke.cc"><mark@mielke.cc></a> </pre>
Aidan Van Dyk <aidan@highrise.ca> writes: > * Gregory Stark <stark@enterprisedb.com> [081001 11:59]: > >> If setting a hint bit cleared a flag on the buffer header then the >> checksumming process could set that flag, begin checksumming, and check that >> the flag is still set when he's finished. >> >> Actually I suppose that wouldn't actually be good enough. He would have to do >> the i/o and check that the checksum was still valid after the i/o. If not then >> he would have to recalculate the checksum and repeat the i/o. That might make >> the idea a loser since I think the only way it wins is if you rarely actually >> get someone setting the hint bits during i/o anyways. > > A doubled-write is essentially "free" with PostgreSQL because it's not > doing direct IO, rather relying on the OS page cache to be efficient. All things are relative. What we're talking about here is all cpu and memory-bandwidth costs anyways so, yes, it'll be cheap compared to the disk i/o but it'll still represent doubling the memory bandwidth and cpu cost of these routines. That said you would only have to do it in cases where the hint bits actually get twiddled. That might not actually happen often. > But the problem is if something crashes (or interrupts PG) between those > two writes, you've got a block of data into the pagecache (and possibly > to the disks) that PG will no longer read in, because the CRC/checksum > fails despite the actual content being valid... I don't think this is a problem because we're still doing WAL logging. The i/o isn't allowed to happen until the page has been WAL logged and fsynced anyways. Incidentally I think the JUST_DIRTIED bit might actually be sufficient here. Hint bits already cause the buffer to be marked dirty. So the only case I see a real problem for is when we're writing a block as part of a checkpoint and find it's JUST_DIRTIED after writing it. In that case we would have to start over and write it again rather than leave it marked dirty. If we're writing the block as part of normal i/o then we could just decide to leave the possibly-bogus checksum in the table since it'll be overwritten by a full page write anyways. It'll be overwritten in normal use when the newly dirty buffer is eventually written out again. If you're not doing full page writes then you would have to restore from backup in cases where previously the page might actually have been valid though. That's kind of unfortunate. In theory it hasn't actually changed anything the risks of running without full page writes but it has certainly increased the likelihood of actually having to deal with "corruption" in the form of a gratuitously invalid checksum. (Of course without checksums you don't ever actually know if you have corruption -- and real corruption). > One possibility would be to "double-buffer" the write... i.e. as you > calculate your CRC, you're doing it on a local copy of the block, which > you hand to the OS to write... If you're touching the whole block of > memory to CRC it, it isn't *ridiculously* more expensive to copy the > memory somewhere else as you do it... Hm. Well that might actually work. You can do the CRC at the same time as copying to the buffer, effectively doing it for the same cost as the CRC alone. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com Ask me about EnterpriseDB's On-Demand Production Tuning
Paul Schlie <schlie@comcast.net> writes: > Tom Lane wrote: >> Paul Schlie writes: >>> - yes, if you're willing to compute true CRC's as opposed to simpler >>> checksums, which may be worth the price if in fact many/most data >>> check failures are truly caused by single bit errors somewhere in the >>> chain, >> >> FWIW, not one of the corrupted-data problems I've investigated has ever >> looked like a single-bit error. So the theoretical basis for using a >> CRC here seems pretty weak. I doubt we'd even consider automatic repair >> attempts anyway. > > - although I accept that you may be correct in your assessment that most > errors are in fact multi-bit; I've seen bad memory in a SCSI controller cause single-bit errors in storage. It was quite confusing since the symptom was syntax errors in the C code we were compiling on the server. The sysadmin actually caught it reliably corrupting a block of source text written out and read back. I've also seen single-bit errors caused by bad memory in a network interface. *Twice*. Particularly nasty since the CRC on TCP/IP packets is only 16-bit so a large enough ftp transfer would eventually finish despite the packet loss but with the occasional bits flipped. In these days of SAN/NAS and SCSI over IP that's pretty scary... Several cases on list have come down to "filesystem secretly replaces entire block of data with Folger's Crystals(tm) -- let's see if the database notices". Any checksum would help in that case but I wouldn't discount single bit errors either. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com Ask me about EnterpriseDB's PostGIS support!
On Wed, Oct 01, 2008 at 11:57:31AM -0400, Alvaro Herrera wrote: > Tom Lane escribió: > > "Jonah H. Harris" <jonah.harris@gmail.com> writes: > > > > I probably wouldn't compare checksumming *every* WAL record to a > > > single block-level checksum. > > > > No, not at all. Block-level checksums would be an order of magnitude > > more expensive: they're on bigger chunks of data and they'd be done more > > often. > > More often? My intention is that they are checked when the buffer is > read in, and calculated/stored when the buffer is written out. > In-memory changers of the block do not check nor recalculate the sum. I know you said detecting memory errors wasn't being attempted, but bad memory accounts for a reasonable number of reports of database corruption on -general so I was wondering if moving the checks around could catch some of these. How about updating the block's checksum immediately whenever a tuple is modified within it? Checksums would be checked whenever data is read in and just before it's written out. Checksum failure on write would cause PG to abort noisily with complaints about bad memory? If some simple checksum could be found (probably a parity check) that would allow partial updates PG wouldn't need to scan too much data when regenerating it. It would also cause the checksum to stay bad if data goes bad in memory between reading from disk and its eventual write out, a block remaining in cache for a large amount of time makes this case appear possible. Sam
>>> Tom Lane <tgl@sss.pgh.pa.us> wrote: > Paul Schlie <schlie@comcast.net> writes: >> - yes, if you're willing to compute true CRC's as opposed to simpler >> checksums, which may be worth the price if in fact many/most data >> check failures are truly caused by single bit errors somewhere in the >> chain, > > FWIW, not one of the corrupted-data problems I've investigated has ever > looked like a single-bit error. So the theoretical basis for using a > CRC here seems pretty weak. I doubt we'd even consider automatic repair > attempts anyway. +1 The only single-bit errors I've seen have been the result of a buggy driver for a particular model of network card. The problem went away with the next update of the driver. I've never encountered a single-bit error in a disk sector. -Kevin
Kevin Grittner wrote: >>>> Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Paul Schlie <schlie@comcast.net> writes: >>> - yes, if you're willing to compute true CRC's as opposed to >>> simpler checksums, which may be worth the price if in fact many/most >>> data check failures are truly caused by single bit errors somewhere >>> in the chain, >> >> FWIW, not one of the corrupted-data problems I've investigated has >> ever looked like a single-bit error. So the theoretical basis for >> using a CRC here seems pretty weak. I doubt we'd even consider >> automatic repair attempts anyway. > > +1 > > The only single-bit errors I've seen have been the result of a buggy > driver for a particular model of network card. The problem went away > with the next update of the driver. I've never encountered a > single-bit error in a disk sector. - although I personally don't see how a buggy driver could ever likely generate single bit errors within the data stored/retrieved, as most typically have no business mucking with data beyond breaking-it-up or collating it into larger chunks typically on octet boundaries, unless implementing a soft usart or something like that for some odd reason. - however regardless, if some form of error detection ends up being implemented, it might be nice to actually log corrupted blocks of data along with their previously computed checksums for subsequent analysis in an effort to ascertain if there's an opportunity to improve its implementation based on this more concrete real-world information.
Paul Schlie <schlie@comcast.net> writes: > - however regardless, if some form of error detection ends up being > implemented, it might be nice to actually log corrupted blocks of data > along with their previously computed checksums for subsequent analysis > in an effort to ascertain if there's an opportunity to improve its > implementation based on this more concrete real-world information. This feature is getting overdesigned, I think. It's already the case that we log an error complaining that thus-and-such a page is corrupt. Once PG has decided that it won't have anything to do with the page at all --- it can't load it into shared buffers, so it won't write it either. So the user can go inspect the page at leisure with whatever tools seem handy. I don't see a need for more verbose logging. regards, tom lane
On Wed, Oct 1, 2008 at 4:06 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Paul Schlie <schlie@comcast.net> writes: >> - however regardless, if some form of error detection ends up being >> implemented, it might be nice to actually log corrupted blocks of data >> along with their previously computed checksums for subsequent analysis >> in an effort to ascertain if there's an opportunity to improve its >> implementation based on this more concrete real-world information. > > This feature is getting overdesigned, I think. It's already the case > that we log an error complaining that thus-and-such a page is corrupt. > Once PG has decided that it won't have anything to do with the page at > all --- it can't load it into shared buffers, so it won't write it > either. So the user can go inspect the page at leisure with whatever > tools seem handy. I don't see a need for more verbose logging. Agreed! -- Jonah H. Harris, Senior DBA myYearbook.com
Aidan Van Dyk <aidan@highrise.ca> writes: > One possibility would be to "double-buffer" the write... i.e. as you > calculate your CRC, you're doing it on a local copy of the block, which > you hand to the OS to write... If you're touching the whole block of > memory to CRC it, it isn't *ridiculously* more expensive to copy the > memory somewhere else as you do it... That actually seems like a really good idea. We don't have to increase the buffer locking requirements, or make much of any change at all in the existing logic. +1, especially if this is intended to be an optional feature (which I agree with). regards, tom lane
> Aidan Van Dyk <aidan@highrise.ca> writes: >> One possibility would be to "double-buffer" the write... i.e. as you >> calculate your CRC, you're doing it on a local copy of the block, which >> you hand to the OS to write... If you're touching the whole block of >> memory to CRC it, it isn't *ridiculously* more expensive to copy the >> memory somewhere else as you do it... > > That actually seems like a really good idea. We don't have to increase > the buffer locking requirements, or make much of any change at all in > the existing logic. +1, especially if this is intended to be an > optional feature (which I agree with). > I don't think it make sense at all!!! If you are going to double buffer, one presumes that for some non-zero period of time, the block must be locked during which it is copied. You wouldn't want it changing "mid-copy" would you? How is this any less of a hit than just calculating the checksum?
pgsql@mohawksoft.com writes: >> That actually seems like a really good idea. > I don't think it make sense at all!!! > If you are going to double buffer, one presumes that for some non-zero > period of time, the block must be locked during which it is copied. You > wouldn't want it changing "mid-copy" would you? How is this any less of a > hit than just calculating the checksum? It only has to be share-locked. That locks out every change *but* hint bit updates, and we don't really care whether we catch a hint bit update or not, so long as it doesn't break our CRC. This is really exactly the same way it works now: there is no way to know whether the page image sent to the kernel includes hint bit updates made after the write() call starts. But we don't care. (The JUST_DIRTIED business ensures that we'll catch any such hint bit update next time.) Thought experiment: suppose that write() had some magic option that made it calculate a CRC on the data after it was pulled from userspace, while it's sitting in a kernel buffer. Then we'd have exactly the same guarantees as we do now about the consistency of data written to disk, except that the CRC magically got in there too. The double-buffer idea implements exactly that behavior, without any magic write option. regards, tom lane
pgsql@mohawksoft.com writes: > If you are going to double buffer, one presumes that for some non-zero > period of time, the block must be locked during which it is copied. You > wouldn't want it changing "mid-copy" would you? How is this any less of a > hit than just calculating the checksum? a) You wouldn't have to keep the lock while doing the I/O. Once the CRC+copy is done you can release the lock secure in the knowledge that nobody is going to modify your buffered copy before the kernel can grab its copy. b) You don't have to worry about hint bits being modified underneath you. As long as the CRC+copy is carefully written to copy whole atomic-sized words/bytes and only read the original once then it won't matter if it catches the hint bit before or after it's set. The CRC will reflect the value buffered and eventually written. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com Ask me about EnterpriseDB's RemoteDBA services!
Gregory Stark <stark@enterprisedb.com> writes: > a) You wouldn't have to keep the lock while doing the I/O. Hoo, yeah, so the period of holding the share-lock could well be *shorter* than it is now. Most especially so if the write() blocks instead of just transferring the data to kernel space and returning. I wonder whether that could mean that it's a win to double-buffer even if we aren't computing a checksum? Nah, probably not. regards, tom lane
* Tom Lane <tgl@sss.pgh.pa.us> [081001 19:42]: > Gregory Stark <stark@enterprisedb.com> writes: > > a) You wouldn't have to keep the lock while doing the I/O. > > Hoo, yeah, so the period of holding the share-lock could well be > *shorter* than it is now. Most especially so if the write() blocks > instead of just transferring the data to kernel space and returning. > > I wonder whether that could mean that it's a win to double-buffer > even if we aren't computing a checksum? Nah, probably not. That all depends on what you think is longer: copy 8K & release, or syscall(which obviously does a copy)+return & release... And whether you want to shorten the lock hold time (but it's only shared), or the time until write is done (isn't the goal to have writes being done in the background during checkpoint so write latency isn't a problem)... This might be an interesting experiment for someone to do with a very high concurrency, high-write load... a. -- Aidan Van Dyk Create like a god, aidan@highrise.ca command like a king, http://www.highrise.ca/ work like a slave.
Rather than potentially letting this slide past 8.4, I threw together an extremely quick-hack patch at the smgr-layer for block-level checksums. There are some nasties in that the CRC is the first member of PageHeaderData (in order to guarantee inclusion of the LSN), and that it bumps the size of the page header from 24-32 bytes on MAXALIGN=8. I really think most people should use checksums and so I didn't make it optional storage within the page (which would of course increase the complexity). Second, I didn't bump page version numbers. Third, rather than a zero-filled page (which has been commonplace for as long as I can remember), I used a fixed magic number (0xdeadbeef) for pd_checksum as the default CRC; that way, if someone enables/disables it at runtime, they won't get invalid checksums for blocks which hadn't been checksummed previously. This may as well be zero (which means PageInit/PageHeaderIsValid wouldn't have to be touched), but I used it as a test. I ran the regressions and several concurrent benchmark tests which passed successfully, but I'm sure I'm missing quite a bit due to the the fact that it's late, it's just a quick hack, and I haven't gone through the buffer manager locking code in awhile. I'll be happy to work on this or let Alvaro take it; just as long as it gets done for 8.4. -- Jonah H. Harris, Senior DBA myYearbook.com
Attachment
On Thu, Oct 2, 2008 at 1:29 AM, Jonah H. Harris <jonah.harris@gmail.com> wrote: > I ran the regressions and several concurrent benchmark tests which > passed successfully, but I'm sure I'm missing quite a bit due to the > the fact that it's late, it's just a quick hack, and I haven't gone > through the buffer manager locking code in awhile. Don't know how I missed this obvious one... should not be coding this late @ night :( Patch updated. -- Jonah H. Harris, Senior DBA myYearbook.com
Attachment
Jonah H. Harris wrote: > Rather than potentially letting this slide past 8.4, I threw together > an extremely quick-hack patch at the smgr-layer for block-level > checksums. One hard problem is how to deal with torn pages with non-WAL-logged changes. Like heap hint bit updates, killing tuples in index pages during a scan, and the new FSM pages. Currently, a torn page when writing a hint-bit-updated page doesn't matter, but it would break the checksum. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
* Gregory Stark: > I've also seen single-bit errors caused by bad memory in a network interface. > *Twice*. Particularly nasty since the CRC on TCP/IP packets is only 16-bit so > a large enough ftp transfer would eventually finish despite the packet loss > but with the occasional bits flipped. In these days of SAN/NAS and SCSI over > IP that's pretty scary... I've seen double-bit errors in Internet routing which canceled each other out (the Internet checksum is just a sum, not a CRC, so this happens with some probability once you've got bit errors with a multiple-of-16 periodicity). -- Florian Weimer <fweimer@bfk.de> BFK edv-consulting GmbH http://www.bfk.de/ Kriegsstraße 100 tel: +49-721-96201-1 D-76133 Karlsruhe fax: +49-721-96201-99
On Thu, 2008-10-02 at 09:35 +0300, Heikki Linnakangas wrote: > Jonah H. Harris wrote: > > Rather than potentially letting this slide past 8.4, I threw together > > an extremely quick-hack patch at the smgr-layer for block-level > > checksums. > > One hard problem is how to deal with torn pages with non-WAL-logged > changes. Like heap hint bit updates, killing tuples in index pages > during a scan, and the new FSM pages. Hit bit updates and killing tuples in index pages during a scan can probably be brute-forced quite cheaply after we find a CRC mismatch. Not sure about new FSM pages. > Currently, a torn page when writing a hint-bit-updated page doesn't > matter, but it would break the checksum. Another idea is to just ignore non-WAL-logged bits when calculating CRC-s, by masking them out before adding corresponding bytes to CRC. This requires page-type aware CRC functions and is more expensive to calculate. How much more expensive is something that only testing can tell. Probably not very much, as everything needed should be in L1 caches already. --------------- Hannu
I have a stupid question wrt hint bits and CRC checksums- it seems to me that it should be possible, if you change the hint bits, to be able to very easily calculate what the change in the CRC checksum should be. The basic idea of the CRC checksum is that, given a message x, the checksum is x mod p where p is some constant polynomial (all operations are in GF(2^n)). Now, the interesting thing about modulo is that it's distributable- that is to say, (x ^ y) mod p = (x mod p) ^ (y mod p), and that (x * y) mod p = ((x mod p) * (y mod p)) mod p (I'm using ^ instead of the more traditional + here to emphasize that it's xor, not addition, I'm doing). So let's assume we're updating a word a known n bytes from the end of the message- we calculate y = old_value ^ new_value, so our change is the equivalent of changing the original block m to (m ^ (y * x^{8n})). The new checksum is then (m ^ (y * x^{8n})) mod p = (m mod p) ^ (((y mod p) * (x^{8n} mod p)) mod p). Now, m mod p is the original checksum, and (x^{8n} mod p) is a constant for a given n, and the multiplication modulo p can be implemented as a set of table lookups, one per byte. The take away here is that, if we know ahead of time where the modifications are going to be, we can make updating the CRC checksum (relatively) cheap. So, for example, a change of the hint bits would only need 4 tables lookups and a couple of xors to update the block's CRC checksum. We could extended this idea- break the 8K page up into, say, 32 256-byte "subblocks". Changing any given subblock would require only re-checksumming that subblock and then updating the CRC checksum. The reason for the subblocks would be to limit the number of tables necessary- each subblock requires it's own set of 4 256-word tables, so having 32 subblocks means that the tables involved would be 32*4*256*4 = 128K in size. Going to, say, 64 byte subblocks means needing 128 tables or 512K of tables. If people are interested, I could bang out the code tonight, and post it to the list tomorrow. Brian
On Thu, Oct 2, 2008 at 9:07 AM, Brian Hurt <bhurt@janestcapital.com> wrote: > I have a stupid question wrt hint bits and CRC checksums- it seems to me > that it should be possible, if you change the hint bits, to be able to very > easily calculate what the change in the CRC checksum should be. Doesn't the problem still remain? The problem being that the buffer can be changed as it's written, yes? -- Jonah H. Harris, Senior DBA myYearbook.com
Jonah H. Harris wrote: > On Thu, Oct 2, 2008 at 9:07 AM, Brian Hurt <bhurt@janestcapital.com> wrote: > >> I have a stupid question wrt hint bits and CRC checksums- it seems to me >> that it should be possible, if you change the hint bits, to be able to very >> easily calculate what the change in the CRC checksum should be. >> > > Doesn't the problem still remain? The problem being that the buffer > can be changed as it's written, yes? > > Another possibility is to just not checksum the hint bits... Brian
"Jonah H. Harris" <jonah.harris@gmail.com> writes: > On Thu, Oct 2, 2008 at 9:07 AM, Brian Hurt <bhurt@janestcapital.com> wrote: >> I have a stupid question wrt hint bits and CRC checksums- it seems to me >> that it should be possible, if you change the hint bits, to be able to very >> easily calculate what the change in the CRC checksum should be. > > Doesn't the problem still remain? The problem being that the buffer > can be changed as it's written, yes? It's even worse than that. Two processes can both be fiddling hint bits on different tuples (or even the same tuple) at the same time. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com Ask me about EnterpriseDB's RemoteDBA services!
On Thu, Oct 2, 2008 at 9:36 AM, Brian Hurt <bhurt@janestcapital.com> wrote: > Another possibility is to just not checksum the hint bits... Seems like that would just complicate matters and prevent a viable checksum. -- Jonah H. Harris, Senior DBA myYearbook.com
On Thu, Oct 2, 2008 at 9:42 AM, Gregory Stark <stark@enterprisedb.com> wrote: > It's even worse than that. Two processes can both be fiddling hint bits on > different tuples (or even the same tuple) at the same time. Agreed. Back to the double-buffer idea, we could have a temporary BLCKSZ buffer we could use immediately before write() which we could copy the block to, perform the checksum on, and write out... is that what you were thinking Tom? -- Jonah H. Harris, Senior DBA myYearbook.com
Brian Hurt wrote: > Jonah H. Harris wrote: >> On Thu, Oct 2, 2008 at 9:07 AM, Brian Hurt <bhurt@janestcapital.com> >> wrote: >>> I have a stupid question wrt hint bits and CRC checksums- it seems to me >>> that it should be possible, if you change the hint bits, to be able >>> to very >>> easily calculate what the change in the CRC checksum should be. >> >> Doesn't the problem still remain? The problem being that the buffer >> can be changed as it's written, yes? >> > Another possibility is to just not checksum the hint bits... That would work. But I'm afraid it'd make the implementation a lot more invasive, and also slower. The buffer manager would have to know what kind of a page it's dealing with, heap or index or FSM or what, to know where the hint bits are. Then it would have to follow the line pointers to locate the hint bits, and mask them out for the CRC calculation. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes: > Brian Hurt wrote: >> Another possibility is to just not checksum the hint bits... > That would work. But I'm afraid it'd make the implementation a lot more > invasive, and also slower. The buffer manager would have to know what > kind of a page it's dealing with, heap or index or FSM or what, to know > where the hint bits are. Then it would have to follow the line pointers > to locate the hint bits, and mask them out for the CRC calculation. Right. The odds are that this'd actually be slower than the double-buffer method, because of all the added complexity. And it would really suck from a modularity standpoint to have bufmgr know about all that. The problem we still have to solve is torn pages when writing back a hint-bit update ... regards, tom lane
Tom Lane wrote: > Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes: >> Brian Hurt wrote: >>> Another possibility is to just not checksum the hint bits... > >> That would work. But I'm afraid it'd make the implementation a lot more >> invasive, and also slower. The buffer manager would have to know what >> kind of a page it's dealing with, heap or index or FSM or what, to know >> where the hint bits are. Then it would have to follow the line pointers >> to locate the hint bits, and mask them out for the CRC calculation. > > Right. The odds are that this'd actually be slower than the > double-buffer method, because of all the added complexity. I was thinking that masking out the hint bits would be implemented by copying the page to the temporary buffer, ANDing out the hint bits there, and then calculating the CRC and issuing the write. So we'd still need to double-buffer. > The problem we still have to solve is torn pages when writing back a > hint-bit update ... Not checksumming the hint bits *is* a solution to the torn page problem. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Jonah H. Harris wrote: > On Thu, Oct 2, 2008 at 1:29 AM, Jonah H. Harris <jonah.harris@gmail.com> wrote: >> I ran the regressions and several concurrent benchmark tests which >> passed successfully, but I'm sure I'm missing quite a bit due to the >> the fact that it's late, it's just a quick hack, and I haven't gone >> through the buffer manager locking code in awhile. > > Don't know how I missed this obvious one... should not be coding this > late @ night :( > > Patch updated. > I read through this patch and am curious why 0xdeadbeef was used as an uninitialized value for the page crc. Is this value somehow less likely to have collisons than zero (or any other arbitrary value)? Would it not be better to add a boolean bit or byte to inidcate the crc state? -- Andrew Chernow eSilo, LLC every bit counts http://www.esilo.com/
On Thu, Oct 2, 2008 at 10:09 AM, Andrew Chernow <ac@esilo.com> wrote: > I read through this patch and am curious why 0xdeadbeef was used as an > uninitialized value for the page crc. Is this value somehow less likely to > have collisons than zero (or any other arbitrary value)? It was just an arbitrary value I chose to identify non-checksummed pages; I believe would have the same collision rate as anything else. > Would it not be better to add a boolean bit or byte to inidcate the crc > state? Ideally, though we don't have any spare bits to play with in MAXALIGN=4. -- Jonah H. Harris, Senior DBA myYearbook.com
Jonah H. Harris wrote: > On Thu, Oct 2, 2008 at 10:09 AM, Andrew Chernow <ac@esilo.com> wrote: >> Would it not be better to add a boolean bit or byte to inidcate the crc >> state? > > Ideally, though we don't have any spare bits to play with in MAXALIGN=4. In the page header? There's plenty of free bits in pd_flags. But isn't it a bit dangerous to have a single flag on the page indicating whether the CRC is valid or not? Any corruption that flips that bit would make the CRC check to be skipped. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Thu, Oct 2, 2008 at 10:27 AM, Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote: >> Ideally, though we don't have any spare bits to play with in MAXALIGN=4. > > In the page header? There's plenty of free bits in pd_flags. Ahh, didn't see that. Good catch! > But isn't it a bit dangerous to have a single flag on the page indicating > whether the CRC is valid or not? Any corruption that flips that bit would > make the CRC check to be skipped. Agreed. -- Jonah H. Harris, Senior DBA myYearbook.com
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes: > Tom Lane wrote: >> The problem we still have to solve is torn pages when writing back a >> hint-bit update ... > Not checksumming the hint bits *is* a solution to the torn page problem. Yeah, but it has enough drawbacks that I'd like to keep looking for alternatives. One argument that I've not seen raised is that not checksumming the hint bits leaves you open to a single-bit error that incorrectly sets a hint bit. regards, tom lane
Andrew Chernow <ac@esilo.com> writes: > I read through this patch and am curious why 0xdeadbeef was used as an > uninitialized value for the page crc. Is this value somehow less likely > to have collisons than zero (or any other arbitrary value)? Actually, because that's a favorite bit pattern for programs to fill unused memory with, I'd venture that it has measurably HIGHER odds of being bogus than any other bit pattern. Consider the possibility that a database page got overwritten with someone's core dump. > Would it not be better to add a boolean bit or byte to inidcate the crc > state? No, as noted that would give you a one-in-two chance of incorrectly skipping the CRC check, not one-in-2^32 or so. If we're going to allow a silent skip of the CRC check then a special value of CRC is a good way to do it ... just not this particular one. regards, tom lane
On Thu, Oct 2, 2008 at 10:41 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Not checksumming the hint bits *is* a solution to the torn page problem. > > Yeah, but it has enough drawbacks that I'd like to keep looking for > alternatives. Agreed. > One argument that I've not seen raised is that not checksumming the hint > bits leaves you open to a single-bit error that incorrectly sets a hint > bit. Agreed. -- Jonah H. Harris, Senior DBA myYearbook.com
So, it comes down to two possible designs, each with its own set of challenges. Just to see where to go from here... I want to make sure the options I've seen in this thread are laid out clearly: 1. Hold an exclusive lock on the buffer during the call to smgrwrite OR 2. Doublebuffer the write OR 3. Do some crufty magic to ignore hint-bit updates Because option 3 not only complicates the entire thing, but also makes corruption more difficult to detect, I don't consider it viable. Can anyone provide a reason that makes this option viable? Option 1 will prevent hint-bit updates during write, which means we can checksum the buffer and not worry about it. Also, is only the buffer content lock required? This could potentially slow down concurrent transactions reading the block and/or writing hint bits. Option #2 consists of copying the block to a temporary buffer, checksumming it, and pushing the checksummed block down to write() (at smgr/md/fd depending on where we want to perform the checksum). From my perspective, I prefer #2 and performing it at the sgmr layer, but I am open to suggestions. Tom, what are your thoughts? #1 isn't very difficult, but I can see it potentially causing a number of side-problems and it would require a fair amount of testing. -- Jonah H. Harris, Senior DBA myYearbook.com
"Jonah H. Harris" <jonah.harris@gmail.com> writes: > Just to see where to go from here... I want to make sure the options > I've seen in this thread are laid out clearly: > 1. Hold an exclusive lock on the buffer during the call to smgrwrite > OR > 2. Doublebuffer the write > OR > 3. Do some crufty magic to ignore hint-bit updates Right, I think everyone agrees now that #2 seems like the most preferable option for writing the checksum. However, that still leaves us lacking a solution for torn pages during a write that follows a hint bit update. We may end up with some "crufty magic" anyway for dealing with that. regards, tom lane
On Tuesday 30 September 2008 17:17:10 Decibel! wrote: > On Sep 30, 2008, at 1:48 PM, Heikki Linnakangas wrote: > > Doesn't some filesystems include a per-block CRC, which would > > achieve the same thing? ZFS? > > Sure, some do. We're on linux and can't run ZFS. And I'll argue that > no linux FS is anywhere near as tested as ext3 is, which means that > going to some other FS that offers you CRC means you're now exposing > yourself to the possibility of issues with the FS itself. Not to > mention that changing filesystems on a large production system is > very painful. Actually we had someone on irc yesterday explaining how they were able to run zfs on debian linux, so that option might be closer than you think. On a side note, I believe there are a couple of companies that do postgresql consulting that have pretty good experience running it atop solaris... just in case you guys ever do want to do a migration or something ;-) -- Robert Treat Build A Brighter LAMP :: Linux Apache {middleware} PostgreSQL
* Tom Lane <tgl@sss.pgh.pa.us> [081002 11:40]: > "Jonah H. Harris" <jonah.harris@gmail.com> writes: > > Just to see where to go from here... I want to make sure the options > > I've seen in this thread are laid out clearly: > > > 1. Hold an exclusive lock on the buffer during the call to smgrwrite > > OR > > 2. Doublebuffer the write > > OR > > 3. Do some crufty magic to ignore hint-bit updates > > Right, I think everyone agrees now that #2 seems like the most > preferable option for writing the checksum. However, that still > leaves us lacking a solution for torn pages during a write that > follows a hint bit update. We may end up with some "crufty > magic" anyway for dealing with that. How does your current "write" strategy handle this situation. I mean, how do you currently guarnetee that between when you call write() and the kernel copies the buffer internally, no hint-bit are updated? #define write(fd, buf, count) buffer_crc_write(fd, buf, count) whatever protection you have on the regular write is sufficient. The time of the protection will need to start before the "buffer" period instead of just the write, (and maybe not the write syscall anymore) but with CPU caches and speed, the buffer period should be <= the time of the write() syscall... Your fsync is your "on disk guarentee", not the write, and that won't change. But I thought you didn't really care about hint-bit updates, even in the current strategy... but I'm fully ignorant about the code, sorry... a. -- Aidan Van Dyk Create like a god, aidan@highrise.ca command like a king, http://www.highrise.ca/ work like a slave.
On Thu, 02 Oct 2008 11:57:30 -0400 Robert Treat <xzilla@users.sourceforge.net> wrote: > Actually we had someone on irc yesterday explaining how they were > able to run zfs on debian linux, so that option might be closer than > you think. Its user mode. Not sure I would suggest that from a production server perspective. > > On a side note, I believe there are a couple of companies that do > postgresql consulting that have pretty good experience running it > atop solaris... just in case you guys ever do want to do a migration > or something ;-) :P Joshua D. Drake -- The PostgreSQL Company since 1997: http://www.commandprompt.com/ PostgreSQL Community Conference: http://www.postgresqlconference.org/ United States PostgreSQL Association: http://www.postgresql.us/
On Thu, Oct 2, 2008 at 12:05 PM, Aidan Van Dyk <aidan@highrise.ca> wrote: > How does your current "write" strategy handle this situation. I mean, > how do you currently guarnetee that between when you call write() and > the kernel copies the buffer internally, no hint-bit are updated? Working on the exact double-buffering technique now. > #define write(fd, buf, count) buffer_crc_write(fd, buf, count) I certainly wouldn't interpose the write() call itself; that's just asking for trouble. > whatever protection you have on the regular write is sufficient. The > time of the protection will need to start before the "buffer" period > instead of just the write, (and maybe not the write syscall anymore) but > with CPU caches and speed, the buffer period should be <= the time of > the write() syscall... Your fsync is your "on disk guarentee", not the > write, and that won't change. Agreed. > But I thought you didn't really care about hint-bit updates, even in the > current strategy... but I'm fully ignorant about the code, sorry... The current implementation does not take it into account. -- Jonah H. Harris, Senior DBA myYearbook.com
* Jonah H. Harris <jonah.harris@gmail.com> [081002 12:43]: > > #define write(fd, buf, count) buffer_crc_write(fd, buf, count) > > I certainly wouldn't interpose the write() call itself; that's just > asking for trouble. Of course not, that was only to show that whatever you currenlty pritect "write()" with, is valid for protecting the buffer+write. > > But I thought you didn't really care about hint-bit updates, even in the > > current strategy... but I'm fully ignorant about the code, sorry... > > The current implementation does not take it into account. So if PG currently doesn't care about the hit-bits being updated, during the write, then why should introducing a double-buffer introduce the a torn-page problem Tom mentions? I admit, I'm fishing for information from those in the know, because I haven't been looking at the code long enough (or all of it enough) to to know all the ins-and-outs... a. -- Aidan Van Dyk Create like a god, aidan@highrise.ca command like a king, http://www.highrise.ca/ work like a slave.
On Thu, Oct 2, 2008 at 12:51 PM, Aidan Van Dyk <aidan@highrise.ca> wrote: >> > But I thought you didn't really care about hint-bit updates, even in the >> > current strategy... but I'm fully ignorant about the code, sorry... >> >> The current implementation does not take it into account. > > So if PG currently doesn't care about the hit-bits being updated, during > the write, then why should introducing a double-buffer introduce the a > torn-page problem Tom mentions? I admit, I'm fishing for information > from those in the know, because I haven't been looking at the code long > enough (or all of it enough) to to know all the ins-and-outs... PG doesn't care because during hint-bits aren't logged and during normal WAL replay, the old page will be pulled from the WAL. I believe what Tom is referring to is that the buffer PG sends to write() can still be modified by way of SetHintBits between the time smgrwrite is called and the time the actual write takes place, which is why we can't rely on a checksum of the buffer pointer passed to smgrwrite and friends. If we're double-buffering the write, I don't see where we could be introducing a torn-page, as we'd actually be writing a copied version of the buffer. Will look into this. -- Jonah H. Harris, Senior DBA myYearbook.com
Jonah H. Harris wrote: > PG doesn't care because during hint-bits aren't logged and during > normal WAL replay, the old page will be pulled from the WAL. I > believe what Tom is referring to is that the buffer PG sends to > write() can still be modified by way of SetHintBits between the time > smgrwrite is called and the time the actual write takes place, which > is why we can't rely on a checksum of the buffer pointer passed to > smgrwrite and friends. > > If we're double-buffering the write, I don't see where we could be > introducing a torn-page, as we'd actually be writing a copied version > of the buffer. Will look into this. The torn page is during kernel write to disk, I assume, so it is still possible. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
* Bruce Momjian <bruce@momjian.us> [081002 13:07]: > Jonah H. Harris wrote: > > PG doesn't care because during hint-bits aren't logged and during > > normal WAL replay, the old page will be pulled from the WAL. I > > believe what Tom is referring to is that the buffer PG sends to > > write() can still be modified by way of SetHintBits between the time > > smgrwrite is called and the time the actual write takes place, which > > is why we can't rely on a checksum of the buffer pointer passed to > > smgrwrite and friends. > > > > If we're double-buffering the write, I don't see where we could be > > introducing a torn-page, as we'd actually be writing a copied version > > of the buffer. Will look into this. > > The torn page is during kernel write to disk, I assume, so it is still > possible. Ah, I see... So your full-page-write in the WAL, protecting the torn page has to be aware of the need for a valid CRC32 as well... -- Aidan Van Dyk Create like a god, aidan@highrise.ca command like a king, http://www.highrise.ca/ work like a slave.
On Thu, Oct 2, 2008 at 1:07 PM, Bruce Momjian <bruce@momjian.us> wrote: >> If we're double-buffering the write, I don't see where we could be >> introducing a torn-page, as we'd actually be writing a copied version >> of the buffer. Will look into this. > > The torn page is during kernel write to disk, I assume, so it is still > possible. Well, we can't really control too much of that. The most common solution to that I've seen is to double-write the page (which some OSes already do regardless). Or, are you meaning something else? -- Jonah H. Harris, Senior DBA myYearbook.com
On 2 Oct 2008, at 05:51 PM, Aidan Van Dyk <aidan@highrise.ca> wrote: > So if PG currently doesn't care about the hit-bits being updated, > during > the write, then why should introducing a double-buffer introduce the a > torn-page problem Tom mentions? I admit, I'm fishing for information > from those in the know, because I haven't been looking at the code > long > enough (or all of it enough) to to know all the ins-and-outs... It's not the buffeting it's the checksum. The problem arises if a page is read in but no wal logged modifications are done against it. If a hint bit is modified it won't be wal logged but the page is marked dirty. When we write the page there's a chance only part of the page actually makes it to disk if the system crashes before the whole page is flushed. Wal logged changes are safe because of full_page_writes. Hint bits are safe because either the old or the new value will be on disk and we don't care which. It doesn't matter if some hint bits are set and some aren't. However the checksum won't match because the checksum will have been calculated on the whole block and part of it was never written. Writing this explanation did bring to mind one solution which we had already discussed for other reasons: not marking blocks dirty after hint bit setting. Alternatively if we detect a block is dirty but the lsn is older than the last checkpoint is that the only time we need to worry? Then we could either discard the writes or generate a noop wal log record just for the full page write in that case.
Jonah H. Harris wrote: > On Thu, Oct 2, 2008 at 1:07 PM, Bruce Momjian <bruce@momjian.us> wrote: > >> If we're double-buffering the write, I don't see where we could be > >> introducing a torn-page, as we'd actually be writing a copied version > >> of the buffer. Will look into this. > > > > The torn page is during kernel write to disk, I assume, so it is still > > possible. > > Well, we can't really control too much of that. The most common > solution to that I've seen is to double-write the page (which some > OSes already do regardless). Or, are you meaning something else? I just don't see how writing a copy of the page (rather than the original) to the kernel affects issues about torn pages. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
> It's not the buffeting it's the checksum. The problem arises if a page is > read in but no wal logged modifications are done against it. If a hint bit > is modified it won't be wal logged but the page is marked dirty. Ahhhhh. Thanks Greg. Let me look into this a bit before I respond :) -- Jonah H. Harris, Senior DBA myYearbook.com
Greg Stark escribió: > Writing this explanation did bring to mind one solution which we had > already discussed for other reasons: not marking blocks dirty after hint > bit setting. How about when a hint bit is set and the page is not already dirty, set the checksum to the "always valid" value? The problem I have with this idea is that there would be lots of pages excluded from the CRC checks, a non-trivial percentage of the time. Maybe we could mix this with Simon's approach to counting hint bit setting, and calculate a valid CRC on the page every n-th non-logged change. -- Alvaro Herrera http://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc.
* Greg Stark <greg.stark@enterprisedb.com> [081002 13:37]: > It's not the buffeting it's the checksum. The problem arises if a page > is read in but no wal logged modifications are done against it. If a > hint bit is modified it won't be wal logged but the page is marked > dirty. > > When we write the page there's a chance only part of the page actually > makes it to disk if the system crashes before the whole page is flushed. Yup, Brucess message pointed me in the right direction.. > Wal logged changes are safe because of full_page_writes. Hint bits are > safe because either the old or the new value will be on disk and we > don't care which. It doesn't matter if some hint bits are set and some > aren't. > > However the checksum won't match because the checksum will have been > calculated on the whole block and part of it was never written. Correct. But now doesn't full-page-writes give us the same protection here against a half-write as it did for the previous case? On recovery after a torn-page write, won't the recovery of the full_page_write WAL + WAL changes get us back to the page as it was before the buffer+checksum+write? The checksum on that block *now in memory* is irrelevant, because it wasn't "read from disk", it was completely constructed from WAL records, which are protected by checksums themselves individually..., and is now ready to be "written to disk" which will force a valid checksum to be on it. a. -- Aidan Van Dyk Create like a god, aidan@highrise.ca command like a king, http://www.highrise.ca/ work like a slave.
Aidan Van Dyk <aidan@highrise.ca> writes: >> Wal logged changes are safe because of full_page_writes. Hint bits are >> safe because either the old or the new value will be on disk and we >> don't care which. It doesn't matter if some hint bits are set and some >> aren't. >> >> However the checksum won't match because the checksum will have been >> calculated on the whole block and part of it was never written. > > Correct. But now doesn't full-page-writes give us the same protection > here against a half-write as it did for the previous case? > > On recovery after a torn-page write, won't the recovery of the > full_page_write WAL + WAL changes get us back to the page as it was > before the buffer+checksum+write? Hint bit setting doesn't trigger a WAL record. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com Ask me about EnterpriseDB's PostGIS support!
On Thu, Oct 2, 2008 at 1:58 PM, Gregory Stark <stark@enterprisedb.com> wrote: >> On recovery after a torn-page write, won't the recovery of the >> full_page_write WAL + WAL changes get us back to the page as it was >> before the buffer+checksum+write? > > Hint bit setting doesn't trigger a WAL record. Hence, no page image is written to WAL for later use in recovery. -- Jonah H. Harris, Senior DBA myYearbook.com
On Thu, Oct 2, 2008 at 1:44 PM, Alvaro Herrera <alvherre@commandprompt.com> wrote: > How about when a hint bit is set and the page is not already dirty, set > the checksum to the "always valid" value? The problem I have with this > idea is that there would be lots of pages excluded from the CRC checks, > a non-trivial percentage of the time. I don't like that because it trades-off corruption detection (the whole point of this feature) for a slight performance improvement. > Maybe we could mix this with Simon's approach to counting hint bit > setting, and calculate a valid CRC on the page every n-th non-logged > change. I still think we should only calculate checksums on the actual write. And, this still seems to have an issue with WAL, unless Simon's original idea somehow included recording hint bit settings/dirtying the page in WAL. -- Jonah H. Harris, Senior DBA myYearbook.com
* Jonah H. Harris <jonah.harris@gmail.com> [081002 14:01]: > On Thu, Oct 2, 2008 at 1:58 PM, Gregory Stark <stark@enterprisedb.com> wrote: > >> On recovery after a torn-page write, won't the recovery of the > >> full_page_write WAL + WAL changes get us back to the page as it was > >> before the buffer+checksum+write? > > > > Hint bit setting doesn't trigger a WAL record. > > Hence, no page image is written to WAL for later use in recovery. OK. Got it... The block is dirty (only because of hint bits). write starts, crash, torn page, recovery doesn't "fix" the torn page... because it's never been changed (according WAL), so on next read... Without the CRC it doesn't matter, because the only change was hint-bits, so the page is half-old+half-new, but new == old+only hint-bits... Because ther'es no WAP. the torn page will be read next time that buffer is needed... a. -- Aidan Van Dyk Create like a god, aidan@highrise.ca command like a king, http://www.highrise.ca/ work like a slave.
On Wednesday 01 October 2008 10:27:52 Tom Lane wrote: > pgsql@mohawksoft.com writes: > >> No, it's all about time penalties and loss of concurrency. > > > > I don't think that the amount of time it would take to calculate and test > > the sum is even important. It may be in older CPUs, but these days CPUs > > are so fast in RAM and a block is very small. On x86 systems, depending > > on page alignment, we are talking about two or three pages that will be > > "in memory" (They were used to read the block from disk or previously > > accessed). > > Your optimism is showing ;-). XLogInsert routinely shows up as a major > CPU hog in any update-intensive test, and AFAICT that's mostly from the > CRC calculation for WAL records. > Yeah... for those who run on filesystems that do checksumming for you, I'd bet they'd much rather see time spent in turning that off rather than checksumming everything else. (just guessing) -- Robert Treat Build A Brighter LAMP :: Linux Apache {middleware} PostgreSQL
Robert Treat wrote: > On Wednesday 01 October 2008 10:27:52 Tom Lane wrote: > > Your optimism is showing ;-). XLogInsert routinely shows up as a major > > CPU hog in any update-intensive test, and AFAICT that's mostly from the > > CRC calculation for WAL records. > > Yeah... for those who run on filesystems that do checksumming for you, I'd bet > they'd much rather see time spent in turning that off rather than > checksumming everything else. (just guessing) I don't think it can be turned off, because ISTR a failed checksum is used to detect end of the WAL stream to be recovered. -- Alvaro Herrera http://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc.
Jonah H. Harris escribió: > On Thu, Oct 2, 2008 at 1:44 PM, Alvaro Herrera > <alvherre@commandprompt.com> wrote: > > How about when a hint bit is set and the page is not already dirty, set > > the checksum to the "always valid" value? The problem I have with this > > idea is that there would be lots of pages excluded from the CRC checks, > > a non-trivial percentage of the time. > > I don't like that because it trades-off corruption detection (the > whole point of this feature) for a slight performance improvement. I agree that giving up corruption detection is not such a hot idea, but what I'm intending to get back is not performance but correctness (in this case protection from the torn page problem) > > Maybe we could mix this with Simon's approach to counting hint bit > > setting, and calculate a valid CRC on the page every n-th non-logged > > change. > > I still think we should only calculate checksums on the actual write. Well, if we could trade off a bit of performance for correctness, I would give up on that :-) However, you're right that this tradeoff is not what we're having here. > And, this still seems to have an issue with WAL, unless Simon's > original idea somehow included recording hint bit settings/dirtying > the page in WAL. I have to admit I don't remember exactly how it worked :-) I think the idea was avoiding setting the page dirty until a certain number of hint bit setting operations had been done (which I think means it's not useful for the present purpose). -- Alvaro Herrera http://www.CommandPrompt.com/ PostgreSQL Replication, Consulting, Custom Development, 24x7 support
* Alvaro Herrera <alvherre@commandprompt.com> [081002 16:18]: > > And, this still seems to have an issue with WAL, unless Simon's > > original idea somehow included recording hint bit settings/dirtying > > the page in WAL. > > I have to admit I don't remember exactly how it worked :-) I think the > idea was avoiding setting the page dirty until a certain number of hint > bit setting operations had been done (which I think means it's not > useful for the present purpose). How crazy would I be to wonder about the performance impact of doing the full_page_write xlog block backups at buffer dirtying time (MarkBufferDirty and SetBufferCommitInfoNeedsSave), instead of at XLogInsert? A few thoughts, quite possible not true: * The xlog backup block records don't need to be synced to disk at the time of the dirty but they can be synced along withstuff behind it, although it *needs* to be synced by the time the buffer write() comes long, otherwise we haven't fixedour torn page probem, so practically, we may need to sync it for guarentees * OLAP workloads that handle bulk insert/update/delete are probably running with full_page_writes off, so don't pay the penaltyof extra xlog writing on all the hint-bits being set * OLTP workloads with full_page_writes on would have some extra full_page_writes, but I have a feeling that most dirtiedbuffers in OLTP systems are going to get changed by more than just hint-bits soon eonugh anyways, so it's not a hugenet increase * Slow systems that aren't high-velocity can probably spare a bit more Xlog bandwith anyways... But my experience is limitted to my small-scale databases... -- Aidan Van Dyk Create like a god, aidan@highrise.ca command like a king, http://www.highrise.ca/ work like a slave.
On Thu, 2008-10-02 at 16:18 -0400, Alvaro Herrera wrote: > > > Maybe we could mix this with Simon's approach to counting hint bit > > > setting, and calculate a valid CRC on the page every n-th non-logged > > > change. > > > > I still think we should only calculate checksums on the actual write. > > Well, if we could trade off a bit of performance for correctness, I > would give up on that :-) However, you're right that this tradeoff is > not what we're having here. > > > And, this still seems to have an issue with WAL, unless Simon's > > original idea somehow included recording hint bit settings/dirtying > > the page in WAL. > > I have to admit I don't remember exactly how it worked :-) I think the > idea was avoiding setting the page dirty until a certain number of hint > bit setting operations had been done (which I think means it's not > useful for the present purpose). Having read everything so far, the only way I can see to solve the problem does seem to be to make hint bit setting write WAL. When, is the question. Every time is definitely the wrong answer. Hint bit setting approach so far is in two parts: we add code to separate the concept of "dirty" from "has hint bits set". We already have some logic that says when to write dirty pages. So we just add some slightly different logic that says when to write hinted pages. The main correctness of the idea has been worked out. The difficult part is the "when to write hinted pages" because its just a heuristic, subject to argument and lots of performance testing. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
On Thu, Oct 2, 2008 at 7:42 PM, Jonah H. Harris <jonah.harris@gmail.com> wrote: >> It's not the buffeting it's the checksum. The problem arises if a page is >> read in but no wal logged modifications are done against it. If a hint bit >> is modified it won't be wal logged but the page is marked dirty. > > Ahhhhh. Thanks Greg. Let me look into this a bit before I respond :) Hmm, how about, when reading a page: read the page if checksum mismatch { flip the hint bits [1] if checksum mismatch { ERROR } else { emit a warning,'found a torn page' } } ...that is assuming we know which bit to flip and that we accept the check will be a bit weaker. :) OTOH this shouldn't happen too often, so performance should matter much. My 0.02 Best regards, Dawid Kuroczko [1]: Of course it would be more efficient to flip the checksum, but it would be tricky. :) -- .................. ``The essence of real creativity is a certain: *Dawid Kuroczko* : playfulness, a flittingfrom idea to idea: qnex42@gmail.com : without getting bogged down by fixated demands.''`..................' Sherkaner Underhill, A Deepness in the Sky, V. Vinge
On Oct 1, 2008, at 2:03 PM, Sam Mason wrote: > I know you said detecting memory errors wasn't being attempted, but > bad memory accounts for a reasonable number of reports of database > corruption on -general so I was wondering if moving the checks around > could catch some of these. Down the road, I would also like to have a sanity check for data modification that occur while the data is in a buffer, to guard against memory or CPU errors. But the huge issue there is how to do it without killing performance. Because there's no obvious solution to that, I don't want to try and get it in for 8.4. -- Decibel!, aka Jim C. Nasby, Database Architect decibel@decibel.org Give your computer some brain candy! www.distributed.net Team #1828
"Dawid Kuroczko" <qnex42@gmail.com> writes: > On Thu, Oct 2, 2008 at 7:42 PM, Jonah H. Harris <jonah.harris@gmail.com> wrote: > > if checksum mismatch { > flip the hint bits [1] I did try to make something like that work. But I didn't get anywhere. There could easily be dozens of bits to flip. The MaxHeapTuplesPerPage is over 200 and you could easily have half of them that don't match the checksum if the writes happen in 4k chunks. If they happen in 512b chunks then you could have a lot more. And yes they could easily have all been set at around the same time because that's often just what a sequential scan does. And you can't even just set the bits to their "correct" values either before the checksum or before checking the checksum since the "correct" value changes over time. By the time you compare the checksum more bits will be settable than when the page was stored. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com Ask me about EnterpriseDB's PostGIS support!
On Oct 2, 2008, at 3:18 PM, Alvaro Herrera wrote: > I have to admit I don't remember exactly how it worked :-) I think > the > idea was avoiding setting the page dirty until a certain number of > hint > bit setting operations had been done (which I think means it's not > useful for the present purpose). Well, it would be useful if whenever we magically decided it was time to write out a page that had only hint-bit updates we generated WAL, right? Even if it was just a no-op WAL record to ensure we had the page image in the WAL. BTW, speaking of torn pages... I've heard that there's some serious gains to be had by turning full_page_writes to off, but I've never even dreamed of doing that because I've never seen any real sure-fire way to check that your hardware can't write torn pages. But if we have checksums enabled and checked the checksums on a block the first time we touched it during recovery, we'd be able to detect torn pages, yet still recover. That would help show that torn pages aren't possible in a particular environment (though unfortunately I don't think there's any way to actually prove that they're not). -- Decibel!, aka Jim C. Nasby, Database Architect decibel@decibel.org Give your computer some brain candy! www.distributed.net Team #1828
OK, I have a stupid question- torn pages are a problem, but only during recovery. Recovery is (I assume) a fairly rare condition- if data corruption is going to happen, it's most likely to happen during normal operation. So why not just turn off CRC checksumming during recovery, or at least treat it as a much less critical error? During recovery, if the CRC checksum matches, we can assume the page is good- not only not corrupt, but not torn either. If the CRC checksum doesn't match, we don't panic, but maybe we do more careful analysis of the page to make sure that only the hint bits are wrong. Or maybe not. It's only during normal operation that a CRC checksum failure would be considered critical. Feel free to explain to me why I'm an idiot. Brian
On Fri, Oct 3, 2008 at 3:36 PM, Brian Hurt <bhurt@janestcapital.com> wrote: > OK, I have a stupid question- torn pages are a problem, but only during > recovery. Recovery is (I assume) a fairly rare condition- if data > corruption is going to happen, it's most likely to happen during normal > operation. So why not just turn off CRC checksumming during recovery, or at > least treat it as a much less critical error? During recovery, if the CRC > checksum matches, we can assume the page is good- not only not corrupt, but > not torn either. If the CRC checksum doesn't match, we don't panic, but > maybe we do more careful analysis of the page to make sure that only the > hint bits are wrong. Or maybe not. It's only during normal operation that > a CRC checksum failure would be considered critical. Well: 1. database half-writes the page X to disk, and there is power outage. 2. we regain the power 2. during recovery database replay all WAL-logged pages. The X page was not WAL-logged, thus it is not replayed. 3. when replaying is finished, everything looks OK at this point 4. user runs a SELECT which hits page X. Oops, we have a checksum mismatch. Best regards, Dawid Kuroczko -- .................. ``The essence of real creativity is a certain: *Dawid Kuroczko* : playfulness, a flittingfrom idea to idea: qnex42@gmail.com : without getting bogged down by fixated demands.''`..................' Sherkaner Underhill, A Deepness in the Sky, V. Vinge
Brian Hurt wrote: > OK, I have a stupid question- torn pages are a problem, but only during > recovery. Recovery is (I assume) a fairly rare condition- if data > corruption is going to happen, it's most likely to happen during normal > operation. So why not just turn off CRC checksumming during recovery, > or at least treat it as a much less critical error? During recovery, if > the CRC checksum matches, we can assume the page is good- not only not > corrupt, but not torn either. If the CRC checksum doesn't match, we > don't panic, but maybe we do more careful analysis of the page to make > sure that only the hint bits are wrong. Or maybe not. It's only during > normal operation that a CRC checksum failure would be considered critical. Interesting question. The problem is that we don't read all pages in during recovery. One idea would be to WAL log the page numbers that might be torn and recompute the checksums on those pages during recovery. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
* Decibel! <decibel@decibel.org> [081002 19:18]: > Well, it would be useful if whenever we magically decided it was time > to write out a page that had only hint-bit updates we generated WAL, > right? Even if it was just a no-op WAL record to ensure we had the > page image in the WAL. Well, I'm by no means an expert in the code, but from my looking around bufmgr and transam yesterady, it really looks like it would be a modularity nightmare... But I think that would have the same "total IO" affect as nop WAL record being generated at the the page being dirtied, which would seem to fit the code a bit better... a. -- Aidan Van Dyk Create like a god, aidan@highrise.ca command like a king, http://www.highrise.ca/ work like a slave.
So this discussion died with no solution arising to the hint-bit-setting-invalidates-the-CRC problem. Apparently the only solution in sight is to WAL-log hint bits. Simon opines it would be horrible from a performance standpoint to WAL-log every hint bit set, and I think we all agree with that. So we need to find an alternative mechanism to WAL log hint bits. I thought about causing a process that's about to write a page check a flag that says "this page has been dirtied by someone who didn't bother to generate WAL". If the flag is set, then the writer process is forced to write a WAL record containing all hint bits in the page, and only then it is allowed to write the page (and thus calculate the new CRC). -- Alvaro Herrera http://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc.
On Fri, Oct 17, 2008 at 11:26 AM, Alvaro Herrera <alvherre@commandprompt.com> wrote: > So this discussion died with no solution arising to the > hint-bit-setting-invalidates-the-CRC problem. I've been busy. > Apparently the only solution in sight is to WAL-log hint bits. Simon > opines it would be horrible from a performance standpoint to WAL-log > every hint bit set, and I think we all agree with that. So we need to > find an alternative mechanism to WAL log hint bits. Agreed. > I thought about causing a process that's about to write a page check a > flag that says "this page has been dirtied by someone who didn't bother > to generate WAL". If the flag is set, then the writer process is forced > to write a WAL record containing all hint bits in the page, and only > then it is allowed to write the page (and thus calculate the new CRC). Interesting idea... let me ponder it for a bit. -- Jonah H. Harris, Senior DBA myYearbook.com
I'm far from convinced wal logging hint bits is a non starter. In fact I doubt the wal log itself I a problem. Having to take the buffer lock does suck though. Heikki had a clever idea earlier which was to have two crc checks- one which skips the hint bits and one dedicated to hint bits. If the second doesn't match we clear all the hint bits. The problem with that is that skipping the hint bits for the main crc would slow it down severely. It would make a lot of sense if the hint bits were all in a contiguous block of memory but I can't see how to make that add up. greg On 17 Oct 2008, at 05:42 PM, "Jonah H. Harris" <jonah.harris@gmail.com> wrote: > On Fri, Oct 17, 2008 at 11:26 AM, Alvaro Herrera > <alvherre@commandprompt.com> wrote: >> So this discussion died with no solution arising to the >> hint-bit-setting-invalidates-the-CRC problem. > > I've been busy. > >> Apparently the only solution in sight is to WAL-log hint bits. Simon >> opines it would be horrible from a performance standpoint to WAL-log >> every hint bit set, and I think we all agree with that. So we need >> to >> find an alternative mechanism to WAL log hint bits. > > Agreed. > >> I thought about causing a process that's about to write a page >> check a >> flag that says "this page has been dirtied by someone who didn't >> bother >> to generate WAL". If the flag is set, then the writer process is >> forced >> to write a WAL record containing all hint bits in the page, and only >> then it is allowed to write the page (and thus calculate the new >> CRC). > > Interesting idea... let me ponder it for a bit. > > -- > Jonah H. Harris, Senior DBA > myYearbook.com > > -- > Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-hackers
On Fri, Oct 17, 2008 at 12:05 PM, Greg Stark <greg.stark@enterprisedb.com> wrote: > Heikki had a clever idea earlier which was to have two crc checks- one which > skips the hint bits and one dedicated to hint bits. If the second doesn't > match we clear all the hint bits. Sounds overcomplicated to me. > The problem with that is that skipping the hint bits for the main crc would > slow it down severely. It would make a lot of sense if the hint bits were > all in a contiguous block of memory but I can't see how to make that add up. Agreed. -- Jonah H. Harris, Senior DBA myYearbook.com
* Greg Stark <greg.stark@enterprisedb.com> [081017 12:05]: > I'm far from convinced wal logging hint bits is a non starter. In fact > I doubt the wal log itself I a problem. Having to take the buffer lock > does suck though. And remember, you don't even need to WAL all the hint-bit setts:... You only *need* to get a WAL backup of the block at some *any* point before the block's written to save from the torn-page on recovery leading to an inconsistent CRC. a. -- Aidan Van Dyk Create like a god, aidan@highrise.ca command like a king, http://www.highrise.ca/ work like a slave.
On Fri, 2008-10-17 at 12:26 -0300, Alvaro Herrera wrote: > Apparently the only solution in sight is to WAL-log hint bits. Simon > opines it would be horrible from a performance standpoint to WAL-log > every hint bit set, and I think we all agree with that. So we need to > find an alternative mechanism to WAL log hint bits. Yes, it's clearly not acceptable bit by bit. But perhaps writing a single WAL record if you scan whole page and set all bits at once. Then it makes sense in some cases. It might be possible to have a partial solution where some blocks have CRC checks, some not. Most databases have static portions. Any block not touched for X amount of time (~= to a distance between current LSN and LSN on block) could have CRC checks added. Or maybe just make it a table-level option and let users choose if they want the hit or not. Or maybe have a new command that you can run whenever you want to set CRC checks. That way you get to choose. CHECK TABLE? -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
Simon Riggs wrote: > > On Fri, 2008-10-17 at 12:26 -0300, Alvaro Herrera wrote: > > > Apparently the only solution in sight is to WAL-log hint bits. Simon > > opines it would be horrible from a performance standpoint to WAL-log > > every hint bit set, and I think we all agree with that. So we need to > > find an alternative mechanism to WAL log hint bits. > > Yes, it's clearly not acceptable bit by bit. > > But perhaps writing a single WAL record if you scan whole page and set > all bits at once. Then it makes sense in some cases. Yeah, I thought about that too -- and perhaps give the scan some slop, so that it will also updates some more hint bits that would be updated in the next, say, 100 transactions. However this seems more messy than the other idea. > It might be possible to have a partial solution where some blocks have > CRC checks, some not. That's another idea but it reduces the effectiveness of the check. -- Alvaro Herrera http://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc.
On Fri, 2008-10-17 at 13:59 -0300, Alvaro Herrera wrote: > > It might be possible to have a partial solution where some blocks have > > CRC checks, some not. > > That's another idea but it reduces the effectiveness of the check. If you put in a GUC to control the check, block by block. 0 = check every time, with full impact. Other values delay the use of CRC checks. Kind of like freezing parameters. Let people choose. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
I don't think that works anyways. No matter how thoroughly you update all the hint bits there's still a chance someone else comes along and sets one you missed or is setting hint bits on the same tuple at the same time and your update gets lost. greg On 17 Oct 2008, at 06:59 PM, Alvaro Herrera <alvherre@commandprompt.com> wrote: > Simon Riggs wrote: >> >> On Fri, 2008-10-17 at 12:26 -0300, Alvaro Herrera wrote: >> >>> Apparently the only solution in sight is to WAL-log hint bits. >>> Simon >>> opines it would be horrible from a performance standpoint to WAL-log >>> every hint bit set, and I think we all agree with that. So we >>> need to >>> find an alternative mechanism to WAL log hint bits. >> >> Yes, it's clearly not acceptable bit by bit. >> >> But perhaps writing a single WAL record if you scan whole page and >> set >> all bits at once. Then it makes sense in some cases. > > Yeah, I thought about that too -- and perhaps give the scan some slop, > so that it will also updates some more hint bits that would be updated > in the next, say, 100 transactions. However this seems more messy > than > the other idea. > >> It might be possible to have a partial solution where some blocks >> have >> CRC checks, some not. > > That's another idea but it reduces the effectiveness of the check. > > -- > Alvaro Herrera http://www.CommandPrompt.com/ > The PostgreSQL Company - Command Prompt, Inc. > > -- > Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-hackers
Hi, Alvaro Herrera wrote: > So this discussion died with no solution arising to the > hint-bit-setting-invalidates-the-CRC problem. Isn't double-buffering solving this issue? Has somebody checked if it even helps performance due to being able to release the lock on the buffer *before* the syscall? Regards Markus Wanner
Markus Wanner wrote: > Alvaro Herrera wrote: >> So this discussion died with no solution arising to the >> hint-bit-setting-invalidates-the-CRC problem. > > Isn't double-buffering solving this issue? Has somebody checked if it > even helps performance due to being able to release the lock on the > buffer *before* the syscall? Double-buffering helps with the hint bit issues within shared buffers, but doesn't help with the torn page and hint bits problem. The torn page problem seems to be the show-stopper. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Alvaro Herrera wrote: > So this discussion died with no solution arising to the > hint-bit-setting-invalidates-the-CRC problem. Is there no point at which a page is logically committed to storage, past which no mutating access may be performed?
Simon Riggs wrote: > But perhaps writing a single WAL record if you scan whole page and set > all bits at once. Then it makes sense in some cases. So this is what I ended up doing; attached. There are some gotchas in this patch: 1. it does not consider hint bits other than the ones defined in htup.h. Some index AMs use hint bits to "kill" tuples (LP_DEAD mostly, I think). This means that CRCs will be broken for such pages when pages are torn. 2. some parts of the code could be considered modularity violations. For example, tqual.c is setting a bit in a Page structure; bufmgr.c is later checking that bit to determine when to log. 3. the bgwriter is seen writing WAL entries at checkpoint. At shutdown, this might cause an error to be reported on how there was not supposed to be activity on the log. I didn't save the exact error report and I can't find it in the source :-( So it "mostly works" at this time. I very much welcome opinions to improve the weak points. -- Alvaro Herrera http://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc.
Attachment
Alvaro Herrera wrote: > So this is what I ended up doing; attached. Oh, another thing. The contents for the WAL log message here is very simplistic; just store all the t_infomask and t_infomask2 relevant bits, for all the tuples in the table. A possible optimization to reduce the WAL traffic is to add another infomask bit which indicates whether a hint bit has been set since the last time we visited the page. I'm unsure if this is worth the pain. (Another possibility, even more painful, is to choose at runtime between the two formats, depending on the number of tuples that need hint bits logged.) -- Alvaro Herrera http://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc.
Alvaro Herrera wrote: > There are some gotchas in this patch: > > 1. it does not consider hint bits other than the ones defined in htup.h. > Some index AMs use hint bits to "kill" tuples (LP_DEAD mostly, I think). > This means that CRCs will be broken for such pages when pages are torn. The "other hint bits" are: - LP_DEAD as used by the various callers of ItemIdMarkDead. - PD_PAGE_FULL - BTPageOpaque->btpo_flags and btpo_cycleid All of them are changed with only SetBufferCommitInfoNeedsSave being called afterwards. -- Alvaro Herrera http://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc.
Alvaro Herrera wrote: > 3. the bgwriter is seen writing WAL entries at checkpoint. At shutdown, > this might cause an error to be reported on how there was not supposed > to be activity on the log. I didn't save the exact error report and I > can't find it in the source :-( LOG: received fast shutdown request LOG: aborting any active transactions FATAL: terminating connection due to administrator command LOG: autovacuum launcher shutting down LOG: shutting down LOG: INSERT @ 0/67F05F0: prev 0/67F05C0; xid 0: Heap2 - hintbits: rel 1663/16384/1259; blk 4 CONTEXT: writing block 4 of relation 1663/16384/1259 LOG: xlog flush request 0/67F06C0; write 0/0; flush 0/0 CONTEXT: writing block 4 of relation 1663/16384/1259 LOG: INSERT @ 0/67F06C0: prev 0/67F05F0; xid 0: Heap2 - hintbits: rel 1663/16384/2608; blk 40 CONTEXT: writing block 40 of relation 1663/16384/2608 LOG: xlog flush request 0/67F0708; write 0/67F06C0; flush 0/67F06C0 CONTEXT: writing block 40 of relation 1663/16384/2608 LOG: INSERT @ 0/67F0708: prev 0/67F06C0; xid 0: Heap2 - hintbits: rel 1663/16384/1249; blk 29 CONTEXT: writing block 29 of relation 1663/16384/1249 LOG: xlog flush request 0/67F0808; write 0/67F0708; flush 0/67F0708 CONTEXT: writing block 29 of relation 1663/16384/1249 LOG: INSERT @ 0/67F0808: prev 0/67F0708; xid 0: XLOG - checkpoint: redo 0/67F05F0; tli 1; xid 0/9093; oid 90132; multi 1;offset 0; shutdown LOG: xlog flush request 0/67F0850; write 0/67F0808; flush 0/67F0808 PANIC: concurrent transaction log activity while database system is shutting down LOG: background writer process (PID 17411) was terminated by signal 6: Aborted I am completely at a loss what to do here. -- Alvaro Herrera http://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc.
Alvaro Herrera wrote: > The "other hint bits" are: > > - LP_DEAD as used by the various callers of ItemIdMarkDead. > - PD_PAGE_FULL > - BTPageOpaque->btpo_flags and btpo_cycleid > > All of them are changed with only SetBufferCommitInfoNeedsSave being > called afterwards. I think we could get away with WAL-logging LP_DEAD via ItemIdMarkDead similar to what is done to SetHintBits in the posted patch, and cope with the rest by marking the page with the invalid checksum; they are not so frequent anyway so the reliability loss is low. -- Alvaro Herrera http://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc.
Alvaro Herrera <alvherre@commandprompt.com> writes: > Alvaro Herrera wrote: > >> The "other hint bits" are: >> >> - LP_DEAD as used by the various callers of ItemIdMarkDead. >> - PD_PAGE_FULL >> - BTPageOpaque->btpo_flags and btpo_cycleid >> >> All of them are changed with only SetBufferCommitInfoNeedsSave being >> called afterwards. > > I think we could get away with WAL-logging LP_DEAD via ItemIdMarkDead > similar to what is done to SetHintBits in the posted patch, and cope > with the rest by marking the page with the invalid checksum; they are > not so frequent anyway so the reliability loss is low. If PD_PAGE_FULL is set and that causes the crc to be set to the invalid sum will we ever get another chance to set it? -- Gregory Stark EnterpriseDB http://www.enterprisedb.com Ask me about EnterpriseDB's RemoteDBA services!
Alvaro Herrera napsal(a): > Simon Riggs wrote: > >> But perhaps writing a single WAL record if you scan whole page and set >> all bits at once. Then it makes sense in some cases. > > So this is what I ended up doing; attached. > > There are some gotchas in this patch: > Please, DO NOT MOVE position of page version in PageHeader structure! And PG_PAGE_LAYOUT_VERSION should be bump to 5. Thanks Zdenek -- Zdenek Kotala Sun Microsystems Prague, Czech Republic http://sun.com/postgresql
On Thu, Oct 30, 2008 at 10:33 AM, Zdenek Kotala <Zdenek.Kotala@sun.com> wrote: > Please, DO NOT MOVE position of page version in PageHeader structure! And > PG_PAGE_LAYOUT_VERSION should be bump to 5. Umm, any in-place upgrade should be capable of handling changes to the page header. Of, did I miss something significant in the in-place upgrade design? -- Jonah H. Harris, Senior DBA myYearbook.com
Jonah H. Harris wrote: > On Thu, Oct 30, 2008 at 10:33 AM, Zdenek Kotala <Zdenek.Kotala@sun.com> wrote: >> Please, DO NOT MOVE position of page version in PageHeader structure! And >> PG_PAGE_LAYOUT_VERSION should be bump to 5. > > Umm, any in-place upgrade should be capable of handling changes to the > page header. Of, did I miss something significant in the in-place I thought that was kind of the point of in place upgrade. Joshua D. Drake
Zdenek Kotala wrote: > Alvaro Herrera napsal(a): >> Simon Riggs wrote: >> >>> But perhaps writing a single WAL record if you scan whole page and set >>> all bits at once. Then it makes sense in some cases. >> >> So this is what I ended up doing; attached. > > Please, DO NOT MOVE position of page version in PageHeader structure! Hmm. The only way I see we could do that is to modify the checksum struct member to a predefined value before calculating the page's checksum. Ah, actually there's another alternative -- leave the checksum on its current position (start of struct) and move other members below pg_pagesize_version (leaning towards pd_tli and pd_flags). That'd leave the page version in the same position. (Hmm, maybe it's better to move pd_lower and pd_upper?) > And PG_PAGE_LAYOUT_VERSION should be bump to 5. Easily done; thanks for the reminder. -- Alvaro Herrera http://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc.
Attachment
"Joshua D. Drake" <jd@commandprompt.com> writes: > Jonah H. Harris wrote: >> On Thu, Oct 30, 2008 at 10:33 AM, Zdenek Kotala <Zdenek.Kotala@sun.com> wrote: >>> Please, DO NOT MOVE position of page version in PageHeader structure! And >>> PG_PAGE_LAYOUT_VERSION should be bump to 5. >> >> Umm, any in-place upgrade should be capable of handling changes to the >> page header. Of, did I miss something significant in the in-place > > I thought that was kind of the point of in place upgrade. Sure, but he has to have a reliable way to tell what version of the page header he's looking at... What I'm wondering though -- are we going to make CRCs mandatory? Or set aside the 4 bytes even if you're not using them? Because if the size of the page header varies depending on whether you're using CRCs that sounds like it would be quite a pain. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com Ask me about EnterpriseDB's 24x7 Postgres support!
"Jonah H. Harris" <jonah.harris@gmail.com> writes: > On Thu, Oct 30, 2008 at 10:33 AM, Zdenek Kotala <Zdenek.Kotala@sun.com> wrote: >> Please, DO NOT MOVE position of page version in PageHeader structure! And >> PG_PAGE_LAYOUT_VERSION should be bump to 5. > Umm, any in-place upgrade should be capable of handling changes to the > page header. Well, yeah, but it has to be able to tell which version it's dealing with. I quite agree with Zdenek that keeping the version indicator in a fixed location is appropriate. regards, tom lane
On Thu, Oct 30, 2008 at 11:14 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > "Jonah H. Harris" <jonah.harris@gmail.com> writes: >> On Thu, Oct 30, 2008 at 10:33 AM, Zdenek Kotala <Zdenek.Kotala@sun.com> wrote: >>> Please, DO NOT MOVE position of page version in PageHeader structure! And >>> PG_PAGE_LAYOUT_VERSION should be bump to 5. > >> Umm, any in-place upgrade should be capable of handling changes to the >> page header. > > Well, yeah, but it has to be able to tell which version it's dealing > with. I quite agree with Zdenek that keeping the version indicator > in a fixed location is appropriate. Most of the other databases I've worked, which don't have different types of pages, put the page version as the first element of the page.That would let us put the crc right after it. Thoughts? -- Jonah H. Harris, Senior DBA myYearbook.com
Gregory Stark wrote: > Alvaro Herrera <alvherre@commandprompt.com> writes: > > > Alvaro Herrera wrote: > > > >> The "other hint bits" are: > >> > >> - LP_DEAD as used by the various callers of ItemIdMarkDead. > >> - PD_PAGE_FULL > >> - BTPageOpaque->btpo_flags and btpo_cycleid > >> > >> All of them are changed with only SetBufferCommitInfoNeedsSave being > >> called afterwards. > > > > I think we could get away with WAL-logging LP_DEAD via ItemIdMarkDead > > similar to what is done to SetHintBits in the posted patch, and cope > > with the rest by marking the page with the invalid checksum; they are > > not so frequent anyway so the reliability loss is low. > > If PD_PAGE_FULL is set and that causes the crc to be set to the invalid sum > will we ever get another chance to set it? I should have qualified that a bit more. It's not setting PD_FULL that's not logged, but clearing it (heap_prune_page, line 282). It's set in heap_update. Hmm, oh I see another problem here -- the bit is not restored when replayed heap_update's WAL record. I'm now wondering what other bits are set without much care about correctly restoring them during replay. -- Alvaro Herrera http://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc.
"Jonah H. Harris" <jonah.harris@gmail.com> writes: > On Thu, Oct 30, 2008 at 11:14 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Well, yeah, but it has to be able to tell which version it's dealing >> with. I quite agree with Zdenek that keeping the version indicator >> in a fixed location is appropriate. > Most of the other databases I've worked, which don't have different > types of pages, put the page version as the first element of the page. > That would let us put the crc right after it. Thoughts? "Fixed location" does not mean "let's move it". regards, tom lane
Jonah H. Harris napsal(a): > On Thu, Oct 30, 2008 at 10:33 AM, Zdenek Kotala <Zdenek.Kotala@sun.com> wrote: >> Please, DO NOT MOVE position of page version in PageHeader structure! And >> PG_PAGE_LAYOUT_VERSION should be bump to 5. > > Umm, any in-place upgrade should be capable of handling changes to the > page header. Of, did I miss something significant in the in-place > upgrade design? Not any change. If you move page header version field to another position it will require kind of magic to detect what version it is. Other field you can place everywhere :-), but do not touch page version. It will brings a lot of problems... Zdenek -- Zdenek Kotala Sun Microsystems Prague, Czech Republic http://sun.com/postgresql
On Thu, Oct 30, 2008 at 11:27 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > "Jonah H. Harris" <jonah.harris@gmail.com> writes: >> On Thu, Oct 30, 2008 at 11:14 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >>> Well, yeah, but it has to be able to tell which version it's dealing >>> with. I quite agree with Zdenek that keeping the version indicator >>> in a fixed location is appropriate. > >> Most of the other databases I've worked, which don't have different >> types of pages, put the page version as the first element of the page. >> That would let us put the crc right after it. Thoughts? > > "Fixed location" does not mean "let's move it". Just trying to be helpful. Just thought I might give some insight as to what others, who had implemented in-place upgrade functionality years before Postgres' existence, had done. -- Jonah H. Harris, Senior DBA myYearbook.com
Alvaro Herrera <alvherre@commandprompt.com> writes: > Ah, actually there's another alternative -- leave the checksum on its > current position (start of struct) and move other members below > pg_pagesize_version (leaning towards pd_tli and pd_flags). That'd leave > the page version in the same position. I don't understand why the position of anything matters here. Look at TCP packets for instance, the checksum is not at the beginning or end of anything. The CRC is chosen such that if you CRC the resulting packet including the CRC you get a CRC of 0. That can be done for whatever offset the CRC appears at I believe. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com Ask me about EnterpriseDB's 24x7 Postgres support!
Alvaro Herrera napsal(a): > Zdenek Kotala wrote: >> Alvaro Herrera napsal(a): >>> Simon Riggs wrote: >>> >>>> But perhaps writing a single WAL record if you scan whole page and set >>>> all bits at once. Then it makes sense in some cases. >>> So this is what I ended up doing; attached. >> Please, DO NOT MOVE position of page version in PageHeader structure! > > Hmm. The only way I see we could do that is to modify the checksum > struct member to a predefined value before calculating the page's > checksum. > > Ah, actually there's another alternative -- leave the checksum on its > current position (start of struct) and move other members below > pg_pagesize_version (leaning towards pd_tli and pd_flags). That'd leave > the page version in the same position. > > (Hmm, maybe it's better to move pd_lower and pd_upper?) No, please, keep pd_lower and pd_upper on same position. They are accessed more often than pd_tli and pd_flags. It is better for optimization. By the way, do you need CRC as a first page member? Is it for future development like CLOG integration into buffers? Why not put it on the end as and mark it as a special? It will reduce space requirement when CRC is not enabled. Zdenek -- Zdenek Kotala Sun Microsystems Prague, Czech Republic http://sun.com/postgresql
Zdenek Kotala <Zdenek.Kotala@Sun.COM> writes: > By the way, do you need CRC as a first page member? Is it for future development > like CLOG integration into buffers? Why not put it on the end as and mark it as > a special? It will reduce space requirement when CRC is not enabled. ... and make life tremendously more complex for indexes, plus turning CRC checking on or off on-the-fly would be problematic. I think Alvaro has the right idea: just put the field there all the time. regards, tom lane
Tom Lane napsal(a): > Zdenek Kotala <Zdenek.Kotala@Sun.COM> writes: >> By the way, do you need CRC as a first page member? Is it for future development >> like CLOG integration into buffers? Why not put it on the end as and mark it as >> a special? It will reduce space requirement when CRC is not enabled. > > ... and make life tremendously more complex for indexes, Indexes uses PageGetSpecial macro and they live with them and PageInit could do correct placement. Only problem are assertmacros and extra check which verifies correct size of special. > plus turning > CRC checking on or off on-the-fly would be problematic. Yeah, it is problem. > I think Alvaro > has the right idea: just put the field there all the time. Agree. Zdenek -- Zdenek Kotala Sun Microsystems Prague, Czech Republic http://sun.com/postgresql
Gregory Stark escribió: > What I'm wondering though -- are we going to make CRCs mandatory? Or set aside > the 4 bytes even if you're not using them? Because if the size of the page > header varies depending on whether you're using CRCs that sounds like it would > be quite a pain. Not mandatory, but the space needs to be set aside. (Otherwise you couldn't turn it on after running with it turned off, which would rule out using the database after initdb). -- Alvaro Herrera http://www.CommandPrompt.com/ PostgreSQL Replication, Consulting, Custom Development, 24x7 support
On Thu, Oct 30, 2008 at 12:14 PM, Alvaro Herrera <alvherre@commandprompt.com> wrote: > Gregory Stark escribió: > >> What I'm wondering though -- are we going to make CRCs mandatory? Or set aside >> the 4 bytes even if you're not using them? Because if the size of the page >> header varies depending on whether you're using CRCs that sounds like it would >> be quite a pain. > > Not mandatory, but the space needs to be set aside. (Otherwise you > couldn't turn it on after running with it turned off, which would rule > out using the database after initdb). Agreed. -- Jonah H. Harris, Senior DBA myYearbook.com
Alvaro Herrera wrote: > Hmm, oh I see another problem here -- the bit is not restored when > replayed heap_update's WAL record. I'm now wondering what other bits > are set without much care about correctly restoring them during replay. I'm now wondering whether it'd be easier to just ignore pd_flags in calculating the checksum. -- Alvaro Herrera http://www.CommandPrompt.com/ PostgreSQL Replication, Consulting, Custom Development, 24x7 support
On Thu, Oct 30, 2008 at 03:41:17PM +0000, Gregory Stark wrote: > The CRC is chosen such that if you CRC the resulting packet including the CRC > you get a CRC of 0. That can be done for whatever offset the CRC appears at I > believe. IIRC, you calculate the CRC-32 of the page, then XOR it over where it's supposed to end up. No need to preseed (or more accurately, it doesn't matter how you preseed, the result is the same). For checking it doesn't matter either, just checksum the page and if you get zero it's correct. Have a nice day, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > Please line up in a tree and maintain the heap invariant while > boarding. Thank you for flying nlogn airlines.
* Greg Stark: > Wal logged changes are safe because of full_page_writes. Hint bits are > safe because either the old or the new value will be on disk and we > don't care which. Is this really true with all disks? IBM's DTLA disks didn't behave that way (an interrupted write could zero a sector), and I think the text book algorithms don't assume this behavior, either. -- Florian Weimer <fweimer@bfk.de> BFK edv-consulting GmbH http://www.bfk.de/ Kriegsstraße 100 tel: +49-721-96201-1 D-76133 Karlsruhe fax: +49-721-96201-99
Alvaro Herrera wrote: > Alvaro Herrera wrote: > > > Hmm, oh I see another problem here -- the bit is not restored when > > replayed heap_update's WAL record. I'm now wondering what other bits > > are set without much care about correctly restoring them during replay. > > I'm now wondering whether it'd be easier to just ignore pd_flags in > calculating the checksum. Okay, so this is what I've done. pd_flags is skipped. Also the WAL routine logs both HeapTupleHeader infomasks and ItemId->lp_flags. On the latter point I'm not 100% sure of the cases where lp_flags must be logged; right now I'm only logging if the item is marked as "having storage" (the logic being that if an item does not have storage, then making it have requires a WAL entry, and vice versa). (This version has some debugging log entries which are obviously only WIP material.) -- Alvaro Herrera http://www.CommandPrompt.com/ PostgreSQL Replication, Consulting, Custom Development, 24x7 support
Attachment
Alvaro Herrera wrote: > Alvaro Herrera wrote: > >Alvaro Herrera wrote: > > > > Hmm, oh I see another problem here -- the bit is not restored when > > replayed heap_update's WAL record. I'm now wondering what other bits > > are set without much care about correctly restoring them during replay. > > > > I'm now wondering whether it'd be easier to just ignore pd_flags in > > calculating the checksum. > > Okay, so this is what I've done. pd_flags is skipped. Also the WAL > routine logs both HeapTupleHeader infomasks and ItemId->lp_flags. On > the latter point I'm not 100% sure of the cases where lp_flags must be > logged; right now I'm only logging if the item is marked as "having > storage" (the logic being that if an item does not have storage, then > making it have requires a WAL entry, and vice versa). Might it make sense to move such flags to another data structure which may or may not need to be logged, thereby maintaining the crc integrity of the data pages themselves? (I pre-apologize if this is a silly, as I honestly don't understand how once a page has been logically committed to storage, it can ever be subsequently validly modified unless first removed as being committed to storage; as if it's write were interrupted prior to being completed, it seems most correct to simply consider the page as not having been stored and simply resume the process from the beginning if a partial store is suspected; thereby implying that any buffers storing the logical page are not released until the page as a whole is known to have been successfully stored; thereby retaining the entire page to either to remain committed for storage, or possibly alternatively made re-available for mutation with it's crc marked as invalid if ever mutated prior to being re-committed to storage, it seems.)
On Fri, Oct 17, 2008 at 12:26:11PM -0300, Alvaro Herrera wrote: > So this discussion died with no solution arising to the > hint-bit-setting-invalidates-the-CRC problem. > > Apparently the only solution in sight is to WAL-log hint bits. Simon > opines it would be horrible from a performance standpoint to WAL-log > every hint bit set, and I think we all agree with that. So we need to > find an alternative mechanism to WAL log hint bits. There is another option I havn't seen mentioned anywhere yet: a single bit change in a page has a predictable change on the CRC, dependant only on the position of the bit. So in theory it would be possible for the process changing the hint bit to update the CRC with a single XOR operation. Working out what to XOR it with is the hard part. Worst case you're talking about a BLOCK_SIZE*8*4 byte = 256K lookup table but CRC has nice mathematical properties which could probably get that down to a few KB. Although, maybe locking of the hint bits would be a problem? Have a nice day, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > Please line up in a tree and maintain the heap invariant while > boarding. Thank you for flying nlogn airlines.
Martijn van Oosterhout <kleptog@svana.org> writes: > There is another option I havn't seen mentioned anywhere yet: a single > bit change in a page has a predictable change on the CRC, dependant > only on the position of the bit. So in theory it would be possible for > the process changing the hint bit to update the CRC with a single XOR > operation. Working out what to XOR it with is the hard part. > Although, maybe locking of the hint bits would be a problem? Yes it would :-(. Also, this scheme would point us towards maintaining the CRCs *continually* while the page is in memory, rather than only recalculating them upon write. So every tuple insert/update/delete would require a recalculation of the entire page CRC. What happened to the plan to double-buffer the writes to avoid this issue? regards, tom lane
On Sun, Nov 09, 2008 at 11:02:32AM -0500, Tom Lane wrote: > Yes it would :-(. Also, this scheme would point us towards maintaining > the CRCs *continually* while the page is in memory, rather than only > recalculating them upon write. So every tuple insert/update/delete > would require a recalculation of the entire page CRC. I wasn't thinking of that. I was thinking more of the situation where a seq scan reads in a page, updates a few hint bits and then goes on to the next page. For these just doing a few XORs might be cheaper. > What happened to the plan to double-buffer the writes to avoid this > issue? Might be better anyway. A single copy-and-checksum would probably be quite cheap (pulling the page into L2 cache). Have a nice day, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > Please line up in a tree and maintain the heap invariant while > boarding. Thank you for flying nlogn airlines.
I think double buffering solves the torn page problem but not the lack of wal logging. Alvarro solved the wal logging by deferring the wal logs. But I'm not sure how confident we are that it's logging enough. I'm beginning to think just excluding the hint bits would be simpler and safer. If we're double buffering then it might be possible to do that pretty cheaply. Copy the whole buffer with memcpy then loop through the line pointers unsetting the hint bits. Then do the crc. Though that would prevent us from doing "zero-copy" crc by doing it in the copy. greg On 9 Nov 2008, at 04:02 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Martijn van Oosterhout <kleptog@svana.org> writes: >> There is another option I havn't seen mentioned anywhere yet: a >> single >> bit change in a page has a predictable change on the CRC, dependant >> only on the position of the bit. So in theory it would be possible >> for >> the process changing the hint bit to update the CRC with a single XOR >> operation. Working out what to XOR it with is the hard part. > >> Although, maybe locking of the hint bits would be a problem? > > Yes it would :-(. Also, this scheme would point us towards > maintaining > the CRCs *continually* while the page is in memory, rather than only > recalculating them upon write. So every tuple insert/update/delete > would require a recalculation of the entire page CRC. > > What happened to the plan to double-buffer the writes to avoid this > issue? > > regards, tom lane > > -- > Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-hackers
Greg Stark wrote: > I think double buffering solves the torn page problem but not the lack > of wal logging. Alvarro solved the wal logging by deferring the wal > logs. But I'm not sure how confident we are that it's logging enough. > Right now, it's WAL-logging HeapTupleHeader hint bits (infomask and infomask2), and ItemId (line pointer) flags. Page pd_flags are skipped in the CRC checksum -- this is easy to do because they are in a constant offset in the page and I'm just skipping those bytes in CRC_COMP(). So what I'm missing is: - btree hint bits - bgwriter calls XLogInsert during shutdown, to WAL-log the hint bits of unwritten pages. This causes a PANIC to trigger about concurrent WAL activity during checkpoint. (The easy solution to this problem is just to remove the check; another idea is to flush the buffers before grabbing the final address to watch for at shutdown.) > I'm beginning to think just excluding the hint bits would be simpler and > safer. If we're double buffering then it might be possible to do that > pretty cheaply. Copy the whole buffer with memcpy then loop through the > line pointers unsetting the hint bits. Then do the crc. Though that would > prevent us from doing "zero-copy" crc by doing it in the copy. This can probably be made to work, and it solves the problem that bgwriter calls XLogInsert during shutdown. I would create new routines to clear hint bits in all involved modules (heap_resethintbits, btree_%, item_%, page_%), and call them on a copy of the page. The downside to this idea is that we need to create a copy of the page and call those routines when we read the page in, too. -- Alvaro Herrera http://www.CommandPrompt.com/ PostgreSQL Replication, Consulting, Custom Development, 24x7 support
Alvaro Herrera <alvherre@commandprompt.com> writes: > Greg Stark wrote: >> I'm beginning to think just excluding the hint bits would be simpler and >> safer. If we're double buffering then it might be possible to do that >> pretty cheaply. Copy the whole buffer with memcpy then loop through the >> line pointers unsetting the hint bits. Then do the crc. Though that would >> prevent us from doing "zero-copy" crc by doing it in the copy. > The downside to this idea is that we need to create a copy of the page > and call those routines when we read the page in, too. Ugh. The cost on write was bad enough, but paying it on read is a lot worse ... regards, tom lane
Tom Lane <tgl@sss.pgh.pa.us> writes: > Alvaro Herrera <alvherre@commandprompt.com> writes: >> Greg Stark wrote: >>> I'm beginning to think just excluding the hint bits would be simpler and >>> safer. If we're double buffering then it might be possible to do that >>> pretty cheaply. Copy the whole buffer with memcpy then loop through the >>> line pointers unsetting the hint bits. Then do the crc. Though that would >>> prevent us from doing "zero-copy" crc by doing it in the copy. > >> The downside to this idea is that we need to create a copy of the page >> and call those routines when we read the page in, too. oh, good point. > Ugh. The cost on write was bad enough, but paying it on read is a lot > worse ... I think you could checksum the block including the hint bits then go back and remove them from the checksum. I didn't realize you were handling more than just the heap transaction hint bits though. It would be hard to do it in any kind of abstract away like you were describing. How happy are you with the wal logging entries? Have you done any tests to see how much extra wal traffic it is? Are you sure you always generate enough logs soon enough? -- Gregory Stark EnterpriseDB http://www.enterprisedb.com Ask me about EnterpriseDB's RemoteDBA services!
Gregory Stark wrote: > I think you could checksum the block including the hint bits then go back and > remove them from the checksum. I'm not sure what you're proposing here. It sounds to me like you are saying that we can read the page, make it available to other users, and then check the CRC. I don't think this works though, because if you do that the possibly-invalid buffer is available to the other readers. > I didn't realize you were handling more than just the heap transaction > hint bits though. It would be hard to do it in any kind of abstract > away like you were describing. Yeah, I also initially thought that there was only a single set of hint bits, but that turned out not to be the case. Right now the nbtree hint bits are the ones missing :-( It's hard to see how to handle those. > How happy are you with the wal logging entries? Have you done any tests to see > how much extra wal traffic it is? Are you sure you always generate enough logs > soon enough? I haven't measured the amount of traffic. They are always generated "soon enough": just before calling smgrwrite on the page on FlushBuffer, i.e. just before the page hits disk. I admit it feels a bit dirty to be calling XLogInsert on such low a level. Right now we log all bits for all tuples, even if a single bit changed. It could be more efficient if I could only logs tuples whose hints bits had changed since the last write. This would require setting a bit on every tuple "this tuple has an unlogged hint bit" (right now there's a bit at the page level). I haven't tried implementing that. -- Alvaro Herrera http://www.CommandPrompt.com/ PostgreSQL Replication, Consulting, Custom Development, 24x7 support
Alvaro Herrera <alvherre@commandprompt.com> writes: > Gregory Stark wrote: > >> I think you could checksum the block including the hint bits then go back and >> remove them from the checksum. > > I'm not sure what you're proposing here. It sounds to me like you are > saying that we can read the page, make it available to other users, and > then check the CRC. I don't think this works though, because if you do > that the possibly-invalid buffer is available to the other readers. No, I just meant that you could calculate the CRC by scanning the whole buffer efficiently using one of the good word-wise CRC algorithms, then look at the line pointers to find the hint bits and subtract them out of the CRC. The result should be zero after adjusting for the hint bits. It doesn't solve much though. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com Ask me about EnterpriseDB's Slony Replication support!
On Mon, Nov 10, 2008 at 11:31:33PM +0000, Gregory Stark wrote: > No, I just meant that you could calculate the CRC by scanning the whole buffer > efficiently using one of the good word-wise CRC algorithms, then look at the > line pointers to find the hint bits and subtract them out of the CRC. The > result should be zero after adjusting for the hint bits. If you're going to look at the line pointers anyway, couldn't you just do it in one pass, like: n = 0 next = &tuple[n].hintbits pos = 0 while pos < BLOCK_SIZE: if pos == next: CRC_ADD( block[pos] & mask ) n++ next = &tuple[n].hintbits # If n == numtups,next = BLOCK_SIZE else: CRC_ADD( block[pos] pos++ This only handles one byte of hintbits but can easily be extended. No need to actually *store* the hintbit free version anywhere... Have a nice day, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > Please line up in a tree and maintain the heap invariant while > boarding. Thank you for flying nlogn airlines.
Martijn van Oosterhout wrote: > If you're going to look at the line pointers anyway, couldn't you just > do it in one pass, like: > > n = 0 > next = &tuple[n].hintbits > pos = 0 > while pos < BLOCK_SIZE: > if pos == next: > CRC_ADD( block[pos] & mask ) > n++ > next = &tuple[n].hintbits # If n == numtups, next = BLOCK_SIZE > else: > CRC_ADD( block[pos] > pos++ For this to work, we would have to create two (or more) versions of the calculate checksum macro, one for heap pages and other for other pages. I'm not sure how bad is that. The bit that's worse is that we'd need to have external knowledge of what kind of page we're talking about (i.e. FlushBuffer would need to know whether a page is heap or another kind). However, your idea suggests something else that we could do to improve the patch: skip the ItemId->lp_flags during the CRC calculation. This would mean we wouldn't need to WAL-log those. The problem with that is that lp_flags are only 2 bits, so we would need to iterate zeroing them and restore them after CRC_COMP() instead of simply skipping. The immediately useful information arising from your note is that I noticed I'm calling a heap routine on non-heap pages, because of setting PD_UNLOGGED_CHANGE for ItemId flags on index pages. -- Alvaro Herrera http://www.CommandPrompt.com/ PostgreSQL Replication, Consulting, Custom Development, 24x7 support
Alvaro Herrera <alvherre@commandprompt.com> writes: > However, your idea suggests something else that we could do to improve > the patch: skip the ItemId->lp_flags during the CRC calculation. This > would mean we wouldn't need to WAL-log those. What!? In most cases those bits are critical data, not hints. regards, tom lane
Tom Lane wrote: > Alvaro Herrera <alvherre@commandprompt.com> writes: > > However, your idea suggests something else that we could do to improve > > the patch: skip the ItemId->lp_flags during the CRC calculation. This > > would mean we wouldn't need to WAL-log those. > > What!? In most cases those bits are critical data, not hints. In most cases; but LP_DEAD is used as a hint sometimes which is causing me some grief ... -- Alvaro Herrera http://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc.
Hmm, I can get around the btree problem by not summing the "special space". This loses a bit of reliability because some of the most critical bits of the page would not be protected by the CRC, but the bulk of the data would be. And this allows me to get away from page type specific tricks (like btpo_cycleid which is used as a hint bit). The reason I noticed this is that I started wondering about only summing the part of the page that's actually used, i.e. the header, the line pointers, and the area beyond pd_upper. I then noticed that if I only include the area between pd_upper and pd_special then I don't need to care about those bits. So far, the only other idea I've had is to keep a list of page types (gin, gist, btree, hash, heap; am I missing something else?) and each module would provide a routine to do the summing. (Or perhaps better: the routine they provide says how to sum the special area of the page. That would allow having a single routine to check the bulk of the page, and the type-specific routine sums the summable parts of the special area.) Thoughts? -- Alvaro Herrera http://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc.
On Wed, Nov 12, 2008 at 11:08:13AM -0300, Alvaro Herrera wrote: > For this to work, we would have to create two (or more) versions of the > calculate checksum macro, one for heap pages and other for other pages. > I'm not sure how bad is that. The bit that's worse is that we'd need to > have external knowledge of what kind of page we're talking about (i.e. > FlushBuffer would need to know whether a page is heap or another kind). I think all you need is a macro (say COMP_CRC32_ONE) that adds a single byte to the checksum, then use COMP_CRC32 for the bulk of the work. Yes, you'd need to distinguish different kinds of pages. Seems to me the xlog code already has plenty of examples on how to do acrobatics with CRCs. > However, your idea suggests something else that we could do to improve > the patch: skip the ItemId->lp_flags during the CRC calculation. This > would mean we wouldn't need to WAL-log those. The problem with that is > that lp_flags are only 2 bits, so we would need to iterate zeroing them > and restore them after CRC_COMP() instead of simply skipping. Not sure why you're so intent on actually changing memory just so you can use COMP_CRC32, which is just a for loop around the COMP_CRC32_ONE I mentioned. Actually changing the memory probably means locking so why bother. Have a nice day, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > Please line up in a tree and maintain the heap invariant while > boarding. Thank you for flying nlogn airlines.
Martijn van Oosterhout wrote: > On Wed, Nov 12, 2008 at 11:08:13AM -0300, Alvaro Herrera wrote: > > However, your idea suggests something else that we could do to improve > > the patch: skip the ItemId->lp_flags during the CRC calculation. This > > would mean we wouldn't need to WAL-log those. The problem with that is > > that lp_flags are only 2 bits, so we would need to iterate zeroing them > > and restore them after CRC_COMP() instead of simply skipping. > > Not sure why you're so intent on actually changing memory just so you can use > COMP_CRC32, which is just a for loop around the COMP_CRC32_ONE I > mentioned. Actually changing the memory probably means locking so why > bother. Well, that's one of the problems -- memory is being changed without holding a lock. The other problem is that of pages being changed, their CRCs calculated, and then a crash occuring. On recovery, the CRC is restored but some of those changed bits are not. The other thing that maybe you didn't notice is that lp_flags are 2 bits, not a full byte. A byte-at-a-time CRC calculation is no help there. -- Alvaro Herrera http://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc.
Alvaro Herrera <alvherre@commandprompt.com> writes: > The other thing that maybe you didn't notice is that lp_flags are 2 > bits, not a full byte. A byte-at-a-time CRC calculation is no help > there. I think we're talking past each other. Martin and I are talking about doing something like: for (...) ... crc(word including hint bits) ... for (each line pointer) crc-negated(word & LP_DEAD<<15) Because CRC is a cyclic checksum it's possible to add or remove bits incrementally. This only works if the data is already copied to someplace so you can be sure nobody will set or clear the bits behind your back. But when you're reading the data back in you don't have to worry about that. I'm a bit surprised to hear our CRC implementation is a bytewise loop. I thought it was much faster to process CRC checks word-wise. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com Ask me about EnterpriseDB's PostGIS support!
Gregory Stark wrote: > I think we're talking past each other. Martin and I are talking about doing > something like: > > for (...) > ... > crc(word including hint bits) > ... > for (each line pointer) > crc-negated(word & LP_DEAD<<15) > > Because CRC is a cyclic checksum it's possible to add or remove bits > incrementally. I see. Since our CRC implementation is a simple byte loop, and since ItemIdData fits in a uint32, the attached patch should do mostly the same by copying the line pointer into a uint32, turning off the lp_flags, and summing the modified copy. This patch is also skipping pd_special and the unused area of the page. I'm still testing this; please beware that this likely has an even higher bug density than my regular patches (and some debugging printouts as well). While reading the pg_filedump code I noticed that there's a way to tell the different index pages apart, so perhaps we can use that to be able to checksum the special space as well. -- Alvaro Herrera http://www.CommandPrompt.com/ PostgreSQL Replication, Consulting, Custom Development, 24x7 support
Attachment
Alvaro Herrera <alvherre@commandprompt.com> writes: > I'm still testing this; please beware that this likely has an even > higher bug density than my regular patches (and some debugging printouts > as well). This seems impossibly fragile ... and the non-modular assumptions about what is in a disk page aren't even the worst part :-(. The worst part is the race conditions. In particular, the code added to FlushBuffer effectively assumes that the PD_UNLOGGED_CHANGE bit is set sooner than the actual hint bit change occurs. Even if the tqual.c code did that in the right order, which it doesn't, you can't assume that the updates will become visible to other CPUs in the expected order. This might be fixable with introduction of some memory barrier operations but it's certainly broken as-is. Also, if you do make tqual.c set the bits in that order, it's not clear how you can ever *clear* PD_UNLOGGED_CHANGE without introducing a race condition at that end. (The patch actually neglects to do this anywhere, which means that it won't be long till every page in the DB has got that bit set all the time, which I don't think we want.) I also don't like that you've got different CPUs trying to set or clear the same PD_UNLOGGED_CHANGE bit with no locking. We can tolerate that for ordinary hint bits because it's not critical if an update gets lost. But in this scheme PD_UNLOGGED_CHANGE is not an optional hint bit: you *will* mess up if it fails to get set. Even more insidiously, the scheme will certainly fail if someone ever tries to add another asynchronously-updated hint bit in pd_flags, since an update of one of the bits might overwrite a concurrent update of the other. Also, it's not inconceivable (depending on how wide the processor/memory bus is) that one processor updating PD_UNLOGGED_CHANGE could overwrite some other processor's change to the nearby pd_checksum or pd_lsn or pd_tli fields. Basically, you can't make any critical changes to a shared buffer if you haven't got exclusive lock on it. But that's exactly what this patch is assuming it can do. regards, tom lane
I think I'm missing something... In this patch, I see you writing WAL records for hint-bits (bufmgr.c FlushBuffer). But doesn't XLogInsert then make a "backup block" record (unless it's already got one since last checkpoint)? Once there's a backup block record, the torn-page problem that causes the whole CRCs to not validate, isn't it? On crash/recovery, you won't read this torn block because the WAL log will have the old backup + any possible updates to it... Sorry if I'm missing something very obvious... a. * Alvaro Herrera <alvherre@commandprompt.com> [081113 13:08]: > I see. > > Since our CRC implementation is a simple byte loop, and since ItemIdData > fits in a uint32, the attached patch should do mostly the same by > copying the line pointer into a uint32, turning off the lp_flags, and > summing the modified copy. > > This patch is also skipping pd_special and the unused area of the page. > > I'm still testing this; please beware that this likely has an even > higher bug density than my regular patches (and some debugging printouts > as well). > > While reading the pg_filedump code I noticed that there's a way to tell > the different index pages apart, so perhaps we can use that to be able > to checksum the special space as well. -- Aidan Van Dyk Create like a god, aidan@highrise.ca command like a king, http://www.highrise.ca/ work like a slave.
Tom Lane wrote: > Basically, you can't make any critical changes to a shared buffer > if you haven't got exclusive lock on it. But that's exactly what > this patch is assuming it can do. It seems to me that the only possible way to close this hole is to acquire an exclusive lock before calling FlushBuffers, not shared. This lock would be held until the flag has been examined and reset; the actual WAL record and write would continue with a shared lock, as now. I'm wary of this "solution" because it's likely to reduce concurrency tremendously ... thoughts? (The alternative seems to be to abandon this idea for hint bit logging; we'll need something else.) -- Alvaro Herrera http://www.CommandPrompt.com/ PostgreSQL Replication, Consulting, Custom Development, 24x7 support
Aidan Van Dyk wrote: > > I think I'm missing something... > > In this patch, I see you writing WAL records for hint-bits (bufmgr.c > FlushBuffer). But doesn't XLogInsert then make a "backup block" record (unless > it's already got one since last checkpoint)? I'm not causing a backup block to be written with that WAL record. The rationale is that it's not needed -- if there was a critical write to the page, then there's already a backup block. If the only write was a hint bit being set, then the page cannot possibly be torn. Now that I think about this, I wonder if this can cause problems in some filesystems. XFS, for example, zeroes out during recovery any block that was written to but not fsync'ed before a crash. This means that if we change a hint bit after a checkpoing and mark the page dirty, the system can write the page. Suppose we crash at this point. On recovery, XFS will zero out the block, but there will be nothing with which to recovery it, because there's no backup block ... -- Alvaro Herrera http://www.CommandPrompt.com/ PostgreSQL Replication, Consulting, Custom Development, 24x7 support
Alvaro Herrera <alvherre@commandprompt.com> writes: > Tom Lane wrote: >> Basically, you can't make any critical changes to a shared buffer >> if you haven't got exclusive lock on it. But that's exactly what >> this patch is assuming it can do. > It seems to me that the only possible way to close this hole is to > acquire an exclusive lock before calling FlushBuffers, not shared. > This lock would be held until the flag has been examined and reset; the > actual WAL record and write would continue with a shared lock, as now. Well, if we adopt the double buffering approach then the ex-lock would only need to be held for long enough to copy the page contents to local memory. So maybe this would be acceptable. It would certainly be a heck of a lot simpler than any workable variant of the current patch is likely to be; and we could simplify some existing code too (no more need for the BM_JUST_DIRTIED flag for instance). > (The alternative seems to be to abandon this idea for hint bit logging; > we'll need something else.) I'm feeling dissatisfied too --- seems like we're one idea short of a good solution. In the larger scheme of things, this patch shouldn't go in anyway as long as there is some chance that we could have upgrade-in-place for 8.4 at the price of not increasing the page header size. So I think there's time to keep thinking about it. regards, tom lane
Alvaro Herrera wrote: > Tom Lane wrote: > > > Basically, you can't make any critical changes to a shared buffer > > if you haven't got exclusive lock on it. But that's exactly what > > this patch is assuming it can do. > > It seems to me that the only possible way to close this hole is to > acquire an exclusive lock before calling FlushBuffers, not shared. > This lock would be held until the flag has been examined and reset; the > actual WAL record and write would continue with a shared lock, as now. We don't seem to have an API for reducing LWLock strength though ... -- Alvaro Herrera http://www.CommandPrompt.com/ PostgreSQL Replication, Consulting, Custom Development, 24x7 support
Alvaro Herrera wrote: > XFS, for example, zeroes out during recovery any block > that was written to but not fsync'ed before a crash. This means that if > we change a hint bit after a checkpoing and mark the page dirty, the > system can write the page. Suppose we crash at this point. On > recovery, XFS will zero out the block, but there will be nothing with > which to recovery it, because there's no backup block ... Really? That would mean that you're prone to lose data if you run PostgreSQL on XFS, even without the CRC patch. I doubt that's true, though. Google found this: http://marc.info/?l=linux-xfs&m=122549156102504&w=2 See the bottom of that mail. Although, Florian Weimer suggested earlier in this thread that IBM DTLA disks have exactly that problem; a sector could be zero-filled if the write is interrupted. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Heikki Linnakangas wrote: > Alvaro Herrera wrote: >> XFS, for example, zeroes out during recovery any block >> that was written to but not fsync'ed before a crash. This means that if >> we change a hint bit after a checkpoing and mark the page dirty, the >> system can write the page. Suppose we crash at this point. On >> recovery, XFS will zero out the block, but there will be nothing with >> which to recovery it, because there's no backup block ... > > Really? That would mean that you're prone to lose data if you run > PostgreSQL on XFS, even without the CRC patch. > > I doubt that's true, though. Google found this: > > http://marc.info/?l=linux-xfs&m=122549156102504&w=2 Ah, there's no problem here then. This email mentions another one by "Eric" which is this one: http://marc.info/?l=linux-xfs&m=122546510218150&w=2 It contains more information about the problem. > Although, Florian Weimer suggested earlier in this thread that IBM DTLA > disks have exactly that problem; a sector could be zero-filled if the > write is interrupted. Hmm. -- Alvaro Herrera http://www.CommandPrompt.com/ PostgreSQL Replication, Consulting, Custom Development, 24x7 support
* Tom Lane <tgl@sss.pgh.pa.us> [081113 14:43]: > Well, if we adopt the double buffering approach then the ex-lock would > only need to be held for long enough to copy the page contents to local > memory. So maybe this would be acceptable. It would certainly be a > heck of a lot simpler than any workable variant of the current patch > is likely to be; and we could simplify some existing code too (no more > need for the BM_JUST_DIRTIED flag for instance). Well, can we get rid of the PD_UNLOGGED_CHANGE completely? I think that if the buffer is dirty (FlushBuffer was called, and you've gotten through the StartBufferIO and gotten the lock), you can just WAL log the hint bits from the *local double-buffered* "page" (don't know if the current code allows it easily) If I understand tom's objections its that with the shared lock, other hint bits may still change... But we don't relly care if we get all the hint bits to WAL in our write, what we care about is that we get the hint-bits *that we checksummed* to WAL. You'll need throw the CRC in the WAL as well for the really paranoid. That way, if the write is torn, on recovery, the correct hint bits and matching CRC will be available. This means your chewing up more WAL. You get the WAL record for all the hint bits on every page write. For that you get: 1) Simplified locking (and maybe with releasing the lock before the write shorter lock hold-times) 2) Simplified CRC/checksum (don't have to try and skip hint-bits) 3) HINT bits WAL logged even for blocks written that aren't hint-bit only You trade WAL and simplicity for verifiable integrety. a. -- Aidan Van Dyk Create like a god, aidan@highrise.ca command like a king, http://www.highrise.ca/ work like a slave.
On Thu, Nov 13, 2008 at 01:45:52PM -0500, Tom Lane wrote: > Alvaro Herrera <alvherre@commandprompt.com> writes: > > I'm still testing this; please beware that this likely has an even > > higher bug density than my regular patches (and some debugging printouts > > as well). > > This seems impossibly fragile ... and the non-modular assumptions about > what is in a disk page aren't even the worst part :-(. The worst part > is the race conditions. Actually, the real problem to me seems to be that to check the checksum when you read the page in, you need to look at the contents of the page and "assume" some of the values in there are correct, before you can even calculate the checksum. If the page really is corrupted, chances are the item pointers are going to be bogus, but you need to read them to calculate the checksum... Double-buffering allows you to simply checksum the whole page, so creating a COMP_CRC32_WITH_COPY() macro would do it. Just allocate a block on the stack, copy/checksum it there, do the write() syscall and forget it. Have a nice day, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > Please line up in a tree and maintain the heap invariant while > boarding. Thank you for flying nlogn airlines.
Martijn van Oosterhout <kleptog@svana.org> writes: > Actually, the real problem to me seems to be that to check the checksum > when you read the page in, you need to look at the contents of the page > and "assume" some of the values in there are correct, before you can > even calculate the checksum. If the page really is corrupted, chances > are the item pointers are going to be bogus, but you need to read them > to calculate the checksum... Hmm. You could verify the values closely enough to ensure you don't crash while redoing the CRC calculation, which ought to be sufficient. Still, I agree that the whole thing looks too Rube Goldbergian to count as a reliability enhancer, which is what the point is after all. > Double-buffering allows you to simply checksum the whole page, so > creating a COMP_CRC32_WITH_COPY() macro would do it. Just allocate a > block on the stack, copy/checksum it there, do the write() syscall and > forget it. I think the argument is about whether we increase our vulnerability to torn-page problems if we just add a CRC and don't do anything else to the overall writing process. Right now, a partial write on a hint-bit-only update merely results in some hint bits getting lost (as long as you discount the scenario where the disk fails to read a partially-written sector at all --- maybe we're fooling ourselves to ignore that?). With a CRC added, that suddenly becomes a corrupted-page situation, and it's not easy to tell that no real harm was done. Again, the real bottom line here is whether there will be a *net* gain in reliability. If a CRC adds too many false-positive reports of bad data, it's not going to be a win. regards, tom lane
Tom Lane wrote: > Still, I agree that the whole thing looks too Rube Goldbergian to count > as a reliability enhancer, which is what the point is after all. Agreed. > I think the argument is about whether we increase our vulnerability to > torn-page problems if we just add a CRC and don't do anything else to > the overall writing process. Right now, a partial write on a > hint-bit-only update merely results in some hint bits getting lost > (as long as you discount the scenario where the disk fails to read a > partially-written sector at all --- maybe we're fooling ourselves to > ignore that?). With a CRC added, that suddenly becomes a corrupted-page > situation, and it's not easy to tell that no real harm was done. The first idea that comes to mind is skipping hint bits in the CRC too. That does away with a lot of the trouble (PD_UNLOGGED_CHANGE, the necessity of WAL-logging hint bits, etc). The problem, again, is that the checksumming process becomes page type-specific; but then maybe that's the only workable approach. -- Alvaro Herrera http://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc.
On Thu, Nov 13, 2008 at 09:03:41PM -0300, Alvaro Herrera wrote: > The first idea that comes to mind is skipping hint bits in the CRC too. > That does away with a lot of the trouble (PD_UNLOGGED_CHANGE, the > necessity of WAL-logging hint bits, etc). The problem, again, is that > the checksumming process becomes page type-specific; but then maybe > that's the only workable approach. Which brings back the problem of having to decode the page to checksum it, so your checksumming code needs to have all sorts of failsafes in it to stop it going crazy on bad data. But I understand the problem is that you want to continue in the face of torn pages, something which is AFAICS ambitious. At least MS-SQL just blows up on a torn page, havn't found results for other databases... Have a nice day, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > Please line up in a tree and maintain the heap invariant while > boarding. Thank you for flying nlogn airlines.
Martijn van Oosterhout <kleptog@svana.org> writes: > But I understand the problem is that you want to continue in the face > of torn pages, something which is AFAICS ambitious. At least MS-SQL > just blows up on a torn page, havn't found results for other > databases... I don't think it's too "ambitious" to demand that this patch preserve a behavior we have today. In fact, if the patch were to break torn-page handling, it would be 100% likely to be a net *decrease* in system reliability. It would add detection of a situation that is not supposed to happen (ie, storage system fails to return the same data it stored) at the cost of breaking one's database when the storage system acts as it's expected and documented to in a routine power-loss situation. So no, I don't care that MSSQL is unable to handle this. This patch must, or it doesn't go in. regards, tom lane
On Fri, Nov 14, 2008 at 10:51:57AM -0500, Tom Lane wrote: > In fact, if the patch were to break torn-page handling, it would be > 100% likely to be a net *decrease* in system reliability. It would add > detection of a situation that is not supposed to happen (ie, storage > system fails to return the same data it stored) at the cost of breaking > one's database when the storage system acts as it's expected and > documented to in a routine power-loss situation. Ok, I see it's a problem because the hint changes are not WAL logged, so torn pages are expected to work in normal operation. But simply skipping the hint bits during checksumming is a terrible solution, since then any errors in those bits will go undetected. To not be able to say in the documentation that you'll detect 100% of single-bit errors is pretty darn terrible, since that's kind of the goal of the exercise. Unfortunatly, there's not a lot of easy solutions here. You could do two checksums, one with and one without hint bits. The overall checksum tells you if there's a problem. If it doesn't match the second checksum will tell you if it's the hint bits or not (torn page problem). If it's the hint bits you can reset them all and continue. The checksums need not be of equal strength. The extreme case is an ECC where you explicitly can set it so you can alter N bits before you need to recalculate the checksum. Computationally though, that sucks. Hope this helps, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > Please line up in a tree and maintain the heap invariant while > boarding. Thank you for flying nlogn airlines.
Martijn van Oosterhout wrote: > On Fri, Nov 14, 2008 at 10:51:57AM -0500, Tom Lane wrote: >> In fact, if the patch were to break torn-page handling, it would be >> 100% likely to be a net *decrease* in system reliability. It would add >> detection of a situation that is not supposed to happen (ie, storage >> system fails to return the same data it stored) at the cost of breaking >> one's database when the storage system acts as it's expected and >> documented to in a routine power-loss situation. > > Ok, I see it's a problem because the hint changes are not WAL logged, > so torn pages are expected to work in normal operation. But simply > skipping the hint bits during checksumming is a terrible solution, > since then any errors in those bits will go undetected. To not be able > to say in the documentation that you'll detect 100% of single-bit > errors is pretty darn terrible, since that's kind of the goal of the > exercise. Agreed, trying to explain that in the documentation would look like making excuses. The requirement that all hint bit changes are WAL-logged seems like a pretty big change. I don't like doing that, just for CRCing. There has been discussion before about not writing out pages to disk that only have hint-bit updates on them. That means that the next time the page is read, the reader needs to do the clog lookups and set the hint bits again. It's a tradeoff, making the first SELECT after modifying a page cheaper, I/O-wise, at the cost of making all subsequent SELECTs that need to read the page from disk or kernel cache more expensive, CPU-wise. I'm not sure if I like that idea or not, but it would also solve the CRC problem with torn pages. FWIW, it would also solve the problem suggested with IBM DTLA disks and others that might zero-out a sector in case of an interrupted write. I'm not totally convinced that's a problem, as there's apparently other software that make the same assumption as we do, and we haven't heard of any torn-page corruption in real life, but still. If we made the behavior configurable, that would be pretty hard to explain in the docs. We'd have three options with dependencies - CRC on/off - write pages with only hint bit changes on/off - full_page_writes on/off If disable full_page_writes, you're vulnerable to torn pages. If you enable it, you're not. Except if you also turn CRC on. Except if you also turn "write pages with only hint bit changes" off. > Unfortunatly, there's not a lot of easy solutions here. You could do > two checksums, one with and one without hint bits. The overall checksum > tells you if there's a problem. If it doesn't match the second checksum > will tell you if it's the hint bits or not (torn page problem). If it's > the hint bits you can reset them all and continue. The checksums need > not be of equal strength. Hmm, that would work I guess. > The extreme case is an ECC where you explicitly can set it so you can > alter N bits before you need to recalculate the checksum. > Computationally though, that sucks. Yep. Also, in case of a torn page, you're very likely going to have several hint bits from the old image and several from the new image. An error-correcting code would need to be unfeasibly long to cope with that. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
[sorry for top-posting - damn phone] I thought of saying that too but it doesn't really solve the problem. Think of what happens if someone sets a hint bit on a dirty page. greg On 17 Nov 2008, at 08:26 AM, Heikki Linnakangas <heikki.linnakangas@enterprisedb.com > wrote: > Martijn van Oosterhout wrote: >> On Fri, Nov 14, 2008 at 10:51:57AM -0500, Tom Lane wrote: >>> In fact, if the patch were to break torn-page handling, it would be >>> 100% likely to be a net *decrease* in system reliability. It >>> would add >>> detection of a situation that is not supposed to happen (ie, storage >>> system fails to return the same data it stored) at the cost of >>> breaking >>> one's database when the storage system acts as it's expected and >>> documented to in a routine power-loss situation. >> Ok, I see it's a problem because the hint changes are not WAL logged, >> so torn pages are expected to work in normal operation. But simply >> skipping the hint bits during checksumming is a terrible solution, >> since then any errors in those bits will go undetected. To not be >> able >> to say in the documentation that you'll detect 100% of single-bit >> errors is pretty darn terrible, since that's kind of the goal of the >> exercise. > > Agreed, trying to explain that in the documentation would look like > making excuses. > > The requirement that all hint bit changes are WAL-logged seems like > a pretty big change. I don't like doing that, just for CRCing. > > There has been discussion before about not writing out pages to disk > that only have hint-bit updates on them. That means that the next > time the page is read, the reader needs to do the clog lookups and > set the hint bits again. It's a tradeoff, making the first SELECT > after modifying a page cheaper, I/O-wise, at the cost of making all > subsequent SELECTs that need to read the page from disk or kernel > cache more expensive, CPU-wise. > > I'm not sure if I like that idea or not, but it would also solve the > CRC problem with torn pages. FWIW, it would also solve the problem > suggested with IBM DTLA disks and others that might zero-out a > sector in case of an interrupted write. I'm not totally convinced > that's a problem, as there's apparently other software that make the > same assumption as we do, and we haven't heard of any torn-page > corruption in real life, but still. > > If we made the behavior configurable, that would be pretty hard to > explain in the docs. We'd have three options with dependencies > > - CRC on/off > - write pages with only hint bit changes on/off > - full_page_writes on/off > > If disable full_page_writes, you're vulnerable to torn pages. If you > enable it, you're not. Except if you also turn CRC on. Except if you > also turn "write pages with only hint bit changes" off. > >> Unfortunatly, there's not a lot of easy solutions here. You could do >> two checksums, one with and one without hint bits. The overall >> checksum >> tells you if there's a problem. If it doesn't match the second >> checksum >> will tell you if it's the hint bits or not (torn page problem). If >> it's >> the hint bits you can reset them all and continue. The checksums need >> not be of equal strength. > > Hmm, that would work I guess. > >> The extreme case is an ECC where you explicitly can set it so you can >> alter N bits before you need to recalculate the checksum. >> Computationally though, that sucks. > > Yep. Also, in case of a torn page, you're very likely going to have > several hint bits from the old image and several from the new image. > An error-correcting code would need to be unfeasibly long to cope > with that. > > -- > Heikki Linnakangas > EnterpriseDB http://www.enterprisedb.com > > -- > Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-hackers
* Greg Stark <greg.stark@enterprisedb.com> [081117 03:54]: > [sorry for top-posting - damn phone] > > I thought of saying that too but it doesn't really solve the problem. > Think of what happens if someone sets a hint bit on a dirty page. If the page is dirty from a "real change", then it has a WAL backup block record already, so the torn-page on disk is going to be fixed with the wal replay ... *because* of the torn-page problem already being "solved" in PG. You don't get the hint-bits back, but that's no different from the current state. But nobody's previously cared if hint-bits wern't set on WAL replay. The tradeoff for CRC is: 1) Are hint-bits "worth saving"? (noting that with CRC the goal is the ability of detecting blocks that aren't *exactly*as we wrote them) 2) Are hint-bits "nice, but not worth IO" (noting that this case can be mitigated to only pages which are hint-bit *only*changed, not dirty with already-wal-logged changes) -- Aidan Van Dyk Create like a god, aidan@highrise.ca command like a king, http://www.highrise.ca/ work like a slave.
On Mon, Nov 17, 2008 at 08:41:20AM -0500, Aidan Van Dyk wrote: > * Greg Stark <greg.stark@enterprisedb.com> [081117 03:54]: > > [sorry for top-posting - damn phone] > > > > I thought of saying that too but it doesn't really solve the problem. > > Think of what happens if someone sets a hint bit on a dirty page. > > If the page is dirty from a "real change", then it has a WAL backup block > record already, so the torn-page on disk is going to be fixed with the wal > replay ... *because* of the torn-page problem already being "solved" in PG. Aah, I thought the problem was that someone updating a tuple won't write out the whole page to WAL, only a delta. Then again, maybe I understood it wrong. > The tradeoff for CRC is: > > 1) Are hint-bits "worth saving"? (noting that with CRC the goal is the > ability of detecting blocks that aren't *exactly* as we wrote them) Worth saving? No. Does it matter if they're wrong? Yes. If the XMIN_COMMITTED bit gets set incorrectly the tuple may appear even when it shouldn't. Put another way, accedently having a one converted to a zero is not a problem. Having a zero become a one though, is probably bad. > 2) Are hint-bits "nice, but not worth IO" (noting that this case can be > mitigated to only pages which are hint-bit *only* changed, not dirty with > already-wal-logged changes) That's a long running debate. Hint bits do save I/O, the question is the tradeoff. Have a nice day, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > Please line up in a tree and maintain the heap invariant while > boarding. Thank you for flying nlogn airlines.
* Martijn van Oosterhout <kleptog@svana.org> [081117 10:15]: > Aah, I thought the problem was that someone updating a tuple won't > write out the whole page to WAL, only a delta. Then again, maybe I > understood it wrong. I'm no expert, but from my understanding of xlog.c, the WAL record is a "delta" type record (i.e with MVCC, a new tuple, but HOT complicated this), but the WAL makes a "backup block record" of the block the WAL modifies hasn't been written since the last checkpoint. This "backup block" is what saves postgres from the torn-page problem on crash/recovery. On pages with "only hint-bit updates", the torn-page problem was never an issue because either the old or new parts of the pages are both valid on recovery, you just may need to re-calculate a hint-bit again. > > The tradeoff for CRC is: > > > > 1) Are hint-bits "worth saving"? (noting that with CRC the goal is the > > ability of detecting blocks that aren't *exactly* as we wrote them) > > Worth saving? No. Does it matter if they're wrong? Yes. If the > XMIN_COMMITTED bit gets set incorrectly the tuple may appear even when > it shouldn't. Put another way, accedently having a one converted to a > zero is not a problem. Having a zero become a one though, is probably > bad. Yes, and this difference is why not WAL-logging hint bits (and allowing torn-pages to possibley appear) is been safe and never been a problem. But if you're doing a CRC on the page, then they are suddenenly just as equally important as a "real change" > > 2) Are hint-bits "nice, but not worth IO" (noting that this case can be > > mitigated to only pages which are hint-bit *only* changed, not dirty with > > already-wal-logged changes) > > That's a long running debate. Hint bits do save I/O, the question is > the tradeoff. And I don't think anyone's going to have a good answer either way unless we get real numbers. But I don't know of any way to get at these numbers right now. 1) How many writes happen on buffer pages that are "hint dirty" but not "really dirty"? 2) How much IO would writing the WAL records "hint bits" on every page write take up? -- Aidan Van Dyk Create like a god, aidan@highrise.ca command like a king, http://www.highrise.ca/ work like a slave.
Aidan Van Dyk <aidan@highrise.ca> writes: > * Greg Stark <greg.stark@enterprisedb.com> [081117 03:54]: >> [sorry for top-posting - damn phone] >> >> I thought of saying that too but it doesn't really solve the problem. >> Think of what happens if someone sets a hint bit on a dirty page. > > If the page is dirty from a "real change", then it has a WAL backup block > record already, so the torn-page on disk is going to be fixed with the wal > replay ... *because* of the torn-page problem already being "solved" in PG. > You don't get the hint-bits back, but that's no different from the current > state. But nobody's previously cared if hint-bits wern't set on WAL replay. Hum. Actually I think you're right. However you still have a problem that someone could come along and set the hint bit between calculating the CRC and actually calling write. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com Ask me about EnterpriseDB's 24x7 Postgres support!
Gregory Stark wrote: > However you still have a problem that someone could come along and set the > hint bit between calculating the CRC and actually calling write. The double-buffering will solve that. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Aidan Van Dyk wrote: > * Greg Stark <greg.stark@enterprisedb.com> [081117 03:54]: >> I thought of saying that too but it doesn't really solve the problem. >> Think of what happens if someone sets a hint bit on a dirty page. > > If the page is dirty from a "real change", then it has a WAL backup block > record already, so the torn-page on disk is going to be fixed with the wal > replay ... *because* of the torn-page problem already being "solved" in PG. > You don't get the hint-bits back, but that's no different from the current > state. But nobody's previously cared if hint-bits wern't set on WAL replay. What if all changes to a page (even hit bits) are WAL logged when running with Block-level CRC checks enables, does that make things easier? I'm sure it would result in some performance loss, but anyone enabling Block Level CRCs is already trading some performance for safety. Thoughts?
* Matthew T. O'Connor <matthew@zeut.net> [081117 15:19]: > Aidan Van Dyk wrote: >> * Greg Stark <greg.stark@enterprisedb.com> [081117 03:54]: >>> I thought of saying that too but it doesn't really solve the problem. >>> Think of what happens if someone sets a hint bit on a dirty page. >> >> If the page is dirty from a "real change", then it has a WAL backup block >> record already, so the torn-page on disk is going to be fixed with the wal >> replay ... *because* of the torn-page problem already being "solved" in PG. >> You don't get the hint-bits back, but that's no different from the current >> state. But nobody's previously cared if hint-bits wern't set on WAL replay. > > > What if all changes to a page (even hit bits) are WAL logged when > running with Block-level CRC checks enables, does that make things > easier? I'm sure it would result in some performance loss, but anyone > enabling Block Level CRCs is already trading some performance for safety. > > Thoughts? *I'ld* be more than happy for that trade-off, because: 1) I run PostgreSQL on old, crappy hardware 2) I run small databases 3) I've never had a situation where PG was already to slow 4) I'ld like to know when I really *should* dump old hardware... But I'm not going to loose money if my DB is down either, so I'ld hardly consider myself one to cater to ;-) -- Aidan Van Dyk Create like a god, aidan@highrise.ca command like a king, http://www.highrise.ca/ work like a slave.
Heikki Linnakangas wrote: >> Gregory Stark wrote: >> However you still have a problem that someone could come along and set the >> hint bit between calculating the CRC and actually calling write. > > The double-buffering will solve that. Or simply require that hint bit writes acquire a write lock on the page (which should be available if not being critically updated or flushed); thereby to write/flush, simply do the same, calc it's crc, write/flush to disk, then release it's lock; and which seems like the most reliable thing to do (although no expert on pg's implementation by any means). As I would guess although there may be occasional lock contention, its likely minor, and greatly simplify the whole process it would seem? (unless I misunderstand, even double buffering requires a lock, as if multiple hint bits may be updated during the copy, the resulting copy may be inconsistent if only partially reflecting the updates in progress)
Paul Schlie <schlie@comcast.net> writes: > Heikki Linnakangas wrote: >>> Gregory Stark wrote: >>> However you still have a problem that someone could come along and set the >>> hint bit between calculating the CRC and actually calling write. >> >> The double-buffering will solve that. > > Or simply require that hint bit writes acquire a write lock on the page > (which should be available if not being critically updated or flushed); > thereby to write/flush, simply do the same, calc it's crc, write/flush to > disk, then release it's lock; and which seems like the most reliable thing > to do (although no expert on pg's implementation by any means). > > As I would guess although there may be occasional lock contention, its > likely minor, and greatly simplify the whole process it would seem? Well it would be a lot more locking than now. You're talking about locking potentially hundreds of times per page scanned as well as locking when doing a write which is potentially a long time since the write can block. It would be the simplest option. Perhaps we should test whether it's actually a problem. > (unless I misunderstand, even double buffering requires a lock, as if > multiple hint bits may be updated during the copy, the resulting copy may be > inconsistent if only partially reflecting the updates in progress) No you only need a share lock to do the copy since there's nothing wrong "inconsistent" sets of hint bits as long as you're checksumming the same copy you're putting on disk. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com Ask me about EnterpriseDB's On-Demand Production Tuning
Gregory Stark wrote: > Paul Schlie writes: > >> Heikki Linnakangas wrote: >>>> Gregory Stark wrote: >>>> However you still have a problem that someone could come along and set the >>>> hint bit between calculating the CRC and actually calling write. >>> >>> The double-buffering will solve that. >> >> Or simply require that hint bit writes acquire a write lock on the page >> (which should be available if not being critically updated or flushed); >> thereby to write/flush, simply do the same, calc it's crc, write/flush to >> disk, then release it's lock; and which seems like the most reliable thing >> to do (although no expert on pg's implementation by any means). >> >> As I would guess although there may be occasional lock contention, its >> likely minor, and greatly simplify the whole process it would seem? > > Well it would be a lot more locking than now. You're talking about locking > potentially hundreds of times per page scanned as well as locking when doing > a write which is potentially a long time since the write can block. - I guess one could define another lock, specifically for hint bits; thereby a page scan would not need that lock, but hint bit updates could use it in combination with a page flush requiring both a share lock and the hint bit lock. (but don't know if its overall better than copying) > It would be the simplest option. Perhaps we should test whether it's actually > a problem. > >> (unless I misunderstand, even double buffering requires a lock, as if >> multiple hint bits may be updated during the copy, the resulting copy may be >> inconsistent if only partially reflecting the updates in progress) > > No you only need a share lock to do the copy since there's nothing wrong > "inconsistent" sets of hint bits as long as you're checksumming the same copy > you're putting on disk. Understood, thanks.
Aidan Van Dyk wrote: > And I don't think anyone's going to have a good answer either way unless we get > real numbers. But I don't know of any way to get at these numbers right now. > > 1) How many writes happen on buffer pages that are "hint dirty" but not "really > dirty"? > > 2) How much IO would writing the WAL records "hint bits" on every page write > take up? I don't think it's a matter of hoy many writes or how much IO. The question is locks. Right now we flip hint bits without taking any kind of lock on the page. If we're going to WAL-log each hint bit change, then we will need to lock the page to update the LSN. This will make changing a hint bit a very expensive operation, and maybe a possible cause for deadlocks. What my patch did was log hint bits in bulk. The problem of that approach was precisely that it was not locking the logged page enough (locking before setting the "this page needs hint bits logged" bit). Of course, the trivial solution is just to lock the page before flipping hint bits, but I don't know (and I doubt) whether it would really work at all. -- Alvaro Herrera http://www.CommandPrompt.com/ PostgreSQL Replication, Consulting, Custom Development, 24x7 support
Alvaro Herrera wrote: > Right now we flip hint bits without taking any kind > of lock on the page. That's not quite true. You need to hold a shared lock on heap page to examine the visibility of a tuple, and that's when the hint bits are set. So we always hold at least a shared lock on the page while hint bits are set. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
* Alvaro Herrera <alvherre@commandprompt.com> [081118 12:25]: > I don't think it's a matter of hoy many writes or how much IO. The > question is locks. Right now we flip hint bits without taking any kind > of lock on the page. If we're going to WAL-log each hint bit change, > then we will need to lock the page to update the LSN. This will make > changing a hint bit a very expensive operation, and maybe a possible > cause for deadlocks. Ya, that's obviously the worst option. > What my patch did was log hint bits in bulk. The problem of that > approach was precisely that it was not locking the logged page enough > (locking before setting the "this page needs hint bits logged" bit). Of > course, the trivial solution is just to lock the page before flipping > hint bits, but I don't know (and I doubt) whether it would really work > at all. But why can't you wal-log the hint bits from the "buffered" page. then your consitent. At least as consistent as the original write was. So you're CRC ends up being: Buffer the page Calculate CRC on the buffered page WAL (in bulk) the hint bits (and maybeCRC?) write buffered page -- aidan van dyk create like a god, aidan@highrise.ca command like a king, http://www.highrise.ca/ work like a slave.
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes: > Alvaro Herrera wrote: >> Right now we flip hint bits without taking any kind >> of lock on the page. > That's not quite true. You need to hold a shared lock on heap page to > examine the visibility of a tuple, and that's when the hint bits are > set. So we always hold at least a shared lock on the page while hint > bits are set. Right, but we couldn't let hint-bit-setters update the page LSN with only shared lock. Too much chance of ending up with a scrambled LSN value. Could we arrange for the actual LSN-updating to be done while still holding WALInsertLock? Then we'd be depending on that lock, not the page-level locks, to serialize. It's not great to be pushing more work inside that global lock, but it's not very much more work ... regards, tom lane
Aidan Van Dyk <aidan@highrise.ca> writes: > But why can't you wal-log the hint bits from the "buffered" page. then your > consitent. At least as consistent as the original write was. > So you're CRC ends up being: > Buffer the page > Calculate CRC on the buffered page > WAL (in bulk) the hint bits (and maybe CRC?) > write buffered page The trouble here is to avoid repeated WAL-logging of the same hint bits. (Alvaro's patch tried to do that by depending on another hint bit in the page header, but that seems unsafe if hint bit setters aren't taking exclusive lock.) regards, tom lane
* Tom Lane <tgl@sss.pgh.pa.us> [081118 12:43]: > Aidan Van Dyk <aidan@highrise.ca> writes: > > But why can't you wal-log the hint bits from the "buffered" page. then your > > consitent. At least as consistent as the original write was. > > > So you're CRC ends up being: > > Buffer the page > > Calculate CRC on the buffered page > > WAL (in bulk) the hint bits (and maybe CRC?) > > write buffered page > > The trouble here is to avoid repeated WAL-logging of the same hint bits. > > (Alvaro's patch tried to do that by depending on another hint bit in the > page header, but that seems unsafe if hint bit setters aren't taking > exclusive lock.) And I know it's extra IO. That's why I started the whole thing with a question along the lines of "how much extra IO are people going to take" for the sake of "guarenteeing" we read exactly what we wrote. a. -- Aidan Van Dyk Create like a god, aidan@highrise.ca command like a king, http://www.highrise.ca/ work like a slave.
On Tue, 2008-11-18 at 12:54 -0500, Aidan Van Dyk wrote: > * Tom Lane <tgl@sss.pgh.pa.us> [081118 12:43]: > > Aidan Van Dyk <aidan@highrise.ca> writes: > > The trouble here is to avoid repeated WAL-logging of the same hint bits. > > > > (Alvaro's patch tried to do that by depending on another hint bit in the > > page header, but that seems unsafe if hint bit setters aren't taking > > exclusive lock.) > > And I know it's extra IO. That's why I started the whole thing with a question > along the lines of "how much extra IO are people going to take" for the sake of > "guarenteeing" we read exactly what we wrote. Those that need it will turn it on, those that don't won't. IO is cheap for those that are going to actually need this feature. Joshua D. Drake > > a. > --
On Thu, Nov 13, 2008 at 1:00 PM, Alvaro Herrera <alvherre@commandprompt.com> > This patch is also skipping pd_special and the unused area of the page. > v11 doesn't apply to cvs head anymore -- Atentamente, Jaime Casanova Soporte y capacitación de PostgreSQL Asesoría y desarrollo de sistemas Guayaquil - Ecuador Cel. +59387171157
Jaime Casanova wrote: > On Thu, Nov 13, 2008 at 1:00 PM, Alvaro Herrera <alvherre@commandprompt.com> > > This patch is also skipping pd_special and the unused area of the page. > > v11 doesn't apply to cvs head anymore I'm not currently working on this patch, sorry. -- Alvaro Herrera http://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc.
Alvaro Herrera wrote: > Jaime Casanova wrote: >> On Thu, Nov 13, 2008 at 1:00 PM, Alvaro Herrera <alvherre@commandprompt.com> >>> This patch is also skipping pd_special and the unused area of the page. >> v11 doesn't apply to cvs head anymore > > I'm not currently working on this patch, sorry. > Should we pull it from 8.4, then? --Josh
On Sun, Dec 14, 2008 at 4:51 PM, Josh Berkus <josh@agliodbs.com> wrote: >>> v11 doesn't apply to cvs head anymore >> >> I'm not currently working on this patch, sorry. >> > > Should we pull it from 8.4, then? Here's an updated patch against head. NOTE, it appears that this (and the previous) patch PANIC with "concurrent transaction log activity while database system is shutting down" on shutdown if checksumming is enabled. This appears to be due to FlushBuffer (lines 1821-1828) during the checkpoint-at-shutdown. Other than that, I haven't looked into what needs to be done to fix it. Similarly, I ran a pgbench, performed a manual checkpoint, and corrupted the tellers table myself using hexedit but the system didn't pick up the corruption at all :( Alvaro, have you given up on the patch or are you just busy on something else at the moment? -- Jonah H. Harris, Senior DBA myYearbook.com
Attachment
Jonah H. Harris escribió: > On Sun, Dec 14, 2008 at 4:51 PM, Josh Berkus <josh@agliodbs.com> wrote: > >>> v11 doesn't apply to cvs head anymore > >> > >> I'm not currently working on this patch, sorry. > >> > > > > Should we pull it from 8.4, then? > > Here's an updated patch against head. Thanks. > NOTE, it appears that this (and the previous) patch PANIC with > "concurrent transaction log activity while database system is shutting > down" on shutdown if checksumming is enabled. This appears to be due > to FlushBuffer (lines 1821-1828) during the checkpoint-at-shutdown. Yeah, I reported this issue several times. > Similarly, I ran a pgbench, performed a manual checkpoint, and > corrupted the tellers table myself using hexedit but the system didn't > pick up the corruption at all :( Heh :-) > Alvaro, have you given up on the patch or are you just busy on > something else at the moment? I've given up until we find a good way to handle hint bits. Various schemes have been proposed but they all have more or less fatal flaws. -- Alvaro Herrera http://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc.
On Mon, Dec 15, 2008 at 7:24 AM, Alvaro Herrera <alvherre@commandprompt.com> wrote: >> Here's an updated patch against head. > > Thanks. No problemo. >> NOTE, it appears that this (and the previous) patch PANIC with >> "concurrent transaction log activity while database system is shutting >> down" on shutdown if checksumming is enabled. This appears to be due >> to FlushBuffer (lines 1821-1828) during the checkpoint-at-shutdown. > > Yeah, I reported this issue several times. Hmm. Well, the easiest thing would be to add a !shutdown check for logging the hint bits during the shutdown checkpoint :) Of course, that would break the page for recovery, which was the whole point of putting that in place. I'd have to look at xlog and see whether that check can be deferred or changed. Or, did you already research this issue? >> Similarly, I ran a pgbench, performed a manual checkpoint, and >> corrupted the tellers table myself using hexedit but the system didn't >> pick up the corruption at all :( > > Heh :-) :( >> Alvaro, have you given up on the patch or are you just busy on >> something else at the moment? > > I've given up until we find a good way to handle hint bits. Various > schemes have been proposed but they all have more or less fatal flaws. Agreed. Though, I don't want to see this patch get dropped from 8.4. ALL, Alvaro has tried a couple different methods, does anyone have any other ideas? -- Jonah H. Harris, Senior DBA myYearbook.com
Jonah H. Harris wrote: > >> Alvaro, have you given up on the patch or are you just busy on > >> something else at the moment? > > > > I've given up until we find a good way to handle hint bits. Various > > schemes have been proposed but they all have more or less fatal flaws. > > Agreed. Though, I don't want to see this patch get dropped from 8.4. > > ALL, Alvaro has tried a couple different methods, does anyone have any > other ideas? Feature freeze is not the time to be looking for new ideas. I suggest we save this for 8.5. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
On Mon, Dec 15, 2008 at 10:13 AM, Bruce Momjian <bruce@momjian.us> wrote: > Jonah H. Harris wrote: >> >> Alvaro, have you given up on the patch or are you just busy on >> >> something else at the moment? >> > >> > I've given up until we find a good way to handle hint bits. Various >> > schemes have been proposed but they all have more or less fatal flaws. >> >> Agreed. Though, I don't want to see this patch get dropped from 8.4. >> >> ALL, Alvaro has tried a couple different methods, does anyone have any >> other ideas? > > Feature freeze is not the time to be looking for new ideas. I suggest > we save this for 8.5. Well, we may not need a new idea. Currently, the problem I see with the checkpoint-at-shutdown looks like it could possibly be easily solved. Though, there may be other issues I'm not familiar with. Has anyone reviewed this yet? -- Jonah H. Harris, Senior DBA myYearbook.com
"Jonah H. Harris" <jonah.harris@gmail.com> writes: > On Mon, Dec 15, 2008 at 10:13 AM, Bruce Momjian <bruce@momjian.us> wrote: >> Feature freeze is not the time to be looking for new ideas. I suggest >> we save this for 8.5. > Well, we may not need a new idea. We don't really have an acceptable solution for the conflict with hint bit behavior. The shutdown issue is minor, agreed, but that's not the stumbling block. regards, tom lane
Jonah H. Harris escribió: > Well, we may not need a new idea. Currently, the problem I see with > the checkpoint-at-shutdown looks like it could possibly be easily > solved. Though, there may be other issues I'm not familiar with. Has > anyone reviewed this yet? I didn't investigate the shutdown checkpoint issue a lot (I was aware of it), because the really hard problem are hint bits. -- Alvaro Herrera http://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc.
On Mon, Dec 15, 2008 at 11:29 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > "Jonah H. Harris" <jonah.harris@gmail.com> writes: >> On Mon, Dec 15, 2008 at 10:13 AM, Bruce Momjian <bruce@momjian.us> wrote: >>> Feature freeze is not the time to be looking for new ideas. I suggest >>> we save this for 8.5. > >> Well, we may not need a new idea. > > We don't really have an acceptable solution for the conflict with hint > bit behavior. The shutdown issue is minor, agreed, but that's not the > stumbling block. Agreed on the shutdown issue. But, didn't this patch address the hint bit setting as discussed? After performing a cursory look at the patch, it appears that hint-bit changes are detected and a WAL entry is written on buffer flush if hint bits had been changed. I don't see anything wrong with this in theory. Am I missing something? Now, in the case where hint bits have been updated and a WAL record is required because the buffer is being flushed, requiring the WAL to be flushed up to that point may be a killer on performance. Has anyone tested it? -- Jonah H. Harris, Senior DBA myYearbook.com
Jonah H. Harris escribió: > On Mon, Dec 15, 2008 at 11:29 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > We don't really have an acceptable solution for the conflict with hint > > bit behavior. The shutdown issue is minor, agreed, but that's not the > > stumbling block. > > Agreed on the shutdown issue. But, didn't this patch address the hint > bit setting as discussed? After performing a cursory look at the > patch, it appears that hint-bit changes are detected and a WAL entry > is written on buffer flush if hint bits had been changed. I don't see > anything wrong with this in theory. Am I missing something? That only does heap hint bits, but it does nothing about pd_flags, the btree flags (btpo_cycleid I think), and something else I don't recall at the moment. This was all solvable however. The big problem with it was that it was using a new bit in pd_flags in unsafe ways. To make it safe you'd have to grab a lock on the page, which is very probably problematic. > Now, in the case where hint bits have been updated and a WAL record is > required because the buffer is being flushed, requiring the WAL to be > flushed up to that point may be a killer on performance. Has anyone > tested it? I didn't measure it but I'm sure it'll be plenty slow. -- Alvaro Herrera http://www.CommandPrompt.com/ PostgreSQL Replication, Consulting, Custom Development, 24x7 support
On Mon, Dec 15, 2008 at 11:50 AM, Alvaro Herrera <alvherre@commandprompt.com> wrote: > That only does heap hint bits, but it does nothing about pd_flags, the > btree flags (btpo_cycleid I think), and something else I don't recall at > the moment. This was all solvable however. The big problem with it was > that it was using a new bit in pd_flags in unsafe ways. To make it safe > you'd have to grab a lock on the page, which is very probably problematic. :( >> Now, in the case where hint bits have been updated and a WAL record is >> required because the buffer is being flushed, requiring the WAL to be >> flushed up to that point may be a killer on performance. Has anyone >> tested it? > > I didn't measure it but I'm sure it'll be plenty slow. Yeah. What really sucks is that it would be fairly unpredictable and could easily result in unexpected production performance issues. It is pretty late in the process to continue with this design-related discussion, but I really wanted to see it in 8.4. -- Jonah H. Harris, Senior DBA myYearbook.com
Jonah H. Harris escribió: > It is pretty late in the process to continue with this design-related > discussion, but I really wanted to see it in 8.4. Well, it's hard to blame anyone but me, because I started working on this barely two weeks before the final commitfest IIRC. -- Alvaro Herrera http://www.CommandPrompt.com/ PostgreSQL Replication, Consulting, Custom Development, 24x7 support
Alvaro Herrera <alvherre@commandprompt.com> writes: > Jonah H. Harris escribió: >> Now, in the case where hint bits have been updated and a WAL record is >> required because the buffer is being flushed, requiring the WAL to be >> flushed up to that point may be a killer on performance. Has anyone >> tested it? > > I didn't measure it but I'm sure it'll be plenty slow. How hard would it be to just take an exclusive lock on the page when setting all these hint bits? It might be a big performance hit but it would only affect running with CRC enabled and we can document that. And it wouldn't involve contorting the existing code much. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com Ask me about EnterpriseDB's On-Demand Production Tuning
On Mon, 2008-12-15 at 10:13 -0500, Bruce Momjian wrote: > Jonah H. Harris wrote: > > >> Alvaro, have you given up on the patch or are you just busy on > > >> something else at the moment? > > > > > > I've given up until we find a good way to handle hint bits. Various > > > schemes have been proposed but they all have more or less fatal flaws. > > > > Agreed. Though, I don't want to see this patch get dropped from 8.4. > > > > ALL, Alvaro has tried a couple different methods, does anyone have any > > other ideas? > > Feature freeze is not the time to be looking for new ideas. I suggest > we save this for 8.5. Agreed, shall we remove the replication and se postgres patches too :P. If we can't fix the issue, then yeah let's rip it out but as it sits we have a hurdle that needs to be overcome not a new feature that needs to be implemented. Sincerely, Joshua D. Drake > > -- > Bruce Momjian <bruce@momjian.us> http://momjian.us > EnterpriseDB http://enterprisedb.com > > + If your life is a hard drive, Christ can be your backup. + > -- PostgreSQL Consulting, Development, Support, Training 503-667-4564 - http://www.commandprompt.com/ The PostgreSQL Company,serving since 1997
Joshua D. Drake escribió: > If we can't fix the issue, then yeah let's rip it out but as it sits we > have a hurdle that needs to be overcome not a new feature that needs to > be implemented. Ideas for solving the hurdle are welcome. > Agreed, shall we remove the replication and se postgres patches too :P. There are plenty of ideas for those patches, and lively discussion. They look successful to me, which this patch does not. Please do not troll. -- Alvaro Herrera http://www.CommandPrompt.com/ PostgreSQL Replication, Consulting, Custom Development, 24x7 support
Gregory Stark escribió: > Alvaro Herrera <alvherre@commandprompt.com> writes: > > > Jonah H. Harris escribió: > >> Now, in the case where hint bits have been updated and a WAL record is > >> required because the buffer is being flushed, requiring the WAL to be > >> flushed up to that point may be a killer on performance. Has anyone > >> tested it? > > > > I didn't measure it but I'm sure it'll be plenty slow. > > How hard would it be to just take an exclusive lock on the page when setting > all these hint bits? I guess it will be intolerably slow then. If we were to say "we have CRC now, but if you enable it you have 1% of the performance" we will get laughed at. -- Alvaro Herrera http://www.CommandPrompt.com/ PostgreSQL Replication, Consulting, Custom Development, 24x7 support
On Mon, 2008-12-15 at 14:29 -0300, Alvaro Herrera wrote: > Joshua D. Drake escribió: > > > If we can't fix the issue, then yeah let's rip it out but as it sits we > > have a hurdle that needs to be overcome not a new feature that needs to > > be implemented. > > Ideas for solving the hurdle are welcome. > > > Agreed, shall we remove the replication and se postgres patches too :P. > > There are plenty of ideas for those patches, and lively discussion. > They look successful to me, which this patch does not. > > Please do not troll. I wasn't trolling. I was making a point. Joshua D. Drake > -- PostgreSQL Consulting, Development, Support, Training 503-667-4564 - http://www.commandprompt.com/ The PostgreSQL Company,serving since 1997
On Mon, Dec 15, 2008 at 12:30 PM, Alvaro Herrera <alvherre@commandprompt.com> wrote: >> How hard would it be to just take an exclusive lock on the page when setting >> all these hint bits? > > I guess it will be intolerably slow then. If we were to say "we have > CRC now, but if you enable it you have 1% of the performance" we will > get laughed at. Well, Oracle does tell users that enabling full CRC checking will cost ~5% performance overhead, which is reasonable to me. I'm not pessimistic enough to think we'd be down to 1% the performance of a non-CRC enabled system, but the locking overhead would probably be fairly high. The problem is, at this point, we don't really know what the impact would be either way :( -- Jonah H. Harris, Senior DBA myYearbook.com
Joshua D. Drake escribió: > On Mon, 2008-12-15 at 14:29 -0300, Alvaro Herrera wrote: > > Joshua D. Drake escribió: > > > > > If we can't fix the issue, then yeah let's rip it out but as it sits we > > > have a hurdle that needs to be overcome not a new feature that needs to > > > be implemented. > > > > Ideas for solving the hurdle are welcome. > > > > > Agreed, shall we remove the replication and se postgres patches too :P. > > > > There are plenty of ideas for those patches, and lively discussion. > > They look successful to me, which this patch does not. > > > > Please do not troll. > > I wasn't trolling. I was making a point. Okay. Sorry. I was defeating your point. -- Alvaro Herrera http://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc.
On Fri, 2008-10-17 at 12:26 -0300, Alvaro Herrera wrote: > So this discussion died with no solution arising to the > hint-bit-setting-invalidates-the-CRC problem. > > Apparently the only solution in sight is to WAL-log hint bits. Simon > opines it would be horrible from a performance standpoint to WAL-log > every hint bit set, and I think we all agree with that. So we need to > find an alternative mechanism to WAL log hint bits. It occurred to me that maybe we don't need to WAL-log the CRC checks. Proposal * We reserve enough space on a disk block for a CRC check. When a dirty block is written to disk we calculate and annotate the CRC value, though this is *not* WAL logged. * In normal running we re-check the CRC when we read the block back into shared_buffers. * In recovery we will overwrite the last image of a block from WAL, so we ignore the block CRC check, since the WAL record was already CRC checked. If full_page_writes = off, we ignore and zero the block's CRC for any block touched during recovery. We do those things because the block CRC in the WAL is likely to be different to that on disk, due to hints. * We also re-check the CRC on a block immediately before we dirty the block (for any reason). This minimises the possibility of in-memory data corruption for blocks. So in the typical case all blocks moving from disk <-> memory and from clean -> dirty are CRC checked. So in the case where we have full_page_writes = on then we have a good CRC every time. In the full_page_writes = off case we are exposed only on the blocks that changed during last checkpoint cycle and only if we crash. That seems good because most databases are up 99% of the time, so any corruptions are likely to occur in normal running, not as a result of crashes. This would be a run-time option. Like it? -- Simon Riggs www.2ndQuadrant.com
On Mon, 2009-11-30 at 13:21 +0000, Simon Riggs wrote: > On Fri, 2008-10-17 at 12:26 -0300, Alvaro Herrera wrote: > > So this discussion died with no solution arising to the > > hint-bit-setting-invalidates-the-CRC problem. > > > > Apparently the only solution in sight is to WAL-log hint bits. Simon > > opines it would be horrible from a performance standpoint to WAL-log > > every hint bit set, and I think we all agree with that. So we need to > > find an alternative mechanism to WAL log hint bits. > > It occurred to me that maybe we don't need to WAL-log the CRC checks. > > Proposal > > * We reserve enough space on a disk block for a CRC check. When a dirty > block is written to disk we calculate and annotate the CRC value, though > this is *not* WAL logged. > > * In normal running we re-check the CRC when we read the block back into > shared_buffers. > > * In recovery we will overwrite the last image of a block from WAL, so > we ignore the block CRC check, since the WAL record was already CRC > checked. If full_page_writes = off, we ignore and zero the block's CRC > for any block touched during recovery. We do those things because the > block CRC in the WAL is likely to be different to that on disk, due to > hints. > > * We also re-check the CRC on a block immediately before we dirty the > block (for any reason). This minimises the possibility of in-memory data > corruption for blocks. > > So in the typical case all blocks moving from disk <-> memory and from > clean -> dirty are CRC checked. So in the case where we have > full_page_writes = on then we have a good CRC every time. In the > full_page_writes = off case we are exposed only on the blocks that > changed during last checkpoint cycle and only if we crash. That seems > good because most databases are up 99% of the time, so any corruptions > are likely to occur in normal running, not as a result of crashes. > > This would be a run-time option. > > Like it? > Just FYI, Alvaro is out of town and our of email access (almost exclusively). It may take him another week or so to get back to this. Joshua D. Drake > -- > Simon Riggs www.2ndQuadrant.com > > -- PostgreSQL.org Major Contributor Command Prompt, Inc: http://www.commandprompt.com/ - 503.667.4564 Consulting, Training, Support, Custom Development, Engineering If the world pushes look it in the eye and GRR. Then push back harder. - Salamander
Simon Riggs wrote: > Proposal > > * We reserve enough space on a disk block for a CRC check. When a dirty > block is written to disk we calculate and annotate the CRC value, though > this is *not* WAL logged. Imagine this: 1. A hint bit is set. It is not WAL-logged, but the page is dirtied. 2. The buffer is flushed out of the buffer cache to the OS. A new CRC is calculated and stored on the page. 3. Half of the page is flushed to disk (aka torn page problem). The CRC made it to disk but the flipped hint bit didn't. You now have a page with incorrect CRC on disk. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Mon, 2009-11-30 at 22:27 +0200, Heikki Linnakangas wrote: > Simon Riggs wrote: > > Proposal > > > > * We reserve enough space on a disk block for a CRC check. When a dirty > > block is written to disk we calculate and annotate the CRC value, though > > this is *not* WAL logged. > > Imagine this: > 1. A hint bit is set. It is not WAL-logged, but the page is dirtied. > 2. The buffer is flushed out of the buffer cache to the OS. A new CRC is > calculated and stored on the page. > 3. Half of the page is flushed to disk (aka torn page problem). The CRC > made it to disk but the flipped hint bit didn't. > > You now have a page with incorrect CRC on disk. You've written that as if you are spotting a problem. It sounds to me that this is exactly the situation we would like to detect and this is a perfect way of doing that. What do you see is the purpose here apart from spotting corruptions? Do we think error rates are so low we can recover the corruption by doing something clever with the CRC? I envisage most corruptions as being unrecoverable except from backup/WAL/replicated servers. It's been a long day, so perhaps I've misunderstood. -- Simon Riggs www.2ndQuadrant.com
* Simon Riggs <simon@2ndQuadrant.com> [091130 16:28]: > > You've written that as if you are spotting a problem. It sounds to me > that this is exactly the situation we would like to detect and this is a > perfect way of doing that. > > What do you see is the purpose here apart from spotting corruptions? > > Do we think error rates are so low we can recover the corruption by > doing something clever with the CRC? I envisage most corruptions as > being unrecoverable except from backup/WAL/replicated servers. > > It's been a long day, so perhaps I've misunderstood. No, I believe the torn-page problem is exactly the thing that made the checksum talks stall out last time... The torn page isn't currently a problem on only-hint-bit-dirty writes, because if you get half-old/half-new, the only changes is the hint bit - no big loss, the data is still the same. But, with a form of check-sums, when you read it it next time, is it corrupt? According to the check-sum, yes, but in reality, the *data* is still valid, just that the check sum is/isn't correctly matching the half-changed hint bits... And then many not-so-really-attractive workarounds where thrown around, with nothing nice falling into place... a. -- Aidan Van Dyk Create like a god, aidan@highrise.ca command like a king, http://www.highrise.ca/ work like a slave.
On Mon, 2009-11-30 at 16:49 -0500, Aidan Van Dyk wrote: > * Simon Riggs <simon@2ndQuadrant.com> [091130 16:28]: > > > > You've written that as if you are spotting a problem. It sounds to me > > that this is exactly the situation we would like to detect and this is a > > perfect way of doing that. > > > > What do you see is the purpose here apart from spotting corruptions? > > > > Do we think error rates are so low we can recover the corruption by > > doing something clever with the CRC? I envisage most corruptions as > > being unrecoverable except from backup/WAL/replicated servers. > > > > It's been a long day, so perhaps I've misunderstood. > > No, I believe the torn-page problem is exactly the thing that made the > checksum talks stall out last time... The torn page isn't currently a > problem on only-hint-bit-dirty writes, because if you get > half-old/half-new, the only changes is the hint bit - no big loss, the > data is still the same. > > But, with a form of check-sums, when you read it it next time, is it > corrupt? According to the check-sum, yes, but in reality, the *data* is > still valid, just that the check sum is/isn't correctly matching the > half-changed hint bits... A good argument, but we're missing some proportion. There are at most 240 hint bits in an 8192 byte block. So that is less than 0.5% of the data block where a single bit error would not corrupt data, and 0% of the data block where a 2+ bit error would not corrupt data. Put it another way, more than 99.5% of possible errors would cause data loss, so I would at least like the option of being told about them. The other perspective is that these errors are unlikely to be caused by cosmic rays and other quantum effects, they are more likely to be caused by hardware errors. Hardware errors are frequently repeatable, so one bank of memory or one section of DRAM is damaged and will give errors. If we don't report an error, the next error from that piece of hardware is almost certain to cause data loss, so even a false positive result should be treated as a good indicator of a true positive detection result in the future. If protection against data loss really does need to be so invasive that we need to WAL-log all changes, then lets make it a table-level option. If people want to pay the price, we should at least give them the option of doing so. We can think of ways of optimising it later. Since I was the one who opposed this on the basis of performance, I want to rescind that objection and say lets make it an option for those that wish to trade performance for some visibility of possible data loss errors. -- Simon Riggs www.2ndQuadrant.com
Simon Riggs <simon@2ndQuadrant.com> writes: > On Mon, 2009-11-30 at 16:49 -0500, Aidan Van Dyk wrote: >> No, I believe the torn-page problem is exactly the thing that made the >> checksum talks stall out last time... The torn page isn't currently a >> problem on only-hint-bit-dirty writes, because if you get >> half-old/half-new, the only changes is the hint bit - no big loss, the >> data is still the same. > A good argument, but we're missing some proportion. No, I think you are. The problem with the described behavior is exactly that it converts a non-problem into a problem --- a big problem, in fact: uncorrectable data loss. Loss of hint bits is expected and tolerated in the current system design. But a block with bad CRC is not going to have any automated recovery path. So the difficulty is that in the name of improving system reliability by detecting infrequent corruption events, we'd be decreasing system reliability by *creating* infrequent corruption events, added onto whatever events we were hoping to detect. There is no strong argument you can make that this isn't a net loss --- you'd need to pull some error-rate numbers out of the air to even try to make the argument, and in any case the fact remains that more data gets lost with the CRC than without it. The only thing the CRC is really buying is giving the PG project a more plausible argument for blaming data loss on somebody else; it's not helping the user whose data got lost. It's hard to justify the amount of work and performance hit we'd take to obtain a "feature" like that. regards, tom lane
On Mon, 2009-11-30 at 13:21 +0000, Simon Riggs wrote: > On Fri, 2008-10-17 at 12:26 -0300, Alvaro Herrera wrote: > > So this discussion died with no solution arising to the > > hint-bit-setting-invalidates-the-CRC problem. > > > > Apparently the only solution in sight is to WAL-log hint bits. Simon > > opines it would be horrible from a performance standpoint to WAL-log > > every hint bit set, and I think we all agree with that. So we need to > > find an alternative mechanism to WAL log hint bits. > > It occurred to me that maybe we don't need to WAL-log the CRC checks. > > Proposal > > * We reserve enough space on a disk block for a CRC check. When a dirty > block is written to disk we calculate and annotate the CRC value, though > this is *not* WAL logged. > > * In normal running we re-check the CRC when we read the block back into > shared_buffers. > > * In recovery we will overwrite the last image of a block from WAL, so > we ignore the block CRC check, since the WAL record was already CRC > checked. If full_page_writes = off, we ignore and zero the block's CRC > for any block touched during recovery. We do those things because the > block CRC in the WAL is likely to be different to that on disk, due to > hints. > > * We also re-check the CRC on a block immediately before we dirty the > block (for any reason). This minimises the possibility of in-memory data > corruption for blocks. > > So in the typical case all blocks moving from disk <-> memory and from > clean -> dirty are CRC checked. So in the case where we have > full_page_writes = on then we have a good CRC every time. In the > full_page_writes = off case we are exposed only on the blocks that > changed during last checkpoint cycle and only if we crash. That seems > good because most databases are up 99% of the time, so any corruptions > are likely to occur in normal running, not as a result of crashes. > > This would be a run-time option. > > Like it? > Just FYI, Alvaro is out of town and our of email access (almost exclusively). It may take him another week or so to get back to this. Joshua D. Drake > -- > Simon Riggs www.2ndQuadrant.com > > -- PostgreSQL.org Major Contributor Command Prompt, Inc: http://www.commandprompt.com/ - 503.667.4564 Consulting, Training, Support, Custom Development, Engineering If the world pushes look it in the eye and GRR. Then push back harder. - Salamander
On Mon, 2009-11-30 at 20:02 -0500, Tom Lane wrote: > Simon Riggs <simon@2ndQuadrant.com> writes: > > On Mon, 2009-11-30 at 16:49 -0500, Aidan Van Dyk wrote: > >> No, I believe the torn-page problem is exactly the thing that made the > >> checksum talks stall out last time... The torn page isn't currently a > >> problem on only-hint-bit-dirty writes, because if you get > >> half-old/half-new, the only changes is the hint bit - no big loss, the > >> data is still the same. > > > A good argument, but we're missing some proportion. > > No, I think you are. The problem with the described behavior is exactly > that it converts a non-problem into a problem --- a big problem, in > fact: uncorrectable data loss. Loss of hint bits is expected and > tolerated in the current system design. But a block with bad CRC is not > going to have any automated recovery path. > > So the difficulty is that in the name of improving system reliability > by detecting infrequent corruption events, we'd be decreasing system > reliability by *creating* infrequent corruption events, added onto > whatever events we were hoping to detect. There is no strong argument > you can make that this isn't a net loss --- you'd need to pull some > error-rate numbers out of the air to even try to make the argument, > and in any case the fact remains that more data gets lost with the CRC > than without it. The only thing the CRC is really buying is giving > the PG project a more plausible argument for blaming data loss on > somebody else; it's not helping the user whose data got lost. > > It's hard to justify the amount of work and performance hit we'd take > to obtain a "feature" like that. I think there is a clear justification for an additional option. There is no "creation" of corruption events. This scheme detects corruption events that *have* occurred. Now I understand that we previously would have recovered seamlessly from such events, but they were corruption events nonetheless and I think they need to be reported. (For why, see Conclusion #2, below). The frequency of such events against other corruption events is important here. You are right that there is effectively one new *type* of corruption event but without error-rate numbers you can't say that this shows substantially "more data gets lost with the CRC than without it". So let me say this again: the argument that inaction is a safe response here relies upon error-rate numbers going in your favour. You don't persuade us of one argument purely by observing that the alternate proposition requires a certain threshold error-rate - both propositions do. So its a straight: "what is the error-rate?" discussion and ISTM that there is good evidence of what that is. --- So, what is the probability of single-bit errors effecting hint bits? The hint bits can occupy any portion of the block, so their positions are random. They occupy less than 0.5% of the block, so they must account for a very small proportion of hardware-induced errors. Since most reasonable servers use Error Correcting Memory, I would expect not to see a high level of single bit errors, even though we know they are occurring in the underlying hardware (Conclusion #1, Schroeder et al, 2009) What is the chance that a correctable corruption event is in no way linked to another non-correctable event later? We would need to argue that corruptions are a purely stochastic process in all cases, yet again, there is evidence of both a clear and strong linkage from correctable to non-correctable errors. (Conclusion #2 and Conclusion #7, Schroeder et al, 2009). Schroeder et al http://www.cs.toronto.edu/~bianca/papers/sigmetrics09.pdf (thanks Greg!) Based on that paper, ISTM that ignorable hint bit corruptions are likely to account for a very small proportion of all corruptions, and of those, "70-80%" would show up as a non-ignorable corruptions within a month anyway. So the immediate effect on reliability is tiny, if any. The effect on detection is huge, which eventually produces significantly higher relability overall. > The only thing the CRC is really buying is giving > the PG project a more plausible argument for blaming data loss on > somebody else; it's not helping the user whose data got lost. This isn't about blame, its about detection. If we know something has happened we can do something about it. Experienced people know that hardware goes wrong, they just want to be told so they can fix it. I blocked development of a particular proposal earlier for performance reasons, but did not intend to block progress completely. It seems likely the checks will cause a performance hit. So make them an option. -- Simon Riggs www.2ndQuadrant.com
Simon Riggs wrote: > There is no "creation" of corruption events. This scheme detects > corruption events that *have* occurred. Now I understand that we > previously would have recovered seamlessly from such events, but they > were corruption events nonetheless and I think they need to be reported. > (For why, see Conclusion #2, below). No, you're still missing the point. The point is *not* random bit errors affecting hint bits, but the torn page problem. Today, a torn page is a completely valid and expected behavior from the OS and storage subsystem. We handle it with full_page_writes, and by relying on the fact that it's OK for a hint bit set to get lost. With your scheme, a torn page would become a corrupt page. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Tue, 2009-12-01 at 10:04 +0200, Heikki Linnakangas wrote: > Simon Riggs wrote: > > There is no "creation" of corruption events. This scheme detects > > corruption events that *have* occurred. Now I understand that we > > previously would have recovered seamlessly from such events, but they > > were corruption events nonetheless and I think they need to be reported. > > (For why, see Conclusion #2, below). > > No, you're still missing the point. The point is *not* random bit errors > affecting hint bits, but the torn page problem. Today, a torn page is a > completely valid and expected behavior from the OS and storage > subsystem. We handle it with full_page_writes, and by relying on the > fact that it's OK for a hint bit set to get lost. With your scheme, a > torn page would become a corrupt page. Well, its easy to keep going on about how much you think I misunderstand. But I think that's just misdirection. The way we handle torn page corruptions *hides* actual corruptions from us. The frequency of true positives and false positives is important here. If the false positive ratio is very small, then reporting them is not a problem because of the benefit we get from having spotted the true positives. Some convicted murderers didn't do it, but that is not an argument for letting them all go free (without knowing the details). So we need to know what the false positive ratio is before we evaluate the benefit of either reporting or non-reporting possible corruption events. When do you think torn pages happen? Only at crash, or other times also? Do they always happen at crash? Are there ways to re-check a block that has suffered a hint-related torn page issue? Are there ways to isolate and minimise the reporting of false positives? Those are important questions and this is not black and white. If the *only* answer really is we-must-WAL-log everything, then that is the answer, as an option. I suspect that there is a less strict possibility, if we question our assumptions and look at the frequencies. We know that I have no time to work on this; I am just trying to hold open the door to a few possibilities that we have not fully considered in a balanced way. And I myself am guilty of having slammed the door previously. I encourage development of a way forward based upon a balance of utility. -- Simon Riggs www.2ndQuadrant.com
Simon Riggs wrote: > The way we handle torn page corruptions *hides* actual corruptions from > us. The frequency of true positives and false positives is important > here. If the false positive ratio is very small, then reporting them is > not a problem because of the benefit we get from having spotted the true > positives. Some convicted murderers didn't do it, but that is not an > argument for letting them all go free (without knowing the details). So > we need to know what the false positive ratio is before we evaluate the > benefit of either reporting or non-reporting possible corruption events. > > When do you think torn pages happen? Only at crash, or other times also? > Do they always happen at crash? Are there ways to re-check a block that > has suffered a hint-related torn page issue? Are there ways to isolate > and minimise the reporting of false positives? Those are important > questions and this is not black and white. > > If the *only* answer really is we-must-WAL-log everything, then that is > the answer, as an option. I suspect that there is a less strict > possibility, if we question our assumptions and look at the frequencies. > > We know that I have no time to work on this; I am just trying to hold > open the door to a few possibilities that we have not fully considered > in a balanced way. And I myself am guilty of having slammed the door > previously. I encourage development of a way forward based upon a > balance of utility. I think the problem boils down to what the user response should be to a corruption report. If it is a torn page, it would be corrected and the user doesn't have to do anything. If it is something that is not correctable, then the user has corruption and/or bad hardware. I think the problem is that the existing proposal can't distinguish between these two cases so the user has no idea how to respond to the report. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
On Tue, 2009-12-01 at 06:35 -0500, Bruce Momjian wrote: > Simon Riggs wrote: > > The way we handle torn page corruptions *hides* actual corruptions from > > us. The frequency of true positives and false positives is important > > here. If the false positive ratio is very small, then reporting them is > > not a problem because of the benefit we get from having spotted the true > > positives. Some convicted murderers didn't do it, but that is not an > > argument for letting them all go free (without knowing the details). So > > we need to know what the false positive ratio is before we evaluate the > > benefit of either reporting or non-reporting possible corruption events. > > > > When do you think torn pages happen? Only at crash, or other times also? > > Do they always happen at crash? Are there ways to re-check a block that > > has suffered a hint-related torn page issue? Are there ways to isolate > > and minimise the reporting of false positives? Those are important > > questions and this is not black and white. > > > > If the *only* answer really is we-must-WAL-log everything, then that is > > the answer, as an option. I suspect that there is a less strict > > possibility, if we question our assumptions and look at the frequencies. > > > > We know that I have no time to work on this; I am just trying to hold > > open the door to a few possibilities that we have not fully considered > > in a balanced way. And I myself am guilty of having slammed the door > > previously. I encourage development of a way forward based upon a > > balance of utility. > > I think the problem boils down to what the user response should be to a > corruption report. If it is a torn page, it would be corrected and the > user doesn't have to do anything. If it is something that is not > correctable, then the user has corruption and/or bad hardware. > I think > the problem is that the existing proposal can't distinguish between > these two cases so the user has no idea how to respond to the report. If 99.5% of cases are real corruption then there is little need to distinguish between the cases, nor much value in doing so. The prevalence of the different error types is critical to understanding how to respond. If a man pulls a gun on you, your first thought isn't "some people remove guns from their jacket to polish them, so perhaps he intends to polish it now" because the prevalence of shootings is high, when faced by people with guns, and the risk of dying is also high. You make a judgement based upon the prevalence and the risk. That is all I am asking for us to do here, make a balanced call. These recent comments are a change in my own position, based upon evaluating the prevalence and the risk. I ask others to consider the same line of thought rather than a black/white assessment. All useful detection mechanisms have non-zero false positives because we would rather sometimes ring the bell for no reason than to let bad things through silently, as we do now. -- Simon Riggs www.2ndQuadrant.com
Simon Riggs wrote: > > I think > > the problem is that the existing proposal can't distinguish between > > these two cases so the user has no idea how to respond to the report. > > If 99.5% of cases are real corruption then there is little need to > distinguish between the cases, nor much value in doing so. The > prevalence of the different error types is critical to understanding how > to respond. > > If a man pulls a gun on you, your first thought isn't "some people > remove guns from their jacket to polish them, so perhaps he intends to > polish it now" because the prevalence of shootings is high, when faced > by people with guns, and the risk of dying is also high. You make a > judgement based upon the prevalence and the risk. > > That is all I am asking for us to do here, make a balanced call. These > recent comments are a change in my own position, based upon evaluating > the prevalence and the risk. I ask others to consider the same line of > thought rather than a black/white assessment. > > All useful detection mechanisms have non-zero false positives because we > would rather sometimes ring the bell for no reason than to let bad > things through silently, as we do now. OK, but what happens if someone gets the failure report, assumes their hardware is faulty and replaces it, and then gets a failure report again? I assume torn pages are 99% of the reported problem, which are expected and are fixed, and bad hardware 1%, quite the opposite of your numbers above. What might be interesting is to report CRC mismatches if the database was shut down cleanly previously; I think in those cases we shouldn't have torn pages. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
On Mon, Nov 30, 2009 at 3:27 PM, Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote: > Simon Riggs wrote: >> Proposal >> >> * We reserve enough space on a disk block for a CRC check. When a dirty >> block is written to disk we calculate and annotate the CRC value, though >> this is *not* WAL logged. > > Imagine this: > 1. A hint bit is set. It is not WAL-logged, but the page is dirtied. > 2. The buffer is flushed out of the buffer cache to the OS. A new CRC is > calculated and stored on the page. > 3. Half of the page is flushed to disk (aka torn page problem). The CRC > made it to disk but the flipped hint bit didn't. > > You now have a page with incorrect CRC on disk. This is probably a stupid question, but why doesn't the other half of the page make it to disk? Somebody pulls the plug first? ...Robert
Robert Haas wrote: > On Mon, Nov 30, 2009 at 3:27 PM, Heikki Linnakangas > <heikki.linnakangas@enterprisedb.com> wrote: > > Simon Riggs wrote: > >> Proposal > >> > >> * We reserve enough space on a disk block for a CRC check. When a dirty > >> block is written to disk we calculate and annotate the CRC value, though > >> this is *not* WAL logged. > > > > Imagine this: > > 1. A hint bit is set. It is not WAL-logged, but the page is dirtied. > > 2. The buffer is flushed out of the buffer cache to the OS. A new CRC is > > calculated and stored on the page. > > 3. Half of the page is flushed to disk (aka torn page problem). The CRC > > made it to disk but the flipped hint bit didn't. > > > > You now have a page with incorrect CRC on disk. > > This is probably a stupid question, but why doesn't the other half of > the page make it to disk? Somebody pulls the plug first? Yep, the pages are 512 bytes on disk, so you might get only some of the 16 512-byte blocks to disk, or the 512-byte block might be partially written. Full page writes fix these on recovery. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
On Tue, 2009-12-01 at 07:05 -0500, Bruce Momjian wrote: > I assume torn pages are 99% of the reported problem, which are > expected and are fixed, and bad hardware 1%, quite the opposite of your > numbers above. On what basis do you make that assumption? -- Simon Riggs www.2ndQuadrant.com
Simon Riggs wrote: > On Tue, 2009-12-01 at 07:05 -0500, Bruce Momjian wrote: > > > I assume torn pages are 99% of the reported problem, which are > > expected and are fixed, and bad hardware 1%, quite the opposite of your > > numbers above. > > On what basis do you make that assumption? Because we added full page write protection to fix the reported problem of torn pages, which we had on occasion; now we don't. Bad hardware reports are less frequent. And we know we can reproduce torn pages by shutting of power to a server without battery-backed cache. We don't know how to produce I/O failures on demand. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
bruce wrote: > What might be interesting is to report CRC mismatches if the database > was shut down cleanly previously; I think in those cases we shouldn't > have torn pages. Sorry, stupid idea on my part. We don't WAL log hit bit changes so there is no guarantee the page is in WAL on recovery. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
On Tue, 2009-12-01 at 07:58 -0500, Bruce Momjian wrote: > bruce wrote: > > What might be interesting is to report CRC mismatches if the database > > was shut down cleanly previously; I think in those cases we shouldn't > > have torn pages. > > Sorry, stupid idea on my part. We don't WAL log hit bit changes so > there is no guarantee the page is in WAL on recovery. I thought it was a reasonable idea. We would need to re-check CRCs after a crash and zero any that mismatched. Then we can start checking them again as we run. In any case, it seems strange to do nothing to protect the database in normal running just because there is one type of problem that occurs when we crash. -- Simon Riggs www.2ndQuadrant.com
On Tue, 2009-12-01 at 07:42 -0500, Bruce Momjian wrote: > Simon Riggs wrote: > > On Tue, 2009-12-01 at 07:05 -0500, Bruce Momjian wrote: > > > > > I assume torn pages are 99% of the reported problem, which are > > > expected and are fixed, and bad hardware 1%, quite the opposite of your > > > numbers above. > > > > On what basis do you make that assumption? > > Because we added full page write protection to fix the reported problem > of torn pages, which we had on occasion; now we don't. Bad hardware > reports are less frequent. Bad hardware reports are infrequent because we lack a detection system for them, which is the topic of this thread. It would be circular to argue that as a case against. It's also an argument that only effects crashes. -- Simon Riggs www.2ndQuadrant.com
Bruce Momjian wrote: > What might be interesting is to report CRC mismatches if the database > was shut down cleanly previously; I think in those cases we shouldn't > have torn pages. Unfortunately that's not true. You can crash, leading to a torn page, and then start up the database and shut it down cleanly. The torn page is still there, even though the last shutdown was a clean one. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Mon, Nov 30, 2009 at 9:27 PM, Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote: > Simon Riggs wrote: >> Proposal >> >> * We reserve enough space on a disk block for a CRC check. When a dirty >> block is written to disk we calculate and annotate the CRC value, though >> this is *not* WAL logged. > > Imagine this: > 1. A hint bit is set. It is not WAL-logged, but the page is dirtied. > 2. The buffer is flushed out of the buffer cache to the OS. A new CRC is > calculated and stored on the page. > 3. Half of the page is flushed to disk (aka torn page problem). The CRC > made it to disk but the flipped hint bit didn't. > > You now have a page with incorrect CRC on disk. > What if we treated the hint bits as all-zeros for the purpose of CRC calculation? This would exclude them from the checksum. Greetings Marcin Mańk
On Tuesday 01 December 2009 14:38:26 marcin mank wrote: > On Mon, Nov 30, 2009 at 9:27 PM, Heikki Linnakangas > > <heikki.linnakangas@enterprisedb.com> wrote: > > Simon Riggs wrote: > >> Proposal > >> > >> * We reserve enough space on a disk block for a CRC check. When a dirty > >> block is written to disk we calculate and annotate the CRC value, though > >> this is *not* WAL logged. > > > > Imagine this: > > 1. A hint bit is set. It is not WAL-logged, but the page is dirtied. > > 2. The buffer is flushed out of the buffer cache to the OS. A new CRC is > > calculated and stored on the page. > > 3. Half of the page is flushed to disk (aka torn page problem). The CRC > > made it to disk but the flipped hint bit didn't. > > > > You now have a page with incorrect CRC on disk. > > What if we treated the hint bits as all-zeros for the purpose of CRC > calculation? This would exclude them from the checksum. That sounds like doing a complete copy of the wal page zeroing specific fields and then doing wal - rather expensive I would say. Both, during computing the checksum and checking it... Andres
* Andres Freund <andres@anarazel.de> [091201 08:42]: > On Tuesday 01 December 2009 14:38:26 marcin mank wrote: > > On Mon, Nov 30, 2009 at 9:27 PM, Heikki Linnakangas > > > > <heikki.linnakangas@enterprisedb.com> wrote: > > > Simon Riggs wrote: > > >> Proposal > > >> > > >> * We reserve enough space on a disk block for a CRC check. When a dirty > > >> block is written to disk we calculate and annotate the CRC value, though > > >> this is *not* WAL logged. > > > > > > Imagine this: > > > 1. A hint bit is set. It is not WAL-logged, but the page is dirtied. > > > 2. The buffer is flushed out of the buffer cache to the OS. A new CRC is > > > calculated and stored on the page. > > > 3. Half of the page is flushed to disk (aka torn page problem). The CRC > > > made it to disk but the flipped hint bit didn't. > > > > > > You now have a page with incorrect CRC on disk. > > > > What if we treated the hint bits as all-zeros for the purpose of CRC > > calculation? This would exclude them from the checksum. > That sounds like doing a complete copy of the wal page zeroing specific fields > and then doing wal - rather expensive I would say. Both, during computing the > checksum and checking it... No, it has nothing to do with WAL, it has to do with when writing "pages" out... You already double-buffer them (to avoid the page changing while you checksum it) before calling write, but the code writing (and then reading) pages doesn't currently have to know all the internal "stuff" needed decide what's a hint bit and what's not... And adding that information into the buffer in/out would be a huge wart on the modularity of the PG code... a. -- Aidan Van Dyk Create like a god, aidan@highrise.ca command like a king, http://www.highrise.ca/ work like a slave.
On Tue, Dec 1, 2009 at 8:30 AM, Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote: > Bruce Momjian wrote: >> What might be interesting is to report CRC mismatches if the database >> was shut down cleanly previously; I think in those cases we shouldn't >> have torn pages. > > Unfortunately that's not true. You can crash, leading to a torn page, > and then start up the database and shut it down cleanly. The torn page > is still there, even though the last shutdown was a clean one. Thinking through this, as I understand it, in order to prevent this problem, you'd need to be able to predict at recovery time which pages might have been torn by the unclean shutdown. In order to do that, you'd need to know which pages were waiting to be written to disk at the time of the shutdown. For ordinary page modifications, that's not a problem, because there will be WAL records for those pages that need to be replayed, and we could recompute the CRC at the same time. But for hint bit changes, there's no persistent state that would tell us which hint bits were in the midst of being flipped when the system went down, so the only way to make sure all the CRCs are correct would be to rescan every page in the entire cluster and recompute every CRC. Is that right? ...Robert
On Tuesday 01 December 2009 15:26:21 Aidan Van Dyk wrote: > * Andres Freund <andres@anarazel.de> [091201 08:42]: > > On Tuesday 01 December 2009 14:38:26 marcin mank wrote: > > > On Mon, Nov 30, 2009 at 9:27 PM, Heikki Linnakangas > > > > > > <heikki.linnakangas@enterprisedb.com> wrote: > > > > Simon Riggs wrote: > > > >> Proposal > > > >> > > > >> * We reserve enough space on a disk block for a CRC check. When a > > > >> dirty block is written to disk we calculate and annotate the CRC > > > >> value, though this is *not* WAL logged. > > > > > > > > Imagine this: > > > > 1. A hint bit is set. It is not WAL-logged, but the page is dirtied. > > > > 2. The buffer is flushed out of the buffer cache to the OS. A new CRC > > > > is calculated and stored on the page. > > > > 3. Half of the page is flushed to disk (aka torn page problem). The > > > > CRC made it to disk but the flipped hint bit didn't. > > > > > > > > You now have a page with incorrect CRC on disk. > > > > > > What if we treated the hint bits as all-zeros for the purpose of CRC > > > calculation? This would exclude them from the checksum. > > > > That sounds like doing a complete copy of the wal page zeroing specific > > fields and then doing wal - rather expensive I would say. Both, during > > computing the checksum and checking it... > No, it has nothing to do with WAL, it has to do with when writing > "pages" out... You already double-buffer them (to avoid the page > changing while you checksum it) before calling write, but the code > writing (and then reading) pages doesn't currently have to know all the > internal "stuff" needed decide what's a hint bit and what's not... err, yes. That "WAL" slipped in, sorry. But it would still either mean a third copy of the page or a rather complex jumping around on the page... Andres
Robert Haas wrote: > On Tue, Dec 1, 2009 at 8:30 AM, Heikki Linnakangas > <heikki.linnakangas@enterprisedb.com> wrote: >> Bruce Momjian wrote: >>> What might be interesting is to report CRC mismatches if the database >>> was shut down cleanly previously; I think in those cases we shouldn't >>> have torn pages. >> Unfortunately that's not true. You can crash, leading to a torn page, >> and then start up the database and shut it down cleanly. The torn page >> is still there, even though the last shutdown was a clean one. > > Thinking through this, as I understand it, in order to prevent this > problem, you'd need to be able to predict at recovery time which pages > might have been torn by the unclean shutdown. In order to do that, > you'd need to know which pages were waiting to be written to disk at > the time of the shutdown. For ordinary page modifications, that's not > a problem, because there will be WAL records for those pages that need > to be replayed, and we could recompute the CRC at the same time. But > for hint bit changes, there's no persistent state that would tell us > which hint bits were in the midst of being flipped when the system > went down, so the only way to make sure all the CRCs are correct would > be to rescan every page in the entire cluster and recompute every CRC. > > Is that right? Yep. Even if rescanning every page in the cluster was feasible from a performance point-of-view, it would make the CRC checking a lot less useful. It's not hard to imagine that when a hardware glitch happens causing corruption, it also causes the system to crash. Recalculating the CRCs after crash would mask the corruption. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Tue, 2009-12-01 at 15:30 +0200, Heikki Linnakangas wrote: > Bruce Momjian wrote: > > What might be interesting is to report CRC mismatches if the database > > was shut down cleanly previously; I think in those cases we shouldn't > > have torn pages. > > Unfortunately that's not true. You can crash, leading to a torn page, > and then start up the database and shut it down cleanly. The torn page > is still there, even though the last shutdown was a clean one. There seems to be two ways forwards: journalling or fsck. We can either * WAL-log all changes to a page (journalling) (8-byte overhead) * After a crash disable CRC checks until a full database scan has either re-checked CRC or found CRC mismatch, report it in the LOG and then reset the CRC. (fsck) (8-byte overhead) Both of which can be optimised in various ways. Also, we might * Put all hint bits in the block header to allow them to be excluded more easily from CRC checking. If we used 3 more bits from ItemIdData.lp_len (limiting tuple length to 4096) then we could store some hints in the item pointer. HEAP_XMIN_INVALID can be stored as LP_DEAD, since that will happen very quickly anyway. -- Simon Riggs www.2ndQuadrant.com
On Tue, Dec 1, 2009 at 9:40 AM, Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote: > Even if rescanning every page in the cluster was feasible from a > performance point-of-view, it would make the CRC checking a lot less > useful. It's not hard to imagine that when a hardware glitch happens > causing corruption, it also causes the system to crash. Recalculating > the CRCs after crash would mask the corruption. Yeah. Thanks for the explanation - I think I understand the problem now. ...Robert
* Simon Riggs: > * Put all hint bits in the block header to allow them to be excluded > more easily from CRC checking. If we used 3 more bits from > ItemIdData.lp_len (limiting tuple length to 4096) then we could store > some hints in the item pointer. HEAP_XMIN_INVALID can be stored as > LP_DEAD, since that will happen very quickly anyway. What about putting the whole visibility information out-of-line, into its own B-tree, indexed by page number? -- Florian Weimer <fweimer@bfk.de> BFK edv-consulting GmbH http://www.bfk.de/ Kriegsstraße 100 tel: +49-721-96201-1 D-76133 Karlsruhe fax: +49-721-96201-99
Florian Weimer <fweimer@bfk.de> writes: > What about putting the whole visibility information out-of-line, into > its own B-tree, indexed by page number? Hint bits need to be *cheap* to examine. Otherwise there's little point in having them at all. regards, tom lane
On Tue, 2009-12-01 at 16:40 +0200, Heikki Linnakangas wrote: > It's not hard to imagine that when a hardware glitch happens > causing corruption, it also causes the system to crash. Recalculating > the CRCs after crash would mask the corruption. They are already masked from us, so continuing to mask those errors would not put us in a worse position. If we are saying that 99% of page corruptions are caused at crash time because of torn pages on hint bits, then only WAL logging can help us find the 1%. I'm not convinced that is an accurate or safe assumption and I'd at least like to see LOG entries showing what happened. ISTM we could go for two levels of protection. CRC checks and scanner for Level 1 protection, then full WAL logging for Level 2 protection. -- Simon Riggs www.2ndQuadrant.com
Simon Riggs <simon@2ndQuadrant.com> writes: > On Tue, 2009-12-01 at 16:40 +0200, Heikki Linnakangas wrote: >> It's not hard to imagine that when a hardware glitch happens >> causing corruption, it also causes the system to crash. Recalculating >> the CRCs after crash would mask the corruption. > They are already masked from us, so continuing to mask those errors > would not put us in a worse position. No, it would just destroy a large part of the argument for why this is worth doing. "We detect disk errors ... except for ones that happen during a database crash." "Say what?" The fundamental problem with this is the same as it's been all along: the tradeoff between implementation work expended, performance overhead added, and net number of real problems detected (with a suitably large demerit for actually *introducing* problems) just doesn't look attractive. You can make various compromises that improve one or two of these factors at the cost of making the others worse, but at the end of the day I've still not seen a combination that seems worth doing. regards, tom lane
On Tue, Dec 1, 2009 at 10:35 AM, Simon Riggs <simon@2ndquadrant.com> wrote: > On Tue, 2009-12-01 at 16:40 +0200, Heikki Linnakangas wrote: > >> It's not hard to imagine that when a hardware glitch happens >> causing corruption, it also causes the system to crash. Recalculating >> the CRCs after crash would mask the corruption. > > They are already masked from us, so continuing to mask those errors > would not put us in a worse position. > > If we are saying that 99% of page corruptions are caused at crash time > because of torn pages on hint bits, then only WAL logging can help us > find the 1%. I'm not convinced that is an accurate or safe assumption > and I'd at least like to see LOG entries showing what happened. It may or may not be true that most page corruptions happen at crash time, but it's certainly false that they are caused at crash time *because of torn pages on hint bits*. If only part of a block is written to disk and the unwritten parts contain hint-bit changes - that's not corruption. That's design behavior. Any CRC system needs to avoid complaining about errors when that happens because otherwise people will think that their database is corrupted and their hardware is faulty when in reality it is not. If we could find a way to put the hint bits in the same 512-byte block as the CRC, that might do it, but I'm not sure whether that is possible. Ignoring CRC errors after a crash until we've re-CRC'd the entire database will certainly eliminate the bogus error reports, but it seems likely to mask a large percentage of legitimate errors. For example, suppose that I write 1MB of data out to disk and then don't access it for a year. During that time the data is corrupted. Then the system crashes. Upon recovery, since there's no way of knowing whether hint bits on those pages were being updated at the time of the crash, so the system re-CRC's the corrupted data and declares it known good. Six months later, I try to access the data and find out that it's bad. Sucks to be me. Now consider the following alternative scenario: I write the block to disk. Five minutes later, without an intervening crash, I read it back in and it's bad. Yeah, the system detects it. Which is more likely? I'm not an expert on disk failure modes, but my intuition is that the first one will happen often enough to make us look silly. Is it 10%? 20%? 50%? I don't know. But ISTM that a CRC system that has no ability to determine whether a system is still "ok" post-crash is not a compelling proposition, even though it might still be able to detect some problems. ...Robert
On Tue, 2009-12-01 at 07:05 -0500, Bruce Momjian wrote: > > > All useful detection mechanisms have non-zero false positives because we > > would rather sometimes ring the bell for no reason than to let bad > > things through silently, as we do now. > > OK, but what happens if someone gets the failure report, assumes their > hardware is faulty and replaces it, and then gets a failure report > again? They are stupid? Nobody just replaces hardware. You test it. We can't fix stupid. Joshua D. Drake -- PostgreSQL.org Major Contributor Command Prompt, Inc: http://www.commandprompt.com/ - 503.667.4564 Consulting, Training, Support, Custom Development, Engineering If the world pushes look it in the eye and GRR. Then push back harder. - Salamander
On Tue, 2009-12-01 at 10:55 -0500, Tom Lane wrote: > Simon Riggs <simon@2ndQuadrant.com> writes: > > On Tue, 2009-12-01 at 16:40 +0200, Heikki Linnakangas wrote: > >> It's not hard to imagine that when a hardware glitch happens > >> causing corruption, it also causes the system to crash. Recalculating > >> the CRCs after crash would mask the corruption. > > > They are already masked from us, so continuing to mask those errors > > would not put us in a worse position. > > No, it would just destroy a large part of the argument for why this > is worth doing. "We detect disk errors ... except for ones that happen > during a database crash." "Say what?" I know what I said sounds ridiculous, I'm just trying to keep my mind open about the tradeoffs. The way to detect 100% of corruptions is to WAL-log 100% of writes to blocks and we know that sucks performance - twas me that said it in the original discussion. I'm trying to explore whether we can detect <100% of other errors at some intermediate percentage of WAL-logging. If we decide that there isn't an intermediate position worth taking, I'm happy, as long it was a fact-based decision. > The fundamental problem with this is the same as it's been all along: > the tradeoff between implementation work expended, performance overhead > added, and net number of real problems detected (with a suitably large > demerit for actually *introducing* problems) just doesn't look > attractive. You can make various compromises that improve one or two of > these factors at the cost of making the others worse, but at the end of > the day I've still not seen a combination that seems worth doing. I agree. But also I do believe there are people that care enough about this to absorb a performance hit and the new features in 8.5 will bring in a new crop of people that care about those things very much. -- Simon Riggs www.2ndQuadrant.com
On Tue, 2009-12-01 at 10:55 -0500, Tom Lane wrote: > Simon Riggs <simon@2ndQuadrant.com> writes: > > On Tue, 2009-12-01 at 16:40 +0200, Heikki Linnakangas wrote: > >> It's not hard to imagine that when a hardware glitch happens > >> causing corruption, it also causes the system to crash. Recalculating > >> the CRCs after crash would mask the corruption. > > > They are already masked from us, so continuing to mask those errors > > would not put us in a worse position. > > No, it would just destroy a large part of the argument for why this > is worth doing. "We detect disk errors ... except for ones that happen > during a database crash." "Say what?" > > The fundamental problem with this is the same as it's been all along: > the tradeoff between implementation work expended, performance overhead > added, and net number of real problems detected (with a suitably large > demerit for actually *introducing* problems) just doesn't look > attractive. You can make various compromises that improve one or two of > these factors at the cost of making the others worse, but at the end of > the day I've still not seen a combination that seems worth doing. Let me try a different but similar perspective. The problem we are trying to solve here, only matters to a very small subset of the people actually using PostgreSQL. Specifically, a percentage that is using PostgreSQL in a situation where they can lose many thousands of dollars per minute or hour should an outage occur. On the other hand it is those very people that are *paying* people to try and implement these features. Kind of a catch-22. The hard core reality is this. *IF* it is one of the goals of this project to insure that the software can be safely, effectively, and responsibly operated in a manner that is acceptable to C* level people in a Fortune level company then we *must* solve this problem. If it is not the goal of the project, leave it to EDB/CMD/2ndQuandrant to fork it because it will eventually happen. Our customers are demanding these features. Sincerely, Joshua D. Drake -- PostgreSQL.org Major Contributor Command Prompt, Inc: http://www.commandprompt.com/ - 503.667.4564 Consulting, Training, Support, Custom Development, Engineering If the world pushes look it in the eye and GRR. Then push back harder. - Salamander
Simon Riggs wrote: > Also, we might > > * Put all hint bits in the block header to allow them to be excluded > more easily from CRC checking. If we used 3 more bits from > ItemIdData.lp_len (limiting tuple length to 4096) then we could store > some hints in the item pointer. HEAP_XMIN_INVALID can be stored as > LP_DEAD, since that will happen very quickly anyway. OK, here is another idea, maybe crazy: When we read in a page that has an invalid CRC, we check the page to see which hint bits are _not_ set, and we try setting them to see if can get a matching CRC. If there no missing hint bits and the CRC doesn't match, we know the page is corrupt. If two hint bits are missing, we can try setting one and both of them and see if can get a matching CRC. If we can, the page is OK, if not, it is corrupt. Now if 32 hint bits are missing, but could be based on transaction status, then we would need 2^32 possible hint bit combinations, so we can't do the test and we just assume the page is valid. I have no idea what percentage of corruption this would detect, but it might have minimal overhead because the overhead only happens when we detect a non-matching CRC due to a crash of some sort. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
On Tue, Dec 1, 2009 at 1:02 PM, Joshua D. Drake <jd@commandprompt.com> wrote: > The hard core reality is this. *IF* it is one of the goals of this > project to insure that the software can be safely, effectively, and > responsibly operated in a manner that is acceptable to C* level people > in a Fortune level company then we *must* solve this problem. > > If it is not the goal of the project, leave it to EDB/CMD/2ndQuandrant > to fork it because it will eventually happen. Our customers are > demanding these features. OK, and when you fork it, how do you plan to implement it? The problem AFAICS is not that anyone hugely dislikes the feature; it's that nobody is really clear on how to implement it in a way that's actually useful. So far the only somewhat reasonable suggestions I've seen seem to be: 1. WAL-log setting the hint bits. If you don't like the resulting performance, shut off the feature. 2. Rearrange the page so that all the hint bits are in the first 512 bytes along with the CRC, so that there can be no torn pages. AFAICS, no one has rendered judgment on whether this is a feasible solution. Does $COMPETITOR offer this feature? ...Robert
On Tue, 2009-12-01 at 13:05 -0500, Bruce Momjian wrote: > Simon Riggs wrote: > > Also, we might > > > > * Put all hint bits in the block header to allow them to be excluded > > more easily from CRC checking. If we used 3 more bits from > > ItemIdData.lp_len (limiting tuple length to 4096) then we could store > > some hints in the item pointer. HEAP_XMIN_INVALID can be stored as > > LP_DEAD, since that will happen very quickly anyway. > > OK, here is another idea, maybe crazy: When there's nothing else left, crazy wins. > When we read in a page that has an invalid CRC, we check the page to see > which hint bits are _not_ set, and we try setting them to see if can get > a matching CRC. If there no missing hint bits and the CRC doesn't > match, we know the page is corrupt. If two hint bits are missing, we > can try setting one and both of them and see if can get a matching CRC. > If we can, the page is OK, if not, it is corrupt. > > Now if 32 hint bits are missing, but could be based on transaction > status, then we would need 2^32 possible hint bit combinations, so we > can't do the test and we just assume the page is valid. > > I have no idea what percentage of corruption this would detect, but it > might have minimal overhead because the overhead only happens when we > detect a non-matching CRC due to a crash of some sort. Perhaps we could store a sector-based parity bit each 512 bytes in the block. If there are an even number of hint bits set, if odd we unset the parity bit. So whenever we set a hint bit we flip the parity bit for that sector. That way we could detect which sectors are potentially missing in an effort to minimize the number of combinations we need to test. That would require only 16 bits for an 8192 byte block; we store it next to the CRC, so we know that was never altered separately. So total 6 byte overhead. -- Simon Riggs www.2ndQuadrant.com
On Tue, 2009-12-01 at 13:20 -0500, Robert Haas wrote: > On Tue, Dec 1, 2009 at 1:02 PM, Joshua D. Drake <jd@commandprompt.com> wrote: > > The hard core reality is this. *IF* it is one of the goals of this > > project to insure that the software can be safely, effectively, and > > responsibly operated in a manner that is acceptable to C* level people > > in a Fortune level company then we *must* solve this problem. > > > > If it is not the goal of the project, leave it to EDB/CMD/2ndQuandrant > > to fork it because it will eventually happen. Our customers are > > demanding these features. > > OK, and when you fork it, how do you plan to implement it? Hey man, I am not an engineer :P. You know that. I am just speaking the pressures that some of us are having in the marketplace about these types of features. > red judgment on whether this is a feasible solution. > > Does $COMPETITOR offer this feature? > My understanding is that MSSQL does. I am not sure about Oracle. Those are the only two I run into (I don't run into MySQL at all). I know others likely compete in the DB2 space. Sincerely, Joshua D. Drake -- PostgreSQL.org Major Contributor Command Prompt, Inc: http://www.commandprompt.com/ - 503.667.4564 Consulting, Training, Support, Custom Development, Engineering If the world pushes look it in the eye and GRR. Then push back harder. - Salamander
Bruce Momjian <bruce@momjian.us> writes: > OK, here is another idea, maybe crazy: > When we read in a page that has an invalid CRC, we check the page to see > which hint bits are _not_ set, and we try setting them to see if can get > a matching CRC. If there no missing hint bits and the CRC doesn't > match, we know the page is corrupt. If two hint bits are missing, we > can try setting one and both of them and see if can get a matching CRC. > If we can, the page is OK, if not, it is corrupt. > Now if 32 hint bits are missing, but could be based on transaction > status, then we would need 2^32 possible hint bit combinations, so we > can't do the test and we just assume the page is valid. A typical page is going to have something like 100 tuples, so potentially 2^400 combinations to try. I don't see this being realistic from that standpoint. What's much worse is that to even find the potentially missing hint bits, you need to make very strong assumptions about the validity of the rest of the page. The suggestions that were made upthread about moving the hint bits could resolve the second objection, but once you do that you might as well just exclude them from the CRC and eliminate the guessing. regards, tom lane
Simon Riggs <simon@2ndQuadrant.com> writes: > On Tue, 2009-12-01 at 13:05 -0500, Bruce Momjian wrote: >> When we read in a page that has an invalid CRC, we check the page to see >> which hint bits are _not_ set, and we try setting them to see if can get >> a matching CRC. > Perhaps we could store a sector-based parity bit each 512 bytes in the > block. If there are an even number of hint bits set, if odd we unset the > parity bit. So whenever we set a hint bit we flip the parity bit for > that sector. That way we could detect which sectors are potentially > missing in an effort to minimize the number of combinations we need to > test. Actually, the killer problem with *any* scheme involving "guessing" is that each bit you guess translates directly to removing one bit of confidence from the CRC value. If you try to guess at as many as 32 bits, it is practically guaranteed that you will find a combination that makes a 32-bit CRC appear to match. Well before that, you have degraded the reliability of the error detection to the point that there's no point. The bottom line here seems to be that the only practical way to do anything like this is to move the hint bits into their own area of the page, and then exclude them from the CRC. Are we prepared to once again blow off any hope of in-place update for another release cycle? regards, tom lane
* Tom Lane <tgl@sss.pgh.pa.us> [091201 13:58]: > Actually, the killer problem with *any* scheme involving "guessing" > is that each bit you guess translates directly to removing one bit > of confidence from the CRC value. If you try to guess at as many > as 32 bits, it is practically guaranteed that you will find a > combination that makes a 32-bit CRC appear to match. Well before > that, you have degraded the reliability of the error detection to > the point that there's no point. Exactly. > The bottom line here seems to be that the only practical way to do > anything like this is to move the hint bits into their own area of > the page, and then exclude them from the CRC. Are we prepared to > once again blow off any hope of in-place update for another release > cycle? Well, *I* think if we're ever going to have really reliable "in-place upgrades" that we can expect to function release after release, we're going to need to be able to read in "old version" pages, and convert them to current version pages, for some set of "old version" (I'ld be happy with $VERSION-1)... But I don't see that happening any time soon... But I'm not loading TB of data either, my largest clusters are a couple of gigs, so I acknowledge my priorities are probably quite different then some of the companies driving a lot of the heavy development. a. -- Aidan Van Dyk Create like a god, aidan@highrise.ca command like a king, http://www.highrise.ca/ work like a slave.
On Tue, Dec 1, 2009 at 6:41 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Bruce Momjian <bruce@momjian.us> writes: >> OK, here is another idea, maybe crazy: > >> When we read in a page that has an invalid CRC, we check the page to see >> which hint bits are _not_ set, and we try setting them to see if can get >> a matching CRC. Unfortunately you would also have to try *unsetting* every hint bit as well since the updated hint bits might have made it to disk but not the CRC leaving the old CRC for the block with the unset bits. I actually independently had the same thought today that Simon had of moving the hint bits to the line pointer. We can obtain more free bits in the line pointers by dividing the item offsets and sizes by maxalign if we need it. That should give at least 4 spare bits which is all we need for the four VALID/INVALID hint bits. It should be relatively cheap to skip the hint bits in the line pointers since they'll be the same bits of every 16-bit value for a whole range. Alternatively we could just CRC the tuples and assume a corrupted line pointer will show itself quickly. That would actually make it faster than a straight CRC of the whole block -- making lemonade out of lemons as it were. There's still the all-tuples-in-page-are-visible hint bit and the hint bits in btree pages. I'm not sure if those are easier or harder to solve. We might be able to assume the all-visible flag will not be torn from the crc as long as they're within the same 512 byte sector. And iirc the btree hint bits are in the line pointers themselves as well? Another thought is that would could use the MSSQL-style torn page detection of including a counter (or even a bit?) in every 512-byte chunk which gets incremented every time the page is written. If they don't all match when read in then the page was torn and we can't check the CRC. That gets us the advantage that we can inform the user that a torn page was detected so they know that they must absolutely use full_page_writes on their system. Currently users are in the dark whether their system is susceptible to them or not and have now idea with what frequency. Even here there are quite divergent opinions about their frequency and which systems are susceptible to them or immune. -- greg
Greg Stark <gsstark@mit.edu> writes: > Another thought is that would could use the MSSQL-style torn page > detection of including a counter (or even a bit?) in every 512-byte > chunk which gets incremented every time the page is written. I think we can dismiss that idea, or any idea involving a per-page status value, out of hand. The implications for tuple layout are just too messy. I'm not especially thrilled with making assumptions about the underlying device block size anyway. regards, tom lane
All, I feel strongly that we should be verifying pages on write, or at least providing the option to do so, because hardware is simply not reliable.And a lot of our biggest users are having issues;it seems pretty much guarenteed that if you have more than 20 postgres servers, at least one of them will have bad memory, bad RAID and/or a bad driver. (and yes, InnoDB, DB2 and Oracle do provide tools to detect hardware corruption when it happens. Oracle even provides correction tools. We are *way* behind them in this regard) There are two primary conditions we are testing for: (a) bad RAM, which happens as frequently as 8% of the time on commodity servers, and given a sufficient amount of RAM happens 99% of the time due to quantum effects, and (b) bad I/O, in the form of bad drivers, bad RAID, and/or bad disks. Our users want to potentially take two degrees of action on this: 1. detect the corruption immediately when it happens, so that they can effectively troubleshoot the cause of the corruption, and potentially shut down the database before further corruption occurs and while they still have clean backups. 2. make an attempt to fix the corrupted page before/immediately after it is written. Further, based on talking to some of these users who are having chronic and not-debuggable issues on their sets of 100's of PostgreSQL servers, there are some other specs: -- Many users would be willing to sacrifice significant performance (up to 20%) as a start-time option in order to be "corruption-proof". -- Even more users would only be interested in using the anti-corruption options after they know they have a problem to troubleshoot it, and then turn the corruption detection back off. So, based on my conversations with users, what we really want is a solution which does (1) for both (a) and (b) as a start-time option, and having siginificant performance overhead for this option is OK. Now, does block-level CRCs qualify? The problem I have with CRC checks is that it only detects bad I/O, and is completely unable to detect data corruption due to bad memory. This means that really we want a different solution which can detect both bad RAM and bad I/O, and should only fall back on CRC checks if we're unable to devise one. One of the things Simon and I talked about in Japan is that most of the time, data corruption makes the data page and/or tuple unreadable. So, checking data format for readable pages and tuples (and index nodes) both before and after write to disk (the latter would presumably be handled by the bgwriter and/or checkpointer) would catch a lot of kinds of corruption before they had a chance to spread. However, that solution would not detect subtle corruption, like single-bit-flipping issues caused by quantum errors. Also, it would require reading back each page as it's written to disk, which is OK for a bunch of single-row writes, but for bulk data loads a significant problem. So, what I'm saying is that I think we really want a better solution, and am throwing this out there to see if anyone is clever enough. --Josh Berkus
Josh Berkus <josh@agliodbs.com> wrote: > And a lot of our biggest users are having issues; it seems pretty > much guarenteed that if you have more than 20 postgres servers, at > least one of them will have bad memory, bad RAID and/or a bad > driver. Huh?!? We have about 200 clusters running on about 100 boxes, and we see that very rarely. On about 100 older boxes, relegated to less critical tasks, we see a failure maybe three or four times per year. It's usually not subtle, and a sane backup and redundant server policy has kept us from suffering much pain from these. I'm not questioning the value of adding features to detect corruption, but your numbers are hard to believe. > The problem I have with CRC checks is that it only detects bad > I/O, and is completely unable to detect data corruption due to bad > memory. This means that really we want a different solution which > can detect both bad RAM and bad I/O, and should only fall back on > CRC checks if we're unable to devise one. md5sum of each tuple? As an optional system column (a la oid)? > checking data format for readable pages and tuples (and index > nodes) both before and after write to disk Given that PostgreSQL goes through the OS, and many of us are using RAID controllers with BBU RAM, how do you do a read with any confidence that it came from the disk? (I mean, I know how to do that for a performance test, but as a routine step during production use?) -Kevin
On Tue, Dec 1, 2009 at 7:19 PM, Josh Berkus <josh@agliodbs.com> wrote: > However, that solution would not detect subtle corruption, like > single-bit-flipping issues caused by quantum errors. Well there is a solution for this, ECC RAM. There's *no* software solution for it. The corruption can just as easily happen the moment you write the value before you calculate any checksum or in the register holding the value before you even write it. Or it could occur the moment after you finish checking the checksum. Also you're not going to be able to be sure you're checking the actual dram and not the L2 cache or the processor's L1/L0 caches. ECC RAM solves this problem properly and it does work. There's not much point in paying a much bigger cost for an ineffective solution. > Also, it would > require reading back each page as it's written to disk, which is OK for > a bunch of single-row writes, but for bulk data loads a significant problem. Not sure what that really means for Postgres. It would just mean reading back the same page of memory from the filesystem cache that we just read. It sounds like you're describing fsyncing every single page to disk and then wait 1min/7200 or even 1min/15k to do a direct read for every single page -- that's not a 20% performance hit though. We would have to change our mascot from the elephant to a snail I think. You could imagine a more complex solution where you have a separate process wait until the next checkpoint then do direct reads for all the blocks written since the previous checkpoint (which have now been fsynced) and verify that the block on disk has a verifiable CRC. I'm not sure even direct reads let you get the block on disk if someone else has written the block into cache though. If you could then this sounds like it could be made to work efficiently (with sequential bitmap-style scans) and could be quite handy. What I like about that is you could deprioritize this process's i/o so that it didn't impact the main processing. As things stand this wouldn't detect pages written because they were dirtied by hint bit updates but that could be addressed a few different ways. -- greg
On Tue, Dec 1, 2009 at 2:06 PM, Aidan Van Dyk <aidan@highrise.ca> wrote: > Well, *I* think if we're ever going to have really reliable "in-place > upgrades" that we can expect to function release after release, we're > going to need to be able to read in "old version" pages, and convert > them to current version pages, for some set of "old version" (I'ld be > happy with $VERSION-1)... But I don't see that happening any time > soon... I agree. I've attempted to make this point before - as has Zdenek - and been scorned for it, but I don't think it's become any less true for all of that. I don't think you have to look much further than the limitations on upgrading from 8.3 to 8.4 to conclude that the current strategy is always going to be pretty hit or miss. http://cvs.pgfoundry.org/cgi-bin/cvsweb.cgi/~checkout~/pg-migrator/pg_migrator/README?rev=1.59&content-type=text/plain ...Robert
On Tue, Dec 1, 2009 at 7:51 PM, Robert Haas <robertmhaas@gmail.com> wrote: > On Tue, Dec 1, 2009 at 2:06 PM, Aidan Van Dyk <aidan@highrise.ca> wrote: >> Well, *I* think if we're ever going to have really reliable "in-place >> upgrades" that we can expect to function release after release, we're >> going to need to be able to read in "old version" pages, and convert >> them to current version pages, for some set of "old version" (I'ld be >> happy with $VERSION-1)... But I don't see that happening any time >> soon... > > I agree. I've attempted to make this point before - as has Zdenek - > and been scorned for it, but I don't think it's become any less true > for all of that. I don't think you have to look much further than the > limitations on upgrading from 8.3 to 8.4 to conclude that the current > strategy is always going to be pretty hit or miss. I find that hard to understand. I believe the consensus is that an on-demand page-level migration statregy like Aidan described is precisely the plan when it's necessary to handle page format changes. There were no page format changes for 8.3->8.4 however so there's no point writing dead code until it actually has anything to do. And there was no point writing it for previously releases because there was pg_migrator anyways. Zdenek's plan was basically the same but he wanted the backend to be able to handle any version page directly without conversion any time. Pointing at the 8.3 pg_migrator limitations is irrelevant -- every single one of those limitations would not be addressed by a page-level migration code path. They are all data-type redefinitions that can't be fixed without understanding the table structure and definition. These limitations would all require adding code to the new version to handle the old data types and their behaviour and to convert them to the new datatypes when a tuple is rewritten. In some cases this is really not easy at all. -- greg
On Tue, Dec 1, 2009 at 8:04 PM, Greg Stark <gsstark@mit.edu> wrote: > And there was no point writing it for previously releases because there > was **no** pg_migrator anyways. oops -- greg
On Tue, Dec 1, 2009 at 3:04 PM, Greg Stark <gsstark@mit.edu> wrote: > I find that hard to understand. I believe the consensus is that an > on-demand page-level migration statregy like Aidan described is > precisely the plan when it's necessary to handle page format changes. > There were no page format changes for 8.3->8.4 however so there's no > point writing dead code until it actually has anything to do. And > there was no point writing it for previously releases because there > was pg_migrator anyways. Zdenek's plan was basically the same but he > wanted the backend to be able to handle any version page directly > without conversion any time. > > Pointing at the 8.3 pg_migrator limitations is irrelevant -- every > single one of those limitations would not be addressed by a page-level > migration code path. They are all data-type redefinitions that can't > be fixed without understanding the table structure and definition. > These limitations would all require adding code to the new version to > handle the old data types and their behaviour and to convert them to > the new datatypes when a tuple is rewritten. In some cases this is > really not easy at all. OK, fair enough. My implication that only page formats were at issue was off-base. My underlying point was that I think we have to be prepared to write code that can understand old binary formats (on the tuple, page, or relation level) if we want this to work and work reliably. I believe that there has been much resistance to that idea.If I am wrong, great! ...Robert
Robert Haas <robertmhaas@gmail.com> writes: > OK, fair enough. My implication that only page formats were at issue > was off-base. My underlying point was that I think we have to be > prepared to write code that can understand old binary formats (on the > tuple, page, or relation level) if we want this to work and work > reliably. I believe that there has been much resistance to that idea. We keep looking for cheaper alternatives. There may not be any... regards, tom lane
Tom Lane wrote: > Robert Haas <robertmhaas@gmail.com> writes: > >> OK, fair enough. My implication that only page formats were at issue >> was off-base. My underlying point was that I think we have to be >> prepared to write code that can understand old binary formats (on the >> tuple, page, or relation level) if we want this to work and work >> reliably. I believe that there has been much resistance to that idea. >> > > We keep looking for cheaper alternatives. There may not be any... > > > Yeah. I think we might need to bite the bullet on that and start thinking more about different strategies for handling page versioning to satisfy various needs. I've been convinced for a while that some sort of versioning scheme is inevitable, but I do understand the reluctance. cheers andrew
Andrew Dunstan wrote: > > > Tom Lane wrote: > > Robert Haas <robertmhaas@gmail.com> writes: > > > >> OK, fair enough. My implication that only page formats were at issue > >> was off-base. My underlying point was that I think we have to be > >> prepared to write code that can understand old binary formats (on the > >> tuple, page, or relation level) if we want this to work and work > >> reliably. I believe that there has been much resistance to that idea. > >> > > > > We keep looking for cheaper alternatives. There may not be any... > > > > > > > > Yeah. I think we might need to bite the bullet on that and start > thinking more about different strategies for handling page versioning to > satisfy various needs. I've been convinced for a while that some sort of > versioning scheme is inevitable, but I do understand the reluctance. I always felt our final solution would be a combination of pg_migrator for system catalog changes and page format conversion for page changes. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
Tom Lane wrote: > Bruce Momjian <bruce@momjian.us> writes: > > OK, here is another idea, maybe crazy: > > > When we read in a page that has an invalid CRC, we check the page to see > > which hint bits are _not_ set, and we try setting them to see if can get > > a matching CRC. If there no missing hint bits and the CRC doesn't > > match, we know the page is corrupt. If two hint bits are missing, we > > can try setting one and both of them and see if can get a matching CRC. > > If we can, the page is OK, if not, it is corrupt. > > > Now if 32 hint bits are missing, but could be based on transaction > > status, then we would need 2^32 possible hint bit combinations, so we > > can't do the test and we just assume the page is valid. > > A typical page is going to have something like 100 tuples, so > potentially 2^400 combinations to try. I don't see this being > realistic from that standpoint. What's much worse is that to even > find the potentially missing hint bits, you need to make very strong > assumptions about the validity of the rest of the page. > > The suggestions that were made upthread about moving the hint bits > could resolve the second objection, but once you do that you might > as well just exclude them from the CRC and eliminate the guessing. OK, crazy idea #3. What if we had a per-page counter of the number of hint bits set --- that way, we would only consider a CRC check failure to be corruption if the count matched the hint bit count on the page. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
Bruce Momjian <bruce@momjian.us> writes: > OK, crazy idea #3. What if we had a per-page counter of the number of > hint bits set --- that way, we would only consider a CRC check failure > to be corruption if the count matched the hint bit count on the page. Seems like rather a large hole in the ability to detect corruption. In particular, this again assumes that you can accurately locate all the hint bits in a page whose condition is questionable. Pick up the wrong bits, you'll come to the wrong conclusion --- and the default behavior you propose here is the wrong result. regards, tom lane
Bruce Momjian wrote: > Tom Lane wrote: >> >> The suggestions that were made upthread about moving the hint bits >> could resolve the second objection, but once you do that you might >> as well just exclude them from the CRC and eliminate the guessing. > > OK, crazy idea #3. What if we had a per-page counter of the number of > hint bits set --- that way, we would only consider a CRC check failure > to be corruption if the count matched the hint bit count on the page. Can I piggy-back on Bruce's crazy idea and ask a stupid question? Why are we writing out the hint bits to disk anyway? Is it really so slow to calculate them on read + cache them that it's worth all this trouble? Are they not also to blame for the "write my import data twice" feature? -- Richard Huxton Archonet Ltd
On Dec 1, 2009, at 12:58 PM, Tom Lane wrote: > The bottom line here seems to be that the only practical way to do > anything like this is to move the hint bits into their own area of > the page, and then exclude them from the CRC. Are we prepared to > once again blow off any hope of in-place update for another release > cycle? What happened to the work that was being done to allow a page to be upgraded on the fly when it was read in from disk? -- Jim C. Nasby, Database Architect jim@nasby.net 512.569.9461 (cell) http://jim.nasby.net
Tom Lane wrote: > Bruce Momjian <bruce@momjian.us> writes: > > OK, crazy idea #3. What if we had a per-page counter of the number of > > hint bits set --- that way, we would only consider a CRC check failure > > to be corruption if the count matched the hint bit count on the page. > > Seems like rather a large hole in the ability to detect corruption. > In particular, this again assumes that you can accurately locate all > the hint bits in a page whose condition is questionable. Pick up the > wrong bits, you'll come to the wrong conclusion --- and the default > behavior you propose here is the wrong result. I was assuming any update of hint bits would update the per-page counter so it would always be accurate. However, I seem to remember we don't lock the page when updating hint bits, so that wouldn't work. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
On Tue, Dec 1, 2009 at 9:57 PM, Richard Huxton <dev@archonet.com> wrote: > Why are we writing out the hint bits to disk anyway? Is it really so > slow to calculate them on read + cache them that it's worth all this > trouble? Are they not also to blame for the "write my import data twice" > feature? It would be interesting to experiment with different strategies. But the results would depend a lot on workloads and I doubt one strategy is best for everyone. It has often been suggested that we could set the hint bits but not dirty the page, so they would never be written out unless some other update hit the page. In most use cases that would probably result in the right thing happening where we avoid half the writes but still stop doing transaction status lookups relatively promptly. The scary thing is that there might be use cases such as static data loaded where the hint bits never get set and every scan of the page has to recheck those statuses until the tuples are frozen. (Not dirtying the page almost gets us out of the CRC problems -- it doesn't in our current setup because we don't take a lock when setting the hint bits, so you could set it on a page someone is in the middle of CRC checking and writing. There were other solutions proposed for that, including just making hint bits require locking the page or double buffering the write.) There does need to be something like the hint bits which does eventually have to be set because we can't keep transaction information around forever. Even if you keep the transaction information all the way back to the last freeze date (up to about 1GB and change I think) then the data has to be written twice, the second time is to freeze the transactions. In the worst case then reading a page requires a random page access (or two) from anywhere in that 1GB+ file for each tuple on the page (whether visible to us or not). -- greg
On Dec 1, 2009, at 1:39 PM, Kevin Grittner wrote: > Josh Berkus <josh@agliodbs.com> wrote: > >> And a lot of our biggest users are having issues; it seems pretty >> much guarenteed that if you have more than 20 postgres servers, at >> least one of them will have bad memory, bad RAID and/or a bad >> driver. > > Huh?!? We have about 200 clusters running on about 100 boxes, and > we see that very rarely. On about 100 older boxes, relegated to > less critical tasks, we see a failure maybe three or four times per > year. It's usually not subtle, and a sane backup and redundant > server policy has kept us from suffering much pain from these. I'm > not questioning the value of adding features to detect corruption, > but your numbers are hard to believe. That's just your experience. Others have had different experiences. And honestly, bickering about exact numbers misses Josh's point completely. Postgres is seriously lacking in it's ability to detect hardware problems, and hardware *does fail*. And you can't just assume that when it fails it blows up completely. We really do need some capability for detecting errors. >> The problem I have with CRC checks is that it only detects bad >> I/O, and is completely unable to detect data corruption due to bad >> memory. This means that really we want a different solution which >> can detect both bad RAM and bad I/O, and should only fall back on >> CRC checks if we're unable to devise one. > > md5sum of each tuple? As an optional system column (a la oid) That's a possibility. As Josh mentioned, some people will pay a serious performance hit to ensure that their data is safe and correct. The CRC proposal was intended as a middle of the road approach that would at least tell you that your hardware was probably OK. There's certainly more that could be done. Also, I think some means of detecting torn pages would be very welcome. If this was done at the storage manager level it would probably be fairly transparent to the rest of the code. >> checking data format for readable pages and tuples (and index >> nodes) both before and after write to disk > > Given that PostgreSQL goes through the OS, and many of us are using > RAID controllers with BBU RAM, how do you do a read with any > confidence that it came from the disk? (I mean, I know how to do > that for a performance test, but as a routine step during production > use?) You'd probably need to go to some kind of stand-alone or background process that slowly reads and verifies the entire database. Unfortunately at that point you could only detect corruption and not correct it, but it'd still be better than nothing. -- Jim C. Nasby, Database Architect jim@nasby.net 512.569.9461 (cell) http://jim.nasby.net
On Tue, Dec 1, 2009 at 9:58 PM, decibel <decibel@decibel.org> wrote: > What happened to the work that was being done to allow a page to be upgraded > on the fly when it was read in from disk? There were no page level changes between 8.3 and 8.4. -- greg
Greg Stark wrote: > On Tue, Dec 1, 2009 at 6:41 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Bruce Momjian <bruce@momjian.us> writes: > >> OK, here is another idea, maybe crazy: > > > >> When we read in a page that has an invalid CRC, we check the page to see > >> which hint bits are _not_ set, and we try setting them to see if can get > >> a matching CRC. > > Unfortunately you would also have to try *unsetting* every hint bit as > well since the updated hint bits might have made it to disk but not > the CRC leaving the old CRC for the block with the unset bits. > > I actually independently had the same thought today that Simon had of > moving the hint bits to the line pointer. We can obtain more free bits > in the line pointers by dividing the item offsets and sizes by > maxalign if we need it. That should give at least 4 spare bits which > is all we need for the four VALID/INVALID hint bits. > > It should be relatively cheap to skip the hint bits in the line > pointers since they'll be the same bits of every 16-bit value for a > whole range. Alternatively we could just CRC the tuples and assume a > corrupted line pointer will show itself quickly. That would actually > make it faster than a straight CRC of the whole block -- making > lemonade out of lemons as it were. Yea, I am thinking we would have to have the hint bits in the line pointers --- if not, we would have to reserve a lot of free space to hold the maximum number of tuple hint bits --- seems like a waste. I also like the idea that we don't need to CRC check the line pointers because any corruption there is going to appear immediately. However, the bad news is that we wouldn't find the corruption until we try to access bad data and might crash. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
Greg Stark wrote: > On Tue, Dec 1, 2009 at 9:58 PM, decibel <decibel@decibel.org> wrote: > > What happened to the work that was being done to allow a page to be upgraded > > on the fly when it was read in from disk? > > There were no page level changes between 8.3 and 8.4. Yea, we have the idea of how to do it (in cases where the page size doesn't increase), but no need to implement it in 8.3 to 8.4. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
Greg Stark wrote: > On Tue, Dec 1, 2009 at 9:57 PM, Richard Huxton <dev@archonet.com> wrote: >> Why are we writing out the hint bits to disk anyway? Is it really so >> slow to calculate them on read + cache them that it's worth all this >> trouble? Are they not also to blame for the "write my import data twice" >> feature? > > It would be interesting to experiment with different strategies. But > the results would depend a lot on workloads and I doubt one strategy > is best for everyone. > > It has often been suggested that we could set the hint bits but not > dirty the page, so they would never be written out unless some other > update hit the page. In most use cases that would probably result in > the right thing happening where we avoid half the writes but still > stop doing transaction status lookups relatively promptly. The scary > thing is that there might be use cases such as static data loaded > where the hint bits never get set and every scan of the page has to > recheck those statuses until the tuples are frozen. And how scary is that? Assuming we cache the hints... 1. With the page itself, so same lifespan 2. Separately, perhaps with a different (longer) lifespan. Separately would then let you trade complexity for compactness - "all of block B is deleted", "all of table T is visible". So what is the cost of calculating the hint-bits for a whole block of tuples in one go vs reading that block from actual spinning disk? > There does need to be something like the hint bits which does > eventually have to be set because we can't keep transaction > information around forever. Even if you keep the transaction > information all the way back to the last freeze date (up to about 1GB > and change I think) then the data has to be written twice, the second > time is to freeze the transactions. In the worst case then reading a > page requires a random page access (or two) from anywhere in that 1GB+ > file for each tuple on the page (whether visible to us or not). While on that topic - I'm assuming freezing requires substantially more effort than updating hint bits? -- Richard Huxton Archonet Ltd
Bruce Momjian <bruce@momjian.us> writes: > Greg Stark wrote: >> It should be relatively cheap to skip the hint bits in the line >> pointers since they'll be the same bits of every 16-bit value for a >> whole range. Alternatively we could just CRC the tuples and assume a >> corrupted line pointer will show itself quickly. That would actually >> make it faster than a straight CRC of the whole block -- making >> lemonade out of lemons as it were. I don't think "relatively cheap" is the right criterion here --- the question to me is how many assumptions are you making in order to compute the page's CRC. Each assumption degrades the reliability of the check, not to mention creating another maintenance hazard. > Yea, I am thinking we would have to have the hint bits in the line > pointers --- if not, we would have to reserve a lot of free space to > hold the maximum number of tuple hint bits --- seems like a waste. Not if you're willing to move the line pointers around. I'd envision an extra pointer in the page header, with a layout along the lines of fixed-size page headerhint bitsline pointersfree spacetuples properspecial space with the CRC covering everything except the hint bits and perhaps the free space (depending on whether you wanted to depend on two more pointers to be right). We would have to move the line pointers anytime we needed to grow the hint-bit space, and there would be a straightforward tradeoff between how often to move the pointers versus how much potentially-wasted space we leave at the end of the hint area. Or we could put the hint bits after the pointers, which might be better because the hints would be smaller == cheaper to move. > I also like the idea that we don't need to CRC check the line pointers > because any corruption there is going to appear immediately. However, > the bad news is that we wouldn't find the corruption until we try to > access bad data and might crash. That sounds exactly like the corruption detection system we have now. If you think that behavior is acceptable, we can skip this whole discussion. regards, tom lane
Tom Lane wrote: > Bruce Momjian <bruce@momjian.us> writes: > > Greg Stark wrote: > >> It should be relatively cheap to skip the hint bits in the line > >> pointers since they'll be the same bits of every 16-bit value for a > >> whole range. Alternatively we could just CRC the tuples and assume a > >> corrupted line pointer will show itself quickly. That would actually > >> make it faster than a straight CRC of the whole block -- making > >> lemonade out of lemons as it were. > > I don't think "relatively cheap" is the right criterion here --- the > question to me is how many assumptions are you making in order to > compute the page's CRC. Each assumption degrades the reliability > of the check, not to mention creating another maintenance hazard. > > > Yea, I am thinking we would have to have the hint bits in the line > > pointers --- if not, we would have to reserve a lot of free space to > > hold the maximum number of tuple hint bits --- seems like a waste. > > Not if you're willing to move the line pointers around. I'd envision > an extra pointer in the page header, with a layout along the lines of > > fixed-size page header > hint bits > line pointers > free space > tuples proper > special space > > with the CRC covering everything except the hint bits and perhaps the > free space (depending on whether you wanted to depend on two more > pointers to be right). We would have to move the line pointers anytime > we needed to grow the hint-bit space, and there would be a > straightforward tradeoff between how often to move the pointers versus > how much potentially-wasted space we leave at the end of the hint area. I assume you don't want the hint bits in the line pointers because we would need to lock the page? > Or we could put the hint bits after the pointers, which might be better > because the hints would be smaller == cheaper to move. I don't see the value there because you would need to move the hint bits every time you added a new line pointer. The bigger problem is that you would need to lock the page to update the hint bits if they move around on the page. > > I also like the idea that we don't need to CRC check the line pointers > > because any corruption there is going to appear immediately. However, > > the bad news is that we wouldn't find the corruption until we try to > > access bad data and might crash. > > That sounds exactly like the corruption detection system we have now. > If you think that behavior is acceptable, we can skip this whole > discussion. Agreed, hence the "bad" part. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
Richard Huxton <dev@archonet.com> writes: > So what is the cost of calculating the hint-bits for a whole block of > tuples in one go vs reading that block from actual spinning disk? Potentially a couple of hundred times worse, if you're unlucky and each XID on the page requires visiting a different block of clog that's also not in memory. The average case probably isn't that bad, but I think we'd likely be talking at least a factor of two penalty --- you'd be hopelessly optimistic to assume you didn't need at least one clog visit per page. Also, if you want to assume that you're lucky and the XIDs mostly fall within a fairly recent range of clog pages, you're still not out of the woods. In that situation what you are talking about is a spectacular increase in the hit rate for cached clog pages --- which are already a known contention bottleneck in many scenarios. > While on that topic - I'm assuming freezing requires substantially more > effort than updating hint bits? It's a WAL-logged page change, so at minimum double the cost. regards, tom lane
Bruce Momjian <bruce@momjian.us> writes: > Tom Lane wrote: >> I don't think "relatively cheap" is the right criterion here --- the >> question to me is how many assumptions are you making in order to >> compute the page's CRC. Each assumption degrades the reliability >> of the check, not to mention creating another maintenance hazard. > I assume you don't want the hint bits in the line pointers because we > would need to lock the page? No, I don't want them there because I don't want the CRC check to know so much about the page layout. >> Or we could put the hint bits after the pointers, which might be better >> because the hints would be smaller == cheaper to move. > I don't see the value there because you would need to move the hint bits > every time you added a new line pointer. No, we could add unused line pointers in multiples, exactly the same as we would add unused hint bits in multiples if we did it the other way. I don't know offhand which would be more efficient, but you can't just dismiss one without analysis. > The bigger problem is that you > would need to lock the page to update the hint bits if they move around > on the page. We are already assuming that things aren't moving around when we update a hint bit now. That's what the requirement of shared buffer lock when calling tqual.c is for. regards, tom lane
On Tue, Dec 1, 2009 at 10:47 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Bruce Momjian <bruce@momjian.us> writes: >> Greg Stark wrote: >>> It should be relatively cheap to skip the hint bits in the line >>> pointers since they'll be the same bits of every 16-bit value for a >>> whole range. Alternatively we could just CRC the tuples and assume a >>> corrupted line pointer will show itself quickly. That would actually >>> make it faster than a straight CRC of the whole block -- making >>> lemonade out of lemons as it were. > > I don't think "relatively cheap" is the right criterion here --- the > question to me is how many assumptions are you making in order to > compute the page's CRC. Each assumption degrades the reliability > of the check, not to mention creating another maintenance hazard. Well the only assumption here is that we know where the line pointers start and end. That sounds like the same level of assumption as your structure with the line pointers moving around. I agree with your general point though -- trying to skip the hint bits strewn around in the tuples means that every line pointer had better be correct or you'll be in trouble before you even get to the CRC check. Skipping them in the line pointers just means applying a hard-coded mask against each word in that region. It seems to me adding a third structure on the page and then requiring tqual to be able to find that doesn't significantly reduce the complexity over having tqual be able to find the line pointers. And it significantly increases the complexity of every other part of the system which has to deal with a third structure on the page. And adding and compacting the page becomes a lot more complex. I'm also I'm a bit leery about adding more line pointers than necessary because even a small number of line pointers will mean you're likely to often fit one fewer tuple on the page. -- greg
On Dec 1, 2009, at 4:13 PM, Greg Stark wrote: > On Tue, Dec 1, 2009 at 9:57 PM, Richard Huxton <dev@archonet.com> > wrote: >> Why are we writing out the hint bits to disk anyway? Is it really so >> slow to calculate them on read + cache them that it's worth all this >> trouble? Are they not also to blame for the "write my import data >> twice" >> feature? > > It would be interesting to experiment with different strategies. But > the results would depend a lot on workloads and I doubt one strategy > is best for everyone. I agree that we'll always have the issue with freezing. But I also think it's time to revisit the whole idea of hint bits. AFAIK we only keep at maximum 2B transactions, and each one takes 2 bits in CLOG. So worst-case scenario, we're looking at 4G of clog. On modern hardware, that's not a lot. And that's also assuming that we don't do any kind of compression on that data (obviously we couldn't use just any old compression algorithm, but there's certainly tricks that could be used to reduce the size of this information). I know this is something that folks at EnterpriseDB have looked at, perhaps there's data they can share. > It has often been suggested that we could set the hint bits but not > dirty the page, so they would never be written out unless some other > update hit the page. In most use cases that would probably result in > the right thing happening where we avoid half the writes but still > stop doing transaction status lookups relatively promptly. The scary > thing is that there might be use cases such as static data loaded > where the hint bits never get set and every scan of the page has to > recheck those statuses until the tuples are frozen. > > (Not dirtying the page almost gets us out of the CRC problems -- it > doesn't in our current setup because we don't take a lock when setting > the hint bits, so you could set it on a page someone is in the middle > of CRC checking and writing. There were other solutions proposed for > that, including just making hint bits require locking the page or > double buffering the write.) > > There does need to be something like the hint bits which does > eventually have to be set because we can't keep transaction > information around forever. Even if you keep the transaction > information all the way back to the last freeze date (up to about 1GB > and change I think) then the data has to be written twice, the second > time is to freeze the transactions. In the worst case then reading a > page requires a random page access (or two) from anywhere in that 1GB+ > file for each tuple on the page (whether visible to us or not). > -- > greg > > -- > Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-hackers > -- Jim C. Nasby, Database Architect jim@nasby.net 512.569.9461 (cell) http://jim.nasby.net
Greg Stark <gsstark@mit.edu> writes: > On Tue, Dec 1, 2009 at 10:47 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> I don't think "relatively cheap" is the right criterion here --- the >> question to me is how many assumptions are you making in order to >> compute the page's CRC. �Each assumption degrades the reliability >> of the check, not to mention creating another maintenance hazard. > Well the only assumption here is that we know where the line pointers > start and end. ... and what they contain. To CRC a subset of the page at all, we have to put some amount of faith into the page header's pointers. We can do weak checks on those, but only weak ones. If we process different parts of the page differently, we're increasing our trust in those pointers and reducing the quality of the CRC check. > It seems to me adding a third structure on the page and then requiring > tqual to be able to find that doesn't significantly reduce the > complexity over having tqual be able to find the line pointers. And it > significantly increases the complexity of every other part of the > system which has to deal with a third structure on the page. And > adding and compacting the page becomes a lot more complex. The page compaction logic amounts to a grand total of two not-very-long routines. The vast majority of the code impact from this would be from the problem of finding the out-of-line hint bits for a tuple, which as you say appears about equivalently hard either way. So I think keeping the CRC logic as simple as possible is good from both a reliability and performance standpoint. regards, tom lane
On Tue, 2009-12-01 at 07:05 -0500, Bruce Momjian wrote: > > > All useful detection mechanisms have non-zero false positives because we > > would rather sometimes ring the bell for no reason than to let bad > > things through silently, as we do now. > > OK, but what happens if someone gets the failure report, assumes their > hardware is faulty and replaces it, and then gets a failure report > again? They are stupid? Nobody just replaces hardware. You test it. We can't fix stupid. Joshua D. Drake -- PostgreSQL.org Major Contributor Command Prompt, Inc: http://www.commandprompt.com/ - 503.667.4564 Consulting, Training, Support, Custom Development, Engineering If the world pushes look it in the eye and GRR. Then push back harder. - Salamander
On Tue, 2009-12-01 at 10:55 -0500, Tom Lane wrote: > Simon Riggs <simon@2ndQuadrant.com> writes: > > On Tue, 2009-12-01 at 16:40 +0200, Heikki Linnakangas wrote: > >> It's not hard to imagine that when a hardware glitch happens > >> causing corruption, it also causes the system to crash. Recalculating > >> the CRCs after crash would mask the corruption. > > > They are already masked from us, so continuing to mask those errors > > would not put us in a worse position. > > No, it would just destroy a large part of the argument for why this > is worth doing. "We detect disk errors ... except for ones that happen > during a database crash." "Say what?" > > The fundamental problem with this is the same as it's been all along: > the tradeoff between implementation work expended, performance overhead > added, and net number of real problems detected (with a suitably large > demerit for actually *introducing* problems) just doesn't look > attractive. You can make various compromises that improve one or two of > these factors at the cost of making the others worse, but at the end of > the day I've still not seen a combination that seems worth doing. Let me try a different but similar perspective. The problem we are trying to solve here, only matters to a very small subset of the people actually using PostgreSQL. Specifically, a percentage that is using PostgreSQL in a situation where they can lose many thousands of dollars per minute or hour should an outage occur. On the other hand it is those very people that are *paying* people to try and implement these features. Kind of a catch-22. The hard core reality is this. *IF* it is one of the goals of this project to insure that the software can be safely, effectively, and responsibly operated in a manner that is acceptable to C* level people in a Fortune level company then we *must* solve this problem. If it is not the goal of the project, leave it to EDB/CMD/2ndQuandrant to fork it because it will eventually happen. Our customers are demanding these features. Sincerely, Joshua D. Drake -- PostgreSQL.org Major Contributor Command Prompt, Inc: http://www.commandprompt.com/ - 503.667.4564 Consulting, Training, Support, Custom Development, Engineering If the world pushes look it in the eye and GRR. Then push back harder. - Salamander
On Wed, Dec 2, 2009 at 12:03 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Greg Stark <gsstark@mit.edu> writes: >> On Tue, Dec 1, 2009 at 10:47 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >>> I don't think "relatively cheap" is the right criterion here --- the >>> question to me is how many assumptions are you making in order to >>> compute the page's CRC. Each assumption degrades the reliability >>> of the check, not to mention creating another maintenance hazard. > >> Well the only assumption here is that we know where the line pointers >> start and end. > > ... and what they contain. To CRC a subset of the page at all, we have > to put some amount of faith into the page header's pointers. We can do > weak checks on those, but only weak ones. If we process different parts > of the page differently, we're increasing our trust in those pointers > and reducing the quality of the CRC check. I'm not sure we're on the same page. As I understand it there are three proposals on the table now: 1) set aside a section of the page to contain only non-checksummed hint bits. That section has to be relocatable so the crc check would have to read the start and end address of it from the page header. 2) store the hint bits in the line pointers and skip checking the line pointers. In that case the crc check would skip any bytes between the start of the line pointer array and pd_lower (or pd_upper? no point in crc checking unused bytes is there?) 3) store the hint bits in the line pointers and apply a mask which masks out the 4 hint bits in each 32-bit word in the region between the start of the line pointers and pd_lower (or pd_upper again) These three options all seem to have the same level of interdependence for the crc check, namely they all depend one or two values in the page header to specify a range of bytes in the block. None of them depend on the contents of the line pointers themselves being correct, only the one or two fields in the header specifying which range of bytes the hint bits lie in. For what it's worth I don't think "decreasing the quality of the crc check" is actually valid. The bottom line is that in all of the above options if any pointer is invalid we'll be CRCing a different set of data from the set that originally went into calculating the stored CRC so we'll be effectively computing a random value which will have a 1/2^32 chance of being the value stored in the CRC field regardless of anything else. -- greg
On Tue, 2009-12-01 at 13:20 -0500, Robert Haas wrote: > On Tue, Dec 1, 2009 at 1:02 PM, Joshua D. Drake <jd@commandprompt.com> wrote: > > The hard core reality is this. *IF* it is one of the goals of this > > project to insure that the software can be safely, effectively, and > > responsibly operated in a manner that is acceptable to C* level people > > in a Fortune level company then we *must* solve this problem. > > > > If it is not the goal of the project, leave it to EDB/CMD/2ndQuandrant > > to fork it because it will eventually happen. Our customers are > > demanding these features. > > OK, and when you fork it, how do you plan to implement it? Hey man, I am not an engineer :P. You know that. I am just speaking the pressures that some of us are having in the marketplace about these types of features. > red judgment on whether this is a feasible solution. > > Does $COMPETITOR offer this feature? > My understanding is that MSSQL does. I am not sure about Oracle. Those are the only two I run into (I don't run into MySQL at all). I know others likely compete in the DB2 space. Sincerely, Joshua D. Drake -- PostgreSQL.org Major Contributor Command Prompt, Inc: http://www.commandprompt.com/ - 503.667.4564 Consulting, Training, Support, Custom Development, Engineering If the world pushes look it in the eye and GRR. Then push back harder. - Salamander
On Tue, Dec 1, 2009 at 5:15 PM, Greg Stark <gsstark@mit.edu> wrote: > On Tue, Dec 1, 2009 at 9:58 PM, decibel <decibel@decibel.org> wrote: >> What happened to the work that was being done to allow a page to be upgraded >> on the fly when it was read in from disk? > > There were no page level changes between 8.3 and 8.4. That's true, but I don't think it's the full and complete answer to the question. Zdenek submitted a page for CF 2008-11 which attempted to add support for multiple page versions. I guess we're on v4 right now, and he was attempting to add support for v3 pages, which would have allowed reading in pages from old PG versions. To put it bluntly, the code wasn't anything I would have wanted to deploy, but the reason why Zdenek gave up on fixing it was because several community members considerably senior to myself provided negative feedback on the concept. ...Robert
Robert Haas wrote: > On Tue, Dec 1, 2009 at 5:15 PM, Greg Stark <gsstark@mit.edu> wrote: > > On Tue, Dec 1, 2009 at 9:58 PM, decibel <decibel@decibel.org> wrote: > >> What happened to the work that was being done to allow a page to be upgraded > >> on the fly when it was read in from disk? > > > > There were no page level changes between 8.3 and 8.4. > > That's true, but I don't think it's the full and complete answer to > the question. Zdenek submitted a page for CF 2008-11 which attempted > to add support for multiple page versions. I guess we're on v4 right > now, and he was attempting to add support for v3 pages, which would > have allowed reading in pages from old PG versions. To put it > bluntly, the code wasn't anything I would have wanted to deploy, but > the reason why Zdenek gave up on fixing it was because several > community members considerably senior to myself provided negative > feedback on the concept. Well, there were quite a number of open issues relating to page conversion: o Do we write the old version or just convert on read?o How do we write pages that get larger on conversion to the newformat? As I rember the patch allowed read/wite of old versions, which greatly increased its code impact. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
On Tue, Dec 1, 2009 at 9:31 PM, Bruce Momjian <bruce@momjian.us> wrote: > Robert Haas wrote: >> On Tue, Dec 1, 2009 at 5:15 PM, Greg Stark <gsstark@mit.edu> wrote: >> > On Tue, Dec 1, 2009 at 9:58 PM, decibel <decibel@decibel.org> wrote: >> >> What happened to the work that was being done to allow a page to be upgraded >> >> on the fly when it was read in from disk? >> > >> > There were no page level changes between 8.3 and 8.4. >> >> That's true, but I don't think it's the full and complete answer to >> the question. Zdenek submitted a page for CF 2008-11 which attempted >> to add support for multiple page versions. I guess we're on v4 right >> now, and he was attempting to add support for v3 pages, which would >> have allowed reading in pages from old PG versions. To put it >> bluntly, the code wasn't anything I would have wanted to deploy, but >> the reason why Zdenek gave up on fixing it was because several >> community members considerably senior to myself provided negative >> feedback on the concept. > > Well, there were quite a number of open issues relating to page > conversion: > > o Do we write the old version or just convert on read? > o How do we write pages that get larger on conversion to the > new format? > > As I rember the patch allowed read/wite of old versions, which greatly > increased its code impact. Oh, for sure there were plenty of issues with the patch, starting with the fact that the way it was set up led to unacceptable performance and code complexity trade-offs. Some of my comments from the time: http://archives.postgresql.org/pgsql-hackers/2008-11/msg00149.php http://archives.postgresql.org/pgsql-hackers/2008-11/msg00152.php But the point is that the concept, I think, is basically the right one: you have to be able to read and make sense of the contents of old page versions. There is room, at least in my book, for debate about which operations we should support on old pages. Totally read only? Set hit bits? Kill old tuples? Add new tuples? The key issue, as I think Heikki identified at the time, is to figure out how you're eventually going to get rid of the old pages. He proposed running a pre-upgrade utility on each page to reserve the right amount of free space. http://archives.postgresql.org/pgsql-hackers/2008-11/msg00208.php I don't like that solution. If the pre-upgrade utility is something that has to be run while the database is off-line, then it defeats the point of an in-place upgrade. If it can be run while the database is up, I fear it will need to be deeply integrated into the server. And since we can't know the requirements for how much space to reserve (and it needn't be a constant) until we design the new feature, this will likely mean backpatching a rather large chunk of complex code, which to put it mildly, is not the sort of thing we normally would even consider. I think a better approach is to support reading tuples from old pages, but to write all new tuples into new pages. A full-table rewrite (like UPDATE foo SET x = x, CLUSTER, etc.) can be used to propel everything to the new version, with the usual tricks for people who need to rewrite the table a piece at a time. But, this is not religion for me. I'm fine with some other design; I just can't presently see how to make it work. I think the present discussion of CRC checks is an excellent test-case for any and all ideas about how to solve this problem. If someone can get a patch committed than can convert the 8.4 page format to an 8.5 format with the hint bits shuffled around a (hopefully optional) CRC added, I think that'll become the de facto standard for how to handle page format upgrades. ...Robert
Robert Haas wrote: > > Well, there were quite a number of open issues relating to page > > conversion: > > > > ? ? ? ?o ?Do we write the old version or just convert on read? > > ? ? ? ?o ?How do we write pages that get larger on conversion to the > > ? ? ? ? ? new format? > > > > As I rember the patch allowed read/wite of old versions, which greatly > > increased its code impact. > > Oh, for sure there were plenty of issues with the patch, starting with > the fact that the way it was set up led to unacceptable performance > and code complexity trade-offs. Some of my comments from the time: > > http://archives.postgresql.org/pgsql-hackers/2008-11/msg00149.php > http://archives.postgresql.org/pgsql-hackers/2008-11/msg00152.php > > But the point is that the concept, I think, is basically the right > one: you have to be able to read and make sense of the contents of old > page versions. There is room, at least in my book, for debate about > which operations we should support on old pages. Totally read only? > Set hit bits? Kill old tuples? Add new tuples? I think part of the problem is there was no agreement before the patch was coded and submitted, and there didn't seem to be much desire from the patch author to adjust it, nor demand from the community because we didn't need it yet. > The key issue, as I think Heikki identified at the time, is to figure > out how you're eventually going to get rid of the old pages. He > proposed running a pre-upgrade utility on each page to reserve the > right amount of free space. > > http://archives.postgresql.org/pgsql-hackers/2008-11/msg00208.php Right. There were two basic approaches to handling a patch that would expand when upgraded to the new version --- either allow the system to write the old format, or have a pre-upgrade script that moved tuples so there was guaranteed enough free space in every page for the new format. I think we agreed that the later was better than the former, and it was easy because we don't have any need for that at this time. Plus the script would not rewrite every page, just certain pages that required it. > I don't like that solution. If the pre-upgrade utility is something > that has to be run while the database is off-line, then it defeats the > point of an in-place upgrade. If it can be run while the database is > up, I fear it will need to be deeply integrated into the server. And > since we can't know the requirements for how much space to reserve > (and it needn't be a constant) until we design the new feature, this > will likely mean backpatching a rather large chunk of complex code, > which to put it mildly, is not the sort of thing we normally would > even consider. I think a better approach is to support reading tuples > from old pages, but to write all new tuples into new pages. A > full-table rewrite (like UPDATE foo SET x = x, CLUSTER, etc.) can be > used to propel everything to the new version, with the usual tricks > for people who need to rewrite the table a piece at a time. But, this > is not religion for me. I'm fine with some other design; I just can't > presently see how to make it work. Well, perhaps the text I wrote above will clarify that the upgrade script is only for page expansion --- it is not to rewrite every page into the new format. > I think the present discussion of CRC checks is an excellent test-case > for any and all ideas about how to solve this problem. If someone can > get a patch committed than can convert the 8.4 page format to an 8.5 > format with the hint bits shuffled around a (hopefully optional) CRC > added, I think that'll become the de facto standard for how to handle > page format upgrades. Well, yea, the idea would be that the 8.5 server would either convert the page to the new format on read (assuming there is enough free space, perhaps requiring a pre-upgrade script), or have the server write the page in the old 8.4 format and not do CRC checks on the page. My guess is the former. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
Robert Haas wrote: > If the pre-upgrade utility is something > that has to be run while the database is off-line, then it defeats the > point of an in-place upgrade. If it can be run while the database is > up, I fear it will need to be deeply integrated into the server. And > since we can't know the requirements for how much space to reserve > (and it needn't be a constant) until we design the new feature, this > will likely mean backpatching a rather large chunk of complex code, > which to put it mildly, is not the sort of thing we normally would > even consider. You're wandering into the sort of overdesign that isn't really needed yet. For now, presume it's a constant amount of overhead, and that the release notes for the new version will say "configure the pre-upgrade utility and tell it you need <x> bytes of space reserved". That's sufficient for the CRC case, right? Needs a few more bytes per page, 8.5 release notes could say exactly how much. Solve that before making things more complicated by presuming you need to solve the variable-size increase problem, too. We'll be lucky to get the absolute simplest approach committed, you really need to have a big smoking gun to justify feature creep in this area. (If I had to shoot from the hip and design for the variable case, why not just make the thing that determines how much space a given page needs reserved a function the user can re-install with a smarter version?) > I think a better approach is to support reading tuples > from old pages, but to write all new tuples into new pages. A > full-table rewrite (like UPDATE foo SET x = x, CLUSTER, etc.) can be > used to propel everything to the new version, with the usual tricks > for people who need to rewrite the table a piece at a time. I think you're oversimplifying the operational difficulty of "the usual tricks". This is a painful approach for the exact people who need this the most: people with a live multi-TB installation they can't really afford to add too much load to. The beauty of the in-place upgrade tool just converting pages as it scans through looking for them is that you can dial up its intensity to exactly how much overhead you can stand, and let it loose until it's done. -- Greg Smith 2ndQuadrant Baltimore, MD PostgreSQL Training, Services and Support greg@2ndQuadrant.com www.2ndQuadrant.com
On Tue, Dec 1, 2009 at 10:34 PM, Bruce Momjian <bruce@momjian.us> wrote: > Robert Haas wrote: >> > Well, there were quite a number of open issues relating to page >> > conversion: >> > >> > ? ? ? ?o ?Do we write the old version or just convert on read? >> > ? ? ? ?o ?How do we write pages that get larger on conversion to the >> > ? ? ? ? ? new format? >> > >> > As I rember the patch allowed read/wite of old versions, which greatly >> > increased its code impact. >> >> Oh, for sure there were plenty of issues with the patch, starting with >> the fact that the way it was set up led to unacceptable performance >> and code complexity trade-offs. Some of my comments from the time: >> >> http://archives.postgresql.org/pgsql-hackers/2008-11/msg00149.php >> http://archives.postgresql.org/pgsql-hackers/2008-11/msg00152.php >> >> But the point is that the concept, I think, is basically the right >> one: you have to be able to read and make sense of the contents of old >> page versions. There is room, at least in my book, for debate about >> which operations we should support on old pages. Totally read only? >> Set hit bits? Kill old tuples? Add new tuples? > > I think part of the problem is there was no agreement before the patch > was coded and submitted, and there didn't seem to be much desire from > the patch author to adjust it, nor demand from the community because we > didn't need it yet. Could be. It's water under the bridge at this point. >> The key issue, as I think Heikki identified at the time, is to figure >> out how you're eventually going to get rid of the old pages. He >> proposed running a pre-upgrade utility on each page to reserve the >> right amount of free space. >> >> http://archives.postgresql.org/pgsql-hackers/2008-11/msg00208.php > > Right. There were two basic approaches to handling a patch that would > expand when upgraded to the new version --- either allow the system to > write the old format, or have a pre-upgrade script that moved tuples so > there was guaranteed enough free space in every page for the new format. > I think we agreed that the later was better than the former, and it was > easy because we don't have any need for that at this time. Plus the > script would not rewrite every page, just certain pages that required > it. While I'm always willing to be proven wrong, I think it's a complete dead-end to believe that it's going to be easier to reserve space for page expansion using the upgrade-from version rather than the upgrade-to version. I am firmly of the belief that the NEW pg version must be able to operate on an unmodified heap migrated from the OLD pg version. After this set of patches was rejected, Zdenek actually proposed an alternate patch that would have allowed space reservation, and it was rejected precisely because there was no clear certainty that it would solve any hypothetical future problem. ...Robert
On Tue, Dec 1, 2009 at 10:45 PM, Greg Smith <greg@2ndquadrant.com> wrote: > Robert Haas wrote: >> >> If the pre-upgrade utility is something >> that has to be run while the database is off-line, then it defeats the >> point of an in-place upgrade. If it can be run while the database is >> up, I fear it will need to be deeply integrated into the server. And >> since we can't know the requirements for how much space to reserve >> (and it needn't be a constant) until we design the new feature, this >> will likely mean backpatching a rather large chunk of complex code, >> which to put it mildly, is not the sort of thing we normally would >> even consider. > > You're wandering into the sort of overdesign that isn't really needed yet. > For now, presume it's a constant amount of overhead, and that the release > notes for the new version will say "configure the pre-upgrade utility and > tell it you need <x> bytes of space reserved". That's sufficient for the > CRC case, right? Needs a few more bytes per page, 8.5 release notes could > say exactly how much. Solve that before making things more complicated by > presuming you need to solve the variable-size increase problem, too. We'll > be lucky to get the absolute simplest approach committed, you really need to > have a big smoking gun to justify feature creep in this area. Well, I think the best way to solve the problem is to design the system in a way that makes it unnecessary to have a pre-upgrade tool at all, by making the new PG version capable of handling page expansion where needed. I don't understand how putting that functionality into the OLD PG version can be better. But I may be misunderstanding something. > (If I had to shoot from the hip and design for the variable case, why not > just make the thing that determines how much space a given page needs > reserved a function the user can re-install with a smarter version?) That's a pretty good idea. I have no love of this pre-upgrade concept, but if we're going to do it that way, then allowing someone to load in a function to compute the required amount of free space to reserve is a good thought. >> I think a better approach is to support reading tuples >> from old pages, but to write all new tuples into new pages. A >> full-table rewrite (like UPDATE foo SET x = x, CLUSTER, etc.) can be >> used to propel everything to the new version, with the usual tricks >> for people who need to rewrite the table a piece at a time. > > I think you're oversimplifying the operational difficulty of "the usual > tricks". This is a painful approach for the exact people who need this the > most: people with a live multi-TB installation they can't really afford to > add too much load to. The beauty of the in-place upgrade tool just > converting pages as it scans through looking for them is that you can dial > up its intensity to exactly how much overhead you can stand, and let it > loose until it's done. Fair enough. ...Robert
* Greg Stark <gsstark@mit.edu> [091201 20:14]: > I'm not sure we're on the same page. As I understand it there are > three proposals on the table now: > > 1) set aside a section of the page to contain only non-checksummed > hint bits. That section has to be relocatable so the crc check would > have to read the start and end address of it from the page header. > > 2) store the hint bits in the line pointers and skip checking the line > pointers. In that case the crc check would skip any bytes between the > start of the line pointer array and pd_lower (or pd_upper? no point in > crc checking unused bytes is there?) > > 3) store the hint bits in the line pointers and apply a mask which > masks out the 4 hint bits in each 32-bit word in the region between > the start of the line pointers and pd_lower (or pd_upper again) I'm not intimately familiar with the innards of the pages, but I had *thought* that the original suggestion of moving the hint bits was purely to make sure that they are in the same filesystem block/disk sector as the CRC. That may not be possible, but *if* that's the case, you avoid the torn-page problem, with only 1 minimal assumption: - the FS-block/disk-sector will write whole "blocks" ata time, or likely be corrupt anyways With my understanding of disks and platters, I'ld assume that if you got a partial sector written, and something prevented it from being completely written, I'ld guess the part missing would be smeared with corruption.... And that would seem to hold with flash/SSD's too... a. -- Aidan Van Dyk Create like a god, aidan@highrise.ca command like a king, http://www.highrise.ca/ work like a slave.
Robert Haas wrote: > >> The key issue, as I think Heikki identified at the time, is to figure > >> out how you're eventually going to get rid of the old pages. ?He > >> proposed running a pre-upgrade utility on each page to reserve the > >> right amount of free space. > >> > >> http://archives.postgresql.org/pgsql-hackers/2008-11/msg00208.php > > > > Right. ?There were two basic approaches to handling a patch that would > > expand when upgraded to the new version --- either allow the system to > > write the old format, or have a pre-upgrade script that moved tuples so > > there was guaranteed enough free space in every page for the new format. > > I think we agreed that the later was better than the former, and it was > > easy because we don't have any need for that at this time. ?Plus the > > script would not rewrite every page, just certain pages that required > > it. > > While I'm always willing to be proven wrong, I think it's a complete > dead-end to believe that it's going to be easier to reserve space for > page expansion using the upgrade-from version rather than the > upgrade-to version. I am firmly of the belief that the NEW pg version > must be able to operate on an unmodified heap migrated from the OLD pg > version. After this set of patches was rejected, Zdenek actually Does it need to write the old version, and if it does, it has to carry around the old format structures all over the backend? That was the unclear part. > proposed an alternate patch that would have allowed space reservation, > and it was rejected precisely because there was no clear certainty > that it would solve any hypothetical future problem. True. It was solving a problem we didn't have, yet. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
On Tue, Dec 01, 2009 at 10:34:11PM -0500, Bruce Momjian wrote: > Robert Haas wrote: > > The key issue, as I think Heikki identified at the time, is to > > figure out how you're eventually going to get rid of the old > > pages. He proposed running a pre-upgrade utility on each page to > > reserve the right amount of free space. > > > > http://archives.postgresql.org/pgsql-hackers/2008-11/msg00208.php > > Right. There were two basic approaches to handling a patch that > would expand when upgraded to the new version --- either allow the > system to write the old format, or have a pre-upgrade script that > moved tuples so there was guaranteed enough free space in every page > for the new format. I think we agreed that the later was better > than the former, and it was easy because we don't have any need for > that at this time. Plus the script would not rewrite every page, > just certain pages that required it. Please forgive me for barging in here, but that approach simply is untenable if it requires that the database be down while those pages are being found, marked, moved around, etc. The data volumes that really concern people who need an in-place upgrade are such that even dd if=$PGDATA of=/dev/null bs=8192 # (or whatever the optimal block size would be) would require *much* more time than such people would accept as a down time window, and while that's a lower bound, it's not a reasonable lower bound on the time. If this re-jiggering could kick off in the background at start and work on a running PostgreSQL, the whole objection goes away. A problem that arises for any in-place upgrade system we do is that if someone's at 99% storage capacity, we can pretty well guarantee some kind of catastrophic failure. Could we create some way to get an estimate of space needed, given that the system needs to stay up while that's happening? Cheers, David. -- David Fetter <david@fetter.org> http://fetter.org/ Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter Skype: davidfetter XMPP: david.fetter@gmail.com iCal: webcal://www.tripit.com/feed/ical/people/david74/tripit.ics Remember to vote! Consider donating to Postgres: http://www.postgresql.org/about/donate
Hi, As we're talking about crazy ideas... Bruce Momjian <bruce@momjian.us> writes: > Well, yea, the idea would be that the 8.5 server would either convert > the page to the new format on read (assuming there is enough free space, > perhaps requiring a pre-upgrade script), or have the server write the > page in the old 8.4 format and not do CRC checks on the page. My guess > is the former. We already have had demand for read only tables (some on-disk format optimisation would then be possible). What about having page level read-only restriction, thus allowing the newer server version to operate in read-only mode on the older server version pages, and convert on write by allocating whole new page(s)? Then we go even crazier, with a special recovery mode on the new version able to read older version WAL format, producing older version pages. That sounds like code maintenance hell, but would allow for a $new WAL standby to restore from a $old wal steam, and be read only. Then you sitchover to the slave and it goes out of recovery and creates new pages on writes. How about going this crazy? Regards, -- dim
On Wed, Dec 2, 2009 at 11:26 AM, Dimitri Fontaine <dfontaine@hi-media.com> wrote: > We already have had demand for read only tables (some on-disk format > optimisation would then be possible). What about having page level > read-only restriction, thus allowing the newer server version to operate > in read-only mode on the older server version pages, and convert on > write by allocating whole new page(s)? I'm a bit confused. Read-only tables are tables that the user has said they don't intend to modify. We can throw an error if they try. What you're proposing are pages that the system treats as read-only but what do you propose to do if the user actually does try to update or delete (or lock) a record in those pages? If we want to avoid converting them to new pages we need to be able to at least store an xmax and set the ctid on those tuples. And probably we would need to do other things like set hint bits or set fields in the page header. -- greg
David Fetter wrote: > > Right. There were two basic approaches to handling a patch that > > would expand when upgraded to the new version --- either allow the > > system to write the old format, or have a pre-upgrade script that > > moved tuples so there was guaranteed enough free space in every page > > for the new format. I think we agreed that the later was better > > than the former, and it was easy because we don't have any need for > > that at this time. Plus the script would not rewrite every page, > > just certain pages that required it. > > Please forgive me for barging in here, but that approach simply is > untenable if it requires that the database be down while those pages > are being found, marked, moved around, etc. > > The data volumes that really concern people who need an in-place > upgrade are such that even > > dd if=$PGDATA of=/dev/null bs=8192 # (or whatever the optimal block size would be) > > would require *much* more time than such people would accept as a down > time window, and while that's a lower bound, it's not a reasonable > lower bound on the time. Well, you can say it is unacceptable, but if there are no other options then that is all we can offer. My main point is that we should consider writing old format pages only when we have no choice (page size might expand), and even then, we might decide to have a pre-migration script because the code impact of writing the old format would be too great. This is all hypothetical until we have a real use-case. > If this re-jiggering could kick off in the background at start and > work on a running PostgreSQL, the whole objection goes away. > > A problem that arises for any in-place upgrade system we do is that if > someone's at 99% storage capacity, we can pretty well guarantee some > kind of catastrophic failure. Could we create some way to get an > estimate of space needed, given that the system needs to stay up while > that's happening? Yea, the database would expand and hopefully have full transaction semantics. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
Greg Stark <gsstark@mit.edu> writes: > On Wed, Dec 2, 2009 at 11:26 AM, Dimitri Fontaine > <dfontaine@hi-media.com> wrote: >> We already have had demand for read only tables (some on-disk format >> optimisation would then be possible). What about having page level >> read-only restriction, thus allowing the newer server version to operate >> in read-only mode on the older server version pages, and convert on >> write by allocating whole new page(s)? > > I'm a bit confused. Read-only tables are tables that the user has said > they don't intend to modify. We can throw an error if they try. What > you're proposing are pages that the system treats as read-only but > what do you propose to do if the user actually does try to update or > delete (or lock) a record in those pages? Well it's still a pretty rough idea, so I'll need help from this forum to get to something concrete enough for someone to be able to implement it... and there you go: > If we want to avoid > converting them to new pages we need to be able to at least store an > xmax and set the ctid on those tuples. And probably we would need to > do other things like set hint bits or set fields in the page header. My idea was more that any non read-only access to the page forces a rewrite in the new format, and a deprecation of the ancient page. Maybe like what vacuum would be doing on it as soon as it realises the page contains no visible tuples anymore, but done by the backend at the time of the modification. That makes the first modifications of the page quite costly but allow to somewhat choose when that happens. And still have read only access, so you could test parts of your application on a hot standby running next version. Maybe there's just too much craziness in there now. -- dim
On Tue, Dec 1, 2009 at 11:45 PM, Bruce Momjian <bruce@momjian.us> wrote: > Robert Haas wrote: >> >> The key issue, as I think Heikki identified at the time, is to figure >> >> out how you're eventually going to get rid of the old pages. ?He >> >> proposed running a pre-upgrade utility on each page to reserve the >> >> right amount of free space. >> >> >> >> http://archives.postgresql.org/pgsql-hackers/2008-11/msg00208.php >> > >> > Right. ?There were two basic approaches to handling a patch that would >> > expand when upgraded to the new version --- either allow the system to >> > write the old format, or have a pre-upgrade script that moved tuples so >> > there was guaranteed enough free space in every page for the new format. >> > I think we agreed that the later was better than the former, and it was >> > easy because we don't have any need for that at this time. ?Plus the >> > script would not rewrite every page, just certain pages that required >> > it. >> >> While I'm always willing to be proven wrong, I think it's a complete >> dead-end to believe that it's going to be easier to reserve space for >> page expansion using the upgrade-from version rather than the >> upgrade-to version. I am firmly of the belief that the NEW pg version >> must be able to operate on an unmodified heap migrated from the OLD pg >> version. After this set of patches was rejected, Zdenek actually > > Does it need to write the old version, and if it does, it has to carry > around the old format structures all over the backend? That was the > unclear part. I think it needs partial write support for the old version. If the page is not expanding, then you can probably just replace pages in place. But if the page is expanding, then you need to be able to move individual tuples[1]. Since you want to be up and running while that's happening, I think you probably need to be able to update xmax and probably set hit bints. But you don't need to be able to add tuples to the old page format, and I don't think you need complete vacuum support, since you don't plan to reuse the dead space - you'll just recycle the whole page once the tuples are all dead. As for carrying it around the whole backend, I'm not sure how much of the backend really needs to know. It would only be anything that looks at pages, rather than, say, tuples, but I don't really know how much code that touches. I suppose that's one of the things we need to figure out. [1] Unless, of course, you use a pre-upgrade utility. But this is about how to make it work WITHOUT a pre-upgrade utility. >> proposed an alternate patch that would have allowed space reservation, >> and it was rejected precisely because there was no clear certainty >> that it would solve any hypothetical future problem. > > True. It was solving a problem we didn't have, yet. Well, that's sort of a circular argument. If you're going to reserve space with a pre-upgrade utility, you're going to need to put the pre-upgrade utility into the version you want to upgrade FROM. If we wanted to be able to use a pre-upgrade utility to upgrade to 8.5, we would have had to put the utility into 8.4. The problem I'm referring to is that there is no guarantee that you would be able predict how much space to reserve. In a case like CRCs, it may be as simple as "4 bytes". But what if, say, we switch to a different compression algorithm for inline toast? Some pages will contract, others will expand, but there's no saying by how much - and therefore no fixed amount of reserved space is guaranteed to be adequate. It's true that we might never want to do that particular thing, but I don't think we can say categorically that we'll NEVER want to do anything that expands pages by an unpredictable amount. So it might be quite complex to figure out how much space to reserve on any given page. If we can find a way to make that the NEW PG version's problem, it's still complicated, but at least it's not complicated stuff that has to be backpatched. Another problem with a pre-upgrade utility is - how do you verify, when you fire up the new cluster, that the pre-upgrade utility has done its thing? If the new PG version requires 4 bytes of space reserved on each page, what happens when you get halfway through upgrading your 1TB database and find a page with only 2 bytes available? There aren't a lot of good options. The old PG version could try to mark the DB in some way to indicate whether it successfully completed, but what if there's a bug and something was missed? Then you have this scenario: 1. Run the pre-upgrade script. 2. pg_migrator. 3. Fire up new version. 4. Discover that pre-upgrade script forgot to reserve enough space on some page. 5. Report a bug. 6. Bug fixed, new version of pre-upgrade script is now available. 7. ??? If all the logic is in the new server, you may still be in hot water when you discover that it can't deal with a particular case. But hopefully the problem would be confined to that page, or that relation, and you could use the rest of your database. And even if not, when the bug is fixed, you are patching the version that you're still running and not the version that you've already left behind and can't easily go back to. Of course if the bug is bad enough it can fry your database under any design, but in the pre-upgrade script design you have to be really, really confident that the pre-upgrade script doesn't have any holes that will only be found after it's too late. ...Robert
On Wed, 2009-12-02 at 10:48 -0500, Robert Haas wrote: > Well, that's sort of a circular argument. If you're going to reserve > space with a pre-upgrade utility, you're going to need to put the > pre-upgrade utility into the version you want to upgrade FROM. If we > wanted to be able to use a pre-upgrade utility to upgrade to 8.5, we > would have had to put the utility into 8.4. Don't see any need to reserve space at all. If this is really needed, we first run a script to prepare the 8.4 database for conversion to 8.5. The script would move things around if it finds a block that would have difficulty after upgrade. We may be able to do that simple, using fillfactor, or it may need to be more complex. Either way, its still easy to do this when required. -- Simon Riggs www.2ndQuadrant.com
On Wed, Dec 2, 2009 at 11:08 AM, Simon Riggs <simon@2ndquadrant.com> wrote: > On Wed, 2009-12-02 at 10:48 -0500, Robert Haas wrote: >> Well, that's sort of a circular argument. If you're going to reserve >> space with a pre-upgrade utility, you're going to need to put the >> pre-upgrade utility into the version you want to upgrade FROM. If we >> wanted to be able to use a pre-upgrade utility to upgrade to 8.5, we >> would have had to put the utility into 8.4. > > Don't see any need to reserve space at all. > > If this is really needed, we first run a script to prepare the 8.4 > database for conversion to 8.5. The script would move things around if > it finds a block that would have difficulty after upgrade. We may be > able to do that simple, using fillfactor, or it may need to be more > complex. Either way, its still easy to do this when required. I discussed the problems with this, as I see them, in the same email you just quoted. You don't have to agree with my analysis, of course. ...Robert
Robert Haas wrote: > The problem I'm referring to is that there is no guarantee that you > would be able predict how much space to reserve. In a case like CRCs, > it may be as simple as "4 bytes". But what if, say, we switch to a > different compression algorithm for inline toast? Upthread, you made a perfectly sensible suggestion: use the CRC addition as a test case to confirm you can build something useful that allowed slightly more complicated in-place upgrades than are supported now. This requires some new code to do tuple shuffling, communicate reserved space, etc. All things that seem quite sensible to have available, useful steps toward a more comprehensive solution, and an achievable goal you wouldn't even have to argue about. Now, you're wandering us back down the path where we have to solve a "migrate TOAST changes" level problem in order to make progress. Starting with presuming you have to solve the hardest possible issue around is the documented path to failure here. We've seen multiple such solutions before, and they all had trade-offs deemed unacceptable: either a performance loss for everyone (not just people upgrading), or unbearable code complexity. There's every reason to believe your reinvention of the same techniques will suffer the same fate. When someone has such a change to be made, maybe you could bring this back up again and gain some traction. One of the big lessons I took from the 8.4 development's lack of progress on this class of problem: no work to make upgrades easier will get accepted unless there is such an upgrade on the table that requires it. You need a test case to make sure the upgrade approach a) works as expected, and b) is code you must commit now or in-place upgrade is lost. Anything else will be deferred; I don't think there's any interest in solving a speculative future problem left at this point, given that it will be code we can't even prove will work. > Another problem with a pre-upgrade utility is - how do you verify, > when you fire up the new cluster, that the pre-upgrade utility has > done its thing? Some additional catalog support was suggested to mark what the pre-upgrade utility had processed. I'm sure I could find the messages about again if I had to. > If all the logic is in the new server, you may still be in hot water > when you discover that it can't deal with a particular case. If you can't design a pre-upgrade script without showstopper bugs, what makes you think the much more complicated code in the new server (which will be carrying around an ugly mess of old and new engine parts) will work as advertised? I think we'll be lucky to get the simplest possible scheme implemented, and that any of these more complicated ones will die under their own weight of their complexity. Also, your logic seems to presume that no backports are possible to the old server. A bug-fix to the pre-upgrade script is a completely reasonable and expected candidate for backporting, because it will be such a targeted piece of code that adjusting it shouldn't impact anything else. The same will not be even remotely true if there's a bug fix needed in a more complicated system that lives in a regularly traversed code path. Having such a tightly targeted chunk of code makes pre-upgrade *more* likely to get bug-fix backports, because you won't be touching code executed by regular users at all. The potential code impact of backporting fixes to the more complicated approaches here is another major obstacle to adopting one of them. That's an issue that we didn't even get to the last time, because showstopper issues popped up first. That problem was looming had work continued down that path though. -- Greg Smith 2ndQuadrant Baltimore, MD PostgreSQL Training, Services and Support greg@2ndQuadrant.com www.2ndQuadrant.com
On Wed, Dec 2, 2009 at 1:08 PM, Greg Smith <greg@2ndquadrant.com> wrote: > Robert Haas wrote: >> >> The problem I'm referring to is that there is no guarantee that you >> would be able predict how much space to reserve. In a case like CRCs, >> it may be as simple as "4 bytes". But what if, say, we switch to a >> different compression algorithm for inline toast? > > Upthread, you made a perfectly sensible suggestion: use the CRC addition as > a test case to confirm you can build something useful that allowed slightly > more complicated in-place upgrades than are supported now. This requires > some new code to do tuple shuffling, communicate reserved space, etc. All > things that seem quite sensible to have available, useful steps toward a > more comprehensive solution, and an achievable goal you wouldn't even have > to argue about. > > Now, you're wandering us back down the path where we have to solve a > "migrate TOAST changes" level problem in order to make progress. Starting > with presuming you have to solve the hardest possible issue around is the > documented path to failure here. We've seen multiple such solutions before, > and they all had trade-offs deemed unacceptable: either a performance loss > for everyone (not just people upgrading), or unbearable code complexity. > There's every reason to believe your reinvention of the same techniques > will suffer the same fate. Just to set the record straight, I don't intend to work on this problem at all (unless paid, of course). And I'm perfectly happy to go with whatever workable solution someone else comes up with. I'm just offering opinions on what I see as the advantages and disadvantages of different approaches, and anyone is working on this is more than free to ignore them. > Some additional catalog support was suggested to mark what the pre-upgrade > utility had processed. I'm sure I could find the messages about again if I > had to. And that's a perfectly sensible solution, except that adding a catalog column to 8.4 at this point would force initdb, so that's a non-starter. I suppose we could shoehorn it into the reloptions. > Also, your logic seems to presume that no backports are possible to the old > server. The problem on the table at the moment is that the proposed CRC feature will expand every page by a uniform amount - so in this case a fixed-space-per-page reservation utility would be completely adequate.Does anyone think this is a realistic thing to backportto 8.4? ...Robert
On tis, 2009-12-01 at 19:41 +0000, Greg Stark wrote: > > Also, it would > > require reading back each page as it's written to disk, which is OK > for > > a bunch of single-row writes, but for bulk data loads a significant > problem. > > Not sure what that really means for Postgres. It would just mean > reading back the same page of memory from the filesystem cache that we > just read. Surely the file system ought to be the place where to solve this. After all, we don't put link-level corruption detection into the libpq protocol either.
On tis, 2009-12-01 at 17:47 -0500, Tom Lane wrote: > Bruce Momjian <bruce@momjian.us> writes: > > I also like the idea that we don't need to CRC check the line pointers > > because any corruption there is going to appear immediately. However, > > the bad news is that we wouldn't find the corruption until we try to > > access bad data and might crash. > > That sounds exactly like the corruption detection system we have now. > If you think that behavior is acceptable, we can skip this whole > discussion. I think one of the motivations for this CRC business was to detect corruption in the user data. As you say, we already handle corruption in the metadata.
Robert Haas wrote: >> Some additional catalog support was suggested to mark what the pre-upgrade >> utility had processed. I'm sure I could find the messages about again if I >> had to. >> > And that's a perfectly sensible solution, except that adding a catalog > column to 8.4 at this point would force initdb, so that's a > non-starter. I suppose we could shoehorn it into the reloptions. > There's no reason the associated catalog support had to ship with the old version. You can always modify the catalog after initdb, but before running the pre-upgrade utility. pg_migrator might make that change for you. > The problem on the table at the moment is that the proposed CRC > feature will expand every page by a uniform amount - so in this case a > fixed-space-per-page reservation utility would be completely adequate. > Does anyone think this is a realistic thing to backport to 8.4? > I believe the main problem here is making sure that the server doesn't turn around and fill pages right back up again. The logic that needs to show up here has two parts: 1) Don't fill new pages completely up, save the space that will be needed in the new version 2) Find old pages that are filled and free some space on them The pre-upgrade utility we've been talking about does (2), and that's easy to imagine implementing as an add-on module rather than a backport. I don't know how (1) can be done in a way such that it's easily backported to 8.4. -- Greg Smith 2ndQuadrant Baltimore, MD PostgreSQL Training, Services and Support greg@2ndQuadrant.com www.2ndQuadrant.com
On Wed, Dec 2, 2009 at 1:56 PM, Greg Smith <greg@2ndquadrant.com> wrote: > Robert Haas wrote: >>> Some additional catalog support was suggested to mark what the >>> pre-upgrade >>> utility had processed. I'm sure I could find the messages about again >>> if I >>> had to. >> And that's a perfectly sensible solution, except that adding a catalog >> column to 8.4 at this point would force initdb, so that's a >> non-starter. I suppose we could shoehorn it into the reloptions. > There's no reason the associated catalog support had to ship with the old > version. You can always modify the catalog after initdb, but before running > the pre-upgrade utility. pg_migrator might make that change for you. Uh, really? I don't think that's possible at all. >> The problem on the table at the moment is that the proposed CRC >> feature will expand every page by a uniform amount - so in this case a >> fixed-space-per-page reservation utility would be completely adequate. >> Does anyone think this is a realistic thing to backport to 8.4? > > I believe the main problem here is making sure that the server doesn't turn > around and fill pages right back up again. The logic that needs to show up > here has two parts: > > 1) Don't fill new pages completely up, save the space that will be needed in > the new version > 2) Find old pages that are filled and free some space on them > > The pre-upgrade utility we've been talking about does (2), and that's easy > to imagine implementing as an add-on module rather than a backport. I don't > know how (1) can be done in a way such that it's easily backported to 8.4. Me neither. ...Robert
Robert Haas wrote: <blockquote cite="mid:603c8f070912021104g32f14915rec242e0ecde3ae45@mail.gmail.com" type="cite"><pre wrap="">OnWed, Dec 2, 2009 at 1:56 PM, Greg Smith <a class="moz-txt-link-rfc2396E" href="mailto:greg@2ndquadrant.com"><greg@2ndquadrant.com></a>wrote: </pre><blockquote type="cite"><pre wrap="">There'sno reason the associated catalog support had to ship with the old version. You can always modify the catalog after initdb, but before running the pre-upgrade utility. pg_migrator might make that change for you. </pre></blockquote><pre wrap=""> Uh, really? I don't think that's possible at all. </pre></blockquote> Worst case just to get this bootstrapped: you installa new table with the added bits. Old version page upgrader accounts for itself there. pg_migrator dumps that dataand then loads it into its new, correct home on the newer version. There's already stuff like that being done anyway--dumpingthings from the old catalog and inserting into the new one--and if the origin is actually an add-on ratherthan an original catalog page it doesn't really matter. As long as the new version can see the info it needs in itscatalog it doesn't matter how it got to there; that's the one that needs to check the migration status before it can accessthings outside of the catalog.<br /><br /><pre class="moz-signature" cols="72">-- Greg Smith 2ndQuadrant Baltimore, MD PostgreSQL Training, Services and Support <a class="moz-txt-link-abbreviated" href="mailto:greg@2ndQuadrant.com">greg@2ndQuadrant.com</a> <a class="moz-txt-link-abbreviated"href="http://www.2ndQuadrant.com">www.2ndQuadrant.com</a> </pre>
On Wed, Dec 2, 2009 at 2:27 PM, Greg Smith <greg@2ndquadrant.com> wrote: > Robert Haas wrote: > > On Wed, Dec 2, 2009 at 1:56 PM, Greg Smith <greg@2ndquadrant.com> wrote: > > > There's no reason the associated catalog support had to ship with the old > version. You can always modify the catalog after initdb, but before running > the pre-upgrade utility. pg_migrator might make that change for you. > > > Uh, really? I don't think that's possible at all. > > > Worst case just to get this bootstrapped: you install a new table with the > added bits. Old version page upgrader accounts for itself there. > pg_migrator dumps that data and then loads it into its new, correct home on > the newer version. There's already stuff like that being done > anyway--dumping things from the old catalog and inserting into the new > one--and if the origin is actually an add-on rather than an original catalog > page it doesn't really matter. As long as the new version can see the info > it needs in its catalog it doesn't matter how it got to there; that's the > one that needs to check the migration status before it can access things > outside of the catalog. That might work. I think that in order to get a fixed OID for the new catalog you would need to run a backend in bootstrap mode, which might (not sure) require shutting down the database first. But it sounds doable. There remains the issue of whether it is reasonable to think about backpatching such a thing, and whether doing so is easier/better than dealing with page expansion in the new server. ...Robert
On Wed, Dec 2, 2009 at 6:34 PM, Robert Haas <robertmhaas@gmail.com> wrote: >> Also, your logic seems to presume that no backports are possible to the old >> server. > > The problem on the table at the moment is that the proposed CRC > feature will expand every page by a uniform amount - so in this case a > fixed-space-per-page reservation utility would be completely adequate. > Does anyone think this is a realistic thing to backport to 8.4? This whole discussion is based on assumptions which do not match my recollection of the old discussion. I would suggest people go back and read the emails but it's clear at least some people have so it seems people get different things out of those old emails. My recollection of Tom and Heikki's suggestions for Zdenek were as follows: 1) When 8.9.0 comes out we also release an 8.8.x which contains a new guc which says to prepare for an 8.9 update. If that guc is set then any new pages are guaranteed to have enough space for 8.9.0 which could be as simple as guaranteeing there are x bytes of free space, in the case of the CRC it's actually *not* a uniform amount of free space if we go with Tom's design of having a variable chunk which moves around but it's still just a simple arithmetic to determine if there's enough free space on the page for a new tuple so it would be simple enough to backport. 2) When you want to prepare a database for upgrade you run the precheck script which first of all makes sure you're running 8.8.x and that the flag is set. Then it checks the free space on every page to ensure it's satisfactory. If not then it can do a noop update to any tuple on the page which the new free space calculation would guarantee would go to a new page. Then you have to wait long enough and vacuum. 3) Then you run pg_migrator which swaps in the new catalog files. 4) Then you shut down and bring up 8.9.0 which on reading any page *immediately* converts it to 8.9.0 format. 5) You would eventually also need some program which processes every page and guarantees to write it back out in the new format. Otherwise there will be pages that you never stop reconverting every time they're read. -- greg
Greg Stark <gsstark@mit.edu> writes: > This whole discussion is based on assumptions which do not match my > recollection of the old discussion. I would suggest people go back and > read the emails but it's clear at least some people have so it seems > people get different things out of those old emails. My recollection > of Tom and Heikki's suggestions for Zdenek were as follows: > 1) When 8.9.0 comes out we also release an 8.8.x which contains a new > guc which says to prepare for an 8.9 update. Yeah, I think the critical point is not to assume that the behavior of the old system is completely set in stone. We can insist that you must update to at least point release .N before beginning the migration process. That gives us a chance to backpatch code that makes adjustments to the behavior of the old server, so long as the backpatch isn't invasive enough to raise stability concerns. regards, tom lane
On Wed, Dec 2, 2009 at 3:48 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Greg Stark <gsstark@mit.edu> writes: >> This whole discussion is based on assumptions which do not match my >> recollection of the old discussion. I would suggest people go back and >> read the emails but it's clear at least some people have so it seems >> people get different things out of those old emails. My recollection >> of Tom and Heikki's suggestions for Zdenek were as follows: > >> 1) When 8.9.0 comes out we also release an 8.8.x which contains a new >> guc which says to prepare for an 8.9 update. > > Yeah, I think the critical point is not to assume that the behavior of > the old system is completely set in stone. We can insist that you must > update to at least point release .N before beginning the migration > process. That gives us a chance to backpatch code that makes > adjustments to the behavior of the old server, so long as the backpatch > isn't invasive enough to raise stability concerns. If we have consensus on that approach, I'm fine with it. I just don't want one of the people who wants this CRC feature to go to a lot of trouble to develop a space reservation system that has to be backpatched to 8.4, and then have the patch rejected as too potentially destabilizing. ...Robert
On Tue, Dec 1, 2009 at 1:27 PM, Joshua D. Drake <jd@commandprompt.com> wrote:
To my knowledge, MySQL, InnoDB, BerkeleyDB, solidDB, Oracle, SQL Server, Sybase, DB2, eXtremeDB, RDB, and Teradata all checksum pages.
On Tue, 2009-12-01 at 13:20 -0500, Robert Haas wrote:My understanding is that MSSQL does. I am not sure about Oracle. Those
> Does $COMPETITOR offer this feature?
>
are the only two I run into (I don't run into MySQL at all). I know
others likely compete in the DB2 space.
To my knowledge, MySQL, InnoDB, BerkeleyDB, solidDB, Oracle, SQL Server, Sybase, DB2, eXtremeDB, RDB, and Teradata all checksum pages.
--
Jonah H. Harris, Senior DBA
myYearbook.com
On Dec 3, 2009, at 1:53 PM, Jonah H. Harris wrote: > On Tue, Dec 1, 2009 at 1:27 PM, Joshua D. Drake > <jd@commandprompt.com> wrote: > On Tue, 2009-12-01 at 13:20 -0500, Robert Haas wrote: > > Does $COMPETITOR offer this feature? > > > > My understanding is that MSSQL does. I am not sure about Oracle. Those > are the only two I run into (I don't run into MySQL at all). I know > others likely compete in the DB2 space. > > To my knowledge, MySQL, InnoDB, BerkeleyDB, solidDB, Oracle, SQL > Server, Sybase, DB2, eXtremeDB, RDB, and Teradata all checksum pages. So... now that the upgrade discussion seems to have died down... was any consensus reached on how to do said checksumming? -- Jim C. Nasby, Database Architect jim@nasby.net 512.569.9461 (cell) http://jim.nasby.net
On Fri, 2009-12-04 at 03:32 -0600, decibel wrote: > So... now that the upgrade discussion seems to have died down... was > any consensus reached on how to do said checksumming? Possibly. Please can you go through the discussion and pull out a balanced summary of how to proceed? I lost track a while back and I'm sure many others did also. -- Simon Riggs www.2ndQuadrant.com
Kevin, > md5sum of each tuple? As an optional system column (a la oid)? I am mainly an application programmer working with PostgreSQL. And I want to point out an additional usefullness of an md5sum of each tuple: it makes comparing table-contents in replicated / related databases MUCH more feasible. I am in the process of adding a user-space "myhash" column to all my applications tables, filled by a trigger on insert / update. It really speeds up table comparison across databases; and it is very helpfull in debugging replications. Harald -- GHUM Harald Massa persuadere et programmare Harald Armin Massa Spielberger Straße 49 70435 Stuttgart 0173/9409607 no fx, no carrier pigeon - %s is too gigantic of an industry to bend to the whims of reality
On Fri, Dec 4, 2009 at 9:34 AM, Simon Riggs <simon@2ndquadrant.com> wrote: > Possibly. Please can you go through the discussion and pull out a > balanced summary of how to proceed? I lost track a while back and I'm > sure many others did also. I summarized the three feasible plans I think I saw; <407d949e0912011713j63045989j67b7b343ef00c192@mail.gmail.com> -- greg
decibel wrote: > On Dec 3, 2009, at 1:53 PM, Jonah H. Harris wrote: > > On Tue, Dec 1, 2009 at 1:27 PM, Joshua D. Drake > > <jd@commandprompt.com> wrote: > > On Tue, 2009-12-01 at 13:20 -0500, Robert Haas wrote: > > > Does $COMPETITOR offer this feature? > > > > > > > My understanding is that MSSQL does. I am not sure about Oracle. Those > > are the only two I run into (I don't run into MySQL at all). I know > > others likely compete in the DB2 space. > > > > To my knowledge, MySQL, InnoDB, BerkeleyDB, solidDB, Oracle, SQL > > Server, Sybase, DB2, eXtremeDB, RDB, and Teradata all checksum pages. > > > So... now that the upgrade discussion seems to have died down... was > any consensus reached on how to do said checksumming? I think the hint bit has to be added to the item pointer, by using the offset bits that are already zero, according to Greg Stark. That solution leads to easy programming, no expanding hint bit array, and it is backward compatible so doesn't cause problems for pg_migrator. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
On Fri, 2009-12-04 at 07:12 -0500, Bruce Momjian wrote: > I think the hint bit has to be added to the item pointer, by using the > offset bits that are already zero, according to Greg Stark. That > solution leads to easy programming, no expanding hint bit array, and it > is backward compatible so doesn't cause problems for pg_migrator. Seems like a reasonable way forward. As I pointed out here http://archives.postgresql.org/pgsql-hackers/2009-12/msg00056.php we only need to use 3 bits not 4, but it does limit tuple length to 4096 for all block sizes. (Two different options there for doing that). An added advantage of this approach is that the cachelines for the item pointer array will already be in CPU cache, so there is no additional access time when we set the hint bits when they are moved to their new position. I should also point out that removing 4 bits from the tuple header would allow us to get rid of t_infomask2, reducing tuple length by a further 2 bytes. -- Simon Riggs www.2ndQuadrant.com
BTW with VACUUM FULL removed I assume we're going to get rid of HEAP_MOVED_IN and HEAP_MOVED_OFF too, right? -- Alvaro Herrera http://www.CommandPrompt.com/ PostgreSQL Replication, Consulting, Custom Development, 24x7 support
Simon Riggs wrote: > On Fri, 2009-12-04 at 07:12 -0500, Bruce Momjian wrote: > > > I think the hint bit has to be added to the item pointer, by using the > > offset bits that are already zero, according to Greg Stark. That > > solution leads to easy programming, no expanding hint bit array, and it > > is backward compatible so doesn't cause problems for pg_migrator. > > Seems like a reasonable way forward. > > As I pointed out here > http://archives.postgresql.org/pgsql-hackers/2009-12/msg00056.php > we only need to use 3 bits not 4, but it does limit tuple length to 4096 > for all block sizes. (Two different options there for doing that). > > An added advantage of this approach is that the cachelines for the item > pointer array will already be in CPU cache, so there is no additional > access time when we set the hint bits when they are moved to their new > position. > > I should also point out that removing 4 bits from the tuple header would > allow us to get rid of t_infomask2, reducing tuple length by a further 2 > bytes. Wow, that is a nice win. Does alignment allow us to actually use that space? -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
On Fri, 2009-12-04 at 07:54 -0500, Bruce Momjian wrote: > > I should also point out that removing 4 bits from the tuple header would > > allow us to get rid of t_infomask2, reducing tuple length by a further 2 > > bytes. > > Wow, that is a nice win. Does alignment allow us to actually use that > space? It would mean that tables up to 24 columns wide would still be 24 bytes wide, whereas >8 columns now has to fit in 32 bytes. So in practical terms most tables would benefit in your average database. -- Simon Riggs www.2ndQuadrant.com
On Fri, 2009-12-04 at 09:52 -0300, Alvaro Herrera wrote: > BTW with VACUUM FULL removed I assume we're going to get rid of > HEAP_MOVED_IN and HEAP_MOVED_OFF too, right? Much as I would like to see those go, no. VF code should remain for some time yet, IMHO. We could remove it, but doing so is not a priority because it buys us nothing in terms of features and its the type of thing we should do at the start of a release cycle, not end. I certainly don't have time to do it, at least. -- Simon Riggs www.2ndQuadrant.com
Simon Riggs wrote: > On Fri, 2009-12-04 at 09:52 -0300, Alvaro Herrera wrote: > >> BTW with VACUUM FULL removed I assume we're going to get rid of >> HEAP_MOVED_IN and HEAP_MOVED_OFF too, right? > > Much as I would like to see those go, no. VF code should remain for some > time yet, IMHO. I don't think we need to keep VF code otherwise, but I would leave HEAP_MOVED_IN/OFF support alone for now for in-place upgrade. Otherwise we need a pre-upgrade script or something to scrub them off. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Fri, Dec 4, 2009 at 12:57 PM, Simon Riggs <simon@2ndquadrant.com> wrote: > On Fri, 2009-12-04 at 07:54 -0500, Bruce Momjian wrote: > >> > I should also point out that removing 4 bits from the tuple header would >> > allow us to get rid of t_infomask2, reducing tuple length by a further 2 >> > bytes. >> >> Wow, that is a nice win. Does alignment allow us to actually use that >> space? > > It would mean that tables up to 24 columns wide would still be 24 bytes > wide, whereas >8 columns now has to fit in 32 bytes. So in practical > terms most tables would benefit in your average database. I don't think getting rid of infomask2 wins us 2 bytes so fast. The rest of those two bytes is natts which of course we still need. If we lose vacuum full then the table's open for reducing the width of command id too if we need more bits. If we do that and we moved everything we could to the line pointers including ctid we might just be able to squeeze the tuple overhead down to 16 bytes. That would win 8 bytes per tuple for people with no null columns or with nulls and a total of 9-64 columns but if they have 1-8 columns and any are null it would actually consume more space. But it looks to me like it would be very very tight and require drastic measures -- I think we would be left with something like 11 bits for commandid and no spare bits in the tuple header at all. -- greg
On Fri, Dec 4, 2009 at 1:35 PM, Greg Stark <gsstark@mit.edu> wrote: > If we lose vacuum full then the table's open for reducing the width of > command id too if we need more bits. If we do that and we moved > everything we could to the line pointers including ctid we might just > be able to squeeze the tuple overhead down to 16 bytes. I'm not sure why I said "including ctid". We would have to move everything transactional to the line pointer, including xmin, xmax, ctid, all the hint bits, the updated flags, hot flags, etc. The only things left in the tuple header would be things that have to be there such as HAS_OIDS, HAS_NULLS, natts, hoff, etc. It would be a pretty drastic change, though a fairly logical one. I recall someone actually submitted a patch to separate out the transactional bits anyways a while back, just to save a few bytes in in-memory tuples. If we could save on disk-space usage it would be a lot more compelling. But it doesn't look to me like it really saves enough often enough to be worth so much code churn. -- greg
Greg Stark escribió: > On Fri, Dec 4, 2009 at 1:35 PM, Greg Stark <gsstark@mit.edu> wrote: > > If we lose vacuum full then the table's open for reducing the width of > > command id too if we need more bits. If we do that and we moved > > everything we could to the line pointers including ctid we might just > > be able to squeeze the tuple overhead down to 16 bytes. > > I'm not sure why I said "including ctid". We would have to move > everything transactional to the line pointer, including xmin, xmax, > ctid, all the hint bits, the updated flags, hot flags, etc. The only > things left in the tuple header would be things that have to be there > such as HAS_OIDS, HAS_NULLS, natts, hoff, etc. It would be a pretty > drastic change, though a fairly logical one. Do we need XMAX_EXCL_LOCK and XMAX_SHARED_LOCK to be moved? It seems to me that they can stay with the tuple header because they are set by wal-logged operations. Same for XMAX_IS_MULTI. The HASfoo bits are all set on tuple creation, never touched later, so they can stay in the header too. We only need XMIN_COMMITTED, XMIN_INVALID, XMAX_COMMITTED, XMAX_INVALID, HEAP_COMBOCID on the line pointer AFAICS ... oh, and HEAP_HOT_UPDATED and HEAP_ONLY_TUPLE, not sure. -- Alvaro Herrera http://www.CommandPrompt.com/ PostgreSQL Replication, Consulting, Custom Development, 24x7 support
Heikki Linnakangas escribió: > Simon Riggs wrote: > > On Fri, 2009-12-04 at 09:52 -0300, Alvaro Herrera wrote: > > > >> BTW with VACUUM FULL removed I assume we're going to get rid of > >> HEAP_MOVED_IN and HEAP_MOVED_OFF too, right? > > > > Much as I would like to see those go, no. VF code should remain for some > > time yet, IMHO. > > I don't think we need to keep VF code otherwise, but I would leave > HEAP_MOVED_IN/OFF support alone for now for in-place upgrade. Otherwise > we need a pre-upgrade script or something to scrub them off. CRCs are going to need scrubbing anyway, no? Oh, but you're assuming that CRCs are optional, so not everybody would need that, right? -- Alvaro Herrera http://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc.
On Fri, Dec 4, 2009 at 9:48 AM, Alvaro Herrera <alvherre@commandprompt.com> wrote: > Heikki Linnakangas escribió: >> Simon Riggs wrote: >> > On Fri, 2009-12-04 at 09:52 -0300, Alvaro Herrera wrote: >> > >> >> BTW with VACUUM FULL removed I assume we're going to get rid of >> >> HEAP_MOVED_IN and HEAP_MOVED_OFF too, right? >> > >> > Much as I would like to see those go, no. VF code should remain for some >> > time yet, IMHO. >> >> I don't think we need to keep VF code otherwise, but I would leave >> HEAP_MOVED_IN/OFF support alone for now for in-place upgrade. Otherwise >> we need a pre-upgrade script or something to scrub them off. > > CRCs are going to need scrubbing anyway, no? Oh, but you're assuming > that CRCs are optional, so not everybody would need that, right? If we can make not only the validity but also the presence of the CRC field optional, it will simplify things greatly for in-place upgrade, I think, because the upgrade won't itself require expanding the page. Turning on the CRC functionality for a particular table may require expanding the page, but that's a different problem. :-) Have we thought about what other things have changed between 8.4 and 8.5 that might cause problems for in-place upgrade? ...Robert
Simon Riggs <simon@2ndQuadrant.com> writes: > As I pointed out here > http://archives.postgresql.org/pgsql-hackers/2009-12/msg00056.php > we only need to use 3 bits not 4, but it does limit tuple length to 4096 > for all block sizes. (Two different options there for doing that). Limiting the tuple length is a deal-breaker. regards, tom lane
Greg Stark <gsstark@mit.edu> writes: > I'm not sure why I said "including ctid". We would have to move > everything transactional to the line pointer, including xmin, xmax, > ctid, all the hint bits, the updated flags, hot flags, etc. The only > things left in the tuple header would be things that have to be there > such as HAS_OIDS, HAS_NULLS, natts, hoff, etc. It would be a pretty > drastic change, though a fairly logical one. I recall someone actually > submitted a patch to separate out the transactional bits anyways a > while back, just to save a few bytes in in-memory tuples. If we could > save on disk-space usage it would be a lot more compelling. But it > doesn't look to me like it really saves enough often enough to be > worth so much code churn. It would also break things for indexes, which don't need all that stuff in their line pointers. More to the point, moving the same bits to someplace else on the page doesn't save anything at all. regards, tom lane
On Fri, 2009-12-04 at 10:43 -0500, Tom Lane wrote: > Simon Riggs <simon@2ndQuadrant.com> writes: > > As I pointed out here > > http://archives.postgresql.org/pgsql-hackers/2009-12/msg00056.php > > we only need to use 3 bits not 4, but it does limit tuple length to 4096 > > for all block sizes. (Two different options there for doing that). > > Limiting the tuple length is a deal-breaker. If people that use 32kB block sizes exist in practice, I note that because tuples are at least 4 byte aligned that the first 2 bits of the length are always unused. So they're available for those with strangely long tuples, and can be used to signify high order bytes and so max tuple length could be 16384. With tuples that long, it would be better to assume 8-byte minimum alignment, which would put max tuple length back up to 32KB again. None of that need effect people with a standard 8192 byte blocksize. -- Simon Riggs www.2ndQuadrant.com
Robert Haas <robertmhaas@gmail.com> writes: > Have we thought about what other things have changed between 8.4 and > 8.5 that might cause problems for in-place upgrade? So far, nothing. We even made Andrew Gierth jump through hoops to keep hstore's on-disk representation upwards compatible. regards, tom lane
On Fri, 2009-12-04 at 13:35 +0000, Greg Stark wrote: > I don't think getting rid of infomask2 wins us 2 bytes so fast. The > rest of those two bytes is natts which of course we still need. err, yes, OK. -- Simon Riggs www.2ndQuadrant.com
Robert Haas wrote: > On Fri, Dec 4, 2009 at 9:48 AM, Alvaro Herrera > <alvherre@commandprompt.com> wrote: > > Heikki Linnakangas escribi?: > >> Simon Riggs wrote: > >> > On Fri, 2009-12-04 at 09:52 -0300, Alvaro Herrera wrote: > >> > > >> >> BTW with VACUUM FULL removed I assume we're going to get rid of > >> >> HEAP_MOVED_IN and HEAP_MOVED_OFF too, right? > >> > > >> > Much as I would like to see those go, no. VF code should remain for some > >> > time yet, IMHO. > >> > >> I don't think we need to keep VF code otherwise, but I would leave > >> HEAP_MOVED_IN/OFF support alone for now for in-place upgrade. Otherwise > >> we need a pre-upgrade script or something to scrub them off. > > > > CRCs are going to need scrubbing anyway, no? ?Oh, but you're assuming > > that CRCs are optional, so not everybody would need that, right? > > If we can make not only the validity but also the presence of the CRC > field optional, it will simplify things greatly for in-place upgrade, > I think, because the upgrade won't itself require expanding the page. > Turning on the CRC functionality for a particular table may require > expanding the page, but that's a different problem. :-) Well, I am not sure how we would turn the _space_ used for CRC on and off because you would have to rewrite the entire table/database to turn it on, which seems unfortunate. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
On Fri, Dec 4, 2009 at 2:04 PM, Bruce Momjian <bruce@momjian.us> wrote: > Robert Haas wrote: >> On Fri, Dec 4, 2009 at 9:48 AM, Alvaro Herrera >> <alvherre@commandprompt.com> wrote: >> > Heikki Linnakangas escribi?: >> >> Simon Riggs wrote: >> >> > On Fri, 2009-12-04 at 09:52 -0300, Alvaro Herrera wrote: >> >> > >> >> >> BTW with VACUUM FULL removed I assume we're going to get rid of >> >> >> HEAP_MOVED_IN and HEAP_MOVED_OFF too, right? >> >> > >> >> > Much as I would like to see those go, no. VF code should remain for some >> >> > time yet, IMHO. >> >> >> >> I don't think we need to keep VF code otherwise, but I would leave >> >> HEAP_MOVED_IN/OFF support alone for now for in-place upgrade. Otherwise >> >> we need a pre-upgrade script or something to scrub them off. >> > >> > CRCs are going to need scrubbing anyway, no? ?Oh, but you're assuming >> > that CRCs are optional, so not everybody would need that, right? >> >> If we can make not only the validity but also the presence of the CRC >> field optional, it will simplify things greatly for in-place upgrade, >> I think, because the upgrade won't itself require expanding the page. >> Turning on the CRC functionality for a particular table may require >> expanding the page, but that's a different problem. :-) > > Well, I am not sure how we would turn the _space_ used for CRC on and > off because you would have to rewrite the entire table/database to turn > it on, which seems unfortunate. Well, presumably you're going to have to do some of that work anyway, because even if the space is set aside you're still going to have to read the page in, CRC it, and write it back out. However if the space is not pre-allocated then you also have to deal with moving tuples to other pages. But that problem is going to have to be dealt with somewhere along the line no matter what we do, because if you're upgrading an 8.3 or 8.4 system to 8.5, you need to add that space sometime: either before migration (with a pre-upgrade utility), or after migration (by some sort of page converter/tuple mover), or only when/if enabling the CRC feature. One nice thing about making it the CRC feature's problem to make space on each page is that people who don't want to use CRCs can still use those extra 4 bytes/page for data. That might not be worth the code complexity if we were starting from scratch, but I'm thinking that most of the code complexity is a given if we want to also support in-place upgrade. ...Robert
Massa, Harald Armin wrote: > I am in the process of adding a user-space "myhash" column to all my > applications tables, filled by a trigger on insert / update. It really > speeds up table comparison across databases; and it is very helpfull > in debugging replications. Have you seen pg_comparator? -- Alvaro Herrera http://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc.
Robert Haas wrote: > > Well, I am not sure how we would turn the _space_ used for CRC on and > > off because you would have to rewrite the entire table/database to turn > > it on, which seems unfortunate. > > Well, presumably you're going to have to do some of that work anyway, > because even if the space is set aside you're still going to have to > read the page in, CRC it, and write it back out. However if the space > is not pre-allocated then you also have to deal with moving tuples to > other pages. But that problem is going to have to be dealt with > somewhere along the line no matter what we do, because if you're > upgrading an 8.3 or 8.4 system to 8.5, you need to add that space > sometime: either before migration (with a pre-upgrade utility), or > after migration (by some sort of page converter/tuple mover), or only > when/if enabling the CRC feature. > > One nice thing about making it the CRC feature's problem to make space > on each page is that people who don't want to use CRCs can still use > those extra 4 bytes/page for data. That might not be worth the code > complexity if we were starting from scratch, but I'm thinking that > most of the code complexity is a given if we want to also support > in-place upgrade. My guess is we can find somewhere on a 8.4 heap/index page to add four bytes. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
A curiosity question regarding torn pages: How does this work on file systems that don't write in-place, but instead alwaysdo copy-on-write? My example would be Sun's ZFS file system (In Solaris & BSD). Because of its "snapshot & rollback" functionality, it neverwrites a page in-place, but instead always copies it to another place on disk. How does this affect the corruptioncaused by a torn write? Can we end up with horrible corruption on this type of filesystem where we wouldn't on normal file systems, where we arewriting to a previously zeroed area on disk? Sorry if this is a stupid question... Hopefully somebody can reassure me that this isn't an issue.
On Fri, 2009-12-04 at 14:47 -0800, Chuck McDevitt wrote: > A curiosity question regarding torn pages: How does this work on file > systems that don't write in-place, but instead always do > copy-on-write? > > My example would be Sun's ZFS file system (In Solaris & BSD). Because > of its "snapshot & rollback" functionality, it never writes a page > in-place, but instead always copies it to another place on disk. How > does this affect the corruption caused by a torn write? > > Can we end up with horrible corruption on this type of filesystem > where we wouldn't on normal file systems, where we are writing to a > previously zeroed area on disk? > > Sorry if this is a stupid question... Hopefully somebody can reassure > me that this isn't an issue. Think we're still good. Not a stupid question. Hint bits are set while the block is in shared_buffers and setting a hint bit dirties the page, but does not write WAL. Because the page is dirty we re-write the whole block at checkpoint, by bgwriter cleaning or via dirty page eviction. So ZFS is OK, but we do more writing than we want to, sometimes. -- Simon Riggs www.2ndQuadrant.com
>> I am in the process of adding a user-space "myhash" column to all my >> applications tables, filled by a trigger on insert / update. It really >> speeds up table comparison across databases; and it is very helpfull >> in debugging replications. > > Have you seen pg_comparator? yes, saw the lightning talk at pgday.eu it also uses md5 hashes, just in an own schema. Guess pg_comparator would profit from an integrated MD5 hash. Harald -- GHUM Harald Massa persuadere et programmare Harald Armin Massa Spielberger Straße 49 70435 Stuttgart 0173/9409607 no fx, no carrier pigeon - %s is too gigantic of an industry to bend to the whims of reality
It can save space because the line pointers have less alignment requirements. But I don't see any point in the current state. -- Greg On 2009-12-04, at 3:48 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Greg Stark <gsstark@mit.edu> writes: >> I'm not sure why I said "including ctid". We would have to move >> everything transactional to the line pointer, including xmin, xmax, >> ctid, all the hint bits, the updated flags, hot flags, etc. The only >> things left in the tuple header would be things that have to be there >> such as HAS_OIDS, HAS_NULLS, natts, hoff, etc. It would be a pretty >> drastic change, though a fairly logical one. I recall someone >> actually >> submitted a patch to separate out the transactional bits anyways a >> while back, just to save a few bytes in in-memory tuples. If we could >> save on disk-space usage it would be a lot more compelling. But it >> doesn't look to me like it really saves enough often enough to be >> worth so much code churn. > > It would also break things for indexes, which don't need all that > stuff > in their line pointers. > > More to the point, moving the same bits to someplace else on the page > doesn't save anything at all. > > regards, tom lane
On Fri, Dec 4, 2009 at 10:47 PM, Chuck McDevitt <cmcdevitt@greenplum.com> wrote: > A curiosity question regarding torn pages: How does this work on file systems that don't write in-place, but instead alwaysdo copy-on-write? > > My example would be Sun's ZFS file system (In Solaris & BSD). Because of its "snapshot & rollback" functionality, it neverwrites a page in-place, but instead always copies it to another place on disk. How does this affect the corruptioncaused by a torn write? > > Can we end up with horrible corruption on this type of filesystem where we wouldn't on normal file systems, where we arewriting to a previously zeroed area on disk? > > Sorry if this is a stupid question... Hopefully somebody can reassure me that this isn't an issue. It's not a stupid question, we're not 100% sure but we believe ZFS doesn't need full page writes because it's immune to torn pages. I think the idea of ZFS is that the new partially written page isn't visible because it's not linked into the tree until it's been completely written. To me it appears this would depend on the drive system ordering writes very strictly which seems hard to be sure is happening. Perhaps this is tied to the tricks they do to avoid contention on the root, if they do a write barrier before every root update that seems like it should be sufficient to me, but I don't know at that level of detail. -- greg