Thread: 16-bit page checksums for 9.2
After the various recent discussions on list, I present what I believe to be a working patch implementing 16-but checksums on all buffer pages. page_checksums = on | off (default) There are no required block changes; checksums are optional and some blocks may have a checksum, others not. This means that the patch will allow pg_upgrade. That capability also limits us to 16-bit checksums. Fletcher's 16 is used in this patch and seems rather quick, though that is easily replaceable/tuneable if desired, perhaps even as a parameter enum. This patch is a step on the way to 32-bit checksums in a future redesign of the page layout, though that is not a required future change, nor does this prevent that. Checksum is set whenever the buffer is flushed to disk, and checked when the page is read in from disk. It is not set at other times, and for much of the time may not be accurate. This follows earlier discussions from 2010-12-22, and is discussed in detail in patch comments. Note it works with buffer manager pages, which includes shared and local data buffers, but not SLRU pages (yet? an easy addition but needs other discussion around contention). Note that all this does is detect bit errors on the page, it doesn't identify where the error is, how bad and definitely not what caused it or when it happened. The main body of the patch involves changes to bufpage.c/.h so this differs completely from the VMware patch, for technical reasons. Also included are facilities to LockBufferForHints() with usage in various AMs, to avoid the case where hints are set during calculation of the checksum. In my view this is a fully working, committable patch but I'm not in a hurry to do so given the holiday season. Hopefully its a gift not a turkey, and therefore a challenge for some to prove that wrong. Enjoy either way, Merry Christmas, -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
Simon Riggs <simon@2ndQuadrant.com> writes: > After the various recent discussions on list, I present what I believe > to be a working patch implementing 16-but checksums on all buffer > pages. I think locking around hint-bit-setting is likely to be unworkable from a performance standpoint. I also wonder whether it might not result in deadlocks. Also, as far as I can see this patch usurps the page version field, which I find unacceptably short-sighted. Do you really think this is the last page layout change we'll ever make? regards, tom lane
On Saturday, December 24, 2011 03:46:16 PM Tom Lane wrote: > Simon Riggs <simon@2ndQuadrant.com> writes: > > After the various recent discussions on list, I present what I believe > > to be a working patch implementing 16-but checksums on all buffer > > pages. > > I think locking around hint-bit-setting is likely to be unworkable from > a performance standpoint. I also wonder whether it might not result in > deadlocks. Why don't you use the same tricks as the former patch and copy the buffer, compute the checksum on that, and then write out that copy (you can even do both at the same time). I have a hard time believing that the additional copy is more expensive than the locking. Andres
On Sat, Dec 24, 2011 at 2:46 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Simon Riggs <simon@2ndQuadrant.com> writes: >> After the various recent discussions on list, I present what I believe >> to be a working patch implementing 16-but checksums on all buffer >> pages. > > I think locking around hint-bit-setting is likely to be unworkable from > a performance standpoint. Anyone choosing page_checksums = on has already made a performance reducing decision in favour of reliability. So they understand and accept the impact. There is no locking when the parameter is off. A safe alternative is to use LockBuffer, which has a much greater performance impact. I did think about optimistically checking after the write, but if we crash at that point we will then see a block that has an invalid checksum. It's faster but you may get a checksum failure if you crash - but then one important aspect of this is to spot problems in case of a crash, so that seems unacceptable. > I also wonder whether it might not result in > deadlocks. If you can see how, please say. I can't see any ways for that myself. > Also, as far as I can see this patch usurps the page version field, > which I find unacceptably short-sighted. Do you really think this is > the last page layout change we'll ever make? No, I don't. I hope and expect the next page layout change to reintroduce such a field. But since we're agreed now that upgrading is important, changing page format isn't likely to be happening until we get an online upgrade process. So future changes are much less likely. If they do happen, we have some flag bits spare that can be used to indicate later versions. It's not the prettiest thing in the world, but it's a small ugliness in return for an important feature. If there was a way without that, I would have chosen it. pg_filedump will need to be changed more than normal, but the version isn't used anywhere else in the server code. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Sat, Dec 24, 2011 at 3:54 PM, Andres Freund <andres@anarazel.de> wrote: > On Saturday, December 24, 2011 03:46:16 PM Tom Lane wrote: >> Simon Riggs <simon@2ndQuadrant.com> writes: >> > After the various recent discussions on list, I present what I believe >> > to be a working patch implementing 16-but checksums on all buffer >> > pages. >> >> I think locking around hint-bit-setting is likely to be unworkable from >> a performance standpoint. I also wonder whether it might not result in >> deadlocks. > Why don't you use the same tricks as the former patch and copy the buffer, > compute the checksum on that, and then write out that copy (you can even do > both at the same time). I have a hard time believing that the additional copy > is more expensive than the locking. We would copy every time we write, yet lock only every time we set hint bits. If that option is favoured, I'll write another version after Christmas. ISTM we can't write and copy at the same time because the cheksum is not a trailer field. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Sat, Dec 24, 2011 at 3:51 PM, Aidan Van Dyk <aidan@highrise.ca> wrote: > Not an expert here, but after reading through the patch quickly, I > don't see anything that changes the torn-page problem though, right? > > Hint bits aren't wal-logged, and FPW isn't forced on the hint-bit-only > dirty, right? Checksums merely detect a problem, whereas FPWs correct a problem if it happens, but only in crash situations. So this does nothing to remove the need for FPWs, though checksum detection could be used for double write buffers also. Checksums work even when there is no crash, so if your disk goes bad and corrupts data then you'll know about it as soon as it happens. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Saturday, December 24, 2011 05:01:02 PM Simon Riggs wrote: > On Sat, Dec 24, 2011 at 3:54 PM, Andres Freund <andres@anarazel.de> wrote: > > On Saturday, December 24, 2011 03:46:16 PM Tom Lane wrote: > >> Simon Riggs <simon@2ndQuadrant.com> writes: > >> > After the various recent discussions on list, I present what I believe > >> > to be a working patch implementing 16-but checksums on all buffer > >> > pages. > >> > >> I think locking around hint-bit-setting is likely to be unworkable from > >> a performance standpoint. I also wonder whether it might not result in > >> deadlocks. > > > > Why don't you use the same tricks as the former patch and copy the > > buffer, compute the checksum on that, and then write out that copy (you > > can even do both at the same time). I have a hard time believing that > > the additional copy is more expensive than the locking. > > We would copy every time we write, yet lock only every time we set hint > bits. Isn't setting hint bits also a rather frequent operation? At least in a well- cached workload where most writeout happens due to checkpoints. > If that option is favoured, I'll write another version after Christmas. Seems less complicated (wrt deadlocking et al) to me. But I havent read your patch, so I will shut up now ;) Andres
On Sat, Dec 24, 2011 at 4:06 PM, Simon Riggs <simon@2ndquadrant.com> wrote: > Checksums merely detect a problem, whereas FPWs correct a problem if > it happens, but only in crash situations. > > So this does nothing to remove the need for FPWs, though checksum > detection could be used for double write buffers also. This is missing the point. If you have a torn page on a page that is only dirty due to hint bits then the checksum will show a spurious checksum failure. It will "detect" a problem that isn't there. The problem is that there is no WAL indicating the hint bit change. And if the torn page includes the new checksum but not the new hint bit or vice versa it will be a checksum mismatch. The strategy discussed in the past was moving all the hint bits to a common area and skipping them in the checksum. No amount of double writing or buffering or locking will avoid this problem. -- greg
On Sat, Dec 24, 2011 at 8:06 PM, Greg Stark <stark@mit.edu> wrote: > On Sat, Dec 24, 2011 at 4:06 PM, Simon Riggs <simon@2ndquadrant.com> wrote: >> Checksums merely detect a problem, whereas FPWs correct a problem if >> it happens, but only in crash situations. >> >> So this does nothing to remove the need for FPWs, though checksum >> detection could be used for double write buffers also. > > This is missing the point. If you have a torn page on a page that is > only dirty due to hint bits then the checksum will show a spurious > checksum failure. It will "detect" a problem that isn't there. It will detect a problem that *is* there, but one you are classifying it as a non-problem because it is a correctable or acceptable bit error. Given that acceptable bit errors on hints cover no more than 1% of a block, the great likelihood is that the bit error is unacceptable in any case, so false positives page errors are in fact very rare. Any bit error is an indicator of problems on the external device, so many would regard any bit error as unacceptable. > The problem is that there is no WAL indicating the hint bit change. > And if the torn page includes the new checksum but not the new hint > bit or vice versa it will be a checksum mismatch. > > The strategy discussed in the past was moving all the hint bits to a > common area and skipping them in the checksum. No amount of double > writing or buffering or locking will avoid this problem. I completely agree we should do this, but we are unable to do it now, so this patch is a stop-gap and provides a much requested feature *now*. In the future, we will be able to tell the difference between an acceptable and an unacceptable bit error. Right now, all we have is the ability to detect a bit error and as I point out above that is 99% of the problem solves, at least. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
> Simon Riggs wrote: > On Sat, Dec 24, 2011 at 8:06 PM, Greg Stark wrote: >> The problem is that there is no WAL indicating the hint bit >> change. And if the torn page includes the new checksum but not the >> new hint bit or vice versa it will be a checksum mismatch. With *just* this patch, true. An OS crash or hardware failure could sometimes create an invalid page. >> The strategy discussed in the past was moving all the hint bits to >> a common area and skipping them in the checksum. No amount of >> double writing or buffering or locking will avoid this problem. I don't believe that. Double-writing is a technique to avoid torn pages, but it requires a checksum to work. This chicken-and-egg problem requires the checksum to be implemented first. > I completely agree we should do this, but we are unable to do it > now, so this patch is a stop-gap and provides a much requested > feature *now*. Yes, for people who trust their environment to prevent torn pages, or who are willing to tolerate one bad page per OS crash in return for quick reporting of data corruption from unreliable file systems, this is a good feature even without double-writes. > In the future, we will be able to tell the difference between an > acceptable and an unacceptable bit error. A double-write patch would provide that, and it sounds like VMware has a working patch for that which is being polished for submission. It would need to wait until we have some consensus on the checksum patch before it can be finalized. I'll try to review the patch from this thread today, to do what I can to move that along. -Kevin
On Sat, Dec 24, 2011 at 04:01:02PM +0000, Simon Riggs wrote: > On Sat, Dec 24, 2011 at 3:54 PM, Andres Freund <andres@anarazel.de> wrote: > > Why don't you use the same tricks as the former patch and copy the buffer, > > compute the checksum on that, and then write out that copy (you can even do > > both at the same time). I have a hard time believing that the additional copy > > is more expensive than the locking. > > ISTM we can't write and copy at the same time because the cheksum is > not a trailer field. Ofcourse you can. If the checksum is in the trailer field you get the nice property that the whole block has a constant checksum. However, if you store the checksum elsewhere you just need to change the checking algorithm to copy the checksum out, zero those bytes and run the checksum and compare with the extracted checksum. Not pretty, but I don't think it makes a difference in performence. Have a nice day, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > He who writes carelessly confesses thereby at the very outset that he does > not attach much importance to his own thoughts. -- Arthur Schopenhauer
On Sun, Dec 25, 2011 at 5:08 AM, Simon Riggs <simon@2ndquadrant.com> wrote: > On Sat, Dec 24, 2011 at 8:06 PM, Greg Stark <stark@mit.edu> wrote: >> On Sat, Dec 24, 2011 at 4:06 PM, Simon Riggs <simon@2ndquadrant.com> wrote: >>> Checksums merely detect a problem, whereas FPWs correct a problem if >>> it happens, but only in crash situations. >>> >>> So this does nothing to remove the need for FPWs, though checksum >>> detection could be used for double write buffers also. >> >> This is missing the point. If you have a torn page on a page that is >> only dirty due to hint bits then the checksum will show a spurious >> checksum failure. It will "detect" a problem that isn't there. > > It will detect a problem that *is* there, but one you are classifying > it as a non-problem because it is a correctable or acceptable bit > error. I don't agree with this. We don't WAL-log hint bit changes precisely because it's OK if they make it to disk and it's OK if they don't. Given that, I don't see how we can say that writing out only half of a page that has had hint bit changes is a problem. It's not. (And if it is, then we ought to WAL-log all such changes regardless of whether CRCs are in use.) -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 25.12.2011 15:01, Kevin Grittner wrote: > I don't believe that. Double-writing is a technique to avoid torn > pages, but it requires a checksum to work. This chicken-and-egg > problem requires the checksum to be implemented first. I don't think double-writes require checksums on the data pages themselves, just on the copies in the double-write buffers. In the double-write buffer, you'll need some extra information per-page anyway, like a relfilenode and block number that indicates which page it is in the buffer. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Tue, Dec 27, 2011 at 8:05 PM, Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote: > On 25.12.2011 15:01, Kevin Grittner wrote: >> >> I don't believe that. Double-writing is a technique to avoid torn >> pages, but it requires a checksum to work. This chicken-and-egg >> problem requires the checksum to be implemented first. > > > I don't think double-writes require checksums on the data pages themselves, > just on the copies in the double-write buffers. In the double-write buffer, > you'll need some extra information per-page anyway, like a relfilenode and > block number that indicates which page it is in the buffer. How would you know when to look in the double write buffer? -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Sun, Dec 25, 2011 at 1:01 PM, Kevin Grittner <Kevin.Grittner@wicourts.gov> wrote: > This chicken-and-egg > problem requires the checksum to be implemented first. v2 of checksum patch, using a conditional copy if checksumming is enabled, so locking is removed. Thanks to Andres for thwacking me with the cluestick, though I have used a simple copy rather than a copy & calc. Tested using make installcheck with parameter on/off, then restart and vacuumdb to validate all pages. Reviews, objections, user interface tweaks all welcome. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
On 28.12.2011 01:39, Simon Riggs wrote: > On Tue, Dec 27, 2011 at 8:05 PM, Heikki Linnakangas > <heikki.linnakangas@enterprisedb.com> wrote: >> On 25.12.2011 15:01, Kevin Grittner wrote: >>> >>> I don't believe that. Double-writing is a technique to avoid torn >>> pages, but it requires a checksum to work. This chicken-and-egg >>> problem requires the checksum to be implemented first. >> >> >> I don't think double-writes require checksums on the data pages themselves, >> just on the copies in the double-write buffers. In the double-write buffer, >> you'll need some extra information per-page anyway, like a relfilenode and >> block number that indicates which page it is in the buffer. > > How would you know when to look in the double write buffer? You scan the double-write buffer, and every page in the double write buffer that has a valid checksum, you copy to the main storage. There's no need to check validity of pages in the main storage. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Wed, Dec 28, 2011 at 7:42 AM, Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote: >> How would you know when to look in the double write buffer? > > > You scan the double-write buffer, and every page in the double write buffer > that has a valid checksum, you copy to the main storage. There's no need to > check validity of pages in the main storage. OK, then we are talking at cross purposes. Double write buffers, in the way you explain them allow us to remove full page writes. They clearly don't do anything to check page validity on read. Torn pages are not the only fault we wish to correct against... and the double writes idea is orthogonal to the idea of checksums. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On 28.12.2011 11:22, Simon Riggs wrote: > On Wed, Dec 28, 2011 at 7:42 AM, Heikki Linnakangas > <heikki.linnakangas@enterprisedb.com> wrote: > >>> How would you know when to look in the double write buffer? >> >> >> You scan the double-write buffer, and every page in the double write buffer >> that has a valid checksum, you copy to the main storage. There's no need to >> check validity of pages in the main storage. > > OK, then we are talking at cross purposes. Double write buffers, in > the way you explain them allow us to remove full page writes. They > clearly don't do anything to check page validity on read. Torn pages > are not the only fault we wish to correct against... and the double > writes idea is orthogonal to the idea of checksums. The reason we're talking about double write buffers in this thread is that double write buffers can be used to solve the problem with hint bits and checksums. You're right, though, that it's academical whether double write buffers can be used without checksums on data pages, if the whole point of the exercise is to make it possible to have checksums on data pages.. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Wed, Dec 28, 2011 at 5:45 PM, Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote: > On 28.12.2011 11:22, Simon Riggs wrote: >> >> On Wed, Dec 28, 2011 at 7:42 AM, Heikki Linnakangas >> <heikki.linnakangas@enterprisedb.com> wrote: >> >>>> How would you know when to look in the double write buffer? >>> >>> >>> >>> You scan the double-write buffer, and every page in the double write >>> buffer >>> that has a valid checksum, you copy to the main storage. There's no need >>> to >>> check validity of pages in the main storage. >> >> >> OK, then we are talking at cross purposes. Double write buffers, in >> the way you explain them allow us to remove full page writes. They >> clearly don't do anything to check page validity on read. Torn pages >> are not the only fault we wish to correct against... and the double >> writes idea is orthogonal to the idea of checksums. > > > The reason we're talking about double write buffers in this thread is that > double write buffers can be used to solve the problem with hint bits and > checksums. Torn pages are not the only problem we need to detect. You said "You scan the double write buffer...". When exactly would you do that? Please explain how a double write buffer detects problems that do not occur as the result of a crash. We don't have much time, so please be clear and lucid. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
> Heikki Linnakangas wrote: > On 28.12.2011 01:39, Simon Riggs wrote: >> On Tue, Dec 27, 2011 at 8:05 PM, Heikki Linnakangas >> wrote: >>> On 25.12.2011 15:01, Kevin Grittner wrote: >>>> >>>> I don't believe that. Double-writing is a technique to avoid >>>> torn pages, but it requires a checksum to work. This chicken- >>>> and-egg problem requires the checksum to be implemented first. >>> >>> I don't think double-writes require checksums on the data pages >>> themselves, just on the copies in the double-write buffers. In >>> the double-write buffer, you'll need some extra information per- >>> page anyway, like a relfilenode and block number that indicates >>> which page it is in the buffer. You are clearly right -- if there is no checksum in the page itself, you can put one in the double-write metadata. I've never seen that discussed before, but I'm embarrassed that it never occurred to me. >> How would you know when to look in the double write buffer? > > You scan the double-write buffer, and every page in the double > write buffer that has a valid checksum, you copy to the main > storage. There's no need to check validity of pages in the main > storage. Right. I'll recap my understanding of double-write (from memory -- if there's a material error or omission, I hope someone will correct me). The write-ups I've seen on double-write techniques have all the writes to the double-write buffer (a single, sequential file that stays around). This is done as sequential writing to a file which is overwritten pretty frequently, making the writes to a controller very fast, and a BBU write-back cache unlikely to actually write to disk very often. On good server-quality hardware, it should be blasting RAM-to_RAM very efficiently. The file is fsync'd (like I said, hopefully to BBU cache), then each page in the double-write buffer is written to the normal page location, and that is fsync'd. Once that is done, the database writes have no risk of being torn, and the double-write buffer is marked as empty. This all happens at the point when you would be writing the page to the database, after the WAL-logging. On crash recovery you read through the double-write buffer from the start and write the pages which look good (including a good checksum) to the database before replaying WAL. If you find a checksum error in processing the double-write buffer, you assume that you never got as far as the fsync of the double-write buffer, which means you never started writing the buffer contents to the database, which means there can't be any torn pages there. If you get to the end and fsync, you can be sure any torn pages from a previous attempt to write to the database itself have been overwritten with the good copy in the double-write buffer. Either way, you move on to WAL processing. You wind up with a database free of torn pages before you apply WAL. full_page_writes to the WAL are not needed as long as double-write is used for any pages which would have been written to the WAL. If checksums were written to the double-buffer metadata instead of adding them to the page itself, this could be implemented alone. It would probably allow a modest speed improvement over using full_page_writes and would eliminate those full-page images from the WAL files, making them smaller. If we do add a checksum to the page header, that could be used for testing for torn pages in the double-write buffer without needing a redundant calculation for double-write. With no torn pages in the actual database, checksum failures there would never be false positives. To get this right for a checksum in the page header, double-write would need to be used for all cases where full_page_writes now are used (i.e., the first write of a page after a checkpoint), and for all unlogged writes (e.g., hint-bit-only writes). There would be no correctness problem for always using double-write, but it would be unnecessary overhead for other page writes, which I think we can avoid. -Kevin
> Heikki Linnakangas wrote: > Simon Riggs wrote: >> OK, then we are talking at cross purposes. Double write buffers, >> in the way you explain them allow us to remove full page writes. >> They clearly don't do anything to check page validity on read. >> Torn pages are not the only fault we wish to correct against... >> and the double writes idea is orthogonal to the idea of checksums. > > The reason we're talking about double write buffers in this thread > is that double write buffers can be used to solve the problem with > hint bits and checksums. Exactly. Every time the issue of page checksums is raised, there are objections because OS or hardware crashes could cause torn pages for hint-bit-only writes which would be treated as serious errors (potentially indicating hardware failure) when they are in fact expected and benign. Some time before the thread dies, someone generally points out that double-write technology would be a graceful way to handle that, with the side benefit of smaller WAL files. All available evidence suggests it would also allow a small performance improvement, although I hesitate to emphasize that aspect of it; the other benefits fully justify the effort without that. I do feel there is value in a page checksum patch even without torn page protection. The discussion on the list has convinced me that a failed checksum should be treated as seriously as other page format errors, rather than as a warning, even though (in the absence of torn page protection) torn hint-bit-only page writes would be benign. As an example of how this might be useful, consider our central databases which contain all the detail replicated from the circuit court databases in all the counties. These are mission-critical, so we have redundant servers in separate buildings. At one point, one of them experienced hardware problems and we started seeing invalid pages. Since we can shift the load between these servers without down time, we moved all applications to other servers, and investigated. Now, it's possible that for some time before we got errors on the bad pages, there could have been subtle corruption which didn't generate errors but presented bad data on our web site. A page checksum would help prevent that sort of problem, and a post-crash false positive might waste a little time in investigation, but that cost would be far outweighed by the benefit of better accuracy guarantees. Of course, it will be a big plus if we can roll this out in 9.2 in conjunction with a double-write feature. Not only will double-write probably be a bit faster than full_page_writes in the WAL log, but it will allow protection against torn pages on hint-bit-only writes without adding those writes to the WAL or doing any major rearrangement of where they sit that would break pg_upgrade. It would be nice not to have to put all sorts of caveats and explanations into the docs about how a checksum error might be benign due to hint bit writes. -Kevin
On Thu, Dec 29, 2011 at 11:08:43AM -0600, Kevin Grittner wrote: > > Heikki Linnakangas wrote: > > Simon Riggs wrote: > > >> OK, then we are talking at cross purposes. Double write buffers, > >> in the way you explain them allow us to remove full page writes. > >> They clearly don't do anything to check page validity on read. > >> Torn pages are not the only fault we wish to correct against... > >> and the double writes idea is orthogonal to the idea of checksums. > > > > The reason we're talking about double write buffers in this thread > > is that double write buffers can be used to solve the problem with > > hint bits and checksums. > Of course, it will be a big plus if we can roll this out in 9.2 in > conjunction with a double-write feature. Not only will double-write > probably be a bit faster than full_page_writes in the WAL log, but it > will allow protection against torn pages on hint-bit-only writes > without adding those writes to the WAL or doing any major > rearrangement of where they sit that would break pg_upgrade. [Thanks for your recent thread summaries.] A double-write buffer, like a WAL-logged full-page image, is a technique for performing atomic writes wider than those automatically provided by components further down the storage stack. The two strategies have different performance characteristics, and we're told that a double-write buffer would better serve us overall. However, its benefits would not be *greater* for hint-only writes than for any other write. For that reason, I think we should consider these changes independently. With page checksums enabled, remove the hazard of torn hint-only writes by ensuring that a WAL FPI has flushed since the last checkpoint. When necessary, emit an FPI-only record. Separately, optimize first-since-checkpoint writes by replacing FPIs with double-write buffers. The double-write patch will reduce the added WAL of the checksum/safe-hint-updates patch to zero. If the double-writes patch founders, we'll just have more-costly, yet equally reliable, page checksums. nm
On Thu, Dec 29, 2011 at 6:44 PM, Kevin Grittner <Kevin.Grittner@wicourts.gov> wrote: > positives. To get this right for a checksum in the page header, > double-write would need to be used for all cases where > full_page_writes now are used (i.e., the first write of a page after > a checkpoint), and for all unlogged writes (e.g., hint-bit-only > writes). There would be no correctness problem for always using > double-write, but it would be unnecessary overhead for other page > writes, which I think we can avoid. Unless I'm missing something, double-writes are needed for all writes, not only the first page after a checkpoint. Consider this sequence of events: 1. Checkpoint 2. Double-write of page A (DW buffer write, sync, heap write) 3. Sync of heap, releasing DW buffer for new writes.... some time goes by 4. Regular write of page A 5. OS writes one part of page A 6. Crash! Now recovery comes along, page A is broken in the heap with no double-write buffer backup nor anything to recover it by in the WAL. -- Ants Aasma
2011/12/30 Ants Aasma <ants.aasma@eesti.ee>: > On Thu, Dec 29, 2011 at 6:44 PM, Kevin Grittner > <Kevin.Grittner@wicourts.gov> wrote: > >> positives. To get this right for a checksum in the page header, >> double-write would need to be used for all cases where >> full_page_writes now are used (i.e., the first write of a page after >> a checkpoint), and for all unlogged writes (e.g., hint-bit-only >> writes). There would be no correctness problem for always using >> double-write, but it would be unnecessary overhead for other page >> writes, which I think we can avoid. > > Unless I'm missing something, double-writes are needed for all writes, > not only the first page after a checkpoint. Consider this sequence of > events: > > 1. Checkpoint > 2. Double-write of page A (DW buffer write, sync, heap write) > 3. Sync of heap, releasing DW buffer for new writes. > ... some time goes by > 4. Regular write of page A > 5. OS writes one part of page A > 6. Crash! > > Now recovery comes along, page A is broken in the heap with no > double-write buffer backup nor anything to recover it by in the WAL. I guess the assumption is that the write in (4) is either backed by the WAL, or made safe by double writing. ISTM that such reasoning is only correct if the change that is expressed by the WAL record can be applied in the context of inconsistent (i.e., partially written) pages, which I assume is not the case (excuse my ignorance regarding such basic facts). So I think you are right. Nicolas -- A. Because it breaks the logical sequence of discussion. Q. Why is top posting bad?
On Thu, Dec 29, 2011 at 4:44 PM, Kevin Grittner <Kevin.Grittner@wicourts.gov> wrote: >> Heikki Linnakangas wrote: >> On 28.12.2011 01:39, Simon Riggs wrote: >>> On Tue, Dec 27, 2011 at 8:05 PM, Heikki Linnakangas >>> wrote: >>>> On 25.12.2011 15:01, Kevin Grittner wrote: >>>>> >>>>> I don't believe that. Double-writing is a technique to avoid >>>>> torn pages, but it requires a checksum to work. This chicken- >>>>> and-egg problem requires the checksum to be implemented first. >>>> >>>> I don't think double-writes require checksums on the data pages >>>> themselves, just on the copies in the double-write buffers. In >>>> the double-write buffer, you'll need some extra information per- >>>> page anyway, like a relfilenode and block number that indicates >>>> which page it is in the buffer. > > You are clearly right -- if there is no checksum in the page itself, > you can put one in the double-write metadata. I've never seen that > discussed before, but I'm embarrassed that it never occurred to me. Heikki's idea for double writes works well. It solves the problems of torn pages in a way that would make FPW redundant. However, I don't see that it provides protection across non-crash write problems. We know we have these since many systems have run without a crash for years and yet still experience corrupt data. Double writes do not require page checksums but neither do they replace page checksums. So I think we need page checksums plus either FPWs or double writes. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
> Nicolas Barbier wrote: > 2011/12/30 Ants Aasma : >> Kevin Grittner wrote: >> >>> positives. To get this right for a checksum in the page header, >>> double-write would need to be used for all cases where >>> full_page_writes now are used (i.e., the first write of a page >>> after a checkpoint), and for all unlogged writes (e.g., >>> hint-bit-only writes). There would be no correctness problem for >>> always using double-write, but it would be unnecessary overhead >>> for other page writes, which I think we can avoid. >> >> Unless I'm missing something, double-writes are needed for all >> writes, not only the first page after a checkpoint. Consider this >> sequence of events: >> >> 1. Checkpoint >> 2. Double-write of page A (DW buffer write, sync, heap write) >> 3. Sync of heap, releasing DW buffer for new writes. >> ... some time goes by >> 4. Regular write of page A >> 5. OS writes one part of page A >> 6. Crash! >> >> Now recovery comes along, page A is broken in the heap with no >> double-write buffer backup nor anything to recover it by in the >> WAL. > > I guess the assumption is that the write in (4) is either backed by > the WAL, or made safe by double writing. ISTM that such reasoning > is only correct if the change that is expressed by the WAL record > can be applied in the context of inconsistent (i.e., partially > written) pages, which I assume is not the case (excuse my ignorance > regarding such basic facts). > > So I think you are right. Hmm. It appears that I didn't think that through all the way. I see two alternatives. (1) We don't eliminate full_page_writes and we only need to use double-writes for unlogged writes. (2) We double-write all writes and on recovery we only apply WAL to a page from pd_lsn onward. We would start from the same point and follow the same rules except that when we read a page and find a pd_lsn past the location of the record we are applying, we do nothing because we are 100% sure everything to that point is safely written and not torn. full_page_writes to WAL would not be needed. -Kevin
> Simon Riggs wrote: > Kevin Grittner wrote: >> if there is no checksum in the page itself, you can put one in the >> double-write metadata. > However, I don't see that it provides protection across non-crash > write problems. We know we have these since many systems have run > without a crash for years and yet still experience corrupt data. Agreed. I don't think anyone has tried to assert it solves the same problems that checksums solve -- it is a high-performance way to solve some of the problems that an in-page checksum *creates* without breaking pg_upgrade. > Double writes do not require page checksums but neither do they > replace page checksums. To nit-pick: double writes require a page checksum, but (as Heikki pointed out) they don't require it to be stored in the page. If there *is* one stored in the page, it probably makes sense to use it. > So I think we need page checksums plus either FPWs or double > writes. Adding checksums by themselves creates a risk of false positive corrupted page indications following an OS or hardware crash. Additional FPWs or a new double-write mechanism are two of miriad possible solutions to that. If it is going to be addressed for 9.2, I believe they're the two most reasonable, especially from the POV of pg_upgrade. So, while they should be separate patches, the complement each other; each makes the other perform better, and they should share some code. -Kevin
On Thu, Dec 29, 2011 at 11:44 AM, Kevin Grittner <Kevin.Grittner@wicourts.gov> wrote: > You wind up with a database free of torn pages before you apply WAL. > full_page_writes to the WAL are not needed as long as double-write is > used for any pages which would have been written to the WAL. If > checksums were written to the double-buffer metadata instead of > adding them to the page itself, this could be implemented alone. It > would probably allow a modest speed improvement over using > full_page_writes and would eliminate those full-page images from the > WAL files, making them smaller. Correct. So now lots of people seem to be jumping on the double-write bandwagon and looking at some the things it promise: All writes are durable This solves 2 big issues: - Remove torn-page problem - Remove FPW from WAL That up front looks pretty attractive. But we need to look at the tradeoffs, and then decide (benchmark anyone). Remember, postgresql is a double-write system right now. The 1st, checkumed write is the FPW in WAL. It's fsynced. And the 2nd synced write is when the file is synced during checkpoint. So, postgresql currently has an optimization now that not every write has *requirements* for atomic, instant durability. And so postgresql get's to do lots of writes to the OS cache and *not* request them to be instantly synced. And then at some point, when it's reay to clear the 1st checksumed write, make sure everywrite is synced. And lots of work went into PG recently to get even better at the collection of writes/syncs that happen at checkpoint time to take even biger advantage of the fact that its' better to write everything in a fil efirst, then call a single sync. So moving to this new double-write-area bandwagon, we move from a "WAL FPW synced at the commit, collect as many other writes, then final sync" type system to a system where *EVERY* write requires syncs of 2 separate 8K writes at buffer write-out time. So we avoid the FPW at commit (yes, that's nice for latency), and we guarentee every buffer written is consistent (that fixes our hit-bit-only dirty writes from being torn). And we do that at a cost of every buffer write requiring 2 fsyncs, in a serial fashion. Come checkpoint, I'm wondering.... Again, all that to avoid a single "optimization" that postgresql currently has: 1) writes for hint-bit only buffers don't need to be durable And the problem that optimization introduces: 1) Since they aren't guarenteed durable, we can't believe a checksum -- Aidan Van Dyk Create like a god, aidan@highrise.ca command like a king, http://www.highrise.ca/ work like a slave.
On 12/29/11, Ants Aasma <ants.aasma@eesti.ee> wrote: > Unless I'm missing something, double-writes are needed for all writes, > not only the first page after a checkpoint. Consider this sequence of > events: > > 1. Checkpoint > 2. Double-write of page A (DW buffer write, sync, heap write) > 3. Sync of heap, releasing DW buffer for new writes. > ... some time goes by > 4. Regular write of page A > 5. OS writes one part of page A > 6. Crash! > > Now recovery comes along, page A is broken in the heap with no > double-write buffer backup nor anything to recover it by in the WAL. Isn't 3 the very definition of a checkpoint, meaning that 4 is not really a regular write as it is the first one after a checkpoint? But it doesn't seem safe to me replace a page from the DW buffer and then apply WAL to that replaced page which preceded the age of the page in the buffer. Cheers, Jeff
Simon Riggs <simon@2ndQuadrant.com> wrote: > v2 of checksum patch, using a conditional copy if checksumming is > enabled, so locking is removed. > > Thanks to Andres for thwacking me with the cluestick, though I > have used a simple copy rather than a copy & calc. > > Tested using make installcheck with parameter on/off, then restart > and vacuumdb to validate all pages. > > Reviews, objections, user interface tweaks all welcome. I'm happy with how this looks, except (as noted in a code comment) that there seems to be room for optimization of the calculation itself. Details below: (1) I like the choice of Fletcher-16. It should be very good at detecting problems while being a lot less expensive that an official CRC calculation. The fact that it was developed at Lawrence Livermore Labs and has been subjected to peer review in the IEEE Transactions on Communications generates confidence in the technique. According to what I've read, though, the technique is conceptually based around the modulus of the sums. Doing the modulus calculation for each addition is often done to prevent overflow, but if we define sum1 and sum2 as uint I don't see how we can get an overflow with 8k byes, so I suggest we declare both of these local variables as uint and leave the modulus to the end. It can't hurt to leave off 16000 division operations per checksum generation. (2) I'm not sure about doing this in three parts, to skip the checksum itself and the hole in the middle of the page. Is this because the hole might not have predictable data? Why would that matter, as long as it is read back the same? Is it expected that this will reduce the cost by checking fewer bytes? That we could tolerate the read of an actual corrupted disk sector in the middle of a page if it didn't contain any data? If we can set the checksum to zero before starting, might it be faster to load four bytes at a time, and iterate fewer times? (Like I said, I'm not sure about any of this, but it seemed worth asking the questions.) (3) Rather than having PageSetVerificationInfo() use memcpy, followed by pass through the copied data to calculate the checksum, might it be better to have a "copy and calculate" version of the function (as in VMware's original patch) to save an extra pass over the page image? Other than these performance tweaks around the calculation phase, I didn't spot any problems. I beat up on it a bit on a couple machines without hitting any bugs or seeing any regression test failures. -Kevin
On Jan 3, 2012, at 4:21 PM, Kevin Grittner wrote: > (2) I'm not sure about doing this in three parts, to skip the > checksum itself and the hole in the middle of the page. Is this > because the hole might not have predictable data? Why would that > matter, as long as it is read back the same? IMO not checksumming the free space would be a really bad idea. It's entirely possible to have your hardware crapping onyour free space, and I'd still want to know that that was happening. Now, it would be interesting if the free space couldbe checksummed separately, since there's no reason to refuse to read the page if only the free space is screwed up...But given the choice, I'd rather get an error when the free space is "corrupted" and be forced to look into things ratherthan blissfully ignore corrupted free space only to be hit later with real data loss. -- Jim C. Nasby, Database Architect jim@nasby.net 512.569.9461 (cell) http://jim.nasby.net
On Fri, Dec 30, 2011 at 11:58 AM, Jeff Janes <jeff.janes@gmail.com> wrote: > On 12/29/11, Ants Aasma <ants.aasma@eesti.ee> wrote: >> Unless I'm missing something, double-writes are needed for all writes, >> not only the first page after a checkpoint. Consider this sequence of >> events: >> >> 1. Checkpoint >> 2. Double-write of page A (DW buffer write, sync, heap write) >> 3. Sync of heap, releasing DW buffer for new writes. >> ... some time goes by >> 4. Regular write of page A >> 5. OS writes one part of page A >> 6. Crash! >> >> Now recovery comes along, page A is broken in the heap with no >> double-write buffer backup nor anything to recover it by in the WAL. > > Isn't 3 the very definition of a checkpoint, meaning that 4 is not > really a regular write as it is the first one after a checkpoint? I think you nailed it. > But it doesn't seem safe to me replace a page from the DW buffer and > then apply WAL to that replaced page which preceded the age of the > page in the buffer. That's what LSNs are for. If we write the page to the checkpoint buffer just once per checkpoint, recovery can restore the double-written versions of the pages and then begin WAL replay, which will restore all the subsequent changes made to the page. Recovery may also need to do additional double-writes if it encounters pages that for which we wrote WAL but never flushed the buffer, because a crash during recovery can also create torn pages. When we reach a restartpoint, we fsync everything down to disk and then nuke the double-write buffer. Similarly, in normal running, we can nuke the double-write buffer at checkpoint time, once the fsyncs are complete. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
"Kevin Grittner" wrote: > if we define sum1 and sum2 as uint I don't see how we can get an > overflow with 8k byes I feel the need to amend that opinion. While sum1 only needs to hold a maximum of (BLCKSZ * 255), which would be adequate for a BLCKSZ up to 16 MB, sum2 needs to hold up to a maximum of ((BLCKSZ * (BLCKSZ + 1) / 2) * 255) so a 32-bit unsigned isn't even good for an 8 KB block size. A uint64 for sum2 can handle BLCKSZ up to 256 MB. So we could define sum1 as uint (which is presumably either uint32 or uint64) and sum2 as uint64 and for a BLCKSZ up toe 16 MB we still don't need to find the modulo 255 numbers until we're done adding things up. Does anyone think we need to support anything bigger than that? Making them both uint64 would support a BLCKSZ up to 256 MB without needing to do the division inside the loop. but it might be slightly slower on 32-bit builds. -Kevin
On Tue, Jan 3, 2012 at 11:00 PM, Jim Nasby <jim@nasby.net> wrote: > On Jan 3, 2012, at 4:21 PM, Kevin Grittner wrote: >> (2) I'm not sure about doing this in three parts, to skip the >> checksum itself and the hole in the middle of the page. Is this >> because the hole might not have predictable data? Why would that >> matter, as long as it is read back the same? > > IMO not checksumming the free space would be a really bad idea. It's entirely possible to have your hardware crapping onyour free space, and I'd still want to know that that was happening. Now, it would be interesting if the free space couldbe checksummed separately, since there's no reason to refuse to read the page if only the free space is screwed up...But given the choice, I'd rather get an error when the free space is "corrupted" and be forced to look into things ratherthan blissfully ignore corrupted free space only to be hit later with real data loss. I see that argument. We don't have space for 2 checksums. We can either (1) report all errors on a page, including errors that don't change PostgreSQL data. This involves checksumming long strings of zeroes, which the checksum algorithm can't tell apart from long strings of ones. (2) report only errors that changed PostgreSQL data. We already do (1) for WAL CRCs so doing the same thing for page checksums makes sense and is much faster. If enough people think we should do (2) then its a simple change to the patch. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Tue, Jan 3, 2012 at 10:21 PM, Kevin Grittner <Kevin.Grittner@wicourts.gov> wrote: > I'm happy with how this looks, except (as noted in a code comment) > that there seems to be room for optimization of the calculation > itself. Details below: ... > (3) Rather than having PageSetVerificationInfo() use memcpy, > followed by pass through the copied data to calculate the checksum, > might it be better to have a "copy and calculate" version of the > function (as in VMware's original patch) to save an extra pass over > the page image? > Other than these performance tweaks around the calculation phase, I > didn't spot any problems. I beat up on it a bit on a couple > machines without hitting any bugs or seeing any regression test > failures. My focus was on getting something working first, then tuning. If we're agreed that we have everything apart from the tuning then we can proceed with tests to see which works better. The copy and calculate approach might get in the way of hardware prefetch since in my understanding the memory fetch time exceeds the calculation time. As discussed elsewhere using that code or not would not stop that work being credited. David, please can you rework the VMware calc patch to produce an additional 16-bit checksum mechanism in a way compatible with the 16bit patch, so we can test the two versions of the calculation? We can make the GUC an enum so that the page checksum is selectable (for testing). -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Tuesday, January 03, 2012 11:21:42 PM Kevin Grittner wrote: > (1) I like the choice of Fletcher-16. It should be very good at > detecting problems while being a lot less expensive that an official > CRC calculation. I wonder if CRC32c wouldn't be a good alternative given more and more cpus (its in SSE 4.2) support calculating it in silicon. Andres
On Wed, Jan 4, 2012 at 9:20 AM, Andres Freund <andres@anarazel.de> wrote: > On Tuesday, January 03, 2012 11:21:42 PM Kevin Grittner wrote: >> (1) I like the choice of Fletcher-16. It should be very good at >> detecting problems while being a lot less expensive that an official >> CRC calculation. > I wonder if CRC32c wouldn't be a good alternative given more and more cpus > (its in SSE 4.2) support calculating it in silicon. We're trying to get something that fits in 16bits for this release. I'm guessing CRC32c doesn't? -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Wed, Jan 4, 2012 at 3:49 AM, Robert Haas <robertmhaas@gmail.com> wrote: > On Fri, Dec 30, 2011 at 11:58 AM, Jeff Janes <jeff.janes@gmail.com> wrote: >> On 12/29/11, Ants Aasma <ants.aasma@eesti.ee> wrote: >>> Unless I'm missing something, double-writes are needed for all writes, >>> not only the first page after a checkpoint. Consider this sequence of >>> events: >>> >>> 1. Checkpoint >>> 2. Double-write of page A (DW buffer write, sync, heap write) >>> 3. Sync of heap, releasing DW buffer for new writes. >>> ... some time goes by >>> 4. Regular write of page A >>> 5. OS writes one part of page A >>> 6. Crash! >>> >>> Now recovery comes along, page A is broken in the heap with no >>> double-write buffer backup nor anything to recover it by in the WAL. >> >> Isn't 3 the very definition of a checkpoint, meaning that 4 is not >> really a regular write as it is the first one after a checkpoint? > > I think you nailed it. No, I should have explicitly stated that no checkpoint happens in between. I think the confusion here is because I assumed Kevin described a fixed size d-w buffer in this message: On Thu, Dec 29, 2011 at 6:44 PM, Kevin Grittner <Kevin.Grittner@wicourts.gov> wrote: > ... The file is fsync'd (like I said, > hopefully to BBU cache), then each page in the double-write buffer is > written to the normal page location, and that is fsync'd. Once that > is done, the database writes have no risk of being torn, and the > double-write buffer is marked as empty. ... If the double-write buffer survives until the next checkpoint, double-writing only the first write should work just fine. The advantage over current full-page writes is that the write is not into the WAL stream and is done (hopefully) by the bgwriter/checkpointer in the background. -- Ants Aasma
2012/1/4 Simon Riggs <simon@2ndquadrant.com>: > On Wed, Jan 4, 2012 at 9:20 AM, Andres Freund <andres@anarazel.de> wrote: > >> I wonder if CRC32c wouldn't be a good alternative given more and more cpus >> (its in SSE 4.2) support calculating it in silicon. > > We're trying to get something that fits in 16bits for this release. > I'm guessing CRC32c doesn't? What happens to the problem-detecting performance of a 16 bit part of a CRC32c vs. a real 16 bit checksum? If it is still as good, it might make sense to use the former, assuming that there is a way to easily trigger the silicon support and enough CPUs support it. Nicolas -- A. Because it breaks the logical sequence of discussion. Q. Why is top posting bad?
Simon Riggs wrote: > We can either > > (1) report all errors on a page, including errors that don't change > PostgreSQL data. This involves checksumming long strings of zeroes, > which the checksum algorithm can't tell apart from long strings of > ones. What do you mean? Each byte which goes into the checksum, and the position of that byte influences the outcome once you've got a non-zero value in sum1. The number of leading NIL bytes would not affect the outcome unless you seed the calculation with something non-zero, but including the page header in the calculation seems to cover that OK. > (2) report only errors that changed PostgreSQL data. > > We already do (1) for WAL CRCs so doing the same thing for page > checksums makes sense and is much faster. > > If enough people think we should do (2) then its a simple change to > the patch. To me, (1) makes more sense, but it seems to me you're currently doing (2) because you check in three parts, skipping the free space in the middle of the page. -Kevin
Excerpts from Kevin Grittner's message of mié ene 04 04:12:43 -0300 2012: > "Kevin Grittner" wrote: > > > if we define sum1 and sum2 as uint I don't see how we can get an > > overflow with 8k byes > > I feel the need to amend that opinion. > > While sum1 only needs to hold a maximum of (BLCKSZ * 255), which > would be adequate for a BLCKSZ up to 16 MB, sum2 needs to hold up to > a maximum of ((BLCKSZ * (BLCKSZ + 1) / 2) * 255) so a 32-bit unsigned > isn't even good for an 8 KB block size. A uint64 for sum2 can handle > BLCKSZ up to 256 MB. So we could define sum1 as uint (which is > presumably either uint32 or uint64) and sum2 as uint64 and for a > BLCKSZ up toe 16 MB we still don't need to find the modulo 255 > numbers until we're done adding things up. > > Does anyone think we need to support anything bigger than that? > Making them both uint64 would support a BLCKSZ up to 256 MB without > needing to do the division inside the loop. but it might be slightly > slower on 32-bit builds. We don't support BLCKSZ higher than 32k anyway ... saith pg_config.h: /* Size of a disk block --- this also limits the size of a tuple. You can set it bigger if you need bigger tuples (althoughTOAST should reduce the need to have large tuples, since fields can be spread across multiple tuples). BLCKSZmust be a power of 2. The maximum possible value of BLCKSZ is currently 2^15 (32768). This is determined by the 15-bitwidths of the lp_off and lp_len fields in ItemIdData (see include/storage/itemid.h). Changing BLCKSZ requires aninitdb. */ #define BLCKSZ 8192 -- Álvaro Herrera <alvherre@commandprompt.com> The PostgreSQL Company - Command Prompt, Inc. PostgreSQL Replication, Consulting, Custom Development, 24x7 support
Robert Haas wrote: > Jeff Janes wrote: >> But it doesn't seem safe to me replace a page from the DW buffer >> and then apply WAL to that replaced page which preceded the age of >> the page in the buffer. > > That's what LSNs are for. Agreed. > If we write the page to the checkpoint buffer just once per > checkpoint, recovery can restore the double-written versions of the > pages and then begin WAL replay, which will restore all the > subsequent changes made to the page. Recovery may also need to do > additional double-writes if it encounters pages that for which we > wrote WAL but never flushed the buffer, because a crash during > recovery can also create torn pages. That's a good point. I think WAL application does need to use double-write. As usual, it doesn't affect *when* a page must be written, but *how*. > When we reach a restartpoint, we fsync everything down to disk and > then nuke the double-write buffer. I think we add to the double-write buffer as we write pages from the buffer to disk. I don't think it makes sense to do potentially repeated writes of the same page with different contents to the double-write buffer as we go; nor is it a good idea to leave the page unsynced and let the double-write buffer grow for a long time. > Similarly, in normal running, we can nuke the double-write buffer > at checkpoint time, once the fsyncs are complete. Well, we should nuke it for re-use as soon as all pages in the buffer are written and fsynced. I'm not at all sure that the best performance is hit by waiting for checkpoint for that versus doing it at page eviction time. The whole reason that double-write techniques don't double the write time is that it is relatively small and the multiple writes to the same disk sectors get absorbed by the BBU write-back without actually hitting the disk all the time. Letting the double-write buffer grow to a large size seems likely to me to be a performance killer. The whole double-write, including fsyncs to buffer and the actual page location should just be considered part of the page write process, I think. -Kevin
Simon, all, * Simon Riggs (simon@2ndQuadrant.com) wrote: > (1) report all errors on a page, including errors that don't change > PostgreSQL data. This involves checksumming long strings of zeroes, > which the checksum algorithm can't tell apart from long strings of > ones. Do we actually know when/where it's supposed to be all zeros, and hence could we check for that explicitly? If we know what it's supposed to be, in order to be consistent with other information, I could certainly see value in actually checking that. I don't think that's valuable enough to go breaking abstraction layers or bending over backwards to do it though. If we don't have the knowledge, at the right level, that the data should all be zeros then including those pieces in the CRC certainly makes sense to me. Just my 2c. Thanks, Stephen
Simon Riggs wrote: > My focus was on getting something working first, then tuning. If > we're agreed that we have everything apart from the tuning then we > can proceed with tests to see which works better. Sure. I just think you are there already except for what I got into. FWIW, moving the modulus application out of the loop is a very trivial change and has no affect on the results; it's strictly a performance issue. -Kevin
Alvaro Herrera wrote: > We don't support BLCKSZ higher than 32k anyway Thanks for pointing that out. Then I think we should declare sum1 to be uint and sum2 to be uint64. We can take out the "% 255" out from where it sits in the v2 patch, and just add something like this after the sums are generated: sum1 %= 255; sum2 %= 255; Or, of course, we can just do it on the line where we combine the two sums for the final result, if that's not too hard to read. -Kevin
On Wed, Jan 4, 2012 at 1:31 PM, Stephen Frost <sfrost@snowman.net> wrote: > Simon, all, > > * Simon Riggs (simon@2ndQuadrant.com) wrote: >> (1) report all errors on a page, including errors that don't change >> PostgreSQL data. This involves checksumming long strings of zeroes, >> which the checksum algorithm can't tell apart from long strings of >> ones. > > Do we actually know when/where it's supposed to be all zeros, and hence > could we check for that explicitly? If we know what it's supposed to > be, in order to be consistent with other information, I could certainly > see value in actually checking that. Yes, we can. Excellent suggestion, will implement. That means we can keep the CRC calc fast as well as check the whole of the page inbound. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Wed, Jan 4, 2012 at 8:31 AM, Kevin Grittner <Kevin.Grittner@wicourts.gov> wrote: >> When we reach a restartpoint, we fsync everything down to disk and >> then nuke the double-write buffer. > > I think we add to the double-write buffer as we write pages from the > buffer to disk. I don't think it makes sense to do potentially > repeated writes of the same page with different contents to the > double-write buffer as we go; nor is it a good idea to leave the page > unsynced and let the double-write buffer grow for a long time. You may be right. Currently, though, we only fsync() at end-of-checkpoint. So we'd have to think about what to fsync, and how often, to keep the double-write buffer to a manageable size. I can't help thinking that any extra fsyncs are pretty expensive, though, especially if you have to fsync() every file that's been double-written before clearing the buffer. Possibly we could have 2^N separate buffers based on an N-bit hash of the relfilenode and segment number, so that we could just fsync 1/(2^N)-th of the open files at a time. But even that sounds expensive: writing back lots of dirty data isn't cheap. One of the systems I've been doing performance testing on can sometimes take >15 seconds to write a shutdown checkpoint, and I'm sure that other people have similar (and worse) problems. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> wrote: > we only fsync() at end-of-checkpoint. So we'd have to think about > what to fsync, and how often, to keep the double-write buffer to a > manageable size. I think this is the big tuning challenge with this technology. > I can't help thinking that any extra fsyncs are pretty expensive, > though, especially if you have to fsync() every file that's been > double-written before clearing the buffer. Possibly we could have > 2^N separate buffers based on an N-bit hash of the relfilenode and > segment number, so that we could just fsync 1/(2^N)-th of the open > files at a time. I'm not sure I'm following -- we would just be fsyncing those files we actually wrote pages into, right? Not all segments for the table involved? > But even that sounds expensive: writing back lots of dirty data > isn't cheap. One of the systems I've been doing performance > testing on can sometimes take >15 seconds to write a shutdown > checkpoint, Consider the relation-file fsyncs for double-write as a form of checkpoint spreading, and maybe it won't seem so bad. It should make that shutdown checkpoint less painful. Now, I have been thinking that on a write-heavy system you had better have a BBU write-back cache, but that's my recommendation, anyway. > and I'm sure that other people have similar (and worse) problems. Well, I have no doubt that this feature should be optional. Those who prefer can continue to do full-page writes to the WAL, instead. Or take the "running with scissors" approach. -Kevin
On Wed, Jan 4, 2012 at 1:32 PM, Kevin Grittner <Kevin.Grittner@wicourts.gov> wrote: > Robert Haas <robertmhaas@gmail.com> wrote: >> we only fsync() at end-of-checkpoint. So we'd have to think about >> what to fsync, and how often, to keep the double-write buffer to a >> manageable size. > > I think this is the big tuning challenge with this technology. One of them, anyway. I think it may also be tricky to make sure that a backend that needs to write a dirty buffer doesn't end up having to wait for a double-write to be fsync'd. >> I can't help thinking that any extra fsyncs are pretty expensive, >> though, especially if you have to fsync() every file that's been >> double-written before clearing the buffer. Possibly we could have >> 2^N separate buffers based on an N-bit hash of the relfilenode and >> segment number, so that we could just fsync 1/(2^N)-th of the open >> files at a time. > > I'm not sure I'm following -- we would just be fsyncing those files > we actually wrote pages into, right? Not all segments for the table > involved? Yes. >> But even that sounds expensive: writing back lots of dirty data >> isn't cheap. One of the systems I've been doing performance >> testing on can sometimes take >15 seconds to write a shutdown >> checkpoint, > > Consider the relation-file fsyncs for double-write as a form of > checkpoint spreading, and maybe it won't seem so bad. It should > make that shutdown checkpoint less painful. Now, I have been > thinking that on a write-heavy system you had better have a BBU > write-back cache, but that's my recommendation, anyway. I think this point has possibly been beaten to death, but at the risk of belaboring the point I'll bring it up again: the frequency with which we fsync() is basically a trade-off between latency and throughput. If you fsync a lot, then each one will be small, so you shouldn't experience much latency, but throughput might suck. If you don't fsync very much, then you maximize the chances for write-combining (because inserting an fsync between two writes to the same block forces that block to be physically written twice rather than just once) thus improving throughput, but when you do get around to calling fsync() there may be a lot of data to write all at once, and you may get a gigantic latency spike. As far as I can tell, one fsync per checkpoint is the theoretical minimum, and that's what we do now. So our current system is optimized for throughput. The decision to put full-page images into WAL rather than a separate buffer is essentially turning the dial in the same direction, because, in effect, the double-write fsync piggybacks on the WAL fsync which we must do anyway. So both the decision to use a double-write buffer AT ALL and the decision to fsync more frequently to keep that buffer to a manageable size are going to result in turning that dial in the opposite direction. It seems to me inevitable that, even with the best possible implementation, throughput will get worse. With a good implementation but not a bad one, latency should improve. Now, this is not necessarily a reason to reject the idea. I believe that several people have proposed that our current implementation is *overly* optimized for throughput *at the expense of* latency, and that we might want to provide some options that, in one way or another, fsync more frequently, so that checkpoint spikes aren't as bad. But when it comes time to benchmark, we might need to think somewhat carefully about what we're testing... Another thought here is that double-writes may not be the best solution, and are almost certainly not the easiest-to-implement solution. We could instead do something like this: when an unlogged change is made to a buffer (e.g. a hint bit is set), we set a flag on the buffer header. When we evict such a buffer, we emit a WAL record that just overwrites the whole buffer with a new FPI. There are some pretty obvious usage patterns where this is likely to be painful (e.g. load a big table without setting hint bits, and then seq-scan it). But there are also many use cases where the working set fits inside shared buffers and data pages don't get written very often, apart from checkpoint time, and those cases might work just fine. Also, the cases that are problems for this implementation are likely to also be problems for a double-write based implementation, for exactly the same reasons: if you discover at buffer eviction time that you need to fsync something (whether it's WAL or DW), it's going to hurt. Checksums aren't free even when using double-writes: if you don't have checksums, pages that have only hint bit-changes don't need to be double-written. If double writes aren't going to give us anything "for free", maybe that's not the right place to be focusing our efforts... -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> wrote: > I think it may also be tricky to make sure that a backend that > needs to write a dirty buffer doesn't end up having to wait for a > double-write to be fsync'd. This and other parts of your post seem to ignore the BBU write-back cache. Multiple fsyncs of a single page can still be collapsed at that level to a single actual disk write. In fact, I rather doubt this technology will look very good on machines without write-back caching. I'm not as sure as you are that this is a net loss in throughput, either. When the fsync storm clogs the RAID controller, even reads stall, so something which more evenly pushes writes to disk might avoid these non-productive pauses. I think that could improve throughput enough to balance or exceed the other effects. Maybe. I agree we need to be careful to craft a good set of benchmarks here. > Checksums aren't free even when using double-writes: if you don't > have checksums, pages that have only hint bit-changes don't need > to be double-written. Agreed. Checksums aren't expected to be free under any circumstances. I'm expecting DW to be slightly faster than FPW in general, with or without in-page checksums. > If double writes aren't going to give us anything "for free", > maybe that's not the right place to be focusing our > efforts... I'm not sure why it's not enough that they improve performance over the alternative. Making some other feature with obvious overhead "free" seems an odd requirement to hang on this. (Maybe I'm misunderstanding you on that point?) -Kevin
On Wed, Jan 4, 2012 at 3:51 PM, Kevin Grittner <Kevin.Grittner@wicourts.gov> wrote: >> If double writes aren't going to give us anything "for free", >> maybe that's not the right place to be focusing our >> efforts... > > I'm not sure why it's not enough that they improve performance over > the alternative. Making some other feature with obvious overhead > "free" seems an odd requirement to hang on this. (Maybe I'm > misunderstanding you on that point?) Well, this thread is nominally about checksums, but here we are talking about double writes, so I thought we were connecting those features in some way? Certainly, life is easier if we can develop them completely separately - but checksums really ought to come with some sort of solution to the problem of a torn-page with hint bit changes, IMO, and I thought that's why were thinking so hard about DW just now. Maybe I'm confused. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, Jan 4, 2012 at 3:13 PM, Simon Riggs <simon@2ndquadrant.com> wrote: > On Wed, Jan 4, 2012 at 1:31 PM, Stephen Frost <sfrost@snowman.net> wrote: >> Simon, all, >> >> * Simon Riggs (simon@2ndQuadrant.com) wrote: >>> (1) report all errors on a page, including errors that don't change >>> PostgreSQL data. This involves checksumming long strings of zeroes, >>> which the checksum algorithm can't tell apart from long strings of >>> ones. >> >> Do we actually know when/where it's supposed to be all zeros, and hence >> could we check for that explicitly? If we know what it's supposed to >> be, in order to be consistent with other information, I could certainly >> see value in actually checking that. > > Yes, we can. Excellent suggestion, will implement. No, we can't. I discover that non-all-zeroes holes are fairly common, just not very frequent. That may or may not be a problem, but not something to be dealt with here and now. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Wed, Jan 4, 2012 at 1:35 PM, Kevin Grittner <Kevin.Grittner@wicourts.gov> wrote: > Simon Riggs wrote: > >> My focus was on getting something working first, then tuning. If >> we're agreed that we have everything apart from the tuning then we >> can proceed with tests to see which works better. > > Sure. I just think you are there already except for what I got into. > > FWIW, moving the modulus application out of the loop is a very > trivial change and has no affect on the results; it's strictly a > performance issue. New version attached, with your suggested changes included. Hole check code is there as well, but ifdef'd out since it isn't a valid check in all cases. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
* Simon Riggs (simon@2ndQuadrant.com) wrote: > I discover that non-all-zeroes holes are fairly common, just not very frequent. Curious, might be interesting to find out why. > That may or may not be a problem, but not something to be dealt with > here and now. But I agree that it's not the job of this patch/effort. It sounds like we have clear indication, however, that those areas, as they are not necessairly all zeros, should be included in the checksum. Thanks, Stephen
On Fri, Jan 6, 2012 at 1:10 AM, Stephen Frost <sfrost@snowman.net> wrote: > * Simon Riggs (simon@2ndQuadrant.com) wrote: >> I discover that non-all-zeroes holes are fairly common, just not very frequent. > > Curious, might be interesting to find out why. > >> That may or may not be a problem, but not something to be dealt with >> here and now. > > But I agree that it's not the job of this patch/effort. It sounds like > we have clear indication, however, that those areas, as they are not > necessairly all zeros, should be included in the checksum. Disagree. Full page writes ignore the hole, so its appropriate to do so here also. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Friday, January 06, 2012 11:30:53 AM Simon Riggs wrote: > On Fri, Jan 6, 2012 at 1:10 AM, Stephen Frost <sfrost@snowman.net> wrote: > > * Simon Riggs (simon@2ndQuadrant.com) wrote: > >> I discover that non-all-zeroes holes are fairly common, just not very > >> frequent. > > > > Curious, might be interesting to find out why. > > > >> That may or may not be a problem, but not something to be dealt with > >> here and now. > > > > But I agree that it's not the job of this patch/effort. It sounds like > > we have clear indication, however, that those areas, as they are not > > necessairly all zeros, should be included in the checksum. > > Disagree. Full page writes ignore the hole, so its appropriate to do > so here also. Well, ignoriging them in fpw has clear space benefits. Ignoring them while checksumming doesn't have that much of a benefit. Andres
On Thu, Jan 5, 2012 at 3:46 PM, Simon Riggs <simon@2ndquadrant.com> wrote: > On Wed, Jan 4, 2012 at 1:35 PM, Kevin Grittner > <Kevin.Grittner@wicourts.gov> wrote: >> Sure. I just think you are there already except for what I got into. > New version attached, with your suggested changes included. Hole check > code is there as well, but ifdef'd out since it isn't a valid check in > all cases. Following private discussions, Kevin showed me the patch at v3 was inadequate because it didn't cater for torn pages as a result of hint bit writes. The following patch (v4) introduces a new WAL record type that writes backup blocks for the first hint on a block in any checkpoint that has not previously been changed. IMHO this fixes the torn page problem correctly, though at some additional loss of performance but not the total catastrophe some people had imagined. Specifically we don't need to log anywhere near 100% of hint bit settings, much more like 20-30% (estimated not measured). Dan Scales also observed some cases where private writes aren't checksummed, e.g. index build. Those suggestions are still outstanding from me, but the v4 is worthy of more serious review. I'll now turn my attention more fully onto the DW patch. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
On Friday, January 06, 2012 07:26:14 PM Simon Riggs wrote: > The following patch (v4) introduces a new WAL record type that writes > backup blocks for the first hint on a block in any checkpoint that has > not previously been changed. IMHO this fixes the torn page problem > correctly, though at some additional loss of performance but not the > total catastrophe some people had imagined. Specifically we don't need > to log anywhere near 100% of hint bit settings, much more like 20-30% > (estimated not measured). Well, but it will hurt in those cases where hint bits already hurt hugely. Which is for example when accessing huge amounts of data the first time after loading it. Which is not exactly uncommon... Andres
On 06.01.2012 20:26, Simon Riggs wrote: > The following patch (v4) introduces a new WAL record type that writes > backup blocks for the first hint on a block in any checkpoint that has > not previously been changed. IMHO this fixes the torn page problem > correctly, though at some additional loss of performance but not the > total catastrophe some people had imagined. Specifically we don't need > to log anywhere near 100% of hint bit settings, much more like 20-30% > (estimated not measured). How's that going to work during recovery? Like in hot standby. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Friday, January 06, 2012 08:45:45 PM Heikki Linnakangas wrote: > On 06.01.2012 20:26, Simon Riggs wrote: > > The following patch (v4) introduces a new WAL record type that writes > > backup blocks for the first hint on a block in any checkpoint that has > > not previously been changed. IMHO this fixes the torn page problem > > correctly, though at some additional loss of performance but not the > > total catastrophe some people had imagined. Specifically we don't need > > to log anywhere near 100% of hint bit settings, much more like 20-30% > > (estimated not measured). > > How's that going to work during recovery? Like in hot standby. How's recovery a problem? Unless I miss something that doesn't actually introduce a new possibility to transport hint bits to the standby (think fpw's). A new transport will obviously increase traffic but ... Andres
On Fri, Jan 6, 2012 at 2:35 PM, Andres Freund <andres@anarazel.de> wrote: > On Friday, January 06, 2012 07:26:14 PM Simon Riggs wrote: >> The following patch (v4) introduces a new WAL record type that writes >> backup blocks for the first hint on a block in any checkpoint that has >> not previously been changed. IMHO this fixes the torn page problem >> correctly, though at some additional loss of performance but not the >> total catastrophe some people had imagined. Specifically we don't need >> to log anywhere near 100% of hint bit settings, much more like 20-30% >> (estimated not measured). > Well, but it will hurt in those cases where hint bits already hurt hugely. > Which is for example when accessing huge amounts of data the first time after > loading it. Which is not exactly uncommon... No, but I'd rather see us commit a checksum patch that sometimes carries a major performance penalty and then work to reduce that penalty later than never commit anything at all. I think it's unrealistic to assume we're going to get this perfectly right in one try. I am not sure whether this particular patch is close enough for government work or still needs more hacking, and it may well be the latter, but there is no use holding our breath and waiting for absolute perfection. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Fri, Jan 6, 2012 at 2:48 PM, Andres Freund <andres@anarazel.de> wrote: > On Friday, January 06, 2012 08:45:45 PM Heikki Linnakangas wrote: >> On 06.01.2012 20:26, Simon Riggs wrote: >> > The following patch (v4) introduces a new WAL record type that writes >> > backup blocks for the first hint on a block in any checkpoint that has >> > not previously been changed. IMHO this fixes the torn page problem >> > correctly, though at some additional loss of performance but not the >> > total catastrophe some people had imagined. Specifically we don't need >> > to log anywhere near 100% of hint bit settings, much more like 20-30% >> > (estimated not measured). >> >> How's that going to work during recovery? Like in hot standby. > How's recovery a problem? Unless I miss something that doesn't actually > introduce a new possibility to transport hint bits to the standby (think > fpw's). A new transport will obviously increase traffic but ... The standby can set hint bits locally that weren't set on the data it received from the master. This will require rechecksumming and rewriting the page, but obviously we can't write the WAL records needed to protect those writes during recovery. So a crash could create a torn page, invalidating the checksum. Ignoring checksum errors during Hot Standby operation doesn't fix it, either, because eventually you might want to promote the standby, and the checksum will still be invalid. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Friday, January 06, 2012 08:53:38 PM Robert Haas wrote: > On Fri, Jan 6, 2012 at 2:48 PM, Andres Freund <andres@anarazel.de> wrote: > > On Friday, January 06, 2012 08:45:45 PM Heikki Linnakangas wrote: > >> On 06.01.2012 20:26, Simon Riggs wrote: > >> > The following patch (v4) introduces a new WAL record type that writes > >> > backup blocks for the first hint on a block in any checkpoint that has > >> > not previously been changed. IMHO this fixes the torn page problem > >> > correctly, though at some additional loss of performance but not the > >> > total catastrophe some people had imagined. Specifically we don't need > >> > to log anywhere near 100% of hint bit settings, much more like 20-30% > >> > (estimated not measured). > >> > >> How's that going to work during recovery? Like in hot standby. > > > > How's recovery a problem? Unless I miss something that doesn't actually > > introduce a new possibility to transport hint bits to the standby (think > > fpw's). A new transport will obviously increase traffic but ... > > The standby can set hint bits locally that weren't set on the data it > received from the master. This will require rechecksumming and > rewriting the page, but obviously we can't write the WAL records > needed to protect those writes during recovery. So a crash could > create a torn page, invalidating the checksum. Err. Stupid me, thanks. > Ignoring checksum errors during Hot Standby operation doesn't fix it, > either, because eventually you might want to promote the standby, and > the checksum will still be invalid. Its funny. I have the feeling we all are missing a very obvious brilliant solution to this... Andres
On Fri, Jan 6, 2012 at 7:49 PM, Robert Haas <robertmhaas@gmail.com> wrote: > No, but I'd rather see us commit a checksum patch that sometimes > carries a major performance penalty and then work to reduce that > penalty later than never commit anything at all. I think it's > unrealistic to assume we're going to get this perfectly right in one > try. I am not sure whether this particular patch is close enough for > government work or still needs more hacking, and it may well be the > latter, but there is no use holding our breath and waiting for > absolute perfection. Well said. My view completely. I do want jam tomorrow, I just want bread and butter today as well. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Fri, Jan 6, 2012 at 3:47 PM, Simon Riggs <simon@2ndquadrant.com> wrote: > Since we ignore hints during HS anyway, No, we don't. We need to ignore visibility map bits, but we need not and do not ignore HEAP_XMIN_COMMITTED, HEAP_XMAX_COMMITTED, etc. > not setting them seems OK if > checksums defined. Or we can recommend that you don't use checksums on > a standby. Whichever fits. > > Writing pages during recovery doesn't need WAL. If we crash, we replay > using the already generated WAL. Which is all fine, except when you start making changes that are not WAL-logged. Then, you have the same torn page problem that exists when you it in normal running. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Fri, Jan 6, 2012 at 2:03 PM, Andres Freund <andres@anarazel.de> wrote: >> The standby can set hint bits locally that weren't set on the data it >> received from the master. This will require rechecksumming and >> rewriting the page, but obviously we can't write the WAL records >> needed to protect those writes during recovery. So a crash could >> create a torn page, invalidating the checksum. > Err. Stupid me, thanks. > >> Ignoring checksum errors during Hot Standby operation doesn't fix it, >> either, because eventually you might want to promote the standby, and >> the checksum will still be invalid. > Its funny. I have the feeling we all are missing a very obvious brilliant > solution to this... Like getting rid of hint bits? merlin
On Fri, Jan 6, 2012 at 7:53 PM, Robert Haas <robertmhaas@gmail.com> wrote: > On Fri, Jan 6, 2012 at 2:48 PM, Andres Freund <andres@anarazel.de> wrote: >> On Friday, January 06, 2012 08:45:45 PM Heikki Linnakangas wrote: >>> On 06.01.2012 20:26, Simon Riggs wrote: >>> > The following patch (v4) introduces a new WAL record type that writes >>> > backup blocks for the first hint on a block in any checkpoint that has >>> > not previously been changed. IMHO this fixes the torn page problem >>> > correctly, though at some additional loss of performance but not the >>> > total catastrophe some people had imagined. Specifically we don't need >>> > to log anywhere near 100% of hint bit settings, much more like 20-30% >>> > (estimated not measured). >>> >>> How's that going to work during recovery? Like in hot standby. >> How's recovery a problem? Unless I miss something that doesn't actually >> introduce a new possibility to transport hint bits to the standby (think >> fpw's). A new transport will obviously increase traffic but ... > > The standby can set hint bits locally that weren't set on the data it > received from the master. This will require rechecksumming and > rewriting the page, but obviously we can't write the WAL records > needed to protect those writes during recovery. So a crash could > create a torn page, invalidating the checksum. > > Ignoring checksum errors during Hot Standby operation doesn't fix it, > either, because eventually you might want to promote the standby, and > the checksum will still be invalid. Since we ignore hints during HS anyway, not setting them seems OK if checksums defined. Or we can recommend that you don't use checksums on a standby. Whichever fits. Writing pages during recovery doesn't need WAL. If we crash, we replay using the already generated WAL. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Fri, Jan 6, 2012 at 5:17 PM, Merlin Moncure <mmoncure@gmail.com> wrote: > On Fri, Jan 6, 2012 at 2:03 PM, Andres Freund <andres@anarazel.de> wrote: >>> The standby can set hint bits locally that weren't set on the data it >>> received from the master. This will require rechecksumming and >>> rewriting the page, but obviously we can't write the WAL records >>> needed to protect those writes during recovery. So a crash could >>> create a torn page, invalidating the checksum. >> Err. Stupid me, thanks. >> >>> Ignoring checksum errors during Hot Standby operation doesn't fix it, >>> either, because eventually you might want to promote the standby, and >>> the checksum will still be invalid. >> Its funny. I have the feeling we all are missing a very obvious brilliant >> solution to this... > > Like getting rid of hint bits? Or even just not bothering to consider them as making buffers dirty, so the only writes are already protected by the double-write (WAL, or if they get some DW outside of WAL). I think I've said it before, but I'm guessing OLTP style database rarely have pages written that are dirty that aren't covered by real changes (so have the FPW anyways) and OLAP type generally freeze after loads to avoid the hint-bit-write penalty too... a. -- Aidan Van Dyk Create like a god, aidan@highrise.ca command like a king, http://www.highrise.ca/ work like a slave.
On Fri, Jan 6, 2012 at 5:17 PM, Merlin Moncure <mmoncure@gmail.com> wrote: >> Its funny. I have the feeling we all are missing a very obvious brilliant >> solution to this... > > Like getting rid of hint bits? No. First, that solution has been proposed before, and it crashed and burned before. You may as well propose getting rid of VACUUM. In each case, the supposed problem is in fact a cure for a much more severe problem that would quickly make you long for the days when you had the first one. I think you (or someone else?) had a fairly well-developed patch for blunting the impact of hint bit setting, and that we might be able to do if you (or someone) wants to finish it. But getting rid of them altogether isn't likely, not because PostgreSQL developers are an intractable herd of mule-headed idiots (although I have been known to be all of those things on occasion) but because the performance characteristics demonstrably suck, as mostly recently demonstrated by commit 4de82f7d7c50a81ec8e70e2cb0ab413ab9134c0b, which significantly improved performance with synchronous_commit=off by shaving about an average of one-tenth of one second off the time before hint bits could be set on any given tuple. Second, even if you did it, it wouldn't fix this problem, because hint bits aren't the only changes we make without WAL logging. In particular, when an index scan runs across a tuple that turns out to be dead, we kill the index tuple so that future scans don't need to revisit it. I haven't actually done any testing to measure how significant this is, but I wouldn't bet on it not mattering. I think it would be wonderful to come up with a design that either blunted the effect of hint bits or eliminated them altogether, but the idea that there's an obvious way forward there that we've simply refused to pursue is, AFAICT, flat wrong. Let's not get sidetracked into talking about things that aren't going to change in time for 9.2 (and probably not 9.3, either, given that no one seems to be putting any time into it) and aren't even feasible solutions to the problem at hand (checksums) anyway. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Fri, Jan 6, 2012 at 6:48 PM, Aidan Van Dyk <aidan@highrise.ca> wrote: > I think I've said it before, but I'm guessing OLTP style database > rarely have pages written that are dirty that aren't covered by real > changes (so have the FPW anyways) and OLAP type generally freeze after > loads to avoid the hint-bit-write penalty too... But ya, again, I've never measured ;-) -- Aidan Van Dyk Create like a god, aidan@highrise.ca command like a king, http://www.highrise.ca/ work like a slave.
On Fri, Jan 6, 2012 at 9:44 PM, Robert Haas <robertmhaas@gmail.com> wrote: >> Writing pages during recovery doesn't need WAL. If we crash, we replay >> using the already generated WAL. > > Which is all fine, except when you start making changes that are not > WAL-logged. Then, you have the same torn page problem that exists > when you it in normal running. Yes, of course. My point is that this is not a blocker to using checksums, only that some actions cannot occur on the standby but the replay of changes is not in danger. page_checksums is an optional parameter, so you can turn it on or off on the standby as you wish. People frequently have a standby dedicated to HA and other standbys for queries. So this is all normal and natural. page_checksums will default to 'off' in the final patch anyway, in my understanding. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On 07.01.2012 12:14, Simon Riggs wrote: > page_checksums is an optional parameter, so you can turn it on or off > on the standby as you wish. People frequently have a standby dedicated > to HA and other standbys for queries. So this is all normal and > natural. There needs to be a well-documented way of turning it on/off. In particular, from off to on. If you ever flip it off even for a minute, pages with invalid checksums start to appear in the data directory, and if you then turn it on again, you start getting errors. Perhaps there needs to be a third setting, calculate-but-dont-verify, where CRCs are updated but existing CRCs are not expected to be correct. And a utility to scan through your database and fix any incorrect CRCs, so that after that's run in calculate-but-dont-verify mode, you can safely turn checksums on. Even better would be a way to make that process robust so that you can't do a pilot error and turn page_checksums 'on' when it's not safe to do so. Otherwise when you get a checksum error, your first thought is going to be "hmm, I wonder if anyone ever turned page_checksums 'off' by accident anytime in the past, or if this is a genuine error". > page_checksums will default to 'off' in the final patch anyway, in my > understanding. That makes the need for such a facility even more important. Otherwise there's no safe way to ever switch it from off to on. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Sat, Jan 7, 2012 at 10:26 AM, Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote: > There needs to be a well-documented way of turning it on/off. In particular, > from off to on. There is.... in the patch. The checksum field is optional, as is the parameter. If page_checksums is on, we write a checksum and it is correct. We also validate any checksums we see. If page_checksums is off we clear the checksum on write, so an incorrect checksum is never written. So there isn't any problem with there being incorrect checksums on blocks and you can turn the parameter on and off as often and as easily as you want. I think it can be USERSET but I wouldn't want to encourage users to see turning it off as a performance tuning feature. If the admin turns it on for the server, its on, so its SIGHUP. Any holes in that I haven't noticed? -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Sat, Jan 7, 2012 at 10:55 AM, Simon Riggs <simon@2ndquadrant.com> wrote: > So there isn't any problem with there being incorrect checksums on > blocks and you can turn the parameter on and off as often and as > easily as you want. I think it can be USERSET but I wouldn't want to > encourage users to see turning it off as a performance tuning feature. > If the admin turns it on for the server, its on, so its SIGHUP. > > Any holes in that I haven't noticed? And of course, as soon as I wrote that I thought of the problem. We mustn't make a write that hasn't been covered by a FPW, so we must know ahead of time whether to WAL log hints or not. We can't simply turn it on/off any longer, now that we have to WAL log hint bits also. So thanks for making me think of that. We *could* make it turn on/off at each checkpoint, but its easier just to say that it can be turned on/off at server start. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Sat, Jan 7, 2012 at 11:09 AM, Simon Riggs <simon@2ndquadrant.com> wrote: > On Sat, Jan 7, 2012 at 10:55 AM, Simon Riggs <simon@2ndquadrant.com> wrote: > >> So there isn't any problem with there being incorrect checksums on >> blocks and you can turn the parameter on and off as often and as >> easily as you want. I think it can be USERSET but I wouldn't want to >> encourage users to see turning it off as a performance tuning feature. >> If the admin turns it on for the server, its on, so its SIGHUP. >> >> Any holes in that I haven't noticed? > > And of course, as soon as I wrote that I thought of the problem. We > mustn't make a write that hasn't been covered by a FPW, so we must > know ahead of time whether to WAL log hints or not. We can't simply > turn it on/off any longer, now that we have to WAL log hint bits also. > So thanks for making me think of that. > > We *could* make it turn on/off at each checkpoint, but its easier just > to say that it can be turned on/off at server start. Attached patch v6 now handles hint bits and checksums correctly, following Heikki's comments. In recovery, setting a hint doesn't dirty a block if it wasn't already dirty. So we can write some hints, and we can set others but not write them. Lots of comments in the code. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
On Jan 6, 2012, at 4:36 AM, Andres Freund wrote: > On Friday, January 06, 2012 11:30:53 AM Simon Riggs wrote: >> On Fri, Jan 6, 2012 at 1:10 AM, Stephen Frost <sfrost@snowman.net> wrote: >>> * Simon Riggs (simon@2ndQuadrant.com) wrote: >>>> I discover that non-all-zeroes holes are fairly common, just not very >>>> frequent. >>> >>> Curious, might be interesting to find out why. >>> >>>> That may or may not be a problem, but not something to be dealt with >>>> here and now. >>> >>> But I agree that it's not the job of this patch/effort. It sounds like >>> we have clear indication, however, that those areas, as they are not >>> necessairly all zeros, should be included in the checksum. >> >> Disagree. Full page writes ignore the hole, so its appropriate to do >> so here also. > Well, ignoriging them in fpw has clear space benefits. Ignoring them while > checksumming doesn't have that much of a benefit. I agree with Andres... we should checksum zero bytes, because if they're screwed up then something is wrong with your system,even if you got lucky with what data got trashed. As I mentioned before, 2 separate checksums would be nice, but if we can't have that I think we need to fail on any checksumerror. -- Jim C. Nasby, Database Architect jim@nasby.net 512.569.9461 (cell) http://jim.nasby.net
On 1/7/12 5:26 AM, Heikki Linnakangas wrote: > Perhaps there > needs to be a third setting, calculate-but-dont-verify, where CRCs are > updated but existing CRCs are not expected to be correct. And a utility > to scan through your database and fix any incorrect CRCs, so that after > that's run in calculate-but-dont-verify mode, you can safely turn > checksums on. The users of any feature like this are eventually going to demand a "scrubbing" utility that validates the checksums in a background task. Doing something like pg_dump just to validate your copy of the data is still clean is not a great approach. If you accept that sort of utility is inevitable, you might as well presume it could handle this sort of task eventually too. -- Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com
On 12/30/11 9:44 AM, Aidan Van Dyk wrote: > So moving to this new double-write-area bandwagon, we move from a "WAL > FPW synced at the commit, collect as many other writes, then final > sync" type system to a system where *EVERY* write requires syncs of 2 > separate 8K writes at buffer write-out time. It's not quite that bad. The double-write area is going to be a small chunk of re-used sequential I/O, like the current WAL. And if this approach shifts some of the full-page writes out of the WAL and toward the new area instead, that's not a real doubling either. Could probably put both on the same disk, and in situations where you don't have a battery-backed write cache it's possible to get a write to both per rotation. This idea has been tested pretty extensively as part of MySQL's Innodb engine. Results there suggest the overhead is in the 5% to 30% range; some examples mentioning both extremes of that: http://www.mysqlperformanceblog.com/2006/08/04/innodb-double-write/ http://www.bigdbahead.com/?p=392 Makes me wish I knew off the top of my head how expensive WAL logging hint bits would be, for comparison sake. -- Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com
On Sun, Jan 8, 2012 at 2:03 PM, Simon Riggs <simon@2ndquadrant.com> wrote: > On Sat, Jan 7, 2012 at 11:09 AM, Simon Riggs <simon@2ndquadrant.com> wrote: >> On Sat, Jan 7, 2012 at 10:55 AM, Simon Riggs <simon@2ndquadrant.com> wrote: >> >>> So there isn't any problem with there being incorrect checksums on >>> blocks and you can turn the parameter on and off as often and as >>> easily as you want. I think it can be USERSET but I wouldn't want to >>> encourage users to see turning it off as a performance tuning feature. >>> If the admin turns it on for the server, its on, so its SIGHUP. >>> >>> Any holes in that I haven't noticed? >> >> And of course, as soon as I wrote that I thought of the problem. We >> mustn't make a write that hasn't been covered by a FPW, so we must >> know ahead of time whether to WAL log hints or not. We can't simply >> turn it on/off any longer, now that we have to WAL log hint bits also. >> So thanks for making me think of that. >> >> We *could* make it turn on/off at each checkpoint, but its easier just >> to say that it can be turned on/off at server start. > > Attached patch v6 now handles hint bits and checksums correctly, > following Heikki's comments. > > In recovery, setting a hint doesn't dirty a block if it wasn't already > dirty. So we can write some hints, and we can set others but not write > them. > > Lots of comments in the code. v7 * Fixes merge conflict * Minor patch cleanups * Adds checksum of complete page including hole * Calcs checksum in mdwrite() so we pickup all non-shared buffer writes also Robert mentioned to me there were outstanding concerns on this patch. I know of none, and have double checked the thread to confirm all concerns are fully addressed. Adding to CF. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
On Wed, Jan 11, 2012 at 10:12:31PM +0000, Simon Riggs wrote: > v7 This patch uses FPIs to guard against torn hint writes, even when the checksums are disabled. One could not simply condition them on the page_checksums setting, though. Suppose page_checksums=off and we're hinting a page previously written with page_checksums=on. If that write tears, leaving the checksum intact, that block will now fail verification. A couple of ideas come to mind. (a) Read the on-disk page and emit an FPI only if the old page had a checksum. (b) Add a checksumEnableLSN field to pg_control. Whenever the cluster starts with checksums disabled, set the field to InvalidXLogRecPtr. Whenever the cluster starts with checksums enabled and the field is InvalidXLogRecPtr, set the field to the next LSN. When a checksum failure occurs in a page with LSN older than the stored one, emit either a softer warning or no message at all. Not all WAL-bypassing operations use SetBufferCommitInfoNeedsSave() to dirty the buffer. Here are other sites I noticed that do MarkBufferDirty() without bumping the page LSN:heap_xlog_visible()visibilitymap_clear()visibilitymap_truncate()lazy_scan_heap()XLogRecordPageWithFreeSpace()FreeSpaceMapTruncateRel()fsm_set_and_search()fsm_vacuum_page()fsm_search_avail() Though I have not investigated each of these in detail, I suspect most or all represent continued torn-page hazards. Since the FSM is just a hint, we do not WAL-log it at all; it would be nicest to always skip checksums for it, too. The visibility map shortcuts are more nuanced. When page_checksums=off and we read a page last written by page_checksums=on, we still verify its checksum. If a mostly-insert/read site uses checksums for some time and eventually wishes to shed the overhead, disabling the feature will not stop the costs for reads of old data. They would need a dump/reload to get the performance of a never-checksummed database. Let's either unconditionally skip checksum validation under page_checksums=off or add a new state like page_checksums=ignore for that even-weaker condition. > --- a/doc/src/sgml/config.sgml > +++ b/doc/src/sgml/config.sgml > + <para> > + Turning this parameter off speeds normal operation, but > + might allow data corruption to go unnoticed. The checksum uses > + 16-bit checksums, using the fast Fletcher 16 algorithm. With this > + parameter enabled there is still a non-zero probability that an error > + could go undetected, as well as a non-zero probability of false > + positives. > + </para> What sources of false positives do you intend to retain? > --- a/src/backend/catalog/storage.c > +++ b/src/backend/catalog/storage.c > @@ -20,6 +20,7 @@ > #include "postgres.h" > > #include "access/visibilitymap.h" > +#include "access/transam.h" > #include "access/xact.h" > #include "access/xlogutils.h" > #include "catalog/catalog.h" > @@ -70,6 +71,7 @@ static PendingRelDelete *pendingDeletes = NULL; /* head of linked list */ > /* XLOG gives us high 4 bits */ > #define XLOG_SMGR_CREATE 0x10 > #define XLOG_SMGR_TRUNCATE 0x20 > +#define XLOG_SMGR_HINT 0x40 > > typedef struct xl_smgr_create > { > @@ -477,19 +479,74 @@ AtSubAbort_smgr(void) > smgrDoPendingDeletes(false); > } > > +/* > + * Write a backup block if needed when we are setting a hint. > + * > + * Deciding the "if needed" bit is delicate and requires us to either > + * grab WALInsertLock or check the info_lck spinlock. If we check the > + * spinlock and it says Yes then we will need to get WALInsertLock as well, > + * so the design choice here is to just go straight for the WALInsertLock > + * and trust that calls to this function are minimised elsewhere. > + * > + * Callable while holding share lock on the buffer content. > + * > + * Possible that multiple concurrent backends could attempt to write > + * WAL records. In that case, more than one backup block may be recorded > + * though that isn't important to the outcome and the backup blocks are > + * likely to be identical anyway. > + */ > +#define SMGR_HINT_WATERMARK 13579 > +void > +smgr_buffer_hint(Buffer buffer) > +{ > + /* > + * Make an XLOG entry reporting the hint > + */ > + XLogRecPtr lsn; > + XLogRecData rdata[2]; > + int watermark = SMGR_HINT_WATERMARK; > + > + /* > + * Not allowed to have zero-length records, so use a small watermark > + */ > + rdata[0].data = (char *) (&watermark); > + rdata[0].len = sizeof(int); > + rdata[0].buffer = InvalidBuffer; > + rdata[0].buffer_std = false; > + rdata[0].next = &(rdata[1]); > + > + rdata[1].data = NULL; > + rdata[1].len = 0; > + rdata[1].buffer = buffer; > + rdata[1].buffer_std = true; > + rdata[1].next = NULL; > + > + lsn = XLogInsert(RM_SMGR_ID, XLOG_SMGR_HINT, rdata); > + > + /* > + * Set the page LSN if we wrote a backup block. > + */ > + if (!XLByteEQ(InvalidXLogRecPtr, lsn)) > + { > + Page page = BufferGetPage(buffer); > + PageSetLSN(page, lsn); PageSetLSN() is not atomic, so the shared buffer content lock we'll be holding is insufficient. Need a PageSetTLI() here, too. > + elog(LOG, "inserted backup block for hint bit"); > + } > +} > --- a/src/backend/storage/buffer/bufmgr.c > +++ b/src/backend/storage/buffer/bufmgr.c > @@ -2341,6 +2350,41 @@ SetBufferCommitInfoNeedsSave(Buffer buffer) > if ((bufHdr->flags & (BM_DIRTY | BM_JUST_DIRTIED)) != > (BM_DIRTY | BM_JUST_DIRTIED)) > { > + /* > + * If we're writing checksums and we care about torn pages then we > + * cannot dirty a page during recovery as a result of a hint. > + * We can set the hint, just not dirty the page as a result. > + * > + * See long discussion in bufpage.c > + */ > + if (HintsMustNotDirtyPage()) > + return; This expands toif (page_checksums && fullPageWrites && RecoveryInProgress()) If only page_checksums is false, smgr_buffer_hint() can attempt to insert a WAL record during recovery. > + > + /* > + * Write a full page into WAL iff this is the first change on the > + * block since the last checkpoint. That will never be the case > + * if the block is already dirty because we either made a change > + * or set a hint already. Note that aggressive cleaning of blocks > + * dirtied by hint bit setting would increase the call rate. > + * Bulk setting of hint bits would reduce the call rate... > + * > + * We must issue the WAL record before we mark the buffer dirty. > + * Otherwise we might write the page before we write the WAL. > + * That causes a race condition, since a checkpoint might > + * occur between writing the WAL record and marking the buffer dirty. > + * We solve that with a kluge, but one that is already in use > + * during transaction commit to prevent race conditions. > + * Basically, we simply prevent the checkpoint WAL record from > + * being written until we have marked the buffer dirty. We don't > + * start the checkpoint flush until we have marked dirty, so our > + * checkpoint must flush the change to disk successfully or the > + * checkpoint never gets written, so crash recovery will set us right. > + * > + * XXX rename PGPROC variable later; keep it same now for clarity > + */ > + MyPgXact->inCommit = true; > + smgr_buffer_hint(buffer); > + > LockBufHdr(bufHdr); > Assert(bufHdr->refcount > 0); > if (!(bufHdr->flags & BM_DIRTY)) > @@ -2351,6 +2395,7 @@ SetBufferCommitInfoNeedsSave(Buffer buffer) > } > bufHdr->flags |= (BM_DIRTY | BM_JUST_DIRTIED); > UnlockBufHdr(bufHdr); > + MyPgXact->inCommit = false; > } > } > > diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c > index 096d36a..a220310 100644 > --- a/src/backend/storage/buffer/localbuf.c > +++ b/src/backend/storage/buffer/localbuf.c > @@ -200,6 +200,8 @@ LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum, > /* Find smgr relation for buffer */ > oreln = smgropen(bufHdr->tag.rnode, MyBackendId); > > + /* XXX do we want to write checksums for local buffers? An option? */ > + Don't we have that, now that it happens in mdwrite()? I can see value in an option to exclude local buffers, since corruption there may be less exciting. It doesn't seem important for an initial patch, though. > --- a/src/backend/storage/page/bufpage.c > +++ b/src/backend/storage/page/bufpage.c > + * http://www.google.com/research/pubs/archive/35162.pdf, discussed 2010/12/22. I get a 404 for that link. > --- a/src/include/storage/bufpage.h > +++ b/src/include/storage/bufpage.h > -#define PD_VALID_FLAG_BITS 0x0007 /* OR of all valid pd_flags bits */ > +#define PD_VALID_FLAG_BITS 0x800F /* OR of all non-checksum pd_flags bits */ The comment is not consistent with the character of the value. Thanks, nm
Some other comments on the checksum patch: I'm not sure why you moved the checksum calculation (PageSetVerificationInfo) to mdwrite() rather than smgrwrite(). If therewere every another storage backend, it would have to duplicate the checksum check, right? Is there a disadvantage toputting it in smgrwrite()? You may have already noticed this, but a bunch of the comments are incorrect, now that you have moved the checksum calculationto mdwrite(). config.sgml - says writes via temp_buffers (which I think means local buffers) are not checksummed -- that's no longer true,right? bufmgr.c, line 1914 - bufToWrite no longer exists. You could revert changes from 1912-1920 localbuf.c, line203 - as mentioned below, this comment is no longer relevant, you are checksumming local buffers now Dan ----- Original Message ----- From: "Noah Misch" <noah@leadboat.com> To: "Simon Riggs" <simon@2ndQuadrant.com> Cc: "Heikki Linnakangas" <heikki.linnakangas@enterprisedb.com>, "Robert Haas" <robertmhaas@gmail.com>, "Andres Freund" <andres@anarazel.de>,"Kevin Grittner" <Kevin.Grittner@wicourts.gov>, david@fetter.org, aidan@highrise.ca, stark@mit.edu,pgsql-hackers@postgresql.org Sent: Thursday, January 26, 2012 12:20:39 PM Subject: Re: [HACKERS] 16-bit page checksums for 9.2 On Wed, Jan 11, 2012 at 10:12:31PM +0000, Simon Riggs wrote: > v7 This patch uses FPIs to guard against torn hint writes, even when the checksums are disabled. One could not simply condition them on the page_checksums setting, though. Suppose page_checksums=off and we're hinting a page previously written with page_checksums=on. If that write tears, leaving the checksum intact, that block will now fail verification. A couple of ideas come to mind. (a) Read the on-disk page and emit an FPI only if the old page had a checksum. (b) Add a checksumEnableLSN field to pg_control. Whenever the cluster starts with checksums disabled, set the field to InvalidXLogRecPtr. Whenever the cluster starts with checksums enabled and the field is InvalidXLogRecPtr, set the field to the next LSN. When a checksum failure occurs in a page with LSN older than the stored one, emit either a softer warning or no message at all. Not all WAL-bypassing operations use SetBufferCommitInfoNeedsSave() to dirty the buffer. Here are other sites I noticed that do MarkBufferDirty() without bumping the page LSN:heap_xlog_visible()visibilitymap_clear()visibilitymap_truncate()lazy_scan_heap()XLogRecordPageWithFreeSpace()FreeSpaceMapTruncateRel()fsm_set_and_search()fsm_vacuum_page()fsm_search_avail() Though I have not investigated each of these in detail, I suspect most or all represent continued torn-page hazards. Since the FSM is just a hint, we do not WAL-log it at all; it would be nicest to always skip checksums for it, too. The visibility map shortcuts are more nuanced. When page_checksums=off and we read a page last written by page_checksums=on, we still verify its checksum. If a mostly-insert/read site uses checksums for some time and eventually wishes to shed the overhead, disabling the feature will not stop the costs for reads of old data. They would need a dump/reload to get the performance of a never-checksummed database. Let's either unconditionally skip checksum validation under page_checksums=off or add a new state like page_checksums=ignore for that even-weaker condition. > --- a/doc/src/sgml/config.sgml > +++ b/doc/src/sgml/config.sgml > + <para> > + Turning this parameter off speeds normal operation, but > + might allow data corruption to go unnoticed. The checksum uses > + 16-bit checksums, using the fast Fletcher 16 algorithm. With this > + parameter enabled there is still a non-zero probability that an error > + could go undetected, as well as a non-zero probability of false > + positives. > + </para> What sources of false positives do you intend to retain? > --- a/src/backend/catalog/storage.c > +++ b/src/backend/catalog/storage.c > @@ -20,6 +20,7 @@ > #include "postgres.h" > > #include "access/visibilitymap.h" > +#include "access/transam.h" > #include "access/xact.h" > #include "access/xlogutils.h" > #include "catalog/catalog.h" > @@ -70,6 +71,7 @@ static PendingRelDelete *pendingDeletes = NULL; /* head of linked list */ > /* XLOG gives us high 4 bits */ > #define XLOG_SMGR_CREATE 0x10 > #define XLOG_SMGR_TRUNCATE 0x20 > +#define XLOG_SMGR_HINT 0x40 > > typedef struct xl_smgr_create > { > @@ -477,19 +479,74 @@ AtSubAbort_smgr(void) > smgrDoPendingDeletes(false); > } > > +/* > + * Write a backup block if needed when we are setting a hint. > + * > + * Deciding the "if needed" bit is delicate and requires us to either > + * grab WALInsertLock or check the info_lck spinlock. If we check the > + * spinlock and it says Yes then we will need to get WALInsertLock as well, > + * so the design choice here is to just go straight for the WALInsertLock > + * and trust that calls to this function are minimised elsewhere. > + * > + * Callable while holding share lock on the buffer content. > + * > + * Possible that multiple concurrent backends could attempt to write > + * WAL records. In that case, more than one backup block may be recorded > + * though that isn't important to the outcome and the backup blocks are > + * likely to be identical anyway. > + */ > +#define SMGR_HINT_WATERMARK 13579 > +void > +smgr_buffer_hint(Buffer buffer) > +{ > + /* > + * Make an XLOG entry reporting the hint > + */ > + XLogRecPtr lsn; > + XLogRecData rdata[2]; > + int watermark = SMGR_HINT_WATERMARK; > + > + /* > + * Not allowed to have zero-length records, so use a small watermark > + */ > + rdata[0].data = (char *) (&watermark); > + rdata[0].len = sizeof(int); > + rdata[0].buffer = InvalidBuffer; > + rdata[0].buffer_std = false; > + rdata[0].next = &(rdata[1]); > + > + rdata[1].data = NULL; > + rdata[1].len = 0; > + rdata[1].buffer = buffer; > + rdata[1].buffer_std = true; > + rdata[1].next = NULL; > + > + lsn = XLogInsert(RM_SMGR_ID, XLOG_SMGR_HINT, rdata); > + > + /* > + * Set the page LSN if we wrote a backup block. > + */ > + if (!XLByteEQ(InvalidXLogRecPtr, lsn)) > + { > + Page page = BufferGetPage(buffer); > + PageSetLSN(page, lsn); PageSetLSN() is not atomic, so the shared buffer content lock we'll be holding is insufficient. Need a PageSetTLI() here, too. > + elog(LOG, "inserted backup block for hint bit"); > + } > +} > --- a/src/backend/storage/buffer/bufmgr.c > +++ b/src/backend/storage/buffer/bufmgr.c > @@ -2341,6 +2350,41 @@ SetBufferCommitInfoNeedsSave(Buffer buffer) > if ((bufHdr->flags & (BM_DIRTY | BM_JUST_DIRTIED)) != > (BM_DIRTY | BM_JUST_DIRTIED)) > { > + /* > + * If we're writing checksums and we care about torn pages then we > + * cannot dirty a page during recovery as a result of a hint. > + * We can set the hint, just not dirty the page as a result. > + * > + * See long discussion in bufpage.c > + */ > + if (HintsMustNotDirtyPage()) > + return; This expands toif (page_checksums && fullPageWrites && RecoveryInProgress()) If only page_checksums is false, smgr_buffer_hint() can attempt to insert a WAL record during recovery. > + > + /* > + * Write a full page into WAL iff this is the first change on the > + * block since the last checkpoint. That will never be the case > + * if the block is already dirty because we either made a change > + * or set a hint already. Note that aggressive cleaning of blocks > + * dirtied by hint bit setting would increase the call rate. > + * Bulk setting of hint bits would reduce the call rate... > + * > + * We must issue the WAL record before we mark the buffer dirty. > + * Otherwise we might write the page before we write the WAL. > + * That causes a race condition, since a checkpoint might > + * occur between writing the WAL record and marking the buffer dirty. > + * We solve that with a kluge, but one that is already in use > + * during transaction commit to prevent race conditions. > + * Basically, we simply prevent the checkpoint WAL record from > + * being written until we have marked the buffer dirty. We don't > + * start the checkpoint flush until we have marked dirty, so our > + * checkpoint must flush the change to disk successfully or the > + * checkpoint never gets written, so crash recovery will set us right. > + * > + * XXX rename PGPROC variable later; keep it same now for clarity > + */ > + MyPgXact->inCommit = true; > + smgr_buffer_hint(buffer); > + > LockBufHdr(bufHdr); > Assert(bufHdr->refcount > 0); > if (!(bufHdr->flags & BM_DIRTY)) > @@ -2351,6 +2395,7 @@ SetBufferCommitInfoNeedsSave(Buffer buffer) > } > bufHdr->flags |= (BM_DIRTY | BM_JUST_DIRTIED); > UnlockBufHdr(bufHdr); > + MyPgXact->inCommit = false; > } > } > > diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c > index 096d36a..a220310 100644 > --- a/src/backend/storage/buffer/localbuf.c > +++ b/src/backend/storage/buffer/localbuf.c > @@ -200,6 +200,8 @@ LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum, > /* Find smgr relation for buffer */ > oreln = smgropen(bufHdr->tag.rnode, MyBackendId); > > + /* XXX do we want to write checksums for local buffers? An option? */ > + Don't we have that, now that it happens in mdwrite()? I can see value in an option to exclude local buffers, since corruption there may be less exciting. It doesn't seem important for an initial patch, though. > --- a/src/backend/storage/page/bufpage.c > +++ b/src/backend/storage/page/bufpage.c > + * http://www.google.com/research/pubs/archive/35162.pdf, discussed 2010/12/22. I get a 404 for that link. > --- a/src/include/storage/bufpage.h > +++ b/src/include/storage/bufpage.h > -#define PD_VALID_FLAG_BITS 0x0007 /* OR of all valid pd_flags bits */ > +#define PD_VALID_FLAG_BITS 0x800F /* OR of all non-checksum pd_flags bits */ The comment is not consistent with the character of the value. Thanks, nm -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
On Thu, Jan 26, 2012 at 7:01 PM, Dan Scales <scales@vmware.com> wrote: > I'm not sure why you moved the checksum calculation (PageSetVerificationInfo) to mdwrite() rather than smgrwrite(). Ifthere were every another storage backend, it would have to duplicate the checksum check, right? Is there a disadvantageto putting it in smgrwrite()? The smgr and md layers don't currently know anything about the page format, and I imagine we want to keep it that way. It seems like the right place for this is in some higher layer, like bufmgr. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
The advantage of putting the checksum calculation in smgrwrite() (or mdwrite()) is that it catches a bunch of page writesthat don't go through the buffer pool (see calls to smgrwrite() in nbtree.c, nbtsort.c, spginsert.c) Also, I missed this before: don't you want to add the checksum calculation (PageSetVerificationInfo) to mdextend() (or preferablysmgrextend()) as well? Otherwise, you won't be checksumming a bunch of the new pages. Dan ----- Original Message ----- From: "Robert Haas" <robertmhaas@gmail.com> To: "Dan Scales" <scales@vmware.com> Cc: "Noah Misch" <noah@leadboat.com>, "Heikki Linnakangas" <heikki.linnakangas@enterprisedb.com>, "Andres Freund" <andres@anarazel.de>,"Kevin Grittner" <Kevin.Grittner@wicourts.gov>, david@fetter.org, aidan@highrise.ca, stark@mit.edu,pgsql-hackers@postgresql.org, "Simon Riggs" <simon@2ndquadrant.com> Sent: Friday, January 27, 2012 5:19:32 AM Subject: Re: [HACKERS] 16-bit page checksums for 9.2 On Thu, Jan 26, 2012 at 7:01 PM, Dan Scales <scales@vmware.com> wrote: > I'm not sure why you moved the checksum calculation (PageSetVerificationInfo) to mdwrite() rather than smgrwrite(). Ifthere were every another storage backend, it would have to duplicate the checksum check, right? Is there a disadvantageto putting it in smgrwrite()? The smgr and md layers don't currently know anything about the page format, and I imagine we want to keep it that way. It seems like the right place for this is in some higher layer, like bufmgr. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Fri, Jan 27, 2012 at 9:07 PM, Dan Scales <scales@vmware.com> wrote: > The advantage of putting the checksum calculation in smgrwrite() (or mdwrite()) is that it catches a bunch of page writesthat don't go through the buffer pool (see calls to smgrwrite() in nbtree.c, nbtsort.c, spginsert.c) I'll have another look at that. Seems like we can make it various ways, we just need to decide the code placement. > Also, I missed this before: don't you want to add the checksum calculation (PageSetVerificationInfo) to mdextend() (orpreferably smgrextend()) as well? Otherwise, you won't be checksumming a bunch of the new pages. You don't need to checksum the extend because no data is written at that point. It create a new block that will become dirty at some point and then be written out, which will trigger the checksum. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On 28.01.2012 15:49, Simon Riggs wrote: > On Fri, Jan 27, 2012 at 9:07 PM, Dan Scales<scales@vmware.com> wrote: > >> Also, I missed this before: don't you want to add the checksum calculation (PageSetVerificationInfo) to mdextend() (orpreferably smgrextend()) as well? Otherwise, you won't be checksumming a bunch of the new pages. > > You don't need to checksum the extend because no data is written at > that point. That's not correct. smgrextend writes a block of data just like smgrwrite does. When a relation is extended by the buffer manager, it calls smgrextend with an all-zeros block, but not all callers do that. See rewriteheap.c and nbtsort.c for counter-examples. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Fri, Jan 27, 2012 at 4:07 PM, Dan Scales <scales@vmware.com> wrote: > The advantage of putting the checksum calculation in smgrwrite() (or mdwrite()) is that it catches a bunch of page writesthat don't go through the buffer pool (see calls to smgrwrite() in nbtree.c, nbtsort.c, spginsert.c) Maybe we should have some sort of wrapper function, then. I really dislike the idea that the smgr layer knows anything about the page format, and if md has to know that's even worse. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Sun, Dec 25, 2011 at 04:25:19PM +0100, Martijn van Oosterhout wrote: > On Sat, Dec 24, 2011 at 04:01:02PM +0000, Simon Riggs wrote: > > On Sat, Dec 24, 2011 at 3:54 PM, Andres Freund <andres@anarazel.de> wrote: > > > Why don't you use the same tricks as the former patch and copy the buffer, > > > compute the checksum on that, and then write out that copy (you can even do > > > both at the same time). I have a hard time believing that the additional copy > > > is more expensive than the locking. > > > > ISTM we can't write and copy at the same time because the cheksum is > > not a trailer field. > > Ofcourse you can. If the checksum is in the trailer field you get the > nice property that the whole block has a constant checksum. However, if > you store the checksum elsewhere you just need to change the checking > algorithm to copy the checksum out, zero those bytes and run the > checksum and compare with the extracted checksum. > > Not pretty, but I don't think it makes a difference in performence. Sorry to be late replying to this, but an interesting idea would be to zero the _hint_ bits before doing the CRC checksum. That would avoid the problem of WAL-logging the hint bits. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. +
On Sat, Dec 24, 2011 at 03:56:58PM +0000, Simon Riggs wrote: > > Also, as far as I can see this patch usurps the page version field, > > which I find unacceptably short-sighted. Do you really think this is > > the last page layout change we'll ever make? > > No, I don't. I hope and expect the next page layout change to > reintroduce such a field. > > But since we're agreed now that upgrading is important, changing page > format isn't likely to be happening until we get an online upgrade > process. So future changes are much less likely. If they do happen, we > have some flag bits spare that can be used to indicate later versions. > It's not the prettiest thing in the world, but it's a small ugliness > in return for an important feature. If there was a way without that, I > would have chosen it. Have you considered the CRC might match a valuid page version number? Is that safe? -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. +
On Sun, Feb 5, 2012 at 3:59 AM, Bruce Momjian <bruce@momjian.us> wrote: > On Sat, Dec 24, 2011 at 03:56:58PM +0000, Simon Riggs wrote: >> > Also, as far as I can see this patch usurps the page version field, >> > which I find unacceptably short-sighted. Do you really think this is >> > the last page layout change we'll ever make? >> >> No, I don't. I hope and expect the next page layout change to >> reintroduce such a field. >> >> But since we're agreed now that upgrading is important, changing page >> format isn't likely to be happening until we get an online upgrade >> process. So future changes are much less likely. If they do happen, we >> have some flag bits spare that can be used to indicate later versions. >> It's not the prettiest thing in the world, but it's a small ugliness >> in return for an important feature. If there was a way without that, I >> would have chosen it. > > Have you considered the CRC might match a valuid page version number? > Is that safe? In the proposed scheme there are two flag bits set on the page to indicate whether the field is used as a checksum rather than a version number. So its possible the checksum could look like a valid page version number, but we'd still be able to tell the difference. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Sun, Feb 05, 2012 at 08:40:09PM +0000, Simon Riggs wrote: > On Sun, Feb 5, 2012 at 3:59 AM, Bruce Momjian <bruce@momjian.us> wrote: > > On Sat, Dec 24, 2011 at 03:56:58PM +0000, Simon Riggs wrote: > >> > Also, as far as I can see this patch usurps the page version field, > >> > which I find unacceptably short-sighted. Do you really think this is > >> > the last page layout change we'll ever make? > >> > >> No, I don't. I hope and expect the next page layout change to > >> reintroduce such a field. > >> > >> But since we're agreed now that upgrading is important, changing page > >> format isn't likely to be happening until we get an online upgrade > >> process. So future changes are much less likely. If they do happen, we > >> have some flag bits spare that can be used to indicate later versions. > >> It's not the prettiest thing in the world, but it's a small ugliness > >> in return for an important feature. If there was a way without that, I > >> would have chosen it. > > > > Have you considered the CRC might match a valuid page version number? > > Is that safe? > > In the proposed scheme there are two flag bits set on the page to > indicate whether the field is used as a checksum rather than a version > number. So its possible the checksum could look like a valid page > version number, but we'd still be able to tell the difference. Thanks. Clearly we don't need 16 bits to represent our page version number because we rarely change it. The other good thing is I don't remember anyone wanting additional per-page storage in the past few years except for a checksum. I wonder if we should just dedicate 3 page header bits, call that the page version number, and set this new version number to 1, and assume all previous versions were zero, and have them look in the old page version location if the new version number is zero. I am basically thinking of how we can plan ahead to move the version number to a new location and have a defined way of finding the page version number using old and new schemes. I don't want to get to a point where we need a bit per page number version. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. +
On 06.02.2012 05:48, Bruce Momjian wrote: > On Sun, Feb 05, 2012 at 08:40:09PM +0000, Simon Riggs wrote: >> In the proposed scheme there are two flag bits set on the page to >> indicate whether the field is used as a checksum rather than a version >> number. So its possible the checksum could look like a valid page >> version number, but we'd still be able to tell the difference. > > Thanks. Clearly we don't need 16 bits to represent our page version > number because we rarely change it. The other good thing is I don't > remember anyone wanting additional per-page storage in the past few > years except for a checksum. There's this idea that I'd like to see implemented: http://archives.postgresql.org/pgsql-hackers/2010-05/msg01176.php In any case, I think it's a very bad idea to remove the page version field. How are we supposed to ever be able to change the page format again if we don't even have a version number on the page? I strongly oppose removing it. I'm also not very comfortable with the idea of having flags on the page indicating whether it has a checksum or not. It's not hard to imagine a software of firmware bug or hardware failure that would cause pd_flags field to be zeroed out altogether. It would be more robust if the expected bit-pattern was not 0-0, but 1-0 or 0-1, but then you need to deal with that at upgrade somehow. And it still feels a bit whacky anyway. > I wonder if we should just dedicate 3 page header bits, call that the > page version number, and set this new version number to 1, and assume > all previous versions were zero, and have them look in the old page > version location if the new version number is zero. I am basically > thinking of how we can plan ahead to move the version number to a new > location and have a defined way of finding the page version number using > old and new schemes. Three bits seems short-sighted, but yeah, something like 6-8 bits should be enough. On the whole, though. I think we should bite the bullet and invent a way to extend the page header at upgrade. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Mon, Feb 6, 2012 at 4:48 AM, Bruce Momjian <bruce@momjian.us> wrote: > On Sun, Feb 05, 2012 at 08:40:09PM +0000, Simon Riggs wrote: >> On Sun, Feb 5, 2012 at 3:59 AM, Bruce Momjian <bruce@momjian.us> wrote: >> > On Sat, Dec 24, 2011 at 03:56:58PM +0000, Simon Riggs wrote: >> >> > Also, as far as I can see this patch usurps the page version field, >> >> > which I find unacceptably short-sighted. Do you really think this is >> >> > the last page layout change we'll ever make? >> >> >> >> No, I don't. I hope and expect the next page layout change to >> >> reintroduce such a field. >> >> >> >> But since we're agreed now that upgrading is important, changing page >> >> format isn't likely to be happening until we get an online upgrade >> >> process. So future changes are much less likely. If they do happen, we >> >> have some flag bits spare that can be used to indicate later versions. >> >> It's not the prettiest thing in the world, but it's a small ugliness >> >> in return for an important feature. If there was a way without that, I >> >> would have chosen it. >> > >> > Have you considered the CRC might match a valuid page version number? >> > Is that safe? >> >> In the proposed scheme there are two flag bits set on the page to >> indicate whether the field is used as a checksum rather than a version >> number. So its possible the checksum could look like a valid page >> version number, but we'd still be able to tell the difference. > > Thanks. Clearly we don't need 16 bits to represent our page version > number because we rarely change it. The other good thing is I don't > remember anyone wanting additional per-page storage in the past few > years except for a checksum. > > I wonder if we should just dedicate 3 page header bits, call that the > page version number, and set this new version number to 1, and assume > all previous versions were zero, and have them look in the old page > version location if the new version number is zero. I am basically > thinking of how we can plan ahead to move the version number to a new > location and have a defined way of finding the page version number using > old and new schemes. > > I don't want to get to a point where we need a bit per page number > version. Agreed, though I think we need at least 2 bits set if we are using the checksum, so we should start at version 2 to ensure that. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Mon, Feb 6, 2012 at 7:51 AM, Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote: > On 06.02.2012 05:48, Bruce Momjian wrote: >> >> On Sun, Feb 05, 2012 at 08:40:09PM +0000, Simon Riggs wrote: >>> >>> In the proposed scheme there are two flag bits set on the page to >>> indicate whether the field is used as a checksum rather than a version >>> number. So its possible the checksum could look like a valid page >>> version number, but we'd still be able to tell the difference. >> >> >> Thanks. Clearly we don't need 16 bits to represent our page version >> number because we rarely change it. The other good thing is I don't >> remember anyone wanting additional per-page storage in the past few >> years except for a checksum. > > > There's this idea that I'd like to see implemented: > http://archives.postgresql.org/pgsql-hackers/2010-05/msg01176.php As you'll note, adding that field will change the page format and is therefore ruled out for 9.2. ISTM there is a different way to handle that anyway. All we need to do is to record the LSN of the last wraparound in shared memory/control file. Any block with page LSN older than that has all-frozen rows. That doesn't use any space nor does it require another field to be set. That leaves the same problem that if someone writes to the page you need to freeze the rows first. > In any case, I think it's a very bad idea to remove the page version field. > How are we supposed to ever be able to change the page format again if we > don't even have a version number on the page? I strongly oppose removing it. Nobody is removing the version field, nor is anybody suggesting not being able to tell which page version we are looking at. > I'm also not very comfortable with the idea of having flags on the page > indicating whether it has a checksum or not. It's not hard to imagine a > software of firmware bug or hardware failure that would cause pd_flags field > to be zeroed out altogether. It would be more robust if the expected > bit-pattern was not 0-0, but 1-0 or 0-1, but then you need to deal with that > at upgrade somehow. And it still feels a bit whacky anyway. Good idea. Lets use 0-0-0 to represent upgraded from previous version, needs a bit set 0-0-1 to represent new version number of page, no checksum 1-1-1 to represent new version number of page, with checksum So we have 1 bit dedicated to the page version, 2 bits to the checksum indicator >> I wonder if we should just dedicate 3 page header bits, call that the >> page version number, and set this new version number to 1, and assume >> all previous versions were zero, and have them look in the old page >> version location if the new version number is zero. I am basically >> thinking of how we can plan ahead to move the version number to a new >> location and have a defined way of finding the page version number using >> old and new schemes. > > > Three bits seems short-sighted, but yeah, something like 6-8 bits should be > enough. On the whole, though. I think we should bite the bullet and invent a > way to extend the page header at upgrade. There are currently many spare bits. I don't see any need to allocate them to this specific use ahead of time - especially since that is the exact decision we took last time when we reserved 16 bits for the version. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On 06.02.2012 10:05, Simon Riggs wrote: > On Mon, Feb 6, 2012 at 7:51 AM, Heikki Linnakangas > <heikki.linnakangas@enterprisedb.com> wrote: >> On 06.02.2012 05:48, Bruce Momjian wrote: >>> >>> Thanks. Clearly we don't need 16 bits to represent our page version >>> number because we rarely change it. The other good thing is I don't >>> remember anyone wanting additional per-page storage in the past few >>> years except for a checksum. >> >> There's this idea that I'd like to see implemented: >> http://archives.postgresql.org/pgsql-hackers/2010-05/msg01176.php > > As you'll note, adding that field will change the page format and is > therefore ruled out for 9.2. > > ISTM there is a different way to handle that anyway. All we need to do > is to record the LSN of the last wraparound in shared memory/control > file. Any block with page LSN older than that has all-frozen rows. > That doesn't use any space nor does it require another field to be > set. Good idea. However, the followup idea to that discussion was to not only avoid the I/O needed to mark tuples as frozen, but to avoid xid wraparound altogether, by allowing clog to grow indefinitely. You do want to freeze at some point of course, to truncate the clog, but it would be nice to not have a hard limit. The way to do that is to store an xid "epoch" in the page header, so that Xids are effectively 64-bits wide, even though the xid fields on the tuple header are only 32-bits wide. That does require a new field in the page header. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Mon, Feb 6, 2012 at 10:02 AM, Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote: > On 06.02.2012 10:05, Simon Riggs wrote: >> >> On Mon, Feb 6, 2012 at 7:51 AM, Heikki Linnakangas >> <heikki.linnakangas@enterprisedb.com> wrote: >>> >>> On 06.02.2012 05:48, Bruce Momjian wrote: >>>> >>>> >>>> Thanks. Clearly we don't need 16 bits to represent our page version >>>> number because we rarely change it. The other good thing is I don't >>>> remember anyone wanting additional per-page storage in the past few >>>> years except for a checksum. >>> >>> >>> There's this idea that I'd like to see implemented: >>> http://archives.postgresql.org/pgsql-hackers/2010-05/msg01176.php >> >> >> As you'll note, adding that field will change the page format and is >> therefore ruled out for 9.2. >> >> ISTM there is a different way to handle that anyway. All we need to do >> is to record the LSN of the last wraparound in shared memory/control >> file. Any block with page LSN older than that has all-frozen rows. >> That doesn't use any space nor does it require another field to be >> set. > > > Good idea. However, the followup idea to that discussion was to not only > avoid the I/O needed to mark tuples as frozen, but to avoid xid wraparound > altogether, by allowing clog to grow indefinitely. You do want to freeze at > some point of course, to truncate the clog, but it would be nice to not have > a hard limit. The way to do that is to store an xid "epoch" in the page > header, so that Xids are effectively 64-bits wide, even though the xid > fields on the tuple header are only 32-bits wide. That does require a new > field in the page header. We wouldn't need to do that would we? -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On 06.02.2012 11:25, Simon Riggs wrote: > On Mon, Feb 6, 2012 at 10:02 AM, Heikki Linnakangas > <heikki.linnakangas@enterprisedb.com> wrote: >> Good idea. However, the followup idea to that discussion was to not only >> avoid the I/O needed to mark tuples as frozen, but to avoid xid wraparound >> altogether, by allowing clog to grow indefinitely. You do want to freeze at >> some point of course, to truncate the clog, but it would be nice to not have >> a hard limit. The way to do that is to store an xid "epoch" in the page >> header, so that Xids are effectively 64-bits wide, even though the xid >> fields on the tuple header are only 32-bits wide. That does require a new >> field in the page header. > > We wouldn't need to do that would we? Huh? Do you mean that we wouldn't need to implement that feature? -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Mon, Feb 06, 2012 at 09:05:19AM +0000, Simon Riggs wrote: > > In any case, I think it's a very bad idea to remove the page version field. > > How are we supposed to ever be able to change the page format again if we > > don't even have a version number on the page? I strongly oppose removing it. > > Nobody is removing the version field, nor is anybody suggesting not > being able to tell which page version we are looking at. Agreed. I thought the idea was that we have a 16-bit page _version_ number and 8+ free page _flag_ bits, which are all currently zero. The idea was to move the version number from 16-bit field into the unused flag bits, and use the 16-bit field for the checksum. I would like to have some logic that would allow tools inspecting the page to tell if they should look for the version number in the bits at the beginning of the page or at the end. Specifically, this becomes the checksum: uint16 pd_pagesize_version; and this holds the page version, if we have updated the page to the new format: uint16 pd_flags; /* flag bits, see below */ Of the 16 bits of pd_flags, these are the only ones used: #define PD_HAS_FREE_LINES 0x0001 /* are there any unused line pointers? */ #define PD_PAGE_FULL 0x0002 /* not enough free space for new * tuple? */ #define PD_ALL_VISIBLE 0x0004 /* all tuples on page are visible to * everyone */ #define PD_VALID_FLAG_BITS 0x0007 /* OR of all valid pd_flags bits */ > > I'm also not very comfortable with the idea of having flags on the page > > indicating whether it has a checksum or not. It's not hard to imagine a > > software of firmware bug or hardware failure that would cause pd_flags field > > to be zeroed out altogether. It would be more robust if the expected > > bit-pattern was not 0-0, but 1-0 or 0-1, but then you need to deal with that > > at upgrade somehow. And it still feels a bit whacky anyway. > Good idea. Lets use > > 0-0-0 to represent upgraded from previous version, needs a bit set > 0-0-1 to represent new version number of page, no checksum > 1-1-1 to represent new version number of page, with checksum > > So we have 1 bit dedicated to the page version, 2 bits to the checksum indicator Interesting point that we would not be guarding against a bit flip from 1 to 0 for the checksum bit; I agree using two bits is the way to go. I don't see how upgrade figures into this. However, I am unclear how Simon's idea above actually works. We need two bits for redundancy, both 1, to mark a page as having a checksum. I don't think mixing the idea of a new page version and checksum enabled really makes sense, especially since we have to plan for future page version changes. I think we dedicate 2 bits to say we have computed a checksum, and 3 bits to mark up to 8 possible page versions, so the logic is, in pd_flags, we use bits 0x8 and 0x16 to indicate that a checksum is stored on the page, and we use 0x32 and later for the page version number. We can assume all the remaining bits are for the page version number until we need to define new bits, and we can start storing them at the end first, and work forward. If all the version bits are zero, it means the page version number is still stored in pd_pagesize_version. > >> I wonder if we should just dedicate 3 page header bits, call that the > >> page version number, and set this new version number to 1, and assume > >> all previous versions were zero, and have them look in the old page > >> version location if the new version number is zero. I am basically > >> thinking of how we can plan ahead to move the version number to a new > >> location and have a defined way of finding the page version number using > >> old and new schemes. > > > > > > Three bits seems short-sighted, but yeah, something like 6-8 bits should be > > enough. On the whole, though. I think we should bite the bullet and invent a > > way to extend the page header at upgrade. > > There are currently many spare bits. I don't see any need to allocate > them to this specific use ahead of time - especially since that is the > exact decision we took last time when we reserved 16 bits for the > version. Right, but I am thinking we should set things up so we can grow the page version number into the unused bit, rather than box it between bits we are already using. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. +
On Mon, Feb 06, 2012 at 08:51:34AM +0100, Heikki Linnakangas wrote: > >I wonder if we should just dedicate 3 page header bits, call that the > >page version number, and set this new version number to 1, and assume > >all previous versions were zero, and have them look in the old page > >version location if the new version number is zero. I am basically > >thinking of how we can plan ahead to move the version number to a new > >location and have a defined way of finding the page version number using > >old and new schemes. > > Three bits seems short-sighted, but yeah, something like 6-8 bits > should be enough. On the whole, though. I think we should bite the > bullet and invent a way to extend the page header at upgrade. I just emailed a possible layout that allows for future page version number expansion. I don't think there is any magic bullet that will allow for page header extension by pg_upgrade. If it is going to be done, it would have to happen in the backend while the system is running. Anything that requires pg_upgrade to do lots of reads or writes basically makes pg_upgrade useless. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. +
On Mon, Feb 06, 2012 at 12:59:59PM -0500, Bruce Momjian wrote: > > > I'm also not very comfortable with the idea of having flags on the page > > > indicating whether it has a checksum or not. It's not hard to imagine a > > > software of firmware bug or hardware failure that would cause pd_flags field > > > to be zeroed out altogether. It would be more robust if the expected > > > bit-pattern was not 0-0, but 1-0 or 0-1, but then you need to deal with that > > > at upgrade somehow. And it still feels a bit whacky anyway. > > > Good idea. Lets use > > > > 0-0-0 to represent upgraded from previous version, needs a bit set > > 0-0-1 to represent new version number of page, no checksum > > 1-1-1 to represent new version number of page, with checksum > > > > So we have 1 bit dedicated to the page version, 2 bits to the checksum indicator > > Interesting point that we would not be guarding against a bit flip from > 1 to 0 for the checksum bit; I agree using two bits is the way to go. I > don't see how upgrade figures into this. > > However, I am unclear how Simon's idea above actually works. We need > two bits for redundancy, both 1, to mark a page as having a checksum. I > don't think mixing the idea of a new page version and checksum enabled > really makes sense, especially since we have to plan for future page > version changes. > > I think we dedicate 2 bits to say we have computed a checksum, and 3 > bits to mark up to 8 possible page versions, so the logic is, in > pd_flags, we use bits 0x8 and 0x16 to indicate that a checksum is stored > on the page, and we use 0x32 and later for the page version number. We > can assume all the remaining bits are for the page version number until > we need to define new bits, and we can start storing them at the end > first, and work forward. If all the version bits are zero, it means the > page version number is still stored in pd_pagesize_version. A simpler solution would be to place two bits for checksum after the existing page bit usage, and place the page version on the last four bits of the 16-bit field --- that still leaves us with 7 unused bits. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. +
On Thu, Jan 26, 2012 at 8:20 PM, Noah Misch <noah@leadboat.com> wrote: > On Wed, Jan 11, 2012 at 10:12:31PM +0000, Simon Riggs wrote: >> v7 Thanks very much for the review. Just realised I hadn't actually replied... > This patch uses FPIs to guard against torn hint writes, even when the > checksums are disabled. One could not simply condition them on the > page_checksums setting, though. Suppose page_checksums=off and we're hinting > a page previously written with page_checksums=on. If that write tears, > leaving the checksum intact, that block will now fail verification. A couple > of ideas come to mind. (a) Read the on-disk page and emit an FPI only if the > old page had a checksum. (b) Add a checksumEnableLSN field to pg_control. > Whenever the cluster starts with checksums disabled, set the field to > InvalidXLogRecPtr. Whenever the cluster starts with checksums enabled and the > field is InvalidXLogRecPtr, set the field to the next LSN. When a checksum > failure occurs in a page with LSN older than the stored one, emit either a > softer warning or no message at all. We can only change page_checksums at restart (now) so the above seems moot. If we crash with FPWs enabled we repair any torn pages. > Not all WAL-bypassing operations use SetBufferCommitInfoNeedsSave() to dirty > the buffer. Here are other sites I noticed that do MarkBufferDirty() without > bumping the page LSN: > heap_xlog_visible() > visibilitymap_clear() > visibilitymap_truncate() > lazy_scan_heap() > XLogRecordPageWithFreeSpace() > FreeSpaceMapTruncateRel() > fsm_set_and_search() > fsm_vacuum_page() > fsm_search_avail() > Though I have not investigated each of these in detail, I suspect most or all > represent continued torn-page hazards. Since the FSM is just a hint, we do > not WAL-log it at all; it would be nicest to always skip checksums for it, > too. The visibility map shortcuts are more nuanced. Still checking all, but not as bad as the list looks. I'm removing chksum code from smgr and putting back into bufmgr and other locations. Patch footprint now looks like this. doc/src/sgml/config.sgml | 40 +++src/backend/access/hash/hashpage.c | 1src/backend/access/heap/rewriteheap.c | 4src/backend/access/heap/visibilitymap.c | 1src/backend/access/nbtree/nbtree.c | 1src/backend/access/nbtree/nbtsort.c | 3src/backend/access/spgist/spginsert.c | 2src/backend/access/transam/twophase.c | 16 -src/backend/access/transam/xact.c | 8src/backend/access/transam/xlog.c | 114 ++++++++src/backend/commands/tablecmds.c | 2src/backend/storage/buffer/bufmgr.c | 75 +++++src/backend/storage/buffer/localbuf.c | 2src/backend/storage/ipc/procarray.c | 63 +++-src/backend/storage/lmgr/proc.c | 4src/backend/storage/page/bufpage.c | 331 +++++++++++++++++++++++++-src/backend/utils/misc/guc.c | 14 +src/backend/utils/misc/postgresql.conf.sample| 15 -src/include/access/xlog.h | 1src/include/catalog/pg_control.h | 3src/include/catalog/storage.h | 1src/include/storage/bufpage.h | 107 ++++++--src/include/storage/proc.h | 3src/include/storage/procarray.h | 424 files changed, 743 insertions(+), 72 deletions(-) Patch is *not* attached here, yet. Still working on it. > When page_checksums=off and we read a page last written by page_checksums=on, > we still verify its checksum. If a mostly-insert/read site uses checksums for > some time and eventually wishes to shed the overhead, disabling the feature > will not stop the costs for reads of old data. They would need a dump/reload > to get the performance of a never-checksummed database. Let's either > unconditionally skip checksum validation under page_checksums=off or add a new > state like page_checksums=ignore for that even-weaker condition. Agreed, changed. >> --- a/doc/src/sgml/config.sgml >> +++ b/doc/src/sgml/config.sgml > >> + <para> >> + Turning this parameter off speeds normal operation, but >> + might allow data corruption to go unnoticed. The checksum uses >> + 16-bit checksums, using the fast Fletcher 16 algorithm. With this >> + parameter enabled there is still a non-zero probability that an error >> + could go undetected, as well as a non-zero probability of false >> + positives. >> + </para> > > What sources of false positives do you intend to retain? I see none. Will reword. >> --- a/src/backend/catalog/storage.c >> +++ b/src/backend/catalog/storage.c >> @@ -20,6 +20,7 @@ >> #include "postgres.h" >> >> #include "access/visibilitymap.h" >> +#include "access/transam.h" >> #include "access/xact.h" >> #include "access/xlogutils.h" >> #include "catalog/catalog.h" >> @@ -70,6 +71,7 @@ static PendingRelDelete *pendingDeletes = NULL; /* head of linked list */ >> /* XLOG gives us high 4 bits */ >> #define XLOG_SMGR_CREATE 0x10 >> #define XLOG_SMGR_TRUNCATE 0x20 >> +#define XLOG_SMGR_HINT 0x40 >> >> typedef struct xl_smgr_create >> { >> @@ -477,19 +479,74 @@ AtSubAbort_smgr(void) >> smgrDoPendingDeletes(false); >> } >> >> +/* >> + * Write a backup block if needed when we are setting a hint. >> + * >> + * Deciding the "if needed" bit is delicate and requires us to either >> + * grab WALInsertLock or check the info_lck spinlock. If we check the >> + * spinlock and it says Yes then we will need to get WALInsertLock as well, >> + * so the design choice here is to just go straight for the WALInsertLock >> + * and trust that calls to this function are minimised elsewhere. >> + * >> + * Callable while holding share lock on the buffer content. >> + * >> + * Possible that multiple concurrent backends could attempt to write >> + * WAL records. In that case, more than one backup block may be recorded >> + * though that isn't important to the outcome and the backup blocks are >> + * likely to be identical anyway. >> + */ >> +#define SMGR_HINT_WATERMARK 13579 >> +void >> +smgr_buffer_hint(Buffer buffer) >> +{ >> + /* >> + * Make an XLOG entry reporting the hint >> + */ >> + XLogRecPtr lsn; >> + XLogRecData rdata[2]; >> + int watermark = SMGR_HINT_WATERMARK; >> + >> + /* >> + * Not allowed to have zero-length records, so use a small watermark >> + */ >> + rdata[0].data = (char *) (&watermark); >> + rdata[0].len = sizeof(int); >> + rdata[0].buffer = InvalidBuffer; >> + rdata[0].buffer_std = false; >> + rdata[0].next = &(rdata[1]); >> + >> + rdata[1].data = NULL; >> + rdata[1].len = 0; >> + rdata[1].buffer = buffer; >> + rdata[1].buffer_std = true; >> + rdata[1].next = NULL; >> + >> + lsn = XLogInsert(RM_SMGR_ID, XLOG_SMGR_HINT, rdata); >> + >> + /* >> + * Set the page LSN if we wrote a backup block. >> + */ >> + if (!XLByteEQ(InvalidXLogRecPtr, lsn)) >> + { >> + Page page = BufferGetPage(buffer); >> + PageSetLSN(page, lsn); > > PageSetLSN() is not atomic, so the shared buffer content lock we'll be holding > is insufficient. Am serialising this by only writing PageLSN while holding buf hdr lock. > Need a PageSetTLI() here, too. Agreed >> + elog(LOG, "inserted backup block for hint bit"); >> + } >> +} > >> --- a/src/backend/storage/buffer/bufmgr.c >> +++ b/src/backend/storage/buffer/bufmgr.c > >> @@ -2341,6 +2350,41 @@ SetBufferCommitInfoNeedsSave(Buffer buffer) >> if ((bufHdr->flags & (BM_DIRTY | BM_JUST_DIRTIED)) != >> (BM_DIRTY | BM_JUST_DIRTIED)) >> { >> + /* >> + * If we're writing checksums and we care about torn pages then we >> + * cannot dirty a page during recovery as a result of a hint. >> + * We can set the hint, just not dirty the page as a result. >> + * >> + * See long discussion in bufpage.c >> + */ >> + if (HintsMustNotDirtyPage()) >> + return; > > This expands to > if (page_checksums && fullPageWrites && RecoveryInProgress()) > If only page_checksums is false, smgr_buffer_hint() can attempt to insert a > WAL record during recovery. Yes, logic corrected. >> + >> + /* >> + * Write a full page into WAL iff this is the first change on the >> + * block since the last checkpoint. That will never be the case >> + * if the block is already dirty because we either made a change >> + * or set a hint already. Note that aggressive cleaning of blocks >> + * dirtied by hint bit setting would increase the call rate. >> + * Bulk setting of hint bits would reduce the call rate... >> + * >> + * We must issue the WAL record before we mark the buffer dirty. >> + * Otherwise we might write the page before we write the WAL. >> + * That causes a race condition, since a checkpoint might >> + * occur between writing the WAL record and marking the buffer dirty. >> + * We solve that with a kluge, but one that is already in use >> + * during transaction commit to prevent race conditions. >> + * Basically, we simply prevent the checkpoint WAL record from >> + * being written until we have marked the buffer dirty. We don't >> + * start the checkpoint flush until we have marked dirty, so our >> + * checkpoint must flush the change to disk successfully or the >> + * checkpoint never gets written, so crash recovery will set us right. >> + * >> + * XXX rename PGPROC variable later; keep it same now for clarity >> + */ >> + MyPgXact->inCommit = true; >> + smgr_buffer_hint(buffer); >> + >> LockBufHdr(bufHdr); >> Assert(bufHdr->refcount > 0); >> if (!(bufHdr->flags & BM_DIRTY)) >> @@ -2351,6 +2395,7 @@ SetBufferCommitInfoNeedsSave(Buffer buffer) >> } >> bufHdr->flags |= (BM_DIRTY | BM_JUST_DIRTIED); >> UnlockBufHdr(bufHdr); >> + MyPgXact->inCommit = false; >> } >> } Found another bug in that section also, relating to inCommit handling. >> diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c >> index 096d36a..a220310 100644 >> --- a/src/backend/storage/buffer/localbuf.c >> +++ b/src/backend/storage/buffer/localbuf.c >> @@ -200,6 +200,8 @@ LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum, >> /* Find smgr relation for buffer */ >> oreln = smgropen(bufHdr->tag.rnode, MyBackendId); >> >> + /* XXX do we want to write checksums for local buffers? An option? */ >> + > > Don't we have that, now that it happens in mdwrite()? > > I can see value in an option to exclude local buffers, since corruption there > may be less exciting. It doesn't seem important for an initial patch, though. I'm continuing to exclude local buffers. Let me know if that should change. >> --- a/src/backend/storage/page/bufpage.c >> +++ b/src/backend/storage/page/bufpage.c > >> + * http://www.google.com/research/pubs/archive/35162.pdf, discussed 2010/12/22. > > I get a 404 for that link. Changed >> --- a/src/include/storage/bufpage.h >> +++ b/src/include/storage/bufpage.h > >> -#define PD_VALID_FLAG_BITS 0x0007 /* OR of all valid pd_flags bits */ >> +#define PD_VALID_FLAG_BITS 0x800F /* OR of all non-checksum pd_flags bits */ > > The comment is not consistent with the character of the value. OK -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Tue, Feb 07, 2012 at 08:58:59PM +0000, Simon Riggs wrote: > On Thu, Jan 26, 2012 at 8:20 PM, Noah Misch <noah@leadboat.com> wrote: > > On Wed, Jan 11, 2012 at 10:12:31PM +0000, Simon Riggs wrote: > > This patch uses FPIs to guard against torn hint writes, even when the > > checksums are disabled. ?One could not simply condition them on the > > page_checksums setting, though. ?Suppose page_checksums=off and we're hinting > > a page previously written with page_checksums=on. ?If that write tears, > > leaving the checksum intact, that block will now fail verification. ?A couple > > of ideas come to mind. ?(a) Read the on-disk page and emit an FPI only if the > > old page had a checksum. ?(b) Add a checksumEnableLSN field to pg_control. > > Whenever the cluster starts with checksums disabled, set the field to > > InvalidXLogRecPtr. ?Whenever the cluster starts with checksums enabled and the > > field is InvalidXLogRecPtr, set the field to the next LSN. ?When a checksum > > failure occurs in a page with LSN older than the stored one, emit either a > > softer warning or no message at all. > > We can only change page_checksums at restart (now) so the above seems moot. > > If we crash with FPWs enabled we repair any torn pages. There's no live bug, but that comes at a high cost: the patch has us emit full-page images for hint bit writes regardless of the page_checksums setting. > > PageSetLSN() is not atomic, so the shared buffer content lock we'll be holding > > is insufficient. > > Am serialising this by only writing PageLSN while holding buf hdr lock. That means also taking the buffer header spinlock in every PageGetLSN() caller holding only a shared buffer content lock. Do you think that will pay off, versus the settled pattern of trading here your shared buffer content lock for an exclusive one? > > I can see value in an option to exclude local buffers, since corruption there > > may be less exciting. ?It doesn't seem important for an initial patch, though. > > I'm continuing to exclude local buffers. Let me know if that should change. Seems reasonable. Thanks, nm
On Wed, Feb 8, 2012 at 3:24 AM, Noah Misch <noah@leadboat.com> wrote: > On Tue, Feb 07, 2012 at 08:58:59PM +0000, Simon Riggs wrote: >> On Thu, Jan 26, 2012 at 8:20 PM, Noah Misch <noah@leadboat.com> wrote: >> > On Wed, Jan 11, 2012 at 10:12:31PM +0000, Simon Riggs wrote: > >> > This patch uses FPIs to guard against torn hint writes, even when the >> > checksums are disabled. ?One could not simply condition them on the >> > page_checksums setting, though. ?Suppose page_checksums=off and we're hinting >> > a page previously written with page_checksums=on. ?If that write tears, >> > leaving the checksum intact, that block will now fail verification. ?A couple >> > of ideas come to mind. ?(a) Read the on-disk page and emit an FPI only if the >> > old page had a checksum. ?(b) Add a checksumEnableLSN field to pg_control. >> > Whenever the cluster starts with checksums disabled, set the field to >> > InvalidXLogRecPtr. ?Whenever the cluster starts with checksums enabled and the >> > field is InvalidXLogRecPtr, set the field to the next LSN. ?When a checksum >> > failure occurs in a page with LSN older than the stored one, emit either a >> > softer warning or no message at all. >> >> We can only change page_checksums at restart (now) so the above seems moot. >> >> If we crash with FPWs enabled we repair any torn pages. > > There's no live bug, but that comes at a high cost: the patch has us emit > full-page images for hint bit writes regardless of the page_checksums setting. Sorry, I don't understand what you mean. I don't see any failure cases that require that. page_checksums can only change at a shutdown checkpoint, The statement "If that write tears, >> > leaving the checksum intact, that block will now fail verification." cannot happen, ISTM. If we write out a block we update the checksum if page_checksums is set, or we clear it. If we experience a torn page at crash, the FPI corrects that, so the checksum never does fail verification. We only need to write a FPI when we write with checksums. If that's wrong, please explain a failure case in detail. >> > PageSetLSN() is not atomic, so the shared buffer content lock we'll be holding >> > is insufficient. >> >> Am serialising this by only writing PageLSN while holding buf hdr lock. > > That means also taking the buffer header spinlock in every PageGetLSN() caller > holding only a shared buffer content lock. Do you think that will pay off, > versus the settled pattern of trading here your shared buffer content lock for > an exclusive one? Yes, I think it will pay off. This is the only code location where we do that, and we are already taking the buffer header lock, so there is effectively no overhead. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Wed, Feb 08, 2012 at 11:40:34AM +0000, Simon Riggs wrote: > On Wed, Feb 8, 2012 at 3:24 AM, Noah Misch <noah@leadboat.com> wrote: > > On Tue, Feb 07, 2012 at 08:58:59PM +0000, Simon Riggs wrote: > >> On Thu, Jan 26, 2012 at 8:20 PM, Noah Misch <noah@leadboat.com> wrote: > >> > This patch uses FPIs to guard against torn hint writes, even when the > >> > checksums are disabled. ?One could not simply condition them on the > >> > page_checksums setting, though. ?Suppose page_checksums=off and we're hinting > >> > a page previously written with page_checksums=on. ?If that write tears, > >> > leaving the checksum intact, that block will now fail verification. ?A couple > >> > of ideas come to mind. ?(a) Read the on-disk page and emit an FPI only if the > >> > old page had a checksum. ?(b) Add a checksumEnableLSN field to pg_control. > >> > Whenever the cluster starts with checksums disabled, set the field to > >> > InvalidXLogRecPtr. ?Whenever the cluster starts with checksums enabled and the > >> > field is InvalidXLogRecPtr, set the field to the next LSN. ?When a checksum > >> > failure occurs in a page with LSN older than the stored one, emit either a > >> > softer warning or no message at all. > >> > >> We can only change page_checksums at restart (now) so the above seems moot. > >> > >> If we crash with FPWs enabled we repair any torn pages. > > > > There's no live bug, but that comes at a high cost: the patch has us emit > > full-page images for hint bit writes regardless of the page_checksums setting. > > Sorry, I don't understand what you mean. I don't see any failure cases > that require that. > > page_checksums can only change at a shutdown checkpoint, > > The statement "If that write tears, > >> > leaving the checksum intact, that block will now fail verification." > cannot happen, ISTM. > > If we write out a block we update the checksum if page_checksums is > set, or we clear it. If we experience a torn page at crash, the FPI > corrects that, so the checksum never does fail verification. We only > need to write a FPI when we write with checksums. > > If that's wrong, please explain a failure case in detail. In the following description, I will treat a heap page as having two facts. The first is either CHKSUM or !CHKSUM, and the second is either HINT or !HINT. A torn page write lacking the protection of a full-page image can update one or both facts despite the transaction having logically updated both. (This is just the normal torn-page scenario.) Given that, consider the following sequence of events: 1. startup with page_checksums = on 2. update page P with facts CHKSUM, !HINT 3. clean write of P 4. clean shutdown 5. startup with page_checksums = off 6. update P with facts !CHKSUM, HINT. no WAL since page_checksums=off 7. begin write of P 8. crash. torn write of P only updates HINT. disk state now CHKSUM, HINT 9. startup with page_checksums = off 10. crash recovery. does not touch P, because step 6 generated no WAL 11. clean shutdown 12. startup with page_checksums = on 13. read P. checksum failure Again, this cannot happen with checksum16_with_wallogged_hint_bits.v7.patch, because step 6 _does_ write WAL. That ought to change for the sake of page_checksums=off performance, at which point we must protect the above scenario by other means. > >> > PageSetLSN() is not atomic, so the shared buffer content lock we'll be holding > >> > is insufficient. > >> > >> Am serialising this by only writing PageLSN while holding buf hdr lock. > > > > That means also taking the buffer header spinlock in every PageGetLSN() caller > > holding only a shared buffer content lock. ?Do you think that will pay off, > > versus the settled pattern of trading here your shared buffer content lock for > > an exclusive one? > > Yes, I think it will pay off. This is the only code location where we > do that, and we are already taking the buffer header lock, so there is > effectively no overhead. The sites I had in the mind are the calls to PageGetLSN() in FlushBuffer() (via BufferGetLSN()) and XLogCheckBuffer(). We don't take the buffer header spinlock at either of those locations. If they remain safe under the new rules, we'll need comments explaining why. I think they may indeed be safe, but it's far from obvious. Thanks, nm
On Wed, Feb 8, 2012 at 2:05 PM, Noah Misch <noah@leadboat.com> wrote: > On Wed, Feb 08, 2012 at 11:40:34AM +0000, Simon Riggs wrote: >> On Wed, Feb 8, 2012 at 3:24 AM, Noah Misch <noah@leadboat.com> wrote: >> > On Tue, Feb 07, 2012 at 08:58:59PM +0000, Simon Riggs wrote: >> >> On Thu, Jan 26, 2012 at 8:20 PM, Noah Misch <noah@leadboat.com> wrote: >> >> > This patch uses FPIs to guard against torn hint writes, even when the >> >> > checksums are disabled. ?One could not simply condition them on the >> >> > page_checksums setting, though. ?Suppose page_checksums=off and we're hinting >> >> > a page previously written with page_checksums=on. ?If that write tears, >> >> > leaving the checksum intact, that block will now fail verification. ?A couple >> >> > of ideas come to mind. ?(a) Read the on-disk page and emit an FPI only if the >> >> > old page had a checksum. ?(b) Add a checksumEnableLSN field to pg_control. >> >> > Whenever the cluster starts with checksums disabled, set the field to >> >> > InvalidXLogRecPtr. ?Whenever the cluster starts with checksums enabled and the >> >> > field is InvalidXLogRecPtr, set the field to the next LSN. ?When a checksum >> >> > failure occurs in a page with LSN older than the stored one, emit either a >> >> > softer warning or no message at all. >> >> >> >> We can only change page_checksums at restart (now) so the above seems moot. >> >> >> >> If we crash with FPWs enabled we repair any torn pages. >> > >> > There's no live bug, but that comes at a high cost: the patch has us emit >> > full-page images for hint bit writes regardless of the page_checksums setting. >> >> Sorry, I don't understand what you mean. I don't see any failure cases >> that require that. >> >> page_checksums can only change at a shutdown checkpoint, >> >> The statement "If that write tears, >> >> > leaving the checksum intact, that block will now fail verification." >> cannot happen, ISTM. >> >> If we write out a block we update the checksum if page_checksums is >> set, or we clear it. If we experience a torn page at crash, the FPI >> corrects that, so the checksum never does fail verification. We only >> need to write a FPI when we write with checksums. >> >> If that's wrong, please explain a failure case in detail. > > In the following description, I will treat a heap page as having two facts. > The first is either CHKSUM or !CHKSUM, and the second is either HINT or !HINT. > A torn page write lacking the protection of a full-page image can update one > or both facts despite the transaction having logically updated both. (This is > just the normal torn-page scenario.) Given that, consider the following > sequence of events: > > 1. startup with page_checksums = on > 2. update page P with facts CHKSUM, !HINT > 3. clean write of P > 4. clean shutdown > > 5. startup with page_checksums = off > 6. update P with facts !CHKSUM, HINT. no WAL since page_checksums=off > 7. begin write of P > 8. crash. torn write of P only updates HINT. disk state now CHKSUM, HINT > > 9. startup with page_checksums = off > 10. crash recovery. does not touch P, because step 6 generated no WAL > 11. clean shutdown > > 12. startup with page_checksums = on > 13. read P. checksum failure > > Again, this cannot happen with checksum16_with_wallogged_hint_bits.v7.patch, > because step 6 _does_ write WAL. That ought to change for the sake of > page_checksums=off performance, at which point we must protect the above > scenario by other means. Thanks for explaining. Logic fixed. >> >> > PageSetLSN() is not atomic, so the shared buffer content lock we'll be holding >> >> > is insufficient. >> >> >> >> Am serialising this by only writing PageLSN while holding buf hdr lock. >> > >> > That means also taking the buffer header spinlock in every PageGetLSN() caller >> > holding only a shared buffer content lock. ?Do you think that will pay off, >> > versus the settled pattern of trading here your shared buffer content lock for >> > an exclusive one? >> >> Yes, I think it will pay off. This is the only code location where we >> do that, and we are already taking the buffer header lock, so there is >> effectively no overhead. > > The sites I had in the mind are the calls to PageGetLSN() in FlushBuffer() > (via BufferGetLSN()) and XLogCheckBuffer(). We don't take the buffer header > spinlock at either of those locations. If they remain safe under the new > rules, we'll need comments explaining why. I think they may indeed be safe, > but it's far from obvious. OK, some slight rework required to do that, no problems. I checked all other call sites and have updated README to explain. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Thu, Feb 9, 2012 at 3:16 PM, Simon Riggs <simon@2ndquadrant.com> wrote: > On Wed, Feb 8, 2012 at 2:05 PM, Noah Misch <noah@leadboat.com> wrote: >> On Wed, Feb 08, 2012 at 11:40:34AM +0000, Simon Riggs wrote: >>> On Wed, Feb 8, 2012 at 3:24 AM, Noah Misch <noah@leadboat.com> wrote: >>> > On Tue, Feb 07, 2012 at 08:58:59PM +0000, Simon Riggs wrote: >>> >> On Thu, Jan 26, 2012 at 8:20 PM, Noah Misch <noah@leadboat.com> wrote: >>> >> > This patch uses FPIs to guard against torn hint writes, even when the >>> >> > checksums are disabled. ?One could not simply condition them on the >>> >> > page_checksums setting, though. ?Suppose page_checksums=off and we're hinting >>> >> > a page previously written with page_checksums=on. ?If that write tears, >>> >> > leaving the checksum intact, that block will now fail verification. ?A couple >>> >> > of ideas come to mind. ?(a) Read the on-disk page and emit an FPI only if the >>> >> > old page had a checksum. ?(b) Add a checksumEnableLSN field to pg_control. >>> >> > Whenever the cluster starts with checksums disabled, set the field to >>> >> > InvalidXLogRecPtr. ?Whenever the cluster starts with checksums enabled and the >>> >> > field is InvalidXLogRecPtr, set the field to the next LSN. ?When a checksum >>> >> > failure occurs in a page with LSN older than the stored one, emit either a >>> >> > softer warning or no message at all. >>> >> >>> >> We can only change page_checksums at restart (now) so the above seems moot. >>> >> >>> >> If we crash with FPWs enabled we repair any torn pages. >>> > >>> > There's no live bug, but that comes at a high cost: the patch has us emit >>> > full-page images for hint bit writes regardless of the page_checksums setting. >>> >>> Sorry, I don't understand what you mean. I don't see any failure cases >>> that require that. >>> >>> page_checksums can only change at a shutdown checkpoint, >>> >>> The statement "If that write tears, >>> >> > leaving the checksum intact, that block will now fail verification." >>> cannot happen, ISTM. >>> >>> If we write out a block we update the checksum if page_checksums is >>> set, or we clear it. If we experience a torn page at crash, the FPI >>> corrects that, so the checksum never does fail verification. We only >>> need to write a FPI when we write with checksums. >>> >>> If that's wrong, please explain a failure case in detail. >> >> In the following description, I will treat a heap page as having two facts. >> The first is either CHKSUM or !CHKSUM, and the second is either HINT or !HINT. >> A torn page write lacking the protection of a full-page image can update one >> or both facts despite the transaction having logically updated both. (This is >> just the normal torn-page scenario.) Given that, consider the following >> sequence of events: >> >> 1. startup with page_checksums = on >> 2. update page P with facts CHKSUM, !HINT >> 3. clean write of P >> 4. clean shutdown >> >> 5. startup with page_checksums = off >> 6. update P with facts !CHKSUM, HINT. no WAL since page_checksums=off >> 7. begin write of P >> 8. crash. torn write of P only updates HINT. disk state now CHKSUM, HINT >> >> 9. startup with page_checksums = off >> 10. crash recovery. does not touch P, because step 6 generated no WAL >> 11. clean shutdown >> >> 12. startup with page_checksums = on >> 13. read P. checksum failure >> >> Again, this cannot happen with checksum16_with_wallogged_hint_bits.v7.patch, >> because step 6 _does_ write WAL. That ought to change for the sake of >> page_checksums=off performance, at which point we must protect the above >> scenario by other means. > > Thanks for explaining. > > Logic fixed. > >>> >> > PageSetLSN() is not atomic, so the shared buffer content lock we'll be holding >>> >> > is insufficient. >>> >> >>> >> Am serialising this by only writing PageLSN while holding buf hdr lock. >>> > >>> > That means also taking the buffer header spinlock in every PageGetLSN() caller >>> > holding only a shared buffer content lock. ?Do you think that will pay off, >>> > versus the settled pattern of trading here your shared buffer content lock for >>> > an exclusive one? >>> >>> Yes, I think it will pay off. This is the only code location where we >>> do that, and we are already taking the buffer header lock, so there is >>> effectively no overhead. >> >> The sites I had in the mind are the calls to PageGetLSN() in FlushBuffer() >> (via BufferGetLSN()) and XLogCheckBuffer(). We don't take the buffer header >> spinlock at either of those locations. If they remain safe under the new >> rules, we'll need comments explaining why. I think they may indeed be safe, >> but it's far from obvious. > > OK, some slight rework required to do that, no problems. > > I checked all other call sites and have updated README to explain. v8 attached -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
<p style=" margin-top:0px; margin-bottom:0px; margin-left:0px; margin-right:0px; -qt-block-indent:0; text-indent:0px; -qt-user-state:0;">ADijous, 16 de febrer de 2012 12:16:31, Simon Riggs va escriure:<p style=" margin-top:0px; margin-bottom:0px;margin-left:0px; margin-right:0px; -qt-block-indent:0; text-indent:0px; -qt-user-state:0;">> v8 attached<pstyle="-qt-paragraph-type:empty; margin-top:0px; margin-bottom:0px; margin-left:0px; margin-right:0px; -qt-block-indent:0;text-indent:0px; -qt-user-state:0;"><br /><p style=" margin-top:0px; margin-bottom:0px; margin-left:0px;margin-right:0px; -qt-block-indent:0; text-indent:0px; -qt-user-state:0;">Maybe the docs should include whatwill happen if the checksum is not correct?<p style=" margin-top:0px; margin-bottom:0px; margin-left:0px; margin-right:0px;-qt-block-indent:0; text-indent:0px; -qt-user-state:0;"><br />-- <p style=" margin-top:0px; margin-bottom:0px;margin-left:0px; margin-right:0px; -qt-block-indent:0; text-indent:0px; -qt-user-state:0;">Albert Cerverai Areny<p style=" margin-top:0px; margin-bottom:0px; margin-left:0px; margin-right:0px; -qt-block-indent:0; text-indent:0px;-qt-user-state:0;"><a href="http://www.NaN-tic.com"><span style=" text-decoration: underline; color:#0057ae;">http://www.NaN-tic.com</span></a><pstyle=" margin-top:0px; margin-bottom:0px; margin-left:0px; margin-right:0px;-qt-block-indent:0; text-indent:0px; -qt-user-state:0;">Tel: +34 93 553 18 03<p style="-qt-paragraph-type:empty;margin-top:0px; margin-bottom:0px; margin-left:0px; margin-right:0px; -qt-block-indent:0;text-indent:0px; -qt-user-state:0;"><br /><p style=" margin-top:0px; margin-bottom:0px; margin-left:0px;margin-right:0px; -qt-block-indent:0; text-indent:0px; -qt-user-state:0;"><a href="http://twitter.com/albertnan"><spanstyle=" text-decoration: underline; color:#0057ae;">http://twitter.com/albertnan</span></a><pstyle=" margin-top:0px; margin-bottom:0px; margin-left:0px; margin-right:0px;-qt-block-indent:0; text-indent:0px; -qt-user-state:0;"><a href="http://www.nan-tic.com/blog"><span style="text-decoration: underline; color:#0057ae;">http://www.nan-tic.com/blog</span></a><p style="-qt-paragraph-type:empty;margin-top:0px; margin-bottom:0px; margin-left:0px; margin-right:0px; -qt-block-indent:0;text-indent:0px; -qt-user-state:0; text-decoration: underline; color:#0057ae;">
On Thu, Feb 16, 2012 at 6:16 AM, Simon Riggs <simon@2ndquadrant.com> wrote: > v8 attached It's hard to believe that this version has been tested terribly thoroughly, because it doesn't compile. + LockBufHdr(buf); + + /* + * Run PageGetLSN while holding header lock. + */ + recptr = BufferGetLSN(buf); + + /* To check if block content changes while flushing. - vadim 01/17/97 */ + buf->flags &= ~BM_JUST_DIRTIED; + UnlockBufHdr(buf); + This doesn't seem very helpful. It's obvious even without the comment that we're running PageGetLSN while holding the header lock. What's not obvious, and what the comment should be explaining, is why we're doing that. + /* + * If we're in recovery we cannot dirty a page because of a hint. + * We can set the hint, just not dirty the page as a result so + * the hint is lost when we evict the page or shutdown. + * + * See long discussion in bufpage.c + */ + if (RecoveryInProgress()) + return; Doesn't this seem awfully bad for performance on Hot Standby servers? I agree that it fixes the problem with un-WAL-logged pages there, but I seem to recall some recent complaining about performance features that work on the master but not the standby. Durable hint bits are one such feature. + * Basically, we simply prevent the checkpoint WAL record from + * being written until we have marked the buffer dirty. We don't + * start the checkpoint flush until we have marked dirty, so our + * checkpoint must flush the change to disk successfully or the + * checkpoint never gets written, so crash recovery will fix. + * + * It's possible we may enter here without an xid, so it is + * essential that CreateCheckpoint waits for virtual transactions + * rather than full transactionids. + */ + MyPgXact->delayChkpt = delayChkpt = true; I am slightly worried that this expansion in the use of this mechanism (formerly called inCommit, for those following along at home) could lead to checkpoint starvation. Suppose we've got one or two large table scans wandering along, setting hint bits, and now suddenly it's time to checkpoint. How long will it take the checkpoint process to find a time when nobody's got delayChkpt set? + #define PageSetChecksum(page) \ + do \ + { \ + PageHeader p = (PageHeader) page; \ + p->pd_flags |= PD_PAGE_VERSION_PLUS1; \ + p->pd_flags |= PD_CHECKSUM1; \ + p->pd_flags &= ~PD_CHECKSUM2; \ + p->pd_verify.pd_checksum16 = PageCalcChecksum16(page); \ + } while (0); + + /* ensure any older checksum info is overwritten with watermark */ + #define PageResetVersion(page) \ + do \ + { \ + PageHeader p = (PageHeader) page; \ + if (!PageHasNoChecksum(p)) \ + { \ + p->pd_flags &= ~PD_PAGE_VERSION_PLUS1; \ + p->pd_flags &= ~PD_CHECKSUM1; \ + p->pd_flags &= ~PD_CHECKSUM2; \ + PageSetPageSizeAndVersion(p, BLCKSZ, PG_PAGE_LAYOUT_VERSION); \ + } \ + } while (0); So, when the page has a checksum, PD_CHECKSUM2 is not set, and when it doesn't have a checksum, PD_CHECKSUM2 is not set? What good does that do? * PageGetPageSize * Returns the page size of a page. * ! * Since PageSizeIsValid() when pagesize == BLCKSZ, just written BLCKSZ. ! * This can be called on any page, initialised or not, in or out of buffers. ! * You might think this can vary at runtime but you'd be wrong, since pages ! * frequently need to occupy buffers and pages are copied from one to another ! * so there are many hidden assumptions that this simple definition is true. */ ! #define PageGetPageSize(page) (BLCKSZ) I think I agree that the current definition of PageGetPageSize seems unlikely to come anywhere close to allowing us to cope with multiple page sizes, but I think this method of disabling it is a hack. The callers that want to know how big the page really is should just use BLCKSZ instead of this macro, and those that want to know how big the page THINKS it is (notably contrib/pageinspect) need a way to get that information. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Thu, Feb 16, 2012 at 1:53 PM, Robert Haas <robertmhaas@gmail.com> wrote: > On Thu, Feb 16, 2012 at 6:16 AM, Simon Riggs <simon@2ndquadrant.com> wrote: >> v8 attached > > It's hard to believe that this version has been tested terribly > thoroughly, because it doesn't compile. I'm just back home from a few days off grid. It's possible it doesn't compile against current HEAD, though it certainly does compile and work against my last git pull. I will look into your comments in detail tomorrow morning. Thank you for looking at the patch. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Fri, Feb 17, 2012 at 10:13 PM, Simon Riggs <simon@2ndquadrant.com> wrote: > On Thu, Feb 16, 2012 at 1:53 PM, Robert Haas <robertmhaas@gmail.com> wrote: >> On Thu, Feb 16, 2012 at 6:16 AM, Simon Riggs <simon@2ndquadrant.com> wrote: >>> v8 attached >> >> It's hard to believe that this version has been tested terribly >> thoroughly, because it doesn't compile. > > I'm just back home from a few days off grid. > > It's possible it doesn't compile against current HEAD, though it > certainly does compile and work against my last git pull. > > I will look into your comments in detail tomorrow morning. Thank you > for looking at the patch. It does appear there is a chunk of code introduced somewhere between my testing and sending the patch, so you're right. Looking at the chunk below it looks like something hit some keys, adding a "]" and CR, which are adjacent on my keyboard. Trying to develop with 3 kids and a dog around isn't easy, I guess. With this chunk removed we're back to the version I've tested - I think - will carry on checking. @@ -465,7 +466,9 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum, { /* Only need to adjust flags */ bufHdr->flags |= BM_VALID; - } + [ + +} else { /* Set BM_VALID, terminate IO, and wake up any waiters */ For testing, I've been running server with shared_buffers = 16, to force blocks in and out of cache frequently to stress test the checksumming. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
On Thu, Feb 16, 2012 at 1:53 PM, Robert Haas <robertmhaas@gmail.com> wrote: > + /* > + * If we're in recovery we cannot dirty a page because of a hint. > + * We can set the hint, just not dirty the page as a result so > + * the hint is lost when we evict the page or shutdown. > + * > + * See long discussion in bufpage.c > + */ > + if (RecoveryInProgress()) > + return; > > Doesn't this seem awfully bad for performance on Hot Standby servers? > I agree that it fixes the problem with un-WAL-logged pages there, but > I seem to recall some recent complaining about performance features > that work on the master but not the standby. Durable hint bits are > one such feature. It's impossible for it to work, in this case, since we cannot write new WAL to prevent torn pages. Note that hint bit setting on a dirty block is allowed, so many hints will still be set in Hot Standby. > + * Basically, we simply prevent the checkpoint WAL record from > + * being written until we have marked the buffer dirty. We don't > + * start the checkpoint flush until we have marked dirty, so our > + * checkpoint must flush the change to disk successfully or the > + * checkpoint never gets written, so crash recovery will fix. > + * > + * It's possible we may enter here without an xid, so it is > + * essential that CreateCheckpoint waits for virtual transactions > + * rather than full transactionids. > + */ > + MyPgXact->delayChkpt = delayChkpt = true; > > I am slightly worried that this expansion in the use of this mechanism > (formerly called inCommit, for those following along at home) could > lead to checkpoint starvation. Suppose we've got one or two large > table scans wandering along, setting hint bits, and now suddenly it's > time to checkpoint. How long will it take the checkpoint process to > find a time when nobody's got delayChkpt set? We don't need to wait until nobody has it set, we just need to wait for the people that had it set when we first checked to be out of that state momentarily. > + #define PageSetChecksum(page) \ > + do \ > + { \ > + PageHeader p = (PageHeader) page; \ > + p->pd_flags |= PD_PAGE_VERSION_PLUS1; \ > + p->pd_flags |= PD_CHECKSUM1; \ > + p->pd_flags &= ~PD_CHECKSUM2; \ > + p->pd_verify.pd_checksum16 = PageCalcChecksum16(page); \ > + } while (0); > + > + /* ensure any older checksum info is overwritten with watermark */ > + #define PageResetVersion(page) \ > + do \ > + { \ > + PageHeader p = (PageHeader) page; \ > + if (!PageHasNoChecksum(p)) \ > + { \ > + p->pd_flags &= ~PD_PAGE_VERSION_PLUS1; \ > + p->pd_flags &= ~PD_CHECKSUM1; \ > + p->pd_flags &= ~PD_CHECKSUM2; \ > + PageSetPageSizeAndVersion(p, BLCKSZ, PG_PAGE_LAYOUT_VERSION); \ > + } \ > + } while (0); > > So, when the page has a checksum, PD_CHECKSUM2 is not set, and when it > doesn't have a checksum, PD_CHECKSUM2 is not set? What good does that > do? As explained in detailed comments, the purpose of this is to implement Heikki's suggestion that we have a bit set to zero so we can detect failures that cause a run of 1s. > * PageGetPageSize > * Returns the page size of a page. > * > ! * Since PageSizeIsValid() when pagesize == BLCKSZ, just written BLCKSZ. > ! * This can be called on any page, initialised or not, in or out of buffers. > ! * You might think this can vary at runtime but you'd be wrong, since pages > ! * frequently need to occupy buffers and pages are copied from one to another > ! * so there are many hidden assumptions that this simple definition is true. > */ > ! #define PageGetPageSize(page) (BLCKSZ) > > I think I agree that the current definition of PageGetPageSize seems > unlikely to come anywhere close to allowing us to cope with multiple > page sizes, but I think this method of disabling it is a hack. The > callers that want to know how big the page really is should just use > BLCKSZ instead of this macro, and those that want to know how big the > page THINKS it is (notably contrib/pageinspect) need a way to get that > information. Fair comment. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Sun, Feb 19, 2012 at 4:35 PM, Simon Riggs <simon@2ndquadrant.com> wrote: > We don't need to wait until nobody has it set, we just need to wait > for the people that had it set when we first checked to be out of that > state momentarily. I've just finished doing some performance analysis on various aspects of hint bit setting for this patch. I've seen as high as 14% of transactions writing full pages during a pgbench run. That sounds quite bad, but on pgbench at least all of those are associated with UPDATEs, which would dirty the page anyway, so there aren't any more full page writes overall. Checkpoints would be delayed only until a virtual transaction ends or a virtual transaction comes out of DelayCkpt state. If a virtual transaction was long running it wouldn't spend much time in the delaying state, especially if we take into account I/O requirements. So although I'm not exactly happy with the overheads, they don't seem to be as big a problem as they sound. Plus this is all optional and avoidable. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Thu, Feb 16, 2012 at 11:16 AM, Simon Riggs <simon@2ndquadrant.com> wrote: > v8 attached v10 attached. This patch covers all the valid concerns discussed and has been extensively tested. I expect to commit this in about 24 hours, so last call for comments please or say if you need more time to review. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
On Sun, Feb 19, 2012 at 11:35 AM, Simon Riggs <simon@2ndquadrant.com> wrote: >> Doesn't this seem awfully bad for performance on Hot Standby servers? >> I agree that it fixes the problem with un-WAL-logged pages there, but >> I seem to recall some recent complaining about performance features >> that work on the master but not the standby. Durable hint bits are >> one such feature. > > It's impossible for it to work, in this case, since we cannot write > new WAL to prevent torn pages. > > Note that hint bit setting on a dirty block is allowed, so many hints > will still be set in Hot Standby. To me, it seems that you are applying a double standard. You have twice attempted to insist that I do extra work to make major features that I worked on - unlogged tables and index-only scans - work in Hot Standby mode, despite the existence of significant technological obstacles. But when it comes to your own feature, you simply state that it cannot be done, and therefore we need not do it. Of course, this feature, like those, CAN be made to work. It just involves solving difficult problems that have little to do with the primary purpose of the patch. To be honest, I don't use Hot Standby enough to care very much about this particular issue, and I'm not saying we should reject it on these grounds. But I do think it's a mistake to dismiss it entirely, since every limitation of Hot Standby narrows the set of cases in which it can be deployed. And at any rate, I want the same standard applied to my work as to yours. >> I am slightly worried that this expansion in the use of this mechanism >> (formerly called inCommit, for those following along at home) could >> lead to checkpoint starvation. Suppose we've got one or two large >> table scans wandering along, setting hint bits, and now suddenly it's >> time to checkpoint. How long will it take the checkpoint process to >> find a time when nobody's got delayChkpt set? > > We don't need to wait until nobody has it set, we just need to wait > for the people that had it set when we first checked to be out of that > state momentarily. Ah... good point. >> So, when the page has a checksum, PD_CHECKSUM2 is not set, and when it >> doesn't have a checksum, PD_CHECKSUM2 is not set? What good does that >> do? > > As explained in detailed comments, the purpose of this is to implement > Heikki's suggestion that we have a bit set to zero so we can detect > failures that cause a run of 1s. I think it's nonsensical to pretend that there's anything special about that particular bit. If we want to validate the page header before trusting the lack of a checksum bit, we can do that far more thoroughly than just checking that one bit. There are a whole bunch of bits that ought to always be zero, and there are other things we can validate as well (e.g. LSN not in future). If we're concerned about the checksum-enabled bit getting flipped (and I agree that we should be), we can check some combination of that stuff in the hope of catching it, and that'll be a far better guard than just checking one arbitrarily selected bit. That having been said, I don't feel very good about the idea of relying on the contents of the page to tell us whether or not the page has a checksum. There's no guarantee that an error that flips the has-checksum bit will flip any other bit on the page, or that it won't flip everything else we're relying on as a backstop in exactly the manner that foils whatever algorithm we put in place. Random corruption is, perhaps, unlikely to do that, but somehow I feel like it'll happen more often than random chance suggests. Things never fail the way you want them to. Another disadvantage of the current scheme is that there's no particularly easy way to know that your whole cluster has checksums. No matter how we implement checksums, you'll have to rewrite every table in the cluster in order to get them fully turned on. But with the current design, there's no easy way to know how much of the cluster is actually checksummed. If you shut checksums off, they'll linger until those pages are rewritten, and there's no easy way to find the relations from which they need to be removed, either. I'm tempted to suggest a relation-level switch: when you want checksums, you use ALTER TABLE to turn them on, and when you don't want them any more you use ALTER TABLE to shut them off again, in each case rewriting the table. That way, there's never any ambiguity about what's in the data pages in a given relation: either they're either all checksummed, or none of them are. This moves the decision about whether checksums are enabled or disabled quite a but further away from the data itself, and also allows you to determine (by catalog inspection) which parts of the cluster do in fact have checksums. It might be kind of a pain to implement, though: you'd have to pass the information about how any given relation was configured down to the place where we validate page sanity. I'm not sure whether that's practical. I also think that the question of where exactly the checksum ought to be put might bear some more thought. Tom expressed some concern about stealing the page version field, and it seems to me we could work around that by instead stealing something less valuable. The top eight bits of the pd_pagesize_version field probably meet that criteria, since in practice basically everybody uses 8K blocks, and the number of errors that will be caught by checksums is probably much larger than the number of errors that will be caught by the page-size cross-check. But maybe the other 8 bits should come from somewhere else, maybe pd_tli or pd_flags. For almost all practical purposes, storing only the low-order 8 bits of the TLI ought to provide just as much of a safety check as storing the low-order 16 bits, so I think that might be the way to go. We could set it up so that we always check only the low 8 bits against the TLI, regardless of whether checksums are enabled; then we basically free up that bit space at no cost in code complexity. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Sun, Feb 19, 2012 at 10:04 PM, Robert Haas <robertmhaas@gmail.com> wrote: > To me, it seems that you are applying a double standard. You have > twice attempted to insist that I do extra work to make major features > that I worked on - unlogged tables and index-only scans - work in Hot > Standby mode, despite the existence of significant technological > obstacles. But when it comes to your own feature, you simply state > that it cannot be done, and therefore we need not do it. Of course, > this feature, like those, CAN be made to work. Vitriol aside, If you would be so kind as to explain how it is possible, as you claim, I'll look into making it work. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Sun, Feb 19, 2012 at 10:04 PM, Robert Haas <robertmhaas@gmail.com> wrote: >> As explained in detailed comments, the purpose of this is to implement >> Heikki's suggestion that we have a bit set to zero so we can detect >> failures that cause a run of 1s. > > I think it's nonsensical to pretend that there's anything special > about that particular bit. If we want to validate the page header > before trusting the lack of a checksum bit, we can do that far more > thoroughly than just checking that one bit. There are a whole bunch > of bits that ought to always be zero, and there are other things we > can validate as well (e.g. LSN not in future). If we're concerned > about the checksum-enabled bit getting flipped (and I agree that we > should be), we can check some combination of that stuff in the hope of > catching it, and that'll be a far better guard than just checking one > arbitrarily selected bit. I thought it was a reasonable and practical idea from Heikki. The bit is not selected arbitrarily, it is by design adjacent to one of the other bits. So overall, 3 bits need to be set to a precise value and a run of 1s or 0s will throw and error. > That having been said, I don't feel very good about the idea of > relying on the contents of the page to tell us whether or not the page > has a checksum. There's no guarantee that an error that flips the > has-checksum bit will flip any other bit on the page, or that it won't > flip everything else we're relying on as a backstop in exactly the > manner that foils whatever algorithm we put in place. Random > corruption is, perhaps, unlikely to do that, but somehow I feel like > it'll happen more often than random chance suggests. Things never > fail the way you want them to. You're right. This patch is not the best possible world, given a clean slate. But we don't have a clean slate. The fact is this patch will detect corruptions pretty well and that's what Postgres users want. While developing this, many obstacles could have been blockers. I think we're fairly lucky that I managed to find a way through the minefield of obstacles. > Another disadvantage of the current scheme is that there's no > particularly easy way to know that your whole cluster has checksums. > No matter how we implement checksums, you'll have to rewrite every > table in the cluster in order to get them fully turned on. But with > the current design, there's no easy way to know how much of the > cluster is actually checksummed. You can read every block and check. > If you shut checksums off, they'll > linger until those pages are rewritten, and there's no easy way to > find the relations from which they need to be removed, either. We can't have it both ways. Either we have an easy upgrade, or we don't. I'm told that an easy upgrade is essential. > I'm tempted to suggest a relation-level switch: when you want > checksums, you use ALTER TABLE to turn them on, and when you don't > want them any more you use ALTER TABLE to shut them off again, in each > case rewriting the table. That way, there's never any ambiguity about > what's in the data pages in a given relation: either they're either > all checksummed, or none of them are. This moves the decision about > whether checksums are enabled or disabled quite a but further away > from the data itself, and also allows you to determine (by catalog > inspection) which parts of the cluster do in fact have checksums. It > might be kind of a pain to implement, though: you'd have to pass the > information about how any given relation was configured down to the > place where we validate page sanity. I'm not sure whether that's > practical. It's not practical as the only mechanism, given the requirement for upgrade. As I mention in the docs, if you want that, use VACUUM FULL. So there is a mechanism if you want it. > I also think that the question of where exactly the checksum ought to > be put might bear some more thought. Tom expressed some concern about > stealing the page version field, and it seems to me we could work > around that by instead stealing something less valuable. The top > eight bits of the pd_pagesize_version field probably meet that > criteria, since in practice basically everybody uses 8K blocks, and > the number of errors that will be caught by checksums is probably much > larger than the number of errors that will be caught by the page-size > cross-check. But maybe the other 8 bits should come from somewhere > else, maybe pd_tli or pd_flags. For almost all practical purposes, > storing only the low-order 8 bits of the TLI ought to provide just as > much of a safety check as storing the low-order 16 bits, so I think > that might be the way to go. We could set it up so that we always > check only the low 8 bits against the TLI, regardless of whether > checksums are enabled; then we basically free up that bit space at no > cost in code complexity. The version stuff is catered for by the current patch. TLI isn't something I want to mess with. It does actually mean something, unlike the page version field which carries no appreciable information. People want checksums. This is the best way I've come up with of giving it to them, given all of the constraints and requests imposed upon me. I have no doubt that a later release will redesign the page format and do a better job, while at the same time allowing an online upgrade mechanism to allow block changes. But that is at least 1 year away and experience shows that is often much longer than that. I'll buy you a $DRINK if that happens in the next release. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Sun, Feb 19, 2012 at 6:33 PM, Simon Riggs <simon@2ndquadrant.com> wrote: > On Sun, Feb 19, 2012 at 10:04 PM, Robert Haas <robertmhaas@gmail.com> wrote: >> To me, it seems that you are applying a double standard. You have >> twice attempted to insist that I do extra work to make major features >> that I worked on - unlogged tables and index-only scans - work in Hot >> Standby mode, despite the existence of significant technological >> obstacles. But when it comes to your own feature, you simply state >> that it cannot be done, and therefore we need not do it. Of course, >> this feature, like those, CAN be made to work. > > Vitriol aside, If you would be so kind as to explain how it is > possible, as you claim, I'll look into making it work. It would require a double-write buffer of some kind. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Sun, Feb 19, 2012 at 6:57 PM, Simon Riggs <simon@2ndquadrant.com> wrote: > I thought it was a reasonable and practical idea from Heikki. The bit > is not selected arbitrarily, it is by design adjacent to one of the > other bits. So overall, 3 bits need to be set to a precise value and a > run of 1s or 0s will throw and error. Sure, but who is to say that a typical error will look anything like that? Anyway, you could check even more bits just as easily; we know exactly which ones have a plausible reason for being non-zero. >> That having been said, I don't feel very good about the idea of >> relying on the contents of the page to tell us whether or not the page >> has a checksum. There's no guarantee that an error that flips the >> has-checksum bit will flip any other bit on the page, or that it won't >> flip everything else we're relying on as a backstop in exactly the >> manner that foils whatever algorithm we put in place. Random >> corruption is, perhaps, unlikely to do that, but somehow I feel like >> it'll happen more often than random chance suggests. Things never >> fail the way you want them to. > > You're right. This patch is not the best possible world, given a clean > slate. But we don't have a clean slate. > > The fact is this patch will detect corruptions pretty well and that's > what Postgres users want. > > While developing this, many obstacles could have been blockers. I > think we're fairly lucky that I managed to find a way through the > minefield of obstacles. I think we could do worse than this patch, but I don't really believe it's ready for commit. We don't have a single performance number showing how much of a performance regression this causes, either on the master or on the standby, on any workload, much less those where a problem is reasonably forseeable from the design you've chosen. Many controversial aspects of the patch, such as the way you're using the buffer header spinlocks as a surrogate for x-locking the buffer, or the right way of obtaining the bit-space the patch requires, haven't really been discussed, and to the extent that they have been discussed, they have not been agreed. On the former point, you haven't updated src/backend/storage/buffer/README at all; but updating is not by itself sufficient. Before we change the buffer-locking rules, we ought to have some discussion of whether it's OK to do that, and why it's necessary. "Why it's necessary" would presumably include demonstrating that the performance of the straightforward implementation stinks, and that with this change is better. You haven't made any effort to do that whatsoever, or if you have, you haven't posted the results here. I'm pretty un-excited by that change, first because I think it's a modularity violation and possibly unsafe, and second because I believe we've already got a problem with buffer header spinlock contention which this might exacerbate. >> Another disadvantage of the current scheme is that there's no >> particularly easy way to know that your whole cluster has checksums. >> No matter how we implement checksums, you'll have to rewrite every >> table in the cluster in order to get them fully turned on. But with >> the current design, there's no easy way to know how much of the >> cluster is actually checksummed. > > You can read every block and check. Using what tool? >> If you shut checksums off, they'll >> linger until those pages are rewritten, and there's no easy way to >> find the relations from which they need to be removed, either. > > We can't have it both ways. Either we have an easy upgrade, or we > don't. I'm told that an easy upgrade is essential. Easy upgrade and removal of checksums are unrelated issues AFAICS. >> I'm tempted to suggest a relation-level switch: when you want >> checksums, you use ALTER TABLE to turn them on, and when you don't >> want them any more you use ALTER TABLE to shut them off again, in each >> case rewriting the table. That way, there's never any ambiguity about >> what's in the data pages in a given relation: either they're either >> all checksummed, or none of them are. This moves the decision about >> whether checksums are enabled or disabled quite a but further away >> from the data itself, and also allows you to determine (by catalog >> inspection) which parts of the cluster do in fact have checksums. It >> might be kind of a pain to implement, though: you'd have to pass the >> information about how any given relation was configured down to the >> place where we validate page sanity. I'm not sure whether that's >> practical. > > It's not practical as the only mechanism, given the requirement for upgrade. Why not? > The version stuff is catered for by the current patch. Your patch reuses the version number field for an unrelated purpose and includes some vague hints of how we might do versioning using some of the page-level flag bits. Since bit-space is fungible, I think it makes more sense to keep the version number where it is and carve the checksum out of whatever bit space you would have moved the version number into. Whether that's pd_tli or pd_flags is somewhat secondary; the point is that you can certainly arrange to leave the 8 bits that currently contain the version number as they are. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Mon, Feb 20, 2012 at 12:42 AM, Robert Haas <robertmhaas@gmail.com> wrote: > I think we could do worse than this patch, but I don't really believe > it's ready for commit. We don't have a single performance number > showing how much of a performance regression this causes, either on > the master or on the standby, on any workload, much less those where a > problem is reasonably forseeable from the design you've chosen. Many > controversial aspects of the patch, such as the way you're using the > buffer header spinlocks as a surrogate for x-locking the buffer, or > the right way of obtaining the bit-space the patch requires, haven't > really been discussed, and to the extent that they have been > discussed, they have not been agreed. These points are being discussed here. If you have rational objections, say what they are. > On the former point, you haven't updated > src/backend/storage/buffer/README at all; I've updated the appropriate README file which relates to page LSNs to explain. > but updating is not by > itself sufficient. Before we change the buffer-locking rules, we > ought to have some discussion of whether it's OK to do that, and why > it's necessary. "Why it's necessary" would presumably include > demonstrating that the performance of the straightforward > implementation stinks, and that with this change is better. What straightforward implementation is that?? This *is* the straightforward one. God knows what else we'd break if we drop the lock, reacquire as an exclusive, then drop it again and reacquire in shared mode. Code tends to assume that when you take a lock you hold it until you release; doing otherwise would allow all manner of race conditions to emerge. And do you really think that would be faster? > You > haven't made any effort to do that whatsoever, or if you have, you > haven't posted the results here. I'm pretty un-excited by that > change, first because I think it's a modularity violation and possibly > unsafe, and second because I believe we've already got a problem with > buffer header spinlock contention which this might exacerbate. > >>> Another disadvantage of the current scheme is that there's no >>> particularly easy way to know that your whole cluster has checksums. >>> No matter how we implement checksums, you'll have to rewrite every >>> table in the cluster in order to get them fully turned on. But with >>> the current design, there's no easy way to know how much of the >>> cluster is actually checksummed. >> >> You can read every block and check. > > Using what tool? Pageinspect? >>> If you shut checksums off, they'll >>> linger until those pages are rewritten, and there's no easy way to >>> find the relations from which they need to be removed, either. >> >> We can't have it both ways. Either we have an easy upgrade, or we >> don't. I'm told that an easy upgrade is essential. > > Easy upgrade and removal of checksums are unrelated issues AFAICS. > >>> I'm tempted to suggest a relation-level switch: when you want >>> checksums, you use ALTER TABLE to turn them on, and when you don't >>> want them any more you use ALTER TABLE to shut them off again, in each >>> case rewriting the table. That way, there's never any ambiguity about >>> what's in the data pages in a given relation: either they're either >>> all checksummed, or none of them are. This moves the decision about >>> whether checksums are enabled or disabled quite a but further away >>> from the data itself, and also allows you to determine (by catalog >>> inspection) which parts of the cluster do in fact have checksums. It >>> might be kind of a pain to implement, though: you'd have to pass the >>> information about how any given relation was configured down to the >>> place where we validate page sanity. I'm not sure whether that's >>> practical. >> >> It's not practical as the only mechanism, given the requirement for upgrade. > > Why not? > >> The version stuff is catered for by the current patch. > > Your patch reuses the version number field for an unrelated purpose > and includes some vague hints of how we might do versioning using some > of the page-level flag bits. Since bit-space is fungible, I think it > makes more sense to keep the version number where it is and carve the > checksum out of whatever bit space you would have moved the version > number into. Whether that's pd_tli or pd_flags is somewhat secondary; > the point is that you can certainly arrange to leave the 8 bits that > currently contain the version number as they are. If you've got some rational objections, please raise them. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Mon, Feb 20, 2012 at 4:18 AM, Simon Riggs <simon@2ndquadrant.com> wrote: > What straightforward implementation is that?? This *is* the straightforward one. > > God knows what else we'd break if we drop the lock, reacquire as an > exclusive, then drop it again and reacquire in shared mode. Code tends > to assume that when you take a lock you hold it until you release; > doing otherwise would allow all manner of race conditions to emerge. > > And do you really think that would be faster? I don't know, but neither do you, because you apparently haven't tried it. Games where we drop the shared lock and get an exclusive lock are used in numerous places in the source code; see, for example, heap_update(), so I don't believe that there is any reason to reject that a priori. Indeed, I can think of at least one pretty good reason to do it exactly that way: it's the way we've handled all previous problems of this type, and in general it's better to make new code look like existing code than to invent something new, especially when you haven't made any effort to quantify the problem or the extent to which the proposed solution mitigates it. > If you've got some rational objections, please raise them. Look, I think that the objections I have been raising are rational. YMMV, and obviously does. The bottom line here is that I can't stop you from committing this patch, and we both know that. And, really, at the end of the day, I have no desire to keep you from committing this patch, even if I had the power to do so, because I agree that the feature has merit. But I can object to it and, as the patch stands now, I do object to it, for the following specific reasons: 1. You haven't posted any meaningful performance test results, of either normal cases or worst cases. 2. You're fiddling with the buffer locking in a way that I'm very doubtful about without having tried any of the alternatives. 3. You're rearranging the page header in a way that I find ugly and baroque. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2/20/12 5:57 AM, Robert Haas wrote: > 3. You're rearranging the page header in a way that I find ugly and baroque. Guys, are we talking about an on-disk format change? If so, this may need to be kicked out of 9.2 ... -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
On Mon, Feb 20, 2012 at 12:53 PM, Josh Berkus <josh@agliodbs.com> wrote: > On 2/20/12 5:57 AM, Robert Haas wrote: >> 3. You're rearranging the page header in a way that I find ugly and baroque. > > Guys, are we talking about an on-disk format change? If so, this may > need to be kicked out of 9.2 ... Yes, we are. Simon's gone to some pains to make it backward compatible, but IMHO it's a lot messier and less future-proof than it could be with some more work, and if we commit this patch than we WILL be stuck with this for a very long time. The fact is that, thus far, so real effort has been made by anyone to evict anything from the current CommitFest. I did that for the last two cycles, but as the minutes for last year's development meeting succinctly observed "Haas ... doesn't like being the schedule jerk". My resolve to be less of a schedule jerk is being sorely tested, though: aside from this patch, which has its share of issues, there's also Alvaro's foreign key stuff, which probably needs a lot more review than it has so far gotten, and several large patches from Dimitri, which do cool stuff but need a lot more work, and Peter Geoghegan's pg_stat_statements patch, which is awesome functionality but sounds like it's still a little green, and parallel pg_dump, which is waiting on a rebase after some refactoring I did to simplify things, and pgsql_fdw, and Heikki's xlog insert scaling patch, which fortunately seems to be in the capable hands of Fujii Masao and Jeff Janes, but certainly isn't ready to go today. Any of these individually could probably eat up a solid week of my time, and then there are the other 40 as-yet-undealt-with patches. I said five weeks ago that it probably wouldn't hurt anything to let the CommitFest run six weeks, and Tom opined that it would probably require two months. But the sad reality is that, after five weeks, we're not even half done with this CommitFest. That's partly because most people did not review as much code as they submitted, partly because a bunch of people played fast and loose with the submission deadline, and partly because we didn't return any patches that still needed major rehashing after the first round of review, leading to a situation where supposedly code-complete patches are showing up for the first time four or five weeks after the submission deadline. Now none of that is the fault of this patch specifically: honestly, if I had to pick just two more patches to get committed before beta, this would probably be one of them. But that doesn't make me happy with where it's at today, not to mention the fact that there are people who submitted their code a lot earlier who still haven't gotten the attention this patch has, which is still less than the attention a patch of this magnitude probably needs. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
>> Guys, are we talking about an on-disk format change? If so, this may >> need to be kicked out of 9.2 ... > > Yes, we are. Simon's gone to some pains to make it backward > compatible, but IMHO it's a lot messier and less future-proof than it > could be with some more work, and if we commit this patch than we WILL > be stuck with this for a very long time. Yeah. I'd personally prefer to boot this to 9.3, then. It's not like there's not enough features for 9.2, and I really don't want this feature to cause 5 others to be delayed. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
On Sun, Feb 19, 2012 at 05:04:06PM -0500, Robert Haas wrote: > Another disadvantage of the current scheme is that there's no > particularly easy way to know that your whole cluster has checksums. > No matter how we implement checksums, you'll have to rewrite every > table in the cluster in order to get them fully turned on. But with > the current design, there's no easy way to know how much of the > cluster is actually checksummed. If you shut checksums off, they'll > linger until those pages are rewritten, and there's no easy way to > find the relations from which they need to be removed, either. Yes, pg_upgrade makes this hard. If you are using pg_dump to restore, and set the checksum GUC before you do the restore, and never turn it off, then you will have a fully checksum'ed database. If you use pg_upgrade, and enable the checksum GUC, your database will become progressively checksummed, and as Simon pointed out, the only clean way is VACUUM FULL. It is quite hard to estimate the checksum coverage of a database with mixed checksumming --- one cool idea would be for ANALYZE to report how many of the pages it saw were checksummed. Yeah, crazy, but it might be enough. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. +
On Mon, Feb 20, 2012 at 6:22 PM, Robert Haas <robertmhaas@gmail.com> wrote: > On Mon, Feb 20, 2012 at 12:53 PM, Josh Berkus <josh@agliodbs.com> wrote: >> On 2/20/12 5:57 AM, Robert Haas wrote: >>> 3. You're rearranging the page header in a way that I find ugly and baroque. >> >> Guys, are we talking about an on-disk format change? If so, this may >> need to be kicked out of 9.2 ... > > Yes, we are. Simon's gone to some pains to make it backward > compatible, but IMHO it's a lot messier and less future-proof than it > could be with some more work, and if we commit this patch than we WILL > be stuck with this for a very long time. No, we aren't. The format is not changed, though there are 3 new bits added. The format is designed to be extensible and we have room for 10 more bits, which allows 1024 new page formats, which at current usage rate should last us for approximately 5000 years of further development. Previously, we were limited to 251 more page formats, so this new mechanism is more future proof than the last. Alternatively you might say that we have dropped from 1271 page formats to only 1024, which is hardly limiting. This has already been discussed and includes points made by Bruce and Heikki in the final design. The "more work" required is not described clearly, nor has anyone but RH requested it. In terms of messier, RH's only alternate suggestion is to split the checksum into multiple pieces, which I don't think yer average reader would call less ugly, less messy or more future proof. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Mon, Feb 20, 2012 at 11:09 PM, Bruce Momjian <bruce@momjian.us> wrote: > On Sun, Feb 19, 2012 at 05:04:06PM -0500, Robert Haas wrote: >> Another disadvantage of the current scheme is that there's no >> particularly easy way to know that your whole cluster has checksums. >> No matter how we implement checksums, you'll have to rewrite every >> table in the cluster in order to get them fully turned on. But with >> the current design, there's no easy way to know how much of the >> cluster is actually checksummed. If you shut checksums off, they'll >> linger until those pages are rewritten, and there's no easy way to >> find the relations from which they need to be removed, either. > > Yes, pg_upgrade makes this hard. If you are using pg_dump to restore, > and set the checksum GUC before you do the restore, and never turn it > off, then you will have a fully checksum'ed database. > > If you use pg_upgrade, and enable the checksum GUC, your database will > become progressively checksummed, and as Simon pointed out, the only > clean way is VACUUM FULL. It is quite hard to estimate the checksum > coverage of a database with mixed checksumming --- one cool idea would > be for ANALYZE to report how many of the pages it saw were checksummed. > Yeah, crazy, but it might be enough. Well, I didn't say VACUUM FULL was the only clean way of knowing whether every block is checksummed, its a very intrusive way. If you want a fast upgrade with pg_upgrade, rewriting every block is not really a grand plan, but if you want it... If we did that, I think I would prefer to do it with these commands VACUUM ENABLE CHECKSUM; //whole database only VACUUM DISABLE CHECKSUM; rather than use a GUC. We can then add an option to pg_upgrade. That way, we scan whole database, adding checksums and then record it in pg_database When we remove it, we do same thing in reverse then record it. So there's no worries about turning on/off GUCs and we know for certain where our towel is. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Mon, Feb 20, 2012 at 11:23:42PM +0000, Simon Riggs wrote: > > If you use pg_upgrade, and enable the checksum GUC, your database will > > become progressively checksummed, and as Simon pointed out, the only > > clean way is VACUUM FULL. It is quite hard to estimate the checksum > > coverage of a database with mixed checksumming --- one cool idea would > > be for ANALYZE to report how many of the pages it saw were checksummed. > > Yeah, crazy, but it might be enough. > > Well, I didn't say VACUUM FULL was the only clean way of knowing > whether every block is checksummed, its a very intrusive way. > > If you want a fast upgrade with pg_upgrade, rewriting every block is > not really a grand plan, but if you want it... > > If we did that, I think I would prefer to do it with these commands > > VACUUM ENABLE CHECKSUM; //whole database only > VACUUM DISABLE CHECKSUM; > > rather than use a GUC. We can then add an option to pg_upgrade. > > That way, we scan whole database, adding checksums and then record it > in pg_database > > When we remove it, we do same thing in reverse then record it. > > So there's no worries about turning on/off GUCs and we know for > certain where our towel is. Yes, that would work, but I suspect most users are going to want to enable checksums when they are seeing some odd behavior and want to test the hardware. Requiring them to rewrite the database to do that is making the feature less useful, and once they have the problem fixed, they might want to turn it off again. Now, one argument is that they could enable checksums, see no checksum errors, but still be getting checksum failures from pages that were written before checksum was enabled. I think we just need to document that checksum only works for writes performed _after_ the feature is enabled. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. +
On Mon, Feb 20, 2012 at 08:57:08AM -0500, Robert Haas wrote: > On Mon, Feb 20, 2012 at 4:18 AM, Simon Riggs <simon@2ndquadrant.com> wrote: > > What straightforward implementation is that?? This *is* the straightforward one. > > > > God knows what else we'd break if we drop the lock, reacquire as an > > exclusive, then drop it again and reacquire in shared mode. Code tends > > to assume that when you take a lock you hold it until you release; > > doing otherwise would allow all manner of race conditions to emerge. > > > > And do you really think that would be faster? > > I don't know, but neither do you, because you apparently haven't tried > it. Games where we drop the shared lock and get an exclusive lock > are used in numerous places in the source code; see, for example, > heap_update(), so I don't believe that there is any reason to reject > that a priori. Indeed, I can think of at least one pretty good reason > to do it exactly that way: it's the way we've handled all previous > problems of this type, and in general it's better to make new code > look like existing code than to invent something new, especially when > you haven't made any effort to quantify the problem or the extent to > which the proposed solution mitigates it. We do, in numerous places, drop a shared buffer content lock and reacquire it in exclusive mode. However, the existing users of that pattern tend to trade the lock, complete subsequent work, and unlock the buffer all within the same function. So they must, because several of them recheck facts that can change during the period of holding no lock. SetBufferCommitInfoNeedsSave() callers do not adhere to that pattern; they can be quite distant from the original lock acquisition and eventual lock release. Take the prototypical case of SetHintBits(). Our shared lock originated in a place like heapgettup(), which expects to hold it continually until finished. We would need to somehow push up through HeapTupleSatisfiesVisibility() the fact that we traded the buffer lock, then have heapgettup() reorient itself as needed. Every caller of code that can reach SetBufferCommitInfoNeedsSave() would need to do the same. Perhaps there's a different way to apply the lock-trade technique that avoids this mess, but I'm not seeing it. Consequently, I find the idea of requiring a spinlock acquisition to read or write pd_lsn/pd_tli under BUFFER_LOCK_SHARE to be a sensible one. Within that umbrella, some details need attention: - Not all BUFFER_LOCK_SHARE callers of PageGetLSN() live in bufmgr.c. I note gistScanPage() and XLogCheckBuffer()[1]. (Perhapswe'll only require the spinlock for heap buffers, letting gistScanPage() off the hook.) We need a public API, perhapsLockBufferForLSN(). - The use of some spinlock need not imply using the buffer header spinlock. We could add a dedicated pd_lsn_tli_lock to BufferDesc. That has the usual trade-off of splitting a lock: less contention at the cost of more acquisitions. I have nointuition on which approach would perform better. I agree that src/backend/storage/buffer/README is the place to document the new locking rules. I do share your general unease about adding new twists to the buffer access rules. Some of our hairiest code is also the code manipulating buffer locks most extensively, and I would not wish to see that code get even more difficult to understand. However, I'm not seeing a credible alternative that retains the same high-level behaviors. A possible compromise is to leave the page clean after setting a hint bit, much like the patch already has us do under hot standby. Then there's no new WAL and no new rules around pd_lsn. Wasn't that one of the things Merlin benchmarked when he was looking at hint bits? Does anyone recall the result? Thanks, nm [1] Patch v10 adds a comment to XLogCheckBuffer() saying otherwise. The comment is true today, but the same patch makes it false by having XLogSaveBufferForHint() call XLogInsert() under a share lock.
On Tue, Feb 21, 2012 at 10:07 AM, Noah Misch <noah@leadboat.com> wrote: > Consequently, I find the idea of requiring > a spinlock acquisition to read or write pd_lsn/pd_tli under BUFFER_LOCK_SHARE > to be a sensible one. OK > Within that umbrella, some details need attention: > > - Not all BUFFER_LOCK_SHARE callers of PageGetLSN() live in bufmgr.c. I note > gistScanPage() and XLogCheckBuffer()[1]. (Perhaps we'll only require the > spinlock for heap buffers, letting gistScanPage() off the hook.) We need a > public API, perhaps LockBufferForLSN(). Yep, I checked all the call points previously. gistScanPage() and also gistbulkdelete() would be need changing, but since GIST doesn't use hints, there is no need for locking. I'll document that better. > - The use of some spinlock need not imply using the buffer header spinlock. > We could add a dedicated pd_lsn_tli_lock to BufferDesc. That has the usual > trade-off of splitting a lock: less contention at the cost of more > acquisitions. I have no intuition on which approach would perform better. Will think about that and try a few ideas. > I agree that src/backend/storage/buffer/README is the place to document the > new locking rules. OK, I'll move the comments. > I do share your general unease about adding new twists to the buffer access > rules. Some of our hairiest code is also the code manipulating buffer locks > most extensively, and I would not wish to see that code get even more > difficult to understand. However, I'm not seeing a credible alternative that > retains the same high-level behaviors. Yes, those changes not made lightly and with much checking. > A possible compromise is to leave the page clean after setting a hint bit, > much like the patch already has us do under hot standby. Then there's no new > WAL and no new rules around pd_lsn. Wasn't that one of the things Merlin > benchmarked when he was looking at hint bits? Does anyone recall the result? I was thinking exactly that myself. I'll add a GUC to test. > [1] Patch v10 adds a comment to XLogCheckBuffer() saying otherwise. The > comment is true today, but the same patch makes it false by having > XLogSaveBufferForHint() call XLogInsert() under a share lock. OK, thats an issue, well spotted. Will fix. Thanks for thorough review. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Sun, Feb 19, 2012 at 05:04:06PM -0500, Robert Haas wrote: > On Sun, Feb 19, 2012 at 11:35 AM, Simon Riggs <simon@2ndquadrant.com> wrote: > >> So, when the page has a checksum, PD_CHECKSUM2 is not set, and when it > >> doesn't have a checksum, PD_CHECKSUM2 is not set? ?What good does that > >> do? > > > > As explained in detailed comments, the purpose of this is to implement > > Heikki's suggestion that we have a bit set to zero so we can detect > > failures that cause a run of 1s. > > I think it's nonsensical to pretend that there's anything special > about that particular bit. If we want to validate the page header > before trusting the lack of a checksum bit, we can do that far more > thoroughly than just checking that one bit. There are a whole bunch > of bits that ought to always be zero, and there are other things we > can validate as well (e.g. LSN not in future). If we're concerned > about the checksum-enabled bit getting flipped (and I agree that we > should be), we can check some combination of that stuff in the hope of > catching it, and that'll be a far better guard than just checking one > arbitrarily selected bit. PageHeaderIsValid() (being renamed to PageIsVerified()) already checks "(page->pd_flags & ~PD_VALID_FLAG_BITS) == 0". Explicitly naming another bit and keeping it unset is redundant with that existing check. It would cease to be redundant if we ever allocate all the flag bits, but then we also wouldn't have a bit to spare as PD_CHECKSUM2. I agree with you on this point. > That having been said, I don't feel very good about the idea of > relying on the contents of the page to tell us whether or not the page > has a checksum. There's no guarantee that an error that flips the > has-checksum bit will flip any other bit on the page, or that it won't > flip everything else we're relying on as a backstop in exactly the > manner that foils whatever algorithm we put in place. Random > corruption is, perhaps, unlikely to do that, but somehow I feel like > it'll happen more often than random chance suggests. Things never > fail the way you want them to. > > Another disadvantage of the current scheme is that there's no > particularly easy way to know that your whole cluster has checksums. > No matter how we implement checksums, you'll have to rewrite every > table in the cluster in order to get them fully turned on. But with > the current design, there's no easy way to know how much of the > cluster is actually checksummed. If you shut checksums off, they'll > linger until those pages are rewritten, and there's no easy way to > find the relations from which they need to be removed, either. I'm not seeing value in rewriting pages to remove checksums, as opposed to just ignoring those checksums going forward. Did you have a particular scenario in mind? > I'm tempted to suggest a relation-level switch: when you want > checksums, you use ALTER TABLE to turn them on, and when you don't > want them any more you use ALTER TABLE to shut them off again, in each > case rewriting the table. That way, there's never any ambiguity about > what's in the data pages in a given relation: either they're either > all checksummed, or none of them are. This moves the decision about > whether checksums are enabled or disabled quite a but further away > from the data itself, and also allows you to determine (by catalog > inspection) which parts of the cluster do in fact have checksums. It > might be kind of a pain to implement, though: you'd have to pass the > information about how any given relation was configured down to the > place where we validate page sanity. I'm not sure whether that's > practical. This patch implies future opportunities to flesh out its UI, and I don't see it locking us out of implementing the above. We'll want a weak-lock command that adds checksums to pages lacking them. We'll want to expose whether a given relation has full checksum coverage. With that, we could produce an error when an apparently-non-checksummed page appears in a relation previously known to have full checksum coverage. Even supposing an "ALTER TABLE t SET {WITH | WITHOUT} CHECKSUMS" as the only tool for enabling or disabling them, it's helpful to have each page declare whether it has a checksum. That way, the rewrite can take as little as an AccessShareLock, and any failure need not lose work already done. If you rely on anticipating the presence of a checksum based on a pg_class column, that ALTER needs to perform an atomic rewrite.
On Wed, Feb 22, 2012 at 7:06 AM, Noah Misch <noah@leadboat.com> wrote: > On Sun, Feb 19, 2012 at 05:04:06PM -0500, Robert Haas wrote: >> On Sun, Feb 19, 2012 at 11:35 AM, Simon Riggs <simon@2ndquadrant.com> wrote: >> >> So, when the page has a checksum, PD_CHECKSUM2 is not set, and when it >> >> doesn't have a checksum, PD_CHECKSUM2 is not set? ?What good does that >> >> do? >> > >> > As explained in detailed comments, the purpose of this is to implement >> > Heikki's suggestion that we have a bit set to zero so we can detect >> > failures that cause a run of 1s. >> >> I think it's nonsensical to pretend that there's anything special >> about that particular bit. If we want to validate the page header >> before trusting the lack of a checksum bit, we can do that far more >> thoroughly than just checking that one bit. There are a whole bunch >> of bits that ought to always be zero, and there are other things we >> can validate as well (e.g. LSN not in future). If we're concerned >> about the checksum-enabled bit getting flipped (and I agree that we >> should be), we can check some combination of that stuff in the hope of >> catching it, and that'll be a far better guard than just checking one >> arbitrarily selected bit. > > PageHeaderIsValid() (being renamed to PageIsVerified()) already checks > "(page->pd_flags & ~PD_VALID_FLAG_BITS) == 0". Explicitly naming another bit > and keeping it unset is redundant with that existing check. It would cease to > be redundant if we ever allocate all the flag bits, but then we also wouldn't > have a bit to spare as PD_CHECKSUM2. I agree with you on this point. No problem to remove that. It was there at Heikki's request. >> That having been said, I don't feel very good about the idea of >> relying on the contents of the page to tell us whether or not the page >> has a checksum. There's no guarantee that an error that flips the >> has-checksum bit will flip any other bit on the page, or that it won't >> flip everything else we're relying on as a backstop in exactly the >> manner that foils whatever algorithm we put in place. Random >> corruption is, perhaps, unlikely to do that, but somehow I feel like >> it'll happen more often than random chance suggests. Things never >> fail the way you want them to. >> >> Another disadvantage of the current scheme is that there's no >> particularly easy way to know that your whole cluster has checksums. >> No matter how we implement checksums, you'll have to rewrite every >> table in the cluster in order to get them fully turned on. But with >> the current design, there's no easy way to know how much of the >> cluster is actually checksummed. If you shut checksums off, they'll >> linger until those pages are rewritten, and there's no easy way to >> find the relations from which they need to be removed, either. > > I'm not seeing value in rewriting pages to remove checksums, as opposed to > just ignoring those checksums going forward. Did you have a particular > scenario in mind? Agreed. No reason to change a checksum unless we rewrite the block, no matter whether page_checksums is on or off. >> I'm tempted to suggest a relation-level switch: when you want >> checksums, you use ALTER TABLE to turn them on, and when you don't >> want them any more you use ALTER TABLE to shut them off again, in each >> case rewriting the table. That way, there's never any ambiguity about >> what's in the data pages in a given relation: either they're either >> all checksummed, or none of them are. This moves the decision about >> whether checksums are enabled or disabled quite a but further away >> from the data itself, and also allows you to determine (by catalog >> inspection) which parts of the cluster do in fact have checksums. It >> might be kind of a pain to implement, though: you'd have to pass the >> information about how any given relation was configured down to the >> place where we validate page sanity. I'm not sure whether that's >> practical. > > This patch implies future opportunities to flesh out its UI, and I don't see > it locking us out of implementing the above. Agreed > We'll want a weak-lock command > that adds checksums to pages lacking them. VACUUM FREEZE will do this. > We'll want to expose whether a > given relation has full checksum coverage. Not sure I understand why we'd want that. If you really want that, its a simple function to do it. > With that, we could produce an > error when an apparently-non-checksummed page appears in a relation previously > known to have full checksum coverage. Additional checking at relation level is just going to slow things down. Don't see the value of having relation level checksum option. It would be an easy thing to skip checksums on UNLOGGED relations, and possibly desirable, but I don't see the utility of a per relation checksum setting in other cases. There's just no need for fine tuning like that. > Even supposing an "ALTER TABLE t SET {WITH | WITHOUT} CHECKSUMS" as the only > tool for enabling or disabling them, it's helpful to have each page declare > whether it has a checksum. That way, the rewrite can take as little as an > AccessShareLock, and any failure need not lose work already done. If you rely > on anticipating the presence of a checksum based on a pg_class column, that > ALTER needs to perform an atomic rewrite. That will pretty much kill the usability of this feature. We must be able to add checksums without taking a full lock on the table and/or database. Making it an ALTER TABLE would require huge maintenance effort to add checksums to a database. -1 to the idea of having ALTER TABLE variant to enable/disable checksums -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Tue, Feb 21, 2012 at 5:07 AM, Noah Misch <noah@leadboat.com> wrote: > We do, in numerous places, drop a shared buffer content lock and reacquire it > in exclusive mode. However, the existing users of that pattern tend to trade > the lock, complete subsequent work, and unlock the buffer all within the same > function. So they must, because several of them recheck facts that can change > during the period of holding no lock. SetBufferCommitInfoNeedsSave() callers > do not adhere to that pattern; they can be quite distant from the original > lock acquisition and eventual lock release. Take the prototypical case of > SetHintBits(). Our shared lock originated in a place like heapgettup(), which > expects to hold it continually until finished. We would need to somehow push > up through HeapTupleSatisfiesVisibility() the fact that we traded the buffer > lock, then have heapgettup() reorient itself as needed. Every caller of code > that can reach SetBufferCommitInfoNeedsSave() would need to do the same. > Perhaps there's a different way to apply the lock-trade technique that avoids > this mess, but I'm not seeing it. Consequently, I find the idea of requiring > a spinlock acquisition to read or write pd_lsn/pd_tli under BUFFER_LOCK_SHARE > to be a sensible one. Within that umbrella, some details need attention: Fair analysis. > - Not all BUFFER_LOCK_SHARE callers of PageGetLSN() live in bufmgr.c. I note > gistScanPage() and XLogCheckBuffer()[1]. (Perhaps we'll only require the > spinlock for heap buffers, letting gistScanPage() off the hook.) We need a > public API, perhaps LockBufferForLSN(). That seems like a good idea. I am a little worried about Simon's plan to do the additional locking only for AMs that have no unlogged-type updates. It's a reasonable performance optimization, and skipping the locking when checksums are shut off also seems reasonable, but it seems a bit fragile: suppose that, in the future, someone fixes GiST to do something intelligent with kill_prior_tuple. Now suddenly it needs LSN locking that it didn't need before, but this fact might not be very obvious. It might be a good idea to design LockBufferForLSN to take an AM OID as an argument, and use the AM OID to decide whether the locking is required. That way, we can call it from everywhere that reads an LSN, and it can simply skip the locking in places where it isn't presently needed. Or maybe there's a better design, but I agree with you that some kind of public API is essential. > - The use of some spinlock need not imply using the buffer header spinlock. > We could add a dedicated pd_lsn_tli_lock to BufferDesc. That has the usual > trade-off of splitting a lock: less contention at the cost of more > acquisitions. I have no intuition on which approach would perform better. I think I like the idea of a separate lock, but I agree it could stand to be tested both ways. Another thought is that we might add SpinLockConditionalAcquire(), that just tries to TAS the lock and returns false if the lock is already held. Then, we could grab the spinlock while writing the page, in lieu of copying it, and anyone who wants to set a hint bit can conditionally acquire the spinlock long enough to set the bits. If the spinlock isn't immediately free then instead of spinning they just don't set the hint bits after all. That way you can be sure that no hint bits are set during the write, but any attempt to set a hint bit only costs you a single TAS - no loop, as with a regular spinlock acquisition - and of course the hint itself. > A possible compromise is to leave the page clean after setting a hint bit, > much like the patch already has us do under hot standby. Then there's no new > WAL and no new rules around pd_lsn. Wasn't that one of the things Merlin > benchmarked when he was looking at hint bits? Does anyone recall the result? It slows things down substantially in the case of repeated scans of the same unchaning table. In fact, I tried a much more nuanced approach: set BM_UNTIDY on the page when SetBufferCommitInfoNeedsSave() is called. BM_UNTIDY pages are written by the background writer and during checkpoints, and 5% of the time by backends. All of that has the salutary effect of making the first sequential scan less ungodly slow, but it also has the less-desirable effect of making the next 19 of them still kinda-sorta slow. It was unclear to me (and perhaps to others) whether it really worked out to a win. However, we might need/want to revisit some of that logic in connection with this patch, because it seems to me that a large sequential scan of an unhinted table could be ferociously slow, with the backend writing out its own pages and WAL-logging them as it goes.It would be good to test that to figure out whethersome mitigation measures are needed. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
I decided that it would be worth benchmarking this patch. Specifically, I tested: master, as a basis of comparison checksum16_with_wallogged_hint_bits.v10.patch, page_checksums = 'on' checksum16_with_wallogged_hint_bits.v10.patch, page_checksums = 'off' This test was performed using pgbench-tools. At different client counts and scaling factors "1,10,100", performance of an update.sql workload was tested. This benchmark used Greg Smith's "toy" server. As he put it recently: "The main change to the 8 hyperthreaded core test server (Intel i7-870) for this year is bumping it from 8GB to 16GB of RAM, which effectively doubles the scale I can reach before things slow dramatically." However, while Greg used scientific Linux for his recent batch of performance numbers, the box was booted into Debian for this, which used Kernel 2.6.32, FWIW. Didn't bother with a separate disk for WAL. I put shared_buffers at 1GB, and checkpoint_segments at 50. I took the additional precaution of initdb'ing for each set, lest there be some kind of contamination between sets, which necessitated doing some additional work since I couldn't very well expect the "results" database to persist. Different sets of figures from different runs where dumped and restored, before finally being combined so that pgbench-tools could produce it's regular report. I have attached a config for pgbench-tools, so that others may recreate my work here. I also attach the most relevant image, comparing each test set across scaling levels. I'll make a pdf of the full report available if that would be useful. -- Peter Geoghegan http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training and Services
Attachment
On Wed, Feb 22, 2012 at 11:17:53PM +0000, Peter Geoghegan wrote: > I decided that it would be worth benchmarking this patch. > Specifically, I tested: > > master, as a basis of comparison > > checksum16_with_wallogged_hint_bits.v10.patch, page_checksums = 'on' > > checksum16_with_wallogged_hint_bits.v10.patch, page_checksums = 'off' > > This test was performed using pgbench-tools. At different client > counts and scaling factors "1,10,100", performance of an update.sql > workload was tested. Looks interesting. Could you get some error bars around the numbers plotted, and possibly some scaling factors between 10 and 100? For the former, I'm looking for whether those changes are within ordinary variation, and in the latter, some better idea of what the curve looks like. Cheers, David. -- David Fetter <david@fetter.org> http://fetter.org/ Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter Skype: davidfetter XMPP: david.fetter@gmail.com iCal: webcal://www.tripit.com/feed/ical/people/david74/tripit.ics Remember to vote! Consider donating to Postgres: http://www.postgresql.org/about/donate
On Wed, Feb 22, 2012 at 6:17 PM, Peter Geoghegan <peter@2ndquadrant.com> wrote: > I decided that it would be worth benchmarking this patch. > Specifically, I tested: > > master, as a basis of comparison > > checksum16_with_wallogged_hint_bits.v10.patch, page_checksums = 'on' > > checksum16_with_wallogged_hint_bits.v10.patch, page_checksums = 'off' > > This test was performed using pgbench-tools. At different client > counts and scaling factors "1,10,100", performance of an update.sql > workload was tested. > > This benchmark used Greg Smith's "toy" server. As he put it recently: > > "The main change to the 8 hyperthreaded core test server (Intel > i7-870) for this year is bumping it from 8GB to 16GB of RAM, which > effectively doubles the scale I can reach before things slow > dramatically." However, while Greg used scientific Linux for his > recent batch of performance numbers, the box was booted into Debian > for this, which used Kernel 2.6.32, FWIW. Didn't bother with a > separate disk for WAL. > > I put shared_buffers at 1GB, and checkpoint_segments at 50. I took the > additional precaution of initdb'ing for each set, lest there be some > kind of contamination between sets, which necessitated doing some > additional work since I couldn't very well expect the "results" > database to persist. Different sets of figures from different runs > where dumped and restored, before finally being combined so that > pgbench-tools could produce it's regular report. > > I have attached a config for pgbench-tools, so that others may > recreate my work here. I also attach the most relevant image, > comparing each test set across scaling levels. I'll make a pdf of the > full report available if that would be useful. Thanks for testing this. The graph obscures a bit how much percentage change we're talking about here - could you post the raw tps numbers? I think we also need to test the case of seq-scanning a large, unhinted table. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 22.02.2012 14:30, Simon Riggs wrote: > On Wed, Feb 22, 2012 at 7:06 AM, Noah Misch<noah@leadboat.com> wrote: >> On Sun, Feb 19, 2012 at 05:04:06PM -0500, Robert Haas wrote: >>> Another disadvantage of the current scheme is that there's no >>> particularly easy way to know that your whole cluster has checksums. >>> No matter how we implement checksums, you'll have to rewrite every >>> table in the cluster in order to get them fully turned on. But with >>> the current design, there's no easy way to know how much of the >>> cluster is actually checksummed. If you shut checksums off, they'll >>> linger until those pages are rewritten, and there's no easy way to >>> find the relations from which they need to be removed, either. >> >> I'm not seeing value in rewriting pages to remove checksums, as opposed to >> just ignoring those checksums going forward. Did you have a particular >> scenario in mind? > > Agreed. No reason to change a checksum unless we rewrite the block, no > matter whether page_checksums is on or off. This can happen: 1. checksums are initially enabled. A page is written, with a correct checksum. 2. checksums turned off. 3. A hint bit is set on the page. 4. While the page is being written out, someone pulls the power cord, and you get a torn write. The hint bit change made it to disk, but the clearing of the checksum in the page header did not. 5. Sometime after restart, checksums are turned back on. The page now has an incorrect checksum on it. The next time it's read, you get a checksum error. I'm pretty uncomfortable with this idea of having a flag on the page itself to indicate whether it has a checksum or not. No matter how many bits we use for that flag. You can never be quite sure that all your data is covered by the checksum, and there's a lot of room for subtle bugs like the above, where a page is reported as corrupt when it isn't, or vice versa. This thing needs to be reliable and robust. The purpose of a checksum is to have an extra sanity check, to detect faulty hardware. If it's complicated, whenever you get a checksum mismatch, you'll be wondering if you have broken hardware or if you just bumped on a PostgreSQL bug. I think you need a flag in pg_control or somewhere to indicate whether checksums are currently enabled or disabled, and a mechanism to scan and rewrite all the pages with checksums, before they are verified. I've said this before, but I still don't like the hacks with the version number in the page header. Even if it works, I would much prefer the straightforward option of extending the page header for the new field. Yes, it means you have to deal with pg_upgrade, but it's a hurdle we'll have to jump at some point anyway. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Wed, Feb 29, 2012 at 2:40 PM, Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote: > On 22.02.2012 14:30, Simon Riggs wrote: >> >> On Wed, Feb 22, 2012 at 7:06 AM, Noah Misch<noah@leadboat.com> wrote: >>> >>> On Sun, Feb 19, 2012 at 05:04:06PM -0500, Robert Haas wrote: >>>> >>>> Another disadvantage of the current scheme is that there's no >>>> particularly easy way to know that your whole cluster has checksums. >>>> No matter how we implement checksums, you'll have to rewrite every >>>> table in the cluster in order to get them fully turned on. But with >>>> the current design, there's no easy way to know how much of the >>>> cluster is actually checksummed. If you shut checksums off, they'll >>>> linger until those pages are rewritten, and there's no easy way to >>>> find the relations from which they need to be removed, either. >>> >>> >>> I'm not seeing value in rewriting pages to remove checksums, as opposed >>> to >>> just ignoring those checksums going forward. Did you have a particular >>> scenario in mind? >> >> >> Agreed. No reason to change a checksum unless we rewrite the block, no >> matter whether page_checksums is on or off. > > > This can happen: > > 1. checksums are initially enabled. A page is written, with a correct > checksum. > 2. checksums turned off. > 3. A hint bit is set on the page. > 4. While the page is being written out, someone pulls the power cord, and > you get a torn write. The hint bit change made it to disk, but the clearing > of the checksum in the page header did not. > 5. Sometime after restart, checksums are turned back on. > > The page now has an incorrect checksum on it. The next time it's read, you > get a checksum error. Yes, you will. And you'll get a checksum error because the block no longer passes. So an error should be reported. We can and should document that turning this on/off/on can cause problems. Hopefully crashing isn't that common a situation. The production default would be "off". The default in the patch is "on" only for testing. > I'm pretty uncomfortable with this idea of having a flag on the page itself > to indicate whether it has a checksum or not. No matter how many bits we use > for that flag. You can never be quite sure that all your data is covered by > the checksum, and there's a lot of room for subtle bugs like the above, > where a page is reported as corrupt when it isn't, or vice versa. That is necessary to allow upgrade. It's not their for any other reason. > This thing needs to be reliable and robust. The purpose of a checksum is to > have an extra sanity check, to detect faulty hardware. If it's complicated, > whenever you get a checksum mismatch, you'll be wondering if you have broken > hardware or if you just bumped on a PostgreSQL bug. I think you need a flag > in pg_control or somewhere to indicate whether checksums are currently > enabled or disabled, and a mechanism to scan and rewrite all the pages with > checksums, before they are verified. That would require massive downtime, so again, it has been ruled out for practicality. > I've said this before, but I still don't like the hacks with the version > number in the page header. Even if it works, I would much prefer the > straightforward option of extending the page header for the new field. Yes, > it means you have to deal with pg_upgrade, but it's a hurdle we'll have to > jump at some point anyway. What you suggest might happen in the next release, or maybe longer. There may be things that block it completely, so it might never happen. My personal opinion is that it is not possible to make further block format changes until we have a fully online upgrade process, otherwise we block people from upgrading - not everybody can take their site down to run pg_upgrade. I plan to work on that, but it may not happen for 9.3; perhaps you will object to that also when it comes. So we simply cannot rely on this "jam tomorrow" vision. This patch is very specifically something that makes the best of the situation, now, for those that want and need it. If you don't want it, you don't have to use it. But that shouldn't stop us giving it to the people that do want it. I'm hearing general interest and support for this feature from people that run their business on PostgreSQL. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On 29.02.2012 17:01, Simon Riggs wrote: > On Wed, Feb 29, 2012 at 2:40 PM, Heikki Linnakangas > <heikki.linnakangas@enterprisedb.com> wrote: >> On 22.02.2012 14:30, Simon Riggs wrote: >>> >>> Agreed. No reason to change a checksum unless we rewrite the block, no >>> matter whether page_checksums is on or off. >> >> This can happen: >> >> 1. checksums are initially enabled. A page is written, with a correct >> checksum. >> 2. checksums turned off. >> 3. A hint bit is set on the page. >> 4. While the page is being written out, someone pulls the power cord, and >> you get a torn write. The hint bit change made it to disk, but the clearing >> of the checksum in the page header did not. >> 5. Sometime after restart, checksums are turned back on. >> >> The page now has an incorrect checksum on it. The next time it's read, you >> get a checksum error. > > Yes, you will. And you'll get a checksum error because the block no > longer passes. So an error should be reported. > > We can and should document that turning this on/off/on can cause > problems. Hopefully crashing isn't that common a situation. Hopefully not, but then again, you get interested in fiddling with this setting, when you do experience crashes. This feature needs to be trustworthy in the face of crashes. >> I'm pretty uncomfortable with this idea of having a flag on the page itself >> to indicate whether it has a checksum or not. No matter how many bits we use >> for that flag. You can never be quite sure that all your data is covered by >> the checksum, and there's a lot of room for subtle bugs like the above, >> where a page is reported as corrupt when it isn't, or vice versa. > > That is necessary to allow upgrade. It's not their for any other reason. Understood. I'm uncomfortable with it regardless of why it's there. >> This thing needs to be reliable and robust. The purpose of a checksum is to >> have an extra sanity check, to detect faulty hardware. If it's complicated, >> whenever you get a checksum mismatch, you'll be wondering if you have broken >> hardware or if you just bumped on a PostgreSQL bug. I think you need a flag >> in pg_control or somewhere to indicate whether checksums are currently >> enabled or disabled, and a mechanism to scan and rewrite all the pages with >> checksums, before they are verified. > > That would require massive downtime, so again, it has been ruled out > for practicality. Surely it can be done online. You'll just need a third state between off and on, where checksums are written but not verified, while the cluster is scanned. >> I've said this before, but I still don't like the hacks with the version >> number in the page header. Even if it works, I would much prefer the >> straightforward option of extending the page header for the new field. Yes, >> it means you have to deal with pg_upgrade, but it's a hurdle we'll have to >> jump at some point anyway. > > What you suggest might happen in the next release, or maybe longer. > There may be things that block it completely, so it might never > happen. My personal opinion is that it is not possible to make further > block format changes until we have a fully online upgrade process, > otherwise we block people from upgrading - not everybody can take > their site down to run pg_upgrade. I plan to work on that, but it may > not happen for 9.3; Yep, we should bite the bullet and work on that. > perhaps you will object to that also when it comes. Heh, quite possible :-). But only if there's a reason. I do want to see a solution to this, I have a few page-format changing ideas myself that I'd like to implement at some point. > This patch is very specifically something that makes the best of the > situation, now, for those that want and need it. If you don't want it, > you don't have to use it. Whether I use it or not, I'll have to live with it in the source tree. > But that shouldn't stop us giving it to the people that do want it. > > I'm hearing general interest and support for this feature from people > that run their business on PostgreSQL. If you ask someone "would you like to have checksums in PostgreSQL?", he'll say "sure". If you ask him "would you like to keep the PostgreSQL source tree as simple as possible, to make it easy for new developers to join the effort?", he'll say "yes". It's all about how you frame the question. Even if you want to have checksums on data pages, it doesn't necessarily mean you want them so badly you can't wait another release or two for a cleaner solution, or that you'd be satisfied with the implementation proposed here. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Wed, Feb 29, 2012 at 3:30 PM, Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote: > Surely it can be done online. You'll just need a third state between off and > on, where checksums are written but not verified, while the cluster is > scanned. Are you saying you would accept the patch if we had this? -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On 29.02.2012 17:42, Simon Riggs wrote: > On Wed, Feb 29, 2012 at 3:30 PM, Heikki Linnakangas > <heikki.linnakangas@enterprisedb.com> wrote: > >> Surely it can be done online. You'll just need a third state between off and >> on, where checksums are written but not verified, while the cluster is >> scanned. > > Are you saying you would accept the patch if we had this? I think I would still be uncomfortable with the hacks in the page header. Less so than in the current form - you wouldn't need a flag to indicate whether the page has a valid checksum or not, which would clean it up quite a bit - but still. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Wed, Feb 29, 2012 at 3:46 PM, Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote: >> Are you saying you would accept the patch if we had this? > I think I would still be uncomfortable with the hacks in the page header. There are no "hacks". There are some carefully designed changes with input from multiple people, including yourself, and it copes as gracefully as it can with backwards compatibility requirements. > Less so than in the current form - you wouldn't need a flag to indicate > whether the page has a valid checksum or not, which would clean it up quite > a bit - but still. I agree this would remove the reliance on a bit in the page header and so make it even more robust. I'll add the 2 phase enabling feature, making it happen at database level. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Wed, Feb 29, 2012 at 11:24 AM, Simon Riggs <simon@2ndquadrant.com> wrote: > On Wed, Feb 29, 2012 at 3:46 PM, Heikki Linnakangas > <heikki.linnakangas@enterprisedb.com> wrote: > >>> Are you saying you would accept the patch if we had this? > >> I think I would still be uncomfortable with the hacks in the page header. > > There are no "hacks". There are some carefully designed changes with > input from multiple people, including yourself, and it copes as > gracefully as it can with backwards compatibility requirements. You have comments from three different people, all experienced hackers, disagreeing with this position; Heikki and I have both proposed alternate approaches. I'm not sure that we're at a point where we can say that we know what the best solution is, but I think it is clear that there's enough concern about this that you ought not to be denying that there is a problem. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, Feb 29, 2012 at 12:26 PM, Simon Riggs <simon@2ndquadrant.com> wrote: > On Wed, Feb 29, 2012 at 4:41 PM, Robert Haas <robertmhaas@gmail.com> wrote: >> On Wed, Feb 29, 2012 at 11:24 AM, Simon Riggs <simon@2ndquadrant.com> wrote: >>> On Wed, Feb 29, 2012 at 3:46 PM, Heikki Linnakangas >>> <heikki.linnakangas@enterprisedb.com> wrote: >>> >>>>> Are you saying you would accept the patch if we had this? >>> >>>> I think I would still be uncomfortable with the hacks in the page header. >>> >>> There are no "hacks". There are some carefully designed changes with >>> input from multiple people, including yourself, and it copes as >>> gracefully as it can with backwards compatibility requirements. >> >> You have comments from three different people, all experienced >> hackers, disagreeing with this position; > > Who is the third person you speak of? Perhaps they will speak again if > they wish to be heard. Tom Lane. It was the very first email posted in response to the very first version of this patch you ever posted. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, Feb 29, 2012 at 5:44 PM, Robert Haas <robertmhaas@gmail.com> wrote: >>> You have comments from three different people, all experienced >>> hackers, disagreeing with this position; >> >> Who is the third person you speak of? Perhaps they will speak again if >> they wish to be heard. > > Tom Lane. It was the very first email posted in response to the very > first version of this patch you ever posted. Tom objected to not being able to tell which version a data block was. At length, we have discussed this on list and there is no issue. It is clear what page format a block has. Please ping him if you believe he has other rational objections to committing something and he isn't listening. I'm beginning to lose faith that objections are being raised at a rational level. It's not a panel game with points for clever answers, its an engineering debate about how to add features real users want. And they do want, so let me solve the problems by agreeing something early enough to allow it to be implemented, rather than just discussing it until we run out of time. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Wed, Feb 29, 2012 at 4:41 PM, Robert Haas <robertmhaas@gmail.com> wrote: > On Wed, Feb 29, 2012 at 11:24 AM, Simon Riggs <simon@2ndquadrant.com> wrote: >> On Wed, Feb 29, 2012 at 3:46 PM, Heikki Linnakangas >> <heikki.linnakangas@enterprisedb.com> wrote: >> >>>> Are you saying you would accept the patch if we had this? >> >>> I think I would still be uncomfortable with the hacks in the page header. >> >> There are no "hacks". There are some carefully designed changes with >> input from multiple people, including yourself, and it copes as >> gracefully as it can with backwards compatibility requirements. > > You have comments from three different people, all experienced > hackers, disagreeing with this position; Who is the third person you speak of? Perhaps they will speak again if they wish to be heard. > Heikki and I have both > proposed alternate approaches. No, you've proposed rejecting the patch in favour of some future dream. We all agree on the future dream but it clearly happens in the future and that could easily be more than 1 release ahead. If you have comments that would allow a patch in this release, I am happy to hear them. I hear only two people seeking to reject a patch that adds value for users. Quite frankly, the comments about flag bits are ridiculous and bogus. > I'm not sure that we're at a point > where we can say that we know what the best solution is, but I think > it is clear that there's enough concern about this that you ought not > to be denying that there is a problem. There are some weird cases that can cause problems. My wish to is to resolve those so everyone is happy. If those weird cases are simply used as an excuse to reject, then I don't accept that and nor will our users. Of course, if there was a significant issue, it would be immediate rejection but there isn't. I'm trying to commit a useful patch in this release. If you'd like to work with me to find a solution acceptable to all, I'd be happy to include suggestions, just as I have already included comments from Heikki, Bruce, Noah and others. I do accept that Heikki now says that the code I added at his request is just a hack. Assuming I'm going to commit something in this release, what should it look like? At Heikki's request/idea I will be working on a 2-phase database level parameter that will give us checksumming on the whole database after a scan to enable it. That sounds like it will resolve the corner cases relatively cleanly. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On 29.02.2012 19:54, Simon Riggs wrote: > I'm beginning to lose faith that objections are being raised at a > rational level. It's not a panel game with points for clever answers, > its an engineering debate about how to add features real users want. > And they do want, so let me solve the problems by agreeing something > early enough to allow it to be implemented, rather than just > discussing it until we run out of time. I thought my view on how this should be done was already clear, but just in case it isn't, let me restate: Enlarge the page header to make room for the checksum. To handle upgrades, put code in the backend to change the page format from old version to new one on-the-fly, as pages are read in. Because we're making the header larger, we need to ensure that there's room on every page. To do that, write a utility that you run on the cluster before running pg_upgrade, which moves tuples to ensure that. To ensure that the space doesn't get used again before upgrading, change the old version so that it reserves those N bytes in all new insertions and updates (I believe that approach has been discussed before and everyone is comfortable with backpatching such a change). All of this in 9.3. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Excerpts from Heikki Linnakangas's message of mié feb 29 16:09:02 -0300 2012: > On 29.02.2012 19:54, Simon Riggs wrote: > > I'm beginning to lose faith that objections are being raised at a > > rational level. It's not a panel game with points for clever answers, > > its an engineering debate about how to add features real users want. > > And they do want, so let me solve the problems by agreeing something > > early enough to allow it to be implemented, rather than just > > discussing it until we run out of time. > > I thought my view on how this should be done was already clear, but just > in case it isn't, let me restate: Enlarge the page header to make room > for the checksum. To handle upgrades, put code in the backend to change > the page format from old version to new one on-the-fly, as pages are > read in. Because we're making the header larger, we need to ensure that > there's room on every page. To do that, write a utility that you run on > the cluster before running pg_upgrade, which moves tuples to ensure > that. To ensure that the space doesn't get used again before upgrading, > change the old version so that it reserves those N bytes in all new > insertions and updates (I believe that approach has been discussed > before and everyone is comfortable with backpatching such a change). All > of this in 9.3. Note that if we want such an utility to walk and transform pages, we probably need a marker in the catalogs somewhere so that pg_upgrade can make sure that it was done in all candidate tables -- which is something that we should get in 9.2 so that it can be used in 9.3. Such a marker would also allow us get rid of HEAP_MOVED_IN and HEAP_MOVED_OUT. -- Álvaro Herrera <alvherre@commandprompt.com> The PostgreSQL Company - Command Prompt, Inc. PostgreSQL Replication, Consulting, Custom Development, 24x7 support
On Wed, Feb 29, 2012 at 2:09 PM, Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote: > On 29.02.2012 19:54, Simon Riggs wrote: >> I'm beginning to lose faith that objections are being raised at a >> rational level. It's not a panel game with points for clever answers, >> its an engineering debate about how to add features real users want. >> And they do want, so let me solve the problems by agreeing something >> early enough to allow it to be implemented, rather than just >> discussing it until we run out of time. > > I thought my view on how this should be done was already clear, but just in > case it isn't, let me restate: Enlarge the page header to make room for the > checksum. To handle upgrades, put code in the backend to change the page > format from old version to new one on-the-fly, as pages are read in. Because > we're making the header larger, we need to ensure that there's room on every > page. To do that, write a utility that you run on the cluster before running > pg_upgrade, which moves tuples to ensure that. To ensure that the space > doesn't get used again before upgrading, change the old version so that it > reserves those N bytes in all new insertions and updates (I believe that > approach has been discussed before and everyone is comfortable with > backpatching such a change). All of this in 9.3. This could all be done on-line if we had a per-table flag indicating which page version is in use. You can use some variant of CLUSTER or VACUUM or ALTER TABLE to rewrite the table using the page version of your choice. This is also more granular, in that it allows checksums (or other features) to be turned on and off on a per-table basis. When you read the table, the same flag tells you which page version to expect, and you can error out if you haven't got that (or if the CRCs don't match). -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 29.02.2012 21:18, Alvaro Herrera wrote: > Excerpts from Heikki Linnakangas's message of mié feb 29 16:09:02 -0300 2012: >> I thought my view on how this should be done was already clear, but just >> in case it isn't, let me restate: Enlarge the page header to make room >> for the checksum. To handle upgrades, put code in the backend to change >> the page format from old version to new one on-the-fly, as pages are >> read in. Because we're making the header larger, we need to ensure that >> there's room on every page. To do that, write a utility that you run on >> the cluster before running pg_upgrade, which moves tuples to ensure >> that. To ensure that the space doesn't get used again before upgrading, >> change the old version so that it reserves those N bytes in all new >> insertions and updates (I believe that approach has been discussed >> before and everyone is comfortable with backpatching such a change). All >> of this in 9.3. > > Note that if we want such an utility to walk and transform pages, we > probably need a marker in the catalogs somewhere so that pg_upgrade can > make sure that it was done in all candidate tables -- which is something > that we should get in 9.2 so that it can be used in 9.3. In the simplest form, the utility could just create a magic file in the data directory to indicate that it has run. All we need is a boolean flag, unless you want to be fancy and make the utility restartable. Implemented that way, there's no need to have anything in the catalogs. > Such a marker would also allow us get rid of HEAP_MOVED_IN and > HEAP_MOVED_OUT. True. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Wed, Feb 29, 2012 at 2:18 PM, Alvaro Herrera <alvherre@commandprompt.com> wrote: > Note that if we want such an utility to walk and transform pages, we > probably need a marker in the catalogs somewhere so that pg_upgrade can > make sure that it was done in all candidate tables -- which is something > that we should get in 9.2 so that it can be used in 9.3. Such a marker > would also allow us get rid of HEAP_MOVED_IN and HEAP_MOVED_OUT. Getting rid of HEAP_MOVED_IN and HEAP_MOVED_OUT would be really nice, but I don't see why we need to squeeze anything into 9.2 for any of this. pg_upgrade can certainly handle the addition of a new pg_class column, and can arrange for in-place upgrades from pre-9.3 versions to 9.3 to set the flag to the appropriate value. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 29.02.2012 21:30, Robert Haas wrote: > On Wed, Feb 29, 2012 at 2:18 PM, Alvaro Herrera > <alvherre@commandprompt.com> wrote: >> Note that if we want such an utility to walk and transform pages, we >> probably need a marker in the catalogs somewhere so that pg_upgrade can >> make sure that it was done in all candidate tables -- which is something >> that we should get in 9.2 so that it can be used in 9.3. Such a marker >> would also allow us get rid of HEAP_MOVED_IN and HEAP_MOVED_OUT. > > Getting rid of HEAP_MOVED_IN and HEAP_MOVED_OUT would be really nice, > but I don't see why we need to squeeze anything into 9.2 for any of > this. pg_upgrade can certainly handle the addition of a new pg_class > column, and can arrange for in-place upgrades from pre-9.3 versions to > 9.3 to set the flag to the appropriate value. The utility would run in the old cluster before upgrading, so the the flag would have to be present in the old version. pg_upgrade would check that the flag is set, refusing to upgrade if it isn't, with an error like "please run pre-upgrade utility first". -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Excerpts from Heikki Linnakangas's message of mié feb 29 16:29:26 -0300 2012: > On 29.02.2012 21:18, Alvaro Herrera wrote: > > Excerpts from Heikki Linnakangas's message of mié feb 29 16:09:02 -0300 2012: > >> I thought my view on how this should be done was already clear, but just > >> in case it isn't, let me restate: Enlarge the page header to make room > >> for the checksum. To handle upgrades, put code in the backend to change > >> the page format from old version to new one on-the-fly, as pages are > >> read in. Because we're making the header larger, we need to ensure that > >> there's room on every page. To do that, write a utility that you run on > >> the cluster before running pg_upgrade, which moves tuples to ensure > >> that. To ensure that the space doesn't get used again before upgrading, > >> change the old version so that it reserves those N bytes in all new > >> insertions and updates (I believe that approach has been discussed > >> before and everyone is comfortable with backpatching such a change). All > >> of this in 9.3. > > > > Note that if we want such an utility to walk and transform pages, we > > probably need a marker in the catalogs somewhere so that pg_upgrade can > > make sure that it was done in all candidate tables -- which is something > > that we should get in 9.2 so that it can be used in 9.3. > > In the simplest form, the utility could just create a magic file in the > data directory to indicate that it has run. All we need is a boolean > flag, unless you want to be fancy and make the utility restartable. > Implemented that way, there's no need to have anything in the catalogs. Well, I find it likely that people with huge and/or high velocity databases would want to get fancy about it, so that they can schedule it as they see fit. What I wouldn't like is an utility that is optional, so that we end up with situations after upgrade that do have the new page format, others that don't. -- Álvaro Herrera <alvherre@commandprompt.com> The PostgreSQL Company - Command Prompt, Inc. PostgreSQL Replication, Consulting, Custom Development, 24x7 support
On Wed, Feb 29, 2012 at 2:33 PM, Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote: > On 29.02.2012 21:30, Robert Haas wrote: >> >> On Wed, Feb 29, 2012 at 2:18 PM, Alvaro Herrera >> <alvherre@commandprompt.com> wrote: >>> >>> Note that if we want such an utility to walk and transform pages, we >>> probably need a marker in the catalogs somewhere so that pg_upgrade can >>> make sure that it was done in all candidate tables -- which is something >>> that we should get in 9.2 so that it can be used in 9.3. Such a marker >>> would also allow us get rid of HEAP_MOVED_IN and HEAP_MOVED_OUT. >> >> >> Getting rid of HEAP_MOVED_IN and HEAP_MOVED_OUT would be really nice, >> but I don't see why we need to squeeze anything into 9.2 for any of >> this. pg_upgrade can certainly handle the addition of a new pg_class >> column, and can arrange for in-place upgrades from pre-9.3 versions to >> 9.3 to set the flag to the appropriate value. > > The utility would run in the old cluster before upgrading, so the the flag > would have to be present in the old version. pg_upgrade would check that the > flag is set, refusing to upgrade if it isn't, with an error like "please run > pre-upgrade utility first". I find that a pretty unappealing design; it seems to me it'd be much easier to make the new cluster cope with everything. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > On Wed, Feb 29, 2012 at 2:33 PM, Heikki Linnakangas > <heikki.linnakangas@enterprisedb.com> wrote: >> The utility would run in the old cluster before upgrading, so the the flag >> would have to be present in the old version. pg_upgrade would check that the >> flag is set, refusing to upgrade if it isn't, with an error like "please run >> pre-upgrade utility first". > I find that a pretty unappealing design; it seems to me it'd be much > easier to make the new cluster cope with everything. Easier for who? I don't care for the idea of code that has to cope with two page formats, or before long N page formats, because if we don't have some mechanism like this then we will never be able to decide that an old data format is safely dead. regards, tom lane
Excerpts from Tom Lane's message of mié feb 29 18:34:27 -0300 2012: > > Robert Haas <robertmhaas@gmail.com> writes: > > On Wed, Feb 29, 2012 at 2:33 PM, Heikki Linnakangas > > <heikki.linnakangas@enterprisedb.com> wrote: > >> The utility would run in the old cluster before upgrading, so the the flag > >> would have to be present in the old version. pg_upgrade would check that the > >> flag is set, refusing to upgrade if it isn't, with an error like "please run > >> pre-upgrade utility first". > > > I find that a pretty unappealing design; it seems to me it'd be much > > easier to make the new cluster cope with everything. > > Easier for who? I don't care for the idea of code that has to cope with > two page formats, or before long N page formats, because if we don't > have some mechanism like this then we will never be able to decide that > an old data format is safely dead. .. in fact this is precisely what killed Zdenek Kotala's idea of upgrading. -- Álvaro Herrera <alvherre@commandprompt.com> The PostgreSQL Company - Command Prompt, Inc. PostgreSQL Replication, Consulting, Custom Development, 24x7 support
On Wed, Feb 29, 2012 at 4:34 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> On Wed, Feb 29, 2012 at 2:33 PM, Heikki Linnakangas >> <heikki.linnakangas@enterprisedb.com> wrote: >>> The utility would run in the old cluster before upgrading, so the the flag >>> would have to be present in the old version. pg_upgrade would check that the >>> flag is set, refusing to upgrade if it isn't, with an error like "please run >>> pre-upgrade utility first". > >> I find that a pretty unappealing design; it seems to me it'd be much >> easier to make the new cluster cope with everything. > > Easier for who? I don't care for the idea of code that has to cope with > two page formats, or before long N page formats, because if we don't > have some mechanism like this then we will never be able to decide that > an old data format is safely dead. Huh? You can drop support for a new page format any time you like. You just decree that version X+1 no longer supports it, and you can't pg_upgrade to it until all traces of the old page format are gone. If you're going to require an offline rewrite of the whole old cluster before doing the upgrade, how much better is it than just breaking the page format and telling pg_upgrade users they're out of luck? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > On Wed, Feb 29, 2012 at 4:34 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Easier for who? �I don't care for the idea of code that has to cope with >> two page formats, or before long N page formats, because if we don't >> have some mechanism like this then we will never be able to decide that >> an old data format is safely dead. > Huh? You can drop support for a new page format any time you like. > You just decree that version X+1 no longer supports it, and you can't > pg_upgrade to it until all traces of the old page format are gone. And how would a DBA know that? > If you're going to require an offline rewrite of the whole old cluster > before doing the upgrade, how much better is it than just breaking the > page format and telling pg_upgrade users they're out of luck? The difference is that people aren't playing russian roulette with their data when they upgrade. The point of the mechanisms being discussed here is to know, for certain, that a large database no longer contains pages of the old format. regards, tom lane
On 2/29/12 3:53 PM, Alvaro Herrera wrote: > > Excerpts from Tom Lane's message of mié feb 29 18:34:27 -0300 2012: >> >> Robert Haas<robertmhaas@gmail.com> writes: >>> On Wed, Feb 29, 2012 at 2:33 PM, Heikki Linnakangas >>> <heikki.linnakangas@enterprisedb.com> wrote: >>>> The utility would run in the old cluster before upgrading, so the the flag >>>> would have to be present in the old version. pg_upgrade would check that the >>>> flag is set, refusing to upgrade if it isn't, with an error like "please run >>>> pre-upgrade utility first". >> >>> I find that a pretty unappealing design; it seems to me it'd be much >>> easier to make the new cluster cope with everything. >> >> Easier for who? I don't care for the idea of code that has to cope with >> two page formats, or before long N page formats, because if we don't >> have some mechanism like this then we will never be able to decide that >> an old data format is safely dead. > > .. in fact this is precisely what killed Zdenek Kotala's idea of > upgrading. This is also why Simon has avoided the whole upgrade thing with his 16 bit checksum idea (otherwise presumably we'd be lookingat bigger checksums anyway). I get that fussing around with the version field is ugly. If there was another way to do this without breaking pg_upgradethen it would be silly to mess with the version field. Unfortunately, there is no other way. Page checksuming is something a lot of people (myself included) want to see. Being able to get it in 9.2 would be a big winover crossing our fingers that at some point in the future (who knows when) we'll maybe figure out the page format upgradeissue. While we should definitely be looking into that issue it's definitely not going to get fixed in 9.2. ISTM thatchecksums are actually ready to go if people can just swallow the bitter pill of a screwed-up page version field untilwe can actually get an upgrade utility in place (and until we get that utility in place I don't see that the versionfield would do us any good anyway...) -- Jim C. Nasby, Database Architect jim@nasby.net 512.569.9461 (cell) http://jim.nasby.net
Jim Nasby <jim@nasby.net> writes: > On 2/29/12 3:53 PM, Alvaro Herrera wrote: >> .. in fact this is precisely what killed Zdenek Kotala's idea of >> upgrading. > This is also why Simon has avoided the whole upgrade thing with his 16 bit checksum idea (otherwise presumably we'd belooking at bigger checksums anyway). > I get that fussing around with the version field is ugly. If there was another way to do this without breaking pg_upgradethen it would be silly to mess with the version field. Unfortunately, there is no other way. Fundamentally, what is going on here is that several of us think that we need a page format upgrade infrastructure first, and then we can think about adding checksums. Simon would like to cram checksums in without building such infrastructure, regardless of the consequences in code ugliness or future maintainability. Personally, I think that is a bad tradeoff. Eventually we are going to have to build that infrastructure, and the more we've kluged the page header format in the meanwhile, the more unpleasant it's going to be. regards, tom lane
On Wed, Feb 29, 2012 at 5:52 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> On Wed, Feb 29, 2012 at 4:34 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >>> Easier for who? I don't care for the idea of code that has to cope with >>> two page formats, or before long N page formats, because if we don't >>> have some mechanism like this then we will never be able to decide that >>> an old data format is safely dead. > >> Huh? You can drop support for a new page format any time you like. >> You just decree that version X+1 no longer supports it, and you can't >> pg_upgrade to it until all traces of the old page format are gone. > > And how would a DBA know that? We'd add a column to pg_class that tracks which page version is in use for each relation. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
>> And how would a DBA know that? > > We'd add a column to pg_class that tracks which page version is in use > for each relation. So a relation can't have some pages in Version 9.2, and other pages in version 9.3? How will this work for 2TB tables? -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
On Thu, Mar 1, 2012 at 12:42 PM, Josh Berkus <josh@agliodbs.com> wrote: >>> And how would a DBA know that? >> >> We'd add a column to pg_class that tracks which page version is in use >> for each relation. > > So a relation can't have some pages in Version 9.2, and other pages in > version 9.3? How will this work for 2TB tables? Not very well, but better than Tom's proposal to require upgrading the entire cluster in a single off-line operation. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
>> So a relation can't have some pages in Version 9.2, and other pages in >> version 9.3? How will this work for 2TB tables? > > Not very well, but better than Tom's proposal to require upgrading the > entire cluster in a single off-line operation. Yes, but the result will be that anyone with a 2TB table will *never* convert it to the new format. Which means we can never deprecate that format, because lots of people will still be using it. I continue to assert that all of this sounds like 9.3 work to me. I'm really not keen on pushing through a hack which will make pushing in a long-term solution harder. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
Robert Haas <robertmhaas@gmail.com> writes: >> So a relation can't have some pages in Version 9.2, and other pages in >> version 9.3? �How will this work for 2TB tables? > Not very well, but better than Tom's proposal to require upgrading the > entire cluster in a single off-line operation. WTF? That was most certainly not what *I* was proposing; it's obviously unworkable. We need a process that can incrementally up-version a live database and keep track of the minimum version present, at some granularity smaller than "whole database". All of this was discussed and hashed out about two years ago, IIRC. We just haven't made any progress towards actually implementing those concepts. regards, tom lane
On Thu, Mar 1, 2012 at 4:08 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: >>> So a relation can't have some pages in Version 9.2, and other pages in >>> version 9.3? How will this work for 2TB tables? > >> Not very well, but better than Tom's proposal to require upgrading the >> entire cluster in a single off-line operation. > > WTF? That was most certainly not what *I* was proposing; it's obviously > unworkable. We need a process that can incrementally up-version a live > database and keep track of the minimum version present, at some > granularity smaller than "whole database". > > All of this was discussed and hashed out about two years ago, IIRC. > We just haven't made any progress towards actually implementing those > concepts. I am now officially confused. I thought you and Heikki were arguing about 24 hours ago that we needed to shut down the old database, run a pre-upgrade utility to convert all the pages, run pg_upgrade, and then bring the new database on-line; and that, further, you were arguing that we should not support multiple page versions. Now you seem to be arguing the exact opposite - namely, that we should support multiple page versions, and that the conversion should be done on-line. So, to reiterate, I'm lost. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Excerpts from Robert Haas's message of jue mar 01 21:23:06 -0300 2012: > On Thu, Mar 1, 2012 at 4:08 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Robert Haas <robertmhaas@gmail.com> writes: > >>> So a relation can't have some pages in Version 9.2, and other pages in > >>> version 9.3? How will this work for 2TB tables? > > > >> Not very well, but better than Tom's proposal to require upgrading the > >> entire cluster in a single off-line operation. > > > > WTF? That was most certainly not what *I* was proposing; it's obviously > > unworkable. We need a process that can incrementally up-version a live > > database and keep track of the minimum version present, at some > > granularity smaller than "whole database". > > > > All of this was discussed and hashed out about two years ago, IIRC. > > We just haven't made any progress towards actually implementing those > > concepts. > > I am now officially confused. I thought you and Heikki were arguing > about 24 hours ago that we needed to shut down the old database, run a > pre-upgrade utility to convert all the pages, run pg_upgrade, and then > bring the new database on-line; Actually, nobody mentioned shutting the database down. This utility does not necessarily have to be something that runs independently from a running postmaster; in fact what I envisioned was that we would have it be run in a backend -- I've always imagined it's just a function to which you give a table OID, and it converts pages of that table into the new format. Maybe a user-friendly utility would connect to each database and call this function for each table. > and that, further, you were arguing that we should not support > multiple page versions. I don't think we need to support multiple page versions, if multiple means > 2. Obviously we need the server to support reading *two* page versions; though we can probably get away with having support for *writing* only the newest page version. (The pg_class column would mean "the oldest page version to be found in this table", so the table can actually contain a mixture of old pages and new pages.) -- Álvaro Herrera <alvherre@commandprompt.com> The PostgreSQL Company - Command Prompt, Inc. PostgreSQL Replication, Consulting, Custom Development, 24x7 support
Alvaro Herrera <alvherre@commandprompt.com> writes: > Excerpts from Robert Haas's message of jue mar 01 21:23:06 -0300 2012: >> and that, further, you were arguing that we should not support >> multiple page versions. > I don't think we need to support multiple page versions, if multiple > means > 2. That's exactly the point here. We clearly cannot support on-line upgrade unless, somewhere along the line, we are willing to cope with two page formats more or less concurrently. What I don't want is for that requirement to balloon to supporting N formats forever. If we do not have a mechanism that allows certifying that you have no remaining pages of format N-1 before you upgrade to a server that supports (only) versions N and N+1, then we're going to be in the business of indefinite backwards compatibility instead. I'm not entirely sure, but I think we may all be in violent agreement about where this needs to end up. regards, tom lane
On Thu, Mar 1, 2012 at 8:32 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Alvaro Herrera <alvherre@commandprompt.com> writes: >> Excerpts from Robert Haas's message of jue mar 01 21:23:06 -0300 2012: >>> and that, further, you were arguing that we should not support >>> multiple page versions. > >> I don't think we need to support multiple page versions, if multiple >> means > 2. > > That's exactly the point here. We clearly cannot support on-line > upgrade unless, somewhere along the line, we are willing to cope with > two page formats more or less concurrently. What I don't want is for > that requirement to balloon to supporting N formats forever. If we > do not have a mechanism that allows certifying that you have no > remaining pages of format N-1 before you upgrade to a server that > supports (only) versions N and N+1, then we're going to be in the > business of indefinite backwards compatibility instead. > > I'm not entirely sure, but I think we may all be in violent agreement > about where this needs to end up. I was using multiple to mean N>1, so yeah, it's sounding like we're more or less on the same page here. One thing I'm not too sure about is how to extend the page format to handle optional features. For example, suppose we want to add 2 bytes to the page header for a checksum (or 4 bytes, or any other number). Ideally, I'd like to not use up those 2 bytes on pages that aren't checksummed. What do we do about that? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > One thing I'm not too sure about is how to extend the page format to > handle optional features. For example, suppose we want to add 2 bytes > to the page header for a checksum (or 4 bytes, or any other number). > Ideally, I'd like to not use up those 2 bytes on pages that aren't > checksummed. What do we do about that? I'm having a hard time getting excited about adding that requirement on top of all the others that we have for this feature. In particular, if we insist on that, there is exactly zero chance of turning checksums on on-the-fly, because you won't be able to do it if the page is full. The scheme that seems to me to make the most sense for checksums, if we grant Simon's postulate that a 2-byte checksum is enough, is (1) repurpose the top N bits of pd_flags as a page version number, for some N; (2) repurpose pd_pagesize_version as the checksum, with the understanding that you can't have a checksum in version-0 pages. (Simon's current patch seems to be an overengineered version of that. Possibly we should also ask ourselves how much we really need pd_tli and whether that couldn't be commandeered as the checksum field.) I see the page versioning stuff as mainly being of interest for changes that are more invasive than this, for instance tuple-header or data changes. regards, tom lane
On Thu, Mar 1, 2012 at 9:11 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> One thing I'm not too sure about is how to extend the page format to >> handle optional features. For example, suppose we want to add 2 bytes >> to the page header for a checksum (or 4 bytes, or any other number). >> Ideally, I'd like to not use up those 2 bytes on pages that aren't >> checksummed. What do we do about that? > > I'm having a hard time getting excited about adding that requirement on > top of all the others that we have for this feature. In particular, if > we insist on that, there is exactly zero chance of turning checksums on > on-the-fly, because you won't be able to do it if the page is full. > > The scheme that seems to me to make the most sense for checksums, > if we grant Simon's postulate that a 2-byte checksum is enough, is > (1) repurpose the top N bits of pd_flags as a page version number, > for some N; > (2) repurpose pd_pagesize_version as the checksum, with the > understanding that you can't have a checksum in version-0 pages. > > (Simon's current patch seems to be an overengineered version of that. > Possibly we should also ask ourselves how much we really need pd_tli > and whether that couldn't be commandeered as the checksum field.) > > I see the page versioning stuff as mainly being of interest for changes > that are more invasive than this, for instance tuple-header or data > changes. Well, OK, so... it wouldn't bother me a bit to steal pd_tli for this, although Simon previously objected to steeling even half of it, when I suggested that upthread. But I don't see the point of your first proposal: keeping the page version right where it is is a good idea, and we should do it. We could steel some *high* order bits from that field without breaking anything, but moving it around seems like it will complicate life to no good purpose. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 23 February 2012 01:39, Robert Haas <robertmhaas@gmail.com> wrote: > Thanks for testing this. The graph obscures a bit how much percentage > change we're talking about here - could you post the raw tps numbers? Sorry for not following up on this until now. The report is available from: http://pagechecksum.staticloud.com/ Note that detailed latency figures for each test are not available in this report, so only the main page is available, but that gets you the raw tps numbers. -- Peter Geoghegan http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training and Services
On Fri, Mar 2, 2012 at 2:11 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> One thing I'm not too sure about is how to extend the page format to >> handle optional features. For example, suppose we want to add 2 bytes >> to the page header for a checksum (or 4 bytes, or any other number). >> Ideally, I'd like to not use up those 2 bytes on pages that aren't >> checksummed. What do we do about that? > > I'm having a hard time getting excited about adding that requirement on > top of all the others that we have for this feature. In particular, if > we insist on that, there is exactly zero chance of turning checksums on > on-the-fly, because you won't be able to do it if the page is full. > > The scheme that seems to me to make the most sense for checksums, > if we grant Simon's postulate that a 2-byte checksum is enough, is > (1) repurpose the top N bits of pd_flags as a page version number, > for some N; > (2) repurpose pd_pagesize_version as the checksum, with the > understanding that you can't have a checksum in version-0 pages. I'd say N = 1, for now. N > 1 in future, as needed. We can document the intention to reserve high end bits for that purpose. We can defer the decision about what bits are used for until they are needed. > (Simon's current patch seems to be an overengineered version of that. Happy to make changes. Having 3 bits instead of 1 is just for robustness, but would accept the opinion that this is not worth having. One suggestion from Heikki is that we don't change the bits at all; we just have a database level setting that says whether checksums are enabled. We use VACUUM FREEZE to enable checksumming, rewrite the blocks and then enable at db level. Once fully enabled we check the checksums on read. If we do that, we don't need to record that the page format has changed and we could dispense with page-level flags entirely, but of course that then means you can't inspect a block and know whether its actually got a checksum on it or maybe its just a version field. If we want to know the difference between page formats we need to use flag bits to indicate that. Which in some ways could be called fragile. Personally, don't mind what we do there. > Possibly we should also ask ourselves how much we really need pd_tli > and whether that couldn't be commandeered as the checksum field.) pd_tli has slightly more utility than pd_pagesize_version, but we could easily live without either. The question is whether a 32-bit value, split into 2 pieces, is a faster and better way than the current proposal. Splitting the value in two doesn't sound like it would lead to good performance or sensible code, but someone may have a reason why 32-bit checksums are important. IMHO we have a viable mechanism for checksums, its all down to what user interface we would prefer and related details. I'm happy to implement whatever we decide. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Sun, Feb 19, 2012 at 1:49 PM, Simon Riggs <simon@2ndquadrant.com> wrote: > On Thu, Feb 16, 2012 at 11:16 AM, Simon Riggs <simon@2ndquadrant.com> wrote: > >> v8 attached > > v10 attached. > > This patch covers all the valid concerns discussed and has been > extensively tested. If I turn wal_level to anything other than minimal, this fails to pass make installcheck with multiple instances of ERROR: invalid page header in block 0 of relation.... I've looked around in the patch docs and this thread, and this doesn't seem to be either expected or already known. This occurs even when page_checksums is off. Cheers, Jeff
On Sun, 2012-02-19 at 21:49 +0000, Simon Riggs wrote: > On Thu, Feb 16, 2012 at 11:16 AM, Simon Riggs <simon@2ndquadrant.com> wrote: > > > v8 attached > > v10 attached. > > This patch covers all the valid concerns discussed and has been > extensively tested. > > I expect to commit this in about 24 hours, so last call for comments > please or say if you need more time to review. I have made an attempt to merge this with master. This patch is not meant to be the next version, and I have not made a real effort to be sure this was a safe merge. I am just hoping this saves Simon some of the tedious portions of the merge, and offers a reference point to see where his merge differs from mine. In the meantime, I'm looking at the original v10 patch against master as it existed when Simon posted the patch. I'd like to see checksums take part in the June commitfest. Regards, Jeff Davis
Attachment
On Sun, 2012-02-19 at 21:49 +0000, Simon Riggs wrote: > On Thu, Feb 16, 2012 at 11:16 AM, Simon Riggs <simon@2ndquadrant.com> wrote: > > > v8 attached > > v10 attached. > > This patch covers all the valid concerns discussed and has been > extensively tested. Is there something I can do to help get this ready for the next commitfest? I am willing to rebase it, but I was worried that might cause confusion. I am also willing to review it after the rebase, of course. There are still a couple open issues, including: * Store the checksum in the page version field or the TLI field? * What mechanism to guarantee to the user that all pages are properly protected by checksums (rather than just new pages)? In other words, there are more than the two states represented by the GUC. * What specific data is included in the checksum? I suggested that we include the block number, and maybe the OID. Regards,Jeff Davis