Thread: Page Checksums + Double Writes
Folks, One of the things VMware is working on is double writes, per previous discussions of how, for example, InnoDB does things. I'd initially thought that introducing just one of the features in $Subject at a time would help, but I'm starting to see a mutual dependency. The issue is that double writes needs a checksum to work by itself, and page checksums more broadly work better when there are double writes, obviating the need to have full_page_writes on. If submitting these things together seems like a better idea than having them arrive separately, I'll work with my team here to make that happen soonest. There's a separate issue we'd like to get clear on, which is whether it would be OK to make a new PG_PAGE_LAYOUT_VERSION. If so, there's less to do, but pg_upgrade as it currently stands is broken. If not, we'll have to do some extra work on the patch as described below. Thanks to Kevin Grittner for coming up with this :) - Use a header bit to say whether we've got a checksum on the page. We're using 3/16 of the available bits as described insrc/include/storage/bufpage.h. - When that bit is set, place the checksum somewhere convenient on the page. One way to do this would be to have an optionalfield at the end of the special space based on the new bit. Rows from pg_upgrade would have the bit clear, and wouldhave the shorter special structure without the checksum. Cheers, David. -- David Fetter <david@fetter.org> http://fetter.org/ Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter Skype: davidfetter XMPP: david.fetter@gmail.com iCal: webcal://www.tripit.com/feed/ical/people/david74/tripit.ics Remember to vote! Consider donating to Postgres: http://www.postgresql.org/about/donate
Excerpts from David Fetter's message of mié dic 21 18:59:13 -0300 2011: > If not, we'll have to do some extra work on the patch as described > below. Thanks to Kevin Grittner for coming up with this :) > > - Use a header bit to say whether we've got a checksum on the page. > We're using 3/16 of the available bits as described in > src/include/storage/bufpage.h. > > - When that bit is set, place the checksum somewhere convenient on the > page. One way to do this would be to have an optional field at the > end of the special space based on the new bit. Rows from pg_upgrade > would have the bit clear, and would have the shorter special > structure without the checksum. If you get away with a new page format, let's make sure and coordinate so that we can add more info into the header. One thing I wanted was to have an ID struct on each file, so that you know what DB/relation/segment the file corresponds to. So the first page's special space would be a bit larger than the others. -- Álvaro Herrera <alvherre@commandprompt.com> The PostgreSQL Company - Command Prompt, Inc. PostgreSQL Replication, Consulting, Custom Development, 24x7 support
Alvaro Herrera <alvherre@commandprompt.com> wrote: > If you get away with a new page format, let's make sure and > coordinate so that we can add more info into the header. One > thing I wanted was to have an ID struct on each file, so that you > know what DB/relation/segment the file corresponds to. So the > first page's special space would be a bit larger than the others. Couldn't that also be done by burning a bit in the page header flags, without a page layout version bump? If that were done, you wouldn't have the additional information on tables converted by pg_upgrade, but you would get them on new tables, including those created by pg_dump/psql conversions. Adding them could even be made conditional, although I don't know whether that's a good idea.... -Kevin
On Wed, Dec 21, 2011 at 10:19 PM, Kevin Grittner <Kevin.Grittner@wicourts.gov> wrote: > Alvaro Herrera <alvherre@commandprompt.com> wrote: > >> If you get away with a new page format, let's make sure and >> coordinate so that we can add more info into the header. One >> thing I wanted was to have an ID struct on each file, so that you >> know what DB/relation/segment the file corresponds to. So the >> first page's special space would be a bit larger than the others. > > Couldn't that also be done by burning a bit in the page header > flags, without a page layout version bump? If that were done, you > wouldn't have the additional information on tables converted by > pg_upgrade, but you would get them on new tables, including those > created by pg_dump/psql conversions. Adding them could even be made > conditional, although I don't know whether that's a good idea.... These are good thoughts because they overcome the major objection to doing *anything* here for 9.2. We don't need to use any flag bits at all. We add PG_PAGE_LAYOUT_VERSION to the control file, so that CRC checking becomes an initdb option. All new pages can be created with PG_PAGE_LAYOUT_VERSION from the control file. All existing pages must be either the layout version from this release (4) or the next version (5). Page validity then becomes version dependent. pg_upgrade still works. Layout 5 is where we add CRCs, so its basically optional. We can also have a utility that allows you to bump the page version for all new pages, even after you've upgraded, so we may end with a mix of page layout versions in the same relation. That's more questionable but I see no problem with it. Do we need CRCs as a table level option? I hope not. That complicates many things. All of this allows us to have another more efficient page version (6) in future without problems, so its good infrastructure. I'm now personally game on to make something work here for 9.2. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
David Fetter <david@fetter.org> writes: > There's a separate issue we'd like to get clear on, which is whether > it would be OK to make a new PG_PAGE_LAYOUT_VERSION. If you're not going to provide pg_upgrade support, I think there is no chance of getting a new page layout accepted. The people who might want CRC support are pretty much exactly the same people who would find lack of pg_upgrade a showstopper. Now, given the hint bit issues, I rather doubt that you can make this work without a page format change anyway. So maybe you ought to just bite the bullet and start working on the pg_upgrade problem, rather than imagining you will find an end-run around it. > The issue is that double writes needs a checksum to work by itself, > and page checksums more broadly work better when there are double > writes, obviating the need to have full_page_writes on. Um. So how is that going to work if checksums are optional? regards, tom lane
Simon Riggs <simon@2ndQuadrant.com> writes: > We don't need to use any flag bits at all. We add > PG_PAGE_LAYOUT_VERSION to the control file, so that CRC checking > becomes an initdb option. All new pages can be created with > PG_PAGE_LAYOUT_VERSION from the control file. All existing pages must > be either the layout version from this release (4) or the next version > (5). Page validity then becomes version dependent. > We can also have a utility that allows you to bump the page version > for all new pages, even after you've upgraded, so we may end with a > mix of page layout versions in the same relation. That's more > questionable but I see no problem with it. It seems like you've forgotten all of the previous discussion of how we'd manage a page format version change. Having two different page formats running around in the system at the same time is far from free; in the worst case it means that every single piece of code that touches pages has to know about and be prepared to cope with both versions. That's a rather daunting prospect, from a coding perspective and even more from a testing perspective. Maybe the issues can be kept localized, but I've seen no analysis done of what the impact would be or how we could minimize it. I do know that we considered the idea and mostly rejected it a year or two back. A "utility to bump the page version" is equally a whole lot easier said than done, given that the new version has more overhead space and thus less payload space than the old. What does it do when the old page is too full to be converted? "Move some data somewhere else" might be workable for heap pages, but I'm less sanguine about rearranging indexes like that. At the very least it would imply that the utility has full knowledge about every index type in the system. > I'm now personally game on to make something work here for 9.2. If we're going to freeze 9.2 in the spring, I think it's a bit late for this sort of work to be just starting. What you've just described sounds to me like possibly a year's worth of work. regards, tom lane
On Wed, Dec 21, 2011 at 11:43 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > It seems like you've forgotten all of the previous discussion of how > we'd manage a page format version change. Maybe I've had too much caffeine. It's certainly late here. > Having two different page formats running around in the system at the > same time is far from free; in the worst case it means that every single > piece of code that touches pages has to know about and be prepared to > cope with both versions. That's a rather daunting prospect, from a > coding perspective and even more from a testing perspective. Maybe > the issues can be kept localized, but I've seen no analysis done of > what the impact would be or how we could minimize it. I do know that > we considered the idea and mostly rejected it a year or two back. I'm looking at that now. My feeling is it probably depends upon how different the formats are, so given we are discussing a 4 byte addition to the header, it might be doable. I'm investing some time on the required analysis. > A "utility to bump the page version" is equally a whole lot easier said > than done, given that the new version has more overhead space and thus > less payload space than the old. What does it do when the old page is > too full to be converted? "Move some data somewhere else" might be > workable for heap pages, but I'm less sanguine about rearranging indexes > like that. At the very least it would imply that the utility has full > knowledge about every index type in the system. I agree, rewriting every page is completely out and I never even considered it. >> I'm now personally game on to make something work here for 9.2. > > If we're going to freeze 9.2 in the spring, I think it's a bit late > for this sort of work to be just starting. I agree with that. If this goes adrift it will have to be killed for 9.2. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Wed, Dec 21, 2011 at 1:59 PM, David Fetter <david@fetter.org> wrote: > One of the things VMware is working on is double writes, per previous > discussions of how, for example, InnoDB does things. The world is moving to flash, and the lifetime of flash is measured writes. Potentially doubling the number of writes is potentially halving the life of the flash. Something to think about... -- Rob Wultsch wultsch@gmail.com
On Wed, Dec 21, 2011 at 7:06 PM, Simon Riggs <simon@2ndquadrant.com> wrote: > My feeling is it probably depends upon how different the formats are, > so given we are discussing a 4 byte addition to the header, it might > be doable. I agree. When thinking back on Zoltan's patches, it's worth remembering that he had a number of pretty bad ideas mixed in with the good stuff - such as taking a bunch of things that are written as macros for speed, and converting them to function calls. Also, he didn't make any attempt to isolate the places that needed to know about both page versions; everybody knew about everything, everywhere, and so everything needed to branch in places where it had not needed to do so before. I don't think we should infer from the failure of those patches that no one can do any better. On the other hand, I also agree with Tom that the chances of getting this done in time for 9.2 are virtually zero, assuming that (1) we wish to ship 9.2 in 2012 and (2) we don't wish to be making destabilizing changes beyond the end of the last CommitFest. There is a lot of work here, and I would be astonished if we could wrap it all up in the next month. Or even the next four months. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, Dec 21, 2011 at 04:18:33PM -0800, Rob Wultsch wrote: > On Wed, Dec 21, 2011 at 1:59 PM, David Fetter <david@fetter.org> wrote: > > One of the things VMware is working on is double writes, per > > previous discussions of how, for example, InnoDB does things. > > The world is moving to flash, and the lifetime of flash is measured > writes. Potentially doubling the number of writes is potentially > halving the life of the flash. > > Something to think about... Modern flash drives let you have more write cycles than modern spinning rust, so while yes, there is something happening, it's also happening to spinning rust, too. Cheers, David. -- David Fetter <david@fetter.org> http://fetter.org/ Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter Skype: davidfetter XMPP: david.fetter@gmail.com iCal: webcal://www.tripit.com/feed/ical/people/david74/tripit.ics Remember to vote! Consider donating to Postgres: http://www.postgresql.org/about/donate
On Thu, Dec 22, 2011 at 12:06 AM, Simon Riggs <simon@2ndquadrant.com> wrote: >> Having two different page formats running around in the system at the >> same time is far from free; in the worst case it means that every single >> piece of code that touches pages has to know about and be prepared to >> cope with both versions. That's a rather daunting prospect, from a >> coding perspective and even more from a testing perspective. Maybe >> the issues can be kept localized, but I've seen no analysis done of >> what the impact would be or how we could minimize it. I do know that >> we considered the idea and mostly rejected it a year or two back. > > I'm looking at that now. > > My feeling is it probably depends upon how different the formats are, > so given we are discussing a 4 byte addition to the header, it might > be doable. > > I'm investing some time on the required analysis. We've assumed to now that adding a CRC to the Page Header would add 4 bytes, meaning that we are assuming we are taking a CRC-32 check field. This will change the size of the header and thus break pg_upgrade in a straightforward implementation. Breaking pg_upgrade is not acceptable. We can get around this by making code dependent upon page version, allowing mixed page versions in one executable. That causes the PageGetItemId() macro to be page version dependent. After review, altering the speed of PageGetItemId() is not acceptable either (show me microbenchmarks if you doubt that). In a large minority of cases the line pointer and the page header will be in separate cache lines. As Kevin points out, we have 13 bits spare on the pd_flags of PageHeader, so we have a little wiggle room there. In addition to that I notice that pd_pagesize_version itself is 8 bits (page size is other 8 bits packed together), yet we currently use just one bit of that, since version is 4. Version 3 was last seen in Postgres 8.2, now de-supported. Since we don't care too much about backwards compatibility with data in Postgres 8.2 and below, we can just assume that all pages are version 4, unless marked otherwise with additional flags. We then use two separate bits to pd_flags to show PD_HAS_CRC (0x0008 and 0x8000). We then completely replace the 16 bit version field with a 16-bit CRC value, rather than a 32-bit value. Why two flag bits? If either CRC bit is set we assume the page's CRC is supposed to be valid. This ensures that a single bit error doesn't switch off CRC checking when it was supposed to be active. I suggest we remove the page size data completely; if we need to keep that we should mark 8192 bytes as the default and set bits for 16kB and 32 kB respectively. With those changes, we are able to re-organise the page header so that we can add a 16 bit checksum (CRC), yet retain the same size of header. Thus, we don't need to change PageGetItemId(). We would require changes to PageHeaderIsValid() and PageInit() only. Making these changes means we are reducing the number of bits used to validate the page header, though we are providing a much better way of detecting page validity, so the change is of positive benefit. Adding a CRC was a performance concern because of the hint bit problem, so making the value 16 bits long gives performance where it is needed. Note that we do now have a separation of bgwriter and checkpointer, so we have more CPU bandwidth to address the problem. Adding multiple bgwriters is also possible. Notably, this proposal makes CRC checking optional, so if performance is a concern it can be disabled completely. Which CRC algorithm to choose? "A study of error detection capabilities for random independent bit errors and burst errors reveals that XOR, two's complement addition, and Adler checksums are suboptimal for typical network use. Instead, one's complement addition should be used for networks willing to sacrifice error detection effectiveness to reduce compute cost, Fletcher checksum for networks looking for a balance of error detection and compute cost, and CRCs for networks willing to pay a higher compute cost for significantly improved error detection." The Effectiveness of Checksums for Embedded Control Networks, Maxino, T.C. Koopman, P.J., Dependable and Secure Computing, IEEE Transactions on Issue Date: Jan.-March 2009 Available here - http://www.ece.cmu.edu/~koopman/pubs/maxino09_checksums.pdf Based upon that paper, I suggest we use Fletcher-16. The overall concept is not sensitive to the choice of checksum algorithm however and the algorithm itself could be another option. F16 or CRC. My poor understanding of the difference is that F16 is about 20 times cheaper to calculate, at the expense of about 1000 times worse error detection (but still pretty good). 16 bit CRCs are not the strongest available, but still support excellent error detection rates - better than 1 failure in a million, possibly much better depending on which algorithm and block size. That's good easily enough to detect our kind of errors. This idea doesn't rule out the possibility of a 4 byte CRC-32 added in the future, since we still have 11 bits spare for use as future page version indicators. (If we did that, it is clear that we should add the checksum as a *trailer* not as part of the header.) So overall, I do now think its still possible to add an optional checksum in the 9.2 release and am willing to pursue it unless there are technical objections. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On 22.12.2011 01:43, Tom Lane wrote: > A "utility to bump the page version" is equally a whole lot easier said > than done, given that the new version has more overhead space and thus > less payload space than the old. What does it do when the old page is > too full to be converted? "Move some data somewhere else" might be > workable for heap pages, but I'm less sanguine about rearranging indexes > like that. At the very least it would imply that the utility has full > knowledge about every index type in the system. Remembering back the old discussions, my favorite scheme was to have an online pre-upgrade utility that runs on the old cluster, moving things around so that there is enough spare room on every page. It would do normal heap updates to make room on heap pages (possibly causing transient serialization failures, like all updates do), and split index pages to make room on them. Yes, it would need to know about all index types. And it would set a global variable to indicate that X bytes must be kept free on all future updates, too. Once the pre-upgrade utility has scanned through the whole cluster, you can run pg_upgrade. After the upgrade, old page versions are converted to new format as pages are read in. The conversion is staightforward, as there the pre-upgrade utility ensured that there is enough spare room on every page. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
* David Fetter: > The issue is that double writes needs a checksum to work by itself, > and page checksums more broadly work better when there are double > writes, obviating the need to have full_page_writes on. How desirable is it to disable full_page_writes? Doesn't it cut down recovery time significantly because it avoids read-modify-write cycles with a cold cache? -- Florian Weimer <fweimer@bfk.de> BFK edv-consulting GmbH http://www.bfk.de/ Kriegsstraße 100 tel: +49-721-96201-1 D-76133 Karlsruhe fax: +49-721-96201-99
On Thu, Dec 22, 2011 at 7:44 AM, Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote: > On 22.12.2011 01:43, Tom Lane wrote: >> >> A "utility to bump the page version" is equally a whole lot easier said >> than done, given that the new version has more overhead space and thus >> less payload space than the old. What does it do when the old page is >> too full to be converted? "Move some data somewhere else" might be >> workable for heap pages, but I'm less sanguine about rearranging indexes >> like that. At the very least it would imply that the utility has full >> knowledge about every index type in the system. > > > Remembering back the old discussions, my favorite scheme was to have an > online pre-upgrade utility that runs on the old cluster, moving things > around so that there is enough spare room on every page. It would do normal > heap updates to make room on heap pages (possibly causing transient > serialization failures, like all updates do), and split index pages to make > room on them. Yes, it would need to know about all index types. And it would > set a global variable to indicate that X bytes must be kept free on all > future updates, too. > > Once the pre-upgrade utility has scanned through the whole cluster, you can > run pg_upgrade. After the upgrade, old page versions are converted to new > format as pages are read in. The conversion is staightforward, as there the > pre-upgrade utility ensured that there is enough spare room on every page. That certainly works, but we're still faced with pg_upgrade rewriting every page, which will take a significant amount of time and with no backout plan or rollback facility. I don't like that at all, hence why I think we need an online upgrade facility if we do have to alter page headers. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Thu, Dec 22, 2011 at 8:42 AM, Florian Weimer <fweimer@bfk.de> wrote: > * David Fetter: > >> The issue is that double writes needs a checksum to work by itself, >> and page checksums more broadly work better when there are double >> writes, obviating the need to have full_page_writes on. > > How desirable is it to disable full_page_writes? Doesn't it cut down > recovery time significantly because it avoids read-modify-write cycles > with a cold cache? It's way too late in the cycle to suggest removing full page writes or code them. We're looking to add protection, not swap out existing ones. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On 2011-12-22 09:42, Florian Weimer wrote: > * David Fetter: > >> The issue is that double writes needs a checksum to work by itself, >> and page checksums more broadly work better when there are double >> writes, obviating the need to have full_page_writes on. > How desirable is it to disable full_page_writes? Doesn't it cut down > recovery time significantly because it avoids read-modify-write cycles > with a cold cache What is the downsides of having full_page_writes enabled .. except from log-volume? The manual mentions something about speed, but it is a bit unclear where that would come from, since the full pages must be somewhere in memory when being worked on anyway,. Anyway, I have an archive_command that looks like: archive_command = 'test ! -f /data/wal/%f.gz && gzip --fast < %p > /data/wal/%f.gz' It brings on along somewhere between 50 and 75% reduction in log-volume with "no cost" on the production system (since gzip just occupices one of the many cores on the system) and can easily keep up even during quite heavy writes. Recovery is a bit more tricky, because hooking gunzip into the command there will cause the system to replay log, gunzip, read data, replay log cycle where the gunzip easily could be done on the other logfiles while replay are being done on one. So a "straightforward" recovery will cost in recovery time, but that can be dealt with. Jesper -- Jesper
Simon Riggs wrote: > So overall, I do now think its still possible to add an optional > checksum in the 9.2 release and am willing to pursue it unless > there are technical objections. Just to restate Simon's proposal, to make sure I'm understanding it, we would support a new page header format number and the old one in 9.2, both to be the same size and carefully engineered to minimize what code would need to be aware of the version. PageHeaderIsValid() and PageInit() certainly would, and we would need some way to set, clear (maybe), and validate a CRC. We would need a GUC to indicate whether to write the CRC, and if present we would always test it on read and treat it as a damaged page if it didn't match. (Perhaps other options could be added later, to support recovery attempts, but let's not complicate a first cut.) This whole idea would depend on either (1) trusting your storage system never to tear a page on write or (2) getting the double-write feature added, too. I see some big advantages to this over what I suggested to David. For starters, using a flag bit and putting the CRC somewhere other than the page header would require that each AM deal with the CRC, exposing some function(s) for that. Simon's idea doesn't require that. I was also a bit concerned about shifting tuple images to convert non-protected pages to protected pages. No need to do that, either. With the bit flags, I think there might be some cases where we would be unable to add a CRC to a converted page because space was too tight; that's not an issue with Simon's proposal. Heikki was talking about a pre-convert tool. Neither approach really needs that, although with Simon's approach it would be possible to have a background *post*-conversion tool to add CRCs, if desired. Things would continue to function if it wasn't run; you just wouldn't have CRC protection on pages not updated since pg_upgrade was run. Simon, does it sound like I understand your proposal? Now, on to the separate-but-related topic of double-write. That absolutely requires some form of checksum or CRC to detect torn pages, in order for the technique to work at all. Adding a CRC without double-write would work fine if you have a storage stack which prevents torn pages in the file system or hardware driver. If you don't have that, it could create a damaged page indication after a hardware or OS crash, although I suspect that would be the exception, not the typical case. Given all that, and the fact that it would be cleaner to deal with these as two separate patches, it seems the CRC patch should go in first. (And, if this is headed for 9.2, *very soon*, so there is time for the double-write patch to follow.) It seems to me that the full_page_writes GUC could become an enumeration, with "off" having the current meaning, "wal" meaning what "on" now does, and "double" meaning that the new double-write technique would be used. (It doesn't seem to make any sense to do both at the same time.) I don't think we need a separate GUC to tell us *what* to protect against torn pages -- if not "off" we should always protect the first write of a page after checkpoint, and if "double" and write_page_crc (or whatever we call it) is "on", then we protect hint-bit-only writes. I think. I can see room to argue that with CRCs on we should do a full-page write to the WAL for a hint-bit-only change, or that we should add another GUC to control when we do this. I'm going to take a shot at writing a patch for background hinting over the holidays, which I think has benefit alone but also boosts the value of these patches, since it would reduce double-write activity otherwise needed to prevent spurious error when using CRCs. This whole area has some overlap with spreading writes, I think. The double-write approach seems to count on writing a bunch of pages (potentially from different disk files) sequentially to the double-write buffer, fsyncing that, and then writing the actual pages -- which must be fsynced before the related portion of the double-write buffer can be reused. The simple implementation would be to simply fsync the files just written to if they required a prior write to the double-write buffer, although fancier techniques could be used to try to optimize that. Again, setting hint bits set before the write when possible would help reduce the impact of that. -Kevin
On Thu, Dec 22, 2011 at 4:00 AM, Jesper Krogh <jesper@krogh.cc> wrote: > On 2011-12-22 09:42, Florian Weimer wrote: >> >> * David Fetter: >> >>> The issue is that double writes needs a checksum to work by itself, >>> and page checksums more broadly work better when there are double >>> writes, obviating the need to have full_page_writes on. >> >> How desirable is it to disable full_page_writes? Doesn't it cut down >> recovery time significantly because it avoids read-modify-write cycles >> with a cold cache > > What is the downsides of having full_page_writes enabled .. except from > log-volume? The manual mentions something about speed, but it is > a bit unclear where that would come from, since the full pages must > be somewhere in memory when being worked on anyway,. > I thought I will share some of my perspective on this checksum + doublewrite from a performance point of view. Currently from what I see in our tests based on dbt2, DVDStore, etc is that checksum does not impact scalability or total throughput measured. It does increase CPU cycles depending on the algorithm used by not really anything that causes problems. The Doublewrite change will be the big win to performance compared to full_page_write. For example compared to other databases our WAL traffic is one of the highest. Most of it is attributed to full_page_write. The reason full_page_write is necessary in production (atleast without worrying about replication impact) is that if a write fails, we can recover that whole page from WAL Logs as it is and just put it back out there. (In fact I believe thats the recovery does.) However the net impact is during high OLTP the runtime impact on WAL is high due to the high traffic and compared to other databases due to the higher traffic, the utilization is high. Also this has a huge impact on transaction response time the first time a page is changed which in all OLTP environments it is huge because by nature the transactions are all on random pages. When we use Doublewrite with checksums, we can safely disable full_page_write causing a HUGE reduction to the WAL traffic without loss of reliatbility due to a write fault since there are two writes always. (Implementation detail discussable). Since the double writes itself are sequential bundling multiple such writes further reduces the write time. The biggest improvement is that now these writes are not done during TRANSACTION COMMIT but during CHECKPOINT WRITES which improves performance drastically for OLTP application's transaction performance and you still get the reliability that is needed. Typically Performance in terms of throughput tps system is like tps(Full_page Write) << tps (no full page write) With the double write and CRC we see tps (Full_page_write) << tps (Doublewrite) < tps(no full page write) Which is a big win for production systems to get the reliability of full_page write. Also the side effect for response times is that they are more leveled unlike full page write where the response time varies like 0.5ms to 5ms depending on whether the same transaction needs to write a full page onto WAL or not. With doublewrite it can always be around 0.5ms rather than have a huge deviation on transaction performance. With this folks measuring the 90 %ile response time will see a huge relief on trying to meet their SLAs. Also from WAL perspective, I like to put the WAL on its own LUN/spindle/VMDK etc .. The net result that I have with the reduced WAL traffic, my utilization drops which means the same hardware can now handle higher WAL traffic in terms of IOPS resulting that WAL itself becomes lesser of a bottleneck. Typically this is observed by the reduction in response times of the transactions and increase in tps till some other bottleneck becomes the gating factor. So overall this is a big win. Regards, Jignesh
Jignesh Shah <jkshah@gmail.com> wrote: > When we use Doublewrite with checksums, we can safely disable > full_page_write causing a HUGE reduction to the WAL traffic > without loss of reliatbility due to a write fault since there are > two writes always. (Implementation detail discussable). The "always" there surprised me. It seemed to me that we only need to do the double-write where we currently do full page writes or unlogged writes. In thinking about your message, it finally struck me that this might require a WAL record to be written with the checksum (or CRC; whatever we use). Still, writing a WAL record with a CRC prior to the page write would be less data than the full page. Doing double-writes instead for situations without the torn page risk seems likely to be a net performance loss, although I have no benchmarks to back that up (not having a double-write implementation to test). And if we can get correct behavior without doing either (the checksum WAL record or the double-write), that's got to be a clear win. -Kevin
On Thu, Dec 22, 2011 at 11:16 AM, Kevin Grittner <Kevin.Grittner@wicourts.gov> wrote: > Jignesh Shah <jkshah@gmail.com> wrote: > >> When we use Doublewrite with checksums, we can safely disable >> full_page_write causing a HUGE reduction to the WAL traffic >> without loss of reliatbility due to a write fault since there are >> two writes always. (Implementation detail discussable). > > The "always" there surprised me. It seemed to me that we only need > to do the double-write where we currently do full page writes or > unlogged writes. In thinking about your message, it finally struck Currently PG only does full page write for the first change that makes the dirty after a checkpoint. This scheme works when all changes are relative to that first page so when checkpoint write fails then it can recreate the page by using the full page write + all the delta changes from WAL. In the double write implementation, every checkpoint write is double writed, so if the first doublewrite page write fails then then original page is not corrupted and if the second write to the actual datapage fails, then one can recover it from the earlier write. Now while it seems that there are 2X double writes during checkpoint is true. I can argue that there are the same 2 X writes right now except 1X of the write goes to WAL DURING TRANSACTION COMMIT. Also since doublewrite is generally written in its own file it is essentially sequential so it doesnt have the same write latencies as the actual checkpoint write. So if you look at the net amount of the writes it is the same. For unlogged tables even if you do doublewrite it is not much of a penalty while that may not be logging before in the WAL. By doing the double write for it, it is still safe and gives resilience for those tables to it eventhough it is not required. The net result is that the underlying page is never "irrecoverable" due to failed writes. > me that this might require a WAL record to be written with the > checksum (or CRC; whatever we use). Still, writing a WAL record > with a CRC prior to the page write would be less data than the full > page. Doing double-writes instead for situations without the torn > page risk seems likely to be a net performance loss, although I have > no benchmarks to back that up (not having a double-write > implementation to test). And if we can get correct behavior without > doing either (the checksum WAL record or the double-write), that's > got to be a clear win. I am not sure why would one want to write the checksum to WAL. As for the double writes, infact there is not a net loss because (a) the writes to the doublewrite area is sequential the writes calls are relatively very fast and infact does not cause any latency increase to any transactions unlike full_page_write. (b) It can be moved to a different location to have no stress on the default tablespace if you are worried about that spindle handling 2X writes which is mitigated in full_page_writes if you move pg_xlogs to different spindle and my own tests supports that the net result is almost as fast as full_page_write=off but not the same due to the extra write (which gives you the desired reliability) but way better than full_page_write=on. Regards, Jignesh > -Kevin
On Thu, Dec 22, 2011 at 1:50 PM, Jignesh Shah <jkshah@gmail.com> wrote: > In the double write implementation, every checkpoint write is double > writed, Unless I'm quite thoroughly confused, which is possible, the double write will need to happen the first time a buffer is written following each checkpoint. Which might mean the next checkpoint, but it could also be sooner if the background writer kicks in, or in the worst case a buffer has to do its own write. Furthermore, we can't *actually* write any pages until they are written *and fsync'd* to the double-write buffer. So the penalty for the background writer failing to do the right thing is going to go up enormously. Think about VACUUM or COPY IN, using a ring buffer and kicking out its own pages. Every time it evicts a page, it is going to have to doublewrite the buffer, fsync it, and then write it for real. That is going to make PostgreSQL 6.5 look like a speed demon. The background writer or checkpointer can conceivably dump a bunch of pages into the doublewrite area and then fsync the whole thing in bulk, but a backend that needs to evict a page only wants one page, so it's pretty much screwed. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Thu, Dec 22, 2011 at 3:04 PM, Robert Haas <robertmhaas@gmail.com> wrote: > On Thu, Dec 22, 2011 at 1:50 PM, Jignesh Shah <jkshah@gmail.com> wrote: >> In the double write implementation, every checkpoint write is double >> writed, > > Unless I'm quite thoroughly confused, which is possible, the double > write will need to happen the first time a buffer is written following > each checkpoint. Which might mean the next checkpoint, but it could > also be sooner if the background writer kicks in, or in the worst case > a buffer has to do its own write. > Logically the double write happens for every checkpoint write and it gets fsynced.. Implementation wise you can do a chunk of those pages like we do in sets of pages and sync them once and yes it still performs better than full_page_write. As long as you compare with full_page_write=on, the scheme is always much better. If you compare it with performance of full_page_write=off it is slightly less but then you lose the the reliability. So for performance testers like me who always turn off full_page_write anyway during my benchmark run will not see any impact. However for folks in production who are rightly scared to turn off full_page_write will have an ability to increase performance without being scared on failed writes. > Furthermore, we can't *actually* write any pages until they are > written *and fsync'd* to the double-write buffer. So the penalty for > the background writer failing to do the right thing is going to go up > enormously. Think about VACUUM or COPY IN, using a ring buffer and > kicking out its own pages. Every time it evicts a page, it is going > to have to doublewrite the buffer, fsync it, and then write it for > real. That is going to make PostgreSQL 6.5 look like a speed demon. Like I said implementation detail wise it depends on how many such pages do you sync simultaneously and the real tests prove that it is actually much faster than one expects. > The background writer or checkpointer can conceivably dump a bunch of > pages into the doublewrite area and then fsync the whole thing in > bulk, but a backend that needs to evict a page only wants one page, so > it's pretty much screwed. > Generally what point you pay the penalty is a trade off.. I would argue that you are making me pay for the full page write for my first transaction commit that changes the page which I can never avoid and the result is I get a transaction response time that is unacceptable since the deviation of a similar transaction which modifies the page already made dirty is lot less. However I can avoid page evictions if I select a bigger bufferpool (not necessarily that I want to do that but I have a choice without losing reliability). Regards, Jignesh > -- > Robert Haas > EnterpriseDB: http://www.enterprisedb.com > The Enterprise PostgreSQL Company
On Thu, Dec 22, 2011 at 9:50 AM, Kevin Grittner <Kevin.Grittner@wicourts.gov> wrote: > Simon, does it sound like I understand your proposal? Yes, thanks for restating. > Now, on to the separate-but-related topic of double-write. That > absolutely requires some form of checksum or CRC to detect torn > pages, in order for the technique to work at all. Adding a CRC > without double-write would work fine if you have a storage stack > which prevents torn pages in the file system or hardware driver. If > you don't have that, it could create a damaged page indication after > a hardware or OS crash, although I suspect that would be the > exception, not the typical case. Given all that, and the fact that > it would be cleaner to deal with these as two separate patches, it > seems the CRC patch should go in first. (And, if this is headed for > 9.2, *very soon*, so there is time for the double-write patch to > follow.) It could work that way, but I seriously doubt that a technique only mentioned in dispatches one month before the last CF is likely to become trustable code within one month. We've been discussing CRCs for years, so assembling the puzzle seems much easier, when all the parts are available. > It seems to me that the full_page_writes GUC could become an > enumeration, with "off" having the current meaning, "wal" meaning > what "on" now does, and "double" meaning that the new double-write > technique would be used. (It doesn't seem to make any sense to do > both at the same time.) I don't think we need a separate GUC to tell > us *what* to protect against torn pages -- if not "off" we should > always protect the first write of a page after checkpoint, and if > "double" and write_page_crc (or whatever we call it) is "on", then we > protect hint-bit-only writes. I think. I can see room to argue that > with CRCs on we should do a full-page write to the WAL for a > hint-bit-only change, or that we should add another GUC to control > when we do this. > > I'm going to take a shot at writing a patch for background hinting > over the holidays, which I think has benefit alone but also boosts > the value of these patches, since it would reduce double-write > activity otherwise needed to prevent spurious error when using CRCs. I would suggest you examine how to have an array of N bgwriters, then just slot the code for hinting into the bgwriter. That way a bgwriter can set hints, calc CRC and write pages in sequence on a particular block. The hinting needs to be synchronised with the writing to give good benefit. If we want page checksums in 9.2, I'll need your help, so the hinting may be a sidetrack. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Simon Riggs <simon@2ndQuadrant.com> wrote: > It could work that way, but I seriously doubt that a technique > only mentioned in dispatches one month before the last CF is > likely to become trustable code within one month. We've been > discussing CRCs for years, so assembling the puzzle seems much > easier, when all the parts are available. Well, double-write has been mentioned on the lists for years, sometimes in conjunction with CRCs, and I get the impression this is one of those things which has been worked on out of the community's view for a while and is just being posted now. That's often not viewed as the ideal way for development to proceed from a community standpoint, but it's been done before with some degree of success -- particularly when a feature has been bikeshedded to a standstill. ;-) > I would suggest you examine how to have an array of N bgwriters, > then just slot the code for hinting into the bgwriter. That way a > bgwriter can set hints, calc CRC and write pages in sequence on a > particular block. The hinting needs to be synchronised with the > writing to give good benefit. I'll think about that. I see pros and cons, and I'll have to see how those balance out after I mull them over. > If we want page checksums in 9.2, I'll need your help, so the > hinting may be a sidetrack. Well, VMware posted the initial patch, and that was the first I heard of it. I just had some off-line discussions with them after they posted it. Perhaps the engineers who wrote it should take your comments as a review an post a modified patch? It didn't seem like that pot of broth needed any more cooks, so I was going to go work on a nice dessert; but I agree that any way I can help along the either of the $Subject patches should take priority. -Kevin
"Kevin Grittner" <Kevin.Grittner@wicourts.gov> wrote: >> I would suggest you examine how to have an array of N bgwriters, >> then just slot the code for hinting into the bgwriter. That way a >> bgwriter can set hints, calc CRC and write pages in sequence on a >> particular block. The hinting needs to be synchronised with the >> writing to give good benefit. > > I'll think about that. I see pros and cons, and I'll have to see > how those balance out after I mull them over. I think maybe the best solution is to create some common code to use from both. The problem with *just* doing it in bgwriter is that it would not help much with workloads like Robert has been using for most of his performance testing -- a database which fits entirely in shared buffers and starts thrashing on CLOG. For a background hinter process my goal would be to deal with xids as they are passed by the global xmin value, so that you have a cheap way to know that they are ripe for hinting, and you can frequently hint a bunch of transactions that are all in the same CLOG page which is recent enough to likely be already loaded. Now, a background hinter isn't going to be a net win if it has to grovel through every tuple on every dirty page every time it sweeps through the buffers, so the idea depends on having a sufficiently efficient was to identify interesting buffers. I'm hoping to improve on this, but my best idea so far is to add a field to the buffer header for "earliest unhinted xid" for the page. Whenever this background process wakes up and is scanning through the buffers (probably just in buffer number order), it does a quick check, without any pin or lock, to see if the buffer is dirty and the earliest unhinted xid is below the global xmin. If it passes both of those tests, there is definitely useful work which can be done if the page doesn't get evicted before we can do it. We pin the page, recheck those conditions, and then we look at each tuple and hint where possible. As we go, we remember the earliest xid that we see which is *not* being hinted, to store back into the buffer header when we're done. Of course, we would also update the buffer header for new tuples or when an xmax is set if the xid involved precedes what we have in the buffer header. This would not only help avoid multiple page writes as unhinted tuples on the page are read, it would minimize thrashing on CLOG and move some of the hinting work from the critical path of reading a tuple into a background process. Thoughts? -Kevin
On Fri, Dec 23, 2011 at 11:14 AM, Kevin Grittner <Kevin.Grittner@wicourts.gov> wrote: > Thoughts? Those are good thoughts. Here's another random idea, which might be completely nuts. Maybe we could consider some kind of summarization of CLOG data, based on the idea that most transactions commit. We introduce the idea of a CLOG rollup page. On a CLOG rollup page, each bit represents the status of N consecutive XIDs. If the bit is set, that means all XIDs in that group are known to have committed. If it's clear, then we don't know, and must fall through to a regular CLOG lookup. If you let N = 1024, then 8K of CLOG rollup data is enough to represent the status of 64 million transactions, which means that just a couple of pages could cover as much of the XID space as you probably need to care about. Also, you would need to replace CLOG summary pages in memory only very infrequently. Backends could test the bit without any lock. If it's set, they do pg_read_barrier(), and then check the buffer label to make sure it's still the summary page they were expecting. If so, no CLOG lookup is needed. If the page has changed under us or the bit is clear, then we fall through to a regular CLOG lookup. An obvious problem is that, if the abort rate is significantly different from zero, and especially if the aborts are randomly mixed in with commits rather than clustered together in small portions of the XID space, the CLOG rollup data would become useless. On the other hand, if you're doing 10k tps, you only need to have a window of a tenth of a second or so where everything commits in order to start getting some benefit, which doesn't seem like a stretch. Perhaps the CLOG rollup data wouldn't even need to be kept on disk. We could simply have bgwriter (or bghinter) set the rollup bits in shared memory for new transactions, as it becomes possible to do so, and let lookups for XIDs prior to the last shutdown fall through to CLOG. Or, if that's not appealing, we could reconstruct the data in memory by groveling through the CLOG pages - or maybe just set summary bits only for CLOG pages that actually get faulted in. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > An obvious problem is that, if the abort rate is significantly > different from zero, and especially if the aborts are randomly mixed > in with commits rather than clustered together in small portions of > the XID space, the CLOG rollup data would become useless. Yeah, I'm afraid that with N large enough to provide useful acceleration, the cases where you'd actually get a win would be too thin on the ground to make it worth the trouble. regards, tom lane
On Fri, Dec 23, 2011 at 12:42 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> An obvious problem is that, if the abort rate is significantly >> different from zero, and especially if the aborts are randomly mixed >> in with commits rather than clustered together in small portions of >> the XID space, the CLOG rollup data would become useless. > > Yeah, I'm afraid that with N large enough to provide useful > acceleration, the cases where you'd actually get a win would be too thin > on the ground to make it worth the trouble. Well, I don't know: something like pgbench is certainly going to benefit, because all the transactions commit. I suspect that's true for many benchmarks. Whether it's true of real-life workloads is more arguable, of course, but if the benchmarks aren't measuring things that people really do with the database, then why are they designed the way they are? I've certainly written applications that relied on the database for integrity checking, so rollbacks were an expected occurrence, but then again those were very low-velocity systems where there wasn't going to be enough CLOG contention to matter anyway. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 12/23/11, Robert Haas <robertmhaas@gmail.com> wrote: > On Fri, Dec 23, 2011 at 11:14 AM, Kevin Grittner > <Kevin.Grittner@wicourts.gov> wrote: >> Thoughts? > > Those are good thoughts. > > Here's another random idea, which might be completely nuts. Maybe we > could consider some kind of summarization of CLOG data, based on the > idea that most transactions commit. I had a perhaps crazier idea. Aren't CLOG pages older than global xmin effectively read only? Could backends that need these bypass locking and shared memory altogether? > An obvious problem is that, if the abort rate is significantly > different from zero, and especially if the aborts are randomly mixed > in with commits rather than clustered together in small portions of > the XID space, the CLOG rollup data would become useless. On the > other hand, if you're doing 10k tps, you only need to have a window of > a tenth of a second or so where everything commits in order to start > getting some benefit, which doesn't seem like a stretch. Could we get some major OLTP users to post their CLOG for analysis? I wouldn't think there would be much security/propietary issues with CLOG data. Cheers, Jeff
Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> An obvious problem is that, if the abort rate is significantly >> different from zero, and especially if the aborts are randomly >> mixed in with commits rather than clustered together in small >> portions of the XID space, the CLOG rollup data would become >> useless. > > Yeah, I'm afraid that with N large enough to provide useful > acceleration, the cases where you'd actually get a win would be > too thin on the ground to make it worth the trouble. Just to get a real-life data point, I check the pg_clog directory for Milwaukee County Circuit Courts. They have about 300 OLTP users, plus replication feeds to the central servers. Looking at the now-present files, there are 19,104 blocks of 256 bytes (which should support N of 1024, per Robert's example). Of those, 12,644 (just over 66%) contain 256 bytes of hex 55. "Last modified" dates on the files go back to the 4th of October, so this represents roughly three months worth of real-life transactions. -Kevin
Jeff Janes <jeff.janes@gmail.com> wrote: > Could we get some major OLTP users to post their CLOG for > analysis? I wouldn't think there would be much > security/propietary issues with CLOG data. FWIW, I got the raw numbers to do my quick check using this Ruby script (put together for me by Peter Brant). If it is of any use to anyone else, feel free to use it and/or post any enhanced versions of it. #!/usr/bin/env ruby Dir.glob("*") do |file_name| contents = File.read(file_name) total = contents.enum_for(:each_byte).enum_for(:each_slice, 256).inject(0) do |count, chunk| if chunk.all? { |b| b == 0x55 } count + 1 else count end end printf"%s %d\n", file_name, total end -Kevin
Jeff Janes <jeff.janes@gmail.com> writes: > I had a perhaps crazier idea. Aren't CLOG pages older than global xmin > effectively read only? Could backends that need these bypass locking > and shared memory altogether? Hmm ... once they've been written out from the SLRU arena, yes. In fact you don't need to go back as far as global xmin --- *any* valid xmin is a sufficient boundary point. The only real problem is to know whether the data's been written out from the shared area yet. This idea has potential. I like it better than Robert's, mainly because I do not want to see us put something in place that would lead people to try to avoid rollbacks. regards, tom lane
On Thu, Dec 22, 2011 at 9:58 PM, Simon Riggs <simon@2ndquadrant.com> wrote: > On Thu, Dec 22, 2011 at 9:50 AM, Kevin Grittner > <Kevin.Grittner@wicourts.gov> wrote: > >> Simon, does it sound like I understand your proposal? > > Yes, thanks for restating. I've implemented that proposal, posting patch on a separate thread. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Thu, 2011-12-22 at 03:50 -0600, Kevin Grittner wrote: > Now, on to the separate-but-related topic of double-write. That > absolutely requires some form of checksum or CRC to detect torn > pages, in order for the technique to work at all. Adding a CRC > without double-write would work fine if you have a storage stack > which prevents torn pages in the file system or hardware driver. If > you don't have that, it could create a damaged page indication after > a hardware or OS crash, although I suspect that would be the > exception, not the typical case. Given all that, and the fact that > it would be cleaner to deal with these as two separate patches, it > seems the CRC patch should go in first. I think it could be broken down further. Taking a step back, there are several types of HW-induced corruption, and checksums only catch some of them. For instance, the disk losing data completely and just returning zeros won't be caught, because we assume that a zero page is just fine. From a development standpoint, I think a better approach would be: 1. Investigate if there are reasonable ways to ensure that (outside of recovery) pages are always initialized; and therefore zero pages can be treated as corruption. 2. Make some room in the page header for checksums and maybe some other simple sanity information (like file and page number). It will be a big project to sort out the pg_upgrade issues (as Tom and others have pointed out). 3. Attack hint bits problem. If (1) and (2) were complete, we would catch many common types of corruption, and we'd be in a much better position to think clearly about hint bits, double writes, etc. Regards,Jeff Davis
On Tue, Dec 27, 2011 at 1:24 PM, Jeff Davis <pgsql@j-davis.com> wrote: > 3. Attack hint bits problem. A large number of problems would go away if the current hint bit system could be replaced with something that did not require writing to the tuple itself. FWIW, moving the bits around seems like a non-starter -- you're trading a problem with a much bigger problem (locking, wal logging, etc). But perhaps a clog caching strategy would be a win. You get a full nibble back in the tuple header, significant i/o reduction for some workloads, crc becomes relatively trivial, etc etc. My first attempt at a process local cache for hint bits wasn't perfect but proved (at least to me) that you can sneak a tight cache in there without significantly impacting the general case. Maybe the angle of attack was wrong anyways -- I bet if you kept a judicious number of clog pages in each local process with some smart invalidation you could cover enough cases that scribbling the bits down would become unnecessary. Proving that is a tall order of course, but IMO merits another attempt. merlin
On Tue, 2011-12-27 at 16:43 -0600, Merlin Moncure wrote: > On Tue, Dec 27, 2011 at 1:24 PM, Jeff Davis <pgsql@j-davis.com> wrote: > > 3. Attack hint bits problem. > > A large number of problems would go away if the current hint bit > system could be replaced with something that did not require writing > to the tuple itself. My point was that neither the zero page problem nor the upgrade problem are solved by addressing the hint bits problem. They can be solved independently, and in my opinion, it seems to make sense to solve those problems before the hint bits problem (in the context of detecting hardware corruption). Of course, don't let that stop you from trying to get rid of hint bits, that has numerous potential benefits. Regards,Jeff Davis
On Tue, Dec 27, 2011 at 10:43 PM, Merlin Moncure <mmoncure@gmail.com> wrote: > I bet if you kept a judicious number of > clog pages in each local process with some smart invalidation you > could cover enough cases that scribbling the bits down would become > unnecessary. I don't understand how any cache can completely remove the need for hint bits. Without hint bits the xids in the tuples will be "in-doubt" forever. No matter how large your cache you'll always come across tuples that are arbitrarily old and are from an unbounded size set of xids. We could replace the xids with a frozen xid sooner but that just amounts to nearly the same thing as the hint bits only with page locking and wal records. -- greg
On Wed, Dec 28, 2011 at 8:45 AM, Greg Stark <stark@mit.edu> wrote: > On Tue, Dec 27, 2011 at 10:43 PM, Merlin Moncure <mmoncure@gmail.com> wrote: >> I bet if you kept a judicious number of >> clog pages in each local process with some smart invalidation you >> could cover enough cases that scribbling the bits down would become >> unnecessary. > > I don't understand how any cache can completely remove the need for > hint bits. Without hint bits the xids in the tuples will be "in-doubt" > forever. No matter how large your cache you'll always come across > tuples that are arbitrarily old and are from an unbounded size set of > xids. well, hint bits aren't needed strictly speaking, they are an optimization to guard against clog lookups. but is marking bits on the tuple the only way to get that effect? I'm conjecturing that some process local memory could be laid on top of the clog slru that would be fast enough such that it could take the place of the tuple bits in the visibility check. Maybe this could reduce clog contention as well -- or maybe the idea is unworkable. That said, it shouldn't be that much work to make a proof of concept to test the idea. > We could replace the xids with a frozen xid sooner but that just > amounts to nearly the same thing as the hint bits only with page > locking and wal records. right -- I don't think that helps. merlin
On Dec 23, 2011, at 2:23 PM, Kevin Grittner wrote: > Jeff Janes <jeff.janes@gmail.com> wrote: > >> Could we get some major OLTP users to post their CLOG for >> analysis? I wouldn't think there would be much >> security/propietary issues with CLOG data. > > FWIW, I got the raw numbers to do my quick check using this Ruby > script (put together for me by Peter Brant). If it is of any use to > anyone else, feel free to use it and/or post any enhanced versions > of it. Here's output from our largest OLTP system... not sure exactly how to interpret it, so I'm just providing the raw data. Thisspans almost exactly 1 month. I have a number of other systems I can profile if anyone's interested. 063A 379 063B 143 063C 94 063D 94 063E 326 063F 113 0640 122 0641 270 0642 81 0643 390 0644 183 0645 76 0646 61 0647 50 0648 275 0649 288 064A 126 064B 53 064C 59 064D 125 064E 357 064F 92 0650 54 0651 83 0652 267 0653 328 0654 118 0655 75 0656 104 0657 280 0658 414 0659 105 065A 74 065B 153 065C 303 065D 63 065E 216 065F 169 0660 113 0661 405 0662 85 0663 52 0664 44 0665 78 0666 412 0667 116 0668 48 0669 61 066A 66 066B 364 066C 104 066D 48 066E 68 066F 104 0670 465 0671 158 0672 64 0673 62 0674 115 0675 452 0676 296 0677 65 0678 80 0679 177 067A 316 067B 86 067C 87 067D 270 067E 84 067F 295 0680 299 0681 88 0682 35 0683 67 0684 66 0685 456 0686 146 0687 52 0688 33 0689 73 068A 147 068B 345 068C 107 068D 67 068E 50 068F 97 0690 473 0691 156 0692 47 0693 57 0694 97 0695 550 0696 224 0697 51 0698 80 0699 280 069A 115 069B 426 069C 241 069D 395 069E 98 069F 130 06A0 523 06A1 296 06A2 92 06A3 97 06A4 122 06A5 524 06A6 256 06A7 118 06A8 111 06A9 157 06AA 553 06AB 166 06AC 106 06AD 103 06AE 200 06AF 621 06B0 288 06B1 95 06B2 107 06B3 227 06B4 92 06B5 447 06B6 210 06B7 364 06B8 119 06B9 113 06BA 384 06BB 319 06BC 45 06BD 68 06BE 2 -- Jim C. Nasby, Database Architect jim@nasby.net 512.569.9461 (cell) http://jim.nasby.net
Jim Nasby <jim@nasby.net> wrote: > Here's output from our largest OLTP system... not sure exactly how > to interpret it, so I'm just providing the raw data. This spans > almost exactly 1 month. Those number wind up meaning that 18% of the 256-byte blocks (1024 transactions each) were all commits. Yikes. That pretty much shoots down Robert's idea of summarized CLOG data, I think. -Kevin
On Wed, Jan 4, 2012 at 3:02 PM, Kevin Grittner <Kevin.Grittner@wicourts.gov> wrote: > Jim Nasby <jim@nasby.net> wrote: >> Here's output from our largest OLTP system... not sure exactly how >> to interpret it, so I'm just providing the raw data. This spans >> almost exactly 1 month. > > Those number wind up meaning that 18% of the 256-byte blocks (1024 > transactions each) were all commits. Yikes. That pretty much > shoots down Robert's idea of summarized CLOG data, I think. I'm not *totally* certain of that... another way to look at it is that I have to be able to show a win even if only 18% of the probes into the summarized data are successful, which doesn't seem totally out of the question given how cheap I think lookups could be. But I'll admit it's not real encouraging. I think the first thing we need to look at is increasing the number of CLOG buffers. Even if hypothetical summarized CLOG data had a 60% hit rate rather than 18%, 8 CLOG buffers is probably still not going to be enough for a 32-core system, let alone anything larger. I am aware of two concerns here: 1. Unconditionally adding more CLOG buffers will increase PostgreSQL's minimum memory footprint, which is bad for people suffering under default shared memory limits or running a database on a device with less memory than a low-end cell phone. 2. The CLOG code isn't designed to manage a large number of buffers, so adding more might cause a performance regression on small systems. On Nate Boley's 32-core system, running pgbench at scale factor 100, the optimal number of buffers seems to be around 32. I'd like to get some test results from smaller systems - any chance you (or anyone) have, say, an 8-core box you could test on? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> wrote: > 2. The CLOG code isn't designed to manage a large number of > buffers, so adding more might cause a performance regression on > small systems. > > On Nate Boley's 32-core system, running pgbench at scale factor > 100, the optimal number of buffers seems to be around 32. I'd > like to get some test results from smaller systems - any chance > you (or anyone) have, say, an 8-core box you could test on? Hmm. I can think of a lot of 4-core servers I could test on. (We have a few poised to go into production where it would be relatively easy to do benchmarking without distorting factors right now.) After that we jump to 16 cores, unless I'm forgetting something. These are currently all in production, but some of them are redundant machines which could be pulled for a few hours here and there for benchmarks. If either of those seem worthwhile, please spec the useful tests so I can capture the right information. -Kevin
On Jan 4, 2012, at 2:02 PM, Kevin Grittner wrote: > Jim Nasby <jim@nasby.net> wrote: >> Here's output from our largest OLTP system... not sure exactly how >> to interpret it, so I'm just providing the raw data. This spans >> almost exactly 1 month. > > Those number wind up meaning that 18% of the 256-byte blocks (1024 > transactions each) were all commits. Yikes. That pretty much > shoots down Robert's idea of summarized CLOG data, I think. Here's another data point. This is for a londiste slave of what I posted earlier. Note that this slave has no users on it. 054A 654 054B 835 054C 973 054D 1020 054E 1012 054F 1022 0550 284 And these clog files are from Sep 15-30... I believe that's the period when we were building this slave, but I'm not 100%certain. 04F0 194 04F1 253 04F2 585 04F3 243 04F4 176 04F5 164 04F6 358 04F7 505 04F8 168 04F9 180 04FA 369 04FB 318 04FC 236 04FD 437 04FE 242 04FF 625 0500 222 0501 139 0502 174 0503 91 0504 546 0505 220 0506 187 0507 151 0508 199 0509 491 050A 232 050B 170 050C 191 050D 414 050E 557 050F 231 0510 173 0511 159 0512 436 0513 789 0514 354 0515 157 0516 187 0517 333 0518 599 0519 483 051A 300 051B 512 051C 713 051D 422 051E 291 051F 596 0520 785 0521 825 0522 484 0523 238 0524 151 0525 190 0526 256 0527 403 0528 551 0529 757 052A 837 052B 418 052C 256 052D 161 052E 254 052F 423 0530 469 0531 757 0532 627 0533 325 0534 224 0535 295 0536 290 0537 352 0538 561 0539 565 053A 833 053B 756 053C 485 053D 276 053E 241 053F 270 0540 334 0541 306 0542 700 0543 821 0544 402 0545 199 0546 226 0547 250 0548 354 0549 587 This is for a slave of that database that does have user activity: 054A 654 054B 835 054C 420 054D 432 054E 852 054F 666 0550 302 0551 243 0552 600 0553 295 0554 617 0555 504 0556 232 0557 304 0558 580 0559 156 -- Jim C. Nasby, Database Architect jim@nasby.net 512.569.9461 (cell) http://jim.nasby.net
On Wed, Jan 4, 2012 at 4:02 PM, Kevin Grittner <Kevin.Grittner@wicourts.gov> wrote: > Robert Haas <robertmhaas@gmail.com> wrote: > >> 2. The CLOG code isn't designed to manage a large number of >> buffers, so adding more might cause a performance regression on >> small systems. >> >> On Nate Boley's 32-core system, running pgbench at scale factor >> 100, the optimal number of buffers seems to be around 32. I'd >> like to get some test results from smaller systems - any chance >> you (or anyone) have, say, an 8-core box you could test on? > > Hmm. I can think of a lot of 4-core servers I could test on. (We > have a few poised to go into production where it would be relatively > easy to do benchmarking without distorting factors right now.) > After that we jump to 16 cores, unless I'm forgetting something. > These are currently all in production, but some of them are > redundant machines which could be pulled for a few hours here and > there for benchmarks. If either of those seem worthwhile, please > spec the useful tests so I can capture the right information. Yes, both of those seem useful. To compile, I do this: ./configure --prefix=$HOME/install/$BRANCHNAME --enable-depend --enable-debug ${EXTRA_OPTIONS} make make -C contrib/pgbench make check make install make -C contrib/pgbench install In this case, the relevant builds would probably be (1) master, (2) master with NUM_CLOG_BUFFERS = 16, (3) master with NUM_CLOG_BUFFERS = 32, and (4) master with NUM_CLOG_BUFFERS = 48. (You could also try intermediate numbers if it seems warranted.) Basic test setup: rm -rf $PGDATA ~/install/master/bin/initdb cat >> $PGDATA/postgresql.conf <<EOM; shared_buffers = 8GB maintenance_work_mem = 1GB synchronous_commit = off checkpoint_segments = 300 checkpoint_timeout = 15min checkpoint_completion_target = 0.9 wal_writer_delay = 20ms EOM I'm attaching a driver script you can modify to taste. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
On Jan4, 2012, at 21:27 , Robert Haas wrote: > I think the first thing we need to look at is increasing the number of > CLOG buffers. What became of the idea to treat the stable (i.e. earlier than the oldest active xid) and the unstable (i.e. the rest) parts of the CLOG differently. On 64-bit machines at least, we could simply mmap() the stable parts of the CLOG into the backend address space, and access it without any locking at all. I believe that we could also compress the stable part by 50% if we use one instead of two bits per txid. AFAIK, we need two bits because we a) Distinguish between transaction where were ABORTED and those which never completed (due to i.e. a backend crash) and b) Mark transaction as SUBCOMMITTED to achieve atomic commits. Which both are strictly necessary for the stable parts of the clog. Note that we could still keep the uncompressed CLOG around for debugging purposes - the additional compressed version would require only 2^32/8 bytes = 512 MB in the worst case, which people who're serious about performance can very probably spare. The fly in the ointment are 32-bit machines, of course - but then, those could still fall back to the current way of doing things. best regards, Florian Pflug
On Thu, Jan 5, 2012 at 5:15 AM, Florian Pflug <fgp@phlo.org> wrote: > On Jan4, 2012, at 21:27 , Robert Haas wrote: >> I think the first thing we need to look at is increasing the number of >> CLOG buffers. > > What became of the idea to treat the stable (i.e. earlier than the oldest > active xid) and the unstable (i.e. the rest) parts of the CLOG differently. I'm curious -- anyone happen to have an idea how big the unstable CLOG xid space is in the "typical" case? What's would be the main driver of making it bigger? What are the main tradeoffs in terms of trying to keep the unstable area compact? merlin
On Thu, Jan 5, 2012 at 6:15 AM, Florian Pflug <fgp@phlo.org> wrote: > On 64-bit machines at least, we could simply mmap() the stable parts of the > CLOG into the backend address space, and access it without any locking at all. True. I think this could be done, but it would take some fairly careful thought and testing because (1) we don't currently use mmap() anywhere else in the backend AFAIK, so we might run into portability issues (think: Windows) and perhaps unexpected failure modes (e.g. mmap() fails because there are too many mappings already). Also, it's not completely guaranteed to be a win. Sure, you save on locking, but now you are doing an mmap() call in every backend instead of just one read() into shared memory. If concurrency isn't a problem that might be more expensive on net. Or maybe no, but I'm kind of inclined to steer clear of this whole area at least for 9.2. So far, the only test result I have only supports the notion that we run into trouble when NUM_CPUS > NUM_CLOG_BUFFERS, and people have to before they can even start their I/Os. That can be fixed with a pretty modest reengineering. I'm sure there is a second-order effect from the cost of repeated I/Os per se, which a backend-private cache of one form or another might well help with, but it may not be very big. Test results are welcome, of course. > I believe that we could also compress the stable part by 50% if we use one > instead of two bits per txid. AFAIK, we need two bits because we > > a) Distinguish between transaction where were ABORTED and those which never > completed (due to i.e. a backend crash) and > > b) Mark transaction as SUBCOMMITTED to achieve atomic commits. > > Which both are strictly necessary for the stable parts of the clog. Well, if we're going to do compression at all, I'm inclined to think that we should compress by more than a factor of two. Jim Nasby's numbers (the worst we've seen so far) show that 18% of 1k blocks of XIDs were all commits. Presumably if we reduced the chunk size to, say, 8 transactions, that percentage would go up, and even that would be enough to get 16x compression rather than 2x. Of course, then keeping the uncompressed CLOG files becomes required rather than optional, but that's OK. What bothers me about compressing by only 2x is that the act of compressing is not free. You have to read all the chunks and then write out new chunks, and those chunks then compete for each other in cache. Who is to say that we're not better off just reading the uncompressed data at that point? At least then we have only one copy of it. > Note that > we could still keep the uncompressed CLOG around for debugging purposes - the > additional compressed version would require only 2^32/8 bytes = 512 MB in the > worst case, which people who're serious about performance can very probably > spare. I don't think it'd be even that much, because we only ever use half the XID space at a time, and often probably much less: the default value of vacuum_freeze_table_age is only 150 million transactions. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
For what's worth here are the numbers on one of our biggest databases (same system as I posted about separately wrt seq_scan_cost vs random_page_cost). 0053 1001 00BA 1009 0055 1001 00B9 1020 0054 983 00BB 1010 0056 1001 00BC 1019 0069 0 00BD 1009 006A 224 00BE 1018 006B 1009 00BF 1008 006C 1008 00C0 1006 006D 1004 00C1 1014 006E 1016 00C2 1023 006F 1003 00C3 1012 0070 1011 00C4 1000 0071 1011 00C5 1002 0072 1005 00C6 982 0073 1009 00C7 996 0074 1013 00C8 973 0075 1002 00D1 987 0076 997 00D2 968 0077 1007 00D3 974 0078 1012 00D4 964 0079 994 00D5 981 007A 1013 00D6 964 007B 999 00D7 966 007C 1000 00D8 971 007D 1000 00D9 956 007E 1008 00DA 976 007F 1010 00DB 950 0080 1001 00DC 967 0081 1009 00DD 983 0082 1008 00DE 970 0083 988 00DF 965 0084 1007 00E0 984 0085 1012 00E1 1004 0086 1004 00E2 976 0087 996 00E3 941 0088 1008 00E4 960 0089 1003 00E5 948 008A 995 00E6 851 008B 1001 00E7 971 008C 1003 00E8 954 008D 982 00E9 938 008E 1000 00EA 931 008F 1008 00EB 956 0090 1009 00EC 960 0091 1013 00ED 962 0092 1006 00EE 933 0093 1012 00EF 956 0094 994 00F0 978 0095 1017 00F1 292 0096 1004 0097 1005 0098 1014 0099 1012 009A 994 0035 1003 009B 1007 0036 1004 009C 1010 0037 981 009D 1024 0038 1002 009E 1009 0039 998 009F 1011 003A 995 00A0 1015 003B 996 00A1 1018 003C 1013 00A5 1007 003D 1008 00A3 1016 003E 1007 00A4 1020 003F 989 00A7 375 0040 989 00A6 1010 0041 975 00A9 3 0042 994 00A8 0 0043 1010 00AA 1 0044 1007 00AB 1 0045 1008 00AC 0 0046 991 00AF 4 0047 1010 00AD 0 0048 997 00AE 0 0049 1002 00B0 5 004A 1004 00B1 0 004B 1012 00B2 0 004C 999 00B3 0 004D 1008 00B4 0 004E 1007 00B5 807 004F 1010 00B6 1007 0050 1004 00B7 1007 0051 1009 00B8 1006 0052 1005 0057 1008 00C9 994 0058 991 00CA 977 0059 1000 00CB 978 005A 998 00CD 944 005B 971 00CC 972 005C 1005 00CF 969 005D 1010 00CE 988 005E 1006 00D0 975 005F 1015 0060 989 0061 998 0062 1014 0063 1000 0064 991 0065 990 0066 1000 0067 947 0068 377 00A2 1011 On 23/12/11 14:23, Kevin Grittner wrote: > Jeff Janes <jeff.janes@gmail.com> wrote: > > > Could we get some major OLTP users to post their CLOG for > > analysis? I wouldn't think there would be much > > security/propietary issues with CLOG data. > > FWIW, I got the raw numbers to do my quick check using this Ruby > script (put together for me by Peter Brant). If it is of any use to > anyone else, feel free to use it and/or post any enhanced versions > of it. > > #!/usr/bin/env ruby > > Dir.glob("*") do |file_name| > contents = File.read(file_name) > total = > contents.enum_for(:each_byte).enum_for(:each_slice, > 256).inject(0) do |count, chunk| > if chunk.all? { |b| b == 0x55 } > count + 1 > else > count > end > end > printf "%s %d\n", file_name, total > end > > -Kevin > > -- > Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-hackers
Benedikt Grundmann <bgrundmann@janestreet.com> wrote: > For what's worth here are the numbers on one of our biggest > databases (same system as I posted about separately wrt > seq_scan_cost vs random_page_cost). That's would be a 88.4% hit rate on the summarized data. -Kevin