Thread: should we set hint bits without dirtying the page?
In a sleepy email late last night on the crash-safe visibility map thread, I proposed introducing a new buffer state BM_UNTIDY. When a page is dirtied by a hint bit update, we mark it untidy but not dirty.Untidy buffers would be treated as dirty by the backgroundwriter cleaning scan, but as clean by checkpoints and by backends doing emergency buffer cleaning to feed new allocations. This would have the effect of rate-limiting the number of buffers that we write just for hint-bit updates. With default settings, we'd write at most bgwriter_lru_maxpages * (1000 ms/second / bgwriter_delay) untidy pages per second, which works out to 4MB/second of write traffic with default settings. That seems like it might be enough to prevent the "bulk load followed by SELECT" access pattern from totally swamping the machine with write traffic, while still ensuring that all the hint bits eventually do get set. I then got to wondering whether we should even go a step further, and simply decree that a page with only hint bit updates is not dirty and won't be written, period. If your working set fits in RAM, this isn't really a big deal because you'll read the pages in once, set the hint bits, and those pages will just stick around. Where it's a problem is when you have a huge table that you're scanning over and over again, especially if data in that table was loaded by many different, widely spaced XIDs that require looking at many different CLOG pages. But maybe we could ameliorate that problem by freezing more aggressively. As soon as all tuples on the page are all-visible, VACUUM will freeze every tuple on the page (setting a HEAP_XMIN_FROZEN bit rather than actually overwriting XMIN, to preserve forensic information) and mark it all-visible in a single WAL-logged operation. Also, we could have the background writer (!) try to perform this same operation on pages evicted during the cleaning scan. This would impose the same sort of I/O cap as the previous idea, although it would generate not only page writes but also WAL activity. The result would be not only to reduce the number of times we write the page (which, right now, can be as much as 3 * number_of_tuples, if we insert, hint-bit update, and then freeze each tuple separately), but also to make the freezing happen gradually over time rather than in a sudden spike when the XID age cut-off is reached. This would also be advantageous for index-only scans, because a large insert only table would gradually accumulate frozen pages without ever being vacuumed. The gradual freezing wouldn't apply in all cases - in particular, if you have a large insert-only table that you never actually read anything out of, you'd still get a spike when the XID age cut-off is reached. I'm inclined to think it would still be a big improvement over the status quo - you'd write the table twice instead of three times, and the second one would often be spread out rather than all at once. I foresee various objections. One is that freezing will force FPIs, so you'll still be writing the data three times. Of course, if you count FPIs, we're now writing the data four times, but under this scheme much more data would stick around long enough to get frozen, so the objection has merit. However, I think we can avoid this too, by allocating an additional bit in pd_flags, PD_FPI. Instead of emitting an FPI when the old LSN precedes the redo pointer, we'll emit an FPI when the FPI bit is set (in which case we'll also clear the bit) OR when the old LSN precedes the redo pointer. Upon emitting a WAL record that is torn-page safe (such as a freeze or all-visible record), we'll pass a flag to XLogInsert that arranges to suppress FPIs, bump the LSN, and set PD_FPI. That way, if the page is touched again before the next checkpoint by an operation that does NOT suppress FPI, one will be emitted then. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 12/2/10 4:00 PM, Robert Haas wrote: > As soon as all tuples on the page are all-visible, VACUUM will freeze > every tuple on the page (setting a HEAP_XMIN_FROZEN bit rather than > actually overwriting XMIN, to preserve forensic information) and mark > it all-visible in a single WAL-logged operation. Also, we could have > the background writer (!) try to perform this same operation on pages > evicted during the cleaning scan. This would impose the same sort of > I/O cap as the previous idea, although it would generate not only page > writes but also WAL activity. I would love this. It would also help considerably with the "freezing already cold data" problem ... if we were allowed to treat the frozen bit as canonical and not update any of the tuples. While never needing to touch pages at all for freezing is my preference, updating them while they're in memory anyway is a close second. Hmm. That doesn't work, though; the page can contain tuples which are attached to rolledback XIDs. Also, autovacuum would have no way of knowing which pages are frozen without reading them. -- -- Josh Berkus PostgreSQL Experts Inc. http://www.pgexperts.com
On Thu, Dec 2, 2010 at 7:19 PM, Josh Berkus <josh@agliodbs.com> wrote: > On 12/2/10 4:00 PM, Robert Haas wrote: >> As soon as all tuples on the page are all-visible, VACUUM will freeze >> every tuple on the page (setting a HEAP_XMIN_FROZEN bit rather than >> actually overwriting XMIN, to preserve forensic information) and mark >> it all-visible in a single WAL-logged operation. Also, we could have >> the background writer (!) try to perform this same operation on pages >> evicted during the cleaning scan. This would impose the same sort of >> I/O cap as the previous idea, although it would generate not only page >> writes but also WAL activity. > > I would love this. It would also help considerably with the "freezing > already cold data" problem ... if we were allowed to treat the frozen > bit as canonical and not update any of the tuples. While never needing > to touch pages at all for freezing is my preference, updating them while > they're in memory anyway is a close second. > > Hmm. That doesn't work, though; the page can contain tuples which are > attached to rolledback XIDs. Sure, well, any pages that are not all-visible will need to get vacuumed before they get marked all-visible. I can't fix that problem. But the more we freeze opportunistically before vacuum, the less painful vacuum will be when it finally kicks in. I don't anticipate this is going to be perfect; I'd be happy if we could achieve "better". > Also, autovacuum would have no way of > knowing which pages are frozen without reading them. Well, reading them is still better than reading them and then writing them. But in the long term I imagine we can avoid even doing that much. If we have a crash-safe visibility map and an aggressive freezing policy that freezes all tuples on the page before marking it all-visible, then even an anti-wraparound vacuum needn't scan all-visible pages. We might not feel confident to rely on that right away, but I think over the long term we can hope to get there. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > I then got to wondering whether we should even go a step further, and > simply decree that a page with only hint bit updates is not dirty and > won't be written, period. This sort of thing has been discussed before. It seems fairly clear to me that any of these variations represents a performance tradeoff: some cases will get better and some will get worse. I think we are not going to get far unless we can agree on a set of benchmark cases that we'll use to decide whether the tradeoff is a win or not. How can we arrive at that? regards, tom lane
On 03.12.2010 04:54, Tom Lane wrote: > Robert Haas<robertmhaas@gmail.com> writes: >> I then got to wondering whether we should even go a step further, and >> simply decree that a page with only hint bit updates is not dirty and >> won't be written, period. > > This sort of thing has been discussed before. It seems fairly clear to > me that any of these variations represents a performance tradeoff: some > cases will get better and some will get worse. I think we are not going > to get far unless we can agree on a set of benchmark cases that we'll > use to decide whether the tradeoff is a win or not. How can we arrive > at that? It's pretty easy to come up with a test case where that would be a win. I'd like to see some benchmark results of the worst case, to see how much loss we're talking about at most. Robert described the worst case: > Where it's a problem is > when you have a huge table that you're scanning over and over again, > especially if data in that table was loaded by many different, widely > spaced XIDs that require looking at many different CLOG pages. I'd like to add to that: "and the table is big enough to not fit in shared_buffers, but small enough to fit in OS cache". -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Thu, Dec 2, 2010 at 7:00 PM, Robert Haas <robertmhaas@gmail.com> wrote: > But > maybe we could ameliorate that problem by freezing more aggressively. I realized as I was falling asleep last night any sort of more aggressive freezing is going to be a huge bummer for Hot Standby users, for which freezing generates a conflict. Argh. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Thu, 2010-12-02 at 19:00 -0500, Robert Haas wrote: > Untidy buffers would be treated as dirty by the background writer > cleaning scan, but as clean by checkpoints and by backends doing > emergency buffer cleaning to feed new allocations. Differentiating between a backend write and a bgwriter write sounds like a good heuristic to me. Of course, only numbers can tell, but it sounds promising. > I then got to wondering whether we should even go a step further, and > simply decree that a page with only hint bit updates is not dirty and > won't be written, period. Sounds reasonable. Just to throw another idea out there, perhaps we could change the behavior based on whether the page is already dirty or not. I haven't thought this through, but it might be an interesting approach. Regards,Jeff Davis