Thread: Removing PD_ALL_VISIBLE
Follow up to discussion: http://archives.postgresql.org/pgsql-hackers/2012-11/msg00817.php I worked out a patch that replaces PD_ALL_VISIBLE with calls to visibilitymap_test. It rips out a lot of complexity, with a net drop of about 300 lines (not a lot, but some of that code is pretty complex). The patch is quite rough, and I haven't yet added the code to keep the VM buffer pins around longer, but I still don't see any major problems. I'm fairly certain that scans will not be adversely affected, and I think that inserts/updates/deletes should be fine as well. The main worry I have with inserts/updates/deletes is that there will be less spatial locality for allocating new buffers or for modifying existing buffers. So keeping a pinned VM buffer might not help as much, because it might need to pin a different one, anyway. Adding to January commitfest, as it's past 11/15. Regards, Jeff Davis
Attachment
On Wed, 2012-11-21 at 18:25 -0800, Jeff Davis wrote: > Follow up to discussion: > http://archives.postgresql.org/pgsql-hackers/2012-11/msg00817.php > > I worked out a patch that replaces PD_ALL_VISIBLE with calls to > visibilitymap_test. It rips out a lot of complexity, with a net drop of > about 300 lines (not a lot, but some of that code is pretty complex). Updated patch attached. Now it tries to keep the VM buffer pinned during scans, inserts, updates, and deletes. This should avoid increased contention pinning the VM pages, but performance tests are required. For updates, it currently only tries to hold a pin on the VM buffer for the page of the original tuple. For HOT updates, that's always the same as the new buffer anyway. For cold updates, we could also try to keep a pin on the buffer for the new tuple, but right now I don't see an obvious need for that complexity. It may plausibly be a problem when doing a bulk update on a freshly-loaded table. It occurred to me that it might be difficult to test this patch without a fairly large test case. A big assumption of my patch is that there will be locality of access (and the VM page you already have a pin on is likely to be needed the next time), which is obvious during a scan but not so obvious during I/U/D. But a single 8K VM page represents some 60K pages, or about 500MB of data. So anything less than that means that there is only one VM page, and locality is trivial... it seems like any test on a table less than 5GB would not be fair. Then again, if a 5GB table is being randomly accessed, an extra pin is unlikely to matter. Also, without locality, the contention would not be nearly as bad either. I'm still pretty unclear what the "worst case" for this patch is supposed to look like. Regards, Jeff Davis
Attachment
Jeff Davis <pgsql@j-davis.com> writes: > Now it tries to keep the VM buffer pinned during scans, inserts, > updates, and deletes. This should avoid increased contention pinning the > VM pages, but performance tests are required. > ... > Then again, if a 5GB table is being randomly accessed, an extra pin is > unlikely to matter. Also, without locality, the contention would not be > nearly as bad either. I'm still pretty unclear what the "worst case" for > this patch is supposed to look like. I'd be worried about the case of a lot of sessions touching a lot of unrelated tables. This change implies doubling the number of buffers that are held pinned by any given query, and the distributed overhead from that (eg, adding cycles to searches for free buffers) is what you ought to be afraid of. Another possibly important point is that reducing the number of pin/unpin cycles for a given VM page might actually hurt the chances of it being found in shared buffers, because IIRC the usage_count is bumped once per pin/unpin. That algorithm is based on the assumption that buffer pins are not drastically different in lifespan, but I think you just broke that for pins on VM pages. I'm not particularly concerned about devising solutions for these problems, though, because I think this idea is a loser from the get-go; the increase in contention for VM pages is alone going to destroy any possible benefit. regards, tom lane
On Sun, 2012-11-25 at 22:30 -0500, Tom Lane wrote: > I'd be worried about the case of a lot of sessions touching a lot of > unrelated tables. This change implies doubling the number of buffers > that are held pinned by any given query, and the distributed overhead > from that (eg, adding cycles to searches for free buffers) is what you > ought to be afraid of. That's a good point. "Doubling" might be an exaggeration if indexes are involved, but it's still a concern. The cost of this might be difficult to measure though. > Another possibly important point is that reducing the number of > pin/unpin cycles for a given VM page might actually hurt the chances of > it being found in shared buffers, because IIRC the usage_count is bumped > once per pin/unpin. That algorithm is based on the assumption that > buffer pins are not drastically different in lifespan, but I think you > just broke that for pins on VM pages. If doing a bunch of simple key lookups using an index, then the root of the index page is only pinned once per query, but we expect that to stay in shared buffers. I see the VM page as about the same: one pin per query (or maybe a couple for large tables). I don't see how the lifetime of the pin matters a whole lot in this case; it's more about the rate of pins/unpins, right? > I'm not particularly concerned about devising solutions for these > problems, though, because I think this idea is a loser from the get-go; > the increase in contention for VM pages is alone going to destroy any > possible benefit. Your intuition here is better than mine, but I am still missing something here. If we keep the buffer pinned, then there will be very few pin/unpin cycles here, so I don't see where the contention would come from (any more than there is contention pinning the root of an index). Do you still think I need a shared lock here or something? If so, then index-only scans have a pretty big problem right now, too. I'll try to quantify some of these effects you've mentioned, and see how the numbers turn out. I'm worried that I'll need more than 4 cores to show anything though, so perhaps someone with a many-core box would be interested to test this out? Regards,Jeff Davis
On Mon, Nov 26, 2012 at 3:29 PM, Jeff Davis <pgsql@j-davis.com> wrote: > Your intuition here is better than mine, but I am still missing > something here. If we keep the buffer pinned, then there will be very > few pin/unpin cycles here, so I don't see where the contention would > come from (any more than there is contention pinning the root of an > index). Based on previous measurements, I think there *is* contention pinning the root of an index. Currently, I believe it's largely overwhelmed by contention from other sources, such as the buffer manager lwlocks and the very-evil ProcArrayLock. However, I believe that as we fix those problems, this will start to percolate up towards the top of the heap. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Jeff Davis <pgsql@j-davis.com> writes: > On Sun, 2012-11-25 at 22:30 -0500, Tom Lane wrote: >> Another possibly important point is that reducing the number of >> pin/unpin cycles for a given VM page might actually hurt the chances of >> it being found in shared buffers, because IIRC the usage_count is bumped >> once per pin/unpin. > If doing a bunch of simple key lookups using an index, then the root of > the index page is only pinned once per query, but we expect that to stay > in shared buffers. I see the VM page as about the same: one pin per > query (or maybe a couple for large tables). Hmmm ... that seems like a valid analogy. I may be worried about nothing as far as this point goes. > Do you still think I need a shared lock here or something? If so, then > index-only scans have a pretty big problem right now, too. There's still the issue of whether the IOS code is safe in machines with weak memory ordering. I think that it probably is safe, but the argument for it in the current code comment is wrong; most likely, a correct argument has to depend on read/write barriers associated with taking snapshots. I think what you ought to do is work through that, fix the existing comment, and then see whether the same argument works for what you want to do. regards, tom lane
On Mon, 2012-11-26 at 16:10 -0500, Tom Lane wrote: > There's still the issue of whether the IOS code is safe in machines with > weak memory ordering. I think that it probably is safe, but the > argument for it in the current code comment is wrong; most likely, a > correct argument has to depend on read/write barriers associated with > taking snapshots. I think what you ought to do is work through that, > fix the existing comment, and then see whether the same argument works > for what you want to do. As a part of the patch, I did change the comment, and here's what I came up with: * Note on Memory Ordering Effects: visibilitymap_test does not lock * the visibility map buffer, and therefore the resultwe read here * could be slightly stale. However, it can't be stale enough to * matter. * * We need to detect clearinga VM bit due to an insert right away, * because the tuple is present in the index page but not visible. The * readingof the TID by this scan (using a shared lock on the index * buffer) is serialized with the insert of the TID intothe index * (using an exclusive lock on the index buffer). Because the VM bit is * cleared before updating the index,and locking/unlocking of the * index page acts as a full memory barrier, we are sure to see the * cleared bit if wesee a recently-inserted TID. * * Deletes do not update the index page (only VACUUM will clear out the * TID), so the clearingof the VM bit by a delete is not serialized * with this test below, and we may see a value that is significantly* stale. However, we don't care about the delete right away, because * the tuple is still visible until thedeleting transaction commits or * the statement ends (if it's our transaction). In either case, the * lock on the VM bufferwill have been released (acting as a write * barrier) after clearing the bit. And for us to have a snapshot that *includes the deleting transaction (making the tuple invisible), we * must have acquired ProcArrayLock after that time, actingas a read * barrier. * * It's worth going through this complexity to avoid needing to lock * the VM buffer, which couldcause significant contention. And I updated the comment in visibilitymap.c as well (reformatted for this email): "To test a bit in the visibility map, most callers should have a pin on the VM buffer, and at least a shared lock on the data buffer. Any process that clears the VM bit must have an exclusive lock on the data buffer, so that will serialize access to the appropriate bit. Because lock acquisition and release are full memory barriers, then there is no danger of seeing the state of the bit before it was last cleared. Callers that don't have the data buffer yet, such as an index only scan or a VACUUM that is skipping pages, must handle the concurrency as appropriate." Regards,Jeff Davis
On Mon, Nov 26, 2012 at 3:03 PM, Robert Haas <robertmhaas@gmail.com> wrote: > On Mon, Nov 26, 2012 at 3:29 PM, Jeff Davis <pgsql@j-davis.com> wrote: >> Your intuition here is better than mine, but I am still missing >> something here. If we keep the buffer pinned, then there will be very >> few pin/unpin cycles here, so I don't see where the contention would >> come from (any more than there is contention pinning the root of an >> index). > > Based on previous measurements, I think there *is* contention pinning > the root of an index. Currently, I believe it's largely overwhelmed > by contention from other sources, such as the buffer manager lwlocks > and the very-evil ProcArrayLock. However, I believe that as we fix > those problems, this will start to percolate up towards the top of the > heap. Yup -- it (buffer pin contention on high traffic buffers) been caught in the wild -- just maintaining the pin count was enough to do it in at least one documented case. Pathological workloads demonstrate contention today and there's no good reason to assume it's limited index root nodes -- i'm strongly suspicious buffer spinlock issues are behind some other malfeasance we've seen recently. merlin
On Mon, 2012-11-26 at 16:55 -0600, Merlin Moncure wrote: > > Based on previous measurements, I think there *is* contention pinning > > the root of an index. Currently, I believe it's largely overwhelmed > > by contention from other sources, such as the buffer manager lwlocks > > and the very-evil ProcArrayLock. However, I believe that as we fix > > those problems, this will start to percolate up towards the top of the > > heap. > > Yup -- it (buffer pin contention on high traffic buffers) been caught > in the wild -- just maintaining the pin count was enough to do it in > at least one documented case. Pathological workloads demonstrate > contention today and there's no good reason to assume it's limited > index root nodes -- i'm strongly suspicious buffer spinlock issues are > behind some other malfeasance we've seen recently. I tried for quite a while to show any kind of performance difference between checking the VM and checking PD_ALL_VISIBLE on a 12-core box (24 if you count HT). Three patches in question: 1. Current unpatched master 2. patch that naively always checks the VM page, pinning and unpinning each time 3. Same as #2, but tries to keep buffers pinned (I had to fix a bug in this patch though -- new version forthcoming) I tested from 1 to 64 concurrent connections. One test was just concurrent scans of a ~400MB table. The other test was a DELETE FROM foo WHERE ctid BETWEEN '(N,0)' AND '(N,256)' where N is the worker number in the test program. The table here is only about 2 MB. The idea is that the scan will happen quickly, leading to many quick deletes, but the deletes won't actually touch the same pages (aside from the VM page). So, this was designed to be uncontended except for pinning the VM page. On the scan test, it was really hard to see any difference in the test noise, but I may have seen about a 3-4% degradation between patch #1 and patch #2 at higher concurrencies. It was difficult for me to reproduce this result -- it usually wouldn't show up. I didn't see any difference between patch #1 and patch #3. On the delete test I detected no difference between #1 and #2 at all. I think someone with access to a larger box may need to test this. Or, if someone has a more specific suggestion about how I can create a worst case, then let me know. Regards,Jeff Davis
On Fri, 2012-11-30 at 13:16 -0800, Jeff Davis wrote: > I tried for quite a while to show any kind of performance difference > between checking the VM and checking PD_ALL_VISIBLE on a 12-core box (24 > if you count HT). > > Three patches in question: > 1. Current unpatched master > 2. patch that naively always checks the VM page, pinning and unpinning > each time > 3. Same as #2, but tries to keep buffers pinned (I had to fix a bug in > this patch though -- new version forthcoming) New patch attached. Nathan Boley kindly lent me access to a 64-core box, and that shows a much more interesting result. The previous test (on the 12-core) basically showed no difference between any of the patches. Now, I see why on the 64 core box: the interesting region seems to be around 32 concurrent connections. The left column is the concurrency, and the right is the runtime. This test was for concurrent scans of a 350MB table (each process did 4 scans and quit). Test program attached. Patch 1 (scan test): 001 004.299533 002 004.434378 004 004.708533 008 004.518470 012 004.487033 016 004.513915 024 004.765459 032 006.425780 048 007.089146 064 007.908850 072 009.461419 096 013.098646 108 015.278592 128 019.797206 Patch 2 (scan test): 001 004.385206 002 004.596340 004 004.616684 008 004.832248 012 004.858336 016 004.689959 024 005.016797 032 006.857642 048 012.049407 064 025.774772 072 032.680710 096 059.147500 108 083.654806 128 120.350200 Patch 3 (scan test): 001 004.464991 002 004.555595 004 004.562364 008 004.649633 012 004.628159 016 004.518748 024 004.768348 032 004.834177 048 007.003305 064 008.242714 072 009.732261 096 013.231056 108 014.996977 128 020.488570 As you can see, patch #2 starts to show a difference at around 32 and completely falls over by 48 connections. This is expected because it's the naive approach that pins the VM page every time it needs it. Patch #1 and #3 are effectively the same, subsequent runs (and with more measurements around concurrency 32) show that the differences are just noise (which seems to be greater around the inflection point of 32). All of the numbers that seem to show any difference can end up with patch #1 better or patch #3 better, depending on the run. I tried the delete test, too, but I still couldn't see any difference. (I neglected to mention in my last email: I aborted after each delete so that it would be repeatable). The inflection point there is significantly lower, so I assume it must be contending over something else. I tried making the table unlogged to see if that would change things, but it didn't change much. This test only scales linearly to about 8 or so. Or, there could be something wrong with my test. So, I conclude that contention is certainly a problem for scans for patch #2, but patch #3 seems to fix that completely by holding the buffer pins. The deletes are somewhat inconclusive, but I just can't see a difference. Holding more pins does have a distributed cost in theory, as Tom points out, but I don't know where to begin testing that. We'll have to make a decision between (a) maintaining the extra complexity and doing the extra page writes involved with PD_ALL_VISIBLE; or (b) holding onto one extra pin per table being scanned. Right now, if PD_ALL_VISIBLE did not exist, it would be pretty hard to justify putting it in as far as I can tell. Regards, Jeff Davis
Attachment
Rebased patch attached. No significant changes. Regards, Jeff Davis
Attachment
On 17 January 2013 03:02, Jeff Davis <pgsql@j-davis.com> wrote: > Rebased patch attached. No significant changes. Jeff, can you summarise/collate why we're doing this, what concerns it raises and how you've dealt with them? That will help decide whether to commit. Thanks -- Simon Riggs http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Thu, Jan 17, 2013 at 2:11 PM, Simon Riggs <simon@2ndquadrant.com> wrote: > > On 17 January 2013 03:02, Jeff Davis <pgsql@j-davis.com> wrote: > > > Rebased patch attached. No significant changes. > > Jeff, can you summarise/collate why we're doing this, what concerns it > raises and how you've dealt with them? That will help decide whether > to commit. > +1. On another thread "Set visibility map bit after HOT prune", Robert mentioned that its not such a good idea to remove the PD_ALL_VISIBLE flag because it helps us to reduce the contention on the VM page, especially when we need to reset the VM bit. Here is an excerpt from Robert's comment that thread: "Sure, but you're zipping rather blithely past the disadvantages of such an approach. Jeff Davis recently proposed getting rid of PD_ALL_VISIBLE, and Tom and I both expressed considerable skepticism about that; this proposal has the same problems. One of the major benefits of PD_ALL_VISIBLE is that, when it isn't set, inserts, updates, and deletes to the page can ignore the visibility map. That means that a server under heavy concurrency is much less likely to encounter contention on the visibility map blocks. Now, maybe that's not really a problem, but I sure haven't seen enough evidence to make me believe it. If it's really true that PD_ALL_VISIBLE needn't fill this role, then Heikki wasted an awful lot of time implementing it, and I wasted an awful lot of time keeping it working when I made the visibility map crash-safe for IOS. That could be true, but I tend to think it isn't." May be you've already addressed that concern with the proven performance numbers, but I'm not sure. Thanks, Pavan -- Pavan Deolasee http://www.linkedin.com/in/pavandeolasee
At 2013-01-17 08:41:37 +0000, simon@2ndQuadrant.com wrote: > > Jeff, can you summarise/collate why we're doing this, what concerns it > raises and how you've dealt with them? Since I was just looking at the original patch and discussion, and since Pavan has posted an excerpt from one objection to it, here's an excerpt from Jeff's original post titled "Do we need so many hint bits?" http://www.postgresql.org/message-id/1353026577.14335.91.camel@sussancws0025 Also, I am wondering about PD_ALL_VISIBLE. It was originally introduced in the visibility map patch, apparently as away to know when to clear the VM bit when doing an update. It was then also used for scans, which showed a significantspeedup. But I wonder: why not just use the visibilitymap directly from those places? It can be used for thescan because it is crash safe now (not possible before). And since it's only one lookup per scanned page, then I don'tthink it would be a measurable performance loss there. Inserts/updates/deletes also do a significant amount of work,so again, I doubt it's a big drop in performance there -- maybe under a lot of concurrency or something. The benefit of removing PD_ALL_VISIBLE would be significantly higher. It's quite common to load a lot of data, and thendo some reads for a while (setting hint bits and flushing them to disk), and then do a VACUUM a while later, settingPD_ALL_VISIBLE and writing all of the pages again. Also, if I remember correctly, Robert went to significant effortwhen making the VM crash-safe to keep the PD_ALL_VISIBLE and VM bits consistent. Maybe this was all discussed before? There was considerable discussion after this (accessible through the archives link above), which I won't attempt to summarise. -- Abhijit
<br /><br /><div class="gmail_quote">On Thu, Jan 17, 2013 at 2:57 PM, Abhijit Menon-Sen <span dir="ltr"><<a href="mailto:ams@2ndquadrant.com"target="_blank">ams@2ndquadrant.com</a>></span> wrote:<br /><blockquote class="gmail_quote"style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><br /><br /> There was considerablediscussion after this (accessible through the<br /> archives link above), which I won't attempt to summarise.<br/><span class="HOEnZb"><font color="#888888"></font></span><br clear="all" /></blockquote></div><br />I thoughtRobert made those comments after considerable discussions on Jeff's approach. So he probably still stands by his objectionsor at least not satisfied/seen the numbers.<br /><br />Now that I look at the patch, I wonder if there is anotherfundamental issue with the patch. Since the patch removes WAL logging for the VM set operation, this can happen:<br/><br />1. Vacuum kicks in and clears all dead tuples in a page and decides that its all-visible<br /> 2. VacuumWAL-logs the cleanup activity and marks the page dirty<br />3. Vacuum sets the visibility bit and marks the VM pagedirty<br />4. Say the VM page gets written to the disk. The heap page is not yet written neither the WAL log correspondingto the cleanup operation<br /> 5. CRASH<br /><br />After recovery, the VM bit will remain set because the VMpage got written before the crash. But since heap page's cleanup WAL did not made to the disk, those operations won't bereplayed. The heap page will be left with not-all-visible tuples in that case and its not a good state to be in.<br /><br/>The original code does not have this problem because the VM set WAL gets written after the heap page cleanup WAL.So its guaranteed that the VM bit will be set during recovery only if the cleanup WAL is replayed too (there is moremagic than what meets the eye and I think its not fully documented). <br /><br />Thanks,<br />Pavan<br /><br />-- <br/>Pavan Deolasee<br /><a href="http://www.linkedin.com/in/pavandeolasee" target="_blank">http://www.linkedin.com/in/pavandeolasee</a>
On Thu, Jan 17, 2013 at 3:49 AM, Pavan Deolasee <pavan.deolasee@gmail.com> wrote: > May be you've already addressed that concern with the proven > performance numbers, but I'm not sure. It would be nice to hear what Heikki's reasons were for adding PD_ALL_VISIBLE in the first place. Jeff's approach of holding the VM pins for longer certainly mitigates some of the damage, in the sense that it reduces buffer lookups and pin/unpin cycles - and might be worth doing independently of the rest of the patch if we think it's a win. Index-only scans already use a similar optimization, so extending it to inserts, updates, and deletes is surely worth considering. The main question in my mind is whether there are any negative consequences to holding a VM buffer pin for that long without interruption. The usual consideration - namely, blocking vacuum - doesn't apply here because vacuum does not take a cleanup lock on the visibility map page, only the heap page, but I'm not sure if there are others. The other part of the issue is cache pressure. I don't think I can say it better than what Tom already wrote: # I'd be worried about the case of a lot of sessions touching a lot of # unrelated tables. This change implies doubling the number of buffers # that are held pinned by any given query, and the distributed overhead # from that (eg, adding cycles to searches for free buffers) is what you # ought to be afraid of. I agree that we ought to be afraid of that. A pgbench test isn't going to find a problem in this area because there you have a bunch of sessions all accessing the same small group of tables. To find a problem of the type above, you'd need lots of backends accessing lots of different, small tables. That's not the use case we usually benchmark, but I think there are a fair number of people doing things like that - for example, imagine a hosting provider or web application with many databases or schemas on a single instance. AFAICS, Jeff hasn't tried to test this scenario. Now, on the flip side, we should also be thinking about what we would gain from this patch, and what other ways there might be to achieve the same goals. As far as I can see, the main gain is that if you bulk-load a table, don't vacuum it right away, get all the hint bits set by some other mechanism, and then vacuum the table, you'll only read the whole table instead of rewriting the whole table. So ISTM that, for example, if we adopted the idea of making HOT prune set visibility-map bits, most of the benefit of this change evaporates, but whatever costs it may have will remain. There are other possible ways of getting the same benefit as well - for example, I've been thinking for a while now that we should try to find some judicious way of vacuuming insert-only tables, perhaps only in small chunks when there's nothing else going on. That wouldn't as clearly obsolete this patch, but it would address a very similar use case and would also preset hint bits, which would help a lot of people. Some of the ideas we've had about a HEAP_XMIN_FREEZE intersect with this idea, too - if we can do an early freeze without losing forensic information, we can roll setting the hint bit, setting PD_ALL_VISIBLE, and freezing the page into a single write. All of which is to say that I think this patch is premature. If we adopt something like this, we're likely never going to revert back to the way we do it now, and whatever cache-pressure or other costs this approach carries will be hear to stay - so we had better think awfully carefully before we do that. And, FWIW, I don't believe that there is sufficient time in this release cycle to carefully test this patch and the other alternative designs that aim toward the same ends. Even if there were, this is exactly the sort of thing that should be committed at the beginning of a release cycle, not the end, so as to allow adequate time for discovery of unforeseen consequences before the code ends up out in the wild. Of course, there's another issue here too, which is that as Pavan points out, the page throws crash-safety out the window, which breaks index-only scans (if you have a crash). HEAP_XLOG_VISIBLE is intended principally to protect the VM bit, not PD_ALL_VISIBLE, but the patch rips it out even though its purpose is to remove the latter, not the former. Removing PD_ALL_VISIBLE eliminates the need to keep the visibility-map bit consist with PD_ALL_VISIBLE, but it does not eliminate the need to keep PD_ALL_VISIBLE consistent with the page contents. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 17.01.2013 16:53, Robert Haas wrote: > On Thu, Jan 17, 2013 at 3:49 AM, Pavan Deolasee > <pavan.deolasee@gmail.com> wrote: >> May be you've already addressed that concern with the proven >> performance numbers, but I'm not sure. > > It would be nice to hear what Heikki's reasons were for adding > PD_ALL_VISIBLE in the first place. The idea was to avoid clearing the bit in the VM page on every update, when the bit is known to not be set, ie. when the PD_ALL_VISIBLE flag is not set. I assumed the traffic and contention on the VM page would be a killer otherwise. I don't remember if I ever actually tested that though. Maybe I was worrying about nothing and hitting the VM page on every update is ok. - Heikki
On Thu, 2013-01-17 at 15:25 +0530, Pavan Deolasee wrote: > Now that I look at the patch, I wonder if there is another fundamental > issue with the patch. Since the patch removes WAL logging for the VM > set operation, this can happen: > Thank you. I think I was confused by this comment here: "When we *set* a visibility map during VACUUM, we must write WAL. This may seem counterintuitive, since the bit is basically a hint: if it is clear, it may still be the case that every tuple on the page is visible to all transactions; we just don't know that for certain. The difficulty is that there are two bits which are typically set together: the PD_ALL_VISIBLE bit on the page itself, and the visibility map bit. If a crash occurs after the visibility map page makes it to disk and before the updated heap page makes it to disk, redo must set the bit on the heap page. Otherwise, the next insert, update, or delete on the heap page will fail to realize that the visibility map bit must be cleared, possibly causing index-only scans to return wrong answers." Which lead me to believe that I could just rip out the WAL-related code if PD_ALL_VISIBLE goes away, which is incorrect. But the incorrectness doesn't have to do with the WAL directly, it's because the VM page's LSN is not bumped past the LSN of the related heap page cleanup, so it can be written too early. Of course, the way to bump the LSN is to write WAL for the visibilitymap_set operation. But that would be a very simple WAL routine, rather than the complex one that exists without the patch. I suppose we could even try to bump the LSN without writing WAL somehow, but it doesn't seem worth reasoning through that (setting a VM bit is rare enough). Regards,Jeff Davis
On Thu, 2013-01-17 at 10:39 -0800, Jeff Davis wrote: > On Thu, 2013-01-17 at 15:25 +0530, Pavan Deolasee wrote: > > Now that I look at the patch, I wonder if there is another fundamental > > issue with the patch. Since the patch removes WAL logging for the VM > > set operation, this can happen: > > > Thank you. New patch attached with simple WAL logging. Regards, Jeff Davis
Attachment
On 17 January 2013 15:14, Heikki Linnakangas <hlinnakangas@vmware.com> wrote: > On 17.01.2013 16:53, Robert Haas wrote: >> >> On Thu, Jan 17, 2013 at 3:49 AM, Pavan Deolasee >> <pavan.deolasee@gmail.com> wrote: >>> >>> May be you've already addressed that concern with the proven >>> performance numbers, but I'm not sure. >> >> >> It would be nice to hear what Heikki's reasons were for adding >> PD_ALL_VISIBLE in the first place. > > > The idea was to avoid clearing the bit in the VM page on every update, when > the bit is known to not be set, ie. when the PD_ALL_VISIBLE flag is not set. > I assumed the traffic and contention on the VM page would be a killer > otherwise. I don't remember if I ever actually tested that though. Maybe I > was worrying about nothing and hitting the VM page on every update is ok. Presumably we remember the state of the VM so we can skip the re-visit after every write? -- Simon Riggs http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Thu, 2013-01-17 at 17:14 +0200, Heikki Linnakangas wrote: > I don't remember if I ever actually tested that > though. Maybe I was worrying about nothing and hitting the VM page on > every update is ok. I tried, but was unable to show really anything at all, even without keeping the VM page pinned. I think the bottleneck is elsewhere; although I am keeping the page pinned in this patch to prevent it from becoming a bottleneck. Regards,Jeff Davis
On Thu, 2013-01-17 at 19:58 +0000, Simon Riggs wrote: > Presumably we remember the state of the VM so we can skip the re-visit > after every write? That was not a part of my patch, although I remember that you mentioned that previously and I thought it could be a good way to mitigate a problem if it ever came up. However, the tests I did didn't show any problem there. The tests were somewhat noisy, so perhaps I was doing something wrong, but it didn't appear that looking at the VM page for every update was a bottleneck. Regards,Jeff Davis
On Thu, 2013-01-17 at 09:53 -0500, Robert Haas wrote: > The main question in my mind is whether > there are any negative consequences to holding a VM buffer pin for > that long without interruption. The usual consideration - namely, > blocking vacuum - doesn't apply here because vacuum does not take a > cleanup lock on the visibility map page, only the heap page, but I'm > not sure if there are others. If the "without interruption" part becomes a practical problem, it seems fairly easy to fix: drop the pin and pick it up again once every K pages. Unless I'm missing something, this is a minor concern. > The other part of the issue is cache pressure. I don't think I can say > it better than what Tom already wrote: > > # I'd be worried about the case of a lot of sessions touching a lot of > # unrelated tables. This change implies doubling the number of buffers > # that are held pinned by any given query, and the distributed overhead > # from that (eg, adding cycles to searches for free buffers) is what you > # ought to be afraid of. > > I agree that we ought to be afraid of that. It's a legitimate concern, but I think being "afraid" goes to far (more below). > A pgbench test isn't > going to find a problem in this area because there you have a bunch of > sessions all accessing the same small group of tables. To find a > problem of the type above, you'd need lots of backends accessing lots > of different, small tables. That's not the use case we usually > benchmark, but I think there are a fair number of people doing things > like that - for example, imagine a hosting provider or web application > with many databases or schemas on a single instance. AFAICS, Jeff > hasn't tried to test this scenario. The concern here is over a lot of different, small tables (e.g. multi-tenancy or something similar) as you say. If we're talking about nearly empty tables, that's easy to fix: just don't use the VM on tables less than N pages. You could say that "small" tables are really 1-10MB each, and you could have a zillion of those. I will try to create a worst-case here and see what numbers come out. Perhaps the extra time to look for a free buffer does add up. Test plan: 1. Take current patch (without "skip VM check for small tables" optimization mentioned above). 2. Create 500 tables each about 1MB. 3. VACUUM them all. 4. Start 500 connections (one foreach table) 5. Time the running of a loop that executes a COUNT(*) on that connection's table 100 times. I think shared_buffers=64MB is probably appropriate. We want some memory pressure so that it has to find and evict pages to satisfy the queries. But we don't want it to be totally exhausted and unable to even pin a new page; that really doesn't tell us much except that max_connections is too high. Sound reasonable? > Now, on the flip side, we should also be thinking about what we would > gain from this patch, and what other ways there might be to achieve > the same goals. One thing to keep in mind is that the current code to maintain a crash-safe PD_ALL_VISIBLE and VM bit is quite complex and doesn't play by the normal rules. If you want to talk about distributed costs, that has some very real distributed costs in terms of development effort. For instance, my checksums patch took me extra time to write (despite the patch being the simplest checksums design on the table) and will take others extra time to review. So, if the only things keeping that code in place are theoretical fears, let's take them one by one and see if they are real problems or not. > All of which is to say that I think this patch is premature. If we > adopt something like this, we're likely never going to revert back to > the way we do it now, and whatever cache-pressure or other costs this > approach carries will be hear to stay - so we had better think awfully > carefully before we do that. What about this patch makes it hard to undo/rework in the future? > Even if > there were, this is exactly the sort of thing that should be committed > at the beginning of a release cycle, not the end, so as to allow > adequate time for discovery of unforeseen consequences before the code > ends up out in the wild. I'm concerned that such a comment at this stage will cut review early, which could prevent also it from being committed early in 9.4. > Of course, there's another issue here too, which is that as Pavan > points out, the page throws crash-safety out the window My mistake. I believe that is already fixed, and certainly not a fundamental issue. Regards,Jeff Davis
On Thu, 2013-01-17 at 14:53 -0800, Jeff Davis wrote: > Test plan: > > 1. Take current patch (without "skip VM check for small tables" > optimization mentioned above). > 2. Create 500 tables each about 1MB. > 3. VACUUM them all. > 4. Start 500 connections (one for each table) > 5. Time the running of a loop that executes a COUNT(*) on that > connection's table 100 times. Done, with a few extra variables. Again, thanks to Nathan Boley for lending me the 64-core box. Test program attached. I did both 1MB tables and 1 tuple tables, but I ended up throwing out the 1-tuple table results. First of all, as I said, that's a pretty easy problem to solve, so not really what I want to test. Second, I had to do so many iterations that I don't think I was testing anything useful. I did see what might have been a couple differences, but I would need to explore in more detail and I don't think it's worth it, so I'm just reporting on the 1MB tables. For each test, each of 500 connections runs 10 iterations of a COUNT(*) on it's own 1MB table (which is vacuumed and has the VM bit set). The query is prepared once. The table has only an int column. The variable is shared_buffers, going from 32MB (near exhaustion for 500 connections) to 2048MB (everything fits). The last column is the time range in seconds. I included the range this time, because there was more variance in the runs but I still think they are good test results. master: 32MB: 16.4 - 18.9 64MB: 16.9 - 17.3 128MB: 17.5 - 17.9 256MB: 14.7 - 15.8 384MB: 8.1 - 9.3 448MB: 4.3 - 9.2 512MB: 1.7 - 2.2 576MB: 0.6 - 0.6 1024MB: 0.6 - 0.6 2048MB: 0.6 - 0.6 patch: 32MB: 16.8 - 17.6 64MB: 17.1 - 17.5 128MB: 17.2 - 18.0 256MB: 14.8 - 16.2 384MB: 8.0 - 10.1 448MB: 4.6 - 7.2 512MB: 2.0 - 2.6 576MB: 0.6 - 0.6 1024MB: 0.6 - 0.6 2048MB: 0.6 - 0.6 Conclusion: I see about what I expect: a precipitous drop in runtime after everything fits in shared_buffers (500 1MB tables means the inflection point around 512MB makes a lot of sense). There does seem to be a measurable difference right around that inflection point, but it's not much. Considering that this is the worst case that I could devise, I am not too concerned about this. However, it is interesting to see that there really is a lot of maintenance work being done when we need to move pages in and out of shared buffers. I'm not sure that it's related to the freelists though. For the extra pins to really be a problem, I think a much higher percentage of the buffers would need to be pinned. Since the case we are worried about involves scans (if it involved indexes, that would already be using more than one pin per scan), then that means the only way to get to a high percentage of pinned buffers is by having very small tables. But we don't really need to use the VM when scanning very small tables (the overhead would be elsewhere), so I think we're OK. So, I attached a new version of the patch that doesn't look at the VM for tables with fewer than 32 pages. That's the only change. Regards, Jeff Davis
Attachment
On Thu, Jan 17, 2013 at 5:50 PM, Jeff Davis <pgsql@j-davis.com> wrote: > If the "without interruption" part becomes a practical problem, it seems > fairly easy to fix: drop the pin and pick it up again once every K > pages. Unless I'm missing something, this is a minor concern. I think probably so. > Test plan: > > 1. Take current patch (without "skip VM check for small tables" > optimization mentioned above). > 2. Create 500 tables each about 1MB. > 3. VACUUM them all. > 4. Start 500 connections (one for each table) > 5. Time the running of a loop that executes a COUNT(*) on that > connection's table 100 times. > > I think shared_buffers=64MB is probably appropriate. We want some memory > pressure so that it has to find and evict pages to satisfy the queries. > But we don't want it to be totally exhausted and unable to even pin a > new page; that really doesn't tell us much except that max_connections > is too high. > > Sound reasonable? Well, it's certainly a data point, but each table in that case has 128 pages, so the effect is still pretty small. The place where you're going to run into trouble is when many of those tables have 4 pages each, or 2 pages each, or 1 page each. >> All of which is to say that I think this patch is premature. If we >> adopt something like this, we're likely never going to revert back to >> the way we do it now, and whatever cache-pressure or other costs this >> approach carries will be hear to stay - so we had better think awfully >> carefully before we do that. > > What about this patch makes it hard to undo/rework in the future? Well, if you have a bunch of code, and it preserves property X at all times, it is pretty easy to decide that new code need no longer preserve property X. It is essentially a no-op. The reverse, going through a bunch of existing code that does not consistently preserve property X and making it do so, is always much harder, because you've got to audit all the code you've already got. I don't want to undo the work that's been done on this over the last four years without a really good reason, and I'm not seeing one. > My mistake. I believe that is already fixed, and certainly not a > fundamental issue. It at least calls for a repetition of any performance testing that has already been done. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Fri, Jan 18, 2013 at 3:31 AM, Jeff Davis <pgsql@j-davis.com> wrote: > So, I attached a new version of the patch that doesn't look at the VM > for tables with fewer than 32 pages. That's the only change. That certainly seems worthwhile, but I still don't want to get rid of this code. I'm just not seeing a reason why that's something that desperately needs to be done. I don't think this is a barrier to anything else we want to do, and it might well be that the situations where this patch would hurt us are currently masked by other bottlenecks, but won't be always. Right now, the vast majority of heap updates don't need to pin the visibility map page; with this change, all of them do. Now, I accept that your test results show that that doesn't matter, but how can that not be an artifact of some kind? Can we really credit that accessing two pages costs no more than accessing one? To what countervailing factor could we plausibly attribute that? Now, even if it costs more in some narrow sense, the difference might not be enough to matter. But without some big gain on the other side, why tinker with it? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > On Fri, Jan 18, 2013 at 3:31 AM, Jeff Davis <pgsql@j-davis.com> wrote: >> So, I attached a new version of the patch that doesn't look at the VM >> for tables with fewer than 32 pages. That's the only change. > That certainly seems worthwhile, but I still don't want to get rid of > this code. I'm just not seeing a reason why that's something that > desperately needs to be done. Yeah, I'm having the same problem. Despite Jeff's test results, I can't help thinking that lack of PD_ALL_VISIBLE *will* hurt us under some workloads, and it's not obvious to me what benefit we get from dropping it. regards, tom lane
On Mon, Jan 21, 2013 at 8:49 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> On Fri, Jan 18, 2013 at 3:31 AM, Jeff Davis <pgsql@j-davis.com> wrote: >>> So, I attached a new version of the patch that doesn't look at the VM >>> for tables with fewer than 32 pages. That's the only change. > >> That certainly seems worthwhile, but I still don't want to get rid of >> this code. I'm just not seeing a reason why that's something that >> desperately needs to be done. > > Yeah, I'm having the same problem. Despite Jeff's test results, I can't > help thinking that lack of PD_ALL_VISIBLE *will* hurt us under some > workloads, and it's not obvious to me what benefit we get from dropping > it. I tend to agree. When I looked at the patch, I thought since its removing a WAL record (and associated redo logic), it has some additional value. But that was kind of broken (sorry, I haven't looked at the latest patch if Jeff fixed it without requiring to reinstate the WAL logging). I also thought that bug invalidates the performance numbers reported. Of course, there is an argument that this patch will simplify the code, but I'm not sure if its enough to justify the additional contention which may or may not show up in the benchmarks we are running, but we know its there. Thanks, Pavan -- Pavan Deolasee http://www.linkedin.com/in/pavandeolasee
On Sun, 2013-01-20 at 22:19 -0500, Tom Lane wrote: > Robert Haas <robertmhaas@gmail.com> writes: > > On Fri, Jan 18, 2013 at 3:31 AM, Jeff Davis <pgsql@j-davis.com> wrote: > >> So, I attached a new version of the patch that doesn't look at the VM > >> for tables with fewer than 32 pages. That's the only change. > > > That certainly seems worthwhile, but I still don't want to get rid of > > this code. I'm just not seeing a reason why that's something that > > desperately needs to be done. > > Yeah, I'm having the same problem. Despite Jeff's test results, I can't > help thinking that lack of PD_ALL_VISIBLE *will* hurt us under some > workloads, and it's not obvious to me what benefit we get from dropping > it. OK. I respectfully disagree with the arguments I've seen so far, but we can all move on. I already had some more test results (again, thanks to Nathan Boley), so I finished them up and attached them to the bottom of this email for the archives (there's always hope, right?). Regards,Jeff Davis Test goal: is 32 is an appropriate threshold for using the VM after we remove PD_ALL_VISIBLE? Test setup: 500 31-page tables and 500 33-page tables. Test recent build against patched version, with varying shared_buffers. The first test is 500 connections each doing 20 iterations of COUNT(*) against the 500 31-page tables. The next test is the same, except against the 33-page tables. Test results: (There were a few outliers I discarded as being too fast. It always happened in the first run, and I strongly suspect the connections began unevenly, leading to lower concurrency. They didn't seem to favor one build over another.) master, 31-page (first column is shared_buffers, second is range of seconds): 32MB: 5.8 - 6.1 64MB: 3.1 - 3.7 96MB: 1.7 - 2.2 112MB: 0.6 - 1.1 128MB: 0.4 - 0.4 256MB: 0.4- 0.4 patch, 31-page (doesn't use VM because 31 is below threshold): 32MB: 5.1 - 5.9 64MB: 3.4 - 3.8 96MB: 1.7 - 2.0 112MB: 0.7 - 1.1 128MB: 0.4 - 0.5 256MB: 0.4 - 0.5 master, 33-page: 32MB: 5.9 - 7.0 64MB: 4.2 - 4.7 96MB: 2.4 - 2.8 112MB: 1.2 - 1.6 128MB: 0.5 - 0.5 256MB: 0.4 - 0.5 patch, 33-page (does use VM because 33 is above threshold): 32MB: 6.2 - 7.2 64MB: 4.4 - 4.7 96MB: 2.8 - 3.0 112MB: 1.7 - 1.8 128MB: 0.5 - 1.0 256MB: 0.4 - 0.5 Conclusion: 32 pages is a low enough threshold that skipping the VM is insignificant. When the 500 tables are 33 pages, and it does use the VM, we do see a measurable cost under cache pressure. The penalty is fairly small (10% ballpark), and this is a worst-case, so I don't think it's a problem. From the last test results, we know it gets back to even by the time the tables are 128 pages (1MB). So it could be that there is a slightly higher threshold (between 32 and 128) that would be slightly better. But given how specific this case is and how small the penalty is, I think 32 is a fine threshold. Also, to reiterate, I focused my tests almost entirely on scans, though some of the concern was around inserts/updates/deletes. I tried, but was unable to show any difference on those tests. I suspect that the bottleneck is elsewhere.
On Mon, 2013-01-21 at 11:27 +0530, Pavan Deolasee wrote: > I tend to agree. When I looked at the patch, I thought since its > removing a WAL record (and associated redo logic), it has some > additional value. But that was kind of broken (sorry, I haven't looked > at the latest patch if Jeff fixed it without requiring to reinstate > the WAL logging). I also thought that bug invalidates the performance > numbers reported. I revalidated the performance numbers after reinstating the WAL logging. No difference (which is unsurprising, since my tests were read-only). > Of course, there is an argument that this patch will > simplify the code, but I'm not sure if its enough to justify the > additional contention which may or may not show up in the benchmarks > we are running, but we know its there. What additional contention? How do you know it's there? The design is to keep the VM page pinned, and to read it without requiring locks (like index-only scans already do). So I don't see any obvious additional contention there, unless you're talking about the work the CPU does for cache coherency (which did not show up in any of my tests). I understand that my patch is probably not going in, but I would like to be clear about what is a known practical problem, what is a theoretical problem that has eluded my testing capabilities, and what is no problem at all. Regards,Jeff Davis
On Mon, Jan 21, 2013 at 12:22 PM, Jeff Davis <pgsql@j-davis.com> wrote:
At the minimum your patch will need to have one additional buffer pinned for every K < 8192 * 8 heap pages. For most cases, the value of K will be large enough to ignore the overhead, but for small values of K, I'm not sure if we can say that with confidence. Of course, for very small tables the real contention might be somewhere else and so this change may not show up anything on the benchmarks. Doesn't your own tests for 33 page tables confirm that there is indeed contention for this case ? I agree its not huge, but I don't know if there are workloads where it will matter.
Sorry about that. I know how that feels. But we need some more reasons to justify this change, especially because the performance numbers themselves are not showing any signs either way. One could have argued that this saves us a valuable page level bit, but with pg_upgrade etc it has become hard to re-utilize page level bits for other purposes and we don't yet have a pressing need for more bits.
Thanks,
Pavan
On Mon, 2013-01-21 at 11:27 +0530, Pavan Deolasee wrote:What additional contention? How do you know it's there?
> Of course, there is an argument that this patch will
> simplify the code, but I'm not sure if its enough to justify the
> additional contention which may or may not show up in the benchmarks
> we are running, but we know its there.
At the minimum your patch will need to have one additional buffer pinned for every K < 8192 * 8 heap pages. For most cases, the value of K will be large enough to ignore the overhead, but for small values of K, I'm not sure if we can say that with confidence. Of course, for very small tables the real contention might be somewhere else and so this change may not show up anything on the benchmarks. Doesn't your own tests for 33 page tables confirm that there is indeed contention for this case ? I agree its not huge, but I don't know if there are workloads where it will matter.
I understand that my patch is probably not going in,
Sorry about that. I know how that feels. But we need some more reasons to justify this change, especially because the performance numbers themselves are not showing any signs either way. One could have argued that this saves us a valuable page level bit, but with pg_upgrade etc it has become hard to re-utilize page level bits for other purposes and we don't yet have a pressing need for more bits.
Thanks,
Pavan
--
Pavan Deolasee
http://www.linkedin.com/in/pavandeolasee
On Mon, 2013-01-21 at 12:49 +0530, Pavan Deolasee wrote: > At the minimum your patch will need to have one additional buffer > pinned for every K < 8192 * 8 heap pages. I assume it's the same K I referred to when responding to Robert: the max number of heap buffers we read before we unpin and repin the VM buffer. Right now it's unlimited, because there is currently no need to have it at all -- I was just offering the solution in case we did want to do VM page cleanup in the future or something. > For most cases, the value of K will be large enough to ignore the > overhead, but for small values of K, I'm not sure if we can say that > with confidence. It's a constant, and we can easily set it high enough that it wouldn't matter. There's no reason at all to choose a small value for K. Consider that an index root page pin is held for the entire scan, and we don't have a problem with that. > Of course, for very small tables the real contention might be > somewhere else and so this change may not show up anything on the > benchmarks. Doesn't your own tests for 33 page tables confirm that > there is indeed contention for this case ? I agree its not huge, but I > don't know if there are workloads where it will matter. That appears to happen because of the fraction of pinned pages being too high (aside: it was fairly challenging to construct a test where that happened). I believe it was mostly solved by adding a threshold to use the VM, and I can probably solve it completely by doing some more experiments and finding a better threshold value. > > I understand that my patch is probably not going in, > > Sorry about that. I know how that feels. But we need some more reasons > to justify this change, especially because the performance numbers > themselves are not showing any signs either way. That confuses me. The testing was to show it didn't hurt other workloads (like scans or inserts/updates/deletes); so the best possible result is that they don't show signs either way. But yes, I see that others are not interested in the benefits offered by the patch, which is why I'm giving up on it. If people are concerned about the costs, then I can fix those; but there's nothing I can do to increase the benefits. Regards,Jeff Davis
On 21.01.2013 11:10, Jeff Davis wrote: > That confuses me. The testing was to show it didn't hurt other workloads > (like scans or inserts/updates/deletes); so the best possible result is > that they don't show signs either way. I went back to look at the initial test results that demonstrated that keeping the pin on the VM buffer mitigated the overhead of pinning the vm page. The obvious next question is, what is the impact when that's inefficient, ie. when you update pages across different 512 MB sections, so that the vm pin has to be dropped and reacquired repeatedly. I tried to construct a worst case scenario for that: create unlogged table testtable (i int4); insert into testtable select generate_series(1, 15000000); insert into testtable select generate_series(1, 15000000); create index testtable_index on testtable (i); When you traverse tuples using that index, the tuples will come alternating from low-numbered pages and high-numbered pages, which defeats keeping the last vm page pinned. To test, I ran this: set enable_bitmapscan=off; set enable_seqscan=off; begin; delete from testtable where i >= 0; rollback; I repeated a few times with and without the patch (rm-pd-all-visible-0118.patch). According to \timing, the delete takes about 12.6 seconds without the patch, and 15.3 seconds with it. This is a worst-case scenario, and the slowdown isn't huge, so maybe it's a worthwhile tradeoff. But it shows that removing PD_ALL_VISIBLE is not completely free. - Heikki
On Mon, Jan 21, 2013 at 1:52 AM, Jeff Davis <pgsql@j-davis.com> wrote: >> Of course, there is an argument that this patch will >> simplify the code, but I'm not sure if its enough to justify the >> additional contention which may or may not show up in the benchmarks >> we are running, but we know its there. > > What additional contention? How do you know it's there? We know it's there because every additional page that you access has to be looked up and locked and the memory that contains it brought into the CPU cache. If you look up and lock more pages, you WILL have more contention - both for memory, and for locks - and more runtime cost. Whether or not you can currently detect that contention and/or runtime cost experimentally is another matter. Failure to do so could indicate either that you haven't got the right test case - Heikki seems to have found one that works - or that it's being masked by other things, but might not be always. To raise an example which I believe is similar to this one, consider Kevin Grittner's work on SSI. He set himself a goal that SSI should impose a performance regression of no more than 2% - and he met that goal, at the time the code was committed. But then, my read scalability work during the 9.2 cycle blew the lid off of read performance for certain workloads, and now SSI is FAR slower on those workloads. His code always had a scalability problem, but you couldn't measure it experimentally before I did that work, because there were equally bad scalability problems elsewhere. Now there aren't, and you can, and I have. We might well have refused to commit that patch with the locking scheme it uses today if my scalability work had been done first - or perhaps we would have decided that the case where the difference is large is too narrow to be worth worrying about, but I think it would have at least provoked discussion. My biggest concern about the visibility map is that it will act as a focal point for contention. Map half a gigabyte of heap down to 1 page of visibility map and it's easy to think that you could have some situation in which that page gets very, very hot. We know that cache line stealing is a significant cost of ProcArray manipulation, which is why the PGXACT patch makes a significant difference in write throughput. Now your argument seems to be that we can't measure any such effect here, but maybe we're just not measuring the right thing. Heikki's example reminded me that I was concerned about the cost of repeatedly pinning different VM buffers during an index-only scan, if the relation were very large. That's seems easy to justify on the grounds that you're saving so much I/O and memory traffic that the pins will seem cheap by comparison ... but they don't, always. IIRC, an index-only scan on a fully-cached relation is not necessarily faster than a regular index-scan, even if the heap is entirely all-visible. I realize these examples aren't directly applicable to this case, so I'm not sure if they help to advance the discussion or not. I advance them only as evidence that simple tests don't always manage to capture the real costs and benefits of a feature or optimization, and that predicting the system-wide effects of changes in this area is hard. For a large benefit I think we would perhaps bite the bullet and take our chances, but in my (human and imperfect) judgement the benefit here is not large enough to justify it. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Mon, 2013-01-21 at 12:00 +0200, Heikki Linnakangas wrote: > On 21.01.2013 11:10, Jeff Davis wrote: > > That confuses me. The testing was to show it didn't hurt other workloads > > (like scans or inserts/updates/deletes); so the best possible result is > > that they don't show signs either way. > > I went back to look at the initial test results that demonstrated that > keeping the pin on the VM buffer mitigated the overhead of pinning the > vm page. The obvious next question is, what is the impact when that's > inefficient, ie. when you update pages across different 512 MB sections, > so that the vm pin has to be dropped and reacquired repeatedly. > > I tried to construct a worst case scenario for that: I confirmed this result in a single connection (no concurrency). I used a shared_buffers of 2GB so that the whole table would fit. Also, I fixed a bug that I noticed along the way, which was an uninitialized variable. New version attached. FWIW, I'm considering this patch to be rejected; I just didn't want to leave a patch with a bug in it. Regards, Jeff Davis