Re: index-only scans vs. Hot Standby, round two - Mailing list pgsql-hackers
From | Noah Misch |
---|---|
Subject | Re: index-only scans vs. Hot Standby, round two |
Date | |
Msg-id | 20120416070256.GC22182@tornado.leadboat.com Whole thread Raw |
In response to | index-only scans vs. Hot Standby, round two (Robert Haas <robertmhaas@gmail.com>) |
Responses |
Re: index-only scans vs. Hot Standby, round two
Re: index-only scans vs. Hot Standby, round two |
List | pgsql-hackers |
On Fri, Apr 13, 2012 at 12:33:06PM -0400, Robert Haas wrote: > In the department of query cancellations, I believe Noah argued > previously that this wasn't really going to cause a problem. And, > indeed, if the master has a mix of inserts, updates, and deletes, then > it seems likely that any recovery conflicts generated this way would > be hitting about the same place in the XID space as the ones caused by > pruning away dead tuples. What will be different, if we go this > route, is that an insert-only master, which right now only generates > conflicts at freezing time, will instead generate conflicts much > sooner, as soon as the relation is vacuumed. I do not use Hot Standby > enough to know whether this is a problem, and I'm not trying to block > this approach, but I do want to make sure that everyone agrees that > it's acceptable before we go do it, because inevitably somebody is > going to get bit. Given sufficient doubt, we could add a GUC, say hot_standby_use_all_visible. A standby with the GUC "off" would ignore all-visible indicators and also decline to generate conflicts when flipping them on. > As to correctness, after further review of lazy_scan_heap(), I think > there are some residual bugs here that need to be fixed whatever we > decide to do about the Hot Standby case: > > 1. I noticed that when we do PageSetAllVisible() today we follow it up > with SetBufferCommitInfoNeedsSave(). PD_ALL_VISIBLE is not merely a > hint, so I think these should be changed to MarkBufferDirty(), which > shouldn't make much difference given current code, but proposals to > handle hint changes differently than non-hint changes abound, so it's > probably not good to rely on those meaning the same thing forever. Do you refer to PD_ALL_VISIBLE as "not merely a hint" due to the requirement to prevent a page from simultaneously having a negative PD_ALL_VISIBLE and a positive visibility map bit? That is to say, PD_ALL_VISIBLE is fully a hint in its role as a witness of tuple visibility, but it is not a hint in its role as a witness of the visibility map status? In any event, I agree that those call sites should use MarkBufferDirty(). The large comment in SetBufferCommitInfoNeedsSave() seems incorrect. On systems with weaker memory ordering, the FlushBuffer() process's clearing of BM_JUST_DIRTIED may not be visible to the SetBufferCommitInfoNeedsSave() process until some time after the former retrieved buffer contents from shared memory. Memory barriers could make the comment true, but we should probably just update the comment to explain that a race condition may eat the update entirely. Incidentally, is there a good reason for MarkBufferDirty() to increment pgBufferUsage.shared_blks_dirtied and SetBufferCommitInfoNeedsSave() not to do so? Looks like an oversight. > 2. More seriously, lazy_scan_heap() releases the buffer lock before > setting the all-visible bit on the heap. This looks just plain wrong > (and it's my fault, to be clear). Somebody else can come along after > we've released the buffer lock and mark the page not-all-visible. > That concurrent process will check the visibility map bit, find it > already clear, and go on its merry way. Then, we set the visibility > map bit and go on our merry way. Now we have a not-all-visible page > with the all-visible bit set in the visibility map. Oops. Hmm, indeed. This deserves its own open item. > I think > this probably needs to be rewritten so that lazy_scan_heap() acquires > a pin on the visibility map page before locking the buffer for cleanup > and holds onto that pin until either we exit the range of heap pages > covered by that visibility map page or trigger index vac due to > maintenance_work_mem exhaustion. With that infrastructure in place, > we can set the visibility map bit at the same time we set the > page-level bit without risking holding the buffer lock across a buffer > I/O (we might have to hold the buffer lock across a WAL flush in the > worst case, but that hazard exists in a number of other places as well > and fixing it here is well out of scope). Looks reasonable at a glance. > 1. vacuum on master sets the page-level bit and the visibility map bit > 2. the heap page with the bit is written to disk BEFORE the WAL entry > generated in (1) gets flushed; this is allowable, since it's not an > error for the page-level bit to be set while the visibility-map bit is > unset, only the other way around > 3. crash (still before the WAL entry is flushed) > 4. some operation on the master emits an FPW for the page without > first rendering it not-all-visible > > At present, I'm not sure there's any real problem here, since normally > an all-visible heap page is only going to get another WAL entry if > somebody does an insert, update, or delete on it, in which case the > visibility map bit is going to get cleared anyway. Freezing the page > might emit a new FPW, but that's going to generate conflicts anyway, > so no problem there. So I think there's no real problem here, but it > doesn't seem totally future-proof - any future type of WAL record that > might modify the page without rendering it not-all-visible would > create a very subtle hazard for Hot Standby. We could dodge the whole > issue by leaving the hack in heapgetpage() intact and just be happy > with making index-only scans work, or maybe somebody has a more clever > idea. Good point. We could finagle things so the standby notices end-of-recovery checkpoints and blocks the optimization for older snapshots. For example, maintain an integer count of end-of-recovery checkpoints seen and store that in each Snapshot instead of takenDuringRecovery; use 0 on a master. When the global value is greater than the snapshot's value, disable the optimization for that snapshot. I don't know whether the optimization is powerful enough to justify that level of trouble, but it's an option to consider. For a different approach, targeting the fragility, we could add assertions to detect unexpected cases of a page bearing PD_ALL_VISIBLE submitted as a full-page image. Thanks, nm
pgsql-hackers by date: