index-only scans vs. Hot Standby, round two - Mailing list pgsql-hackers
From | Robert Haas |
---|---|
Subject | index-only scans vs. Hot Standby, round two |
Date | |
Msg-id | CA+TgmoaRhoYP=13PhkevEXhUDUscu7hjmN4GsRhH=tzGorKNRQ@mail.gmail.com Whole thread Raw |
Responses |
Re: index-only scans vs. Hot Standby, round two
Re: index-only scans vs. Hot Standby, round two |
List | pgsql-hackers |
Currently, we have a problem with index-only scans in Hot Standby mode: the xmin horizon on the standby might lag the master, and thus an index-only scan might mistakenly conclude that no heap fetch is needed when in fact it is. I suggested that we handle this by suppressing generation of index-only scan plans in Hot Standby mode, but Simon, Noah, and Dimitri were arguing that we should instead do the following, which is now on the open items list: * Make XLOG_HEAP2_VISIBLE records generate recovery snapshot conflicts so that IndexOnlyScans work on Hot Standby Ideally, this would also allow us to remove the following hack from heapgetpage(), which would improve sequential-scan performance for Hot Standby. /* * If the all-visible flag indicates that all tuples on the page are * visible to everyone, we can skip the per-tuplevisibility tests. But * not in hot standby mode. A tuple that's already visible to all * transactions in themaster might still be invisible to a read-only * transaction in the standby. */ all_visible = PageIsAllVisible(dp)&& !snapshot->takenDuringRecovery; As far as I can see, to make this work, we'd need to have lazy_scan_heap() compute the latest xmin on any potentially all-visible page and treat that as a cutoff XID, cancelling queries with snapshots whose xmins equal or precede that value, just as we currently do when freezing tuples (cf xlog_heap_freeze). There are two things to worry about: correctness, and more query cancellations. In the department of query cancellations, I believe Noah argued previously that this wasn't really going to cause a problem. And, indeed, if the master has a mix of inserts, updates, and deletes, then it seems likely that any recovery conflicts generated this way would be hitting about the same place in the XID space as the ones caused by pruning away dead tuples. What will be different, if we go this route, is that an insert-only master, which right now only generates conflicts at freezing time, will instead generate conflicts much sooner, as soon as the relation is vacuumed. I do not use Hot Standby enough to know whether this is a problem, and I'm not trying to block this approach, but I do want to make sure that everyone agrees that it's acceptable before we go do it, because inevitably somebody is going to get bit. In making our decision, I think we should keep in mind that we are currently pretty laid-back about marking pages all-visible, and that's probably something we're going to want to change over time: for example, we recently discussed the fact that a page with one dead tuple in it currently needs *two* vacuums to become all-visible, because only the first vacuum pass is smart enough to mark things all-visible, and the first heap pass only truncates the dead tuple to a dead line pointer, so the page isn't all-visible a that point. When we fix that, which I think we're almost certainly going to want to do at some point, then that means these conflicts will occur that much sooner. I think we will likely also want to consider marking things all-visible on the fly at some point in the future; for example, if we pull a buffer in for an index-only scan and dirty it by setting a hint bit, we might want to take the plunge and mark it all-visible before evicting it, to save effort the next time. Again, not trying to dissuade anyone, just making sure we've thought through it enough to avoid being unhappy later. It's worth noting that none of this will normally matter if hot_standby_feedback=on, so part of the analysis here may depend on how many people we think have flipped that switch. As to correctness, after further review of lazy_scan_heap(), I think there are some residual bugs here that need to be fixed whatever we decide to do about the Hot Standby case: 1. I noticed that when we do PageSetAllVisible() today we follow it up with SetBufferCommitInfoNeedsSave(). PD_ALL_VISIBLE is not merely a hint, so I think these should be changed to MarkBufferDirty(), which shouldn't make much difference given current code, but proposals to handle hint changes differently than non-hint changes abound, so it's probably not good to rely on those meaning the same thing forever. 2. More seriously, lazy_scan_heap() releases the buffer lock before setting the all-visible bit on the heap. This looks just plain wrong (and it's my fault, to be clear). Somebody else can come along after we've released the buffer lock and mark the page not-all-visible. That concurrent process will check the visibility map bit, find it already clear, and go on its merry way. Then, we set the visibility map bit and go on our merry way. Now we have a not-all-visible page with the all-visible bit set in the visibility map. Oops. I think this probably needs to be rewritten so that lazy_scan_heap() acquires a pin on the visibility map page before locking the buffer for cleanup and holds onto that pin until either we exit the range of heap pages covered by that visibility map page or trigger index vac due to maintenance_work_mem exhaustion. With that infrastructure in place, we can set the visibility map bit at the same time we set the page-level bit without risking holding the buffer lock across a buffer I/O (we might have to hold the buffer lock across a WAL flush in the worst case, but that hazard exists in a number of other places as well and fixing it here is well out of scope). Now, suppose we fix the above bugs and also add logic to generate query-conflicts as described above. How far can the standby now trust the visibility information? Setting the visibility map bit is a fully WAL-logged operation, so anyone for whom seeing the bit as set would potentially be a problem is certain to be killed before they get the chance to become confused. And when we first open for read-only connections after a new base backup, the initial snapshot had better be able to see all XIDs which have committed prior to that point, and only such things should be marked all-visible, so AFAICT the door is nailed shut pretty tight here. I am a little less certain about the page-level bit. On the master, it's possible for the PD_ALL_VISIBLE bit to make it to disk before the WAL record that sets the visibility map bit; the LSN interlock only protects the visibility map page itself, as part of a complicated strategy to avoid emitting FPWs for the entire heap when we vacuum an insert-only table. On the standby, those same bits are going to get set when the xlog records covering the setting of visibility map bit get replayed (which would be OK, in the sense that the page-level bit would be trustworthy in this case), or when a copy of the buffer makes it from master to standby by some other means (which is the potentially problematic case). This could happen either as part of a base backup, or via a FPW. I don't think the base backup case is a problem for the same reasons that it's OK for the visibility map bit case. If an FPW turns the bit on, then somehow the page-level bit got set without the visibility map bit getting set. I think the only for that to happen is: 1. vacuum on master sets the page-level bit and the visibility map bit 2. the heap page with the bit is written to disk BEFORE the WAL entry generated in (1) gets flushed; this is allowable, since it's not an error for the page-level bit to be set while the visibility-map bit is unset, only the other way around 3. crash (still before the WAL entry is flushed) 4. some operation on the master emits an FPW for the page without first rendering it not-all-visible At present, I'm not sure there's any real problem here, since normally an all-visible heap page is only going to get another WAL entry if somebody does an insert, update, or delete on it, in which case the visibility map bit is going to get cleared anyway. Freezing the page might emit a new FPW, but that's going to generate conflicts anyway, so no problem there. So I think there's no real problem here, but it doesn't seem totally future-proof - any future type of WAL record that might modify the page without rendering it not-all-visible would create a very subtle hazard for Hot Standby. We could dodge the whole issue by leaving the hack in heapgetpage() intact and just be happy with making index-only scans work, or maybe somebody has a more clever idea. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
pgsql-hackers by date: