Thread: Re: Incorrect result of bitmap heap scan.
On Mon, Dec 2, 2024 at 10:15 AM Matthias van de Meent <boekewurm+postgres@gmail.com> wrote: > The running theory is that bitmap executor nodes incorrectly assume > that the rows contained in the bitmap all are still present in the > index, and thus assume they're allowed to only check the visibility > map to see if the reference contained in the bitmap is visible. > However, this seems incorrect: Note that index AMs must hold at least > pins on the index pages that contain their results when those results > are returned by amgettuple() [0], and that amgetbitmap() doesn't do > that for all TIDs in the bitmap; thus allowing vacuum to remove TIDs > from the index (and later, heap) that are still present in the bitmap > used in the scan. This theory seems very believable. We hold onto a leaf page buffer pin for index-only scans as an interlock against concurrent TID recycling. If we assume for the sake of argument that the optimization from commit 7c70996e is correct, then why do we even bother with holding onto the pin during index-only scans? In theory we should either do the "buffer pin interlock against TID recycling" thing everywhere, or nowhere -- how could bitmap scans possibly be different here? > I think this might be an oversight when the feature was originally > committed in 7c70996e (PG11): we don't know when the VM bit was set, > and the bitmap we're scanning may thus be out-of-date (and should've > had TIDs removed it it had been an index), so I propose disabling this > optimization for now, as attached. I have a hard time imagining any alternative fix that is suitable for backpatch. Can we save the optimization on the master branch? Clearly it would be wildly impractical to do the "buffer pin interlock against TID recycling" thing in the context of bitmap scans. The only thing that I can think of that might work is a scheme that establishes a "safe LSN" for a given MVCC snapshot. If the VM page's LSN is later than the "safe LSN", it's not okay to trust its all-visible bits. At least not in the context of bitmap index scans that use the optimization from 7c70996e. -- Peter Geoghegan
Peter Geoghegan <pg@bowt.ie> writes: > On Mon, Dec 2, 2024 at 10:15 AM Matthias van de Meent > <boekewurm+postgres@gmail.com> wrote: >> The running theory is that bitmap executor nodes incorrectly assume >> that the rows contained in the bitmap all are still present in the >> index, and thus assume they're allowed to only check the visibility >> map to see if the reference contained in the bitmap is visible. >> However, this seems incorrect: Note that index AMs must hold at least >> pins on the index pages that contain their results when those results >> are returned by amgettuple() [0], and that amgetbitmap() doesn't do >> that for all TIDs in the bitmap; thus allowing vacuum to remove TIDs >> from the index (and later, heap) that are still present in the bitmap >> used in the scan. > This theory seems very believable. I'm not convinced. I think there are two assumptions underlying bitmap scan: 1. Regardless of index contents, it's not okay for vacuum to remove tuples that an open transaction could potentially see. So the heap tuple will be there if we look, unless it was already dead (in which case it could have been replaced, so we have to check visibility of whatever we find). 2. If the page's all-visible bit is set, there has been no recent change in its contents, so we don't have to look at the page. "Recent" is a bit squishily defined, but again it should surely cover outdating or removal of a tuple that an open transaction could see. If this is not working, I am suspicious that somebody made page freezing too aggressive somewhere along the line. Whether that's true or not, it seems like it'd be worth bisecting to see if we can finger a commit where the behavior changed (and the same goes for the question of why-isnt-it-an-IOS-scan). However, the reproducer seems to have quite a low failure probability for me, which makes it hard to do bisection testing with much confidence. Can we do anything to make the test more reliable? If I'm right to suspect autovacuum, maybe triggering lots of manual vacuums would improve the odds? regards, tom lane
Hi, On 2024-12-02 11:07:38 -0500, Peter Geoghegan wrote: > Clearly it would be wildly impractical to do the "buffer pin interlock > against TID recycling" thing in the context of bitmap scans. That's certainly true if we do the VM check at the time of the bitmap heap scan. What if we did it whenever we first enter a block into the TID bitmap? There's sufficient padding space in PagetableEntry to store such state without increasing memory usage. That'd probably need some API evolution, because we'd only know after entering into the tidbitmap that we need to check the VM, we'd need to index pins during the VM checking and then update PagetableEntry with the result of the VM probe. I think, but am not certain, that it'd be sufficient to do the VM check the first time an index scan encounters a block. If a concurrent vacuum would mark the heap page all-visible after we encountered it first, we'd do "real" visibility checks, because the first VM lookup won't have it as all visible. And in the opposite situation, where we find a page all-visible in the VM, which then subsequently gets marked not-all-visible, normal snapshot semantics would make it safe to still consider the contents all-visible, because update/delete can't be visible to "us". Greetings, Andres Freund
Hi, On 2024-12-02 11:43:42 -0500, Tom Lane wrote: > Peter Geoghegan <pg@bowt.ie> writes: > > This theory seems very believable. > > I'm not convinced. I think there are two assumptions underlying > bitmap scan: > > 1. Regardless of index contents, it's not okay for vacuum to remove > tuples that an open transaction could potentially see. So the heap > tuple will be there if we look, unless it was already dead (in which > case it could have been replaced, so we have to check visibility of > whatever we find). I think the problematic scenario involves tuples that *nobody* can see. During the bitmap index scan we don't know that though. Thus the tid gets inserted into the bitmap. Then, before we visit the heap, a concurrent vacuum removes the tuple from the indexes and then the heap and marks the page as all-visible, as the deleted row version has been removed. Then, during the bitmap heap scan, we do a VM lookup and see the page being all-visible and thus assume there's a visible tuple pointed to by the tid. No snapshot semantics protect against this, as the tuple is *not* visible to anyone. Greetings, Andres Freund
Andres Freund <andres@anarazel.de> writes: > I think the problematic scenario involves tuples that *nobody* can see. During > the bitmap index scan we don't know that though. Thus the tid gets inserted > into the bitmap. Then, before we visit the heap, a concurrent vacuum removes > the tuple from the indexes and then the heap and marks the page as > all-visible, as the deleted row version has been removed. Yup. I am saying that that qualifies as too-aggressive setting of the all-visible bit. I'm not sure what rule we should adopt instead of the current one, but I'd much rather slow down page freezing than institute new page locking rules. regards, tom lane
On Mon, Dec 2, 2024 at 12:02 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > Andres Freund <andres@anarazel.de> writes: > > I think the problematic scenario involves tuples that *nobody* can see. During > > the bitmap index scan we don't know that though. Thus the tid gets inserted > > into the bitmap. Then, before we visit the heap, a concurrent vacuum removes > > the tuple from the indexes and then the heap and marks the page as > > all-visible, as the deleted row version has been removed. > > Yup. I am saying that that qualifies as too-aggressive setting of the > all-visible bit. I'm not sure what rule we should adopt instead of > the current one, but I'd much rather slow down page freezing than > institute new page locking rules. Freezing a page, and setting a page all-visible are orthogonal. Nothing has changed about when or how we set pages all-visible in a long time -- VACUUM has always done that to the maximum extent that its OldestXmin cutoff will allow. (The changes to freezing made freezing work a little bit more like that, the general idea being to batch-up work.) -- Peter Geoghegan
Peter Geoghegan <pg@bowt.ie> writes: > On Mon, Dec 2, 2024 at 12:02 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Yup. I am saying that that qualifies as too-aggressive setting of the >> all-visible bit. I'm not sure what rule we should adopt instead of >> the current one, but I'd much rather slow down page freezing than >> institute new page locking rules. > Freezing a page, and setting a page all-visible are orthogonal. Sorry, sloppy wording on my part. regards, tom lane
On Mon, Dec 2, 2024 at 11:52 AM Andres Freund <andres@anarazel.de> wrote: > I think the problematic scenario involves tuples that *nobody* can see. During > the bitmap index scan we don't know that though. Right, exactly. FWIW, this same issue is why it is safe for nbtree to drop its pin early during plain index scans, but not during index-only scans -- see _bt_drop_lock_and_maybe_pin, and the nbtree/README section on making concurrent TID recycling safe. Weirdly, nbtree is specifically aware that it needs to *not* drop its pin in the context of index-only scans (to make sure that VACUUM cannot do unsafe concurrent TID recycling) -- even though an equivalent index scan would be able to drop its pin like this. The underlying reason why nbtree can discriminate like this is that it "knows" that plain index scans will always visit the heap proper. If a TID points to an LP_UNUSED item, then it is considered dead to the scan (even though in general the heap page itself might be marked all-visible). If some completely unrelated, newly inserted heap tuple is found instead, then it cannot be visible to the plain index scan's MVCC snapshot (has to be an MVCC snapshot for the leaf page pin to get dropped like this). -- Peter Geoghegan
Hi, On 2024-12-02 12:02:39 -0500, Tom Lane wrote: > Andres Freund <andres@anarazel.de> writes: > > I think the problematic scenario involves tuples that *nobody* can see. During > > the bitmap index scan we don't know that though. Thus the tid gets inserted > > into the bitmap. Then, before we visit the heap, a concurrent vacuum removes > > the tuple from the indexes and then the heap and marks the page as > > all-visible, as the deleted row version has been removed. > > Yup. I am saying that that qualifies as too-aggressive setting of the > all-visible bit. I'm not sure what rule we should adopt instead of > the current one, but I'd much rather slow down page freezing than > institute new page locking rules. How? This basically would mean we could never set all-visible if there is *any* concurrent scan on the current relation, because any concurrent scan could have an outdated view of all-visible. Afaict this isn't an issue of "too-aggressive setting of the all-visible bit", it's an issue of setting it at all. Greetings, Andres Freund
On Mon, Dec 2, 2024 at 12:11 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Freezing a page, and setting a page all-visible are orthogonal. > > Sorry, sloppy wording on my part. Freezing doesn't affect the contents of the visibility map in any way that seems relevant. The executor only cares about the all-visible bit (and never the all-frozen bit), and the rules around when and how VACUUM sets the all-visible bit (and how everybody else unsets the all-visible bit) haven't changed in forever. So I just can't see it. I guess it's natural to suspect more recent work -- commit 7c70996e is about 6 years old. But I the race condition that I suspect is at play here is very narrow. It's pretty unlikely that there'll be a dead-to-all TID returned to a scan (not just dead to our MVCC snapshot, dead to everybody's) that is subsequently concurrently removed from the index, and then set LP_UNUSED in the heap. It's probably impossible if you don't have a small table -- VACUUM just isn't going to be fast enough to get to the leaf page after the bitmap index scan, but still be able to get to the heap before its corresponding bitmap heap scan (that uses the VM as an optimization) can do the relevant visibility checks (while it could happen with a large table and a slow bitmap scan, the chances of the VACUUM being precisely aligned with the bitmap scan, in just the wrong way, seem remote in the extreme). Finally, none of this will happen if some other factor hinders VACUUM from setting the relevant heap page all-visible. AFAICT this is only a problem because of the involvement of the VM, specifically -- an MVCC snapshot *is* generally sufficient to make bitmap index scans safe from the dangers of concurrent TID recycling, as explained in "62.4. Index Locking Considerations". That only ceases to be true when the visibility map becomes involved (the VM lacks the granular visibility information required to make all this safe). This is essentially the same VM race issue that nbtree's _bt_drop_lock_and_maybe_pin protects against during conventional index-only scans. -- Peter Geoghegan
On Mon, Dec 2, 2024 at 11:43 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > Whether that's true or not, it seems like it'd be worth bisecting > to see if we can finger a commit where the behavior changed (and > the same goes for the question of why-isnt-it-an-IOS-scan). However, > the reproducer seems to have quite a low failure probability for me, > which makes it hard to do bisection testing with much confidence. > Can we do anything to make the test more reliable? If I'm right > to suspect autovacuum, maybe triggering lots of manual vacuums > would improve the odds? I agree that autovacuum (actually, VACUUM) is important here. I find that the test becomes much more reliable if I create the test table "with (autovacuum_analyze_scale_factor=0.99, vacuum_truncate=off)". More importantly, rather than relying on autovacuum, I just run VACUUM manually from psql. I find it convenient to use "\watch 0.01" to run VACUUM repeatedly. -- Peter Geoghegan
Hi, On 2024-12-02 13:39:43 -0500, Peter Geoghegan wrote: > I guess it's natural to suspect more recent work -- commit 7c70996e is > about 6 years old. But I the race condition that I suspect is at play > here is very narrow. FWIW, the test I just posted shows the issue down to 11 (although for 11 one has to remove the (TRUNCATE false). 10 returns correct results. I don't think the race is particularly narrow. Having a vacuum complete between the start of the bitmap indexscan and the end of the bitmap heapscan, leaves a lot of time with an expensive query. I suspect one contributor to this avoiding attention till now was that the optimization is fairly narrow: /* * We can potentially skip fetching heap pages if we do not need * any columns of the table, either for checking non-indexable * quals or for returning data. This test is a bit simplistic, as * it checks the stronger condition that there's no qual or return * tlist at all. But in most cases it's probably not worth working * harder than that. */ need_tuples = (node->ss.ps.plan->qual != NIL || node->ss.ps.plan->targetlist != NIL); Even an entry in the targetlist that that does not depend on the current row disables the optimization. Due to not being able to return any content for those rows, it's also somewhat hard to actually notice that the results are wrong... > It's pretty unlikely that there'll be a dead-to-all TID returned to a > scan (not just dead to our MVCC snapshot, dead to everybody's) that is > subsequently concurrently removed from the index, and then set > LP_UNUSED in the heap. It's probably impossible if you don't have a > small table -- VACUUM just isn't going to be fast enough to get to the > leaf page after the bitmap index scan, but still be able to get to the > heap before its corresponding bitmap heap scan (that uses the VM as an > optimization) can do the relevant visibility checks (while it could > happen with a large table and a slow bitmap scan, the chances of the > VACUUM being precisely aligned with the bitmap scan, in just the wrong > way, seem remote in the extreme). A cursor, waiting for IO, waiting for other parts of the query, ... can all make this windows almost arbitrarily large. Greetings, Andres Freund
On Mon, Dec 2, 2024 at 3:55 PM Andres Freund <andres@anarazel.de> wrote: > I suspect one contributor to this avoiding attention till now was that the > optimization is fairly narrow: > > /* > * We can potentially skip fetching heap pages if we do not need > * any columns of the table, either for checking non-indexable > * quals or for returning data. This test is a bit simplistic, as > * it checks the stronger condition that there's no qual or return > * tlist at all. But in most cases it's probably not worth working > * harder than that. > */ > need_tuples = (node->ss.ps.plan->qual != NIL || > node->ss.ps.plan->targetlist != NIL); > > Even an entry in the targetlist that that does not depend on the current row > disables the optimization. Good point. I agree that that factor is likely to have masked the problem over the past 6 years. -- Peter Geoghegan
On Mon, Dec 2, 2024 at 12:15 PM Peter Geoghegan <pg@bowt.ie> wrote: > The underlying reason why nbtree can discriminate like this is that it > "knows" that plain index scans will always visit the heap proper. If a > TID points to an LP_UNUSED item, then it is considered dead to the > scan (even though in general the heap page itself might be marked > all-visible). If some completely unrelated, newly inserted heap tuple > is found instead, then it cannot be visible to the plain index scan's > MVCC snapshot (has to be an MVCC snapshot for the leaf page pin to get > dropped like this). If I add "SET enable_indexonlyscan = false" to the "setup" section of the v1-0001-isolationtester-showing-broken-index-only-scans-w.patch isolationtester test case I posted earlier today, the test will pass. There is no reason to think that plain GiST index scans are broken. The fact that GiST VACUUM won't acquire cleanup locks is a problem *only* because GiST supports index-only scans (AFAIK no other thing within GiST is broken). My point is that index-only scans are generally distinct from plain index scans from an interlocking point of view -- they're not equivalent (not quite). And yet the "62.4. Index Locking Considerations" docs nevertheless say nothing about index-only scans in particular (only about bitmap scans). The docs do say that an MVCC snapshot is protective, though -- which makes it sound like GiST can't be doing anything wrong here (GiST only uses MVCC snapshots). Actually, I don't think that GiST would be broken at all if we'd simply never added support for index-only scans to GiST. I'm fairly sure that index-only scans didn't exist when the bulk of this part of the docs was originally written. ISTM that we ought to do something about these obsolete docs, after we've fixed the bugs themselves. -- Peter Geoghegan
On Mon, 2 Dec 2024 at 18:02, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Andres Freund <andres@anarazel.de> writes: > > I think the problematic scenario involves tuples that *nobody* can see. During > > the bitmap index scan we don't know that though. Thus the tid gets inserted > > into the bitmap. Then, before we visit the heap, a concurrent vacuum removes > > the tuple from the indexes and then the heap and marks the page as > > all-visible, as the deleted row version has been removed. > > Yup. I am saying that that qualifies as too-aggressive setting of the > all-visible bit. I'm not sure what rule we should adopt instead of > the current one, but I'd much rather slow down page freezing than > institute new page locking rules. I don't think it's new page locking rules, but rather a feature that forgot to apply page locking rules after bypassing MVCC's snapshot rules. IOS is only allowed exactly when they comply with index AM's locking rules "only return a TID that's on a page that can't concurrently be processed by vacuum" - why would this be different for the bitmap equivalent? By saying we're too aggressive with setting the all-visible bit you seem to suggest we should add rules to vacuum that to remove LP_DEAD items we don't just have to wait for tuples to be dead to all transactions, but also for all transactions that might've gotten those all_dead TIDs from indexes to have committed or rolled back, so that no references to those TIDs exist that might consider them "possibly visible". We could achieve that by adding a waitpoint to vacuum (between the index scan and the second heap scan for LP cleanup) which waits for all concurrent transactions accessing the table to commit (thus all bitmaps would be dropped), similar to REINDEX CONCURRENTLY's wait phase, but that would slow down vacuum's ability to clean up old data significantly, and change overall vacuum behaviour in a fundamental way. I'm quite opposed to such a change. Kind regards, Matthias van de Meent Neon (https://neon.tech)