Thread: Re: Incorrect result of bitmap heap scan.

Re: Incorrect result of bitmap heap scan.

From
Peter Geoghegan
Date:
On Mon, Dec 2, 2024 at 10:15 AM Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:
> The running theory is that bitmap executor nodes incorrectly assume
> that the rows contained in the bitmap all are still present in the
> index, and thus assume they're allowed to only check the visibility
> map to see if the reference contained in the bitmap is visible.
> However, this seems incorrect: Note that index AMs must hold at least
> pins on the index pages that contain their results when those results
> are returned by amgettuple() [0], and that amgetbitmap() doesn't do
> that for all TIDs in the bitmap; thus allowing vacuum to remove TIDs
> from the index (and later, heap) that are still present in the bitmap
> used in the scan.

This theory seems very believable.

We hold onto a leaf page buffer pin for index-only scans as an
interlock against concurrent TID recycling. If we assume for the sake
of argument that the optimization from commit 7c70996e is correct,
then why do we even bother with holding onto the pin during index-only
scans?

In theory we should either do the "buffer pin interlock against TID
recycling" thing everywhere, or nowhere -- how could bitmap scans
possibly be different here?

> I think this might be an oversight when the feature was originally
> committed in 7c70996e (PG11): we don't know when the VM bit was set,
> and the bitmap we're scanning may thus be out-of-date (and should've
> had TIDs removed it it had been an index), so I propose disabling this
> optimization for now, as attached.

I have a hard time imagining any alternative fix that is suitable for
backpatch. Can we save the optimization on the master branch?

Clearly it would be wildly impractical to do the "buffer pin interlock
against TID recycling" thing in the context of bitmap scans. The only
thing that I can think of that might work is a scheme that establishes
a "safe LSN" for a given MVCC snapshot. If the VM page's LSN is later
than the "safe LSN", it's not okay to trust its all-visible bits. At
least not in the context of bitmap index scans that use the
optimization from 7c70996e.

--
Peter Geoghegan



Re: Incorrect result of bitmap heap scan.

From
Tom Lane
Date:
Peter Geoghegan <pg@bowt.ie> writes:
> On Mon, Dec 2, 2024 at 10:15 AM Matthias van de Meent
> <boekewurm+postgres@gmail.com> wrote:
>> The running theory is that bitmap executor nodes incorrectly assume
>> that the rows contained in the bitmap all are still present in the
>> index, and thus assume they're allowed to only check the visibility
>> map to see if the reference contained in the bitmap is visible.
>> However, this seems incorrect: Note that index AMs must hold at least
>> pins on the index pages that contain their results when those results
>> are returned by amgettuple() [0], and that amgetbitmap() doesn't do
>> that for all TIDs in the bitmap; thus allowing vacuum to remove TIDs
>> from the index (and later, heap) that are still present in the bitmap
>> used in the scan.

> This theory seems very believable.

I'm not convinced.  I think there are two assumptions underlying
bitmap scan:

1. Regardless of index contents, it's not okay for vacuum to remove
tuples that an open transaction could potentially see.  So the heap
tuple will be there if we look, unless it was already dead (in which
case it could have been replaced, so we have to check visibility of
whatever we find).

2. If the page's all-visible bit is set, there has been no recent
change in its contents, so we don't have to look at the page.
"Recent" is a bit squishily defined, but again it should surely
cover outdating or removal of a tuple that an open transaction
could see.

If this is not working, I am suspicious that somebody made page
freezing too aggressive somewhere along the line.

Whether that's true or not, it seems like it'd be worth bisecting
to see if we can finger a commit where the behavior changed (and
the same goes for the question of why-isnt-it-an-IOS-scan).  However,
the reproducer seems to have quite a low failure probability for me,
which makes it hard to do bisection testing with much confidence.
Can we do anything to make the test more reliable?  If I'm right
to suspect autovacuum, maybe triggering lots of manual vacuums
would improve the odds?

            regards, tom lane



Re: Incorrect result of bitmap heap scan.

From
Andres Freund
Date:
Hi,

On 2024-12-02 11:07:38 -0500, Peter Geoghegan wrote:
> Clearly it would be wildly impractical to do the "buffer pin interlock
> against TID recycling" thing in the context of bitmap scans.

That's certainly true if we do the VM check at the time of the bitmap heap
scan.

What if we did it whenever we first enter a block into the TID bitmap? There's
sufficient padding space in PagetableEntry to store such state without
increasing memory usage.

That'd probably need some API evolution, because we'd only know after entering
into the tidbitmap that we need to check the VM, we'd need to index pins
during the VM checking and then update PagetableEntry with the result of the
VM probe.

I think, but am not certain, that it'd be sufficient to do the VM check the
first time an index scan encounters a block. If a concurrent vacuum would mark
the heap page all-visible after we encountered it first, we'd do "real"
visibility checks, because the first VM lookup won't have it as all
visible. And in the opposite situation, where we find a page all-visible in
the VM, which then subsequently gets marked not-all-visible, normal snapshot
semantics would make it safe to still consider the contents all-visible,
because update/delete can't be visible to "us".

Greetings,

Andres Freund



Re: Incorrect result of bitmap heap scan.

From
Andres Freund
Date:
Hi,

On 2024-12-02 11:43:42 -0500, Tom Lane wrote:
> Peter Geoghegan <pg@bowt.ie> writes:
> > This theory seems very believable.
> 
> I'm not convinced.  I think there are two assumptions underlying
> bitmap scan:
> 
> 1. Regardless of index contents, it's not okay for vacuum to remove
> tuples that an open transaction could potentially see.  So the heap
> tuple will be there if we look, unless it was already dead (in which
> case it could have been replaced, so we have to check visibility of
> whatever we find).

I think the problematic scenario involves tuples that *nobody* can see. During
the bitmap index scan we don't know that though. Thus the tid gets inserted
into the bitmap. Then, before we visit the heap, a concurrent vacuum removes
the tuple from the indexes and then the heap and marks the page as
all-visible, as the deleted row version has been removed.  Then, during the
bitmap heap scan, we do a VM lookup and see the page being all-visible and
thus assume there's a visible tuple pointed to by the tid.

No snapshot semantics protect against this, as the tuple is *not* visible to
anyone.

Greetings,

Andres Freund



Re: Incorrect result of bitmap heap scan.

From
Tom Lane
Date:
Andres Freund <andres@anarazel.de> writes:
> I think the problematic scenario involves tuples that *nobody* can see. During
> the bitmap index scan we don't know that though. Thus the tid gets inserted
> into the bitmap. Then, before we visit the heap, a concurrent vacuum removes
> the tuple from the indexes and then the heap and marks the page as
> all-visible, as the deleted row version has been removed.

Yup.  I am saying that that qualifies as too-aggressive setting of the
all-visible bit.  I'm not sure what rule we should adopt instead of
the current one, but I'd much rather slow down page freezing than
institute new page locking rules.

            regards, tom lane



Re: Incorrect result of bitmap heap scan.

From
Peter Geoghegan
Date:
On Mon, Dec 2, 2024 at 12:02 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Andres Freund <andres@anarazel.de> writes:
> > I think the problematic scenario involves tuples that *nobody* can see. During
> > the bitmap index scan we don't know that though. Thus the tid gets inserted
> > into the bitmap. Then, before we visit the heap, a concurrent vacuum removes
> > the tuple from the indexes and then the heap and marks the page as
> > all-visible, as the deleted row version has been removed.
>
> Yup.  I am saying that that qualifies as too-aggressive setting of the
> all-visible bit.  I'm not sure what rule we should adopt instead of
> the current one, but I'd much rather slow down page freezing than
> institute new page locking rules.

Freezing a page, and setting a page all-visible are orthogonal.
Nothing has changed about when or how we set pages all-visible in a
long time -- VACUUM has always done that to the maximum extent that
its OldestXmin cutoff will allow. (The changes to freezing made
freezing work a little bit more like that, the general idea being to
batch-up work.)

--
Peter Geoghegan



Re: Incorrect result of bitmap heap scan.

From
Tom Lane
Date:
Peter Geoghegan <pg@bowt.ie> writes:
> On Mon, Dec 2, 2024 at 12:02 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Yup.  I am saying that that qualifies as too-aggressive setting of the
>> all-visible bit.  I'm not sure what rule we should adopt instead of
>> the current one, but I'd much rather slow down page freezing than
>> institute new page locking rules.

> Freezing a page, and setting a page all-visible are orthogonal.

Sorry, sloppy wording on my part.

            regards, tom lane



Re: Incorrect result of bitmap heap scan.

From
Peter Geoghegan
Date:
On Mon, Dec 2, 2024 at 11:52 AM Andres Freund <andres@anarazel.de> wrote:
> I think the problematic scenario involves tuples that *nobody* can see. During
> the bitmap index scan we don't know that though.

Right, exactly.

FWIW, this same issue is why it is safe for nbtree to drop its pin
early during plain index scans, but not during index-only scans -- see
_bt_drop_lock_and_maybe_pin, and the nbtree/README section on making
concurrent TID recycling safe. Weirdly, nbtree is specifically aware
that it needs to *not* drop its pin in the context of index-only scans
(to make sure that VACUUM cannot do unsafe concurrent TID recycling)
-- even though an equivalent index scan would be able to drop its pin
like this.

The underlying reason why nbtree can discriminate like this is that it
"knows" that plain index scans will always visit the heap proper. If a
TID points to an LP_UNUSED item, then it is considered dead to the
scan (even though in general the heap page itself might be marked
all-visible). If some completely unrelated, newly inserted heap tuple
is found instead, then it cannot be visible to the plain index scan's
MVCC snapshot (has to be an MVCC snapshot for the leaf page pin to get
dropped like this).

--
Peter Geoghegan



Re: Incorrect result of bitmap heap scan.

From
Andres Freund
Date:
Hi,

On 2024-12-02 12:02:39 -0500, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
> > I think the problematic scenario involves tuples that *nobody* can see. During
> > the bitmap index scan we don't know that though. Thus the tid gets inserted
> > into the bitmap. Then, before we visit the heap, a concurrent vacuum removes
> > the tuple from the indexes and then the heap and marks the page as
> > all-visible, as the deleted row version has been removed.
> 
> Yup.  I am saying that that qualifies as too-aggressive setting of the
> all-visible bit.  I'm not sure what rule we should adopt instead of
> the current one, but I'd much rather slow down page freezing than
> institute new page locking rules.

How? This basically would mean we could never set all-visible if there is
*any* concurrent scan on the current relation, because any concurrent scan
could have an outdated view of all-visible.  Afaict this isn't an issue of
"too-aggressive setting of the all-visible bit", it's an issue of setting it
at all.

Greetings,

Andres Freund



Re: Incorrect result of bitmap heap scan.

From
Peter Geoghegan
Date:
On Mon, Dec 2, 2024 at 12:11 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > Freezing a page, and setting a page all-visible are orthogonal.
>
> Sorry, sloppy wording on my part.

Freezing doesn't affect the contents of the visibility map in any way
that seems relevant. The executor only cares about the all-visible bit
(and never the all-frozen bit), and the rules around when and how
VACUUM sets the all-visible bit (and how everybody else unsets the
all-visible bit) haven't changed in forever. So I just can't see it.

I guess it's natural to suspect more recent work -- commit 7c70996e is
about 6 years old. But I the race condition that I suspect is at play
here is very narrow.

It's pretty unlikely that there'll be a dead-to-all TID returned to a
scan (not just dead to our MVCC snapshot, dead to everybody's) that is
subsequently concurrently removed from the index, and then set
LP_UNUSED in the heap. It's probably impossible if you don't have a
small table -- VACUUM just isn't going to be fast enough to get to the
leaf page after the bitmap index scan, but still be able to get to the
heap before its corresponding bitmap heap scan (that uses the VM as an
optimization) can do the relevant visibility checks (while it could
happen with a large table and a slow bitmap scan, the chances of the
VACUUM being precisely aligned with the bitmap scan, in just the wrong
way, seem remote in the extreme). Finally, none of this will happen if
some other factor hinders VACUUM from setting the relevant heap page
all-visible.

AFAICT this is only a problem because of the involvement of the VM,
specifically -- an MVCC snapshot *is* generally sufficient to make
bitmap index scans safe from the dangers of concurrent TID recycling,
as explained in "62.4. Index Locking Considerations". That only ceases
to be true when the visibility map becomes involved (the VM lacks the
granular visibility information required to make all this safe). This
is essentially the same VM race issue that nbtree's
_bt_drop_lock_and_maybe_pin protects against during conventional
index-only scans.

--
Peter Geoghegan



Re: Incorrect result of bitmap heap scan.

From
Peter Geoghegan
Date:
On Mon, Dec 2, 2024 at 11:43 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Whether that's true or not, it seems like it'd be worth bisecting
> to see if we can finger a commit where the behavior changed (and
> the same goes for the question of why-isnt-it-an-IOS-scan).  However,
> the reproducer seems to have quite a low failure probability for me,
> which makes it hard to do bisection testing with much confidence.
> Can we do anything to make the test more reliable?  If I'm right
> to suspect autovacuum, maybe triggering lots of manual vacuums
> would improve the odds?

I agree that autovacuum (actually, VACUUM) is important here.

I find that the test becomes much more reliable if I create the test
table "with (autovacuum_analyze_scale_factor=0.99,
vacuum_truncate=off)". More importantly, rather than relying on
autovacuum, I just run VACUUM manually from psql. I find it convenient
to use "\watch 0.01" to run VACUUM repeatedly.

--
Peter Geoghegan



Re: Incorrect result of bitmap heap scan.

From
Andres Freund
Date:
Hi,

On 2024-12-02 13:39:43 -0500, Peter Geoghegan wrote:
> I guess it's natural to suspect more recent work -- commit 7c70996e is
> about 6 years old. But I the race condition that I suspect is at play
> here is very narrow.

FWIW, the test I just posted shows the issue down to 11 (although for 11 one
has to remove the (TRUNCATE false). 10 returns correct results.

I don't think the race is particularly narrow. Having a vacuum complete
between the start of the bitmap indexscan and the end of the bitmap heapscan,
leaves a lot of time with an expensive query.


I suspect one contributor to this avoiding attention till now was that the
optimization is fairly narrow:

            /*
             * We can potentially skip fetching heap pages if we do not need
             * any columns of the table, either for checking non-indexable
             * quals or for returning data.  This test is a bit simplistic, as
             * it checks the stronger condition that there's no qual or return
             * tlist at all. But in most cases it's probably not worth working
             * harder than that.
             */
            need_tuples = (node->ss.ps.plan->qual != NIL ||
                           node->ss.ps.plan->targetlist != NIL);

Even an entry in the targetlist that that does not depend on the current row
disables the optimization.

Due to not being able to return any content for those rows, it's also somewhat
hard to actually notice that the results are wrong...



> It's pretty unlikely that there'll be a dead-to-all TID returned to a
> scan (not just dead to our MVCC snapshot, dead to everybody's) that is
> subsequently concurrently removed from the index, and then set
> LP_UNUSED in the heap. It's probably impossible if you don't have a
> small table -- VACUUM just isn't going to be fast enough to get to the
> leaf page after the bitmap index scan, but still be able to get to the
> heap before its corresponding bitmap heap scan (that uses the VM as an
> optimization) can do the relevant visibility checks (while it could
> happen with a large table and a slow bitmap scan, the chances of the
> VACUUM being precisely aligned with the bitmap scan, in just the wrong
> way, seem remote in the extreme).

A cursor, waiting for IO, waiting for other parts of the query, ... can all
make this windows almost arbitrarily large.

Greetings,

Andres Freund



Re: Incorrect result of bitmap heap scan.

From
Peter Geoghegan
Date:
On Mon, Dec 2, 2024 at 3:55 PM Andres Freund <andres@anarazel.de> wrote:
> I suspect one contributor to this avoiding attention till now was that the
> optimization is fairly narrow:
>
>                         /*
>                          * We can potentially skip fetching heap pages if we do not need
>                          * any columns of the table, either for checking non-indexable
>                          * quals or for returning data.  This test is a bit simplistic, as
>                          * it checks the stronger condition that there's no qual or return
>                          * tlist at all. But in most cases it's probably not worth working
>                          * harder than that.
>                          */
>                         need_tuples = (node->ss.ps.plan->qual != NIL ||
>                                                    node->ss.ps.plan->targetlist != NIL);
>
> Even an entry in the targetlist that that does not depend on the current row
> disables the optimization.

Good point. I agree that that factor is likely to have masked the
problem over the past 6 years.

--
Peter Geoghegan



Re: Incorrect result of bitmap heap scan.

From
Peter Geoghegan
Date:
On Mon, Dec 2, 2024 at 12:15 PM Peter Geoghegan <pg@bowt.ie> wrote:
> The underlying reason why nbtree can discriminate like this is that it
> "knows" that plain index scans will always visit the heap proper. If a
> TID points to an LP_UNUSED item, then it is considered dead to the
> scan (even though in general the heap page itself might be marked
> all-visible). If some completely unrelated, newly inserted heap tuple
> is found instead, then it cannot be visible to the plain index scan's
> MVCC snapshot (has to be an MVCC snapshot for the leaf page pin to get
> dropped like this).

If I add "SET enable_indexonlyscan = false" to the "setup" section of
the v1-0001-isolationtester-showing-broken-index-only-scans-w.patch
isolationtester test case I posted earlier today, the test will pass.
There is no reason to think that plain GiST index scans are broken.
The fact that GiST VACUUM won't acquire cleanup locks is a problem
*only* because GiST supports index-only scans (AFAIK no other thing
within GiST is broken).

My point is that index-only scans are generally distinct from plain
index scans from an interlocking point of view -- they're not
equivalent (not quite). And yet the "62.4. Index Locking
Considerations" docs nevertheless say nothing about index-only scans
in particular (only about bitmap scans). The docs do say that an MVCC
snapshot is protective, though -- which makes it sound like GiST can't
be doing anything wrong here (GiST only uses MVCC snapshots).

Actually, I don't think that GiST would be broken at all if we'd
simply never added support for index-only scans to GiST. I'm fairly
sure that index-only scans didn't exist when the bulk of this part of
the docs was originally written. ISTM that we ought to do something
about these obsolete docs, after we've fixed the bugs themselves.

--
Peter Geoghegan



Re: Incorrect result of bitmap heap scan.

From
Matthias van de Meent
Date:
On Mon, 2 Dec 2024 at 18:02, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> Andres Freund <andres@anarazel.de> writes:
> > I think the problematic scenario involves tuples that *nobody* can see. During
> > the bitmap index scan we don't know that though. Thus the tid gets inserted
> > into the bitmap. Then, before we visit the heap, a concurrent vacuum removes
> > the tuple from the indexes and then the heap and marks the page as
> > all-visible, as the deleted row version has been removed.
>
> Yup.  I am saying that that qualifies as too-aggressive setting of the
> all-visible bit.  I'm not sure what rule we should adopt instead of
> the current one, but I'd much rather slow down page freezing than
> institute new page locking rules.

I don't think it's new page locking rules, but rather a feature that
forgot to apply page locking rules after bypassing MVCC's snapshot
rules. IOS is only allowed exactly when they comply with index AM's
locking rules "only return a TID that's on a page that can't
concurrently be processed by vacuum" - why would this be different for
the bitmap equivalent?

By saying we're too aggressive with setting the all-visible bit you
seem to suggest we should add rules to vacuum that to remove LP_DEAD
items we don't just have to wait for tuples to be dead to all
transactions, but also for all transactions that might've gotten those
all_dead TIDs from indexes to have committed or rolled back, so that
no references to those TIDs exist that might consider them "possibly
visible".
We could achieve that by adding a waitpoint to vacuum (between the
index scan and the second heap scan for LP cleanup) which waits for
all concurrent transactions accessing the table to commit (thus all
bitmaps would be dropped), similar to REINDEX CONCURRENTLY's wait
phase, but that would slow down vacuum's ability to clean up old data
significantly, and change overall vacuum behaviour in a fundamental
way. I'm quite opposed to such a change.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)