Thread: Why doesn't GiST VACUUM require a super-exclusive lock, like nbtree VACUUM?

Why doesn't GiST VACUUM require a super-exclusive lock, like nbtree VACUUM?

From
Peter Geoghegan
Date:
The code in gistvacuum.c is closely based on similar code in nbtree.c,
except that it only acquires an exclusive lock -- not a
super-exclusive lock. I suspect that that's because it seemed
unnecessary; nbtree plain index scans have their own special reasons
for this, that don't apply to GiST. Namely: nbtree plain index scans
that don't use an MVCC snapshot clearly need some other interlock to
protect against concurrent recycling of pointed-to-by-leaf-page TIDs.
And so as a general rule nbtree VACUUM needs a full
super-exclusive/cleanup lock, just in case there is a plain index scan
that uses some other kind of snapshot (logical replication, say).

To say the same thing another way: nbtree follows "the third rule"
described by "62.4. Index Locking Considerations" in the docs [1], but
GiST does not. The idea that GiST's behavior is okay here does seem
consistent with what the same docs go on to say about it: "When using
an MVCC-compliant snapshot, there is no problem because the new
occupant of the slot is certain to be too new to pass the snapshot
test".

But what about index-only scans, which GiST also supports? I think
that the rules are different there, even though index-only scans use
an MVCC snapshot.

The (admittedly undocumented) reason why we can never drop the leaf
page pin for an index-only scan in nbtree (but can do so for plain
index scans) also relates to heap interlocking -- but with a twist.
Here's the twist: the second heap pass by VACUUM can set visibility
map bits independently of the first (once LP_DEAD items from the first
pass over the heap are set to LP_UNUSED, which renders the page
all-visible) -- this all happens at the end of
lazy_vacuum_heap_page(). That's why index-only scans can't just assume
that VACUUM won't have deleted the TID from the leaf page they're
scanning immediately after they're done reading it. VACUUM could even
manage to set the visibility map bit for a relevant heap page inside
lazy_vacuum_heap_page(), before the index-only scan can read the
visibility map. If that is allowed to happen, the index-only would
give wrong answers if one of the TID references held in local memory
by the index-only scan happens to be marked LP_UNUSED inside
lazy_vacuum_heap_page(). IOW, it looks like we run the risk of a
concurrently recycled dead-to-everybody TID becoming visible during
GiST index-only scans, just because we have no interlock.

In summary:

UUIC this is only safe for nbtree because 1.) It acquires a
super-exclusive lock when vacuuming leaf pages, and 2.) Index-only
scans never drop their pin on the leaf page when accessing the
visibility map "in-sync" with the scan (of course we hope not to
access the heap proper at all for index-only scans). These precautions
are both necessary to make the race condition I describe impossible,
because they ensure that VACUUM cannot reach lazy_vacuum_heap_page()
until after our index-only scan reads the visibility map (and then has
to read the heap for at least that one dead-to-all TID, discovering
that the TID is dead to its snapshot). Why wouldn't GiST need to take
the same precautions, though?

[1] https://www.postgresql.org/docs/devel/index-locking.html
--
Peter Geoghegan



Re: Why doesn't GiST VACUUM require a super-exclusive lock, like nbtree VACUUM?

From
Peter Geoghegan
Date:
On Thu, Nov 4, 2021 at 8:52 AM Andrey Borodin <x4mmm@yandex-team.ru> wrote:
> Let's enumerate steps how things can go wrong.
>
> Backend1: Index-Only scan returns tid and xs_hitup with index_tuple1 on index_page1 pointing to heap_tuple1 on page1
>
> Backend2: Remove index_tuple1 and heap_tuple1
>
> Backend3: Mark page1 all-visible
> Backend1: Thinks that page1 is all-visible and shows index_tuple1 as visible
>
> To avoid this Backend1 must hold pin on index_page1 until it's done with checking visibility, and Backend2 must do
LockBufferForCleanup(index_page1).Do I get things right?
 

Almost. Backend3 is actually Backend2 here (there is no 3) -- it runs
VACUUM throughout.

Note that it's not particularly likely that Backend2/VACUUM will "win"
this race, because it typically has to do much more work than
Backend1. It has to actually remove the index tuples from the leaf
page, then any other index work (for this and other indexes). Then it
has to arrive back in vacuumlazy.c to set the VM bit in
lazy_vacuum_heap_page(). That's a pretty unlikely scenario. And even
if it happened it would only happen once (until the next time we get
unlucky). What are the chances of somebody noticing a more or less
once-off, slightly wrong answer?

-- 
Peter Geoghegan



Re: Why doesn't GiST VACUUM require a super-exclusive lock, like nbtree VACUUM?

From
Andrey Borodin
Date:

> 4 нояб. 2021 г., в 20:58, Peter Geoghegan <pg@bowt.ie> написал(а):
> That's a pretty unlikely scenario. And even
> if it happened it would only happen once (until the next time we get
> unlucky). What are the chances of somebody noticing a more or less
> once-off, slightly wrong answer?

I'd say next to impossible, yet not impossible. Or, perhaps, I do not see protection from this.

Moreover there's a "microvacuum". It kills tuples with BUFFER_LOCK_SHARE. AFAIU it should take cleanup lock on buffer
too?

Best regards, Andrey Borodin.


Re: Why doesn't GiST VACUUM require a super-exclusive lock, like nbtree VACUUM?

From
Peter Geoghegan
Date:
On Fri, Nov 5, 2021 at 3:26 AM Andrey Borodin <x4mmm@yandex-team.ru> wrote:
> > 4 нояб. 2021 г., в 20:58, Peter Geoghegan <pg@bowt.ie> написал(а):
> > That's a pretty unlikely scenario. And even
> > if it happened it would only happen once (until the next time we get
> > unlucky). What are the chances of somebody noticing a more or less
> > once-off, slightly wrong answer?
>
> I'd say next to impossible, yet not impossible. Or, perhaps, I do not see protection from this.

I think that that's probably all correct -- I would certainly make the
same guess. It's very unlikely to happen, and when it does happen it
happens only once.

> Moreover there's a "microvacuum". It kills tuples with BUFFER_LOCK_SHARE. AFAIU it should take cleanup lock on buffer
too?

No, because there is no heap vacuuming involved (because that doesn't
happen outside lazyvacuum.c). The work that VACUUM does inside
lazy_vacuum_heap_rel() is part of the problem here -- we need an
interlock between that work, and index-only scans. Making LP_DEAD
items in heap pages LP_UNUSED is only ever possible during a VACUUM
operation (I'm sure you know why). AFAICT there would be no bug at all
without that detail.

I believe that there have been several historic reasons why we need a
cleanup lock during nbtree VACUUM, and that there is only one
remaining reason for it today. So the history is unusually complicated. But
AFAICT it's always some kind of "interlock with heapam VACUUM" issue,
with TID recycling, with no protection from our MVCC snapshot. I would
say that that's the "real problem" here, when I get to first principles.

Attached draft patch attempts to explain things in this area within
the nbtree README. There is a much shorter comment about it within
vacuumlazy.c. I am concerned about GiST index-only scans themselves,
of course, but I discovered this issue when thinking carefully about
the concurrency rules for VACUUM -- I think it's valuable to formalize
and justify the general rules that index access methods must follow.

We can talk about this some more in NYC. See you soon!
--
Peter Geoghegan

Attachment

Re: Why doesn't GiST VACUUM require a super-exclusive lock, like nbtree VACUUM?

From
Peter Geoghegan
Date:
On Tue, Nov 30, 2021 at 5:09 PM Peter Geoghegan <pg@bowt.ie> wrote:
> I believe that there have been several historic reasons why we need a
> cleanup lock during nbtree VACUUM, and that there is only one
> remaining reason for it today. So the history is unusually complicated.

Minor correction: we actually also have to worry about plain index
scans that don't use an MVCC snapshot, which is possible within
nbtree. It's quite likely when using logical replication, actually.
See the patch for more.

Like with the index-only scan case, a non-MVCC snapshot + plain nbtree
index scan cannot rely on heap access within the index scan node -- it
won't reliably notice that any newer heap tuples (that are really the
result of concurrent TID recycling) are not actually visible to its
MVCC snapshot -- because there isn't an MVCC snapshot. The only
difference in the index-only scan scenario is that we use the
visibility map (not the heap) -- which is racey in a way that makes
our MVCC snapshot (IOSs always have an MVCC snapshot) an ineffective
protection.

In summary, to be safe against confusion from concurrent TID recycling
during index/index-only scans, we can do either of the following
things:

1. Hold a pin of our leaf page while accessing the heap -- that'll
definitely conflict with the cleanup lock that nbtree VACUUM will
inevitably try to acquire on our leaf page.

OR:

2. Hold an MVCC snapshot, AND do an actual heap page access during the
plain index scan -- do both together.

With approach 2, our plain index scan must determine visibility using
real XIDs (against something like a dirty snapshot), rather than using
a visibility map bit. That is also necessary because the VM might
become invalid or ambiguous, in a way that's clearly not possible when
looking at full heap tuple headers with XIDs -- concurrent recycling
becomes safe if we know that we'll reliably notice it and not give
wrong answers.

Does that make sense?

-- 
Peter Geoghegan



Re: Why doesn't GiST VACUUM require a super-exclusive lock, like nbtree VACUUM?

From
Peter Geoghegan
Date:
On Tue, Nov 30, 2021 at 5:09 PM Peter Geoghegan <pg@bowt.ie> wrote:
> Attached draft patch attempts to explain things in this area within
> the nbtree README. There is a much shorter comment about it within
> vacuumlazy.c. I am concerned about GiST index-only scans themselves,
> of course, but I discovered this issue when thinking carefully about
> the concurrency rules for VACUUM -- I think it's valuable to formalize
> and justify the general rules that index access methods must follow.

I pushed a commit that described how this works for nbtree, in the README file.

I think that there might be an even more subtle race condition in
nbtree itself, though, during recovery. We no longer do a "pin scan"
during recovery these days (see commits 9f83468b, 3e4b7d87, and
687f2cd7 for full information). I think that it might be necessary to
do that, just for the benefit of index-only scans -- if it's necessary
during original execution, then why not during recovery?

The work to remove "pin scans" was justified by pointing out that we
no longer use various kinds of snapshots during recovery, but it said
nothing about index-only scans, which need the TID recycling interlock
(i.e. need to hold onto a leaf page while accessing the heap in sync)
even with an MVCC snapshot. It's easy to imagine how it might have
been missed: nobody ever documented the general issue with index-only
scans, until now. Commit 2ed5b87f recognized they were unsafe for the
optimization that it added (to avoid blocking VACUUM), but never
explained why they were unsafe.

Going back to doing pin scans during recovery seems deeply
unappealing, especially to fix a totally narrow race condition.

-- 
Peter Geoghegan



On Mon, Dec 2, 2024 at 8:18 PM Peter Geoghegan <pg@bowt.ie> wrote:
> Attached is a refined version of a test case I posted earlier on [2],
> decisively proving that GiST index-only scans are in fact subtly
> broken. Right now it fails, showing a wrong answer to a query. The
> patch adds an isolationtest test case to btree_gist, based on a test
> case of Andres'.

I can confirm that the same bug affects SP-GiST. I modified the
original failing GiST isolation test to make it use SP-GiST instead,
proving what I already strongly suspected.

I have no reason to believe that there are any similar problems in
core index AMs other than GiST and SP-GiST, though. Let's go through
them all now: nbtree already does everything correctly, and all
remaining core index AMs don't support index-only scans *and* don't
support scans that don't just use an MVCC snapshot.

--
Peter Geoghegan