Thread: Why doesn't GiST VACUUM require a super-exclusive lock, like nbtree VACUUM?

Why doesn't GiST VACUUM require a super-exclusive lock, like nbtree VACUUM?

From

Peter Geoghegan

Date:

03 November 2021, 23:33:08

The code in gistvacuum.c is closely based on similar code in nbtree.c,
except that it only acquires an exclusive lock -- not a
super-exclusive lock. I suspect that that's because it seemed
unnecessary; nbtree plain index scans have their own special reasons
for this, that don't apply to GiST. Namely: nbtree plain index scans
that don't use an MVCC snapshot clearly need some other interlock to
protect against concurrent recycling of pointed-to-by-leaf-page TIDs.
And so as a general rule nbtree VACUUM needs a full
super-exclusive/cleanup lock, just in case there is a plain index scan
that uses some other kind of snapshot (logical replication, say).

To say the same thing another way: nbtree follows "the third rule"
described by "62.4. Index Locking Considerations" in the docs [1], but
GiST does not. The idea that GiST's behavior is okay here does seem
consistent with what the same docs go on to say about it: "When using
an MVCC-compliant snapshot, there is no problem because the new
occupant of the slot is certain to be too new to pass the snapshot
test".

But what about index-only scans, which GiST also supports? I think
that the rules are different there, even though index-only scans use
an MVCC snapshot.

The (admittedly undocumented) reason why we can never drop the leaf
page pin for an index-only scan in nbtree (but can do so for plain
index scans) also relates to heap interlocking -- but with a twist.
Here's the twist: the second heap pass by VACUUM can set visibility
map bits independently of the first (once LP_DEAD items from the first
pass over the heap are set to LP_UNUSED, which renders the page
all-visible) -- this all happens at the end of
lazy_vacuum_heap_page(). That's why index-only scans can't just assume
that VACUUM won't have deleted the TID from the leaf page they're
scanning immediately after they're done reading it. VACUUM could even
manage to set the visibility map bit for a relevant heap page inside
lazy_vacuum_heap_page(), before the index-only scan can read the
visibility map. If that is allowed to happen, the index-only would
give wrong answers if one of the TID references held in local memory
by the index-only scan happens to be marked LP_UNUSED inside
lazy_vacuum_heap_page(). IOW, it looks like we run the risk of a
concurrently recycled dead-to-everybody TID becoming visible during
GiST index-only scans, just because we have no interlock.

In summary:

UUIC this is only safe for nbtree because 1.) It acquires a
super-exclusive lock when vacuuming leaf pages, and 2.) Index-only
scans never drop their pin on the leaf page when accessing the
visibility map "in-sync" with the scan (of course we hope not to
access the heap proper at all for index-only scans). These precautions
are both necessary to make the race condition I describe impossible,
because they ensure that VACUUM cannot reach lazy_vacuum_heap_page()
until after our index-only scan reads the visibility map (and then has
to read the heap for at least that one dead-to-all TID, discovering
that the TID is dead to its snapshot). Why wouldn't GiST need to take
the same precautions, though?

[1] https://www.postgresql.org/docs/devel/index-locking.html
--
Peter Geoghegan

Re: Why doesn't GiST VACUUM require a super-exclusive lock, like nbtree VACUUM?

From

Peter Geoghegan

Date:

04 November 2021, 15:58:29

On Thu, Nov 4, 2021 at 8:52 AM Andrey Borodin <x4mmm@yandex-team.ru> wrote:
> Let's enumerate steps how things can go wrong.
>
> Backend1: Index-Only scan returns tid and xs_hitup with index_tuple1 on index_page1 pointing to heap_tuple1 on page1
>
> Backend2: Remove index_tuple1 and heap_tuple1
>
> Backend3: Mark page1 all-visible
> Backend1: Thinks that page1 is all-visible and shows index_tuple1 as visible
>
> To avoid this Backend1 must hold pin on index_page1 until it's done with checking visibility, and Backend2 must do
LockBufferForCleanup(index_page1).Do I get things right?

Almost. Backend3 is actually Backend2 here (there is no 3) -- it runs
VACUUM throughout.

Note that it's not particularly likely that Backend2/VACUUM will "win"
this race, because it typically has to do much more work than
Backend1. It has to actually remove the index tuples from the leaf
page, then any other index work (for this and other indexes). Then it
has to arrive back in vacuumlazy.c to set the VM bit in
lazy_vacuum_heap_page(). That's a pretty unlikely scenario. And even
if it happened it would only happen once (until the next time we get
unlucky). What are the chances of somebody noticing a more or less
once-off, slightly wrong answer?

-- 
Peter Geoghegan

Re: Why doesn't GiST VACUUM require a super-exclusive lock, like nbtree VACUUM?

From

Andrey Borodin

Date:

05 November 2021, 10:26:09

> 4 нояб. 2021 г., в 20:58, Peter Geoghegan <pg@bowt.ie> написал(а):
> That's a pretty unlikely scenario. And even
> if it happened it would only happen once (until the next time we get
> unlucky). What are the chances of somebody noticing a more or less
> once-off, slightly wrong answer?

I'd say next to impossible, yet not impossible. Or, perhaps, I do not see protection from this.

Moreover there's a "microvacuum". It kills tuples with BUFFER_LOCK_SHARE. AFAIU it should take cleanup lock on buffer
too?

Best regards, Andrey Borodin.

Re: Why doesn't GiST VACUUM require a super-exclusive lock, like nbtree VACUUM?

From

Peter Geoghegan

Date:

01 December 2021, 01:09:14

On Fri, Nov 5, 2021 at 3:26 AM Andrey Borodin <x4mmm@yandex-team.ru> wrote:
> > 4 нояб. 2021 г., в 20:58, Peter Geoghegan <pg@bowt.ie> написал(а):
> > That's a pretty unlikely scenario. And even
> > if it happened it would only happen once (until the next time we get
> > unlucky). What are the chances of somebody noticing a more or less
> > once-off, slightly wrong answer?
>
> I'd say next to impossible, yet not impossible. Or, perhaps, I do not see protection from this.

I think that that's probably all correct -- I would certainly make the
same guess. It's very unlikely to happen, and when it does happen it
happens only once.

> Moreover there's a "microvacuum". It kills tuples with BUFFER_LOCK_SHARE. AFAIU it should take cleanup lock on buffer
too?

No, because there is no heap vacuuming involved (because that doesn't
happen outside lazyvacuum.c). The work that VACUUM does inside
lazy_vacuum_heap_rel() is part of the problem here -- we need an
interlock between that work, and index-only scans. Making LP_DEAD
items in heap pages LP_UNUSED is only ever possible during a VACUUM
operation (I'm sure you know why). AFAICT there would be no bug at all
without that detail.

I believe that there have been several historic reasons why we need a
cleanup lock during nbtree VACUUM, and that there is only one
remaining reason for it today. So the history is unusually complicated. But
AFAICT it's always some kind of "interlock with heapam VACUUM" issue,
with TID recycling, with no protection from our MVCC snapshot. I would
say that that's the "real problem" here, when I get to first principles.

Attached draft patch attempts to explain things in this area within
the nbtree README. There is a much shorter comment about it within
vacuumlazy.c. I am concerned about GiST index-only scans themselves,
of course, but I discovered this issue when thinking carefully about
the concurrency rules for VACUUM -- I think it's valuable to formalize
and justify the general rules that index access methods must follow.

We can talk about this some more in NYC. See you soon!
--
Peter Geoghegan

Attachment

v1-0001-nbtree-README-Improve-VACUUM-interlock-section.patch

Re: Why doesn't GiST VACUUM require a super-exclusive lock, like nbtree VACUUM?

From

Peter Geoghegan

Date:

01 December 2021, 01:39:19

On Tue, Nov 30, 2021 at 5:09 PM Peter Geoghegan <pg@bowt.ie> wrote:
> I believe that there have been several historic reasons why we need a
> cleanup lock during nbtree VACUUM, and that there is only one
> remaining reason for it today. So the history is unusually complicated.

Minor correction: we actually also have to worry about plain index
scans that don't use an MVCC snapshot, which is possible within
nbtree. It's quite likely when using logical replication, actually.
See the patch for more.

Like with the index-only scan case, a non-MVCC snapshot + plain nbtree
index scan cannot rely on heap access within the index scan node -- it
won't reliably notice that any newer heap tuples (that are really the
result of concurrent TID recycling) are not actually visible to its
MVCC snapshot -- because there isn't an MVCC snapshot. The only
difference in the index-only scan scenario is that we use the
visibility map (not the heap) -- which is racey in a way that makes
our MVCC snapshot (IOSs always have an MVCC snapshot) an ineffective
protection.

In summary, to be safe against confusion from concurrent TID recycling
during index/index-only scans, we can do either of the following
things:

1. Hold a pin of our leaf page while accessing the heap -- that'll
definitely conflict with the cleanup lock that nbtree VACUUM will
inevitably try to acquire on our leaf page.

OR:

2. Hold an MVCC snapshot, AND do an actual heap page access during the
plain index scan -- do both together.

With approach 2, our plain index scan must determine visibility using
real XIDs (against something like a dirty snapshot), rather than using
a visibility map bit. That is also necessary because the VM might
become invalid or ambiguous, in a way that's clearly not possible when
looking at full heap tuple headers with XIDs -- concurrent recycling
becomes safe if we know that we'll reliably notice it and not give
wrong answers.

Does that make sense?

-- 
Peter Geoghegan

Re: Why doesn't GiST VACUUM require a super-exclusive lock, like nbtree VACUUM?

From

Peter Geoghegan

Date:

09 December 2021, 01:40:37

On Tue, Nov 30, 2021 at 5:09 PM Peter Geoghegan <pg@bowt.ie> wrote:
> Attached draft patch attempts to explain things in this area within
> the nbtree README. There is a much shorter comment about it within
> vacuumlazy.c. I am concerned about GiST index-only scans themselves,
> of course, but I discovered this issue when thinking carefully about
> the concurrency rules for VACUUM -- I think it's valuable to formalize
> and justify the general rules that index access methods must follow.

I pushed a commit that described how this works for nbtree, in the README file.

I think that there might be an even more subtle race condition in
nbtree itself, though, during recovery. We no longer do a "pin scan"
during recovery these days (see commits 9f83468b, 3e4b7d87, and
687f2cd7 for full information). I think that it might be necessary to
do that, just for the benefit of index-only scans -- if it's necessary
during original execution, then why not during recovery?

The work to remove "pin scans" was justified by pointing out that we
no longer use various kinds of snapshots during recovery, but it said
nothing about index-only scans, which need the TID recycling interlock
(i.e. need to hold onto a leaf page while accessing the heap in sync)
even with an MVCC snapshot. It's easy to imagine how it might have
been missed: nobody ever documented the general issue with index-only
scans, until now. Commit 2ed5b87f recognized they were unsafe for the
optimization that it added (to avoid blocking VACUUM), but never
explained why they were unsafe.

Going back to doing pin scans during recovery seems deeply
unappealing, especially to fix a totally narrow race condition.

-- 
Peter Geoghegan

Re: Why doesn't GiST VACUUM require a super-exclusive lock, like nbtree VACUUM?

From

Peter Geoghegan

Date:

03 December 2024, 19:21:04

On Mon, Dec 2, 2024 at 8:18 PM Peter Geoghegan <pg@bowt.ie> wrote:
> Attached is a refined version of a test case I posted earlier on [2],
> decisively proving that GiST index-only scans are in fact subtly
> broken. Right now it fails, showing a wrong answer to a query. The
> patch adds an isolationtest test case to btree_gist, based on a test
> case of Andres'.

I can confirm that the same bug affects SP-GiST. I modified the
original failing GiST isolation test to make it use SP-GiST instead,
proving what I already strongly suspected.

I have no reason to believe that there are any similar problems in
core index AMs other than GiST and SP-GiST, though. Let's go through
them all now: nbtree already does everything correctly, and all
remaining core index AMs don't support index-only scans *and* don't
support scans that don't just use an MVCC snapshot.

--
Peter Geoghegan

Re: Why doesn't GiST VACUUM require a super-exclusive lock, like nbtree VACUUM?

From

Peter Geoghegan

Date:

28 February, 04:53:19

On Sat, Feb 8, 2025 at 8:47 AM Michail Nikolaev
<michail.nikolaev@gmail.com> wrote:
> Just some commit messages + few cleanups.

I'm worried about this:

+These longer pin lifetimes can cause buffer exhaustion with messages like "no
+unpinned buffers available" when the index has many pages that have similar
+ordering; but future work can figure out how to best work that out.

I think that we should have some kind of upper bound on the number of
pins that can be acquired at any one time, in order to completely
avoid these problems. Solving that problem will probably require GiST
expertise that I don't have right now.

--
Peter Geoghegan

Re: Why doesn't GiST VACUUM require a super-exclusive lock, like nbtree VACUUM?

From

Heikki Linnakangas

Date:

05 March, 12:04:25

On 28/02/2025 03:53, Peter Geoghegan wrote:
> On Sat, Feb 8, 2025 at 8:47 AM Michail Nikolaev
> <michail.nikolaev@gmail.com> wrote:
>> Just some commit messages + few cleanups.
> 
> I'm worried about this:
> 
> +These longer pin lifetimes can cause buffer exhaustion with messages like "no
> +unpinned buffers available" when the index has many pages that have similar
> +ordering; but future work can figure out how to best work that out.
> 
> I think that we should have some kind of upper bound on the number of
> pins that can be acquired at any one time, in order to completely
> avoid these problems. Solving that problem will probably require GiST
> expertise that I don't have right now.

+1. With no limit, it seems pretty easy to hold thousands of buffer pins 
with this.

The index can set IndexScanDesc->xs_recheck to indicate that the quals 
must be rechecked. Perhaps we should have a similar flag to indicate 
that the visibility must be rechecked.

Matthias's earlier patch 
(https://www.postgresql.org/message-id/CAEze2Wg1kbpo_Q1%3D9X68JRsgfkyPCk4T0QN%2BqKz10%2BFVzCAoGA%40mail.gmail.com) 
had a more complicated mechanism to track the pinned buffers. Later 
patch got rid of that, which simplified things a lot. I wonder if we 
need something like that, after all.

Here's a completely different line of attack: Instead of holding buffer 
pins for longer, what if we checked the visibility map earlier? We could 
check the visibility map already when we construct the 
GISTSearchHeapItem, and set a flag in IndexScanDesc to tell 
IndexOnlyNext() that we have already done that. IndexOnlyNext() would 
have three cases:

1. The index AM has not checked the visibility map. Check it in 
IndexOnlyNext(), and fetch the tuple if it's not set. This is what it 
always does today.
2. The index AM has checked the visibility map, and the VM bit was set. 
IndexOnlyNext() can skip the VM check and use the tuple directly.
3. The index AM has checked the visibility map, and the VM bit was not 
set. IndexOnlyNext() will fetch the tuple to check its visibility.

-- 
Heikki Linnakangas
Neon (https://neon.tech)

Re: Why doesn't GiST VACUUM require a super-exclusive lock, like nbtree VACUUM?

From

Matthias van de Meent

Date:

05 March, 21:19:06

On Wed, 5 Mar 2025 at 10:04, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>
> On 28/02/2025 03:53, Peter Geoghegan wrote:
> > On Sat, Feb 8, 2025 at 8:47 AM Michail Nikolaev
> > <michail.nikolaev@gmail.com> wrote:
> >> Just some commit messages + few cleanups.
> >
> > I'm worried about this:
> >
> > +These longer pin lifetimes can cause buffer exhaustion with messages like "no
> > +unpinned buffers available" when the index has many pages that have similar
> > +ordering; but future work can figure out how to best work that out.
> >
> > I think that we should have some kind of upper bound on the number of
> > pins that can be acquired at any one time, in order to completely
> > avoid these problems. Solving that problem will probably require GiST
> > expertise that I don't have right now.
>
> +1. With no limit, it seems pretty easy to hold thousands of buffer pins
> with this.
>
> The index can set IndexScanDesc->xs_recheck to indicate that the quals
> must be rechecked. Perhaps we should have a similar flag to indicate
> that the visibility must be rechecked.
>
> Matthias's earlier patch
> (https://www.postgresql.org/message-id/CAEze2Wg1kbpo_Q1%3D9X68JRsgfkyPCk4T0QN%2BqKz10%2BFVzCAoGA%40mail.gmail.com)
> had a more complicated mechanism to track the pinned buffers. Later
> patch got rid of that, which simplified things a lot. I wonder if we
> need something like that, after all.

I dropped that because it effectively duplicates the current
per-backend pin tracking system. Adding it back in will probably
complicate matters by a lot again.

> Here's a completely different line of attack: Instead of holding buffer
> pins for longer, what if we checked the visibility map earlier? We could
> check the visibility map already when we construct the
> GISTSearchHeapItem, and set a flag in IndexScanDesc to tell
> IndexOnlyNext() that we have already done that. IndexOnlyNext() would
> have three cases:

I don't like integrating a heap-specific thing like VM_ALL_VISIBLE()
to indexes, but given that IOS code already uses that exact code my
dislike is not to the point of a -1. I'd like it better if we had a
TableAM API for higher-level visibility checks (e.g.
table_tids_could_be_invisible?()) which gives us those responses
instead; dropping the requirement to maintain VM in pg's preferred
format to support efficient IOS.

I am a bit worried about even more random IO happening before we've
returned even a single tuple, but that's probably much less of an
issue than "unlimited pins".

With VM-checking in the index, we would potentially have another
benefit: By checking all tids on the page at once, we can deduplicate
and reduce the VM lookups. The gains might not be all that impressive,
but could be significant in certain hot cases.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)

Re: Why doesn't GiST VACUUM require a super-exclusive lock, like nbtree VACUUM?

From

Matthias van de Meent

Date:

08 March, 05:36:31

On Wed, 5 Mar 2025 at 19:19, Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:
>
> On Wed, 5 Mar 2025 at 10:04, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> >
> > On 28/02/2025 03:53, Peter Geoghegan wrote:
> > > On Sat, Feb 8, 2025 at 8:47 AM Michail Nikolaev
> > > <michail.nikolaev@gmail.com> wrote:
> > >> Just some commit messages + few cleanups.
> > >
> > > I'm worried about this:
> > >
> > > +These longer pin lifetimes can cause buffer exhaustion with messages like "no
> > > +unpinned buffers available" when the index has many pages that have similar
> > > +ordering; but future work can figure out how to best work that out.
> > >
> > > I think that we should have some kind of upper bound on the number of
> > > pins that can be acquired at any one time, in order to completely
> > > avoid these problems. Solving that problem will probably require GiST
> > > expertise that I don't have right now.
> >
> > +1. With no limit, it seems pretty easy to hold thousands of buffer pins
> > with this.
> >
> > The index can set IndexScanDesc->xs_recheck to indicate that the quals
> > must be rechecked. Perhaps we should have a similar flag to indicate
> > that the visibility must be rechecked.

Added as xs_visrecheck in 0001.

> > Here's a completely different line of attack: Instead of holding buffer
> > pins for longer, what if we checked the visibility map earlier? We could
> > check the visibility map already when we construct the
> > GISTSearchHeapItem, and set a flag in IndexScanDesc to tell
> > IndexOnlyNext() that we have already done that. IndexOnlyNext() would
> > have three cases:
>
> I don't like integrating a heap-specific thing like VM_ALL_VISIBLE()
> to indexes, but given that IOS code already uses that exact code my
> dislike is not to the point of a -1. I'd like it better if we had a
> TableAM API for higher-level visibility checks (e.g.
> table_tids_could_be_invisible?()) which gives us those responses
> instead; dropping the requirement to maintain VM in pg's preferred
> format to support efficient IOS.

Here's a patchset that uses that approach. Naming of functions, types,
fields and arguments TBD. The patch works and passes the new
VACUUM-conflict tests, though I suspect the SP-GIST tests to have
bugs, as an intermediate version of my 0003 patch didn't trigger the
tests to fail, even though it did not hold a pin on (all) sorted
items' data when it was being checked for visibility and/or returned
from the scan.

Patch 0001 details the important changes, while 0002/0003 use this new
API to make GIST and SP-GIST's IOS work correctly when concurrent
VACUUM is/was running.
0004 is the existing patch with tests (v8-0001).

> With VM-checking in the index, we would potentially have another
> benefit: By checking all tids on the page at once, we can deduplicate
> and reduce the VM lookups. The gains might not be all that impressive,
> but could be significant in certain hot cases.

That is also included in this, but any performance impact hasn't been
tested nor validated.


Kind regards,

Matthias van de Meent
Neon (https://neon.tech)

Attachment

Re: Why doesn't GiST VACUUM require a super-exclusive lock, like nbtree VACUUM?

From

Michail Nikolaev

Date:

09 March, 15:44:00

Hello, Mathias!

> though I suspect the SP-GIST tests to have

> bugs, as an intermediate version of my 0003 patch didn't trigger the
> tests to fail

It all fails on master - could you please detail what is "intermediate" in that case? Also, I think it is a good idea to add the same type of test to btree.

> * XXX: In the future we should probably reorder these operations so
> * we can apply the checks in block order, rather than index order.

I think it is already done in your patch, no?

Should we when use that mechanics for btree as well? It seems to be straight forward and non-invasive. In such case, "Unchecked" goes away, and it is each AM responsibility to call the check while holding the pin.

Best regards,

Mikhail.

Re: Why doesn't GiST VACUUM require a super-exclusive lock, like nbtree VACUUM?

From

vignesh C

Date:

16 March, 15:58:17

On Sat, 8 Mar 2025 at 08:06, Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:
>
> Here's a patchset that uses that approach. Naming of functions, types,
> fields and arguments TBD. The patch works and passes the new
> VACUUM-conflict tests, though I suspect the SP-GIST tests to have
> bugs, as an intermediate version of my 0003 patch didn't trigger the
> tests to fail, even though it did not hold a pin on (all) sorted
> items' data when it was being checked for visibility and/or returned
> from the scan.
>
> Patch 0001 details the important changes, while 0002/0003 use this new
> API to make GIST and SP-GIST's IOS work correctly when concurrent
> VACUUM is/was running.
> 0004 is the existing patch with tests (v8-0001).

I noticed that Mikhail's feedback from [1] is not yet addressed. I
have changed the status of the commitfest entry to Waiting on Author,
kindly address them and update the status to Needs review.
[1] - https://www.postgresql.org/message-id/CANtu0ojz0apXnVia0reTL28eL2=__ev8aLsiH=1XfD_Z3dnkTw@mail.gmail.com

Regards,
Vignesh

Re: Why doesn't GiST VACUUM require a super-exclusive lock, like nbtree VACUUM?

From

Matthias van de Meent

Date:

21 March, 19:14:12

On Sun, 16 Mar 2025 at 13:58, vignesh C <vignesh21@gmail.com> wrote:
>
> On Sat, 8 Mar 2025 at 08:06, Matthias van de Meent
> <boekewurm+postgres@gmail.com> wrote:
> >
> > Here's a patchset that uses that approach. Naming of functions, types,
> > fields and arguments TBD. The patch works and passes the new
> > VACUUM-conflict tests, though I suspect the SP-GIST tests to have
> > bugs, as an intermediate version of my 0003 patch didn't trigger the
> > tests to fail, even though it did not hold a pin on (all) sorted
> > items' data when it was being checked for visibility and/or returned
> > from the scan.
> >
> > Patch 0001 details the important changes, while 0002/0003 use this new
> > API to make GIST and SP-GIST's IOS work correctly when concurrent
> > VACUUM is/was running.
> > 0004 is the existing patch with tests (v8-0001).
>
> I noticed that Mikhail's feedback from [1] is not yet addressed. I
> have changed the status of the commitfest entry to Waiting on Author,
> kindly address them and update the status to Needs review.
> [1] - https://www.postgresql.org/message-id/CANtu0ojz0apXnVia0reTL28eL2=__ev8aLsiH=1XfD_Z3dnkTw@mail.gmail.com

While there has indeed been some feedback, so far I've been looking
for architectural feedback about how the bug would be solved, not per
se the names of variables, or the exact details of the comments on the
new code: I usually rather wait with polishing my patches until after
we've made sure it doesn't need a full rewrite due to architectural
issues (like what happened in the previous two iterations).

Attached is v10, which polishes the previous patches, and adds a patch
for nbtree to use the new visibility checking strategy so that it too
can release its index pages much earlier, and adds a similar
visibility check test to nbtree.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)