Think I see a btree vacuuming bug - Mailing list pgsql-hackers

From Tom Lane
Subject Think I see a btree vacuuming bug
Date
Msg-id 27838.1022350912@sss.pgh.pa.us
Whole thread Raw
Responses Re: Think I see a btree vacuuming bug  (Manfred Koizar <mkoi-pg@aon.at>)
Re: Think I see a btree vacuuming bug  (Bruce Momjian <pgman@candle.pha.pa.us>)
Re: Think I see a btree vacuuming bug  (Bruce Momjian <pgman@candle.pha.pa.us>)
List pgsql-hackers
If a VACUUM running concurrently with someone else's indexscan were to
delete the index tuple that the indexscan is currently stopped on, then
we'd get a failure when the indexscan resumes and tries to re-find its
place.  (This is the infamous "my bits moved right off the end of the
world" error condition.)  What is supposed to prevent that from
happening is that the indexscan retains a buffer pin (but not a read
lock) on the index page containing the tuple it's stopped on.  VACUUM
will not delete any tuple until it can get a "super exclusive" lock on
the page (cf. LockBufferForCleanup), and the pin prevents it from doing
so.

However: suppose that some other activity causes the index page to be
split while the indexscan is stopped, and that the tuple it's stopped
on gets relocated into the new righthand page of the pair.  Then the
indexscan is holding a pin on the wrong page --- not the one its tuple
is in.  It would then be possible for the VACUUM to arrive at the tuple
and delete it before the indexscan is resumed.

This is a pretty low-probability scenario, especially given the new
index-tuple-killing mechanism (which renders it less likely that an
indexscan will stop on a vacuum-able tuple).  But it could happen.

The only solution I've thought of is to make btbulkdelete acquire
"super exclusive" lock on *every* leaf page of the index as it scans,
rather than only locking the pages it actually needs to delete something
from.  And we'd need to tweak _bt_restscan to chain its pins (pin the
next page to the right before releasing pin on the previous page).
This would prevent a btbulkdelete scan from overtaking ordinary
indexscans, and thereby ensure that it couldn't arrive at the tuple
on which an indexscan is stopped, even with splitting.

I'm somewhat concerned that the more stringent locking will slow down
VACUUM a good deal when there's lots of concurrent activity, but I don't
see another answer.  Ideas anyone?
        regards, tom lane


pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: strange update problem with 7.2.1
Next
From: Michael Meskes
Date:
Subject: Re: Redhat 7.3 time manipulation bug