Re: Page at a time index scan - Mailing list pgsql-patches

From Tom Lane
Subject Re: Page at a time index scan
Date
Msg-id 25829.1146846553@sss.pgh.pa.us
Whole thread Raw
In response to Re: Page at a time index scan  (Heikki Linnakangas <hlinnaka@iki.fi>)
Responses Re: Page at a time index scan  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-patches
I've just realized that the locking considerations in this patch are
considerably more complicated than we thought.  The discussion so far
considered only half of what the deletion interlock needs to accomplish.
We have to ensure that indexscans don't lose their place, which is
solved in the patch by stopping between pages rather than on a specific
item.  But we also have to ensure that indexscans don't return pointers
to heap tuples that have already been deleted by VACUUM --- at least in
the case where the caller is using a non-MVCC snapshot.  Else, if the
tuple is replaced by a newer tuple before the caller is able to visit
it, the caller might mistakenly accept that tuple as the one it wanted.

The problematic scenario looks like this:

1. indexscan slurps a bunch of heap TIDs into local memory.  It keeps
pin, but not lock, on the index page it got the TIDs from.

2. someone else splits the index page, which is allowed since the page
is only pinned not locked.  Now, some of the index items are on a new
page that the indexscan does not have pinned.

3. vacuum comes along and removes some of the moved items.  Since they
are on a page that no one has pinned, even the super-exclusive lock
that vacuum takes won't prevent this.

4. vacuum removes the corresponding heap tuples.

5. someone else installs new, unrelated, tuples into the freed heap
slots.

6. indexscan finally visits the heap.  It finds tuples that are valid
per its snapshot, but aren't what it thought it was finding (they don't
necessarily match the index key).

While we could prevent this by rechecking whether fetched heap tuples
match the index search condition, that costs an awful lot of cycles
for such an obviously rare problem.  We need a locking-based solution
instead.

I believe an appropriate fix looks like this:

1. Indexscan is required to keep pin on the page it read the items
from (even though we realize they may not still be on that page).
The pin can only be dropped when we are no longer going to return any
more items read from the page.

2. btbulkdelete is required to get super-exclusive lock on *every leaf
page in the index*.  It must not try to optimize this down to just the
pages containing deletable items.

With these rules, even though vacuum might have physically deleted the
index item that the indexscan is reporting to its caller, the
btbulkdelete call will not have been able to complete.  (If bulkdelete
arrived at the original page before the indexscan did, then of course
it'd already have deleted the item and there's no problem.  If it
arrives after, then it'll be blocked by not being able to get
super-exclusive lock.)  Hence vacuum will not have removed the heap
tuple yet, even though the index item might be gone.

We could improve concurrency if we extend the API so that btgettuple
knows whether it is fetching for an MVCC-safe snapshot or not.  When the
scan is using an MVCC-safe snapshot it's OK to release pin immediately.
However I'm not sure if this is worth the trouble --- it probably makes
sense to hold onto the pin anyway, in case we're told to mark some of
the tuples killed.

The patch as given gets point 2 right, but I think it doesn't
consistently do point 1, and in any case it's certainly not documenting
the issue properly.

BTW, I just realized another bug in the patch: btbulkdelete fails to
guarantee that it visits every page in the index.  It was OK for
btvacuumcleanup to ignore pages added to the index after it starts,
but btbulkdelete has to deal with such pages.

One thing I'm kind of wondering is whether amgetmulti should still exist
as a separate API.  Seems like if amgettuple is doing the equivalent
thing under-the-hood, then amgetmulti isn't saving us anything except a
few trips through FunctionCallN.  It's not at all clear that it's worth
the API complexity of having two different fetch functions.

            regards, tom lane

pgsql-patches by date:

Previous
From: Bruce Momjian
Date:
Subject: Re: Have configure complain about unknown options
Next
From: "Joshua D. Drake"
Date:
Subject: Re: [BUGS] BUG #2401: spinlocks not available on amd64