Re: We need index-only scans - Mailing list pgsql-hackers

From Alvaro Herrera
Subject Re: We need index-only scans
Date
Msg-id 1289569462-sup-9948@alvh.no-ip.org
Whole thread Raw
In response to Re: We need index-only scans  (Greg Stark <gsstark@mit.edu>)
Responses Re: We need index-only scans
List pgsql-hackers
Excerpts from Greg Stark's message of vie nov 12 10:33:28 -0300 2010:

> In Postgres, aside from the visibility issues we have a separate
> problem. In order to achieve high concurrency we allow splits to occur
> without locking the index. And the new pages can be found anywhere in
> the index, even to the left of the existing page. So a sequential scan
> could miss some data if the page it's on is split and some of the data
> is moved to be to the left of where our scan is.

Eh?  If a transaction splits a page and put some new tuples to the
"left" of another transaction's scan (assuming a forward scan), then
necessarily the second transaction cannot see those tuples anyway -- the
inserter must have done the split after you got your snapshot.

A transaction cannot split a page that other transaction has a scan
stopped in, because there are buffer locks involved.  Therefore the scan
cannot miss any tuples.  Saith src/backend/access/nbtree/README:


Lehman and Yao don't require read locks, but assume that in-memory
copies of tree pages are unshared.  Postgres shares in-memory buffers
among backends.  As a result, we do page-level read locking on btree
pages in order to guarantee that no record is modified while we are
examining it.  This reduces concurrency but guaranteees correct
behavior.  An advantage is that when trading in a read lock for a
write lock, we need not re-read the page after getting the write lock.
Since we're also holding a pin on the shared buffer containing the
page, we know that buffer still contains the page and is up-to-date.

We support the notion of an ordered "scan" of an index as well as
insertions, deletions, and simple lookups.  A scan in the forward
direction is no problem, we just use the right-sibling pointers that
L&Y require anyway.  (Thus, once we have descended the tree to the
correct start point for the scan, the scan looks only at leaf pages
and never at higher tree levels.)  To support scans in the backward
direction, we also store a "left sibling" link much like the "right
sibling".  (This adds an extra step to the L&Y split algorithm: while
holding the write lock on the page being split, we also lock its former
right sibling to update that page's left-link.  This is safe since no
writer of that page can be interested in acquiring a write lock on our
page.)  A backwards scan has one additional bit of complexity: after
following the left-link we must account for the possibility that the
left sibling page got split before we could read it.  So, we have to
move right until we find a page whose right-link matches the page we
came from.  (Actually, it's even harder than that; see deletion discussion
below.)

Page read locks are held only for as long as a scan is examining a page.
To minimize lock/unlock traffic, an index scan always searches a leaf page
to identify all the matching items at once, copying their heap tuple IDs
into backend-local storage.  The heap tuple IDs are then processed while
not holding any page lock within the index.  We do continue to hold a pin
on the leaf page, to protect against concurrent deletions (see below).
In this state the scan is effectively stopped "between" pages, either
before or after the page it has pinned.  This is safe in the presence of
concurrent insertions and even page splits, because items are never moved
across pre-existing page boundaries --- so the scan cannot miss any items
it should have seen, nor accidentally return the same item twice.  The scan
must remember the page's right-link at the time it was scanned, since that
is the page to move right to; if we move right to the current right-link
then we'd re-scan any items moved by a page split.  We don't similarly
remember the left-link, since it's best to use the most up-to-date
left-link when trying to move left (see detailed move-left algorithm
below).


-- 
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


pgsql-hackers by date:

Previous
From: "Kevin Grittner"
Date:
Subject: Re: multi-platform, multi-locale regression tests
Next
From: Heikki Linnakangas
Date:
Subject: Re: We need index-only scans