Re: We need index-only scans - Mailing list pgsql-hackers
From | Alvaro Herrera |
---|---|
Subject | Re: We need index-only scans |
Date | |
Msg-id | 1289569462-sup-9948@alvh.no-ip.org Whole thread Raw |
In response to | Re: We need index-only scans (Greg Stark <gsstark@mit.edu>) |
Responses |
Re: We need index-only scans
|
List | pgsql-hackers |
Excerpts from Greg Stark's message of vie nov 12 10:33:28 -0300 2010: > In Postgres, aside from the visibility issues we have a separate > problem. In order to achieve high concurrency we allow splits to occur > without locking the index. And the new pages can be found anywhere in > the index, even to the left of the existing page. So a sequential scan > could miss some data if the page it's on is split and some of the data > is moved to be to the left of where our scan is. Eh? If a transaction splits a page and put some new tuples to the "left" of another transaction's scan (assuming a forward scan), then necessarily the second transaction cannot see those tuples anyway -- the inserter must have done the split after you got your snapshot. A transaction cannot split a page that other transaction has a scan stopped in, because there are buffer locks involved. Therefore the scan cannot miss any tuples. Saith src/backend/access/nbtree/README: Lehman and Yao don't require read locks, but assume that in-memory copies of tree pages are unshared. Postgres shares in-memory buffers among backends. As a result, we do page-level read locking on btree pages in order to guarantee that no record is modified while we are examining it. This reduces concurrency but guaranteees correct behavior. An advantage is that when trading in a read lock for a write lock, we need not re-read the page after getting the write lock. Since we're also holding a pin on the shared buffer containing the page, we know that buffer still contains the page and is up-to-date. We support the notion of an ordered "scan" of an index as well as insertions, deletions, and simple lookups. A scan in the forward direction is no problem, we just use the right-sibling pointers that L&Y require anyway. (Thus, once we have descended the tree to the correct start point for the scan, the scan looks only at leaf pages and never at higher tree levels.) To support scans in the backward direction, we also store a "left sibling" link much like the "right sibling". (This adds an extra step to the L&Y split algorithm: while holding the write lock on the page being split, we also lock its former right sibling to update that page's left-link. This is safe since no writer of that page can be interested in acquiring a write lock on our page.) A backwards scan has one additional bit of complexity: after following the left-link we must account for the possibility that the left sibling page got split before we could read it. So, we have to move right until we find a page whose right-link matches the page we came from. (Actually, it's even harder than that; see deletion discussion below.) Page read locks are held only for as long as a scan is examining a page. To minimize lock/unlock traffic, an index scan always searches a leaf page to identify all the matching items at once, copying their heap tuple IDs into backend-local storage. The heap tuple IDs are then processed while not holding any page lock within the index. We do continue to hold a pin on the leaf page, to protect against concurrent deletions (see below). In this state the scan is effectively stopped "between" pages, either before or after the page it has pinned. This is safe in the presence of concurrent insertions and even page splits, because items are never moved across pre-existing page boundaries --- so the scan cannot miss any items it should have seen, nor accidentally return the same item twice. The scan must remember the page's right-link at the time it was scanned, since that is the page to move right to; if we move right to the current right-link then we'd re-scan any items moved by a page split. We don't similarly remember the left-link, since it's best to use the most up-to-date left-link when trying to move left (see detailed move-left algorithm below). -- Álvaro Herrera <alvherre@commandprompt.com> The PostgreSQL Company - Command Prompt, Inc. PostgreSQL Replication, Consulting, Custom Development, 24x7 support
pgsql-hackers by date: