Re: [RFC] Minmax indexes - Mailing list pgsql-hackers
From | Alvaro Herrera |
---|---|
Subject | Re: [RFC] Minmax indexes |
Date | |
Msg-id | 20130719053957.GD4139@eldon.alvh.no-ip.org Whole thread Raw |
In response to | [RFC] Minmax indexes (Alvaro Herrera <alvherre@2ndquadrant.com>) |
List | pgsql-hackers |
Alvaro Herrera wrote: > This is a preliminary proposal for Minmax indexes. I'm experimenting > with the code, but it's too crude to post yet, so here's a document > explaining what they are and how they work, so that reviewers can poke > holes to have the design improved. After going further with the implementation, I have added a new concept, the reverse range map. Reverse Range Map ----------------- The reverse range map is a separate fork each Minmax index has. Its purpose is to let a way to quickly find the index tuple corresponding to a page range; for a given heap page number, there's an O(1) way to obtain the TID of the corresponding index tuple. To scan the index, we first obtain the TID of index tuple for page 0. If this returns a valid TID, we read that tuple to determine the min/max bounds for this page range. If it returns invalid, then the range is unsummarized, and the scan must return the whole range as needing scan. After this index entry has been processed, we obtain the TID of the index tuple for page 0+pagesPerRange (currently this is a compile-time constant, but there's no reason this cannot be a per-index property). Continue adding pagesPerRange until we reach the end of the heap. To read the TID during an index scan, we follow this protocol: * read revmap page * obtain share lock on the buffer * read the TID * LockTuple that TID (using the index as relation). A shared lock is sufficient. We need the LockTuple to prevent VACUUMfrom recycling the index tuple; see below. * release buffer lock * read the index tuple * release the tuple lock On heap insert/update, it is normally cheaper to update the summary tuple (grab the old tuple, expand the min/max range according to the new value being inserted, insert the new index tuple, update the reverse range map) than letting the range be unsummarized, which would require scanning the pages in the range. [However, when many updates on a range are going to be done, it'd be better to mark it as unsummarized (i.e. set the revmap TID to Invalid) and do a resummarization later, to prevent the index from bloating. Also, it'd be sensible to allow postponing of the index update, if many tuples are inserted; something like GIN's "pending list". We would need to keep track the TIDs of the inserted heap tuples, so that we can figure out the new min/max values, without having to scan the whole range.] To update the summary tuple for a page range, we use this protocol: * insert a new index tuple anywhere; note its TID * read revmap page * obtain exclusive lock on buffer * write the TID * release lock This ensures no concurrent reader can obtain a partially-written TID. Note we don't need a tuple lock here. Concurrent scans don't have to worry about whether they got the old or new index tuple: if they get the old one, the tighter values are okay from a correctness standpoint because due to MVCC they can't possibly see the just-inserted heap tuples anyway. For vacuuming, we need to figure out which index tuples are no longer referenced from the reverse range map. This requires some brute force, but is simple: 1) scan the complete index, store each existing TIDs in a dynahash. Hash key is the TID, hash value is a boolean initiallyset to false. 2) scan the complete revmap sequentially, read the TIDs on each page. Share lock on each page is sufficient. For eachTID so obtained, grab the element from the hash and update the boolean to true. 3) Scan the index again; for each tuple found, search the hash table. If the tuple is not present, it must have been addedafter our initial scan; ignore it. If the hash flag is true, then the tuple is referenced; ignore it. If the hashflag is false, then the index tuple is no longer referenced by the revmap; but they could be about to be accessed bya concurrent scan. Do ConditionalLockTuple. If this fails, ignore the tuple, it will be deleted by a future vacuum. If lock is acquired, then we can safely remove the index tuple. 4) Index pages with free space can be detected by this second scan. Register those with the FSM. Note this doesn't require scanning the heap at all, or being involved in the heap's cleanup procedure. (In particular, tables with only minmax indexes do not require the removed tuples' TIDs to be collected during the heap cleanup pass.) XXX I think this is wrong in detail; we probably need to be using LockBufferForCleanup. With the reverse range map it is very easy to see which page ranges in the heap need resummarization; it's the ones marked with InvalidTid. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
pgsql-hackers by date: