Re: Minmax indexes - Mailing list pgsql-hackers
From | Alvaro Herrera |
---|---|
Subject | Re: Minmax indexes |
Date | |
Msg-id | 20131108203602.GW5809@eldon.alvh.no-ip.org Whole thread Raw |
In response to | Re: Minmax indexes (Robert Haas <robertmhaas@gmail.com>) |
Responses |
Re: Minmax indexes
|
List | pgsql-hackers |
Robert Haas escribió: > On Wed, Sep 25, 2013 at 4:34 PM, Alvaro Herrera > <alvherre@2ndquadrant.com> wrote: > > Here's an updated version of this patch, with fixes to all the bugs > > reported so far. Thanks to Thom Brown, Jaime Casanova, Erik Rijkers and > > Amit Kapila for the reports. > > I'm not very happy with the use of a separate relation fork for > storing this data. I have been playing with having the revmap in the main fork of the index rather than a separate one. On the surface many things stay just what they are; I only had to add a layer beneath the revmap that maps its logical block numbers to physical block numbers. The problem with this is that it needs more disk access, because revmap block numbers cannot be hardcoded. After doing some quick math, what I ended up with was to keep an array of BlockNumbers in the metapage. Each element in this array points to array pages; each array page is, in turn, filled with more BlockNumbers, which this time correspond to the logical revmap pages we used to have in the revmap fork. (I initially feared that this design would not allow me to address enough revmap pages for the largest of tables; but fortunately this is sufficient unless you configure very small pages, say BLCKSZ 2kB, use small page ranges, and use small datatypes, say "char". I have no problem with saying that that scenario is not supported if you want to have minmax indexes on 32 TB tables. I mean, who uses BLCKSZ smaller than 8kB anyway?). The advantage of this design is that in order to find any particular logical revmap page, you always have to do a constant number of page accesses. You read the metapage, then read the array page, then read the revmap page; done. Another idea I considered was chaining revmap pages (so each would have a pointer-to-next), or chaining array pages; but this would have meant that to locate an individual page to the end of the revmap, you might need to do many accesses. Not good. As an optimization for relatively small indexes, we hardcode the page number for the first revmap page: it's always the page right after the metapage (so BlockNumber 1). A revmap page can store, with the default page size, about 1350 item pointers; so with an index built for page ranges of 1000 pages per range, you can point to enough index entries for a ~10 GB table without having the need to examine the first array page. This seems pretty acceptable; people with larger tables can likely spare one extra page accessed every now and then. (For comparison, each regular minmax page can store about 500 index tuples, if it's built for a single 4-byte column; this means that the 10 GB table requires a 5-page index.) This is not complete yet; although I have a proof-of-concept working, I still need to write XLog support code and update the pageinspect code to match. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
pgsql-hackers by date: