Re: Minmax indexes - Mailing list pgsql-hackers

From Alvaro Herrera
Subject Re: Minmax indexes
Date
Msg-id 20131108203602.GW5809@eldon.alvh.no-ip.org
Whole thread Raw
In response to Re: Minmax indexes  (Robert Haas <robertmhaas@gmail.com>)
Responses Re: Minmax indexes
List pgsql-hackers
Robert Haas escribió:
> On Wed, Sep 25, 2013 at 4:34 PM, Alvaro Herrera
> <alvherre@2ndquadrant.com> wrote:
> > Here's an updated version of this patch, with fixes to all the bugs
> > reported so far.  Thanks to Thom Brown, Jaime Casanova, Erik Rijkers and
> > Amit Kapila for the reports.
> 
> I'm not very happy with the use of a separate relation fork for
> storing this data.

I have been playing with having the revmap in the main fork of the index
rather than a separate one.  On the surface many things stay just what
they are; I only had to add a layer beneath the revmap that maps its
logical block numbers to physical block numbers.  The problem with this
is that it needs more disk access, because revmap block numbers cannot be
hardcoded.

After doing some quick math, what I ended up with was to keep an array
of BlockNumbers in the metapage.  Each element in this array points to
array pages; each array page is, in turn, filled with more BlockNumbers,
which this time correspond to the logical revmap pages we used to have
in the revmap fork.  (I initially feared that this design would not
allow me to address enough revmap pages for the largest of tables; but
fortunately this is sufficient unless you configure very small pages,
say BLCKSZ 2kB, use small page ranges, and use small datatypes, say
"char".  I have no problem with saying that that scenario is not
supported if you want to have minmax indexes on 32 TB tables.  I mean,
who uses BLCKSZ smaller than 8kB anyway?).

The advantage of this design is that in order to find any particular
logical revmap page, you always have to do a constant number of page
accesses.  You read the metapage, then read the array page, then read
the revmap page; done.  Another idea I considered was chaining revmap
pages (so each would have a pointer-to-next), or chaining array pages;
but this would have meant that to locate an individual page to the end
of the revmap, you might need to do many accesses.  Not good.

As an optimization for relatively small indexes, we hardcode the page
number for the first revmap page: it's always the page right after the
metapage (so BlockNumber 1).  A revmap page can store, with the default
page size, about 1350 item pointers; so with an index built for page
ranges of 1000 pages per range, you can point to enough index entries
for a ~10 GB table without having the need to examine the first array
page.  This seems pretty acceptable; people with larger tables can
likely spare one extra page accessed every now and then.
(For comparison, each regular minmax page can store about 500 index
tuples, if it's built for a single 4-byte column; this means that the 10
GB table requires a 5-page index.)

This is not complete yet; although I have a proof-of-concept working, I
still need to write XLog support code and update the pageinspect code to
match.

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services



pgsql-hackers by date:

Previous
From: Heikki Linnakangas
Date:
Subject: Re: Gin page deletion bug
Next
From: Alvaro Herrera
Date:
Subject: Re: Minmax indexes