Re: Minmax indexes - Mailing list pgsql-hackers

From Heikki Linnakangas
Subject Re: Minmax indexes
Date
Msg-id 53A844CD.8030601@vmware.com
Whole thread Raw
In response to Re: Minmax indexes  (Alvaro Herrera <alvherre@2ndquadrant.com>)
Responses Re: Minmax indexes  (Alvaro Herrera <alvherre@2ndquadrant.com>)
List pgsql-hackers
Some comments, aside from the design wrt. bounding boxes etc. :

On 06/15/2014 05:34 AM, Alvaro Herrera wrote:
> Robert Haas wrote:
>> On Wed, Sep 25, 2013 at 4:34 PM, Alvaro Herrera
>> <alvherre@2ndquadrant.com> wrote:
>>> Here's an updated version of this patch, with fixes to all the bugs
>>> reported so far.  Thanks to Thom Brown, Jaime Casanova, Erik Rijkers and
>>> Amit Kapila for the reports.
>>
>> I'm not very happy with the use of a separate relation fork for
>> storing this data.
>
> Here's a new version of this patch.  Now the revmap is not stored in a
> separate fork, but together with all the regular data, as explained
> elsewhere in the thread.

Thanks! Please update the README accordingly.

If I understand the code correctly, the revmap is a three-level deep 
structure. The bottom level consists of "regular revmap pages", and each 
regular revmap page is filled with ItemPointerDatas, which point to the 
index tuples. The middle level consists of "array revmap pages", and 
each array revmap page contains an array of BlockNumbers of the "regular 
revmap" pages. The top level is an array of BlockNumbers of the array 
revmap pages, and it is stored in the metapage.

With 8k block size, that's just enough to cover the full range of 2^32-1 
blocks that you'll need if you set mm_pages_per_range=1. Each regular 
revmap page can store about 8192/6 = 1365 item pointers, each array 
revmap page can store about 8192/4 = 2048 block references, and the size 
of the top array is 8192/4. That's just enough; to store the required 
number of array pages in the top array, the array needs to be 
(2^32/1365)/2048)=1536 elements large.

But with 4k or smaller blocks, it's not enough.

I wonder if it would be simpler to just always store the revmap pages in 
the beginning of the index, before any other pages. Finding the revmap 
page would then be just as easy as with a separate fork. When the 
table/index is extended so that a new revmap page is needed, move the 
existing page at that block out of the way. Locking needs some 
consideration, but I think it would be feasible and simpler than you 
have now.

> I have followed the suggestion by Amit to overwrite the index tuple when
> a new heap tuple is inserted, instead of creating a separate index
> tuple.  This saves a lot of index bloat.  This required a new entry
> point in bufpage.c, PageOverwriteItemData().  bufpage.c also has a new
> function PageIndexDeleteNoCompact which is similar in spirit to
> PageIndexMultiDelete except that item pointers do not change.  This is
> necessary because the revmap stores item pointers, and such reference
> would break if we were to renumber items in index pages.

ISTM that when the old tuple cannot be updated in-place, the new index 
tuple is inserted with mm_doinsert(), but the old tuple is never deleted.

- Heikki

-- 
- Heikki



pgsql-hackers by date:

Previous
From: Andrew Dunstan
Date:
Subject: Re: please review source(SQLServer compatible)‏
Next
From: Andres Freund
Date:
Subject: Re: replication identifier format