Re: Minmax indexes - Mailing list pgsql-hackers

From Alvaro Herrera
Subject Re: Minmax indexes
Date
Msg-id 20140623170751.GA5032@eldon.alvh.no-ip.org
Whole thread Raw
In response to Re: Minmax indexes  (Heikki Linnakangas <hlinnakangas@vmware.com>)
Responses Re: Minmax indexes  (Heikki Linnakangas <hlinnakangas@vmware.com>)
List pgsql-hackers
Heikki Linnakangas wrote:
> Some comments, aside from the design wrt. bounding boxes etc. :

Thanks.  I haven't commented on that sub-thread because I think it's
possible to come up with a reasonable design that solves the issue by
adding a couple of amprocs.  I need to do some more thinking to ensure
it is really workable, and then I'll post my ideas.

> On 06/15/2014 05:34 AM, Alvaro Herrera wrote:
> >Robert Haas wrote:

> If I understand the code correctly, the revmap is a three-level deep
> structure. The bottom level consists of "regular revmap pages", and
> each regular revmap page is filled with ItemPointerDatas, which
> point to the index tuples. The middle level consists of "array
> revmap pages", and each array revmap page contains an array of
> BlockNumbers of the "regular revmap" pages. The top level is an
> array of BlockNumbers of the array revmap pages, and it is stored in
> the metapage.

Yep, that's correct.  Essentially, we still have the revmap as a linear
space (containing TIDs); the other two levels on top of that are only
there to enable locating the physical page numbers for each revmap
logical page.  I make one exception that the first logical revmap page
is always stored in page 1, to optimize the case of a smallish table
(~1360 page ranges; approximately 1.3 gigabytes of data at 128 pages per
range, or 170 megabytes at 16 pages per range.)

Each page has a page header (24 bytes) and special space (4 bytes), so
it has 8192-28=8164 bytes available for data, so 1360 item pointers.

> With 8k block size, that's just enough to cover the full range of
> 2^32-1 blocks that you'll need if you set mm_pages_per_range=1. Each
> regular revmap page can store about 8192/6 = 1365 item pointers,
> each array revmap page can store about 8192/4 = 2048 block
> references, and the size of the top array is 8192/4. That's just
> enough; to store the required number of array pages in the top
> array, the array needs to be (2^32/1365)/2048)=1536 elements large.
> 
> But with 4k or smaller blocks, it's not enough.

Yeah.  As I said elsewhere, actual useful values are likely to be close
to the read-ahead setting of the underlying disk; by default that'd be
16 pages (128kB), but I think it's common advice to increase the kernel
setting to improve performance.  I don't think we don't need to prevent
minmax indexes with pages_per_range=1, but I don't think we need to
ensure that that setting works with the largest tables, either, because
it doesn't make any sense to set it up like that.

Also, while there are some recommendations to set up a system with
larger page sizes (32kB), I have never seen any recommendation to set it
lower.  It wouldn't make sense to build a system that has very large
tables and use a smaller page size.

So in other words, yes, you're correct that the mechanism doesn't work
in some cases (small page size and index configured for highest level of
detail), but the conditions are such that I don't think it matters.

ISTM the thing to do here is to do the math at index creation time, and
if we find that we don't have enough space in the metapage for all array
revmap pointers we need, bail out and require the index to be created
with a larger pages_per_range setting.

> I wonder if it would be simpler to just always store the revmap
> pages in the beginning of the index, before any other pages. Finding
> the revmap page would then be just as easy as with a separate fork.
> When the table/index is extended so that a new revmap page is
> needed, move the existing page at that block out of the way. Locking
> needs some consideration, but I think it would be feasible and
> simpler than you have now.

Moving index items around is not easy, because you'd have to adjust the
revmap to rewrite the item pointers.

> >I have followed the suggestion by Amit to overwrite the index tuple when
> >a new heap tuple is inserted, instead of creating a separate index
> >tuple.  This saves a lot of index bloat.  This required a new entry
> >point in bufpage.c, PageOverwriteItemData().  bufpage.c also has a new
> >function PageIndexDeleteNoCompact which is similar in spirit to
> >PageIndexMultiDelete except that item pointers do not change.  This is
> >necessary because the revmap stores item pointers, and such reference
> >would break if we were to renumber items in index pages.
> 
> ISTM that when the old tuple cannot be updated in-place, the new
> index tuple is inserted with mm_doinsert(), but the old tuple is
> never deleted.

It's deleted by the next vacuum.

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services



pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: Use a signal to trigger a memory context dump?
Next
From: Andres Freund
Date:
Subject: Re: Atomics hardware support table & supported architectures