Re: Thoughts on nbtree with logical/varwidth table identifiers, v12on-disk representation - Mailing list pgsql-hackers

From Peter Geoghegan
Subject Re: Thoughts on nbtree with logical/varwidth table identifiers, v12on-disk representation
Date
Msg-id CAH2-WznT=8ibNy68RK-Crwg5qRa6sJzaP5V_GNcSjUNmYFLArw@mail.gmail.com
Whole thread Raw
In response to Re: Thoughts on nbtree with logical/varwidth table identifiers, v12on-disk representation  (Andres Freund <andres@anarazel.de>)
List pgsql-hackers
On Wed, Oct 30, 2019 at 1:25 PM Andres Freund <andres@anarazel.de> wrote:
> I assume you mean that the index would dynamically recognize when it
> needs the wider tids ("for the higher portion")? If so, yea, that makes
> sense to me. Would that need to encode the 6/8byteness of a tid on a
> per-element basis? Or are you thinking of being able to upconvert in a
> general way?

I think that index access methods should continue to control the
layout of TIDs/item pointers, while mostly just treating them as
fixed-width integers. Maybe you just have 6 and 8 byte structs, or
maybe you also have a 16 byte struct, but it isn't really variable
length to the index AM (it's more like a union). It becomes the index
AM's responsibility to remember which TID width applies at a per tuple
level (or per posting list level, etc). In general, index AMs have to
"supply their own status bits", which should be fine (the alternative
is that they refuse to support anything other than the traditional 6
byte TID format).

Table AMs don't get to supply their own operator class for sorting
these integers -- they had better be happy with the TID sort order
that is baked in (ascending int order) in the context of index scans
that return duplicates, and things like that. There is probably
generic infrastructure for up-converting, too, but the index AM is
fundamentally in the driving seat with this design.

> If we had variable width tids varying by more than 2 bytes, it might be
> reasonable for cases like this to store all tids padded to the width of
> the widest tid. I think that'd still be pretty OK, because you'd only
> pay the price if you actually have long tids, and only on pages where
> such tids are referenced. Obviously that means that such a posting list
> could grow more than by the size of the inserted tid upon insertion, but
> that might still be OK?  That'd require storing the width of the posting
> list elements somewhere, unfortunately - not sure if that's a problem?

My solution is to require the index AM to look after itself. The index
AM is responsible for not mixing them up. For nbtree with
deduplication, this means that having different width TIDs makes it
impossible to merge together posting lists/tuples. For GIN, this means
that we can expand the width of the TIDs in the posting list to match
the widest TID. We can then make it into a posting tree if necessary
-- GIN has the benefit of always being able to fall back on the option
of making a new posting tree (unlike nbtree with deduplication). GIN's
B-Tree is very primitive in some ways (no deletion of items in the
entry tree, no deletion of pages in the entry tree), which gives it
the ability to safely fall back on creating a new posting tree when it
runs out of space.

> ISTM if we had varint style tids, we'd probably still save space on
> average for today's heap that way. How important is it for you to be
> able to split out the "block offset" and "page offset" bits?

Pretty important. The nbtree deduplication patch is very compelling
because it almost offers the best of both worlds -- the concurrency
characteristics of today's nbtree, combined with very much improved
space efficiency. Keeping the space accounting as simple as possible
seems like a big part of why this is possible at all. There is only
one new type of WAL record required for deduplication, and they're
pretty small. (Existing WAL records are changed to support posting
list splits, but these are small, low-overhead changes.)

> I'm somewhat inclined to think that tids should just be a varint (or
> maybe two, if we want to make it slightly simpler to keep compat to how
> they look today), and that the AM internally makes sense of that.

I am opposed to adding anything that is truly varint. The index access
method ought to be able to have fine control over the layout, without
being burdened by an overly complicated TID representation.

> > I can support varwidth TIDs in the future pretty well if the TID
> > doesn't have to be *arbitrarily* wide.
>
> I think it'd be perfectly reasonable to have a relatively low upper
> limit for tid width. Not 8 bytes, but also not 200 bytes.

My point is that we should find a way to make TIDs continue to be an
array of fixed width integers in any given context. Lots of index
access method code can be ported in a relatively straightforward
fashion this way. This has some downsides, but I believe that they're
worth it.

> > Individual posting lists can themselves either use 6 byte or 8 byte
> > TIDs, preserving the ability to access a posting list entry at random
> > using simple pointer arithmetic.  This makes converting over index AMs
> > a lot less painful -- it'll be pretty easy to avoid mixing together
> > the 6 byte and 8 byte structs.
>
> With varint style tids as I suggested, that ought to be fairly simple?

nbtree probably won't be able to tolerate having to widen every TID in
the posting list all at once when new tuples are inserted that have
TIDs that are one byte wider, that go in the same posting list (as I
said, keeping the space accounting simple is particularly important
for nbtree). This even seems hard for GIN, which thinks of TIDs as an
array of fixed width ints in many contexts. Also, BRIN revmap pages
are also mostly just arrays of 6 byte item pointers, that rely on
simple pointer arithmetic to do random access.

-- 
Peter Geoghegan



pgsql-hackers by date:

Previous
From: David Rowley
Date:
Subject: Re: v12.0: ERROR: could not find pathkey item to sort
Next
From: David Rowley
Date:
Subject: Re: Creating foreign key on partitioned table is too slow