Re: Thoughts on nbtree with logical/varwidth table identifiers, v12on-disk representation - Mailing list pgsql-hackers
From | Peter Geoghegan |
---|---|
Subject | Re: Thoughts on nbtree with logical/varwidth table identifiers, v12on-disk representation |
Date | |
Msg-id | CAH2-WznT=8ibNy68RK-Crwg5qRa6sJzaP5V_GNcSjUNmYFLArw@mail.gmail.com Whole thread Raw |
In response to | Re: Thoughts on nbtree with logical/varwidth table identifiers, v12on-disk representation (Andres Freund <andres@anarazel.de>) |
List | pgsql-hackers |
On Wed, Oct 30, 2019 at 1:25 PM Andres Freund <andres@anarazel.de> wrote: > I assume you mean that the index would dynamically recognize when it > needs the wider tids ("for the higher portion")? If so, yea, that makes > sense to me. Would that need to encode the 6/8byteness of a tid on a > per-element basis? Or are you thinking of being able to upconvert in a > general way? I think that index access methods should continue to control the layout of TIDs/item pointers, while mostly just treating them as fixed-width integers. Maybe you just have 6 and 8 byte structs, or maybe you also have a 16 byte struct, but it isn't really variable length to the index AM (it's more like a union). It becomes the index AM's responsibility to remember which TID width applies at a per tuple level (or per posting list level, etc). In general, index AMs have to "supply their own status bits", which should be fine (the alternative is that they refuse to support anything other than the traditional 6 byte TID format). Table AMs don't get to supply their own operator class for sorting these integers -- they had better be happy with the TID sort order that is baked in (ascending int order) in the context of index scans that return duplicates, and things like that. There is probably generic infrastructure for up-converting, too, but the index AM is fundamentally in the driving seat with this design. > If we had variable width tids varying by more than 2 bytes, it might be > reasonable for cases like this to store all tids padded to the width of > the widest tid. I think that'd still be pretty OK, because you'd only > pay the price if you actually have long tids, and only on pages where > such tids are referenced. Obviously that means that such a posting list > could grow more than by the size of the inserted tid upon insertion, but > that might still be OK? That'd require storing the width of the posting > list elements somewhere, unfortunately - not sure if that's a problem? My solution is to require the index AM to look after itself. The index AM is responsible for not mixing them up. For nbtree with deduplication, this means that having different width TIDs makes it impossible to merge together posting lists/tuples. For GIN, this means that we can expand the width of the TIDs in the posting list to match the widest TID. We can then make it into a posting tree if necessary -- GIN has the benefit of always being able to fall back on the option of making a new posting tree (unlike nbtree with deduplication). GIN's B-Tree is very primitive in some ways (no deletion of items in the entry tree, no deletion of pages in the entry tree), which gives it the ability to safely fall back on creating a new posting tree when it runs out of space. > ISTM if we had varint style tids, we'd probably still save space on > average for today's heap that way. How important is it for you to be > able to split out the "block offset" and "page offset" bits? Pretty important. The nbtree deduplication patch is very compelling because it almost offers the best of both worlds -- the concurrency characteristics of today's nbtree, combined with very much improved space efficiency. Keeping the space accounting as simple as possible seems like a big part of why this is possible at all. There is only one new type of WAL record required for deduplication, and they're pretty small. (Existing WAL records are changed to support posting list splits, but these are small, low-overhead changes.) > I'm somewhat inclined to think that tids should just be a varint (or > maybe two, if we want to make it slightly simpler to keep compat to how > they look today), and that the AM internally makes sense of that. I am opposed to adding anything that is truly varint. The index access method ought to be able to have fine control over the layout, without being burdened by an overly complicated TID representation. > > I can support varwidth TIDs in the future pretty well if the TID > > doesn't have to be *arbitrarily* wide. > > I think it'd be perfectly reasonable to have a relatively low upper > limit for tid width. Not 8 bytes, but also not 200 bytes. My point is that we should find a way to make TIDs continue to be an array of fixed width integers in any given context. Lots of index access method code can be ported in a relatively straightforward fashion this way. This has some downsides, but I believe that they're worth it. > > Individual posting lists can themselves either use 6 byte or 8 byte > > TIDs, preserving the ability to access a posting list entry at random > > using simple pointer arithmetic. This makes converting over index AMs > > a lot less painful -- it'll be pretty easy to avoid mixing together > > the 6 byte and 8 byte structs. > > With varint style tids as I suggested, that ought to be fairly simple? nbtree probably won't be able to tolerate having to widen every TID in the posting list all at once when new tuples are inserted that have TIDs that are one byte wider, that go in the same posting list (as I said, keeping the space accounting simple is particularly important for nbtree). This even seems hard for GIN, which thinks of TIDs as an array of fixed width ints in many contexts. Also, BRIN revmap pages are also mostly just arrays of 6 byte item pointers, that rely on simple pointer arithmetic to do random access. -- Peter Geoghegan
pgsql-hackers by date: