Re: Thoughts on nbtree with logical/varwidth table identifiers, v12on-disk representation - Mailing list pgsql-hackers
From | Andres Freund |
---|---|
Subject | Re: Thoughts on nbtree with logical/varwidth table identifiers, v12on-disk representation |
Date | |
Msg-id | 20191030202514.cfqx2he7q6ooj5kl@alap3.anarazel.de Whole thread Raw |
In response to | Re: Thoughts on nbtree with logical/varwidth table identifiers, v12on-disk representation (Peter Geoghegan <pg@bowt.ie>) |
Responses |
Re: Thoughts on nbtree with logical/varwidth table identifiers, v12on-disk representation
|
List | pgsql-hackers |
Hi, On 2019-10-30 12:37:50 -0700, Peter Geoghegan wrote: > On Wed, Oct 30, 2019 at 12:03 PM Andres Freund <andres@anarazel.de> wrote: > > I'd much rather not entrench this further, even leaving global indexes > > aside. The 4 byte block number is a significant limitation for heap > > tables too, and we should lift that at some point not too far away. > > Then there's also other AMs that could really use a wider tid space. > > I agree that that limitation is a problem that should be fixed before > too long. But the solution probably shouldn't be a radical departure > from what we have today. The vast majority of tables are not affected > by the TID space limitation. Those tables that are can tolerate > supporting fixed width "long" TIDs (perhaps 8 bytes long?) that are > used for the higher portion of the heap TID space alone. > > The idea here is that TID is varwidth, but actually uses the existing > heap TID format most of the time. For larger tables it uses a wider > fixed width struct that largely works the same as the old 6 byte > struct. I assume you mean that the index would dynamically recognize when it needs the wider tids ("for the higher portion")? If so, yea, that makes sense to me. Would that need to encode the 6/8byteness of a tid on a per-element basis? Or are you thinking of being able to upconvert in a general way? > > > Though I suppose a posting list almost has to have fixed width TIDs to > > > perform acceptably. > > > > Hm. It's not clear to me why that is? > > Random access matters for things like determining the correct offset > to split a posting list at. This is needed in the event of an > overlapping insertion of a new duplicate tuple whose heap TID falls > within the range of the posting list. Also, I want to be able to scan > posting lists backwards for backwards scans. In general, fixed-width > TIDs make the page space accounting fairly simple, which matters a lot > in nbtree. If we had variable width tids varying by more than 2 bytes, it might be reasonable for cases like this to store all tids padded to the width of the widest tid. I think that'd still be pretty OK, because you'd only pay the price if you actually have long tids, and only on pages where such tids are referenced. Obviously that means that such a posting list could grow more than by the size of the inserted tid upon insertion, but that might still be OK? That'd require storing the width of the posting list elements somewhere, unfortunately - not sure if that's a problem? ISTM if we had varint style tids, we'd probably still save space on average for today's heap that way. How important is it for you to be able to split out the "block offset" and "page offset" bits? I'm somewhat inclined to think that tids should just be a varint (or maybe two, if we want to make it slightly simpler to keep compat to how they look today), and that the AM internally makes sense of that. > I can support varwidth TIDs in the future pretty well if the TID > doesn't have to be *arbitrarily* wide. I think it'd be perfectly reasonable to have a relatively low upper limit for tid width. Not 8 bytes, but also not 200 bytes. > Individual posting lists can themselves either use 6 byte or 8 byte > TIDs, preserving the ability to access a posting list entry at random > using simple pointer arithmetic. This makes converting over index AMs > a lot less painful -- it'll be pretty easy to avoid mixing together > the 6 byte and 8 byte structs. With varint style tids as I suggested, that ought to be fairly simple? > I don't think "indirect" indexes are a realistic goal for > Postgres. VACUUM is just too messy there (as is any other garbage > collection mechanism). Zedstore and Zheap don't change this. Hm, I don't think there's a fundamental problem, but let's leave that aside, there's enough other reasons to improve this. Greetings, Andres Freund
pgsql-hackers by date: