Thread: Thoughts on nbtree with logical/varwidth table identifiers, v12on-disk representation
Thoughts on nbtree with logical/varwidth table identifiers, v12on-disk representation
From
Peter Geoghegan
Date:
Andres has suggested that I work on teaching nbtree to accommodate variable-width, logical table identifiers, such as those required for indirect indexes, or clustered indexes, where secondary indexes must use a logical primary key value instead of a heap TID. I'm not currently committed to working on this as a project, but I definitely don't want to make it any harder. This has caused me to think about the problem as it relates to the new on-disk representation for v4 nbtree indexes in Postgres 12. I do have a minor misgiving about one particular aspect of what I came up with: The precise details of how we represent heap TID in pivot tuples seems like it might make things harder than they need to be for a future logical/varwidth table identifier project. This probably isn't worth doing anything about now, but it seems worth discussing now, just in case. The macro BTreeTupleGetHeapTID() can be used to get a pointer to an ItemPointerData (an ItemPointer) for the heap TID column if any is available, regardless of whether the tuple is a non-pivot tuple (points to the heap) or a pivot tuple (belongs in internal/branch pages, and points to a block in the index, but needs to store heap TID as well). In the non-pivot case the ItemPointer points to the start of the tuple (raw IndexTuple field), while in the pivot case it points to itup + IndexTupleSize() - sizeof(ItemPointerData). This interface seems like the right thing to me; it's simple, low-context, works just as well with INCLUDE indexes, and makes it fast to determine if there are any truncated suffix attributes. However, I don't like the way the alignment padding works -- there is often "extra" padding *between* the last untrucated suffix attribute and the heap TID. It seems like any MAXALIGN() padding should all be at the end -- the only padding between tuples should be based on the *general* requirement for the underlying data types, regardless of whether or not we're dealing with the special heap TID representation in pivot tuples. We should eliminate what could be viewed as a special case. This approach is probably going to be easier to generalize later. There can be a design where the logical/varwidth attribute can be accessed either by using the usual index_getattr() stuff, or using an interface like BTreeTupleGetHeapTID() to get to it quickly. We'd have to store an offset to the final/identifier attribute in the header to make that work, because we couldn't simply assume a particular width (like 6 bytes), but that seems straightforward. (I imagine that there'd be less difference between pivot and non-pivot tuples with varwidth identifiers than there are currently with heap TID, since we won't have to worry about pg_upgrade.) nbtinsert.c is very MAXALIGN()-heavy, and currently always represents that index tuples have a MAXALIGN()'d size, but that doesn't seem necessary or desirable to me. After all, we don't do that within heapam -- we can just rely on the bufpage.c routines to allocate a MAXALIGN()'d space for the whole tuple, while still making the lp_len field in the line pointer use the original size (i.e. size with un-MAXALIGN()'ed tuple data area). I've found that it's quite possible to get the nbtree code to store the tuple size (lp_len and redundant IndexTupleSize() representation) this way, just like heapam always has. This has some useful consequences: BTreeTupleGetHeapTID() continues to work with the special pivot tuple representation, while _bt_truncate() never "puts padding in the wrong place" when it must add a heap TID due to there being many duplicates, and split point that avoids doing that (that "truncates the heap TID attribute"). I could make this work without breaking the regression tests in about 10 minutes, which is at least encouraging (it was a bit tricky, though). This also results in an immediate though small benefit for v4 nbtree indexes: _bt_truncate() produces smaller pivot tuples in a few cases. For example, indexes with one or two boolean fields will have pivot tuples that are 15 bytes and 16 bytes in length respectively, occupying 16 bytes of tuple space on internal pages. The saving comes because we can use the alignment padding hole, that was empty in the original non-pivot index tuple that the new pivot tuple is to be formed from. Currently, the size of these pivot tuples would be 24 bytes, so we're occasionally saving a MAXALIGN() quantum in space this way. It is unlikely that anyone would actually care very much about these kinds of space savings, but at the same time it feels more elegant to me. The heap TID may not have a pg_attribute entry, but ISTM that the on-disk representation should not have padding "in the wrong place", on general principle. Thoughts? -- Peter Geoghegan
Re: Thoughts on nbtree with logical/varwidth table identifiers, v12on-disk representation
From
Stephen Frost
Date:
Greetings, * Peter Geoghegan (pg@bowt.ie) wrote: > Andres has suggested that I work on teaching nbtree to accommodate > variable-width, logical table identifiers, such as those required for > indirect indexes, or clustered indexes, where secondary indexes must > use a logical primary key value instead of a heap TID. I'm not > currently committed to working on this as a project, but I definitely > don't want to make it any harder. This has caused me to think about > the problem as it relates to the new on-disk representation for v4 > nbtree indexes in Postgres 12. I do have a minor misgiving about one > particular aspect of what I came up with: The precise details of how > we represent heap TID in pivot tuples seems like it might make things > harder than they need to be for a future logical/varwidth table > identifier project. This probably isn't worth doing anything about > now, but it seems worth discussing now, just in case. This seems like it would be helpful for global indexes as well, wouldn't it? > This also results in an immediate though small benefit for v4 nbtree > indexes: _bt_truncate() produces smaller pivot tuples in a few cases. > For example, indexes with one or two boolean fields will have pivot > tuples that are 15 bytes and 16 bytes in length respectively, > occupying 16 bytes of tuple space on internal pages. The saving comes > because we can use the alignment padding hole, that was empty in the > original non-pivot index tuple that the new pivot tuple is to be > formed from. Currently, the size of these pivot tuples would be 24 > bytes, so we're occasionally saving a MAXALIGN() quantum in space this > way. It is unlikely that anyone would actually care very much about > these kinds of space savings, but at the same time it feels more > elegant to me. The heap TID may not have a pg_attribute entry, but > ISTM that the on-disk representation should not have padding "in the > wrong place", on general principle. > > Thoughts? I agree with trying to avoid having padding 'in the wrong place' and if it makes some indexes smaller, great, even if they're unlikely to be interesting in the vast majority of cases, they may still exist out there. Of course, this is provided that it doesn't overly complicate the code, but it sounds like it wouldn't be too bad in this case. Thanks! Stephen
Attachment
Re: Thoughts on nbtree with logical/varwidth table identifiers, v12on-disk representation
From
Andres Freund
Date:
Hi, On 2019-04-21 17:46:09 -0700, Peter Geoghegan wrote: > Andres has suggested that I work on teaching nbtree to accommodate > variable-width, logical table identifiers, such as those required for > indirect indexes, or clustered indexes, where secondary indexes must > use a logical primary key value instead of a heap TID. I think it's two more cases: - table AMs that want to support tables that are bigger than 32TB. That used to be unrealistic, but it's not anymore. Especially when the need to VACUUM etc is largely removed / reduced. - global indexes (for cross-partition unique constraints and such), which need a partition identifier as part of the tid (or as part of the index key, but I think that actually makes interaction with indexam from other layers more complicated - the inside of the index maybe may want to represent it as a column, but to the outside that ought not to be visible) > Thoughts? Seems reasonable to me. I, more generally, wonder if there's not a case to squeeze out more padding than "just" what you describe (since we IIRC don't frequently keep pointers into such tuples anyway, and definitely don't for byval attrs). But that's very likely better done separately. Greetings, Andres Freund
Re: Thoughts on nbtree with logical/varwidth table identifiers, v12on-disk representation
From
Peter Geoghegan
Date:
On Mon, Apr 22, 2019 at 8:36 AM Stephen Frost <sfrost@snowman.net> wrote: > This seems like it would be helpful for global indexes as well, wouldn't > it? Yes, though that should probably work by reusing what we already do with heap TID (use standard IndexTuple fields on the leaf level for heap TID), plus an additional identifier for the partition number that is located at the physical end of the tuple. IOW, I think that this might benefit from a design that is half way between what we already do with heap TIDs and what we would be required to do to make varwidth logical row identifiers in tables work -- the partition number is varwidth, though often only a single byte. > I agree with trying to avoid having padding 'in the wrong place' and if > it makes some indexes smaller, great, even if they're unlikely to be > interesting in the vast majority of cases, they may still exist out > there. Of course, this is provided that it doesn't overly complicate > the code, but it sounds like it wouldn't be too bad in this case. Here is what it took: * Removed the "conservative" MAXALIGN() within index_form_tuple(), bringing it in line with heap_form_tuple(), which only MAXALIGN()s so that the first attribute in tuple's data area can safely be accessed on alignment-picky platforms, but doesn't do the same with data_len. * Removed most of the MAXALIGN()s from nbtinsert.c, except one that considers if a page split is required. * Didn't change the nbtsplitloc.c code, because we need to assume MAXALIGN()'d space quantities there. We continue to not trust the reported tuple length to be MAXALIGN()'d, which is now essentially rather than just defensive. * Removed MAXALIGN()s within _bt_truncate(), and SHORTALIGN()'d the whole tuple size in the case where new pivot tuple requires a heap TID representation. We access TIDs as 3 2 byte integers, so this is necessary for alignment-picky platforms. I will pursue this as a project for PostgreSQL 13. It doesn't affect on-disk compatibility, because BTreeTupleGetHeapTID() works just as well with either the existing scheme, or this new one. Having the "real" tuple length available will make it easier to implement "true" suffix truncation, where we truncate *within* a text attribute (i.e. generate a new, shorter value using new opclass infrastructure). -- Peter Geoghegan
Re: Thoughts on nbtree with logical/varwidth table identifiers, v12on-disk representation
From
Peter Geoghegan
Date:
On Mon, Apr 22, 2019 at 9:35 AM Andres Freund <andres@anarazel.de> wrote: > I, more generally, wonder if there's not a case to squeeze out more > padding than "just" what you describe (since we IIRC don't frequently > keep pointers into such tuples anyway, and definitely don't for byval > attrs). But that's very likely better done separately. There is one way that that is kind of relevant here. The current requirement seems to be that *any* sort of tuple header be MAXALIGN()'d, because in the worst case the first attribute needs to be accessed at a MAXALIGN()'d boundary on an alignment-picky platform. That isn't so bad today, since we usually find a reasonably good way to put those 8 bytes (or 23/24 bytes in the case of heap tuples) to use. However, with varwidth table identifiers, the only use of those 8 bytes that I can think of is the offset to the identifier (or perhaps its length), plus the usual t_info stuff. We'd almost invariably waste 4 or 5 bytes, which seems like a problem to me. -- Peter Geoghegan
Re: Thoughts on nbtree with logical/varwidth table identifiers, v12on-disk representation
From
Stephen Frost
Date:
Greetings, * Peter Geoghegan (pg@bowt.ie) wrote: > On Mon, Apr 22, 2019 at 8:36 AM Stephen Frost <sfrost@snowman.net> wrote: > > This seems like it would be helpful for global indexes as well, wouldn't > > it? > > Yes, though that should probably work by reusing what we already do > with heap TID (use standard IndexTuple fields on the leaf level for > heap TID), plus an additional identifier for the partition number that > is located at the physical end of the tuple. IOW, I think that this > might benefit from a design that is half way between what we already > do with heap TIDs and what we would be required to do to make varwidth > logical row identifiers in tables work -- the partition number is > varwidth, though often only a single byte. Yes, global indexes for partitioned tables could potentially be simpler than the logical row identifiers, but maybe it'd be useful to just have one implementation based around logical row identifiers which ends up working for global indexes as well as the other types of indexes and table access methods. If we thought that the 'single-byte' partition number covered enough use-cases then we could possibly consider supporting them for partitions by just 'stealing' a byte from BlockIdData and having the per-partition size be limited to 4TB when a global index exists on the partitioned table. That's certainly not an ideal limitation but it might be appealing to some users who really would like global indexes and could possibly require less to implement, though there's a lot of other things that would have to be done to have global indexes. Anyhow, just some random thoughts that I figured I'd share in case there might be something there worth thinking about. > > I agree with trying to avoid having padding 'in the wrong place' and if > > it makes some indexes smaller, great, even if they're unlikely to be > > interesting in the vast majority of cases, they may still exist out > > there. Of course, this is provided that it doesn't overly complicate > > the code, but it sounds like it wouldn't be too bad in this case. > > Here is what it took: > > * Removed the "conservative" MAXALIGN() within index_form_tuple(), > bringing it in line with heap_form_tuple(), which only MAXALIGN()s so > that the first attribute in tuple's data area can safely be accessed > on alignment-picky platforms, but doesn't do the same with data_len. > > * Removed most of the MAXALIGN()s from nbtinsert.c, except one that > considers if a page split is required. > > * Didn't change the nbtsplitloc.c code, because we need to assume > MAXALIGN()'d space quantities there. We continue to not trust the > reported tuple length to be MAXALIGN()'d, which is now essentially > rather than just defensive. > > * Removed MAXALIGN()s within _bt_truncate(), and SHORTALIGN()'d the > whole tuple size in the case where new pivot tuple requires a heap TID > representation. We access TIDs as 3 2 byte integers, so this is > necessary for alignment-picky platforms. > > I will pursue this as a project for PostgreSQL 13. It doesn't affect > on-disk compatibility, because BTreeTupleGetHeapTID() works just as > well with either the existing scheme, or this new one. Having the > "real" tuple length available will make it easier to implement "true" > suffix truncation, where we truncate *within* a text attribute (i.e. > generate a new, shorter value using new opclass infrastructure). This sounds pretty good to me, though I'm not nearly as close to the code there as you are. Thanks! Stephen
Attachment
Re: Thoughts on nbtree with logical/varwidth table identifiers, v12on-disk representation
From
Peter Geoghegan
Date:
On Mon, Apr 22, 2019 at 10:32 AM Stephen Frost <sfrost@snowman.net> wrote: > Yes, global indexes for partitioned tables could potentially be simpler > than the logical row identifiers, but maybe it'd be useful to just have > one implementation based around logical row identifiers which ends up > working for global indexes as well as the other types of indexes and > table access methods. Maybe so. I think that we'd have to actually try it out to know for sure. > If we thought that the 'single-byte' partition number covered enough > use-cases then we could possibly consider supporting them for partitions > by just 'stealing' a byte from BlockIdData and having the per-partition > size be limited to 4TB when a global index exists on the partitioned > table. I don't think that that would make it any easier to implement. > This sounds pretty good to me, though I'm not nearly as close to the > code there as you are. I'm slightly concerned that I may have broken an index_form_tuple() caller from some other access method, but they all seem to not trust index_form_tuple() to have MAXALIGN()'d on their behalf, just like nbtree (nbtree won't when I'm done, though, because it will actively try to preserve the "real" tuple size). It's convenient to me that no caller seems to rely on the index_form_tuple() MAXALIGN() that I want to remove. -- Peter Geoghegan
Re: Thoughts on nbtree with logical/varwidth table identifiers, v12on-disk representation
From
Robert Haas
Date:
On Mon, Apr 22, 2019 at 1:16 PM Peter Geoghegan <pg@bowt.ie> wrote: > Yes, though that should probably work by reusing what we already do > with heap TID (use standard IndexTuple fields on the leaf level for > heap TID), plus an additional identifier for the partition number that > is located at the physical end of the tuple. IOW, I think that this > might benefit from a design that is half way between what we already > do with heap TIDs and what we would be required to do to make varwidth > logical row identifiers in tables work -- the partition number is > varwidth, though often only a single byte. I think we're likely to have a problem with global indexes + DETACH PARTITION that is similar to the problem we now have with DROP COLUMN. If you drop or detach a partition, you can either (a) perform, as part of that operation, a scan of every global index to remove all references to the former partition, or (b) tell each global indexes that all references to that partition number ought to be regarded as dead index tuples. (b) makes detaching partitions faster and (a) seems hard to make rollback-safe, so I'm guessing we'll end up with (b). But that means that if someone repeatedly attaches and detaches partitions, the partition numbers could get quite big. And even without that somebody could have a lot of partitions. So while I do not disagree that the partition number could be variable-width and sometimes only 1 payload byte, I think we had better make sure to design the system in such a way that it scales to at least 4 payload bytes, because I have no faith that anything less will be sufficient for our demanding user base. We don't want people to be able to exhaust the supply of partition numbers the way they can exhaust the supply of attribute numbers by adding and dropping columns repeatedly. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Thoughts on nbtree with logical/varwidth table identifiers, v12on-disk representation
From
Peter Geoghegan
Date:
On Wed, Apr 24, 2019 at 5:22 AM Robert Haas <robertmhaas@gmail.com> wrote: > If you drop or detach a partition, you can either (a) perform, as part > of that operation, a scan of every global index to remove all > references to the former partition, or (b) tell each global indexes > that all references to that partition number ought to be regarded as > dead index tuples. (b) makes detaching partitions faster and (a) > seems hard to make rollback-safe, so I'm guessing we'll end up with > (b). I agree that (b) is the way to go. > We don't want people to be able to exhaust the supply of partition > numbers the way they can exhaust the supply of attribute numbers by > adding and dropping columns repeatedly. I agree that a partition numbering system needs to be able to accommodate arbitrarily-many partitions over time. It wouldn't have occurred to me to do it any other way. It is far far easier to make this work than it would be to retrofit varwidth attribute numbers. We won't have to worry about the HeapTupleHeaderGetNatts() representation. At the same time, nothing stops us from representing partition numbers in a simpler though less space efficient way in system catalogs. The main point of having global indexes is to be able to push down the partition number and use it during index scans. We can store the partition number at the end of the tuple on leaf pages, so that it's easily accessible (important for VACUUM), while continuing to use the IndexTuple fields for heap TID. On internal pages, the IndexTuple fields must be used for the downlink (block number of child), so both partition number and heap TID have to go towards the end of the tuples (this happens just with heap TID on Postgres 12). Of course, suffix truncation will manage to consistently get rid of both in most cases, especially when the global index is a unique index. The hard part is how to do varwidth encoding for space-efficient partition numbers while continuing to use IndexTuple fields for heap TID on the leaf level, *and* also having a BTreeTupleGetHeapTID()-style macro to get partition number without walking the entire index tuple. I suppose you could make the byte at the end of the tuple indicate that there are in fact 31 bits total when its high bit is set -- otherwise it's a 7 bit integer. Something like that may be the way to go. The alignment rules seem to make it worthwhile to keep the heap TID in the tuple header; it seems inherently necessary to have a MAXALIGN()'d tuple header, so finding a way to consistently put the first MAXALIGN() quantum to good use seems wise. -- Peter Geoghegan
Re: Thoughts on nbtree with logical/varwidth table identifiers, v12on-disk representation
From
Peter Geoghegan
Date:
On Wed, Apr 24, 2019 at 10:43 AM Peter Geoghegan <pg@bowt.ie> wrote: > The hard part is how to do varwidth encoding for space-efficient > partition numbers while continuing to use IndexTuple fields for heap > TID on the leaf level, *and* also having a > BTreeTupleGetHeapTID()-style macro to get partition number without > walking the entire index tuple. I suppose you could make the byte at > the end of the tuple indicate that there are in fact 31 bits total > when its high bit is set -- otherwise it's a 7 bit integer. Something > like that may be the way to go. The alignment rules seem to make it > worthwhile to keep the heap TID in the tuple header; it seems > inherently necessary to have a MAXALIGN()'d tuple header, so finding a > way to consistently put the first MAXALIGN() quantum to good use seems > wise. It's even harder than that if you want to make it possible to walk the tuple from either direction, which also seems useful. You want to be able to jump straight to the end of the tuple to get the partition number, while at the same time being able to access it in the usual way, as if it was just another attribute. Ugh, this is a mess. It would be so much easier if we had a tuple representation that stored attribute offsets right in the header. -- Peter Geoghegan
Re: Thoughts on nbtree with logical/varwidth table identifiers, v12on-disk representation
From
Peter Geoghegan
Date:
On Mon, Apr 22, 2019 at 9:35 AM Andres Freund <andres@anarazel.de> wrote: > On 2019-04-21 17:46:09 -0700, Peter Geoghegan wrote: > > Andres has suggested that I work on teaching nbtree to accommodate > > variable-width, logical table identifiers, such as those required for > > indirect indexes, or clustered indexes, where secondary indexes must > > use a logical primary key value instead of a heap TID. I'm revisiting this thread now because it may have relevance to the nbtree deduplication patch. If nothing else, the patch further commits us to the current heap TID format by making assumptions about the width of posting lists with 6 byte TIDs. Though I suppose a posting list almost has to have fixed width TIDs to perform acceptably. Database systems with a varwidth TID format probably don't support anything like posting lists. > I think it's two more cases: > > - table AMs that want to support tables that are bigger than 32TB. That > used to be unrealistic, but it's not anymore. Especially when the need > to VACUUM etc is largely removed / reduced. Can we steal some bits that are currently used for offset number instead? 16 bits is far more than we ever need to use for heap offset numbers in practice. (I wonder if this would also have benefits for the representation of in-memory bitmaps?) > - global indexes (for cross-partition unique constraints and such), > which need a partition identifier as part of the tid (or as part of > the index key, but I think that actually makes interaction with > indexam from other layers more complicated - the inside of the index > maybe may want to represent it as a column, but to the outside that > ought not to be visible) Can we just use an implementation level attribute for this? Would it be so bad if we weren't able to jump straight to the partition number without walking through the tuple when the tuple has varwidth attributes? (If that isn't acceptable, then we can probably make it work for global indexes without having to generalize everything.) In general, Postgres heap TIDs are not stable identifiers of rows (they only stably identify HOT chains). This is not the case in other database systems, which may go to great trouble to make it possible to assume that TIDs are stable over time (Andy Pavlo says that our TIDs are physical while Oracle's are logical -- I don't like that terminology, but know what he means). Generalizing the nbtree AM to be able to work with an arbitrary type of table row identifier that isn't at all like a TID raises tricky definitional questions. It would have to work in a way that made the new variety of table row identifier stable, which is a significant new requirement (and one that zheap is clearly not interested in). If we're willing to support something like a clustered index in nbtree, I wonder what we need to do to make things like numeric display scale still work. For bonus points, describe how unique checking works with a secondary unique index. I am not suggesting that these issues are totally insurmountable. What I am saying is this: If we already had "stable logical" TIDs instead of "mostly physical TIDs", then generalizing nbtree index tuples to store arbitrary table row identifiers would more or less be all about the data structure managed by nbtree. But that isn't the case, and that strongly discourages me from working on this -- we shouldn't talk about the problem as if it is mostly just a matter of settling of the best index tuple format. Frankly I am not very enthusiastic about working on a project that has unclear scope and unclear benefits for users. -- Peter Geoghegan
Re: Thoughts on nbtree with logical/varwidth table identifiers, v12on-disk representation
From
Andres Freund
Date:
Hi, On 2019-10-30 11:33:21 -0700, Peter Geoghegan wrote: > On Mon, Apr 22, 2019 at 9:35 AM Andres Freund <andres@anarazel.de> wrote: > > On 2019-04-21 17:46:09 -0700, Peter Geoghegan wrote: > > > Andres has suggested that I work on teaching nbtree to accommodate > > > variable-width, logical table identifiers, such as those required for > > > indirect indexes, or clustered indexes, where secondary indexes must > > > use a logical primary key value instead of a heap TID. > > I'm revisiting this thread now because it may have relevance to the > nbtree deduplication patch. If nothing else, the patch further commits > us to the current heap TID format by making assumptions about the > width of posting lists with 6 byte TIDs. I'd much rather not entrench this further, even leaving global indexes aside. The 4 byte block number is a significant limitation for heap tables too, and we should lift that at some point not too far away. Then there's also other AMs that could really use a wider tid space. > Though I suppose a posting list almost has to have fixed width TIDs to > perform acceptably. Hm. It's not clear to me why that is? > > I think it's two more cases: > > > > - table AMs that want to support tables that are bigger than 32TB. That > > used to be unrealistic, but it's not anymore. Especially when the need > > to VACUUM etc is largely removed / reduced. > > Can we steal some bits that are currently used for offset number > instead? 16 bits is far more than we ever need to use for heap offset > numbers in practice. I think that's a terrible idea. For one, some AMs will have significant higher limits, especially taking compression and larger block sizes into account. Also not all AMs need identifiers tied so closely to a disk position, e.g. zedstore does not. We shouldn't hack evermore information into the offset, given that background. > (I wonder if this would also have benefits for the representation of > in-memory bitmaps?) Hm. Not sure how? > > - global indexes (for cross-partition unique constraints and such), > > which need a partition identifier as part of the tid (or as part of > > the index key, but I think that actually makes interaction with > > indexam from other layers more complicated - the inside of the index > > maybe may want to represent it as a column, but to the outside that > > ought not to be visible) > > Can we just use an implementation level attribute for this? Would it > be so bad if we weren't able to jump straight to the partition number > without walking through the tuple when the tuple has varwidth > attributes? (If that isn't acceptable, then we can probably make it > work for global indexes without having to generalize everything.) Having to walk through the index tuple might be acceptable - in all likelihood we'll have to do so anyway. It does however not *really* resolve the issue that we still need to pass something tid back from the indexam, so we can fetch the associated tuple from the heap, or add the tid to a bitmap. But that could be done separately from the index internal data structures. > Generalizing the nbtree AM to be able to work with an arbitrary type > of table row identifier that isn't at all like a TID raises tricky > definitional questions. It would have to work in a way that made the > new variety of table row identifier stable, which is a significant new > requirement (and one that zheap is clearly not interested in). Hm. I don't see why a different types of TID would imply them being stable? > I am not suggesting that these issues are totally insurmountable. What > I am saying is this: If we already had "stable logical" TIDs instead > of "mostly physical TIDs", then generalizing nbtree index tuples to > store arbitrary table row identifiers would more or less be all about > the data structure managed by nbtree. But that isn't the case, and > that strongly discourages me from working on this -- we shouldn't talk > about the problem as if it is mostly just a matter of settling of the > best index tuple format. > Frankly I am not very enthusiastic about working on a project that has > unclear scope and unclear benefits for users. Why would properly supporting AMs like zedstore, global indexes, "indirect" indexes etc benefit users? Greetings, Andres Freund
Re: Thoughts on nbtree with logical/varwidth table identifiers, v12on-disk representation
From
Peter Geoghegan
Date:
On Wed, Oct 30, 2019 at 12:03 PM Andres Freund <andres@anarazel.de> wrote: > I'd much rather not entrench this further, even leaving global indexes > aside. The 4 byte block number is a significant limitation for heap > tables too, and we should lift that at some point not too far away. > Then there's also other AMs that could really use a wider tid space. I agree that that limitation is a problem that should be fixed before too long. But the solution probably shouldn't be a radical departure from what we have today. The vast majority of tables are not affected by the TID space limitation. Those tables that are can tolerate supporting fixed width "long" TIDs (perhaps 8 bytes long?) that are used for the higher portion of the heap TID space alone. The idea here is that TID is varwidth, but actually uses the existing heap TID format most of the time. For larger tables it uses a wider fixed width struct that largely works the same as the old 6 byte struct. > > Though I suppose a posting list almost has to have fixed width TIDs to > > perform acceptably. > > Hm. It's not clear to me why that is? Random access matters for things like determining the correct offset to split a posting list at. This is needed in the event of an overlapping insertion of a new duplicate tuple whose heap TID falls within the range of the posting list. Also, I want to be able to scan posting lists backwards for backwards scans. In general, fixed-width TIDs make the page space accounting fairly simple, which matters a lot in nbtree. I can support varwidth TIDs in the future pretty well if the TID doesn't have to be *arbitrarily* wide. Individual posting lists can themselves either use 6 byte or 8 byte TIDs, preserving the ability to access a posting list entry at random using simple pointer arithmetic. This makes converting over index AMs a lot less painful -- it'll be pretty easy to avoid mixing together the 6 byte and 8 byte structs. > > Can we steal some bits that are currently used for offset number > > instead? 16 bits is far more than we ever need to use for heap offset > > numbers in practice. > > I think that's a terrible idea. For one, some AMs will have significant > higher limits, especially taking compression and larger block sizes into > account. Also not all AMs need identifiers tied so closely to a disk > position, e.g. zedstore does not. We shouldn't hack evermore > information into the offset, given that background. Fair enough, but somebody needs to cut some scope here. > Having to walk through the index tuple might be acceptable - in all > likelihood we'll have to do so anyway. It does however not *really* > resolve the issue that we still need to pass something tid back from the > indexam, so we can fetch the associated tuple from the heap, or add the > tid to a bitmap. But that could be done separately from the index > internal data structures. I agree. > > Generalizing the nbtree AM to be able to work with an arbitrary type > > of table row identifier that isn't at all like a TID raises tricky > > definitional questions. > Hm. I don't see why a different types of TID would imply them being > stable? It is unclear what it means. I would like to see a sketch of a design for varwidth TIDs that balances everybody's concerns. I don't think "indirect" indexes are a realistic goal for Postgres. VACUUM is just too messy there (as is any other garbage collection mechanism). Zedstore and Zheap don't change this. > > Frankly I am not very enthusiastic about working on a project that has > > unclear scope and unclear benefits for users. > > Why would properly supporting AMs like zedstore, global indexes, > "indirect" indexes etc benefit users? Global indexes seem doable. I don't see how "indirect" indexes can ever work in Postgres. I don't know exactly what zedstore needs here, but maybe it can work well with a less ambitious design for varwidth TIDs along the lines I've sketched. -- Peter Geoghegan
Re: Thoughts on nbtree with logical/varwidth table identifiers, v12on-disk representation
From
Andres Freund
Date:
Hi, On 2019-10-30 12:37:50 -0700, Peter Geoghegan wrote: > On Wed, Oct 30, 2019 at 12:03 PM Andres Freund <andres@anarazel.de> wrote: > > I'd much rather not entrench this further, even leaving global indexes > > aside. The 4 byte block number is a significant limitation for heap > > tables too, and we should lift that at some point not too far away. > > Then there's also other AMs that could really use a wider tid space. > > I agree that that limitation is a problem that should be fixed before > too long. But the solution probably shouldn't be a radical departure > from what we have today. The vast majority of tables are not affected > by the TID space limitation. Those tables that are can tolerate > supporting fixed width "long" TIDs (perhaps 8 bytes long?) that are > used for the higher portion of the heap TID space alone. > > The idea here is that TID is varwidth, but actually uses the existing > heap TID format most of the time. For larger tables it uses a wider > fixed width struct that largely works the same as the old 6 byte > struct. I assume you mean that the index would dynamically recognize when it needs the wider tids ("for the higher portion")? If so, yea, that makes sense to me. Would that need to encode the 6/8byteness of a tid on a per-element basis? Or are you thinking of being able to upconvert in a general way? > > > Though I suppose a posting list almost has to have fixed width TIDs to > > > perform acceptably. > > > > Hm. It's not clear to me why that is? > > Random access matters for things like determining the correct offset > to split a posting list at. This is needed in the event of an > overlapping insertion of a new duplicate tuple whose heap TID falls > within the range of the posting list. Also, I want to be able to scan > posting lists backwards for backwards scans. In general, fixed-width > TIDs make the page space accounting fairly simple, which matters a lot > in nbtree. If we had variable width tids varying by more than 2 bytes, it might be reasonable for cases like this to store all tids padded to the width of the widest tid. I think that'd still be pretty OK, because you'd only pay the price if you actually have long tids, and only on pages where such tids are referenced. Obviously that means that such a posting list could grow more than by the size of the inserted tid upon insertion, but that might still be OK? That'd require storing the width of the posting list elements somewhere, unfortunately - not sure if that's a problem? ISTM if we had varint style tids, we'd probably still save space on average for today's heap that way. How important is it for you to be able to split out the "block offset" and "page offset" bits? I'm somewhat inclined to think that tids should just be a varint (or maybe two, if we want to make it slightly simpler to keep compat to how they look today), and that the AM internally makes sense of that. > I can support varwidth TIDs in the future pretty well if the TID > doesn't have to be *arbitrarily* wide. I think it'd be perfectly reasonable to have a relatively low upper limit for tid width. Not 8 bytes, but also not 200 bytes. > Individual posting lists can themselves either use 6 byte or 8 byte > TIDs, preserving the ability to access a posting list entry at random > using simple pointer arithmetic. This makes converting over index AMs > a lot less painful -- it'll be pretty easy to avoid mixing together > the 6 byte and 8 byte structs. With varint style tids as I suggested, that ought to be fairly simple? > I don't think "indirect" indexes are a realistic goal for > Postgres. VACUUM is just too messy there (as is any other garbage > collection mechanism). Zedstore and Zheap don't change this. Hm, I don't think there's a fundamental problem, but let's leave that aside, there's enough other reasons to improve this. Greetings, Andres Freund
Re: Thoughts on nbtree with logical/varwidth table identifiers, v12on-disk representation
From
Peter Geoghegan
Date:
On Wed, Oct 30, 2019 at 1:25 PM Andres Freund <andres@anarazel.de> wrote: > I assume you mean that the index would dynamically recognize when it > needs the wider tids ("for the higher portion")? If so, yea, that makes > sense to me. Would that need to encode the 6/8byteness of a tid on a > per-element basis? Or are you thinking of being able to upconvert in a > general way? I think that index access methods should continue to control the layout of TIDs/item pointers, while mostly just treating them as fixed-width integers. Maybe you just have 6 and 8 byte structs, or maybe you also have a 16 byte struct, but it isn't really variable length to the index AM (it's more like a union). It becomes the index AM's responsibility to remember which TID width applies at a per tuple level (or per posting list level, etc). In general, index AMs have to "supply their own status bits", which should be fine (the alternative is that they refuse to support anything other than the traditional 6 byte TID format). Table AMs don't get to supply their own operator class for sorting these integers -- they had better be happy with the TID sort order that is baked in (ascending int order) in the context of index scans that return duplicates, and things like that. There is probably generic infrastructure for up-converting, too, but the index AM is fundamentally in the driving seat with this design. > If we had variable width tids varying by more than 2 bytes, it might be > reasonable for cases like this to store all tids padded to the width of > the widest tid. I think that'd still be pretty OK, because you'd only > pay the price if you actually have long tids, and only on pages where > such tids are referenced. Obviously that means that such a posting list > could grow more than by the size of the inserted tid upon insertion, but > that might still be OK? That'd require storing the width of the posting > list elements somewhere, unfortunately - not sure if that's a problem? My solution is to require the index AM to look after itself. The index AM is responsible for not mixing them up. For nbtree with deduplication, this means that having different width TIDs makes it impossible to merge together posting lists/tuples. For GIN, this means that we can expand the width of the TIDs in the posting list to match the widest TID. We can then make it into a posting tree if necessary -- GIN has the benefit of always being able to fall back on the option of making a new posting tree (unlike nbtree with deduplication). GIN's B-Tree is very primitive in some ways (no deletion of items in the entry tree, no deletion of pages in the entry tree), which gives it the ability to safely fall back on creating a new posting tree when it runs out of space. > ISTM if we had varint style tids, we'd probably still save space on > average for today's heap that way. How important is it for you to be > able to split out the "block offset" and "page offset" bits? Pretty important. The nbtree deduplication patch is very compelling because it almost offers the best of both worlds -- the concurrency characteristics of today's nbtree, combined with very much improved space efficiency. Keeping the space accounting as simple as possible seems like a big part of why this is possible at all. There is only one new type of WAL record required for deduplication, and they're pretty small. (Existing WAL records are changed to support posting list splits, but these are small, low-overhead changes.) > I'm somewhat inclined to think that tids should just be a varint (or > maybe two, if we want to make it slightly simpler to keep compat to how > they look today), and that the AM internally makes sense of that. I am opposed to adding anything that is truly varint. The index access method ought to be able to have fine control over the layout, without being burdened by an overly complicated TID representation. > > I can support varwidth TIDs in the future pretty well if the TID > > doesn't have to be *arbitrarily* wide. > > I think it'd be perfectly reasonable to have a relatively low upper > limit for tid width. Not 8 bytes, but also not 200 bytes. My point is that we should find a way to make TIDs continue to be an array of fixed width integers in any given context. Lots of index access method code can be ported in a relatively straightforward fashion this way. This has some downsides, but I believe that they're worth it. > > Individual posting lists can themselves either use 6 byte or 8 byte > > TIDs, preserving the ability to access a posting list entry at random > > using simple pointer arithmetic. This makes converting over index AMs > > a lot less painful -- it'll be pretty easy to avoid mixing together > > the 6 byte and 8 byte structs. > > With varint style tids as I suggested, that ought to be fairly simple? nbtree probably won't be able to tolerate having to widen every TID in the posting list all at once when new tuples are inserted that have TIDs that are one byte wider, that go in the same posting list (as I said, keeping the space accounting simple is particularly important for nbtree). This even seems hard for GIN, which thinks of TIDs as an array of fixed width ints in many contexts. Also, BRIN revmap pages are also mostly just arrays of 6 byte item pointers, that rely on simple pointer arithmetic to do random access. -- Peter Geoghegan