Thread: MaxOffsetNumber for Table AMs
The notion of TID is based on pages and line pointers, which makes sense for heapam, but that's not likely to make sense for a custom table AM. The obvious answer is to make a simple mapping between a TID and whatever makes sense to the AM (for the sake of discussion, let's say a plain row number). The most natural thing would be to say that we have 48 bits, so it can just be a 48-bit number. Of course, there are some restrictions on valid values that complicate this: * InvalidBlockNumber of 0xFFFFFFFF. Not a problem. * InvalidOffsetNumber of 0. Not a problem. * MaxOffsetNumber of 2048. Does this limit really apply to table AMs? It just seems like it's used when scanning heap or index pages for stack-allocated arrays. For a table AM it appears to waste 5 bits. * ginpostinglist.c:itemptr_to_uint64() seems to think 2047 is the max offset number. Is this a bug? As a table AM author, I'd like to know what the real limits are so that I can use whatever bits are available to map from TID to row number and back, without worrying that something will break in the future. A few possibilities: 1. Keep MaxOffsetNumber as 2048 and fix itemptr_to_uint64(). 2. Change MaxOffsetNumber to 2047. This seems likely to break extensions that rely on it. 3. Define MaxOffsetNumber as 65536 and introduce a new MaxItemsPerPage as 2048 for the stack-allocated arrays. We'd still need to fix itemptr_to_uint64(). Thoughts? Regards, Jeff Davis
On Fri, 30 Apr 2021, 09:46 Jeff Davis, <pgsql@j-davis.com> wrote:
The notion of TID is based on pages and line pointers, which makes
sense for heapam, but that's not likely to make sense for a custom
table AM.
The obvious answer is to make a simple mapping between a TID and
whatever makes sense to the AM (for the sake of discussion, let's say a
plain row number).
The most natural thing would be to say that we have 48 bits, so it can
just be a 48-bit number. Of course, there are some restrictions on
valid values that complicate this:
* InvalidBlockNumber of 0xFFFFFFFF. Not a problem.
* InvalidOffsetNumber of 0. Not a problem.
* MaxOffsetNumber of 2048. Does this limit really apply to table AMs?
MaxOffsetNumber is not per se 2048. It is BLCKSZ / sizeof(ItemIdData), which is only 2048 for a 8kiB BLCKSZ. As we support BLCKSZ up to 32kiB, MaxOffsetNumber can be as large as 8196.
Other than that, I believe you've also missed the special offset numbers specified in itemptr.h (SpecTokenOffsetNumber and MovedPartitionsOffsetNumber). I am not well enough aware of the usage of these OffsetNumber values, but those might also be limiting the values any tableAM can use for their TIDs.
It just seems like it's used when scanning heap or index pages for
stack-allocated arrays. For a table AM it appears to waste 5 bits.
MaxOffsetNumber is used for postgres' Page layout, of which the MaxOffsetNumber is defined as how many item pointers could exist on a page, and AFAIK should be used for postgres' Page layout only. No thing can or should change that. If any code asserts limitations to the ip_posid of table tuples that could also not be tableam tuples, then I believe that is probably a mistake in postgres, and that should be fixed.
* ginpostinglist.c:itemptr_to_uint64() seems to think 2047 is the max
offset number. Is this a bug?
No. The documentation for that function explicitly mentions that these item pointers are optimized for storage when using the heap tableam, and that that code will be changed once there exist tableAMs with different TID / ip_posid constraints (see the comment on lines 32 and 33 of that file).
Note that the limiting number that itemptr_to_uint64 should mention for bit calculations is actually MaxHeaptuplesPerPage, which is about one seventh of MaxOffsetNumber. The resulting number of bits reserved is not a miscalculation though, because MaxHeaptuplesPerPage (for 32kiB BLCKSZ) requires the mentioned 11 bits, and adapting bit swizzling for multiple page sizes was apparently not considered worth the effort.
As a table AM author, I'd like to know what the real limits are so that
I can use whatever bits are available to map from TID to row number and
back, without worrying that something will break in the future. A few
possibilities:
1. Keep MaxOffsetNumber as 2048 and fix itemptr_to_uint64().
I believe that this is the right way when there exist tableAMs that use those upper 5 bits.
2. Change MaxOffsetNumber to 2047. This seems likely to break
extensions that rely on it.
If you're going to change MaxOffsetNumber, I believe that it's better to change it to ((BLCKSZ - sizeof(PageHeaderData)) / sizeof(ItemIdData)), as that is the maximum amount of ItemIds you could put on a Page that has no page opaque.
3. Define MaxOffsetNumber as 65536 and introduce a new
MaxItemsPerPage as 2048 for the stack-allocated arrays. We'd still need
to fix itemptr_to_uint64().
I believe that that breaks more things than otherwise required. ip_posid is already limited to uint16, so I see no reason to add a constant that would assert that the value of any uint16 is less its max value plus one.
With regards,
Matthias van de Meent
Jeff Davis <pgsql@j-davis.com> writes: > The notion of TID is based on pages and line pointers, which makes > sense for heapam, but that's not likely to make sense for a custom > table AM. > The obvious answer is to make a simple mapping between a TID and > whatever makes sense to the AM (for the sake of discussion, let's say a > plain row number). I'm inclined to think that when we get around to doing something about this, we need to make a much bigger change than just poking at the margins of type tid. My thought at the moment is that all APIs above the AM level ought to be redefined to use uint64 for tuple identifiers. heapam and related index AMs could map block + offset into that in some convenient way, and other AMs could do what they like. Andres seems to feel that we should try to allow variable-width tupleids, but I'm afraid that the cost/benefit ratio for that would be pretty poor. regards, tom lane
On Fri, Apr 30, 2021 at 8:06 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > My thought at the moment is that all APIs above the AM level ought > to be redefined to use uint64 for tuple identifiers. heapam and > related index AMs could map block + offset into that in some > convenient way, and other AMs could do what they like. > > Andres seems to feel that we should try to allow variable-width > tupleids, but I'm afraid that the cost/benefit ratio for that > would be pretty poor. I agree. It'll be easier for a new table AM to be developed with that constraint than it will be to retrofit it to every index AM. It probably wouldn't be that hard to make nbtree deduplication and GIN posting list compression work with uint64 TIDs. But variable-width TIDs are a very different matter. Compatibility with index AMs is more than a matter of switching out the tuple identifier -- if you invent something that has totally different performance characteristics for index AMs, then it's likely to break tacit assumptions about the cost of certain things. For example, index tuple deletion probably relies on the fact that there just isn't that many table blocks to visit (to get an XID for recovery conflict purposes) in practice due to various locality-related effects. Plus deduplication ameliorates problems with version churn in indexes -- presumably the same problems will exist when any new table AM is used, and so it'll be useful to address the same problems in the same way. I agree that it might well be useful to make TIDs fully logical (not "physiological" -- physical across blocks, logical within blocks) for some new table AM. Even then, it would still definitely be a good idea to make these logical TID values correlate with the physical table structure in some way. Plus TIDs should still be fixed size. If a new table AM can't do it that way then that certainly needs to be justified -- it's unreasonable to imagine that it simply isn't the table AM's problem to solve. -- Peter Geoghegan
On Fri, 2021-04-30 at 11:06 -0400, Tom Lane wrote: > My thought at the moment is that all APIs above the AM level ought > to be redefined to use uint64 for tuple identifiers. One challenge might be reliance on InvalidOffsetNumber as a special value in a number of places (e.g. bitmap index scans). That doesn't seem like a major problem though. > heapam and > related index AMs could map block + offset into that in some > convenient way, and other AMs could do what they like. Do you mean that indexes would be expected to hold a uint64, a 48-bit int (that directly maps to a uint64), or still hold an ItemPointerData? Regards, Jeff Davis
Jeff Davis <pgsql@j-davis.com> writes: > On Fri, 2021-04-30 at 11:06 -0400, Tom Lane wrote: >> My thought at the moment is that all APIs above the AM level ought >> to be redefined to use uint64 for tuple identifiers. > Do you mean that indexes would be expected to hold a uint64, a 48-bit > int (that directly maps to a uint64), or still hold an ItemPointerData? ISTM that would be up to the index AM. We'd need some interlocks on which index AMs could be used with which table AMs in any case, I think. It'd likely not be hard for existing index AMs to be repurposed to store "any 48-bit TID", but extending them to full 64-bit TIDs may be impractical. I think the hard part may really be in places like tidbitmap.c, which one would wish to be AM-independent, but right now it's quite specific to heap-style TIDs. Maybe we can think of a way to parameterize it. regards, tom lane
On Fri, 2021-04-30 at 12:04 +0200, Matthias van de Meent wrote: > Other than that, I believe you've also missed the special offset > numbers specified in itemptr.h (SpecTokenOffsetNumber and > MovedPartitionsOffsetNumber). I am not well enough aware of the usage > of these OffsetNumber values, but those might also be limiting the > values any tableAM can use for their TIDs. Yes, thank you. If it is treated specially in a heap tuple, it can't be a regular TID. > > It just seems like it's used when scanning heap or index pages for > > stack-allocated arrays. For a table AM it appears to waste 5 bits. > > MaxOffsetNumber is used for postgres' Page layout, of which the > MaxOffsetNumber is defined as how many item pointers could exist on a > page, and AFAIK should be used for postgres' Page layout only. No > thing can or should change that. If any code asserts limitations to > the ip_posid of table tuples that could also not be tableam tuples, > then I believe that is probably a mistake in postgres, and that > should be fixed. A name that would better fit your definition would be something like "MaxItemsPerPage". The name "MaxOffsetNumber" implies that any number past that must be either invalid or special. But it seems like you are saying that if I use an offset number of 5000 in my table AM, then that's fine and should be treated like a normal TID. > No. The documentation for that function explicitly mentions that > these item pointers are optimized for storage when using the heap > tableam, and that that code will be changed once there exist tableAMs > with different TID / ip_posid constraints (see the comment on lines > 32 and 33 of that file). Thank you. I'm a table AM author, and I'd like to use whatever the real range of TIDs is. Does that mean it's time to change that code in ginpostinglist.c now? > > 1. Keep MaxOffsetNumber as 2048 and fix itemptr_to_uint64(). > > I believe that this is the right way when there exist tableAMs that > use those upper 5 bits. Does that mean we should declare the valid range of offsets to be between 1 and 0xfffc (inclusive)? I'm trying to use some mapping now that's somewhat stable so that I don't have to worry that something will break later, and then require reindexing all tables with my table AM. Regards, Jeff Davis
On Fri, Apr 30, 2021 at 11:06 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > My thought at the moment is that all APIs above the AM level ought > to be redefined to use uint64 for tuple identifiers. heapam and > related index AMs could map block + offset into that in some > convenient way, and other AMs could do what they like. > > Andres seems to feel that we should try to allow variable-width > tupleids, but I'm afraid that the cost/benefit ratio for that > would be pretty poor. There are two major reasons why I want variable-width tuple IDs. One is global indexes, where you need as many bits as the AMs implementing the partitions need, plus some extra bits to identify which partition is relevant for a particular tuple. No fixed number of bits that you make available can ever be sufficient here, because a global index always needs to have extra bits compared to a partition-local index; if you let the partition-local index use more bits, the global index now needs even more space. The other is indirect indexes, work Álvaro proposed a few years ago, where the index entry points to the primary key value rather than a TID. The space needs for that are based on the type of the primary key column. This proposal solves neither of those problems. Another problem in this general area is that there is quite a bit of code that thinks a TID is specifically a block number and an offset, like the Bitmap Index/Heap Scan code, for example. But making tuple IDs wider doesn't help with that problem either. What problem do you think this proposal does solve? -- Robert Haas EDB: http://www.enterprisedb.com
On Fri, 2021-04-30 at 08:36 -0700, Peter Geoghegan wrote: > Compatibility with index AMs is more than a matter of switching out > the tuple identifier -- if you invent something that has totally > different performance characteristics for index AMs, then it's likely > to break tacit assumptions about the cost of certain things. I think it would be reasonable to document and expect that table AMs offer some locality of access for tuples with similar IDs. Do you think we need something stronger than that? > Plus deduplication ameliorates problems with version churn > in > indexes -- presumably the same problems will exist when any new table > AM is used, and so it'll be useful to address the same problems in > the > same way. I got lost after "presumably the same problems", can you explain? > I agree that it might well be useful to make TIDs fully logical (not > "physiological" -- physical across blocks, logical within blocks) for > some new table AM. Even then, it would still definitely be a good > idea > to make these logical TID values correlate with the physical table > structure in some way. Agreed. Regards, Jeff Davis
Robert Haas <robertmhaas@gmail.com> writes: > On Fri, Apr 30, 2021 at 11:06 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Andres seems to feel that we should try to allow variable-width >> tupleids, but I'm afraid that the cost/benefit ratio for that >> would be pretty poor. > There are two major reasons why I want variable-width tuple IDs. One > is global indexes, where you need as many bits as the AMs implementing > the partitions need, plus some extra bits to identify which partition > is relevant for a particular tuple. No fixed number of bits that you > make available can ever be sufficient here, I agree that global indexes need more bits, but it doesn't necessarily follow that we must have variable-width TIDs. We could for example say that "real" TIDs are only 48 bits and index AMs that want to be usable as global indexes must be capable of handling 64-bit TIDs, leaving 16 bits for partition ID. A more forward-looking definition would require global index AMs to store 96 bits (partition OID plus 64-bit TID). Either way would be far simpler for every moving part involved than going over to full varlena TIDs. > Another problem in this general area is that there is quite a bit of > code that thinks a TID is specifically a block number and an offset, > like the Bitmap Index/Heap Scan code, for example. But making tuple > IDs wider doesn't help with that problem either. Agreed, that's an area that will need a lot of thought for anything that we do here. But varlena TIDs surely do not make that easier to fix. > What problem do you think this proposal does solve? Accommodating table AMs that want more than 48 bits for a TID. We're already starting to run up against the fact that that's not enough bits for plausible use-cases. 64 bits may someday in the far future not be enough either, but I think that's a very long way off. regards, tom lane
On Fri, Apr 30, 2021 at 10:10 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > There are two major reasons why I want variable-width tuple IDs. One > > is global indexes, where you need as many bits as the AMs implementing > > the partitions need, plus some extra bits to identify which partition > > is relevant for a particular tuple. No fixed number of bits that you > > make available can ever be sufficient here, > > I agree that global indexes need more bits, but it doesn't necessarily > follow that we must have variable-width TIDs. We could for example > say that "real" TIDs are only 48 bits and index AMs that want to be > usable as global indexes must be capable of handling 64-bit TIDs, > leaving 16 bits for partition ID. A more forward-looking definition > would require global index AMs to store 96 bits (partition OID plus > 64-bit TID). Either way would be far simpler for every moving part > involved than going over to full varlena TIDs. The question of how the on-disk format on indexes needs to be changed to accomodate global indexes seems like an entirely separate question to how we go about expanding or redefining TIDs. Global indexes should work by adding an extra column that is somewhat like a TID, that may even have its own pg_attribute entry. It's much more natural to make the partition number a separate column IMV -- nbtree suffix truncation and deduplication can work in about the same way as before. Plus you'll need to do predicate pushdown using the partition identifier in some scenarios anyway. You can make the partition identifier variable-width without imposing the cost and complexity of variable-width TIDs on index AMs. I believe that the main reason why there have been so few problems with any of the nbtree work in the past few releases is that it avoided certain kinds of special cases. Any special cases in the on-disk format and in the space accounting used when choosing a split point ought to be avoided at all costs. We can probably afford to add a lot of complexity to make global indexes work, but it ought to be contained to cases that actually use global indexes in an obvious way. -- Peter Geoghegan
On Fri, Apr 30, 2021 at 1:10 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > I agree that global indexes need more bits, but it doesn't necessarily > follow that we must have variable-width TIDs. We could for example > say that "real" TIDs are only 48 bits and index AMs that want to be > usable as global indexes must be capable of handling 64-bit TIDs, > leaving 16 bits for partition ID. A more forward-looking definition > would require global index AMs to store 96 bits (partition OID plus > 64-bit TID). Either way would be far simpler for every moving part > involved than going over to full varlena TIDs. 16 bits is not much for a partition identifier. We've already had complaints about INNER_VAR being too small, so apparently there are people who want to use really large numbers of partitions. But even if we imagine a hypothetical world where nobody uses more than a couple thousand partitions at once, it's very reasonable to want to avoid recycling partition identifiers so that detaching a partition can be O(1), and there's no way that's going to be viable if the whole address space is only 16 bits, because with time series data people are going to be continually creating new partitions and dropping old ones. I would guess that it probably is viable with 32 bits, but we'd have to have a mapping layer rather than using the OID directly to avoid wraparound collisions. Now this problem can be avoided by just requiring the AM to store more bits, exactly as you say. I suspect 96 bits is large enough for all of the practical use cases people have, or at least within spitting distance. But it strikes me as relatively inefficient to say that we're always going to store 96 bits for every TID. I certainly don't think we want to break on-disk compatibility and widen every existing btree index by changing all the 6-byte TIDs they're storing now to store 12 bytes TIDs that are at least half zero bytes, so I think we're bound to end up with at least two options: 6 and 12. But variable-width would be a lot nicer. You could store small TIDs and small partition identifiers very compactly, and only use the full number of bytes when the situation demands it. > > What problem do you think this proposal does solve? > > Accommodating table AMs that want more than 48 bits for a TID. > We're already starting to run up against the fact that that's not > enough bits for plausible use-cases. 64 bits may someday in the far > future not be enough either, but I think that's a very long way off. Do people actually want to store more than 2^48 rows in a table, or is this more about the division of a TID into a block number and an item number? -- Robert Haas EDB: http://www.enterprisedb.com
On Fri, 2021-04-30 at 12:51 -0400, Robert Haas wrote: > There are two major reasons why I want variable-width tuple IDs. One > is global indexes, where you need as many bits as the AMs > implementing > the partitions need, plus some extra bits to identify which partition > is relevant for a particular tuple. No fixed number of bits that you > make available can ever be sufficient here, because a global index > always needs to have extra bits compared to a partition-local index; > if you let the partition-local index use more bits, the global index > now needs even more space. The other is indirect indexes, work Álvaro > proposed a few years ago, where the index entry points to the primary > key value rather than a TID. The space needs for that are based on > the > type of the primary key column. This proposal solves neither of those > problems. The particular problem I have now is that Table AMs seem to support indexes just fine, but TIDs are under-specified so I don't know what I really have to work with. BlockNumber seems well-specified as 0..0XFFFFFFFE (inclusive), but I don't know what the valid range of OffsetNumber is for the purposes of a TableAM. Part of changing to uint64 would be specifying the TIDs in a way that I could rely on in the future. The problems you mention above are above the table AM layer, so they seem orthogonal. There would still need to be an ID that table AMs can use to fetch a tuple from a particular physical table. In the future we may support primary unique indexes at the table AM layer, which would get more interesting. I can see an argument for a TID being an arbitrary datum in that case, but I haven't really considered the design implications. Is this what you are suggesting? Regards, Jeff Davis
On Fri, Apr 30, 2021 at 1:28 PM Peter Geoghegan <pg@bowt.ie> wrote: > Global indexes should work by adding an extra column that is somewhat > like a TID, that may even have its own pg_attribute entry. It's much > more natural to make the partition number a separate column IMV -- > nbtree suffix truncation and deduplication can work in about the same > way as before. Plus you'll need to do predicate pushdown using the > partition identifier in some scenarios anyway. You can make the > partition identifier variable-width without imposing the cost and > complexity of variable-width TIDs on index AMs. I agree up to a point but ... are you imagining that the TID continues to have its own special place in the page, while the partition identifier is stored more like a regular tuple column? Because it seems to me that it would be better to try to eliminate the special-storage case, just like we did for OIDs. If you want a 6-byte TID, put a 6-byte column in the tuple for it. If you also want a partition identifier, put an extra column in the tuple for that. If you want a wider TID or a varlena TID, well, put columns for that into the tuple instead of the 6-byte column you'd normally put. This seems extremely flexible and a lot more aesthetically appealing than what we have today. -- Robert Haas EDB: http://www.enterprisedb.com
On Fri, Apr 30, 2021 at 10:04 AM Jeff Davis <pgsql@j-davis.com> wrote: > On Fri, 2021-04-30 at 08:36 -0700, Peter Geoghegan wrote: > > Compatibility with index AMs is more than a matter of switching out > > the tuple identifier -- if you invent something that has totally > > different performance characteristics for index AMs, then it's likely > > to break tacit assumptions about the cost of certain things. > > I think it would be reasonable to document and expect that table AMs > offer some locality of access for tuples with similar IDs. Do you think > we need something stronger than that? I don't know. This conversation is still too abstract for me to be able to take a firm position. ISTM that we tend to talk about the table AM in excessively abstract terms. It would be a lot easier if we had clear fixed goals for a small number of additional table AMs. > > Plus deduplication ameliorates problems with version churn > > in > > indexes -- presumably the same problems will exist when any new table > > AM is used, and so it'll be useful to address the same problems in > > the > > same way. > > I got lost after "presumably the same problems", can you explain? Well, there are now two things in nbtree that specifically help with version churn caused by non-HOT updates: deduplication, and bottom-up index deletion (especially the latter). Presumably any new table AM will have something like non-HOT updates. Though they may rarely be a problem (say because the new table AM isn't really for OLTP), whenever they are a problem they'll be a very big problem. It seems like a good idea to have the same index AM level protections against accumulating version-churn index tuples in an unbounded way. More generally, it seems like a good idea to try to make new table AMs reasonably close to heapam insofar as possible. The reality is that everything evolved around heapam, and that that's likely to matter in all kinds of ways that nobody fully understands just yet. We have a somewhat abstract table AM interface, which is good, but that doesn't mean that table AMs can be designed as if it was a green field situation. -- Peter Geoghegan
On Fri, 2021-04-30 at 12:35 -0400, Tom Lane wrote: > ISTM that would be up to the index AM. We'd need some interlocks on > which index AMs could be used with which table AMs in any case, I > think. I'm not sure why? It seems like we should be able to come up with something that's generic enough. > I think the hard part may really be in places like tidbitmap.c, which > one would wish to be AM-independent, but right now it's quite > specific > to heap-style TIDs. Maybe we can think of a way to parameterize it. For my particular AM, being able to have a parameterized granularity might be nice, but not required. Regards, Jeff Davis
On Fri, Apr 30, 2021 at 1:37 PM Jeff Davis <pgsql@j-davis.com> wrote: > The particular problem I have now is that Table AMs seem to support > indexes just fine, but TIDs are under-specified so I don't know what I > really have to work with. BlockNumber seems well-specified as > 0..0XFFFFFFFE (inclusive), but I don't know what the valid range of > OffsetNumber is for the purposes of a TableAM. I agree that this is a problem. > Part of changing to uint64 would be specifying the TIDs in a way that I > could rely on in the future. I mean, from my perspective, the problem here is that the abstraction layer is leaky and things outside of the table AM layer know what heap is doing under the hood, and rely on it. If we could refactor the abstraction to be less leaky, it would be clearer what assumptions table AM authors can make. If we can't, any specification doesn't seem worth much. > In the future we may support primary unique indexes at the table AM > layer, which would get more interesting. I can see an argument for a > TID being an arbitrary datum in that case, but I haven't really > considered the design implications. Is this what you are suggesting? I think that would be the best long-term plan. I guess I have two distinguishable concerns. One is that I want to be able to have indexes with a payload that's not just a 6-byte TID; e.g. adding a partition identifier to support global indexes, or replacing the 6-byte TID with a primary key reference to support indirect indexes. The other concern is to be able to have table AMs that use arbitrary methods to identify a tuple. For example, if somebody implemented an index-organized table, the "TID" would really be the primary key. Even though these are distinguishable concerns, they basically point in the same direction as far as index layout is concerned. The implications for the table AM layer are somewhat different in the two cases, but both argue that some places that are now talking about TIDs should be changed to talk about Datums or something of that sort. -- Robert Haas EDB: http://www.enterprisedb.com
On Fri, Apr 30, 2021 at 10:39 AM Robert Haas <robertmhaas@gmail.com> wrote: > I agree up to a point but ... are you imagining that the TID continues > to have its own special place in the page, while the partition > identifier is stored more like a regular tuple column? Because it > seems to me that it would be better to try to eliminate the > special-storage case, just like we did for OIDs. I agree in principle, but making that work well is very hard in practice because of the format of IndexTuple -- which bleeds into everything. That TID is special is probably a natural consequence of the fact that we don't have an offset-based format of the kind you see in other DB systems -- systems that don't emphasize extensibility. We cannot jump to a hypothetical TID attribute inexpensively inside code like _bt_compare() because we don't have a cheap way to jump straight to the datum for any attribute. So we just store TID in IndexTuple directly instead. Imagine how much more expensive VACUUM would be if it had to grovel through the IndexTuple format. I wonder how the same useful performance characteristics can be maintained with a variable-width TID design. If you solve the problem by changing IndexTuple, then you are kind of obligated to not use varlena headers to keep the on-disk size manageable. Who knows where it all ends? -- Peter Geoghegan
On Fri, Apr 30, 2021 at 10:56 AM Robert Haas <robertmhaas@gmail.com> wrote: > I think that would be the best long-term plan. I guess I have two > distinguishable concerns. One is that I want to be able to have > indexes with a payload that's not just a 6-byte TID; e.g. adding a > partition identifier to support global indexes, or replacing the > 6-byte TID with a primary key reference to support indirect indexes. > The other concern is to be able to have table AMs that use arbitrary > methods to identify a tuple. For example, if somebody implemented an > index-organized table, the "TID" would really be the primary key. > > Even though these are distinguishable concerns, they basically point > in the same direction as far as index layout is concerned. The > implications for the table AM layer are somewhat different in the two > cases, but both argue that some places that are now talking about TIDs > should be changed to talk about Datums or something of that sort. I don't know how it's possible to do any of this without first addressing what the table AM does in cases where heapam currently does a non-HOT update. You obviously cannot have the equivalent of duplicate TIDs when your new table AM runs into these scenarios. So what do you do instead? How do you make your clustered index/IoT style identifiers (i.e. your strictly logical TID-like identifiers) deal with that case? ISTM that this is by far the biggest issue with generalizing the table AM for use by a tableam (that isn't already very much like heapam). I am always surprised to be the one that has to point it out during these discussions. It's a huge issue. -- Peter Geoghegan
On Fri, Apr 30, 2021 at 2:05 PM Peter Geoghegan <pg@bowt.ie> wrote: > I agree in principle, but making that work well is very hard in > practice because of the format of IndexTuple -- which bleeds into > everything. That TID is special is probably a natural consequence of > the fact that we don't have an offset-based format of the kind you see > in other DB systems -- systems that don't emphasize extensibility. We > cannot jump to a hypothetical TID attribute inexpensively inside code > like _bt_compare() because we don't have a cheap way to jump straight > to the datum for any attribute. So we just store TID in IndexTuple > directly instead. Imagine how much more expensive VACUUM would be if > it had to grovel through the IndexTuple format. I can't imagine that, so maybe you want to enlighten me? I see that there's a potential problem there, and I'm glad you pointed it out because I hadn't thought about it previously ... but if you always put the column or columns that VACUUM would need first, it's not obvious to me that it would be all that expensive. Deforming the tuple to a sufficient degree to extract the first column, which would even be fixed-width, shouldn't take much work. > I wonder how the same useful performance characteristics can be > maintained with a variable-width TID design. If you solve the problem > by changing IndexTuple, then you are kind of obligated to not use > varlena headers to keep the on-disk size manageable. Who knows where > it all ends? What's wrong with varlena headers? It would end up being a 1-byte header in practically every case, and no variable-width representation can do without a length word of some sort. I'm not saying varlena is as efficient as some new design could hypothetically be, but it doesn't seem like it'd be a big enough problem to stress about. If you used a variable-width representation for integers, you might actually save bytes in a lot of cases. An awful lot of the TIDs people store in practice probably contain several zero bytes, and if we make them wider, that's going to be even more true. -- Robert Haas EDB: http://www.enterprisedb.com
On Fri, 2021-04-30 at 10:50 -0700, Peter Geoghegan wrote: > I don't know. This conversation is still too abstract for me to be > able to take a firm position. ISTM that we tend to talk about the > table AM in excessively abstract terms. It would be a lot easier if > we > had clear fixed goals for a small number of additional table AMs. https://github.com/citusdata/citus/tree/master/src/backend/columnar My colleagues and I have been working on a "columnar" table AM. It doesn't currently support indexes, but it would be useful to support them. The basic idea is we have "stripes" of ~150000 tuples that are rearranged and compressed, and stored in an smgr-controlled file that goes through the buffer cache and uses generic WAL. To support indexes, we could do our own lookups from a "row number" to a particular offset where we can find and decompress the stripe that holds that row number, and then scan forward in the stripe to find the particular row. This will be terrible for random access, but [*waves hands*] we will keep state and use a few optimizations so that this is not terribly slow for clustered access. Granted, TID lookup on columnar will be much slower than for a heap (and we can use a CustomScan so that the costs reflect that). But it will satisfy important use cases: 1. Indexes defined on partition parent tables. Even if the index is never used for queries, we don't want to throw an error when defining the partitioned parent index. 2. Unique indexes and exclusion constraints. 3. Clustered index scans can still be reasonably fast. 4. Could be used for UPDATE/DELETE as well. > More generally, it seems like a good idea to try to make new table > AMs > reasonably close to heapam insofar as possible. The reality is that > everything evolved around heapam, and that that's likely to matter in > all kinds of ways that nobody fully understands just yet. Agreed. I think of this as an evolving situation where we take steps toward a better abstraction. One (hopefully reasonable) step I'd like to take is a well-specified TID. Regards, Jeff Davis
On Fri, 2021-04-30 at 13:56 -0400, Robert Haas wrote: > I think that would be the best long-term plan. We should still have *some* answer in the short term for table AM authors like me. If I use offset numbers as high as MaxOffsetNumber, then itemptr_to_uint64 will fail. If I base my calculations for the TID to row number mapping on MaxOffsetNumber at all, then it will break if we change MaxOffsetNumber (as was suggested[1]). My takeaway so far is that the only safe thing to do is hard code it to 2000. I suppose I can do that until we settle on something better (at which point I can force a reindex, I suppose). [1] https://postgr.es/m/CAEze2Wit1EtHHBHJ+CYvBPthrWUzu2Vqc-BmzU3ApK3iotHriw@mail.gmail.com > Even though these are distinguishable concerns, they basically point > in the same direction as far as index layout is concerned. The > implications for the table AM layer are somewhat different in the two > cases, but both argue that some places that are now talking about > TIDs > should be changed to talk about Datums or something of that sort. Logically, that makes a lot of sense to me. Peter seems to have quite a few practical implementation concerns though, so it could be a long road. Regards, Jeff Davis
On Fri, Apr 30, 2021 at 11:23 AM Robert Haas <robertmhaas@gmail.com> wrote: > On Fri, Apr 30, 2021 at 2:05 PM Peter Geoghegan <pg@bowt.ie> wrote: > > I agree in principle, but making that work well is very hard in > > practice because of the format of IndexTuple -- which bleeds into > > everything. That TID is special is probably a natural consequence of > > the fact that we don't have an offset-based format of the kind you see > > in other DB systems -- systems that don't emphasize extensibility. We > > cannot jump to a hypothetical TID attribute inexpensively inside code > > like _bt_compare() because we don't have a cheap way to jump straight > > to the datum for any attribute. So we just store TID in IndexTuple > > directly instead. Imagine how much more expensive VACUUM would be if > > it had to grovel through the IndexTuple format. > > I can't imagine that, so maybe you want to enlighten me? I see that > there's a potential problem there, and I'm glad you pointed it out > because I hadn't thought about it previously ... but if you always put > the column or columns that VACUUM would need first, it's not obvious > to me that it would be all that expensive. Maybe. The point is that it is a problem that needs to be solved. > Deforming the tuple to a > sufficient degree to extract the first column, which would even be > fixed-width, shouldn't take much work. I think that it's reasonable to impose some cost on index AMs here, but that needs to be bounded sensibly and unambiguously. For example, it would probably be okay if you had either 6 byte or 8 byte TIDs, but no other variations. You could require index AMs (the subset of index AMs that are ever able to store 8 byte TIDs) to directly encode which width they're dealing with at the level of each IndexTuple. That would create some problems for nbtree deduplication, especially in boundary cases, but ISTM that you can manage the complexity by sensibly restricting how the TIDs work across the board. For example, the TIDs should always work like unsigned integers -- the table AM must be willing to work with that restriction. You'd then have posting lists tuples in nbtree whose TIDs were all either 6 bytes or 8 bytes wide, with a mix of each possible (though not particularly likely) on the same leaf page. Say when you have a table that exceeds the current MaxBlockNumber restrictions. It would be relatively straightforward for nbtree deduplication to simply refuse to mix 6 byte and 8 byte datums together to avoid complexity in boundary cases. The deduplication pass logic has the flexibility that this requires already. > > I wonder how the same useful performance characteristics can be > > maintained with a variable-width TID design. If you solve the problem > > by changing IndexTuple, then you are kind of obligated to not use > > varlena headers to keep the on-disk size manageable. Who knows where > > it all ends? > > What's wrong with varlena headers? It would end up being a 1-byte > header in practically every case, and no variable-width representation > can do without a length word of some sort. I'm not saying varlena is > as efficient as some new design could hypothetically be, but it > doesn't seem like it'd be a big enough problem to stress about. If you > used a variable-width representation for integers, you might actually > save bytes in a lot of cases. An awful lot of the TIDs people store in > practice probably contain several zero bytes, and if we make them > wider, that's going to be even more true. Maybe all of this is true, and maybe it works out to be the best path forward in the long term, all things considered. But whether or not that's true is crucially dependent on what real practical table AMs (of which there will only ever be a tiny number) actually need to do. Why should we assume that the table AM cannot accept some restrictions? What good does it do to legalistically define the problem as a problem for index AMs to solve? -- Peter Geoghegan
On Fri, Apr 30, 2021 at 2:23 PM Peter Geoghegan <pg@bowt.ie> wrote: > I don't know how it's possible to do any of this without first > addressing what the table AM does in cases where heapam currently does > a non-HOT update. Why can't it do what it does already? It's not broken for heap, so why should it be broken for anything else? And why are non-HOT updates specifically a problem? > You obviously cannot have the equivalent of > duplicate TIDs when your new table AM runs into these scenarios. So > what do you do instead? How do you make your clustered index/IoT style > identifiers (i.e. your strictly logical TID-like identifiers) deal > with that case? Is the problem you're worried about here that, with something like an index-organized table, you can have multiple row versions that have the same logical tuple ID, i.e. primary key value? And that the interfaces aren't well-suited to that? Because that's a problem I have thought about and can comment on, even though I think the question of having multiple versions with the same TID is distinguishable from the question of how *wide* TIDs should be. But maybe that's not what you are talking about here, in which case I guess I need a clearer explanation of the concern. -- Robert Haas EDB: http://www.enterprisedb.com
On Fri, Apr 30, 2021 at 12:20 PM Robert Haas <robertmhaas@gmail.com> wrote: > Why can't it do what it does already? It's not broken for heap, so why > should it be broken for anything else? And why are non-HOT updates > specifically a problem? No reason. > > You obviously cannot have the equivalent of > > duplicate TIDs when your new table AM runs into these scenarios. So > > what do you do instead? How do you make your clustered index/IoT style > > identifiers (i.e. your strictly logical TID-like identifiers) deal > > with that case? > > Is the problem you're worried about here that, with something like an > index-organized table, you can have multiple row versions that have > the same logical tuple ID, i.e. primary key value? And that the > interfaces aren't well-suited to that? Because that's a problem I have > thought about and can comment on, even though I think the question of > having multiple versions with the same TID is distinguishable from the > question of how *wide* TIDs should be. But maybe that's not what you > are talking about here, in which case I guess I need a clearer > explanation of the concern. That's what I'm talking about. I'd like to hear what you think about it. It's not exactly a narrow concern. For one thing, it is enough to totally validate my suggestion about how we might widen TIDs and still have nbtree deduplication. -- Peter Geoghegan
On Fri, Apr 30, 2021 at 3:30 PM Peter Geoghegan <pg@bowt.ie> wrote: > > Is the problem you're worried about here that, with something like an > > index-organized table, you can have multiple row versions that have > > the same logical tuple ID, i.e. primary key value? And that the > > interfaces aren't well-suited to that? Because that's a problem I have > > thought about and can comment on, even though I think the question of > > having multiple versions with the same TID is distinguishable from the > > question of how *wide* TIDs should be. But maybe that's not what you > > are talking about here, in which case I guess I need a clearer > > explanation of the concern. > > That's what I'm talking about. I'd like to hear what you think about it. OK. I thought about this in regards to zheap, which has this exact problem, because it wants to do so-called "in place" updates where the new version of the row goes right on top of the old one in the table page, and the old version of the row gets written into the undo log. Just to keep things simple, we said that initially we'd only use this in-place update strategy when no indexed columns were changed, so that there's only ever one set of index entries for a given TID. In that model, the index AMs don't really need to care that there are actually multiple tuples for the same TID, because those tuples differ only in columns that the index doesn't care about anyway. An index scan has to be careful to fetch the correct version of the tuple, but it has a Snapshot available, so it can do that. However, there's no easy and efficient way to handle updates and deletes. Suppose for example that a tuple has been updated 5 times, creating versions t1..t5. t5 is now in the zheap page, and the other versions are in the undo. t5 points to t4 which points to t3 and so forth. Now an updater comes along and let's say that the updater's snapshot sees t2. It may be that t3..t5 are *uncommitted* updates in which case the attempt to update t2 may succeed if the transaction that performed then aborts, or it may be that the updating transactions have committed, in which case we're going to have to fail. But that decision isn't made by the scan that sees t3; it happens when the TID reaches the ModifyTable node. So what zheap ends up doing is finding the right tuple version during the scan, by making use of the snapshot, and then having to go repeat that work when it's time to try to perform the update. It would be nice to avoid this. If we could feed system columns from the scan through to the update, we could pass along an undo pointer and avoid the extra overhead. So it seems to me, so far anyway, that there's no very fundamental problem here, but there is an efficiency issue which we could address if we had a bit more planner and executor infrastructure to help us out. Now in the long run the vision for zheap was that we'd eventually want to do in-place updates even when indexed columns have been modified, and this gets a whole lot trickier, because now there can be multiple sets of index entries pointing at the same TID which don't agree on the values of the indexed columns. As old row versions die off, some of those pointers need to be cleaned out, and others do not. I thought we might solve this problem by something akin to retail index deletion: have an update or delete on a zheap tuple go re-find the associated index entries and mark them for possible cleanup, and then vacuum can ignore all unmarked tuples. There might be some efficiency problems with this idea I hadn't considered, based on your remarks today. But regardless of the wisdom or folly of this approach, the broader point is that we can't assume that all heap types are going to have the same maintenance requirements. I think most of them are going to have some kind of maintenance operation that need to or at least can optionally be performed from time to time, but it might be triggered by completely different criteria than vacuum. New table AMs might well choose to use 64-bit XIDs, avoiding the need for wraparound processing altogether. Maybe they have such good opportunistic cleanup mechanisms that periodic vacuum for bloat isn't even really needed. Maybe they bloat when updates and deletes commit but not when inserts and updates abort, because those cases are handled via some other mechanism. Who knows, really? It's hard to predict what not-yet-written AMs might care about, and even if we knew, it seems crazy to try to rewrite the way vacuum works to cater to those needs before we actually have some working AMs to use as a testbed. It strikes me that certain interesting cases might not really need anything very in-depth here. For example, consider indirect indexes, where the index references the primary key value rather than the TID. Well, the indirect index should probably be vacuumed periodically to prevent bloat, but it doesn't need to be vacuumed to recycle TIDs because it doesn't contain TIDs. BRIN indexes, BTW, also don't contain TIDs. Either could, therefore, be optionally vacuumed after vacuum has done absolutely everything else, even truncate the table, or they could be vacuumed on a completely separate schedule that doesn't have anything to do with table vacuuming. I suppose we'd have to come up with some solution, but I don't think it would need to be fully general; it could just be good enough for that particular feature, since fully general seems rather impossible anyway. So I feel like it's pretty fair to just defer this question. Without some solution you can't entirely finish a project like indirect indexes, but without variable-width index payloads you can't even start it. -- Robert Haas EDB: http://www.enterprisedb.com
On Fri, 2021-04-30 at 12:29 -0700, Peter Geoghegan wrote: > > Is the problem you're worried about here that, with something like > > an > > index-organized table, you can have multiple row versions that have > > the same logical tuple ID, i.e. primary key value? > > That's what I'm talking about. I'd like to hear what you think about > it. FWIW, this is not a problem in my table AM. I am fine having different TIDs for each version, just like heapam. For index-organized tables it does seem like an interesting problem. Regards, Jeff Davis
On Fri, Apr 30, 2021 at 2:13 PM Jeff Davis <pgsql@j-davis.com> wrote: > FWIW, this is not a problem in my table AM. I am fine having different > TIDs for each version, just like heapam. This means that we are largely in agreement about the general nature of the problem. That seems like a good basis to redefine TID-like identifiers so that they can accommodate what you want to do. > For index-organized tables it does seem like an interesting problem. I strongly suspect that index-organized tables (or indirect indexes, or anything else that assumes that TID-like identifiers map directly to logical rows as opposed to physical versions) are going to break too many assumptions to ever be tractable. Assuming I have that right, it would advance the discussion if we could all agree on that being a non-goal for the tableam interface in general. This would allow us to clearly discuss how to solve the remaining problem of accommodating column stores and suchlike. That seems hard, but much more tractable. The fact that the tableam has almost no non-goals has always bothered me a bit. Especially on this particular point about purely logical TID-like identifiers. -- Peter Geoghegan
On Fri, Apr 30, 2021 at 2:07 PM Robert Haas <robertmhaas@gmail.com> wrote: > OK. I thought about this in regards to zheap, which has this exact > problem, because it wants to do so-called "in place" updates where the > new version of the row goes right on top of the old one in the table > page, and the old version of the row gets written into the undo log. > Just to keep things simple, we said that initially we'd only use this > in-place update strategy when no indexed columns were changed, so that > there's only ever one set of index entries for a given TID. In that > model, the index AMs don't really need to care that there are actually > multiple tuples for the same TID, because those tuples differ only in > columns that the index doesn't care about anyway. An index scan has to > be careful to fetch the correct version of the tuple, but it has a > Snapshot available, so it can do that. Right. So zheap (in the current prototype implementation) is like heapam with its HOT optimization, except that it isn't subject to the same limitations with regard to fitting heap tuples on the same heap page to keep the same HOT chain going over time. You kind of have the moral equivalent of a HOT chain that can largely live in UNDO. That seems like a very useful thing on its own. A lot of the problems with HOT are in this area -- maybe the vast majority, even. A remaining problem is that we must generate a new round of index tuples for each and every index when only one indexed column is logically modified by an UPDATE statement. I think that this is much less of a problem now due to bottom-up index deletion. Sure, it sucks that we still have to dirty the page at all. But it's nevertheless true that it all but eliminates version-driven page splits, which are where almost all of the remaining downside is. It's very reasonable to now wonder if this particular all-indexes problem is worth solving at all in light of that. (Modern hardware characteristics also make a comprehensive fix less valuable in practice.) > However, there's no easy and > efficient way to handle updates and deletes. Suppose for example that > a tuple has been updated 5 times, creating versions t1..t5. t5 is now > in the zheap page, and the other versions are in the undo. t5 points > to t4 which points to t3 and so forth. Now an updater comes along and > let's say that the updater's snapshot sees t2. It may be that t3..t5 > are *uncommitted* updates in which case the attempt to update t2 may > succeed if the transaction that performed then aborts, or it may be > that the updating transactions have committed, in which case we're > going to have to fail. But that decision isn't made by the scan that > sees t3; it happens when the TID reaches the ModifyTable node. So what > zheap ends up doing is finding the right tuple version during the > scan, by making use of the snapshot, and then having to go repeat that > work when it's time to try to perform the update. It would be nice to > avoid this. I believe that this is another consequence of the fact that Postgres versions tuples, not pages. This is not a minor theoretical point. It's very different to what Oracle does. It's almost a necessary consequence of our basic approach to extensibility, because you can have things like index tuples whose values are equal but visibly distinct (e.g., the numeric value '5.0' is equal to but distinct from '5'). It also has a lot to do with how crash recovery works. > If we could feed system columns from the scan through to > the update, we could pass along an undo pointer and avoid the extra > overhead. So it seems to me, so far anyway, that there's no very > fundamental problem here, but there is an efficiency issue which we > could address if we had a bit more planner and executor infrastructure > to help us out. FWIW you don't necessarily have to do the EPQ stuff. You could in theory do a statement-level rollback, and repeat. The EPQ mechanism is unique to Postgres. Maybe it doesn't matter, but I don't think that it's essential to follow this in other table AMs. > Now in the long run the vision for zheap was that we'd eventually want > to do in-place updates even when indexed columns have been modified, > and this gets a whole lot trickier, because now there can be multiple > sets of index entries pointing at the same TID which don't agree on > the values of the indexed columns. It's much easier when you have a very simple type system that doesn't allow differences like my "numeric '5.0' vs '5'" example -- a system that is built for this from the ground up. If there are meaningful semantic differences among opclass-equal index tuples, then we can never assume that index tuples will always be locatable after an update affecting indexed columns (if only because we need to preserve the '5.0' and '5' variants in an index on a numeric column). If we could at least be sure that two index tuples that point to the same stable/logical zheap TID (in a world where TIDs were stable identifiers of logical rows) were nevertheless unique, then we'd be able to uniquely identify each index tuple during retail index tuple deletion -- they'd still have distinct key values in the index tuple overall. That assumption isn't workable in Postgres, though. I'm pretty sure that there will be a bunch of tacit assumptions like this that would shake out all over the place. You'd have to actually pursue this design to figure out what they were, but I'm pretty sure many more exist. In any case this one example seems sufficient to make me doubt the whole enterprise. > As old row versions die off, some > of those pointers need to be cleaned out, and others do not. I thought > we might solve this problem by something akin to retail index > deletion: have an update or delete on a zheap tuple go re-find the > associated index entries and mark them for possible cleanup, and then > vacuum can ignore all unmarked tuples. There might be some efficiency > problems with this idea I hadn't considered, based on your remarks > today. But regardless of the wisdom or folly of this approach, the > broader point is that we can't assume that all heap types are going to > have the same maintenance requirements. No, we can't. But we had better have a generalized definition that accounts for what variation is acceptable, and (most importantly) what variation *isn't* acceptable. > I think most of them are going > to have some kind of maintenance operation that need to or at least > can optionally be performed from time to time, but it might be > triggered by completely different criteria than vacuum. New table AMs > might well choose to use 64-bit XIDs, avoiding the need for wraparound > processing altogether. Maybe they have such good opportunistic cleanup > mechanisms that periodic vacuum for bloat isn't even really needed. > Maybe they bloat when updates and deletes commit but not when inserts > and updates abort, because those cases are handled via some other > mechanism. Who knows, really? It's hard to predict what > not-yet-written AMs might care about, and even if we knew, it seems > crazy to try to rewrite the way vacuum works to cater to those needs > before we actually have some working AMs to use as a testbed. Nothing is certain, but frankly I just don't believe that anybody is ever going to solve this problem in Postgres. The fundamental assumption that TIDs are not stable identifiers of logical rows (they point to versions) is just too baked into everything. And the downsides of that design can be fixed in a localized way. On the other hand, Jeff and I agree about the parameters of the discussion here. I can see myself doing work inside nbtree to facilitate his work. But that is made a lot less likely by the general lack of agreement about what ought to ultimately be possible. There is a very real cost to indefinitely deferring making a hard choice about what we can rule out for table AMs. It's *not* free. How long should that situation be allowed to continue for? This is not a rhetorical question -- maybe it should be timeboxed in some way. Right now the tableam can in theory do. Nobody knows how, but anything is possible! > It strikes me that certain interesting cases might not really need > anything very in-depth here. For example, consider indirect indexes, > where the index references the primary key value rather than the TID. > Well, the indirect index should probably be vacuumed periodically to > prevent bloat, but it doesn't need to be vacuumed to recycle TIDs > because it doesn't contain TIDs. Why is that okay, though? How can you get away with not having version-based TID-like identifiers here? I would be willing to accept an answer like "it's unclear, but it must be possible" if there was no downside. But as I said, there is a downside. > BRIN indexes, BTW, also don't contain > TIDs. BRIN indexes aren't suitable as indirect indexes, though. > Either could, therefore, be optionally vacuumed after vacuum has > done absolutely everything else, even truncate the table, or they > could be vacuumed on a completely separate schedule that doesn't have > anything to do with table vacuuming. I think that the question of how TID-like identifiers work across table AMs is fundamentally distinct from how VACUUM works. I think that we'll probably always have something like VACUUM. That doesn't mean that we cannot ultimately teach VACUUM to run far less often. There is a problem with things being overly coupled inside VACUUM, so our practical experience with VACUUM isn't necessarily a reliable indicator of how much of a problem VACUUM is long term. -- Peter Geoghegan
On Fri, Apr 30, 2021 at 5:22 PM Peter Geoghegan <pg@bowt.ie> wrote: > I strongly suspect that index-organized tables (or indirect indexes, > or anything else that assumes that TID-like identifiers map directly > to logical rows as opposed to physical versions) are going to break > too many assumptions to ever be tractable. Assuming I have that right, > it would advance the discussion if we could all agree on that being a > non-goal for the tableam interface in general. I *emphatically* disagree with the idea of ruling such things out categorically. This is just as naive as the TODO's statement that we do not want "All backends running as threads in a single process". Does anyone really believe that we don't want that any more? I believed it 10 years ago, but not any more. It's costing us very substantially not only in that in makes parallel query more complicated and fragile, but more importantly in that we can't scale up to connection counts that other databases can handle because we use up too many operating system resources. Support threading in PostgreSQL isn't a project that someone will pull off over a long weekend and it's not something that has to be done tomorrow, but it's pretty clearly the future. So here. The complexity of getting a table AM that does anything non-trivial working is formidable, and I don't expect it to happen right away. Picking one that is essentially block-based and can use 48-bit TIDs is very likely the right initial target because that's the closest we have now, and there's no sense attacking the hardest variant of the problem first. However, as with the threads-vs-processes example, I strongly suspect that having only one table AM is leaving vast amounts of performance on the table. To say that we're never going to pursue the parts of that space that require a different kind of tuple identifier is to permanently write off tons of ideas that have produced promising results in other systems. Let's not do that. -- Robert Haas EDB: http://www.enterprisedb.com
On Fri, Apr 30, 2021 at 6:19 PM Peter Geoghegan <pg@bowt.ie> wrote: > A remaining problem is that we must generate a new round of index > tuples for each and every index when only one indexed column is > logically modified by an UPDATE statement. I think that this is much > less of a problem now due to bottom-up index deletion. Sure, it sucks > that we still have to dirty the page at all. But it's nevertheless > true that it all but eliminates version-driven page splits, which are > where almost all of the remaining downside is. It's very reasonable to > now wonder if this particular all-indexes problem is worth solving at > all in light of that. (Modern hardware characteristics also make a > comprehensive fix less valuable in practice.) It's reasonable to wonder. I think it depends on whether the problem is bloat or just general slowness. To the extent that the problem is bloat, bottom-index deletion will help a lot, but it's not going to help with slowness because, as you say, we still have to dirty the pages. And I am pretty confident that slowness is a very significant part of the problem here. It's pretty common for people migrating from another database system to have, for example, a table with 10 indexes and then repeatedly update a column that is covered by only one of those indexes. Now, with bottom-up index deletion, this should cause a lot less bloat, and that's good. But you still have to update all 10 indexes in the foreground, and that's bad, because the alternative is to find just the one affected index and update it twice -- once to insert the new tuple, and a second time to delete-mark the old tuple. 10 is a lot more than 2, and that's even ignoring the cost of deferred cleanup on the other 9 indexes. So I don't really expect this to get us out of the woods. Somebody whose workload runs five times slower on a pristine data load is quite likely to give up on using PostgreSQL before bloat even enters the picture. -- Robert Haas EDB: http://www.enterprisedb.com
On Mon, May 3, 2021 at 7:41 AM Robert Haas <robertmhaas@gmail.com> wrote: > So here. The complexity of getting a table AM that does anything > non-trivial working is formidable, and I don't expect it to happen > right away. Picking one that is essentially block-based and can use > 48-bit TIDs is very likely the right initial target because that's the > closest we have now, and there's no sense attacking the hardest > variant of the problem first. It doesn't have to be block-based -- that's not what Jeff is proposing. It just has to be able to accept the restriction that indexes must have a unique TID-like identifier for each version (not quite a version actually -- whatever the equivalent of a HOT chain is). This is a restriction that Jeff had pretty much planned on working within before starting this thread (I know this because we spoke about it privately). It's quite possible to rule out an index-organized table design without ruling out a column store with logical TID-like identifiers, that aren't block-based. It's fair to wonder if not tightening up the rules for TID-like identifiers is actually helping table AM authors in practice. I think it's actually making things harder. -- Peter Geoghegan
On Mon, May 3, 2021 at 8:03 AM Robert Haas <robertmhaas@gmail.com> wrote: > It's reasonable to wonder. I think it depends on whether the problem > is bloat or just general slowness. To the extent that the problem is > bloat, bottom-index deletion will help a lot, but it's not going to > help with slowness because, as you say, we still have to dirty the > pages. And I am pretty confident that slowness is a very significant > part of the problem here. It's all going to depend on workload of course -- we'll need to wait and see what users still complain about with Postgres 14 to really have some idea. You only freshly dirty those leaf pages that weren't already dirty, and the costs will be much more linear, so it's a complicated question. Here is a more modest statement that might be more convincing: The *relative* importance of making something like HOT more robust to things like long-running xacts has increased now that we have bottom-up index deletion. We could improve things here by adding something like zheap, which allows a HOT chain to mostly live in UNDO, and therefore pretty much become arbitrarily long. This seems plausible because users will accept that UPDATEs that modify one or more indexed columns kinda suck, as long as there is never any truly pathological performance. Whereas users will not easily accept that HOT (or something like it) doesn't quite work well enough to make relation sizes stable when they avoid updating indexed columns. I don't think that even the absence of UPDATEs that logically modify indexes and the absence of long running transactions (that hold up cleanup) is sufficient to make HOT work well enough to keep table sizes stable over time. Minor inefficiencies (e.g. LP_DEAD line pointer bloat) will tend to aggregate over time, leading to heap fragmentation. -- Peter Geoghegan
On Mon, May 3, 2021 at 11:26 AM Peter Geoghegan <pg@bowt.ie> wrote: > It just has to be able to accept the restriction that > indexes must have a unique TID-like identifier for each version (not > quite a version actually -- whatever the equivalent of a HOT chain > is). This is a restriction that Jeff had pretty much planned on > working within before starting this thread (I know this because we > spoke about it privately). Well, I think what I'm saying is that I'm not on board with such a restriction. If you're just saying that it has to be possible to identify rows somehow, I am in full agreement, and I think the universe is on board as well. But if you're saying those identifiers have to be fixed-width and 48 (or even 64) bits, I disagree that we wish to have such a requirement in perpetuity. That'd be like going around to automobile manufacturers in 1925 and asking them to agree that all future cars ever manufactured must have a clutch. -- Robert Haas EDB: http://www.enterprisedb.com
On Mon, May 3, 2021 at 9:45 AM Robert Haas <robertmhaas@gmail.com> wrote: > But if you're saying those identifiers have to be fixed-width and 48 > (or even 64) bits, I disagree that we wish to have such a requirement > in perpetuity. Once you require that TID-like identifiers must point to particular versions (as opposed to particular logical rows), you also virtually require that the identifiers must always be integer-like (though not necessarily block-based and not necessarily 6 bytes). You've practically ensured that clustered index tables (and indirect indexes) will never be possible by accepting this. Those designs are the only real reason to have truly variable-length TID-like identifiers IMV (as opposed to 2 or perhaps even 3 standard TID widths). You don't accept any of that, though. Fair enough. I predict that avoiding making a hard choice will make Jeff's work here a lot harder, though. -- Peter Geoghegan
On Mon, May 3, 2021 at 1:00 PM Peter Geoghegan <pg@bowt.ie> wrote: > You don't accept any of that, though. Fair enough. I predict that > avoiding making a hard choice will make Jeff's work here a lot harder, > though. I don't really think so, or at least I don't see a reason why it should. As things stand today, I don't think it's possible for a table AM author to make any other choice than to assume that their TIDs have to look and work like heap TIDs; that is, there had better be a block number portion and an item number portion, and the item number had better be smaller than MaxOffsetNumber, and if you want bitmap scans to run reasonably quickly, the block number had also better correspond to physical locality to some degree. It's not clear to me how exactly someone would go about fixing all of that, but I think it would be great if they did. Even if that person wanted to assume for purposes of their own patch that fixed-width, integer-like TIDs are the only thing we care about, that would be fine with me. Getting to a point where the available 48 bits can be used in whatever way the table AM author wants is clearly better than what we have now. Now I'm personally of the opinion that we shouldn't be content to stop there, but so what? I'm not insisting that Jeff or anyone else has to work on this problem, or that they have to fix more of it rather than less. I hope that nobody's going to try to back us into a corner by making design decisions that deliberately complicate the possibility of future improvements in that area, and that's about it. I don't really understand why you think that's unreasonable, or even problematic. I can't see that any way in which the assumption that we will NEVER want to further generalize the TID concept simplifies anything anyone wants to do today. -- Robert Haas EDB: http://www.enterprisedb.com
On Mon, 3 May 2021 at 19:00, Peter Geoghegan <pg@bowt.ie> wrote: > > On Mon, May 3, 2021 at 9:45 AM Robert Haas <robertmhaas@gmail.com> wrote: > > But if you're saying those identifiers have to be fixed-width and 48 > > (or even 64) bits, I disagree that we wish to have such a requirement > > in perpetuity. > > Once you require that TID-like identifiers must point to particular > versions (as opposed to particular logical rows), you also virtually > require that the identifiers must always be integer-like (though not > necessarily block-based and not necessarily 6 bytes). You've > practically ensured that clustered index tables (and indirect indexes) > will never be possible by accepting this. For IoT, as far as I know, one of the constraints is that there exists some unique constraint on the table, which also defines the ordering. Assuming that that is the case, we can use <unique key> + <inserting transaction id> to identify tuple versions. With regards, Matthias van de Meent.
On Mon, May 3, 2021 at 10:22 AM Robert Haas <robertmhaas@gmail.com> wrote: > I don't really think so, or at least I don't see a reason why it > should. As things stand today, I don't think it's possible for a table > AM author to make any other choice than to assume that their TIDs have > to look and work like heap TIDs; that is, there had better be a block > number portion and an item number portion, and the item number had > better be smaller than MaxOffsetNumber, and if you want bitmap scans > to run reasonably quickly, the block number had also better correspond > to physical locality to some degree. It's not clear to me how exactly > someone would go about fixing all of that, but I think it would be > great if they did. Even if that person wanted to assume for purposes > of their own patch that fixed-width, integer-like TIDs are the only > thing we care about, that would be fine with me. Getting to a point > where the available 48 bits can be used in whatever way the table AM > author wants is clearly better than what we have now. I don't think it's much good to just do that. You probably need a full 64-bits for something like a column store. But that's all you need. > Now I'm personally of the opinion that we shouldn't be content to stop > there, but so what? I'm not insisting that Jeff or anyone else has to > work on this problem, or that they have to fix more of it rather than > less. I hope that nobody's going to try to back us into a corner by > making design decisions that deliberately complicate the possibility > of future improvements in that area, and that's about it. I don't > really understand why you think that's unreasonable, or even > problematic. I can't see that any way in which the assumption that we > will NEVER want to further generalize the TID concept simplifies > anything anyone wants to do today. It creates ambiguity of the kind that deters related improvements. I for one am not comfortable with (say) working on generalizing TID to the extent required to facilitate Jeff's work if that obligates me to make some legalistic and wholly insincere statement about future improvements to the definition of TID still being quite possible (to facilitate indirect indexes, or whatever). The truth is that I cannot possibly know if facilitating Jeff's work in the short term blocks off other things in the long term -- because I don't actually have a clue how these other things could ever really be implemented sensible in any case. -- Peter Geoghegan
On Mon, 2021-05-03 at 09:59 -0700, Peter Geoghegan wrote: > You don't accept any of that, though. Fair enough. I predict that > avoiding making a hard choice will make Jeff's work here a lot > harder, > though. For the purposes of this discussion, what's making my life difficult is that we don't have a good definition for TID, leaving me with two options: 1. be overly conservative, accept MaxOffsetNumber=2048, wasting a bunch of address space; or 2. risk the mapping between TID and row number could break at any time And compounding that, it seems that there's a bug in GIN that doesn't honor MaxOffsetNumber, so actually neither of the rules above work either. Instead, I need to use 2047 as the max offset number, which has no real basis in the postgres design, but I'd be stuck with it for a long time. What I'm looking for is: * A declaration of what the actual maximum valid offset number is, and that it will be stable enough to use for table AMs for now. (This maximum valid offset number may or may not be called MaxOffsetNumber, and may or may not be tied to the maximum number of items that fit on a page.) * A confirmation that this GIN behavior is a bug that should be fixed, now that there are table AMs in existence that need it fixed. Even if we fix this in v15, we still need some guidance for what table AMs should do in earlier versions. If we change the way tuple IDs are represented or the table AM in v15 or beyond, that may require a REINDEX for indexes on some table AMs. As long as we have some robust way to check that a REINDEX is necessary, that's fine with me. Regards, Jeff Davis
On Mon, May 3, 2021 at 10:22 AM Matthias van de Meent <boekewurm+postgres@gmail.com> wrote: > For IoT, as far as I know, one of the constraints is that there exists > some unique constraint on the table, which also defines the ordering. > Assuming that that is the case, we can use <unique key> + <inserting > transaction id> to identify tuple versions. Perhaps that's true in theory, but the resulting design seems likely to be useless in the end. In any case I believe that systems that generally use a heap but give you the choice of using an IoT (I'm really thinking of Oracle) tend to not have many users that actually avail of IoTs. On modern flash storage the trade-off made by an IoT or clustered index design seems like the wrong one on average. You're saving about 1 I/O on average with a PK lookup, which just isn't that much of an upside compared to the many downsides. -- Peter Geoghegan
On Mon, 2021-05-03 at 13:22 -0400, Robert Haas wrote: > to look and work like heap TIDs; that is, there had better be a block > number portion and an item number portion, Right (at least for now). > and the item number had > better be smaller than MaxOffsetNumber, That's not clear to me at all, and is the whole reason I began this thread. a. You say "smaller than MaxOffsetNumber", but that's a little weird. If an offset can't be MaxOffsetNumber, it's not really the maximum, is it? b. If you actually meant "less than or equal to MaxOffsetNumber", that will fail with the GIN posting list issue raised in my first email. Do you agree that's a bug? c. Why can't we go all the way up to MovedPartitionsOffsetNumber - 1? Right now, MaxOffsetNumber is poorly named, because it actually represents the a number slightly higher than the maximum number of items that can fit on a page. That essentially wastes 5 bits of address space for no obvious reason. > and if you want bitmap scans > to run reasonably quickly, the block number had also better > correspond > to physical locality to some degree. Right (at least for now). Regards, Jeff Davis
On Mon, 2021-05-03 at 10:38 -0700, Peter Geoghegan wrote: > I don't think it's much good to just do that. You probably need a > full > 64-bits for something like a column store. But that's all you need. I would definitely like that for citus columnar, and it would definitely make it easier to manage the address space, but I won't demand it today. 48 bits is a workable tuple address space for many purposes, especially when you factor in logical partitioning. I will be dealing with gaps though, so wasting 5 bits of address space (2^16 / MaxOffsetNumber = 32) to bring it down to 43 bits is not great. Regards, Jeff Davis
On Mon, May 3, 2021 at 10:57 AM Jeff Davis <pgsql@j-davis.com> wrote: > For the purposes of this discussion, what's making my life difficult is > that we don't have a good definition for TID, leaving me with two > options: > > 1. be overly conservative, accept MaxOffsetNumber=2048, wasting a > bunch of address space; or tidbitmap.c uses MaxHeapTuplesPerPage as its MAX_TUPLES_PER_PAGE, which is much lower than MaxOffsetNumber (it's almost 10x lower). I wonder what that means for your design. > 2. risk the mapping between TID and row number could break at any > time Though this clearly is the immediate problem for you, I think that the real problem is that the table AM kind of tacitly assumes that there is a universality to item pointer TIDs -- which is obviously not true. It might be useful for you to know what assumptions index AMs can make about how TIDs work in general, but I think that you really need an index-AM level infrastructure that advertises the capabilities of each index AM with respect to handling each possible variation (I suppose you have heapam, 6 byte uint, and maybe 8 byte uint). The easiest reasonable short term design for you is probably to find a way to make 6 byte TIDs into 48-bit unsigned integers (perhaps only conceptually), at least in contexts where the columnar table AM is used. You'll still need the index AM for that. This at least makes 64-bit TID-like identifiers a relatively simple conceptually shift. -- Peter Geoghegan
On Mon, 3 May 2021 at 20:43, Peter Geoghegan <pg@bowt.ie> wrote: > > On Mon, May 3, 2021 at 10:57 AM Jeff Davis <pgsql@j-davis.com> wrote: > > For the purposes of this discussion, what's making my life difficult is > > that we don't have a good definition for TID, leaving me with two > > options: > > > > 1. be overly conservative, accept MaxOffsetNumber=2048, wasting a > > bunch of address space; or > > tidbitmap.c uses MaxHeapTuplesPerPage as its MAX_TUPLES_PER_PAGE, > which is much lower than MaxOffsetNumber (it's almost 10x lower). I > wonder what that means for your design. One could relatively easily disable bitmap scans on the table AM by not installing the relevant bitmap support functions on the registered TableAM structure, and thus not touch that problem. Some indexes will then never be accessed due to the bitmap scan requirement of their IndexAM (gin, brin, bloom, to name a few), and as such won't make sense to create on that table, but that's about it I think. We might want to add some safeguards that bitmapscan-only indexams arent used on tableams that don't support it, but I believe that's a nice-to-have and not critical, on a similar level to the deduplication of constaint indexes. With regards, Matthias van de Meent
On Mon, May 3, 2021 at 12:06 PM Matthias van de Meent <boekewurm+postgres@gmail.com> wrote: > One could relatively easily disable bitmap scans on the table AM by > not installing the relevant bitmap support functions on the registered > TableAM structure, and thus not touch that problem. I have no idea how much it'll hurt things if the column store table AM supports no analogue of bitmap scans. > Some indexes will > then never be accessed due to the bitmap scan requirement of their > IndexAM (gin, brin, bloom, to name a few), and as such won't make > sense to create on that table, but that's about it I think. Right. More formally: if this restriction is accepted by a table AM (say the column store table AM), then any index AM with amgettuple set to NULL cannot ever be used (it should probably be treated as an error condition at CREATE INDEX time). If this really is the best path forward (again, no idea if that's true) then that would conveniently make it pretty easy to solve the GIN posting list issue raised by Jeff. It just wouldn't matter -- GIN indexes cannot be used with the column store anyway. -- Peter Geoghegan
On Fri, 2021-04-30 at 10:55 -0700, Jeff Davis wrote: > On Fri, 2021-04-30 at 12:35 -0400, Tom Lane wrote: > > ISTM that would be up to the index AM. We'd need some interlocks > > on > > which index AMs could be used with which table AMs in any case, I > > think. > > I'm not sure why? It seems like we should be able to come up with > something that's generic enough. Another point: the idea of supporting only some kinds of indexes doesn't mix well with partitioning. If you declare an index on the parent, we should do something reasonable if one partition's table AM doesn't support that index AM. Regards, Jeff Davis
On Mon, May 3, 2021 at 2:03 PM Jeff Davis <pgsql@j-davis.com> wrote: > Another point: the idea of supporting only some kinds of indexes > doesn't mix well with partitioning. If you declare an index on the > parent, we should do something reasonable if one partition's table AM > doesn't support that index AM. Sure, but it either makes sense for the columnar table AM to support bitmap scans (or some analogous type of scan that works only slightly differently) or it doesn't. It's not at all clear which it is right now. If it makes sense then it will of course be necessary to describe what "bitmap scan" actually means with the columnar storage table AM (plus you'll still need to make some in-core changes to places like tidbitmap.c). OTOH if it doesn't make sense then that's that -- it's going to be a bit annoying in the partitioning scenario you describe, but some things are bound to be *inherently* impossible, so it can't be helped. It seems senseless to *require* table AMs to support something like a bitmap scan. I don't think it's a coincidence that GIN is the index AM that looks like it presents at least 2 problems for the columnar table AM. To me this suggests that this will need a much higher level discussion. -- Peter Geoghegan
On Mon, 2021-05-03 at 15:07 -0700, Peter Geoghegan wrote: > Sure, but it either makes sense for the columnar table AM to support > bitmap scans (or some analogous type of scan that works only slightly > differently) or it doesn't. It's not at all clear which it is right > now. It makes sense for my columnar table AM -- there's TID locality. > If it makes sense then it will of course be necessary to describe > what > "bitmap scan" actually means with the columnar storage table AM (plus > you'll still need to make some in-core changes to places like > tidbitmap.c). OTOH if it doesn't make sense then that's that -- it's > going to be a bit annoying in the partitioning scenario you describe, > but some things are bound to be *inherently* impossible, so it can't > be > helped. I don't see why in-core changes are a strict requirement. It doesn't make too much difference if a lossy TID doesn't correspond exactly to the columnar layout -- it should be fine as long as there's locality, right? > It seems senseless to *require* table AMs to support something like a > bitmap scan. I am not yet convinced that it's "senseless", but it is optional and there's probably a reason that it's not required. We still need to address the fact that two features have had a minor collision: indexes on a partitioned table and table AMs that don't necessarily support all index types. It's not good to just throw an error, because we could be forcing the user to manually manage the indexes on hundreds of partitions just because some tables have a different AM and it doesn't support the index type. We probably want to do something about that, but as far as I can tell, it's not a problem for columnar right now. > I don't think it's a coincidence that GIN is the index AM > that looks like it presents at least 2 problems for the columnar > table > AM. To me this suggests that this will need a much higher level > discussion. One problem is that ginpostinglist.c restricts the use of offset numbers higher than MaxOffsetNumber - 1. At best, that's a confusing and unnecessary off-by-one error that we happen to be stuck with because it affects the on-disk format. Now that I'm past that particular confusion, I can live with a workaround until we do something better. What is the other problem with GIN? Regards, Jeff Davis
On Mon, May 3, 2021 at 5:15 PM Jeff Davis <pgsql@j-davis.com> wrote: > I don't see why in-core changes are a strict requirement. It doesn't > make too much difference if a lossy TID doesn't correspond exactly to > the columnar layout -- it should be fine as long as there's locality, > right? But look at the details: tidbitmap.c uses MaxHeapTuplesPerPage as its MAX_TUPLES_PER_PAGE, which seems like a problem -- that's 291 with default BLCKSZ. I doubt that that restriction is something that you can afford to live with, even just for the time being. > > It seems senseless to *require* table AMs to support something like a > > bitmap scan. > > I am not yet convinced that it's "senseless", but it is optional and > there's probably a reason that it's not required. I mean it's senseless to require it in the general case. > We still need to address the fact that two features have had a minor > collision: indexes on a partitioned table and table AMs that don't > necessarily support all index types. It's not good to just throw an > error, because we could be forcing the user to manually manage the > indexes on hundreds of partitions just because some tables have a > different AM and it doesn't support the index type. I don't see why that's necessarily a problem. Why, in general, should every table AM be able to support every index AM? I find it puzzling that nobody can find one single thing that the table AM interface *can't* do. What are the chances that the abstraction really is perfect? > > I don't think it's a coincidence that GIN is the index AM > > that looks like it presents at least 2 problems for the columnar > > table > > AM. To me this suggests that this will need a much higher level > > discussion. > > One problem is that ginpostinglist.c restricts the use of offset > numbers higher than MaxOffsetNumber - 1. At best, that's a confusing > and unnecessary off-by-one error that we happen to be stuck with > because it affects the on-disk format. Now that I'm past that > particular confusion, I can live with a workaround until we do > something better. > > What is the other problem with GIN? I just meant the tidbitmap.c stuff, and so on. There is really one big problem: GIN leverages the fact that bitmap scans are all that it supports in many different ways. The reality is that it was designed to work with heapam -- that's how it evolved. It seems rather unlikely that problems are confined to this ginpostinglist.c representational issue -- which is very surface-level. The only way to figure it out is to try to make it work and see what happens, though, so perhaps it isn't worth discussing any further until that happens. -- Peter Geoghegan
On Mon, 2021-05-03 at 18:12 -0700, Peter Geoghegan wrote: > But look at the details: tidbitmap.c uses MaxHeapTuplesPerPage as its > MAX_TUPLES_PER_PAGE, which seems like a problem -- that's 291 with > default BLCKSZ. I doubt that that restriction is something that you > can afford to live with, even just for the time being. Oh, you're right. I missed that MaxHeapTuplesPerPage was an order of magnitude smaller. > I don't see why that's necessarily a problem. Why, in general, should > every table AM be able to support every index AM? I didn't propose that every table AM needs to support every index type, just that we should do something or at least document something. It's pretty frustrating to have to fall back to manually managing the indexes for dozens or hundreds of partitions when you make use of multiple table AMs. We might be conflating support for index AMs with support for features like bitmap scans. If a certain kind of index fails at CREATE INDEX time, that's painful for the partitioning case. But here it's more like the CREATE INDEX would succeed but it would just never be used, which is a different kind of frustrating. Whatever we do or don't do, we should try to avoid surprises. I expect table AMs to be used heavily with partitioning. Regards, Jeff Davis
Hi, On 2021-04-30 11:51:07 -0700, Peter Geoghegan wrote: > I think that it's reasonable to impose some cost on index AMs here, > but that needs to be bounded sensibly and unambiguously. For example, > it would probably be okay if you had either 6 byte or 8 byte TIDs, but > no other variations. You could require index AMs (the subset of index > AMs that are ever able to store 8 byte TIDs) to directly encode which > width they're dealing with at the level of each IndexTuple. That would > create some problems for nbtree deduplication, especially in boundary > cases, but ISTM that you can manage the complexity by sensibly > restricting how the TIDs work across the board. > For example, the TIDs should always work like unsigned integers -- the > table AM must be willing to work with that restriction. Isn't that more a question of the encoding than the concrete representation? > You'd then have posting lists tuples in nbtree whose TIDs were all > either 6 bytes or 8 bytes wide, with a mix of each possible (though > not particularly likely) on the same leaf page. Say when you have a > table that exceeds the current MaxBlockNumber restrictions. It would > be relatively straightforward for nbtree deduplication to simply > refuse to mix 6 byte and 8 byte datums together to avoid complexity in > boundary cases. The deduplication pass logic has the flexibility that > this requires already. Which nbtree cases do you think would have an easier time supporting switching between 6 or 8 byte tids than supporting fully variable width tids? Given that IndexTupleData already is variable-width, it's not clear to me why supporting two distinct sizes would be harder than a fully variable size? I assume it's things like BTDedupState->htids? > > What's wrong with varlena headers? It would end up being a 1-byte > > header in practically every case, and no variable-width representation > > can do without a length word of some sort. I'm not saying varlena is > > as efficient as some new design could hypothetically be, but it > > doesn't seem like it'd be a big enough problem to stress about. If you > > used a variable-width representation for integers, you might actually > > save bytes in a lot of cases. An awful lot of the TIDs people store in > > practice probably contain several zero bytes, and if we make them > > wider, that's going to be even more true. > > Maybe all of this is true, and maybe it works out to be the best path > forward in the long term, all things considered. But whether or not > that's true is crucially dependent on what real practical table AMs > (of which there will only ever be a tiny number) actually need to do. > Why should we assume that the table AM cannot accept some > restrictions? What good does it do to legalistically define the > problem as a problem for index AMs to solve? I don't think anybody is arguing that AMs cannot accept any restrictions? I do think it's pretty clear that it's not entirely obvious what the concrete set of proper restrictions would be, where we won't end up needing to re-evaluate limits in a few years are. If you add to that the fact that variable-width tids will often end up considerably smaller than our current tids, it's not obvious why we should use bitspace somewhere to indicate an 8 byte tid instead of a a variable-width tid? Greetings, Andres Freund
On Mon, May 3, 2021 at 2:13 PM Jeff Davis <pgsql@j-davis.com> wrote: > That's not clear to me at all, and is the whole reason I began this > thread. > > a. You say "smaller than MaxOffsetNumber", but that's a little weird. > If an offset can't be MaxOffsetNumber, it's not really the maximum, is > it? I wasn't trying to be that precise. I see that OffsetNumberIsValid() returns true if the offset is <= MaxOffsetNumber, so therefore I agree that using exactly MaxOffsetNumber ought to work. > b. If you actually meant "less than or equal to MaxOffsetNumber", > that will fail with the GIN posting list issue raised in my first > email. Do you agree that's a bug? Given the above, yes. > c. Why can't we go all the way up to MovedPartitionsOffsetNumber - 1? > Right now, MaxOffsetNumber is poorly named, because it actually > represents the a number slightly higher than the maximum number of > items that can fit on a page. That essentially wastes 5 bits of address > space for no obvious reason. Because of stuff like this: [rhaas EDBAS]$ git grep -F '[MaxOffsetNumber' src/backend/access/gist/gistvacuum.c: OffsetNumber todelete[MaxOffsetNumber]; src/backend/access/gist/gistvacuum.c: OffsetNumber todelete[MaxOffsetNumber]; src/backend/access/gist/gistvacuum.c: BlockNumber leafs_to_delete[MaxOffsetNumber]; src/backend/access/hash/hash.c: OffsetNumber deletable[MaxOffsetNumber]; src/backend/access/hash/hashinsert.c: OffsetNumber deletable[MaxOffsetNumber]; src/backend/access/hash/hashovfl.c: OffsetNumber deletable[MaxOffsetNumber]; Maybe changing those places to use dynamic allocation wouldn't hurt anything in terms of performance, but I'm not sure. Making them 32 times larger categorically does not seem like a good idea. There might be other dependencies on this value in other parts of the code; I'm not sure. -- Robert Haas EDB: http://www.enterprisedb.com
On Mon, 2021-05-03 at 15:07 -0700, Peter Geoghegan wrote: > It seems senseless to *require* table AMs to support something like a > bitmap scan. I thought about this some more, and this framing is backwards. ItemPointers are fundamental to the table AM API: they are passed in to required methods, and expected to be returned[1]. Bitmap scans are optional, but that should be determined by whether the author wants to implement the bitmap scan methods of their table AM. The fine details of ItemPointer representation should not be making the decision for them. We still need to answer the core question that started this thread: what the heck is an ItemPointer, anyway? After looking at itemptr.h, off.h, ginpostinglist.c and tidbitmap.c, it seems that an ItemPointer is a block number from [0, 0xFFFFFFFe]; and an offset number from [1, MaxHeapTuplesPerPage] which is by default [1, 291]. Attached is a patch that clarifies what I've found so far and gives clear guidance to table AM authors. Before I commit this I'll make sure that following the guidance actually works for the columnar AM. Regards, Jeff Davis [1] Even for the current version of columnar, which doesn't support indexes or updates, we implemented a hack to provide dummy TIDs because some places expect them (see analyze.c:compare_rows()).
Attachment
On Tue, 2021-05-04 at 12:56 -0400, Robert Haas wrote: > b. If you actually meant "less than or equal to MaxOffsetNumber", > > that will fail with the GIN posting list issue raised in my first > > email. Do you agree that's a bug? > > Given the above, yes. If we just subtracted one, it would fit in 11 bits, and that would be fine because zero is invalid anyway. Unfortunately, it's on disk, so I think we are stuck with it. Regardless, the other limitation in tidbitmap.c is more strict anyway (MaxHeapTuplesPerPage=291). > > Because of stuff like this: > > [rhaas EDBAS]$ git grep -F '[MaxOffsetNumber' > src/backend/access/gist/gistvacuum.c: OffsetNumber > todelete[MaxOffsetNumber]; > src/backend/access/gist/gistvacuum.c: OffsetNumber > todelete[MaxOffsetNumber]; > src/backend/access/gist/gistvacuum.c: BlockNumber > leafs_to_delete[MaxOffsetNumber]; > src/backend/access/hash/hash.c: OffsetNumber > deletable[MaxOffsetNumber]; > src/backend/access/hash/hashinsert.c: OffsetNumber > deletable[MaxOffsetNumber]; > src/backend/access/hash/hashovfl.c: OffsetNumber > deletable[MaxOffsetNumber]; I don't think those are problems because they represent items on an *index* page, not ItemPointers coming from a table. Regards, Jeff Davis
On Tue, May 4, 2021 at 11:52 AM Jeff Davis <pgsql@j-davis.com> wrote: > On Mon, 2021-05-03 at 15:07 -0700, Peter Geoghegan wrote: > > It seems senseless to *require* table AMs to support something like a > > bitmap scan. > > I thought about this some more, and this framing is backwards. > ItemPointers are fundamental to the table AM API: they are passed in to > required methods, and expected to be returned[1]. I prefer my framing, but okay, let's go with yours. What difference does it make? The fact that we're starting with the table AM API doesn't change the fundamental fact that quite a few implementation details that are local to code like the GIN AM and tidbitmap.c were (rightly or wrongly) simply built with heapam in mind. The fact that that's true is hardly surprising, and hardly argues against the idea of having a table AM to begin with. There is no getting around the need to talk about the first principles here, and to talk about the specific implications for your particular table AM (perhaps others too). Abstractions are only useful when they serve concrete implementations. Of course they should be as general and abstract as possible -- but no more. > Bitmap scans are optional, but that should be determined by whether the > author wants to implement the bitmap scan methods of their table AM. > The fine details of ItemPointer representation should not be making the > decision for them. A distinction without a difference. If bitmap scans are optional and some index AMs are 100% built from the ground up to work only with bitmap scans, then those index AMs are effectively optional (or optional to the extent that bitmap scans themselves are optional). I have absolutely no idea how it would be possible to make GIN work without having index scans. It would be so different that it wouldn't be GIN anymore. I think maybe it is possible for GIN to work with your column store table AM in particular. Why aren't we talking about that concrete issue, or something like that? We're talking about this abstraction as if it must already be perfect, and therefore the standard by which every other thing needs to be measured. But why? > We still need to answer the core question that started this thread: > what the heck is an ItemPointer, anyway? > > After looking at itemptr.h, off.h, ginpostinglist.c and tidbitmap.c, it > seems that an ItemPointer is a block number from [0, 0xFFFFFFFe]; and > an offset number from [1, MaxHeapTuplesPerPage] which is by default [1, > 291]. > > Attached is a patch that clarifies what I've found so far and gives > clear guidance to table AM authors. Before I commit this I'll make sure > that following the guidance actually works for the columnar AM. I don't get what the point of this patch is. Obviously all of the particulars here are just accidents of history that we ought to change sooner or later anyway. I don't have any objection to writing them all down someplace official. But what difference does it make if there is no underlying *general* set of principles behind any of it? This definition of a TID can break at any time because it just isn't useful or general. This is self-evident -- your definition includes MaxHeapTuplesPerPage! How could that possibly be anything other than an accident whose details are completely arbitrary and therefore subject to change at any time? This is not necessarily a big deal! We can fix it by reconciling things in a pragmatic, bottom-up way. That's what I expected would happen all along. The table AM is not the Ark of the Covenant (just like tidbitmap.c, or anything else). -- Peter Geoghegan
On Mon, May 3, 2021 at 10:01 PM Andres Freund <andres@anarazel.de> wrote: > > For example, the TIDs should always work like unsigned integers -- the > > table AM must be willing to work with that restriction. > > Isn't that more a question of the encoding than the concrete representation? I don't think so, no. How does B-Tree deduplication work without something like that? The fact of the matter is that things are very tightly coupled in all kinds of ways. I'm all for decoupling them to the extent required to facilitate a new and useful table AM. But I am unlikely to commit to months of work based on abstract arguments and future work. I think that you'll find that I'm not the only one that sees it that way. > > You'd then have posting lists tuples in nbtree whose TIDs were all > > either 6 bytes or 8 bytes wide, with a mix of each possible (though > > not particularly likely) on the same leaf page. Say when you have a > > table that exceeds the current MaxBlockNumber restrictions. It would > > be relatively straightforward for nbtree deduplication to simply > > refuse to mix 6 byte and 8 byte datums together to avoid complexity in > > boundary cases. The deduplication pass logic has the flexibility that > > this requires already. > > Which nbtree cases do you think would have an easier time supporting > switching between 6 or 8 byte tids than supporting fully variable width > tids? Given that IndexTupleData already is variable-width, it's not > clear to me why supporting two distinct sizes would be harder than a > fully variable size? I assume it's things like BTDedupState->htids? Stuff like that, yeah. The space utilization stuff inside nbtsplitloc.c and nbtdedup.c pretty much rests on the assumption that TIDs are fixed width. Obviously there are some ways in which that could be revised if there was a really good reason to do so -- like an actual concrete reason with some clear basis in reality. You have no obligation to make me happy, but FYI I find arguments like "but why wouldn't you just allow arbitrary-width TIDs?" to be deeply unconvincing. Do you really expect me to do a huge amount of work and risk a lot of new bugs, just to facilitate something that may or may not ever happen? Would you do that if you were in my position? > I don't think anybody is arguing that AMs cannot accept any restrictions? I do > think it's pretty clear that it's not entirely obvious what the concrete set > of proper restrictions would be, where we won't end up needing to re-evaluate > limits in a few years are. I'm absolutely fine with the fact that the table AM has these issues -- I would expect it. I would like to help! I just find these wildly abstract discussions to be close to a total waste of time. The idea that we should let a thousand table AM flowers bloom and then review what to do seems divorced from reality. Even if the table AM becomes wildly successful there will still only have been maybe 2 - 4 table AMs that ever really had a chance. Supposing that we have no idea what they could possibly look like just yet is just navel gazing. > If you add to that the fact that variable-width tids will often end up > considerably smaller than our current tids, it's not obvious why we should use > bitspace somewhere to indicate an 8 byte tid instead of a a variable-width > tid? It's not really the space overhead. It's the considerable complexity that it would add. -- Peter Geoghegan
Hi, On 2021-05-04 14:13:36 -0700, Peter Geoghegan wrote: > On Mon, May 3, 2021 at 10:01 PM Andres Freund <andres@anarazel.de> wrote: > > > For example, the TIDs should always work like unsigned integers -- the > > > table AM must be willing to work with that restriction. > > > > Isn't that more a question of the encoding than the concrete representation? > > I don't think so, no. How does B-Tree deduplication work without > something like that? The fact of the matter is that things are very > tightly coupled in all kinds of ways. What does the deduplication actually require from tids? Isn't it just that you need to be able to compare tids? > > > You'd then have posting lists tuples in nbtree whose TIDs were all > > > either 6 bytes or 8 bytes wide, with a mix of each possible (though > > > not particularly likely) on the same leaf page. Say when you have a > > > table that exceeds the current MaxBlockNumber restrictions. It would > > > be relatively straightforward for nbtree deduplication to simply > > > refuse to mix 6 byte and 8 byte datums together to avoid complexity in > > > boundary cases. The deduplication pass logic has the flexibility that > > > this requires already. > > > > Which nbtree cases do you think would have an easier time supporting > > switching between 6 or 8 byte tids than supporting fully variable width > > tids? Given that IndexTupleData already is variable-width, it's not > > clear to me why supporting two distinct sizes would be harder than a > > fully variable size? I assume it's things like BTDedupState->htids? > > Stuff like that, yeah. The space utilization stuff inside > nbtsplitloc.c and nbtdedup.c pretty much rests on the assumption that > TIDs are fixed width. Hm. It doesn't seems look like that'd be all that hard to adjust / that it'd be meaningfully easier to support only one other type of tid width. > Obviously there are some ways in which that could be revised if there > was a really good reason to do so -- like an actual concrete reason > with some clear basis in reality. The example of indirect indexes has been brought up repeatedly - you just didn't respond to it? > You have no obligation to make me happy, but FYI I find arguments like > "but why wouldn't you just allow arbitrary-width TIDs?" to be deeply > unconvincing. Do you really expect me to do a huge amount of work and > risk a lot of new bugs, just to facilitate something that may or may > not ever happen? Would you do that if you were in my position? So far nobody has expressed any expectation of you doing specific work in this thread as far as I can see? I certainly didn't intend to. I think it's perfectly normal to discuss tradeoffs and disagree about them? Greetings, Andres Freund
On Tue, 2021-05-04 at 13:51 -0700, Peter Geoghegan wrote: > I think maybe it is possible for GIN to work with your column store > table AM in particular. Why aren't we talking about that concrete > issue, or something like that? Happy to. At this point I'd rather obey the constraint that the offset number falls in the range [1, MaxHeapTuplesPerPage] so that columnar will have bitmap index scans and GIN. If you see a way to work around this limitation and still have GIN and bitmap index support, so much the better. The cost of obeying this limitation is that, in a workload involving lots of small transactions, columnar might run out of TID space and force a VACUUM FULL. In that case, VACUUM FULL is probably a good idea anyway (to coalesce the tuples for better compression), but forcing it is obviously not ideal. The reason columnar will run out of TID space more quickly for small operations is because small inserts might reserve more TIDs then they actually use, leaving gaps; and small updates/deletes will fragment the TID space. The benefit of obeying this limitation is that I expect that bitmap index scans will work well with columnar because they avoid random access. And it seems like a nice benefit if we can support the full range of index AMs for columnar. > I don't get what the point of this patch is. Obviously all of the > particulars here are just accidents of history that we ought to > change > sooner or later anyway. The point is if "sooner" turns into "later" then we at least have some guidance for table AM authors in the interim. But if nobody else thinks that's useful, then so be it. Regards, Jeff Davis
On Tue, May 4, 2021 at 5:40 PM Andres Freund <andres@anarazel.de> wrote: > What does the deduplication actually require from tids? Isn't it just > that you need to be able to compare tids? It's hard to know for sure what is essential to the design, and what can be discarded. Though I can say for sure that it depends on there being cheap to compare TIDs. > Hm. It doesn't seems look like that'd be all that hard to adjust / that > it'd be meaningfully easier to support only one other type of tid width. It depends on equi-sized TIDs within a posting list, too -- see "Posting list splits" from the nbtree README for some idea of what I mean. The reason that nbtree deduplication could be enabled by default (the reason why it has very little downside) is because there are virtually no special cases, and because the WAL overhead for posting list splits is so low (it's only a tiny bit higher than a simple btinsert()). Anyway, deduplication could probably still work in about the same way if there were some sensible limits on how generalized TIDs could work. > > Obviously there are some ways in which that could be revised if there > > was a really good reason to do so -- like an actual concrete reason > > with some clear basis in reality. > > The example of indirect indexes has been brought up repeatedly - you > just didn't respond to it? I did respond to it -- in detail. I don't see how it will ever be possible to make indirect indexes work -- at least within anything like the current framework. And even if I was wrong there, it still wouldn't mean that the project was particularly worth pursuing. Whether or not you find my arguments about indirect indexes convincing is ultimately beside the point. The fact is that I just don't believe that that's ever going to happen (it is a fact that that's what I believe). This will discourage me from becoming involved in anything that touches on how the table AM thinks about TIDs, particularly in index AMs. As far as I'm concerned this TID stuff is up in the air (except maybe for something like zheap which is sufficiently close to heapam that it doesn't matter). > So far nobody has expressed any expectation of you doing specific work > in this thread as far as I can see? I certainly didn't intend to. I > think it's perfectly normal to discuss tradeoffs and disagree about > them? That's why I was careful to say "You have no obligation to make me happy". Of course it's true that I could easily just not get involved -- that was never my concern. Here is my concern: I have an obligation to make it clear that I think that you really ought to straighten out this business with generalizing TIDs before too long. Not because I say so, but because it's holding up progress in general. If you aren't getting cooperation from people who work on indexing (could be somebody else), then consider the possibility that this business with TIDs and bitmap scans has a lot to do with it. Most people are not as outspoken as I am. I'm not at all surprised that this happened, simply because of the history -- it makes sense. I'm glad that the table AM interface exists, and it was always going to be something that required refinement over time. I want the table AM to be successful. If I didn't care then I would say nothing at all. -- Peter Geoghegan
On Tue, May 4, 2021 at 9:24 PM Peter Geoghegan <pg@bowt.ie> wrote: > Here is my concern: I have an obligation to make it clear that I think > that you really ought to straighten out this business with > generalizing TIDs before too long. Not because I say so, but because > it's holding up progress in general. If you aren't getting cooperation > from people who work on indexing (could be somebody else), then > consider the possibility that this business with TIDs and bitmap scans > has a lot to do with it. Most people are not as outspoken as I am. It seems to me that we're doing a lot of disagreeing given that, as I see it, there are only relatively minor differences between the positions of the various people here. Andres and I are, I think, relatively convinced that variable-width TIDs would let us do things that aren't otherwise possible, and that those things are potentially useful and I would even venture to say cool. I don't believe you disagree with that, but you think it's going to be too much work to implement. Fair enough; anyone can try it who is interested and see how far they get. Anyone who thinks it's going to be impossibly hard probably will prefer not to try, and that's OK too. But if we take that off the table, what about a less-ambitious generalization of the TID mechanism? I can't really see anyone putting up a serious argument against allowing all 48 bits of space available in the existing TID format to be used which, as Jeff points out, is not currently the case. So anyone who wants to try to write that patch is free to do so. I don't have a clear idea how to make that work, to be honest, but my limited supply of ideas need not prevent anyone else from trying theirs. There might be some slight disagreement about whether it's useful to generalize TIDs from a 48-bit address space to a 64-bit address space without making it fully general. Like Andres, I am unconvinced that's meaningfully easier, and I am convinced that it's meaningfully less good, but other people can disagree and that's fine. I'm perfectly willing to change my opinion if somebody shows up with a patch that demonstrates the value of this approach. The main point here is one that I think you made a few emails back: the limitations of the current system are defined by what will actually work with the code as it exists today, not some mailing list discussion. It's too early for the project to commit to stability in this area; we have not managed to get a single AM apart from heapam into core, and that situation doesn't appear likely to change in the near future. If and when we have say 5 of those we can probably articulate some intelligent ideas about what we think the patterns that need to hold for future AMs are, but it's reckless to extrapolate from 1 working example, and right now that's all we have. -- Robert Haas EDB: http://www.enterprisedb.com
On Wed, May 5, 2021 at 7:27 AM Robert Haas <robertmhaas@gmail.com> wrote: > It seems to me that we're doing a lot of disagreeing given that, as I > see it, there are only relatively minor differences between the > positions of the various people here. I'm being very vocal here because I'm concerned that we're going about generalizing TIDs in the wrong way. To me it feels like there is a loss of perspective about what really matters. There just isn't that many table AM TID designs that could ever work, and even among those schemes that could ever work there is a pretty clear hierarchy. This blue sky thinking about generalizing TIDs 2 years in seems *weird* to me. Nobody is obligated to reassure me. I felt that this had to be said. > Andres and I are, I think, > relatively convinced that variable-width TIDs would let us do things > that aren't otherwise possible, and that those things are potentially > useful and I would even venture to say cool. I don't believe you > disagree with that, but you think it's going to be too much work to > implement. Fair enough; anyone can try it who is interested and see > how far they get. Anyone who thinks it's going to be impossibly hard > probably will prefer not to try, and that's OK too. I think that's accurate. But it's easy to not disagree with the idea that variable-width TIDs might lead to something interesting. Talk is cheap. No other database system has something like indirect indexes. They have clustered indexes, but that's rather different. I think that indirect indexes were a design that was concerned about the issue of write amplification from non-HOT updates. But do I even remember the details correctly? We're talking about indirect indexes as if that was an idea whose high level user-visible goals were clear, but I don't even think that that much is true. This kind of thing concerns me. It very much feels like failing to see the forest for the trees. > But if we take that off the table, what about a less-ambitious > generalization of the TID mechanism? I can't really see anyone putting > up a serious argument against allowing all 48 bits of space available > in the existing TID format to be used which, as Jeff points out, is > not currently the case. So anyone who wants to try to write that patch > is free to do so. I don't have a clear idea how to make that work, to > be honest, but my limited supply of ideas need not prevent anyone else > from trying theirs. I agree that we should focus on what we can agree on. It seems as if we all more or less agree on this much. > There might be some slight disagreement about whether it's useful to > generalize TIDs from a 48-bit address space to a 64-bit address space > without making it fully general. Like Andres, I am unconvinced that's > meaningfully easier, and I am convinced that it's meaningfully less > good, but other people can disagree and that's fine. I'm perfectly > willing to change my opinion if somebody shows up with a patch that > demonstrates the value of this approach. It's going to be hard if not impossible to provide empirical evidence for the proposition that 64-bit wide TIDs (alongside 48-bit TIDs) are the way to go. Same with any other scheme. We're talking way too much about TIDs themselves and way too little about table AM use cases, the way the data structures might work in new table AMs, and so on. > The main point here is one that I think you made a few emails back: > the limitations of the current system are defined by what will > actually work with the code as it exists today, not some mailing list > discussion. Right. > It's too early for the project to commit to stability in > this area; we have not managed to get a single AM apart from heapam > into core, and that situation doesn't appear likely to change in the > near future. I would be happy if we could commit to committing to stability. I really don't think that it's *that* hard to move significantly closer to a design that describes just how close to heapam a table AM should be. It doesn't commit the table AM to all that many details. > If and when we have say 5 of those we can probably > articulate some intelligent ideas about what we think the patterns > that need to hold for future AMs are, but it's reckless to extrapolate > from 1 working example, and right now that's all we have. I don't think that there will be 5 table AMs that are credible to users at any point in the future. In any case there only needs to be 1 or 2 good ones for the table AM to have been a resounding success. -- Peter Geoghegan
On Wed, May 5, 2021 at 11:50 AM Peter Geoghegan <pg@bowt.ie> wrote: > I'm being very vocal here because I'm concerned that we're going about > generalizing TIDs in the wrong way. To me it feels like there is a > loss of perspective about what really matters. Well, which things matter is a question of opinion, not fact. > No other database system has something like indirect indexes. They > have clustered indexes, but that's rather different. I don't think this is true at all. If you have a clustered index - i.e. the table is physically arranged according to the index ordering - then your secondary indexes all pretty much have to be what we're calling indirect indexes. They can hardly point to a physical identifier if rows are being moved around. I believe InnoDB works this way, and I think Oracle's index-organized tables do too. I suspect there are other examples. > > There might be some slight disagreement about whether it's useful to > > generalize TIDs from a 48-bit address space to a 64-bit address space > > without making it fully general. Like Andres, I am unconvinced that's > > meaningfully easier, and I am convinced that it's meaningfully less > > good, but other people can disagree and that's fine. I'm perfectly > > willing to change my opinion if somebody shows up with a patch that > > demonstrates the value of this approach. > > It's going to be hard if not impossible to provide empirical evidence > for the proposition that 64-bit wide TIDs (alongside 48-bit TIDs) are > the way to go. Same with any other scheme. We're talking way too much > about TIDs themselves and way too little about table AM use cases, the > way the data structures might work in new table AMs, and so on. I didn't mean that it has to be a test result showing that 64-bit TIDs outperform 56-bit TIDs or something. I just meant there has to be a reason to believe it's good, which could be based on a discussion of use cases or whatever. If we *don't* have a reason to believe it's good, we shouldn't do it. My point is that so far I am not seeing a whole lot of value of this proposed approach. For a 64-bit TID to be valuable to you, one of two things has to be true: you either don't care about having indexes that store TIDs on your new table type, or the index types you want to use can store those 64-bit TIDs. Now, I have not yet heard of anyone working on a table AM who does not want to be able to support adding btree indexes. There may be someone that I don't know about, and if so, fine. But otherwise, we need a way to store them. And that requires changing the page format for btree indexes. But surely we do not want to make all TIDs everywhere wider in future btree versions, so at least two TID widths - 6 bytes and 8 bytes - would have to be supported. And if we're at all going to do that, I think it's certainly worth asking whether supporting varlena TIDs would really be all that much harder. You seem to think it is, and you might be right, but I'm not ready to give up, because I do not see how we are ever going to get global indexes or indirect indexes without doing it, and those would be good features to have. If we can't ever get them, so be it, but you seem to kind of be saying that things like global indexes and indirect indexes are hard, and therefore they don't count as reasons why we might want variable-width TIDs. But one very large reason why those things are hard is that they require variable-width TIDs, so AFAICS this boils down to saying that we don't want the feature because it's hard to implement. But we should not conflate feasibility with desirability. I am quite sure that lots of people want global indexes. The number of people who want indirect indexes is in my estimation much smaller, but it's probably not zero, or else Alvaro wouldn't have tried his hand at writing a patch. Whether we can *get* those things is in doubt; whether it will happen in the near future is very much in doubt. But I at least am not in doubt about whether people want it, because I hear complaints about the lack of global indexes on an almost-daily basis. If those complaints are all from people hoping to fake me out into spending time on something that is worthless to them, my colleagues are very good actors. -- Robert Haas EDB: http://www.enterprisedb.com
On Wed, 2021-05-05 at 10:27 -0400, Robert Haas wrote: > It's too early for the project to commit to stability in > this area; we have not managed to get a single AM apart from heapam > into core "In core" shouldn't matter. In fact, if it's in core, stability of the APIs is much less important. > If and when we have say 5 of those That seems like a standard that we won't reach in any reasonable amount of time. > we can probably > articulate some intelligent ideas about what we think the patterns > that need to hold for future AMs are, but it's reckless to > extrapolate > from 1 working example, and right now that's all we have. We should count columnar as a second example. While it doesn't support everything that heap does, we are actively working on it and it's gaining features quickly. It's also showing some impressive real-world results. Regards, Jeff Davis
On Wed, May 5, 2021 at 9:42 AM Robert Haas <robertmhaas@gmail.com> wrote: > On Wed, May 5, 2021 at 11:50 AM Peter Geoghegan <pg@bowt.ie> wrote: > > I'm being very vocal here because I'm concerned that we're going about > > generalizing TIDs in the wrong way. To me it feels like there is a > > loss of perspective about what really matters. > > Well, which things matter is a question of opinion, not fact. I'm not trying to win an argument here. I am giving an opinion in the hopes that it leads to some kind of useful synthesis, based on all of our opinions. > > No other database system has something like indirect indexes. They > > have clustered indexes, but that's rather different. > > I don't think this is true at all. If you have a clustered index - > i.e. the table is physically arranged according to the index ordering > - then your secondary indexes all pretty much have to be what we're > calling indirect indexes. They can hardly point to a physical > identifier if rows are being moved around. I believe InnoDB works this > way, and I think Oracle's index-organized tables do too. I suspect > there are other examples. But these systems don't have indirect indexes *on a heap table*! Why would they ever do it that way? They already have rowid/TID as a stable identifier of logical rows, so having indirect indexes that point to a heap table's rows would be strictly worse than the generic approach for indexes on a heap table. What we call indirect indexes seem to me to be a failed attempt to solve the "TID is not a stable identifier of logical row" issue that is baked-in to Postgres. If I thought it was worth solving that problem then I suppose I'd solve it directly. The "indirection" of indirect indexes actuallys buys you nothing! It just moves some of the problem somewhere else, at the cost of even more complexity. Indirect indexes (without a clustered index) are a muddled idea. Of course I accept that clustered indexes make sense in general (though less and less these days). But the fact that these systems "use indirect indexes" for secondary indexes is precisely why clustered indexes don't seem like a great design with modern hardware! Should we invest a huge amount of work in order to have all of the disadvantages, and none of the advantages? > My point is that so far I am not seeing a whole lot of value of this > proposed approach. For a 64-bit TID to be valuable to you, one of two > things has to be true: you either don't care about having indexes that > store TIDs on your new table type, or the index types you want to use > can store those 64-bit TIDs. Now, I have not yet heard of anyone > working on a table AM who does not want to be able to support adding > btree indexes. There may be someone that I don't know about, and if > so, fine. But otherwise, we need a way to store them. And that > requires changing the page format for btree indexes. But surely we do > not want to make all TIDs everywhere wider in future btree versions, > so at least two TID widths - 6 bytes and 8 bytes - would have to be > supported. I agree that we don't want a performance/space overhead for simple cases that are quite happy with the current format. > And if we're at all going to do that, I think it's > certainly worth asking whether supporting varlena TIDs would really be > all that much harder. You seem to think it is, and you might be right, > but I'm not ready to give up, because I do not see how we are ever > going to get global indexes or indirect indexes without doing it, and > those would be good features to have. I think that global indexes are well worth having, and should be solved some completely different way. The partition key can be an additive thing. I strongly suspect that indirect indexes (without a clustered index) are 100% useless in both theory and practice, so naturally I have little to no interest. The difficulty of supporting (say) 6 byte and 8 byte TIDs together is vastly lower than variable-width TIDs, for all kinds of reasons. See my remarks to Andres upthread about deduplication. > If we can't ever get them, so be it, but you seem to kind of be saying > that things like global indexes and indirect indexes are hard, and > therefore they don't count as reasons why we might want variable-width > TIDs.But one very large reason why those things are hard is that they > require variable-width TIDs, so AFAICS this boils down to saying that > we don't want the feature because it's hard to implement. More like very hard to implement for a very low benefit. > But we > should not conflate feasibility with desirability. I am quite sure > that lots of people want global indexes. I do too! -- Peter Geoghegan
On Wed, 2021-05-05 at 08:50 -0700, Peter Geoghegan wrote: > There just isn't that > many table AM TID designs that could ever work, and even among those > schemes that could ever work there is a pretty clear hierarchy. This > blue sky thinking about generalizing TIDs 2 years in seems *weird* to > me. I am happy to keep table AM discussions concrete, as I have plenty of concrete problems which I would like to turn into proposals. Regards, Jeff Davis
On Wed, May 5, 2021 at 1:13 PM Jeff Davis <pgsql@j-davis.com> wrote: > "In core" shouldn't matter. In fact, if it's in core, stability of the > APIs is much less important. I don't know what to say here. I think it's unrealistic to believe that a very new API that has only 1 in-core user is going to be fully stable, or that we can know how it might evolve. I can understand why you and probably other people want that, but if somebody figures out a way to make some part of core significantly better and it requires changing that API, they're going to change the API, not give up on the idea. -- Robert Haas EDB: http://www.enterprisedb.com
On Wed, May 5, 2021 at 10:33 AM Robert Haas <robertmhaas@gmail.com> wrote: > I don't know what to say here. I think it's unrealistic to believe > that a very new API that has only 1 in-core user is going to be fully > stable, or that we can know how it might evolve. I can understand why > you and probably other people want that, but if somebody figures out a > way to make some part of core significantly better and it requires > changing that API, they're going to change the API, not give up on the > idea. I strongly agree. More generally, we need to decide what downsides we're willing to live with. What we have right now has little chance of failing. It also has little chance of succeeding (except for something like zheap, which can presumably get by with the heapam's idea of TID). -- Peter Geoghegan
On Wed, May 5, 2021 at 1:15 PM Peter Geoghegan <pg@bowt.ie> wrote: > > I don't think this is true at all. If you have a clustered index - > > i.e. the table is physically arranged according to the index ordering > > - then your secondary indexes all pretty much have to be what we're > > calling indirect indexes. They can hardly point to a physical > > identifier if rows are being moved around. I believe InnoDB works this > > way, and I think Oracle's index-organized tables do too. I suspect > > there are other examples. > > But these systems don't have indirect indexes *on a heap table*! Why > would they ever do it that way? They already have rowid/TID as a > stable identifier of logical rows, so having indirect indexes that > point to a heap table's rows would be strictly worse than the generic > approach for indexes on a heap table. One advantage of indirect indexes is that you can potentially avoid a lot of writes to the index. If a non-HOT update is performed, but the primary key is not updated, the index does not need to be touched. I think that's a potentially significant savings, even if bottom-up index deletion would have prevented the page splits. Similarly, you can mark a dead line pointer unused without having to scan the indirect index, because the index isn't pointing to that dead line pointer anyway. Hmm, but I guess you have another cleanup problem. What prevents someone from inserting a new row with the same primary key as a previously-deleted row but different values in some indirectly-indexed column? Then the old index entries, if still present, could mistakenly refer to the new row. I don't know whether Alvaro thought of that problem when he was working on this previously, or whether he solved it somehow. Possibly that's a big enough problem that the whole idea is dead in the water, but it's not obvious to me that this is so. And, anyway, this whole argument is predicated on the fact that the only table AM we have right now is heapam. If we had a table AM that organized the data by primary key value, we'd still want to be able to have secondary indexes, and they'd have to use the primary key value as the TID. > I think that global indexes are well worth having, and should be > solved some completely different way. The partition key can be an > additive thing. I agree that the partition identifier should be an additive thing, but where would we add it? It seems to me that the obvious answer is to make it a column of the index tuple. And if we can do that, why can't we put whatever kind of TID-like stuff people want in the index tuple, too? Maybe part of the problem here is that I don't actually understand how posting lists are represented... -- Robert Haas EDB: http://www.enterprisedb.com
On Wed, 2021-05-05 at 10:48 -0700, Peter Geoghegan wrote: > What we have right now has little chance of failing. It also has > little chance of succeeding (except for something like zheap, which > can presumably get by with the heapam's idea of TID). What has little chance of succeeding? Table AMs? And why isn't columnar an example of someting that can "get by with heapam's idea of TID"? I mean, it's not a perfect fit, but my primary complaint this whole thread is that it's undefined, not that it's completely unworkable. Regards, Jeff Davis
On Wed, May 5, 2021 at 10:57 AM Robert Haas <robertmhaas@gmail.com> wrote: > One advantage of indirect indexes is that you can potentially avoid a > lot of writes to the index. If a non-HOT update is performed, but the > primary key is not updated, the index does not need to be touched. I > think that's a potentially significant savings, even if bottom-up > index deletion would have prevented the page splits. Similarly, you > can mark a dead line pointer unused without having to scan the > indirect index, because the index isn't pointing to that dead line > pointer anyway. As I said, this is equivalent to solving the "TID is a stable identifier of logical row" issue (an exceptionally hard problem that I don't think is worth solving), except that you make the secondary indexes have potentially larger keys for no benefit. Sure, you can consistently refer to a logical row using its PK value (assuming you have this whole two-phase locking infrastructure), but why wouldn't you "just" solve the problem with TID directly instead? What does involving PK values actually buy you? I am pretty sure that the answer is "less than nothing". It is still true that I'm arguing against ever having a clustered index table AM, which would be somewhat useful to users (that much I'll own). The main reason for that is because we'd still be required to solve the "TID is a stable identifier of logical row" issue, except it's not a physiological TID/rowid (it's a fully logical row identifier). So everything seems to lead back to that hard problem seeming insoluble. > Hmm, but I guess you have another cleanup problem. What prevents > someone from inserting a new row with the same primary key as a > previously-deleted row but different values in some indirectly-indexed > column? Two-phase locking in indexes stops it. Note that this is pretty much what happens in Oracle -- it's not just SQL Server. This is why we have rich extensibility indexing -- indexes are strictly physical data structures in Postgres. > And, anyway, this whole argument is predicated on the fact that the > only table AM we have right now is heapam. If we had a table AM that > organized the data by primary key value, we'd still want to be able to > have secondary indexes, and they'd have to use the primary key value > as the TID. But Jeff has a design for the columnstore table AM where TIDs are essentially logical (not physiological like those of heapam), that nevertheless will work with the design around TIDs that I have in mind. "Logical identifiers" versus "Logical identifiers that stably identify logical rows" seems like a subtle but important distinction here. Of course I cannot yet rule out the possibility that this approach to TIDs will always be good enough. But it sure seems like it might be, and starting with the assumption that it is and working backwards seems like a good way to attack the problem as a practical matter. > > I think that global indexes are well worth having, and should be > > solved some completely different way. The partition key can be an > > additive thing. > > I agree that the partition identifier should be an additive thing, but > where would we add it? It seems to me that the obvious answer is to > make it a column of the index tuple. Right. > And if we can do that, why can't > we put whatever kind of TID-like stuff people want in the index tuple, > too? Maybe part of the problem here is that I don't actually > understand how posting lists are represented... You want to use the partition identifier for predicate push-down and stuff anyway, so making it part of the TID doesn't seem particularly natural to me. "Posting list splits" from the nbtree README will give you some idea of why I care about making TIDs integer-like and equi-sized within any given index tuple. There will be similar considerations for GIN. Though I think that nbtree deduplication is important enough on its own to try to preserve across table AMs. -- Peter Geoghegan
Hi, On 2021-05-05 13:32:57 -0400, Robert Haas wrote: > I don't know what to say here. I think it's unrealistic to believe > that a very new API that has only 1 in-core user is going to be fully > stable, or that we can know how it might evolve. I can understand why > you and probably other people want that, but if somebody figures out a > way to make some part of core significantly better and it requires > changing that API, they're going to change the API, not give up on the > idea. Yea. I think it would be actively *bad* if tableam were too stable. tableam is at best an 80% solution to the abstraction needs (those 80% were pretty painful to achieve already, I don't think we could have gotten much more initially). If we get cornered into not evolving the API because of 2-3 external users, we're a) going to live with a leaky abstraction for much longer b) getting more hesitant to work incrementally. Both would be bad. Greetings, Andres Freund
Hi, On 2021-05-05 10:56:56 -0700, Jeff Davis wrote: > On Wed, 2021-05-05 at 10:48 -0700, Peter Geoghegan wrote: > > What we have right now has little chance of failing. It also has > > little chance of succeeding (except for something like zheap, which > > can presumably get by with the heapam's idea of TID). > > What has little chance of succeeding? Table AMs? > > And why isn't columnar an example of someting that can "get by with > heapam's idea of TID"? I mean, it's not a perfect fit, but my primary > complaint this whole thread is that it's undefined, not that it's > completely unworkable. Agreed. And we can increase the fit a good bit without needing invasive all-over changes. With those changes often even helping heap. E.g. tidbitmap.c's harcoded use of MaxHeapTuplesPerPage is a problem even for heap - we waste a lot of space that's not commonly used. A better datastructure (radix tree like I'd say, but several tree shaped approaches seem possible). Greetings, Andres Freund
On Wed, May 5, 2021 at 10:56 AM Jeff Davis <pgsql@j-davis.com> wrote: > What has little chance of succeeding? Table AMs? > > And why isn't columnar an example of someting that can "get by with > heapam's idea of TID"? I mean, it's not a perfect fit, but my primary > complaint this whole thread is that it's undefined, not that it's > completely unworkable. I think that it could be fairly workable with moderate effort (maybe even without that effort, but that doesn't seem so appealing). To do it well we have to actually generalize TIDs sensibly. And I think that that means admitting that we'll never solve the "TID is a stable identifier of a logical row, not a physical version" problem. ISTM that that's the problem that is at the root of everything here. -- Peter Geoghegan
On Wed, May 5, 2021 at 11:25 AM Andres Freund <andres@anarazel.de> wrote: > Agreed. And we can increase the fit a good bit without needing invasive > all-over changes. With those changes often even helping heap. > > E.g. tidbitmap.c's harcoded use of MaxHeapTuplesPerPage is a problem > even for heap - we waste a lot of space that's not commonly used. A > better datastructure (radix tree like I'd say, but several tree shaped > approaches seem possible). Agreed -- even if we only cared about heapam we still ought to do something about tidbitmap.c's use of MaxHeapTuplesPerPage. -- Peter Geoghegan
On Wed, 2021-05-05 at 11:22 -0700, Andres Freund wrote: > Yea. I think it would be actively *bad* if tableam were too > stable. tableam is at best an 80% solution to the abstraction needs > (those 80% were pretty painful to achieve already, I don't think we > could have gotten much more initially). If we get cornered into not > evolving the API because of 2-3 external users, we're a) going to > live > with a leaky abstraction for much longer b) getting more hesitant to > work incrementally. Both would be bad. Like anything, we make the decision at the time we have a reason to break something. But why are are exensions disfavored in this calculation vs. in-core? Isn't it a lot easier to update in-core code to new APIs? Evolving the API is one thing, but we should be more careful about things that could affect on-disk state like ItemPointer representations. By "more careful", I don't mean that we reject all proposals; I mean that we don't casually impose new limits in other parts of the system that happen to work for heapam but will cause table AM extensions to break. Regards, Jeff Davis
On Wed, May 5, 2021 at 12:09 PM Jeff Davis <pgsql@j-davis.com> wrote: > Like anything, we make the decision at the time we have a reason to > break something. But why are are exensions disfavored in this > calculation vs. in-core? Isn't it a lot easier to update in-core code > to new APIs? We don't really have an API for how TIDs behave (unless you happen to want to emulate heapam, which is reasonable and was expected). It's unspecified because nobody knows what it is (or what it should be) just yet. AFAICT there is no TID API to break. -- Peter Geoghegan
On Wed, 5 May 2021 at 19:15, Peter Geoghegan <pg@bowt.ie> wrote: > > On Wed, May 5, 2021 at 9:42 AM Robert Haas <robertmhaas@gmail.com> wrote: > > On Wed, May 5, 2021 at 11:50 AM Peter Geoghegan <pg@bowt.ie> wrote: > > > I'm being very vocal here because I'm concerned that we're going about > > > generalizing TIDs in the wrong way. To me it feels like there is a > > > loss of perspective about what really matters. > > > > Well, which things matter is a question of opinion, not fact. > > I'm not trying to win an argument here. I am giving an opinion in the > hopes that it leads to some kind of useful synthesis, based on all of > our opinions. > > > > No other database system has something like indirect indexes. They > > > have clustered indexes, but that's rather different. > > > > I don't think this is true at all. If you have a clustered index - > > i.e. the table is physically arranged according to the index ordering > > - then your secondary indexes all pretty much have to be what we're > > calling indirect indexes. They can hardly point to a physical > > identifier if rows are being moved around. I believe InnoDB works this > > way, and I think Oracle's index-organized tables do too. I suspect > > there are other examples. > > But these systems don't have indirect indexes *on a heap table*! Why > would they ever do it that way? They already have rowid/TID as a > stable identifier of logical rows, so having indirect indexes that > point to a heap table's rows would be strictly worse than the generic > approach for indexes on a heap table. > > What we call indirect indexes seem to me to be a failed attempt to > solve the "TID is not a stable identifier of logical row" issue that > is baked-in to Postgres. If I thought it was worth solving that > problem then I suppose I'd solve it directly. The "indirection" of > indirect indexes actuallys buys you nothing! It just moves some of the > problem somewhere else, at the cost of even more complexity. Indirect > indexes (without a clustered index) are a muddled idea. > > Of course I accept that clustered indexes make sense in general > (though less and less these days). But the fact that these systems > "use indirect indexes" for secondary indexes is precisely why > clustered indexes don't seem like a great design with modern hardware! > Should we invest a huge amount of work in order to have all of the > disadvantages, and none of the advantages? > > > My point is that so far I am not seeing a whole lot of value of this > > proposed approach. For a 64-bit TID to be valuable to you, one of two > > things has to be true: you either don't care about having indexes that > > store TIDs on your new table type, or the index types you want to use > > can store those 64-bit TIDs. Now, I have not yet heard of anyone > > working on a table AM who does not want to be able to support adding > > btree indexes. There may be someone that I don't know about, and if > > so, fine. But otherwise, we need a way to store them. And that > > requires changing the page format for btree indexes. But surely we do > > not want to make all TIDs everywhere wider in future btree versions, > > so at least two TID widths - 6 bytes and 8 bytes - would have to be > > supported. > > I agree that we don't want a performance/space overhead for simple > cases that are quite happy with the current format. > > > And if we're at all going to do that, I think it's > > certainly worth asking whether supporting varlena TIDs would really be > > all that much harder. You seem to think it is, and you might be right, > > but I'm not ready to give up, because I do not see how we are ever > > going to get global indexes or indirect indexes without doing it, and > > those would be good features to have. > > I think that global indexes are well worth having, and should be > solved some completely different way. The partition key can be an > additive thing. I believe that it cannot be "just" an additive thing, at least not through a normal INCLUDEd column, as you'd get duplicate TIDs in the index, with its related problems. You also cannot add it as a key column, as this would disable UNIQUE indexes; one of the largest use cases of global indexes. So, you must create specialized infrastructure for this identifier. And when we're already adding specialized infrastructure, then this should probably be part of a new TID infrastructure. And if we're going to change TID infrastructure to allow for more sizes (as we'd need normal TableAM TIDs, and global index partition-identifying TIDs), I'd argue that it should not be too much more difficult to create an infrastructure for 'new TID' in which the table AM supplies type, size and strict ordering information for these 'new TID's. And if this 'new TID' size is not going to be defined by the index AM but by the indexed object (be it a table or a 'global' or whatever we'll build indexes on), I see no reason why this 'new TID' infrastructure couldn't eventually support variable length TIDs; or constant sized usertype TIDs (e.g. the 3 int columns of the primary key of a clustered table). The only requirements that I believe to be fundamental for any kind of TID are 1.) Uniqueness during the lifecycle of the tuple, from creation to life to dead to fully dereferenced from all indexes; 2.) There exists a strict ordering of all TIDs of that type; And maybe to supply some form of efficiency to the underlying tableAM: 3.) There should be an equivalent of bitmap for that TID type. For the nbtree deduplication subsystem, and for gin posting lists to be able to work efficiently, the following must also hold: 4.) The TID type has a fixed size, preferably efficiently packable. Only the last requirement cannot be met with varlena TID types. But, as I also believe that not all indexes can be expected to work (well) for all kinds of TableAM, I don't see how this would be a blocking issue. > I strongly suspect that indirect indexes (without a > clustered index) are 100% useless in both theory and practice, so > naturally I have little to no interest. > > The difficulty of supporting (say) 6 byte and 8 byte TIDs together is > vastly lower than variable-width TIDs, for all kinds of reasons. See > my remarks to Andres upthread about deduplication. I believe that deduplication is amazing when it works, but it should not be a blocker for new TID infrastructure (assuming it still works by default for nbtree+heap). As an example: numeric columns cannot be deduplicated, and that wasn't considered a blocker for deduplication to be merged. The only reasons I can think of why varlena TIDs cannot be efficiently deduplicated is the storage and search efficiency: I cannot binary search in a packed array of varlena attributes, but I can binary search through packed fixed-size attributes. Any fixed-size TID can realistically be deduplicated, assuming enough effort is put into the patch implementing the new TID infrastructure. > > If we can't ever get them, so be it, but you seem to kind of be saying > > that things like global indexes and indirect indexes are hard, and > > therefore they don't count as reasons why we might want variable-width > > TIDs.But one very large reason why those things are hard is that they > > require variable-width TIDs, so AFAICS this boils down to saying that > > we don't want the feature because it's hard to implement. > > More like very hard to implement for a very low benefit. Complicated optimizations have been merged which had only a small gain. Storage gains for index-oriented tables can become as large as the size of the primary key by not having to store all primary key values in both the index and the table; which can thus be around 100% of a table in the least efficient cases of having a PK over all columns. Yes, this might be indeed only a 'small gain' for access latency, but not needing to store another copy of your data (and keeping it in cache, etc.) is a significant win in my book. Regarding losing deduplication in btrees when we have varlena TIDs: This loss in [storage] efficiency can be partially mitigated by implementing prefix truncation/prefix deduplication, and as such that loss would not necessarily be too problematic when PT/PD is implemented. With regards, Matthias van de Meent
On Wed, May 5, 2021 at 12:43 PM Matthias van de Meent <boekewurm+postgres@gmail.com> wrote: > I believe that it cannot be "just" an additive thing, at least not > through a normal INCLUDEd column, as you'd get duplicate TIDs in the > index, with its related problems. You also cannot add it as a key > column, as this would disable UNIQUE indexes; one of the largest use > cases of global indexes. So, you must create specialized > infrastructure for this identifier. You're just quibbling about the precise words that I used. Of course it is true that there must be some sense in which a global index partition key attribute will need to be special to the implementation -- how else could a global index enforce uniqueness? That was clearly implied. > And when we're already adding specialized infrastructure, then this > should probably be part of a new TID infrastructure. This is a non-sequitur. > And if we're going to change TID infrastructure to allow for more > sizes (as we'd need normal TableAM TIDs, and global index > partition-identifying TIDs), I'd argue that it should not be too much > more difficult to create an infrastructure for 'new TID' in which the > table AM supplies type, size and strict ordering information for these > 'new TID's. > > And if this 'new TID' size is not going to be defined by the index AM > but by the indexed object (be it a table or a 'global' or whatever > we'll build indexes on), I see no reason why this 'new TID' > infrastructure couldn't eventually support variable length TIDs; or > constant sized usertype TIDs (e.g. the 3 int columns of the primary > key of a clustered table). You're not considering the big picture. It's not self-evident that anybody will ever have much use for a variable-width TID in their table AM, at least beyond some fairly simple scheme -- because of the fundamental issue of TID not working as a stable identifier of logical rows in Postgres. If it was very clear that there would be *some* significant benefit then the costs might start to look reasonable. But there isn't. "Build it and they will come" is not at all convincing to me. -- Peter Geoghegan
On Wed, 5 May 2021 at 22:09, Peter Geoghegan <pg@bowt.ie> wrote: > > On Wed, May 5, 2021 at 12:43 PM Matthias van de Meent > <boekewurm+postgres@gmail.com> wrote: > > I believe that it cannot be "just" an additive thing, at least not > > through a normal INCLUDEd column, as you'd get duplicate TIDs in the > > index, with its related problems. You also cannot add it as a key > > column, as this would disable UNIQUE indexes; one of the largest use > > cases of global indexes. So, you must create specialized > > infrastructure for this identifier. > > You're just quibbling about the precise words that I used. Of course > it is true that there must be some sense in which a global index > partition key attribute will need to be special to the implementation > -- how else could a global index enforce uniqueness? That was clearly > implied. This implication was not 100% clear to me, and the last thread on global indexes that implemented it through INCLUDEd columns didn't mention this. As such, I wanted to explicitly mention that this partition/table identifier would need to be part of the keyspace. > > And when we're already adding specialized infrastructure, then this > > should probably be part of a new TID infrastructure. > > This is a non-sequitur. I may have skipped some reasoning: I believe that the TID is the unique identifier of that tuple, within context. For normal indexes, the TID as supplied directly by the TableAM is sufficient, as the context is that table. For global indexes, this TID must include enough information to relate it to the table the tuple originated from. In the whole database, that would be the OID of the table + the TID as supplied by the table. As such, the identifier of the logical row (which can be called the TID), as stored in index tuples in global indexes, would need to consist of the TableAM supplied TID + the (local) id of the table containing the tuple. Assuming we're in agreement on that part, I would think it would be consistent to put this in TID infrastructure, such that all indexes that use such new TID infrastructure can be defined to be global with only minimal effort. > > And if we're going to change TID infrastructure to allow for more > > sizes (as we'd need normal TableAM TIDs, and global index > > partition-identifying TIDs), I'd argue that it should not be too much > > more difficult to create an infrastructure for 'new TID' in which the > > table AM supplies type, size and strict ordering information for these > > 'new TID's. > > > > And if this 'new TID' size is not going to be defined by the index AM > > but by the indexed object (be it a table or a 'global' or whatever > > we'll build indexes on), I see no reason why this 'new TID' > > infrastructure couldn't eventually support variable length TIDs; or > > constant sized usertype TIDs (e.g. the 3 int columns of the primary > > key of a clustered table). > > You're not considering the big picture. It's not self-evident that > anybody will ever have much use for a variable-width TID in their > table AM, at least beyond some fairly simple scheme -- because of the > fundamental issue of TID not working as a stable identifier of logical > rows in Postgres. ZHeap states that it can implement stable TIDs within limits, as IIRC it requires retail index deletion support for all indexes on the updated columns of that table. I fail to see why this same infrastructure could not be used for supporting clustered tables, while enforcing these limits only soft enforced in ZHeap (that is, only allowing index AMs that support retail index tuple deletion). > If it was very clear that there would be *some* > significant benefit then the costs might start to look reasonable. But > there isn't. "Build it and they will come" is not at all convincing to > me. Clustered tables / Index-oriented Tables are very useful for tables of which most columns are contained in the PK, or otherwise are often ordered by their PK. I don't know of any way that would allow us to build a clustered table _without_ including the primary key in some form into the TID, or otherwise introducing a layer of indirection that would undo the clustered access implicated by the clustered table. Additionally, compacting/re-clustering a table would be _much_ cheaper for clustered tables, as the indexes attached to that table would not need rebuilding: all TIDs will stay valid across the clustering operation. With regards, Matthias van de Meent
On Wed, May 5, 2021 at 3:18 PM Matthias van de Meent <boekewurm+postgres@gmail.com> wrote: > I believe that the TID is the unique identifier of that tuple, within context. > > For normal indexes, the TID as supplied directly by the TableAM is > sufficient, as the context is that table. > For global indexes, this TID must include enough information to relate > it to the table the tuple originated from. Clearly something like a partition identifier column is sometimes just like a regular user-visible column, though occasionally not like one -- whichever is useful to the implementation in each context. For example, we probably want to do predicate pushdown, maybe with real cataloged operators that access the column like any other user-created column (the optimizer knows about the column, which even has a pg_attribute entry). Note that we only ever access the TID column using an insertion scankey today -- so there are several ways in which the partition identifier really would be much more like a user column than tid/scantid ever was. The TID is a key column for most purposes as of Postgres 12 (at least internally). That didn't break all unique indexes due to the existence of non-unique TIDs across duplicates! Insertions that must call _bt_check_unique() can deal with the issue directly, by temporarily unsetting scantid. We can easily do roughly the same thing here: be slightly creative about how we interpret whether or not the partition identifier is "just another key column" across each context. This is also similar to the way the implementation is slightly creative about NULL values, which are not equal to any other value to the user, but are nevertheless just another value from the domain of indexed values to the nbtree implementation. Cleverly defining the semantics of keys to get better performance and to avoid the need for special case code is more or less a standard technique. > In the whole database, that would be the OID of the table + the TID as > supplied by the table. > > As such, the identifier of the logical row (which can be called the > TID), as stored in index tuples in global indexes, would need to > consist of the TableAM supplied TID + the (local) id of the table > containing the tuple. 2 points: 1. Clearly you need to use the partition identifier with the TID in order to look up the version in the table -- you need to use both together in global indexes. But it can still work in much the same way as it would in a standard index -- it's just that you handle that extra detail as well. That's what I meant by additive. 2. If a TID points to a version of a row (or whatever you want to call the generalized version of a HOT chain -- almost the same thing), then of course you can always map it back to the logical row. That must always be true. It is equally true within a global index. Points 1 and 2 above seem obvious to me...so I think we agree on that much. I just don't know how you go from here to "we need variable-width TIDs". In all sincerity, I am confused because to me it just seems as if you're asserting that it must be necessary to have variable width TIDs again and again, without ever getting around to justifying it. Or even trying to. > Assuming we're in agreement on that part, I > would think it would be consistent to put this in TID infrastructure, > such that all indexes that use such new TID infrastructure can be > defined to be global with only minimal effort. Abstract definitions can be very useful, but ultimately they're just tools. They're seldom useful as a starting point in my experience. I try to start with the reality on the ground, and perhaps arrive at some kind of abstract model or idea much later. > ZHeap states that it can implement stable TIDs within limits, as IIRC > it requires retail index deletion support for all indexes on the > updated columns of that table. Whether or not that's true is not at all clear. What is true is that the prototype version of zheap that we have as of today is notable in that it more or less allows the moral equivalent of a HOT chain to be arbitrarily long (or much longer, at least). To the best of my knowledge there is nothing about retail index tuple deletion in the design, except perhaps something vague and aspirational. > I fail to see why this same > infrastructure could not be used for supporting clustered tables, > while enforcing these limits only soft enforced in ZHeap (that is, > only allowing index AMs that support retail index tuple deletion). You're ignoring an ocean of complexity here. Principally the need to implement something like two-phase locking (key value locking) in indexes to make this work, but also the need to account for how fundamentally redefining TID breaks things. To say nothing of how this might affect crash recovery. > > If it was very clear that there would be *some* > > significant benefit then the costs might start to look reasonable. But > > there isn't. "Build it and they will come" is not at all convincing to > > me. > > Clustered tables / Index-oriented Tables are very useful for tables of > which most columns are contained in the PK, or otherwise are often > ordered by their PK. I'm well aware of the fact that clustered index based tables are sometimes more useful than heap-based tables. -- Peter Geoghegan
On Wed, May 5, 2021 at 3:43 PM Matthias van de Meent <boekewurm+postgres@gmail.com> wrote: > I believe that it cannot be "just" an additive thing, at least not > through a normal INCLUDEd column, as you'd get duplicate TIDs in the > index, with its related problems. You also cannot add it as a key > column, as this would disable UNIQUE indexes; one of the largest use > cases of global indexes. So, you must create specialized > infrastructure for this identifier. > > And when we're already adding specialized infrastructure, then this > should probably be part of a new TID infrastructure. > > And if we're going to change TID infrastructure to allow for more > sizes (as we'd need normal TableAM TIDs, and global index > partition-identifying TIDs), I'd argue that it should not be too much > more difficult to create an infrastructure for 'new TID' in which the > table AM supplies type, size and strict ordering information for these > 'new TID's. > > And if this 'new TID' size is not going to be defined by the index AM > but by the indexed object (be it a table or a 'global' or whatever > we'll build indexes on), I see no reason why this 'new TID' > infrastructure couldn't eventually support variable length TIDs; or > constant sized usertype TIDs (e.g. the 3 int columns of the primary > key of a clustered table). > > The only requirements that I believe to be fundamental for any kind of TID are > > 1.) Uniqueness during the lifecycle of the tuple, from creation to > life to dead to fully dereferenced from all indexes; > 2.) There exists a strict ordering of all TIDs of that type; > > And maybe to supply some form of efficiency to the underlying tableAM: > > 3.) There should be an equivalent of bitmap for that TID type. > > For the nbtree deduplication subsystem, and for gin posting lists to > be able to work efficiently, the following must also hold: > > 4.) The TID type has a fixed size, preferably efficiently packable. > > Only the last requirement cannot be met with varlena TID types. But, > as I also believe that not all indexes can be expected to work (well) > for all kinds of TableAM, I don't see how this would be a blocking > issue. +1 to all of that. > Storage gains for index-oriented tables can become as large as the > size of the primary key by not having to store all primary key values > in both the index and the table; which can thus be around 100% of a > table in the least efficient cases of having a PK over all columns. > > Yes, this might be indeed only a 'small gain' for access latency, but > not needing to store another copy of your data (and keeping it in > cache, etc.) is a significant win in my book. This is a really good point. Also, if the table is ordered by a synthetic logical TID, range scans on the primary key will be less efficient than if the primary key is itself the TID. We have the ability to CLUSTER on an index for good reasons, and "Automatically maintain clustering on a table" has been on the todo list forever. It's hard to imagine this will ever be achieved with the current heap, though: the way to get there is to have a table AM for which this is an explicit goal. -- Robert Haas EDB: http://www.enterprisedb.com
How hard would it be to declare TID as current ItemPointerData with some values prohibited (NULL, SpecTokenOffsetNumber = 0xfffe, MovedPartitionsOffsetNumber = 0xfffd, presumably also 0xffff ?). And then commit to fixing usage outside access/heap/ which depend on small value for MaxHeapTuplesPerPage, currently only nbodes/tidscan.c , access/gin/ginpostinglist.c and access/brin/brin_*.c there is also MaxHeapTuplesPerPage usage in ./backend/storage/page/bufpage.c but it seems to be all in heapam-dependent functions (PageRepairFragmentation(), PageGetHeapFreeSpace() and few others) which most likely should be moved to access/heap/ Doing it this way would leave us with some manageable complexity in mapping from TID to 48-bit integer and/or 3 wanted positions in 2^32 ------------ Hannu Krosing On Wed, May 5, 2021 at 8:40 PM Peter Geoghegan <pg@bowt.ie> wrote: > > On Wed, May 5, 2021 at 11:25 AM Andres Freund <andres@anarazel.de> wrote: > > Agreed. And we can increase the fit a good bit without needing invasive > > all-over changes. With those changes often even helping heap. > > > > E.g. tidbitmap.c's harcoded use of MaxHeapTuplesPerPage is a problem > > even for heap - we waste a lot of space that's not commonly used. A > > better datastructure (radix tree like I'd say, but several tree shaped > > approaches seem possible). > > Agreed -- even if we only cared about heapam we still ought to do > something about tidbitmap.c's use of MaxHeapTuplesPerPage. > > -- > Peter Geoghegan > >
On Thu, May 6, 2021 at 3:07 AM Robert Haas <robertmhaas@gmail.com> wrote: > > On Wed, May 5, 2021 at 3:43 PM Matthias van de Meent > > Storage gains for index-oriented tables can become as large as the > > size of the primary key by not having to store all primary key values > > in both the index and the table; which can thus be around 100% of a > > table in the least efficient cases of having a PK over all columns. > > > > Yes, this might be indeed only a 'small gain' for access latency, but > > not needing to store another copy of your data (and keeping it in > > cache, etc.) is a significant win in my book. > > This is a really good point. Also, if the table is ordered by a > synthetic logical TID, range scans on the primary key will be less > efficient than if the primary key is itself the TID. We have the > ability to CLUSTER on an index for good reasons, and "Automatically > maintain clustering on a table" has been on the todo list forever. > It's hard to imagine this will ever be achieved with the current heap, > though: the way to get there is to have a table AM for which this is > an explicit goal. But would this not have the downside that all the secondary indexes will blow up as they now need to have the full table row as the TID ? ----- Hannu Krosing
On Thu, 2021-05-06 at 03:26 +0200, Hannu Krosing wrote: > How hard would it be to declare TID as current ItemPointerData with > some values prohibited (NULL, SpecTokenOffsetNumber = 0xfffe, > MovedPartitionsOffsetNumber = 0xfffd, presumably also 0xffff ?). I don't think there's consensus in this thread that we want to do that, but I'd be fine with it. It's possible but not trivial. tidbitmap.c would be the biggest challenge, I think. Regards, Jeff Davis
----- Hannu Krosing On Thu, May 6, 2021 at 4:53 AM Jeff Davis <pgsql@j-davis.com> wrote: > > On Thu, 2021-05-06 at 03:26 +0200, Hannu Krosing wrote: > > How hard would it be to declare TID as current ItemPointerData with > > some values prohibited (NULL, SpecTokenOffsetNumber = 0xfffe, > > MovedPartitionsOffsetNumber = 0xfffd, presumably also 0xffff ?). > > I don't think there's consensus in this thread that we want to do that, > but I'd be fine with it. Sure. I just proposed this as a Minimal Viable Change. Just hoping that we can agree on an interim solution which addresses the immediate problem and then continue arguing about the ideal way for the rest of v15 cycle (and the v16 and v17 ;) ) > > It's possible but not trivial. tidbitmap.c would be the biggest > challenge, I think. I think all these places (tidbitmap, gin, brin) relay on "relatively small" MaxHeapTuplesPerPage as an upper bound for some allocations and then still allocate a lot more than needed. One can get to 291 tids / page only when you create a table with no columns, or less than 8 columns which are all set to NULL. A table with a single non-null boolean column already can fit only 226 tuples per page. It is definitely a non-trivial amount of work to rewrite these three but going to (almost) full 48 bits from current theoretically-a-little-over-40-effective-bits would expand the number of addressable tuples 225 times. Of course it would be extra nice to also somehow encode the 3 special ItemPointerData values (NULL, 0xfffe, 0cfffd) "somewhere else" and get an option of uninterrupted 48-bit address space for non-heap table AMs, but this would likely be much more disruptive, if possible at all. We could still check, if they are used outside of heapam and maybe just fix these places. > > Regards, > Jeff Davis > > >
On Wed, May 5, 2021 at 10:53 PM Jeff Davis <pgsql@j-davis.com> wrote: > On Thu, 2021-05-06 at 03:26 +0200, Hannu Krosing wrote: > > How hard would it be to declare TID as current ItemPointerData with > > some values prohibited (NULL, SpecTokenOffsetNumber = 0xfffe, > > MovedPartitionsOffsetNumber = 0xfffd, presumably also 0xffff ?). > > I don't think there's consensus in this thread that we want to do that, > but I'd be fine with it. > > It's possible but not trivial. tidbitmap.c would be the biggest > challenge, I think. I think that would be fine, too. I don't think it's the ideal situation, but it seems like a clear improvement over what we have now. We might want to reserve a few values for future projects that might need distinguished values like SpecTokenOffsetNumber or MovedPartitionsOffsetNumber, though, so we don't completely box ourselves into a corner. -- Robert Haas EDB: http://www.enterprisedb.com
On Thu, 6 May 2021 at 01:22, Peter Geoghegan <pg@bowt.ie> wrote: > > On Wed, May 5, 2021 at 3:18 PM Matthias van de Meent > <boekewurm+postgres@gmail.com> wrote: > > I believe that the TID is the unique identifier of that tuple, within context. > > > > For normal indexes, the TID as supplied directly by the TableAM is > > sufficient, as the context is that table. > > For global indexes, this TID must include enough information to relate > > it to the table the tuple originated from. > > Clearly something like a partition identifier column is sometimes just > like a regular user-visible column, though occasionally not like one > -- whichever is useful to the implementation in each context. For > example, we probably want to do predicate pushdown, maybe with real > cataloged operators that access the column like any other user-created > column (the optimizer knows about the column, which even has a > pg_attribute entry). Note that we only ever access the TID column > using an insertion scankey today -- so there are several ways in which > the partition identifier really would be much more like a user column > than tid/scantid ever was. > > The TID is a key column for most purposes as of Postgres 12 (at least > internally). That didn't break all unique indexes due to the existence > of non-unique TIDs across duplicates! Insertions that must call > _bt_check_unique() can deal with the issue directly, by temporarily > unsetting scantid. > > We can easily do roughly the same thing here: be slightly creative > about how we interpret whether or not the partition identifier is > "just another key column" across each context. This is also similar to > the way the implementation is slightly creative about NULL values, > which are not equal to any other value to the user, but are > nevertheless just another value from the domain of indexed values to > the nbtree implementation. Cleverly defining the semantics of keys to > get better performance and to avoid the need for special case code is > more or less a standard technique. > > > In the whole database, that would be the OID of the table + the TID as > > supplied by the table. > > > > As such, the identifier of the logical row (which can be called the > > TID), as stored in index tuples in global indexes, would need to > > consist of the TableAM supplied TID + the (local) id of the table > > containing the tuple. > > 2 points: > > 1. Clearly you need to use the partition identifier with the TID in > order to look up the version in the table -- you need to use both > together in global indexes. But it can still work in much the same way > as it would in a standard index -- it's just that you handle that > extra detail as well. That's what I meant by additive. > > 2. If a TID points to a version of a row (or whatever you want to call > the generalized version of a HOT chain -- almost the same thing), then > of course you can always map it back to the logical row. That must > always be true. It is equally true within a global index. > > Points 1 and 2 above seem obvious to me...so I think we agree on that > much. Yes. > I just don't know how you go from here to "we need > variable-width TIDs". In all sincerity, I am confused because to me it > just seems as if you're asserting that it must be necessary to have > variable width TIDs again and again, without ever getting around to > justifying it. Or even trying to. See below. I'm not saying we need it _right now_, but at the very least I'd like to argue that we should not close the door on varlena TIDs, because there _are_ reasons for those TID types. See also below. > > Assuming we're in agreement on that part, I > > would think it would be consistent to put this in TID infrastructure, > > such that all indexes that use such new TID infrastructure can be > > defined to be global with only minimal effort. > > Abstract definitions can be very useful, but ultimately they're just > tools. They're seldom useful as a starting point in my experience. I > try to start with the reality on the ground, and perhaps arrive at > some kind of abstract model or idea much later. Although I agree that abstract definitions are tools, I disagree that they're seldom useful as a starting point. Many have implemented b (plus) trees based on the original paper exactly due to the guarantees that are provided by the abstract definition as defined in the paper. When trying to create completely new things I would agree that starting with the abstract is probably not the right idea, but we're not in a green field. There are many examples for storage engines / table AMs outside PostgreSQL, and I believe that we can take learnings from those for the potential extendability of PostgreSQL. > > ZHeap states that it can implement stable TIDs within limits, as IIRC > > it requires retail index deletion support for all indexes on the > > updated columns of that table. > > Whether or not that's true is not at all clear. What is true is that > the prototype version of zheap that we have as of today is notable in > that it more or less allows the moral equivalent of a HOT chain to be > arbitrarily long (or much longer, at least). To the best of my > knowledge there is nothing about retail index tuple deletion in the > design, except perhaps something vague and aspirational. I'm fairly certain that ZHeap's todo list item no. 1 details retail index tuple deletion as I understand it: "Delete marking in indexes: This will allow inplace updates even when index columns are updated and additionally with this we can avoid the need for a dedicated vacuum process to perform retail deletes." If I read this correctly, this means asking the index AM to delete the specific index tuple corresponding with the deleted item AKA retail index tuple deletion. > > I fail to see why this same > > infrastructure could not be used for supporting clustered tables, > > while enforcing these limits only soft enforced in ZHeap (that is, > > only allowing index AMs that support retail index tuple deletion). > > You're ignoring an ocean of complexity here. Principally the need to > implement something like two-phase locking (key value locking) in > indexes to make this work, but also the need to account for how > fundamentally redefining TID breaks things. To say nothing of how this > might affect crash recovery. An ocean of complexity that is (to the best of my knowledge) to be mostly delegated to the TableAM that implements these varlena TIDs. I'm not telling you that varlena TIDs are a one-size-fits-all, but I want to make clear that varlena TIDs are required for certain table AMs that would otherwise be impossible. > > > If it was very clear that there would be *some* > > > significant benefit then the costs might start to look reasonable. But > > > there isn't. "Build it and they will come" is not at all convincing to > > > me. > > > > Clustered tables / Index-oriented Tables are very useful for tables of > > which most columns are contained in the PK, or otherwise are often > > ordered by their PK. > > I'm well aware of the fact that clustered index based tables are > sometimes more useful than heap-based tables. Then would you also agree that for persistently clustered tables to work in any form of efficiency, they would need to support variable length identifiers? Let me explain my reasoning: If the TableAM TID contains information about it's physical location, we cannot implement persistent clustered tables due to instability of this physical location of the base tuple (see also how we cannot guarantee the physical location of an index tuple in the index, only the logical location). That being the case, the TID must be fully logical. I hope you agree at least up to here. The TID must order in the same manner as the table's clustering, as you would want to be able to find the tuple from the table, and the only guarantee you have is that the table has an ordering. If the TID does not have the same ordering, you cannot find the correct tuple (or you'd need a TID -> key mapping, which would mean you'd lose a lot of the benefits you got from the clustering). If the TID must share an ordering with the clustering of the table, then it should also have a similar key space as the clustering: I could cluster a table on a text column, but any fixed size TID cannot ever dream to be used to correctly identify the position of a certain text string within an arbitrary sorted list of text strings. You'll have to use that text string to find it's location in the list. As such, I believe that the tuple identifier (or also: the identifier using which we can locate the tuple data corresponding to that identifier) for clustered tables / index oriented tables is at the very least the set of columns that that table is clustered on. Do note that this is all about the TID supplied by a TableAM that would implement persistent clustering, which is how MySQL has implemented tables. I am unaware of a scheme that would maintain persistent clustering/ordering across the table on key columns without also leaking these key columns into the identifier that is used to find the tuple in the table. With regards, Matthias van de Meent
On Thu, May 6, 2021 at 4:10 AM Matthias van de Meent <boekewurm+postgres@gmail.com> wrote: > See below. I'm not saying we need it _right now_, but at the very > least I'd like to argue that we should not close the door on varlena > TIDs, because there _are_ reasons for those TID types. See also below. Perhaps I was a bit too strident. It's a question of trade-offs. > Although I agree that abstract definitions are tools, I disagree that > they're seldom useful as a starting point. Many have implemented b > (plus) trees based on the original paper exactly due to the guarantees > that are provided by the abstract definition as defined in the paper. I was unclear. I agree with this much. I don't think it applies in this situation, though. There are many competing considerations besides that. What I'm concerned about is avoiding a strategically ruinous direction that works against our strengths -- we should play to our strengths. For example, it's no coincidence that Postgres is the only system that has B-Tree deduplication and enables it by default, with next to no downside. This is more or less a consequence of the fact that indexes in Postgres don't need their own locks, unlike any relation DB system that more or less uses a design like ARIES -- indexes are just physical data structures that back the logical database. They are not part of the logical database per se. Consider a 2015 paper from Goetz Graefe about index locking, entitled "Orthogonal key-value locking": https://subs.emis.de/LNI/Proceedings/Proceedings241/237.pdf Some kind of locking in indexes (something like ARIES/IM or ARIES/KVL) is needed to make TIDs stable identifiers of logical rows -- even in a system like Oracle (but most obviously in a system like InnoDB or SQL Server). According to Graefe, "further improvements [to index concurrency control] are possible despite multiple available techniques and decades with little progress". This is why terms like "ruinously expensive" don't feel like hyperbole when talking about pursuing TID stability/clustered indexes/whatever -- it is a radically different design. And maybe even a basically inferior design these days, all things considered. I think that the simple approach to storage taken by Postgres has aged incredibly well -- there are many ways in which it suits modern hardware more than traditional designs. Modern hardware is generally much more sensitive to the cost of any kind of synchronization [1], but is otherwise very fast in the ways -- rather a reversal from the environment that ARIES was created in. While there are certainly downsides to the Postgres approach to storage, concurrency control and recovery, these downsides can be fixed directly. Including in the ways we've been discussing on other threads these past few months. I think that we can push the idea of teaching heapam (and maybe nbtree) just enough about the logical database to not make really dumb decisions in some cases. Why not see how far that can go first? Graefe's perspective in the key-value locking paper is roughly the Microsoft SQL Server perspective. It's not hard for me to imagine why he undertook to research index locking used for concurrency control (not to be confused with what they call latches and what we call LWLocks) while still working for Microsoft. The TPC-E benchmark is something that Microsoft alone targets with SQL Server (no other DB system has an official entry on the TPC website). Pity TPC-E isn't more successful, since it seems to have a lot more real world relevance than TPC-C. I bet we could learn a lot from it because it's actually realistic (TPC-C isn't that realistic, and TPC-B/pgbench is the furthest possible thing from reality). The best analysis I've been able to find about TPC-E is probably "From A to E: Analyzing TPC’s OLTP Benchmarks" [2]. It concludes that the big bottleneck for TPC-E is "logical lock contention", which is the overwhelming bottleneck at high client counts. Though I haven't run TPC-E properly myself (because of issues with bitrot), I'd speculate that this gives Postgres an edge in TPC-E. I've heard quite a few reports of Postgres having much faster writes compared to any of the big proprietary systems. Some of these reports are many years old. And yet we routinely seem to talk about our approach as if it's obviously inferior. The main weakness that Postgres storage has seems to me to be largely in the area of stability, perhaps only in theoretically rare or extreme cases that nevertheless cause real problems in the real world. Of course I cannot prove that adopting stable TIDs necessarily means accepting a similar burden from lock contention in an affected table AM, or in general. I cannot prove much of anything when we're considering such an abstract question. But perhaps this gives you a better idea about where my skepticism comes from. > When trying to create completely new things I would agree that > starting with the abstract is probably not the right idea, but we're > not in a green field. There are many examples for storage engines / > table AMs outside PostgreSQL, and I believe that we can take learnings > from those for the potential extendability of PostgreSQL. I agree, up to a point. Perhaps the reason I came up with bottom-up index deletion was that my reading of the DB research literature led me to consider the logical database in the context of index AMs. Turns out we can use a dirt cheap technique to give B-Tree indexes some basic ideas about the logical database, without having to go to the expense of making nbtree truly have authoritative information about it. A little appreciation of other designs can go a long way. We can afford to have weaknesses, because we clearly also have strengths. We only need to not have an Achilles' heel. > "Delete marking in indexes: This will allow inplace updates even when > index columns are updated and additionally with this we can avoid the > need for a dedicated vacuum process to perform retail deletes." > > If I read this correctly, this means asking the index AM to delete the > specific index tuple corresponding with the deleted item AKA retail > index tuple deletion. Dealing with this at the level of nbtree/the physical data structure is very easy. The problem is with concurrency control, especially within index AMs. This is about the logical contents of the database. Being able to think of indexes as basically just data structures is a huge luxury for us. > Then would you also agree that for persistently clustered tables to > work in any form of efficiency, they would need to support variable > length identifiers? Let me explain my reasoning: Yes, I definitely accept that. It must be true that clustered indexes (not to be confused with indirect indexes + a heap) have *some* merit -- that much has to be true. I just doubt that it's a promising area for us. The cost/risks are very high, and the benefits are probably quite low. We don't have to tackle the fundamental problem in order to add a column store table AM with logical-ish TID-like identifiers -- at least not according to Jeff. Why not consider a low cost solution like that for each table AM? > If the TableAM TID contains information about it's physical location, > we cannot implement persistent clustered tables due to instability of > this physical location of the base tuple (see also how we cannot > guarantee the physical location of an index tuple in the index, only > the logical location). > That being the case, the TID must be fully logical. I hope you agree > at least up to here. Absolutely. Agreed. > If the TID must share an ordering with the clustering of the table, > then it should also have a similar key space as the clustering: I > could cluster a table on a text column, but any fixed size TID cannot > ever dream to be used to correctly identify the position of a certain > text string within an arbitrary sorted list of text strings. You'll > have to use that text string to find it's location in the list. As > such, I believe that the tuple identifier (or also: the identifier > using which we can locate the tuple data corresponding to that > identifier) for clustered tables / index oriented tables is at the > very least the set of columns that that table is clustered on. I agree. I think that looking at MySQL/InnoDB here might be very useful. InnoDB secondary indexes use stable identifiers of the logical row in the table -- which is just the primary key value for the relevant row in all cases. Systems like Oracle and SQL Server can use TIDs/RIDs with a heap based table, which is stable in the same way (sometimes at great cost elsewhere). All three systems are more or less the same as each other as far as the fundamental nature of TID-like identifiers is concerned (and rather unlike Postgres) -- TIDs/RIDs/pointers in secondary indexes are all stable identifiers of logical rows. Andy Pavlo has said that InnoDB/Oracle TIDs are logical identifiers, while Postgres TIDs are physical identifiers. (I don't find that particular terminology useful because it confuses issues elsewhere, but I know what he meant.) No less an authority than Mark Callaghan (well known in the MySQL community for his work at Google and later Facebook) did a comparative benchmark between Postgres and MySQL (InnoDB) back in January. I'm a big fan of his analysis of LSM/storage economics stuff, so naturally I read his report on this with great interest. The high level outcome was that Postgres 13 was mostly faster than MySQL 8, and Postgres 11 was mostly slower than MySQL 5.6: https://smalldatum.blogspot.com/2021/01/sysbench-postgres-vs-mysql-and-impact.html But the details are what really interests me. One very notable performance gap was between MySQL 8 + InnoDB and Postgres 13 with updates that affect an indexed column. Apparently one system could be as much as 4x slower with sysbench index updates, which is notably the straggler among the tests performed, for one of the systems. Reportedly the dreaded write amplification from updates is to blame. Just like that tawdry Uber blog post from 2016 warned of! But guess what? *Postgres* was the system that was faster and had *significantly less* write amplification from updates to secondary indexes -- to the extent that it really stood out as an *advantage* (for this particular set of tests). The pertinent information about write amplification with index updates is under the "Write-heavy" section: "The largest difference is for the 2nd test (update-index) that requires index maintenance." ... "Write-amp (wKB/o) is worse for MySQL especially for the update-index test". So...what do you think of that? This analysis was based on Postgres 13, not Postgres 14, so presumably we'd do better now that we have bottom-up index deletion. I don't mean to suggest that clearly we're living in opposite land, and Postgres was actually the one that did better with index updates all along -- as Callaghan says, it's one benchmark that shouldn't be overinterpreted and is mostly useful to get the right high level intuitions about how each system performs. I believe that sysbench has 2 or 3 indexes here, including the PK, and I wouldn't be surprised if the final result flipped in favor of InnoDB with more indexes -- even with bottom-up index deletion. All I'm saying is that it seems extremely premature to assume that there are very significant advantages to generalizing TID-like identifiers to be stable, be it in a clustered index table AM design like InnoDB or some other kind of table AM. Maybe there are huge advantages, *and* maybe it's even worth the enormous trouble and risk (to say nothing of the opportunity cost of not improving what already seems to work quite well most of the time). But I tend to doubt it, and I'd prefer to work on simple incremental approaches that already seem to be working. [1] https://pathelland.substack.com/p/i-am-so-glad-im-uncoordinated [2] https://www.researchgate.net/publication/262275971_From_A_to_E_Analyzing_TPC's_OLTP_Benchmarks_--_The_obsolete_the_ubiquitous_the_unexplored -- Peter Geoghegan