Re: MaxOffsetNumber for Table AMs - Mailing list pgsql-hackers
From | Peter Geoghegan |
---|---|
Subject | Re: MaxOffsetNumber for Table AMs |
Date | |
Msg-id | CAH2-Wzmee=VKQ7DbgZY4AcacWSoWTw0yyLdnvtn6kyH0nzg+pw@mail.gmail.com Whole thread Raw |
In response to | Re: MaxOffsetNumber for Table AMs (Robert Haas <robertmhaas@gmail.com>) |
Responses |
Re: MaxOffsetNumber for Table AMs
|
List | pgsql-hackers |
On Fri, Apr 30, 2021 at 2:07 PM Robert Haas <robertmhaas@gmail.com> wrote: > OK. I thought about this in regards to zheap, which has this exact > problem, because it wants to do so-called "in place" updates where the > new version of the row goes right on top of the old one in the table > page, and the old version of the row gets written into the undo log. > Just to keep things simple, we said that initially we'd only use this > in-place update strategy when no indexed columns were changed, so that > there's only ever one set of index entries for a given TID. In that > model, the index AMs don't really need to care that there are actually > multiple tuples for the same TID, because those tuples differ only in > columns that the index doesn't care about anyway. An index scan has to > be careful to fetch the correct version of the tuple, but it has a > Snapshot available, so it can do that. Right. So zheap (in the current prototype implementation) is like heapam with its HOT optimization, except that it isn't subject to the same limitations with regard to fitting heap tuples on the same heap page to keep the same HOT chain going over time. You kind of have the moral equivalent of a HOT chain that can largely live in UNDO. That seems like a very useful thing on its own. A lot of the problems with HOT are in this area -- maybe the vast majority, even. A remaining problem is that we must generate a new round of index tuples for each and every index when only one indexed column is logically modified by an UPDATE statement. I think that this is much less of a problem now due to bottom-up index deletion. Sure, it sucks that we still have to dirty the page at all. But it's nevertheless true that it all but eliminates version-driven page splits, which are where almost all of the remaining downside is. It's very reasonable to now wonder if this particular all-indexes problem is worth solving at all in light of that. (Modern hardware characteristics also make a comprehensive fix less valuable in practice.) > However, there's no easy and > efficient way to handle updates and deletes. Suppose for example that > a tuple has been updated 5 times, creating versions t1..t5. t5 is now > in the zheap page, and the other versions are in the undo. t5 points > to t4 which points to t3 and so forth. Now an updater comes along and > let's say that the updater's snapshot sees t2. It may be that t3..t5 > are *uncommitted* updates in which case the attempt to update t2 may > succeed if the transaction that performed then aborts, or it may be > that the updating transactions have committed, in which case we're > going to have to fail. But that decision isn't made by the scan that > sees t3; it happens when the TID reaches the ModifyTable node. So what > zheap ends up doing is finding the right tuple version during the > scan, by making use of the snapshot, and then having to go repeat that > work when it's time to try to perform the update. It would be nice to > avoid this. I believe that this is another consequence of the fact that Postgres versions tuples, not pages. This is not a minor theoretical point. It's very different to what Oracle does. It's almost a necessary consequence of our basic approach to extensibility, because you can have things like index tuples whose values are equal but visibly distinct (e.g., the numeric value '5.0' is equal to but distinct from '5'). It also has a lot to do with how crash recovery works. > If we could feed system columns from the scan through to > the update, we could pass along an undo pointer and avoid the extra > overhead. So it seems to me, so far anyway, that there's no very > fundamental problem here, but there is an efficiency issue which we > could address if we had a bit more planner and executor infrastructure > to help us out. FWIW you don't necessarily have to do the EPQ stuff. You could in theory do a statement-level rollback, and repeat. The EPQ mechanism is unique to Postgres. Maybe it doesn't matter, but I don't think that it's essential to follow this in other table AMs. > Now in the long run the vision for zheap was that we'd eventually want > to do in-place updates even when indexed columns have been modified, > and this gets a whole lot trickier, because now there can be multiple > sets of index entries pointing at the same TID which don't agree on > the values of the indexed columns. It's much easier when you have a very simple type system that doesn't allow differences like my "numeric '5.0' vs '5'" example -- a system that is built for this from the ground up. If there are meaningful semantic differences among opclass-equal index tuples, then we can never assume that index tuples will always be locatable after an update affecting indexed columns (if only because we need to preserve the '5.0' and '5' variants in an index on a numeric column). If we could at least be sure that two index tuples that point to the same stable/logical zheap TID (in a world where TIDs were stable identifiers of logical rows) were nevertheless unique, then we'd be able to uniquely identify each index tuple during retail index tuple deletion -- they'd still have distinct key values in the index tuple overall. That assumption isn't workable in Postgres, though. I'm pretty sure that there will be a bunch of tacit assumptions like this that would shake out all over the place. You'd have to actually pursue this design to figure out what they were, but I'm pretty sure many more exist. In any case this one example seems sufficient to make me doubt the whole enterprise. > As old row versions die off, some > of those pointers need to be cleaned out, and others do not. I thought > we might solve this problem by something akin to retail index > deletion: have an update or delete on a zheap tuple go re-find the > associated index entries and mark them for possible cleanup, and then > vacuum can ignore all unmarked tuples. There might be some efficiency > problems with this idea I hadn't considered, based on your remarks > today. But regardless of the wisdom or folly of this approach, the > broader point is that we can't assume that all heap types are going to > have the same maintenance requirements. No, we can't. But we had better have a generalized definition that accounts for what variation is acceptable, and (most importantly) what variation *isn't* acceptable. > I think most of them are going > to have some kind of maintenance operation that need to or at least > can optionally be performed from time to time, but it might be > triggered by completely different criteria than vacuum. New table AMs > might well choose to use 64-bit XIDs, avoiding the need for wraparound > processing altogether. Maybe they have such good opportunistic cleanup > mechanisms that periodic vacuum for bloat isn't even really needed. > Maybe they bloat when updates and deletes commit but not when inserts > and updates abort, because those cases are handled via some other > mechanism. Who knows, really? It's hard to predict what > not-yet-written AMs might care about, and even if we knew, it seems > crazy to try to rewrite the way vacuum works to cater to those needs > before we actually have some working AMs to use as a testbed. Nothing is certain, but frankly I just don't believe that anybody is ever going to solve this problem in Postgres. The fundamental assumption that TIDs are not stable identifiers of logical rows (they point to versions) is just too baked into everything. And the downsides of that design can be fixed in a localized way. On the other hand, Jeff and I agree about the parameters of the discussion here. I can see myself doing work inside nbtree to facilitate his work. But that is made a lot less likely by the general lack of agreement about what ought to ultimately be possible. There is a very real cost to indefinitely deferring making a hard choice about what we can rule out for table AMs. It's *not* free. How long should that situation be allowed to continue for? This is not a rhetorical question -- maybe it should be timeboxed in some way. Right now the tableam can in theory do. Nobody knows how, but anything is possible! > It strikes me that certain interesting cases might not really need > anything very in-depth here. For example, consider indirect indexes, > where the index references the primary key value rather than the TID. > Well, the indirect index should probably be vacuumed periodically to > prevent bloat, but it doesn't need to be vacuumed to recycle TIDs > because it doesn't contain TIDs. Why is that okay, though? How can you get away with not having version-based TID-like identifiers here? I would be willing to accept an answer like "it's unclear, but it must be possible" if there was no downside. But as I said, there is a downside. > BRIN indexes, BTW, also don't contain > TIDs. BRIN indexes aren't suitable as indirect indexes, though. > Either could, therefore, be optionally vacuumed after vacuum has > done absolutely everything else, even truncate the table, or they > could be vacuumed on a completely separate schedule that doesn't have > anything to do with table vacuuming. I think that the question of how TID-like identifiers work across table AMs is fundamentally distinct from how VACUUM works. I think that we'll probably always have something like VACUUM. That doesn't mean that we cannot ultimately teach VACUUM to run far less often. There is a problem with things being overly coupled inside VACUUM, so our practical experience with VACUUM isn't necessarily a reliable indicator of how much of a problem VACUUM is long term. -- Peter Geoghegan
pgsql-hackers by date: