Re: MaxOffsetNumber for Table AMs - Mailing list pgsql-hackers

From Peter Geoghegan
Subject Re: MaxOffsetNumber for Table AMs
Date
Msg-id CAH2-Wzmee=VKQ7DbgZY4AcacWSoWTw0yyLdnvtn6kyH0nzg+pw@mail.gmail.com
Whole thread Raw
In response to Re: MaxOffsetNumber for Table AMs  (Robert Haas <robertmhaas@gmail.com>)
Responses Re: MaxOffsetNumber for Table AMs
List pgsql-hackers
On Fri, Apr 30, 2021 at 2:07 PM Robert Haas <robertmhaas@gmail.com> wrote:
> OK. I thought about this in regards to zheap, which has this exact
> problem, because it wants to do so-called "in place" updates where the
> new version of the row goes right on top of the old one in the table
> page, and the old version of the row gets written into the undo log.
> Just to keep things simple, we said that initially we'd only use this
> in-place update strategy when no indexed columns were changed, so that
> there's only ever one set of index entries for a given TID. In that
> model, the index AMs don't really need to care that there are actually
> multiple tuples for the same TID, because those tuples differ only in
> columns that the index doesn't care about anyway. An index scan has to
> be careful to fetch the correct version of the tuple, but it has a
> Snapshot available, so it can do that.

Right. So zheap (in the current prototype implementation) is like
heapam with its HOT optimization, except that it isn't subject to the
same limitations with regard to fitting heap tuples on the same heap
page to keep the same HOT chain going over time. You kind of have the
moral equivalent of a HOT chain that can largely live in UNDO. That
seems like a very useful thing on its own. A lot of the problems with
HOT are in this area -- maybe the vast majority, even.

A remaining problem is that we must generate a new round of index
tuples for each and every index when only one indexed column is
logically modified by an UPDATE statement. I think that this is much
less of a problem now due to bottom-up index deletion. Sure, it sucks
that we still have to dirty the page at all. But it's nevertheless
true that it all but eliminates version-driven page splits, which are
where almost all of the remaining downside is. It's very reasonable to
now wonder if this particular all-indexes problem is worth solving at
all in light of that. (Modern hardware characteristics also make a
comprehensive fix less valuable in practice.)

> However, there's no easy and
> efficient way to handle updates and deletes. Suppose for example that
> a tuple has been updated 5 times, creating versions t1..t5. t5 is now
> in the zheap page, and the other versions are in the undo. t5 points
> to t4 which points to t3 and so forth. Now an updater comes along and
> let's say that the updater's snapshot sees t2. It may be that t3..t5
> are *uncommitted* updates in which case the attempt to update t2 may
> succeed if the transaction that performed then aborts, or it may be
> that the updating transactions have committed, in which case we're
> going to have to fail. But that decision isn't made by the scan that
> sees t3; it happens when the TID reaches the ModifyTable node. So what
> zheap ends up doing is finding the right tuple version during the
> scan, by making use of the snapshot, and then having to go repeat that
> work when it's time to try to perform the update. It would be nice to
> avoid this.

I believe that this is another consequence of the fact that Postgres
versions tuples, not pages. This is not a minor theoretical point.
It's very different to what Oracle does. It's almost a necessary
consequence of our basic approach to extensibility, because you can
have things like index tuples whose values are equal but visibly
distinct (e.g., the numeric value '5.0' is equal to but distinct from
'5'). It also has a lot to do with how crash recovery works.

> If we could feed system columns from the scan through to
> the update, we could pass along an undo pointer and avoid the extra
> overhead. So it seems to me, so far anyway, that there's no very
> fundamental problem here, but there is an efficiency issue which we
> could address if we had a bit more planner and executor infrastructure
> to help us out.

FWIW you don't necessarily have to do the EPQ stuff. You could in
theory do a statement-level rollback, and repeat. The EPQ mechanism is
unique to Postgres. Maybe it doesn't matter, but I don't think that
it's essential to follow this in other table AMs.

> Now in the long run the vision for zheap was that we'd eventually want
> to do in-place updates even when indexed columns have been modified,
> and this gets a whole lot trickier, because now there can be multiple
> sets of index entries pointing at the same TID which don't agree on
> the values of the indexed columns.

It's much easier when you have a very simple type system that doesn't
allow differences like my "numeric '5.0' vs '5'" example -- a system
that is built for this from the ground up. If there are meaningful
semantic differences among opclass-equal index tuples, then we can
never assume that index tuples will always be locatable after an
update affecting indexed columns (if only because we need to preserve
the '5.0' and '5' variants in an index on a numeric column).

If we could at least be sure that two index tuples that point to the
same stable/logical zheap TID (in a world where TIDs were stable
identifiers of logical rows) were nevertheless unique, then we'd be
able to uniquely identify each index tuple during retail index tuple
deletion -- they'd still have distinct key values in the index tuple
overall. That assumption isn't workable in Postgres, though.

I'm pretty sure that there will be a bunch of tacit assumptions like
this that would shake out all over the place. You'd have to actually
pursue this design to figure out what they were, but I'm pretty sure
many more exist. In any case this one example seems sufficient to make
me doubt the whole enterprise.

> As old row versions die off, some
> of those pointers need to be cleaned out, and others do not. I thought
> we might solve this problem by something akin to retail index
> deletion: have an update or delete on a zheap tuple go re-find the
> associated index entries and mark them for possible cleanup, and then
> vacuum can ignore all unmarked tuples. There might be some efficiency
> problems with this idea I hadn't considered, based on your remarks
> today. But regardless of the wisdom or folly of this approach, the
> broader point is that we can't assume that all heap types are going to
> have the same maintenance requirements.

No, we can't. But we had better have a generalized definition that
accounts for what variation is acceptable, and (most importantly) what
variation *isn't* acceptable.

> I think most of them are going
> to have some kind of maintenance operation that need to or at least
> can optionally be performed from time to time, but it might be
> triggered by completely different criteria than vacuum. New table AMs
> might well choose to use 64-bit XIDs, avoiding the need for wraparound
> processing altogether. Maybe they have such good opportunistic cleanup
> mechanisms that periodic vacuum for bloat isn't even really needed.
> Maybe they bloat when updates and deletes commit but not when inserts
> and updates abort, because those cases are handled via some other
> mechanism. Who knows, really? It's hard to predict what
> not-yet-written AMs might care about, and even if we knew, it seems
> crazy to try to rewrite the way vacuum works to cater to those needs
> before we actually have some working AMs to use as a testbed.

Nothing is certain, but frankly I just don't believe that anybody is
ever going to solve this problem in Postgres. The fundamental
assumption that TIDs are not stable identifiers of logical rows (they
point to versions) is just too baked into everything. And the
downsides of that design can be fixed in a localized way.

On the other hand, Jeff and I agree about the parameters of the
discussion here. I can see myself doing work inside nbtree to
facilitate his work. But that is made a lot less likely by the general
lack of agreement about what ought to ultimately be possible. There is
a very real cost to indefinitely deferring making a hard choice about
what we can rule out for table AMs. It's *not* free.

How long should that situation be allowed to continue for? This is not
a rhetorical question -- maybe it should be timeboxed in some way.
Right now the tableam can in theory do. Nobody knows how, but anything
is possible!

> It strikes me that certain interesting cases might not really need
> anything very in-depth here. For example, consider indirect indexes,
> where the index references the primary key value rather than the TID.
> Well, the indirect index should probably be vacuumed periodically to
> prevent bloat, but it doesn't need to be vacuumed to recycle TIDs
> because it doesn't contain TIDs.

Why is that okay, though? How can you get away with not having
version-based TID-like identifiers here? I would be willing to accept
an answer like "it's unclear, but it must be possible" if there was no
downside. But as I said, there is a downside.

> BRIN indexes, BTW, also don't contain
> TIDs.

BRIN indexes aren't suitable as indirect indexes, though.

> Either could, therefore, be optionally vacuumed after vacuum has
> done absolutely everything else, even truncate the table, or they
> could be vacuumed on a completely separate schedule that doesn't have
> anything to do with table vacuuming.

I think that the question of how TID-like identifiers work across
table AMs is fundamentally distinct from how VACUUM works. I think
that we'll probably always have something like VACUUM. That doesn't
mean that we cannot ultimately teach VACUUM to run far less often.
There is a problem with things being overly coupled inside VACUUM, so
our practical experience with VACUUM isn't necessarily a reliable
indicator of how much of a problem VACUUM is long term.

-- 
Peter Geoghegan



pgsql-hackers by date:

Previous
From: Peter Geoghegan
Date:
Subject: Re: MaxOffsetNumber for Table AMs
Next
From: Mark Dilger
Date:
Subject: Granting control of SUSET gucs to non-superusers