Re: MaxOffsetNumber for Table AMs - Mailing list pgsql-hackers

From Robert Haas
Subject Re: MaxOffsetNumber for Table AMs
Date
Msg-id CA+Tgmob0=38ALOwWEvwbwKkGyRp0+P0k288r1DmNwqXpvVpXjA@mail.gmail.com
Whole thread Raw
In response to Re: MaxOffsetNumber for Table AMs  (Peter Geoghegan <pg@bowt.ie>)
Responses Re: MaxOffsetNumber for Table AMs
List pgsql-hackers
On Fri, Apr 30, 2021 at 3:30 PM Peter Geoghegan <pg@bowt.ie> wrote:
> > Is the problem you're worried about here that, with something like an
> > index-organized table, you can have multiple row versions that have
> > the same logical tuple ID, i.e. primary key value? And that the
> > interfaces aren't well-suited to that? Because that's a problem I have
> > thought about and can comment on, even though I think the question of
> > having multiple versions with the same TID is distinguishable from the
> > question of how *wide* TIDs should be. But maybe that's not what you
> > are talking about here, in which case I guess I need a clearer
> > explanation of the concern.
>
> That's what I'm talking about. I'd like to hear what you think about it.

OK. I thought about this in regards to zheap, which has this exact
problem, because it wants to do so-called "in place" updates where the
new version of the row goes right on top of the old one in the table
page, and the old version of the row gets written into the undo log.
Just to keep things simple, we said that initially we'd only use this
in-place update strategy when no indexed columns were changed, so that
there's only ever one set of index entries for a given TID. In that
model, the index AMs don't really need to care that there are actually
multiple tuples for the same TID, because those tuples differ only in
columns that the index doesn't care about anyway. An index scan has to
be careful to fetch the correct version of the tuple, but it has a
Snapshot available, so it can do that. However, there's no easy and
efficient way to handle updates and deletes. Suppose for example that
a tuple has been updated 5 times, creating versions t1..t5. t5 is now
in the zheap page, and the other versions are in the undo. t5 points
to t4 which points to t3 and so forth. Now an updater comes along and
let's say that the updater's snapshot sees t2. It may be that t3..t5
are *uncommitted* updates in which case the attempt to update t2 may
succeed if the transaction that performed then aborts, or it may be
that the updating transactions have committed, in which case we're
going to have to fail. But that decision isn't made by the scan that
sees t3; it happens when the TID reaches the ModifyTable node. So what
zheap ends up doing is finding the right tuple version during the
scan, by making use of the snapshot, and then having to go repeat that
work when it's time to try to perform the update. It would be nice to
avoid this. If we could feed system columns from the scan through to
the update, we could pass along an undo pointer and avoid the extra
overhead. So it seems to me, so far anyway, that there's no very
fundamental problem here, but there is an efficiency issue which we
could address if we had a bit more planner and executor infrastructure
to help us out.

Now in the long run the vision for zheap was that we'd eventually want
to do in-place updates even when indexed columns have been modified,
and this gets a whole lot trickier, because now there can be multiple
sets of index entries pointing at the same TID which don't agree on
the values of the indexed columns. As old row versions die off, some
of those pointers need to be cleaned out, and others do not. I thought
we might solve this problem by something akin to retail index
deletion: have an update or delete on a zheap tuple go re-find the
associated index entries and mark them for possible cleanup, and then
vacuum can ignore all unmarked tuples. There might be some efficiency
problems with this idea I hadn't considered, based on your remarks
today. But regardless of the wisdom or folly of this approach, the
broader point is that we can't assume that all heap types are going to
have the same maintenance requirements. I think most of them are going
to have some kind of maintenance operation that need to or at least
can optionally be performed from time to time, but it might be
triggered by completely different criteria than vacuum. New table AMs
might well choose to use 64-bit XIDs, avoiding the need for wraparound
processing altogether. Maybe they have such good opportunistic cleanup
mechanisms that periodic vacuum for bloat isn't even really needed.
Maybe they bloat when updates and deletes commit but not when inserts
and updates abort, because those cases are handled via some other
mechanism. Who knows, really? It's hard to predict what
not-yet-written AMs might care about, and even if we knew, it seems
crazy to try to rewrite the way vacuum works to cater to those needs
before we actually have some working AMs to use as a testbed.

It strikes me that certain interesting cases might not really need
anything very in-depth here. For example, consider indirect indexes,
where the index references the primary key value rather than the TID.
Well, the indirect index should probably be vacuumed periodically to
prevent bloat, but it doesn't need to be vacuumed to recycle TIDs
because it doesn't contain TIDs. BRIN indexes, BTW, also don't contain
TIDs. Either could, therefore, be optionally vacuumed after vacuum has
done absolutely everything else, even truncate the table, or they
could be vacuumed on a completely separate schedule that doesn't have
anything to do with table vacuuming. I suppose we'd have to come up
with some solution, but I don't think it would need to be fully
general; it could just be good enough for that particular feature,
since fully general seems rather impossible anyway. So I feel like
it's pretty fair to just defer this question. Without some solution
you can't entirely finish a project like indirect indexes, but without
variable-width index payloads you can't even start it.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



pgsql-hackers by date:

Previous
From: Mark Dilger
Date:
Subject: Re: pg_amcheck contrib application
Next
From: Robert Haas
Date:
Subject: Re: pg_amcheck contrib application