Re: MaxOffsetNumber for Table AMs - Mailing list pgsql-hackers

From Peter Geoghegan
Subject Re: MaxOffsetNumber for Table AMs
Date
Msg-id CAH2-WzmKfo58B1Jg-NOucmDZvPbFayArWrQeVd=_=wKwZvJeTw@mail.gmail.com
Whole thread Raw
In response to Re: MaxOffsetNumber for Table AMs  (Robert Haas <robertmhaas@gmail.com>)
Responses Re: MaxOffsetNumber for Table AMs  (Andres Freund <andres@anarazel.de>)
List pgsql-hackers
On Fri, Apr 30, 2021 at 11:23 AM Robert Haas <robertmhaas@gmail.com> wrote:
> On Fri, Apr 30, 2021 at 2:05 PM Peter Geoghegan <pg@bowt.ie> wrote:
> > I agree in principle, but making that work well is very hard in
> > practice because of the format of IndexTuple -- which bleeds into
> > everything. That TID is special is probably a natural consequence of
> > the fact that we don't have an offset-based format of the kind you see
> > in other DB systems -- systems that don't emphasize extensibility. We
> > cannot jump to a hypothetical TID attribute inexpensively inside code
> > like _bt_compare() because we don't have a cheap way to jump straight
> > to the datum for any attribute. So we just store TID in IndexTuple
> > directly instead. Imagine how much more expensive VACUUM would be if
> > it had to grovel through the IndexTuple format.
>
> I can't imagine that, so maybe you want to enlighten me? I see that
> there's a potential problem there, and I'm glad you pointed it out
> because I hadn't thought about it previously ... but if you always put
> the column or columns that VACUUM would need first, it's not obvious
> to me that it would be all that expensive.

Maybe. The point is that it is a problem that needs to be solved.

> Deforming the tuple to a
> sufficient degree to extract the first column, which would even be
> fixed-width, shouldn't take much work.

I think that it's reasonable to impose some cost on index AMs here,
but that needs to be bounded sensibly and unambiguously. For example,
it would probably be okay if you had either 6 byte or 8 byte TIDs, but
no other variations. You could require index AMs (the subset of index
AMs that are ever able to store 8 byte TIDs) to directly encode which
width they're dealing with at the level of each IndexTuple. That would
create some problems for nbtree deduplication, especially in boundary
cases, but ISTM that you can manage the complexity by sensibly
restricting how the TIDs work across the board. For example, the TIDs
should always work like unsigned integers -- the table AM must be
willing to work with that restriction.

You'd then have posting lists tuples in nbtree whose TIDs were all
either 6 bytes or 8 bytes wide, with a mix of each possible (though
not particularly likely) on the same leaf page. Say when you have a
table that exceeds the current MaxBlockNumber restrictions. It would
be relatively straightforward for nbtree deduplication to simply
refuse to mix 6 byte and 8 byte datums together to avoid complexity in
boundary cases. The deduplication pass logic has the flexibility that
this requires already.

> > I wonder how the same useful performance characteristics can be
> > maintained with a variable-width TID design. If you solve the problem
> > by changing IndexTuple, then you are kind of obligated to not use
> > varlena headers to keep the on-disk size manageable. Who knows where
> > it all ends?
>
> What's wrong with varlena headers? It would end up being a 1-byte
> header in practically every case, and no variable-width representation
> can do without a length word of some sort. I'm not saying varlena is
> as efficient as some new design could hypothetically be, but it
> doesn't seem like it'd be a big enough problem to stress about. If you
> used a variable-width representation for integers, you might actually
> save bytes in a lot of cases. An awful lot of the TIDs people store in
> practice probably contain several zero bytes, and if we make them
> wider, that's going to be even more true.

Maybe all of this is true, and maybe it works out to be the best path
forward in the long term, all things considered. But whether or not
that's true is crucially dependent on what real practical table AMs
(of which there will only ever be a tiny number) actually need to do.
Why should we assume that the table AM cannot accept some
restrictions? What good does it do to legalistically define the
problem as a problem for index AMs to solve?

-- 
Peter Geoghegan



pgsql-hackers by date:

Previous
From: Jeff Davis
Date:
Subject: Re: MaxOffsetNumber for Table AMs
Next
From: Robert Haas
Date:
Subject: Re: pg_amcheck contrib application