Re: MaxOffsetNumber for Table AMs - Mailing list pgsql-hackers

From Robert Haas
Subject Re: MaxOffsetNumber for Table AMs
Date
Msg-id CA+Tgmoarb24ZJUJSMZ_3jOAxL-XCvV-tjNF2+vfBGnV0jvKXPA@mail.gmail.com
Whole thread Raw
In response to Re: MaxOffsetNumber for Table AMs  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-hackers
On Fri, Apr 30, 2021 at 1:10 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> I agree that global indexes need more bits, but it doesn't necessarily
> follow that we must have variable-width TIDs.  We could for example
> say that "real" TIDs are only 48 bits and index AMs that want to be
> usable as global indexes must be capable of handling 64-bit TIDs,
> leaving 16 bits for partition ID.  A more forward-looking definition
> would require global index AMs to store 96 bits (partition OID plus
> 64-bit TID).  Either way would be far simpler for every moving part
> involved than going over to full varlena TIDs.

16 bits is not much for a partition identifier. We've already had
complaints about INNER_VAR being too small, so apparently there are
people who want to use really large numbers of partitions. But even if
we imagine a hypothetical world where nobody uses more than a couple
thousand partitions at once, it's very reasonable to want to avoid
recycling partition identifiers so that detaching a partition can be
O(1), and there's no way that's going to be viable if the whole
address space is only 16 bits, because with time series data people
are going to be continually creating new partitions and dropping old
ones. I would guess that it probably is viable with 32 bits, but we'd
have to have a mapping layer rather than using the OID directly to
avoid wraparound collisions.

Now this problem can be avoided by just requiring the AM to store more
bits, exactly as you say. I suspect 96 bits is large enough for all of
the practical use cases people have, or at least within spitting
distance. But it strikes me as relatively inefficient to say that
we're always going to store 96 bits for every TID. I certainly don't
think we want to break on-disk compatibility and widen every existing
btree index by changing all the 6-byte TIDs they're storing now to
store 12 bytes TIDs that are at least half zero bytes, so I think
we're bound to end up with at least two options: 6 and 12. But
variable-width would be a lot nicer. You could store small TIDs and
small partition identifiers very compactly, and only use the full
number of bytes when the situation demands it.

> > What problem do you think this proposal does solve?
>
> Accommodating table AMs that want more than 48 bits for a TID.
> We're already starting to run up against the fact that that's not
> enough bits for plausible use-cases.  64 bits may someday in the far
> future not be enough either, but I think that's a very long way off.

Do people actually want to store more than 2^48 rows in a table, or is
this more about the division of a TID into a block number and an item
number?

-- 
Robert Haas
EDB: http://www.enterprisedb.com



pgsql-hackers by date:

Previous
From: Peter Geoghegan
Date:
Subject: Re: MaxOffsetNumber for Table AMs
Next
From: Alvaro Herrera
Date:
Subject: Re: Use simplehash.h instead of dynahash in SMgr