Re: Support for 8-byte TOAST values (aka the TOAST infinite loop problem) - Mailing list pgsql-hackers

From Michael Paquier
Subject Re: Support for 8-byte TOAST values (aka the TOAST infinite loop problem)
Date
Msg-id aTn4aS32O4lmaPnj@paquier.xyz
Whole thread Raw
In response to Re: Support for 8-byte TOAST values (aka the TOAST infinite loop problem)  (Robert Haas <robertmhaas@gmail.com>)
List pgsql-hackers
On Mon, Dec 08, 2025 at 04:19:54PM -0500, Robert Haas wrote:
> I took a brief look at this today, looking at parts of v8-0005 and
> v8-0015.

Thanks for the input!

> Although I don't dislike the idea of an abstraction layer in
> concept, it's unclear to me how much this particular abstraction layer
> is really buying you. It's basically abstracting away two things: the
> difference between the current varatt_external and
> varatt_external_oid8, and the unbundling of compression_method from
> extsize. That's not a lot. This thread has a bunch of ideas already on
> other things that people want to do to the TOAST system or think you
> should be doing to the TOAST system, and while I'm -1 on letting those
> discussions hijack the thread, it's worth paying attention to whether
> the abstraction layer makes any of them easier. As far as I can see,
> it doesn't. I suspect, for example, that direct TID access to toast
> values is a much worse idea than people here seem to be supposing, but
> whether that's true or false, toast_external_data doesn't help anyone
> who is trying to implement it.
> I don't think it even really helps with zstd compression, even
> though that's going to need a ToastCompressionId, so I kind of
> find myself wondering what the point is.

The idea for the support of zstd compression with this set of APIs
would be to allocate a new vartag_external, and perhaps just allow an
8-byte OID in this case to keep the code simple.  It is true that I
have not attempted to rework the code behind the compression and the
decompression of the slices, but I could not directly see why we would
want to do that as this also relates with the in-memory representation
of compressed datums.

> One alternative idea that I had is just provide a way to turn a 4-byte
> TOAST pointe into an 8-byte TOAST pointer. If you inserted that
> surgically at certain places, maybe you could make 4-byte TOAST
> pointers invisible to certain parts of the system. But I'm also not
> sure that's any better than what you call the "brutal" approach, which
> I'm actually not sure is that brutal. I mean, how bad is it if we just
> deal with one more possibility at each place where we currently deal
> with VARTAG_ONDISK? To be clear, I am not trying to say "absolutely
> don't do this abstraction layer".

To be honest, the brutal method is not that bad I think.  For the
in-tree places, my patch touches all the areas that matter, so I have
identified what needs to be touched.  Also please note that the
approach used in this patch is based on a remark that has been made in
the last few years (perhaps by you actually, I'd need to find the
reference?), where folks were worrying about the introduction of a new
vartag_external as something that could be a deadly trap if we don't
patch all the areas that need to handle with external on-disk datums.
The brutal method was actually the first thing that I have done, just
to notice that I was refactoring all these areas of the code the same
way, leading to the patch attached at the end as being a win if we add
more vartags in the future.  Anyway, as far as things stand today on
this thread, I have not been able to get even a single +1 for this
design.  I mean, that's fine, it's still what can be called a
consensus and that's how development happens.

> However, imagine a future where we
> have 10 vartags that can appear on disk: the current VARTAG_ONDISK and
> 9 others. If, when that time comes, your abstraction layer handles 5
> or 6 or more of those, I'd call that a win. If the only two it ever
> handles are OID and OID8, I'd say that's a loss: it's not really an
> abstraction layer at all at that point. In that situation you'd rather
> just admit that what you're really trying to do is smooth over the
> distinction between OID and OID8 and keep any other goals (including
> unbundling the compression ID, IMHO) separate.

I won't deny this argument.  As far as things go for the backend core
engine, most of the bad things I hear back from users regarding TOAST
is the 4-byte OID limitation with TOAST values.  The addition of zstd
comes second, but the 32 bit problem still primes because backends
just get suddenly stuck.

> I am somewhat bothered by the idea of just bluntly changing everything
> over to 8-byte TOAST OIDs.

I agree that forcing that everywhere is a bad idea and I am not
suggesting that, neither does the patch set enforce that.  The first
design of the patch made an 8-byte enforcement possible by using a
GUC, with the default being 4 bytes.  The first set of feedback I have
received in August was to use a reloption, making 8-byte OIDs an
option that one has to pass with CREATE TABLE, and to not use a GUC.
The latest versions of the patch use a reloption.

> Widening the TOAST table column doesn't
> concern me; it eats up 4 bytes, but the tuples are ~2k so it's not
> that much of a difference. Widening the TOAST pointer seems like a
> bigger concern.
> Every single toasted value is getting 4 bytes wider. A
> single tuple could potentially have a significant number of such
> columns, and as a result, get significantly wider. We could still use
> the 4-byte toast pointer format if the value ID happens to be small,
> but if we just assign the value ID using a 4-byte counter, after the
> first 4B allocations, it will never be small again. After that,
> relations that never would have needed anything like 4B toast pointers
> will still pay the cost for those that do, which seems quite sad.
> Whether this patch should be responsible for doing something about
> that sadness is unclear to me: a sequence-per-relation to allow
> separate counter allocation for each relation would be great, but
> bottlenecking this patch set behind such a large change might well be
> the wrong idea.

As far as I am concerned about the user cases, the tables where the
4-byte limitation is reached usually concern a small subset of tables
in one or more schemas.  The cost of a 8-byte OID also comes down to
how much the TOAST blobs are updated.  If they are updated a lot,
we'll need a new value anyway.  If the blobs are mostly static
content with other attributes updated a lot, the cost of 8-byte OIDs
gets much high.     If I would be a user deadling with a set of tables
that have already 4 billions TOAST blobs of at least 2kB in more than
one relation, the extra 4 bytes when updating the other attributes are
not my main worry.  :)

> I am also uncertain how much loss we're really talking about: if for
> example a wide table contains 5 long text blogs per row (which seems
> fairly high) then that's "only" 20 bytes/row, and probably those rows
> don't fit into 8kB blocks that well anyway, so maybe the loss of
> efficiency isn't much. Maybe the biggest concern is that, IIUC, some
> tuples that currently can be stored on disk wouldn't be storable at
> all any more with the wider pointers, because the row couldn't be
> squeezed into the block no matter how much pressure the TOAST code
> applies. If that's correct, I think there's a pretty good chance that
> this patch will make a few people who are already desperately unhappy
> with the TOAST system even more unhappy. There's nothing you can
> really do if you're up against that hard limit. I'm not sure what, if
> anything, we can or should do about that, but it seems like something
> we ought to at least discuss.

Yeah, perhaps.  Again, for users that fight against the hard limit,
what I'm hearing is that for some users reworking a large table to be
rewritten as a partitioned one to bypass the limit is not acceptable
in some cases.  With the recent planner changes that have improved the
planning time for many partitions, things are better, but it's
basically impossible to predict all the possible user behaviors that
there could be out there.  I won't deny that it could be possible that
enlarging the toast values to 8-bytes for some relations leads to some
folks being unhappy because it makes the storage of the on-disk
external pointers less effective regarding to alignment.

At this point, I am mostly interested in a possible consensus about
what people would like to see, in the timeframe that remains for v19
and the timing is what it is because everybody is busy.  What I am in
priority interested is giving a way for users to bypass the 4-byte
limit without reworking a schema.  If this is opt-in, users have at
least a choice.  A reloption has the disadvantage to not be something
that users are aware of by default.  A rewrite of the TOAST table is
required anyway.  If the consensus is this design is not good enough,
that's fine.  If the consensus is to use the
"brutal-still-not-so-brutal" approach, that's fine.  The only thing I
can do is commit my resources into making something happen for the
4-byte limitation, whatever the approach folks would be OK with.
--
Michael

Attachment

pgsql-hackers by date:

Previous
From: Peter Geoghegan
Date:
Subject: Re: index prefetching
Next
From: Michael Paquier
Date:
Subject: Re: backpatch tests: Rename conflicting role names to 14/15