Support for 8-byte TOAST values (aka the TOAST infinite loop problem) - Mailing list pgsql-hackers

From Michael Paquier
Subject Support for 8-byte TOAST values (aka the TOAST infinite loop problem)
Date
Msg-id aFOnKHG7Wn-Srnpv@paquier.xyz
Whole thread Raw
List pgsql-hackers
Hi all,

I have been looking at $subject and the many past reviews recently,
also related to some of the work related to the potential support for
zstandard compression in TOAST values, and found myself pondering
about the following message from Tom, to be reminded that nothing has
been done regarding the fact that the backend may finish in an
infinite loop once a TOAST table reaches 4 billion values:
https://www.postgresql.org/message-id/764273.1669674269%40sss.pgh.pa.us

Spoiler: I have heard of users that are in this case, and the best
thing we can do currently except raising shoulders is to use
workarounds with data externalization AFAIK, which is not nice,
usually, and users notice the problem once they see some backends
stuck in the infinite loop.  I have spent some time looking at the
problem, and looked at all the proposals in this area like these ones
(I hope so at least):
https://commitfest.postgresql.org/patch/4296/
https://www.postgresql.org/message-id/flat/224711f9-83b7-a307-b17f-4457ab73aa0a@sigaev.ru

Anyway, it seems like nothing goes in a direction that I think would
be suited to fix the two following problems (some of the proposed
patches broke backward-compatibility, as well, and that's critical):
- The limit of TOAST values to 4 billions, because external TOAST
pointers want OIDs.
- New compression methods, see the recent proposal about zstandard at
[1].  ToastCompressionId is currently limited to 4 values because the
extinfo field of varatt_external has only this much data remaining.
Spoiler: I want to propose a new varatt_external dedicated to
zstandard-compressed external pointers, but that's not for this
thread.

Please find attached a patch set I have finished with while poking at
the problem, to address points 1) and 2) in the first email mentioned
at the top of this message.  It is not yet ready for prime day yet
(there are a couple of things that would need adjustments), but I have
reached the point where I am going to need a consensus about what
people would be OK to have in terms of design to be able to support
multiple types of varatt_external to address these issues.  And I'm OK
to consume time on that for the v19 cycle.

While hacking at (playing with?) the whole toasting and detoasting
code to understand the blast radius that this would involve, I have
quickly found that it is very annoying to have to touch at many places
of varatt.h to make variants of the existing varatt_external structure
(what we store on-disk as varlena Datum for external TOAST pointers).
Spoiler: it's my first time touching the internals of this code area
so deeply.  Most of the code churns happen because we need to make the
[de]toast code aware of what to do depending on the vartags of the
external varlenas.  It would be simple to hardcode a bunch of new
VARATT_IS_EXTERNAL_ONDISK() variants to plug in the new structures.
While it is efficient, this has a cost for out-of-core code and in
core because all the places that touch external TOAST pointers need to
be adjusted.  Point is: it can be done.  But if we introduce more
types of external TOAST pointers we need to always patch all these
areas, and there's a cost in that each time one or more new vartags
are added.

So, I have invented a new interface aimed at manipulating on-disk
external TOAST pointers, called toast_external, that is an on-memory
structure that services as an interface between on-disk external TOAST
pointers and what the backend wants to look at when retrieving chunks
of data from the TOAST relations.  That's the main proposal of this
patch set, with a structure looking like that:
typedef struct toast_external_data
{
    /* Original data size (includes header) */
    int32       rawsize;
    /* External saved size (without header) */
    uint32      extsize;
    /* compression method */
    ToastCompressionId compression_method;
    /* Relation OID of TOAST table containing the value */
    Oid            toastrelid;
    /*
     * Unique ID of value within TOAST table.  This could be an OID or an
     * int8 value.  This field is large enough to be able to store any of
     * them.
     */
    uint64        value;
} toast_external_data;

This is a bit similar to what the code does for R/W and R/O vartags,
only applying to the on-disk external pointers.  Then, the [de]toast
code and extension code is updated so as varlenas are changed into
this structure if we need to retrieve some of its data, and these
areas of the code do not need to know about the details of the
external TOAST pointers.  When saving an external set of chunks, this
structure is filled with information depending on what
toast_save_datum() deals with, be it a short varlena, a non-compressed
external value, or a compressed external value, then builds a varlena
with the vartag we want.

External TOAST pointers have three properties that are hardcoded in
the tree, bringing some challenges of their own:
- The maximum size of a chunk, TOAST_MAX_CHUNK_SIZE, tweaked at close
to 2k to make 4 chunks fit on a page.  This depends on the size of the
external pointer.  This one was actually easy to refactor.
- The varlena header size, based on VARTAG_SIZE(), which is kind of
tricky to refactor out in the new toast_external.c, but that seems OK
even if this knowledge stays in varatt.h.
- The toast pointer size, aka TOAST_POINTER_SIZE.  This one is
actually very interesting (tricky): we use it in one place,
toast_tuple_find_biggest_attribute(), as a lower bound to decide if an
attribute should be toastable or not.  I've refactored the code to use
a "best" guess depending on the value type in the TOAST relation, but
that's not 100% waterproof.  That needs more thoughts.

Anyway, the patch set is able to demonstrate how much needs to be done
in the tree to support multiple chunk_id types, and the CI is happy
with the attached.  Some of the things done:
- Introduction of a user-settable GUC called default_toast_type, that
can be switched between two modes "oid" and "int8", to force the
creation of a TOAST relation using one type or the other.
- Dump, restore and upgrade support are integrated, relying on a GUC
makes the logic a breeze.
- 64b values are retrieved from a single counter in the control file,
named a "TOAST counter", which has the same reliability and properties
as an OID, with checkpoint support, WAL records, etc.
- Rewrites are soft, so I have kicked the can down the toast on this
point to not make the proposal more complicated than it should be: a
VACUUM FULL retains the same TOAST value type as the original.  We
could extend rewrites so as the type of TOAST value is changed.  It is
possible to setup a new cluster with default_toast_type = int8 set
after an upgrade, with the existing tables still using the OID mode.
This relates to the recent proposal with a concurrent VACUUM FULL
(REPACK discussion).

The patch set keeps the existing vartag_external with OID values for
backward-compatibility, and adds a second vartag_external that can
store 8-byte values.  This model is the simplest one, and
optimizations are possible, where the Datum TOAST pointer could be
shorter depending on the ID type (OID or int8), the compression method
and the actual value to divide in chunks.  For example, if you know
that a chunk of data to save has a value less than UINT32_MAX, we
could store 4 bytes worth of data instead of 8.  This design has the
advantage to allow plugging in new TOAST external structures easily.
Now I've not spent extra time in this tuning, because there's no point
in spending more time without an actual agreement about three things,
and *that's what I'm looking for* as feedback for this upcoming CF:
- The structures of the external TOAST pointers.  Variable-sized
pointers could be one possibility, across multiple vartags.  Ideas are
welcome.
- How many vartag_external types we want.
- If people are actually OK with this translation layer or not, and I
don't disagree that there may be some paths hot enough where the
translation between the on-disk varlenas and this on-memory
toast_external_data hurts.  Again, it is possible to hardcode more
types of vartags in the tree, or just bypass the translation in the
paths that are too hot.  That's doable still brutal, but if that's the
final consensus reached I'm OK with that as well.  (See for example
the changes in amcheck to see how simpler things get.)

The patch set has been divided into multiple pieces to ease its
review.  Again, I'm not completely happy with everything in it, but
it's a start.  Each patch has its own commit message, so feel free to
refer to them for more details:
- 0001 introduces the GUC default_toast_type.  It is just defined, not
used in the tree at this stage.
- 0002 adds support for catcache lookups for int8 values, required to
allow TOAST values with int8 and its indexes.  Potentially useful for
extensions.
- 0003 introduces the "TOAST counter", 8 bytes in the control file to
allocate values for the int8 chunk_id.  That's cheap, reliable.
- 0004 is a mechanical change, that enlarges a couple of TOAST
interfaces to use values of uint64 instead of OID.
- 0005, again a mechanical change, reducing a bit the footprint of
TOAST_MAX_CHUNK_SIZE because OID and int8 values need different
values.
- 0006 tweaks pg_column_toast_chunk_id() to use int8 as return type.

Then comes the "main" patches:
- 0007 adds support for int8 chunk_id in TOAST tables.  This is mostly
a mechanical change.  If applying the patches up to this point,
external Datums are applied to both OID and int8 values.  Note that
there is one tweak I'm unhappy with: the toast counter generation
would need to be smarter to avoid concurrent values because we don't
cross-check the TOAST index for existing values.  (Sorry, got slightly
lazy here).
- 0008 adds tests for external compressed and uncompressed TOAST
values for int8 TOAST types.
- 0009 adds support for dump, restore, upgrades of the TOAST table
types.
- 0010 is the main dish: refactoring of the TOAST code to use
toast_external_data, with OID vartags as the only type defined.
- 0011 adds a second vartag_external: the one with int8 values stored
in the external TOAST pointer.
- 0012 is a bonus for amcheck: what needs to be done in its TAP tests
to allow the corruption cases to work when supporting a new vartag.

That was a long message.  Thank you for reading if you have reached
this point.

Regards,

[1]: https://www.postgresql.org/message-id/CAFAfj_HX84EK4hyRYw50AOHOcdVi-%2BFFwAAPo7JHx4aShCvunQ%40mail.gmail.com
--
Michael

Attachment

pgsql-hackers by date:

Previous
From: Amit Kapila
Date:
Subject: Re: Adding a '--clean-publisher-objects' option to 'pg_createsubscriber' utility.
Next
From: Tatsuo Ishii
Date:
Subject: Re: Add RESPECT/IGNORE NULLS and FROM FIRST/LAST options