Support for 8-byte TOAST values (aka the TOAST infinite loop problem) - Mailing list pgsql-hackers
From | Michael Paquier |
---|---|
Subject | Support for 8-byte TOAST values (aka the TOAST infinite loop problem) |
Date | |
Msg-id | aFOnKHG7Wn-Srnpv@paquier.xyz Whole thread Raw |
List | pgsql-hackers |
Hi all, I have been looking at $subject and the many past reviews recently, also related to some of the work related to the potential support for zstandard compression in TOAST values, and found myself pondering about the following message from Tom, to be reminded that nothing has been done regarding the fact that the backend may finish in an infinite loop once a TOAST table reaches 4 billion values: https://www.postgresql.org/message-id/764273.1669674269%40sss.pgh.pa.us Spoiler: I have heard of users that are in this case, and the best thing we can do currently except raising shoulders is to use workarounds with data externalization AFAIK, which is not nice, usually, and users notice the problem once they see some backends stuck in the infinite loop. I have spent some time looking at the problem, and looked at all the proposals in this area like these ones (I hope so at least): https://commitfest.postgresql.org/patch/4296/ https://www.postgresql.org/message-id/flat/224711f9-83b7-a307-b17f-4457ab73aa0a@sigaev.ru Anyway, it seems like nothing goes in a direction that I think would be suited to fix the two following problems (some of the proposed patches broke backward-compatibility, as well, and that's critical): - The limit of TOAST values to 4 billions, because external TOAST pointers want OIDs. - New compression methods, see the recent proposal about zstandard at [1]. ToastCompressionId is currently limited to 4 values because the extinfo field of varatt_external has only this much data remaining. Spoiler: I want to propose a new varatt_external dedicated to zstandard-compressed external pointers, but that's not for this thread. Please find attached a patch set I have finished with while poking at the problem, to address points 1) and 2) in the first email mentioned at the top of this message. It is not yet ready for prime day yet (there are a couple of things that would need adjustments), but I have reached the point where I am going to need a consensus about what people would be OK to have in terms of design to be able to support multiple types of varatt_external to address these issues. And I'm OK to consume time on that for the v19 cycle. While hacking at (playing with?) the whole toasting and detoasting code to understand the blast radius that this would involve, I have quickly found that it is very annoying to have to touch at many places of varatt.h to make variants of the existing varatt_external structure (what we store on-disk as varlena Datum for external TOAST pointers). Spoiler: it's my first time touching the internals of this code area so deeply. Most of the code churns happen because we need to make the [de]toast code aware of what to do depending on the vartags of the external varlenas. It would be simple to hardcode a bunch of new VARATT_IS_EXTERNAL_ONDISK() variants to plug in the new structures. While it is efficient, this has a cost for out-of-core code and in core because all the places that touch external TOAST pointers need to be adjusted. Point is: it can be done. But if we introduce more types of external TOAST pointers we need to always patch all these areas, and there's a cost in that each time one or more new vartags are added. So, I have invented a new interface aimed at manipulating on-disk external TOAST pointers, called toast_external, that is an on-memory structure that services as an interface between on-disk external TOAST pointers and what the backend wants to look at when retrieving chunks of data from the TOAST relations. That's the main proposal of this patch set, with a structure looking like that: typedef struct toast_external_data { /* Original data size (includes header) */ int32 rawsize; /* External saved size (without header) */ uint32 extsize; /* compression method */ ToastCompressionId compression_method; /* Relation OID of TOAST table containing the value */ Oid toastrelid; /* * Unique ID of value within TOAST table. This could be an OID or an * int8 value. This field is large enough to be able to store any of * them. */ uint64 value; } toast_external_data; This is a bit similar to what the code does for R/W and R/O vartags, only applying to the on-disk external pointers. Then, the [de]toast code and extension code is updated so as varlenas are changed into this structure if we need to retrieve some of its data, and these areas of the code do not need to know about the details of the external TOAST pointers. When saving an external set of chunks, this structure is filled with information depending on what toast_save_datum() deals with, be it a short varlena, a non-compressed external value, or a compressed external value, then builds a varlena with the vartag we want. External TOAST pointers have three properties that are hardcoded in the tree, bringing some challenges of their own: - The maximum size of a chunk, TOAST_MAX_CHUNK_SIZE, tweaked at close to 2k to make 4 chunks fit on a page. This depends on the size of the external pointer. This one was actually easy to refactor. - The varlena header size, based on VARTAG_SIZE(), which is kind of tricky to refactor out in the new toast_external.c, but that seems OK even if this knowledge stays in varatt.h. - The toast pointer size, aka TOAST_POINTER_SIZE. This one is actually very interesting (tricky): we use it in one place, toast_tuple_find_biggest_attribute(), as a lower bound to decide if an attribute should be toastable or not. I've refactored the code to use a "best" guess depending on the value type in the TOAST relation, but that's not 100% waterproof. That needs more thoughts. Anyway, the patch set is able to demonstrate how much needs to be done in the tree to support multiple chunk_id types, and the CI is happy with the attached. Some of the things done: - Introduction of a user-settable GUC called default_toast_type, that can be switched between two modes "oid" and "int8", to force the creation of a TOAST relation using one type or the other. - Dump, restore and upgrade support are integrated, relying on a GUC makes the logic a breeze. - 64b values are retrieved from a single counter in the control file, named a "TOAST counter", which has the same reliability and properties as an OID, with checkpoint support, WAL records, etc. - Rewrites are soft, so I have kicked the can down the toast on this point to not make the proposal more complicated than it should be: a VACUUM FULL retains the same TOAST value type as the original. We could extend rewrites so as the type of TOAST value is changed. It is possible to setup a new cluster with default_toast_type = int8 set after an upgrade, with the existing tables still using the OID mode. This relates to the recent proposal with a concurrent VACUUM FULL (REPACK discussion). The patch set keeps the existing vartag_external with OID values for backward-compatibility, and adds a second vartag_external that can store 8-byte values. This model is the simplest one, and optimizations are possible, where the Datum TOAST pointer could be shorter depending on the ID type (OID or int8), the compression method and the actual value to divide in chunks. For example, if you know that a chunk of data to save has a value less than UINT32_MAX, we could store 4 bytes worth of data instead of 8. This design has the advantage to allow plugging in new TOAST external structures easily. Now I've not spent extra time in this tuning, because there's no point in spending more time without an actual agreement about three things, and *that's what I'm looking for* as feedback for this upcoming CF: - The structures of the external TOAST pointers. Variable-sized pointers could be one possibility, across multiple vartags. Ideas are welcome. - How many vartag_external types we want. - If people are actually OK with this translation layer or not, and I don't disagree that there may be some paths hot enough where the translation between the on-disk varlenas and this on-memory toast_external_data hurts. Again, it is possible to hardcode more types of vartags in the tree, or just bypass the translation in the paths that are too hot. That's doable still brutal, but if that's the final consensus reached I'm OK with that as well. (See for example the changes in amcheck to see how simpler things get.) The patch set has been divided into multiple pieces to ease its review. Again, I'm not completely happy with everything in it, but it's a start. Each patch has its own commit message, so feel free to refer to them for more details: - 0001 introduces the GUC default_toast_type. It is just defined, not used in the tree at this stage. - 0002 adds support for catcache lookups for int8 values, required to allow TOAST values with int8 and its indexes. Potentially useful for extensions. - 0003 introduces the "TOAST counter", 8 bytes in the control file to allocate values for the int8 chunk_id. That's cheap, reliable. - 0004 is a mechanical change, that enlarges a couple of TOAST interfaces to use values of uint64 instead of OID. - 0005, again a mechanical change, reducing a bit the footprint of TOAST_MAX_CHUNK_SIZE because OID and int8 values need different values. - 0006 tweaks pg_column_toast_chunk_id() to use int8 as return type. Then comes the "main" patches: - 0007 adds support for int8 chunk_id in TOAST tables. This is mostly a mechanical change. If applying the patches up to this point, external Datums are applied to both OID and int8 values. Note that there is one tweak I'm unhappy with: the toast counter generation would need to be smarter to avoid concurrent values because we don't cross-check the TOAST index for existing values. (Sorry, got slightly lazy here). - 0008 adds tests for external compressed and uncompressed TOAST values for int8 TOAST types. - 0009 adds support for dump, restore, upgrades of the TOAST table types. - 0010 is the main dish: refactoring of the TOAST code to use toast_external_data, with OID vartags as the only type defined. - 0011 adds a second vartag_external: the one with int8 values stored in the external TOAST pointer. - 0012 is a bonus for amcheck: what needs to be done in its TAP tests to allow the corruption cases to work when supporting a new vartag. That was a long message. Thank you for reading if you have reached this point. Regards, [1]: https://www.postgresql.org/message-id/CAFAfj_HX84EK4hyRYw50AOHOcdVi-%2BFFwAAPo7JHx4aShCvunQ%40mail.gmail.com -- Michael
Attachment
- v1-0001-Add-GUC-default_toast_type.patch
- v1-0002-Add-catcache-support-for-INT8OID.patch
- v1-0003-Introduce-global-64-bit-TOAST-ID-counter-in-contr.patch
- v1-0004-Refactor-some-TOAST-value-ID-code-to-use-uint64-i.patch
- v1-0005-Minimize-footprint-of-TOAST_MAX_CHUNK_SIZE-in-hea.patch
- v1-0006-Switch-pg_column_toast_chunk_id-return-value-from.patch
- v1-0007-Add-support-for-bigint-TOAST-values.patch
- v1-0008-Add-tests-for-TOAST-relations-with-bigint-as-valu.patch
- v1-0009-Add-support-for-TOAST-table-types-in-pg_dump-and-.patch
- v1-0010-Refactor-external-TOAST-pointer-code-for-better-p.patch
- v1-0011-Add-new-vartag_external-for-8-byte-TOAST-values.patch
- v1-0012-amcheck-Add-test-cases-for-8-byte-TOAST-values.patch
- signature.asc
pgsql-hackers by date: