Re: Pluggable toaster - Mailing list pgsql-hackers

From Nikita Malakhov
Subject Re: Pluggable toaster
Date
Msg-id CAN-LCVOmwiwxy_Cy20E0Ntevcw8tmcVd8r_MMRX9-Qn0pcHpbw@mail.gmail.com
Whole thread Raw
In response to Re: Pluggable toaster  (Simon Riggs <simon.riggs@enterprisedb.com>)
List pgsql-hackers
Hi all!

Simon, thank you for your review.
I'll try to give a brief explanation on some topics you've mentioned.
My colleagues would correct me if I miss the point and provide some more details.

>Agreed, Oleg has made some very clear analysis of the value of having
>a higher degree of control over toasting from within the datatype.
Currently we see the biggest flaw in TOAST functionality is that it
does not provide any means for extension and modification except
modifying the core code itself. It is not possible to use any other
TOAST strategy except existing in the core, the same issue is with
assigning different TOAST methods to columns and datatypes.
The main point in this patch is actually to provide an open API and
syntax for creation of new Toasters as pluggable extensions, and
to make an existing (default) toaster to work via this API without
affecting its function. Also, the default toaster is strongly cross-tied
with Heap access, with somewhat unclear code relations (headers,
function locations and calls, etc.) that are not quite good logically
structured and ask to be straightened out.

>In my understanding, we want to be able to
>1. Access data from a toasted object one slice at a time, by using
>knowledge of the structure
>2. If toasted data is updated, then update a minimum number of
>slices(s), without rewriting the existing slices
>3. If toasted data is expanded, then allownew slices to be appended to
>the object without rewriting the existing slices
There are two main ideas behind Pluggable Toaster patch -
First - to provide an extensible API for all Postgres developers, to
be able to develop and plug in custom toasters as independent
extensions for different data types and columns, to use different
toast strategies, access and compression methods, and so on;
Second - to refactor current Toast functionality, to improve Toast
code structure and make it more logically structured and
understandable, to 'detach' default ('generic', as it is currently
named, or maybe the best naming for it to be 'heap') toaster from DBMS
core code, route it through new API and hide all existing internal
specific Toast functionality behind new API.

All the points you mentioned are made available for development by
this patch (and, actually, some are being developed - in the
bytea_appendable_toaster part of this patch or jSONb toaster by Nikita
Glukhov, he could provide much better explanation on this topic).

>> Modification of current toaster for all tasks and cases looks too
>> complex, moreover, it  will not works for  custom data types. Postgres
>> is an extensible database,  why not to extent its extensibility even
>> further, to have pluggable TOAST! We  propose an idea to separate
>> toaster from  heap using  toaster API similar to table AM API etc.
>> Following patches are applicable over patch in [1]

>ISTM that we would want the toast algorithm to be associated with the
>datatype, not the column?
>Can you explain your thinking?
This possibility is considered for future development.

>We already have Expanded toast format, in-memory, which was designed
>specifically to allow us to access sub-structure of the datatype
>in-memory. So I was expecting to see an Expanded, on-disk, toast
>format that roughly matched that concept, since Tom has already shown
>us the way. (varatt_expanded). This would be usable by both JSON and
>PostGIS.
The main disadvantage is that it does not suppose either usage of any
other toasting strategies, or compressions methods except plgz and
lz4.

>Some other thoughts:

>I imagine the data type might want to keep some kind of dictionary
>inside the main toast pointer, so we could make allowance for some
>optional datatype-specific private area in the toast pointer itself,
>allowing a mix of inline and out-of-line data, and/or a table of
>contents to the slices.
It is partly implemented in jSONb custom Toaster, as I mentioned
above, and also could be considered for future improvement of existing
Toaster as an extension.

>I'm thinking could also tackle these things at the same time:
>* We want to expand TOAST to 64-bit pointers, so we can have more
>pointers in a table
This issue is being discussed but not currently implemented, it was
considered as one of the possible future improvements.

>* We want to avoid putting the data length into the toast pointer, so
>we can allow the toasted data to be expanded without rewriting
>everything (to avoid O(N^2) cost)
May I correct you - actual relation is O(N), not O(N^2).
Currently data length is stored outside customized toaster data, in
the varatt_custom structure that is supposed to be used by all custom
(extended) toasters. Data that is specific to some custom Toasted will
be stored inside va_toasterdata structure.

Looking forward to your thoughts on our work.

--
Best regards,
Nikita A. Malakhov

On Wed, Jan 5, 2022 at 5:46 PM Simon Riggs <simon.riggs@enterprisedb.com> wrote:
On Thu, 30 Dec 2021 at 16:40, Teodor Sigaev <teodor@sigaev.ru> wrote:

> We are working on custom toaster for JSONB [1], because current TOAST is
> universal for any data type and because of that it has some disadvantages:
>     - "one toast fits all"  may be not the best solution for particular
>       type or/and use cases
>     - it doesn't know the internal structure of data type, so it  cannot
>       choose an optimal toast strategy
>     - it can't  share common parts between different rows and even
>       versions of rows

Agreed, Oleg has made some very clear analysis of the value of having
a higher degree of control over toasting from within the datatype.

In my understanding, we want to be able to
1. Access data from a toasted object one slice at a time, by using
knowledge of the structure
2. If toasted data is updated, then update a minimum number of
slices(s), without rewriting the existing slices
3. If toasted data is expanded, then allownew slices to be appended to
the object without rewriting the existing slices

> Modification of current toaster for all tasks and cases looks too
> complex, moreover, it  will not works for  custom data types. Postgres
> is an extensible database,  why not to extent its extensibility even
> further, to have pluggable TOAST! We  propose an idea to separate
> toaster from  heap using  toaster API similar to table AM API etc.
> Following patches are applicable over patch in [1]

ISTM that we would want the toast algorithm to be associated with the
datatype, not the column?
Can you explain your thinking?

We already have Expanded toast format, in-memory, which was designed
specifically to allow us to access sub-structure of the datatype
in-memory. So I was expecting to see an Expanded, on-disk, toast
format that roughly matched that concept, since Tom has already shown
us the way. (varatt_expanded). This would be usable by both JSON and
PostGIS.


Some other thoughts:

I imagine the data type might want to keep some kind of dictionary
inside the main toast pointer, so we could make allowance for some
optional datatype-specific private area in the toast pointer itself,
allowing a mix of inline and out-of-line data, and/or a table of
contents to the slices.

I'm thinking could also tackle these things at the same time:
* We want to expand TOAST to 64-bit pointers, so we can have more
pointers in a table
* We want to avoid putting the data length into the toast pointer, so
we can allow the toasted data to be expanded without rewriting
everything (to avoid O(N^2) cost)

--
Simon Riggs                http://www.EnterpriseDB.com/


pgsql-hackers by date:

Previous
From: Andrey Borodin
Date:
Subject: Re: Isolation levels on primary and standby
Next
From: "Efrain J. Berdecia"
Date:
Subject: Re: Custom Operator for citext LIKE predicates question