Thread: [RFC] indirect toast tuple support

[RFC] indirect toast tuple support

From
Andres Freund
Date:
Hi,

During logical decoding toast tuples are decoded separately from the
main tuple. Works nicely. To make the main table's HeapTuple actually
easily useable the tuple needs to be "reconstructed" to not point to
disk anymore but to the separately reconstructed tuples.

There are two ways to do this:
a) build a new HeapTuple that contains all formerly toasted datums
   inline (i.e. !VARATT_IS_EXTERNAL)
b) add support for external toast tuples that point into separately
   allocated memory instead of a toast table

a) has the problem that that the flattened HeapTuple can be bigger than
our maximal allocation size which seems like an awkward restriction...

Given that there have been wishes to support something like b) for quite
some time, independent from logical decoding, it seems like a good idea
to add support for it. Its e.g. useful for avoiding repeated detoasting
or decompression of tuples.

The problem with b) is that there is no space in varlena's flag bits to
directly denote that a varlena points into memory instead of either
directly containing the data or a varattrib_1b_e containing a
varatt_external pointing to an on-disk toasted tuple.

I propose extending the EXTERNAL varlenas to be able to point to memory
instead just to disk. It seem apt to use EXTERNAL for this as they
aren't stored in the normal heap tuple but somewhere else.
Unfortunately there is no backward-compatible flag space in
varattrib_1b_e either to nicely denote this and we sure don't want to
break on-disk compatibility for this. Thus I propose to distinguish
on-disk and in-memory tuples via the varattrib_1b_e.va_len_1be.

The attached (RFC, not fully ready!) patch adds the following stuff to
the public interfaces:
typedef struct varatt_indirect
{
    struct varlena *pointer;    /* Pointer to in-memory varlena */
} varatt_indirect;

...

#define VARATT_IS_EXTERNAL(PTR)                VARATT_IS_1B_E(PTR)
#define VARATT_IS_EXTERNAL_TOAST(PTR) \
    (VARATT_IS_EXTERNAL(PTR) && VARSIZE_EXTERNAL(PTR) == TOAST_POINTER_SIZE)
#define VARATT_IS_EXTERNAL_INDIRECT(PTR) \
    (VARATT_IS_EXTERNAL(PTR) && VARSIZE_EXTERNAL(PTR) == INDIRECT_POINTER_SIZE)

I don't like to make the distinction through the size but I don't have a
better idea. And hey, its toast/varlena stuff, who expects cleanliness ;)

Existing code doesn't need to care whether a EXTERNAL datum is TOAST or
INDIRECT, that's handled transparently in tuptoaster.c. All EXTERNAL
tuples need to go through there anyway, so that seems fine.

Currently toast_fetch_datum() in tuptoaster.c does part of the gruntwork
for this as it was the easiest location but I think it might be better
to spread the work to some more callsites of it for clarity's sake.

Opinions?

Greetings,

Andres Freund

--
 Andres Freund                       http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Attachment

Re: [RFC] indirect toast tuple support

From
Andres Freund
Date:
On 2013-02-16 17:42:31 +0100, Andres Freund wrote:
> +/* ----------
> + * toast_datum_differs -
> + *
> + *  Determine whether two toasted datums are the same and don't have to be
> + *  stored again.
> + * ----------
> + */
> +static bool
> +toast_datum_differs(struct varlena *old_value, struct varlena *new_value)
> +{
> +    Assert(VARATT_IS_EXTERNAL(old_value));
> +    Assert(VARATT_IS_EXTERNAL(new_value));
> +
> +    /* fast path for the common case where we have the toast oid available */
> +    if (VARATT_IS_EXTERNAL_TOAST(old_value) &&
> +        VARATT_IS_EXTERNAL_TOAST(new_value))
> +        return memcmp((char *) old_value, (char *) new_value,
> +                      VARSIZE_EXTERNAL(old_value)) == 0;
> ...
> +    /* compare payload, we're fine with unaligned data */
> +    return memcmp(VARDATA_ANY(old_value), VARDATA_ANY(new_value),
> +                  VARSIZE_ANY_EXHDR(old_value)) == 0;
> +}

Those == need to be !=. Comes from changing the meaning of a function
last minute without an easy way to test (it just uselessly emits a new
toast tuple when nothing changed).

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: [RFC] indirect toast tuple support

From
Greg Stark
Date:
On Sat, Feb 16, 2013 at 4:42 PM, Andres Freund <andres@2ndquadrant.com> wrote:
> I propose extending the EXTERNAL varlenas to be able to point to memory
> instead just to disk. It seem apt to use EXTERNAL for this as they
> aren't stored in the normal heap tuple but somewhere else.
> Unfortunately there is no backward-compatible flag space in
> varattrib_1b_e either to nicely denote this and we sure don't want to
> break on-disk compatibility for this. Thus I propose to distinguish
> on-disk and in-memory tuples via the varattrib_1b_e.va_len_1be.

The intention was to use va_len_1be to allow extensibility with
different external toast types. We were originally not going to have
it at all and just before committing we became concerned that we
wanted to avoid painting ourselves into a corner where we wouldn't be
able to come up with any new formats for external toast types. So we
added this one byte. i suggest thinking of it more as a "type" field
that happens to be the length of the toast pointer by convention than
an actual length header.

Just as an example I have at various times proposed a column
compression method that would store a dictionary of common values and
the toast pointer would be a pointer to this dictionary and an index
into it.

I have no problem using it for this case since it's using up only one
particular value for this field. As long as you don't want to use up
all the possible values for a single type of external pointer it seems
in line with the plan.


-- 
greg



Re: [RFC] indirect toast tuple support

From
Robert Haas
Date:
On Sat, Feb 16, 2013 at 11:42 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> Given that there have been wishes to support something like b) for quite
> some time, independent from logical decoding, it seems like a good idea
> to add support for it. Its e.g. useful for avoiding repeated detoasting
> or decompression of tuples.
>
> The problem with b) is that there is no space in varlena's flag bits to
> directly denote that a varlena points into memory instead of either
> directly containing the data or a varattrib_1b_e containing a
> varatt_external pointing to an on-disk toasted tuple.

So the other way that we could do this is to use something that's the
same size as a TOAST pointer but has different content - the
seemingly-obvious choice being  va_toastrelid == 0.  I'd be a little
reluctant to do it the way you propose because we might, at some
point, want to try to reduce the size of toast pointers.   If you have
a tuple with many attributes, the size of the TOAST pointers
themselves starts to add up.  It would be nice to be able to have 8
byte or even 4 byte toast pointers to handle those situations.  If we
steal one or both of those lengths to mean "the data is cached in
memory somewhere" then we can't use those lengths in a smaller on-disk
representation, which would seem a shame.

But having said that, +1 on the general idea of getting something like
this done.  We really need a better infrastructure to avoid copying
large values around repeatedly in memory - a gigabyte is a lot of data
to be slinging around.

Of course, you will not be surprised to hear that I think this is 9.4 material.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [RFC] indirect toast tuple support

From
Andres Freund
Date:
On 2013-02-19 08:48:05 -0500, Robert Haas wrote:
> On Sat, Feb 16, 2013 at 11:42 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> > Given that there have been wishes to support something like b) for quite
> > some time, independent from logical decoding, it seems like a good idea
> > to add support for it. Its e.g. useful for avoiding repeated detoasting
> > or decompression of tuples.
> >
> > The problem with b) is that there is no space in varlena's flag bits to
> > directly denote that a varlena points into memory instead of either
> > directly containing the data or a varattrib_1b_e containing a
> > varatt_external pointing to an on-disk toasted tuple.
> 
> So the other way that we could do this is to use something that's the
> same size as a TOAST pointer but has different content - the
> seemingly-obvious choice being  va_toastrelid == 0.

Unfortunately that would mean you need to copy the varatt_external (or
whatever it would be called) to aligned storage to check what it
is. Thats why I went the other way.

Its a bit sad that varatt_1b_e only contains a length and not a type
byte. I would like to change the storage of existing toast types but
thats not going to work for pg_upgrade reasons...


>  I'd be a little
> reluctant to do it the way you propose because we might, at some
> point, want to try to reduce the size of toast pointers.   If you have
> a tuple with many attributes, the size of the TOAST pointers
> themselves starts to add up.  It would be nice to be able to have 8
> byte or even 4 byte toast pointers to handle those situations.  If we
> steal one or both of those lengths to mean "the data is cached in
> memory somewhere" then we can't use those lengths in a smaller on-disk
> representation, which would seem a shame.

I agree. As I said above, having the type overlayed into the lenght was
and is a bad idea, I just haven't found a better one thats compatible
yet.
Except inventing typlen=-3 aka "toast2" or something. But even that
wouldn't help getting rid of existing pg_upgraded tables. Besides being
a maintenance nightmare.

The only reasonable thing I can see us doing is renaming
varattrib_1b_e.va_len_1be into va_type and redefine VARSIZE_1B_E into a
switch that maps types into lengths. But I think I would put this off,
except placing a comment somewhere, until its gets necessary.

> But having said that, +1 on the general idea of getting something like
> this done.  We really need a better infrastructure to avoid copying
> large values around repeatedly in memory - a gigabyte is a lot of data
> to be slinging around.
> 
> Of course, you will not be surprised to hear that I think this is 9.4 material.

Yes, obviously. But I need time to actually propose a working patch (I
already found 2 bugs in what I had submitted), thats why I brought it up
now. No point in wasting time if there's an oviously better idea around.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: [RFC] indirect toast tuple support

From
Robert Haas
Date:
On Tue, Feb 19, 2013 at 9:00 AM, Andres Freund <andres@2ndquadrant.com> wrote:
>> So the other way that we could do this is to use something that's the
>> same size as a TOAST pointer but has different content - the
>> seemingly-obvious choice being  va_toastrelid == 0.
>
> Unfortunately that would mean you need to copy the varatt_external (or
> whatever it would be called) to aligned storage to check what it
> is. Thats why I went the other way.

How big a problem is that, though?

>>  I'd be a little
>> reluctant to do it the way you propose because we might, at some
>> point, want to try to reduce the size of toast pointers.   If you have
>> a tuple with many attributes, the size of the TOAST pointers
>> themselves starts to add up.  It would be nice to be able to have 8
>> byte or even 4 byte toast pointers to handle those situations.  If we
>> steal one or both of those lengths to mean "the data is cached in
>> memory somewhere" then we can't use those lengths in a smaller on-disk
>> representation, which would seem a shame.
>
> I agree. As I said above, having the type overlayed into the lenght was
> and is a bad idea, I just haven't found a better one thats compatible
> yet.
> Except inventing typlen=-3 aka "toast2" or something. But even that
> wouldn't help getting rid of existing pg_upgraded tables. Besides being
> a maintenance nightmare.
>
> The only reasonable thing I can see us doing is renaming
> varattrib_1b_e.va_len_1be into va_type and redefine VARSIZE_1B_E into a
> switch that maps types into lengths. But I think I would put this off,
> except placing a comment somewhere, until its gets necessary.

I guess I wonder how hard we think it would be to insert such a thing
when it becomes necessary.  How much stuff is there out there that
cares about the fact that that length is a byte?

>> But having said that, +1 on the general idea of getting something like
>> this done.  We really need a better infrastructure to avoid copying
>> large values around repeatedly in memory - a gigabyte is a lot of data
>> to be slinging around.
>>
>> Of course, you will not be surprised to hear that I think this is 9.4 material.
>
> Yes, obviously. But I need time to actually propose a working patch (I
> already found 2 bugs in what I had submitted), thats why I brought it up
> now. No point in wasting time if there's an oviously better idea around.

Cool.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [RFC] indirect toast tuple support

From
Andres Freund
Date:
On 2013-02-19 09:12:02 -0500, Robert Haas wrote:
> On Tue, Feb 19, 2013 at 9:00 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> >> So the other way that we could do this is to use something that's the
> >> same size as a TOAST pointer but has different content - the
> >> seemingly-obvious choice being  va_toastrelid == 0.
> >
> > Unfortunately that would mean you need to copy the varatt_external (or
> > whatever it would be called) to aligned storage to check what it
> > is. Thats why I went the other way.
> 
> How big a problem is that, though?

There are quite some places where we test the actual type of a Datum
inside tuptoaster.c. Copying it to local storage everytime might
actually be noticeable performancewise. Besides the ugliness of needing
a local variable, copying the data and only testing afterwards...

> >>  I'd be a little
> >> reluctant to do it the way you propose because we might, at some
> >> point, want to try to reduce the size of toast pointers.   If you have
> >> a tuple with many attributes, the size of the TOAST pointers
> >> themselves starts to add up.  It would be nice to be able to have 8
> >> byte or even 4 byte toast pointers to handle those situations.  If we
> >> steal one or both of those lengths to mean "the data is cached in
> >> memory somewhere" then we can't use those lengths in a smaller on-disk
> >> representation, which would seem a shame.
> >
> > I agree. As I said above, having the type overlayed into the lenght was
> > and is a bad idea, I just haven't found a better one thats compatible
> > yet.
> > Except inventing typlen=-3 aka "toast2" or something. But even that
> > wouldn't help getting rid of existing pg_upgraded tables. Besides being
> > a maintenance nightmare.
> >
> > The only reasonable thing I can see us doing is renaming
> > varattrib_1b_e.va_len_1be into va_type and redefine VARSIZE_1B_E into a
> > switch that maps types into lengths. But I think I would put this off,
> > except placing a comment somewhere, until its gets necessary.
> 
> I guess I wonder how hard we think it would be to insert such a thing
> when it becomes necessary.  How much stuff is there out there that
> cares about the fact that that length is a byte?

You mean whether we could store the length in 6 bytes and use two for
the type? That should probably work as well. But I don't see much
advantage in that given that all those sizes ought to be static.
Redefining VARSIZE_1B_E as indicated above should be fairly easy, there
aren't many callsites that touch stuff at such low level.

Greetings,

Andres Freund



Re: [RFC] indirect toast tuple support

From
Robert Haas
Date:
On Tue, Feb 19, 2013 at 9:26 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> On 2013-02-19 09:12:02 -0500, Robert Haas wrote:
>> On Tue, Feb 19, 2013 at 9:00 AM, Andres Freund <andres@2ndquadrant.com> wrote:
>> >> So the other way that we could do this is to use something that's the
>> >> same size as a TOAST pointer but has different content - the
>> >> seemingly-obvious choice being  va_toastrelid == 0.
>> >
>> > Unfortunately that would mean you need to copy the varatt_external (or
>> > whatever it would be called) to aligned storage to check what it
>> > is. Thats why I went the other way.
>>
>> How big a problem is that, though?
>
> There are quite some places where we test the actual type of a Datum
> inside tuptoaster.c. Copying it to local storage everytime might
> actually be noticeable performancewise. Besides the ugliness of needing
> a local variable, copying the data and only testing afterwards...

Hrm, OK.

>> >>  I'd be a little
>> >> reluctant to do it the way you propose because we might, at some
>> >> point, want to try to reduce the size of toast pointers.   If you have
>> >> a tuple with many attributes, the size of the TOAST pointers
>> >> themselves starts to add up.  It would be nice to be able to have 8
>> >> byte or even 4 byte toast pointers to handle those situations.  If we
>> >> steal one or both of those lengths to mean "the data is cached in
>> >> memory somewhere" then we can't use those lengths in a smaller on-disk
>> >> representation, which would seem a shame.
>> >
>> > I agree. As I said above, having the type overlayed into the lenght was
>> > and is a bad idea, I just haven't found a better one thats compatible
>> > yet.
>> > Except inventing typlen=-3 aka "toast2" or something. But even that
>> > wouldn't help getting rid of existing pg_upgraded tables. Besides being
>> > a maintenance nightmare.
>> >
>> > The only reasonable thing I can see us doing is renaming
>> > varattrib_1b_e.va_len_1be into va_type and redefine VARSIZE_1B_E into a
>> > switch that maps types into lengths. But I think I would put this off,
>> > except placing a comment somewhere, until its gets necessary.
>>
>> I guess I wonder how hard we think it would be to insert such a thing
>> when it becomes necessary.  How much stuff is there out there that
>> cares about the fact that that length is a byte?
>
> You mean whether we could store the length in 6 bytes and use two for
> the type? That should probably work as well. But I don't see much
> advantage in that given that all those sizes ought to be static.
> Redefining VARSIZE_1B_E as indicated above should be fairly easy, there
> aren't many callsites that touch stuff at such low level.

/me blinks.

No, that's not what I meant.  I meant: how hard it would be to
redefine VARSIZE_1B_E along the lines you suggest?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [RFC] indirect toast tuple support

From
Andres Freund
Date:
On 2013-02-20 10:16:45 -0500, Robert Haas wrote:
> On Tue, Feb 19, 2013 at 9:26 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> > On 2013-02-19 09:12:02 -0500, Robert Haas wrote:
> >> On Tue, Feb 19, 2013 at 9:00 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> >> >>  I'd be a little
> >> >> reluctant to do it the way you propose because we might, at some
> >> >> point, want to try to reduce the size of toast pointers.   If you have
> >> >> a tuple with many attributes, the size of the TOAST pointers
> >> >> themselves starts to add up.  It would be nice to be able to have 8
> >> >> byte or even 4 byte toast pointers to handle those situations.  If we
> >> >> steal one or both of those lengths to mean "the data is cached in
> >> >> memory somewhere" then we can't use those lengths in a smaller on-disk
> >> >> representation, which would seem a shame.
> >> >
> >> > I agree. As I said above, having the type overlayed into the lenght was
> >> > and is a bad idea, I just haven't found a better one thats compatible
> >> > yet.
> >> > Except inventing typlen=-3 aka "toast2" or something. But even that
> >> > wouldn't help getting rid of existing pg_upgraded tables. Besides being
> >> > a maintenance nightmare.
> >> >
> >> > The only reasonable thing I can see us doing is renaming
> >> > varattrib_1b_e.va_len_1be into va_type and redefine VARSIZE_1B_E into a
> >> > switch that maps types into lengths. But I think I would put this off,
> >> > except placing a comment somewhere, until its gets necessary.
> >>
> >> I guess I wonder how hard we think it would be to insert such a thing
> >> when it becomes necessary.  How much stuff is there out there that
> >> cares about the fact that that length is a byte?
> >
> > You mean whether we could store the length in 6 bytes and use two for
> > the type? That should probably work as well. But I don't see much
> > advantage in that given that all those sizes ought to be static.
> > Redefining VARSIZE_1B_E as indicated above should be fairly easy, there
> > aren't many callsites that touch stuff at such low level.
> 
> /me blinks.
> 
> No, that's not what I meant.  I meant: how hard it would be to
> redefine VARSIZE_1B_E along the lines you suggest?

Should be pretty easy. Will do so for the next revision.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: [RFC] indirect toast tuple support

From
Greg Stark
Date:
On Tue, Feb 19, 2013 at 2:00 PM, Andres Freund <andres@2ndquadrant.com> wrote:
> The only reasonable thing I can see us doing is renaming
> varattrib_1b_e.va_len_1be into va_type and redefine VARSIZE_1B_E into a
> switch that maps types into lengths. But I think I would put this off,
> except placing a comment somewhere, until its gets necessary.

Is there any reason to make it a switch before we actually have two
types that happen to have the same length?

It might make the code clearer if there was an enum with the (one)
type listed but as long as all the enum values happen to have the
value of the length of the struct then it makes heap_form_tuple and
heap_deform_tuple marginally faster. (unless gcc can optimize away the
whole switch statement which might be plausible, especially if it's
just a few ?: expressions)

-- 
greg



Re: [RFC] indirect toast tuple support

From
Greg Stark
Date:
On Thu, Feb 21, 2013 at 2:32 AM, Greg Stark <stark@mit.edu> wrote:
> On Tue, Feb 19, 2013 at 2:00 PM, Andres Freund <andres@2ndquadrant.com> wrote:
>> The only reasonable thing I can see us doing is renaming
>> varattrib_1b_e.va_len_1be into va_type and redefine VARSIZE_1B_E into a
>> switch that maps types into lengths. But I think I would put this off,
>> except placing a comment somewhere, until its gets necessary.
>
> Is there any reason to make it a switch before we actually have two
> types that happen to have the same length?
>
> It might make the code clearer if there was an enum with the (one)
> type listed but as long as all the enum values happen to have the
> value of the length of the struct then it makes heap_form_tuple and
> heap_deform_tuple marginally faster. (unless gcc can optimize away the
> whole switch statement which might be plausible, especially if it's
> just a few ?: expressions)

For what it's worth much of this was discussed at the time. I
originally wrote it as an enum and Tom changed it to a length byte,
specifically for performance reasons, and said we could always change
it back to an enum where some of the values just happened to be equal
to their length if we needed it:

http://www.postgresql.org/message-id/flat/82tzp7bbbh.fsf@mid.bfk.de#82tzp7bbbh.fsf@mid.bfk.de


-- 
greg