Thread: [RFC] indirect toast tuple support
Hi, During logical decoding toast tuples are decoded separately from the main tuple. Works nicely. To make the main table's HeapTuple actually easily useable the tuple needs to be "reconstructed" to not point to disk anymore but to the separately reconstructed tuples. There are two ways to do this: a) build a new HeapTuple that contains all formerly toasted datums inline (i.e. !VARATT_IS_EXTERNAL) b) add support for external toast tuples that point into separately allocated memory instead of a toast table a) has the problem that that the flattened HeapTuple can be bigger than our maximal allocation size which seems like an awkward restriction... Given that there have been wishes to support something like b) for quite some time, independent from logical decoding, it seems like a good idea to add support for it. Its e.g. useful for avoiding repeated detoasting or decompression of tuples. The problem with b) is that there is no space in varlena's flag bits to directly denote that a varlena points into memory instead of either directly containing the data or a varattrib_1b_e containing a varatt_external pointing to an on-disk toasted tuple. I propose extending the EXTERNAL varlenas to be able to point to memory instead just to disk. It seem apt to use EXTERNAL for this as they aren't stored in the normal heap tuple but somewhere else. Unfortunately there is no backward-compatible flag space in varattrib_1b_e either to nicely denote this and we sure don't want to break on-disk compatibility for this. Thus I propose to distinguish on-disk and in-memory tuples via the varattrib_1b_e.va_len_1be. The attached (RFC, not fully ready!) patch adds the following stuff to the public interfaces: typedef struct varatt_indirect { struct varlena *pointer; /* Pointer to in-memory varlena */ } varatt_indirect; ... #define VARATT_IS_EXTERNAL(PTR) VARATT_IS_1B_E(PTR) #define VARATT_IS_EXTERNAL_TOAST(PTR) \ (VARATT_IS_EXTERNAL(PTR) && VARSIZE_EXTERNAL(PTR) == TOAST_POINTER_SIZE) #define VARATT_IS_EXTERNAL_INDIRECT(PTR) \ (VARATT_IS_EXTERNAL(PTR) && VARSIZE_EXTERNAL(PTR) == INDIRECT_POINTER_SIZE) I don't like to make the distinction through the size but I don't have a better idea. And hey, its toast/varlena stuff, who expects cleanliness ;) Existing code doesn't need to care whether a EXTERNAL datum is TOAST or INDIRECT, that's handled transparently in tuptoaster.c. All EXTERNAL tuples need to go through there anyway, so that seems fine. Currently toast_fetch_datum() in tuptoaster.c does part of the gruntwork for this as it was the easiest location but I think it might be better to spread the work to some more callsites of it for clarity's sake. Opinions? Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
On 2013-02-16 17:42:31 +0100, Andres Freund wrote: > +/* ---------- > + * toast_datum_differs - > + * > + * Determine whether two toasted datums are the same and don't have to be > + * stored again. > + * ---------- > + */ > +static bool > +toast_datum_differs(struct varlena *old_value, struct varlena *new_value) > +{ > + Assert(VARATT_IS_EXTERNAL(old_value)); > + Assert(VARATT_IS_EXTERNAL(new_value)); > + > + /* fast path for the common case where we have the toast oid available */ > + if (VARATT_IS_EXTERNAL_TOAST(old_value) && > + VARATT_IS_EXTERNAL_TOAST(new_value)) > + return memcmp((char *) old_value, (char *) new_value, > + VARSIZE_EXTERNAL(old_value)) == 0; > ... > + /* compare payload, we're fine with unaligned data */ > + return memcmp(VARDATA_ANY(old_value), VARDATA_ANY(new_value), > + VARSIZE_ANY_EXHDR(old_value)) == 0; > +} Those == need to be !=. Comes from changing the meaning of a function last minute without an easy way to test (it just uselessly emits a new toast tuple when nothing changed). Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Sat, Feb 16, 2013 at 4:42 PM, Andres Freund <andres@2ndquadrant.com> wrote: > I propose extending the EXTERNAL varlenas to be able to point to memory > instead just to disk. It seem apt to use EXTERNAL for this as they > aren't stored in the normal heap tuple but somewhere else. > Unfortunately there is no backward-compatible flag space in > varattrib_1b_e either to nicely denote this and we sure don't want to > break on-disk compatibility for this. Thus I propose to distinguish > on-disk and in-memory tuples via the varattrib_1b_e.va_len_1be. The intention was to use va_len_1be to allow extensibility with different external toast types. We were originally not going to have it at all and just before committing we became concerned that we wanted to avoid painting ourselves into a corner where we wouldn't be able to come up with any new formats for external toast types. So we added this one byte. i suggest thinking of it more as a "type" field that happens to be the length of the toast pointer by convention than an actual length header. Just as an example I have at various times proposed a column compression method that would store a dictionary of common values and the toast pointer would be a pointer to this dictionary and an index into it. I have no problem using it for this case since it's using up only one particular value for this field. As long as you don't want to use up all the possible values for a single type of external pointer it seems in line with the plan. -- greg
On Sat, Feb 16, 2013 at 11:42 AM, Andres Freund <andres@2ndquadrant.com> wrote: > Given that there have been wishes to support something like b) for quite > some time, independent from logical decoding, it seems like a good idea > to add support for it. Its e.g. useful for avoiding repeated detoasting > or decompression of tuples. > > The problem with b) is that there is no space in varlena's flag bits to > directly denote that a varlena points into memory instead of either > directly containing the data or a varattrib_1b_e containing a > varatt_external pointing to an on-disk toasted tuple. So the other way that we could do this is to use something that's the same size as a TOAST pointer but has different content - the seemingly-obvious choice being va_toastrelid == 0. I'd be a little reluctant to do it the way you propose because we might, at some point, want to try to reduce the size of toast pointers. If you have a tuple with many attributes, the size of the TOAST pointers themselves starts to add up. It would be nice to be able to have 8 byte or even 4 byte toast pointers to handle those situations. If we steal one or both of those lengths to mean "the data is cached in memory somewhere" then we can't use those lengths in a smaller on-disk representation, which would seem a shame. But having said that, +1 on the general idea of getting something like this done. We really need a better infrastructure to avoid copying large values around repeatedly in memory - a gigabyte is a lot of data to be slinging around. Of course, you will not be surprised to hear that I think this is 9.4 material. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2013-02-19 08:48:05 -0500, Robert Haas wrote: > On Sat, Feb 16, 2013 at 11:42 AM, Andres Freund <andres@2ndquadrant.com> wrote: > > Given that there have been wishes to support something like b) for quite > > some time, independent from logical decoding, it seems like a good idea > > to add support for it. Its e.g. useful for avoiding repeated detoasting > > or decompression of tuples. > > > > The problem with b) is that there is no space in varlena's flag bits to > > directly denote that a varlena points into memory instead of either > > directly containing the data or a varattrib_1b_e containing a > > varatt_external pointing to an on-disk toasted tuple. > > So the other way that we could do this is to use something that's the > same size as a TOAST pointer but has different content - the > seemingly-obvious choice being va_toastrelid == 0. Unfortunately that would mean you need to copy the varatt_external (or whatever it would be called) to aligned storage to check what it is. Thats why I went the other way. Its a bit sad that varatt_1b_e only contains a length and not a type byte. I would like to change the storage of existing toast types but thats not going to work for pg_upgrade reasons... > I'd be a little > reluctant to do it the way you propose because we might, at some > point, want to try to reduce the size of toast pointers. If you have > a tuple with many attributes, the size of the TOAST pointers > themselves starts to add up. It would be nice to be able to have 8 > byte or even 4 byte toast pointers to handle those situations. If we > steal one or both of those lengths to mean "the data is cached in > memory somewhere" then we can't use those lengths in a smaller on-disk > representation, which would seem a shame. I agree. As I said above, having the type overlayed into the lenght was and is a bad idea, I just haven't found a better one thats compatible yet. Except inventing typlen=-3 aka "toast2" or something. But even that wouldn't help getting rid of existing pg_upgraded tables. Besides being a maintenance nightmare. The only reasonable thing I can see us doing is renaming varattrib_1b_e.va_len_1be into va_type and redefine VARSIZE_1B_E into a switch that maps types into lengths. But I think I would put this off, except placing a comment somewhere, until its gets necessary. > But having said that, +1 on the general idea of getting something like > this done. We really need a better infrastructure to avoid copying > large values around repeatedly in memory - a gigabyte is a lot of data > to be slinging around. > > Of course, you will not be surprised to hear that I think this is 9.4 material. Yes, obviously. But I need time to actually propose a working patch (I already found 2 bugs in what I had submitted), thats why I brought it up now. No point in wasting time if there's an oviously better idea around. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Tue, Feb 19, 2013 at 9:00 AM, Andres Freund <andres@2ndquadrant.com> wrote: >> So the other way that we could do this is to use something that's the >> same size as a TOAST pointer but has different content - the >> seemingly-obvious choice being va_toastrelid == 0. > > Unfortunately that would mean you need to copy the varatt_external (or > whatever it would be called) to aligned storage to check what it > is. Thats why I went the other way. How big a problem is that, though? >> I'd be a little >> reluctant to do it the way you propose because we might, at some >> point, want to try to reduce the size of toast pointers. If you have >> a tuple with many attributes, the size of the TOAST pointers >> themselves starts to add up. It would be nice to be able to have 8 >> byte or even 4 byte toast pointers to handle those situations. If we >> steal one or both of those lengths to mean "the data is cached in >> memory somewhere" then we can't use those lengths in a smaller on-disk >> representation, which would seem a shame. > > I agree. As I said above, having the type overlayed into the lenght was > and is a bad idea, I just haven't found a better one thats compatible > yet. > Except inventing typlen=-3 aka "toast2" or something. But even that > wouldn't help getting rid of existing pg_upgraded tables. Besides being > a maintenance nightmare. > > The only reasonable thing I can see us doing is renaming > varattrib_1b_e.va_len_1be into va_type and redefine VARSIZE_1B_E into a > switch that maps types into lengths. But I think I would put this off, > except placing a comment somewhere, until its gets necessary. I guess I wonder how hard we think it would be to insert such a thing when it becomes necessary. How much stuff is there out there that cares about the fact that that length is a byte? >> But having said that, +1 on the general idea of getting something like >> this done. We really need a better infrastructure to avoid copying >> large values around repeatedly in memory - a gigabyte is a lot of data >> to be slinging around. >> >> Of course, you will not be surprised to hear that I think this is 9.4 material. > > Yes, obviously. But I need time to actually propose a working patch (I > already found 2 bugs in what I had submitted), thats why I brought it up > now. No point in wasting time if there's an oviously better idea around. Cool. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2013-02-19 09:12:02 -0500, Robert Haas wrote: > On Tue, Feb 19, 2013 at 9:00 AM, Andres Freund <andres@2ndquadrant.com> wrote: > >> So the other way that we could do this is to use something that's the > >> same size as a TOAST pointer but has different content - the > >> seemingly-obvious choice being va_toastrelid == 0. > > > > Unfortunately that would mean you need to copy the varatt_external (or > > whatever it would be called) to aligned storage to check what it > > is. Thats why I went the other way. > > How big a problem is that, though? There are quite some places where we test the actual type of a Datum inside tuptoaster.c. Copying it to local storage everytime might actually be noticeable performancewise. Besides the ugliness of needing a local variable, copying the data and only testing afterwards... > >> I'd be a little > >> reluctant to do it the way you propose because we might, at some > >> point, want to try to reduce the size of toast pointers. If you have > >> a tuple with many attributes, the size of the TOAST pointers > >> themselves starts to add up. It would be nice to be able to have 8 > >> byte or even 4 byte toast pointers to handle those situations. If we > >> steal one or both of those lengths to mean "the data is cached in > >> memory somewhere" then we can't use those lengths in a smaller on-disk > >> representation, which would seem a shame. > > > > I agree. As I said above, having the type overlayed into the lenght was > > and is a bad idea, I just haven't found a better one thats compatible > > yet. > > Except inventing typlen=-3 aka "toast2" or something. But even that > > wouldn't help getting rid of existing pg_upgraded tables. Besides being > > a maintenance nightmare. > > > > The only reasonable thing I can see us doing is renaming > > varattrib_1b_e.va_len_1be into va_type and redefine VARSIZE_1B_E into a > > switch that maps types into lengths. But I think I would put this off, > > except placing a comment somewhere, until its gets necessary. > > I guess I wonder how hard we think it would be to insert such a thing > when it becomes necessary. How much stuff is there out there that > cares about the fact that that length is a byte? You mean whether we could store the length in 6 bytes and use two for the type? That should probably work as well. But I don't see much advantage in that given that all those sizes ought to be static. Redefining VARSIZE_1B_E as indicated above should be fairly easy, there aren't many callsites that touch stuff at such low level. Greetings, Andres Freund
On Tue, Feb 19, 2013 at 9:26 AM, Andres Freund <andres@2ndquadrant.com> wrote: > On 2013-02-19 09:12:02 -0500, Robert Haas wrote: >> On Tue, Feb 19, 2013 at 9:00 AM, Andres Freund <andres@2ndquadrant.com> wrote: >> >> So the other way that we could do this is to use something that's the >> >> same size as a TOAST pointer but has different content - the >> >> seemingly-obvious choice being va_toastrelid == 0. >> > >> > Unfortunately that would mean you need to copy the varatt_external (or >> > whatever it would be called) to aligned storage to check what it >> > is. Thats why I went the other way. >> >> How big a problem is that, though? > > There are quite some places where we test the actual type of a Datum > inside tuptoaster.c. Copying it to local storage everytime might > actually be noticeable performancewise. Besides the ugliness of needing > a local variable, copying the data and only testing afterwards... Hrm, OK. >> >> I'd be a little >> >> reluctant to do it the way you propose because we might, at some >> >> point, want to try to reduce the size of toast pointers. If you have >> >> a tuple with many attributes, the size of the TOAST pointers >> >> themselves starts to add up. It would be nice to be able to have 8 >> >> byte or even 4 byte toast pointers to handle those situations. If we >> >> steal one or both of those lengths to mean "the data is cached in >> >> memory somewhere" then we can't use those lengths in a smaller on-disk >> >> representation, which would seem a shame. >> > >> > I agree. As I said above, having the type overlayed into the lenght was >> > and is a bad idea, I just haven't found a better one thats compatible >> > yet. >> > Except inventing typlen=-3 aka "toast2" or something. But even that >> > wouldn't help getting rid of existing pg_upgraded tables. Besides being >> > a maintenance nightmare. >> > >> > The only reasonable thing I can see us doing is renaming >> > varattrib_1b_e.va_len_1be into va_type and redefine VARSIZE_1B_E into a >> > switch that maps types into lengths. But I think I would put this off, >> > except placing a comment somewhere, until its gets necessary. >> >> I guess I wonder how hard we think it would be to insert such a thing >> when it becomes necessary. How much stuff is there out there that >> cares about the fact that that length is a byte? > > You mean whether we could store the length in 6 bytes and use two for > the type? That should probably work as well. But I don't see much > advantage in that given that all those sizes ought to be static. > Redefining VARSIZE_1B_E as indicated above should be fairly easy, there > aren't many callsites that touch stuff at such low level. /me blinks. No, that's not what I meant. I meant: how hard it would be to redefine VARSIZE_1B_E along the lines you suggest? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2013-02-20 10:16:45 -0500, Robert Haas wrote: > On Tue, Feb 19, 2013 at 9:26 AM, Andres Freund <andres@2ndquadrant.com> wrote: > > On 2013-02-19 09:12:02 -0500, Robert Haas wrote: > >> On Tue, Feb 19, 2013 at 9:00 AM, Andres Freund <andres@2ndquadrant.com> wrote: > >> >> I'd be a little > >> >> reluctant to do it the way you propose because we might, at some > >> >> point, want to try to reduce the size of toast pointers. If you have > >> >> a tuple with many attributes, the size of the TOAST pointers > >> >> themselves starts to add up. It would be nice to be able to have 8 > >> >> byte or even 4 byte toast pointers to handle those situations. If we > >> >> steal one or both of those lengths to mean "the data is cached in > >> >> memory somewhere" then we can't use those lengths in a smaller on-disk > >> >> representation, which would seem a shame. > >> > > >> > I agree. As I said above, having the type overlayed into the lenght was > >> > and is a bad idea, I just haven't found a better one thats compatible > >> > yet. > >> > Except inventing typlen=-3 aka "toast2" or something. But even that > >> > wouldn't help getting rid of existing pg_upgraded tables. Besides being > >> > a maintenance nightmare. > >> > > >> > The only reasonable thing I can see us doing is renaming > >> > varattrib_1b_e.va_len_1be into va_type and redefine VARSIZE_1B_E into a > >> > switch that maps types into lengths. But I think I would put this off, > >> > except placing a comment somewhere, until its gets necessary. > >> > >> I guess I wonder how hard we think it would be to insert such a thing > >> when it becomes necessary. How much stuff is there out there that > >> cares about the fact that that length is a byte? > > > > You mean whether we could store the length in 6 bytes and use two for > > the type? That should probably work as well. But I don't see much > > advantage in that given that all those sizes ought to be static. > > Redefining VARSIZE_1B_E as indicated above should be fairly easy, there > > aren't many callsites that touch stuff at such low level. > > /me blinks. > > No, that's not what I meant. I meant: how hard it would be to > redefine VARSIZE_1B_E along the lines you suggest? Should be pretty easy. Will do so for the next revision. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Tue, Feb 19, 2013 at 2:00 PM, Andres Freund <andres@2ndquadrant.com> wrote: > The only reasonable thing I can see us doing is renaming > varattrib_1b_e.va_len_1be into va_type and redefine VARSIZE_1B_E into a > switch that maps types into lengths. But I think I would put this off, > except placing a comment somewhere, until its gets necessary. Is there any reason to make it a switch before we actually have two types that happen to have the same length? It might make the code clearer if there was an enum with the (one) type listed but as long as all the enum values happen to have the value of the length of the struct then it makes heap_form_tuple and heap_deform_tuple marginally faster. (unless gcc can optimize away the whole switch statement which might be plausible, especially if it's just a few ?: expressions) -- greg
On Thu, Feb 21, 2013 at 2:32 AM, Greg Stark <stark@mit.edu> wrote: > On Tue, Feb 19, 2013 at 2:00 PM, Andres Freund <andres@2ndquadrant.com> wrote: >> The only reasonable thing I can see us doing is renaming >> varattrib_1b_e.va_len_1be into va_type and redefine VARSIZE_1B_E into a >> switch that maps types into lengths. But I think I would put this off, >> except placing a comment somewhere, until its gets necessary. > > Is there any reason to make it a switch before we actually have two > types that happen to have the same length? > > It might make the code clearer if there was an enum with the (one) > type listed but as long as all the enum values happen to have the > value of the length of the struct then it makes heap_form_tuple and > heap_deform_tuple marginally faster. (unless gcc can optimize away the > whole switch statement which might be plausible, especially if it's > just a few ?: expressions) For what it's worth much of this was discussed at the time. I originally wrote it as an enum and Tom changed it to a length byte, specifically for performance reasons, and said we could always change it back to an enum where some of the values just happened to be equal to their length if we needed it: http://www.postgresql.org/message-id/flat/82tzp7bbbh.fsf@mid.bfk.de#82tzp7bbbh.fsf@mid.bfk.de -- greg