Thread: Variable length varlena headers redux
I've been looking at this again and had a few conversations about it. This may be easier than I had originally thought but there's one major issue that's bugging me. Do you see any way to avoid having every user function everywhere use a new macro api instead of VARDATA/VARATT_DATA and VARSIZE/VARATT_SIZEP? The two approaches I see are either a) To have two sets of macros, one of which, VARATT_DATA and VARATT_SIZEP are for constructing new tuples and behaves exactly as it does now. So you always construct a four-byte header datum. Then in heap_form*tuple we check if you can use a shorter header and convert. VARDATA/VARSIZE would be for looking at existing datums and would interpret the header bits. This seems very fragile since one stray call site using VARATT_DATA to find the data in an existing datum would cause random bugs that only occur rarely in certain circumstances. It would even work as long as the size is filled in with VARATT_SIZEP first which it usually is, but fail if someone changes the order of the statements. or b) throw away VARATT_DATA and VARATT_SIZEP and make all user function everywhere change over to a new macro api. That seems like a pretty big burden. It's safer but means every contrib module would have to be updated and so on. I'm hoping I'm missing something and there's a way to do this without breaking the api for every user function. Gregory Stark <stark@enterprisedb.com> writes: > In any case it seems a bit backwards to me. Wouldn't it be better to > preserve bits in the case of short length words where they're precious > rather than long ones? If we make 0xxxxxxx the 1-byte case it means ... Well, I don't find that real persuasive: you're saying that it's important to have a 1-byte not 2-byte header for datums between 64 and 127 bytes long. Which is by definition less than a 2% savings for those values. I think its's more important to pick bitpatterns that reduce the number of cases heap_deform_tuple has to think about while decoding the length of a field --- every "if" in that inner loop is expensive. I realized this morning that if we are going to preserve the rule that 4-byte-header and compressed-header cases can be distinguished from the data alone, there is no reason to be very worried about whether the 2-byte cases can represent the maximal length of an in-line datum. If you want to do 16K inline (and your page is big enough for that) you can just fall back to the 4-byte-header case. So there's no real disadvantage if the 2-byte headers can only go up to 4K or so. This gives us some more flexibility in the bitpattern choices. Another thought that occurred to me is that if we preserve the convention that a length word's value includes itself, then for a 1-byte header the bit pattern 10000000 is meaningless --- the count has to be at least 1. So one trick we could play is to take over this value as the signal for "toast pointer follows", with the assumption that the tuple-decoder code knows a-priori how big a toast pointer is. I am not real enamored of this, because it certainly adds one case to the inner heap_deform_tuple loop and it'll give us problems if we ever want more than one kind of toast pointer. But it's a possibility. Anyway, a couple of encodings that I'm thinking about now involve limiting uncompressed data to 1G (same as now), so that we can play with the first 2 bits instead of just 1: 00xxxxxx 4-byte length word, aligned, uncompressed data (up to 1G) 01xxxxxx 4-byte length word, aligned, compressed data (up to 1G) 100xxxxx 1-byte length word, unaligned, TOAST pointer 1010xxxx 2-byte length word, unaligned, uncompressed data (up to 4K) 1011xxxx 2-byte length word, unaligned, compressed data (up to 4K) 11xxxxxx 1-byte length word, unaligned, uncompressed data (up to 63b) or 00xxxxxx 4-byte length word, aligned, uncompressed data (up to 1G) 010xxxxx 2-byte length word, unaligned, uncompressed data (up to 8K) 011xxxxx 2-byte length word, unaligned, compressed data (up to 8K) 10000000 1-byte length word, unaligned, TOAST pointer 1xxxxxxx 1-byte length word, unaligned, uncompressed data (up to 127b) (xxxxxxx not all zero) This second choice allows longer datums in both the 1-byte and 2-byte header formats, but it hardwires the length of a TOAST pointer and requires four cases to be distinguished in the inner loop; the first choice only requires three cases, because TOAST pointer and 1-byte header can be handled by the same rule "length is low 6 bits of byte". The second choice also loses the ability to store in-line compressed data above 8K, but that's probably an insignificant loss. There's more than one way to do it ... regards, tom lane -- Gregory Stark EnterpriseDB http://www.enterprisedb.com
Gregory Stark <gsstark@mit.edu> writes: > Do you see any way to avoid having every user function everywhere > use a new macro api instead of VARDATA/VARATT_DATA and VARSIZE/VARATT_SIZEP? Bruce and I had a long phone conversation about that today. We concluded that there doesn't seem to be any strong reason to change the macros for *reading* a varlena value's length. The problem is with the code that *writes* a length word. The mother example is textin(): result = (text *) palloc(len + VARHDRSZ);VARATT_SIZEP(result) = len + VARHDRSZ;memcpy(VARDATA(result), inputText, len); There is not a lot we can do with that second line: it's going to assign the value of "len + VARHDRSZ" to something that has to be a native C variable, struct field, or at best bit field. If we replaced that line with something like SET_VARLENA_LEN(result, len + VARHDRSZ); then we'd have a *whole* lot more control. For example it'd be easy to implement the previously-discussed design involving storing uncompressed length words in network byte order: SET_VARLENA_LEN does htonl() and VARSIZE does ntohl() and nothing else in the per-datatype functions needs to change. Another idea that we were kicking around is to make an explicit distinction between little-endian and big-endian hardware: on big-endian hardware, store the two TOAST flag bits in the MSBs as now, but on little-endian, store them in the LSBs, shifting the length value up two bits. This would probably be marginally faster than htonl/ntohl depending on hardware and compiler intelligence, but either way you get to guarantee that the flag bits are in the physically first byte, which is the critical thing needed to be able to tell the difference between compressed and uncompressed length values. By my count there are only 170 uses of VARATT_SIZEP in the entire backend (including contrib) so this is not an especially daunting change. It would break existing user-written functions that return varlena values, but the fix wouldn't be painful for them either. We could guarantee that every problem spot is found by removing the current definition of the macro (and renaming the underlying struct fields to get the attention of the truly unreconstructed...). The important point here is that VARSIZE() still works, so only code that creates a new varlena value need be affected, not code that examines one. Too bad we didn't get this right in the original TOAST rewrite; but hindsight is always 20/20. Right now I think we gotta absorb some pain if we want to improve the header-overhead situation, and this seems about the minimum possible amount of pain. regards, tom lane
On Mon, Feb 12, 2007 at 11:19:14PM -0500, Tom Lane wrote: > Gregory Stark <gsstark@mit.edu> writes: > By my count there are only 170 uses of VARATT_SIZEP in the entire > backend (including contrib) so this is not an especially daunting > change. It would break existing user-written functions that return > varlena values, but the fix wouldn't be painful for them either. Could the "new style" macros be back-ported to previous releases in case we do this? That way module maintainers wouldn't need to maintain two different sets of code for it - they could use the new style and just compile it against an older version of pg? //Magnus
"Tom Lane" <tgl@sss.pgh.pa.us> writes: > If we replaced that line with something like > > SET_VARLENA_LEN(result, len + VARHDRSZ); > > then we'd have a *whole* lot more control. I think that part was already clear. The problem was in VARDATA. I don't really see a way around it though. Places that fill in VARDATA before the size (formatting.c seems to be the worst case) will just have to be changed and it'll be a fairly fragile point. > For example it'd be easy to implement the previously-discussed design > involving storing uncompressed length words in network byte order: > SET_VARLENA_LEN does htonl() and VARSIZE does ntohl() and nothing else in > the per-datatype functions needs to change. Another idea that we were > kicking around is to make an explicit distinction between little-endian and > big-endian hardware: on big-endian hardware, store the two TOAST flag bits > in the MSBs as now, but on little-endian, store them in the LSBs, shifting > the length value up two bits. This would probably be marginally faster than > htonl/ntohl depending on hardware and compiler intelligence, but either way > you get to guarantee that the flag bits are in the physically first byte, > which is the critical thing needed to be able to tell the difference between > compressed and uncompressed length values. Actually I think neither htonl nor bitshifting the entire 4-byte word is going to really work here. Both will require 4-byte alignment. Instead I think we have to access the length byte by byte as a (char*) and do arithmetic. Since it's the pointer being passed to VARSIZE that isn't too hard, but it might perform poorly. > The important point here is that VARSIZE() still works, so only code that > creates a new varlena value need be affected, not code that examines one. So what would VARSIZE() return, the size of the payload plus VARHDRSZ regardless of what actual size the header was? That seems like it would break the least existing code though removing all the VARHDRSZ offsets seems like it would be cleaner. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com
Magnus Hagander wrote: > On Mon, Feb 12, 2007 at 11:19:14PM -0500, Tom Lane wrote: > > Gregory Stark <gsstark@mit.edu> writes: > > > By my count there are only 170 uses of VARATT_SIZEP in the entire > > backend (including contrib) so this is not an especially daunting > > change. It would break existing user-written functions that return > > varlena values, but the fix wouldn't be painful for them either. > > Could the "new style" macros be back-ported to previous releases in case > we do this? That way module maintainers wouldn't need to maintain > two different sets of code for it - they could use the new style and > just compile it against an older version of pg? Yes, Tom and I talked about this. It could appear in the next minor release of all branches. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
Gregory Stark wrote: > "Tom Lane" <tgl@sss.pgh.pa.us> writes: >> For example it'd be easy to implement the previously-discussed design >> involving storing uncompressed length words in network byte order: >> SET_VARLENA_LEN does htonl() and VARSIZE does ntohl() and nothing else in >> the per-datatype functions needs to change. Another idea that we were >> kicking around is to make an explicit distinction between little-endian and >> big-endian hardware: on big-endian hardware, store the two TOAST flag bits >> in the MSBs as now, but on little-endian, store them in the LSBs, shifting >> the length value up two bits. This would probably be marginally faster than >> htonl/ntohl depending on hardware and compiler intelligence, but either way >> you get to guarantee that the flag bits are in the physically first byte, >> which is the critical thing needed to be able to tell the difference between >> compressed and uncompressed length values. > > Actually I think neither htonl nor bitshifting the entire 4-byte word is going > to really work here. Both will require 4-byte alignment. Instead I think we > have to access the length byte by byte as a (char*) and do arithmetic. Since > it's the pointer being passed to VARSIZE that isn't too hard, but it might > perform poorly. We would still require all datums with a 4-byte header to be 4-byte aligned, right? When reading, you would first check if it's a compressed or uncompressed header. If compressed, read the 1 byte header, if uncompressed, read the 4-byte header and do htonl or bitshifting. No need to do htonl or bitshifting on unaligned datums. >> The important point here is that VARSIZE() still works, so only code that >> creates a new varlena value need be affected, not code that examines one. > > So what would VARSIZE() return, the size of the payload plus VARHDRSZ > regardless of what actual size the header was? That seems like it would break > the least existing code though removing all the VARHDRSZ offsets seems like it > would be cleaner. My vote would be to change every caller. Though there's a lot of callers, it's a very simple change. To make it posible to compile an external module against 8.2 and 8.3, you could have a simple ifdef block to map the new macro to old behavior. Or we could backport the macro definitions as Magnus suggested. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
"Heikki Linnakangas" <heikki@enterprisedb.com> writes: >> Actually I think neither htonl nor bitshifting the entire 4-byte word is going >> to really work here. Both will require 4-byte alignment. Instead I think we >> have to access the length byte by byte as a (char*) and do arithmetic. Since >> it's the pointer being passed to VARSIZE that isn't too hard, but it might >> perform poorly. > > We would still require all datums with a 4-byte header to be 4-byte aligned, > right? When reading, you would first check if it's a compressed or uncompressed > header. If compressed, read the 1 byte header, if uncompressed, read the 4-byte > header and do htonl or bitshifting. No need to do htonl or bitshifting on > unaligned datums. It's not easy to require datums with 4-byte headers to be 4-byte aligned. How do you know where to look for the bits to show it's an uncompressed header if you don't know where it's aligned yet? It could be done if you rule that if you're on an unaligned byte and see a \0 then scan forward until the aligned byte. But that seems just as cpu expensive as just doing the arithmetic. And wastes space to boot. I'm thinking VARSIZE would look something like: #define VARSIZE((datum)) \ ((((uint8*)(datum))[0] & 0x80) ? (((uint8*)(datum))[0] & 0x7F) : \ (((uint8*)(datum))[0]<<24 | ((uint8*)(datum))[1]<<16 | ((uint8*)(datum))[2]<<8 | ((uint8*)(datum))[0])) Which is effectively the same as doing ntohl except that it only works for left hand sides -- luckily VARSIZE always has a lhs. It also works for unaligned accesses. It's going to be fairly slow but no slower than doing an unaligned access looking at nul padding bytes. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com
Bruce Momjian <bruce@momjian.us> writes: > Magnus Hagander wrote: >> Could the "new style" macros be back-ported to previous releases in case >> we do this? > Yes, Tom and I talked about this. It could appear in the next minor > release of all branches. I don't really see the point of that. Third-party authors who want their code to be backwards-compatible would do something like #ifndef SET_VARLENA_LEN #define SET_VARLENA_LEN(var,len) (VARATT_SIZEP(var) = (len)) #endif While we could provide this same macro in later updates of the current release branches, those authors are still going to want to include the above in their code so as to be able to compile against existing releases. Therefore there's not really much point in us doing it too. regards, tom lane
Gregory Stark wrote: > a) To have two sets of macros, one of which, VARATT_DATA and > VARATT_SIZEP are for constructing new tuples and behaves exactly as > it does now. So you always construct a four-byte header datum. Then > in heap_form*tuple we check if you can use a shorter header and > convert. VARDATA/VARSIZE would be for looking at existing datums and > would interpret the header bits. Has any thought been given to headers *longer* than four bytes? I don't exactly recall a flood of field reports that one gigabyte per datum is too little, but as long as the encoding of variable length data is changed, one might as well prepare a little for the future. Of course, that would put a dent into any plan that wants to normalize the header to four bytes somewhere along the way. -- Peter Eisentraut http://developer.postgresql.org/~petere/
On Tue, Feb 13, 2007 at 09:44:03AM -0500, Tom Lane wrote: > Bruce Momjian <bruce@momjian.us> writes: > > Magnus Hagander wrote: > >> Could the "new style" macros be back-ported to previous releases in case > >> we do this? > > > Yes, Tom and I talked about this. It could appear in the next minor > > release of all branches. > > I don't really see the point of that. Third-party authors who want > their code to be backwards-compatible would do something like > > #ifndef SET_VARLENA_LEN > #define SET_VARLENA_LEN(var,len) (VARATT_SIZEP(var) = (len)) > #endif > > While we could provide this same macro in later updates of the current > release branches, those authors are still going to want to include the > above in their code so as to be able to compile against existing > releases. Therefore there's not really much point in us doing it too. It'd be a help to those who wouldn't be building against releases with known security issues in them for one ;-) Sure, it's not important or a dealbreaker or so, but it would be convenient. //Magnus
Gregory Stark <stark@enterprisedb.com> writes: > I don't really see a way around it though. Places that fill in VARDATA before > the size (formatting.c seems to be the worst case) will just have to be > changed and it'll be a fairly fragile point. No, we're not going there: it'd break too much code now and it'd be a continuing source of bugs for the foreseeable future. The sane way to design this is that (1) code written to existing practice will always generate 4-byte headers. (Hence, VARDATA() acts the same as now.) That's the format that generally gets passed around in memory. (2) creation of a short header is handled by the TOAST code just before the tuple goes to disk. (3) replacement of a short header with a 4-byte header is considered part of de-TOASTing. After we have that working, we can work on offering alternative macros that let specific functions avoid the overhead of conversion between 4-byte headers and short ones, in much the same way that there are TOAST macros now that let specific functions get down-and-dirty with the out-of-line TOAST representation. But first we have to get to the point where 4-byte-header datums can be distinguished from short-header datums by inspection; and that requires either network byte order in the 4-byte length word or some other change in its representation. > Actually I think neither htonl nor bitshifting the entire 4-byte word is going > to really work here. Both will require 4-byte alignment. And your point is what? The 4-byte form can continue to require alignment, and *will* require it in any case, since many of the affected datatypes expect alignment of the data within the varlena. The trick is that when we are examining a non-aligned address within a tuple, we have to be able to tell whether we are looking at the first byte of a short-header datum (not aligned) or a pad byte. This is easily done, for instance by decreeing that pad bytes must be zeroes. I think we should probably consider making use of different alignment codes for different varlena datatypes. For instance the geometry types probably will still need align 'd' since they contain doubles; this may mean that we should just punt on any short-header optimization for them. But text and friends could have align 'c' showing that they need no padding and would be perfectly happy with a nonaligned VARDATA pointer. (Actually, maybe we should only do this whole thing for 'c'-alignable data types? But NUMERIC is a bit of a problem, it'd like 's'-alignment. OTOH we could just switch NUMERIC to an all-two-byte format that's independent of TOAST per se.) regards, tom lane
Heikki Linnakangas wrote: > Gregory Stark wrote: > > "Tom Lane" <tgl@sss.pgh.pa.us> writes: > >> For example it'd be easy to implement the previously-discussed design > >> involving storing uncompressed length words in network byte order: > >> SET_VARLENA_LEN does htonl() and VARSIZE does ntohl() and nothing else in > >> the per-datatype functions needs to change. Another idea that we were > >> kicking around is to make an explicit distinction between little-endian and > >> big-endian hardware: on big-endian hardware, store the two TOAST flag bits > >> in the MSBs as now, but on little-endian, store them in the LSBs, shifting > >> the length value up two bits. This would probably be marginally faster than > >> htonl/ntohl depending on hardware and compiler intelligence, but either way > >> you get to guarantee that the flag bits are in the physically first byte, > >> which is the critical thing needed to be able to tell the difference between > >> compressed and uncompressed length values. > > > > Actually I think neither htonl nor bitshifting the entire 4-byte word is going > > to really work here. Both will require 4-byte alignment. Instead I think we > > have to access the length byte by byte as a (char*) and do arithmetic. Since > > it's the pointer being passed to VARSIZE that isn't too hard, but it might > > perform poorly. > > We would still require all datums with a 4-byte header to be 4-byte > aligned, right? When reading, you would first check if it's a compressed > or uncompressed header. If compressed, read the 1 byte header, if > uncompressed, read the 4-byte header and do htonl or bitshifting. No > need to do htonl or bitshifting on unaligned datums. I am not sure how to handle the alignment issue. If we require 1-byte headers to be 4-byte aligned, we lose a lot of the benefits of the 1-byte header. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
Bruce Momjian wrote: > Heikki Linnakangas wrote: >> We would still require all datums with a 4-byte header to be 4-byte >> aligned, right? When reading, you would first check if it's a compressed >> or uncompressed header. If compressed, read the 1 byte header, if >> uncompressed, read the 4-byte header and do htonl or bitshifting. No >> need to do htonl or bitshifting on unaligned datums. > > I am not sure how to handle the alignment issue. If we require 1-byte > headers to be 4-byte aligned, we lose a lot of the benefits of the > 1-byte header. Why would we require that? I don't see a problem with having 4-byte header 4-byte aligned, and 1-byte headers not aligned. The first access to the header is to check if it's a 4 or 1 byte header. That's a 1 byte wide access, requiring no alignment. After that you know which one it is. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Heikki Linnakangas wrote: > Bruce Momjian wrote: > > Heikki Linnakangas wrote: > >> We would still require all datums with a 4-byte header to be 4-byte > >> aligned, right? When reading, you would first check if it's a compressed > >> or uncompressed header. If compressed, read the 1 byte header, if > >> uncompressed, read the 4-byte header and do htonl or bitshifting. No > >> need to do htonl or bitshifting on unaligned datums. > > > > I am not sure how to handle the alignment issue. If we require 1-byte > > headers to be 4-byte aligned, we lose a lot of the benefits of the > > 1-byte header. > > Why would we require that? > > I don't see a problem with having 4-byte header 4-byte aligned, and > 1-byte headers not aligned. The first access to the header is to check > if it's a 4 or 1 byte header. That's a 1 byte wide access, requiring no > alignment. After that you know which one it is. But if you are walking through attributes, how do you know to look at the next byte or the next aligned byte? We have to force zeros in there? -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
Bruce Momjian <bruce@momjian.us> writes: > But if you are walking through attributes, how do you know to look at > the next byte or the next aligned byte? We have to force zeros in > there? Yup: pad bytes must be zeroes (they are already) and a 1-byte-header can't be a zero (easily done if its length includes itself). So the tuple-walking code would do something like if (looking-at-a-zero && not-at-4-byte-boundary) advance to next 4-byte boundary;check current byte to determine if 1-byteor 4-byte header; regards, tom lane
Tom Lane <tgl@sss.pgh.pa.us> writes: > Gregory Stark <stark@enterprisedb.com> writes: > > I don't really see a way around it though. Places that fill in VARDATA before > > the size (formatting.c seems to be the worst case) will just have to be > > changed and it'll be a fairly fragile point. > > No, we're not going there: it'd break too much code now and it'd be a > continuing source of bugs for the foreseeable future. The sane way to > design this is that > > (1) code written to existing practice will always generate 4-byte > headers. (Hence, VARDATA() acts the same as now.) That's the format > that generally gets passed around in memory. So then we don't need to replace VARSIZE with SET_VARLENA_LEN at all. > (2) creation of a short header is handled by the TOAST code just before > the tuple goes to disk. > > (3) replacement of a short header with a 4-byte header is considered > part of de-TOASTing. So (nigh) every tuple will get deformed and reformed once before it goes to disk? Currently the toast code doesn't even look at a tuple if it's small enough, but in this case we would want it to fire even on very narrow rows. One design possibility I considered was doing this in heap_deform_tuple and heap_form_tuple. Basically skipping the extra deform/form_tuple cycle in the toast code. I had considered having heap_deform_tuple palloc copies of these data before returning them. But that has the same problems. The other problem is that there may be places in the code that receive a datum from someplace where they have every right to expect it not to be toasted. For example, plpgsql deforming a tuple they just formed, or even as the return value from a function. They might be quite surprised to receive a toasted tuple. Note also that that's going to force us to palloc and memcpy these data. Are there going to be circumstances where existing code where this changes the memory context lifetime of some data? If, for example, soemthing like the inet code knows its arguments can never be large enough to get toasted and doesn't do a FREE_IF_COPY on its btree operator arguments. > After we have that working, we can work on offering alternative macros > that let specific functions avoid the overhead of conversion between > 4-byte headers and short ones, in much the same way that there are TOAST > macros now that let specific functions get down-and-dirty with the > out-of-line TOAST representation. But first we have to get to the point > where 4-byte-header datums can be distinguished from short-header datums > by inspection; and that requires either network byte order in the 4-byte > length word or some other change in its representation. > > > Actually I think neither htonl nor bitshifting the entire 4-byte word is going > > to really work here. Both will require 4-byte alignment. > > And your point is what? The 4-byte form can continue to require > alignment, and *will* require it in any case, since many of the affected > datatypes expect alignment of the data within the varlena. The trick is > that when we are examining a non-aligned address within a tuple, we have > to be able to tell whether we are looking at the first byte of a > short-header datum (not aligned) or a pad byte. This is easily done, > for instance by decreeing that pad bytes must be zeroes. Well if we're doing it in toast then the alignment of the payload really doesn't matter at all. It'll be realigned after detoasting anyways. What I had had in mind was to prohibit using smaller headers than the alignment of the data type. But that was on the assumption we would continue to use the compressed header in memory and not copy it. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com
Gregory Stark <gsstark@mit.edu> writes: > Tom Lane <tgl@sss.pgh.pa.us> writes: >> (1) code written to existing practice will always generate 4-byte >> headers. (Hence, VARDATA() acts the same as now.) That's the format >> that generally gets passed around in memory. > So then we don't need to replace VARSIZE with SET_VARLENA_LEN at all. Yes, we do, because we have to alter the representation of 4-byte headers. Otherwise we can't tell which header format a datum is using. > So (nigh) every tuple will get deformed and reformed once before it goes to > disk? Currently the toast code doesn't even look at a tuple if it's small > enough, but in this case we would want it to fire even on very narrow rows. I'd be inclined to put the intelligence into heap_form_tuple and thereby avoid getting the TOAST code involved unless there are wide fields to deal with. > What I had had in mind was to prohibit using smaller headers than the > alignment of the data type. But that was on the assumption we would continue > to use the compressed header in memory and not copy it. Well, it wouldn't be too unreasonable to limit this whole mechanism to datatypes that have no alignment requirements on the *content* of their datums; which in practice is probably just text/varchar/char and perhaps inet. regards, tom lane
Gregory Stark wrote: > > Tom Lane <tgl@sss.pgh.pa.us> writes: > > > Gregory Stark <stark@enterprisedb.com> writes: > > > I don't really see a way around it though. Places that fill in VARDATA before > > > the size (formatting.c seems to be the worst case) will just have to be > > > changed and it'll be a fairly fragile point. > > > > No, we're not going there: it'd break too much code now and it'd be a > > continuing source of bugs for the foreseeable future. The sane way to > > design this is that > > > > (1) code written to existing practice will always generate 4-byte > > headers. (Hence, VARDATA() acts the same as now.) That's the format > > that generally gets passed around in memory. > > So then we don't need to replace VARSIZE with SET_VARLENA_LEN at all. > > > (2) creation of a short header is handled by the TOAST code just before > > the tuple goes to disk. > > > > (3) replacement of a short header with a 4-byte header is considered > > part of de-TOASTing. > > So (nigh) every tuple will get deformed and reformed once before it goes to > disk? Currently the toast code doesn't even look at a tuple if it's small > enough, but in this case we would want it to fire even on very narrow rows. One weird idea I had was that the macros can read 1 and 4-byte headers, but can only create 4-byte headers. The code that writes to the shared buffer pages would to compression from 1 to 4 bytes as needed. This might avoid changing any macros. It also allows us to carry around 4-byte headers in memory, which I think might be more efficient. I am not sure if I have heard this idea proposed already or not. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
Tom Lane <tgl@sss.pgh.pa.us> writes: > > So (nigh) every tuple will get deformed and reformed once before it goes to > > disk? Currently the toast code doesn't even look at a tuple if it's small > > enough, but in this case we would want it to fire even on very narrow rows. > > I'd be inclined to put the intelligence into heap_form_tuple and thereby > avoid getting the TOAST code involved unless there are wide fields to > deal with. And have heap_deform_tuple / heap_getattr palloc and memcpy the the datum on the way out? Or wait until detoast time and then do it? If we do it on the way out of the heaptuple then we could have a rule that headers are always compressed in a tuple and always uncompressed out of a tuple. But I'm surprised, I had the impression that you were reluctant to consider memcpying all the data all the time. -- greg
Greg Stark wrote: > Tom Lane <tgl@sss.pgh.pa.us> writes: > > > > So (nigh) every tuple will get deformed and reformed once before it goes to > > > disk? Currently the toast code doesn't even look at a tuple if it's small > > > enough, but in this case we would want it to fire even on very narrow rows. > > > > I'd be inclined to put the intelligence into heap_form_tuple and thereby > > avoid getting the TOAST code involved unless there are wide fields to > > deal with. > > And have heap_deform_tuple / heap_getattr palloc and memcpy the the datum on > the way out? Or wait until detoast time and then do it? > > If we do it on the way out of the heaptuple then we could have a rule that > headers are always compressed in a tuple and always uncompressed out of a > tuple. > > But I'm surprised, I had the impression that you were reluctant to consider > memcpying all the data all the time. Uh, if the macros can read 1 and 4-byte headers, why do we need to allocate memory for them? -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
Greg Stark <gsstark@mit.edu> writes: > Tom Lane <tgl@sss.pgh.pa.us> writes: >> I'd be inclined to put the intelligence into heap_form_tuple and thereby >> avoid getting the TOAST code involved unless there are wide fields to >> deal with. > And have heap_deform_tuple / heap_getattr palloc and memcpy the the datum on > the way out? Or wait until detoast time and then do it? No, heap_deform_tuple / heap_getattr are not responsible for palloc'ing anything, only for computing appropriate pointers into the tuple. Existing functions that use PG_DETOAST_DATUM would incur a palloc to produce a 4-byte-header version of a short-header datum. We could then work on modifying one function at a time to use some alternative macro that doesn't force a useless palloc, but the system wouldn't be broken meanwhile; and only the high-traffic functions would be worth fixing at all. The point I'm trying to get across here is to do things one small step at a time; if you insist on a "big bang" patch then it'll never get done. You might want to go back and review the CVS history for some other big changes like TOAST and the version-1 function-call protocol to see our previous uses of this approach. regards, tom lane
"Bruce Momjian" <bruce@momjian.us> writes: >> But I'm surprised, I had the impression that you were reluctant to consider >> memcpying all the data all the time. > > Uh, if the macros can read 1 and 4-byte headers, why do we need to > allocate memory for them? Because the macros can't read 1 and 4-byte headers. If they could we would have the problem with VARDATA for code sites that write to the data before they write the size. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com
Gregory Stark wrote: > > "Bruce Momjian" <bruce@momjian.us> writes: > > >> But I'm surprised, I had the impression that you were reluctant to consider > >> memcpying all the data all the time. > > > > Uh, if the macros can read 1 and 4-byte headers, why do we need to > > allocate memory for them? > > Because the macros can't read 1 and 4-byte headers. If they could we would > have the problem with VARDATA for code sites that write to the data before > they write the size. OK, so what if we always create 4-byte headers, and read both. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
"Tom Lane" <tgl@sss.pgh.pa.us> writes: > The point I'm trying to get across here is to do things one small step > at a time; if you insist on a "big bang" patch then it'll never get > done. You might want to go back and review the CVS history for some > other big changes like TOAST and the version-1 function-call protocol > to see our previous uses of this approach. Believe me I'm not looking for a "big bang" approach. It's just that I've found problems with any of the incremental approaches. I suppose it's inevitable that "big bang" approaches always seem like the most perfect since they automatically involve fixing anything that could be a problem. You keep suggesting things that I've previously considered and rejected -- perhaps prematurely. Off the top of my head I recall the following four options from our discussions. It looks like we're circling around option 4. As long as we're willing to live with the palloc/memcpy overhead at least for now given that we can reduce it by whittling away at the sites that use the old macros, that seems like a good compromise and the shortest development path except perhaps option 1. And I take it you're not worried about sites that might not detoast a datum or detoast one in the wrong memory context where previously they were guaranteed it wouldn't generate a copy? In particular I'm worried about btree code and plpgsql row/record handling. I'm not sure what to do about the alignment issue. We could just never align 1-byte headers. That would work just fine as long a the data types that need alignment don't get ported to the new macros. It seems silly to waste space on disk in order to save a cpu memcpy that we aren't even going to be saving for now anyways. Option 1) We detect cases where the typmod guarantees either a fixed size or a maximum size < 256 bytes. In which case instead of copying the typlen from pg_type into tupledesc and the table's attlen we use the implied attlen or store -3 indicating single byte headers everywhere. Cons: This implies pallocing and memcpying the datum in heap_deform_tuple. It doesn't help text or varchar(xxxx) at all even if we mostly store small data in them. Pros: it buys us 0-byte headers for things like numeric(1,0) and even char(n) on single-byte encodings. It buys us 1-byteheaders for most numerics and char(n) or varchar(n). Option 2) We have heap_form_tuple/heap_deform_tuple compress and uncompress the headers. The rule is that headers are always compressed in a tuple and never compressed when at the end of a datum. Cons: This implies pallocing and memcpying the datum in heap_deform_tuple It requires restricting the range of 1-byte headers to 127 or 63 bytes and always uses 1 byte even for fixed sizedata. (We could get 0-byte headers for a small range (ascii characters and numeric integers up to 127) but then1-byte headers would be down to 31 byte data.) It implies a different varlena format for on-disk and in-memory Pros: it works for text/varchar as long as we store small data. It lets us use network byte order or some other format for the on-disk headers without changing the macros to accessin-memory headers. Option 3) We have the toaster (or heap_form_tuple as a shortcut) compress the headers but delay decompression until pg_detoast_datum. tuples only contain compressed headers but Datums sometimes point to compressed headers and sometimes uncompressed headers just as they sometimes point to toasted data and sometimes detoasted data. Cons: This still implies pallocing and memcpying the datum at least for now There could be cases where code expects to deform_tuple and be guaranteed a non-toasted pointer into the tuple. requires replacing VARATT_SIZEP with SET_VARLENA_LEN() Pros: It allows for future macros to examine the compressed datum without decompressing it. Option 4) We have compressed data constructed on the fly everywhere possible. Cons: requires replacing VARATT_SIZEP and also requires hacking VARDATA to find the data in the appropriate place. Mightneed an additional pair of macros for backwards compatibility in code that really needs to construct a 4-byteheadered varlena. fragility with risk of VARSIZE / VARDATA being filled in out of order Requires changing header to be in network byte order all the time. Pros: one consistent representation for varlena everywhere. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com
Gregory Stark <stark@enterprisedb.com> writes: > "Bruce Momjian" <bruce@momjian.us> writes: >> Uh, if the macros can read 1 and 4-byte headers, why do we need to >> allocate memory for them? > Because the macros can't read 1 and 4-byte headers. If they could we would > have the problem with VARDATA for code sites that write to the data before > they write the size. The way I see this working is that VARDATA keeps its current behavior and hence can only be used with datums that are known to be in 4-byte-header form; hence, to avoid breaking code that uses it, PG_DETOAST_DATUM has to produce a 4-byte-header datum always. After we have the infrastructure in place, we'd make a pass over high-traffic functions to replace uses of PG_DETOAST_DATUM with something that doesn't forcibly expand 1-byte-header datums, and replace uses of VARDATA on the result with something that handles both header formats (and would be unsuitable for generating result datums, since it'd have to assume that the length is already filled in). I don't see any good reason why datatype-specific functions would ever need to generate the short-header format directly. The only point where it's worth trimming the header size is during heap_form_tuple, and we can do it there at no significant efficiency cost. So uses of VARDATA in connection with building a new datum need not be touched. I'm inclined also to suggest that VARSIZE() need only support 4-byte format: we could have a second macro that understands both formats and gets used in the same high-traffic functions in which we are replacing uses of VARDATA(). There's no benefit in making other sites support 1-byte format for VARSIZE() if they aren't going to support it for VARDATA(). regards, tom lane
Hey all: This is a randomly inserted distraction to tell you that I really like to read about these ideas. No input from myself at this point. I'm happy with the direction you are taking. Thanks, mark -- mark@mielke.cc / markm@ncf.ca / markm@nortel.com __________________________ . . _ ._ . . .__ . . ._. .__ . . . .__ | Neighbourhood Coder |\/| |_| |_| |/ |_ |\/| | |_ | |/ |_ | | | | | | \ | \ |__ . | | .|. |__ |__ | \ |__ | Ottawa, Ontario, Canada One ring to rule them all, one ring to find them, one ring to bring them all and in the darkness bindthem... http://mark.mielke.cc/
Gregory Stark <stark@enterprisedb.com> writes: > You keep suggesting things that I've previously considered and rejected -- > perhaps prematurely. Off the top of my head I recall the following four > options from our discussions. It looks like we're circling around option 4. No, I think I'm arguing for option 3. > And I take it you're not worried about sites that might not detoast a datum or > detoast one in the wrong memory context where previously they were guaranteed > it wouldn't generate a copy? In particular I'm worried about btree code and > plpgsql row/record handling. Any such site is already broken, because with the proposed design, code is only exposed to short-header datums if it is also exposed to toasted datums. It's possible that we would find a few places that hadn't gotten converted, but any such place is a must-fix anyway. In practice the TOAST support has been in there long enough that I don't think we'll find any ... at least not in the core code. It's entirely possible that there is user-defined code out there that uses varlena format and isn't careful about detoasting. This is an argument for allowing opt-out (see below). > I'm not sure what to do about the alignment issue. We could just never align > 1-byte headers. That would work just fine as long a the data types that need > alignment don't get ported to the new macros. It seems silly to waste space on > disk in order to save a cpu memcpy that we aren't even going to be saving for > now anyways. No, a datum that is in 1-byte-header format wouldn't have any special alignment inside a tuple. There are two paths to get at it: detoast it (producing an aligned, 4-byte-header, palloc'd datum) or use the still-to-be-named macros that let you access the unaligned content directly. I'm inclined to think that we might want to set things up so that varlena datatypes can individually opt-in or opt-out of this treatment; a datatype that needs alignment of its content might well wish to opt-out to avoid copying overhead. We could do that either with a different typlen code, or still typlen -1 but pay attention to whether typalign is 'c'. > Option 1) > We detect cases where the typmod guarantees either a fixed size or a maximum > size < 256 bytes. After last week I would veto this option anyway: it fails unless we always know typmod exactly, and I'm here to tell you we don't. regards, tom lane
Tom Lane wrote: > > We detect cases where the typmod guarantees either a fixed size or a maximum > > size < 256 bytes. > > After last week I would veto this option anyway: it fails unless we > always know typmod exactly, and I'm here to tell you we don't. If we can pull this off, it handles short values stored in TEXT fields too, which is a big win over the typmod idea I had. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
"Tom Lane" <tgl@sss.pgh.pa.us> writes: > Gregory Stark <stark@enterprisedb.com> writes: >> You keep suggesting things that I've previously considered and rejected -- >> perhaps prematurely. Off the top of my head I recall the following four >> options from our discussions. It looks like we're circling around option 4. > > No, I think I'm arguing for option 3. oops, sorry, got my own numbering mixed up. >> And I take it you're not worried about sites that might not detoast a datum or >> detoast one in the wrong memory context where previously they were guaranteed >> it wouldn't generate a copy? In particular I'm worried about btree code and >> plpgsql row/record handling. > > Any such site is already broken, because with the proposed design, code > is only exposed to short-header datums if it is also exposed to toasted > datums. Well the toaster doesn't run until we were about to put the tuple on disk. I was afraid there might be circumstances where data from in-memory tuples are returned and we use them without fearing them being toasted. In particular I was fearing the record/row handling in plpgsql. I was also worried about the tuples that indexes stores. But it seems they don't go through heap_form_tuple at all. It does seem that the INDEX_TOAST_HACK would need some work -- but its existence means there can't be anyone else trusting index tuples not to contain toasted data which is convenient. > I'm inclined to think that we might want to set things up so that > varlena datatypes can individually opt-in or opt-out of this treatment; > a datatype that needs alignment of its content might well wish to > opt-out to avoid copying overhead. We could do that either with a > different typlen code, or still typlen -1 but pay attention to whether > typalign is 'c'. Any reason to go with typalign? You can always use typalign='i' to get the regular headers. Alternatively, what does the trailing "a" in varlena signify? Would this be varlenb? >> Option 1) > >> We detect cases where the typmod guarantees either a fixed size or a maximum >> size < 256 bytes. > > After last week I would veto this option anyway: it fails unless we > always know typmod exactly, and I'm here to tell you we don't. Oh, heh, timing is everything I guess. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com
Gregory Stark wrote: > Alternatively, what does the trailing "a" in varlena signify? Would this be > varlenb? "attribute" -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
Gregory Stark <stark@enterprisedb.com> writes: > "Tom Lane" <tgl@sss.pgh.pa.us> writes: >> Any such site is already broken, because with the proposed design, code >> is only exposed to short-header datums if it is also exposed to toasted >> datums. > Well the toaster doesn't run until we were about to put the tuple on disk. I > was afraid there might be circumstances where data from in-memory tuples are > returned and we use them without fearing them being toasted. In particular I > was fearing the record/row handling in plpgsql. True, which is another argument why datatype-specific functions should never return short-header datums: there are places that call such functions directly and probably expect an untoasted result. I'm not very concerned about stuff that's gone through a tuple though: in most cases that could be expected to contain toasted values anyway. regards, tom lane
On Tue, Feb 13, 2007 at 01:32:11PM -0500, Bruce Momjian wrote: > Gregory Stark wrote: > > Alternatively, what does the trailing "a" in varlena signify? Would this be > > varlenb? > > "attribute" > > -- Actually varlena stands for "variable length array." elein elein@varlena.com
I want to register, just in principle, that I object to changing the structure of a varlena. The idea behind the data type is simple, clean and fast. And it is easily understood by developers and by people developing applications and functions in PostgreSQL. Of course you will do what you will. Be careful out there. elein elein@varlena.com
elein <elein@varlena.com> writes: > I want to register, just in principle, that I object to changing > the structure of a varlena. Indeed, I'm doing my best to restrain Greg from changing what a datatype-specific function will see by default. The stuff with variable-length headers should be confined to a relatively small number of functions in which we're willing to pay a code-ugliness penalty for speed. regards, tom lane
Interesting problem, seems to be planer related: select 1 from beitraege bei, b_zuordnungen bz, (select bei_id from b_zuordnungen bz, ben_zuordnungen z, strukturelemente se where se.id = z.str_id and se.sty_id= (select id from strukturtypen where code='GEO') and z.str_id = bz.str_id and z.ben_id= 100 union select id from beitraege where kz_edit <> 'N' and useraend = 100 ) as foo where bz.bei_id = bei.id and foo.bei_id = bei.id and bei.red_id in (select gba.grp_id from grp_ben_applikationen gba, grp_gruppen grp where grp.id = gba.grp_id and grp.kz_aktiv='J' and gba.app_id in (select id from grp_applikationenwhere code in ('app1', 'app2')) and gba.ben_id = 100) and (bei.bei_id_frei is nullor bei.kz_edit='N') and (bei.bt_id, bz.str_id) in ((96,1259036), (96,2688382) ) and bei.red_id=1777712 and bei.id in ( select bzu.bei_id from b_zuordnungen bzu, strukturelemente se where bzu.str_id = se.id and se.id = 1773715 ) and bei.id=10157309; ERROR: failed to build any 8-way joins Interesting: remove any of the above where conditions solves the problem go away, e.g. removing "and bei.id=10157309". Testet with 8.2.1 and 8.1.4, same effect on both systems. Because of the large number of tables involved it's difficult tofind a self contained patch, but if necessary I'll give it a try. I could give 8.2.3 a try, but I doubt this will help. Any ideas? Regards Mario Weilguni
Mario Weilguni wrote: > Interesting: remove any of the above where conditions solves the problem go away, e.g. removing "and bei.id=10157309". > > Testet with 8.2.1 and 8.1.4, same effect on both systems. Because of the large number of tables involved it's difficultto find a self contained patch, but if necessary I'll give it a try. > I could give 8.2.3 a try, but I doubt this will help. I think a similar problem was fixed after 8.2.3 was released, so you may want to check out the REL8_2_STABLE branch. -- Alvaro Herrera http://www.CommandPrompt.com/ PostgreSQL Replication, Consulting, Custom Development, 24x7 support
Thanks for the info. Is there a fix for 8.1 branch, the production system is still 8.1. RegardsMario Weilguni Am Donnerstag, 15. Februar 2007 16:25 schrieb Alvaro Herrera: > Mario Weilguni wrote: > > Interesting: remove any of the above where conditions solves the problem > > go away, e.g. removing "and bei.id=10157309". > > > > Testet with 8.2.1 and 8.1.4, same effect on both systems. Because of the > > large number of tables involved it's difficult to find a self contained > > patch, but if necessary I'll give it a try. I could give 8.2.3 a try, but > > I doubt this will help. > > I think a similar problem was fixed after 8.2.3 was released, so you may > want to check out the REL8_2_STABLE branch.
Mario Weilguni <mweilguni@sime.com> writes: > ERROR: failed to build any 8-way joins Could you provide a self-contained test case for this? You probably don't need any data, just the table schemas. I fixed a problem with a similar symptom a couple days ago, but that was in logic that was new in 8.2 ... regards, tom lane