Thread: Placing hints in line pointers
Notes on a longer term idea... An item pointer (also called line pointer) is used to allow an external pointer to an item, while allowing us to place the tuple that anywhere on the page. An ItemId is 4 bytes long and currently consists of (see src/include/storage/itemid.h)... typedef struct ItemIdData{ unsigned lp_off:15, /* offset to tuple (from start of page) */ lp_flags:2, /* state of item pointer, see below */ lp_len:15; /* byte length of tuple */} ItemIdData; The offset to the tuple is 15 bits, which is sufficient to point to 32768 separate byte positions, and hence why we limit ourselves to 32kB blocks. If we use 4 byte alignment for tuples, then that would mean we wouldn't ever use the lower 2 bits of lp_off, nor would we use the lower 2 bits of lp_len. They are always set at zero. (Obviously, with 8 byte alignment we would have 3 bits spare in each, but I'm looking for something that works the same on various architectures for simplicity). So my suggestion is to make lp_off and lp_len store the values in terms of 4 byte chunks, which would allow us to rework the data structure like this... typedef struct ItemIdData { unsigned lp_off:13, /* offset to tuple (from start of page), number of 4 byte chunks */ lp_xmin_hint:2, /* committed and invalid hints for xmin */ lp_flags:2, /* state of item pointer, see below */ lp_len:13; /* byte length of tuple, numberof 4 byte chunks */ lp_xmax_hint:2, /*committed and invalid hints for xmax */ } ItemIdData; i.e. we have room for 4 additional bits and we use those to put the tuple hints for xmin and xmax Doing this would have two purposes: * We wouldn't need to follow the pointer if the row is marked aborted. This would save a random memory access for that tuple * It would isolate the tuple hint values into a smaller area of the block, so we would be able to avoid the annoyance of recalculating the checksums for the whole block when a single bit changes. We wouldn't need to do a FPW when a hint changes, we would only need to take a copy of the ItemId array, which is much smaller. And it could be protected by its own checksum. (In addition, if we wanted, this could be used to extend block size to 64kB if we used 8-byte alignment for tuples) --Simon Riggs http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Sat, 2013-06-01 at 15:45 +0100, Simon Riggs wrote: > Doing this would have two purposes: > > * We wouldn't need to follow the pointer if the row is marked aborted. > This would save a random memory access for that tuple That's quite similar to LP_DEAD, right? You could potentially set this new hint sooner, which may be an advantage. > We wouldn't need to do a FPW when a hint changes, we would only need > to take a copy of the ItemId array, which is much smaller. And it > could be protected by its own checksum. I like the direction of this idea. I don't see exactly how you want to make this safe, though. During replay, we can't just replace the ItemId array, because the other contents on the page might be newer (e.g. some tuples may have been cleaned out, leaving the item pointers pointing to the wrong place). > (In addition, if we wanted, this could be used to extend block size to > 64kB if we used 8-byte alignment for tuples) I think that supporting 64KB blocks has merit. Interesting observation about the extra bits available. That would be an on-disk format change, so perhaps this is something to add to the list of "waiting for a format change"? Regards,Jeff Davis
On Mon, Jun 10, 2013 at 3:43 AM, Jeff Davis <pgsql@j-davis.com> wrote: >> We wouldn't need to do a FPW when a hint changes, we would only need >> to take a copy of the ItemId array, which is much smaller. And it >> could be protected by its own checksum. > > I like the direction of this idea. One of the previous proposals was to move all the hint bits to a dedicated area. This had a few problems: 1) Three areas would have meant we wold have needed some tricky ability to relocate the third area since the current "grow from both ends" technique doesn't scale to three. Throwing them in with the line pointers might nicely solve this. 2) We kept finding more hint bits lying around. There are hint bits in the heap page header, there are hint bits in indexes, and I thought we might have found more hint bits in the tuple headers than the 4 you note. I'm not sure this is still true incidentally, I think the PD_ALLVISIBLE flag might no longer be a hint bit? Was that the only one in the page header? Are the index hint bits also relocatable to the line pointers in the index pages? 3) The reduction in the checksum coverage. Personally I thought this was a red herring -- they're hint bits, they're whole raison d'etre is to be a performance optimization. But if you toss the line pointers in with them then I see a problem. Protecting the line pointers is critically important. A flipped bit in a line pointer could cause all kinds of weirdness. And I don't think it would be easy to protect them with their own checksum. You would have the same problems you currently have of a process updating the hint bit behind your back while calculating the checksum or after your last WAL record but before the block is flushed. Now if this is combined with the other idea -- masking out *just* the hint bits from the checksum I wonder if moving them to the line pointers doesn't make that more feasible. Since they would be in a consistent location for every line pointer, instead of having to check on each iteration if we're looking at the beginning of a tuple, and the only thing we would be counting on being correct before checking the checksum would be the number of line pointers (rather than every line pointer offset). -- greg