Thread: On-disk Tuple Size
[I've moved this discussion about changing the line pointer from four bytes to two from -general to -hackers, since it's fairly technical. The entire message Tom is responding to is appended to this one.] On Sat, 20 Apr 2002, Tom Lane wrote: > Curt Sampson <cjs@cynic.net> writes: > > ... Then we could declare that all tuples must be aligned on a > > four-byte boundary, use the top 14 bits of a 16-bit line pointer as the > > address, and the bottom two bits for the LP_USED and LP_DELETED flag. > > This would slightly simplify the code for determining the flags, and > > incidently boost the maximum page size to 64K. > > Hmm. Maybe, but the net effect would only be to reduce the minimum row > overhead from 36 to 34 bytes. Not sure it's worth worrying about. Well, unless the implementation is hideously complex, I'd say that every byte is worth worrying about, given the amount of overhead that's currently there. 36 to 34 bytes could give something approaching a 5% performance increase for tables with short rows. (Actually, do we prefer the tables/rows or relations/tuples terminology here? I guess I kinda tend to use the latter for physical stuff.) If we could drop the OID from the tuple when it's not being used, that would be another four bytes, bringing the performance increase up towards 15% on tables with short rows. Of course I understand that all this is contingent not only on such changes being acceptable, but someone actually caring enough to write them. While we're at it, would someone have the time to explain to me how the on-disk CommandIds are used? A quick look at the code indicates that this is used for cursor consistency, among other things, but it's still a bit mysterious to me. > > ... I don't see why we would then > > need the LP_DELETED flag at all. > > I believe we do want to distinguish three states: live tuple, dead > tuple, and empty space. Otherwise there will be cases where you're > forced to move data immediately to collapse empty space, when there's > not a good reason to except that your representation can't cope. I don't understand this. Why do you need to collapse empty space immediately? Why not just wait until you can't find an empty fragment in the page that's big enough, and then do the collapse? Oh, on a final unrelated note, <john@akadine.com>, you're bouncing mail from my host for reasons not well explained ("550 Access denied.") I tried postmaster at your site, but that bounces mail too. If you want to work out the problem, drop me e-mail from some address at which you can be responded to. cjs -- Curt Sampson <cjs@cynic.net> +81 90 7737 2974 http://www.netbsd.org Don't you know, in this new Dark Age, we're alllight. --XTC ------- Previous Message --------
Curt Sampson <cjs@cynic.net> writes: > While we're at it, would someone have the time to explain to me > how the on-disk CommandIds are used? To determine visibility of tuples for commands within a transaction. Just as you don't want your transaction's effects to become visible until you commit, you don't want an individual command's effects to become visible until you do CommandCounterIncrement. Among other things this solves the Halloween problem for us (how do you stop an UPDATE from trying to re-update the tuples it's already emitted, should it chance to hit them during its table scan). The command IDs aren't interesting anymore once the originating transaction is over, but I don't see a realistic way to recycle the space ... >> I believe we do want to distinguish three states: live tuple, dead >> tuple, and empty space. Otherwise there will be cases where you're >> forced to move data immediately to collapse empty space, when there's >> not a good reason to except that your representation can't cope. > I don't understand this. I thought more about this in the shower this morning, and realized the fundamental drawback of the scheme you are suggesting: it requires the line pointers and physical storage to be in the same order. (Or you could make it work in reverse order, by looking at the prior pointer instead of the next one to determine item size; that would actually work a little better. But in any case line pointer order and physical storage order are tied together.) This is clearly a loser for index pages: most inserts would require a data shuffle. But it is also a loser for heap pages, and the reason is that on heap pages we cannot change a tuple's index (line pointer number) once it's been created. If we did, it'd invalidate CTID forward links, index entries, and heapscan cursor positions for open scans. Indeed, pretty much the whole point of having the line pointers is to provide a stable ID for a tuple --- if we didn't need that we could just walk through the physical storage. When VACUUM removes a dead tuple, it compacts out the physical space and marks the line pointer as unused. (Of course, it makes sure all references to the tuple are gone first.) The next time we want to insert a tuple on that page, we can recycle the unused line pointer instead of allocating a new one from the end of the line pointer array. However, the physical space for the new tuple should come from the main free-space pool in the middle of the page. To implement the pointers-without-sizes representation, we'd be forced to shuffle data to make room for the tuple between the two adjacent-by-line-number tuples. The three states of a line pointer that I referred to are live (pointing at a good tuple), dead (pointing at storage that used to contain a good tuple, doesn't anymore, but hasn't been compacted out yet), and empty (doesn't point at storage at all; the space it used to describe has been merged into the middle-of-the-page free pool). ISTM a pointers-only representation can handle the live and dead cases nicely, but the empty case is going to be a real headache. In short, a pointers-only representation would give us a lot less flexibility in free space management. It's an interesting idea but I doubt that saving two bytes per row is worth the extra overhead. regards, tom lane
On Sat, 20 Apr 2002, Tom Lane wrote: > Curt Sampson <cjs@cynic.net> writes: > > While we're at it, would someone have the time to explain to me > > how the on-disk CommandIds are used? > > To determine visibility of tuples for commands within a transaction. > Just as you don't want your transaction's effects to become visible > until you commit, you don't want an individual command's effects to > become visible until you do CommandCounterIncrement. Among other > things this solves the Halloween problem for us (how do you stop > an UPDATE from trying to re-update the tuples it's already emitted, > should it chance to hit them during its table scan). > > The command IDs aren't interesting anymore once the originating > transaction is over, but I don't see a realistic way to recycle > the space ... Ah, I see. So basically, it's exactly parallel to the transaction IDs except it's for commands instead of transactions? So this seems to imply to me that the insert command ID fields are of interest only to the transaction that did the insert. In other words, if your transaction ID is not the one listed in t_xmin, the t_cmin field is always ignored. And the same goes for t_cmax and t_xmax, right? If this is the case, would it be possible to number the commands per-transaction, rather than globally? Then the t_cmin for a particular tuple might be say, 7, but though there might be many transactions that have processed or will process command number 7, we would know which transaction this belongs to by the t_xmin field. Does this work for cursors, which currently seem to rely on a global command ID? If you keep track of the transaction ID as well, I think so, right? Having per-transaction command IDs might allow us to reduce the range of the t_cmin and t_cmax fields. Unfortunately, probably by not all that much, since one doesn't want to limit the number of commands within a single transaction to something as silly as 65536. But perhaps we don't need to increment the command ID for every command. If I do an insert, but I know that the previous command was also an insert, I know that there were no intervening reads in this transaction, so can I use the previous command's ID? Could it be that we need to increment the command ID only when we switch from writing to reading or vice versa? There could still be transactions that would run into problems, of course, but these might all be rather pathological cases. Or is everybody wishing they had some of whatever I'm smoking? :-) cjs -- Curt Sampson <cjs@cynic.net> +81 90 7737 2974 http://www.netbsd.org Don't you know, in this new Dark Age, we're alllight. --XTC
On Sat, 20 Apr 2002, Tom Lane wrote: > >> I believe we do want to distinguish three states: live tuple, dead > >> tuple, and empty space. Otherwise there will be cases where you're > >> forced to move data immediately to collapse empty space, when there's > >> not a good reason to except that your representation can't cope. > > > I don't understand this. > > I thought more about this in the shower this morning, and realized the > fundamental drawback of the scheme you are suggesting: it requires the > line pointers and physical storage to be in the same order. > ...But in any case line pointer order and physical storage order are > tied together.) I thought that for a while too. As you point out, you want an list of line pointers ordered by address in the block to build your map of space after the free space in the middle of the page because you get the end of the current block from the start address of the next block. (We know where the free space in the middle is from the end of the line pointer array.) However, there's no reason it has to be stored in this order on the disk. You can build a sorted list of line pointers in a separate area of memory after you read the page. Yes, this uses a bit more CPU, but I think it's going to be a pretty trivial amount. It's a short list, and since you're touching the data anyway, it's going to be in the CPU cache. The real cost you'll pay is in the time to access the area of memory where you're storing the sorted list of line pointers. But the potential saving here is up to 5% in I/O costs (due to using less disk space). > The three states of a line pointer that I referred to are live > (pointing at a good tuple), dead (pointing at storage that used > to contain a good tuple, doesn't anymore, but hasn't been compacted > out yet), and empty (doesn't point at storage at all; the space it > used to describe has been merged into the middle-of-the-page free > pool). Right. I now realize that we still do still need the three states, which are in my case: live: points to tuple data in use free space: points to unused space in the page, i.e., a dead tuple. unused: a line pointer that doesn't point to anything at all. > ISTM a pointers-only representation can handle the live and > dead cases nicely, but the empty case is going to be a real headache. This doesn't need a separate flag, since we can just have the line pointer point to something obviously invalid, such as the page header. (0 seems quite convenient for this.) In the header, we need a count of the number of line pointers (line_id_count above), but we can drop the beginning/end of free space pointers, since we know that data space starts after the last line pointer, and ends at the beginnning of special space. So here's an example of a page layout. Sizes are arbitrary ones I picked for the sake of the example, except for the line_id sizes. Address Size Item 0 24 page header (line_id_count = 6) 24 2 line_id: 7751 (free space 1) 26 2 line_id: 7800 (tuple 1) 28 2 line_id: 0 (unused) 30 2 line_id: 7600 (tuple 2) 32 2 line_id: 8000 (tuple 3) 34 2 line_id: 7941(free space 2) 36 7564 free space in the middle of the page 7600 150 tuple 2 7750 50 free space 1 7800 100 tuple 1 7940 60 free space 2 8000 96 tuple 3 8096 96 special space Note above that the free space pointers have the LSB set to indicate that they point to free space, not tuples. So the first line_id actually points to 7750. When I do an insert, the first thing I do is scan for a free line pointer. Finding a free one at 28, I decide to re-use that. Then I look for the smallest block of free space that will hold the data that I need to insert. If it fits, exactly, I use it. If not, I need to extend the line pointer array by one and make that point to the remaining free space in the block of free space I used. If a big enough block of free space doesn't exist, I compact the page and try again. cjs -- Curt Sampson <cjs@cynic.net> +81 90 7737 2974 http://www.netbsd.org Don't you know, in this new Dark Age, we're alllight. --XTC
> Having per-transaction command IDs might allow us to reduce the range of > the t_cmin and t_cmax fields. Unfortunately, probably by not all that > much, since one doesn't want to limit the number of commands within a > single transaction to something as silly as 65536. If you can figure out how to make that roll over sure, but thats a very small number. Consider users who do most of their stuff via functions (one transaction). Now consider the function that builds reports, stats, etc. for some department. It's likley these work on a per account basis. We have a function making invoices that would wrap around that atleast 10x.
Curt Sampson <cjs@cynic.net> writes: > If this is the case, would it be possible to number the commands > per-transaction, rather than globally? They are. regards, tom lane
Curt Sampson <cjs@cynic.net> writes: > Yes, this uses a bit more CPU, but I think it's going to be a pretty > trivial amount. It's a short list, and since you're touching the data > anyway, it's going to be in the CPU cache. The real cost you'll pay is > in the time to access the area of memory where you're storing the sorted > list of line pointers. But the potential saving here is up to 5% in I/O > costs (due to using less disk space). At this point you're essentially arguing that it's faster to recompute the list of item sizes than it is to read it off disk. Given that the recomputation would require sorting the list of item locations (with up to a couple hundred entries --- more than that if blocksize > 8K) I'm not convinced of that. Another difficulty is that we'd lose the ability to record item sizes to the exact byte. What we'd reconstruct from the item locations are sizes rounded up to the next MAXALIGN boundary. I am not sure that this is a problem, but I'm not sure it's not either. The part of this idea that I actually like is overlapping the status bits with the low order part of the item location, using the assumption that MAXALIGN is at least 4. That would allow us to support BLCKSZ up to 64K, and probably save a cycle or two in fetching/storing the item fields as well. The larger BLCKSZ limit isn't nearly as desirable as it used to be, because of TOAST, and in fact it could be a net loser because of increased WAL traffic. But it'd be interesting to try it and see. regards, tom lane
On Sun, 21 Apr 2002, Tom Lane wrote: > At this point you're essentially arguing that it's faster to recompute > the list of item sizes than it is to read it off disk. Given that the > recomputation would require sorting the list of item locations (with > up to a couple hundred entries --- more than that if blocksize > 8K) > I'm not convinced of that. No, not at all. What I'm arguing is that the I/O savings gained from removing two bytes from the tuple overhead will more than compensate for having to do a little bit more computation after reading the block. How do I know? Well, I have very solid figures. I know because I pulled them straight out of my....anyway. :-) Yeah, it's more or less instinct that says to me that this would be a win. If others don't agree, there's a pretty reasonable chance that I'm wrong here. But I think it might be worthwile spending a bit of effort to see what we can do to reduce our tuple overhead. After all, there is a good commerical DB that has much, much lower overhead, even if it's not really comparable because it doesn't use MVCC. The best thing really would be to see what other good MVCC databases do. I'm going to go to the bookshop in the next few days and try to find out what Oracle's physical layout is. > Another difficulty is that we'd lose the ability to record item sizes > to the exact byte. What we'd reconstruct from the item locations are > sizes rounded up to the next MAXALIGN boundary. I am not sure that > this is a problem, but I'm not sure it's not either. Well, I don't see any real problem with it, but yeah, I might well be missing something here. > The larger BLCKSZ limit isn't nearly as desirable as it used to be, > because of TOAST, and in fact it could be a net loser because of > increased WAL traffic. But it'd be interesting to try it and see. Mmmm, I hadn't thought about the WAL side of things. In an ideal world, it wouldn't be a problem because WAL writes would be related only to tuple size, and would have nothing to do with block size. Or so it seems to me. But I have to go read the WAL code a bit before I care to make any real assertions there. cjs -- Curt Sampson <cjs@cynic.net> +81 90 7737 2974 http://www.netbsd.org Don't you know, in this new Dark Age, we're alllight. --XTC