Thread: On-disk Tuple Size

On-disk Tuple Size

From
Curt Sampson
Date:
[I've moved this discussion about changing the line pointer from four
bytes to two from -general to -hackers, since it's fairly technical.
The entire message Tom is responding to is appended to this one.]

On Sat, 20 Apr 2002, Tom Lane wrote:

> Curt Sampson <cjs@cynic.net> writes:
> > ... Then we could declare that all tuples must be aligned on a
> > four-byte boundary, use the top 14 bits of a 16-bit line pointer as the
> > address, and the bottom two bits for the LP_USED and LP_DELETED flag.
> > This would slightly simplify the code for determining the flags, and
> > incidently boost the maximum page size to 64K.
>
> Hmm.  Maybe, but the net effect would only be to reduce the minimum row
> overhead from 36 to 34 bytes.  Not sure it's worth worrying about.

Well, unless the implementation is hideously complex, I'd say that
every byte is worth worrying about, given the amount of overhead that's
currently there. 36 to 34 bytes could give something approaching a 5%
performance increase for tables with short rows. (Actually, do we prefer
the tables/rows or relations/tuples terminology here? I guess I kinda
tend to use the latter for physical stuff.)

If we could drop the OID from the tuple when it's not being used,
that would be another four bytes, bringing the performance increase
up towards 15% on tables with short rows.

Of course I understand that all this is contingent not only on such
changes being acceptable, but someone actually caring enough to
write them.

While we're at it, would someone have the time to explain to me
how the on-disk CommandIds are used? A quick look at the code
indicates that this is used for cursor consistency, among other
things, but it's still a bit mysterious to me.

> > ... I don't see why we would then
> > need the LP_DELETED flag at all.
>
> I believe we do want to distinguish three states: live tuple, dead
> tuple, and empty space.  Otherwise there will be cases where you're
> forced to move data immediately to collapse empty space, when there's
> not a good reason to except that your representation can't cope.

I don't understand this. Why do you need to collapse empty space
immediately? Why not just wait until you can't find an empty fragment
in the page that's big enough, and then do the collapse?

Oh, on a final unrelated note, <john@akadine.com>, you're bouncing
mail from my host for reasons not well explained ("550 Access
denied.") I tried postmaster at your site, but that bounces mail
too. If you want to work out the problem, drop me e-mail from some
address at which you can be responded to.

cjs
-- 
Curt Sampson  <cjs@cynic.net>   +81 90 7737 2974   http://www.netbsd.org   Don't you know, in this new Dark Age, we're
alllight.  --XTC
 

------- Previous Message --------

Re: On-disk Tuple Size

From
Tom Lane
Date:
Curt Sampson <cjs@cynic.net> writes:
> While we're at it, would someone have the time to explain to me
> how the on-disk CommandIds are used?

To determine visibility of tuples for commands within a transaction.
Just as you don't want your transaction's effects to become visible
until you commit, you don't want an individual command's effects to
become visible until you do CommandCounterIncrement.  Among other
things this solves the Halloween problem for us (how do you stop
an UPDATE from trying to re-update the tuples it's already emitted,
should it chance to hit them during its table scan).

The command IDs aren't interesting anymore once the originating
transaction is over, but I don't see a realistic way to recycle
the space ...

>> I believe we do want to distinguish three states: live tuple, dead
>> tuple, and empty space.  Otherwise there will be cases where you're
>> forced to move data immediately to collapse empty space, when there's
>> not a good reason to except that your representation can't cope.

> I don't understand this.

I thought more about this in the shower this morning, and realized the
fundamental drawback of the scheme you are suggesting: it requires the
line pointers and physical storage to be in the same order.  (Or you
could make it work in reverse order, by looking at the prior pointer
instead of the next one to determine item size; that would actually
work a little better.  But in any case line pointer order and physical
storage order are tied together.)

This is clearly a loser for index pages: most inserts would require
a data shuffle.  But it is also a loser for heap pages, and the reason
is that on heap pages we cannot change a tuple's index (line pointer
number) once it's been created.  If we did, it'd invalidate CTID
forward links, index entries, and heapscan cursor positions for open
scans.  Indeed, pretty much the whole point of having the line pointers
is to provide a stable ID for a tuple --- if we didn't need that we
could just walk through the physical storage.

When VACUUM removes a dead tuple, it compacts out the physical space
and marks the line pointer as unused.  (Of course, it makes sure all
references to the tuple are gone first.)  The next time we want to
insert a tuple on that page, we can recycle the unused line pointer
instead of allocating a new one from the end of the line pointer array.
However, the physical space for the new tuple should come from the
main free-space pool in the middle of the page.  To implement the
pointers-without-sizes representation, we'd be forced to shuffle data
to make room for the tuple between the two adjacent-by-line-number tuples.

The three states of a line pointer that I referred to are live
(pointing at a good tuple), dead (pointing at storage that used
to contain a good tuple, doesn't anymore, but hasn't been compacted
out yet), and empty (doesn't point at storage at all; the space it
used to describe has been merged into the middle-of-the-page free
pool).  ISTM a pointers-only representation can handle the live and
dead cases nicely, but the empty case is going to be a real headache.

In short, a pointers-only representation would give us a lot less
flexibility in free space management.  It's an interesting idea but
I doubt that saving two bytes per row is worth the extra overhead.
        regards, tom lane


Re: On-disk Tuple Size

From
Curt Sampson
Date:
On Sat, 20 Apr 2002, Tom Lane wrote:

> Curt Sampson <cjs@cynic.net> writes:
> > While we're at it, would someone have the time to explain to me
> > how the on-disk CommandIds are used?
>
> To determine visibility of tuples for commands within a transaction.
> Just as you don't want your transaction's effects to become visible
> until you commit, you don't want an individual command's effects to
> become visible until you do CommandCounterIncrement.  Among other
> things this solves the Halloween problem for us (how do you stop
> an UPDATE from trying to re-update the tuples it's already emitted,
> should it chance to hit them during its table scan).
>
> The command IDs aren't interesting anymore once the originating
> transaction is over, but I don't see a realistic way to recycle
> the space ...

Ah, I see. So basically, it's exactly parallel to the transaction IDs
except it's for commands instead of transactions?

So this seems to imply to me that the insert command ID fields are of
interest only to the transaction that did the insert. In other words, if
your transaction ID is not the one listed in t_xmin, the t_cmin field is
always ignored. And the same goes for t_cmax and t_xmax, right?

If this is the case, would it be possible to number the commands
per-transaction, rather than globally? Then the t_cmin for a particular
tuple might be say, 7, but though there might be many transactions that
have processed or will process command number 7, we would know which
transaction this belongs to by the t_xmin field.

Does this work for cursors, which currently seem to rely on a global
command ID? If you keep track of the transaction ID as well, I think so,
right?

Having per-transaction command IDs might allow us to reduce the range of
the t_cmin and t_cmax fields. Unfortunately, probably by not all that
much, since one doesn't want to limit the number of commands within a
single transaction to something as silly as 65536.

But perhaps we don't need to increment the command ID for every command.
If I do an insert, but I know that the previous command was also an
insert, I know that there were no intervening reads in this transaction,
so can I use the previous command's ID? Could it be that we need to
increment the command ID only when we switch from writing to reading
or vice versa? There could still be transactions that would run into
problems, of course, but these might all be rather pathological cases.

Or is everybody wishing they had some of whatever I'm smoking? :-)

cjs
-- 
Curt Sampson  <cjs@cynic.net>   +81 90 7737 2974   http://www.netbsd.org   Don't you know, in this new Dark Age, we're
alllight.  --XTC
 



Re: On-disk Tuple Size

From
Curt Sampson
Date:
On Sat, 20 Apr 2002, Tom Lane wrote:

> >> I believe we do want to distinguish three states: live tuple, dead
> >> tuple, and empty space.  Otherwise there will be cases where you're
> >> forced to move data immediately to collapse empty space, when there's
> >> not a good reason to except that your representation can't cope.
>
> > I don't understand this.
>
> I thought more about this in the shower this morning, and realized the
> fundamental drawback of the scheme you are suggesting: it requires the
> line pointers and physical storage to be in the same order.
> ...But in any case line pointer order and physical storage order are
> tied together.)

I thought that for a while too. As you point out, you want an list of
line pointers ordered by address in the block to build your map of space
after the free space in the middle of the page because you get the end
of the current block from the start address of the next block. (We know
where the free space in the middle is from the end of the line pointer
array.)

However, there's no reason it has to be stored in this order on the
disk. You can build a sorted list of line pointers in a separate area of
memory after you read the page.

Yes, this uses a bit more CPU, but I think it's going to be a pretty
trivial amount. It's a short list, and since you're touching the data
anyway, it's going to be in the CPU cache. The real cost you'll pay is
in the time to access the area of memory where you're storing the sorted
list of line pointers. But the potential saving here is up to 5% in I/O
costs (due to using less disk space).

> The three states of a line pointer that I referred to are live
> (pointing at a good tuple), dead (pointing at storage that used
> to contain a good tuple, doesn't anymore, but hasn't been compacted
> out yet), and empty (doesn't point at storage at all; the space it
> used to describe has been merged into the middle-of-the-page free
> pool).

Right. I now realize that we still do still need the three states,
which are in my case:
   live:    points to tuple data in use
   free space:    points to unused space in the page, i.e., a dead tuple.
   unused:    a line pointer that doesn't point to anything at all.

> ISTM a pointers-only representation can handle the live and
> dead cases nicely, but the empty case is going to be a real headache.

This doesn't need a separate flag, since we can just have the line
pointer point to something obviously invalid, such as the page
header. (0 seems quite convenient for this.)

In the header, we need a count of the number of line pointers
(line_id_count above), but we can drop the beginning/end of free
space pointers, since we know that data space starts after the last
line pointer, and ends at the beginnning of special space.

So here's an example of a page layout. Sizes are arbitrary ones I
picked for the sake of the example, except for the line_id sizes.
   Address    Size    Item
   0        24    page header (line_id_count = 6)
   24        2    line_id: 7751 (free space 1)   26        2    line_id: 7800 (tuple 1)   28        2    line_id:    0
(unused)  30        2    line_id: 7600 (tuple 2)   32        2    line_id: 8000 (tuple 3)   34        2    line_id:
7941(free space 2)
 
   36        7564    free space in the middle of the page
   7600    150    tuple 2   7750    50    free space 1   7800    100    tuple 1   7940    60    free space 2   8000
96   tuple 3   8096    96    special space
 

Note above that the free space pointers have the LSB set to indicate
that they point to free space, not tuples. So the first line_id
actually points to 7750.

When I do an insert, the first thing I do is scan for a free line
pointer. Finding a free one at 28, I decide to re-use that. Then
I look for the smallest block of free space that will hold the data
that I need to insert. If it fits, exactly, I use it. If not, I
need to extend the line pointer array by one and make that point
to the remaining free space in the block of free space I used.

If a big enough block of free space doesn't exist, I compact the
page and try again.

cjs
-- 
Curt Sampson  <cjs@cynic.net>   +81 90 7737 2974   http://www.netbsd.org   Don't you know, in this new Dark Age, we're
alllight.  --XTC
 



Re: On-disk Tuple Size

From
"Rod Taylor"
Date:
> Having per-transaction command IDs might allow us to reduce the
range of
> the t_cmin and t_cmax fields. Unfortunately, probably by not all
that
> much, since one doesn't want to limit the number of commands within
a
> single transaction to something as silly as 65536.

If you can figure out how to make that roll over sure, but thats a
very small number.

Consider users who do most of their stuff via functions (one
transaction).  Now consider the function that builds reports, stats,
etc. for some department.  It's likley these work on a per account
basis.

We have a function making invoices that would wrap around that atleast
10x.



Re: On-disk Tuple Size

From
Tom Lane
Date:
Curt Sampson <cjs@cynic.net> writes:
> If this is the case, would it be possible to number the commands
> per-transaction, rather than globally?

They are.
        regards, tom lane


Re: On-disk Tuple Size

From
Tom Lane
Date:
Curt Sampson <cjs@cynic.net> writes:
> Yes, this uses a bit more CPU, but I think it's going to be a pretty
> trivial amount. It's a short list, and since you're touching the data
> anyway, it's going to be in the CPU cache. The real cost you'll pay is
> in the time to access the area of memory where you're storing the sorted
> list of line pointers. But the potential saving here is up to 5% in I/O
> costs (due to using less disk space).

At this point you're essentially arguing that it's faster to recompute
the list of item sizes than it is to read it off disk.  Given that the
recomputation would require sorting the list of item locations (with
up to a couple hundred entries --- more than that if blocksize > 8K)
I'm not convinced of that.

Another difficulty is that we'd lose the ability to record item sizes
to the exact byte.  What we'd reconstruct from the item locations are
sizes rounded up to the next MAXALIGN boundary.  I am not sure that
this is a problem, but I'm not sure it's not either.

The part of this idea that I actually like is overlapping the status
bits with the low order part of the item location, using the assumption
that MAXALIGN is at least 4.  That would allow us to support BLCKSZ up
to 64K, and probably save a cycle or two in fetching/storing the item
fields as well.  The larger BLCKSZ limit isn't nearly as desirable
as it used to be, because of TOAST, and in fact it could be a net loser
because of increased WAL traffic.  But it'd be interesting to try it
and see.
        regards, tom lane


Re: On-disk Tuple Size

From
Curt Sampson
Date:
On Sun, 21 Apr 2002, Tom Lane wrote:

> At this point you're essentially arguing that it's faster to recompute
> the list of item sizes than it is to read it off disk.  Given that the
> recomputation would require sorting the list of item locations (with
> up to a couple hundred entries --- more than that if blocksize > 8K)
> I'm not convinced of that.

No, not at all. What I'm arguing is that the I/O savings gained from
removing two bytes from the tuple overhead will more than compensate for
having to do a little bit more computation after reading the block.

How do I know? Well, I have very solid figures. I know because I pulled
them straight out of my....anyway. :-) Yeah, it's more or less instinct
that says to me that this would be a win. If others don't agree, there's
a pretty reasonable chance that I'm wrong here. But I think it might
be worthwile spending a bit of effort to see what we can do to reduce
our tuple overhead. After all, there is a good commerical DB that has
much, much lower overhead, even if it's not really comparable because it
doesn't use MVCC. The best thing really would be to see what other good
MVCC databases do. I'm going to go to the bookshop in the next few days
and try to find out what Oracle's physical layout is.

> Another difficulty is that we'd lose the ability to record item sizes
> to the exact byte.  What we'd reconstruct from the item locations are
> sizes rounded up to the next MAXALIGN boundary.  I am not sure that
> this is a problem, but I'm not sure it's not either.

Well, I don't see any real problem with it, but yeah, I might well be
missing something here.

> The larger BLCKSZ limit isn't nearly as desirable as it used to be,
> because of TOAST, and in fact it could be a net loser because of
> increased WAL traffic. But it'd be interesting to try it and see.

Mmmm, I hadn't thought about the WAL side of things. In an ideal world,
it wouldn't be a problem because WAL writes would be related only to
tuple size, and would have nothing to do with block size. Or so it seems
to me. But I have to go read the WAL code a bit before I care to make
any real assertions there.

cjs
-- 
Curt Sampson  <cjs@cynic.net>   +81 90 7737 2974   http://www.netbsd.org   Don't you know, in this new Dark Age, we're
alllight.  --XTC