Re: On-disk Tuple Size - Mailing list pgsql-hackers

From Curt Sampson
Subject Re: On-disk Tuple Size
Date
Msg-id Pine.NEB.4.43.0204211557120.6249-100000@angelic.cynic.net
Whole thread Raw
In response to Re: On-disk Tuple Size  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: On-disk Tuple Size  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-hackers
On Sat, 20 Apr 2002, Tom Lane wrote:

> >> I believe we do want to distinguish three states: live tuple, dead
> >> tuple, and empty space.  Otherwise there will be cases where you're
> >> forced to move data immediately to collapse empty space, when there's
> >> not a good reason to except that your representation can't cope.
>
> > I don't understand this.
>
> I thought more about this in the shower this morning, and realized the
> fundamental drawback of the scheme you are suggesting: it requires the
> line pointers and physical storage to be in the same order.
> ...But in any case line pointer order and physical storage order are
> tied together.)

I thought that for a while too. As you point out, you want an list of
line pointers ordered by address in the block to build your map of space
after the free space in the middle of the page because you get the end
of the current block from the start address of the next block. (We know
where the free space in the middle is from the end of the line pointer
array.)

However, there's no reason it has to be stored in this order on the
disk. You can build a sorted list of line pointers in a separate area of
memory after you read the page.

Yes, this uses a bit more CPU, but I think it's going to be a pretty
trivial amount. It's a short list, and since you're touching the data
anyway, it's going to be in the CPU cache. The real cost you'll pay is
in the time to access the area of memory where you're storing the sorted
list of line pointers. But the potential saving here is up to 5% in I/O
costs (due to using less disk space).

> The three states of a line pointer that I referred to are live
> (pointing at a good tuple), dead (pointing at storage that used
> to contain a good tuple, doesn't anymore, but hasn't been compacted
> out yet), and empty (doesn't point at storage at all; the space it
> used to describe has been merged into the middle-of-the-page free
> pool).

Right. I now realize that we still do still need the three states,
which are in my case:
   live:    points to tuple data in use
   free space:    points to unused space in the page, i.e., a dead tuple.
   unused:    a line pointer that doesn't point to anything at all.

> ISTM a pointers-only representation can handle the live and
> dead cases nicely, but the empty case is going to be a real headache.

This doesn't need a separate flag, since we can just have the line
pointer point to something obviously invalid, such as the page
header. (0 seems quite convenient for this.)

In the header, we need a count of the number of line pointers
(line_id_count above), but we can drop the beginning/end of free
space pointers, since we know that data space starts after the last
line pointer, and ends at the beginnning of special space.

So here's an example of a page layout. Sizes are arbitrary ones I
picked for the sake of the example, except for the line_id sizes.
   Address    Size    Item
   0        24    page header (line_id_count = 6)
   24        2    line_id: 7751 (free space 1)   26        2    line_id: 7800 (tuple 1)   28        2    line_id:    0
(unused)  30        2    line_id: 7600 (tuple 2)   32        2    line_id: 8000 (tuple 3)   34        2    line_id:
7941(free space 2)
 
   36        7564    free space in the middle of the page
   7600    150    tuple 2   7750    50    free space 1   7800    100    tuple 1   7940    60    free space 2   8000
96   tuple 3   8096    96    special space
 

Note above that the free space pointers have the LSB set to indicate
that they point to free space, not tuples. So the first line_id
actually points to 7750.

When I do an insert, the first thing I do is scan for a free line
pointer. Finding a free one at 28, I decide to re-use that. Then
I look for the smallest block of free space that will hold the data
that I need to insert. If it fits, exactly, I use it. If not, I
need to extend the line pointer array by one and make that point
to the remaining free space in the block of free space I used.

If a big enough block of free space doesn't exist, I compact the
page and try again.

cjs
-- 
Curt Sampson  <cjs@cynic.net>   +81 90 7737 2974   http://www.netbsd.org   Don't you know, in this new Dark Age, we're
alllight.  --XTC
 



pgsql-hackers by date:

Previous
From: Curt Sampson
Date:
Subject: Re: Documentation on page files
Next
From: Martijn van Oosterhout
Date:
Subject: Re: Documentation on page files