Re: On-disk Tuple Size - Mailing list pgsql-hackers

From Tom Lane
Subject Re: On-disk Tuple Size
Date
Msg-id 23713.1019318231@sss.pgh.pa.us
Whole thread Raw
In response to On-disk Tuple Size  (Curt Sampson <cjs@cynic.net>)
Responses Re: On-disk Tuple Size  (Curt Sampson <cjs@cynic.net>)
Re: On-disk Tuple Size  (Curt Sampson <cjs@cynic.net>)
List pgsql-hackers
Curt Sampson <cjs@cynic.net> writes:
> While we're at it, would someone have the time to explain to me
> how the on-disk CommandIds are used?

To determine visibility of tuples for commands within a transaction.
Just as you don't want your transaction's effects to become visible
until you commit, you don't want an individual command's effects to
become visible until you do CommandCounterIncrement.  Among other
things this solves the Halloween problem for us (how do you stop
an UPDATE from trying to re-update the tuples it's already emitted,
should it chance to hit them during its table scan).

The command IDs aren't interesting anymore once the originating
transaction is over, but I don't see a realistic way to recycle
the space ...

>> I believe we do want to distinguish three states: live tuple, dead
>> tuple, and empty space.  Otherwise there will be cases where you're
>> forced to move data immediately to collapse empty space, when there's
>> not a good reason to except that your representation can't cope.

> I don't understand this.

I thought more about this in the shower this morning, and realized the
fundamental drawback of the scheme you are suggesting: it requires the
line pointers and physical storage to be in the same order.  (Or you
could make it work in reverse order, by looking at the prior pointer
instead of the next one to determine item size; that would actually
work a little better.  But in any case line pointer order and physical
storage order are tied together.)

This is clearly a loser for index pages: most inserts would require
a data shuffle.  But it is also a loser for heap pages, and the reason
is that on heap pages we cannot change a tuple's index (line pointer
number) once it's been created.  If we did, it'd invalidate CTID
forward links, index entries, and heapscan cursor positions for open
scans.  Indeed, pretty much the whole point of having the line pointers
is to provide a stable ID for a tuple --- if we didn't need that we
could just walk through the physical storage.

When VACUUM removes a dead tuple, it compacts out the physical space
and marks the line pointer as unused.  (Of course, it makes sure all
references to the tuple are gone first.)  The next time we want to
insert a tuple on that page, we can recycle the unused line pointer
instead of allocating a new one from the end of the line pointer array.
However, the physical space for the new tuple should come from the
main free-space pool in the middle of the page.  To implement the
pointers-without-sizes representation, we'd be forced to shuffle data
to make room for the tuple between the two adjacent-by-line-number tuples.

The three states of a line pointer that I referred to are live
(pointing at a good tuple), dead (pointing at storage that used
to contain a good tuple, doesn't anymore, but hasn't been compacted
out yet), and empty (doesn't point at storage at all; the space it
used to describe has been merged into the middle-of-the-page free
pool).  ISTM a pointers-only representation can handle the live and
dead cases nicely, but the empty case is going to be a real headache.

In short, a pointers-only representation would give us a lot less
flexibility in free space management.  It's an interesting idea but
I doubt that saving two bytes per row is worth the extra overhead.
        regards, tom lane


pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: On-Disk Tuple Size
Next
From: Tom Lane
Date:
Subject: Re: Documentation on page files