Re: On-disk Tuple Size - Mailing list pgsql-hackers

From Curt Sampson
Subject Re: On-disk Tuple Size
Date
Msg-id Pine.NEB.4.43.0204220432470.8450-100000@angelic.cynic.net
Whole thread Raw
In response to Re: On-disk Tuple Size  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-hackers
On Sun, 21 Apr 2002, Tom Lane wrote:

> At this point you're essentially arguing that it's faster to recompute
> the list of item sizes than it is to read it off disk.  Given that the
> recomputation would require sorting the list of item locations (with
> up to a couple hundred entries --- more than that if blocksize > 8K)
> I'm not convinced of that.

No, not at all. What I'm arguing is that the I/O savings gained from
removing two bytes from the tuple overhead will more than compensate for
having to do a little bit more computation after reading the block.

How do I know? Well, I have very solid figures. I know because I pulled
them straight out of my....anyway. :-) Yeah, it's more or less instinct
that says to me that this would be a win. If others don't agree, there's
a pretty reasonable chance that I'm wrong here. But I think it might
be worthwile spending a bit of effort to see what we can do to reduce
our tuple overhead. After all, there is a good commerical DB that has
much, much lower overhead, even if it's not really comparable because it
doesn't use MVCC. The best thing really would be to see what other good
MVCC databases do. I'm going to go to the bookshop in the next few days
and try to find out what Oracle's physical layout is.

> Another difficulty is that we'd lose the ability to record item sizes
> to the exact byte.  What we'd reconstruct from the item locations are
> sizes rounded up to the next MAXALIGN boundary.  I am not sure that
> this is a problem, but I'm not sure it's not either.

Well, I don't see any real problem with it, but yeah, I might well be
missing something here.

> The larger BLCKSZ limit isn't nearly as desirable as it used to be,
> because of TOAST, and in fact it could be a net loser because of
> increased WAL traffic. But it'd be interesting to try it and see.

Mmmm, I hadn't thought about the WAL side of things. In an ideal world,
it wouldn't be a problem because WAL writes would be related only to
tuple size, and would have nothing to do with block size. Or so it seems
to me. But I have to go read the WAL code a bit before I care to make
any real assertions there.

cjs
-- 
Curt Sampson  <cjs@cynic.net>   +81 90 7737 2974   http://www.netbsd.org   Don't you know, in this new Dark Age, we're
alllight.  --XTC
 



pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: On-disk Tuple Size
Next
From: Thomas Lockhart
Date:
Subject: Patches applied; initdb time!