On Sun, 2008-03-16 at 21:33 -0300, Alvaro Herrera wrote:
> Tom Lane wrote:
>
> > The idea that's becoming attractive to me while contemplating the
> > multiple-maps problem is that we should adopt something similar to
> > the old Mac OS idea of multiple "forks" in a relation. In addition
> > to the main data fork which contains the same info as now, there could
> > be one or more map forks which are separate files in the filesystem.
Are'nt we in a way doing this for indexes ?
> I think something similar could be used to store tuple visibility bits
> separately from heap tuple data itself, so +1 to this idea.
Not just "bits", but whole visibility info (xmin,xmax,tmin,tmax, plus
bits) should be stored separately.
A separate "fork" for visibility should be organized as a b-tree index
(as we already have well-honed mechanisms for dealing with those
effectively) but visibility fork is stored in a compressed form by
storing ranges of all-visible or all-deleted tuples as two endpoints
only and also the tree is reorganized when possible similar to what we
currently do for HOT updates.
This will keep the visibility index really small for cases with little
updates, most likely one or two pages regardless of table size.
One important difference from indexes is that visibility info should be
stored first, before writing data to heap and creating ordinary index
entries.
> (The rough idea in my head was that you can do an indexscan and look
> up visibility bits without having to pull the whole heap along; and
> visibility updates are also cheaper, whether they come from indexscans
> or heap scans. Of course, the implicit cost is that a seqscan needs to
> fetch the visibility pages, too; and the locking is more complex.)
another cost is heavy inserting/updating where there will probably be
more lock contention as visibility info for new tuples will more often
land on the same visibility pages due to visibility info being generally
smaller.
Of course, with visibility info in a separate fork, very narrow tables
will have the ratios reversed - for one byte wide table visibility info
will be a few times bigger than actual data, at least initially before
compression has kicked in.
--------------------
Hannu