Improving free space usage (was: Reducing relation locking overhead) - Mailing list pgsql-hackers

From Jim C. Nasby
Subject Improving free space usage (was: Reducing relation locking overhead)
Date
Msg-id 20051208185702.GB58449@nasby.net
Whole thread Raw
In response to Re: Reducing relation locking overhead  (Hannu Krosing <hannu@skype.net>)
Responses Re: Improving free space usage (was: Reducing relation locking  (Hannu Krosing <hannu@skype.net>)
List pgsql-hackers
On Thu, Dec 08, 2005 at 11:58:50AM +0200, Hannu Krosing wrote:
> ??hel kenal p??eval, N, 2005-12-08 kell 01:08, kirjutas Jim C. Nasby:
> > On Thu, Dec 08, 2005 at 08:57:42AM +0200, Hannu Krosing wrote:
> > > ??hel kenal p??eval, N, 2005-12-08 kell 00:16, kirjutas Jim C. Nasby:
> > > > On Sat, Dec 03, 2005 at 10:15:25AM -0500, Greg Stark wrote:
> > > > > Tom Lane <tgl@sss.pgh.pa.us> writes:
> > > > > > What's worse, once you have excluded writes you have to rescan the entire
> > > > > > table to be sure you haven't missed anything. So in the scenarios where this
> > > > > > whole thing is actually interesting, ie enormous tables, you're still
> > > > > > talking about a fairly long interval with writes locked out. Maybe not as
> > > > > > long as a complete REINDEX, but long.
> > > > > 
> > > > > I was thinking you would set a flag to disable use of the FSM for
> > > > > inserts/updates while the reindex was running. So you would know where to find
> > > > > the new tuples, at the end of the table after the last tuple you read.
> > > > 
> > > > What about keeping a separate list of new tuples? Obviously we'd only do
> > > > this when an index was being built on a table. 
> > > 
> > > The problem with separate list is that it can be huge. For example on a
> > > table with 200 inserts/updates per second an index build lasting 6 hours
> > > would accumulate total on 6*3600*200 = 4320000 new tuples.
> > 
> > Sure, but it's unlikely that such a table would be very wide, so 4.3M
> > tuples would probably only amount to a few hundred MB of data. It's also
> > possible that this list could be vacuumed by whatever the regular vacuum
> > process is for the table.
> 
> I think that keeping such list as part the table at well defined
> location (like pages from N to M) is the best strategy, as it will
> automatically make all new tuples available to parallel processes and
> avoids both duplicate storage as well as the the need for changing
> insert/update code.

There's one thing I hate about that idea though: good luck trying to
move those tuples somewhere else after the index build is done and you
now want to shrink the table back down to a more normal size. If we had
a better way to do that it would be much more palatable, but right now
on a heavily updated table this would result in a lot of bloat.

Along those lines, I've wondered if it makes sense to add more
flexibility in how free space is reclaimed in a table. One obvious
possibility is to have a strategy where new tuples will always look to
the FSM for space (instead of going into the current page if possible),
and the FSM will always hand out the earliest page in the table it has.
This mode would have the effect of moving tuples towards the front of
the table, allowing for space reclamation. A variation might be that
this mode will not effect tuples that are generated as part of an UPDATE
and are in the first x% of the table, since it doesn't make sense to
move a tuple from page 2 to page 1 in a 1000 page table.

Another possibility is to always go to the FSM and to have the FSM hand
back the page that is closest to the new tuple according to a certain
index. This would allow for ALTER TABLE CLUSTER to be much more
self-maintaining. The downside is that you'd have to do a lookup on that
index, but presumably if the index is important enough to cluster on
then it should be well-cached. There's probably some other tweaks that
could be done as well to make this more performant.
-- 
Jim C. Nasby, Sr. Engineering Consultant      jnasby@pervasive.com
Pervasive Software      http://pervasive.com    work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf       cell: 512-569-9461


pgsql-hackers by date:

Previous
From: "Jim C. Nasby"
Date:
Subject: Re: Vertical Partitioning with TOAST
Next
From: Jan Wieck
Date:
Subject: Re: Vertical Partitioning with TOAST