Thread: Dead Space Map

Dead Space Map

From
Heikki Linnakangas
Date:
Hi,

The idea of using a so called dead space map to speed up vacuum has come 
up multiple times in this list in the last couple of years. I wrote an 
initial implementation of it to measure the performance impact it has on 
updates and on vacuum.

Potential uses for a dead space map are:

* speed up vacuum when there's few dead tuples

Vacuum will need to be modified to use index lookups to find index tuples 
corresponding the dead heap tuples. Otherwise you have to scan through 
all the indexes anyway.

* vacuuming pages one by one as they're written by bgwriter

I'm not sure how much difference this would make, but it would be an 
interesting experiment. In theory, you could save a lot of total I/O, 
because you would not need to come back to vacuum the pages later, but you 
would have to read in any index pages pointing to the dead heap tuples 
inside bgwriter.

* implementation of index-only scans

An index scan would not have to check the visibility information of heap 
tuples on those heap pages that are marked as clean in the dead space map.
This requires that the dead space map is implemented so that a page is 
reliably marked as dirty in all circumstances when it contains any tuples 
that are not visible to all backends.

The obvious drawback is that heap updates need to update the dead space 
map too.

My current implementation stores a bitmap of 32k bits in the special space 
of every 32k heap pages. Each bit in the bitmap corresponds one heap page. 
The bit is set every time a tuple is updated, and it's cleared by vacuum. 
This is a very simple approach, and doesn't take much space.

Is there something I'm missing? Any ideas?

I'm going to have some spare time to hack PostgreSQL in the coming 
months, and I'm thinking of refining this if there's interest. Is anyone 
else working on this?

- Heikki


Re: Dead Space Map

From
"Luke Lonergan"
Date:
Heikki,

On 2/27/06 9:53 AM, "Heikki Linnakangas" <hlinnaka@iki.fi> wrote:


> My current implementation stores a bitmap of 32k bits in the special space
> of every 32k heap pages. Each bit in the bitmap corresponds one heap page.
> The bit is set every time a tuple is updated, and it's cleared by vacuum.
> This is a very simple approach, and doesn't take much space.
> 
> Is there something I'm missing? Any ideas?

Sounds great!
> I'm going to have some spare time to hack PostgreSQL in the coming
> months, and I'm thinking of refining this if there's interest. Is anyone
> else working on this?

This idea seems like it could dramatically improve vacuum - commonly a big
issue.

- Luke




Re: Dead Space Map

From
Tom Lane
Date:
Heikki Linnakangas <hlinnaka@iki.fi> writes:
> Vacuum will need to be modified to use index lookups to find index tuples 
> corresponding the dead heap tuples. Otherwise you have to scan through 
> all the indexes anyway.

This strikes me as a fairly bad idea, because it makes VACUUM dependent
on correct functioning of user-written code --- consider a functional
index involving a user-written function that was claimed to be immutable
and is not.  There are concurrency-safety issues too, I think, having to
do with the way that btree ensures we don't delete any index tuple that
some scan is stopped on.

> * vacuuming pages one by one as they're written by bgwriter

That's not happening.  VACUUM has to be a transaction and the bgwriter
does not run transactions; nor is it in any position to clean out index
entries associated with a heap page.  (To change this would at a minimum
require instituting a separate bgwriter process per database; or else a
wholesale rewrite of our catalog access infrastructure to allow it to
work in a non-database-specific context.  There are also interesting
deadlock problems to think about if the bgwriter can be blocked by other
transactions, or if it needs to read pages not currently in shared memory.)

> * implementation of index-only scans

> An index scan would not have to check the visibility information of heap 
> tuples on those heap pages that are marked as clean in the dead space map.
> This requires that the dead space map is implemented so that a page is 
> reliably marked as dirty in all circumstances when it contains any tuples 
> that are not visible to all backends.

The "reliably" part of this is likely to make it a non-starter.  Another
problem is that the semantics needed by this are not quite the same as
the semantics of whether a page needs to be visited by vacuum.

> My current implementation stores a bitmap of 32k bits in the special space 
> of every 32k heap pages. Each bit in the bitmap corresponds one heap page. 
> The bit is set every time a tuple is updated, and it's cleared by vacuum. 
> This is a very simple approach, and doesn't take much space.

I thought the plan was to use out-of-line storage associated with each
table "segment" file.
        regards, tom lane


Re: Dead Space Map

From
Heikki Linnakangas
Date:
On Mon, 27 Feb 2006, Tom Lane wrote:

> Heikki Linnakangas <hlinnaka@iki.fi> writes:
>> Vacuum will need to be modified to use index lookups to find index tuples
>> corresponding the dead heap tuples. Otherwise you have to scan through
>> all the indexes anyway.
>
> This strikes me as a fairly bad idea, because it makes VACUUM dependent
> on correct functioning of user-written code --- consider a functional
> index involving a user-written function that was claimed to be immutable
> and is not.

If the user-defined function is broken, you're in more or less trouble 
anyway. A VACUUM FULL or REINDEX can be used to recover.

> There are concurrency-safety issues too, I think, having to
> do with the way that btree ensures we don't delete any index tuple that
> some scan is stopped on.

Hmm, I see. I'll have to study the btree implementation more thoroughly.

>> * implementation of index-only scans
>
>> An index scan would not have to check the visibility information of heap
>> tuples on those heap pages that are marked as clean in the dead space map.
>> This requires that the dead space map is implemented so that a page is
>> reliably marked as dirty in all circumstances when it contains any tuples
>> that are not visible to all backends.
>
> The "reliably" part of this is likely to make it a non-starter.

AFAICS all heap access goes through the functions in heapam.c. They need 
to be modified to update the dead space map. Also on recovery. The 
locking semantics of the dead space map need to be thought out, but I 
don't see any insurmountable problems.

> Another
> problem is that the semantics needed by this are not quite the same as
> the semantics of whether a page needs to be visited by vacuum.

True. We might have to have two bits per page. It's still not that 
bad, though, compared to the benefit.

>> My current implementation stores a bitmap of 32k bits in the special space
>> of every 32k heap pages. Each bit in the bitmap corresponds one heap page.
>> The bit is set every time a tuple is updated, and it's cleared by vacuum.
>> This is a very simple approach, and doesn't take much space.
>
> I thought the plan was to use out-of-line storage associated with each
> table "segment" file.

You're probably referring to Alvaro's auto-vacuum todo list from July:

http://archives.postgresql.org/pgsql-hackers/2005-07/msg01029.php

I find it more attractive to put the bitmap in the special space, for the 
reasons stated earlier by Jan Wieck:

http://archives.postgresql.org/pgsql-hackers/2004-03/msg00957.php

That is, so that you can utilize the common buffer management code. Jan 
also had a plan there for the locking.

- Heikki


Re: Dead Space Map

From
Tom Lane
Date:
Heikki Linnakangas <hlinnaka@iki.fi> writes:
> On Mon, 27 Feb 2006, Tom Lane wrote:
>> This strikes me as a fairly bad idea, because it makes VACUUM dependent
>> on correct functioning of user-written code --- consider a functional
>> index involving a user-written function that was claimed to be immutable
>> and is not.

> If the user-defined function is broken, you're in more or less trouble 
> anyway.

Less.  A non-immutable function might result in lookup failures (not
finding the row you need) but not in database corruption, which is what
would ensue if VACUUM fails to remove an index tuple.  The index entry
would eventually point to a wrong table entry, after the table item slot
gets recycled for another tuple.

Moreover, you haven't pointed to any strong reason to adopt this
methodology.  It'd only be a win when vacuuming pretty small numbers
of tuples, which is not the design center for VACUUM, and isn't likely
to be the case in practice either if you're using autovacuum.  If you're
removing say 1% of the tuples, you are likely to be hitting every index
page to do it, meaning that the scan approach will be significantly
*more* efficient than retail lookups.
        regards, tom lane


Re: Dead Space Map

From
"Jim C. Nasby"
Date:
On Mon, Feb 27, 2006 at 01:17:36PM -0500, Tom Lane wrote:
> > * vacuuming pages one by one as they're written by bgwriter
> 
> That's not happening.  VACUUM has to be a transaction and the bgwriter
> does not run transactions; nor is it in any position to clean out index
> entries associated with a heap page.  (To change this would at a minimum
> require instituting a separate bgwriter process per database; or else a
> wholesale rewrite of our catalog access infrastructure to allow it to
> work in a non-database-specific context.  There are also interesting
> deadlock problems to think about if the bgwriter can be blocked by other
> transactions, or if it needs to read pages not currently in shared memory.)

Or there could be a seperate daemon that isn't associated with bgwriter.
AFAIK as long as it vacuums the dirty page before bgwrite wants to write
it you'd still get the IO benefit.
-- 
Jim C. Nasby, Sr. Engineering Consultant      jnasby@pervasive.com
Pervasive Software      http://pervasive.com    work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf       cell: 512-569-9461


Re: Dead Space Map

From
"Jim C. Nasby"
Date:
On Mon, Feb 27, 2006 at 03:05:41PM -0500, Tom Lane wrote:
> Heikki Linnakangas <hlinnaka@iki.fi> writes:
> > On Mon, 27 Feb 2006, Tom Lane wrote:
> >> This strikes me as a fairly bad idea, because it makes VACUUM dependent
> >> on correct functioning of user-written code --- consider a functional
> >> index involving a user-written function that was claimed to be immutable
> >> and is not.
> 
> > If the user-defined function is broken, you're in more or less trouble 
> > anyway.
> 
> Less.  A non-immutable function might result in lookup failures (not
> finding the row you need) but not in database corruption, which is what
> would ensue if VACUUM fails to remove an index tuple.  The index entry
> would eventually point to a wrong table entry, after the table item slot
> gets recycled for another tuple.
Is there some (small) metadata that could be stored in the index to
protect against this, perhaps XID? Granted, it's another 4 bytes, but it
would only need to be in functional indexes. And there could still be a
means to turn it off, if you're 100% certain that the function is
immutable. lower() is probably the biggest use-case here...

> Moreover, you haven't pointed to any strong reason to adopt this
> methodology.  It'd only be a win when vacuuming pretty small numbers
> of tuples, which is not the design center for VACUUM, and isn't likely
> to be the case in practice either if you're using autovacuum.  If you're
> removing say 1% of the tuples, you are likely to be hitting every index
> page to do it, meaning that the scan approach will be significantly
> *more* efficient than retail lookups.

The use case is any large table that sees updates in 'hot spots'.
Anything that's based on current time is a likely candidate, since often
most activity only concerns the past few days of data.
-- 
Jim C. Nasby, Sr. Engineering Consultant      jnasby@pervasive.com
Pervasive Software      http://pervasive.com    work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf       cell: 512-569-9461


Re: Dead Space Map

From
Tom Lane
Date:
"Jim C. Nasby" <jnasby@pervasive.com> writes:
> On Mon, Feb 27, 2006 at 03:05:41PM -0500, Tom Lane wrote:
>> Moreover, you haven't pointed to any strong reason to adopt this
>> methodology.  It'd only be a win when vacuuming pretty small numbers
>> of tuples, which is not the design center for VACUUM, and isn't likely
>> to be the case in practice either if you're using autovacuum.  If you're
>> removing say 1% of the tuples, you are likely to be hitting every index
>> page to do it, meaning that the scan approach will be significantly
>> *more* efficient than retail lookups.

> The use case is any large table that sees updates in 'hot spots'.
> Anything that's based on current time is a likely candidate, since often
> most activity only concerns the past few days of data.

I'm unmoved by that argument too.  If the updates are clustered then
another effect kicks in: the existing btbulkdelete approach is able to
collapse all the deletions on a given index page into one WAL record.
With retail deletes it'd be difficult if not impossible to do that,
resulting in a significant increase in WAL traffic during a vacuum.
(We know it's significant because we saw a good improvement when we
fixed btbulkdelete to work that way, instead of issuing a separate
WAL record per deleted index entry as it once did.)
        regards, tom lane


Re: Dead Space Map

From
Greg Stark
Date:
Tom Lane <tgl@sss.pgh.pa.us> writes:

> Heikki Linnakangas <hlinnaka@iki.fi> writes:

> > * implementation of index-only scans
> 
> > An index scan would not have to check the visibility information of heap 
> > tuples on those heap pages that are marked as clean in the dead space map.
> > This requires that the dead space map is implemented so that a page is 
> > reliably marked as dirty in all circumstances when it contains any tuples 
> > that are not visible to all backends.
> 
> The "reliably" part of this is likely to make it a non-starter.  Another
> problem is that the semantics needed by this are not quite the same as
> the semantics of whether a page needs to be visited by vacuum.

It would be very disappointing if this part doesn't turn out to be possible.

I had always thought the semantics were the same, but now I'm realizing that
vacuum doesn't need to visit tuples that have been committed even if they're
not visible to some transaction. So having a "vacuum can ignore this" bit
doesn't help you with index scans.

But I think the thought process went the other direction. If you have the bit
intended for index scans indicating that the tuple is not "in doubt" ie, it's
visible to every transaction, then that also implies the tuple doesn't need to
be visited by vacuum.

Skipping pages that don't contain any in doubt tuples would be a huge win.
Even if there might be some additional pages that vacuum could in theory be
skipping too.

-- 
greg



Re: Dead Space Map

From
Hannu Krosing
Date:
Ühel kenal päeval, E, 2006-02-27 kell 13:17, kirjutas Tom Lane:
> Heikki Linnakangas <hlinnaka@iki.fi> writes:
> > Vacuum will need to be modified to use index lookups to find index tuples 
> > corresponding the dead heap tuples. Otherwise you have to scan through 
> > all the indexes anyway.
> 
> This strikes me as a fairly bad idea, because it makes VACUUM dependent
> on correct functioning of user-written code --- consider a functional
> index involving a user-written function that was claimed to be immutable
> and is not.  There are concurrency-safety issues too, I think, having to
> do with the way that btree ensures we don't delete any index tuple that
> some scan is stopped on.
> 
> > * vacuuming pages one by one as they're written by bgwriter
> 
> That's not happening.  VACUUM has to be a transaction 

WHY does vacuum need to be a tranasction ? I thought it was a programmer
workload optimisation (aka. lazyness :) ) to require ordinary lazy
vacuum to be in transaction.

There is no fundamental reason, why vacuum needs to run in a transaction
itselt.

-----------------
Hannu




Re: Dead Space Map

From
Hannu Krosing
Date:
Ühel kenal päeval, E, 2006-02-27 kell 15:05, kirjutas Tom Lane:
> Heikki Linnakangas <hlinnaka@iki.fi> writes:
> > On Mon, 27 Feb 2006, Tom Lane wrote:
> >> This strikes me as a fairly bad idea, because it makes VACUUM dependent
> >> on correct functioning of user-written code --- consider a functional
> >> index involving a user-written function that was claimed to be immutable
> >> and is not.
> 
> > If the user-defined function is broken, you're in more or less trouble 
> > anyway.
> 
> Less.  A non-immutable function might result in lookup failures (not
> finding the row you need) but not in database corruption, which is what
> would ensue if VACUUM fails to remove an index tuple. 

Arguably the database is *already* broken once one has used a
non-immutable function in index ops.

"Failing to remove" is a condition that is easily detected, so one can
flag the operator class as lossy (RECHECK) and actually fix that
brokenness.

> Moreover, you haven't pointed to any strong reason to adopt this
> methodology.  It'd only be a win when vacuuming pretty small numbers
> of tuples, which is not the design center for VACUUM, and isn't likely
> to be the case in practice either if you're using autovacuum.  If you're
> removing say 1% of the tuples, you are likely to be hitting every index
> page to do it, meaning that the scan approach will be significantly
> *more* efficient than retail lookups.

One use-case would be keeping update-heavy databases healthy, by
ensuring that most updates will end up in the same page as originals.

That goal would be easier to achieve, if there were a process which
reclaims old tuples as fast as possible.

-----------------
Hannu




Re: Dead Space Map

From
Hannu Krosing
Date:
Ühel kenal päeval, T, 2006-02-28 kell 01:04, kirjutas Tom Lane:
> "Jim C. Nasby" <jnasby@pervasive.com> writes:
> > On Mon, Feb 27, 2006 at 03:05:41PM -0500, Tom Lane wrote:
> >> Moreover, you haven't pointed to any strong reason to adopt this
> >> methodology.  It'd only be a win when vacuuming pretty small numbers
> >> of tuples, which is not the design center for VACUUM, and isn't likely
> >> to be the case in practice either if you're using autovacuum.  If you're
> >> removing say 1% of the tuples, you are likely to be hitting every index
> >> page to do it, meaning that the scan approach will be significantly
> >> *more* efficient than retail lookups.
> 
> > The use case is any large table that sees updates in 'hot spots'.
> > Anything that's based on current time is a likely candidate, since often
> > most activity only concerns the past few days of data.
> 
> I'm unmoved by that argument too.  If the updates are clustered then
> another effect kicks in: the existing btbulkdelete approach is able to
> collapse all the deletions on a given index page into one WAL record.
> With retail deletes it'd be difficult if not impossible to do that,
> resulting in a significant increase in WAL traffic during a vacuum.
> (We know it's significant because we saw a good improvement when we
> fixed btbulkdelete to work that way, instead of issuing a separate
> WAL record per deleted index entry as it once did.)

In the "big table with hotspots" case, I doubt that the win from
btbulkdelete will outweight having to scan 100000 index pages to get to
the 30 or 50 that can be bulkdeleted.

--------------
Hannu




Re: Dead Space Map

From
Hannu Krosing
Date:
Ühel kenal päeval, T, 2006-02-28 kell 10:04, kirjutas Hannu Krosing:
> Ühel kenal päeval, E, 2006-02-27 kell 15:05, kirjutas Tom Lane:
> > Heikki Linnakangas <hlinnaka@iki.fi> writes:
> > > On Mon, 27 Feb 2006, Tom Lane wrote:
> > >> This strikes me as a fairly bad idea, because it makes VACUUM dependent
> > >> on correct functioning of user-written code --- consider a functional
> > >> index involving a user-written function that was claimed to be immutable
> > >> and is not.
> > 
> > > If the user-defined function is broken, you're in more or less trouble 
> > > anyway.
> > 
> > Less.  A non-immutable function might result in lookup failures (not
> > finding the row you need) but not in database corruption, which is what
> > would ensue if VACUUM fails to remove an index tuple. 

Or do you man that an index entry pointing to a non-existing tuple is
"corruption" ? It would be realtively easy to teach index access method
to just ignore (or even delete) the dangling index entries.

> Arguably the database is *already* broken once one has used a
> non-immutable function in index ops.
> 
> "Failing to remove" is a condition that is easily detected, so one can
> flag the operator class as lossy (RECHECK) and actually fix that
> brokenness.

Ok, maybe not fix but alleviate - you wont get any non-matching tuples,
but there still remains the possibility to get the sam tuple twice.

------------
Hannu




Re: Dead Space Map

From
Heikki Linnakangas
Date:
On Mon, 27 Feb 2006, Tom Lane wrote:

> Heikki Linnakangas <hlinnaka@iki.fi> writes:
>> On Mon, 27 Feb 2006, Tom Lane wrote:
>>> This strikes me as a fairly bad idea, because it makes VACUUM dependent
>>> on correct functioning of user-written code --- consider a functional
>>> index involving a user-written function that was claimed to be immutable
>>> and is not.
>
>> If the user-defined function is broken, you're in more or less trouble
>> anyway.
>
> Less.  A non-immutable function might result in lookup failures (not
> finding the row you need) but not in database corruption, which is what
> would ensue if VACUUM fails to remove an index tuple.  The index entry
> would eventually point to a wrong table entry, after the table item slot
> gets recycled for another tuple.

It's easy to detect when it happens (that you don't find the row in the 
index). You can then complain loudly or fall back to the full scan.

> Moreover, you haven't pointed to any strong reason to adopt this
> methodology.  It'd only be a win when vacuuming pretty small numbers
> of tuples, which is not the design center for VACUUM, and isn't likely
> to be the case in practice either if you're using autovacuum.  If you're
> removing say 1% of the tuples, you are likely to be hitting every index
> page to do it, meaning that the scan approach will be significantly
> *more* efficient than retail lookups.

It really depends a lot on what kind of indexes the table has, clustering 
order, and what kind of deletions and updates happen. Assuming a random 
distribution of dead tuples, a scan is indeed going to be more efficient 
as you say.

Assuming a tightly packed bunch of dead tuples, however, say after a 
"delete from log where timestamp < now() - 1 month", you would benefit 
from a partial vacuum.

That's certainly something that needs to be tested. It's important to 
implement so that it falls back to full scan when it's significantly 
faster.

- Heikki


Re: Dead Space Map

From
Alvaro Herrera
Date:
Hannu Krosing wrote:

> In the "big table with hotspots" case, I doubt that the win from
> btbulkdelete will outweight having to scan 100000 index pages to get to
> the 30 or 50 that can be bulkdeleted.

Remember that WAL traffic is fsync'ed, so it can be _much_ slower than
anything else.

-- 
Alvaro Herrera                                http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.


Re: Dead Space Map

From
Martijn van Oosterhout
Date:
On Tue, Feb 28, 2006 at 09:52:24AM +0200, Hannu Krosing wrote:
> WHY does vacuum need to be a tranasction ? I thought it was a programmer
> workload optimisation (aka. lazyness :) ) to require ordinary lazy
> vacuum to be in transaction.
>
> There is no fundamental reason, why vacuum needs to run in a transaction
> itselt.

AIUI, vacuum needs to take locks on tables and possibly wait on other
transactions to complete. Similarly, other transactions may need to
wait on the vacuum (DDL statements). The mechanism in postgres to clean
up locks on crash and handle deadlock detection (I think) requires
everybody to be running in a transaction, so vacuum does too...

At least, that's what I thought,

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
> tool for doing 5% of the work and then sitting around waiting for someone
> else to do the other 95% so you can sue them.

Re: Dead Space Map

From
Tom Lane
Date:
Hannu Krosing <hannu@skype.net> writes:
> There is no fundamental reason, why vacuum needs to run in a transaction
> itselt.

I'll just take one point: you cannot hold locks unless you have a
transaction, therefore you cannot even look up the catalog info about
the table you wish to vacuum; let alone prevent someone else from
dropping the table under you.

The current bgwriter is incapable of looking at catalog contents as
such --- it operates only at the level of physical data blocks.
        regards, tom lane


Re: Dead Space Map

From
Tom Lane
Date:
Hannu Krosing <hannu@skype.net> writes:
> Or do you man that an index entry pointing to a non-existing tuple is
> "corruption" ? It would be realtively easy to teach index access method
> to just ignore (or even delete) the dangling index entries.

No, I mean that an index entry pointing at a WRONG tuple is corruption
(and no need for quotes about that one).  And that is the state that
will ensue, as soon as someone re-uses the line pointer.
        regards, tom lane


Re: Dead Space Map

From
Hannu Krosing
Date:
Ühel kenal päeval, T, 2006-02-28 kell 10:26, kirjutas Tom Lane:
> Hannu Krosing <hannu@skype.net> writes:
> > There is no fundamental reason, why vacuum needs to run in a transaction
> > itselt.
> 
> I'll just take one point: you cannot hold locks unless you have a
> transaction, therefore you cannot even look up the catalog info about
> the table you wish to vacuum; let alone prevent someone else from
> dropping the table under you.
> 
> The current bgwriter is incapable of looking at catalog contents as
> such --- it operates only at the level of physical data blocks.

Would'nt it still be possible to drop a table from below bgwriter ?

Or will pgwriter just ignore that.

--------------
Hannu




Re: Dead Space Map

From
Hannu Krosing
Date:
Ühel kenal päeval, T, 2006-02-28 kell 10:16, kirjutas Alvaro Herrera:
> Hannu Krosing wrote:
> 
> > In the "big table with hotspots" case, I doubt that the win from
> > btbulkdelete will outweight having to scan 100000 index pages to get to
> > the 30 or 50 that can be bulkdeleted.
> 
> Remember that WAL traffic is fsync'ed, so it can be _much_ slower than
> anything else.

But it should need to be fsynced only once, at commit time, so if you
can clean up more than one page, you end up with many wal records, but
just one fsync.

-------------
Hannu




Re: Dead Space Map

From
"Jim C. Nasby"
Date:
On Tue, Feb 28, 2006 at 01:18:14AM -0500, Greg Stark wrote:
> But I think the thought process went the other direction. If you have the bit
> intended for index scans indicating that the tuple is not "in doubt" ie, it's
> visible to every transaction, then that also implies the tuple doesn't need to
> be visited by vacuum.
> 
> Skipping pages that don't contain any in doubt tuples would be a huge win.
> Even if there might be some additional pages that vacuum could in theory be
> skipping too.

Agreed. IMO, *anything* that improves the efficiency of vacuum would be
of huge benefit, and keeping a bitmap of pages that are known to be 100%
visible would be a big start in that direction. For many large tables,
this case would cover a large percentage of pages, speeding up not only
vacuum but also index scans.

ISTM that the continuing debate about how to improve vacuum is due
largely to the fact that there are a very large number of possibilities.
I would very much like to see a decision on one to impliment as a
starting point. Ideas about some kind of dead-space-map, or a
known-clean-map have been floating around for at least 2 versions now.
-- 
Jim C. Nasby, Sr. Engineering Consultant      jnasby@pervasive.com
Pervasive Software      http://pervasive.com    work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf       cell: 512-569-9461


Re: Dead Space Map

From
Tom Lane
Date:
Hannu Krosing <hannu@skype.net> writes:
> Ühel kenal päeval, T, 2006-02-28 kell 10:26, kirjutas Tom Lane:
>> The current bgwriter is incapable of looking at catalog contents as
>> such --- it operates only at the level of physical data blocks.

> Would'nt it still be possible to drop a table from below bgwriter ?

The mechanism that handles that is that smgr.c calls
DropRelFileNodeBuffers before physically unlinking the file.
Hence, by the time the unlink happens, there is not any buffer
that the bgwriter might be trying to write into it.

Processes that might try to read in new pages must hold some kind
of lock on the relation the page belongs to, hence they must be
running a transaction.  Otherwise there would be a race condition here.
(The process executing the DROP TABLE must hold exclusive lock on the
table, thereby guaranteeing that there is no one trying to read in
new pages from it.)
        regards, tom lane


Re: Dead Space Map

From
Bruce Momjian
Date:
Jim C. Nasby wrote:
> On Tue, Feb 28, 2006 at 01:18:14AM -0500, Greg Stark wrote:
> > But I think the thought process went the other direction. If you have the bit
> > intended for index scans indicating that the tuple is not "in doubt" ie, it's
> > visible to every transaction, then that also implies the tuple doesn't need to
> > be visited by vacuum.
> > 
> > Skipping pages that don't contain any in doubt tuples would be a huge win.
> > Even if there might be some additional pages that vacuum could in theory be
> > skipping too.
> 
> Agreed. IMO, *anything* that improves the efficiency of vacuum would be
> of huge benefit, and keeping a bitmap of pages that are known to be 100%
> visible would be a big start in that direction. For many large tables,
> this case would cover a large percentage of pages, speeding up not only
> vacuum but also index scans.
> 
> ISTM that the continuing debate about how to improve vacuum is due
> largely to the fact that there are a very large number of possibilities.
> I would very much like to see a decision on one to impliment as a
> starting point. Ideas about some kind of dead-space-map, or a
> known-clean-map have been floating around for at least 2 versions now.

Right, we should discuss all the possibilities and do something.  I
think we just don't know what yet.

One idea Simon and I had was to reuse heap rows where all indexed values
in the old row and the new row were the same, meaning the heap value
could be replaced without changing the indexes at all.  We thought this
would be very useful for frequently-updated rows.  It could also be used
if no index are on the table.

--  Bruce Momjian   http://candle.pha.pa.us SRA OSS, Inc.   http://www.sraoss.com
 + If your life is a hard drive, Christ can be your backup. +


Re: Dead Space Map

From
Tom Lane
Date:
Bruce Momjian <pgman@candle.pha.pa.us> writes:
> One idea Simon and I had was to reuse heap rows where all indexed values
> in the old row and the new row were the same, meaning the heap value
> could be replaced without changing the indexes at all.  We thought this
> would be very useful for frequently-updated rows.  It could also be used
> if no index are on the table.

MVCC goes out the window, eh?  Not to mention transaction rollback ability?
        regards, tom lane


Re: Dead Space Map

From
Bruce Momjian
Date:
Tom Lane wrote:
> Bruce Momjian <pgman@candle.pha.pa.us> writes:
> > One idea Simon and I had was to reuse heap rows where all indexed values
> > in the old row and the new row were the same, meaning the heap value
> > could be replaced without changing the indexes at all.  We thought this
> > would be very useful for frequently-updated rows.  It could also be used
> > if no index are on the table.
> 
> MVCC goes out the window, eh?  Not to mention transaction rollback ability?

If the old row is not visible to any transactions, why would it not work?

--  Bruce Momjian   http://candle.pha.pa.us SRA OSS, Inc.   http://www.sraoss.com
 + If your life is a hard drive, Christ can be your backup. +


Re: Dead Space Map

From
Tom Lane
Date:
Bruce Momjian <pgman@candle.pha.pa.us> writes:
> Tom Lane wrote:
>> MVCC goes out the window, eh?  Not to mention transaction rollback ability?

> If the old row is not visible to any transactions, why would it not work?

The old row is ALWAYS visible to somebody, until you commit (if you ever
do).  You can't simply overwrite existing data.
        regards, tom lane


Re: Dead Space Map

From
Bruce Momjian
Date:
Tom Lane wrote:
> Bruce Momjian <pgman@candle.pha.pa.us> writes:
> > Tom Lane wrote:
> >> MVCC goes out the window, eh?  Not to mention transaction rollback ability?
> 
> > If the old row is not visible to any transactions, why would it not work?
> 
> The old row is ALWAYS visible to somebody, until you commit (if you ever
> do).  You can't simply overwrite existing data.

Huh, I am not suggesting overwriting tuples you created, but tuples
created by earlier transactions and now invisible to everyone.

I should be clearer.  Suppose you have a table with a single index on
the primary key.  You are updating the row over and over again (a
typical case).  You create the first row, commit, then it is updated
(two copies), commit, then you update it again.  That first created row
might not be visible to anyone, but has the same index value as the new
row you are about to add.  Why not reused that heap tuple?

--  Bruce Momjian   http://candle.pha.pa.us SRA OSS, Inc.   http://www.sraoss.com
 + If your life is a hard drive, Christ can be your backup. +


Re: Dead Space Map

From
Heikki Linnakangas
Date:
On Tue, 28 Feb 2006, Bruce Momjian wrote:

> Tom Lane wrote:
>> Bruce Momjian <pgman@candle.pha.pa.us> writes:
>>> Tom Lane wrote:
>>>> MVCC goes out the window, eh?  Not to mention transaction rollback ability?
>>
>>> If the old row is not visible to any transactions, why would it not work?
>>
>> The old row is ALWAYS visible to somebody, until you commit (if you ever
>> do).  You can't simply overwrite existing data.
>
> Huh, I am not suggesting overwriting tuples you created, but tuples
> created by earlier transactions and now invisible to everyone.
>
> I should be clearer.  Suppose you have a table with a single index on
> the primary key.  You are updating the row over and over again (a
> typical case).  You create the first row, commit, then it is updated
> (two copies), commit, then you update it again.  That first created row
> might not be visible to anyone, but has the same index value as the new
> row you are about to add.  Why not reused that heap tuple?

Index tuples have commit info hint bits in them, don't they? You would 
still have to update those.

- Heikki


Re: Dead Space Map

From
Bruce Momjian
Date:
Heikki Linnakangas wrote:
> On Tue, 28 Feb 2006, Bruce Momjian wrote:
> 
> > Tom Lane wrote:
> >> Bruce Momjian <pgman@candle.pha.pa.us> writes:
> >>> Tom Lane wrote:
> >>>> MVCC goes out the window, eh?  Not to mention transaction rollback ability?
> >>
> >>> If the old row is not visible to any transactions, why would it not work?
> >>
> >> The old row is ALWAYS visible to somebody, until you commit (if you ever
> >> do).  You can't simply overwrite existing data.
> >
> > Huh, I am not suggesting overwriting tuples you created, but tuples
> > created by earlier transactions and now invisible to everyone.
> >
> > I should be clearer.  Suppose you have a table with a single index on
> > the primary key.  You are updating the row over and over again (a
> > typical case).  You create the first row, commit, then it is updated
> > (two copies), commit, then you update it again.  That first created row
> > might not be visible to anyone, but has the same index value as the new
> > row you are about to add.  Why not reused that heap tuple?
> 
> Index tuples have commit info hint bits in them, don't they? You would 
> still have to update those.

Right, but consider that we would have to use the index to find the
matching reusable heap to begin with, so we could clear the hint bit.

--  Bruce Momjian   http://candle.pha.pa.us SRA OSS, Inc.   http://www.sraoss.com
 + If your life is a hard drive, Christ can be your backup. +


Re: Dead Space Map

From
Tom Lane
Date:
Bruce Momjian <pgman@candle.pha.pa.us> writes:
> Tom Lane wrote:
>> The old row is ALWAYS visible to somebody, until you commit (if you ever
>> do).  You can't simply overwrite existing data.

> Huh, I am not suggesting overwriting tuples you created, but tuples
> created by earlier transactions and now invisible to everyone.

Hm?  If they're invisible to everyone, they're invisible to you too,
so you'd never select such a row as the basis for an update.

> I should be clearer.  Suppose you have a table with a single index on
> the primary key.  You are updating the row over and over again (a
> typical case).  You create the first row, commit, then it is updated
> (two copies), commit, then you update it again.  That first created row
> might not be visible to anyone, but has the same index value as the new
> row you are about to add.  Why not reused that heap tuple?

You appear to be talking about searching the heap to see if there's a
row that is vacuumable but coincidentally has the same physical length
and all the same index values as the row you'd like to insert.  I cannot
believe that the cost of doing that on every insert is a win compared to
vacuum.  Point 1: the cost is being paid up front (in the foreground
inserting transaction) not in a background vacuum.  Point 2: you cannot
just assume that such a row actually has the index entries you need ---
it might predate the creation of some indexes.  You'd have to actually
probe each of the indexes to see if there's an entry pointing at the
row.  Point 3: you can't do this if vacuum is currently running on the
table, as it might be in the midst of removing that same entry.  Hence
such an insert requires locking out vacuum, which eliminates one of the
main reasons for having lazy vacuum in the first place.  Point 4: you
also have conflicts against other inserts that might be trying to seize
on that same dead row.  The locking needed to fix that is considerably
worse than the short-term lock needed to soak up some free space on a
page, because you'd have to hold it across the time needed to check all
the indexes per point 2.
        regards, tom lane


Re: Dead Space Map

From
Greg Stark
Date:
Bruce Momjian <pgman@candle.pha.pa.us> writes:

> I should be clearer.  Suppose you have a table with a single index on
> the primary key.  You are updating the row over and over again (a
> typical case).  You create the first row, commit, then it is updated
> (two copies), commit, then you update it again.  That first created row
> might not be visible to anyone, but has the same index value as the new
> row you are about to add.  Why not reused that heap tuple?

If you commit each update then your tuple might well be visible to other
transactions, how would you check that?

I originally thought you meant if you are repeatedly updating the same record
within the same transaction. In that case sure you could reuse the space but
a) only if it's big enough for the new record and b) how often do you really
do that?

-- 
greg



Re: Dead Space Map

From
Tom Lane
Date:
Heikki Linnakangas <hlinnaka@iki.fi> writes:
> Index tuples have commit info hint bits in them, don't they?

Oooh, I forgot that one, but that's definitely strike 5.

And there's a strike 6: I'm pretty sure this idea breaks unique-index
checking.  Somebody else trying to insert a tuple with a duplicate key
value might have just looked at your tuple and concluded it was dead,
hence they could go ahead and insert their tuple ... but at the time you
are looking, they haven't managed to insert their index entry quite yet.
Even if you can avoid that race condition, it will certainly take a
second index search to catch the problem.
        regards, tom lane


Re: Dead Space Map

From
Tom Lane
Date:
Bruce Momjian <pgman@candle.pha.pa.us> writes:
> Heikki Linnakangas wrote:
>> Index tuples have commit info hint bits in them, don't they? You would 
>> still have to update those.

> Right, but consider that we would have to use the index to find the
> matching reusable heap to begin with, so we could clear the hint bit.

Not that easy: there's a race condition, ie, someone who's just visited
the dead tuple might be on their way back to the index to set the hint
bit.

Have we killed this idea sufficiently dead yet?
        regards, tom lane


Re: Dead Space Map

From
Tom Lane
Date:
Greg Stark <gsstark@mit.edu> writes:
> I originally thought you meant if you are repeatedly updating the same record
> within the same transaction. In that case sure you could reuse the space but
> a) only if it's big enough for the new record and b) how often do you really
> do that?

Also, it's not that easy even within a single transaction: you could
have active snapshots inside your xact that can still see the obsoleted
tuple.
        regards, tom lane


Re: Dead Space Map

From
Bruce Momjian
Date:
Greg Stark wrote:
> 
> Bruce Momjian <pgman@candle.pha.pa.us> writes:
> 
> > I should be clearer.  Suppose you have a table with a single index on
> > the primary key.  You are updating the row over and over again (a
> > typical case).  You create the first row, commit, then it is updated
> > (two copies), commit, then you update it again.  That first created row
> > might not be visible to anyone, but has the same index value as the new
> > row you are about to add.  Why not reused that heap tuple?
> 
> If you commit each update then your tuple might well be visible to other
> transactions, how would you check that?

You check the xmin/xmax using standard visibility rules.

> I originally thought you meant if you are repeatedly updating the same record
> within the same transaction. In that case sure you could reuse the space but
> a) only if it's big enough for the new record and b) how often do you really
> do that?

Right, that is a rare case.

--  Bruce Momjian   http://candle.pha.pa.us SRA OSS, Inc.   http://www.sraoss.com
 + If your life is a hard drive, Christ can be your backup. +


Re: Dead Space Map

From
Greg Stark
Date:
Bruce Momjian <pgman@candle.pha.pa.us> writes:

> Greg Stark wrote:
> > 
> > If you commit each update then your tuple might well be visible to other
> > transactions, how would you check that?
> 
> You check the xmin/xmax using standard visibility rules.

Would that really find anything? Even putting aside your own transaction
wouldn't it pick up any older transaction that started before yours and is
still running? They can't really see it but short of actually looking at their
snapshots I don't think you can determine that.

-- 
greg



Re: Dead Space Map

From
"Jim C. Nasby"
Date:
On Tue, Feb 28, 2006 at 11:58:44AM -0500, Bruce Momjian wrote:
> Jim C. Nasby wrote:
> > On Tue, Feb 28, 2006 at 01:18:14AM -0500, Greg Stark wrote:
> > > But I think the thought process went the other direction. If you have the bit
> > > intended for index scans indicating that the tuple is not "in doubt" ie, it's
> > > visible to every transaction, then that also implies the tuple doesn't need to
> > > be visited by vacuum.
> > > 
> > > Skipping pages that don't contain any in doubt tuples would be a huge win.
> > > Even if there might be some additional pages that vacuum could in theory be
> > > skipping too.
> > 
> > Agreed. IMO, *anything* that improves the efficiency of vacuum would be
> > of huge benefit, and keeping a bitmap of pages that are known to be 100%
> > visible would be a big start in that direction. For many large tables,
> > this case would cover a large percentage of pages, speeding up not only
> > vacuum but also index scans.
> > 
> > ISTM that the continuing debate about how to improve vacuum is due
> > largely to the fact that there are a very large number of possibilities.
> > I would very much like to see a decision on one to impliment as a
> > starting point. Ideas about some kind of dead-space-map, or a
> > known-clean-map have been floating around for at least 2 versions now.
> 
> Right, we should discuss all the possibilities and do something.  I
> think we just don't know what yet.

I guess that might make more sense if the discussions were more
organized so that we could go back later and easily find the pros and
cons of all the options, but that doesn't really exist. Sure, there's
the archives, but trying to read through discussions there is a PITA,
let alone trying to find all the relevant bits in the first place. I
doubt that will change without bringing some kind of more formal
collaboration tool online (something I really doubt folks would go for).

In this case, ISTM that most of the ideas have already been pretty well
hashed out, and I can't think of anything that would be a better place
to start than either a known clean map or a dead space map. The
advantages of each seem pretty clear and I think the implimentation
issues are pretty well hashed out.
-- 
Jim C. Nasby, Sr. Engineering Consultant      jnasby@pervasive.com
Pervasive Software      http://pervasive.com    work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf       cell: 512-569-9461


Re: Dead Space Map

From
Sokolov Yura
Date:
>On Tue, 28 Feb 2006 01:04:00 -0500, Tom Lane wrote 
>"Jim C. Nasby" <jnasby ( at ) pervasive ( dot ) com> writes:
>> On Mon, Feb 27, 2006 at 03:05:41PM -0500, Tom Lane wrote:
>> Moreover, you haven't pointed to any strong reason to adopt this
>>> methodology.  It'd only be a win when vacuuming pretty small numbers
>>> of tuples, which is not the design center for VACUUM, and isn't likely
>>> to be the case in practice either if you're using autovacuum.  If you're
>>> removing say 1% of the tuples, you are likely to be hitting every index
>>> page to do it, meaning that the scan approach will be significantly
>>> *more* efficient than retail lookups.

>> The use case is any large table that sees updates in 'hot spots'.
>> Anything that's based on current time is a likely candidate, since often
>> most activity only concerns the past few days of data.

>I'm unmoved by that argument too.  If the updates are clustered then
>another effect kicks in: the existing btbulkdelete approach is able to
>collapse all the deletions on a given index page into one WAL record.
>With retail deletes it'd be difficult if not impossible to do that,
>resulting in a significant increase in WAL traffic during a vacuum.
>(We know it's significant because we saw a good improvement when we
>fixed btbulkdelete to work that way, instead of issuing a separate
>WAL record per deleted index entry as it once did.)

Excuse me for cutting in.

I think it is possible to build a bitmap of _index_ pages dinamicly while scanning table pages.
If we touched more than 1/3(for example) of index pages (for certain index) 
then we stop looking tuples in that index and fall back to full index scan for VACUUMing it.
If not, then we can VACUUM only touched pages of this index.

Excuse my English.