Thread: Dead Space Map
Hi, The idea of using a so called dead space map to speed up vacuum has come up multiple times in this list in the last couple of years. I wrote an initial implementation of it to measure the performance impact it has on updates and on vacuum. Potential uses for a dead space map are: * speed up vacuum when there's few dead tuples Vacuum will need to be modified to use index lookups to find index tuples corresponding the dead heap tuples. Otherwise you have to scan through all the indexes anyway. * vacuuming pages one by one as they're written by bgwriter I'm not sure how much difference this would make, but it would be an interesting experiment. In theory, you could save a lot of total I/O, because you would not need to come back to vacuum the pages later, but you would have to read in any index pages pointing to the dead heap tuples inside bgwriter. * implementation of index-only scans An index scan would not have to check the visibility information of heap tuples on those heap pages that are marked as clean in the dead space map. This requires that the dead space map is implemented so that a page is reliably marked as dirty in all circumstances when it contains any tuples that are not visible to all backends. The obvious drawback is that heap updates need to update the dead space map too. My current implementation stores a bitmap of 32k bits in the special space of every 32k heap pages. Each bit in the bitmap corresponds one heap page. The bit is set every time a tuple is updated, and it's cleared by vacuum. This is a very simple approach, and doesn't take much space. Is there something I'm missing? Any ideas? I'm going to have some spare time to hack PostgreSQL in the coming months, and I'm thinking of refining this if there's interest. Is anyone else working on this? - Heikki
Heikki, On 2/27/06 9:53 AM, "Heikki Linnakangas" <hlinnaka@iki.fi> wrote: > My current implementation stores a bitmap of 32k bits in the special space > of every 32k heap pages. Each bit in the bitmap corresponds one heap page. > The bit is set every time a tuple is updated, and it's cleared by vacuum. > This is a very simple approach, and doesn't take much space. > > Is there something I'm missing? Any ideas? Sounds great! > I'm going to have some spare time to hack PostgreSQL in the coming > months, and I'm thinking of refining this if there's interest. Is anyone > else working on this? This idea seems like it could dramatically improve vacuum - commonly a big issue. - Luke
Heikki Linnakangas <hlinnaka@iki.fi> writes: > Vacuum will need to be modified to use index lookups to find index tuples > corresponding the dead heap tuples. Otherwise you have to scan through > all the indexes anyway. This strikes me as a fairly bad idea, because it makes VACUUM dependent on correct functioning of user-written code --- consider a functional index involving a user-written function that was claimed to be immutable and is not. There are concurrency-safety issues too, I think, having to do with the way that btree ensures we don't delete any index tuple that some scan is stopped on. > * vacuuming pages one by one as they're written by bgwriter That's not happening. VACUUM has to be a transaction and the bgwriter does not run transactions; nor is it in any position to clean out index entries associated with a heap page. (To change this would at a minimum require instituting a separate bgwriter process per database; or else a wholesale rewrite of our catalog access infrastructure to allow it to work in a non-database-specific context. There are also interesting deadlock problems to think about if the bgwriter can be blocked by other transactions, or if it needs to read pages not currently in shared memory.) > * implementation of index-only scans > An index scan would not have to check the visibility information of heap > tuples on those heap pages that are marked as clean in the dead space map. > This requires that the dead space map is implemented so that a page is > reliably marked as dirty in all circumstances when it contains any tuples > that are not visible to all backends. The "reliably" part of this is likely to make it a non-starter. Another problem is that the semantics needed by this are not quite the same as the semantics of whether a page needs to be visited by vacuum. > My current implementation stores a bitmap of 32k bits in the special space > of every 32k heap pages. Each bit in the bitmap corresponds one heap page. > The bit is set every time a tuple is updated, and it's cleared by vacuum. > This is a very simple approach, and doesn't take much space. I thought the plan was to use out-of-line storage associated with each table "segment" file. regards, tom lane
On Mon, 27 Feb 2006, Tom Lane wrote: > Heikki Linnakangas <hlinnaka@iki.fi> writes: >> Vacuum will need to be modified to use index lookups to find index tuples >> corresponding the dead heap tuples. Otherwise you have to scan through >> all the indexes anyway. > > This strikes me as a fairly bad idea, because it makes VACUUM dependent > on correct functioning of user-written code --- consider a functional > index involving a user-written function that was claimed to be immutable > and is not. If the user-defined function is broken, you're in more or less trouble anyway. A VACUUM FULL or REINDEX can be used to recover. > There are concurrency-safety issues too, I think, having to > do with the way that btree ensures we don't delete any index tuple that > some scan is stopped on. Hmm, I see. I'll have to study the btree implementation more thoroughly. >> * implementation of index-only scans > >> An index scan would not have to check the visibility information of heap >> tuples on those heap pages that are marked as clean in the dead space map. >> This requires that the dead space map is implemented so that a page is >> reliably marked as dirty in all circumstances when it contains any tuples >> that are not visible to all backends. > > The "reliably" part of this is likely to make it a non-starter. AFAICS all heap access goes through the functions in heapam.c. They need to be modified to update the dead space map. Also on recovery. The locking semantics of the dead space map need to be thought out, but I don't see any insurmountable problems. > Another > problem is that the semantics needed by this are not quite the same as > the semantics of whether a page needs to be visited by vacuum. True. We might have to have two bits per page. It's still not that bad, though, compared to the benefit. >> My current implementation stores a bitmap of 32k bits in the special space >> of every 32k heap pages. Each bit in the bitmap corresponds one heap page. >> The bit is set every time a tuple is updated, and it's cleared by vacuum. >> This is a very simple approach, and doesn't take much space. > > I thought the plan was to use out-of-line storage associated with each > table "segment" file. You're probably referring to Alvaro's auto-vacuum todo list from July: http://archives.postgresql.org/pgsql-hackers/2005-07/msg01029.php I find it more attractive to put the bitmap in the special space, for the reasons stated earlier by Jan Wieck: http://archives.postgresql.org/pgsql-hackers/2004-03/msg00957.php That is, so that you can utilize the common buffer management code. Jan also had a plan there for the locking. - Heikki
Heikki Linnakangas <hlinnaka@iki.fi> writes: > On Mon, 27 Feb 2006, Tom Lane wrote: >> This strikes me as a fairly bad idea, because it makes VACUUM dependent >> on correct functioning of user-written code --- consider a functional >> index involving a user-written function that was claimed to be immutable >> and is not. > If the user-defined function is broken, you're in more or less trouble > anyway. Less. A non-immutable function might result in lookup failures (not finding the row you need) but not in database corruption, which is what would ensue if VACUUM fails to remove an index tuple. The index entry would eventually point to a wrong table entry, after the table item slot gets recycled for another tuple. Moreover, you haven't pointed to any strong reason to adopt this methodology. It'd only be a win when vacuuming pretty small numbers of tuples, which is not the design center for VACUUM, and isn't likely to be the case in practice either if you're using autovacuum. If you're removing say 1% of the tuples, you are likely to be hitting every index page to do it, meaning that the scan approach will be significantly *more* efficient than retail lookups. regards, tom lane
On Mon, Feb 27, 2006 at 01:17:36PM -0500, Tom Lane wrote: > > * vacuuming pages one by one as they're written by bgwriter > > That's not happening. VACUUM has to be a transaction and the bgwriter > does not run transactions; nor is it in any position to clean out index > entries associated with a heap page. (To change this would at a minimum > require instituting a separate bgwriter process per database; or else a > wholesale rewrite of our catalog access infrastructure to allow it to > work in a non-database-specific context. There are also interesting > deadlock problems to think about if the bgwriter can be blocked by other > transactions, or if it needs to read pages not currently in shared memory.) Or there could be a seperate daemon that isn't associated with bgwriter. AFAIK as long as it vacuums the dirty page before bgwrite wants to write it you'd still get the IO benefit. -- Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com Pervasive Software http://pervasive.com work: 512-231-6117 vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461
On Mon, Feb 27, 2006 at 03:05:41PM -0500, Tom Lane wrote: > Heikki Linnakangas <hlinnaka@iki.fi> writes: > > On Mon, 27 Feb 2006, Tom Lane wrote: > >> This strikes me as a fairly bad idea, because it makes VACUUM dependent > >> on correct functioning of user-written code --- consider a functional > >> index involving a user-written function that was claimed to be immutable > >> and is not. > > > If the user-defined function is broken, you're in more or less trouble > > anyway. > > Less. A non-immutable function might result in lookup failures (not > finding the row you need) but not in database corruption, which is what > would ensue if VACUUM fails to remove an index tuple. The index entry > would eventually point to a wrong table entry, after the table item slot > gets recycled for another tuple. Is there some (small) metadata that could be stored in the index to protect against this, perhaps XID? Granted, it's another 4 bytes, but it would only need to be in functional indexes. And there could still be a means to turn it off, if you're 100% certain that the function is immutable. lower() is probably the biggest use-case here... > Moreover, you haven't pointed to any strong reason to adopt this > methodology. It'd only be a win when vacuuming pretty small numbers > of tuples, which is not the design center for VACUUM, and isn't likely > to be the case in practice either if you're using autovacuum. If you're > removing say 1% of the tuples, you are likely to be hitting every index > page to do it, meaning that the scan approach will be significantly > *more* efficient than retail lookups. The use case is any large table that sees updates in 'hot spots'. Anything that's based on current time is a likely candidate, since often most activity only concerns the past few days of data. -- Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com Pervasive Software http://pervasive.com work: 512-231-6117 vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461
"Jim C. Nasby" <jnasby@pervasive.com> writes: > On Mon, Feb 27, 2006 at 03:05:41PM -0500, Tom Lane wrote: >> Moreover, you haven't pointed to any strong reason to adopt this >> methodology. It'd only be a win when vacuuming pretty small numbers >> of tuples, which is not the design center for VACUUM, and isn't likely >> to be the case in practice either if you're using autovacuum. If you're >> removing say 1% of the tuples, you are likely to be hitting every index >> page to do it, meaning that the scan approach will be significantly >> *more* efficient than retail lookups. > The use case is any large table that sees updates in 'hot spots'. > Anything that's based on current time is a likely candidate, since often > most activity only concerns the past few days of data. I'm unmoved by that argument too. If the updates are clustered then another effect kicks in: the existing btbulkdelete approach is able to collapse all the deletions on a given index page into one WAL record. With retail deletes it'd be difficult if not impossible to do that, resulting in a significant increase in WAL traffic during a vacuum. (We know it's significant because we saw a good improvement when we fixed btbulkdelete to work that way, instead of issuing a separate WAL record per deleted index entry as it once did.) regards, tom lane
Tom Lane <tgl@sss.pgh.pa.us> writes: > Heikki Linnakangas <hlinnaka@iki.fi> writes: > > * implementation of index-only scans > > > An index scan would not have to check the visibility information of heap > > tuples on those heap pages that are marked as clean in the dead space map. > > This requires that the dead space map is implemented so that a page is > > reliably marked as dirty in all circumstances when it contains any tuples > > that are not visible to all backends. > > The "reliably" part of this is likely to make it a non-starter. Another > problem is that the semantics needed by this are not quite the same as > the semantics of whether a page needs to be visited by vacuum. It would be very disappointing if this part doesn't turn out to be possible. I had always thought the semantics were the same, but now I'm realizing that vacuum doesn't need to visit tuples that have been committed even if they're not visible to some transaction. So having a "vacuum can ignore this" bit doesn't help you with index scans. But I think the thought process went the other direction. If you have the bit intended for index scans indicating that the tuple is not "in doubt" ie, it's visible to every transaction, then that also implies the tuple doesn't need to be visited by vacuum. Skipping pages that don't contain any in doubt tuples would be a huge win. Even if there might be some additional pages that vacuum could in theory be skipping too. -- greg
Ühel kenal päeval, E, 2006-02-27 kell 13:17, kirjutas Tom Lane: > Heikki Linnakangas <hlinnaka@iki.fi> writes: > > Vacuum will need to be modified to use index lookups to find index tuples > > corresponding the dead heap tuples. Otherwise you have to scan through > > all the indexes anyway. > > This strikes me as a fairly bad idea, because it makes VACUUM dependent > on correct functioning of user-written code --- consider a functional > index involving a user-written function that was claimed to be immutable > and is not. There are concurrency-safety issues too, I think, having to > do with the way that btree ensures we don't delete any index tuple that > some scan is stopped on. > > > * vacuuming pages one by one as they're written by bgwriter > > That's not happening. VACUUM has to be a transaction WHY does vacuum need to be a tranasction ? I thought it was a programmer workload optimisation (aka. lazyness :) ) to require ordinary lazy vacuum to be in transaction. There is no fundamental reason, why vacuum needs to run in a transaction itselt. ----------------- Hannu
Ühel kenal päeval, E, 2006-02-27 kell 15:05, kirjutas Tom Lane: > Heikki Linnakangas <hlinnaka@iki.fi> writes: > > On Mon, 27 Feb 2006, Tom Lane wrote: > >> This strikes me as a fairly bad idea, because it makes VACUUM dependent > >> on correct functioning of user-written code --- consider a functional > >> index involving a user-written function that was claimed to be immutable > >> and is not. > > > If the user-defined function is broken, you're in more or less trouble > > anyway. > > Less. A non-immutable function might result in lookup failures (not > finding the row you need) but not in database corruption, which is what > would ensue if VACUUM fails to remove an index tuple. Arguably the database is *already* broken once one has used a non-immutable function in index ops. "Failing to remove" is a condition that is easily detected, so one can flag the operator class as lossy (RECHECK) and actually fix that brokenness. > Moreover, you haven't pointed to any strong reason to adopt this > methodology. It'd only be a win when vacuuming pretty small numbers > of tuples, which is not the design center for VACUUM, and isn't likely > to be the case in practice either if you're using autovacuum. If you're > removing say 1% of the tuples, you are likely to be hitting every index > page to do it, meaning that the scan approach will be significantly > *more* efficient than retail lookups. One use-case would be keeping update-heavy databases healthy, by ensuring that most updates will end up in the same page as originals. That goal would be easier to achieve, if there were a process which reclaims old tuples as fast as possible. ----------------- Hannu
Ühel kenal päeval, T, 2006-02-28 kell 01:04, kirjutas Tom Lane: > "Jim C. Nasby" <jnasby@pervasive.com> writes: > > On Mon, Feb 27, 2006 at 03:05:41PM -0500, Tom Lane wrote: > >> Moreover, you haven't pointed to any strong reason to adopt this > >> methodology. It'd only be a win when vacuuming pretty small numbers > >> of tuples, which is not the design center for VACUUM, and isn't likely > >> to be the case in practice either if you're using autovacuum. If you're > >> removing say 1% of the tuples, you are likely to be hitting every index > >> page to do it, meaning that the scan approach will be significantly > >> *more* efficient than retail lookups. > > > The use case is any large table that sees updates in 'hot spots'. > > Anything that's based on current time is a likely candidate, since often > > most activity only concerns the past few days of data. > > I'm unmoved by that argument too. If the updates are clustered then > another effect kicks in: the existing btbulkdelete approach is able to > collapse all the deletions on a given index page into one WAL record. > With retail deletes it'd be difficult if not impossible to do that, > resulting in a significant increase in WAL traffic during a vacuum. > (We know it's significant because we saw a good improvement when we > fixed btbulkdelete to work that way, instead of issuing a separate > WAL record per deleted index entry as it once did.) In the "big table with hotspots" case, I doubt that the win from btbulkdelete will outweight having to scan 100000 index pages to get to the 30 or 50 that can be bulkdeleted. -------------- Hannu
Ühel kenal päeval, T, 2006-02-28 kell 10:04, kirjutas Hannu Krosing: > Ühel kenal päeval, E, 2006-02-27 kell 15:05, kirjutas Tom Lane: > > Heikki Linnakangas <hlinnaka@iki.fi> writes: > > > On Mon, 27 Feb 2006, Tom Lane wrote: > > >> This strikes me as a fairly bad idea, because it makes VACUUM dependent > > >> on correct functioning of user-written code --- consider a functional > > >> index involving a user-written function that was claimed to be immutable > > >> and is not. > > > > > If the user-defined function is broken, you're in more or less trouble > > > anyway. > > > > Less. A non-immutable function might result in lookup failures (not > > finding the row you need) but not in database corruption, which is what > > would ensue if VACUUM fails to remove an index tuple. Or do you man that an index entry pointing to a non-existing tuple is "corruption" ? It would be realtively easy to teach index access method to just ignore (or even delete) the dangling index entries. > Arguably the database is *already* broken once one has used a > non-immutable function in index ops. > > "Failing to remove" is a condition that is easily detected, so one can > flag the operator class as lossy (RECHECK) and actually fix that > brokenness. Ok, maybe not fix but alleviate - you wont get any non-matching tuples, but there still remains the possibility to get the sam tuple twice. ------------ Hannu
On Mon, 27 Feb 2006, Tom Lane wrote: > Heikki Linnakangas <hlinnaka@iki.fi> writes: >> On Mon, 27 Feb 2006, Tom Lane wrote: >>> This strikes me as a fairly bad idea, because it makes VACUUM dependent >>> on correct functioning of user-written code --- consider a functional >>> index involving a user-written function that was claimed to be immutable >>> and is not. > >> If the user-defined function is broken, you're in more or less trouble >> anyway. > > Less. A non-immutable function might result in lookup failures (not > finding the row you need) but not in database corruption, which is what > would ensue if VACUUM fails to remove an index tuple. The index entry > would eventually point to a wrong table entry, after the table item slot > gets recycled for another tuple. It's easy to detect when it happens (that you don't find the row in the index). You can then complain loudly or fall back to the full scan. > Moreover, you haven't pointed to any strong reason to adopt this > methodology. It'd only be a win when vacuuming pretty small numbers > of tuples, which is not the design center for VACUUM, and isn't likely > to be the case in practice either if you're using autovacuum. If you're > removing say 1% of the tuples, you are likely to be hitting every index > page to do it, meaning that the scan approach will be significantly > *more* efficient than retail lookups. It really depends a lot on what kind of indexes the table has, clustering order, and what kind of deletions and updates happen. Assuming a random distribution of dead tuples, a scan is indeed going to be more efficient as you say. Assuming a tightly packed bunch of dead tuples, however, say after a "delete from log where timestamp < now() - 1 month", you would benefit from a partial vacuum. That's certainly something that needs to be tested. It's important to implement so that it falls back to full scan when it's significantly faster. - Heikki
Hannu Krosing wrote: > In the "big table with hotspots" case, I doubt that the win from > btbulkdelete will outweight having to scan 100000 index pages to get to > the 30 or 50 that can be bulkdeleted. Remember that WAL traffic is fsync'ed, so it can be _much_ slower than anything else. -- Alvaro Herrera http://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc.
On Tue, Feb 28, 2006 at 09:52:24AM +0200, Hannu Krosing wrote: > WHY does vacuum need to be a tranasction ? I thought it was a programmer > workload optimisation (aka. lazyness :) ) to require ordinary lazy > vacuum to be in transaction. > > There is no fundamental reason, why vacuum needs to run in a transaction > itselt. AIUI, vacuum needs to take locks on tables and possibly wait on other transactions to complete. Similarly, other transactions may need to wait on the vacuum (DDL statements). The mechanism in postgres to clean up locks on crash and handle deadlock detection (I think) requires everybody to be running in a transaction, so vacuum does too... At least, that's what I thought, Have a nice day, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a > tool for doing 5% of the work and then sitting around waiting for someone > else to do the other 95% so you can sue them.
Hannu Krosing <hannu@skype.net> writes: > There is no fundamental reason, why vacuum needs to run in a transaction > itselt. I'll just take one point: you cannot hold locks unless you have a transaction, therefore you cannot even look up the catalog info about the table you wish to vacuum; let alone prevent someone else from dropping the table under you. The current bgwriter is incapable of looking at catalog contents as such --- it operates only at the level of physical data blocks. regards, tom lane
Hannu Krosing <hannu@skype.net> writes: > Or do you man that an index entry pointing to a non-existing tuple is > "corruption" ? It would be realtively easy to teach index access method > to just ignore (or even delete) the dangling index entries. No, I mean that an index entry pointing at a WRONG tuple is corruption (and no need for quotes about that one). And that is the state that will ensue, as soon as someone re-uses the line pointer. regards, tom lane
Ühel kenal päeval, T, 2006-02-28 kell 10:26, kirjutas Tom Lane: > Hannu Krosing <hannu@skype.net> writes: > > There is no fundamental reason, why vacuum needs to run in a transaction > > itselt. > > I'll just take one point: you cannot hold locks unless you have a > transaction, therefore you cannot even look up the catalog info about > the table you wish to vacuum; let alone prevent someone else from > dropping the table under you. > > The current bgwriter is incapable of looking at catalog contents as > such --- it operates only at the level of physical data blocks. Would'nt it still be possible to drop a table from below bgwriter ? Or will pgwriter just ignore that. -------------- Hannu
Ühel kenal päeval, T, 2006-02-28 kell 10:16, kirjutas Alvaro Herrera: > Hannu Krosing wrote: > > > In the "big table with hotspots" case, I doubt that the win from > > btbulkdelete will outweight having to scan 100000 index pages to get to > > the 30 or 50 that can be bulkdeleted. > > Remember that WAL traffic is fsync'ed, so it can be _much_ slower than > anything else. But it should need to be fsynced only once, at commit time, so if you can clean up more than one page, you end up with many wal records, but just one fsync. ------------- Hannu
On Tue, Feb 28, 2006 at 01:18:14AM -0500, Greg Stark wrote: > But I think the thought process went the other direction. If you have the bit > intended for index scans indicating that the tuple is not "in doubt" ie, it's > visible to every transaction, then that also implies the tuple doesn't need to > be visited by vacuum. > > Skipping pages that don't contain any in doubt tuples would be a huge win. > Even if there might be some additional pages that vacuum could in theory be > skipping too. Agreed. IMO, *anything* that improves the efficiency of vacuum would be of huge benefit, and keeping a bitmap of pages that are known to be 100% visible would be a big start in that direction. For many large tables, this case would cover a large percentage of pages, speeding up not only vacuum but also index scans. ISTM that the continuing debate about how to improve vacuum is due largely to the fact that there are a very large number of possibilities. I would very much like to see a decision on one to impliment as a starting point. Ideas about some kind of dead-space-map, or a known-clean-map have been floating around for at least 2 versions now. -- Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com Pervasive Software http://pervasive.com work: 512-231-6117 vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461
Hannu Krosing <hannu@skype.net> writes: > Ühel kenal päeval, T, 2006-02-28 kell 10:26, kirjutas Tom Lane: >> The current bgwriter is incapable of looking at catalog contents as >> such --- it operates only at the level of physical data blocks. > Would'nt it still be possible to drop a table from below bgwriter ? The mechanism that handles that is that smgr.c calls DropRelFileNodeBuffers before physically unlinking the file. Hence, by the time the unlink happens, there is not any buffer that the bgwriter might be trying to write into it. Processes that might try to read in new pages must hold some kind of lock on the relation the page belongs to, hence they must be running a transaction. Otherwise there would be a race condition here. (The process executing the DROP TABLE must hold exclusive lock on the table, thereby guaranteeing that there is no one trying to read in new pages from it.) regards, tom lane
Jim C. Nasby wrote: > On Tue, Feb 28, 2006 at 01:18:14AM -0500, Greg Stark wrote: > > But I think the thought process went the other direction. If you have the bit > > intended for index scans indicating that the tuple is not "in doubt" ie, it's > > visible to every transaction, then that also implies the tuple doesn't need to > > be visited by vacuum. > > > > Skipping pages that don't contain any in doubt tuples would be a huge win. > > Even if there might be some additional pages that vacuum could in theory be > > skipping too. > > Agreed. IMO, *anything* that improves the efficiency of vacuum would be > of huge benefit, and keeping a bitmap of pages that are known to be 100% > visible would be a big start in that direction. For many large tables, > this case would cover a large percentage of pages, speeding up not only > vacuum but also index scans. > > ISTM that the continuing debate about how to improve vacuum is due > largely to the fact that there are a very large number of possibilities. > I would very much like to see a decision on one to impliment as a > starting point. Ideas about some kind of dead-space-map, or a > known-clean-map have been floating around for at least 2 versions now. Right, we should discuss all the possibilities and do something. I think we just don't know what yet. One idea Simon and I had was to reuse heap rows where all indexed values in the old row and the new row were the same, meaning the heap value could be replaced without changing the indexes at all. We thought this would be very useful for frequently-updated rows. It could also be used if no index are on the table. -- Bruce Momjian http://candle.pha.pa.us SRA OSS, Inc. http://www.sraoss.com + If your life is a hard drive, Christ can be your backup. +
Bruce Momjian <pgman@candle.pha.pa.us> writes: > One idea Simon and I had was to reuse heap rows where all indexed values > in the old row and the new row were the same, meaning the heap value > could be replaced without changing the indexes at all. We thought this > would be very useful for frequently-updated rows. It could also be used > if no index are on the table. MVCC goes out the window, eh? Not to mention transaction rollback ability? regards, tom lane
Tom Lane wrote: > Bruce Momjian <pgman@candle.pha.pa.us> writes: > > One idea Simon and I had was to reuse heap rows where all indexed values > > in the old row and the new row were the same, meaning the heap value > > could be replaced without changing the indexes at all. We thought this > > would be very useful for frequently-updated rows. It could also be used > > if no index are on the table. > > MVCC goes out the window, eh? Not to mention transaction rollback ability? If the old row is not visible to any transactions, why would it not work? -- Bruce Momjian http://candle.pha.pa.us SRA OSS, Inc. http://www.sraoss.com + If your life is a hard drive, Christ can be your backup. +
Bruce Momjian <pgman@candle.pha.pa.us> writes: > Tom Lane wrote: >> MVCC goes out the window, eh? Not to mention transaction rollback ability? > If the old row is not visible to any transactions, why would it not work? The old row is ALWAYS visible to somebody, until you commit (if you ever do). You can't simply overwrite existing data. regards, tom lane
Tom Lane wrote: > Bruce Momjian <pgman@candle.pha.pa.us> writes: > > Tom Lane wrote: > >> MVCC goes out the window, eh? Not to mention transaction rollback ability? > > > If the old row is not visible to any transactions, why would it not work? > > The old row is ALWAYS visible to somebody, until you commit (if you ever > do). You can't simply overwrite existing data. Huh, I am not suggesting overwriting tuples you created, but tuples created by earlier transactions and now invisible to everyone. I should be clearer. Suppose you have a table with a single index on the primary key. You are updating the row over and over again (a typical case). You create the first row, commit, then it is updated (two copies), commit, then you update it again. That first created row might not be visible to anyone, but has the same index value as the new row you are about to add. Why not reused that heap tuple? -- Bruce Momjian http://candle.pha.pa.us SRA OSS, Inc. http://www.sraoss.com + If your life is a hard drive, Christ can be your backup. +
On Tue, 28 Feb 2006, Bruce Momjian wrote: > Tom Lane wrote: >> Bruce Momjian <pgman@candle.pha.pa.us> writes: >>> Tom Lane wrote: >>>> MVCC goes out the window, eh? Not to mention transaction rollback ability? >> >>> If the old row is not visible to any transactions, why would it not work? >> >> The old row is ALWAYS visible to somebody, until you commit (if you ever >> do). You can't simply overwrite existing data. > > Huh, I am not suggesting overwriting tuples you created, but tuples > created by earlier transactions and now invisible to everyone. > > I should be clearer. Suppose you have a table with a single index on > the primary key. You are updating the row over and over again (a > typical case). You create the first row, commit, then it is updated > (two copies), commit, then you update it again. That first created row > might not be visible to anyone, but has the same index value as the new > row you are about to add. Why not reused that heap tuple? Index tuples have commit info hint bits in them, don't they? You would still have to update those. - Heikki
Heikki Linnakangas wrote: > On Tue, 28 Feb 2006, Bruce Momjian wrote: > > > Tom Lane wrote: > >> Bruce Momjian <pgman@candle.pha.pa.us> writes: > >>> Tom Lane wrote: > >>>> MVCC goes out the window, eh? Not to mention transaction rollback ability? > >> > >>> If the old row is not visible to any transactions, why would it not work? > >> > >> The old row is ALWAYS visible to somebody, until you commit (if you ever > >> do). You can't simply overwrite existing data. > > > > Huh, I am not suggesting overwriting tuples you created, but tuples > > created by earlier transactions and now invisible to everyone. > > > > I should be clearer. Suppose you have a table with a single index on > > the primary key. You are updating the row over and over again (a > > typical case). You create the first row, commit, then it is updated > > (two copies), commit, then you update it again. That first created row > > might not be visible to anyone, but has the same index value as the new > > row you are about to add. Why not reused that heap tuple? > > Index tuples have commit info hint bits in them, don't they? You would > still have to update those. Right, but consider that we would have to use the index to find the matching reusable heap to begin with, so we could clear the hint bit. -- Bruce Momjian http://candle.pha.pa.us SRA OSS, Inc. http://www.sraoss.com + If your life is a hard drive, Christ can be your backup. +
Bruce Momjian <pgman@candle.pha.pa.us> writes: > Tom Lane wrote: >> The old row is ALWAYS visible to somebody, until you commit (if you ever >> do). You can't simply overwrite existing data. > Huh, I am not suggesting overwriting tuples you created, but tuples > created by earlier transactions and now invisible to everyone. Hm? If they're invisible to everyone, they're invisible to you too, so you'd never select such a row as the basis for an update. > I should be clearer. Suppose you have a table with a single index on > the primary key. You are updating the row over and over again (a > typical case). You create the first row, commit, then it is updated > (two copies), commit, then you update it again. That first created row > might not be visible to anyone, but has the same index value as the new > row you are about to add. Why not reused that heap tuple? You appear to be talking about searching the heap to see if there's a row that is vacuumable but coincidentally has the same physical length and all the same index values as the row you'd like to insert. I cannot believe that the cost of doing that on every insert is a win compared to vacuum. Point 1: the cost is being paid up front (in the foreground inserting transaction) not in a background vacuum. Point 2: you cannot just assume that such a row actually has the index entries you need --- it might predate the creation of some indexes. You'd have to actually probe each of the indexes to see if there's an entry pointing at the row. Point 3: you can't do this if vacuum is currently running on the table, as it might be in the midst of removing that same entry. Hence such an insert requires locking out vacuum, which eliminates one of the main reasons for having lazy vacuum in the first place. Point 4: you also have conflicts against other inserts that might be trying to seize on that same dead row. The locking needed to fix that is considerably worse than the short-term lock needed to soak up some free space on a page, because you'd have to hold it across the time needed to check all the indexes per point 2. regards, tom lane
Bruce Momjian <pgman@candle.pha.pa.us> writes: > I should be clearer. Suppose you have a table with a single index on > the primary key. You are updating the row over and over again (a > typical case). You create the first row, commit, then it is updated > (two copies), commit, then you update it again. That first created row > might not be visible to anyone, but has the same index value as the new > row you are about to add. Why not reused that heap tuple? If you commit each update then your tuple might well be visible to other transactions, how would you check that? I originally thought you meant if you are repeatedly updating the same record within the same transaction. In that case sure you could reuse the space but a) only if it's big enough for the new record and b) how often do you really do that? -- greg
Heikki Linnakangas <hlinnaka@iki.fi> writes: > Index tuples have commit info hint bits in them, don't they? Oooh, I forgot that one, but that's definitely strike 5. And there's a strike 6: I'm pretty sure this idea breaks unique-index checking. Somebody else trying to insert a tuple with a duplicate key value might have just looked at your tuple and concluded it was dead, hence they could go ahead and insert their tuple ... but at the time you are looking, they haven't managed to insert their index entry quite yet. Even if you can avoid that race condition, it will certainly take a second index search to catch the problem. regards, tom lane
Bruce Momjian <pgman@candle.pha.pa.us> writes: > Heikki Linnakangas wrote: >> Index tuples have commit info hint bits in them, don't they? You would >> still have to update those. > Right, but consider that we would have to use the index to find the > matching reusable heap to begin with, so we could clear the hint bit. Not that easy: there's a race condition, ie, someone who's just visited the dead tuple might be on their way back to the index to set the hint bit. Have we killed this idea sufficiently dead yet? regards, tom lane
Greg Stark <gsstark@mit.edu> writes: > I originally thought you meant if you are repeatedly updating the same record > within the same transaction. In that case sure you could reuse the space but > a) only if it's big enough for the new record and b) how often do you really > do that? Also, it's not that easy even within a single transaction: you could have active snapshots inside your xact that can still see the obsoleted tuple. regards, tom lane
Greg Stark wrote: > > Bruce Momjian <pgman@candle.pha.pa.us> writes: > > > I should be clearer. Suppose you have a table with a single index on > > the primary key. You are updating the row over and over again (a > > typical case). You create the first row, commit, then it is updated > > (two copies), commit, then you update it again. That first created row > > might not be visible to anyone, but has the same index value as the new > > row you are about to add. Why not reused that heap tuple? > > If you commit each update then your tuple might well be visible to other > transactions, how would you check that? You check the xmin/xmax using standard visibility rules. > I originally thought you meant if you are repeatedly updating the same record > within the same transaction. In that case sure you could reuse the space but > a) only if it's big enough for the new record and b) how often do you really > do that? Right, that is a rare case. -- Bruce Momjian http://candle.pha.pa.us SRA OSS, Inc. http://www.sraoss.com + If your life is a hard drive, Christ can be your backup. +
Bruce Momjian <pgman@candle.pha.pa.us> writes: > Greg Stark wrote: > > > > If you commit each update then your tuple might well be visible to other > > transactions, how would you check that? > > You check the xmin/xmax using standard visibility rules. Would that really find anything? Even putting aside your own transaction wouldn't it pick up any older transaction that started before yours and is still running? They can't really see it but short of actually looking at their snapshots I don't think you can determine that. -- greg
On Tue, Feb 28, 2006 at 11:58:44AM -0500, Bruce Momjian wrote: > Jim C. Nasby wrote: > > On Tue, Feb 28, 2006 at 01:18:14AM -0500, Greg Stark wrote: > > > But I think the thought process went the other direction. If you have the bit > > > intended for index scans indicating that the tuple is not "in doubt" ie, it's > > > visible to every transaction, then that also implies the tuple doesn't need to > > > be visited by vacuum. > > > > > > Skipping pages that don't contain any in doubt tuples would be a huge win. > > > Even if there might be some additional pages that vacuum could in theory be > > > skipping too. > > > > Agreed. IMO, *anything* that improves the efficiency of vacuum would be > > of huge benefit, and keeping a bitmap of pages that are known to be 100% > > visible would be a big start in that direction. For many large tables, > > this case would cover a large percentage of pages, speeding up not only > > vacuum but also index scans. > > > > ISTM that the continuing debate about how to improve vacuum is due > > largely to the fact that there are a very large number of possibilities. > > I would very much like to see a decision on one to impliment as a > > starting point. Ideas about some kind of dead-space-map, or a > > known-clean-map have been floating around for at least 2 versions now. > > Right, we should discuss all the possibilities and do something. I > think we just don't know what yet. I guess that might make more sense if the discussions were more organized so that we could go back later and easily find the pros and cons of all the options, but that doesn't really exist. Sure, there's the archives, but trying to read through discussions there is a PITA, let alone trying to find all the relevant bits in the first place. I doubt that will change without bringing some kind of more formal collaboration tool online (something I really doubt folks would go for). In this case, ISTM that most of the ideas have already been pretty well hashed out, and I can't think of anything that would be a better place to start than either a known clean map or a dead space map. The advantages of each seem pretty clear and I think the implimentation issues are pretty well hashed out. -- Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com Pervasive Software http://pervasive.com work: 512-231-6117 vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461
>On Tue, 28 Feb 2006 01:04:00 -0500, Tom Lane wrote >"Jim C. Nasby" <jnasby ( at ) pervasive ( dot ) com> writes: >> On Mon, Feb 27, 2006 at 03:05:41PM -0500, Tom Lane wrote: >> Moreover, you haven't pointed to any strong reason to adopt this >>> methodology. It'd only be a win when vacuuming pretty small numbers >>> of tuples, which is not the design center for VACUUM, and isn't likely >>> to be the case in practice either if you're using autovacuum. If you're >>> removing say 1% of the tuples, you are likely to be hitting every index >>> page to do it, meaning that the scan approach will be significantly >>> *more* efficient than retail lookups. >> The use case is any large table that sees updates in 'hot spots'. >> Anything that's based on current time is a likely candidate, since often >> most activity only concerns the past few days of data. >I'm unmoved by that argument too. If the updates are clustered then >another effect kicks in: the existing btbulkdelete approach is able to >collapse all the deletions on a given index page into one WAL record. >With retail deletes it'd be difficult if not impossible to do that, >resulting in a significant increase in WAL traffic during a vacuum. >(We know it's significant because we saw a good improvement when we >fixed btbulkdelete to work that way, instead of issuing a separate >WAL record per deleted index entry as it once did.) Excuse me for cutting in. I think it is possible to build a bitmap of _index_ pages dinamicly while scanning table pages. If we touched more than 1/3(for example) of index pages (for certain index) then we stop looking tuples in that index and fall back to full index scan for VACUUMing it. If not, then we can VACUUM only touched pages of this index. Excuse my English.