Thread: Quick idea for reducing VACUUM contention
Just wanted to record a quick idea in case its useful in the future. VACUUM reads all blocks in sequence and waits on each one to acquire a cleanup lock. If VACUUM is running with vacuum_delay enabled then we might take a slightly different approach: Read the heap blocks in sequence, but make a conditional lock for cleanup on each block. If we don't get it, sleep, then try again when we wake up. If we fail the second time, just skip the block completely. As long as we skip no more than 1% of the blocks we should be able to do a very good job of cleanup, yet with reduced block contention as the VACUUM proceeds. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com
"Simon Riggs" <simon@2ndquadrant.com> wrote: > Read the heap blocks in sequence, but make a conditional lock for > cleanup on each block. If we don't get it, sleep, then try again when we > wake up. If we fail the second time, just skip the block completely. When we allow some skips in removing dead tuples, can we guarantee pg_class.relfrozenxid? I think we might need additional "freezing-xmax" operations to avoid XID-wraparound in the first path of vacuum, though it hardly occurs. It might be a future topic ... if we are in the direciton of "optimistic sweeping", is it possible to remove the second path of vacuum completely? We just add XID of the vacuum to dead tuples we see in the first path. When backends find a dead tuple and see the transaction identified by XID in it has commited, they can freely reuse the area of the dead tuple because we can assume index entries pointing the tuple have been removed by the vacuum. We would use the infrastructure introduced by HOT for this purpose. Regards, --- ITAGAKI Takahiro NTT Open Source Software Center
ITAGAKI Takahiro wrote: > "Simon Riggs" <simon@2ndquadrant.com> wrote: > > > Read the heap blocks in sequence, but make a conditional lock for > > cleanup on each block. If we don't get it, sleep, then try again when we > > wake up. If we fail the second time, just skip the block completely. It would be cool if we could do something like sweep a range of pages, initiate IO for those that are not in shared buffers, and while that is running, lock and clean up the ones that are in shared buffers, skipping those that are not lockable right away; when that's done, go back to those buffers that were gotten from I/O and clean those up. And retry the locking for those that couldn't be locked the first time around, also conditionally. And when that's all done, a third pass could get those blocks that weren't cleaned up in none of the previous passes (and this time the lock would not be conditional). Then do a vacuum_delay sleep. > When we allow some skips in removing dead tuples, can we guarantee > pg_class.relfrozenxid? No we can't. > I think we might need additional "freezing-xmax" operations to avoid > XID-wraparound in the first path of vacuum, though it hardly occurs. I'm not sure I follow. Can you elaborate? Do you mean storing a separate relfrozenxmax for each table or something like that? > It might be a future topic ... if we are in the direciton of > "optimistic sweeping", is it possible to remove the second path of vacuum > completely? We just add XID of the vacuum to dead tuples we see in the > first path. When backends find a dead tuple and see the transaction > identified by XID in it has commited, they can freely reuse the area of > the dead tuple because we can assume index entries pointing the tuple > have been removed by the vacuum. I would be worried about leftover index entries being later used by new tuples in the heap. Then when you visit the index, find that entry, go to the heap and find the new tuple and return it, which could be bogus. (Unless, I think, you check in the index when you are going to insert the new index tuple -- if the CTID is already used, reuse that entry or remove it before insertion). I don't know. Maybe it's OK but it seems messy even if it is. -- Alvaro Herrera http://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc.
On Jul 27, 2007, at 1:49 AM, Alvaro Herrera wrote: > ITAGAKI Takahiro wrote: >> "Simon Riggs" <simon@2ndquadrant.com> wrote: >> >>> Read the heap blocks in sequence, but make a conditional lock for >>> cleanup on each block. If we don't get it, sleep, then try again >>> when we >>> wake up. If we fail the second time, just skip the block completely. > > It would be cool if we could do something like sweep a range of pages, > initiate IO for those that are not in shared buffers, and while > that is > running, lock and clean up the ones that are in shared buffers, > skipping > those that are not lockable right away; when that's done, go back to > those buffers that were gotten from I/O and clean those up. And retry > the locking for those that couldn't be locked the first time around, > also conditionally. And when that's all done, a third pass could get > those blocks that weren't cleaned up in none of the previous passes > (and > this time the lock would not be conditional). Would that be substantially easier than just creating a bgreader? -- Jim Nasby jim@nasby.net EnterpriseDB http://enterprisedb.com 512.569.9461 (cell)
Jim Nasby wrote: > On Jul 27, 2007, at 1:49 AM, Alvaro Herrera wrote: >> ITAGAKI Takahiro wrote: >>> "Simon Riggs" <simon@2ndquadrant.com> wrote: >>> >>>> Read the heap blocks in sequence, but make a conditional lock for >>>> cleanup on each block. If we don't get it, sleep, then try again when we >>>> wake up. If we fail the second time, just skip the block completely. >> >> It would be cool if we could do something like sweep a range of pages, >> initiate IO for those that are not in shared buffers, and while that is >> running, lock and clean up the ones that are in shared buffers, skipping >> those that are not lockable right away; when that's done, go back to >> those buffers that were gotten from I/O and clean those up. And retry >> the locking for those that couldn't be locked the first time around, >> also conditionally. And when that's all done, a third pass could get >> those blocks that weren't cleaned up in none of the previous passes (and >> this time the lock would not be conditional). > > Would that be substantially easier than just creating a bgreader? I'm not sure about easier, but I'm not sure that the bgreader can do the same job. ISTM that the bgreader would be mostly in charge of reading in advance of backends, whereas what I'm proposing is mostly about finding the best spot for locking. It might turn out to be more trouble than it's worth though, for sure. And in any case I'm not in a hurry to implement it. In any case I'm not so sure about skipping vacuuming a block if it's not lockable. -- Alvaro Herrera http://www.amazon.com/gp/registry/CTMLCN8V17R4 "Crear es tan difícil como ser libre" (Elsa Triolet)
Alvaro Herrera <alvherre@commandprompt.com> wrote: > > I think we might need additional "freezing-xmax" operations to avoid > > XID-wraparound in the first path of vacuum, though it hardly occurs. > > I'm not sure I follow. Can you elaborate? Do you mean storing a > separate relfrozenxmax for each table or something like that? We need to work around wraparound of xmax in dead tuples. If we miss to vacuum them and XID is wrapped, we cannot remove them until the next XID-wraparound, because we treat them to be deleted in the *future*. > > We just add XID of the vacuum to dead tuples we see in the > > first path. When backends find a dead tuple and see the transaction > > identified by XID in it has commited, they can freely reuse the area of > > the dead tuple because we can assume index entries pointing the tuple > > have been removed by the vacuum. > > I would be worried about leftover index entries being later used by new > tuples in the heap. Then when you visit the index, find that entry, go > to the heap and find the new tuple and return it, which could be bogus. Avoiding dangling index entries, I'm thinking about reusing dead tuples only if we see the VACUUM transaction have committed successfully. That means the VACUUM transaction removed all index entries corresponding those dead tuples; They are now Heap-Only-Tuples, so that we can recycle them in the same manner as HOT updated tuples. Regards, --- ITAGAKI Takahiro NTT Open Source Software Center
ITAGAKI Takahiro wrote: > > Alvaro Herrera <alvherre@commandprompt.com> wrote: > > > > I think we might need additional "freezing-xmax" operations to avoid > > > XID-wraparound in the first path of vacuum, though it hardly occurs. > > > > I'm not sure I follow. Can you elaborate? Do you mean storing a > > separate relfrozenxmax for each table or something like that? > > We need to work around wraparound of xmax in dead tuples. If we miss to > vacuum them and XID is wrapped, we cannot remove them until the next > XID-wraparound, because we treat them to be deleted in the *future*. Oh, but this should not be a problem, because a tuple is either frozen or removed completely -- xmax cannot precede xmin. > > > We just add XID of the vacuum to dead tuples we see in the > > > first path. When backends find a dead tuple and see the transaction > > > identified by XID in it has commited, they can freely reuse the area of > > > the dead tuple because we can assume index entries pointing the tuple > > > have been removed by the vacuum. > > > > I would be worried about leftover index entries being later used by new > > tuples in the heap. Then when you visit the index, find that entry, go > > to the heap and find the new tuple and return it, which could be bogus. > > Avoiding dangling index entries, I'm thinking about reusing dead tuples > only if we see the VACUUM transaction have committed successfully. > That means the VACUUM transaction removed all index entries corresponding > those dead tuples; They are now Heap-Only-Tuples, so that we can recycle > them in the same manner as HOT updated tuples. Hmm. OK, I admit I have no idea how HOT works. -- Alvaro Herrera http://www.CommandPrompt.com/ PostgreSQL Replication, Consulting, Custom Development, 24x7 support
On Jul 30, 2007, at 8:00 PM, Alvaro Herrera wrote: > ITAGAKI Takahiro wrote: >> Alvaro Herrera <alvherre@commandprompt.com> wrote: >>>> I think we might need additional "freezing-xmax" operations to >>>> avoid >>>> XID-wraparound in the first path of vacuum, though it hardly >>>> occurs. >>> >>> I'm not sure I follow. Can you elaborate? Do you mean storing a >>> separate relfrozenxmax for each table or something like that? >> >> We need to work around wraparound of xmax in dead tuples. If we >> miss to >> vacuum them and XID is wrapped, we cannot remove them until the next >> XID-wraparound, because we treat them to be deleted in the *future*. > > Oh, but this should not be a problem, because a tuple is either frozen > or removed completely -- xmax cannot precede xmin. What if it's frozen, then deleted, and then we wrap on xmax? Wouldn't that make the tuple re-appear? -- Decibel!, aka Jim Nasby decibel@decibel.org EnterpriseDB http://enterprisedb.com 512.569.9461 (cell)
Decibel! wrote: > On Jul 30, 2007, at 8:00 PM, Alvaro Herrera wrote: >> ITAGAKI Takahiro wrote: >>> Alvaro Herrera <alvherre@commandprompt.com> wrote: >>>>> I think we might need additional "freezing-xmax" operations to avoid >>>>> XID-wraparound in the first path of vacuum, though it hardly occurs. >>>> >>>> I'm not sure I follow. Can you elaborate? Do you mean storing a >>>> separate relfrozenxmax for each table or something like that? >>> >>> We need to work around wraparound of xmax in dead tuples. If we miss to >>> vacuum them and XID is wrapped, we cannot remove them until the next >>> XID-wraparound, because we treat them to be deleted in the *future*. >> >> Oh, but this should not be a problem, because a tuple is either frozen >> or removed completely -- xmax cannot precede xmin. > > What if it's frozen, then deleted, and then we wrap on xmax? Wouldn't that > make the tuple re-appear? That cannot happen, because the next vacuum will remove the tuple if the Xmax is committed. If the deleting transaction aborts, then vacuum will set Xmax to Invalid (see heap_freeze_tuple in heapam.c). One potential problem you would see is if the deleting transaction marks it deleted and then not commit for 2 billion transactions, thus vacuum is not able to remove it because it shows up as delete-in-progress. However there are plenty other problems you would hit in that case (autovacuum starting to misbehave being the first you would probably notice). -- Alvaro Herrera http://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc.
On Jul 30, 2007, at 1:47 PM, Alvaro Herrera wrote: > Jim Nasby wrote: >> On Jul 27, 2007, at 1:49 AM, Alvaro Herrera wrote: >>> ITAGAKI Takahiro wrote: >>> It would be cool if we could do something like sweep a range of >>> pages, >>> initiate IO for those that are not in shared buffers, and while >>> that is >>> running, lock and clean up the ones that are in shared buffers, >>> skipping >>> those that are not lockable right away; when that's done, go back to >>> those buffers that were gotten from I/O and clean those up. And >>> retry >> >> Would that be substantially easier than just creating a bgreader? > > I'm not sure about easier, but I'm not sure that the bgreader can > do the > same job. ISTM that the bgreader would be mostly in charge of reading > in advance of backends, whereas what I'm proposing is mostly about > finding the best spot for locking. It might turn out to be more > trouble > than it's worth though, for sure. And in any case I'm not in a > hurry to > implement it. I was referring specifically to the "read in what's not already in shared buffers" part of Itagaki-san's message... that seems to be something best suited for a bgreader. -- Decibel!, aka Jim Nasby decibel@decibel.org EnterpriseDB http://enterprisedb.com 512.569.9461 (cell)