Thread: Quick idea for reducing VACUUM contention

Quick idea for reducing VACUUM contention

From
"Simon Riggs"
Date:
Just wanted to record a quick idea in case its useful in the future.

VACUUM reads all blocks in sequence and waits on each one to acquire a
cleanup lock.

If VACUUM is running with vacuum_delay enabled then we might take a
slightly different approach:

Read the heap blocks in sequence, but make a conditional lock for
cleanup on each block. If we don't get it, sleep, then try again when we
wake up. If we fail the second time, just skip the block completely.

As long as we skip no more than 1% of the blocks we should be able to do
a very good job of cleanup, yet with reduced block contention as the
VACUUM proceeds.

--  Simon Riggs EnterpriseDB  http://www.enterprisedb.com



Re: Quick idea for reducing VACUUM contention

From
ITAGAKI Takahiro
Date:
"Simon Riggs" <simon@2ndquadrant.com> wrote:

> Read the heap blocks in sequence, but make a conditional lock for
> cleanup on each block. If we don't get it, sleep, then try again when we
> wake up. If we fail the second time, just skip the block completely.

When we allow some skips in removing dead tuples, can we guarantee
pg_class.relfrozenxid? I think we might need additional "freezing-xmax"
operations to avoid XID-wraparound in the first path of vacuum, though
it hardly occurs.


It might be a future topic ... if we are in the direciton of 
"optimistic sweeping", is it possible to remove the second path of vacuum
completely? We just add XID of the vacuum to dead tuples we see in the
first path. When backends find a dead tuple and see the transaction
identified by XID in it has commited, they can freely reuse the area of
the dead tuple because we can assume index entries pointing the tuple
have been removed by the vacuum. We would use the infrastructure
introduced by HOT for this purpose.

Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center




Re: Quick idea for reducing VACUUM contention

From
Alvaro Herrera
Date:
ITAGAKI Takahiro wrote:
> "Simon Riggs" <simon@2ndquadrant.com> wrote:
> 
> > Read the heap blocks in sequence, but make a conditional lock for
> > cleanup on each block. If we don't get it, sleep, then try again when we
> > wake up. If we fail the second time, just skip the block completely.

It would be cool if we could do something like sweep a range of pages,
initiate IO for those that are not in shared buffers, and while that is
running, lock and clean up the ones that are in shared buffers, skipping
those that are not lockable right away; when that's done, go back to
those buffers that were gotten from I/O and clean those up.  And retry
the locking for those that couldn't be locked the first time around,
also conditionally.  And when that's all done, a third pass could get
those blocks that weren't cleaned up in none of the previous passes (and
this time the lock would not be conditional).

Then do a vacuum_delay sleep.

> When we allow some skips in removing dead tuples, can we guarantee
> pg_class.relfrozenxid?

No we can't.

> I think we might need additional "freezing-xmax" operations to avoid
> XID-wraparound in the first path of vacuum, though it hardly occurs.

I'm not sure I follow.  Can you elaborate?  Do you mean storing a
separate relfrozenxmax for each table or something like that?

> It might be a future topic ... if we are in the direciton of 
> "optimistic sweeping", is it possible to remove the second path of vacuum
> completely? We just add XID of the vacuum to dead tuples we see in the
> first path. When backends find a dead tuple and see the transaction
> identified by XID in it has commited, they can freely reuse the area of
> the dead tuple because we can assume index entries pointing the tuple
> have been removed by the vacuum.

I would be worried about leftover index entries being later used by new
tuples in the heap.  Then when you visit the index, find that entry, go
to the heap and find the new tuple and return it, which could be bogus.
(Unless, I think, you check in the index when you are going to insert
the new index tuple -- if the CTID is already used, reuse that entry or
remove it before insertion).

I don't know.  Maybe it's OK but it seems messy even if it is.

-- 
Alvaro Herrera                                http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.


Re: Quick idea for reducing VACUUM contention

From
Jim Nasby
Date:
On Jul 27, 2007, at 1:49 AM, Alvaro Herrera wrote:
> ITAGAKI Takahiro wrote:
>> "Simon Riggs" <simon@2ndquadrant.com> wrote:
>>
>>> Read the heap blocks in sequence, but make a conditional lock for
>>> cleanup on each block. If we don't get it, sleep, then try again  
>>> when we
>>> wake up. If we fail the second time, just skip the block completely.
>
> It would be cool if we could do something like sweep a range of pages,
> initiate IO for those that are not in shared buffers, and while  
> that is
> running, lock and clean up the ones that are in shared buffers,  
> skipping
> those that are not lockable right away; when that's done, go back to
> those buffers that were gotten from I/O and clean those up.  And retry
> the locking for those that couldn't be locked the first time around,
> also conditionally.  And when that's all done, a third pass could get
> those blocks that weren't cleaned up in none of the previous passes  
> (and
> this time the lock would not be conditional).

Would that be substantially easier than just creating a bgreader?
--
Jim Nasby                                            jim@nasby.net
EnterpriseDB      http://enterprisedb.com      512.569.9461 (cell)




Re: Quick idea for reducing VACUUM contention

From
Alvaro Herrera
Date:
Jim Nasby wrote:
> On Jul 27, 2007, at 1:49 AM, Alvaro Herrera wrote:
>> ITAGAKI Takahiro wrote:
>>> "Simon Riggs" <simon@2ndquadrant.com> wrote:
>>>
>>>> Read the heap blocks in sequence, but make a conditional lock for
>>>> cleanup on each block. If we don't get it, sleep, then try again when we
>>>> wake up. If we fail the second time, just skip the block completely.
>>
>> It would be cool if we could do something like sweep a range of pages,
>> initiate IO for those that are not in shared buffers, and while that is
>> running, lock and clean up the ones that are in shared buffers, skipping
>> those that are not lockable right away; when that's done, go back to
>> those buffers that were gotten from I/O and clean those up.  And retry
>> the locking for those that couldn't be locked the first time around,
>> also conditionally.  And when that's all done, a third pass could get
>> those blocks that weren't cleaned up in none of the previous passes (and
>> this time the lock would not be conditional).
>
> Would that be substantially easier than just creating a bgreader?

I'm not sure about easier, but I'm not sure that the bgreader can do the
same job.  ISTM that the bgreader would be mostly in charge of reading
in advance of backends, whereas what I'm proposing is mostly about
finding the best spot for locking.  It might turn out to be more trouble
than it's worth though, for sure.  And in any case I'm not in a hurry to
implement it.

In any case I'm not so sure about skipping vacuuming a block if it's not
lockable.

-- 
Alvaro Herrera                 http://www.amazon.com/gp/registry/CTMLCN8V17R4
"Crear es tan difícil como ser libre" (Elsa Triolet)


Re: Quick idea for reducing VACUUM contention

From
ITAGAKI Takahiro
Date:
Alvaro Herrera <alvherre@commandprompt.com> wrote:

> > I think we might need additional "freezing-xmax" operations to avoid
> > XID-wraparound in the first path of vacuum, though it hardly occurs.
> 
> I'm not sure I follow.  Can you elaborate?  Do you mean storing a
> separate relfrozenxmax for each table or something like that?

We need to work around wraparound of xmax in dead tuples. If we miss to
vacuum them and XID is wrapped, we cannot remove them until the next
XID-wraparound, because we treat them to be deleted in the *future*.


> > We just add XID of the vacuum to dead tuples we see in the
> > first path. When backends find a dead tuple and see the transaction
> > identified by XID in it has commited, they can freely reuse the area of
> > the dead tuple because we can assume index entries pointing the tuple
> > have been removed by the vacuum.
> 
> I would be worried about leftover index entries being later used by new
> tuples in the heap.  Then when you visit the index, find that entry, go
> to the heap and find the new tuple and return it, which could be bogus.

Avoiding dangling index entries, I'm thinking about reusing dead tuples
only if we see the VACUUM transaction have committed successfully.
That means the VACUUM transaction removed all index entries corresponding
those dead tuples; They are now Heap-Only-Tuples, so that we can recycle
them in the same manner as HOT updated tuples.

Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center




Re: Quick idea for reducing VACUUM contention

From
Alvaro Herrera
Date:
ITAGAKI Takahiro wrote:
> 
> Alvaro Herrera <alvherre@commandprompt.com> wrote:
> 
> > > I think we might need additional "freezing-xmax" operations to avoid
> > > XID-wraparound in the first path of vacuum, though it hardly occurs.
> > 
> > I'm not sure I follow.  Can you elaborate?  Do you mean storing a
> > separate relfrozenxmax for each table or something like that?
> 
> We need to work around wraparound of xmax in dead tuples. If we miss to
> vacuum them and XID is wrapped, we cannot remove them until the next
> XID-wraparound, because we treat them to be deleted in the *future*.

Oh, but this should not be a problem, because a tuple is either frozen
or removed completely -- xmax cannot precede xmin.


> > > We just add XID of the vacuum to dead tuples we see in the
> > > first path. When backends find a dead tuple and see the transaction
> > > identified by XID in it has commited, they can freely reuse the area of
> > > the dead tuple because we can assume index entries pointing the tuple
> > > have been removed by the vacuum.
> > 
> > I would be worried about leftover index entries being later used by new
> > tuples in the heap.  Then when you visit the index, find that entry, go
> > to the heap and find the new tuple and return it, which could be bogus.
> 
> Avoiding dangling index entries, I'm thinking about reusing dead tuples
> only if we see the VACUUM transaction have committed successfully.
> That means the VACUUM transaction removed all index entries corresponding
> those dead tuples; They are now Heap-Only-Tuples, so that we can recycle
> them in the same manner as HOT updated tuples.

Hmm.  OK, I admit I have no idea how HOT works.

-- 
Alvaro Herrera                                http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


Re: Quick idea for reducing VACUUM contention

From
Decibel!
Date:
On Jul 30, 2007, at 8:00 PM, Alvaro Herrera wrote:
> ITAGAKI Takahiro wrote:
>> Alvaro Herrera <alvherre@commandprompt.com> wrote:
>>>> I think we might need additional "freezing-xmax" operations to  
>>>> avoid
>>>> XID-wraparound in the first path of vacuum, though it hardly  
>>>> occurs.
>>>
>>> I'm not sure I follow.  Can you elaborate?  Do you mean storing a
>>> separate relfrozenxmax for each table or something like that?
>>
>> We need to work around wraparound of xmax in dead tuples. If we  
>> miss to
>> vacuum them and XID is wrapped, we cannot remove them until the next
>> XID-wraparound, because we treat them to be deleted in the *future*.
>
> Oh, but this should not be a problem, because a tuple is either frozen
> or removed completely -- xmax cannot precede xmin.

What if it's frozen, then deleted, and then we wrap on xmax? Wouldn't  
that make the tuple re-appear?
-- 
Decibel!, aka Jim Nasby                        decibel@decibel.org
EnterpriseDB      http://enterprisedb.com      512.569.9461 (cell)




Re: Quick idea for reducing VACUUM contention

From
Alvaro Herrera
Date:
Decibel! wrote:
> On Jul 30, 2007, at 8:00 PM, Alvaro Herrera wrote:
>> ITAGAKI Takahiro wrote:
>>> Alvaro Herrera <alvherre@commandprompt.com> wrote:
>>>>> I think we might need additional "freezing-xmax" operations to avoid
>>>>> XID-wraparound in the first path of vacuum, though it hardly occurs.
>>>>
>>>> I'm not sure I follow.  Can you elaborate?  Do you mean storing a
>>>> separate relfrozenxmax for each table or something like that?
>>>
>>> We need to work around wraparound of xmax in dead tuples. If we miss to
>>> vacuum them and XID is wrapped, we cannot remove them until the next
>>> XID-wraparound, because we treat them to be deleted in the *future*.
>>
>> Oh, but this should not be a problem, because a tuple is either frozen
>> or removed completely -- xmax cannot precede xmin.
>
> What if it's frozen, then deleted, and then we wrap on xmax? Wouldn't that 
> make the tuple re-appear?

That cannot happen, because the next vacuum will remove the tuple if the
Xmax is committed.  If the deleting transaction aborts, then vacuum will
set Xmax to Invalid (see heap_freeze_tuple in heapam.c).

One potential problem you would see is if the deleting transaction marks
it deleted and then not commit for 2 billion transactions, thus vacuum
is not able to remove it because it shows up as delete-in-progress.
However there are plenty other problems you would hit in that case
(autovacuum starting to misbehave being the first you would probably
notice).

-- 
Alvaro Herrera                                http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.


Re: Quick idea for reducing VACUUM contention

From
Decibel!
Date:
On Jul 30, 2007, at 1:47 PM, Alvaro Herrera wrote:
> Jim Nasby wrote:
>> On Jul 27, 2007, at 1:49 AM, Alvaro Herrera wrote:
>>> ITAGAKI Takahiro wrote:
>>> It would be cool if we could do something like sweep a range of  
>>> pages,
>>> initiate IO for those that are not in shared buffers, and while  
>>> that is
>>> running, lock and clean up the ones that are in shared buffers,  
>>> skipping
>>> those that are not lockable right away; when that's done, go back to
>>> those buffers that were gotten from I/O and clean those up.  And  
>>> retry
>>
>> Would that be substantially easier than just creating a bgreader?
>
> I'm not sure about easier, but I'm not sure that the bgreader can  
> do the
> same job.  ISTM that the bgreader would be mostly in charge of reading
> in advance of backends, whereas what I'm proposing is mostly about
> finding the best spot for locking.  It might turn out to be more  
> trouble
> than it's worth though, for sure.  And in any case I'm not in a  
> hurry to
> implement it.

I was referring specifically to the "read in what's not already in  
shared buffers" part of Itagaki-san's message... that seems to be  
something best suited for a bgreader.
-- 
Decibel!, aka Jim Nasby                        decibel@decibel.org
EnterpriseDB      http://enterprisedb.com      512.569.9461 (cell)