Thread: autovacuum truncate exclusive lock round two

autovacuum truncate exclusive lock round two

From
Jan Wieck
Date:
This problem has been discussed before. Those familiar with the subject
please skip the next paragraph.

When autovacuum finds a substantial amount of empty pages at the end of
a relation, it attempts to truncate it in lazy_truncate_heap(). Because
all the scanning had been done in parallel to normal DB activity, it
needs to verify that all those blocks are still empty. To do that
autovacuum grabs an AccessExclusiveLock on the relation, then scans
backwards to the last non-empty page. If any other backend needs to
access that table during this time, it will kill the autovacuum from the
deadlock detection code, which by default is done after a 1000ms
timeout. The autovacuum launcher will start another vacuum after
(default) 60 seconds, which most likely is getting killed again, and
again, and again. The net result of this is that the table is never
truncated and every 60 seconds there is a 1 second hiccup before the
autovacuum is killed.

Proposal:

Add functions to lmgr that are derived from the lock release code, but
instead of releasing the lock and waking up waiters, just return a
boolean telling if there are any waiters that would be woken up if this
lock was released.

Use this lmgr feature inside count_nondeletable_pages() of vacuumlazy.c
to periodically check, if there is a conflicting lock request waiting.
If not, keep going. If there is a waiter, truncate the relation to the
point checked thus far, release the AccessExclusiveLock, then loop back
to where we acquire this lock in the first place and continue
checking/truncating.

I have a working patch here:

https://github.com/wieck/postgres/tree/autovacuum-truncate-lock

This patch does introduce three new postgresql.conf parameters, which I
would be happy to get rid of if we could derive them from something
else. Something based on the deadlock timeout may be possible.

     autovacuum_truncate_lock_check = 100ms # how frequent to check
                                            # for conflicting locks
     autovacuum_truncate_lock_retry = 50    # how often to try acquiring
                                            # the exclusive lock
     autovacuum_truncate_lock_wait = 20ms   # nap in between attempts

With these settings, I see the truncate of a bloated table progressing
at a rate of 3 minutes per GB, while that table is accessed 20 times per
second.

The original "kill autovacuum" mechanism in the deadlock code is still
there. All this code really does is 10 lmgr lookups per second and
releasing the AccessExclusiveLock if there are any waiters. I don't
think it can get any cheaper than this.

I am attaching a script that uses pgbench to demonstrate the actual
problem of a bloated table with significant empty pages at the end.


Comments?


Jan

--
Anyone who trades liberty for security deserves neither
liberty nor security. -- Benjamin Franklin

Attachment

Re: autovacuum truncate exclusive lock round two

From
Jan Wieck
Date:
Here is the patch for it.


Jan

--
Anyone who trades liberty for security deserves neither
liberty nor security. -- Benjamin Franklin

Attachment

Re: autovacuum truncate exclusive lock round two

From
Stephen Frost
Date:
Jan,

* Jan Wieck (JanWieck@Yahoo.com) wrote:
> This problem has been discussed before. Those familiar with the
> subject please skip the next paragraph.

Apologies if this was already thought-of and ruled out for some reason,
but...

> Because all the scanning had been done in parallel to normal DB
> activity, it needs to verify that all those blocks are still empty.

Would it be possible to use the FSM to figure out if things have changed
since the last scan..?  Does that scan update the FSM, which would then
be updated by another backend in the event that it decided to write
something there?  Or do we consider the FSM to be completely
untrustworthy wrt this (and if so, I don't suppose there's any hope to
using the visibility map...)?

The notion of having to double-scan and the AccessExclusiveLock on the
relation are telling me this work-around, while completely possible,
isn't exactly ideal...

Perhaps another option would be a page-level or something which is
larger than per-row (strikes me as a lot of overhead for this and it's
not clear how we'd do it), but less than an entire relation, but there
are certainly pain points there too.
Thanks,
    Stephen

Re: autovacuum truncate exclusive lock round two

From
Jan Wieck
Date:
Steven,

On 10/24/2012 10:46 PM, Stephen Frost wrote:
> Jan,
>
> * Jan Wieck (JanWieck@Yahoo.com) wrote:
>> This problem has been discussed before. Those familiar with the
>> subject please skip the next paragraph.
>
> Apologies if this was already thought-of and ruled out for some reason,
> but...
>
>> Because all the scanning had been done in parallel to normal DB
>> activity, it needs to verify that all those blocks are still empty.
>
> Would it be possible to use the FSM to figure out if things have changed
> since the last scan..?  Does that scan update the FSM, which would then
> be updated by another backend in the event that it decided to write
> something there?  Or do we consider the FSM to be completely
> untrustworthy wrt this (and if so, I don't suppose there's any hope to
> using the visibility map...)?

I honestly don't know if we can trust the FSM enough when it comes to 
throwing away heap pages. Can we?

>
> The notion of having to double-scan and the AccessExclusiveLock on the
> relation are telling me this work-around, while completely possible,
> isn't exactly ideal...

Under normal circumstances with just a few pages to trim off the end 
this is no problem. Those pages were the last pages just scanned by this 
very autovacuum, so they are found in the shared buffers anyway. All the 
second scan does in that case is to fetch the page once more from shared 
buffers to be 100% sure, we are not truncating off new tuples. We 
definitely need the AccessExclusiveLock to prevent someone from 
extending the relation at the end between our check for relation size 
and the truncate. Fetching 50 empty blocks from the buffer cache while 
at it isn't that big of a deal and that is what it normally looks like.

The problem case this patch is dealing with is rolling window tables 
that experienced some bloat. The typical example is a log table, that 
has new data constantly added and the oldest data constantly purged out. 
This data normally rotates through some blocks like a rolling window. If 
for some reason (purging turned off for example) this table bloats by 
several GB and later shrinks back to its normal content, soon all the 
used blocks are at the beginning of the heap and we find tens of 
thousands of empty pages at the end. Only now does the second scan take 
more than 1000ms and autovacuum is at risk to get killed while at it.

Since we have experienced this problem several times now on our 
production systems, something clearly needs to be done. But IMHO it 
doesn't happen often enough to take any risk here.


Jan

-- 
Anyone who trades liberty for security deserves neither
liberty nor security. -- Benjamin Franklin



Re: autovacuum truncate exclusive lock round two

From
Tom Lane
Date:
Jan Wieck <JanWieck@Yahoo.com> writes:
> On 10/24/2012 10:46 PM, Stephen Frost wrote:
>> Would it be possible to use the FSM to figure out if things have changed
>> since the last scan..?  Does that scan update the FSM, which would then
>> be updated by another backend in the event that it decided to write
>> something there?  Or do we consider the FSM to be completely
>> untrustworthy wrt this (and if so, I don't suppose there's any hope to
>> using the visibility map...)?

> I honestly don't know if we can trust the FSM enough when it comes to 
> throwing away heap pages. Can we?

No.  Backends are under no obligation to update FSM for each individual
tuple insertion, and typically don't do so.

More to the point, you have to take AccessExclusiveLock *anyway*,
because this is interlocking not only against new insertions but plain
read-only seqscans: if a seqscan falls off the end of the table it will
be very unhappy.  So I don't see where we'd buy anything by consulting
the FSM.
        regards, tom lane



Re: autovacuum truncate exclusive lock round two

From
Jan Wieck
Date:
On 10/25/2012 9:45 AM, Tom Lane wrote:
> Jan Wieck <JanWieck@Yahoo.com> writes:
>> On 10/24/2012 10:46 PM, Stephen Frost wrote:
>>> Would it be possible to use the FSM to figure out if things have changed
>>> since the last scan..?  Does that scan update the FSM, which would then
>>> be updated by another backend in the event that it decided to write
>>> something there?  Or do we consider the FSM to be completely
>>> untrustworthy wrt this (and if so, I don't suppose there's any hope to
>>> using the visibility map...)?
>
>> I honestly don't know if we can trust the FSM enough when it comes to
>> throwing away heap pages. Can we?
>
> No.  Backends are under no obligation to update FSM for each individual
> tuple insertion, and typically don't do so.
>
> More to the point, you have to take AccessExclusiveLock *anyway*,
> because this is interlocking not only against new insertions but plain
> read-only seqscans: if a seqscan falls off the end of the table it will
> be very unhappy.  So I don't see where we'd buy anything by consulting
> the FSM.

Thank you.

One thing that I haven't mentioned yet is that with this patch, we could 
actually insert a vacuum_delay_point() into the loop in 
count_nondeletable_pages(). We no longer cling to the exclusive lock but 
rather get out of the way as soon as somebody needs the table. Under 
this condition we no longer need to do the second scan full bore.


Jan

-- 
Anyone who trades liberty for security deserves neither
liberty nor security. -- Benjamin Franklin



Re: autovacuum truncate exclusive lock round two

From
Stephen Frost
Date:
Jan,

* Jan Wieck (JanWieck@Yahoo.com) wrote:
> The problem case this patch is dealing with is rolling window tables
> that experienced some bloat. The typical example is a log table,
> that has new data constantly added and the oldest data constantly
> purged out. This data normally rotates through some blocks like a
> rolling window. If for some reason (purging turned off for example)
> this table bloats by several GB and later shrinks back to its normal
> content, soon all the used blocks are at the beginning of the heap
> and we find tens of thousands of empty pages at the end. Only now
> does the second scan take more than 1000ms and autovacuum is at risk
> to get killed while at it.

My concern is that this could certainly also happen to a heavily updated
table in an OLTP type of environment where the requirement to take a
heavy lock to clean it up might prevent it from ever happening..  I was
simply hoping we could find a mechanism to lock just those pages we're
getting ready to nuke rather than the entire relation.  Perhaps we can
consider how to make those changes alongside of changes to eliminate or
reduce the extent locking that has been painful (for me at least) when
doing massive parallel loads into a table.

> Since we have experienced this problem several times now on our
> production systems, something clearly needs to be done. But IMHO it
> doesn't happen often enough to take any risk here.

I'm not advocating a 'do-nothing' approach, was just looking for another
option that might allow for this work to happen on the heap in parallel
with regular access.  Since we havn't got any way to do that currently,
+1 for moving forward with this as it clearly improves the current
situation.
Thanks,
    Stephen

Re: autovacuum truncate exclusive lock round two

From
Jan Wieck
Date:
On 10/25/2012 10:12 AM, Stephen Frost wrote:
> Jan,
>
> * Jan Wieck (JanWieck@Yahoo.com) wrote:
>> The problem case this patch is dealing with is rolling window tables
>> that experienced some bloat. The typical example is a log table,
>> that has new data constantly added and the oldest data constantly
>> purged out. This data normally rotates through some blocks like a
>> rolling window. If for some reason (purging turned off for example)
>> this table bloats by several GB and later shrinks back to its normal
>> content, soon all the used blocks are at the beginning of the heap
>> and we find tens of thousands of empty pages at the end. Only now
>> does the second scan take more than 1000ms and autovacuum is at risk
>> to get killed while at it.
>
> My concern is that this could certainly also happen to a heavily updated
> table in an OLTP type of environment where the requirement to take a
> heavy lock to clean it up might prevent it from ever happening..  I was
> simply hoping we could find a mechanism to lock just those pages we're
> getting ready to nuke rather than the entire relation.  Perhaps we can
> consider how to make those changes alongside of changes to eliminate or
> reduce the extent locking that has been painful (for me at least) when
> doing massive parallel loads into a table.

I've been testing this with loads of 20 writes/s to that bloated table. 
Preventing not only the clean up, but the following ANALYZE as well is 
precisely what happens. There may be multiple ways how to get into this 
situation, but once you're there the symptoms are the same. Vacuum fails 
to truncate it and causing a 1 second hiccup every minute, while vacuum 
is holding the exclusive lock until the deadlock detection code of 
another transaction kills it.

My patch doesn't change the logic how we ensure that we don't zap any 
data by accident with the truncate and Tom's comments suggest we should 
stick to it. It only makes autovacuum check frequently if the 
AccessExclusiveLock is actually blocking anyone and then get out of the 
way.

I would rather like to discuss any ideas how to do all this without 3 
new GUCs.

In the original code, the maximum delay that autovacuum can cause by 
holding the exclusive lock is one deadlock_timeout (default 1s). It 
would appear reasonable to me to use max(deadlock_timeout/10,10ms) as 
the interval to check for a conflicting lock request. For another 
transaction that needs to access the table this is 10 times faster than 
it is now and still guarantees that autovacuum will make some progress 
with the truncate.

The other two GUCs control how often and how fast autovacuum tries to 
acquire the exclusive lock in the first place. Since we actively release 
the lock *because someone needs it* it is pretty much guaranteed that 
the immediate next lock attempt fails. We on purpose do a 
ConditionalLockRelation() because there is a chance to deadlock. The 
current code only tries one lock attempt and gives up immediately. I 
don't know from what to derive a good value for how long to retry, but 
the nap time in between tries could be a hardcoded 20ms or using the 
cost based vacuum nap time (which defaults to 20ms).

Any other ideas are welcome.


Thanks,
Jan

-- 
Anyone who trades liberty for security deserves neither
liberty nor security. -- Benjamin Franklin



Re: autovacuum truncate exclusive lock round two

From
Alvaro Herrera
Date:
Jan Wieck wrote:

> In the original code, the maximum delay that autovacuum can cause by
> holding the exclusive lock is one deadlock_timeout (default 1s). It
> would appear reasonable to me to use max(deadlock_timeout/10,10ms)
> as the interval to check for a conflicting lock request. For another
> transaction that needs to access the table this is 10 times faster
> than it is now and still guarantees that autovacuum will make some
> progress with the truncate.

So you would be calling GetCurrentTimestamp() continuously?  Since you
mentioned adding a vacuum delay point I wonder if it would make sense to
test for lockers each time it would consider going to sleep, instead.
(One hazard to keep in mind is the case where no vacuum delay is
configured.)

--
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services



Re: autovacuum truncate exclusive lock round two

From
Jan Wieck
Date:
On 10/25/2012 12:24 PM, Alvaro Herrera wrote:
> Jan Wieck wrote:
>
>> In the original code, the maximum delay that autovacuum can cause by
>> holding the exclusive lock is one deadlock_timeout (default 1s). It
>> would appear reasonable to me to use max(deadlock_timeout/10,10ms)
>> as the interval to check for a conflicting lock request. For another
>> transaction that needs to access the table this is 10 times faster
>> than it is now and still guarantees that autovacuum will make some
>> progress with the truncate.
>
> So you would be calling GetCurrentTimestamp() continuously?  Since you
> mentioned adding a vacuum delay point I wonder if it would make sense to
> test for lockers each time it would consider going to sleep, instead.
> (One hazard to keep in mind is the case where no vacuum delay is
> configured.)

Depends on your definition of "continuously". If doing one 
INSTR_TIME_SET_CURRENT(), which on Unix boils down to a gettimeofday(), 
every 32 ReadBufferExtended() calls counts as continuously, then yes.

Adding a vacuum_delay_point() is something we should consider. However, 
the vacuum_delay_point() call simply naps when enough cost has been 
racked up. You don't know if the next call will nap or not. We would 
have to extend that functionality with some vacuum_delay_would_nap() 
call to do what you suggest.


Jan

-- 
Anyone who trades liberty for security deserves neither
liberty nor security. -- Benjamin Franklin



Re: autovacuum truncate exclusive lock round two

From
Amit Kapila
Date:
On Thursday, October 25, 2012 9:46 PM Jan Wieck wrote:
> On 10/25/2012 10:12 AM, Stephen Frost wrote:
> > Jan,
> >
> > * Jan Wieck (JanWieck@Yahoo.com) wrote:
> >> The problem case this patch is dealing with is rolling window tables
> >> that experienced some bloat. The typical example is a log table,
> >> that has new data constantly added and the oldest data constantly
> >> purged out. This data normally rotates through some blocks like a
> >> rolling window. If for some reason (purging turned off for example)
> >> this table bloats by several GB and later shrinks back to its normal
> >> content, soon all the used blocks are at the beginning of the heap
> >> and we find tens of thousands of empty pages at the end. Only now
> >> does the second scan take more than 1000ms and autovacuum is at risk
> >> to get killed while at it.
> >
> > My concern is that this could certainly also happen to a heavily
> updated
> > table in an OLTP type of environment where the requirement to take a
> > heavy lock to clean it up might prevent it from ever happening..  I
> was
> > simply hoping we could find a mechanism to lock just those pages we're
> > getting ready to nuke rather than the entire relation.  Perhaps we can
> > consider how to make those changes alongside of changes to eliminate
> or
> > reduce the extent locking that has been painful (for me at least) when
> > doing massive parallel loads into a table.
> 
> I've been testing this with loads of 20 writes/s to that bloated table.
> Preventing not only the clean up, but the following ANALYZE as well is
> precisely what happens. There may be multiple ways how to get into this
> situation, but once you're there the symptoms are the same. Vacuum fails
> to truncate it and causing a 1 second hiccup every minute, while vacuum
> is holding the exclusive lock until the deadlock detection code of
> another transaction kills it.
> 
> My patch doesn't change the logic how we ensure that we don't zap any
> data by accident with the truncate and Tom's comments suggest we should
> stick to it. It only makes autovacuum check frequently if the
> AccessExclusiveLock is actually blocking anyone and then get out of the
> way.
> 
> I would rather like to discuss any ideas how to do all this without 3
> new GUCs.
> 
> In the original code, the maximum delay that autovacuum can cause by
> holding the exclusive lock is one deadlock_timeout (default 1s). It
> would appear reasonable to me to use max(deadlock_timeout/10,10ms) as
> the interval to check for a conflicting lock request. For another
> transaction that needs to access the table this is 10 times faster than
> it is now and still guarantees that autovacuum will make some progress
> with the truncate.
 One other way could be to check after every few pages for a conflicting
lock request.

> The other two GUCs control how often and how fast autovacuum tries to
> acquire the exclusive lock in the first place. Since we actively release
> the lock *because someone needs it* it is pretty much guaranteed that
> the immediate next lock attempt fails. We on purpose do a
> ConditionalLockRelation() because there is a chance to deadlock. The
> current code only tries one lock attempt and gives up immediately. I
> don't know from what to derive a good value for how long to retry, 
 Can't we do something like, after nap check for conditional lock and if it
didn't get     then get lock unconditionally.  The reason why after your implementation it might be okay to have lock
unconditionally after one try is that anyway after every few pages or after small time, it will release the lock
if there is any waiter.

> but
> the nap time in between tries could be a hardcoded 20ms or using the
> cost based vacuum nap time (which defaults to 20ms).

I think using cost based vacuum nap time or default value is good.

Adding new parameters might have user/administrator overhead, it is always
better if it can be intelligently decided by database itself.
However if you feel these are parameters which can vary based on different
kind of usage, then I think it is better to expose it through configuration
parameters to users.

With Regards,
Amit Kapila.




Re: autovacuum truncate exclusive lock round two

From
Jan Wieck
Date:
On 10/26/2012 1:29 AM, Amit Kapila wrote:
> On Thursday, October 25, 2012 9:46 PM Jan Wieck wrote:
>> On 10/25/2012 10:12 AM, Stephen Frost wrote:
>> > Jan,
>> >
>> > * Jan Wieck (JanWieck@Yahoo.com) wrote:
>> >> The problem case this patch is dealing with is rolling window tables
>> >> that experienced some bloat. The typical example is a log table,
>> >> that has new data constantly added and the oldest data constantly
>> >> purged out. This data normally rotates through some blocks like a
>> >> rolling window. If for some reason (purging turned off for example)
>> >> this table bloats by several GB and later shrinks back to its normal
>> >> content, soon all the used blocks are at the beginning of the heap
>> >> and we find tens of thousands of empty pages at the end. Only now
>> >> does the second scan take more than 1000ms and autovacuum is at risk
>> >> to get killed while at it.
>> >
>> > My concern is that this could certainly also happen to a heavily
>> updated
>> > table in an OLTP type of environment where the requirement to take a
>> > heavy lock to clean it up might prevent it from ever happening..  I
>> was
>> > simply hoping we could find a mechanism to lock just those pages we're
>> > getting ready to nuke rather than the entire relation.  Perhaps we can
>> > consider how to make those changes alongside of changes to eliminate
>> or
>> > reduce the extent locking that has been painful (for me at least) when
>> > doing massive parallel loads into a table.
>>
>> I've been testing this with loads of 20 writes/s to that bloated table.
>> Preventing not only the clean up, but the following ANALYZE as well is
>> precisely what happens. There may be multiple ways how to get into this
>> situation, but once you're there the symptoms are the same. Vacuum fails
>> to truncate it and causing a 1 second hiccup every minute, while vacuum
>> is holding the exclusive lock until the deadlock detection code of
>> another transaction kills it.
>>
>> My patch doesn't change the logic how we ensure that we don't zap any
>> data by accident with the truncate and Tom's comments suggest we should
>> stick to it. It only makes autovacuum check frequently if the
>> AccessExclusiveLock is actually blocking anyone and then get out of the
>> way.
>>
>> I would rather like to discuss any ideas how to do all this without 3
>> new GUCs.
>>
>> In the original code, the maximum delay that autovacuum can cause by
>> holding the exclusive lock is one deadlock_timeout (default 1s). It
>> would appear reasonable to me to use max(deadlock_timeout/10,10ms) as
>> the interval to check for a conflicting lock request. For another
>> transaction that needs to access the table this is 10 times faster than
>> it is now and still guarantees that autovacuum will make some progress
>> with the truncate.
>
>    One other way could be to check after every few pages for a conflicting
> lock request.

How is this any different from what my patch does? Did you even look at 
the code?

>
>> The other two GUCs control how often and how fast autovacuum tries to
>> acquire the exclusive lock in the first place. Since we actively release
>> the lock *because someone needs it* it is pretty much guaranteed that
>> the immediate next lock attempt fails. We on purpose do a
>> ConditionalLockRelation() because there is a chance to deadlock. The
>> current code only tries one lock attempt and gives up immediately. I
>> don't know from what to derive a good value for how long to retry,
>
>    Can't we do something like, after nap check for conditional lock and if it
> didn't get    
>    then get lock unconditionally.

No, we cannot. This is also well documented in the code.


Jan

-- 
Anyone who trades liberty for security deserves neither
liberty nor security. -- Benjamin Franklin



Re: autovacuum truncate exclusive lock round two

From
Amit Kapila
Date:
On Friday, October 26, 2012 10:59 AM Amit Kapila wrote:
> On Thursday, October 25, 2012 9:46 PM Jan Wieck wrote:
> > On 10/25/2012 10:12 AM, Stephen Frost wrote:
> > > Jan,
> > >
> > > * Jan Wieck (JanWieck@Yahoo.com) wrote:
> > >> The problem case this patch is dealing with is rolling window
> tables
> > >> that experienced some bloat. The typical example is a log table,
> > >> that has new data constantly added and the oldest data constantly
> > >> purged out. This data normally rotates through some blocks like a
> > >> rolling window. If for some reason (purging turned off for example)
> > >> this table bloats by several GB and later shrinks back to its
> normal
> > >> content, soon all the used blocks are at the beginning of the heap
> > >> and we find tens of thousands of empty pages at the end. Only now
> > >> does the second scan take more than 1000ms and autovacuum is at
> risk
> > >> to get killed while at it.
> > >
> > > My concern is that this could certainly also happen to a heavily
> > updated
> > > table in an OLTP type of environment where the requirement to take a
> > > heavy lock to clean it up might prevent it from ever happening..  I
> > was
> > > simply hoping we could find a mechanism to lock just those pages
> we're
> > > getting ready to nuke rather than the entire relation.  Perhaps we
> can
> > > consider how to make those changes alongside of changes to eliminate
> > or
> > > reduce the extent locking that has been painful (for me at least)
> when
> > > doing massive parallel loads into a table.
> >
> > I've been testing this with loads of 20 writes/s to that bloated
> table.
> > Preventing not only the clean up, but the following ANALYZE as well is
> > precisely what happens. There may be multiple ways how to get into
> this
> > situation, but once you're there the symptoms are the same. Vacuum
> fails
> > to truncate it and causing a 1 second hiccup every minute, while
> vacuum
> > is holding the exclusive lock until the deadlock detection code of
> > another transaction kills it.
> >
> > My patch doesn't change the logic how we ensure that we don't zap any
> > data by accident with the truncate and Tom's comments suggest we
> should
> > stick to it. It only makes autovacuum check frequently if the
> > AccessExclusiveLock is actually blocking anyone and then get out of
> the
> > way.
> >
> > I would rather like to discuss any ideas how to do all this without 3
> > new GUCs.
> >
> > In the original code, the maximum delay that autovacuum can cause by
> > holding the exclusive lock is one deadlock_timeout (default 1s). It
> > would appear reasonable to me to use max(deadlock_timeout/10,10ms) as
> > the interval to check for a conflicting lock request. For another
> > transaction that needs to access the table this is 10 times faster
> than
> > it is now and still guarantees that autovacuum will make some progress
> > with the truncate.
> 
>   One other way could be to check after every few pages for a
> conflicting
> lock request.
> 
> > The other two GUCs control how often and how fast autovacuum tries to
> > acquire the exclusive lock in the first place. Since we actively
> release
> > the lock *because someone needs it* it is pretty much guaranteed that
> > the immediate next lock attempt fails. We on purpose do a
> > ConditionalLockRelation() because there is a chance to deadlock. The
> > current code only tries one lock attempt and gives up immediately. I
> > don't know from what to derive a good value for how long to retry,
> 
>   Can't we do something like, after nap check for conditional lock and
> if it
> didn't get
>   then get lock unconditionally.
>   The reason why after your implementation it might be okay to have lock
> unconditionally after one try is that
>   anyway after every few pages or after small time, it will release the
> lock
> if there is any waiter.

I am sorry, at this point trying to take unconditional X lock can lead to
deadlock, so above is not possible.

> > but
> > the nap time in between tries could be a hardcoded 20ms or using the
> > cost based vacuum nap time (which defaults to 20ms).
> 
> I think using cost based vacuum nap time or default value is good.
> 
> Adding new parameters might have user/administrator overhead, it is
> always
> better if it can be intelligently decided by database itself.
> However if you feel these are parameters which can vary based on
> different
> kind of usage, then I think it is better to expose it through
> configuration
> parameters to users.
> 
> With Regards,
> Amit Kapila.
> 
> 
> 
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers




Re: autovacuum truncate exclusive lock round two

From
Amit Kapila
Date:
On Friday, October 26, 2012 11:50 AM Jan Wieck wrote:
> On 10/26/2012 1:29 AM, Amit Kapila wrote:
> > On Thursday, October 25, 2012 9:46 PM Jan Wieck wrote:
> >> On 10/25/2012 10:12 AM, Stephen Frost wrote:
> >> > Jan,
> >> >
> >> > * Jan Wieck (JanWieck@Yahoo.com) wrote:
> >> >> The problem case this patch is dealing with is rolling window
> tables
> >> >> that experienced some bloat. The typical example is a log table,
> >> >> that has new data constantly added and the oldest data constantly
> >> >> purged out. This data normally rotates through some blocks like a
> >> >> rolling window. If for some reason (purging turned off for
> example)
> >> >> this table bloats by several GB and later shrinks back to its
> normal
> >> >> content, soon all the used blocks are at the beginning of the heap
> >> >> and we find tens of thousands of empty pages at the end. Only now
> >> >> does the second scan take more than 1000ms and autovacuum is at
> risk
> >> >> to get killed while at it.
> >> >
> >> > My concern is that this could certainly also happen to a heavily
> >> updated
> >> > table in an OLTP type of environment where the requirement to take
> a
> >> > heavy lock to clean it up might prevent it from ever happening..  I
> >> was
> >> > simply hoping we could find a mechanism to lock just those pages
> we're
> >> > getting ready to nuke rather than the entire relation.  Perhaps we
> can
> >> > consider how to make those changes alongside of changes to
> eliminate
> >> or
> >> > reduce the extent locking that has been painful (for me at least)
> when
> >> > doing massive parallel loads into a table.
> >>
> >> I've been testing this with loads of 20 writes/s to that bloated
> table.
> >> Preventing not only the clean up, but the following ANALYZE as well
> is
> >> precisely what happens. There may be multiple ways how to get into
> this
> >> situation, but once you're there the symptoms are the same. Vacuum
> fails
> >> to truncate it and causing a 1 second hiccup every minute, while
> vacuum
> >> is holding the exclusive lock until the deadlock detection code of
> >> another transaction kills it.
> >>
> >> My patch doesn't change the logic how we ensure that we don't zap any
> >> data by accident with the truncate and Tom's comments suggest we
> should
> >> stick to it. It only makes autovacuum check frequently if the
> >> AccessExclusiveLock is actually blocking anyone and then get out of
> the
> >> way.
> >>
> >> I would rather like to discuss any ideas how to do all this without 3
> >> new GUCs.
> >>
> >> In the original code, the maximum delay that autovacuum can cause by
> >> holding the exclusive lock is one deadlock_timeout (default 1s). It
> >> would appear reasonable to me to use max(deadlock_timeout/10,10ms) as
> >> the interval to check for a conflicting lock request. For another
> >> transaction that needs to access the table this is 10 times faster
> than
> >> it is now and still guarantees that autovacuum will make some
> progress
> >> with the truncate.
> >
> >    One other way could be to check after every few pages for a
> conflicting
> > lock request.
> 
> How is this any different from what my patch does?  The difference is that in the patch it checks for waiters by
using2
 
parameters autovacuum_truncate_lock_check and blkno%32 and what I  had mentioned was to check only based on blkno. Will
iteffect too much if we directly check for waiters after every 32
 
(any feasible number) blocks?

> Did you even look at the code? I haven't looked at code when I had given reply to your previous mail. But
now I have checked it.

With Regards,
Amit Kapila. 




Re: autovacuum truncate exclusive lock round two

From
Jan Wieck
Date:
On 10/26/2012 6:35 AM, Amit Kapila wrote:
> On Friday, October 26, 2012 11:50 AM Jan Wieck wrote:
>> On 10/26/2012 1:29 AM, Amit Kapila wrote:
>> >    One other way could be to check after every few pages for a
>> conflicting
>> > lock request.
>>
>> How is this any different from what my patch does?
>    The difference is that in the patch it checks for waiters by using 2
> parameters autovacuum_truncate_lock_check and blkno%32 and what I
>    had mentioned was to check only based on blkno.
>    Will it effect too much if we directly check for waiters after every 32
> (any feasible number) blocks?

The blkno%32 is there to not do the gettimeofday() call too often. But 
relying on the blkno alone is IMHO not a good idea. It had to be a 
number small enough so that even on a busy system and when the pages 
have to be read from disk, vacuum checks and releases the lock quickly. 
But large enough so that it doesn't create a significant amount of 
spinlock calls in the lmgr. We would end up with another parameter, 
number of blocks, that is a lot harder to estimate a good value for.


Jan

-- 
Anyone who trades liberty for security deserves neither
liberty nor security. -- Benjamin Franklin



Re: autovacuum truncate exclusive lock round two

From
Robert Haas
Date:
On Wed, Oct 24, 2012 at 4:20 PM, Jan Wieck <JanWieck@yahoo.com> wrote:
> This patch does introduce three new postgresql.conf parameters, which I
> would be happy to get rid of if we could derive them from something else.
> Something based on the deadlock timeout may be possible.
>
>     autovacuum_truncate_lock_check = 100ms # how frequent to check
>                                            # for conflicting locks
>     autovacuum_truncate_lock_retry = 50    # how often to try acquiring
>                                            # the exclusive lock
>     autovacuum_truncate_lock_wait = 20ms   # nap in between attempts

+1 for this general approach.

As you suggested downthread, I think that hard-coding
autovacuum_truncate_lock_check to one-tenth of the deadlock timeout
should be just fine.  For the other two parameters, I doubt we need to
make them configurable at all.  It's not exactly clear what to set
them to, but it does seem clear that the down side of setting them
incorrectly isn't very much as long as the defaults are roughly sane.
Personally, I'd be inclined to retry less frequently but over a
slightly longer time period - say twenty retries, one after every
100ms.  But I wouldn't be upset if we settled on what you've got here,
either.  We just don't want to let the total time we spend waiting for
the lock get too long, because that means pinning down an auto-vacuum
worker that might be critically needed elsewhere.  So the product of
autovacuum_truncate_lock_retry and autovacuum_truncate_lock_wait
probably should not be more than a couple of seconds.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: autovacuum truncate exclusive lock round two

From
Dimitri Fontaine
Date:
Jan Wieck <JanWieck@Yahoo.com> writes:
> Use this lmgr feature inside count_nondeletable_pages() of vacuumlazy.c to
> periodically check, if there is a conflicting lock request waiting. If not,
> keep going. If there is a waiter, truncate the relation to the point checked
> thus far, release the AccessExclusiveLock, then loop back to where we
> acquire this lock in the first place and continue checking/truncating.

I think that maybe we could just bail out after releasing the
AccessExclusiveLock and trust autovacuum to get back to truncating that
relation later. That would allow removing 2 of the 3 GUCs below:

>     autovacuum_truncate_lock_check = 100ms # how frequent to check
>                                            # for conflicting locks

This is the one remaining. Could we maybe check for lock conflict after
every move backward a page, or some multiple thereof? The goal would be
to ensure that progress is made, while also being aware of concurrent
activity, ala CHECK_FOR_INTERRUPTS().

Regards,
-- 
Dimitri Fontaine
http://2ndQuadrant.fr     PostgreSQL : Expertise, Formation et Support




Re: autovacuum truncate exclusive lock round two

From
Alvaro Herrera
Date:
Dimitri Fontaine wrote:
> Jan Wieck <JanWieck@Yahoo.com> writes:
> > Use this lmgr feature inside count_nondeletable_pages() of vacuumlazy.c to
> > periodically check, if there is a conflicting lock request waiting. If not,
> > keep going. If there is a waiter, truncate the relation to the point checked
> > thus far, release the AccessExclusiveLock, then loop back to where we
> > acquire this lock in the first place and continue checking/truncating.
>
> I think that maybe we could just bail out after releasing the
> AccessExclusiveLock and trust autovacuum to get back to truncating that
> relation later.

That doesn't work, because the truncating code is not reached unless
vacuuming actually took place.  So if you interrupt it, it will just not
get called again later.  Maybe we could have autovacuum somehow invoke
that separately, but that would require that the fact that truncation
was aborted is kept track of somewhere.

--
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services



Re: autovacuum truncate exclusive lock round two

From
Amit Kapila
Date:
On Friday, November 16, 2012 4:09 AM Alvaro Herrera wrote:
> Dimitri Fontaine wrote:
> > Jan Wieck <JanWieck@Yahoo.com> writes:
> > > Use this lmgr feature inside count_nondeletable_pages() of
> vacuumlazy.c to
> > > periodically check, if there is a conflicting lock request waiting.
> If not,
> > > keep going. If there is a waiter, truncate the relation to the point
> checked
> > > thus far, release the AccessExclusiveLock, then loop back to where
> we
> > > acquire this lock in the first place and continue
> checking/truncating.
> >
> > I think that maybe we could just bail out after releasing the
> > AccessExclusiveLock and trust autovacuum to get back to truncating
> that
> > relation later.
> 
> That doesn't work, because the truncating code is not reached unless
> vacuuming actually took place.  So if you interrupt it, it will just not
> get called again later.  Maybe we could have autovacuum somehow invoke
> that separately, but that would require that the fact that truncation
> was aborted is kept track of somewhere.

Won't it have a chance to be handled next time when vacuum will trigger due
to updates/deletes on some other pages.

OTOH, may be next time again the same thing happens and it was not able to
complete the truncate.
So I think it's better to complete first time only, but may be using some
heuristic time for wait and retry rather than
with configuration variables.

With Regards,
Amit Kapila





Re: autovacuum truncate exclusive lock round two

From
Alvaro Herrera
Date:
Jan,

Are you posting an updated patch?

--
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services



Re: autovacuum truncate exclusive lock round two

From
"Kevin Grittner"
Date:
Alvaro Herrera wrote:

> Are you posting an updated patch?

Well, there wasn't exactly a consensus on what should change, so I'll
throw some initial review comments out even though I still need to
check some things in the code and do more testing.

The patch applied cleanly, compiled without warnings, and passed all
tests in `make check-world`.

The previous discussion seemed to me to achieve consensus on the need
for the feature, but a general dislike for adding so many new GUC
settings to configure it. I concur on that. I agree with the feelings
of others that just using deadlock_timeout / 10 (probably clamped to
a minimum of 10ms) would be good instead of
autovacuum_truncate_lock_check. I agree that the values of the other
two settings probably aren't too critical as long as they are fairly
reasonable, which I would define as being 20ms to 100ms with with
retries lasting no more than 2 seconds.  I'm inclined to suggest a
total time of deadlock_timeout * 2 clamped to 2 seconds, but maybe
that is over-engineering it.

There was also a mention of possibly inserting a vacuum_delay_point()
call outside of the AccessExclusiveLock. I don't feel strongly about
it, but if the page scanning cost can be tracked easily, I guess it
is better to do that.

Other than simplifying the code based on eliminating the new GUCs,
the coding issues I found were:

TRUE and FALSE were used as literals.  Since true and false are used
about 10 times as often and current code in the modified files is
mixed, I would recommend the lowercase form. We decided it wasn't
worth the trouble to convert the minority usage over, but I don't see
any reason to add new instances of the minority case.

In LockHasWaiters() the partition lock is acquired using
LW_EXCLUSIVE. I don't see why that can't be LW_SHARED. Was there a
reason for this, or was it just not changed after copy/paste?

I still need to review the timing calls, since I'm not familiar with
them so it wasn't immediately obvious to me whether they were being
used correctly. I have no reason to believe that they aren't, but
feel I should check.

Also, I want to do another pass looking just for off-by-one errors on
blkno. Again, I have no reason to believe that there is a problem; it
just seems like it would be a particularly easy type of mistake to
make and miss when a key structure has this field:
 BlockNumber nonempty_pages; /* actually, last nonempty page + 1 */

And I want to test more.

But if a new version could be created based on the above, that would
be great. Chances seem relatively slim at this point that there will
be much else to change, if anything.

-Kevin



Re: autovacuum truncate exclusive lock round two

From
"Kevin Grittner"
Date:
Kevin Grittner wrote:

> I still need to review the timing calls, since I'm not familiar
> with them so it wasn't immediately obvious to me whether they
> were being used correctly. I have no reason to believe that they
> aren't, but feel I should check.

It seems odd to me that assignment of one instr_time to another is
done with INSTR_TIME_SET_ZERO() of the target followed by
INSTR_TIME_ADD() with the target and the source. It seems to me
that simple assignment would be more readable, and I can't see any
down side.

Why shouldn't:
   INSTR_TIME_SET_ZERO(elapsed);   INSTR_TIME_ADD(elapsed, currenttime);   INSTR_TIME_SUBTRACT(elapsed, starttime);

instead be?:
   elapsed = currenttime;   INSTR_TIME_SUBTRACT(elapsed, starttime);

And why shouldn't:
   INSTR_TIME_SET_ZERO(starttime);   INSTR_TIME_ADD(starttime, currenttime);

instead be?:
   starttime = currenttime;

Resetting starttime this way seems especially odd.

> Also, I want to do another pass looking just for off-by-one
> errors on blkno. Again, I have no reason to believe that there is
> a problem; it just seems like it would be a particularly easy
> type of mistake to make and miss when a key structure has this
> field:
> 
>   BlockNumber nonempty_pages;
>      /* actually, last nonempty page + 1 */

No problems found with that.
> And I want to test more.

The patch worked as advertised in all my tests, but I became
uncomforatable with the games being played with the last autovacuum
timestamp and the count of dead tuples. Sure, that causes
autovacuum to kick back in until it has dealt with the truncation,
but it makes it hard for a human looking at the stat views to see
what's going on, and I'm not sure it wouldn't lead to bad plans due
to stale statistics.

Personally, I would rather see us add a boolean to indicate that
autovacuum was needed (regardless of the math on the other columns)
and use that to control the retries -- leaving the other columns
free to reflect reality.

-Kevin



Re: autovacuum truncate exclusive lock round two

From
Jan Wieck
Date:
On 11/28/2012 3:33 PM, Kevin Grittner wrote:
> Kevin Grittner wrote:
>
>> I still need to review the timing calls, since I'm not familiar
>> with them so it wasn't immediately obvious to me whether they
>> were being used correctly. I have no reason to believe that they
>> aren't, but feel I should check.
>
> It seems odd to me that assignment of one instr_time to another is
> done with INSTR_TIME_SET_ZERO() of the target followed by
> INSTR_TIME_ADD() with the target and the source. It seems to me
> that simple assignment would be more readable, and I can't see any
> down side.
>
> Why shouldn't:
>
>      INSTR_TIME_SET_ZERO(elapsed);
>      INSTR_TIME_ADD(elapsed, currenttime);
>      INSTR_TIME_SUBTRACT(elapsed, starttime);
>
> instead be?:
>
>      elapsed = currenttime;
>      INSTR_TIME_SUBTRACT(elapsed, starttime);
>
> And why shouldn't:
>
>      INSTR_TIME_SET_ZERO(starttime);
>      INSTR_TIME_ADD(starttime, currenttime);
>
> instead be?:
>
>      starttime = currenttime;
>
> Resetting starttime this way seems especially odd.

instr_time is LARGE_INTEGER on Win32 but struct timeval on Unix. Is
    starttime = currenttime;

portable if those are structs?

>
>> Also, I want to do another pass looking just for off-by-one
>> errors on blkno. Again, I have no reason to believe that there is
>> a problem; it just seems like it would be a particularly easy
>> type of mistake to make and miss when a key structure has this
>> field:
>>
>>   BlockNumber nonempty_pages;
>>      /* actually, last nonempty page + 1 */
>
> No problems found with that.
>
>> And I want to test more.
>
> The patch worked as advertised in all my tests, but I became
> uncomforatable with the games being played with the last autovacuum
> timestamp and the count of dead tuples. Sure, that causes
> autovacuum to kick back in until it has dealt with the truncation,
> but it makes it hard for a human looking at the stat views to see
> what's going on, and I'm not sure it wouldn't lead to bad plans due
> to stale statistics.
>
> Personally, I would rather see us add a boolean to indicate that
> autovacuum was needed (regardless of the math on the other columns)
> and use that to control the retries -- leaving the other columns
> free to reflect reality.

You mean to extend the stats by yet another column? The thing is that 
this whole case happens rarely. We see this every other month or so and 
only on a rolling window table after it got severely bloated due to some 
abnormal use (purging disabled for some operational reason). The whole 
situation resolves itself after a few minutes to maybe an hour or two.

Personally I don't think living with a few wrong stats on one table for 
that time is so bad that it warrants changing that much more code.

Lower casing TRUE/FALSE will be done.

If the LW_SHARED is enough in LockHasWaiters(), that's fine too.

I think we have a consensus that the check interval should be derived 
from deadlock_timeout/10 clamped to 10ms. I'm going to remove that GUC.

About the other two, I'm just not sure. Maybe using half the value as 
for the lock waiter interval as the lock retry interval, again clamped 
to 10ms, and simply leaving one GUC that controls how long autovacuum 
should retry this. I'm not too worried about the retry interval. 
However, the problem with that overall retry duration is that this value 
very much depends on the usage pattern of the database. If long running 
transactions (like >5 seconds) are the norm, then a hard coded value of 
2 seconds before autovacuum gives up isn't going to help much.


Jan

-- 
Anyone who trades liberty for security deserves neither
liberty nor security. -- Benjamin Franklin



Re: autovacuum truncate exclusive lock round two

From
Tom Lane
Date:
Jan Wieck <JanWieck@Yahoo.com> writes:
> On 11/28/2012 3:33 PM, Kevin Grittner wrote:
>> Resetting starttime this way seems especially odd.

> instr_time is LARGE_INTEGER on Win32 but struct timeval on Unix. Is
>      starttime = currenttime;
> portable if those are structs?

Sure.  We rely on struct assignment in lots of places.
        regards, tom lane



Re: autovacuum truncate exclusive lock round two

From
Jan Wieck
Date:
On 11/29/2012 9:46 AM, Tom Lane wrote:
> Jan Wieck <JanWieck@Yahoo.com> writes:
>> On 11/28/2012 3:33 PM, Kevin Grittner wrote:
>>> Resetting starttime this way seems especially odd.
>
>> instr_time is LARGE_INTEGER on Win32 but struct timeval on Unix. Is
>>      starttime = currenttime;
>> portable if those are structs?
>
> Sure.  We rely on struct assignment in lots of places.

Will be done then.


Thanks,
Jan

-- 
Anyone who trades liberty for security deserves neither
liberty nor security. -- Benjamin Franklin



Re: autovacuum truncate exclusive lock round two

From
Jan Wieck
Date:
Attached is a new patch that addresses most of the points raised in
discussion before.

1) Most of the configuration variables are derived from deadlock_timeout
now. The "check for conflicting lock request" interval is
deadlock_timeout/10, clamped to 10ms. The "try to acquire exclusive
lock" interval is deadlock_timeout/20, also clamped to 10ms. The only
GUC variable remaining is autovacuum_truncate_lock_try=2000ms with a
range from 0 (just try once) to 20000ms.

I'd like to point out that this is a significant change in functionality
as without the config option for the check interval, there is no longer
any possibility to disable the call to LockHasWaiters() and return to
the original (deadlock code kills autovacuum) behavior.

2) The partition lock in LockHasWaiters() was lowered to LW_SHARED. The
LW_EXCLUSIVE was indeed a copy/paste result.

3) The instr_time handling was simplified as suggested.

4) Lower case TRUE/FALSE.


I did not touch the part about suppressing the stats and the ANALYZE
step of "auto vacuum+analyze". The situation is no different today. When
the deadlock code kills autovacuum, stats aren't updated either. And
this patch is meant to cause autovacuum to finish the truncate in a few
minutes or hours, so that the situation fixes itself.


Jan

--
Anyone who trades liberty for security deserves neither
liberty nor security. -- Benjamin Franklin

Attachment

Re: autovacuum truncate exclusive lock round two

From
"Kevin Grittner"
Date:
Jan Wieck wrote:

> Attached is a new patch that addresses most of the points raised
> in discussion before.
> 
> 1) Most of the configuration variables are derived from
> deadlock_timeout now. The "check for conflicting lock request"
> interval is deadlock_timeout/10, clamped to 10ms. The "try to
> acquire exclusive lock" interval is deadlock_timeout/20, also
> clamped to 10ms. The only GUC variable remaining is
> autovacuum_truncate_lock_try=2000ms with a range from 0 (just try
> once) to 20000ms.

If we're going to keep this GUC, we need docs for it.

> I'd like to point out that this is a significant change in
> functionality as without the config option for the check
> interval, there is no longer any possibility to disable the call
> to LockHasWaiters() and return to the original (deadlock code
> kills autovacuum) behavior.

Arguably we could simplify the deadlock resolution code a little,
but it seems like it is probably safer to leave it as a failsafe,
at least for now.

> I did not touch the part about suppressing the stats and the
> ANALYZE step of "auto vacuum+analyze". The situation is no
> different today. When the deadlock code kills autovacuum, stats
> aren't updated either. And this patch is meant to cause
> autovacuum to finish the truncate in a few minutes or hours, so
> that the situation fixes itself.

Unless someone want to argue for more aggressive updating of the
stats, I guess I'll yield to that argument.

Besides the documentation, the only thing I could spot on this
go-around was that this comment probably no longer has a reason for
being since this is no longer conditional and what it's doing is
obvious from the code:
   /* Initialize the starttime if we check for conflicting lock requests */

With docs and dropping the obsolete comment, I think this will be
ready.

-Kevin



Re: autovacuum truncate exclusive lock round two

From
Jan Wieck
Date:
On 12/3/2012 5:42 PM, Kevin Grittner wrote:
> Jan Wieck wrote:
>
>> Attached is a new patch that addresses most of the points raised
>> in discussion before.
>>
>> 1) Most of the configuration variables are derived from
>> deadlock_timeout now. The "check for conflicting lock request"
>> interval is deadlock_timeout/10, clamped to 10ms. The "try to
>> acquire exclusive lock" interval is deadlock_timeout/20, also
>> clamped to 10ms. The only GUC variable remaining is
>> autovacuum_truncate_lock_try=2000ms with a range from 0 (just try
>> once) to 20000ms.
>
> If we're going to keep this GUC, we need docs for it.

Certainly. But since we're still debating which and how many GUC 
variables we want, I don't think doc-time has come yet.

>
>> I'd like to point out that this is a significant change in
>> functionality as without the config option for the check
>> interval, there is no longer any possibility to disable the call
>> to LockHasWaiters() and return to the original (deadlock code
>> kills autovacuum) behavior.
>
> Arguably we could simplify the deadlock resolution code a little,
> but it seems like it is probably safer to leave it as a failsafe,
> at least for now.

Thinking about it, I'm not really happy with removing the 
autovacuum_truncate_lock_check GUC at all.

Fact is that the deadlock detection code and the configuration parameter 
for it should IMHO have nothing to do with all this in the first place. 
A properly implemented application does not deadlock. Someone running 
such a properly implemented application should be able to safely set 
deadlock_timeout to minutes without the slightest ill side effect, but 
with the benefit that the deadlock detection code itself does not add to 
the lock contention. The only reason one cannot do so today is because 
autovacuum's truncate phase could then freeze the application with an 
exclusive lock for that long.

I believe the check interval needs to be decoupled from the 
deadlock_timeout again. This will leave us with 2 GUCs at least.


Jan

-- 
Anyone who trades liberty for security deserves neither
liberty nor security. -- Benjamin Franklin



Re: autovacuum truncate exclusive lock round two

From
"Kevin Grittner"
Date:
Jan Wieck wrote:

> Thinking about it, I'm not really happy with removing the 
> autovacuum_truncate_lock_check GUC at all.
> 
> Fact is that the deadlock detection code and the configuration
> parameter for it should IMHO have nothing to do with all this in
> the first place. A properly implemented application does not
> deadlock.

I don't agree. I believe that in some cases it is possible and
practicable to set access rules which would prevent deadlocks in
application access to a database. In other cases the convolutions
required in the code, the effort in educating dozens or hundreds of
programmers maintaining the code (and keeping the training current
during staff turnover), and the staff time required for compliance
far outweigh the benefit of an occasional transaction retry.
However, it is enough for your argument that there are cases where
it can be done.

> Someone running such a properly implemented application should be
> able to safely set deadlock_timeout to minutes without the
> slightest ill side effect, but with the benefit that the deadlock
> detection code itself does not add to the lock contention. The
> only reason one cannot do so today is because autovacuum's
> truncate phase could then freeze the application with an
> exclusive lock for that long.
> 
> I believe the check interval needs to be decoupled from the 
> deadlock_timeout again.

OK

> This will leave us with 2 GUCs at least.

Hmm. What problems do you see with hard-coding reasonable values?
Adding two or three GUC settings for a patch with so little
user-visible impact seems weird. And it seems to me (and also
seemed to Robert) as though the specific values of the other two
settings really aren't that critical as long as they are anywhere
within a reasonable range. Configuring PostgreSQL can be
intimidating enough without adding knobs that really don't do
anything useful. Can you show a case where special values would be
helpful?

-Kevin



Re: autovacuum truncate exclusive lock round two

From
Jan Wieck
Date:
On 12/4/2012 8:06 AM, Kevin Grittner wrote:
> Jan Wieck wrote:
>> I believe the check interval needs to be decoupled from the
>> deadlock_timeout again.
>
> OK
>
>> This will leave us with 2 GUCs at least.
>
> Hmm. What problems do you see with hard-coding reasonable values?

The question is what is reasonable?

Lets talk about the time to (re)acquire the lock first. In the cases 
where truncating a table can hurt we are dealing with many gigabytes. 
The original vacuumlazy scan of them can take hours if not days. During 
that scan the vacuum worker has probably spent many hours napping in the 
vacuum delay points. For me 50ms interval for 5 seconds would be 
reasonable for (re)acquiring that lock.

The reasoning behind it being that we need some sort of retry mechanism 
because if autovacuum just gave up the exclusive lock because someone 
needed access, it is more or less guaranteed that the immediate attempt 
to reacquire it will fail until that waiter has committed. But if it 
can't get a lock after 5 seconds, the system seems busy enough so that 
autovacuum should come back much later, when the launcher kicks it off 
again.

I don't care much about occupying that autovacuum worker for a few 
seconds. It just spent hours vacuuming that very table. How much harm 
will a couple more seconds do?

The check interval for the LockHasWaiters() call however depends very 
much on the response time constraints of the application. A 200ms 
interval for example would cause the truncate phase to hold onto the 
exclusive lock for 200ms at least. That means that a steady stream of 
short running transactions would see a 100ms "blocking" on average, 
200ms max. For many applications that is probably OK. If your response 
time constraint is <=50ms on 98% of transactions, you might want to have 
that knob though.

I admit I really have no idea what the most reasonable default for that 
value would be. Something between 50ms and deadlock_timeout/2 I guess.


Jan

-- 
Anyone who trades liberty for security deserves neither
liberty nor security. -- Benjamin Franklin



Re: autovacuum truncate exclusive lock round two

From
"Kevin Grittner"
Date:
Jan Wieck wrote:

> [arguments for GUCs]

This is getting confusing. I thought I had already conceded the
case for autovacuum_truncate_lock_try, and you appeared to spend
most of your post arguing for it anyway. I think. It's a little
hard to tell. Perhaps the best thing is to present the issue to the
list and solicit more opinions on what to do. Please correct me if
I mis-state any of this.

The primary problem this patch is solving is that in some
workloads, autovacuum will repeatedly try to truncate the unused
pages at the end of a table, but will continually get canceled
after burning resources because another process wants to acquire a
lock on the table which conflicts with the one held by autovacuum.
This is handled by the deadlock checker, so another process must
block for the deadlock_timeout interval each time. All work done by
the truncate phase of autovacuum is lost on each interrupted
attempt. Statistical information is not updated, so another attempt
will trigger the next time autovacuum looks at whether to vacuum
the table.

It's obvious that this pattern not only fails to release
potentially large amounts of unused space back to the OS, but the
headbanging can continue to consume significant resources and for
an extended period, and the repeated blocking for deadlock_timeout
could cause latency problems.

The patch has the truncate work, which requires
AccessExclusiveLock, check at intervals for whether another process
is waiting on its lock. That interval is one of the timings we need
to determine, and for which a GUC was initially proposed. I think
that the check should be fast enough that doing it once every 20ms
as a hard-coded interval would be good enough. When it sees this
situation, it truncates the file for as far as it has managed to
get, releases its lock on the table, sleeps for an interval, and
then checks to see if the lock has become available again.

How long it should sleep between tries to reacquire the lock is
another possible GUC. Again, I'm inclined to think that this could
be hard-coded. Since autovacuum was knocked off-task after doing
some significant work, I'm inclined to make this interval a little
bigger, but I don't think it matters a whole lot. Anything between
20ms and 100ms seens sane. Maybe 50ms?

At any point that it is unable to acquire the lock, there is a
check for how long this autovacuum task has been starved for the
lock. Initially I was arguing for twice the deadlock_timeout on the
basis that this would probably be short enough not to leave the
autovacuum worker sidelined for too long, but long enough for the
attempt to get past a single deadlock between two other processes.
This is the setting Jan is least willing to concede.

If the autovacuum worker does abandon the attempt, it will keep
retrying, since we go out of our way to prevent the autovacuum
process from updating the statistics based on the "incomplete"
processing. This last interval is not how long it will attempt to
truncate, but how long it will keep one autovacuum worker making
unsuccessful attempts to acquire the lock before it is put to other
uses. Workers will keep coming back to this table until the
truncate phase is completed, just as it does without the patch; the
difference being that anytime it gets the lock, even briefly, it is
able to persist some progress.

So the question on the table is which of these three intervals
should be GUCs, and what values to use if they aren't.

-Kevin



Re: autovacuum truncate exclusive lock round two

From
Jan Wieck
Date:
On 12/4/2012 1:51 PM, Kevin Grittner wrote:
> Jan Wieck wrote:
>
>> [arguments for GUCs]
>
> This is getting confusing. I thought I had already conceded the
> case for autovacuum_truncate_lock_try, and you appeared to spend
> most of your post arguing for it anyway. I think. It's a little
> hard to tell. Perhaps the best thing is to present the issue to the
> list and solicit more opinions on what to do. Please correct me if
> I mis-state any of this.
>
> The primary problem this patch is solving is that in some
> workloads, autovacuum will repeatedly try to truncate the unused
> pages at the end of a table, but will continually get canceled
> after burning resources because another process wants to acquire a
> lock on the table which conflicts with the one held by autovacuum.
> This is handled by the deadlock checker, so another process must
> block for the deadlock_timeout interval each time. All work done by
> the truncate phase of autovacuum is lost on each interrupted
> attempt. Statistical information is not updated, so another attempt
> will trigger the next time autovacuum looks at whether to vacuum
> the table.
>
> It's obvious that this pattern not only fails to release
> potentially large amounts of unused space back to the OS, but the
> headbanging can continue to consume significant resources and for
> an extended period, and the repeated blocking for deadlock_timeout
> could cause latency problems.
>
> The patch has the truncate work, which requires
> AccessExclusiveLock, check at intervals for whether another process
> is waiting on its lock. That interval is one of the timings we need
> to determine, and for which a GUC was initially proposed. I think
> that the check should be fast enough that doing it once every 20ms
> as a hard-coded interval would be good enough. When it sees this
> situation, it truncates the file for as far as it has managed to
> get, releases its lock on the table, sleeps for an interval, and
> then checks to see if the lock has become available again.
>
> How long it should sleep between tries to reacquire the lock is
> another possible GUC. Again, I'm inclined to think that this could
> be hard-coded. Since autovacuum was knocked off-task after doing
> some significant work, I'm inclined to make this interval a little
> bigger, but I don't think it matters a whole lot. Anything between
> 20ms and 100ms seens sane. Maybe 50ms?
>
> At any point that it is unable to acquire the lock, there is a
> check for how long this autovacuum task has been starved for the
> lock. Initially I was arguing for twice the deadlock_timeout on the
> basis that this would probably be short enough not to leave the
> autovacuum worker sidelined for too long, but long enough for the
> attempt to get past a single deadlock between two other processes.
> This is the setting Jan is least willing to concede.
>
> If the autovacuum worker does abandon the attempt, it will keep
> retrying, since we go out of our way to prevent the autovacuum
> process from updating the statistics based on the "incomplete"
> processing. This last interval is not how long it will attempt to
> truncate, but how long it will keep one autovacuum worker making
> unsuccessful attempts to acquire the lock before it is put to other
> uses. Workers will keep coming back to this table until the
> truncate phase is completed, just as it does without the patch; the
> difference being that anytime it gets the lock, even briefly, it is
> able to persist some progress.

That is all correct.

>
> So the question on the table is which of these three intervals
> should be GUCs, and what values to use if they aren't.

I could live with all the above defaults, but would like to see more 
comments on them.


Jan

-- 
Anyone who trades liberty for security deserves neither
liberty nor security. -- Benjamin Franklin



Re: autovacuum truncate exclusive lock round two

From
Robert Haas
Date:
On Tue, Dec 4, 2012 at 2:05 PM, Jan Wieck <JanWieck@yahoo.com> wrote:
>> So the question on the table is which of these three intervals
>> should be GUCs, and what values to use if they aren't.
>
> I could live with all the above defaults, but would like to see more
> comments on them.

I largely agree with what's already been said.  The only interval that
seems to me to maybe need its own knob is the total time after which
the autovacuum worker will give up.  If we set it to 2 *
deadlock_timeout, some people might find that a reason to raise
deadlock_timeout.  Since people *already* raise deadlock_timeout to
obscenely high values (a minute?  an hour???) and then complain that
things blow up in their face, I think there's a decent argument to be
made that piggybacking anything else on that setting is unwise.

Against that, FWICT, this problem only affects a small number of
users: Jan is the only person I can ever remember reporting this
issue.  I'm not dumb enough to think he's the only person who it
affects; but my current belief is that it's not an enormously common
problem.  So the main argument I can see against adding a GUC is that
the problem is too marginal to justify a setting of its own.  What I
really see as the key issue is: suppose we hardcode this to say 2
seconds.  Is that going to fix the problem effectively for 99% of the
people who have this problem, or for 25% of the people who have this
problem?  In the former case, we probably don't need a GUC; in the
latter case, we probably do.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: autovacuum truncate exclusive lock round two

From
"Kevin Grittner"
Date:
Robert Haas wrote:

> Since people *already* raise deadlock_timeout to obscenely high
> values (a minute? an hour???) and then complain that things blow
> up in their face, I think there's a decent argument to be made
> that piggybacking anything else on that setting is unwise.

If people are really doing that, then I tend to agree. I wasn't
aware of that practice.

> Against that, FWICT, this problem only affects a small number of
> users: Jan is the only person I can ever remember reporting this
> issue. I'm not dumb enough to think he's the only person who it
> affects; but my current belief is that it's not an enormously
> common problem. So the main argument I can see against adding a
> GUC is that the problem is too marginal to justify a setting of
> its own. What I really see as the key issue is: suppose we
> hardcode this to say 2 seconds. Is that going to fix the problem
> effectively for 99% of the people who have this problem, or for
> 25% of the people who have this problem? In the former case, we
> probably don't need a GUC; in the latter case, we probably do.

Given the fact that autovacuum will keep throwing workers at it to
essentially loop indefinitely at an outer level, I don't think the
exact setting of this interval is all that critical either. My gut
feel is that anything in the 2 second to 5 second range would be
sane, so I won't argue over any explicit setting within that range.
Below that, I think the overhead of autovacuum coming back to the
table repeatedly would probably start to get too high; below that
we could be causing some small, heavily-updated table to be
neglected by autovacuum -- especially if you get multiple
autovacuum workers tied up in this delay on different tables at the
same time.

Regarding how many people are affected, I have seen several reports
of situations where users claim massive impact on performance when
autovacuum kicks in. The reports have not included enough detail to
quantify the impact or in most cases to establish a cause, but this
seems like it could have a noticable impact, especially if the
deadlock timeout was set to more than a second.

-Kevin



Re: autovacuum truncate exclusive lock round two

From
Robert Haas
Date:
On Wed, Dec 5, 2012 at 11:24 AM, Kevin Grittner <kgrittn@mail.com> wrote:
> Robert Haas wrote:
>> Since people *already* raise deadlock_timeout to obscenely high
>> values (a minute? an hour???) and then complain that things blow
>> up in their face, I think there's a decent argument to be made
>> that piggybacking anything else on that setting is unwise.
>
> If people are really doing that, then I tend to agree. I wasn't
> aware of that practice.

It's probably not quite common enough to be called a "practice", but I
have encountered it a number of times in support situations.  Alas, I
no longer remember the details of exactly what misery it caused, but I
do remember it wasn't good.  :-)

>> Against that, FWICT, this problem only affects a small number of
>> users: Jan is the only person I can ever remember reporting this
>> issue. I'm not dumb enough to think he's the only person who it
>> affects; but my current belief is that it's not an enormously
>> common problem. So the main argument I can see against adding a
>> GUC is that the problem is too marginal to justify a setting of
>> its own. What I really see as the key issue is: suppose we
>> hardcode this to say 2 seconds. Is that going to fix the problem
>> effectively for 99% of the people who have this problem, or for
>> 25% of the people who have this problem? In the former case, we
>> probably don't need a GUC; in the latter case, we probably do.
>
> Given the fact that autovacuum will keep throwing workers at it to
> essentially loop indefinitely at an outer level, I don't think the
> exact setting of this interval is all that critical either. My gut
> feel is that anything in the 2 second to 5 second range would be
> sane, so I won't argue over any explicit setting within that range.
> Below that, I think the overhead of autovacuum coming back to the
> table repeatedly would probably start to get too high; below that
> we could be causing some small, heavily-updated table to be
> neglected by autovacuum -- especially if you get multiple
> autovacuum workers tied up in this delay on different tables at the
> same time.

I think that part of what's tricky here is that the dynamics of this
problem depend heavily on table size.  I handled one support case
where lowering autovacuum_naptime to 15s was an indispenable part of
the solution, so in that case having an autovacuum worker retry for
more than a few seconds sounds kind of insane.  OTOH, that case
involved a small, rapidly changing table.  If you've got an enormous
table where vacuum takes an hour to chug through all of it, abandoning
the effort to truncate the table after a handful of seconds might
sound equally insane.

Many it'd be sensible to relate the retry time to the time spend
vacuuming the table.  Say, if the amount of time spent retrying
exceeds 10% of the time spend vacuuming the table, with a minimum of
1s and a maximum of 1min, give up.  That way, big tables will get a
little more leeway than small tables, which is probably appropriate.

> Regarding how many people are affected, I have seen several reports
> of situations where users claim massive impact on performance when
> autovacuum kicks in. The reports have not included enough detail to
> quantify the impact or in most cases to establish a cause, but this
> seems like it could have a noticable impact, especially if the
> deadlock timeout was set to more than a second.

Yeah, I agree this could be a cause of those types of reports, but I
don't have any concrete evidence that any of the cases I've worked
were actually due to this specific issue.  The most recent case of
this type I worked on was due to I/O saturation - which, since it
happened to be EC2, really meant network saturation.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: autovacuum truncate exclusive lock round two

From
Jan Wieck
Date:
On 12/5/2012 2:00 PM, Robert Haas wrote:
> Many it'd be sensible to relate the retry time to the time spend
> vacuuming the table.  Say, if the amount of time spent retrying
> exceeds 10% of the time spend vacuuming the table, with a minimum of
> 1s and a maximum of 1min, give up.  That way, big tables will get a
> little more leeway than small tables, which is probably appropriate.

That sort of "dynamic" approach would indeed be interesting. But I fear 
that it is going to be complex at best. The amount of time spent in 
scanning heavily depends on the visibility map. The initial vacuum scan 
of a table can take hours or more, but it does update the visibility map 
even if the vacuum itself is aborted later. The next vacuum may scan 
that table in almost no time at all, because it skips all blocks that 
are marked "all visible".

So the total time the "scan" takes is no yardstick I'd use.


Jan

-- 
Anyone who trades liberty for security deserves neither
liberty nor security. -- Benjamin Franklin



Re: autovacuum truncate exclusive lock round two

From
Robert Haas
Date:
On Wed, Dec 5, 2012 at 10:16 PM, Jan Wieck <JanWieck@yahoo.com> wrote:
> On 12/5/2012 2:00 PM, Robert Haas wrote:
>>
>> Many it'd be sensible to relate the retry time to the time spend
>> vacuuming the table.  Say, if the amount of time spent retrying
>> exceeds 10% of the time spend vacuuming the table, with a minimum of
>> 1s and a maximum of 1min, give up.  That way, big tables will get a
>> little more leeway than small tables, which is probably appropriate.
>
> That sort of "dynamic" approach would indeed be interesting. But I fear that
> it is going to be complex at best. The amount of time spent in scanning
> heavily depends on the visibility map. The initial vacuum scan of a table
> can take hours or more, but it does update the visibility map even if the
> vacuum itself is aborted later. The next vacuum may scan that table in
> almost no time at all, because it skips all blocks that are marked "all
> visible".

Well, if that's true, then there's little reason to worry about giving
up quickly, because the next autovacuum a minute later won't consume
many resources.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: autovacuum truncate exclusive lock round two

From
Jan Wieck
Date:
Kevin and Robert are well aware of most of the below. I just want to put 
this out here so other people, who haven't followed the discussion too 
closely, may chime in.

Some details on the problem:

First of all, there is a minimum number of 1000 pages that the vacuum 
scan must detect as possibly being all empty at the end of a relation. 
Without at least 8MB of possible free space at the end, the code never 
calls lazy_truncate_heap(). This means we don't have to worry about tiny 
relations at all. Any relation that stays under 8MB turnover between 
autovacuum VACUUM runs can never get into this ever.

Relations that have higher turnover than that, but at random places or 
with a high percentage of rather static rows, don't fall into the 
problem category either. They may never accumulate that much "contiguous 
free space at the end". The turnover will be reusing free space all over 
the place. So again, lazy_truncate_heap() won't be called ever.

Relations that eventually build up more than 8MB of free space at the 
end aren't automatically a problem. The autovacuum VACUUM scan just 
scanned those pages at the end, which means that the safety scan for 
truncate, done under exclusive lock, is checking exactly those pages at 
the end and most likely they are still in memory. The truncate safety 
scan will be fast due to a 99+% buffer cache hit rate.

The only actual problem case (I have found so far) are rolling window 
tables of significant size, that can bloat multiple times their normal 
size every now and then. This is indeed a rare corner case and I have no 
idea how many users may (unknowingly) be suffering from it.

This rare corner case triggers lazy_truncate_heap() with a significant 
amount of free space to truncate. The table bloats, then all the bloat 
is deleted and the periodic 100% turnover will guarantee that all "live" 
tuples will shortly after circulate in lower block numbers again, with 
gigabytes of empty space at the end.

This by itself isn't a problem still. The existing code may do the job 
just fine "unless" there is "frequent" access to that very table. Only 
at this special combination of circumstances we actually have a problem.

Only now, with a significant amount of free space at the end and 
frequent access to the table, the truncate safety scan takes long enough 
and has to actually read pages from disk to interfere with client 
transactions.

At this point, the truncate safety scan may have to be interrupted to 
let the frequent other traffic go through. This is what we accomplish 
with the autovacuum_truncate_lock_check interval, where we voluntarily 
release the lock whenever someone else needs it. I agree with Kevin that 
a 20ms check interval is reasonable because the code to check this is 
even less expensive than releasing the exclusive lock we're holding.

At the same time, completely giving up and relying on the autovacuum 
launcher to restart another worker isn't as free as it looks like 
either. The next autovacuum worker will have to do the VACUUM scan 
first, before getting to the truncate phase. We cannot just skip blindly 
to the truncate code. With repeated abortion of the truncate, the table 
would deteriorate and accumulate dead tuples again. The removal of dead 
tuples and their index tuples has priority.

As said earlier in the discussion, the VACUUM scan will skip pages, that 
are marked as completely visible. So the scan won't physically read the 
majority of the empty pages at the end of the table over and over. But 
it will at least scan all pages, that had been modified since the last 
VACUUM run.

To me this means that we want to be more generous to the truncate code 
about acquiring the exclusive lock. In my tests, I've seen that a 
rolling window table with a "live" set of just 10 MB or so, but empty 
space of 3 GB, can still have a 2 minute VACUUM scan time. Throwing that 
work away because we can't acquire the exclusive lock withing 2 seconds 
is a waste of effort.

Something in between 2-60 seconds sounds more reasonable to me.


Jan

-- 
Anyone who trades liberty for security deserves neither
liberty nor security. -- Benjamin Franklin



Re: autovacuum truncate exclusive lock round two

From
Jan Wieck
Date:
On 12/6/2012 12:45 PM, Robert Haas wrote:
> On Wed, Dec 5, 2012 at 10:16 PM, Jan Wieck <JanWieck@yahoo.com> wrote:
>> That sort of "dynamic" approach would indeed be interesting. But I fear that
>> it is going to be complex at best. The amount of time spent in scanning
>> heavily depends on the visibility map. The initial vacuum scan of a table
>> can take hours or more, but it does update the visibility map even if the
>> vacuum itself is aborted later. The next vacuum may scan that table in
>> almost no time at all, because it skips all blocks that are marked "all
>> visible".
>
> Well, if that's true, then there's little reason to worry about giving
> up quickly, because the next autovacuum a minute later won't consume
> many resources.

"Almost no time" is of course "relative" to what an actual scan and dead
tuple removal cost. Looking at a table with 3 GB of dead tuples at the
end, the initial vacuum scan takes hours. When vacuum comes back to this
table, cleaning up a couple megabytes of newly deceased tuples and then
skipping over the all visible pages may take a minute.

Based on the discussion and what I feel is a consensus I have created an
updated patch that has no GUC at all. The hard coded parameters in
include/postmaster/autovacuum.h are

     AUTOVACUUM_TRUNCATE_LOCK_CHECK_INTERVAL      20 /* ms */
     AUTOVACUUM_TRUNCATE_LOCK_WAIT_INTERVAL       50 /* ms */
     AUTOVACUUM_TRUNCATE_LOCK_TIMEOUT             5000 /* ms */

I gave that the worst workload I can think of. A pgbench (style)
application that throws about 10 transactions per second at it, so that
there is constantly the need to give up the lock due to conflicting lock
requests and then reacquiring it again. A "cleanup" process is
periodically moving old tuples from the history table to an archive
table, making history a rolling window table. And a third job that 2-3
times per minute produces a 10 second lasting transaction, forcing
autovacuum to give up on the lock reacquisition.

Even with that workload autovacuum slow but steady is chopping away at
the table.


Jan

--
Anyone who trades liberty for security deserves neither
liberty nor security. -- Benjamin Franklin

Attachment

Re: autovacuum truncate exclusive lock round two

From
"Kevin Grittner"
Date:
Jan Wieck wrote:

> Based on the discussion and what I feel is a consensus I have
> created an updated patch that has no GUC at all. The hard coded
> parameters in include/postmaster/autovacuum.h are
> 
>  AUTOVACUUM_TRUNCATE_LOCK_CHECK_INTERVAL 20 /* ms */
>  AUTOVACUUM_TRUNCATE_LOCK_WAIT_INTERVAL 50 /* ms */
>  AUTOVACUUM_TRUNCATE_LOCK_TIMEOUT 5000 /* ms */

Since these really aren't part of the external API and are only
referenced in vacuumlazy.c, it seems more appropriate to define
them there.

> I gave that the worst workload I can think of. A pgbench (style) 
> application that throws about 10 transactions per second at it,
> so that there is constantly the need to give up the lock due to
> conflicting lock requests and then reacquiring it again. A
> "cleanup" process is periodically moving old tuples from the
> history table to an archive table, making history a rolling
> window table. And a third job that 2-3 times per minute produces
> a 10 second lasting transaction, forcing autovacuum to give up on
> the lock reacquisition.
> 
> Even with that workload autovacuum slow but steady is chopping
> away at the table.

Applies with minor offsets, builds without warning, and passes
`make check-world`. My tests based on your earlier posted test
script confirm the benefit.

There are some minor white-space issues; for example git diff
--color shows some trailing spaces in comments.

There are no docs, but since there are no user-visible changes in
behavior other than better performance and more prompt and reliable
trunction of tables where we were already doing so, it doesn't seem
like any new docs are needed. Due to the nature of the problem,
tests are tricky to run correctly and take a long time to run, so I
don't see how any regressions tests would be appropriate, either.

This patch seems ready for committer, and I would be comfortable
with making the minor changes I mention above and committing it. 
I'll wait a day or two to allow any other comments or objections.

To summarize, there has been pathalogical behavior in an
infrequently-encountered corner case of autovacuum, wasting a lot
of resources indefinitely when it is encountered; this patch gives
a major performance improvement in in this situation without any
other user-visible change and without requiring any new GUCs. It
adds a new public function in the locking area to allow a process
to check whether a particular lock it is holding is blocking any
other process, and another to wrap that to make it easy to check
whether the lock held on a particular table is blocking another
process. It uses this new capability to be smarter about scheduling
autovacuum's truncation work, and to avoid throwing away
incremental progress in doing so.

As such, I don't think it would be crazy to back-patch this, but I
think it would be wise to allow it to be proven on master/9.3 for a
while before considering that.

-Kevin



Re: autovacuum truncate exclusive lock round two

From
Jan Wieck
Date:
On 12/9/2012 2:37 PM, Kevin Grittner wrote:
> Jan Wieck wrote:
>
>> Based on the discussion and what I feel is a consensus I have
>> created an updated patch that has no GUC at all. The hard coded
>> parameters in include/postmaster/autovacuum.h are
>>
>>  AUTOVACUUM_TRUNCATE_LOCK_CHECK_INTERVAL 20 /* ms */
>>  AUTOVACUUM_TRUNCATE_LOCK_WAIT_INTERVAL 50 /* ms */
>>  AUTOVACUUM_TRUNCATE_LOCK_TIMEOUT 5000 /* ms */
>
> Since these really aren't part of the external API and are only
> referenced in vacuumlazy.c, it seems more appropriate to define
> them there.
>
>> I gave that the worst workload I can think of. A pgbench (style)
>> application that throws about 10 transactions per second at it,
>> so that there is constantly the need to give up the lock due to
>> conflicting lock requests and then reacquiring it again. A
>> "cleanup" process is periodically moving old tuples from the
>> history table to an archive table, making history a rolling
>> window table. And a third job that 2-3 times per minute produces
>> a 10 second lasting transaction, forcing autovacuum to give up on
>> the lock reacquisition.
>>
>> Even with that workload autovacuum slow but steady is chopping
>> away at the table.
>
> Applies with minor offsets, builds without warning, and passes
> `make check-world`. My tests based on your earlier posted test
> script confirm the benefit.
>
> There are some minor white-space issues; for example git diff
> --color shows some trailing spaces in comments.

Cleaned up all of those.


Jan

--
Anyone who trades liberty for security deserves neither
liberty nor security. -- Benjamin Franklin

Attachment

Re: autovacuum truncate exclusive lock round two

From
"Kevin Grittner"
Date:
Jan Wieck wrote:

> Cleaned up all of those.

Applied with trivial editing, mostly from a pgindent run against
modified files.

-Kevin



Re: autovacuum truncate exclusive lock round two

From
"Kevin Grittner"
Date:
Kevin Grittner wrote:

> Applied with trivial editing, mostly from a pgindent run against
> modified files.

Applied back as far as 9.0. Before that code didn't match well
enough for it to seem safe to apply without many hours of
additional testing.

I have confirmed occurences of this problem at least as far back as
9.0 in the wild, where it is causing performance degradation severe
enough to force users to stop production usage long enough to
manually vacuum the affected tables. The use case is a lot like
what Jan described, where PostgreSQL is being used for high volume
queuing. When there is a burst of activity which increases table
size, and then the queues are drained back to a normal state, the
problem shows up.

-Kevin