Thread: autovacuum truncate exclusive lock round two
This problem has been discussed before. Those familiar with the subject please skip the next paragraph. When autovacuum finds a substantial amount of empty pages at the end of a relation, it attempts to truncate it in lazy_truncate_heap(). Because all the scanning had been done in parallel to normal DB activity, it needs to verify that all those blocks are still empty. To do that autovacuum grabs an AccessExclusiveLock on the relation, then scans backwards to the last non-empty page. If any other backend needs to access that table during this time, it will kill the autovacuum from the deadlock detection code, which by default is done after a 1000ms timeout. The autovacuum launcher will start another vacuum after (default) 60 seconds, which most likely is getting killed again, and again, and again. The net result of this is that the table is never truncated and every 60 seconds there is a 1 second hiccup before the autovacuum is killed. Proposal: Add functions to lmgr that are derived from the lock release code, but instead of releasing the lock and waking up waiters, just return a boolean telling if there are any waiters that would be woken up if this lock was released. Use this lmgr feature inside count_nondeletable_pages() of vacuumlazy.c to periodically check, if there is a conflicting lock request waiting. If not, keep going. If there is a waiter, truncate the relation to the point checked thus far, release the AccessExclusiveLock, then loop back to where we acquire this lock in the first place and continue checking/truncating. I have a working patch here: https://github.com/wieck/postgres/tree/autovacuum-truncate-lock This patch does introduce three new postgresql.conf parameters, which I would be happy to get rid of if we could derive them from something else. Something based on the deadlock timeout may be possible. autovacuum_truncate_lock_check = 100ms # how frequent to check # for conflicting locks autovacuum_truncate_lock_retry = 50 # how often to try acquiring # the exclusive lock autovacuum_truncate_lock_wait = 20ms # nap in between attempts With these settings, I see the truncate of a bloated table progressing at a rate of 3 minutes per GB, while that table is accessed 20 times per second. The original "kill autovacuum" mechanism in the deadlock code is still there. All this code really does is 10 lmgr lookups per second and releasing the AccessExclusiveLock if there are any waiters. I don't think it can get any cheaper than this. I am attaching a script that uses pgbench to demonstrate the actual problem of a bloated table with significant empty pages at the end. Comments? Jan -- Anyone who trades liberty for security deserves neither liberty nor security. -- Benjamin Franklin
Attachment
Here is the patch for it. Jan -- Anyone who trades liberty for security deserves neither liberty nor security. -- Benjamin Franklin
Attachment
Jan, * Jan Wieck (JanWieck@Yahoo.com) wrote: > This problem has been discussed before. Those familiar with the > subject please skip the next paragraph. Apologies if this was already thought-of and ruled out for some reason, but... > Because all the scanning had been done in parallel to normal DB > activity, it needs to verify that all those blocks are still empty. Would it be possible to use the FSM to figure out if things have changed since the last scan..? Does that scan update the FSM, which would then be updated by another backend in the event that it decided to write something there? Or do we consider the FSM to be completely untrustworthy wrt this (and if so, I don't suppose there's any hope to using the visibility map...)? The notion of having to double-scan and the AccessExclusiveLock on the relation are telling me this work-around, while completely possible, isn't exactly ideal... Perhaps another option would be a page-level or something which is larger than per-row (strikes me as a lot of overhead for this and it's not clear how we'd do it), but less than an entire relation, but there are certainly pain points there too. Thanks, Stephen
Steven, On 10/24/2012 10:46 PM, Stephen Frost wrote: > Jan, > > * Jan Wieck (JanWieck@Yahoo.com) wrote: >> This problem has been discussed before. Those familiar with the >> subject please skip the next paragraph. > > Apologies if this was already thought-of and ruled out for some reason, > but... > >> Because all the scanning had been done in parallel to normal DB >> activity, it needs to verify that all those blocks are still empty. > > Would it be possible to use the FSM to figure out if things have changed > since the last scan..? Does that scan update the FSM, which would then > be updated by another backend in the event that it decided to write > something there? Or do we consider the FSM to be completely > untrustworthy wrt this (and if so, I don't suppose there's any hope to > using the visibility map...)? I honestly don't know if we can trust the FSM enough when it comes to throwing away heap pages. Can we? > > The notion of having to double-scan and the AccessExclusiveLock on the > relation are telling me this work-around, while completely possible, > isn't exactly ideal... Under normal circumstances with just a few pages to trim off the end this is no problem. Those pages were the last pages just scanned by this very autovacuum, so they are found in the shared buffers anyway. All the second scan does in that case is to fetch the page once more from shared buffers to be 100% sure, we are not truncating off new tuples. We definitely need the AccessExclusiveLock to prevent someone from extending the relation at the end between our check for relation size and the truncate. Fetching 50 empty blocks from the buffer cache while at it isn't that big of a deal and that is what it normally looks like. The problem case this patch is dealing with is rolling window tables that experienced some bloat. The typical example is a log table, that has new data constantly added and the oldest data constantly purged out. This data normally rotates through some blocks like a rolling window. If for some reason (purging turned off for example) this table bloats by several GB and later shrinks back to its normal content, soon all the used blocks are at the beginning of the heap and we find tens of thousands of empty pages at the end. Only now does the second scan take more than 1000ms and autovacuum is at risk to get killed while at it. Since we have experienced this problem several times now on our production systems, something clearly needs to be done. But IMHO it doesn't happen often enough to take any risk here. Jan -- Anyone who trades liberty for security deserves neither liberty nor security. -- Benjamin Franklin
Jan Wieck <JanWieck@Yahoo.com> writes: > On 10/24/2012 10:46 PM, Stephen Frost wrote: >> Would it be possible to use the FSM to figure out if things have changed >> since the last scan..? Does that scan update the FSM, which would then >> be updated by another backend in the event that it decided to write >> something there? Or do we consider the FSM to be completely >> untrustworthy wrt this (and if so, I don't suppose there's any hope to >> using the visibility map...)? > I honestly don't know if we can trust the FSM enough when it comes to > throwing away heap pages. Can we? No. Backends are under no obligation to update FSM for each individual tuple insertion, and typically don't do so. More to the point, you have to take AccessExclusiveLock *anyway*, because this is interlocking not only against new insertions but plain read-only seqscans: if a seqscan falls off the end of the table it will be very unhappy. So I don't see where we'd buy anything by consulting the FSM. regards, tom lane
On 10/25/2012 9:45 AM, Tom Lane wrote: > Jan Wieck <JanWieck@Yahoo.com> writes: >> On 10/24/2012 10:46 PM, Stephen Frost wrote: >>> Would it be possible to use the FSM to figure out if things have changed >>> since the last scan..? Does that scan update the FSM, which would then >>> be updated by another backend in the event that it decided to write >>> something there? Or do we consider the FSM to be completely >>> untrustworthy wrt this (and if so, I don't suppose there's any hope to >>> using the visibility map...)? > >> I honestly don't know if we can trust the FSM enough when it comes to >> throwing away heap pages. Can we? > > No. Backends are under no obligation to update FSM for each individual > tuple insertion, and typically don't do so. > > More to the point, you have to take AccessExclusiveLock *anyway*, > because this is interlocking not only against new insertions but plain > read-only seqscans: if a seqscan falls off the end of the table it will > be very unhappy. So I don't see where we'd buy anything by consulting > the FSM. Thank you. One thing that I haven't mentioned yet is that with this patch, we could actually insert a vacuum_delay_point() into the loop in count_nondeletable_pages(). We no longer cling to the exclusive lock but rather get out of the way as soon as somebody needs the table. Under this condition we no longer need to do the second scan full bore. Jan -- Anyone who trades liberty for security deserves neither liberty nor security. -- Benjamin Franklin
Jan, * Jan Wieck (JanWieck@Yahoo.com) wrote: > The problem case this patch is dealing with is rolling window tables > that experienced some bloat. The typical example is a log table, > that has new data constantly added and the oldest data constantly > purged out. This data normally rotates through some blocks like a > rolling window. If for some reason (purging turned off for example) > this table bloats by several GB and later shrinks back to its normal > content, soon all the used blocks are at the beginning of the heap > and we find tens of thousands of empty pages at the end. Only now > does the second scan take more than 1000ms and autovacuum is at risk > to get killed while at it. My concern is that this could certainly also happen to a heavily updated table in an OLTP type of environment where the requirement to take a heavy lock to clean it up might prevent it from ever happening.. I was simply hoping we could find a mechanism to lock just those pages we're getting ready to nuke rather than the entire relation. Perhaps we can consider how to make those changes alongside of changes to eliminate or reduce the extent locking that has been painful (for me at least) when doing massive parallel loads into a table. > Since we have experienced this problem several times now on our > production systems, something clearly needs to be done. But IMHO it > doesn't happen often enough to take any risk here. I'm not advocating a 'do-nothing' approach, was just looking for another option that might allow for this work to happen on the heap in parallel with regular access. Since we havn't got any way to do that currently, +1 for moving forward with this as it clearly improves the current situation. Thanks, Stephen
On 10/25/2012 10:12 AM, Stephen Frost wrote: > Jan, > > * Jan Wieck (JanWieck@Yahoo.com) wrote: >> The problem case this patch is dealing with is rolling window tables >> that experienced some bloat. The typical example is a log table, >> that has new data constantly added and the oldest data constantly >> purged out. This data normally rotates through some blocks like a >> rolling window. If for some reason (purging turned off for example) >> this table bloats by several GB and later shrinks back to its normal >> content, soon all the used blocks are at the beginning of the heap >> and we find tens of thousands of empty pages at the end. Only now >> does the second scan take more than 1000ms and autovacuum is at risk >> to get killed while at it. > > My concern is that this could certainly also happen to a heavily updated > table in an OLTP type of environment where the requirement to take a > heavy lock to clean it up might prevent it from ever happening.. I was > simply hoping we could find a mechanism to lock just those pages we're > getting ready to nuke rather than the entire relation. Perhaps we can > consider how to make those changes alongside of changes to eliminate or > reduce the extent locking that has been painful (for me at least) when > doing massive parallel loads into a table. I've been testing this with loads of 20 writes/s to that bloated table. Preventing not only the clean up, but the following ANALYZE as well is precisely what happens. There may be multiple ways how to get into this situation, but once you're there the symptoms are the same. Vacuum fails to truncate it and causing a 1 second hiccup every minute, while vacuum is holding the exclusive lock until the deadlock detection code of another transaction kills it. My patch doesn't change the logic how we ensure that we don't zap any data by accident with the truncate and Tom's comments suggest we should stick to it. It only makes autovacuum check frequently if the AccessExclusiveLock is actually blocking anyone and then get out of the way. I would rather like to discuss any ideas how to do all this without 3 new GUCs. In the original code, the maximum delay that autovacuum can cause by holding the exclusive lock is one deadlock_timeout (default 1s). It would appear reasonable to me to use max(deadlock_timeout/10,10ms) as the interval to check for a conflicting lock request. For another transaction that needs to access the table this is 10 times faster than it is now and still guarantees that autovacuum will make some progress with the truncate. The other two GUCs control how often and how fast autovacuum tries to acquire the exclusive lock in the first place. Since we actively release the lock *because someone needs it* it is pretty much guaranteed that the immediate next lock attempt fails. We on purpose do a ConditionalLockRelation() because there is a chance to deadlock. The current code only tries one lock attempt and gives up immediately. I don't know from what to derive a good value for how long to retry, but the nap time in between tries could be a hardcoded 20ms or using the cost based vacuum nap time (which defaults to 20ms). Any other ideas are welcome. Thanks, Jan -- Anyone who trades liberty for security deserves neither liberty nor security. -- Benjamin Franklin
Jan Wieck wrote: > In the original code, the maximum delay that autovacuum can cause by > holding the exclusive lock is one deadlock_timeout (default 1s). It > would appear reasonable to me to use max(deadlock_timeout/10,10ms) > as the interval to check for a conflicting lock request. For another > transaction that needs to access the table this is 10 times faster > than it is now and still guarantees that autovacuum will make some > progress with the truncate. So you would be calling GetCurrentTimestamp() continuously? Since you mentioned adding a vacuum delay point I wonder if it would make sense to test for lockers each time it would consider going to sleep, instead. (One hazard to keep in mind is the case where no vacuum delay is configured.) -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On 10/25/2012 12:24 PM, Alvaro Herrera wrote: > Jan Wieck wrote: > >> In the original code, the maximum delay that autovacuum can cause by >> holding the exclusive lock is one deadlock_timeout (default 1s). It >> would appear reasonable to me to use max(deadlock_timeout/10,10ms) >> as the interval to check for a conflicting lock request. For another >> transaction that needs to access the table this is 10 times faster >> than it is now and still guarantees that autovacuum will make some >> progress with the truncate. > > So you would be calling GetCurrentTimestamp() continuously? Since you > mentioned adding a vacuum delay point I wonder if it would make sense to > test for lockers each time it would consider going to sleep, instead. > (One hazard to keep in mind is the case where no vacuum delay is > configured.) Depends on your definition of "continuously". If doing one INSTR_TIME_SET_CURRENT(), which on Unix boils down to a gettimeofday(), every 32 ReadBufferExtended() calls counts as continuously, then yes. Adding a vacuum_delay_point() is something we should consider. However, the vacuum_delay_point() call simply naps when enough cost has been racked up. You don't know if the next call will nap or not. We would have to extend that functionality with some vacuum_delay_would_nap() call to do what you suggest. Jan -- Anyone who trades liberty for security deserves neither liberty nor security. -- Benjamin Franklin
On Thursday, October 25, 2012 9:46 PM Jan Wieck wrote: > On 10/25/2012 10:12 AM, Stephen Frost wrote: > > Jan, > > > > * Jan Wieck (JanWieck@Yahoo.com) wrote: > >> The problem case this patch is dealing with is rolling window tables > >> that experienced some bloat. The typical example is a log table, > >> that has new data constantly added and the oldest data constantly > >> purged out. This data normally rotates through some blocks like a > >> rolling window. If for some reason (purging turned off for example) > >> this table bloats by several GB and later shrinks back to its normal > >> content, soon all the used blocks are at the beginning of the heap > >> and we find tens of thousands of empty pages at the end. Only now > >> does the second scan take more than 1000ms and autovacuum is at risk > >> to get killed while at it. > > > > My concern is that this could certainly also happen to a heavily > updated > > table in an OLTP type of environment where the requirement to take a > > heavy lock to clean it up might prevent it from ever happening.. I > was > > simply hoping we could find a mechanism to lock just those pages we're > > getting ready to nuke rather than the entire relation. Perhaps we can > > consider how to make those changes alongside of changes to eliminate > or > > reduce the extent locking that has been painful (for me at least) when > > doing massive parallel loads into a table. > > I've been testing this with loads of 20 writes/s to that bloated table. > Preventing not only the clean up, but the following ANALYZE as well is > precisely what happens. There may be multiple ways how to get into this > situation, but once you're there the symptoms are the same. Vacuum fails > to truncate it and causing a 1 second hiccup every minute, while vacuum > is holding the exclusive lock until the deadlock detection code of > another transaction kills it. > > My patch doesn't change the logic how we ensure that we don't zap any > data by accident with the truncate and Tom's comments suggest we should > stick to it. It only makes autovacuum check frequently if the > AccessExclusiveLock is actually blocking anyone and then get out of the > way. > > I would rather like to discuss any ideas how to do all this without 3 > new GUCs. > > In the original code, the maximum delay that autovacuum can cause by > holding the exclusive lock is one deadlock_timeout (default 1s). It > would appear reasonable to me to use max(deadlock_timeout/10,10ms) as > the interval to check for a conflicting lock request. For another > transaction that needs to access the table this is 10 times faster than > it is now and still guarantees that autovacuum will make some progress > with the truncate. One other way could be to check after every few pages for a conflicting lock request. > The other two GUCs control how often and how fast autovacuum tries to > acquire the exclusive lock in the first place. Since we actively release > the lock *because someone needs it* it is pretty much guaranteed that > the immediate next lock attempt fails. We on purpose do a > ConditionalLockRelation() because there is a chance to deadlock. The > current code only tries one lock attempt and gives up immediately. I > don't know from what to derive a good value for how long to retry, Can't we do something like, after nap check for conditional lock and if it didn't get then get lock unconditionally. The reason why after your implementation it might be okay to have lock unconditionally after one try is that anyway after every few pages or after small time, it will release the lock if there is any waiter. > but > the nap time in between tries could be a hardcoded 20ms or using the > cost based vacuum nap time (which defaults to 20ms). I think using cost based vacuum nap time or default value is good. Adding new parameters might have user/administrator overhead, it is always better if it can be intelligently decided by database itself. However if you feel these are parameters which can vary based on different kind of usage, then I think it is better to expose it through configuration parameters to users. With Regards, Amit Kapila.
On 10/26/2012 1:29 AM, Amit Kapila wrote: > On Thursday, October 25, 2012 9:46 PM Jan Wieck wrote: >> On 10/25/2012 10:12 AM, Stephen Frost wrote: >> > Jan, >> > >> > * Jan Wieck (JanWieck@Yahoo.com) wrote: >> >> The problem case this patch is dealing with is rolling window tables >> >> that experienced some bloat. The typical example is a log table, >> >> that has new data constantly added and the oldest data constantly >> >> purged out. This data normally rotates through some blocks like a >> >> rolling window. If for some reason (purging turned off for example) >> >> this table bloats by several GB and later shrinks back to its normal >> >> content, soon all the used blocks are at the beginning of the heap >> >> and we find tens of thousands of empty pages at the end. Only now >> >> does the second scan take more than 1000ms and autovacuum is at risk >> >> to get killed while at it. >> > >> > My concern is that this could certainly also happen to a heavily >> updated >> > table in an OLTP type of environment where the requirement to take a >> > heavy lock to clean it up might prevent it from ever happening.. I >> was >> > simply hoping we could find a mechanism to lock just those pages we're >> > getting ready to nuke rather than the entire relation. Perhaps we can >> > consider how to make those changes alongside of changes to eliminate >> or >> > reduce the extent locking that has been painful (for me at least) when >> > doing massive parallel loads into a table. >> >> I've been testing this with loads of 20 writes/s to that bloated table. >> Preventing not only the clean up, but the following ANALYZE as well is >> precisely what happens. There may be multiple ways how to get into this >> situation, but once you're there the symptoms are the same. Vacuum fails >> to truncate it and causing a 1 second hiccup every minute, while vacuum >> is holding the exclusive lock until the deadlock detection code of >> another transaction kills it. >> >> My patch doesn't change the logic how we ensure that we don't zap any >> data by accident with the truncate and Tom's comments suggest we should >> stick to it. It only makes autovacuum check frequently if the >> AccessExclusiveLock is actually blocking anyone and then get out of the >> way. >> >> I would rather like to discuss any ideas how to do all this without 3 >> new GUCs. >> >> In the original code, the maximum delay that autovacuum can cause by >> holding the exclusive lock is one deadlock_timeout (default 1s). It >> would appear reasonable to me to use max(deadlock_timeout/10,10ms) as >> the interval to check for a conflicting lock request. For another >> transaction that needs to access the table this is 10 times faster than >> it is now and still guarantees that autovacuum will make some progress >> with the truncate. > > One other way could be to check after every few pages for a conflicting > lock request. How is this any different from what my patch does? Did you even look at the code? > >> The other two GUCs control how often and how fast autovacuum tries to >> acquire the exclusive lock in the first place. Since we actively release >> the lock *because someone needs it* it is pretty much guaranteed that >> the immediate next lock attempt fails. We on purpose do a >> ConditionalLockRelation() because there is a chance to deadlock. The >> current code only tries one lock attempt and gives up immediately. I >> don't know from what to derive a good value for how long to retry, > > Can't we do something like, after nap check for conditional lock and if it > didn't get > then get lock unconditionally. No, we cannot. This is also well documented in the code. Jan -- Anyone who trades liberty for security deserves neither liberty nor security. -- Benjamin Franklin
On Friday, October 26, 2012 10:59 AM Amit Kapila wrote: > On Thursday, October 25, 2012 9:46 PM Jan Wieck wrote: > > On 10/25/2012 10:12 AM, Stephen Frost wrote: > > > Jan, > > > > > > * Jan Wieck (JanWieck@Yahoo.com) wrote: > > >> The problem case this patch is dealing with is rolling window > tables > > >> that experienced some bloat. The typical example is a log table, > > >> that has new data constantly added and the oldest data constantly > > >> purged out. This data normally rotates through some blocks like a > > >> rolling window. If for some reason (purging turned off for example) > > >> this table bloats by several GB and later shrinks back to its > normal > > >> content, soon all the used blocks are at the beginning of the heap > > >> and we find tens of thousands of empty pages at the end. Only now > > >> does the second scan take more than 1000ms and autovacuum is at > risk > > >> to get killed while at it. > > > > > > My concern is that this could certainly also happen to a heavily > > updated > > > table in an OLTP type of environment where the requirement to take a > > > heavy lock to clean it up might prevent it from ever happening.. I > > was > > > simply hoping we could find a mechanism to lock just those pages > we're > > > getting ready to nuke rather than the entire relation. Perhaps we > can > > > consider how to make those changes alongside of changes to eliminate > > or > > > reduce the extent locking that has been painful (for me at least) > when > > > doing massive parallel loads into a table. > > > > I've been testing this with loads of 20 writes/s to that bloated > table. > > Preventing not only the clean up, but the following ANALYZE as well is > > precisely what happens. There may be multiple ways how to get into > this > > situation, but once you're there the symptoms are the same. Vacuum > fails > > to truncate it and causing a 1 second hiccup every minute, while > vacuum > > is holding the exclusive lock until the deadlock detection code of > > another transaction kills it. > > > > My patch doesn't change the logic how we ensure that we don't zap any > > data by accident with the truncate and Tom's comments suggest we > should > > stick to it. It only makes autovacuum check frequently if the > > AccessExclusiveLock is actually blocking anyone and then get out of > the > > way. > > > > I would rather like to discuss any ideas how to do all this without 3 > > new GUCs. > > > > In the original code, the maximum delay that autovacuum can cause by > > holding the exclusive lock is one deadlock_timeout (default 1s). It > > would appear reasonable to me to use max(deadlock_timeout/10,10ms) as > > the interval to check for a conflicting lock request. For another > > transaction that needs to access the table this is 10 times faster > than > > it is now and still guarantees that autovacuum will make some progress > > with the truncate. > > One other way could be to check after every few pages for a > conflicting > lock request. > > > The other two GUCs control how often and how fast autovacuum tries to > > acquire the exclusive lock in the first place. Since we actively > release > > the lock *because someone needs it* it is pretty much guaranteed that > > the immediate next lock attempt fails. We on purpose do a > > ConditionalLockRelation() because there is a chance to deadlock. The > > current code only tries one lock attempt and gives up immediately. I > > don't know from what to derive a good value for how long to retry, > > Can't we do something like, after nap check for conditional lock and > if it > didn't get > then get lock unconditionally. > The reason why after your implementation it might be okay to have lock > unconditionally after one try is that > anyway after every few pages or after small time, it will release the > lock > if there is any waiter. I am sorry, at this point trying to take unconditional X lock can lead to deadlock, so above is not possible. > > but > > the nap time in between tries could be a hardcoded 20ms or using the > > cost based vacuum nap time (which defaults to 20ms). > > I think using cost based vacuum nap time or default value is good. > > Adding new parameters might have user/administrator overhead, it is > always > better if it can be intelligently decided by database itself. > However if you feel these are parameters which can vary based on > different > kind of usage, then I think it is better to expose it through > configuration > parameters to users. > > With Regards, > Amit Kapila. > > > > -- > Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-hackers
On Friday, October 26, 2012 11:50 AM Jan Wieck wrote: > On 10/26/2012 1:29 AM, Amit Kapila wrote: > > On Thursday, October 25, 2012 9:46 PM Jan Wieck wrote: > >> On 10/25/2012 10:12 AM, Stephen Frost wrote: > >> > Jan, > >> > > >> > * Jan Wieck (JanWieck@Yahoo.com) wrote: > >> >> The problem case this patch is dealing with is rolling window > tables > >> >> that experienced some bloat. The typical example is a log table, > >> >> that has new data constantly added and the oldest data constantly > >> >> purged out. This data normally rotates through some blocks like a > >> >> rolling window. If for some reason (purging turned off for > example) > >> >> this table bloats by several GB and later shrinks back to its > normal > >> >> content, soon all the used blocks are at the beginning of the heap > >> >> and we find tens of thousands of empty pages at the end. Only now > >> >> does the second scan take more than 1000ms and autovacuum is at > risk > >> >> to get killed while at it. > >> > > >> > My concern is that this could certainly also happen to a heavily > >> updated > >> > table in an OLTP type of environment where the requirement to take > a > >> > heavy lock to clean it up might prevent it from ever happening.. I > >> was > >> > simply hoping we could find a mechanism to lock just those pages > we're > >> > getting ready to nuke rather than the entire relation. Perhaps we > can > >> > consider how to make those changes alongside of changes to > eliminate > >> or > >> > reduce the extent locking that has been painful (for me at least) > when > >> > doing massive parallel loads into a table. > >> > >> I've been testing this with loads of 20 writes/s to that bloated > table. > >> Preventing not only the clean up, but the following ANALYZE as well > is > >> precisely what happens. There may be multiple ways how to get into > this > >> situation, but once you're there the symptoms are the same. Vacuum > fails > >> to truncate it and causing a 1 second hiccup every minute, while > vacuum > >> is holding the exclusive lock until the deadlock detection code of > >> another transaction kills it. > >> > >> My patch doesn't change the logic how we ensure that we don't zap any > >> data by accident with the truncate and Tom's comments suggest we > should > >> stick to it. It only makes autovacuum check frequently if the > >> AccessExclusiveLock is actually blocking anyone and then get out of > the > >> way. > >> > >> I would rather like to discuss any ideas how to do all this without 3 > >> new GUCs. > >> > >> In the original code, the maximum delay that autovacuum can cause by > >> holding the exclusive lock is one deadlock_timeout (default 1s). It > >> would appear reasonable to me to use max(deadlock_timeout/10,10ms) as > >> the interval to check for a conflicting lock request. For another > >> transaction that needs to access the table this is 10 times faster > than > >> it is now and still guarantees that autovacuum will make some > progress > >> with the truncate. > > > > One other way could be to check after every few pages for a > conflicting > > lock request. > > How is this any different from what my patch does? The difference is that in the patch it checks for waiters by using2 parameters autovacuum_truncate_lock_check and blkno%32 and what I had mentioned was to check only based on blkno. Will iteffect too much if we directly check for waiters after every 32 (any feasible number) blocks? > Did you even look at the code? I haven't looked at code when I had given reply to your previous mail. But now I have checked it. With Regards, Amit Kapila.
On 10/26/2012 6:35 AM, Amit Kapila wrote: > On Friday, October 26, 2012 11:50 AM Jan Wieck wrote: >> On 10/26/2012 1:29 AM, Amit Kapila wrote: >> > One other way could be to check after every few pages for a >> conflicting >> > lock request. >> >> How is this any different from what my patch does? > The difference is that in the patch it checks for waiters by using 2 > parameters autovacuum_truncate_lock_check and blkno%32 and what I > had mentioned was to check only based on blkno. > Will it effect too much if we directly check for waiters after every 32 > (any feasible number) blocks? The blkno%32 is there to not do the gettimeofday() call too often. But relying on the blkno alone is IMHO not a good idea. It had to be a number small enough so that even on a busy system and when the pages have to be read from disk, vacuum checks and releases the lock quickly. But large enough so that it doesn't create a significant amount of spinlock calls in the lmgr. We would end up with another parameter, number of blocks, that is a lot harder to estimate a good value for. Jan -- Anyone who trades liberty for security deserves neither liberty nor security. -- Benjamin Franklin
On Wed, Oct 24, 2012 at 4:20 PM, Jan Wieck <JanWieck@yahoo.com> wrote: > This patch does introduce three new postgresql.conf parameters, which I > would be happy to get rid of if we could derive them from something else. > Something based on the deadlock timeout may be possible. > > autovacuum_truncate_lock_check = 100ms # how frequent to check > # for conflicting locks > autovacuum_truncate_lock_retry = 50 # how often to try acquiring > # the exclusive lock > autovacuum_truncate_lock_wait = 20ms # nap in between attempts +1 for this general approach. As you suggested downthread, I think that hard-coding autovacuum_truncate_lock_check to one-tenth of the deadlock timeout should be just fine. For the other two parameters, I doubt we need to make them configurable at all. It's not exactly clear what to set them to, but it does seem clear that the down side of setting them incorrectly isn't very much as long as the defaults are roughly sane. Personally, I'd be inclined to retry less frequently but over a slightly longer time period - say twenty retries, one after every 100ms. But I wouldn't be upset if we settled on what you've got here, either. We just don't want to let the total time we spend waiting for the lock get too long, because that means pinning down an auto-vacuum worker that might be critically needed elsewhere. So the product of autovacuum_truncate_lock_retry and autovacuum_truncate_lock_wait probably should not be more than a couple of seconds. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Jan Wieck <JanWieck@Yahoo.com> writes: > Use this lmgr feature inside count_nondeletable_pages() of vacuumlazy.c to > periodically check, if there is a conflicting lock request waiting. If not, > keep going. If there is a waiter, truncate the relation to the point checked > thus far, release the AccessExclusiveLock, then loop back to where we > acquire this lock in the first place and continue checking/truncating. I think that maybe we could just bail out after releasing the AccessExclusiveLock and trust autovacuum to get back to truncating that relation later. That would allow removing 2 of the 3 GUCs below: > autovacuum_truncate_lock_check = 100ms # how frequent to check > # for conflicting locks This is the one remaining. Could we maybe check for lock conflict after every move backward a page, or some multiple thereof? The goal would be to ensure that progress is made, while also being aware of concurrent activity, ala CHECK_FOR_INTERRUPTS(). Regards, -- Dimitri Fontaine http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support
Dimitri Fontaine wrote: > Jan Wieck <JanWieck@Yahoo.com> writes: > > Use this lmgr feature inside count_nondeletable_pages() of vacuumlazy.c to > > periodically check, if there is a conflicting lock request waiting. If not, > > keep going. If there is a waiter, truncate the relation to the point checked > > thus far, release the AccessExclusiveLock, then loop back to where we > > acquire this lock in the first place and continue checking/truncating. > > I think that maybe we could just bail out after releasing the > AccessExclusiveLock and trust autovacuum to get back to truncating that > relation later. That doesn't work, because the truncating code is not reached unless vacuuming actually took place. So if you interrupt it, it will just not get called again later. Maybe we could have autovacuum somehow invoke that separately, but that would require that the fact that truncation was aborted is kept track of somewhere. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Friday, November 16, 2012 4:09 AM Alvaro Herrera wrote: > Dimitri Fontaine wrote: > > Jan Wieck <JanWieck@Yahoo.com> writes: > > > Use this lmgr feature inside count_nondeletable_pages() of > vacuumlazy.c to > > > periodically check, if there is a conflicting lock request waiting. > If not, > > > keep going. If there is a waiter, truncate the relation to the point > checked > > > thus far, release the AccessExclusiveLock, then loop back to where > we > > > acquire this lock in the first place and continue > checking/truncating. > > > > I think that maybe we could just bail out after releasing the > > AccessExclusiveLock and trust autovacuum to get back to truncating > that > > relation later. > > That doesn't work, because the truncating code is not reached unless > vacuuming actually took place. So if you interrupt it, it will just not > get called again later. Maybe we could have autovacuum somehow invoke > that separately, but that would require that the fact that truncation > was aborted is kept track of somewhere. Won't it have a chance to be handled next time when vacuum will trigger due to updates/deletes on some other pages. OTOH, may be next time again the same thing happens and it was not able to complete the truncate. So I think it's better to complete first time only, but may be using some heuristic time for wait and retry rather than with configuration variables. With Regards, Amit Kapila
Jan, Are you posting an updated patch? -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Alvaro Herrera wrote: > Are you posting an updated patch? Well, there wasn't exactly a consensus on what should change, so I'll throw some initial review comments out even though I still need to check some things in the code and do more testing. The patch applied cleanly, compiled without warnings, and passed all tests in `make check-world`. The previous discussion seemed to me to achieve consensus on the need for the feature, but a general dislike for adding so many new GUC settings to configure it. I concur on that. I agree with the feelings of others that just using deadlock_timeout / 10 (probably clamped to a minimum of 10ms) would be good instead of autovacuum_truncate_lock_check. I agree that the values of the other two settings probably aren't too critical as long as they are fairly reasonable, which I would define as being 20ms to 100ms with with retries lasting no more than 2 seconds. I'm inclined to suggest a total time of deadlock_timeout * 2 clamped to 2 seconds, but maybe that is over-engineering it. There was also a mention of possibly inserting a vacuum_delay_point() call outside of the AccessExclusiveLock. I don't feel strongly about it, but if the page scanning cost can be tracked easily, I guess it is better to do that. Other than simplifying the code based on eliminating the new GUCs, the coding issues I found were: TRUE and FALSE were used as literals. Since true and false are used about 10 times as often and current code in the modified files is mixed, I would recommend the lowercase form. We decided it wasn't worth the trouble to convert the minority usage over, but I don't see any reason to add new instances of the minority case. In LockHasWaiters() the partition lock is acquired using LW_EXCLUSIVE. I don't see why that can't be LW_SHARED. Was there a reason for this, or was it just not changed after copy/paste? I still need to review the timing calls, since I'm not familiar with them so it wasn't immediately obvious to me whether they were being used correctly. I have no reason to believe that they aren't, but feel I should check. Also, I want to do another pass looking just for off-by-one errors on blkno. Again, I have no reason to believe that there is a problem; it just seems like it would be a particularly easy type of mistake to make and miss when a key structure has this field: BlockNumber nonempty_pages; /* actually, last nonempty page + 1 */ And I want to test more. But if a new version could be created based on the above, that would be great. Chances seem relatively slim at this point that there will be much else to change, if anything. -Kevin
Kevin Grittner wrote: > I still need to review the timing calls, since I'm not familiar > with them so it wasn't immediately obvious to me whether they > were being used correctly. I have no reason to believe that they > aren't, but feel I should check. It seems odd to me that assignment of one instr_time to another is done with INSTR_TIME_SET_ZERO() of the target followed by INSTR_TIME_ADD() with the target and the source. It seems to me that simple assignment would be more readable, and I can't see any down side. Why shouldn't: INSTR_TIME_SET_ZERO(elapsed); INSTR_TIME_ADD(elapsed, currenttime); INSTR_TIME_SUBTRACT(elapsed, starttime); instead be?: elapsed = currenttime; INSTR_TIME_SUBTRACT(elapsed, starttime); And why shouldn't: INSTR_TIME_SET_ZERO(starttime); INSTR_TIME_ADD(starttime, currenttime); instead be?: starttime = currenttime; Resetting starttime this way seems especially odd. > Also, I want to do another pass looking just for off-by-one > errors on blkno. Again, I have no reason to believe that there is > a problem; it just seems like it would be a particularly easy > type of mistake to make and miss when a key structure has this > field: > > BlockNumber nonempty_pages; > /* actually, last nonempty page + 1 */ No problems found with that. > And I want to test more. The patch worked as advertised in all my tests, but I became uncomforatable with the games being played with the last autovacuum timestamp and the count of dead tuples. Sure, that causes autovacuum to kick back in until it has dealt with the truncation, but it makes it hard for a human looking at the stat views to see what's going on, and I'm not sure it wouldn't lead to bad plans due to stale statistics. Personally, I would rather see us add a boolean to indicate that autovacuum was needed (regardless of the math on the other columns) and use that to control the retries -- leaving the other columns free to reflect reality. -Kevin
On 11/28/2012 3:33 PM, Kevin Grittner wrote: > Kevin Grittner wrote: > >> I still need to review the timing calls, since I'm not familiar >> with them so it wasn't immediately obvious to me whether they >> were being used correctly. I have no reason to believe that they >> aren't, but feel I should check. > > It seems odd to me that assignment of one instr_time to another is > done with INSTR_TIME_SET_ZERO() of the target followed by > INSTR_TIME_ADD() with the target and the source. It seems to me > that simple assignment would be more readable, and I can't see any > down side. > > Why shouldn't: > > INSTR_TIME_SET_ZERO(elapsed); > INSTR_TIME_ADD(elapsed, currenttime); > INSTR_TIME_SUBTRACT(elapsed, starttime); > > instead be?: > > elapsed = currenttime; > INSTR_TIME_SUBTRACT(elapsed, starttime); > > And why shouldn't: > > INSTR_TIME_SET_ZERO(starttime); > INSTR_TIME_ADD(starttime, currenttime); > > instead be?: > > starttime = currenttime; > > Resetting starttime this way seems especially odd. instr_time is LARGE_INTEGER on Win32 but struct timeval on Unix. Is starttime = currenttime; portable if those are structs? > >> Also, I want to do another pass looking just for off-by-one >> errors on blkno. Again, I have no reason to believe that there is >> a problem; it just seems like it would be a particularly easy >> type of mistake to make and miss when a key structure has this >> field: >> >> BlockNumber nonempty_pages; >> /* actually, last nonempty page + 1 */ > > No problems found with that. > >> And I want to test more. > > The patch worked as advertised in all my tests, but I became > uncomforatable with the games being played with the last autovacuum > timestamp and the count of dead tuples. Sure, that causes > autovacuum to kick back in until it has dealt with the truncation, > but it makes it hard for a human looking at the stat views to see > what's going on, and I'm not sure it wouldn't lead to bad plans due > to stale statistics. > > Personally, I would rather see us add a boolean to indicate that > autovacuum was needed (regardless of the math on the other columns) > and use that to control the retries -- leaving the other columns > free to reflect reality. You mean to extend the stats by yet another column? The thing is that this whole case happens rarely. We see this every other month or so and only on a rolling window table after it got severely bloated due to some abnormal use (purging disabled for some operational reason). The whole situation resolves itself after a few minutes to maybe an hour or two. Personally I don't think living with a few wrong stats on one table for that time is so bad that it warrants changing that much more code. Lower casing TRUE/FALSE will be done. If the LW_SHARED is enough in LockHasWaiters(), that's fine too. I think we have a consensus that the check interval should be derived from deadlock_timeout/10 clamped to 10ms. I'm going to remove that GUC. About the other two, I'm just not sure. Maybe using half the value as for the lock waiter interval as the lock retry interval, again clamped to 10ms, and simply leaving one GUC that controls how long autovacuum should retry this. I'm not too worried about the retry interval. However, the problem with that overall retry duration is that this value very much depends on the usage pattern of the database. If long running transactions (like >5 seconds) are the norm, then a hard coded value of 2 seconds before autovacuum gives up isn't going to help much. Jan -- Anyone who trades liberty for security deserves neither liberty nor security. -- Benjamin Franklin
Jan Wieck <JanWieck@Yahoo.com> writes: > On 11/28/2012 3:33 PM, Kevin Grittner wrote: >> Resetting starttime this way seems especially odd. > instr_time is LARGE_INTEGER on Win32 but struct timeval on Unix. Is > starttime = currenttime; > portable if those are structs? Sure. We rely on struct assignment in lots of places. regards, tom lane
On 11/29/2012 9:46 AM, Tom Lane wrote: > Jan Wieck <JanWieck@Yahoo.com> writes: >> On 11/28/2012 3:33 PM, Kevin Grittner wrote: >>> Resetting starttime this way seems especially odd. > >> instr_time is LARGE_INTEGER on Win32 but struct timeval on Unix. Is >> starttime = currenttime; >> portable if those are structs? > > Sure. We rely on struct assignment in lots of places. Will be done then. Thanks, Jan -- Anyone who trades liberty for security deserves neither liberty nor security. -- Benjamin Franklin
Attached is a new patch that addresses most of the points raised in discussion before. 1) Most of the configuration variables are derived from deadlock_timeout now. The "check for conflicting lock request" interval is deadlock_timeout/10, clamped to 10ms. The "try to acquire exclusive lock" interval is deadlock_timeout/20, also clamped to 10ms. The only GUC variable remaining is autovacuum_truncate_lock_try=2000ms with a range from 0 (just try once) to 20000ms. I'd like to point out that this is a significant change in functionality as without the config option for the check interval, there is no longer any possibility to disable the call to LockHasWaiters() and return to the original (deadlock code kills autovacuum) behavior. 2) The partition lock in LockHasWaiters() was lowered to LW_SHARED. The LW_EXCLUSIVE was indeed a copy/paste result. 3) The instr_time handling was simplified as suggested. 4) Lower case TRUE/FALSE. I did not touch the part about suppressing the stats and the ANALYZE step of "auto vacuum+analyze". The situation is no different today. When the deadlock code kills autovacuum, stats aren't updated either. And this patch is meant to cause autovacuum to finish the truncate in a few minutes or hours, so that the situation fixes itself. Jan -- Anyone who trades liberty for security deserves neither liberty nor security. -- Benjamin Franklin
Attachment
Jan Wieck wrote: > Attached is a new patch that addresses most of the points raised > in discussion before. > > 1) Most of the configuration variables are derived from > deadlock_timeout now. The "check for conflicting lock request" > interval is deadlock_timeout/10, clamped to 10ms. The "try to > acquire exclusive lock" interval is deadlock_timeout/20, also > clamped to 10ms. The only GUC variable remaining is > autovacuum_truncate_lock_try=2000ms with a range from 0 (just try > once) to 20000ms. If we're going to keep this GUC, we need docs for it. > I'd like to point out that this is a significant change in > functionality as without the config option for the check > interval, there is no longer any possibility to disable the call > to LockHasWaiters() and return to the original (deadlock code > kills autovacuum) behavior. Arguably we could simplify the deadlock resolution code a little, but it seems like it is probably safer to leave it as a failsafe, at least for now. > I did not touch the part about suppressing the stats and the > ANALYZE step of "auto vacuum+analyze". The situation is no > different today. When the deadlock code kills autovacuum, stats > aren't updated either. And this patch is meant to cause > autovacuum to finish the truncate in a few minutes or hours, so > that the situation fixes itself. Unless someone want to argue for more aggressive updating of the stats, I guess I'll yield to that argument. Besides the documentation, the only thing I could spot on this go-around was that this comment probably no longer has a reason for being since this is no longer conditional and what it's doing is obvious from the code: /* Initialize the starttime if we check for conflicting lock requests */ With docs and dropping the obsolete comment, I think this will be ready. -Kevin
On 12/3/2012 5:42 PM, Kevin Grittner wrote: > Jan Wieck wrote: > >> Attached is a new patch that addresses most of the points raised >> in discussion before. >> >> 1) Most of the configuration variables are derived from >> deadlock_timeout now. The "check for conflicting lock request" >> interval is deadlock_timeout/10, clamped to 10ms. The "try to >> acquire exclusive lock" interval is deadlock_timeout/20, also >> clamped to 10ms. The only GUC variable remaining is >> autovacuum_truncate_lock_try=2000ms with a range from 0 (just try >> once) to 20000ms. > > If we're going to keep this GUC, we need docs for it. Certainly. But since we're still debating which and how many GUC variables we want, I don't think doc-time has come yet. > >> I'd like to point out that this is a significant change in >> functionality as without the config option for the check >> interval, there is no longer any possibility to disable the call >> to LockHasWaiters() and return to the original (deadlock code >> kills autovacuum) behavior. > > Arguably we could simplify the deadlock resolution code a little, > but it seems like it is probably safer to leave it as a failsafe, > at least for now. Thinking about it, I'm not really happy with removing the autovacuum_truncate_lock_check GUC at all. Fact is that the deadlock detection code and the configuration parameter for it should IMHO have nothing to do with all this in the first place. A properly implemented application does not deadlock. Someone running such a properly implemented application should be able to safely set deadlock_timeout to minutes without the slightest ill side effect, but with the benefit that the deadlock detection code itself does not add to the lock contention. The only reason one cannot do so today is because autovacuum's truncate phase could then freeze the application with an exclusive lock for that long. I believe the check interval needs to be decoupled from the deadlock_timeout again. This will leave us with 2 GUCs at least. Jan -- Anyone who trades liberty for security deserves neither liberty nor security. -- Benjamin Franklin
Jan Wieck wrote: > Thinking about it, I'm not really happy with removing the > autovacuum_truncate_lock_check GUC at all. > > Fact is that the deadlock detection code and the configuration > parameter for it should IMHO have nothing to do with all this in > the first place. A properly implemented application does not > deadlock. I don't agree. I believe that in some cases it is possible and practicable to set access rules which would prevent deadlocks in application access to a database. In other cases the convolutions required in the code, the effort in educating dozens or hundreds of programmers maintaining the code (and keeping the training current during staff turnover), and the staff time required for compliance far outweigh the benefit of an occasional transaction retry. However, it is enough for your argument that there are cases where it can be done. > Someone running such a properly implemented application should be > able to safely set deadlock_timeout to minutes without the > slightest ill side effect, but with the benefit that the deadlock > detection code itself does not add to the lock contention. The > only reason one cannot do so today is because autovacuum's > truncate phase could then freeze the application with an > exclusive lock for that long. > > I believe the check interval needs to be decoupled from the > deadlock_timeout again. OK > This will leave us with 2 GUCs at least. Hmm. What problems do you see with hard-coding reasonable values? Adding two or three GUC settings for a patch with so little user-visible impact seems weird. And it seems to me (and also seemed to Robert) as though the specific values of the other two settings really aren't that critical as long as they are anywhere within a reasonable range. Configuring PostgreSQL can be intimidating enough without adding knobs that really don't do anything useful. Can you show a case where special values would be helpful? -Kevin
On 12/4/2012 8:06 AM, Kevin Grittner wrote: > Jan Wieck wrote: >> I believe the check interval needs to be decoupled from the >> deadlock_timeout again. > > OK > >> This will leave us with 2 GUCs at least. > > Hmm. What problems do you see with hard-coding reasonable values? The question is what is reasonable? Lets talk about the time to (re)acquire the lock first. In the cases where truncating a table can hurt we are dealing with many gigabytes. The original vacuumlazy scan of them can take hours if not days. During that scan the vacuum worker has probably spent many hours napping in the vacuum delay points. For me 50ms interval for 5 seconds would be reasonable for (re)acquiring that lock. The reasoning behind it being that we need some sort of retry mechanism because if autovacuum just gave up the exclusive lock because someone needed access, it is more or less guaranteed that the immediate attempt to reacquire it will fail until that waiter has committed. But if it can't get a lock after 5 seconds, the system seems busy enough so that autovacuum should come back much later, when the launcher kicks it off again. I don't care much about occupying that autovacuum worker for a few seconds. It just spent hours vacuuming that very table. How much harm will a couple more seconds do? The check interval for the LockHasWaiters() call however depends very much on the response time constraints of the application. A 200ms interval for example would cause the truncate phase to hold onto the exclusive lock for 200ms at least. That means that a steady stream of short running transactions would see a 100ms "blocking" on average, 200ms max. For many applications that is probably OK. If your response time constraint is <=50ms on 98% of transactions, you might want to have that knob though. I admit I really have no idea what the most reasonable default for that value would be. Something between 50ms and deadlock_timeout/2 I guess. Jan -- Anyone who trades liberty for security deserves neither liberty nor security. -- Benjamin Franklin
Jan Wieck wrote: > [arguments for GUCs] This is getting confusing. I thought I had already conceded the case for autovacuum_truncate_lock_try, and you appeared to spend most of your post arguing for it anyway. I think. It's a little hard to tell. Perhaps the best thing is to present the issue to the list and solicit more opinions on what to do. Please correct me if I mis-state any of this. The primary problem this patch is solving is that in some workloads, autovacuum will repeatedly try to truncate the unused pages at the end of a table, but will continually get canceled after burning resources because another process wants to acquire a lock on the table which conflicts with the one held by autovacuum. This is handled by the deadlock checker, so another process must block for the deadlock_timeout interval each time. All work done by the truncate phase of autovacuum is lost on each interrupted attempt. Statistical information is not updated, so another attempt will trigger the next time autovacuum looks at whether to vacuum the table. It's obvious that this pattern not only fails to release potentially large amounts of unused space back to the OS, but the headbanging can continue to consume significant resources and for an extended period, and the repeated blocking for deadlock_timeout could cause latency problems. The patch has the truncate work, which requires AccessExclusiveLock, check at intervals for whether another process is waiting on its lock. That interval is one of the timings we need to determine, and for which a GUC was initially proposed. I think that the check should be fast enough that doing it once every 20ms as a hard-coded interval would be good enough. When it sees this situation, it truncates the file for as far as it has managed to get, releases its lock on the table, sleeps for an interval, and then checks to see if the lock has become available again. How long it should sleep between tries to reacquire the lock is another possible GUC. Again, I'm inclined to think that this could be hard-coded. Since autovacuum was knocked off-task after doing some significant work, I'm inclined to make this interval a little bigger, but I don't think it matters a whole lot. Anything between 20ms and 100ms seens sane. Maybe 50ms? At any point that it is unable to acquire the lock, there is a check for how long this autovacuum task has been starved for the lock. Initially I was arguing for twice the deadlock_timeout on the basis that this would probably be short enough not to leave the autovacuum worker sidelined for too long, but long enough for the attempt to get past a single deadlock between two other processes. This is the setting Jan is least willing to concede. If the autovacuum worker does abandon the attempt, it will keep retrying, since we go out of our way to prevent the autovacuum process from updating the statistics based on the "incomplete" processing. This last interval is not how long it will attempt to truncate, but how long it will keep one autovacuum worker making unsuccessful attempts to acquire the lock before it is put to other uses. Workers will keep coming back to this table until the truncate phase is completed, just as it does without the patch; the difference being that anytime it gets the lock, even briefly, it is able to persist some progress. So the question on the table is which of these three intervals should be GUCs, and what values to use if they aren't. -Kevin
On 12/4/2012 1:51 PM, Kevin Grittner wrote: > Jan Wieck wrote: > >> [arguments for GUCs] > > This is getting confusing. I thought I had already conceded the > case for autovacuum_truncate_lock_try, and you appeared to spend > most of your post arguing for it anyway. I think. It's a little > hard to tell. Perhaps the best thing is to present the issue to the > list and solicit more opinions on what to do. Please correct me if > I mis-state any of this. > > The primary problem this patch is solving is that in some > workloads, autovacuum will repeatedly try to truncate the unused > pages at the end of a table, but will continually get canceled > after burning resources because another process wants to acquire a > lock on the table which conflicts with the one held by autovacuum. > This is handled by the deadlock checker, so another process must > block for the deadlock_timeout interval each time. All work done by > the truncate phase of autovacuum is lost on each interrupted > attempt. Statistical information is not updated, so another attempt > will trigger the next time autovacuum looks at whether to vacuum > the table. > > It's obvious that this pattern not only fails to release > potentially large amounts of unused space back to the OS, but the > headbanging can continue to consume significant resources and for > an extended period, and the repeated blocking for deadlock_timeout > could cause latency problems. > > The patch has the truncate work, which requires > AccessExclusiveLock, check at intervals for whether another process > is waiting on its lock. That interval is one of the timings we need > to determine, and for which a GUC was initially proposed. I think > that the check should be fast enough that doing it once every 20ms > as a hard-coded interval would be good enough. When it sees this > situation, it truncates the file for as far as it has managed to > get, releases its lock on the table, sleeps for an interval, and > then checks to see if the lock has become available again. > > How long it should sleep between tries to reacquire the lock is > another possible GUC. Again, I'm inclined to think that this could > be hard-coded. Since autovacuum was knocked off-task after doing > some significant work, I'm inclined to make this interval a little > bigger, but I don't think it matters a whole lot. Anything between > 20ms and 100ms seens sane. Maybe 50ms? > > At any point that it is unable to acquire the lock, there is a > check for how long this autovacuum task has been starved for the > lock. Initially I was arguing for twice the deadlock_timeout on the > basis that this would probably be short enough not to leave the > autovacuum worker sidelined for too long, but long enough for the > attempt to get past a single deadlock between two other processes. > This is the setting Jan is least willing to concede. > > If the autovacuum worker does abandon the attempt, it will keep > retrying, since we go out of our way to prevent the autovacuum > process from updating the statistics based on the "incomplete" > processing. This last interval is not how long it will attempt to > truncate, but how long it will keep one autovacuum worker making > unsuccessful attempts to acquire the lock before it is put to other > uses. Workers will keep coming back to this table until the > truncate phase is completed, just as it does without the patch; the > difference being that anytime it gets the lock, even briefly, it is > able to persist some progress. That is all correct. > > So the question on the table is which of these three intervals > should be GUCs, and what values to use if they aren't. I could live with all the above defaults, but would like to see more comments on them. Jan -- Anyone who trades liberty for security deserves neither liberty nor security. -- Benjamin Franklin
On Tue, Dec 4, 2012 at 2:05 PM, Jan Wieck <JanWieck@yahoo.com> wrote: >> So the question on the table is which of these three intervals >> should be GUCs, and what values to use if they aren't. > > I could live with all the above defaults, but would like to see more > comments on them. I largely agree with what's already been said. The only interval that seems to me to maybe need its own knob is the total time after which the autovacuum worker will give up. If we set it to 2 * deadlock_timeout, some people might find that a reason to raise deadlock_timeout. Since people *already* raise deadlock_timeout to obscenely high values (a minute? an hour???) and then complain that things blow up in their face, I think there's a decent argument to be made that piggybacking anything else on that setting is unwise. Against that, FWICT, this problem only affects a small number of users: Jan is the only person I can ever remember reporting this issue. I'm not dumb enough to think he's the only person who it affects; but my current belief is that it's not an enormously common problem. So the main argument I can see against adding a GUC is that the problem is too marginal to justify a setting of its own. What I really see as the key issue is: suppose we hardcode this to say 2 seconds. Is that going to fix the problem effectively for 99% of the people who have this problem, or for 25% of the people who have this problem? In the former case, we probably don't need a GUC; in the latter case, we probably do. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas wrote: > Since people *already* raise deadlock_timeout to obscenely high > values (a minute? an hour???) and then complain that things blow > up in their face, I think there's a decent argument to be made > that piggybacking anything else on that setting is unwise. If people are really doing that, then I tend to agree. I wasn't aware of that practice. > Against that, FWICT, this problem only affects a small number of > users: Jan is the only person I can ever remember reporting this > issue. I'm not dumb enough to think he's the only person who it > affects; but my current belief is that it's not an enormously > common problem. So the main argument I can see against adding a > GUC is that the problem is too marginal to justify a setting of > its own. What I really see as the key issue is: suppose we > hardcode this to say 2 seconds. Is that going to fix the problem > effectively for 99% of the people who have this problem, or for > 25% of the people who have this problem? In the former case, we > probably don't need a GUC; in the latter case, we probably do. Given the fact that autovacuum will keep throwing workers at it to essentially loop indefinitely at an outer level, I don't think the exact setting of this interval is all that critical either. My gut feel is that anything in the 2 second to 5 second range would be sane, so I won't argue over any explicit setting within that range. Below that, I think the overhead of autovacuum coming back to the table repeatedly would probably start to get too high; below that we could be causing some small, heavily-updated table to be neglected by autovacuum -- especially if you get multiple autovacuum workers tied up in this delay on different tables at the same time. Regarding how many people are affected, I have seen several reports of situations where users claim massive impact on performance when autovacuum kicks in. The reports have not included enough detail to quantify the impact or in most cases to establish a cause, but this seems like it could have a noticable impact, especially if the deadlock timeout was set to more than a second. -Kevin
On Wed, Dec 5, 2012 at 11:24 AM, Kevin Grittner <kgrittn@mail.com> wrote: > Robert Haas wrote: >> Since people *already* raise deadlock_timeout to obscenely high >> values (a minute? an hour???) and then complain that things blow >> up in their face, I think there's a decent argument to be made >> that piggybacking anything else on that setting is unwise. > > If people are really doing that, then I tend to agree. I wasn't > aware of that practice. It's probably not quite common enough to be called a "practice", but I have encountered it a number of times in support situations. Alas, I no longer remember the details of exactly what misery it caused, but I do remember it wasn't good. :-) >> Against that, FWICT, this problem only affects a small number of >> users: Jan is the only person I can ever remember reporting this >> issue. I'm not dumb enough to think he's the only person who it >> affects; but my current belief is that it's not an enormously >> common problem. So the main argument I can see against adding a >> GUC is that the problem is too marginal to justify a setting of >> its own. What I really see as the key issue is: suppose we >> hardcode this to say 2 seconds. Is that going to fix the problem >> effectively for 99% of the people who have this problem, or for >> 25% of the people who have this problem? In the former case, we >> probably don't need a GUC; in the latter case, we probably do. > > Given the fact that autovacuum will keep throwing workers at it to > essentially loop indefinitely at an outer level, I don't think the > exact setting of this interval is all that critical either. My gut > feel is that anything in the 2 second to 5 second range would be > sane, so I won't argue over any explicit setting within that range. > Below that, I think the overhead of autovacuum coming back to the > table repeatedly would probably start to get too high; below that > we could be causing some small, heavily-updated table to be > neglected by autovacuum -- especially if you get multiple > autovacuum workers tied up in this delay on different tables at the > same time. I think that part of what's tricky here is that the dynamics of this problem depend heavily on table size. I handled one support case where lowering autovacuum_naptime to 15s was an indispenable part of the solution, so in that case having an autovacuum worker retry for more than a few seconds sounds kind of insane. OTOH, that case involved a small, rapidly changing table. If you've got an enormous table where vacuum takes an hour to chug through all of it, abandoning the effort to truncate the table after a handful of seconds might sound equally insane. Many it'd be sensible to relate the retry time to the time spend vacuuming the table. Say, if the amount of time spent retrying exceeds 10% of the time spend vacuuming the table, with a minimum of 1s and a maximum of 1min, give up. That way, big tables will get a little more leeway than small tables, which is probably appropriate. > Regarding how many people are affected, I have seen several reports > of situations where users claim massive impact on performance when > autovacuum kicks in. The reports have not included enough detail to > quantify the impact or in most cases to establish a cause, but this > seems like it could have a noticable impact, especially if the > deadlock timeout was set to more than a second. Yeah, I agree this could be a cause of those types of reports, but I don't have any concrete evidence that any of the cases I've worked were actually due to this specific issue. The most recent case of this type I worked on was due to I/O saturation - which, since it happened to be EC2, really meant network saturation. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 12/5/2012 2:00 PM, Robert Haas wrote: > Many it'd be sensible to relate the retry time to the time spend > vacuuming the table. Say, if the amount of time spent retrying > exceeds 10% of the time spend vacuuming the table, with a minimum of > 1s and a maximum of 1min, give up. That way, big tables will get a > little more leeway than small tables, which is probably appropriate. That sort of "dynamic" approach would indeed be interesting. But I fear that it is going to be complex at best. The amount of time spent in scanning heavily depends on the visibility map. The initial vacuum scan of a table can take hours or more, but it does update the visibility map even if the vacuum itself is aborted later. The next vacuum may scan that table in almost no time at all, because it skips all blocks that are marked "all visible". So the total time the "scan" takes is no yardstick I'd use. Jan -- Anyone who trades liberty for security deserves neither liberty nor security. -- Benjamin Franklin
On Wed, Dec 5, 2012 at 10:16 PM, Jan Wieck <JanWieck@yahoo.com> wrote: > On 12/5/2012 2:00 PM, Robert Haas wrote: >> >> Many it'd be sensible to relate the retry time to the time spend >> vacuuming the table. Say, if the amount of time spent retrying >> exceeds 10% of the time spend vacuuming the table, with a minimum of >> 1s and a maximum of 1min, give up. That way, big tables will get a >> little more leeway than small tables, which is probably appropriate. > > That sort of "dynamic" approach would indeed be interesting. But I fear that > it is going to be complex at best. The amount of time spent in scanning > heavily depends on the visibility map. The initial vacuum scan of a table > can take hours or more, but it does update the visibility map even if the > vacuum itself is aborted later. The next vacuum may scan that table in > almost no time at all, because it skips all blocks that are marked "all > visible". Well, if that's true, then there's little reason to worry about giving up quickly, because the next autovacuum a minute later won't consume many resources. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Kevin and Robert are well aware of most of the below. I just want to put this out here so other people, who haven't followed the discussion too closely, may chime in. Some details on the problem: First of all, there is a minimum number of 1000 pages that the vacuum scan must detect as possibly being all empty at the end of a relation. Without at least 8MB of possible free space at the end, the code never calls lazy_truncate_heap(). This means we don't have to worry about tiny relations at all. Any relation that stays under 8MB turnover between autovacuum VACUUM runs can never get into this ever. Relations that have higher turnover than that, but at random places or with a high percentage of rather static rows, don't fall into the problem category either. They may never accumulate that much "contiguous free space at the end". The turnover will be reusing free space all over the place. So again, lazy_truncate_heap() won't be called ever. Relations that eventually build up more than 8MB of free space at the end aren't automatically a problem. The autovacuum VACUUM scan just scanned those pages at the end, which means that the safety scan for truncate, done under exclusive lock, is checking exactly those pages at the end and most likely they are still in memory. The truncate safety scan will be fast due to a 99+% buffer cache hit rate. The only actual problem case (I have found so far) are rolling window tables of significant size, that can bloat multiple times their normal size every now and then. This is indeed a rare corner case and I have no idea how many users may (unknowingly) be suffering from it. This rare corner case triggers lazy_truncate_heap() with a significant amount of free space to truncate. The table bloats, then all the bloat is deleted and the periodic 100% turnover will guarantee that all "live" tuples will shortly after circulate in lower block numbers again, with gigabytes of empty space at the end. This by itself isn't a problem still. The existing code may do the job just fine "unless" there is "frequent" access to that very table. Only at this special combination of circumstances we actually have a problem. Only now, with a significant amount of free space at the end and frequent access to the table, the truncate safety scan takes long enough and has to actually read pages from disk to interfere with client transactions. At this point, the truncate safety scan may have to be interrupted to let the frequent other traffic go through. This is what we accomplish with the autovacuum_truncate_lock_check interval, where we voluntarily release the lock whenever someone else needs it. I agree with Kevin that a 20ms check interval is reasonable because the code to check this is even less expensive than releasing the exclusive lock we're holding. At the same time, completely giving up and relying on the autovacuum launcher to restart another worker isn't as free as it looks like either. The next autovacuum worker will have to do the VACUUM scan first, before getting to the truncate phase. We cannot just skip blindly to the truncate code. With repeated abortion of the truncate, the table would deteriorate and accumulate dead tuples again. The removal of dead tuples and their index tuples has priority. As said earlier in the discussion, the VACUUM scan will skip pages, that are marked as completely visible. So the scan won't physically read the majority of the empty pages at the end of the table over and over. But it will at least scan all pages, that had been modified since the last VACUUM run. To me this means that we want to be more generous to the truncate code about acquiring the exclusive lock. In my tests, I've seen that a rolling window table with a "live" set of just 10 MB or so, but empty space of 3 GB, can still have a 2 minute VACUUM scan time. Throwing that work away because we can't acquire the exclusive lock withing 2 seconds is a waste of effort. Something in between 2-60 seconds sounds more reasonable to me. Jan -- Anyone who trades liberty for security deserves neither liberty nor security. -- Benjamin Franklin
On 12/6/2012 12:45 PM, Robert Haas wrote: > On Wed, Dec 5, 2012 at 10:16 PM, Jan Wieck <JanWieck@yahoo.com> wrote: >> That sort of "dynamic" approach would indeed be interesting. But I fear that >> it is going to be complex at best. The amount of time spent in scanning >> heavily depends on the visibility map. The initial vacuum scan of a table >> can take hours or more, but it does update the visibility map even if the >> vacuum itself is aborted later. The next vacuum may scan that table in >> almost no time at all, because it skips all blocks that are marked "all >> visible". > > Well, if that's true, then there's little reason to worry about giving > up quickly, because the next autovacuum a minute later won't consume > many resources. "Almost no time" is of course "relative" to what an actual scan and dead tuple removal cost. Looking at a table with 3 GB of dead tuples at the end, the initial vacuum scan takes hours. When vacuum comes back to this table, cleaning up a couple megabytes of newly deceased tuples and then skipping over the all visible pages may take a minute. Based on the discussion and what I feel is a consensus I have created an updated patch that has no GUC at all. The hard coded parameters in include/postmaster/autovacuum.h are AUTOVACUUM_TRUNCATE_LOCK_CHECK_INTERVAL 20 /* ms */ AUTOVACUUM_TRUNCATE_LOCK_WAIT_INTERVAL 50 /* ms */ AUTOVACUUM_TRUNCATE_LOCK_TIMEOUT 5000 /* ms */ I gave that the worst workload I can think of. A pgbench (style) application that throws about 10 transactions per second at it, so that there is constantly the need to give up the lock due to conflicting lock requests and then reacquiring it again. A "cleanup" process is periodically moving old tuples from the history table to an archive table, making history a rolling window table. And a third job that 2-3 times per minute produces a 10 second lasting transaction, forcing autovacuum to give up on the lock reacquisition. Even with that workload autovacuum slow but steady is chopping away at the table. Jan -- Anyone who trades liberty for security deserves neither liberty nor security. -- Benjamin Franklin
Attachment
Jan Wieck wrote: > Based on the discussion and what I feel is a consensus I have > created an updated patch that has no GUC at all. The hard coded > parameters in include/postmaster/autovacuum.h are > > AUTOVACUUM_TRUNCATE_LOCK_CHECK_INTERVAL 20 /* ms */ > AUTOVACUUM_TRUNCATE_LOCK_WAIT_INTERVAL 50 /* ms */ > AUTOVACUUM_TRUNCATE_LOCK_TIMEOUT 5000 /* ms */ Since these really aren't part of the external API and are only referenced in vacuumlazy.c, it seems more appropriate to define them there. > I gave that the worst workload I can think of. A pgbench (style) > application that throws about 10 transactions per second at it, > so that there is constantly the need to give up the lock due to > conflicting lock requests and then reacquiring it again. A > "cleanup" process is periodically moving old tuples from the > history table to an archive table, making history a rolling > window table. And a third job that 2-3 times per minute produces > a 10 second lasting transaction, forcing autovacuum to give up on > the lock reacquisition. > > Even with that workload autovacuum slow but steady is chopping > away at the table. Applies with minor offsets, builds without warning, and passes `make check-world`. My tests based on your earlier posted test script confirm the benefit. There are some minor white-space issues; for example git diff --color shows some trailing spaces in comments. There are no docs, but since there are no user-visible changes in behavior other than better performance and more prompt and reliable trunction of tables where we were already doing so, it doesn't seem like any new docs are needed. Due to the nature of the problem, tests are tricky to run correctly and take a long time to run, so I don't see how any regressions tests would be appropriate, either. This patch seems ready for committer, and I would be comfortable with making the minor changes I mention above and committing it. I'll wait a day or two to allow any other comments or objections. To summarize, there has been pathalogical behavior in an infrequently-encountered corner case of autovacuum, wasting a lot of resources indefinitely when it is encountered; this patch gives a major performance improvement in in this situation without any other user-visible change and without requiring any new GUCs. It adds a new public function in the locking area to allow a process to check whether a particular lock it is holding is blocking any other process, and another to wrap that to make it easy to check whether the lock held on a particular table is blocking another process. It uses this new capability to be smarter about scheduling autovacuum's truncation work, and to avoid throwing away incremental progress in doing so. As such, I don't think it would be crazy to back-patch this, but I think it would be wise to allow it to be proven on master/9.3 for a while before considering that. -Kevin
On 12/9/2012 2:37 PM, Kevin Grittner wrote: > Jan Wieck wrote: > >> Based on the discussion and what I feel is a consensus I have >> created an updated patch that has no GUC at all. The hard coded >> parameters in include/postmaster/autovacuum.h are >> >> AUTOVACUUM_TRUNCATE_LOCK_CHECK_INTERVAL 20 /* ms */ >> AUTOVACUUM_TRUNCATE_LOCK_WAIT_INTERVAL 50 /* ms */ >> AUTOVACUUM_TRUNCATE_LOCK_TIMEOUT 5000 /* ms */ > > Since these really aren't part of the external API and are only > referenced in vacuumlazy.c, it seems more appropriate to define > them there. > >> I gave that the worst workload I can think of. A pgbench (style) >> application that throws about 10 transactions per second at it, >> so that there is constantly the need to give up the lock due to >> conflicting lock requests and then reacquiring it again. A >> "cleanup" process is periodically moving old tuples from the >> history table to an archive table, making history a rolling >> window table. And a third job that 2-3 times per minute produces >> a 10 second lasting transaction, forcing autovacuum to give up on >> the lock reacquisition. >> >> Even with that workload autovacuum slow but steady is chopping >> away at the table. > > Applies with minor offsets, builds without warning, and passes > `make check-world`. My tests based on your earlier posted test > script confirm the benefit. > > There are some minor white-space issues; for example git diff > --color shows some trailing spaces in comments. Cleaned up all of those. Jan -- Anyone who trades liberty for security deserves neither liberty nor security. -- Benjamin Franklin
Attachment
Jan Wieck wrote: > Cleaned up all of those. Applied with trivial editing, mostly from a pgindent run against modified files. -Kevin
Kevin Grittner wrote: > Applied with trivial editing, mostly from a pgindent run against > modified files. Applied back as far as 9.0. Before that code didn't match well enough for it to seem safe to apply without many hours of additional testing. I have confirmed occurences of this problem at least as far back as 9.0 in the wild, where it is causing performance degradation severe enough to force users to stop production usage long enough to manually vacuum the affected tables. The use case is a lot like what Jan described, where PostgreSQL is being used for high volume queuing. When there is a burst of activity which increases table size, and then the queues are drained back to a normal state, the problem shows up. -Kevin