Re: autovacuum truncate exclusive lock round two - Mailing list pgsql-hackers
From | Kevin Grittner |
---|---|
Subject | Re: autovacuum truncate exclusive lock round two |
Date | |
Msg-id | 20121204185113.142840@gmx.com Whole thread Raw |
In response to | autovacuum truncate exclusive lock round two (Jan Wieck <JanWieck@Yahoo.com>) |
Responses |
Re: autovacuum truncate exclusive lock round two
|
List | pgsql-hackers |
Jan Wieck wrote: > [arguments for GUCs] This is getting confusing. I thought I had already conceded the case for autovacuum_truncate_lock_try, and you appeared to spend most of your post arguing for it anyway. I think. It's a little hard to tell. Perhaps the best thing is to present the issue to the list and solicit more opinions on what to do. Please correct me if I mis-state any of this. The primary problem this patch is solving is that in some workloads, autovacuum will repeatedly try to truncate the unused pages at the end of a table, but will continually get canceled after burning resources because another process wants to acquire a lock on the table which conflicts with the one held by autovacuum. This is handled by the deadlock checker, so another process must block for the deadlock_timeout interval each time. All work done by the truncate phase of autovacuum is lost on each interrupted attempt. Statistical information is not updated, so another attempt will trigger the next time autovacuum looks at whether to vacuum the table. It's obvious that this pattern not only fails to release potentially large amounts of unused space back to the OS, but the headbanging can continue to consume significant resources and for an extended period, and the repeated blocking for deadlock_timeout could cause latency problems. The patch has the truncate work, which requires AccessExclusiveLock, check at intervals for whether another process is waiting on its lock. That interval is one of the timings we need to determine, and for which a GUC was initially proposed. I think that the check should be fast enough that doing it once every 20ms as a hard-coded interval would be good enough. When it sees this situation, it truncates the file for as far as it has managed to get, releases its lock on the table, sleeps for an interval, and then checks to see if the lock has become available again. How long it should sleep between tries to reacquire the lock is another possible GUC. Again, I'm inclined to think that this could be hard-coded. Since autovacuum was knocked off-task after doing some significant work, I'm inclined to make this interval a little bigger, but I don't think it matters a whole lot. Anything between 20ms and 100ms seens sane. Maybe 50ms? At any point that it is unable to acquire the lock, there is a check for how long this autovacuum task has been starved for the lock. Initially I was arguing for twice the deadlock_timeout on the basis that this would probably be short enough not to leave the autovacuum worker sidelined for too long, but long enough for the attempt to get past a single deadlock between two other processes. This is the setting Jan is least willing to concede. If the autovacuum worker does abandon the attempt, it will keep retrying, since we go out of our way to prevent the autovacuum process from updating the statistics based on the "incomplete" processing. This last interval is not how long it will attempt to truncate, but how long it will keep one autovacuum worker making unsuccessful attempts to acquire the lock before it is put to other uses. Workers will keep coming back to this table until the truncate phase is completed, just as it does without the patch; the difference being that anytime it gets the lock, even briefly, it is able to persist some progress. So the question on the table is which of these three intervals should be GUCs, and what values to use if they aren't. -Kevin
pgsql-hackers by date: