Re: autovacuum truncate exclusive lock round two - Mailing list pgsql-hackers

From Jan Wieck
Subject Re: autovacuum truncate exclusive lock round two
Date
Msg-id 50C0E520.6020802@Yahoo.com
Whole thread Raw
In response to Re: autovacuum truncate exclusive lock round two  (Jan Wieck <JanWieck@Yahoo.com>)
List pgsql-hackers
Kevin and Robert are well aware of most of the below. I just want to put 
this out here so other people, who haven't followed the discussion too 
closely, may chime in.

Some details on the problem:

First of all, there is a minimum number of 1000 pages that the vacuum 
scan must detect as possibly being all empty at the end of a relation. 
Without at least 8MB of possible free space at the end, the code never 
calls lazy_truncate_heap(). This means we don't have to worry about tiny 
relations at all. Any relation that stays under 8MB turnover between 
autovacuum VACUUM runs can never get into this ever.

Relations that have higher turnover than that, but at random places or 
with a high percentage of rather static rows, don't fall into the 
problem category either. They may never accumulate that much "contiguous 
free space at the end". The turnover will be reusing free space all over 
the place. So again, lazy_truncate_heap() won't be called ever.

Relations that eventually build up more than 8MB of free space at the 
end aren't automatically a problem. The autovacuum VACUUM scan just 
scanned those pages at the end, which means that the safety scan for 
truncate, done under exclusive lock, is checking exactly those pages at 
the end and most likely they are still in memory. The truncate safety 
scan will be fast due to a 99+% buffer cache hit rate.

The only actual problem case (I have found so far) are rolling window 
tables of significant size, that can bloat multiple times their normal 
size every now and then. This is indeed a rare corner case and I have no 
idea how many users may (unknowingly) be suffering from it.

This rare corner case triggers lazy_truncate_heap() with a significant 
amount of free space to truncate. The table bloats, then all the bloat 
is deleted and the periodic 100% turnover will guarantee that all "live" 
tuples will shortly after circulate in lower block numbers again, with 
gigabytes of empty space at the end.

This by itself isn't a problem still. The existing code may do the job 
just fine "unless" there is "frequent" access to that very table. Only 
at this special combination of circumstances we actually have a problem.

Only now, with a significant amount of free space at the end and 
frequent access to the table, the truncate safety scan takes long enough 
and has to actually read pages from disk to interfere with client 
transactions.

At this point, the truncate safety scan may have to be interrupted to 
let the frequent other traffic go through. This is what we accomplish 
with the autovacuum_truncate_lock_check interval, where we voluntarily 
release the lock whenever someone else needs it. I agree with Kevin that 
a 20ms check interval is reasonable because the code to check this is 
even less expensive than releasing the exclusive lock we're holding.

At the same time, completely giving up and relying on the autovacuum 
launcher to restart another worker isn't as free as it looks like 
either. The next autovacuum worker will have to do the VACUUM scan 
first, before getting to the truncate phase. We cannot just skip blindly 
to the truncate code. With repeated abortion of the truncate, the table 
would deteriorate and accumulate dead tuples again. The removal of dead 
tuples and their index tuples has priority.

As said earlier in the discussion, the VACUUM scan will skip pages, that 
are marked as completely visible. So the scan won't physically read the 
majority of the empty pages at the end of the table over and over. But 
it will at least scan all pages, that had been modified since the last 
VACUUM run.

To me this means that we want to be more generous to the truncate code 
about acquiring the exclusive lock. In my tests, I've seen that a 
rolling window table with a "live" set of just 10 MB or so, but empty 
space of 3 GB, can still have a 2 minute VACUUM scan time. Throwing that 
work away because we can't acquire the exclusive lock withing 2 seconds 
is a waste of effort.

Something in between 2-60 seconds sounds more reasonable to me.


Jan

-- 
Anyone who trades liberty for security deserves neither
liberty nor security. -- Benjamin Franklin



pgsql-hackers by date:

Previous
From: Andres Freund
Date:
Subject: Re: ALTER TABLE ... NOREWRITE option
Next
From: Robert Haas
Date:
Subject: Re: Setting visibility map in VACUUM's second phase