Re: We need to log aborted autovacuums - Mailing list pgsql-hackers

From Greg Smith
Subject Re: We need to log aborted autovacuums
Date
Msg-id 4D316D4C.4000406@2ndquadrant.com
Whole thread Raw
In response to Re: We need to log aborted autovacuums  (Josh Berkus <josh@agliodbs.com>)
Responses Re: We need to log aborted autovacuums  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-hackers
Josh Berkus wrote:
> The lack of vacuum could be occurring for any of 4 reasons:
>
> 1) Locking
> 2) You have a lot of tables and not enough autovac_workers / too much
> sleep time
> 3) You need to autovac this particular table more frequently, since it
> gets dirtied really fast
> 4) The table has been set with special autovac settings which keep it
> from being autovac'd
>
> We can currently distinguish between cased 2, 3, 4 based on existing
> available facts.  However, distinguishing case 1 from 2 or 3, in
> particular, isn't currently possible except by methods which require
> collecting a lot of ad-hoc monitoring data over a period of time.  This
> makes the effort required for the diagnosis completely out of proportion
> with the magnitude of the problem.
>   

Since no one coded up something more exciting, I just did a round of 
self-review of the little logging patch I sent upthread to see if it was 
a worthy CF candidate.  I agree that *something* should be done, and I'd 
rather have a logging-based solution than nothing at all here.

So that patch is, to use a polite UK term I've adopted recently, 
rubbish.  But unless I'm confused (it is late), I think there's a 
problem here that's much worse than what you described--in the case 
where someone has grabbed a really large lock on a table that needs 
cleanup at least.  I started with the following test case:

SHOW log_autovacuum_min_duration;
DROP TABLE t;
CREATE TABLE t(s serial,i integer);
INSERT INTO t(i) SELECT generate_series(1,100000);
SELECT relname,last_autovacuum,autovacuum_count FROM pg_stat_user_tables 
WHERE relname='t';
DELETE FROM t WHERE s<50000;
\q

Which, so long as the min duration is 0, does mostly what I expected.  I 
followed immediately by starting a new session, doing:

BEGIN;
LOCK t;

Before autovacuum got to the table; I left this session open, stuck 
there idle in a transaction holding a lock.  I threw some extra logging 
in the code path leading up to where we'd both like the lack of lock 
acquisition to be logged, and it shows the following (at DEBUG3):

DEBUG:  t: vac: 49999 (threshold 50), anl: 149999 (threshold 50)
DEBUG:  autovacuum: t: vac needed
INFO:  vacuum_rel:  processing 16528
INFO:  vacuum_rel:  trying to open relation 16528

All but the first one of those lines are not in the current code, and as 
noted already that one existing line is at DEBUG3.  This was trying to 
simulate the case you were complaining about:  autovacuum has been 
triggered, it knows there is work to be done, and it can't do said work.

It hasn't triggered the error message I tried to add yet through, 
because it's not hitting that spot.  Look where it's actually blocked at:

$ ps -eaf | grep postgres
gsmith    5490  4880  0 04:09 ?        00:00:00 postgres: autovacuum 
worker process   gsmith waiting                                  

$ psql -c "select procpid,now() - query_start as runtime,current_query 
from pg_stat_activity where current_query like 'autovacuum%'"procpid |     runtime     |            current_query
   
 
---------+-----------------+-------------------------------------   5490 | 00:11:35.813727 | autovacuum: VACUUM ANALYZE
public.t

I haven't just failed to vacuum the table that needs it without any 
entry in the logs saying as much.  I've actually tied down an autovacuum 
worker and kept it from doing anything else until that lock is 
released!  When vacuum_rel inside of vacuum.c does this on a relation it 
wants to acquire a lock on:
   onerel = try_relation_open(relid, lmode);

It just hangs out there forever, if the only issue is not being able to 
get the lock it wants.  The abort code path I tried to insert logging 
into only occurs if the relation was deleted.  Watch what's happened 
while I've been typing:
procpid |     runtime     |            current_query           
---------+-----------------+-------------------------------------   5490 | 00:36:24.426528 | autovacuum: VACUUM ANALYZE
public.t

I've now hit the AV worker with a denial of service attack, essentially, 
for over 35 minutes.  I'd bet that if I was a malicious user, I could 
make AV come to a halt altogether for the whole server with this 
approach, even with relatively limited rights on the database.

Does try_relation_open need to have a lock acquisition timeout when AV 
is calling it?  I don't see any other way to get a log mesage noting 
when this problem passed by in there, you can only see it live in these 
process view.  And the current behavior doesn't feel right to me.

-- 
Greg Smith   2ndQuadrant US    greg@2ndQuadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us
"PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books



pgsql-hackers by date:

Previous
From: Fujii Masao
Date:
Subject: Re: auto-sizing wal_buffers
Next
From: Greg Smith
Date:
Subject: Re: Spread checkpoint sync