Re: We need to log aborted autovacuums - Mailing list pgsql-hackers
From | Greg Smith |
---|---|
Subject | Re: We need to log aborted autovacuums |
Date | |
Msg-id | 4D316D4C.4000406@2ndquadrant.com Whole thread Raw |
In response to | Re: We need to log aborted autovacuums (Josh Berkus <josh@agliodbs.com>) |
Responses |
Re: We need to log aborted autovacuums
(Tom Lane <tgl@sss.pgh.pa.us>)
|
List | pgsql-hackers |
Josh Berkus wrote: > The lack of vacuum could be occurring for any of 4 reasons: > > 1) Locking > 2) You have a lot of tables and not enough autovac_workers / too much > sleep time > 3) You need to autovac this particular table more frequently, since it > gets dirtied really fast > 4) The table has been set with special autovac settings which keep it > from being autovac'd > > We can currently distinguish between cased 2, 3, 4 based on existing > available facts. However, distinguishing case 1 from 2 or 3, in > particular, isn't currently possible except by methods which require > collecting a lot of ad-hoc monitoring data over a period of time. This > makes the effort required for the diagnosis completely out of proportion > with the magnitude of the problem. > Since no one coded up something more exciting, I just did a round of self-review of the little logging patch I sent upthread to see if it was a worthy CF candidate. I agree that *something* should be done, and I'd rather have a logging-based solution than nothing at all here. So that patch is, to use a polite UK term I've adopted recently, rubbish. But unless I'm confused (it is late), I think there's a problem here that's much worse than what you described--in the case where someone has grabbed a really large lock on a table that needs cleanup at least. I started with the following test case: SHOW log_autovacuum_min_duration; DROP TABLE t; CREATE TABLE t(s serial,i integer); INSERT INTO t(i) SELECT generate_series(1,100000); SELECT relname,last_autovacuum,autovacuum_count FROM pg_stat_user_tables WHERE relname='t'; DELETE FROM t WHERE s<50000; \q Which, so long as the min duration is 0, does mostly what I expected. I followed immediately by starting a new session, doing: BEGIN; LOCK t; Before autovacuum got to the table; I left this session open, stuck there idle in a transaction holding a lock. I threw some extra logging in the code path leading up to where we'd both like the lack of lock acquisition to be logged, and it shows the following (at DEBUG3): DEBUG: t: vac: 49999 (threshold 50), anl: 149999 (threshold 50) DEBUG: autovacuum: t: vac needed INFO: vacuum_rel: processing 16528 INFO: vacuum_rel: trying to open relation 16528 All but the first one of those lines are not in the current code, and as noted already that one existing line is at DEBUG3. This was trying to simulate the case you were complaining about: autovacuum has been triggered, it knows there is work to be done, and it can't do said work. It hasn't triggered the error message I tried to add yet through, because it's not hitting that spot. Look where it's actually blocked at: $ ps -eaf | grep postgres gsmith 5490 4880 0 04:09 ? 00:00:00 postgres: autovacuum worker process gsmith waiting $ psql -c "select procpid,now() - query_start as runtime,current_query from pg_stat_activity where current_query like 'autovacuum%'"procpid | runtime | current_query ---------+-----------------+------------------------------------- 5490 | 00:11:35.813727 | autovacuum: VACUUM ANALYZE public.t I haven't just failed to vacuum the table that needs it without any entry in the logs saying as much. I've actually tied down an autovacuum worker and kept it from doing anything else until that lock is released! When vacuum_rel inside of vacuum.c does this on a relation it wants to acquire a lock on: onerel = try_relation_open(relid, lmode); It just hangs out there forever, if the only issue is not being able to get the lock it wants. The abort code path I tried to insert logging into only occurs if the relation was deleted. Watch what's happened while I've been typing: procpid | runtime | current_query ---------+-----------------+------------------------------------- 5490 | 00:36:24.426528 | autovacuum: VACUUM ANALYZE public.t I've now hit the AV worker with a denial of service attack, essentially, for over 35 minutes. I'd bet that if I was a malicious user, I could make AV come to a halt altogether for the whole server with this approach, even with relatively limited rights on the database. Does try_relation_open need to have a lock acquisition timeout when AV is calling it? I don't see any other way to get a log mesage noting when this problem passed by in there, you can only see it live in these process view. And the current behavior doesn't feel right to me. -- Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.us "PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books
pgsql-hackers by date: