Thread: Need help debugging why autovacuum seems "stuck" -- until I use superuser to vacuum freeze pg_database

I have been debugging a problem on a 9.3.10 Postgres database cluster with over 1200 databases.  10 workers, increased maintenance_work_mem, auto vacuum settings to run more frequently than default.   What I will notice is that autovacuum will run for a week or so and traverse databases as expected.  I will be able to see that age(datfrozenxid) for all 1200 databases will stay close to autovacuum_freeze_max_age as desired.

Then, suddenly I will see it get “stuck”.  Autovacuum launcher will not launch worker processes even though databases start to age past autovacuum_freeze_max_age.  If I create a list of databases and sort by age(datfrozenxid), connect to the database with the oldest and execute a simple:  "vacuum freeze pg_database;”, autovacuum springs back into action.

It’s never the same database where autovacuum seems to get “stuck”.  I’m attempting to gather more debugging information, but, also can’t understand why simply doing a “vacuum freeze pg_database” breaks up the jam.

Any thoughts?

Shawn
On Sun, May 1, 2016 at 10:39 PM, McCoy, Shawn <shamccoy@amazon.com> wrote:
> I have been debugging a problem on a 9.3.10 Postgres database cluster with
> over 1200 databases.  10 workers, increased maintenance_work_mem, auto
> vacuum settings to run more frequently than default.   What I will notice is
> that autovacuum will run for a week or so and traverse databases as
> expected.  I will be able to see that age(datfrozenxid) for all 1200
> databases will stay close to autovacuum_freeze_max_age as desired.
>
> Then, suddenly I will see it get “stuck”.  Autovacuum launcher will not
> launch worker processes even though databases start to age past
> autovacuum_freeze_max_age.  If I create a list of databases and sort by
> age(datfrozenxid), connect to the database with the oldest and execute a
> simple:  "vacuum freeze pg_database;”, autovacuum springs back into action.
>
> It’s never the same database where autovacuum seems to get “stuck”.  I’m
> attempting to gather more debugging information, but, also can’t understand
> why simply doing a “vacuum freeze pg_database” breaks up the jam.
>
> Any thoughts?

So when it's stuck, there are no AV worker processes running at all,
for a sustained period of time?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



I enabled debug level and can see that the launcher does in fact spawn a
worker, but, it appears, scans and disappears in milliseconds.  

It was stuck again this morning and it was spitting this out ad nauseum for
"my_database_name":

2016-05-11 16:03:42 UTC::@:[13567]:DEBUG:  pg_database: vac: 0 (threshold
141), anl: 57 (threshold 166)
2016-05-11 16:03:42 UTC::@:[13567]:DEBUG:  autovac_balance_cost(pid=13567
db=143187, rel=1262, dobalance=yes cost_limit=1000, cost_limit_base=1000,
cost_delay=10)
2016-05-11 16:03:42 UTC::@:[13567]:DEBUG:  vacuuming
"pg_catalog.pg_database"
2016-05-11 16:03:42 UTC::@:[13567]:DEBUG:  index "pg_database_datname_index"
now contains 1325 row versions in 11 pages
2016-05-11 16:03:42 UTC::@:[13567]:DETAIL:  0 index row versions were
removed.       0 index pages have been deleted, 0 are currently reusable.       CPU 0.00s/0.00u sec elapsed 0.00 sec.
2016-05-11 16:03:42 UTC::@:[13567]:DEBUG:  index "pg_database_oid_index" now
contains 1325 row versions in 6 pages
2016-05-11 16:03:42 UTC::@:[13567]:DETAIL:  0 index row versions were
removed.       0 index pages have been deleted, 0 are currently reusable.       CPU 0.00s/0.00u sec elapsed 0.00 sec.
2016-05-11 16:03:42 UTC::@:[13567]:DEBUG:  "pg_database": found 0 removable,
0 nonremovable row versions in 0 out of 43 pages
2016-05-11 16:03:42 UTC::@:[13567]:DETAIL:  0 dead row versions cannot be
removed yet.       There were 0 unused item pointers.       0 pages are entirely empty.       CPU 0.00s/0.00u sec
elapsed0.00 sec.
 
2016-05-11 16:03:42 UTC::@:[13567]:LOG:  automatic vacuum of table
"my_database_name.pg_catalog.pg_database": index scans: 0       pages: 0 removed, 43 remain       tuples: 0 removed,
1156remain       buffer usage: 45 hits, 0 misses, 0 dirtied       avg read rate: 0.000 MB/s, avg write rate: 0.000 MB/s
     system usage: CPU 0.00s/0.00u sec elapsed 0.00 sec
 

The problem is though, when I look at age(datfrozenxid) for
"my_database_name" it is the oldest at 654,189,546 and should trigger the
autovacuum_freeze_max_age of 200,000,000 but it doesn't.  The relation with
this age is "pg_database".  If I issue a manual "vacuum freeze pg_database"
in the database, autovacuum will run on the next oldest database but get
stuck again with same symptoms (i.e. pg_database is the relation with the
oldest age).  If I don't issue the manual "vacuum freeze" for all databases
and left things alone, it would get to wrap-around state.



--
View this message in context:
http://postgresql.nabble.com/Need-help-debugging-why-autovacuum-seems-stuck-until-I-use-superuser-to-vacuum-freeze-pg-database-tp5901425p5903104.html
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.



Shawn wrote:
> The problem is though, when I look at age(datfrozenxid) for
> "my_database_name" it is the oldest at 654,189,546 and should trigger the
> autovacuum_freeze_max_age of 200,000,000 but it doesn't.  The relation with
> this age is "pg_database".  If I issue a manual "vacuum freeze pg_database"
> in the database, autovacuum will run on the next oldest database but get
> stuck again with same symptoms (i.e. pg_database is the relation with the
> oldest age).  If I don't issue the manual "vacuum freeze" for all databases
> and left things alone, it would get to wrap-around state.

I was thinking about this in connection with the shared catalog
vacuuming bug I fixed yesterday.  One problem here is that the
relfrozenxid/relminmxid settings are local to each database, so a
for-wraparound vacuum would have to scan the table in all databases, and
advance the counter in all databases, before the overall "age" of the
system goes down again, which is inconvenient.  This is a problem
inherent to having these columns in a per-database catalog, and the only
solution I can see is to have another shared catalog containing age
state for the shared catalogs.  (However, this issue isn't terribly
serious -- it just means you have to scan the catalog N times instead of
one.)

However this doesn't explain why the vacuuming of pg_database in one
database doesn't actually advance the relfrozenxid in that database; it
certainly should.  This is the serious problem you're facing, it seems.
I wonder if you've tweaked the freeze settings somehow.

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services