Thread: Need help debugging why autovacuum seems "stuck" -- until I use superuser to vacuum freeze pg_database
Need help debugging why autovacuum seems "stuck" -- until I use superuser to vacuum freeze pg_database
From
"McCoy, Shawn"
Date:
I have been debugging a problem on a 9.3.10 Postgres database cluster with over 1200 databases. 10 workers, increased maintenance_work_mem, auto vacuum settings to run more frequently than default. What I will notice is that autovacuum will run for a week or so and traverse databases as expected. I will be able to see that age(datfrozenxid) for all 1200 databases will stay close to autovacuum_freeze_max_age as desired.
Then, suddenly I will see it get “stuck”. Autovacuum launcher will not launch worker processes even though databases start to age past autovacuum_freeze_max_age. If I create a list of databases and sort by age(datfrozenxid), connect to the database with the oldest and execute a simple: "vacuum freeze pg_database;”, autovacuum springs back into action.
It’s never the same database where autovacuum seems to get “stuck”. I’m attempting to gather more debugging information, but, also can’t understand why simply doing a “vacuum freeze pg_database” breaks up the jam.
Any thoughts?
Shawn
Re: Need help debugging why autovacuum seems "stuck" -- until I use superuser to vacuum freeze pg_database
From
Robert Haas
Date:
On Sun, May 1, 2016 at 10:39 PM, McCoy, Shawn <shamccoy@amazon.com> wrote: > I have been debugging a problem on a 9.3.10 Postgres database cluster with > over 1200 databases. 10 workers, increased maintenance_work_mem, auto > vacuum settings to run more frequently than default. What I will notice is > that autovacuum will run for a week or so and traverse databases as > expected. I will be able to see that age(datfrozenxid) for all 1200 > databases will stay close to autovacuum_freeze_max_age as desired. > > Then, suddenly I will see it get “stuck”. Autovacuum launcher will not > launch worker processes even though databases start to age past > autovacuum_freeze_max_age. If I create a list of databases and sort by > age(datfrozenxid), connect to the database with the oldest and execute a > simple: "vacuum freeze pg_database;”, autovacuum springs back into action. > > It’s never the same database where autovacuum seems to get “stuck”. I’m > attempting to gather more debugging information, but, also can’t understand > why simply doing a “vacuum freeze pg_database” breaks up the jam. > > Any thoughts? So when it's stuck, there are no AV worker processes running at all, for a sustained period of time? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Need help debugging why autovacuum seems "stuck" -- until I use superuser to vacuum freeze pg_database
From
Shawn
Date:
I enabled debug level and can see that the launcher does in fact spawn a worker, but, it appears, scans and disappears in milliseconds. It was stuck again this morning and it was spitting this out ad nauseum for "my_database_name": 2016-05-11 16:03:42 UTC::@:[13567]:DEBUG: pg_database: vac: 0 (threshold 141), anl: 57 (threshold 166) 2016-05-11 16:03:42 UTC::@:[13567]:DEBUG: autovac_balance_cost(pid=13567 db=143187, rel=1262, dobalance=yes cost_limit=1000, cost_limit_base=1000, cost_delay=10) 2016-05-11 16:03:42 UTC::@:[13567]:DEBUG: vacuuming "pg_catalog.pg_database" 2016-05-11 16:03:42 UTC::@:[13567]:DEBUG: index "pg_database_datname_index" now contains 1325 row versions in 11 pages 2016-05-11 16:03:42 UTC::@:[13567]:DETAIL: 0 index row versions were removed. 0 index pages have been deleted, 0 are currently reusable. CPU 0.00s/0.00u sec elapsed 0.00 sec. 2016-05-11 16:03:42 UTC::@:[13567]:DEBUG: index "pg_database_oid_index" now contains 1325 row versions in 6 pages 2016-05-11 16:03:42 UTC::@:[13567]:DETAIL: 0 index row versions were removed. 0 index pages have been deleted, 0 are currently reusable. CPU 0.00s/0.00u sec elapsed 0.00 sec. 2016-05-11 16:03:42 UTC::@:[13567]:DEBUG: "pg_database": found 0 removable, 0 nonremovable row versions in 0 out of 43 pages 2016-05-11 16:03:42 UTC::@:[13567]:DETAIL: 0 dead row versions cannot be removed yet. There were 0 unused item pointers. 0 pages are entirely empty. CPU 0.00s/0.00u sec elapsed0.00 sec. 2016-05-11 16:03:42 UTC::@:[13567]:LOG: automatic vacuum of table "my_database_name.pg_catalog.pg_database": index scans: 0 pages: 0 removed, 43 remain tuples: 0 removed, 1156remain buffer usage: 45 hits, 0 misses, 0 dirtied avg read rate: 0.000 MB/s, avg write rate: 0.000 MB/s system usage: CPU 0.00s/0.00u sec elapsed 0.00 sec The problem is though, when I look at age(datfrozenxid) for "my_database_name" it is the oldest at 654,189,546 and should trigger the autovacuum_freeze_max_age of 200,000,000 but it doesn't. The relation with this age is "pg_database". If I issue a manual "vacuum freeze pg_database" in the database, autovacuum will run on the next oldest database but get stuck again with same symptoms (i.e. pg_database is the relation with the oldest age). If I don't issue the manual "vacuum freeze" for all databases and left things alone, it would get to wrap-around state. -- View this message in context: http://postgresql.nabble.com/Need-help-debugging-why-autovacuum-seems-stuck-until-I-use-superuser-to-vacuum-freeze-pg-database-tp5901425p5903104.html Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.
Re: Re: Need help debugging why autovacuum seems "stuck" -- until I use superuser to vacuum freeze pg_database
From
Alvaro Herrera
Date:
Shawn wrote: > The problem is though, when I look at age(datfrozenxid) for > "my_database_name" it is the oldest at 654,189,546 and should trigger the > autovacuum_freeze_max_age of 200,000,000 but it doesn't. The relation with > this age is "pg_database". If I issue a manual "vacuum freeze pg_database" > in the database, autovacuum will run on the next oldest database but get > stuck again with same symptoms (i.e. pg_database is the relation with the > oldest age). If I don't issue the manual "vacuum freeze" for all databases > and left things alone, it would get to wrap-around state. I was thinking about this in connection with the shared catalog vacuuming bug I fixed yesterday. One problem here is that the relfrozenxid/relminmxid settings are local to each database, so a for-wraparound vacuum would have to scan the table in all databases, and advance the counter in all databases, before the overall "age" of the system goes down again, which is inconvenient. This is a problem inherent to having these columns in a per-database catalog, and the only solution I can see is to have another shared catalog containing age state for the shared catalogs. (However, this issue isn't terribly serious -- it just means you have to scan the catalog N times instead of one.) However this doesn't explain why the vacuuming of pg_database in one database doesn't actually advance the relfrozenxid in that database; it certainly should. This is the serious problem you're facing, it seems. I wonder if you've tweaked the freeze settings somehow. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services