Last time the problem was solved again by restarting the cluster. Last saturday (after two weeks) the problem occurred again. Currently only one database is processed by autovacuum daemon while there are tables in other databases that should be processed and there are free autovacuum workers. We noticed that just before the last other database was processed were few deadlocks in the log on the hanged database. Also there was wal-e backup started in one hour after it.
Any suggestions on how to find the sources of the problem?
I see that only one database is processed by autovacuum daemon while there are tables in other databases that should be processed and there are free autovacuum workers.
Here is information about forzenxid you requested.
Max Vikharev <bm.kinder@gmail.com> writes: > Problem gone after restarting the cluster. > Autovacuum started to process relations in other databases.
Hmm, interesting.
> I dont know how to reproduce the issue, we will monitor it. > If there any way to debug it when it occurs again - let me know.
Did you by any chance capture the contents of pg_database.datfrozenxid and datminmxid and compare them to the pg_class.relfrozenxid and relminmxid fields in the problematic databases?
It's not hard to imagine that if the pg_database fields somehow didn't get updated correctly during pg_upgrade, that would prevent autovacuum from processing some databases to prevent wraparound. However, that doesn't explain failure to examine those databases at all, so I'm a bit at a loss.
Another thing to check is whether the stats collector is working. Specifically look at whether counts in pg_stat_all_tables are incrementing in the problem databases.
My guess is that somehow pg_upgrade left something in a slightly hosed state, and that restarting de-hosed it, so that you aren't going to see this again ... at least not till your next upgrade. But I don't know exactly what the something could be.