autovacuum hung on simple tables - Mailing list pgsql-general

From senor
Subject autovacuum hung on simple tables
Date
Msg-id SN4P221MB0683F2707C055743A1EE5C2CF73B9@SN4P221MB0683.NAMP221.PROD.OUTLOOK.COM
Whole thread Raw
Responses Re: autovacuum hung on simple tables  (Laurenz Albe <laurenz.albe@cybertec.at>)
List pgsql-general
Hi All,

I'm still trying to get a better understanding of the autovacuum process.
This is a different postgres installation as my previous posts and confusing me in new ways.
Still 11.4 running on CentOS 7 and 8 nvme in software raid

This issue started with postgres "...not accepting commands to avoid wraparound...".
On this server I was able to stop all access to DB and dedicate resources to only postgres. I thought I could allow
autovacuumto do its thing with a ton of workers.  

I think everything boils down to 2 questions:
1. Can autovacuum or manual vacuum be coerced into dealing with oldest first?
    1a. Where might I find advice on configuring postgres resources for maximum cpu & memory maintenance use. In other
wordsquickest path out of "not accepting commands" land. Besides increasing autovacuum_freeze_max_age. 
2. What can cause autovacuum to stall? Could associated toast or index bne the cause.

It appeared that autovacuum was not choosing the tables with the oldest xmin so I produced an ordered list of oldest
tableswith: 
SELECT oid::regclass, age(relfrozenxid)
FROM pg_class
WHERE relkind IN ('r', 't', 'm')
AND age(relfrozenxid) > 2000000000
ORDER BY 2 DESC

The list contained over 6000 tables from pg_toast. They all belonged to daily reports tables. The reports are created
dailyand not touched again. 

Most of the autovacuums that did start seem to be hung. Never completing even on the simplest tables.
The newest 2 autovacuums in the list are completing about one every couple seconds.
CPU and disk IO are nearly idle.
An example table is shown here:

phantom=# select
phantom-#       pg_size_pretty(pg_total_relation_size(relid)) as total_size,
phantom-#       pg_size_pretty(pg_relation_size(relid, 'main')) as relation_size_main,
phantom-#       pg_size_pretty(pg_relation_size(relid, 'fsm')) as relation_size_fsm,
phantom-#       pg_size_pretty(pg_relation_size(relid, 'vm')) as relation_size_vm,
phantom-#       pg_size_pretty(pg_relation_size(relid, 'init')) as relation_size_init,
phantom-#       pg_size_pretty(pg_table_size(relid)) as table_size,
phantom-#       pg_size_pretty(pg_total_relation_size(relid) - pg_relation_size(relid)) as external_size
phantom-#  from
phantom-#       pg_catalog.pg_statio_user_tables
phantom-# where
phantom-#   relname like 'report_user_439';
 total_size | relation_size_main | relation_size_fsm | relation_size_vm | relation_size_init | table_size |
external_size

------------+--------------------+-------------------+------------------+--------------------+------------+---------------
 80 kB      | 8192 bytes         | 24 kB             | 8192 bytes       | 0 bytes            | 48 kB      | 72 kB
(1 row)


I scripted a vacuum loop using the oldest table list. It's extremely slow but it was making better progress than
autovacuumwas. 

Using ps I see that there were as many worker processes as defined with autovacuum_max_workers but pg_stat_activity
consistantlyshowed 19. I killed the script thinking there might be a conflict. I saw no difference after 30 minutes so
restartedscript. Never saw anything in pg_stat_progress_vacuum. 

vacuum settings:
                name                 |  setting
-------------------------------------+-----------
 autovacuum                          | on
 autovacuum_analyze_scale_factor     | 0.1
 autovacuum_analyze_threshold        | 50
 autovacuum_freeze_max_age           | 200000000
 autovacuum_max_workers              | 40
 autovacuum_multixact_freeze_max_age | 400000000
 autovacuum_naptime                  | 4
 autovacuum_vacuum_cost_delay        | 0
 autovacuum_vacuum_cost_limit        | 5000
 autovacuum_vacuum_scale_factor      | 0.2
 autovacuum_vacuum_threshold         | 50
 autovacuum_work_mem                 | -1
 log_autovacuum_min_duration         | 0
 vacuum_cleanup_index_scale_factor   | 0.1
 vacuum_cost_delay                   | 0
 vacuum_cost_limit                   | 200
 vacuum_cost_page_dirty              | 20
 vacuum_cost_page_hit                | 1
 vacuum_cost_page_miss               | 10
 vacuum_defer_cleanup_age            | 0
 vacuum_freeze_min_age               | 50000000
 vacuum_freeze_table_age             | 150000000
 vacuum_multixact_freeze_min_age     | 5000000
 vacuum_multixact_freeze_table_age   | 150000000

I'm now thinking that autovacuum getting hung up is what caused the issue to begin with. I see nothing but the
successfulvacuums from the script and my own fat-fingering commands in the postgres logs (set at info). 

Any hints are appreciated.
Senor


pgsql-general by date:

Previous
From: Bryn Llewellyn
Date:
Subject: Re: Putting the O/S user for "local" "peer" authentication in the "postgres" group vs chmod'ing the "pg*.conf" files to be readable by "all"
Next
From: milist ujang
Date:
Subject: Re: Postgres to edb AS, need conversion?