Re: autovacuum hung on simple tables - Mailing list pgsql-general

From senor
Subject Re: autovacuum hung on simple tables
Date
Msg-id SN4P221MB0683B708071E5052B557259AF7049@SN4P221MB0683.NAMP221.PROD.OUTLOOK.COM
Whole thread Raw
In response to Re: autovacuum hung on simple tables  (Laurenz Albe <laurenz.albe@cybertec.at>)
Responses Re: autovacuum hung on simple tables  (Laurenz Albe <laurenz.albe@cybertec.at>)
List pgsql-general
I'm a little late getting back to this but still have no solution.
I 100% agree that updating postgres is best. The usual excuses for not updating apply. It will be done when it is
allowedto be done. It remains 11.4 for now. I read through the changelogs up through 11.18 and didn't see anything
obviouslyrelated. The underlying cause could be pretty involved and something I wouldn't recognize. 

Thank you Laurenz Albe for reminding me about an important clue. I had inadvertently executed a vacuum freeze on a
tablethat pg_stat_activity indicated was currently being autovacuumed. The manual vacuum succeeded while leaving the
previousautovacuum still showing active in pg_stat_activity. Manual vacuum always completes, never stalls, but also
oftendoes not appear in pg_stat_progress_vacuum unless it's a longer process. 

It appears the process completes the vacuum but does not register that fact. relfrozenxid of the main table is what
wouldbe expected but an associated toast table was still very old. Cancelling all pending vacuums of the table and
manuallyrunning vacuum completes in a few seconds and both the main relation and toast are updated as expected with
lastvacuum time updated in pg_stat_all_tables. pg_stat_progress_vacuum never had any entry. Autoacuum and autovacuum
analyzeboth get hung. 

I often see the same table listed multiple times in pg_stat_activity with different pids and:
state: active
backend_type: autovacuum worker
wait_event_type: blank
state_change: 10-15 seconds after backend_start - about how long it takes to manually run vacuum on most tables.

What exactly does autovacuum rely on in the stats file? I ran strace on a hung autovacuum process and saw a repeated
readof /run/postgresql/db_16384.stat (tempfs). The file is 740MB which is about the same as other similar installations
I'vereviewed. I'm lacking in overall experience in this though. 

One final oddity:
I ran a query for oldest relfrozenxid and redirected to file. The query took around a minute. A few seconds after it
finished,I queried for 'active' in pg_stat_activity and the oldest relfrozenxid query was still listed. A few seconds
laterit had cleared. 

Can a corrupted stats file prevent autovac from reading/writing?


> I scripted a vacuum loop using the oldest table list. It's extremely slow but it was
> making better progress than autovacuum was.
>
> Using ps I see that there were as many worker processes as defined with autovacuum_max_workers
> but pg_stat_activity consistantly showed 19. I killed the script thinking there might be a conflict.
> I saw no difference after 30 minutes so restarted script.

I am not sure what exactly you are actually doing here, but you should know that there
can only be one VACUUM process per table.  If there is already an anti-wraparound autovacuum
running on the table, a manual VACUUM will simple be blocked until the autovacuum worker
is done.

> Never saw anything in pg_stat_progress_vacuum.

Now that would be weird, except if VACUUM cannot get the required lock on the table.

> vacuum settings:
>                 name                 |  setting
> -------------------------------------+-----------
>  autovacuum                          | on
>  autovacuum_freeze_max_age           | 200000000
>  autovacuum_max_workers              | 40
>  autovacuum_naptime                  | 4
>  autovacuum_vacuum_cost_delay        | 0
>  autovacuum_vacuum_cost_limit        | 5000
>  autovacuum_work_mem                 | -1
>  vacuum_freeze_min_age               | 50000000
>  vacuum_freeze_table_age             | 150000000
>
> I'm now thinking that autovacuum getting hung up is what caused the issue to begin with. I see nothing
> but the successful vacuums from the script and my own fat-fingering commands in the postgres
> logs (set at info).

Sorry about the reply formatting. I tried using outlook web in Edge. Maybe that was a mistake.
Thanks,
Senor


pgsql-general by date:

Previous
From: David Rowley
Date:
Subject: Re: Postgres upgrade 12 - issues with OIDs
Next
From: Thomas Kellerer
Date:
Subject: Re: Get table catalog from pg_indexes