Re: Autovacuum endless loop in heap_page_prune()? - Mailing list pgsql-general

From Laurenz Albe
Subject Re: Autovacuum endless loop in heap_page_prune()?
Date
Msg-id 34f2beb882d607b360dea53433f95a767d1cad8f.camel@cybertec.at
Whole thread Raw
In response to Autovacuum endless loop in heap_page_prune()?  (Peter <pmc@citylink.dinoex.sub.org>)
Responses Re: Autovacuum endless loop in heap_page_prune()?
List pgsql-general
On Sat, 2024-05-25 at 12:51 +0200, Peter wrote:
>  I just found Autovacuum run for 6 hours on a 8 GB table, VACUUM query
> doesnt cancel, cluster doesn't stop, autovacuum worker is not
> killable, truss shows no activity, after kill -6 this backtrace:
>
> * thread #1, name = 'postgres', stop reason = signal SIGABRT
>   * frame #0: 0x0000000000548063 postgres`HeapTupleSatisfiesVacuumHorizon + 531
>     frame #1: 0x000000000054aed9 postgres`heap_page_prune + 537
>     frame #2: 0x000000000054e38a postgres`heap_vacuum_rel + 3626
>     frame #3: 0x00000000006af382 postgres`vacuum_rel + 626
>     frame #4: 0x00000000006aeeeb postgres`vacuum + 1611
>     frame #5: 0x00000000007b4664 postgres`do_autovacuum + 4292
>     frame #6: 0x00000000007b2342 postgres`AutoVacWorkerMain + 866
>     frame #7: 0x00000000007b1f97 postgres`StartAutoVacWorker + 39
>     frame #8: 0x00000000007ba0df postgres`sigusr1_handler + 783
>     frame #9: 0x00000008220da627 libthr.so.3`___lldb_unnamed_symbol683 + 215
>     frame #10: 0x00000008220d9b1a libthr.so.3`___lldb_unnamed_symbol664 + 314
>     frame #11: 0x00007ffffffff913
>     frame #12: 0x00000000007bba25 postgres`ServerLoop + 1541
>     frame #13: 0x00000000007b9467 postgres`PostmasterMain + 3207
>     frame #14: 0x000000000071a566 postgres`main + 758
>     frame #15: 0x00000000004f9995 postgres`_start + 261
>
> After restart, no problems reported yet.
>
> Storyline:
> this is the file-list table of my backup/archive system, contains ~50
> mio records. Recently I found a flaw in the backup system, so that some
> old records weren't removed. I wrote a script to do this, that script
> did run first at 04:15 and reported it had now removed a lot of old
> data. I looked into pgadmin4 and it reported 9 mio dead tuples.

This smells of index corruption.

I have seen cases where a corrupted index sends VACUUM into an endless loop
so that it does not react to query cancellation.

Check the index with the "bt_index_check()" function from the "amcheck"
extension.  If that reports a problem, rebuild the index.

Of course, as always, try to figure out how that could happen.
Apart from hardware problems, one frequent cause is upgrading glibc
(if the index on a string column or expression).

Yours,
Laurenz Albe



pgsql-general by date:

Previous
From: Laurenz Albe
Date:
Subject: Re: Long running query causing XID limit breach
Next
From: Andreas Joseph Krogh
Date:
Subject: Re: prevent users from SELECT-ing from pg_roles/pg_database