On Sat, 2024-05-25 at 12:51 +0200, Peter wrote:
> I just found Autovacuum run for 6 hours on a 8 GB table, VACUUM query
> doesnt cancel, cluster doesn't stop, autovacuum worker is not
> killable, truss shows no activity, after kill -6 this backtrace:
>
> * thread #1, name = 'postgres', stop reason = signal SIGABRT
> * frame #0: 0x0000000000548063 postgres`HeapTupleSatisfiesVacuumHorizon + 531
> frame #1: 0x000000000054aed9 postgres`heap_page_prune + 537
> frame #2: 0x000000000054e38a postgres`heap_vacuum_rel + 3626
> frame #3: 0x00000000006af382 postgres`vacuum_rel + 626
> frame #4: 0x00000000006aeeeb postgres`vacuum + 1611
> frame #5: 0x00000000007b4664 postgres`do_autovacuum + 4292
> frame #6: 0x00000000007b2342 postgres`AutoVacWorkerMain + 866
> frame #7: 0x00000000007b1f97 postgres`StartAutoVacWorker + 39
> frame #8: 0x00000000007ba0df postgres`sigusr1_handler + 783
> frame #9: 0x00000008220da627 libthr.so.3`___lldb_unnamed_symbol683 + 215
> frame #10: 0x00000008220d9b1a libthr.so.3`___lldb_unnamed_symbol664 + 314
> frame #11: 0x00007ffffffff913
> frame #12: 0x00000000007bba25 postgres`ServerLoop + 1541
> frame #13: 0x00000000007b9467 postgres`PostmasterMain + 3207
> frame #14: 0x000000000071a566 postgres`main + 758
> frame #15: 0x00000000004f9995 postgres`_start + 261
>
> After restart, no problems reported yet.
>
> Storyline:
> this is the file-list table of my backup/archive system, contains ~50
> mio records. Recently I found a flaw in the backup system, so that some
> old records weren't removed. I wrote a script to do this, that script
> did run first at 04:15 and reported it had now removed a lot of old
> data. I looked into pgadmin4 and it reported 9 mio dead tuples.
This smells of index corruption.
I have seen cases where a corrupted index sends VACUUM into an endless loop
so that it does not react to query cancellation.
Check the index with the "bt_index_check()" function from the "amcheck"
extension. If that reports a problem, rebuild the index.
Of course, as always, try to figure out how that could happen.
Apart from hardware problems, one frequent cause is upgrading glibc
(if the index on a string column or expression).
Yours,
Laurenz Albe