>> Can you run amcheck's bt_index_check() routine against some of the
>> indexes you've shown? There is perhaps some chance that index
>> corruption exists and causes VACUUM to take a very long time to delete
>> index pages. This is pretty much a wild guess, though.
Unfortunately I can't, we haven't enabled this extension yet. And since this is a production, I'm not ready to turn it on right away. But I can say that this theory is unlikely, since this problem occurs on different sites. Here's an example of an output from another site where table size is 46.5 GB (again, data only), indexes 107GB, toast 62MB:
automatic aggressive vacuum of table "appdbname2.appschemaname.applications": index scans: 1
pages: 0 removed, 6091646 remain, 0 skipped due to pins, 6086395 skipped frozen
tuples: 2344 removed, 35295654 remain, 0 are dead but not yet removable, oldest xmin: 213412878
buffer usage: 251980554 hits, 14462331 misses, 18844 dirtied
avg read rate: 12.018 MB/s, avg write rate: 0.016 MB/s
system usage: CPU: user: 7734.43 s, system: 178.98 s, elapsed: 9401.36 s
Here again we see that there are 5251 blocks that need to be cleaned (6091646 - 6086395), buffer usage is 266461729 blocks or ~ 2 TB and processing time is 2.5h+.
>> It's possible that VACUUM had to wait a long time for a cleanup lock
on one individual heap page here
If such a scenario is possible, it makes sense to add information about the blocking waiting time to the output. Something like:
system usage: CPU: user: 7734.43 s, system: 178.98 s, lock_wait: 1234.56 s, elapsed: 9401.36 s
Mikhail