On Thu, 26 Dec 2024 13:24:03 -0800
Will Storey <will@summercat.com> wrote:
> My incident was actually not caused by autovacuum. A VACUUM was run
> against the primary by a cronjob. A web service running read queries
> against hot standbys went down for several minutes as its queries
> were stuck in a lock queue.
>
> ...
>
> As I recall, I confirmed the cause via query logs. I noticed the
> table was vacuumed at the time, which lead me to learning about the
> page truncation behaviour. It has been a couple years though.
Ah - thanks - this is very helpful. I have also seen issues
specifically with hot standbys, which continue holding the exclusive
lock even when the primary read-write instance releases the lock.
A better solution in my opinion would be to enhance the WAL replay
process so that it can somehow temporarily relinquish the exclusive lock
under contention, similar to what the primary read-write instance is
able to do.
This is not an easy enhancement to make. Maybe we'd need the primary to
put more information into the WAL than it does today. Maybe we'd need
to leverage hot_standby_feedback to enable standbys to signal a primary
to release the lock.
Anyway thanks for the report - we need people reporting these issues on
the lists so that there's a little visibility into the impact.
Personally I'm still hesitant about the idea of globally disabling
vacuum truncation. That was never the goal of the
old_snapshot_threshold feature, interesting that you were able to
capitalize on the side-effect. Personally I'd still favor disabling it
only on the tables that are both frequently vacuumed and also
frequently queried on hot standbys.
In a pinch, you could disable it for all tables with a bit of dynamic
SQL and ensuring that new tables created in the future include the
syntax to disable it too.
-Jeremy