Thread: Disabling vacuum truncate for autovacuum

Disabling vacuum truncate for autovacuum

From
Will Storey
Date:
Hi!

I would like to disable vacuum's truncate behaviour for autovacuum.
Previously I had an outage due to its access exclusive lock when it was
replicated to a hot standby.

When that outage happened it was from a VACUUM call in a cronjob rather
than autovacuum. I now run such VACUUMs with TRUNCATE false which avoids
the issue for these. However I've realized that autovacuum could cause this
as well. This is of interest to me because I'm looking at tuning autovacuum
and getting rid of the cronjob, but I've realized relying on autovacuum
could be dangerous because of the truncates.

I believe the only way to disable this for autovacuum is by changing the
vacuum_truncate storage parameters on tables. (Ignoring the now removed
old_snapshot_threshold option). I'm thinking of altering all my tables to
turn it off. Is this a horrible idea? I expect I would need to monitor
tables for problematic growth, but that might be better than a surprise
outage. I suppose the growth could cause an outage too, but I'm thinking it
would be more controllable.

Would I need to disable the settings on catalog tables too? (To rule out
any possibility of it happening). Are there any other things I might be
missing?

I am also wondering if having an autovacuum setting to control it would be
a good idea for a feature. That would be simpler for me than altering all
my tables and help me avoid missing any (e.g. catalogs, new tables).

I might be worrying needlessly about this as maybe it is unlikely to
happen. I suppose it is workload dependent.

Thank you!

Will



Re: Disabling vacuum truncate for autovacuum

From
Laurenz Albe
Date:
On Mon, 2024-12-16 at 16:25 -0800, Will Storey wrote:
> I would like to disable vacuum's truncate behaviour for autovacuum.
> Previously I had an outage due to its access exclusive lock when it was
> replicated to a hot standby.
>
> When that outage happened it was from a VACUUM call in a cronjob rather
> than autovacuum. I now run such VACUUMs with TRUNCATE false which avoids
> the issue for these. However I've realized that autovacuum could cause this
> as well.
>
> I believe the only way to disable this for autovacuum is by changing the
> vacuum_truncate storage parameters on tables. (Ignoring the now removed
> old_snapshot_threshold option).

Yes, you can only do that table by table.


> I'm thinking of altering all my tables to
> turn it off. Is this a horrible idea? I expect I would need to monitor
> tables for problematic growth, but that might be better than a surprise
> outage. I suppose the growth could cause an outage too, but I'm thinking it
> would be more controllable.

I don't see a problem with disabling VACUUM truncation for normal workloads.
Some applications, like volatile queue tables, might need the feature, but
I'd assume that to be the exception.

> Would I need to disable the settings on catalog tables too? (To rule out
> any possibility of it happening). Are there any other things I might be
> missing?

Potentially yes.  But unless you are using temporary tables or create,
alter and drop lots of objects, that shouldn't be necessary.

> I am also wondering if having an autovacuum setting to control it would be
> a good idea for a feature.

I'm all for that.

Yours,
Laurenz Albe



Re: Disabling vacuum truncate for autovacuum

From
Will Storey
Date:
On Tue 2024-12-17 08:30:19 +0100, Laurenz Albe wrote:
> > Would I need to disable the settings on catalog tables too? (To rule out
> > any possibility of it happening). Are there any other things I might be
> > missing?
> 
> Potentially yes.  But unless you are using temporary tables or create,
> alter and drop lots of objects, that shouldn't be necessary.

I see. That makes sense. I do have some things that use temporary tables as
well as some jobs that create/drop objects. They are not very frequent nor
are there a huge number objects involved, but I suppose it could still be
an issue. I'm not keen on altering the catalogs, but it sounds like if I
want to be very safe then I would need to.

> > I am also wondering if having an autovacuum setting to control it would be
> > a good idea for a feature.
> 
> I'm all for that.

I previously had old_snapshot_threshold enabled, which would have done this
anyway I believe, including for the catalog tables. That was convenient!

> Yours,
> Laurenz Albe

Thank you Laurenz! I've read a bunch of your writing and I've learned a lot
from you. I'm a big fan :-). Thank you for what you do!



Re: Disabling vacuum truncate for autovacuum

From
Jeremy Schneider
Date:
On Mon, 16 Dec 2024 16:25:06 -0800
Will Storey <will@summercat.com> wrote:

> I would like to disable vacuum's truncate behaviour for autovacuum.
> Previously I had an outage due to its access exclusive lock when it
> was replicated to a hot standby.
> 
> When that outage happened it was from a VACUUM call in a cronjob
> rather than autovacuum. I now run such VACUUMs with TRUNCATE false
> which avoids the issue for these. However I've realized that
> autovacuum could cause this as well. This is of interest to me
> because I'm looking at tuning autovacuum and getting rid of the
> cronjob, but I've realized relying on autovacuum could be dangerous
> because of the truncates.

Can you tell us a little bit more about the outage? Autovacuum is
designed to quickly relinquish this lock if there is any contention, and
the dangers of disabling autovacuum are significant, so your statement
about autovac being "dangerous" will raise a lot of eyebrows.

Did your outage involve hot standbys serving read-only traffic, or did
it only involve a read-write database?

What was the exact nature of the outage and how did you narrow down the
cause to the exclusive lock held specifically during an autovacuum
truncation?

-Jeremy



Re: Disabling vacuum truncate for autovacuum

From
Will Storey
Date:
On Thu 2024-12-26 12:21:08 -0800, Jeremy Schneider wrote:
> On Mon, 16 Dec 2024 16:25:06 -0800
> Will Storey <will@summercat.com> wrote:
> 
> > I would like to disable vacuum's truncate behaviour for autovacuum.
> > Previously I had an outage due to its access exclusive lock when it
> > was replicated to a hot standby.
> > 
> > When that outage happened it was from a VACUUM call in a cronjob
> > rather than autovacuum. I now run such VACUUMs with TRUNCATE false
> > which avoids the issue for these. However I've realized that
> > autovacuum could cause this as well. This is of interest to me
> > because I'm looking at tuning autovacuum and getting rid of the
> > cronjob, but I've realized relying on autovacuum could be dangerous
> > because of the truncates.
> 
> Can you tell us a little bit more about the outage? Autovacuum is
> designed to quickly relinquish this lock if there is any contention, and
> the dangers of disabling autovacuum are significant, so your statement
> about autovac being "dangerous" will raise a lot of eyebrows.
>
> Did your outage involve hot standbys serving read-only traffic, or did
> it only involve a read-write database?
>
> What was the exact nature of the outage and how did you narrow down the
> cause to the exclusive lock held specifically during an autovacuum
> truncation?

My incident was actually not caused by autovacuum. A VACUUM was run against
the primary by a cronjob. A web service running read queries against hot
standbys went down for several minutes as its queries were stuck in a lock
queue.

While this was VACUUM, my understanding is that autovacuum could do this as
well because it does not see the queries on the hot standby that could be
blocked by it, so it won't know to stop its work. I think this issue is
part of what lead to the addition of the vacuum_truncate reloption
discussed in
https://www.postgresql.org/message-id/flat/CAHGQGwE5UqFqSq1%3DkV3QtTUtXphTdyHA-8rAj4A%3DY%2Be4kyp3BQ%40mail.gmail.com,
e.g. this message:
https://www.postgresql.org/message-id/20190408044345.ndxsnveqqlj3m67g%40alap3.anarazel.de.
I could be misunderstanding it though!

As I recall, I confirmed the cause via query logs. I noticed the table was
vacuumed at the time, which lead me to learning about the page truncation
behaviour. It has been a couple years though.

The cronjob still runs every night, but now with TRUNCATE false. I've been
thinking of trying to get rid of it and rely more on autovacuum which is
why I've been revisiting this. As well, we're no longer protected by
old_snapshot_threshold disabling the page truncation globally, due to that
being removed.



Re: Disabling vacuum truncate for autovacuum

From
Jeremy Schneider
Date:
On Thu, 26 Dec 2024 13:24:03 -0800
Will Storey <will@summercat.com> wrote:

> My incident was actually not caused by autovacuum. A VACUUM was run
> against the primary by a cronjob. A web service running read queries
> against hot standbys went down for several minutes as its queries
> were stuck in a lock queue.
> 
> ...
> 
> As I recall, I confirmed the cause via query logs. I noticed the
> table was vacuumed at the time, which lead me to learning about the
> page truncation behaviour. It has been a couple years though.

Ah - thanks - this is very helpful. I have also seen issues
specifically with hot standbys, which continue holding the exclusive
lock even when the primary read-write instance releases the lock.

A better solution in my opinion would be to enhance the WAL replay
process so that it can somehow temporarily relinquish the exclusive lock
under contention, similar to what the primary read-write instance is
able to do.

This is not an easy enhancement to make. Maybe we'd need the primary to
put more information into the WAL than it does today. Maybe we'd need
to leverage hot_standby_feedback to enable standbys to signal a primary
to release the lock.

Anyway thanks for the report - we need people reporting these issues on
the lists so that there's a little visibility into the impact.

Personally I'm still hesitant about the idea of globally disabling
vacuum truncation. That was never the goal of the
old_snapshot_threshold feature, interesting that you were able to
capitalize on the side-effect. Personally I'd still favor disabling it
only on the tables that are both frequently vacuumed and also
frequently queried on hot standbys.

In a pinch, you could disable it for all tables with a bit of dynamic
SQL and ensuring that new tables created in the future include the
syntax to disable it too.

-Jeremy