Thread: Introduce some randomness to autovacuum

Introduce some randomness to autovacuum

From
Junwang Zhao
Date:
Hi hackers,

After watching Robert's talk[1] on autovacuum and participating in the related
workshop yesterday, it appears that people are inclined to use prioritization
to address the issues highlighted in Robert's presentation. Here I list two
of the failure modes that were discussed.

- Spinning. Running repeatedly on the same table but not accomplishing
anything useful.
- Starvation. autovacuum can't vacuum everything that needs vacuuming.
- ...

The prioritization way needs some basic stuff that postgres doesn't have now.

I had a random thought that introducing some randomness might help
mitigate some of the issues mentioned above. Before performing vacuum
on the collected tables, we could rotate the table_oids list by a random
number within the range [0, list_length(table_oids)]. This way, every table
would have an equal chance of being vacuumed first, thus no spinning and
starvation.

Even if there is a broken table that repeatedly gets stuck, this random
approach would still provide opportunities for other tables to be vacuumed.
Eventually, the system would converge.

The change is something like the following, I haven't tested the code,
just posted it here for discussion, let me know your thoughts.

diff --git a/src/backend/postmaster/autovacuum.c
b/src/backend/postmaster/autovacuum.c
index 16756152b71..6dddd273d22 100644
--- a/src/backend/postmaster/autovacuum.c
+++ b/src/backend/postmaster/autovacuum.c
@@ -79,6 +79,7 @@
 #include "catalog/pg_namespace.h"
 #include "commands/dbcommands.h"
 #include "commands/vacuum.h"
+#include "common/pg_prng.h"
 #include "common/int.h"
 #include "lib/ilist.h"
 #include "libpq/pqsignal.h"
@@ -2267,6 +2268,25 @@ do_autovacuum(void)

           "Autovacuum Portal",

           ALLOCSET_DEFAULT_SIZES);

+       /*
+        * Randomly rotate the list of tables to vacuum.  This is to avoid
+        * always vacuuming the same table first, which could lead to spinning
+        * on the same table or vacuuming starvation.
+        */
+       if (list_length(table_oids) > 2)
+       {
+               int rand = 0;
+               static pg_prng_state prng_state;
+               List       *tmp_oids = NIL;
+
+               pg_prng_seed(&prng_state, (uint64) (getpid() ^ time(NULL)));
+               rand = (int) pg_prng_uint64_range(&prng_state, 0,
list_length(table_oids) - 1);
+               if (rand != 0) {
+                       tmp_oids = list_copy_tail(table_oids, rand);
+                       table_oids = list_copy_head(table_oids,
list_length(table_oids) - rand);
+                       table_oids = list_concat(table_oids, tmp_oids);
+               }
+       }
        /*
         * Perform operations on collected tables.
         */


[1] How Autovacuum Goes Wrong: And Can We Please Make It Stop Doing
That? https://www.youtube.com/watch?v=RfTD-Twpvac


-- 
Regards
Junwang Zhao



Re: Introduce some randomness to autovacuum

From
wenhui qiu
Date:
Hi,I like your idea,It would be even better if the weights could be taken according to the larger tables

On Fri, 25 Apr 2025 at 22:03, Junwang Zhao <zhjwpku@gmail.com> wrote:
Hi hackers,

After watching Robert's talk[1] on autovacuum and participating in the related
workshop yesterday, it appears that people are inclined to use prioritization
to address the issues highlighted in Robert's presentation. Here I list two
of the failure modes that were discussed.

- Spinning. Running repeatedly on the same table but not accomplishing
anything useful.
- Starvation. autovacuum can't vacuum everything that needs vacuuming.
- ...

The prioritization way needs some basic stuff that postgres doesn't have now.

I had a random thought that introducing some randomness might help
mitigate some of the issues mentioned above. Before performing vacuum
on the collected tables, we could rotate the table_oids list by a random
number within the range [0, list_length(table_oids)]. This way, every table
would have an equal chance of being vacuumed first, thus no spinning and
starvation.

Even if there is a broken table that repeatedly gets stuck, this random
approach would still provide opportunities for other tables to be vacuumed.
Eventually, the system would converge.

The change is something like the following, I haven't tested the code,
just posted it here for discussion, let me know your thoughts.

diff --git a/src/backend/postmaster/autovacuum.c
b/src/backend/postmaster/autovacuum.c
index 16756152b71..6dddd273d22 100644
--- a/src/backend/postmaster/autovacuum.c
+++ b/src/backend/postmaster/autovacuum.c
@@ -79,6 +79,7 @@
 #include "catalog/pg_namespace.h"
 #include "commands/dbcommands.h"
 #include "commands/vacuum.h"
+#include "common/pg_prng.h"
 #include "common/int.h"
 #include "lib/ilist.h"
 #include "libpq/pqsignal.h"
@@ -2267,6 +2268,25 @@ do_autovacuum(void)

           "Autovacuum Portal",

           ALLOCSET_DEFAULT_SIZES);

+       /*
+        * Randomly rotate the list of tables to vacuum.  This is to avoid
+        * always vacuuming the same table first, which could lead to spinning
+        * on the same table or vacuuming starvation.
+        */
+       if (list_length(table_oids) > 2)
+       {
+               int rand = 0;
+               static pg_prng_state prng_state;
+               List       *tmp_oids = NIL;
+
+               pg_prng_seed(&prng_state, (uint64) (getpid() ^ time(NULL)));
+               rand = (int) pg_prng_uint64_range(&prng_state, 0,
list_length(table_oids) - 1);
+               if (rand != 0) {
+                       tmp_oids = list_copy_tail(table_oids, rand);
+                       table_oids = list_copy_head(table_oids,
list_length(table_oids) - rand);
+                       table_oids = list_concat(table_oids, tmp_oids);
+               }
+       }
        /*
         * Perform operations on collected tables.
         */


[1] How Autovacuum Goes Wrong: And Can We Please Make It Stop Doing
That? https://www.youtube.com/watch?v=RfTD-Twpvac


--
Regards
Junwang Zhao


Re: Introduce some randomness to autovacuum

From
Nikita Malakhov
Date:
Hi!

I agree it is a good idea to shift the table list. Although vacuuming larger tables first
is a questionable approach because smaller ones could wait a long time to be vacuumed.
It looks like the most obvious and simple way is that the first table to be vacuumed
should not be the first one from the previous iteration.

On Fri, Apr 25, 2025 at 6:04 PM wenhui qiu <qiuwenhuifx@gmail.com> wrote:
Hi,I like your idea,It would be even better if the weights could be taken according to the larger tables


--

Regards,
Nikita Malakhov
Postgres Professional
The Russian Postgres Company

Re: Introduce some randomness to autovacuum

From
Junwang Zhao
Date:
Hi Nikita, wenhui,

On Fri, Apr 25, 2025 at 11:16 PM Nikita Malakhov <hukutoc@gmail.com> wrote:
>
> Hi!
>
> I agree it is a good idea to shift the table list. Although vacuuming larger tables first
> is a questionable approach because smaller ones could wait a long time to be vacuumed.
> It looks like the most obvious and simple way is that the first table to be vacuumed
> should not be the first one from the previous iteration.
>
> On Fri, Apr 25, 2025 at 6:04 PM wenhui qiu <qiuwenhuifx@gmail.com> wrote:
>>
>> Hi,I like your idea,It would be even better if the weights could be taken according to the larger tables
>>
>
> --
> Regards,
> Nikita Malakhov
> Postgres Professional
> The Russian Postgres Company
> https://postgrespro.ru/


Thanks for your feedback.

I ended up with adding a guc configuration that may support different vacuum
strategies. I name it as `autovacuum_vacuum_strategy` but you might have
a better one. For now it support only two strategies:

1. Sequential: Tables are vacuumed in the order they are collected.
2. Random: The list of tables is rotated around a randomly chosen
       pivot before vacuuming to avoid always starting with the same
       table, which prevents vacuuming starvation for some tables.

We can extend this strategy like prioritization and whatever algorithms
in the future.

--
Regards
Junwang Zhao

Attachment

Re: Introduce some randomness to autovacuum

From
Nathan Bossart
Date:
On Fri, Apr 25, 2025 at 10:02:49PM +0800, Junwang Zhao wrote:
> After watching Robert's talk[1] on autovacuum and participating in the related
> workshop yesterday, it appears that people are inclined to use prioritization
> to address the issues highlighted in Robert's presentation. Here I list two
> of the failure modes that were discussed.
> 
> - Spinning. Running repeatedly on the same table but not accomplishing
> anything useful.
> - Starvation. autovacuum can't vacuum everything that needs vacuuming.
> - ...
> 
> The prioritization way needs some basic stuff that postgres doesn't have now.
> 
> I had a random thought that introducing some randomness might help
> mitigate some of the issues mentioned above. Before performing vacuum
> on the collected tables, we could rotate the table_oids list by a random
> number within the range [0, list_length(table_oids)]. This way, every table
> would have an equal chance of being vacuumed first, thus no spinning and
> starvation.
> 
> Even if there is a broken table that repeatedly gets stuck, this random
> approach would still provide opportunities for other tables to be vacuumed.
> Eventually, the system would converge.

First off, thank you for thinking about this problem and for sharing your
thoughts.  Adding randomness to solve this is a creative idea.

That being said, I am -1 for this proposal.  Autovacuum parameters and
scheduling are already quite complicated, and making it nondeterministic
would add an additional layer of complexity (and may introduce its own
problems).  But more importantly, IMHO it masks the problems instead of
solving them more directly, and it could mask future problems, too.  It'd
probably behoove us to think about the known problems more deeply and to
craft more targeted solutions.

-- 
nathan



Re: Introduce some randomness to autovacuum

From
Sami Imseih
Date:
> - Spinning. Running repeatedly on the same table but not accomplishing
> anything useful.

> But more importantly, IMHO it masks the problems instead of
> solving them more directly, and it could mask future problems, too

To add more to Nathan's comment about masking future problems,
this will not solve the "spinning" problem because if the most common
reason for this is a long-running transaction, etc., all your tables will
eventually end up with wasted vacuum cycles because the xmin
horizon is not advancing.

--
Sami Imseih



Re: Introduce some randomness to autovacuum

From
Greg Sabino Mullane
Date:
On Wed, Apr 30, 2025 at 10:07 AM Junwang Zhao <zhjwpku@gmail.com> wrote:
I ended up with adding a guc configuration that may support different vacuum
strategies.

+1 to this: it's a good solution to a tricky problem. I would be a -1 if this were not a GUC.

Yes, it is masking the problem, but maybe a better way to think about it is that it is delaying the performance impact, allowing more time for a manual intervention of the problematic table(s).

Cheers,
Greg

--
Enterprise Postgres Software Products & Tech Support

Re: Introduce some randomness to autovacuum

From
Sami Imseih
Date:
> Yes, it is masking the problem, but maybe a better way to think about it is that it is delaying the
> performance impact, allowing more time for a manual intervention of the problematic table(s).

I question how the user will gauge the success of setting the strategy
to "random"? They may make
it random by default, but fall into the same issues and revert it back
to the default strategy.

But also, the key as you mention is "manual intervention" which
requires proper monitoring. I will
argue that for the two cases that this proposal is seeking to correct,
we already have good
solutions that could be implemented by a user.

Let's take the "spinning" case again. If a table has some sort of
problem causing
vacuum to error out, one can just disable autovacuum on a per-table
level and correct
the issue. Also, the xmin horizon being held back ( which is claimed
to be the most common cause,
and I agree with that ), well that one is just going to cause all your
autovacuums to become
useless.

Also, I do think the starvation problem has a good answer now that
autovacuum_max_workers
can be modified online. Maybe something can be done for autovacuum to
auto-tune this
setting to give more workers at times when it's needed. Not sure what
that looks like,
but it is more possible now that this setting does not require a restart.

--
Sami Imseih
Amazon Web Services (AWS)



Re: Introduce some randomness to autovacuum

From
Robert Treat
Date:
On Wed, Apr 30, 2025 at 1:56 PM Sami Imseih <samimseih@gmail.com> wrote:
>
> > Yes, it is masking the problem, but maybe a better way to think about it is that it is delaying the
> > performance impact, allowing more time for a manual intervention of the problematic table(s).
>
> I question how the user will gauge the success of setting the strategy
> to "random"? They may make
> it random by default, but fall into the same issues and revert it back
> to the default strategy.
>
> But also, the key as you mention is "manual intervention" which
> requires proper monitoring. I will
> argue that for the two cases that this proposal is seeking to correct,
> we already have good
> solutions that could be implemented by a user.
>

I would have a lot more faith in this discussion if there was any kind
of external solution that had gained popularity as a general solution,
but this doesn't seem to be the case (and trying to wedge something in
the server will likely hurt that kind of research.

As an example, the first fallacy of autovacuum management is the idea
that a single strategy will always work. Having implemented a number
of crude vacuum management systems in user space already, I know that
I have run into multiple cases where I had to essentially build two
different "queues" of vacuums (one for xids, one for bloat) to be fed
into Postgres so as not to be blocked (in either direction) by
conflicting priorities no matter how wonky things got. I can imagine a
set of gucs that we could put in to try to mimic such types of
behavior, but I imagine it would take quite a few rounds before we got
the behavior correct.


Robert Treat
https://xzilla.net



Re: Introduce some randomness to autovacuum

From
David Rowley
Date:
On Thu, 1 May 2025 at 03:29, Nathan Bossart <nathandbossart@gmail.com> wrote:
> That being said, I am -1 for this proposal.  Autovacuum parameters and
> scheduling are already quite complicated, and making it nondeterministic
> would add an additional layer of complexity (and may introduce its own
> problems).  But more importantly, IMHO it masks the problems instead of
> solving them more directly, and it could mask future problems, too.  It'd
> probably behoove us to think about the known problems more deeply and to
> craft more targeted solutions.

-1 from me too.

It sounds like the aim is to fix the problem with autovacuum vacuuming
the same table over and over and being unable to remove enough dead
tuples due to something holding back the oldest xmin horizon.  Why
can't we just fix that by remembering the value that
VacuumCutoffs.OldestXmin and only coming back to that table once
that's moved forward some amount?

David



Re: Introduce some randomness to autovacuum

From
Junwang Zhao
Date:
Hi Sami,

On Thu, May 1, 2025 at 1:56 AM Sami Imseih <samimseih@gmail.com> wrote:
>
> > Yes, it is masking the problem, but maybe a better way to think about it is that it is delaying the
> > performance impact, allowing more time for a manual intervention of the problematic table(s).
>
> I question how the user will gauge the success of setting the strategy
> to "random"? They may make
> it random by default, but fall into the same issues and revert it back
> to the default strategy.
>
> But also, the key as you mention is "manual intervention" which
> requires proper monitoring. I will
> argue that for the two cases that this proposal is seeking to correct,
> we already have good
> solutions that could be implemented by a user.
>
> Let's take the "spinning" case again. If a table has some sort of
> problem causing
> vacuum to error out, one can just disable autovacuum on a per-table
> level and correct
> the issue. Also, the xmin horizon being held back ( which is claimed
> to be the most common cause,
> and I agree with that ), well that one is just going to cause all your
> autovacuums to become
> useless.

Yeah, I tend to agree with you that the xmin horizon hold back will
make autovacuums to become useless for all tables.

But I have a question, let me quote Andres' comment on slack first:

```quote begin
It seems a bit silly to not just do some basic prioritization instead,
but perhaps we just need to reach for some basic stuff, given that
we seem unable to progress on prioritization.
```quote end

If randomness is not working, ISTM that the prioritization will not benefit
the "spinning" case too, am I right?

>
> Also, I do think the starvation problem has a good answer now that
> autovacuum_max_workers
> can be modified online. Maybe something can be done for autovacuum to
> auto-tune this
> setting to give more workers at times when it's needed. Not sure what
> that looks like,
> but it is more possible now that this setting does not require a restart.

Good to know, thanks.

One case I didn't mention is that some corruption due to vacuuming the
same table might starve other tables two, randomness gives other tables
some chances to be vacuumed. I do admit that multi vacuum workers
can eliminate this issue a little bit if the corrupted table's vacuum progress
lasts for some time, but I think randomness is much better.

>
> --
> Sami Imseih
> Amazon Web Services (AWS)



--
Regards
Junwang Zhao



Re: Introduce some randomness to autovacuum

From
Junwang Zhao
Date:
On Thu, May 1, 2025 at 8:12 AM David Rowley <dgrowleyml@gmail.com> wrote:
>
> On Thu, 1 May 2025 at 03:29, Nathan Bossart <nathandbossart@gmail.com> wrote:
> > That being said, I am -1 for this proposal.  Autovacuum parameters and
> > scheduling are already quite complicated, and making it nondeterministic
> > would add an additional layer of complexity (and may introduce its own
> > problems).  But more importantly, IMHO it masks the problems instead of
> > solving them more directly, and it could mask future problems, too.  It'd
> > probably behoove us to think about the known problems more deeply and to
> > craft more targeted solutions.
>
> -1 from me too.
>
> It sounds like the aim is to fix the problem with autovacuum vacuuming
> the same table over and over and being unable to remove enough dead
> tuples due to something holding back the oldest xmin horizon.  Why
> can't we just fix that by remembering the value that
> VacuumCutoffs.OldestXmin and only coming back to that table once
> that's moved forward some amount?

Users expect the tables to be auto vacuumed when:
*dead_tuples > vac_base_thresh + vac_scale_factor * reltuples*
If we depend on xid moving forward to do autovacuum, I think
there are chances some bloated tables won't be vacuumed?


>
> David



--
Regards
Junwang Zhao



Re: Introduce some randomness to autovacuum

From
David Rowley
Date:
On Thu, 1 May 2025 at 17:35, Junwang Zhao <zhjwpku@gmail.com> wrote:
>
> On Thu, May 1, 2025 at 8:12 AM David Rowley <dgrowleyml@gmail.com> wrote:
> > It sounds like the aim is to fix the problem with autovacuum vacuuming
> > the same table over and over and being unable to remove enough dead
> > tuples due to something holding back the oldest xmin horizon.  Why
> > can't we just fix that by remembering the value that
> > VacuumCutoffs.OldestXmin and only coming back to that table once
> > that's moved forward some amount?
>
> Users expect the tables to be auto vacuumed when:
> *dead_tuples > vac_base_thresh + vac_scale_factor * reltuples*
> If we depend on xid moving forward to do autovacuum, I think
> there are chances some bloated tables won't be vacuumed?

Can you explain why you think that?  The idea is to start vacuum other
tables that perhaps can have dead tuples removed instead of repeating
vacuums on the same table over and over without any chance of being
able to remove any more dead tuples than we could during the last
vacuum.

David