Re: another autovacuum scheduling thread - Mailing list pgsql-hackers

From Greg Burd
Subject Re: another autovacuum scheduling thread
Date
Msg-id bff7dc5a-f839-449b-b56a-aab36cd3e39d@app.fastmail.com
Whole thread Raw
In response to Re: another autovacuum scheduling thread  (Nathan Bossart <nathandbossart@gmail.com>)
List pgsql-hackers
On Wed, Mar 18, 2026, at 12:09 PM, Nathan Bossart wrote:
> On Wed, Mar 18, 2026 at 12:06:34PM +1300, David Rowley wrote:
>> I think it would have been better to have done this about 3 months
>> ago, but I think it's probably still fine to do now. Feature freeze is
>> still quite a long way from release. I do expect that the most likely
>> time that someone would find issues with this change would be during
>> beta or RC, as that's when people would give PostgreSQL production
>> workloads to try out. During the dev cycle, I expect it's *mostly*
>> just hackers giving the database toy workloads in a very targeted way
>> to something specific that they're hacking on.  Anyway, now since
>> you've added the GUCs to control the weights, there's a way for users
>> to have some influence, so at least there's an escape hatch.
>
> Thanks for chiming in.

Hey Nathan, et. al., I'll chime in too!  Apologies in advance for the length of the message.  I need to learn how to be
moreconcise... 

First off, thanks for looking into this!  I think your work is a significant improvement over where we are today and
shouldbe in v19.  Put simply, I think that the v12 patch is no worse than pg_class order we use today and in many cases
muchbetter.  That said, I think we should tweak it a bit more while we're at it. 

>> I think the GUCs are probably a good idea. I expect the most likely
>> change that people might want to make would be to raise the priority
>> of analyze over vacuum since that's often much faster to complete. We
>> know that some workloads are very sensitive to outdated statistics.
>>
>> On the other hand, we shouldn't be taking adding 5 new autovacuum GUCs
>> lightly as there are already so many. If we are going to come up with
>> something better than this in a few years then it could be better to
>> wait to reduce the pain of having to remove GUCs in the future. I
>> don't personally have any better ideas, so I'd rather see it go in
>> than not.
>
> Yeah, adding these GUCs feels a bit like etching in stone, but if folks
> want configurability, and nobody has better ideas, I'm not sure what else
> to do.

I'm late in the review process. I know David Rowley proposed the unified scoring approach that became the foundation of
thispatch, and I think that's a great direction. However, I'm concerned that the patch's default scoring weights don't
giveXID-age urgency sufficient priority over dead-tuple urgency. The weight GUCs (autovacuum_vacuum_score_weight, etc.)
canaddress this, but they max at 1.0, meaning you can only reduce dead-tuple priority, not increase XID priority. 

>> I didn't look at the patch in detail, but noticed that you might want
>> to add a "See Section N.N.N for more information." to the new GUCs to
>> link to the text you've added on how they're used.
>
> Good idea.  I've added that.
>
>> Do you think it's worth making the call to
>> list_sort(tables_to_process, TableToProcessComparator) conditional on
>> a variable set like: sort_required |= (score != 0.0);? I recall that
>> someone had concerns that the actual sort itself could add overhead.
>
> I don't think it's necessary.  I tested sorting 1M and 10M tables with
> randomly generated scores (on my MacBook, with assertions enabled).  The
> former took ~150 milliseconds, and the latter took ~1770 milliseconds.  I
> suspect there are far bigger problems to worry about if you have anywhere
> near that many tables.
>
> --
> nathan
>
> Attachments:
> * v12-0001-autovacuum-scheduling-improvements.patch

I decided to model before/after behavior using discrete-event simulation. With a bit of LLM help, we no have
autovacuum_simulation.pyand autovacuum_simulation_fixed.py to compare all three approaches (Before/v12/Proposed small
fixto v12) across 20 runs with randomized OID assignments. 

Results from the models are at the end of this email.  They simulate the before/after/fixed behavior of the v12
autovacuumscheduling patch and a suggested fix (I'll get to why in a minute). 

They test a contention scenario: 100 tables (5 already past freeze_max_age, 15 approaching it at staggered rates, and
80high-churn tables constantly exceeding their dead-tuple vacuum thresholds) competing for 3 autovacuum workers over 7
days. I'm sure there are other scenarios that we could test, this felt representational enough to me based on what I've
seen.

With the first test, the "Before" scheduler processes tables in OID order; the "After" scheduler uses the patch's
urgencyscoring with all default GUC values. 

The model of the v12 patch produces a clear system-wide improvement: the number of tables simultaneously past
freeze_max_agepeaks at roughly 20 under score-based scheduling versus 80+ under OID order. This happens because
high-churntables earn high dead-tuple scores, get vacuumed frequently, and their relfrozenxid ages reset as a side
effect- preventing them from ever reaching freeze_max_age. Under OID order, high-OID active tables are *starved* (for
noother reason than they were created later than their counterparts!) and their XID ages grow needlessly into the
dangerzone. The v12 patch also correctly prioritizes the most dangerous force-vacuum tables first (those closest to
failsafeage), whereas OID order's success with any particular table is coincidental. 

But there's a downside in v12, the simulation reveals a scoring-scale concern under default weights. Active tables
accumulatedead-tuple scores of 18–70+ within minutes of their last vacuum, while force-vacuum tables that have just
crossedfreeze_max_age carry XID scores of only 1.0–4.0 (age/freeze_max_age). The exponential boost doesn't activate
untilfailsafe age (1.6B), which is 8× the freeze_max_age threshold. In the gap between 200M and 1.6B, force-vacuum
tablesare consistently outranked by ordinary dead-tuple work. In the tested scenario this meant only 2 of 20 at-risk
tableswere actually vacuumed by the score-based scheduler (versus 5 by OID due to luck), and average per-table
wraparoundexposure was 26% worse. 

One possible remedy within the current design would be to either raise the default autovacuum_freeze_score_weight or
applya floor multiplier when force_vacuum is true.  For example, ensuring any force-vacuum score is at least as large
asthe maximum non-force-vacuum score in the current candidate set. Alternatively, the weight GUC range could be
expandedabove 1.0 to allow DBAs to explicitly boost XID-age priority. The existing 0.0–1.0 range only allows reducing
componentpriority, which makes it difficult to express "wraparound prevention is more important than bloat control. 

Tiered sorting using the existing wraparound flag might be the simplest and safest fix. While the thread discussed
usingscoring with exponential boost to prioritize wraparound tables, I think a two-tier approach (wraparound tier vs
routinetier) would be more robust. We already treat wraparound as non-negotiable (force_vacuum bypasses av_enabled,
triggersemergency behavior). The scoring system should reflect this by making wraparound a hard priority tier, not a
scorecomponent competing with bloat cleanup. 

Instead of sorting solely by score, sort first by wraparound, then by score within each tier. The score computation
staysexactly as-is; it's only used for relative ordering among force_vacuum tables and among non-force-vacuum tables. 

Something like:

/*
 * Comparison function for sorting TableToProcess candidates.
 *
 * Tables at risk of wraparound are always processed before routine
 * maintenance work.  Within each tier, tables are sorted by descending
 * urgency score.
 */
static int
TableToProcessComparator(const ListCell *a, const ListCell *b)
{
    TableToProcess *ta = (TableToProcess *) lfirst(a);
    TableToProcess *tb = (TableToProcess *) lfirst(b);

    /* Wraparound prevention always takes priority */
    if (ta->wraparound && !tb->wraparound)
        return -1;
    if (!ta->wraparound && tb->wraparound)
        return 1;

    /* Within same tier, highest score first */
    if (ta->score > tb->score)
        return -1;
    if (ta->score < tb->score)
        return 1;
    return 0;
}

The simulation code, workload generator, and visualizations are attached (in "foo_tgz" because my last attempt at this
emailwas stuck in the moderation queue). I'd welcome feedback on whether the scenario is representative. The
autovacuum_simulation_fixed.pyincludes the proposed fix and also runs a number of iteration where it randomized the
tableOids so as to remove any dependency on ordering that may be implied in our current algorithm (the before, or v18
algorithm).

The proposed addition to your fix is in the v20260318 patch attached.

best.

-greg


$ ./autovacuum_simulation.py
======================================================================
AUTOVACUUM SCHEDULING SIMULATION (v12 patch)
Accurate model of score-based prioritization
======================================================================

PostgreSQL config:
 autovacuum_freeze_max_age     =   200,000,000
 vacuum_failsafe_age           = 1,600,000,000
 autovacuum_vacuum_threshold   = 50
 autovacuum_vacuum_scale_factor= 0.2
 autovacuum_max_workers        = 3

Simulation: 7 days, 60s steps, seed=42

Generating tables...
   5 critical   — already past freeze_max_age
  15 aging      — approaching freeze_max_age
  80 active     — high dead-tuple rate

BEFORE simulation (catalog OID order):
 [OID order  ]   0%
 [OID order  ]  20%
 [OID order  ]  40%
 [OID order  ]  60%
 [OID order  ]  80%
 [OID order  ] 100%

AFTER simulation (score-based priority):
 [Score-based]   0%
 [Score-based]  20%
 [Score-based]  40%
 [Score-based]  60%
 [Score-based]  80%
 [Score-based] 100%

==============================================================================
RESULTS: Exposure time (minutes at risk before vacuum)
Table               OID   Crossed     Before      After   Change
------------------------------------------------------------------------------
critical_0        16465   day 0.0    10080m     10080m      +0%
critical_1        16398   day 0.0       14m     10080m  -71900%
critical_2        16387   day 0.0        4m     10080m  -251900%
critical_3        16478   day 0.0    10080m         4m    +100%
critical_4        16419   day 0.0    10080m         4m    +100%
aging_0           16415   day 0.3     9648m      9648m      +0%
aging_1           16412   day 0.7     9114m      9114m      +0%
aging_7           16459   day 0.7     9027m      9027m      +0%
aging_6           16395   day 0.8        9m      8982m  -99700%
aging_10          16481   day 0.9     8721m      8721m      +0%
aging_11          16472   day 0.9     8721m      8721m      +0%
aging_2           16401   day 1.0      415m      8579m   -1967%
aging_3           16397   day 1.0       12m      8578m  -71383%
aging_4           16470   day 1.1     8499m      8499m      +0%
aging_12          16411   day 1.1     8483m      8483m      +0%
aging_8           16438   day 1.2     8296m      8296m      +0%
aging_5           16453   day 1.3     8178m      8178m      +0%
aging_14          16448   day 2.1     7101m      7101m      +0%
aging_9           16388   day 2.2        4m      6979m  -174375%
aging_13          16413   day 2.2     6959m      6959m      +0%
------------------------------------------------------------------------------
AVERAGE                               6172m      7806m     -26%
MAXIMUM                              10080m     10080m      +0%
==============================================================================

Generating visualization...
 ✓ output/autovacuum_scheduling_impact.png

Done.


-------------------------------------------------------

$ ./autovacuum_simulation_fixed.py
========================================================================
THREE-WAY AUTOVACUUM SCHEDULING COMPARISON
Before (OID) vs v12 Patch (Score) vs Proposed Fix (Tiered)
========================================================================

Config: 3 workers, 7-day sim, 60s steps, 20 runs
Tables: 5 critical + 15 aging + 80 active = 100
freeze_max_age = 200,000,000
Estimated runtime: 3-8 minutes

Run      OID avg    Score avg   Tiered avg
-------------------------------------------
  1       7222m       7892m          4m
  2       7642m       7892m          4m
  3       7961m       7892m          4m
  4       6333m       7892m          4m
  5       8110m       7892m          4m
  6       6359m       7892m          4m
  7       6629m       7892m          4m
  8       8526m       7892m          4m
  9       6385m       7892m          4m
 10       6813m       7892m          4m
 11       8588m       7892m          4m
 12       7261m       7892m          4m
 13       6682m       7892m          4m
 14       8035m       7892m          4m
 15       5667m       7892m          4m
 16       7595m       7892m          4m
 17       6394m       7892m          4m
 18       6686m       7892m          4m
 19       7819m       7892m          4m
 20       7178m       7892m          4m

========================================================================
AGGREGATE RESULTS
========================================================================

Avg exposure per run (minutes):
 OID       :    7194 ± 816      (min=5667, max=8588)
 Score     :    7892 ± 0        (min=7892, max=7892)
 Tiered    :       4 ± 0        (min=4, max=4)

Peak concurrent force_vacuum tables:
 OID       :      82 ± 3        (min=79, max=88)
 Score     :      20 ± 0        (min=20, max=20)
 Tiered    :       5 ± 0        (min=5, max=5)

Pairwise wins (lower avg exposure = better):
 Score beats OID: 5/20   loses: 15/20   ties: 0/20
 Tiered beats OID: 20/20   loses: 0/20   ties: 0/20
 Tiered beats Score: 20/20   loses: 0/20   ties: 0/20

Variance (std dev of avg exposure across runs):
 OID       : 816 min
 Score     : 0 min
 Tiered    : 0 min

Per-table mean exposure (minutes):
 Table                      OID           Score          Tiered
 --------------------------------------------------------------
 critical_0        7564±4470    10080±0           7±0
 critical_1        8577±3671    10080±0           7±0
 critical_2        9073±3099     8938±0           3±0
 critical_3        8581±3661        4±0           4±0
 critical_4        9167±2826        3±0           3±0
 aging_0           7253±4256     9648±0           4±0
 aging_1           7324±3675     9114±0           4±0
 aging_2           6865±3517     8579±0           5±0
 aging_3           7253±3714     8851±0           4±0
 aging_4           6018±4014     8579±0           4±0
 aging_5           6856±2951     8064±0           4±0
 aging_6           6799±3482     8496±0           3±0
 aging_7           5765±4318     8853±0           4±0
 aging_8           7091±3227     8644±0           4±0
 aging_9           6358±2738     7479±0           4±0
 aging_10          7106±3620     8870±0           4±0
 aging_11          7175±3673     8965±0           4±0
 aging_12          6499±3299     8106±0           4±0
 aging_13          6080±3108     7595±0           4±0
 aging_14          6479±3913     8891±0           4±0

========================================================================
Completed in 108 seconds (1.8 minutes)
========================================================================

Generating visualization...
 ✓ output/three_way_comparison.png
Attachment

pgsql-hackers by date:

Previous
From: Pavel Stehule
Date:
Subject: Re: Read-only connection mode for AI workflows.
Next
From: Marco Nenciarini
Date:
Subject: Re: BUG: Cascading standby fails to reconnect after falling back to archive recovery