Re: another autovacuum scheduling thread - Mailing list pgsql-hackers

From Greg Burd
Subject Re: another autovacuum scheduling thread
Date
Msg-id 3ca1e398-c787-47e9-9afc-8e298b94dac0@app.fastmail.com
Whole thread Raw
In response to Re: another autovacuum scheduling thread  (David Rowley <dgrowleyml@gmail.com>)
Responses Re: another autovacuum scheduling thread
List pgsql-hackers
On Thu, Mar 19, 2026, at 7:44 AM, David Rowley wrote:
> On Thu, 19 Mar 2026 at 22:57, Greg Burd <greg@burd.me> wrote:
>> I'm late in the review process. I know David Rowley proposed the unified scoring approach that became the foundation
ofthis patch, and I think that's a great direction. However, I'm concerned that the patch's default scoring weights
don'tgive XID-age urgency sufficient priority over dead-tuple urgency. The weight GUCs (autovacuum_vacuum_score_weight,
etc.)can address this, but they max at 1.0, meaning you can only reduce dead-tuple priority, not increase XID priority. 
>

Hello David,

> I think that it would be good if you could state *why* you disagree
> with the proposed scoring rather than *that* you disagree. All this
> stuff was talked about around [1]. For me, I don't see what's
> particularly alarming about a table reaching
> autovaccum_max_freeze_age. That GUC is set to less than 10% of the
> total transaction ID space of where the table must be frozen. Why is
> it you think these should take priority over everything else? SLRU
> buffers are configurable since v17, so having to lookup the clog for a
> wider range of xids isn't as big an issue as it used to be, plus
> memory and L3 sizes are bigger than they used to be. Is slow clog
> lookups what you're concerned about? You didn't really say.

Fair point. Let me be more specific.

My concern isn't that wraparound vacuums are inherently alarming, I agree with you that reaching freeze_max_age isn't a
crisis.The issue is a scoring-scale problem in the gap between freeze_max_age (200M) and failsafe age (1.6B). 

In that 1.4B XID window, force_vacuum tables have XID scores of 1.0–8.0 (age/freeze_max_age), while typical active
tablesaccumulate dead-tuple scores of 18–70+ within hours of their last vacuum. The exponential boost doesn't activate
untilfailsafe age, so force_vacuum tables are systematically outranked by routine bloat cleanup for what could be days
orweeks in production. 

I tried to model this in a stress test with 100 tables competing for 3 workers over 7 days, the v12 score-based
scheduleractually performed worse than OID order for wraparound exposure: 

Algorithm    Avg exposure    Peak concurrent risk
─────────────────────────────────────────────────
OID          7194 ± 816 min        82 tables
Score        7892 ± 0 min          20 tables
Tiered       4 ± 0 min              5 tables

The score-based scheduler reduced peak concurrent risk (20 vs 82), which is good, but average per-table exposure
increased10% because force_vacuum tables were starved. In 15 out of 20 runs with randomized OIDs, the score scheduler
performedworse than OID order. 

> Having said that, I'd not realised that Nathan capped the new GUCs at
> 1.0. I think we should allow those to be set higher, likely at least
> to 10.0.

That would definitely help. If autovacuum_freeze_score_weight could be set to 8.0–10.0, DBAs could manually restore the
prioritywe want. 

> Maybe we could consider adjusting the code that's setting the
> xid_score/mxid_score so that we start scaling the score aggressively
> when if (xid_age >= effective_xid_failsafe_age /
> Max(autovacuum_freeze_score_weight,1.0)) becomes true

This is clever, it would make the aggressive scaling kick in earlier when the weight is higher. At weight=8.0, you'd
getexponential boost starting at 200M (failsafe/8) instead of 1.6B. 

Both of these approaches would work. The tiered-sorting proposal was motivated by simplicity. The code already treats
wraparoundas categorically different (force_vacuum bypasses av_enabled, triggers emergency behavior, can't be disabled
per-table).Making the sort order reflect that same categorical distinction felt more aligned with the existing logic
thantrying to tune scoring weights to create the same effect. 

But I'm not religious about it, and I don't have a strong intuition for which would be easier for DBAs to grok and use
orfor us to maintain so if you do I'll follow your lead. If raising the GUC caps to 10.0 and adding the / Max(weight,
1.0)scaling factor achieves the same goal with less conceptual change, that works for me. The key issue is ensuring
force_vacuumtables don't get starved by high-scoring bloat work, and either approach solves that. 

> Then, if people
> want to play it safer, then they can set
> autovacuum_freeze_score_weight = 2.0 and have the aggressive scaling
> kick in at 800 million, or whatever half of effective_xid_failsafe_age
> is set to. You could set yours to 8.0, if you really want tables over
> autovacuum_freeze_max_age to take priority over everything else. I
> just don't see or understand the reason why you'd want to.
>
> It's a fairly common misconception that a wraparound vacuum is
> something to be alarmed about. Maybe you've fallen for that?

Not alarmed, but I think the system should process them promptly once they're flagged as force_vacuum, rather than
lettingthem queue behind routine work for potentially days. My simulation suggests the current default weights don't
achievethat. 

> I recall
> a few proposals to adjust the wording that's shown in pg_stat_activity
> to make them seem less alarming.
>
> David
>
> [1]
> https://www.postgresql.org/message-id/CAApHDvqobtKMwJbhKB_c%3D3-TM%3DTgS3bcuvzcWMm3ee1c0mz9hw%40mail.gmail.com

Attached is autovacuum_simulation_v3.py which implements your suggestions as a fourth mode called 'dynamic', it has:

- raised GUC caps: the dynamic mode uses weight=8.0
- exponential scaling at: dynamic_xid_threshold = c.effective_xid_failsafe / max(weight, 1.0)

With weight=8.0, this means v12 exponential boost starts at 1.6B XIDs and dynamic starts at 200M XIDs (1.6B / 8.0).
DidI capture your suggestion accurately? 

I ran the simulation comparing your dynamic scaling approach (weight=8.0, exponential boost at 200M) against tiered
sorting.The good news: dynamic scaling is a massive improvement over v12, wraparound exposure is about 38 minutes. 

The difference between dynamic and tiered shows up in tables that cross freeze_max_age during the simulation: with
score-onlysorting, they still compete with high-scoring active tables until their XID ages grow large enough, resulting
inwraparound exposures of 24-102 minutes. 

Tiered sorting processes them immediately upon crossing the threshold, keeping exposure at 3-5 minutes regardless of
whenthey cross. 

Both approaches solve the v12 problem and IMO all three are an improvement over what we ship today.  I think some form
ofthis patch should make it into v19. 

best.

-greg


$ ./autovacuum_simulation_v3.py
================================================================================
FOUR-WAY AUTOVACUUM SCHEDULING COMPARISON
OID vs v12 Score vs Dynamic Scaling (weight=8.0) vs Tiered Sort
================================================================================

Config: 3 workers, 7-day sim, 60s steps, 20 runs
Tables: 5 critical + 15 aging + 80 active = 100
freeze_max_age = 200,000,000
Estimated runtime: 3-8 minutes

 Run      OID avg    Score avg  Dynamic avg   Tiered avg
--------------------------------------------------------
   1       7222m       7892m         38m          4m
   2       7642m       7892m         38m          4m
   3       7961m       7892m         38m          4m
   4       6333m       7892m         38m          4m
   5       8110m       7892m         38m          4m
   6       6359m       7892m         38m          4m
   7       6629m       7892m         38m          4m
   8       8526m       7892m         38m          4m
   9       6385m       7892m         38m          4m
  10       6813m       7892m         38m          4m
  11       8588m       7892m         38m          4m
  12       7261m       7892m         38m          4m
  13       6682m       7892m         38m          4m
  14       8035m       7892m         38m          4m
  15       5667m       7892m         38m          4m
  16       7595m       7892m         38m          4m
  17       6394m       7892m         38m          4m
  18       6686m       7892m         38m          4m
  19       7819m       7892m         38m          4m
  20       7178m       7892m         38m          4m

========================================================================
AGGREGATE RESULTS
========================================================================

Avg exposure per run (minutes):
  OID       :    7194 ± 816      (min=5667, max=8588)
  Score     :    7892 ± 0        (min=7892, max=7892)
  Dynamic   :      38 ± 0        (min=38, max=38)
  Tiered    :       4 ± 0        (min=4, max=4)

Peak concurrent force_vacuum tables:
  OID       :      82 ± 3        (min=79, max=88)
  Score     :      20 ± 0        (min=20, max=20)
  Dynamic   :       5 ± 0        (min=5, max=5)
  Tiered    :       5 ± 0        (min=5, max=5)

Pairwise wins (lower avg exposure = better):
  Score beats OID: 5/20   loses: 15/20   ties: 0/20
  Dynamic beats OID: 20/20   loses: 0/20   ties: 0/20
  Tiered beats OID: 20/20   loses: 0/20   ties: 0/20
  Dynamic beats Score: 20/20   loses: 0/20   ties: 0/20
  Tiered beats Score: 20/20   loses: 0/20   ties: 0/20
  Tiered beats Dynamic: 20/20   loses: 0/20   ties: 0/20

Variance (std dev of avg exposure across runs):
  OID       : 816 min
  Score     : 0 min
  Dynamic   : 0 min
  Tiered    : 0 min

Per-table mean exposure (minutes):
  Table                      OID           Score         Dynamic          Tiered
  ------------------------------------------------------------------------------
  critical_0        7564±4470    10080±0           7±0           7±0
  critical_1        8577±3671    10080±0           7±0           7±0
  critical_2        9073±3099     8938±0           3±0           3±0
  critical_3        8581±3661        4±0           4±0           4±0
  critical_4        9167±2826        3±0           3±0           3±0
  aging_0           7253±4256     9648±0          95±0           4±0
  aging_1           7324±3675     9114±0          24±0           4±0
  aging_2           6865±3517     8579±0          86±0           5±0
  aging_3           7253±3714     8851±0          18±0           4±0
  aging_4           6018±4014     8579±0          47±0           4±0
  aging_5           6856±2951     8064±0          90±0           4±0
  aging_6           6799±3482     8496±0          45±0           3±0
  aging_7           5765±4318     8853±0          21±0           4±0
  aging_8           7091±3227     8644±0           6±0           4±0
  aging_9           6358±2738     7479±0         102±0           4±0
  aging_10          7106±3620     8870±0          36±0           4±0
  aging_11          7175±3673     8965±0          65±0           4±0
  aging_12          6499±3299     8106±0          51±0           4±0
  aging_13          6080±3108     7595±0          28±0           4±0
  aging_14          6479±3913     8891±0          27±0           4±0

========================================================================
Completed in 149 seconds (2.5 minutes)
========================================================================

Generating visualization...
  ✓ output/four_way_comparison.png

Done.
Attachment

pgsql-hackers by date:

Previous
From: Daniel Gustafsson
Date:
Subject: Re: Changing the state of data checksums in a running cluster
Next
From: Shinya Kato
Date:
Subject: Re: pg_stat_replication.*_lag sometimes shows NULL during active replication