Thread: Re: Eagerly scan all-visible pages to amortize aggressive vacuum

Re: Eagerly scan all-visible pages to amortize aggressive vacuum

From

Andres Freund

Date:

07 November 2024, 18:42:29

Hi,

On 2024-11-01 19:35:22 -0400, Melanie Plageman wrote:
> It implements a new "semi-aggressive" vacuum. Semi-aggressive vacuums
> eagerly scan some number of all-visible but not all-frozen pages in
> hopes of freezing them. All-visible pages that are eagerly scanned and
> set all-frozen in the visibility map are considered successful eager
> scans and those not frozen are considered failed eager scans.

I think it's worth mentioning that this would be useful to eventually be able
to mark pages as all-visible during on-access-pruning, to allow for
index-only-scans (and also faster seqscans). Today that'd have the danger of
increasing the "anti-wraparound vacuum debt" unconscionably, but with this
patch that'd be addressed.


> Because we want to amortize our eager scanning across a few vacuums,
> we cap the maximum number of successful eager scans to a percentage of
> the total number of all-visible but not all-frozen pages in the table
> (currently 20%).

One thing worth mentioning around here seems that we currently can't
partially-aggressively freeze tuples that are "too young" and how that
interacts with everything else.


> To demonstrate the results, I ran an append-only workload run for over
> 3 hours on master and with my patch applied. The patched version of
> Postgres amortized the work of freezing the all-visible but not
> all-frozen pages nicely. The first aggressive vacuum with the patch
> was 44 seconds and on master it was 1201 seconds.

Oops.


> patch
>   LOG:  automatic aggressive vacuum of table "history": index scans: 0
>   vacuum duration: 44 seconds (msecs: 44661).
>     pages: 0 removed, 27425085 remain, 1104095 scanned (4.03% of
> total), 709889 eagerly scanned
>     frozen: 316544 pages from table (1.15% of total) had 17409920
> tuples frozen. 316544 pages set all-frozen in the VM
>     I/O timings: read: 1160.599 ms, write: 2461.205 ms. approx time
> spent in vacuum delay: 16230 ms.
>     buffer usage: 1105630 hits, 1111898 reads, 646229 newly dirtied,
> 1750426 dirtied.
>     WAL usage: 1027099 records, 316566 full page images, 276209780 bytes.
>
> master
>   LOG:  automatic aggressive vacuum of table "history": index scans: 0
>   vacuum duration: 1201 seconds (msecs: 1201487).
>     pages: 0 removed, 27515348 remain, 15800948 scanned (57.43% of
> total), 15098257 eagerly scanned
>     frozen: 15096384 pages from table (54.87% of total) had 830247896
> tuples frozen. 15096384 pages set all-frozen in the VM
>     I/O timings: read: 246537.348 ms, write: 73000.498 ms. approx time
> spent in vacuum delay: 349166 ms.
>     buffer usage: 15798343 hits, 15813524 reads, 15097063 newly
> dirtied, 31274333 dirtied.
>     WAL usage: 30733564 records, 15097073 full page images, 11789257631 bytes.
>
> This is because, with the patch, the freezing work is being off-loaded
> to earlier vacuums.
>
> In the attached chart.png, you can see the vm_page_freezes climbing
> steadily with the patch, whereas on master, there are sudden spikes
> aligned with the aggressive vacuums. You can also see that the number
> of pages that are all-visible but not all-frozen grows steadily on
> master until the aggressive vacuum. This is vacuum's "backlog" of
> freezing work.

What's the reason for all-visible-but-not-all-frozen to increase to a higher
value initially than where it later settles?


> In this benchmark, the TPS is rate-limited, but using pgbench
> per-statement reports, I calculated that the P99 latency for this
> benchmark is 16 ms on master and 1 ms with the patch. Vacuuming pages
> sooner decreases vacuum reads and doing the freezing work spread over
> more vacuums improves P99 transaction latency.

Nice!


> Below is the comparative WAL volume, checkpointer and background
> writer writes, reads and writes done by all other backend types, time
> spent vacuuming in milliseconds, and p99 latency. Notice that overall
> vacuum IO time is substantially lower with the patch.
>
>    version     wal  cptr_bgwriter_w   other_rw  vac_io_time  p99_lat
>     patch   770 GB          5903264  235073744   513722         1
>     master  767 GB          5908523  216887764  1003654        16

Hm. It's not clear to me why other_rw is higher with the patch? After all,
given the workload, there's no chance of unnecessarily freezing tuples? Is
that just because at the end of the benchmark there's leftover work?



> From 69b517f34caf39ad814691d6412c68d54e852990 Mon Sep 17 00:00:00 2001
> From: Melanie Plageman <melanieplageman@gmail.com>
> Date: Mon, 28 Oct 2024 10:53:37 -0400
> Subject: [PATCH v1 1/9] Rename LVRelState->frozen_pages
>
> Rename frozen_pages to tuple_freeze_pages in LVRelState, the struct used
> for tracking state during vacuuming of a heap relation. frozen_pages
> sounds like it includes every all-frozen page. That is a misnomer. It
> does not include pages with already frozen tuples. It also includes
> pages that are not actually all-frozen.

Hm. Is tuple_freeze_pages that much clearer, it could also just indicate pages
that already were frozen. How about "newly_frozen_pages"?


> Subject: [PATCH v1 2/9] Make visibilitymap_set() return previous state of

LGTM.



> From c317a272713bb833f7f2761a5be1924f8e1bdb4d Mon Sep 17 00:00:00 2001
> From: Melanie Plageman <melanieplageman@gmail.com>
> Date: Thu, 31 Oct 2024 18:19:18 -0400
> Subject: [PATCH v1 3/9] Count pages set all-visible and all-frozen in VM
>  during vacuum
>
> Vacuum already counts and logs pages with newly frozen tuples. Count and
> log pages set all-frozen in the VM too. This includes pages that are
> empty before or after vacuuming.
>
> While we are at it, count and log the number of pages vacuum set
> all-visible. Pages that are all-visible but not all-frozen are debt for
> future aggressive vacuums. The newly all-visible and all-frozen counts
> give us visiblity into the rate at which this debt is being accrued and
> paid down.

Hm. Why is it interesting to log VM changes separately from the state changes
of underlying pages? Wouldn't it e.g. be more intersting to log the number of
empty pages that vacuum [re-]discovered? I've a bit hard time seeing how a
user could take advantage of this.


> From 67781cc2511bb7d62ccc9461f1787272820abcc4 Mon Sep 17 00:00:00 2001
> From: Melanie Plageman <melanieplageman@gmail.com>
> Date: Mon, 28 Oct 2024 11:07:50 -0400
> Subject: [PATCH v1 4/9] Replace uses of blkno local variable in
>  lazy_scan_heap()

Largely LGTM, but I'm not sure that it's worth having as a separate commit.


> @@ -1114,7 +1117,6 @@ heap_vac_scan_next_block(LVRelState *vacrel, BlockNumber *blkno,
>              ReleaseBuffer(vacrel->next_unskippable_vmbuffer);
>              vacrel->next_unskippable_vmbuffer = InvalidBuffer;
>          }
> -        *blkno = vacrel->rel_pages;
>          return false;
>      }

I'd seat *blkno to InvalidBlockNumber or such, that'd make misuse more
apparent than having it set to some random older block.


> From 67b5565ad57d3b196695f85811dde2044ba79f3e Mon Sep 17 00:00:00 2001
> From: Melanie Plageman <melanieplageman@gmail.com>
> Date: Mon, 28 Oct 2024 11:14:24 -0400
> Subject: [PATCH v1 5/9] Move vacuum VM buffer release
>
> The VM buffer for the next unskippable block can be released after the
> main loop in lazy_scan_heap(). Doing so de-clutters
> heap_vac_scan_next_block() and opens up more refactoring options.

That's vague...


> From 8485dc400b3d4e9f895170af4f5fb1bb959b8495 Mon Sep 17 00:00:00 2001
> From: Melanie Plageman <melanieplageman@gmail.com>
> Date: Mon, 28 Oct 2024 11:36:58 -0400
> Subject: [PATCH v1 6/9] Remove superfluous next_block local variable in vacuum
>  code
>
> Reduce the number of block related variables in lazy_scan_heap() and its
> helpers by removing the next_block local variable from
> heap_vac_scan_next_block().

I don't mind this change, but I also don't get how it's related to anything
else here or why it's really better than the status quo.


> diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
> index 4b1eadea1f2..52c9d49f2b1 100644
> --- a/src/backend/access/heap/vacuumlazy.c
> +++ b/src/backend/access/heap/vacuumlazy.c
> @@ -1112,19 +1112,17 @@ static bool
>  heap_vac_scan_next_block(LVRelState *vacrel, BlockNumber *blkno,
>                           bool *all_visible_according_to_vm)
>  {
> -    BlockNumber next_block;
> -
>      /* relies on InvalidBlockNumber + 1 overflowing to 0 on first call */
> -    next_block = vacrel->current_block + 1;
> +    vacrel->current_block++;

I realize this isn't introduced in this commit, but darn, that's ugly.


> From 78ad9e022b95e024ff5bfa96af78e9e44730c970 Mon Sep 17 00:00:00 2001
> From: Melanie Plageman <melanieplageman@gmail.com>
> Date: Mon, 28 Oct 2024 11:42:10 -0400
> Subject: [PATCH v1 7/9] Make heap_vac_scan_next_block() return BlockNumber


> @@ -857,7 +857,8 @@ lazy_scan_heap(LVRelState *vacrel)
>      vacrel->next_unskippable_allvis = false;
>      vacrel->next_unskippable_vmbuffer = InvalidBuffer;
>
> -    while (heap_vac_scan_next_block(vacrel, &blkno, &all_visible_according_to_vm))
> +    while (BlockNumberIsValid(blkno = heap_vac_scan_next_block(vacrel,
> +                                                               &all_visible_according_to_vm)))

Personally I'd write this as

while (true)
{
    BlockNumber blkno;

    blkno = heap_vac_scan_next_block(vacrel, ...);

    if (!BlockNumberIsValid(blkno))
       break;

Mostly because it's good to use more minimal scopes when possible,
particularly when previously the scope intentionally was larger. But also
partially because I don't love variable assignments inside a macro call,
inside a while().



> From 818d1c3b068c6705611256cfc3eb1f10bdc0b684 Mon Sep 17 00:00:00 2001
> From: Melanie Plageman <melanieplageman@gmail.com>
> Date: Fri, 1 Nov 2024 18:25:05 -0400
> Subject: [PATCH v1 8/9] WIP: Add more general summary to vacuumlazy.c
>
> Currently the summary at the top of vacuumlazy.c provides some specific
> details related to the new dead TID storage in 17. I plan to add a
> summary and maybe some sub-sections to contextualize it.

I like this idea. It's hard to understand vacuumlazy.c without already
understanding vacuumlazy.c, which isn't a good situation.


> ---
>  src/backend/access/heap/vacuumlazy.c | 11 +++++++++++
>  1 file changed, 11 insertions(+)
>
> diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
> index 7ce69953ba0..15a04c6b10b 100644
> --- a/src/backend/access/heap/vacuumlazy.c
> +++ b/src/backend/access/heap/vacuumlazy.c
> @@ -3,6 +3,17 @@
>   * vacuumlazy.c
>   *      Concurrent ("lazy") vacuuming.
>   *
> + * Heap relations are vacuumed in three main phases. In the first phase,
> + * vacuum scans relation pages, pruning and freezing tuples and saving dead
> + * tuples' TIDs in a TID store. If that TID store fills up or vacuum finishes
> + * scanning the relation, it progresses to the second phase: index vacuuming.
> + * After index vacuuming is complete, vacuum scans the blocks of the relation
> + * indicated by the TIDs in the TID store and reaps the dead tuples, freeing
> + * that space for future tuples. Finally, vacuum may truncate the relation if
> + * it has emptied pages at the end. XXX: this summary needs work.

Yea, at least we ought to mention that the phasing can be different when there
are no indexes and that the later phases can heuristically be omitted when
there aren't enough dead items.



> From f21f0bab1dbe675be4b4dddcb2eea486d8a69d36 Mon Sep 17 00:00:00 2001
> From: Melanie Plageman <melanieplageman@gmail.com>
> Date: Mon, 28 Oct 2024 12:15:08 -0400
> Subject: [PATCH v1 9/9] Eagerly scan all-visible pages to amortize aggressive
>  vacuum
>
> Introduce semi-aggressive vacuums, which scan some of the all-visible
> but not all-frozen pages in the relation to amortize the cost of an
> aggressive vacuum.

I wonder if "aggressive" is really the right terminology going
forward... Somehow it doesn't seem particularly descriptive anymore if, in
many workloads, almost all vacuums are going to be aggressive-ish.


> Because the goal is to freeze these all-visible pages, all-visible pages
> that are eagerly scanned and set all-frozen in the visibility map are
> considered successful eager scans and those not frozen are considered
> failed eager scans.
>
> If too many eager scans fail in a row, eager scanning is temporarily
> suspended until a later portion of the relation. Because the goal is to
> amortize aggressive vacuums, we cap the number of successes as well.
> Once we reach the maximum number of blocks successfully eager scanned
> and frozen, the semi-aggressive vacuum is downgraded to an unaggressive
> vacuum.
> ---
>  src/backend/access/heap/vacuumlazy.c | 327 +++++++++++++++++++++++----
>  src/backend/commands/vacuum.c        |  20 +-
>  src/include/commands/vacuum.h        |  27 ++-
>  src/tools/pgindent/typedefs.list     |   1 +
>  4 files changed, 326 insertions(+), 49 deletions(-)
>
> diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
> index 15a04c6b10b..adabb5ff5f1 100644
> --- a/src/backend/access/heap/vacuumlazy.c
> +++ b/src/backend/access/heap/vacuumlazy.c
> @@ -12,6 +12,40 @@
>   * that space for future tuples. Finally, vacuum may truncate the relation if
>   * it has emptied pages at the end. XXX: this summary needs work.
>   *
> + * Relation Scanning:
> + *
> + * Vacuum scans the heap relation, starting at the beginning and progressing
> + * to the end, skipping pages as permitted by their visibility status, vacuum
> + * options, and the aggressiveness level of the vacuum.
> + *
> + * When page skipping is enabled, unaggressive vacuums may skip scanning pages
> + * that are marked all-visible in the visibility map. We may choose not to
> + * skip pages if the range of skippable pages is below SKIP_PAGES_THRESHOLD.
> + *
> + * Semi-aggressive vacuums will scan skippable pages in an effort to freeze
> + * them and decrease the backlog of all-visible but not all-frozen pages that
> + * have to be processed to advance relfrozenxid and avoid transaction ID
> + * wraparound.
> + *
> + * We count it as a success when we are able to set an eagerly scanned page
> + * all-frozen in the VM and a failure when we are not able to set the page
> + * all-frozen.
> + *
> + * Because we want to amortize the overhead of freezing pages over multiple
> + * vacuums, we cap the number of successful eager scans to
> + * EAGER_SCAN_SUCCESS_RATE of the number of all-visible but not all-frozen
> + * pages at the beginning of the vacuum.
> + *
> + * On the assumption that different regions of the table are likely to contain
> + * similarly aged data, we use a localized failure cap instead of a global cap
> + * for the whole relation. The failure count is reset on each region of the
> + * table, comprised of RELSEG_SIZE blocks (or 1/4 of the table size for a
> + * small table). In each region, we tolerate MAX_SUCCESSIVE_EAGER_SCAN_FAILS
> + * before suspending eager scanning until the end of the region.

I'm a bit surprised to see such large regions. Why not something finer, in the
range of a few megabytes?  The FSM steers new tuples quite aggressively to the
start of the table, which means that in many workloads there will be old and
new data interspersed at the start of the table. Using RELSEG_SIZE sized
regions for semi-aggressive vacuuming will mean that we'll often not do any
semi-aggressive processing beyond the start of the relation, as we'll reach
the failure rate quickly.

I also find it layering-wise a bit weird to use RELSEG_SIZE, that's really imo
is just an md.c concept.


> +/*
> + * Semi-aggressive vacuums eagerly scan some all-visible but not all-frozen
> + * pages. Since our goal is to freeze these pages, an eager scan that fails to
> + * set the page all-frozen in the VM is considered to have "failed".
> + *
> + * On the assumption that different regions of the table tend to have
> + * similarly aged data, once we fail to freeze MAX_SUCCESSIVE_EAGER_SCAN_FAILS
> + * blocks, we suspend eager scanning until vacuum has progressed to another
> + * region of the table with potentially older data.
> + */
> +#define MAX_SUCCESSIVE_EAGER_SCAN_FAILS 1024

Can this really be a constant, given that the semi-aggressive regions are
shrunk for small tables?


> +/*
> + * An eager scan of a page that is set all-frozen in the VM is considered
> + * "successful". To spread out eager scanning across multiple semi-aggressive
> + * vacuums, we limit the number of successful eager scans (as well as the
> + * number of failures). The maximum number of successful eager scans is
> + * calculated as a ratio of the all-visible but not all-frozen pages at the
> + * beginning of the vacuum.
> + */
> +#define EAGER_SCAN_SUCCESS_RATE 0.2
>  typedef struct LVRelState

There imo should be newlines between the define and LVRelState.

>  {
>      /* Target heap relation and its indexes */
> @@ -153,8 +208,22 @@ typedef struct LVRelState
>      BufferAccessStrategy bstrategy;
>      ParallelVacuumState *pvs;
>
> -    /* Aggressive VACUUM? (must set relfrozenxid >= FreezeLimit) */
> -    bool        aggressive;
> +    /*
> +     * Whether or not this is an aggressive, semi-aggressive, or unaggressive
> +     * VACUUM. A fully aggressive vacuum must set relfrozenxid >= FreezeLimit
> +     * and therefore must scan every unfrozen tuple. A semi-aggressive vacuum
> +     * will scan a certain number of all-visible pages until it is downgraded
> +     * to an unaggressive vacuum.
> +     */
> +    VacAggressive aggressive;

Hm. A few comments:

- why is VacAggressive defined in vacuum.h? Isn't this fairly tightly coupled
  to heapam?
- Kinda feels like the type should be named VacAggressivness or such?


> +    /*
> +     * A semi-aggressive vacuum that has failed to freeze too many eagerly
> +     * scanned blocks in a row suspends eager scanning. unaggressive_to is the
> +     * block number of the first block eligible for resumed eager scanning.
> +     */
> +    BlockNumber unaggressive_to;

What's it set to otherwise? What is it set to in aggressive vacuums?


> @@ -227,6 +296,26 @@ typedef struct LVRelState
>      BlockNumber next_unskippable_block; /* next unskippable block */
>      bool        next_unskippable_allvis;    /* its visibility status */
>      Buffer        next_unskippable_vmbuffer;    /* buffer containing its VM bit */
> +
> +    /*
> +     * Count of skippable blocks eagerly scanned as part of a semi-aggressive
> +     * vacuum (for logging only).
> +     */
> +    BlockNumber eager_scanned;

I'd add _pages or such.


> +    /*
> +     * The number of eagerly scanned blocks a semi-aggressive vacuum failed to
> +     * freeze (due to age) in the current eager scan region. It is reset each
> +     * time we hit MAX_SUCCESSIVE_EAGER_SCAN_FAILS.
> +     */
> +    BlockNumber eager_scanned_failed_frozen;
> +
> +    /*
> +     * The remaining number of blocks a semi-aggressive vacuum will consider
> +     * eager scanning. This is initialized to EAGER_SCAN_SUCCESS_RATE of the
> +     * total number of all-visible but not all-frozen pages.
> +     */
> +    BlockNumber remaining_eager_scan_successes;

I think it might look better if you just bundled these into a struct like

      struct
      {
        BlockNumber scanned;
        BlockNumber failed_frozen;
        BlockNumber remaining_successes;
      } eager_pages;


> @@ -471,24 +569,49 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
>           * Force aggressive mode, and disable skipping blocks using the
>           * visibility map (even those set all-frozen)
>           */
> -        vacrel->aggressive = true;
> +        vacrel->aggressive = VAC_AGGRESSIVE;
>          skipwithvm = false;
>      }
>
>      vacrel->skipwithvm = skipwithvm;
> +    vacrel->eager_scanned = 0;
> +    vacrel->eager_scanned_failed_frozen = 0;
> +
> +    /*
> +     * Even if we successfully freeze them, we want to cap the number of
> +     * eagerly scanned blocks so that we spread out the overhead across
> +     * multiple vacuums. remaining_eager_scan_successes is only used by
> +     * semi-aggressive vacuums.
> +     */

Somehow the "them" feels like a dangling pointer that's initialized too late ;)


> +    visibilitymap_count(rel, &orig_rel_allvisible, &orig_rel_allfrozen);
> +    vacrel->remaining_eager_scan_successes =
> +        (BlockNumber) (EAGER_SCAN_SUCCESS_RATE * (orig_rel_allvisible - orig_rel_allfrozen));
>
>      if (verbose)
>      {
> -        if (vacrel->aggressive)
> -            ereport(INFO,
> -                    (errmsg("aggressively vacuuming \"%s.%s.%s\"",
> -                            vacrel->dbname, vacrel->relnamespace,
> -                            vacrel->relname)));
> -        else
> -            ereport(INFO,
> -                    (errmsg("vacuuming \"%s.%s.%s\"",
> -                            vacrel->dbname, vacrel->relnamespace,
> -                            vacrel->relname)));
> +        switch (vacrel->aggressive)
> +        {
> +            case VAC_UNAGGRESSIVE:
> +                ereport(INFO,
> +                        (errmsg("vacuuming \"%s.%s.%s\"",
> +                                vacrel->dbname, vacrel->relnamespace,
> +                                vacrel->relname)));
> +                break;
> +
> +            case VAC_AGGRESSIVE:
> +                ereport(INFO,
> +                        (errmsg("aggressively vacuuming \"%s.%s.%s\"",
> +                                vacrel->dbname, vacrel->relnamespace,
> +                                vacrel->relname)));
> +                break;
> +
> +            case VAC_SEMIAGGRESSIVE:
> +                ereport(INFO,
> +                        (errmsg("semiaggressively vacuuming \"%s.%s.%s\"",
> +                                vacrel->dbname, vacrel->relnamespace,
> +                                vacrel->relname)));
> +                break;
> +        }

Wonder if we should have a function that returns the aggressiveness of a
vacuum as an already translated string. There are other places where we emit
the aggressiveness as part of a message, and it's pretty silly to duplicate
most of the message.


> @@ -545,11 +668,13 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
>       * Non-aggressive VACUUMs may advance them by any amount, or not at all.
>       */
>      Assert(vacrel->NewRelfrozenXid == vacrel->cutoffs.OldestXmin ||
> -           TransactionIdPrecedesOrEquals(vacrel->aggressive ? vacrel->cutoffs.FreezeLimit :
> +           TransactionIdPrecedesOrEquals(vacrel->aggressive == VAC_AGGRESSIVE ?
> +                                         vacrel->cutoffs.FreezeLimit :
>                                           vacrel->cutoffs.relfrozenxid,
>                                           vacrel->NewRelfrozenXid));
>      Assert(vacrel->NewRelminMxid == vacrel->cutoffs.OldestMxact ||
> -           MultiXactIdPrecedesOrEquals(vacrel->aggressive ? vacrel->cutoffs.MultiXactCutoff :
> +           MultiXactIdPrecedesOrEquals(vacrel->aggressive == VAC_AGGRESSIVE ?
> +                                       vacrel->cutoffs.MultiXactCutoff :
>                                         vacrel->cutoffs.relminmxid,
>                                         vacrel->NewRelminMxid));
>      if (vacrel->skippedallvis)

These are starting to feel somewhat complicated. Wonder if it'd be easier to
read if they were written as normal ifs.


> +/*
> + * Helper to decrement a block number to 0 without wrapping around.
> + */
> +static void
> +decrement_blkno(BlockNumber *block)
> +{
> +    if ((*block) > 0)
> +        (*block)--;
> +}

Huh.


> @@ -956,11 +1094,23 @@ lazy_scan_heap(LVRelState *vacrel)
>          if (!got_cleanup_lock)
>              LockBuffer(buf, BUFFER_LOCK_SHARE);
>
> +        page_freezes = vacrel->vm_page_freezes;
> +
>          /* Check for new or empty pages before lazy_scan_[no]prune call */
>          if (lazy_scan_new_or_empty(vacrel, buf, blkno, page, !got_cleanup_lock,
>                                     vmbuffer))
>          {
>              /* Processed as new/empty page (lock and pin released) */
> +
> +            /* count an eagerly scanned page as a failure or a success */
> +            if (was_eager_scanned)
> +            {
> +                if (vacrel->vm_page_freezes > page_freezes)
> +                    decrement_blkno(&vacrel->remaining_eager_scan_successes);
> +                else
> +                    vacrel->eager_scanned_failed_frozen++;
> +            }
> +
>              continue;
>          }

Maybe I'm confused, but ISTM that remaining_eager_scan_successes shouldn't
actually be a BlockNumber, given it doesn't actually indicates a specific
block...

I don't understand why we would sometimes want to treat empty pages as a
failure? They can't fail to be frozen, no?

I'm not sure it makes sense to count them as progress towards the success
limit either - afaict we just rediscovered free space is the table. That's imo
separate from semi-aggressive freezing.


Storing page_freezes as a copy of vacrel->vm_page_freezes and then checking if
that increased feels like a somewhat ugly way of tracking if freezing
happend. There's no more direct way?

Why is decrement_blkno() needed? How can we ever get into negative territory?
Shouldn't eager scanning have been disabled when
remaining_eager_scan_successes reaches zero and thus prevent
remaining_eager_scan_successes from ever going below zero?



> @@ -1144,7 +1310,65 @@ heap_vac_scan_next_block(LVRelState *vacrel,
>           */
>          bool        skipsallvis;
>
> -        find_next_unskippable_block(vacrel, &skipsallvis);
> +        /*
> +         * Figure out if we should disable eager scan going forward or
> +         * downgrade to an unaggressive vacuum altogether.
> +         */
> +        if (vacrel->aggressive == VAC_SEMIAGGRESSIVE)
> +        {
> +            /*
> +             * If we hit our success limit, there is no need to eagerly scan
> +             * any additional pages. Downgrade the vacuum to unaggressive.
> +             */
> +            if (vacrel->remaining_eager_scan_successes == 0)
> +                vacrel->aggressive = VAC_UNAGGRESSIVE;
> +
> +            /*
> +             * If we hit the max number of failed eager scans for this region
> +             * of the table, figure out where the next eager scan region
> +             * should start. Eager scanning is effectively disabled until we
> +             * scan a block in that new region.
> +             */
> +            else if (vacrel->eager_scanned_failed_frozen >=
> +                     MAX_SUCCESSIVE_EAGER_SCAN_FAILS)
> +            {
> +                BlockNumber region_size,
> +                            offset;
> +

Why are we doing this logic here, rather than after incrementing
eager_scanned_failed_frozen? Seems that'd limit the amount of times we need to
run through this logic substantially?


> +                /*
> +                 * On the assumption that different regions of the table are
> +                 * likely to have similarly aged data, we will retry eager
> +                 * scanning again later. For a small table, we'll retry eager
> +                 * scanning every quarter of the table. For a larger table,
> +                 * we'll consider eager scanning again after processing
> +                 * another region's worth of data.
> +                 *
> +                 * We consider the region to start from the first failure, so
> +                 * calculate the block to restart eager scanning from there.
> +                 */
> +                region_size = Min(RELSEG_SIZE, (vacrel->rel_pages / 4));
> +
> +                offset = vacrel->eager_scanned_failed_frozen % region_size;
> +
> +                Assert(vacrel->eager_scanned > 0);
> +
> +                vacrel->unaggressive_to = vacrel->current_block + (region_size - offset);
> +                vacrel->eager_scanned_failed_frozen = 0;
> +            }
> +        }

I'd put the logic for this in a separate function, feels complicated and
independent enough for it to be worth not having it in heap_vac_scan_next_block().


> @@ -1199,6 +1423,11 @@ heap_vac_scan_next_block(LVRelState *vacrel,
>   * The next unskippable block and its visibility information is updated in
>   * vacrel.
>   *
> + * consider_eager_scan indicates whether or not we should consider scanning
> + * all-visible but not all-frozen blocks. was_eager_scanned is set to true if
> + * we decided to eager scan a block. In this case, next_unskippable_block is
> + * set to that block number.
> + *
>   * Note: our opinion of which blocks can be skipped can go stale immediately.
>   * It's okay if caller "misses" a page whose all-visible or all-frozen marking
>   * was concurrently cleared, though.  All that matters is that caller scan all
> @@ -1208,7 +1437,11 @@ heap_vac_scan_next_block(LVRelState *vacrel,
>   * to skip such a range is actually made, making everything safe.)
>   */
>  static void
> -find_next_unskippable_block(LVRelState *vacrel, bool *skipsallvis)
> +find_next_unskippable_block(
> +                            LVRelState *vacrel,
> +                            bool consider_eager_scan,
> +                            bool *was_eager_scanned,
> +                            bool *skipsallvis)

I don't think we ever have function definitions without a parameter on the
line with the function name.

Hm - isn't consider_eager_scan potentially outdated after
find_next_unskippable_block() iterated over a bunch of blocks? If
consider_eager_scan is false, this could very well go into the next
"semi-aggressive region", where consider_eager_scan should have been
re-enabled, no?


Greetings,

Andres Freund

Re: Eagerly scan all-visible pages to amortize aggressive vacuum

From

Robert Treat

Date:

15 December 2024, 09:05:27

On Fri, Dec 13, 2024 at 5:53 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:
> On Thu, Nov 7, 2024 at 10:42 AM Andres Freund <andres@anarazel.de> wrote:
> > Hi,
>
> Thanks for the review!
> Attached v2 should address your feedback and also fixes a few bugs with v1.
>
> I've still yet to run very long-running benchmarks. I did start running more
> varied benchmark scenarios -- but all still under two hours. So far, the
> behavior is as expected.
>
> > On 2024-11-01 19:35:22 -0400, Melanie Plageman wrote:
> > > Because we want to amortize our eager scanning across a few vacuums,
> > > we cap the maximum number of successful eager scans to a percentage of
> > > the total number of all-visible but not all-frozen pages in the table
> > > (currently 20%).
> >
> > One thing worth mentioning around here seems that we currently can't
> > partially-aggressively freeze tuples that are "too young" and how that
> > interacts with everything else.
>
> I'm not sure I know what you mean. Are you talking about how we don't freeze
> tuples that are visible to everyone but younger than the freeze limit?
>

FWIW that was my interpretation of his statement, though I had a
clarifying question around this topic myself, which is, from a user
perspective when would we expect to see these eager vacuums? ISTM we
would be doing 'normal vacuums' prior to vacuum_freeze_min_age, and
'aggressive vacuums' after (autovacuum_freeze_max_age -
vacuum_freeze_min_age), so the expectation is that 'eager vacuums'
would fall into the ~50 million transaction window between those two
points (assuming defaults, which admittedly I don't use). Does that
sound right?

> > > In the attached chart.png, you can see the vm_page_freezes climbing
> > > steadily with the patch, whereas on master, there are sudden spikes
> > > aligned with the aggressive vacuums. You can also see that the number
> > > of pages that are all-visible but not all-frozen grows steadily on
> > > master until the aggressive vacuum. This is vacuum's "backlog" of
> > > freezing work.
> >
> > What's the reason for all-visible-but-not-all-frozen to increase to a higher
> > value initially than where it later settles?
>
> My guess is that it has to do with shorter, more frequent vacuums at the
> beginning of the benchmark when the relation is smaller (and we haven't
> exceeded shared buffers or memory yet). They are setting pages all-visible, but
> we haven't used up enough xids yet to qualify for an eager vacuum.
>
> The peak of AVnAF pages aligns with the start of the first eager vacuum. We
> don't do any eager scanning until we are sure there is some data requiring
> freeze (see this criteria):
>
>     if (TransactionIdIsNormal(vacrel->cutoffs.relfrozenxid) &&
>         TransactionIdPrecedesOrEquals(vacrel->cutoffs.relfrozenxid,
>                                       vacrel->cutoffs.FreezeLimit))
>
> Once we have used up enough xids to qualify for the first eager vacuum, the
> number of AVnAF pages starts to go down.
>
> It would follow from this theory that we would see a build-up like this after
> each relfrozenxid advancement (so after the next aggressive vacuum).
>
> But I think we don't see this because the vacuums are longer by the time
> aggressive vacuums have started, so we end up using up enough XIDs between
> vacuums to qualify for eager vacuums on vacuums after the aggressive vacuum.
>
> That is just my theory though.
>

I like your theory but it's a little too counterintuitive for me :-)

I would expect we'd see a change in the vacuum time & rate after the
first aggressive scan, which incidentally your graph does show for
master, but it looks a bit too smooth on your original patchset. I
guess there could be a sweet spot where the rate of changes fit
perfectly with regards to the line between lazy / eager vacuums, but
hard to imagine you were that lucky.

> > > Below is the comparative WAL volume, checkpointer and background
> > > writer writes, reads and writes done by all other backend types, time
> > > spent vacuuming in milliseconds, and p99 latency. Notice that overall
> > > vacuum IO time is substantially lower with the patch.
> > >
> > >    version     wal  cptr_bgwriter_w   other_rw  vac_io_time  p99_lat
> > >     patch   770 GB          5903264  235073744   513722         1
> > >     master  767 GB          5908523  216887764  1003654        16
> >
> > Hm. It's not clear to me why other_rw is higher with the patch? After all,
> > given the workload, there's no chance of unnecessarily freezing tuples? Is
> > that just because at the end of the benchmark there's leftover work?
>
> So other_rw is mostly client backend and autovacuum reads and writes. It is
> higher with the patch because there are actually more vacuum reads and writes
> with the patch than on master. However the autovacuum worker read and write
> time is much lower. Those blocks are more often in shared buffers, I would
> guess.
>
<snip>
> From e36b4fac345be44954410c4f0e61467dc0f49a72 Mon Sep 17 00:00:00 2001
> From: Melanie Plageman <melanieplageman@gmail.com>
> Date: Thu, 12 Dec 2024 16:44:37 -0500
> Subject: [PATCH v2 10/10] Eagerly scan all-visible pages to amortize
> aggressive vacuum
>
> @@ -27,11 +27,37 @@
>  * to the end, skipping pages as permitted by their visibility status, vacuum
>  * options, and the eagerness level of the vacuum.
>  *
> + * There are three vacuum eagerness levels: normal vacuum, eager vacuum, and
> + * aggressive vacuum.
> + *
>  * When page skipping is enabled, non-aggressive vacuums may skip scanning
> - * pages that are marked all-visible in the visibility map. We may choose not
> + * pages that are marked all-visible in the visibility map. It may choose not
>  * to skip pages if the range of skippable pages is below
>  * SKIP_PAGES_THRESHOLD.
>  *

I find the above confusing since page skipping is the regular activity
but referred to in the negative, and because you use the term
"non-aggressive vacuums" which in prior releases only mapped to
"normal" vacuums, but now would map to both "normal" and "eager"
vacuums, and it isn't clear that is desired (in my head anyway). Does
the following still convey what you meant (and hopefully work better
with the paragraphs that follow)?

When page skipping is not disabled, a normal vacuum may skip scanning
pages that are marked all-visible in the visibility map if the range
of skippable pages is below SKIP_PAGES_THRESHOLD.

> + * Eager vacuums will scan skippable pages in an effort to freeze them and
> + * decrease the backlog of all-visible but not all-frozen pages that have to
> + * be processed to advance relfrozenxid and avoid transaction ID wraparound.
> + *

> @@ -170,6 +197,51 @@ typedef enum
>       VACUUM_ERRCB_PHASE_TRUNCATE,
> } VacErrPhase;
>
> +/*
> + * Eager vacuums scan some all-visible but not all-frozen pages. Since our
> + * goal is to freeze these pages, an eager scan that fails to set the page
> + * all-frozen in the VM is considered to have "failed".
> + *
> + * On the assumption that different regions of the table tend to have
> + * similarly aged data, once we fail to freeze EAGER_SCAN_MAX_FAILS_PER_REGION
> + * blocks in a region of size EAGER_SCAN_REGION_SIZE, we suspend eager
> + * scanning until vacuum has progressed to another region of the table with
> + * potentially older data.
> + */
> +#define EAGER_SCAN_REGION_SIZE 4096
> +#define EAGER_SCAN_MAX_FAILS_PER_REGION 128
> +
> +/*
> + * An eager scan of a page that is set all-frozen in the VM is considered
> + * "successful". To spread out eager scanning across multiple eager vacuums,
> + * we limit the number of successful eager page scans. The maximum number of
> + * successful eager page scans is calculated as a ratio of the all-visible but
> + * not all-frozen pages at the beginning of the vacuum.
> + */
> +#define EAGER_SCAN_SUCCESS_RATE 0.2
> +
> +/*
> + * The eagerness level of a vacuum determines how many all-visible but
> + * not all-frozen pages it eagerly scans.
> + *
> + * A normal vacuum (eagerness VAC_NORMAL) scans no all-visible pages (with the
> + * exception of those scanned due to SKIP_PAGES_THRESHOLD).
> + *
> + * An eager vacuum (eagerness VAC_EAGER) scans a number of pages up to a limit
> + * based on whether or not it is succeeding or failing. An eager vacuum is
> + * downgraded to a normal vacuum when it hits its success quota. An aggressive
> + * vacuum cannot be downgraded. No eagerness level is ever upgraded.
> + *

At the risk of being overly nit-picky... eager vacuums scan their
subset of all-visible pages "up to a limit" based solely on the
success ratio. In the case of (excessive) failures, there is no limit
to the number of pages scanned, only a pause in the pages scanned
until the next region.

> + * An aggressive vacuum (eagerness EAGER_FULL) must scan all all-visible but
> + * not all-frozen pages.
> + */

I think the above should be VAC_AGGRESSIVE vs EAGER_FULL, no?

> +typedef enum VacEagerness
> +{
> +     VAC_NORMAL,
> +     VAC_EAGER,
> +     VAC_AGGRESSIVE,
> +} VacEagerness;
> +

> @@ -516,25 +772,20 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
>                * Force aggressive mode, and disable skipping blocks using the
>                * visibility map (even those set all-frozen)
>                */
> -             vacrel->aggressive = true;
> +             aggressive = true;
>               skipwithvm = false;
>       }
>
>       vacrel->skipwithvm = skipwithvm;
>
> +     heap_vacuum_set_up_eagerness(vacrel, aggressive);
> +
>       if (verbose)
> -     {
> -             if (vacrel->aggressive)
> -                     ereport(INFO,
> -                                     (errmsg("aggressively vacuuming \"%s.%s.%s\"",
> -                                                     vacrel->dbname, vacrel->relnamespace,
> -                                                     vacrel->relname)));
> -             else
> -                     ereport(INFO,
> -                                     (errmsg("vacuuming \"%s.%s.%s\"",
> -                                                     vacrel->dbname, vacrel->relnamespace,
> -                                                     vacrel->relname)));
> -     }
> +             ereport(INFO,
> +                             (errmsg("%s of \"%s.%s.%s\"",
> +                                             vac_eagerness_description(vacrel->eagerness),
> +                                             vacrel->dbname, vacrel->relnamespace,
> +                                             vacrel->relname)));
>
>       /*
>        * Allocate dead_items memory using dead_items_alloc.  This handles

One thing I am wondering about is that since we actually modify
vacrel->eagerness during the "success downgrade" cycle, a single
vacuum run could potentially produce messages with both eager vacuum
and normal vacuum language. I don't think that'd be a problem in the
above spot, but wondering if it might be elsewhere (maybe in
pg_stat_activity?).


Robert Treat
https://xzilla.net

Re: Eagerly scan all-visible pages to amortize aggressive vacuum

From

Nazir Bilal Yavuz

Date:

17 December 2024, 18:56:59

Hi,

Thank you for working on this!

On Sat, 14 Dec 2024 at 01:53, Melanie Plageman
<melanieplageman@gmail.com> wrote:
>
> On Thu, Nov 7, 2024 at 10:42 AM Andres Freund <andres@anarazel.de> wrote:
> >
> > Hi,
>
> Thanks for the review!
> Attached v2 should address your feedback and also fixes a few bugs with v1.

Here are couple of code comments:

=== [PATCH v2 07/10] ===

It took me a while to understand that heap_vac_scan_next_block() loops
until rel_pages. What do you think about adding
Assert(vacrel->current_block == rel_pages) before the
pgstat_progress_update_param(PROGRESS_VACUUM_HEAP_BLKS_SCANNED,
rel_pages) and having a comment on main loop should process blocks
until rel_pages?

=== [PATCH v2 09/10] ===

+ * If there are no indexes or index scanning is disabled, phase II may be
+ * skipped. If phase I identified very few dead index entries, vacuum may skip
+ * phases II and III. Index vacuuming deletes the dead index entries from the
+ * TID store.

phase III is not mentioned in the previous comments. It could be
better to first explain what phase III is before mentioning it.

+ * After index vacuuming is complete, vacuum scans the blocks of the relation
+ * indicated by the TIDs in the TID store and reaps the dead tuples, freeing
+ * that space for future tuples. Finally, vacuum may truncate the relation if
+ * it has emptied pages at the end.
+ *
+ * After finishing all phases of work, vacuum updates relation statistics in
+ * pg_class and the cumulative statistics subsystem.

There is no clear definition of phase III here as well. I can not
understand what phase III is and which parts the vacuum may skip.

=== [PATCH v2 10/10] ===

+        /*
+         * The number of eagerly scanned blocks an eager vacuum failed to
+         * freeze (due to age) in the current eager scan region. Eager vacuums
+         * reset it to EAGER_SCAN_MAX_FAILS_PER_REGION each time they enter a
+         * new region of the relation. Aggressive vacuums also decrement this
+         * coutner but it is initialized to # blocks in the relation.
+         */

s/coutner/counter

 /* non-export function prototypes */
+
 static void lazy_scan_heap(LVRelState *vacrel);

Extra blank line.

+    if (TransactionIdIsNormal(vacrel->cutoffs.relfrozenxid) &&
+        TransactionIdPrecedesOrEquals(vacrel->cutoffs.relfrozenxid,
+                                      vacrel->cutoffs.FreezeLimit))
+        oldest_unfrozen_requires_freeze = true;
+
+    if (MultiXactIdIsValid(vacrel->cutoffs.relminmxid) &&
+        MultiXactIdPrecedesOrEquals(vacrel->cutoffs.relminmxid,
+                                    vacrel->cutoffs.MultiXactCutoff))
+        oldest_unfrozen_requires_freeze = true;

You may want to short-circuit the second if condition with
!oldest_unfrozen_requires_freeze but it may increase the complexity,
up to you.

+    vacrel->eager_pages.remaining_fails = EAGER_SCAN_MAX_FAILS_PER_REGION *
+        (1 - vacrel->next_eager_scan_region_start / EAGER_SCAN_REGION_SIZE);

This will always return EAGER_SCAN_MAX_FAILS_PER_REGION or 0 because
of the integer dividing.

+    if (aggressive)
+    {
+        vacrel->eagerness = VAC_AGGRESSIVE;
+
+        /*
+         * An aggressive vacuum must scan every all-visible page to safely
+         * advance the relfrozenxid and/or relminmxid. As such, there is no
+         * cap to the number of allowed successes or failures.
+         */
+        vacrel->eager_pages.remaining_fails = vacrel->rel_pages + 1;
+        vacrel->eager_pages.remaining_successes = vacrel->rel_pages + 1;
+        return;
+    }
...
...
+        if (was_eager_scanned)
+        {
+            if (vm_page_frozen)
+            {
+                Assert(vacrel->eager_pages.remaining_successes > 0);
+                vacrel->eager_pages.remaining_successes--;
+
+                if (vacrel->eager_pages.remaining_successes == 0)
+                {
+                    Assert(vacrel->eagerness == VAC_EAGER);

My initial thought was that since *was_eager_scanned* is true,
Assert(vacrel->eagerness == VAC_EAGER) should be under the top if
condition (I assumed that was_eager_scanned is only true for eager
vacuums, not for aggressive vacuums too) but I see your point here.
Since you set vacrel->eager_pages.remaining_successes to
vacrel->rel_pages + 1, vacrel->eager_pages.remaining_successes can not
reach 0 although all pages are processed as successful. I think
comment is needed in both places to explain why
vacrel->eager_pages.[remaining_fails | remaining_successes] is set to
vacrel->rel_pages + 1 and why vacrel->eagerness should be VAC_EAGER
when was_eager_scanned is true and
vacrel->eager_pages.remaining_successes is 0.

--
Regards,
Nazir Bilal Yavuz
Microsoft

Re: Eagerly scan all-visible pages to amortize aggressive vacuum

From

Melanie Plageman

Date:

18 December 2024, 01:54:31

Thanks for the review!

On Tue, Dec 17, 2024 at 10:57 AM Nazir Bilal Yavuz <byavuz81@gmail.com> wrote:
>
> Here are couple of code comments:
>
> === [PATCH v2 07/10] ===
>
> It took me a while to understand that heap_vac_scan_next_block() loops
> until rel_pages. What do you think about adding
> Assert(vacrel->current_block == rel_pages) before the
> pgstat_progress_update_param(PROGRESS_VACUUM_HEAP_BLKS_SCANNED,
> rel_pages) and having a comment on main loop should process blocks
> until rel_pages?

I added these to the v3 I sent in [1]

> === [PATCH v2 09/10] ===
>
> + * If there are no indexes or index scanning is disabled, phase II may be
> + * skipped. If phase I identified very few dead index entries, vacuum may skip
> + * phases II and III. Index vacuuming deletes the dead index entries from the
> + * TID store.
>
> phase III is not mentioned in the previous comments. It could be
> better to first explain what phase III is before mentioning it.
>
> + * After index vacuuming is complete, vacuum scans the blocks of the relation
> + * indicated by the TIDs in the TID store and reaps the dead tuples, freeing
> + * that space for future tuples. Finally, vacuum may truncate the relation if
> + * it has emptied pages at the end.
> + *
> + * After finishing all phases of work, vacuum updates relation statistics in
> + * pg_class and the cumulative statistics subsystem.
>
> There is no clear definition of phase III here as well. I can not
> understand what phase III is and which parts the vacuum may skip.

I've attempted to address this in v3.

> === [PATCH v2 10/10] ===
>
> +        /*
> +         * The number of eagerly scanned blocks an eager vacuum failed to
> +         * freeze (due to age) in the current eager scan region. Eager vacuums
> +         * reset it to EAGER_SCAN_MAX_FAILS_PER_REGION each time they enter a
> +         * new region of the relation. Aggressive vacuums also decrement this
> +         * coutner but it is initialized to # blocks in the relation.
> +         */
>
> s/coutner/counter

Fixed

>  /* non-export function prototypes */
> +
>  static void lazy_scan_heap(LVRelState *vacrel);
>
> Extra blank line.

Fixed

> +    if (TransactionIdIsNormal(vacrel->cutoffs.relfrozenxid) &&
> +        TransactionIdPrecedesOrEquals(vacrel->cutoffs.relfrozenxid,
> +                                      vacrel->cutoffs.FreezeLimit))
> +        oldest_unfrozen_requires_freeze = true;
> +
> +    if (MultiXactIdIsValid(vacrel->cutoffs.relminmxid) &&
> +        MultiXactIdPrecedesOrEquals(vacrel->cutoffs.relminmxid,
> +                                    vacrel->cutoffs.MultiXactCutoff))
> +        oldest_unfrozen_requires_freeze = true;
>
> You may want to short-circuit the second if condition with
> !oldest_unfrozen_requires_freeze but it may increase the complexity,
> up to you.

Good point. Done.

> +    vacrel->eager_pages.remaining_fails = EAGER_SCAN_MAX_FAILS_PER_REGION *
> +        (1 - vacrel->next_eager_scan_region_start / EAGER_SCAN_REGION_SIZE);
>
> This will always return EAGER_SCAN_MAX_FAILS_PER_REGION or 0 because
> of the integer dividing.

Ah, wow, thanks so much for catching that!

> +    if (aggressive)
> +    {
> +        vacrel->eagerness = VAC_AGGRESSIVE;
> +
> +        /*
> +         * An aggressive vacuum must scan every all-visible page to safely
> +         * advance the relfrozenxid and/or relminmxid. As such, there is no
> +         * cap to the number of allowed successes or failures.
> +         */
> +        vacrel->eager_pages.remaining_fails = vacrel->rel_pages + 1;
> +        vacrel->eager_pages.remaining_successes = vacrel->rel_pages + 1;
> +        return;
> +    }
> ...
> ...
> +        if (was_eager_scanned)
> +        {
> +            if (vm_page_frozen)
> +            {
> +                Assert(vacrel->eager_pages.remaining_successes > 0);
> +                vacrel->eager_pages.remaining_successes--;
> +
> +                if (vacrel->eager_pages.remaining_successes == 0)
> +                {
> +                    Assert(vacrel->eagerness == VAC_EAGER);
>
> My initial thought was that since *was_eager_scanned* is true,
> Assert(vacrel->eagerness == VAC_EAGER) should be under the top if
> condition (I assumed that was_eager_scanned is only true for eager
> vacuums, not for aggressive vacuums too) but I see your point here.
> Since you set vacrel->eager_pages.remaining_successes to
> vacrel->rel_pages + 1, vacrel->eager_pages.remaining_successes can not
> reach 0 although all pages are processed as successful. I think
> comment is needed in both places to explain why
> vacrel->eager_pages.[remaining_fails | remaining_successes] is set to
> vacrel->rel_pages + 1 and why vacrel->eagerness should be VAC_EAGER
> when was_eager_scanned is true and
> vacrel->eager_pages.remaining_successes is 0.

Makes sense. I've attempted to clarify as you suggest in v3.

- Melanie

[1] https://www.postgresql.org/message-id/CAAKRu_Zni_idCUyKTBteRM-G5X1qiB9mf75rZGtHpt%2Bnk1z4Gg%40mail.gmail.com

Re: Eagerly scan all-visible pages to amortize aggressive vacuum

From

Robert Haas

Date:

19 December 2024, 00:29:50

On Tue, Dec 17, 2024 at 5:54 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:
> Makes sense. I've attempted to clarify as you suggest in v3.

I would just commit 0001. There's nothing to be gained by waiting around.

I don't care about 0002 much. It doesn't seem particularly better or
worse. I would suggest that if you want to do this, maybe don't split
up this:

    vacrel->blkno = InvalidBlockNumber;
    if (BufferIsValid(vmbuffer))
        ReleaseBuffer(vmbuffer);

You could put the moved code before or after after that instead of the
middle of it. Unless there's some reason it makes more sense in the
middle.

I don't care about 0003 much, either. It looks correct to me, but
whether it's better is subjective.

I dislike 0004 as presented. Reducing the scope of blkno is a false
economy; the function doesn't do much of anything interesting after
the main loop. And why do we want to report rel_pages to the progress
reporting machinery instead of blkno? I'd rather that the code report
where it actually ended up (blkno) rather than reporting where it
thinks it must have ended up (rel_pages).

I agree that the assert that 0005 replaces is confusing, but replacing
8 lines of code with 37 is not an improvement in my book.

I like 0006. The phase I-II-III terminology doesn't appear anywhere in
the code (at least, not to my knowledge) but we speak of it that way
so frequently in email and in conversation that mentioning it
someplace that a future developer might find seems advisable. I think
it would be worth mentioning in the second paragraph of the comment
that we may resume phase I after completing phase III if we entered
phase II due to lack of memory. I'm not sure what the right way to
phrase this is -- in a sense, this possibility means that these aren't
really phases at all, however much we may speak of them that way. But
as I say, we talk about them that way all the time, so it's good that
this is finally adding comments to match.

Regarding 0007:

- Why do we measure failure as an integer total but success as a
percentage? Maybe the thought here is that failures are counted per
region but successes are counted globally across the relation, but
then I wonder if that's what we want, and if so, whether we need some
comments to explain why we want that rather than doing both per
region.

- I do not like the nested struct definition for eager_pages. Either
define the struct separately, give it a name, and then refer to that
name here, or just define the elements individually, and maybe give
them a common prefix or something. I don't think what you have here is
a style we normally use. As you can see, pgindent is not particularly
fond of it. It seems particularly weird given that you didn't even put
all the stuff related to eagerness inside of it.

- It's a little weird that you mostly treat eager vacuums as an
intermediate position between aggressive and normal, but then we
decide whether we're eager in a different place than we decide whether
we're aggressive.

- On a related note, won't most vacuums be VAC_EAGER rather than
VAC_NORMAL, thus making VAC_NORMAL a misnomer? I wonder if it's better
to merge eager and normal together, and just treat the cases where we
judge eager scanning not worthwhile as a mild special case of an
otherwise-normal vacuum. It's important that a user can tell what
happened from the log message, but it doesn't seem absolutely
necessary for the start-of-vacuum message to make that clear. It could
just be that %u all-visible scanned => 0 all-visible scanned means we
didn't end up being at all eager.

- Right after the comment "Produce distinct output for the corner case
all the same, just in case," you've changed it so that there's no
longer an if-statement. While you haven't really falsified the comment
in any deep sense, some rewording is probably in order. Also, the
first sentence of this comment overtly contradicts itself without
really explaining anything useful. That's a preexisting problem for
which you're not responsible, but maybe it makes sense to fix it while
you're adjusting this comment.

- Perhaps it's unfair of me, but I think I would have hoped for an
acknowledgement in this commit message, considering that I believe I
was the one who suggested breaking the relation into logical regions,
trying to freeze a percentage of the all-visible-but-not-all-frozen
pages, and capping both successes and failures. Starting the region at
a random offset wasn't my idea, and the specific thresholds you've
chosen were not the ones I suggested, and limiting successes globally
rather than per-region was not what I think I had in mind, and I don't
mean to take away from everything you've done to move this forward,
but unless I am misunderstanding the situation, this particular patch
(0007) is basically an implementation of an algorithm that, as far as
I know, I was the first to propose.

- Which of course also means that I tend to like the idea, but also
that I'm biased. Still, what is the reasonable alternative to this
patch? I have a hard time believing that it's "do nothing". As far as
I know, pretty much everyone agrees that the large burst of work that
tends to occur when an aggressive vacuum kicks off is extremely
problematic, particularly but not only the first time it kicks off on
a table or group of tables that may have accumulated many
all-visible-but-not-all-frozen pages. This patch might conceivably err
in moving that work too aggressively to earlier vacuums, thus making
those vacuums too expensive or wasting work if the pages end up being
modified again; or it might conceivably err in moving work
insufficiently aggressively to earlier vacuums, leaving too much
remaining work when the aggressive vacuum finally happens. In fact, I
would be surprised if it doesn't have those problems in some
situations. But it could have moderately severe cases of those
problems and still be quite a bit better than what we have now
overall.

- So,I think we should avoid fine-tuning this and try to understand if
there's anything wrong with the big picture. Can we imagine a user who
is systematically unhappy with this change? Like, not a user who got
hosed once because of some bad luck, but someone who is constantly and
predictably getting hosed? They'd need to be accumulating lots of
all-visible-not-all-frozen pages in relatively large tables on a
regular basis, but then I guess they'd need to either drop the tables
before the aggressive vacuum happened, or they'd need to render the
pages not-all-visible again before the aggressive vacuum would have
happened. I'm not entirely sure how possible that is. My best guess is
that it's possible if the timing of your autovacuum runs is
particularly poor -- you just need to load some data, vacuum it early
enough that the XIDs are still young so it doesn't get frozen, then
have the eager vacuum hit it, and then update it. That doesn't seem
impossible, but I'm not sure if it's possible to make it happen often
enough and severely enough to really cause a problem. And I'm not sure
we're going to find that out before this is committed, so that brings
me to:

- If this patch is going to go into the source tree in PostgreSQL 18,
it should do that soon. This is a bad patch to be committing in late
March. I propose that it should be shipped by end of January or wait a
year. Even if it goes into the tree by the end of January, there is a
chance we'll find problems that we can't fix quickly and have to
revert the whole thing. But that becomes much more likely if it is
committed close to feature freeze. There is a certain kind of patch -
and I think it includes this patch - where you just can't ever really
know what the problems people are going to hit IRL are until it's
committed and people try things and hit problems, perhaps not even as
part of an intent to test this patch, but just testing in general. If
we don't find those problems with enough time remaining to react to
them in a thoughtful way, that's where full reverts start to happen.

--
Robert Haas
EDB: http://www.enterprisedb.com

Re: Eagerly scan all-visible pages to amortize aggressive vacuum

From

Robert Treat

Date:

21 December 2024, 18:28:26

On Tue, Dec 17, 2024 at 5:51 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:
>
> Thanks for taking a look!
>
> I've rebased and attached an updated v3 which also addresses review feedback.
>
> On Sun, Dec 15, 2024 at 1:05 AM Robert Treat <rob@xzilla.net> wrote:
> > On Fri, Dec 13, 2024 at 5:53 PM Melanie Plageman
> > <melanieplageman@gmail.com> wrote:
<snip>
>
> So, all vacuums are eager after the first eager vacuum (in this
> workload).

I have been thinking about this statement and a sense of awkwardness
that I felt about how this implementation came together which I think
Robert captured well with his statement "- It's a little weird that
you mostly treat eager vacuums as an intermediate position between
aggressive and normal, but then we decide whether we're eager in a
different place than we decide whether
we're aggressive."

It feels to me like eager vacuums are not so much a distinct thing,
but that, like how all vacuum do index cleanup unless told not to, all
vacuums are optimistic that they will convert avp to afp, only we bail
out quickly if the table is below vacuum_freeze_min_age and we throw
the limits out if it is above autovacuum_freeze_max_age, similar to
how we manage vacuum cost limit.

> > > + * The eagerness level of a vacuum determines how many all-visible but
> > > + * not all-frozen pages it eagerly scans.
> > > + *
> > > + * A normal vacuum (eagerness VAC_NORMAL) scans no all-visible pages (with the
> > > + * exception of those scanned due to SKIP_PAGES_THRESHOLD).
> > > + *
> > > + * An eager vacuum (eagerness VAC_EAGER) scans a number of pages up to a limit
> > > + * based on whether or not it is succeeding or failing. An eager vacuum is
> > > + * downgraded to a normal vacuum when it hits its success quota. An aggressive
> > > + * vacuum cannot be downgraded. No eagerness level is ever upgraded.
> > > + *
> >
> > At the risk of being overly nit-picky... eager vacuums scan their
> > subset of all-visible pages "up to a limit" based solely on the
> > success ratio. In the case of (excessive) failures, there is no limit
> > to the number of pages scanned, only a pause in the pages scanned
> > until the next region.
>
> Yes, this is a good point. I've changed it to specifically indicate
> the limit is based on successes. However, I feel like I should either
> not mention the success limit here or mention the regional limiting
> based on failures -- otherwise it is confusing. Can you think of a
> wording that would be good for this comment?
>

I don't feel like I came up with anything concise :-)

I think the gist is something like "a given vacuum will scan a number
of pages up to either a limit based on successes, after which it is
downgraded to a normal vacuum, or a subset of pages based on failures
across different regions within the heap." but even that feels like we
are relying on people already understanding what is going on rather
than defining/explaining it.

> > > + * An aggressive vacuum (eagerness EAGER_FULL) must scan all all-visible but
> > > + * not all-frozen pages.
> > > + */
> >
> > I think the above should be VAC_AGGRESSIVE vs EAGER_FULL, no?
>
> Yep, thanks! Fixed.
>
> > >       vacrel->skipwithvm = skipwithvm;
> > >
> > > +     heap_vacuum_set_up_eagerness(vacrel, aggressive);
> > > +
> > >       if (verbose)
> > > -     {
> > > -             if (vacrel->aggressive)
> > > -                     ereport(INFO,
> > > -                                     (errmsg("aggressively vacuuming \"%s.%s.%s\"",
> > > -                                                     vacrel->dbname, vacrel->relnamespace,
> > > -                                                     vacrel->relname)));
> > > -             else
> > > -                     ereport(INFO,
> > > -                                     (errmsg("vacuuming \"%s.%s.%s\"",
> > > -                                                     vacrel->dbname, vacrel->relnamespace,
> > > -                                                     vacrel->relname)));
> > > -     }
> > > +             ereport(INFO,
> > > +                             (errmsg("%s of \"%s.%s.%s\"",
> > > +                                             vac_eagerness_description(vacrel->eagerness),
> > > +                                             vacrel->dbname, vacrel->relnamespace,
> > > +                                             vacrel->relname)));
> > >
> > >       /*
> > >        * Allocate dead_items memory using dead_items_alloc.  This handles
> >
> > One thing I am wondering about is that since we actually modify
> > vacrel->eagerness during the "success downgrade" cycle, a single
> > vacuum run could potentially produce messages with both eager vacuum
> > and normal vacuum language. I don't think that'd be a problem in the
> > above spot, but wondering if it might be elsewhere (maybe in
> > pg_stat_activity?).
>
> Great point. It's actually already an issue in the vacuum log output.
> In the new patch, I save the eagerness level at the beginning and use
> it in the logging output. Any future users of the eagerness level will
> have to figure out if they care about the original eagerness level or
> the current one.
>

In a general sense, you want to know if your vacuums are converting
avp to afp since it can explain i/o usage. In this version of the
patch, we know it is turned on based on VacEagerness, but it'd be nice
to know if it got turned off; ie. basically if we end up in
if (vacrel->eager_pages.remaining_successes == 0) maybe drop a note in the logs.

In a world where all vacuums are eager vacuums, I think you still want
the latter messaging, but I think the former would end up being noted
based on the outcome of criteria in vacuum_get_cutoffs() (primarily if
we are over the freeze limit).

Robert Treat
https://xzilla.net

Re: Eagerly scan all-visible pages to amortize aggressive vacuum

From

Melanie Plageman

Date:

23 December 2024, 20:58:09

On Sat, Dec 21, 2024 at 10:28 AM Robert Treat <rob@xzilla.net> wrote:
>
> On Tue, Dec 17, 2024 at 5:51 PM Melanie Plageman
> <melanieplageman@gmail.com> wrote:
> >
> It feels to me like eager vacuums are not so much a distinct thing,
> but that, like how all vacuum do index cleanup unless told not to, all
> vacuums are optimistic that they will convert avp to afp, only we bail
> out quickly if the table is below vacuum_freeze_min_age and we throw
> the limits out if it is above autovacuum_freeze_max_age, similar to
> how we manage vacuum cost limit.

I like this framing/way of thinking about it. v4 in [1] eliminates the
concept of eager vacuums. I still use the aggressive vacuum and
non-aggressive vacuum framing. But, if we add a reloption/GUC to allow
configuring failures per region (proposed in [1]), that means more
using-facing docs and I think this way of framing it as a spectrum of
all-visible page scanning based on relfrozenxid's relationship to each
of these GUCs might be a clear way to explain eager scanning to users.

> In a general sense, you want to know if your vacuums are converting
> avp to afp since it can explain i/o usage. In this version of the
> patch, we know it is turned on based on VacEagerness, but it'd be nice
> to know if it got turned off; ie. basically if we end up in
> if (vacrel->eager_pages.remaining_successes == 0) maybe drop a note in the logs.

I've added a logging message like this.

> In a world where all vacuums are eager vacuums, I think you still want
> the latter messaging, but I think the former would end up being noted
> based on the outcome of criteria in vacuum_get_cutoffs() (primarily if
> we are over the freeze limit).

Hmm. Now that I've eliminated the concept of eager vacuums, users are
not informed whether or not eager scanning is enabled at the beginning
of vacuum. Only when eager scanning is disabled during vacuum or at
the end when they see the number of pages eagerly scanned would they
know whether or not this eager scanning happened.

- Melanie

[1] https://www.postgresql.org/message-id/CAAKRu_byJgf3wB8ukv6caXROReS1SRZsVPb7RWP%2B8qPtdDGykw%40mail.gmail.com

Re: Eagerly scan all-visible pages to amortize aggressive vacuum

From

Robert Haas

Date:

24 January, 23:43:05

On Fri, Jan 24, 2025 at 3:02 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:
> This thought exercise made me realize something is wrong with my
> current patch, though. If you set the failure tolerance
> (vacuum_eager_scan_max_fails) to 0 right now, it disables eager
> scanning altogether. That might be unexpected. You would probably
> expect setting that to 0 to still allow eager scanning if it is only
> succeeding. That made me think that perhaps I should have a -1 value
> that disables eager scanning altogether and 0 just sets the failure
> tolerance to 0.

I would not expect that. I would expect setting a value of N to mean
"try until we have N failures". "Try until you have 0 failures" is
logically equivalent to "don't try".

> I think you're right. I would go with a percentage. I don't see many
> other GUCs that are percents. What would you call it? Perhaps
> vacuum_eager_scan_fail_threshold? The % of the blocks in the table
> vacuum may scan and fail to freeze.
>
> There is an issue I see with making it a percentage. The current
> default vacuum_eager_scan_max_fails is 128 out of 4096. That means you
> are allowed to scan about 3% of the blocks in the table even if you
> fail to freeze every one. I don't think there are very many reasonable
> values above 256, personally. So, it might be weird to add a
> percentage GUC value with only very low acceptable values. Part of
> this is that we are not making the success cap configurable, so that
> means that you might have lots of extra I/O if you are both failing
> and succeeding. Someone configuring this GUC might think this is
> controlling the amount of extra I/O they are incurring.

Why would it be unreasonable to set this value to 25% or even 100%? I
grant that it doesn't sound like the most prudent possible value, but
if I'm willing to fritter away my VACUUM resources to have the best
possible chance of eagerly freezing stuff, isn't that up to me? I
think setting this value to 100% would be WAY less damaging than
work_mem='1TB' or autovacuum_naptime='7d', both of which are allowed.
In fact, setting this to 100% could theoretically have no negative
consequences at all, if it so happens that no freeze failures occur.
Couldn't it even be a win, if freeze failures are common but
minimizing the impact of aggressive vacuum is of overwhelming
importance?

Looking at ConfigureNamesReal[], some existing words we use to talk
about real-valued GUCs include: cost, fraction, factor, bias,
multiplier, target, rate. Of those, I like "fraction" and "rate" best
here, because they imply a range of 0-1 rather than anything broader.
vacuum_max_eager_freeze_failure_rate? Kinda long...

> The while was meant to contrast the sentence before it about typically
> only scanning pages that have been modified since the previous vacuum.
> But, I see how that didn't work out for me. How about this:
>
> """
> <command>VACUUM</command> typically scans pages that have been
> modified since the last vacuum. However, some all-visible but not
> all-frozen pages are eagerly scanned to try and freeze them. Note that
> the <structfield>relfrozenxid</structfield> can only be advanced when
> every page of the table that might contain unfrozen XIDs is scanned.
> """

"to try and freeze them" seems a little awkward to me. The rest seems fine.

"in the hope that we can freeze them"?

That still doesn't seem great to me. Maybe it's worth expending more
text here, not sure.

> I rewrote the comment using some of your input and trying to clarify
> my initial point about opportunistic vs guaranteed freezing.
>
> """
> If FreezeLimit has not advanced past the relfrozenxid or
> MultiXactCutoff has not advanced past relminmxid, we are unlikely to
> be able to freeze anything new.
> Tuples with XIDs older than OldestXmin or MXIDs older than OldestMxact
> are technically freezable, but we won't freeze them unless some other
> criteria for opportunistic freezing is met.

It's not super clear to me what "some other criteria" means here, but
maybe it's fine.

> It is also possible than a transaction newer than the FreezeLimit has

than->that

> ended, rendering additional tuples freezable.
> As a heuristic, however, it makes sense to wait until the FreezeLimit
> or MultiXactCutoff has advanced before eager scanning.
> """
>
> > + ereport(INFO,
> > + (errmsg("Vacuum successfully froze %u eager scanned blocks of
> > \"%s.%s.%s\". Now disabling eager scanning.",
> >
> > I predict that if Tom sees this, you will get complaints about both
> > the wording of the message, which pretty clearly does not conform to
> > style guidelines for a primary error message, and also about the use
> > of the INFO level. Allow me to suggest DEBUG1 or DEBUG2 and "disabling
> > eager scanning after freezing %u eagerly scanned blocks".
>
> I've used your wording. Just for future reference, are the style
> guidelines I was violating the capitalization and punctuation? Are
> these documented somewhere? Also, what is a primary error message?
> INFO level? Or ones that use ereport and are translated? I looked at
> other messages and saw that they don't capitalize the first word or
> use punctuation, so I assume that those were problems.

Yes. Primary error messages, i.e. errmsg(), are not capitalized and
punctuated, unlike errdetail() and errhint() messages, which are.

See https://www.postgresql.org/docs/current/error-style-guide.html

INFO level is used for VERY few things. I can't tell you off the top
of my head when it's appropriate, but I think the answer is "almost
never".

--
Robert Haas
EDB: http://www.enterprisedb.com

Re: Eagerly scan all-visible pages to amortize aggressive vacuum

From

Robert Treat

Date:

26 January, 19:30:52

On Fri, Jan 24, 2025 at 3:43 PM Robert Haas <robertmhaas@gmail.com> wrote:
> On Fri, Jan 24, 2025 at 3:02 PM Melanie Plageman
> <melanieplageman@gmail.com> wrote:
> >
> > > + ereport(INFO,
> > > + (errmsg("Vacuum successfully froze %u eager scanned blocks of
> > > \"%s.%s.%s\". Now disabling eager scanning.",
> > >
> > > I predict that if Tom sees this, you will get complaints about both
> > > the wording of the message, which pretty clearly does not conform to
> > > style guidelines for a primary error message, and also about the use
> > > of the INFO level. Allow me to suggest DEBUG1 or DEBUG2 and "disabling
> > > eager scanning after freezing %u eagerly scanned blocks".
> >
> > I've used your wording. Just for future reference, are the style
> > guidelines I was violating the capitalization and punctuation? Are
> > these documented somewhere? Also, what is a primary error message?
> > INFO level? Or ones that use ereport and are translated? I looked at
> > other messages and saw that they don't capitalize the first word or
> > use punctuation, so I assume that those were problems.
>
> Yes. Primary error messages, i.e. errmsg(), are not capitalized and
> punctuated, unlike errdetail() and errhint() messages, which are.
>
> See https://www.postgresql.org/docs/current/error-style-guide.html
>
> INFO level is used for VERY few things. I can't tell you off the top
> of my head when it's appropriate, but I think the answer is "almost
> never".
>

Maybe, but one of the areas that INFO is used for is providing
additional details in VACUUM VERBOSE output, and this seems like it
would be pretty useful information to have if you are trying to
discern changes in i/o rate during a vacuum, or trying to tune the
failure rate setting, or several other related fields (including
automated capture by tools like pganalyze), so I believe INFO is the
right choice for this.


Robert Treat
https://xzilla.net

Re: Eagerly scan all-visible pages to amortize aggressive vacuum

From

Robert Treat

Date:

26 January, 23:11:42

On Fri, Jan 24, 2025 at 3:43 PM Robert Haas <robertmhaas@gmail.com> wrote:
> On Fri, Jan 24, 2025 at 3:02 PM Melanie Plageman
> <melanieplageman@gmail.com> wrote:
> > This thought exercise made me realize something is wrong with my
> > current patch, though. If you set the failure tolerance
> > (vacuum_eager_scan_max_fails) to 0 right now, it disables eager
> > scanning altogether. That might be unexpected. You would probably
> > expect setting that to 0 to still allow eager scanning if it is only
> > succeeding. That made me think that perhaps I should have a -1 value
> > that disables eager scanning altogether and 0 just sets the failure
> > tolerance to 0.
>
> I would not expect that. I would expect setting a value of N to mean
> "try until we have N failures". "Try until you have 0 failures" is
> logically equivalent to "don't try".
>

Hmm... isn't "don't try" really more logically equivalent to "stop
after zero failures", with "try until zero failures" more of the
inverse... more like "never stop never stopping"?

> > I think you're right. I would go with a percentage. I don't see many
> > other GUCs that are percents. What would you call it? Perhaps
> > vacuum_eager_scan_fail_threshold? The % of the blocks in the table
> > vacuum may scan and fail to freeze.
> >
> > There is an issue I see with making it a percentage. The current
> > default vacuum_eager_scan_max_fails is 128 out of 4096. That means you
> > are allowed to scan about 3% of the blocks in the table even if you
> > fail to freeze every one. I don't think there are very many reasonable
> > values above 256, personally. So, it might be weird to add a
> > percentage GUC value with only very low acceptable values. Part of
> > this is that we are not making the success cap configurable, so that
> > means that you might have lots of extra I/O if you are both failing
> > and succeeding. Someone configuring this GUC might think this is
> > controlling the amount of extra I/O they are incurring.
>
> Why would it be unreasonable to set this value to 25% or even 100%? I
> grant that it doesn't sound like the most prudent possible value, but
> if I'm willing to fritter away my VACUUM resources to have the best
> possible chance of eagerly freezing stuff, isn't that up to me? I
> think setting this value to 100% would be WAY less damaging than
> work_mem='1TB' or autovacuum_naptime='7d', both of which are allowed.
> In fact, setting this to 100% could theoretically have no negative
> consequences at all, if it so happens that no freeze failures occur.
> Couldn't it even be a win, if freeze failures are common but
> minimizing the impact of aggressive vacuum is of overwhelming
> importance?
>

Yeah, I don't see much reason to be concerned about this being a foot
gun; I don't think there is a way to configure it to be more
disruptive than the folks I have seen running vacuum freeze in cron
jobs :-)

And I do think there is potential upside; we've mostly talked about
this in cases that use settings close to the defaults, but I think
there is a significant number of people who use numbers quite
different from the defaults (specifically, increasing
autovac_max_freeze_age / vacuum_freeze_table_age considerably higher)
where a higher failure rate is probably worth it, especially as vacuum
time increases. (It feels very analogous to checkpoint smoothing when
put this way)

> > The while was meant to contrast the sentence before it about typically
> > only scanning pages that have been modified since the previous vacuum.
> > But, I see how that didn't work out for me. How about this:
> >
> > """
> > <command>VACUUM</command> typically scans pages that have been
> > modified since the last vacuum. However, some all-visible but not
> > all-frozen pages are eagerly scanned to try and freeze them. Note that
> > the <structfield>relfrozenxid</structfield> can only be advanced when
> > every page of the table that might contain unfrozen XIDs is scanned.
> > """
>
> "to try and freeze them" seems a little awkward to me. The rest seems fine.
>
> "in the hope that we can freeze them"?
>
> That still doesn't seem great to me. Maybe it's worth expending more
> text here, not sure.
>

Yeah, this feels like a net negative. As I see it, we're trying to
connect three not-obviously related ideas for users who are trying to
understand how this system works especially with regards to
relfrozenxid advanced (based on the section of the docs we are in),
loosely
1. you've probably heard vacuum mainly scans modified pages for cleanup
2. you might notice it also scans non-modified pages, because
sometimes it wants to freeze stuff
3. you might think, if we are freezing stuff, why doesn't relfrozenxid advance?

So that middle bit is trying to act as glue that pulls this all
together. I thought the previous version was closer, with Haas's
feedback I might go with something more like this:

    <para>
-    <command>VACUUM</command> normally only scans pages that have been modified
-    since the last vacuum, but
<structfield>relfrozenxid</structfield> can only be
-    advanced when every page of the table
-    that might contain unfrozen XIDs is scanned.  This happens when
+    While <command>VACUUM</command> typically scans pages that have been
+    modified since the last vacuum, it may also eagerly scan some
all-visible but not
+    all-frozen pages in an attempt to freeze them, but the
+    <structfield>relfrozenxid</structfield> will only be advanced when
+    every page of the table that might contain unfrozen XIDs is scanned.
+    This happens when

> > I rewrote the comment using some of your input and trying to clarify
> > my initial point about opportunistic vs guaranteed freezing.
> >
> > """
> > If FreezeLimit has not advanced past the relfrozenxid or
> > MultiXactCutoff has not advanced past relminmxid, we are unlikely to
> > be able to freeze anything new.
> > Tuples with XIDs older than OldestXmin or MXIDs older than OldestMxact
> > are technically freezable, but we won't freeze them unless some other
> > criteria for opportunistic freezing is met.
>
> It's not super clear to me what "some other criteria" means here, but
> maybe it's fine.
>
> > It is also possible that a transaction newer than the FreezeLimit has
> > ended, rendering additional tuples freezable.
> > As a heuristic, however, it makes sense to wait until the FreezeLimit
> > or MultiXactCutoff has advanced before eager scanning.
> > """
> >

Yeah, I also think the "some other criteria" part seems poor, although
tbh this all feels like a net negative to the previous wording or
Haas's suggested change, like you're describing what the code below it
does rather than what the intentions of the code are. For example, I
like the "freeze horizon" language vs explicit FreezeLimit since the
code could change even if the goal doesn't. But in lue of going back
to that wording (modulo your 1 line explanation in your email), if I
were to attempt to split the difference:

We only want to
enable eager scanning if we are likely to be able to freeze some of
the pages in the relation, which is unlikely if FreezeLimit has not
advanced past
relfrozenxid or if MultiXactCutoff has not advanced passed
relminmxid. Granted, there may be pages we didn't try to freeze
before, or some previously blocking XID greater than FreezeLimit may
have now ended (allowing for freezing), but as a heuristic we wait
until the
FreezeLimit advances to increase our chances of successful freezing.

Robert Treat
https://xzilla.net

Re: Eagerly scan all-visible pages to amortize aggressive vacuum

From

Robert Haas

Date:

27 January, 20:03:53

On Sun, Jan 26, 2025 at 3:11 PM Robert Treat <rob@xzilla.net> wrote:
> Hmm... isn't "don't try" really more logically equivalent to "stop
> after zero failures", with "try until zero failures" more of the
> inverse... more like "never stop never stopping"?

No. Or at least, I don't think that's how English works.

> So that middle bit is trying to act as glue that pulls this all
> together. I thought the previous version was closer, with Haas's
> feedback I might go with something more like this:
>
>     <para>
> -    <command>VACUUM</command> normally only scans pages that have been modified
> -    since the last vacuum, but
> <structfield>relfrozenxid</structfield> can only be
> -    advanced when every page of the table
> -    that might contain unfrozen XIDs is scanned.  This happens when
> +    While <command>VACUUM</command> typically scans pages that have been
> +    modified since the last vacuum, it may also eagerly scan some
> all-visible but not
> +    all-frozen pages in an attempt to freeze them, but the
> +    <structfield>relfrozenxid</structfield> will only be advanced when
> +    every page of the table that might contain unfrozen XIDs is scanned.
> +    This happens when

I like that.

--
Robert Haas
EDB: http://www.enterprisedb.com

Re: Eagerly scan all-visible pages to amortize aggressive vacuum

From

Masahiko Sawada

Date:

27 January, 20:52:16

On Mon, Jan 27, 2025 at 9:45 AM Melanie Plageman
<melanieplageman@gmail.com> wrote:
>
> On Fri, Jan 24, 2025 at 3:43 PM Robert Haas <robertmhaas@gmail.com> wrote:
> >
> > On Fri, Jan 24, 2025 at 3:02 PM Melanie Plageman
> > <melanieplageman@gmail.com> wrote:
> > > I think you're right. I would go with a percentage. I don't see many
> > > other GUCs that are percents. What would you call it? Perhaps
> > > vacuum_eager_scan_fail_threshold? The % of the blocks in the table
> > > vacuum may scan and fail to freeze.
> > >
> > > There is an issue I see with making it a percentage. The current
> > > default vacuum_eager_scan_max_fails is 128 out of 4096. That means you
> > > are allowed to scan about 3% of the blocks in the table even if you
> > > fail to freeze every one. I don't think there are very many reasonable
> > > values above 256, personally. So, it might be weird to add a
> > > percentage GUC value with only very low acceptable values. Part of
> > > this is that we are not making the success cap configurable, so that
> > > means that you might have lots of extra I/O if you are both failing
> > > and succeeding. Someone configuring this GUC might think this is
> > > controlling the amount of extra I/O they are incurring.
> >
> > Why would it be unreasonable to set this value to 25% or even 100%? I
> > grant that it doesn't sound like the most prudent possible value, but
> > if I'm willing to fritter away my VACUUM resources to have the best
> > possible chance of eagerly freezing stuff, isn't that up to me? I
> > think setting this value to 100% would be WAY less damaging than
> > work_mem='1TB' or autovacuum_naptime='7d', both of which are allowed.
> > In fact, setting this to 100% could theoretically have no negative
> > consequences at all, if it so happens that no freeze failures occur.
> > Couldn't it even be a win, if freeze failures are common but
> > minimizing the impact of aggressive vacuum is of overwhelming
> > importance?
> >
> > Looking at ConfigureNamesReal[], some existing words we use to talk
> > about real-valued GUCs include: cost, fraction, factor, bias,
> > multiplier, target, rate. Of those, I like "fraction" and "rate" best
> > here, because they imply a range of 0-1 rather than anything broader.
> > vacuum_max_eager_freeze_failure_rate? Kinda long...
>
> attached v11 uses a fraction with this name. It follows the
> conventions and I find it descriptive.
>
> Changing the configuration to a fraction does mean that quite a few of
> the comments are different in this version. I think the description of
> the guc in
> config.sgml could use some review in particular. I tried to make it
> clear that the percentage is not the maximum number of blocks that
> could be eager scanned, because you may also eagerly scan blocks and
> succeed.
>
> Other note: I noticed AutoVacOpts that are floating point numbers (like
> vacuum_cost_delay) are float8s but their associated GUCs (like
> autovacuum_vacuum_cost_delay) are doubles. These are going to be equivalent,
> but I wanted to note the inconsistency in case I was missing something.
>
> > > """
> > > <command>VACUUM</command> typically scans pages that have been
> > > modified since the last vacuum. However, some all-visible but not
> > > all-frozen pages are eagerly scanned to try and freeze them. Note that
> > > the <structfield>relfrozenxid</structfield> can only be advanced when
> > > every page of the table that might contain unfrozen XIDs is scanned.
> > > """
> >
> > "to try and freeze them" seems a little awkward to me. The rest seems fine.
> >
> > "in the hope that we can freeze them"?
> >
> > That still doesn't seem great to me. Maybe it's worth expending more
> > text here, not sure.
>
> Given this and Treat's input from a different mail:
>
> > Yeah, this feels like a net negative. As I see it, we're trying to
> > connect three not-obviously related ideas for users who are trying to
> > understand how this system works especially with regards to
> > relfrozenxid advanced (based on the section of the docs we are in),
> > loosely
> > 1. you've probably heard vacuum mainly scans modified pages for cleanup
> > 2. you might notice it also scans non-modified pages, because
> > sometimes it wants to freeze stuff
> > 3. you might think, if we are freezing stuff, why doesn't relfrozenxid advance?
> >
> > So that middle bit is trying to act as glue that pulls this all
> > together. I thought the previous version was closer, with Haas's
> > feedback I might go with something more like this:
>
> I've mostly used his suggested wording in attached v11.
>
> > > I rewrote the comment using some of your input and trying to clarify
> > > my initial point about opportunistic vs guaranteed freezing.
> > >
> > > """
> > > If FreezeLimit has not advanced past the relfrozenxid or
> > > MultiXactCutoff has not advanced past relminmxid, we are unlikely to
> > > be able to freeze anything new.
> > > Tuples with XIDs older than OldestXmin or MXIDs older than OldestMxact
> > > are technically freezable, but we won't freeze them unless some other
> > > criteria for opportunistic freezing is met.
> >
> > It's not super clear to me what "some other criteria" means here, but
> > maybe it's fine.
>
> Given this thought and Treat's feedback below
>
> > Yeah, I also think the "some other criteria" part seems poor, although
> > tbh this all feels like a net negative to the previous wording or
> > Haas's suggested change, like you're describing what the code below it
> > does rather than what the intentions of the code are. For example, I
> > like the "freeze horizon" language vs explicit FreezeLimit since the
> > code could change even if the goal doesn't. But in lue of going back
> > to that wording (modulo your 1 line explanation in your email), if I
> > were to attempt to split the difference:
> >
> > We only want to
> > enable eager scanning if we are likely to be able to freeze some of
> > the pages in the relation, which is unlikely if FreezeLimit has not
> > advanced past
> > relfrozenxid or if MultiXactCutoff has not advanced passed
> > relminmxid. Granted, there may be pages we didn't try to freeze
> > before, or some previously blocking XID greater than FreezeLimit may
> > have now ended (allowing for freezing), but as a heuristic we wait
> > until the
> > FreezeLimit advances to increase our chances of successful freezing.
>
> I've updated the proposed text to mostly be in line with what he suggested.
>
> > > ended, rendering additional tuples freezable.
> > > As a heuristic, however, it makes sense to wait until the FreezeLimit
> > > or MultiXactCutoff has advanced before eager scanning.
> > > """
> > >
> > > > + ereport(INFO,
> > > > + (errmsg("Vacuum successfully froze %u eager scanned blocks of
> > > > \"%s.%s.%s\". Now disabling eager scanning.",
> > > >
> > > > I predict that if Tom sees this, you will get complaints about both
> > > > the wording of the message, which pretty clearly does not conform to
> > > > style guidelines for a primary error message, and also about the use
> > > > of the INFO level. Allow me to suggest DEBUG1 or DEBUG2 and "disabling
> > > > eager scanning after freezing %u eagerly scanned blocks".
> > >
> > > I've used your wording. Just for future reference, are the style
> > > guidelines I was violating the capitalization and punctuation? Are
> > > these documented somewhere? Also, what is a primary error message?
> > > INFO level? Or ones that use ereport and are translated? I looked at
> > > other messages and saw that they don't capitalize the first word or
> > > use punctuation, so I assume that those were problems.
> >
> > Yes. Primary error messages, i.e. errmsg(), are not capitalized and
> > punctuated, unlike errdetail() and errhint() messages, which are.
> >
> > See https://www.postgresql.org/docs/current/error-style-guide.html
> >
> > INFO level is used for VERY few things. I can't tell you off the top
> > of my head when it's appropriate, but I think the answer is "almost
> > never".
>
> I have changed it to DEBUG2. I started it as INFO because the other vacuum
> messages about whether or not the vacuum is an aggressive vacuum were at INFO
> level.I don't know what a user might prefer. Treat said this downthread:
>
> > Maybe, but one of the areas that INFO is used for is providing
> > additional details in VACUUM VERBOSE output, and this seems like it
> > would be pretty useful information to have if you are trying to
> > discern changes in i/o rate during a vacuum, or trying to tune the
> > failure rate setting, or several other related fields (including
> > automated capture by tools like pganalyze), so I believe INFO is the
> > right choice for this.
>
> I'm happy to go either way. I don't want users mad about verbosity, but if they
> think it's helpful, then that seems good.
>

Thank you for updating the patch. I was reviewing the v10 patch and
had some comments. I believe these comments are still valid for v11,
but please ignore them if outdated.

+       if (TransactionIdIsNormal(vacrel->cutoffs.relfrozenxid) &&
+               TransactionIdPrecedesOrEquals(vacrel->cutoffs.relfrozenxid,
+
   vacrel->cutoffs.FreezeLimit))
+               oldest_unfrozen_before_cutoff = true;
+
+       if (!oldest_unfrozen_before_cutoff &&
+               MultiXactIdIsValid(vacrel->cutoffs.relminmxid) &&
+               MultiXactIdPrecedesOrEquals(vacrel->cutoffs.relminmxid,
+
 vacrel->cutoffs.MultiXactCutoff))
+               oldest_unfrozen_before_cutoff = true;

Given that our freeze check strictly checks if an xid is older than
the cutoff (exclusive bound), I think we should check if the
relfrozenxid and relminmxid strictly precede the cutoff values.

---
        if (*was_eager_scanned)
            vacrel->eager_scanned_pages++;

How about incrementing this counter near the place where incrementing
scanned_pages (i.e., at the beginning of the loop of
heap_vac_scan_next_block())? It would make it easy to understand the
difference between eager_scanned_pages and scanned_pages.

---
+ * No vm_page_frozen output parameter (like what is passed to
+ * lazy_scan_prune()) is passed here because empty pages are always frozen and
+ * thus could never be eagerly scanned.

The last part "thus could never be eagerly scanned" confused me a bit;
IIUC we count all pages that are scanned because of this new eager
scan feature as "eagerly scanned pages". We increment
eager_scanned_pages counter even if the page is either new or empty.
This fact seems to contradict the sentence "empty pages could never be
eagerly scanned".

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: Eagerly scan all-visible pages to amortize aggressive vacuum

From

Robert Treat

Date:

28 January, 21:48:56

On Mon, Jan 27, 2025 at 12:45 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:
>
> On Fri, Jan 24, 2025 at 3:43 PM Robert Haas <robertmhaas@gmail.com> wrote:
> >
> > On Fri, Jan 24, 2025 at 3:02 PM Melanie Plageman
> > <melanieplageman@gmail.com> wrote:
> attached v11 uses a fraction with this name. It follows the
> conventions and I find it descriptive.
>
> Changing the configuration to a fraction does mean that quite a few of
> the comments are different in this version. I think the description of
> the guc in
> config.sgml could use some review in particular. I tried to make it
> clear that the percentage is not the maximum number of blocks that
> could be eager scanned, because you may also eagerly scan blocks and
> succeed.
>

Specifies the maximum fraction of pages that <command>VACUUM</command>
may scan and <emphasis>fail</emphasis> to set all-frozen in the
visibility map before disabling eager scanning. A value of 0 disables
eager scanning altogether. The default is 0.03 (3%).

Note that when eager scanning is enabled, successful page freeze
attempts will not count against this limit, although they are
internally capped at 20% of the all-visible but not all-frozen pages
within the relation, in an effort to amortize overhead for future
aggressive vacuums.

This parameter can only be set in the
<filename>postgresql.conf</filename> file or on the server command
line; but the setting can be overridden for individual tables by
changing the <link
linkend="reloption-vacuum-max-eager-freeze-failure-rate">corresponding
table storage parameter</link>. For more information, see <xref
linkend="vacuum-for-wraparound"/>.


> Other note: I noticed AutoVacOpts that are floating point numbers (like
> vacuum_cost_delay) are float8s but their associated GUCs (like
> autovacuum_vacuum_cost_delay) are doubles. These are going to be equivalent,
> but I wanted to note the inconsistency in case I was missing something.
>

::thinking emoji::

I think what you have is right, though certainly better to have a more
qualified opinion.
On a not entirely related item, I find it interesting that you set
your max value to 1.0, but vacuum_scale_factor (and similar) set to
100.00.
I think you have it right though, what does it mean to set a table's
scale factor to 10,000%?

> > >
> > > > + ereport(INFO,
> > > > + (errmsg("Vacuum successfully froze %u eager scanned blocks of
> > > > \"%s.%s.%s\". Now disabling eager scanning.",
> > > >
> > > > I predict that if Tom sees this, you will get complaints about both
> > > > the wording of the message, which pretty clearly does not conform to
> > > > style guidelines for a primary error message, and also about the use
> > > > of the INFO level. Allow me to suggest DEBUG1 or DEBUG2 and "disabling
> > > > eager scanning after freezing %u eagerly scanned blocks".
> > >
> > > I've used your wording. Just for future reference, are the style
> > > guidelines I was violating the capitalization and punctuation? Are
> > > these documented somewhere? Also, what is a primary error message?
> > > INFO level? Or ones that use ereport and are translated? I looked at
> > > other messages and saw that they don't capitalize the first word or
> > > use punctuation, so I assume that those were problems.
> >
> > Yes. Primary error messages, i.e. errmsg(), are not capitalized and
> > punctuated, unlike errdetail() and errhint() messages, which are.
> >
> > See https://www.postgresql.org/docs/current/error-style-guide.html
> >
> > INFO level is used for VERY few things. I can't tell you off the top
> > of my head when it's appropriate, but I think the answer is "almost
> > never".
>
> I have changed it to DEBUG2. I started it as INFO because the other vacuum
> messages about whether or not the vacuum is an aggressive vacuum were at INFO
> level.I don't know what a user might prefer. Treat said this downthread:
>
> > Maybe, but one of the areas that INFO is used for is providing
> > additional details in VACUUM VERBOSE output, and this seems like it
> > would be pretty useful information to have if you are trying to
> > discern changes in i/o rate during a vacuum, or trying to tune the
> > failure rate setting, or several other related fields (including
> > automated capture by tools like pganalyze), so I believe INFO is the
> > right choice for this.
>
> I'm happy to go either way. I don't want users mad about verbosity, but if they
> think it's helpful, then that seems good.
>

If there is a configurable, people will want to tune it, and DEBUG
level messages aren't a usable solution.

Robert Treat
https://xzilla.net

Re: Eagerly scan all-visible pages to amortize aggressive vacuum

From

Masahiko Sawada

Date:

29 January, 00:38:00

On Mon, Jan 27, 2025 at 1:22 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:
>
> On Mon, Jan 27, 2025 at 12:52 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > Thank you for updating the patch. I was reviewing the v10 patch and
> > had some comments. I believe these comments are still valid for v11,
> > but please ignore them if outdated.
>
> Thanks so much for the review!
>
> > +       if (TransactionIdIsNormal(vacrel->cutoffs.relfrozenxid) &&
> > +               TransactionIdPrecedesOrEquals(vacrel->cutoffs.relfrozenxid,
> > +
> >    vacrel->cutoffs.FreezeLimit))
> > +               oldest_unfrozen_before_cutoff = true;
> > +
> > +       if (!oldest_unfrozen_before_cutoff &&
> > +               MultiXactIdIsValid(vacrel->cutoffs.relminmxid) &&
> > +               MultiXactIdPrecedesOrEquals(vacrel->cutoffs.relminmxid,
> > +
> >  vacrel->cutoffs.MultiXactCutoff))
> > +               oldest_unfrozen_before_cutoff = true;
> >
> > Given that our freeze check strictly checks if an xid is older than
> > the cutoff (exclusive bound), I think we should check if the
> > relfrozenxid and relminmxid strictly precede the cutoff values.
>
> Makes sense. I've changed that.
>
> > ---
> >         if (*was_eager_scanned)
> >             vacrel->eager_scanned_pages++;
> >
> > How about incrementing this counter near the place where incrementing
> > scanned_pages (i.e., at the beginning of the loop of
> > heap_vac_scan_next_block())? It would make it easy to understand the
> > difference between eager_scanned_pages and scanned_pages.
>
> Right. That makes sense. I've changed that in the attached v12.
>
> > ---
> > + * No vm_page_frozen output parameter (like what is passed to
> > + * lazy_scan_prune()) is passed here because empty pages are always frozen and
> > + * thus could never be eagerly scanned.
> >
> > The last part "thus could never be eagerly scanned" confused me a bit;
> > IIUC we count all pages that are scanned because of this new eager
> > scan feature as "eagerly scanned pages". We increment
> > eager_scanned_pages counter even if the page is either new or empty.
> > This fact seems to contradict the sentence "empty pages could never be
> > eagerly scanned".
>
> Ah, so what I mean by this is that the first time an empty page is
> vacuumed, it is set all-visible and all-frozen in the VM. We only
> eagerly scan pages that are _only_ all-visible (not also all-frozen).
> So, no empty pages will ever be eligible for eager scanning. I've
> updated the comment, because I agree it was confusing. See what you
> think now.

Thank you for updating the patch! These updates look good to me.

BTW I realized that we need to update tab-complete.c too to support
the tab-completion for vacuum_max_eager_freeze_failure_rate.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: Eagerly scan all-visible pages to amortize aggressive vacuum

From

Melanie Plageman

Date:

29 January, 19:34:55

On Fri, Jan 24, 2025 at 11:20 AM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Fri, Jan 24, 2025 at 9:15 AM Melanie Plageman
> <melanieplageman@gmail.com> wrote:
> > So, in this case, there is only one table in question, so 1 autovacuum
> > worker (and up to 2 maintenance parallel workers for index vacuuming).
> > The duration I provided is just the absolute duration from start of
> > vacuum to finish -- not considering the amount of time each parallel
> > worker may have been working (also it includes time spent delaying).
> > The benchmark ran for 2.8 hours. I configured vacuum to run
> > frequently. In this case, master spent 47% of the total time vacuuming
> > and the patch spent 56%.
>
> Definitely not insignificant, but I think it's OK for a worst case.
> Autovacuum is a background process, so it's not like a 20% regression
> on query performance.

So, I've done a few runs with FPIs turned off to reduce run variance
caused by vacuum and checkpoint timing.
Of course this means that the amount of IO done by vacuum is very
different from a benchmark run with realistic settings.

I reran two of my simulations:

1)
- hot tail
    32 clients inserting 20 rows then updating 1 row
    duration: 3 hours

There is a small increase in total time spent vacuuming (< 10%). But
it is spread out. The first aggressive vacuum of the table is 20
seconds with the patch and 9 minutes on master. And this is not an
append-only workload -- the tail of the table (up to 200,000 rows old)
is being updated (and potentially unfrozen). So, this feels like a
win.

The insert/update P99 latency is lower (better) with the patch around
the time of the aggressive vacuum.

2)
- hot tail with delete (worst-case)
    32 clients inserting 20 rows then updating 1 row and 1
rate-limited client deleting all data before it can be aggressively
vacuumed
    durations: 3 hours

There is a 10-15% increase in total time spent vacuuming with the
patch (30-40% of total benchmark runtime is spent vacuuming).

I ran the benchmark for 4 hours as well, and for that duration I
started to see a larger increase in vacuum IO time with the patch.
However, the 4 hour run had only one aggressive vacuum (around the 2.5
hour mark), so the numbers are hard to compare because the patch is
meant to do some of the work of the next aggressive vacuum in advance.

The insert/update P99 latency is the same or lower (better) with the patch.

Next I plan to run the hottail delete benchmark with default settings
(including FPIs) with master and with the patch for about 24 hours
each. I'm hoping the long duration will smooth out some of the run
variance even with FPIs.

- Melanie

Re: Eagerly scan all-visible pages to amortize aggressive vacuum

From

Andres Freund

Date:

04 February, 05:09:30

Hi,

On 2025-01-29 14:12:52 -0500, Melanie Plageman wrote:
> From 71f32189aad510b73d221fb0478ffd916e5e5dde Mon Sep 17 00:00:00 2001
> From: Melanie Plageman <melanieplageman@gmail.com>
> Date: Mon, 27 Jan 2025 12:23:00 -0500
> Subject: [PATCH v14] Eagerly scan all-visible pages to amortize aggressive
>  vacuum
> 
> Amortize the cost of an aggressive vacuum by eagerly scanning some
> all-visible but not all-frozen pages during normal vacuums.

I think it'd be good to explain the problem this is trying to address a bit
more (i.e. the goal is to avoid the situation which a lot of work is deferred
until an aggressive vacuum, which is then very expensive).


> Because the goal is to freeze these all-visible pages, all-visible pages
> that are eagerly scanned and set all-frozen in the visibility map are
> counted as successful eager freezes and those not frozen are considered
> failed eager freezes.

I don't really understand this sentence. "Because the goal is to freeze these
all-visible pages" doesn't really relate with the rest of the sentence.



> +     <varlistentry id="guc-vacuum-max-eager-freeze-failure-rate" xreflabel="vacuum_max_eager_freeze_failure_rate">
> +      <term><varname>vacuum_max_eager_freeze_failure_rate</varname> (<type>floating point</type>)
> +      <indexterm>
> +       <primary><varname>vacuum_max_eager_freeze_failure_rate</varname> configuration parameter</primary>
> +      </indexterm>
> +      </term>
> +      <listitem>
> +       <para>
> +        Specifies the maximum fraction of pages that
> +        <command>VACUUM</command> may scan and <emphasis>fail</emphasis> to set
> +        all-frozen in the visibility map before disabling eager scanning. A
> +        value of <literal>0</literal> disables eager scanning altogether. The
> +        default is <literal>0.03</literal> (3%).
> +       </para>

Fraction of what?


> +       <para>
> +        Note that when eager scanning is enabled, successful page freezes
> +        do not count against this limit, although they are internally
> +        capped at 20% of the all-visible but not all-frozen pages in the
> +        relation. Capping successful page freezes helps amortize the
> +        overhead across multiple normal vacuums.
> +       </para>

What does it mean that they are not counted, but are capped?


> +   <para>
> +    If a table is building up a backlog of all-visible but not all-frozen
> +    pages, a normal vacuum may choose to scan skippable pages in an effort to
> +    freeze them. Doing so decreases the number of pages the next aggressive
> +    vacuum must scan. These are referred to as <firstterm>eagerly
> +    scanned</firstterm> pages. Eager scanning can be tuned to attempt
> +    to freeze more all-visible pages by increasing
> +    <xref linkend="guc-vacuum-max-eager-freeze-failure-rate"/>. Even if eager
> +    scanning has kept the number of all-visible but not all-frozen pages to a
> +    minimum, most tables still require periodic aggressive vacuuming.
> +   </para>

Maybe mention that the aggressive vacuuming will often be cheaper than without
eager freezing, even if necessary?


> + * Normal vacuums count all-visible pages eagerly scanned as a success when
> + * they are able to set them all-frozen in the VM and as a failure when they
> + * are not able to set them all-frozen.

Maybe some more punctuation would make this more readable? Or a slight
rephrasing?


> + * Because we want to amortize the overhead of freezing pages over multiple
> + * vacuums, normal vacuums cap the number of successful eager freezes to
> + * MAX_EAGER_FREEZE_SUCCESS_RATE of the number of all-visible but not
> + * all-frozen pages at the beginning of the vacuum. Once the success cap has
> + * been hit, eager scanning is disabled for the remainder of the vacuum of the
> + * relation.

It also caps the maximum "downside" of freezing eagerly, right? Seems worth
mentioning.



> +    /*
> +     * Now calculate the eager scan start block. Start at a random spot
> +     * somewhere within the first eager scan region. This avoids eager
> +     * scanning and failing to freeze the exact same blocks each vacuum of the
> +     * relation.
> +     */

If I understand correctly, we're not really choosing a spot inside the first
eager scan region, we determine the bounds of the first region?



> @@ -930,16 +1188,21 @@ lazy_scan_heap(LVRelState *vacrel)
>      vacrel->current_block = InvalidBlockNumber;
>      vacrel->next_unskippable_block = InvalidBlockNumber;
>      vacrel->next_unskippable_allvis = false;
> +    vacrel->next_unskippable_eager_scanned = false;
>      vacrel->next_unskippable_vmbuffer = InvalidBuffer;
>  
> -    while (heap_vac_scan_next_block(vacrel, &blkno, &all_visible_according_to_vm))
> +    while (heap_vac_scan_next_block(vacrel, &blkno, &all_visible_according_to_vm,
> +                                    &was_eager_scanned))

Pedantic^3: Is past tense really appropriate? We *will* be scanning that page
in the body of the loop, right?


> @@ -1064,7 +1327,45 @@ lazy_scan_heap(LVRelState *vacrel)
>          if (got_cleanup_lock)
>              lazy_scan_prune(vacrel, buf, blkno, page,
>                              vmbuffer, all_visible_according_to_vm,
> -                            &has_lpdead_items);
> +                            &has_lpdead_items, &vm_page_frozen);
> +
> +        /*
> +         * Count an eagerly scanned page as a failure or a success.
> +         */
> +        if (was_eager_scanned)

Hm - how should pages be counted that we couldn't get a lock on?  I think
right now they'll be counted as a failure, but that doesn't seem quite right.


> diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c
> index 0ab921a169b..32a1b8c46a1 100644
> --- a/src/backend/postmaster/autovacuum.c
> +++ b/src/backend/postmaster/autovacuum.c
> @@ -2826,6 +2826,12 @@ table_recheck_autovac(Oid relid, HTAB *table_toast_map,
>          tab->at_params.is_wraparound = wraparound;
>          tab->at_params.log_min_duration = log_min_duration;
>          tab->at_params.toast_parent = InvalidOid;
> +
> +        /*
> +         * Later we check reloptions for vacuum_max_eager_freeze_failure_rate
> +         * override
> +         */
> +        tab->at_params.max_eager_freeze_failure_rate = vacuum_max_eager_freeze_failure_rate;
>          tab->at_storage_param_vac_cost_limit = avopts ?
>              avopts->vacuum_cost_limit : 0;
>          tab->at_storage_param_vac_cost_delay = avopts ?

I'd mention where that is, so that a reader of that comment doesn't have to
search around...


I think this is pretty close!

Greetings,

Andres Freund

Re: Eagerly scan all-visible pages to amortize aggressive vacuum

From

Robert Haas

Date:

04 February, 20:17:35

On Mon, Feb 3, 2025 at 9:09 PM Andres Freund <andres@anarazel.de> wrote:
> > +     <varlistentry id="guc-vacuum-max-eager-freeze-failure-rate" xreflabel="vacuum_max_eager_freeze_failure_rate">
> > +      <term><varname>vacuum_max_eager_freeze_failure_rate</varname> (<type>floating point</type>)
> > +      <indexterm>
> > +       <primary><varname>vacuum_max_eager_freeze_failure_rate</varname> configuration parameter</primary>
> > +      </indexterm>
> > +      </term>
> > +      <listitem>
> > +       <para>
> > +        Specifies the maximum fraction of pages that
> > +        <command>VACUUM</command> may scan and <emphasis>fail</emphasis> to set
> > +        all-frozen in the visibility map before disabling eager scanning. A
> > +        value of <literal>0</literal> disables eager scanning altogether. The
> > +        default is <literal>0.03</literal> (3%).
> > +       </para>
>
> Fraction of what?

"pages", according to the text.

I'm not really sure what you think is wrong with this. There may be an
even better way to explain this but this doesn't seem bad to me.

--
Robert Haas
EDB: http://www.enterprisedb.com

Re: Eagerly scan all-visible pages to amortize aggressive vacuum

From

Robert Treat

Date:

04 February, 22:57:40

On Tue, Feb 4, 2025 at 12:44 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:
> On Mon, Feb 3, 2025 at 9:09 PM Andres Freund <andres@anarazel.de> wrote:
> >
> > On 2025-01-29 14:12:52 -0500, Melanie Plageman wrote:
> > > From 71f32189aad510b73d221fb0478ffd916e5e5dde Mon Sep 17 00:00:00 2001
> > > From: Melanie Plageman <melanieplageman@gmail.com>
>
> > > @@ -1064,7 +1327,45 @@ lazy_scan_heap(LVRelState *vacrel)
> > >               if (got_cleanup_lock)
> > >                       lazy_scan_prune(vacrel, buf, blkno, page,
> > >                                                       vmbuffer, all_visible_according_to_vm,
> > > -                                                     &has_lpdead_items);
> > > +                                                     &has_lpdead_items, &vm_page_frozen);
> > > +
> > > +             /*
> > > +              * Count an eagerly scanned page as a failure or a success.
> > > +              */
> > > +             if (was_eager_scanned)
> >
> > Hm - how should pages be counted that we couldn't get a lock on?  I think
> > right now they'll be counted as a failure, but that doesn't seem quite right.
>
> Yea, I thought that counting them as failures made sense because we
> did fail to freeze them. However, now that you mention it, we didn't
> fail to freeze them because of age, so maybe we don't want to count
> them as failures. I don't expect us to have a bunch of contended
> all-visible pages, so I think the question is about what makes it more
> clear in the code. What do you think? Should I reset was_eager_scanned
> to false if we don't get the cleanup lock?
>

I feel like if we are making the trade-off in resources to attempt
eager scanning, and we weren't making progress for whatever reason
(and in the lock failure cases, wouldn't some of those be things that
would prevent us from freezing?) then it would generally be ok to bias
towards bailing sooner rather than later.


Robert Treat
https://xzilla.net

Re: Eagerly scan all-visible pages to amortize aggressive vacuum

From

Robert Haas

Date:

04 February, 23:55:19

On Tue, Feb 4, 2025 at 2:57 PM Robert Treat <rob@xzilla.net> wrote:
> > Yea, I thought that counting them as failures made sense because we
> > did fail to freeze them. However, now that you mention it, we didn't
> > fail to freeze them because of age, so maybe we don't want to count
> > them as failures. I don't expect us to have a bunch of contended
> > all-visible pages, so I think the question is about what makes it more
> > clear in the code. What do you think? Should I reset was_eager_scanned
> > to false if we don't get the cleanup lock?
>
> I feel like if we are making the trade-off in resources to attempt
> eager scanning, and we weren't making progress for whatever reason
> (and in the lock failure cases, wouldn't some of those be things that
> would prevent us from freezing?) then it would generally be ok to bias
> towards bailing sooner rather than later.

Failures to acquire cleanup locks are, hopefully, rare, so it may not
matter that much. Having said that, if we skip a page because we can't
acquire a cleanup lock on it, I think that means that it was already
present in shared_buffers, which means that we didn't have to do an
I/O to get it. Since I think the point of the failure cap is mostly to
limit wasted I/O, I would lean toward NOT counting such cases as
failures.

--
Robert Haas
EDB: http://www.enterprisedb.com

Re: Eagerly scan all-visible pages to amortize aggressive vacuum

From

Melanie Plageman

Date:

05 February, 01:34:46

On Tue, Feb 4, 2025 at 3:55 PM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Tue, Feb 4, 2025 at 2:57 PM Robert Treat <rob@xzilla.net> wrote:
> > > Yea, I thought that counting them as failures made sense because we
> > > did fail to freeze them. However, now that you mention it, we didn't
> > > fail to freeze them because of age, so maybe we don't want to count
> > > them as failures. I don't expect us to have a bunch of contended
> > > all-visible pages, so I think the question is about what makes it more
> > > clear in the code. What do you think? Should I reset was_eager_scanned
> > > to false if we don't get the cleanup lock?
> >
> > I feel like if we are making the trade-off in resources to attempt
> > eager scanning, and we weren't making progress for whatever reason
> > (and in the lock failure cases, wouldn't some of those be things that
> > would prevent us from freezing?) then it would generally be ok to bias
> > towards bailing sooner rather than later.
>
> Failures to acquire cleanup locks are, hopefully, rare, so it may not
> matter that much. Having said that, if we skip a page because we can't
> acquire a cleanup lock on it, I think that means that it was already
> present in shared_buffers, which means that we didn't have to do an
> I/O to get it. Since I think the point of the failure cap is mostly to
> limit wasted I/O, I would lean toward NOT counting such cases as
> failures.

I think I misspoke when I said we are unlikely to have contended
all-visible pages. I suppose it is trivial to concoct a scenario where
there are many pinned all-visible pages.

Initially I agreed with you that we shouldn't count eagerly scanned
pages we failed to freeze because we didn't get the cleanup lock as
failures. If the page is pinned in shared buffers, it is necessarily
not going to cost us a read -- which is the main overhead of failed
eager freezes.

However, if we don't count an eagerly scanned page as a failure when
we don't get the cleanup lock because we won't have incurred a read,
then why would we count any eagerly scanned page in shared buffers as
a failure? In the case that we actually tried freezing the page and
failed because it was too new, that is giving us information about
whether or not we should keep trying to freeze. So, it is not just
about doing the read but also about what the failure indicates about
the data.

Interestingly, we call heap_tuple_should_freeze() in
lazy_scan_noprune(), so we actually could tell whether or not there
are some tuples on the page old enough to trigger freezing if we had
gotten the cleanup lock. One option would be to add a new output
parameter to lazy_scan_noprune() that indicates whether or not there
were tuples with xids older than the FreezeLimit, and only if there
were not, count it as a failed eager scan.

- Melanie

Re: Eagerly scan all-visible pages to amortize aggressive vacuum

From

Robert Haas

Date:

05 February, 19:47:22

On Tue, Feb 4, 2025 at 5:34 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:
> I think I misspoke when I said we are unlikely to have contended
> all-visible pages. I suppose it is trivial to concoct a scenario where
> there are many pinned all-visible pages.

It's hard to keep heap pages pinned for a really long time, so there
aren't likely to be a lot of them at once. Maybe async I/O will change
that somewhat, but the word in this sentence that I would emphasize is
"concoct". I don't think it's normal for there to be lots of pinned
pages just hanging around.

(It's a bit easier to have index pages that stay pinned for a long
time because an index scan can, or at least could last I checked, hold
onto a pin while the cursor was open even if we're not currently
executing the query. But for a heap page that doesn't happen, AFAIK.)

> However, if we don't count an eagerly scanned page as a failure when
> we don't get the cleanup lock because we won't have incurred a read,
> then why would we count any eagerly scanned page in shared buffers as
> a failure? In the case that we actually tried freezing the page and
> failed because it was too new, that is giving us information about
> whether or not we should keep trying to freeze. So, it is not just
> about doing the read but also about what the failure indicates about
> the data.

That's a good point, but failure to get a tuple lock isn't a judgement
either way on whether the XIDs in the page are old or new.

> Interestingly, we call heap_tuple_should_freeze() in
> lazy_scan_noprune(), so we actually could tell whether or not there
> are some tuples on the page old enough to trigger freezing if we had
> gotten the cleanup lock. One option would be to add a new output
> parameter to lazy_scan_noprune() that indicates whether or not there
> were tuples with xids older than the FreezeLimit, and only if there
> were not, count it as a failed eager scan.

Yeah, that could be done, and it doesn't sound like a bad idea.
However, I'm also unsure whether we need to add additional logic for
this case. My intuition is that it just won't happen very often. I'm
not quite confident that I'm correct about that, but if I had to
guess, my guess would be "rare scenario".

--
Robert Haas
EDB: http://www.enterprisedb.com

Re: Eagerly scan all-visible pages to amortize aggressive vacuum

From

Robert Haas

Date:

10 February, 22:30:15

On Wed, Feb 5, 2025 at 12:23 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:
> I started getting worried thinking about this. If you have a cursor
> for select * from a table and fetch forward far enough, couldn't
> vacuum fail to get cleanup locks on a whole range of blocks?

I don't think so. A given scan holds at most 1 heap pin at a time, so
I don't see how a cursor doing a FETCH FORWARD would make anything
happen that wouldn't otherwise. I do think it's theoretically possible
that a running query and a running VACUUM could be going at exactly
the same speed so that the VACUUM always tries to pin every page at
the exact same moment the SELECT has it pinned, but in practice I
think it is very unlikely that the timing will work out like that.
Like win-the-lottery kind of unlikely.

> Yea, I guess that means the freeze limit caps wasted read I/O -- but
> you could end up doing no read I/O and still hitting the freeze fail
> limit because it is trying to detect when data is too new to be worth
> eager scanning. But that's probably the behavior we want.

The too-new case is very scary, I think. It's already a problem that
when there are long-running transactions in play, autovacuum commands
a VACUUM which finds nothing it can usefully clean up. It would be
very valuable for someone to figure out a way to prevent that, but the
relevance to this patch is that we need to try hard to avoid making
that problem substantially worse than it is already. Even if the whole
database is memory-resident, the too-new case allows for wasting a
bunch of effort that we would rather not waste; the cap here tries to
make sure that isn't any worse than it needs to be.

> > > Interestingly, we call heap_tuple_should_freeze() in
> > > lazy_scan_noprune(), so we actually could tell whether or not there
> > > are some tuples on the page old enough to trigger freezing if we had
> > > gotten the cleanup lock. One option would be to add a new output
> > > parameter to lazy_scan_noprune() that indicates whether or not there
> > > were tuples with xids older than the FreezeLimit, and only if there
> > > were not, count it as a failed eager scan.
> >
> > Yeah, that could be done, and it doesn't sound like a bad idea.
> > However, I'm also unsure whether we need to add additional logic for
> > this case. My intuition is that it just won't happen very often. I'm
> > not quite confident that I'm correct about that, but if I had to
> > guess, my guess would be "rare scenario".
>
> I've not done this in the attached v16. I have added a comment about it.
> I think not doing it is a judgment call and not a bug, right?

As of this writing, I do not believe that this is an essential part of
this patch, but it is always possible that with the benefit of
hindsight it will seem otherwise. That's just life, though. If I knew
in advance which of my decisions would later seem like mistakes, I
would make very few mistakes.

--
Robert Haas
EDB: http://www.enterprisedb.com

Re: Eagerly scan all-visible pages to amortize aggressive vacuum

From

Andres Freund

Date:

11 February, 04:06:12

Hi,

On 2025-02-04 12:44:22 -0500, Melanie Plageman wrote:
> On Mon, Feb 3, 2025 at 9:09 PM Andres Freund <andres@anarazel.de> wrote:
> > > +     /*
> > > +      * Now calculate the eager scan start block. Start at a random spot
> > > +      * somewhere within the first eager scan region. This avoids eager
> > > +      * scanning and failing to freeze the exact same blocks each vacuum of the
> > > +      * relation.
> > > +      */
> >
> > If I understand correctly, we're not really choosing a spot inside the first
> > eager scan region, we determine the bounds of the first region?
> 
> I'm not sure I understand how those are different, but I updated the
> comment a bit. Maybe you can elaborate what you mean?

Let's assume that we use regions of 512 pages. Without randomness we'd do:

[0 .. 512) [512 .. 1024) ...


IMO, if we were to choose a spot inside the first region, we'd do:

[random(0, 512) .. 512) [512 .. 1024)


If we choose the bounds of the first region we'd do:

[0, random(0, 512)) [$prior bound .. $prior_bound + 512)

or something like that.

Greetings,

Andres Freund


PS: planning to respond to cleanup lock stuff downthread, after a bit of
exercise or tomorrow morning

Re: Eagerly scan all-visible pages to amortize aggressive vacuum

From

Andres Freund

Date:

11 February, 19:06:27

Hi,

On 2025-02-10 14:30:15 -0500, Robert Haas wrote:
> On Wed, Feb 5, 2025 at 12:23 PM Melanie Plageman
> <melanieplageman@gmail.com> wrote:
> > I started getting worried thinking about this. If you have a cursor
> > for select * from a table and fetch forward far enough, couldn't
> > vacuum fail to get cleanup locks on a whole range of blocks?
> 
> I don't think so. A given scan holds at most 1 heap pin at a time, so
> I don't see how a cursor doing a FETCH FORWARD would make anything
> happen that wouldn't otherwise. I do think it's theoretically possible
> that a running query and a running VACUUM could be going at exactly
> the same speed so that the VACUUM always tries to pin every page at
> the exact same moment the SELECT has it pinned, but in practice I
> think it is very unlikely that the timing will work out like that.
> Like win-the-lottery kind of unlikely.

I think the most common way to get a range of pages pinned is to have a bunch
of backends doing index nested loop joins to a fairly narrow value range in a
table. You can sometimes see (via pg_buffercache) that there are some pages
that are rather heavily pinned, but just about no other page in the table is.


> > Yea, I guess that means the freeze limit caps wasted read I/O -- but
> > you could end up doing no read I/O and still hitting the freeze fail
> > limit because it is trying to detect when data is too new to be worth
> > eager scanning. But that's probably the behavior we want.
> 
> The too-new case is very scary, I think. It's already a problem that
> when there are long-running transactions in play, autovacuum commands
> a VACUUM which finds nothing it can usefully clean up. It would be
> very valuable for someone to figure out a way to prevent that, but the
> relevance to this patch is that we need to try hard to avoid making
> that problem substantially worse than it is already. Even if the whole
> database is memory-resident, the too-new case allows for wasting a
> bunch of effort that we would rather not waste; the cap here tries to
> make sure that isn't any worse than it needs to be.

IME the most common way for this issue is anti-wraparound vacuums that can't
actually clean up anything, due to some old transaction. We'll just fire those
off over and over again, without taking into account that we still can't make
progress.

That scenario doesn't apply the same way to this patch, because by the time
you get to an anti-wrap vacuum, this patch doesn't do anything.

Of course it's possible that the dead tuple trigger causes repeated vacuums
that each can't clean up anything, but that's much less common IME.


> > > > Interestingly, we call heap_tuple_should_freeze() in
> > > > lazy_scan_noprune(), so we actually could tell whether or not there
> > > > are some tuples on the page old enough to trigger freezing if we had
> > > > gotten the cleanup lock. One option would be to add a new output
> > > > parameter to lazy_scan_noprune() that indicates whether or not there
> > > > were tuples with xids older than the FreezeLimit, and only if there
> > > > were not, count it as a failed eager scan.
> > >
> > > Yeah, that could be done, and it doesn't sound like a bad idea.
> > > However, I'm also unsure whether we need to add additional logic for
> > > this case. My intuition is that it just won't happen very often. I'm
> > > not quite confident that I'm correct about that, but if I had to
> > > guess, my guess would be "rare scenario".
> >
> > I've not done this in the attached v16. I have added a comment about it.
> > I think not doing it is a judgment call and not a bug, right?
> 
> As of this writing, I do not believe that this is an essential part of
> this patch, but it is always possible that with the benefit of
> hindsight it will seem otherwise. That's just life, though. If I knew
> in advance which of my decisions would later seem like mistakes, I
> would make very few mistakes.

I agree that we can disregard this for now.

Greetings,

Andres Freund

Re: Eagerly scan all-visible pages to amortize aggressive vacuum

From

Andres Freund

Date:

11 February, 19:31:46

On 2025-02-05 12:23:29 -0500, Melanie Plageman wrote:
> Attached v16 implements the logic to not count pages we failed to
> freeze because of cleanup lock contention as eager freeze failures.

That looks good to me.

Re: Eagerly scan all-visible pages to amortize aggressive vacuum

From

Melanie Plageman

Date:

11 February, 21:57:25

On Tue, Feb 11, 2025 at 11:31 AM Andres Freund <andres@anarazel.de> wrote:
>
> On 2025-02-05 12:23:29 -0500, Melanie Plageman wrote:
> > Attached v16 implements the logic to not count pages we failed to
> > freeze because of cleanup lock contention as eager freeze failures.
>
> That looks good to me.

Cool. Committed and marked as such in the cf app.

- Melanie