Thread: Re: Eagerly scan all-visible pages to amortize aggressive vacuum

Re: Eagerly scan all-visible pages to amortize aggressive vacuum

From
Andres Freund
Date:
Hi,

On 2024-11-01 19:35:22 -0400, Melanie Plageman wrote:
> It implements a new "semi-aggressive" vacuum. Semi-aggressive vacuums
> eagerly scan some number of all-visible but not all-frozen pages in
> hopes of freezing them. All-visible pages that are eagerly scanned and
> set all-frozen in the visibility map are considered successful eager
> scans and those not frozen are considered failed eager scans.

I think it's worth mentioning that this would be useful to eventually be able
to mark pages as all-visible during on-access-pruning, to allow for
index-only-scans (and also faster seqscans). Today that'd have the danger of
increasing the "anti-wraparound vacuum debt" unconscionably, but with this
patch that'd be addressed.


> Because we want to amortize our eager scanning across a few vacuums,
> we cap the maximum number of successful eager scans to a percentage of
> the total number of all-visible but not all-frozen pages in the table
> (currently 20%).

One thing worth mentioning around here seems that we currently can't
partially-aggressively freeze tuples that are "too young" and how that
interacts with everything else.


> To demonstrate the results, I ran an append-only workload run for over
> 3 hours on master and with my patch applied. The patched version of
> Postgres amortized the work of freezing the all-visible but not
> all-frozen pages nicely. The first aggressive vacuum with the patch
> was 44 seconds and on master it was 1201 seconds.

Oops.


> patch
>   LOG:  automatic aggressive vacuum of table "history": index scans: 0
>   vacuum duration: 44 seconds (msecs: 44661).
>     pages: 0 removed, 27425085 remain, 1104095 scanned (4.03% of
> total), 709889 eagerly scanned
>     frozen: 316544 pages from table (1.15% of total) had 17409920
> tuples frozen. 316544 pages set all-frozen in the VM
>     I/O timings: read: 1160.599 ms, write: 2461.205 ms. approx time
> spent in vacuum delay: 16230 ms.
>     buffer usage: 1105630 hits, 1111898 reads, 646229 newly dirtied,
> 1750426 dirtied.
>     WAL usage: 1027099 records, 316566 full page images, 276209780 bytes.
>
> master
>   LOG:  automatic aggressive vacuum of table "history": index scans: 0
>   vacuum duration: 1201 seconds (msecs: 1201487).
>     pages: 0 removed, 27515348 remain, 15800948 scanned (57.43% of
> total), 15098257 eagerly scanned
>     frozen: 15096384 pages from table (54.87% of total) had 830247896
> tuples frozen. 15096384 pages set all-frozen in the VM
>     I/O timings: read: 246537.348 ms, write: 73000.498 ms. approx time
> spent in vacuum delay: 349166 ms.
>     buffer usage: 15798343 hits, 15813524 reads, 15097063 newly
> dirtied, 31274333 dirtied.
>     WAL usage: 30733564 records, 15097073 full page images, 11789257631 bytes.
>
> This is because, with the patch, the freezing work is being off-loaded
> to earlier vacuums.
>
> In the attached chart.png, you can see the vm_page_freezes climbing
> steadily with the patch, whereas on master, there are sudden spikes
> aligned with the aggressive vacuums. You can also see that the number
> of pages that are all-visible but not all-frozen grows steadily on
> master until the aggressive vacuum. This is vacuum's "backlog" of
> freezing work.

What's the reason for all-visible-but-not-all-frozen to increase to a higher
value initially than where it later settles?


> In this benchmark, the TPS is rate-limited, but using pgbench
> per-statement reports, I calculated that the P99 latency for this
> benchmark is 16 ms on master and 1 ms with the patch. Vacuuming pages
> sooner decreases vacuum reads and doing the freezing work spread over
> more vacuums improves P99 transaction latency.

Nice!


> Below is the comparative WAL volume, checkpointer and background
> writer writes, reads and writes done by all other backend types, time
> spent vacuuming in milliseconds, and p99 latency. Notice that overall
> vacuum IO time is substantially lower with the patch.
>
>    version     wal  cptr_bgwriter_w   other_rw  vac_io_time  p99_lat
>     patch   770 GB          5903264  235073744   513722         1
>     master  767 GB          5908523  216887764  1003654        16

Hm. It's not clear to me why other_rw is higher with the patch? After all,
given the workload, there's no chance of unnecessarily freezing tuples? Is
that just because at the end of the benchmark there's leftover work?



> From 69b517f34caf39ad814691d6412c68d54e852990 Mon Sep 17 00:00:00 2001
> From: Melanie Plageman <melanieplageman@gmail.com>
> Date: Mon, 28 Oct 2024 10:53:37 -0400
> Subject: [PATCH v1 1/9] Rename LVRelState->frozen_pages
>
> Rename frozen_pages to tuple_freeze_pages in LVRelState, the struct used
> for tracking state during vacuuming of a heap relation. frozen_pages
> sounds like it includes every all-frozen page. That is a misnomer. It
> does not include pages with already frozen tuples. It also includes
> pages that are not actually all-frozen.

Hm. Is tuple_freeze_pages that much clearer, it could also just indicate pages
that already were frozen. How about "newly_frozen_pages"?


> Subject: [PATCH v1 2/9] Make visibilitymap_set() return previous state of

LGTM.



> From c317a272713bb833f7f2761a5be1924f8e1bdb4d Mon Sep 17 00:00:00 2001
> From: Melanie Plageman <melanieplageman@gmail.com>
> Date: Thu, 31 Oct 2024 18:19:18 -0400
> Subject: [PATCH v1 3/9] Count pages set all-visible and all-frozen in VM
>  during vacuum
>
> Vacuum already counts and logs pages with newly frozen tuples. Count and
> log pages set all-frozen in the VM too. This includes pages that are
> empty before or after vacuuming.
>
> While we are at it, count and log the number of pages vacuum set
> all-visible. Pages that are all-visible but not all-frozen are debt for
> future aggressive vacuums. The newly all-visible and all-frozen counts
> give us visiblity into the rate at which this debt is being accrued and
> paid down.

Hm. Why is it interesting to log VM changes separately from the state changes
of underlying pages? Wouldn't it e.g. be more intersting to log the number of
empty pages that vacuum [re-]discovered? I've a bit hard time seeing how a
user could take advantage of this.


> From 67781cc2511bb7d62ccc9461f1787272820abcc4 Mon Sep 17 00:00:00 2001
> From: Melanie Plageman <melanieplageman@gmail.com>
> Date: Mon, 28 Oct 2024 11:07:50 -0400
> Subject: [PATCH v1 4/9] Replace uses of blkno local variable in
>  lazy_scan_heap()

Largely LGTM, but I'm not sure that it's worth having as a separate commit.


> @@ -1114,7 +1117,6 @@ heap_vac_scan_next_block(LVRelState *vacrel, BlockNumber *blkno,
>              ReleaseBuffer(vacrel->next_unskippable_vmbuffer);
>              vacrel->next_unskippable_vmbuffer = InvalidBuffer;
>          }
> -        *blkno = vacrel->rel_pages;
>          return false;
>      }

I'd seat *blkno to InvalidBlockNumber or such, that'd make misuse more
apparent than having it set to some random older block.


> From 67b5565ad57d3b196695f85811dde2044ba79f3e Mon Sep 17 00:00:00 2001
> From: Melanie Plageman <melanieplageman@gmail.com>
> Date: Mon, 28 Oct 2024 11:14:24 -0400
> Subject: [PATCH v1 5/9] Move vacuum VM buffer release
>
> The VM buffer for the next unskippable block can be released after the
> main loop in lazy_scan_heap(). Doing so de-clutters
> heap_vac_scan_next_block() and opens up more refactoring options.

That's vague...


> From 8485dc400b3d4e9f895170af4f5fb1bb959b8495 Mon Sep 17 00:00:00 2001
> From: Melanie Plageman <melanieplageman@gmail.com>
> Date: Mon, 28 Oct 2024 11:36:58 -0400
> Subject: [PATCH v1 6/9] Remove superfluous next_block local variable in vacuum
>  code
>
> Reduce the number of block related variables in lazy_scan_heap() and its
> helpers by removing the next_block local variable from
> heap_vac_scan_next_block().

I don't mind this change, but I also don't get how it's related to anything
else here or why it's really better than the status quo.


> diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
> index 4b1eadea1f2..52c9d49f2b1 100644
> --- a/src/backend/access/heap/vacuumlazy.c
> +++ b/src/backend/access/heap/vacuumlazy.c
> @@ -1112,19 +1112,17 @@ static bool
>  heap_vac_scan_next_block(LVRelState *vacrel, BlockNumber *blkno,
>                           bool *all_visible_according_to_vm)
>  {
> -    BlockNumber next_block;
> -
>      /* relies on InvalidBlockNumber + 1 overflowing to 0 on first call */
> -    next_block = vacrel->current_block + 1;
> +    vacrel->current_block++;

I realize this isn't introduced in this commit, but darn, that's ugly.


> From 78ad9e022b95e024ff5bfa96af78e9e44730c970 Mon Sep 17 00:00:00 2001
> From: Melanie Plageman <melanieplageman@gmail.com>
> Date: Mon, 28 Oct 2024 11:42:10 -0400
> Subject: [PATCH v1 7/9] Make heap_vac_scan_next_block() return BlockNumber


> @@ -857,7 +857,8 @@ lazy_scan_heap(LVRelState *vacrel)
>      vacrel->next_unskippable_allvis = false;
>      vacrel->next_unskippable_vmbuffer = InvalidBuffer;
>
> -    while (heap_vac_scan_next_block(vacrel, &blkno, &all_visible_according_to_vm))
> +    while (BlockNumberIsValid(blkno = heap_vac_scan_next_block(vacrel,
> +                                                               &all_visible_according_to_vm)))

Personally I'd write this as

while (true)
{
    BlockNumber blkno;

    blkno = heap_vac_scan_next_block(vacrel, ...);

    if (!BlockNumberIsValid(blkno))
       break;

Mostly because it's good to use more minimal scopes when possible,
particularly when previously the scope intentionally was larger. But also
partially because I don't love variable assignments inside a macro call,
inside a while().



> From 818d1c3b068c6705611256cfc3eb1f10bdc0b684 Mon Sep 17 00:00:00 2001
> From: Melanie Plageman <melanieplageman@gmail.com>
> Date: Fri, 1 Nov 2024 18:25:05 -0400
> Subject: [PATCH v1 8/9] WIP: Add more general summary to vacuumlazy.c
>
> Currently the summary at the top of vacuumlazy.c provides some specific
> details related to the new dead TID storage in 17. I plan to add a
> summary and maybe some sub-sections to contextualize it.

I like this idea. It's hard to understand vacuumlazy.c without already
understanding vacuumlazy.c, which isn't a good situation.


> ---
>  src/backend/access/heap/vacuumlazy.c | 11 +++++++++++
>  1 file changed, 11 insertions(+)
>
> diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
> index 7ce69953ba0..15a04c6b10b 100644
> --- a/src/backend/access/heap/vacuumlazy.c
> +++ b/src/backend/access/heap/vacuumlazy.c
> @@ -3,6 +3,17 @@
>   * vacuumlazy.c
>   *      Concurrent ("lazy") vacuuming.
>   *
> + * Heap relations are vacuumed in three main phases. In the first phase,
> + * vacuum scans relation pages, pruning and freezing tuples and saving dead
> + * tuples' TIDs in a TID store. If that TID store fills up or vacuum finishes
> + * scanning the relation, it progresses to the second phase: index vacuuming.
> + * After index vacuuming is complete, vacuum scans the blocks of the relation
> + * indicated by the TIDs in the TID store and reaps the dead tuples, freeing
> + * that space for future tuples. Finally, vacuum may truncate the relation if
> + * it has emptied pages at the end. XXX: this summary needs work.

Yea, at least we ought to mention that the phasing can be different when there
are no indexes and that the later phases can heuristically be omitted when
there aren't enough dead items.



> From f21f0bab1dbe675be4b4dddcb2eea486d8a69d36 Mon Sep 17 00:00:00 2001
> From: Melanie Plageman <melanieplageman@gmail.com>
> Date: Mon, 28 Oct 2024 12:15:08 -0400
> Subject: [PATCH v1 9/9] Eagerly scan all-visible pages to amortize aggressive
>  vacuum
>
> Introduce semi-aggressive vacuums, which scan some of the all-visible
> but not all-frozen pages in the relation to amortize the cost of an
> aggressive vacuum.

I wonder if "aggressive" is really the right terminology going
forward... Somehow it doesn't seem particularly descriptive anymore if, in
many workloads, almost all vacuums are going to be aggressive-ish.


> Because the goal is to freeze these all-visible pages, all-visible pages
> that are eagerly scanned and set all-frozen in the visibility map are
> considered successful eager scans and those not frozen are considered
> failed eager scans.
>
> If too many eager scans fail in a row, eager scanning is temporarily
> suspended until a later portion of the relation. Because the goal is to
> amortize aggressive vacuums, we cap the number of successes as well.
> Once we reach the maximum number of blocks successfully eager scanned
> and frozen, the semi-aggressive vacuum is downgraded to an unaggressive
> vacuum.
> ---
>  src/backend/access/heap/vacuumlazy.c | 327 +++++++++++++++++++++++----
>  src/backend/commands/vacuum.c        |  20 +-
>  src/include/commands/vacuum.h        |  27 ++-
>  src/tools/pgindent/typedefs.list     |   1 +
>  4 files changed, 326 insertions(+), 49 deletions(-)
>
> diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
> index 15a04c6b10b..adabb5ff5f1 100644
> --- a/src/backend/access/heap/vacuumlazy.c
> +++ b/src/backend/access/heap/vacuumlazy.c
> @@ -12,6 +12,40 @@
>   * that space for future tuples. Finally, vacuum may truncate the relation if
>   * it has emptied pages at the end. XXX: this summary needs work.
>   *
> + * Relation Scanning:
> + *
> + * Vacuum scans the heap relation, starting at the beginning and progressing
> + * to the end, skipping pages as permitted by their visibility status, vacuum
> + * options, and the aggressiveness level of the vacuum.
> + *
> + * When page skipping is enabled, unaggressive vacuums may skip scanning pages
> + * that are marked all-visible in the visibility map. We may choose not to
> + * skip pages if the range of skippable pages is below SKIP_PAGES_THRESHOLD.
> + *
> + * Semi-aggressive vacuums will scan skippable pages in an effort to freeze
> + * them and decrease the backlog of all-visible but not all-frozen pages that
> + * have to be processed to advance relfrozenxid and avoid transaction ID
> + * wraparound.
> + *
> + * We count it as a success when we are able to set an eagerly scanned page
> + * all-frozen in the VM and a failure when we are not able to set the page
> + * all-frozen.
> + *
> + * Because we want to amortize the overhead of freezing pages over multiple
> + * vacuums, we cap the number of successful eager scans to
> + * EAGER_SCAN_SUCCESS_RATE of the number of all-visible but not all-frozen
> + * pages at the beginning of the vacuum.
> + *
> + * On the assumption that different regions of the table are likely to contain
> + * similarly aged data, we use a localized failure cap instead of a global cap
> + * for the whole relation. The failure count is reset on each region of the
> + * table, comprised of RELSEG_SIZE blocks (or 1/4 of the table size for a
> + * small table). In each region, we tolerate MAX_SUCCESSIVE_EAGER_SCAN_FAILS
> + * before suspending eager scanning until the end of the region.

I'm a bit surprised to see such large regions. Why not something finer, in the
range of a few megabytes?  The FSM steers new tuples quite aggressively to the
start of the table, which means that in many workloads there will be old and
new data interspersed at the start of the table. Using RELSEG_SIZE sized
regions for semi-aggressive vacuuming will mean that we'll often not do any
semi-aggressive processing beyond the start of the relation, as we'll reach
the failure rate quickly.

I also find it layering-wise a bit weird to use RELSEG_SIZE, that's really imo
is just an md.c concept.


> +/*
> + * Semi-aggressive vacuums eagerly scan some all-visible but not all-frozen
> + * pages. Since our goal is to freeze these pages, an eager scan that fails to
> + * set the page all-frozen in the VM is considered to have "failed".
> + *
> + * On the assumption that different regions of the table tend to have
> + * similarly aged data, once we fail to freeze MAX_SUCCESSIVE_EAGER_SCAN_FAILS
> + * blocks, we suspend eager scanning until vacuum has progressed to another
> + * region of the table with potentially older data.
> + */
> +#define MAX_SUCCESSIVE_EAGER_SCAN_FAILS 1024

Can this really be a constant, given that the semi-aggressive regions are
shrunk for small tables?


> +/*
> + * An eager scan of a page that is set all-frozen in the VM is considered
> + * "successful". To spread out eager scanning across multiple semi-aggressive
> + * vacuums, we limit the number of successful eager scans (as well as the
> + * number of failures). The maximum number of successful eager scans is
> + * calculated as a ratio of the all-visible but not all-frozen pages at the
> + * beginning of the vacuum.
> + */
> +#define EAGER_SCAN_SUCCESS_RATE 0.2
>  typedef struct LVRelState

There imo should be newlines between the define and LVRelState.

>  {
>      /* Target heap relation and its indexes */
> @@ -153,8 +208,22 @@ typedef struct LVRelState
>      BufferAccessStrategy bstrategy;
>      ParallelVacuumState *pvs;
>
> -    /* Aggressive VACUUM? (must set relfrozenxid >= FreezeLimit) */
> -    bool        aggressive;
> +    /*
> +     * Whether or not this is an aggressive, semi-aggressive, or unaggressive
> +     * VACUUM. A fully aggressive vacuum must set relfrozenxid >= FreezeLimit
> +     * and therefore must scan every unfrozen tuple. A semi-aggressive vacuum
> +     * will scan a certain number of all-visible pages until it is downgraded
> +     * to an unaggressive vacuum.
> +     */
> +    VacAggressive aggressive;

Hm. A few comments:

- why is VacAggressive defined in vacuum.h? Isn't this fairly tightly coupled
  to heapam?
- Kinda feels like the type should be named VacAggressivness or such?


> +    /*
> +     * A semi-aggressive vacuum that has failed to freeze too many eagerly
> +     * scanned blocks in a row suspends eager scanning. unaggressive_to is the
> +     * block number of the first block eligible for resumed eager scanning.
> +     */
> +    BlockNumber unaggressive_to;

What's it set to otherwise? What is it set to in aggressive vacuums?


> @@ -227,6 +296,26 @@ typedef struct LVRelState
>      BlockNumber next_unskippable_block; /* next unskippable block */
>      bool        next_unskippable_allvis;    /* its visibility status */
>      Buffer        next_unskippable_vmbuffer;    /* buffer containing its VM bit */
> +
> +    /*
> +     * Count of skippable blocks eagerly scanned as part of a semi-aggressive
> +     * vacuum (for logging only).
> +     */
> +    BlockNumber eager_scanned;

I'd add _pages or such.


> +    /*
> +     * The number of eagerly scanned blocks a semi-aggressive vacuum failed to
> +     * freeze (due to age) in the current eager scan region. It is reset each
> +     * time we hit MAX_SUCCESSIVE_EAGER_SCAN_FAILS.
> +     */
> +    BlockNumber eager_scanned_failed_frozen;
> +
> +    /*
> +     * The remaining number of blocks a semi-aggressive vacuum will consider
> +     * eager scanning. This is initialized to EAGER_SCAN_SUCCESS_RATE of the
> +     * total number of all-visible but not all-frozen pages.
> +     */
> +    BlockNumber remaining_eager_scan_successes;

I think it might look better if you just bundled these into a struct like

      struct
      {
        BlockNumber scanned;
        BlockNumber failed_frozen;
        BlockNumber remaining_successes;
      } eager_pages;


> @@ -471,24 +569,49 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
>           * Force aggressive mode, and disable skipping blocks using the
>           * visibility map (even those set all-frozen)
>           */
> -        vacrel->aggressive = true;
> +        vacrel->aggressive = VAC_AGGRESSIVE;
>          skipwithvm = false;
>      }
>
>      vacrel->skipwithvm = skipwithvm;
> +    vacrel->eager_scanned = 0;
> +    vacrel->eager_scanned_failed_frozen = 0;
> +
> +    /*
> +     * Even if we successfully freeze them, we want to cap the number of
> +     * eagerly scanned blocks so that we spread out the overhead across
> +     * multiple vacuums. remaining_eager_scan_successes is only used by
> +     * semi-aggressive vacuums.
> +     */

Somehow the "them" feels like a dangling pointer that's initialized too late ;)


> +    visibilitymap_count(rel, &orig_rel_allvisible, &orig_rel_allfrozen);
> +    vacrel->remaining_eager_scan_successes =
> +        (BlockNumber) (EAGER_SCAN_SUCCESS_RATE * (orig_rel_allvisible - orig_rel_allfrozen));
>
>      if (verbose)
>      {
> -        if (vacrel->aggressive)
> -            ereport(INFO,
> -                    (errmsg("aggressively vacuuming \"%s.%s.%s\"",
> -                            vacrel->dbname, vacrel->relnamespace,
> -                            vacrel->relname)));
> -        else
> -            ereport(INFO,
> -                    (errmsg("vacuuming \"%s.%s.%s\"",
> -                            vacrel->dbname, vacrel->relnamespace,
> -                            vacrel->relname)));
> +        switch (vacrel->aggressive)
> +        {
> +            case VAC_UNAGGRESSIVE:
> +                ereport(INFO,
> +                        (errmsg("vacuuming \"%s.%s.%s\"",
> +                                vacrel->dbname, vacrel->relnamespace,
> +                                vacrel->relname)));
> +                break;
> +
> +            case VAC_AGGRESSIVE:
> +                ereport(INFO,
> +                        (errmsg("aggressively vacuuming \"%s.%s.%s\"",
> +                                vacrel->dbname, vacrel->relnamespace,
> +                                vacrel->relname)));
> +                break;
> +
> +            case VAC_SEMIAGGRESSIVE:
> +                ereport(INFO,
> +                        (errmsg("semiaggressively vacuuming \"%s.%s.%s\"",
> +                                vacrel->dbname, vacrel->relnamespace,
> +                                vacrel->relname)));
> +                break;
> +        }

Wonder if we should have a function that returns the aggressiveness of a
vacuum as an already translated string. There are other places where we emit
the aggressiveness as part of a message, and it's pretty silly to duplicate
most of the message.


> @@ -545,11 +668,13 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
>       * Non-aggressive VACUUMs may advance them by any amount, or not at all.
>       */
>      Assert(vacrel->NewRelfrozenXid == vacrel->cutoffs.OldestXmin ||
> -           TransactionIdPrecedesOrEquals(vacrel->aggressive ? vacrel->cutoffs.FreezeLimit :
> +           TransactionIdPrecedesOrEquals(vacrel->aggressive == VAC_AGGRESSIVE ?
> +                                         vacrel->cutoffs.FreezeLimit :
>                                           vacrel->cutoffs.relfrozenxid,
>                                           vacrel->NewRelfrozenXid));
>      Assert(vacrel->NewRelminMxid == vacrel->cutoffs.OldestMxact ||
> -           MultiXactIdPrecedesOrEquals(vacrel->aggressive ? vacrel->cutoffs.MultiXactCutoff :
> +           MultiXactIdPrecedesOrEquals(vacrel->aggressive == VAC_AGGRESSIVE ?
> +                                       vacrel->cutoffs.MultiXactCutoff :
>                                         vacrel->cutoffs.relminmxid,
>                                         vacrel->NewRelminMxid));
>      if (vacrel->skippedallvis)

These are starting to feel somewhat complicated. Wonder if it'd be easier to
read if they were written as normal ifs.


> +/*
> + * Helper to decrement a block number to 0 without wrapping around.
> + */
> +static void
> +decrement_blkno(BlockNumber *block)
> +{
> +    if ((*block) > 0)
> +        (*block)--;
> +}

Huh.


> @@ -956,11 +1094,23 @@ lazy_scan_heap(LVRelState *vacrel)
>          if (!got_cleanup_lock)
>              LockBuffer(buf, BUFFER_LOCK_SHARE);
>
> +        page_freezes = vacrel->vm_page_freezes;
> +
>          /* Check for new or empty pages before lazy_scan_[no]prune call */
>          if (lazy_scan_new_or_empty(vacrel, buf, blkno, page, !got_cleanup_lock,
>                                     vmbuffer))
>          {
>              /* Processed as new/empty page (lock and pin released) */
> +
> +            /* count an eagerly scanned page as a failure or a success */
> +            if (was_eager_scanned)
> +            {
> +                if (vacrel->vm_page_freezes > page_freezes)
> +                    decrement_blkno(&vacrel->remaining_eager_scan_successes);
> +                else
> +                    vacrel->eager_scanned_failed_frozen++;
> +            }
> +
>              continue;
>          }

Maybe I'm confused, but ISTM that remaining_eager_scan_successes shouldn't
actually be a BlockNumber, given it doesn't actually indicates a specific
block...

I don't understand why we would sometimes want to treat empty pages as a
failure? They can't fail to be frozen, no?

I'm not sure it makes sense to count them as progress towards the success
limit either - afaict we just rediscovered free space is the table. That's imo
separate from semi-aggressive freezing.


Storing page_freezes as a copy of vacrel->vm_page_freezes and then checking if
that increased feels like a somewhat ugly way of tracking if freezing
happend. There's no more direct way?

Why is decrement_blkno() needed? How can we ever get into negative territory?
Shouldn't eager scanning have been disabled when
remaining_eager_scan_successes reaches zero and thus prevent
remaining_eager_scan_successes from ever going below zero?



> @@ -1144,7 +1310,65 @@ heap_vac_scan_next_block(LVRelState *vacrel,
>           */
>          bool        skipsallvis;
>
> -        find_next_unskippable_block(vacrel, &skipsallvis);
> +        /*
> +         * Figure out if we should disable eager scan going forward or
> +         * downgrade to an unaggressive vacuum altogether.
> +         */
> +        if (vacrel->aggressive == VAC_SEMIAGGRESSIVE)
> +        {
> +            /*
> +             * If we hit our success limit, there is no need to eagerly scan
> +             * any additional pages. Downgrade the vacuum to unaggressive.
> +             */
> +            if (vacrel->remaining_eager_scan_successes == 0)
> +                vacrel->aggressive = VAC_UNAGGRESSIVE;
> +
> +            /*
> +             * If we hit the max number of failed eager scans for this region
> +             * of the table, figure out where the next eager scan region
> +             * should start. Eager scanning is effectively disabled until we
> +             * scan a block in that new region.
> +             */
> +            else if (vacrel->eager_scanned_failed_frozen >=
> +                     MAX_SUCCESSIVE_EAGER_SCAN_FAILS)
> +            {
> +                BlockNumber region_size,
> +                            offset;
> +

Why are we doing this logic here, rather than after incrementing
eager_scanned_failed_frozen? Seems that'd limit the amount of times we need to
run through this logic substantially?


> +                /*
> +                 * On the assumption that different regions of the table are
> +                 * likely to have similarly aged data, we will retry eager
> +                 * scanning again later. For a small table, we'll retry eager
> +                 * scanning every quarter of the table. For a larger table,
> +                 * we'll consider eager scanning again after processing
> +                 * another region's worth of data.
> +                 *
> +                 * We consider the region to start from the first failure, so
> +                 * calculate the block to restart eager scanning from there.
> +                 */
> +                region_size = Min(RELSEG_SIZE, (vacrel->rel_pages / 4));
> +
> +                offset = vacrel->eager_scanned_failed_frozen % region_size;
> +
> +                Assert(vacrel->eager_scanned > 0);
> +
> +                vacrel->unaggressive_to = vacrel->current_block + (region_size - offset);
> +                vacrel->eager_scanned_failed_frozen = 0;
> +            }
> +        }

I'd put the logic for this in a separate function, feels complicated and
independent enough for it to be worth not having it in heap_vac_scan_next_block().


> @@ -1199,6 +1423,11 @@ heap_vac_scan_next_block(LVRelState *vacrel,
>   * The next unskippable block and its visibility information is updated in
>   * vacrel.
>   *
> + * consider_eager_scan indicates whether or not we should consider scanning
> + * all-visible but not all-frozen blocks. was_eager_scanned is set to true if
> + * we decided to eager scan a block. In this case, next_unskippable_block is
> + * set to that block number.
> + *
>   * Note: our opinion of which blocks can be skipped can go stale immediately.
>   * It's okay if caller "misses" a page whose all-visible or all-frozen marking
>   * was concurrently cleared, though.  All that matters is that caller scan all
> @@ -1208,7 +1437,11 @@ heap_vac_scan_next_block(LVRelState *vacrel,
>   * to skip such a range is actually made, making everything safe.)
>   */
>  static void
> -find_next_unskippable_block(LVRelState *vacrel, bool *skipsallvis)
> +find_next_unskippable_block(
> +                            LVRelState *vacrel,
> +                            bool consider_eager_scan,
> +                            bool *was_eager_scanned,
> +                            bool *skipsallvis)

I don't think we ever have function definitions without a parameter on the
line with the function name.

Hm - isn't consider_eager_scan potentially outdated after
find_next_unskippable_block() iterated over a bunch of blocks? If
consider_eager_scan is false, this could very well go into the next
"semi-aggressive region", where consider_eager_scan should have been
re-enabled, no?


Greetings,

Andres Freund



Re: Eagerly scan all-visible pages to amortize aggressive vacuum

From
Robert Treat
Date:
On Fri, Dec 13, 2024 at 5:53 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:
> On Thu, Nov 7, 2024 at 10:42 AM Andres Freund <andres@anarazel.de> wrote:
> > Hi,
>
> Thanks for the review!
> Attached v2 should address your feedback and also fixes a few bugs with v1.
>
> I've still yet to run very long-running benchmarks. I did start running more
> varied benchmark scenarios -- but all still under two hours. So far, the
> behavior is as expected.
>
> > On 2024-11-01 19:35:22 -0400, Melanie Plageman wrote:
> > > Because we want to amortize our eager scanning across a few vacuums,
> > > we cap the maximum number of successful eager scans to a percentage of
> > > the total number of all-visible but not all-frozen pages in the table
> > > (currently 20%).
> >
> > One thing worth mentioning around here seems that we currently can't
> > partially-aggressively freeze tuples that are "too young" and how that
> > interacts with everything else.
>
> I'm not sure I know what you mean. Are you talking about how we don't freeze
> tuples that are visible to everyone but younger than the freeze limit?
>

FWIW that was my interpretation of his statement, though I had a
clarifying question around this topic myself, which is, from a user
perspective when would we expect to see these eager vacuums? ISTM we
would be doing 'normal vacuums' prior to vacuum_freeze_min_age, and
'aggressive vacuums' after (autovacuum_freeze_max_age -
vacuum_freeze_min_age), so the expectation is that 'eager vacuums'
would fall into the ~50 million transaction window between those two
points (assuming defaults, which admittedly I don't use). Does that
sound right?

> > > In the attached chart.png, you can see the vm_page_freezes climbing
> > > steadily with the patch, whereas on master, there are sudden spikes
> > > aligned with the aggressive vacuums. You can also see that the number
> > > of pages that are all-visible but not all-frozen grows steadily on
> > > master until the aggressive vacuum. This is vacuum's "backlog" of
> > > freezing work.
> >
> > What's the reason for all-visible-but-not-all-frozen to increase to a higher
> > value initially than where it later settles?
>
> My guess is that it has to do with shorter, more frequent vacuums at the
> beginning of the benchmark when the relation is smaller (and we haven't
> exceeded shared buffers or memory yet). They are setting pages all-visible, but
> we haven't used up enough xids yet to qualify for an eager vacuum.
>
> The peak of AVnAF pages aligns with the start of the first eager vacuum. We
> don't do any eager scanning until we are sure there is some data requiring
> freeze (see this criteria):
>
>     if (TransactionIdIsNormal(vacrel->cutoffs.relfrozenxid) &&
>         TransactionIdPrecedesOrEquals(vacrel->cutoffs.relfrozenxid,
>                                       vacrel->cutoffs.FreezeLimit))
>
> Once we have used up enough xids to qualify for the first eager vacuum, the
> number of AVnAF pages starts to go down.
>
> It would follow from this theory that we would see a build-up like this after
> each relfrozenxid advancement (so after the next aggressive vacuum).
>
> But I think we don't see this because the vacuums are longer by the time
> aggressive vacuums have started, so we end up using up enough XIDs between
> vacuums to qualify for eager vacuums on vacuums after the aggressive vacuum.
>
> That is just my theory though.
>

I like your theory but it's a little too counterintuitive for me :-)

I would expect we'd see a change in the vacuum time & rate after the
first aggressive scan, which incidentally your graph does show for
master, but it looks a bit too smooth on your original patchset. I
guess there could be a sweet spot where the rate of changes fit
perfectly with regards to the line between lazy / eager vacuums, but
hard to imagine you were that lucky.

> > > Below is the comparative WAL volume, checkpointer and background
> > > writer writes, reads and writes done by all other backend types, time
> > > spent vacuuming in milliseconds, and p99 latency. Notice that overall
> > > vacuum IO time is substantially lower with the patch.
> > >
> > >    version     wal  cptr_bgwriter_w   other_rw  vac_io_time  p99_lat
> > >     patch   770 GB          5903264  235073744   513722         1
> > >     master  767 GB          5908523  216887764  1003654        16
> >
> > Hm. It's not clear to me why other_rw is higher with the patch? After all,
> > given the workload, there's no chance of unnecessarily freezing tuples? Is
> > that just because at the end of the benchmark there's leftover work?
>
> So other_rw is mostly client backend and autovacuum reads and writes. It is
> higher with the patch because there are actually more vacuum reads and writes
> with the patch than on master. However the autovacuum worker read and write
> time is much lower. Those blocks are more often in shared buffers, I would
> guess.
>
<snip>
> From e36b4fac345be44954410c4f0e61467dc0f49a72 Mon Sep 17 00:00:00 2001
> From: Melanie Plageman <melanieplageman@gmail.com>
> Date: Thu, 12 Dec 2024 16:44:37 -0500
> Subject: [PATCH v2 10/10] Eagerly scan all-visible pages to amortize
> aggressive vacuum
>
> @@ -27,11 +27,37 @@
>  * to the end, skipping pages as permitted by their visibility status, vacuum
>  * options, and the eagerness level of the vacuum.
>  *
> + * There are three vacuum eagerness levels: normal vacuum, eager vacuum, and
> + * aggressive vacuum.
> + *
>  * When page skipping is enabled, non-aggressive vacuums may skip scanning
> - * pages that are marked all-visible in the visibility map. We may choose not
> + * pages that are marked all-visible in the visibility map. It may choose not
>  * to skip pages if the range of skippable pages is below
>  * SKIP_PAGES_THRESHOLD.
>  *

I find the above confusing since page skipping is the regular activity
but referred to in the negative, and because you use the term
"non-aggressive vacuums" which in prior releases only mapped to
"normal" vacuums, but now would map to both "normal" and "eager"
vacuums, and it isn't clear that is desired (in my head anyway). Does
the following still convey what you meant (and hopefully work better
with the paragraphs that follow)?

When page skipping is not disabled, a normal vacuum may skip scanning
pages that are marked all-visible in the visibility map if the range
of skippable pages is below SKIP_PAGES_THRESHOLD.

> + * Eager vacuums will scan skippable pages in an effort to freeze them and
> + * decrease the backlog of all-visible but not all-frozen pages that have to
> + * be processed to advance relfrozenxid and avoid transaction ID wraparound.
> + *

> @@ -170,6 +197,51 @@ typedef enum
>       VACUUM_ERRCB_PHASE_TRUNCATE,
> } VacErrPhase;
>
> +/*
> + * Eager vacuums scan some all-visible but not all-frozen pages. Since our
> + * goal is to freeze these pages, an eager scan that fails to set the page
> + * all-frozen in the VM is considered to have "failed".
> + *
> + * On the assumption that different regions of the table tend to have
> + * similarly aged data, once we fail to freeze EAGER_SCAN_MAX_FAILS_PER_REGION
> + * blocks in a region of size EAGER_SCAN_REGION_SIZE, we suspend eager
> + * scanning until vacuum has progressed to another region of the table with
> + * potentially older data.
> + */
> +#define EAGER_SCAN_REGION_SIZE 4096
> +#define EAGER_SCAN_MAX_FAILS_PER_REGION 128
> +
> +/*
> + * An eager scan of a page that is set all-frozen in the VM is considered
> + * "successful". To spread out eager scanning across multiple eager vacuums,
> + * we limit the number of successful eager page scans. The maximum number of
> + * successful eager page scans is calculated as a ratio of the all-visible but
> + * not all-frozen pages at the beginning of the vacuum.
> + */
> +#define EAGER_SCAN_SUCCESS_RATE 0.2
> +
> +/*
> + * The eagerness level of a vacuum determines how many all-visible but
> + * not all-frozen pages it eagerly scans.
> + *
> + * A normal vacuum (eagerness VAC_NORMAL) scans no all-visible pages (with the
> + * exception of those scanned due to SKIP_PAGES_THRESHOLD).
> + *
> + * An eager vacuum (eagerness VAC_EAGER) scans a number of pages up to a limit
> + * based on whether or not it is succeeding or failing. An eager vacuum is
> + * downgraded to a normal vacuum when it hits its success quota. An aggressive
> + * vacuum cannot be downgraded. No eagerness level is ever upgraded.
> + *

At the risk of being overly nit-picky... eager vacuums scan their
subset of all-visible pages "up to a limit" based solely on the
success ratio. In the case of (excessive) failures, there is no limit
to the number of pages scanned, only a pause in the pages scanned
until the next region.

> + * An aggressive vacuum (eagerness EAGER_FULL) must scan all all-visible but
> + * not all-frozen pages.
> + */

I think the above should be VAC_AGGRESSIVE vs EAGER_FULL, no?

> +typedef enum VacEagerness
> +{
> +     VAC_NORMAL,
> +     VAC_EAGER,
> +     VAC_AGGRESSIVE,
> +} VacEagerness;
> +

> @@ -516,25 +772,20 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
>                * Force aggressive mode, and disable skipping blocks using the
>                * visibility map (even those set all-frozen)
>                */
> -             vacrel->aggressive = true;
> +             aggressive = true;
>               skipwithvm = false;
>       }
>
>       vacrel->skipwithvm = skipwithvm;
>
> +     heap_vacuum_set_up_eagerness(vacrel, aggressive);
> +
>       if (verbose)
> -     {
> -             if (vacrel->aggressive)
> -                     ereport(INFO,
> -                                     (errmsg("aggressively vacuuming \"%s.%s.%s\"",
> -                                                     vacrel->dbname, vacrel->relnamespace,
> -                                                     vacrel->relname)));
> -             else
> -                     ereport(INFO,
> -                                     (errmsg("vacuuming \"%s.%s.%s\"",
> -                                                     vacrel->dbname, vacrel->relnamespace,
> -                                                     vacrel->relname)));
> -     }
> +             ereport(INFO,
> +                             (errmsg("%s of \"%s.%s.%s\"",
> +                                             vac_eagerness_description(vacrel->eagerness),
> +                                             vacrel->dbname, vacrel->relnamespace,
> +                                             vacrel->relname)));
>
>       /*
>        * Allocate dead_items memory using dead_items_alloc.  This handles

One thing I am wondering about is that since we actually modify
vacrel->eagerness during the "success downgrade" cycle, a single
vacuum run could potentially produce messages with both eager vacuum
and normal vacuum language. I don't think that'd be a problem in the
above spot, but wondering if it might be elsewhere (maybe in
pg_stat_activity?).


Robert Treat
https://xzilla.net



Re: Eagerly scan all-visible pages to amortize aggressive vacuum

From
Nazir Bilal Yavuz
Date:
Hi,

Thank you for working on this!

On Sat, 14 Dec 2024 at 01:53, Melanie Plageman
<melanieplageman@gmail.com> wrote:
>
> On Thu, Nov 7, 2024 at 10:42 AM Andres Freund <andres@anarazel.de> wrote:
> >
> > Hi,
>
> Thanks for the review!
> Attached v2 should address your feedback and also fixes a few bugs with v1.

Here are couple of code comments:

=== [PATCH v2 07/10] ===

It took me a while to understand that heap_vac_scan_next_block() loops
until rel_pages. What do you think about adding
Assert(vacrel->current_block == rel_pages) before the
pgstat_progress_update_param(PROGRESS_VACUUM_HEAP_BLKS_SCANNED,
rel_pages) and having a comment on main loop should process blocks
until rel_pages?

=== [PATCH v2 09/10] ===

+ * If there are no indexes or index scanning is disabled, phase II may be
+ * skipped. If phase I identified very few dead index entries, vacuum may skip
+ * phases II and III. Index vacuuming deletes the dead index entries from the
+ * TID store.

phase III is not mentioned in the previous comments. It could be
better to first explain what phase III is before mentioning it.

+ * After index vacuuming is complete, vacuum scans the blocks of the relation
+ * indicated by the TIDs in the TID store and reaps the dead tuples, freeing
+ * that space for future tuples. Finally, vacuum may truncate the relation if
+ * it has emptied pages at the end.
+ *
+ * After finishing all phases of work, vacuum updates relation statistics in
+ * pg_class and the cumulative statistics subsystem.

There is no clear definition of phase III here as well. I can not
understand what phase III is and which parts the vacuum may skip.

=== [PATCH v2 10/10] ===

+        /*
+         * The number of eagerly scanned blocks an eager vacuum failed to
+         * freeze (due to age) in the current eager scan region. Eager vacuums
+         * reset it to EAGER_SCAN_MAX_FAILS_PER_REGION each time they enter a
+         * new region of the relation. Aggressive vacuums also decrement this
+         * coutner but it is initialized to # blocks in the relation.
+         */

s/coutner/counter

 /* non-export function prototypes */
+
 static void lazy_scan_heap(LVRelState *vacrel);

Extra blank line.

+    if (TransactionIdIsNormal(vacrel->cutoffs.relfrozenxid) &&
+        TransactionIdPrecedesOrEquals(vacrel->cutoffs.relfrozenxid,
+                                      vacrel->cutoffs.FreezeLimit))
+        oldest_unfrozen_requires_freeze = true;
+
+    if (MultiXactIdIsValid(vacrel->cutoffs.relminmxid) &&
+        MultiXactIdPrecedesOrEquals(vacrel->cutoffs.relminmxid,
+                                    vacrel->cutoffs.MultiXactCutoff))
+        oldest_unfrozen_requires_freeze = true;

You may want to short-circuit the second if condition with
!oldest_unfrozen_requires_freeze but it may increase the complexity,
up to you.

+    vacrel->eager_pages.remaining_fails = EAGER_SCAN_MAX_FAILS_PER_REGION *
+        (1 - vacrel->next_eager_scan_region_start / EAGER_SCAN_REGION_SIZE);

This will always return EAGER_SCAN_MAX_FAILS_PER_REGION or 0 because
of the integer dividing.

+    if (aggressive)
+    {
+        vacrel->eagerness = VAC_AGGRESSIVE;
+
+        /*
+         * An aggressive vacuum must scan every all-visible page to safely
+         * advance the relfrozenxid and/or relminmxid. As such, there is no
+         * cap to the number of allowed successes or failures.
+         */
+        vacrel->eager_pages.remaining_fails = vacrel->rel_pages + 1;
+        vacrel->eager_pages.remaining_successes = vacrel->rel_pages + 1;
+        return;
+    }
...
...
+        if (was_eager_scanned)
+        {
+            if (vm_page_frozen)
+            {
+                Assert(vacrel->eager_pages.remaining_successes > 0);
+                vacrel->eager_pages.remaining_successes--;
+
+                if (vacrel->eager_pages.remaining_successes == 0)
+                {
+                    Assert(vacrel->eagerness == VAC_EAGER);

My initial thought was that since *was_eager_scanned* is true,
Assert(vacrel->eagerness == VAC_EAGER) should be under the top if
condition (I assumed that was_eager_scanned is only true for eager
vacuums, not for aggressive vacuums too) but I see your point here.
Since you set vacrel->eager_pages.remaining_successes to
vacrel->rel_pages + 1, vacrel->eager_pages.remaining_successes can not
reach 0 although all pages are processed as successful. I think
comment is needed in both places to explain why
vacrel->eager_pages.[remaining_fails | remaining_successes] is set to
vacrel->rel_pages + 1 and why vacrel->eagerness should be VAC_EAGER
when was_eager_scanned is true and
vacrel->eager_pages.remaining_successes is 0.

--
Regards,
Nazir Bilal Yavuz
Microsoft



Re: Eagerly scan all-visible pages to amortize aggressive vacuum

From
Melanie Plageman
Date:
Thanks for the review!

On Tue, Dec 17, 2024 at 10:57 AM Nazir Bilal Yavuz <byavuz81@gmail.com> wrote:
>
> Here are couple of code comments:
>
> === [PATCH v2 07/10] ===
>
> It took me a while to understand that heap_vac_scan_next_block() loops
> until rel_pages. What do you think about adding
> Assert(vacrel->current_block == rel_pages) before the
> pgstat_progress_update_param(PROGRESS_VACUUM_HEAP_BLKS_SCANNED,
> rel_pages) and having a comment on main loop should process blocks
> until rel_pages?

I added these to the v3 I sent in [1]

> === [PATCH v2 09/10] ===
>
> + * If there are no indexes or index scanning is disabled, phase II may be
> + * skipped. If phase I identified very few dead index entries, vacuum may skip
> + * phases II and III. Index vacuuming deletes the dead index entries from the
> + * TID store.
>
> phase III is not mentioned in the previous comments. It could be
> better to first explain what phase III is before mentioning it.
>
> + * After index vacuuming is complete, vacuum scans the blocks of the relation
> + * indicated by the TIDs in the TID store and reaps the dead tuples, freeing
> + * that space for future tuples. Finally, vacuum may truncate the relation if
> + * it has emptied pages at the end.
> + *
> + * After finishing all phases of work, vacuum updates relation statistics in
> + * pg_class and the cumulative statistics subsystem.
>
> There is no clear definition of phase III here as well. I can not
> understand what phase III is and which parts the vacuum may skip.

I've attempted to address this in v3.

> === [PATCH v2 10/10] ===
>
> +        /*
> +         * The number of eagerly scanned blocks an eager vacuum failed to
> +         * freeze (due to age) in the current eager scan region. Eager vacuums
> +         * reset it to EAGER_SCAN_MAX_FAILS_PER_REGION each time they enter a
> +         * new region of the relation. Aggressive vacuums also decrement this
> +         * coutner but it is initialized to # blocks in the relation.
> +         */
>
> s/coutner/counter

Fixed

>  /* non-export function prototypes */
> +
>  static void lazy_scan_heap(LVRelState *vacrel);
>
> Extra blank line.

Fixed

> +    if (TransactionIdIsNormal(vacrel->cutoffs.relfrozenxid) &&
> +        TransactionIdPrecedesOrEquals(vacrel->cutoffs.relfrozenxid,
> +                                      vacrel->cutoffs.FreezeLimit))
> +        oldest_unfrozen_requires_freeze = true;
> +
> +    if (MultiXactIdIsValid(vacrel->cutoffs.relminmxid) &&
> +        MultiXactIdPrecedesOrEquals(vacrel->cutoffs.relminmxid,
> +                                    vacrel->cutoffs.MultiXactCutoff))
> +        oldest_unfrozen_requires_freeze = true;
>
> You may want to short-circuit the second if condition with
> !oldest_unfrozen_requires_freeze but it may increase the complexity,
> up to you.

Good point. Done.

> +    vacrel->eager_pages.remaining_fails = EAGER_SCAN_MAX_FAILS_PER_REGION *
> +        (1 - vacrel->next_eager_scan_region_start / EAGER_SCAN_REGION_SIZE);
>
> This will always return EAGER_SCAN_MAX_FAILS_PER_REGION or 0 because
> of the integer dividing.

Ah, wow, thanks so much for catching that!

> +    if (aggressive)
> +    {
> +        vacrel->eagerness = VAC_AGGRESSIVE;
> +
> +        /*
> +         * An aggressive vacuum must scan every all-visible page to safely
> +         * advance the relfrozenxid and/or relminmxid. As such, there is no
> +         * cap to the number of allowed successes or failures.
> +         */
> +        vacrel->eager_pages.remaining_fails = vacrel->rel_pages + 1;
> +        vacrel->eager_pages.remaining_successes = vacrel->rel_pages + 1;
> +        return;
> +    }
> ...
> ...
> +        if (was_eager_scanned)
> +        {
> +            if (vm_page_frozen)
> +            {
> +                Assert(vacrel->eager_pages.remaining_successes > 0);
> +                vacrel->eager_pages.remaining_successes--;
> +
> +                if (vacrel->eager_pages.remaining_successes == 0)
> +                {
> +                    Assert(vacrel->eagerness == VAC_EAGER);
>
> My initial thought was that since *was_eager_scanned* is true,
> Assert(vacrel->eagerness == VAC_EAGER) should be under the top if
> condition (I assumed that was_eager_scanned is only true for eager
> vacuums, not for aggressive vacuums too) but I see your point here.
> Since you set vacrel->eager_pages.remaining_successes to
> vacrel->rel_pages + 1, vacrel->eager_pages.remaining_successes can not
> reach 0 although all pages are processed as successful. I think
> comment is needed in both places to explain why
> vacrel->eager_pages.[remaining_fails | remaining_successes] is set to
> vacrel->rel_pages + 1 and why vacrel->eagerness should be VAC_EAGER
> when was_eager_scanned is true and
> vacrel->eager_pages.remaining_successes is 0.

Makes sense. I've attempted to clarify as you suggest in v3.

- Melanie

[1] https://www.postgresql.org/message-id/CAAKRu_Zni_idCUyKTBteRM-G5X1qiB9mf75rZGtHpt%2Bnk1z4Gg%40mail.gmail.com



Re: Eagerly scan all-visible pages to amortize aggressive vacuum

From
Robert Haas
Date:
On Tue, Dec 17, 2024 at 5:54 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:
> Makes sense. I've attempted to clarify as you suggest in v3.

I would just commit 0001. There's nothing to be gained by waiting around.

I don't care about 0002 much. It doesn't seem particularly better or
worse. I would suggest that if you want to do this, maybe don't split
up this:

    vacrel->blkno = InvalidBlockNumber;
    if (BufferIsValid(vmbuffer))
        ReleaseBuffer(vmbuffer);

You could put the moved code before or after after that instead of the
middle of it. Unless there's some reason it makes more sense in the
middle.

I don't care about 0003 much, either. It looks correct to me, but
whether it's better is subjective.

I dislike 0004 as presented. Reducing the scope of blkno is a false
economy; the function doesn't do much of anything interesting after
the main loop. And why do we want to report rel_pages to the progress
reporting machinery instead of blkno? I'd rather that the code report
where it actually ended up (blkno) rather than reporting where it
thinks it must have ended up (rel_pages).

I agree that the assert that 0005 replaces is confusing, but replacing
8 lines of code with 37 is not an improvement in my book.

I like 0006. The phase I-II-III terminology doesn't appear anywhere in
the code (at least, not to my knowledge) but we speak of it that way
so frequently in email and in conversation that mentioning it
someplace that a future developer might find seems advisable. I think
it would be worth mentioning in the second paragraph of the comment
that we may resume phase I after completing phase III if we entered
phase II due to lack of memory. I'm not sure what the right way to
phrase this is -- in a sense, this possibility means that these aren't
really phases at all, however much we may speak of them that way. But
as I say, we talk about them that way all the time, so it's good that
this is finally adding comments to match.

Regarding 0007:

- Why do we measure failure as an integer total but success as a
percentage? Maybe the thought here is that failures are counted per
region but successes are counted globally across the relation, but
then I wonder if that's what we want, and if so, whether we need some
comments to explain why we want that rather than doing both per
region.

- I do not like the nested struct definition for eager_pages. Either
define the struct separately, give it a name, and then refer to that
name here, or just define the elements individually, and maybe give
them a common prefix or something. I don't think what you have here is
a style we normally use. As you can see, pgindent is not particularly
fond of it. It seems particularly weird given that you didn't even put
all the stuff related to eagerness inside of it.

- It's a little weird that you mostly treat eager vacuums as an
intermediate position between aggressive and normal, but then we
decide whether we're eager in a different place than we decide whether
we're aggressive.

- On a related note, won't most vacuums be VAC_EAGER rather than
VAC_NORMAL, thus making VAC_NORMAL a misnomer? I wonder if it's better
to merge eager and normal together, and just treat the cases where we
judge eager scanning not worthwhile as a mild special case of an
otherwise-normal vacuum. It's important that a user can tell what
happened from the log message, but it doesn't seem absolutely
necessary for the start-of-vacuum message to make that clear. It could
just be that %u all-visible scanned => 0 all-visible scanned means we
didn't end up being at all eager.

- Right after the comment "Produce distinct output for the corner case
all the same, just in case," you've changed it so that there's no
longer an if-statement. While you haven't really falsified the comment
in any deep sense, some rewording is probably in order. Also, the
first sentence of this comment overtly contradicts itself without
really explaining anything useful. That's a preexisting problem for
which you're not responsible, but maybe it makes sense to fix it while
you're adjusting this comment.

- Perhaps it's unfair of me, but I think I would have hoped for an
acknowledgement in this commit message, considering that I believe I
was the one who suggested breaking the relation into logical regions,
trying to freeze a percentage of the all-visible-but-not-all-frozen
pages, and capping both successes and failures. Starting the region at
a random offset wasn't my idea, and the specific thresholds you've
chosen were not the ones I suggested, and limiting successes globally
rather than per-region was not what I think I had in mind, and I don't
mean to take away from everything you've done to move this forward,
but unless I am misunderstanding the situation, this particular patch
(0007) is basically an implementation of an algorithm that, as far as
I know, I was the first to propose.

- Which of course also means that I tend to like the idea, but also
that I'm biased. Still, what is the reasonable alternative to this
patch? I have a hard time believing that it's "do nothing". As far as
I know, pretty much everyone agrees that the large burst of work that
tends to occur when an aggressive vacuum kicks off is extremely
problematic, particularly but not only the first time it kicks off on
a table or group of tables that may have accumulated many
all-visible-but-not-all-frozen pages. This patch might conceivably err
in moving that work too aggressively to earlier vacuums, thus making
those vacuums too expensive or wasting work if the pages end up being
modified again; or it might conceivably err in moving work
insufficiently aggressively to earlier vacuums, leaving too much
remaining work when the aggressive vacuum finally happens. In fact, I
would be surprised if it doesn't have those problems in some
situations. But it could have moderately severe cases of those
problems and still be quite a bit better than what we have now
overall.

- So,I think we should avoid fine-tuning this and try to understand if
there's anything wrong with the big picture. Can we imagine a user who
is systematically unhappy with this change? Like, not a user who got
hosed once because of some bad luck, but someone who is constantly and
predictably getting hosed? They'd need to be accumulating lots of
all-visible-not-all-frozen pages in relatively large tables on a
regular basis, but then I guess they'd need to either drop the tables
before the aggressive vacuum happened, or they'd need to render the
pages not-all-visible again before the aggressive vacuum would have
happened. I'm not entirely sure how possible that is. My best guess is
that it's possible if the timing of your autovacuum runs is
particularly poor -- you just need to load some data, vacuum it early
enough that the XIDs are still young so it doesn't get frozen, then
have the eager vacuum hit it, and then update it. That doesn't seem
impossible, but I'm not sure if it's possible to make it happen often
enough and severely enough to really cause a problem. And I'm not sure
we're going to find that out before this is committed, so that brings
me to:

- If this patch is going to go into the source tree in PostgreSQL 18,
it should do that soon. This is a bad patch to be committing in late
March. I propose that it should be shipped by end of January or wait a
year. Even if it goes into the tree by the end of January, there is a
chance we'll find problems that we can't fix quickly and have to
revert the whole thing. But that becomes much more likely if it is
committed close to feature freeze. There is a certain kind of patch -
and I think it includes this patch - where you just can't ever really
know what the problems people are going to hit IRL are until it's
committed and people try things and hit problems, perhaps not even as
part of an intent to test this patch, but just testing in general. If
we don't find those problems with enough time remaining to react to
them in a thoughtful way, that's where full reverts start to happen.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: Eagerly scan all-visible pages to amortize aggressive vacuum

From
Robert Treat
Date:
On Tue, Dec 17, 2024 at 5:51 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:
>
> Thanks for taking a look!
>
> I've rebased and attached an updated v3 which also addresses review feedback.
>
> On Sun, Dec 15, 2024 at 1:05 AM Robert Treat <rob@xzilla.net> wrote:
> > On Fri, Dec 13, 2024 at 5:53 PM Melanie Plageman
> > <melanieplageman@gmail.com> wrote:
<snip>
>
> So, all vacuums are eager after the first eager vacuum (in this
> workload).

I have been thinking about this statement and a sense of awkwardness
that I felt about how this implementation came together which I think
Robert captured well with his statement "- It's a little weird that
you mostly treat eager vacuums as an intermediate position between
aggressive and normal, but then we decide whether we're eager in a
different place than we decide whether
we're aggressive."

It feels to me like eager vacuums are not so much a distinct thing,
but that, like how all vacuum do index cleanup unless told not to, all
vacuums are optimistic that they will convert avp to afp, only we bail
out quickly if the table is below vacuum_freeze_min_age and we throw
the limits out if it is above autovacuum_freeze_max_age, similar to
how we manage vacuum cost limit.

> > > + * The eagerness level of a vacuum determines how many all-visible but
> > > + * not all-frozen pages it eagerly scans.
> > > + *
> > > + * A normal vacuum (eagerness VAC_NORMAL) scans no all-visible pages (with the
> > > + * exception of those scanned due to SKIP_PAGES_THRESHOLD).
> > > + *
> > > + * An eager vacuum (eagerness VAC_EAGER) scans a number of pages up to a limit
> > > + * based on whether or not it is succeeding or failing. An eager vacuum is
> > > + * downgraded to a normal vacuum when it hits its success quota. An aggressive
> > > + * vacuum cannot be downgraded. No eagerness level is ever upgraded.
> > > + *
> >
> > At the risk of being overly nit-picky... eager vacuums scan their
> > subset of all-visible pages "up to a limit" based solely on the
> > success ratio. In the case of (excessive) failures, there is no limit
> > to the number of pages scanned, only a pause in the pages scanned
> > until the next region.
>
> Yes, this is a good point. I've changed it to specifically indicate
> the limit is based on successes. However, I feel like I should either
> not mention the success limit here or mention the regional limiting
> based on failures -- otherwise it is confusing. Can you think of a
> wording that would be good for this comment?
>

I don't feel like I came up with anything concise :-)

I think the gist is something like "a given vacuum will scan a number
of pages up to either a limit based on successes, after which it is
downgraded to a normal vacuum, or a subset of pages based on failures
across different regions within the heap." but even that feels like we
are relying on people already understanding what is going on rather
than defining/explaining it.

> > > + * An aggressive vacuum (eagerness EAGER_FULL) must scan all all-visible but
> > > + * not all-frozen pages.
> > > + */
> >
> > I think the above should be VAC_AGGRESSIVE vs EAGER_FULL, no?
>
> Yep, thanks! Fixed.
>
> > >       vacrel->skipwithvm = skipwithvm;
> > >
> > > +     heap_vacuum_set_up_eagerness(vacrel, aggressive);
> > > +
> > >       if (verbose)
> > > -     {
> > > -             if (vacrel->aggressive)
> > > -                     ereport(INFO,
> > > -                                     (errmsg("aggressively vacuuming \"%s.%s.%s\"",
> > > -                                                     vacrel->dbname, vacrel->relnamespace,
> > > -                                                     vacrel->relname)));
> > > -             else
> > > -                     ereport(INFO,
> > > -                                     (errmsg("vacuuming \"%s.%s.%s\"",
> > > -                                                     vacrel->dbname, vacrel->relnamespace,
> > > -                                                     vacrel->relname)));
> > > -     }
> > > +             ereport(INFO,
> > > +                             (errmsg("%s of \"%s.%s.%s\"",
> > > +                                             vac_eagerness_description(vacrel->eagerness),
> > > +                                             vacrel->dbname, vacrel->relnamespace,
> > > +                                             vacrel->relname)));
> > >
> > >       /*
> > >        * Allocate dead_items memory using dead_items_alloc.  This handles
> >
> > One thing I am wondering about is that since we actually modify
> > vacrel->eagerness during the "success downgrade" cycle, a single
> > vacuum run could potentially produce messages with both eager vacuum
> > and normal vacuum language. I don't think that'd be a problem in the
> > above spot, but wondering if it might be elsewhere (maybe in
> > pg_stat_activity?).
>
> Great point. It's actually already an issue in the vacuum log output.
> In the new patch, I save the eagerness level at the beginning and use
> it in the logging output. Any future users of the eagerness level will
> have to figure out if they care about the original eagerness level or
> the current one.
>

In a general sense, you want to know if your vacuums are converting
avp to afp since it can explain i/o usage. In this version of the
patch, we know it is turned on based on VacEagerness, but it'd be nice
to know if it got turned off; ie. basically if we end up in
if (vacrel->eager_pages.remaining_successes == 0) maybe drop a note in the logs.

In a world where all vacuums are eager vacuums, I think you still want
the latter messaging, but I think the former would end up being noted
based on the outcome of criteria in vacuum_get_cutoffs() (primarily if
we are over the freeze limit).

Robert Treat
https://xzilla.net



Re: Eagerly scan all-visible pages to amortize aggressive vacuum

From
Melanie Plageman
Date:
On Sat, Dec 21, 2024 at 10:28 AM Robert Treat <rob@xzilla.net> wrote:
>
> On Tue, Dec 17, 2024 at 5:51 PM Melanie Plageman
> <melanieplageman@gmail.com> wrote:
> >
> It feels to me like eager vacuums are not so much a distinct thing,
> but that, like how all vacuum do index cleanup unless told not to, all
> vacuums are optimistic that they will convert avp to afp, only we bail
> out quickly if the table is below vacuum_freeze_min_age and we throw
> the limits out if it is above autovacuum_freeze_max_age, similar to
> how we manage vacuum cost limit.

I like this framing/way of thinking about it. v4 in [1] eliminates the
concept of eager vacuums. I still use the aggressive vacuum and
non-aggressive vacuum framing. But, if we add a reloption/GUC to allow
configuring failures per region (proposed in [1]), that means more
using-facing docs and I think this way of framing it as a spectrum of
all-visible page scanning based on relfrozenxid's relationship to each
of these GUCs might be a clear way to explain eager scanning to users.

> In a general sense, you want to know if your vacuums are converting
> avp to afp since it can explain i/o usage. In this version of the
> patch, we know it is turned on based on VacEagerness, but it'd be nice
> to know if it got turned off; ie. basically if we end up in
> if (vacrel->eager_pages.remaining_successes == 0) maybe drop a note in the logs.

I've added a logging message like this.

> In a world where all vacuums are eager vacuums, I think you still want
> the latter messaging, but I think the former would end up being noted
> based on the outcome of criteria in vacuum_get_cutoffs() (primarily if
> we are over the freeze limit).

Hmm. Now that I've eliminated the concept of eager vacuums, users are
not informed whether or not eager scanning is enabled at the beginning
of vacuum. Only when eager scanning is disabled during vacuum or at
the end when they see the number of pages eagerly scanned would they
know whether or not this eager scanning happened.

- Melanie

[1] https://www.postgresql.org/message-id/CAAKRu_byJgf3wB8ukv6caXROReS1SRZsVPb7RWP%2B8qPtdDGykw%40mail.gmail.com