Re: Eagerly scan all-visible pages to amortize aggressive vacuum - Mailing list pgsql-hackers

From Andres Freund
Subject Re: Eagerly scan all-visible pages to amortize aggressive vacuum
Date
Msg-id ctdjzroezaxmiyah3gwbwm67defsrwj2b5fpfs4ku6msfpxeia@mwjyqlhwr2wu
Whole thread Raw
List pgsql-hackers
Hi,

On 2024-11-01 19:35:22 -0400, Melanie Plageman wrote:
> It implements a new "semi-aggressive" vacuum. Semi-aggressive vacuums
> eagerly scan some number of all-visible but not all-frozen pages in
> hopes of freezing them. All-visible pages that are eagerly scanned and
> set all-frozen in the visibility map are considered successful eager
> scans and those not frozen are considered failed eager scans.

I think it's worth mentioning that this would be useful to eventually be able
to mark pages as all-visible during on-access-pruning, to allow for
index-only-scans (and also faster seqscans). Today that'd have the danger of
increasing the "anti-wraparound vacuum debt" unconscionably, but with this
patch that'd be addressed.


> Because we want to amortize our eager scanning across a few vacuums,
> we cap the maximum number of successful eager scans to a percentage of
> the total number of all-visible but not all-frozen pages in the table
> (currently 20%).

One thing worth mentioning around here seems that we currently can't
partially-aggressively freeze tuples that are "too young" and how that
interacts with everything else.


> To demonstrate the results, I ran an append-only workload run for over
> 3 hours on master and with my patch applied. The patched version of
> Postgres amortized the work of freezing the all-visible but not
> all-frozen pages nicely. The first aggressive vacuum with the patch
> was 44 seconds and on master it was 1201 seconds.

Oops.


> patch
>   LOG:  automatic aggressive vacuum of table "history": index scans: 0
>   vacuum duration: 44 seconds (msecs: 44661).
>     pages: 0 removed, 27425085 remain, 1104095 scanned (4.03% of
> total), 709889 eagerly scanned
>     frozen: 316544 pages from table (1.15% of total) had 17409920
> tuples frozen. 316544 pages set all-frozen in the VM
>     I/O timings: read: 1160.599 ms, write: 2461.205 ms. approx time
> spent in vacuum delay: 16230 ms.
>     buffer usage: 1105630 hits, 1111898 reads, 646229 newly dirtied,
> 1750426 dirtied.
>     WAL usage: 1027099 records, 316566 full page images, 276209780 bytes.
>
> master
>   LOG:  automatic aggressive vacuum of table "history": index scans: 0
>   vacuum duration: 1201 seconds (msecs: 1201487).
>     pages: 0 removed, 27515348 remain, 15800948 scanned (57.43% of
> total), 15098257 eagerly scanned
>     frozen: 15096384 pages from table (54.87% of total) had 830247896
> tuples frozen. 15096384 pages set all-frozen in the VM
>     I/O timings: read: 246537.348 ms, write: 73000.498 ms. approx time
> spent in vacuum delay: 349166 ms.
>     buffer usage: 15798343 hits, 15813524 reads, 15097063 newly
> dirtied, 31274333 dirtied.
>     WAL usage: 30733564 records, 15097073 full page images, 11789257631 bytes.
>
> This is because, with the patch, the freezing work is being off-loaded
> to earlier vacuums.
>
> In the attached chart.png, you can see the vm_page_freezes climbing
> steadily with the patch, whereas on master, there are sudden spikes
> aligned with the aggressive vacuums. You can also see that the number
> of pages that are all-visible but not all-frozen grows steadily on
> master until the aggressive vacuum. This is vacuum's "backlog" of
> freezing work.

What's the reason for all-visible-but-not-all-frozen to increase to a higher
value initially than where it later settles?


> In this benchmark, the TPS is rate-limited, but using pgbench
> per-statement reports, I calculated that the P99 latency for this
> benchmark is 16 ms on master and 1 ms with the patch. Vacuuming pages
> sooner decreases vacuum reads and doing the freezing work spread over
> more vacuums improves P99 transaction latency.

Nice!


> Below is the comparative WAL volume, checkpointer and background
> writer writes, reads and writes done by all other backend types, time
> spent vacuuming in milliseconds, and p99 latency. Notice that overall
> vacuum IO time is substantially lower with the patch.
>
>    version     wal  cptr_bgwriter_w   other_rw  vac_io_time  p99_lat
>     patch   770 GB          5903264  235073744   513722         1
>     master  767 GB          5908523  216887764  1003654        16

Hm. It's not clear to me why other_rw is higher with the patch? After all,
given the workload, there's no chance of unnecessarily freezing tuples? Is
that just because at the end of the benchmark there's leftover work?



> From 69b517f34caf39ad814691d6412c68d54e852990 Mon Sep 17 00:00:00 2001
> From: Melanie Plageman <melanieplageman@gmail.com>
> Date: Mon, 28 Oct 2024 10:53:37 -0400
> Subject: [PATCH v1 1/9] Rename LVRelState->frozen_pages
>
> Rename frozen_pages to tuple_freeze_pages in LVRelState, the struct used
> for tracking state during vacuuming of a heap relation. frozen_pages
> sounds like it includes every all-frozen page. That is a misnomer. It
> does not include pages with already frozen tuples. It also includes
> pages that are not actually all-frozen.

Hm. Is tuple_freeze_pages that much clearer, it could also just indicate pages
that already were frozen. How about "newly_frozen_pages"?


> Subject: [PATCH v1 2/9] Make visibilitymap_set() return previous state of

LGTM.



> From c317a272713bb833f7f2761a5be1924f8e1bdb4d Mon Sep 17 00:00:00 2001
> From: Melanie Plageman <melanieplageman@gmail.com>
> Date: Thu, 31 Oct 2024 18:19:18 -0400
> Subject: [PATCH v1 3/9] Count pages set all-visible and all-frozen in VM
>  during vacuum
>
> Vacuum already counts and logs pages with newly frozen tuples. Count and
> log pages set all-frozen in the VM too. This includes pages that are
> empty before or after vacuuming.
>
> While we are at it, count and log the number of pages vacuum set
> all-visible. Pages that are all-visible but not all-frozen are debt for
> future aggressive vacuums. The newly all-visible and all-frozen counts
> give us visiblity into the rate at which this debt is being accrued and
> paid down.

Hm. Why is it interesting to log VM changes separately from the state changes
of underlying pages? Wouldn't it e.g. be more intersting to log the number of
empty pages that vacuum [re-]discovered? I've a bit hard time seeing how a
user could take advantage of this.


> From 67781cc2511bb7d62ccc9461f1787272820abcc4 Mon Sep 17 00:00:00 2001
> From: Melanie Plageman <melanieplageman@gmail.com>
> Date: Mon, 28 Oct 2024 11:07:50 -0400
> Subject: [PATCH v1 4/9] Replace uses of blkno local variable in
>  lazy_scan_heap()

Largely LGTM, but I'm not sure that it's worth having as a separate commit.


> @@ -1114,7 +1117,6 @@ heap_vac_scan_next_block(LVRelState *vacrel, BlockNumber *blkno,
>              ReleaseBuffer(vacrel->next_unskippable_vmbuffer);
>              vacrel->next_unskippable_vmbuffer = InvalidBuffer;
>          }
> -        *blkno = vacrel->rel_pages;
>          return false;
>      }

I'd seat *blkno to InvalidBlockNumber or such, that'd make misuse more
apparent than having it set to some random older block.


> From 67b5565ad57d3b196695f85811dde2044ba79f3e Mon Sep 17 00:00:00 2001
> From: Melanie Plageman <melanieplageman@gmail.com>
> Date: Mon, 28 Oct 2024 11:14:24 -0400
> Subject: [PATCH v1 5/9] Move vacuum VM buffer release
>
> The VM buffer for the next unskippable block can be released after the
> main loop in lazy_scan_heap(). Doing so de-clutters
> heap_vac_scan_next_block() and opens up more refactoring options.

That's vague...


> From 8485dc400b3d4e9f895170af4f5fb1bb959b8495 Mon Sep 17 00:00:00 2001
> From: Melanie Plageman <melanieplageman@gmail.com>
> Date: Mon, 28 Oct 2024 11:36:58 -0400
> Subject: [PATCH v1 6/9] Remove superfluous next_block local variable in vacuum
>  code
>
> Reduce the number of block related variables in lazy_scan_heap() and its
> helpers by removing the next_block local variable from
> heap_vac_scan_next_block().

I don't mind this change, but I also don't get how it's related to anything
else here or why it's really better than the status quo.


> diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
> index 4b1eadea1f2..52c9d49f2b1 100644
> --- a/src/backend/access/heap/vacuumlazy.c
> +++ b/src/backend/access/heap/vacuumlazy.c
> @@ -1112,19 +1112,17 @@ static bool
>  heap_vac_scan_next_block(LVRelState *vacrel, BlockNumber *blkno,
>                           bool *all_visible_according_to_vm)
>  {
> -    BlockNumber next_block;
> -
>      /* relies on InvalidBlockNumber + 1 overflowing to 0 on first call */
> -    next_block = vacrel->current_block + 1;
> +    vacrel->current_block++;

I realize this isn't introduced in this commit, but darn, that's ugly.


> From 78ad9e022b95e024ff5bfa96af78e9e44730c970 Mon Sep 17 00:00:00 2001
> From: Melanie Plageman <melanieplageman@gmail.com>
> Date: Mon, 28 Oct 2024 11:42:10 -0400
> Subject: [PATCH v1 7/9] Make heap_vac_scan_next_block() return BlockNumber


> @@ -857,7 +857,8 @@ lazy_scan_heap(LVRelState *vacrel)
>      vacrel->next_unskippable_allvis = false;
>      vacrel->next_unskippable_vmbuffer = InvalidBuffer;
>
> -    while (heap_vac_scan_next_block(vacrel, &blkno, &all_visible_according_to_vm))
> +    while (BlockNumberIsValid(blkno = heap_vac_scan_next_block(vacrel,
> +                                                               &all_visible_according_to_vm)))

Personally I'd write this as

while (true)
{
    BlockNumber blkno;

    blkno = heap_vac_scan_next_block(vacrel, ...);

    if (!BlockNumberIsValid(blkno))
       break;

Mostly because it's good to use more minimal scopes when possible,
particularly when previously the scope intentionally was larger. But also
partially because I don't love variable assignments inside a macro call,
inside a while().



> From 818d1c3b068c6705611256cfc3eb1f10bdc0b684 Mon Sep 17 00:00:00 2001
> From: Melanie Plageman <melanieplageman@gmail.com>
> Date: Fri, 1 Nov 2024 18:25:05 -0400
> Subject: [PATCH v1 8/9] WIP: Add more general summary to vacuumlazy.c
>
> Currently the summary at the top of vacuumlazy.c provides some specific
> details related to the new dead TID storage in 17. I plan to add a
> summary and maybe some sub-sections to contextualize it.

I like this idea. It's hard to understand vacuumlazy.c without already
understanding vacuumlazy.c, which isn't a good situation.


> ---
>  src/backend/access/heap/vacuumlazy.c | 11 +++++++++++
>  1 file changed, 11 insertions(+)
>
> diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
> index 7ce69953ba0..15a04c6b10b 100644
> --- a/src/backend/access/heap/vacuumlazy.c
> +++ b/src/backend/access/heap/vacuumlazy.c
> @@ -3,6 +3,17 @@
>   * vacuumlazy.c
>   *      Concurrent ("lazy") vacuuming.
>   *
> + * Heap relations are vacuumed in three main phases. In the first phase,
> + * vacuum scans relation pages, pruning and freezing tuples and saving dead
> + * tuples' TIDs in a TID store. If that TID store fills up or vacuum finishes
> + * scanning the relation, it progresses to the second phase: index vacuuming.
> + * After index vacuuming is complete, vacuum scans the blocks of the relation
> + * indicated by the TIDs in the TID store and reaps the dead tuples, freeing
> + * that space for future tuples. Finally, vacuum may truncate the relation if
> + * it has emptied pages at the end. XXX: this summary needs work.

Yea, at least we ought to mention that the phasing can be different when there
are no indexes and that the later phases can heuristically be omitted when
there aren't enough dead items.



> From f21f0bab1dbe675be4b4dddcb2eea486d8a69d36 Mon Sep 17 00:00:00 2001
> From: Melanie Plageman <melanieplageman@gmail.com>
> Date: Mon, 28 Oct 2024 12:15:08 -0400
> Subject: [PATCH v1 9/9] Eagerly scan all-visible pages to amortize aggressive
>  vacuum
>
> Introduce semi-aggressive vacuums, which scan some of the all-visible
> but not all-frozen pages in the relation to amortize the cost of an
> aggressive vacuum.

I wonder if "aggressive" is really the right terminology going
forward... Somehow it doesn't seem particularly descriptive anymore if, in
many workloads, almost all vacuums are going to be aggressive-ish.


> Because the goal is to freeze these all-visible pages, all-visible pages
> that are eagerly scanned and set all-frozen in the visibility map are
> considered successful eager scans and those not frozen are considered
> failed eager scans.
>
> If too many eager scans fail in a row, eager scanning is temporarily
> suspended until a later portion of the relation. Because the goal is to
> amortize aggressive vacuums, we cap the number of successes as well.
> Once we reach the maximum number of blocks successfully eager scanned
> and frozen, the semi-aggressive vacuum is downgraded to an unaggressive
> vacuum.
> ---
>  src/backend/access/heap/vacuumlazy.c | 327 +++++++++++++++++++++++----
>  src/backend/commands/vacuum.c        |  20 +-
>  src/include/commands/vacuum.h        |  27 ++-
>  src/tools/pgindent/typedefs.list     |   1 +
>  4 files changed, 326 insertions(+), 49 deletions(-)
>
> diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
> index 15a04c6b10b..adabb5ff5f1 100644
> --- a/src/backend/access/heap/vacuumlazy.c
> +++ b/src/backend/access/heap/vacuumlazy.c
> @@ -12,6 +12,40 @@
>   * that space for future tuples. Finally, vacuum may truncate the relation if
>   * it has emptied pages at the end. XXX: this summary needs work.
>   *
> + * Relation Scanning:
> + *
> + * Vacuum scans the heap relation, starting at the beginning and progressing
> + * to the end, skipping pages as permitted by their visibility status, vacuum
> + * options, and the aggressiveness level of the vacuum.
> + *
> + * When page skipping is enabled, unaggressive vacuums may skip scanning pages
> + * that are marked all-visible in the visibility map. We may choose not to
> + * skip pages if the range of skippable pages is below SKIP_PAGES_THRESHOLD.
> + *
> + * Semi-aggressive vacuums will scan skippable pages in an effort to freeze
> + * them and decrease the backlog of all-visible but not all-frozen pages that
> + * have to be processed to advance relfrozenxid and avoid transaction ID
> + * wraparound.
> + *
> + * We count it as a success when we are able to set an eagerly scanned page
> + * all-frozen in the VM and a failure when we are not able to set the page
> + * all-frozen.
> + *
> + * Because we want to amortize the overhead of freezing pages over multiple
> + * vacuums, we cap the number of successful eager scans to
> + * EAGER_SCAN_SUCCESS_RATE of the number of all-visible but not all-frozen
> + * pages at the beginning of the vacuum.
> + *
> + * On the assumption that different regions of the table are likely to contain
> + * similarly aged data, we use a localized failure cap instead of a global cap
> + * for the whole relation. The failure count is reset on each region of the
> + * table, comprised of RELSEG_SIZE blocks (or 1/4 of the table size for a
> + * small table). In each region, we tolerate MAX_SUCCESSIVE_EAGER_SCAN_FAILS
> + * before suspending eager scanning until the end of the region.

I'm a bit surprised to see such large regions. Why not something finer, in the
range of a few megabytes?  The FSM steers new tuples quite aggressively to the
start of the table, which means that in many workloads there will be old and
new data interspersed at the start of the table. Using RELSEG_SIZE sized
regions for semi-aggressive vacuuming will mean that we'll often not do any
semi-aggressive processing beyond the start of the relation, as we'll reach
the failure rate quickly.

I also find it layering-wise a bit weird to use RELSEG_SIZE, that's really imo
is just an md.c concept.


> +/*
> + * Semi-aggressive vacuums eagerly scan some all-visible but not all-frozen
> + * pages. Since our goal is to freeze these pages, an eager scan that fails to
> + * set the page all-frozen in the VM is considered to have "failed".
> + *
> + * On the assumption that different regions of the table tend to have
> + * similarly aged data, once we fail to freeze MAX_SUCCESSIVE_EAGER_SCAN_FAILS
> + * blocks, we suspend eager scanning until vacuum has progressed to another
> + * region of the table with potentially older data.
> + */
> +#define MAX_SUCCESSIVE_EAGER_SCAN_FAILS 1024

Can this really be a constant, given that the semi-aggressive regions are
shrunk for small tables?


> +/*
> + * An eager scan of a page that is set all-frozen in the VM is considered
> + * "successful". To spread out eager scanning across multiple semi-aggressive
> + * vacuums, we limit the number of successful eager scans (as well as the
> + * number of failures). The maximum number of successful eager scans is
> + * calculated as a ratio of the all-visible but not all-frozen pages at the
> + * beginning of the vacuum.
> + */
> +#define EAGER_SCAN_SUCCESS_RATE 0.2
>  typedef struct LVRelState

There imo should be newlines between the define and LVRelState.

>  {
>      /* Target heap relation and its indexes */
> @@ -153,8 +208,22 @@ typedef struct LVRelState
>      BufferAccessStrategy bstrategy;
>      ParallelVacuumState *pvs;
>
> -    /* Aggressive VACUUM? (must set relfrozenxid >= FreezeLimit) */
> -    bool        aggressive;
> +    /*
> +     * Whether or not this is an aggressive, semi-aggressive, or unaggressive
> +     * VACUUM. A fully aggressive vacuum must set relfrozenxid >= FreezeLimit
> +     * and therefore must scan every unfrozen tuple. A semi-aggressive vacuum
> +     * will scan a certain number of all-visible pages until it is downgraded
> +     * to an unaggressive vacuum.
> +     */
> +    VacAggressive aggressive;

Hm. A few comments:

- why is VacAggressive defined in vacuum.h? Isn't this fairly tightly coupled
  to heapam?
- Kinda feels like the type should be named VacAggressivness or such?


> +    /*
> +     * A semi-aggressive vacuum that has failed to freeze too many eagerly
> +     * scanned blocks in a row suspends eager scanning. unaggressive_to is the
> +     * block number of the first block eligible for resumed eager scanning.
> +     */
> +    BlockNumber unaggressive_to;

What's it set to otherwise? What is it set to in aggressive vacuums?


> @@ -227,6 +296,26 @@ typedef struct LVRelState
>      BlockNumber next_unskippable_block; /* next unskippable block */
>      bool        next_unskippable_allvis;    /* its visibility status */
>      Buffer        next_unskippable_vmbuffer;    /* buffer containing its VM bit */
> +
> +    /*
> +     * Count of skippable blocks eagerly scanned as part of a semi-aggressive
> +     * vacuum (for logging only).
> +     */
> +    BlockNumber eager_scanned;

I'd add _pages or such.


> +    /*
> +     * The number of eagerly scanned blocks a semi-aggressive vacuum failed to
> +     * freeze (due to age) in the current eager scan region. It is reset each
> +     * time we hit MAX_SUCCESSIVE_EAGER_SCAN_FAILS.
> +     */
> +    BlockNumber eager_scanned_failed_frozen;
> +
> +    /*
> +     * The remaining number of blocks a semi-aggressive vacuum will consider
> +     * eager scanning. This is initialized to EAGER_SCAN_SUCCESS_RATE of the
> +     * total number of all-visible but not all-frozen pages.
> +     */
> +    BlockNumber remaining_eager_scan_successes;

I think it might look better if you just bundled these into a struct like

      struct
      {
        BlockNumber scanned;
        BlockNumber failed_frozen;
        BlockNumber remaining_successes;
      } eager_pages;


> @@ -471,24 +569,49 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
>           * Force aggressive mode, and disable skipping blocks using the
>           * visibility map (even those set all-frozen)
>           */
> -        vacrel->aggressive = true;
> +        vacrel->aggressive = VAC_AGGRESSIVE;
>          skipwithvm = false;
>      }
>
>      vacrel->skipwithvm = skipwithvm;
> +    vacrel->eager_scanned = 0;
> +    vacrel->eager_scanned_failed_frozen = 0;
> +
> +    /*
> +     * Even if we successfully freeze them, we want to cap the number of
> +     * eagerly scanned blocks so that we spread out the overhead across
> +     * multiple vacuums. remaining_eager_scan_successes is only used by
> +     * semi-aggressive vacuums.
> +     */

Somehow the "them" feels like a dangling pointer that's initialized too late ;)


> +    visibilitymap_count(rel, &orig_rel_allvisible, &orig_rel_allfrozen);
> +    vacrel->remaining_eager_scan_successes =
> +        (BlockNumber) (EAGER_SCAN_SUCCESS_RATE * (orig_rel_allvisible - orig_rel_allfrozen));
>
>      if (verbose)
>      {
> -        if (vacrel->aggressive)
> -            ereport(INFO,
> -                    (errmsg("aggressively vacuuming \"%s.%s.%s\"",
> -                            vacrel->dbname, vacrel->relnamespace,
> -                            vacrel->relname)));
> -        else
> -            ereport(INFO,
> -                    (errmsg("vacuuming \"%s.%s.%s\"",
> -                            vacrel->dbname, vacrel->relnamespace,
> -                            vacrel->relname)));
> +        switch (vacrel->aggressive)
> +        {
> +            case VAC_UNAGGRESSIVE:
> +                ereport(INFO,
> +                        (errmsg("vacuuming \"%s.%s.%s\"",
> +                                vacrel->dbname, vacrel->relnamespace,
> +                                vacrel->relname)));
> +                break;
> +
> +            case VAC_AGGRESSIVE:
> +                ereport(INFO,
> +                        (errmsg("aggressively vacuuming \"%s.%s.%s\"",
> +                                vacrel->dbname, vacrel->relnamespace,
> +                                vacrel->relname)));
> +                break;
> +
> +            case VAC_SEMIAGGRESSIVE:
> +                ereport(INFO,
> +                        (errmsg("semiaggressively vacuuming \"%s.%s.%s\"",
> +                                vacrel->dbname, vacrel->relnamespace,
> +                                vacrel->relname)));
> +                break;
> +        }

Wonder if we should have a function that returns the aggressiveness of a
vacuum as an already translated string. There are other places where we emit
the aggressiveness as part of a message, and it's pretty silly to duplicate
most of the message.


> @@ -545,11 +668,13 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
>       * Non-aggressive VACUUMs may advance them by any amount, or not at all.
>       */
>      Assert(vacrel->NewRelfrozenXid == vacrel->cutoffs.OldestXmin ||
> -           TransactionIdPrecedesOrEquals(vacrel->aggressive ? vacrel->cutoffs.FreezeLimit :
> +           TransactionIdPrecedesOrEquals(vacrel->aggressive == VAC_AGGRESSIVE ?
> +                                         vacrel->cutoffs.FreezeLimit :
>                                           vacrel->cutoffs.relfrozenxid,
>                                           vacrel->NewRelfrozenXid));
>      Assert(vacrel->NewRelminMxid == vacrel->cutoffs.OldestMxact ||
> -           MultiXactIdPrecedesOrEquals(vacrel->aggressive ? vacrel->cutoffs.MultiXactCutoff :
> +           MultiXactIdPrecedesOrEquals(vacrel->aggressive == VAC_AGGRESSIVE ?
> +                                       vacrel->cutoffs.MultiXactCutoff :
>                                         vacrel->cutoffs.relminmxid,
>                                         vacrel->NewRelminMxid));
>      if (vacrel->skippedallvis)

These are starting to feel somewhat complicated. Wonder if it'd be easier to
read if they were written as normal ifs.


> +/*
> + * Helper to decrement a block number to 0 without wrapping around.
> + */
> +static void
> +decrement_blkno(BlockNumber *block)
> +{
> +    if ((*block) > 0)
> +        (*block)--;
> +}

Huh.


> @@ -956,11 +1094,23 @@ lazy_scan_heap(LVRelState *vacrel)
>          if (!got_cleanup_lock)
>              LockBuffer(buf, BUFFER_LOCK_SHARE);
>
> +        page_freezes = vacrel->vm_page_freezes;
> +
>          /* Check for new or empty pages before lazy_scan_[no]prune call */
>          if (lazy_scan_new_or_empty(vacrel, buf, blkno, page, !got_cleanup_lock,
>                                     vmbuffer))
>          {
>              /* Processed as new/empty page (lock and pin released) */
> +
> +            /* count an eagerly scanned page as a failure or a success */
> +            if (was_eager_scanned)
> +            {
> +                if (vacrel->vm_page_freezes > page_freezes)
> +                    decrement_blkno(&vacrel->remaining_eager_scan_successes);
> +                else
> +                    vacrel->eager_scanned_failed_frozen++;
> +            }
> +
>              continue;
>          }

Maybe I'm confused, but ISTM that remaining_eager_scan_successes shouldn't
actually be a BlockNumber, given it doesn't actually indicates a specific
block...

I don't understand why we would sometimes want to treat empty pages as a
failure? They can't fail to be frozen, no?

I'm not sure it makes sense to count them as progress towards the success
limit either - afaict we just rediscovered free space is the table. That's imo
separate from semi-aggressive freezing.


Storing page_freezes as a copy of vacrel->vm_page_freezes and then checking if
that increased feels like a somewhat ugly way of tracking if freezing
happend. There's no more direct way?

Why is decrement_blkno() needed? How can we ever get into negative territory?
Shouldn't eager scanning have been disabled when
remaining_eager_scan_successes reaches zero and thus prevent
remaining_eager_scan_successes from ever going below zero?



> @@ -1144,7 +1310,65 @@ heap_vac_scan_next_block(LVRelState *vacrel,
>           */
>          bool        skipsallvis;
>
> -        find_next_unskippable_block(vacrel, &skipsallvis);
> +        /*
> +         * Figure out if we should disable eager scan going forward or
> +         * downgrade to an unaggressive vacuum altogether.
> +         */
> +        if (vacrel->aggressive == VAC_SEMIAGGRESSIVE)
> +        {
> +            /*
> +             * If we hit our success limit, there is no need to eagerly scan
> +             * any additional pages. Downgrade the vacuum to unaggressive.
> +             */
> +            if (vacrel->remaining_eager_scan_successes == 0)
> +                vacrel->aggressive = VAC_UNAGGRESSIVE;
> +
> +            /*
> +             * If we hit the max number of failed eager scans for this region
> +             * of the table, figure out where the next eager scan region
> +             * should start. Eager scanning is effectively disabled until we
> +             * scan a block in that new region.
> +             */
> +            else if (vacrel->eager_scanned_failed_frozen >=
> +                     MAX_SUCCESSIVE_EAGER_SCAN_FAILS)
> +            {
> +                BlockNumber region_size,
> +                            offset;
> +

Why are we doing this logic here, rather than after incrementing
eager_scanned_failed_frozen? Seems that'd limit the amount of times we need to
run through this logic substantially?


> +                /*
> +                 * On the assumption that different regions of the table are
> +                 * likely to have similarly aged data, we will retry eager
> +                 * scanning again later. For a small table, we'll retry eager
> +                 * scanning every quarter of the table. For a larger table,
> +                 * we'll consider eager scanning again after processing
> +                 * another region's worth of data.
> +                 *
> +                 * We consider the region to start from the first failure, so
> +                 * calculate the block to restart eager scanning from there.
> +                 */
> +                region_size = Min(RELSEG_SIZE, (vacrel->rel_pages / 4));
> +
> +                offset = vacrel->eager_scanned_failed_frozen % region_size;
> +
> +                Assert(vacrel->eager_scanned > 0);
> +
> +                vacrel->unaggressive_to = vacrel->current_block + (region_size - offset);
> +                vacrel->eager_scanned_failed_frozen = 0;
> +            }
> +        }

I'd put the logic for this in a separate function, feels complicated and
independent enough for it to be worth not having it in heap_vac_scan_next_block().


> @@ -1199,6 +1423,11 @@ heap_vac_scan_next_block(LVRelState *vacrel,
>   * The next unskippable block and its visibility information is updated in
>   * vacrel.
>   *
> + * consider_eager_scan indicates whether or not we should consider scanning
> + * all-visible but not all-frozen blocks. was_eager_scanned is set to true if
> + * we decided to eager scan a block. In this case, next_unskippable_block is
> + * set to that block number.
> + *
>   * Note: our opinion of which blocks can be skipped can go stale immediately.
>   * It's okay if caller "misses" a page whose all-visible or all-frozen marking
>   * was concurrently cleared, though.  All that matters is that caller scan all
> @@ -1208,7 +1437,11 @@ heap_vac_scan_next_block(LVRelState *vacrel,
>   * to skip such a range is actually made, making everything safe.)
>   */
>  static void
> -find_next_unskippable_block(LVRelState *vacrel, bool *skipsallvis)
> +find_next_unskippable_block(
> +                            LVRelState *vacrel,
> +                            bool consider_eager_scan,
> +                            bool *was_eager_scanned,
> +                            bool *skipsallvis)

I don't think we ever have function definitions without a parameter on the
line with the function name.

Hm - isn't consider_eager_scan potentially outdated after
find_next_unskippable_block() iterated over a bunch of blocks? If
consider_eager_scan is false, this could very well go into the next
"semi-aggressive region", where consider_eager_scan should have been
re-enabled, no?


Greetings,

Andres Freund



pgsql-hackers by date:

Previous
From: "David E. Wheeler"
Date:
Subject: Re: RFC: Extension Packaging & Lookup
Next
From: Andres Freund
Date:
Subject: Re: Adding NetBSD and OpenBSD to Postgres CI