From 4838bd1f11b748d2082caedfe4da506b8fe3f67a Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Mon, 13 Dec 2021 15:00:49 -0800
Subject: [PATCH v8 2/3] Make block-level characteristics drive freezing.

Teach VACUUM to freeze all of the tuples on a page whenever it notices
that it would otherwise mark the page all-visible, without also marking
it all-frozen.  VACUUM won't freeze _any_ tuples on the page unless
_all_ tuples (that remain after pruning) are all-visible.  It may
occasionally be necessary to freeze the page due to the presence of a
particularly old XID, from before VACUUM's FreezeLimit cutoff.  But the
FreezeLimit mechanism will seldom have any impact on which pages are
frozen anymore -- it is just a backstop now.

Freezing can now informally be thought of as something that takes place
at the level of an entire page, or not at all -- differences in XIDs
among tuples on the same page are not interesting, barring extreme
cases.  Freezing a page is now practically synonymous with setting the
page to all-visible in the visibility map, at least to users.

The main upside of the new approach to freezing is that it makes the
overhead of vacuuming much more predictable over time.  We avoid the
need for large balloon payments, since the system no longer accumulates
"freezing debt" that can only be paid off by anti-wraparound vacuuming.
This seems to have been particularly troublesome with append-only
tables, especially in the common case where XIDs from pages that are
marked all-visible for the first time are still fairly young (in
particular, not as old as indicated by VACUUM's vacuum_freeze_min_age
freezing cutoff).  Before now, nothing stopped these pages from being
set to all-visible (without also being set to all-frozen) the first time
they were reached by VACUUM, which meant that they just couldn't be
frozen until the next anti-wraparound VACUUM -- at which point the XIDs
from the unfrozen tuples might be much older than vacuum_freeze_min_age.
In summary, the old vacuum_freeze_min_age-based FreezeLimit cutoff could
not _reliably_ limit freezing debt unless the GUC was set to 0.

There is a virtuous cycle enabled by the new approach to freezing:
freezing more tuples earlier during non-aggressive VACUUMs allows us to
advance relfrozenxid eagerly, which buys time.  This creates every
opportunity for the workload to naturally generate enough dead tuples
(or newly inserted tuples) to make the autovacuum launcher launch a
non-aggressive autovacuum.  The overall effect is that most individual
tables no longer require _any_ anti-wraparound vacuum operations.  This
effect also owes much to the enhancement added by commit ?????, which
loosened the coupling between freezing and advancing relfrozenxid,
allowing VACUUM to precisely determine a new relfrozenxid.

It's still possible (and sometimes even likely) that VACUUM won't be
able to freeze a tuple with a somewhat older XID due only to a cleanup
lock not being immediately available.  It's even possible that some
VACUUM operations will fail to advance relfrozenxid by very many XIDs as
a consequence.  But the impact over time should be negligible.  The next
VACUUM operation for the table will effectively get a new opportunity to
freeze (or perhaps remove) the same tuple that was originally missed.
Once that happens, relfrozenxid will completely catch up. (Actually, one
could reasonably argue that we never really "fell behind" in the first
place -- the amount of freezing needed to significantly advance
relfrozenxid won't have measurably increased at any point.  A once-off
drop in the extent to which VACUUM can advance relfrozenxid is almost
certainly harmless noise.)

Author: Peter Geoghegan <pg@bowt.ie>
Discussion: https://postgr.es/m/CAH2-WzkymFbz6D_vL+jmqSn_5q1wsFvFrE+37yLgL_Rkfd6Gzg@mail.gmail.com
---
 src/backend/access/heap/vacuumlazy.c | 84 ++++++++++++++++++++++++----
 1 file changed, 72 insertions(+), 12 deletions(-)

diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index d481a300b..ea4b75189 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -169,6 +169,7 @@ typedef struct LVRelState
 
 	/* VACUUM operation's cutoff for pruning */
 	TransactionId OldestXmin;
+	MultiXactId OldestMxact;
 	/* VACUUM operation's cutoff for freezing XIDs and MultiXactIds */
 	TransactionId FreezeLimit;
 	MultiXactId MultiXactCutoff;
@@ -200,6 +201,7 @@ typedef struct LVRelState
 	BlockNumber scanned_pages;	/* # pages examined (not skipped via VM) */
 	BlockNumber frozenskipped_pages;	/* # frozen pages skipped via VM */
 	BlockNumber removed_pages;	/* # pages removed by relation truncation */
+	BlockNumber newly_frozen_pages; /* # pages with tuples frozen by us */
 	BlockNumber lpdead_item_pages;	/* # pages with LP_DEAD items */
 	BlockNumber missed_dead_pages;	/* # pages with missed dead tuples */
 	BlockNumber nonempty_pages; /* actually, last nonempty page + 1 */
@@ -474,6 +476,7 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 
 	/* Set cutoffs for entire VACUUM */
 	vacrel->OldestXmin = OldestXmin;
+	vacrel->OldestMxact = OldestMxact;
 	vacrel->FreezeLimit = FreezeLimit;
 	vacrel->MultiXactCutoff = MultiXactCutoff;
 
@@ -654,12 +657,15 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 							 vacrel->relnamespace,
 							 vacrel->relname,
 							 vacrel->num_index_scans);
-			appendStringInfo(&buf, _("pages: %u removed, %u remain, %u scanned (%.2f%% of total)\n"),
+			appendStringInfo(&buf, _("pages: %u removed, %u remain, %u scanned (%.2f%% of total), %u newly frozen (%.2f%% of total)\n"),
 							 vacrel->removed_pages,
 							 vacrel->rel_pages,
 							 vacrel->scanned_pages,
 							 orig_rel_pages == 0 ? 0 :
-							 100.0 * vacrel->scanned_pages / orig_rel_pages);
+							 100.0 * vacrel->scanned_pages / orig_rel_pages,
+							 vacrel->newly_frozen_pages,
+							 orig_rel_pages == 0 ? 0 :
+							 100.0 * vacrel->newly_frozen_pages / orig_rel_pages);
 			appendStringInfo(&buf,
 							 _("tuples: %lld removed, %lld remain, %lld are dead but not yet removable\n"),
 							 (long long) vacrel->tuples_deleted,
@@ -827,6 +833,7 @@ lazy_scan_heap(LVRelState *vacrel, int nworkers)
 	vacrel->scanned_pages = 0;
 	vacrel->frozenskipped_pages = 0;
 	vacrel->removed_pages = 0;
+	vacrel->newly_frozen_pages = 0;
 	vacrel->lpdead_item_pages = 0;
 	vacrel->missed_dead_pages = 0;
 	vacrel->nonempty_pages = 0;
@@ -1027,7 +1034,7 @@ lazy_scan_heap(LVRelState *vacrel, int nworkers)
 			/*
 			 * SKIP_PAGES_THRESHOLD (threshold for skipping) was not
 			 * crossed, or this is the last page.  Scan the page, even
-			 * though it's all-visible (and possibly even all-frozen).
+			 * though it's all-visible (and likely all-frozen, too).
 			 */
 			all_visible_according_to_vm = true;
 		}
@@ -1589,7 +1596,7 @@ lazy_scan_prune(LVRelState *vacrel,
 	ItemId		itemid;
 	HeapTupleData tuple;
 	HTSV_Result res;
-	int			tuples_deleted,
+	int			tuples_deleted = 0,
 				lpdead_items,
 				recently_dead_tuples,
 				num_tuples,
@@ -1600,6 +1607,9 @@ lazy_scan_prune(LVRelState *vacrel,
 	xl_heap_freeze_tuple frozen[MaxHeapTuplesPerPage];
 	TransactionId NewRelfrozenxid;
 	MultiXactId NewRelminmxid;
+	TransactionId FreezeLimit = vacrel->FreezeLimit;
+	MultiXactId MultiXactCutoff = vacrel->MultiXactCutoff;
+	bool		freezeblk = false;
 
 	Assert(BufferGetBlockNumber(buf) == blkno);
 
@@ -1610,7 +1620,6 @@ retry:
 	/* Initialize (or reset) page-level counters */
 	NewRelfrozenxid = vacrel->NewRelfrozenxid;
 	NewRelminmxid = vacrel->NewRelminmxid;
-	tuples_deleted = 0;
 	lpdead_items = 0;
 	recently_dead_tuples = 0;
 	num_tuples = 0;
@@ -1625,9 +1634,9 @@ retry:
 	 * lpdead_items's final value can be thought of as the number of tuples
 	 * that were deleted from indexes.
 	 */
-	tuples_deleted = heap_page_prune(rel, buf, vistest,
-									 InvalidTransactionId, 0, &nnewlpdead,
-									 &vacrel->offnum);
+	tuples_deleted += heap_page_prune(rel, buf, vistest,
+									  InvalidTransactionId, 0, &nnewlpdead,
+									  &vacrel->offnum);
 
 	/*
 	 * Now scan the page to collect LP_DEAD items and check for tuples
@@ -1678,11 +1687,16 @@ retry:
 		 * vacrel->nonempty_pages value) is inherently race-prone.  It must be
 		 * treated as advisory/unreliable, so we might as well be slightly
 		 * optimistic.
+		 *
+		 * We delay setting all_visible to false due to seeing an LP_DEAD
+		 * item.  We need to test "is the page all_visible if we just consider
+		 * remaining tuples with tuple storage?" below, when considering if we
+		 * should freeze the tuples on the page.  (all_visible will be set to
+		 * false for caller once we've decided on what to freeze.)
 		 */
 		if (ItemIdIsDead(itemid))
 		{
 			deadoffsets[lpdead_items++] = offnum;
-			prunestate->all_visible = false;
 			prunestate->has_lpdead_items = true;
 			continue;
 		}
@@ -1816,8 +1830,8 @@ retry:
 		if (heap_prepare_freeze_tuple(tuple.t_data,
 									  vacrel->relfrozenxid,
 									  vacrel->relminmxid,
-									  vacrel->FreezeLimit,
-									  vacrel->MultiXactCutoff,
+									  FreezeLimit,
+									  MultiXactCutoff,
 									  &frozen[nfrozen],
 									  &tuple_totally_frozen,
 									  &NewRelfrozenxid,
@@ -1837,6 +1851,50 @@ retry:
 
 	vacrel->offnum = InvalidOffsetNumber;
 
+	/*
+	 * Freeze the whole page using OldestXmin (not FreezeLimit) as our cutoff
+	 * if the page is now eligible to be marked all_visible (barring any
+	 * LP_DEAD items) when the page is not already eligible to be marked
+	 * all_frozen.  We generally expect to freeze all of a block's tuples
+	 * together and at once, or none at all.  FreezeLimit is just a backstop
+	 * mechanism that makes sure that we don't overlook one or two older
+	 * tuples.
+	 *
+	 * For example, it's just about possible that successive VACUUM operations
+	 * will never quite manage to use the main block-level logic to freeze one
+	 * old tuple from a page where all other tuples are continually updated.
+	 * We should not be in any hurry to freeze such a tuple.  Even still, it's
+	 * better if we take care of it before an anti-wraparound VACUUM becomes
+	 * necessary -- that would mean that we'd have to wait for a cleanup lock
+	 * during the aggressive VACUUM, which has risks of its own.
+	 *
+	 * FIXME This code structure has been used for prototyping and testing the
+	 * algorithm, details of which have settled.  Code itself to be rewritten,
+	 * though.  It is backwards right now -- should be _starting_ with
+	 * OldestXmin (not FreezeLimit), since that's what happens at the
+	 * conceptual level.
+	 *
+	 * TODO Make vacuum_freeze_min_age GUC/reloption default -1, which will be
+	 * interpreted as "whatever autovacuum_freeze_max_age/2 is".  Idea is to
+	 * make FreezeLimit into a true backstop, and to do our best to avoid
+	 * waiting for a cleanup lock (always prefer to punt to the next VACUUM,
+	 * since we can advance relfrozenxid to the oldest XID on the page inside
+	 * lazy_scan_noprune).
+	 */
+	if (!freezeblk &&
+		((nfrozen > 0 && nfrozen < num_tuples) ||
+		 (prunestate->all_visible && !prunestate->all_frozen)))
+	{
+		freezeblk = true;
+		FreezeLimit = vacrel->OldestXmin;
+		MultiXactCutoff = vacrel->OldestMxact;
+		goto retry;
+	}
+
+	/* Time to define all_visible in a way that accounts for LP_DEAD items */
+	if (lpdead_items > 0)
+		prunestate->all_visible = false;
+
 	/*
 	 * We have now divided every item on the page into either an LP_DEAD item
 	 * that will need to be vacuumed in indexes later, or a LP_NORMAL tuple
@@ -1854,6 +1912,8 @@ retry:
 	{
 		Assert(prunestate->hastup);
 
+		vacrel->newly_frozen_pages++;
+
 		/*
 		 * At least one tuple with storage needs to be frozen -- execute that
 		 * now.
@@ -1882,7 +1942,7 @@ retry:
 		{
 			XLogRecPtr	recptr;
 
-			recptr = log_heap_freeze(vacrel->rel, buf, vacrel->FreezeLimit,
+			recptr = log_heap_freeze(vacrel->rel, buf, FreezeLimit,
 									 frozen, nfrozen);
 			PageSetLSN(page, recptr);
 		}
-- 
2.30.2