From bf390a64d41ef61313bb4004916844f158c398d6 Mon Sep 17 00:00:00 2001 From: Peter Geoghegan Date: Mon, 9 Nov 2020 12:59:30 -0800 Subject: [PATCH v10 3/3] Teach nbtree to use bottom-up index deletion. Teach nbtree to eagerly delete duplicate of tuples representing old versions in the event of a localized flood of version churn. This situation is detected using heuristics, including the recently added "index is logically unchanged by an UPDATE" executor hint. The immediate goal of bottom-up index deletion in nbtree is to avoid "unnecessary" page splits caused entirely by duplicates needed only for MVCC/versioning purposes. It naturally has an even more useful effect, though: it acts as a backstop against accumulating an excessive number of index tuple versions for any given _logical row_. Note that the relationship between this localized condition and the proportion of garbage tuples in the entire index is very loose, and can be very volatile. Bottom-up index deletion complements what we might now call "top-down index deletion": index vacuuming performed by VACUUM. It responds to the immediate local needs of queries, while leaving it up to autovacuum to perform infrequent clean sweeps of the index. Also extend deletion of LP_DEAD-marked index tuples by teaching it to delete extra index tuples (that are not LP_DEAD-marked) in passing. This doesn't increase the number of table blocks accessed by deletion, at least in the common case where the table is not unlogged and wal_level >= replica. It increases the number of index tuples deleted significantly in many cases. For example, it almost never fails to delete at least a few extra index tuples when the regression tests run, and can delete vastly more index tuples fairly often. Bottom-up deletion uses the same WAL record that we use when deleting LP_DEAD items (the xl_btree_delete record). This commit extends _bt_delitems_delete() to support granular TID deletion in posting list tuples, and to support a caller-supplied latestRemovedXid. Bump XLOG_PAGE_MAGIC because xl_btree_delete changed. No bump in BTREE_VERSION, since there are no changes to the on-disk representation of nbtree indexes. Indexes built on PostgreSQL 12 or PostgreSQL 13 will automatically benefit from the optimization (i.e. no reindexing required) following a pg_upgrade. This commit is the final major component of bottom-up index deletion, following an earlier commit that added heapam support. Author: Peter Geoghegan Reviewed-By: Victor Yegorov Discussion: https://postgr.es/m/CAH2-Wzm+maE3apHB8NOtmM=p-DO65j2V5GzAWCOEEuy3JZgb2g@mail.gmail.com --- src/include/access/nbtree.h | 21 +- src/include/access/nbtxlog.h | 101 ++--- src/backend/access/common/reloptions.c | 10 + src/backend/access/nbtree/README | 133 ++++++- src/backend/access/nbtree/nbtdedup.c | 315 +++++++++++++++- src/backend/access/nbtree/nbtinsert.c | 346 +++++++++++++++-- src/backend/access/nbtree/nbtpage.c | 490 ++++++++++++++++++------- src/backend/access/nbtree/nbtree.c | 2 +- src/backend/access/nbtree/nbtsort.c | 1 - src/backend/access/nbtree/nbtutils.c | 4 +- src/backend/access/nbtree/nbtxlog.c | 94 +++-- src/bin/psql/tab-complete.c | 4 +- doc/src/sgml/btree.sgml | 109 +++++- doc/src/sgml/ref/create_index.sgml | 16 + 14 files changed, 1360 insertions(+), 286 deletions(-) diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h index 3b60e696eb..3c93d7f9b5 100644 --- a/src/include/access/nbtree.h +++ b/src/include/access/nbtree.h @@ -17,6 +17,7 @@ #include "access/amapi.h" #include "access/itup.h" #include "access/sdir.h" +#include "access/tableam.h" #include "access/xlogreader.h" #include "catalog/pg_am_d.h" #include "catalog/pg_index.h" @@ -767,7 +768,8 @@ typedef BTDedupStateData *BTDedupState; /* * BTVacuumPostingData is state that represents how to VACUUM a posting list - * tuple when some (though not all) of its TIDs are to be deleted. + * tuple when some (though not all) of its TIDs are to be deleted. (Also used + * by bottom-up index deletion.) * * Convention is that itup field is the original posting list tuple on input, * and palloc()'d final tuple used to overwrite existing tuple on output. @@ -963,6 +965,7 @@ typedef struct BTOptions /* fraction of newly inserted tuples prior to trigger index cleanup */ float8 vacuum_cleanup_index_scale_factor; bool deduplicate_items; /* Try to deduplicate items? */ + bool delete_items; /* Bottom-up delete items? */ } BTOptions; #define BTGetFillFactor(relation) \ @@ -978,6 +981,11 @@ typedef struct BTOptions relation->rd_rel->relam == BTREE_AM_OID), \ ((relation)->rd_options ? \ ((BTOptions *) (relation)->rd_options)->deduplicate_items : true)) +#define BTGetDeleteItems(relation) \ + (AssertMacro(relation->rd_rel->relkind == RELKIND_INDEX && \ + relation->rd_rel->relam == BTREE_AM_OID), \ + ((relation)->rd_options ? \ + ((BTOptions *) (relation)->rd_options)->delete_items : true)) /* * Constant definition for progress reporting. Phase numbers must match @@ -1031,6 +1039,8 @@ extern void _bt_parallel_advance_array_keys(IndexScanDesc scan); extern void _bt_dedup_pass(Relation rel, Buffer buf, Relation heapRel, IndexTuple newitem, Size newitemsz, bool checkingunique); +extern bool _bt_bottomup_pass(Relation rel, Buffer buf, Relation heapRel, + Size newitemsz); extern void _bt_dedup_start_pending(BTDedupState state, IndexTuple base, OffsetNumber baseoff); extern bool _bt_dedup_save_htid(BTDedupState state, IndexTuple itup); @@ -1045,7 +1055,8 @@ extern IndexTuple _bt_swap_posting(IndexTuple newitem, IndexTuple oposting, * prototypes for functions in nbtinsert.c */ extern bool _bt_doinsert(Relation rel, IndexTuple itup, - IndexUniqueCheck checkUnique, Relation heapRel); + IndexUniqueCheck checkUnique, bool indexUnchanged, + Relation heapRel); extern void _bt_finish_split(Relation rel, Buffer lbuf, BTStack stack); extern Buffer _bt_getstackbuf(Relation rel, BTStack stack, BlockNumber child); @@ -1083,9 +1094,9 @@ extern bool _bt_page_recyclable(Page page); extern void _bt_delitems_vacuum(Relation rel, Buffer buf, OffsetNumber *deletable, int ndeletable, BTVacuumPosting *updatable, int nupdatable); -extern void _bt_delitems_delete(Relation rel, Buffer buf, - OffsetNumber *deletable, int ndeletable, - Relation heapRel); +extern void _bt_delitems_delete_check(Relation rel, Buffer buf, + Relation heapRel, + TM_IndexDeleteOp *delstate); extern uint32 _bt_pagedel(Relation rel, Buffer leafbuf, TransactionId *oldestBtpoXact); diff --git a/src/include/access/nbtxlog.h b/src/include/access/nbtxlog.h index 5c014bdc66..db1eb11042 100644 --- a/src/include/access/nbtxlog.h +++ b/src/include/access/nbtxlog.h @@ -176,24 +176,6 @@ typedef struct xl_btree_dedup #define SizeOfBtreeDedup (offsetof(xl_btree_dedup, nintervals) + sizeof(uint16)) -/* - * This is what we need to know about delete of individual leaf index tuples. - * The WAL record can represent deletion of any number of index tuples on a - * single index page when *not* executed by VACUUM. Deletion of a subset of - * the TIDs within a posting list tuple is not supported. - * - * Backup Blk 0: index page - */ -typedef struct xl_btree_delete -{ - TransactionId latestRemovedXid; - uint32 ndeleted; - - /* DELETED TARGET OFFSET NUMBERS FOLLOW */ -} xl_btree_delete; - -#define SizeOfBtreeDelete (offsetof(xl_btree_delete, ndeleted) + sizeof(uint32)) - /* * This is what we need to know about page reuse within btree. This record * only exists to generate a conflict point for Hot Standby. @@ -211,9 +193,61 @@ typedef struct xl_btree_reuse_page #define SizeOfBtreeReusePage (sizeof(xl_btree_reuse_page)) /* - * This is what we need to know about which TIDs to remove from an individual - * posting list tuple during vacuuming. An array of these may appear at the - * end of xl_btree_vacuum records. + * xl_btree_vacuum and xl_btree_delete records describe deletion of index + * tuples on a leaf page. The former variant is used by VACUUM, while the + * latter variant is used by the ad-hoc deletions that sometimes take place + * when btinsert() is called. + * + * The records are very similar. The only difference is that xl_btree_delete + * has to include a latestRemovedXid field to generate recovery conflicts. + * (VACUUM operations can just rely on earlier conflicts generated during + * pruning of the table whose TIDs the to-be-deleted index tuples point to. + * There are also small differences between each REDO routine that we don't go + * into here.) + * + * xl_btree_vacuum and xl_btree_delete both represent deletion of any number + * of index tuples on a single leaf page using page offset numbers. Both also + * support "updates" of index tuples, which is how deletes of a subset of TIDs + * contained in an existing posting list tuple are implemented. + * + * Updated posting list tuples are represented using xl_btree_update metadata. + * The REDO routines each use the xl_btree_update entries (plus each + * corresponding original index tuple from the target leaf page) to generate + * the final updated tuple. + * + * Updates are only used when there will be some remaining TIDs left by the + * REDO routine. Otherwise the posting list tuple just gets deleted outright. + */ +typedef struct xl_btree_vacuum +{ + uint16 ndeleted; + uint16 nupdated; + + /* DELETED TARGET OFFSET NUMBERS FOLLOW */ + /* UPDATED TARGET OFFSET NUMBERS FOLLOW */ + /* UPDATED TUPLES METADATA (xl_btree_update) ARRAY FOLLOWS */ +} xl_btree_vacuum; + +#define SizeOfBtreeVacuum (offsetof(xl_btree_vacuum, nupdated) + sizeof(uint16)) + +typedef struct xl_btree_delete +{ + TransactionId latestRemovedXid; + uint16 ndeleted; + uint16 nupdated; + + /* DELETED TARGET OFFSET NUMBERS FOLLOW */ + /* UPDATED TARGET OFFSET NUMBERS FOLLOW */ + /* UPDATED TUPLES METADATA (xl_btree_update) ARRAY FOLLOWS */ +} xl_btree_delete; + +#define SizeOfBtreeDelete (offsetof(xl_btree_delete, nupdated) + sizeof(uint16)) + +/* + * The offsets that appear in xl_btree_update metadata are offsets into the + * original posting list from tuple, not page offset numbers. These are + * 0-based. The page offset number for the original posting list tuple comes + * from main xl_btree_delete/xl_btree_vacuum record. */ typedef struct xl_btree_update { @@ -224,31 +258,6 @@ typedef struct xl_btree_update #define SizeOfBtreeUpdate (offsetof(xl_btree_update, ndeletedtids) + sizeof(uint16)) -/* - * This is what we need to know about a VACUUM of a leaf page. The WAL record - * can represent deletion of any number of index tuples on a single index page - * when executed by VACUUM. It can also support "updates" of index tuples, - * which is how deletes of a subset of TIDs contained in an existing posting - * list tuple are implemented. (Updates are only used when there will be some - * remaining TIDs once VACUUM finishes; otherwise the posting list tuple can - * just be deleted). - * - * Updated posting list tuples are represented using xl_btree_update metadata. - * The REDO routine uses each xl_btree_update (plus its corresponding original - * index tuple from the target leaf page) to generate the final updated tuple. - */ -typedef struct xl_btree_vacuum -{ - uint16 ndeleted; - uint16 nupdated; - - /* DELETED TARGET OFFSET NUMBERS FOLLOW */ - /* UPDATED TARGET OFFSET NUMBERS FOLLOW */ - /* UPDATED TUPLES METADATA ARRAY FOLLOWS */ -} xl_btree_vacuum; - -#define SizeOfBtreeVacuum (offsetof(xl_btree_vacuum, nupdated) + sizeof(uint16)) - /* * This is what we need to know about marking an empty subtree for deletion. * The target identifies the tuple removed from the parent page (note that we diff --git a/src/backend/access/common/reloptions.c b/src/backend/access/common/reloptions.c index 8ccc228a8c..95e29345de 100644 --- a/src/backend/access/common/reloptions.c +++ b/src/backend/access/common/reloptions.c @@ -168,6 +168,16 @@ static relopt_bool boolRelOpts[] = }, true }, + { + { + "delete_items", + "Enables \"bottom-up index deletion\" feature for this btree index", + RELOPT_KIND_BTREE, + ShareUpdateExclusiveLock /* since it applies only to later + * inserts */ + }, + true + }, /* list terminator */ {{NULL}} }; diff --git a/src/backend/access/nbtree/README b/src/backend/access/nbtree/README index 27f555177e..ebe4408378 100644 --- a/src/backend/access/nbtree/README +++ b/src/backend/access/nbtree/README @@ -419,8 +419,8 @@ without a backend's cached page also being detected as invalidated, but only when we happen to recycle a block that once again gets recycled as the rightmost leaf page. -On-the-Fly Deletion Of Index Tuples ------------------------------------ +On-the-Fly deletion of LP_DEAD-bit-set index tuples +--------------------------------------------------- If a process visits a heap tuple and finds that it's dead and removable (ie, dead to all open transactions, not only that process), then we can @@ -439,19 +439,26 @@ from the index immediately; since index scans only stop "between" pages, no scan can lose its place from such a deletion. We separate the steps because we allow LP_DEAD to be set with only a share lock (it's exactly like a hint bit for a heap tuple), but physically removing tuples requires -exclusive lock. In the current code we try to remove LP_DEAD tuples when -we are otherwise faced with having to split a page to do an insertion (and -hence have exclusive lock on it already). Deduplication can also prevent -a page split, but removing LP_DEAD tuples is the preferred approach. -(Note that posting list tuples can only have their LP_DEAD bit set when -every table TID within the posting list is known dead.) +exclusive lock. Also, delaying the deletion often allows us to pick up +extra index tuples that weren't initially safe for index scans to mark +LP_DEAD. Live index tuples that are close to LP_DEAD-marked tuples in +time and space are usually highly likely to become dead-to-all shortly. +This makes workloads that greatly benefit from the LP_DEAD optimization +resilient against intermittent disruption from long running transactions +that hold open an MVCC snapshot (compared to the behavior prior to +PostgreSQL 14, the version that taught the LP_DEAD deletion process to +check if nearby index tuples are safe to delete in passing). -This leaves the index in a state where it has no entry for a dead tuple -that still exists in the heap. This is not a problem for the current -implementation of VACUUM, but it could be a problem for anything that -explicitly tries to find index entries for dead tuples. (However, the -same situation is created by REINDEX, since it doesn't enter dead -tuples into the index.) +We only try to delete LP_DEAD tuples (and nearby tuples) when we are +otherwise faced with having to split a page to do an insertion (and hence +have exclusive lock on it already). Deduplication and bottom-up index +deletion can also prevent a page split, but removing LP_DEAD tuples is +always the preferred approach. (Note that posting list tuples can only +have their LP_DEAD bit set when every table TID within the posting list is +known dead. This isn't much of a problem because LP_DEAD deletion can +often still do granular deletion of TIDs from a posting list. This will +happen when the posting list tuple's TIDs point to a table block that some +LP_DEAD-marked index tuple happens to point to.) It's sufficient to have an exclusive lock on the index page, not a super-exclusive lock, to do deletion of LP_DEAD items. It might seem @@ -469,6 +476,87 @@ LSN of the page, and only act to set LP_DEAD bits when the LSN has not changed at all. (Avoiding dropping the pin entirely also makes it safe, of course.) +Bottom-Up deletion +------------------ + +We attempt to delete whatever duplicates happen to be present on the page +when the duplicates are suspected to be caused by version churn from +successive UPDATEs. This only happens when we receive an executor hint +indicating that optimizations like heapam's HOT have not worked out for +the index -- the incoming tuple must be a logically unchanged duplicate +which is needed for MVCC purposes, suggesting that that might well be the +dominant source of new index tuples on the leaf page in question. (Also, +bottom-up deletion is triggered within unique indexes in cases with +continual INSERT and DELETE related churn, since that is easy to detect +without any external hint.) + +On-the-fly deletion of LP_DEAD-bit-set items (which can include deletion +of other close by index tuples) will already have failed to prevent a page +split when a bottom-up deletion pass takes place (often because no LP_DEAD +bits were ever set on the page). The two mechanisms have closely related +implementations. The same WAL records are used for each operation, and +the same tableam infrastructure is used to determine what TIDs/tuples are +actually safe to delete. The implementations only differ in how they pick +TIDs to consider for deletion, and whether or not the tableam will give up +before accessing all table blocks (bottom-up deletion lives with the +uncertainty of its success by keeping the cost of failure low). Even +still, the two mechanisms are clearly distinct at the conceptual level. + +Bottom-up index deletion is driven entirely by heuristics (whereas +on-the-fly deletion is guaranteed to delete at least those index tuples +that are already LP_DEAD marked). We have no certainty that we'll find +even one index tuple to delete. That's why we access as few tableam +blocks as possible, and only commit to accessing the next table block in +line when a positive outcome for the operation as a whole still looks +likely. This means that the tableam needs to have a fairly good idea of +how much space it has freed on the leaf page, to keep the costs and +benefits in balance per operation (and even across successive operations +affecting the same leaf page). + +Bottom-up index deletion can be thought of as a backstop mechanism against +unnecessary version-driven page splits. It is based in part on an idea +from generational garbage collection: the "generational hypothesis". This +is the empirical observation that "most objects die young". Within +nbtree, new index tuples often quickly appear in the same place, and then +quickly become garbage. There can be intense concentrations of garbage in +relatively few leaf pages (or there would be without the intervention of +bottom-up deletion). This occurs with workloads that consist of skewed +UPDATEs. There is little to lose and much to gain by spending a few +cycles to become reasonably sure that a page split is truly necessary +(when it seems like there is some chance of that) -- page splits are +expensive, and practically irreversible. + +We expect to find a reasonably large number of tuples that are safe to +delete within each bottom-up pass. If we don't then we won't need to +consider the question of bottom-up deletion for the same leaf page for +quite a while (usually because the page splits, which resolves the +situation, at least for a while). We expect to perform regular bottom-up +deletion operations against pages that are at constant risk of unnecessary +page splits caused only by version churn. When the mechanism works well +we'll constantly be "on the verge" of having version-churn-driven page +splits, but never actually have even one. + +Our duplicate heuristics work well despite being fairly simple. +Unnecessary page splits only occur when there are truly pathological +levels of version churn (in theory a small amount of version churn could +make a page split occur earlier than strictly necessary, but that's pretty +harmless). We don't have to understand the underlying workload; we only +have to understand the general nature of the pathology that we target. +Version churn is easy to spot when it is truly pathological. Affected +leaf pages are homogeneous. + +If version churn hasn't become a real problem then we don't actually want +to do anything about it anyway (we should be lazy about cleaning it up, at +least). All that really matters is that garbage does not become +concentrated in any one part of the key space (the number of physical +versions accessed by queries to read any given logical row should remain +low over time and across all parts of the key space). Remaining garbage +tuples can be thought of as "floating garbage" that VACUUM will eventually +get around to removing (VACUUM can be thought of as a top-down mechanism +that bottom-up garbage collection complements). The absolute number of +garbage tuples (and even the proportion of all index tuples that are +garbage) is generally much less important. + WAL Considerations ------------------ @@ -767,9 +855,10 @@ into a single physical tuple with a posting list (a simple array of heap TIDs with the standard item pointer format). Deduplication is always applied lazily, at the point where it would otherwise be necessary to perform a page split. It occurs only when LP_DEAD items have been -removed, as our last line of defense against splitting a leaf page. We -can set the LP_DEAD bit with posting list tuples, though only when all -TIDs are known dead. +removed, as our last line of defense against splitting a leaf page +(bottom-up index deletion may be attempted first, as our second last line +of defense). We can set the LP_DEAD bit with posting list tuples, though +only when all TIDs are known dead. Our lazy approach to deduplication allows the page space accounting used during page splits to have absolutely minimal special case logic for @@ -826,6 +915,16 @@ delay a split that is probably inevitable anyway. This allows us to avoid the overhead of attempting to deduplicate with unique indexes that always have few or no duplicates. +Note: Avoiding "unnecessary" page splits driven by version churn is also +the goal of bottom-up index deletion, which was added to PostgreSQL 14. +Bottom-up index deletion is now the preferred way to deal with this +problem (with all kinds of indexes, though especially with unique +indexes). Still, deduplication can sometimes augment bottom-up index +deletion. When deletion cannot free tuples (due to an old snapshot +holding up cleanup), falling back on deduplication provides additional +capacity. Delaying the page split by deduplicating can allow a future +bottom-up deletion pass of the same page to succeed. + Posting list splits ------------------- diff --git a/src/backend/access/nbtree/nbtdedup.c b/src/backend/access/nbtree/nbtdedup.c index 9e535124c4..a4050ddd68 100644 --- a/src/backend/access/nbtree/nbtdedup.c +++ b/src/backend/access/nbtree/nbtdedup.c @@ -19,6 +19,8 @@ #include "miscadmin.h" #include "utils/rel.h" +static void _bt_bottomup_finish_pending(Page page, TM_IndexDeleteOp *delstate, + BTDedupState state); static bool _bt_do_singleval(Relation rel, Page page, BTDedupState state, OffsetNumber minoff, IndexTuple newitem); static void _bt_singleval_fillfactor(Page page, BTDedupState state, @@ -267,6 +269,157 @@ _bt_dedup_pass(Relation rel, Buffer buf, Relation heapRel, IndexTuple newitem, pfree(state); } +/* + * Perform bottom-up index deletion pass. + * + * See if duplicate index tuples are eligible to be deleted by accessing + * visibility information from the tableam. Give up if we have to access more + * than a few tableam blocks. Caller tries to avoid "unnecessary" page splits + * (splits driven only by version churn) by calling here when it looks like + * that's about to happen. It's normal for there to be a lot of calls here + * for pages that are constantly at risk of an unnecessary split. + * + * Each failure to delete a duplicate/promising tuple here is a kind of + * learning experience. It results in caller falling back on splitting the + * page (or on a deduplication pass), discouraging future calls back here for + * the same key space range covered by a failed page (or at least discouraging + * processing the original duplicates in case where caller falls back on a + * successful deduplication pass). We converge on the most effective strategy + * for each page in the index over time. + * + * Returns true on success, in which case caller can assume page split will be + * avoided for a reasonable amount of time. Returns false when caller should + * deduplicate the page (if possible at all). + * + * Note: occasionally a true return value does not actually indicate that any + * items could be deleted. It might just indicate that caller should not go + * on to perform a deduplication pass. Caller is not expected to care about + * the difference. + * + * Note: Caller should have already deleted all existing items with their + * LP_DEAD bits set. + */ +bool +_bt_bottomup_pass(Relation rel, Buffer buf, Relation heapRel, Size newitemsz) +{ + OffsetNumber offnum, + minoff, + maxoff; + Page page = BufferGetPage(buf); + BTPageOpaque opaque = (BTPageOpaque) PageGetSpecialPointer(page); + BTDedupState state; + TM_IndexDeleteOp delstate; + bool neverdedup = false; + int nkeyatts = IndexRelationGetNumberOfKeyAttributes(rel); + + /* Passed-in newitemsz is MAXALIGNED but does not include line pointer */ + newitemsz += sizeof(ItemIdData); + + /* Initialize deduplication state */ + state = (BTDedupState) palloc(sizeof(BTDedupStateData)); + state->deduplicate = true; + state->nmaxitems = 0; + state->maxpostingsize = BLCKSZ; /* "posting list size" not a concern */ + state->base = NULL; + state->baseoff = InvalidOffsetNumber; + state->basetupsize = 0; + state->htids = palloc(state->maxpostingsize); + state->nhtids = 0; + state->nitems = 0; + state->phystupsize = 0; + state->nintervals = 0; + + /* + * Initialize tableam state that describes bottom-up index deletion + * operation. + * + * We will ask tableam to free 1/16 of BLCKSZ. We don't usually expect to + * have to free much space each call here in order to avoid page splits. + * We don't want to be too aggressive since in general the tableam will + * have to access more table blocks when we ask for more free space. In + * general we try to be conservative about what we ask for (though not too + * conservative), while leaving it up to the tableam to ramp up the number + * of tableam blocks accessed when conditions in the table structure + * happen to favor it. + * + * We expect to end up back here again and again for any leaf page that is + * more or less constantly at risk of unnecessary page splits -- in fact + * that's what happens when bottom-up deletion really helps. We must + * avoid thrashing when this becomes very frequent at the level of an + * individual page. Our free space target helps with that. It balances + * the costs and benefits over time and across related bottom-up deletion + * passes. + */ + delstate.alltids = false; /* Only visit most promising table blocks */ + delstate.ndeltids = 0; + delstate.deltids = palloc(MaxTIDsPerBTreePage * sizeof(TM_IndexDelete)); + delstate.status = palloc(MaxTIDsPerBTreePage * sizeof(TM_IndexStatus)); + delstate.targetfreespace = Max(BLCKSZ / 16, newitemsz); + + /* Now remember details of the page in the state we'll pass to tableam */ + minoff = P_FIRSTDATAKEY(opaque); + maxoff = PageGetMaxOffsetNumber(page); + for (offnum = minoff; + offnum <= maxoff; + offnum = OffsetNumberNext(offnum)) + { + ItemId itemid = PageGetItemId(page, offnum); + IndexTuple itup = (IndexTuple) PageGetItem(page, itemid); + + Assert(!ItemIdIsDead(itemid)); + + if (offnum == minoff) + { + _bt_dedup_start_pending(state, itup, offnum); + } + else if (_bt_keep_natts_fast(rel, state->base, itup) > nkeyatts && + _bt_dedup_save_htid(state, itup)) + { + /* Tuple is equal; just added its TIDs to pending interval */ + } + else + { + /* Finalize interval -- move its TIDs to bottom-up state */ + _bt_bottomup_finish_pending(page, &delstate, state); + + /* itup starts new pending interval */ + _bt_dedup_start_pending(state, itup, offnum); + } + } + /* Finalize final interval -- move its TIDs to bottom-up state */ + _bt_bottomup_finish_pending(page, &delstate, state); + + /* + * When there are now duplicates on the page at all, we should not tell + * caller to deduplicate later on. + * + * Note: We accept the possibility that there may be no promising + * tuples/duplicates at all (we always finish what we started). The + * tableam has its own heuristics that it can fall back on, so it still + * has some chance of success. + */ + if (state->nintervals == 0) + neverdedup = true; + + /* Done with dedup state */ + pfree(state->htids); + pfree(state); + + /* Confirm which TIDs are dead-to-all, then physically delete */ + _bt_delitems_delete_check(rel, buf, heapRel, &delstate); + + /* Done with deletion state */ + pfree(delstate.deltids); + pfree(delstate.status); + + /* Carry out earlier decision to have caller avoid deduplication now */ + if (neverdedup) + return true; + + /* Don't dedup when we won't end up back here any time soon anyway */ + return PageGetExactFreeSpace(page) >= Max(BLCKSZ / 24, newitemsz); +} + /* * Create a new pending posting list tuple based on caller's base tuple. * @@ -452,6 +605,164 @@ _bt_dedup_finish_pending(Page newpage, BTDedupState state) return spacesaving; } +/* + * Finalize interval during bottom-up index deletion. + * + * Determines which TIDs are to be marked promising based on heuristics. + */ +static void +_bt_bottomup_finish_pending(Page page, TM_IndexDeleteOp *delstate, + BTDedupState state) +{ + bool dupinterval = (state->nitems > 1); + + Assert(state->nitems > 0); + Assert(state->nitems <= state->nhtids); + Assert(state->intervals[state->nintervals].baseoff == state->baseoff); + + /* + * All TIDs from all tuples are at least recording in state. Tuples are + * marked promising when they're duplicates (i.e. when they appear in an + * interval with more than one item, as when we expect create a new + * posting list tuple in the deduplication case). + * + * It's easy to see what this means in the plain non-pivot tuple case: + * TIDs from duplicate plain tuples are promising. Posting list tuples + * are more subtle. We ought to do something with posting list tuples, + * though plain tuples tend to be more promising targets. (Plain tuples + * are the most likely to be dead/deletable because they suggest version + * churn. And they allow us to free more space when we actually succeed). + */ + for (int i = 0; i < state->nitems; i++) + { + OffsetNumber offnum = state->baseoff + i; + ItemId itemid = PageGetItemId(page, offnum); + IndexTuple itup = (IndexTuple) PageGetItem(page, itemid); + TM_IndexDelete *cdeltid; + TM_IndexStatus *dstatus; + + cdeltid = &delstate->deltids[delstate->ndeltids]; + dstatus = &delstate->status[delstate->ndeltids]; + + if (!BTreeTupleIsPosting(itup)) + { + /* Easy case: A plain non-pivot tuple's TID */ + cdeltid->tid = itup->t_tid; + cdeltid->id = delstate->ndeltids; + dstatus->idxoffnum = offnum; + dstatus->ispromising = dupinterval; + dstatus->deleteitup = false; /* for now */ + dstatus->tupsize = + ItemIdGetLength(itemid) + sizeof(ItemIdData); + delstate->ndeltids++; + } + else + { + /* + * Harder case: A posting list tuple's TIDs (multiple TIDs). + * + * Only a single TID from a posting list tuple may be promising, + * and only when it appears in a duplicate tuple (just like plain + * tuple case). In general there is a good chance that the + * posting list tuple relates to multiple logical rows, rather + * than multiple versions of just one logical row. (It can only + * be the latter case when a previous bottom-up deletion pass + * failed, necessitating a deduplication pass, which isn't all + * that common.) + * + * There is a pretty good chance that at least one of the logical + * rows from the posting list was updated, and so had a successor + * version (about as good a chance as it is in the regular tuple + * case, at least). We should at least try to follow the regular + * tuple case while making the conservative assumption that there + * can only be one affected logical row per posting list tuple. We + * do that by picking one TID when it appears to be from the + * predominant tableam block in the posting list (if any one + * tableam block predominates). The approach we take is to either + * choose the first or last TID in the posting list (if any at + * all). We go with whichever one is on the same tableam block at + * the middle tuple (and only the first TID when both the first + * and last TIDs relate to the same tableam block -- we could + * easily be too aggressive here). + * + * If it turns out that there are multiple old versions of a + * single logical table row, we still have a pretty good chance of + * being able to delete them this way. We don't want to give too + * strong a signal to the tableam. But we should always try to + * give some useful hints. Even cases with considerable + * uncertainty can consistently avoid an unnecessary page split, + * in part because the tableam will have tricks of its own for + * figuring out where to look in marginal cases. + */ + int nitem = BTreeTupleGetNPosting(itup); + bool firstpromise = false; + bool lastpromise = false; + + Assert(_bt_posting_valid(itup)); + + if (dupinterval) + { + /* Figure out if there really should be promising TIDs */ + BlockNumber minblocklist, + midblocklist, + maxblocklist; + ItemPointer mintid, + midtid, + maxtid; + + mintid = BTreeTupleGetHeapTID(itup); + midtid = BTreeTupleGetPostingN(itup, nitem / 2); + maxtid = BTreeTupleGetMaxHeapTID(itup); + minblocklist = ItemPointerGetBlockNumber(mintid); + midblocklist = ItemPointerGetBlockNumber(midtid); + maxblocklist = ItemPointerGetBlockNumber(maxtid); + + firstpromise = (minblocklist == midblocklist); + lastpromise = (!firstpromise && midblocklist == maxblocklist); + } + + /* No more than one TID from itup can be promising */ + Assert(!(firstpromise && lastpromise)); + + for (int p = 0; p < nitem; p++) + { + ItemPointer htid = BTreeTupleGetPostingN(itup, p); + + cdeltid->tid = *htid; + cdeltid->id = delstate->ndeltids; + dstatus->idxoffnum = offnum; + dstatus->ispromising = false; + + if ((firstpromise && p == 0) || + (lastpromise && p == nitem - 1)) + dstatus->ispromising = true; + + dstatus->deleteitup = false; /* for now */ + dstatus->tupsize = sizeof(ItemPointerData) + 1; + delstate->ndeltids++; + + cdeltid++; + dstatus++; + } + } + } + + if (dupinterval) + { + /* + * Maintain interval state for consistency with true deduplication + * case + */ + state->intervals[state->nintervals].nitems = state->nitems; + state->nintervals++; + } + + /* Reset state for next interval */ + state->nhtids = 0; + state->nitems = 0; + state->phystupsize = 0; +} + /* * Determine if page non-pivot tuples (data items) are all duplicates of the * same value -- if they are, deduplication's "single value" strategy should @@ -622,8 +933,8 @@ _bt_form_posting(IndexTuple base, ItemPointer htids, int nhtids) * Generate a replacement tuple by "updating" a posting list tuple so that it * no longer has TIDs that need to be deleted. * - * Used by VACUUM. Caller's vacposting argument points to the existing - * posting list tuple to be updated. + * Used by both VACUUM and bottom-up index deletion. Caller's vacposting + * argument points to the existing posting list tuple to be updated. * * On return, caller's vacposting argument will point to final "updated" * tuple, which will be palloc()'d in caller's memory context. diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c index dde43b1415..4fc1d0001e 100644 --- a/src/backend/access/nbtree/nbtinsert.c +++ b/src/backend/access/nbtree/nbtinsert.c @@ -17,7 +17,6 @@ #include "access/nbtree.h" #include "access/nbtxlog.h" -#include "access/tableam.h" #include "access/transam.h" #include "access/xloginsert.h" #include "miscadmin.h" @@ -37,6 +36,7 @@ static TransactionId _bt_check_unique(Relation rel, BTInsertState insertstate, static OffsetNumber _bt_findinsertloc(Relation rel, BTInsertState insertstate, bool checkingunique, + bool indexUnchanged, BTStack stack, Relation heapRel); static void _bt_stepright(Relation rel, BTInsertState insertstate, BTStack stack); @@ -61,7 +61,13 @@ static inline bool _bt_pgaddtup(Page page, Size itemsize, IndexTuple itup, static void _bt_delete_or_dedup_one_page(Relation rel, Relation heapRel, BTInsertState insertstate, bool lpdeadonly, bool checkingunique, - bool uniquedup); + bool uniquedup, bool indexUnchanged); +static void _bt_lpdead_pass(Relation rel, Buffer buffer, Relation heapRel, + OffsetNumber *deletable, int ndeletable, + OffsetNumber minoff, OffsetNumber maxoff); +static BlockNumber *_bt_lpdead_blocks(Page page, OffsetNumber *deletable, + int ndeletable, int *nblocks); +static int _bt_lpdead_blocks_cmp(const void *arg1, const void *arg2); /* * _bt_doinsert() -- Handle insertion of a single index tuple in the tree. @@ -75,6 +81,11 @@ static void _bt_delete_or_dedup_one_page(Relation rel, Relation heapRel, * For UNIQUE_CHECK_EXISTING we merely run the duplicate check, and * don't actually insert. * + * indexUnchanged executor hint indicates if itup is from an + * UPDATE that didn't logically change the indexed value, but + * must nevertheless have a new entry to point to a successor + * version. + * * The result value is only significant for UNIQUE_CHECK_PARTIAL: * it must be true if the entry is known unique, else false. * (In the current implementation we'll also return true after a @@ -83,7 +94,8 @@ static void _bt_delete_or_dedup_one_page(Relation rel, Relation heapRel, */ bool _bt_doinsert(Relation rel, IndexTuple itup, - IndexUniqueCheck checkUnique, Relation heapRel) + IndexUniqueCheck checkUnique, bool indexUnchanged, + Relation heapRel) { bool is_unique = false; BTInsertStateData insertstate; @@ -238,7 +250,7 @@ search: * checkingunique. */ newitemoff = _bt_findinsertloc(rel, &insertstate, checkingunique, - stack, heapRel); + indexUnchanged, stack, heapRel); _bt_insertonpg(rel, itup_key, insertstate.buf, InvalidBuffer, stack, itup, insertstate.itemsz, newitemoff, insertstate.postingoff, false); @@ -777,6 +789,17 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel, * room for the new tuple, this function moves right, trying to find a * legal page that does.) * + * If 'indexUnchanged' is true, this is for an UPDATE that didn't + * logically change the indexed value, but must nevertheless have a new + * entry to point to a successor version. This hint from the executor + * will influence our behavior when the page might have to be split and + * we must consider our options. Bottom-up index deletion can avoid + * pathological version-driven page splits, but we only want to go to the + * trouble of trying it when we already have moderate confidence that + * it's appropriate. The hint should not significantly affect our + * behavior over time unless practically all inserts on to the leaf page + * get the hint. + * * On exit, insertstate buffer contains the chosen insertion page, and * the offset within that page is returned. If _bt_findinsertloc needed * to move right, the lock and pin on the original page are released, and @@ -793,6 +816,7 @@ static OffsetNumber _bt_findinsertloc(Relation rel, BTInsertState insertstate, bool checkingunique, + bool indexUnchanged, BTStack stack, Relation heapRel) { @@ -817,7 +841,7 @@ _bt_findinsertloc(Relation rel, if (itup_key->heapkeyspace) { /* Keep track of whether checkingunique duplicate seen */ - bool uniquedup = false; + bool uniquedup = indexUnchanged; /* * If we're inserting into a unique index, we may have to walk right @@ -881,7 +905,8 @@ _bt_findinsertloc(Relation rel, */ if (PageGetFreeSpace(page) < insertstate->itemsz) _bt_delete_or_dedup_one_page(rel, heapRel, insertstate, false, - checkingunique, uniquedup); + checkingunique, uniquedup, + indexUnchanged); } else { @@ -923,7 +948,8 @@ _bt_findinsertloc(Relation rel, { /* Erase LP_DEAD items (won't deduplicate) */ _bt_delete_or_dedup_one_page(rel, heapRel, insertstate, true, - checkingunique, false); + checkingunique, false, + indexUnchanged); if (PageGetFreeSpace(page) >= insertstate->itemsz) break; /* OK, now we have enough space */ @@ -977,7 +1003,7 @@ _bt_findinsertloc(Relation rel, * This can only erase LP_DEAD items (it won't deduplicate). */ _bt_delete_or_dedup_one_page(rel, heapRel, insertstate, true, - checkingunique, false); + checkingunique, false, indexUnchanged); /* * Do new binary search. New insert location cannot overlap with any @@ -2609,15 +2635,24 @@ _bt_pgaddtup(Page page, * _bt_delete_or_dedup_one_page - Try to avoid a leaf page split by attempting * a variety of operations. * - * There are two operations performed here: deleting items already marked - * LP_DEAD, and deduplication. If both operations fail to free enough space - * for the incoming item then caller will go on to split the page. We always - * attempt our preferred strategy (which is to delete items whose LP_DEAD bit - * are set) first. If that doesn't work out we move on to deduplication. + * There are three operations performed here: deleting items already marked + * LP_DEAD, bottom-up index deletion, and deduplication. If all three + * operations fail to free enough space for the incoming item then caller will + * go on to split the page. We always attempt our preferred strategy (which + * is to delete items whose LP_DEAD bit are set) first. If that doesn't work + * out we consider alternatives. Most calls here will not exhaustively + * attempt all three operations. Deduplication and bottom-up index deletion + * are relatively expensive operations, so we try to pick one or the other up + * front (whichever one seems better for this specific page). * - * Caller's checkingunique and uniquedup arguments help us decide if we should - * perform deduplication, which is primarily useful with low cardinality data, - * but can sometimes absorb version churn. + * Caller's checkingunique, uniquedup, and indexUnchanged arguments help us + * decide which alternative strategy we should attempt (or attempt first). + * Deduplication is primarily useful with low cardinality data. Bottom-up + * index deletion is a backstop against version churn caused by repeated + * UPDATE statements where affected indexes don't receive logical changes + * (because an optimization like heapam's HOT cannot be applied in the + * tableam). But useful interplay between both techniques over time is + * sometimes possible. * * Callers that only want us to look for/delete LP_DEAD items can ask for that * directly by passing true 'lpdeadonly' argument. @@ -2639,11 +2674,12 @@ static void _bt_delete_or_dedup_one_page(Relation rel, Relation heapRel, BTInsertState insertstate, bool lpdeadonly, bool checkingunique, - bool uniquedup) + bool uniquedup, bool indexUnchanged) { OffsetNumber deletable[MaxIndexTuplesPerPage]; int ndeletable = 0; OffsetNumber offnum, + minoff, maxoff; Buffer buffer = insertstate->buf; BTScanInsert itup_key = insertstate->itup_key; @@ -2657,8 +2693,9 @@ _bt_delete_or_dedup_one_page(Relation rel, Relation heapRel, * Scan over all items to see which ones need to be deleted according to * LP_DEAD flags. */ + minoff = P_FIRSTDATAKEY(opaque); maxoff = PageGetMaxOffsetNumber(page); - for (offnum = P_FIRSTDATAKEY(opaque); + for (offnum = minoff; offnum <= maxoff; offnum = OffsetNumberNext(offnum)) { @@ -2670,7 +2707,8 @@ _bt_delete_or_dedup_one_page(Relation rel, Relation heapRel, if (ndeletable > 0) { - _bt_delitems_delete(rel, buffer, deletable, ndeletable, heapRel); + _bt_lpdead_pass(rel, buffer, heapRel, deletable, ndeletable, + minoff, maxoff); insertstate->bounds_valid = false; /* Return when a page split has already been avoided */ @@ -2689,18 +2727,19 @@ _bt_delete_or_dedup_one_page(Relation rel, Relation heapRel, * return at this point (or when we go on the try either or both of our * other strategies and they also fail). We do not bother expending a * separate write to clear it, however. Caller will definitely clear it - * when it goes on to split the page (plus deduplication knows to clear - * the flag when it actually modifies the page). + * when it goes on to split the page (note also that deduplication process + * knows to clear the flag when it actually modifies the page). */ if (lpdeadonly) return; /* * We can get called in the checkingunique case when there is no reason to - * believe that there are any duplicates on the page; we should at least - * still check for LP_DEAD items. If that didn't work out, give up and - * let caller split the page. Deduplication cannot be justified given - * there is no reason to think that there are duplicates. + * believe that there are any duplicates on the page; we just needed to + * check for LP_DEAD items. When we were called under these circumstances + * and get this far, LP_DEAD item deletion didn't work out, and so we give + * up and let caller split the page. (A bottom-up pass or deduplication + * pass are also unlikely to work out.) */ if (checkingunique && !uniquedup) return; @@ -2708,6 +2747,22 @@ _bt_delete_or_dedup_one_page(Relation rel, Relation heapRel, /* Assume bounds about to be invalidated (this is almost certain now) */ insertstate->bounds_valid = false; + /* + * Perform bottom-up index deletion pass when executor hint indicated that + * incoming item is logically unchanged, or for a unique index that is + * known to have physical duplicates for some other reason. (There is a + * large overlap between these two cases for a unique index. It's worth + * having both triggering conditions in order to apply the optimization in + * the event of successive related INSERT and DELETE statements.) + * + * We'll go on to do a deduplication pass when a bottom-up pass fails to + * delete an acceptable amount of free space (a non-trivial fraction of + * the page that exceeds the new item's size). + */ + if (BTGetDeleteItems(rel) && (indexUnchanged || uniquedup) && + _bt_bottomup_pass(rel, buffer, heapRel, insertstate->itemsz)) + return; + /* * Perform deduplication pass, though only when it is enabled for the * index and known to be safe (it must be an allequalimage index). @@ -2716,3 +2771,244 @@ _bt_delete_or_dedup_one_page(Relation rel, Relation heapRel, _bt_dedup_pass(rel, buffer, heapRel, insertstate->itup, insertstate->itemsz, checkingunique); } + +/* + * _bt_lpdead_pass - Try to avoid a leaf page split by deleting LP_DEAD-set + * index tuples, as well as any other nearby tuples that are convenient to + * delete in passing. + * + * The tableam can inexpensively check extra index tuples whose TIDs happen to + * point to the same table blocks as a TID from an LP_DEAD-marked tuple's TID. + * This routine is responsible for gathering TIDs from LP_DEAD-marked index + * tuples (which are surely deletable) alongside index tuples with same-block + * TIDs (which are totally speculative) for processing by tableam. Physical + * deletion of the final known-safe TIDs from the leaf page takes place at the + * end. + * + * In practice it is often possible to delete at least a few extra tuples here + * for indexUnchanged callers. This will happen when LP_DEAD bit setting was + * temporarily disrupted by some transaction that held open an MVCC snapshot + * for a relatively long time; we can pick up newer version-duplicate index + * tuples that couldn't have their LP_DEAD bits set by UPDATEs, provided + * they're on the same tableam block as earlier versions that were marked (and + * provided the snapshot is no longer held open by now). We don't try to be + * clever, though. We simply focus on extra tuples that are practically free + * to check in passing. In practice the number of extra index tuples that + * turn out to be deletable often greatly exceeds the number of LP_DEAD-marked + * index tuples. + */ +static void +_bt_lpdead_pass(Relation rel, Buffer buffer, Relation heapRel, + OffsetNumber *deletable, int ndeletable, + OffsetNumber minoff, OffsetNumber maxoff) +{ + Page page = BufferGetPage(buffer); + TM_IndexDeleteOp delstate; + BlockNumber *blocks; + int nblocks; + OffsetNumber offnum; + + blocks = _bt_lpdead_blocks(page, deletable, ndeletable, &nblocks); + + delstate.alltids = true; /* Not doing bottom-up deletion */ + delstate.ndeltids = 0; + delstate.deltids = palloc(MaxTIDsPerBTreePage * sizeof(TM_IndexDelete)); + delstate.status = palloc(MaxTIDsPerBTreePage * sizeof(TM_IndexStatus)); + delstate.targetfreespace = 0; /* Visiting all table blocks anyway */ + + for (offnum = minoff; + offnum <= maxoff; + offnum = OffsetNumberNext(offnum)) + { + ItemId itemid = PageGetItemId(page, offnum); + IndexTuple itup = (IndexTuple) PageGetItem(page, itemid); + TM_IndexDelete *cdeltid; + TM_IndexStatus *dstatus; + BlockNumber tidblock; + BlockNumber *match; + + cdeltid = &delstate.deltids[delstate.ndeltids]; + dstatus = &delstate.status[delstate.ndeltids]; + + if (!BTreeTupleIsPosting(itup)) + { + /* Plain non-pivot tuple's TID */ + tidblock = ItemPointerGetBlockNumber(&itup->t_tid); + + match = (BlockNumber *) bsearch(&tidblock, blocks, nblocks, + sizeof(BlockNumber), + _bt_lpdead_blocks_cmp); + + if (!match) + continue; + + /* + * TID has heap block among those pointed to by LP_DEAD-bit set + * tuples on leaf page + */ + cdeltid->tid = itup->t_tid; + cdeltid->id = delstate.ndeltids; + dstatus->idxoffnum = offnum; + dstatus->ispromising = false; /* irrelevant */ + dstatus->deleteitup = ItemIdIsDead(itemid); /* for now */ + dstatus->tupsize = 1; /* irrelevant */ + delstate.ndeltids++; + } + else + { + int nitem = BTreeTupleGetNPosting(itup); + + for (int p = 0; p < nitem; p++) + { + ItemPointer htid = BTreeTupleGetPostingN(itup, p); + + tidblock = ItemPointerGetBlockNumber(htid); + + match = (BlockNumber *) bsearch(&tidblock, blocks, nblocks, + sizeof(BlockNumber), + _bt_lpdead_blocks_cmp); + + if (!match) + continue; + + /* + * TID has heap block among those pointed to by LP_DEAD-bit + * set tuples on leaf page + */ + cdeltid->tid = *htid; + cdeltid->id = delstate.ndeltids; + dstatus->idxoffnum = offnum; + dstatus->ispromising = false; /* irrelevant */ + dstatus->deleteitup = ItemIdIsDead(itemid); /* for now */ + dstatus->tupsize = 1; /* irrelevant */ + delstate.ndeltids++; + + cdeltid++; + dstatus++; + } + } + } + + Assert(delstate.ndeltids >= ndeletable); + + /* Physically delete LP_DEAD tuples (plus extra dead-to-all TIDs) */ + _bt_delitems_delete_check(rel, buffer, heapRel, &delstate); + + /* be tidy */ + pfree(blocks); + pfree(delstate.deltids); + pfree(delstate.status); +} + +/* + * _bt_lpdead_blocks() -- Build a list of LP_DEAD related table blocks + * + * Build a list of those blocks pointed to by index tuples that caller found + * had their LP_DEAD bits set. Used by _bt_lpdead_pass to delete extra nearby + * tuples that are convenient to delete in passing. + */ +static BlockNumber * +_bt_lpdead_blocks(Page page, OffsetNumber *deletable, int ndeletable, + int *nblocks) +{ + int spacenhtids; + int nhtids; + ItemPointer htids; + BlockNumber *blocks; + BlockNumber lastblock = InvalidBlockNumber; + + /* Array will grow iff there are posting list tuples to consider */ + spacenhtids = ndeletable; + nhtids = 0; + htids = (ItemPointer) palloc(sizeof(ItemPointerData) * spacenhtids); + for (int i = 0; i < ndeletable; i++) + { + ItemId itemid; + IndexTuple itup; + + itemid = PageGetItemId(page, deletable[i]); + itup = (IndexTuple) PageGetItem(page, itemid); + + Assert(ItemIdIsDead(itemid)); + Assert(!BTreeTupleIsPivot(itup)); + + if (!BTreeTupleIsPosting(itup)) + { + if (nhtids + 1 > spacenhtids) + { + spacenhtids *= 2; + htids = (ItemPointer) + repalloc(htids, sizeof(ItemPointerData) * spacenhtids); + } + + Assert(ItemPointerIsValid(&itup->t_tid)); + ItemPointerCopy(&itup->t_tid, &htids[nhtids]); + nhtids++; + } + else + { + int nposting = BTreeTupleGetNPosting(itup); + + if (nhtids + nposting > spacenhtids) + { + spacenhtids = Max(spacenhtids * 2, nhtids + nposting); + htids = (ItemPointer) + repalloc(htids, sizeof(ItemPointerData) * spacenhtids); + } + + for (int j = 0; j < nposting; j++) + { + ItemPointer htid = BTreeTupleGetPostingN(itup, j); + + Assert(ItemPointerIsValid(htid)); + ItemPointerCopy(htid, &htids[nhtids]); + nhtids++; + } + } + } + + Assert(nhtids >= ndeletable); + + qsort((void *) htids, nhtids, sizeof(ItemPointerData), + (int (*) (const void *, const void *)) ItemPointerCompare); + + blocks = palloc(sizeof(BlockNumber) * nhtids); + *nblocks = 0; + + for (int i = 0; i < nhtids; i++) + { + ItemPointer tid = htids + i; + BlockNumber tidblock = ItemPointerGetBlockNumber(tid); + + if (tidblock == lastblock) + continue; + + lastblock = tidblock; + blocks[*nblocks] = tidblock; + (*nblocks)++; + } + + pfree(htids); + + return blocks; +} + +/* + * _bt_lpdead_blocks_cmp() -- BlockNumber comparator + * + * Used by _bt_lpdead_pass to search through its list of table blocks that are + * known to be pointed to by TIDs in LP_DEAD-marked index tuples. + */ +static int +_bt_lpdead_blocks_cmp(const void *arg1, const void *arg2) +{ + BlockNumber b1 = *((BlockNumber *) arg1); + BlockNumber b2 = *((BlockNumber *) arg2); + + if (b1 < b2) + return -1; + else if (b1 > b2) + return 1; + + return 0; +} diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c index 848123d921..63d2694d89 100644 --- a/src/backend/access/nbtree/nbtpage.c +++ b/src/backend/access/nbtree/nbtpage.c @@ -38,8 +38,15 @@ static BTMetaPageData *_bt_getmeta(Relation rel, Buffer metabuf); static void _bt_log_reuse_page(Relation rel, BlockNumber blkno, TransactionId latestRemovedXid); -static TransactionId _bt_xid_horizon(Relation rel, Relation heapRel, Page page, - OffsetNumber *deletable, int ndeletable); +static void _bt_delitems_delete(Relation rel, Buffer buf, + TransactionId latestRemovedXid, + OffsetNumber *deletable, int ndeletable, + BTVacuumPosting *updatable, int nupdatable, + Relation heapRel); +static char *_bt_delitems_updates(BTVacuumPosting *updatable, int nupdatable, + OffsetNumber *updatedoffsets, + Size *updatedbuflen, + bool needswal); static bool _bt_mark_page_halfdead(Relation rel, Buffer leafbuf, BTStack stack); static bool _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, @@ -1110,15 +1117,15 @@ _bt_page_recyclable(Page page) * sorted in ascending order. * * Routine deals with deleting TIDs when some (but not all) of the heap TIDs - * in an existing posting list item are to be removed by VACUUM. This works - * by updating/overwriting an existing item with caller's new version of the - * item (a version that lacks the TIDs that are to be deleted). + * in an existing posting list item are to be removed. This works by + * updating/overwriting an existing item with caller's new version of the item + * (a version that lacks the TIDs that are to be deleted). * * We record VACUUMs and b-tree deletes differently in WAL. Deletes must - * generate their own latestRemovedXid by accessing the heap directly, whereas - * VACUUMs rely on the initial heap scan taking care of it indirectly. Also, - * only VACUUM can perform granular deletes of individual TIDs in posting list - * tuples. + * generate their own latestRemovedXid by accessing the table directly, + * whereas VACUUMs rely on the initial heap scan taking care of it indirectly. + * Also, we remove the VACUUM cycle ID from pages, which b-tree deletes don't + * do. */ void _bt_delitems_vacuum(Relation rel, Buffer buf, @@ -1127,7 +1134,7 @@ _bt_delitems_vacuum(Relation rel, Buffer buf, { Page page = BufferGetPage(buf); BTPageOpaque opaque; - Size itemsz; + bool needswal = RelationNeedsWAL(rel); char *updatedbuf = NULL; Size updatedbuflen = 0; OffsetNumber updatedoffsets[MaxIndexTuplesPerPage]; @@ -1135,45 +1142,11 @@ _bt_delitems_vacuum(Relation rel, Buffer buf, /* Shouldn't be called unless there's something to do */ Assert(ndeletable > 0 || nupdatable > 0); - for (int i = 0; i < nupdatable; i++) - { - /* Replace work area IndexTuple with updated version */ - _bt_update_posting(updatable[i]); - - /* Maintain array of updatable page offsets for WAL record */ - updatedoffsets[i] = updatable[i]->updatedoffset; - } - - /* XLOG stuff -- allocate and fill buffer before critical section */ - if (nupdatable > 0 && RelationNeedsWAL(rel)) - { - Size offset = 0; - - for (int i = 0; i < nupdatable; i++) - { - BTVacuumPosting vacposting = updatable[i]; - - itemsz = SizeOfBtreeUpdate + - vacposting->ndeletedtids * sizeof(uint16); - updatedbuflen += itemsz; - } - - updatedbuf = palloc(updatedbuflen); - for (int i = 0; i < nupdatable; i++) - { - BTVacuumPosting vacposting = updatable[i]; - xl_btree_update update; - - update.ndeletedtids = vacposting->ndeletedtids; - memcpy(updatedbuf + offset, &update.ndeletedtids, - SizeOfBtreeUpdate); - offset += SizeOfBtreeUpdate; - - itemsz = update.ndeletedtids * sizeof(uint16); - memcpy(updatedbuf + offset, vacposting->deletetids, itemsz); - offset += itemsz; - } - } + /* Generate new version of posting lists without deleted TIDs */ + if (nupdatable > 0) + updatedbuf = _bt_delitems_updates(updatable, nupdatable, + updatedoffsets, &updatedbuflen, + needswal); /* No ereport(ERROR) until changes are logged */ START_CRIT_SECTION(); @@ -1194,6 +1167,7 @@ _bt_delitems_vacuum(Relation rel, Buffer buf, { OffsetNumber updatedoffset = updatedoffsets[i]; IndexTuple itup; + Size itemsz; itup = updatable[i]->itup; itemsz = MAXALIGN(IndexTupleSize(itup)); @@ -1227,7 +1201,7 @@ _bt_delitems_vacuum(Relation rel, Buffer buf, MarkBufferDirty(buf); /* XLOG stuff */ - if (RelationNeedsWAL(rel)) + if (needswal) { XLogRecPtr recptr; xl_btree_vacuum xlrec_vacuum; @@ -1260,7 +1234,7 @@ _bt_delitems_vacuum(Relation rel, Buffer buf, /* can't leak memory here */ if (updatedbuf != NULL) pfree(updatedbuf); - /* free tuples generated by calling _bt_update_posting() */ + /* free tuples allocated within _bt_delitems_updates() */ for (int i = 0; i < nupdatable; i++) pfree(updatable[i]->itup); } @@ -1269,36 +1243,70 @@ _bt_delitems_vacuum(Relation rel, Buffer buf, * Delete item(s) from a btree leaf page during single-page cleanup. * * This routine assumes that the caller has pinned and write locked the - * buffer. Also, the given deletable array *must* be sorted in ascending - * order. + * buffer. Also, the given deletable and updatable arrays *must* be sorted in + * ascending order. + * + * Routine deals with deleting TIDs when some (but not all) of the heap TIDs + * in an existing posting list item are to be removed. This works by + * updating/overwriting an existing item with caller's new version of the item + * (a version that lacks the TIDs that are to be deleted). * * This is nearly the same as _bt_delitems_vacuum as far as what it does to - * the page, but it needs to generate its own latestRemovedXid by accessing - * the heap. This is used by the REDO routine to generate recovery conflicts. - * Also, it doesn't handle posting list tuples unless the entire tuple can be - * deleted as a whole (since there is only one LP_DEAD bit per line pointer). + * the page, but it needs its own latestRemovedXid from called (caller gets + * this from tableam). This is used by the REDO routine to generate recovery + * conflicts. The other difference is that _bt_delitems_vacuum will clear + * page's VACUUM cycle ID. We must never do that. */ -void -_bt_delitems_delete(Relation rel, Buffer buf, +static void +_bt_delitems_delete(Relation rel, Buffer buf, TransactionId latestRemovedXid, OffsetNumber *deletable, int ndeletable, + BTVacuumPosting *updatable, int nupdatable, Relation heapRel) { Page page = BufferGetPage(buf); BTPageOpaque opaque; - TransactionId latestRemovedXid = InvalidTransactionId; + bool needswal = RelationNeedsWAL(rel); + char *updatedbuf = NULL; + Size updatedbuflen = 0; + OffsetNumber updatedoffsets[MaxIndexTuplesPerPage]; /* Shouldn't be called unless there's something to do */ - Assert(ndeletable > 0); + Assert(ndeletable > 0 || nupdatable > 0); - if (XLogStandbyInfoActive() && RelationNeedsWAL(rel)) - latestRemovedXid = - _bt_xid_horizon(rel, heapRel, page, deletable, ndeletable); + /* Generate new versions of posting lists without deleted TIDs */ + if (nupdatable > 0) + updatedbuf = _bt_delitems_updates(updatable, nupdatable, + updatedoffsets, &updatedbuflen, + needswal); /* No ereport(ERROR) until changes are logged */ START_CRIT_SECTION(); - /* Fix the page */ - PageIndexMultiDelete(page, deletable, ndeletable); + /* + * Handle posting tuple updates. + * + * Deliberately do this before handling simple deletes. If we did it the + * other way around (i.e. WAL record order -- simple deletes before + * updates) then we'd have to make compensating changes to the 'updatable' + * array of offset numbers. + */ + for (int i = 0; i < nupdatable; i++) + { + OffsetNumber updatedoffset = updatedoffsets[i]; + IndexTuple itup; + Size itemsz; + + itup = updatable[i]->itup; + itemsz = MAXALIGN(IndexTupleSize(itup)); + if (!PageIndexTupleOverwrite(page, updatedoffset, (Item) itup, + itemsz)) + elog(PANIC, "failed to update partially dead item in block %u of index \"%s\"", + BufferGetBlockNumber(buf), RelationGetRelationName(rel)); + } + + /* Now handle simple deletes of entire tuples */ + if (ndeletable > 0) + PageIndexMultiDelete(page, deletable, ndeletable); /* * Unlike _bt_delitems_vacuum, we *must not* clear the vacuum cycle ID, @@ -1318,25 +1326,29 @@ _bt_delitems_delete(Relation rel, Buffer buf, MarkBufferDirty(buf); /* XLOG stuff */ - if (RelationNeedsWAL(rel)) + if (needswal) { XLogRecPtr recptr; xl_btree_delete xlrec_delete; xlrec_delete.latestRemovedXid = latestRemovedXid; xlrec_delete.ndeleted = ndeletable; + xlrec_delete.nupdated = nupdatable; XLogBeginInsert(); XLogRegisterBuffer(0, buf, REGBUF_STANDARD); XLogRegisterData((char *) &xlrec_delete, SizeOfBtreeDelete); - /* - * The deletable array is not in the buffer, but pretend that it is. - * When XLogInsert stores the whole buffer, the array need not be - * stored too. - */ - XLogRegisterBufData(0, (char *) deletable, - ndeletable * sizeof(OffsetNumber)); + if (ndeletable > 0) + XLogRegisterBufData(0, (char *) deletable, + ndeletable * sizeof(OffsetNumber)); + + if (nupdatable > 0) + { + XLogRegisterBufData(0, (char *) updatedoffsets, + nupdatable * sizeof(OffsetNumber)); + XLogRegisterBufData(0, updatedbuf, updatedbuflen); + } recptr = XLogInsert(RM_BTREE_ID, XLOG_BTREE_DELETE); @@ -1344,83 +1356,299 @@ _bt_delitems_delete(Relation rel, Buffer buf, } END_CRIT_SECTION(); + + /* can't leak memory here */ + if (updatedbuf != NULL) + pfree(updatedbuf); + /* free tuples allocated within _bt_delitems_updates() */ + for (int i = 0; i < nupdatable; i++) + pfree(updatable[i]->itup); } /* - * Get the latestRemovedXid from the table entries pointed to by the non-pivot - * tuples being deleted. + * Set up state needed to delete TIDs from posting list tuples via "updating" + * the tuple. Performs steps common to both _bt_delitems_vacuum and + * _bt_delitems_delete. These steps must take place before each function's + * critical section begins. * - * This is a specialized version of index_compute_xid_horizon_for_tuples(). - * It's needed because btree tuples don't always store table TID using the - * standard index tuple header field. + * updatabable and nupdatable are inputs, though note that we will use + * _bt_update_posting() to replace the original itup with a pointer to a final + * version in palloc()'d memory. Caller should free the tuples when its done. + * + * The first nupdatable entries from updatedoffsets are set to the page offset + * number for posting list tuples that caller updates. This is mostly useful + * because caller may need to WAL-log the page offsets (though we always do + * this for caller out of convenience). + * + * Returns buffer consisting of an array of xl_btree_update structs that + * describe the steps we perform here for caller (though only when needswal is + * true). Also sets *updatedbuflen to the final size of the buffer. This + * buffer is used by caller when WAL logging is required. */ -static TransactionId -_bt_xid_horizon(Relation rel, Relation heapRel, Page page, - OffsetNumber *deletable, int ndeletable) +static char * +_bt_delitems_updates(BTVacuumPosting *updatable, int nupdatable, + OffsetNumber *updatedoffsets, Size *updatedbuflen, + bool needswal) { - TransactionId latestRemovedXid = InvalidTransactionId; - int spacenhtids; - int nhtids; - ItemPointer htids; + char *updatedbuf = NULL; + Size buflen = 0; - /* Array will grow iff there are posting list tuples to consider */ - spacenhtids = ndeletable; - nhtids = 0; - htids = (ItemPointer) palloc(sizeof(ItemPointerData) * spacenhtids); - for (int i = 0; i < ndeletable; i++) + /* Shouldn't be called unless there's something to do */ + Assert(nupdatable > 0); + + for (int i = 0; i < nupdatable; i++) { - ItemId itemid; - IndexTuple itup; + BTVacuumPosting vacposting = updatable[i]; + Size itemsz; - itemid = PageGetItemId(page, deletable[i]); - itup = (IndexTuple) PageGetItem(page, itemid); + /* Replace work area IndexTuple with updated version */ + _bt_update_posting(vacposting); - Assert(ItemIdIsDead(itemid)); - Assert(!BTreeTupleIsPivot(itup)); + /* Keep track of size of xl_btree_update for updatedbuf in passing */ + itemsz = SizeOfBtreeUpdate + vacposting->ndeletedtids * sizeof(uint16); + buflen += itemsz; - if (!BTreeTupleIsPosting(itup)) + /* Build updatedoffsets buffer in passing */ + updatedoffsets[i] = vacposting->updatedoffset; + } + + /* XLOG stuff */ + if (needswal) + { + Size offset = 0; + + /* Allocate, set final size for caller */ + updatedbuf = palloc(buflen); + *updatedbuflen = buflen; + for (int i = 0; i < nupdatable; i++) { - if (nhtids + 1 > spacenhtids) - { - spacenhtids *= 2; - htids = (ItemPointer) - repalloc(htids, sizeof(ItemPointerData) * spacenhtids); - } + BTVacuumPosting vacposting = updatable[i]; + Size itemsz; + xl_btree_update update; - Assert(ItemPointerIsValid(&itup->t_tid)); - ItemPointerCopy(&itup->t_tid, &htids[nhtids]); - nhtids++; - } - else - { - int nposting = BTreeTupleGetNPosting(itup); + update.ndeletedtids = vacposting->ndeletedtids; + memcpy(updatedbuf + offset, &update.ndeletedtids, + SizeOfBtreeUpdate); + offset += SizeOfBtreeUpdate; - if (nhtids + nposting > spacenhtids) - { - spacenhtids = Max(spacenhtids * 2, nhtids + nposting); - htids = (ItemPointer) - repalloc(htids, sizeof(ItemPointerData) * spacenhtids); - } - - for (int j = 0; j < nposting; j++) - { - ItemPointer htid = BTreeTupleGetPostingN(itup, j); - - Assert(ItemPointerIsValid(htid)); - ItemPointerCopy(htid, &htids[nhtids]); - nhtids++; - } + itemsz = update.ndeletedtids * sizeof(uint16); + memcpy(updatedbuf + offset, vacposting->deletetids, itemsz); + offset += itemsz; } } - Assert(nhtids >= ndeletable); + return updatedbuf; +} - latestRemovedXid = - table_compute_xid_horizon_for_tuples(heapRel, htids, nhtids); +/* + * Comparator used by _bt_delitems_delete_check() to restore deltids array + * back to its original leaf-page-wise sort order + */ +static int +_bt_delitems_cmp(const void *a, const void *b) +{ + TM_IndexDelete *indexdelete1 = (TM_IndexDelete *) a; + TM_IndexDelete *indexdelete2 = (TM_IndexDelete *) b; - pfree(htids); + if (indexdelete1->id > indexdelete2->id) + return 1; + if (indexdelete1->id < indexdelete2->id) + return -1; - return latestRemovedXid; + Assert(false); + + return 0; +} + +/* + * Try to delete item(s) from a btree leaf page during single-page cleanup. + * + * nbtree interface to table_index_delete_check(). Deletes a subset of index + * tuples that caller suspects to be dead-to-all: those that are actually + * dead-to-all, and therefore safe to delete. Used by bottom-up index + * deletion. + * + * Simple deletion of LP_DEAD-set index tuples caller goes through here too. + * It used to call _bt_delitems_delete() directly, but using this interface + * has distinct advantages. It often allows us to delete some extra index + * tuples that happen to be dead-to-all and happen to have not had their + * LP_DEAD bit set in passing (LP_DEAD caller includes these extra TIDs in + * delstate). The extra cost of this approach is acceptable because a + * latestRemovedXid value will be needed anyway. It will need to be acquired + * by visiting all relevant table blocks again, so including extra TIDs is + * cheap. (Actually, it's only strictly necessary to get a latestRemovedXid + * with logged indexes. LP_DEAD deletion still uses this approach in all + * cases, just to be consistent.) + * + * Note: We rely on the assumption that the delstate.deltids array is sorted + * on its id field, which is a proxy for the original leaf-page-wise order of + * index tuples. Caller must gather items in delstate in the natural way: + * through appending each TID that we consider in leaf-page-wise order. + */ +void +_bt_delitems_delete_check(Relation rel, Buffer buf, Relation heapRel, + TM_IndexDeleteOp *delstate) +{ + Page page = BufferGetPage(buf); + TransactionId latestRemovedXid; + OffsetNumber postingidxoffnum; + int ndeletable, + nupdatable; + OffsetNumber deletable[MaxIndexTuplesPerPage]; + BTVacuumPosting updatable[MaxIndexTuplesPerPage]; + + /* + * Use tableam interface to determine which tuples to delete first. + * + * There is a good chance that accessing table block buffers won't result + * in any misses. Temporal locality is important here. + */ + latestRemovedXid = table_index_delete_check(heapRel, delstate); + + /* The tableam may have nothing (though only for bottom-up caller) */ + if (delstate->ndeltids == 0) + return; + + /* Don't need to WAL-log latestRemovedXid in all cases */ + if (!XLogStandbyInfoActive() || !RelationNeedsWAL(rel)) + latestRemovedXid = InvalidTransactionId; + + /* + * Construct a leaf-page-wise description of what _bt_delitems_delete() + * needs to do to physically delete index tuples from the page. + * + * Must sort deltids array (which is typically much smaller now) first. + * It must match the order expected by loop: leaf-page-wise order. + */ + qsort(delstate->deltids, delstate->ndeltids, sizeof(TM_IndexDelete), + _bt_delitems_cmp); + postingidxoffnum = InvalidOffsetNumber; + ndeletable = 0; + nupdatable = 0; + for (int i = 0; i < delstate->ndeltids; i++) + { + TM_IndexStatus *dstatus = delstate->status + delstate->deltids[i].id; + OffsetNumber idxoffnum = dstatus->idxoffnum; + ItemId itemid = PageGetItemId(page, idxoffnum); + IndexTuple itup = (IndexTuple) PageGetItem(page, itemid); + int tidi, + nitem; + BTVacuumPosting vacposting; + + if (idxoffnum == postingidxoffnum) + { + /* + * This deltid entry is a TID from a posting list tuple that has + * already been completely processed (since we process all of a + * posting lists TIDs together, once) + */ + Assert(BTreeTupleIsPosting(itup)); + continue; + } + + if (!BTreeTupleIsPosting(itup)) + { + /* Plain non-pivot tuple */ + Assert(ItemPointerEquals(&itup->t_tid, &delstate->deltids[i].tid)); + if (dstatus->deleteitup) + deletable[ndeletable++] = idxoffnum; + continue; + } + + /* + * Posting list tuple. Process all of its TIDs together, at once. + * + * tidi is a posting-list-tid local iterator for array. We're going + * to peak at later entries in deltid array here. Remember to skip + * over the itup-related entries that we peak at here later on. We + * should not do anything more with them when get back to the top of + * the outermost deltids loop (we should just skip them). + * + * Innermost loop exploits the fact that both itup's TIDs and the + * entries from the array (whose TIDs came from itup) are in ascending + * TID order. We avoid unnecessary TID comparisons by starting each + * execution of the innermost loop at the point where the previous + * execution (for previous TID from itup) left off at. + */ + postingidxoffnum = idxoffnum; /* Remember: process itup once only */ + tidi = i; /* Initialize for itup's first TID */ + vacposting = NULL; /* Describes what to do with itup */ + nitem = BTreeTupleGetNPosting(itup); + for (int j = 0; j < nitem; j++) + { + ItemPointer htid = BTreeTupleGetPostingN(itup, j); + int cmp = -1; + + for (; tidi < delstate->ndeltids; tidi++) + { + TM_IndexDelete *tcdeltid = &delstate->deltids[tidi]; + TM_IndexStatus *tdstatus = (delstate->status + tcdeltid->id); + + /* Stop when we get to first entry beyond itup's entries */ + Assert(tdstatus->idxoffnum >= idxoffnum); + if (tdstatus->idxoffnum != idxoffnum) + break; + + /* Skip any non-deletable entries for itup */ + if (!tdstatus->deleteitup) + continue; + + /* Have we found matching deletable entry for htid? */ + cmp = ItemPointerCompare(htid, &tcdeltid->tid); + + /* Keep going until equal or greater tid from array located */ + if (cmp <= 0) + break; + } + + /* Final check on htid: must match a deletable array entry */ + if (cmp != 0) + continue; + + if (vacposting == NULL) + { + /* + * First deletable TID for itup found. Start maintaining + * metadata describing which TIDs to delete from itup. + */ + vacposting = palloc(offsetof(BTVacuumPostingData, deletetids) + + nitem * sizeof(uint16)); + vacposting->itup = itup; + vacposting->updatedoffset = idxoffnum; + vacposting->ndeletedtids = 0; + } + + /* htid will be deleted from itup */ + vacposting->deletetids[vacposting->ndeletedtids++] = j; + } + + if (vacposting == NULL) + { + /* No TIDs to delete from itup -- do nothing */ + } + else if (vacposting->ndeletedtids == nitem) + { + /* Straight delete of itup (to delete all TIDs) */ + deletable[ndeletable++] = idxoffnum; + /* Turns out we won't need granular information */ + pfree(vacposting); + } + else + { + /* Delete some but not all TIDs from itup */ + Assert(vacposting->ndeletedtids > 0 && + vacposting->ndeletedtids < nitem); + updatable[nupdatable++] = vacposting; + } + } + + /* Physically delete the dead-to-all TIDs we've located */ + _bt_delitems_delete(rel, buf, latestRemovedXid, deletable, ndeletable, + updatable, nupdatable, heapRel); + + /* be tidy */ + for (int i = 0; i < nupdatable; i++) + pfree(updatable[i]); } /* diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c index d6c8ad5d27..0d7f5199e5 100644 --- a/src/backend/access/nbtree/nbtree.c +++ b/src/backend/access/nbtree/nbtree.c @@ -209,7 +209,7 @@ btinsert(Relation rel, Datum *values, bool *isnull, itup = index_form_tuple(RelationGetDescr(rel), values, isnull); itup->t_tid = *ht_ctid; - result = _bt_doinsert(rel, itup, checkUnique, heapRel); + result = _bt_doinsert(rel, itup, checkUnique, indexUnchanged, heapRel); pfree(itup); diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c index 8730de25ed..d5d90cf696 100644 --- a/src/backend/access/nbtree/nbtsort.c +++ b/src/backend/access/nbtree/nbtsort.c @@ -49,7 +49,6 @@ #include "access/parallel.h" #include "access/relscan.h" #include "access/table.h" -#include "access/tableam.h" #include "access/xact.h" #include "access/xlog.h" #include "access/xloginsert.h" diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c index 2f5f14e527..831cc28eac 100644 --- a/src/backend/access/nbtree/nbtutils.c +++ b/src/backend/access/nbtree/nbtutils.c @@ -2108,7 +2108,9 @@ btoptions(Datum reloptions, bool validate) {"vacuum_cleanup_index_scale_factor", RELOPT_TYPE_REAL, offsetof(BTOptions, vacuum_cleanup_index_scale_factor)}, {"deduplicate_items", RELOPT_TYPE_BOOL, - offsetof(BTOptions, deduplicate_items)} + offsetof(BTOptions, deduplicate_items)}, + {"delete_items", RELOPT_TYPE_BOOL, + offsetof(BTOptions, delete_items)} }; diff --git a/src/backend/access/nbtree/nbtxlog.c b/src/backend/access/nbtree/nbtxlog.c index 5135b800af..3e7289fe49 100644 --- a/src/backend/access/nbtree/nbtxlog.c +++ b/src/backend/access/nbtree/nbtxlog.c @@ -556,6 +556,47 @@ btree_xlog_dedup(XLogReaderState *record) UnlockReleaseBuffer(buf); } +static void +btree_xlog_updates(Page page, OffsetNumber *updatedoffsets, + xl_btree_update *updates, int nupdated) +{ + BTVacuumPosting vacposting; + IndexTuple origtuple; + ItemId itemid; + Size itemsz; + + for (int i = 0; i < nupdated; i++) + { + itemid = PageGetItemId(page, updatedoffsets[i]); + origtuple = (IndexTuple) PageGetItem(page, itemid); + + vacposting = palloc(offsetof(BTVacuumPostingData, deletetids) + + updates->ndeletedtids * sizeof(uint16)); + vacposting->updatedoffset = updatedoffsets[i]; + vacposting->itup = origtuple; + vacposting->ndeletedtids = updates->ndeletedtids; + memcpy(vacposting->deletetids, + (char *) updates + SizeOfBtreeUpdate, + updates->ndeletedtids * sizeof(uint16)); + + _bt_update_posting(vacposting); + + /* Overwrite updated version of tuple */ + itemsz = MAXALIGN(IndexTupleSize(vacposting->itup)); + if (!PageIndexTupleOverwrite(page, updatedoffsets[i], + (Item) vacposting->itup, itemsz)) + elog(PANIC, "failed to update partially dead item"); + + pfree(vacposting->itup); + pfree(vacposting); + + /* advance to next xl_btree_update from array */ + updates = (xl_btree_update *) + ((char *) updates + SizeOfBtreeUpdate + + updates->ndeletedtids * sizeof(uint16)); + } +} + static void btree_xlog_vacuum(XLogReaderState *record) { @@ -589,41 +630,7 @@ btree_xlog_vacuum(XLogReaderState *record) xlrec->nupdated * sizeof(OffsetNumber)); - for (int i = 0; i < xlrec->nupdated; i++) - { - BTVacuumPosting vacposting; - IndexTuple origtuple; - ItemId itemid; - Size itemsz; - - itemid = PageGetItemId(page, updatedoffsets[i]); - origtuple = (IndexTuple) PageGetItem(page, itemid); - - vacposting = palloc(offsetof(BTVacuumPostingData, deletetids) + - updates->ndeletedtids * sizeof(uint16)); - vacposting->updatedoffset = updatedoffsets[i]; - vacposting->itup = origtuple; - vacposting->ndeletedtids = updates->ndeletedtids; - memcpy(vacposting->deletetids, - (char *) updates + SizeOfBtreeUpdate, - updates->ndeletedtids * sizeof(uint16)); - - _bt_update_posting(vacposting); - - /* Overwrite updated version of tuple */ - itemsz = MAXALIGN(IndexTupleSize(vacposting->itup)); - if (!PageIndexTupleOverwrite(page, updatedoffsets[i], - (Item) vacposting->itup, itemsz)) - elog(PANIC, "failed to update partially dead item"); - - pfree(vacposting->itup); - pfree(vacposting); - - /* advance to next xl_btree_update from array */ - updates = (xl_btree_update *) - ((char *) updates + SizeOfBtreeUpdate + - updates->ndeletedtids * sizeof(uint16)); - } + btree_xlog_updates(page, updatedoffsets, updates, xlrec->nupdated); } if (xlrec->ndeleted > 0) @@ -675,7 +682,22 @@ btree_xlog_delete(XLogReaderState *record) page = (Page) BufferGetPage(buffer); - PageIndexMultiDelete(page, (OffsetNumber *) ptr, xlrec->ndeleted); + if (xlrec->nupdated > 0) + { + OffsetNumber *updatedoffsets; + xl_btree_update *updates; + + updatedoffsets = (OffsetNumber *) + (ptr + xlrec->ndeleted * sizeof(OffsetNumber)); + updates = (xl_btree_update *) ((char *) updatedoffsets + + xlrec->nupdated * + sizeof(OffsetNumber)); + + btree_xlog_updates(page, updatedoffsets, updates, xlrec->nupdated); + } + + if (xlrec->ndeleted > 0) + PageIndexMultiDelete(page, (OffsetNumber *) ptr, xlrec->ndeleted); /* Mark the page as not containing any LP_DEAD items */ opaque = (BTPageOpaque) PageGetSpecialPointer(page); diff --git a/src/bin/psql/tab-complete.c b/src/bin/psql/tab-complete.c index 8afc780acc..497bf5c3ac 100644 --- a/src/bin/psql/tab-complete.c +++ b/src/bin/psql/tab-complete.c @@ -1765,14 +1765,14 @@ psql_completion(const char *text, int start, int end) /* ALTER INDEX SET|RESET ( */ else if (Matches("ALTER", "INDEX", MatchAny, "RESET", "(")) COMPLETE_WITH("fillfactor", - "vacuum_cleanup_index_scale_factor", "deduplicate_items", /* BTREE */ + "vacuum_cleanup_index_scale_factor", "deduplicate_items", "delete_items", /* BTREE */ "fastupdate", "gin_pending_list_limit", /* GIN */ "buffering", /* GiST */ "pages_per_range", "autosummarize" /* BRIN */ ); else if (Matches("ALTER", "INDEX", MatchAny, "SET", "(")) COMPLETE_WITH("fillfactor =", - "vacuum_cleanup_index_scale_factor =", "deduplicate_items =", /* BTREE */ + "vacuum_cleanup_index_scale_factor =", "deduplicate_items =", "delete_items =", /* BTREE */ "fastupdate =", "gin_pending_list_limit =", /* GIN */ "buffering =", /* GiST */ "pages_per_range =", "autosummarize =" /* BRIN */ diff --git a/doc/src/sgml/btree.sgml b/doc/src/sgml/btree.sgml index bb395e6a85..9e4abf40d2 100644 --- a/doc/src/sgml/btree.sgml +++ b/doc/src/sgml/btree.sgml @@ -629,6 +629,86 @@ options(relopts local_relopts *) returns + + Bottom-up index deletion + + B-Tree indexes are not directly aware that under MVCC, there might + be multiple extant versions of the same logical table row; to an + index, each tuple is an independent object that needs its own index + entry. Version churn tuples may sometimes + accumulate and adversely affect query latency and throughput. This + typically occurs with UPDATE-heavy workloads + where most individual updates cannot apply the + HOT optimization. Changing the value of only + one column covered by one index during an UPDATE + always necessitates a new set of index tuples + — one for each and every index on the + table. Note in particular that this includes indexes that were not + logically modified by the UPDATE. + All indexes will need a successor physical index tuple that points + to the latest version in the table. Each new tuple within each + index will generally need to coexist with the original + updated tuple for a short period of time (typically + until some time after the UPDATE transaction + commits). This process produces the majority of all garbage index + tuples in some scenarios. + + + Bottom-up index deletion targets this + particular variety of index tuple garbage. It effectively enforces + a soft limit on how many versions there can be in each index for + any given logical row. It is generally very effective provided + there are no long lived snapshots that hold back cleanup. + Bottom-up index deletion complements the top-down + index cleanup performed by VACUUM. It targets + leaf pages that are disproportionately affected by the accumulation + of garbage index tuples, while leaving it up to + VACUUM to perform infrequent clean sweeps of all + indexes. A bottom-up deletion pass takes place when a leaf page + does not have enough free space to fit an incoming tuple, though + only when the incoming tuple originates from an + UPDATE that did not logically change any of the + columns covered by the index in question. + + + The deletion process must closely cooperate with the table access + method. Despite the lack of convenient access to + authoritative information about how index + tuples represent versions or are related to each other, it is + possible for the B-Tree implementation to target garbage index + tuples using relatively simple heuristics. These heuristics decide + on which table blocks to visit based on where dead tuples seem most + likely to be concentrated. Some number of table blocks must be + accessed to get the required authoritative information, but it + isn't necessary to access very many table blocks each time. Also, + each table block access has to actually enable the implementation + to delete at least one additional index tuple. The whole process + ends when any single table block access fails to yield any index + tuples deletes. + + + The delete_items storage parameter can be used + to disable bottom-up index deletion within individual indexes. + Disabling bottom-up index deletion isn't usually helpful. + + + + It's also possible for index tuple deletion to take place + following opportunistic setting of LP_DEAD + status bits. This avoids a relatively expensive bottom-up + deletion pass, which must access table blocks directly. + + + LP_DEAD status bits are set when passing index + scans happen to notice that an index tuple is dead to every + possible MVCC snapshot (not just their own). + LP_DEAD-set tuples are already known to be safe + to delete, so it isn't necessary to access the table blocks + directly. + + + + Deduplication @@ -702,25 +782,16 @@ options(relopts local_relopts *) returns deduplication isn't usually helpful. - B-Tree indexes are not directly aware that under MVCC, there might - be multiple extant versions of the same logical table row; to an - index, each tuple is an independent object that needs its own index - entry. Version duplicates may sometimes accumulate - and adversely affect query latency and throughput. This typically - occurs with UPDATE-heavy workloads where most - individual updates cannot apply the HOT - optimization (often because at least one indexed column gets - modified, necessitating a new set of index tuple versions — - one new tuple for each and every index). In - effect, B-Tree deduplication ameliorates index bloat caused by - version churn. Note that even the tuples from a unique index are - not necessarily physically unique when stored - on disk due to version churn. The deduplication optimization is - selectively applied within unique indexes. It targets those pages - that appear to have version duplicates. The high level goal is to - give VACUUM more time to run before an - unnecessary page split caused by version churn can - take place. + It is sometimes possible for unique indexes (as well as unique + constraints) to use deduplication. This allows leaf pages to + temporarily absorb extra version churn duplicates. + Deduplication in unique indexes augments bottom-up index deletion, + especially in cases where a long-running transactions holds a + snapshot that blocks garbage collection. The goal is to buy time + for the bottom-up index deletion strategy to become effective + again. Delaying page splits until a single long-running + transaction naturally goes away can allow a bottom-up deletion pass + to succeed where an earlier deletion pass failed. diff --git a/doc/src/sgml/ref/create_index.sgml b/doc/src/sgml/ref/create_index.sgml index 29dee5689e..573a3e2894 100644 --- a/doc/src/sgml/ref/create_index.sgml +++ b/doc/src/sgml/ref/create_index.sgml @@ -435,6 +435,22 @@ CREATE [ UNIQUE ] INDEX [ CONCURRENTLY ] [ [ IF NOT EXISTS ] + + delete_items (boolean) + + delete_items storage parameter + + + + + Controls usage of the B-tree bottom-up index deletion technique + described in . Set to + ON or OFF to enable or + disable the optimization. The default is ON. + + + + vacuum_cleanup_index_scale_factor (floating point) -- 2.25.1