From 32d4f8e5a65d3f1410c7460cbb9348564715203b Mon Sep 17 00:00:00 2001 From: Peter Geoghegan Date: Tue, 29 Dec 2020 15:14:35 -0800 Subject: [PATCH v12 2/2] Add bottom-up index deletion. Teach heapam and nbtree to eagerly delete duplicate tuples that represent old versions. This leaf page level process is triggered in response to a flood of versions on the page. Heuristics detect the problem at the leaf page level (including the recently added "index is logically unchanged by an UPDATE" executor hint). The immediate goal of bottom-up index deletion in nbtree is to avoid "unnecessary" page splits caused entirely by duplicates needed only for MVCC/versioning purposes. It naturally has an even more useful effect, though: it acts as a backstop against accumulating an excessive number of index tuple versions for any given _logical row_. Bottom-up index deletion complements what we might now call "top-down index deletion": index vacuuming performed by VACUUM. Bottom-up index deletion responds to the immediate local needs of queries, while leaving it up to autovacuum to perform infrequent clean sweeps of the index. The previous tableam interface used by index AMs to perform tuple deletion (the table_compute_xid_horizon_for_tuples() function) has been replaced with a new interface that supports certain new requirements. Many (perhaps all) of the capabilities added to nbtree by this commit could also be extended to other index AMs. That is left as work for a later commit. Also extend deletion of LP_DEAD-marked index tuples in nbtree by adding logic to consider extra index tuples (that are not LP_DEAD-marked) for deletion in passing. This increases the number of index tuples deleted significantly in many cases. The LP_DEAD deletion process (which is now called "simple deletion") won't need to visit any extra table blocks to check these extra tuples, since we have to visit the same table blocks to generate a latestRemovedXid value anyway (actually, it isn't necessary to generate a latestRemovedXid with an unlogged index, but deleting extra tuples at the last second is reason enough to visit the table blocks, so that's what we do in all cases). Testing has shown that the enhanced LP_DEAD deletion process almost never fails to delete at least a few extra not-LP_DEAD-marked index tuples when the regression tests are run. In practice the enhanced deletion process can pick up a significant number of "extra" index tuples. It's not uncommon for a deletion operation to delete many more speculative "extra" index tuples than LP_DEAD-marked tuples. Also extend nbtree's _bt_delitems_delete() function to support granular TID deletion in posting list tuples. Both simple index deletion and bottom-up index deletion now support deleting individual TIDs from posting list tuples. The lack of any "granular LP_DEAD bitmap" for each nbtree posting list is now likely to be even less of a problem. We can now delete plenty of TIDs from posting list tuples provided nearby LP_DEAD-bit-set tuples give the tableam the right _general_ idea about where to look for deletable items. The overall impact on simple deletion is often quite significant. Bump XLOG_PAGE_MAGIC because xl_btree_delete changed. No bump in BTREE_VERSION, since there are no changes to the on-disk representation of nbtree indexes. Indexes built on PostgreSQL 12 or PostgreSQL 13 will automatically benefit from the optimization (i.e. no reindexing required) following a pg_upgrade. Author: Peter Geoghegan Reviewed-By: Heikki Linnakangas Reviewed-By: Victor Yegorov Discussion: https://postgr.es/m/CAH2-Wzm+maE3apHB8NOtmM=p-DO65j2V5GzAWCOEEuy3JZgb2g@mail.gmail.com --- src/include/access/heapam.h | 5 +- src/include/access/nbtree.h | 25 +- src/include/access/nbtxlog.h | 101 +++-- src/include/access/tableam.h | 138 +++++- src/backend/access/common/reloptions.c | 10 + src/backend/access/heap/heapam.c | 552 +++++++++++++++++++++-- src/backend/access/heap/heapam_handler.c | 2 +- src/backend/access/index/genam.c | 51 ++- src/backend/access/nbtree/README | 133 +++++- src/backend/access/nbtree/nbtdedup.c | 314 ++++++++++++- src/backend/access/nbtree/nbtinsert.c | 364 ++++++++++++--- src/backend/access/nbtree/nbtpage.c | 500 ++++++++++++++------ src/backend/access/nbtree/nbtree.c | 2 +- src/backend/access/nbtree/nbtsort.c | 1 - src/backend/access/nbtree/nbtutils.c | 4 +- src/backend/access/nbtree/nbtxlog.c | 94 ++-- src/backend/access/rmgrdesc/nbtdesc.c | 4 +- src/backend/access/table/tableamapi.c | 2 +- src/bin/psql/tab-complete.c | 4 +- doc/src/sgml/btree.sgml | 147 ++++-- doc/src/sgml/ref/create_index.sgml | 60 ++- 21 files changed, 2098 insertions(+), 415 deletions(-) diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h index 54b2eb7378..a0d55c9165 100644 --- a/src/include/access/heapam.h +++ b/src/include/access/heapam.h @@ -166,9 +166,8 @@ extern void simple_heap_delete(Relation relation, ItemPointer tid); extern void simple_heap_update(Relation relation, ItemPointer otid, HeapTuple tup); -extern TransactionId heap_compute_xid_horizon_for_tuples(Relation rel, - ItemPointerData *items, - int nitems); +extern TransactionId heap_compute_delete_for_tuples(Relation rel, + TM_IndexDeleteOp *delstate); /* in heap/pruneheap.c */ struct GlobalVisState; diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h index 3b60e696eb..e9c65563db 100644 --- a/src/include/access/nbtree.h +++ b/src/include/access/nbtree.h @@ -17,6 +17,7 @@ #include "access/amapi.h" #include "access/itup.h" #include "access/sdir.h" +#include "access/tableam.h" #include "access/xlogreader.h" #include "catalog/pg_am_d.h" #include "catalog/pg_index.h" @@ -168,7 +169,7 @@ typedef struct BTMetaPageData /* * MaxTIDsPerBTreePage is an upper bound on the number of heap TIDs tuples * that may be stored on a btree leaf page. It is used to size the - * per-page temporary buffers used by index scans. + * per-page temporary buffers. * * Note: we don't bother considering per-tuple overheads here to keep * things simple (value is based on how many elements a single array of @@ -766,8 +767,9 @@ typedef struct BTDedupStateData typedef BTDedupStateData *BTDedupState; /* - * BTVacuumPostingData is state that represents how to VACUUM a posting list - * tuple when some (though not all) of its TIDs are to be deleted. + * BTVacuumPostingData is state that represents how to VACUUM (or delete) a + * posting list tuple when some (though not all) of its TIDs are to be + * deleted. * * Convention is that itup field is the original posting list tuple on input, * and palloc()'d final tuple used to overwrite existing tuple on output. @@ -963,6 +965,7 @@ typedef struct BTOptions /* fraction of newly inserted tuples prior to trigger index cleanup */ float8 vacuum_cleanup_index_scale_factor; bool deduplicate_items; /* Try to deduplicate items? */ + bool bottomup_delete_items; /* Bottom-up delete items? */ } BTOptions; #define BTGetFillFactor(relation) \ @@ -978,6 +981,11 @@ typedef struct BTOptions relation->rd_rel->relam == BTREE_AM_OID), \ ((relation)->rd_options ? \ ((BTOptions *) (relation)->rd_options)->deduplicate_items : true)) +#define BTGetBottomupDeleteItems(relation) \ + (AssertMacro(relation->rd_rel->relkind == RELKIND_INDEX && \ + relation->rd_rel->relam == BTREE_AM_OID), \ + ((relation)->rd_options ? \ + ((BTOptions *) (relation)->rd_options)->bottomup_delete_items : true)) /* * Constant definition for progress reporting. Phase numbers must match @@ -1031,6 +1039,8 @@ extern void _bt_parallel_advance_array_keys(IndexScanDesc scan); extern void _bt_dedup_pass(Relation rel, Buffer buf, Relation heapRel, IndexTuple newitem, Size newitemsz, bool checkingunique); +extern bool _bt_bottomupdel_pass(Relation rel, Buffer buf, Relation heapRel, + Size newitemsz); extern void _bt_dedup_start_pending(BTDedupState state, IndexTuple base, OffsetNumber baseoff); extern bool _bt_dedup_save_htid(BTDedupState state, IndexTuple itup); @@ -1045,7 +1055,8 @@ extern IndexTuple _bt_swap_posting(IndexTuple newitem, IndexTuple oposting, * prototypes for functions in nbtinsert.c */ extern bool _bt_doinsert(Relation rel, IndexTuple itup, - IndexUniqueCheck checkUnique, Relation heapRel); + IndexUniqueCheck checkUnique, bool indexUnchanged, + Relation heapRel); extern void _bt_finish_split(Relation rel, Buffer lbuf, BTStack stack); extern Buffer _bt_getstackbuf(Relation rel, BTStack stack, BlockNumber child); @@ -1083,9 +1094,9 @@ extern bool _bt_page_recyclable(Page page); extern void _bt_delitems_vacuum(Relation rel, Buffer buf, OffsetNumber *deletable, int ndeletable, BTVacuumPosting *updatable, int nupdatable); -extern void _bt_delitems_delete(Relation rel, Buffer buf, - OffsetNumber *deletable, int ndeletable, - Relation heapRel); +extern void _bt_delitems_delete_check(Relation rel, Buffer buf, + Relation heapRel, + TM_IndexDeleteOp *delstate); extern uint32 _bt_pagedel(Relation rel, Buffer leafbuf, TransactionId *oldestBtpoXact); diff --git a/src/include/access/nbtxlog.h b/src/include/access/nbtxlog.h index 5c014bdc66..6a20402877 100644 --- a/src/include/access/nbtxlog.h +++ b/src/include/access/nbtxlog.h @@ -176,24 +176,6 @@ typedef struct xl_btree_dedup #define SizeOfBtreeDedup (offsetof(xl_btree_dedup, nintervals) + sizeof(uint16)) -/* - * This is what we need to know about delete of individual leaf index tuples. - * The WAL record can represent deletion of any number of index tuples on a - * single index page when *not* executed by VACUUM. Deletion of a subset of - * the TIDs within a posting list tuple is not supported. - * - * Backup Blk 0: index page - */ -typedef struct xl_btree_delete -{ - TransactionId latestRemovedXid; - uint32 ndeleted; - - /* DELETED TARGET OFFSET NUMBERS FOLLOW */ -} xl_btree_delete; - -#define SizeOfBtreeDelete (offsetof(xl_btree_delete, ndeleted) + sizeof(uint32)) - /* * This is what we need to know about page reuse within btree. This record * only exists to generate a conflict point for Hot Standby. @@ -211,9 +193,61 @@ typedef struct xl_btree_reuse_page #define SizeOfBtreeReusePage (sizeof(xl_btree_reuse_page)) /* - * This is what we need to know about which TIDs to remove from an individual - * posting list tuple during vacuuming. An array of these may appear at the - * end of xl_btree_vacuum records. + * xl_btree_vacuum and xl_btree_delete records describe deletion of index + * tuples on a leaf page. The former variant is used by VACUUM, while the + * latter variant is used by the ad-hoc deletions that sometimes take place + * when btinsert() is called. + * + * The records are very similar. The only difference is that xl_btree_delete + * has to include a latestRemovedXid field to generate recovery conflicts. + * (VACUUM operations can just rely on earlier conflicts generated during + * pruning of the table whose TIDs the to-be-deleted index tuples point to. + * There are also small differences between each REDO routine that we don't go + * into here.) + * + * xl_btree_vacuum and xl_btree_delete both represent deletion of any number + * of index tuples on a single leaf page using page offset numbers. Both also + * support "updates" of index tuples, which is how deletes of a subset of TIDs + * contained in an existing posting list tuple are implemented. + * + * Updated posting list tuples are represented using xl_btree_update metadata. + * The REDO routines each use the xl_btree_update entries (plus each + * corresponding original index tuple from the target leaf page) to generate + * the final updated tuple. + * + * Updates are only used when there will be some remaining TIDs left by the + * REDO routine. Otherwise the posting list tuple just gets deleted outright. + */ +typedef struct xl_btree_vacuum +{ + uint16 ndeleted; + uint16 nupdated; + + /* DELETED TARGET OFFSET NUMBERS FOLLOW */ + /* UPDATED TARGET OFFSET NUMBERS FOLLOW */ + /* UPDATED TUPLES METADATA (xl_btree_update) ARRAY FOLLOWS */ +} xl_btree_vacuum; + +#define SizeOfBtreeVacuum (offsetof(xl_btree_vacuum, nupdated) + sizeof(uint16)) + +typedef struct xl_btree_delete +{ + TransactionId latestRemovedXid; + uint16 ndeleted; + uint16 nupdated; + + /* DELETED TARGET OFFSET NUMBERS FOLLOW */ + /* UPDATED TARGET OFFSET NUMBERS FOLLOW */ + /* UPDATED TUPLES METADATA (xl_btree_update) ARRAY FOLLOWS */ +} xl_btree_delete; + +#define SizeOfBtreeDelete (offsetof(xl_btree_delete, nupdated) + sizeof(uint16)) + +/* + * The offsets that appear in xl_btree_update metadata are offsets into the + * original posting list from tuple, not page offset numbers. These are + * 0-based. The page offset number for the original posting list tuple comes + * from the main xl_btree_vacuum/xl_btree_delete record. */ typedef struct xl_btree_update { @@ -224,31 +258,6 @@ typedef struct xl_btree_update #define SizeOfBtreeUpdate (offsetof(xl_btree_update, ndeletedtids) + sizeof(uint16)) -/* - * This is what we need to know about a VACUUM of a leaf page. The WAL record - * can represent deletion of any number of index tuples on a single index page - * when executed by VACUUM. It can also support "updates" of index tuples, - * which is how deletes of a subset of TIDs contained in an existing posting - * list tuple are implemented. (Updates are only used when there will be some - * remaining TIDs once VACUUM finishes; otherwise the posting list tuple can - * just be deleted). - * - * Updated posting list tuples are represented using xl_btree_update metadata. - * The REDO routine uses each xl_btree_update (plus its corresponding original - * index tuple from the target leaf page) to generate the final updated tuple. - */ -typedef struct xl_btree_vacuum -{ - uint16 ndeleted; - uint16 nupdated; - - /* DELETED TARGET OFFSET NUMBERS FOLLOW */ - /* UPDATED TARGET OFFSET NUMBERS FOLLOW */ - /* UPDATED TUPLES METADATA ARRAY FOLLOWS */ -} xl_btree_vacuum; - -#define SizeOfBtreeVacuum (offsetof(xl_btree_vacuum, nupdated) + sizeof(uint16)) - /* * This is what we need to know about marking an empty subtree for deletion. * The target identifies the tuple removed from the parent page (note that we diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h index 387eb34a61..ced4060194 100644 --- a/src/include/access/tableam.h +++ b/src/include/access/tableam.h @@ -128,6 +128,118 @@ typedef struct TM_FailureData bool traversed; } TM_FailureData; +/* + * State representing call to table_compute_delete_for_tuples(), which checks + * TID status with tableam for index deletion purposes. Index AM caller + * provides a TM_IndexDeleteOp, which points to two palloc()'d arrays. Each + * array has one entry per TID that the tableam is asked to consider. (The + * two arrays are conceptually one single variable-sized array, though. Two + * arrays/structs are used to keep the main TM_IndexDelete array small for + * performance reasons.) + * + * Most index AM callers perform simple index tuple deletion (by specifying + * bottomup = false), and include only known-dead deltids. These known-dead + * entries are generally marked knowndeletable = true directly (typically + * these are TIDs from LP_DEAD-marked index tuples), but that isn't a strict + * requirement. + * + * Callers that specify bottomup = true are "bottom-up index deletion" + * callers. The considerations for the tableam are more subtle with these + * callers because they ask the tableam to perform highly speculative work, + * and might only expect the tableam to check a small fraction of all entries. + * Caller is not allowed to specify knowndeletable = true for any entry + * because everything is highly speculative. Bottom-up caller provides + * context and hints to tableam -- see comments below for details on how index + * AMs and tableams should coordinate during bottom-up index deletion. + * + * Simple index deletion callers may ask the tableam to perform speculative + * work, too. This is a little like bottom-up deletion, but not too much. + * The tableam will only perform speculative work when it's practically free + * to do so in passing for simple deletion caller (while always performing + * whatever work is is needed to enable knowndeletable/LP_DEAD index tuples to + * be deleted within index AM). This is the real reason why it's possible for + * simple index deletion caller to specify knowndeletable = false up front + * (this means "check if it's possible for me to delete corresponding index + * tuple when it's cheap to do so in passing"). The tableam isn't strictly + * obligated to check these "extra" TIDs. However, heap style tableams should + * manage to check all extra tuples for simple deletion callers. (We provide + * this flexibility in case it's needed by future tableams that diverge + * significantly from the traditional Postgres heapam design.) + * + * Index AMs whose simple deletion operations may include some extra TIDs must + * be sure to only pick extra TIDs that will not increase the total number of + * distinct table blocks visited inside heap-style tableams. In other words, + * index AMs should only take extra TIDs from index tuples that happen to + * point to table blocks that are also pointed to by other LP_DEAD-marked + * index tuples for the same deletion operation (i.e. those index tuples with + * knowndeletable = true deltids entries). + * + * The final contents of the array are interesting to callers that ask tableam + * to perform any speculative work (i.e. when _any_ items have knowndeletable + * set to false up front). These callers will naturally need to consult final + * array contents to determine which index tuples are actually safe to delete. + * Even callers that don't ask tableam to do any speculative work will still + * generally need to call table_compute_delete_for_tuples() just to get a + * latestRemovedXid transaction ID value. (The exception is callers that + * don't need to generate recovery conflicts at all, where it should be okay + * to skip the call entirely). + * + * The index AM can keep track of which index tuple relates to which deltid by + * setting idxoffnum (and/or relying on each entry being uniquely identifiable + * using tid), which is important when the final contents of the array will + * need to be interpreted -- the array can shrink from initial size after + * tableam processing and/or have entries in a new order (tableam may sort + * deltids array for its own reasons). Bottom-up callers may find that final + * ndeltids is 0 on return from call to tableam, in which case no index tuple + * deletions are possible. Simple deletion callers can rely on any entries + * they know to be deletable appearing in the final array as deletable. + */ +typedef struct TM_IndexDelete +{ + ItemPointerData tid; /* table TID from index tuple */ + int16 id; /* Offset into TM_IndexStatus array */ +} TM_IndexDelete; + +typedef struct TM_IndexStatus +{ + OffsetNumber idxoffnum; /* Index am page offset number */ + bool knowndeletable; /* Currently known to be deletable? */ + + /* Bottom-up index deletion specific fields follow */ + bool promising; /* Promising (duplicate) index tuple? */ + int16 freespace; /* Space freed in index if deleted */ +} TM_IndexStatus; + +/* + * Bottom-up deletion and index AM/tableam coordination/cooperation is an + * important part of making the optimization work well. The index AM provides + * hints about where to look to the tableam by marking some entries as + * "promising". Index AM does this with duplicate index tuples that are + * strongly suspected to be old versions left behind by UPDATEs that did not + * logically modify indexed values. Index AM may find it helpful to only mark + * entries as promising when they're thought to have been affected by such an + * UPDATE in the recent past. + * + * Bottom-up index deletion casts a wide net at first, usually by including + * all TIDs on a target index page. It is up to the tableam to worry about + * the cost of checking transaction status information. The tableam is in + * control, but needs careful guidance from the index AM. Index AM requests + * that bottomupfreespace target be met, while tableam measures progress + * towards that goal by tallying the per-entry freespace value for known + * deletable entries. (All !bottomup callers can just set these space related + * fields to zero.) + */ +typedef struct TM_IndexDeleteOp +{ + bool bottomup; /* Bottom-up (not simple) deletion? */ + int bottomupfreespace; /* Bottom-up space target */ + + /* Mutable per-TID information follows (index AM initializes entries) */ + int ndeltids; /* Current # of deltids/status elements */ + TM_IndexDelete *deltids; + TM_IndexStatus *status; +} TM_IndexDeleteOp; + /* "options" flag bits for table_tuple_insert */ /* TABLE_INSERT_SKIP_WAL was 0x0001; RelationNeedsWAL() now governs */ #define TABLE_INSERT_SKIP_FSM 0x0002 @@ -342,10 +454,9 @@ typedef struct TableAmRoutine TupleTableSlot *slot, Snapshot snapshot); - /* see table_compute_xid_horizon_for_tuples() */ - TransactionId (*compute_xid_horizon_for_tuples) (Relation rel, - ItemPointerData *items, - int nitems); + /* see table_compute_delete_for_tuples() */ + TransactionId (*compute_delete_for_tuples) (Relation rel, + TM_IndexDeleteOp *delstate); /* ------------------------------------------------------------------------ @@ -1122,16 +1233,21 @@ table_tuple_satisfies_snapshot(Relation rel, TupleTableSlot *slot, } /* - * Compute the newest xid among the tuples pointed to by items. This is used - * to compute what snapshots to conflict with when replaying WAL records for - * page-level index vacuums. + * Compute which index tuples are safe to delete, and the newest xid among the + * tuples that caller finds it is able to delete. + * + * Sets deletable tuples in entries from caller's TM_IndexDeleteOp state that + * are found to point to dead-to-all tuples in the table. See the + * TM_IndexDeleteOp struct for full details. + * + * Returns a latestRemovedXid transaction ID that index AM must use to + * generate a recovery conflict when required. This is the newest xid among + * the tuples pointed to by deltids TIDs that caller can delete. */ static inline TransactionId -table_compute_xid_horizon_for_tuples(Relation rel, - ItemPointerData *items, - int nitems) +table_compute_delete_for_tuples(Relation rel, TM_IndexDeleteOp *delstate) { - return rel->rd_tableam->compute_xid_horizon_for_tuples(rel, items, nitems); + return rel->rd_tableam->compute_delete_for_tuples(rel, delstate); } diff --git a/src/backend/access/common/reloptions.c b/src/backend/access/common/reloptions.c index 8ccc228a8c..b717669bed 100644 --- a/src/backend/access/common/reloptions.c +++ b/src/backend/access/common/reloptions.c @@ -168,6 +168,16 @@ static relopt_bool boolRelOpts[] = }, true }, + { + { + "bottomup_delete_items", + "Enables \"bottom-up index deletion\" feature for this btree index", + RELOPT_KIND_BTREE, + ShareUpdateExclusiveLock /* since it applies only to later + * inserts */ + }, + true + }, /* list terminator */ {{NULL}} }; diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c index 26c2006f23..6f612582c8 100644 --- a/src/backend/access/heap/heapam.c +++ b/src/backend/access/heap/heapam.c @@ -55,6 +55,7 @@ #include "miscadmin.h" #include "pgstat.h" #include "port/atomics.h" +#include "port/pg_bitutils.h" #include "storage/bufmgr.h" #include "storage/freespace.h" #include "storage/lmgr.h" @@ -102,6 +103,8 @@ static void MultiXactIdWait(MultiXactId multi, MultiXactStatus status, uint16 in int *remaining); static bool ConditionalMultiXactIdWait(MultiXactId multi, MultiXactStatus status, uint16 infomask, Relation rel, int *remaining); +static void heap_delete_sort(TM_IndexDeleteOp *delstate); +static int heap_delete_bottomup_sort_and_shrink(TM_IndexDeleteOp *delstate); static XLogRecPtr log_heap_new_cid(Relation relation, HeapTuple tup); static HeapTuple ExtractReplicaIdentity(Relation rel, HeapTuple tup, bool key_changed, bool *copy); @@ -166,18 +169,33 @@ static const struct #ifdef USE_PREFETCH /* - * heap_compute_xid_horizon_for_tuples and xid_horizon_prefetch_buffer use - * this structure to coordinate prefetching activity. + * heap_compute_delete_for_tuples and compute_delete_prefetch_buffer use this + * structure to coordinate prefetching activity */ typedef struct { BlockNumber cur_hblkno; int next_item; - int nitems; - ItemPointerData *tids; -} XidHorizonPrefetchState; + int ndeltids; + TM_IndexDelete *deltids; +} DeletePrefetchState; #endif +/* heap_compute_delete_for_tuples bottom-up index deletion constants */ +#define BOTTOMUP_FAVORABLE_STRIDE 3 +#define BOTTOMUP_MAX_NBLOCKS 6 + +/* + * heap_compute_delete_for_tuples uses this structure to determine which heap + * pages to visit, and in what order for bottom-up index deletion check + */ +typedef struct IndexDeleteCounts +{ + int16 npromisingtids; /* Number of "promising" TIDs in group */ + int16 ntids; /* Number of TIDs in group */ + int16 ifirsttid; /* Offset to group's first deltid */ +} IndexDeleteCounts; + /* * This table maps tuple lock strength values for each particular * MultiXactStatus value. @@ -6936,28 +6954,32 @@ HeapTupleHeaderAdvanceLatestRemovedXid(HeapTupleHeader tuple, #ifdef USE_PREFETCH /* - * Helper function for heap_compute_xid_horizon_for_tuples. Issue prefetch + * Helper function for heap_compute_delete_for_tuples. Issue prefetch * requests for the number of buffers indicated by prefetch_count. The * prefetch_state keeps track of all the buffers that we can prefetch and * which ones have already been prefetched; each call to this function picks * up where the previous call left off. + * + * Note: we expect the deltids array to be sorted in an order that groups TIDs + * by heap block, with all TIDs for each block appearing together in exactly + * one group. */ static void -xid_horizon_prefetch_buffer(Relation rel, - XidHorizonPrefetchState *prefetch_state, - int prefetch_count) +compute_delete_prefetch_buffer(Relation rel, + DeletePrefetchState *prefetch_state, + int prefetch_count) { BlockNumber cur_hblkno = prefetch_state->cur_hblkno; int count = 0; int i; - int nitems = prefetch_state->nitems; - ItemPointerData *tids = prefetch_state->tids; + int ndeltids = prefetch_state->ndeltids; + TM_IndexDelete *deltids = prefetch_state->deltids; for (i = prefetch_state->next_item; - i < nitems && count < prefetch_count; + i < ndeltids && count < prefetch_count; i++) { - ItemPointer htid = &tids[i]; + ItemPointer htid = &deltids[i].tid; if (cur_hblkno == InvalidBlockNumber || ItemPointerGetBlockNumber(htid) != cur_hblkno) @@ -6978,24 +7000,29 @@ xid_horizon_prefetch_buffer(Relation rel, #endif /* - * Get the latestRemovedXid from the heap pages pointed at by the index - * tuples being deleted. + * heapam implementation of tableam's compute_delete_for_tuples interface. * - * We used to do this during recovery rather than on the primary, but that - * approach now appears inferior. It meant that the primary could generate - * a lot of work for the standby without any back-pressure to slow down the - * primary, and it required the standby to have reached consistency, whereas - * we want to have correct information available even before that point. + * This is a helper function that enables incremental index tuple deletion + * within index AMs. See tableam.h comments for a thorough explanation of the + * API. + * + * Some details here are quite subtle, mostly for the benefit of bottom-up + * index deletion callers. We have to closely cooperate with the index AM + * caller to keep the costs and the benefits of bottom-up deletion in balance. * * It's possible for this to generate a fair amount of I/O, since we may be * deleting hundreds of tuples from a single index block. To amortize that * cost to some degree, this uses prefetching and combines repeat accesses to - * the same block. + * the same block. The number of heap blocks visited over time is also + * minimized by applying various strategies. For example, simple index + * deletion callers may include "extra" tuples that that they do not know to + * be deletable. + * + * Returns the latestRemovedXid from the heap pages pointed at by the index + * tuples that caller will go on to delete. */ TransactionId -heap_compute_xid_horizon_for_tuples(Relation rel, - ItemPointerData *tids, - int nitems) +heap_compute_delete_for_tuples(Relation rel, TM_IndexDeleteOp *delstate) { /* Initial assumption is that earlier pruning took care of conflict */ TransactionId latestRemovedXid = InvalidTransactionId; @@ -7005,25 +7032,44 @@ heap_compute_xid_horizon_for_tuples(Relation rel, OffsetNumber maxoff = InvalidOffsetNumber; TransactionId priorXmax; #ifdef USE_PREFETCH - XidHorizonPrefetchState prefetch_state; + DeletePrefetchState prefetch_state; int prefetch_distance; #endif + SnapshotData SnapshotNonVacuumable; + int finalndeltids = 0, + nblocksaccessed = 0; + + /* State that's only used in bottom-up index deletion case */ + int nblocksfavorable = 0; + int curtargetfreespace = delstate->bottomupfreespace, + lastfreespace = 0, + actualfreespace = 0; + bool bottomup_final_block = false; + + InitNonVacuumableSnapshot(SnapshotNonVacuumable, GlobalVisTestFor(rel)); + + /* Sort caller's deltids array by TID for further processing */ + heap_delete_sort(delstate); /* - * Sort to avoid repeated lookups for the same page, and to make it more - * likely to access items in an efficient order. In particular, this - * ensures that if there are multiple pointers to the same page, they all - * get processed looking up and locking the page just once. + * Bottom-up case: Resort deltids array in an order attuned to where the + * greatest number of promising TIDs are to be found, and determine how + * many blocks from the start of sorted array should be considered + * favorable. + * + * Note: This will usually shrink deltids array, capping the number of + * blocks accessed to BOTTOMUP_MAX_NBLOCKS. This helps to avoid + * unnecessary bottom-up case prefetching. */ - qsort((void *) tids, nitems, sizeof(ItemPointerData), - (int (*) (const void *, const void *)) ItemPointerCompare); + if (delstate->bottomup) + nblocksfavorable = heap_delete_bottomup_sort_and_shrink(delstate); #ifdef USE_PREFETCH /* Initialize prefetch state. */ prefetch_state.cur_hblkno = InvalidBlockNumber; prefetch_state.next_item = 0; - prefetch_state.nitems = nitems; - prefetch_state.tids = tids; + prefetch_state.ndeltids = delstate->ndeltids; + prefetch_state.deltids = delstate->deltids; /* * Compute the prefetch distance that we will attempt to maintain. @@ -7038,23 +7084,99 @@ heap_compute_xid_horizon_for_tuples(Relation rel, prefetch_distance = get_tablespace_maintenance_io_concurrency(rel->rd_rel->reltablespace); + /* Cap initial prefetch distance for bottom-up deletion caller */ + if (delstate->bottomup) + { + Assert(nblocksfavorable >= 1); + prefetch_distance = Min(prefetch_distance, nblocksfavorable); + } + /* Start prefetching. */ - xid_horizon_prefetch_buffer(rel, &prefetch_state, prefetch_distance); + compute_delete_prefetch_buffer(rel, &prefetch_state, prefetch_distance); #endif - /* Iterate over all tids, and check their horizon */ - for (int i = 0; i < nitems; i++) + /* Iterate over deltids, determine which to delete, check their horizon */ + Assert(delstate->ndeltids > 0); + for (int i = 0; i < delstate->ndeltids; i++) { - ItemPointer htid = &tids[i]; + TM_IndexDelete *ideltid = &delstate->deltids[i]; + TM_IndexStatus *istatus = delstate->status + ideltid->id; + ItemPointer htid = &ideltid->tid; OffsetNumber offnum; /* - * Read heap buffer, but avoid refetching if it's the same block as - * required for the last tid. + * Read buffer, and perform required extra steps each time a new block + * is encountered. Avoid refetching if it's the same block as the one + * from the last htid. */ if (blkno == InvalidBlockNumber || ItemPointerGetBlockNumber(htid) != blkno) { + /* + * Consider giving up early for bottom-up index deletion caller + * first. (Only prefetch next-next block afterwards, when it + * becomes clear that we're at least going to access the next + * block in line.) + * + * Sometimes the first block frees so much space for bottom-up + * caller that the deletion process can end without accessing any + * more blocks. It is usually necessary to access 2 or 3 blocks + * per bottom-up deletion operation, though. + */ + if (delstate->bottomup) + { + /* + * We often allow caller to delete a few additional items + * whose entries we reached after the point that space target + * from caller was satisfied. The cost of accessing the page + * was already paid at that point, so it made sense to finish + * it off. When that happened, we finalize everything here + * (by finishing off the whole bottom-up deletion operation + * without needlessly paying the cost of accessing any more + * blocks). + */ + if (bottomup_final_block) + break; + + /* + * Give up when we didn't enable our caller to free any + * additional space as a result of processing the page that we + * just finished up with. This rule is the main way in which + * we keep the cost of bottom-up deletion under control. + */ + if (nblocksaccessed >= 1 && actualfreespace == lastfreespace) + break; + lastfreespace = actualfreespace; /* for next time */ + + /* + * Deletion operation (which is bottom-up) will definitely + * access the next block in line. Prepare for that now. + * + * Decay target free space so that we don't hang on for too + * long with a marginal case. (Space target is only truly + * helpful when it allows us to recognize that we don't need + * to access more than 1 or 2 blocks to satisfy caller due to + * agreeable workload characteristics.) + * + * We are a bit more patient when we encounter contiguous + * blocks, though: the decay process is only applied when the + * next block in line is not a favorable/contiguous block. + * This is not an exception to the general rule; we still + * insist on finding at least one deletable item per block + * accessed. + * + * Note: The first block in line is always treated as a + * favorable block, so the earliest possible point that the + * decay can be applied is just before we access the second + * block in line. + */ + Assert(nblocksaccessed > 0 || nblocksfavorable > 0); + if (nblocksfavorable > 0) + nblocksfavorable--; + else + curtargetfreespace /= 2; + } + /* release old buffer */ if (BufferIsValid(buf)) { @@ -7065,6 +7187,9 @@ heap_compute_xid_horizon_for_tuples(Relation rel, blkno = ItemPointerGetBlockNumber(htid); buf = ReadBuffer(rel, blkno); + nblocksaccessed++; + Assert(!delstate->bottomup || + nblocksaccessed <= BOTTOMUP_MAX_NBLOCKS); #ifdef USE_PREFETCH @@ -7072,7 +7197,7 @@ heap_compute_xid_horizon_for_tuples(Relation rel, * To maintain the prefetch distance, prefetch one more page for * each page we read. */ - xid_horizon_prefetch_buffer(rel, &prefetch_state, 1); + compute_delete_prefetch_buffer(rel, &prefetch_state, 1); #endif LockBuffer(buf, BUFFER_LOCK_SHARE); @@ -7081,6 +7206,31 @@ heap_compute_xid_horizon_for_tuples(Relation rel, maxoff = PageGetMaxOffsetNumber(page); } + if (!istatus->knowndeletable) + { + ItemPointerData tmp = *htid; + HeapTupleData heapTuple; + + /* Are any tuples from this HOT chain non-vacuumable? */ + if (heap_hot_search_buffer(&tmp, rel, buf, &SnapshotNonVacuumable, + &heapTuple, NULL, true)) + continue; + + /* Caller will delete, since whole HOT chain is vacuumable */ + istatus->knowndeletable = true; + + /* Maintain index free space info for bottom-up deletion case */ + if (delstate->bottomup) + { + Assert(istatus->freespace > 0); + actualfreespace += istatus->freespace; + if (actualfreespace >= curtargetfreespace) + bottomup_final_block = true; + } + } + else + Assert(!delstate->bottomup && !istatus->promising); + /* * Maintain latestRemovedXid value for deletion operation as a whole * by advancing current value using heap tuple headers. This is @@ -7148,8 +7298,20 @@ heap_compute_xid_horizon_for_tuples(Relation rel, offnum = ItemPointerGetOffsetNumber(&htup->t_ctid); priorXmax = HeapTupleHeaderGetUpdateXid(htup); } + + /* Enable further/final shrinking of deltids for caller */ + finalndeltids = i + 1; } + /* + * Shrink deltids array to exclude non-deletable entries at the end. This + * is not just a minor optimization. Final deltids array size might be + * zero for a bottom-up caller. Index AM is explicitly allowed to rely on + * ndeltids being zero in all cases with zero total deletable entries. + */ + Assert(finalndeltids > 0 || delstate->bottomup); + delstate->ndeltids = finalndeltids; + if (BufferIsValid(buf)) { LockBuffer(buf, BUFFER_LOCK_UNLOCK); @@ -7159,6 +7321,316 @@ heap_compute_xid_horizon_for_tuples(Relation rel, return latestRemovedXid; } +/* + * Specialized inlineable comparison function for heap_delete_sort() + */ +static inline int +heap_delete_sort_cmp(TM_IndexDelete *deltid1, TM_IndexDelete *deltid2) +{ + ItemPointer tid1 = &deltid1->tid; + ItemPointer tid2 = &deltid2->tid; + + { + BlockNumber blk1 = ItemPointerGetBlockNumber(tid1); + BlockNumber blk2 = ItemPointerGetBlockNumber(tid2); + + if (blk1 != blk2) + return (blk1 < blk2) ? -1 : 1; + } + { + OffsetNumber pos1 = ItemPointerGetOffsetNumber(tid1); + OffsetNumber pos2 = ItemPointerGetOffsetNumber(tid2); + + if (pos1 != pos2) + return (pos1 < pos2) ? -1 : 1; + } + + pg_unreachable(); + + return 0; +} + +/* + * Sort deltids array from delstate by TID. This prepares it for further + * processing. + * + * This operation becomes a noticeable consumer of CPU cycles with some + * workloads. This is especially likely with bottom-up index deletion heavy + * workloads, especially when B-Tree deduplication is also used and we might + * well have over a thousand TIDs/deltids (even with default BLCKSZ). This + * justifies a specialized sort routine. + * + * We use shellsort because it's easy to specialize, compiles to relatively + * few instructions, and is adaptive to presorted inputs/subsets (which are + * typical here). The TM_IndexDelete struct is only 8 bytes, so swap + * operations are expected to be cheap here. + */ +static void +heap_delete_sort(TM_IndexDeleteOp *delstate) +{ + TM_IndexDelete *deltids = delstate->deltids; + int ndeltids = delstate->ndeltids; + int low = 0; + + /* + * Shellsort gap sequence (taken from Sedgewick-Incerpi paper). + * + * This implementation is fast with array sizes up to ~4500. This covers + * all supported BLCKSZ values. + */ + const int gaps[9] = {1968, 861, 336, 112, 48, 21, 7, 3, 1}; + + /* Think carefully before changing anything here */ + StaticAssertStmt(sizeof(TM_IndexDelete) <= 8, + "element size exceeds 8 bytes"); + + for (int g = 0; g < lengthof(gaps); g++) + { + for (int hi = gaps[g], i = low + hi; i < ndeltids; i++) + { + TM_IndexDelete d = deltids[i]; + int j = i; + + while (j >= hi && heap_delete_sort_cmp(&deltids[j - hi], &d) >= 0) + { + deltids[j] = deltids[j - hi]; + j -= hi; + } + deltids[j] = d; + } + } +} + +/* + * Determine how many blocks should count as favorable during a bottom-up + * index deletion pass. + * + * Favorable blocks are contiguous heap blocks, which are likely to have + * relatively many dead items. These blocks are cheaper to access together + * all at once. Having many favorable blocks is common with low cardinality + * index tuples, where heap locality will have a relatively large influence on + * which heap blocks we visit (and the order they're processed in). + * + * Caller is expected to have sorted deltids in final bottom-up deletion order + * (block group order). + * + * Returns number of favorable blocks, starting from (and including) the first + * block in line for processing. See heap_compute_delete_for_tuples() for + * details on how the value is applied. + */ +static int +get_nblocksfavorable(IndexDeleteCounts *blockgroups, int nblockgroups, + TM_IndexDelete *deltids) +{ + int nblocksfavorable = 0; + BlockNumber lastblock = InvalidBlockNumber; + + for (int b = 0; b < nblockgroups; b++) + { + IndexDeleteCounts *group = blockgroups + b; + TM_IndexDelete *firstdtid = deltids + group->ifirsttid; + BlockNumber block = ItemPointerGetBlockNumber(&firstdtid->tid); + + /* Note: it's okay if lastblock expression overflows */ + if (BlockNumberIsValid(lastblock) && + (block < lastblock || + block > lastblock + BOTTOMUP_FAVORABLE_STRIDE)) + break; + + nblocksfavorable++; + lastblock = block; + } + + /* + * We always indicate that there is at least 1 favorable block (the first + * in line to process). The first block must always be in sorted order + * because the ordering is relative to the first (or previous) block. + * (heap_compute_delete_for_tuples() is okay with this degenerate case + * because it is supposed to always visit the first heap page in line.) + */ + Assert(nblocksfavorable >= 1); + + return nblocksfavorable; +} + +/* + * qsort comparison function for heap_delete_bottomup_sort_and_shrink() + */ +static int +heap_delete_bottomup_sort_and_shrink_cmp(const void *arg1, const void *arg2) +{ + const IndexDeleteCounts *group1 = (const IndexDeleteCounts *) arg1; + const IndexDeleteCounts *group2 = (const IndexDeleteCounts *) arg2; + + /* + * Most significant field is npromisingtids (which we invert the order of + * so as to sort in desc order). + * + * Caller should have already normalized npromisingtids fields into + * power-of-two values (buckets). + */ + if (group1->npromisingtids > group2->npromisingtids) + return -1; + if (group1->npromisingtids < group2->npromisingtids) + return 1; + + /* + * Tiebreak: desc ntids sort order. + * + * We cannot expect power-of-two values for ntids fields. We should + * behave as if they were already rounded up for us instead. + */ + if (group1->ntids != group2->ntids) + { + uint32 ntids1 = pg_nextpower2_32((uint32) group1->ntids); + uint32 ntids2 = pg_nextpower2_32((uint32) group2->ntids); + + if (ntids1 > ntids2) + return -1; + if (ntids1 < ntids2) + return 1; + } + + /* + * Tiebreak: asc offset-into-deltids-for-block (offset to first TID for + * block in deltids array) order. + * + * This is equivalent to sorting in ascending heap block number order + * (among otherwise equal subsets of the array). This approach allows us + * to avoid accessing the out-of-line TID. (We rely on the assumption + * that the deltids array was sorted in ascending heap TID order when + * these offsets to the first TID from each heap block group were formed.) + */ + if (group1->ifirsttid > group2->ifirsttid) + return 1; + if (group1->ifirsttid < group2->ifirsttid) + return -1; + + pg_unreachable(); + + return 0; +} + +/* + * heap_compute_delete_for_tuples() helper function for bottom-up deletion + * callers. + * + * Sorts deltids array in the order needed for useful processing by bottom-up + * deletion. The array should already be sorted in TID order when we're + * called. The sort process groups heap TIDs from deltids into heap block + * number groupings. Earlier/more-promising groups/blocks are those that are + * known to have the most "promising" TIDs. + * + * Sets new size of deltids array (ndeltids) in state. deltids will only have + * TIDs from the BOTTOMUP_MAX_NBLOCKS most promising heap blocks when we + * return. This is usually far fewer. + * + * Returns number of "favorable" blocks. + */ +static int +heap_delete_bottomup_sort_and_shrink(TM_IndexDeleteOp *delstate) +{ + IndexDeleteCounts *blockgroups; + TM_IndexDelete *reordereddeltids; + BlockNumber curblock = InvalidBlockNumber; + int nblockgroups = 0; + int ncopied = 0; + int nblocksfavorable = 0; + + Assert(delstate->bottomup); + Assert(delstate->ndeltids > 0); + + /* Calculate per-heap-block count of TIDs */ + blockgroups = palloc(sizeof(IndexDeleteCounts) * delstate->ndeltids); + for (int i = 0; i < delstate->ndeltids; i++) + { + TM_IndexDelete *ideltid = &delstate->deltids[i]; + TM_IndexStatus *istatus = delstate->status + ideltid->id; + ItemPointer htid = &ideltid->tid; + bool promising = istatus->promising; + + if (curblock != ItemPointerGetBlockNumber(htid)) + { + /* New block group */ + nblockgroups++; + + Assert(curblock < ItemPointerGetBlockNumber(htid) || + !BlockNumberIsValid(curblock)); + + curblock = ItemPointerGetBlockNumber(htid); + blockgroups[nblockgroups - 1].ifirsttid = i; + blockgroups[nblockgroups - 1].ntids = 1; + blockgroups[nblockgroups - 1].npromisingtids = 0; + } + else + { + blockgroups[nblockgroups - 1].ntids++; + } + + if (promising) + blockgroups[nblockgroups - 1].npromisingtids++; + } + + /* + * We're about ready to sort block groups to determine the optimal order + * for visiting heap pages. But before we do, round the number of + * promising tuples for each block group up to the nearest power-of-two + * (unless there are zero promising tuples). + * + * This scheme usefully divides heap pages into buckets. Each bucket + * contains heap pages that are approximately equally promising, that we + * want to treat as exactly equivalent (at least initially). We should + * not let the most promising heap pages win or lose (get accessed or not + * accessed by bottom-up deletion) on the basis of _relatively_ small + * differences in the total number of promising tuples. + * + * Note that we effectively have the same power-of-two bucketing scheme + * with the ntids field (which is compared after npromisingtids). The + * only reason that we don't fix nhtids here is that the original values + * will be needed when copying the final TIDs from winning block groups + * back into caller's deltids array. + */ + for (int b = 0; b < nblockgroups; b++) + { + IndexDeleteCounts *group = blockgroups + b; + + if (group->npromisingtids != 0) + group->npromisingtids = + pg_nextpower2_32((uint32) group->npromisingtids); + } + + /* Sort groups and rearrange caller's deltids array */ + qsort(blockgroups, nblockgroups, sizeof(IndexDeleteCounts), + heap_delete_bottomup_sort_and_shrink_cmp); + reordereddeltids = palloc(delstate->ndeltids * sizeof(TM_IndexDelete)); + + nblockgroups = Min(BOTTOMUP_MAX_NBLOCKS, nblockgroups); + /* Determine number of favorable blocks at the start of array */ + nblocksfavorable = get_nblocksfavorable(blockgroups, nblockgroups, + delstate->deltids); + + for (int b = 0; b < nblockgroups; b++) + { + IndexDeleteCounts *group = blockgroups + b; + TM_IndexDelete *firstdtid = delstate->deltids + group->ifirsttid; + + memcpy(reordereddeltids + ncopied, firstdtid, + sizeof(TM_IndexDelete) * group->ntids); + ncopied += group->ntids; + } + + /* Copy final grouped and sorted TIDs back into start of caller's array */ + memcpy(delstate->deltids, reordereddeltids, + sizeof(TM_IndexDelete) * ncopied); + delstate->ndeltids = ncopied; + + /* be tidy */ + pfree(reordereddeltids); + pfree(blockgroups); + + return nblocksfavorable; +} + /* * Perform XLogInsert to register a heap cleanup info message. These * messages are sent once per VACUUM and are required because diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c index c6f438de72..37c037b820 100644 --- a/src/backend/access/heap/heapam_handler.c +++ b/src/backend/access/heap/heapam_handler.c @@ -2563,7 +2563,7 @@ static const TableAmRoutine heapam_methods = { .tuple_get_latest_tid = heap_get_latest_tid, .tuple_tid_valid = heapam_tuple_tid_valid, .tuple_satisfies_snapshot = heapam_tuple_satisfies_snapshot, - .compute_xid_horizon_for_tuples = heap_compute_xid_horizon_for_tuples, + .compute_delete_for_tuples = heap_compute_delete_for_tuples, .relation_set_new_filenode = heapam_relation_set_new_filenode, .relation_nontransactional_truncate = heapam_relation_nontransactional_truncate, diff --git a/src/backend/access/index/genam.c b/src/backend/access/index/genam.c index e3164e674a..ad5b589e58 100644 --- a/src/backend/access/index/genam.c +++ b/src/backend/access/index/genam.c @@ -276,11 +276,25 @@ BuildIndexValueDescription(Relation indexRelation, /* * Get the latestRemovedXid from the table entries pointed at by the index - * tuples being deleted. + * tuples being deleted using an AM-generic approach. * - * Note: index access methods that don't consistently use the standard - * IndexTuple + heap TID item pointer representation will need to provide - * their own version of this function. + * This is a table_compute_delete_for_tuples() shim used by index AMs that + * have simple requirements. These callers only need to consult the tableam + * to get a latestRemovedXid value. They do not try to get the tableam to + * check extra/uncertain TIDs opportunistically, and so do not need any custom + * logic to check which specific index tuples are deletable in the end (they + * must all be deletable). + * + * We assume that caller index AM uses the standard IndexTuple representation, + * with table TIDs stored in the t_tid field. We also expect (and assert) + * that the line pointers on page for 'itemnos' offsets are already marked + * LP_DEAD. + * + * Sophisticated users of the table_compute_delete_for_tuples() interface will + * find it worthwhile to go through the same steps for all indexes (even + * unlogged indexes), just to get the extra benefits. It's okay for our + * callers to skip the call here entirely in the case of indexes that don't + * need a latestRemovedXid value, though. */ TransactionId index_compute_xid_horizon_for_tuples(Relation irel, @@ -289,12 +303,17 @@ index_compute_xid_horizon_for_tuples(Relation irel, OffsetNumber *itemnos, int nitems) { - ItemPointerData *ttids = - (ItemPointerData *) palloc(sizeof(ItemPointerData) * nitems); + TM_IndexDeleteOp delstate; TransactionId latestRemovedXid = InvalidTransactionId; Page ipage = BufferGetPage(ibuf); IndexTuple itup; + delstate.bottomup = false; + delstate.bottomupfreespace = 0; + delstate.ndeltids = 0; + delstate.deltids = palloc(nitems * sizeof(TM_IndexDelete)); + delstate.status = palloc(nitems * sizeof(TM_IndexStatus)); + /* identify what the index tuples about to be deleted point to */ for (int i = 0; i < nitems; i++) { @@ -303,14 +322,26 @@ index_compute_xid_horizon_for_tuples(Relation irel, iitemid = PageGetItemId(ipage, itemnos[i]); itup = (IndexTuple) PageGetItem(ipage, iitemid); - ItemPointerCopy(&itup->t_tid, &ttids[i]); + Assert(ItemIdIsDead(iitemid)); + + ItemPointerCopy(&itup->t_tid, &delstate.deltids[i].tid); + delstate.deltids[i].id = delstate.ndeltids; + delstate.status[i].idxoffnum = InvalidOffsetNumber; /* unused */ + delstate.status[i].knowndeletable = true; /* LP_DEAD-marked */ + delstate.status[i].promising = false; /* unused */ + delstate.status[i].freespace = 0; /* unused */ + + delstate.ndeltids++; } /* determine the actual xid horizon */ - latestRemovedXid = - table_compute_xid_horizon_for_tuples(hrel, ttids, nitems); + latestRemovedXid = table_compute_delete_for_tuples(hrel, &delstate); - pfree(ttids); + /* assert tableam agrees that all items are deletable */ + Assert(delstate.ndeltids == nitems); + + pfree(delstate.deltids); + pfree(delstate.status); return latestRemovedXid; } diff --git a/src/backend/access/nbtree/README b/src/backend/access/nbtree/README index 27f555177e..ebe4408378 100644 --- a/src/backend/access/nbtree/README +++ b/src/backend/access/nbtree/README @@ -419,8 +419,8 @@ without a backend's cached page also being detected as invalidated, but only when we happen to recycle a block that once again gets recycled as the rightmost leaf page. -On-the-Fly Deletion Of Index Tuples ------------------------------------ +On-the-Fly deletion of LP_DEAD-bit-set index tuples +--------------------------------------------------- If a process visits a heap tuple and finds that it's dead and removable (ie, dead to all open transactions, not only that process), then we can @@ -439,19 +439,26 @@ from the index immediately; since index scans only stop "between" pages, no scan can lose its place from such a deletion. We separate the steps because we allow LP_DEAD to be set with only a share lock (it's exactly like a hint bit for a heap tuple), but physically removing tuples requires -exclusive lock. In the current code we try to remove LP_DEAD tuples when -we are otherwise faced with having to split a page to do an insertion (and -hence have exclusive lock on it already). Deduplication can also prevent -a page split, but removing LP_DEAD tuples is the preferred approach. -(Note that posting list tuples can only have their LP_DEAD bit set when -every table TID within the posting list is known dead.) +exclusive lock. Also, delaying the deletion often allows us to pick up +extra index tuples that weren't initially safe for index scans to mark +LP_DEAD. Live index tuples that are close to LP_DEAD-marked tuples in +time and space are usually highly likely to become dead-to-all shortly. +This makes workloads that greatly benefit from the LP_DEAD optimization +resilient against intermittent disruption from long running transactions +that hold open an MVCC snapshot (compared to the behavior prior to +PostgreSQL 14, the version that taught the LP_DEAD deletion process to +check if nearby index tuples are safe to delete in passing). -This leaves the index in a state where it has no entry for a dead tuple -that still exists in the heap. This is not a problem for the current -implementation of VACUUM, but it could be a problem for anything that -explicitly tries to find index entries for dead tuples. (However, the -same situation is created by REINDEX, since it doesn't enter dead -tuples into the index.) +We only try to delete LP_DEAD tuples (and nearby tuples) when we are +otherwise faced with having to split a page to do an insertion (and hence +have exclusive lock on it already). Deduplication and bottom-up index +deletion can also prevent a page split, but removing LP_DEAD tuples is +always the preferred approach. (Note that posting list tuples can only +have their LP_DEAD bit set when every table TID within the posting list is +known dead. This isn't much of a problem because LP_DEAD deletion can +often still do granular deletion of TIDs from a posting list. This will +happen when the posting list tuple's TIDs point to a table block that some +LP_DEAD-marked index tuple happens to point to.) It's sufficient to have an exclusive lock on the index page, not a super-exclusive lock, to do deletion of LP_DEAD items. It might seem @@ -469,6 +476,87 @@ LSN of the page, and only act to set LP_DEAD bits when the LSN has not changed at all. (Avoiding dropping the pin entirely also makes it safe, of course.) +Bottom-Up deletion +------------------ + +We attempt to delete whatever duplicates happen to be present on the page +when the duplicates are suspected to be caused by version churn from +successive UPDATEs. This only happens when we receive an executor hint +indicating that optimizations like heapam's HOT have not worked out for +the index -- the incoming tuple must be a logically unchanged duplicate +which is needed for MVCC purposes, suggesting that that might well be the +dominant source of new index tuples on the leaf page in question. (Also, +bottom-up deletion is triggered within unique indexes in cases with +continual INSERT and DELETE related churn, since that is easy to detect +without any external hint.) + +On-the-fly deletion of LP_DEAD-bit-set items (which can include deletion +of other close by index tuples) will already have failed to prevent a page +split when a bottom-up deletion pass takes place (often because no LP_DEAD +bits were ever set on the page). The two mechanisms have closely related +implementations. The same WAL records are used for each operation, and +the same tableam infrastructure is used to determine what TIDs/tuples are +actually safe to delete. The implementations only differ in how they pick +TIDs to consider for deletion, and whether or not the tableam will give up +before accessing all table blocks (bottom-up deletion lives with the +uncertainty of its success by keeping the cost of failure low). Even +still, the two mechanisms are clearly distinct at the conceptual level. + +Bottom-up index deletion is driven entirely by heuristics (whereas +on-the-fly deletion is guaranteed to delete at least those index tuples +that are already LP_DEAD marked). We have no certainty that we'll find +even one index tuple to delete. That's why we access as few tableam +blocks as possible, and only commit to accessing the next table block in +line when a positive outcome for the operation as a whole still looks +likely. This means that the tableam needs to have a fairly good idea of +how much space it has freed on the leaf page, to keep the costs and +benefits in balance per operation (and even across successive operations +affecting the same leaf page). + +Bottom-up index deletion can be thought of as a backstop mechanism against +unnecessary version-driven page splits. It is based in part on an idea +from generational garbage collection: the "generational hypothesis". This +is the empirical observation that "most objects die young". Within +nbtree, new index tuples often quickly appear in the same place, and then +quickly become garbage. There can be intense concentrations of garbage in +relatively few leaf pages (or there would be without the intervention of +bottom-up deletion). This occurs with workloads that consist of skewed +UPDATEs. There is little to lose and much to gain by spending a few +cycles to become reasonably sure that a page split is truly necessary +(when it seems like there is some chance of that) -- page splits are +expensive, and practically irreversible. + +We expect to find a reasonably large number of tuples that are safe to +delete within each bottom-up pass. If we don't then we won't need to +consider the question of bottom-up deletion for the same leaf page for +quite a while (usually because the page splits, which resolves the +situation, at least for a while). We expect to perform regular bottom-up +deletion operations against pages that are at constant risk of unnecessary +page splits caused only by version churn. When the mechanism works well +we'll constantly be "on the verge" of having version-churn-driven page +splits, but never actually have even one. + +Our duplicate heuristics work well despite being fairly simple. +Unnecessary page splits only occur when there are truly pathological +levels of version churn (in theory a small amount of version churn could +make a page split occur earlier than strictly necessary, but that's pretty +harmless). We don't have to understand the underlying workload; we only +have to understand the general nature of the pathology that we target. +Version churn is easy to spot when it is truly pathological. Affected +leaf pages are homogeneous. + +If version churn hasn't become a real problem then we don't actually want +to do anything about it anyway (we should be lazy about cleaning it up, at +least). All that really matters is that garbage does not become +concentrated in any one part of the key space (the number of physical +versions accessed by queries to read any given logical row should remain +low over time and across all parts of the key space). Remaining garbage +tuples can be thought of as "floating garbage" that VACUUM will eventually +get around to removing (VACUUM can be thought of as a top-down mechanism +that bottom-up garbage collection complements). The absolute number of +garbage tuples (and even the proportion of all index tuples that are +garbage) is generally much less important. + WAL Considerations ------------------ @@ -767,9 +855,10 @@ into a single physical tuple with a posting list (a simple array of heap TIDs with the standard item pointer format). Deduplication is always applied lazily, at the point where it would otherwise be necessary to perform a page split. It occurs only when LP_DEAD items have been -removed, as our last line of defense against splitting a leaf page. We -can set the LP_DEAD bit with posting list tuples, though only when all -TIDs are known dead. +removed, as our last line of defense against splitting a leaf page +(bottom-up index deletion may be attempted first, as our second last line +of defense). We can set the LP_DEAD bit with posting list tuples, though +only when all TIDs are known dead. Our lazy approach to deduplication allows the page space accounting used during page splits to have absolutely minimal special case logic for @@ -826,6 +915,16 @@ delay a split that is probably inevitable anyway. This allows us to avoid the overhead of attempting to deduplicate with unique indexes that always have few or no duplicates. +Note: Avoiding "unnecessary" page splits driven by version churn is also +the goal of bottom-up index deletion, which was added to PostgreSQL 14. +Bottom-up index deletion is now the preferred way to deal with this +problem (with all kinds of indexes, though especially with unique +indexes). Still, deduplication can sometimes augment bottom-up index +deletion. When deletion cannot free tuples (due to an old snapshot +holding up cleanup), falling back on deduplication provides additional +capacity. Delaying the page split by deduplicating can allow a future +bottom-up deletion pass of the same page to succeed. + Posting list splits ------------------- diff --git a/src/backend/access/nbtree/nbtdedup.c b/src/backend/access/nbtree/nbtdedup.c index 9e535124c4..6576ca8606 100644 --- a/src/backend/access/nbtree/nbtdedup.c +++ b/src/backend/access/nbtree/nbtdedup.c @@ -1,7 +1,7 @@ /*------------------------------------------------------------------------- * * nbtdedup.c - * Deduplicate items in Postgres btrees. + * Deduplicate or bottom-up delete items in Postgres btrees. * * Portions Copyright (c) 1996-2020, PostgreSQL Global Development Group * Portions Copyright (c) 1994, Regents of the University of California @@ -19,6 +19,8 @@ #include "miscadmin.h" #include "utils/rel.h" +static void _bt_bottomupdel_finish_pending(Page page, BTDedupState state, + TM_IndexDeleteOp *delstate); static bool _bt_do_singleval(Relation rel, Page page, BTDedupState state, OffsetNumber minoff, IndexTuple newitem); static void _bt_singleval_fillfactor(Page page, BTDedupState state, @@ -267,6 +269,168 @@ _bt_dedup_pass(Relation rel, Buffer buf, Relation heapRel, IndexTuple newitem, pfree(state); } +/* + * Perform bottom-up index deletion pass. + * + * See if duplicate index tuples (plus certain nearby tuples) are eligible to + * be deleted using tableam deletion interface's dedicated bottom-up deletion + * flag. The high level goal here is to entirely prevent "unnecessary" page + * splits caused by MVCC version churn from UPDATEs (when the UPDATEs don't + * logically modify any of the columns covered by the 'rel' index). + * + * A leaf page that has never been the target of a bottom-up deletion pass is + * likely to stay that way forever. But once a bottom-up deletion pass is + * actually triggered against the leaf page, all bets are off -- it will + * probably be targeted by many more bottom-up passes in the near future. + * + * The implementation assumes that any bottom-up deletion pass is just the + * latest in a long line of related bottom-up passes that affect the same leaf + * page. If that assumption turns out to be wrong then we'll split the page + * (or perhaps deduplicate it) soon after, resolving the situation at the + * level of the key space covered by the original leaf page. The cost of + * being wrong is fairly low, and must be paid only once (in wasted cycles). + * But when the assumption turns out to be correct it'll usually work out + * again and again, across many successive deletion operations. Besides all + * this, it's unlikely that we'll "get it wrong" in the first place, since we + * know for sure that at least the incoming item is a "logically unchanged" + * index tuple. + * + * However, "getting it wrong" might eventually become unavoidable in the + * presence of a long-running transaction that holds open an MVCC snapshot. + * Even then, bottom-up deletion will probably manage to delete many garbage + * tuples before becoming totally ineffective. There is no practical way to + * know ahead of time whether or not the process will work out, and the cost + * of trying is still relatively low. + * + * Returns true on success, in which case caller can assume page split will be + * avoided for a reasonable amount of time. Returns false when caller should + * deduplicate the page (if possible at all). + * + * Note: Occasionally a true return value does not actually indicate that any + * items could be deleted. It might just indicate that caller should not go + * on to perform a deduplication pass. Caller is not expected to care about + * the difference. + * + * Note: Caller should have already deleted all existing items with their + * LP_DEAD bits set. + */ +bool +_bt_bottomupdel_pass(Relation rel, Buffer buf, Relation heapRel, + Size newitemsz) +{ + OffsetNumber offnum, + minoff, + maxoff; + Page page = BufferGetPage(buf); + BTPageOpaque opaque = (BTPageOpaque) PageGetSpecialPointer(page); + BTDedupState state; + TM_IndexDeleteOp delstate; + bool neverdedup; + int nkeyatts = IndexRelationGetNumberOfKeyAttributes(rel); + + /* Passed-in newitemsz is MAXALIGNED but does not include line pointer */ + newitemsz += sizeof(ItemIdData); + + /* Initialize deduplication state */ + state = (BTDedupState) palloc(sizeof(BTDedupStateData)); + state->deduplicate = true; + state->nmaxitems = 0; + state->maxpostingsize = BLCKSZ; /* We're not really deduplicating */ + state->base = NULL; + state->baseoff = InvalidOffsetNumber; + state->basetupsize = 0; + state->htids = palloc(state->maxpostingsize); + state->nhtids = 0; + state->nitems = 0; + state->phystupsize = 0; + state->nintervals = 0; + + /* + * Initialize tableam state that describes bottom-up index deletion + * operation. + * + * We'll go on to ask the tableam to search for TIDs whose index tuples we + * can safely delete. The tableam will search until our leaf page space + * target is satisfied, or until the cost of continuing with the tableam + * operation seems too high. It focuses its efforts on TIDs associated + * with duplicate index tuples that we mark "promising". + * + * This space target is a little arbitrary. The tableam must be able to + * keep the costs and benefits in balance. We provide the tableam with + * exhaustive information about what might work, without directly + * concerning ourselves with avoiding work during the tableam call. Our + * role in costing the bottom-up deletion process is strictly advisory. + */ + delstate.bottomup = true; + delstate.bottomupfreespace = Max(BLCKSZ / 16, newitemsz); + delstate.ndeltids = 0; + delstate.deltids = palloc(MaxTIDsPerBTreePage * sizeof(TM_IndexDelete)); + delstate.status = palloc(MaxTIDsPerBTreePage * sizeof(TM_IndexStatus)); + + minoff = P_FIRSTDATAKEY(opaque); + maxoff = PageGetMaxOffsetNumber(page); + for (offnum = minoff; + offnum <= maxoff; + offnum = OffsetNumberNext(offnum)) + { + ItemId itemid = PageGetItemId(page, offnum); + IndexTuple itup = (IndexTuple) PageGetItem(page, itemid); + + Assert(!ItemIdIsDead(itemid)); + + if (offnum == minoff) + { + _bt_dedup_start_pending(state, itup, offnum); + } + else if (_bt_keep_natts_fast(rel, state->base, itup) > nkeyatts && + _bt_dedup_save_htid(state, itup)) + { + /* Tuple is equal; just added its TIDs to pending interval */ + } + else + { + /* Finalize interval -- move its TIDs to delete state */ + _bt_bottomupdel_finish_pending(page, state, &delstate); + + /* itup starts new pending interval */ + _bt_dedup_start_pending(state, itup, offnum); + } + } + /* Finalize final interval -- move its TIDs to delete state */ + _bt_bottomupdel_finish_pending(page, state, &delstate); + + /* + * The tableam uses its own heuristics. They can influence the table + * blocks that it visits, especially when promising tuples are not + * concentrated in just a few table blocks. This is why we don't give up + * now in the event of having few (or even zero) promising tuples for the + * tableam. + * + * When there are no duplicates on the page at all we tell our caller to + * not attempt deduplication (by reporting "success"). Having zero + * duplicates/promising tuples should be rare, but when it happens we + * might as well save caller a few cycles. + */ + neverdedup = false; + if (state->nintervals == 0) + neverdedup = true; + + pfree(state->htids); + pfree(state); + + /* Ask tableam which TIDs are deletable, then physically delete them */ + _bt_delitems_delete_check(rel, buf, heapRel, &delstate); + + pfree(delstate.deltids); + pfree(delstate.status); + + if (neverdedup) + return true; + + /* Don't dedup when we won't end up back here any time soon anyway */ + return PageGetExactFreeSpace(page) >= Max(BLCKSZ / 24, newitemsz); +} + /* * Create a new pending posting list tuple based on caller's base tuple. * @@ -452,6 +616,150 @@ _bt_dedup_finish_pending(Page newpage, BTDedupState state) return spacesaving; } +/* + * Finalize interval during bottom-up index deletion. + * + * During a bottom-up pass we expect that TIDs will be recorded in dedup state + * first, and then get moved over to delstate (in variable-sized batches) by + * calling here. Call here happens when the number of TIDs in a dedup + * interval is known, and interval gets finalized (i.e. when caller sees next + * tuple on the page is not a duplicate, or when caller runs out of tuples to + * process from leaf page). + * + * This is where bottom-up deletion determines and remembers which entries are + * duplicates. This will be important information to the tableam delete + * infrastructure later on. Plain index tuple duplicates are marked + * "promising" here, per tableam contract. + * + * Our approach to marking entries whose TIDs come from posting lists is more + * complicated. Posting lists can only be formed by a deduplication pass (or + * during an index build), so recent version churn affecting the pointed-to + * logical rows is not particularly likely. We may still give a weak signal + * about posting list tuples' entries (by marking just one of its TIDs/entries + * promising), though this is only a possibility in the event of further + * duplicate index tuples in final interval that covers posting list tuple (as + * in the plain tuple case). A weak signal/hint will be useful to the tableam + * when it has no stronger signal to go with for the deletion operation as a + * whole. + * + * The heuristics we use work well in practice because we only need to give + * the tableam the right _general_ idea about where to look. Garbage tends to + * naturally get concentrated in relatively few table blocks with workloads + * that bottom-up deletion targets. The tableam cannot possibly rank all + * available table blocks sensibly based on the hints we provide, but that's + * okay -- only the extremes matter. The tableam just needs to be able to + * predict which few table blocks will have the most dead-to-all tuples for + * each deletion operation, with low variance (variance in the number of truly + * deletable TIDs) across related deletion operations. + */ +static void +_bt_bottomupdel_finish_pending(Page page, BTDedupState state, + TM_IndexDeleteOp *delstate) +{ + bool dupinterval = (state->nitems > 1); + + Assert(state->nitems > 0); + Assert(state->nitems <= state->nhtids); + Assert(state->intervals[state->nintervals].baseoff == state->baseoff); + + for (int i = 0; i < state->nitems; i++) + { + OffsetNumber offnum = state->baseoff + i; + ItemId itemid = PageGetItemId(page, offnum); + IndexTuple itup = (IndexTuple) PageGetItem(page, itemid); + TM_IndexDelete *ideltid = &delstate->deltids[delstate->ndeltids]; + TM_IndexStatus *istatus = &delstate->status[delstate->ndeltids]; + + if (!BTreeTupleIsPosting(itup)) + { + /* Simple case: A plain non-pivot tuple */ + ideltid->tid = itup->t_tid; + ideltid->id = delstate->ndeltids; + istatus->idxoffnum = offnum; + istatus->knowndeletable = false; /* for now */ + istatus->promising = dupinterval; /* simple rule */ + istatus->freespace = ItemIdGetLength(itemid) + sizeof(ItemIdData); + + delstate->ndeltids++; + } + else + { + /* + * Complicated case: A posting list tuple. + * + * We make the conservative assumption that there can only be at + * most one affected logical row per posting list tuple. There + * will be at most one promising entry in deltids to represent + * this presumed lone logical row. Note that this isn't even + * considered unless the posting list tuple is also in an interval + * of duplicates -- this complicated rule is just a variant of the + * simple rule used to decide if plain index tuples are promising. + */ + int nitem = BTreeTupleGetNPosting(itup); + bool firstpromising = false; + bool lastpromising = false; + + Assert(_bt_posting_valid(itup)); + + if (dupinterval) + { + /* + * Complicated rule: either the first or last TID in the + * posting list gets marked promising (if any at all) + */ + BlockNumber minblocklist, + midblocklist, + maxblocklist; + ItemPointer mintid, + midtid, + maxtid; + + mintid = BTreeTupleGetHeapTID(itup); + midtid = BTreeTupleGetPostingN(itup, nitem / 2); + maxtid = BTreeTupleGetMaxHeapTID(itup); + minblocklist = ItemPointerGetBlockNumber(mintid); + midblocklist = ItemPointerGetBlockNumber(midtid); + maxblocklist = ItemPointerGetBlockNumber(maxtid); + + /* Only entry with predominant table block can be promising */ + firstpromising = (minblocklist == midblocklist); + lastpromising = (!firstpromising && + midblocklist == maxblocklist); + } + + for (int p = 0; p < nitem; p++) + { + ItemPointer htid = BTreeTupleGetPostingN(itup, p); + + ideltid->tid = *htid; + ideltid->id = delstate->ndeltids; + istatus->idxoffnum = offnum; + istatus->knowndeletable = false; /* for now */ + istatus->promising = false; + if ((firstpromising && p == 0) || + (lastpromising && p == nitem - 1)) + istatus->promising = true; + istatus->freespace = sizeof(ItemPointerData); /* at worst */ + + ideltid++; + istatus++; + delstate->ndeltids++; + } + } + } + + if (dupinterval) + { + state->intervals[state->nintervals].nitems = state->nitems; + state->nintervals++; + } + + /* Reset state for next interval */ + state->nhtids = 0; + state->nitems = 0; + state->phystupsize = 0; +} + /* * Determine if page non-pivot tuples (data items) are all duplicates of the * same value -- if they are, deduplication's "single value" strategy should @@ -622,8 +930,8 @@ _bt_form_posting(IndexTuple base, ItemPointer htids, int nhtids) * Generate a replacement tuple by "updating" a posting list tuple so that it * no longer has TIDs that need to be deleted. * - * Used by VACUUM. Caller's vacposting argument points to the existing - * posting list tuple to be updated. + * Used by both VACUUM and index deletion. Caller's vacposting argument + * points to the existing posting list tuple to be updated. * * On return, caller's vacposting argument will point to final "updated" * tuple, which will be palloc()'d in caller's memory context. diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c index dde43b1415..75869847f3 100644 --- a/src/backend/access/nbtree/nbtinsert.c +++ b/src/backend/access/nbtree/nbtinsert.c @@ -17,9 +17,9 @@ #include "access/nbtree.h" #include "access/nbtxlog.h" -#include "access/tableam.h" #include "access/transam.h" #include "access/xloginsert.h" +#include "lib/qunique.h" #include "miscadmin.h" #include "storage/lmgr.h" #include "storage/predicate.h" @@ -37,6 +37,7 @@ static TransactionId _bt_check_unique(Relation rel, BTInsertState insertstate, static OffsetNumber _bt_findinsertloc(Relation rel, BTInsertState insertstate, bool checkingunique, + bool indexUnchanged, BTStack stack, Relation heapRel); static void _bt_stepright(Relation rel, BTInsertState insertstate, BTStack stack); @@ -60,8 +61,14 @@ static inline bool _bt_pgaddtup(Page page, Size itemsize, IndexTuple itup, OffsetNumber itup_off, bool newfirstdataitem); static void _bt_delete_or_dedup_one_page(Relation rel, Relation heapRel, BTInsertState insertstate, - bool lpdeadonly, bool checkingunique, - bool uniquedup); + bool simpleonly, bool checkingunique, + bool uniquedup, bool indexUnchanged); +static void _bt_simpledel_pass(Relation rel, Buffer buffer, Relation heapRel, + OffsetNumber *deletable, int ndeletable, + OffsetNumber minoff, OffsetNumber maxoff); +static BlockNumber *_bt_deadblocks(Page page, OffsetNumber *deletable, + int ndeletable, int *nblocks); +static inline int _bt_blk_cmp(const void *arg1, const void *arg2); /* * _bt_doinsert() -- Handle insertion of a single index tuple in the tree. @@ -75,6 +82,11 @@ static void _bt_delete_or_dedup_one_page(Relation rel, Relation heapRel, * For UNIQUE_CHECK_EXISTING we merely run the duplicate check, and * don't actually insert. * + * indexUnchanged executor hint indicates if itup is from an + * UPDATE that didn't logically change the indexed value, but + * must nevertheless have a new entry to point to a successor + * version. + * * The result value is only significant for UNIQUE_CHECK_PARTIAL: * it must be true if the entry is known unique, else false. * (In the current implementation we'll also return true after a @@ -83,7 +95,8 @@ static void _bt_delete_or_dedup_one_page(Relation rel, Relation heapRel, */ bool _bt_doinsert(Relation rel, IndexTuple itup, - IndexUniqueCheck checkUnique, Relation heapRel) + IndexUniqueCheck checkUnique, bool indexUnchanged, + Relation heapRel) { bool is_unique = false; BTInsertStateData insertstate; @@ -238,7 +251,7 @@ search: * checkingunique. */ newitemoff = _bt_findinsertloc(rel, &insertstate, checkingunique, - stack, heapRel); + indexUnchanged, stack, heapRel); _bt_insertonpg(rel, itup_key, insertstate.buf, InvalidBuffer, stack, itup, insertstate.itemsz, newitemoff, insertstate.postingoff, false); @@ -480,11 +493,7 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel, * items as quickly as we can. We only apply _bt_compare() when * we get to a non-killed item. We could reuse the bounds to * avoid _bt_compare() calls for known equal tuples, but it - * doesn't seem worth it. Workloads with heavy update activity - * tend to have many deduplication passes, so we'll often avoid - * most of those comparisons, too (we call _bt_compare() when the - * posting list tuple is initially encountered, though not when - * processing later TIDs from the same tuple). + * doesn't seem worth it. */ if (!inposting) curitemid = PageGetItemId(page, offset); @@ -777,6 +786,17 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel, * room for the new tuple, this function moves right, trying to find a * legal page that does.) * + * If 'indexUnchanged' is true, this is for an UPDATE that didn't + * logically change the indexed value, but must nevertheless have a new + * entry to point to a successor version. This hint from the executor + * will influence our behavior when the page might have to be split and + * we must consider our options. Bottom-up index deletion can avoid + * pathological version-driven page splits, but we only want to go to the + * trouble of trying it when we already have moderate confidence that + * it's appropriate. The hint should not significantly affect our + * behavior over time unless practically all inserts on to the leaf page + * get the hint. + * * On exit, insertstate buffer contains the chosen insertion page, and * the offset within that page is returned. If _bt_findinsertloc needed * to move right, the lock and pin on the original page are released, and @@ -793,6 +813,7 @@ static OffsetNumber _bt_findinsertloc(Relation rel, BTInsertState insertstate, bool checkingunique, + bool indexUnchanged, BTStack stack, Relation heapRel) { @@ -817,7 +838,7 @@ _bt_findinsertloc(Relation rel, if (itup_key->heapkeyspace) { /* Keep track of whether checkingunique duplicate seen */ - bool uniquedup = false; + bool uniquedup = indexUnchanged; /* * If we're inserting into a unique index, we may have to walk right @@ -874,14 +895,14 @@ _bt_findinsertloc(Relation rel, } /* - * If the target page is full, see if we can obtain enough space using - * one or more strategies (e.g. erasing LP_DEAD items, deduplication). - * Page splits are expensive, and should only go ahead when truly - * necessary. + * If the target page cannot fit newitem, try to avoid splitting the + * page (at the point of insert) by applying deletion or deduplication + * now */ if (PageGetFreeSpace(page) < insertstate->itemsz) _bt_delete_or_dedup_one_page(rel, heapRel, insertstate, false, - checkingunique, uniquedup); + checkingunique, uniquedup, + indexUnchanged); } else { @@ -921,9 +942,9 @@ _bt_findinsertloc(Relation rel, */ if (P_HAS_GARBAGE(opaque)) { - /* Erase LP_DEAD items (won't deduplicate) */ + /* Perform simple deletion */ _bt_delete_or_dedup_one_page(rel, heapRel, insertstate, true, - checkingunique, false); + false, false, false); if (PageGetFreeSpace(page) >= insertstate->itemsz) break; /* OK, now we have enough space */ @@ -970,14 +991,11 @@ _bt_findinsertloc(Relation rel, /* * There is an overlapping posting list tuple with its LP_DEAD bit * set. We don't want to unnecessarily unset its LP_DEAD bit while - * performing a posting list split, so delete all LP_DEAD items early. - * This is the only case where LP_DEAD deletes happen even though - * there is space for newitem on the page. - * - * This can only erase LP_DEAD items (it won't deduplicate). + * performing a posting list split, so perform simple index tuple + * deletion early. */ _bt_delete_or_dedup_one_page(rel, heapRel, insertstate, true, - checkingunique, false); + false, false, false); /* * Do new binary search. New insert location cannot overlap with any @@ -2606,21 +2624,19 @@ _bt_pgaddtup(Page page, } /* - * _bt_delete_or_dedup_one_page - Try to avoid a leaf page split by attempting - * a variety of operations. + * _bt_delete_or_dedup_one_page - Try to avoid a leaf page split. * - * There are two operations performed here: deleting items already marked - * LP_DEAD, and deduplication. If both operations fail to free enough space - * for the incoming item then caller will go on to split the page. We always - * attempt our preferred strategy (which is to delete items whose LP_DEAD bit - * are set) first. If that doesn't work out we move on to deduplication. + * There are three operations performed here: simple index deletion, bottom-up + * index deletion, and deduplication. If all three operations fail to free + * enough space for the incoming item then caller will go on to split the + * page. We always consider simple deletion first. If that doesn't work out + * we consider alternatives. Callers that only want us to consider simple + * deletion (without any fallback) ask for that using the 'simpleonly' + * argument. * - * Caller's checkingunique and uniquedup arguments help us decide if we should - * perform deduplication, which is primarily useful with low cardinality data, - * but can sometimes absorb version churn. - * - * Callers that only want us to look for/delete LP_DEAD items can ask for that - * directly by passing true 'lpdeadonly' argument. + * We usually pick only one alternative "complex" operation when simple + * deletion alone won't prevent a page split. The 'checkingunique', + * 'uniquedup', and 'indexUnchanged' arguments are used for that. * * Note: We used to only delete LP_DEAD items when the BTP_HAS_GARBAGE page * level flag was found set. The flag was useful back when there wasn't @@ -2638,12 +2654,13 @@ _bt_pgaddtup(Page page, static void _bt_delete_or_dedup_one_page(Relation rel, Relation heapRel, BTInsertState insertstate, - bool lpdeadonly, bool checkingunique, - bool uniquedup) + bool simpleonly, bool checkingunique, + bool uniquedup, bool indexUnchanged) { OffsetNumber deletable[MaxIndexTuplesPerPage]; int ndeletable = 0; OffsetNumber offnum, + minoff, maxoff; Buffer buffer = insertstate->buf; BTScanInsert itup_key = insertstate->itup_key; @@ -2651,14 +2668,16 @@ _bt_delete_or_dedup_one_page(Relation rel, Relation heapRel, BTPageOpaque opaque = (BTPageOpaque) PageGetSpecialPointer(page); Assert(P_ISLEAF(opaque)); - Assert(lpdeadonly || itup_key->heapkeyspace); + Assert(simpleonly || itup_key->heapkeyspace); + Assert(!simpleonly || (!checkingunique && !uniquedup && !indexUnchanged)); /* * Scan over all items to see which ones need to be deleted according to * LP_DEAD flags. */ + minoff = P_FIRSTDATAKEY(opaque); maxoff = PageGetMaxOffsetNumber(page); - for (offnum = P_FIRSTDATAKEY(opaque); + for (offnum = minoff; offnum <= maxoff; offnum = OffsetNumberNext(offnum)) { @@ -2670,7 +2689,8 @@ _bt_delete_or_dedup_one_page(Relation rel, Relation heapRel, if (ndeletable > 0) { - _bt_delitems_delete(rel, buffer, deletable, ndeletable, heapRel); + _bt_simpledel_pass(rel, buffer, heapRel, deletable, ndeletable, + minoff, maxoff); insertstate->bounds_valid = false; /* Return when a page split has already been avoided */ @@ -2682,37 +2702,263 @@ _bt_delete_or_dedup_one_page(Relation rel, Relation heapRel, } /* - * Some callers only want to delete LP_DEAD items. Return early for these - * callers. + * We're done with simple deletion. Return early with callers that only + * call here so that simple deletion can be considered. This includes + * callers that explicitly ask for this and checkingunique callers that + * probably don't have any duplicates on the page, and thus probably have + * nothing more to gain. * * Note: The page's BTP_HAS_GARBAGE hint flag may still be set when we * return at this point (or when we go on the try either or both of our * other strategies and they also fail). We do not bother expending a * separate write to clear it, however. Caller will definitely clear it - * when it goes on to split the page (plus deduplication knows to clear - * the flag when it actually modifies the page). + * when it goes on to split the page (note also that deduplication process + * knows to clear the flag when it actually modifies the page). */ - if (lpdeadonly) - return; - - /* - * We can get called in the checkingunique case when there is no reason to - * believe that there are any duplicates on the page; we should at least - * still check for LP_DEAD items. If that didn't work out, give up and - * let caller split the page. Deduplication cannot be justified given - * there is no reason to think that there are duplicates. - */ - if (checkingunique && !uniquedup) + if (simpleonly || (checkingunique && !uniquedup)) return; /* Assume bounds about to be invalidated (this is almost certain now) */ insertstate->bounds_valid = false; /* - * Perform deduplication pass, though only when it is enabled for the - * index and known to be safe (it must be an allequalimage index). + * Perform bottom-up index deletion pass when executor hint indicated that + * incoming item is logically unchanged, or for a unique index that is + * known to have physical duplicates for some other reason. (There is a + * large overlap between these two cases for a unique index. It's worth + * having both triggering conditions in order to apply the optimization in + * the event of successive related INSERT and DELETE statements.) + * + * We'll go on to do a deduplication pass when a bottom-up pass fails to + * delete an acceptable amount of free space (a significant fraction of + * the page, or space for the new item, whichever is greater). + * + * Note: Bottom-up index deletion uses the same equality/equivalence + * routines as deduplication internally. However, it does not merge + * together index tuples, so the same correctness considerations do not + * apply. We deliberately omit an index-is-allequalimage test here. */ + if (BTGetBottomupDeleteItems(rel) && (indexUnchanged || uniquedup) && + _bt_bottomupdel_pass(rel, buffer, heapRel, insertstate->itemsz)) + return; + + /* Perform deduplication pass (when enabled and index-is-allequalimage) */ if (BTGetDeduplicateItems(rel) && itup_key->allequalimage) _bt_dedup_pass(rel, buffer, heapRel, insertstate->itup, insertstate->itemsz, checkingunique); } + +/* + * _bt_simpledel_pass - Simple index tuple deletion pass. + * + * We delete all LP_DEAD-set index tuples on a leaf page. The offset numbers + * of all such tuples are determined by caller (caller passes these to us as + * its 'deletable' argument). + * + * We might also delete extra index tuples that turn out to be safe to delete + * in passing (though they must be cheap to check in passing to begin with). + * There is no certainty that any extra tuples will be deleted, though. The + * high level goal of the approach we take is to get the most out of each call + * here (without noticeably increasing the per-call overhead compared to what + * we need to do just to be able to delete the page's LP_DEAD-marked index + * tuples). + * + * The number of extra index tuples that turn out to be deletable might + * greatly exceed the number of LP_DEAD-marked index tuples due to various + * locality related effects. For example, it's possible that the total number + * of table blocks (pointed to by all TIDs on the leaf page) is naturally + * quite low, in which case we might end up checking if it's possible to + * delete _most_ index tuples on the page (without the tableam needing to + * access additional table blocks). The tableam will sometimes stumble upon + * _many_ extra deletable index tuples in indexes where this pattern is + * common. + */ +static void +_bt_simpledel_pass(Relation rel, Buffer buffer, Relation heapRel, + OffsetNumber *deletable, int ndeletable, + OffsetNumber minoff, OffsetNumber maxoff) +{ + Page page = BufferGetPage(buffer); + BlockNumber *deadblocks; + int ndeadblocks; + TM_IndexDeleteOp delstate; + OffsetNumber offnum; + + /* Get array of table blocks pointed to by LP_DEAD-set tuples */ + deadblocks = _bt_deadblocks(page, deletable, ndeletable, &ndeadblocks); + + /* Initialize tableam state that describes index deletion operation */ + delstate.bottomup = false; + delstate.bottomupfreespace = 0; + delstate.ndeltids = 0; + delstate.deltids = palloc(MaxTIDsPerBTreePage * sizeof(TM_IndexDelete)); + delstate.status = palloc(MaxTIDsPerBTreePage * sizeof(TM_IndexStatus)); + + for (offnum = minoff; + offnum <= maxoff; + offnum = OffsetNumberNext(offnum)) + { + ItemId itemid = PageGetItemId(page, offnum); + IndexTuple itup = (IndexTuple) PageGetItem(page, itemid); + TM_IndexDelete *odeltid = &delstate.deltids[delstate.ndeltids]; + TM_IndexStatus *ostatus = &delstate.status[delstate.ndeltids]; + BlockNumber tidblock; + void *match; + + if (!BTreeTupleIsPosting(itup)) + { + tidblock = ItemPointerGetBlockNumber(&itup->t_tid); + match = bsearch(&tidblock, deadblocks, ndeadblocks, + sizeof(BlockNumber), _bt_blk_cmp); + + if (!match) + { + Assert(!ItemIdIsDead(itemid)); + continue; + } + + /* + * TID's table block is among those pointed to by the TIDs from + * LP_DEAD-bit set tuples on page -- add TID to deltids + */ + odeltid->tid = itup->t_tid; + odeltid->id = delstate.ndeltids; + ostatus->idxoffnum = offnum; + ostatus->knowndeletable = ItemIdIsDead(itemid); + ostatus->promising = false; /* unused */ + ostatus->freespace = 0; /* unused */ + + delstate.ndeltids++; + } + else + { + int nitem = BTreeTupleGetNPosting(itup); + + for (int p = 0; p < nitem; p++) + { + ItemPointer tid = BTreeTupleGetPostingN(itup, p); + + tidblock = ItemPointerGetBlockNumber(tid); + match = bsearch(&tidblock, deadblocks, ndeadblocks, + sizeof(BlockNumber), _bt_blk_cmp); + + if (!match) + { + Assert(!ItemIdIsDead(itemid)); + continue; + } + + /* + * TID's table block is among those pointed to by the TIDs + * from LP_DEAD-bit set tuples on page -- add TID to deltids + */ + odeltid->tid = *tid; + odeltid->id = delstate.ndeltids; + ostatus->idxoffnum = offnum; + ostatus->knowndeletable = ItemIdIsDead(itemid); + ostatus->promising = false; /* unused */ + ostatus->freespace = 0; /* unused */ + + odeltid++; + ostatus++; + delstate.ndeltids++; + } + } + } + + pfree(deadblocks); + + Assert(delstate.ndeltids >= ndeletable); + + /* Physically delete LP_DEAD tuples (plus any extra dead-to-all TIDs) */ + _bt_delitems_delete_check(rel, buffer, heapRel, &delstate); + + pfree(delstate.deltids); + pfree(delstate.status); +} + +/* + * _bt_deadblocks() -- Get LP_DEAD related table blocks. + * + * Builds sorted and unique-ified array of table block numbers from index + * tuple TIDs whose line pointers are marked LP_DEAD. + * + * Returns final array, and sets *nblocks to its final size for caller. + */ +static BlockNumber * +_bt_deadblocks(Page page, OffsetNumber *deletable, int ndeletable, + int *nblocks) +{ + int spacentids, + ntids; + BlockNumber *tidblocks; + + /* + * Accumulate each TID's block in array whose initial size has space for + * one table block per LP_DEAD-set tuple. Array will only need to grow + * when there are LP_DEAD-marked posting list tuples (which is not that + * common). + */ + spacentids = ndeletable; + ntids = 0; + tidblocks = (BlockNumber *) palloc(sizeof(BlockNumber) * spacentids); + for (int i = 0; i < ndeletable; i++) + { + ItemId itemid = PageGetItemId(page, deletable[i]); + IndexTuple itup = (IndexTuple) PageGetItem(page, itemid); + + Assert(ItemIdIsDead(itemid)); + + if (!BTreeTupleIsPosting(itup)) + { + if (ntids + 1 > spacentids) + { + spacentids *= 2; + tidblocks = (BlockNumber *) + repalloc(tidblocks, sizeof(BlockNumber) * spacentids); + } + + tidblocks[ntids++] = ItemPointerGetBlockNumber(&itup->t_tid); + } + else + { + int nposting = BTreeTupleGetNPosting(itup); + + if (ntids + nposting > spacentids) + { + spacentids = Max(spacentids * 2, ntids + nposting); + tidblocks = (BlockNumber *) + repalloc(tidblocks, sizeof(BlockNumber) * spacentids); + } + + for (int j = 0; j < nposting; j++) + { + ItemPointer tid = BTreeTupleGetPostingN(itup, j); + + tidblocks[ntids++] = ItemPointerGetBlockNumber(tid); + } + } + } + + qsort(tidblocks, ntids, sizeof(BlockNumber), _bt_blk_cmp); + *nblocks = qunique(tidblocks, ntids, sizeof(BlockNumber), _bt_blk_cmp); + + return tidblocks; +} + +/* + * _bt_blk_cmp() -- qsort comparison function for _bt_simpledel_pass + */ +static inline int +_bt_blk_cmp(const void *arg1, const void *arg2) +{ + BlockNumber b1 = *((BlockNumber *) arg1); + BlockNumber b2 = *((BlockNumber *) arg2); + + if (b1 < b2) + return -1; + else if (b1 > b2) + return 1; + + return 0; +} diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c index 793434c026..70c3382b10 100644 --- a/src/backend/access/nbtree/nbtpage.c +++ b/src/backend/access/nbtree/nbtpage.c @@ -38,8 +38,14 @@ static BTMetaPageData *_bt_getmeta(Relation rel, Buffer metabuf); static void _bt_log_reuse_page(Relation rel, BlockNumber blkno, TransactionId latestRemovedXid); -static TransactionId _bt_xid_horizon(Relation rel, Relation heapRel, Page page, - OffsetNumber *deletable, int ndeletable); +static void _bt_delitems_delete(Relation rel, Buffer buf, + TransactionId latestRemovedXid, + OffsetNumber *deletable, int ndeletable, + BTVacuumPosting *updatable, int nupdatable, + Relation heapRel); +static char *_bt_delitems_update(BTVacuumPosting *updatable, int nupdatable, + OffsetNumber *updatedoffsets, + Size *updatedbuflen, bool needswal); static bool _bt_mark_page_halfdead(Relation rel, Buffer leafbuf, BTStack stack); static bool _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, @@ -1110,15 +1116,15 @@ _bt_page_recyclable(Page page) * sorted in ascending order. * * Routine deals with deleting TIDs when some (but not all) of the heap TIDs - * in an existing posting list item are to be removed by VACUUM. This works - * by updating/overwriting an existing item with caller's new version of the - * item (a version that lacks the TIDs that are to be deleted). + * in an existing posting list item are to be removed. This works by + * updating/overwriting an existing item with caller's new version of the item + * (a version that lacks the TIDs that are to be deleted). * * We record VACUUMs and b-tree deletes differently in WAL. Deletes must - * generate their own latestRemovedXid by accessing the heap directly, whereas - * VACUUMs rely on the initial heap scan taking care of it indirectly. Also, - * only VACUUM can perform granular deletes of individual TIDs in posting list - * tuples. + * generate their own latestRemovedXid by accessing the table directly, + * whereas VACUUMs rely on the initial heap scan taking care of it indirectly. + * Also, we remove the VACUUM cycle ID from pages, which b-tree deletes don't + * do. */ void _bt_delitems_vacuum(Relation rel, Buffer buf, @@ -1127,7 +1133,7 @@ _bt_delitems_vacuum(Relation rel, Buffer buf, { Page page = BufferGetPage(buf); BTPageOpaque opaque; - Size itemsz; + bool needswal = RelationNeedsWAL(rel); char *updatedbuf = NULL; Size updatedbuflen = 0; OffsetNumber updatedoffsets[MaxIndexTuplesPerPage]; @@ -1135,45 +1141,11 @@ _bt_delitems_vacuum(Relation rel, Buffer buf, /* Shouldn't be called unless there's something to do */ Assert(ndeletable > 0 || nupdatable > 0); - for (int i = 0; i < nupdatable; i++) - { - /* Replace work area IndexTuple with updated version */ - _bt_update_posting(updatable[i]); - - /* Maintain array of updatable page offsets for WAL record */ - updatedoffsets[i] = updatable[i]->updatedoffset; - } - - /* XLOG stuff -- allocate and fill buffer before critical section */ - if (nupdatable > 0 && RelationNeedsWAL(rel)) - { - Size offset = 0; - - for (int i = 0; i < nupdatable; i++) - { - BTVacuumPosting vacposting = updatable[i]; - - itemsz = SizeOfBtreeUpdate + - vacposting->ndeletedtids * sizeof(uint16); - updatedbuflen += itemsz; - } - - updatedbuf = palloc(updatedbuflen); - for (int i = 0; i < nupdatable; i++) - { - BTVacuumPosting vacposting = updatable[i]; - xl_btree_update update; - - update.ndeletedtids = vacposting->ndeletedtids; - memcpy(updatedbuf + offset, &update.ndeletedtids, - SizeOfBtreeUpdate); - offset += SizeOfBtreeUpdate; - - itemsz = update.ndeletedtids * sizeof(uint16); - memcpy(updatedbuf + offset, vacposting->deletetids, itemsz); - offset += itemsz; - } - } + /* Generate new version of posting lists without deleted TIDs */ + if (nupdatable > 0) + updatedbuf = _bt_delitems_update(updatable, nupdatable, + updatedoffsets, &updatedbuflen, + needswal); /* No ereport(ERROR) until changes are logged */ START_CRIT_SECTION(); @@ -1194,6 +1166,7 @@ _bt_delitems_vacuum(Relation rel, Buffer buf, { OffsetNumber updatedoffset = updatedoffsets[i]; IndexTuple itup; + Size itemsz; itup = updatable[i]->itup; itemsz = MAXALIGN(IndexTupleSize(itup)); @@ -1218,7 +1191,7 @@ _bt_delitems_vacuum(Relation rel, Buffer buf, * Clear the BTP_HAS_GARBAGE page flag. * * This flag indicates the presence of LP_DEAD items on the page (though - * not reliably). Note that we only trust it with pg_upgrade'd + * not reliably). Note that we only rely on it with pg_upgrade'd * !heapkeyspace indexes. That's why clearing it here won't usually * interfere with _bt_delitems_delete(). */ @@ -1227,7 +1200,7 @@ _bt_delitems_vacuum(Relation rel, Buffer buf, MarkBufferDirty(buf); /* XLOG stuff */ - if (RelationNeedsWAL(rel)) + if (needswal) { XLogRecPtr recptr; xl_btree_vacuum xlrec_vacuum; @@ -1260,7 +1233,7 @@ _bt_delitems_vacuum(Relation rel, Buffer buf, /* can't leak memory here */ if (updatedbuf != NULL) pfree(updatedbuf); - /* free tuples generated by calling _bt_update_posting() */ + /* free tuples allocated within _bt_delitems_update() */ for (int i = 0; i < nupdatable; i++) pfree(updatable[i]->itup); } @@ -1269,40 +1242,66 @@ _bt_delitems_vacuum(Relation rel, Buffer buf, * Delete item(s) from a btree leaf page during single-page cleanup. * * This routine assumes that the caller has pinned and write locked the - * buffer. Also, the given deletable array *must* be sorted in ascending - * order. + * buffer. Also, the given deletable and updatable arrays *must* be sorted in + * ascending order. + * + * Routine deals with deleting TIDs when some (but not all) of the heap TIDs + * in an existing posting list item are to be removed. This works by + * updating/overwriting an existing item with caller's new version of the item + * (a version that lacks the TIDs that are to be deleted). * * This is nearly the same as _bt_delitems_vacuum as far as what it does to - * the page, but it needs to generate its own latestRemovedXid by accessing - * the heap. This is used by the REDO routine to generate recovery conflicts. - * Also, it doesn't handle posting list tuples unless the entire tuple can be - * deleted as a whole (since there is only one LP_DEAD bit per line pointer). + * the page, but it needs its own latestRemovedXid from called (caller gets + * this from tableam). This is used by the REDO routine to generate recovery + * conflicts. The other difference is that _bt_delitems_vacuum will clear + * page's VACUUM cycle ID. We must never do that. */ -void -_bt_delitems_delete(Relation rel, Buffer buf, +static void +_bt_delitems_delete(Relation rel, Buffer buf, TransactionId latestRemovedXid, OffsetNumber *deletable, int ndeletable, + BTVacuumPosting *updatable, int nupdatable, Relation heapRel) { Page page = BufferGetPage(buf); BTPageOpaque opaque; - TransactionId latestRemovedXid = InvalidTransactionId; + bool needswal = RelationNeedsWAL(rel); + char *updatedbuf = NULL; + Size updatedbuflen = 0; + OffsetNumber updatedoffsets[MaxIndexTuplesPerPage]; /* Shouldn't be called unless there's something to do */ - Assert(ndeletable > 0); + Assert(ndeletable > 0 || nupdatable > 0); - if (XLogStandbyInfoActive() && RelationNeedsWAL(rel)) - latestRemovedXid = - _bt_xid_horizon(rel, heapRel, page, deletable, ndeletable); + /* Generate new versions of posting lists without deleted TIDs */ + if (nupdatable > 0) + updatedbuf = _bt_delitems_update(updatable, nupdatable, + updatedoffsets, &updatedbuflen, + needswal); /* No ereport(ERROR) until changes are logged */ START_CRIT_SECTION(); - /* Fix the page */ - PageIndexMultiDelete(page, deletable, ndeletable); + /* Handle updates and deletes just like _bt_delitems_vacuum */ + for (int i = 0; i < nupdatable; i++) + { + OffsetNumber updatedoffset = updatedoffsets[i]; + IndexTuple itup; + Size itemsz; + + itup = updatable[i]->itup; + itemsz = MAXALIGN(IndexTupleSize(itup)); + if (!PageIndexTupleOverwrite(page, updatedoffset, (Item) itup, + itemsz)) + elog(PANIC, "failed to update partially dead item in block %u of index \"%s\"", + BufferGetBlockNumber(buf), RelationGetRelationName(rel)); + } + + if (ndeletable > 0) + PageIndexMultiDelete(page, deletable, ndeletable); /* - * Unlike _bt_delitems_vacuum, we *must not* clear the vacuum cycle ID, - * because this is not called by VACUUM + * Unlike _bt_delitems_vacuum, we *must not* clear the vacuum cycle ID at + * this point. The VACUUM command alone controls vacuum cycle IDs. */ opaque = (BTPageOpaque) PageGetSpecialPointer(page); @@ -1310,7 +1309,7 @@ _bt_delitems_delete(Relation rel, Buffer buf, * Clear the BTP_HAS_GARBAGE page flag. * * This flag indicates the presence of LP_DEAD items on the page (though - * not reliably). Note that we only trust it with pg_upgrade'd + * not reliably). Note that we only rely on it with pg_upgrade'd * !heapkeyspace indexes. */ opaque->btpo_flags &= ~BTP_HAS_GARBAGE; @@ -1318,25 +1317,29 @@ _bt_delitems_delete(Relation rel, Buffer buf, MarkBufferDirty(buf); /* XLOG stuff */ - if (RelationNeedsWAL(rel)) + if (needswal) { XLogRecPtr recptr; xl_btree_delete xlrec_delete; xlrec_delete.latestRemovedXid = latestRemovedXid; xlrec_delete.ndeleted = ndeletable; + xlrec_delete.nupdated = nupdatable; XLogBeginInsert(); XLogRegisterBuffer(0, buf, REGBUF_STANDARD); XLogRegisterData((char *) &xlrec_delete, SizeOfBtreeDelete); - /* - * The deletable array is not in the buffer, but pretend that it is. - * When XLogInsert stores the whole buffer, the array need not be - * stored too. - */ - XLogRegisterBufData(0, (char *) deletable, - ndeletable * sizeof(OffsetNumber)); + if (ndeletable > 0) + XLogRegisterBufData(0, (char *) deletable, + ndeletable * sizeof(OffsetNumber)); + + if (nupdatable > 0) + { + XLogRegisterBufData(0, (char *) updatedoffsets, + nupdatable * sizeof(OffsetNumber)); + XLogRegisterBufData(0, updatedbuf, updatedbuflen); + } recptr = XLogInsert(RM_BTREE_ID, XLOG_BTREE_DELETE); @@ -1344,83 +1347,310 @@ _bt_delitems_delete(Relation rel, Buffer buf, } END_CRIT_SECTION(); + + /* can't leak memory here */ + if (updatedbuf != NULL) + pfree(updatedbuf); + /* free tuples allocated within _bt_delitems_update() */ + for (int i = 0; i < nupdatable; i++) + pfree(updatable[i]->itup); } /* - * Get the latestRemovedXid from the table entries pointed to by the non-pivot - * tuples being deleted. + * Set up state needed to delete TIDs from posting list tuples via "updating" + * the tuple. Performs steps common to both _bt_delitems_vacuum and + * _bt_delitems_delete. These steps must take place before each function's + * critical section begins. * - * This is a specialized version of index_compute_xid_horizon_for_tuples(). - * It's needed because btree tuples don't always store table TID using the - * standard index tuple header field. + * updatabable and nupdatable are inputs, though note that we will use + * _bt_update_posting() to replace the original itup with a pointer to a final + * version in palloc()'d memory. Caller should free the tuples when its done. + * + * The first nupdatable entries from updatedoffsets are set to the page offset + * number for posting list tuples that caller updates. This is mostly useful + * because caller may need to WAL-log the page offsets (though we always do + * this for caller out of convenience). + * + * Returns buffer consisting of an array of xl_btree_update structs that + * describe the steps we perform here for caller (though only when needswal is + * true). Also sets *updatedbuflen to the final size of the buffer. This + * buffer is used by caller when WAL logging is required. */ -static TransactionId -_bt_xid_horizon(Relation rel, Relation heapRel, Page page, - OffsetNumber *deletable, int ndeletable) +static char * +_bt_delitems_update(BTVacuumPosting *updatable, int nupdatable, + OffsetNumber *updatedoffsets, Size *updatedbuflen, + bool needswal) { - TransactionId latestRemovedXid = InvalidTransactionId; - int spacenhtids; - int nhtids; - ItemPointer htids; + char *updatedbuf = NULL; + Size buflen = 0; - /* Array will grow iff there are posting list tuples to consider */ - spacenhtids = ndeletable; - nhtids = 0; - htids = (ItemPointer) palloc(sizeof(ItemPointerData) * spacenhtids); - for (int i = 0; i < ndeletable; i++) + /* Shouldn't be called unless there's something to do */ + Assert(nupdatable > 0); + + for (int i = 0; i < nupdatable; i++) { - ItemId itemid; - IndexTuple itup; + BTVacuumPosting vacposting = updatable[i]; + Size itemsz; - itemid = PageGetItemId(page, deletable[i]); - itup = (IndexTuple) PageGetItem(page, itemid); + /* Replace work area IndexTuple with updated version */ + _bt_update_posting(vacposting); - Assert(ItemIdIsDead(itemid)); - Assert(!BTreeTupleIsPivot(itup)); + /* Keep track of size of xl_btree_update for updatedbuf in passing */ + itemsz = SizeOfBtreeUpdate + vacposting->ndeletedtids * sizeof(uint16); + buflen += itemsz; - if (!BTreeTupleIsPosting(itup)) + /* Build updatedoffsets buffer in passing */ + updatedoffsets[i] = vacposting->updatedoffset; + } + + /* XLOG stuff */ + if (needswal) + { + Size offset = 0; + + /* Allocate, set final size for caller */ + updatedbuf = palloc(buflen); + *updatedbuflen = buflen; + for (int i = 0; i < nupdatable; i++) { - if (nhtids + 1 > spacenhtids) - { - spacenhtids *= 2; - htids = (ItemPointer) - repalloc(htids, sizeof(ItemPointerData) * spacenhtids); - } + BTVacuumPosting vacposting = updatable[i]; + Size itemsz; + xl_btree_update update; - Assert(ItemPointerIsValid(&itup->t_tid)); - ItemPointerCopy(&itup->t_tid, &htids[nhtids]); - nhtids++; - } - else - { - int nposting = BTreeTupleGetNPosting(itup); + update.ndeletedtids = vacposting->ndeletedtids; + memcpy(updatedbuf + offset, &update.ndeletedtids, + SizeOfBtreeUpdate); + offset += SizeOfBtreeUpdate; - if (nhtids + nposting > spacenhtids) - { - spacenhtids = Max(spacenhtids * 2, nhtids + nposting); - htids = (ItemPointer) - repalloc(htids, sizeof(ItemPointerData) * spacenhtids); - } - - for (int j = 0; j < nposting; j++) - { - ItemPointer htid = BTreeTupleGetPostingN(itup, j); - - Assert(ItemPointerIsValid(htid)); - ItemPointerCopy(htid, &htids[nhtids]); - nhtids++; - } + itemsz = update.ndeletedtids * sizeof(uint16); + memcpy(updatedbuf + offset, vacposting->deletetids, itemsz); + offset += itemsz; } } - Assert(nhtids >= ndeletable); + return updatedbuf; +} - latestRemovedXid = - table_compute_xid_horizon_for_tuples(heapRel, htids, nhtids); +/* + * Comparator used by _bt_delitems_delete_check() to restore deltids array + * back to its original leaf-page-wise sort order + */ +static int +_bt_delitems_cmp(const void *a, const void *b) +{ + TM_IndexDelete *indexdelete1 = (TM_IndexDelete *) a; + TM_IndexDelete *indexdelete2 = (TM_IndexDelete *) b; - pfree(htids); + if (indexdelete1->id > indexdelete2->id) + return 1; + if (indexdelete1->id < indexdelete2->id) + return -1; - return latestRemovedXid; + Assert(false); + + return 0; +} + +/* + * Try to delete item(s) from a btree leaf page during single-page cleanup. + * + * nbtree interface to table_compute_delete_for_tuples(). Deletes a subset of + * index tuples from caller's deltids array: those whose TIDs are found + * dead-to-all in the table (or marked dead-to-all up-front, which we only + * allow from our simple index deletion caller). We physically delete this + * subset from buf leaf page last of all. + * + * Simple index deletion caller only includes TIDs from index tuples marked + * LP_DEAD, as well as extra TIDs it found on the same leaf page that can be + * included without increasing the total number of distinct table blocks for + * the deletion operation as a whole. This approach often allows us to delete + * some extra index tuples that happen to be dead-to-all at little additional + * cost. The design probably only makes sense with a heap style tableam. + * This should still be okay when the table does not use a heap structure, + * though. In general, the tableam contract provides significant wiggle-room + * for tableams. This allows the tableam to opt out of most work, and to + * avoid any imaginable table I/O overhead that might be imposed by checking + * "extra" TIDs on our behalf. + * + * Bottom-up index deletion caller provides all the TIDs from the leaf page, + * without expecting that tableam will check most of them. The tableam has + * considerable discretion around which entries/blocks it checks (even more so + * than in the simple index deletion case), so once again we don't concern + * ourselves with the overhead for the tableam. We need only be concerned + * about providing relevant context to the tableam. + * + * Note: Caller must have added deltids entries (to delstate's array) in + * leaf-page-wise order: page offset number order, TID order among entries + * taken from the same posting list tuple (tiebreak on TID). This order is + * convenient to work with here. + * + * Note: We also rely on the id field of each deltids element "capturing" this + * original leaf-page-wise order. That is, we expect to be able to get back + * to the original leaf-page-wise order just by sorting deltids on the id + * field (tableam will sort deltids for its own reasons, so we'll need to put + * it back in leaf-page-wise order afterwards). + */ +void +_bt_delitems_delete_check(Relation rel, Buffer buf, Relation heapRel, + TM_IndexDeleteOp *delstate) +{ + Page page = BufferGetPage(buf); + TransactionId latestRemovedXid; + OffsetNumber postingidxoffnum = InvalidOffsetNumber; + int ndeletable = 0, + nupdatable = 0; + OffsetNumber deletable[MaxIndexTuplesPerPage]; + BTVacuumPosting updatable[MaxIndexTuplesPerPage]; + + /* Use tableam interface to determine which tuples to delete first */ + latestRemovedXid = table_compute_delete_for_tuples(heapRel, delstate); + + /* Should not WAL-log latestRemovedXid unless it's required */ + if (!XLogStandbyInfoActive() || !RelationNeedsWAL(rel)) + latestRemovedXid = InvalidTransactionId; + + /* + * Construct a leaf-page-wise description of what _bt_delitems_delete() + * needs to do to physically delete index tuples from the page. + * + * Must sort deltids array to restore leaf-page-wise order (original order + * before call to tableam). This is the order that the loop expects. + * + * Note that deltids array might be a lot smaller now. It might even have + * no entries at all (with bottom-up deletion caller), in which case there + * is nothing left to do. + */ + qsort(delstate->deltids, delstate->ndeltids, sizeof(TM_IndexDelete), + _bt_delitems_cmp); + if (delstate->ndeltids == 0) + { + Assert(delstate->bottomup); + return; + } + + /* We definitely have to delete at least one index tuple (or one TID) */ + for (int i = 0; i < delstate->ndeltids; i++) + { + TM_IndexStatus *dstatus = delstate->status + delstate->deltids[i].id; + OffsetNumber idxoffnum = dstatus->idxoffnum; + ItemId itemid = PageGetItemId(page, idxoffnum); + IndexTuple itup = (IndexTuple) PageGetItem(page, itemid); + int nestedi, + nitem; + BTVacuumPosting vacposting; + + Assert(OffsetNumberIsValid(idxoffnum)); + + if (idxoffnum == postingidxoffnum) + { + /* + * This deltid entry is a TID from a posting list tuple that has + * already been completely processed + */ + Assert(BTreeTupleIsPosting(itup)); + continue; + } + + if (!BTreeTupleIsPosting(itup)) + { + /* Plain non-pivot tuple */ + Assert(ItemPointerEquals(&itup->t_tid, &delstate->deltids[i].tid)); + if (dstatus->knowndeletable) + deletable[ndeletable++] = idxoffnum; + continue; + } + + /* + * itup is a posting list tuple whose lowest deltids entry (which may + * or may not be for the first TID from itup) is considered here now. + * We should process all of the deltids entries for the posting list + * together now, though (not just the lowest). Remember to skip over + * later itup-related entries during later iterations of outermost + * loop. + */ + postingidxoffnum = idxoffnum; /* Remember work in outermost loop */ + nestedi = i; /* Initialize for first itup deltids entry */ + vacposting = NULL; /* Describes final action for itup */ + nitem = BTreeTupleGetNPosting(itup); + for (int p = 0; p < nitem; p++) + { + ItemPointer ptid = BTreeTupleGetPostingN(itup, p); + int ptidcmp = -1; + + /* + * This nested loop reuses work across ptid TIDs taken from itup. + * We take advantage of the fact that both itup's TIDs and deltids + * entries (within a single itup/posting list grouping) must both + * be in ascending TID order. + */ + for (; nestedi < delstate->ndeltids; nestedi++) + { + TM_IndexDelete *tcdeltid = &delstate->deltids[nestedi]; + TM_IndexStatus *tdstatus = (delstate->status + tcdeltid->id); + + /* Stop once we get past all itup related deltids entries */ + Assert(tdstatus->idxoffnum >= idxoffnum); + if (tdstatus->idxoffnum != idxoffnum) + break; + + /* Skip past non-deletable itup related entries up front */ + if (!tdstatus->knowndeletable) + continue; + + /* Entry is first partial ptid match (or an exact match)? */ + ptidcmp = ItemPointerCompare(&tcdeltid->tid, ptid); + if (ptidcmp >= 0) + { + /* Greater than or equal (partial or exact) match... */ + break; + } + } + + /* ...exact ptid match to a deletable deltids entry? */ + if (ptidcmp != 0) + continue; + + /* Exact match for deletable deltids entry -- ptid gets deleted */ + if (vacposting == NULL) + { + vacposting = palloc(offsetof(BTVacuumPostingData, deletetids) + + nitem * sizeof(uint16)); + vacposting->itup = itup; + vacposting->updatedoffset = idxoffnum; + vacposting->ndeletedtids = 0; + } + vacposting->deletetids[vacposting->ndeletedtids++] = p; + } + + /* Final decision on itup, a posting list tuple */ + + if (vacposting == NULL) + { + /* No TIDs to delete from itup -- do nothing */ + } + else if (vacposting->ndeletedtids == nitem) + { + /* Straight delete of itup (to delete all TIDs) */ + deletable[ndeletable++] = idxoffnum; + /* Turns out we won't need granular information */ + pfree(vacposting); + } + else + { + /* Delete some (but not all) TIDs from itup */ + Assert(vacposting->ndeletedtids > 0 && + vacposting->ndeletedtids < nitem); + updatable[nupdatable++] = vacposting; + } + } + + /* Physically delete tuples (or TIDs) using deletable (or updatable) */ + _bt_delitems_delete(rel, buf, latestRemovedXid, deletable, ndeletable, + updatable, nupdatable, heapRel); + + /* be tidy */ + for (int i = 0; i < nupdatable; i++) + pfree(updatable[i]); } /* diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c index a3d757c28f..f8f10bbca1 100644 --- a/src/backend/access/nbtree/nbtree.c +++ b/src/backend/access/nbtree/nbtree.c @@ -209,7 +209,7 @@ btinsert(Relation rel, Datum *values, bool *isnull, itup = index_form_tuple(RelationGetDescr(rel), values, isnull); itup->t_tid = *ht_ctid; - result = _bt_doinsert(rel, itup, checkUnique, heapRel); + result = _bt_doinsert(rel, itup, checkUnique, indexUnchanged, heapRel); pfree(itup); diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c index 8730de25ed..d5d90cf696 100644 --- a/src/backend/access/nbtree/nbtsort.c +++ b/src/backend/access/nbtree/nbtsort.c @@ -49,7 +49,6 @@ #include "access/parallel.h" #include "access/relscan.h" #include "access/table.h" -#include "access/tableam.h" #include "access/xact.h" #include "access/xlog.h" #include "access/xloginsert.h" diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c index 2f5f14e527..4548c39dda 100644 --- a/src/backend/access/nbtree/nbtutils.c +++ b/src/backend/access/nbtree/nbtutils.c @@ -2108,7 +2108,9 @@ btoptions(Datum reloptions, bool validate) {"vacuum_cleanup_index_scale_factor", RELOPT_TYPE_REAL, offsetof(BTOptions, vacuum_cleanup_index_scale_factor)}, {"deduplicate_items", RELOPT_TYPE_BOOL, - offsetof(BTOptions, deduplicate_items)} + offsetof(BTOptions, deduplicate_items)}, + {"bottomup_delete_items", RELOPT_TYPE_BOOL, + offsetof(BTOptions, bottomup_delete_items)} }; diff --git a/src/backend/access/nbtree/nbtxlog.c b/src/backend/access/nbtree/nbtxlog.c index 5135b800af..3e7289fe49 100644 --- a/src/backend/access/nbtree/nbtxlog.c +++ b/src/backend/access/nbtree/nbtxlog.c @@ -556,6 +556,47 @@ btree_xlog_dedup(XLogReaderState *record) UnlockReleaseBuffer(buf); } +static void +btree_xlog_updates(Page page, OffsetNumber *updatedoffsets, + xl_btree_update *updates, int nupdated) +{ + BTVacuumPosting vacposting; + IndexTuple origtuple; + ItemId itemid; + Size itemsz; + + for (int i = 0; i < nupdated; i++) + { + itemid = PageGetItemId(page, updatedoffsets[i]); + origtuple = (IndexTuple) PageGetItem(page, itemid); + + vacposting = palloc(offsetof(BTVacuumPostingData, deletetids) + + updates->ndeletedtids * sizeof(uint16)); + vacposting->updatedoffset = updatedoffsets[i]; + vacposting->itup = origtuple; + vacposting->ndeletedtids = updates->ndeletedtids; + memcpy(vacposting->deletetids, + (char *) updates + SizeOfBtreeUpdate, + updates->ndeletedtids * sizeof(uint16)); + + _bt_update_posting(vacposting); + + /* Overwrite updated version of tuple */ + itemsz = MAXALIGN(IndexTupleSize(vacposting->itup)); + if (!PageIndexTupleOverwrite(page, updatedoffsets[i], + (Item) vacposting->itup, itemsz)) + elog(PANIC, "failed to update partially dead item"); + + pfree(vacposting->itup); + pfree(vacposting); + + /* advance to next xl_btree_update from array */ + updates = (xl_btree_update *) + ((char *) updates + SizeOfBtreeUpdate + + updates->ndeletedtids * sizeof(uint16)); + } +} + static void btree_xlog_vacuum(XLogReaderState *record) { @@ -589,41 +630,7 @@ btree_xlog_vacuum(XLogReaderState *record) xlrec->nupdated * sizeof(OffsetNumber)); - for (int i = 0; i < xlrec->nupdated; i++) - { - BTVacuumPosting vacposting; - IndexTuple origtuple; - ItemId itemid; - Size itemsz; - - itemid = PageGetItemId(page, updatedoffsets[i]); - origtuple = (IndexTuple) PageGetItem(page, itemid); - - vacposting = palloc(offsetof(BTVacuumPostingData, deletetids) + - updates->ndeletedtids * sizeof(uint16)); - vacposting->updatedoffset = updatedoffsets[i]; - vacposting->itup = origtuple; - vacposting->ndeletedtids = updates->ndeletedtids; - memcpy(vacposting->deletetids, - (char *) updates + SizeOfBtreeUpdate, - updates->ndeletedtids * sizeof(uint16)); - - _bt_update_posting(vacposting); - - /* Overwrite updated version of tuple */ - itemsz = MAXALIGN(IndexTupleSize(vacposting->itup)); - if (!PageIndexTupleOverwrite(page, updatedoffsets[i], - (Item) vacposting->itup, itemsz)) - elog(PANIC, "failed to update partially dead item"); - - pfree(vacposting->itup); - pfree(vacposting); - - /* advance to next xl_btree_update from array */ - updates = (xl_btree_update *) - ((char *) updates + SizeOfBtreeUpdate + - updates->ndeletedtids * sizeof(uint16)); - } + btree_xlog_updates(page, updatedoffsets, updates, xlrec->nupdated); } if (xlrec->ndeleted > 0) @@ -675,7 +682,22 @@ btree_xlog_delete(XLogReaderState *record) page = (Page) BufferGetPage(buffer); - PageIndexMultiDelete(page, (OffsetNumber *) ptr, xlrec->ndeleted); + if (xlrec->nupdated > 0) + { + OffsetNumber *updatedoffsets; + xl_btree_update *updates; + + updatedoffsets = (OffsetNumber *) + (ptr + xlrec->ndeleted * sizeof(OffsetNumber)); + updates = (xl_btree_update *) ((char *) updatedoffsets + + xlrec->nupdated * + sizeof(OffsetNumber)); + + btree_xlog_updates(page, updatedoffsets, updates, xlrec->nupdated); + } + + if (xlrec->ndeleted > 0) + PageIndexMultiDelete(page, (OffsetNumber *) ptr, xlrec->ndeleted); /* Mark the page as not containing any LP_DEAD items */ opaque = (BTPageOpaque) PageGetSpecialPointer(page); diff --git a/src/backend/access/rmgrdesc/nbtdesc.c b/src/backend/access/rmgrdesc/nbtdesc.c index e099107f91..a3d81a94a7 100644 --- a/src/backend/access/rmgrdesc/nbtdesc.c +++ b/src/backend/access/rmgrdesc/nbtdesc.c @@ -63,8 +63,8 @@ btree_desc(StringInfo buf, XLogReaderState *record) { xl_btree_delete *xlrec = (xl_btree_delete *) rec; - appendStringInfo(buf, "latestRemovedXid %u; ndeleted %u", - xlrec->latestRemovedXid, xlrec->ndeleted); + appendStringInfo(buf, "latestRemovedXid %u; ndeleted %u; nupdated %u", + xlrec->latestRemovedXid, xlrec->ndeleted, xlrec->nupdated); break; } case XLOG_BTREE_MARK_PAGE_HALFDEAD: diff --git a/src/backend/access/table/tableamapi.c b/src/backend/access/table/tableamapi.c index 58de0743ba..181fa8f2f8 100644 --- a/src/backend/access/table/tableamapi.c +++ b/src/backend/access/table/tableamapi.c @@ -66,7 +66,7 @@ GetTableAmRoutine(Oid amhandler) Assert(routine->tuple_tid_valid != NULL); Assert(routine->tuple_get_latest_tid != NULL); Assert(routine->tuple_satisfies_snapshot != NULL); - Assert(routine->compute_xid_horizon_for_tuples != NULL); + Assert(routine->compute_delete_for_tuples != NULL); Assert(routine->tuple_insert != NULL); diff --git a/src/bin/psql/tab-complete.c b/src/bin/psql/tab-complete.c index 3a43c09bf6..c1f56cd657 100644 --- a/src/bin/psql/tab-complete.c +++ b/src/bin/psql/tab-complete.c @@ -1765,14 +1765,14 @@ psql_completion(const char *text, int start, int end) /* ALTER INDEX SET|RESET ( */ else if (Matches("ALTER", "INDEX", MatchAny, "RESET", "(")) COMPLETE_WITH("fillfactor", - "vacuum_cleanup_index_scale_factor", "deduplicate_items", /* BTREE */ + "vacuum_cleanup_index_scale_factor", "deduplicate_items", "bottomup_delete_items", /* BTREE */ "fastupdate", "gin_pending_list_limit", /* GIN */ "buffering", /* GiST */ "pages_per_range", "autosummarize" /* BRIN */ ); else if (Matches("ALTER", "INDEX", MatchAny, "SET", "(")) COMPLETE_WITH("fillfactor =", - "vacuum_cleanup_index_scale_factor =", "deduplicate_items =", /* BTREE */ + "vacuum_cleanup_index_scale_factor =", "deduplicate_items =", "bottomup_delete_items =", /* BTREE */ "fastupdate =", "gin_pending_list_limit =", /* GIN */ "buffering =", /* GiST */ "pages_per_range =", "autosummarize =" /* BRIN */ diff --git a/doc/src/sgml/btree.sgml b/doc/src/sgml/btree.sgml index bb395e6a85..9062a3f80d 100644 --- a/doc/src/sgml/btree.sgml +++ b/doc/src/sgml/btree.sgml @@ -629,6 +629,104 @@ options(relopts local_relopts *) returns + + Bottom-up index deletion + + B-Tree indexes are not directly aware that under MVCC, there might + be multiple extant versions of the same logical table row; to an + index, each tuple is an independent object that needs its own index + entry. Version churn tuples may sometimes + accumulate and adversely affect query latency and throughput. This + typically occurs with UPDATE-heavy workloads + where most individual updates cannot apply the + HOT optimization. Changing the value of only + one column covered by one index during an UPDATE + always necessitates a new set of index tuples + — one for each and every index on the + table. Note in particular that this includes indexes that were not + logically modified by the UPDATE. + All indexes will need a successor physical index tuple that points + to the latest version in the table. Each new tuple within each + index will generally need to coexist with the original + updated tuple for a short period of time (typically + until some time after the UPDATE transaction + commits). + + + B-Tree indexes incrementally delete version churn index tuples by + performing bottom-up index deletion passes. + Each deletion pass is triggered in reaction to an anticipated + version churn page split. A page split will usually + be avoided, though it's possible that certain implementation-level + heuristics will fail to identify and delete even one garbage index + tuple (in which case a page split or deduplication pass resolves + the issue of an incoming new tuple not fitting on a leaf page). + The worst case number of versions that any index scan must traverse + (for any single logical row) is an important contributor to overall + system responsiveness and throughput. A bottom-up index deletion + pass targets suspected garbage tuples in a single leaf page based + on qualitative distinctions involving logical + rows and versions. This contrasts with the top-down + index cleanup performed by autovacuum workers, which is triggered + when certain quantitative table-level + thresholds are exceeded (see ). + Bottom-up index deletion is enabled by default. + + + + Not all deletion operations that are performed within B-Tree + indexes are bottom-up deletion operations. There is a distinct + category of index tuple deletion: simple index tuple + deletion. This is a deferred maintenance operation + that deletes known dead-to-all index tuples (those whose item + identifier's LP_DEAD bit is already set). Like + bottom-up index deletion, simple index deletion takes place at the + point that a page split is anticipated as a way of avoiding the + split. + + + Simple deletion is opportunistic in the sense that it can only + take place when recent index scans set the + LP_DEAD bits of affected items in passing. + Prior to PostgreSQL 14, the only + category of B-Tree deletion was simple deletion. The main + differences between it and bottom-up deletion are that only the + former is opportunistically driven by the activity of passing + index scans, while only the latter specifically targets version + churn from UPDATEs that do not logically modify + indexed columns. + + + + Bottom-up index deletion performs the vast majority of all garbage + index tuple cleanup for particular indexes with certain workloads. + This is expected with any B-Tree index that is subject to + significant version churn from UPDATEs that + rarely or never logically modify the columns that the index covers. + The average and worst case number of versions per logical row can + be kept low purely through targeted incremental bottom-up deletion + passes. However, an exhaustive clean sweep of the + index (i.e. index vacuuming by VACUUM) will + eventually be required as part of broad cleanup of the table and + all of its indexes taken together. This is necessary because the + core VACUUM implementation must conservatively + assume that bottom-up index deletion missed some remaining garbage + index tuples. VACUUM must never allow a table + TID to get recycled unless and + until it is completely certain that there is no remaining index + tuple containing the TID in question. + + + It's also possible for an index to never benefit from bottom-up + index deletion, even when other indexes on the same table greatly + benefit. The optimization should have almost no overhead for these + indexes. The bottomup_delete_items storage + parameter can be used to disable bottom-up index deletion within + individual indexes (simple index tuple deletion is not affected). + Disabling bottom-up index deletion isn't usually helpful. + + + Deduplication @@ -666,15 +764,17 @@ options(relopts local_relopts *) returns The deduplication process occurs lazily, when a new item is - inserted that cannot fit on an existing leaf page. This prevents - (or at least delays) leaf page splits. Unlike GIN posting list - tuples, B-Tree posting list tuples do not need to expand every time - a new duplicate is inserted; they are merely an alternative - physical representation of the original logical contents of the - leaf page. This design prioritizes consistent performance with - mixed read-write workloads. Most client applications will at least - see a moderate performance benefit from using deduplication. - Deduplication is enabled by default. + inserted that cannot fit on an existing leaf page, though only when + index tuple deletion could not free sufficient space for the new + item (typically deletion is briefly considered and then skipped + over). Unlike GIN posting list tuples, B-Tree posting list tuples + do not need to expand every time a new duplicate is inserted; they + are merely an alternative physical representation of the original + logical contents of the leaf page. This design prioritizes + consistent performance with mixed read-write workloads. Most + client applications will at least see a moderate performance + benefit from using deduplication. Deduplication is enabled by + default. CREATE INDEX and REINDEX @@ -702,25 +802,16 @@ options(relopts local_relopts *) returns deduplication isn't usually helpful. - B-Tree indexes are not directly aware that under MVCC, there might - be multiple extant versions of the same logical table row; to an - index, each tuple is an independent object that needs its own index - entry. Version duplicates may sometimes accumulate - and adversely affect query latency and throughput. This typically - occurs with UPDATE-heavy workloads where most - individual updates cannot apply the HOT - optimization (often because at least one indexed column gets - modified, necessitating a new set of index tuple versions — - one new tuple for each and every index). In - effect, B-Tree deduplication ameliorates index bloat caused by - version churn. Note that even the tuples from a unique index are - not necessarily physically unique when stored - on disk due to version churn. The deduplication optimization is - selectively applied within unique indexes. It targets those pages - that appear to have version duplicates. The high level goal is to - give VACUUM more time to run before an - unnecessary page split caused by version churn can - take place. + It is sometimes possible for unique indexes (as well as unique + constraints) to use deduplication. This allows leaf pages to + temporarily absorb extra version churn duplicates. + Deduplication in unique indexes augments bottom-up index deletion, + especially in cases where a long-running transactions holds a + snapshot that blocks garbage collection. The goal is to buy time + for the bottom-up index deletion strategy to become effective + again. Delaying page splits until a single long-running + transaction naturally goes away can allow a bottom-up deletion pass + to succeed where an earlier deletion pass failed. diff --git a/doc/src/sgml/ref/create_index.sgml b/doc/src/sgml/ref/create_index.sgml index 2054d5d943..7965055d4a 100644 --- a/doc/src/sgml/ref/create_index.sgml +++ b/doc/src/sgml/ref/create_index.sgml @@ -386,17 +386,39 @@ CREATE [ UNIQUE ] INDEX [ CONCURRENTLY ] [ [ IF NOT EXISTS ] The fillfactor for an index is a percentage that determines how full the index method will try to pack index pages. For B-trees, leaf pages - are filled to this percentage during initial index build, and also + are filled to this percentage during initial index builds, and also when extending the index at the right (adding new largest key values). If pages subsequently become completely full, they will be split, leading to - gradual degradation in the index's efficiency. B-trees use a default + fragmentation of the on-disk index structure. B-trees use a default fillfactor of 90, but any integer value from 10 to 100 can be selected. - If the table is static then fillfactor 100 is best to minimize the - index's physical size, but for heavily updated tables a smaller - fillfactor is better to minimize the need for page splits. The - other index methods use fillfactor in different but roughly analogous - ways; the default fillfactor varies between methods. + + + B-tree indexes on tables where many inserts and/or updates are + anticipated can benefit from lower fillfactor settings at + CREATE INDEX time (following bulk loading into the + table). Values in the range of 50 - 90 can usefully smooth + out the rate of page splits during the + early life of the B-tree index (lowering fillfactor like this may even + lower the absolute number of page splits, though this effect is highly + workload dependent). The B-tree bottom-up index deletion technique + described in is dependent on having + some extra space on pages to store extra + tuple versions, and so can be affected by fillfactor (though the effect + is usually not significant). + + + In other specific cases it might be useful to increase fillfactor to + 100 at CREATE INDEX time as a way of maximizing + space utilization. You should only consider this when you are + completely sure that the table is static (i.e. that it will never be + affected by either inserts or updates). A fillfactor setting of 100 + otherwise risks harming performance: even a few + updates or inserts will cause a sudden flood of page splits. + + + The other index methods use fillfactor in different but roughly + analogous ways; the default fillfactor varies between methods. @@ -407,6 +429,25 @@ CREATE [ UNIQUE ] INDEX [ CONCURRENTLY ] [ [ IF NOT EXISTS ] + + bottomup_delete_items (boolean) + + bottomup_delete_items storage parameter + + + + + Controls usage of the B-tree bottom-up index deletion technique + described in . Set to + ON or OFF to enable or + disable the optimization. (Alternative spellings of + ON and OFF are allowed as + described in .). The default is + ON. + + + + deduplicate_items (boolean) @@ -418,10 +459,7 @@ CREATE [ UNIQUE ] INDEX [ CONCURRENTLY ] [ [ IF NOT EXISTS ] . Set to ON or OFF to enable or - disable the optimization. (Alternative spellings of - ON and OFF are allowed as - described in .) The default is - ON. + disable the optimization. The default is ON. -- 2.27.0