From 9ac3ca3acb0930010a5c6aeb75c4cafd75e9ff06 Mon Sep 17 00:00:00 2001 From: Peter Geoghegan Date: Wed, 25 Mar 2026 16:48:43 -0400 Subject: [PATCH v23 2/8] Add amgetbatch interface and adopt it in nbtree. Add a new amgetbatch index AM interface that allows index access methods to implement plain index scans and index-only scans that return index entries in batches comprising all matching items from an index page, rather than one match at a time. Also switch nbtree over from amgettuple to the new amgetbatch interface. The new interface allows the table AM to apply knowledge of which TIDs will be returned to the scan in the near future to perform optimizations like I/O prefetching. Prefetching is set to be added by an upcoming commit. With amgetbatch, a scan-level policy determines whether each batch's index page buffer pin is dropped eagerly by the index AM (for plain scans with an MVCC snapshot, where the snapshot itself prevents TID recycling problems) or retained as an interlock against concurrent TID recycling by VACUUM. The interlock is retained for plain non-MVCC scans and for index-only scans, and is dropped by the table AM via the new amunguardbatch callback when it is safe to do so. (Actually, index AMs are usually able to drop the pin at the same time that they release the lock. In practice, the amunguardbatch callback is only really needed during index-only scans, where dropping the pin interlock might need to be delayed ever so slightly, as explained below.) This extends the dropPin mechanism added to nbtree by commit 2ed5b87f, and generalizes it to work with all index AMs that support the new amgetbatch interface (LP_DEAD marking of index entries must be performed by implementing the new amkillitemsbatch callback, which has a documented contract describing how index AMs must reason about concurrent TID recycling). Scans can always safely drop index page pins eagerly, provided the scan uses an MVCC snapshot (unlike the nbtree dropPin optimization, which had no way of doing this safely during index-only scans due to how amgettuple works, and only gained support for scans of unlogged relations in recent commit 8a879119). The old ammarkpos and amrestrpos index AM callbacks are removed. With amgetbatch, mark/restore of scan positions is managed by the table AM, with help from indexbatch.c utility functions, rather than being delegated to the index AM. All amgetbatch-capable index AMs inherently support mark/restore without needing to implement it themselves. Table AMs must provide the new index_fetch_markpos and index_fetch_restrpos callbacks to make all this work. An upcoming commit that will add index prefetching will use a read stream to read heap pages during index scans. Read stream is careful to limit how many things it pins, lest we run into problems due to having too many buffers pinned. Simply never holding on to index page buffer pins greatly simplifies resource management for index prefetching; there's no risk of unintended interactions between the read stream and index AM. The only downside is that we cannot support prefetching during scans that use a non-MVCC snapshot, which seems quite acceptable. In practice, heapam doesn't drop each batch's index page buffer pin at the earliest opportunity during index-only scans. This was deemed necessary to avoid regressing index-only scans with a LIMIT, in particular with nestloop anti-joins and nestloop semi-joins; eagerly loading all the visibility information up front regressed such queries. The new amgetbatch interface gives table AMs the authority to decide when to drop index page pins/unguard batches, so this can be considered a heapam implementation detail (index AMs don't need to know about it). This scheme still allows index prefetching to consistently hold no more than one batch index page pin at a time, even when an index-only scan (that must perform some heap fetches) holds open several index batches at once in order to maintain an adequate prefetch distance. Index access methods that support plain index scans must now implement either the amgetbatch interface or the amgettuple interface (not both). An upcoming patch will add support for amgetbatch to the hash index AM. But the amgettuple interface will still be used by the GiST and SP-GiST index AMs for now. Both share a set of problems that make it unclear how to go about adding support. Both AMs reconstruct index data as HeapTuples via heap_form_tuple during index-only scans, performing retail palloc allocations that are incompatible with the flat, fixed-size, recyclable per-batch memory model that amgetbatch's currTuples workspace requires. Moreover, both AMs have known bugs involving buffer pin management during index-only scans: they release index leaf page pins immediately, rather than holding them as an interlock against concurrent TID recycling by VACUUM, creating a race condition in which VACUUM can remove a heap tuple and then mark its page all-visible while the index-only scan still holds a reference to the now-recycled TID [1]. These index AMs cannot adopt amgetbatch without first fixing the pin-handling deficiency that they already have under amgettuple (it's not clear how to fix the problem within the confines of the current amgettuple design, let alone in a way that's compatible with amgetbatch). [1] https://postgr.es/m/CAH2-Wz%3DjjiNL9FCh8C1L-GUH15f4WFTWub2x%2B_NucngcDDcHKw%40mail.gmail.com Author: Tomas Vondra Author: Peter Geoghegan Reviewed-By: Andres Freund Reviewed-By: Thomas Munro Discussion: https://postgr.es/m/cf85f46f-b02f-05b2-5248-5000b894ebab@enterprisedb.com Discussion: https://postgr.es/m/efac3238-6f34-41ea-a393-26cc0441b506%40vondra.me --- src/include/access/amapi.h | 27 +- src/include/access/genam.h | 1 + src/include/access/heapam.h | 6 + src/include/access/indexbatch.h | 204 +++++ src/include/access/nbtree.h | 190 ++--- src/include/access/relscan.h | 345 ++++++++- src/include/access/tableam.h | 77 +- src/include/nodes/pathnodes.h | 6 +- src/backend/access/brin/brin.c | 6 +- src/backend/access/gin/ginget.c | 6 +- src/backend/access/gin/ginutil.c | 6 +- src/backend/access/gist/gist.c | 6 +- src/backend/access/hash/hash.c | 6 +- src/backend/access/heap/heapam_handler.c | 3 + src/backend/access/heap/heapam_indexscan.c | 542 ++++++++++++- src/backend/access/index/Makefile | 3 +- src/backend/access/index/amapi.c | 5 + src/backend/access/index/genam.c | 5 + src/backend/access/index/indexam.c | 54 +- src/backend/access/index/indexbatch.c | 726 ++++++++++++++++++ src/backend/access/index/meson.build | 1 + src/backend/access/nbtree/README | 74 +- src/backend/access/nbtree/nbtpage.c | 13 +- src/backend/access/nbtree/nbtreadpage.c | 207 +++-- src/backend/access/nbtree/nbtree.c | 469 +++++------ src/backend/access/nbtree/nbtsearch.c | 567 ++++++-------- src/backend/access/nbtree/nbtutils.c | 245 ------ src/backend/access/nbtree/nbtxlog.c | 6 +- src/backend/access/spgist/spgutils.c | 6 +- src/backend/access/table/tableamapi.c | 3 + src/backend/commands/indexcmds.c | 2 +- src/backend/executor/execAmi.c | 2 +- src/backend/executor/nodeMergejoin.c | 4 +- src/backend/optimizer/path/indxpath.c | 6 +- src/backend/optimizer/util/plancat.c | 8 +- src/backend/replication/logical/relation.c | 9 +- src/backend/utils/adt/amutils.c | 8 +- contrib/bloom/blutils.c | 6 +- doc/src/sgml/indexam.sgml | 571 ++++++++++++-- doc/src/sgml/ref/create_table.sgml | 13 +- .../modules/dummy_index_am/dummy_index_am.c | 6 +- src/tools/pgindent/typedefs.list | 12 +- 42 files changed, 3164 insertions(+), 1298 deletions(-) create mode 100644 src/include/access/indexbatch.h create mode 100644 src/backend/access/index/indexbatch.c diff --git a/src/include/access/amapi.h b/src/include/access/amapi.h index ecfbd017d..fdfd8d600 100644 --- a/src/include/access/amapi.h +++ b/src/include/access/amapi.h @@ -198,6 +198,19 @@ typedef void (*amrescan_function) (IndexScanDesc scan, typedef bool (*amgettuple_function) (IndexScanDesc scan, ScanDirection direction); +/* next batch of valid tuples */ +typedef IndexScanBatch (*amgetbatch_function) (IndexScanDesc scan, + IndexScanBatch priorbatch, + ScanDirection direction); + +/* drop TID recycling interlock held to prevent concurrent VACUUM recycling */ +typedef void (*amunguardbatch_function) (IndexScanDesc scan, + IndexScanBatch batch); + +/* mark dead items in index page */ +typedef void (*amkillitemsbatch_function) (IndexScanDesc scan, + IndexScanBatch batch); + /* fetch all valid tuples */ typedef int64 (*amgetbitmap_function) (IndexScanDesc scan, TIDBitmap *tbm); @@ -205,11 +218,9 @@ typedef int64 (*amgetbitmap_function) (IndexScanDesc scan, /* end index scan */ typedef void (*amendscan_function) (IndexScanDesc scan); -/* mark current scan position */ -typedef void (*ammarkpos_function) (IndexScanDesc scan); - -/* restore marked scan position */ -typedef void (*amrestrpos_function) (IndexScanDesc scan); +/* invalidate index AM state that independently tracks scan's position */ +typedef void (*amposreset_function) (IndexScanDesc scan, + IndexScanBatch batch); /* * Callback function signatures - for parallel index scans. @@ -309,10 +320,12 @@ typedef struct IndexAmRoutine ambeginscan_function ambeginscan; amrescan_function amrescan; amgettuple_function amgettuple; /* can be NULL */ + amgetbatch_function amgetbatch; /* can be NULL */ + amunguardbatch_function amunguardbatch; /* can be NULL */ + amkillitemsbatch_function amkillitemsbatch; /* can be NULL */ amgetbitmap_function amgetbitmap; /* can be NULL */ amendscan_function amendscan; - ammarkpos_function ammarkpos; /* can be NULL */ - amrestrpos_function amrestrpos; /* can be NULL */ + amposreset_function amposreset; /* can be NULL */ /* interface functions to support parallel index scans */ amestimateparallelscan_function amestimateparallelscan; /* can be NULL */ diff --git a/src/include/access/genam.h b/src/include/access/genam.h index db62e0ca1..ae587f8de 100644 --- a/src/include/access/genam.h +++ b/src/include/access/genam.h @@ -96,6 +96,7 @@ typedef bool (*IndexBulkDeleteCallback) (ItemPointer itemptr, void *state); /* struct definitions appear in relscan.h */ typedef struct IndexScanDescData *IndexScanDesc; +typedef struct IndexScanBatchData *IndexScanBatch; typedef struct SysScanDescData *SysScanDesc; typedef struct ParallelIndexScanDescData *ParallelIndexScanDesc; diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h index 3ca42eb93..79976f50f 100644 --- a/src/include/access/heapam.h +++ b/src/include/access/heapam.h @@ -130,6 +130,7 @@ typedef struct IndexFetchHeapData /* For visibility map checks (index-only scans and on-access pruning) */ Buffer xs_vmbuffer; /* visibility map buffer */ + int xs_vm_items; /* # items to resolve visibility info for */ } IndexFetchHeapData; /* Result codes for HeapTupleSatisfiesVacuum */ @@ -434,8 +435,13 @@ extern bool heapam_fetch_tid(Relation rel, ItemPointer tid, Snapshot snapshot, TupleTableSlot *slot, bool *all_dead); extern IndexFetchTableData *heapam_index_fetch_begin(IndexScanDesc scan, uint32 flags); +extern void heapam_index_fetch_batch_init(IndexScanDesc scan, + IndexScanBatch batch, + bool new_alloc); extern void heapam_index_fetch_reset(IndexScanDesc scan); extern void heapam_index_fetch_end(IndexScanDesc scan); +extern void heapam_index_fetch_markpos(IndexScanDesc scan); +extern void heapam_index_fetch_restrpos(IndexScanDesc scan); extern bool heap_hot_search_buffer(ItemPointer tid, Relation relation, Buffer buffer, Snapshot snapshot, HeapTuple heapTuple, bool *all_dead, bool first_call); diff --git a/src/include/access/indexbatch.h b/src/include/access/indexbatch.h new file mode 100644 index 000000000..9f87fb96a --- /dev/null +++ b/src/include/access/indexbatch.h @@ -0,0 +1,204 @@ +/*------------------------------------------------------------------------- + * + * indexbatch.h + * Batch-based index scan infrastructure for the amgetbatch interface. + * + * Provides functions used by table AMs to manage an index scan's positional + * state (stored in IndexScanDesc.batchringbuf), and to manage underlying + * resources such as memory and buffer pins. Also provides various utility + * functions used by index AMs for batch resource management. + * + * This module does not provide elementary operations for manipulating the + * scan's ring buffer (e.g., for appending a batch). Those are implemented as + * inline functions defined beside IndexScanDesc and IndexScanBatch. + * + * Portions Copyright (c) 1996-2026, PostgreSQL Global Development Group + * Portions Copyright (c) 1994, Regents of the University of California + * + * src/include/access/indexbatch.h + * + *------------------------------------------------------------------------- + */ +#ifndef INDEXBATCH_H +#define INDEXBATCH_H + +#include "access/amapi.h" +#include "access/genam.h" +#include "access/relscan.h" +#include "storage/buf.h" +#include "utils/rel.h" + +/* ---------------------------------------------------------------------------- + * Utilities called by table AMs + * ---------------------------------------------------------------------------- + */ + +/* + * Sets up the batch ring buffer structure for use by an index scan. + * + * Called from table AM's index_fetch_begin callback during amgetbatch scans. + */ +static inline void +tableam_util_batchscan_init(IndexScanDesc scan) +{ + Assert(scan->indexRelation->rd_indam->amgetbatch != NULL); + + scan->batchringbuf.scanPos.valid = false; + scan->batchringbuf.markPos.valid = false; + + scan->batchringbuf.markBatch = NULL; + scan->batchringbuf.headBatch = 0; + scan->batchringbuf.nextBatch = 0; + + scan->usebatchring = true; +} + +extern void tableam_util_batchscan_reset(IndexScanDesc scan, bool endscan); +extern void tableam_util_batchscan_end(IndexScanDesc scan); +extern void tableam_util_batchscan_mark_pos(IndexScanDesc scan); +extern void tableam_util_batchscan_restore_pos(IndexScanDesc scan); +extern void tableam_util_scanbatch_dirchange(IndexScanDesc scan); +extern void tableam_util_scanpos_killitem(IndexScanDesc scan); +extern void tableam_util_free_batch(IndexScanDesc scan, IndexScanBatch batch); +extern void tableam_util_unguard_batch(IndexScanDesc scan, IndexScanBatch batch); + +/* + * Fetch the next batch of matching items for the scan (or the first). + * + * Called when caller's current batch (passed to us as priorBatch) has no more + * matching items in the given scan direction. Caller passes a NULL + * priorBatch on the first call here for the scan. + * + * Returns the next batch to be processed by caller in the given scan + * direction, or NULL when there are no more matches in that direction. + * + * This is where batches are appended to the scan's ring buffer. We don't + * free any batches here, though; that is left up to the caller. The caller + * is also responsible for advancing their position. + */ +static pg_attribute_always_inline IndexScanBatch +tableam_util_fetch_next_batch(IndexScanDesc scan, ScanDirection direction, + IndexScanBatch priorBatch, BatchRingItemPos *pos) +{ + IndexScanBatch batch = NULL; + BatchRingBuffer *batchringbuf PG_USED_FOR_ASSERTS_ONLY = &scan->batchringbuf; + + if (!priorBatch) + { + /* First call for the scan */ + Assert(pos == &batchringbuf->scanPos); + } + else if (unlikely(priorBatch->dir != direction)) + { + /* + * We detected a change in scan direction across batches. Prepare + * scan's batchringbuf state for us to get the next batch for the + * opposite scan direction to the one used when priorBatch was + * returned by amgetbatch. + */ + tableam_util_scanbatch_dirchange(scan); + + /* priorBatch is now batchringbuf's only batch */ + Assert(pos->batch == batchringbuf->headBatch); + Assert(index_scan_batch_count(scan) == 1); + } + else if (index_scan_batch_loaded(scan, pos->batch + 1)) + { + /* Next batch already loaded for us */ + batch = index_scan_batch(scan, pos->batch + 1); + + Assert(priorBatch->dir == direction); + Assert(batch->dir == direction); + Assert(batch->firstItem <= batch->lastItem); + return batch; + } + + /* + * Assert preconditions for calling amgetbatch. + * + * priorBatch had better be for the last valid batch currently in the ring + * buffer (batches must stay in scan order). If it isn't then we should + * have already returned some existing loaded batch earlier. + */ + Assert(!index_scan_batch_full(scan)); + Assert(!priorBatch || + (index_scan_batch_count(scan) > 0 && priorBatch->dir == direction && + index_scan_batch(scan, batchringbuf->nextBatch - 1) == priorBatch)); + + /* + * Before we call amgetbatch again, check if priorBatch is already known + * to be the last batch with matching items in this scan direction + */ + if (priorBatch && + (ScanDirectionIsForward(direction) ? + priorBatch->knownEndForward : + priorBatch->knownEndBackward)) + return NULL; + + batch = scan->indexRelation->rd_indam->amgetbatch(scan, priorBatch, + direction); + if (batch) + { + /* We got the batch from the index AM */ + Assert(batch->dir == direction); + Assert(batch->firstItem <= batch->lastItem); + + /* Append batch to the end of ring buffer/write it to buffer index */ + index_scan_batch_append(scan, batch); + + /* + * Theoretically we should set knownEndForward/knownEndBackward to + * false (whichever is used when moving in the opposite direction) + * when this is the scan's first returned batch. We don't bother + * because the index AM should always record that fact in its own + * opaque area. (These fields only exist because we don't want index + * AMs setting _any_ field from any priorbatch that we pass to them. + * Besides, it would be cumbersome for index AMs to keep track of + * which batch is the current amgetbatch call's original priorbatch.) + */ + } + else + { + /* amgetbatch returned NULL */ + if (priorBatch) + { + /* + * There are no further matches to be found in the current scan + * direction, following priorBatch. Remember that priorBatch is + * the last batch with matching items. + */ + if (ScanDirectionIsForward(direction)) + priorBatch->knownEndForward = true; + else + priorBatch->knownEndBackward = true; + } + } + + /* xs_hitup isn't currently supported by amgetbatch scans */ + Assert(!scan->xs_hitup); + + return batch; +} + +/* ---------------------------------------------------------------------------- + * Utilities called by index AMs + * ---------------------------------------------------------------------------- + */ +extern void indexam_util_batch_unlock(IndexScanDesc scan, IndexScanBatch batch, + Buffer buf); +extern IndexScanBatch indexam_util_batch_alloc(IndexScanDesc scan); +extern void indexam_util_batch_release(IndexScanDesc scan, IndexScanBatch batch); + +/* + * Utility macro for accessing the index AM's per-batch opaque data. + * + * Each batch allocation places the index AM opaque area at a fixed negative + * offset from the IndexScanBatch pointer (see indexam_util_batch_alloc). + * This macro returns a typed pointer to that area, asserting that everybody + * has the same idea about where the index AM opaque area is in passing. + */ +#define indexam_util_batch_get_amdata(scan, batch, type) \ + (AssertMacro((scan)->batch_index_opaque_size == MAXALIGN(sizeof(type))), \ + ((type *) ((char *) (batch) - MAXALIGN(sizeof(type))))) + +#endif /* INDEXBATCH_H */ diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h index da7503c57..697cd7a54 100644 --- a/src/include/access/nbtree.h +++ b/src/include/access/nbtree.h @@ -15,6 +15,7 @@ #define NBTREE_H #include "access/amapi.h" +#include "access/indexbatch.h" #include "access/itup.h" #include "access/sdir.h" #include "catalog/pg_am_d.h" @@ -924,112 +925,6 @@ typedef struct BTVacuumPostingData typedef BTVacuumPostingData *BTVacuumPosting; -/* - * BTScanOpaqueData is the btree-private state needed for an indexscan. - * This consists of preprocessed scan keys (see _bt_preprocess_keys() for - * details of the preprocessing), information about the current location - * of the scan, and information about the marked location, if any. (We use - * BTScanPosData to represent the data needed for each of current and marked - * locations.) In addition we can remember some known-killed index entries - * that must be marked before we can move off the current page. - * - * Index scans work a page at a time: we pin and read-lock the page, identify - * all the matching items on the page and save them in BTScanPosData, then - * release the read-lock while returning the items to the caller for - * processing. This approach minimizes lock/unlock traffic. We must always - * drop the lock to make it okay for caller to process the returned items. - * Whether or not we can also release the pin during this window will vary. - * We drop the pin (when so->dropPin) to avoid blocking progress by VACUUM - * (see nbtree/README section about making concurrent TID recycling safe). - * We'll always release both the lock and the pin on the current page before - * moving on to its sibling page. - * - * If we are doing an index-only scan, we save the entire IndexTuple for each - * matched item, otherwise only its heap TID and offset. The IndexTuples go - * into a separate workspace array; each BTScanPosItem stores its tuple's - * offset within that array. Posting list tuples store a "base" tuple once, - * allowing the same key to be returned for each TID in the posting list - * tuple. - */ - -typedef struct BTScanPosItem /* what we remember about each match */ -{ - ItemPointerData heapTid; /* TID of referenced heap item */ - OffsetNumber indexOffset; /* index item's location within page */ - LocationIndex tupleOffset; /* IndexTuple's offset in workspace, if any */ -} BTScanPosItem; - -typedef struct BTScanPosData -{ - Buffer buf; /* currPage buf (invalid means unpinned) */ - - /* page details as of the saved position's call to _bt_readpage */ - BlockNumber currPage; /* page referenced by items array */ - BlockNumber prevPage; /* currPage's left link */ - BlockNumber nextPage; /* currPage's right link */ - XLogRecPtr lsn; /* currPage's LSN (when so->dropPin) */ - - /* scan direction for the saved position's call to _bt_readpage */ - ScanDirection dir; - - /* - * If we are doing an index-only scan, nextTupleOffset is the first free - * location in the associated tuple storage workspace. - */ - int nextTupleOffset; - - /* - * moreLeft and moreRight track whether we think there may be matching - * index entries to the left and right of the current page, respectively. - */ - bool moreLeft; - bool moreRight; - - /* - * The items array is always ordered in index order (ie, increasing - * indexoffset). When scanning backwards it is convenient to fill the - * array back-to-front, so we start at the last slot and fill downwards. - * Hence we need both a first-valid-entry and a last-valid-entry counter. - * itemIndex is a cursor showing which entry was last returned to caller. - */ - int firstItem; /* first valid index in items[] */ - int lastItem; /* last valid index in items[] */ - int itemIndex; /* current index in items[] */ - - BTScanPosItem items[MaxTIDsPerBTreePage]; /* MUST BE LAST */ -} BTScanPosData; - -typedef BTScanPosData *BTScanPos; - -#define BTScanPosIsPinned(scanpos) \ -( \ - AssertMacro(BlockNumberIsValid((scanpos).currPage) || \ - !BufferIsValid((scanpos).buf)), \ - BufferIsValid((scanpos).buf) \ -) -#define BTScanPosUnpin(scanpos) \ - do { \ - ReleaseBuffer((scanpos).buf); \ - (scanpos).buf = InvalidBuffer; \ - } while (0) -#define BTScanPosUnpinIfPinned(scanpos) \ - do { \ - if (BTScanPosIsPinned(scanpos)) \ - BTScanPosUnpin(scanpos); \ - } while (0) - -#define BTScanPosIsValid(scanpos) \ -( \ - AssertMacro(BlockNumberIsValid((scanpos).currPage) || \ - !BufferIsValid((scanpos).buf)), \ - BlockNumberIsValid((scanpos).currPage) \ -) -#define BTScanPosInvalidate(scanpos) \ - do { \ - (scanpos).buf = InvalidBuffer; \ - (scanpos).currPage = InvalidBlockNumber; \ - } while (0) - /* We need one of these for each equality-type SK_SEARCHARRAY scan key */ typedef struct BTArrayKeyInfo { @@ -1050,6 +945,43 @@ typedef struct BTArrayKeyInfo ScanKey high_compare; /* array's < or <= upper bound */ } BTArrayKeyInfo; +/* Per-batch data private to the btree index AM */ +typedef struct BTBatchData +{ + Buffer buf; /* leaf page's buffer pin */ + BlockNumber currPage; /* leaf page's block number */ + BlockNumber prevPage; /* leaf page's left sibling */ + BlockNumber nextPage; /* leaf page's right sibling */ + bool moreLeft; /* more pages of interest to the left? */ + bool moreRight; /* more pages of interest to the right? */ +} BTBatchData; + +/* Access the btree-private per-batch data from an IndexScanBatch pointer */ +#define BTBatchGetData(scan, batch) \ + indexam_util_batch_get_amdata(scan, batch, BTBatchData) + +/* + * BTScanOpaqueData is the btree-private state needed for an indexscan. + * This consists of preprocessed scan keys (see _bt_preprocess_keys() for + * details of the preprocessing), and information about the current array + * keys. There are assumptions about how the current array keys track the + * progress of the index scan through the index's key space (see _bt_readpage, + * btposreset, and _bt_advance_array_keys), but we don't track anything about + * the current scan position/batch in this opaque struct. + * + * Index scans work a page at a time, as required by the amgetbatch contract: + * we pin and read-lock the page, identify all the matching items on the page + * and return them in a newly allocated batch. We then release the read-lock + * using amgetbatch utility routines. This approach minimizes lock/unlock + * traffic. _bt_next is passed priorbatch, which has a BTBatchData area that + * tells us which page is next in line to be read in the given scan direction + * (this is often the same priorbatch passed to btgetbatch by core code). + * + * If we are doing an index-only scan, we save the entire IndexTuple for each + * matched item, otherwise only its table TID and offset. Posting list tuples + * store a "base" tuple once, allowing the same key to be used for each TID in + * the posting list. + */ typedef struct BTScanOpaqueData { /* these fields are set by _bt_preprocess_keys(): */ @@ -1066,32 +998,6 @@ typedef struct BTScanOpaqueData BTArrayKeyInfo *arrayKeys; /* info about each equality-type array key */ FmgrInfo *orderProcs; /* ORDER procs for required equality keys */ MemoryContext arrayContext; /* scan-lifespan context for array data */ - - /* info about killed items if any (killedItems is NULL if never used) */ - int *killedItems; /* currPos.items indexes of killed items */ - int numKilled; /* number of currently stored items */ - bool dropPin; /* drop leaf pin before btgettuple returns? */ - - /* - * If we are doing an index-only scan, these are the tuple storage - * workspaces for the currPos and markPos respectively. Each is of size - * BLCKSZ, so it can hold as much as a full page's worth of tuples. - */ - char *currTuples; /* tuple storage for currPos */ - char *markTuples; /* tuple storage for markPos */ - - /* - * If the marked position is on the same page as current position, we - * don't use markPos, but just keep the marked itemIndex in markItemIndex - * (all the rest of currPos is valid for the mark position). Hence, to - * determine if there is a mark, first look at markItemIndex, then at - * markPos. - */ - int markItemIndex; /* itemIndex, or -1 if not valid */ - - /* keep these last in struct for efficiency */ - BTScanPosData currPos; /* current position data */ - BTScanPosData markPos; /* marked position, if any */ } BTScanOpaqueData; typedef BTScanOpaqueData *BTScanOpaque; @@ -1160,14 +1066,17 @@ extern bool btinsert(Relation rel, Datum *values, bool *isnull, extern IndexScanDesc btbeginscan(Relation rel, int nkeys, int norderbys); extern Size btestimateparallelscan(Relation rel, int nkeys, int norderbys); extern void btinitparallelscan(void *target); -extern bool btgettuple(IndexScanDesc scan, ScanDirection dir); +extern IndexScanBatch btgetbatch(IndexScanDesc scan, + IndexScanBatch priorbatch, + ScanDirection dir); extern int64 btgetbitmap(IndexScanDesc scan, TIDBitmap *tbm); extern void btrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys, ScanKey orderbys, int norderbys); +extern void btunguardbatch(IndexScanDesc scan, IndexScanBatch batch); +extern void btkillitemsbatch(IndexScanDesc scan, IndexScanBatch batch); extern void btparallelrescan(IndexScanDesc scan); extern void btendscan(IndexScanDesc scan); -extern void btmarkpos(IndexScanDesc scan); -extern void btrestrpos(IndexScanDesc scan); +extern void btposreset(IndexScanDesc scan, IndexScanBatch batch); extern IndexBulkDeleteResult *btbulkdelete(IndexVacuumInfo *info, IndexBulkDeleteResult *stats, IndexBulkDeleteCallback callback, @@ -1271,8 +1180,9 @@ extern void _bt_preprocess_keys(IndexScanDesc scan); /* * prototypes for functions in nbtreadpage.c */ -extern bool _bt_readpage(IndexScanDesc scan, ScanDirection dir, - OffsetNumber offnum, bool firstpage); +extern bool _bt_readpage(IndexScanDesc scan, IndexScanBatch newbatch, + ScanDirection dir, OffsetNumber offnum, + bool firstpage); extern void _bt_start_array_keys(IndexScanDesc scan, ScanDirection dir); extern int _bt_binsrch_array_skey(FmgrInfo *orderproc, bool cur_elem_trig, ScanDirection dir, @@ -1287,15 +1197,15 @@ extern BTStack _bt_search(Relation rel, Relation heaprel, BTScanInsert key, Buffer *bufP, int access, bool returnstack); extern OffsetNumber _bt_binsrch_insert(Relation rel, BTInsertState insertstate); extern int32 _bt_compare(Relation rel, BTScanInsert key, Page page, OffsetNumber offnum); -extern bool _bt_first(IndexScanDesc scan, ScanDirection dir); -extern bool _bt_next(IndexScanDesc scan, ScanDirection dir); +extern IndexScanBatch _bt_first(IndexScanDesc scan, ScanDirection dir); +extern IndexScanBatch _bt_next(IndexScanDesc scan, ScanDirection dir, + IndexScanBatch priorbatch); extern Buffer _bt_get_endpoint(Relation rel, uint32 level, bool rightmost); /* * prototypes for functions in nbtutils.c */ extern BTScanInsert _bt_mkscankey(Relation rel, IndexTuple itup); -extern void _bt_killitems(IndexScanDesc scan); extern BTCycleId _bt_vacuum_cycleid(Relation rel); extern BTCycleId _bt_start_vacuum(Relation rel); extern void _bt_end_vacuum(Relation rel); diff --git a/src/include/access/relscan.h b/src/include/access/relscan.h index 986b4f5f3..3421c83c1 100644 --- a/src/include/access/relscan.h +++ b/src/include/access/relscan.h @@ -126,6 +126,12 @@ typedef struct ParallelBlockTableScanWorkerData *ParallelBlockTableScanWorker; */ typedef struct IndexFetchTableData { + /* Table AM per-batch opaque area size (MAXALIGN'd), set by AM */ + uint16 batch_opaque_size; + + /* Per-item trailing data size in each batch */ + uint16 batch_per_item_size; + /* * Bitmask of ScanOptions affecting the relation. No SO_INTERNAL_FLAGS are * permitted. @@ -133,13 +139,184 @@ typedef struct IndexFetchTableData uint32 flags; } IndexFetchTableData; +/* + * Location of a BatchMatchingItem within the scan's ring buffer + */ +typedef struct BatchRingItemPos +{ + /* Position references a valid IndexScanDescData.batchbuf[] entry? */ + bool valid; + + /* IndexScanDescData.batchbuf[]-wise index to relevant IndexScanBatch */ + uint8 batch; + + /* IndexScanBatch.items[]-wise index to relevant BatchMatchingItem */ + int item; + +} BatchRingItemPos; + +/* + * Matching item returned by amgetbatch (in returned IndexScanBatch) during an + * index scan. Used by table AM to locate relevant matching table tuple. + */ +typedef struct BatchMatchingItem +{ + ItemPointerData tableTid; /* TID of referenced table item */ + OffsetNumber indexOffset; /* index item's location within page */ + LocationIndex tupleOffset; /* index tuple's currTuples offset, if any */ +} BatchMatchingItem; + +/* + * Data about one batch of items returned by (and passed to) amgetbatch during + * index scans. + * + * Each batch allocation has the following memory layout: + * + * [table AM opaque area] <- allocation base, at -(batch_table_offset) + * [index AM opaque area] <- at -(batch_index_opaque_size) + * [IndexScanBatchData] <- batch pointer, returned by amgetbatch + * [items[maxitemsbatch]] + * [table AM trailing data] <- per-item area (e.g., for visibility info) + * [currTuples workspace] <- index AM stores index tuples here for + * index-only scans (batch_tuples_workspace) + * + * batch_table_offset combines both AM opaque sizes into a single offset from + * the batch pointer to the true allocation base. The indexbatch.c utilities + * pfree a batch by passing pfree a pointer returned by index_scan_batch_base. + * We rely on the assumption that batches have a fixed layout for the duration + * of an index scan (batches are cached for reuse to avoid palloc churn). + * + * The table AM can overlay a small fixed-size struct at the start of the + * allocated space, which it accesses using a index_scan_batch_base shim + * accessor function. Convention for table AMs is to store a pointer to its + * per-item area in this fixed-size area (e.g., heapam stores a visInfo + * pointer here), in addition to other information tracked at the batch level. + * + * The index AM opaque area is accessed via a custom accessor that uses a + * fixed compile-time constant offset for efficiency (a constant that is + * tracked in the scan descriptor as batch_index_opaque_size). + */ +typedef struct IndexScanBatchData +{ + /* Index page's LSN, optionally used by amkillitemsbatch routines */ + XLogRecPtr lsn; + + /* scan direction when the index page was read */ + ScanDirection dir; + + /* + * knownEndBackward and knownEndForward indicate that this batch is the + * last one with matching items in the relevant scan direction. When + * amgetbatch returns NULL for a given direction, the corresponding flag + * is set on the priorbatch that was passed to that call. We cannot know + * this when a batch is first returned by amgetbatch; it only becomes + * apparent when we try and fail to continue the scan past it. + * + * This allows table AMs to avoid redundant amgetbatch calls with the same + * priorbatch -- the index AM might need to read additional index pages to + * determine there are no more matching items beyond caller's priorbatch. + */ + bool knownEndBackward; + bool knownEndForward; + + /* + * Batch still holds TID recycling interlock? + */ + bool isGuarded; + + /* + * Matching items state for this batch. Output by index AM for table AM. + * + * The items array is always ordered in index order (ie, by increasing + * indexoffset). When scanning backwards it is convenient for index AMs + * to fill the array back-to-front, starting at the last item slot and + * filling downwards. This is why we need both a first-valid-entry and a + * last-valid-entry counter. + * + * Note: these are signed because it's sometimes convenient to use -1 to + * represent an out-of-bounds space just before firstItem (when it's 0). + */ + int firstItem; /* first valid index in items[] */ + int lastItem; /* last valid index in items[] */ + + /* info about dead items, if any (palloc'd separately, NULL if unused) */ + int numDead; /* number of currently stored items */ + int *deadItems; /* items[]-wise indexes of dead items */ + + /* + * If we are doing an index-only scan, this is the tuple storage workspace + * for the matching tuples (tuples referenced by items[]). The workspace + * size is determined by the index AM (batch_tuples_workspace). + * + * currTuples points into the trailing portion of this allocation, past + * items[] and any table AM trailing data. It is NULL for plain index + * scans. + */ + char *currTuples; /* tuple storage for items[] */ + BatchMatchingItem items[FLEXIBLE_ARRAY_MEMBER]; /* matching items */ +} IndexScanBatchData; + +typedef struct IndexScanBatchData *IndexScanBatch; + +/* + * State used by table AMs to manage an index scan that uses the amgetbatch + * interface. Scans use a ring buffer of batches returned by amgetbatch. + * + * This data structure provides table AMs with a way to read ahead of the + * current read position by _multiple_ batches/index pages. The further out + * the table AM reads ahead like this, the further it can see into the future. + * That way the table AM is able to reorder work as aggressively as desired. + */ +typedef struct BatchRingBuffer +{ + /* current positions in IndexScanDescData.batchbuf[] for scan */ + BatchRingItemPos scanPos; /* scan's read position */ + BatchRingItemPos markPos; /* mark/restore position */ + + /* markPos's batch (not in ring buffer when markBatch != scanBatch) */ + IndexScanBatch markBatch; + + /* + * headBatch is an index to the earliest still-valid ring buffer batch + * slot in batchbuf[]. The actual array position for its IndexScanBatch + * is headBatch & (INDEX_SCAN_MAX_BATCHES - 1), since these indexes use + * unsigned wrapping arithmetic. headBatch must be the scan's current + * scanBatch (i.e. the current scanPos batch). + */ + uint8 headBatch; + + /* + * nextBatch is an index to the next _empty_ ring buffer batch slot in + * batchbuf[]. As with headBatch, the actual batchbuf[] array position is + * nextBatch & (INDEX_SCAN_MAX_BATCHES - 1). A new batch can only be + * appended to this position/slot when !index_scan_batch_full(). + * + * Note: the scan's most recently appended batch (its tail batch) is + * always located at (nextBatch - 1) & (INDEX_SCAN_MAX_BATCHES - 1). + */ + uint8 nextBatch; +} BatchRingBuffer; + struct IndexScanInstrumentation; /* * We use the same IndexScanDescData structure for both amgettuple-based * and amgetbitmap-based index scans. Some fields are only relevant in - * amgettuple-based scans. + * amgettuple-based scans. Others are only used in amgetbatch-based scans. + * + * The ring buffer used by amgetbatch scans is stored here as a fixed array of + * pointers to batches. We need a minimum of two ring buffer batches (but use + * INDEX_SCAN_MAX_BATCHES), since table AMs only remove a batch after they've + * already called amgetbatch again and appended the returned batch. */ +#define INDEX_SCAN_CACHE_BATCHES 2 +#define INDEX_SCAN_MAX_BATCHES 64 + +StaticAssertDecl(INDEX_SCAN_MAX_BATCHES <= PG_INT8_MAX + 1, + "index_scan_batch_loaded relies on int8 ring buffer arithmetic"); +StaticAssertDecl((INDEX_SCAN_MAX_BATCHES & (INDEX_SCAN_MAX_BATCHES - 1)) == 0, + "INDEX_SCAN_MAX_BATCHES must be a power of 2"); + typedef struct IndexScanDescData { /* scan parameters */ @@ -150,6 +327,26 @@ typedef struct IndexScanDescData int numberOfOrderBys; /* number of ordering operators */ struct ScanKeyData *keyData; /* array of index qualifier descriptors */ struct ScanKeyData *orderByData; /* array of ordering op descriptors */ + + /* index access method's private state */ + void *opaque; /* access-method-specific info */ + + /* scan's amgetbatch state (only used by amgetbatch/usebatchring scans) */ + BatchRingBuffer batchringbuf; + + /* + * Array of pointers to recyclable batches, used by all amgetbatch scans + * and by amgetbitmap scans of an index AM that supports amgetbatch + */ + IndexScanBatch batchcache[INDEX_SCAN_CACHE_BATCHES]; + + /* Array of pointers to batches, referenced within batchringbuf */ + IndexScanBatch batchbuf[INDEX_SCAN_MAX_BATCHES]; + + bool usebatchring; /* scan uses amgetbatch/batchringbuf? */ + bool batchImmediateUnguard; /* eagerly drop TID recycling + * interlock? */ + bool xs_want_itup; /* caller requests index tuples */ bool xs_temp_snap; /* unregister snapshot at scan end? */ @@ -158,9 +355,8 @@ typedef struct IndexScanDescData bool ignore_killed_tuples; /* do not return killed entries */ bool xactStartedInRecovery; /* prevents killing/seeing killed * tuples */ - - /* index access method's private state */ - void *opaque; /* access-method-specific info */ + /* xs_snapshot uses an MVCC snapshot? */ + bool MVCCScan; /* * Instrumentation counters maintained by all index AMs during both @@ -186,7 +382,7 @@ typedef struct IndexScanDescData /* * Resolved table_index_getnext_slot callback, which is set by - * table_index_fetch_begin at the start of amgettuple scans + * table_index_fetch_begin at the start of amgetbatch/amgettuple scans */ bool (*xs_getnext_slot) (struct IndexScanDescData *scan, ScanDirection direction, @@ -194,6 +390,14 @@ typedef struct IndexScanDescData bool xs_recheck; /* T means scan keys must be rechecked */ + /* batch size information, set once by index AM in ambeginscan */ + uint16 maxitemsbatch; /* size of each batch's items[] array */ + uint16 batch_index_opaque_size; /* MAXALIGN'd index AM opaque size */ + uint16 batch_tuples_workspace; /* currTuples workspace size */ + + /* Computed offset, used to get table AM's opaque area from a batch */ + uint16 batch_table_offset; + /* * When fetching with an ordering operator, the values of the ORDER BY * expressions of the last returned tuple, according to the index. If @@ -237,4 +441,135 @@ typedef struct SysScanDescData struct TupleTableSlot *slot; } SysScanDescData; +/* + * How many batches are currently loaded in the ring buffer? + */ +static inline uint8 +index_scan_batch_count(IndexScanDescData *scan) +{ + return (uint8) (scan->batchringbuf.nextBatch - + scan->batchringbuf.headBatch); +} + +/* + * Do we already have a batch loaded at 'idx' offset in scan's ring buffer? + * + * NOTE: a stale batch idx can alias a currently-loaded range due to + * wraparound, producing a false positive. False negatives are not possible. + */ +static inline bool +index_scan_batch_loaded(IndexScanDescData *scan, uint8 idx) +{ + return (int8) (idx - scan->batchringbuf.headBatch) >= 0 && + (int8) (idx - scan->batchringbuf.nextBatch) < 0; +} + +/* + * Have we loaded the maximum number of batches? + */ +static inline bool +index_scan_batch_full(IndexScanDescData *scan) +{ + return index_scan_batch_count(scan) == INDEX_SCAN_MAX_BATCHES; +} + +/* + * Return batch for the provided index. + */ +static inline IndexScanBatch +index_scan_batch(IndexScanDescData *scan, uint8 idx) +{ + Assert(index_scan_batch_loaded(scan, idx)); + + return scan->batchbuf[idx & (INDEX_SCAN_MAX_BATCHES - 1)]; +} + +/* + * Append given batch to scan's batch ring buffer. + */ +static inline void +index_scan_batch_append(IndexScanDescData *scan, IndexScanBatch batch) +{ + BatchRingBuffer *ringbuf = &scan->batchringbuf; + uint8 nextBatch = ringbuf->nextBatch; + + Assert(!index_scan_batch_full(scan)); + + scan->batchbuf[nextBatch & (INDEX_SCAN_MAX_BATCHES - 1)] = batch; + ringbuf->nextBatch++; +} + +/* + * Return the true allocation base of a batch (used to pfree batches) + */ +static inline void * +index_scan_batch_base(IndexScanDescData *scan, IndexScanBatch batch) +{ + return (char *) batch - scan->batch_table_offset; +} + +/* + * Advance position to its next item in the batch. + * + * Advance to the next item within the provided batch (or to the previous item, + * when scanning backwards). + * + * Returns true if the position could be advanced. Returns false when there + * are no more items from the batch remaining in the given scan direction. + */ +static inline bool +index_scan_pos_advance(ScanDirection direction, + IndexScanBatch batch, BatchRingItemPos *pos) +{ + Assert(pos->valid); + + if (ScanDirectionIsForward(direction)) + { + if (++pos->item > batch->lastItem) + return false; + } + else /* ScanDirectionIsBackward */ + { + if (--pos->item < batch->firstItem) + return false; + } + + /* Advanced within batch */ + return true; +} + +/* + * Advance batch position to the start of its new batch. + * + * When we're called, this position should point to a batch that caller just + * finished consuming from. When we return, this position will point to + * nextBatch, the next batch from the ring buffer. We'll have also set the + * position's item offset to nextBatch's first item in the given direction + * (which is actually nextBatch's _last_ item when scanning backwards). + * + * nextBatch doesn't have to be (and often isn't) the most recently appended + * batch in the scan's ring buffer. It is merely the next batch in line to be + * consumed from the point of view of our caller. + */ +static inline void +index_scan_pos_nextbatch(ScanDirection direction, + IndexScanBatch nextBatch, BatchRingItemPos *pos) +{ + Assert(nextBatch->dir == direction); + Assert(nextBatch->firstItem <= nextBatch->lastItem); + + /* Increment batch (might wrap), or initialize it to zero */ + if (pos->valid) + pos->batch++; + else + pos->batch = 0; + + pos->valid = true; + + if (ScanDirectionIsForward(direction)) + pos->item = nextBatch->firstItem; + else + pos->item = nextBatch->lastItem; +} + #endif /* RELSCAN_H */ diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h index 62016fd0b..2e23a2eb5 100644 --- a/src/include/access/tableam.h +++ b/src/include/access/tableam.h @@ -452,10 +452,11 @@ typedef struct TableAmRoutine * flags is a bitmask of ScanOptions affecting underlying table scan * behavior. See scan_begin() for more information on passing these. * - * Callback is responsible for setting IndexScanDesc.xs_getnext_slot to - * the appropriate slot-based callback. Tuples can then be fetched via - * table_index_getnext_slot(). No separate slot-based callback exists in - * this struct! + * Callback is responsible for initializing the scan's batch ring buffer + * (when the scan's index AM supports the amgetbatch interface), and for + * setting IndexScanDesc.xs_getnext_slot to the appropriate slot-based + * callback. Tuples can then be fetched via table_index_getnext_slot(). + * No separate slot-based callback exists in this struct! * * In principle a single general-purpose callback (stored here) would * suffice, but using specialized variants allows the table AM to provide @@ -467,10 +468,31 @@ typedef struct TableAmRoutine * columns do not change, need to return the current/correct version of * the tuple that is visible to the snapshot, even if the tid points to an * older version of the tuple. + * + * Callback also initializes the batch_opaque_size and batch_per_item_size + * fields in the returned struct, to let the core code know how much + * memory will be required in the opaque table AM portions of each batch + * allocation (during amgetbatch index scans). Table AMs can store things + * like per-item visibility information in each allocated batch. See + * relscan.h for details. */ struct IndexFetchTableData *(*index_fetch_begin) (IndexScanDesc scan, uint32 flags); + /* + * Initialize table AM's per-batch opaque area within a batch allocation. + * + * Called by indexam_util_batch_alloc for each new or recycled batch. + * Table AMs should set up its opaque area (at a negative offset from the + * batch pointer) and any trailing per-item data (e.g. visibility flags). + * + * 'new_alloc' is true for freshly palloc'd batches, false for batches + * recycled from the cache. + */ + void (*index_fetch_batch_init) (IndexScanDesc scan, + IndexScanBatch batch, + bool new_alloc); + /* * Reset index scan for a rescan. Resets table-owned resources. */ @@ -481,6 +503,16 @@ typedef struct TableAmRoutine */ void (*index_fetch_end) (IndexScanDesc scan); + /* + * Mark the current scan position so it can be restored later + */ + void (*index_fetch_markpos) (IndexScanDesc scan); + + /* + * Restore a previously marked scan position + */ + void (*index_fetch_restrpos) (IndexScanDesc scan); + /* ------------------------------------------------------------------------ * Callbacks for non-modifying operations on individual tuples * ------------------------------------------------------------------------ @@ -1273,6 +1305,28 @@ table_index_fetch_reset(IndexScanDesc scan) scan->heapRelation->rd_tableam->index_fetch_reset(scan); } +/* + * Mark the current scan position so it can be restored later + */ +static inline void +table_index_fetch_markpos(IndexScanDesc scan) +{ + Assert(scan->xs_heapfetch); + + scan->heapRelation->rd_tableam->index_fetch_markpos(scan); +} + +/* + * Restore a previously marked scan position + */ +static inline void +table_index_fetch_restrpos(IndexScanDesc scan) +{ + Assert(scan->xs_heapfetch); + + scan->heapRelation->rd_tableam->index_fetch_restrpos(scan); +} + /* * Release resources and deallocate the IndexFetchTableData in the scan. */ @@ -1284,6 +1338,21 @@ table_index_fetch_end(IndexScanDesc scan) scan->heapRelation->rd_tableam->index_fetch_end(scan); } +/* + * Initialize table AM's per-batch opaque area within a batch allocation. + * + * Called by indexam_util_batch_alloc for each new or recycled batch. + */ +static inline void +table_index_fetch_batch_init(IndexScanDesc scan, IndexScanBatch batch, + bool new_alloc) +{ + Assert(scan->xs_heapfetch); + + scan->heapRelation->rd_tableam->index_fetch_batch_init(scan, batch, + new_alloc); +} + /* * Fetch the next tuple from an index scan into `slot`, scanning in the * specified direction. Returns true if a tuple satisfying the scan keys and diff --git a/src/include/nodes/pathnodes.h b/src/include/nodes/pathnodes.h index 693b879f7..85991d447 100644 --- a/src/include/nodes/pathnodes.h +++ b/src/include/nodes/pathnodes.h @@ -1437,12 +1437,12 @@ typedef struct IndexOptInfo bool amoptionalkey; bool amsearcharray; bool amsearchnulls; - /* does AM have amgettuple interface? */ - bool amhasgettuple; + /* does AM have amgetbatch (or gettuple) interface? */ + bool amcanplainscan; /* does AM have amgetbitmap interface? */ bool amhasgetbitmap; bool amcanparallel; - /* does AM have ammarkpos interface? */ + /* is AM prepared for us to restore a mark? */ bool amcanmarkpos; /* AM's cost estimator */ /* Rather than include amapi.h here, we declare amcostestimate like this */ diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c index bdb30752e..9f44cf31e 100644 --- a/src/backend/access/brin/brin.c +++ b/src/backend/access/brin/brin.c @@ -298,10 +298,12 @@ brinhandler(PG_FUNCTION_ARGS) .ambeginscan = brinbeginscan, .amrescan = brinrescan, .amgettuple = NULL, + .amgetbatch = NULL, + .amunguardbatch = NULL, + .amkillitemsbatch = NULL, .amgetbitmap = bringetbitmap, .amendscan = brinendscan, - .ammarkpos = NULL, - .amrestrpos = NULL, + .amposreset = NULL, .amestimateparallelscan = NULL, .aminitparallelscan = NULL, .amparallelrescan = NULL, diff --git a/src/backend/access/gin/ginget.c b/src/backend/access/gin/ginget.c index 6b148e69a..8f7033d62 100644 --- a/src/backend/access/gin/ginget.c +++ b/src/backend/access/gin/ginget.c @@ -1953,9 +1953,9 @@ gingetbitmap(IndexScanDesc scan, TIDBitmap *tbm) * into the main index, and so we might visit it a second time during the * main scan. This is okay because we'll just re-set the same bit in the * bitmap. (The possibility of duplicate visits is a major reason why GIN - * can't support the amgettuple API, however.) Note that it would not do - * to scan the main index before the pending list, since concurrent - * cleanup could then make us miss entries entirely. + * can't support either the amgettuple or amgetbatch API.) Note that it + * would not do to scan the main index before the pending list, since + * concurrent cleanup could then make us miss entries entirely. */ scanPendingInsert(scan, tbm, &ntids); diff --git a/src/backend/access/gin/ginutil.c b/src/backend/access/gin/ginutil.c index fe7b984ff..32422865b 100644 --- a/src/backend/access/gin/ginutil.c +++ b/src/backend/access/gin/ginutil.c @@ -82,10 +82,12 @@ ginhandler(PG_FUNCTION_ARGS) .ambeginscan = ginbeginscan, .amrescan = ginrescan, .amgettuple = NULL, + .amgetbatch = NULL, + .amunguardbatch = NULL, + .amkillitemsbatch = NULL, .amgetbitmap = gingetbitmap, .amendscan = ginendscan, - .ammarkpos = NULL, - .amrestrpos = NULL, + .amposreset = NULL, .amestimateparallelscan = NULL, .aminitparallelscan = NULL, .amparallelrescan = NULL, diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c index 8565e225b..a58c5222f 100644 --- a/src/backend/access/gist/gist.c +++ b/src/backend/access/gist/gist.c @@ -103,10 +103,12 @@ gisthandler(PG_FUNCTION_ARGS) .ambeginscan = gistbeginscan, .amrescan = gistrescan, .amgettuple = gistgettuple, + .amgetbatch = NULL, + .amunguardbatch = NULL, + .amkillitemsbatch = NULL, .amgetbitmap = gistgetbitmap, .amendscan = gistendscan, - .ammarkpos = NULL, - .amrestrpos = NULL, + .amposreset = NULL, .amestimateparallelscan = NULL, .aminitparallelscan = NULL, .amparallelrescan = NULL, diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c index 8d8cd30dc..9ae9f0448 100644 --- a/src/backend/access/hash/hash.c +++ b/src/backend/access/hash/hash.c @@ -114,10 +114,12 @@ hashhandler(PG_FUNCTION_ARGS) .ambeginscan = hashbeginscan, .amrescan = hashrescan, .amgettuple = hashgettuple, + .amgetbatch = NULL, + .amunguardbatch = NULL, + .amkillitemsbatch = NULL, .amgetbitmap = hashgetbitmap, .amendscan = hashendscan, - .ammarkpos = NULL, - .amrestrpos = NULL, + .amposreset = NULL, .amestimateparallelscan = NULL, .aminitparallelscan = NULL, .amparallelrescan = NULL, diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c index 657ae4414..1f97b6e14 100644 --- a/src/backend/access/heap/heapam_handler.c +++ b/src/backend/access/heap/heapam_handler.c @@ -2555,8 +2555,11 @@ static const TableAmRoutine heapam_methods = { .parallelscan_reinitialize = table_block_parallelscan_reinitialize, .index_fetch_begin = heapam_index_fetch_begin, + .index_fetch_batch_init = heapam_index_fetch_batch_init, .index_fetch_reset = heapam_index_fetch_reset, .index_fetch_end = heapam_index_fetch_end, + .index_fetch_markpos = heapam_index_fetch_markpos, + .index_fetch_restrpos = heapam_index_fetch_restrpos, .tuple_insert = heapam_tuple_insert, .tuple_insert_speculative = heapam_tuple_insert_speculative, diff --git a/src/backend/access/heap/heapam_indexscan.c b/src/backend/access/heap/heapam_indexscan.c index e1ba19d8a..5ff9d8bea 100644 --- a/src/backend/access/heap/heapam_indexscan.c +++ b/src/backend/access/heap/heapam_indexscan.c @@ -16,12 +16,49 @@ #include "access/amapi.h" #include "access/heapam.h" +#include "access/indexbatch.h" #include "access/relscan.h" #include "access/visibilitymap.h" #include "storage/predicate.h" #include "utils/pgstat_internal.h" +/* + * Per-batch data private to the heap table AM. + * + * Stored at a negative offset from the IndexScanBatch pointer, in the + * fixed-size table AM opaque area of each batch allocation. + */ +typedef struct HeapBatchData +{ + uint8 *visInfo; /* per-item visibility flags, or NULL */ +} HeapBatchData; + +/* + * Per-item visibility flags stored in HeapBatchData.visInfo array + */ +#define HEAP_BATCH_VIS_CHECKED 0x01 /* checked item in VM? */ +#define HEAP_BATCH_VIS_ALL_VISIBLE 0x02 /* block is known all-visible? */ + +static inline HeapBatchData *heapam_index_batch_data(IndexScanDesc scan, + IndexScanBatch batch); +static inline ItemPointer heapam_index_return_scanpos_tid(IndexScanDesc scan, + IndexFetchHeapData *hscan, + ScanDirection direction, + IndexScanBatch scanBatch, + BatchRingItemPos *scanPos, + bool *all_visible); +static void heapam_index_batch_pos_visibility(IndexScanDesc scan, + ScanDirection direction, + IndexScanBatch batch, + HeapBatchData *hbatch, + BatchRingItemPos *pos); +static bool heapam_index_plain_batch_getnext_slot(IndexScanDesc scan, + ScanDirection direction, + TupleTableSlot *slot); +static bool heapam_index_only_batch_getnext_slot(IndexScanDesc scan, + ScanDirection direction, + TupleTableSlot *slot); static bool heapam_index_plain_tuple_getnext_slot(IndexScanDesc scan, ScanDirection direction, TupleTableSlot *slot); @@ -77,26 +114,92 @@ heapam_index_fetch_begin(IndexScanDesc scan, uint32 flags) { IndexFetchHeapData *hscan = palloc0_object(IndexFetchHeapData); + hscan->xs_base.batch_opaque_size = MAXALIGN(sizeof(HeapBatchData)); + hscan->xs_base.batch_per_item_size = sizeof(uint8); /* visInfo element size */ hscan->xs_base.flags = flags; - hscan->xs_cbuf = InvalidBuffer; - hscan->xs_blk = InvalidBlockNumber; - hscan->xs_vmbuffer = InvalidBuffer; - /* Resolve which getnext_slot implementation to use for this scan */ - if (scan->xs_want_itup) - scan->xs_getnext_slot = heapam_index_only_tuple_getnext_slot; + /* Current heap block state */ + Assert(hscan->xs_cbuf == InvalidBuffer); + hscan->xs_blk = InvalidBlockNumber; + + /* VM related state */ + Assert(hscan->xs_vmbuffer == InvalidBuffer); + hscan->xs_vm_items = 1; + + /* Resolve which xs_getnext_slot implementation to use for this scan */ + if (scan->indexRelation->rd_indam->amgetbatch != NULL) + { + /* amgetbatch index AM */ + if (scan->xs_want_itup) + scan->xs_getnext_slot = heapam_index_only_batch_getnext_slot; + else + scan->xs_getnext_slot = heapam_index_plain_batch_getnext_slot; + + /* Set up scan's batch ring buffer in passing */ + tableam_util_batchscan_init(scan); + } else - scan->xs_getnext_slot = heapam_index_plain_tuple_getnext_slot; + { + /* amgettuple index AM */ + if (scan->xs_want_itup) + scan->xs_getnext_slot = heapam_index_only_tuple_getnext_slot; + else + scan->xs_getnext_slot = heapam_index_plain_tuple_getnext_slot; + } return &hscan->xs_base; } +/* + * Initialize the heap table AM's per-batch opaque area (HeapBatchData). + * Called by indexam_util_batch_alloc for each new or recycled batch. + */ +void +heapam_index_fetch_batch_init(IndexScanDesc scan, IndexScanBatch batch, + bool new_alloc) +{ + HeapBatchData *hbatch = heapam_index_batch_data(scan, batch); + + if (scan->xs_want_itup) + { + if (new_alloc) + { + /* + * The visInfo pointer is stored at the very start of the palloc'd + * space, in the fixed-sized table AM opaque area. visInfo points + * to just past the end of the variable-sized items[maxitemsbatch] + * array (to a space that is also sized according to whatever the + * index AM set maxitemsbatch to). + */ + Size itemsEnd; + + itemsEnd = MAXALIGN(offsetof(IndexScanBatchData, items) + + sizeof(BatchMatchingItem) * scan->maxitemsbatch); + hbatch->visInfo = (uint8 *) ((char *) batch + itemsEnd); + } + + /* Clear visibility flags (needed for both new and recycled batches) */ + memset(hbatch->visInfo, 0, scan->maxitemsbatch); + } + else + { + hbatch->visInfo = NULL; + } +} + void heapam_index_fetch_reset(IndexScanDesc scan) { + IndexFetchHeapData *hscan = (IndexFetchHeapData *) scan->xs_heapfetch; + + /* Rescans should avoid an excessive number of VM lookups */ + hscan->xs_vm_items = 1; + + /* Reset batch ring buffer state */ + if (scan->usebatchring) + tableam_util_batchscan_reset(scan, false); + /* - * Resets are a no-op. - * * Deliberately avoid dropping pins now held in xs_cbuf and xs_vmbuffer. * This saves cycles during certain tight nested loop joins (it can avoid * repeated pinning and unpinning of the same buffer across rescans). @@ -116,9 +219,35 @@ heapam_index_fetch_end(IndexScanDesc scan) if (BufferIsValid(hscan->xs_vmbuffer)) ReleaseBuffer(hscan->xs_vmbuffer); + /* Free all batch related resources */ + if (scan->usebatchring) + tableam_util_batchscan_end(scan); + pfree(hscan); } +/* + * Save batch ring buffer's current scanPos as its markPos + */ +void +heapam_index_fetch_markpos(IndexScanDesc scan) +{ + Assert(scan->usebatchring); + + tableam_util_batchscan_mark_pos(scan); +} + +/* + * Restore batch ring buffer's markPos into its scanPos + */ +void +heapam_index_fetch_restrpos(IndexScanDesc scan) +{ + Assert(scan->usebatchring); + + tableam_util_batchscan_restore_pos(scan); +} + /* * heap_hot_search_buffer - search HOT chain for tuple satisfying snapshot * @@ -367,7 +496,8 @@ heapam_index_fetch_tuple(Relation rel, */ static pg_attribute_always_inline bool heapam_index_fetch_heap_item(IndexScanDesc scan, IndexFetchHeapData *hscan, - TupleTableSlot *slot, bool *heap_continue) + TupleTableSlot *slot, bool *heap_continue, + bool amgetbatch) { bool all_dead = false; bool found; @@ -381,20 +511,339 @@ heapam_index_fetch_heap_item(IndexScanDesc scan, IndexFetchHeapData *hscan, pgstat_count_heap_fetch(scan->indexRelation); /* - * If we scanned a whole HOT chain and found only dead tuples, tell index - * AM to kill its entry for that TID (this will take effect in the next - * amgettuple call, in index_getnext_tid). We do not do this when in - * recovery because it may violate MVCC to do so. See comments in - * RelationGetIndexScan(). + * If we scanned a whole HOT chain and found only dead tuples, remember it + * for later. We do not do this when in recovery because it may violate + * MVCC to do so. See comments in RelationGetIndexScan(). */ if (!scan->xactStartedInRecovery) - scan->kill_prior_tuple = all_dead; + { + if (amgetbatch) + { + if (all_dead) + tableam_util_scanpos_killitem(scan); + } + else + { + /* + * Tell amgettuple-based index AM to kill its entry for that TID + * (this will take effect in the next call, in index_getnext_tid) + */ + scan->kill_prior_tuple = all_dead; + } + } return found; } /* - * Common implementation for both heapam_index_*_getnext_slot variants. + * Get next TID from batch ring buffer, moving in the given scan direction. + * Also sets *all_visible for item when caller passes a non-NULL arg. + */ +static pg_attribute_always_inline ItemPointer +heapam_index_getnext_scanbatch_pos(IndexScanDesc scan, IndexFetchHeapData *hscan, + ScanDirection direction, bool *all_visible) +{ + BatchRingBuffer *batchringbuf = &scan->batchringbuf; + BatchRingItemPos *scanPos = &batchringbuf->scanPos; + IndexScanBatch scanBatch = NULL; + bool hadExistingScanBatch; + + Assert(!scanPos->valid || batchringbuf->headBatch == scanPos->batch); + Assert(scanPos->valid || index_scan_batch_count(scan) == 0); + Assert(all_visible == NULL || scan->xs_want_itup); + + /* + * Check if there's an existing loaded scanBatch for us to return the next + * matching item's TID/index tuple from + */ + hadExistingScanBatch = scanPos->valid; + if (scanPos->valid) + { + /* + * scanPos is valid, so scanBatch must already be loaded in batch ring + * buffer. We rely on that here. + */ + pg_assume(batchringbuf->headBatch == scanPos->batch); + + scanBatch = index_scan_batch(scan, scanPos->batch); + + if (index_scan_pos_advance(direction, scanBatch, scanPos)) + return heapam_index_return_scanpos_tid(scan, hscan, direction, + scanBatch, scanPos, + all_visible); + } + + /* + * Either ran out of items from our existing scanBatch, or it hasn't been + * loaded yet (because this is the first call here for the entire scan). + * Try to advance scanBatch to the next batch (or get the first batch). + */ + scanBatch = tableam_util_fetch_next_batch(scan, direction, + scanBatch, scanPos); + + if (!scanBatch) + { + /* + * We're done; no more batches in the current scan direction. + * + * Note: scanPos is generally still valid at this point. The scan + * might still back up in the other direction. + */ + return NULL; + } + + /* + * Advanced scanBatch. Now position scanPos to the start of new + * scanBatch. + */ + index_scan_pos_nextbatch(direction, scanBatch, scanPos); + Assert(index_scan_batch(scan, scanPos->batch) == scanBatch); + + /* + * Remove the head batch from the batch ring buffer (except when this new + * scanBatch is our only one) + */ + if (hadExistingScanBatch) + { + IndexScanBatch headBatch = index_scan_batch(scan, + batchringbuf->headBatch); + + Assert(headBatch != scanBatch); + Assert(batchringbuf->headBatch != scanPos->batch); + + /* free obsolescent head batch (unless it is scan's markBatch) */ + tableam_util_free_batch(scan, headBatch); + + /* Remove the batch from the ring buffer (even if it's markBatch) */ + batchringbuf->headBatch++; + } + + /* In practice scanBatch will always be the ring buffer's headBatch */ + Assert(batchringbuf->headBatch == scanPos->batch); + + return heapam_index_return_scanpos_tid(scan, hscan, direction, + scanBatch, scanPos, all_visible); +} + +/* + * Access the heap-private fixed-size data from the beginning of an allocated + * IndexScanBatch, using caller's IndexScanBatch pointer + */ +static inline HeapBatchData * +heapam_index_batch_data(IndexScanDesc scan, IndexScanBatch batch) +{ + /* heapam's fixed-size space is at the start of the palloc'd area */ + return (HeapBatchData *) index_scan_batch_base(scan, batch); +} + +/* + * Save the current scanPos/scanBatch item's TID in scan's xs_heaptid, and + * return a pointer to that TID. When all_visible isn't NULL (during an + * index-only scan), also sets item's visibility status in *all_visible. + * + * heapam_index_getnext_scanbatch_pos helper function. + */ +static inline ItemPointer +heapam_index_return_scanpos_tid(IndexScanDesc scan, IndexFetchHeapData *hscan, + ScanDirection direction, + IndexScanBatch scanBatch, + BatchRingItemPos *scanPos, + bool *all_visible) +{ + HeapBatchData *hbatch; + + pgstat_count_index_tuples(scan->indexRelation, 1); + + /* Set xs_heaptid, which caller (and core executor) will need */ + scan->xs_heaptid = scanBatch->items[scanPos->item].tableTid; + + if (all_visible == NULL) + { + /* + * Plain index scan. + */ + Assert(!scan->xs_want_itup); + return &scan->xs_heaptid; + } + + /* + * Index-only scan. + * + * Also set xs_itup, which caller also needs. + */ + Assert(scan->xs_want_itup); + scan->xs_itup = (IndexTuple) (scanBatch->currTuples + + scanBatch->items[scanPos->item].tupleOffset); + + /* + * Set visibility info for the current scanPos item (plus possibly some + * additional items in the current scan direction) as needed + */ + hbatch = heapam_index_batch_data(scan, scanBatch); + if (!(hbatch->visInfo[scanPos->item] & HEAP_BATCH_VIS_CHECKED)) + heapam_index_batch_pos_visibility(scan, direction, scanBatch, hbatch, + scanPos); + + /* Finally, set all_visible for caller */ + *all_visible = + (hbatch->visInfo[scanPos->item] & HEAP_BATCH_VIS_ALL_VISIBLE) != 0; + + return &scan->xs_heaptid; +} + +/* + * Obtain visibility information for a TID from caller's batch. + * + * Called during amgetbatch index-only scans. We always check the visibility + * of caller's item (an offset into caller's batch->items[] array). We might + * also set visibility info for other items from caller's batch more + * proactively when that makes sense. + * + * We keep two competing considerations in balance when determining whether to + * check additional items: the need to keep the cost of visibility map access + * under control when most items will never be returned by the scan anyway + * (important for inner index scans of anti-joins and semi-joins), and the + * need to unguard batches promptly. + * + * Once we've resolved visibility for all items in a batch, we can safely + * unguard it by calling amunguardbatch. This is safe with respect to + * concurrent VACUUM because the batch's guard (typically a buffer pin on the + * originating index page) blocks VACUUM from acquiring a conflicting cleanup + * lock on that page. Copying the relevant visibility map data into our local + * cache suffices to prevent unsafe concurrent TID recycling: if any of these + * TIDs point to dead heap tuples, VACUUM cannot possibly return from + * ambulkdelete and mark the pointed-to heap pages as all-visible. VACUUM + * _can_ do so once the batch is unguarded, but that's okay; we'll be working + * off of cached visibility info that indicates that the dead TIDs are NOT + * all-visible. + * + * What about the opposite case, where a page was all-visible when we cached + * the VM bits but tuples on it are deleted afterwards? That is safe too: any + * tuple that was visible to all when we read the VM must also be visible to + * our MVCC snapshot, so it is correct to skip the heap fetch for those TIDs. + */ +static void +heapam_index_batch_pos_visibility(IndexScanDesc scan, ScanDirection direction, + IndexScanBatch batch, HeapBatchData *hbatch, + BatchRingItemPos *pos) +{ + IndexFetchHeapData *hscan = (IndexFetchHeapData *) scan->xs_heapfetch; + int posItem = pos->item; + bool allbatchitemsvisible; + BlockNumber curvmheapblkno = InvalidBlockNumber; + uint8 curvmheapblkflags = 0; + + Assert(hbatch == heapam_index_batch_data(scan, batch)); + + /* + * The batch must still be guarded whenever we're called. + * + * amunguardbatch can't be called until we've already set _every_ batch + * item's visInfo[] status, but if we've already done so for this batch + * then it shouldn't ever get passed to us again by some subsequent call. + * (This relies on index-only scans always being !batchImmediateUnguard.) + */ + Assert(batch->isGuarded && !scan->batchImmediateUnguard); + + /* + * Set visibility info for a range of items, in scan order. + * + * Note: visibilitymap_get_status does not lock the visibility map buffer, + * so the result could be slightly stale. See the "Memory ordering + * effects" discussion above visibilitymap_get_status for an explanation + * of why this is okay. + */ + if (ScanDirectionIsForward(direction)) + { + int lastSetItem = Min(batch->lastItem, + posItem + hscan->xs_vm_items - 1); + + for (int setItem = posItem; setItem <= lastSetItem; setItem++) + { + ItemPointer tid = &batch->items[setItem].tableTid; + BlockNumber heapblkno = ItemPointerGetBlockNumber(tid); + uint8 flags; + + if (heapblkno == curvmheapblkno) + { + hbatch->visInfo[setItem] = curvmheapblkflags; + continue; + } + + flags = HEAP_BATCH_VIS_CHECKED; + if (VM_ALL_VISIBLE(scan->heapRelation, heapblkno, &hscan->xs_vmbuffer)) + flags |= HEAP_BATCH_VIS_ALL_VISIBLE; + + hbatch->visInfo[setItem] = curvmheapblkflags = flags; + curvmheapblkno = heapblkno; + } + + allbatchitemsvisible = lastSetItem >= batch->lastItem && + (posItem == batch->firstItem || + (hbatch->visInfo[batch->firstItem] & HEAP_BATCH_VIS_CHECKED)); + } + else + { + int lastSetItem = Max(batch->firstItem, + posItem - hscan->xs_vm_items + 1); + + for (int setItem = posItem; setItem >= lastSetItem; setItem--) + { + ItemPointer tid = &batch->items[setItem].tableTid; + BlockNumber heapblkno = ItemPointerGetBlockNumber(tid); + uint8 flags; + + if (heapblkno == curvmheapblkno) + { + hbatch->visInfo[setItem] = curvmheapblkflags; + continue; + } + + flags = HEAP_BATCH_VIS_CHECKED; + if (VM_ALL_VISIBLE(scan->heapRelation, heapblkno, &hscan->xs_vmbuffer)) + flags |= HEAP_BATCH_VIS_ALL_VISIBLE; + + hbatch->visInfo[setItem] = curvmheapblkflags = flags; + curvmheapblkno = heapblkno; + } + + allbatchitemsvisible = lastSetItem <= batch->firstItem && + (posItem == batch->lastItem || + (hbatch->visInfo[batch->lastItem] & HEAP_BATCH_VIS_CHECKED)); + } + + /* + * It's safe to unguard the batch (via amunguardbatch) as soon as we've + * resolved the visibility status of all of its items (unless this is a + * non-MVCC scan) + */ + if (allbatchitemsvisible) + { + Assert(hbatch->visInfo[batch->firstItem] & HEAP_BATCH_VIS_CHECKED); + Assert(hbatch->visInfo[batch->lastItem] & HEAP_BATCH_VIS_CHECKED); + + /* + * Note: nodeIndexonlyscan.c only supports MVCC snapshots, but we + * still cope with index-only scan callers with other snapshot types. + * This is certainly not unexpected; selfuncs.c performs index-only + * scans that use SnapshotNonVacuumable. + */ + if (scan->MVCCScan) + tableam_util_unguard_batch(scan, batch); + } + + /* + * Else check visibility for twice as many items next time, or all items. + * We check all items in one go once we're passed the scan's first batch. + */ + else if (hscan->xs_vm_items < (batch->lastItem - batch->firstItem)) + hscan->xs_vm_items *= 2; + else + hscan->xs_vm_items = scan->maxitemsbatch; +} + +/* + * Common implementation for all four heapam_index_*_getnext_slot variants. * * The result is true if a tuple satisfying the scan keys and the snapshot was * found, false otherwise. The tuple is stored in the specified slot. @@ -403,12 +852,13 @@ heapam_index_fetch_heap_item(IndexScanDesc scan, IndexFetchHeapData *hscan, * dropped by a future call here (or by a later call to heapam_index_fetch_end * through index_endscan). * - * The index_only parameter is a compile-time constant at each call site, - * allowing the compiler to specialize the code for each variant. + * The index_only and amgetbatch parameters are compile-time constants at each + * call site, allowing the compiler to specialize the code for each variant. */ static pg_attribute_always_inline bool heapam_index_getnext_slot(IndexScanDesc scan, ScanDirection direction, - TupleTableSlot *slot, bool index_only) + TupleTableSlot *slot, bool index_only, + bool amgetbatch) { IndexFetchHeapData *hscan = (IndexFetchHeapData *) scan->xs_heapfetch; bool *heap_continue = &scan->xs_heap_continue; @@ -422,14 +872,20 @@ heapam_index_getnext_slot(IndexScanDesc scan, ScanDirection direction, if (!*heap_continue) { /* Get the next TID from the index */ - tid = index_getnext_tid(scan, direction); + if (amgetbatch) + tid = heapam_index_getnext_scanbatch_pos(scan, hscan, + direction, + index_only ? + &all_visible : NULL); + else + tid = index_getnext_tid(scan, direction); /* If we're out of index entries, we're done */ if (tid == NULL) break; - /* For index-only scans, check the visibility map */ - if (index_only) + /* For non-batch index-only scans, check the visibility map */ + if (index_only && !amgetbatch) all_visible = VM_ALL_VISIBLE(scan->heapRelation, ItemPointerGetBlockNumber(tid), &hscan->xs_vmbuffer); @@ -454,7 +910,7 @@ heapam_index_getnext_slot(IndexScanDesc scan, ScanDirection direction, scan->instrument->ntabletuplefetches++; if (!heapam_index_fetch_heap_item(scan, hscan, slot, - heap_continue)) + heap_continue, amgetbatch)) { /* * No visible tuple. If caller set a visited-pages limit @@ -486,7 +942,7 @@ heapam_index_getnext_slot(IndexScanDesc scan, ScanDirection direction, * want us to assume that just having one visible tuple in the * hot chain is always good enough. */ - Assert(!(*heap_continue && IsMVCCSnapshot(scan->xs_snapshot))); + Assert(!(*heap_continue && scan->MVCCScan)); } else { @@ -513,8 +969,8 @@ heapam_index_getnext_slot(IndexScanDesc scan, ScanDirection direction, * entry. If we don't find anything, loop around and grab the * next TID from the index. */ - if (heapam_index_fetch_heap_item(scan, hscan, slot, - heap_continue)) + if (heapam_index_fetch_heap_item(scan, hscan, slot, heap_continue, + amgetbatch)) return true; } } @@ -522,16 +978,40 @@ heapam_index_getnext_slot(IndexScanDesc scan, ScanDirection direction, return false; } +/* xs_getnext_slot callback: amgetbatch, plain index scan */ +static pg_attribute_hot bool +heapam_index_plain_batch_getnext_slot(IndexScanDesc scan, + ScanDirection direction, + TupleTableSlot *slot) +{ + Assert(!scan->xs_want_itup && scan->usebatchring); + Assert(scan->indexRelation->rd_indam->amgetbatch != NULL); + + return heapam_index_getnext_slot(scan, direction, slot, false, true); +} + +/* xs_getnext_slot callback: amgetbatch, index-only scan */ +static pg_attribute_hot bool +heapam_index_only_batch_getnext_slot(IndexScanDesc scan, + ScanDirection direction, + TupleTableSlot *slot) +{ + Assert(scan->xs_want_itup && scan->usebatchring); + Assert(scan->indexRelation->rd_indam->amgetbatch != NULL); + + return heapam_index_getnext_slot(scan, direction, slot, true, true); +} + /* xs_getnext_slot callback: amgettuple, plain index scan */ static pg_attribute_hot bool heapam_index_plain_tuple_getnext_slot(IndexScanDesc scan, ScanDirection direction, TupleTableSlot *slot) { - Assert(!scan->xs_want_itup); + Assert(!scan->xs_want_itup && !scan->usebatchring); Assert(scan->indexRelation->rd_indam->amgettuple != NULL); - return heapam_index_getnext_slot(scan, direction, slot, false); + return heapam_index_getnext_slot(scan, direction, slot, false, false); } /* xs_getnext_slot callback: amgettuple, index-only scan */ @@ -540,8 +1020,8 @@ heapam_index_only_tuple_getnext_slot(IndexScanDesc scan, ScanDirection direction, TupleTableSlot *slot) { - Assert(scan->xs_want_itup); + Assert(scan->xs_want_itup && !scan->usebatchring); Assert(scan->indexRelation->rd_indam->amgettuple != NULL); - return heapam_index_getnext_slot(scan, direction, slot, true); + return heapam_index_getnext_slot(scan, direction, slot, true, false); } diff --git a/src/backend/access/index/Makefile b/src/backend/access/index/Makefile index 6f2e3061a..e6d681b40 100644 --- a/src/backend/access/index/Makefile +++ b/src/backend/access/index/Makefile @@ -16,6 +16,7 @@ OBJS = \ amapi.o \ amvalidate.o \ genam.o \ - indexam.o + indexam.o \ + indexbatch.o include $(top_srcdir)/src/backend/common.mk diff --git a/src/backend/access/index/amapi.c b/src/backend/access/index/amapi.c index efa007030..d4adbbeb2 100644 --- a/src/backend/access/index/amapi.c +++ b/src/backend/access/index/amapi.c @@ -55,6 +55,11 @@ GetIndexAmRoutine(Oid amhandler) Assert(routine->amrescan != NULL); Assert(routine->amendscan != NULL); + /* Assert that AM doesn't have an invalid combination of callbacks */ + Assert((routine->amgetbatch != NULL) == (routine->amunguardbatch != NULL)); + Assert(routine->amkillitemsbatch == NULL || routine->amgetbatch != NULL); + Assert(routine->amgetbatch != NULL || routine->amposreset == NULL); + return routine; } diff --git a/src/backend/access/index/genam.c b/src/backend/access/index/genam.c index acc9f3e6a..17ea93b4d 100644 --- a/src/backend/access/index/genam.c +++ b/src/backend/access/index/genam.c @@ -89,6 +89,8 @@ RelationGetIndexScan(Relation indexRelation, int nkeys, int norderbys) scan->xs_snapshot = InvalidSnapshot; /* caller must initialize this */ scan->numberOfKeys = nkeys; scan->numberOfOrderBys = norderbys; + scan->usebatchring = false; /* set later for amgetbatch callers */ + memset(&scan->batchcache, 0, sizeof(scan->batchcache)); /* * We allocate key workspace here, but it won't get filled until amrescan. @@ -126,6 +128,9 @@ RelationGetIndexScan(Relation indexRelation, int nkeys, int norderbys) scan->xs_hitup = NULL; scan->xs_hitupdesc = NULL; + scan->batch_index_opaque_size = 0; + scan->batch_tuples_workspace = 0; + scan->batch_table_offset = 0; scan->xs_visited_pages_limit = 0; return scan; diff --git a/src/backend/access/index/indexam.c b/src/backend/access/index/indexam.c index 3fac4c30d..443346089 100644 --- a/src/backend/access/index/indexam.c +++ b/src/backend/access/index/indexam.c @@ -13,7 +13,7 @@ * INTERFACE ROUTINES * index_open - open an index relation by relation OID * index_close - close an index relation - * index_beginscan - start a scan of an index with amgettuple + * index_beginscan - start a scan of an index with amgetbatch/amgettuple * index_beginscan_bitmap - start a scan of an index with amgetbitmap * index_rescan - restart a scan of an index * index_endscan - end a scan @@ -278,6 +278,7 @@ index_beginscan_internal(Relation indexRelation, Relation heapRelation, scan->xs_temp_snap = temp_snap; scan->xs_snapshot = snapshot; + scan->MVCCScan = IsMVCCLikeSnapshot(snapshot); scan->instrument = instrument; /* @@ -289,6 +290,7 @@ index_beginscan_internal(Relation indexRelation, Relation heapRelation, scan->heapRelation = heapRelation; scan->xs_want_itup = index_only_scan; scan->xs_heap_continue = false; + scan->batchImmediateUnguard = (scan->MVCCScan && !index_only_scan); /* prepare to fetch index matches from table */ scan->xs_heapfetch = table_index_fetch_begin(scan, flags); @@ -297,11 +299,19 @@ index_beginscan_internal(Relation indexRelation, Relation heapRelation, Assert(scan->xs_getnext_slot != NULL); } + /* + * Bitmap index scans should never use a batch ring buffer (though can use + * the scan's batch cache). Plain index scans (and index-only scans) + * should only use a batch ring buffer with an amgetbatch index AM. + */ + Assert(!scan->xs_heapfetch ? !scan->usebatchring : + (indexRelation->rd_indam->amgetbatch != NULL) == scan->usebatchring); + return scan; } /* - * index_beginscan - start a scan of an index with amgettuple + * index_beginscan - start a scan of an index with amgetbatch/amgettuple * * Caller must be holding suitable locks on the heap and the index. */ @@ -395,7 +405,21 @@ index_endscan(IndexScanDesc scan) SCAN_CHECKS; CHECK_SCAN_PROCEDURE(amendscan); - /* Release resources (like buffer pins) from table accesses */ + /* + * amgetbitmap scans of an index AM that supports amgetbatch make limited + * use of the scan's batch cache. Check for that. + */ + if (!scan->usebatchring && scan->batchcache[0] != NULL) + { + Assert(scan->xs_heapfetch == NULL); + Assert(scan->indexRelation->rd_indam->amgetbatch != NULL); + pfree(index_scan_batch_base(scan, scan->batchcache[0])); + } + + /* + * Release resources (like buffer pins and batch ring buffer) held by + * table AM for index scan + */ if (scan->xs_heapfetch) { table_index_fetch_end(scan); @@ -423,24 +447,24 @@ void index_markpos(IndexScanDesc scan) { SCAN_CHECKS; - CHECK_SCAN_PROCEDURE(ammarkpos); + CHECK_SCAN_PROCEDURE(amgetbatch); - scan->indexRelation->rd_indam->ammarkpos(scan); + table_index_fetch_markpos(scan); } /* ---------------- * index_restrpos - restore a scan position * - * NOTE: this only restores the internal scan state of the index AM. See + * NOTE: this only restores the batch positional state of the table AM. See * comments for ExecRestrPos(). * * NOTE: For heap, in the presence of HOT chains, mark/restore only works * correctly if the scan's snapshot is MVCC-safe; that ensures that there's at * most one returnable tuple in each HOT chain, and so restoring the prior - * state at the granularity of the index AM is sufficient. Since the only - * current user of mark/restore functionality is nodeMergejoin.c, this - * effectively means that merge-join plans only work for MVCC snapshots. This - * could be fixed if necessary, but for now it seems unimportant. + * state at the scan item granularity is sufficient. Since the only current + * user of mark/restore functionality is nodeMergejoin.c, this effectively + * means that merge-join plans only work for MVCC snapshots. This could be + * fixed if necessary, but for now it seems unimportant. * ---------------- */ void @@ -449,16 +473,11 @@ index_restrpos(IndexScanDesc scan) Assert(IsMVCCLikeSnapshot(scan->xs_snapshot)); SCAN_CHECKS; - CHECK_SCAN_PROCEDURE(amrestrpos); + CHECK_SCAN_PROCEDURE(amgetbatch); - /* reset table AM state for restoring the marked position */ - if (scan->xs_heapfetch) - table_index_fetch_reset(scan); - - scan->kill_prior_tuple = false; /* for safety */ scan->xs_heap_continue = false; - scan->indexRelation->rd_indam->amrestrpos(scan); + table_index_fetch_restrpos(scan); } /* @@ -632,6 +651,7 @@ index_getnext_tid(IndexScanDesc scan, ScanDirection direction) SCAN_CHECKS; CHECK_SCAN_PROCEDURE(amgettuple); + Assert(!scan->usebatchring); /* XXX: we should assert that a snapshot is pushed or registered */ Assert(TransactionIdIsValid(RecentXmin)); diff --git a/src/backend/access/index/indexbatch.c b/src/backend/access/index/indexbatch.c new file mode 100644 index 000000000..b2b3afe80 --- /dev/null +++ b/src/backend/access/index/indexbatch.c @@ -0,0 +1,726 @@ +/*------------------------------------------------------------------------- + * + * indexbatch.c + * Batch-based index scan infrastructure for the amgetbatch interface. + * + * This module provides the core infrastructure for batch-based index scans, + * which allow index AMs to return multiple matching TIDs per page in a single + * call. The batch ring buffer is owned by the table AM. + * + * The ring buffer loads batches in index key space/index scan order. + * + * Most functions here are table AM utilities (tableam_util_*), called by + * table AMs during amgetbatch index scans. These manage the batch ring + * buffer's lifecycle and positional state, and help with certain aspects of + * resource management. The table AM uses scanPos (and its scanBatch batch) + * to return items from batches returned by amgetbatch. + * + * There are also some index AM utilities (indexam_util_*), called by index + * AMs that implement the amgetbatch interface, to help manage resources like + * memory, locks, and buffer pins. Index AMs free and unlock batches as + * described in indexam.sgml. + * + * Portions Copyright (c) 1996-2026, PostgreSQL Global Development Group + * Portions Copyright (c) 1994, Regents of the University of California + * + * + * IDENTIFICATION + * src/backend/access/index/indexbatch.c + * + *------------------------------------------------------------------------- + */ + +#include "postgres.h" + +#include "access/amapi.h" +#include "access/indexbatch.h" +#include "access/tableam.h" +#include "common/int.h" +#include "lib/qunique.h" + +static void batch_free(IndexScanDesc scan, IndexScanBatch batch, + bool allow_cache); +static inline bool batch_cache_store(IndexScanDesc scan, IndexScanBatch batch); +static int batch_compare_int(const void *va, const void *vb); + +/* + * Reset ring buffer and related positional state used during an amgetbatch + * index scan + */ +void +tableam_util_batchscan_reset(IndexScanDesc scan, bool endscan) +{ + BatchRingBuffer *batchringbuf = &scan->batchringbuf; + IndexScanBatch markBatch = batchringbuf->markBatch; + bool markBatchFreed = false; + + batchringbuf->scanPos.valid = false; + batchringbuf->markPos.valid = false; + + /* Ensure batch_free won't skip the old markBatch in the loop below */ + batchringbuf->markBatch = NULL; + + for (uint8 i = batchringbuf->headBatch; i != batchringbuf->nextBatch; i++) + { + IndexScanBatch batch = index_scan_batch(scan, i); + + if (batch == markBatch) + markBatchFreed = true; + + batch_free(scan, batch, !endscan); + } + + if (!markBatchFreed && unlikely(markBatch)) + batch_free(scan, markBatch, !endscan); + + batchringbuf->headBatch = 0; + batchringbuf->nextBatch = 0; +} + +/* + * Free resources at end of a batch index scan. + * + * Called by table AM when an index scan is ending, right before the owning + * scan descriptor goes away. Cleans up all batch related resources. + */ +void +tableam_util_batchscan_end(IndexScanDesc scan) +{ + /* Free all remaining loaded batches (even markBatch), bypassing cache */ + tableam_util_batchscan_reset(scan, true); + + for (int i = 0; i < INDEX_SCAN_CACHE_BATCHES; i++) + { + IndexScanBatch cached = scan->batchcache[i]; + + if (cached == NULL) + continue; + + if (cached->deadItems) + pfree(cached->deadItems); + pfree(index_scan_batch_base(scan, cached)); + } +} + +/* + * Set a mark from scanPos position + * + * Called from the table AM's index_fetch_markpos callback. Saves the current + * scan position and associated batch so that the scan can be restored to this + * point later, via tableam_util_batchscan_restore_pos. The marked batch is + * retained and not freed until a new mark is set or the scan ends (or until + * the mark is restored). + */ +void +tableam_util_batchscan_mark_pos(IndexScanDesc scan) +{ + BatchRingBuffer *batchringbuf = &scan->batchringbuf; + BatchRingItemPos *scanPos = &scan->batchringbuf.scanPos; + BatchRingItemPos *markPos = &batchringbuf->markPos; + IndexScanBatch scanBatch = index_scan_batch(scan, scanPos->batch); + IndexScanBatch markBatch = batchringbuf->markBatch; + bool freeMarkBatch; + + Assert(scan->MVCCScan); + + /* + * Free the previous mark batch (if any) -- but only if it isn't our + * scanBatch (defensively make sure that markBatch isn't some later + * still-needed batch, too) + */ + if (!markBatch || markBatch == scanBatch) + { + /* Definitely no markBatch that we should free now */ + freeMarkBatch = false; + } + else if (likely(!index_scan_batch_loaded(scan, markPos->batch))) + { + /* Definitely have a no-longer-loaded markBatch to free */ + freeMarkBatch = true; + } + else + { + /* + * index_scan_batch_loaded indicates that markPos->batch is loaded, + * but after uint8 wraparound a stale batch offset can alias a + * currently-loaded range (false positive). Confirm by checking + * whether the batch pointer in markPos->batch's slot still matches. + */ + freeMarkBatch = (index_scan_batch(scan, markPos->batch) != markBatch); + } + + if (freeMarkBatch) + { + /* Free markBatch, since it isn't loaded/needed for batchringbuf */ + batchringbuf->markBatch = NULL; /* else call won't free markBatch */ + tableam_util_free_batch(scan, markBatch); + } + + /* copy the scan's position */ + batchringbuf->markPos = *scanPos; + batchringbuf->markBatch = scanBatch; +} + +/* + * Restore scanPos to the previously saved markPos position. + * + * Called from the table AM's index_fetch_restrpos callback. Restores the + * scan to a position saved using tableam_util_batchscan_mark_pos earlier. + * The scan's markPos becomes its scanPos. The marked batch is restored as + * the current scanBatch when needed. + * + * We just discard all batches (other than markBatch/restored scanBatch), + * except when markBatch is already the scan's current scanBatch. + */ +void +tableam_util_batchscan_restore_pos(IndexScanDesc scan) +{ + BatchRingBuffer *batchringbuf = &scan->batchringbuf; + BatchRingItemPos *scanPos = &scan->batchringbuf.scanPos; + BatchRingItemPos *markPos = &batchringbuf->markPos; + IndexScanBatch markBatch = batchringbuf->markBatch; + IndexScanBatch scanBatch = index_scan_batch(scan, scanPos->batch); + + Assert(scan->MVCCScan); + Assert(scan->xs_heapfetch); + Assert(markPos->valid); + + if (scanBatch == markBatch) + { + /* markBatch is already scanBatch; needn't change batchringbuf */ + Assert(scanPos->batch == markPos->batch); + + scanPos->item = markPos->item; + return; + } + + /* + * markBatch is behind scanBatch, and so must not be saved in ring buffer + * anymore. We have to deal with restoring the mark the hard way: by + * invalidating all other loaded batches. This is similar to the case + * where the scan direction changes and the scan actually crosses + * batch/index page boundaries (see tableam_util_scanbatch_dirchange). + * + * First, free all batches that are still in the ring buffer. + */ + for (uint8 i = batchringbuf->headBatch; i != batchringbuf->nextBatch; i++) + { + IndexScanBatch batch = index_scan_batch(scan, i); + + Assert(batch != markBatch); + + tableam_util_free_batch(scan, batch); + } + + /* + * Next "append" standalone markBatch, which will become scanBatch + * (scanBatch is always the ring buffer's headBatch) + */ + markPos->batch = 0; + batchringbuf->scanPos = *markPos; + batchringbuf->nextBatch = batchringbuf->headBatch = markPos->batch; + index_scan_batch_append(scan, markBatch); + Assert(index_scan_batch(scan, batchringbuf->scanPos.batch) == markBatch); + + /* + * Finally, call amposreset to let index AM know to invalidate any private + * state that independently tracks the scan's progress + */ + if (scan->indexRelation->rd_indam->amposreset) + scan->indexRelation->rd_indam->amposreset(scan, markBatch); + + /* + * Note: markBatch.deadItems[] might already contain dead items, and might + * yet have more dead items saved. tableam_util_free_batch is prepared + * for that. + */ +} + +/* + * Handle cross-batch change in scan direction + * + * Called by table AM when its scan changes direction in a way that + * necessitates backing the scan up to an index page originally associated + * with a now-freed batch. + * + * When we return, batchringbuf will only contain one batch (the current + * headBatch/scanBatch) and will look as if the new scan direction had been + * used from the start. Caller can then safely pass this batch to amgetbatch + * to determine which batch comes next in the new scan direction. This + * approach isn't particularly efficient, but it works well enough for what + * ought to be a relatively rare occurrence. + */ +void +tableam_util_scanbatch_dirchange(IndexScanDesc scan) +{ + BatchRingBuffer *batchringbuf = &scan->batchringbuf; + IndexScanBatch scanBatch; + + /* + * Release batches starting from the current "tail" batch, working + * backwards until the current head batch (which is also the current + * scanBatch) is the only batch hasn't been freed + */ + while (index_scan_batch_count(scan) > 1) + { + uint8 tailidx = batchringbuf->nextBatch - 1; + IndexScanBatch tail = index_scan_batch(scan, tailidx); + + Assert(tailidx != batchringbuf->scanPos.batch); + + tableam_util_free_batch(scan, tail); + batchringbuf->nextBatch--; + } + + /* scanBatch is now the only batch still loaded */ + Assert(batchringbuf->headBatch == batchringbuf->scanPos.batch); + scanBatch = index_scan_batch(scan, batchringbuf->headBatch); + + /* + * Flip scanBatch's scan direction to reflect the reversal. Also reset + * any index AM state that independently tracks scan progress. + */ + scanBatch->dir = -scanBatch->dir; + if (scan->indexRelation->rd_indam->amposreset) + scan->indexRelation->rd_indam->amposreset(scan, scanBatch); +} + +/* + * Record that scanPos item is dead + * + * Records an offset to the current scanBatch/scanPos item, saving it in + * scanBatch's deadItems array. The items' index tuples will later be + * marked LP_DEAD when current scanBatch is freed. + */ +void +tableam_util_scanpos_killitem(IndexScanDesc scan) +{ + BatchRingItemPos *scanPos = &scan->batchringbuf.scanPos; + IndexScanBatch scanBatch = index_scan_batch(scan, scanPos->batch); + + if (scanBatch->deadItems == NULL) + scanBatch->deadItems = palloc_array(int, scan->maxitemsbatch); + if (scanBatch->numDead < scan->maxitemsbatch) + scanBatch->deadItems[scanBatch->numDead++] = scanPos->item; +} + +/* + * Release resources associated with a batch + * + * Called by table AM's ordered index scan implementation when it is finished + * with a batch and wishes to release its resources. + * + * We call amunguardbatch to drop the TID recycling interlock (e.g. buffer + * pin) when it hasn't been dropped yet. For plain MVCC scans (where + * batchImmediateUnguard is set), the interlock was already dropped eagerly + * in indexam_util_batch_unlock, so we skip the amunguardbatch call here. + * Index-only scans must delay dropping the interlock until visibility is + * resolved for all items in the batch, so amunguardbatch may still need to + * act here. For non-MVCC snapshot scans, the interlock is always held + * until amunguardbatch drops it here -- this is the only place willing to + * unguard a non-MVCC scan's batch. + * + * When the batch has dead items (numDead > 0) and the index AM provides an + * amkillitemsbatch callback, we call it to set LP_DEAD bits in the index + * page. We always recycle the batch memory via indexam_util_batch_release. + * + * Note: Calling here when 'batch' is also batchringbuf.markBatch is a no-op. + * Callers that don't want this should set batchringbuf.markBatch to NULL + * before calling us. Note that markBatch has to be explicitly freed. + */ +void +tableam_util_free_batch(IndexScanDesc scan, IndexScanBatch batch) +{ + /* Pass through to implementation function, with allow_cache=true */ + batch_free(scan, batch, true); +} + +/* + * Free a batch, optionally caching it for reuse. + * + * tableam_util_free_batch implementation function. We split out the + * implementation like this because we don't want to give external table AM + * callers the option of passing allow_cache=false. + * + * When allow_cache is true, we try to store the batch in the scan's batch + * cache for later reuse. When allow_cache is false (typically because the + * scan is shutting down), we pfree the caller's batch unconditionally. + */ +static void +batch_free(IndexScanDesc scan, IndexScanBatch batch, bool allow_cache) +{ + Assert(!(scan->batchImmediateUnguard && batch->isGuarded)); + Assert(batch->isGuarded || scan->MVCCScan); + + /* don't free caller's batch if it is scan's current markBatch */ + if (batch == scan->batchringbuf.markBatch) + return; + + /* Drop TID recycling interlock via amunguardbatch as needed */ + if (!scan->batchImmediateUnguard && batch->isGuarded) + tableam_util_unguard_batch(scan, batch); + + /* + * Let the index AM set LP_DEAD bits in the index page, if applicable. + * + * batch.deadItems[] is now in whatever order the scan returned items in. + * We might have even saved the same item/TID twice. + * + * Sort and unique-ify deadItems[]. That way the index AM can safely + * assume that items will always be in their original index page order. + */ + if (batch->numDead > 0 && + scan->indexRelation->rd_indam->amkillitemsbatch != NULL) + { + if (batch->numDead > 1) + { + qsort(batch->deadItems, batch->numDead, sizeof(int), + batch_compare_int); + batch->numDead = qunique(batch->deadItems, batch->numDead, + sizeof(int), batch_compare_int); + } + + scan->indexRelation->rd_indam->amkillitemsbatch(scan, batch); + } + + /* + * Try to store caller's batch in this amgetbatch scan's cache of + * previously released batches first (when caller requests it) + */ + if (allow_cache && batch_cache_store(scan, batch)) + return; + + /* just pfree the caller's batch (plus batch's deadItems, if any) */ + if (batch->deadItems) + pfree(batch->deadItems); + pfree(index_scan_batch_base(scan, batch)); +} + +/* + * Drop the batch's TID recycling interlock via amunguardbatch + * + * Called by the table AM when it's safe to drop whatever interlock the index + * AM holds to prevent unsafe concurrent TID recycling by VACUUM (typically a + * buffer pin on the batch's index page in batch's opaque area). + */ +void +tableam_util_unguard_batch(IndexScanDesc scan, IndexScanBatch batch) +{ + /* Should be called exactly once iff !batchImmediateUnguard */ + Assert(!scan->batchImmediateUnguard); + Assert(batch->isGuarded); + + scan->indexRelation->rd_indam->amunguardbatch(scan, batch); + + batch->isGuarded = false; +} + +/* + * Unlock batch's index page buffer lock + * + * Unlocks the given buffer in preparation for amgetbatch returning items + * saved in that batch. Performs extra steps required by amgetbatch callers + * in passing. + * + * Only call here when a batch has one or more matching items to return using + * amgetbatch (or for amgetbitmap to load into its bitmap of matching TIDs). + * When an index page has no matches, it's always safe for index AMs to drop + * both the lock and the pin for themselves. + * + * Note: It is convenient for index AMs that implement both amgetbatch and + * amgetbitmap to consistently use the same batch management approach, since + * that avoids introducing special cases to lower-level code. We drop both + * the lock and the pin on batch's page on behalf of amgetbitmap callers. + * + * For amgetbatch callers, when batchImmediateUnguard is set (plain MVCC + * scans), we also release the pin here (the TID recycling interlock), so + * that no later amunguardbatch callback will be needed. Otherwise the table + * AM will call amunguardbatch later when it's safe to drop the interlock. + * + * Index AMs whose TID recycling interlock is not just a buffer pin, or whose + * amunguardbatch does not simply release a pin, are not obligated to use this + * function. They can implement their own equivalent. Such index AMs are also + * free to use the batch LSN field themselves; their amkillitemsbatch routine + * can use that LSN in the usual way, or in whatever way the AM deems necessary + * (core code will not use it for any other purpose). + */ +void +indexam_util_batch_unlock(IndexScanDesc scan, IndexScanBatch batch, Buffer buf) +{ + /* batch must have one or more matching items returned by index AM */ + Assert(batch->firstItem >= 0 && batch->firstItem <= batch->lastItem); + + if (scan->usebatchring) + { + /* amgetbatch (not amgetbitmap) caller */ + Assert(scan->heapRelation != NULL); + + /* + * Have to set batch->lsn so that amkillitemsbatch has a way to detect + * when concurrent table TID recycling by VACUUM might have taken + * place. It'll only be safe to set any index tuple LP_DEAD bits when + * the page LSN hasn't advanced. + * + * Plain MVCC scans (batchImmediateUnguard) also release the pin now, + * dropping the TID recycling interlock so that no amunguardbatch + * callback will be needed later. The index AM caller must clear its + * own opaque buf field after we return. + * + * Non-immediate-unguard scans retain the pin; the table AM will call + * amunguardbatch to drop the interlock when ready. + */ + batch->lsn = BufferGetLSNAtomic(buf); + if (scan->batchImmediateUnguard) + { + /* drop both the lock and the pin */ + UnlockReleaseBuffer(buf); + } + else + { + /* just drop the lock (hold on to interlock pin) */ + UnlockBuffer(buf); + } + + /* If we released buffer pin, batch is now unguarded */ + batch->isGuarded = !scan->batchImmediateUnguard; + } + else + { + /* amgetbitmap (not amgetbatch) caller */ + Assert(scan->heapRelation == NULL); + + /* drop both the lock and the pin */ + UnlockReleaseBuffer(buf); + } +} + +/* + * Allocate a new batch + * + * Used by index AMs that support amgetbatch interface (both during amgetbatch + * and amgetbitmap scans). + * + * Returns IndexScanBatch with space to fit scan->maxitemsbatch-many + * BatchMatchingItem entries. This will either be a newly allocated batch, or + * a batch recycled from the cache managed by indexam_util_batch_release. See + * comments above indexam_util_batch_release. + * + * Housekeeping fields (buf, knownEndBackward/Forward, firstItem, lastItem, + * numDead, deadItems, currTuples) are initialized here. The table AM's + * batch_init callback is invoked here to initialize the table AM opaque area. + * The index AM caller is responsible for filling in its per-batch opaque + * fields and the matching items[] array. + * + * Once the batch has the required matching items, caller should generally + * pass it to indexam_util_batch_unlock, ahead of it being returned through + * index AM's amgetbatch routine. If it turns out that the batch won't need + * to be returned like this (e.g., due to the scan having no more matches), + * caller should pass its empty/unused batch to indexam_util_batch_release. + */ +IndexScanBatch +indexam_util_batch_alloc(IndexScanDesc scan) +{ + IndexScanBatch batch = NULL; + bool new_alloc = false; + + /* + * Lazily compute batch_table_offset on first allocation. This combines + * the table AM and index AM opaque sizes into a single offset that can be + * used to find the table AM opaque area (and the true allocation base) + * from the batch pointer. + */ + if (scan->batch_table_offset == 0 && + (scan->batch_index_opaque_size > 0 || + (scan->xs_heapfetch && scan->xs_heapfetch->batch_opaque_size > 0))) + { + uint16 table_opaque = scan->xs_heapfetch ? + scan->xs_heapfetch->batch_opaque_size : 0; + + scan->batch_table_offset = table_opaque + + scan->batch_index_opaque_size; + } + + /* First look for an existing batch from the cache */ + if (scan->usebatchring) + { + for (int i = 0; i < INDEX_SCAN_CACHE_BATCHES; i++) + { + if (scan->batchcache[i] != NULL) + { + /* Return cached unreferenced batch */ + batch = scan->batchcache[i]; + scan->batchcache[i] = NULL; + break; + } + } + } + else if (scan->batchcache[0] != NULL) + { + /* + * Reuse cached batch from prior amgetbitmap iteration. This path is + * hit on every amgetbitmap call here after the scan's first. + */ + batch = scan->batchcache[0]; + scan->batchcache[0] = NULL; + } + + if (!batch) + { + Size prefix_sz; + Size base_sz; + Size trailing_sz; + Size allocsz; + char *raw; + + /* AM opaque areas before the batch pointer */ + prefix_sz = scan->batch_table_offset; + + /* IndexScanBatchData header + items[] */ + base_sz = offsetof(IndexScanBatchData, items) + + sizeof(BatchMatchingItem) * scan->maxitemsbatch; + + /* + * Trailing data after items[]: per-item data (owned by table AM), + * then currTuples workspace (owned by index AM, read by table AM) + */ + trailing_sz = 0; + if (scan->xs_want_itup) + { + if (scan->xs_heapfetch && + scan->xs_heapfetch->batch_per_item_size > 0) + trailing_sz += MAXALIGN(scan->xs_heapfetch->batch_per_item_size * + scan->maxitemsbatch); + trailing_sz += scan->batch_tuples_workspace; + } + + allocsz = prefix_sz + MAXALIGN(base_sz) + trailing_sz; + raw = palloc(allocsz); + batch = (IndexScanBatch) (raw + prefix_sz); + + /* Set up currTuples pointer for index-only scans */ + if (scan->xs_want_itup && scan->batch_tuples_workspace > 0) + { + Size itemsEnd = MAXALIGN(base_sz); + Size tableTrailing = 0; + + if (scan->xs_heapfetch && + scan->xs_heapfetch->batch_per_item_size > 0) + tableTrailing = MAXALIGN(scan->xs_heapfetch->batch_per_item_size * + scan->maxitemsbatch); + batch->currTuples = (char *) batch + itemsEnd + tableTrailing; + } + else + batch->currTuples = NULL; + + /* + * Batches allocate deadItems lazily (though note that cached batches + * keep their deadItems allocation when recycled) + */ + batch->deadItems = NULL; + new_alloc = true; + } + + /* xs_want_itup scans must get a currTuples space */ + Assert(!(scan->xs_want_itup && scan->batch_tuples_workspace > 0 && + batch->currTuples == NULL)); + + /* Let the table AM initialize its per-batch opaque area */ + if (scan->xs_heapfetch) + table_index_fetch_batch_init(scan, batch, new_alloc); + + /* shared initialization */ + batch->knownEndBackward = false; + batch->knownEndForward = false; + batch->isGuarded = false; + batch->firstItem = -1; + batch->lastItem = -1; + batch->numDead = 0; + + return batch; +} + +/* + * Release allocated batch + * + * This function is called by index AMs to release a batch allocated by + * indexam_util_batch_alloc. Batches are cached here for reuse to reduce + * palloc/pfree overhead. + * + * It's safe to release a batch immediately when it was used to read a page + * that returned no matches to the scan. Batches actually returned by index + * AM's amgetbatch routine (i.e. batches for pages with one or more matches) + * must be released by tableam_util_free_batch, which calls here after the + * index AM's amkillitemsbatch routine (if any). Index AMs that use batches + * should call here to release a batch from their amgetbatch or amgetbitmap + * routines. + * + * The rules for batch ownership differ slightly for amgetbitmap scans; see + * the amgetbitmap documentation in doc/src/sgml/indexam.sgml for details. + */ +void +indexam_util_batch_release(IndexScanDesc scan, IndexScanBatch batch) +{ + if (!scan->usebatchring) + { + /* + * amgetbitmap scan caller. + * + * amgetbitmap routines are required to allocate no more than one + * batch at a time, so we'll always have a free slot. + */ + Assert(scan->batchcache[0] == NULL); + Assert(scan->heapRelation == NULL); + Assert(batch->deadItems == NULL); + Assert(batch->currTuples == NULL); + + scan->batchcache[0] = batch; + return; + } + + /* amgetbatch scan caller */ + Assert(scan->heapRelation != NULL); + + /* + * Try to store caller's batch in this amgetbatch scan's cache of + * previously released batches first + */ + if (batch_cache_store(scan, batch)) + return; + + /* Cache full; just free the caller's batch */ + if (batch->deadItems) + pfree(batch->deadItems); + pfree(index_scan_batch_base(scan, batch)); +} + +/* + * Try to store a batch in the scan's batch cache. + * + * Returns true if a free slot was found, false if the cache is full. + */ +static inline bool +batch_cache_store(IndexScanDesc scan, IndexScanBatch batch) +{ + for (int i = 0; i < INDEX_SCAN_CACHE_BATCHES; i++) + { + if (scan->batchcache[i] == NULL) + { + scan->batchcache[i] = batch; + return true; + } + } + + return false; +} + +/* + * qsort comparison function for int arrays + */ +static int +batch_compare_int(const void *va, const void *vb) +{ + int a = *((const int *) va); + int b = *((const int *) vb); + + return pg_cmp_s32(a, b); +} diff --git a/src/backend/access/index/meson.build b/src/backend/access/index/meson.build index da64cb595..83dfa3f2b 100644 --- a/src/backend/access/index/meson.build +++ b/src/backend/access/index/meson.build @@ -5,4 +5,5 @@ backend_sources += files( 'amvalidate.c', 'genam.c', 'indexam.c', + 'indexbatch.c', ) diff --git a/src/backend/access/nbtree/README b/src/backend/access/nbtree/README index cb921ca2e..a37869b71 100644 --- a/src/backend/access/nbtree/README +++ b/src/backend/access/nbtree/README @@ -179,18 +179,15 @@ hold on to the pin (used when reading from the leaf page) until _after_ they're done visiting the heap (for TIDs from pinned leaf page) prevents concurrent TID recycling. VACUUM cannot get a conflicting cleanup lock until the index scan is totally finished processing its leaf page. +This is required by any index AM that implements the amgetbatch +interface. (See also, doc/src/sgml/indexam.sgml). -This approach is fairly coarse, so we avoid it whenever possible. In -practice most index scans won't hold onto their pin, and so won't block -VACUUM. These index scans must deal with TID recycling directly, which is -more complicated and not always possible. See later section on making -concurrent TID recycling safe. - -Opportunistic index tuple deletion performs almost the same page-level -modifications while only holding an exclusive lock. This is safe because -there is no question of TID recycling taking place later on -- only VACUUM -can make TIDs recyclable. See also simple deletion and bottom-up -deletion, below. +Opportunistic index tuple deletion performs the same page-level +modifications as VACUUM, while only holding an exclusive lock. This is +safe because there is no question of TID recycling taking place -- only +VACUUM can make TIDs recyclable. In other words, VACUUM's cleanup lock +serves to protect non-MVCC snapshot scans from concurrent TID recycling +hazards; it doesn't protect the B-Tree structure itself. Because a pin is not always held, and a page can be split even while someone does hold a pin on it, it is possible that an indexscan will @@ -440,54 +437,6 @@ whenever it is subsequently taken from the FSM for reuse. The deleted page's contents will be overwritten by the split operation (it will become the new right sibling page). -Making concurrent TID recycling safe ------------------------------------- - -As explained in the earlier section about deleting index tuples during -VACUUM, we implement a locking protocol that allows individual index scans -to avoid concurrent TID recycling. Index scans opt-out (and so drop their -leaf page pin when visiting the heap) whenever it's safe to do so, though. -Dropping the pin early is useful because it avoids blocking progress by -VACUUM. This is particularly important with index scans used by cursors, -since idle cursors sometimes stop for relatively long periods of time. In -extreme cases, a client application may hold on to an idle cursors for -hours or even days. Blocking VACUUM for that long could be disastrous. - -Index scans that don't hold on to a buffer pin are protected by holding an -MVCC snapshot instead. This more limited interlock prevents wrong answers -to queries, but it does not prevent concurrent TID recycling itself (only -holding onto the leaf page pin while accessing the heap ensures that). - -Index-only scans can never drop their buffer pin, since they are unable to -tolerate having a referenced TID become recyclable. Index-only scans -typically just visit the visibility map (not the heap proper), and so will -not reliably notice that any stale TID reference (for a TID that pointed -to a dead-to-all heap item at first) was concurrently marked LP_UNUSED in -the heap by VACUUM. This could easily allow VACUUM to set the whole heap -page to all-visible in the visibility map immediately afterwards. An MVCC -snapshot is only sufficient to avoid problems during plain index scans -because they must access granular visibility information from the heap -proper. A plain index scan will even recognize LP_UNUSED items in the -heap (items that could be recycled but haven't been just yet) as "not -visible" -- even when the heap page is generally considered all-visible. - -LP_DEAD setting of index tuples by the kill_prior_tuple optimization -(described in full in simple deletion, below) is also more complicated for -index scans that drop their leaf page pins. We must be careful to avoid -LP_DEAD-marking any new index tuple that looks like a known-dead index -tuple because it happens to share the same TID, following concurrent TID -recycling. It's just about possible that some other session inserted a -new, unrelated index tuple, on the same leaf page, which has the same -original TID. It would be totally wrong to LP_DEAD-set this new, -unrelated index tuple. - -We handle this kill_prior_tuple race condition by having affected index -scans conservatively assume that any change to the leaf page at all -implies that it was reached by btbulkdelete in the interim period when no -buffer pin was held. This is implemented by not setting any LP_DEAD bits -on the leaf page at all when the page's LSN has changed. (This is why we -implement "fake" LSNs for unlogged index relations.) - Fastpath For Index Insertion ---------------------------- @@ -734,7 +683,7 @@ of readers could still move right to recover if we didn't couple same-level locks), but we prefer to be conservative here. During recovery all index scans start with ignore_killed_tuples = false -and we never set kill_prior_tuple. We do this because the oldest xmin +and we never LP_DEAD-mark tuples. We do this because the oldest xmin on the standby server can be older than the oldest xmin on the primary server, which means tuples can be marked LP_DEAD even when they are still visible on the standby. We don't WAL log tuple LP_DEAD bits, but @@ -756,9 +705,8 @@ non-MVCC scans is not required on standby nodes. We still get a full cleanup lock when replaying VACUUM records during recovery, but recovery does not need to lock every leaf page (only those leaf pages that have items to delete) -- that's sufficient to avoid breaking index-only scans -during recovery (see section above about making TID recycling safe). That -leaves concern only for plain index scans. (XXX: Not actually clear why -this is totally unnecessary during recovery.) +during recovery. That leaves concern only for plain index scans. +(XXX: Not actually clear why this is totally unnecessary during recovery.) MVCC snapshot plain index scans are always safe, for the same reasons that they're safe during original execution. HeapTupleSatisfiesToast() doesn't diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c index 054703861..0046c84d1 100644 --- a/src/backend/access/nbtree/nbtpage.c +++ b/src/backend/access/nbtree/nbtpage.c @@ -1060,6 +1060,9 @@ _bt_relbuf(Relation rel, Buffer buf) * Lock is acquired without acquiring another pin. This is like a raw * LockBuffer() call, but performs extra steps needed by Valgrind. * + * Note: _bt_batch_unlock in nbtsearch.c (indexam_util_batch_unlock wrapper + * function) has matching Valgrind buffer lock instrumentation. + * * Note: Caller may need to call _bt_checkpage() with buf when pin on buf * wasn't originally acquired in _bt_getbuf() or _bt_relandgetbuf(). */ @@ -1101,13 +1104,19 @@ _bt_unlockbuf(Relation rel, Buffer buf) * Buffer is pinned and locked, which means that it is expected to be * defined and addressable. Check that proactively. */ - VALGRIND_CHECK_MEM_IS_DEFINED(BufferGetPage(buf), BLCKSZ); +#if defined(USE_VALGRIND) + Page page = BufferGetPage(buf); + + VALGRIND_CHECK_MEM_IS_DEFINED(page, BLCKSZ); +#endif /* LockBuffer() asserts that pin is held by this backend */ LockBuffer(buf, BUFFER_LOCK_UNLOCK); +#if defined(USE_VALGRIND) if (!RelationUsesLocalBuffers(rel)) - VALGRIND_MAKE_MEM_NOACCESS(BufferGetPage(buf), BLCKSZ); + VALGRIND_MAKE_MEM_NOACCESS(page, BLCKSZ); +#endif } /* diff --git a/src/backend/access/nbtree/nbtreadpage.c b/src/backend/access/nbtree/nbtreadpage.c index 2ba1ca660..39c661498 100644 --- a/src/backend/access/nbtree/nbtreadpage.c +++ b/src/backend/access/nbtree/nbtreadpage.c @@ -32,6 +32,7 @@ typedef struct BTReadPageState { /* Input parameters, set by _bt_readpage for _bt_checkkeys */ ScanDirection dir; /* current scan direction */ + BlockNumber currpage; /* current page being read */ OffsetNumber minoff; /* Lowest non-pivot tuple's offset */ OffsetNumber maxoff; /* Highest non-pivot tuple's offset */ IndexTuple finaltup; /* Needed by scans with array keys */ @@ -63,14 +64,13 @@ static bool _bt_scanbehind_checkkeys(IndexScanDesc scan, ScanDirection dir, IndexTuple finaltup); static bool _bt_oppodir_checkkeys(IndexScanDesc scan, ScanDirection dir, IndexTuple finaltup); -static void _bt_saveitem(BTScanOpaque so, int itemIndex, - OffsetNumber offnum, IndexTuple itup); -static int _bt_setuppostingitems(BTScanOpaque so, int itemIndex, - OffsetNumber offnum, const ItemPointerData *heapTid, - IndexTuple itup); -static inline void _bt_savepostingitem(BTScanOpaque so, int itemIndex, - OffsetNumber offnum, - ItemPointer heapTid, int tupleOffset); +static void _bt_saveitem(IndexScanBatch newbatch, int itemIndex, OffsetNumber offnum, + IndexTuple itup, int *tupleOffset); +static int _bt_setuppostingitems(IndexScanBatch newbatch, int itemIndex, + OffsetNumber offnum, const ItemPointerData *tableTid, + IndexTuple itup, int *tupleOffset); +static inline void _bt_savepostingitem(IndexScanBatch newbatch, int itemIndex, OffsetNumber offnum, + ItemPointer tableTid, int baseOffset); static bool _bt_checkkeys(IndexScanDesc scan, BTReadPageState *pstate, bool arrayKeys, IndexTuple tuple, int tupnatts); static bool _bt_check_compare(IndexScanDesc scan, ScanDirection dir, @@ -111,15 +111,15 @@ static bool _bt_verify_keys_with_arraykeys(IndexScanDesc scan); /* - * _bt_readpage() -- Load data from current index page into so->currPos + * _bt_readpage() -- Load data from current index page into newbatch. * - * Caller must have pinned and read-locked so->currPos.buf; the buffer's state - * is not changed here. Also, currPos.moreLeft and moreRight must be valid; - * they are updated as appropriate. All other fields of so->currPos are + * Caller must have pinned and read-locked newbatch.buf; the buffer's state is + * not changed here. Also, newbatch's moreLeft and moreRight must be valid; + * they are updated as appropriate. All other fields of newbatch are * initialized from scratch here. * * We scan the current page starting at offnum and moving in the indicated - * direction. All items matching the scan keys are loaded into currPos.items. + * direction. All items matching the scan keys are saved in newbatch.items. * moreLeft or moreRight (as appropriate) is cleared if _bt_checkkeys reports * that there can be no more matching tuples in the current scan direction * (could just be for the current primitive index scan when scan has arrays). @@ -131,11 +131,12 @@ static bool _bt_verify_keys_with_arraykeys(IndexScanDesc scan); * Returns true if any matching items found on the page, false if none. */ bool -_bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum, - bool firstpage) +_bt_readpage(IndexScanDesc scan, IndexScanBatch newbatch, ScanDirection dir, + OffsetNumber offnum, bool firstpage) { Relation rel = scan->indexRelation; BTScanOpaque so = (BTScanOpaque) scan->opaque; + BTBatchData *btnewbatch = BTBatchGetData(scan, newbatch); Page page; BTPageOpaque opaque; OffsetNumber minoff; @@ -144,23 +145,20 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum, bool arrayKeys, ignore_killed_tuples = scan->ignore_killed_tuples; int itemIndex, + tupleOffset = 0, indnatts; /* save the page/buffer block number, along with its sibling links */ - page = BufferGetPage(so->currPos.buf); + page = BufferGetPage(btnewbatch->buf); opaque = BTPageGetOpaque(page); - so->currPos.currPage = BufferGetBlockNumber(so->currPos.buf); - so->currPos.prevPage = opaque->btpo_prev; - so->currPos.nextPage = opaque->btpo_next; - /* delay setting so->currPos.lsn until _bt_drop_lock_and_maybe_pin */ - pstate.dir = so->currPos.dir = dir; - so->currPos.nextTupleOffset = 0; + pstate.currpage = btnewbatch->currPage = BufferGetBlockNumber(btnewbatch->buf); + btnewbatch->prevPage = opaque->btpo_prev; + btnewbatch->nextPage = opaque->btpo_next; + pstate.dir = newbatch->dir = dir; /* either moreRight or moreLeft should be set now (may be unset later) */ - Assert(ScanDirectionIsForward(dir) ? so->currPos.moreRight : - so->currPos.moreLeft); + Assert(ScanDirectionIsForward(dir) ? btnewbatch->moreRight : btnewbatch->moreLeft); Assert(!P_IGNORE(opaque)); - Assert(BTScanPosIsPinned(so->currPos)); Assert(!so->needPrimScan); /* initialize local variables */ @@ -188,14 +186,14 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum, { /* allow next/prev page to be read by other worker without delay */ if (ScanDirectionIsForward(dir)) - _bt_parallel_release(scan, so->currPos.nextPage, - so->currPos.currPage); + _bt_parallel_release(scan, btnewbatch->nextPage, + btnewbatch->currPage); else - _bt_parallel_release(scan, so->currPos.prevPage, - so->currPos.currPage); + _bt_parallel_release(scan, btnewbatch->prevPage, + btnewbatch->currPage); } - PredicateLockPage(rel, so->currPos.currPage, scan->xs_snapshot); + PredicateLockPage(rel, pstate.currpage, scan->xs_snapshot); if (ScanDirectionIsForward(dir)) { @@ -212,11 +210,11 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum, !_bt_scanbehind_checkkeys(scan, dir, pstate.finaltup)) { /* Schedule another primitive index scan after all */ - so->currPos.moreRight = false; + btnewbatch->moreRight = false; so->needPrimScan = true; if (scan->parallel_scan) _bt_parallel_primscan_schedule(scan, - so->currPos.currPage); + btnewbatch->currPage); return false; } } @@ -280,26 +278,26 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum, if (!BTreeTupleIsPosting(itup)) { /* Remember it */ - _bt_saveitem(so, itemIndex, offnum, itup); + _bt_saveitem(newbatch, itemIndex, offnum, itup, &tupleOffset); itemIndex++; } else { - int tupleOffset; + int baseOffset; /* Set up posting list state (and remember first TID) */ - tupleOffset = - _bt_setuppostingitems(so, itemIndex, offnum, + baseOffset = + _bt_setuppostingitems(newbatch, itemIndex, offnum, BTreeTupleGetPostingN(itup, 0), - itup); + itup, &tupleOffset); itemIndex++; /* Remember all later TIDs (must be at least one) */ for (int i = 1; i < BTreeTupleGetNPosting(itup); i++) { - _bt_savepostingitem(so, itemIndex, offnum, + _bt_savepostingitem(newbatch, itemIndex, offnum, BTreeTupleGetPostingN(itup, i), - tupleOffset); + baseOffset); itemIndex++; } } @@ -339,12 +337,11 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum, } if (!pstate.continuescan) - so->currPos.moreRight = false; + btnewbatch->moreRight = false; Assert(itemIndex <= MaxTIDsPerBTreePage); - so->currPos.firstItem = 0; - so->currPos.lastItem = itemIndex - 1; - so->currPos.itemIndex = 0; + newbatch->firstItem = 0; + newbatch->lastItem = itemIndex - 1; } else { @@ -361,11 +358,11 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum, !_bt_scanbehind_checkkeys(scan, dir, pstate.finaltup)) { /* Schedule another primitive index scan after all */ - so->currPos.moreLeft = false; + btnewbatch->moreLeft = false; so->needPrimScan = true; if (scan->parallel_scan) _bt_parallel_primscan_schedule(scan, - so->currPos.currPage); + btnewbatch->currPage); return false; } } @@ -466,27 +463,27 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum, { /* Remember it */ itemIndex--; - _bt_saveitem(so, itemIndex, offnum, itup); + _bt_saveitem(newbatch, itemIndex, offnum, itup, &tupleOffset); } else { uint16 nitems = BTreeTupleGetNPosting(itup); - int tupleOffset; + int baseOffset; /* Set up posting list state (and remember last TID) */ itemIndex--; - tupleOffset = - _bt_setuppostingitems(so, itemIndex, offnum, + baseOffset = + _bt_setuppostingitems(newbatch, itemIndex, offnum, BTreeTupleGetPostingN(itup, nitems - 1), - itup); + itup, &tupleOffset); /* Remember all prior TIDs (must be at least one) */ for (int i = nitems - 2; i >= 0; i--) { itemIndex--; - _bt_savepostingitem(so, itemIndex, offnum, + _bt_savepostingitem(newbatch, itemIndex, offnum, BTreeTupleGetPostingN(itup, i), - tupleOffset); + baseOffset); } } } @@ -502,12 +499,11 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum, * be found there */ if (!pstate.continuescan) - so->currPos.moreLeft = false; + btnewbatch->moreLeft = false; Assert(itemIndex >= 0); - so->currPos.firstItem = itemIndex; - so->currPos.lastItem = MaxTIDsPerBTreePage - 1; - so->currPos.itemIndex = MaxTIDsPerBTreePage - 1; + newbatch->firstItem = itemIndex; + newbatch->lastItem = MaxTIDsPerBTreePage - 1; } /* @@ -524,7 +520,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum, */ Assert(!pstate.forcenonrequired); - return (so->currPos.firstItem <= so->currPos.lastItem); + return (newbatch->firstItem <= newbatch->lastItem); } /* @@ -1027,90 +1023,91 @@ _bt_oppodir_checkkeys(IndexScanDesc scan, ScanDirection dir, return true; } -/* Save an index item into so->currPos.items[itemIndex] */ +/* Save an index item into newbatch.items[itemIndex] */ static void -_bt_saveitem(BTScanOpaque so, int itemIndex, - OffsetNumber offnum, IndexTuple itup) +_bt_saveitem(IndexScanBatch newbatch, int itemIndex, OffsetNumber offnum, + IndexTuple itup, int *tupleOffset) { - BTScanPosItem *currItem = &so->currPos.items[itemIndex]; - Assert(!BTreeTupleIsPivot(itup) && !BTreeTupleIsPosting(itup)); - currItem->heapTid = itup->t_tid; - currItem->indexOffset = offnum; - if (so->currTuples) + newbatch->items[itemIndex].tableTid = itup->t_tid; + newbatch->items[itemIndex].indexOffset = offnum; + + if (newbatch->currTuples) { Size itupsz = IndexTupleSize(itup); - currItem->tupleOffset = so->currPos.nextTupleOffset; - memcpy(so->currTuples + so->currPos.nextTupleOffset, itup, itupsz); - so->currPos.nextTupleOffset += MAXALIGN(itupsz); + newbatch->items[itemIndex].tupleOffset = *tupleOffset; + memcpy(newbatch->currTuples + *tupleOffset, itup, itupsz); + *tupleOffset += MAXALIGN(itupsz); } } /* * Setup state to save TIDs/items from a single posting list tuple. * - * Saves an index item into so->currPos.items[itemIndex] for TID that is - * returned to scan first. Second or subsequent TIDs for posting list should - * be saved by calling _bt_savepostingitem(). + * Saves an index item into newbatch.items[itemIndex] for TID that is returned + * to scan first. Second or subsequent TIDs for posting list should be saved + * by calling _bt_savepostingitem(). * - * Returns an offset into tuple storage space that main tuple is stored at if - * needed. + * Returns baseOffset, an offset into tuple storage space that main tuple is + * stored at if needed. */ static int -_bt_setuppostingitems(BTScanOpaque so, int itemIndex, OffsetNumber offnum, - const ItemPointerData *heapTid, IndexTuple itup) +_bt_setuppostingitems(IndexScanBatch newbatch, int itemIndex, + OffsetNumber offnum, const ItemPointerData *tableTid, + IndexTuple itup, int *tupleOffset) { - BTScanPosItem *currItem = &so->currPos.items[itemIndex]; + BatchMatchingItem *item = &newbatch->items[itemIndex]; Assert(BTreeTupleIsPosting(itup)); - currItem->heapTid = *heapTid; - currItem->indexOffset = offnum; - if (so->currTuples) + item->tableTid = *tableTid; + item->indexOffset = offnum; + + if (newbatch->currTuples) { /* Save base IndexTuple (truncate posting list) */ IndexTuple base; Size itupsz = BTreeTupleGetPostingOffset(itup); itupsz = MAXALIGN(itupsz); - currItem->tupleOffset = so->currPos.nextTupleOffset; - base = (IndexTuple) (so->currTuples + so->currPos.nextTupleOffset); + item->tupleOffset = *tupleOffset; + base = (IndexTuple) (newbatch->currTuples + *tupleOffset); memcpy(base, itup, itupsz); /* Defensively reduce work area index tuple header size */ base->t_info &= ~INDEX_SIZE_MASK; base->t_info |= itupsz; - so->currPos.nextTupleOffset += itupsz; + *tupleOffset += itupsz; - return currItem->tupleOffset; + return item->tupleOffset; } return 0; } /* - * Save an index item into so->currPos.items[itemIndex] for current posting + * Save an index item into newbatch.items[itemIndex] for current posting * tuple. * * Assumes that _bt_setuppostingitems() has already been called for current - * posting list tuple. Caller passes its return value as tupleOffset. + * posting list tuple. Caller passes its return value as baseOffset. */ static inline void -_bt_savepostingitem(BTScanOpaque so, int itemIndex, OffsetNumber offnum, - ItemPointer heapTid, int tupleOffset) +_bt_savepostingitem(IndexScanBatch newbatch, int itemIndex, OffsetNumber offnum, + ItemPointer tableTid, int baseOffset) { - BTScanPosItem *currItem = &so->currPos.items[itemIndex]; + BatchMatchingItem *item = &newbatch->items[itemIndex]; - currItem->heapTid = *heapTid; - currItem->indexOffset = offnum; + item->tableTid = *tableTid; + item->indexOffset = offnum; /* * Have index-only scans return the same base IndexTuple for every TID * that originates from the same posting list */ - if (so->currTuples) - currItem->tupleOffset = tupleOffset; + if (newbatch->currTuples) + item->tupleOffset = baseOffset; } #define LOOK_AHEAD_REQUIRED_RECHECKS 3 @@ -2821,14 +2818,15 @@ new_prim_scan: * * Note: We make a soft assumption that the current scan direction will * also be used within _bt_next, when it is asked to step off this page. - * It is up to _bt_next to cancel this scheduled primitive index scan - * whenever it steps to a page in the direction opposite currPos.dir. + * The scan direction might be reversed during the next amgetbatch call, + * but not before a call to btposreset that resets the array keys to the + * first positions/elements used when scanning in this other direction. */ pstate->continuescan = false; /* Tell _bt_readpage we're done... */ so->needPrimScan = true; /* ...but call _bt_first again */ if (scan->parallel_scan) - _bt_parallel_primscan_schedule(scan, so->currPos.currPage); + _bt_parallel_primscan_schedule(scan, pstate->currpage); /* Caller's tuple doesn't match the new qual */ return false; @@ -2841,9 +2839,8 @@ end_toplevel_scan: * This ends the entire top-level scan in the current scan direction. * * Note: The scan's arrays (including any non-required arrays) are now in - * their final positions for the current scan direction. If the scan - * direction happens to change, then the arrays will already be in their - * first positions for what will then be the current scan direction. + * their final positions for the current scan direction. This is just + * defensive. */ pstate->continuescan = false; /* Tell _bt_readpage we're done... */ so->needPrimScan = false; /* ...and don't call _bt_first again */ @@ -2910,17 +2907,9 @@ _bt_advance_array_keys_increment(IndexScanDesc scan, ScanDirection dir, /* * The array keys are now exhausted. * - * Restore the array keys to the state they were in immediately before we - * were called. This ensures that the arrays only ever ratchet in the - * current scan direction. - * - * Without this, scans could overlook matching tuples when the scan - * direction gets reversed just before btgettuple runs out of items to - * return, but just after _bt_readpage prepares all the items from the - * scan's final page in so->currPos. When we're on the final page it is - * typical for so->currPos to get invalidated once btgettuple finally - * returns false, which'll effectively invalidate the scan's array keys. - * That hasn't happened yet, though -- and in general it may never happen. + * Defensively restore the array keys to the positions they were in + * immediately before we were called (i.e. to their final positions for + * the current scan direction). */ _bt_start_array_keys(scan, -dir); diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c index 6d870e4eb..e087cb824 100644 --- a/src/backend/access/nbtree/nbtree.c +++ b/src/backend/access/nbtree/nbtree.c @@ -161,11 +161,13 @@ bthandler(PG_FUNCTION_ARGS) .amadjustmembers = btadjustmembers, .ambeginscan = btbeginscan, .amrescan = btrescan, - .amgettuple = btgettuple, + .amgettuple = NULL, + .amgetbatch = btgetbatch, + .amunguardbatch = btunguardbatch, + .amkillitemsbatch = btkillitemsbatch, .amgetbitmap = btgetbitmap, .amendscan = btendscan, - .ammarkpos = btmarkpos, - .amrestrpos = btrestrpos, + .amposreset = btposreset, .amestimateparallelscan = btestimateparallelscan, .aminitparallelscan = btinitparallelscan, .amparallelrescan = btparallelrescan, @@ -224,13 +226,13 @@ btinsert(Relation rel, Datum *values, bool *isnull, } /* - * btgettuple() -- Get the next tuple in the scan. + * btgetbatch() -- Get the first or next batch of tuples in the scan */ -bool -btgettuple(IndexScanDesc scan, ScanDirection dir) +IndexScanBatch +btgetbatch(IndexScanDesc scan, IndexScanBatch priorbatch, ScanDirection dir) { BTScanOpaque so = (BTScanOpaque) scan->opaque; - bool res; + IndexScanBatch batch = priorbatch; Assert(scan->heapRelation != NULL); @@ -243,45 +245,20 @@ btgettuple(IndexScanDesc scan, ScanDirection dir) /* * If we've already initialized this scan, we can just advance it in * the appropriate direction. If we haven't done so yet, we call - * _bt_first() to get the first item in the scan. + * _bt_first() to get the first batch in the scan. */ - if (!BTScanPosIsValid(so->currPos)) - res = _bt_first(scan, dir); + if (batch == NULL) + batch = _bt_first(scan, dir); else - { - /* - * Check to see if we should kill the previously-fetched tuple. - */ - if (scan->kill_prior_tuple) - { - /* - * Yes, remember it for later. (We'll deal with all such - * tuples at once right before leaving the index page.) The - * test for numKilled overrun is not just paranoia: if the - * caller reverses direction in the indexscan then the same - * item might get entered multiple times. It's not worth - * trying to optimize that, so we don't detect it, but instead - * just forget any excess entries. - */ - if (so->killedItems == NULL) - so->killedItems = palloc_array(int, MaxTIDsPerBTreePage); - if (so->numKilled < MaxTIDsPerBTreePage) - so->killedItems[so->numKilled++] = so->currPos.itemIndex; - } + batch = _bt_next(scan, dir, batch); - /* - * Now continue the scan. - */ - res = _bt_next(scan, dir); - } - - /* If we have a tuple, return it ... */ - if (res) + /* If we have a batch, return it ... */ + if (batch) break; /* ... otherwise see if we need another primitive index scan */ } while (so->numArrayKeys && _bt_start_prim_scan(scan)); - return res; + return batch; } /* @@ -291,38 +268,43 @@ int64 btgetbitmap(IndexScanDesc scan, TIDBitmap *tbm) { BTScanOpaque so = (BTScanOpaque) scan->opaque; + IndexScanBatch batch; int64 ntids = 0; - ItemPointer heapTid; + ItemPointer tableTid; Assert(scan->heapRelation == NULL); /* Each loop iteration performs another primitive index scan */ do { - /* Fetch the first page & tuple */ - if (_bt_first(scan, ForwardScanDirection)) + /* Fetch the first batch */ + if ((batch = _bt_first(scan, ForwardScanDirection))) { - /* Save tuple ID, and continue scanning */ - heapTid = &scan->xs_heaptid; - tbm_add_tuples(tbm, heapTid, 1, false); + int itemIndex = 0; + + /* Save first tuple's TID */ + tableTid = &batch->items[itemIndex].tableTid; + tbm_add_tuples(tbm, tableTid, 1, false); ntids++; for (;;) { - /* - * Advance to next tuple within page. This is the same as the - * easy case in _bt_next(). - */ - if (++so->currPos.itemIndex > so->currPos.lastItem) + /* Advance to next TID within page-sized batch */ + if (++itemIndex > batch->lastItem) { - /* let _bt_next do the heavy lifting */ - if (!_bt_next(scan, ForwardScanDirection)) + /* + * _bt_next releases the prior batch for bitmap callers + * before allocating the next one, so only one batch is + * ever used at a time + */ + itemIndex = 0; + batch = _bt_next(scan, ForwardScanDirection, batch); + if (!batch) break; } - /* Save tuple ID, and continue scanning */ - heapTid = &so->currPos.items[so->currPos.itemIndex].heapTid; - tbm_add_tuples(tbm, heapTid, 1, false); + tableTid = &batch->items[itemIndex].tableTid; + tbm_add_tuples(tbm, tableTid, 1, false); ntids++; } } @@ -349,8 +331,6 @@ btbeginscan(Relation rel, int nkeys, int norderbys) /* allocate private workspace */ so = palloc_object(BTScanOpaqueData); - BTScanPosInvalidate(so->currPos); - BTScanPosInvalidate(so->markPos); if (scan->numberOfKeys > 0) so->keyData = (ScanKey) palloc(scan->numberOfKeys * sizeof(ScanKeyData)); else @@ -364,19 +344,11 @@ btbeginscan(Relation rel, int nkeys, int norderbys) so->orderProcs = NULL; so->arrayContext = NULL; - so->killedItems = NULL; /* until needed */ - so->numKilled = 0; - - /* - * We don't know yet whether the scan will be index-only, so we do not - * allocate the tuple workspace arrays until btrescan. However, we set up - * scan->xs_itupdesc whether we'll need it or not, since that's so cheap. - */ - so->currTuples = so->markTuples = NULL; - - scan->xs_itupdesc = RelationGetDescr(rel); - scan->opaque = so; + scan->xs_itupdesc = RelationGetDescr(rel); + scan->maxitemsbatch = MaxTIDsPerBTreePage; + scan->batch_index_opaque_size = MAXALIGN(sizeof(BTBatchData)); + scan->batch_tuples_workspace = BLCKSZ; return scan; } @@ -390,64 +362,185 @@ btrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys, { BTScanOpaque so = (BTScanOpaque) scan->opaque; - /* we aren't holding any read locks, but gotta drop the pins */ - if (BTScanPosIsValid(so->currPos)) - { - /* Before leaving current page, deal with any killed items */ - if (so->numKilled > 0) - _bt_killitems(scan); - BTScanPosUnpinIfPinned(so->currPos); - BTScanPosInvalidate(so->currPos); - } - - /* - * We prefer to eagerly drop leaf page pins before btgettuple returns. - * This avoids making VACUUM wait to acquire a cleanup lock on the page. - * - * We cannot safely drop leaf page pins during index-only scans due to a - * race condition involving VACUUM setting pages all-visible in the VM. - * It's also unsafe for plain index scans that use a non-MVCC snapshot. - * - * Also opt out of dropping leaf page pins eagerly during bitmap scans. - * Pins cannot be held for more than an instant during bitmap scans either - * way, so we might as well avoid wasting cycles on acquiring page LSNs. - * - * See nbtree/README section on making concurrent TID recycling safe. - * - * Note: so->dropPin should never change across rescans. - */ - so->dropPin = (!scan->xs_want_itup && - IsMVCCLikeSnapshot(scan->xs_snapshot) && - scan->heapRelation != NULL); - - so->markItemIndex = -1; - so->needPrimScan = false; - so->scanBehind = false; - so->oppositeDirCheck = false; - BTScanPosUnpinIfPinned(so->markPos); - BTScanPosInvalidate(so->markPos); - - /* - * Allocate tuple workspace arrays, if needed for an index-only scan and - * not already done in a previous rescan call. To save on palloc - * overhead, both workspaces are allocated as one palloc block; only this - * function and btendscan know that. - */ - if (scan->xs_want_itup && so->currTuples == NULL) - { - so->currTuples = (char *) palloc(BLCKSZ * 2); - so->markTuples = so->currTuples + BLCKSZ; - } - /* * Reset the scan keys */ if (scankey && scan->numberOfKeys > 0) memcpy(scan->keyData, scankey, scan->numberOfKeys * sizeof(ScanKeyData)); + so->needPrimScan = false; + so->scanBehind = false; + so->oppositeDirCheck = false; so->numberOfKeys = 0; /* until _bt_preprocess_keys sets it */ so->numArrayKeys = 0; /* ditto */ } +/* + * btunguardbatch() -- Drop batch's TID recycling interlock (buffer pin) + * + * Called by the table AM when it's safe to drop the buffer pin held to + * prevent concurrent TID recycling by VACUUM. + */ +void +btunguardbatch(IndexScanDesc scan, IndexScanBatch batch) +{ + BTBatchData *btbatch = BTBatchGetData(scan, batch); + + /* Should be called exactly once iff !batchImmediateUnguard */ + Assert(!scan->batchImmediateUnguard); + Assert(batch->isGuarded); + + ReleaseBuffer(btbatch->buf); +} + +/* + * btkillitemsbatch() -- Mark dead items' index tuples LP_DEAD + */ +void +btkillitemsbatch(IndexScanDesc scan, IndexScanBatch batch) +{ + Relation rel = scan->indexRelation; + BTBatchData *btbatch = BTBatchGetData(scan, batch); + Page page; + BTPageOpaque opaque; + OffsetNumber minoff; + OffsetNumber maxoff; + bool killedsomething = false; + Buffer buf; + XLogRecPtr latestlsn; + + /* Table AM should have already released batch page's pin by now */ + Assert(batch->numDead > 0); + + buf = _bt_getbuf(rel, btbatch->currPage, BT_READ); + + latestlsn = BufferGetLSNAtomic(buf); + Assert(batch->lsn <= latestlsn); + if (batch->lsn != latestlsn) + { + /* Modified, give up on hinting */ + _bt_relbuf(rel, buf); + return; + } + + page = BufferGetPage(buf); + opaque = BTPageGetOpaque(page); + minoff = P_FIRSTDATAKEY(opaque); + maxoff = PageGetMaxOffsetNumber(page); + + /* Iterate through batch->deadItems[] in leaf page order */ + for (int i = 0; i < batch->numDead; i++) + { + int itemIndex = batch->deadItems[i]; + BatchMatchingItem *kitem = &batch->items[itemIndex]; + OffsetNumber offnum = kitem->indexOffset; + + Assert(itemIndex >= batch->firstItem && itemIndex <= batch->lastItem); + Assert(i == 0 || + offnum >= batch->items[batch->deadItems[i - 1]].indexOffset); + + if (offnum < minoff) + continue; /* pure paranoia */ + while (offnum <= maxoff) + { + ItemId iid = PageGetItemId(page, offnum); + IndexTuple ituple = (IndexTuple) PageGetItem(page, iid); + bool killtuple = false; + + if (BTreeTupleIsPosting(ituple)) + { + int pi = i + 1; + int nposting = BTreeTupleGetNPosting(ituple); + int j; + + for (j = 0; j < nposting; j++) + { + ItemPointer item = BTreeTupleGetPostingN(ituple, j); + + if (!ItemPointerEquals(item, &kitem->tableTid)) + break; /* out of posting list loop */ + + Assert(kitem->indexOffset == offnum); + + /* + * Read-ahead to later kitems here. + * + * We rely on the assumption that not advancing kitem here + * will prevent us from considering the posting list tuple + * fully dead by not matching its next heap TID in next + * loop iteration. + * + * If, on the other hand, this is the final heap TID in + * the posting list tuple, then tuple gets killed + * regardless (i.e. we handle the case where the last + * kitem is also the last heap TID in the last index tuple + * correctly -- posting tuple still gets killed). + */ + if (pi < batch->numDead) + kitem = &batch->items[batch->deadItems[pi++]]; + } + + /* + * Don't bother advancing the outermost loop's int iterator to + * avoid processing dead items that relate to the same + * offnum/posting list tuple. This micro-optimization hardly + * seems worth it. (Further iterations of the outermost loop + * will fail to match on this same posting list's first heap + * TID instead, so we'll advance to the next offnum/index + * tuple pretty quickly.) + */ + if (j == nposting) + killtuple = true; + } + else if (ItemPointerEquals(&ituple->t_tid, &kitem->tableTid)) + killtuple = true; + + /* + * Mark index item as dead, if it isn't already. Since this + * happens while holding a shared buffer lock, it's possible that + * multiple processes attempt to do this simultaneously, leading + * to multiple full-page images being sent to WAL (if + * wal_log_hints or data checksums are enabled), which is + * undesirable. + */ + if (killtuple && !ItemIdIsDead(iid)) + { + if (!killedsomething) + { + /* + * Use the hint bit infrastructure to check if we can + * update the page while just holding a share lock. If we + * are not allowed, there's no point continuing. + */ + if (!BufferBeginSetHintBits(buf)) + goto unlock_page; + } + + /* found the item/all posting list items */ + ItemIdMarkDead(iid); + killedsomething = true; + break; /* out of inner search loop */ + } + offnum = OffsetNumberNext(offnum); + } + } + + /* + * Since this can be redone later if needed, mark as dirty hint. + * + * Whenever we mark anything LP_DEAD, we also set the page's + * BTP_HAS_GARBAGE flag, which is likewise just a hint. (Note that we + * only rely on the page-level flag in !heapkeyspace indexes.) + */ + if (killedsomething) + { + opaque->btpo_flags |= BTP_HAS_GARBAGE; + BufferFinishSetHintBits(buf, true, true); + } + +unlock_page: + _bt_relbuf(rel, buf); +} + /* * btendscan() -- close down a scan */ @@ -456,116 +549,63 @@ btendscan(IndexScanDesc scan) { BTScanOpaque so = (BTScanOpaque) scan->opaque; - /* we aren't holding any read locks, but gotta drop the pins */ - if (BTScanPosIsValid(so->currPos)) - { - /* Before leaving current page, deal with any killed items */ - if (so->numKilled > 0) - _bt_killitems(scan); - BTScanPosUnpinIfPinned(so->currPos); - } - - so->markItemIndex = -1; - BTScanPosUnpinIfPinned(so->markPos); - - /* No need to invalidate positions, the RAM is about to be freed. */ - /* Release storage */ if (so->keyData != NULL) pfree(so->keyData); /* so->arrayKeys and so->orderProcs are in arrayContext */ if (so->arrayContext != NULL) MemoryContextDelete(so->arrayContext); - if (so->killedItems != NULL) - pfree(so->killedItems); - if (so->currTuples != NULL) - pfree(so->currTuples); - /* so->markTuples should not be pfree'd, see btrescan */ pfree(so); } /* - * btmarkpos() -- save current scan position + * btposreset() -- reset array key state for scan position change + * + * Called by the core system when the scan's logical position is about to + * change in a way that invalidates our array key state. This happens when + * restoring a marked position, or when the scan crosses a batch boundary + * while moving in the opposite direction to the one originally used. + * + * For direction changes, the core system will have already flipped the + * batch's dir field before calling here; we use this updated direction when + * resetting our array keys. For mark restoration, the batch's dir will + * retain its original value (from when btgetbatch returned it). */ void -btmarkpos(IndexScanDesc scan) +btposreset(IndexScanDesc scan, IndexScanBatch batch) { BTScanOpaque so = (BTScanOpaque) scan->opaque; + BTBatchData *btbatch = BTBatchGetData(scan, batch); - /* There may be an old mark with a pin (but no lock). */ - BTScanPosUnpinIfPinned(so->markPos); + if (!so->numArrayKeys) + return; /* - * Just record the current itemIndex. If we later step to next page - * before releasing the marked position, _bt_steppage makes a full copy of - * the currPos struct in markPos. If (as often happens) the mark is moved - * before we leave the page, we don't have to do that work. + * Reset array keys to initial state for the batch's scan direction. Also + * clear needPrimScan and related flags. These were set based on the soft + * assumption that the scan would always proceed in the same direction. + * + * These steps work around the soft assumption being violated: they force + * the scan to step to the next/previous page, making the arrays recover. + * When we go to read that page, _bt_readpage will reliably determine if a + * primitive scan really is needed based on the page's tuples. If there's + * a primitive scan, it will reposition the scan using new array values + * (based on the tuples from the neighboring page we'll step on to). + * + * We need to reset the array key state in the correct direction so that + * we won't get confused. When the array keys are behind the key space + * for the page we're stepping on to (behind in terms of the scan dir), + * they will catch up automatically. But when they're ahead of that + * page's key space, the scan could miss matching tuples. */ - if (BTScanPosIsValid(so->currPos)) - so->markItemIndex = so->currPos.itemIndex; + _bt_start_array_keys(scan, batch->dir); + if (ScanDirectionIsForward(batch->dir)) + btbatch->moreRight = true; else - { - BTScanPosInvalidate(so->markPos); - so->markItemIndex = -1; - } -} - -/* - * btrestrpos() -- restore scan to last saved position - */ -void -btrestrpos(IndexScanDesc scan) -{ - BTScanOpaque so = (BTScanOpaque) scan->opaque; - - if (so->markItemIndex >= 0) - { - /* - * The scan has never moved to a new page since the last mark. Just - * restore the itemIndex. - * - * NB: In this case we can't count on anything in so->markPos to be - * accurate. - */ - so->currPos.itemIndex = so->markItemIndex; - } - else - { - /* - * The scan moved to a new page after last mark or restore, and we are - * now restoring to the marked page. We aren't holding any read - * locks, but if we're still holding the pin for the current position, - * we must drop it. - */ - if (BTScanPosIsValid(so->currPos)) - { - /* Before leaving current page, deal with any killed items */ - if (so->numKilled > 0) - _bt_killitems(scan); - BTScanPosUnpinIfPinned(so->currPos); - } - - if (BTScanPosIsValid(so->markPos)) - { - /* bump pin on mark buffer for assignment to current buffer */ - if (BTScanPosIsPinned(so->markPos)) - IncrBufferRefCount(so->markPos.buf); - memcpy(&so->currPos, &so->markPos, - offsetof(BTScanPosData, items[1]) + - so->markPos.lastItem * sizeof(BTScanPosItem)); - if (so->currTuples) - memcpy(so->currTuples, so->markTuples, - so->markPos.nextTupleOffset); - /* Reset the scan's array keys (see _bt_steppage for why) */ - if (so->numArrayKeys) - { - _bt_start_array_keys(scan, so->currPos.dir); - so->needPrimScan = false; - } - } - else - BTScanPosInvalidate(so->currPos); - } + btbatch->moreLeft = true; + so->needPrimScan = false; + so->scanBehind = false; + so->oppositeDirCheck = false; } /* @@ -881,15 +921,6 @@ _bt_parallel_seize(IndexScanDesc scan, BlockNumber *next_scan_page, *next_scan_page = InvalidBlockNumber; *last_curr_page = InvalidBlockNumber; - /* - * Reset so->currPos, and initialize moreLeft/moreRight such that the next - * call to _bt_readnextpage treats this backend similarly to a serial - * backend that steps from *last_curr_page to *next_scan_page (unless this - * backend's so->currPos is initialized by _bt_readfirstpage before then). - */ - BTScanPosInvalidate(so->currPos); - so->currPos.moreLeft = so->currPos.moreRight = true; - if (first) { /* @@ -1039,8 +1070,6 @@ _bt_parallel_done(IndexScanDesc scan) BTParallelScanDesc btscan; bool status_changed = false; - Assert(!BTScanPosIsValid(so->currPos)); - /* Do nothing, for non-parallel scans */ if (parallel_scan == NULL) return; diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c index aae6acb7f..c089ec38d 100644 --- a/src/backend/access/nbtree/nbtsearch.c +++ b/src/backend/access/nbtree/nbtsearch.c @@ -23,53 +23,49 @@ #include "pgstat.h" #include "storage/predicate.h" #include "utils/lsyscache.h" +#include "utils/memdebug.h" #include "utils/rel.h" -static inline void _bt_drop_lock_and_maybe_pin(Relation rel, BTScanOpaque so); +static inline void _bt_batch_unlock(IndexScanDesc scan, IndexScanBatch batch, + Buffer buf); static Buffer _bt_moveright(Relation rel, Relation heaprel, BTScanInsert key, Buffer buf, bool forupdate, BTStack stack, int access); static OffsetNumber _bt_binsrch(Relation rel, BTScanInsert key, Buffer buf); static int _bt_binsrch_posting(BTScanInsert key, Page page, OffsetNumber offnum); -static inline void _bt_returnitem(IndexScanDesc scan, BTScanOpaque so); -static bool _bt_steppage(IndexScanDesc scan, ScanDirection dir); -static bool _bt_readfirstpage(IndexScanDesc scan, OffsetNumber offnum, - ScanDirection dir); -static bool _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, - BlockNumber lastcurrblkno, ScanDirection dir, - bool seized); +static IndexScanBatch _bt_readfirstpage(IndexScanDesc scan, IndexScanBatch firstbatch, + OffsetNumber offnum, ScanDirection dir); +static IndexScanBatch _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, + BlockNumber lastcurrblkno, + ScanDirection dir, bool firstpage); static Buffer _bt_lock_and_validate_left(Relation rel, BlockNumber *blkno, BlockNumber lastcurrblkno); -static bool _bt_endpoint(IndexScanDesc scan, ScanDirection dir); +static IndexScanBatch _bt_endpoint(IndexScanDesc scan, ScanDirection dir, + IndexScanBatch firstbatch); /* - * _bt_drop_lock_and_maybe_pin() + * _bt_batch_unlock() -- nbtree wrapper for indexam_util_batch_unlock. * - * Unlock so->currPos.buf. If scan is so->dropPin, drop the pin, too. - * Dropping the pin prevents VACUUM from blocking on acquiring a cleanup lock. + * Performs the same Valgrind instrumentation as _bt_unlockbuf. */ static inline void -_bt_drop_lock_and_maybe_pin(Relation rel, BTScanOpaque so) +_bt_batch_unlock(IndexScanDesc scan, IndexScanBatch batch, Buffer buf) { - if (!so->dropPin) - { - /* Just drop the lock (not the pin) */ - _bt_unlockbuf(rel, so->currPos.buf); - return; - } +#if defined(USE_VALGRIND) + Page page = BufferGetPage(buf); - /* - * Drop both the lock and the pin. - * - * Have to set so->currPos.lsn so that _bt_killitems has a way to detect - * when concurrent heap TID recycling by VACUUM might have taken place. - */ - so->currPos.lsn = BufferGetLSNAtomic(so->currPos.buf); - _bt_relbuf(rel, so->currPos.buf); - so->currPos.buf = InvalidBuffer; + VALGRIND_CHECK_MEM_IS_DEFINED(page, BLCKSZ); +#endif + + indexam_util_batch_unlock(scan, batch, buf); + +#if defined(USE_VALGRIND) + if (!RelationUsesLocalBuffers(scan->indexRelation)) + VALGRIND_MAKE_MEM_NOACCESS(page, BLCKSZ); +#endif } /* @@ -860,26 +856,25 @@ _bt_compare(Relation rel, } /* - * _bt_first() -- Find the first item in a scan. + * _bt_first() -- Find the first batch in a scan. * * We need to be clever about the direction of scan, the search - * conditions, and the tree ordering. We find the first item (or, - * if backwards scan, the last item) in the tree that satisfies the - * qualifications in the scan key. On success exit, data about the - * matching tuple(s) on the page has been loaded into so->currPos. We'll - * drop all locks and hold onto a pin on page's buffer, except during - * so->dropPin scans, when we drop both the lock and the pin. - * _bt_returnitem sets the next item to return to scan on success exit. + * conditions, and the tree ordering. We find the first leaf page (or + * the last leaf page, when scanning backwards) in the tree with at least + * one tuple that satisfies the qualifications in the scan key. On + * success exit, we return a new batch with that page's matching items. * - * If there are no matching items in the index, we return false, with no - * pins or locks held. so->currPos will remain invalid. + * If there are no matching items in the index (in the given scan direction), + * we just return NULL. Note that returning NULL doesn't necessarily mean the + * end of the top-level scan; caller should check so->needPrimScan to + * determine if another primitive index scan is required. * * Note that scan->keyData[], and the so->keyData[] scankey built from it, * are both search-type scankeys (see nbtree/README for more about this). * Within this routine, we build a temporary insertion-type scankey to use * in locating the scan start position. */ -bool +IndexScanBatch _bt_first(IndexScanDesc scan, ScanDirection dir) { Relation rel = scan->indexRelation; @@ -892,8 +887,12 @@ _bt_first(IndexScanDesc scan, ScanDirection dir) StrategyNumber strat_total = InvalidStrategy; BlockNumber blkno = InvalidBlockNumber, lastcurrblkno; + IndexScanBatch firstbatch; + BTBatchData *btfirstbatch; - Assert(!BTScanPosIsValid(so->currPos)); + /* Allocate space for first batch */ + firstbatch = indexam_util_batch_alloc(scan); + btfirstbatch = BTBatchGetData(scan, firstbatch); /* * Examine the scan keys and eliminate any redundant keys; also mark the @@ -909,7 +908,8 @@ _bt_first(IndexScanDesc scan, ScanDirection dir) { Assert(!so->needPrimScan); _bt_parallel_done(scan); - return false; + indexam_util_batch_release(scan, firstbatch); + return NULL; } /* @@ -918,7 +918,10 @@ _bt_first(IndexScanDesc scan, ScanDirection dir) */ if (scan->parallel_scan != NULL && !_bt_parallel_seize(scan, &blkno, &lastcurrblkno, true)) - return false; + { + indexam_util_batch_release(scan, firstbatch); + return NULL; /* definitely done (so->needPrimScan is unset) */ + } /* * Initialize the scan's arrays (if any) for the current scan direction @@ -938,11 +941,9 @@ _bt_first(IndexScanDesc scan, ScanDirection dir) Assert(!so->needPrimScan); Assert(blkno != P_NONE); - if (!_bt_readnextpage(scan, blkno, lastcurrblkno, dir, true)) - return false; + indexam_util_batch_release(scan, firstbatch); - _bt_returnitem(scan, so); - return true; + return _bt_readnextpage(scan, blkno, lastcurrblkno, dir, true); } /* @@ -1242,7 +1243,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir) * Note: calls _bt_readfirstpage for us, which releases the parallel scan. */ if (keysz == 0) - return _bt_endpoint(scan, dir); + return _bt_endpoint(scan, dir, firstbatch); /* * We want to start the scan somewhere within the index. Set up an @@ -1502,7 +1503,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir) default: /* can't get here, but keep compiler quiet */ elog(ERROR, "unrecognized strat_total: %d", (int) strat_total); - return false; + return NULL; } /* @@ -1510,9 +1511,9 @@ _bt_first(IndexScanDesc scan, ScanDirection dir) * position ourselves on the target leaf page. */ Assert(ScanDirectionIsBackward(dir) == inskey.backward); - _bt_search(rel, NULL, &inskey, &so->currPos.buf, BT_READ, false); + _bt_search(rel, NULL, &inskey, &btfirstbatch->buf, BT_READ, false); - if (!BufferIsValid(so->currPos.buf)) + if (unlikely(!BufferIsValid(btfirstbatch->buf))) { Assert(!so->needPrimScan); @@ -1528,22 +1529,23 @@ _bt_first(IndexScanDesc scan, ScanDirection dir) if (IsolationIsSerializable()) { PredicateLockRelation(rel, scan->xs_snapshot); - _bt_search(rel, NULL, &inskey, &so->currPos.buf, BT_READ, false); + _bt_search(rel, NULL, &inskey, &btfirstbatch->buf, BT_READ, false); } - if (!BufferIsValid(so->currPos.buf)) + if (!BufferIsValid(btfirstbatch->buf)) { _bt_parallel_done(scan); - return false; + indexam_util_batch_release(scan, firstbatch); + return NULL; } } /* position to the precise item on the page */ - offnum = _bt_binsrch(rel, &inskey, so->currPos.buf); + offnum = _bt_binsrch(rel, &inskey, btfirstbatch->buf); /* * Now load data from the first page of the scan (usually the page - * currently in so->currPos.buf). + * currently in firstbatch.buf). * * If inskey.nextkey = false and inskey.backward = false, offnum is * positioned at the first non-pivot tuple >= inskey.scankeys. @@ -1561,164 +1563,72 @@ _bt_first(IndexScanDesc scan, ScanDirection dir) * for the page. For example, when inskey is both < the leaf page's high * key and > all of its non-pivot tuples, offnum will be "maxoff + 1". */ - if (!_bt_readfirstpage(scan, offnum, dir)) - return false; - - _bt_returnitem(scan, so); - return true; + return _bt_readfirstpage(scan, firstbatch, offnum, dir); } /* - * _bt_next() -- Get the next item in a scan. + * _bt_next() -- Get the next batch in a scan. * - * On entry, so->currPos describes the current page, which may be pinned - * but is not locked, and so->currPos.itemIndex identifies which item was - * previously returned. + * On entry, priorbatch describes the batch that was last returned by + * btgetbatch. We'll use the prior batch's positioning information to + * decide which leaf page to read next. * - * On success exit, so->currPos is updated as needed, and _bt_returnitem - * sets the next item to return to the scan. so->currPos remains valid. - * - * On failure exit (no more tuples), we invalidate so->currPos. It'll - * still be possible for the scan to return tuples by changing direction, - * though we'll need to call _bt_first anew in that other direction. + * On success exit, returns the next batch. There must be at least one + * matching tuple on any returned batch (else we'd just return NULL). + * Note that returning NULL doesn't necessarily mean the end of the + * top-level scan; caller should check so->needPrimScan to determine + * if another primitive index scan is required. */ -bool -_bt_next(IndexScanDesc scan, ScanDirection dir) +IndexScanBatch +_bt_next(IndexScanDesc scan, ScanDirection dir, IndexScanBatch priorbatch) { - BTScanOpaque so = (BTScanOpaque) scan->opaque; - - Assert(BTScanPosIsValid(so->currPos)); - - /* - * Advance to next tuple on current page; or if there's no more, try to - * step to the next page with data. - */ - if (ScanDirectionIsForward(dir)) - { - if (++so->currPos.itemIndex > so->currPos.lastItem) - { - if (!_bt_steppage(scan, dir)) - return false; - } - } - else - { - if (--so->currPos.itemIndex < so->currPos.firstItem) - { - if (!_bt_steppage(scan, dir)) - return false; - } - } - - _bt_returnitem(scan, so); - return true; -} - -/* - * Return the index item from so->currPos.items[so->currPos.itemIndex] to the - * index scan by setting the relevant fields in caller's index scan descriptor - */ -static inline void -_bt_returnitem(IndexScanDesc scan, BTScanOpaque so) -{ - BTScanPosItem *currItem = &so->currPos.items[so->currPos.itemIndex]; - - /* Most recent _bt_readpage must have succeeded */ - Assert(BTScanPosIsValid(so->currPos)); - Assert(so->currPos.itemIndex >= so->currPos.firstItem); - Assert(so->currPos.itemIndex <= so->currPos.lastItem); - - /* Return next item, per amgettuple contract */ - scan->xs_heaptid = currItem->heapTid; - if (so->currTuples) - scan->xs_itup = (IndexTuple) (so->currTuples + currItem->tupleOffset); -} - -/* - * _bt_steppage() -- Step to next page containing valid data for scan - * - * Wrapper on _bt_readnextpage that performs final steps for the current page. - * - * On entry, so->currPos must be valid. Its buffer will be pinned, though - * never locked. (Actually, when so->dropPin there won't even be a pin held, - * though so->currPos.currPage must still be set to a valid block number.) - */ -static bool -_bt_steppage(IndexScanDesc scan, ScanDirection dir) -{ - BTScanOpaque so = (BTScanOpaque) scan->opaque; + BTBatchData *btpriorbatch = BTBatchGetData(scan, priorbatch); BlockNumber blkno, lastcurrblkno; - - Assert(BTScanPosIsValid(so->currPos)); - - /* Before leaving current page, deal with any killed items */ - if (so->numKilled > 0) - _bt_killitems(scan); + bool moreInDir; /* - * Before we modify currPos, make a copy of the page data if there was a - * mark position that needs it. + * The core code must deal with cross-batch scan direction changes for us. + * A batch management routine that flips priorbatch's scan direction (and + * calls btposreset to deal with the scan's array keys) is used for this. */ - if (so->markItemIndex >= 0) - { - /* bump pin on current buffer for assignment to mark buffer */ - if (BTScanPosIsPinned(so->currPos)) - IncrBufferRefCount(so->currPos.buf); - memcpy(&so->markPos, &so->currPos, - offsetof(BTScanPosData, items[1]) + - so->currPos.lastItem * sizeof(BTScanPosItem)); - if (so->markTuples) - memcpy(so->markTuples, so->currTuples, - so->currPos.nextTupleOffset); - so->markPos.itemIndex = so->markItemIndex; - so->markItemIndex = -1; - - /* - * If we're just about to start the next primitive index scan - * (possible with a scan that has arrays keys, and needs to skip to - * continue in the current scan direction), moreLeft/moreRight only - * indicate the end of the current primitive index scan. They must - * never be taken to indicate that the top-level index scan has ended - * (that would be wrong). - * - * We could handle this case by treating the current array keys as - * markPos state. But depending on the current array state like this - * would add complexity. Instead, we just unset markPos's copy of - * moreRight or moreLeft (whichever might be affected), while making - * btrestrpos reset the scan's arrays to their initial scan positions. - * In effect, btrestrpos leaves advancing the arrays up to the first - * _bt_readpage call (that takes place after it has restored markPos). - */ - if (so->needPrimScan) - { - if (ScanDirectionIsForward(so->currPos.dir)) - so->markPos.moreRight = true; - else - so->markPos.moreLeft = true; - } - - /* mark/restore not supported by parallel scans */ - Assert(!scan->parallel_scan); - } - - BTScanPosUnpinIfPinned(so->currPos); + Assert(priorbatch->dir == dir); /* Walk to the next page with data */ if (ScanDirectionIsForward(dir)) - blkno = so->currPos.nextPage; + blkno = btpriorbatch->nextPage; else - blkno = so->currPos.prevPage; - lastcurrblkno = so->currPos.currPage; + blkno = btpriorbatch->prevPage; + lastcurrblkno = btpriorbatch->currPage; + moreInDir = ScanDirectionIsForward(dir) ? + btpriorbatch->moreRight : btpriorbatch->moreLeft; /* - * Cancel primitive index scans that were scheduled when the call to - * _bt_readpage for currPos happened to use the opposite direction to the - * one that we're stepping in now. (It's okay to leave the scan's array - * keys as-is, since the next _bt_readpage will advance them.) + * For bitmap scan callers, release the prior batch now so that + * _bt_readnextpage can reuse its memory. That way bitmap scans never + * need more than one batch allocation. */ - if (so->currPos.dir != dir) - so->needPrimScan = false; + if (!scan->usebatchring) + indexam_util_batch_release(scan, priorbatch); + + if (blkno == P_NONE || !moreInDir) + { + /* + * priorbatch's page is known to be the final leaf page with matches + * in this scan direction (its _bt_readpage call figured that out). + * + * Note: if so->needPrimScan is set, then priorbatch's leaf page is + * actually just the final page for the current primitive index scan + * in this scan direction (the scan will continue in _bt_first). + */ + _bt_parallel_done(scan); + return NULL; + } + + /* parallel scan must seize the scan to get next blkno */ + if (scan->parallel_scan != NULL && + !_bt_parallel_seize(scan, &blkno, &lastcurrblkno, false)) + return NULL; /* done iff so->needPrimScan wasn't set */ return _bt_readnextpage(scan, blkno, lastcurrblkno, dir, false); } @@ -1732,178 +1642,169 @@ _bt_steppage(IndexScanDesc scan, ScanDirection dir) * to stop the scan on this page by calling _bt_checkkeys against the high * key. See _bt_readpage for full details. * - * On entry, so->currPos must be pinned and locked (so offnum stays valid). + * On entry, firstbatch must be pinned and locked (so offnum stays valid). * Parallel scan callers must have seized the scan before calling here. * - * On exit, we'll have updated so->currPos and retained locks and pins - * according to the same rules as those laid out for _bt_readnextpage exit. - * Like _bt_readnextpage, our return value indicates if there are any matching - * records in the given direction. + * On success exit, returns unlocked batch containing data from the next page + * that has at least one matching item. If there are no matching items in the + * given scan direction, we just return NULL. Note that returning NULL + * doesn't necessarily mean the end of the top-level scan; btgetbatch and + * btgetbitmap check so->needPrimScan to determine if another primitive index + * scan is required. * * We always release the scan for a parallel scan caller, regardless of * success or failure; we'll call _bt_parallel_release as soon as possible. */ -static bool -_bt_readfirstpage(IndexScanDesc scan, OffsetNumber offnum, ScanDirection dir) +static IndexScanBatch +_bt_readfirstpage(IndexScanDesc scan, IndexScanBatch firstbatch, + OffsetNumber offnum, ScanDirection dir) { BTScanOpaque so = (BTScanOpaque) scan->opaque; + BTBatchData *btfirstbatch = BTBatchGetData(scan, firstbatch); + BlockNumber blkno, + lastcurrblkno; + bool moreInDir; - so->numKilled = 0; /* just paranoia */ - so->markItemIndex = -1; /* ditto */ - - /* Initialize so->currPos for the first page (page in so->currPos.buf) */ + /* Initialize firstbatch's position for the first page */ if (so->needPrimScan) { Assert(so->numArrayKeys); - so->currPos.moreLeft = true; - so->currPos.moreRight = true; + btfirstbatch->moreLeft = true; + btfirstbatch->moreRight = true; so->needPrimScan = false; } else if (ScanDirectionIsForward(dir)) { - so->currPos.moreLeft = false; - so->currPos.moreRight = true; + btfirstbatch->moreLeft = false; + btfirstbatch->moreRight = true; } else { - so->currPos.moreLeft = true; - so->currPos.moreRight = false; + btfirstbatch->moreLeft = true; + btfirstbatch->moreRight = false; } /* * Attempt to load matching tuples from the first page. * - * Note that _bt_readpage will finish initializing the so->currPos fields. + * Note that _bt_readpage will finish initializing the firstbatch fields. * _bt_readpage also releases parallel scan (even when it returns false). */ - if (_bt_readpage(scan, dir, offnum, true)) + if (_bt_readpage(scan, firstbatch, dir, offnum, true)) { - Relation rel = scan->indexRelation; - - /* - * _bt_readpage succeeded. Drop the lock (and maybe the pin) on - * so->currPos.buf in preparation for btgettuple returning tuples. - */ - Assert(BTScanPosIsPinned(so->currPos)); - _bt_drop_lock_and_maybe_pin(rel, so); - return true; + /* _bt_readpage saved one or more matches in firstbatch.items[] */ + _bt_batch_unlock(scan, firstbatch, btfirstbatch->buf); + return firstbatch; } - /* There's no actually-matching data on the page in so->currPos.buf */ - _bt_unlockbuf(scan->indexRelation, so->currPos.buf); + /* There's no actually-matching data on the page returned by _bt_search */ + _bt_relbuf(scan->indexRelation, btfirstbatch->buf); - /* Call _bt_readnextpage using its _bt_steppage wrapper function */ - if (!_bt_steppage(scan, dir)) - return false; + /* Walk to the next page with data */ + if (ScanDirectionIsForward(dir)) + blkno = btfirstbatch->nextPage; + else + blkno = btfirstbatch->prevPage; + lastcurrblkno = btfirstbatch->currPage; + moreInDir = ScanDirectionIsForward(dir) ? + btfirstbatch->moreRight : btfirstbatch->moreLeft; - /* _bt_readpage for a later page (now in so->currPos) succeeded */ - return true; + /* Release firstbatch (will be recycled if we reach _bt_readnextpage) */ + indexam_util_batch_release(scan, firstbatch); + + if (blkno == P_NONE || !moreInDir) + { + /* + * firstbatch _bt_readpage call ended scan in this direction (though + * if so->needPrimScan was set the scan will continue in _bt_first) + */ + _bt_parallel_done(scan); + return NULL; + } + + /* parallel scan must seize the scan to get next blkno */ + if (scan->parallel_scan != NULL && + !_bt_parallel_seize(scan, &blkno, &lastcurrblkno, false)) + return NULL; /* done iff so->needPrimScan wasn't set */ + + return _bt_readnextpage(scan, blkno, lastcurrblkno, dir, false); } /* * _bt_readnextpage() -- Read next page containing valid data for _bt_next * - * Caller's blkno is the next interesting page's link, taken from either the - * previously-saved right link or left link. lastcurrblkno is the page that - * was current at the point where the blkno link was saved, which we use to - * reason about concurrent page splits/page deletions during backwards scans. - * In the common case where seized=false, blkno is either so->currPos.nextPage - * or so->currPos.prevPage, and lastcurrblkno is so->currPos.currPage. + * Caller's blkno is the prior batch's nextPage or prevPage (depending on the + * current scan direction), and lastcurrblkno is the prior batch's currPage. + * We use lastcurrblkno to reason about concurrent page splits/page deletions + * during backwards scans. * - * On entry, so->currPos shouldn't be locked by caller. so->currPos.buf must - * be InvalidBuffer/unpinned as needed by caller (note that lastcurrblkno - * won't need to be read again in almost all cases). Parallel scan callers - * that seized the scan before calling here should pass seized=true; such a - * caller's blkno and lastcurrblkno arguments come from the seized scan. - * seized=false callers just pass us the blkno/lastcurrblkno taken from their - * so->currPos, which (along with so->currPos itself) can be used to end the - * scan. A seized=false caller's blkno can never be assumed to be the page - * that must be read next during a parallel scan, though. We must figure that - * part out for ourselves by seizing the scan (the correct page to read might - * already be beyond the seized=false caller's blkno during a parallel scan, - * unless blkno/so->currPos.nextPage/so->currPos.prevPage is already P_NONE, - * or unless so->currPos.moreRight/so->currPos.moreLeft is already unset). + * On entry, no page should be locked by caller. * - * On success exit, so->currPos is updated to contain data from the next - * interesting page, and we return true. We hold a pin on the buffer on - * success exit (except during so->dropPin index scans, when we drop the pin - * eagerly to avoid blocking VACUUM). + * On success exit, returns unlocked batch containing data from the next page + * that has at least one matching item. If there are no more matching items + * in the given scan direction, we just return NULL. Note that returning NULL + * doesn't necessarily mean the end of the top-level scan; btgetbatch and + * btgetbitmap check so->needPrimScan to determine if another primitive index + * scan is required. * - * If there are no more matching records in the given direction, we invalidate - * so->currPos (while ensuring it retains no locks or pins), and return false. - * - * We always release the scan for a parallel scan caller, regardless of - * success or failure; we'll call _bt_parallel_release as soon as possible. + * Parallel scan callers must seize the scan before calling here. blkno and + * lastcurrblkno should come from the seized scan. We'll release the scan as + * soon as possible. */ -static bool +static IndexScanBatch _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, - BlockNumber lastcurrblkno, ScanDirection dir, bool seized) + BlockNumber lastcurrblkno, ScanDirection dir, bool firstpage) { Relation rel = scan->indexRelation; - BTScanOpaque so = (BTScanOpaque) scan->opaque; + IndexScanBatch newbatch; + BTBatchData *btnewbatch; - Assert(so->currPos.currPage == lastcurrblkno || seized); - Assert(!(blkno == P_NONE && seized)); - Assert(!BTScanPosIsPinned(so->currPos)); + /* Allocate space for new batch */ + newbatch = indexam_util_batch_alloc(scan); + btnewbatch = BTBatchGetData(scan, newbatch); /* - * Remember that the scan already read lastcurrblkno, a page to the left - * of blkno (or remember reading a page to the right, for backwards scans) + * newbatch will be the batch for blkno, a page to the right of + * lastcurrblkno (or to the left, when the scan is moving backwards). + * + * Note: caller's blkno is tentative. newbatch actually stores matches + * from the next leaf page in this scan direction that has at least one + * matching item. This is usually caller's blkno page, but might be some + * other page to its right (or to its left) instead. */ - if (ScanDirectionIsForward(dir)) - so->currPos.moreLeft = true; - else - so->currPos.moreRight = true; + btnewbatch->moreLeft = true; /* for lastcurrblkno (or tentative) */ + btnewbatch->moreRight = true; /* tentative (or for lastcurrblkno) */ for (;;) { Page page; BTPageOpaque opaque; - if (blkno == P_NONE || - (ScanDirectionIsForward(dir) ? - !so->currPos.moreRight : !so->currPos.moreLeft)) - { - /* most recent _bt_readpage call (for lastcurrblkno) ended scan */ - Assert(so->currPos.currPage == lastcurrblkno && !seized); - BTScanPosInvalidate(so->currPos); - _bt_parallel_done(scan); /* iff !so->needPrimScan */ - return false; - } - - Assert(!so->needPrimScan); - - /* parallel scan must never actually visit so->currPos blkno */ - if (!seized && scan->parallel_scan != NULL && - !_bt_parallel_seize(scan, &blkno, &lastcurrblkno, false)) - { - /* whole scan is now done (or another primitive scan required) */ - BTScanPosInvalidate(so->currPos); - return false; - } + Assert(!((BTScanOpaque) scan->opaque)->needPrimScan); + Assert(blkno != P_NONE && lastcurrblkno != P_NONE); if (ScanDirectionIsForward(dir)) { /* read blkno, but check for interrupts first */ CHECK_FOR_INTERRUPTS(); - so->currPos.buf = _bt_getbuf(rel, blkno, BT_READ); + btnewbatch->buf = _bt_getbuf(rel, blkno, BT_READ); } else { /* read blkno, avoiding race (also checks for interrupts) */ - so->currPos.buf = _bt_lock_and_validate_left(rel, &blkno, + btnewbatch->buf = _bt_lock_and_validate_left(rel, &blkno, lastcurrblkno); - if (so->currPos.buf == InvalidBuffer) + if (btnewbatch->buf == InvalidBuffer) { /* must have been a concurrent deletion of leftmost page */ - BTScanPosInvalidate(so->currPos); _bt_parallel_done(scan); - return false; + indexam_util_batch_release(scan, newbatch); + return NULL; } } - page = BufferGetPage(so->currPos.buf); + page = BufferGetPage(btnewbatch->buf); opaque = BTPageGetOpaque(page); lastcurrblkno = blkno; if (likely(!P_IGNORE(opaque))) @@ -1911,17 +1812,17 @@ _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, /* see if there are any matches on this page */ if (ScanDirectionIsForward(dir)) { - /* note that this will clear moreRight if we can stop */ - if (_bt_readpage(scan, dir, P_FIRSTDATAKEY(opaque), seized)) + if (_bt_readpage(scan, newbatch, dir, + P_FIRSTDATAKEY(opaque), firstpage)) break; - blkno = so->currPos.nextPage; + blkno = btnewbatch->nextPage; } else { - /* note that this will clear moreLeft if we can stop */ - if (_bt_readpage(scan, dir, PageGetMaxOffsetNumber(page), seized)) + if (_bt_readpage(scan, newbatch, dir, + PageGetMaxOffsetNumber(page), firstpage)) break; - blkno = so->currPos.prevPage; + blkno = btnewbatch->prevPage; } } else @@ -1936,19 +1837,38 @@ _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, } /* no matching tuples on this page */ - _bt_relbuf(rel, so->currPos.buf); - seized = false; /* released by _bt_readpage (or by us) */ + _bt_relbuf(rel, btnewbatch->buf); + + /* Continue the scan in this direction? */ + if (blkno == P_NONE || + (ScanDirectionIsForward(dir) ? + !btnewbatch->moreRight : !btnewbatch->moreLeft)) + { + /* + * blkno _bt_readpage call ended scan in this direction (though if + * so->needPrimScan was set the scan will continue in _bt_first) + */ + _bt_parallel_done(scan); + indexam_util_batch_release(scan, newbatch); + return NULL; + } + + /* parallel scan must seize the scan to get next blkno */ + if (scan->parallel_scan != NULL && + !_bt_parallel_seize(scan, &blkno, &lastcurrblkno, false)) + { + indexam_util_batch_release(scan, newbatch); + return NULL; /* done iff so->needPrimScan wasn't set */ + } + + firstpage = false; /* next page cannot be first */ } - /* - * _bt_readpage succeeded. Drop the lock (and maybe the pin) on - * so->currPos.buf in preparation for btgettuple returning tuples. - */ - Assert(so->currPos.currPage == blkno); - Assert(BTScanPosIsPinned(so->currPos)); - _bt_drop_lock_and_maybe_pin(rel, so); + /* _bt_readpage saved one or more matches in newbatch.items[] */ + Assert(btnewbatch->currPage == blkno); + _bt_batch_unlock(scan, newbatch, btnewbatch->buf); - return true; + return newbatch; } /* @@ -2174,25 +2094,24 @@ _bt_get_endpoint(Relation rel, uint32 level, bool rightmost) * Parallel scan callers must have seized the scan before calling here. * Exit conditions are the same as for _bt_first(). */ -static bool -_bt_endpoint(IndexScanDesc scan, ScanDirection dir) +static IndexScanBatch +_bt_endpoint(IndexScanDesc scan, ScanDirection dir, IndexScanBatch firstbatch) { Relation rel = scan->indexRelation; - BTScanOpaque so = (BTScanOpaque) scan->opaque; + BTBatchData *btfirstbatch = BTBatchGetData(scan, firstbatch); Page page; BTPageOpaque opaque; OffsetNumber start; - Assert(!BTScanPosIsValid(so->currPos)); - Assert(!so->needPrimScan); + Assert(!((BTScanOpaque) scan->opaque)->needPrimScan); /* * Scan down to the leftmost or rightmost leaf page. This is a simplified * version of _bt_search(). */ - so->currPos.buf = _bt_get_endpoint(rel, 0, ScanDirectionIsBackward(dir)); + btfirstbatch->buf = _bt_get_endpoint(rel, 0, ScanDirectionIsBackward(dir)); - if (!BufferIsValid(so->currPos.buf)) + if (!BufferIsValid(btfirstbatch->buf)) { /* * Empty index. Lock the whole relation, as nothing finer to lock @@ -2200,10 +2119,10 @@ _bt_endpoint(IndexScanDesc scan, ScanDirection dir) */ PredicateLockRelation(rel, scan->xs_snapshot); _bt_parallel_done(scan); - return false; + return NULL; } - page = BufferGetPage(so->currPos.buf); + page = BufferGetPage(btfirstbatch->buf); opaque = BTPageGetOpaque(page); Assert(P_ISLEAF(opaque)); @@ -2229,9 +2148,5 @@ _bt_endpoint(IndexScanDesc scan, ScanDirection dir) /* * Now load data from the first page of the scan. */ - if (!_bt_readfirstpage(scan, start, dir)) - return false; - - _bt_returnitem(scan, so); - return true; + return _bt_readfirstpage(scan, firstbatch, start, dir); } diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c index 732bc750c..415e2a1c0 100644 --- a/src/backend/access/nbtree/nbtutils.c +++ b/src/backend/access/nbtree/nbtutils.c @@ -19,10 +19,7 @@ #include "access/nbtree.h" #include "access/reloptions.h" -#include "access/relscan.h" #include "commands/progress.h" -#include "common/int.h" -#include "lib/qunique.h" #include "miscadmin.h" #include "storage/lwlock.h" #include "utils/datum.h" @@ -30,7 +27,6 @@ #include "utils/rel.h" -static int _bt_compare_int(const void *va, const void *vb); static int _bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright, BTScanInsert itup_key); @@ -145,247 +141,6 @@ _bt_mkscankey(Relation rel, IndexTuple itup) return key; } -/* - * qsort comparison function for int arrays - */ -static int -_bt_compare_int(const void *va, const void *vb) -{ - int a = *((const int *) va); - int b = *((const int *) vb); - - return pg_cmp_s32(a, b); -} - -/* - * _bt_killitems - set LP_DEAD state for items an indexscan caller has - * told us were killed - * - * scan->opaque, referenced locally through so, contains information about the - * current page and killed tuples thereon (generally, this should only be - * called if so->numKilled > 0). - * - * Caller should not have a lock on the so->currPos page, but must hold a - * buffer pin when !so->dropPin. When we return, it still won't be locked. - * It'll continue to hold whatever pins were held before calling here. - * - * We match items by heap TID before assuming they are the right ones to set - * LP_DEAD. If the scan is one that holds a buffer pin on the target page - * continuously from initially reading the items until applying this function - * (if it is a !so->dropPin scan), VACUUM cannot have deleted any items on the - * page, so the page's TIDs can't have been recycled by now. There's no risk - * that we'll confuse a new index tuple that happens to use a recycled TID - * with a now-removed tuple with the same TID (that used to be on this same - * page). We can't rely on that during scans that drop buffer pins eagerly - * (so->dropPin scans), though, so we must condition setting LP_DEAD bits on - * the page LSN having not changed since back when _bt_readpage saw the page. - * We totally give up on setting LP_DEAD bits when the page LSN changed. - * - * We give up much less often during !so->dropPin scans, but it still happens. - * We cope with cases where items have moved right due to insertions. If an - * item has moved off the current page due to a split, we'll fail to find it - * and just give up on it. - */ -void -_bt_killitems(IndexScanDesc scan) -{ - Relation rel = scan->indexRelation; - BTScanOpaque so = (BTScanOpaque) scan->opaque; - Page page; - BTPageOpaque opaque; - OffsetNumber minoff; - OffsetNumber maxoff; - int numKilled = so->numKilled; - bool killedsomething = false; - Buffer buf; - - Assert(numKilled > 0); - Assert(BTScanPosIsValid(so->currPos)); - Assert(scan->heapRelation != NULL); /* can't be a bitmap index scan */ - - /* Always invalidate so->killedItems[] before leaving so->currPos */ - so->numKilled = 0; - - /* - * We need to iterate through so->killedItems[] in leaf page order; the - * loop below expects this (when marking posting list tuples, at least). - * so->killedItems[] is now in whatever order the scan returned items in. - * Scrollable cursor scans might have even saved the same item/TID twice. - * - * Sort and unique-ify so->killedItems[] to deal with all this. - */ - if (numKilled > 1) - { - qsort(so->killedItems, numKilled, sizeof(int), _bt_compare_int); - numKilled = qunique(so->killedItems, numKilled, sizeof(int), - _bt_compare_int); - } - - if (!so->dropPin) - { - /* - * We have held the pin on this page since we read the index tuples, - * so all we need to do is lock it. The pin will have prevented - * concurrent VACUUMs from recycling any of the TIDs on the page. - */ - Assert(BTScanPosIsPinned(so->currPos)); - buf = so->currPos.buf; - _bt_lockbuf(rel, buf, BT_READ); - } - else - { - XLogRecPtr latestlsn; - - Assert(!BTScanPosIsPinned(so->currPos)); - buf = _bt_getbuf(rel, so->currPos.currPage, BT_READ); - - latestlsn = BufferGetLSNAtomic(buf); - Assert(so->currPos.lsn <= latestlsn); - if (so->currPos.lsn != latestlsn) - { - /* Modified, give up on hinting */ - _bt_relbuf(rel, buf); - return; - } - - /* Unmodified, hinting is safe */ - } - - page = BufferGetPage(buf); - opaque = BTPageGetOpaque(page); - minoff = P_FIRSTDATAKEY(opaque); - maxoff = PageGetMaxOffsetNumber(page); - - /* Iterate through so->killedItems[] in leaf page order */ - for (int i = 0; i < numKilled; i++) - { - int itemIndex = so->killedItems[i]; - BTScanPosItem *kitem = &so->currPos.items[itemIndex]; - OffsetNumber offnum = kitem->indexOffset; - - Assert(itemIndex >= so->currPos.firstItem && - itemIndex <= so->currPos.lastItem); - Assert(i == 0 || - offnum >= so->currPos.items[so->killedItems[i - 1]].indexOffset); - - if (offnum < minoff) - continue; /* pure paranoia */ - while (offnum <= maxoff) - { - ItemId iid = PageGetItemId(page, offnum); - IndexTuple ituple = (IndexTuple) PageGetItem(page, iid); - bool killtuple = false; - - if (BTreeTupleIsPosting(ituple)) - { - int pi = i + 1; - int nposting = BTreeTupleGetNPosting(ituple); - int j; - - /* - * Note that the page may have been modified in almost any way - * since we first read it (in the !so->dropPin case), so it's - * possible that this posting list tuple wasn't a posting list - * tuple when we first encountered its heap TIDs. - */ - for (j = 0; j < nposting; j++) - { - ItemPointer item = BTreeTupleGetPostingN(ituple, j); - - if (!ItemPointerEquals(item, &kitem->heapTid)) - break; /* out of posting list loop */ - - /* - * kitem must have matching offnum when heap TIDs match, - * though only in the common case where the page can't - * have been concurrently modified - */ - Assert(kitem->indexOffset == offnum || !so->dropPin); - - /* - * Read-ahead to later kitems here. - * - * We rely on the assumption that not advancing kitem here - * will prevent us from considering the posting list tuple - * fully dead by not matching its next heap TID in next - * loop iteration. - * - * If, on the other hand, this is the final heap TID in - * the posting list tuple, then tuple gets killed - * regardless (i.e. we handle the case where the last - * kitem is also the last heap TID in the last index tuple - * correctly -- posting tuple still gets killed). - */ - if (pi < numKilled) - kitem = &so->currPos.items[so->killedItems[pi++]]; - } - - /* - * Don't bother advancing the outermost loop's int iterator to - * avoid processing killed items that relate to the same - * offnum/posting list tuple. This micro-optimization hardly - * seems worth it. (Further iterations of the outermost loop - * will fail to match on this same posting list's first heap - * TID instead, so we'll advance to the next offnum/index - * tuple pretty quickly.) - */ - if (j == nposting) - killtuple = true; - } - else if (ItemPointerEquals(&ituple->t_tid, &kitem->heapTid)) - killtuple = true; - - /* - * Mark index item as dead, if it isn't already. Since this - * happens while holding a buffer lock possibly in shared mode, - * it's possible that multiple processes attempt to do this - * simultaneously, leading to multiple full-page images being sent - * to WAL (if wal_log_hints or data checksums are enabled), which - * is undesirable. - */ - if (killtuple && !ItemIdIsDead(iid)) - { - if (!killedsomething) - { - /* - * Use the hint bit infrastructure to check if we can - * update the page while just holding a share lock. If we - * are not allowed, there's no point continuing. - */ - if (!BufferBeginSetHintBits(buf)) - goto unlock_page; - } - - /* found the item/all posting list items */ - ItemIdMarkDead(iid); - killedsomething = true; - break; /* out of inner search loop */ - } - offnum = OffsetNumberNext(offnum); - } - } - - /* - * Since this can be redone later if needed, mark as dirty hint. - * - * Whenever we mark anything LP_DEAD, we also set the page's - * BTP_HAS_GARBAGE flag, which is likewise just a hint. (Note that we - * only rely on the page-level flag in !heapkeyspace indexes.) - */ - if (killedsomething) - { - opaque->btpo_flags |= BTP_HAS_GARBAGE; - BufferFinishSetHintBits(buf, true, true); - } - -unlock_page: - if (!so->dropPin) - _bt_unlockbuf(rel, buf); - else - _bt_relbuf(rel, buf); -} - - /* * The following routines manage a shared-memory area in which we track * assignment of "vacuum cycle IDs" to currently-active btree vacuuming diff --git a/src/backend/access/nbtree/nbtxlog.c b/src/backend/access/nbtree/nbtxlog.c index dff7d286f..3bc5e5ccd 100644 --- a/src/backend/access/nbtree/nbtxlog.c +++ b/src/backend/access/nbtree/nbtxlog.c @@ -1095,15 +1095,15 @@ btree_mask(char *pagedata, BlockNumber blkno) /* * In btree leaf pages, it is possible to modify the LP_FLAGS without * emitting any WAL record. Hence, mask the line pointer flags. See - * _bt_killitems(), _bt_check_unique() for details. + * btkillitemsbatch(), _bt_check_unique() for details. */ mask_lp_flags(page); } /* * BTP_HAS_GARBAGE is just an un-logged hint bit. So, mask it. See - * _bt_delete_or_dedup_one_page(), _bt_killitems(), and _bt_check_unique() - * for details. + * _bt_delete_or_dedup_one_page(), btkillitemsbatch(), and + * _bt_check_unique() for details. */ maskopaq->btpo_flags &= ~BTP_HAS_GARBAGE; diff --git a/src/backend/access/spgist/spgutils.c b/src/backend/access/spgist/spgutils.c index f2ee333f6..1f4523102 100644 --- a/src/backend/access/spgist/spgutils.c +++ b/src/backend/access/spgist/spgutils.c @@ -88,10 +88,12 @@ spghandler(PG_FUNCTION_ARGS) .ambeginscan = spgbeginscan, .amrescan = spgrescan, .amgettuple = spggettuple, + .amgetbatch = NULL, + .amunguardbatch = NULL, + .amkillitemsbatch = NULL, .amgetbitmap = spggetbitmap, .amendscan = spgendscan, - .ammarkpos = NULL, - .amrestrpos = NULL, + .amposreset = NULL, .amestimateparallelscan = NULL, .aminitparallelscan = NULL, .amparallelrescan = NULL, diff --git a/src/backend/access/table/tableamapi.c b/src/backend/access/table/tableamapi.c index c09473b97..0fff801a4 100644 --- a/src/backend/access/table/tableamapi.c +++ b/src/backend/access/table/tableamapi.c @@ -53,6 +53,9 @@ GetTableAmRoutine(Oid amhandler) Assert(routine->index_fetch_begin != NULL); Assert(routine->index_fetch_reset != NULL); Assert(routine->index_fetch_end != NULL); + Assert(routine->index_fetch_batch_init != NULL); + Assert(routine->index_fetch_markpos != NULL); + Assert(routine->index_fetch_restrpos != NULL); Assert(routine->fetch_tid != NULL); Assert(routine->tuple_fetch_row_version != NULL); diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c index cba379810..53f1ad82d 100644 --- a/src/backend/commands/indexcmds.c +++ b/src/backend/commands/indexcmds.c @@ -885,7 +885,7 @@ DefineIndex(ParseState *pstate, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("access method \"%s\" does not support multicolumn indexes", accessMethodName))); - if (exclusion && amRoutine->amgettuple == NULL) + if (exclusion && amRoutine->amgettuple == NULL && amRoutine->amgetbatch == NULL) ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("access method \"%s\" does not support exclusion constraints", diff --git a/src/backend/executor/execAmi.c b/src/backend/executor/execAmi.c index 37fe03fdc..979a852fe 100644 --- a/src/backend/executor/execAmi.c +++ b/src/backend/executor/execAmi.c @@ -429,7 +429,7 @@ ExecSupportsMarkRestore(Path *pathnode) case T_IndexOnlyScan: /* - * Not all index types support mark/restore. + * Not all index types support restoring a mark */ return castNode(IndexPath, pathnode)->indexinfo->amcanmarkpos; diff --git a/src/backend/executor/nodeMergejoin.c b/src/backend/executor/nodeMergejoin.c index f8421a74c..6e8fe07f6 100644 --- a/src/backend/executor/nodeMergejoin.c +++ b/src/backend/executor/nodeMergejoin.c @@ -54,8 +54,8 @@ * the inner "5's". This requires repositioning the inner "cursor" * to point at the first inner "5". This is done by "marking" the * first inner 5 so we can restore the "cursor" to it before joining - * with the second outer 5. The access method interface provides - * routines to mark and restore to a tuple. + * with the second outer 5. The table AM interface provides + * routines to mark and restore to a tuple during index scans. * * * Essential operation of the merge join algorithm is as follows: diff --git a/src/backend/optimizer/path/indxpath.c b/src/backend/optimizer/path/indxpath.c index 67d9dc35f..edc7e4736 100644 --- a/src/backend/optimizer/path/indxpath.c +++ b/src/backend/optimizer/path/indxpath.c @@ -43,7 +43,7 @@ /* Whether we are looking for plain indexscan, bitmap scan, or either */ typedef enum { - ST_INDEXSCAN, /* must support amgettuple */ + ST_INDEXSCAN, /* must support amgettuple or amgetbatch */ ST_BITMAPSCAN, /* must support amgetbitmap */ ST_ANYSCAN, /* either is okay */ } ScanTypeControl; @@ -747,7 +747,7 @@ get_index_paths(PlannerInfo *root, RelOptInfo *rel, { IndexPath *ipath = (IndexPath *) lfirst(lc); - if (index->amhasgettuple) + if (index->amcanplainscan) add_path(rel, (Path *) ipath); if (index->amhasgetbitmap && @@ -835,7 +835,7 @@ build_index_paths(PlannerInfo *root, RelOptInfo *rel, switch (scantype) { case ST_INDEXSCAN: - if (!index->amhasgettuple) + if (!index->amcanplainscan) return NIL; break; case ST_BITMAPSCAN: diff --git a/src/backend/optimizer/util/plancat.c b/src/backend/optimizer/util/plancat.c index 7c4be1748..06a2e949d 100644 --- a/src/backend/optimizer/util/plancat.c +++ b/src/backend/optimizer/util/plancat.c @@ -310,11 +310,11 @@ get_relation_info(PlannerInfo *root, Oid relationObjectId, bool inhparent, info->amsearcharray = amroutine->amsearcharray; info->amsearchnulls = amroutine->amsearchnulls; info->amcanparallel = amroutine->amcanparallel; - info->amhasgettuple = (amroutine->amgettuple != NULL); + info->amcanplainscan = (amroutine->amgetbatch != NULL || + amroutine->amgettuple != NULL); info->amhasgetbitmap = amroutine->amgetbitmap != NULL && relation->rd_tableam->scan_bitmap_next_tuple != NULL; - info->amcanmarkpos = (amroutine->ammarkpos != NULL && - amroutine->amrestrpos != NULL); + info->amcanmarkpos = amroutine->amgetbatch != NULL; info->amcostestimate = amroutine->amcostestimate; Assert(info->amcostestimate != NULL); @@ -411,7 +411,7 @@ get_relation_info(PlannerInfo *root, Oid relationObjectId, bool inhparent, info->amsearcharray = false; info->amsearchnulls = false; info->amcanparallel = false; - info->amhasgettuple = false; + info->amcanplainscan = false; info->amhasgetbitmap = false; info->amcanmarkpos = false; info->amcostestimate = NULL; diff --git a/src/backend/replication/logical/relation.c b/src/backend/replication/logical/relation.c index 0b1d80b5b..0d0bd468f 100644 --- a/src/backend/replication/logical/relation.c +++ b/src/backend/replication/logical/relation.c @@ -836,6 +836,7 @@ IsIndexUsableForReplicaIdentityFull(Relation idxrel, AttrMap *attrmap) { AttrNumber keycol; oidvector *indclass; + const IndexAmRoutine *amroutine; /* The index must not be a partial index */ if (!heap_attisnull(idxrel->rd_indextuple, Anum_pg_index_indpred, NULL)) @@ -887,10 +888,12 @@ IsIndexUsableForReplicaIdentityFull(Relation idxrel, AttrMap *attrmap) return false; /* - * The given index access method must implement "amgettuple", which will - * be used later to fetch the tuples. See RelationFindReplTupleByIndex(). + * The given index access method must implement "amgettuple" or + * "amgetbatch", which will be used later to fetch the tuples. See + * RelationFindReplTupleByIndex(). */ - if (GetIndexAmRoutineByAmId(idxrel->rd_rel->relam, false)->amgettuple == NULL) + amroutine = GetIndexAmRoutineByAmId(idxrel->rd_rel->relam, false); + if (amroutine->amgettuple == NULL && amroutine->amgetbatch == NULL) return false; return true; diff --git a/src/backend/utils/adt/amutils.c b/src/backend/utils/adt/amutils.c index c81fb61a0..ddfd1b55c 100644 --- a/src/backend/utils/adt/amutils.c +++ b/src/backend/utils/adt/amutils.c @@ -363,10 +363,11 @@ indexam_property(FunctionCallInfo fcinfo, PG_RETURN_BOOL(routine->amclusterable); case AMPROP_INDEX_SCAN: - PG_RETURN_BOOL(routine->amgettuple ? true : false); + PG_RETURN_BOOL(routine->amgettuple != NULL || + routine->amgetbatch != NULL); case AMPROP_BITMAP_SCAN: - PG_RETURN_BOOL(routine->amgetbitmap ? true : false); + PG_RETURN_BOOL(routine->amgetbitmap != NULL); case AMPROP_BACKWARD_SCAN: PG_RETURN_BOOL(routine->amcanbackward); @@ -392,7 +393,8 @@ indexam_property(FunctionCallInfo fcinfo, PG_RETURN_BOOL(routine->amcanmulticol); case AMPROP_CAN_EXCLUDE: - PG_RETURN_BOOL(routine->amgettuple ? true : false); + PG_RETURN_BOOL(routine->amgettuple != NULL || + routine->amgetbatch != NULL); case AMPROP_CAN_INCLUDE: PG_RETURN_BOOL(routine->amcaninclude); diff --git a/contrib/bloom/blutils.c b/contrib/bloom/blutils.c index 5111cdc6d..6139fb959 100644 --- a/contrib/bloom/blutils.c +++ b/contrib/bloom/blutils.c @@ -146,10 +146,12 @@ blhandler(PG_FUNCTION_ARGS) .ambeginscan = blbeginscan, .amrescan = blrescan, .amgettuple = NULL, + .amgetbatch = NULL, + .amunguardbatch = NULL, + .amkillitemsbatch = NULL, .amgetbitmap = blgetbitmap, .amendscan = blendscan, - .ammarkpos = NULL, - .amrestrpos = NULL, + .amposreset = NULL, .amestimateparallelscan = NULL, .aminitparallelscan = NULL, .amparallelrescan = NULL, diff --git a/doc/src/sgml/indexam.sgml b/doc/src/sgml/indexam.sgml index f48da3185..322ba6b75 100644 --- a/doc/src/sgml/indexam.sgml +++ b/doc/src/sgml/indexam.sgml @@ -167,10 +167,12 @@ typedef struct IndexAmRoutine ambeginscan_function ambeginscan; amrescan_function amrescan; amgettuple_function amgettuple; /* can be NULL */ + amgetbatch_function amgetbatch; /* can be NULL */ + amunguardbatch_function amunguardbatch; /* can be NULL */ + amkillitemsbatch_function amkillitemsbatch; /* can be NULL */ amgetbitmap_function amgetbitmap; /* can be NULL */ amendscan_function amendscan; - ammarkpos_function ammarkpos; /* can be NULL */ - amrestrpos_function amrestrpos; /* can be NULL */ + amposreset_function amposreset; /* can be NULL */ /* interface functions to support parallel index scans */ amestimateparallelscan_function amestimateparallelscan; /* can be NULL */ @@ -676,8 +678,38 @@ ambeginscan (Relation indexRelation, must create this struct by calling RelationGetIndexScan(). In most cases ambeginscan does little beyond making that call and perhaps - acquiring locks; + acquiring locks and initializing standard IndexScanDesc fields; the interesting parts of index-scan startup are in amrescan. + Index access methods that use the amgetbatch interface + must also set the following fields in the scan descriptor: + + + + scan->maxitemsbatch: the maximum number of items + that can appear in a single batch (typically derived from the index page + size, e.g., MaxIndexTuplesPerPage). + + + + + scan->batch_index_opaque_size: the + MAXALIGN'd size of the index AM's per-batch opaque + area. Each batch allocation reserves this much space immediately before + the IndexScanBatchData pointer, for use by the + index AM to store per-page navigation state (e.g., batch index page's + buffer pin and sibling page links). + + + + + scan->batch_tuples_workspace: the size in bytes + of the per-batch tuple storage workspace used for index-only scans + (typically BLCKSZ), or 0 if the index AM does not + support index-only scans. The workspace is accessible via + batch->currTuples. + + + @@ -749,6 +781,264 @@ amgettuple (IndexScanDesc scan, amgettuple field in its IndexAmRoutine struct must be set to NULL. + + + As of PostgreSQL version 19, mark/restore of + scan positions is only supported for scans that use the + amgetbatch interface. + amgettuple scans do not support mark/restore. The + mark/restore implementation lives in the table AM, through its + table_index_fetch_markpos and + table_index_fetch_restrpos implementations. + + + + + +IndexScanBatch +amgetbatch (IndexScanDesc scan, + IndexScanBatch priorbatch, + ScanDirection direction); + + Return the next batch of index tuples in the given scan, moving in the + given direction (forward or backward in the index). Returns an instance of + IndexScanBatch with index tuples loaded, or + NULL if there are no more index tuples in the given + scan direction. + + + + The amgetbatch interface is an alternative to + amgettuple that returns matching index entries in batches + rather than one at a time. By returning all matching index entries from a + single index page together, the table AM gains visibility into which table + blocks will be needed in the near future. + + + + The table AM passes priorbatch to indicate where the + index AM should continue scanning from (or NULL on the + first call for the scan). The index AM uses information from + priorbatch to determine which index page to read next. + Unlike amgettuple, where the index AM maintains its + own scan position, with amgetbatch it is the caller + that controls the progress of the scan through the index. The caller + will typically pass the most recently returned batch, but this is not + guaranteed — for example, following the restoration of a marked + position, an earlier batch may be passed instead. + + + + A batch returned by amgetbatch is associated with an + index page containing at least one matching item/tuple. A buffer + pin can be held onto by the table AM as an interlock against concurrent TID + recycling by VACUUM. The table AM drops this interlock + by calling amunguardbatch when it is safe to do so. + See for details on buffer pin management + during index scans. + + + + A IndexScanBatch that is returned by + amgetbatch is no longer managed by the access method. + It is up to the table AM caller to decide when it should be freed (via + tableam_util_free_batch). Note also that + amgetbatch functions must never modify the + priorbatch parameter. The core + src/backend/access/nbtree/ implementation provides a + reference examples of the amgetbatch interface. + + + + The same caveats described for amgettuple apply here + too: an entry in the returned batch means only that the index contains + an entry that matches the scan keys, not that the tuple necessarily still + exists in the heap or will pass the caller's snapshot test. + + + + Index access methods using amgetbatch must set + scan->xs_recheck to indicate whether rechecking of + scan keys is required, in the same way as amgettuple + does. However, scan->xs_recheck must be set consistently + for an entire scan rather than varying on a per-tuple basis. This is a key + difference from amgettuple, which can set + scan->xs_recheck independently for each tuple it returns. + Index access methods that require granular control over + scan->xs_recheck must use the amgettuple + interface instead of amgetbatch. + + + + Similarly, the amgetbatch interface does not currently + support index-only scans that return data in the form of a + HeapTuple pointer. Index-only scans work by + copying IndexTuple records from index pages into a + local buffer associated with each batch. xs_itupdesc + works in the same way as already described for amgettuple. + The index access method must not set the scan->xs_itup + field itself. + With amgettuple, the index AM sets + scan->xs_hitup to point to a reconstructed + HeapTuple whose lifetime extends until the next + amgettuple call — only one tuple is valid at a + time. With amgetbatch, multiple batches are held open + simultaneously and items are consumed asynchronously by the table AM, so + there is no equivalent single-tuple lifetime for per-item + HeapTuple pointers. The batch infrastructure + provides per-batch storage for IndexTuple copies, + but has no analogous mechanism for HeapTuple data + (used by some index AMs for reconstructed tuples that might not fit in + IndexTuple format). This limitation could be + addressed in a future version of PostgreSQL. + + + + The index access method must provide either amgetbatch + or amgettuple, but not both. + + + + The amgetbatch function need only be provided if the + access method supports plain index scans. If it doesn't, + the amgetbatch field in its + IndexAmRoutine struct must be set to NULL. + + + + +void +amunguardbatch (IndexScanDesc scan, + IndexScanBatch batch); + + Called by the table AM (via + tableam_util_unguard_batch) when it is safe to drop + the TID recycling interlock that the index AM holds on the batch's index + leaf page, which prevents concurrent TID recycling by + VACUUM. + Formally, an index AM may hold a different kind of interlock, or multiple + interlocks, in its per-batch opaque area, but in practice the built-in + index AM that supports amgetbatch — B-tree + — holds a single buffer pin. See for + details on buffer pin management during index scans. This function will be + called at most once for each guarded batch; it is not called when the index + AM has already unguarded the batch itself (as it does when + batchImmediateUnguard is true, which is the + common case). + + + + + The index AM may choose to retain its own buffer pins when this serves an + internal purpose (for example, maintaining a descent stack of pinned index + pages for reuse across amgetbatch calls). However, + any scheme that retains buffer pins managed by the index AM must be sure + to free the pins at an opportune point (at a minimum whenever + amendscan is called, and typically when + amrescan is called). It must also keep the number of + retained pins fixed and small. + + + + + The amunguardbatch function is required for any index + access method that provides amgetbatch. + + + + +void +amkillitemsbatch (IndexScanDesc scan, + IndexScanBatch batch); + + Called by the table AM when it has finished processing a batch that + contains dead items, to set LP_DEAD bits in the batch's + index page. The batch's index page will not be locked by the caller; the + index AM must acquire and release its own lock (and pin) on the index page. + + + + Implementing amkillitemsbatch is optional for + amgetbatch index AMs (those that don't can leave + the field set to NULL), but doing so is recommended for + performance, as it allows future scans to skip known-dead index entries. + The core index access method that currently supports + amgetbatch (B-tree) implements + LP_DEAD marking, though third-party index access methods + are free to choose whether to implement this feature. The table AM may + call tableam_util_scanpos_killitem to mark dead items as + the scan progresses. If the batch contains any such dead items, the batch's + deadItems array will have been sorted and + deduplicated before amkillitemsbatch is called, with + item offsets appearing in ascending order (that is, in index page order, + which is also batch order) and no offset appearing more than once. Index + access methods can rely on this ordering when processing dead items: the + deadItems array can be walked in lockstep with + the index page's item pointers, since both are in ascending page offset + number order. This also means the table AM need not call + tableam_util_scanpos_killitem in any particular order. + + + + + Index access methods using amgettuple rely on the + kill_prior_tuple mechanism instead to mark dead + tuples; the src/backend/access/gist/ implementation + provides a reference example. + + + + + When implementing amkillitemsbatch, the index AM + must verify that the index page has not been modified since the batch was + originally read. The standard way to do this is to call + indexam_util_batch_unlock during + amgetbatch, which releases the index page lock and + saves the page LSN in + the batch's lsn field. Later, within + amkillitemsbatch, the index AM re-reads the page, + compares the current page LSN against + batch->lsn, and gives up on setting + LP_DEAD bits if the LSN has advanced. An advanced LSN + indicates that the page was modified — possibly by + VACUUM recycling table TIDs — so it would be + unsafe to assume that index entries still point to the same heap/table + tuples. Since LP_DEAD marking is only an optimization + hint, it is always safe to skip it. B-tree uses this approach. + + + + + This LSN comparison technique requires the index AM to use fake + (monotonically increasing) LSNs on its pages for relations where WAL is + not generated, since real LSNs are not available in that case. See the + B-tree index implementation for a reference example of this + technique. An index AM that does not implement fake LSNs can still + provide amkillitemsbatch, but should simply do + nothing when the relation does not generate WAL (i.e., when + RelationNeedsWAL() is false), since the LSN + comparison would be unreliable. + + + + + + Index AMs are not obligated to use + indexam_util_batch_unlock — they can implement + their own equivalent, and are free to use the batch + lsn field in whatever way they deem necessary + + + + + This LSN-based verification means that the table AM need not consider + whether unguarding a batch could introduce TID recycling hazards for a + subsequent amkillitemsbatch call. The hazards are the + same in both cases, but since amkillitemsbatch + independently verifies the page LSN and can always safely give up on + setting LP_DEAD bits, correctness is obvious without any + coupling between the two. + @@ -768,8 +1058,8 @@ amgetbitmap (IndexScanDesc scan, itself, and therefore callers recheck both the scan conditions and the partial index predicate (if any) for recheckable tuples. That might not always be true, however. - amgetbitmap and - amgettuple cannot be used in the same index scan; there + Only one of amgetbitmap, amgetbatch, + or amgettuple can be used in any given index scan; there are other restrictions too when using amgetbitmap, as explained in . @@ -781,6 +1071,39 @@ amgetbitmap (IndexScanDesc scan, struct must be set to NULL. + + Index access methods that support amgetbatch will + typically also support amgetbitmap, and almost all of + the index AM's internal scanning code is shared between the two paths. The + main difference is that during amgetbitmap only one + batch is allocated at a time (via indexam_util_batch_alloc), + unlike amgetbatch where the table AM manages several + batches in a dedicated batch ring buffer data structure. + + + + The only change needed to maintain this invariant is a single call to + indexam_util_batch_release at the point where the + scan moves between index pages, conditional on the scan's + usebatchring field being false (indicating a + bitmap index scan). The index AM releases its prior batch + just as it is about to generate the next batch — the same point + where it extracts navigation state (such as sibling-page links) from + priorbatch. No other changes to the index AM's + scanning logic are needed. This early release is specific to + amgetbitmap scans; during + amgetbatch scans the priorbatch + is strictly owned by the caller (the table AM), and the index AM must + never release it. See _bt_next for a reference + example. + + + + The released batch is cached internally and reused by the next + indexam_util_batch_alloc call, avoiding repeated + memory allocation during the bitmap scan. + + void @@ -795,32 +1118,52 @@ amendscan (IndexScanDesc scan); void -ammarkpos (IndexScanDesc scan); +amposreset (IndexScanDesc scan, + IndexScanBatch batch); - Mark current scan position. The access method need only support one - remembered scan position per scan. + Notify the index AM that the table AM is about to change the scan's + logical position in a way that may invalidate index AM state that + independently tracks the scan's progress. This callback is invoked when + the table AM is about to process a batch in a different direction than + was used when the batch was originally returned by + amgetbatch, and also when a marked scan position is + about to be restored. Some index AMs maintain internal state that + advances in lockstep with the scan under the soft assumption that the scan + direction will not change. Such state may fall behind the scan's true + position without harm (simply reading the next index page will allow the + state to catch up), but must never get ahead of it. When + the scan direction changes or a marked position is restored, the assumption + is violated, so the index AM must reset the state to a safe starting point + for the new direction. For example, B-tree uses this callback to reset its + ScalarArrayOpExpr array keys to their initial positions + for the new scan direction. - The ammarkpos function need only be provided if the access - method supports ordered scans. If it doesn't, - the ammarkpos field in its IndexAmRoutine - struct may be set to NULL. + When amposreset is called due to a cross-batch + direction change, the core system will have already flipped the batch's + dir field to reflect the new scan direction + before making the call. The index AM should use this updated direction + when resetting any state that depends on knowing which way the scan is + proceeding. When called to restore a marked position, the batch's + dir is not modified; it retains the direction + from when the batch was originally returned. In both cases, the batch + passed to amposreset is the batch that will next be + passed to amgetbatch as its + priorbatch. Note in particular that the + priorbatch.dir field is + guaranteed to have the same scan direction as when + amposreset was called. - -void -amrestrpos (IndexScanDesc scan); - - Restore the scan to the most recently marked position. - - - - The amrestrpos function need only be provided if the access - method supports ordered scans. If it doesn't, - the amrestrpos field in its IndexAmRoutine - struct may be set to NULL. + Index access methods that have private state which must be reset when the + scan position changes must provide an amposreset + implementation. Index AMs with no such state may set + amposreset to NULL. The + amposreset function can only be provided when the + access method supports ordered scans through the + amgetbatch interface. @@ -975,6 +1318,8 @@ amtranslatecmptype (CompareType cmptype, Oid opfamily, Oid opcintype); Access methods that always return entries in the natural ordering of their data (such as btree) should set amcanorder to true. + Both amgetbatch and amgettuple + scans support this capability. Currently, such access methods must use btree-compatible strategy numbers for their equality and ordering operators. @@ -994,34 +1339,41 @@ amtranslatecmptype (CompareType cmptype, Oid opfamily, Oid opcintype); - The amgettuple function has a direction argument, + Note that amgetbatch scans do not currently support + ordering operators. The core executor expects amgettuple + to set xs_orderbyvals for each returned tuple, + but there is currently no mechanism to associate per-item ordering values + with individual items within a batch. This would require an additional + layer of indirection that does not yet exist, but could be added in a + future version of PostgreSQL. + + + + The amgetbatch function has a direction argument, which can be either ForwardScanDirection (the normal case) or BackwardScanDirection. If the first call after amrescan specifies BackwardScanDirection, then the - set of matching index entries is to be scanned back-to-front rather than in - the normal front-to-back direction, so amgettuple must return - the last matching tuple in the index, rather than the first one as it - normally would. (This will only occur for access - methods that set amcanorder to true.) After the - first call, amgettuple must be prepared to advance the scan in + returned batch must be the batch containing the last matching item(s), + rather than the batch containing the first matching item(s). + amgetbatch must be prepared to advance the scan in either direction from the most recently returned entry. (But if amcanbackward is false, all subsequent calls will have the same direction as the first one.) - Access methods that support ordered scans must support marking a - position in a scan and later returning to the marked position. The same - position might be restored multiple times. However, only one position need - be remembered per scan; a new ammarkpos call overrides the - previously marked position. An access method that does not support ordered - scans need not provide ammarkpos and amrestrpos - functions in IndexAmRoutine; set those pointers to NULL - instead. + All amgetbatch index AMs inherently support + mark/restore of scan positions (amgettuple index AMs + do not). The mark/restore implementation lives in the table AM and works + with any amgetbatch implementation. Index AMs that + maintain internal state which tracks the scan's progress must provide an + amposreset callback to be notified when the scan's + logical position changes unexpectedly, such as during mark/restore; see + for details. - Both the scan position and the mark position (if any) must be maintained + The scan position (if any) must be maintained by the table AM and index AM consistently in the face of concurrent insertions or deletions in the index. It is OK if a freshly-inserted entry is not returned by a scan that would have found the entry if it had existed when the scan started, or for @@ -1040,15 +1392,17 @@ amtranslatecmptype (CompareType cmptype, Oid opfamily, Oid opcintype); which the index returns the actual data not just the TID of the heap tuple. This will only avoid I/O if the visibility map shows that the TID is on an all-visible page; else the heap tuple must be visited anyway to check - MVCC visibility. But that is no concern of the access method's. + MVCC visibility. But that is no concern of the index access method's. - Instead of using amgettuple, an index scan can be done with - amgetbitmap to fetch all tuples in one call. This can be - noticeably more efficient than amgettuple because it allows - avoiding lock/unlock cycles within the access method. In principle - amgetbitmap should have the same effects as repeated + Instead of using amgetbatch or + amgettuple, an index scan can be done with + amgetbitmap to fetch all tuples in one call. This can + be noticeably more efficient than with an ordered scan + because it allows efficient sequential access to table AM pages containing + matches. In principle amgetbitmap should have the + same effects as repeated amgetbatch or amgettuple calls, but we impose several restrictions to simplify matters. First of all, amgetbitmap returns all tuples at once and marking or restoring scan positions isn't @@ -1059,15 +1413,15 @@ amtranslatecmptype (CompareType cmptype, Oid opfamily, Oid opcintype); Also, there is no provision for index-only scans with amgetbitmap, since there is no way to return the contents of index tuples. - Finally, amgetbitmap - does not guarantee any locking of the returned tuples, with implications - spelled out in . + Finally, amgetbitmap does not hold any index page pins + after it returns (similarly to amgetbatch scans with + an MVCC snapshot), as described in . Note that it is permitted for an access method to implement only - amgetbitmap and not amgettuple, or vice versa, - if its internal implementation is unsuited to one API or the other. + amgetbitmap and not amgetbatch/amgettuple, + or vice versa, if its internal implementation is unsuited to one API or the other. @@ -1123,11 +1477,17 @@ amtranslatecmptype (CompareType cmptype, Oid opfamily, Oid opcintype); - An index scan must maintain a pin - on the index page holding the item last returned by - amgettuple, and ambulkdelete cannot delete - entries from pages that are pinned by other backends. The need - for this rule is explained below. + A pin must be held on any index page whose items might still need to + be followed, and ambulkdelete must acquire a + cleanup lock on each index page, which will block if any other + backend holds a pin on that page. + For amgettuple scans, the index access method + manages this pin directly. + For amgetbatch scans, the index AM holds a buffer + pin on each batch's index leaf page (in its per-batch opaque area), + while the table AM controls when the interlock is dropped via + amunguardbatch. + The need for this rule is explained below. @@ -1138,39 +1498,94 @@ amtranslatecmptype (CompareType cmptype, Oid opfamily, Oid opcintype); VACUUM. This creates no serious problems if that item number is still unused when the reader reaches it, since an empty - item slot will be ignored by heap_fetch(). But what if a + item slot will simply be treated as not-visible. But what if a third backend has already re-used the item slot for something else? When using an MVCC-compliant snapshot, there is no problem because the new occupant of the slot is certain to be too new to pass the snapshot test. However, with a non-MVCC-compliant snapshot (such as SnapshotAny), it would be possible to accept and return - a row that does not in fact match the scan keys. We could defend - against this scenario by requiring the scan keys to be rechecked - against the heap row in all cases, but that is too expensive. Instead, - we use a pin on an index page as a proxy to indicate that the reader - might still be in flight from the index entry to the matching - heap entry. Making ambulkdelete block on such a pin ensures - that VACUUM cannot delete the heap entry before the reader - is done with it. This solution costs little in run time, and adds blocking - overhead only in the rare cases where there actually is a conflict. + a wholly unrelated row (one that does not necessarily satisfy the scan + keys). We can optionally use a pin on an index page as a proxy to indicate + that the reader might still be in flight from the index + entry to the matching heap entry. Making ambulkdelete + block on such a pin ensures that VACUUM cannot delete + the heap entry before the reader is done with it. This solution costs + little in run time, and adds blocking overhead only in the rare cases where + there actually is a conflict. For plain index scans that use an + MVCC-compliant snapshot, holding the pin is unnecessary because the scan + will always visit the heap page, where the snapshot itself will reject any + recycled TID's new occupant. (Index-only scans are a special case, as + discussed below.) - This solution requires that index scans be synchronous: we have - to fetch each heap tuple immediately after scanning the corresponding index - entry. This is expensive for a number of reasons. An - asynchronous scan in which we collect many TIDs from the index, - and only visit the heap tuples sometime later, requires much less index - locking overhead and can allow a more efficient heap access pattern. - Per the above analysis, we must use the synchronous approach for - non-MVCC-compliant snapshots, but an asynchronous scan is workable - for a query using an MVCC snapshot. + This solution requires that amgettuple index scans be + synchronous: the table AM must fetch each heap tuple + immediately after scanning the corresponding index entry. This is + expensive for a number of reasons. The + amgetbatch interface, by contrast, was designed to + allow scans to be asynchronous. - In an amgetbitmap index scan, the access method does not - keep an index pin on any of the returned tuples. Therefore - it is only safe to use such scans with MVCC-compliant snapshots. + Whether a batch's TID recycling interlock (typically an index page buffer + pin) is dropped immediately or deferred is controlled by a generic, + scan-level policy that is determined when the scan is opened — it is + not under the control of either the index AM or the table AM. The scan's + batchImmediateUnguard flag encodes this policy. + It is set based on two criteria that are known to the core scan machinery: + whether the scan uses an MVCC-compliant snapshot, and whether it is an + index-only scan. Specifically, + batchImmediateUnguard is true when the scan uses + an MVCC snapshot and is not an index-only scan. + + + + When batchImmediateUnguard is true, the + interlock is dropped inside + indexam_util_batch_unlock (before the batch is even + returned to the table AM), because a plain index scan with an MVCC + snapshot will always visit the heap page, where the MVCC visibility check + is authoritative — even if VACUUM recycles a TID, + the new occupant cannot pass the snapshot test. + + + + When batchImmediateUnguard is false, the + interlock is retained until the table AM explicitly calls + amunguardbatch, because the scan cannot rely on that + heap page MVCC backstop. For non-MVCC scans, there is no MVCC snapshot to + reject a recycled TID's new occupant at all. For index-only scans, even + with an MVCC snapshot, the scan typically avoids visiting the heap page + altogether (using the visibility map instead), so the MVCC check that would + catch a recycled TID usually never runs. In both cases the interlock on + the index page is what prevents VACUUM from recycling + TIDs while the scan is still in flight. + + + + When batchImmediateUnguard is false, the table + AM decides when to call + amunguardbatch; the index AM decides + what to release. When + batchImmediateUnguard is true, the index AM + drops the interlock itself (inside + indexam_util_batch_unlock), and + amunguardbatch is never called. + + + + Similarly, an amgetbitmap index scan is inherently + asynchronous: all matching TIDs are collected into a bitmap before any heap + access begins. Such scans therefore require an MVCC-compliant snapshot, + and there is no need for the access method to hold index page pins. + + + + Index access methods that use amgettuple must manage + pin lifetime themselves, since there is no table AM intermediary (unlike + with amgetbatch). The index AM must hold a pin on the + current index page until the scan moves to a different page or ends. diff --git a/doc/src/sgml/ref/create_table.sgml b/doc/src/sgml/ref/create_table.sgml index 80829b239..18de984d2 100644 --- a/doc/src/sgml/ref/create_table.sgml +++ b/doc/src/sgml/ref/create_table.sgml @@ -1173,12 +1173,13 @@ WITH ( MODULUS numeric_literal, REM - The access method must support amgettuple (see ); at present this means GIN - cannot be used. Although it's allowed, there is little point in using - B-tree or hash indexes with an exclusion constraint, because this - does nothing that an ordinary unique constraint doesn't do better. - So in practice the access method will always be GiST or + The access method must support either amgetbatch + or amgettuple (see ); at + present this means GIN cannot be used. Although + it's allowed, there is little point in using B-tree or hash indexes + with an exclusion constraint, because this does nothing that an + ordinary unique constraint doesn't do better. So in practice the + access method will always be GiST or SP-GiST. diff --git a/src/test/modules/dummy_index_am/dummy_index_am.c b/src/test/modules/dummy_index_am/dummy_index_am.c index 31f8d2b81..a9756f65e 100644 --- a/src/test/modules/dummy_index_am/dummy_index_am.c +++ b/src/test/modules/dummy_index_am/dummy_index_am.c @@ -334,10 +334,12 @@ dihandler(PG_FUNCTION_ARGS) .ambeginscan = dibeginscan, .amrescan = direscan, .amgettuple = NULL, + .amgetbatch = NULL, + .amunguardbatch = NULL, + .amkillitemsbatch = NULL, .amgetbitmap = NULL, .amendscan = diendscan, - .ammarkpos = NULL, - .amrestrpos = NULL, + .amposreset = NULL, .amestimateparallelscan = NULL, .aminitparallelscan = NULL, .amparallelrescan = NULL, diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list index c72f6c595..e3c8ed684 100644 --- a/src/tools/pgindent/typedefs.list +++ b/src/tools/pgindent/typedefs.list @@ -208,6 +208,7 @@ BOOL BOOLEAN BOX BTArrayKeyInfo +BTBatchData BTBuildState BTCallbackState BTCycleId @@ -235,8 +236,6 @@ BTScanInsertData BTScanKeyPreproc BTScanOpaque BTScanOpaqueData -BTScanPosData -BTScanPosItem BTShared BTSortArrayContext BTSpool @@ -265,6 +264,9 @@ BaseBackupCmd BaseBackupTargetHandle BaseBackupTargetType BatchMVCCState +BatchMatchingItem +BatchRingBuffer +BatchRingItemPos BeginDirectModify_function BeginForeignInsert_function BeginForeignModify_function @@ -1246,6 +1248,7 @@ HbaLine HeadlineJsonState HeadlineParsedText HeadlineWordEntry +HeapBatchData HeapCheckContext HeapCheckReadStreamData HeapPageFreeze @@ -1325,6 +1328,8 @@ IndexOrderByDistance IndexPath IndexRuntimeKeyInfo IndexScan +IndexScanBatch +IndexScanBatchData IndexScanDesc IndexScanDescData IndexScanInstrumentation @@ -3540,18 +3545,17 @@ amcanreturn_function amcostestimate_function amendscan_function amestimateparallelscan_function +amgetbatch_function amgetbitmap_function amgettreeheight_function amgettuple_function aminitparallelscan_function aminsert_function aminsertcleanup_function -ammarkpos_function amoptions_function amparallelrescan_function amproperty_function amrescan_function -amrestrpos_function amtranslate_cmptype_function amtranslate_strategy_function amvacuumcleanup_function -- 2.53.0