From 53d523156a68df5f4045ccb874eb81dcb12b0d78 Mon Sep 17 00:00:00 2001 From: Peter Geoghegan Date: Wed, 25 Mar 2026 16:58:09 -0400 Subject: [PATCH v23 5/8] heapam: Add index scan I/O prefetching. This commit implements I/O prefetching for index scans (and index-only scans that require heap fetches). This was made possible by the recent addition of batching interfaces to both the table AM and index AM APIs. The amgetbatch index AM interface provides batches of matching TIDs (rather than one tuple at a time), each of which must be taken from index tuples that appear together on a single index page. This allows multiple batches to be held open simultaneously. Giving the table AM an explicit understanding of index AM concepts/index page boundaries allows it to consider all of the relevant costs and benefits. Prefetching is implemented using a prefetching position under the control of the table AM and core code. This is closely related to the scan position added by commit FIXME, which introduced the amgetbatch interface. A read stream callback advances the read stream as needed to provide sufficiently many heap block numbers to maintain the read stream's target prefetch distance. Testing has shown that index prefetching can make index scans much faster. Large range scans that return many tuples can be as much as 30x faster with local SSDs when buffered I/O is used, and 50x faster or more with higher-latency storage such as network-attached block devices, where the benefit of hiding I/O latency through prefetching is even greater. A new GUC (enable_indexscan_prefetch) controls the use of index prefetching. The default setting is 'on', so all plain index scans use prefetching where support exists. All index-only scans will also use prefetching automatically where supported (once the scan starts to require a significant number of heap fetches). An important goal of the amgetbatch design is to enable the table AM's read stream callback to advance its prefetch position using TIDs that appear on a leaf page that's ahead of the current scan position's leaf page. This is crucial with scans of indexes where each leaf page happens to have relatively few distinct heap blocks among its matching TIDs (as well as with scans with leaf pages that have relatively few total matching items). Index scans can have as many as 64 open batches, which testing has shown to be about the maximum number that can ever be useful. Batches are maintained in scan order using a simple ring buffer data structure. In rare cases where the scan exceeds this quasi-arbitrary limit of 64, the read stream is temporarily paused using the read stream pausing mechanism added by commit 38229cb9. Prefetching (via the read stream) is resumed only after the scan position advances beyond its current open batch and then frees and removes the batch from the scan's batch ring buffer. Testing has shown that it isn't very common for scans to hold open more than about 10 batches to get the desired I/O prefetch distance. The heuristic used to decide when to begin prefetching delays initialization of the scan's read stream until the scan transitions from its first batch to its second batch. Each batch corresponds to matching TIDs from a single index leaf page, so prefetching only begins once the scan reads from its second leaf page containing at least one matching item. A selective index scan that touches only one leaf page never reaches the second batch, so the heuristic correctly avoids prefetching overhead. The picture for more complicated cases is mixed. The same principle applies to nestloop inner index scans with very tight limits (e.g., a correlated subquery with LIMIT 1), where each rescan reads from only a single leaf page: the heuristic avoids the cost of repeatedly resetting a read stream across many rescans. On the other hand, some selective scans that access randomly-ordered heap pages would genuinely benefit from prefetching, but never get as far as reaching their second batch -- a missed opportunity, where the heuristic is overly cautious. Conversely, the heuristic is not cautious enough with slightly less selective nestloop inner scans (e.g., LIMIT 3 within a LATERAL join). These rescans may span two leaf pages, just barely crossing the second-batch threshold, while still only needing to fetch two or three heap pages -- not enough for prefetching to realistically help or pay for itself on any individual rescan. Such queries are regressed by the work from this commit (relative to PostgreSQL 18), though only when the scan has to read heap pages from storage. Adding a smarter heuristic that addresses both shortcomings remains as work for a future release. Passing down an ExecSetTupleBound style hint and using that hint to influence how the read stream ramps up its distance seems like a promising approach. Author: Tomas Vondra Author: Peter Geoghegan Reviewed-By: Andres Freund Reviewed-By: Thomas Munro Discussion: https://postgr.es/m/cf85f46f-b02f-05b2-5248-5000b894ebab@enterprisedb.com --- src/include/access/heapam.h | 13 + src/include/access/indexbatch.h | 9 +- src/include/access/relscan.h | 38 ++ src/include/optimizer/cost.h | 1 + src/backend/access/heap/heapam_indexscan.c | 459 +++++++++++++++++- src/backend/access/index/indexbatch.c | 36 +- src/backend/optimizer/path/costsize.c | 1 + src/backend/utils/misc/guc_parameters.dat | 7 + src/backend/utils/misc/postgresql.conf.sample | 1 + doc/src/sgml/config.sgml | 16 + doc/src/sgml/indexam.sgml | 84 +++- doc/src/sgml/tableam.sgml | 8 + src/test/regress/expected/sysviews.out | 3 +- 13 files changed, 660 insertions(+), 16 deletions(-) diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h index 9281d8645..daa009720 100644 --- a/src/include/access/heapam.h +++ b/src/include/access/heapam.h @@ -134,6 +134,19 @@ typedef struct IndexFetchHeapData /* Plain index scan xs_lastinblock optimization */ bool xs_lastinblock; /* last TID on this block in current batch? */ + + /* + * Read stream state for prefetching (only used during amgetbatch scans). + * + * The read stream moves ahead of the scan's current position using its + * own prefetching position (per conventions supported by indexbatch.c). + * The read stream is allocated early in the scan, and reset on rescan + * (and when the scan direction changes). + */ + bool xs_paused; /* paused until next batch is read? */ + ScanDirection xs_read_stream_dir; /* index scan direction */ + BlockNumber xs_prefetch_block; /* last block returned to xs_read_stream */ + ReadStream *xs_read_stream; /* prefetching read stream */ } IndexFetchHeapData; /* Result codes for HeapTupleSatisfiesVacuum */ diff --git a/src/include/access/indexbatch.h b/src/include/access/indexbatch.h index 9f87fb96a..c7b3d4750 100644 --- a/src/include/access/indexbatch.h +++ b/src/include/access/indexbatch.h @@ -44,6 +44,7 @@ tableam_util_batchscan_init(IndexScanDesc scan) Assert(scan->indexRelation->rd_indam->amgetbatch != NULL); scan->batchringbuf.scanPos.valid = false; + scan->batchringbuf.prefetchPos.valid = false; scan->batchringbuf.markPos.valid = false; scan->batchringbuf.markBatch = NULL; @@ -65,16 +66,16 @@ extern void tableam_util_unguard_batch(IndexScanDesc scan, IndexScanBatch batch) /* * Fetch the next batch of matching items for the scan (or the first). * - * Called when caller's current batch (passed to us as priorBatch) has no more - * matching items in the given scan direction. Caller passes a NULL - * priorBatch on the first call here for the scan. + * Called when caller's current scanBatch/prefetchBatch (passed to us as + * priorBatch) has no more matching items in the given scan direction. Caller + * passes a NULL priorBatch on the first call here for the scan. * * Returns the next batch to be processed by caller in the given scan * direction, or NULL when there are no more matches in that direction. * * This is where batches are appended to the scan's ring buffer. We don't * free any batches here, though; that is left up to the caller. The caller - * is also responsible for advancing their position. + * is also responsible for advancing their scanPos/prefetchPos position. */ static pg_attribute_always_inline IndexScanBatch tableam_util_fetch_next_batch(IndexScanDesc scan, ScanDirection direction, diff --git a/src/include/access/relscan.h b/src/include/access/relscan.h index 3421c83c1..d8e6685b6 100644 --- a/src/include/access/relscan.h +++ b/src/include/access/relscan.h @@ -215,6 +215,10 @@ typedef struct IndexScanBatchData * This allows table AMs to avoid redundant amgetbatch calls with the same * priorbatch -- the index AM might need to read additional index pages to * determine there are no more matching items beyond caller's priorbatch. + * In particular, during prefetching the read stream callback discovers + * the end-of-scan via prefetchBatch. tableam_util_fetch_next_batch() + * checks these flags so that the scan side doesn't repeat the same + * amgetbatch call when it later reaches that batch as scanBatch. */ bool knownEndBackward; bool knownEndForward; @@ -266,11 +270,14 @@ typedef struct IndexScanBatchData *IndexScanBatch; * current read position by _multiple_ batches/index pages. The further out * the table AM reads ahead like this, the further it can see into the future. * That way the table AM is able to reorder work as aggressively as desired. + * Index scans sometimes need to readahead by several dozen batches in order + * to maintain an optimal I/O prefetch distance (for reading table blocks). */ typedef struct BatchRingBuffer { /* current positions in IndexScanDescData.batchbuf[] for scan */ BatchRingItemPos scanPos; /* scan's read position */ + BatchRingItemPos prefetchPos; /* prefetching position */ BatchRingItemPos markPos; /* mark/restore position */ /* markPos's batch (not in ring buffer when markBatch != scanBatch) */ @@ -508,6 +515,37 @@ index_scan_batch_base(IndexScanDescData *scan, IndexScanBatch batch) return (char *) batch - scan->batch_table_offset; } +/* + * Compare two batch ring positions in the given scan direction. + * + * Returns negative if pos1 is behind pos2, 0 if equal, positive if pos1 is + * ahead of pos2. + */ +static inline int +index_scan_pos_cmp(BatchRingItemPos *pos1, BatchRingItemPos *pos2, + ScanDirection direction) +{ + int8 batchdiff; + + Assert(pos1->valid && pos2->valid); + + batchdiff = (int8) (pos1->batch - pos2->batch); + if (batchdiff != 0) + { + /* Resolve comparison using differing batch offsets */ + return batchdiff; + } + + /* + * Resolve comparison using items[]-wise indexes from caller's positions, + * since both positions point to the same ring buffer batch + */ + if (ScanDirectionIsForward(direction)) + return pos1->item - pos2->item; + else + return pos2->item - pos1->item; +} + /* * Advance position to its next item in the batch. * diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h index f2fd5d315..419300a6b 100644 --- a/src/include/optimizer/cost.h +++ b/src/include/optimizer/cost.h @@ -52,6 +52,7 @@ extern PGDLLIMPORT int max_parallel_workers_per_gather; extern PGDLLIMPORT bool enable_seqscan; extern PGDLLIMPORT bool enable_indexscan; extern PGDLLIMPORT bool enable_indexonlyscan; +extern PGDLLIMPORT bool enable_indexscan_prefetch; extern PGDLLIMPORT bool enable_bitmapscan; extern PGDLLIMPORT bool enable_tidscan; extern PGDLLIMPORT bool enable_sort; diff --git a/src/backend/access/heap/heapam_indexscan.c b/src/backend/access/heap/heapam_indexscan.c index b6e57f8f8..fd70ad5bc 100644 --- a/src/backend/access/heap/heapam_indexscan.c +++ b/src/backend/access/heap/heapam_indexscan.c @@ -19,6 +19,7 @@ #include "access/indexbatch.h" #include "access/relscan.h" #include "access/visibilitymap.h" +#include "optimizer/cost.h" #include "storage/predicate.h" #include "utils/pgstat_internal.h" @@ -40,6 +41,9 @@ typedef struct HeapBatchData #define HEAP_BATCH_VIS_CHECKED 0x01 /* checked item in VM? */ #define HEAP_BATCH_VIS_ALL_VISIBLE 0x02 /* block is known all-visible? */ +static pg_noinline void heapam_index_dirchange_reset(IndexFetchHeapData *hscan, + ScanDirection direction, + BatchRingBuffer *batchringbuf); static inline HeapBatchData *heapam_index_batch_data(IndexScanDesc scan, IndexScanBatch batch); static inline ItemPointer heapam_index_return_scanpos_tid(IndexScanDesc scan, @@ -53,6 +57,9 @@ static void heapam_index_batch_pos_visibility(IndexScanDesc scan, IndexScanBatch batch, HeapBatchData *hbatch, BatchRingItemPos *pos); +static BlockNumber heapam_index_prefetch_next_block(ReadStream *stream, + void *callback_private_data, + void *per_buffer_data); static bool heapam_index_plain_batch_getnext_slot(IndexScanDesc scan, ScanDirection direction, TupleTableSlot *slot); @@ -129,6 +136,10 @@ heapam_index_fetch_begin(IndexScanDesc scan, uint32 flags) /* xs_lastinblock optimization state */ Assert(!hscan->xs_lastinblock); + /* Read stream state (other fields initialized by callback) */ + Assert(hscan->xs_read_stream_dir == NoMovementScanDirection); + Assert(hscan->xs_read_stream == NULL); + /* Resolve which xs_getnext_slot implementation to use for this scan */ if (scan->indexRelation->rd_indam->amgetbatch != NULL) { @@ -198,6 +209,15 @@ heapam_index_fetch_reset(IndexScanDesc scan) /* Rescans should avoid an excessive number of VM lookups */ hscan->xs_vm_items = 1; + /* Defensively do an unconditional read stream direction reset */ + hscan->xs_read_stream_dir = NoMovementScanDirection; + + if (hscan->xs_read_stream) + { + hscan->xs_paused = false; + read_stream_reset(hscan->xs_read_stream); + } + /* Reset batch ring buffer state */ if (scan->usebatchring) tableam_util_batchscan_reset(scan, false); @@ -222,6 +242,9 @@ heapam_index_fetch_end(IndexScanDesc scan) if (BufferIsValid(hscan->xs_vmbuffer)) ReleaseBuffer(hscan->xs_vmbuffer); + if (hscan->xs_read_stream) + read_stream_end(hscan->xs_read_stream); + /* Free all batch related resources */ if (scan->usebatchring) tableam_util_batchscan_end(scan); @@ -246,8 +269,16 @@ heapam_index_fetch_markpos(IndexScanDesc scan) void heapam_index_fetch_restrpos(IndexScanDesc scan) { + IndexFetchHeapData *hscan = (IndexFetchHeapData *) scan->xs_heapfetch; + Assert(scan->usebatchring); + if (hscan->xs_read_stream) + { + hscan->xs_paused = false; + read_stream_reset(hscan->xs_read_stream); + } + tableam_util_batchscan_restore_pos(scan); } @@ -449,7 +480,14 @@ heapam_index_fetch_tuple(Relation rel, if (BufferIsValid(hscan->xs_cbuf)) ReleaseBuffer(hscan->xs_cbuf); - hscan->xs_cbuf = ReadBuffer(rel, hscan->xs_blk); + /* + * When using a read stream, the stream will already know which block + * number comes next (though an assertion will verify a match below) + */ + if (hscan->xs_read_stream) + hscan->xs_cbuf = read_stream_next_buffer(hscan->xs_read_stream, NULL); + else + hscan->xs_cbuf = ReadBuffer(rel, hscan->xs_blk); /* * Prune page when it is pinned for the first time @@ -572,6 +610,69 @@ heapam_index_fetch_heap_item(IndexScanDesc scan, IndexFetchHeapData *hscan, return found; } +/* + * Handle a change in index scan direction (at the tuple granularity). + * + * Resets the read stream, since we can't rely on scanPos continuing to agree + * with the blocks that read stream already consumed using prefetchPos. + * + * Note: iff the scan _continues_ in this new direction, and actually steps + * off scanBatch to an earlier index page, tableam_util_fetch_next_batch will + * deal with it. But that might never happen; the scan might yet change + * direction again (or just end before returning more items). + */ +static pg_noinline void +heapam_index_dirchange_reset(IndexFetchHeapData *hscan, + ScanDirection direction, + BatchRingBuffer *batchringbuf) +{ + /* Reset read stream state */ + batchringbuf->prefetchPos.valid = false; + hscan->xs_paused = false; + hscan->xs_read_stream_dir = direction; + + /* Reset read stream itself */ + if (hscan->xs_read_stream) + read_stream_reset(hscan->xs_read_stream); +} + +/* + * Decide whether to start a read stream for heap block prefetching during an + * index scan. + * + * Called each time a new batch is obtained from the index AM, barring the + * first time that happens. We delay initializing the stream until reading + * from the scan's second batch. This heuristic avoids wasting cycles on + * starting a read stream for very selective index scans. + * + * We avoid prefetching during scans where we're unable to unguard (unpin) + * each batch's buffers right away (non-MVCC snapshot scans). We are not + * prepared to sensibly limit the total number of buffer pins held (read + * stream handles all pin resource management for us, and knows nothing + * about pins held on index pages/within batches). + * + * We also delay creating a read stream during index-only scans that haven't + * done any heap fetches yet. We don't want to waste any cycles on + * allocating a read stream until we have a demonstrated need to perform + * heap fetches. + */ +static pg_attribute_always_inline void +heapam_index_consider_prefetching(IndexScanDesc scan, + IndexFetchHeapData *hscan) +{ + Assert(!hscan->xs_read_stream); + Assert(!scan->batchringbuf.prefetchPos.valid); + + if (scan->MVCCScan && enable_indexscan_prefetch && + hscan->xs_blk != InvalidBlockNumber) /* for index-only scans */ + hscan->xs_read_stream = + read_stream_begin_relation(READ_STREAM_DEFAULT, NULL, + scan->heapRelation, MAIN_FORKNUM, + heapam_index_prefetch_next_block, + scan, 0); + /* else don't start a read stream for prefetching (not yet, at least) */ +} + /* * Get next TID from batch ring buffer, moving in the given scan direction. * Also sets *all_visible for item when caller passes a non-NULL arg. @@ -589,6 +690,12 @@ heapam_index_getnext_scanbatch_pos(IndexScanDesc scan, IndexFetchHeapData *hscan Assert(scanPos->valid || index_scan_batch_count(scan) == 0); Assert(all_visible == NULL || scan->xs_want_itup); + /* Handle resetting the read stream when scan direction changes */ + if (hscan->xs_read_stream_dir == NoMovementScanDirection) + hscan->xs_read_stream_dir = direction; /* first call */ + else if (unlikely(hscan->xs_read_stream_dir != direction)) + heapam_index_dirchange_reset(hscan, direction, batchringbuf); + /* * Check if there's an existing loaded scanBatch for us to return the next * matching item's TID/index tuple from @@ -598,7 +705,7 @@ heapam_index_getnext_scanbatch_pos(IndexScanDesc scan, IndexFetchHeapData *hscan { /* * scanPos is valid, so scanBatch must already be loaded in batch ring - * buffer. We rely on that here. + * buffer. We rely on that here (can't do this with prefetchBatch). */ pg_assume(batchringbuf->headBatch == scanPos->batch); @@ -629,6 +736,17 @@ heapam_index_getnext_scanbatch_pos(IndexScanDesc scan, IndexFetchHeapData *hscan return NULL; } + if (hadExistingScanBatch && !hscan->xs_read_stream) + { + Assert(!scan->batchringbuf.prefetchPos.valid); + + /* + * Not using a read stream to do index prefetching. Decide whether to + * start one now. + */ + heapam_index_consider_prefetching(scan, hscan); + } + /* * Advanced scanBatch. Now position scanPos to the start of new * scanBatch. @@ -644,6 +762,7 @@ heapam_index_getnext_scanbatch_pos(IndexScanDesc scan, IndexFetchHeapData *hscan { IndexScanBatch headBatch = index_scan_batch(scan, batchringbuf->headBatch); + BatchRingItemPos *prefetchPos = &batchringbuf->prefetchPos; Assert(headBatch != scanBatch); Assert(batchringbuf->headBatch != scanPos->batch); @@ -651,12 +770,47 @@ heapam_index_getnext_scanbatch_pos(IndexScanDesc scan, IndexFetchHeapData *hscan /* free obsolescent head batch (unless it is scan's markBatch) */ tableam_util_free_batch(scan, headBatch); + /* + * If we're about to release the batch that prefetchPos currently + * points to, just invalidate prefetchPos. See the comments about + * prefetchPos/scanPos within heapam_index_prefetch_next_block for an + * explanation. + * + * This handling is approximately the opposite of resuming a paused + * read stream: this helps the scan deal with prefetchPos falling + * behind scanPos, whereas pausing is used when scanPos has fallen + * behind (very far behind) prefetchPos. + */ + if (prefetchPos->valid && + prefetchPos->batch == batchringbuf->headBatch) + prefetchPos->valid = false; + /* Remove the batch from the ring buffer (even if it's markBatch) */ batchringbuf->headBatch++; + + if (unlikely(hscan->xs_paused)) + { + /* + * heapam_index_prefetch_next_block paused the scan's read stream + * due to our running out of free batch slots. Now that we've + * freed up one such slot, we can resume the read stream (since + * there's now space for heapam_index_prefetch_next_block to store + * one more batch). + */ + Assert(prefetchPos->batch != scanPos->batch); + Assert(prefetchPos->valid && + index_scan_batch_loaded(scan, prefetchPos->batch)); + Assert(index_scan_pos_cmp(prefetchPos, scanPos, direction) > 0); + Assert(!index_scan_batch_full(scan)); + + read_stream_resume(hscan->xs_read_stream); + hscan->xs_paused = false; + } } /* In practice scanBatch will always be the ring buffer's headBatch */ Assert(batchringbuf->headBatch == scanPos->batch); + Assert(!hscan->xs_paused); return heapam_index_return_scanpos_tid(scan, hscan, direction, scanBatch, scanPos, all_visible); @@ -774,6 +928,13 @@ heapam_index_return_scanpos_tid(IndexScanDesc scan, IndexFetchHeapData *hscan, * (important for inner index scans of anti-joins and semi-joins), and the * need to unguard batches promptly. * + * In no event will the scan be allowed to guard more than one batch at a + * time. The primary reason for this restriction is to avoid unintended + * interactions with the read stream, which has its own strategy for keeping + * the number of pins held by the backend under control. (Unguarding via + * the amunguardbatch callback often means releasing a buffer pin on an + * index page, which counts against the same shared pin limit.) + * * Once we've resolved visibility for all items in a batch, we can safely * unguard it by calling amunguardbatch. This is safe with respect to * concurrent VACUUM because the batch's guard (typically a buffer pin on the @@ -911,6 +1072,300 @@ heapam_index_batch_pos_visibility(IndexScanDesc scan, ScanDirection direction, hscan->xs_vm_items = scan->maxitemsbatch; } +/* + * Return the next block to the read stream when performing index prefetching. + * + * The initial batch is always loaded by heapam_index_getnext_scanbatch_pos. + * We don't get called until the first read_stream_next_buffer call, when a + * heap block is requested from the scan's stream for the first time. + * + * The position of the read_stream is stored in prefetchPos. It is typical + * for prefetchPos to consistently stay ahead of the scanPos position that's + * used to track the next TID heapam_index_getnext_scanbatch_pos will return + * to the scan (after the first time we get called). However, that isn't a + * strict precondition (though as explained below we implement a scheme + * essentially equivalent to making it a strict precondition). There is a + * true strict postcondition, though: when we return we'll always leave + * scanPos <= prefetchPos. + */ +static BlockNumber +heapam_index_prefetch_next_block(ReadStream *stream, + void *callback_private_data, + void *per_buffer_data) +{ + IndexScanDesc scan = (IndexScanDesc) callback_private_data; + IndexFetchHeapData *hscan = (IndexFetchHeapData *) scan->xs_heapfetch; + BatchRingBuffer *batchringbuf = &scan->batchringbuf; + BatchRingItemPos *scanPos = &batchringbuf->scanPos; + BatchRingItemPos *prefetchPos = &batchringbuf->prefetchPos; + ScanDirection xs_read_stream_dir = hscan->xs_read_stream_dir; + IndexScanBatch prefetchBatch; + bool fromScanPos = false; + + /* + * scanPos must always be valid when prefetching takes place. There has + * to be at least one batch, loaded as our scanBatch. The scan direction + * must be established, too. + */ + Assert(index_scan_batch_count(scan) > 0); + Assert(scanPos->valid && index_scan_batch_loaded(scan, scanPos->batch)); + Assert(scan->MVCCScan); + Assert(!hscan->xs_paused); + Assert(xs_read_stream_dir != NoMovementScanDirection); + + /* + * Handle initialization on the first call here, when prefetchPos isn't + * yet valid (also handles the prefetchPos < scanPos edge case). + * + * If prefetchPos has not been initialized yet, that typically indicates + * that this is the first call here for the entire scan. We initialize + * prefetchPos using the current scanPos, since the current scanBatch + * item's TID should have its block number returned by the read stream + * first. When this happens, it's likely that prefetchPos will get ahead + * of scanPos very soon, after the _next_ call here returns. + * + * There's also an edge case that we handle using exactly the same steps. + * It's possible for prefetchPos to "fall behind" scanPos, at least in a + * trivial sense: if many adjacent matching items contain TIDs that all + * point to the same heap block, scanPos can actually overtake prefetchPos + * (prefetchPos can't advance until we're actually called). A similar + * issue arises during index-only scans that require only a few heap + * fetches: we'll tend to be called far less often than we'd be called + * during an equivalent plain index scan due to all-visible items. An + * all-visible item will advance scanPos, but can't trigger a call to here + * (just like an item that points to the same heap block that the previous + * item also pointed to). + * + * This scheme produces exactly the same block prefetch requests as a + * scheme that requires heapam_index_getnext_scanbatch_pos to actively + * ensure that "prefetchPos < scanPos" can never happen. That isn't a + * strict precondition for this function because making it explicit would + * impose a performance penalty on heapam_index_getnext_scanbatch_pos. + * + * Note: when heapam_index_getnext_scanbatch_pos frees a batch that + * prefetchPos points to, it'll at least invalidate prefetchPos for us. + * This removes any danger of prefetchPos.batch falling so far behind + * scanPos.batch that it wraps around (and appears to be ahead of scanPos + * instead of behind it). In other words, in a certain sense we actually + * _can_ trust heapam_index_getnext_scanbatch_pos to not let prefetchPos + * fall behind scanPos: it can't happen at the batch granularity (only at + * the item/tuple granularity, which we can always cope with here). + */ + if (!prefetchPos->valid || + index_scan_pos_cmp(prefetchPos, scanPos, xs_read_stream_dir) < 0) + { + hscan->xs_prefetch_block = InvalidBlockNumber; + *prefetchPos = *scanPos; + fromScanPos = true; + + /* + * We must avoid keeping any batch guarded for more than an instant, + * to avoid undesirable interactions with the scan's read stream. See + * comment and assertion at the top of the loop below. + */ + if (scan->xs_want_itup) + { + /* + * Make heapam_index_batch_pos_visibility release resources + * eagerly + */ + hscan->xs_vm_items = scan->maxitemsbatch; + + /* Make sure that this new prefetchBatch is unguarded */ + prefetchBatch = index_scan_batch(scan, prefetchPos->batch); + if (prefetchBatch->isGuarded) + { + HeapBatchData *hbatch = heapam_index_batch_data(scan, + prefetchBatch); + + /* Set visibility info not set through scanBatch */ + heapam_index_batch_pos_visibility(scan, xs_read_stream_dir, + prefetchBatch, hbatch, + prefetchPos); + } + } + } + + prefetchBatch = index_scan_batch(scan, prefetchPos->batch); + + /* + * If prefetchPos wasn't just initialized using scanPos, we're directly + * picking up prefetching where the last call here left off. Assert that + * xs_prefetch_block matches the last item we returned as expected. + * + * Note: we don't actually need a xs_prefetch_block field at all; we could + * just take the last block we returned from prefetchPos directly instead. + * But maintaining xs_prefetch_block explicitly is slightly more robust. + * It gives us a way to make sure that the last call here left prefetchPos + * in a consistent state (e.g., when the read stream had to be paused). + */ +#ifdef USE_ASSERT_CHECKING + if (!fromScanPos) + { + BatchMatchingItem *lastitem = &prefetchBatch->items[prefetchPos->item]; + BlockNumber last_block = ItemPointerGetBlockNumber(&lastitem->tableTid); + + Assert(last_block == hscan->xs_prefetch_block); + } +#endif + + for (;;) + { + BatchMatchingItem *item; + BlockNumber prefetch_block; + + /* + * We never call amgetbatch without immediately unguarding the batch, + * either within the index AM or here (when we eagerly load all of the + * batch's visibility information during an index-only scan). The + * index AM won't hold onto TID interlock buffer pins, keeping the + * absolute number of pins held to a minimum. + * + * This is defensive. The read stream tries to be careful about not + * pinning too many buffers, and that's harder to do reliably if there + * are variable numbers of pins taken without such care. + */ + Assert(!prefetchBatch->isGuarded); + if (fromScanPos) + { + /* + * Don't increment item when prefetchPos was just initialized + * using scanPos. We'll return the scanPos item's heap block + * directly on the first call here. In other words, we'll return + * the heap block from TID passed to heapam_index_fetch_tuple at + * the point where it called read_stream_next_buffer for the first + * time during the scan. (As explained above, we also end up here + * during the first call to read_stream_next_buffer following + * prefetchPos falling behind scanPos/being invalidated for us.) + */ + fromScanPos = false; + } + else if (!index_scan_pos_advance(xs_read_stream_dir, + prefetchBatch, prefetchPos)) + { + /* + * Ran out of items from prefetchBatch. Try to advance to the + * scan's next batch. + */ + if (unlikely(index_scan_batch_full(scan))) + { + /* + * Can't advance prefetchBatch because all available + * batchringbuf batch slots are currently in use. + * + * Deal with this by momentarily pausing the read stream. + * heapam_index_getnext_scanbatch_pos will resume the read + * stream later, though only after scanPos has consumed all + * remaining items from scanBatch (at which point the current + * head batch will be freed, making a slot available for reuse + * here by us). + * + * In practice we hardly ever need to do this. It would be + * possible to avoid the need to pause the read stream by + * dynamically allocating slots, but that would add complexity + * for no real benefit. It also seems like a good idea to + * impose some hard limit on the number of batches that + * prefetchPos can get ahead of scanPos by (especially in the + * case of index-only scans, where we often won't have any + * heap block to return from most of the scan's batches). + */ + hscan->xs_paused = true; + + /* + * Before returning, advance prefetchPos in the opposite + * direction to the one used by the scan. This undoes the + * effects of the most recent advance. We're not going to + * return any block, so it seems like a good idea to leave + * prefetchPos in a state consistent with that. + */ + if (ScanDirectionIsForward(xs_read_stream_dir)) + { + Assert(prefetchPos->item == prefetchBatch->lastItem + 1); + prefetchPos->item = prefetchBatch->lastItem; + } + else + { + Assert(prefetchPos->item == prefetchBatch->firstItem - 1); + prefetchPos->item = prefetchBatch->firstItem; + } + + return read_stream_pause(stream); + } + + prefetchBatch = tableam_util_fetch_next_batch(scan, + xs_read_stream_dir, + prefetchBatch, + prefetchPos); + if (!prefetchBatch) + { + /* + * No more batches in this direction, so all the batches that + * the scan will ever require have now been returned + */ + return InvalidBlockNumber; + } + + /* Position prefetchPos to the start of new prefetchBatch */ + index_scan_pos_nextbatch(xs_read_stream_dir, + prefetchBatch, prefetchPos); + + if (scan->xs_want_itup) + { + HeapBatchData *hbatch = heapam_index_batch_data(scan, + prefetchBatch); + + /* make sure we have visibility info for the entire batch */ + Assert(hscan->xs_vm_items == scan->maxitemsbatch); + heapam_index_batch_pos_visibility(scan, xs_read_stream_dir, + prefetchBatch, hbatch, + prefetchPos); + } + } + + /* + * prefetchPos now points to the next item whose TID's heap block + * number might need to be prefetched + */ + Assert(index_scan_batch(scan, prefetchPos->batch) == prefetchBatch); + Assert(prefetchPos->item >= prefetchBatch->firstItem && + prefetchPos->item <= prefetchBatch->lastItem); + /* scanPos is always <= prefetchPos when we return */ + Assert(index_scan_pos_cmp(scanPos, prefetchPos, xs_read_stream_dir) <= 0); + + if (scan->xs_want_itup) + { + HeapBatchData *hbatch = heapam_index_batch_data(scan, + prefetchBatch); + + Assert(hbatch->visInfo[prefetchPos->item] & HEAP_BATCH_VIS_CHECKED); + if (hbatch->visInfo[prefetchPos->item] & HEAP_BATCH_VIS_ALL_VISIBLE) + { + /* item is known to be all-visible -- don't prefetch */ + continue; + } + } + + item = &prefetchBatch->items[prefetchPos->item]; + prefetch_block = ItemPointerGetBlockNumber(&item->tableTid); + + if (prefetch_block == hscan->xs_prefetch_block) + { + /* + * prefetch_block matches the last prefetchPos item's TID's heap + * block number; we must not return the same prefetch_block twice + * (twice in succession) + */ + continue; + } + + /* We have a new heap block number to return to read stream */ + hscan->xs_prefetch_block = prefetch_block; + return prefetch_block; + } + + return InvalidBlockNumber; +} + /* * Common implementation for all four heapam_index_*_getnext_slot variants. * diff --git a/src/backend/access/index/indexbatch.c b/src/backend/access/index/indexbatch.c index b2b3afe80..c6c50306d 100644 --- a/src/backend/access/index/indexbatch.c +++ b/src/backend/access/index/indexbatch.c @@ -5,15 +5,22 @@ * * This module provides the core infrastructure for batch-based index scans, * which allow index AMs to return multiple matching TIDs per page in a single - * call. The batch ring buffer is owned by the table AM. + * call. The batch ring buffer is owned by the table AM, typically maintained + * alongside a read stream used for prefetching table blocks. * - * The ring buffer loads batches in index key space/index scan order. + * The ring buffer loads batches in index key space/index scan order. This + * allows the table AM to maintain an adequate prefetch distance: its read + * stream callback is thereby able to request table blocks referenced by index + * pages that are well ahead of the current scan position's index page. * * Most functions here are table AM utilities (tableam_util_*), called by * table AMs during amgetbatch index scans. These manage the batch ring * buffer's lifecycle and positional state, and help with certain aspects of * resource management. The table AM uses scanPos (and its scanBatch batch) - * to return items from batches returned by amgetbatch. + * to return items from batches returned by amgetbatch. Table AMs that + * support index I/O prefetching use prefetchPos (and its prefetchBatch batch) + * by implementing a read stream callback that consumes items well ahead of + * scanPos. * * There are also some index AM utilities (indexam_util_*), called by index * AMs that implement the amgetbatch interface, to help manage resources like @@ -55,6 +62,7 @@ tableam_util_batchscan_reset(IndexScanDesc scan, bool endscan) bool markBatchFreed = false; batchringbuf->scanPos.valid = false; + batchringbuf->prefetchPos.valid = false; batchringbuf->markPos.valid = false; /* Ensure batch_free won't skip the old markBatch in the loop below */ @@ -170,7 +178,12 @@ tableam_util_batchscan_mark_pos(IndexScanDesc scan) * the current scanBatch when needed. * * We just discard all batches (other than markBatch/restored scanBatch), - * except when markBatch is already the scan's current scanBatch. + * except when markBatch is already the scan's current scanBatch. We always + * invalidate prefetchPos. The read stream and related prefetching state are + * reset by the caller (which calls this function as it resets that state). + * This approach keeps things simple for table AMs: most code that deals with + * batches is thereby able to assume that the common case where scan direction + * never changes is the only case. */ void tableam_util_batchscan_restore_pos(IndexScanDesc scan) @@ -185,6 +198,14 @@ tableam_util_batchscan_restore_pos(IndexScanDesc scan) Assert(scan->xs_heapfetch); Assert(markPos->valid); + /* + * Restoring a mark always requires stopping prefetching. This is similar + * to the handling table AMs implement to deal with a tuple-level change + * in the scan's direction. The read stream must have already been reset + * by the caller (via table_index_fetch_reset). + */ + batchringbuf->prefetchPos.valid = false; + if (scanBatch == markBatch) { /* markBatch is already scanBatch; needn't change batchringbuf */ @@ -249,6 +270,13 @@ tableam_util_batchscan_restore_pos(IndexScanDesc scan) * to determine which batch comes next in the new scan direction. This * approach isn't particularly efficient, but it works well enough for what * ought to be a relatively rare occurrence. + * + * Caller must have reset the scan's read stream before calling here. That + * needs to happen as soon as the scan requests a tuple in whatever scan + * direction is opposite-to-current. We only deal with the case where the + * scan backs up by enough items to cross a batch boundary (when the scan + * resumes scanning in its original direction/ends before crossing a boundary, + * there isn't any need to call here). */ void tableam_util_scanbatch_dirchange(IndexScanDesc scan) diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c index 1c575e56f..6fcb815f7 100644 --- a/src/backend/optimizer/path/costsize.c +++ b/src/backend/optimizer/path/costsize.c @@ -146,6 +146,7 @@ int max_parallel_workers_per_gather = 2; bool enable_seqscan = true; bool enable_indexscan = true; bool enable_indexonlyscan = true; +bool enable_indexscan_prefetch = true; bool enable_bitmapscan = true; bool enable_tidscan = true; bool enable_sort = true; diff --git a/src/backend/utils/misc/guc_parameters.dat b/src/backend/utils/misc/guc_parameters.dat index a315c4ab8..48eb1b6b9 100644 --- a/src/backend/utils/misc/guc_parameters.dat +++ b/src/backend/utils/misc/guc_parameters.dat @@ -932,6 +932,13 @@ boot_val => 'true', }, +{ name => 'enable_indexscan_prefetch', type => 'bool', context => 'PGC_USERSET', group => 'QUERY_TUNING_METHOD', + short_desc => 'Enables prefetching for index scans and index-only scans.', + flags => 'GUC_EXPLAIN', + variable => 'enable_indexscan_prefetch', + boot_val => 'true', +}, + { name => 'enable_material', type => 'bool', context => 'PGC_USERSET', group => 'QUERY_TUNING_METHOD', short_desc => 'Enables the planner\'s use of materialization.', flags => 'GUC_EXPLAIN', diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample index 6d0337853..577555518 100644 --- a/src/backend/utils/misc/postgresql.conf.sample +++ b/src/backend/utils/misc/postgresql.conf.sample @@ -420,6 +420,7 @@ #enable_incremental_sort = on #enable_indexscan = on #enable_indexonlyscan = on +#enable_indexscan_prefetch = on #enable_material = on #enable_memoize = on #enable_mergejoin = on diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml index d3fea738c..17daeaed4 100644 --- a/doc/src/sgml/config.sgml +++ b/doc/src/sgml/config.sgml @@ -5712,6 +5712,22 @@ ANY num_sync ( + enable_indexscan_prefetch (boolean) + + enable_indexscan_prefetch configuration parameter + + + + + Enables or disables prefetching for index scan and index-only scan + plan types. Prefetching can improve performance by reading table AM + pages ahead of when they are needed during index scans. The default + is on. + + + + enable_material (boolean) diff --git a/doc/src/sgml/indexam.sgml b/doc/src/sgml/indexam.sgml index 4484a2165..d990c1a53 100644 --- a/doc/src/sgml/indexam.sgml +++ b/doc/src/sgml/indexam.sgml @@ -810,9 +810,12 @@ amgetbatch (IndexScanDesc scan, The amgetbatch interface is an alternative to amgettuple that returns matching index entries in batches - rather than one at a time. By returning all matching index entries from a - single index page together, the table AM gains visibility into which table - blocks will be needed in the near future. + rather than one at a time. This enables the table access method to + optimize table block access patterns and perform I/O prefetching. + By returning all matching index entries from a single index page together, + the table AM can readahead through the index and identify which table + blocks will be needed, allowing prefetching of table AM pages during + ordered index scans. @@ -937,7 +940,9 @@ amunguardbatch (IndexScanDesc scan, to free the pins at an opportune point (at a minimum whenever amendscan is called, and typically when amrescan is called). It must also keep the number of - retained pins fixed and small. + retained pins fixed and small, to avoid exhausting the backend's buffer + pin limit (which is shared with the table AM's read stream for index scan + prefetching). @@ -1426,6 +1431,60 @@ amtranslatecmptype (CompareType cmptype, Oid opfamily, Oid opcintype); or vice versa, if its internal implementation is unsuited to one API or the other. + + Table AM Considerations for Batch Scanning + + + This section is primarily relevant to table access + method authors. + + + + When an index scan uses the amgetbatch interface, the + table AM has sole control over the IndexScanDesc's + batchringbuf, including creating, resetting, + and ending the batch ring buffer within the appropriate table AM + callbacks, and managing positional state and TID recycling interlocking + (that is, determining when to unguard each batch, which will typically + release an index page buffer pin associated with the batch). Index access + methods should not access or manipulate these fields. See the + src/backend/access/heap/heapam_indexscan.c + implementation for a reference example. + + + + The scanPos field within + batchringbuf tracks which batch and item within + that batch will be returned next to the executor. The table AM must advance + scanPos as tuples are returned by + table_index_getnext_slot, and must also modify this + field when restoring a saved mark. + + + + The prefetchPos field tracks the position used + for I/O prefetching. It is generally advanced by initializing it from + scanPos within a read stream callback, allowing + the table AM to prefetch table blocks pointed to by items that are well + ahead of the current scan position. Initially + prefetchPos starts at + scanPos, but as the read stream ramps up it can + get far ahead — spanning multiple index pages if necessary to + maintain an optimal I/O prefetch distance for table block reads. A major + goal of the amgetbatch interface is to allow the + table AM to prefetch without being limited to items from the current + scanPos batch's index leaf page. + + + + For details on the TID recycling interlock during batch scans, including + the batchImmediateUnguard policy and the + amunguardbatch callback, see + . + + + + @@ -1526,7 +1585,22 @@ amtranslatecmptype (CompareType cmptype, Oid opfamily, Oid opcintype); immediately after scanning the corresponding index entry. This is expensive for a number of reasons. The amgetbatch interface, by contrast, was designed to - allow scans to be asynchronous. + allow scans to be asynchronous: by collecting batches of + TIDs from multiple index pages, the table AM can prefetch the corresponding + table blocks well ahead of the current scan position (using asynchronous + I/O when available), allowing a more efficient heap access pattern. Not + all scans end up being asynchronous in practice, but the interface is + designed to allow it. Per the above analysis, we must use the synchronous + approach for non-MVCC-compliant snapshots, but an asynchronous scan is + workable for plain index scans that use an MVCC snapshot. + + + + Because the table AM reads multiple index leaf pages ahead via + amgetbatch to facilitate this prefetching, it cannot + practically hold pins on all those pages simultaneously. Therefore, I/O + prefetching with amgetbatch is only possible when an + MVCC-compliant snapshot is in use. diff --git a/doc/src/sgml/tableam.sgml b/doc/src/sgml/tableam.sgml index 9ccf5b739..4542e00b4 100644 --- a/doc/src/sgml/tableam.sgml +++ b/doc/src/sgml/tableam.sgml @@ -129,6 +129,14 @@ my_tableam_handler(PG_FUNCTION_ARGS) optional), the block number needs to provide locality. + + Table access methods must support ordered index scans using the + amgetbatch interface. See also + for details on interfacing with + amgetbatch index access methods, and managing the + scan's position. + + For crash safety, an AM can use postgres' WAL, or a custom implementation. diff --git a/src/test/regress/expected/sysviews.out b/src/test/regress/expected/sysviews.out index 132b56a58..32bc3dd3e 100644 --- a/src/test/regress/expected/sysviews.out +++ b/src/test/regress/expected/sysviews.out @@ -166,6 +166,7 @@ select name, setting from pg_settings where name like 'enable%'; enable_incremental_sort | on enable_indexonlyscan | on enable_indexscan | on + enable_indexscan_prefetch | on enable_material | on enable_memoize | on enable_mergejoin | on @@ -180,7 +181,7 @@ select name, setting from pg_settings where name like 'enable%'; enable_seqscan | on enable_sort | on enable_tidscan | on -(25 rows) +(26 rows) -- There are always wait event descriptions for various types. InjectionPoint -- may be present or absent, depending on history since last postmaster start. -- 2.53.0