From da06d010fff0b21eeccac2234e62d5795db16760 Mon Sep 17 00:00:00 2001 From: Peter Geoghegan Date: Sat, 15 Nov 2025 14:03:58 -0500 Subject: [PATCH v12 10/23] Add heapam index scan I/O prefetching. This commit implements I/O prefetching for index scans (and index-only scans that require heap fetches). This was made possible by the recent addition of batching interfaces to both the table AM and index AM APIs. The amgetbatch index AM interface provides batches of matching TIDs (rather than one tuple at a time), each of which must be taken from index tuples that appear together on a single index page. This allows multiple batches to be held open simultaneously. Giving the table AM an explicit understanding of index AM concepts/index page boundaries allows it to consider all of the relevant costs and benefits. Prefetching is implemented using a prefetching position under the control of the table AM and core code. This is closely related to the scan position added by commit FIXME, which introduced the amgetbatch interface. A read stream callback advances the read stream as needed to provide sufficiently many heap block numbers to maintain the read stream's target prefetch distance. Testing has shown that index prefetching can make index scans much faster. Large range scans that return many tuples can be as much as 35x faster. An important goal of the amgetbatch design is to enable the table AM's read stream callback to advance its prefetch position using TIDs that appear on a leaf page that's ahead of the current scan position's leaf page. This is crucial with scans of indexes where each leaf page happens to have relatively few distinct heap blocks among its matching TIDs (as well as with scans with leaf pages that have relatively few total matching items). Index scans can have as many as 64 open batches, which testing has shown to be about the maximum number that can ever be useful. Batches are maintained in scan order using a simple ring buffer data structure. In rare cases where the scan exceeds this quasi-arbitrary limit of 64, the read stream is temporarily paused. Prefetching (via the read stream) is resumed only after the scan position advances beyond its current open batch and then frees and removes the batch from the scan's batch ring buffer. Testing has shown that it isn't very common for scans to hold open more than about 10 batches to get the desired I/O prefetch distance. Author: Tomas Vondra Author: Peter Geoghegan Reviewed-By: Andres Freund Reviewed-By: Thomas Munro Discussion: https://postgr.es/m/cf85f46f-b02f-05b2-5248-5000b894ebab@enterprisedb.com --- src/include/access/heapam.h | 13 + src/include/access/relscan.h | 35 ++ src/include/optimizer/cost.h | 1 + src/backend/access/heap/heapam_handler.c | 372 +++++++++++++++++- src/backend/access/index/indexbatch.c | 52 ++- src/backend/access/nbtree/README | 2 +- src/backend/optimizer/path/costsize.c | 1 + src/backend/storage/aio/read_stream.c | 2 + src/backend/utils/misc/guc_parameters.dat | 7 + src/backend/utils/misc/postgresql.conf.sample | 1 + doc/src/sgml/config.sgml | 16 + doc/src/sgml/indexam.sgml | 103 ++++- doc/src/sgml/tableam.sgml | 8 + src/test/regress/expected/sysviews.out | 3 +- 14 files changed, 591 insertions(+), 25 deletions(-) diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h index 55579b881..bb5a23dbd 100644 --- a/src/include/access/heapam.h +++ b/src/include/access/heapam.h @@ -123,6 +123,19 @@ typedef struct IndexFetchHeapData Buffer xs_vmbuf; /* visibility map buffer */ int xs_vm_items; /* # items to resolve visibility info for */ + /* For batch index scans that use read stream for prefetching */ + ReadStream *xs_read_stream; + + /* + * The read stream is allocated at the beginning of the scan and reset on + * rescan or when the scan direction changes. The scan direction is saved + * each time a new tuple is requested. If the scan direction changes from + * one tuple to the next, the read stream releases all previously pinned + * buffers and resets the prefetch block. + */ + ScanDirection xs_read_stream_dir; /* index scan direction */ + BlockNumber xs_prefetch_block; /* last block returned to xs_read_stream */ + bool xs_paused; /* paused until next batch is read? */ bool xs_lastinblock; /* last TID on this block in current batch? */ /* NB: if xs_cbuf or vmbuf are not InvalidBuffer, we hold a pin */ diff --git a/src/include/access/relscan.h b/src/include/access/relscan.h index a7abf5a78..6ab2dedc4 100644 --- a/src/include/access/relscan.h +++ b/src/include/access/relscan.h @@ -204,6 +204,10 @@ typedef struct IndexScanBatchData * This allows table AMs to avoid redundant amgetbatch calls with the same * priorbatch -- the index AM might need to read additional index pages to * determine there are no more matching items beyond caller's priorbatch. + * In particular, during prefetching the read stream callback discovers + * the end-of-scan via prefetchBatch. The table AM checks these flags so + * that the scan side doesn't repeat the same amgetbatch call when it + * later reaches that batch as scanBatch. */ bool knownEndBackward; bool knownEndForward; @@ -262,12 +266,21 @@ typedef struct IndexScanBatchData *IndexScanBatch; * matches in. However, table AMs are free to fetch table tuples in whatever * order is most convenient/efficient -- provided that such reordering cannot * affect the order that table_index_getnext_slot later returns tuples in. + * + * This data structure also provides table AMs with a way to read ahead of the + * current read position by _multiple_ batches/index pages. The further out + * the table AM reads ahead like this, the further it can see into the future. + * That way the table AM is able to reorder work as aggressively as desired. + * For example, index scans sometimes need to readahead by as many as a few + * dozen amgetbatch batches in order to maintain an optimal I/O prefetch + * distance (distance for reading table blocks/fetching table tuples). */ typedef struct BatchRingBuffer { /* current positions in batches[] for scan */ BatchRingItemPos scanPos; /* scan's read position */ BatchRingItemPos markPos; /* mark/restore position */ + BatchRingItemPos prefetchPos; /* prefetching position */ IndexScanBatch markBatch; @@ -477,6 +490,28 @@ index_scan_batch_append(IndexScanDescData *scan, IndexScanBatch batch) ringbuf->nextBatch++; } +/* + * Compare two batch ring positions in the given scan direction. + * + * Returns negative if pos1 is behind pos2, 0 if equal, positive if pos1 is + * ahead of pos2. + */ +static inline int +index_scan_pos_cmp(BatchRingItemPos *pos1, BatchRingItemPos *pos2, + ScanDirection direction) +{ + int8 batchdiff = (int8) (pos1->batch - pos2->batch); + + if (batchdiff != 0) + return batchdiff; + + /* Same batch, compare items */ + if (ScanDirectionIsForward(direction)) + return pos1->item - pos2->item; + else + return pos2->item - pos1->item; +} + /* * Advance position to its next item in the batch. * diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h index f2fd5d315..419300a6b 100644 --- a/src/include/optimizer/cost.h +++ b/src/include/optimizer/cost.h @@ -52,6 +52,7 @@ extern PGDLLIMPORT int max_parallel_workers_per_gather; extern PGDLLIMPORT bool enable_seqscan; extern PGDLLIMPORT bool enable_indexscan; extern PGDLLIMPORT bool enable_indexonlyscan; +extern PGDLLIMPORT bool enable_indexscan_prefetch; extern PGDLLIMPORT bool enable_bitmapscan; extern PGDLLIMPORT bool enable_tidscan; extern PGDLLIMPORT bool enable_sort; diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c index 147c1a256..0d5e9907a 100644 --- a/src/backend/access/heap/heapam_handler.c +++ b/src/backend/access/heap/heapam_handler.c @@ -37,6 +37,7 @@ #include "commands/progress.h" #include "executor/executor.h" #include "miscadmin.h" +#include "optimizer/cost.h" #include "pgstat.h" #include "storage/bufmgr.h" #include "storage/bufpage.h" @@ -60,6 +61,9 @@ static BlockNumber heapam_scan_get_blocks_done(HeapScanDesc hscan); static bool BitmapHeapScanNextBlock(TableScanDesc scan, bool *recheck, uint64 *lossy_pages, uint64 *exact_pages); +static BlockNumber heapam_getnext_stream(ReadStream *stream, + void *callback_private_data, + void *per_buffer_data); /* ------------------------------------------------------------------------ @@ -101,6 +105,17 @@ heapam_index_fetch_reset(IndexFetchTableData *scan) /* Rescans should avoid an excessive number of VM lookups */ hscan->xs_vm_items = 1; + /* Reset read stream direction unconditionally */ + hscan->xs_read_stream_dir = NoMovementScanDirection; + + /* Reset read stream itself, and other associated state */ + if (hscan->xs_read_stream) + { + hscan->xs_prefetch_block = InvalidBlockNumber; + hscan->xs_paused = false; + read_stream_reset(hscan->xs_read_stream); + } + /* * Deliberately avoid dropping any pins now held in xs_cbuf and xs_vmbuf. * This saves cycles during certain tight nested loop joins, and during @@ -122,6 +137,9 @@ heapam_index_fetch_end(IndexFetchTableData *scan) if (BufferIsValid(hscan->xs_vmbuf)) ReleaseBuffer(hscan->xs_vmbuf); + if (hscan->xs_read_stream) + read_stream_end(hscan->xs_read_stream); + pfree(hscan); } @@ -191,7 +209,14 @@ heapam_index_fetch_tuple(struct IndexFetchTableData *scan, if (BufferIsValid(hscan->xs_cbuf)) ReleaseBuffer(hscan->xs_cbuf); - hscan->xs_cbuf = ReadBuffer(hscan->xs_base.rel, hscan->xs_blk); + /* + * When using a read stream, the stream will already know which block + * number comes next (though an assertion will verify a match below) + */ + if (hscan->xs_read_stream) + hscan->xs_cbuf = read_stream_next_buffer(hscan->xs_read_stream, NULL); + else + hscan->xs_cbuf = ReadBuffer(hscan->xs_base.rel, hscan->xs_blk); /* * Prune page when it is pinned for the first time @@ -276,6 +301,30 @@ heapam_index_fetch_tuple(struct IndexFetchTableData *scan, * (important for inner index scans of anti-joins and semi-joins), and the * need to not hold onto index leaf pages for too long. * + * Dropping leaf page pins early + * ----------------------------- + * + * In no event will the scan be allowed to hold onto more than one batch's + * leaf page pin at a time. The primary reason for this restriction is to + * avoid unintended interactions with the read stream, which has its own + * strategy for keeping the number of pins held by the backend under control. + * + * Once we've resolved visibility for all items in a batch, we can safely drop + * its leaf page pin. This is safe with respect to concurrent VACUUM because + * index vacuuming will block on acquiring a conflicting cleanup lock on the + * batch's index page due to our holding a pin on that same page. Copying the + * relevant visibility map data into our local cache suffices to prevent unsafe + * concurrent TID recycling: if any of these TIDs point to dead heap tuples, + * VACUUM cannot possibly return from ambulkdelete and mark the pointed-to + * heap pages as all-visible. VACUUM _can_ do so once we release the batch's + * pin, but that's okay; we'll be working off of cached visibility info that + * indicates that the dead TIDs are NOT all-visible. + * + * Note: We cannot drop the pin early when the scan uses a non-MVCC snapshot; + * we must delay it until all heap fetches for the loaded batch have taken + * place. This is why we don't support prefetching during such scans. See + * doc/src/sgml/indexam.sgml. + * * Note on Memory Ordering Effects * ------------------------------- * @@ -490,11 +539,13 @@ static pg_attribute_hot IndexScanBatch heapam_batch_getnext(IndexScanDesc scan, ScanDirection direction, IndexScanBatch priorBatch, BatchRingItemPos *pos) { + IndexFetchHeapData *hscan = (IndexFetchHeapData *) scan->xs_heapfetch; IndexScanBatch batch = NULL; BatchRingBuffer *batchringbuf PG_USED_FOR_ASSERTS_ONLY = &scan->batchringbuf; /* XXX: we should assert that a snapshot is pushed or registered */ Assert(TransactionIdIsValid(RecentXmin)); + Assert(direction == hscan->xs_read_stream_dir); if (!priorBatch) { @@ -576,9 +627,39 @@ heapam_batch_getnext(IndexScanDesc scan, ScanDirection direction, */ if (unlikely(scan->xs_read_extremal_only) && priorBatch) { + Assert(!hscan->xs_read_stream); Assert(scan->xs_want_itup); return NULL; } + + /* + * Delay initializing stream until reading from scan's second batch. + * This heuristic avoids wasting cycles on starting a read stream for + * very selective index scans. We can likely improve upon this, but + * it works well enough for now. + * + * Also avoid prefetching during scans where we're unable to drop each + * batch's buffer pin right away (non-MVCC snapshot scans). We are + * not prepared to sensibly limit the total number of buffer pins held + * (read stream handles all pin resource management for us, and knows + * nothing about pins held on index pages/within batches). + * + * Also delay creating a read stream during index-only scans that + * haven't done any heap fetches yet. We don't want to waste any + * cycles on allocating a read stream until we have a demonstrated + * need for perform heap fetches. + */ + if (!hscan->xs_read_stream && priorBatch && scan->MVCCScan && + hscan->xs_blk != InvalidBlockNumber && /* for index-only scans */ + enable_indexscan_prefetch) + { + Assert(!batchringbuf->prefetchPos.valid); + + hscan->xs_read_stream = + read_stream_begin_relation(READ_STREAM_DEFAULT, NULL, + scan->heapRelation, MAIN_FORKNUM, + heapam_getnext_stream, scan, 0); + } } else { @@ -603,6 +684,32 @@ heapam_batch_getnext(IndexScanDesc scan, ScanDirection direction, return batch; } +/* + * Handle a change in index scan direction (at the tuple granularity). + * + * Resets the read stream, since we can't rely on scanPos continuing to agree + * with the blocks that read stream already consumed using prefetchPos. + * + * Note: iff the scan _continues_ in this new direction, and actually steps + * off scanBatch to an earlier index page, heapam_batch_getnext will deal with + * it. But that might never happen; the scan might yet change direction again + * (or just end before returning more items). + */ +static pg_noinline void +heapam_dirchange_readstream_reset(IndexFetchHeapData *hscan, + ScanDirection direction, + BatchRingBuffer *batchringbuf) +{ + /* Reset read stream state */ + batchringbuf->prefetchPos.valid = false; + hscan->xs_paused = false; + hscan->xs_read_stream_dir = direction; + + /* Reset read stream itself */ + if (hscan->xs_read_stream) + read_stream_reset(hscan->xs_read_stream); +} + /* ---------------- * heapam_batch_getnext_tid - get next TID from batch ring buffer * @@ -621,6 +728,12 @@ heapam_batch_getnext_tid(IndexScanDesc scan, IndexFetchHeapData *hscan, Assert(!scanPos->valid || batchringbuf->headBatch == scanPos->batch); Assert(scanPos->valid || index_scan_batch_count(scan) == 0); + /* Handle resetting the read stream when scan direction changes */ + if (hscan->xs_read_stream_dir == NoMovementScanDirection) + hscan->xs_read_stream_dir = direction; /* first call */ + else if (unlikely(hscan->xs_read_stream_dir != direction)) + heapam_dirchange_readstream_reset(hscan, direction, batchringbuf); + /* * Check if there's an existing loaded scanBatch for us to return the next * matching item's TID/index tuple from @@ -629,7 +742,7 @@ heapam_batch_getnext_tid(IndexScanDesc scan, IndexFetchHeapData *hscan, { /* * scanPos is valid, so scanBatch must already be loaded in batch ring - * buffer. We rely on that here. + * buffer. We rely on that here (can't do this with prefetchBatch). */ Assert(batchringbuf->headBatch == scanPos->batch); @@ -674,21 +787,276 @@ heapam_batch_getnext_tid(IndexScanDesc scan, IndexFetchHeapData *hscan, { IndexScanBatch headBatch = index_scan_batch(scan, batchringbuf->headBatch); + BatchRingItemPos *prefetchPos = &batchringbuf->prefetchPos; /* free obsolescent head batch (unless it is scan's markBatch) */ tableam_util_free_batch(scan, headBatch); + /* + * If we're about to release the batch that prefetchPos currently + * points to, just invalidate prefetchPos. We'll reinitialize it + * using scanPos if and when heapam_getnext_stream is next called. (We + * must avoid confusing a prefetchPos->batch that's actually before + * headBatch with one that's after nextBatch due to uint8 overflow; + * simplest way is to invalidate prefetchPos like this.) + */ + if (prefetchPos->valid && + prefetchPos->batch == batchringbuf->headBatch) + prefetchPos->valid = false; + /* Remove the batch from the ring buffer */ batchringbuf->headBatch++; + + if (hscan->xs_paused) + { + /* + * The scan's read stream was paused by heapam_getnext_stream due + * to exhausting all available free batch slots. We just freed up + * one such slot now, though. Resume the read stream to re-enable + * prefetching. + */ + Assert(!index_scan_batch_full(scan)); + read_stream_resume(hscan->xs_read_stream); + hscan->xs_paused = false; + } } /* In practice scanBatch will always be the ring buffer's headBatch */ Assert(batchringbuf->headBatch == scanPos->batch); + Assert(!hscan->xs_paused); return heapam_batch_return_tid(scan, hscan, direction, scanBatch, scanPos, all_visible); } +/* + * heapam_getnext_stream + * return the next block to pass to the read stream + * + * The initial batch is always loaded by heapam_batch_getnext_tid. We don't + * get called until the first read_stream_next_buffer() call, when a heap + * block is requested from the scan's stream for the first time. + * + * The position of the read_stream is stored in prefetchPos. It is typical for + * prefetchPos to consistently stay ahead of the scanPos position that's used to + * track the next TID to be returned to the scan by heapam_batch_getnext_tid + * after the first time we get called. However, that isn't a precondition. + * There is a strict postcondition, though: when we return we'll always leave + * scanPos <= prefetchPos (except in cases where we return InvalidBlockNumber). + */ +static BlockNumber +heapam_getnext_stream(ReadStream *stream, void *callback_private_data, + void *per_buffer_data) +{ + IndexScanDesc scan = (IndexScanDesc) callback_private_data; + IndexFetchHeapData *hscan = (IndexFetchHeapData *) scan->xs_heapfetch; + BatchRingBuffer *batchringbuf = &scan->batchringbuf; + BatchRingItemPos *scanPos = &batchringbuf->scanPos; + BatchRingItemPos *prefetchPos = &batchringbuf->prefetchPos; + ScanDirection xs_read_stream_dir = hscan->xs_read_stream_dir; + IndexScanBatch prefetchBatch; + bool fromScanPos = false; + + /* + * scanPos must always be valid when prefetching takes place. There has + * to be at least one batch, loaded as our scanBatch. The scan direction + * must be established, too. + */ + Assert(index_scan_batch_count(scan) > 0); + Assert(scan->MVCCScan); + Assert(scanPos->valid); + Assert(!hscan->xs_paused); + Assert(xs_read_stream_dir != NoMovementScanDirection); + + /* + * prefetchPos might not yet be valid. It might have also fallen behind + * scanPos. Deal with both. + * + * If prefetchPos has not been initialized yet, that typically indicates + * that this is the first call here for the entire scan. We initialize + * prefetchPos using the current scanPos, since the current scanBatch + * item's TID should have its block number returned by the read stream + * first. It's likely that prefetchPos will get ahead of scanPos before + * long, but that hasn't happened yet. + * + * It's also possible for prefetchPos to "fall behind" scanPos, at least + * in a trivial sense: if many adjacent items are returned that contain + * TIDs that point to the same heap block, scanPos can actually overtake + * prefetchPos (prefetchPos can't advance until the scan actually calls + * read_stream_next_buffer). Reinitializing from scanPos is enough to + * ensure that prefetchPos still fetches the next heap block that scanPos + * will require (prefetchPos can never fall behind "by more than one group + * of items that all point to the same heap block", so this is safe). + * + * Note: when heapam_batch_getnext_tid frees a batch that prefetchPos + * points to, it'll invalidate prefetchPos for us. This removes any + * danger of prefetchPos.batch falling so far behind scanPos.batch that it + * wraps around (and appears to be ahead of scanPos instead of behind it). + */ + if (!prefetchPos->valid || + index_scan_pos_cmp(prefetchPos, scanPos, xs_read_stream_dir) < 0) + { + hscan->xs_prefetch_block = InvalidBlockNumber; + *prefetchPos = *scanPos; + fromScanPos = true; + + /* + * We must avoid holding on to any batch's buffer pin for more than an + * instant, to avoid undesirable interactions with the scan's read + * stream. Plain index scans always get this behavior automatically. + * Index-only scans are made to drop their buffer pin eagerly through + * a policy of always eagerly setting all the batch item's visibility + * info in one go. + */ + if (scan->xs_want_itup) + { + /* Make heapam_batch_resolve_visibility drop batch pins eagerly */ + hscan->xs_vm_items = scan->maxitemsbatch; + + /* Make sure that this new prefetchBatch holds no pin */ + prefetchBatch = index_scan_batch(scan, prefetchPos->batch); + if (BufferIsValid(prefetchBatch->buf)) + { + HeapBatchData *hbatch = heap_batch_data(prefetchBatch, scan); + + /* Set visibility info not set through scanBatch */ + heapam_batch_resolve_visibility(scan, xs_read_stream_dir, + prefetchBatch, hbatch, + prefetchPos); + } + + /* No buffer pin will be kept on any batch from here on */ + Assert(!BufferIsValid(prefetchBatch->buf)); + } + } + + prefetchBatch = index_scan_batch(scan, prefetchPos->batch); + for (;;) + { + BatchMatchingItem *item; + BlockNumber prefetch_block; + + /* + * We never call amgetbatch without immediately dropping the batch's + * buffer pin (which requires special care during index-only scans). + * The read stream is sensitive to buffer shortages, so we defensively + * avoid anything that visibly affects the per-backend buffer limit. + */ + Assert(!BufferIsValid(prefetchBatch->buf)); + + if (fromScanPos) + { + /* + * Don't increment item when prefetchPos was just initialized + * using scanPos. We'll return the scanPos item's heap block + * directly on the first call here. In other words, we'll return + * the heap block for the TID passed to heapam_index_fetch_tuple + * at the point where it called read_stream_next_buffer for the + * first time during the scan. + */ + fromScanPos = false; + } + else if (!index_scan_pos_advance(xs_read_stream_dir, + prefetchBatch, prefetchPos)) + { + /* + * Ran out of items from prefetchBatch. Try to advance to the + * scan's next batch. + */ + if (unlikely(index_scan_batch_full(scan))) + { + /* + * Can't advance prefetchBatch because all available + * batchringbuf batch slots are currently in use. + * + * Deal with this by momentarily pausing the read stream. + * heapam_batch_getnext_tid will resume the read stream later, + * though only after scanPos has consumed all remaining items + * from scanBatch (at which point scanBatch will be freed, + * making its slot available for reuse by a later batch). + * + * In practice we hardly ever need to do this. It would be + * possible to avoid the need to pause the read stream by + * dynamically allocating slots, but that would add complexity + * for no real benefit. + */ + hscan->xs_paused = true; + return read_stream_pause(stream); + } + + prefetchBatch = heapam_batch_getnext(scan, xs_read_stream_dir, + prefetchBatch, prefetchPos); + if (!prefetchBatch) + { + /* + * Failed to load next batch, so all the batches that the scan + * will ever require (barring a change in scan direction) are + * now loaded + */ + return InvalidBlockNumber; + } + + /* Position prefetchPos to the start of new prefetchBatch */ + index_scan_pos_nextbatch(xs_read_stream_dir, + prefetchBatch, prefetchPos); + + if (scan->xs_want_itup && BufferIsValid(prefetchBatch->buf)) + { + HeapBatchData *hbatch = heap_batch_data(prefetchBatch, scan); + + /* make sure we have visibility info for the entire batch */ + heapam_batch_resolve_visibility(scan, xs_read_stream_dir, + prefetchBatch, hbatch, + prefetchPos); + } + + /* heapam_batch_resolve_visibility must drop buffer pin */ + Assert(!BufferIsValid(prefetchBatch->buf)); + } + + /* + * prefetchPos now points to the next item whose TID's heap block + * number might need to be prefetched + */ + Assert(index_scan_batch(scan, prefetchPos->batch) == prefetchBatch); + Assert(prefetchPos->item >= prefetchBatch->firstItem && + prefetchPos->item <= prefetchBatch->lastItem); + /* scanPos is always <= prefetchPos when we return */ + Assert(index_scan_pos_cmp(scanPos, prefetchPos, xs_read_stream_dir) <= 0); + + if (scan->xs_want_itup) + { + HeapBatchData *hbatch = heap_batch_data(prefetchBatch, scan); + + Assert(hbatch->visInfo[prefetchPos->item] & BATCH_VIS_CHECKED); + if (hbatch->visInfo[prefetchPos->item] & BATCH_VIS_ALL_VISIBLE) + { + /* item is known to be all-visible -- don't prefetch */ + continue; + } + } + + item = &prefetchBatch->items[prefetchPos->item]; + prefetch_block = ItemPointerGetBlockNumber(&item->tableTid); + + if (prefetch_block == hscan->xs_prefetch_block) + { + /* + * prefetch_block matches the last prefetchPos item's TID's heap + * block number; we must not return the same prefetch_block twice + * (twice in succession) + */ + continue; + } + + /* We have a new heap block number to return to read stream */ + hscan->xs_prefetch_block = prefetch_block; + return prefetch_block; + } + + return InvalidBlockNumber; +} + /* ---------------- * index_fetch_heap - get the scan's next heap tuple * diff --git a/src/backend/access/index/indexbatch.c b/src/backend/access/index/indexbatch.c index cdfb9e762..1d6facf24 100644 --- a/src/backend/access/index/indexbatch.c +++ b/src/backend/access/index/indexbatch.c @@ -10,7 +10,10 @@ * approach enables efficient prefetching of table AM blocks during ordered * index scans. * - * The ring buffer loads batches in index key space order. + * The ring buffer loads batches in index key space order. This allows the + * table AM to maintain an adequate prefetch distance: its read stream + * callback is thereby able to request table blocks referenced by index pages + * that are well ahead of the current scan position's index page. * * There's three types of functions in this module: * @@ -28,6 +31,28 @@ * AMs that implement the amgetbatch interface. These manage batch * allocation, index page buffer lock release, and batch memory recycling. * + * These three layers coordinate without explicit coupling: the core lifecycle + * functions assume that table AMs use scanPos/scanBatch and prefetchPos/ + * prefetchBatch in a standardized way (see heapam_handler.c for the reference + * implementation), while table AMs assume that index AMs free and unlock + * batches according to the conventions established here. See indexam.sgml + * for the full specification of the amgetbatch/amkillitemsbatch contract. + * + * The table AM fully controls the read stream as its own private state. + * When the scan direction changes, the table AM must immediately reset its + * read stream and invalidate prefetchPos -- blocks already requested via + * prefetchPos will no longer match what scanPos needs to return. + * + * Crossing a batch boundary in a new scan direction is a separate process, + * handled here: table AMs are required to call tableam_util_batch_dirchange + * to leave the scan's batch ring buffer in a consistent state. The current + * implementation handles this by simply discarding most batches. The key + * invariant is that all loaded batches must be in a consistent scan direction + * order. (During cross-batch direction changes, the current scanBatch will + * have its IndexScanBatchData.dir flipped, but we have no provision for + * keeping all other loaded batches. It's not clear that it'd be useful to + * hold onto them; the scan direction is unlikely to change back.) + * * Portions Copyright (c) 1996-2026, PostgreSQL Global Development Group * Portions Copyright (c) 1994, Regents of the University of California * @@ -61,6 +86,7 @@ index_batchscan_init(IndexScanDesc scan) scan->batchringbuf.scanPos.valid = false; scan->batchringbuf.markPos.valid = false; + scan->batchringbuf.prefetchPos.valid = false; scan->batchringbuf.markBatch = NULL; scan->batchringbuf.headBatch = 0; /* initial head batch */ @@ -85,6 +111,7 @@ index_batchscan_reset(IndexScanDesc scan) batchringbuf->scanPos.valid = false; batchringbuf->markPos.valid = false; + batchringbuf->prefetchPos.valid = false; /* * Ensure tableam_util_free_batch won't skip the old markBatch in the loop @@ -221,7 +248,13 @@ index_batchscan_mark_pos(IndexScanDesc scan) * the current scanBatch when needed. * * We just discard all batches (other than markBatch/restored scanBatch), - * except when markBatch is already the scan's current scanBatch. + * except when markBatch is already the scan's current scanBatch. We always + * invalidate prefetchPos. The read stream and related prefetching state are + * reset by table_index_fetch_reset(), called before this function. This + * approach keeps things simple for table AMs: most code that deals with + * batches is thereby able to assume that the common case where scan direction + * never changes is the only case (tableam_util_batch_dirchange takes a + * similar approach to handling a cross-batch change in scan direction). */ void index_batchscan_restore_pos(IndexScanDesc scan) @@ -236,6 +269,14 @@ index_batchscan_restore_pos(IndexScanDesc scan) Assert(!batchringbuf->done); Assert(markPos->valid); + /* + * Restoring a mark always requires stopping prefetching. This is similar + * to the handling table AMs implement to deal with a tuple-level change + * in the scan's direction. The read stream must have already been reset + * by the caller (via table_index_fetch_reset). + */ + batchringbuf->prefetchPos.valid = false; + if (scanBatch == markBatch) { /* markBatch is already scanBatch; needn't change batchringbuf */ @@ -305,6 +346,13 @@ index_batchscan_restore_pos(IndexScanDesc scan) * point on batchringbuf will look as if our new scan direction had been used * from the start. This approach isn't particularly efficient, but it works * well enough for what ought to be a relatively rare occurrence. + * + * Caller must have reset the scan's read stream before calling here. That + * needs to happen as soon as the scan requests a tuple in whatever scan + * direction is opposite-to-current. We only deal with the case where the + * scan backs up by enough items to cross a batch boundary (when the scan + * resumes scanning in its original direction/ends before crossing a boundary, + * there isn't any need to call here). */ void tableam_util_batch_dirchange(IndexScanDesc scan) diff --git a/src/backend/access/nbtree/README b/src/backend/access/nbtree/README index e75577a7e..3939391ae 100644 --- a/src/backend/access/nbtree/README +++ b/src/backend/access/nbtree/README @@ -186,7 +186,7 @@ interface. (See also, doc/src/sgml/indexam.sgml). Blocking VACUUM like this can be disruptive, so table AMs avoid it whenever possible. The heap table AM usually drops leaf page pins right away, though not during scans that use a non-MVCC snapshot. Index-only scans may also -retain pins in some cases. +retain pins in some cases, though prefetching requires dropping them. Opportunistic index tuple deletion performs the same page-level modifications as VACUUM, while only holding an exclusive lock. This is diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c index 89ca4e08b..78d87cd8b 100644 --- a/src/backend/optimizer/path/costsize.c +++ b/src/backend/optimizer/path/costsize.c @@ -145,6 +145,7 @@ int max_parallel_workers_per_gather = 2; bool enable_seqscan = true; bool enable_indexscan = true; bool enable_indexonlyscan = true; +bool enable_indexscan_prefetch = true; bool enable_bitmapscan = true; bool enable_tidscan = true; bool enable_sort = true; diff --git a/src/backend/storage/aio/read_stream.c b/src/backend/storage/aio/read_stream.c index cd54c1a74..e78c3b15a 100644 --- a/src/backend/storage/aio/read_stream.c +++ b/src/backend/storage/aio/read_stream.c @@ -930,6 +930,8 @@ read_stream_next_buffer(ReadStream *stream, void **per_buffer_data) /* Look-ahead distance ramps up rapidly after we do I/O. */ distance = stream->distance * 2; + if (distance && distance < PG_INT16_MAX) + distance++; distance = Min(distance, stream->max_pinned_buffers); stream->distance = distance; diff --git a/src/backend/utils/misc/guc_parameters.dat b/src/backend/utils/misc/guc_parameters.dat index a5a0edf25..78c3c647f 100644 --- a/src/backend/utils/misc/guc_parameters.dat +++ b/src/backend/utils/misc/guc_parameters.dat @@ -891,6 +891,13 @@ boot_val => 'true', }, +{ name => 'enable_indexscan_prefetch', type => 'bool', context => 'PGC_USERSET', group => 'QUERY_TUNING_METHOD', + short_desc => 'Enables prefetching for index scans and index-only-scans.', + flags => 'GUC_EXPLAIN', + variable => 'enable_indexscan_prefetch', + boot_val => 'true', +}, + { name => 'enable_material', type => 'bool', context => 'PGC_USERSET', group => 'QUERY_TUNING_METHOD', short_desc => 'Enables the planner\'s use of materialization.', flags => 'GUC_EXPLAIN', diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample index e686d88af..aad256ea8 100644 --- a/src/backend/utils/misc/postgresql.conf.sample +++ b/src/backend/utils/misc/postgresql.conf.sample @@ -417,6 +417,7 @@ #enable_incremental_sort = on #enable_indexscan = on #enable_indexonlyscan = on +#enable_indexscan_prefetch = on #enable_material = on #enable_memoize = on #enable_mergejoin = on diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml index 8cdd826fb..4a4a09ad7 100644 --- a/doc/src/sgml/config.sgml +++ b/doc/src/sgml/config.sgml @@ -5712,6 +5712,22 @@ ANY num_sync ( + enable_indexscan_prefetch (boolean) + + enable_indexscan_prefetch configuration parameter + + + + + Enables or disables prefetching for index-scan and index-only-scan + plan types. Prefetching can improve performance by reading table AM + pages ahead of when they are needed during index scans. The default + is on. + + + + enable_material (boolean) diff --git a/doc/src/sgml/indexam.sgml b/doc/src/sgml/indexam.sgml index 213a05ad8..146990e9c 100644 --- a/doc/src/sgml/indexam.sgml +++ b/doc/src/sgml/indexam.sgml @@ -808,9 +808,12 @@ amgetbatch (IndexScanDesc scan, The amgetbatch interface is an alternative to amgettuple that returns matching index entries in batches - rather than one at a time. By returning all matching index entries from a - single index page together, the table AM gains visibility into which table - blocks will be needed in the near future. + rather than one at a time. This enables the table access method to + optimize table block access patterns and perform I/O prefetching. + By returning all matching index entries from a single index page together, + the table AM can readahead through the index and identify which table + blocks will be needed, allowing prefetching of table AM pages during + ordered index scans. @@ -1341,6 +1344,60 @@ amtranslatecmptype (CompareType cmptype, Oid opfamily, Oid opcintype); or vice versa, if its internal implementation is unsuited to one API or the other. + + Table AM Considerations for Batch Scanning + + + This section is primarily relevant to + table access method authors. + When an index scan uses the amgetbatch interface, + the table AM is responsible for managing position state within the + IndexScanDesc's + batchringbuf and for controlling when + buffer pins on index pages are released. + + + + The scanPos field within + batchringbuf tracks which batch and item within + that batch will be returned next to the executor. The table AM must advance + scanPos as tuples are returned by + table_index_getnext_slot. The core code may also + modify this field during operations such as mark/restore. + + + + The prefetchPos field tracks the position used + for I/O prefetching. It is generally advanced by initializing it from + scanPos within a read stream callback, allowing + the table AM to prefetch table blocks pointed to by items that are well + ahead of the current scan position. Initially + prefetchPos starts at + scanPos, but as the read stream ramps up it can + get far ahead — spanning multiple index pages if necessary to + maintain an optimal I/O prefetch distance for table block reads. A major + goal of the amgetbatch interface is to allow the + table AM to prefetch without being limited to items from the current + scanPos batch's index leaf page. + + + + Both scanPos and + prefetchPos are controlled by the table AM and + core code; index access methods should not access or manipulate these + fields. See the src/backend/access/heap/ + implementation for a reference example. + + + + Buffer pins on index pages returned by amgetbatch are + managed by the table AM. See the amgetbatch and + amkillitemsbatch descriptions in + for details. + + + + @@ -1434,31 +1491,39 @@ amtranslatecmptype (CompareType cmptype, Oid opfamily, Oid opcintype); This solution requires that amgettuple index scans be synchronous: the table AM must fetch each heap tuple immediately after scanning the corresponding index entry. This is - expensive for a number of reasons. An - asynchronous scan in which we collect many TIDs from the - index, and only visit the heap tuples sometime later, requires much less - index locking overhead and can allow a more efficient heap access pattern. + expensive for a number of reasons. The + amgetbatch interface, by contrast, was designed to + allow scans to be asynchronous: by collecting batches of + TIDs from multiple index pages, the table AM can prefetch the corresponding + table blocks well ahead of the current scan position (using asynchronous + I/O when available), requiring much less index locking overhead and allowing + a more efficient heap access pattern. Not all scans end up being + asynchronous in practice, but the interface is designed to allow it. Per the above analysis, we must use the synchronous approach for non-MVCC-compliant snapshots, but an asynchronous scan is workable for a query using an MVCC snapshot. - With amgetbatch scans, the table AM controls when - buffer pins on index pages are dropped rather than the index AM. - In practice, the heap table AM (and any table AM with similar concurrency - rules) usually drops pins eagerly for MVCC snapshot scans, but retains - pins for non-MVCC snapshot scans. Index-only scans may retain pins in - some cases, while plain index scans that use an MVCC snapshot always drop - their pins eagerly. Index access methods that implement - amgetbatch do not control when pins are dropped; that - decision is delegated to the table AM. + Because the table AM reads multiple index leaf pages ahead via + amgetbatch to facilitate this prefetching, it cannot + practically hold pins on all those pages simultaneously. Therefore, + I/O prefetching with + amgetbatch is only possible when an MVCC-compliant + snapshot is in use. In practice, the heap table AM (and any table AM + with similar concurrency rules) usually drops pins eagerly for MVCC + snapshot scans, but retains pins for non-MVCC snapshot scans. Index-only + scans may retain pins in some cases, while plain index scans that use an + MVCC snapshot always drop their pins eagerly. Index access methods that + implement amgetbatch do not control when pins are + dropped; that decision is delegated to the table AM. - In an amgetbitmap index scan, the access method does - not keep an index pin on any of the returned tuples. Therefore - it is only safe to use such scans with MVCC-compliant snapshots. + Similarly, an amgetbitmap index scan is inherently + asynchronous: all matching TIDs are collected into a bitmap before any heap + access begins. Such scans therefore require an MVCC-compliant snapshot, + and there is no need for the access method to hold index page pins. diff --git a/doc/src/sgml/tableam.sgml b/doc/src/sgml/tableam.sgml index 9ccf5b739..8e70a6196 100644 --- a/doc/src/sgml/tableam.sgml +++ b/doc/src/sgml/tableam.sgml @@ -129,6 +129,14 @@ my_tableam_handler(PG_FUNCTION_ARGS) optional), the block number needs to provide locality. + + Table access methods can support ordered index scans using the + amgetbatch interface. See also + for details on interfacing with + amgetbatch index access methods, and managing the + scan's position. + + For crash safety, an AM can use postgres' WAL, or a custom implementation. diff --git a/src/test/regress/expected/sysviews.out b/src/test/regress/expected/sysviews.out index 132b56a58..32bc3dd3e 100644 --- a/src/test/regress/expected/sysviews.out +++ b/src/test/regress/expected/sysviews.out @@ -166,6 +166,7 @@ select name, setting from pg_settings where name like 'enable%'; enable_incremental_sort | on enable_indexonlyscan | on enable_indexscan | on + enable_indexscan_prefetch | on enable_material | on enable_memoize | on enable_mergejoin | on @@ -180,7 +181,7 @@ select name, setting from pg_settings where name like 'enable%'; enable_seqscan | on enable_sort | on enable_tidscan | on -(25 rows) +(26 rows) -- There are always wait event descriptions for various types. InjectionPoint -- may be present or absent, depending on history since last postmaster start. -- 2.53.0