Re: WIP: WAL prefetch (another approach) - Mailing list pgsql-hackers
From | Tomas Vondra |
---|---|
Subject | Re: WIP: WAL prefetch (another approach) |
Date | |
Msg-id | 103ec76d-8bbc-09e1-460c-8785751431b5@enterprisedb.com Whole thread Raw |
In response to | Re: WIP: WAL prefetch (another approach) (Stephen Frost <sfrost@snowman.net>) |
Responses |
Re: WIP: WAL prefetch (another approach)
|
List | pgsql-hackers |
On 2/10/21 10:50 PM, Stephen Frost wrote: > > ... > >> +/* >> + * Scan the current record for block references, and consider prefetching. >> + * >> + * Return true if we processed the current record to completion and still have >> + * queue space to process a new record, and false if we saturated the I/O >> + * queue and need to wait for recovery to advance before we continue. >> + */ >> +static bool >> +XLogPrefetcherScanBlocks(XLogPrefetcher *prefetcher) >> +{ >> + DecodedXLogRecord *record = prefetcher->record; >> + >> + Assert(!XLogPrefetcherSaturated(prefetcher)); >> + >> + /* >> + * We might already have been partway through processing this record when >> + * our queue became saturated, so we need to start where we left off. >> + */ >> + for (int block_id = prefetcher->next_block_id; >> + block_id <= record->max_block_id; >> + ++block_id) >> + { >> + DecodedBkpBlock *block = &record->blocks[block_id]; >> + PrefetchBufferResult prefetch; >> + SMgrRelation reln; >> + >> + /* Ignore everything but the main fork for now. */ >> + if (block->forknum != MAIN_FORKNUM) >> + continue; >> + >> + /* >> + * If there is a full page image attached, we won't be reading the >> + * page, so you might think we should skip it. However, if the >> + * underlying filesystem uses larger logical blocks than us, it >> + * might still need to perform a read-before-write some time later. >> + * Therefore, only prefetch if configured to do so. >> + */ >> + if (block->has_image && !recovery_prefetch_fpw) >> + { >> + pg_atomic_unlocked_add_fetch_u64(&Stats->skip_fpw, 1); >> + continue; >> + } > > FPIs in the stream aren't going to just avoid reads when the > filesystem's block size matches PG's- they're also going to avoid > subsequent modifications to the block, provided we don't end up pushing > that block out of shared buffers, rights? > > That is, if you have an empty shared buffers and see: > > Block 5 FPI > Block 6 FPI > Block 5 Update > Block 6 Update > > it seems like, with this patch, we're going to Prefetch Block 5 & 6, > even though we almost certainly won't actually need them. > Yeah, that's a good point. I think it'd make sense to keep track of recent FPIs and skip prefetching such blocks. But how exactly should we implement that, how many blocks do we need to track? If you get an FPI, how long should we skip prefetching of that block? I don't think the history needs to be very long, for two reasons. Firstly, the usual pattern is that we have FPI + several changes for that block shortly after it. Secondly, maintenance_io_concurrency limits this naturally - after crossing that, redo should place the FPI into shared buffers, allowing us to skip the prefetch. So I think using maintenance_io_concurrency is sufficient. We might track more buffers to allow skipping prefetches of blocks that were evicted from shared buffers, but that seems like an overkill. However, maintenance_io_concurrency can be quite high, so just a simple queue is not very suitable - searching it linearly for each block would be too expensive. But I think we can use a simple hash table, tracking (relfilenode, block, LSN), over-sized to minimize collisions. Imagine it's a simple array with (2 * maintenance_io_concurrency) elements, and whenever we prefetch a block or find an FPI, we simply add the block to the array as determined by hash(relfilenode, block) hashtable[hash(...)] = {relfilenode, block, LSN} and then when deciding whether to prefetch a block, we look at that one position. If the (relfilenode, block) match, we check the LSN and skip the prefetch if it's sufficiently recent. Otherwise we prefetch. We may issue some extra prefetches due to collisions, but that's fine I think. There should not be very many of them, thanks to having the hash table oversized. The good thing is this is quite simple, fixed-sized data structure, there's no need for allocations etc. >> + /* Fast path for repeated references to the same relation. */ >> + if (RelFileNodeEquals(block->rnode, prefetcher->last_rnode)) >> + { >> + /* >> + * If this is a repeat access to the same block, then skip it. >> + * >> + * XXX We could also check for last_blkno + 1 too, and also update >> + * last_blkno; it's not clear if the kernel would do a better job >> + * of sequential prefetching. >> + */ >> + if (block->blkno == prefetcher->last_blkno) >> + { >> + pg_atomic_unlocked_add_fetch_u64(&Stats->skip_seq, 1); >> + continue; >> + } > > I'm sure this will help with some cases, but it wouldn't help with the > case that I mention above, as I understand it. > It won't but it's a pretty effective check. I've done some experiments recently, and with random pgbench this eliminates ~15% of prefetches. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
pgsql-hackers by date: