Re: WIP: WAL prefetch (another approach) - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: WIP: WAL prefetch (another approach)
Date
Msg-id 103ec76d-8bbc-09e1-460c-8785751431b5@enterprisedb.com
Whole thread Raw
In response to Re: WIP: WAL prefetch (another approach)  (Stephen Frost <sfrost@snowman.net>)
Responses Re: WIP: WAL prefetch (another approach)
List pgsql-hackers
On 2/10/21 10:50 PM, Stephen Frost wrote:
 >
> ...
 >
>> +/*
>> + * Scan the current record for block references, and consider prefetching.
>> + *
>> + * Return true if we processed the current record to completion and still have
>> + * queue space to process a new record, and false if we saturated the I/O
>> + * queue and need to wait for recovery to advance before we continue.
>> + */
>> +static bool
>> +XLogPrefetcherScanBlocks(XLogPrefetcher *prefetcher)
>> +{
>> +    DecodedXLogRecord *record = prefetcher->record;
>> +
>> +    Assert(!XLogPrefetcherSaturated(prefetcher));
>> +
>> +    /*
>> +     * We might already have been partway through processing this record when
>> +     * our queue became saturated, so we need to start where we left off.
>> +     */
>> +    for (int block_id = prefetcher->next_block_id;
>> +         block_id <= record->max_block_id;
>> +         ++block_id)
>> +    {
>> +        DecodedBkpBlock *block = &record->blocks[block_id];
>> +        PrefetchBufferResult prefetch;
>> +        SMgrRelation reln;
>> +
>> +        /* Ignore everything but the main fork for now. */
>> +        if (block->forknum != MAIN_FORKNUM)
>> +            continue;
>> +
>> +        /*
>> +         * If there is a full page image attached, we won't be reading the
>> +         * page, so you might think we should skip it.  However, if the
>> +         * underlying filesystem uses larger logical blocks than us, it
>> +         * might still need to perform a read-before-write some time later.
>> +         * Therefore, only prefetch if configured to do so.
>> +         */
>> +        if (block->has_image && !recovery_prefetch_fpw)
>> +        {
>> +            pg_atomic_unlocked_add_fetch_u64(&Stats->skip_fpw, 1);
>> +            continue;
>> +        }
> 
> FPIs in the stream aren't going to just avoid reads when the
> filesystem's block size matches PG's- they're also going to avoid
> subsequent modifications to the block, provided we don't end up pushing
> that block out of shared buffers, rights?
> 
> That is, if you have an empty shared buffers and see:
> 
> Block 5 FPI
> Block 6 FPI
> Block 5 Update
> Block 6 Update
> 
> it seems like, with this patch, we're going to Prefetch Block 5 & 6,
> even though we almost certainly won't actually need them.
> 

Yeah, that's a good point. I think it'd make sense to keep track of 
recent FPIs and skip prefetching such blocks. But how exactly should we 
implement that, how many blocks do we need to track? If you get an FPI, 
how long should we skip prefetching of that block?

I don't think the history needs to be very long, for two reasons. 
Firstly, the usual pattern is that we have FPI + several changes for 
that block shortly after it. Secondly, maintenance_io_concurrency limits 
this naturally - after crossing that, redo should place the FPI into 
shared buffers, allowing us to skip the prefetch.

So I think using maintenance_io_concurrency is sufficient. We might 
track more buffers to allow skipping prefetches of blocks that were 
evicted from shared buffers, but that seems like an overkill.

However, maintenance_io_concurrency can be quite high, so just a simple 
queue is not very suitable - searching it linearly for each block would 
be too expensive. But I think we can use a simple hash table, tracking 
(relfilenode, block, LSN), over-sized to minimize collisions.

Imagine it's a simple array with (2 * maintenance_io_concurrency) 
elements, and whenever we prefetch a block or find an FPI, we simply add 
the block to the array as determined by hash(relfilenode, block)

     hashtable[hash(...)] = {relfilenode, block, LSN}

and then when deciding whether to prefetch a block, we look at that one 
position. If the (relfilenode, block) match, we check the LSN and skip 
the prefetch if it's sufficiently recent. Otherwise we prefetch.

We may issue some extra prefetches due to collisions, but that's fine I 
think. There should not be very many of them, thanks to having the hash 
table oversized.

The good thing is this is quite simple, fixed-sized data structure, 
there's no need for allocations etc.



>> +        /* Fast path for repeated references to the same relation. */
>> +        if (RelFileNodeEquals(block->rnode, prefetcher->last_rnode))
>> +        {
>> +            /*
>> +             * If this is a repeat access to the same block, then skip it.
>> +             *
>> +             * XXX We could also check for last_blkno + 1 too, and also update
>> +             * last_blkno; it's not clear if the kernel would do a better job
>> +             * of sequential prefetching.
>> +             */
>> +            if (block->blkno == prefetcher->last_blkno)
>> +            {
>> +                pg_atomic_unlocked_add_fetch_u64(&Stats->skip_seq, 1);
>> +                continue;
>> +            }
> 
> I'm sure this will help with some cases, but it wouldn't help with the
> case that I mention above, as I understand it.
> 

It won't but it's a pretty effective check. I've done some experiments 
recently, and with random pgbench this eliminates ~15% of prefetches.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



pgsql-hackers by date:

Previous
From: Corey Huinker
Date:
Subject: Re: parse_slash_copy doesn't support psql variables substitution
Next
From: David Rowley
Date:
Subject: Re: Keep notnullattrs in RelOptInfo (Was part of UniqueKey patch series)