Re: WIP: WAL prefetch (another approach) - Mailing list pgsql-hackers

From Stephen Frost
Subject Re: WIP: WAL prefetch (another approach)
Date
Msg-id 20210213213930.GM27507@tamriel.snowman.net
Whole thread Raw
In response to Re: WIP: WAL prefetch (another approach)  (Andres Freund <andres@anarazel.de>)
Responses Re: WIP: WAL prefetch (another approach)
List pgsql-hackers
Greetings,

* Andres Freund (andres@anarazel.de) wrote:
> On 2021-02-12 00:42:04 +0100, Tomas Vondra wrote:
> > Yeah, that's a good point. I think it'd make sense to keep track of recent
> > FPIs and skip prefetching such blocks. But how exactly should we implement
> > that, how many blocks do we need to track? If you get an FPI, how long
> > should we skip prefetching of that block?
> >
> > I don't think the history needs to be very long, for two reasons. Firstly,
> > the usual pattern is that we have FPI + several changes for that block
> > shortly after it. Secondly, maintenance_io_concurrency limits this naturally
> > - after crossing that, redo should place the FPI into shared buffers,
> > allowing us to skip the prefetch.
> >
> > So I think using maintenance_io_concurrency is sufficient. We might track
> > more buffers to allow skipping prefetches of blocks that were evicted from
> > shared buffers, but that seems like an overkill.
> >
> > However, maintenance_io_concurrency can be quite high, so just a simple
> > queue is not very suitable - searching it linearly for each block would be
> > too expensive. But I think we can use a simple hash table, tracking
> > (relfilenode, block, LSN), over-sized to minimize collisions.
> >
> > Imagine it's a simple array with (2 * maintenance_io_concurrency) elements,
> > and whenever we prefetch a block or find an FPI, we simply add the block to
> > the array as determined by hash(relfilenode, block)
> >
> >     hashtable[hash(...)] = {relfilenode, block, LSN}
> >
> > and then when deciding whether to prefetch a block, we look at that one
> > position. If the (relfilenode, block) match, we check the LSN and skip the
> > prefetch if it's sufficiently recent. Otherwise we prefetch.
>
> I'm a bit doubtful this is really needed at this point. Yes, the
> prefetching will do a buffer table lookup - but it's a lookup that
> already happens today. And the patch already avoids doing a second
> lookup after prefetching (by optimistically caching the last Buffer id,
> and re-checking).

I agree that when a page is looked up, and found, in the buffer table
that the subsequent cacheing of the buffer id in the WAL records does a
good job of avoiding having to re-do that lookup.  However, that isn't
the case which was being discussed here or what Tomas's suggestion was
intended to address.

What I pointed out up-thread and what's being discussed here is what
happens when the WAL contains a few FPIs and a few regular WAL records
which are mixed up and not in ideal order.  When that happens, with this
patch, the FPIs will be ignored, the regular WAL records will reference
blocks which aren't found in shared buffers (yet) and then we'll both
issue pre-fetches for those and end up having spent effort doing a
buffer lookup that we'll later re-do.

To address the unnecessary syscalls we really just need to keep track of
any FPIs that we've seen between where the point where the prefetching
is happening and the point where the replay is being done- once replay
has replayed an FPI, our buffer lookup will succeed and we'll cache the
buffer that the FPI is at- in other words, only wal_decode_buffer_size
amount of WAL needs to be considered.

We could further leverage this tracking of FPIs, to skip the prefetch
syscalls, by cacheing what later records address the blocks that have
FPIs earlier in the queue with the FPI record and then when replay hits
the FPI and loads it into shared_buffers, it could update the other WAL
records in the queue with the buffer id of the page, allowing us to very
likely avoid having to do another lookup later on.

> I think there's potential for some significant optimization going
> forward, but I think it's basically optimization over what we're doing
> today. As this is already a nontrivial patch, I'd argue for doing so
> separately.

This seems like a great optimization, albeit a fair bit of code, for a
relatively uncommon use-case, specifically where full page writes are
disabled or very large checkpoints.  As that's the case though, I would
think it's reasonable to ask that it go out of its way to avoid slowing
down the more common configurations, particularly since it's proposed to
have it on by default (which I agree with, provided it ends up improving
the common cases, which I think the suggestions above would certainly
make it more likely to do).

Perhaps this already improves the common cases and is worth the extra
code on that basis, but I don't recall seeing much in the way of
benchmarking in this thread for that case- that is, where FPIs are
enabled and checkpoints are smaller than shared buffers.  Jakub's
testing was done with FPWs disabled and Tomas's testing used checkpoints
which were much larger than the size of shared buffers on the system
doing the replay.  While it's certainly good that this patch improves
those cases, we should also be looking out for the worst case and make
sure that the patch doesn't degrade performance in that case.

Thanks,

Stephen

Attachment

pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: How to get Relation tuples in C function
Next
From: Tom Lane
Date:
Subject: Re: Detecting pointer misalignment (was Re: pgsql: Implementation of subscripting for jsonb)