Re: WIP: WAL prefetch (another approach) - Mailing list pgsql-hackers

From Julien Rouhaud
Subject Re: WIP: WAL prefetch (another approach)
Date
Msg-id YkpiH3sFL8XsPBgs@jrouhaud
Whole thread Raw
In response to Re: WIP: WAL prefetch (another approach)  (Thomas Munro <thomas.munro@gmail.com>)
Responses Re: WIP: WAL prefetch (another approach)
List pgsql-hackers
On Thu, Mar 31, 2022 at 10:49:32PM +1300, Thomas Munro wrote:
> On Mon, Mar 21, 2022 at 9:29 PM Julien Rouhaud <rjuju123@gmail.com> wrote:
> > So I finally finished looking at this patch.  Here again, AFAICS the feature is
> > working as expected and I didn't find any problem.  I just have some minor
> > comments, like for the previous patch.
>
> Thanks very much for the review.  I've attached a new version
> addressing most of your feedback, and also rebasing over the new
> WAL-logged CREATE DATABASE.  I've also fixed a couple of bugs (see
> end).
>
> > For the docs:
> >
> > +        Whether to try to prefetch blocks that are referenced in the WAL that
> > +        are not yet in the buffer pool, during recovery.  Valid values are
> > +        <literal>off</literal> (the default), <literal>on</literal> and
> > +        <literal>try</literal>.  The setting <literal>try</literal> enables
> > +        prefetching only if the operating system provides the
> > +        <function>posix_fadvise</function> function, which is currently used
> > +        to implement prefetching.  Note that some operating systems provide the
> > +        function, but don't actually perform any prefetching.
> >
> > Is there any reason not to change it to try?  I'm wondering if some system says
> > that the function exists but simply raise an error if you actually try to use
> > it.  I think that at least WSL does that for some functions.
>
> Yeah, we could just default it to try.  Whether we should ship that
> way is another question, but done for now.

Should there be an associated pg15 open item for that, when the patch will be
committed?  Note that in wal.sgml, the patch still says:

+   [...] By default, prefetching in
+   recovery is disabled.

I guess this should be changed even if we eventually choose to disable it by
default?

> I don't think there are any supported systems that have a
> posix_fadvise() that fails with -1, or we'd know about it, because
> we already use it in other places.  We do support one OS that provides
> a dummy function in libc that does nothing at all (Solaris/illumos),
> and at least a couple that enter the kernel but are known to do
> nothing at all for WILLNEED (AIX, FreeBSD).

Ah, I didn't know that, thanks for the info!

> >  bool
> >  XLogRecGetBlockTag(XLogReaderState *record, uint8 block_id,
> >                    RelFileNode *rnode, ForkNumber *forknum, BlockNumber *blknum)
> > +{
> > +   return XLogRecGetBlockInfo(record, block_id, rnode, forknum, blknum, NULL);
> > +}
> > +
> > +bool
> > +XLogRecGetBlockInfo(XLogReaderState *record, uint8 block_id,
> > +                   RelFileNode *rnode, ForkNumber *forknum,
> > +                   BlockNumber *blknum,
> > +                   Buffer *prefetch_buffer)
> >  {
> >
> > It's missing comments on that function.  XLogRecGetBlockTag comments should
> > probably be reworded at the same time.
>
> New comment added for XLogRecGetBlockInfo().  Wish I could come up
> with a better name for that...  Not quite sure what you thought I should
> change about XLogRecGetBlockTag().

Since XLogRecGetBlockTag is now a wrapper for XLogRecGetBlockInfo, I thought it
would be better to document only the specific behavior for this one (so no
prefetch_buffer), rather than duplicating the whole description in both places.
It seems like a good recipe to miss one of the comments the next time something
is changed there.

For the name, why not the usual XLogRecGetBlockTagExtended()?

> > @@ -3350,6 +3392,14 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
> >          */
> >         if (lastSourceFailed)
> >         {
> > +           /*
> > +            * Don't allow any retry loops to occur during nonblocking
> > +            * readahead.  Let the caller process everything that has been
> > +            * decoded already first.
> > +            */
> > +           if (nonblocking)
> > +               return XLREAD_WOULDBLOCK;
> >
> > Is that really enough?  I'm wondering if the code path in ReadRecord() that
> > forces lastSourceFailed to False while it actually failed when switching into
> > archive recovery (xlogrecovery.c around line 3044) can be problematic here.
>
> I don't see the problem scenario, could you elaborate?

Sorry, I missed that in standby mode ReadRecord would keep going until a record
is found, so no problem indeed.

> > +   /* Do we have a clue where the buffer might be already? */
> > +   if (BufferIsValid(recent_buffer) &&
> > +       mode == RBM_NORMAL &&
> > +       ReadRecentBuffer(rnode, forknum, blkno, recent_buffer))
> > +   {
> > +       buffer = recent_buffer;
> > +       goto recent_buffer_fast_path;
> > +   }
> >
> > Should this increment (local|shared)_blks_hit, since ReadRecentBuffer doesn't?
>
> Hmm.  I guess ReadRecentBuffer() should really do that.  Done.

Ah, I also thought it be be better there but was assuming that there was some
possible usage where it's not wanted.  Good then!

Should ReadRecentBuffer comment be updated to mention that pgBufferUsage is
incremented as appropriate?  FWIW that's the first place I looked when checking
if the stats would be incremented.

> > Missed in the previous patch: XLogDecodeNextRecord() isn't a trivial function,
> > so some comments would be helpful.
>
> OK, I'll come back to that.

Ok!

>
> > +/*
> > + * A callback that reads ahead in the WAL and tries to initiate one IO.
> > + */
> > +static LsnReadQueueNextStatus
> > +XLogPrefetcherNextBlock(uintptr_t pgsr_private, XLogRecPtr *lsn)
> >
> > Should there be a bit more comments about what this function is supposed to
> > enforce?
>
> I have added a comment to explain.

small typos:

+ * Returns LRQ_NEXT_IO if the next block reference and it isn't in the buffer
+ * pool, [...]

I guess s/if the next block/if there's a next block/ or s/and it//.

+ * Returns LRQ_NO_IO if we examined the next block reference and found that it
+ * was already in the buffer pool.

should be LRQ_NEXT_NO_IO, and also this is returned if prefetching is disabled
or it the next block isn't prefetchable.

> > I'm wondering if it's a bit overkill to implement this as a callback.  Do you
> > have near future use cases in mind?  For now no other code could use the
> > infrastructure at all as the lrq is private, so some changes will be needed to
> > make it truly configurable anyway.
>
> Yeah.  Actually, in the next step I want to throw away the lrq part,
> and keep just the XLogPrefetcherNextBlock() function, with some small
> modifications.

Ah I see, that makes sense then.
>
> Admittedly the control flow is a little confusing, but the point of
> this architecture is to separate "how to prefetch one more thing" from
> "when to prefetch, considering I/O depth and related constraints".
> The first thing, "how", is represented by XLogPrefetcherNextBlock().
> The second thing, "when", is represented here by the
> LsnReadQueue/lrq_XXX stuff that is private in this file for now, but
> later I will propose to replace that second thing with the
> pg_streaming_read facility of commitfest entry 38/3316.  This is a way
> of getting there step by step.  I also wrote briefly about that here:
>
> https://www.postgresql.org/message-id/CA%2BhUKGJ7OqpdnbSTq5oK%3DdjSeVW2JMnrVPSm8JC-_dbN6Y7bpw%40mail.gmail.com

I unsurprisingly didn't read the direct IO patch, and also joined the
prefetching thread quite recently so I missed that mail.  Thanks for the
pointer!

>
> > If we keep it as a callback, I think it would make sense to extract some part,
> > like the main prefetch filters / global-limit logic, so other possible
> > implementations can use it if needed.  It would also help to reduce this
> > function a bit, as it's somewhat long.
>
> I can't imagine reusing any of those filtering things anywhere else.
> I admit that the function is kinda long...

Yeah, I thought your plan was to provide custom prefetching method or something
like that.  As-is, apart from making the function less long it wouldn't do
much.

> Other changes:
> [...]
> 3.  The handling for XLOG_SMGR_CREATE was firing for every fork, but
> it really only needed to fire for the main fork, for now.  (There's no
> reason at all this thing shouldn't prefetch other forks, that's just
> left for later).

Ah indeed.  While at it, should there some comments on top of the file
mentioning that only the main fork is prefetched?

> 4.  To make it easier to see the filtering logic at work, I added code
> to log messages about that if you #define XLOGPREFETCHER_DEBUG_LEVEL.
> Could be extended to show more internal state and events...

FTR I also tested the patch defining this.  I will probably define it on my
buildfarm animal when the patch is committed to make sure it doesn't get
broken.

> 5.  While retesting various scenarios, it bothered me that big seq
> scan UPDATEs would repeatedly issue posix_fadvise() for the same block
> (because multiple rows in a page are touched by consecutive records,
> and the page doesn't make it into the buffer pool until a bit later).
> I resurrected the defences I had against that a few versions back
> using a small window of recent prefetches, which I'd originally
> developed as a way to avoid explicit prefetches of sequential scans
> (prefetch 1, 2, 3, ...).  That turned out to be useless superstition
> based on ancient discussions in this mailing list, but I think it's
> still useful to avoid obviously stupid sequences of repeat system
> calls (prefetch 1, 1, 1, ...).  So now it has a little one-cache-line
> sized window of history, to avoid doing that.

Nice!

+ * To detect repeat access to the same block and skip useless extra system
+ * calls, we remember a small windows of recently prefetched blocks.

Should it be "repeated" access, and small window (singular)?

Also, I'm wondering if the "seq" part of the related pieces is a bit too much
specific, as there could be other workloads that lead to repeated update of the
same blocks.  Maybe it's ok to use it for internal variables, but the new
skip_seq field seems a bit too obscure for some user facing thing.  Maybe
skip_same, skip_repeated or something like that?

> I need to re-profile a few workloads after these changes, and then
> there are a couple of bikeshed-colour items:
>
> 1.  It's completely arbitrary that it limits its lookahead to
> maintenance_io_concurrency * 4 blockrefs ahead in the WAL.  I have no
> principled reason to choose 4.  In the AIO version of this (to
> follow), that number of blocks finishes up getting pinned at the same
> time, so more thought might be needed on that, but that doesn't apply
> here yet, so it's a bit arbitrary.

Yeah, I don't see that as a blocker for now.  Maybe use some #define to make it
more obvious though, as it's a bit hidden in the code right now?

> 3.  At some point in this long thread I was convinced to name the view
> pg_stat_prefetch_recovery, but the GUC is called recovery_prefetch.
> That seems silly...

FWIW I prefer recovery_prefetch to prefetch_recovery.



pgsql-hackers by date:

Previous
From: Masahiko Sawada
Date:
Subject: Re: Skipping logical replication transactions on subscriber side
Next
From: Amit Kapila
Date:
Subject: Re: Skipping logical replication transactions on subscriber side