Re: WIP: WAL prefetch (another approach) - Mailing list pgsql-hackers

From Thomas Munro
Subject Re: WIP: WAL prefetch (another approach)
Date
Msg-id CA+hUKGKs7cNkiMyNY9aSn2bF2AQNj9bb_dbY2B_6d_srnx5+mw@mail.gmail.com
Whole thread Raw
In response to Re: WIP: WAL prefetch (another approach)  (Julien Rouhaud <rjuju123@gmail.com>)
Responses Re: WIP: WAL prefetch (another approach)  (Julien Rouhaud <rjuju123@gmail.com>)
List pgsql-hackers
On Mon, Mar 21, 2022 at 9:29 PM Julien Rouhaud <rjuju123@gmail.com> wrote:
> So I finally finished looking at this patch.  Here again, AFAICS the feature is
> working as expected and I didn't find any problem.  I just have some minor
> comments, like for the previous patch.

Thanks very much for the review.  I've attached a new version
addressing most of your feedback, and also rebasing over the new
WAL-logged CREATE DATABASE.  I've also fixed a couple of bugs (see
end).

> For the docs:
>
> +        Whether to try to prefetch blocks that are referenced in the WAL that
> +        are not yet in the buffer pool, during recovery.  Valid values are
> +        <literal>off</literal> (the default), <literal>on</literal> and
> +        <literal>try</literal>.  The setting <literal>try</literal> enables
> +        prefetching only if the operating system provides the
> +        <function>posix_fadvise</function> function, which is currently used
> +        to implement prefetching.  Note that some operating systems provide the
> +        function, but don't actually perform any prefetching.
>
> Is there any reason not to change it to try?  I'm wondering if some system says
> that the function exists but simply raise an error if you actually try to use
> it.  I think that at least WSL does that for some functions.

Yeah, we could just default it to try.  Whether we should ship that
way is another question, but done for now.

I don't think there are any supported systems that have a
posix_fadvise() that fails with -1, or we'd know about it, because
we already use it in other places.  We do support one OS that provides
a dummy function in libc that does nothing at all (Solaris/illumos),
and at least a couple that enter the kernel but are known to do
nothing at all for WILLNEED (AIX, FreeBSD).

> +  <para>
> +   The <xref linkend="guc-recovery-prefetch"/> parameter can
> +   be used to improve I/O performance during recovery by instructing
> +   <productname>PostgreSQL</productname> to initiate reads
> +   of disk blocks that will soon be needed but are not currently in
> +   <productname>PostgreSQL</productname>'s buffer pool.
> +   The <xref linkend="guc-maintenance-io-concurrency"/> and
> +   <xref linkend="guc-wal-decode-buffer-size"/> settings limit prefetching
> +   concurrency and distance, respectively.
> +   By default, prefetching in recovery is disabled.
> +  </para>
>
> I think that "improving I/O performance" is a bit misleading, maybe reduce I/O
> wait time or something like that?  Also, I don't know if we need to be that
> precise, but maybe we should say that it's the underlying kernel that will
> (asynchronously) initiate the reads, and postgres will simply notifies it.

Updated with this new text:

   The <xref linkend="guc-recovery-prefetch"/> parameter can be used to reduce
   I/O wait times during recovery by instructing the kernel to initiate reads
   of disk blocks that will soon be needed but are not currently in
   <productname>PostgreSQL</productname>'s buffer pool and will soon be read.

> +  <para>
> +   The <structname>pg_stat_prefetch_recovery</structname> view will contain only
> +   one row.  It is filled with nulls if recovery is not running or WAL
> +   prefetching is not enabled.  See <xref linkend="guc-recovery-prefetch"/>
> +   for more information.
> +  </para>
>
> That's not the implemented behavior as far as I can see.  It just prints whatever is in SharedStats
> regardless of the recovery state or the prefetch_wal setting (assuming that
> there's no pending reset request).

Yeah.  Updated text: "It is filled with nulls if recovery has not run
or ...".

> Similarly, there's a mention that
> pg_stat_reset_shared('wal') will reset the stats, but I don't see anything
> calling XLogPrefetchRequestResetStats().

It's 'prefetch_recovery', not 'wal', but yeah, oops, it looks like I
got carried away between v18 and v19 while simplifying the stats and
lost a hunk I should have kept.  Fixed.

> Finally, I think we should documented what are the cumulated counters in that
> view (that should get reset) and the dynamic counters (that shouldn't get
> reset).

OK, done.

> For the code:
>
>  bool
>  XLogRecGetBlockTag(XLogReaderState *record, uint8 block_id,
>                    RelFileNode *rnode, ForkNumber *forknum, BlockNumber *blknum)
> +{
> +   return XLogRecGetBlockInfo(record, block_id, rnode, forknum, blknum, NULL);
> +}
> +
> +bool
> +XLogRecGetBlockInfo(XLogReaderState *record, uint8 block_id,
> +                   RelFileNode *rnode, ForkNumber *forknum,
> +                   BlockNumber *blknum,
> +                   Buffer *prefetch_buffer)
>  {
>
> It's missing comments on that function.  XLogRecGetBlockTag comments should
> probably be reworded at the same time.

New comment added for XLogRecGetBlockInfo().  Wish I could come up
with a better name for that...  Not quite sure what you thought I should
change about XLogRecGetBlockTag().

> +ReadRecord(XLogPrefetcher *xlogprefetcher, int emode,
>            bool fetching_ckpt, TimeLineID replayTLI)
>  {
>     XLogRecord *record;
> +   XLogReaderState *xlogreader = XLogPrefetcherReader(xlogprefetcher);
>
> nit: maybe name it XLogPrefetcherGetReader()?

OK.

>   * containing it (if not open already), and returns true. When end of standby
>   * mode is triggered by the user, and there is no more WAL available, returns
>   * false.
> + *
> + * If nonblocking is true, then give up immediately if we can't satisfy the
> + * request, returning XLREAD_WOULDBLOCK instead of waiting.
>   */
> -static bool
> +static XLogPageReadResult
>  WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
>
> The comment still mentions a couple of time returning true/false rather than
> XLREAD_*, same for at least XLogPageRead().

Fixed.

> @@ -3350,6 +3392,14 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
>          */
>         if (lastSourceFailed)
>         {
> +           /*
> +            * Don't allow any retry loops to occur during nonblocking
> +            * readahead.  Let the caller process everything that has been
> +            * decoded already first.
> +            */
> +           if (nonblocking)
> +               return XLREAD_WOULDBLOCK;
>
> Is that really enough?  I'm wondering if the code path in ReadRecord() that
> forces lastSourceFailed to False while it actually failed when switching into
> archive recovery (xlogrecovery.c around line 3044) can be problematic here.

I don't see the problem scenario, could you elaborate?

>                 {"wal_decode_buffer_size", PGC_POSTMASTER, WAL_ARCHIVE_RECOVERY,
>                         gettext_noop("Maximum buffer size for reading ahead in the WAL during recovery."),
>                         gettext_noop("This controls the maximum distance we can read ahead in the WAL to prefetch
referencedblocks."),
 
>                         GUC_UNIT_BYTE
>                 },
>                 &wal_decode_buffer_size,
>                 512 * 1024, 64 * 1024, INT_MAX,
>
> Should the max be MaxAllocSize?

Hmm.  OK, done.

> +   /* Do we have a clue where the buffer might be already? */
> +   if (BufferIsValid(recent_buffer) &&
> +       mode == RBM_NORMAL &&
> +       ReadRecentBuffer(rnode, forknum, blkno, recent_buffer))
> +   {
> +       buffer = recent_buffer;
> +       goto recent_buffer_fast_path;
> +   }
>
> Should this increment (local|shared)_blks_hit, since ReadRecentBuffer doesn't?

Hmm.  I guess ReadRecentBuffer() should really do that.  Done.

> Missed in the previous patch: XLogDecodeNextRecord() isn't a trivial function,
> so some comments would be helpful.

OK, I'll come back to that.

> xlogprefetcher.c:
>
> + * data.  XLogRecBufferForRedo() cooperates uses information stored in the
> + * decoded record to find buffers ently.
>
> I'm not sure what you wanted to say here.  Also, I don't see any
> XLogRecBufferForRedo() anywhere, I'm assuming it's
> XLogReadBufferForRedo?

Yeah, typos.  I rewrote that comment.

> +/*
> + * A callback that reads ahead in the WAL and tries to initiate one IO.
> + */
> +static LsnReadQueueNextStatus
> +XLogPrefetcherNextBlock(uintptr_t pgsr_private, XLogRecPtr *lsn)
>
> Should there be a bit more comments about what this function is supposed to
> enforce?

I have added a comment to explain.

> I'm wondering if it's a bit overkill to implement this as a callback.  Do you
> have near future use cases in mind?  For now no other code could use the
> infrastructure at all as the lrq is private, so some changes will be needed to
> make it truly configurable anyway.

Yeah.  Actually, in the next step I want to throw away the lrq part,
and keep just the XLogPrefetcherNextBlock() function, with some small
modifications.

Admittedly the control flow is a little confusing, but the point of
this architecture is to separate "how to prefetch one more thing" from
"when to prefetch, considering I/O depth and related constraints".
The first thing, "how", is represented by XLogPrefetcherNextBlock().
The second thing, "when", is represented here by the
LsnReadQueue/lrq_XXX stuff that is private in this file for now, but
later I will propose to replace that second thing with the
pg_streaming_read facility of commitfest entry 38/3316.  This is a way
of getting there step by step.  I also wrote briefly about that here:

https://www.postgresql.org/message-id/CA%2BhUKGJ7OqpdnbSTq5oK%3DdjSeVW2JMnrVPSm8JC-_dbN6Y7bpw%40mail.gmail.com

> If we keep it as a callback, I think it would make sense to extract some part,
> like the main prefetch filters / global-limit logic, so other possible
> implementations can use it if needed.  It would also help to reduce this
> function a bit, as it's somewhat long.

I can't imagine reusing any of those filtering things anywhere else.
I admit that the function is kinda long...

> Also, about those filters:
>
> +           if (rmid == RM_XLOG_ID)
> +           {
> +               if (record_type == XLOG_CHECKPOINT_SHUTDOWN ||
> +                   record_type == XLOG_END_OF_RECOVERY)
> +               {
> +                   /*
> +                    * These records might change the TLI.  Avoid potential
> +                    * bugs if we were to allow "read TLI" and "replay TLI" to
> +                    * differ without more analysis.
> +                    */
> +                   prefetcher->no_readahead_until = record->lsn;
> +               }
> +           }
>
> Should there be a note that it's still ok to process this record in the loop
> just after, as it won't contain any prefetchable data, or simply jump to the
> end of that loop?

Comment added.

> +/*
> + * Increment a counter in shared memory.  This is equivalent to *counter++ on a
> + * plain uint64 without any memory barrier or locking, except on platforms
> + * where readers can't read uint64 without possibly observing a torn value.
> + */
> +static inline void
> +XLogPrefetchIncrement(pg_atomic_uint64 *counter)
> +{
> +   Assert(AmStartupProcess() || !IsUnderPostmaster);
> +   pg_atomic_write_u64(counter, pg_atomic_read_u64(counter) + 1);
> +}
>
> I'm curious about this one.  Is it to avoid expensive locking on platforms that
> don't have a lockless pg_atomic_fetch_add_u64?

My goal here is only to make sure that systems without
PG_HAVE_8BYTE_SINGLE_COPY_ATOMICITY don't see bogus/torn values.  On
more typical systems, I just want plain old counter++, for the CPU to
feel free to reorder, without the overheads of LOCK XADD.

> +Datum
> +pg_stat_get_prefetch_recovery(PG_FUNCTION_ARGS)
> +{
> [...]
>
> This function could use the new SetSingleFuncCall() function introduced in
> 9e98583898c.

Oh, yeah, that looks much nicer!

> +# - Prefetching during recovery -
> +
> +#wal_decode_buffer_size = 512kB        # lookahead window used for prefetching
>
> This one should be documented as "(change requires restart)"

Done.

Other changes:

1.  The logic for handling relations and blocks that don't exist
(presumably, yet) wasn't quite right.  The previous version could
raise an error in smgrnblocks() if a referenced relation doesn't exist
at all on disk.  I don't know how to actually reach that case
(considering the analysis this thing does of SMGR create etc to avoid
touching relations that haven't been created yet), but if it is
possible somehow, then it will handle this gracefully.

To check for missing relations I use smgrexists().  To make that fast,
I changed it to not close segments when in recovery, which is OK
because recovery already closes SMGR relations when replaying anything
that would unlink files.

2.  The logic for filtering out access to an entire database wasn't
quite right.  In this new version, that's necessary only for
file-based CREATE DATABASE, since that does bulk creation of relations
without any individual WAL records to analyse.  This works by using
{inv, dbNode, inv} as a key in the filter hash table, but I was trying
to look things up by {spcNode, dbNode, inv}.  Fixed.

3.  The handling for XLOG_SMGR_CREATE was firing for every fork, but
it really only needed to fire for the main fork, for now.  (There's no
reason at all this thing shouldn't prefetch other forks, that's just
left for later).

4.  To make it easier to see the filtering logic at work, I added code
to log messages about that if you #define XLOGPREFETCHER_DEBUG_LEVEL.
Could be extended to show more internal state and events...

5.  While retesting various scenarios, it bothered me that big seq
scan UPDATEs would repeatedly issue posix_fadvise() for the same block
(because multiple rows in a page are touched by consecutive records,
and the page doesn't make it into the buffer pool until a bit later).
I resurrected the defences I had against that a few versions back
using a small window of recent prefetches, which I'd originally
developed as a way to avoid explicit prefetches of sequential scans
(prefetch 1, 2, 3, ...).  That turned out to be useless superstition
based on ancient discussions in this mailing list, but I think it's
still useful to avoid obviously stupid sequences of repeat system
calls (prefetch 1, 1, 1, ...).  So now it has a little one-cache-line
sized window of history, to avoid doing that.

I need to re-profile a few workloads after these changes, and then
there are a couple of bikeshed-colour items:

1.  It's completely arbitrary that it limits its lookahead to
maintenance_io_concurrency * 4 blockrefs ahead in the WAL.  I have no
principled reason to choose 4.  In the AIO version of this (to
follow), that number of blocks finishes up getting pinned at the same
time, so more thought might be needed on that, but that doesn't apply
here yet, so it's a bit arbitrary.

2.  Defaults for wal_decode_buffer_size and maintenance_io_concurrency
are likewise arbitrary.

3.  At some point in this long thread I was convinced to name the view
pg_stat_prefetch_recovery, but the GUC is called recovery_prefetch.
That seems silly...

Attachment

pgsql-hackers by date:

Previous
From: Amit Kapila
Date:
Subject: Re: Identify missing publications from publisher while create/alter subscription.
Next
From: John Naylor
Date:
Subject: Re: do only critical work during single-user vacuum?