Re: WIP: WAL prefetch (another approach) - Mailing list pgsql-hackers

From Thomas Munro
Subject Re: WIP: WAL prefetch (another approach)
Date
Msg-id CA+hUKG+wFV2artamBSCQXai56kLea=u7VTd+QudUXFefM3qihg@mail.gmail.com
Whole thread Raw
In response to Re: WIP: WAL prefetch (another approach)  (Andres Freund <andres@anarazel.de>)
Responses Re: WIP: WAL prefetch (another approach)
List pgsql-hackers
Hi,

Thanks for all that feedback.  It's been a strange couple of weeks,
but I finally have a new version that addresses most of that feedback
(but punts on a couple of suggestions for later development, due to
lack of time).

It also fixes a couple of other problems I found with the previous version:

1.  While streaming, whenever it hit the end of available data (ie LSN
written by WAL receiver), it would close and then reopen the WAL
segment.  Fixed by the machinery in 0007 which allows for "would
block" as distinct from other errors.

2.  During crash recovery, there were some edge cases where it would
try to read the next WAL segment when there isn't one.  Also fixed by
0007.

3.  It was maxing out at maintenance_io_concurrency - 1 due to a silly
circular buffer fence post bug.

Note that 0006 is just for illustration, it's not proposed for commit.

On Wed, Mar 25, 2020 at 11:31 AM Andres Freund <andres@anarazel.de> wrote:
> On 2020-03-18 18:18:44 +1300, Thomas Munro wrote:
> > From 1b03eb5ada24c3b23ab8ca6db50e0c5d90d38259 Mon Sep 17 00:00:00 2001
> > From: Thomas Munro <tmunro@postgresql.org>
> > Date: Mon, 9 Dec 2019 17:22:07 +1300
> > Subject: [PATCH 3/5] Add WalRcvGetWriteRecPtr() (new definition).
> >
> > A later patch will read received WAL to prefetch referenced blocks,
> > without waiting for the data to be flushed to disk.  To do that, it
> > needs to be able to see the write pointer advancing in shared memory.
> >
> > The function formerly bearing name was recently renamed to
> > WalRcvGetFlushRecPtr(), which better described what it does.
>
> Hm. I'm a bit weary of reusing the name with a different meaning. If
> there's any external references, this'll hide that they need to
> adapt. Perhaps, even if it's a bit clunky, name it GetUnflushedRecPtr?

Well, at least external code won't compile due to the change in arguments:

extern XLogRecPtr GetWalRcvWriteRecPtr(XLogRecPtr *latestChunkStart,
TimeLineID *receiveTLI);
extern XLogRecPtr GetWalRcvWriteRecPtr(void);

Anyone who is using that for some kind of data integrity purposes
should hopefully be triggered to investigate, no?  I tried to think of
a better naming scheme but...

> > From c62fde23f70ff06833d743a1c85716e15f3c813c Mon Sep 17 00:00:00 2001
> > From: Thomas Munro <thomas.munro@gmail.com>
> > Date: Tue, 17 Mar 2020 17:26:41 +1300
> > Subject: [PATCH 4/5] Allow PrefetchBuffer() to report what happened.
> >
> > Report whether a prefetch was actually initiated due to a cache miss, so
> > that callers can limit the number of concurrent I/Os they try to issue,
> > without counting the prefetch calls that did nothing because the page
> > was already in our buffers.
> >
> > If the requested block was already cached, return a valid buffer.  This
> > might enable future code to avoid a buffer mapping lookup, though it
> > will need to recheck the buffer before using it because it's not pinned
> > so could be reclaimed at any time.
> >
> > Report neither hit nor miss when a relation's backing file is missing,
> > to prepare for use during recovery.  This will be used to handle cases
> > of relations that are referenced in the WAL but have been unlinked
> > already due to actions covered by WAL records that haven't been replayed
> > yet, after a crash.
>
> We probably should take this into account in nodeBitmapHeapscan.c

Indeed.  The naive version would be something like:

diff --git a/src/backend/executor/nodeBitmapHeapscan.c
b/src/backend/executor/nodeBitmapHeapscan.c
index 726d3a2d9a..3cd644d0ac 100644
--- a/src/backend/executor/nodeBitmapHeapscan.c
+++ b/src/backend/executor/nodeBitmapHeapscan.c
@@ -484,13 +484,11 @@ BitmapPrefetch(BitmapHeapScanState *node,
TableScanDesc scan)
                                        node->prefetch_iterator = NULL;
                                        break;
                                }
-                               node->prefetch_pages++;

                                /*
                                 * If we expect not to have to
actually read this heap page,
                                 * skip this prefetch call, but
continue to run the prefetch
-                                * logic normally.  (Would it be
better not to increment
-                                * prefetch_pages?)
+                                * logic normally.
                                 *
                                 * This depends on the assumption that
the index AM will
                                 * report the same recheck flag for
this future heap page as
@@ -504,7 +502,13 @@ BitmapPrefetch(BitmapHeapScanState *node,
TableScanDesc scan)

                  &node->pvmbuffer));

                                if (!skip_fetch)
-                                       PrefetchBuffer(scan->rs_rd,
MAIN_FORKNUM, tbmpre->blockno);
+                               {
+                                       PrefetchBufferResult prefetch;
+
+                                       prefetch =
PrefetchBuffer(scan->rs_rd, MAIN_FORKNUM, tbmpre->blockno);
+                                       if (prefetch.initiated_io)
+                                               node->prefetch_pages++;
+                               }
                        }
                }

... but that might get arbitrarily far ahead, so it probably needs
some kind of cap, and the parallel version is a bit more complicated.
Something for later, along with more prefetching opportunities.

> > diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
> > index d30aed6fd9..4ceb40a856 100644
> > --- a/src/backend/storage/buffer/bufmgr.c
> > +++ b/src/backend/storage/buffer/bufmgr.c
> > @@ -469,11 +469,13 @@ static int      ts_ckpt_progress_comparator(Datum a, Datum b, void *arg);
> >  /*
> >   * Implementation of PrefetchBuffer() for shared buffers.
> >   */
> > -void
> > +PrefetchBufferResult
> >  PrefetchSharedBuffer(struct SMgrRelationData *smgr_reln,
> >                                        ForkNumber forkNum,
> >                                        BlockNumber blockNum)
> >  {
> > +     PrefetchBufferResult result = { InvalidBuffer, false };
> > +
> >  #ifdef USE_PREFETCH
> >       BufferTag       newTag;         /* identity of requested block */
> >       uint32          newHash;        /* hash value for newTag */
> > @@ -497,7 +499,23 @@ PrefetchSharedBuffer(struct SMgrRelationData *smgr_reln,
> >
> >       /* If not in buffers, initiate prefetch */
> >       if (buf_id < 0)
> > -             smgrprefetch(smgr_reln, forkNum, blockNum);
> > +     {
> > +             /*
> > +              * Try to initiate an asynchronous read.  This returns false in
> > +              * recovery if the relation file doesn't exist.
> > +              */
> > +             if (smgrprefetch(smgr_reln, forkNum, blockNum))
> > +                     result.initiated_io = true;
> > +     }
> > +     else
> > +     {
> > +             /*
> > +              * Report the buffer it was in at that time.  The caller may be able
> > +              * to avoid a buffer table lookup, but it's not pinned and it must be
> > +              * rechecked!
> > +              */
> > +             result.buffer = buf_id + 1;
>
> Perhaps it'd be better to name this "last_buffer" or such, to make it
> clearer that it may be outdated?

OK.  Renamed to "recent_buffer".

> > -void
> > +PrefetchBufferResult
> >  PrefetchBuffer(Relation reln, ForkNumber forkNum, BlockNumber blockNum)
> >  {
> >  #ifdef USE_PREFETCH
> > @@ -540,13 +564,17 @@ PrefetchBuffer(Relation reln, ForkNumber forkNum, BlockNumber blockNum)
> >                                        errmsg("cannot access temporary tables of other sessions")));
> >
> >               /* pass it off to localbuf.c */
> > -             PrefetchLocalBuffer(reln->rd_smgr, forkNum, blockNum);
> > +             return PrefetchLocalBuffer(reln->rd_smgr, forkNum, blockNum);
> >       }
> >       else
> >       {
> >               /* pass it to the shared buffer version */
> > -             PrefetchSharedBuffer(reln->rd_smgr, forkNum, blockNum);
> > +             return PrefetchSharedBuffer(reln->rd_smgr, forkNum, blockNum);
> >       }
> > +#else
> > +     PrefetchBuffer result = { InvalidBuffer, false };
> > +
> > +     return result;
> >  #endif                                                       /* USE_PREFETCH */
> >  }
>
> Hm. Now that results are returned indicating whether the buffer is in
> s_b - shouldn't the return value be accurate regardless of USE_PREFETCH?

Yeah.  Done.

> > +/*
> > + * Type returned by PrefetchBuffer().
> > + */
> > +typedef struct PrefetchBufferResult
> > +{
> > +     Buffer          buffer;                 /* If valid, a hit (recheck needed!) */
>
> I assume there's no user of this yet? Even if there's not, I wonder if
> it still is worth adding and referencing a helper to do so correctly?

It *is* used, but only to see if it's valid.  0006 is a not-for-commit
patch to show how you might use it later to read a buffer.  To
actually use this for something like bitmap heap scan, you'd first
need to fix the modularity violations in that code (I mean we have
PrefetchBuffer() in nodeBitmapHeapscan.c, but the corresponding
[ReleaseAnd]ReadBuffer() in heapam.c, and you'd need to get these into
the same module and/or to communicate in some graceful way).

> > From 42ba0a89260d46230ac0df791fae18bfdca0092f Mon Sep 17 00:00:00 2001
> > From: Thomas Munro <thomas.munro@gmail.com>
> > Date: Wed, 18 Mar 2020 16:35:27 +1300
> > Subject: [PATCH 5/5] Prefetch referenced blocks during recovery.
> >
> > Introduce a new GUC max_wal_prefetch_distance.  If it is set to a
> > positive number of bytes, then read ahead in the WAL at most that
> > distance, and initiate asynchronous reading of referenced blocks.  The
> > goal is to avoid I/O stalls and benefit from concurrent I/O.  The number
> > of concurrency asynchronous reads is capped by the existing
> > maintenance_io_concurrency GUC.  The feature is disabled by default.
> >
> > Reviewed-by: Tomas Vondra <tomas.vondra@2ndquadrant.com>
> > Reviewed-by: Alvaro Herrera <alvherre@2ndquadrant.com>
> > Discussion:
> > https://postgr.es/m/CA%2BhUKGJ4VJN8ttxScUFM8dOKX0BrBiboo5uz1cq%3DAovOddfHpA%40mail.gmail.com
>
> Why is it disabled by default? Just for "risk management"?

Well, it's not free, and might not help you, so not everyone would
want it on.  I think the overheads can be mostly removed with more
work in a later release.  Perhaps we could commit it enabled by
default, and then discuss it before release after looking at some more
data?  On that basis I have now made it default to on, with
max_wal_prefetch_distance = 256kB, if your build has USE_PREFETCH.
Obviously this number can be discussed.

> > +     <varlistentry id="guc-max-wal-prefetch-distance" xreflabel="max_wal_prefetch_distance">
> > +      <term><varname>max_wal_prefetch_distance</varname> (<type>integer</type>)
> > +      <indexterm>
> > +       <primary><varname>max_wal_prefetch_distance</varname> configuration parameter</primary>
> > +      </indexterm>
> > +      </term>
> > +      <listitem>
> > +       <para>
> > +        The maximum distance to look ahead in the WAL during recovery, to find
> > +        blocks to prefetch.  Prefetching blocks that will soon be needed can
> > +        reduce I/O wait times.  The number of concurrent prefetches is limited
> > +        by this setting as well as <xref linkend="guc-maintenance-io-concurrency"/>.
> > +        If this value is specified without units, it is taken as bytes.
> > +        The default is -1, meaning that WAL prefetching is disabled.
> > +       </para>
> > +      </listitem>
> > +     </varlistentry>
>
> Is it worth noting that a too large distance could hurt, because the
> buffers might get evicted again?

OK, I tried to explain that.

> > +     <varlistentry id="guc-wal-prefetch-fpw" xreflabel="wal_prefetch_fpw">
> > +      <term><varname>wal_prefetch_fpw</varname> (<type>boolean</type>)
> > +      <indexterm>
> > +       <primary><varname>wal_prefetch_fpw</varname> configuration parameter</primary>
> > +      </indexterm>
> > +      </term>
> > +      <listitem>
> > +       <para>
> > +        Whether to prefetch blocks with full page images during recovery.
> > +        Usually this doesn't help, since such blocks will not be read.  However,
> > +        on file systems with a block size larger than
> > +        <productname>PostgreSQL</productname>'s, prefetching can avoid a costly
> > +        read-before-write when a blocks are later written.
> > +        This setting has no effect unless
> > +        <xref linkend="guc-max-wal-prefetch-distance"/> is set to a positive number.
> > +        The default is off.
> > +       </para>
> > +      </listitem>
> > +     </varlistentry>
>
> Hm. I think this needs more details - it's not clear enough what this
> actually controls. I assume it's about prefetching for WAL records that
> contain the FPW, but it also could be read to be about not prefetching
> any pages that had FPWs before, or such?

Ok, I have elaborated.

> >       </variablelist>
> >       </sect2>
> >       <sect2 id="runtime-config-wal-archiving">
> > diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
> > index 987580d6df..df4291092b 100644
> > --- a/doc/src/sgml/monitoring.sgml
> > +++ b/doc/src/sgml/monitoring.sgml
> > @@ -320,6 +320,13 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
> >        </entry>
> >       </row>
> >
> > +     <row>
> > +
<entry><structname>pg_stat_wal_prefetcher</structname><indexterm><primary>pg_stat_wal_prefetcher</primary></indexterm></entry>
> > +      <entry>Only one row, showing statistics about blocks prefetched during recovery.
> > +       See <xref linkend="pg-stat-wal-prefetcher-view"/> for details.
> > +      </entry>
> > +     </row>
> > +
>
> 'prefetcher' somehow sounds odd to me. I also suspect that we'll want to
> have additional prefetching stat tables going forward. Perhaps
> 'pg_stat_prefetch_wal'?

Works for me, though while thinking about this I realised that the
"WAL" part was bothering me.  It sounds like we're prefetching WAL
itself, which would be a different thing.  So I renamed this view to
pg_stat_prefetch_recovery.

Then I renamed the main GUCs that control this thing to:

  max_recovery_prefetch_distance
  recovery_prefetch_fpw

> > +    <row>
> > +     <entry><structfield>distance</structfield></entry>
> > +     <entry><type>integer</type></entry>
> > +     <entry>How far ahead of recovery the WAL prefetcher is currently reading, in bytes</entry>
> > +    </row>
> > +    <row>
> > +     <entry><structfield>queue_depth</structfield></entry>
> > +     <entry><type>integer</type></entry>
> > +     <entry>How many prefetches have been initiated but are not yet known to have completed</entry>
> > +    </row>
> > +   </tbody>
> > +   </tgroup>
> > +  </table>
>
> Is there a way we could have a "historical" version of at least some of
> these? An average queue depth, or such?

Ok, I added simple online averages for distance and queue depth that
take a sample every time recovery advances by 256kB.

> It'd be useful to somewhere track the time spent initiating prefetch
> requests. Otherwise it's quite hard to evaluate whether the queue is too
> deep (and just blocks in the OS).

I agree that that sounds useful, and I thought about various ways to
do that that involved new views, until I eventually found myself
wondering: why isn't recovery's I/O already tracked via the existing
stats views?  For example, why can't I see blks_read, blks_hit,
blk_read_time etc moving in pg_stat_database due to recovery activity?

I seems like if you made that work first, or created a new view
pgstatio view for that, then you could add prefetching counters and
timing (if track_io_timing is on) to the existing machinery so that
bufmgr.c would automatically capture it, and then not only recovery
but also stuff like bitmap heap scan could also be measured the same
way.

However, time is short, so I'm not attempting to do anything like that
now.  You can measure the posix_fadvise() times with OS facilities in
the meantime.

> I think it'd be good to have a 'reset time' column.

Done, as stats_reset following other examples.

> > +  <para>
> > +   The <structname>pg_stat_wal_prefetcher</structname> view will contain only
> > +   one row.  It is filled with nulls if recovery is not running or WAL
> > +   prefetching is not enabled.  See <xref linkend="guc-max-wal-prefetch-distance"/>
> > +   for more information.  The counters in this view are reset whenever the
> > +   <xref linkend="guc-max-wal-prefetch-distance"/>,
> > +   <xref linkend="guc-wal-prefetch-fpw"/> or
> > +   <xref linkend="guc-maintenance-io-concurrency"/> setting is changed and
> > +   the server configuration is reloaded.
> > +  </para>
> > +
>
> So pg_stat_reset_shared() cannot be used? If so, why?

Hmm.  OK, I made pg_stat_reset_shared('prefetch_recovery') work.

> It sounds like the counters aren't persisted via the stats system - if
> so, why?

Ok, I made it persist the simple counters by sending to the to stats
collector periodically.  The view still shows data straight out of
shmem though, not out of the stats file.  Now I'm wondering if I
should have the view show it from the stats file, more like other
things, now that I understand that a bit better...  hmm.

> > @@ -7105,6 +7114,31 @@ StartupXLOG(void)
> >                               /* Handle interrupt signals of startup process */
> >                               HandleStartupProcInterrupts();
> >
> > +                             /*
> > +                              * The first time through, or if any relevant settings or the
> > +                              * WAL source changes, we'll restart the prefetching machinery
> > +                              * as appropriate.  This is simpler than trying to handle
> > +                              * various complicated state changes.
> > +                              */
> > +                             if (unlikely(reset_wal_prefetcher))
> > +                             {
> > +                                     /* If we had one already, destroy it. */
> > +                                     if (prefetcher)
> > +                                     {
> > +                                             XLogPrefetcherFree(prefetcher);
> > +                                             prefetcher = NULL;
> > +                                     }
> > +                                     /* If we want one, create it. */
> > +                                     if (max_wal_prefetch_distance > 0)
> > +                                                     prefetcher = XLogPrefetcherAllocate(xlogreader->ReadRecPtr,
> > +
         currentSource == XLOG_FROM_STREAM);
 
> > +                                     reset_wal_prefetcher = false;
> > +                             }
>
> Do we really need all of this code in StartupXLOG() itself? Could it be
> in HandleStartupProcInterrupts() or at least a helper routine called
> here?

It's now done differently, so that StartupXLOG() only has three new
lines: XLogPrefetchBegin() before the loop, XLogPrefetch() in the
loop, and XLogPrefetchEnd() after the loop.

> > +                             /* Peform WAL prefetching, if enabled. */
> > +                             if (prefetcher)
> > +                                     XLogPrefetcherReadAhead(prefetcher, xlogreader->ReadRecPtr);
> > +
> >                               /*
> >                                * Pause WAL replay, if requested by a hot-standby session via
> >                                * SetRecoveryPause().
>
> Personally, I'd rather have the if () be in
> XLogPrefetcherReadAhead(). With an inline wrapper doing the check, if
> the call bothers you (but I don't think it needs to).

Done.

> > +/*-------------------------------------------------------------------------
> > + *
> > + * xlogprefetcher.c
> > + *           Prefetching support for PostgreSQL write-ahead log manager
> > + *
>
> An architectural overview here would be good.

OK, added.

> > +struct XLogPrefetcher
> > +{
> > +     /* Reader and current reading state. */
> > +     XLogReaderState *reader;
> > +     XLogReadLocalOptions options;
> > +     bool                    have_record;
> > +     bool                    shutdown;
> > +     int                             next_block_id;
> > +
> > +     /* Book-keeping required to avoid accessing non-existing blocks. */
> > +     HTAB               *filter_table;
> > +     dlist_head              filter_queue;
> > +
> > +     /* Book-keeping required to limit concurrent prefetches. */
> > +     XLogRecPtr         *prefetch_queue;
> > +     int                             prefetch_queue_size;
> > +     int                             prefetch_head;
> > +     int                             prefetch_tail;
> > +
> > +     /* Details of last prefetch to skip repeats and seq scans. */
> > +     SMgrRelation    last_reln;
> > +     RelFileNode             last_rnode;
> > +     BlockNumber             last_blkno;
>
> Do you have a comment somewhere explaining why you want to avoid
> seqscans (I assume it's about avoiding regressions in linux, but only
> because I recall chatting with you about it).

I've added a note to the new architectural comments.

> > +/*
> > + * On modern systems this is really just *counter++.  On some older systems
> > + * there might be more to it, due to inability to read and write 64 bit values
> > + * atomically.  The counters will only be written to by one process, and there
> > + * is no ordering requirement, so there's no point in using higher overhead
> > + * pg_atomic_fetch_add_u64().
> > + */
> > +static inline void inc_counter(pg_atomic_uint64 *counter)
> > +{
> > +     pg_atomic_write_u64(counter, pg_atomic_read_u64(counter) + 1);
> > +}
>
> Could be worthwhile to add to the atomics infrastructure itself - on the
> platforms where this needs spinlocks this will lead to two acquisitions,
> rather than one.

Ok, I added pg_atomic_unlocked_add_fetch_XXX().  (Could also be
"fetch_add", I don't care, I don't use the result).

> > +/*
> > + * Create a prefetcher that is ready to begin prefetching blocks referenced by
> > + * WAL that is ahead of the given lsn.
> > + */
> > +XLogPrefetcher *
> > +XLogPrefetcherAllocate(XLogRecPtr lsn, bool streaming)
> > +{
> > +     static HASHCTL hash_table_ctl = {
> > +             .keysize = sizeof(RelFileNode),
> > +             .entrysize = sizeof(XLogPrefetcherFilter)
> > +     };
> > +     XLogPrefetcher *prefetcher = palloc0(sizeof(*prefetcher));
> > +
> > +     prefetcher->options.nowait = true;
> > +     if (streaming)
> > +     {
> > +             /*
> > +              * We're only allowed to read as far as the WAL receiver has written.
> > +              * We don't have to wait for it to be flushed, though, as recovery
> > +              * does, so that gives us a chance to get a bit further ahead.
> > +              */
> > +             prefetcher->options.read_upto_policy = XLRO_WALRCV_WRITTEN;
> > +     }
> > +     else
> > +     {
> > +             /* We're allowed to read as far as we can. */
> > +             prefetcher->options.read_upto_policy = XLRO_LSN;
> > +             prefetcher->options.lsn = (XLogRecPtr) -1;
> > +     }
> > +     prefetcher->reader = XLogReaderAllocate(wal_segment_size,
> > +                                                                                     NULL,
> > +                                                                                     read_local_xlog_page,
> > +                                                                                     &prefetcher->options);
> > +     prefetcher->filter_table = hash_create("PrefetchFilterTable", 1024,
> > +                                                                                &hash_table_ctl,
> > +                                                                                HASH_ELEM | HASH_BLOBS);
> > +     dlist_init(&prefetcher->filter_queue);
> > +
> > +     /*
> > +      * The size of the queue is based on the maintenance_io_concurrency
> > +      * setting.  In theory we might have a separate queue for each tablespace,
> > +      * but it's not clear how that should work, so for now we'll just use the
> > +      * general GUC to rate-limit all prefetching.
> > +      */
> > +     prefetcher->prefetch_queue_size = maintenance_io_concurrency;
> > +     prefetcher->prefetch_queue = palloc0(sizeof(XLogRecPtr) * prefetcher->prefetch_queue_size);
> > +     prefetcher->prefetch_head = prefetcher->prefetch_tail = 0;
> > +
> > +     /* Prepare to read at the given LSN. */
> > +     ereport(LOG,
> > +                     (errmsg("WAL prefetch started at %X/%X",
> > +                                     (uint32) (lsn << 32), (uint32) lsn)));
> > +     XLogBeginRead(prefetcher->reader, lsn);
> > +
> > +     XLogPrefetcherResetMonitoringStats();
> > +
> > +     return prefetcher;
> > +}
> > +
> > +/*
> > + * Destroy a prefetcher and release all resources.
> > + */
> > +void
> > +XLogPrefetcherFree(XLogPrefetcher *prefetcher)
> > +{
> > +     double          avg_distance = 0;
> > +     double          avg_queue_depth = 0;
> > +
> > +     /* Log final statistics. */
> > +     if (prefetcher->samples > 0)
> > +     {
> > +             avg_distance = prefetcher->distance_sum / prefetcher->samples;
> > +             avg_queue_depth = prefetcher->queue_depth_sum / prefetcher->samples;
> > +     }
> > +     ereport(LOG,
> > +                     (errmsg("WAL prefetch finished at %X/%X; "
> > +                                     "prefetch = " UINT64_FORMAT ", "
> > +                                     "skip_hit = " UINT64_FORMAT ", "
> > +                                     "skip_new = " UINT64_FORMAT ", "
> > +                                     "skip_fpw = " UINT64_FORMAT ", "
> > +                                     "skip_seq = " UINT64_FORMAT ", "
> > +                                     "avg_distance = %f, "
> > +                                     "avg_queue_depth = %f",
> > +                      (uint32) (prefetcher->reader->EndRecPtr << 32),
> > +                      (uint32) (prefetcher->reader->EndRecPtr),
> > +                      pg_atomic_read_u64(&MonitoringStats->prefetch),
> > +                      pg_atomic_read_u64(&MonitoringStats->skip_hit),
> > +                      pg_atomic_read_u64(&MonitoringStats->skip_new),
> > +                      pg_atomic_read_u64(&MonitoringStats->skip_fpw),
> > +                      pg_atomic_read_u64(&MonitoringStats->skip_seq),
> > +                      avg_distance,
> > +                      avg_queue_depth)));
> > +     XLogReaderFree(prefetcher->reader);
> > +     hash_destroy(prefetcher->filter_table);
> > +     pfree(prefetcher->prefetch_queue);
> > +     pfree(prefetcher);
> > +
> > +     XLogPrefetcherResetMonitoringStats();
> > +}
>
> It's possibly overkill, but I think it'd be a good idea to do all the
> allocations within a prefetch specific memory context. That makes
> detecting potential leaks or such easier.

I looked into that, but in fact it's already pretty clear how much
memory this thing is using, if you call
MemoryContextStats(TopMemoryContext), because it's almost all in a
named hash table:

TopMemoryContext: 155776 total in 6 blocks; 18552 free (8 chunks); 137224 used
  XLogPrefetcherFilterTable: 16384 total in 2 blocks; 4520 free (3
chunks); 11864 used
  SP-GiST temporary context: 8192 total in 1 blocks; 7928 free (0
chunks); 264 used
  GiST temporary context: 8192 total in 1 blocks; 7928 free (0 chunks); 264 used
  GIN recovery temporary context: 8192 total in 1 blocks; 7928 free (0
chunks); 264 used
  Btree recovery temporary context: 8192 total in 1 blocks; 7928 free
(0 chunks); 264 used
  RecoveryLockLists: 8192 total in 1 blocks; 2584 free (0 chunks); 5608 used
  PrivateRefCount: 8192 total in 1 blocks; 2584 free (0 chunks); 5608 used
  MdSmgr: 8192 total in 1 blocks; 7928 free (0 chunks); 264 used
  Pending ops context: 8192 total in 1 blocks; 7928 free (0 chunks); 264 used
  LOCALLOCK hash: 8192 total in 1 blocks; 512 free (0 chunks); 7680 used
  Timezones: 104128 total in 2 blocks; 2584 free (0 chunks); 101544 used
  ErrorContext: 8192 total in 1 blocks; 7928 free (4 chunks); 264 used
Grand total: 358208 bytes in 20 blocks; 86832 free (15 chunks); 271376 used

The XLogPrefetcher struct itself is not measured seperately, but I
don't think that's a problem, it's small and there's only ever one at
a time.  It's that XLogPrefetcherFilterTable that is of variable size
(though it's often empty).  While thinking about this, I made
prefetch_queue into a flexible array rather than a pointer to palloc'd
memory, which seemed a bit tidier.

> > +     /* Can we drop any filters yet, due to problem records begin replayed? */
>
> Odd grammar.

Rewritten.

> > +     XLogPrefetcherCompleteFilters(prefetcher, replaying_lsn);
>
> Hm, why isn't this part of the loop below?

It only needs to run when replaying_lsn has advanced (ie when records
have been replayed).  I hope the new comment makes that clearer.

> > +     /* Main prefetch loop. */
> > +     for (;;)
> > +     {
>
> This kind of looks like a separate process' main loop. The name
> indicates similar. And there's no architecture documentation
> disinclining one from that view...

OK, I have updated the comment.

> The loop body is quite long. I think it should be split into a number of
> helper functions. Perhaps one to ensure a block is read, one to maintain
> stats, and then one to process block references?

I've broken the function up.  It's now:

StartupXLOG()
 -> XLogPrefetch()
     -> XLogPrefetcherReadAhead()
         -> XLogPrefetcherScanRecords()
             -> XLogPrefetcherScanBlocks()

> > +             /*
> > +              * Scan the record for block references.  We might already have been
> > +              * partway through processing this record when we hit maximum I/O
> > +              * concurrency, so start where we left off.
> > +              */
> > +             for (int i = prefetcher->next_block_id; i <= reader->max_block_id; ++i)
> > +             {
>
> Super pointless nitpickery: For a loop-body this big I'd rather name 'i'
> 'blockid' or such.

Done.

Attachment

pgsql-hackers by date:

Previous
From: Amit Langote
Date:
Subject: Re: adding partitioned tables to publications
Next
From: Robert Haas
Date:
Subject: Re: Improving connection scalability: GetSnapshotData()