Re: WIP: WAL prefetch (another approach) - Mailing list pgsql-hackers

From Andres Freund
Subject Re: WIP: WAL prefetch (another approach)
Date
Msg-id 20200324223152.v5qrjmjjo4aukktk@alap3.anarazel.de
Whole thread Raw
In response to Re: WIP: WAL prefetch (another approach)  (Thomas Munro <thomas.munro@gmail.com>)
Responses Re: WIP: WAL prefetch (another approach)  (Thomas Munro <thomas.munro@gmail.com>)
List pgsql-hackers
Hi,

On 2020-03-18 18:18:44 +1300, Thomas Munro wrote:
> From 1b03eb5ada24c3b23ab8ca6db50e0c5d90d38259 Mon Sep 17 00:00:00 2001
> From: Thomas Munro <tmunro@postgresql.org>
> Date: Mon, 9 Dec 2019 17:22:07 +1300
> Subject: [PATCH 3/5] Add WalRcvGetWriteRecPtr() (new definition).
> 
> A later patch will read received WAL to prefetch referenced blocks,
> without waiting for the data to be flushed to disk.  To do that, it
> needs to be able to see the write pointer advancing in shared memory.
> 
> The function formerly bearing name was recently renamed to
> WalRcvGetFlushRecPtr(), which better described what it does.

Hm. I'm a bit weary of reusing the name with a different meaning. If
there's any external references, this'll hide that they need to
adapt. Perhaps, even if it's a bit clunky, name it GetUnflushedRecPtr?


> From c62fde23f70ff06833d743a1c85716e15f3c813c Mon Sep 17 00:00:00 2001
> From: Thomas Munro <thomas.munro@gmail.com>
> Date: Tue, 17 Mar 2020 17:26:41 +1300
> Subject: [PATCH 4/5] Allow PrefetchBuffer() to report what happened.
> 
> Report whether a prefetch was actually initiated due to a cache miss, so
> that callers can limit the number of concurrent I/Os they try to issue,
> without counting the prefetch calls that did nothing because the page
> was already in our buffers.
> 
> If the requested block was already cached, return a valid buffer.  This
> might enable future code to avoid a buffer mapping lookup, though it
> will need to recheck the buffer before using it because it's not pinned
> so could be reclaimed at any time.
> 
> Report neither hit nor miss when a relation's backing file is missing,
> to prepare for use during recovery.  This will be used to handle cases
> of relations that are referenced in the WAL but have been unlinked
> already due to actions covered by WAL records that haven't been replayed
> yet, after a crash.

We probably should take this into account in nodeBitmapHeapscan.c


> diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
> index d30aed6fd9..4ceb40a856 100644
> --- a/src/backend/storage/buffer/bufmgr.c
> +++ b/src/backend/storage/buffer/bufmgr.c
> @@ -469,11 +469,13 @@ static int    ts_ckpt_progress_comparator(Datum a, Datum b, void *arg);
>  /*
>   * Implementation of PrefetchBuffer() for shared buffers.
>   */
> -void
> +PrefetchBufferResult
>  PrefetchSharedBuffer(struct SMgrRelationData *smgr_reln,
>                       ForkNumber forkNum,
>                       BlockNumber blockNum)
>  {
> +    PrefetchBufferResult result = { InvalidBuffer, false };
> +
>  #ifdef USE_PREFETCH
>      BufferTag    newTag;        /* identity of requested block */
>      uint32        newHash;    /* hash value for newTag */
> @@ -497,7 +499,23 @@ PrefetchSharedBuffer(struct SMgrRelationData *smgr_reln,
>  
>      /* If not in buffers, initiate prefetch */
>      if (buf_id < 0)
> -        smgrprefetch(smgr_reln, forkNum, blockNum);
> +    {
> +        /*
> +         * Try to initiate an asynchronous read.  This returns false in
> +         * recovery if the relation file doesn't exist.
> +         */
> +        if (smgrprefetch(smgr_reln, forkNum, blockNum))
> +            result.initiated_io = true;
> +    }
> +    else
> +    {
> +        /*
> +         * Report the buffer it was in at that time.  The caller may be able
> +         * to avoid a buffer table lookup, but it's not pinned and it must be
> +         * rechecked!
> +         */
> +        result.buffer = buf_id + 1;

Perhaps it'd be better to name this "last_buffer" or such, to make it
clearer that it may be outdated?


> -void
> +PrefetchBufferResult
>  PrefetchBuffer(Relation reln, ForkNumber forkNum, BlockNumber blockNum)
>  {
>  #ifdef USE_PREFETCH
> @@ -540,13 +564,17 @@ PrefetchBuffer(Relation reln, ForkNumber forkNum, BlockNumber blockNum)
>                       errmsg("cannot access temporary tables of other sessions")));
>  
>          /* pass it off to localbuf.c */
> -        PrefetchLocalBuffer(reln->rd_smgr, forkNum, blockNum);
> +        return PrefetchLocalBuffer(reln->rd_smgr, forkNum, blockNum);
>      }
>      else
>      {
>          /* pass it to the shared buffer version */
> -        PrefetchSharedBuffer(reln->rd_smgr, forkNum, blockNum);
> +        return PrefetchSharedBuffer(reln->rd_smgr, forkNum, blockNum);
>      }
> +#else
> +    PrefetchBuffer result = { InvalidBuffer, false };
> +
> +    return result;
>  #endif                            /* USE_PREFETCH */
>  }

Hm. Now that results are returned indicating whether the buffer is in
s_b - shouldn't the return value be accurate regardless of USE_PREFETCH?



> +/*
> + * Type returned by PrefetchBuffer().
> + */
> +typedef struct PrefetchBufferResult
> +{
> +    Buffer        buffer;            /* If valid, a hit (recheck needed!) */

I assume there's no user of this yet? Even if there's not, I wonder if
it still is worth adding and referencing a helper to do so correctly?


> From 42ba0a89260d46230ac0df791fae18bfdca0092f Mon Sep 17 00:00:00 2001
> From: Thomas Munro <thomas.munro@gmail.com>
> Date: Wed, 18 Mar 2020 16:35:27 +1300
> Subject: [PATCH 5/5] Prefetch referenced blocks during recovery.
> 
> Introduce a new GUC max_wal_prefetch_distance.  If it is set to a
> positive number of bytes, then read ahead in the WAL at most that
> distance, and initiate asynchronous reading of referenced blocks.  The
> goal is to avoid I/O stalls and benefit from concurrent I/O.  The number
> of concurrency asynchronous reads is capped by the existing
> maintenance_io_concurrency GUC.  The feature is disabled by default.
> 
> Reviewed-by: Tomas Vondra <tomas.vondra@2ndquadrant.com>
> Reviewed-by: Alvaro Herrera <alvherre@2ndquadrant.com>
> Discussion:
> https://postgr.es/m/CA%2BhUKGJ4VJN8ttxScUFM8dOKX0BrBiboo5uz1cq%3DAovOddfHpA%40mail.gmail.com

Why is it disabled by default? Just for "risk management"?


> +     <varlistentry id="guc-max-wal-prefetch-distance" xreflabel="max_wal_prefetch_distance">
> +      <term><varname>max_wal_prefetch_distance</varname> (<type>integer</type>)
> +      <indexterm>
> +       <primary><varname>max_wal_prefetch_distance</varname> configuration parameter</primary>
> +      </indexterm>
> +      </term>
> +      <listitem>
> +       <para>
> +        The maximum distance to look ahead in the WAL during recovery, to find
> +        blocks to prefetch.  Prefetching blocks that will soon be needed can
> +        reduce I/O wait times.  The number of concurrent prefetches is limited
> +        by this setting as well as <xref linkend="guc-maintenance-io-concurrency"/>.
> +        If this value is specified without units, it is taken as bytes.
> +        The default is -1, meaning that WAL prefetching is disabled.
> +       </para>
> +      </listitem>
> +     </varlistentry>

Is it worth noting that a too large distance could hurt, because the
buffers might get evicted again?


> +     <varlistentry id="guc-wal-prefetch-fpw" xreflabel="wal_prefetch_fpw">
> +      <term><varname>wal_prefetch_fpw</varname> (<type>boolean</type>)
> +      <indexterm>
> +       <primary><varname>wal_prefetch_fpw</varname> configuration parameter</primary>
> +      </indexterm>
> +      </term>
> +      <listitem>
> +       <para>
> +        Whether to prefetch blocks with full page images during recovery.
> +        Usually this doesn't help, since such blocks will not be read.  However,
> +        on file systems with a block size larger than
> +        <productname>PostgreSQL</productname>'s, prefetching can avoid a costly
> +        read-before-write when a blocks are later written.
> +        This setting has no effect unless
> +        <xref linkend="guc-max-wal-prefetch-distance"/> is set to a positive number.
> +        The default is off.
> +       </para>
> +      </listitem>
> +     </varlistentry>

Hm. I think this needs more details - it's not clear enough what this
actually controls. I assume it's about prefetching for WAL records that
contain the FPW, but it also could be read to be about not prefetching
any pages that had FPWs before, or such?


>       </variablelist>
>       </sect2>
>       <sect2 id="runtime-config-wal-archiving">
> diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
> index 987580d6df..df4291092b 100644
> --- a/doc/src/sgml/monitoring.sgml
> +++ b/doc/src/sgml/monitoring.sgml
> @@ -320,6 +320,13 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
>        </entry>
>       </row>
>  
> +     <row>
> +
<entry><structname>pg_stat_wal_prefetcher</structname><indexterm><primary>pg_stat_wal_prefetcher</primary></indexterm></entry>
> +      <entry>Only one row, showing statistics about blocks prefetched during recovery.
> +       See <xref linkend="pg-stat-wal-prefetcher-view"/> for details.
> +      </entry>
> +     </row>
> +

'prefetcher' somehow sounds odd to me. I also suspect that we'll want to
have additional prefetching stat tables going forward. Perhaps
'pg_stat_prefetch_wal'?


> +    <row>
> +     <entry><structfield>distance</structfield></entry>
> +     <entry><type>integer</type></entry>
> +     <entry>How far ahead of recovery the WAL prefetcher is currently reading, in bytes</entry>
> +    </row>
> +    <row>
> +     <entry><structfield>queue_depth</structfield></entry>
> +     <entry><type>integer</type></entry>
> +     <entry>How many prefetches have been initiated but are not yet known to have completed</entry>
> +    </row>
> +   </tbody>
> +   </tgroup>
> +  </table>

Is there a way we could have a "historical" version of at least some of
these? An average queue depth, or such?

It'd be useful to somewhere track the time spent initiating prefetch
requests. Otherwise it's quite hard to evaluate whether the queue is too
deep (and just blocks in the OS).

I think it'd be good to have a 'reset time' column.


> +  <para>
> +   The <structname>pg_stat_wal_prefetcher</structname> view will contain only
> +   one row.  It is filled with nulls if recovery is not running or WAL
> +   prefetching is not enabled.  See <xref linkend="guc-max-wal-prefetch-distance"/>
> +   for more information.  The counters in this view are reset whenever the
> +   <xref linkend="guc-max-wal-prefetch-distance"/>,
> +   <xref linkend="guc-wal-prefetch-fpw"/> or
> +   <xref linkend="guc-maintenance-io-concurrency"/> setting is changed and
> +   the server configuration is reloaded.
> +  </para>
> +

So pg_stat_reset_shared() cannot be used? If so, why?

It sounds like the counters aren't persisted via the stats system - if
so, why?



> @@ -7105,6 +7114,31 @@ StartupXLOG(void)
>                  /* Handle interrupt signals of startup process */
>                  HandleStartupProcInterrupts();
>  
> +                /*
> +                 * The first time through, or if any relevant settings or the
> +                 * WAL source changes, we'll restart the prefetching machinery
> +                 * as appropriate.  This is simpler than trying to handle
> +                 * various complicated state changes.
> +                 */
> +                if (unlikely(reset_wal_prefetcher))
> +                {
> +                    /* If we had one already, destroy it. */
> +                    if (prefetcher)
> +                    {
> +                        XLogPrefetcherFree(prefetcher);
> +                        prefetcher = NULL;
> +                    }
> +                    /* If we want one, create it. */
> +                    if (max_wal_prefetch_distance > 0)
> +                            prefetcher = XLogPrefetcherAllocate(xlogreader->ReadRecPtr,
> +                                                                currentSource == XLOG_FROM_STREAM);
> +                    reset_wal_prefetcher = false;
> +                }

Do we really need all of this code in StartupXLOG() itself? Could it be
in HandleStartupProcInterrupts() or at least a helper routine called
here?


> +                /* Peform WAL prefetching, if enabled. */
> +                if (prefetcher)
> +                    XLogPrefetcherReadAhead(prefetcher, xlogreader->ReadRecPtr);
> +
>                  /*
>                   * Pause WAL replay, if requested by a hot-standby session via
>                   * SetRecoveryPause().

Personally, I'd rather have the if () be in
XLogPrefetcherReadAhead(). With an inline wrapper doing the check, if
the call bothers you (but I don't think it needs to).


> +/*-------------------------------------------------------------------------
> + *
> + * xlogprefetcher.c
> + *        Prefetching support for PostgreSQL write-ahead log manager
> + *

An architectural overview here would be good.


> +struct XLogPrefetcher
> +{
> +    /* Reader and current reading state. */
> +    XLogReaderState *reader;
> +    XLogReadLocalOptions options;
> +    bool            have_record;
> +    bool            shutdown;
> +    int                next_block_id;
> +
> +    /* Book-keeping required to avoid accessing non-existing blocks. */
> +    HTAB           *filter_table;
> +    dlist_head        filter_queue;
> +
> +    /* Book-keeping required to limit concurrent prefetches. */
> +    XLogRecPtr       *prefetch_queue;
> +    int                prefetch_queue_size;
> +    int                prefetch_head;
> +    int                prefetch_tail;
> +
> +    /* Details of last prefetch to skip repeats and seq scans. */
> +    SMgrRelation    last_reln;
> +    RelFileNode        last_rnode;
> +    BlockNumber        last_blkno;

Do you have a comment somewhere explaining why you want to avoid
seqscans (I assume it's about avoiding regressions in linux, but only
because I recall chatting with you about it).


> +/*
> + * On modern systems this is really just *counter++.  On some older systems
> + * there might be more to it, due to inability to read and write 64 bit values
> + * atomically.  The counters will only be written to by one process, and there
> + * is no ordering requirement, so there's no point in using higher overhead
> + * pg_atomic_fetch_add_u64().
> + */
> +static inline void inc_counter(pg_atomic_uint64 *counter)
> +{
> +    pg_atomic_write_u64(counter, pg_atomic_read_u64(counter) + 1);
> +}

Could be worthwhile to add to the atomics infrastructure itself - on the
platforms where this needs spinlocks this will lead to two acquisitions,
rather than one.


> +/*
> + * Create a prefetcher that is ready to begin prefetching blocks referenced by
> + * WAL that is ahead of the given lsn.
> + */
> +XLogPrefetcher *
> +XLogPrefetcherAllocate(XLogRecPtr lsn, bool streaming)
> +{
> +    static HASHCTL hash_table_ctl = {
> +        .keysize = sizeof(RelFileNode),
> +        .entrysize = sizeof(XLogPrefetcherFilter)
> +    };
> +    XLogPrefetcher *prefetcher = palloc0(sizeof(*prefetcher));
> +
> +    prefetcher->options.nowait = true;
> +    if (streaming)
> +    {
> +        /*
> +         * We're only allowed to read as far as the WAL receiver has written.
> +         * We don't have to wait for it to be flushed, though, as recovery
> +         * does, so that gives us a chance to get a bit further ahead.
> +         */
> +        prefetcher->options.read_upto_policy = XLRO_WALRCV_WRITTEN;
> +    }
> +    else
> +    {
> +        /* We're allowed to read as far as we can. */
> +        prefetcher->options.read_upto_policy = XLRO_LSN;
> +        prefetcher->options.lsn = (XLogRecPtr) -1;
> +    }
> +    prefetcher->reader = XLogReaderAllocate(wal_segment_size,
> +                                            NULL,
> +                                            read_local_xlog_page,
> +                                            &prefetcher->options);
> +    prefetcher->filter_table = hash_create("PrefetchFilterTable", 1024,
> +                                           &hash_table_ctl,
> +                                           HASH_ELEM | HASH_BLOBS);
> +    dlist_init(&prefetcher->filter_queue);
> +
> +    /*
> +     * The size of the queue is based on the maintenance_io_concurrency
> +     * setting.  In theory we might have a separate queue for each tablespace,
> +     * but it's not clear how that should work, so for now we'll just use the
> +     * general GUC to rate-limit all prefetching.
> +     */
> +    prefetcher->prefetch_queue_size = maintenance_io_concurrency;
> +    prefetcher->prefetch_queue = palloc0(sizeof(XLogRecPtr) * prefetcher->prefetch_queue_size);
> +    prefetcher->prefetch_head = prefetcher->prefetch_tail = 0;
> +
> +    /* Prepare to read at the given LSN. */
> +    ereport(LOG,
> +            (errmsg("WAL prefetch started at %X/%X",
> +                    (uint32) (lsn << 32), (uint32) lsn)));
> +    XLogBeginRead(prefetcher->reader, lsn);
> +
> +    XLogPrefetcherResetMonitoringStats();
> +
> +    return prefetcher;
> +}
> +
> +/*
> + * Destroy a prefetcher and release all resources.
> + */
> +void
> +XLogPrefetcherFree(XLogPrefetcher *prefetcher)
> +{
> +    double        avg_distance = 0;
> +    double        avg_queue_depth = 0;
> +
> +    /* Log final statistics. */
> +    if (prefetcher->samples > 0)
> +    {
> +        avg_distance = prefetcher->distance_sum / prefetcher->samples;
> +        avg_queue_depth = prefetcher->queue_depth_sum / prefetcher->samples;
> +    }
> +    ereport(LOG,
> +            (errmsg("WAL prefetch finished at %X/%X; "
> +                    "prefetch = " UINT64_FORMAT ", "
> +                    "skip_hit = " UINT64_FORMAT ", "
> +                    "skip_new = " UINT64_FORMAT ", "
> +                    "skip_fpw = " UINT64_FORMAT ", "
> +                    "skip_seq = " UINT64_FORMAT ", "
> +                    "avg_distance = %f, "
> +                    "avg_queue_depth = %f",
> +             (uint32) (prefetcher->reader->EndRecPtr << 32),
> +             (uint32) (prefetcher->reader->EndRecPtr),
> +             pg_atomic_read_u64(&MonitoringStats->prefetch),
> +             pg_atomic_read_u64(&MonitoringStats->skip_hit),
> +             pg_atomic_read_u64(&MonitoringStats->skip_new),
> +             pg_atomic_read_u64(&MonitoringStats->skip_fpw),
> +             pg_atomic_read_u64(&MonitoringStats->skip_seq),
> +             avg_distance,
> +             avg_queue_depth)));
> +    XLogReaderFree(prefetcher->reader);
> +    hash_destroy(prefetcher->filter_table);
> +    pfree(prefetcher->prefetch_queue);
> +    pfree(prefetcher);
> +
> +    XLogPrefetcherResetMonitoringStats();
> +}

It's possibly overkill, but I think it'd be a good idea to do all the
allocations within a prefetch specific memory context. That makes
detecting potential leaks or such easier.



> +    /* Can we drop any filters yet, due to problem records begin replayed? */

Odd grammar.


> +    XLogPrefetcherCompleteFilters(prefetcher, replaying_lsn);

Hm, why isn't this part of the loop below?


> +    /* Main prefetch loop. */
> +    for (;;)
> +    {

This kind of looks like a separate process' main loop. The name
indicates similar. And there's no architecture documentation
disinclining one from that view...


The loop body is quite long. I think it should be split into a number of
helper functions. Perhaps one to ensure a block is read, one to maintain
stats, and then one to process block references?


> +        /*
> +         * Scan the record for block references.  We might already have been
> +         * partway through processing this record when we hit maximum I/O
> +         * concurrency, so start where we left off.
> +         */
> +        for (int i = prefetcher->next_block_id; i <= reader->max_block_id; ++i)
> +        {

Super pointless nitpickery: For a loop-body this big I'd rather name 'i'
'blockid' or such.

Greetings,

Andres Freund



pgsql-hackers by date:

Previous
From: Alexander Korotkov
Date:
Subject: Re: Improve checking for pg_index.xmin
Next
From: "Li, Zheng"
Date:
Subject: Re: NOT IN subquery optimization