Re: Improve WALRead() to suck data directly from WAL buffers when possible - Mailing list pgsql-hackers

From Andres Freund
Subject Re: Improve WALRead() to suck data directly from WAL buffers when possible
Date
Msg-id 20231104004348.zhxzk2p266i3r2z2@awork3.anarazel.de
Whole thread Raw
In response to Re: Improve WALRead() to suck data directly from WAL buffers when possible  (Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>)
Responses Re: Improve WALRead() to suck data directly from WAL buffers when possible
List pgsql-hackers
Hi,

On 2023-11-02 22:38:38 +0530, Bharath Rupireddy wrote:
> From 5b5469d7dcd8e98bfcaf14227e67356bbc1f5fe8 Mon Sep 17 00:00:00 2001
> From: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
> Date: Thu, 2 Nov 2023 15:10:51 +0000
> Subject: [PATCH v14] Track oldest initialized WAL buffer page
>
> ---
>  src/backend/access/transam/xlog.c | 170 ++++++++++++++++++++++++++++++
>  src/include/access/xlog.h         |   1 +
>  2 files changed, 171 insertions(+)
>
> diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
> index b541be8eec..fdf2ef310b 100644
> --- a/src/backend/access/transam/xlog.c
> +++ b/src/backend/access/transam/xlog.c
> @@ -504,6 +504,45 @@ typedef struct XLogCtlData
>      XLogRecPtr *xlblocks;        /* 1st byte ptr-s + XLOG_BLCKSZ */
>      int            XLogCacheBlck;    /* highest allocated xlog buffer index */
>
> +    /*
> +     * Start address of oldest initialized page in XLog buffers.
> +     *
> +     * We mainly track oldest initialized page explicitly to quickly tell if a
> +     * given WAL record is available in XLog buffers. It also can be used for
> +     * other purposes, see notes below.
> +     *
> +     * OldestInitializedPage gives XLog buffers following properties:
> +     *
> +     * 1) At any given point of time, pages in XLog buffers array are sorted
> +     * in an ascending order from OldestInitializedPage till InitializedUpTo.
> +     * Note that we verify this property for assert-only builds, see
> +     * IsXLogBuffersArraySorted() for more details.

This is true - but also not, if you look at it a bit too literally. The
buffers in xlblocks itself obviously aren't ordered when wrapping around
between XLogRecPtrToBufIdx(OldestInitializedPage) and
XLogRecPtrToBufIdx(InitializedUpTo).


> +     * 2) OldestInitializedPage is monotonically increasing (by virtue of how
> +     * postgres generates WAL records), that is, its value never decreases.
> +     * This property lets someone read its value without a lock. There's no
> +     * problem even if its value is slightly stale i.e. concurrently being
> +     * updated. One can still use it for finding if a given WAL record is
> +     * available in XLog buffers. At worst, one might get false positives
> +     * (i.e. OldestInitializedPage may tell that the WAL record is available
> +     * in XLog buffers, but when one actually looks at it, it isn't really
> +     * available). This is more efficient and performant than acquiring a lock
> +     * for reading. Note that we may not need a lock to read
> +     * OldestInitializedPage but we need to update it holding
> +     * WALBufMappingLock.

I'd
s/may not need/do not need/

But perhaps rephrase it a bit more, to something like:

To update OldestInitializedPage, WALBufMappingLock needs to be held
exclusively, for reading no lock is required.


> +     *
> +     * 3) One can start traversing XLog buffers from OldestInitializedPage
> +     * till InitializedUpTo to list out all valid WAL records and stats, and
> +     * expose them via SQL-callable functions to users.
> +     *
> +     * 4) XLog buffers array is inherently organized as a circular, sorted and
> +     * rotated array with OldestInitializedPage as pivot with the property
> +     * where LSN of previous buffer page (if valid) is greater than
> +     * OldestInitializedPage and LSN of next buffer page (if valid) is greater
> +     * than OldestInitializedPage.
> +     */
> +    XLogRecPtr    OldestInitializedPage;

It seems a bit odd to name a LSN containing variable *Page...


>      /*
>       * InsertTimeLineID is the timeline into which new WAL is being inserted
>       * and flushed. It is zero during recovery, and does not change once set.
> @@ -590,6 +629,10 @@ static ControlFileData *ControlFile = NULL;
>  #define NextBufIdx(idx)        \
>          (((idx) == XLogCtl->XLogCacheBlck) ? 0 : ((idx) + 1))
>
> +/* Macro to retreat to previous buffer index. */
> +#define PreviousBufIdx(idx)        \
> +        (((idx) == 0) ? XLogCtl->XLogCacheBlck : ((idx) - 1))

I think it might be worth making these inlines and adding assertions that idx
is not bigger than XLogCtl->XLogCacheBlck?


>  /*
>   * XLogRecPtrToBufIdx returns the index of the WAL buffer that holds, or
>   * would hold if it was in cache, the page containing 'recptr'.
> @@ -708,6 +751,10 @@ static void WALInsertLockAcquireExclusive(void);
>  static void WALInsertLockRelease(void);
>  static void WALInsertLockUpdateInsertingAt(XLogRecPtr insertingAt);
>
> +#ifdef USE_ASSERT_CHECKING
> +static bool IsXLogBuffersArraySorted(void);
> +#endif
> +
>  /*
>   * Insert an XLOG record represented by an already-constructed chain of data
>   * chunks.  This is a low-level routine; to construct the WAL record header
> @@ -1992,6 +2039,52 @@ AdvanceXLInsertBuffer(XLogRecPtr upto, TimeLineID tli, bool opportunistic)
>          XLogCtl->InitializedUpTo = NewPageEndPtr;
>
>          npages++;
> +
> +        /*
> +         * Try updating oldest initialized XLog buffer page.
> +         *
> +         * Update it if we are initializing an XLog buffer page for the first
> +         * time or if XLog buffers are full and we are wrapping around.
> +         */
> +        if (XLogRecPtrIsInvalid(XLogCtl->OldestInitializedPage) ||
> +            XLogRecPtrToBufIdx(XLogCtl->OldestInitializedPage) == nextidx)
> +        {
> +            Assert(XLogCtl->OldestInitializedPage < NewPageBeginPtr);
> +
> +            XLogCtl->OldestInitializedPage = NewPageBeginPtr;
> +        }

Wait, isn't this too late?  At this point the buffer can already be used by
GetXLogBuffers().  I think thi sneeds to happen at the latest just before
        *((volatile XLogRecPtr *) &XLogCtl->xlblocks[nextidx]) = NewPageEndPtr;


Why is it legal to get here with XLogCtl->OldestInitializedPage being invalid?


> +
> +/*
> + * Returns whether or not a given WAL record is available in XLog buffers.
> + *
> + * Note that we don't read OldestInitializedPage under a lock, see description
> + * near its definition in xlog.c for more details.
> + *
> + * Note that caller needs to pass in an LSN known to the server, not a future
> + * or unwritten or unflushed LSN.
> + */
> +bool
> +IsWALRecordAvailableInXLogBuffers(XLogRecPtr lsn)
> +{
> +    if (!XLogRecPtrIsInvalid(lsn) &&
> +        !XLogRecPtrIsInvalid(XLogCtl->OldestInitializedPage) &&
> +        lsn >= XLogCtl->OldestInitializedPage &&
> +        lsn < XLogCtl->InitializedUpTo)
> +    {
> +        return true;
> +    }
> +
> +    return false;
> +}
> diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
> index a14126d164..35235010e6 100644
> --- a/src/include/access/xlog.h
> +++ b/src/include/access/xlog.h
> @@ -261,6 +261,7 @@ extern void ReachedEndOfBackup(XLogRecPtr EndRecPtr, TimeLineID tli);
>  extern void SetInstallXLogFileSegmentActive(void);
>  extern bool IsInstallXLogFileSegmentActive(void);
>  extern void XLogShutdownWalRcv(void);
> +extern bool IsWALRecordAvailableInXLogBuffers(XLogRecPtr lsn);
>
>  /*
>   * Routines to start, stop, and get status of a base backup.
> --
> 2.34.1
>

> From db027d8f1dcb53ebceef0135287f120acf67cc21 Mon Sep 17 00:00:00 2001
> From: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
> Date: Thu, 2 Nov 2023 15:36:11 +0000
> Subject: [PATCH v14] Allow WAL reading from WAL buffers
>
> This commit adds WALRead() the capability to read WAL from WAL
> buffers when possible. When requested WAL isn't available in WAL
> buffers, the WAL is read from the WAL file as usual. It relies on
> WALBufMappingLock so that no one replaces the WAL buffer page that
> we're reading from. It skips reading from WAL buffers if
> WALBufMappingLock can't be acquired immediately. In other words,
> it doesn't wait for WALBufMappingLock to be available. This helps
> reduce the contention on WALBufMappingLock.
>
> This commit benefits the callers of WALRead(), that are walsenders
> and pg_walinspect. They can now avoid reading WAL from the WAL
> file (possibly avoiding disk IO). Tests show that the WAL buffers
> hit ratio stood at 95% for 1 primary, 1 sync standby, 1 async
> standby, with pgbench --scale=300 --client=32 --time=900. In other
> words, the walsenders avoided 95% of the time reading from the
> file/avoided pread system calls:
> https://www.postgresql.org/message-id/CALj2ACXKKK%3DwbiG5_t6dGao5GoecMwRkhr7GjVBM_jg54%2BNa%3DQ%40mail.gmail.com
>
> This commit also benefits when direct IO is enabled for WAL.
> Reading WAL from WAL buffers puts back the performance close to
> that of without direct IO for WAL:
> https://www.postgresql.org/message-id/CALj2ACV6rS%2B7iZx5%2BoAvyXJaN4AG-djAQeM1mrM%3DYSDkVrUs7g%40mail.gmail.com
>
> This commit also paves the way for the following features in
> future:
> - Improves synchronous replication performance by replicating
> directly from WAL buffers.
> - A opt-in way for the walreceivers to receive unflushed WAL.
> More details here:
> https://www.postgresql.org/message-id/20231011224353.cl7c2s222dw3de4j%40awork3.anarazel.de
>
> Author: Bharath Rupireddy
> Reviewed-by: Dilip Kumar, Andres Freund
> Reviewed-by: Nathan Bossart, Kuntal Ghosh
> Discussion:
https://www.postgresql.org/message-id/CALj2ACXKKK%3DwbiG5_t6dGao5GoecMwRkhr7GjVBM_jg54%2BNa%3DQ%40mail.gmail.com
> ---
>  src/backend/access/transam/xlog.c       | 205 ++++++++++++++++++++++++
>  src/backend/access/transam/xlogreader.c |  41 ++++-
>  src/include/access/xlog.h               |   6 +
>  3 files changed, 250 insertions(+), 2 deletions(-)
>
> diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
> index fdf2ef310b..ff5dccaaa7 100644
> --- a/src/backend/access/transam/xlog.c
> +++ b/src/backend/access/transam/xlog.c
> @@ -1753,6 +1753,211 @@ GetXLogBuffer(XLogRecPtr ptr, TimeLineID tli)
>      return cachedPos + ptr % XLOG_BLCKSZ;
>  }

>

> +/*
> + * Read WAL from WAL buffers.
> + *
> + * Read 'count' bytes of WAL from WAL buffers into 'buf', starting at location
> + * 'startptr', on timeline 'tli' and return total read bytes.
> + *
> + * This function returns quickly in the following cases:
> + * - When passed-in timeline is different than server's current insertion
> + * timeline as WAL is always inserted into WAL buffers on insertion timeline.
> + * - When server is in recovery as WAL buffers aren't currently used in
> + * recovery.
> + *
> + * Note that this function reads as much as it can from WAL buffers, meaning,
> + * it may not read all the requested 'count' bytes. Caller must be aware of
> + * this and deal with it.
> + *
> + * Note that this function is not available for frontend code as WAL buffers is
> + * an internal mechanism to the server.
>
> + */
> +Size
> +XLogReadFromBuffers(XLogReaderState *state,
> +                    XLogRecPtr startptr,
> +                    TimeLineID tli,
> +                    Size count,
> +                    char *buf)
> +{
> +    XLogRecPtr    ptr;
> +    XLogRecPtr    cur_lsn;
> +    Size        nbytes;
> +    Size        ntotal;
> +    Size        nbatch;
> +    char       *batchstart;
> +
> +    if (RecoveryInProgress())
> +        return 0;
>
> +    if (tli != GetWALInsertionTimeLine())
> +        return 0;
> +
> +    Assert(!XLogRecPtrIsInvalid(startptr));
> +
> +    cur_lsn = GetFlushRecPtr(NULL);
> +    if (unlikely(startptr > cur_lsn))
> +        elog(ERROR, "WAL start LSN %X/%X specified for reading from WAL buffers must be less than current database
systemWAL LSN %X/%X",
 
> +             LSN_FORMAT_ARGS(startptr), LSN_FORMAT_ARGS(cur_lsn));

Hm, why does this check belong here? For some tools it might be legitimate to
read the WAL before it was fully flushed.



> +    /*
> +     * Holding WALBufMappingLock ensures inserters don't overwrite this value
> +     * while we are reading it. We try to acquire it in shared mode so that
> +     * the concurrent WAL readers are also allowed. We try to do as less work
> +     * as possible while holding the lock to not impact concurrent WAL writers
> +     * much. We quickly exit to not cause any contention, if the lock isn't
> +     * immediately available.
> +     */
> +    if (!LWLockConditionalAcquire(WALBufMappingLock, LW_SHARED))
> +        return 0;

That seems problematic - that lock is often heavily contended.  We could
instead check IsWALRecordAvailableInXLogBuffers() once before reading the
page, then read the page contents *without* holding a lock, and then check
IsWALRecordAvailableInXLogBuffers() again - if the page was replaced in the
interim we read bogus data, but that's a bit of a wasted effort.


> +    ptr = startptr;
> +    nbytes = count;                /* Total bytes requested to be read by caller. */
> +    ntotal = 0;                    /* Total bytes read. */
> +    nbatch = 0;                    /* Bytes to be read in single batch. */
> +    batchstart = NULL;            /* Location to read from for single batch. */

What does "batch" mean?


> +    while (nbytes > 0)
> +    {
> +        XLogRecPtr    expectedEndPtr;
> +        XLogRecPtr    endptr;
> +        int            idx;
> +        char       *page;
> +        char       *data;
> +        XLogPageHeader phdr;
> +
> +        idx = XLogRecPtrToBufIdx(ptr);
> +        expectedEndPtr = ptr;
> +        expectedEndPtr += XLOG_BLCKSZ - ptr % XLOG_BLCKSZ;
> +        endptr = XLogCtl->xlblocks[idx];
> +
> +        /* Requested WAL isn't available in WAL buffers. */
> +        if (expectedEndPtr != endptr)
> +            break;
> +
> +        /*
> +         * We found WAL buffer page containing given XLogRecPtr. Get starting
> +         * address of the page and a pointer to the right location of given
> +         * XLogRecPtr in that page.
> +         */
> +        page = XLogCtl->pages + idx * (Size) XLOG_BLCKSZ;
> +        data = page + ptr % XLOG_BLCKSZ;
> +
> +        /*
> +         * The fact that we acquire WALBufMappingLock while reading the WAL
> +         * buffer page itself guarantees that no one else initializes it or
> +         * makes it ready for next use in AdvanceXLInsertBuffer(). However, we
> +         * need to ensure that we are not reading a page that just got
> +         * initialized. For this, we look at the needed page header.
> +         */
> +        phdr = (XLogPageHeader) page;
> +
> +        /* Return, if WAL buffer page doesn't look valid. */
> +        if (!(phdr->xlp_magic == XLOG_PAGE_MAGIC &&
> +              phdr->xlp_pageaddr == (ptr - (ptr % XLOG_BLCKSZ)) &&
> +              phdr->xlp_tli == tli))
> +            break;

I don't think this code should ever encounter a page where this is not the
case?  We particularly shouldn't do so silently, seems that could hide all
kinds of problems.



> +        /*
> +         * Note that we don't perform all page header checks here to avoid
> +         * extra work in production builds; callers will anyway do those
> +         * checks extensively. However, in an assert-enabled build, we perform
> +         * all the checks here and raise an error if failed.
> +         */

Why?


> +        /* Count what is wanted, not the whole page. */
> +        if ((data + nbytes) <= (page + XLOG_BLCKSZ))
> +        {
> +            /* All the bytes are in one page. */
> +            nbatch += nbytes;
> +            ntotal += nbytes;
> +            nbytes = 0;
> +        }
> +        else
> +        {
> +            Size        navailable;
> +
> +            /*
> +             * All the bytes are not in one page. Deduce available bytes on
> +             * the current page, count them and continue to look for remaining
> +             * bytes.
> +             */
s/deducate/deduct/? Perhaps better subtract?


Greetings,

Andres Freund



pgsql-hackers by date:

Previous
From: Andres Freund
Date:
Subject: Re: Improve WALRead() to suck data directly from WAL buffers when possible
Next
From: Jeff Davis
Date:
Subject: Re: Inconsistent use of "volatile" when accessing shared memory?