Re: Improve WALRead() to suck data directly from WAL buffers when possible - Mailing list pgsql-hackers
From | Andres Freund |
---|---|
Subject | Re: Improve WALRead() to suck data directly from WAL buffers when possible |
Date | |
Msg-id | 20231104004348.zhxzk2p266i3r2z2@awork3.anarazel.de Whole thread Raw |
In response to | Re: Improve WALRead() to suck data directly from WAL buffers when possible (Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>) |
Responses |
Re: Improve WALRead() to suck data directly from WAL buffers when possible
|
List | pgsql-hackers |
Hi, On 2023-11-02 22:38:38 +0530, Bharath Rupireddy wrote: > From 5b5469d7dcd8e98bfcaf14227e67356bbc1f5fe8 Mon Sep 17 00:00:00 2001 > From: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> > Date: Thu, 2 Nov 2023 15:10:51 +0000 > Subject: [PATCH v14] Track oldest initialized WAL buffer page > > --- > src/backend/access/transam/xlog.c | 170 ++++++++++++++++++++++++++++++ > src/include/access/xlog.h | 1 + > 2 files changed, 171 insertions(+) > > diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c > index b541be8eec..fdf2ef310b 100644 > --- a/src/backend/access/transam/xlog.c > +++ b/src/backend/access/transam/xlog.c > @@ -504,6 +504,45 @@ typedef struct XLogCtlData > XLogRecPtr *xlblocks; /* 1st byte ptr-s + XLOG_BLCKSZ */ > int XLogCacheBlck; /* highest allocated xlog buffer index */ > > + /* > + * Start address of oldest initialized page in XLog buffers. > + * > + * We mainly track oldest initialized page explicitly to quickly tell if a > + * given WAL record is available in XLog buffers. It also can be used for > + * other purposes, see notes below. > + * > + * OldestInitializedPage gives XLog buffers following properties: > + * > + * 1) At any given point of time, pages in XLog buffers array are sorted > + * in an ascending order from OldestInitializedPage till InitializedUpTo. > + * Note that we verify this property for assert-only builds, see > + * IsXLogBuffersArraySorted() for more details. This is true - but also not, if you look at it a bit too literally. The buffers in xlblocks itself obviously aren't ordered when wrapping around between XLogRecPtrToBufIdx(OldestInitializedPage) and XLogRecPtrToBufIdx(InitializedUpTo). > + * 2) OldestInitializedPage is monotonically increasing (by virtue of how > + * postgres generates WAL records), that is, its value never decreases. > + * This property lets someone read its value without a lock. There's no > + * problem even if its value is slightly stale i.e. concurrently being > + * updated. One can still use it for finding if a given WAL record is > + * available in XLog buffers. At worst, one might get false positives > + * (i.e. OldestInitializedPage may tell that the WAL record is available > + * in XLog buffers, but when one actually looks at it, it isn't really > + * available). This is more efficient and performant than acquiring a lock > + * for reading. Note that we may not need a lock to read > + * OldestInitializedPage but we need to update it holding > + * WALBufMappingLock. I'd s/may not need/do not need/ But perhaps rephrase it a bit more, to something like: To update OldestInitializedPage, WALBufMappingLock needs to be held exclusively, for reading no lock is required. > + * > + * 3) One can start traversing XLog buffers from OldestInitializedPage > + * till InitializedUpTo to list out all valid WAL records and stats, and > + * expose them via SQL-callable functions to users. > + * > + * 4) XLog buffers array is inherently organized as a circular, sorted and > + * rotated array with OldestInitializedPage as pivot with the property > + * where LSN of previous buffer page (if valid) is greater than > + * OldestInitializedPage and LSN of next buffer page (if valid) is greater > + * than OldestInitializedPage. > + */ > + XLogRecPtr OldestInitializedPage; It seems a bit odd to name a LSN containing variable *Page... > /* > * InsertTimeLineID is the timeline into which new WAL is being inserted > * and flushed. It is zero during recovery, and does not change once set. > @@ -590,6 +629,10 @@ static ControlFileData *ControlFile = NULL; > #define NextBufIdx(idx) \ > (((idx) == XLogCtl->XLogCacheBlck) ? 0 : ((idx) + 1)) > > +/* Macro to retreat to previous buffer index. */ > +#define PreviousBufIdx(idx) \ > + (((idx) == 0) ? XLogCtl->XLogCacheBlck : ((idx) - 1)) I think it might be worth making these inlines and adding assertions that idx is not bigger than XLogCtl->XLogCacheBlck? > /* > * XLogRecPtrToBufIdx returns the index of the WAL buffer that holds, or > * would hold if it was in cache, the page containing 'recptr'. > @@ -708,6 +751,10 @@ static void WALInsertLockAcquireExclusive(void); > static void WALInsertLockRelease(void); > static void WALInsertLockUpdateInsertingAt(XLogRecPtr insertingAt); > > +#ifdef USE_ASSERT_CHECKING > +static bool IsXLogBuffersArraySorted(void); > +#endif > + > /* > * Insert an XLOG record represented by an already-constructed chain of data > * chunks. This is a low-level routine; to construct the WAL record header > @@ -1992,6 +2039,52 @@ AdvanceXLInsertBuffer(XLogRecPtr upto, TimeLineID tli, bool opportunistic) > XLogCtl->InitializedUpTo = NewPageEndPtr; > > npages++; > + > + /* > + * Try updating oldest initialized XLog buffer page. > + * > + * Update it if we are initializing an XLog buffer page for the first > + * time or if XLog buffers are full and we are wrapping around. > + */ > + if (XLogRecPtrIsInvalid(XLogCtl->OldestInitializedPage) || > + XLogRecPtrToBufIdx(XLogCtl->OldestInitializedPage) == nextidx) > + { > + Assert(XLogCtl->OldestInitializedPage < NewPageBeginPtr); > + > + XLogCtl->OldestInitializedPage = NewPageBeginPtr; > + } Wait, isn't this too late? At this point the buffer can already be used by GetXLogBuffers(). I think thi sneeds to happen at the latest just before *((volatile XLogRecPtr *) &XLogCtl->xlblocks[nextidx]) = NewPageEndPtr; Why is it legal to get here with XLogCtl->OldestInitializedPage being invalid? > + > +/* > + * Returns whether or not a given WAL record is available in XLog buffers. > + * > + * Note that we don't read OldestInitializedPage under a lock, see description > + * near its definition in xlog.c for more details. > + * > + * Note that caller needs to pass in an LSN known to the server, not a future > + * or unwritten or unflushed LSN. > + */ > +bool > +IsWALRecordAvailableInXLogBuffers(XLogRecPtr lsn) > +{ > + if (!XLogRecPtrIsInvalid(lsn) && > + !XLogRecPtrIsInvalid(XLogCtl->OldestInitializedPage) && > + lsn >= XLogCtl->OldestInitializedPage && > + lsn < XLogCtl->InitializedUpTo) > + { > + return true; > + } > + > + return false; > +} > diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h > index a14126d164..35235010e6 100644 > --- a/src/include/access/xlog.h > +++ b/src/include/access/xlog.h > @@ -261,6 +261,7 @@ extern void ReachedEndOfBackup(XLogRecPtr EndRecPtr, TimeLineID tli); > extern void SetInstallXLogFileSegmentActive(void); > extern bool IsInstallXLogFileSegmentActive(void); > extern void XLogShutdownWalRcv(void); > +extern bool IsWALRecordAvailableInXLogBuffers(XLogRecPtr lsn); > > /* > * Routines to start, stop, and get status of a base backup. > -- > 2.34.1 > > From db027d8f1dcb53ebceef0135287f120acf67cc21 Mon Sep 17 00:00:00 2001 > From: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> > Date: Thu, 2 Nov 2023 15:36:11 +0000 > Subject: [PATCH v14] Allow WAL reading from WAL buffers > > This commit adds WALRead() the capability to read WAL from WAL > buffers when possible. When requested WAL isn't available in WAL > buffers, the WAL is read from the WAL file as usual. It relies on > WALBufMappingLock so that no one replaces the WAL buffer page that > we're reading from. It skips reading from WAL buffers if > WALBufMappingLock can't be acquired immediately. In other words, > it doesn't wait for WALBufMappingLock to be available. This helps > reduce the contention on WALBufMappingLock. > > This commit benefits the callers of WALRead(), that are walsenders > and pg_walinspect. They can now avoid reading WAL from the WAL > file (possibly avoiding disk IO). Tests show that the WAL buffers > hit ratio stood at 95% for 1 primary, 1 sync standby, 1 async > standby, with pgbench --scale=300 --client=32 --time=900. In other > words, the walsenders avoided 95% of the time reading from the > file/avoided pread system calls: > https://www.postgresql.org/message-id/CALj2ACXKKK%3DwbiG5_t6dGao5GoecMwRkhr7GjVBM_jg54%2BNa%3DQ%40mail.gmail.com > > This commit also benefits when direct IO is enabled for WAL. > Reading WAL from WAL buffers puts back the performance close to > that of without direct IO for WAL: > https://www.postgresql.org/message-id/CALj2ACV6rS%2B7iZx5%2BoAvyXJaN4AG-djAQeM1mrM%3DYSDkVrUs7g%40mail.gmail.com > > This commit also paves the way for the following features in > future: > - Improves synchronous replication performance by replicating > directly from WAL buffers. > - A opt-in way for the walreceivers to receive unflushed WAL. > More details here: > https://www.postgresql.org/message-id/20231011224353.cl7c2s222dw3de4j%40awork3.anarazel.de > > Author: Bharath Rupireddy > Reviewed-by: Dilip Kumar, Andres Freund > Reviewed-by: Nathan Bossart, Kuntal Ghosh > Discussion: https://www.postgresql.org/message-id/CALj2ACXKKK%3DwbiG5_t6dGao5GoecMwRkhr7GjVBM_jg54%2BNa%3DQ%40mail.gmail.com > --- > src/backend/access/transam/xlog.c | 205 ++++++++++++++++++++++++ > src/backend/access/transam/xlogreader.c | 41 ++++- > src/include/access/xlog.h | 6 + > 3 files changed, 250 insertions(+), 2 deletions(-) > > diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c > index fdf2ef310b..ff5dccaaa7 100644 > --- a/src/backend/access/transam/xlog.c > +++ b/src/backend/access/transam/xlog.c > @@ -1753,6 +1753,211 @@ GetXLogBuffer(XLogRecPtr ptr, TimeLineID tli) > return cachedPos + ptr % XLOG_BLCKSZ; > } > > +/* > + * Read WAL from WAL buffers. > + * > + * Read 'count' bytes of WAL from WAL buffers into 'buf', starting at location > + * 'startptr', on timeline 'tli' and return total read bytes. > + * > + * This function returns quickly in the following cases: > + * - When passed-in timeline is different than server's current insertion > + * timeline as WAL is always inserted into WAL buffers on insertion timeline. > + * - When server is in recovery as WAL buffers aren't currently used in > + * recovery. > + * > + * Note that this function reads as much as it can from WAL buffers, meaning, > + * it may not read all the requested 'count' bytes. Caller must be aware of > + * this and deal with it. > + * > + * Note that this function is not available for frontend code as WAL buffers is > + * an internal mechanism to the server. > > + */ > +Size > +XLogReadFromBuffers(XLogReaderState *state, > + XLogRecPtr startptr, > + TimeLineID tli, > + Size count, > + char *buf) > +{ > + XLogRecPtr ptr; > + XLogRecPtr cur_lsn; > + Size nbytes; > + Size ntotal; > + Size nbatch; > + char *batchstart; > + > + if (RecoveryInProgress()) > + return 0; > > + if (tli != GetWALInsertionTimeLine()) > + return 0; > + > + Assert(!XLogRecPtrIsInvalid(startptr)); > + > + cur_lsn = GetFlushRecPtr(NULL); > + if (unlikely(startptr > cur_lsn)) > + elog(ERROR, "WAL start LSN %X/%X specified for reading from WAL buffers must be less than current database systemWAL LSN %X/%X", > + LSN_FORMAT_ARGS(startptr), LSN_FORMAT_ARGS(cur_lsn)); Hm, why does this check belong here? For some tools it might be legitimate to read the WAL before it was fully flushed. > + /* > + * Holding WALBufMappingLock ensures inserters don't overwrite this value > + * while we are reading it. We try to acquire it in shared mode so that > + * the concurrent WAL readers are also allowed. We try to do as less work > + * as possible while holding the lock to not impact concurrent WAL writers > + * much. We quickly exit to not cause any contention, if the lock isn't > + * immediately available. > + */ > + if (!LWLockConditionalAcquire(WALBufMappingLock, LW_SHARED)) > + return 0; That seems problematic - that lock is often heavily contended. We could instead check IsWALRecordAvailableInXLogBuffers() once before reading the page, then read the page contents *without* holding a lock, and then check IsWALRecordAvailableInXLogBuffers() again - if the page was replaced in the interim we read bogus data, but that's a bit of a wasted effort. > + ptr = startptr; > + nbytes = count; /* Total bytes requested to be read by caller. */ > + ntotal = 0; /* Total bytes read. */ > + nbatch = 0; /* Bytes to be read in single batch. */ > + batchstart = NULL; /* Location to read from for single batch. */ What does "batch" mean? > + while (nbytes > 0) > + { > + XLogRecPtr expectedEndPtr; > + XLogRecPtr endptr; > + int idx; > + char *page; > + char *data; > + XLogPageHeader phdr; > + > + idx = XLogRecPtrToBufIdx(ptr); > + expectedEndPtr = ptr; > + expectedEndPtr += XLOG_BLCKSZ - ptr % XLOG_BLCKSZ; > + endptr = XLogCtl->xlblocks[idx]; > + > + /* Requested WAL isn't available in WAL buffers. */ > + if (expectedEndPtr != endptr) > + break; > + > + /* > + * We found WAL buffer page containing given XLogRecPtr. Get starting > + * address of the page and a pointer to the right location of given > + * XLogRecPtr in that page. > + */ > + page = XLogCtl->pages + idx * (Size) XLOG_BLCKSZ; > + data = page + ptr % XLOG_BLCKSZ; > + > + /* > + * The fact that we acquire WALBufMappingLock while reading the WAL > + * buffer page itself guarantees that no one else initializes it or > + * makes it ready for next use in AdvanceXLInsertBuffer(). However, we > + * need to ensure that we are not reading a page that just got > + * initialized. For this, we look at the needed page header. > + */ > + phdr = (XLogPageHeader) page; > + > + /* Return, if WAL buffer page doesn't look valid. */ > + if (!(phdr->xlp_magic == XLOG_PAGE_MAGIC && > + phdr->xlp_pageaddr == (ptr - (ptr % XLOG_BLCKSZ)) && > + phdr->xlp_tli == tli)) > + break; I don't think this code should ever encounter a page where this is not the case? We particularly shouldn't do so silently, seems that could hide all kinds of problems. > + /* > + * Note that we don't perform all page header checks here to avoid > + * extra work in production builds; callers will anyway do those > + * checks extensively. However, in an assert-enabled build, we perform > + * all the checks here and raise an error if failed. > + */ Why? > + /* Count what is wanted, not the whole page. */ > + if ((data + nbytes) <= (page + XLOG_BLCKSZ)) > + { > + /* All the bytes are in one page. */ > + nbatch += nbytes; > + ntotal += nbytes; > + nbytes = 0; > + } > + else > + { > + Size navailable; > + > + /* > + * All the bytes are not in one page. Deduce available bytes on > + * the current page, count them and continue to look for remaining > + * bytes. > + */ s/deducate/deduct/? Perhaps better subtract? Greetings, Andres Freund
pgsql-hackers by date: