Re: Checkpointer write combining - Mailing list pgsql-hackers

From Chao Li
Subject Re: Checkpointer write combining
Date
Msg-id 20E76846-4816-45BC-84F0-D8836D9AFA4C@gmail.com
Whole thread Raw
In response to Re: Checkpointer write combining  (Melanie Plageman <melanieplageman@gmail.com>)
List pgsql-hackers

> On Nov 20, 2025, at 06:12, Melanie Plageman <melanieplageman@gmail.com> wrote:
>
> Are you sure you applied 0005? It is the one that adds dbId to CkptSortItem.
>

My bad. Yes, I missed to download and apply 0005.

> - Melanie
>
<v10-0001-Refactor-goto-into-for-loop-in-GetVictimBuffer.patch><v10-0002-Split-FlushBuffer-into-two-parts.patch><v10-0003-Eagerly-flush-bulkwrite-strategy-ring.patch><v10-0004-Write-combining-for-BAS_BULKWRITE.patch><v10-0005-Add-database-Oid-to-CkptSortItem.patch><v10-0006-Implement-checkpointer-data-write-combining.patch><v10-0007-Refactor-SyncOneBuffer-for-bgwriter-use-only.patch>

I went through the whole patch set again, and got a few more comments:

1 - 0002
```
+/*
+ * Prepare and write out a dirty victim buffer.
+ *
+ * Buffer must be pinned, the content lock must be held exclusively, and the
+ * buffer header spinlock must not be held. The exclusive lock is released and
+ * the buffer is returned pinned but not locked.
+ *
+ * bufdesc may be modified.
+ */
+static void
+CleanVictimBuffer(BufferDesc *bufdesc,
+                  bool from_ring, IOContext io_context)
+{
+    XLogRecPtr    max_lsn = InvalidXLogRecPtr;
+    LWLock       *content_lock;

-    /* Find smgr relation for buffer */
-    if (reln == NULL)
-        reln = smgropen(BufTagGetRelFileLocator(&buf->tag), INVALID_PROC_NUMBER);
+    /* Set up this victim buffer to be flushed */
+    if (!PrepareFlushBuffer(bufdesc, &max_lsn))
+        return;
```

I believe when PrepareFlushBuffer(bufdesc, &max_lsn) is false, before “return”, we should release “content_lock”.

Because the function comment clearly says “the content lock must be held exclusively”. Also, looking at the code where
CleanVictimBuffer()is called: 
```
            if (StrategyRejectBuffer(strategy, buf_hdr, from_ring))
            {
                LWLockRelease(content_lock);
                UnpinBuffer(buf_hdr);
                continue;
            }

            /* Content lock is released inside CleanVictimBuffer */
            CleanVictimBuffer(buf_hdr, from_ring, io_context);
```

In the previous “if” clause, it releases content_lock.

2 - 0002
```
+ * Buffer must be pinned, the content lock must be held exclusively, and the
+ * buffer header spinlock must not be held. The exclusive lock is released and
+ * the buffer is returned pinned but not locked.
+ *
+ * bufdesc may be modified.
+ */
+static void
+CleanVictimBuffer(BufferDesc *bufdesc,
+                  bool from_ring, IOContext io_context)
```

The function comment says "the content lock must be held exclusively”, but from the code:
```
            content_lock = BufferDescriptorGetContentLock(buf_hdr);
            if (!LWLockConditionalAcquire(content_lock, LW_SHARED))
            {
                /*
                 * Someone else has locked the buffer, so give it up and loop
                 * back to get another one.
                 */
                UnpinBuffer(buf_hdr);
                continue;
            }

            /*
             * If using a nondefault strategy, and writing the buffer would
             * require a WAL flush, let the strategy decide whether to go
             * ahead and write/reuse the buffer or to choose another victim.
             * We need the content lock to inspect the page LSN, so this can't
             * be done inside StrategyGetBuffer.
             */
            if (StrategyRejectBuffer(strategy, buf_hdr, from_ring))
            {
                LWLockRelease(content_lock);
                UnpinBuffer(buf_hdr);
                continue;
            }

            /* Content lock is released inside CleanVictimBuffer */
            CleanVictimBuffer(buf_hdr, from_ring, io_context);
```
Content_lock is acquired with LW_SHARED.

3 - 0003

In CleanVictimBuffer(), more logic are added, but the content_lock leak problem is still there.

4 - 0003
```
+/*
+ * Returns true if the buffer needs WAL flushed before it can be written out.
+ * Caller must not already hold the buffer header spinlock.
+ */
+static bool
+BufferNeedsWALFlush(BufferDesc *bufdesc, XLogRecPtr *lsn)
+{
+    uint32        buf_state = LockBufHdr(bufdesc);
+
+    *lsn = BufferGetLSN(bufdesc);
+
+    UnlockBufHdr(bufdesc);
+
+    /*
+     * See buffer flushing code for more details on why we condition this on
+     * the relation being logged.
+     */
+    return buf_state & BM_PERMANENT && XLogNeedsFlush(*lsn);
+}
```

This is a new function. I am thinking that if we should only update “lsn” when not BM_PERMANENT? Because from the
existingcode: 
```
    if (buf_state & BM_PERMANENT)
        XLogFlush(recptr);
```

XLogFlush should only happen when BM_PERMANENT.

5 - 0004 - I am thinking if that could be a race condition?

PageSetBatchChecksumInplace() is called once after all pages were pinned earlier, but other backends may modify the
pagecontents while the batch is being assembled, because batching only holds content_lock per page temporarily, not
acrossthe entire run. 
So that:
    • Page A pinned + content lock acquired + LSN read → content lock released
    • Another backend modifies Page A and sets new LSN, dirties buffer
    • Page A is written by this batch using an outdated checksum / outdated page contents

6 - 0006 - Ah, 0006 seems to have solved comment 3 about BM_PERMANENT.

Best regards,
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/







pgsql-hackers by date:

Previous
From: Andreas Karlsson
Date:
Subject: Re: pg_utility ?
Next
From: Andres Freund
Date:
Subject: Re: Buffer locking is special (hints, checksums, AIO writes)