Re: Checkpointer write combining - Mailing list pgsql-hackers
| From | Chao Li |
|---|---|
| Subject | Re: Checkpointer write combining |
| Date | |
| Msg-id | 20E76846-4816-45BC-84F0-D8836D9AFA4C@gmail.com Whole thread Raw |
| In response to | Re: Checkpointer write combining (Melanie Plageman <melanieplageman@gmail.com>) |
| List | pgsql-hackers |
> On Nov 20, 2025, at 06:12, Melanie Plageman <melanieplageman@gmail.com> wrote:
>
> Are you sure you applied 0005? It is the one that adds dbId to CkptSortItem.
>
My bad. Yes, I missed to download and apply 0005.
> - Melanie
>
<v10-0001-Refactor-goto-into-for-loop-in-GetVictimBuffer.patch><v10-0002-Split-FlushBuffer-into-two-parts.patch><v10-0003-Eagerly-flush-bulkwrite-strategy-ring.patch><v10-0004-Write-combining-for-BAS_BULKWRITE.patch><v10-0005-Add-database-Oid-to-CkptSortItem.patch><v10-0006-Implement-checkpointer-data-write-combining.patch><v10-0007-Refactor-SyncOneBuffer-for-bgwriter-use-only.patch>
I went through the whole patch set again, and got a few more comments:
1 - 0002
```
+/*
+ * Prepare and write out a dirty victim buffer.
+ *
+ * Buffer must be pinned, the content lock must be held exclusively, and the
+ * buffer header spinlock must not be held. The exclusive lock is released and
+ * the buffer is returned pinned but not locked.
+ *
+ * bufdesc may be modified.
+ */
+static void
+CleanVictimBuffer(BufferDesc *bufdesc,
+ bool from_ring, IOContext io_context)
+{
+ XLogRecPtr max_lsn = InvalidXLogRecPtr;
+ LWLock *content_lock;
- /* Find smgr relation for buffer */
- if (reln == NULL)
- reln = smgropen(BufTagGetRelFileLocator(&buf->tag), INVALID_PROC_NUMBER);
+ /* Set up this victim buffer to be flushed */
+ if (!PrepareFlushBuffer(bufdesc, &max_lsn))
+ return;
```
I believe when PrepareFlushBuffer(bufdesc, &max_lsn) is false, before “return”, we should release “content_lock”.
Because the function comment clearly says “the content lock must be held exclusively”. Also, looking at the code where
CleanVictimBuffer()is called:
```
if (StrategyRejectBuffer(strategy, buf_hdr, from_ring))
{
LWLockRelease(content_lock);
UnpinBuffer(buf_hdr);
continue;
}
/* Content lock is released inside CleanVictimBuffer */
CleanVictimBuffer(buf_hdr, from_ring, io_context);
```
In the previous “if” clause, it releases content_lock.
2 - 0002
```
+ * Buffer must be pinned, the content lock must be held exclusively, and the
+ * buffer header spinlock must not be held. The exclusive lock is released and
+ * the buffer is returned pinned but not locked.
+ *
+ * bufdesc may be modified.
+ */
+static void
+CleanVictimBuffer(BufferDesc *bufdesc,
+ bool from_ring, IOContext io_context)
```
The function comment says "the content lock must be held exclusively”, but from the code:
```
content_lock = BufferDescriptorGetContentLock(buf_hdr);
if (!LWLockConditionalAcquire(content_lock, LW_SHARED))
{
/*
* Someone else has locked the buffer, so give it up and loop
* back to get another one.
*/
UnpinBuffer(buf_hdr);
continue;
}
/*
* If using a nondefault strategy, and writing the buffer would
* require a WAL flush, let the strategy decide whether to go
* ahead and write/reuse the buffer or to choose another victim.
* We need the content lock to inspect the page LSN, so this can't
* be done inside StrategyGetBuffer.
*/
if (StrategyRejectBuffer(strategy, buf_hdr, from_ring))
{
LWLockRelease(content_lock);
UnpinBuffer(buf_hdr);
continue;
}
/* Content lock is released inside CleanVictimBuffer */
CleanVictimBuffer(buf_hdr, from_ring, io_context);
```
Content_lock is acquired with LW_SHARED.
3 - 0003
In CleanVictimBuffer(), more logic are added, but the content_lock leak problem is still there.
4 - 0003
```
+/*
+ * Returns true if the buffer needs WAL flushed before it can be written out.
+ * Caller must not already hold the buffer header spinlock.
+ */
+static bool
+BufferNeedsWALFlush(BufferDesc *bufdesc, XLogRecPtr *lsn)
+{
+ uint32 buf_state = LockBufHdr(bufdesc);
+
+ *lsn = BufferGetLSN(bufdesc);
+
+ UnlockBufHdr(bufdesc);
+
+ /*
+ * See buffer flushing code for more details on why we condition this on
+ * the relation being logged.
+ */
+ return buf_state & BM_PERMANENT && XLogNeedsFlush(*lsn);
+}
```
This is a new function. I am thinking that if we should only update “lsn” when not BM_PERMANENT? Because from the
existingcode:
```
if (buf_state & BM_PERMANENT)
XLogFlush(recptr);
```
XLogFlush should only happen when BM_PERMANENT.
5 - 0004 - I am thinking if that could be a race condition?
PageSetBatchChecksumInplace() is called once after all pages were pinned earlier, but other backends may modify the
pagecontents while the batch is being assembled, because batching only holds content_lock per page temporarily, not
acrossthe entire run.
So that:
• Page A pinned + content lock acquired + LSN read → content lock released
• Another backend modifies Page A and sets new LSN, dirties buffer
• Page A is written by this batch using an outdated checksum / outdated page contents
6 - 0006 - Ah, 0006 seems to have solved comment 3 about BM_PERMANENT.
Best regards,
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/
pgsql-hackers by date: