Thanks for continuing to review this!
On Wed, Nov 19, 2025 at 9:32 PM Chao Li <li.evan.chao@gmail.com> wrote:
>
> +static void
> +CleanVictimBuffer(BufferDesc *bufdesc,
> + bool from_ring, IOContext io_context)
> +{
> + XLogRecPtr max_lsn = InvalidXLogRecPtr;
> + LWLock *content_lock;
>
> - /* Find smgr relation for buffer */
> - if (reln == NULL)
> - reln = smgropen(BufTagGetRelFileLocator(&buf->tag), INVALID_PROC_NUMBER);
> + /* Set up this victim buffer to be flushed */
> + if (!PrepareFlushBuffer(bufdesc, &max_lsn))
> + return;
> ```
>
> I believe when PrepareFlushBuffer(bufdesc, &max_lsn) is false, before “return”, we should release “content_lock”.
Oh wow, you are so right. That's a big mistake! I would have thought a
test would fail, but it seems we don't have coverage of this. I've
fixed the code in the attached v11. I'll see how difficult it is to
write a test that covers the case where we try to do IO but someone
else has already done it.
> + * Buffer must be pinned, the content lock must be held exclusively, and the
> + * buffer header spinlock must not be held. The exclusive lock is released and
> + * the buffer is returned pinned but not locked.
>
> The function comment says "the content lock must be held exclusively”, but from the code:
> Content_lock is acquired with LW_SHARED.
Thanks, I've updated the comment.
> +static bool
> +BufferNeedsWALFlush(BufferDesc *bufdesc, XLogRecPtr *lsn)
> +{
> + uint32 buf_state = LockBufHdr(bufdesc);
> +
> + *lsn = BufferGetLSN(bufdesc);
> +
> + UnlockBufHdr(bufdesc);
> +
> + /*
> + * See buffer flushing code for more details on why we condition this on
> + * the relation being logged.
> + */
> + return buf_state & BM_PERMANENT && XLogNeedsFlush(*lsn);
> +}
>
> This is a new function. I am thinking that if we should only update “lsn” when not BM_PERMANENT?
That makes sense. I don't use the lsn unless the buffer is logged
(BM_PERMANENT), but I think it is weird for the function to set the
LSN if it is returning false. I've modified the function to set it to
InvalidXLogRecPtr when it would return false. I've also added an
assert before XLogFlush() to make sure the buffer is logged before
flushing the WAL.
> 5 - 0004 - I am thinking if that could be a race condition?
>
> PageSetBatchChecksumInplace() is called once after all pages were pinned earlier, but other backends may modify the
pagecontents while the batch is being assembled, because batching only holds content_lock per page temporarily, not
acrossthe entire run.
> So that:
> • Page A pinned + content lock acquired + LSN read → content lock released
> • Another backend modifies Page A and sets new LSN, dirties buffer
> • Page A is written by this batch using an outdated checksum / outdated page contents
So, there is a race condition but it is slightly different from the
scenario you mention. We actually hold the content lock until after
doing the write. That means someone else can't get an exclusive lock
and modify the tuple data in the buffer. However, queries can set hint
bits while only holding a share lock. That can update the LSN (if an
FPI is required), which would cause bad things to happen when we write
out a buffer with an outdated checksum. What we do in master is make a
copy of the buffer, checksum it, and write out the copied buffer (see
PageSetChecksumCopy() and its function comment).
I have an XXX in PageSetBatchChecksumInplace() and in the commit
message for this patch explaining that it isn't committable until some
ongoing work Andres is doing adding a new lock type for setting hint
bits is committed. So, this specific part of this patch is WIP.
- Melanie