Re: Checkpointer write combining - Mailing list pgsql-hackers

From Chao Li
Subject Re: Checkpointer write combining
Date
Msg-id BF806DCB-D361-4626-9A68-95F50FD18CC8@gmail.com
Whole thread Raw
In response to Re: Checkpointer write combining  (Chao Li <li.evan.chao@gmail.com>)
List pgsql-hackers

> On Feb 25, 2026, at 14:52, Chao Li <li.evan.chao@gmail.com> wrote:
>
>
>
>> On Feb 20, 2026, at 07:41, Melanie Plageman <melanieplageman@gmail.com> wrote:
>>
>> On Wed, Jan 14, 2026 at 6:35 PM Melanie Plageman
>> <melanieplageman@gmail.com> wrote:
>>>
>>> Thanks! v13 attached.
>>
>> I've added background writer write combining and then spent some time
>> refactoring and benchmarking this over the last few weeks, and I've
>> realized I won't be able to get it in shape for Postgres 19. I've
>> attached version 14 here with my WIP code so I can come back to it in
>> the next release.
>>
>> 0012 is an invasive refactor of GetVictimBuffer() and all the
>> functions it calls that needs to be split apart into smaller commits
>> and redistributed throughout the other patches in the set. Other than
>> that, there are a number of todos both in the code and from a
>> benchmarking perspective:
>>
>> - Probe for preceding as well as following blocks when looking for
>> blocks to combine
>> - Experiment with different batch sizes
>> - Enable and test bulkread and vacuum write combining
>>
>> I also periodically get this troubling warning: "WARNING:  invalid
>> page in block 2 of relation "base/16384/2608_fsm"; zeroing out page".
>> So clearly there's at least one bug somewhere.
>>
>> FWIW benchmarks showed a large improvement from backend and bgwriter
>> write combining, but I'll provide more detailed output from benchmarks
>> when I post a more reviewable patch set.
>>
>> 0001 and 0002 are completely independent cleanup that should be pushed
>> anyway, so I might do that, but I didn't want to do so without having
>> posted on the list at least once.
>>
>> - Melanie
>>
<v14-0001-Make-FlushUnlockedBuffer-use-its-parameters.patch><v14-0002-Make-ScheduleBufferTagForWriteback-static.patch><v14-0003-Streamline-buffer-rejection-for-bulkreads-of-unl.patch><v14-0004-Allow-PinBuffer-to-skip-increasing-usage-count.patch><v14-0005-Eagerly-flush-bulkwrite-strategy-ring.patch><v14-0006-Write-combining-for-BAS_BULKWRITE.patch><v14-0007-Add-database-Oid-to-CkptSortItem.patch><v14-0008-Implement-checkpointer-data-write-combining.patch><v14-0009-Remove-SyncOneBuffer-and-refactor-BgBufferSync.patch><v14-0010-Normal-backend-write-combining.patch><v14-0011-Add-write-combining-for-background-writer.patch><v14-0012-WIP-refactor.patch>
>
> A few comments for v14:
>
> 1 - 0001
> ```
> - FlushBuffer(buf, reln, IOOBJECT_RELATION, IOCONTEXT_NORMAL);
> + FlushBuffer(buf, reln, io_object, io_context);
> ```
>
> This changes the hardcode IOOBJECT_RELATION to io_object when calling FlushBuffer(). But FlushBuffer() itself still
usesIOOBJECT_RELATION instead of io_object, so the change will actually not take effective: 
> ```
>    pgstat_count_io_op_time(IOOBJECT_RELATION, io_context,
>                                              IOOP_WRITE, io_start, 1, BLCKSZ);
> ```
>
> So, an update in FlushBuffer() is also needed.
>
> 2 - 0003 - freelist.c
> ```
> bool
> StrategyRejectBuffer(BufferAccessStrategy strategy, BufferDesc *buf, bool from_ring)
> {
> + Assert(strategy);
> +
> /* We only do this in bulkread mode */
> if (strategy->btype != BAS_BULKREAD)
> return false;
> @@ -795,8 +800,14 @@ StrategyRejectBuffer(BufferAccessStrategy strategy, BufferDesc *buf, bool from_r
> strategy->buffers[strategy->current] != BufferDescriptorGetBuffer(buf))
> return false;
>
> + Assert(BufferIsLockedByMe(BufferDescriptorGetBuffer(buf)));
> + Assert(!(pg_atomic_read_u64(&buf->state) & BM_LOCKED));
> ```
>
> I don’t quite understand Assert(!(pg_atomic_read_u64(&buf->state) & BM_LOCKED));
>
> The comment says “the buffer header spinlock must not be held,” but I doubt this Assert actually ensures that. Since
BM_LOCKEDis just a bit in a shared state, isn't it possible for a concurrent backend to grab the lock right when we’re
checkingit? 
>
> If a concurrent process locks the header a split-second before the Assert, we’ll get a false-positive crash in a dev
environmenteven though the code is fine. If they lock it a split-second after, then the Assert didn't really catch
anything.It feels like a race condition either way. 
>
> I think the only way to truly ensure a spinlock is "not held" is to actually acquire it. If we’re just checking the
bitlike this, it’s just a snapshot that might be stale by the time the instruction finishes. What was the specific
worryhere—a self-deadlock, or something else? 
>
> 3 - 0004
> ```
> +typedef enum BufferUsageCountChange
> +{
> + BUC_ZERO,
> + BUC_MAX_ONE,
> + BUC_ONE,
> +} BufferUsageCountChange;
> ```
>
> Can we add some brief comments to explain every enum item? I feel hard to guess the meaning from the names without
furtherreading the code. 
>
> 4 - 0005 - bufmgr.c
> ```
> +/*
> + * Prepare bufdesc for eager flushing.
> + *
> + * Given bufnum, return the buffer descriptor of the buffer to eagerly flush,
> + * pinned and locked and with BM_IO_IN_PROGRESS set, or NULL if this buffer
> + * does not contain a block that should be flushed.
> + *
> + * If returning a buffer, also return its LSN.
> + */
> +static BufferDesc *
> +PrepareOrRejectEagerFlushBuffer(Buffer bufnum)
> ```
>
> This header comment looks stale.
>
> * It says “and with BM_IO_IN_PROGRESS set”, but I don’t see where BM_IO_IN_PROGRESS is set in this function.
> * It says “also return its LSN”, but I don’t see LSN is returned.
>
> 5 - 0006
> ```
> + /* Start IO on the first buffer */
> + if (!StartBufferIO(buf_hdr, false, false))
> + goto clean;
> ```
>
> This failure branch feels like leaking the content lock. The buffer was already share-locked via
BufferLockConditional()earlier, and I don’t see the “clean" path unlock that content lock before returning. 
>
> 6 - 0009
> ```
> + FlushUnlockedBuffer(bufHdr, NULL, IOOBJECT_RELATION, IOCONTEXT_NORMAL);
> + UnpinBuffer(bufHdr);
> +
> + ScheduleBufferTagForWriteback(wb_context, IOCONTEXT_NORMAL, &bufHdr->tag);
> ```
>
> bufHdr is unlocked and unpinned, it might be unsafe to still use bufHdr->tag. I think we can copy bufHdr->tag to a
localvariable before unlock the buffer. 
>
> ~~ My brain is stuck, I will continue 0010 tomorrow ~~
>

7 - 0010
```
- * batch_limit is the largest batch we are allowed to construct given the
- * remaining blocks in the table, the number of available pins, and the
- * current configuration.
+ * max_batch_size is the maximum number of blocks that can be combined into a
+ * single write in general. This function, based on the block number of start,
+ * will determine the maximum IO size for this particular write given how much
+ * of the file remains. max_batch_size is provided by the caller so it doesn't
+ * have to be recalculated for each write.
```

I don’t see the function FindStrategyFlushAdjacents() has a parameter named max_batch_size, but batch_limit is still
there.

8 - 0010
```
 static void
+FindFlushAdjacents(BufferDesc *batch_start,
+                   uint32 batch_limit,
+                   BufferWriteBatch *batch)
+{
+    BufferTag    require;        /* requested block's buffer tag */
+    uint32        newHash;        /* hash value for require */
+    LWLock       *newPartitionLock;    /* buffer partition lock for it */
+    int            buf_id;
+
+    /* create a tag so we can lookup the buffers */
+    InitBufferTag(&require, &batch->reln->smgr_rlocator.locator,
+                  batch->forkno, InvalidBlockNumber);
+
+    for (; batch->n < batch_limit; batch->n++)
+    {
+        XLogRecPtr    lsn;
+
+        require.blockNum = batch->start + batch->n;
+
+        Assert(BlockNumberIsValid(require.blockNum));
```

When I first time read the Assert, I was confused how it can assume the block to be valid, then I realized that
WriteBatchInit()ensures a safe “limit”. Maybe add a brief comment for the Assert. 

9 - 0010
```
+        batch->bufdescs[batch->n] =
+            PrepareOrRejectEagerFlushBuffer(buf_id + 1,
+                                            &require,
+                                            newPartitionLock,
+                                            &lsn);
+        if (lsn > batch->max_lsn)
+            batch->max_lsn = lsn;
+
+        if (batch->bufdescs[batch->n] == NULL)
+            break;
```

I think we can move if (batch->bufdescs[batch->n] == NULL) to before if (lsn > batch->max_lsn).

This is a correctness issue, because PrepareOrRejectEagerFlushBuffer will set lsn to InvalidXLogRecPtr when returns
NULL,but doing the NULL check early just feels more reasonable. 

10 - 0010
```
+ * max_lsn may be updated if the provided buffer LSN exceeds the current max
+ * LSN.
  */
 static BufferDesc *
 PrepareOrRejectEagerFlushBuffer(Buffer bufnum,
                                 BufferTag *require,
+                                LWLock *buftable_lock,
                                 XLogRecPtr *lsn)
```

The function doesn’t have a parameter named max_lsn, is it “lsn”?

11 - 0010
```
+FindFlushAdjacents(BufferDesc *batch_start,
+                   uint32 batch_limit,
+                   BufferWriteBatch *batch)
```

Looks like batch_start is not used at all in this function.

12 - 0011
```
+ * Callers specify if and by how much they want to bump the buffer's usage
+ * count.
```

I don’t get what this comment means.

~~ As 0012 is WIP, I didn't review it. ~~

Best regards,
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/







pgsql-hackers by date:

Previous
From: Dilip Kumar
Date:
Subject: Re: Skipping schema changes in publication
Next
From: shveta malik
Date:
Subject: Re: Skipping schema changes in publication