Re: Checkpointer write combining - Mailing list pgsql-hackers

From Chao Li
Subject Re: Checkpointer write combining
Date
Msg-id 377BC880-1616-4DEF-B9EF-5E297C358F7D@gmail.com
Whole thread Raw
In response to Re: Checkpointer write combining  (Melanie Plageman <melanieplageman@gmail.com>)
Responses Re: Checkpointer write combining
List pgsql-hackers

> On Feb 20, 2026, at 07:41, Melanie Plageman <melanieplageman@gmail.com> wrote:
>
> On Wed, Jan 14, 2026 at 6:35 PM Melanie Plageman
> <melanieplageman@gmail.com> wrote:
>>
>> Thanks! v13 attached.
>
> I've added background writer write combining and then spent some time
> refactoring and benchmarking this over the last few weeks, and I've
> realized I won't be able to get it in shape for Postgres 19. I've
> attached version 14 here with my WIP code so I can come back to it in
> the next release.
>
> 0012 is an invasive refactor of GetVictimBuffer() and all the
> functions it calls that needs to be split apart into smaller commits
> and redistributed throughout the other patches in the set. Other than
> that, there are a number of todos both in the code and from a
> benchmarking perspective:
>
> - Probe for preceding as well as following blocks when looking for
> blocks to combine
> - Experiment with different batch sizes
> - Enable and test bulkread and vacuum write combining
>
> I also periodically get this troubling warning: "WARNING:  invalid
> page in block 2 of relation "base/16384/2608_fsm"; zeroing out page".
> So clearly there's at least one bug somewhere.
>
> FWIW benchmarks showed a large improvement from backend and bgwriter
> write combining, but I'll provide more detailed output from benchmarks
> when I post a more reviewable patch set.
>
> 0001 and 0002 are completely independent cleanup that should be pushed
> anyway, so I might do that, but I didn't want to do so without having
> posted on the list at least once.
>
> - Melanie
>
<v14-0001-Make-FlushUnlockedBuffer-use-its-parameters.patch><v14-0002-Make-ScheduleBufferTagForWriteback-static.patch><v14-0003-Streamline-buffer-rejection-for-bulkreads-of-unl.patch><v14-0004-Allow-PinBuffer-to-skip-increasing-usage-count.patch><v14-0005-Eagerly-flush-bulkwrite-strategy-ring.patch><v14-0006-Write-combining-for-BAS_BULKWRITE.patch><v14-0007-Add-database-Oid-to-CkptSortItem.patch><v14-0008-Implement-checkpointer-data-write-combining.patch><v14-0009-Remove-SyncOneBuffer-and-refactor-BgBufferSync.patch><v14-0010-Normal-backend-write-combining.patch><v14-0011-Add-write-combining-for-background-writer.patch><v14-0012-WIP-refactor.patch>

A few comments for v14:

1 - 0001
```
-    FlushBuffer(buf, reln, IOOBJECT_RELATION, IOCONTEXT_NORMAL);
+    FlushBuffer(buf, reln, io_object, io_context);
```

This changes the hardcode IOOBJECT_RELATION to io_object when calling FlushBuffer(). But FlushBuffer() itself still
usesIOOBJECT_RELATION instead of io_object, so the change will actually not take effective: 
```
    pgstat_count_io_op_time(IOOBJECT_RELATION, io_context,
                                              IOOP_WRITE, io_start, 1, BLCKSZ);
```

So, an update in FlushBuffer() is also needed.

2 - 0003 - freelist.c
```
 bool
 StrategyRejectBuffer(BufferAccessStrategy strategy, BufferDesc *buf, bool from_ring)
 {
+    Assert(strategy);
+
     /* We only do this in bulkread mode */
     if (strategy->btype != BAS_BULKREAD)
         return false;
@@ -795,8 +800,14 @@ StrategyRejectBuffer(BufferAccessStrategy strategy, BufferDesc *buf, bool from_r
         strategy->buffers[strategy->current] != BufferDescriptorGetBuffer(buf))
         return false;

+    Assert(BufferIsLockedByMe(BufferDescriptorGetBuffer(buf)));
+    Assert(!(pg_atomic_read_u64(&buf->state) & BM_LOCKED));
```

I don’t quite understand Assert(!(pg_atomic_read_u64(&buf->state) & BM_LOCKED));

The comment says “the buffer header spinlock must not be held,” but I doubt this Assert actually ensures that. Since
BM_LOCKEDis just a bit in a shared state, isn't it possible for a concurrent backend to grab the lock right when we’re
checkingit? 

If a concurrent process locks the header a split-second before the Assert, we’ll get a false-positive crash in a dev
environmenteven though the code is fine. If they lock it a split-second after, then the Assert didn't really catch
anything.It feels like a race condition either way. 

I think the only way to truly ensure a spinlock is "not held" is to actually acquire it. If we’re just checking the bit
likethis, it’s just a snapshot that might be stale by the time the instruction finishes. What was the specific worry
here—aself-deadlock, or something else? 

3 - 0004
```
+typedef enum BufferUsageCountChange
+{
+    BUC_ZERO,
+    BUC_MAX_ONE,
+    BUC_ONE,
+} BufferUsageCountChange;
```

Can we add some brief comments to explain every enum item? I feel hard to guess the meaning from the names without
furtherreading the code. 

4 - 0005 - bufmgr.c
```
+/*
+ * Prepare bufdesc for eager flushing.
+ *
+ * Given bufnum, return the buffer descriptor of the buffer to eagerly flush,
+ * pinned and locked and with BM_IO_IN_PROGRESS set, or NULL if this buffer
+ * does not contain a block that should be flushed.
+ *
+ * If returning a buffer, also return its LSN.
+ */
+static BufferDesc *
+PrepareOrRejectEagerFlushBuffer(Buffer bufnum)
```

This header comment looks stale.

* It says “and with BM_IO_IN_PROGRESS set”, but I don’t see where BM_IO_IN_PROGRESS is set in this function.
* It says “also return its LSN”, but I don’t see LSN is returned.

5 - 0006
```
+            /* Start IO on the first buffer */
+            if (!StartBufferIO(buf_hdr, false, false))
+                goto clean;
```

This failure branch feels like leaking the content lock. The buffer was already share-locked via
BufferLockConditional()earlier, and I don’t see the “clean" path unlock that content lock before returning. 

6 - 0009
```
+        FlushUnlockedBuffer(bufHdr, NULL, IOOBJECT_RELATION, IOCONTEXT_NORMAL);
+        UnpinBuffer(bufHdr);
+
+        ScheduleBufferTagForWriteback(wb_context, IOCONTEXT_NORMAL, &bufHdr->tag);
```

bufHdr is unlocked and unpinned, it might be unsafe to still use bufHdr->tag. I think we can copy bufHdr->tag to a
localvariable before unlock the buffer. 

~~ My brain is stuck, I will continue 0010 tomorrow ~~

Best regards,
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/







pgsql-hackers by date:

Previous
From: Kirill Reshke
Date:
Subject: Re: amcheck: fix bug of missing corruption in allequalimage validation
Next
From: Chao Li
Date:
Subject: Re: amcheck: fix bug of missing corruption in allequalimage validation