Home > mailing lists

Re: Checkpointer write combining - Mailing list pgsql-hackers

From	BharatDB
Subject	Re: Checkpointer write combining
Date	October 28 15:51:41
Msg-id	CAAh00EQ_vL+1TFLSSS5YXYH+KCg=P5ND3-simX3L2c9P536Xrw@mail.gmail.com Whole thread Raw
In response to	Re: Checkpointer write combining (Chao Li <li.evan.chao@gmail.com>)
List	pgsql-hackers

Tree view

On Thu, Oct 16, 2025 at 9:55 AM Chao Li <li.evan.chao@gmail.com> wrote:

Hi Milanie,

Thanks for updating the patch set. I review 1-6 and got a few more small comments. I didn’t review 0007 as it’s marked as WIP.

>
> - Melanie
> <v7-0001-Refactor-goto-into-for-loop-in-GetVictimBuffer.patch><v7-0002-Split-FlushBuffer-into-two-parts.patch><v7-0003-Eagerly-flush-bulkwrite-strategy-ring.patch><v7-0004-Write-combining-for-BAS_BULKWRITE.patch><v7-0005-Add-database-Oid-to-CkptSortItem.patch><v7-0006-Implement-checkpointer-data-write-combining.patch><v7-0007-WIP-Refactor-SyncOneBuffer-for-bgwriter-only.patch>

1 - 0001
```
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -421,6 +421,12 @@ ResourceOwnerForgetBufferIO(ResourceOwner owner, Buffer buffer)
/*
* Internal buffer management routines
*/
+
+
+/* Note: these two macros only work on shared buffers, not local ones! */
```

Nit: here you added two empty lines, I think we need only 1.

2 - 0002
```
+static void
+CleanVictimBuffer(BufferDesc *bufdesc,
+ bool from_ring, IOContext io_context)
+{

- /* Find smgr relation for buffer */
- if (reln == NULL)
- reln = smgropen(BufTagGetRelFileLocator(&buf->tag), INVALID_PROC_NUMBER);
+ XLogRecPtr max_lsn = InvalidXLogRecPtr;
+ LWLock *content_lock;
```

Nit: the empty line after “{“ should be removed.

3 - 0003
```
+/*
+ * Return the next buffer in the ring or InvalidBuffer if the current sweep is
+ * over.
+ */
+Buffer
+StrategySweepNextBuffer(BufferAccessStrategy strategy, int *sweep_cursor)
+{
+ if (++(*sweep_cursor) >= strategy->nbuffers)
+ *sweep_cursor = 0;
+
+ return strategy->buffers[*sweep_cursor];
+}
```

Feels the function comment is a bit confusing, because the function code doesn’t really perform sweep, the function is just a getter. InvalidBuffer just implies the current sweep is over.

Maybe rephrase to something like: “Return the next buffer in the range. If InvalidBuffer is returned, that implies the current sweep is done."

4 - 0003
```
static BufferDesc *
NextStratBufToFlush(BufferAccessStrategy strategy,
Buffer sweep_end,
XLogRecPtr *lsn, int *sweep_cursor)
``

“Strat” is confusing. I think it’s the short version of “Strategy”. As this is a static function, and other function names all have the whole word of “strategy”, why don’t also use the whole word in this function name as well?

5 - 0004
```
+uint32
+StrategyMaxWriteBatchSize(BufferAccessStrategy strategy)
+{
+ uint32 max_possible_buffer_limit;
+ uint32 max_write_batch_size;
+ int strategy_pin_limit;
+
+ max_write_batch_size = io_combine_limit;
+
+ strategy_pin_limit = GetAccessStrategyPinLimit(strategy);
+ max_possible_buffer_limit = GetPinLimit();
+
+ max_write_batch_size = Min(strategy_pin_limit, max_write_batch_size);
+ max_write_batch_size = Min(max_possible_buffer_limit, max_write_batch_size);
+ max_write_batch_size = Max(1, max_write_batch_size);
+ max_write_batch_size = Min(max_write_batch_size, io_combine_limit);
+ Assert(max_write_batch_size < MAX_IO_COMBINE_LIMIT);
+ return max_write_batch_size;
+}
```

This implementation is hard to understand. I tried to simplify it:
```
uint32
StrategyMaxWriteBatchSize(BufferAccessStrategy strategy)
{
int strategy_pin_limit = GetAccessStrategyPinLimit(strategy);
uint32 max_write_batch_size = Min(GetPinLimit(), (uint32)strategy_pin_limit);

/* Clamp to io_combine_limit and enforce minimum of 1 */
if (max_write_batch_size > io_combine_limit)
max_write_batch_size = io_combine_limit;
if (max_write_batch_size == 0)
max_write_batch_size = 1;

Assert(max_write_batch_size < MAX_IO_COMBINE_LIMIT);
return max_write_batch_size;
}
```

Best regards,
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/

Hello All,

As per reference to the previous mails, I understood the changes made and had tried to replicate the patches into the source code for the bug fix but it didn't show any significant bug. Also I ran some verification tests for the recent changes related to batched write statistics during checkpoints.
Below are my observations and results:

Test Setup
PostgreSQL version: 19devel (custom build)
OS: Ubuntu Linux
Port: 55432
Database: postgres
Test tool: pgbench
Duration: 120 seconds
Command used: pgbench -c 4 -j 4 -T 120 -p 55432 -d postgres

Log Output

After running the workload, I triggered a manual checkpoint and checked the latest log entry:

2025-10-28 16:53:05.696 IST [11422] LOG: checkpoint complete:
wrote 1383 buffers (8.4%), wrote 3 SLRU buffers;
write=0.023 s, sync=0.017 s, total=0.071 s;
sync files=8, longest=0.004 s, average=0.003 s;

distance=33437 kB, estimate=308790 kB;

Observations:

Metric	Value	Source	Interpretation
Buffers written	1383	From log	Consistent with moderate workload
Checkpoint write time	0.023 s	From log	Realistic for ~11 MB write
Checkpoint sync time	0.017 s	From log	Reasonable
Total checkpoint time	0.071 s	From log	≈ write + sync + small overhead
CHECKPOINT runtime (psql)	-	-	Fast, confirms idle background activity

The total time closely matches the sum of write and sync times, with only a small overhead (expected for control file updates).
Checkpoint stats in pg_stat_checkpointer also updated correctly, with no missing or duplicated values.

Expected behavior observed:

write + sync ≈ total (no zero, truncation, or aggregation bug)
Buffer counts and timing scale realistically with workload
No evidence of under- or over-counted times
Checkpoint stats properly recorded in log and pg_stat_checkpointer

Math check:
0.023 s + 0.017 s = 0.040 s, total = 0.071 s => difference ≈ 0.03 s overhead => normal for control file + metadata writes.

Comparison to Prior Reports:

Test	Pre-Patch	Post-Patch	Difference
Checkpoint duration	6.5 s	5.0 s	−23
Heavy workload test	16 s	8 s	−50

Result: It shows consistent and stable timing even under moderate pgbench load — confirming the patch is integrated and functioning correctly.

Final Status:

Category	Result
Batched Write Stats Accuracy	Verified OK
Timing Aggregation Correctness	Verified OK
pg_stat_checkpointer Consistency	Verified OK
Log Formatting	Normal
Performance Regression	None detected

Inference:

The reported write, sync, and total timings are consistent and realistic.
No anomalies or timing mismatches were seen.
Checkpoint performance seems stable and accurate under moderate load.
No regression or counter accumulation issues detected.

Also, I have been verifying output using manual queries recording the observations and find no significant delays. Observations are recorded below and the screenshots are attached herewith.

Observations:

Metric	Before Activity	After Activity	Notes
Buffers written	27949	27972	Matches wrote 23 buffers from log
Write time	206882	206887	Matches write=0.005 s from log
Sync time	840	877	Matches sync=0.037 s from log
num_done	6	7	Completed successfully
slru_written	8	9	Matches wrote 1 SLRU buffer from log
Stats reset	yes	yes	successful
num_requested	7	8	manual checkpoint recorded

For further review, I have attached the screenshots of my terminal herewith.

Kindly review the observations and please let me know if any additional details need to be focused on.

Thanking you.

Regards,

Soumya

Attachment

pgsql-hackers by date:

From: vignesh C
Date: 28 October, 15:51:04
Subject: Re: Logical Replication of sequences

From: Alexander Kukushkin
Date: 28 October, 16:03:00
Subject: Re: Issue with logical replication slot during switchover