Re: Checkpointer write combining - Mailing list pgsql-hackers

From BharatDB
Subject Re: Checkpointer write combining
Date
Msg-id CAAh00EQ_vL+1TFLSSS5YXYH+KCg=P5ND3-simX3L2c9P536Xrw@mail.gmail.com
Whole thread Raw
In response to Re: Checkpointer write combining  (Chao Li <li.evan.chao@gmail.com>)
List pgsql-hackers


On Thu, Oct 16, 2025 at 9:55 AM Chao Li <li.evan.chao@gmail.com> wrote:
Hi Milanie,

Thanks for updating the patch set. I review 1-6 and got a few more small comments. I didn’t review 0007 as it’s marked as WIP.

>
> - Melanie
> <v7-0001-Refactor-goto-into-for-loop-in-GetVictimBuffer.patch><v7-0002-Split-FlushBuffer-into-two-parts.patch><v7-0003-Eagerly-flush-bulkwrite-strategy-ring.patch><v7-0004-Write-combining-for-BAS_BULKWRITE.patch><v7-0005-Add-database-Oid-to-CkptSortItem.patch><v7-0006-Implement-checkpointer-data-write-combining.patch><v7-0007-WIP-Refactor-SyncOneBuffer-for-bgwriter-only.patch>

1 - 0001
```
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -421,6 +421,12 @@ ResourceOwnerForgetBufferIO(ResourceOwner owner, Buffer buffer)
 /*
  * Internal buffer management routines
  */
+
+
+/* Note: these two macros only work on shared buffers, not local ones! */
```

Nit: here you added two empty lines, I think we need only 1.

2 - 0002
```
+static void
+CleanVictimBuffer(BufferDesc *bufdesc,
+                                 bool from_ring, IOContext io_context)
+{

-       /* Find smgr relation for buffer */
-       if (reln == NULL)
-               reln = smgropen(BufTagGetRelFileLocator(&buf->tag), INVALID_PROC_NUMBER);
+       XLogRecPtr      max_lsn = InvalidXLogRecPtr;
+       LWLock     *content_lock;
```

Nit: the empty line after “{“ should be removed.

3 - 0003
```
+/*
+ * Return the next buffer in the ring or InvalidBuffer if the current sweep is
+ * over.
+ */
+Buffer
+StrategySweepNextBuffer(BufferAccessStrategy strategy, int *sweep_cursor)
+{
+       if (++(*sweep_cursor) >= strategy->nbuffers)
+               *sweep_cursor = 0;
+
+       return strategy->buffers[*sweep_cursor];
+}
```

Feels the function comment is a bit confusing, because the function code doesn’t really perform sweep, the function is just a getter. InvalidBuffer just implies the current sweep is over.

Maybe rephrase to something like: “Return the next buffer in the range. If InvalidBuffer is returned, that implies the current sweep is done."

4 - 0003
```
static BufferDesc *
NextStratBufToFlush(BufferAccessStrategy strategy,
Buffer sweep_end,
XLogRecPtr *lsn, int *sweep_cursor)
``

“Strat” is confusing. I think it’s the short version of “Strategy”. As this is a static function, and other function names all have the whole word of “strategy”, why don’t also use the whole word in this function name as well?

5 - 0004
```
+uint32
+StrategyMaxWriteBatchSize(BufferAccessStrategy strategy)
+{
+       uint32          max_possible_buffer_limit;
+       uint32          max_write_batch_size;
+       int                     strategy_pin_limit;
+
+       max_write_batch_size = io_combine_limit;
+
+       strategy_pin_limit = GetAccessStrategyPinLimit(strategy);
+       max_possible_buffer_limit = GetPinLimit();
+
+       max_write_batch_size = Min(strategy_pin_limit, max_write_batch_size);
+       max_write_batch_size = Min(max_possible_buffer_limit, max_write_batch_size);
+       max_write_batch_size = Max(1, max_write_batch_size);
+       max_write_batch_size = Min(max_write_batch_size, io_combine_limit);
+       Assert(max_write_batch_size < MAX_IO_COMBINE_LIMIT);
+       return max_write_batch_size;
+}
```

This implementation is hard to understand. I tried to simplify it:
```
uint32
StrategyMaxWriteBatchSize(BufferAccessStrategy strategy)
{
        int strategy_pin_limit = GetAccessStrategyPinLimit(strategy);
        uint32 max_write_batch_size = Min(GetPinLimit(), (uint32)strategy_pin_limit);

        /* Clamp to io_combine_limit and enforce minimum of 1 */
        if (max_write_batch_size > io_combine_limit)
                max_write_batch_size = io_combine_limit;
        if (max_write_batch_size == 0)
                max_write_batch_size = 1;

        Assert(max_write_batch_size < MAX_IO_COMBINE_LIMIT);
        return max_write_batch_size;
}
```

Best regards,
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/



Hello All,
 
As per reference to the previous mails, I understood the changes made and had tried to replicate the patches into the source code for the bug fix but it didn't show any significant bug. Also I ran some verification tests for the recent changes related to batched write statistics during checkpoints. 
Below are my observations and results:

Test Setup
  • PostgreSQL version: 19devel (custom build)

  • OS: Ubuntu Linux

  • Port: 55432

  • Database: postgres

  • Test tool: pgbench

  • Duration: 120 seconds

  • Command used: pgbench -c 4 -j 4 -T 120 -p 55432 -d postgres

Log Output

After running the workload, I triggered a manual checkpoint and checked the latest log entry:

2025-10-28 16:53:05.696 IST [11422] LOG:  checkpoint complete:
wrote 1383 buffers (8.4%), wrote 3 SLRU buffers;
write=0.023 s, sync=0.017 s, total=0.071 s;
sync files=8, longest=0.004 s, average=0.003 s;
distance=33437 kB, estimate=308790 kB; 

Observations:

Metric

Value

Source

Interpretation

Buffers written

1383

From log

Consistent with moderate workload

Checkpoint write time

0.023 s

From log

Realistic for ~11 MB write

Checkpoint sync time

0.017 s

From log

Reasonable

Total checkpoint time

0.071 s

From log

≈ write + sync + small overhead

CHECKPOINT runtime (psql)

-

-

Fast, confirms idle background activity

 

The total time closely matches the sum of write and sync times, with only a small overhead (expected for control file updates).
Checkpoint stats in pg_stat_checkpointer also updated correctly, with no missing or duplicated values.

Expected behavior observed:

  • write + sync ≈ total (no zero, truncation, or aggregation bug)

  • Buffer counts and timing scale realistically with workload

  • No evidence of under- or over-counted times

  • Checkpoint stats properly recorded in log and pg_stat_checkpointer

Math check:
0.023 s + 0.017 s = 0.040 s, total = 0.071 s => difference ≈ 0.03 s overhead => normal for control file + metadata writes.

Comparison to Prior Reports:


Test

Pre-Patch

Post-Patch

Difference

Checkpoint duration

6.5 s

5.0 s

−23

Heavy workload test

16 s

8 s

−50

Result: It shows consistent and stable timing even under moderate pgbench load — confirming the patch is integrated and functioning correctly.

Final Status:

Category

Result

Batched Write Stats Accuracy

Verified OK

Timing Aggregation Correctness

Verified OK

pg_stat_checkpointer Consistency

Verified OK

Log Formatting

Normal

Performance Regression

None detected

Inference:

  • The reported write, sync, and total timings are consistent and realistic.

  • No anomalies or timing mismatches were seen.

  • Checkpoint performance seems stable and accurate under moderate load.

  • No regression or counter accumulation issues detected.


Also, I have been verifying output using manual queries recording the observations and find no significant delays. Observations are recorded below and the screenshots are attached herewith.

Observations:

Metric

Before Activity

After Activity

Notes

Buffers written

27949

27972

Matches wrote 23 buffers from log

Write time

206882

206887

Matches write=0.005 s from log

Sync time

840

877

Matches sync=0.037 s from log

num_done

6

7

Completed successfully

slru_written

8

9

Matches wrote 1 SLRU buffer from log

Stats reset

yes

yes

successful

num_requested

7

8

manual checkpoint recorded


For further review, I have attached the screenshots of my terminal herewith. 
Kindly review the observations and please let me know if any additional details need to be focused on.

Thanking you.

Regards,
Soumya

Attachment

pgsql-hackers by date:

Previous
From: vignesh C
Date:
Subject: Re: Logical Replication of sequences
Next
From: Alexander Kukushkin
Date:
Subject: Re: Issue with logical replication slot during switchover