Re: [WIP] Pipelined Recovery - Mailing list pgsql-hackers

From Jakub Wartak
Subject Re: [WIP] Pipelined Recovery
Date
Msg-id CAKZiRmyFjGnscqrSS4+61RHxzEDXE3c08v48GAEXFkvsJx-yUw@mail.gmail.com
Whole thread Raw
In response to [WIP] Pipelined Recovery  (Imran Zaheer <imran.zhir@gmail.com>)
List pgsql-hackers
On Fri, Jan 30, 2026 at 3:56 PM Imran Zaheer <imran.zhir@gmail.com> wrote:
>
> Hi,
>
> Based on a suggestion by my colleague Ants Aasma, I worked on this
> idea of adding parallelism to the WAL recovery process.
>
> The crux of this idea is to decode the WAL using parallel workers. Now
> the replay process can get the records from the shared memory queue
> directly. This way, we can decrease some CPU load on the recovery process.
>
> Implementing this idea yielded an improvement of around 20% in the
> recovery times, but results may differ based on workloads. I have
> attached some benchmarks for different workloads.
>
> Following are some recovery tests with the default configs. Here p1
> shows pipeline enabled. (db size) is the backup database size on
> which the recovery happens. You can see more detail related to the
> benchmarks in the attached file `recoveries-benchmark-v01`.
>
>                            elapsed (p0)       elapsed (p1)   % perf    db size
>
> inserts.sql            272s 10ms       197s 570ms    27.37%     480 MB
> updates.sql         177s 420ms      117s 80ms      34.01%     480 MB
> hot-updates.sql   36s 940ms       29s 240ms       20.84%     480 MB
> nonhot.sql           36s 570ms       28s 980ms       20.75%     480 MB
> simple-update     20s 160ms       11s 580ms       42.56%     4913 MB
> tpcb-like              20s 590ms       13s 640ms       33.75%     4913 MB
>
> Similar approach was also suggested by Matthias van de Meent earlier in a
> separate thread [1]. Right now I am using one bgw for decoding and filling
> up the shared message queue, and the redo apply loop simply receives the decoded record
> from the queue. After the redo is finished, the consumer (startup
> process) can request a shutdown from the producer (pipeline bgw)
> before exiting recovery.
>
> This idea can be coupled with another idea of pinning the buffers in
> parallel before the recovery process needs them. This will try to
> parallelize most of the work being done in
> `XLogReadBufferForRedoExtended`. The Redo can simply receive
> the already pinned buffers from a queue, but for implementing
> this, we still need some R&D on that, as IPC and pinning/unpinning of
> buffers across two processes can be tricky.
>
> If someone wants to reproduce the benchmark, they can do so using
> these scripts [2].
>
> Looking forward to your reviews, comments, etc.

Hi Imran,

It's great that you are undertaking such a cool project (I also think
that recovery
performance is one of the two biggest performance limitations today).
I've taken a
quick look at the attached benchmarks and to my surprise they have flamegraphs!

Thoughts on results:

- on the flamegraphs BufferAlloc->GetVictimBuffer->FlushBuffer() is visible
often in both scenarios p0/p1 . I recall from my old WAL IO prefetching stress
tests experiments [1] that tuning bgwriter was playing some important role
(most of the time bgwriter values are left on default). It's connected to
the pipelining idea too for sure, and somehow we are not effective on it
based on Your's and mine results and some other field reports. Basically we
have wrong defaults for bgwriter on standbys, and no-one is going to notice
unless you measure it in-depth. We could also tweak the bgwriter in standby
role to make the pipelining more effective that way too, but maybe I am missing
something? Maybe bgwriter on standby should work full-steam instead of being
rate limited as long as there is work to be done (pressure) (?) It's not a
critique of the patch, but maybe have guys investigated that road too as it is
related to the pipelining concept?

- there's plenty of ValidXLogRecord->pg_comp_crc32c_sb8(), probably even too
much, so I think either old CPU was used or something happened that SSE4.2
was not available (pg_comp_crc32c_sse42()). I'm speculating, but probably
you would even getter results from the patch (which are impressive anyway)
by using pg_comp_crc32c_sse42() because it wouldn't be such a
bottleneck anymore.

- The flamegraphs itself are for whole Postgresql, right or am I
misunderstanding it? Probably in the long run it would be better to have just
the PID of startup/recovering (but that's way harder to script for sure,
due to the need for isolated PID: perf record -p <PID>)

- You need to rebase due to due to 1eb09ed63a8 from couple of days ago:
  CONFLICT (content): Merge conflict in src/backend/postmaster/bgworker.c

- naming: I don't have anything against wal_pipeline.c, it's just that
  the suite of C files is named src/backend/access/transam/xlog*.c
  (so perhaps xlogrecoverypipeline.c (?) to stay consistent)

- I'm was getting lots of problems during linking (e.g. undefined reference to
  `WalPipeline_IsActive') and it appears that the patch is missing
meson support:

    --- a/src/backend/access/transam/meson.build
    +++ b/src/backend/access/transam/meson.build
    @@ -25,6 +25,7 @@ backend_sources += files(
    'xlogstats.c',
    'xlogutils.c',
    'xlogwait.c',
    +'wal_pipeline.c'


- nitpicking, but you need to add wal_pipeline to postgresql.conf.sample
  otherwise 003_check_guc.pl test fails.

- good news is that test with PG_TEST_EXTRA="wal_consistency_checking"  passes

- but bad news is that PG_TEST_EXTRA="wal_consistency_checking" with
  PG_TEST_INITDB_EXTRA_OPTS="-c wal_pipeline=on" which actives all of this does
  not multiple tests (possibly I look for those in the following days,
unless you are faster)

- at least here, fresh after pg_basebackup -R for building standby and
restart with wal_pipeline=on
  I've got bug:
    2026-02-04 10:29:40.146 CET [335197] LOG:  [walpipeline] started.
    2026-02-04 10:29:40.146 CET [335197] LOG:  redo starts at 0/02000028
    2026-02-04 10:29:40.147 CET [335198] LOG:  invalid record length
at 0/03000060: expected at least 24, got 0
    2026-02-04 10:29:40.147 CET [335197] LOG:  consistent recovery
state reached at 0/03000060
    2026-02-04 10:29:40.147 CET [335197] LOG:  [walpipeline] consumer:
received shutdown message from the producer
    2026-02-04 10:29:40.147 CET [335191] LOG:  database system is
ready to accept read-only connections
    2026-02-04 10:29:40.157 CET [335198] LOG:  [walpipeline] producer:
exiting: sent=5 received=5
    2026-02-04 10:29:40.167 CET [335197] LOG:  WAL pipeline stopped
    2026-02-04 10:29:40.167 CET [335197] LOG:  redo done at 0/03000028
system usage: CPU: user: 0.00 s, system: 0.03 s, elapsed: 0.04 s
    2026-02-04 10:29:40.169 CET [335197] LOG:  selected new timeline ID: 2
    2026-02-04 10:29:40.184 CET [335197] LOG:  archive recovery complete

- even when starting from scratch with wal_pipeline = on it's similiar:
   2026-02-04 10:34:14.386 CET [335833] LOG:  database system was
interrupted; last known up at 2026-02-04 10:34:02 CET
   2026-02-04 10:34:14.453 CET [335833] LOG:  starting backup recovery
with redo LSN 0/0B0000E0, checkpoint LSN 0/0B000138, on timeline ID 1
   2026-02-04 10:34:14.453 CET [335833] LOG:  entering standby mode
   2026-02-04 10:34:14.482 CET [335833] LOG:  [walpipeline] started.
   2026-02-04 10:34:14.482 CET [335833] LOG:  redo starts at 0/0B0000E0
   2026-02-04 10:34:14.484 CET [335833] LOG:  completed backup
recovery with redo LSN 0/0B0000E0 and end LSN 0/0B0001E0
   2026-02-04 10:34:14.484 CET [335833] LOG:  consistent recovery
state reached at 0/0B0001E0
   2026-02-04 10:34:14.484 CET [335833] LOG:  [walpipeline] consumer:
received shutdown message from the producer
   2026-02-04 10:34:14.484 CET [335827] LOG:  database system is ready
to accept read-only connections
   2026-02-04 10:34:14.493 CET [335834] LOG:  [walpipeline] producer:
exiting: sent=4 received=4
   2026-02-04 10:34:14.504 CET [335833] LOG:  WAL pipeline stopped
   2026-02-04 10:34:14.504 CET [335833] LOG:  redo done at 0/0B0001E0
system usage: CPU: user: 0.00 s, system: 0.03 s, elapsed: 0.04 s
   2026-02-04 10:34:14.506 CET [335833] LOG:  selected new timeline ID: 2

   it appears that "Cannot read more records, shut it down" ->
  WalPipeline_SendShutdown() path is taken but I haven't pursued further.

Also some very quick review comments:

+bool
+WalPipeline_SendRecord(XLogReaderState *record)
+{
[..]
+
+    if (msglen > WAL_PIPELINE_MAX_MSG_SIZE)
+    {
+        elog(WARNING, "[walpipeline] producer: wal record at %X/%X
too large (%zu bytes), skipping",
+             LSN_FORMAT_ARGS(record->ReadRecPtr), msglen);
+        pfree(buffer);
+        return true;
+    }

When/why it could happen and if it would happen, shouldn't this be more
like PANIC instead?

+/* Size of the shared memory queue (can be made configurable) */
+#define WAL_PIPELINE_QUEUE_SIZE  (128 * 1024 * 1024)  /* 8 MB */
+
+/* Maximum size of a single message */
+#define WAL_PIPELINE_MAX_MSG_SIZE  (2 * 1024 * 1024)  /* 1 MB */

Maybe those should take into account (XLOG_)BLCKSZ too?

-J.

[1] -
https://www.postgresql.org/message-id/VI1PR0701MB69608CBCE44D80857E59572EF6CA0@VI1PR0701MB6960.eurprd07.prod.outlook.com



pgsql-hackers by date:

Previous
From: Richard Guo
Date:
Subject: Re: Convert NOT IN sublinks to anti-joins when safe
Next
From: Álvaro Herrera
Date:
Subject: Re: Truncate logs by max_log_size