Re: [PROPOSAL] Doublewrite Buffer as an alternative torn page protection to Full Page Write - Mailing list pgsql-hackers

From DEVOPS_WwIT
Subject Re: [PROPOSAL] Doublewrite Buffer as an alternative torn page protection to Full Page Write
Date
Msg-id bcfc751c-24ab-40fa-92b0-7dbae4c94e7f@ww-it.cn
Whole thread Raw
In response to Re: [PROPOSAL] Doublewrite Buffer as an alternative torn page protection to Full Page Write  (Robert Treat <rob@xzilla.net>)
List pgsql-hackers
Hi

Personally believe that the Double Write is very smart for MySQL InnoDB, 
but not a good ideal for Postgres,  currently, WAL is the best solution 
for Postgres,

maybe the next generation log system for Postgres could use OrioleDB's 
storage engine.


Regards

Tony

On 2026/2/19 02:00, Robert Treat wrote:
> On Mon, Feb 16, 2026 at 9:07 AM Jakub Wartak
> <jakub.wartak@enterprisedb.com> wrote:
>> On Mon, Feb 9, 2026 at 7:53 PM 陈宗志 <baotiao@gmail.com> wrote:
>>> Hi hackers,
>>>
>>> I raised this topic a while back [1] but didn't get much traction, so
>>> I went ahead and implemented it: a doublewrite buffer (DWB) mechanism
>>> for PostgreSQL as an alternative to full_page_writes.
>>>
>>> The core argument is straightforward. FPW and checkpoint frequency are
>>> fundamentally at odds:
>>>
>>> - FPW wants fewer checkpoints -- each checkpoint triggers a wave of
>>> full-page WAL writes for every page dirtied for the first time,
>>> bloating WAL and tanking write throughput.
>>> - Fast crash recovery wants more checkpoints -- less WAL to replay
>>> means the database comes back sooner.
>>>
>>> DWB resolves this tension by moving torn page protection out of the
>>> WAL path entirely. Instead of writing full pages into WAL (foreground,
>>> latency-sensitive), dirty pages are sequentially written to a
>>> dedicated doublewrite buffer area on disk before being flushed to
>>> their actual locations. The buffer is fsync'd once when full, then
>>> pages are scatter-written to their final positions. On crash recovery,
>>> intact copies from the DWB repair any torn pages.
>>>
>>> Key design differences:
>>>
>>> - FPW: 1 WAL write (foreground) + 1 page write = directly impacts SQL latency
>>> - DWB: 2 page writes (background flush path) = minimal user-visible impact
>>> - DWB batches fsync() across multiple pages; WAL fsync batching is
>>> limited by foreground latency constraints
>>> - DWB decouples torn page protection from checkpoint frequency, so you
>>> can checkpoint as often as you want without write amplification
>>>
>>> I ran sysbench benchmarks (io-bound, --tables=10
>>> --table_size=10000000) with checkpoint_timeout=30s,
>>> shared_buffers=4GB, synchronous_commit=on. Each scenario uses a fresh
>>> database, VACUUM FULL, 60s warmup, 300s run.
>>>
>>> Results (TPS):
>>>
>>>                       FPW OFF    FPW ON     DWB ON
>>> read_write/32        18,038      7,943     13,009
>>> read_write/64        24,249      9,533     15,387
>>> read_write/128       27,801      9,715     15,387
>>> write_only/32        53,146     18,116     31,460
>>> write_only/64        57,628     19,589     32,875
>>> write_only/128       59,454     14,857     33,814
>>>
>>> Avg latency (ms):
>>>
>>>                       FPW OFF    FPW ON     DWB ON
>>> read_write/32          1.77       4.03       2.46
>>> read_write/64          2.64       6.71       4.16
>>> read_write/128         4.60      13.17       9.81
>>> write_only/32          0.60       1.77       1.02
>>> write_only/64          1.11       3.27       1.95
>>> write_only/128         2.15       8.61       3.78
>>>
>>> FPW ON drops to ~25% of baseline (FPW OFF). DWB ON holds at ~57%. In
>>> write-heavy scenarios DWB delivers over 2x the throughput of FPW with
>>> significantly better latency.
>>>
>>> The implementation is here: https://github.com/baotiao/postgres
>>>
>>> I'd appreciate any feedback on the approach. Would be great if the
>>> community could take a look and see if this direction is worth
>>> pursuing upstream.
>> Hi Baotiao
>>
>> I'm a newbie here, but took Your idea with some interest, probably everyone
>> else is busy with work on other patches before commit freeze.
>>
> I'm somewhat less of a noob here, so I'll confirm that this proposal
> has basically zero chance of getting in, at least for the v19 cycle.
> This isn't so much about the proposal itself, but more in that if you
> were trying to pick the worst time of year to submit a large,
> complicated feature into the postgresql workflow, this would be really
> close to that.
>
> However, I have also wondered about this specific trade-off (FPW vs
> DWB) for years, but until now, the level of effort required to produce
> a meaningful POC that would confirm if the idea was worth pursuing was
> so large that I think it stopped anyone from even trying. So,
> hopefully everyone will realize that we don't live in that world
> anymore, and as a side benefit, apparently the idea is worth pursuing.
>
>> I think it would be valuable to have this as I've been hit by PostgreSQL's
>> unsteady (chain-saw-like) WAL traffic, especially related to touching 1st the
>> pages after checkpoint, up to the point of saturating network links. The common
>> counter-argument to double buffering is probably that FPI may(?) increase WAL
>> standby replication rate and this would have to be measured into account
>> (but we also should take into account how much maintenance_io_concurrency/
>> posix_fadvise() prefetching that we do today helps avoid any I/O stalls on
>> fetching pages - so it should be basically free), I see even that you
>> got benefits
>> by not using FPI. Interesting.
>>
>> Some notes/questions about the patches itself:
>>
> So, I haven't looked at the code itself; tbh honest I am a bit too
> paranoid to dive into generated code that would seem to carry some
> likely level of legal risk around potential reuse of GPL/proprietary
> code it might be based on (either in its original training, inference,
> or context used for generation. Yeah, I know innodb isn't written in
> C, but still). That said, I did have some feedback and questions on
> the proposal itself, and some suggestions for how to move things
> forward.
>
> I would be helpful if you could provide a little more information on
> the system you are running these benchmarks on, specifically for me
> the underlying OS/Filesystem/hardware, and I'd even be interested in
> the build flags. I'd also be interested to know if you did any kind of
> crash safety testing... while it is great to have improved
> performance, presumably that isn't actually the primary point of these
> subsystems. It'd also be worth knowing if you tested this on any
> systems with replication (physical or logical) since we'd need to
> understand those potential downstream effects. I'm tempted to say you
> should have an AI generate some pgbench scripts. Granted its early and
> fine if you have done any of this, but I imagine we'll need to look at
> it eventually.
>
>> 0. The convention here is send the patches using:
>>     git format-patch -v<VERSION> HEAD~<numberOfpatches>
>>     for easier review. The 0003 probably should be out of scope. Anyway I've
>>     attached all of those so maybe somebody else is going to take a
>> look at them too,
>>     they look very mature. Is this code used in production already anywhere? (and
>>     BTW the numbers are quite impressive)
>>
> While Jakub is right that the convention is to send patches, that
> convention is based on a manual development model, not an agentic
> development model. While there is no official project policy on this,
> IMHO the thing we really need from you is not the code output, but the
> prompts that were used to generate the code. There are plenty of folks
> who have access to claude that could then use those prompts to
> "recreate with enough proximity" the work you had claude do, and that
> process would also allow for additional verification and reduction of
> any legal concerns or concerns about investing further human
> time/energy. (No offense, but as you are not a regular contributor,
> you could analogize this to when third parties do large code dumps and
> say "here's a contribution, it's up to you to figure out how to use
> it". Ideally we want other folks to be able to pick up the project and
> continue with it, even if it means recreating it, and that works best
> if we have the underlying prompts).
>
> The claude code configuration file is a good start, but certainly not
> enough. Probably the ideal here would be full session logs, although a
> developer-diary would probably also suffice. I'm kind of guessing here
> because I don't know the scope of the prompts involved or how you were
> interacting with Claude in order to get where you are now, but those
> seem like the more obvious tools for work of this size whose intention
> is to be open.
>
>
> Robert Treat
> https://xzilla.net
>
>



pgsql-hackers by date:

Previous
From: Richard Guo
Date:
Subject: Re: Convert ALL SubLinks to ANY SubLinks
Next
From: wenhui qiu
Date:
Subject: Re: Convert ALL SubLinks to ANY SubLinks