Re: [PROPOSAL] Doublewrite Buffer as an alternative torn page protection to Full Page Write - Mailing list pgsql-hackers
| From | DEVOPS_WwIT |
|---|---|
| Subject | Re: [PROPOSAL] Doublewrite Buffer as an alternative torn page protection to Full Page Write |
| Date | |
| Msg-id | bcfc751c-24ab-40fa-92b0-7dbae4c94e7f@ww-it.cn Whole thread Raw |
| In response to | Re: [PROPOSAL] Doublewrite Buffer as an alternative torn page protection to Full Page Write (Robert Treat <rob@xzilla.net>) |
| List | pgsql-hackers |
Hi Personally believe that the Double Write is very smart for MySQL InnoDB, but not a good ideal for Postgres, currently, WAL is the best solution for Postgres, maybe the next generation log system for Postgres could use OrioleDB's storage engine. Regards Tony On 2026/2/19 02:00, Robert Treat wrote: > On Mon, Feb 16, 2026 at 9:07 AM Jakub Wartak > <jakub.wartak@enterprisedb.com> wrote: >> On Mon, Feb 9, 2026 at 7:53 PM 陈宗志 <baotiao@gmail.com> wrote: >>> Hi hackers, >>> >>> I raised this topic a while back [1] but didn't get much traction, so >>> I went ahead and implemented it: a doublewrite buffer (DWB) mechanism >>> for PostgreSQL as an alternative to full_page_writes. >>> >>> The core argument is straightforward. FPW and checkpoint frequency are >>> fundamentally at odds: >>> >>> - FPW wants fewer checkpoints -- each checkpoint triggers a wave of >>> full-page WAL writes for every page dirtied for the first time, >>> bloating WAL and tanking write throughput. >>> - Fast crash recovery wants more checkpoints -- less WAL to replay >>> means the database comes back sooner. >>> >>> DWB resolves this tension by moving torn page protection out of the >>> WAL path entirely. Instead of writing full pages into WAL (foreground, >>> latency-sensitive), dirty pages are sequentially written to a >>> dedicated doublewrite buffer area on disk before being flushed to >>> their actual locations. The buffer is fsync'd once when full, then >>> pages are scatter-written to their final positions. On crash recovery, >>> intact copies from the DWB repair any torn pages. >>> >>> Key design differences: >>> >>> - FPW: 1 WAL write (foreground) + 1 page write = directly impacts SQL latency >>> - DWB: 2 page writes (background flush path) = minimal user-visible impact >>> - DWB batches fsync() across multiple pages; WAL fsync batching is >>> limited by foreground latency constraints >>> - DWB decouples torn page protection from checkpoint frequency, so you >>> can checkpoint as often as you want without write amplification >>> >>> I ran sysbench benchmarks (io-bound, --tables=10 >>> --table_size=10000000) with checkpoint_timeout=30s, >>> shared_buffers=4GB, synchronous_commit=on. Each scenario uses a fresh >>> database, VACUUM FULL, 60s warmup, 300s run. >>> >>> Results (TPS): >>> >>> FPW OFF FPW ON DWB ON >>> read_write/32 18,038 7,943 13,009 >>> read_write/64 24,249 9,533 15,387 >>> read_write/128 27,801 9,715 15,387 >>> write_only/32 53,146 18,116 31,460 >>> write_only/64 57,628 19,589 32,875 >>> write_only/128 59,454 14,857 33,814 >>> >>> Avg latency (ms): >>> >>> FPW OFF FPW ON DWB ON >>> read_write/32 1.77 4.03 2.46 >>> read_write/64 2.64 6.71 4.16 >>> read_write/128 4.60 13.17 9.81 >>> write_only/32 0.60 1.77 1.02 >>> write_only/64 1.11 3.27 1.95 >>> write_only/128 2.15 8.61 3.78 >>> >>> FPW ON drops to ~25% of baseline (FPW OFF). DWB ON holds at ~57%. In >>> write-heavy scenarios DWB delivers over 2x the throughput of FPW with >>> significantly better latency. >>> >>> The implementation is here: https://github.com/baotiao/postgres >>> >>> I'd appreciate any feedback on the approach. Would be great if the >>> community could take a look and see if this direction is worth >>> pursuing upstream. >> Hi Baotiao >> >> I'm a newbie here, but took Your idea with some interest, probably everyone >> else is busy with work on other patches before commit freeze. >> > I'm somewhat less of a noob here, so I'll confirm that this proposal > has basically zero chance of getting in, at least for the v19 cycle. > This isn't so much about the proposal itself, but more in that if you > were trying to pick the worst time of year to submit a large, > complicated feature into the postgresql workflow, this would be really > close to that. > > However, I have also wondered about this specific trade-off (FPW vs > DWB) for years, but until now, the level of effort required to produce > a meaningful POC that would confirm if the idea was worth pursuing was > so large that I think it stopped anyone from even trying. So, > hopefully everyone will realize that we don't live in that world > anymore, and as a side benefit, apparently the idea is worth pursuing. > >> I think it would be valuable to have this as I've been hit by PostgreSQL's >> unsteady (chain-saw-like) WAL traffic, especially related to touching 1st the >> pages after checkpoint, up to the point of saturating network links. The common >> counter-argument to double buffering is probably that FPI may(?) increase WAL >> standby replication rate and this would have to be measured into account >> (but we also should take into account how much maintenance_io_concurrency/ >> posix_fadvise() prefetching that we do today helps avoid any I/O stalls on >> fetching pages - so it should be basically free), I see even that you >> got benefits >> by not using FPI. Interesting. >> >> Some notes/questions about the patches itself: >> > So, I haven't looked at the code itself; tbh honest I am a bit too > paranoid to dive into generated code that would seem to carry some > likely level of legal risk around potential reuse of GPL/proprietary > code it might be based on (either in its original training, inference, > or context used for generation. Yeah, I know innodb isn't written in > C, but still). That said, I did have some feedback and questions on > the proposal itself, and some suggestions for how to move things > forward. > > I would be helpful if you could provide a little more information on > the system you are running these benchmarks on, specifically for me > the underlying OS/Filesystem/hardware, and I'd even be interested in > the build flags. I'd also be interested to know if you did any kind of > crash safety testing... while it is great to have improved > performance, presumably that isn't actually the primary point of these > subsystems. It'd also be worth knowing if you tested this on any > systems with replication (physical or logical) since we'd need to > understand those potential downstream effects. I'm tempted to say you > should have an AI generate some pgbench scripts. Granted its early and > fine if you have done any of this, but I imagine we'll need to look at > it eventually. > >> 0. The convention here is send the patches using: >> git format-patch -v<VERSION> HEAD~<numberOfpatches> >> for easier review. The 0003 probably should be out of scope. Anyway I've >> attached all of those so maybe somebody else is going to take a >> look at them too, >> they look very mature. Is this code used in production already anywhere? (and >> BTW the numbers are quite impressive) >> > While Jakub is right that the convention is to send patches, that > convention is based on a manual development model, not an agentic > development model. While there is no official project policy on this, > IMHO the thing we really need from you is not the code output, but the > prompts that were used to generate the code. There are plenty of folks > who have access to claude that could then use those prompts to > "recreate with enough proximity" the work you had claude do, and that > process would also allow for additional verification and reduction of > any legal concerns or concerns about investing further human > time/energy. (No offense, but as you are not a regular contributor, > you could analogize this to when third parties do large code dumps and > say "here's a contribution, it's up to you to figure out how to use > it". Ideally we want other folks to be able to pick up the project and > continue with it, even if it means recreating it, and that works best > if we have the underlying prompts). > > The claude code configuration file is a good start, but certainly not > enough. Probably the ideal here would be full session logs, although a > developer-diary would probably also suffice. I'm kind of guessing here > because I don't know the scope of the prompts involved or how you were > interacting with Claude in order to get where you are now, but those > seem like the more obvious tools for work of this size whose intention > is to be open. > > > Robert Treat > https://xzilla.net > >
pgsql-hackers by date: