Re: new option to allow pg_rewind to run without full_page_writes - Mailing list pgsql-hackers
From | Thomas Munro |
---|---|
Subject | Re: new option to allow pg_rewind to run without full_page_writes |
Date | |
Msg-id | CA+hUKG+K-cc+LLn=Ys6ivf-+AqyHqd1ycsPHYRLo9oW3PbCDTQ@mail.gmail.com Whole thread Raw |
In response to | Re: new option to allow pg_rewind to run without full_page_writes (Jérémie Grauer <jeremie.grauer@cosium.com>) |
List | pgsql-hackers |
On Tue, Nov 8, 2022 at 12:07 PM Jérémie Grauer <jeremie.grauer@cosium.com> wrote: > On 06/11/2022 03:38, Andres Freund wrote: > > On 2022-11-03 16:54:13 +0100, Jérémie Grauer wrote: > >> Currently pg_rewind refuses to run if full_page_writes is off. This is to > >> prevent it to run into a torn page during operation. > >> > >> This is usually a good call, but some file systems like ZFS are naturally > >> immune to torn page (maybe btrfs too, but I don't know for sure for this > >> one). > > > > Note that this isn't about torn pages in case of crashes, but about reading > > pages while they're being written to. > Like I wrote above, ZFS will prevent torn pages on writes, like > full_page_writes does. Just to spell out the distinction Andres was making, and maybe try to answer a couple of questions if I can, there are two completely different phenomena here: 1. Generally full_page_writes is for handling a lack of atomic writes on power loss, but ZFS already does that itself by virtue of its COW design and data-logging in certain cases. 2. Here we are using full_page_writes to handle lack of atomicity when there are concurrent reads and writes to the same file from different threads. Basically, by turning on full_page_writes we say that we don't trust any block that might have been written to during the copying. Again, ZFS already handles that for itself: it uses range locking in the read and write paths (see zfs_rangelock_enter() in zfs_write() etc), BUT that's only going to work if the actual pread()/pwrite() system calls that reach ZFS are aligned with PostgreSQL's pages. Every now and then a discussion breaks out about WTF POSIX actually requires WRT concurrent read/write, but it's trivial to show that the most popular Linux filesystem exposes randomly mashed-up data from old and new versions of even small writes if you read while a write is concurrently in progress[1], while many others don't. That's what the 2nd thing is protecting against. I think it must be possible to show that breaking on ZFS too, *if* the file regions arriving into system calls are NOT correctly aligned. As Andres points out, <stdio.h> buffered IO streams create a risk there: we have no idea what system calls are reaching ZFS, so it doesn't seem safe to turn off full page writes unless you also fix that. > > Does ZFS actually guarantee that there never can be short reads? As soon as > > they are possible, full page writes are neededI may be missing something here: how does full_page_writes prevents > short _reads_ ? I don't know, but I think the paranoid approach would be that if you get a short read, you go back and pread() at least that whole page, so all your system calls are fully aligned. Then I think you'd be safe? Because zfs_read() does: /* * Lock the range against changes. */ zfs_locked_range_t *lr = zfs_rangelock_enter(&zp->z_rangelock, zfs_uio_offset(uio), zfs_uio_resid(uio), RL_READER); So it should be possible to make a safe version of this patch, by teaching the file-reading code to require BLCKSZ integrity for all reads. [1] https://www.postgresql.org/message-id/CA%2BhUKG%2B19bZKidSiWmMsDmgUVe%3D_rr0m57LfR%2BnAbWprVDd_cw%40mail.gmail.com
pgsql-hackers by date: