Re: new option to allow pg_rewind to run without full_page_writes - Mailing list pgsql-hackers

From Thomas Munro
Subject Re: new option to allow pg_rewind to run without full_page_writes
Date
Msg-id CA+hUKG+K-cc+LLn=Ys6ivf-+AqyHqd1ycsPHYRLo9oW3PbCDTQ@mail.gmail.com
Whole thread Raw
In response to Re: new option to allow pg_rewind to run without full_page_writes  (Jérémie Grauer <jeremie.grauer@cosium.com>)
List pgsql-hackers
On Tue, Nov 8, 2022 at 12:07 PM Jérémie Grauer
<jeremie.grauer@cosium.com> wrote:
> On 06/11/2022 03:38, Andres Freund wrote:
> > On 2022-11-03 16:54:13 +0100, Jérémie Grauer wrote:
> >> Currently pg_rewind refuses to run if full_page_writes is off. This is to
> >> prevent it to run into a torn page during operation.
> >>
> >> This is usually a good call, but some file systems like ZFS are naturally
> >> immune to torn page (maybe btrfs too, but I don't know for sure for this
> >> one).
> >
> > Note that this isn't about torn pages in case of crashes, but about reading
> > pages while they're being written to.

> Like I wrote above, ZFS will prevent torn pages on writes, like
> full_page_writes does.

Just to spell out the distinction Andres was making, and maybe try to
answer a couple of questions if I can, there are two completely
different phenomena here:

1.  Generally full_page_writes is for handling a lack of atomic writes
on power loss, but ZFS already does that itself by virtue of its COW
design and data-logging in certain cases.

2.  Here we are using full_page_writes to handle lack of atomicity
when there are concurrent reads and writes to the same file from
different threads.  Basically, by turning on full_page_writes we say
that we don't trust any block that might have been written to during
the copying.  Again, ZFS already handles that for itself: it uses
range locking in the read and write paths (see zfs_rangelock_enter()
in zfs_write() etc), BUT that's only going to work if the actual
pread()/pwrite() system calls that reach ZFS are aligned with
PostgreSQL's pages.

Every now and then a discussion breaks out about WTF POSIX actually
requires WRT concurrent read/write, but it's trivial to show  that the
most popular Linux filesystem exposes randomly mashed-up data from old
and new versions of even small writes if you read while a write is
concurrently in progress[1], while many others don't.  That's what the
2nd thing is protecting against.  I think it must be possible to show
that breaking on ZFS too, *if* the file regions arriving into system
calls are NOT correctly aligned.  As Andres points out, <stdio.h>
buffered IO streams create a risk there: we have no idea what system
calls are reaching ZFS, so it doesn't seem safe to turn off full page
writes unless you also fix that.

> > Does ZFS actually guarantee that there never can be short reads? As soon as
> > they are possible, full page writes are neededI may be missing something here: how does full_page_writes prevents
> short _reads_ ?

I don't know, but I think the paranoid approach would be that if you
get a short read, you go back and pread() at least that whole page, so
all your system calls are fully aligned.  Then I think you'd be safe?
Because zfs_read() does:

    /*
     * Lock the range against changes.
     */
    zfs_locked_range_t *lr = zfs_rangelock_enter(&zp->z_rangelock,
        zfs_uio_offset(uio), zfs_uio_resid(uio), RL_READER);

So it should be possible to make a safe version of this patch, by
teaching the file-reading code to require BLCKSZ integrity for all
reads.

[1] https://www.postgresql.org/message-id/CA%2BhUKG%2B19bZKidSiWmMsDmgUVe%3D_rr0m57LfR%2BnAbWprVDd_cw%40mail.gmail.com



pgsql-hackers by date:

Previous
From: Jérémie Grauer
Date:
Subject: Re: new option to allow pg_rewind to run without full_page_writes
Next
From: Peter Smith
Date:
Subject: Re: [DOCS] Stats views and functions not in order?