Re: new option to allow pg_rewind to run without full_page_writes - Mailing list pgsql-hackers

From Andres Freund
Subject Re: new option to allow pg_rewind to run without full_page_writes
Date
Msg-id 20221106023819.tpmvqa6kuy4cvtc7@awork3.anarazel.de
Whole thread Raw
In response to new option to allow pg_rewind to run without full_page_writes  (Jérémie Grauer <jeremie.grauer@cosium.com>)
Responses Re: new option to allow pg_rewind to run without full_page_writes
List pgsql-hackers
Hi,

On 2022-11-03 16:54:13 +0100, Jérémie Grauer wrote:
> Currently pg_rewind refuses to run if full_page_writes is off. This is to
> prevent it to run into a torn page during operation.
>
> This is usually a good call, but some file systems like ZFS are naturally
> immune to torn page (maybe btrfs too, but I don't know for sure for this
> one).

Note that this isn't about torn pages in case of crashes, but about reading
pages while they're being written to.

Right now, that definitely allows for torn reads, because of the way
pg_read_binary_file() is implemented.  We only ensure a 4k read size from the
view of our code, which obviously can lead to torn 8k page reads, no matter
what the filesystem guarantees.

Also, for reasons I don't understand we use C streaming IO or
pg_read_binary_file(), so you'd also need to ensure that the buffer size used
by the stream implementation can't cause the reads to happen in smaller
chunks.  Afaict we really shouldn't use file streams here, then we'd at least
have control over that aspect.


Does ZFS actually guarantee that there never can be short reads? As soon as
they are possible, full page writes are needed.



This isn't an fundamental issue - we could have a version of
pg_read_binary_file() for relation data that prevents the page being written
out concurrently by locking the buffer page. In addition it could often avoid
needing to read the page from the OS / disk, if present in shared buffers
(perhaps minus cases where we haven't flushed the WAL yet, but we could also
flush the WAL in those).

Greetings,

Andres Freund



pgsql-hackers by date:

Previous
From: Nathan Bossart
Date:
Subject: Re: Suppressing useless wakeups in walreceiver
Next
From: Tom Lane
Date:
Subject: Re: explain analyze rows=%.0f