Re: AIO writes vs hint bits vs checksums - Mailing list pgsql-hackers

From Thomas Munro
Subject Re: AIO writes vs hint bits vs checksums
Date
Msg-id CA+hUKGJsndPVmEOcgWeKnZit-u6pOWnGaq0pACXOQfn79sfDwA@mail.gmail.com
Whole thread Raw
In response to Re: AIO writes vs hint bits vs checksums  (Thomas Munro <thomas.munro@gmail.com>)
List pgsql-hackers
On Wed, Sep 25, 2024 at 12:45 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> On Wed, Sep 25, 2024 at 8:30 AM Andres Freund <andres@anarazel.de> wrote:
> > However, our habit of modifying buffers while IO is going on is
> > causing issues with filesystem level checksums as well, as evidenced by the
> > fact that debug_io_direct = data on btrfs causes filesystem corruption. So I
> > tend to think it'd be better to just stop doing that alltogether (we also do
> > that for WAL, when writing out a partial page, but a potential fix there would
> > be different, I think).
>
> +many.  Interesting point re the WAL variant.  For the record, here's
> some discussion and a repro for that problem, which Andrew currently
> works around in a build farm animal with mount options:
>
> https://www.postgresql.org/message-id/CA%2BhUKGKSBaz78Fw3WTF3Q8ArqKCz1GgsTfRFiDPbu-j9OFz-jw%40mail.gmail.com

Here's an interesting new development in that area, this time from
OpenZFS, which committed its long awaited O_DIRECT support a couple of
weeks ago[1] and seems to have taken a different direction since that
last discussion.  Clearly it has the same checksum stability problem
as BTRFS and PostgreSQL itself, so an O_DIRECT mode with the goal of
avoiding copying and caching must confront that and break *something*,
or accept something like bounce buffers and give up the zero-copy
goal.  Curiously, they seem to have landed on two different solutions
with three different possible behaviours: (1) On FreeBSD, temporarily
make the memory non-writeable, (2) On Linux, they couldn't do that so
they have an extra checksum verification on write.  I haven't fully
grokked all this yet, or even tried it, and it's not released or
anything, but it looks a bit like all three behaviours are bad for our
current hint bit design: on FreeBSD, setting a hint bit might crash
(?) if a write is in progress in another process, and on Linux,
depending on zfs_vdev_direct_write_verify, either the concurrent write
might fail (= checkpointer failing on EIO because someone concurrently
set a hint bit) or a later read might fail (= file is permanently
corrupted and you don't find out until later, like btrfs).  I plan to
look more closely soon and see if I understood that right...

[1] https://github.com/openzfs/zfs/pull/10018/commits/d7b861e7cfaea867ae28ab46ab11fba89a5a1fda



pgsql-hackers by date:

Previous
From: Nathan Bossart
Date:
Subject: Re: MAINTAIN privilege -- what do we need to un-revert it?
Next
From: Sami Imseih
Date:
Subject: Re: query_id, pg_stat_activity, extended query protocol