Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes? - Mailing list pgsql-hackers

From Thomas Munro
Subject Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?
Date
Msg-id CA+hUKG+cKFKnn7wg_1=NJPw2ekdG+kw9Z4HcYR=XzehkT7n5kw@mail.gmail.com
Whole thread Raw
In response to Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?  (Michael Paquier <michael@paquier.xyz>)
Responses Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?
List pgsql-hackers
On Sun, Aug 7, 2022 at 7:56 PM Michael Paquier <michael@paquier.xyz> wrote:
> FWIW, when it comes to that we have a couple of routines that just use
> '0' to mean such a thing, aka palloc0().  I find 0002 confusing, as it
> introduces in fe_utils.c a new wrapper
> (pg_pwritev_with_retry_and_write_zeros) on what's already a wrapper
> (pg_pwritev_with_retry) for pwrite().

What about pg_write_initial_zeros(fd, size)?  How it writes zeros (ie
using vector I/O, and retrying) seems to be an internal detail, no?
"initial" to make it clearer that it's at offset 0, or maybe
pg_pwrite_zeros(fd, size, offset).

> A second thing is that pg_pwritev_with_retry_and_write_zeros() is
> designed to work on WAL segments initialization and it uses
> XLOG_BLCKSZ and PGAlignedXLogBlock for the job, but there is nothing
> in its name that tells us so.  This makes me question whether
> file_utils.c is a good location for this second thing.  Could a new
> file be a better location?  We have a xlogutils.c in the backend, and
> a name similar to that in src/common/ would be one possibility.

Yeah, I think it should probably be disconnected from XLOG_BLCKSZ, or
maybe it's OK to use BLCKSZ with a comment to say that it's a bit
arbitrary, or maybe it's better to define a new zero buffer of some
arbitrary size just in this code if that is too strange.  We could
experiment with different size buffers to see how it performs, bearing
in mind that every time we double it you halve the number of system
calls, but also bearing in mind that at some point it's too much for
the stack.  I can tell you that the way that code works today was not
really written with performance in mind (unlike, say, the code
reverted from 9.4 that tried to do this with posix_fallocate()), it
was just finding an excuse to call pwritev(), to exercise new fallback
code being committed for use by later AIO stuff (more patches coming
soon).  The retry support was added because it seemed plausible that
some system out there would start to do short writes as we cranked up
the sizes for some implementation reason other than ENOSPC, so we
should make a reusable retry routine.

I think this should also handle the remainder after processing whole
blocks, just for completeness.  If I call the code as presented with size
8193, I think this code will only write 8192 bytes.

I think if this ever needs to work on O_DIRECT files there would be an
alignment constraint on the buffer and size, but I don't think we have
to worry about that for now.



pgsql-hackers by date:

Previous
From: Michael Paquier
Date:
Subject: Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?
Next
From: Tomas Vondra
Date:
Subject: Re: logical decoding and replication of sequences