Thread: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?

Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?

From

Bharath Rupireddy

Date:

05 August 2022, 10:25:26

Hi,

I noticed that dir_open_for_write() in walmethods.c uses write() for
WAL file initialization (note that this code is used by pg_receivewal
and pg_basebackup) as opposed to core using pg_pwritev_with_retry() in
XLogFileInitInternal() to avoid partial writes. Do we need to fix
this?

Thoughts?

-- 
Bharath Rupireddy
RDS Open Source Databases: https://aws.amazon.com/rds/postgresql/

Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?

From

Michael Paquier

Date:

06 August 2022, 06:41:18

On Fri, Aug 05, 2022 at 03:55:26PM +0530, Bharath Rupireddy wrote:
> I noticed that dir_open_for_write() in walmethods.c uses write() for
> WAL file initialization (note that this code is used by pg_receivewal
> and pg_basebackup) as opposed to core using pg_pwritev_with_retry() in
> XLogFileInitInternal() to avoid partial writes. Do we need to fix
> this?

0d56acfb has moved pg_pwritev_with_retry to be backend-only in fd.c :/

> Thoughts?

Makes sense to me for the WAL segment pre-padding initialization, as
we still want to point to the beginning of the segment after we are
done with the pre-padding, and the code has an extra lseek().
--
Michael

Attachment

signature.asc

Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?

From

Bharath Rupireddy

Date:

07 August 2022, 01:12:11

On Sat, Aug 6, 2022 at 12:11 PM Michael Paquier <michael@paquier.xyz> wrote:
>
> On Fri, Aug 05, 2022 at 03:55:26PM +0530, Bharath Rupireddy wrote:
> > I noticed that dir_open_for_write() in walmethods.c uses write() for
> > WAL file initialization (note that this code is used by pg_receivewal
> > and pg_basebackup) as opposed to core using pg_pwritev_with_retry() in
> > XLogFileInitInternal() to avoid partial writes. Do we need to fix
> > this?
>
> 0d56acfb has moved pg_pwritev_with_retry to be backend-only in fd.c :/

Yeah. pg_pwritev_with_retry can also be part of common/file_utils.c/.h
so that everyone can use it.

> > Thoughts?
>
> Makes sense to me for the WAL segment pre-padding initialization, as
> we still want to point to the beginning of the segment after we are
> done with the pre-padding, and the code has an extra lseek().

Thanks. I attached the v1 patch, please review it.

-- 
Bharath Rupireddy
RDS Open Source Databases: https://aws.amazon.com/rds/postgresql/

Attachment

v1-0001-Use-pg_pwritev_with_retry-instead-of-write-in-wal.patch

Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?

From

Thomas Munro

Date:

07 August 2022, 02:12:56

On Sun, Aug 7, 2022 at 1:12 PM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:
> On Sat, Aug 6, 2022 at 12:11 PM Michael Paquier <michael@paquier.xyz> wrote:
> Yeah. pg_pwritev_with_retry can also be part of common/file_utils.c/.h
> so that everyone can use it.
>
> > > Thoughts?
> >
> > Makes sense to me for the WAL segment pre-padding initialization, as
> > we still want to point to the beginning of the segment after we are
> > done with the pre-padding, and the code has an extra lseek().
>
> Thanks. I attached the v1 patch, please review it.

Hi Bharath,

+1 for moving pg_pwritev_with_retry() to file_utils.c, but I think the
commit message should say that is happening.  Maybe the move should
even be in a separate patch (IMHO it's nice to separate mechanical
change patches from new logic/work patches).

+/*
+ * A convenience wrapper for pg_pwritev_with_retry() that zero-fills the given
+ * file of size total_sz in batches of size block_sz.
+ */
+ssize_t
+pg_pwritev_with_retry_and_init(int fd, int total_sz, int block_sz)

Hmm, why not give it a proper name that says it writes zeroes?

Size arguments around syscall-like facilities are usually size_t.

+    memset(zbuffer.data, 0, block_sz);

I believe the modern fashion as of a couple of weeks ago is to tell
the compiler to zero-initialise.  Since it's a union you'd need
designated initialiser syntax, something like zbuffer = { .data = {0}
}?

+        iov[i].iov_base = zbuffer.data;
+        iov[i].iov_len = block_sz;

How can it be OK to use caller supplied block_sz, when
sizeof(zbuffer.data) is statically determined?  What is the point of
this argument?

-            dir_data->lasterrno = errno;
+            /* If errno isn't set, assume problem is no disk space */
+            dir_data->lasterrno = errno ? errno : ENOSPC;

This coding pattern is used in places where we blame short writes on
lack of disk space without bothering to retry.  But the code used in
this patch already handles short writes: it always retries, until
eventually, if you really are out of disk space, you should get an
actual ENOSPC from the operating system.  So I don't think the
guess-it-must-be-ENOSPC technique applies here.

Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?

From

Bharath Rupireddy

Date:

07 August 2022, 05:11:49

On Sun, Aug 7, 2022 at 7:43 AM Thomas Munro <thomas.munro@gmail.com> wrote:
>
> On Sun, Aug 7, 2022 at 1:12 PM Bharath Rupireddy
> <bharath.rupireddyforpostgres@gmail.com> wrote:
> > On Sat, Aug 6, 2022 at 12:11 PM Michael Paquier <michael@paquier.xyz> wrote:
> > Yeah. pg_pwritev_with_retry can also be part of common/file_utils.c/.h
> > so that everyone can use it.
> >
> > > > Thoughts?
> > >
> > > Makes sense to me for the WAL segment pre-padding initialization, as
> > > we still want to point to the beginning of the segment after we are
> > > done with the pre-padding, and the code has an extra lseek().
> >
> > Thanks. I attached the v1 patch, please review it.
>
> Hi Bharath,
>
> +1 for moving pg_pwritev_with_retry() to file_utils.c, but I think the
> commit message should say that is happening.  Maybe the move should
> even be in a separate patch (IMHO it's nice to separate mechanical
> change patches from new logic/work patches).

Agree. I separated out the changes.

> +/*
> + * A convenience wrapper for pg_pwritev_with_retry() that zero-fills the given
> + * file of size total_sz in batches of size block_sz.
> + */
> +ssize_t
> +pg_pwritev_with_retry_and_init(int fd, int total_sz, int block_sz)
>
> Hmm, why not give it a proper name that says it writes zeroes?

Done.

> Size arguments around syscall-like facilities are usually size_t.
>
> +    memset(zbuffer.data, 0, block_sz);
>
> I believe the modern fashion as of a couple of weeks ago is to tell
> the compiler to zero-initialise.  Since it's a union you'd need
> designated initialiser syntax, something like zbuffer = { .data = {0}
> }?

Hm, but we have many places still using memset(). If we were to change
these syntaxes, IMO, it must be done separately.

> +        iov[i].iov_base = zbuffer.data;
> +        iov[i].iov_len = block_sz;
>
> How can it be OK to use caller supplied block_sz, when
> sizeof(zbuffer.data) is statically determined?  What is the point of
> this argument?

Yes, removed block_sz function parameter.

> -            dir_data->lasterrno = errno;
> +            /* If errno isn't set, assume problem is no disk space */
> +            dir_data->lasterrno = errno ? errno : ENOSPC;
>
> This coding pattern is used in places where we blame short writes on
> lack of disk space without bothering to retry.  But the code used in
> this patch already handles short writes: it always retries, until
> eventually, if you really are out of disk space, you should get an
> actual ENOSPC from the operating system.  So I don't think the
> guess-it-must-be-ENOSPC technique applies here.

Done.

Thanks for reviewing. PSA v2 patch-set. Also,I added a CF entry
https://commitfest.postgresql.org/39/3803/ to give the patches a
chance to get tested.

--
Bharath Rupireddy
RDS Open Source Databases: https://aws.amazon.com/rds/postgresql/

Attachment

Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?

From

Michael Paquier

Date:

07 August 2022, 07:56:00

On Sun, Aug 07, 2022 at 10:41:49AM +0530, Bharath Rupireddy wrote:
> Agree. I separated out the changes.

+
+/*
+ * A convenience wrapper for pwritev() that retries on partial write.  If an
+ * error is returned, it is unspecified how much has been written.
+ */
+ssize_t
+pg_pwritev_with_retry(int fd, const struct iovec *iov, int iovcnt, off_t offset)

If moving this routine, this could use a more explicit description,
especially on errno, for example, that could be consumed by the caller
on failure to know what's happening.

>> +/*
>> + * A convenience wrapper for pg_pwritev_with_retry() that zero-fills the given
>> + * file of size total_sz in batches of size block_sz.
>> + */
>> +ssize_t
>> +pg_pwritev_with_retry_and_init(int fd, int total_sz, int block_sz)
>>
>> Hmm, why not give it a proper name that says it writes zeroes?
>
> Done.

FWIW, when it comes to that we have a couple of routines that just use
'0' to mean such a thing, aka palloc0().  I find 0002 confusing, as it
introduces in fe_utils.c a new wrapper
(pg_pwritev_with_retry_and_write_zeros) on what's already a wrapper
(pg_pwritev_with_retry) for pwrite().

A second thing is that pg_pwritev_with_retry_and_write_zeros() is
designed to work on WAL segments initialization and it uses
XLOG_BLCKSZ and PGAlignedXLogBlock for the job, but there is nothing
in its name that tells us so.  This makes me question whether
file_utils.c is a good location for this second thing.  Could a new
file be a better location?  We have a xlogutils.c in the backend, and
a name similar to that in src/common/ would be one possibility.
--
Michael

Attachment

signature.asc

Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?

From

Thomas Munro

Date:

07 August 2022, 09:49:19

On Sun, Aug 7, 2022 at 7:56 PM Michael Paquier <michael@paquier.xyz> wrote:
> FWIW, when it comes to that we have a couple of routines that just use
> '0' to mean such a thing, aka palloc0().  I find 0002 confusing, as it
> introduces in fe_utils.c a new wrapper
> (pg_pwritev_with_retry_and_write_zeros) on what's already a wrapper
> (pg_pwritev_with_retry) for pwrite().

What about pg_write_initial_zeros(fd, size)?  How it writes zeros (ie
using vector I/O, and retrying) seems to be an internal detail, no?
"initial" to make it clearer that it's at offset 0, or maybe
pg_pwrite_zeros(fd, size, offset).

> A second thing is that pg_pwritev_with_retry_and_write_zeros() is
> designed to work on WAL segments initialization and it uses
> XLOG_BLCKSZ and PGAlignedXLogBlock for the job, but there is nothing
> in its name that tells us so.  This makes me question whether
> file_utils.c is a good location for this second thing.  Could a new
> file be a better location?  We have a xlogutils.c in the backend, and
> a name similar to that in src/common/ would be one possibility.

Yeah, I think it should probably be disconnected from XLOG_BLCKSZ, or
maybe it's OK to use BLCKSZ with a comment to say that it's a bit
arbitrary, or maybe it's better to define a new zero buffer of some
arbitrary size just in this code if that is too strange.  We could
experiment with different size buffers to see how it performs, bearing
in mind that every time we double it you halve the number of system
calls, but also bearing in mind that at some point it's too much for
the stack.  I can tell you that the way that code works today was not
really written with performance in mind (unlike, say, the code
reverted from 9.4 that tried to do this with posix_fallocate()), it
was just finding an excuse to call pwritev(), to exercise new fallback
code being committed for use by later AIO stuff (more patches coming
soon).  The retry support was added because it seemed plausible that
some system out there would start to do short writes as we cranked up
the sizes for some implementation reason other than ENOSPC, so we
should make a reusable retry routine.

I think this should also handle the remainder after processing whole
blocks, just for completeness.  If I call the code as presented with size
8193, I think this code will only write 8192 bytes.

I think if this ever needs to work on O_DIRECT files there would be an
alignment constraint on the buffer and size, but I don't think we have
to worry about that for now.

Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?

From

Bharath Rupireddy

Date:

07 August 2022, 15:52:39

On Sun, Aug 7, 2022 at 3:19 PM Thomas Munro <thomas.munro@gmail.com> wrote:
>
> > A second thing is that pg_pwritev_with_retry_and_write_zeros() is
> > designed to work on WAL segments initialization and it uses
> > XLOG_BLCKSZ and PGAlignedXLogBlock for the job, but there is nothing
> > in its name that tells us so.  This makes me question whether
> > file_utils.c is a good location for this second thing.  Could a new
> > file be a better location?  We have a xlogutils.c in the backend, and
> > a name similar to that in src/common/ would be one possibility.
>
> Yeah, I think it should probably be disconnected from XLOG_BLCKSZ, or
> maybe it's OK to use BLCKSZ with a comment to say that it's a bit
> arbitrary, or maybe it's better to define a new zero buffer of some
> arbitrary size just in this code if that is too strange.  We could
> experiment with different size buffers to see how it performs, bearing
> in mind that every time we double it you halve the number of system
> calls, but also bearing in mind that at some point it's too much for
> the stack.  I can tell you that the way that code works today was not
> really written with performance in mind (unlike, say, the code
> reverted from 9.4 that tried to do this with posix_fallocate()), it
> was just finding an excuse to call pwritev(), to exercise new fallback
> code being committed for use by later AIO stuff (more patches coming
> soon).  The retry support was added because it seemed plausible that
> some system out there would start to do short writes as we cranked up
> the sizes for some implementation reason other than ENOSPC, so we
> should make a reusable retry routine.

Yes, doubling the zerobuffer size to say 2 * XLOG_BLCKSZ or 2 * BLCKSZ
reduces the system calls to half (right now, pg_pwritev_with_retry()
gets called 64 times per 16MB WAL file, it writes in the batches of 32
blocks per call).

Is there a ready-to-use tool or script or specific settings for
pgbench (pgbench command line options or GUC settings) that I can play
with to measure the performance?

> I think this should also handle the remainder after processing whole
> blocks, just for completeness.  If I call the code as presented with size
> 8193, I think this code will only write 8192 bytes.

Hm, I will fix it.

> I think if this ever needs to work on O_DIRECT files there would be an
> alignment constraint on the buffer and size, but I don't think we have
> to worry about that for now.

We can add a comment about the above limitation, if required.

--
Bharath Rupireddy
RDS Open Source Databases: https://aws.amazon.com/rds/postgresql/

Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?

From

Bharath Rupireddy

Date:

08 August 2022, 12:40:23

On Sun, Aug 7, 2022 at 9:22 PM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:
>
> On Sun, Aug 7, 2022 at 3:19 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> >
> > > A second thing is that pg_pwritev_with_retry_and_write_zeros() is
> > > designed to work on WAL segments initialization and it uses
> > > XLOG_BLCKSZ and PGAlignedXLogBlock for the job, but there is nothing
> > > in its name that tells us so.  This makes me question whether
> > > file_utils.c is a good location for this second thing.  Could a new
> > > file be a better location?  We have a xlogutils.c in the backend, and
> > > a name similar to that in src/common/ would be one possibility.
> >
> > Yeah, I think it should probably be disconnected from XLOG_BLCKSZ, or
> > maybe it's OK to use BLCKSZ with a comment to say that it's a bit
> > arbitrary, or maybe it's better to define a new zero buffer of some
> > arbitrary size just in this code if that is too strange.  We could
> > experiment with different size buffers to see how it performs, bearing
> > in mind that every time we double it you halve the number of system
> > calls, but also bearing in mind that at some point it's too much for
> > the stack.  I can tell you that the way that code works today was not
> > really written with performance in mind (unlike, say, the code
> > reverted from 9.4 that tried to do this with posix_fallocate()), it
> > was just finding an excuse to call pwritev(), to exercise new fallback
> > code being committed for use by later AIO stuff (more patches coming
> > soon).  The retry support was added because it seemed plausible that
> > some system out there would start to do short writes as we cranked up
> > the sizes for some implementation reason other than ENOSPC, so we
> > should make a reusable retry routine.
>
> Yes, doubling the zerobuffer size to say 2 * XLOG_BLCKSZ or 2 * BLCKSZ
> reduces the system calls to half (right now, pg_pwritev_with_retry()
> gets called 64 times per 16MB WAL file, it writes in the batches of 32
> blocks per call).
>
> Is there a ready-to-use tool or script or specific settings for
> pgbench (pgbench command line options or GUC settings) that I can play
> with to measure the performance?

I played with a simple insert use-case [1] that generates ~380 WAL
files, with different block sizes. To my surprise, I have not seen any
improvement with larger block sizes. I may be doing something wrong
here, suggestions on to test and see the benefits are welcome.

> > I think this should also handle the remainder after processing whole
> > blocks, just for completeness.  If I call the code as presented with size
> > 8193, I think this code will only write 8192 bytes.
>
> Hm, I will fix it.

Fixed.

I'm attaching v5 patch-set. I've addressed review comments received so
far and fixed a compiler warning that CF bot complained about.

Please review it further.

-- 
Bharath Rupireddy
RDS Open Source Databases: https://aws.amazon.com/rds/postgresql/

Attachment

Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?

From

Bharath Rupireddy

Date:

09 August 2022, 07:30:35

On Mon, Aug 8, 2022 at 6:10 PM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:
>
> I played with a simple insert use-case [1] that generates ~380 WAL
> files, with different block sizes. To my surprise, I have not seen any
> improvement with larger block sizes. I may be doing something wrong
> here, suggestions on to test and see the benefits are welcome.
>
> > > I think this should also handle the remainder after processing whole
> > > blocks, just for completeness.  If I call the code as presented with size
> > > 8193, I think this code will only write 8192 bytes.
> >
> > Hm, I will fix it.
>
> Fixed.
>
> I'm attaching v5 patch-set. I've addressed review comments received so
> far and fixed a compiler warning that CF bot complained about.
>
> Please review it further.

I tried to vary the zero buffer size to see if there's any huge
benefit for the WAL-generating queries. Unfortunately, I didn't see
any benefit on my dev system (16 vcore, 512GB SSD, 32GB RAM) . The use
case I've tried is at [1] and the results are at [2].

Having said that, the use of pg_pwritev_with_retry() in walmethods.c
will definitely reduce number of system calls - on HEAD the
dir_open_for_write() makes pad_to_size/XLOG_BLCKSZ i.e. 16MB/8KB =
2,048 write() calls and with patch it makes only 64
pg_pwritev_with_retry() calls with XLOG_BLCKSZ zero buffer size. The
proposed patches will provide straight 32x reduction in system calls
(for pg_receivewal and pg_basebackup) apart from the safety against
partial writes.

[1]
/* built source code with release flags */
./configure --with-zlib --enable-depend --prefix=$PWD/inst/
--with-openssl --with-readline --with-perl --with-libxml CFLAGS='-O2'
> install.log && make -j 8 install > install.log 2>&1 &

\q
./pg_ctl -D data -l logfile stop
rm -rf data

/* ensured that nothing exists in OS page cache */
free -m
sudo su
sync; echo 3 > /proc/sys/vm/drop_caches
exit
free -m

./initdb -D data
./pg_ctl -D data -l logfile start
./psql -d postgres -c 'ALTER SYSTEM SET max_wal_size = "64GB";'
./psql -d postgres -c 'ALTER SYSTEM SET shared_buffers = "8GB";'
./psql -d postgres -c 'ALTER SYSTEM SET work_mem = "16MB";'
./psql -d postgres -c 'ALTER SYSTEM SET checkpoint_timeout = "1d";'
./pg_ctl -D data -l logfile restart
./psql -d postgres -c 'create table foo(bar int);'
./psql -d postgres
\timing
insert into foo select * from generate_series(1, 100000000);   /* this
query generates about 385 WAL files, no checkpoint hence no recycle of
old WAL files, all new WAL files */

[2]
HEAD
Time: 84249.535 ms (01:24.250)

HEAD with wal_init_zero off
Time: 75086.300 ms (01:15.086)

#define    PWRITEV_BLCKSZ    XLOG_BLCKSZ
Time: 85254.302 ms (01:25.254)

#define    PWRITEV_BLCKSZ    (4 * XLOG_BLCKSZ)
Time: 83542.885 ms (01:23.543)

#define    PWRITEV_BLCKSZ    (16 * XLOG_BLCKSZ)
Time: 84035.770 ms (01:24.036)

#define    PWRITEV_BLCKSZ    (64 * XLOG_BLCKSZ)
Time: 84749.021 ms (01:24.749)

#define    PWRITEV_BLCKSZ    (256 * XLOG_BLCKSZ)
Time: 84273.466 ms (01:24.273)

#define    PWRITEV_BLCKSZ    (512 * XLOG_BLCKSZ)
Time: 84233.576 ms (01:24.234)

--
Bharath Rupireddy
RDS Open Source Databases: https://aws.amazon.com/rds/postgresql/

Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?

From

Nathan Bossart

Date:

20 September 2022, 23:00:26

On Mon, Aug 08, 2022 at 06:10:23PM +0530, Bharath Rupireddy wrote:
> I'm attaching v5 patch-set. I've addressed review comments received so
> far and fixed a compiler warning that CF bot complained about.
> 
> Please review it further.

0001 looks reasonable to me.

+        errno = 0;
+        rc = pg_pwritev_zeros(fd, pad_to_size);

Do we need to reset errno?  pg_pwritev_zeros() claims to set errno
appropriately.

+/*
+ * PWRITEV_BLCKSZ is same as XLOG_BLCKSZ for now, however it may change if
+ * writing more bytes per pg_pwritev_with_retry() call is proven to be more
+ * performant.
+ */
+#define PWRITEV_BLCKSZ  XLOG_BLCKSZ

This seems like something we should sort out now instead of leaving as
future work.  Given your recent note, I think we should just use
XLOG_BLCKSZ and PGAlignedXLogBlock and add a comment about the performance
findings with different buffer sizes.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?

From

Nathan Bossart

Date:

20 September 2022, 23:06:08

On Tue, Sep 20, 2022 at 04:00:26PM -0700, Nathan Bossart wrote:
> On Mon, Aug 08, 2022 at 06:10:23PM +0530, Bharath Rupireddy wrote:
>> I'm attaching v5 patch-set. I've addressed review comments received so
>> far and fixed a compiler warning that CF bot complained about.
>> 
>> Please review it further.
> 
> 0001 looks reasonable to me.
> 
> +        errno = 0;
> +        rc = pg_pwritev_zeros(fd, pad_to_size);
> 
> Do we need to reset errno?  pg_pwritev_zeros() claims to set errno
> appropriately.
> 
> +/*
> + * PWRITEV_BLCKSZ is same as XLOG_BLCKSZ for now, however it may change if
> + * writing more bytes per pg_pwritev_with_retry() call is proven to be more
> + * performant.
> + */
> +#define PWRITEV_BLCKSZ  XLOG_BLCKSZ
> 
> This seems like something we should sort out now instead of leaving as
> future work.  Given your recent note, I think we should just use
> XLOG_BLCKSZ and PGAlignedXLogBlock and add a comment about the performance
> findings with different buffer sizes.

I also noticed that the latest patch set no longer applies, so I've marked
the commitfest entry as waiting-on-author.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?

From

Bharath Rupireddy

Date:

23 September 2022, 06:16:56

On Wed, Sep 21, 2022 at 4:30 AM Nathan Bossart <nathandbossart@gmail.com> wrote:
>
> 0001 looks reasonable to me.

Thanks for reviewing.

> +        errno = 0;
> +        rc = pg_pwritev_zeros(fd, pad_to_size);
>
> Do we need to reset errno?  pg_pwritev_zeros() claims to set errno
> appropriately.

Right, pg_pwritev_zeros(), (rather pg_pwritev_with_retry() ensures
that pwritev() or pwrite()) sets the correct errno, please see
Thomas's comments [1] as well. Removed it.

> +/*
> + * PWRITEV_BLCKSZ is same as XLOG_BLCKSZ for now, however it may change if
> + * writing more bytes per pg_pwritev_with_retry() call is proven to be more
> + * performant.
> + */
> +#define PWRITEV_BLCKSZ  XLOG_BLCKSZ
>
> This seems like something we should sort out now instead of leaving as
> future work.  Given your recent note, I think we should just use
> XLOG_BLCKSZ and PGAlignedXLogBlock and add a comment about the performance
> findings with different buffer sizes.

Agreed. Removed the new structure and added a comment.

Another change that I had to do was to allow lseek() even after
pwrite() (via pg_pwritev_zeros()) on Windows in walmethods.c. Without
this, the regression tests start to fail on Windows. And I figured out
that it's not an issue with this patch, it looks like an issue with
pwrite() implementation in win32pwrite.c, see the failure here [2], I
plan to start a separate thread to discuss this.

Please review the attached v4 patch set further.

[1] https://www.postgresql.org/message-id/CA%2BhUKGJKwUrpP0igOFAv5khj3dbHvfWqLzFeo7vtNzDYXv_YZw%40mail.gmail.com
[2] https://github.com/BRupireddy/postgres/tree/use_pwrite_without_lseek_on_windiws

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachment

Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?

From

Nathan Bossart

Date:

23 September 2022, 20:24:39

+        PGAlignedXLogBlock zbuffer;
+
+        memset(zbuffer.data, 0, XLOG_BLCKSZ);

This seems excessive for only writing a single byte.

+#ifdef WIN32
+        /*
+         * XXX: It looks like on Windows, we need an explicit lseek() call here
+         * despite using pwrite() implementation from win32pwrite.c. Otherwise
+         * an error occurs.
+         */

I think this comment is too vague.  Can we describe the error in more
detail?  Or better yet, can we fix it as a prerequisite to this patch set?

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?

From

Thomas Munro

Date:

24 September 2022, 04:14:20

On Sat, Sep 24, 2022 at 8:24 AM Nathan Bossart <nathandbossart@gmail.com> wrote:
> +#ifdef WIN32
> +        /*
> +         * XXX: It looks like on Windows, we need an explicit lseek() call here
> +         * despite using pwrite() implementation from win32pwrite.c. Otherwise
> +         * an error occurs.
> +         */
>
> I think this comment is too vague.  Can we describe the error in more
> detail?  Or better yet, can we fix it as a prerequisite to this patch set?

Although WriteFile() with a synchronous file handle and an explicit
offset doesn't use the current file position, it appears that it still
changes it.  :-(

You'd think from the documentation[1] that that isn't the case, because it says:

"Considerations for working with synchronous file handles:

 * If lpOverlapped is NULL, the write operation starts at the current
file position and WriteFile does not return until the operation is
complete, and the system updates the file pointer before WriteFile
returns.

 * If lpOverlapped is not NULL, the write operation starts at the
offset that is specified in the OVERLAPPED structure and WriteFile
does not return until the write operation is complete. The system
updates the OVERLAPPED Internal and InternalHigh fields before
WriteFile returns."

So it's explicitly saying the file pointer is updated in the first
bullet point and not the second, but src/port/win32pwrite.c is doing
the second thing.  Yet the following assertion added to Bharath's code
fails[2]:

+++ b/src/bin/pg_basebackup/walmethods.c
@@ -221,6 +221,10 @@ dir_open_for_write(WalWriteMethod *wwmethod,
const char *pathname,
        if (pad_to_size && wwmethod->compression_algorithm ==
PG_COMPRESSION_NONE)
        {
                ssize_t rc;
+               off_t before_offset;
+               off_t after_offset;
+
+               before_offset = lseek(fd, 0, SEEK_CUR);

                rc = pg_pwritev_zeros(fd, pad_to_size);

@@ -231,6 +235,9 @@ dir_open_for_write(WalWriteMethod *wwmethod, const
char *pathname,
                        return NULL;
                }

+               after_offset = lseek(fd, 0, SEEK_CUR);
+               Assert(before_offset == after_offset);
+

[1] https://learn.microsoft.com/en-us/windows/win32/api/fileapi/nf-fileapi-writefile#synchronization-and-file-position
[2]
https://api.cirrus-ci.com/v1/artifact/task/6201051266154496/testrun/build/testrun/pg_basebackup/010_pg_basebackup/log/regress_log_010_pg_basebackup

Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?

From

Bharath Rupireddy

Date:

26 September 2022, 15:03:53

On Sat, Sep 24, 2022 at 9:44 AM Thomas Munro <thomas.munro@gmail.com> wrote:
>
> Although WriteFile() with a synchronous file handle and an explicit
> offset doesn't use the current file position, it appears that it still
> changes it.  :-(
>
> You'd think from the documentation[1] that that isn't the case, because it says:
>
> "Considerations for working with synchronous file handles:
>
>  * If lpOverlapped is NULL, the write operation starts at the current
> file position and WriteFile does not return until the operation is
> complete, and the system updates the file pointer before WriteFile
> returns.
>
>  * If lpOverlapped is not NULL, the write operation starts at the
> offset that is specified in the OVERLAPPED structure and WriteFile
> does not return until the write operation is complete. The system
> updates the OVERLAPPED Internal and InternalHigh fields before
> WriteFile returns."
>
> So it's explicitly saying the file pointer is updated in the first
> bullet point and not the second, but src/port/win32pwrite.c is doing
> the second thing.

The WriteFile() and pwrite() implementation in win32pwrite.c are
changing the file pointer on Windows, see the following and [4] for
more details.

# Running: pg_basebackup --no-sync -cfast -D
C:\cirrus\build\testrun\pg_basebackup\010_pg_basebackup\data\tmp_test_sV4r/backup2
--no-manifest --waldir
C:\cirrus\build\testrun\pg_basebackup\010_pg_basebackup\data\tmp_test_sV4r/xlog2
pg_basebackup: offset_before 0 and offset_after 16777216 aren't the same
Assertion failed: offset_before == offset_after, file
../src/bin/pg_basebackup/walmethods.c, line 254

Irrespective of what Windows does with file pointers in WriteFile(),
should we add lseek(SEEK_SET) in our own pwrite()'s implementation,
something like [5]? This is rather hackish without fully knowing what
Windows does internally in WriteFile(), but this does fix inherent
issues that our pwrite() callers (there are quite a number of places
that use pwrite() and presumes file pointer doesn't change on Windows)
may have on Windows. See the regression tests passing [6] with the fix
[5].

> Yet the following assertion added to Bharath's code
> fails[2]:

That was a very quick patch though, here's another version adding
offset checks and an assertion [3], see the assertion failures here
[4].

I also think that we need to discuss this issue separately.

Thoughts?

> [1]
https://learn.microsoft.com/en-us/windows/win32/api/fileapi/nf-fileapi-writefile#synchronization-and-file-position
> [2]
https://api.cirrus-ci.com/v1/artifact/task/6201051266154496/testrun/build/testrun/pg_basebackup/010_pg_basebackup/log/regress_log_010_pg_basebackup
[3] - https://github.com/BRupireddy/postgres/tree/add_pwrite_and_offset_checks_in_walmethods_v2
[4] -
https://api.cirrus-ci.com/v1/artifact/task/5294264635621376/testrun/build/testrun/pg_basebackup/010_pg_basebackup/log/regress_log_010_pg_basebackup
[5]
diff --git a/src/port/win32pwrite.c b/src/port/win32pwrite.c
index 7f2e62e8a7..542b548279 100644
--- a/src/port/win32pwrite.c
+++ b/src/port/win32pwrite.c
@@ -37,5 +37,16 @@ pwrite(int fd, const void *buf, size_t size, off_t offset)
                return -1;
        }

+       /*
+        * On Windows, it is found that WriteFile() changes the file
pointer and we
+        * want pwrite() to not change. Hence, we explicitly reset the
file pointer
+        * to beginning of the file.
+        */
+       if (lseek(fd, 0, SEEK_SET) != 0)
+       {
+               _dosmaperr(GetLastError());
+               return -1;
+       }
+
        return result;
 }
[6] - https://github.com/BRupireddy/postgres/tree/add_pwrite_and_offset_checks_in_walmethods_v3

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?

From

Nathan Bossart

Date:

26 September 2022, 21:27:04

On Mon, Sep 26, 2022 at 08:33:53PM +0530, Bharath Rupireddy wrote:
> Irrespective of what Windows does with file pointers in WriteFile(),
> should we add lseek(SEEK_SET) in our own pwrite()'s implementation,
> something like [5]? This is rather hackish without fully knowing what
> Windows does internally in WriteFile(), but this does fix inherent
> issues that our pwrite() callers (there are quite a number of places
> that use pwrite() and presumes file pointer doesn't change on Windows)
> may have on Windows. See the regression tests passing [6] with the fix
> [5].

I think so.  I don't see why we would rather have each caller ensure
pwrite() behaves as documented.

> +       /*
> +        * On Windows, it is found that WriteFile() changes the file
> pointer and we
> +        * want pwrite() to not change. Hence, we explicitly reset the
> file pointer
> +        * to beginning of the file.
> +        */
> +       if (lseek(fd, 0, SEEK_SET) != 0)
> +       {
> +               _dosmaperr(GetLastError());
> +               return -1;
> +       }
> +
>         return result;
>  }

Why reset to the beginning of the file?  Shouldn't we reset it to what it
was before the call to pwrite()?

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?

From

Thomas Munro

Date:

26 September 2022, 21:37:38

On Tue, Sep 27, 2022 at 10:27 AM Nathan Bossart
<nathandbossart@gmail.com> wrote:
> On Mon, Sep 26, 2022 at 08:33:53PM +0530, Bharath Rupireddy wrote:
> > Irrespective of what Windows does with file pointers in WriteFile(),
> > should we add lseek(SEEK_SET) in our own pwrite()'s implementation,
> > something like [5]? This is rather hackish without fully knowing what
> > Windows does internally in WriteFile(), but this does fix inherent
> > issues that our pwrite() callers (there are quite a number of places
> > that use pwrite() and presumes file pointer doesn't change on Windows)
> > may have on Windows. See the regression tests passing [6] with the fix
> > [5].
>
> I think so.  I don't see why we would rather have each caller ensure
> pwrite() behaves as documented.

I don't think so, that's an extra kernel call.  I think I'll just have
to revert part of my recent change that removed the pg_ prefix from
those function names in our code, and restore the comment that warns
you about the portability hazard (I thought it went away with HP-UX
10, where we were literally calling lseek() before every write()).
The majority of users of these functions don't intermix them with
calls to read()/write(), so they don't care about the file position,
so I think it's just something we'll have to continue to be mindful of
in the places that do.

Unless, that is, I can find a way to stop it from doing that...  I've
added this to my Windows to-do list.  I am going to have a round of
Windows hacking quite soon.

Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?

From

Nathan Bossart

Date:

26 September 2022, 21:55:31

On Tue, Sep 27, 2022 at 10:37:38AM +1300, Thomas Munro wrote:
> I don't think so, that's an extra kernel call.  I think I'll just have
> to revert part of my recent change that removed the pg_ prefix from
> those function names in our code, and restore the comment that warns
> you about the portability hazard (I thought it went away with HP-UX
> 10, where we were literally calling lseek() before every write()).
> The majority of users of these functions don't intermix them with
> calls to read()/write(), so they don't care about the file position,
> so I think it's just something we'll have to continue to be mindful of
> in the places that do.

Ah, you're right, it's probably best to avoid the extra system call for the
majority of callers that don't care about the file position.  I retract my
previous message.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?

From

Bharath Rupireddy

Date:

27 September 2022, 05:42:59

On Tue, Sep 27, 2022 at 3:08 AM Thomas Munro <thomas.munro@gmail.com> wrote:
>
> I don't think so, that's an extra kernel call.  I think I'll just have
> to revert part of my recent change that removed the pg_ prefix from
> those function names in our code, and restore the comment that warns
> you about the portability hazard (I thought it went away with HP-UX
> 10, where we were literally calling lseek() before every write()).
> The majority of users of these functions don't intermix them with
> calls to read()/write(), so they don't care about the file position,
> so I think it's just something we'll have to continue to be mindful of
> in the places that do.

Yes, all of the existing pwrite() callers don't care about the file
position, but the new callers such as the actual idea and patch
proposed here in this thread cares.

Is this the commit cf112c122060568aa06efe4e6e6fb9b2dd4f1090 part of
which [1] you're planning to revert? If so, will we have the
pg_{pwrite, pread} back? IIUC, we don't have lseek(SEEK_SET) in
pg_{pwrite, pread} right? It is the callers responsibility to set the
file position correctly if they wish to, isn't it? Oftentimes,
developers miss the notes in the function comments and use these
functions expecting them to not change file position which works well
on non-Windows platforms but fails on Windows.

This makes me think that we can have pwrite(), pread() introduced by
cf112c122060568aa06efe4e6e6fb9b2dd4f1090 as-is and re-introduce
pg_{pwrite, pread} with pwrite()/pread()+lseek(SEEK_SET) in
win32pwrite.c and win32pread.c. These functions reduce the caller's
efforts and reduce the duplicate code. If okay, I'm happy to lend my
hands on this patch.

Thoughts?

> Unless, that is, I can find a way to stop it from doing that...  I've
> added this to my Windows to-do list.  I am going to have a round of
> Windows hacking quite soon.

Thanks! I think it's time for me to start a new thread just for this
to get more attention and opinion from other hackers and not to
sidetrack the original idea and patch proposed in this thread.

[1] https://github.com/postgres/postgres/blob/REL_14_STABLE/src/port/pwrite.c#L11

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?

From

Thomas Munro

Date:

27 September 2022, 09:30:22

On Tue, Sep 27, 2022 at 6:43 PM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:
> On Tue, Sep 27, 2022 at 3:08 AM Thomas Munro <thomas.munro@gmail.com> wrote:
> >
> > I don't think so, that's an extra kernel call.  I think I'll just have
> > to revert part of my recent change that removed the pg_ prefix from
> > those function names in our code, and restore the comment that warns
> > you about the portability hazard (I thought it went away with HP-UX
> > 10, where we were literally calling lseek() before every write()).
> > The majority of users of these functions don't intermix them with
> > calls to read()/write(), so they don't care about the file position,
> > so I think it's just something we'll have to continue to be mindful of
> > in the places that do.
>
> Yes, all of the existing pwrite() callers don't care about the file
> position, but the new callers such as the actual idea and patch
> proposed here in this thread cares.
>
> Is this the commit cf112c122060568aa06efe4e6e6fb9b2dd4f1090 part of
> which [1] you're planning to revert?

Yeah, just the renaming parts of that.  The lseek()-based emulation is
definitely not coming back.  Something like the attached.

Attachment

0001-Restore-pg_pread-and-friends.txt

Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?

From

Bharath Rupireddy

Date:

27 September 2022, 12:03:33

On Tue, Sep 27, 2022 at 3:01 PM Thomas Munro <thomas.munro@gmail.com> wrote:
>
> Something like the attached.

Isn't it also better to add notes in win32pread.c and win32pwrite.c
about the callers doing lseek(SEEK_SET) if they wish to and Windows
implementations changing file position? We can also add a TODO item
about replacing pg_ versions with pread and friends someday when
Windows fixes the issue? Having it in the commit and include/port.h is
good, but the comments in win32pread.c and win32pwrite.c make life
easier IMO. Otherwise, the patch LGTM.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?

From

Thomas Munro

Date:

27 September 2022, 19:01:26

On Wed, Sep 28, 2022 at 1:03 AM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:>
> On Tue, Sep 27, 2022 at 3:01 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> > Something like the attached.
>
> Isn't it also better to add notes in win32pread.c and win32pwrite.c
> about the callers doing lseek(SEEK_SET) if they wish to and Windows
> implementations changing file position? We can also add a TODO item
> about replacing pg_ versions with pread and friends someday when
> Windows fixes the issue? Having it in the commit and include/port.h is
> good, but the comments in win32pread.c and win32pwrite.c make life
> easier IMO. Otherwise, the patch LGTM.

Thanks, will do.  FWIW I doubt the OS itself will change released
behaviour like that, but I did complain about the straight-up
misleading documentation and they agreed and fixed it[1].

After some looking around, the only way I could find to avoid this
behaviour is to switch to async handles, which do behave as documented
in this respect, but then you can't use plain read/write at all unless
you write replacement wrappers for those too, and AFAICT that adds
more system calls (start write, wait for write to finish) and
complexity/weirdness without any real payoff so it seems like the
wrong direction, at least without more research that I'm not pursuing
currently.  (FWIW in AIO porting experiments a while back we used
async handles to get IOs running concurrently with CPU work and
possibly other IOs so it was actually worth doing, but that was only
for smgr and the main wal writing code where there's no intermixed
plain read/write calls as you have here, so the problem doesn't even
come up.)

[1] https://github.com/MicrosoftDocs/sdk-api/pull/1309

Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?

From

Bharath Rupireddy

Date:

28 September 2022, 04:43:20

On Wed, Sep 28, 2022 at 12:32 AM Thomas Munro <thomas.munro@gmail.com> wrote:
>
> On Wed, Sep 28, 2022 at 1:03 AM Bharath Rupireddy
> <bharath.rupireddyforpostgres@gmail.com> wrote:>
> > On Tue, Sep 27, 2022 at 3:01 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> > > Something like the attached.
> >
> > Isn't it also better to add notes in win32pread.c and win32pwrite.c
> > about the callers doing lseek(SEEK_SET) if they wish to and Windows
> > implementations changing file position? We can also add a TODO item
> > about replacing pg_ versions with pread and friends someday when
> > Windows fixes the issue? Having it in the commit and include/port.h is
> > good, but the comments in win32pread.c and win32pwrite.c make life
> > easier IMO. Otherwise, the patch LGTM.
>
> Thanks, will do.

I'm looking forward to getting it in.

> FWIW I doubt the OS itself will change released
> behaviour like that, but I did complain about the straight-up
> misleading documentation and they agreed and fixed it[1].
>
> [1] https://github.com/MicrosoftDocs/sdk-api/pull/1309

Great! Is there any plan to request for change in WriteFile() to not
alter file position?

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?

From

Thomas Munro

Date:

28 September 2022, 06:11:52

On Wed, Sep 28, 2022 at 5:43 PM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:
> On Wed, Sep 28, 2022 at 12:32 AM Thomas Munro <thomas.munro@gmail.com> wrote:
> > FWIW I doubt the OS itself will change released
> > behaviour like that, but I did complain about the straight-up
> > misleading documentation and they agreed and fixed it[1].
> >
> > [1] https://github.com/MicrosoftDocs/sdk-api/pull/1309
>
> Great! Is there any plan to request for change in WriteFile() to not
> alter file position?

Not from me.  I stick to open source problems.  Reporting bugs in
documentation is legitimate self defence, though.

Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?

From

Bharath Rupireddy

Date:

29 September 2022, 06:02:32

On Sat, Sep 24, 2022 at 1:54 AM Nathan Bossart <nathandbossart@gmail.com> wrote:
>
> +        PGAlignedXLogBlock zbuffer;
> +
> +        memset(zbuffer.data, 0, XLOG_BLCKSZ);
>
> This seems excessive for only writing a single byte.

Yes, I removed it now, instead doing pg_pwrite(fd, "\0", 1,
wal_segment_size - 1).

> +#ifdef WIN32
> +        /*
> +         * XXX: It looks like on Windows, we need an explicit lseek() call here
> +         * despite using pwrite() implementation from win32pwrite.c. Otherwise
> +         * an error occurs.
> +         */
>
> I think this comment is too vague.  Can we describe the error in more
> detail?  Or better yet, can we fix it as a prerequisite to this patch set?

The commit b6d8a60aba322678585ebe11dab072a37ac32905 brings back
pg_pwrite() and its friends. This puts the responsibility of doing
lseek(SEEK_SET) on the callers if they wish to.

Please see the v5 patch set and review it further.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachment

Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?

From

Nathan Bossart

Date:

29 September 2022, 17:27:11

On Thu, Sep 29, 2022 at 11:32:32AM +0530, Bharath Rupireddy wrote:
> On Sat, Sep 24, 2022 at 1:54 AM Nathan Bossart <nathandbossart@gmail.com> wrote:
>>
>> +        PGAlignedXLogBlock zbuffer;
>> +
>> +        memset(zbuffer.data, 0, XLOG_BLCKSZ);
>>
>> This seems excessive for only writing a single byte.
> 
> Yes, I removed it now, instead doing pg_pwrite(fd, "\0", 1,
> wal_segment_size - 1).

I don't think removing the use of PGAlignedXLogBlock here introduces any
sort of alignment risk, so this should be alright.

+#ifdef WIN32
+        /*
+         * WriteFile() on Windows changes the current file position, hence we
+         * need an explicit lseek() here. See pg_pwrite() implementation in
+         * win32pwrite.c for more details.
+         */

Should we really surround this with a WIN32 check, or should we just
unconditionally lseek() here?  I understand that this helps avoid an extra
system call on many platforms, but in theory another platform introduced in
the future could have the same problem, and this seems like something that
could easily be missed.  Presumably we could do something fancier to
indicate pg_pwrite()'s behavior in this regard, but I don't know if that
sort of complexity is really worth it in order to save an lseek().

+        iov[0].iov_base = zbuffer.data;

This seems superfluous, but I don't think it's hurting anything.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?

From

Thomas Munro

Date:

30 September 2022, 01:15:30

On Fri, Sep 30, 2022 at 6:27 AM Nathan Bossart <nathandbossart@gmail.com> wrote:
> On Thu, Sep 29, 2022 at 11:32:32AM +0530, Bharath Rupireddy wrote:
> > On Sat, Sep 24, 2022 at 1:54 AM Nathan Bossart <nathandbossart@gmail.com> wrote:
> >>
> >> +        PGAlignedXLogBlock zbuffer;
> >> +
> >> +        memset(zbuffer.data, 0, XLOG_BLCKSZ);
> >>
> >> This seems excessive for only writing a single byte.
> >
> > Yes, I removed it now, instead doing pg_pwrite(fd, "\0", 1,
> > wal_segment_size - 1).
>
> I don't think removing the use of PGAlignedXLogBlock here introduces any
> sort of alignment risk, so this should be alright.
>
> +#ifdef WIN32
> +        /*
> +         * WriteFile() on Windows changes the current file position, hence we
> +         * need an explicit lseek() here. See pg_pwrite() implementation in
> +         * win32pwrite.c for more details.
> +         */
>
> Should we really surround this with a WIN32 check, or should we just
> unconditionally lseek() here?  I understand that this helps avoid an extra
> system call on many platforms, but in theory another platform introduced in
> the future could have the same problem, and this seems like something that
> could easily be missed.  Presumably we could do something fancier to
> indicate pg_pwrite()'s behavior in this regard, but I don't know if that
> sort of complexity is really worth it in order to save an lseek().

+1 for just doing it always, with a one-liner comment like
"pg_pwritev*() might move the file position".  No reason to spam the
source tree with more explanations of the exact reason.  If someone
ever comes up with another case where p- and non-p- I/O functions are
intermixed and it's really worth saving a system call (don't get me
wrong, we call lseek() an obscene amount elsewhere and I'd like to fix
that, but this case isn't hot?) then I like your idea of a macro to
tell you whether you need to.

Earlier I wondered why we'd want to include "pg_pwritev" in the name
of this zero-filling function (pwritev being an internal
implementation detail), but now it seems like maybe a good idea
because it highlights the file position portability problem by being a
member of that family of similarly-named functions.

Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?

From

Bharath Rupireddy

Date:

30 September 2022, 02:39:04

On Fri, Sep 30, 2022 at 6:46 AM Thomas Munro <thomas.munro@gmail.com> wrote:
>
> +1 for just doing it always, with a one-liner comment like
> "pg_pwritev*() might move the file position".  No reason to spam the
> source tree with more explanations of the exact reason.

+1 for resetting the file position in a platform-independent manner.
But, a description there won't hurt IMO and it saves time for the
hackers who spend time there and think why it's that way.

> If someone
> ever comes up with another case where p- and non-p- I/O functions are
> intermixed and it's really worth saving a system call (don't get me
> wrong, we call lseek() an obscene amount elsewhere and I'd like to fix
> that, but this case isn't hot?) then I like your idea of a macro to
> tell you whether you need to.

I don't think we go that route as the code isn't a hot path and an
extra system call wouldn't hurt performance much, a comment there
should work.

> Earlier I wondered why we'd want to include "pg_pwritev" in the name
> of this zero-filling function (pwritev being an internal
> implementation detail), but now it seems like maybe a good idea
> because it highlights the file position portability problem by being a
> member of that family of similarly-named functions.

Hm.

On Thu, Sep 29, 2022 at 10:57 PM Nathan Bossart
<nathandbossart@gmail.com> wrote:
>
> +        iov[0].iov_base = zbuffer.data;
>
> This seems superfluous, but I don't think it's hurting anything.

Yes, I removed it. Adding a comment, something like [1], would make it
more verbose, hence I've not added.

I'm attaching the v6 patch set, please review it further.

[1]
        /*
         * Use the first vector buffer to write the remaining size. Note that
         * zero buffer was already pointed to it above, hence just specifying
         * the size is enough here.
         */

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachment

Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?

From

Nathan Bossart

Date:

30 September 2022, 03:09:56

On Fri, Sep 30, 2022 at 08:09:04AM +0530, Bharath Rupireddy wrote:
> I'm attaching the v6 patch set, please review it further.

Looks reasonable to me.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?

From

Michael Paquier

Date:

27 October 2022, 05:54:00

On Thu, Sep 29, 2022 at 08:09:56PM -0700, Nathan Bossart wrote:
> Looks reasonable to me.

0001, to move pg_pwritev_with_retry() to a new home, seems fine, so
applied.

Regarding 0002, using pg_pwrite_zeros() as a routine name, as
suggested by Thomas, sounds good to me.  However, I am not really a
fan of its dependency with PGAlignedXLogBlock, because it should be
able to work with any buffers of any sizes, as long as the input
buffer is aligned, shouldn't it?  For example, what about
PGAlignedBlock?  So, should we make this more extensible?  My guess
would be the addition of the block size and the block pointer to the
arguments of pg_pwrite_zeros(), in combination with a check to make
sure that the input buffer is MAXALIGN()'d (with an Assert() rather
than just an elog/pg_log_error?).
--
Michael

Attachment

signature.asc

Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?

From

Bharath Rupireddy

Date:

27 October 2022, 09:27:47

On Thu, Oct 27, 2022 at 11:24 AM Michael Paquier <michael@paquier.xyz> wrote:
>
> On Thu, Sep 29, 2022 at 08:09:56PM -0700, Nathan Bossart wrote:
> > Looks reasonable to me.
>
> 0001, to move pg_pwritev_with_retry() to a new home, seems fine, so
> applied.

Thanks.

> Regarding 0002, using pg_pwrite_zeros() as a routine name, as
> suggested by Thomas, sounds good to me.

Changed.

> However, I am not really a
> fan of its dependency with PGAlignedXLogBlock, because it should be
> able to work with any buffers of any sizes, as long as the input
> buffer is aligned, shouldn't it?  For example, what about
> PGAlignedBlock?  So, should we make this more extensible? My guess
> would be the addition of the block size and the block pointer to the
> arguments of pg_pwrite_zeros(), in combination with a check to make
> sure that the input buffer is MAXALIGN()'d (with an Assert() rather
> than just an elog/pg_log_error?).

+1 to pass in the aligned buffer, its size and an assertion on the buffer size.

Please see the attached v7 patch.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachment

v7-0001-Use-pg_pwritev_with_retry-instead-of-write-in-wal.patch

Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?

From

Andres Freund

Date:

27 October 2022, 22:58:25

Hi,

Interestingly, I also needed something like pg_pwrite_zeros() today. Exposed
via smgr, for more efficient relation extensions.

On 2022-10-27 14:54:00 +0900, Michael Paquier wrote:
> Regarding 0002, using pg_pwrite_zeros() as a routine name, as
> suggested by Thomas, sounds good to me.  However, I am not really a
> fan of its dependency with PGAlignedXLogBlock, because it should be
> able to work with any buffers of any sizes, as long as the input
> buffer is aligned, shouldn't it?  For example, what about
> PGAlignedBlock?  So, should we make this more extensible?  My guess
> would be the addition of the block size and the block pointer to the
> arguments of pg_pwrite_zeros(), in combination with a check to make
> sure that the input buffer is MAXALIGN()'d (with an Assert() rather
> than just an elog/pg_log_error?).

I don't like passing in the buffer. That leads to code like in Bharat's latest
version, where we now zero that buffer on every invocation of
pg_pwrite_zeros() - not at all cheap. And every caller has to have provisions
to provide that buffer.

The block sizes don't need to match, do they? As long as the block is properly
aligned, we can change the iov_len of the final iov to match whatever the size
is being passed in, no?

Why don't we define a

static PGAlignedBlock source_of_zeroes;

in file_utils.c, and use that in pg_pwrite_zeros(), being careful to set the
iov_len arguments correctly?

Greetings,

Andres Freund

Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?

From

Michael Paquier

Date:

28 October 2022, 02:09:38

On Thu, Oct 27, 2022 at 03:58:25PM -0700, Andres Freund wrote:
> The block sizes don't need to match, do they? As long as the block is properly
> aligned, we can change the iov_len of the final iov to match whatever the size
> is being passed in, no?

Hmm.  Based on what Bharath has written upthread, it does not seem to
matter if the size of the aligned block changes, either:
https://www.postgresql.org/message-id/CALj2ACUccjR7KbKqWOsQmqH1ZGEDyJ7hH5Ef+DOhcv7+kOnjCQ@mail.gmail.com

I am honestly not sure whether it is a good idea to make file_utils.c
depend on one of the compile-time page sizes in this routine, be it
the page size of the WAL page size, as pg_write_zeros() would be used
for some rather low-level operations.  But we could as well just use a
locally-defined structure with a buffer at 4kB or 8kB and call it a
day?
--
Michael

Attachment

signature.asc

Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?

From

Bharath Rupireddy

Date:

28 October 2022, 06:08:51

On Fri, Oct 28, 2022 at 7:39 AM Michael Paquier <michael@paquier.xyz> wrote:
>
> On Thu, Oct 27, 2022 at 03:58:25PM -0700, Andres Freund wrote:
> > The block sizes don't need to match, do they? As long as the block is properly
> > aligned, we can change the iov_len of the final iov to match whatever the size
> > is being passed in, no?
>
> Hmm. Based on what Bharath has written upthread, it does not seem to
> matter if the size of the aligned block changes, either:
> https://www.postgresql.org/message-id/CALj2ACUccjR7KbKqWOsQmqH1ZGEDyJ7hH5Ef+DOhcv7+kOnjCQ@mail.gmail.com
>
> I am honestly not sure whether it is a good idea to make file_utils.c
> depend on one of the compile-time page sizes in this routine, be it
> the page size of the WAL page size, as pg_write_zeros() would be used
> for some rather low-level operations. But we could as well just use a
> locally-defined structure with a buffer at 4kB or 8kB and call it a
> day?

+1. Please see the attached v8 patch.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachment

v8-0001-Use-pg_pwritev_with_retry-instead-of-write-in-wal.patch

Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?

From

Michael Paquier

Date:

28 October 2022, 08:00:23

On Fri, Oct 28, 2022 at 11:38:51AM +0530, Bharath Rupireddy wrote:
> +1. Please see the attached v8 patch.

+   char        data[PG_WRITE_BLCKSZ];
+   double      force_align_d;
+   int64       force_align_i64;
+} PGAlignedWriteBlock;
I have not checked in details, but that should do the job.

Andres, Thomas, Nathan, perhaps you have a different view on the
matter?
--
Michael

Attachment

signature.asc

Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?

From

Andres Freund

Date:

28 October 2022, 22:07:00

Hi,

On 2022-10-28 11:09:38 +0900, Michael Paquier wrote:
> On Thu, Oct 27, 2022 at 03:58:25PM -0700, Andres Freund wrote:
> > The block sizes don't need to match, do they? As long as the block is properly
> > aligned, we can change the iov_len of the final iov to match whatever the size
> > is being passed in, no?
> 
> Hmm.  Based on what Bharath has written upthread, it does not seem to
> matter if the size of the aligned block changes, either:
> https://www.postgresql.org/message-id/CALj2ACUccjR7KbKqWOsQmqH1ZGEDyJ7hH5Ef+DOhcv7+kOnjCQ@mail.gmail.com
> 
> I am honestly not sure whether it is a good idea to make file_utils.c
> depend on one of the compile-time page sizes in this routine, be it
> the page size of the WAL page size, as pg_write_zeros() would be used
> for some rather low-level operations.  But we could as well just use a
> locally-defined structure with a buffer at 4kB or 8kB and call it a
> day?

Shrug. I don't think we gain much by having yet another PGAlignedXYZBlock.

Greetings,

Andres Freund

Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?

From

Bharath Rupireddy

Date:

29 October 2022, 06:24:02

On Sat, Oct 29, 2022 at 3:37 AM Andres Freund <andres@anarazel.de> wrote:
>
> Hi,
>
> On 2022-10-28 11:09:38 +0900, Michael Paquier wrote:
> > On Thu, Oct 27, 2022 at 03:58:25PM -0700, Andres Freund wrote:
> > > The block sizes don't need to match, do they? As long as the block is properly
> > > aligned, we can change the iov_len of the final iov to match whatever the size
> > > is being passed in, no?
> >
> > Hmm.  Based on what Bharath has written upthread, it does not seem to
> > matter if the size of the aligned block changes, either:
> > https://www.postgresql.org/message-id/CALj2ACUccjR7KbKqWOsQmqH1ZGEDyJ7hH5Ef+DOhcv7+kOnjCQ@mail.gmail.com
> >
> > I am honestly not sure whether it is a good idea to make file_utils.c
> > depend on one of the compile-time page sizes in this routine, be it
> > the page size of the WAL page size, as pg_write_zeros() would be used
> > for some rather low-level operations.  But we could as well just use a
> > locally-defined structure with a buffer at 4kB or 8kB and call it a
> > day?
>
> Shrug. I don't think we gain much by having yet another PGAlignedXYZBlock.

Hm. I tend to agree with Andres here, i.e. using PGAlignedBlock is
sufficient. It seems like we are using PGAlignedBlock for heap, index,
history file, fsm, visibility map file pages as well.

Please see the attached v9 patch.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachment

v9-0001-Use-pg_pwritev_with_retry-instead-of-write-in-wal.patch

Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?

From

Alvaro Herrera

Date:

30 October 2022, 14:44:32

On 2022-Oct-27, Michael Paquier wrote:

> On Thu, Sep 29, 2022 at 08:09:56PM -0700, Nathan Bossart wrote:
> > Looks reasonable to me.
> 
> 0001, to move pg_pwritev_with_retry() to a new home, seems fine, so
> applied.

Maybe something a bit useless, but while perusing the commits I noticed
a forward struct declaration was moved from one file to another; this is
claimed to avoid including port/pg_iovec.h in common/file_utils.h.  We
do that kind of thing in a few places, but in this particular case it
seems a bit of a pointless exercise, since pg_iovec.h doesn't include
anything else and it's a quite simple header.

So I'm kinda proposing that we only do the forward struct initialization
dance when it really saves on things -- in particular, when it helps
avoid or reduce massive indirect header inclusion.

-- 
Álvaro Herrera               48°01'N 7°57'E  —  https://www.EnterpriseDB.com/

Attachment

0001-No-need-to-avoid-including-pg_iovec.h.patch

Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?

From

Michael Paquier

Date:

30 October 2022, 23:31:08

On Sun, Oct 30, 2022 at 03:44:32PM +0100, Alvaro Herrera wrote:
> So I'm kinda proposing that we only do the forward struct initialization
> dance when it really saves on things -- in particular, when it helps
> avoid or reduce massive indirect header inclusion.

Sure.

>  extern ssize_t pg_pwritev_with_retry(int fd,
> -                                     const struct iovec *iov,
> +                                     const iovec *iov,
>                                       int iovcnt,
>                                       off_t offset);

However this still needs to be defined as a struct, no?
--
Michael

Attachment

signature.asc

Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?

From

Bharath Rupireddy

Date:

31 October 2022, 06:20:27

On Mon, Oct 31, 2022 at 5:01 AM Michael Paquier <michael@paquier.xyz> wrote:
>
> On Sun, Oct 30, 2022 at 03:44:32PM +0100, Alvaro Herrera wrote:
> > So I'm kinda proposing that we only do the forward struct initialization
> > dance when it really saves on things -- in particular, when it helps
> > avoid or reduce massive indirect header inclusion.
>
> Sure.

I don't think including pg_iovec.h in file_utils.h is a good idea. I
agree that pg_iovec.h is fairly a small header file but file_utils.h
is included in 21 .c files, as of today and the file_utils.h footprint
might increase in future. Therefore, I'd vote for forward struct
initialization as it is on HEAD today.

> >  extern ssize_t pg_pwritev_with_retry(int fd,
> > -                                     const struct iovec *iov,
> > +                                     const iovec *iov,
> >                                       int iovcnt,
> >                                       off_t offset);
>
> However this still needs to be defined as a struct, no?

Yes, we need a struct there because we haven't typedef'ed struct iovec.

Also, the patch forgets to remove "port/pg_iovec.h" from file_utils.c

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?

From

Bharath Rupireddy

Date:

01 November 2022, 03:02:48

On Mon, Oct 31, 2022 at 11:50 AM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:
>
> On Mon, Oct 31, 2022 at 5:01 AM Michael Paquier <michael@paquier.xyz> wrote:
> >
> > On Sun, Oct 30, 2022 at 03:44:32PM +0100, Alvaro Herrera wrote:
> > > So I'm kinda proposing that we only do the forward struct initialization
> > > dance when it really saves on things -- in particular, when it helps
> > > avoid or reduce massive indirect header inclusion.
> >
> > Sure.
>
> I don't think including pg_iovec.h in file_utils.h is a good idea. I
> agree that pg_iovec.h is fairly a small header file but file_utils.h
> is included in 21 .c files, as of today and the file_utils.h footprint
> might increase in future. Therefore, I'd vote for forward struct
> initialization as it is on HEAD today.

I'm attaching the v9 patch from upthread here again for further review
and to make CF bot happy.

-- 
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachment

v9-0001-Use-pg_pwritev_with_retry-instead-of-write-in-wal.patch

Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?

From

Michael Paquier

Date:

08 November 2022, 04:07:57

On Tue, Nov 01, 2022 at 08:32:48AM +0530, Bharath Rupireddy wrote:
> I'm attaching the v9 patch from upthread here again for further review
> and to make CF bot happy.

So, I have looked at that, and at the end concluded that Andres'
suggestion to use PGAlignedBlock in pg_write_zeros() will serve better
in the long run.  Thomas has mentioned upthread that some of the
comments don't need to be that long, so I have tweaked these to be
minimal, and updated a few more areas.  Note that this has been split
into two commits: one to introduce the new routine in file_utils.c and
a second for the switch in walmethods.c.
--
Michael

Attachment

signature.asc

Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?

From

John Naylor

Date:

11 November 2022, 05:44:14

On Tue, Nov 8, 2022 at 11:08 AM Michael Paquier <michael@paquier.xyz> wrote:
>
> So, I have looked at that, and at the end concluded that Andres'
> suggestion to use PGAlignedBlock in pg_write_zeros() will serve better
> in the long run. Thomas has mentioned upthread that some of the
> comments don't need to be that long, so I have tweaked these to be
> minimal, and updated a few more areas. Note that this has been split
> into two commits: one to introduce the new routine in file_utils.c and
> a second for the switch in walmethods.c.

Was there supposed to be an attachment here?

--
John Naylor
EDB: http://www.enterprisedb.com

Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?

From

Bharath Rupireddy

Date:

11 November 2022, 06:23:08

On Fri, Nov 11, 2022 at 11:14 AM John Naylor
<john.naylor@enterprisedb.com> wrote:
>
> On Tue, Nov 8, 2022 at 11:08 AM Michael Paquier <michael@paquier.xyz> wrote:
> >
> > So, I have looked at that, and at the end concluded that Andres'
> > suggestion to use PGAlignedBlock in pg_write_zeros() will serve better
> > in the long run.  Thomas has mentioned upthread that some of the
> > comments don't need to be that long, so I have tweaked these to be
> > minimal, and updated a few more areas.  Note that this has been split
> > into two commits: one to introduce the new routine in file_utils.c and
> > a second for the switch in walmethods.c.
>
> Was there supposed to be an attachment here?

Nope. The patches have already been committed -
3bdbdf5d06f2179d4c17926d77ff734ea9e7d525 and
28cc2976a9cf0ed661dbc55f49f669192cce1c89.

-- 
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?

From

Michael Paquier

Date:

11 November 2022, 07:11:31

On Fri, Nov 11, 2022 at 11:53:08AM +0530, Bharath Rupireddy wrote:
> On Fri, Nov 11, 2022 at 11:14 AM John Naylor <john.naylor@enterprisedb.com> wrote:
>> Was there supposed to be an attachment here?
>
> Nope. The patches have already been committed -
> 3bdbdf5d06f2179d4c17926d77ff734ea9e7d525 and
> 28cc2976a9cf0ed661dbc55f49f669192cce1c89.

The committed patches are pretty much the same as the last version
sent on this thread, except that the changes have been split across
the files they locally impact, with a few simplifications tweaks to
the comments.  Hencem I did not see any need to send a new version for
this case.
--
Michael

Attachment

signature.asc

Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?

From

John Naylor

Date:

11 November 2022, 09:21:31

On Fri, Nov 11, 2022 at 2:12 PM Michael Paquier <michael@paquier.xyz> wrote:
>
> On Fri, Nov 11, 2022 at 11:53:08AM +0530, Bharath Rupireddy wrote:
> > On Fri, Nov 11, 2022 at 11:14 AM John Naylor <john.naylor@enterprisedb.com> wrote:
> >> Was there supposed to be an attachment here?
> >
> > Nope. The patches have already been committed -
> > 3bdbdf5d06f2179d4c17926d77ff734ea9e7d525 and
> > 28cc2976a9cf0ed661dbc55f49f669192cce1c89.
>
> The committed patches are pretty much the same as the last version
> sent on this thread, except that the changes have been split across
> the files they locally impact, with a few simplifications tweaks to
> the comments. Hencem I did not see any need to send a new version for
> this case.

Ah, I missed that -- sorry for the noise.

--
John Naylor
EDB: http://www.enterprisedb.com

Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?

From

Andres Freund

Date:

11 February 2023, 22:44:24

Hi,

On 2022-11-01 08:32:48 +0530, Bharath Rupireddy wrote:
> +/*
> + * pg_pwrite_zeros
> + *
> + * Writes zeros to a given file. Input parameters are "fd" (file descriptor of
> + * the file), "size" (size of the file in bytes).
> + *
> + * On failure, a negative value is returned and errno is set appropriately so
> + * that the caller can use it accordingly.
> + */
> +ssize_t
> +pg_pwrite_zeros(int fd, size_t size)
> +{
> +    PGAlignedBlock    zbuffer;
> +    size_t    zbuffer_sz;
> +    struct iovec    iov[PG_IOV_MAX];
> +    int        blocks;
> +    size_t    remaining_size = 0;
> +    int        i;
> +    ssize_t    written;
> +    ssize_t    total_written = 0;
> +
> +    zbuffer_sz = sizeof(zbuffer.data);
> +
> +    /* Zero-fill the buffer. */
> +    memset(zbuffer.data, 0, zbuffer_sz);

I previously commented on this - why are we memseting a buffer on every call
to this?  That's not at all free.

Something like
    static const PGAlignedBlock zerobuf = {0};
would do the trick.  You do need to cast the const away, to assign to
iov_base, but that's not too ugly.

Greetings,

Andres Freund

Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?

From

Bharath Rupireddy

Date:

12 February 2023, 14:29:00

On Sun, Feb 12, 2023 at 4:14 AM Andres Freund <andres@anarazel.de> wrote:
>
> > +ssize_t
> > +pg_pwrite_zeros(int fd, size_t size)
> > +{
> > +     PGAlignedBlock  zbuffer;
> > +     size_t  zbuffer_sz;
> > +     struct iovec    iov[PG_IOV_MAX];
> > +     int             blocks;
> > +     size_t  remaining_size = 0;
> > +     int             i;
> > +     ssize_t written;
> > +     ssize_t total_written = 0;
> > +
> > +     zbuffer_sz = sizeof(zbuffer.data);
> > +
> > +     /* Zero-fill the buffer. */
> > +     memset(zbuffer.data, 0, zbuffer_sz);
>
> I previously commented on this - why are we memseting a buffer on every call
> to this?  That's not at all free.
>
> Something like
>     static const PGAlignedBlock zerobuf = {0};
> would do the trick.  You do need to cast the const away, to assign to
> iov_base, but that's not too ugly.

Thanks for looking at it. We know that we don't change the zbuffer in
the function, so can we avoid static const and have just a static
variable, like the attached
v1-0001-Use-static-variable-to-avoid-memset-calls-in-pg_p.patch? Do
you see any problem with it?

FWIW, it comes out like the attached
v1-0001-Use-static-const-variable-to-avoid-memset-calls-i.patch with
static const.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachment

Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?

From

Andres Freund

Date:

12 February 2023, 17:31:36

Hi,

On 2023-02-12 19:59:00 +0530, Bharath Rupireddy wrote:
> Thanks for looking at it. We know that we don't change the zbuffer in
> the function, so can we avoid static const and have just a static
> variable, like the attached
> v1-0001-Use-static-variable-to-avoid-memset-calls-in-pg_p.patch? Do
> you see any problem with it?

Making it static const is better, because it allows the memory for the
variable to be put in a readonly section.


>      /* Prepare to write out a lot of copies of our zero buffer at once. */
>      for (i = 0; i < lengthof(iov); ++i)
>      {
> -        iov[i].iov_base = zbuffer.data;
> +        iov[i].iov_base = (void *) (unconstify(PGAlignedBlock *, &zbuffer)->data);
>          iov[i].iov_len = zbuffer_sz;
>      }

Another thing: I think we should either avoid iterating over all the IOVs if
we don't need them, or, even better, initialize the array as a constant, once.

Greetings,

Andres Freund

Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?

From

Michael Paquier

Date:

13 February 2023, 04:08:14

On Sun, Feb 12, 2023 at 09:31:36AM -0800, Andres Freund wrote:
> Another thing: I think we should either avoid iterating over all the IOVs if
> we don't need them, or, even better, initialize the array as a constant, once.

Where you imply that we'd still use memset() once on iov[PG_IOV_MAX],
right?
--
Michael

Attachment

signature.asc

Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?

From

Bharath Rupireddy

Date:

13 February 2023, 04:45:03

On Sun, Feb 12, 2023 at 11:01 PM Andres Freund <andres@anarazel.de> wrote:
>
> On 2023-02-12 19:59:00 +0530, Bharath Rupireddy wrote:
> > Thanks for looking at it. We know that we don't change the zbuffer in
> > the function, so can we avoid static const and have just a static
> > variable, like the attached
> > v1-0001-Use-static-variable-to-avoid-memset-calls-in-pg_p.patch? Do
> > you see any problem with it?
>
> Making it static const is better, because it allows the memory for the
> variable to be put in a readonly section.

Okay.

> >       /* Prepare to write out a lot of copies of our zero buffer at once. */
> >       for (i = 0; i < lengthof(iov); ++i)
> >       {
> > -             iov[i].iov_base = zbuffer.data;
> > +             iov[i].iov_base = (void *) (unconstify(PGAlignedBlock *, &zbuffer)->data);
> >               iov[i].iov_len = zbuffer_sz;
> >       }
>
> Another thing: I think we should either avoid iterating over all the IOVs if
> we don't need them, or, even better, initialize the array as a constant, once.

How about like the attached patch? It makes the iovec static variable
and points the zero buffer only once/for the first time to iovec. This
avoids for-loop on every call.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachment

v1-0001-Optimize-zero-buffer-handling-in-pg_pwrite_zeros.patch

Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?

From

Kyotaro Horiguchi

Date:

13 February 2023, 09:33:34

At Mon, 13 Feb 2023 10:15:03 +0530, Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote in 
> On Sun, Feb 12, 2023 at 11:01 PM Andres Freund <andres@anarazel.de> wrote:
> >
> > On 2023-02-12 19:59:00 +0530, Bharath Rupireddy wrote:
> > >       /* Prepare to write out a lot of copies of our zero buffer at once. */
> > >       for (i = 0; i < lengthof(iov); ++i)
> > >       {
> > > -             iov[i].iov_base = zbuffer.data;
> > > +             iov[i].iov_base = (void *) (unconstify(PGAlignedBlock *, &zbuffer)->data);
> > >               iov[i].iov_len = zbuffer_sz;
> > >       }
> >
> > Another thing: I think we should either avoid iterating over all the IOVs if
> > we don't need them, or, even better, initialize the array as a constant, once.

FWIW, I tried to use the "{[start .. end] = {}}" trick (GNU extension?
[*1]) for constant array initialization, but individual members don't
accept assigning a const value, thus I did deconstify as the follows.

>    static const struct iovec    iov[PG_IOV_MAX] =
>        {[0 ... PG_IOV_MAX - 1] =
>         {
>             .iov_base = (void *)&zbuffer.data,
>             .iov_len = BLCKSZ
>         }
>        };

I didn't checked the actual mapping, but if I tried an assignment
"iov[0].iov_base = NULL", it failed as "assignment of member
‘iov_base’ in read-only object", so is it successfully placed in a
read-only segment?

Later code assigns iov[0].iov_len thus we need to provide a separate
iov non-const variable, or can we use pwrite instead there?  (I didn't
find pg_pwrite_with_retry(), though)

> How about like the attached patch? It makes the iovec static variable
> and points the zero buffer only once/for the first time to iovec. This
> avoids for-loop on every call.

As the patch itself, it seems forgetting to reset iov[0].iov_len after
writing a partial block.

retards.

*1: https://gcc.gnu.org/onlinedocs/gcc-4.1.2/gcc/Designated-Inits.html

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?

From

Andres Freund

Date:

13 February 2023, 17:39:47

On 2023-02-13 18:33:34 +0900, Kyotaro Horiguchi wrote:
> At Mon, 13 Feb 2023 10:15:03 +0530, Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote in 
> > On Sun, Feb 12, 2023 at 11:01 PM Andres Freund <andres@anarazel.de> wrote:
> > >
> > > On 2023-02-12 19:59:00 +0530, Bharath Rupireddy wrote:
> > > >       /* Prepare to write out a lot of copies of our zero buffer at once. */
> > > >       for (i = 0; i < lengthof(iov); ++i)
> > > >       {
> > > > -             iov[i].iov_base = zbuffer.data;
> > > > +             iov[i].iov_base = (void *) (unconstify(PGAlignedBlock *, &zbuffer)->data);
> > > >               iov[i].iov_len = zbuffer_sz;
> > > >       }
> > >
> > > Another thing: I think we should either avoid iterating over all the IOVs if
> > > we don't need them, or, even better, initialize the array as a constant, once.
> 
> FWIW, I tried to use the "{[start .. end] = {}}" trick (GNU extension?
> [*1]) for constant array initialization, but individual members don't
> accept assigning a const value, thus I did deconstify as the follows.
> 
> >    static const struct iovec    iov[PG_IOV_MAX] =
> >        {[0 ... PG_IOV_MAX - 1] =
> >         {
> >             .iov_base = (void *)&zbuffer.data,
> >             .iov_len = BLCKSZ
> >         }
> >        };
> 
> I didn't checked the actual mapping, but if I tried an assignment
> "iov[0].iov_base = NULL", it failed as "assignment of member
> ‘iov_base’ in read-only object", so is it successfully placed in a
> read-only segment?
> 
> Later code assigns iov[0].iov_len thus we need to provide a separate
> iov non-const variable, or can we use pwrite instead there?  (I didn't
> find pg_pwrite_with_retry(), though)

Given that we need to do that, and given that we already need to loop to
handle writes that are longer than PG_IOV_MAX * BLCKSZ, it's probably not
worth avoiding iov initialization.

But I think it's worth limiting the initialization to blocks.

I'd also try to combine the first pg_writev_* with the second one.

Greetings,

Andres Freund

Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?

From

Andres Freund

Date:

14 February 2023, 01:10:56

Hi,

On 2023-02-12 09:31:36 -0800, Andres Freund wrote:
> Another thing: I think we should either avoid iterating over all the IOVs if
> we don't need them, or, even better, initialize the array as a constant, once.

I just tried to use pg_pwrite_zeros - and couldn't because it doesn't have an
offset parameter.  Huh, what lead to the function being so constrained?

- Andres

Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?

From

Michael Paquier

Date:

14 February 2023, 07:06:24

On Mon, Feb 13, 2023 at 05:10:56PM -0800, Andres Freund wrote:
> I just tried to use pg_pwrite_zeros - and couldn't because it doesn't have an
> offset parameter.  Huh, what lead to the function being so constrained?

Its current set of uses cases, where we only use it now to initialize
with zeros with WAL segments.  If you have a case that plans to use
that stuff with an offset, no problem with me.
--
Michael

Attachment

signature.asc

Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?

From

Bharath Rupireddy

Date:

14 February 2023, 12:30:00

On Mon, Feb 13, 2023 at 11:09 PM Andres Freund <andres@anarazel.de> wrote:
>
> > Later code assigns iov[0].iov_len thus we need to provide a separate
> > iov non-const variable, or can we use pwrite instead there?  (I didn't
> > find pg_pwrite_with_retry(), though)
>
> Given that we need to do that, and given that we already need to loop to
> handle writes that are longer than PG_IOV_MAX * BLCKSZ, it's probably not
> worth avoiding iov initialization.
>
> But I think it's worth limiting the initialization to blocks.

We can still optimize away the for loop by using a single iovec for
remaining size, like the attached v2 patch.

> I'd also try to combine the first pg_writev_* with the second one.

Done, PSA v2 patch.

On Tue, Feb 14, 2023 at 6:40 AM Andres Freund <andres@anarazel.de> wrote:
>
> Hi,
>
> On 2023-02-12 09:31:36 -0800, Andres Freund wrote:
> > Another thing: I think we should either avoid iterating over all the IOVs if
> > we don't need them, or, even better, initialize the array as a constant, once.
>
> I just tried to use pg_pwrite_zeros - and couldn't because it doesn't have an
> offset parameter.  Huh, what lead to the function being so constrained?

Done, PSA v2 patch.

We could do few more things, but honestly I feel they're unnecessary:
1) An assert-only code that checks if the asked file contents are
zeroed at the end of pg_pwrite_zeros (to be more defensive, but
reading 16MB files and checking if it's zero-filled will surely
slowdown the Assert builds).
2) A small test module passing in a file with the size to write isn't
multiple of block size, meaning, the code we have in the function to
write last remaining bytes (less than BLCKSZ) gets covered which isn't
covered right now -
https://coverage.postgresql.org/src/common/file_utils.c.gcov.html.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachment

v2-0001-Refactor-pg_pwrite_zeros.patch

Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?

From

Andres Freund

Date:

15 February 2023, 00:46:07

Hi,

On 2023-02-14 16:06:24 +0900, Michael Paquier wrote:
> On Mon, Feb 13, 2023 at 05:10:56PM -0800, Andres Freund wrote:
> > I just tried to use pg_pwrite_zeros - and couldn't because it doesn't have an
> > offset parameter.  Huh, what lead to the function being so constrained?
> 
> Its current set of uses cases, where we only use it now to initialize
> with zeros with WAL segments.  If you have a case that plans to use
> that stuff with an offset, no problem with me.

Then it really shouldn't have been named pg_pwrite_zeros(). The point of the
p{write,read}{,v} family of functions is to be able to specify the offset to
read/write at. I assume the p is for position, but I'm not sure.

Greetings,

Andres Freund

Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?

From

Andres Freund

Date:

15 February 2023, 00:55:25

Hi,

On 2023-02-14 18:00:00 +0530, Bharath Rupireddy wrote:
> On Mon, Feb 13, 2023 at 11:09 PM Andres Freund <andres@anarazel.de> wrote:
> >
> > > Later code assigns iov[0].iov_len thus we need to provide a separate
> > > iov non-const variable, or can we use pwrite instead there?  (I didn't
> > > find pg_pwrite_with_retry(), though)
> >
> > Given that we need to do that, and given that we already need to loop to
> > handle writes that are longer than PG_IOV_MAX * BLCKSZ, it's probably not
> > worth avoiding iov initialization.
> >
> > But I think it's worth limiting the initialization to blocks.
>
> We can still optimize away the for loop by using a single iovec for
> remaining size, like the attached v2 patch.
>
> > I'd also try to combine the first pg_writev_* with the second one.
>
> Done, PSA v2 patch.

This feels way too complicated to me.  How about something more like the
attached?


> 2) A small test module passing in a file with the size to write isn't
> multiple of block size, meaning, the code we have in the function to
> write last remaining bytes (less than BLCKSZ) gets covered which isn't
> covered right now -

FWIW, I tested this locally by just specifying a smaller size than BLCKSZ for
the write size.

Greetings,

Andres Freund

Attachment

v3-0001-Revise-pg_pwrite_zeros.patch

Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?

From

Michael Paquier

Date:

15 February 2023, 01:26:41

On Tue, Feb 14, 2023 at 04:46:07PM -0800, Andres Freund wrote:
> Then it really shouldn't have been named pg_pwrite_zeros(). The point of the
> p{write,read}{,v} family of functions is to be able to specify the offset to
> read/write at. I assume the p is for position, but I'm not sure.

'p' could stand for POSIX, though both read() and pread() are in it.
Anyway, it looks that your guess may be right:
https://stackoverflow.com/questions/17877556/what-does-p-stand-for-in-function-names-pwrite-and-pread

Even there, people don't seem completely sure.
--
Michael

Attachment

signature.asc

Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?

From

Kyotaro Horiguchi

Date:

15 February 2023, 01:28:37

At Tue, 14 Feb 2023 16:55:25 -0800, Andres Freund <andres@anarazel.de> wrote in 
> Hi,
> 
> On 2023-02-14 18:00:00 +0530, Bharath Rupireddy wrote:
> > Done, PSA v2 patch.
> 
> This feels way too complicated to me.  How about something more like the
> attached?

I like this one, but the parameters offset and size are in a different
order from pwrite(fd, buf, count, offset).  I perfer the arrangement
suggested by Bharath. And isn't it better to use Min(remaining_size,
BLCKSZ) instead of a bare if statement?

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?

From

Andres Freund

Date:

15 February 2023, 02:00:28

Hi

On 2023-02-15 10:28:37 +0900, Kyotaro Horiguchi wrote:
> At Tue, 14 Feb 2023 16:55:25 -0800, Andres Freund <andres@anarazel.de> wrote in 
> > Hi,
> > 
> > On 2023-02-14 18:00:00 +0530, Bharath Rupireddy wrote:
> > > Done, PSA v2 patch.
> > 
> > This feels way too complicated to me.  How about something more like the
> > attached?
> 
> I like this one, but the parameters offset and size are in a different
> order from pwrite(fd, buf, count, offset).  I perfer the arrangement
> suggested by Bharath.

Yes, it probably is better. Not sure why I went with that order.


> And isn't it better to use Min(remaining_size, BLCKSZ) instead of a bare if
> statement?

I really can't make myself care about which version is better :)

Greetings,

Andres Freund

Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?

From

Bharath Rupireddy

Date:

15 February 2023, 07:30:00

On Wed, Feb 15, 2023 at 6:25 AM Andres Freund <andres@anarazel.de> wrote:
>
> > Done, PSA v2 patch.
>
> This feels way too complicated to me.  How about something more like the
> attached?

Thanks. I kind of did cost analysis of v2 and v3:

Input: zero-fill a file of size 256*8K.
v2 patch:
iovec initialization with zerobuf for loop - 1 time
zero-fill 32 blocks at once - 8 times
stack memory - sizeof(PGAlignedBlock) + sizeof(struct iovec) * PG_IOV_MAX

v3 patch:
iovec initialization with zerobuf for loop - 8 times (7 times more
than v2 patch)
zero-fill 32 blocks at once - 8 times (no change from v2 patch)
stack memory - sizeof(PGAlignedBlock) + sizeof(struct iovec) *
PG_IOV_MAX (no change from v2 patch)

The v3 patch reduces initialization of iovec array elements which is a
clear win when pg_pwrite_zeros is called for sizes less than BLCKSZ
many times (I assume this is what is needed for the relation extension
lock improvements feature). However, it increases the number of iovec
initialization with zerobuf for the cases when pg_pwrite_zeros is
called for sizes far greater than BLCKSZ (for instance, WAL file
initialization).

FWIW, I attached v4 patch, a simplified version of the v2  - it
initializes all the iovec array elements if the total blocks to be
written crosses lengthof(iovec array), otherwise it initializes only
the needed blocks.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachment

v4-0001-Refactor-pg_pwrite_zeros.patch

Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?

From

Michael Paquier

Date:

16 February 2023, 07:58:23

On Wed, Feb 15, 2023 at 01:00:00PM +0530, Bharath Rupireddy wrote:
> The v3 patch reduces initialization of iovec array elements which is a
> clear win when pg_pwrite_zeros is called for sizes less than BLCKSZ
> many times (I assume this is what is needed for the relation extension
> lock improvements feature). However, it increases the number of iovec
> initialization with zerobuf for the cases when pg_pwrite_zeros is
> called for sizes far greater than BLCKSZ (for instance, WAL file
> initialization).

It seems to me that v3 would do extra initializations only if
pg_pwritev_with_retry() does *not* retry its writes, but that's not
the case as it retries on a partial write as per its name.  The number
of iov buffers is stricly capped by remaining_size.  FWIW, I find v3
proposed more elegant.

> FWIW, I attached v4 patch, a simplified version of the v2  - it
> initializes all the iovec array elements if the total blocks to be
> written crosses lengthof(iovec array), otherwise it initializes only
> the needed blocks.

+   static size_t   zbuf_sz = BLCKSZ;
In v4, what's the advantage of marking that as static?  It could
actually be dangerous if this is carelessly updated.  Well, that's not
the case, still..
--
Michael

Attachment

signature.asc

Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?

From

Andres Freund

Date:

16 February 2023, 19:00:20

Hi,

On 2023-02-16 16:58:23 +0900, Michael Paquier wrote:
> On Wed, Feb 15, 2023 at 01:00:00PM +0530, Bharath Rupireddy wrote:
> > The v3 patch reduces initialization of iovec array elements which is a
> > clear win when pg_pwrite_zeros is called for sizes less than BLCKSZ
> > many times (I assume this is what is needed for the relation extension
> > lock improvements feature). However, it increases the number of iovec
> > initialization with zerobuf for the cases when pg_pwrite_zeros is
> > called for sizes far greater than BLCKSZ (for instance, WAL file
> > initialization).

In those cases the cost of initializing the IOV doesn't matter, relative to
the other costs. The important point is to not initialize a lot of elements if
they're not even needed. Because we need to overwrite the trailing iov
element, it doesn't seem worth to try to "pre-initialize" iov.

Referencing a static variable is more expensive than accessing an on-stack
variable. Having a first-call check is more expensive than not having it.

Thus making the iov and zbuf_sz static isn't helpful. Particularly the latter
seems like a bad idea, because it's a compiler constant.

> It seems to me that v3 would do extra initializations only if
> pg_pwritev_with_retry() does *not* retry its writes, but that's not
> the case as it retries on a partial write as per its name.  The number
> of iov buffers is stricly capped by remaining_size.

I don't really understand this bit?

Greetings,

Andres Freund

Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?

From

Michael Paquier

Date:

17 February 2023, 07:19:46

On Thu, Feb 16, 2023 at 11:00:20AM -0800, Andres Freund wrote:
> I don't really understand this bit?

As of this message, I saw this quote:
https://www.postgresql.org/message-id/fCALj2ACXEBwY_bM3kmZEkYpcXsM+yGitpYHi4FdT6MSk6YRtKTQ@mail.gmail.com
"However, it increases the number of iovec initialization with zerobuf
for the cases when pg_pwrite_zeros is called for sizes far greater
than BLCKSZ (for instance, WAL file initialization)."

But it looks like I misunderstood what this quote meant compared to
what v3 does.  It is true that v3 sets iov_len and iov_base more than
needed when writing sizes larger than BLCKSZ.  Seems like you think
that it is not really going to matter much to track which iovecs have
been already initialized during the first loop on
pg_pwritev_with_retry() to keep the code shorter?
--
Michael

Attachment

signature.asc

Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?

From

Andres Freund

Date:

17 February 2023, 17:31:14

Hi,

On 2023-02-17 16:19:46 +0900, Michael Paquier wrote:
> But it looks like I misunderstood what this quote meant compared to
> what v3 does.  It is true that v3 sets iov_len and iov_base more than
> needed when writing sizes larger than BLCKSZ.

I don't think it does for writes larger than BLCKSZ, it just does more for
writes larger than PG_IKOV_MAX * BLCKSZ. But in those cases CPU time is going
to be spent elsewhere.

> Seems like you think that it is not really going to matter much to track
> which iovecs have been already initialized during the first loop on
> pg_pwritev_with_retry() to keep the code shorter?

Yes. I'd bet that, in the unlikely case you're going to see any difference at
all, unconditionally initializing is going to win.

Right now we memset() 8KB, and iterate over 32 IOVs, unconditionally, on every
call. Even if we could do some further optimizations of what I did in the
patch, you can initialize needed IOVs repeatedly a *lot* of times, before it
shows up...

I'm inclined to go with my version, with the argument order swapped to
Bharath's order.

Greetings,

Andres Freund

Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?

From

Michael Paquier

Date:

20 February 2023, 05:33:22

On Fri, Feb 17, 2023 at 09:31:14AM -0800, Andres Freund wrote:
> On 2023-02-17 16:19:46 +0900, Michael Paquier wrote:
>> But it looks like I misunderstood what this quote meant compared to
>> what v3 does.  It is true that v3 sets iov_len and iov_base more than
>> needed when writing sizes larger than BLCKSZ.
>
> I don't think it does for writes larger than BLCKSZ, it just does more for
> writes larger than PG_IKOV_MAX * BLCKSZ. But in those cases CPU time is going
> to be spent elsewhere.

Yep.

>> Seems like you think that it is not really going to matter much to track
>> which iovecs have been already initialized during the first loop on
>> pg_pwritev_with_retry() to keep the code shorter?
>
> Yes. I'd bet that, in the unlikely case you're going to see any difference at
> all, unconditionally initializing is going to win.
>
> Right now we memset() 8KB, and iterate over 32 IOVs, unconditionally, on every
> call. Even if we could do some further optimizations of what I did in the
> patch, you can initialize needed IOVs repeatedly a *lot* of times, before it
> shows up...
>
> I'm inclined to go with my version, with the argument order swapped to
> Bharath's order.

Okay.  That's fine by me.
--
Michael

Attachment

signature.asc

Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?

From

Bharath Rupireddy

Date:

20 February 2023, 08:24:00

On Mon, Feb 20, 2023 at 11:03 AM Michael Paquier <michael@paquier.xyz> wrote:
>
> > I'm inclined to go with my version, with the argument order swapped to
> > Bharath's order.
>
> Okay.  That's fine by me.

I ran some tests on my dev system [1] and I don't see much difference
between v3 and v4. So, +1 for v3 patch (+ argument order swap) from
Andres to keep the code simple and elegant.

[1]
HEAD: 16MB (12.231 ms), 8190 Bytes (0.199 ms), 8192 Bytes (0.176 ms),
1GB (603.668 ms), 10GB (21184.936 ms (00:21.185))
v3 patch: 16MB (12.632 ms), 8190 Bytes (0.183 ms), 8192 Bytes (0.166
ms), 1GB (610.428 ms), 10GB (22647.308 ms (00:22.647))
v4 patch: 16MB (12.044 ms), 8190 Bytes (0.167 ms), 8192 Bytes (0.139
ms), 1GB (603.848 ms), 10GB (21225.331 ms (00:21.225))

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?

From

Michael Paquier

Date:

06 March 2023, 04:29:50

On Mon, Feb 20, 2023 at 01:54:00PM +0530, Bharath Rupireddy wrote:
> I ran some tests on my dev system [1] and I don't see much difference
> between v3 and v4. So, +1 for v3 patch (+ argument order swap) from
> Andres to keep the code simple and elegant.

This thread has stalled for a couple of weeks, so I have gone back to
it.  Testing on a tmpfs I am not seeing a difference if performance
for any of the approaches discussed.  At the end, as I am the one
behind the introduction of pg_pwrite_zeros(), I have applied v3 after
switches the size and offset parameters to be the same way as in v4.
--
Michael

Attachment

signature.asc

Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?

From

Andres Freund

Date:

06 March 2023, 22:29:50

Hi,

On 2023-03-06 13:29:50 +0900, Michael Paquier wrote:
> On Mon, Feb 20, 2023 at 01:54:00PM +0530, Bharath Rupireddy wrote:
> > I ran some tests on my dev system [1] and I don't see much difference
> > between v3 and v4. So, +1 for v3 patch (+ argument order swap) from
> > Andres to keep the code simple and elegant.
>
> This thread has stalled for a couple of weeks, so I have gone back to
> it.  Testing on a tmpfs I am not seeing a difference if performance
> for any of the approaches discussed.  At the end, as I am the one
> behind the introduction of pg_pwrite_zeros(), I have applied v3 after
> switches the size and offset parameters to be the same way as in v4.

Thanks.

Greetings,

Andres Freund

Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?

From

Thomas Munro

Date:

07 March 2023, 02:42:03

On Mon, Mar 6, 2023 at 5:30 PM Michael Paquier <michael@paquier.xyz> wrote:
> On Mon, Feb 20, 2023 at 01:54:00PM +0530, Bharath Rupireddy wrote:
> > I ran some tests on my dev system [1] and I don't see much difference
> > between v3 and v4. So, +1 for v3 patch (+ argument order swap) from
> > Andres to keep the code simple and elegant.
>
> This thread has stalled for a couple of weeks, so I have gone back to
> it.  Testing on a tmpfs I am not seeing a difference if performance
> for any of the approaches discussed.  At the end, as I am the one
> behind the introduction of pg_pwrite_zeros(), I have applied v3 after
> switches the size and offset parameters to be the same way as in v4.

Apparently ye olde GCC 4.7 on "lapwing" doesn't like the way you
initialised that struct.  I guess it wants {{0}} instead of {0}.
Apparently old GCC was wrong about that warning[1], but that system
doesn't have the back-patched fixes?  Not sure.

[1] https://stackoverflow.com/questions/63355760/how-standard-is-the-0-initializer-in-c89

Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?

From

Thomas Munro

Date:

07 March 2023, 02:44:46

On Tue, Mar 7, 2023 at 3:42 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> Apparently ye olde GCC 4.7 on "lapwing" doesn't like the way you
> initialised that struct.  I guess it wants {{0}} instead of {0}.
> Apparently old GCC was wrong about that warning[1], but that system
> doesn't have the back-patched fixes?  Not sure.

Oh, you already pushed a fix.  But now I'm wondering if it's useful to
have old buggy compilers set to run with -Werror.

Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?

From

Michael Paquier

Date:

07 March 2023, 04:32:22

On Tue, Mar 07, 2023 at 03:44:46PM +1300, Thomas Munro wrote:
> On Tue, Mar 7, 2023 at 3:42 PM Thomas Munro <thomas.munro@gmail.com> wrote:
>> Apparently ye olde GCC 4.7 on "lapwing" doesn't like the way you
>> initialised that struct.  I guess it wants {{0}} instead of {0}.
>> Apparently old GCC was wrong about that warning[1], but that system
>> doesn't have the back-patched fixes?  Not sure.

6392f2a was one such case.

> Oh, you already pushed a fix.  But now I'm wondering if it's useful to
> have old buggy compilers set to run with -Werror.

Yes, as far as I can see when investigating the issue, this is an old
bug of gcc when detecting where the initialization needs to be
applied.  And at the same time the fix is deadly simple, so the
current statu-quo does not sound that bad to me.  Note that lapwing is
one of the only animals testing 32b builds, and it has saved from
quite few bugs over the years.
--
Michael

Attachment

signature.asc

Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?

From

Michael Paquier

Date:

07 March 2023, 05:00:50

On Mon, Mar 06, 2023 at 02:29:50PM -0800, Andres Freund wrote:
> Thanks.

Sure, no problem.  If there is anything else needed for this thread,
feel free to ping me here.
--
Michael

Attachment

signature.asc

Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?

From

Thomas Munro

Date:

07 March 2023, 06:14:51

On Tue, Mar 7, 2023 at 5:32 PM Michael Paquier <michael@paquier.xyz> wrote:
> On Tue, Mar 07, 2023 at 03:44:46PM +1300, Thomas Munro wrote:
> > Oh, you already pushed a fix.  But now I'm wondering if it's useful to
> > have old buggy compilers set to run with -Werror.
>
> Yes, as far as I can see when investigating the issue, this is an old
> bug of gcc when detecting where the initialization needs to be
> applied.  And at the same time the fix is deadly simple, so the
> current statu-quo does not sound that bad to me.  Note that lapwing is
> one of the only animals testing 32b builds, and it has saved from
> quite few bugs over the years.

Yeah, but I'm just wondering, why not run a current release on it[1]?
Debian is one of the few distributions still supporting 32 bit
kernels, and it's good to test rare things, but AFAIK the primary
reason we finish up with EOL'd OSes in the 'farm is because they have
been forgotten (the secondary reason is because they couldn't be
upgraded because the OS dropped the [micro]architecture).  Unlike
vintage SPARC, actual users might plausibly be running a current
release on a 32 bit Intel system, I guess (maybe on a Quark
microcontroller?)?

BTW CI also tests 32 bit with -m32 on Debian, but with a 64 bit
kernel, which probably doesn't change much at the level we care about,
so maybe this doesn't matter much... just sharing an observation that
we're wasting time thinking about an OS release that gave up the ghost
in 2016, because it is running with -Werror.  *shrug*

[1] https://wiki.debian.org/DebianReleases

Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?

From

Julien Rouhaud

Date:

07 March 2023, 06:47:05

On Tue, Mar 07, 2023 at 07:14:51PM +1300, Thomas Munro wrote:
> On Tue, Mar 7, 2023 at 5:32 PM Michael Paquier <michael@paquier.xyz> wrote:
> > On Tue, Mar 07, 2023 at 03:44:46PM +1300, Thomas Munro wrote:
> > > Oh, you already pushed a fix.  But now I'm wondering if it's useful to
> > > have old buggy compilers set to run with -Werror.
> >
> > Yes, as far as I can see when investigating the issue, this is an old
> > bug of gcc when detecting where the initialization needs to be
> > applied.  And at the same time the fix is deadly simple, so the
> > current statu-quo does not sound that bad to me.  Note that lapwing is
> > one of the only animals testing 32b builds, and it has saved from
> > quite few bugs over the years.
>
> Yeah, but I'm just wondering, why not run a current release on it[1]?
> Debian is one of the few distributions still supporting 32 bit
> kernels, and it's good to test rare things, but AFAIK the primary
> reason we finish up with EOL'd OSes in the 'farm is because they have
> been forgotten (the secondary reason is because they couldn't be
> upgraded because the OS dropped the [micro]architecture).

I registered lapwing as a 32b Debian 7 so I thought it would be expected to
keep it as-is rather than upgrading to all newer major Debian versions,
especially since there were newer debian animal registered (no 32b though
AFAICS).  I'm not opposed to upgrading it but I think there's still value in
having somewhat old packages versions being tested, especially since there
isn't much 32b coverage of those.  I would be happy to register a newer 32b
version, or even sid, if needed but the -m32 part on the CI makes me think
there isn't much value doing that now.

Now about the -Werror:

> BTW CI also tests 32 bit with -m32 on Debian, but with a 64 bit
> kernel, which probably doesn't change much at the level we care about,
> so maybe this doesn't matter much... just sharing an observation that
> we're wasting time thinking about an OS release that gave up the ghost
> in 2016, because it is running with -Werror.  *shrug*

I think this is the first time that a problem raised by -Werror on that old
animal is actually a false positive, while there were many times it reported
useful stuff.  Now this has been up for years before we got better CI tooling,
especially with -m32 support, so there might not be any value to have it
anymore.  As I mentioned at [1] I don't mind removing it or just work on
upgrading any dependency (or removing known buggy compiler flags) to keep it
without being annoying.  In any case I'm usually quite fast at reacting to any
problem/complaint on that animal, so you don't have to worry about the
buildfarm being red too long if it came to that.

[1] https://www.postgresql.org/message-id/20220921155025.wdixzbrt2uzbi6vz%40jrouhaud

Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?

From

Thomas Munro

Date:

07 March 2023, 07:33:11

On Tue, Mar 7, 2023 at 7:47 PM Julien Rouhaud <rjuju123@gmail.com> wrote:
> I registered lapwing as a 32b Debian 7 so I thought it would be expected to
> keep it as-is rather than upgrading to all newer major Debian versions,
> especially since there were newer debian animal registered (no 32b though
> AFAICS).

Animals do get upgraded: see the "w. e. f." ("with effect from") line
in https://buildfarm.postgresql.org/cgi-bin/show_members.pl which
comes from people running something like ./update_personality.pl
--os-version "11" so that it shows up on the website.

>  I'm not opposed to upgrading it but I think there's still value in
> having somewhat old packages versions being tested, especially since there
> isn't much 32b coverage of those.  I would be happy to register a newer 32b
> version, or even sid, if needed but the -m32 part on the CI makes me think
> there isn't much value doing that now.

Totally up to you as an animal zoo keeper but in my humble opinion the
interesting range of Debian releases currently is 11-13, or maybe 10
if you really want to test the LTS/old-stable release (and CI is
testing 11).

> I think this is the first time that a problem raised by -Werror on that old
> animal is actually a false positive, while there were many times it reported
> useful stuff.  Now this has been up for years before we got better CI tooling,
> especially with -m32 support, so there might not be any value to have it
> anymore.  As I mentioned at [1] I don't mind removing it or just work on
> upgrading any dependency (or removing known buggy compiler flags) to keep it
> without being annoying.  In any case I'm usually quite fast at reacting to any
> problem/complaint on that animal, so you don't have to worry about the
> buildfarm being red too long if it came to that.

Yeah, it's given us lots of useful data, thanks.  Personally I would
upgrade it so it keeps telling us useful things but I feel like I've
said enough about that so I'll shut up now :-)  Re: being red too
long... yeah that reminds me, I really need to fix seawasp ASAP...

Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?

From

Alvaro Herrera

Date:

07 March 2023, 10:09:20

On 2023-Mar-07, Julien Rouhaud wrote:

> I registered lapwing as a 32b Debian 7 so I thought it would be expected to
> keep it as-is rather than upgrading to all newer major Debian versions,
> especially since there were newer debian animal registered (no 32b though
> AFAICS).  I'm not opposed to upgrading it but I think there's still value in
> having somewhat old packages versions being tested, especially since there
> isn't much 32b coverage of those.  I would be happy to register a newer 32b
> version, or even sid, if needed but the -m32 part on the CI makes me think
> there isn't much value doing that now.

I think a pure 32bit animal running contemporary Debian would be better
than just ditching the animal completely, as would appear to be the
alternative, precisely because we have no other 32bit machine running
x86 Linux.

Maybe you can have *two* animals on the same machine: one running the
old Debian without -Werror, and the one with new Debian and that flag
kept.

I think CI is not a replacement for the buildfarm.  It helps catche some
problems earlier, but we shouldn't think that we no longer need some
buildfarm animals because CI runs those configs.

-- 
Álvaro Herrera         PostgreSQL Developer  —  https://www.EnterpriseDB.com/

Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?

From

Andres Freund

Date:

17 March 2023, 00:00:10

Hi,

On 2023-03-07 15:44:46 +1300, Thomas Munro wrote:
> On Tue, Mar 7, 2023 at 3:42 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> > Apparently ye olde GCC 4.7 on "lapwing" doesn't like the way you
> > initialised that struct.  I guess it wants {{0}} instead of {0}.
> > Apparently old GCC was wrong about that warning[1], but that system
> > doesn't have the back-patched fixes?  Not sure.
> 
> Oh, you already pushed a fix.  But now I'm wondering if it's useful to
> have old buggy compilers set to run with -Werror.

I think it's actively harmful to do so. Avoiding warnings on a > 10 year old
compiler a) is a waste of time b) unnecessarily requires making our code
uglier.

Greetings,

Andres Freund