Thread: could not extend file "base/5/3501" with FileFallocate(): Interrupted system call

could not extend file "base/5/3501" with FileFallocate(): Interrupted system call

From

Christoph Berg

Date:

24 April 2023, 08:53:35

Re: Andres Freund
> Add smgrzeroextend(), FileZero(), FileFallocate()

Hi,

I'm often seeing PG16 builds erroring out in the pgbench tests:

00:33:12 make[2]: Entering directory '/<<PKGBUILDDIR>>/build/src/bin/pgbench'
00:33:12 echo "# +++ tap check in src/bin/pgbench +++" && rm -rf '/<<PKGBUILDDIR>>/build/src/bin/pgbench'/tmp_check &&
/bin/mkdir-p '/<<PKGBUILDDIR>>/build/src/bin/pgbench'/tmp_check && cd /<<PKGBUILDDIR>>/build/../src/bin/pgbench &&
TESTLOGDIR='/<<PKGBUILDDIR>>/build/src/bin/pgbench/tmp_check/log'
TESTDATADIR='/<<PKGBUILDDIR>>/build/src/bin/pgbench/tmp_check'
PATH="/<<PKGBUILDDIR>>/build/tmp_install/usr/lib/postgresql/16/bin:/<<PKGBUILDDIR>>/build/src/bin/pgbench:$PATH"
LD_LIBRARY_PATH="/<<PKGBUILDDIR>>/build/tmp_install/usr/lib/aarch64-linux-gnu" PGPORT='65432'
top_builddir='/<<PKGBUILDDIR>>/build/src/bin/pgbench/../../..'
PG_REGRESS='/<<PKGBUILDDIR>>/build/src/bin/pgbench/../../../src/test/regress/pg_regress'/usr/bin/prove -I
/<<PKGBUILDDIR>>/build/../src/test/perl/-I /<<PKGBUILDDIR>>/build/../src/bin/pgbench --verbose t/*.pl
 
00:33:12 # +++ tap check in src/bin/pgbench +++
00:33:14 #   Failed test 'concurrent OID generation status (got 2 vs expected 0)'
00:33:14 #   at t/001_pgbench_with_server.pl line 31.
00:33:14 #   Failed test 'concurrent OID generation stdout /(?^:processed: 125/125)/'
00:33:14 #   at t/001_pgbench_with_server.pl line 31.
00:33:14 #                   'pgbench (16devel (Debian 16~~devel-1.pgdg100+~20230423.1656.g8bbd0cc))
00:33:14 # transaction type:
/<<PKGBUILDDIR>>/build/src/bin/pgbench/tmp_check/t_001_pgbench_with_server_main_data/001_pgbench_concurrent_insert
00:33:14 # scaling factor: 1
00:33:14 # query mode: prepared
00:33:14 # number of clients: 5
00:33:14 # number of threads: 1
00:33:14 # maximum number of tries: 1
00:33:14 # number of transactions per client: 25
00:33:14 # number of transactions actually processed: 118/125
00:33:14 # number of failed transactions: 0 (0.000%)
00:33:14 # latency average = 26.470 ms
00:33:14 # initial connection time = 66.583 ms
00:33:14 # tps = 188.889760 (without initial connection time)
00:33:14 # '
00:33:14 #     doesn't match '(?^:processed: 125/125)'
00:33:14 #   Failed test 'concurrent OID generation stderr /(?^:^$)/'
00:33:14 #   at t/001_pgbench_with_server.pl line 31.
00:33:14 #                   'pgbench: error: client 2 script 0 aborted in command 0 query 0: ERROR:  could not extend
file"base/5/3501" with FileFallocate(): Interrupted system call
 
00:33:14 # HINT:  Check free disk space.
00:33:14 # pgbench: error: Run was aborted; the above results are incomplete.
00:33:14 # '
00:33:14 #     doesn't match '(?^:^$)'
00:33:26 # Looks like you failed 3 tests of 428.
00:33:26 t/001_pgbench_with_server.pl ..
00:33:26 not ok 1 - concurrent OID generation status (got 2 vs expected 0)

I don't think the disk is full since it's always hitting that same
spot, on some of the builds:

https://pgdgbuild.dus.dg-i.net/job/postgresql-16-binaries-snapshot/833/

This is overlayfs with tmpfs (upper)/ext4 (lower). Manually running
that test works though, and the FS seems to support posix_fallocate:

#include <fcntl.h>
#include <stdio.h>

int main ()
{
        int f;
        int err;

        if (!(f = open("moo", O_CREAT | O_RDWR, 0666)))
                perror("open");

        err = posix_fallocate(f, 0, 10);
        perror("posix_fallocate");

        return 0;
}

$ ./a.out
posix_fallocate: Success

The problem has been there for some weeks - I didn't report it earlier
as I was on vacation, in the free time trying to bootstrap s390x
support for apt.pg.o, and there was this other direct IO problem
making all the builds fail for some time.

Christoph

Re: could not extend file "base/5/3501" with FileFallocate(): Interrupted system call

From

Melanie Plageman

Date:

24 April 2023, 15:58:55

On Mon, Apr 24, 2023 at 10:53:35AM +0200, Christoph Berg wrote:
> Re: Andres Freund
> > Add smgrzeroextend(), FileZero(), FileFallocate()
> 
> Hi,
> 
> I'm often seeing PG16 builds erroring out in the pgbench tests:
> 
> 00:33:12 make[2]: Entering directory '/<<PKGBUILDDIR>>/build/src/bin/pgbench'
> 00:33:12 echo "# +++ tap check in src/bin/pgbench +++" && rm -rf '/<<PKGBUILDDIR>>/build/src/bin/pgbench'/tmp_check
&&/bin/mkdir -p '/<<PKGBUILDDIR>>/build/src/bin/pgbench'/tmp_check && cd /<<PKGBUILDDIR>>/build/../src/bin/pgbench &&
TESTLOGDIR='/<<PKGBUILDDIR>>/build/src/bin/pgbench/tmp_check/log'
TESTDATADIR='/<<PKGBUILDDIR>>/build/src/bin/pgbench/tmp_check'
PATH="/<<PKGBUILDDIR>>/build/tmp_install/usr/lib/postgresql/16/bin:/<<PKGBUILDDIR>>/build/src/bin/pgbench:$PATH"
LD_LIBRARY_PATH="/<<PKGBUILDDIR>>/build/tmp_install/usr/lib/aarch64-linux-gnu" PGPORT='65432'
top_builddir='/<<PKGBUILDDIR>>/build/src/bin/pgbench/../../..'
PG_REGRESS='/<<PKGBUILDDIR>>/build/src/bin/pgbench/../../../src/test/regress/pg_regress'/usr/bin/prove -I
/<<PKGBUILDDIR>>/build/../src/test/perl/-I /<<PKGBUILDDIR>>/build/../src/bin/pgbench --verbose t/*.pl
 
> 00:33:12 # +++ tap check in src/bin/pgbench +++
> 00:33:14 #   Failed test 'concurrent OID generation status (got 2 vs expected 0)'
> 00:33:14 #   at t/001_pgbench_with_server.pl line 31.
> 00:33:14 #   Failed test 'concurrent OID generation stdout /(?^:processed: 125/125)/'
> 00:33:14 #   at t/001_pgbench_with_server.pl line 31.
> 00:33:14 #                   'pgbench (16devel (Debian 16~~devel-1.pgdg100+~20230423.1656.g8bbd0cc))
> 00:33:14 # transaction type:
/<<PKGBUILDDIR>>/build/src/bin/pgbench/tmp_check/t_001_pgbench_with_server_main_data/001_pgbench_concurrent_insert
> 00:33:14 # scaling factor: 1
> 00:33:14 # query mode: prepared
> 00:33:14 # number of clients: 5
> 00:33:14 # number of threads: 1
> 00:33:14 # maximum number of tries: 1
> 00:33:14 # number of transactions per client: 25
> 00:33:14 # number of transactions actually processed: 118/125
> 00:33:14 # number of failed transactions: 0 (0.000%)
> 00:33:14 # latency average = 26.470 ms
> 00:33:14 # initial connection time = 66.583 ms
> 00:33:14 # tps = 188.889760 (without initial connection time)
> 00:33:14 # '
> 00:33:14 #     doesn't match '(?^:processed: 125/125)'
> 00:33:14 #   Failed test 'concurrent OID generation stderr /(?^:^$)/'
> 00:33:14 #   at t/001_pgbench_with_server.pl line 31.
> 00:33:14 #                   'pgbench: error: client 2 script 0 aborted in command 0 query 0: ERROR:  could not
extendfile "base/5/3501" with FileFallocate(): Interrupted system call
 
> 00:33:14 # HINT:  Check free disk space.
> 00:33:14 # pgbench: error: Run was aborted; the above results are incomplete.
> 00:33:14 # '
> 00:33:14 #     doesn't match '(?^:^$)'
> 00:33:26 # Looks like you failed 3 tests of 428.
> 00:33:26 t/001_pgbench_with_server.pl ..
> 00:33:26 not ok 1 - concurrent OID generation status (got 2 vs expected 0)
> 
> I don't think the disk is full since it's always hitting that same
> spot, on some of the builds:
> 
> https://pgdgbuild.dus.dg-i.net/job/postgresql-16-binaries-snapshot/833/
> 
> This is overlayfs with tmpfs (upper)/ext4 (lower). Manually running
> that test works though, and the FS seems to support posix_fallocate:
> 
> #include <fcntl.h>
> #include <stdio.h>
> 
> int main ()
> {
>         int f;
>         int err;
> 
>         if (!(f = open("moo", O_CREAT | O_RDWR, 0666)))
>                 perror("open");
> 
>         err = posix_fallocate(f, 0, 10);
>         perror("posix_fallocate");
> 
>         return 0;
> }
> 
> $ ./a.out
> posix_fallocate: Success
> 
> The problem has been there for some weeks - I didn't report it earlier
> as I was on vacation, in the free time trying to bootstrap s390x
> support for apt.pg.o, and there was this other direct IO problem
> making all the builds fail for some time.

I noticed that dsm_impl_posix_resize() does a do while rc==EINTR and
FileFallocate() doesn't. From what the comment says in
dsm_impl_posix_resize() and some cursory googling, posix_fallocate()
doesn't restart automatically on most systems, so a do while() rc==EINTR
is often used. Is there a reason it isn't used in FileFallocate() I
wonder?

- Melanie

Re: could not extend file "base/5/3501" with FileFallocate(): Interrupted system call

From

Andres Freund

Date:

24 April 2023, 22:32:25

Hi,

On 2023-04-24 10:53:35 +0200, Christoph Berg wrote:
> I'm often seeing PG16 builds erroring out in the pgbench tests:

Interesting!

> I don't think the disk is full since it's always hitting that same
> spot, on some of the builds:

Yea, the EINTR pretty clearly indicates that it's not really out-of-space.

> https://pgdgbuild.dus.dg-i.net/job/postgresql-16-binaries-snapshot/833/
> 
> This is overlayfs with tmpfs (upper)/ext4 (lower). Manually running
> that test works though, and the FS seems to support posix_fallocate:

I guess it requires a bunch of memory (?) pressure for this to happen
(triggering blocking during fallocate, opening the window for a signal to
arrive), which likely only happens when running things concurrently.

We obviously can add a retry loop to FileFallocate(), similar to what's
already present e.g. in FileRead(). But I wonder if we shouldn't go a bit
further, and do it for all the fd.c routines where it's remotely plausible
EINTR could be returned? It's a bit silly to add EINTR retries one-by-one to
the functions.

The following are documented to potentially return EINTR, without fd.c having
code to retry:

- FileWriteback() / pg_flush_data()
- FileSync() / pg_fsync()
- FileFallocate()
- FileTruncate()

With the first two there's the added complication that it's not entirely
obvious whether it'd be better to handle this in File* or pg_*. I'd argue the
latter is a bit more sensible?

Greetings,

Andres Freund

Re: could not extend file "base/5/3501" with FileFallocate(): Interrupted system call

From

Andres Freund

Date:

25 April 2023, 00:16:23

Hi,

On 2023-04-24 15:32:25 -0700, Andres Freund wrote:
> On 2023-04-24 10:53:35 +0200, Christoph Berg wrote:
> > I'm often seeing PG16 builds erroring out in the pgbench tests:
> > I don't think the disk is full since it's always hitting that same
> > spot, on some of the builds:
> 
> Yea, the EINTR pretty clearly indicates that it's not really out-of-space.

FWIW, I tried to reproduce this, without success - not too surprising, I
assume it's rather timing dependent.


> We obviously can add a retry loop to FileFallocate(), similar to what's
> already present e.g. in FileRead(). But I wonder if we shouldn't go a bit
> further, and do it for all the fd.c routines where it's remotely plausible
> EINTR could be returned? It's a bit silly to add EINTR retries one-by-one to
> the functions.
> 
> 
> The following are documented to potentially return EINTR, without fd.c having
> code to retry:
> 
> - FileWriteback() / pg_flush_data()
> - FileSync() / pg_fsync()
> - FileFallocate()
> - FileTruncate()
> 
> With the first two there's the added complication that it's not entirely
> obvious whether it'd be better to handle this in File* or pg_*. I'd argue the
> latter is a bit more sensible?

A prototype of that approach is attached. I pushed the retry handling into the
pg_* routines where applicable.  I guess we could add pg_* routines for
FileFallocate(), FilePrewarm() etc as well, but I didn't do that here.

Christoph, could you verify this fixes your issue?

Greetings,

Andres Freund

Attachment

v1-0001-fd.c-Retry-after-EINTR-in-more-places.patch

Re: could not extend file "base/5/3501" with FileFallocate(): Interrupted system call

From

Christoph Berg

Date:

25 April 2023, 18:24:30

Re: Andres Freund
> A prototype of that approach is attached. I pushed the retry handling into the
> pg_* routines where applicable.  I guess we could add pg_* routines for
> FileFallocate(), FilePrewarm() etc as well, but I didn't do that here.
> 
> Christoph, could you verify this fixes your issue?

Everything green with the patch applied. Thanks!

https://pgdgbuild.dus.dg-i.net/job/postgresql-16-binaries-snapshot/839/

Christoph

Re: could not extend file "base/5/3501" with FileFallocate(): Interrupted system call

From

Thomas Munro

Date:

25 April 2023, 23:37:55

On Tue, Apr 25, 2023 at 12:16 PM Andres Freund <andres@anarazel.de> wrote:
> On 2023-04-24 15:32:25 -0700, Andres Freund wrote:
> > We obviously can add a retry loop to FileFallocate(), similar to what's
> > already present e.g. in FileRead(). But I wonder if we shouldn't go a bit
> > further, and do it for all the fd.c routines where it's remotely plausible
> > EINTR could be returned? It's a bit silly to add EINTR retries one-by-one to
> > the functions.
> >
> >
> > The following are documented to potentially return EINTR, without fd.c having
> > code to retry:
> >
> > - FileWriteback() / pg_flush_data()
> > - FileSync() / pg_fsync()
> > - FileFallocate()
> > - FileTruncate()
> >
> > With the first two there's the added complication that it's not entirely
> > obvious whether it'd be better to handle this in File* or pg_*. I'd argue the
> > latter is a bit more sensible?
>
> A prototype of that approach is attached. I pushed the retry handling into the
> pg_* routines where applicable.  I guess we could add pg_* routines for
> FileFallocate(), FilePrewarm() etc as well, but I didn't do that here.

One problem we ran into with the the shm_open() case (which is nearly
identical under the covers, since shm_open() just opens a file in a
tmpfs on Linux) was that a simple retry loop like this could never
terminate if the process was receiving a lot of signals from the
recovery process, which is why we went with the idea of masking
signals instead.  Eventually we should probably grow the file in
smaller chunks with a CFI in between so that we both guarantee that we
make progress (by masking for smaller size increases) and service
interrupts in a timely fashion (by unmasking between loops).  I don't
think that applies here because we're not trying to fallocate
humongous size increases in one go, but I just want to note that we're
making a different choice.  I think this looks reasonable for the use
cases we actually have here.

Re: could not extend file "base/5/3501" with FileFallocate(): Interrupted system call

From

Christoph Berg

Date:

23 May 2023, 14:25:59

Re: Andres Freund
> A prototype of that approach is attached. I pushed the retry handling into the
> pg_* routines where applicable.  I guess we could add pg_* routines for
> FileFallocate(), FilePrewarm() etc as well, but I didn't do that here.
> 
> Christoph, could you verify this fixes your issue?

Hi,

I believe this issue is still open for PG16.

Christoph

Re: could not extend file "base/5/3501" with FileFallocate(): Interrupted system call

From

Michael Paquier

Date:

24 May 2023, 01:42:03

On Tue, May 23, 2023 at 04:25:59PM +0200, Christoph Berg wrote:
> I believe this issue is still open for PG16.

Right.  I've added an item to the list, to not forget.
--
Michael

Attachment

signature.asc

Re: could not extend file "base/5/3501" with FileFallocate(): Interrupted system call

From

Kyotaro Horiguchi

Date:

24 May 2023, 01:56:28

At Wed, 26 Apr 2023 11:37:55 +1200, Thomas Munro <thomas.munro@gmail.com> wrote in 
> On Tue, Apr 25, 2023 at 12:16 PM Andres Freund <andres@anarazel.de> wrote:
> > On 2023-04-24 15:32:25 -0700, Andres Freund wrote:
> > > We obviously can add a retry loop to FileFallocate(), similar to what's
> > > already present e.g. in FileRead(). But I wonder if we shouldn't go a bit
> > > further, and do it for all the fd.c routines where it's remotely plausible
> > > EINTR could be returned? It's a bit silly to add EINTR retries one-by-one to
> > > the functions.
> > >
> > >
> > > The following are documented to potentially return EINTR, without fd.c having
> > > code to retry:
> > >
> > > - FileWriteback() / pg_flush_data()
> > > - FileSync() / pg_fsync()
> > > - FileFallocate()
> > > - FileTruncate()
> > >
> > > With the first two there's the added complication that it's not entirely
> > > obvious whether it'd be better to handle this in File* or pg_*. I'd argue the
> > > latter is a bit more sensible?
> >
> > A prototype of that approach is attached. I pushed the retry handling into the
> > pg_* routines where applicable.  I guess we could add pg_* routines for
> > FileFallocate(), FilePrewarm() etc as well, but I didn't do that here.
> 
> One problem we ran into with the the shm_open() case (which is nearly
> identical under the covers, since shm_open() just opens a file in a
> tmpfs on Linux) was that a simple retry loop like this could never
> terminate if the process was receiving a lot of signals from the
> recovery process, which is why we went with the idea of masking
> signals instead.  Eventually we should probably grow the file in
> smaller chunks with a CFI in between so that we both guarantee that we
> make progress (by masking for smaller size increases) and service
> interrupts in a timely fashion (by unmasking between loops).  I don't
> think that applies here because we're not trying to fallocate
> humongous size increases in one go, but I just want to note that we're
> making a different choice.  I think this looks reasonable for the use
> cases we actually have here.

FWIW I share the same feeling about looping by EINTR without signals
being blocked. If we just retry the same operation without processing
signals after getting EINTR, I think blocking signals is better. We
could block signals more gracefully, but I'm not sure it's worth the
complexity.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: could not extend file "base/5/3501" with FileFallocate(): Interrupted system call

From

Andres Freund

Date:

24 May 2023, 02:28:45

Hi,

On 2023-05-24 10:56:28 +0900, Kyotaro Horiguchi wrote:
> At Wed, 26 Apr 2023 11:37:55 +1200, Thomas Munro <thomas.munro@gmail.com> wrote in 
> > On Tue, Apr 25, 2023 at 12:16 PM Andres Freund <andres@anarazel.de> wrote:
> > > On 2023-04-24 15:32:25 -0700, Andres Freund wrote:
> > > > We obviously can add a retry loop to FileFallocate(), similar to what's
> > > > already present e.g. in FileRead(). But I wonder if we shouldn't go a bit
> > > > further, and do it for all the fd.c routines where it's remotely plausible
> > > > EINTR could be returned? It's a bit silly to add EINTR retries one-by-one to
> > > > the functions.
> > > >
> > > >
> > > > The following are documented to potentially return EINTR, without fd.c having
> > > > code to retry:
> > > >
> > > > - FileWriteback() / pg_flush_data()
> > > > - FileSync() / pg_fsync()
> > > > - FileFallocate()
> > > > - FileTruncate()
> > > >
> > > > With the first two there's the added complication that it's not entirely
> > > > obvious whether it'd be better to handle this in File* or pg_*. I'd argue the
> > > > latter is a bit more sensible?
> > >
> > > A prototype of that approach is attached. I pushed the retry handling into the
> > > pg_* routines where applicable.  I guess we could add pg_* routines for
> > > FileFallocate(), FilePrewarm() etc as well, but I didn't do that here.
> > 
> > One problem we ran into with the the shm_open() case (which is nearly
> > identical under the covers, since shm_open() just opens a file in a
> > tmpfs on Linux) was that a simple retry loop like this could never
> > terminate if the process was receiving a lot of signals from the
> > recovery process, which is why we went with the idea of masking
> > signals instead.  Eventually we should probably grow the file in
> > smaller chunks with a CFI in between so that we both guarantee that we
> > make progress (by masking for smaller size increases) and service
> > interrupts in a timely fashion (by unmasking between loops).  I don't
> > think that applies here because we're not trying to fallocate
> > humongous size increases in one go, but I just want to note that we're
> > making a different choice.  I think this looks reasonable for the use
> > cases we actually have here.
> 
> FWIW I share the same feeling about looping by EINTR without signals
> being blocked. If we just retry the same operation without processing
> signals after getting EINTR, I think blocking signals is better. We
> could block signals more gracefully, but I'm not sure it's worth the
> complexity.

I seriously doubt it's a good path to go down in this case. As Thomas
mentioned, this case isn't really comparable to the shm_open() one, due to the
bounded vs unbounded amount of memory we're dealing with.

What would be the benefit?

Greetings,

Andres Freund

Re: could not extend file "base/5/3501" with FileFallocate(): Interrupted system call

From

Kyotaro Horiguchi

Date:

24 May 2023, 04:13:51

At Tue, 23 May 2023 19:28:45 -0700, Andres Freund <andres@anarazel.de> wrote in 
> Hi,
> 
> On 2023-05-24 10:56:28 +0900, Kyotaro Horiguchi wrote:
> > At Wed, 26 Apr 2023 11:37:55 +1200, Thomas Munro <thomas.munro@gmail.com> wrote in 
> > > On Tue, Apr 25, 2023 at 12:16 PM Andres Freund <andres@anarazel.de> wrote:
> > > > On 2023-04-24 15:32:25 -0700, Andres Freund wrote:
> > > > > We obviously can add a retry loop to FileFallocate(), similar to what's
> > > > > already present e.g. in FileRead(). But I wonder if we shouldn't go a bit
> > > > > further, and do it for all the fd.c routines where it's remotely plausible
> > > > > EINTR could be returned? It's a bit silly to add EINTR retries one-by-one to
> > > > > the functions.
> > > > >
> > > > >
> > > > > The following are documented to potentially return EINTR, without fd.c having
> > > > > code to retry:
> > > > >
> > > > > - FileWriteback() / pg_flush_data()
> > > > > - FileSync() / pg_fsync()
> > > > > - FileFallocate()
> > > > > - FileTruncate()
> > > > >
> > > > > With the first two there's the added complication that it's not entirely
> > > > > obvious whether it'd be better to handle this in File* or pg_*. I'd argue the
> > > > > latter is a bit more sensible?
> > > >
> > > > A prototype of that approach is attached. I pushed the retry handling into the
> > > > pg_* routines where applicable.  I guess we could add pg_* routines for
> > > > FileFallocate(), FilePrewarm() etc as well, but I didn't do that here.
> > > 
> > > One problem we ran into with the the shm_open() case (which is nearly
> > > identical under the covers, since shm_open() just opens a file in a
> > > tmpfs on Linux) was that a simple retry loop like this could never
> > > terminate if the process was receiving a lot of signals from the
> > > recovery process, which is why we went with the idea of masking
> > > signals instead.  Eventually we should probably grow the file in
> > > smaller chunks with a CFI in between so that we both guarantee that we
> > > make progress (by masking for smaller size increases) and service
> > > interrupts in a timely fashion (by unmasking between loops).  I don't
> > > think that applies here because we're not trying to fallocate
> > > humongous size increases in one go, but I just want to note that we're
> > > making a different choice.  I think this looks reasonable for the use
> > > cases we actually have here.
> > 
> > FWIW I share the same feeling about looping by EINTR without signals
> > being blocked. If we just retry the same operation without processing
> > signals after getting EINTR, I think blocking signals is better. We
> > could block signals more gracefully, but I'm not sure it's worth the
> > complexity.
> 
> I seriously doubt it's a good path to go down in this case. As Thomas
> mentioned, this case isn't really comparable to the shm_open() one, due to the
> bounded vs unbounded amount of memory we're dealing with.
> 
> What would be the benefit?

I'm not certain what you mean by "it" here. Regarding signal blocking,
the benefit would be a lower chance of getting constantly interrupted
by a string of frequent interrupts, which can't be prevented just by
looping over. From what I gathered, Thomas meant that we don't need to
use chunking to prevent long periods of ignoring interrupts because
we're extending a file by a few blocks. However, I might have
misunderstood.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: could not extend file "base/5/3501" with FileFallocate(): Interrupted system call

From

Alvaro Herrera

Date:

06 June 2023, 19:53:00

On 2023-Apr-24, Andres Freund wrote:

> A prototype of that approach is attached. I pushed the retry handling into the
> pg_* routines where applicable.  I guess we could add pg_* routines for
> FileFallocate(), FilePrewarm() etc as well, but I didn't do that here.
> 
> Christoph, could you verify this fixes your issue?

So, is anyone making progress on this?  I don't see anything in the
thread.

On adding the missing pg_* wrappers: I think if we don't (and we leave
the retry loops at the File* layer), then the risk is that some external
code would add calls to the underlying File* routines trusting them to
do the retrying, which would then become broken when we move the retry
loops to the pg_* wrappers when we add them.  That doesn't seem terribly
serious to me.

-- 
Álvaro Herrera               48°01'N 7°57'E  —  https://www.EnterpriseDB.com/
"El Maquinismo fue proscrito so pena de cosquilleo hasta la muerte"
(Ijon Tichy en Viajes, Stanislaw Lem)

Re: could not extend file "base/5/3501" with FileFallocate(): Interrupted system call

From

Christoph Berg

Date:

07 June 2023, 13:18:42

Re: Alvaro Herrera
> > Christoph, could you verify this fixes your issue?
> 
> So, is anyone making progress on this?  I don't see anything in the
> thread.

Well, I had reported that I haven't been seeing any problems since I
applied the patch to the postgresql-16.deb package. So for me, the
patch looks like it solves the problem.

Christoph

Re: could not extend file "base/5/3501" with FileFallocate(): Interrupted system call

From

Andres Freund

Date:

10 June 2023, 04:35:38

Hi,

On 2023-06-06 21:53:00 +0200, Alvaro Herrera wrote:
> On 2023-Apr-24, Andres Freund wrote:
> 
> > A prototype of that approach is attached. I pushed the retry handling into the
> > pg_* routines where applicable.  I guess we could add pg_* routines for
> > FileFallocate(), FilePrewarm() etc as well, but I didn't do that here.
> > 
> > Christoph, could you verify this fixes your issue?
> 
> So, is anyone making progress on this?  I don't see anything in the
> thread.

Thanks for bringing it up again, I had lost track. I now added an open items
entry.

My gut feeling is that we should go with something quite minimal at this
stage.

> On adding the missing pg_* wrappers: I think if we don't (and we leave
> the retry loops at the File* layer), then the risk is that some external
> code would add calls to the underlying File* routines trusting them to
> do the retrying, which would then become broken when we move the retry
> loops to the pg_* wrappers when we add them.  That doesn't seem terribly
> serious to me.

I'm not too worried about that either.

Unless somebody strongly advocates a different path, I plan to push something
along the lines of the prototype I had posted. After reading over it a bunch
more times, some of the code is a bit finnicky.

I wish we had some hack that made syscalls EINTR at a random intervals, just
to make it realistic to test these paths...

Greetings,

Andres Freund

Re: could not extend file "base/5/3501" with FileFallocate(): Interrupted system call

From

Andres Freund

Date:

19 June 2023, 17:27:06

Hi Tom,

Unfortunately, due to some personal life business, it took until for me to
feel comfortable pushing the fix for
https://www.postgresql.org/message-id/ZEZDj1H61ryrmY9o@msg.df7cb.de
(FileFallocate() erroring out with EINTR due to running on tmpfs).

Do you want me to hold off before beta2 is wrapped? I did a bunch of CI runs
with the patch and patch + test infra, and they all passed.

Greetings,

Andres Freund

Attachment

v2-0001-fd.c-Retry-after-EINTR-in-more-places.patch

Re: could not extend file "base/5/3501" with FileFallocate(): Interrupted system call

From

Tom Lane

Date:

19 June 2023, 17:37:45

Andres Freund <andres@anarazel.de> writes:
> Unfortunately, due to some personal life business, it took until for me to
> feel comfortable pushing the fix for
> https://www.postgresql.org/message-id/ZEZDj1H61ryrmY9o@msg.df7cb.de
> (FileFallocate() erroring out with EINTR due to running on tmpfs).
> Do you want me to hold off before beta2 is wrapped? I did a bunch of CI runs
> with the patch and patch + test infra, and they all passed.

We still have a week till beta2 wrap, so I'd say push.  If the buildfarm
gets unhappy you can revert.

            regards, tom lane

Re: could not extend file "base/5/3501" with FileFallocate(): Interrupted system call

From

Andres Freund

Date:

19 June 2023, 18:16:56

Hi,

On June 19, 2023 10:37:45 AM PDT, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>Andres Freund <andres@anarazel.de> writes:
>> Unfortunately, due to some personal life business, it took until for me to
>> feel comfortable pushing the fix for
>> https://www.postgresql.org/message-id/ZEZDj1H61ryrmY9o@msg.df7cb.de
>> (FileFallocate() erroring out with EINTR due to running on tmpfs).
>> Do you want me to hold off before beta2 is wrapped? I did a bunch of CI runs
>> with the patch and patch + test infra, and they all passed.
>
>We still have a week till beta2 wrap, so I'd say push.  If the buildfarm
>gets unhappy you can revert.

Hah. Somehow I confused myself into thinking you're wrapping later today. Calendar math vs Andres: 6753:3

Andres

--
Sent from my Android device with K-9 Mail. Please excuse my brevity.

Re: could not extend file "base/5/3501" with FileFallocate(): Interrupted system call

From

Amit Kapila

Date:

21 June 2023, 03:24:48

On Mon, Jun 19, 2023 at 11:47 PM Andres Freund <andres@anarazel.de> wrote:
>
> On June 19, 2023 10:37:45 AM PDT, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> >Andres Freund <andres@anarazel.de> writes:
> >> Unfortunately, due to some personal life business, it took until for me to
> >> feel comfortable pushing the fix for
> >> https://www.postgresql.org/message-id/ZEZDj1H61ryrmY9o@msg.df7cb.de
> >> (FileFallocate() erroring out with EINTR due to running on tmpfs).
> >> Do you want me to hold off before beta2 is wrapped? I did a bunch of CI runs
> >> with the patch and patch + test infra, and they all passed.
> >
> >We still have a week till beta2 wrap, so I'd say push.  If the buildfarm
> >gets unhappy you can revert.
>
> Hah. Somehow I confused myself into thinking you're wrapping later today. Calendar math vs Andres: 6753:3
>

Can we close the open item corresponding to this after your commit 0d369ac650?

--
With Regards,
Amit Kapila.

Re: could not extend file "base/5/3501" with FileFallocate(): Interrupted system call

From

Andres Freund

Date:

21 June 2023, 03:49:20

Hi,

On 2023-06-21 08:54:48 +0530, Amit Kapila wrote:
> On Mon, Jun 19, 2023 at 11:47 PM Andres Freund <andres@anarazel.de> wrote:
> >
> > On June 19, 2023 10:37:45 AM PDT, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > >Andres Freund <andres@anarazel.de> writes:
> > >> Unfortunately, due to some personal life business, it took until for me to
> > >> feel comfortable pushing the fix for
> > >> https://www.postgresql.org/message-id/ZEZDj1H61ryrmY9o@msg.df7cb.de
> > >> (FileFallocate() erroring out with EINTR due to running on tmpfs).
> > >> Do you want me to hold off before beta2 is wrapped? I did a bunch of CI runs
> > >> with the patch and patch + test infra, and they all passed.
> > >
> > >We still have a week till beta2 wrap, so I'd say push.  If the buildfarm
> > >gets unhappy you can revert.
> >
> > Hah. Somehow I confused myself into thinking you're wrapping later today. Calendar math vs Andres: 6753:3
> >
> 
> Can we close the open item corresponding to this after your commit 0d369ac650?

Yes, sorry for forgetting that. Done now.

Greetings,

Andres Freund