Thread: could not extend file "base/5/3501" with FileFallocate(): Interrupted system call
could not extend file "base/5/3501" with FileFallocate(): Interrupted system call
From
Christoph Berg
Date:
Re: Andres Freund > Add smgrzeroextend(), FileZero(), FileFallocate() Hi, I'm often seeing PG16 builds erroring out in the pgbench tests: 00:33:12 make[2]: Entering directory '/<<PKGBUILDDIR>>/build/src/bin/pgbench' 00:33:12 echo "# +++ tap check in src/bin/pgbench +++" && rm -rf '/<<PKGBUILDDIR>>/build/src/bin/pgbench'/tmp_check && /bin/mkdir-p '/<<PKGBUILDDIR>>/build/src/bin/pgbench'/tmp_check && cd /<<PKGBUILDDIR>>/build/../src/bin/pgbench && TESTLOGDIR='/<<PKGBUILDDIR>>/build/src/bin/pgbench/tmp_check/log' TESTDATADIR='/<<PKGBUILDDIR>>/build/src/bin/pgbench/tmp_check' PATH="/<<PKGBUILDDIR>>/build/tmp_install/usr/lib/postgresql/16/bin:/<<PKGBUILDDIR>>/build/src/bin/pgbench:$PATH" LD_LIBRARY_PATH="/<<PKGBUILDDIR>>/build/tmp_install/usr/lib/aarch64-linux-gnu" PGPORT='65432' top_builddir='/<<PKGBUILDDIR>>/build/src/bin/pgbench/../../..' PG_REGRESS='/<<PKGBUILDDIR>>/build/src/bin/pgbench/../../../src/test/regress/pg_regress'/usr/bin/prove -I /<<PKGBUILDDIR>>/build/../src/test/perl/-I /<<PKGBUILDDIR>>/build/../src/bin/pgbench --verbose t/*.pl 00:33:12 # +++ tap check in src/bin/pgbench +++ 00:33:14 # Failed test 'concurrent OID generation status (got 2 vs expected 0)' 00:33:14 # at t/001_pgbench_with_server.pl line 31. 00:33:14 # Failed test 'concurrent OID generation stdout /(?^:processed: 125/125)/' 00:33:14 # at t/001_pgbench_with_server.pl line 31. 00:33:14 # 'pgbench (16devel (Debian 16~~devel-1.pgdg100+~20230423.1656.g8bbd0cc)) 00:33:14 # transaction type: /<<PKGBUILDDIR>>/build/src/bin/pgbench/tmp_check/t_001_pgbench_with_server_main_data/001_pgbench_concurrent_insert 00:33:14 # scaling factor: 1 00:33:14 # query mode: prepared 00:33:14 # number of clients: 5 00:33:14 # number of threads: 1 00:33:14 # maximum number of tries: 1 00:33:14 # number of transactions per client: 25 00:33:14 # number of transactions actually processed: 118/125 00:33:14 # number of failed transactions: 0 (0.000%) 00:33:14 # latency average = 26.470 ms 00:33:14 # initial connection time = 66.583 ms 00:33:14 # tps = 188.889760 (without initial connection time) 00:33:14 # ' 00:33:14 # doesn't match '(?^:processed: 125/125)' 00:33:14 # Failed test 'concurrent OID generation stderr /(?^:^$)/' 00:33:14 # at t/001_pgbench_with_server.pl line 31. 00:33:14 # 'pgbench: error: client 2 script 0 aborted in command 0 query 0: ERROR: could not extend file"base/5/3501" with FileFallocate(): Interrupted system call 00:33:14 # HINT: Check free disk space. 00:33:14 # pgbench: error: Run was aborted; the above results are incomplete. 00:33:14 # ' 00:33:14 # doesn't match '(?^:^$)' 00:33:26 # Looks like you failed 3 tests of 428. 00:33:26 t/001_pgbench_with_server.pl .. 00:33:26 not ok 1 - concurrent OID generation status (got 2 vs expected 0) I don't think the disk is full since it's always hitting that same spot, on some of the builds: https://pgdgbuild.dus.dg-i.net/job/postgresql-16-binaries-snapshot/833/ This is overlayfs with tmpfs (upper)/ext4 (lower). Manually running that test works though, and the FS seems to support posix_fallocate: #include <fcntl.h> #include <stdio.h> int main () { int f; int err; if (!(f = open("moo", O_CREAT | O_RDWR, 0666))) perror("open"); err = posix_fallocate(f, 0, 10); perror("posix_fallocate"); return 0; } $ ./a.out posix_fallocate: Success The problem has been there for some weeks - I didn't report it earlier as I was on vacation, in the free time trying to bootstrap s390x support for apt.pg.o, and there was this other direct IO problem making all the builds fail for some time. Christoph
Re: could not extend file "base/5/3501" with FileFallocate(): Interrupted system call
From
Melanie Plageman
Date:
On Mon, Apr 24, 2023 at 10:53:35AM +0200, Christoph Berg wrote: > Re: Andres Freund > > Add smgrzeroextend(), FileZero(), FileFallocate() > > Hi, > > I'm often seeing PG16 builds erroring out in the pgbench tests: > > 00:33:12 make[2]: Entering directory '/<<PKGBUILDDIR>>/build/src/bin/pgbench' > 00:33:12 echo "# +++ tap check in src/bin/pgbench +++" && rm -rf '/<<PKGBUILDDIR>>/build/src/bin/pgbench'/tmp_check &&/bin/mkdir -p '/<<PKGBUILDDIR>>/build/src/bin/pgbench'/tmp_check && cd /<<PKGBUILDDIR>>/build/../src/bin/pgbench && TESTLOGDIR='/<<PKGBUILDDIR>>/build/src/bin/pgbench/tmp_check/log' TESTDATADIR='/<<PKGBUILDDIR>>/build/src/bin/pgbench/tmp_check' PATH="/<<PKGBUILDDIR>>/build/tmp_install/usr/lib/postgresql/16/bin:/<<PKGBUILDDIR>>/build/src/bin/pgbench:$PATH" LD_LIBRARY_PATH="/<<PKGBUILDDIR>>/build/tmp_install/usr/lib/aarch64-linux-gnu" PGPORT='65432' top_builddir='/<<PKGBUILDDIR>>/build/src/bin/pgbench/../../..' PG_REGRESS='/<<PKGBUILDDIR>>/build/src/bin/pgbench/../../../src/test/regress/pg_regress'/usr/bin/prove -I /<<PKGBUILDDIR>>/build/../src/test/perl/-I /<<PKGBUILDDIR>>/build/../src/bin/pgbench --verbose t/*.pl > 00:33:12 # +++ tap check in src/bin/pgbench +++ > 00:33:14 # Failed test 'concurrent OID generation status (got 2 vs expected 0)' > 00:33:14 # at t/001_pgbench_with_server.pl line 31. > 00:33:14 # Failed test 'concurrent OID generation stdout /(?^:processed: 125/125)/' > 00:33:14 # at t/001_pgbench_with_server.pl line 31. > 00:33:14 # 'pgbench (16devel (Debian 16~~devel-1.pgdg100+~20230423.1656.g8bbd0cc)) > 00:33:14 # transaction type: /<<PKGBUILDDIR>>/build/src/bin/pgbench/tmp_check/t_001_pgbench_with_server_main_data/001_pgbench_concurrent_insert > 00:33:14 # scaling factor: 1 > 00:33:14 # query mode: prepared > 00:33:14 # number of clients: 5 > 00:33:14 # number of threads: 1 > 00:33:14 # maximum number of tries: 1 > 00:33:14 # number of transactions per client: 25 > 00:33:14 # number of transactions actually processed: 118/125 > 00:33:14 # number of failed transactions: 0 (0.000%) > 00:33:14 # latency average = 26.470 ms > 00:33:14 # initial connection time = 66.583 ms > 00:33:14 # tps = 188.889760 (without initial connection time) > 00:33:14 # ' > 00:33:14 # doesn't match '(?^:processed: 125/125)' > 00:33:14 # Failed test 'concurrent OID generation stderr /(?^:^$)/' > 00:33:14 # at t/001_pgbench_with_server.pl line 31. > 00:33:14 # 'pgbench: error: client 2 script 0 aborted in command 0 query 0: ERROR: could not extendfile "base/5/3501" with FileFallocate(): Interrupted system call > 00:33:14 # HINT: Check free disk space. > 00:33:14 # pgbench: error: Run was aborted; the above results are incomplete. > 00:33:14 # ' > 00:33:14 # doesn't match '(?^:^$)' > 00:33:26 # Looks like you failed 3 tests of 428. > 00:33:26 t/001_pgbench_with_server.pl .. > 00:33:26 not ok 1 - concurrent OID generation status (got 2 vs expected 0) > > I don't think the disk is full since it's always hitting that same > spot, on some of the builds: > > https://pgdgbuild.dus.dg-i.net/job/postgresql-16-binaries-snapshot/833/ > > This is overlayfs with tmpfs (upper)/ext4 (lower). Manually running > that test works though, and the FS seems to support posix_fallocate: > > #include <fcntl.h> > #include <stdio.h> > > int main () > { > int f; > int err; > > if (!(f = open("moo", O_CREAT | O_RDWR, 0666))) > perror("open"); > > err = posix_fallocate(f, 0, 10); > perror("posix_fallocate"); > > return 0; > } > > $ ./a.out > posix_fallocate: Success > > The problem has been there for some weeks - I didn't report it earlier > as I was on vacation, in the free time trying to bootstrap s390x > support for apt.pg.o, and there was this other direct IO problem > making all the builds fail for some time. I noticed that dsm_impl_posix_resize() does a do while rc==EINTR and FileFallocate() doesn't. From what the comment says in dsm_impl_posix_resize() and some cursory googling, posix_fallocate() doesn't restart automatically on most systems, so a do while() rc==EINTR is often used. Is there a reason it isn't used in FileFallocate() I wonder? - Melanie
Re: could not extend file "base/5/3501" with FileFallocate(): Interrupted system call
From
Andres Freund
Date:
Hi, On 2023-04-24 10:53:35 +0200, Christoph Berg wrote: > I'm often seeing PG16 builds erroring out in the pgbench tests: Interesting! > I don't think the disk is full since it's always hitting that same > spot, on some of the builds: Yea, the EINTR pretty clearly indicates that it's not really out-of-space. > https://pgdgbuild.dus.dg-i.net/job/postgresql-16-binaries-snapshot/833/ > > This is overlayfs with tmpfs (upper)/ext4 (lower). Manually running > that test works though, and the FS seems to support posix_fallocate: I guess it requires a bunch of memory (?) pressure for this to happen (triggering blocking during fallocate, opening the window for a signal to arrive), which likely only happens when running things concurrently. We obviously can add a retry loop to FileFallocate(), similar to what's already present e.g. in FileRead(). But I wonder if we shouldn't go a bit further, and do it for all the fd.c routines where it's remotely plausible EINTR could be returned? It's a bit silly to add EINTR retries one-by-one to the functions. The following are documented to potentially return EINTR, without fd.c having code to retry: - FileWriteback() / pg_flush_data() - FileSync() / pg_fsync() - FileFallocate() - FileTruncate() With the first two there's the added complication that it's not entirely obvious whether it'd be better to handle this in File* or pg_*. I'd argue the latter is a bit more sensible? Greetings, Andres Freund
Re: could not extend file "base/5/3501" with FileFallocate(): Interrupted system call
From
Andres Freund
Date:
Hi, On 2023-04-24 15:32:25 -0700, Andres Freund wrote: > On 2023-04-24 10:53:35 +0200, Christoph Berg wrote: > > I'm often seeing PG16 builds erroring out in the pgbench tests: > > I don't think the disk is full since it's always hitting that same > > spot, on some of the builds: > > Yea, the EINTR pretty clearly indicates that it's not really out-of-space. FWIW, I tried to reproduce this, without success - not too surprising, I assume it's rather timing dependent. > We obviously can add a retry loop to FileFallocate(), similar to what's > already present e.g. in FileRead(). But I wonder if we shouldn't go a bit > further, and do it for all the fd.c routines where it's remotely plausible > EINTR could be returned? It's a bit silly to add EINTR retries one-by-one to > the functions. > > > The following are documented to potentially return EINTR, without fd.c having > code to retry: > > - FileWriteback() / pg_flush_data() > - FileSync() / pg_fsync() > - FileFallocate() > - FileTruncate() > > With the first two there's the added complication that it's not entirely > obvious whether it'd be better to handle this in File* or pg_*. I'd argue the > latter is a bit more sensible? A prototype of that approach is attached. I pushed the retry handling into the pg_* routines where applicable. I guess we could add pg_* routines for FileFallocate(), FilePrewarm() etc as well, but I didn't do that here. Christoph, could you verify this fixes your issue? Greetings, Andres Freund
Attachment
Re: could not extend file "base/5/3501" with FileFallocate(): Interrupted system call
From
Christoph Berg
Date:
Re: Andres Freund > A prototype of that approach is attached. I pushed the retry handling into the > pg_* routines where applicable. I guess we could add pg_* routines for > FileFallocate(), FilePrewarm() etc as well, but I didn't do that here. > > Christoph, could you verify this fixes your issue? Everything green with the patch applied. Thanks! https://pgdgbuild.dus.dg-i.net/job/postgresql-16-binaries-snapshot/839/ Christoph
Re: could not extend file "base/5/3501" with FileFallocate(): Interrupted system call
From
Thomas Munro
Date:
On Tue, Apr 25, 2023 at 12:16 PM Andres Freund <andres@anarazel.de> wrote: > On 2023-04-24 15:32:25 -0700, Andres Freund wrote: > > We obviously can add a retry loop to FileFallocate(), similar to what's > > already present e.g. in FileRead(). But I wonder if we shouldn't go a bit > > further, and do it for all the fd.c routines where it's remotely plausible > > EINTR could be returned? It's a bit silly to add EINTR retries one-by-one to > > the functions. > > > > > > The following are documented to potentially return EINTR, without fd.c having > > code to retry: > > > > - FileWriteback() / pg_flush_data() > > - FileSync() / pg_fsync() > > - FileFallocate() > > - FileTruncate() > > > > With the first two there's the added complication that it's not entirely > > obvious whether it'd be better to handle this in File* or pg_*. I'd argue the > > latter is a bit more sensible? > > A prototype of that approach is attached. I pushed the retry handling into the > pg_* routines where applicable. I guess we could add pg_* routines for > FileFallocate(), FilePrewarm() etc as well, but I didn't do that here. One problem we ran into with the the shm_open() case (which is nearly identical under the covers, since shm_open() just opens a file in a tmpfs on Linux) was that a simple retry loop like this could never terminate if the process was receiving a lot of signals from the recovery process, which is why we went with the idea of masking signals instead. Eventually we should probably grow the file in smaller chunks with a CFI in between so that we both guarantee that we make progress (by masking for smaller size increases) and service interrupts in a timely fashion (by unmasking between loops). I don't think that applies here because we're not trying to fallocate humongous size increases in one go, but I just want to note that we're making a different choice. I think this looks reasonable for the use cases we actually have here.
Re: could not extend file "base/5/3501" with FileFallocate(): Interrupted system call
From
Christoph Berg
Date:
Re: Andres Freund > A prototype of that approach is attached. I pushed the retry handling into the > pg_* routines where applicable. I guess we could add pg_* routines for > FileFallocate(), FilePrewarm() etc as well, but I didn't do that here. > > Christoph, could you verify this fixes your issue? Hi, I believe this issue is still open for PG16. Christoph
Re: could not extend file "base/5/3501" with FileFallocate(): Interrupted system call
From
Michael Paquier
Date:
On Tue, May 23, 2023 at 04:25:59PM +0200, Christoph Berg wrote: > I believe this issue is still open for PG16. Right. I've added an item to the list, to not forget. -- Michael
Attachment
Re: could not extend file "base/5/3501" with FileFallocate(): Interrupted system call
From
Kyotaro Horiguchi
Date:
At Wed, 26 Apr 2023 11:37:55 +1200, Thomas Munro <thomas.munro@gmail.com> wrote in > On Tue, Apr 25, 2023 at 12:16 PM Andres Freund <andres@anarazel.de> wrote: > > On 2023-04-24 15:32:25 -0700, Andres Freund wrote: > > > We obviously can add a retry loop to FileFallocate(), similar to what's > > > already present e.g. in FileRead(). But I wonder if we shouldn't go a bit > > > further, and do it for all the fd.c routines where it's remotely plausible > > > EINTR could be returned? It's a bit silly to add EINTR retries one-by-one to > > > the functions. > > > > > > > > > The following are documented to potentially return EINTR, without fd.c having > > > code to retry: > > > > > > - FileWriteback() / pg_flush_data() > > > - FileSync() / pg_fsync() > > > - FileFallocate() > > > - FileTruncate() > > > > > > With the first two there's the added complication that it's not entirely > > > obvious whether it'd be better to handle this in File* or pg_*. I'd argue the > > > latter is a bit more sensible? > > > > A prototype of that approach is attached. I pushed the retry handling into the > > pg_* routines where applicable. I guess we could add pg_* routines for > > FileFallocate(), FilePrewarm() etc as well, but I didn't do that here. > > One problem we ran into with the the shm_open() case (which is nearly > identical under the covers, since shm_open() just opens a file in a > tmpfs on Linux) was that a simple retry loop like this could never > terminate if the process was receiving a lot of signals from the > recovery process, which is why we went with the idea of masking > signals instead. Eventually we should probably grow the file in > smaller chunks with a CFI in between so that we both guarantee that we > make progress (by masking for smaller size increases) and service > interrupts in a timely fashion (by unmasking between loops). I don't > think that applies here because we're not trying to fallocate > humongous size increases in one go, but I just want to note that we're > making a different choice. I think this looks reasonable for the use > cases we actually have here. FWIW I share the same feeling about looping by EINTR without signals being blocked. If we just retry the same operation without processing signals after getting EINTR, I think blocking signals is better. We could block signals more gracefully, but I'm not sure it's worth the complexity. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: could not extend file "base/5/3501" with FileFallocate(): Interrupted system call
From
Andres Freund
Date:
Hi, On 2023-05-24 10:56:28 +0900, Kyotaro Horiguchi wrote: > At Wed, 26 Apr 2023 11:37:55 +1200, Thomas Munro <thomas.munro@gmail.com> wrote in > > On Tue, Apr 25, 2023 at 12:16 PM Andres Freund <andres@anarazel.de> wrote: > > > On 2023-04-24 15:32:25 -0700, Andres Freund wrote: > > > > We obviously can add a retry loop to FileFallocate(), similar to what's > > > > already present e.g. in FileRead(). But I wonder if we shouldn't go a bit > > > > further, and do it for all the fd.c routines where it's remotely plausible > > > > EINTR could be returned? It's a bit silly to add EINTR retries one-by-one to > > > > the functions. > > > > > > > > > > > > The following are documented to potentially return EINTR, without fd.c having > > > > code to retry: > > > > > > > > - FileWriteback() / pg_flush_data() > > > > - FileSync() / pg_fsync() > > > > - FileFallocate() > > > > - FileTruncate() > > > > > > > > With the first two there's the added complication that it's not entirely > > > > obvious whether it'd be better to handle this in File* or pg_*. I'd argue the > > > > latter is a bit more sensible? > > > > > > A prototype of that approach is attached. I pushed the retry handling into the > > > pg_* routines where applicable. I guess we could add pg_* routines for > > > FileFallocate(), FilePrewarm() etc as well, but I didn't do that here. > > > > One problem we ran into with the the shm_open() case (which is nearly > > identical under the covers, since shm_open() just opens a file in a > > tmpfs on Linux) was that a simple retry loop like this could never > > terminate if the process was receiving a lot of signals from the > > recovery process, which is why we went with the idea of masking > > signals instead. Eventually we should probably grow the file in > > smaller chunks with a CFI in between so that we both guarantee that we > > make progress (by masking for smaller size increases) and service > > interrupts in a timely fashion (by unmasking between loops). I don't > > think that applies here because we're not trying to fallocate > > humongous size increases in one go, but I just want to note that we're > > making a different choice. I think this looks reasonable for the use > > cases we actually have here. > > FWIW I share the same feeling about looping by EINTR without signals > being blocked. If we just retry the same operation without processing > signals after getting EINTR, I think blocking signals is better. We > could block signals more gracefully, but I'm not sure it's worth the > complexity. I seriously doubt it's a good path to go down in this case. As Thomas mentioned, this case isn't really comparable to the shm_open() one, due to the bounded vs unbounded amount of memory we're dealing with. What would be the benefit? Greetings, Andres Freund
Re: could not extend file "base/5/3501" with FileFallocate(): Interrupted system call
From
Kyotaro Horiguchi
Date:
At Tue, 23 May 2023 19:28:45 -0700, Andres Freund <andres@anarazel.de> wrote in > Hi, > > On 2023-05-24 10:56:28 +0900, Kyotaro Horiguchi wrote: > > At Wed, 26 Apr 2023 11:37:55 +1200, Thomas Munro <thomas.munro@gmail.com> wrote in > > > On Tue, Apr 25, 2023 at 12:16 PM Andres Freund <andres@anarazel.de> wrote: > > > > On 2023-04-24 15:32:25 -0700, Andres Freund wrote: > > > > > We obviously can add a retry loop to FileFallocate(), similar to what's > > > > > already present e.g. in FileRead(). But I wonder if we shouldn't go a bit > > > > > further, and do it for all the fd.c routines where it's remotely plausible > > > > > EINTR could be returned? It's a bit silly to add EINTR retries one-by-one to > > > > > the functions. > > > > > > > > > > > > > > > The following are documented to potentially return EINTR, without fd.c having > > > > > code to retry: > > > > > > > > > > - FileWriteback() / pg_flush_data() > > > > > - FileSync() / pg_fsync() > > > > > - FileFallocate() > > > > > - FileTruncate() > > > > > > > > > > With the first two there's the added complication that it's not entirely > > > > > obvious whether it'd be better to handle this in File* or pg_*. I'd argue the > > > > > latter is a bit more sensible? > > > > > > > > A prototype of that approach is attached. I pushed the retry handling into the > > > > pg_* routines where applicable. I guess we could add pg_* routines for > > > > FileFallocate(), FilePrewarm() etc as well, but I didn't do that here. > > > > > > One problem we ran into with the the shm_open() case (which is nearly > > > identical under the covers, since shm_open() just opens a file in a > > > tmpfs on Linux) was that a simple retry loop like this could never > > > terminate if the process was receiving a lot of signals from the > > > recovery process, which is why we went with the idea of masking > > > signals instead. Eventually we should probably grow the file in > > > smaller chunks with a CFI in between so that we both guarantee that we > > > make progress (by masking for smaller size increases) and service > > > interrupts in a timely fashion (by unmasking between loops). I don't > > > think that applies here because we're not trying to fallocate > > > humongous size increases in one go, but I just want to note that we're > > > making a different choice. I think this looks reasonable for the use > > > cases we actually have here. > > > > FWIW I share the same feeling about looping by EINTR without signals > > being blocked. If we just retry the same operation without processing > > signals after getting EINTR, I think blocking signals is better. We > > could block signals more gracefully, but I'm not sure it's worth the > > complexity. > > I seriously doubt it's a good path to go down in this case. As Thomas > mentioned, this case isn't really comparable to the shm_open() one, due to the > bounded vs unbounded amount of memory we're dealing with. > > What would be the benefit? I'm not certain what you mean by "it" here. Regarding signal blocking, the benefit would be a lower chance of getting constantly interrupted by a string of frequent interrupts, which can't be prevented just by looping over. From what I gathered, Thomas meant that we don't need to use chunking to prevent long periods of ignoring interrupts because we're extending a file by a few blocks. However, I might have misunderstood. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: could not extend file "base/5/3501" with FileFallocate(): Interrupted system call
From
Alvaro Herrera
Date:
On 2023-Apr-24, Andres Freund wrote: > A prototype of that approach is attached. I pushed the retry handling into the > pg_* routines where applicable. I guess we could add pg_* routines for > FileFallocate(), FilePrewarm() etc as well, but I didn't do that here. > > Christoph, could you verify this fixes your issue? So, is anyone making progress on this? I don't see anything in the thread. On adding the missing pg_* wrappers: I think if we don't (and we leave the retry loops at the File* layer), then the risk is that some external code would add calls to the underlying File* routines trusting them to do the retrying, which would then become broken when we move the retry loops to the pg_* wrappers when we add them. That doesn't seem terribly serious to me. -- Álvaro Herrera 48°01'N 7°57'E — https://www.EnterpriseDB.com/ "El Maquinismo fue proscrito so pena de cosquilleo hasta la muerte" (Ijon Tichy en Viajes, Stanislaw Lem)
Re: could not extend file "base/5/3501" with FileFallocate(): Interrupted system call
From
Christoph Berg
Date:
Re: Alvaro Herrera > > Christoph, could you verify this fixes your issue? > > So, is anyone making progress on this? I don't see anything in the > thread. Well, I had reported that I haven't been seeing any problems since I applied the patch to the postgresql-16.deb package. So for me, the patch looks like it solves the problem. Christoph
Re: could not extend file "base/5/3501" with FileFallocate(): Interrupted system call
From
Andres Freund
Date:
Hi, On 2023-06-06 21:53:00 +0200, Alvaro Herrera wrote: > On 2023-Apr-24, Andres Freund wrote: > > > A prototype of that approach is attached. I pushed the retry handling into the > > pg_* routines where applicable. I guess we could add pg_* routines for > > FileFallocate(), FilePrewarm() etc as well, but I didn't do that here. > > > > Christoph, could you verify this fixes your issue? > > So, is anyone making progress on this? I don't see anything in the > thread. Thanks for bringing it up again, I had lost track. I now added an open items entry. My gut feeling is that we should go with something quite minimal at this stage. > On adding the missing pg_* wrappers: I think if we don't (and we leave > the retry loops at the File* layer), then the risk is that some external > code would add calls to the underlying File* routines trusting them to > do the retrying, which would then become broken when we move the retry > loops to the pg_* wrappers when we add them. That doesn't seem terribly > serious to me. I'm not too worried about that either. Unless somebody strongly advocates a different path, I plan to push something along the lines of the prototype I had posted. After reading over it a bunch more times, some of the code is a bit finnicky. I wish we had some hack that made syscalls EINTR at a random intervals, just to make it realistic to test these paths... Greetings, Andres Freund
Re: could not extend file "base/5/3501" with FileFallocate(): Interrupted system call
From
Andres Freund
Date:
Hi Tom, Unfortunately, due to some personal life business, it took until for me to feel comfortable pushing the fix for https://www.postgresql.org/message-id/ZEZDj1H61ryrmY9o@msg.df7cb.de (FileFallocate() erroring out with EINTR due to running on tmpfs). Do you want me to hold off before beta2 is wrapped? I did a bunch of CI runs with the patch and patch + test infra, and they all passed. Greetings, Andres Freund
Attachment
Re: could not extend file "base/5/3501" with FileFallocate(): Interrupted system call
From
Tom Lane
Date:
Andres Freund <andres@anarazel.de> writes: > Unfortunately, due to some personal life business, it took until for me to > feel comfortable pushing the fix for > https://www.postgresql.org/message-id/ZEZDj1H61ryrmY9o@msg.df7cb.de > (FileFallocate() erroring out with EINTR due to running on tmpfs). > Do you want me to hold off before beta2 is wrapped? I did a bunch of CI runs > with the patch and patch + test infra, and they all passed. We still have a week till beta2 wrap, so I'd say push. If the buildfarm gets unhappy you can revert. regards, tom lane
Re: could not extend file "base/5/3501" with FileFallocate(): Interrupted system call
From
Andres Freund
Date:
Hi, On June 19, 2023 10:37:45 AM PDT, Tom Lane <tgl@sss.pgh.pa.us> wrote: >Andres Freund <andres@anarazel.de> writes: >> Unfortunately, due to some personal life business, it took until for me to >> feel comfortable pushing the fix for >> https://www.postgresql.org/message-id/ZEZDj1H61ryrmY9o@msg.df7cb.de >> (FileFallocate() erroring out with EINTR due to running on tmpfs). >> Do you want me to hold off before beta2 is wrapped? I did a bunch of CI runs >> with the patch and patch + test infra, and they all passed. > >We still have a week till beta2 wrap, so I'd say push. If the buildfarm >gets unhappy you can revert. Hah. Somehow I confused myself into thinking you're wrapping later today. Calendar math vs Andres: 6753:3 Andres -- Sent from my Android device with K-9 Mail. Please excuse my brevity.
Re: could not extend file "base/5/3501" with FileFallocate(): Interrupted system call
From
Amit Kapila
Date:
On Mon, Jun 19, 2023 at 11:47 PM Andres Freund <andres@anarazel.de> wrote: > > On June 19, 2023 10:37:45 AM PDT, Tom Lane <tgl@sss.pgh.pa.us> wrote: > >Andres Freund <andres@anarazel.de> writes: > >> Unfortunately, due to some personal life business, it took until for me to > >> feel comfortable pushing the fix for > >> https://www.postgresql.org/message-id/ZEZDj1H61ryrmY9o@msg.df7cb.de > >> (FileFallocate() erroring out with EINTR due to running on tmpfs). > >> Do you want me to hold off before beta2 is wrapped? I did a bunch of CI runs > >> with the patch and patch + test infra, and they all passed. > > > >We still have a week till beta2 wrap, so I'd say push. If the buildfarm > >gets unhappy you can revert. > > Hah. Somehow I confused myself into thinking you're wrapping later today. Calendar math vs Andres: 6753:3 > Can we close the open item corresponding to this after your commit 0d369ac650? -- With Regards, Amit Kapila.
Re: could not extend file "base/5/3501" with FileFallocate(): Interrupted system call
From
Andres Freund
Date:
Hi, On 2023-06-21 08:54:48 +0530, Amit Kapila wrote: > On Mon, Jun 19, 2023 at 11:47 PM Andres Freund <andres@anarazel.de> wrote: > > > > On June 19, 2023 10:37:45 AM PDT, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > >Andres Freund <andres@anarazel.de> writes: > > >> Unfortunately, due to some personal life business, it took until for me to > > >> feel comfortable pushing the fix for > > >> https://www.postgresql.org/message-id/ZEZDj1H61ryrmY9o@msg.df7cb.de > > >> (FileFallocate() erroring out with EINTR due to running on tmpfs). > > >> Do you want me to hold off before beta2 is wrapped? I did a bunch of CI runs > > >> with the patch and patch + test infra, and they all passed. > > > > > >We still have a week till beta2 wrap, so I'd say push. If the buildfarm > > >gets unhappy you can revert. > > > > Hah. Somehow I confused myself into thinking you're wrapping later today. Calendar math vs Andres: 6753:3 > > > > Can we close the open item corresponding to this after your commit 0d369ac650? Yes, sorry for forgetting that. Done now. Greetings, Andres Freund