Thread: buildfarm instance bichir stuck
Hi,
Bichir's been stuck for the past month and is unable to run regression tests since 6a2a70a02018d6362f9841cc2f499cc45405e86b.
It is interesting that that commit's a month old and probably no other client has complained since, but diving in, I can see that it's been unable to even start regression tests after that commit went in.
Note that Bichir is running on WSL1 (not WSL2) - i.e. Windows Subsystem for Linux inside Windows 10 - and so isn't really production use-case. The only run that actually got submitted to Buildfarm was from a few days back when I killed it after a long wait - see [1].
Since yesterday, I have another run that's again stuck on CREATE DATABASE (see outputs below) and although pstack not working may be a limitation of the architecture / installation (unsure), a trace shows it is stuck at poll.
Tracing commits, it seems that the commit 6a2a70a02018d6362f9841cc2f499cc45405e86b broke things and I can confirm that 'make check' works if I rollback to the preceding commit ( 83709a0d5a46559db016c50ded1a95fd3b0d3be6 ).
Note that Bichir is running on WSL1 (not WSL2) - i.e. Windows Subsystem for Linux inside Windows 10 - and so isn't really production use-case. The only run that actually got submitted to Buildfarm was from a few days back when I killed it after a long wait - see [1].
Since yesterday, I have another run that's again stuck on CREATE DATABASE (see outputs below) and although pstack not working may be a limitation of the architecture / installation (unsure), a trace shows it is stuck at poll.
Tracing commits, it seems that the commit 6a2a70a02018d6362f9841cc2f499cc45405e86b broke things and I can confirm that 'make check' works if I rollback to the preceding commit ( 83709a0d5a46559db016c50ded1a95fd3b0d3be6 ).
Not sure if many agree but 2 things stood out here:
1) Buildfarm never got the message that a commit broke an instance. Ideally I'd have expected buildfarm to have an optimistic timeout that could have helped - for e.g. right now, the CREATE DATABASE is still stuck since 18 hrs.
2) bichir is clearly not a production use-case (it takes 5 hrs to complete a HEAD run!), so let me know if this change is intentional (I guess I'll stop maintaining it if so) but thought I'd still put this out in case it interests someone.
-
thanks
robins
Reference:
1) Last run that I had to kill - https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=bichir&dt=2021-03-31%2012%3A00%3A05
#####################################################
The current run is running since yesterday.
postgres@WSLv1:/opt/postgres/bf/v11/buildroot/HEAD/bichir.lastrun-logs$ tail -2 lastcommand.log
running on port 5678 with PID 8715
============== creating database "regression" ==============
postgres@WSLv1:/opt/postgres/bf/v11/buildroot/HEAD/bichir.lastrun-logs$ date
Wed Apr 7 12:48:26 AEST 2021
postgres@WSLv1:/opt/postgres/bf/v11/buildroot/HEAD/bichir.lastrun-logs$ ls -la
total 840
drwxrwxr-x 1 postgres postgres 4096 Apr 6 09:00 .
drwxrwxr-x 1 postgres postgres 4096 Apr 6 08:55 ..
-rw-rw-r-- 1 postgres postgres 1358 Apr 6 08:55 SCM-checkout.log
-rw-rw-r-- 1 postgres postgres 91546 Apr 6 08:56 configure.log
-rw-rw-r-- 1 postgres postgres 40 Apr 6 08:55 githead.log
-rw-rw-r-- 1 postgres postgres 2890 Apr 6 09:01 lastcommand.log
-rw-rw-r-- 1 postgres postgres 712306 Apr 6 09:00 make.log
root@WSLv1:~# pstack 8729
8729: psql -X -c CREATE DATABASE "regression" TEMPLATE=template0 LC_COLLATE='C' LC_CTYPE='C' postgres
pstack: Bad address
failed to read target.
root@WSLv1:~# gdb -batch -ex bt -p 8729
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
0x00007f41a8ea4c84 in __GI___poll (fds=fds@entry=0x7fffe13d7be8, nfds=nfds@entry=1, timeout=-1) at ../sysdeps/unix/sysv/linux/poll.c:29
29 ../sysdeps/unix/sysv/linux/poll.c: No such file or directory.
#0 0x00007f41a8ea4c84 in __GI___poll (fds=fds@entry=0x7fffe13d7be8, nfds=nfds@entry=1, timeout=-1) at ../sysdeps/unix/sysv/linux/poll.c:29
#1 0x00007f41a9bc8eb1 in poll (__timeout=<optimized out>, __nfds=1, __fds=0x7fffe13d7be8) at /usr/include/x86_64-linux-gnu/bits/poll2.h:46
#2 pqSocketPoll (end_time=-1, forWrite=0, forRead=1, sock=<optimized out>) at fe-misc.c:1133
#3 pqSocketCheck (conn=0x7fffd979a0b0, forRead=1, forWrite=0, end_time=-1) at fe-misc.c:1075
#4 0x00007f41a9bc8ff0 in pqWaitTimed (forRead=<optimized out>, forWrite=<optimized out>, conn=0x7fffd979a0b0, finish_time=<optimized out>) at fe-misc.c:1007
#5 0x00007f41a9bc5ac9 in PQgetResult (conn=0x7fffd979a0b0) at fe-exec.c:1963
#6 0x00007f41a9bc5ea3 in PQexecFinish (conn=0x7fffd979a0b0) at fe-exec.c:2306
#7 0x00007f41a9bc5ef2 in PQexec (conn=<optimized out>, query=query@entry=0x7fffd9799f70 "CREATE DATABASE \"regression\" TEMPLATE=template0 LC_COLLATE='C' LC_CTYPE='C'") at fe-exec.c:2148
#8 0x00007f41aa21e7a0 in SendQuery (query=0x7fffd9799f70 "CREATE DATABASE \"regression\" TEMPLATE=template0 LC_COLLATE='C' LC_CTYPE='C'") at common.c:1303
#9 0x00007f41aa2160a6 in main (argc=<optimized out>, argv=<optimized out>) at startup.c:369
#####################################################
Here we can see that 83709a0d5a46559db016c50ded1a95fd3b0d3be6 goes past 'CREATE DATABASE'
=======================
robins@WSLv1:~/proj/postgres/postgres$ git checkout 83709a0d5a46559db016c50ded1a95fd3b0d3be6
Previous HEAD position was 6a2a70a020 Use signalfd(2) for epoll latches.
HEAD is now at 83709a0d5a Use SIGURG rather than SIGUSR1 for latches.
robins@WSLv1:~/proj/postgres/postgres$ cd src/test/regress/
robins@WSLv1:~/proj/postgres/postgres/src/test/regress$ make -j4 NO_LOCALE=1 check
make -C ../../../src/backend generated-headers
rm -rf ./testtablespace
make[1]: Entering directory '/home/robins/proj/postgres/postgres/src/backend'
make -C catalog distprep generated-header-symlinks
make -C utils distprep generated-header-symlinks
mkdir ./testtablespace
make[2]: Entering directory '/home/robins/proj/postgres/postgres/src/backend/utils'
make[2]: Nothing to be done for 'distprep'.
make[2]: Nothing to be done for 'generated-header-symlinks'.
make[2]: Leaving directory '/home/robins/proj/postgres/postgres/src/backend/utils'
make[2]: Entering directory '/home/robins/proj/postgres/postgres/src/backend/catalog'
make[2]: Nothing to be done for 'distprep'.
make[2]: Nothing to be done for 'generated-header-symlinks'.
make[2]: Leaving directory '/home/robins/proj/postgres/postgres/src/backend/catalog'
make[1]: Leaving directory '/home/robins/proj/postgres/postgres/src/backend'
make -C ../../../src/port all
rm -rf '/home/robins/proj/postgres/postgres'/tmp_install
make[1]: Entering directory '/home/robins/proj/postgres/postgres/src/port'
make[1]: Nothing to be done for 'all'.
make[1]: Leaving directory '/home/robins/proj/postgres/postgres/src/port'
make -C ../../../src/common all
make[1]: Entering directory '/home/robins/proj/postgres/postgres/src/common'
make[1]: Nothing to be done for 'all'.
make[1]: Leaving directory '/home/robins/proj/postgres/postgres/src/common'
make -C ../../../contrib/spi
make[1]: Entering directory '/home/robins/proj/postgres/postgres/contrib/spi'
make[1]: Nothing to be done for 'all'.
make[1]: Leaving directory '/home/robins/proj/postgres/postgres/contrib/spi'
/bin/mkdir -p '/home/robins/proj/postgres/postgres'/tmp_install/log
make -C '../../..' DESTDIR='/home/robins/proj/postgres/postgres'/tmp_install install >'/home/robins/proj/postgres/postgres'/tmp_install/log/install.log 2>&1
make -j1 checkprep >>'/home/robins/proj/postgres/postgres'/tmp_install/log/install.log 2>&1
PATH="/home/robins/proj/postgres/postgres/tmp_install/opt/postgres/master/bin:$PATH" LD_LIBRARY_PATH="/home/robins/proj/postgres/postgres/tmp_install/opt/postgres/master/li b" ../../../src/test/regress/pg_regress --temp-instance=./tmp_check --inputdir=. --bindir= --no-locale --dlpath=. --max-concurrent-tests=20 --schedule=./parallel_sched ule
============== removing existing temp instance ==============
============== creating temporary instance ==============
============== initializing database system ==============
============== starting postmaster ==============
running on port 58080 with PID 25879
============== creating database "regression" ==============
CREATE DATABASE
ALTER DATABASE
============== running regression test queries ==============
test tablespace ... ok 1239 ms
parallel group (20 tests): boolean char varchar name text int2 int4 int8 oid float4 float8 bit^CGNUmakefile:132: recipe for target 'check' failed
make: *** [check] Interrupt
But checking out 6a2a70a02018d6362f9841cc2f499cc45405e86b we can see that it hangs at 'CREATE DATABASE'
=======================================
robins@WSLv1:~/proj/postgres/postgres/src/test/regress$ git checkout 6a2a70a02018d6362f9841cc2f499cc45405e86b
Previous HEAD position was 83709a0d5a Use SIGURG rather than SIGUSR1 for latches.
HEAD is now at 6a2a70a020 Use signalfd(2) for epoll latches.
robins@WSLv1:~/proj/postgres/postgres/src/test/regress$ make -j4 NO_LOCALE=1 check
make -C ../../../src/backend generated-headers
rm -rf ./testtablespace
make[1]: Entering directory '/home/robins/proj/postgres/postgres/src/backend'
make -C catalog distprep generated-header-symlinks
make -C utils distprep generated-header-symlinks
mkdir ./testtablespace
make[2]: Entering directory '/home/robins/proj/postgres/postgres/src/backend/utils'
make[2]: Nothing to be done for 'distprep'.
make[2]: Nothing to be done for 'generated-header-symlinks'.
make[2]: Leaving directory '/home/robins/proj/postgres/postgres/src/backend/utils'
make[2]: Entering directory '/home/robins/proj/postgres/postgres/src/backend/catalog'
make[2]: Nothing to be done for 'distprep'.
make[2]: Nothing to be done for 'generated-header-symlinks'.
make[2]: Leaving directory '/home/robins/proj/postgres/postgres/src/backend/catalog'
make[1]: Leaving directory '/home/robins/proj/postgres/postgres/src/backend'
make -C ../../../src/port all
rm -rf '/home/robins/proj/postgres/postgres'/tmp_install
make[1]: Entering directory '/home/robins/proj/postgres/postgres/src/port'
make[1]: Nothing to be done for 'all'.
make[1]: Leaving directory '/home/robins/proj/postgres/postgres/src/port'
make -C ../../../src/common all
make[1]: Entering directory '/home/robins/proj/postgres/postgres/src/common'
make[1]: Nothing to be done for 'all'.
make[1]: Leaving directory '/home/robins/proj/postgres/postgres/src/common'
make -C ../../../contrib/spi
make[1]: Entering directory '/home/robins/proj/postgres/postgres/contrib/spi'
make[1]: Nothing to be done for 'all'.
make[1]: Leaving directory '/home/robins/proj/postgres/postgres/contrib/spi'
/bin/mkdir -p '/home/robins/proj/postgres/postgres'/tmp_install/log
make -C '../../..' DESTDIR='/home/robins/proj/postgres/postgres'/tmp_install install >'/home/robins/proj/postgres/postgres'/tmp_install/log/install.log 2>&1
make -j1 checkprep >>'/home/robins/proj/postgres/postgres'/tmp_install/log/install.log 2>&1
PATH="/home/robins/proj/postgres/postgres/tmp_install/opt/postgres/master/bin:$PATH" LD_LIBRARY_PATH="/home/robins/proj/postgres/postgres/tmp_install/opt/postgres/master/lib" ../../../src/test/regress/pg_regress --temp-instance=./tmp_check --inputdir=. --bindir= --no-locale --dlpath=. --max-concurrent-tests=20 --schedule=./parallel_schedule
============== removing existing temp instance ==============
============== creating temporary instance ==============
============== initializing database system ==============
============== starting postmaster ==============
running on port 58080 with PID 26702
============== creating database "regression" ==============
stuck here ^^^
^CCancel request sent
FATAL: terminating connection due to administrator command
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
connection to server was lost
command failed: "psql" -X -c "CREATE DATABASE \"regression\" TEMPLATE=template0 LC_COLLATE='C' LC_CTYPE='C'" "postgres"
pg_ctl: PID file "/home/robins/proj/postgres/postgres/src/test/regress/./tmp_check/data/postmaster.pid" does not exist
Is server running?
pg_regress: could not stop postmaster: exit code was 256
GNUmakefile:132: recipe for target 'check' failed
make: *** [check] Interrupt
-
thanks
robins
Reference:
1) Last run that I had to kill - https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=bichir&dt=2021-03-31%2012%3A00%3A05
#####################################################
The current run is running since yesterday.
postgres@WSLv1:/opt/postgres/bf/v11/buildroot/HEAD/bichir.lastrun-logs$ tail -2 lastcommand.log
running on port 5678 with PID 8715
============== creating database "regression" ==============
postgres@WSLv1:/opt/postgres/bf/v11/buildroot/HEAD/bichir.lastrun-logs$ date
Wed Apr 7 12:48:26 AEST 2021
postgres@WSLv1:/opt/postgres/bf/v11/buildroot/HEAD/bichir.lastrun-logs$ ls -la
total 840
drwxrwxr-x 1 postgres postgres 4096 Apr 6 09:00 .
drwxrwxr-x 1 postgres postgres 4096 Apr 6 08:55 ..
-rw-rw-r-- 1 postgres postgres 1358 Apr 6 08:55 SCM-checkout.log
-rw-rw-r-- 1 postgres postgres 91546 Apr 6 08:56 configure.log
-rw-rw-r-- 1 postgres postgres 40 Apr 6 08:55 githead.log
-rw-rw-r-- 1 postgres postgres 2890 Apr 6 09:01 lastcommand.log
-rw-rw-r-- 1 postgres postgres 712306 Apr 6 09:00 make.log
root@WSLv1:~# pstack 8729
8729: psql -X -c CREATE DATABASE "regression" TEMPLATE=template0 LC_COLLATE='C' LC_CTYPE='C' postgres
pstack: Bad address
failed to read target.
root@WSLv1:~# gdb -batch -ex bt -p 8729
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
0x00007f41a8ea4c84 in __GI___poll (fds=fds@entry=0x7fffe13d7be8, nfds=nfds@entry=1, timeout=-1) at ../sysdeps/unix/sysv/linux/poll.c:29
29 ../sysdeps/unix/sysv/linux/poll.c: No such file or directory.
#0 0x00007f41a8ea4c84 in __GI___poll (fds=fds@entry=0x7fffe13d7be8, nfds=nfds@entry=1, timeout=-1) at ../sysdeps/unix/sysv/linux/poll.c:29
#1 0x00007f41a9bc8eb1 in poll (__timeout=<optimized out>, __nfds=1, __fds=0x7fffe13d7be8) at /usr/include/x86_64-linux-gnu/bits/poll2.h:46
#2 pqSocketPoll (end_time=-1, forWrite=0, forRead=1, sock=<optimized out>) at fe-misc.c:1133
#3 pqSocketCheck (conn=0x7fffd979a0b0, forRead=1, forWrite=0, end_time=-1) at fe-misc.c:1075
#4 0x00007f41a9bc8ff0 in pqWaitTimed (forRead=<optimized out>, forWrite=<optimized out>, conn=0x7fffd979a0b0, finish_time=<optimized out>) at fe-misc.c:1007
#5 0x00007f41a9bc5ac9 in PQgetResult (conn=0x7fffd979a0b0) at fe-exec.c:1963
#6 0x00007f41a9bc5ea3 in PQexecFinish (conn=0x7fffd979a0b0) at fe-exec.c:2306
#7 0x00007f41a9bc5ef2 in PQexec (conn=<optimized out>, query=query@entry=0x7fffd9799f70 "CREATE DATABASE \"regression\" TEMPLATE=template0 LC_COLLATE='C' LC_CTYPE='C'") at fe-exec.c:2148
#8 0x00007f41aa21e7a0 in SendQuery (query=0x7fffd9799f70 "CREATE DATABASE \"regression\" TEMPLATE=template0 LC_COLLATE='C' LC_CTYPE='C'") at common.c:1303
#9 0x00007f41aa2160a6 in main (argc=<optimized out>, argv=<optimized out>) at startup.c:369
#####################################################
Here we can see that 83709a0d5a46559db016c50ded1a95fd3b0d3be6 goes past 'CREATE DATABASE'
=======================
robins@WSLv1:~/proj/postgres/postgres$ git checkout 83709a0d5a46559db016c50ded1a95fd3b0d3be6
Previous HEAD position was 6a2a70a020 Use signalfd(2) for epoll latches.
HEAD is now at 83709a0d5a Use SIGURG rather than SIGUSR1 for latches.
robins@WSLv1:~/proj/postgres/postgres$ cd src/test/regress/
robins@WSLv1:~/proj/postgres/postgres/src/test/regress$ make -j4 NO_LOCALE=1 check
make -C ../../../src/backend generated-headers
rm -rf ./testtablespace
make[1]: Entering directory '/home/robins/proj/postgres/postgres/src/backend'
make -C catalog distprep generated-header-symlinks
make -C utils distprep generated-header-symlinks
mkdir ./testtablespace
make[2]: Entering directory '/home/robins/proj/postgres/postgres/src/backend/utils'
make[2]: Nothing to be done for 'distprep'.
make[2]: Nothing to be done for 'generated-header-symlinks'.
make[2]: Leaving directory '/home/robins/proj/postgres/postgres/src/backend/utils'
make[2]: Entering directory '/home/robins/proj/postgres/postgres/src/backend/catalog'
make[2]: Nothing to be done for 'distprep'.
make[2]: Nothing to be done for 'generated-header-symlinks'.
make[2]: Leaving directory '/home/robins/proj/postgres/postgres/src/backend/catalog'
make[1]: Leaving directory '/home/robins/proj/postgres/postgres/src/backend'
make -C ../../../src/port all
rm -rf '/home/robins/proj/postgres/postgres'/tmp_install
make[1]: Entering directory '/home/robins/proj/postgres/postgres/src/port'
make[1]: Nothing to be done for 'all'.
make[1]: Leaving directory '/home/robins/proj/postgres/postgres/src/port'
make -C ../../../src/common all
make[1]: Entering directory '/home/robins/proj/postgres/postgres/src/common'
make[1]: Nothing to be done for 'all'.
make[1]: Leaving directory '/home/robins/proj/postgres/postgres/src/common'
make -C ../../../contrib/spi
make[1]: Entering directory '/home/robins/proj/postgres/postgres/contrib/spi'
make[1]: Nothing to be done for 'all'.
make[1]: Leaving directory '/home/robins/proj/postgres/postgres/contrib/spi'
/bin/mkdir -p '/home/robins/proj/postgres/postgres'/tmp_install/log
make -C '../../..' DESTDIR='/home/robins/proj/postgres/postgres'/tmp_install install >'/home/robins/proj/postgres/postgres'/tmp_install/log/install.log 2>&1
make -j1 checkprep >>'/home/robins/proj/postgres/postgres'/tmp_install/log/install.log 2>&1
PATH="/home/robins/proj/postgres/postgres/tmp_install/opt/postgres/master/bin:$PATH" LD_LIBRARY_PATH="/home/robins/proj/postgres/postgres/tmp_install/opt/postgres/master/li b" ../../../src/test/regress/pg_regress --temp-instance=./tmp_check --inputdir=. --bindir= --no-locale --dlpath=. --max-concurrent-tests=20 --schedule=./parallel_sched ule
============== removing existing temp instance ==============
============== creating temporary instance ==============
============== initializing database system ==============
============== starting postmaster ==============
running on port 58080 with PID 25879
============== creating database "regression" ==============
CREATE DATABASE
ALTER DATABASE
============== running regression test queries ==============
test tablespace ... ok 1239 ms
parallel group (20 tests): boolean char varchar name text int2 int4 int8 oid float4 float8 bit^CGNUmakefile:132: recipe for target 'check' failed
make: *** [check] Interrupt
But checking out 6a2a70a02018d6362f9841cc2f499cc45405e86b we can see that it hangs at 'CREATE DATABASE'
=======================================
robins@WSLv1:~/proj/postgres/postgres/src/test/regress$ git checkout 6a2a70a02018d6362f9841cc2f499cc45405e86b
Previous HEAD position was 83709a0d5a Use SIGURG rather than SIGUSR1 for latches.
HEAD is now at 6a2a70a020 Use signalfd(2) for epoll latches.
robins@WSLv1:~/proj/postgres/postgres/src/test/regress$ make -j4 NO_LOCALE=1 check
make -C ../../../src/backend generated-headers
rm -rf ./testtablespace
make[1]: Entering directory '/home/robins/proj/postgres/postgres/src/backend'
make -C catalog distprep generated-header-symlinks
make -C utils distprep generated-header-symlinks
mkdir ./testtablespace
make[2]: Entering directory '/home/robins/proj/postgres/postgres/src/backend/utils'
make[2]: Nothing to be done for 'distprep'.
make[2]: Nothing to be done for 'generated-header-symlinks'.
make[2]: Leaving directory '/home/robins/proj/postgres/postgres/src/backend/utils'
make[2]: Entering directory '/home/robins/proj/postgres/postgres/src/backend/catalog'
make[2]: Nothing to be done for 'distprep'.
make[2]: Nothing to be done for 'generated-header-symlinks'.
make[2]: Leaving directory '/home/robins/proj/postgres/postgres/src/backend/catalog'
make[1]: Leaving directory '/home/robins/proj/postgres/postgres/src/backend'
make -C ../../../src/port all
rm -rf '/home/robins/proj/postgres/postgres'/tmp_install
make[1]: Entering directory '/home/robins/proj/postgres/postgres/src/port'
make[1]: Nothing to be done for 'all'.
make[1]: Leaving directory '/home/robins/proj/postgres/postgres/src/port'
make -C ../../../src/common all
make[1]: Entering directory '/home/robins/proj/postgres/postgres/src/common'
make[1]: Nothing to be done for 'all'.
make[1]: Leaving directory '/home/robins/proj/postgres/postgres/src/common'
make -C ../../../contrib/spi
make[1]: Entering directory '/home/robins/proj/postgres/postgres/contrib/spi'
make[1]: Nothing to be done for 'all'.
make[1]: Leaving directory '/home/robins/proj/postgres/postgres/contrib/spi'
/bin/mkdir -p '/home/robins/proj/postgres/postgres'/tmp_install/log
make -C '../../..' DESTDIR='/home/robins/proj/postgres/postgres'/tmp_install install >'/home/robins/proj/postgres/postgres'/tmp_install/log/install.log 2>&1
make -j1 checkprep >>'/home/robins/proj/postgres/postgres'/tmp_install/log/install.log 2>&1
PATH="/home/robins/proj/postgres/postgres/tmp_install/opt/postgres/master/bin:$PATH" LD_LIBRARY_PATH="/home/robins/proj/postgres/postgres/tmp_install/opt/postgres/master/lib" ../../../src/test/regress/pg_regress --temp-instance=./tmp_check --inputdir=. --bindir= --no-locale --dlpath=. --max-concurrent-tests=20 --schedule=./parallel_schedule
============== removing existing temp instance ==============
============== creating temporary instance ==============
============== initializing database system ==============
============== starting postmaster ==============
running on port 58080 with PID 26702
============== creating database "regression" ==============
stuck here ^^^
^CCancel request sent
FATAL: terminating connection due to administrator command
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
connection to server was lost
command failed: "psql" -X -c "CREATE DATABASE \"regression\" TEMPLATE=template0 LC_COLLATE='C' LC_CTYPE='C'" "postgres"
pg_ctl: PID file "/home/robins/proj/postgres/postgres/src/test/regress/./tmp_check/data/postmaster.pid" does not exist
Is server running?
pg_regress: could not stop postmaster: exit code was 256
GNUmakefile:132: recipe for target 'check' failed
make: *** [check] Interrupt
On Wed, Apr 7, 2021 at 5:44 PM Robins Tharakan <tharakan@gmail.com> wrote: > Bichir's been stuck for the past month and is unable to run regression tests since 6a2a70a02018d6362f9841cc2f499cc45405e86b. Hrmph. That's "Use signalfd(2) for epoll latches." I had a similar report from an illumos user (but it was intermittent). I have never seen such a failure on Linux. My first guess is that these two systems that are doing Linux system call emulation have implemented subtly different semantics, and something is going wrong like this: a SIGUSR1 arrives to tell you some important news about a procsignal and the signal handler calls SetLatch(MyLatch) which does kill(MyProcPid, SIGURG), but somehow that fails to wake up the epoll() you are sleeping in which contains the signalfd that should receive the signal and report it by being readable, due to some internal race. Or something like that. But I haven't been able to verify that theory because I don't have any of those computers. If it is indeed something like that and not a bug in my code, then I was thinking that the main tool available to deal with it would be to set WAIT_USE_POLL in the relevant template file, so that we don't use the combination of epoll + signalfd on illlumos, but then WSL1 thows a spanner in the works because AFAIK it's masquerading as Ubuntu, running PostgreSQL from an Ubuntu package with a freaky kernel. Hmm. > It is interesting that that commit's a month old and probably no other client has complained since, but diving in, I cansee that it's been unable to even start regression tests after that commit went in. Oh, well at least it's easily reproducible then, that's something! > Note that Bichir is running on WSL1 (not WSL2) - i.e. Windows Subsystem for Linux inside Windows 10 - and so isn't reallyproduction use-case. The only run that actually got submitted to Buildfarm was from a few days back when I killed itafter a long wait - see [1]. > > Since yesterday, I have another run that's again stuck on CREATE DATABASE (see outputs below) and although pstack not workingmay be a limitation of the architecture / installation (unsure), a trace shows it is stuck at poll. That's actually the client. I guess there is also a backend process stuck somewhere in epoll_wait()?
Hi Thomas,
Thanks for taking a look at this promptly.
On Wed, 7 Apr 2021 at 16:17, Thomas Munro <thomas.munro@gmail.com> wrote:
> On Wed, Apr 7, 2021 at 5:44 PM Robins Tharakan <tharakan@gmail.com> wrote:
> > It is interesting that that commit's a month old and probably no other client has complained since, but diving in, I can see that it's been unable to even start regression tests after that commit went in.
>
> Oh, well at least it's easily reproducible then, that's something!
Correct. This is easily reproducible on this test-instance, so let me know if you want me to test a patch.
>
> That's actually the client. I guess there is also a backend process
> stuck somewhere in epoll_wait()?
You're right (and yes my bad, I was looking at the client). The server process is stuck in epoll_wait(). Let me know if you need me to give any other info that may be helpful.
root@WSLv1:~# gdb -batch -ex bt -p 29887
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
0x00007fa087741a07 in epoll_wait (epfd=10, events=0x7fffcbcc5748, maxevents=maxevents@entry=1, timeout=timeout@entry=-1) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30
30 ../sysdeps/unix/sysv/linux/epoll_wait.c: No such file or directory.
#0 0x00007fa087741a07 in epoll_wait (epfd=10, events=0x7fffcbcc5748, maxevents=maxevents@entry=1, timeout=timeout@entry=-1) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30
#1 0x00007fa088c355dc in WaitEventSetWaitBlock (nevents=1, occurred_events=0x7fffd2d4c090, cur_timeout=-1, set=0x7fffcbcc56e8) at latch.c:1428
#2 WaitEventSetWait (set=0x7fffcbcc56e8, timeout=timeout@entry=-1, occurred_events=occurred_events@entry=0x7fffd2d4c090, nevents=nevents@entry=1, wait_event_info=wait_ev
#3 0x00007fa088c35a14 in WaitLatch (latch=<optimized out>, wakeEvents=wakeEvents@entry=33, timeout=timeout@entry=-1, wait_event_info=wait_event_info@entry=134217733) at
#4 0x00007fa088c43ed8 in ConditionVariableTimedSleep (cv=0x7fa0873cc498, timeout=-1, wait_event_info=134217733) at condition_variable.c:163
#5 0x00007fa088bba8bc in RequestCheckpoint (flags=flags@entry=44) at checkpointer.c:1017
#6 0x00007fa088a46315 in createdb (pstate=pstate@entry=0x7fffcbcebbc0, stmt=stmt@entry=0x7fffcbcca558) at dbcommands.c:711
.
.
.
-
robins
Thanks for taking a look at this promptly.
On Wed, 7 Apr 2021 at 16:17, Thomas Munro <thomas.munro@gmail.com> wrote:
> On Wed, Apr 7, 2021 at 5:44 PM Robins Tharakan <tharakan@gmail.com> wrote:
> > It is interesting that that commit's a month old and probably no other client has complained since, but diving in, I can see that it's been unable to even start regression tests after that commit went in.
>
> Oh, well at least it's easily reproducible then, that's something!
Correct. This is easily reproducible on this test-instance, so let me know if you want me to test a patch.
>
> That's actually the client. I guess there is also a backend process
> stuck somewhere in epoll_wait()?
You're right (and yes my bad, I was looking at the client). The server process is stuck in epoll_wait(). Let me know if you need me to give any other info that may be helpful.
root@WSLv1:~# gdb -batch -ex bt -p 29887
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
0x00007fa087741a07 in epoll_wait (epfd=10, events=0x7fffcbcc5748, maxevents=maxevents@entry=1, timeout=timeout@entry=-1) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30
30 ../sysdeps/unix/sysv/linux/epoll_wait.c: No such file or directory.
#0 0x00007fa087741a07 in epoll_wait (epfd=10, events=0x7fffcbcc5748, maxevents=maxevents@entry=1, timeout=timeout@entry=-1) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30
#1 0x00007fa088c355dc in WaitEventSetWaitBlock (nevents=1, occurred_events=0x7fffd2d4c090, cur_timeout=-1, set=0x7fffcbcc56e8) at latch.c:1428
#2 WaitEventSetWait (set=0x7fffcbcc56e8, timeout=timeout@entry=-1, occurred_events=occurred_events@entry=0x7fffd2d4c090, nevents=nevents@entry=1, wait_event_info=wait_ev
#3 0x00007fa088c35a14 in WaitLatch (latch=<optimized out>, wakeEvents=wakeEvents@entry=33, timeout=timeout@entry=-1, wait_event_info=wait_event_info@entry=134217733) at
#4 0x00007fa088c43ed8 in ConditionVariableTimedSleep (cv=0x7fa0873cc498, timeout=-1, wait_event_info=134217733) at condition_variable.c:163
#5 0x00007fa088bba8bc in RequestCheckpoint (flags=flags@entry=44) at checkpointer.c:1017
#6 0x00007fa088a46315 in createdb (pstate=pstate@entry=0x7fffcbcebbc0, stmt=stmt@entry=0x7fffcbcca558) at dbcommands.c:711
.
.
.
-
robins
On 4/7/21 2:16 AM, Thomas Munro wrote: > On Wed, Apr 7, 2021 at 5:44 PM Robins Tharakan <tharakan@gmail.com> wrote: >> Bichir's been stuck for the past month and is unable to run regression tests since 6a2a70a02018d6362f9841cc2f499cc45405e86b. > Hrmph. That's "Use signalfd(2) for epoll latches." I had a similar > report from an illumos user (but it was intermittent). I have never > seen such a failure on Linux. My first guess is that these two > systems that are doing Linux system call emulation have implemented > subtly different semantics, and something is going wrong like this: a > SIGUSR1 arrives to tell you some important news about a procsignal and > the signal handler calls SetLatch(MyLatch) which does kill(MyProcPid, > SIGURG), but somehow that fails to wake up the epoll() you are > sleeping in which contains the signalfd that should receive the signal > and report it by being readable, due to some internal race. Or > something like that. But I haven't been able to verify that theory > because I don't have any of those computers. If it is indeed > something like that and not a bug in my code, then I was thinking that > the main tool available to deal with it would be to set WAIT_USE_POLL > in the relevant template file, so that we don't use the combination of > epoll + signalfd on illlumos, but then WSL1 thows a spanner in the > works because AFAIK it's masquerading as Ubuntu, running PostgreSQL > from an Ubuntu package with a freaky kernel. Hmm. > To test this the OP could just add CPPFLAGS => '-DWAIT_USE_POLL', to his animal's config's config_env stanza. cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
Thanks Andrew.
On Wed, 7 Apr 2021 at 21:49, Andrew Dunstan <andrew@dunslane.net> wrote:
> On 4/7/21 2:16 AM, Thomas Munro wrote:
> > On Wed, Apr 7, 2021 at 5:44 PM Robins Tharakan <tharakan@gmail.com> wrote:
> >> Bichir's been stuck for the past month and is unable to run regression tests since 6a2a70a02018d6362f9841cc2f499cc45405e86b.
> > ...If it is indeed
> > something like that and not a bug in my code, then I was thinking that
> > the main tool available to deal with it would be to set WAIT_USE_POLL
> > in the relevant template file, so that we don't use the combination of
> > epoll + signalfd on illlumos, but then WSL1 thows a spanner in the
> > works because AFAIK it's masquerading as Ubuntu, running PostgreSQL
> > from an Ubuntu package with a freaky kernel. Hmm.
postgres@WSLv1:/opt/postgres/bf/v11/buildroot/HEAD/bichir.lastrun-logs$ grep CPPFLAGS configure.log| grep using
configure: using CPPFLAGS=-DWAIT_USE_POLL -D_GNU_SOURCE -I/usr/include/libxml2
configure:19511: using CPPFLAGS=-DWAIT_USE_POLL -D_GNU_SOURCE -I/usr/include/libxml2
postgres@WSLv1:/opt/postgres/bf/v11/buildroot/HEAD/bichir.lastrun-logs$ grep -A2 "creating database" lastcommand.log
============== creating database "regression" ==============
CREATE DATABASE
ALTER DATABASE
-
thanks
The build's still running but the CPPFLAGS hint does seem to have helped (see below).
Unless advised otherwise, I intend to let that option be, so as to get bichir back online. If a future commit 'fixes' things, I could rollback this flag to test things out (or try out other options if required).
> On 4/7/21 2:16 AM, Thomas Munro wrote:
> > On Wed, Apr 7, 2021 at 5:44 PM Robins Tharakan <tharakan@gmail.com> wrote:
> >> Bichir's been stuck for the past month and is unable to run regression tests since 6a2a70a02018d6362f9841cc2f499cc45405e86b.
> > ...If it is indeed
> > something like that and not a bug in my code, then I was thinking that
> > the main tool available to deal with it would be to set WAIT_USE_POLL
> > in the relevant template file, so that we don't use the combination of
> > epoll + signalfd on illlumos, but then WSL1 thows a spanner in the
> > works because AFAIK it's masquerading as Ubuntu, running PostgreSQL
> > from an Ubuntu package with a freaky kernel. Hmm.
> To test this the OP could just add
> CPPFLAGS => '-DWAIT_USE_POLL',
> to his animal's config's config_env stanza.
> CPPFLAGS => '-DWAIT_USE_POLL',
> to his animal's config's config_env stanza.
This did help in getting past the previous hurdle.
postgres@WSLv1:/opt/postgres/bf/v11/buildroot/HEAD/bichir.lastrun-logs$ grep CPPFLAGS configure.log| grep using
configure: using CPPFLAGS=-DWAIT_USE_POLL -D_GNU_SOURCE -I/usr/include/libxml2
configure:19511: using CPPFLAGS=-DWAIT_USE_POLL -D_GNU_SOURCE -I/usr/include/libxml2
postgres@WSLv1:/opt/postgres/bf/v11/buildroot/HEAD/bichir.lastrun-logs$ grep -A2 "creating database" lastcommand.log
============== creating database "regression" ==============
CREATE DATABASE
ALTER DATABASE
-
thanks
robins
Robins Tharakan <tharakan@gmail.com> writes: > Not sure if many agree but 2 things stood out here: > 1) Buildfarm never got the message that a commit broke an instance. Ideally > I'd have expected buildfarm to have an optimistic timeout that could have > helped - for e.g. right now, the CREATE DATABASE is still stuck since 18 > hrs. As far as that goes, you can set wait_timeout in the animal's config to something comfortably more than the longest run time you expect. It doesn't default to enabled though, possibly because picking a one-size-fits-all value would be impossible. I do use it on some of my flakier dinosaurs, and I've noticed that when it does kick in, the buildfarm run just stops dead and no report is sent to the BF server. That has advantages in not cluttering the BF status with run-failed-because-of-$weird_problem issues, but it doesn't help from the standpoint of noticing when your animal is stuck. Maybe it'd be better to change that behavior. (I can also attest from personal experience that what had been a comfortable amount of slop when you picked it tends to become less so over time. Consider yourself warned.) regards, tom lane
On 4/7/21 1:07 PM, Tom Lane wrote: > Robins Tharakan <tharakan@gmail.com> writes: >> Not sure if many agree but 2 things stood out here: >> 1) Buildfarm never got the message that a commit broke an instance. Ideally >> I'd have expected buildfarm to have an optimistic timeout that could have >> helped - for e.g. right now, the CREATE DATABASE is still stuck since 18 >> hrs. > As far as that goes, you can set wait_timeout in the animal's config > to something comfortably more than the longest run time you expect. > It doesn't default to enabled though, possibly because picking a > one-size-fits-all value would be impossible. > > I do use it on some of my flakier dinosaurs, and I've noticed that > when it does kick in, the buildfarm run just stops dead and no report > is sent to the BF server. That has advantages in not cluttering the > BF status with run-failed-because-of-$weird_problem issues, but it > doesn't help from the standpoint of noticing when your animal is stuck. > Maybe it'd be better to change that behavior. > Yeah, I'll have a look. It's not simple for a bunch of reasons. cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
Andrew Dunstan <andrew@dunslane.net> writes: > On 4/7/21 1:07 PM, Tom Lane wrote: >> I do use it on some of my flakier dinosaurs, and I've noticed that >> when it does kick in, the buildfarm run just stops dead and no report >> is sent to the BF server. That has advantages in not cluttering the >> BF status with run-failed-because-of-$weird_problem issues, but it >> doesn't help from the standpoint of noticing when your animal is stuck. >> Maybe it'd be better to change that behavior. > Yeah, I'll have a look. It's not simple for a bunch of reasons. On further thought, that doesn't seem like the place to fix it. I'd rather be able to ask the buildfarm server to send me nagmail if my animal hasn't sent a report in N days (where N had better be owner-configurable). This would catch not only animal-is-hung, but also other classes of problems like whole-machine-is-hung or you-broke-your-firewall-configuration-so-it-cant-contact-the-server. I've had issues of those sorts before ... regards, tom lane
On 4/7/21 4:02 PM, Tom Lane wrote: > Andrew Dunstan <andrew@dunslane.net> writes: >> On 4/7/21 1:07 PM, Tom Lane wrote: >>> I do use it on some of my flakier dinosaurs, and I've noticed that >>> when it does kick in, the buildfarm run just stops dead and no report >>> is sent to the BF server. That has advantages in not cluttering the >>> BF status with run-failed-because-of-$weird_problem issues, but it >>> doesn't help from the standpoint of noticing when your animal is stuck. >>> Maybe it'd be better to change that behavior. >> Yeah, I'll have a look. It's not simple for a bunch of reasons. > On further thought, that doesn't seem like the place to fix it. > I'd rather be able to ask the buildfarm server to send me nagmail > if my animal hasn't sent a report in N days (where N had better > be owner-configurable). This would catch not only animal-is-hung, > but also other classes of problems like whole-machine-is-hung or > you-broke-your-firewall-configuration-so-it-cant-contact-the-server. > I've had issues of those sorts before ... > > That already exists, and has for a long time. See the 'alerts' stanza of your config file. cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
Andrew Dunstan <andrew@dunslane.net> writes: > On 4/7/21 4:02 PM, Tom Lane wrote: >> On further thought, that doesn't seem like the place to fix it. >> I'd rather be able to ask the buildfarm server to send me nagmail >> if my animal hasn't sent a report in N days (where N had better >> be owner-configurable). > That already exists, and has for a long time. See the 'alerts' stanza of > your config file. Oh! In that case, I don't think we need anything else. regards, tom lane
On Wed, Apr 7, 2021 at 7:31 PM Robins Tharakan <tharakan@gmail.com> wrote: > Correct. This is easily reproducible on this test-instance, so let me know if you want me to test a patch. From your description it sounds like signals are not arriving at all, rather than some more complicated race. Let's go back to basics... what does the attached program print for you? I see: tmunro@x1:~/junk$ cc test-signalfd.c tmunro@x1:~/junk$ ./a.out blocking SIGURG... creating a signalfd to receive SIGURG... creating an epoll set... adding signalfd to epoll set... polling the epoll set... 0 sending a signal... polling the epoll set... 1
Attachment
On Fri, 9 Apr 2021 at 16:12, Thomas Munro <thomas.munro@gmail.com> wrote:
> From your description it sounds like signals are not arriving at all,
> rather than some more complicated race. Let's go back to basics...
> what does the attached program print for you? I see:
>
> tmunro@x1:~/junk$ cc test-signalfd.c
> tmunro@x1:~/junk$ ./a.out
> blocking SIGURG...
> creating a signalfd to receive SIGURG...
> creating an epoll set...
> adding signalfd to epoll set...
> polling the epoll set... 0
> sending a signal...
> polling the epoll set... 1
I get pretty much the same. Some additional info below, although not sure if it'd be of any help here.
robins@WSLv1:~/proj/hackers$ cc test-signalfd.c
robins@WSLv1:~/proj/hackers$ ./a.out
blocking SIGURG...
creating a signalfd to receive SIGURG...
creating an epoll set...
adding signalfd to epoll set...
polling the epoll set... 0
sending a signal...
polling the epoll set... 1
robins@WSLv1:~/proj/hackers$ cat /proc/cpuinfo | egrep 'flags|model' | sort -u
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave osxsave avx f16c rdrand lahf_lm abm 3dnowprefetch fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx rdseed adx smap clflushopt intel_pt ibrs ibpb stibp ssbd
model : 142
model name : Intel(R) Core(TM) i7-8650U CPU @ 1.90GHz
robins@WSLv1:~/proj/hackers$ uname -a
Linux WSLv1 4.4.0-19041-Microsoft #488-Microsoft Mon Sep 01 13:43:00 PST 2020 x86_64 x86_64 x86_64 GNU/Linux
C:>wsl -l -v
NAME STATE VERSION
* Ubuntu-18.04 Running 1
-
robins
blocking SIGURG...
creating a signalfd to receive SIGURG...
creating an epoll set...
adding signalfd to epoll set...
polling the epoll set... 0
sending a signal...
polling the epoll set... 1
robins@WSLv1:~/proj/hackers$ cat /proc/cpuinfo | egrep 'flags|model' | sort -u
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave osxsave avx f16c rdrand lahf_lm abm 3dnowprefetch fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx rdseed adx smap clflushopt intel_pt ibrs ibpb stibp ssbd
model : 142
model name : Intel(R) Core(TM) i7-8650U CPU @ 1.90GHz
robins@WSLv1:~/proj/hackers$ uname -a
Linux WSLv1 4.4.0-19041-Microsoft #488-Microsoft Mon Sep 01 13:43:00 PST 2020 x86_64 x86_64 x86_64 GNU/Linux
C:>wsl -l -v
NAME STATE VERSION
* Ubuntu-18.04 Running 1
-
robins
On Fri, Apr 9, 2021 at 6:11 PM Thomas Munro <thomas.munro@gmail.com> wrote: > On Wed, Apr 7, 2021 at 7:31 PM Robins Tharakan <tharakan@gmail.com> wrote: > > Correct. This is easily reproducible on this test-instance, so let me know if you want me to test a patch. > > From your description it sounds like signals are not arriving at all, > rather than some more complicated race. Let's go back to basics... I was looking into the portability of SIGURG and OOB socket data for something totally different (hallway track discussion from PGCon, could we use that for query cancel, like FTP does, instead of opening another socket?), and lo and behold, someone has figured out a workaround for this latch problem: https://github.com/microsoft/WSL/issues/8619 I don't really want to add code to scrape uname() ouput detect different kernels at runtime as shown there, but it doesn't seem to make a difference on Linux if we just always do what was suggested. I didn't look too hard into whether that is the right place to put the call, or really understand *why* it works, and since I am not a Windows user and we don't have a WSL1 CI, I can't confirm that it works or explore whether there is some other ordering of operations that would be better but still work, but if that does the trick then maybe we should just do something like the attached. Thoughts?