buildfarm instance bichir stuck - Mailing list pgsql-hackers

From Robins Tharakan
Subject buildfarm instance bichir stuck
Date
Msg-id CAEP4nAymAZP1VEBNoWAQca85ZtU5YxuwS95+Vu+XW+-eMfq_vQ@mail.gmail.com
Whole thread Raw
Responses Re: buildfarm instance bichir stuck  (Thomas Munro <thomas.munro@gmail.com>)
Re: buildfarm instance bichir stuck  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-hackers
Hi,

Bichir's been stuck for the past month and is unable to run regression tests since 6a2a70a02018d6362f9841cc2f499cc45405e86b.

It is interesting that that commit's a month old and probably no other client has complained since, but diving in, I can see that it's been unable to even start regression tests after that commit went in.

Note that Bichir is running on WSL1 (not WSL2) - i.e. Windows Subsystem for Linux inside Windows 10 - and so isn't really production use-case. The only run that actually got submitted to Buildfarm was from a few days back when I killed it after a long wait - see [1].

Since yesterday, I have another run that's again stuck on CREATE DATABASE (see outputs below) and although pstack not working may be a limitation of the architecture / installation (unsure), a trace shows it is stuck at poll.

Tracing commits, it seems that the commit 6a2a70a02018d6362f9841cc2f499cc45405e86b broke things and I can confirm that 'make check' works if I rollback to the preceding commit ( 83709a0d5a46559db016c50ded1a95fd3b0d3be6 ).

Not sure if many agree but 2 things stood out here:
1) Buildfarm never got the message that a commit broke an instance. Ideally I'd have expected buildfarm to have an optimistic timeout that could have helped - for e.g. right now, the CREATE DATABASE is still stuck since 18 hrs.

2) bichir is clearly not a production use-case (it takes 5 hrs to complete a HEAD run!), so let me know if this change is intentional (I guess I'll stop maintaining it if so) but thought I'd still put this out in case it interests someone.

-
thanks
robins

Reference:
1) Last run that I had to kill - https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=bichir&dt=2021-03-31%2012%3A00%3A05

#####################################################
The current run is running since yesterday.


postgres@WSLv1:/opt/postgres/bf/v11/buildroot/HEAD/bichir.lastrun-logs$ tail -2 lastcommand.log
running on port 5678 with PID 8715
============== creating database "regression"         ==============


postgres@WSLv1:/opt/postgres/bf/v11/buildroot/HEAD/bichir.lastrun-logs$ date
Wed Apr  7 12:48:26 AEST 2021


postgres@WSLv1:/opt/postgres/bf/v11/buildroot/HEAD/bichir.lastrun-logs$ ls -la
total 840
drwxrwxr-x 1 postgres postgres   4096 Apr  6 09:00 .
drwxrwxr-x 1 postgres postgres   4096 Apr  6 08:55 ..
-rw-rw-r-- 1 postgres postgres   1358 Apr  6 08:55 SCM-checkout.log
-rw-rw-r-- 1 postgres postgres  91546 Apr  6 08:56 configure.log
-rw-rw-r-- 1 postgres postgres     40 Apr  6 08:55 githead.log
-rw-rw-r-- 1 postgres postgres   2890 Apr  6 09:01 lastcommand.log
-rw-rw-r-- 1 postgres postgres 712306 Apr  6 09:00 make.log


root@WSLv1:~# pstack 8729
8729: psql -X -c CREATE DATABASE "regression" TEMPLATE=template0 LC_COLLATE='C' LC_CTYPE='C' postgres
pstack: Bad address
failed to read target.


root@WSLv1:~# gdb -batch -ex bt -p 8729
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
0x00007f41a8ea4c84 in __GI___poll (fds=fds@entry=0x7fffe13d7be8, nfds=nfds@entry=1, timeout=-1) at ../sysdeps/unix/sysv/linux/poll.c:29
29      ../sysdeps/unix/sysv/linux/poll.c: No such file or directory.
#0  0x00007f41a8ea4c84 in __GI___poll (fds=fds@entry=0x7fffe13d7be8, nfds=nfds@entry=1, timeout=-1) at ../sysdeps/unix/sysv/linux/poll.c:29
#1  0x00007f41a9bc8eb1 in poll (__timeout=<optimized out>, __nfds=1, __fds=0x7fffe13d7be8) at /usr/include/x86_64-linux-gnu/bits/poll2.h:46
#2  pqSocketPoll (end_time=-1, forWrite=0, forRead=1, sock=<optimized out>) at fe-misc.c:1133
#3  pqSocketCheck (conn=0x7fffd979a0b0, forRead=1, forWrite=0, end_time=-1) at fe-misc.c:1075
#4  0x00007f41a9bc8ff0 in pqWaitTimed (forRead=<optimized out>, forWrite=<optimized out>, conn=0x7fffd979a0b0, finish_time=<optimized out>) at fe-misc.c:1007
#5  0x00007f41a9bc5ac9 in PQgetResult (conn=0x7fffd979a0b0) at fe-exec.c:1963
#6  0x00007f41a9bc5ea3 in PQexecFinish (conn=0x7fffd979a0b0) at fe-exec.c:2306
#7  0x00007f41a9bc5ef2 in PQexec (conn=<optimized out>, query=query@entry=0x7fffd9799f70 "CREATE DATABASE \"regression\" TEMPLATE=template0 LC_COLLATE='C' LC_CTYPE='C'") at fe-exec.c:2148
#8  0x00007f41aa21e7a0 in SendQuery (query=0x7fffd9799f70 "CREATE DATABASE \"regression\" TEMPLATE=template0 LC_COLLATE='C' LC_CTYPE='C'") at common.c:1303
#9  0x00007f41aa2160a6 in main (argc=<optimized out>, argv=<optimized out>) at startup.c:369



#####################################################



Here we can see that 83709a0d5a46559db016c50ded1a95fd3b0d3be6 goes past 'CREATE DATABASE'
=======================
robins@WSLv1:~/proj/postgres/postgres$ git checkout 83709a0d5a46559db016c50ded1a95fd3b0d3be6
Previous HEAD position was 6a2a70a020 Use signalfd(2) for epoll latches.
HEAD is now at 83709a0d5a Use SIGURG rather than SIGUSR1 for latches.

robins@WSLv1:~/proj/postgres/postgres$ cd src/test/regress/

robins@WSLv1:~/proj/postgres/postgres/src/test/regress$ make -j4 NO_LOCALE=1 check
make -C ../../../src/backend generated-headers
rm -rf ./testtablespace
make[1]: Entering directory '/home/robins/proj/postgres/postgres/src/backend'
make -C catalog distprep generated-header-symlinks
make -C utils distprep generated-header-symlinks
mkdir ./testtablespace
make[2]: Entering directory '/home/robins/proj/postgres/postgres/src/backend/utils'
make[2]: Nothing to be done for 'distprep'.
make[2]: Nothing to be done for 'generated-header-symlinks'.
make[2]: Leaving directory '/home/robins/proj/postgres/postgres/src/backend/utils'
make[2]: Entering directory '/home/robins/proj/postgres/postgres/src/backend/catalog'
make[2]: Nothing to be done for 'distprep'.
make[2]: Nothing to be done for 'generated-header-symlinks'.
make[2]: Leaving directory '/home/robins/proj/postgres/postgres/src/backend/catalog'
make[1]: Leaving directory '/home/robins/proj/postgres/postgres/src/backend'
make -C ../../../src/port all
rm -rf '/home/robins/proj/postgres/postgres'/tmp_install
make[1]: Entering directory '/home/robins/proj/postgres/postgres/src/port'
make[1]: Nothing to be done for 'all'.
make[1]: Leaving directory '/home/robins/proj/postgres/postgres/src/port'
make -C ../../../src/common all
make[1]: Entering directory '/home/robins/proj/postgres/postgres/src/common'
make[1]: Nothing to be done for 'all'.
make[1]: Leaving directory '/home/robins/proj/postgres/postgres/src/common'
make -C ../../../contrib/spi
make[1]: Entering directory '/home/robins/proj/postgres/postgres/contrib/spi'
make[1]: Nothing to be done for 'all'.
make[1]: Leaving directory '/home/robins/proj/postgres/postgres/contrib/spi'
/bin/mkdir -p '/home/robins/proj/postgres/postgres'/tmp_install/log
make -C '../../..' DESTDIR='/home/robins/proj/postgres/postgres'/tmp_install install >'/home/robins/proj/postgres/postgres'/tmp_install/log/install.log 2>&1
make -j1  checkprep >>'/home/robins/proj/postgres/postgres'/tmp_install/log/install.log 2>&1
PATH="/home/robins/proj/postgres/postgres/tmp_install/opt/postgres/master/bin:$PATH" LD_LIBRARY_PATH="/home/robins/proj/postgres/postgres/tmp_install/opt/postgres/master/li b"  ../../../src/test/regress/pg_regress --temp-instance=./tmp_check --inputdir=. --bindir=   --no-locale  --dlpath=. --max-concurrent-tests=20  --schedule=./parallel_sched ule
============== removing existing temp instance        ==============
============== creating temporary instance            ==============
============== initializing database system           ==============
============== starting postmaster                    ==============
running on port 58080 with PID 25879
============== creating database "regression"         ==============
CREATE DATABASE
ALTER DATABASE
============== running regression test queries        ==============
test tablespace                   ... ok         1239 ms
parallel group (20 tests):  boolean char varchar name text int2 int4 int8 oid float4 float8 bit^CGNUmakefile:132: recipe for target 'check' failed
make: *** [check] Interrupt



But checking out 6a2a70a02018d6362f9841cc2f499cc45405e86b we can see that it hangs at 'CREATE DATABASE'
=======================================
robins@WSLv1:~/proj/postgres/postgres/src/test/regress$ git checkout 6a2a70a02018d6362f9841cc2f499cc45405e86b
Previous HEAD position was 83709a0d5a Use SIGURG rather than SIGUSR1 for latches.
HEAD is now at 6a2a70a020 Use signalfd(2) for epoll latches.
robins@WSLv1:~/proj/postgres/postgres/src/test/regress$ make -j4 NO_LOCALE=1 check
make -C ../../../src/backend generated-headers
rm -rf ./testtablespace
make[1]: Entering directory '/home/robins/proj/postgres/postgres/src/backend'
make -C catalog distprep generated-header-symlinks
make -C utils distprep generated-header-symlinks
mkdir ./testtablespace
make[2]: Entering directory '/home/robins/proj/postgres/postgres/src/backend/utils'
make[2]: Nothing to be done for 'distprep'.
make[2]: Nothing to be done for 'generated-header-symlinks'.
make[2]: Leaving directory '/home/robins/proj/postgres/postgres/src/backend/utils'
make[2]: Entering directory '/home/robins/proj/postgres/postgres/src/backend/catalog'
make[2]: Nothing to be done for 'distprep'.
make[2]: Nothing to be done for 'generated-header-symlinks'.
make[2]: Leaving directory '/home/robins/proj/postgres/postgres/src/backend/catalog'
make[1]: Leaving directory '/home/robins/proj/postgres/postgres/src/backend'
make -C ../../../src/port all
rm -rf '/home/robins/proj/postgres/postgres'/tmp_install
make[1]: Entering directory '/home/robins/proj/postgres/postgres/src/port'
make[1]: Nothing to be done for 'all'.
make[1]: Leaving directory '/home/robins/proj/postgres/postgres/src/port'
make -C ../../../src/common all
make[1]: Entering directory '/home/robins/proj/postgres/postgres/src/common'
make[1]: Nothing to be done for 'all'.
make[1]: Leaving directory '/home/robins/proj/postgres/postgres/src/common'
make -C ../../../contrib/spi
make[1]: Entering directory '/home/robins/proj/postgres/postgres/contrib/spi'
make[1]: Nothing to be done for 'all'.
make[1]: Leaving directory '/home/robins/proj/postgres/postgres/contrib/spi'
/bin/mkdir -p '/home/robins/proj/postgres/postgres'/tmp_install/log
make -C '../../..' DESTDIR='/home/robins/proj/postgres/postgres'/tmp_install install >'/home/robins/proj/postgres/postgres'/tmp_install/log/install.log 2>&1
make -j1  checkprep >>'/home/robins/proj/postgres/postgres'/tmp_install/log/install.log 2>&1
PATH="/home/robins/proj/postgres/postgres/tmp_install/opt/postgres/master/bin:$PATH" LD_LIBRARY_PATH="/home/robins/proj/postgres/postgres/tmp_install/opt/postgres/master/lib"  ../../../src/test/regress/pg_regress --temp-instance=./tmp_check --inputdir=. --bindir=   --no-locale  --dlpath=. --max-concurrent-tests=20  --schedule=./parallel_schedule
============== removing existing temp instance        ==============
============== creating temporary instance            ==============
============== initializing database system           ==============
============== starting postmaster                    ==============
running on port 58080 with PID 26702
============== creating database "regression"         ==============
stuck here ^^^
^CCancel request sent
FATAL:  terminating connection due to administrator command
server closed the connection unexpectedly
        This probably means the server terminated abnormally
        before or while processing the request.
connection to server was lost
command failed: "psql" -X -c "CREATE DATABASE \"regression\" TEMPLATE=template0 LC_COLLATE='C' LC_CTYPE='C'" "postgres"
pg_ctl: PID file "/home/robins/proj/postgres/postgres/src/test/regress/./tmp_check/data/postmaster.pid" does not exist
Is server running?

pg_regress: could not stop postmaster: exit code was 256
GNUmakefile:132: recipe for target 'check' failed
make: *** [check] Interrupt

pgsql-hackers by date:

Previous
From: "osumi.takamichi@fujitsu.com"
Date:
Subject: RE: Stronger safeguard for archive recovery not to miss data
Next
From: Thomas Munro
Date:
Subject: Re: MultiXact\SLRU buffers configuration