Re: conchuela timeouts since 2021-10-09 system upgrade - Mailing list pgsql-bugs

From Thomas Munro
Subject Re: conchuela timeouts since 2021-10-09 system upgrade
Date
Msg-id CA+hUKG+jTMdxkyZJbUGyG7ggFKABWsBwSxSzLNVm-X87DeLfqg@mail.gmail.com
Whole thread Raw
In response to Re: conchuela timeouts since 2021-10-09 system upgrade  (Thomas Munro <thomas.munro@gmail.com>)
Responses Re: conchuela timeouts since 2021-10-09 system upgrade
List pgsql-bugs
I managed to reproduce the 017_shm.pl hang using my Vagrant file (see
earlier).  It happens for "gmake -s check-world", but not "gmake -s -C
src/test/recovery check", which might have confused me last time.  The
symptoms are as described already, though this time I learned with
truss that it's in a retry loop waiting 1 second at a time.

I don't know how to get debug symbols for libc on this operating
system (well, probably needed to select debug option while installing
the OS, but I didn't install the OS, I'm using a lazy bones pre-rolled
Vagrant image).  So in order to be able to inspect the arguments to
select, I made my own LD_PRELOAD wrapper for select (see end), and I
observed that it was calling select() with an empty fd_set:

(gdb) bt
#0  0x00000008014c1e6c in select () from /lib/libc.so.8
#1  0x00000008009f0fe6 in select () from /usr/lib/libpthread.so.0
#2  0x000000080044c622 in select (nfds=16, readfds=0x8017f6de0,
writefds=0x8017f6d50, exceptfds=0x0,
    timeout=0x7fffffdfd4a0) at select.c:13
#3  0x00000008007bfaf9 in Perl_pp_sselect () from
/usr/local/lib/perl5/5.32/mach/CORE/libperl.so.5.32
#4  0x000000080076b036 in Perl_runops_standard ()
   from /usr/local/lib/perl5/5.32/mach/CORE/libperl.so.5.32
#5  0x00000008006da3b7 in perl_run () from
/usr/local/lib/perl5/5.32/mach/CORE/libperl.so.5.32
#6  0x0000000000400de4 in main ()
(gdb) f 2
#2  0x000000080044c622 in select (nfds=16, readfds=0x8017f6de0,
writefds=0x8017f6d50, exceptfds=0x0,
    timeout=0x7fffffdfd4a0) at select.c:13
13        return real_function(nfds, readfds, writefds, exceptfds, timeout);
(gdb) print nfds
$1 = 16
(gdb) print *readfds
$2 = {fds_bits = {0 <repeats 16 times>}}
(gdb) print *writefds
$3 = {fds_bits = {0 <repeats 16 times>}}

So it looks a lot like something on the perl side has lost track of
the pipe it's supposed to be selecting on.  If I understand correctly,
it's supposed to be waiting for one of the following pipes to appear
as readable, whichever is the one that the zombie psql process
previously held the other end of:

$ fstat -p 73032
USER     CMD          PID   FD PATH                  INUM MODE
  SZ|DV R/W
vagrant  perl       73032 root /                        1 drwxr-xr-x
 offset:0  r
vagrant  perl       73032   wd
/home/vagrant/postgres/src/test/recovery     4434524058 drwxr-xr-x
offset:0  r
vagrant  perl       73032 text /pfs/@@-1:00001/local/bin/perl
     4329919842 -rwxr-xr-x    offset:0  r
vagrant  perl       73032    0 /dev/pts/0
          1335 crw--w----     pts/0:155171 rw
vagrant  perl       73032    1
/home/vagrant/postgres/src/test/recovery/tmp_check/log/regress_log_017_shm
  4477427235 -rw-r--r--    offset:9457  w
vagrant  perl       73032    2
/home/vagrant/postgres/src/test/recovery/tmp_check/log/regress_log_017_shm
  4477427235 -rw-r--r--    offset:9457  w
vagrant  perl       73032    3
/home/vagrant/postgres/src/test/recovery/tmp_check/log/regress_log_017_shm
  4477427235 -rw-r--r--    offset:9457  w
vagrant  perl       73032    4* pipe fffff800aac1dca0 (B<->A) ravail 0
wavail 0 rw
vagrant  perl       73032    5* pipe fffff800ab6e9020 (B<->A) ravail 0
wavail 0 rw
vagrant  perl       73032    6* pipe fffff800aac1dca0 (B<->A) ravail 0
wavail 0 rw
vagrant  perl       73032    7* pipe fffff800ab6e9020 (B<->A) ravail 0
wavail 0 rw
vagrant  perl       73032    8* pipe fffff800aac1dca0 (B<->A) ravail 0
wavail 0 rw
vagrant  perl       73032    9* pipe fffff800ab6e9020 (B<->A) ravail 0
wavail 0 rw
vagrant  perl       73032   13* pipe fffff800ab157560 (B<->A) ravail 0
wavail 0 rw


=== select wrapper ===

$ cat select.c
#include <dlfcn.h>
#include <sys/select.h>

int
select(int nfds,fd_set *readfds, fd_set *writefds, fd_set *exceptfds,
    struct timeval *timeout)
{
    static int (*real_function)(int, fd_set *, fd_set *, fd_set *,
        struct timeval *);

    if (!real_function)
        real_function = dlsym(RTLD_NEXT, "select");

    return real_function(nfds, readfds, writefds, exceptfds, timeout);
}

$ cc -Wall -fPIC -shared -g -o myselect.so select.c -ldl
$ cd postgres
$ LD_PRELOAD=/home/vagrant/myselect.so gmake check-world -s



pgsql-bugs by date:

Previous
From: Peter Geoghegan
Date:
Subject: Re: BUG #17255: Server crashes in index_delete_sort_cmp() due to race condition with vacuum
Next
From: Дмитрий Иванов
Date:
Subject: Re: pg_restore depending on user functions