Re: src/test/recovery regression failure on bionic - Mailing list pgsql-hackers

From Tom Lane
Subject Re: src/test/recovery regression failure on bionic
Date
Msg-id 1462.1578522666@sss.pgh.pa.us
Whole thread Raw
In response to Re: src/test/recovery regression failure on bionic  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: src/test/recovery regression failure on bionic  (Andres Freund <andres@anarazel.de>)
List pgsql-hackers
I wrote:
> This would happen if anything is causing the postmaster to have
> a few more open files than the test added by commit
> d207038053837ae9365df2776371632387f6f655 is allowing for.  It's
> a test bug and nothing more.
> Why sidewinder is not showing this in HEAD too is an interesting
> question, but it isn't.  However, it could be that on another
> platform (ie bionic) the problem does manifest in HEAD.

I set up a NetBSD 7 installation locally, and while I have not
directly reproduced the failure, I believe I understand all the
components of it now.

(1) d20703805's test will clearly fall over if there are more than six
FDs open in the postmaster when set_max_safe_fds is called, because it
sets max_files_per_process = 26 while set_max_safe_fds requires at
least 20 usable FDs to be available.

(2) The postmaster's stdin/stdout/stderr will surely eat up three of
those.

(3) In HEAD, that's actually all the FDs there are normally, but in the
back branches there is one more (under the conditions of this test),
because in the back branches we open the postmaster's listen sockets
before we run set_max_safe_fds.  (9a86f03b4 changed this.)

(4) NetBSD 7.0's cron leaves three extra open FDs in processes that
it spawns.  I have not looked into why, but I have experimentally
observed this.  For example, lsof on a "sleep" launched from cron
shows

COMMAND  PID USER   FD   TYPE             DEVICE SIZE/OFF    NODE NAME
sleep   7824  tgl  cwd   VDIR                0,0      512  795201 /home/tgl
sleep   7824  tgl  txt   VREG                0,0    10431 1613152 /bin/sleep
sleep   7824  tgl  txt   VREG                0,0  1616564   22726 /lib/libc.so.12.193.1
sleep   7824  tgl  txt   VREG                0,0    55295   22747 /lib/libgcc_s.so.1.0
sleep   7824  tgl  txt   VREG                0,0   187183   22762 /lib/libm.so.0.11
sleep   7824  tgl  txt   VREG                0,0    92195 1499524 /libexec/ld.elf_so
sleep   7824  tgl    0r  PIPE 0xfffffe803131eb58    16384
sleep   7824  tgl    1w  PIPE 0xfffffe8007ec4a30        0         ->0xfffffe800cc0d2c0
sleep   7824  tgl    2w  PIPE 0xfffffe8007ec4a30        0         ->0xfffffe800cc0d2c0
sleep   7824  tgl    7u                                           unknown file system type: 0
sleep   7824  tgl    8u                                           unknown file system type: 0
sleep   7824  tgl    9w  PIPE 0xfffffe80036c4dc0        0

while of course "sleep" launched by hand has only 0/1/2 open.

We may conclude that when the regression tests are launched from cron,
as would be typical for a buildfarm animal, HEAD has exactly zero FDs
to spare in this test, while the back branches are one FD underwater
and fail.  This matches the observed results from sidewinder.

It's not clear whether any of this info applies to Christoph's trouble
with bionic.  If the extra FDs are an old cron bug, it could be that
bionic shares that bug --- but to explain failure on HEAD, you'd have to
posit four excess FDs not three.  I'm not convinced that what Christoph
is seeing matches this anyway; he hasn't showed the telltale
"insufficient file descriptors" message, at least.  Still, maybe
launched-by-cron vs launched-by-hand is a relevant point there.

            regards, tom lane



pgsql-hackers by date:

Previous
From: Peter Eisentraut
Date:
Subject: Re: Recognizing superuser in pg_hba.conf
Next
From: Robert Haas
Date:
Subject: Re: Removing pg_pltemplate and creating "trustable" extensions