On Sun, Oct 24, 2021 at 02:45:38PM +0300, Andrey Borodin wrote:
> > 24 окт. 2021 г., в 08:00, Noah Misch <noah@leadboat.com> написал(а):
> > Buildfarm member conchuela (DragonFly BSD 6.0) has gotten multiple
> > "IPC::Run: timeout on timer" in the new tests. No other animal has.
> > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=conchuela&dt=2021-10-24%2003%3A05%3A09
> > is an example run. The pgbench queries finished quickly, but the
> > $pgbench_h->finish() apparently timed out after 180s. I guess this would be
> > consistent with pgbench blocking in write(), waiting for something to empty a
> > pipe buffer so it can write more. I thought finish() will drain any incoming
> > I/O, though. This phenomenon has been appearing regularly via
> > src/test/recovery/t/017_shm.pl[1], so this thread doesn't have a duty to
> > resolve it. A stack trace of the stuck pgbench should be informative, though.
>
> Some thoughts:
> 0. I doubt that psql\pgbench is stuck in these failures.
Got it. If pgbench is a zombie, the fault does lie in IPC::Run or the kernel.
> 1. All observed similar failures seem to be related to finish() sub of IPC::Run harness
> 2. Finish must pump any pending data from process [0]. But it can hang if process is waiting for something.
> 3. There is reported bug of finish [1]. But the description is slightly different.
Since that report is about a Perl-process child on Linux, I think we can treat
it as unrelated.
These failures started on 2021-10-09, the day conchuela updated from DragonFly
v4.4.3-RELEASE to DragonFly v6.0.0-RELEASE. It smells like a kernel bug.
Since the theorized kernel bug seems not to affect
src/test/subscription/t/015_stream.pl, I wonder if we can borrow a workaround
from other tests. One thing in common with src/test/recovery/t/017_shm.pl and
the newest failure sites is that they don't write anything to the child stdin.
Does writing e.g. a single byte (that the child doesn't use) work around the
problem? If not, does passing the script via stdin, like "pgbench -f-
<script.sql", work around the problem?
> [0] https://metacpan.org/dist/IPC-Run/source/lib/IPC/Run.pm#L3481
> [1] https://github.com/toddr/IPC-Run/issues/57