Re: buildfarm instance bichir stuck - Mailing list pgsql-hackers

From Thomas Munro
Subject Re: buildfarm instance bichir stuck
Date
Msg-id CA+hUKG+Sm8ZDiyW5Sr-5QZAK377dy=WHoFSU4vu=tgHOqS5JQQ@mail.gmail.com
Whole thread Raw
In response to buildfarm instance bichir stuck  (Robins Tharakan <tharakan@gmail.com>)
Responses Re: buildfarm instance bichir stuck  (Robins Tharakan <tharakan@gmail.com>)
Re: buildfarm instance bichir stuck  (Andrew Dunstan <andrew@dunslane.net>)
List pgsql-hackers
On Wed, Apr 7, 2021 at 5:44 PM Robins Tharakan <tharakan@gmail.com> wrote:
> Bichir's been stuck for the past month and is unable to run regression tests since
6a2a70a02018d6362f9841cc2f499cc45405e86b.

Hrmph.  That's "Use signalfd(2) for epoll latches."  I had a similar
report from an illumos user (but it was intermittent).  I have never
seen such a failure on Linux.  My first guess is that these two
systems that are doing Linux system call emulation have implemented
subtly different semantics, and something is going wrong like this: a
SIGUSR1 arrives to tell you some important news about a procsignal and
the signal handler calls SetLatch(MyLatch) which does kill(MyProcPid,
SIGURG), but somehow that fails to wake up the epoll() you are
sleeping in which contains the signalfd that should receive the signal
and report it by being readable, due to some internal race.  Or
something like that.  But I haven't been able to verify that theory
because I don't have any of those computers.  If it is indeed
something like that and not a bug in my code, then I was thinking that
the main tool available to deal with it would be to set WAIT_USE_POLL
in the relevant template file, so that we don't use the combination of
epoll + signalfd on illlumos, but then WSL1 thows a spanner in the
works because AFAIK it's masquerading as Ubuntu, running PostgreSQL
from an Ubuntu package with a freaky kernel.  Hmm.

> It is interesting that that commit's a month old and probably no other client has complained since, but diving in, I
cansee that it's been unable to even start regression tests after that commit went in. 

Oh, well at least it's easily reproducible then, that's something!

> Note that Bichir is running on WSL1 (not WSL2) - i.e. Windows Subsystem for Linux inside Windows 10 - and so isn't
reallyproduction use-case. The only run that actually got submitted to Buildfarm was from a few days back when I killed
itafter a long wait - see [1]. 
>
> Since yesterday, I have another run that's again stuck on CREATE DATABASE (see outputs below) and although pstack not
workingmay be a limitation of the architecture / installation (unsure), a trace shows it is stuck at poll. 

That's actually the client.  I guess there is also a backend process
stuck somewhere in epoll_wait()?



pgsql-hackers by date:

Previous
From: Michael Paquier
Date:
Subject: Re: Can we remove extra memset in BloomInitPage, GinInitPage and SpGistInitPage when we have it in PageInit?
Next
From: Bharath Rupireddy
Date:
Subject: Re: Can we remove extra memset in BloomInitPage, GinInitPage and SpGistInitPage when we have it in PageInit?