On 4/7/21 2:16 AM, Thomas Munro wrote:
> On Wed, Apr 7, 2021 at 5:44 PM Robins Tharakan <tharakan@gmail.com> wrote:
>> Bichir's been stuck for the past month and is unable to run regression tests since
6a2a70a02018d6362f9841cc2f499cc45405e86b.
> Hrmph. That's "Use signalfd(2) for epoll latches." I had a similar
> report from an illumos user (but it was intermittent). I have never
> seen such a failure on Linux. My first guess is that these two
> systems that are doing Linux system call emulation have implemented
> subtly different semantics, and something is going wrong like this: a
> SIGUSR1 arrives to tell you some important news about a procsignal and
> the signal handler calls SetLatch(MyLatch) which does kill(MyProcPid,
> SIGURG), but somehow that fails to wake up the epoll() you are
> sleeping in which contains the signalfd that should receive the signal
> and report it by being readable, due to some internal race. Or
> something like that. But I haven't been able to verify that theory
> because I don't have any of those computers. If it is indeed
> something like that and not a bug in my code, then I was thinking that
> the main tool available to deal with it would be to set WAIT_USE_POLL
> in the relevant template file, so that we don't use the combination of
> epoll + signalfd on illlumos, but then WSL1 thows a spanner in the
> works because AFAIK it's masquerading as Ubuntu, running PostgreSQL
> from an Ubuntu package with a freaky kernel. Hmm.
>
To test this the OP could just add
CPPFLAGS => '-DWAIT_USE_POLL',
to his animal's config's config_env stanza.
cheers
andrew
--
Andrew Dunstan
EDB: https://www.enterprisedb.com