Re: Windows buildfarm members vs. new async-notify isolation test - Mailing list pgsql-hackers

From Tom Lane
Subject Re: Windows buildfarm members vs. new async-notify isolation test
Date
Msg-id 4412.1575748586@sss.pgh.pa.us
Whole thread Raw
In response to Re: Windows buildfarm members vs. new async-notify isolation test  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: Windows buildfarm members vs. new async-notify isolation test  (Amit Kapila <amit.kapila16@gmail.com>)
Re: Windows buildfarm members vs. new async-notify isolation test  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-hackers
I wrote:
> Amit Kapila <amit.kapila16@gmail.com> writes:
>> On Sat, Dec 7, 2019 at 5:01 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> A possible theory as to what's happening is that the kernel scheduler
>>> is discriminating against listener2's signal management thread(s)
>>> and not running them until everything else goes idle for a moment.

>> If we have to believe that theory then why the other similar test is
>> not showing the problem.

> There are fewer processes involved in that case, so I don't think
> it disproves the theory that this is a scheduler glitch.

So, just idly looking at the code in src/backend/port/win32/signal.c
and src/port/kill.c, I have to wonder why we have this baroque-looking
design of using *two* signal management threads.  And, if I'm
reading it right, we create an entire new pipe object and an entire
new instance of the second thread for each incoming signal.  Plus, the
signal senders use CallNamedPipe (hence, underneath, TransactNamedPipe)
which means they in effect wait for the recipient's signal-handling
thread to ack receipt of the signal.  Maybe there's a good reason for
all this but it sure seems like a lot of wasted cycles from here.

I have to wonder why we don't have a single named pipe that lasts as
long as the recipient process does, and a signal sender just writes
one byte to it, and considers the signal delivered if it is able to
do that.  The "message" semantics seem like overkill for that.

I dug around in the contemporaneous archives and could only find
https://www.postgresql.org/message-id/303E00EBDD07B943924382E153890E5434AA47%40cuthbert.rcsinc.local
which describes the existing approach but fails to explain why we
should do it like that.

This might or might not have much to do with the immediate problem,
but I can't help wondering if there's some race-condition-ish behavior
in there that's contributing to what we're seeing.  We already had to
fix a couple of race conditions from doing it like this, cf commits
2e371183e, 04a4413c2, f27a4696f.  Perhaps 0ea1f2a3a is relevant
as well.

            regards, tom lane



pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: psql small improvement patch
Next
From: Andrew Dunstan
Date:
Subject: Re: ssl passphrase callback