Home > mailing lists

Re: Windows buildfarm members vs. new async-notify isolation test - Mailing list pgsql-hackers

From	Tom Lane
Subject	Re: Windows buildfarm members vs. new async-notify isolation test
Date	December 7, 2019 19:56:26
Msg-id	4412.1575748586@sss.pgh.pa.us Whole thread Raw
In response to	Re: Windows buildfarm members vs. new async-notify isolation test (Tom Lane <tgl@sss.pgh.pa.us>)
Responses	Re: Windows buildfarm members vs. new async-notify isolation test Re: Windows buildfarm members vs. new async-notify isolation test
List	pgsql-hackers

Tree view

I wrote:
> Amit Kapila <amit.kapila16@gmail.com> writes:
>> On Sat, Dec 7, 2019 at 5:01 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> A possible theory as to what's happening is that the kernel scheduler
>>> is discriminating against listener2's signal management thread(s)
>>> and not running them until everything else goes idle for a moment.

>> If we have to believe that theory then why the other similar test is
>> not showing the problem.

> There are fewer processes involved in that case, so I don't think
> it disproves the theory that this is a scheduler glitch.

So, just idly looking at the code in src/backend/port/win32/signal.c
and src/port/kill.c, I have to wonder why we have this baroque-looking
design of using *two* signal management threads.  And, if I'm
reading it right, we create an entire new pipe object and an entire
new instance of the second thread for each incoming signal.  Plus, the
signal senders use CallNamedPipe (hence, underneath, TransactNamedPipe)
which means they in effect wait for the recipient's signal-handling
thread to ack receipt of the signal.  Maybe there's a good reason for
all this but it sure seems like a lot of wasted cycles from here.

I have to wonder why we don't have a single named pipe that lasts as
long as the recipient process does, and a signal sender just writes
one byte to it, and considers the signal delivered if it is able to
do that.  The "message" semantics seem like overkill for that.

I dug around in the contemporaneous archives and could only find
https://www.postgresql.org/message-id/303E00EBDD07B943924382E153890E5434AA47%40cuthbert.rcsinc.local
which describes the existing approach but fails to explain why we
should do it like that.

This might or might not have much to do with the immediate problem,
but I can't help wondering if there's some race-condition-ish behavior
in there that's contributing to what we're seeing.  We already had to
fix a couple of race conditions from doing it like this, cf commits
2e371183e, 04a4413c2, f27a4696f.  Perhaps 0ea1f2a3a is relevant
as well.

            regards, tom lane

pgsql-hackers by date:

From: Tom Lane
Date: 07 December 2019, 17:58:12
Subject: Re: psql small improvement patch

From: Andrew Dunstan
Date: 07 December 2019, 21:03:27
Subject: Re: ssl passphrase callback

Re: Windows buildfarm members vs. new async-notify isolation test - Mailing list pgsql-hackers

Previous

Next