Re: Windows buildfarm members vs. new async-notify isolation test - Mailing list pgsql-hackers

From Amit Kapila
Subject Re: Windows buildfarm members vs. new async-notify isolation test
Date
Msg-id CAA4eK1KpRMRJG0krbiL8sUA9wZTVwvoHejEkJK2sVH2idG-rSQ@mail.gmail.com
Whole thread Raw
In response to Re: Windows buildfarm members vs. new async-notify isolation test  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: Windows buildfarm members vs. new async-notify isolation test  (Amit Kapila <amit.kapila16@gmail.com>)
List pgsql-hackers
On Sun, Dec 8, 2019 at 1:26 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> So, just idly looking at the code in src/backend/port/win32/signal.c
> and src/port/kill.c, I have to wonder why we have this baroque-looking
> design of using *two* signal management threads.  And, if I'm
> reading it right, we create an entire new pipe object and an entire
> new instance of the second thread for each incoming signal.  Plus, the
> signal senders use CallNamedPipe (hence, underneath, TransactNamedPipe)
> which means they in effect wait for the recipient's signal-handling
> thread to ack receipt of the signal.  Maybe there's a good reason for
> all this but it sure seems like a lot of wasted cycles from here.
>
> I have to wonder why we don't have a single named pipe that lasts as
> long as the recipient process does, and a signal sender just writes
> one byte to it, and considers the signal delivered if it is able to
> do that.  The "message" semantics seem like overkill for that.
>
> I dug around in the contemporaneous archives and could only find
> https://www.postgresql.org/message-id/303E00EBDD07B943924382E153890E5434AA47%40cuthbert.rcsinc.local
> which describes the existing approach but fails to explain why we
> should do it like that.
>
> This might or might not have much to do with the immediate problem,
> but I can't help wondering if there's some race-condition-ish behavior
> in there that's contributing to what we're seeing.
>

On the receiving side, the work we do after the 'notify' is finished
(or before CallNamedPipe gets control back) is as follows:

pg_signal_dispatch_thread()
{
..
FlushFileBuffers(pipe);
DisconnectNamedPipe(pipe);
CloseHandle(pipe);

pg_queue_signal(sigNum);
}

It seems most of these are the system calls which makes me think that
they might be slow enough on some Windows version that it could lead
to such race condition.

Now, coming back to the other theory the scheduler is not able to
schedule these signal management threads.  I think if that would be
the case, then notify could not have finished, because CallNamedPipe
returns only when dispatch thread writes back to the pipe.   Now, if
somehow after writing back on the pipe if the scheduler kicks this
thread out, it is possible that we see such behavior, however, I am
not sure if we can do anything about that.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



pgsql-hackers by date:

Previous
From: "Karl O. Pinc"
Date:
Subject: Re: proposal: minscale, rtrim, btrim functions for numeric
Next
From: Amit Kapila
Date:
Subject: Re: Windows buildfarm members vs. new async-notify isolation test