Re: BF animal malleefowl reported an failure in 001_password.pl - Mailing list pgsql-hackers
From | Thomas Munro |
---|---|
Subject | Re: BF animal malleefowl reported an failure in 001_password.pl |
Date | |
Msg-id | CA+hUKG+_hTYQRD9w-01-dgQO+cw6zLARGRtr1ziXTgbX=RxEjw@mail.gmail.com Whole thread Raw |
In response to | Re: BF animal malleefowl reported an failure in 001_password.pl (Tom Lane <tgl@sss.pgh.pa.us>) |
Responses |
Re: BF animal malleefowl reported an failure in 001_password.pl
|
List | pgsql-hackers |
On Sat, Jan 14, 2023 at 8:55 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > "houzj.fnst@fujitsu.com" <houzj.fnst@fujitsu.com> writes: > > I noticed one BF failure[1] when monitoring the BF for some other commit. > > [1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=malleefowl&dt=2023-01-13%2009%3A54%3A51 > > ... > > So it seems the connection happens before pg_ident.conf is actually reloaded ? > > Not sure if we need to do something make sure the reload happen, because it's > > looks like very rare failure which hasn't happen in last 90 days. > > That does look like a race condition between config reloading and > new-backend launching. However, I can't help being suspicious about > the fact that we haven't seen this symptom before and now here it is > barely a day after 7389aad63 (Use WaitEventSet API for postmaster's > event loop). It seems fairly plausible that that did something that > causes the postmaster to preferentially process connection-accept ahead > of SIGHUP. I took a quick look through the code and did not see a > smoking gun, but I'm way too tired to be sure I didn't miss something. Yeah, I guess the scheduling might go something like this: 1. kill() runs and sets SIGHUP as pending in the postmaster process; the postmaster is now runnable but not yet running. 2. Meanwhile connect() starts. 3. postmaster starts running, sees the pending signal and immediately runs the handler, which previously did the actual reload (before doing anything else) but now just sets our reload-pending flag and does kill(self, SIGURG), and then returns, so epoll_wait() is unblocked. 4. epoll_wait() returns, reporting two events: signalfd ready to read (or self-pipe, or EVFILT_SIGNAL), AND connection ready to accept. 5. Connection happens to be reported first so we accept/fork the connection and reload later. I think epoll will report fds in the order they became ready (undocumented, but apparently well known that it's a kind of FIFO linked list), but that itself is indeterminate, as 2 and 3 race. It looks like melleefowl is slow/overloaded (often ~3 hours to run the tests, sometimes ~half and hour and sometimes ~4 hours). Now that I think about it, it's surprising I haven't seen this before though, implying that 3 always beats 2, so maybe I'm missing something else... But if that's the general idea, I suppose there would be two ways to give higher priority to signals/latches that arrive in the same set of events: (1) scan the events array twice (for latches then connections), or (2) check our pending flags every time through the output events loop, at the top, even for connection events (ie just move some lines up a bit). Probably 2 is the way to go (see also discussion about whether we should do that anyway, to give priority to a shutdown request if it arrives while the server is looping over 64 server sockets that are all ready to accept).
pgsql-hackers by date: