Re: dynamic background workers, round two - Mailing list pgsql-hackers
From | Andres Freund |
---|---|
Subject | Re: dynamic background workers, round two |
Date | |
Msg-id | 20130726093447.GH15081@alap2.anarazel.de Whole thread Raw |
In response to | Re: dynamic background workers, round two (Robert Haas <robertmhaas@gmail.com>) |
Responses |
Re: dynamic background workers, round two
|
List | pgsql-hackers |
On 2013-07-25 12:35:30 -0400, Robert Haas wrote: > On Wed, Jul 24, 2013 at 5:34 PM, Andres Freund <andres@2ndquadrant.com> wrote: > > This seems like a sensible idea to me. But, in the context of dynamic > > query, don't we also need the reverse infrastructure of notifying a > > bgworker that the client, that requested it to be started, has died? > > Ending up with a dozen bgworkers running because the normal backend > > FATALed doesn't seem nice. > > I think that will be desirable for many, although not all, uses of > background workers. For example, you might want to be able to do > something like SELECT pg_background('CREATE INDEX CONCURRENTLY ...') > and then exit your session and have the background worker continue to > run. Or maybe somebody writes a trigger that starts a background > worker (if there's not one already working) and queues up work for the > trigger, and the background worker continues processing the queue > until it is empty, and then exits. > But for parallel query specifically, yeah, we'll probably want the > untimely demise of either the original backend or any of its workers > to result in the death of all the workers and an abort of the parent's > transaction. However, there's no need for the postmaster to be > involved in any of that. For example, once the worker process is up > and running and the parent has its PID, it can use an on_shmem_exit > hook or a PG_ENSURE_ERROR_CLEANUP() block to send SIGTERM to any > still-living workers. It doesn't need to be the postmaster, but I think we need to provide central infrastructure for that. I don't want this to end up being redone poorly in multiple places. I just wanted to mention it, it obviously doesn't need to be implemented at the same time. > >> +#define InvalidPid (-1) > >> + > > > > That value strikes me as a rather bad idea. As a pid that represents > > init's processgroup. I.e. kill(InvalidPid) will try to kill init... I'd > > rather not spread that outside async.c. > > Well, if we ever try to kill(InvalidPid), that's a PostgreSQL bug. > And if the postgres process has permission to send signals to init, > that's an OS bug (or a PostgreSQL bug, since we refuse to run as > root). So I'm not very worried about it. I don't think somebody mistakenly kill()ing somebody with -1 is all too likely, but it's a valid value to pass. That's why I dislike using it as constant for an invalid value. > I've got two functions that > need distinguishable return values for the not-yet-started case and > the already-dead case, neither of which can be real PIDs. If we don't > use 0 and -1, the alternative is presumably to complicate the calling > signature with another out parameter. That does not seem like a net > improvement. I actually think those cases would be better served by an error code mechanism which is extensible. > >> +/* > >> + * Report the PID of a newly-launched background worker in shared memory. > >> + * > >> + * This function should only be called from the postmaster. > >> + */ > >> +void > >> +ReportBackgroundWorkerPID(RegisteredBgWorker *rw) > >> +{ > >> + BackgroundWorkerSlot *slot; > >> + > >> + Assert(rw->rw_shmem_slot < max_worker_processes); > >> + slot = &BackgroundWorkerData->slot[rw->rw_shmem_slot]; > >> + slot->pid = rw->rw_pid; > >> + > >> + if (rw->rw_worker.bgw_notify_pid != 0) > >> + kill(rw->rw_worker.bgw_notify_pid, SIGUSR1); > >> +} > > > > Any reason not to use SendProcSignal() properly here? Just that you > > don't know the BackendId? I seems unclean to reuse SIGUSR1 without using > > the infrastructure built for that. > > We're in the postmaster here. The postmaster currently uses kill > directly in all cases, rather than SendProcSignal(), and I'd rather > not change that. Any code that the postmaster calls has got to be > safe against arbitrary shared memory corruption, and I'd rather keep > that footprint as small as possible. For example, if we added > bgw_backend_id, the postmaster would have to bounds check it to make > sure it didn't seg fault. The more of that kind of stuff we add, the > more chances there are to destabilize the postmaster. I'd like to > keep it as minimal as possible. I can see the point in the argument, but it still seems to be grotty architecturally. > >> + for (;;) > >> + { > >> + pid = GetBackgroundWorkerPid(handle); > >> + if (pid != InvalidPid) > >> + break; > >> + rc = WaitLatch(&MyProc->procLatch, > >> + WL_LATCH_SET | WL_POSTMASTER_DEATH, 0); > >> + if (rc & WL_POSTMASTER_DEATH) > >> + return InvalidPid; > > > > This basically ignores postmaster death. With unix_latch.c that > > shouldn't be a problem because afaics it's implementation correctly > > should report an error in the next WaitLatch without waiting as well. I > > am not sure about the win32 implementation, is that ok there as well? > > I don't understand what you mean by this. Why does it ignore postmaster death? The other callers that check for WL_POSTMASTER_DEATH actually perform drastic measures like exit()ing upon postmaster. Here you just return. With the unix latch implementation it seems to be guaranteed that the next WaitLatch() done somewhere else will also return WL_POSTMASTER_DEATH. I am not sure whether it might actually "consume" the death in the windows implementation. > >> + set_latch_on_sigusr1 = false; > > > > So, the set_latch_on_sigusr1 logical exists so MyProc->procLatch doesn't > > get SetLatch'ed unless we're in +WaitForBackgroundWorkerStartup()? > > Right. Noah suggested to me that we might do this unconditionally, > which would certainly be simpler and cleaner. I dunno if it'd risk > too many spurious wake-ups in any scenario, though. I kind of doubt > it, but I don't know. If we were to using the procsignal mechanism, I wouldn't worry about doing it unconditionally, but signals for the other reasons SIGUSR1 is sent might actually come in relatively frequently. > >> +/* > >> + * When a backend asks to be notified about worker state changes, we > >> + * set a flag in its backend entry. The background worker machinery needs > >> + * to know when such backends exit. > >> + */ > >> +bool > >> +PostmasterMarkPIDForWorkerNotify(int pid) > >> +{ > >> + dlist_iter iter; > >> + Backend *bp; > >> + > >> + dlist_foreach(iter, &BackendList) > >> + { > >> + bp = dlist_container(Backend, elem, iter.cur); > >> + if (bp->pid == pid) > >> + { > >> + bp->bgworker_notify = true; > >> + return true; > >> + } > >> + } > >> + return false; > >> +} > > > > Maybe we should have MyBackend? > > How would that help? This code is run in the postmaster, so that we > only clear iterate through workers and clear bgw_notify_pid when the > backend that has just exited has at some point appeared in a > bgw_notify_pid. Thinko. > > Btw, you seem to want to support this in bgworkers started by a > > bgworker. That's not going to work without some changes if the > > "intermediate" bgworker is one without a backend since those don't use > > procsignal_sigusr1_handler. > Right. I think it's OK for now to limit it to cases where the > intermediate bgworker has a backend. If someone else finds that > restriction unacceptable, they can fix it. I don't have a problem with the restriction, but I'd like to see a check against it. Maybe check for MyBackendId != InvalidBackendId in RegisterDynamicBackgroundWorker()? That would also prevent starting further bgworkers before BackgroundWorkerInitializeConnection() is done in a connected bgworker which seems to be a good thing. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
pgsql-hackers by date: