Thread: Missed check for too-many-children in bgworker spawning
Over in [1] we have a report of a postmaster shutdown that seems to have occurred because some client logic was overaggressively spawning connection requests, causing the postmaster's child-process arrays to be temporarily full, and then some parallel query tried to launch a new bgworker process. The postmaster's bgworker-spawning logic lacks any check for the arrays being full, so when AssignPostmasterChildSlot failed to find a free slot, kaboom! The attached proposed patch fixes this by making bgworker spawning include a canAcceptConnections() test. That's perhaps overkill, since we really just need to check the CountChildren() total; but I judged that going through that function and having it decide what to test or not test was a better design than duplicating the CountChildren() test elsewhere. I'd first imagined also replacing the one-size-fits-all check if (CountChildren(BACKEND_TYPE_ALL) >= MaxLivePostmasterChildren()) result = CAC_TOOMANY; with something like switch (backend_type) { case BACKEND_TYPE_NORMAL: if (CountChildren(backend_type) >= 2 * MaxConnections) result = CAC_TOOMANY; break; case BACKEND_TYPE_AUTOVAC: if (CountChildren(backend_type) >= 2 * autovacuum_max_workers) result = CAC_TOOMANY; break; ... } so as to subdivide the pool of child-process slots and prevent client requests from consuming slots meant for background processes. But on closer examination that's not really worth the trouble, because this pool is already considerably bigger than MaxBackends; so even if we prevented a failure here we could still have bgworker startup failure later on when it tries to acquire a PGPROC. Barring objections, I'll apply and back-patch this soon. regards, tom lane [1] https://www.postgresql.org/message-id/flat/CADCf-WNZk_9680Q0YjfBzuiR0Oe8LzvDs2Ts3_tq6Tv1e8raQQ%40mail.gmail.com diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c index eb9e022..7947c96 100644 --- a/src/backend/postmaster/postmaster.c +++ b/src/backend/postmaster/postmaster.c @@ -409,7 +409,7 @@ static void SendNegotiateProtocolVersion(List *unrecognized_protocol_options); static void processCancelRequest(Port *port, void *pkt); static int initMasks(fd_set *rmask); static void report_fork_failure_to_client(Port *port, int errnum); -static CAC_state canAcceptConnections(void); +static CAC_state canAcceptConnections(int backend_type); static bool RandomCancelKey(int32 *cancel_key); static void signal_child(pid_t pid, int signal); static bool SignalSomeChildren(int signal, int targets); @@ -2401,13 +2401,15 @@ processCancelRequest(Port *port, void *pkt) * canAcceptConnections --- check to see if database state allows connections. */ static CAC_state -canAcceptConnections(void) +canAcceptConnections(int backend_type) { CAC_state result = CAC_OK; /* * Can't start backends when in startup/shutdown/inconsistent recovery - * state. + * state. We treat autovac workers the same as user backends for this + * purpose. However, bgworkers are excluded from this test; we expect + * bgworker_should_start_now() decided whether the DB state allows them. * * In state PM_WAIT_BACKUP only superusers can connect (this must be * allowed so that a superuser can end online backup mode); we return @@ -2415,7 +2417,8 @@ canAcceptConnections(void) * that neither CAC_OK nor CAC_WAITBACKUP can safely be returned until we * have checked for too many children. */ - if (pmState != PM_RUN) + if (pmState != PM_RUN && + backend_type != BACKEND_TYPE_BGWORKER) { if (pmState == PM_WAIT_BACKUP) result = CAC_WAITBACKUP; /* allow superusers only */ @@ -2435,9 +2438,9 @@ canAcceptConnections(void) /* * Don't start too many children. * - * We allow more connections than we can have backends here because some + * We allow more connections here than we can have backends because some * might still be authenticating; they might fail auth, or some existing - * backend might exit before the auth cycle is completed. The exact + * backend might exit before the auth cycle is completed. The exact * MaxBackends limit is enforced when a new backend tries to join the * shared-inval backend array. * @@ -4114,7 +4117,7 @@ BackendStartup(Port *port) bn->cancel_key = MyCancelKey; /* Pass down canAcceptConnections state */ - port->canAcceptConnections = canAcceptConnections(); + port->canAcceptConnections = canAcceptConnections(BACKEND_TYPE_NORMAL); bn->dead_end = (port->canAcceptConnections != CAC_OK && port->canAcceptConnections != CAC_WAITBACKUP); @@ -5486,7 +5489,7 @@ StartAutovacuumWorker(void) * we have to check to avoid race-condition problems during DB state * changes. */ - if (canAcceptConnections() == CAC_OK) + if (canAcceptConnections(BACKEND_TYPE_AUTOVAC) == CAC_OK) { /* * Compute the cancel key that will be assigned to this session. We @@ -5863,6 +5866,19 @@ assign_backendlist_entry(RegisteredBgWorker *rw) Backend *bn; /* + * Check that database state allows another connection. Currently the + * only possible failure is CAC_TOOMANY, so we just log an error message + * based on that rather than checking the error code precisely. + */ + if (canAcceptConnections(BACKEND_TYPE_BGWORKER) != CAC_OK) + { + ereport(LOG, + (errcode(ERRCODE_CONFIGURATION_LIMIT_EXCEEDED), + errmsg("no slot available for new worker process"))); + return false; + } + + /* * Compute the cancel key that will be assigned to this session. We * probably don't need cancel keys for background workers, but we'd better * have something random in the field to prevent unfriendly people from
On Sun, Oct 6, 2019 at 1:17 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > Over in [1] we have a report of a postmaster shutdown that seems to > have occurred because some client logic was overaggressively spawning > connection requests, causing the postmaster's child-process arrays to > be temporarily full, and then some parallel query tried to launch a > new bgworker process. The postmaster's bgworker-spawning logic lacks > any check for the arrays being full, so when AssignPostmasterChildSlot > failed to find a free slot, kaboom! > > The attached proposed patch fixes this by making bgworker spawning > include a canAcceptConnections() test. That's perhaps overkill, since > we really just need to check the CountChildren() total; but I judged > that going through that function and having it decide what to test or > not test was a better design than duplicating the CountChildren() test > elsewhere. > > I'd first imagined also replacing the one-size-fits-all check > > if (CountChildren(BACKEND_TYPE_ALL) >= MaxLivePostmasterChildren()) > result = CAC_TOOMANY; > > with something like > > switch (backend_type) > { > case BACKEND_TYPE_NORMAL: > if (CountChildren(backend_type) >= 2 * MaxConnections) > result = CAC_TOOMANY; > break; > case BACKEND_TYPE_AUTOVAC: > if (CountChildren(backend_type) >= 2 * autovacuum_max_workers) > result = CAC_TOOMANY; > break; > ... > } > > so as to subdivide the pool of child-process slots and prevent client > requests from consuming slots meant for background processes. But on > closer examination that's not really worth the trouble, because this > pool is already considerably bigger than MaxBackends; so even if we > prevented a failure here we could still have bgworker startup failure > later on when it tries to acquire a PGPROC. > > Barring objections, I'll apply and back-patch this soon. I think it used to work this way -- not sure if it was ever committed this way, but it at least did during development -- and we ripped it out because somebody (Magnus?) pointed out that if you got close to the connection limit, you could see parallel queries start failing, and that would suck. Falling back to non-parallel seems more OK in that situation than actually failing. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > On Sun, Oct 6, 2019 at 1:17 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: >> The attached proposed patch fixes this by making bgworker spawning >> include a canAcceptConnections() test. > I think it used to work this way -- not sure if it was ever committed > this way, but it at least did during development -- and we ripped it > out because somebody (Magnus?) pointed out that if you got close to > the connection limit, you could see parallel queries start failing, > and that would suck. Falling back to non-parallel seems more OK in > that situation than actually failing. I'm not following your point? Whatever you might think the appropriate response is, I'm pretty sure "elog(FATAL) out of the postmaster" is not it. Moreover, we have to --- and already do, I trust --- deal with other resource-exhaustion errors in exactly the same code path, notably fork(2) failure which we simply can't predict or prevent. Doesn't the parallel query logic already deal sanely with failure to obtain as many workers as it wanted? regards, tom lane
On Mon, Oct 7, 2019 at 4:03 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > I'm not following your point? Whatever you might think the appropriate > response is, I'm pretty sure "elog(FATAL) out of the postmaster" is not > it. Moreover, we have to --- and already do, I trust --- deal with > other resource-exhaustion errors in exactly the same code path, notably > fork(2) failure which we simply can't predict or prevent. Doesn't the > parallel query logic already deal sanely with failure to obtain as many > workers as it wanted? If we fail to obtain workers because there are not adequate workers slots available, parallel query deals with that smoothly. But, once we have a slot, any further failure will trigger the parallel query to ERROR out. For the case where we get a slot but can't start the worker process, see WaitForParallelWorkersToFinish and/or WaitForParallelWorkersToAttach and comments therein. Once we're attached, any error messages thrown by the worker are propagated back to the master; see HandleParallelMessages and pq_redirect_to_shm_mq. Now you could argue that the master ought to selectively ignore certain kinds of errors and just continue on, while rethrowing others, say based on the errcode(). Such design ideas have been roundly panned in other contexts, though, so I'm not sure it would be a great idea to do it here either. But in any case, it's not how the current system behaves, or was designed to behave. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > On Mon, Oct 7, 2019 at 4:03 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: >> ... Moreover, we have to --- and already do, I trust --- deal with >> other resource-exhaustion errors in exactly the same code path, notably >> fork(2) failure which we simply can't predict or prevent. Doesn't the >> parallel query logic already deal sanely with failure to obtain as many >> workers as it wanted? > If we fail to obtain workers because there are not adequate workers > slots available, parallel query deals with that smoothly. But, once > we have a slot, any further failure will trigger the parallel query to > ERROR out. Well, that means we have a not-very-stable system then. We could improve on matters so far as the postmaster's child-process arrays are concerned, by defining separate slot "pools" for the different types of child processes. But I don't see much point if the code is not prepared to recover from a fork() failure --- and if it is, that would a fortiori deal with out-of-child-slots as well. regards, tom lane
On Wed, Oct 9, 2019 at 10:21 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > Well, that means we have a not-very-stable system then. > > We could improve on matters so far as the postmaster's child-process > arrays are concerned, by defining separate slot "pools" for the different > types of child processes. But I don't see much point if the code is > not prepared to recover from a fork() failure --- and if it is, that > would a fortiori deal with out-of-child-slots as well. I would say rather that if fork() is failing on your system, you have a not very stable system. The fact that parallel query is going to fail is sad, but not as sad as the fact that connecting to the database is also going to fail, and that logging into the system to try to fix the problem may well fail as well. Code that tries to make parallel query cope with this situation without an error wouldn't often be tested, so it might be buggy, and it wouldn't necessarily be a benefit if it did work. I expect many people would rather have the query fail and free up slots in the system process table than consume precisely all of them and then try to execute the query at a slower-than-expected rate. Anyway, here's some previous discussion on this topic for your consideration: https://www.postgresql.org/message-id/flat/CAKJS1f_6H2Gh3QyORyRP%2BG3YB3gZiNms_8QdtO5gvitfY5N9ig%40mail.gmail.com -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > On Wed, Oct 9, 2019 at 10:21 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: >> We could improve on matters so far as the postmaster's child-process >> arrays are concerned, by defining separate slot "pools" for the different >> types of child processes. But I don't see much point if the code is >> not prepared to recover from a fork() failure --- and if it is, that >> would a fortiori deal with out-of-child-slots as well. > I would say rather that if fork() is failing on your system, you have > a not very stable system. The fact that parallel query is going to > fail is sad, but not as sad as the fact that connecting to the > database is also going to fail, and that logging into the system to > try to fix the problem may well fail as well. True, it's not a situation you especially want to be in. However, I've lost count of the number of times that I've heard someone talk about how their system was overstressed to the point that everything else was failing, but Postgres kept chugging along. That's a good reputation to have and we shouldn't just walk away from it. > Code that tries to make > parallel query cope with this situation without an error wouldn't > often be tested, so it might be buggy, and it wouldn't necessarily be > a benefit if it did work. I expect many people would rather have the > query fail and free up slots in the system process table than consume > precisely all of them and then try to execute the query at a > slower-than-expected rate. I find that argument to be utter bunkum. The parallel query code is *already* designed to silently degrade performance when its primary resource limit (shared bgworker slots) is exhausted. How can it be all right to do that but not all right to cope with fork failure similarly? If we think running up against the kernel limits is a case that we can roll over and die on, why don't we rip out the virtual-FD stuff in fd.c? As for "might be buggy", if we ripped out every part of Postgres that's under-tested, I'm afraid there might not be much left. In any case, a sane design for this would make as much as possible of the code handle "out of shared bgworker slots" just the same as resource failures later on, so that there wouldn't be that big a gap in coverage. Having said all that, I made a patch that causes the postmaster to reserve separate child-process-array slots for autovac workers and bgworkers, as per attached, so that excessive connection requests can't DOS those subsystems. But I'm not sure that it's worth the complication; it wouldn't be necessary if the parallel query launch code were more robust. regards, tom lane diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c index 85f15a5..8c66578 100644 --- a/src/backend/postmaster/postmaster.c +++ b/src/backend/postmaster/postmaster.c @@ -2441,17 +2441,43 @@ canAcceptConnections(int backend_type) /* * Don't start too many children. * - * We allow more connections here than we can have backends because some - * might still be authenticating; they might fail auth, or some existing - * backend might exit before the auth cycle is completed. The exact - * MaxBackends limit is enforced when a new backend tries to join the - * shared-inval backend array. + * We allow more connections (including walsenders) here than we can have + * backends because some might still be authenticating; they might fail + * auth, or some existing backend might exit before the auth cycle is + * completed. The exact MaxBackends limit is enforced when a new backend + * tries to join the shared-inval backend array. * - * The limit here must match the sizes of the per-child-process arrays; - * see comments for MaxLivePostmasterChildren(). + * On the other hand, there's no need to allow extra autovac workers or + * bgworkers, because the requesting logic for those should never request + * more processes than the corresponding limit. + * + * This logic must agree with MaxLivePostmasterChildren(), which sets the + * total size of the per-child-process arrays. + * + * Note: currently, we can't see backend_type == BACKEND_TYPE_WALSND here, + * but it seems less weird to include it in the switch than not. We must + * treat normal backends and walsenders alike for this purpose because one + * type can metamorphose into the other later. */ - if (CountChildren(BACKEND_TYPE_ALL) >= MaxLivePostmasterChildren()) - result = CAC_TOOMANY; + switch (backend_type) + { + case BACKEND_TYPE_NORMAL: + case BACKEND_TYPE_WALSND: + if (CountChildren(BACKEND_TYPE_NORMAL | BACKEND_TYPE_WALSND) >= + 2 * (MaxConnections + max_wal_senders)) + result = CAC_TOOMANY; + break; + case BACKEND_TYPE_AUTOVAC: + if (CountChildren(BACKEND_TYPE_AUTOVAC) >= autovacuum_max_workers) + result = CAC_TOOMANY; + break; + case BACKEND_TYPE_BGWORKER: + if (CountChildren(BACKEND_TYPE_BGWORKER) >= max_worker_processes) + result = CAC_TOOMANY; + break; + default: + elog(ERROR, "unexpected backend_type: %d", backend_type); + } return result; } @@ -5334,8 +5360,18 @@ static int CountChildren(int target) { dlist_iter iter; + bool check_walsender; int cnt = 0; + /* + * In common cases we don't need to distinguish normal from walsender + * children, so we can save a few cycles and shared-memory touches by not + * checking for walsenders if not. + */ + check_walsender = + (target & (BACKEND_TYPE_NORMAL | BACKEND_TYPE_WALSND)) != 0 && + (target & (BACKEND_TYPE_NORMAL | BACKEND_TYPE_WALSND)) != (BACKEND_TYPE_NORMAL | BACKEND_TYPE_WALSND); + dlist_foreach(iter, &BackendList) { Backend *bp = dlist_container(Backend, elem, iter.cur); @@ -5343,11 +5379,7 @@ CountChildren(int target) if (bp->dead_end) continue; - /* - * Since target == BACKEND_TYPE_ALL is the most common case, we test - * it first and avoid touching shared memory for every child. - */ - if (target != BACKEND_TYPE_ALL) + if (check_walsender) { /* * Assign bkend_type for any recently announced WAL Sender @@ -5356,11 +5388,11 @@ CountChildren(int target) if (bp->bkend_type == BACKEND_TYPE_NORMAL && IsPostmasterChildWalSender(bp->child_slot)) bp->bkend_type = BACKEND_TYPE_WALSND; - - if (!(target & bp->bkend_type)) - continue; } + if (!(target & bp->bkend_type)) + continue; + cnt++; } return cnt; @@ -5626,15 +5658,31 @@ CreateOptsFile(int argc, char *argv[], char *fullprogname) * (the PMChildFlags array, and if EXEC_BACKEND the ShmemBackendArray). * These arrays include regular backends, autovac workers, walsenders * and background workers, but not special children nor dead_end children. - * This allows the arrays to have a fixed maximum size, to wit the same - * too-many-children limit enforced by canAcceptConnections(). The exact value - * isn't too critical as long as it's more than MaxBackends. + * This allows the arrays to have a fixed maximum size. The total amount + * of space we reserve here mustn't be less than the number of child processes + * that canAcceptConnections() will allow. */ int MaxLivePostmasterChildren(void) { - return 2 * (MaxConnections + autovacuum_max_workers + 1 + - max_wal_senders + max_worker_processes); + int nchildren = 0; + + /* + * For children that are launched from asynchronous connection requests, + * allow some extra space in the arrays so that we don't deny a connection + * that would have succeeded by the time it is ready to join the shared- + * memory data structures. The 2X multiplier is arbitrary, and probably + * overkill, but the amount of space wasted is minimal. + */ + nchildren += 2 * (MaxConnections + max_wal_senders); + + /* Allow for the autovac launcher and its workers */ + nchildren += 1 + autovacuum_max_workers; + + /* Allow for bgworkers */ + nchildren += max_worker_processes; + + return nchildren; } /*
On 2019-Oct-09, Tom Lane wrote: > Robert Haas <robertmhaas@gmail.com> writes: > > On Wed, Oct 9, 2019 at 10:21 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > >> We could improve on matters so far as the postmaster's child-process > >> arrays are concerned, by defining separate slot "pools" for the different > >> types of child processes. But I don't see much point if the code is > >> not prepared to recover from a fork() failure --- and if it is, that > >> would a fortiori deal with out-of-child-slots as well. > > > I would say rather that if fork() is failing on your system, you have > > a not very stable system. The fact that parallel query is going to > > fail is sad, but not as sad as the fact that connecting to the > > database is also going to fail, and that logging into the system to > > try to fix the problem may well fail as well. > > True, it's not a situation you especially want to be in. However, > I've lost count of the number of times that I've heard someone talk > about how their system was overstressed to the point that everything > else was failing, but Postgres kept chugging along. That's a good > reputation to have and we shouldn't just walk away from it. I agree with this point in principle. Everything else (queries, checkpointing) can fail, but it's critical that postmaster continues to run -- that way, once the high load episode is over, connections can be re-established as needed, auxiliary processes can be re-launched, and the system can be again working normally. If postmaster dies, all bets are off. Also: an idle postmaster is not using any resources; on its own, killing it or it dying would not free any useful resources for the system load to be back to low again. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Mon, Nov 4, 2019 at 10:42 AM Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > > True, it's not a situation you especially want to be in. However, > > I've lost count of the number of times that I've heard someone talk > > about how their system was overstressed to the point that everything > > else was failing, but Postgres kept chugging along. That's a good > > reputation to have and we shouldn't just walk away from it. > > I agree with this point in principle. Everything else (queries, > checkpointing) can fail, but it's critical that postmaster continues to > run -- that way, once the high load episode is over, connections can be > re-established as needed, auxiliary processes can be re-launched, and > the system can be again working normally. If postmaster dies, all bets > are off. Also: an idle postmaster is not using any resources; on its > own, killing it or it dying would not free any useful resources for the > system load to be back to low again. Sure, I'm not arguing that the postmaster should blow up and die. I was, however, arguing that if the postmaster fails to launch workers for a parallel query due to process table exhaustion, it's OK for *that query* to error out. Tom finds that argument to be "utter bunkum," but I don't agree. I think there might also be some implementation complexity there that is more than meets the eye. If a process trying to register workers finds out that no worker slots are available, it discovers this at the time it tries to perform the registration. But fork() failure happens later and in a different process. The original process just finds out that the worker is "stopped," not whether or not it ever got started in the first place. We certainly can't ignore a worker that managed to start and then bombed out, because it might've already, for example, claimed a block from a Parallel Seq Scan and not yet sent back the corresponding tuples. We could ignore a worker that never started at all, due to EAGAIN or whatever else, but the original process that registered the worker has no way of finding this out. Now you might think we could just fix that by having the postmaster record something in the slot, but that doesn't work either, because the slot could get reused before the original process checks the status information. The fact that the slot has been reused is sufficient evidence that the worker was unregistered, which means it either stopped or we gave up on starting it, but it doesn't tell us which one. To be able to tell that, we'd have to have a mechanism to prevent slots from getting reused until any necessary exit status information had bene read, sort of like the OS-level zombie process mechanism (which we all love, I guess, and therefore definitely want to reinvent...?). The postmaster logic would need to be made more complicated, so that zombies couldn't accumulate: if a process asked for status notifications, but then died, any zombies waiting for it would need to be cleared. And you'd also have to make sure that a process which didn't die was guaranteed to read the status from the zombie to clear it, and that it did so in a reasonably timely fashion, which is currently in no way guaranteed and does not appear at all straightforward to guarantee. And even if you solved for all of that, I think you might still find that it breaks some parallel query (or parallel create index) code that expects the number of workers to change at registration time, but not afterwards. So, that could would all need to be adjusted. In short, I think Tom wants a pony. But that does not mean we should not fix this bug. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2019-Nov-04, Robert Haas wrote: > On Mon, Nov 4, 2019 at 10:42 AM Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > > > True, it's not a situation you especially want to be in. However, > > > I've lost count of the number of times that I've heard someone talk > > > about how their system was overstressed to the point that everything > > > else was failing, but Postgres kept chugging along. That's a good > > > reputation to have and we shouldn't just walk away from it. > > > > I agree with this point in principle. Everything else (queries, > > checkpointing) can fail, but it's critical that postmaster continues to > > run [...] > > Sure, I'm not arguing that the postmaster should blow up and die. I must have misinterpreted you, then. But then I also misinterpreted Tom, because I thought it was this stability problem that was "utter bunkum". > I was, however, arguing that if the postmaster fails to launch workers > for a parallel query due to process table exhaustion, it's OK for > *that query* to error out. That position makes sense to me. It would be nice [..ponies..] for the query to run regardless, but if it doesn't, it's not such a big deal; the query could have equally failed to run in a single process anyway. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Alvaro Herrera <alvherre@2ndquadrant.com> writes: > On 2019-Nov-04, Robert Haas wrote: >> On Mon, Nov 4, 2019 at 10:42 AM Alvaro Herrera <alvherre@2ndquadrant.com> wrote: >>> I agree with this point in principle. Everything else (queries, >>> checkpointing) can fail, but it's critical that postmaster continues to >>> run [...] >> Sure, I'm not arguing that the postmaster should blow up and die. > I must have misinterpreted you, then. But then I also misinterpreted > Tom, because I thought it was this stability problem that was "utter > bunkum". I fixed the postmaster crash problem in commit 3887e9455. The residual issue that I think is entirely bogus is that the parallel query start code will silently continue without workers if it hits our internal resource limit of how many bgworker ProcArray slots there are, but not do the same when it hits the external resource limit of the kernel refusing to fork(). I grant that there might be implementation reasons for that being difficult, but I reject Robert's apparent opinion that it's somehow desirable to behave that way. As things stand, we have all of the disadvantages that you can't predict how many workers you'll get, and none of the advantages of robustness in the face of system resource exhaustion. regards, tom lane
Hi, On 2019-10-09 12:29:18 -0400, Robert Haas wrote: > I would say rather that if fork() is failing on your system, you have > a not very stable system. I don't think that's really true, fwiw. It's often a good idea to turn on strict memory overcommit accounting, and with that set, it's actually fairly common to see fork() fail with ENOMEM, even if there's practically a reasonable amount of resources. Especially with larger shared buffers and without huge pages, the amount of memory needed for a postmaster child in the worst case is not insubstantial. > The fact that parallel query is going to fail is sad, but not as sad > as the fact that connecting to the database is also going to fail, and > that logging into the system to try to fix the problem may well fail > as well. Well, but parallel query also has to the potential to much more quickly lead to a lot of new backends being started than you'd get new connections on an analytics DB. > Code that tries to make parallel query cope with this situation > without an error wouldn't often be tested, so it might be buggy, and > it wouldn't necessarily be a benefit if it did work. I expect many > people would rather have the query fail and free up slots in the > system process table than consume precisely all of them and then try > to execute the query at a slower-than-expected rate. I concede that you have a point here. Greetings, Andres Freund
Hi, On 2019-11-04 12:14:53 -0500, Robert Haas wrote: > If a process trying to register workers finds out that no worker slots > are available, it discovers this at the time it tries to perform the > registration. But fork() failure happens later and in a different > process. The original process just finds out that the worker is > "stopped," not whether or not it ever got started in the first > place. Is that really true? In the case where it started and failed we except the error queue to have been attached to, and there to be either an error 'E' or a 'X' response (cf HandleParallelMessage()). It doesn't strike me as very complicated to keep track of whether any worker has sent an 'E' or not, no? I don't think we really need the Funny (?) anecdote: I learned about this part of the system recently, after I had installed some crash handler inside postgres. Turns out that that diverted, as a side-effect, SIGUSR1 to it's own signal handler. All tests in the main regression tests passed, except for ones getting stuck waiting for WaitForParallelWorkersToFinish(), which could be fixed by disabling parallelism aggressively. Took me like two hours to debug... Also, a bit sad that parallel query is the only visible failure (in the main tests) of breaking the sigusr1 infrastructure... > We certainly can't ignore a worker that managed to start and > then bombed out, because it might've already, for example, claimed a > block from a Parallel Seq Scan and not yet sent back the corresponding > tuples. We could ignore a worker that never started at all, due to > EAGAIN or whatever else, but the original process that registered the > worker has no way of finding this out. Sure, but in that case we'd have gotten either an error back from the worker, or postmaster wouldhave PANIC restarted everyone due to an unhandled error in the worker, no? > And even if you solved for all of that, I think you might still find > that it breaks some parallel query (or parallel create index) code > that expects the number of workers to change at registration time, but > not afterwards. So, that could would all need to be adjusted. Fair enough. Although I think practically nearly everything has to be ready to handle workers just being slow to start up anyway, no? There's plenty cases where we just finish before all workers are getting around to do work. Greetings, Andres Freund
Greetings, * Andres Freund (andres@anarazel.de) wrote: > On 2019-10-09 12:29:18 -0400, Robert Haas wrote: > > I would say rather that if fork() is failing on your system, you have > > a not very stable system. > > I don't think that's really true, fwiw. It's often a good idea to turn > on strict memory overcommit accounting, and with that set, it's actually > fairly common to see fork() fail with ENOMEM, even if there's > practically a reasonable amount of resources. Especially with larger > shared buffers and without huge pages, the amount of memory needed for a > postmaster child in the worst case is not insubstantial. I've not followed this thread very closely, but I agree with Andres here wrt fork() failing with ENOMEM in the field and not because the system isn't stable. Thanks, Stephen
Attachment
On Mon, Nov 4, 2019 at 2:04 PM Andres Freund <andres@anarazel.de> wrote: > Is that really true? In the case where it started and failed we except > the error queue to have been attached to, and there to be either an > error 'E' or a 'X' response (cf HandleParallelMessage()). It doesn't > strike me as very complicated to keep track of whether any worker has > sent an 'E' or not, no? I don't think we really need the One of us is confused here, because I don't think that helps. Consider three background workers Alice, Bob, and Charlie. Alice fails to launch because fork() fails. Bob launches but then exits unexpectedly. Charlie has no difficulties and carries out his assigned duties. Now, the system you are proposing will say that Charlie is OK but Alice and Bob are a problem. However, that's the way it already works. What Tom wants is to distinguish Alice from Bob, and your proposal is of no help at all with that problem, so far as I can see. > > We certainly can't ignore a worker that managed to start and > > then bombed out, because it might've already, for example, claimed a > > block from a Parallel Seq Scan and not yet sent back the corresponding > > tuples. We could ignore a worker that never started at all, due to > > EAGAIN or whatever else, but the original process that registered the > > worker has no way of finding this out. > > Sure, but in that case we'd have gotten either an error back from the > worker, or postmaster wouldhave PANIC restarted everyone due to an > unhandled error in the worker, no? An unhandled ERROR in the worker is not a PANIC. I think it's just an ERROR that ends up being fatal in effect, but even if it's actually promoted to FATAL, it's not a PANIC. It is *generally* true that if a worker hits an ERROR, the error will be propagated back to the leader, but it is not an invariable rule. One pretty common way that it fails to happen - common in the sense that it comes up during development, not common on production systems I hope - is if a worker dies before reaching the call to pq_redirect_to_shm_mq(). Before that, there's no possibility of communicating anything. Granted, at that point we shouldn't yet have done any work that might mess up the query results. Similarly, once we reach that point, we are dependent on a certain amount of good behavior for things to work as expected; yeah, any code that calls proc_exit() is suppose to signal an ERROR or FATAL first, but what if it doesn't? Granted, in that case we'd probably fail to send an 'X' message, too, so the leader would still have a chance to realize something is wrong. I guess I agree with you to this extent: I made a policy decision that if a worker is successfully fails to show up, that's an ERROR. It would be possible to adopt the opposite policy, namely that if a worker doesn't show up, that's an "oh well." You'd have to be very certain that the worker wasn't going to show up later, though. For instance, suppose you check all of the shared memory queues used for returning tuples and find that every queue is either in a state where (1) nobody's ever attached to it or (2) somebody attached and then detached. This is not good enough, because it's possible that after you checked queue #1, and found it in the former state, someone attached and read a block, which caused queue #2 to enter the latter state before you got around to checking it. If you decide that it's OK to decide that we're done at this point, you'll never return the tuples that are pushed through queue #1. But, assuming you nailed the door shut so that such problems could not occur, I think we could make a decision to ignore works that failed before doing anything interesting. Whether that would be a good policy decision is pretty questionable in my mind. In addition to what I mentioned before, I think there's a serious danger that errors that users would have really wanted to know about - or developers would really want to have known about - would get ignored. You could have some horrible problem that's making your workers fail to launch, and the system would just carry on as if everything were fine, except with bad query plans. I realize that you and others might say "oh, well, monitor your logs, then," but I think there is certainly some value in an ordinary user being able to know that things didn't go well without having to look into the PostgreSQL log for errors. Now, maybe you think that's not enough value to justify having it work the way it does today, and I certainly respect that, but I don't view it that way myself. What I mostly want to emphasize here is that, while parallel query has had a number of bugs in this area that were the result of shoddy design or inadequate testing - principally by me - this isn't one of them. This decision was made consciously by me because I thought it gave us the best chance of having a system that would be reliable and have satisfying behavior for users. Sounds like not everybody agrees, and that's fine, but I just want to get it out there that this wasn't accidental on my part. > > And even if you solved for all of that, I think you might still find > > that it breaks some parallel query (or parallel create index) code > > that expects the number of workers to change at registration time, but > > not afterwards. So, that could would all need to be adjusted. > > Fair enough. Although I think practically nearly everything has to be > ready to handle workers just being slow to start up anyway, no? There's > plenty cases where we just finish before all workers are getting around > to do work. Because of the shutdown race mentioned above, we generally have to wait for workers to exit before we can shut down parallelism. See commit 2badb5afb89cd569500ef7c3b23c7a9d11718f2f (whose commit message also documents some of the behaviors now in question). So we tolerate slow startup in that it doesn't prevent us from getting started on query execution, but not to the extent that we can finish query execution without knowing definitively that every worker is either already gone or will never be showing up. (Is it possible to do better there? Perhaps. If we could somehow throw up a brick wall to prevent new workers from doing anything that would cause problems, then verify that every worker which got past the brick wall has exited cleanly, then we could ignore the risk of more workers showing up later, because they'd hit the brick wall before causing any trouble.) -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Hi, On 2019-11-04 14:58:20 -0500, Robert Haas wrote: > On Mon, Nov 4, 2019 at 2:04 PM Andres Freund <andres@anarazel.de> wrote: > > Is that really true? In the case where it started and failed we except > > the error queue to have been attached to, and there to be either an > > error 'E' or a 'X' response (cf HandleParallelMessage()). It doesn't > > strike me as very complicated to keep track of whether any worker has > > sent an 'E' or not, no? I don't think we really need the > > One of us is confused here, because I don't think that helps. Consider > three background workers Alice, Bob, and Charlie. Alice fails to > launch because fork() fails. Bob launches but then exits unexpectedly. > Charlie has no difficulties and carries out his assigned duties. > > Now, the system you are proposing will say that Charlie is OK but > Alice and Bob are a problem. However, that's the way it already works. > What Tom wants is to distinguish Alice from Bob, and your proposal is > of no help at all with that problem, so far as I can see. I don't see how what I'm saying treats Alice and Bob the same. What I'm saying is that if a worker has been started and shut down, without signalling an error via the error queue, and without an exit code that causes postmaster to worry, then we can just ignore the worker for the purpose of determining whether the query succeeded. Without a meaningful loss in reliably. And we can detect such a cases easily, we already do, we just have remove an ereport(), and document things. > > > We certainly can't ignore a worker that managed to start and > > > then bombed out, because it might've already, for example, claimed a > > > block from a Parallel Seq Scan and not yet sent back the corresponding > > > tuples. We could ignore a worker that never started at all, due to > > > EAGAIN or whatever else, but the original process that registered the > > > worker has no way of finding this out. > > > > Sure, but in that case we'd have gotten either an error back from the > > worker, or postmaster wouldhave PANIC restarted everyone due to an > > unhandled error in the worker, no? > > An unhandled ERROR in the worker is not a PANIC. I think it's just an > ERROR that ends up being fatal in effect, but even if it's actually > promoted to FATAL, it's not a PANIC. If it's an _exit without going through the PG machinery, it'll eventually be PANIC, albeit with a slight delay. And once we're actually executing the parallel query, we better have error reporting set up for parallel queries. > It is *generally* true that if a worker hits an ERROR, the error will > be propagated back to the leader, but it is not an invariable rule. > One pretty common way that it fails to happen - common in the sense > that it comes up during development, not common on production systems > I hope - is if a worker dies before reaching the call to > pq_redirect_to_shm_mq(). Before that, there's no possibility of > communicating anything. Granted, at that point we shouldn't yet have > done any work that might mess up the query results. Right. > Similarly, once we reach that point, we are dependent on a certain amount of good > behavior for things to work as expected; yeah, any code that calls > proc_exit() is suppose to signal an ERROR or FATAL first, but what if > it doesn't? Granted, in that case we'd probably fail to send an 'X' > message, too, so the leader would still have a chance to realize > something is wrong. I mean, in that case so many more things are screwed up, I don't buy that it's worth pessimizing ENOMEM handling for this. And if you're really concerned, we could add before_shmem_exit hook or such that makes extra double sure that we've signalled something. > I guess I agree with you to this extent: I made a policy decision that > if a worker is successfully fails to show up, that's an ERROR. It > would be possible to adopt the opposite policy, namely that if a > worker doesn't show up, that's an "oh well." You'd have to be very > certain that the worker wasn't going to show up later, though. For > instance, suppose you check all of the shared memory queues used for > returning tuples and find that every queue is either in a state where > (1) nobody's ever attached to it or (2) somebody attached and then > detached. This is not good enough, because it's possible that after > you checked queue #1, and found it in the former state, someone > attached and read a block, which caused queue #2 to enter the latter > state before you got around to checking it. If you decide that it's OK > to decide that we're done at this point, you'll never return the > tuples that are pushed through queue #1. That's why the code *already* waits for workers to attach, or for the slot to be marked unused/invalid/reused. I don't see how that applies to not explicitly erroring out when we know that the worker *failed* to start: void WaitForParallelWorkersToFinish(ParallelContext *pcxt) ... /* * We didn't detect any living workers, but not all workers are * known to have exited cleanly. Either not all workers have * launched yet, or maybe some of them failed to start or * terminated abnormally. */ for (i = 0; i < pcxt->nworkers_launched; ++i) { pid_t pid; shm_mq *mq; /* * If the worker is BGWH_NOT_YET_STARTED or BGWH_STARTED, we * should just keep waiting. If it is BGWH_STOPPED, then * further investigation is needed. */ if (pcxt->worker[i].error_mqh == NULL || pcxt->worker[i].bgwhandle == NULL || GetBackgroundWorkerPid(pcxt->worker[i].bgwhandle, &pid) != BGWH_STOPPED) continue; /* * Check whether the worker ended up stopped without ever * attaching to the error queue. If so, the postmaster was * unable to fork the worker or it exited without initializing * properly. We must throw an error, since the caller may * have been expecting the worker to do some work before * exiting. */ mq = shm_mq_get_queue(pcxt->worker[i].error_mqh); if (shm_mq_get_sender(mq) == NULL) ereport(ERROR, (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), errmsg("parallel worker failed to initialize"), errhint("More details may be available in the server log."))); /* * The worker is stopped, but is attached to the error queue. * Unless there's a bug somewhere, this will only happen when * the worker writes messages and terminates after the * CHECK_FOR_INTERRUPTS() near the top of this function and * before the call to GetBackgroundWorkerPid(). In that case, * or latch should have been set as well and the right things * will happen on the next pass through the loop. */ } } > But, assuming you nailed the door shut so that such problems could not > occur, I think we could make a decision to ignore works that failed > before doing anything interesting. Whether that would be a good policy > decision is pretty questionable in my mind. In addition to what I > mentioned before, I think there's a serious danger that errors that > users would have really wanted to know about - or developers would > really want to have known about - would get ignored. You could have > some horrible problem that's making your workers fail to launch, and > the system would just carry on as if everything were fine, except with > bad query plans. I realize that you and others might say "oh, well, > monitor your logs, then," but I think there is certainly some value in > an ordinary user being able to know that things didn't go well without > having to look into the PostgreSQL log for errors. Now, maybe you > think that's not enough value to justify having it work the way it > does today, and I certainly respect that, but I don't view it that way > myself. Yea, this is somewhat of a pickle. I'm inclined to think that the problem of unnecessarily ERRORing out queries is worse than the disease (causing unnecessary failures to make debugging of some not all that likely errors easier is somewhat a severe measure). I think I mentioned this to you on chat, but I think the in-core use of bgworkers, at least as they currently are designed, is/was an architecturally bad idea. There's numerous problems flowing from that, with error handling being one big recurring theme. Greetings, Andres Freund