Thread: [HACKERS] Instability in select_parallel regression test
Buildfarm members gaur, pademelon, and gharial have all recently shown failures like this: *** /home/bfarm/bf-data/HEAD/pgsql.build/src/test/regress/expected/select_parallel.out Thu Feb 16 20:35:14 2017 --- /home/bfarm/bf-data/HEAD/pgsql.build/src/test/regress/results/select_parallel.out Thu Feb 16 21:17:58 2017 *************** *** 163,167 **** -- provoke error in worker select stringu1::int2 from tenk1 where unique1 = 1; ERROR: invalid input syntaxfor integer: "BAAAAA" - CONTEXT: parallel worker rollback; --- 163,166 ---- ====================================================================== The failure appears intermittent on gharial but is quite reproducible on gaur/pademelon. I inserted some debug elog calls and got this trace of events: 2017-02-17 00:28:32.641 EST [18934] LOG: RegisterDynamicBackgroundWorker: grabbing slot 1 2017-02-17 00:28:32.641 EST [18934] STATEMENT: select count(*) from a_star; 2017-02-17 00:28:32.643 EST [18934] LOG: LaunchParallelWorkers: registered 1 of 1 workers 2017-02-17 00:28:32.643 EST [18934] STATEMENT: select count(*) from a_star; 2017-02-17 00:28:32.731 EST [18934] LOG: RegisterDynamicBackgroundWorker: grabbing slot 2 2017-02-17 00:28:32.731 EST [18934] STATEMENT: select length(stringu1) from tenk1 group by length(stringu1); 2017-02-17 00:28:32.824 EST [18934] LOG: LaunchParallelWorkers: registered 1 of 4 workers 2017-02-17 00:28:32.824 EST [18934] STATEMENT: select length(stringu1) from tenk1 group by length(stringu1); 2017-02-17 00:28:32.824 EST [18934] LOG: RegisterDynamicBackgroundWorker: grabbing slot 3 2017-02-17 00:28:32.824 EST [18934] STATEMENT: select length(stringu1) from tenk1 group by length(stringu1); 2017-02-17 00:28:32.923 EST [18934] LOG: LaunchParallelWorkers: registered 2 of 4 workers 2017-02-17 00:28:32.923 EST [18934] STATEMENT: select length(stringu1) from tenk1 group by length(stringu1); 2017-02-17 00:28:32.923 EST [18934] LOG: RegisterDynamicBackgroundWorker: grabbing slot 4 2017-02-17 00:28:32.923 EST [18934] STATEMENT: select length(stringu1) from tenk1 group by length(stringu1); 2017-02-17 00:28:32.989 EST [18934] LOG: LaunchParallelWorkers: registered 3 of 4 workers 2017-02-17 00:28:32.989 EST [18934] STATEMENT: select length(stringu1) from tenk1 group by length(stringu1); 2017-02-17 00:28:32.989 EST [18934] LOG: RegisterDynamicBackgroundWorker: grabbing slot 5 2017-02-17 00:28:32.989 EST [18934] STATEMENT: select length(stringu1) from tenk1 group by length(stringu1); 2017-02-17 00:28:33.040 EST [18934] LOG: LaunchParallelWorkers: registered 4 of 4 workers 2017-02-17 00:28:33.040 EST [18934] STATEMENT: select length(stringu1) from tenk1 group by length(stringu1); 2017-02-17 00:28:33.093 EST [18934] LOG: RegisterDynamicBackgroundWorker: grabbing slot 6 2017-02-17 00:28:33.093 EST [18934] STATEMENT: select count(*) from tenk1 where (two, four) not in (select hundred, thousandfrom tenk2 where thousand > 100); 2017-02-17 00:28:33.173 EST [18934] LOG: LaunchParallelWorkers: registered 1 of 4 workers 2017-02-17 00:28:33.173 EST [18934] STATEMENT: select count(*) from tenk1 where (two, four) not in (select hundred, thousandfrom tenk2 where thousand > 100); 2017-02-17 00:28:33.173 EST [18934] LOG: RegisterDynamicBackgroundWorker: grabbing slot 7 2017-02-17 00:28:33.173 EST [18934] STATEMENT: select count(*) from tenk1 where (two, four) not in (select hundred, thousandfrom tenk2 where thousand > 100); 2017-02-17 00:28:33.253 EST [18934] LOG: LaunchParallelWorkers: registered 2 of 4 workers 2017-02-17 00:28:33.253 EST [18934] STATEMENT: select count(*) from tenk1 where (two, four) not in (select hundred, thousandfrom tenk2 where thousand > 100); 2017-02-17 00:28:33.253 EST [18934] LOG: RegisterDynamicBackgroundWorker: no free slots of 8 2017-02-17 00:28:33.253 EST [18934] STATEMENT: select count(*) from tenk1 where (two, four) not in (select hundred, thousandfrom tenk2 where thousand > 100); 2017-02-17 00:28:33.254 EST [18934] LOG: LaunchParallelWorkers: failed to register 3 of 4 workers 2017-02-17 00:28:33.254 EST [18934] STATEMENT: select count(*) from tenk1 where (two, four) not in (select hundred, thousandfrom tenk2 where thousand > 100); 2017-02-17 00:28:33.557 EST [18934] LOG: RegisterDynamicBackgroundWorker: no free slots of 8 2017-02-17 00:28:33.557 EST [18934] STATEMENT: select count((unique1)) from tenk1 where hundred > 1; 2017-02-17 00:28:33.557 EST [18934] LOG: LaunchParallelWorkers: failed to register 1 of 4 workers 2017-02-17 00:28:33.557 EST [18934] STATEMENT: select count((unique1)) from tenk1 where hundred > 1; 2017-02-17 00:28:33.703 EST [18934] LOG: RegisterDynamicBackgroundWorker: no free slots of 8 2017-02-17 00:28:33.703 EST [18934] STATEMENT: select stringu1::int2 from tenk1 where unique1 = 1; 2017-02-17 00:28:33.703 EST [18934] LOG: LaunchParallelWorkers: failed to register 1 of 1 workers 2017-02-17 00:28:33.703 EST [18934] STATEMENT: select stringu1::int2 from tenk1 where unique1 = 1; 2017-02-17 00:28:33.704 EST [18934] ERROR: invalid input syntax for integer: "BAAAAA" 2017-02-17 00:28:33.704 EST [18934] STATEMENT: select stringu1::int2 from tenk1 where unique1 = 1; 2017-02-17 00:28:33.854 EST [18336] LOG: ForgetBackgroundWorker: unregistering background worker "parallel worker for PID18934" slot 7 2017-02-17 00:28:33.854 EST [18336] LOG: ForgetBackgroundWorker: unregistering background worker "parallel worker for PID18934" slot 6 2017-02-17 00:28:33.854 EST [18336] LOG: ForgetBackgroundWorker: unregistering background worker "parallel worker for PID18934" slot 5 2017-02-17 00:28:33.854 EST [18336] LOG: ForgetBackgroundWorker: unregistering background worker "parallel worker for PID18934" slot 4 2017-02-17 00:28:33.854 EST [18336] LOG: ForgetBackgroundWorker: unregistering background worker "parallel worker for PID18934" slot 3 2017-02-17 00:28:33.854 EST [18336] LOG: ForgetBackgroundWorker: unregistering background worker "parallel worker for PID18934" slot 2 2017-02-17 00:28:33.855 EST [18336] LOG: ForgetBackgroundWorker: unregistering background worker "parallel worker for PID18934" slot 1 In short, it looks to me like ExecShutdownGatherWorkers doesn't actually wait for parallel workers to finish (as its comment suggests is necessary), so that on not-too-speedy machines the worker slots may all still be in use when the next command wants some. This does not seem like a good thing from a reproducibility or error-detection standpoint. It is *certainly* not what you would expect from reading any of the comments in the vicinity. I don't know if this is a generic logic failure or if there's something platform-specific about it --- it's suspicious that it's only showing up on ia64 and hppa architectures. I also find it curious that nobody had seen this at all until the past couple days. I suspect that 51ee6f316 changed the number of parallel workers demanded by the select_parallel regression test, which is, shall we say, not what its commit log message would suggest. regards, tom lane
On Fri, Feb 17, 2017 at 11:22 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Buildfarm members gaur, pademelon, and gharial have all recently shown > failures like this: > > *** /home/bfarm/bf-data/HEAD/pgsql.build/src/test/regress/expected/select_parallel.out Thu Feb 16 20:35:14 2017 > --- /home/bfarm/bf-data/HEAD/pgsql.build/src/test/regress/results/select_parallel.out Thu Feb 16 21:17:58 2017 > *************** > *** 163,167 **** > -- provoke error in worker > select stringu1::int2 from tenk1 where unique1 = 1; > ERROR: invalid input syntax for integer: "BAAAAA" > - CONTEXT: parallel worker > rollback; > --- 163,166 ---- > > ====================================================================== > > The failure appears intermittent on gharial but is quite reproducible > on gaur/pademelon. I inserted some debug elog calls and got this trace > of events: > > 2017-02-17 00:28:32.641 EST [18934] LOG: RegisterDynamicBackgroundWorker: grabbing slot 1 > 2017-02-17 00:28:32.641 EST [18934] STATEMENT: select count(*) from a_star; > 2017-02-17 00:28:32.643 EST [18934] LOG: LaunchParallelWorkers: registered 1 of 1 workers > 2017-02-17 00:28:32.643 EST [18934] STATEMENT: select count(*) from a_star; > 2017-02-17 00:28:32.731 EST [18934] LOG: RegisterDynamicBackgroundWorker: grabbing slot 2 > 2017-02-17 00:28:32.731 EST [18934] STATEMENT: select length(stringu1) from tenk1 group by length(stringu1); > 2017-02-17 00:28:32.824 EST [18934] LOG: LaunchParallelWorkers: registered 1 of 4 workers > 2017-02-17 00:28:32.824 EST [18934] STATEMENT: select length(stringu1) from tenk1 group by length(stringu1); > 2017-02-17 00:28:32.824 EST [18934] LOG: RegisterDynamicBackgroundWorker: grabbing slot 3 > 2017-02-17 00:28:32.824 EST [18934] STATEMENT: select length(stringu1) from tenk1 group by length(stringu1); > 2017-02-17 00:28:32.923 EST [18934] LOG: LaunchParallelWorkers: registered 2 of 4 workers > 2017-02-17 00:28:32.923 EST [18934] STATEMENT: select length(stringu1) from tenk1 group by length(stringu1); > 2017-02-17 00:28:32.923 EST [18934] LOG: RegisterDynamicBackgroundWorker: grabbing slot 4 > 2017-02-17 00:28:32.923 EST [18934] STATEMENT: select length(stringu1) from tenk1 group by length(stringu1); > 2017-02-17 00:28:32.989 EST [18934] LOG: LaunchParallelWorkers: registered 3 of 4 workers > 2017-02-17 00:28:32.989 EST [18934] STATEMENT: select length(stringu1) from tenk1 group by length(stringu1); > 2017-02-17 00:28:32.989 EST [18934] LOG: RegisterDynamicBackgroundWorker: grabbing slot 5 > 2017-02-17 00:28:32.989 EST [18934] STATEMENT: select length(stringu1) from tenk1 group by length(stringu1); > 2017-02-17 00:28:33.040 EST [18934] LOG: LaunchParallelWorkers: registered 4 of 4 workers > 2017-02-17 00:28:33.040 EST [18934] STATEMENT: select length(stringu1) from tenk1 group by length(stringu1); > 2017-02-17 00:28:33.093 EST [18934] LOG: RegisterDynamicBackgroundWorker: grabbing slot 6 > 2017-02-17 00:28:33.093 EST [18934] STATEMENT: select count(*) from tenk1 where (two, four) not in > (select hundred, thousand from tenk2 where thousand > 100); > 2017-02-17 00:28:33.173 EST [18934] LOG: LaunchParallelWorkers: registered 1 of 4 workers > 2017-02-17 00:28:33.173 EST [18934] STATEMENT: select count(*) from tenk1 where (two, four) not in > (select hundred, thousand from tenk2 where thousand > 100); > 2017-02-17 00:28:33.173 EST [18934] LOG: RegisterDynamicBackgroundWorker: grabbing slot 7 > 2017-02-17 00:28:33.173 EST [18934] STATEMENT: select count(*) from tenk1 where (two, four) not in > (select hundred, thousand from tenk2 where thousand > 100); > 2017-02-17 00:28:33.253 EST [18934] LOG: LaunchParallelWorkers: registered 2 of 4 workers > 2017-02-17 00:28:33.253 EST [18934] STATEMENT: select count(*) from tenk1 where (two, four) not in > (select hundred, thousand from tenk2 where thousand > 100); > 2017-02-17 00:28:33.253 EST [18934] LOG: RegisterDynamicBackgroundWorker: no free slots of 8 > 2017-02-17 00:28:33.253 EST [18934] STATEMENT: select count(*) from tenk1 where (two, four) not in > (select hundred, thousand from tenk2 where thousand > 100); > 2017-02-17 00:28:33.254 EST [18934] LOG: LaunchParallelWorkers: failed to register 3 of 4 workers > 2017-02-17 00:28:33.254 EST [18934] STATEMENT: select count(*) from tenk1 where (two, four) not in > (select hundred, thousand from tenk2 where thousand > 100); > 2017-02-17 00:28:33.557 EST [18934] LOG: RegisterDynamicBackgroundWorker: no free slots of 8 > 2017-02-17 00:28:33.557 EST [18934] STATEMENT: select count((unique1)) from tenk1 where hundred > 1; > 2017-02-17 00:28:33.557 EST [18934] LOG: LaunchParallelWorkers: failed to register 1 of 4 workers > 2017-02-17 00:28:33.557 EST [18934] STATEMENT: select count((unique1)) from tenk1 where hundred > 1; > 2017-02-17 00:28:33.703 EST [18934] LOG: RegisterDynamicBackgroundWorker: no free slots of 8 > 2017-02-17 00:28:33.703 EST [18934] STATEMENT: select stringu1::int2 from tenk1 where unique1 = 1; > 2017-02-17 00:28:33.703 EST [18934] LOG: LaunchParallelWorkers: failed to register 1 of 1 workers > 2017-02-17 00:28:33.703 EST [18934] STATEMENT: select stringu1::int2 from tenk1 where unique1 = 1; > 2017-02-17 00:28:33.704 EST [18934] ERROR: invalid input syntax for integer: "BAAAAA" > 2017-02-17 00:28:33.704 EST [18934] STATEMENT: select stringu1::int2 from tenk1 where unique1 = 1; > 2017-02-17 00:28:33.854 EST [18336] LOG: ForgetBackgroundWorker: unregistering background worker "parallel worker forPID 18934" slot 7 > 2017-02-17 00:28:33.854 EST [18336] LOG: ForgetBackgroundWorker: unregistering background worker "parallel worker forPID 18934" slot 6 > 2017-02-17 00:28:33.854 EST [18336] LOG: ForgetBackgroundWorker: unregistering background worker "parallel worker forPID 18934" slot 5 > 2017-02-17 00:28:33.854 EST [18336] LOG: ForgetBackgroundWorker: unregistering background worker "parallel worker forPID 18934" slot 4 > 2017-02-17 00:28:33.854 EST [18336] LOG: ForgetBackgroundWorker: unregistering background worker "parallel worker forPID 18934" slot 3 > 2017-02-17 00:28:33.854 EST [18336] LOG: ForgetBackgroundWorker: unregistering background worker "parallel worker forPID 18934" slot 2 > 2017-02-17 00:28:33.855 EST [18336] LOG: ForgetBackgroundWorker: unregistering background worker "parallel worker forPID 18934" slot 1 > > In short, it looks to me like ExecShutdownGatherWorkers doesn't actually > wait for parallel workers to finish (as its comment suggests is > necessary), so that on not-too-speedy machines the worker slots may all > still be in use when the next command wants some. > ExecShutdownGatherWorkers() do wait for workers to exit/finish, but it doesn't wait for the postmaster to free the used slots and that is how that API is supposed to work. There is good chance that on slow machines the slots get freed up much later by postmaster after the workers have exited. > This does not seem > like a good thing from a reproducibility or error-detection standpoint. > It is *certainly* not what you would expect from reading any of the > comments in the vicinity. > > I don't know if this is a generic logic failure or if there's something > platform-specific about it --- it's suspicious that it's only showing up > on ia64 and hppa architectures. > > I also find it curious that nobody had seen this at all until the past > couple days. I suspect that 51ee6f316 changed the number of parallel > workers demanded by the select_parallel regression test, which is, shall > we say, not what its commit log message would suggest. > The commit 51ee6f316 doesn't change parallel workers demanded by select_parallel, rather it is changed in commit 5e6d8d2 which has added a new test in select_parallel. I think that is somewhat clear from the logs below where the first test where LaunchParallelWorkers is failed is introduced by commit 5e6d8d2. > 2017-02-17 00:28:33.253 EST [18934] LOG: RegisterDynamicBackgroundWorker: no free slots of 8 > 2017-02-17 00:28:33.253 EST [18934] STATEMENT: select count(*) from tenk1 where (two, four) not in > (select hundred, thousand from tenk2 where thousand > 100); > 2017-02-17 00:28:33.254 EST [18934] LOG: LaunchParallelWorkers: failed to register 3 of 4 workers > 2017-02-17 00:28:33.254 EST [18934] STATEMENT: select count(*) from tenk1 where (two, four) not in > (select hundred, thousand from tenk2 where thousand > 100); Now, I think that the test added by commit 5e6d8d2 doesn't have a problem because it is normally okay to run parallel query tests even when required workers are not available. I think what we need to do here is to move the test that needs workers to execute before other parallel query tests where there is no such requirement. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Amit Kapila <amit.kapila16@gmail.com> writes: > On Fri, Feb 17, 2017 at 11:22 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> In short, it looks to me like ExecShutdownGatherWorkers doesn't actually >> wait for parallel workers to finish (as its comment suggests is >> necessary), so that on not-too-speedy machines the worker slots may all >> still be in use when the next command wants some. > ExecShutdownGatherWorkers() do wait for workers to exit/finish, but it > doesn't wait for the postmaster to free the used slots and that is how > that API is supposed to work. There is good chance that on slow > machines the slots get freed up much later by postmaster after the > workers have exited. That seems like a seriously broken design to me, first because it can make for a significant delay in the slots becoming available (which is what's evidently causing these regression failures), and second because it's simply bad design to load extra responsibilities onto the postmaster. Especially ones that involve touching shared memory. I think this needs to be changed, and promptly. Why in the world don't you simply have the workers clearing their slots when they exit? We don't have an expectation that regular backends are incompetent to clean up after themselves. (Obviously, a crash exit is a different case.) > I think what we need to do > here is to move the test that needs workers to execute before other > parallel query tests where there is no such requirement. That's not fixing the problem, it's merely averting your eyes from the symptom. regards, tom lane
On Fri, Feb 17, 2017 at 9:15 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Amit Kapila <amit.kapila16@gmail.com> writes: >> On Fri, Feb 17, 2017 at 11:22 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >>> In short, it looks to me like ExecShutdownGatherWorkers doesn't actually >>> wait for parallel workers to finish (as its comment suggests is >>> necessary), so that on not-too-speedy machines the worker slots may all >>> still be in use when the next command wants some. > >> ExecShutdownGatherWorkers() do wait for workers to exit/finish, but it >> doesn't wait for the postmaster to free the used slots and that is how >> that API is supposed to work. There is good chance that on slow >> machines the slots get freed up much later by postmaster after the >> workers have exited. > > That seems like a seriously broken design to me, first because it can make > for a significant delay in the slots becoming available (which is what's > evidently causing these regression failures), and second because it's > simply bad design to load extra responsibilities onto the postmaster. > Especially ones that involve touching shared memory. > > I think this needs to be changed, and promptly. Why in the world don't > you simply have the workers clearing their slots when they exit? > There seem to be many reasons why exit of background workers is done by postmaster like when they have to restart after a crash or if someone terminates them (TerminateBackgroundWorker()) or if the notification has to be sent at exit (bgw_notify_pid). Moreover, there can be background workers without shared memory access as well which can't clear state from shared memory. Another point to think is that background workers contain user-supplied code, so not sure how good it is to give them control of tinkering with shared data structures of the backend. Now, one can argue that for some of the cases where it is possible background worker should cleanup shared memory state at the exit, but I think it is not directly related to the parallel query. As far as the parallel query is concerned, it just needs to wait for workers exit to ensure that no more operations can be performed in workers after that point so that it can accumulate stats and retrieve some fixed parallel state information. > We don't have an expectation that regular backends are incompetent to > clean up after themselves. (Obviously, a crash exit is a different > case.) > >> I think what we need to do >> here is to move the test that needs workers to execute before other >> parallel query tests where there is no such requirement. > > That's not fixing the problem, it's merely averting your eyes from > the symptom. > I am not sure why you think so. Parallel query is capable of running without workers even if the number of planned workers are not available and there are many reasons for same. In general, I think the tests should not rely on the availability of background workers and if there is a test like that then it should be the responsibility of the test to ensure the same. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Fri, Feb 17, 2017 at 9:57 PM, Amit Kapila <amit.kapila16@gmail.com> wrote: >> That seems like a seriously broken design to me, first because it can make >> for a significant delay in the slots becoming available (which is what's >> evidently causing these regression failures), and second because it's >> simply bad design to load extra responsibilities onto the postmaster. >> Especially ones that involve touching shared memory. It's a little surprising to me that the delay we're seeing here is significant, because the death of a child should cause an immediate SIGCHLD, resulting in a call to reaper(), resulting in a call to waitpid(). Why's that not working? >> I think this needs to be changed, and promptly. Why in the world don't >> you simply have the workers clearing their slots when they exit? > > There seem to be many reasons why exit of background workers is done > by postmaster like when they have to restart after a crash or if > someone terminates them (TerminateBackgroundWorker()) or if the > notification has to be sent at exit (bgw_notify_pid). Moreover, there > can be background workers without shared memory access as well which > can't clear state from shared memory. Another point to think is that > background workers contain user-supplied code, so not sure how good it > is to give them control of tinkering with shared data structures of > the backend. Now, one can argue that for some of the cases where it > is possible background worker should cleanup shared memory state at > the exit, but I think it is not directly related to the parallel > query. As far as the parallel query is concerned, it just needs to > wait for workers exit to ensure that no more operations can be > performed in workers after that point so that it can accumulate stats > and retrieve some fixed parallel state information. That's a pretty good list of reasons with which I agree. To underscore one of Amit's points, bgw_notify_pid requires SIGUSR1 to be sent to the process that registered the worker if that process is still running, both when the process is started and when it terminates. The workers have no way of keeping track of whether the process that did the registration is still running, and with bad luck might SIGUSR1 an unrelated process (possibly killing it). And those SIGUSR1s are absolutely necessary, because otherwise a process that is waiting for a worker to start or stop would have to poll, which would make the whole system more resource-intensive and a whole lot less responsive. There are a few more reasons Amit didn't mention: 1. Workers can fail during startup before they can possibly have done enough work to be able to mark their slots as being free. On Linux, fork() can fail; on Windows, you can fail either to start a new process or to have it reattach to the shared memory segment. You could try to have the postmaster catch just those cases where the worker doesn't get far enough and have the worker do the others, but that's almost bound to make things more complex and fragile. I can't see us coming out ahead there. 2. Worker slots don't get freed when the process terminates; they get freed when the background worker is not running and will never again be restarted. So there's a subtle lifespan difference here: a background worker slot is consumed before any process is started, and persists across possibly many restarts of that process, and is finally freed when the process is both not currently running and not ever going to be restarted. It would no doubt be a bit fragile if the worker tried to guess whether or not the postmaster was going to restart it and free the slot only if not -- what happens if the postmaster comes to a different conclusion? >>> I think what we need to do >>> here is to move the test that needs workers to execute before other >>> parallel query tests where there is no such requirement. >> >> That's not fixing the problem, it's merely averting your eyes from >> the symptom. >> > I am not sure why you think so. Parallel query is capable of running > without workers even if the number of planned workers are not > available and there are many reasons for same. In general, I think > the tests should not rely on the availability of background workers > and if there is a test like that then it should be the responsibility > of the test to ensure the same. I agree with that, too. One thing that I'm a little concerned about, though, is that users could also see the kind of behavior Tom seems to be describing here and might find it surprising. For example, suppose your queries all use 4 workers and you have 4 workers configured. If you're the only user on the system and you run query after query after query, you expect that all of those queries are going to get all 4 workers. If the postmaster is so unresponsive that you don't get all of them, or worse you don't get any, you might be tempted to swear at the stupid PostgreSQL developer who engineered this system. (Hi, swearing people!) I'd like to understand whether this is something that happens once-in-a-blue-moon on CLOBBER_CACHE_RECURSIVELY critters or whether it's a pretty regular thing. If it's a regular thing, that's kinda disappointing, because it implies that the operating system feels free to take a little holiday before delivering SIGCHLD, or something like that. I guess one possible solution to this problem might be to change ExecParallelFinish's call to WaitForParallelWorkersToFinish() to instead call WaitForParallelWorkersToExit() -- and maybe just removing the former, since at that point it would have no remaining callers and no real use, if we believe that waiting for the slot to become free is the right behavior. The downside of this is that it increases the lag to shut down a parallel query over what's basically a corner case. Yeah, there might be some cases where not being able to reuse the worker slots you just freed hurts, but is it worth making all of the queries that don't care about that a little slower to cater to the cases where you do care? I'm not sure. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > On Fri, Feb 17, 2017 at 9:57 PM, Amit Kapila <amit.kapila16@gmail.com> wrote: >>> That seems like a seriously broken design to me, first because it can make >>> for a significant delay in the slots becoming available (which is what's >>> evidently causing these regression failures), and second because it's >>> simply bad design to load extra responsibilities onto the postmaster. >>> Especially ones that involve touching shared memory. > It's a little surprising to me that the delay we're seeing here is > significant, because the death of a child should cause an immediate > SIGCHLD, resulting in a call to reaper(), resulting in a call to > waitpid(). Why's that not working? I can't say for sure about gharial, but gaur/pademelon is a single-CPU machine, and I have reason to believe that its scheduler discriminates pretty hard against processes that have been deemed to be background processes. The reason it's not working is simply that the postmaster doesn't get a chance to run until the active backend has already decided that there are no free slots for the next parallel query (indeed, the next several parallel queries). Yeah, it got the signal, and eventually it gets some actual cycles, but not soon enough. On a sufficiently loaded machine this could be expected to happen, at least occasionally, even if you had lots of CPUs. It's merely more readily reproducible on this particular configuration. Also, you're assuming that the parallel worker process gets enough more cycles to exit once it's woken the parent backend with an "I'm done" signal. I wouldn't care to bet on that happening promptly either. I have definitely seen HPUX do a process context swap *immediately* when it sees a lower-priority process signal a higher-priority one, and not give the first process any more cycles for a good long time. >> There seem to be many reasons why exit of background workers is done >> by postmaster like when they have to restart after a crash or if >> someone terminates them (TerminateBackgroundWorker()) or if the >> notification has to be sent at exit (bgw_notify_pid). Moreover, there >> can be background workers without shared memory access as well which >> can't clear state from shared memory. Another point to think is that >> background workers contain user-supplied code, so not sure how good it >> is to give them control of tinkering with shared data structures of >> the backend. Now, one can argue that for some of the cases where it >> is possible background worker should cleanup shared memory state at >> the exit, but I think it is not directly related to the parallel >> query. As far as the parallel query is concerned, it just needs to >> wait for workers exit to ensure that no more operations can be >> performed in workers after that point so that it can accumulate stats >> and retrieve some fixed parallel state information. > That's a pretty good list of reasons with which I agree. It seems largely bogus IMO, except possibly the "no access to shared memory" reason, which is irrelevant for the problem at hand; I see no very probable reason why backends would ever be interested in launching workers that they couldn't communicate with. In particular, I'll buy the "user-supplied code" one only if we rip out every possibility of executing user-supplied C or plperlu or plpythonu code in standard backends. There is *no* good reason for treating parallel workers as less reliable than standard backends; if anything, given the constraints on what we let them do, they should be more reliable. Basically, I think we need to fix things so that WaitForParallelWorkersToFinish doesn't return until the slot is free and there are no impediments to launching another worker. Anything less is going to result in race conditions and intermittent misbehavior on heavily-loaded machines. Personally I think that it'd be a good idea to get the postmaster out of the loop while we're at it, but the minimum change is to wait for the slot to be marked free by the postmaster. >>>> I think what we need to do >>>> here is to move the test that needs workers to execute before other >>>> parallel query tests where there is no such requirement. >>> That's not fixing the problem, it's merely averting your eyes from >>> the symptom. >> I am not sure why you think so. In other words, you are willing to accept a system in which we can never be sure that any query after the first one in select_parallel actually ran in parallel? That seems silly on its face, and an enormous restriction on what we'll actually be able to test. regards, tom lane
On Sat, Feb 18, 2017 at 11:53 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> It's a little surprising to me that the delay we're seeing here is >> significant, because the death of a child should cause an immediate >> SIGCHLD, resulting in a call to reaper(), resulting in a call to >> waitpid(). Why's that not working? > > I can't say for sure about gharial, but gaur/pademelon is a single-CPU > machine, ... Ah, OK. Makes sense. >>> There seem to be many reasons why exit of background workers is done >>> by postmaster like when they have to restart after a crash or if >>> someone terminates them (TerminateBackgroundWorker()) or if the >>> notification has to be sent at exit (bgw_notify_pid). Moreover, there >>> can be background workers without shared memory access as well which >>> can't clear state from shared memory. Another point to think is that >>> background workers contain user-supplied code, so not sure how good it >>> is to give them control of tinkering with shared data structures of >>> the backend. Now, one can argue that for some of the cases where it >>> is possible background worker should cleanup shared memory state at >>> the exit, but I think it is not directly related to the parallel >>> query. As far as the parallel query is concerned, it just needs to >>> wait for workers exit to ensure that no more operations can be >>> performed in workers after that point so that it can accumulate stats >>> and retrieve some fixed parallel state information. > >> That's a pretty good list of reasons with which I agree. > > It seems largely bogus IMO, except possibly the "no access to shared > memory" reason, which is irrelevant for the problem at hand; I see no very > probable reason why backends would ever be interested in launching workers > that they couldn't communicate with. In particular, I'll buy the > "user-supplied code" one only if we rip out every possibility of executing > user-supplied C or plperlu or plpythonu code in standard backends. There > is *no* good reason for treating parallel workers as less reliable than > standard backends; if anything, given the constraints on what we let them > do, they should be more reliable. I don't understand this paragraph at all. There are certainly use cases for backends starting workers with which they can't communicate, like starting some kind of a daemon - that was the original purpose of the background worker facility. But even when the process is planning to communicate with the worker, there's a time after the worker is requested and before that communication link is established. Failures during the intervening time have to be handled cleanly, and if you think that's trivial to do without postmaster involvement then I think you don't really understand the problem yet. > Basically, I think we need to fix things so that > WaitForParallelWorkersToFinish doesn't return until the slot is free > and there are no impediments to launching another worker. Anything > less is going to result in race conditions and intermittent misbehavior > on heavily-loaded machines. Personally I think that it'd be a good idea > to get the postmaster out of the loop while we're at it, but the minimum > change is to wait for the slot to be marked free by the postmaster. Such a change can be made, but as I pointed out in the part you didn't quote, there are reasons to wonder whether that will be a constructive change in real life even if it's better for the regression tests. Optimizing PostgreSQL for the use case of running regression tests in the buildfarm at the expense of other use cases wouldn't be very smart. Maybe such a change is better in real-world applications too, but that deserves at least a little bit of thought and substantive discussion. >>> I am not sure why you think so. > > In other words, you are willing to accept a system in which we can never > be sure that any query after the first one in select_parallel actually ran > in parallel? That seems silly on its face, and an enormous restriction on > what we'll actually be able to test. Without denying that we may want to change the behavior here, if you think this is the first place in the regression tests where the behavior is timing-dependent, you obviously didn't attend https://www.pgcon.org/2016/schedule/track/Hacking/927.en.html -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Sun, Feb 19, 2017 at 2:17 PM, Robert Haas <robertmhaas@gmail.com> wrote: > Such a change can be made, but as I pointed out in the part you didn't > quote, there are reasons to wonder whether that will be a constructive > change in real life even if it's better for the regression tests. > Optimizing PostgreSQL for the use case of running regression tests in > the buildfarm at the expense of other use cases wouldn't be very > smart. Maybe such a change is better in real-world applications too, > but that deserves at least a little bit of thought and substantive > discussion. Rewind. Wait a minute. Looking at this code again, it looks like we're supposed to ALREADY BE DOING THIS. DestroyParallelContext() calls WaitForParallelWorkersToExit() which calls WaitForBackgroundWorkerShutdown() for each worker. That function returns only when the postmaster dies (which causes an error with that specific complaint) or when GetBackgroundWorkerPid() sets the status to BGWH_STOPPED. GetBackgroundWorkerPid() only returns BGWH_STOPPED when either (a) handle->generation != slot->generation (meaning that the slot got reused, and therefore must have been freed) or when (b) slot->pid == 0. The pid only gets set to 0 in BackgroundWorkerStateChange() when slot->terminate is set, or in ReportBackgroundWorkerPID() when it's called from CleanupBackgroundWorker. So this function should not be returning until after all workers have actually exited. However, it looks like there's a race condition here, because the slot doesn't get freed up at the same time that the PID gets set to 0. That actually happens later, when the postmaster calls maybe_start_bgworker() or DetermineSleepTime() and one of those functions calls ForgetBackgroundWorker(). We could tighten this up by changing CleanupBackgroundWorker() to also call ForgetBackgroundWorker() immediately after calling ReportBackgroundWorker() if rw->rw_terminate || rw->rw_worker.bgw_restart_time == BGW_NEVER_RESTART. If we do that BEFORE sending the notification to the starting process, that closes this hole. Almost. There's still a possibility that the waiting process might receive SIGUSR1 for some unrelated reason and check the state of shared memory just after slot->pid == 0 and before slot->in_use is cleared and (also relevantly) just before BackgroundWorkerData->parallel_terminate_count is incremented. Fixing that seems a bit tougher, since we're certainly not going to have the postmaster start taking locks on shared data structures, but just forgetting the worker sooner would probably improve things a lot. I think. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Sun, Feb 19, 2017 at 5:54 PM, Robert Haas <robertmhaas@gmail.com> wrote: > However, it looks like there's a race condition here, because the slot > doesn't get freed up at the same time that the PID gets set to 0. > That actually happens later, when the postmaster calls > maybe_start_bgworker() or DetermineSleepTime() and one of those > functions calls ForgetBackgroundWorker(). We could tighten this up by > changing CleanupBackgroundWorker() to also call > ForgetBackgroundWorker() immediately after calling > ReportBackgroundWorker() if rw->rw_terminate || > rw->rw_worker.bgw_restart_time == BGW_NEVER_RESTART. If we do that > BEFORE sending the notification to the starting process, that closes > this hole. Almost. And here's a patch implementing something along those lines. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
On Sun, Feb 19, 2017 at 5:54 PM, Robert Haas <robertmhaas@gmail.com> wrote: > On Sun, Feb 19, 2017 at 2:17 PM, Robert Haas <robertmhaas@gmail.com> wrote: >> Such a change can be made, but as I pointed out in the part you didn't >> quote, there are reasons to wonder whether that will be a constructive >> change in real life even if it's better for the regression tests. >> Optimizing PostgreSQL for the use case of running regression tests in >> the buildfarm at the expense of other use cases wouldn't be very >> smart. Maybe such a change is better in real-world applications too, >> but that deserves at least a little bit of thought and substantive >> discussion. > > Rewind. Wait a minute. Looking at this code again, it looks like > we're supposed to ALREADY BE DOING THIS. > > DestroyParallelContext() calls WaitForParallelWorkersToExit() which > calls WaitForBackgroundWorkerShutdown() for each worker. That > function returns only when the postmaster dies (which causes an error > with that specific complaint) or when GetBackgroundWorkerPid() sets > the status to BGWH_STOPPED. GetBackgroundWorkerPid() only returns > BGWH_STOPPED when either (a) handle->generation != slot->generation > (meaning that the slot got reused, and therefore must have been freed) > or when (b) slot->pid == 0. The pid only gets set to 0 in > BackgroundWorkerStateChange() when slot->terminate is set, or in > ReportBackgroundWorkerPID() when it's called from > CleanupBackgroundWorker. So this function should not be returning > until after all workers have actually exited. > Yeah, I have also noticed this point and was thinking of the way to close this gap. > However, it looks like there's a race condition here, because the slot > doesn't get freed up at the same time that the PID gets set to 0. > That actually happens later, when the postmaster calls > maybe_start_bgworker() or DetermineSleepTime() and one of those > functions calls ForgetBackgroundWorker(). We could tighten this up by > changing CleanupBackgroundWorker() to also call > ForgetBackgroundWorker() immediately after calling > ReportBackgroundWorker() if rw->rw_terminate || > rw->rw_worker.bgw_restart_time == BGW_NEVER_RESTART. If we do that > BEFORE sending the notification to the starting process, that closes > this hole. Almost. > To close the remaining gap, don't you think we can check slot->in_use flag when generation number for handle and slot are same. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Sun, Feb 19, 2017 at 6:50 PM, Amit Kapila <amit.kapila16@gmail.com> wrote: > To close the remaining gap, don't you think we can check slot->in_use > flag when generation number for handle and slot are same. That doesn't completely fix it either, because ForgetBackgroundWorker() also does BackgroundWorkerData->parallel_terminate_count++, which we might also fail to see, which would cause RegisterDynamicBackgroundWorker() to bail out early. There are CPU ordering effects to think about here, not just the order in which the operations are actually performed. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Sun, Feb 19, 2017 at 8:32 PM, Robert Haas <robertmhaas@gmail.com> wrote: > On Sun, Feb 19, 2017 at 6:50 PM, Amit Kapila <amit.kapila16@gmail.com> wrote: >> To close the remaining gap, don't you think we can check slot->in_use >> flag when generation number for handle and slot are same. > > That doesn't completely fix it either, because > ForgetBackgroundWorker() also does > BackgroundWorkerData->parallel_terminate_count++, which we might also > fail to see, which would cause RegisterDynamicBackgroundWorker() to > bail out early. There are CPU ordering effects to think about here, > not just the order in which the operations are actually performed. > Sure, I think we can attempt to fix that as well by adding write memory barrier in ForgetBackgroundWorker(). The main point is if we keep any loose end in this area, then there is a chance that the regression test select_parallel can still fail, if not now, then in future. Another way could be that we can try to minimize the race condition here and then adjust the select_parallel as suggested above so that we don't see this failure. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Mon, Feb 20, 2017 at 7:52 AM, Amit Kapila <amit.kapila16@gmail.com> wrote: > On Sun, Feb 19, 2017 at 8:32 PM, Robert Haas <robertmhaas@gmail.com> wrote: >> On Sun, Feb 19, 2017 at 6:50 PM, Amit Kapila <amit.kapila16@gmail.com> wrote: >>> To close the remaining gap, don't you think we can check slot->in_use >>> flag when generation number for handle and slot are same. >> >> That doesn't completely fix it either, because >> ForgetBackgroundWorker() also does >> BackgroundWorkerData->parallel_terminate_count++, which we might also >> fail to see, which would cause RegisterDynamicBackgroundWorker() to >> bail out early. There are CPU ordering effects to think about here, >> not just the order in which the operations are actually performed. >> > > Sure, I think we can attempt to fix that as well by adding write > memory barrier in ForgetBackgroundWorker(). I don't think so. > The main point is if we > keep any loose end in this area, then there is a chance that the > regression test select_parallel can still fail, if not now, then in > future. Another way could be that we can try to minimize the race > condition here and then adjust the select_parallel as suggested above > so that we don't see this failure. My guess is that if we apply the fix I suggested above, it'll be good enough. If that turns out not to be true, then I guess we'll have to deal with that, but why not do the easy thing first? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Sun, Feb 26, 2017 at 11:15 PM, Robert Haas <robertmhaas@gmail.com> wrote: > On Mon, Feb 20, 2017 at 7:52 AM, Amit Kapila <amit.kapila16@gmail.com> wrote: > >> The main point is if we >> keep any loose end in this area, then there is a chance that the >> regression test select_parallel can still fail, if not now, then in >> future. Another way could be that we can try to minimize the race >> condition here and then adjust the select_parallel as suggested above >> so that we don't see this failure. > > My guess is that if we apply the fix I suggested above, it'll be good > enough. If that turns out not to be true, then I guess we'll have to > deal with that, but why not do the easy thing first? > Okay, that is also a sensible approach. Your patch looks good to me, though I have not tested it. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Mon, Feb 27, 2017 at 8:07 AM, Amit Kapila <amit.kapila16@gmail.com> wrote: >> My guess is that if we apply the fix I suggested above, it'll be good >> enough. If that turns out not to be true, then I guess we'll have to >> deal with that, but why not do the easy thing first? > > Okay, that is also a sensible approach. Your patch looks good to me, > though I have not tested it. Committed. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company