Thread: [HACKERS] Instability in select_parallel regression test

[HACKERS] Instability in select_parallel regression test

From
Tom Lane
Date:
Buildfarm members gaur, pademelon, and gharial have all recently shown
failures like this:

*** /home/bfarm/bf-data/HEAD/pgsql.build/src/test/regress/expected/select_parallel.out    Thu Feb 16 20:35:14 2017
--- /home/bfarm/bf-data/HEAD/pgsql.build/src/test/regress/results/select_parallel.out    Thu Feb 16 21:17:58 2017
***************
*** 163,167 **** -- provoke error in worker select stringu1::int2 from tenk1 where unique1 = 1; ERROR:  invalid input
syntaxfor integer: "BAAAAA" 
- CONTEXT:  parallel worker rollback;
--- 163,166 ----

======================================================================

The failure appears intermittent on gharial but is quite reproducible
on gaur/pademelon.  I inserted some debug elog calls and got this trace
of events:

2017-02-17 00:28:32.641 EST [18934] LOG:  RegisterDynamicBackgroundWorker: grabbing slot 1
2017-02-17 00:28:32.641 EST [18934] STATEMENT:  select count(*) from a_star;
2017-02-17 00:28:32.643 EST [18934] LOG:  LaunchParallelWorkers: registered 1 of 1 workers
2017-02-17 00:28:32.643 EST [18934] STATEMENT:  select count(*) from a_star;
2017-02-17 00:28:32.731 EST [18934] LOG:  RegisterDynamicBackgroundWorker: grabbing slot 2
2017-02-17 00:28:32.731 EST [18934] STATEMENT:  select length(stringu1) from tenk1 group by length(stringu1);
2017-02-17 00:28:32.824 EST [18934] LOG:  LaunchParallelWorkers: registered 1 of 4 workers
2017-02-17 00:28:32.824 EST [18934] STATEMENT:  select length(stringu1) from tenk1 group by length(stringu1);
2017-02-17 00:28:32.824 EST [18934] LOG:  RegisterDynamicBackgroundWorker: grabbing slot 3
2017-02-17 00:28:32.824 EST [18934] STATEMENT:  select length(stringu1) from tenk1 group by length(stringu1);
2017-02-17 00:28:32.923 EST [18934] LOG:  LaunchParallelWorkers: registered 2 of 4 workers
2017-02-17 00:28:32.923 EST [18934] STATEMENT:  select length(stringu1) from tenk1 group by length(stringu1);
2017-02-17 00:28:32.923 EST [18934] LOG:  RegisterDynamicBackgroundWorker: grabbing slot 4
2017-02-17 00:28:32.923 EST [18934] STATEMENT:  select length(stringu1) from tenk1 group by length(stringu1);
2017-02-17 00:28:32.989 EST [18934] LOG:  LaunchParallelWorkers: registered 3 of 4 workers
2017-02-17 00:28:32.989 EST [18934] STATEMENT:  select length(stringu1) from tenk1 group by length(stringu1);
2017-02-17 00:28:32.989 EST [18934] LOG:  RegisterDynamicBackgroundWorker: grabbing slot 5
2017-02-17 00:28:32.989 EST [18934] STATEMENT:  select length(stringu1) from tenk1 group by length(stringu1);
2017-02-17 00:28:33.040 EST [18934] LOG:  LaunchParallelWorkers: registered 4 of 4 workers
2017-02-17 00:28:33.040 EST [18934] STATEMENT:  select length(stringu1) from tenk1 group by length(stringu1);
2017-02-17 00:28:33.093 EST [18934] LOG:  RegisterDynamicBackgroundWorker: grabbing slot 6
2017-02-17 00:28:33.093 EST [18934] STATEMENT:  select count(*) from tenk1 where (two, four) not in    (select hundred,
thousandfrom tenk2 where thousand > 100); 
2017-02-17 00:28:33.173 EST [18934] LOG:  LaunchParallelWorkers: registered 1 of 4 workers
2017-02-17 00:28:33.173 EST [18934] STATEMENT:  select count(*) from tenk1 where (two, four) not in    (select hundred,
thousandfrom tenk2 where thousand > 100); 
2017-02-17 00:28:33.173 EST [18934] LOG:  RegisterDynamicBackgroundWorker: grabbing slot 7
2017-02-17 00:28:33.173 EST [18934] STATEMENT:  select count(*) from tenk1 where (two, four) not in    (select hundred,
thousandfrom tenk2 where thousand > 100); 
2017-02-17 00:28:33.253 EST [18934] LOG:  LaunchParallelWorkers: registered 2 of 4 workers
2017-02-17 00:28:33.253 EST [18934] STATEMENT:  select count(*) from tenk1 where (two, four) not in    (select hundred,
thousandfrom tenk2 where thousand > 100); 
2017-02-17 00:28:33.253 EST [18934] LOG:  RegisterDynamicBackgroundWorker: no free slots of 8
2017-02-17 00:28:33.253 EST [18934] STATEMENT:  select count(*) from tenk1 where (two, four) not in    (select hundred,
thousandfrom tenk2 where thousand > 100); 
2017-02-17 00:28:33.254 EST [18934] LOG:  LaunchParallelWorkers: failed to register 3 of 4 workers
2017-02-17 00:28:33.254 EST [18934] STATEMENT:  select count(*) from tenk1 where (two, four) not in    (select hundred,
thousandfrom tenk2 where thousand > 100); 
2017-02-17 00:28:33.557 EST [18934] LOG:  RegisterDynamicBackgroundWorker: no free slots of 8
2017-02-17 00:28:33.557 EST [18934] STATEMENT:  select  count((unique1)) from tenk1 where hundred > 1;
2017-02-17 00:28:33.557 EST [18934] LOG:  LaunchParallelWorkers: failed to register 1 of 4 workers
2017-02-17 00:28:33.557 EST [18934] STATEMENT:  select  count((unique1)) from tenk1 where hundred > 1;
2017-02-17 00:28:33.703 EST [18934] LOG:  RegisterDynamicBackgroundWorker: no free slots of 8
2017-02-17 00:28:33.703 EST [18934] STATEMENT:  select stringu1::int2 from tenk1 where unique1 = 1;
2017-02-17 00:28:33.703 EST [18934] LOG:  LaunchParallelWorkers: failed to register 1 of 1 workers
2017-02-17 00:28:33.703 EST [18934] STATEMENT:  select stringu1::int2 from tenk1 where unique1 = 1;
2017-02-17 00:28:33.704 EST [18934] ERROR:  invalid input syntax for integer: "BAAAAA"
2017-02-17 00:28:33.704 EST [18934] STATEMENT:  select stringu1::int2 from tenk1 where unique1 = 1;
2017-02-17 00:28:33.854 EST [18336] LOG:  ForgetBackgroundWorker: unregistering background worker "parallel worker for
PID18934" slot 7 
2017-02-17 00:28:33.854 EST [18336] LOG:  ForgetBackgroundWorker: unregistering background worker "parallel worker for
PID18934" slot 6 
2017-02-17 00:28:33.854 EST [18336] LOG:  ForgetBackgroundWorker: unregistering background worker "parallel worker for
PID18934" slot 5 
2017-02-17 00:28:33.854 EST [18336] LOG:  ForgetBackgroundWorker: unregistering background worker "parallel worker for
PID18934" slot 4 
2017-02-17 00:28:33.854 EST [18336] LOG:  ForgetBackgroundWorker: unregistering background worker "parallel worker for
PID18934" slot 3 
2017-02-17 00:28:33.854 EST [18336] LOG:  ForgetBackgroundWorker: unregistering background worker "parallel worker for
PID18934" slot 2 
2017-02-17 00:28:33.855 EST [18336] LOG:  ForgetBackgroundWorker: unregistering background worker "parallel worker for
PID18934" slot 1 

In short, it looks to me like ExecShutdownGatherWorkers doesn't actually
wait for parallel workers to finish (as its comment suggests is
necessary), so that on not-too-speedy machines the worker slots may all
still be in use when the next command wants some.  This does not seem
like a good thing from a reproducibility or error-detection standpoint.
It is *certainly* not what you would expect from reading any of the
comments in the vicinity.

I don't know if this is a generic logic failure or if there's something
platform-specific about it --- it's suspicious that it's only showing up
on ia64 and hppa architectures.

I also find it curious that nobody had seen this at all until the past
couple days.  I suspect that 51ee6f316 changed the number of parallel
workers demanded by the select_parallel regression test, which is, shall
we say, not what its commit log message would suggest.
        regards, tom lane



Re: [HACKERS] Instability in select_parallel regression test

From
Amit Kapila
Date:
On Fri, Feb 17, 2017 at 11:22 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Buildfarm members gaur, pademelon, and gharial have all recently shown
> failures like this:
>
> *** /home/bfarm/bf-data/HEAD/pgsql.build/src/test/regress/expected/select_parallel.out  Thu Feb 16 20:35:14 2017
> --- /home/bfarm/bf-data/HEAD/pgsql.build/src/test/regress/results/select_parallel.out   Thu Feb 16 21:17:58 2017
> ***************
> *** 163,167 ****
>   -- provoke error in worker
>   select stringu1::int2 from tenk1 where unique1 = 1;
>   ERROR:  invalid input syntax for integer: "BAAAAA"
> - CONTEXT:  parallel worker
>   rollback;
> --- 163,166 ----
>
> ======================================================================
>
> The failure appears intermittent on gharial but is quite reproducible
> on gaur/pademelon.  I inserted some debug elog calls and got this trace
> of events:
>
> 2017-02-17 00:28:32.641 EST [18934] LOG:  RegisterDynamicBackgroundWorker: grabbing slot 1
> 2017-02-17 00:28:32.641 EST [18934] STATEMENT:  select count(*) from a_star;
> 2017-02-17 00:28:32.643 EST [18934] LOG:  LaunchParallelWorkers: registered 1 of 1 workers
> 2017-02-17 00:28:32.643 EST [18934] STATEMENT:  select count(*) from a_star;
> 2017-02-17 00:28:32.731 EST [18934] LOG:  RegisterDynamicBackgroundWorker: grabbing slot 2
> 2017-02-17 00:28:32.731 EST [18934] STATEMENT:  select length(stringu1) from tenk1 group by length(stringu1);
> 2017-02-17 00:28:32.824 EST [18934] LOG:  LaunchParallelWorkers: registered 1 of 4 workers
> 2017-02-17 00:28:32.824 EST [18934] STATEMENT:  select length(stringu1) from tenk1 group by length(stringu1);
> 2017-02-17 00:28:32.824 EST [18934] LOG:  RegisterDynamicBackgroundWorker: grabbing slot 3
> 2017-02-17 00:28:32.824 EST [18934] STATEMENT:  select length(stringu1) from tenk1 group by length(stringu1);
> 2017-02-17 00:28:32.923 EST [18934] LOG:  LaunchParallelWorkers: registered 2 of 4 workers
> 2017-02-17 00:28:32.923 EST [18934] STATEMENT:  select length(stringu1) from tenk1 group by length(stringu1);
> 2017-02-17 00:28:32.923 EST [18934] LOG:  RegisterDynamicBackgroundWorker: grabbing slot 4
> 2017-02-17 00:28:32.923 EST [18934] STATEMENT:  select length(stringu1) from tenk1 group by length(stringu1);
> 2017-02-17 00:28:32.989 EST [18934] LOG:  LaunchParallelWorkers: registered 3 of 4 workers
> 2017-02-17 00:28:32.989 EST [18934] STATEMENT:  select length(stringu1) from tenk1 group by length(stringu1);
> 2017-02-17 00:28:32.989 EST [18934] LOG:  RegisterDynamicBackgroundWorker: grabbing slot 5
> 2017-02-17 00:28:32.989 EST [18934] STATEMENT:  select length(stringu1) from tenk1 group by length(stringu1);
> 2017-02-17 00:28:33.040 EST [18934] LOG:  LaunchParallelWorkers: registered 4 of 4 workers
> 2017-02-17 00:28:33.040 EST [18934] STATEMENT:  select length(stringu1) from tenk1 group by length(stringu1);
> 2017-02-17 00:28:33.093 EST [18934] LOG:  RegisterDynamicBackgroundWorker: grabbing slot 6
> 2017-02-17 00:28:33.093 EST [18934] STATEMENT:  select count(*) from tenk1 where (two, four) not in
>                 (select hundred, thousand from tenk2 where thousand > 100);
> 2017-02-17 00:28:33.173 EST [18934] LOG:  LaunchParallelWorkers: registered 1 of 4 workers
> 2017-02-17 00:28:33.173 EST [18934] STATEMENT:  select count(*) from tenk1 where (two, four) not in
>                 (select hundred, thousand from tenk2 where thousand > 100);
> 2017-02-17 00:28:33.173 EST [18934] LOG:  RegisterDynamicBackgroundWorker: grabbing slot 7
> 2017-02-17 00:28:33.173 EST [18934] STATEMENT:  select count(*) from tenk1 where (two, four) not in
>                 (select hundred, thousand from tenk2 where thousand > 100);
> 2017-02-17 00:28:33.253 EST [18934] LOG:  LaunchParallelWorkers: registered 2 of 4 workers
> 2017-02-17 00:28:33.253 EST [18934] STATEMENT:  select count(*) from tenk1 where (two, four) not in
>                 (select hundred, thousand from tenk2 where thousand > 100);
> 2017-02-17 00:28:33.253 EST [18934] LOG:  RegisterDynamicBackgroundWorker: no free slots of 8
> 2017-02-17 00:28:33.253 EST [18934] STATEMENT:  select count(*) from tenk1 where (two, four) not in
>                 (select hundred, thousand from tenk2 where thousand > 100);
> 2017-02-17 00:28:33.254 EST [18934] LOG:  LaunchParallelWorkers: failed to register 3 of 4 workers
> 2017-02-17 00:28:33.254 EST [18934] STATEMENT:  select count(*) from tenk1 where (two, four) not in
>                 (select hundred, thousand from tenk2 where thousand > 100);
> 2017-02-17 00:28:33.557 EST [18934] LOG:  RegisterDynamicBackgroundWorker: no free slots of 8
> 2017-02-17 00:28:33.557 EST [18934] STATEMENT:  select  count((unique1)) from tenk1 where hundred > 1;
> 2017-02-17 00:28:33.557 EST [18934] LOG:  LaunchParallelWorkers: failed to register 1 of 4 workers
> 2017-02-17 00:28:33.557 EST [18934] STATEMENT:  select  count((unique1)) from tenk1 where hundred > 1;
> 2017-02-17 00:28:33.703 EST [18934] LOG:  RegisterDynamicBackgroundWorker: no free slots of 8
> 2017-02-17 00:28:33.703 EST [18934] STATEMENT:  select stringu1::int2 from tenk1 where unique1 = 1;
> 2017-02-17 00:28:33.703 EST [18934] LOG:  LaunchParallelWorkers: failed to register 1 of 1 workers
> 2017-02-17 00:28:33.703 EST [18934] STATEMENT:  select stringu1::int2 from tenk1 where unique1 = 1;
> 2017-02-17 00:28:33.704 EST [18934] ERROR:  invalid input syntax for integer: "BAAAAA"
> 2017-02-17 00:28:33.704 EST [18934] STATEMENT:  select stringu1::int2 from tenk1 where unique1 = 1;
> 2017-02-17 00:28:33.854 EST [18336] LOG:  ForgetBackgroundWorker: unregistering background worker "parallel worker
forPID 18934" slot 7
 
> 2017-02-17 00:28:33.854 EST [18336] LOG:  ForgetBackgroundWorker: unregistering background worker "parallel worker
forPID 18934" slot 6
 
> 2017-02-17 00:28:33.854 EST [18336] LOG:  ForgetBackgroundWorker: unregistering background worker "parallel worker
forPID 18934" slot 5
 
> 2017-02-17 00:28:33.854 EST [18336] LOG:  ForgetBackgroundWorker: unregistering background worker "parallel worker
forPID 18934" slot 4
 
> 2017-02-17 00:28:33.854 EST [18336] LOG:  ForgetBackgroundWorker: unregistering background worker "parallel worker
forPID 18934" slot 3
 
> 2017-02-17 00:28:33.854 EST [18336] LOG:  ForgetBackgroundWorker: unregistering background worker "parallel worker
forPID 18934" slot 2
 
> 2017-02-17 00:28:33.855 EST [18336] LOG:  ForgetBackgroundWorker: unregistering background worker "parallel worker
forPID 18934" slot 1
 
>
> In short, it looks to me like ExecShutdownGatherWorkers doesn't actually
> wait for parallel workers to finish (as its comment suggests is
> necessary), so that on not-too-speedy machines the worker slots may all
> still be in use when the next command wants some.
>

ExecShutdownGatherWorkers() do wait for workers to exit/finish, but it
doesn't wait for the postmaster to free the used slots and that is how
that API is supposed to work.  There is good chance that on slow
machines the slots get freed up much later by postmaster after the
workers have exited.

>  This does not seem
> like a good thing from a reproducibility or error-detection standpoint.
> It is *certainly* not what you would expect from reading any of the
> comments in the vicinity.
>
> I don't know if this is a generic logic failure or if there's something
> platform-specific about it --- it's suspicious that it's only showing up
> on ia64 and hppa architectures.
>
> I also find it curious that nobody had seen this at all until the past
> couple days.  I suspect that 51ee6f316 changed the number of parallel
> workers demanded by the select_parallel regression test, which is, shall
> we say, not what its commit log message would suggest.
>

The commit 51ee6f316 doesn't change parallel workers demanded by
select_parallel, rather it is changed in commit 5e6d8d2 which has
added a new test in select_parallel.  I think that is somewhat clear
from the logs below where the first test where LaunchParallelWorkers
is failed is introduced by commit 5e6d8d2.

> 2017-02-17 00:28:33.253 EST [18934] LOG:  RegisterDynamicBackgroundWorker: no free slots of 8
> 2017-02-17 00:28:33.253 EST [18934] STATEMENT:  select count(*) from tenk1 where (two, four) not in
>                 (select hundred, thousand from tenk2 where thousand > 100);
> 2017-02-17 00:28:33.254 EST [18934] LOG:  LaunchParallelWorkers: failed to register 3 of 4 workers
> 2017-02-17 00:28:33.254 EST [18934] STATEMENT:  select count(*) from tenk1 where (two, four) not in
>                 (select hundred, thousand from tenk2 where thousand > 100);


Now, I think that the test added by commit 5e6d8d2 doesn't have a
problem because it is normally okay to run parallel query tests even
when required workers are not available.  I think what we need to do
here is to move the test that needs workers to execute before other
parallel query tests where there is no such requirement.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: [HACKERS] Instability in select_parallel regression test

From
Tom Lane
Date:
Amit Kapila <amit.kapila16@gmail.com> writes:
> On Fri, Feb 17, 2017 at 11:22 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> In short, it looks to me like ExecShutdownGatherWorkers doesn't actually
>> wait for parallel workers to finish (as its comment suggests is
>> necessary), so that on not-too-speedy machines the worker slots may all
>> still be in use when the next command wants some.

> ExecShutdownGatherWorkers() do wait for workers to exit/finish, but it
> doesn't wait for the postmaster to free the used slots and that is how
> that API is supposed to work.  There is good chance that on slow
> machines the slots get freed up much later by postmaster after the
> workers have exited.

That seems like a seriously broken design to me, first because it can make
for a significant delay in the slots becoming available (which is what's
evidently causing these regression failures), and second because it's
simply bad design to load extra responsibilities onto the postmaster.
Especially ones that involve touching shared memory.

I think this needs to be changed, and promptly.  Why in the world don't
you simply have the workers clearing their slots when they exit?
We don't have an expectation that regular backends are incompetent to
clean up after themselves.  (Obviously, a crash exit is a different
case.)

> I think what we need to do
> here is to move the test that needs workers to execute before other
> parallel query tests where there is no such requirement.

That's not fixing the problem, it's merely averting your eyes from
the symptom.
        regards, tom lane



Re: [HACKERS] Instability in select_parallel regression test

From
Amit Kapila
Date:
On Fri, Feb 17, 2017 at 9:15 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Amit Kapila <amit.kapila16@gmail.com> writes:
>> On Fri, Feb 17, 2017 at 11:22 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> In short, it looks to me like ExecShutdownGatherWorkers doesn't actually
>>> wait for parallel workers to finish (as its comment suggests is
>>> necessary), so that on not-too-speedy machines the worker slots may all
>>> still be in use when the next command wants some.
>
>> ExecShutdownGatherWorkers() do wait for workers to exit/finish, but it
>> doesn't wait for the postmaster to free the used slots and that is how
>> that API is supposed to work.  There is good chance that on slow
>> machines the slots get freed up much later by postmaster after the
>> workers have exited.
>
> That seems like a seriously broken design to me, first because it can make
> for a significant delay in the slots becoming available (which is what's
> evidently causing these regression failures), and second because it's
> simply bad design to load extra responsibilities onto the postmaster.
> Especially ones that involve touching shared memory.
>
> I think this needs to be changed, and promptly.  Why in the world don't
> you simply have the workers clearing their slots when they exit?
>

There seem to be many reasons why exit of background workers is done
by postmaster like when they have to restart after a crash or if
someone terminates them (TerminateBackgroundWorker()) or if the
notification has to be sent at exit (bgw_notify_pid).  Moreover, there
can be background workers without shared memory access as well which
can't clear state from shared memory.  Another point to think is that
background workers contain user-supplied code, so not sure how good it
is to give them control of tinkering with shared data structures of
the backend.  Now, one can argue that for some of the cases where it
is possible background worker should cleanup shared memory state at
the exit, but I think it is not directly related to the parallel
query.  As far as the parallel query is concerned, it just needs to
wait for workers exit to ensure that no more operations can be
performed in workers after that point so that it can accumulate stats
and retrieve some fixed parallel state information.

> We don't have an expectation that regular backends are incompetent to
> clean up after themselves.  (Obviously, a crash exit is a different
> case.)
>
>> I think what we need to do
>> here is to move the test that needs workers to execute before other
>> parallel query tests where there is no such requirement.
>
> That's not fixing the problem, it's merely averting your eyes from
> the symptom.
>

I am not sure why you think so.  Parallel query is capable of running
without workers even if the number of planned workers are not
available and there are many reasons for same.  In general, I think
the tests should not rely on the availability of background workers
and if there is a test like that then it should be the responsibility
of the test to ensure the same.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: [HACKERS] Instability in select_parallel regression test

From
Robert Haas
Date:
On Fri, Feb 17, 2017 at 9:57 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>> That seems like a seriously broken design to me, first because it can make
>> for a significant delay in the slots becoming available (which is what's
>> evidently causing these regression failures), and second because it's
>> simply bad design to load extra responsibilities onto the postmaster.
>> Especially ones that involve touching shared memory.

It's a little surprising to me that the delay we're seeing here is
significant, because the death of a child should cause an immediate
SIGCHLD, resulting in a call to reaper(), resulting in a call to
waitpid().  Why's that not working?

>> I think this needs to be changed, and promptly.  Why in the world don't
>> you simply have the workers clearing their slots when they exit?
>
> There seem to be many reasons why exit of background workers is done
> by postmaster like when they have to restart after a crash or if
> someone terminates them (TerminateBackgroundWorker()) or if the
> notification has to be sent at exit (bgw_notify_pid).  Moreover, there
> can be background workers without shared memory access as well which
> can't clear state from shared memory.  Another point to think is that
> background workers contain user-supplied code, so not sure how good it
> is to give them control of tinkering with shared data structures of
> the backend.  Now, one can argue that for some of the cases where it
> is possible background worker should cleanup shared memory state at
> the exit, but I think it is not directly related to the parallel
> query.  As far as the parallel query is concerned, it just needs to
> wait for workers exit to ensure that no more operations can be
> performed in workers after that point so that it can accumulate stats
> and retrieve some fixed parallel state information.

That's a pretty good list of reasons with which I agree.  To
underscore one of Amit's points, bgw_notify_pid requires SIGUSR1 to be
sent to the process that registered the worker if that process is
still running, both when the process is started and when it
terminates.  The workers have no way of keeping track of whether the
process that did the registration is still running, and with bad luck
might SIGUSR1 an unrelated process (possibly killing it).  And those
SIGUSR1s are absolutely necessary, because otherwise a process that is
waiting for a worker to start or stop would have to poll, which would
make the whole system more resource-intensive and a whole lot less
responsive.

There are a few more reasons Amit didn't mention:

1. Workers can fail during startup before they can possibly have done
enough work to be able to mark their slots as being free.  On Linux,
fork() can fail; on Windows, you can fail either to start a new
process or to have it reattach to the shared memory segment.  You
could try to have the postmaster catch just those cases where the
worker doesn't get far enough and have the worker do the others, but
that's almost bound to make things more complex and fragile.  I can't
see us coming out ahead there.

2. Worker slots don't get freed when the process terminates; they get
freed when the background worker is not running and will never again
be restarted.  So there's a subtle lifespan difference here: a
background worker slot is consumed before any process is started, and
persists across possibly many restarts of that process, and is finally
freed when the process is both not currently running and not ever
going to be restarted.  It would no doubt be a bit fragile if the
worker tried to guess whether or not the postmaster was going to
restart it and free the slot only if not -- what happens if the
postmaster comes to a different conclusion?

>>> I think what we need to do
>>> here is to move the test that needs workers to execute before other
>>> parallel query tests where there is no such requirement.
>>
>> That's not fixing the problem, it's merely averting your eyes from
>> the symptom.
>>
> I am not sure why you think so.  Parallel query is capable of running
> without workers even if the number of planned workers are not
> available and there are many reasons for same.  In general, I think
> the tests should not rely on the availability of background workers
> and if there is a test like that then it should be the responsibility
> of the test to ensure the same.

I agree with that, too.  One thing that I'm a little concerned about,
though, is that users could also see the kind of behavior Tom seems to
be describing here and might find it surprising.  For example, suppose
your queries all use 4 workers and you have 4 workers configured.  If
you're the only user on the system and you run query after query after
query, you expect that all of those queries are going to get all 4
workers.  If the postmaster is so unresponsive that you don't get all
of them, or worse you don't get any, you might be tempted to swear at
the stupid PostgreSQL developer who engineered this system.  (Hi,
swearing people!)  I'd like to understand whether this is something
that happens once-in-a-blue-moon on CLOBBER_CACHE_RECURSIVELY critters
or whether it's a pretty regular thing.  If it's a regular thing,
that's kinda disappointing, because it implies that the operating
system feels free to take a little holiday before delivering SIGCHLD,
or something like that.

I guess one possible solution to this problem might be to change
ExecParallelFinish's call to WaitForParallelWorkersToFinish() to
instead call WaitForParallelWorkersToExit() -- and maybe just removing
the former, since at that point it would have no remaining callers and
no real use, if we believe that waiting for the slot to become free is
the right behavior.  The downside of this is that it increases the lag
to shut down a parallel query over what's basically a corner case.
Yeah, there might be some cases where not being able to reuse the
worker slots you just freed hurts, but is it worth making all of the
queries that don't care about that a little slower to cater to the
cases where you do care?  I'm not sure.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] Instability in select_parallel regression test

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> On Fri, Feb 17, 2017 at 9:57 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>>> That seems like a seriously broken design to me, first because it can make
>>> for a significant delay in the slots becoming available (which is what's
>>> evidently causing these regression failures), and second because it's
>>> simply bad design to load extra responsibilities onto the postmaster.
>>> Especially ones that involve touching shared memory.

> It's a little surprising to me that the delay we're seeing here is
> significant, because the death of a child should cause an immediate
> SIGCHLD, resulting in a call to reaper(), resulting in a call to
> waitpid().  Why's that not working?

I can't say for sure about gharial, but gaur/pademelon is a single-CPU
machine, and I have reason to believe that its scheduler discriminates
pretty hard against processes that have been deemed to be background
processes.  The reason it's not working is simply that the postmaster
doesn't get a chance to run until the active backend has already decided
that there are no free slots for the next parallel query (indeed, the
next several parallel queries).  Yeah, it got the signal, and eventually
it gets some actual cycles, but not soon enough.  On a sufficiently loaded
machine this could be expected to happen, at least occasionally, even if
you had lots of CPUs.  It's merely more readily reproducible on this
particular configuration.

Also, you're assuming that the parallel worker process gets enough more
cycles to exit once it's woken the parent backend with an "I'm done"
signal.  I wouldn't care to bet on that happening promptly either.
I have definitely seen HPUX do a process context swap *immediately*
when it sees a lower-priority process signal a higher-priority one,
and not give the first process any more cycles for a good long time.

>> There seem to be many reasons why exit of background workers is done
>> by postmaster like when they have to restart after a crash or if
>> someone terminates them (TerminateBackgroundWorker()) or if the
>> notification has to be sent at exit (bgw_notify_pid).  Moreover, there
>> can be background workers without shared memory access as well which
>> can't clear state from shared memory.  Another point to think is that
>> background workers contain user-supplied code, so not sure how good it
>> is to give them control of tinkering with shared data structures of
>> the backend.  Now, one can argue that for some of the cases where it
>> is possible background worker should cleanup shared memory state at
>> the exit, but I think it is not directly related to the parallel
>> query.  As far as the parallel query is concerned, it just needs to
>> wait for workers exit to ensure that no more operations can be
>> performed in workers after that point so that it can accumulate stats
>> and retrieve some fixed parallel state information.

> That's a pretty good list of reasons with which I agree.

It seems largely bogus IMO, except possibly the "no access to shared
memory" reason, which is irrelevant for the problem at hand; I see no very
probable reason why backends would ever be interested in launching workers
that they couldn't communicate with.  In particular, I'll buy the
"user-supplied code" one only if we rip out every possibility of executing
user-supplied C or plperlu or plpythonu code in standard backends.  There
is *no* good reason for treating parallel workers as less reliable than
standard backends; if anything, given the constraints on what we let them
do, they should be more reliable.

Basically, I think we need to fix things so that
WaitForParallelWorkersToFinish doesn't return until the slot is free
and there are no impediments to launching another worker.  Anything
less is going to result in race conditions and intermittent misbehavior
on heavily-loaded machines.  Personally I think that it'd be a good idea
to get the postmaster out of the loop while we're at it, but the minimum
change is to wait for the slot to be marked free by the postmaster.

>>>> I think what we need to do
>>>> here is to move the test that needs workers to execute before other
>>>> parallel query tests where there is no such requirement.

>>> That's not fixing the problem, it's merely averting your eyes from
>>> the symptom.

>> I am not sure why you think so.

In other words, you are willing to accept a system in which we can never
be sure that any query after the first one in select_parallel actually ran
in parallel?  That seems silly on its face, and an enormous restriction on
what we'll actually be able to test.
        regards, tom lane



Re: [HACKERS] Instability in select_parallel regression test

From
Robert Haas
Date:
On Sat, Feb 18, 2017 at 11:53 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> It's a little surprising to me that the delay we're seeing here is
>> significant, because the death of a child should cause an immediate
>> SIGCHLD, resulting in a call to reaper(), resulting in a call to
>> waitpid().  Why's that not working?
>
> I can't say for sure about gharial, but gaur/pademelon is a single-CPU
> machine, ...

Ah, OK.  Makes sense.

>>> There seem to be many reasons why exit of background workers is done
>>> by postmaster like when they have to restart after a crash or if
>>> someone terminates them (TerminateBackgroundWorker()) or if the
>>> notification has to be sent at exit (bgw_notify_pid).  Moreover, there
>>> can be background workers without shared memory access as well which
>>> can't clear state from shared memory.  Another point to think is that
>>> background workers contain user-supplied code, so not sure how good it
>>> is to give them control of tinkering with shared data structures of
>>> the backend.  Now, one can argue that for some of the cases where it
>>> is possible background worker should cleanup shared memory state at
>>> the exit, but I think it is not directly related to the parallel
>>> query.  As far as the parallel query is concerned, it just needs to
>>> wait for workers exit to ensure that no more operations can be
>>> performed in workers after that point so that it can accumulate stats
>>> and retrieve some fixed parallel state information.
>
>> That's a pretty good list of reasons with which I agree.
>
> It seems largely bogus IMO, except possibly the "no access to shared
> memory" reason, which is irrelevant for the problem at hand; I see no very
> probable reason why backends would ever be interested in launching workers
> that they couldn't communicate with.  In particular, I'll buy the
> "user-supplied code" one only if we rip out every possibility of executing
> user-supplied C or plperlu or plpythonu code in standard backends.  There
> is *no* good reason for treating parallel workers as less reliable than
> standard backends; if anything, given the constraints on what we let them
> do, they should be more reliable.

I don't understand this paragraph at all.  There are certainly use
cases for backends starting workers with which they can't communicate,
like starting some kind of a daemon - that was the original purpose of
the background worker facility.  But even when the process is planning
to communicate with the worker, there's a time after the worker is
requested and before that communication link is established.  Failures
during the intervening time have to be handled cleanly, and if you
think that's trivial to do without postmaster involvement then I think
you don't really understand the problem yet.

> Basically, I think we need to fix things so that
> WaitForParallelWorkersToFinish doesn't return until the slot is free
> and there are no impediments to launching another worker.  Anything
> less is going to result in race conditions and intermittent misbehavior
> on heavily-loaded machines.  Personally I think that it'd be a good idea
> to get the postmaster out of the loop while we're at it, but the minimum
> change is to wait for the slot to be marked free by the postmaster.

Such a change can be made, but as I pointed out in the part you didn't
quote, there are reasons to wonder whether that will be a constructive
change in real life even if it's better for the regression tests.
Optimizing PostgreSQL for the use case of running regression tests in
the buildfarm at the expense of other use cases wouldn't be very
smart.  Maybe such a change is better in real-world applications too,
but that deserves at least a little bit of thought and substantive
discussion.

>>> I am not sure why you think so.
>
> In other words, you are willing to accept a system in which we can never
> be sure that any query after the first one in select_parallel actually ran
> in parallel?  That seems silly on its face, and an enormous restriction on
> what we'll actually be able to test.

Without denying that we may want to change the behavior here, if you
think this is the first place in the regression tests where the
behavior is timing-dependent, you obviously didn't attend
https://www.pgcon.org/2016/schedule/track/Hacking/927.en.html

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] Instability in select_parallel regression test

From
Robert Haas
Date:
On Sun, Feb 19, 2017 at 2:17 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> Such a change can be made, but as I pointed out in the part you didn't
> quote, there are reasons to wonder whether that will be a constructive
> change in real life even if it's better for the regression tests.
> Optimizing PostgreSQL for the use case of running regression tests in
> the buildfarm at the expense of other use cases wouldn't be very
> smart.  Maybe such a change is better in real-world applications too,
> but that deserves at least a little bit of thought and substantive
> discussion.

Rewind.  Wait a minute.  Looking at this code again, it looks like
we're supposed to ALREADY BE DOING THIS.

DestroyParallelContext() calls WaitForParallelWorkersToExit() which
calls WaitForBackgroundWorkerShutdown() for each worker.  That
function returns only when the postmaster dies (which causes an error
with that specific complaint) or when GetBackgroundWorkerPid() sets
the status to BGWH_STOPPED. GetBackgroundWorkerPid() only returns
BGWH_STOPPED when either (a) handle->generation != slot->generation
(meaning that the slot got reused, and therefore must have been freed)
or when (b) slot->pid == 0.  The pid only gets set to 0 in
BackgroundWorkerStateChange() when slot->terminate is set, or in
ReportBackgroundWorkerPID() when it's called from
CleanupBackgroundWorker.  So this function should not be returning
until after all workers have actually exited.

However, it looks like there's a race condition here, because the slot
doesn't get freed up at the same time that the PID gets set to 0.
That actually happens later, when the postmaster calls
maybe_start_bgworker() or DetermineSleepTime() and one of those
functions calls ForgetBackgroundWorker(). We could tighten this up by
changing CleanupBackgroundWorker() to also call
ForgetBackgroundWorker() immediately after calling
ReportBackgroundWorker() if rw->rw_terminate ||
rw->rw_worker.bgw_restart_time == BGW_NEVER_RESTART.  If we do that
BEFORE sending the notification to the starting process, that closes
this hole.  Almost.

There's still a possibility that the waiting process might receive
SIGUSR1 for some unrelated reason and check the state of shared memory
just after slot->pid == 0 and before slot->in_use is cleared and (also
relevantly) just before BackgroundWorkerData->parallel_terminate_count
is incremented.  Fixing that seems a bit tougher, since we're
certainly not going to have the postmaster start taking locks on
shared data structures, but just forgetting the worker sooner would
probably improve things a lot.  I think.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] Instability in select_parallel regression test

From
Robert Haas
Date:
On Sun, Feb 19, 2017 at 5:54 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> However, it looks like there's a race condition here, because the slot
> doesn't get freed up at the same time that the PID gets set to 0.
> That actually happens later, when the postmaster calls
> maybe_start_bgworker() or DetermineSleepTime() and one of those
> functions calls ForgetBackgroundWorker(). We could tighten this up by
> changing CleanupBackgroundWorker() to also call
> ForgetBackgroundWorker() immediately after calling
> ReportBackgroundWorker() if rw->rw_terminate ||
> rw->rw_worker.bgw_restart_time == BGW_NEVER_RESTART.  If we do that
> BEFORE sending the notification to the starting process, that closes
> this hole.  Almost.

And here's a patch implementing something along those lines.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] Instability in select_parallel regression test

From
Amit Kapila
Date:
On Sun, Feb 19, 2017 at 5:54 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Sun, Feb 19, 2017 at 2:17 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> Such a change can be made, but as I pointed out in the part you didn't
>> quote, there are reasons to wonder whether that will be a constructive
>> change in real life even if it's better for the regression tests.
>> Optimizing PostgreSQL for the use case of running regression tests in
>> the buildfarm at the expense of other use cases wouldn't be very
>> smart.  Maybe such a change is better in real-world applications too,
>> but that deserves at least a little bit of thought and substantive
>> discussion.
>
> Rewind.  Wait a minute.  Looking at this code again, it looks like
> we're supposed to ALREADY BE DOING THIS.
>
> DestroyParallelContext() calls WaitForParallelWorkersToExit() which
> calls WaitForBackgroundWorkerShutdown() for each worker.  That
> function returns only when the postmaster dies (which causes an error
> with that specific complaint) or when GetBackgroundWorkerPid() sets
> the status to BGWH_STOPPED. GetBackgroundWorkerPid() only returns
> BGWH_STOPPED when either (a) handle->generation != slot->generation
> (meaning that the slot got reused, and therefore must have been freed)
> or when (b) slot->pid == 0.  The pid only gets set to 0 in
> BackgroundWorkerStateChange() when slot->terminate is set, or in
> ReportBackgroundWorkerPID() when it's called from
> CleanupBackgroundWorker.  So this function should not be returning
> until after all workers have actually exited.
>

Yeah, I have also noticed this point and was thinking of the way to
close this gap.

> However, it looks like there's a race condition here, because the slot
> doesn't get freed up at the same time that the PID gets set to 0.
> That actually happens later, when the postmaster calls
> maybe_start_bgworker() or DetermineSleepTime() and one of those
> functions calls ForgetBackgroundWorker(). We could tighten this up by
> changing CleanupBackgroundWorker() to also call
> ForgetBackgroundWorker() immediately after calling
> ReportBackgroundWorker() if rw->rw_terminate ||
> rw->rw_worker.bgw_restart_time == BGW_NEVER_RESTART.  If we do that
> BEFORE sending the notification to the starting process, that closes
> this hole.  Almost.
>

To close the remaining gap, don't you think we can check slot->in_use
flag when generation number for handle and slot are same.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: [HACKERS] Instability in select_parallel regression test

From
Robert Haas
Date:
On Sun, Feb 19, 2017 at 6:50 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> To close the remaining gap, don't you think we can check slot->in_use
> flag when generation number for handle and slot are same.

That doesn't completely fix it either, because
ForgetBackgroundWorker() also does
BackgroundWorkerData->parallel_terminate_count++, which we might also
fail to see, which would cause RegisterDynamicBackgroundWorker() to
bail out early.  There are CPU ordering effects to think about here,
not just the order in which the operations are actually performed.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] Instability in select_parallel regression test

From
Amit Kapila
Date:
On Sun, Feb 19, 2017 at 8:32 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Sun, Feb 19, 2017 at 6:50 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>> To close the remaining gap, don't you think we can check slot->in_use
>> flag when generation number for handle and slot are same.
>
> That doesn't completely fix it either, because
> ForgetBackgroundWorker() also does
> BackgroundWorkerData->parallel_terminate_count++, which we might also
> fail to see, which would cause RegisterDynamicBackgroundWorker() to
> bail out early.  There are CPU ordering effects to think about here,
> not just the order in which the operations are actually performed.
>

Sure, I think we can attempt to fix that as well by adding write
memory barrier in ForgetBackgroundWorker().  The main point is if we
keep any loose end in this area, then there is a chance that the
regression test select_parallel can still fail, if not now, then in
future.  Another way could be that we can try to minimize the race
condition here and then adjust the select_parallel as suggested above
so that we don't see this failure.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: [HACKERS] Instability in select_parallel regression test

From
Robert Haas
Date:
On Mon, Feb 20, 2017 at 7:52 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Sun, Feb 19, 2017 at 8:32 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> On Sun, Feb 19, 2017 at 6:50 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>>> To close the remaining gap, don't you think we can check slot->in_use
>>> flag when generation number for handle and slot are same.
>>
>> That doesn't completely fix it either, because
>> ForgetBackgroundWorker() also does
>> BackgroundWorkerData->parallel_terminate_count++, which we might also
>> fail to see, which would cause RegisterDynamicBackgroundWorker() to
>> bail out early.  There are CPU ordering effects to think about here,
>> not just the order in which the operations are actually performed.
>>
>
> Sure, I think we can attempt to fix that as well by adding write
> memory barrier in ForgetBackgroundWorker().

I don't think so.

> The main point is if we
> keep any loose end in this area, then there is a chance that the
> regression test select_parallel can still fail, if not now, then in
> future.  Another way could be that we can try to minimize the race
> condition here and then adjust the select_parallel as suggested above
> so that we don't see this failure.

My guess is that if we apply the fix I suggested above, it'll be good
enough.  If that turns out not to be true, then I guess we'll have to
deal with that, but why not do the easy thing first?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] Instability in select_parallel regression test

From
Amit Kapila
Date:
On Sun, Feb 26, 2017 at 11:15 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Mon, Feb 20, 2017 at 7:52 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>
>> The main point is if we
>> keep any loose end in this area, then there is a chance that the
>> regression test select_parallel can still fail, if not now, then in
>> future.  Another way could be that we can try to minimize the race
>> condition here and then adjust the select_parallel as suggested above
>> so that we don't see this failure.
>
> My guess is that if we apply the fix I suggested above, it'll be good
> enough.  If that turns out not to be true, then I guess we'll have to
> deal with that, but why not do the easy thing first?
>

Okay, that is also a sensible approach.  Your patch looks good to me,
though I have not tested it.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: [HACKERS] Instability in select_parallel regression test

From
Robert Haas
Date:
On Mon, Feb 27, 2017 at 8:07 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>> My guess is that if we apply the fix I suggested above, it'll be good
>> enough.  If that turns out not to be true, then I guess we'll have to
>> deal with that, but why not do the easy thing first?
>
> Okay, that is also a sensible approach.  Your patch looks good to me,
> though I have not tested it.

Committed.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company