Re: [HACKERS] Instability in select_parallel regression test - Mailing list pgsql-hackers

From Amit Kapila
Subject Re: [HACKERS] Instability in select_parallel regression test
Date
Msg-id CAA4eK1+kwZavoPzC=N2NnzM0po_XuyLUJavHLN8f0FyPmiqocA@mail.gmail.com
Whole thread Raw
In response to [HACKERS] Instability in select_parallel regression test  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: [HACKERS] Instability in select_parallel regression test
List pgsql-hackers
On Fri, Feb 17, 2017 at 11:22 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Buildfarm members gaur, pademelon, and gharial have all recently shown
> failures like this:
>
> *** /home/bfarm/bf-data/HEAD/pgsql.build/src/test/regress/expected/select_parallel.out  Thu Feb 16 20:35:14 2017
> --- /home/bfarm/bf-data/HEAD/pgsql.build/src/test/regress/results/select_parallel.out   Thu Feb 16 21:17:58 2017
> ***************
> *** 163,167 ****
>   -- provoke error in worker
>   select stringu1::int2 from tenk1 where unique1 = 1;
>   ERROR:  invalid input syntax for integer: "BAAAAA"
> - CONTEXT:  parallel worker
>   rollback;
> --- 163,166 ----
>
> ======================================================================
>
> The failure appears intermittent on gharial but is quite reproducible
> on gaur/pademelon.  I inserted some debug elog calls and got this trace
> of events:
>
> 2017-02-17 00:28:32.641 EST [18934] LOG:  RegisterDynamicBackgroundWorker: grabbing slot 1
> 2017-02-17 00:28:32.641 EST [18934] STATEMENT:  select count(*) from a_star;
> 2017-02-17 00:28:32.643 EST [18934] LOG:  LaunchParallelWorkers: registered 1 of 1 workers
> 2017-02-17 00:28:32.643 EST [18934] STATEMENT:  select count(*) from a_star;
> 2017-02-17 00:28:32.731 EST [18934] LOG:  RegisterDynamicBackgroundWorker: grabbing slot 2
> 2017-02-17 00:28:32.731 EST [18934] STATEMENT:  select length(stringu1) from tenk1 group by length(stringu1);
> 2017-02-17 00:28:32.824 EST [18934] LOG:  LaunchParallelWorkers: registered 1 of 4 workers
> 2017-02-17 00:28:32.824 EST [18934] STATEMENT:  select length(stringu1) from tenk1 group by length(stringu1);
> 2017-02-17 00:28:32.824 EST [18934] LOG:  RegisterDynamicBackgroundWorker: grabbing slot 3
> 2017-02-17 00:28:32.824 EST [18934] STATEMENT:  select length(stringu1) from tenk1 group by length(stringu1);
> 2017-02-17 00:28:32.923 EST [18934] LOG:  LaunchParallelWorkers: registered 2 of 4 workers
> 2017-02-17 00:28:32.923 EST [18934] STATEMENT:  select length(stringu1) from tenk1 group by length(stringu1);
> 2017-02-17 00:28:32.923 EST [18934] LOG:  RegisterDynamicBackgroundWorker: grabbing slot 4
> 2017-02-17 00:28:32.923 EST [18934] STATEMENT:  select length(stringu1) from tenk1 group by length(stringu1);
> 2017-02-17 00:28:32.989 EST [18934] LOG:  LaunchParallelWorkers: registered 3 of 4 workers
> 2017-02-17 00:28:32.989 EST [18934] STATEMENT:  select length(stringu1) from tenk1 group by length(stringu1);
> 2017-02-17 00:28:32.989 EST [18934] LOG:  RegisterDynamicBackgroundWorker: grabbing slot 5
> 2017-02-17 00:28:32.989 EST [18934] STATEMENT:  select length(stringu1) from tenk1 group by length(stringu1);
> 2017-02-17 00:28:33.040 EST [18934] LOG:  LaunchParallelWorkers: registered 4 of 4 workers
> 2017-02-17 00:28:33.040 EST [18934] STATEMENT:  select length(stringu1) from tenk1 group by length(stringu1);
> 2017-02-17 00:28:33.093 EST [18934] LOG:  RegisterDynamicBackgroundWorker: grabbing slot 6
> 2017-02-17 00:28:33.093 EST [18934] STATEMENT:  select count(*) from tenk1 where (two, four) not in
>                 (select hundred, thousand from tenk2 where thousand > 100);
> 2017-02-17 00:28:33.173 EST [18934] LOG:  LaunchParallelWorkers: registered 1 of 4 workers
> 2017-02-17 00:28:33.173 EST [18934] STATEMENT:  select count(*) from tenk1 where (two, four) not in
>                 (select hundred, thousand from tenk2 where thousand > 100);
> 2017-02-17 00:28:33.173 EST [18934] LOG:  RegisterDynamicBackgroundWorker: grabbing slot 7
> 2017-02-17 00:28:33.173 EST [18934] STATEMENT:  select count(*) from tenk1 where (two, four) not in
>                 (select hundred, thousand from tenk2 where thousand > 100);
> 2017-02-17 00:28:33.253 EST [18934] LOG:  LaunchParallelWorkers: registered 2 of 4 workers
> 2017-02-17 00:28:33.253 EST [18934] STATEMENT:  select count(*) from tenk1 where (two, four) not in
>                 (select hundred, thousand from tenk2 where thousand > 100);
> 2017-02-17 00:28:33.253 EST [18934] LOG:  RegisterDynamicBackgroundWorker: no free slots of 8
> 2017-02-17 00:28:33.253 EST [18934] STATEMENT:  select count(*) from tenk1 where (two, four) not in
>                 (select hundred, thousand from tenk2 where thousand > 100);
> 2017-02-17 00:28:33.254 EST [18934] LOG:  LaunchParallelWorkers: failed to register 3 of 4 workers
> 2017-02-17 00:28:33.254 EST [18934] STATEMENT:  select count(*) from tenk1 where (two, four) not in
>                 (select hundred, thousand from tenk2 where thousand > 100);
> 2017-02-17 00:28:33.557 EST [18934] LOG:  RegisterDynamicBackgroundWorker: no free slots of 8
> 2017-02-17 00:28:33.557 EST [18934] STATEMENT:  select  count((unique1)) from tenk1 where hundred > 1;
> 2017-02-17 00:28:33.557 EST [18934] LOG:  LaunchParallelWorkers: failed to register 1 of 4 workers
> 2017-02-17 00:28:33.557 EST [18934] STATEMENT:  select  count((unique1)) from tenk1 where hundred > 1;
> 2017-02-17 00:28:33.703 EST [18934] LOG:  RegisterDynamicBackgroundWorker: no free slots of 8
> 2017-02-17 00:28:33.703 EST [18934] STATEMENT:  select stringu1::int2 from tenk1 where unique1 = 1;
> 2017-02-17 00:28:33.703 EST [18934] LOG:  LaunchParallelWorkers: failed to register 1 of 1 workers
> 2017-02-17 00:28:33.703 EST [18934] STATEMENT:  select stringu1::int2 from tenk1 where unique1 = 1;
> 2017-02-17 00:28:33.704 EST [18934] ERROR:  invalid input syntax for integer: "BAAAAA"
> 2017-02-17 00:28:33.704 EST [18934] STATEMENT:  select stringu1::int2 from tenk1 where unique1 = 1;
> 2017-02-17 00:28:33.854 EST [18336] LOG:  ForgetBackgroundWorker: unregistering background worker "parallel worker
forPID 18934" slot 7
 
> 2017-02-17 00:28:33.854 EST [18336] LOG:  ForgetBackgroundWorker: unregistering background worker "parallel worker
forPID 18934" slot 6
 
> 2017-02-17 00:28:33.854 EST [18336] LOG:  ForgetBackgroundWorker: unregistering background worker "parallel worker
forPID 18934" slot 5
 
> 2017-02-17 00:28:33.854 EST [18336] LOG:  ForgetBackgroundWorker: unregistering background worker "parallel worker
forPID 18934" slot 4
 
> 2017-02-17 00:28:33.854 EST [18336] LOG:  ForgetBackgroundWorker: unregistering background worker "parallel worker
forPID 18934" slot 3
 
> 2017-02-17 00:28:33.854 EST [18336] LOG:  ForgetBackgroundWorker: unregistering background worker "parallel worker
forPID 18934" slot 2
 
> 2017-02-17 00:28:33.855 EST [18336] LOG:  ForgetBackgroundWorker: unregistering background worker "parallel worker
forPID 18934" slot 1
 
>
> In short, it looks to me like ExecShutdownGatherWorkers doesn't actually
> wait for parallel workers to finish (as its comment suggests is
> necessary), so that on not-too-speedy machines the worker slots may all
> still be in use when the next command wants some.
>

ExecShutdownGatherWorkers() do wait for workers to exit/finish, but it
doesn't wait for the postmaster to free the used slots and that is how
that API is supposed to work.  There is good chance that on slow
machines the slots get freed up much later by postmaster after the
workers have exited.

>  This does not seem
> like a good thing from a reproducibility or error-detection standpoint.
> It is *certainly* not what you would expect from reading any of the
> comments in the vicinity.
>
> I don't know if this is a generic logic failure or if there's something
> platform-specific about it --- it's suspicious that it's only showing up
> on ia64 and hppa architectures.
>
> I also find it curious that nobody had seen this at all until the past
> couple days.  I suspect that 51ee6f316 changed the number of parallel
> workers demanded by the select_parallel regression test, which is, shall
> we say, not what its commit log message would suggest.
>

The commit 51ee6f316 doesn't change parallel workers demanded by
select_parallel, rather it is changed in commit 5e6d8d2 which has
added a new test in select_parallel.  I think that is somewhat clear
from the logs below where the first test where LaunchParallelWorkers
is failed is introduced by commit 5e6d8d2.

> 2017-02-17 00:28:33.253 EST [18934] LOG:  RegisterDynamicBackgroundWorker: no free slots of 8
> 2017-02-17 00:28:33.253 EST [18934] STATEMENT:  select count(*) from tenk1 where (two, four) not in
>                 (select hundred, thousand from tenk2 where thousand > 100);
> 2017-02-17 00:28:33.254 EST [18934] LOG:  LaunchParallelWorkers: failed to register 3 of 4 workers
> 2017-02-17 00:28:33.254 EST [18934] STATEMENT:  select count(*) from tenk1 where (two, four) not in
>                 (select hundred, thousand from tenk2 where thousand > 100);


Now, I think that the test added by commit 5e6d8d2 doesn't have a
problem because it is normally okay to run parallel query tests even
when required workers are not available.  I think what we need to do
here is to move the test that needs workers to execute before other
parallel query tests where there is no such requirement.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



pgsql-hackers by date:

Previous
From: Andreas Karlsson
Date:
Subject: Re: [HACKERS] REINDEX CONCURRENTLY 2.0
Next
From: Rushabh Lathia
Date:
Subject: Re: [HACKERS] Gather Merge