[HACKERS] Instability in select_parallel regression test - Mailing list pgsql-hackers
From | Tom Lane |
---|---|
Subject | [HACKERS] Instability in select_parallel regression test |
Date | |
Msg-id | 30673.1487310734@sss.pgh.pa.us Whole thread Raw |
Responses |
Re: [HACKERS] Instability in select_parallel regression test
|
List | pgsql-hackers |
Buildfarm members gaur, pademelon, and gharial have all recently shown failures like this: *** /home/bfarm/bf-data/HEAD/pgsql.build/src/test/regress/expected/select_parallel.out Thu Feb 16 20:35:14 2017 --- /home/bfarm/bf-data/HEAD/pgsql.build/src/test/regress/results/select_parallel.out Thu Feb 16 21:17:58 2017 *************** *** 163,167 **** -- provoke error in worker select stringu1::int2 from tenk1 where unique1 = 1; ERROR: invalid input syntaxfor integer: "BAAAAA" - CONTEXT: parallel worker rollback; --- 163,166 ---- ====================================================================== The failure appears intermittent on gharial but is quite reproducible on gaur/pademelon. I inserted some debug elog calls and got this trace of events: 2017-02-17 00:28:32.641 EST [18934] LOG: RegisterDynamicBackgroundWorker: grabbing slot 1 2017-02-17 00:28:32.641 EST [18934] STATEMENT: select count(*) from a_star; 2017-02-17 00:28:32.643 EST [18934] LOG: LaunchParallelWorkers: registered 1 of 1 workers 2017-02-17 00:28:32.643 EST [18934] STATEMENT: select count(*) from a_star; 2017-02-17 00:28:32.731 EST [18934] LOG: RegisterDynamicBackgroundWorker: grabbing slot 2 2017-02-17 00:28:32.731 EST [18934] STATEMENT: select length(stringu1) from tenk1 group by length(stringu1); 2017-02-17 00:28:32.824 EST [18934] LOG: LaunchParallelWorkers: registered 1 of 4 workers 2017-02-17 00:28:32.824 EST [18934] STATEMENT: select length(stringu1) from tenk1 group by length(stringu1); 2017-02-17 00:28:32.824 EST [18934] LOG: RegisterDynamicBackgroundWorker: grabbing slot 3 2017-02-17 00:28:32.824 EST [18934] STATEMENT: select length(stringu1) from tenk1 group by length(stringu1); 2017-02-17 00:28:32.923 EST [18934] LOG: LaunchParallelWorkers: registered 2 of 4 workers 2017-02-17 00:28:32.923 EST [18934] STATEMENT: select length(stringu1) from tenk1 group by length(stringu1); 2017-02-17 00:28:32.923 EST [18934] LOG: RegisterDynamicBackgroundWorker: grabbing slot 4 2017-02-17 00:28:32.923 EST [18934] STATEMENT: select length(stringu1) from tenk1 group by length(stringu1); 2017-02-17 00:28:32.989 EST [18934] LOG: LaunchParallelWorkers: registered 3 of 4 workers 2017-02-17 00:28:32.989 EST [18934] STATEMENT: select length(stringu1) from tenk1 group by length(stringu1); 2017-02-17 00:28:32.989 EST [18934] LOG: RegisterDynamicBackgroundWorker: grabbing slot 5 2017-02-17 00:28:32.989 EST [18934] STATEMENT: select length(stringu1) from tenk1 group by length(stringu1); 2017-02-17 00:28:33.040 EST [18934] LOG: LaunchParallelWorkers: registered 4 of 4 workers 2017-02-17 00:28:33.040 EST [18934] STATEMENT: select length(stringu1) from tenk1 group by length(stringu1); 2017-02-17 00:28:33.093 EST [18934] LOG: RegisterDynamicBackgroundWorker: grabbing slot 6 2017-02-17 00:28:33.093 EST [18934] STATEMENT: select count(*) from tenk1 where (two, four) not in (select hundred, thousandfrom tenk2 where thousand > 100); 2017-02-17 00:28:33.173 EST [18934] LOG: LaunchParallelWorkers: registered 1 of 4 workers 2017-02-17 00:28:33.173 EST [18934] STATEMENT: select count(*) from tenk1 where (two, four) not in (select hundred, thousandfrom tenk2 where thousand > 100); 2017-02-17 00:28:33.173 EST [18934] LOG: RegisterDynamicBackgroundWorker: grabbing slot 7 2017-02-17 00:28:33.173 EST [18934] STATEMENT: select count(*) from tenk1 where (two, four) not in (select hundred, thousandfrom tenk2 where thousand > 100); 2017-02-17 00:28:33.253 EST [18934] LOG: LaunchParallelWorkers: registered 2 of 4 workers 2017-02-17 00:28:33.253 EST [18934] STATEMENT: select count(*) from tenk1 where (two, four) not in (select hundred, thousandfrom tenk2 where thousand > 100); 2017-02-17 00:28:33.253 EST [18934] LOG: RegisterDynamicBackgroundWorker: no free slots of 8 2017-02-17 00:28:33.253 EST [18934] STATEMENT: select count(*) from tenk1 where (two, four) not in (select hundred, thousandfrom tenk2 where thousand > 100); 2017-02-17 00:28:33.254 EST [18934] LOG: LaunchParallelWorkers: failed to register 3 of 4 workers 2017-02-17 00:28:33.254 EST [18934] STATEMENT: select count(*) from tenk1 where (two, four) not in (select hundred, thousandfrom tenk2 where thousand > 100); 2017-02-17 00:28:33.557 EST [18934] LOG: RegisterDynamicBackgroundWorker: no free slots of 8 2017-02-17 00:28:33.557 EST [18934] STATEMENT: select count((unique1)) from tenk1 where hundred > 1; 2017-02-17 00:28:33.557 EST [18934] LOG: LaunchParallelWorkers: failed to register 1 of 4 workers 2017-02-17 00:28:33.557 EST [18934] STATEMENT: select count((unique1)) from tenk1 where hundred > 1; 2017-02-17 00:28:33.703 EST [18934] LOG: RegisterDynamicBackgroundWorker: no free slots of 8 2017-02-17 00:28:33.703 EST [18934] STATEMENT: select stringu1::int2 from tenk1 where unique1 = 1; 2017-02-17 00:28:33.703 EST [18934] LOG: LaunchParallelWorkers: failed to register 1 of 1 workers 2017-02-17 00:28:33.703 EST [18934] STATEMENT: select stringu1::int2 from tenk1 where unique1 = 1; 2017-02-17 00:28:33.704 EST [18934] ERROR: invalid input syntax for integer: "BAAAAA" 2017-02-17 00:28:33.704 EST [18934] STATEMENT: select stringu1::int2 from tenk1 where unique1 = 1; 2017-02-17 00:28:33.854 EST [18336] LOG: ForgetBackgroundWorker: unregistering background worker "parallel worker for PID18934" slot 7 2017-02-17 00:28:33.854 EST [18336] LOG: ForgetBackgroundWorker: unregistering background worker "parallel worker for PID18934" slot 6 2017-02-17 00:28:33.854 EST [18336] LOG: ForgetBackgroundWorker: unregistering background worker "parallel worker for PID18934" slot 5 2017-02-17 00:28:33.854 EST [18336] LOG: ForgetBackgroundWorker: unregistering background worker "parallel worker for PID18934" slot 4 2017-02-17 00:28:33.854 EST [18336] LOG: ForgetBackgroundWorker: unregistering background worker "parallel worker for PID18934" slot 3 2017-02-17 00:28:33.854 EST [18336] LOG: ForgetBackgroundWorker: unregistering background worker "parallel worker for PID18934" slot 2 2017-02-17 00:28:33.855 EST [18336] LOG: ForgetBackgroundWorker: unregistering background worker "parallel worker for PID18934" slot 1 In short, it looks to me like ExecShutdownGatherWorkers doesn't actually wait for parallel workers to finish (as its comment suggests is necessary), so that on not-too-speedy machines the worker slots may all still be in use when the next command wants some. This does not seem like a good thing from a reproducibility or error-detection standpoint. It is *certainly* not what you would expect from reading any of the comments in the vicinity. I don't know if this is a generic logic failure or if there's something platform-specific about it --- it's suspicious that it's only showing up on ia64 and hppa architectures. I also find it curious that nobody had seen this at all until the past couple days. I suspect that 51ee6f316 changed the number of parallel workers demanded by the select_parallel regression test, which is, shall we say, not what its commit log message would suggest. regards, tom lane
pgsql-hackers by date: