Thread: Instability in partition_prune test?
Hi, Apologies if this was already discussed, I didn't see it. One of my animals elver had a one-off failure this morning: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=elver&dt=2018-04-12%2018%3A18%3A05 partition_prune ... FAILED Subplans Removed: 6 -> Parallel Seq Scan on ab_a2_b1 (actual rows=0 loops=1) Filter: ((a >= $1) AND (a <= $2) AND (b < 4)) ! -> Parallel Seq Scan on ab_a2_b2 (actual rows=0 loops=1) Filter: ((a >= $1) AND (a <= $2) AND (b < 4)) -> Parallel Seq Scan on ab_a2_b3 (actual rows=0 loops=1) Filter: ((a >= $1) AND (a <= $2) AND (b < 4)) --- 1608,1614 ---- Subplans Removed: 6 -> Parallel Seq Scan on ab_a2_b1 (actual rows=0 loops=1) Filter: ((a >= $1) AND (a <= $2) AND (b < 4)) ! -> Parallel Seq Scan on ab_a2_b2 (actual rows=0 loops=2) Filter: ((a >= $1) AND (a <= $2) AND (b < 4)) -> Parallel Seq Scan on ab_a2_b3 (actual rows=0 loops=1) Filter: ((a >= $1) AND (a <= $2) AND (b < 4)) This is a Parallel Append with three processes working on three subplans. It looks like one of the subplans got executed twice? -- Thomas Munro http://www.enterprisedb.com
Thomas Munro <thomas.munro@enterprisedb.com> writes: > Apologies if this was already discussed, I didn't see it. One of my > animals elver had a one-off failure this morning: > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=elver&dt=2018-04-12%2018%3A18%3A05 Yeah, this looks very much like my recent complaint: https://www.postgresql.org/message-id/1876.1523224471%40sss.pgh.pa.us Yours is the third case so far. But we've had little luck reproducing it, so there's no way to be sure if Rowley's proposed fix helps. If you can help test that, please do. regards, tom lane
On 13 April 2018 at 10:29, Thomas Munro <thomas.munro@enterprisedb.com> wrote: > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=elver&dt=2018-04-12%2018%3A18%3A05 > > partition_prune ... FAILED > > Subplans Removed: 6 > -> Parallel Seq Scan on ab_a2_b1 (actual rows=0 loops=1) > Filter: ((a >= $1) AND (a <= $2) AND (b < 4)) > ! -> Parallel Seq Scan on ab_a2_b2 (actual rows=0 loops=1) > Filter: ((a >= $1) AND (a <= $2) AND (b < 4)) > -> Parallel Seq Scan on ab_a2_b3 (actual rows=0 loops=1) > Filter: ((a >= $1) AND (a <= $2) AND (b < 4)) > --- 1608,1614 ---- > Subplans Removed: 6 > -> Parallel Seq Scan on ab_a2_b1 (actual rows=0 loops=1) > Filter: ((a >= $1) AND (a <= $2) AND (b < 4)) > ! -> Parallel Seq Scan on ab_a2_b2 (actual rows=0 loops=2) > Filter: ((a >= $1) AND (a <= $2) AND (b < 4)) > -> Parallel Seq Scan on ab_a2_b3 (actual rows=0 loops=1) > Filter: ((a >= $1) AND (a <= $2) AND (b < 4)) > > This is a Parallel Append with three processes working on three > subplans. It looks like one of the subplans got executed twice? Hi Thomas, Thanks for the report. If you're able to run the scripts in [1] and confirm you can reproduce the failure, if so, then revert the code back to 5c0675215 and see if you still get the additional loop. You'll need to update the expected results once back in 5c0675215 as the 6 subplans will no longer be removed. I've been unable to reproduce this so keen to get results from a machine that can. [1] https://www.postgresql.org/message-id/CAKJS1f8o2Yd%3DrOP%3DEt3A0FWgF%2BgSAOkFSU6eNhnGzTPV7nN8sQ%40mail.gmail.com -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Fri, Apr 13, 2018 at 1:21 PM, David Rowley <david.rowley@2ndquadrant.com> wrote: > On 13 April 2018 at 10:29, Thomas Munro <thomas.munro@enterprisedb.com> wrote: >> This is a Parallel Append with three processes working on three >> subplans. It looks like one of the subplans got executed twice? > > Hi Thomas, > > Thanks for the report. If you're able to run the scripts in [1] and > confirm you can reproduce the failure, if so, then revert the code > back to 5c0675215 and see if you still get the additional loop. > > You'll need to update the expected results once back in 5c0675215 as > the 6 subplans will no longer be removed. > > I've been unable to reproduce this so keen to get results from a > machine that can. I tried for some time with 1, 2 and 3 parallel copies of that shell script you showed (after tweaking it to stick $$ into the filename used for test.out so they didn't clobber each other). No dice, couldn't make it fail. So, as the option of last resort, I tried thinking about what's actually going on here. I think it might be working as designed! The participants are allowed to run other subplans, because they're parallel sub-plans: it's just a question of whether any backend manages to complete its subplan and then go looking for another subplan to attack before it is marked as complete. If that's true, then the loop number shown can't technically be made stable, can it? Unless you make it so that the scans were not allowed to be run by more than one worker. -- Thomas Munro http://www.enterprisedb.com
On 13 April 2018 at 14:37, Thomas Munro <thomas.munro@enterprisedb.com> wrote: > I think it might be working as designed! The participants are allowed > to run other subplans, because they're parallel sub-plans: it's just a > question of whether any backend manages to complete its subplan and > then go looking for another subplan to attack before it is marked as > complete. If that's true, then the loop number shown can't > technically be made stable, can it? Unless you make it so that the > scans were not allowed to be run by more than one worker. Yeah, you're right. It's pretty easy to show this by giving Parallel Append a small subnode and a larger subnode to work on: drop table if exists t1; drop table if exists t2; create table t1 (a int); create table t2 (a int); insert into t1 select generate_Series(1,1000000); insert into t2 select generate_Series(1,10000); explain (analyze, costs off, summary off, timing off) select count(*) from (select * from t1 union all select * from t2) t1_t2; QUERY PLAN ------------------------------------------------------------------------------ Finalize Aggregate (actual rows=1 loops=1) -> Gather (actual rows=3 loops=1) Workers Planned: 2 Workers Launched: 2 -> Partial Aggregate (actual rows=1 loops=3) -> Parallel Append (actual rows=336667 loops=3) -> Parallel Seq Scan on t1 (actual rows=333333 loops=3) -> Parallel Seq Scan on t2 (actual rows=10000 loops=1) (8 rows) I'll just need to go think about how we can make the test stable now. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On 13 April 2018 at 14:41, David Rowley <david.rowley@2ndquadrant.com> wrote: > I'll just need to go think about how we can make the test stable now. Thomas and I discussed this a bit off-list. The attached basically adds: set max_parallel_workers = 0; before the Parallel Append tests. All those tests were intended to do what check that "(never executed)" appeared for the correct nodes. They still do that. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
David Rowley wrote: > On 13 April 2018 at 14:41, David Rowley <david.rowley@2ndquadrant.com> wrote: > > I'll just need to go think about how we can make the test stable now. > > Thomas and I discussed this a bit off-list. > > The attached basically adds: > > set max_parallel_workers = 0; > > before the Parallel Append tests. > > All those tests were intended to do what check that "(never executed)" > appeared for the correct nodes. They still do that. Makes sense -- pushed. Thanks David and Thomas for your efforts tracking this down. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
David Rowley <david.rowley@2ndquadrant.com> writes: > The attached basically adds: > set max_parallel_workers = 0; It seems quite silly to be asking for a parallel plan and then insisting it not run in parallel. Maybe the right solution is to strip out the loop_count from what's printed. We've already done that sort of thing in at least one other test, using some plpgsql code to "sed" the EXPLAIN output. regards, tom lane
Tom Lane wrote: > David Rowley <david.rowley@2ndquadrant.com> writes: > > The attached basically adds: > > set max_parallel_workers = 0; > > It seems quite silly to be asking for a parallel plan and then insisting > it not run in parallel. The idea is to use the parallel append code, but run it in the leader. Now that you mention it, this probably decreases coverage for the choose_next_subplan_for_worker function. > Maybe the right solution is to strip out the loop_count from what's > printed. We've already done that sort of thing in at least one other > test, using some plpgsql code to "sed" the EXPLAIN output. Ah, explain_parallel_sort_stats() ... maybe that's an idea, yeah. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Alvaro Herrera <alvherre@alvh.no-ip.org> writes: > Tom Lane wrote: >> It seems quite silly to be asking for a parallel plan and then insisting >> it not run in parallel. > Now that you mention it, this probably decreases coverage for the > choose_next_subplan_for_worker function. Yeah, loss of executor code coverage was what concerned me. regards, tom lane
Tom Lane wrote: > Alvaro Herrera <alvherre@alvh.no-ip.org> writes: > > Tom Lane wrote: > >> It seems quite silly to be asking for a parallel plan and then insisting > >> it not run in parallel. > > > Now that you mention it, this probably decreases coverage for the > > choose_next_subplan_for_worker function. > > Yeah, loss of executor code coverage was what concerned me. Here's a proposed patch for this. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachment
Alvaro Herrera <alvherre@alvh.no-ip.org> writes: > Tom Lane wrote: >> Yeah, loss of executor code coverage was what concerned me. > Here's a proposed patch for this. Seems reasonable. I'm still uncomfortable with the assumption that if we ask for two workers we will get two workers, but that's a pre-existing problem in other parallel regression tests. regards, tom lane
Tom Lane wrote: > Alvaro Herrera <alvherre@alvh.no-ip.org> writes: > > Tom Lane wrote: > >> Yeah, loss of executor code coverage was what concerned me. > > > Here's a proposed patch for this. > > Seems reasonable. I'm still uncomfortable with the assumption > that if we ask for two workers we will get two workers, but > that's a pre-existing problem in other parallel regression tests. Yeah, I was looking at that line and wondering. But I think that'd require a different approach (*if* we see it fail, which I'm not sure we have), such as suppressing the Workers Launched lines without a plpgsql function to do it, since it's much more prevalent than this problem. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Alvaro Herrera <alvherre@alvh.no-ip.org> writes: > Tom Lane wrote: >> Seems reasonable. I'm still uncomfortable with the assumption >> that if we ask for two workers we will get two workers, but >> that's a pre-existing problem in other parallel regression tests. > Yeah, I was looking at that line and wondering. But I think that'd > require a different approach (*if* we see it fail, which I'm not sure we > have), such as suppressing the Workers Launched lines without a plpgsql > function to do it, since it's much more prevalent than this problem. At least in this case, some of the "row" counts also depend on number of workers, no? So just hiding that line wouldn't do it. Anyway, I agree that we shouldn't solve that problem until we see that it's a problem in practice. regards, tom lane
On 17 April 2018 at 09:31, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Alvaro Herrera <alvherre@alvh.no-ip.org> writes: >> Tom Lane wrote: >>> Seems reasonable. I'm still uncomfortable with the assumption >>> that if we ask for two workers we will get two workers, but >>> that's a pre-existing problem in other parallel regression tests. > >> Yeah, I was looking at that line and wondering. But I think that'd >> require a different approach (*if* we see it fail, which I'm not sure we >> have), such as suppressing the Workers Launched lines without a plpgsql >> function to do it, since it's much more prevalent than this problem. > > At least in this case, some of the "row" counts also depend on number > of workers, no? So just hiding that line wouldn't do it. > > Anyway, I agree that we shouldn't solve that problem until we see > that it's a problem in practice. I agree. The solution to that problem, if it ever comes up may end up just being to run the particular test in a parallel group with just that test, or fewer tests. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services