Thread: Instability in partition_prune test?

Instability in partition_prune test?

From
Thomas Munro
Date:
Hi,

Apologies if this was already discussed, I didn't see it.  One of my
animals elver had a one-off failure this morning:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=elver&dt=2018-04-12%2018%3A18%3A05

     partition_prune              ... FAILED

                       Subplans Removed: 6
                       ->  Parallel Seq Scan on ab_a2_b1 (actual rows=0 loops=1)
                             Filter: ((a >= $1) AND (a <= $2) AND (b < 4))
!                      ->  Parallel Seq Scan on ab_a2_b2 (actual rows=0 loops=1)
                             Filter: ((a >= $1) AND (a <= $2) AND (b < 4))
                       ->  Parallel Seq Scan on ab_a2_b3 (actual rows=0 loops=1)
                             Filter: ((a >= $1) AND (a <= $2) AND (b < 4))
--- 1608,1614 ----
                       Subplans Removed: 6
                       ->  Parallel Seq Scan on ab_a2_b1 (actual rows=0 loops=1)
                             Filter: ((a >= $1) AND (a <= $2) AND (b < 4))
!                      ->  Parallel Seq Scan on ab_a2_b2 (actual rows=0 loops=2)
                             Filter: ((a >= $1) AND (a <= $2) AND (b < 4))
                       ->  Parallel Seq Scan on ab_a2_b3 (actual rows=0 loops=1)
                             Filter: ((a >= $1) AND (a <= $2) AND (b < 4))

This is a Parallel Append with three processes working on three
subplans.  It looks like one of the subplans got executed twice?

-- 
Thomas Munro
http://www.enterprisedb.com


Re: Instability in partition_prune test?

From
Tom Lane
Date:
Thomas Munro <thomas.munro@enterprisedb.com> writes:
> Apologies if this was already discussed, I didn't see it.  One of my
> animals elver had a one-off failure this morning:
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=elver&dt=2018-04-12%2018%3A18%3A05

Yeah, this looks very much like my recent complaint:

https://www.postgresql.org/message-id/1876.1523224471%40sss.pgh.pa.us

Yours is the third case so far.  But we've had little luck reproducing it,
so there's no way to be sure if Rowley's proposed fix helps.  If you can
help test that, please do.

            regards, tom lane


Re: Instability in partition_prune test?

From
David Rowley
Date:
On 13 April 2018 at 10:29, Thomas Munro <thomas.munro@enterprisedb.com> wrote:
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=elver&dt=2018-04-12%2018%3A18%3A05
>
>      partition_prune              ... FAILED
>
>                        Subplans Removed: 6
>                        ->  Parallel Seq Scan on ab_a2_b1 (actual rows=0 loops=1)
>                              Filter: ((a >= $1) AND (a <= $2) AND (b < 4))
> !                      ->  Parallel Seq Scan on ab_a2_b2 (actual rows=0 loops=1)
>                              Filter: ((a >= $1) AND (a <= $2) AND (b < 4))
>                        ->  Parallel Seq Scan on ab_a2_b3 (actual rows=0 loops=1)
>                              Filter: ((a >= $1) AND (a <= $2) AND (b < 4))
> --- 1608,1614 ----
>                        Subplans Removed: 6
>                        ->  Parallel Seq Scan on ab_a2_b1 (actual rows=0 loops=1)
>                              Filter: ((a >= $1) AND (a <= $2) AND (b < 4))
> !                      ->  Parallel Seq Scan on ab_a2_b2 (actual rows=0 loops=2)
>                              Filter: ((a >= $1) AND (a <= $2) AND (b < 4))
>                        ->  Parallel Seq Scan on ab_a2_b3 (actual rows=0 loops=1)
>                              Filter: ((a >= $1) AND (a <= $2) AND (b < 4))
>
> This is a Parallel Append with three processes working on three
> subplans.  It looks like one of the subplans got executed twice?

Hi Thomas,

Thanks for the report. If you're able to run the scripts in [1] and
confirm you can reproduce the failure, if so, then revert the code
back to 5c0675215 and see if you still get the additional loop.

You'll need to update the expected results once back in 5c0675215 as
the 6 subplans will no longer be removed.

I've been unable to reproduce this so keen to get results from a
machine that can.

[1] https://www.postgresql.org/message-id/CAKJS1f8o2Yd%3DrOP%3DEt3A0FWgF%2BgSAOkFSU6eNhnGzTPV7nN8sQ%40mail.gmail.com

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: Instability in partition_prune test?

From
Thomas Munro
Date:
On Fri, Apr 13, 2018 at 1:21 PM, David Rowley
<david.rowley@2ndquadrant.com> wrote:
> On 13 April 2018 at 10:29, Thomas Munro <thomas.munro@enterprisedb.com> wrote:
>> This is a Parallel Append with three processes working on three
>> subplans.  It looks like one of the subplans got executed twice?
>
> Hi Thomas,
>
> Thanks for the report. If you're able to run the scripts in [1] and
> confirm you can reproduce the failure, if so, then revert the code
> back to 5c0675215 and see if you still get the additional loop.
>
> You'll need to update the expected results once back in 5c0675215 as
> the 6 subplans will no longer be removed.
>
> I've been unable to reproduce this so keen to get results from a
> machine that can.

I tried for some time with 1, 2 and 3 parallel copies of that shell
script you showed (after tweaking it to stick $$ into the filename
used for test.out so they didn't clobber each other).  No dice,
couldn't make it fail.

So, as the option of last resort, I tried thinking about what's
actually going on here.

I think it might be working as designed!  The participants are allowed
to run other subplans, because they're parallel sub-plans: it's just a
question of whether any backend manages to complete its subplan and
then go looking for another subplan to attack before it is marked as
complete.  If that's true, then the loop number shown can't
technically be made stable, can it?  Unless you make it so that the
scans were not allowed to be run by more than one worker.

-- 
Thomas Munro
http://www.enterprisedb.com


Re: Instability in partition_prune test?

From
David Rowley
Date:
On 13 April 2018 at 14:37, Thomas Munro <thomas.munro@enterprisedb.com> wrote:
> I think it might be working as designed!  The participants are allowed
> to run other subplans, because they're parallel sub-plans: it's just a
> question of whether any backend manages to complete its subplan and
> then go looking for another subplan to attack before it is marked as
> complete.  If that's true, then the loop number shown can't
> technically be made stable, can it?  Unless you make it so that the
> scans were not allowed to be run by more than one worker.

Yeah, you're right.

It's pretty easy to show this by giving Parallel Append a small
subnode and a larger subnode to work on:

drop table if exists t1;
drop table if exists t2;
create table t1 (a int);
create table t2 (a int);
insert into t1 select generate_Series(1,1000000);
insert into t2 select generate_Series(1,10000);

explain (analyze, costs off, summary off, timing off) select count(*)
from (select * from t1 union all select * from t2) t1_t2;
                                  QUERY PLAN
------------------------------------------------------------------------------
 Finalize Aggregate (actual rows=1 loops=1)
   ->  Gather (actual rows=3 loops=1)
         Workers Planned: 2
         Workers Launched: 2
         ->  Partial Aggregate (actual rows=1 loops=3)
               ->  Parallel Append (actual rows=336667 loops=3)
                     ->  Parallel Seq Scan on t1 (actual rows=333333 loops=3)
                     ->  Parallel Seq Scan on t2 (actual rows=10000 loops=1)
(8 rows)

I'll just need to go think about how we can make the test stable now.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: Instability in partition_prune test?

From
David Rowley
Date:
On 13 April 2018 at 14:41, David Rowley <david.rowley@2ndquadrant.com> wrote:
> I'll just need to go think about how we can make the test stable now.

Thomas and I discussed this a bit off-list.

The attached basically adds:

set max_parallel_workers = 0;

before the Parallel Append tests.

All those tests were intended to do what check that "(never executed)"
appeared for the correct nodes. They still do that.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Attachment

Re: Instability in partition_prune test?

From
Alvaro Herrera
Date:
David Rowley wrote:
> On 13 April 2018 at 14:41, David Rowley <david.rowley@2ndquadrant.com> wrote:
> > I'll just need to go think about how we can make the test stable now.
> 
> Thomas and I discussed this a bit off-list.
> 
> The attached basically adds:
> 
> set max_parallel_workers = 0;
> 
> before the Parallel Append tests.
> 
> All those tests were intended to do what check that "(never executed)"
> appeared for the correct nodes. They still do that.

Makes sense -- pushed.  Thanks David and Thomas for your efforts
tracking this down.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: Instability in partition_prune test?

From
Tom Lane
Date:
David Rowley <david.rowley@2ndquadrant.com> writes:
> The attached basically adds:
> set max_parallel_workers = 0;

It seems quite silly to be asking for a parallel plan and then insisting
it not run in parallel.

Maybe the right solution is to strip out the loop_count from what's
printed.  We've already done that sort of thing in at least one other
test, using some plpgsql code to "sed" the EXPLAIN output.

            regards, tom lane


Re: Instability in partition_prune test?

From
Alvaro Herrera
Date:
Tom Lane wrote:
> David Rowley <david.rowley@2ndquadrant.com> writes:
> > The attached basically adds:
> > set max_parallel_workers = 0;
> 
> It seems quite silly to be asking for a parallel plan and then insisting
> it not run in parallel.

The idea is to use the parallel append code, but run it in the leader.

Now that you mention it, this probably decreases coverage for the
choose_next_subplan_for_worker function.

> Maybe the right solution is to strip out the loop_count from what's
> printed.  We've already done that sort of thing in at least one other
> test, using some plpgsql code to "sed" the EXPLAIN output.

Ah, explain_parallel_sort_stats() ... maybe that's an idea, yeah.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: Instability in partition_prune test?

From
Tom Lane
Date:
Alvaro Herrera <alvherre@alvh.no-ip.org> writes:
> Tom Lane wrote:
>> It seems quite silly to be asking for a parallel plan and then insisting
>> it not run in parallel.

> Now that you mention it, this probably decreases coverage for the
> choose_next_subplan_for_worker function.

Yeah, loss of executor code coverage was what concerned me.

            regards, tom lane


Re: Instability in partition_prune test?

From
Alvaro Herrera
Date:
Tom Lane wrote:
> Alvaro Herrera <alvherre@alvh.no-ip.org> writes:
> > Tom Lane wrote:
> >> It seems quite silly to be asking for a parallel plan and then insisting
> >> it not run in parallel.
> 
> > Now that you mention it, this probably decreases coverage for the
> > choose_next_subplan_for_worker function.
> 
> Yeah, loss of executor code coverage was what concerned me.

Here's a proposed patch for this.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

Re: Instability in partition_prune test?

From
Tom Lane
Date:
Alvaro Herrera <alvherre@alvh.no-ip.org> writes:
> Tom Lane wrote:
>> Yeah, loss of executor code coverage was what concerned me.

> Here's a proposed patch for this.

Seems reasonable.  I'm still uncomfortable with the assumption
that if we ask for two workers we will get two workers, but
that's a pre-existing problem in other parallel regression tests.

            regards, tom lane


Re: Instability in partition_prune test?

From
Alvaro Herrera
Date:
Tom Lane wrote:
> Alvaro Herrera <alvherre@alvh.no-ip.org> writes:
> > Tom Lane wrote:
> >> Yeah, loss of executor code coverage was what concerned me.
> 
> > Here's a proposed patch for this.
> 
> Seems reasonable.  I'm still uncomfortable with the assumption
> that if we ask for two workers we will get two workers, but
> that's a pre-existing problem in other parallel regression tests.

Yeah, I was looking at that line and wondering.  But I think that'd
require a different approach (*if* we see it fail, which I'm not sure we
have), such as suppressing the Workers Launched lines without a plpgsql
function to do it, since it's much more prevalent than this problem.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: Instability in partition_prune test?

From
Tom Lane
Date:
Alvaro Herrera <alvherre@alvh.no-ip.org> writes:
> Tom Lane wrote:
>> Seems reasonable.  I'm still uncomfortable with the assumption
>> that if we ask for two workers we will get two workers, but
>> that's a pre-existing problem in other parallel regression tests.

> Yeah, I was looking at that line and wondering.  But I think that'd
> require a different approach (*if* we see it fail, which I'm not sure we
> have), such as suppressing the Workers Launched lines without a plpgsql
> function to do it, since it's much more prevalent than this problem.

At least in this case, some of the "row" counts also depend on number
of workers, no?  So just hiding that line wouldn't do it.

Anyway, I agree that we shouldn't solve that problem until we see
that it's a problem in practice.

            regards, tom lane


Re: Instability in partition_prune test?

From
David Rowley
Date:
On 17 April 2018 at 09:31, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Alvaro Herrera <alvherre@alvh.no-ip.org> writes:
>> Tom Lane wrote:
>>> Seems reasonable.  I'm still uncomfortable with the assumption
>>> that if we ask for two workers we will get two workers, but
>>> that's a pre-existing problem in other parallel regression tests.
>
>> Yeah, I was looking at that line and wondering.  But I think that'd
>> require a different approach (*if* we see it fail, which I'm not sure we
>> have), such as suppressing the Workers Launched lines without a plpgsql
>> function to do it, since it's much more prevalent than this problem.
>
> At least in this case, some of the "row" counts also depend on number
> of workers, no?  So just hiding that line wouldn't do it.
>
> Anyway, I agree that we shouldn't solve that problem until we see
> that it's a problem in practice.

I agree. The solution to that problem, if it ever comes up may end up
just being to run the particular test in a parallel group with just
that test, or fewer tests.


-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services