Thread: parallel.c is not marked as test covered

parallel.c is not marked as test covered

From
Clément Prévost
Date:
The entire parallel.c reported test coverage is zero: http://coverage.postgresql.org/src/backend/access/transam/parallel.c.gcov.html

It seem that it's not covered by the original 924bcf4f commit but I don't know if it's on purpose. This feature being really new that would be understandable.

If it's not, my first idea would be to fix this by adding a simple sql test in /src/test/regress, in the "sql" and "expected" subdirectories that explain (cost off) some queries. 
I'm also guessing here that we can force a query to have a parallel plan with some cost magic to avoid dealing with sufficiently large datasets.

Re: parallel.c is not marked as test covered

From
David Rowley
Date:
On 9 May 2016 at 09:12, Clément Prévost <prevostclement@gmail.com> wrote:
> The entire parallel.c reported test coverage is zero:
> http://coverage.postgresql.org/src/backend/access/transam/parallel.c.gcov.html
>
> It seem that it's not covered by the original 924bcf4f commit but I don't
> know if it's on purpose. This feature being really new that would be
> understandable.
>
> If it's not, my first idea would be to fix this by adding a simple sql test
> in /src/test/regress, in the "sql" and "expected" subdirectories that
> explain (cost off) some queries.
> I'm also guessing here that we can force a query to have a parallel plan
> with some cost magic to avoid dealing with sufficiently large datasets.

I'm not entirely sure which machine generates that coverage output,
but the problem with it is that it's just one machine. We do have at
least one buildfarm member which runs with force_parallel_mode =
regress. See http://buildfarm.postgresql.org/cgi-bin/show_history.pl?nm=mandrill&br=HEAD

If the coverage was run from that machine then you'd see something
higher than 0% on parallel.c.

It would be nice to improve this a bit and have the planner generate a
handful of parallel plans even without force_parallel_mode = regress.
I do believe we can now do this without having to create large tables
in the regression tests if we set parallel_setup_cost to something
low, perhaps 0 and set the table's parallel_degree to something higher
than 1. The problem with that is knowing what should actually be
tested. It seems like pretty much any plan which can generate a
parallel query is a good candidate for writing a test for, which makes
the choices for which tests to write quite hard. It makes you realise
why Robert when down the force_parallel_mode = regress path instead.

-- David Rowley                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services



Re: parallel.c is not marked as test covered

From
Alvaro Herrera
Date:
David Rowley wrote:

> I'm not entirely sure which machine generates that coverage output,
> but the problem with it is that it's just one machine. We do have at
> least one buildfarm member which runs with force_parallel_mode =
> regress.

It's not a buildfarm machine, but a machine setup specifically for
coverage reports.  It runs "make check-world" only.  I can add some
additional command(s) to run, if somebody can suggest something.

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: parallel.c is not marked as test covered

From
David Rowley
Date:
On 9 May 2016 at 13:20, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
> David Rowley wrote:
>
>> I'm not entirely sure which machine generates that coverage output,
>> but the problem with it is that it's just one machine. We do have at
>> least one buildfarm member which runs with force_parallel_mode =
>> regress.
>
> It's not a buildfarm machine, but a machine setup specifically for
> coverage reports.  It runs "make check-world" only.  I can add some
> additional command(s) to run, if somebody can suggest something.

I'm not familiar with what's possible with make check-world, but would
it be reasonable to make that just do another regression test run with
force_parallel_mode set to regress?

-- David Rowley                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services



Re: parallel.c is not marked as test covered

From
Alvaro Herrera
Date:
David Rowley wrote:
> On 9 May 2016 at 13:20, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:

> > It's not a buildfarm machine, but a machine setup specifically for
> > coverage reports.  It runs "make check-world" only.  I can add some
> > additional command(s) to run, if somebody can suggest something.
> 
> I'm not familiar with what's possible with make check-world, but would
> it be reasonable to make that just do another regression test run with
> force_parallel_mode set to regress?

My suggestion isn't to change what "make check-world" does; it's just to
change the script in the coverage machine so that it runs some other
command after "make check-world" and before "make coverage-html".

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: parallel.c is not marked as test covered

From
David Rowley
Date:
On 9 May 2016 at 14:26, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
> David Rowley wrote:
>> On 9 May 2016 at 13:20, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
>
>> > It's not a buildfarm machine, but a machine setup specifically for
>> > coverage reports.  It runs "make check-world" only.  I can add some
>> > additional command(s) to run, if somebody can suggest something.
>>
>> I'm not familiar with what's possible with make check-world, but would
>> it be reasonable to make that just do another regression test run with
>> force_parallel_mode set to regress?
>
> My suggestion isn't to change what "make check-world" does; it's just to
> change the script in the coverage machine so that it runs some other
> command after "make check-world" and before "make coverage-html".

I understand what you meant. I was just taking the suggestion one step
further as I thought that if you were adding that for the coverage
test then it might actually also be a good idea that it occurred in
make check-world. The parallel tests are quite thin as it is, so
perhaps its a good idea to get more machines running through the
parallel code.

pg_regress seems to support --temp-config, so we could just have a
config file which has; max_parallel_degree = 2 and force_parallel_mode
= regress.

-- David Rowley                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services



Re: parallel.c is not marked as test covered

From
Andres Freund
Date:
On 2016-05-08 22:20:55 -0300, Alvaro Herrera wrote:
> David Rowley wrote:
> 
> > I'm not entirely sure which machine generates that coverage output,
> > but the problem with it is that it's just one machine. We do have at
> > least one buildfarm member which runs with force_parallel_mode =
> > regress.
> 
> It's not a buildfarm machine, but a machine setup specifically for
> coverage reports.  It runs "make check-world" only.  I can add some
> additional command(s) to run, if somebody can suggest something.

I think it's a good idea to run a force-parallel run on some buildfarm
members. But I'm rather convinced that the core tests run by all animals
need some minimal coverage of parallel queries. Both because otherwise
it'll be hard to get some coverage of unusual platforms, and because
it's imo something rather relevant to test during development.

- Andres



Re: parallel.c is not marked as test covered

From
Tom Lane
Date:
Andres Freund <andres@anarazel.de> writes:
> I think it's a good idea to run a force-parallel run on some buildfarm
> members.

Noah's already doing that on at least one of his critters.  But some more
wouldn't hurt.

> But I'm rather convinced that the core tests run by all animals
> need some minimal coverage of parallel queries. Both because otherwise
> it'll be hard to get some coverage of unusual platforms, and because
> it's imo something rather relevant to test during development.

+1.  Experimenting with what we might do, it seems like it's harder to get
the planner to use a parallel plan than you would think.

regression=# explain select count(*) from tenk1;                                           QUERY PLAN
                        
 
--------------------------------------------------------------------------------
-------------------Aggregate  (cost=295.29..295.30 rows=1 width=8)  ->  Index Only Scan using tenk1_thous_tenthous on
tenk1 (cost=0.29..270.29 r
 
ows=10000 width=0)
(2 rows)

regression=# set enable_indexscan TO 0;
SET
regression=# explain select count(*) from tenk1;                          QUERY PLAN                            
-----------------------------------------------------------------Aggregate  (cost=483.00..483.01 rows=1 width=8)  ->
SeqScan on tenk1  (cost=0.00..458.00 rows=10000 width=0)
 
(2 rows)

regression=# set force_parallel_mode TO on;
SET
regression=# explain select count(*) from tenk1;                          QUERY PLAN                            
-----------------------------------------------------------------Aggregate  (cost=483.00..483.01 rows=1 width=8)  ->
SeqScan on tenk1  (cost=0.00..458.00 rows=10000 width=0)
 
(2 rows)

Methinks force_parallel_mode is a bit broken.

Also, once you *do* get it to make a parallel plan:

regression=# create table foo as select generate_series(1,1000000) g;
SELECT 1000000
regression=# analyze foo;
ANALYZE
regression=# explain select count(*) from foo;                                     QUERY PLAN
          
 
--------------------------------------------------------------------------------------Finalize Aggregate
(cost=10633.55..10633.56rows=1 width=8)  ->  Gather  (cost=10633.33..10633.54 rows=2 width=8)        Workers Planned: 2
      ->  Partial Aggregate  (cost=9633.33..9633.34 rows=1 width=8)              ->  Parallel Seq Scan on foo
(cost=0.00..8591.67rows=416667 width=0)
 
(5 rows)

regression=# explain (costs off) select count(*) from foo;                QUERY PLAN                 
--------------------------------------------Finalize Aggregate  ->  Gather        Workers Planned: 2        ->  Partial
Aggregate             ->  Parallel Seq Scan on foo
 
(5 rows)

That's not going to do for a regression-test case because it will break
anytime somebody changes the value of max_parallel_degree.  Maybe we
should suppress the "Workers Planned" field in costs-off display mode?
        regards, tom lane



Re: parallel.c is not marked as test covered

From
Robert Haas
Date:
On Mon, May 9, 2016 at 11:11 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Andres Freund <andres@anarazel.de> writes:
>> I think it's a good idea to run a force-parallel run on some buildfarm
>> members.
>
> Noah's already doing that on at least one of his critters.  But some more
> wouldn't hurt.

I agree.

>> But I'm rather convinced that the core tests run by all animals
>> need some minimal coverage of parallel queries. Both because otherwise
>> it'll be hard to get some coverage of unusual platforms, and because
>> it's imo something rather relevant to test during development.
>
> +1.  Experimenting with what we might do, it seems like it's harder to get
> the planner to use a parallel plan than you would think.
>
> regression=# explain select count(*) from tenk1;
>                                             QUERY PLAN
>
> --------------------------------------------------------------------------------
> -------------------
>  Aggregate  (cost=295.29..295.30 rows=1 width=8)
>    ->  Index Only Scan using tenk1_thous_tenthous on tenk1  (cost=0.29..270.29 r
> ows=10000 width=0)
> (2 rows)
>
> regression=# set enable_indexscan TO 0;
> SET
> regression=# explain select count(*) from tenk1;
>                            QUERY PLAN
> -----------------------------------------------------------------
>  Aggregate  (cost=483.00..483.01 rows=1 width=8)
>    ->  Seq Scan on tenk1  (cost=0.00..458.00 rows=10000 width=0)
> (2 rows)
>
> regression=# set force_parallel_mode TO on;
> SET
> regression=# explain select count(*) from tenk1;
>                            QUERY PLAN
> -----------------------------------------------------------------
>  Aggregate  (cost=483.00..483.01 rows=1 width=8)
>    ->  Seq Scan on tenk1  (cost=0.00..458.00 rows=10000 width=0)
> (2 rows)
>
> Methinks force_parallel_mode is a bit broken.

Hmm, that is strange.  I would have expected that to stuff a Gather on
top of the Aggregate.  I wonder why it's not doing that.

> Also, once you *do* get it to make a parallel plan:
>
> regression=# create table foo as select generate_series(1,1000000) g;
> SELECT 1000000
> regression=# analyze foo;
> ANALYZE
> regression=# explain select count(*) from foo;
>                                       QUERY PLAN
> --------------------------------------------------------------------------------------
>  Finalize Aggregate  (cost=10633.55..10633.56 rows=1 width=8)
>    ->  Gather  (cost=10633.33..10633.54 rows=2 width=8)
>          Workers Planned: 2
>          ->  Partial Aggregate  (cost=9633.33..9633.34 rows=1 width=8)
>                ->  Parallel Seq Scan on foo  (cost=0.00..8591.67 rows=416667 width=0)
> (5 rows)
>
> regression=# explain (costs off) select count(*) from foo;
>                  QUERY PLAN
> --------------------------------------------
>  Finalize Aggregate
>    ->  Gather
>          Workers Planned: 2
>          ->  Partial Aggregate
>                ->  Parallel Seq Scan on foo
> (5 rows)
>
> That's not going to do for a regression-test case because it will break
> anytime somebody changes the value of max_parallel_degree.  Maybe we
> should suppress the "Workers Planned" field in costs-off display mode?

That seems reasonable to me.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: parallel.c is not marked as test covered

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> On Mon, May 9, 2016 at 11:11 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> regression=# set force_parallel_mode TO on;
>> SET
>> regression=# explain select count(*) from tenk1;
>> QUERY PLAN
>> -----------------------------------------------------------------
>> Aggregate  (cost=483.00..483.01 rows=1 width=8)
>> ->  Seq Scan on tenk1  (cost=0.00..458.00 rows=10000 width=0)
>> (2 rows)
>> 
>> Methinks force_parallel_mode is a bit broken.

> Hmm, that is strange.  I would have expected that to stuff a Gather on
> top of the Aggregate.  I wonder why it's not doing that.

The reason is that create_plain_partial_paths() contains a hard-wired
decision not to generate any partial paths for relations smaller than
1000 blocks, which means that no partial path will ever be generated
for *any* relation in the standard regression tests, force_parallel_mode
or no.

I would just go fix this, along the lines of

*************** create_plain_partial_paths(PlannerInfo *
*** 702,708 ****        * with all of its inheritance siblings it may well pay off.        */       if (rel->pages <
parallel_threshold&&
 
!           rel->reloptkind == RELOPT_BASEREL)           return;        /*
--- 703,710 ----        * with all of its inheritance siblings it may well pay off.        */       if (rel->pages <
parallel_threshold&&
 
!           rel->reloptkind == RELOPT_BASEREL &&
!           force_parallel_mode == FORCE_PARALLEL_OFF)           return;        /*

except that doing so and running the regression tests with
force_parallel_mode = on results in core dumps.  Some debugging seems
indicated ... and we should at this point assume that there has been no
useful testing of parallel query in the buildfarm, not even on Noah's
machines.
        regards, tom lane



Re: parallel.c is not marked as test covered

From
Robert Haas
Date:
On Wed, May 11, 2016 at 12:34 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Hmm, that is strange.  I would have expected that to stuff a Gather on
>> top of the Aggregate.  I wonder why it's not doing that.
>
> The reason is that create_plain_partial_paths() contains a hard-wired
> decision not to generate any partial paths for relations smaller than
> 1000 blocks, which means that no partial path will ever be generated
> for *any* relation in the standard regression tests, force_parallel_mode
> or no.

Well that's an interesting theory, except that you've completely
missed the point of force_parallel_mode.  force_parallel_mode pushes a
special Gather node on top of any plan that is not already parallel in
some way but which is parallel-safe.   That special Gather node runs
only in the worker, not the leader, and always uses just one worker.
The point is to test that queries run in the worker in the same way
that they do in the leader.  This has already found lots of bugs, so
it's clearly useful, despite all the confusion it's created.

> I would just go fix this, along the lines of
>
> *************** create_plain_partial_paths(PlannerInfo *
> *** 702,708 ****
>          * with all of its inheritance siblings it may well pay off.
>          */
>         if (rel->pages < parallel_threshold &&
> !           rel->reloptkind == RELOPT_BASEREL)
>             return;
>
>         /*
> --- 703,710 ----
>          * with all of its inheritance siblings it may well pay off.
>          */
>         if (rel->pages < parallel_threshold &&
> !           rel->reloptkind == RELOPT_BASEREL &&
> !           force_parallel_mode == FORCE_PARALLEL_OFF)
>             return;
>
>         /*
>
> except that doing so and running the regression tests with
> force_parallel_mode = on results in core dumps.

Nonetheless, that is a bug.  (I only get one crash - do you get more?)

> Some debugging seems
> indicated ... and we should at this point assume that there has been no
> useful testing of parallel query in the buildfarm, not even on Noah's
> machines.

Yes and no.  The force_parallel_mode stuff is primarily there to tell
us about things like "you marked a function as parallel-safe, but it
actually blows up when run in a worker".  It won't catch a case where
a function is marked parallel-safe but silently does the wrong thing
instead of failing, but it will catch quite a few mistakes of that
kind, and I think that's good.

What it won't do, though, is actually run "real" parallel queries -
that is, multiple workers doing a Parallel Seq Scan with some other
stuff pushed on top, and then a Gather.  And we should have test
coverage for that, too, so that we're testing real concurrent
behavior, and not just our ability to run plans in a worker.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: parallel.c is not marked as test covered

From
"David G. Johnston"
Date:
On Wed, May 11, 2016 at 10:38 AM, Robert Haas <robertmhaas@gmail.com> wrote:
On Wed, May 11, 2016 at 12:34 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Hmm, that is strange.  I would have expected that to stuff a Gather on
>> top of the Aggregate.  I wonder why it's not doing that.
>
> The reason is that create_plain_partial_paths() contains a hard-wired
> decision not to generate any partial paths for relations smaller than
> 1000 blocks, which means that no partial path will ever be generated
> for *any* relation in the standard regression tests, force_parallel_mode
> or no.

Well that's an interesting theory, except that you've completely
missed the point of force_parallel_mode.  force_parallel_mode pushes a
special Gather node on top of any plan that is not already parallel in
some way but which is parallel-safe.   That special Gather node runs
only in the worker, not the leader, and always uses just one worker.

​What happens when there are no workers available due to max_worker_processes ​already being assigned?

Related question, if max_parallel_degree is >1 and "the requested number of workers may not actually be available at runtime" is true, does the degree of parallelism minimize at 1 worker + leader or will the leader simply run the query by itself?

David J.

Re: parallel.c is not marked as test covered

From
Robert Haas
Date:
On Wed, May 11, 2016 at 1:38 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> I would just go fix this, along the lines of
>>
>> *************** create_plain_partial_paths(PlannerInfo *
>> *** 702,708 ****
>>          * with all of its inheritance siblings it may well pay off.
>>          */
>>         if (rel->pages < parallel_threshold &&
>> !           rel->reloptkind == RELOPT_BASEREL)
>>             return;
>>
>>         /*
>> --- 703,710 ----
>>          * with all of its inheritance siblings it may well pay off.
>>          */
>>         if (rel->pages < parallel_threshold &&
>> !           rel->reloptkind == RELOPT_BASEREL &&
>> !           force_parallel_mode == FORCE_PARALLEL_OFF)
>>             return;
>>
>>         /*
>>
>> except that doing so and running the regression tests with
>> force_parallel_mode = on results in core dumps.
>
> Nonetheless, that is a bug.  (I only get one crash - do you get more?)

This looks like a bug in the parallel aggregate code.

#0  make_partialgroup_input_target (root=0x7faa5f002d20,
final_target=0x7faa5f004270) at planner.c:4308
4308            Index        sgref = final_target->sortgrouprefs[i];
(gdb) p debug_query_string
$1 = 0x7faa5d00b638 "SELECT SUM(COUNT(f1)) OVER () FROM int4_tbl WHERE f1=42;"
Current language:  auto; currently minimal
(gdb) bt
#0  make_partialgroup_input_target (root=0x7faa5f002d20,
final_target=0x7faa5f004270) at planner.c:4308
#1  0x0000000101df4889 in create_grouping_paths (root=0x7faa5f002d20,
input_rel=0x7faa5f0034b0, target=0x7faa5f004270, rollup_lists=0x0,
rollup_groupclauses=0x0) at planner.c:3421
#2  0x0000000101df19a0 in grouping_planner (root=0x7faa5f002d20,
inheritance_update=0 '\0', tuple_fraction=0) at planner.c:1796
#3  0x0000000101def276 in subquery_planner (glob=0x7faa5f002b90,
parse=0x7faa5d00cb80, parent_root=0x0, hasRecursion=0 '\0',
tuple_fraction=0) at planner.c:758
#4  0x0000000101dee2de in standard_planner (parse=0x7faa5d00cb80,
cursorOptions=256, boundParams=0x0) at planner.c:307
#5  0x0000000101dedf81 in planner (parse=0x7faa5d00cb80,
cursorOptions=256, boundParams=0x0) at planner.c:177
#6  0x0000000101eed7b6 in pg_plan_query (querytree=0x7faa5d00cb80,
cursorOptions=256, boundParams=0x0) at postgres.c:798
#7  0x0000000101eed8a3 in pg_plan_queries (querytrees=0x7faa5f002cc0,
cursorOptions=256, boundParams=0x0) at postgres.c:857
#8  0x0000000101ef05ad in exec_simple_query
(query_string=0x7faa5d00b638 "SELECT SUM(COUNT(f1)) OVER () FROM
int4_tbl WHERE f1=42;") at postgres.c:1022
#9  0x0000000101eefa8f in PostgresMain (argc=1, argv=0x7faa5b005be0,
dbname=0x7faa5b005940 "regression", username=0x7faa5b005920 "rhaas")
at postgres.c:4059
#10 0x0000000101e45209 in BackendRun (port=0x7faa5ae01770) at postmaster.c:4258
#11 0x0000000101e44449 in BackendStartup (port=0x7faa5ae01770) at
postmaster.c:3932
#12 0x0000000101e433ef in ServerLoop () at postmaster.c:1690
#13 0x0000000101e40a23 in PostmasterMain (argc=8, argv=0x7faa5ac099b0)
at postmaster.c:1298
#14 0x0000000101d6e160 in main (argc=8, argv=0x7faa5ac099b0) at main.c:228
(gdb) p final_target->sortgrouprefs
$2 = (Index *) 0x0

I don't immediately understand what's going wrong here.  It looks to
me like make_group_input_target() already called, and that worked OK,
but now make_partialgroup_input_target() is failing using more-or-less
the same logic.  Presumably that's because make_group_input_target()
was called on final_target as returned by create_pathtarget(root,
tlist), but make_partialgroup_input_target() is being called on
grouping_target, which I'm guessing came from
make_window_input_target, which somehow lacks sortgroupref labeling.
But I don't immediately see how that would happen, so there's
obviously something I'm missing here.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: parallel.c is not marked as test covered

From
Robert Haas
Date:
On Wed, May 11, 2016 at 1:48 PM, David G. Johnston
<david.g.johnston@gmail.com> wrote:
> What happens when there are no workers available due to max_worker_processes
> already being assigned?

Then the leader runs the plan after all.

> Related question, if max_parallel_degree is >1 and "the requested number of
> workers may not actually be available at runtime" is true, does the degree
> of parallelism minimize at 1 worker + leader or will the leader simply run
> the query by itself?

If the leader can get no workers at all, it will simply run the query
by itself.  Of course, it tries to get as many as it can.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: parallel.c is not marked as test covered

From
Robert Haas
Date:
On Wed, May 11, 2016 at 1:57 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> I don't immediately understand what's going wrong here.  It looks to
> me like make_group_input_target() already called, and that worked OK,
> but now make_partialgroup_input_target() is failing using more-or-less
> the same logic.  Presumably that's because make_group_input_target()
> was called on final_target as returned by create_pathtarget(root,
> tlist), but make_partialgroup_input_target() is being called on
> grouping_target, which I'm guessing came from
> make_window_input_target, which somehow lacks sortgroupref labeling.
> But I don't immediately see how that would happen, so there's
> obviously something I'm missing here.

So, it turns out you can reproduce this bug pretty easily without
force_parallel_mode, like this:

alter table int4_tbl set (parallel_degree = 4);
SELECT SUM(COUNT(f1)) OVER () FROM int4_tbl WHERE f1=42;

Or you can just a query that involves a window function on a table
large enough for parallelism to be considered:

SELECT SUM(COUNT(aid)) OVER () FROM pgbench_accounts;

The crash goes way if the target list involves at least one plain
column that uses no aggregate or window function, because then the
PathTarget list has a sortgrouprefs array.  Trivial fix patch
attached, although some more review from you (Tom Lane, PathTarget
inventor and planner whiz) and David Rowley (author of this function)
would be appreciated in case there are deeper issues here.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

Re: parallel.c is not marked as test covered

From
David Rowley
Date:
<div dir="ltr">On 12 May 2016 at 07:04, Robert Haas <<a
href="mailto:robertmhaas@gmail.com">robertmhaas@gmail.com</a>>wrote:<br />> On Wed, May 11, 2016 at 1:57 PM,
RobertHaas <<a href="mailto:robertmhaas@gmail.com">robertmhaas@gmail.com</a>> wrote:<br />>> I don't
immediatelyunderstand what's going wrong here.  It looks to<br />>> me like make_group_input_target() already
called,and that worked OK,<br />>> but now make_partialgroup_input_target() is failing using more-or-less<br
/>>>the same logic.  Presumably that's because make_group_input_target()<br />>> was called on final_target
asreturned by create_pathtarget(root,<br />>> tlist), but make_partialgroup_input_target() is being called on<br
/>>>grouping_target, which I'm guessing came from<br />>> make_window_input_target, which somehow lacks
sortgroupreflabeling.<br />>> But I don't immediately see how that would happen, so there's<br />>>
obviouslysomething I'm missing here.<br />><br />> So, it turns out you can reproduce this bug pretty easily
without<br/>> force_parallel_mode, like this:<br />><br />> alter table int4_tbl set (parallel_degree = 4);<br
/>>SELECT SUM(COUNT(f1)) OVER () FROM int4_tbl WHERE f1=42;<br />><br />> Or you can just a query that
involvesa window function on a table<br />> large enough for parallelism to be considered:<br />><br />>
SELECTSUM(COUNT(aid)) OVER () FROM pgbench_accounts;<br />><br />> The crash goes way if the target list involves
atleast one plain<br />> column that uses no aggregate or window function, because then the<br />> PathTarget
listhas a sortgrouprefs array.  Trivial fix patch<br />> attached, although some more review from you (Tom Lane,
PathTarget<br/>> inventor and planner whiz) and David Rowley (author of this function)<br />> would be
appreciatedin case there are deeper issues here.<br /><br />The problem is make_group_input_target() only calls
add_column_to_pathtarget()(which allocates this array) when there's a GROUP BY, otherwise it just appends to the
non_group_collist. Since your query has no GROUP BY it means that add_column_to_pathtarget() is never called with a
non-zerosortgroupref.<br /><br />It looks like Tom has intended that PathTarget->sortgrouprefs can be NULL going by
boththe comment /* corresponding sort/group refnos, or 0 */, and the coding inside add_column_to_pathtarget(), which
doesnot allocate the array if given a 0 sortgroupref.<br /><br />It looks like make_sort_input_target(),
make_window_input_target()and make_group_input_target() all get away without this check because they're all using
final_target,which was built by make_pathtarget_from_tlist() which *always* allocates the sortgrouprefs array, even if
it'sleft filled with zeros.<br /><br />It might be better if this was all consistent. Perhaps it would be worth
modifyingmake_pathtarget_from_tlist() to only allocate the sortgrouprefs array if there's any non-zero
tle->ressortgroupref,then modify the other make_*_input_target() functions to handle a NULL array, similar to the
fixthat's in your patch. This saves an allocation which is likely much more expensive than the NULL check later.
Alternativelyadd_column_to_pathtarget() could be modified to allocate the array even if sortgroupref is zero.<br /><br
/>Ithink consistency is good here, as if this had been consistent this would not be a bug.<br /><br />-- <br /> David
Rowley                  <a href="http://www.2ndQuadrant.com/">http://www.2ndQuadrant.com/</a><br /> PostgreSQL
Development,24x7 Support, Training & Services</div> 

Re: parallel.c is not marked as test covered

From
Clément Prévost
Date:
On Mon, May 9, 2016 at 4:50 PM Andres Freund <andres@anarazel.de> wrote:
I think it's a good idea to run a force-parallel run on some buildfarm
members. But I'm rather convinced that the core tests run by all animals
need some minimal coverage of parallel queries. Both because otherwise
it'll be hard to get some coverage of unusual platforms, and because
it's imo something rather relevant to test during development.
Good point. 

After some experiments, I found out that, for my setup (9b7bfc3a88ef7b), a parallel seq scan is used given both parallel_setup_cost and parallel_tuple_cost are set to 0 and given that the table is at least 3 times as large as the biggest test table tenk1.

The attached patch
 is a regression test using this method that is reasonably small and fast to run. I also hid the workers count from the explain output when costs are disabled as suggested by Tom Lane and Robert Haas on this same thread (http://www.postgresql.org/message-id/CA+TgmobBQS4ss3+CwoZOKgbsBqKfRndwc=hLiALAep5aXQCTDA@mail.gmail.com).

Testing under these conditions does not test the planner job at all but at least some parallel code can be run on the build farm and the test suite gets 2643 more lines and 188 more function covered.

I don't know however if this test will be reliable on other platforms, some more feedback is needed here.
Attachment

Re: parallel.c is not marked as test covered

From
Noah Misch
Date:
On Sun, May 15, 2016 at 12:53:13PM +0000, Clément Prévost wrote:
> On Mon, May 9, 2016 at 4:50 PM Andres Freund <andres@anarazel.de> wrote:
> > I think it's a good idea to run a force-parallel run on some buildfarm
> > members. But I'm rather convinced that the core tests run by all animals
> > need some minimal coverage of parallel queries. Both because otherwise
> > it'll be hard to get some coverage of unusual platforms, and because
> > it's imo something rather relevant to test during development.
> >
> Good point.
> 
> After some experiments, I found out that, for my setup (9b7bfc3a88ef7b), a
> parallel seq scan is used given both parallel_setup_cost
> and parallel_tuple_cost are set to 0 and given that the table is at least 3
> times as large as the biggest test table tenk1.
> 
> The attached patch is a regression test using this method that is
> reasonably small and fast to run. I also hid the workers count from the
> explain output when costs are disabled as suggested by Tom Lane and Robert
> Haas on this same thread (
> http://www.postgresql.org/message-id/CA+TgmobBQS4ss3+CwoZOKgbsBqKfRndwc=hLiALAep5aXQCTDA@mail.gmail.com
> ).
> 
> Testing under these conditions does not test the planner job at all but at
> least some parallel code can be run on the build farm and the test suite
> gets 2643 more lines and 188 more function covered.
> 
> I don't know however if this test will be reliable on other platforms, some
> more feedback is needed here.

[This is a generic notification.]

The above-described topic is currently a PostgreSQL 9.6 open item.  Robert,
since you committed the patch believed to have created it, you own this open
item.  If some other commit is more relevant or if this does not belong as a
9.6 open item, please let us know.  Otherwise, please observe the policy on
open item ownership[1] and send a status update within 72 hours of this
message.  Include a date for your subsequent status update.  Testers may
discover new open items at any time, and I want to plan to get them all fixed
well in advance of shipping 9.6rc1.  Consequently, I will appreciate your
efforts toward speedy resolution.  Thanks.

[1] http://www.postgresql.org/message-id/20160527025039.GA447393@tornado.leadboat.com



Re: parallel.c is not marked as test covered

From
Peter Eisentraut
Date:
On 5/9/16 10:50 AM, Andres Freund wrote:
> I think it's a good idea to run a force-parallel run on some buildfarm
> members. But I'm rather convinced that the core tests run by all animals
> need some minimal coverage of parallel queries. Both because otherwise
> it'll be hard to get some coverage of unusual platforms, and because
> it's imo something rather relevant to test during development.

I can confirm that with force_parallel_mode = on, parallel.c does get 
good test coverage.  But I agree that it is a problem that this does not 
happen in the default test setup.

I think we might want to have a new test file that sets 
force_parallel_mode = on and just runs a few queries to give parallel.c 
a small workout.  For example, I find that the following test file gives 
almost the same coverage as running the full test suite with f_p_m=on:

"""
SET force_parallel_mode = on;

EXPLAIN SELECT * FROM tenk1 WHERE unique1 = 1;
SELECT * FROM tenk1 WHERE unique1 = 1;

-- provoke error in worker
SELECT stringu1::int2 FROM tenk1;
"""

While we're at it, one of the things that force_parallel_mode = regress 
does is remove an error context that is otherwise added to all messages:
    errcontext("parallel worker, PID %d", *(int32 *) arg);

I think we should get rid of that error context altogether, except 
possibly as a debugging tool, because it's an implementation detail that 
does not need to be shown to users by default.

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: parallel.c is not marked as test covered

From
Noah Misch
Date:
On Sun, May 29, 2016 at 01:31:24AM -0400, Noah Misch wrote:
> On Sun, May 15, 2016 at 12:53:13PM +0000, Clément Prévost wrote:
> > On Mon, May 9, 2016 at 4:50 PM Andres Freund <andres@anarazel.de> wrote:
> > > I think it's a good idea to run a force-parallel run on some buildfarm
> > > members. But I'm rather convinced that the core tests run by all animals
> > > need some minimal coverage of parallel queries. Both because otherwise
> > > it'll be hard to get some coverage of unusual platforms, and because
> > > it's imo something rather relevant to test during development.
> > >
> > Good point.
> > 
> > After some experiments, I found out that, for my setup (9b7bfc3a88ef7b), a
> > parallel seq scan is used given both parallel_setup_cost
> > and parallel_tuple_cost are set to 0 and given that the table is at least 3
> > times as large as the biggest test table tenk1.
> > 
> > The attached patch is a regression test using this method that is
> > reasonably small and fast to run. I also hid the workers count from the
> > explain output when costs are disabled as suggested by Tom Lane and Robert
> > Haas on this same thread (
> > http://www.postgresql.org/message-id/CA+TgmobBQS4ss3+CwoZOKgbsBqKfRndwc=hLiALAep5aXQCTDA@mail.gmail.com
> > ).
> > 
> > Testing under these conditions does not test the planner job at all but at
> > least some parallel code can be run on the build farm and the test suite
> > gets 2643 more lines and 188 more function covered.
> > 
> > I don't know however if this test will be reliable on other platforms, some
> > more feedback is needed here.
> 
> [This is a generic notification.]
> 
> The above-described topic is currently a PostgreSQL 9.6 open item.  Robert,
> since you committed the patch believed to have created it, you own this open
> item.  If some other commit is more relevant or if this does not belong as a
> 9.6 open item, please let us know.  Otherwise, please observe the policy on
> open item ownership[1] and send a status update within 72 hours of this
> message.  Include a date for your subsequent status update.  Testers may
> discover new open items at any time, and I want to plan to get them all fixed
> well in advance of shipping 9.6rc1.  Consequently, I will appreciate your
> efforts toward speedy resolution.  Thanks.
> 
> [1] http://www.postgresql.org/message-id/20160527025039.GA447393@tornado.leadboat.com

I enjoy reviewing automated test patches, so I persuaded Robert to transfer
ownership of this open item to me.  I will update this thread no later than
2016-06-07 09:00 UTC.  There is an 85% chance I will have reviewed the
proposed patch by then.



Re: parallel.c is not marked as test covered

From
Noah Misch
Date:
Thanks for this patch.  I have reviewed it:

On Sun, May 15, 2016 at 12:53:13PM +0000, Clément Prévost wrote:
> After some experiments, I found out that, for my setup (9b7bfc3a88ef7b), a
> parallel seq scan is used given both parallel_setup_cost
> and parallel_tuple_cost are set to 0 and given that the table is at least 3
> times as large as the biggest test table tenk1.

That worked because it pushed the table over the create_plain_partial_paths()
1000-block (8 MiB) threshold; see
http://www.postgresql.org/message-id/26842.1462941248@sss.pgh.pa.us

While the way you tested this is not wrong, I recommend instead querying an
inheritance parent if that achieves no less test coverage.  "SELECT count(*)
FROM a_star" gets a parallel plan under your cost parameters.  That avoids
building and discarding a relatively-large table.

> The attached patch is a regression test using this method that is
> reasonably small and fast to run. I also hid the workers count from the
> explain output when costs are disabled as suggested by Tom Lane and Robert
> Haas on this same thread (
> http://www.postgresql.org/message-id/CA+TgmobBQS4ss3+CwoZOKgbsBqKfRndwc=hLiALAep5aXQCTDA@mail.gmail.com
> ).

It seems wrong to me for COSTS OFF to disable output that is not a cost
estimate.  file_fdw did that, but I don't want to proliferate that loose
interpretation of COSTS OFF.

Unlike the example in the linked-to message, this test won't experience
variable worker count.  For the query you used and for the a_star query, low
input size makes the planner request exactly one worker.  If later planner
enhancements give this query a configuration-specific worker count, the test
could capture EXPLAIN output with PL/pgSQL and compare it to a regex.

> Testing under these conditions does not test the planner job at all but at
> least some parallel code can be run on the build farm and the test suite
> gets 2643 more lines and 188 more function covered.

Nice.

> --- a/src/test/regress/parallel_schedule
> +++ b/src/test/regress/parallel_schedule
> @@ -79,7 +79,7 @@ ignore: random
>  # ----------
>  # Another group of parallel tests
>  # ----------
> -test: select_into select_distinct select_distinct_on select_implicit select_having subselect union case join
aggregatestransactions random portals arrays btree_index hash_index update namespace prepared_xacts delete
 
> +test: select_into select_distinct select_distinct_on select_implicit select_having subselect union case join
aggregatestransactions random portals arrays btree_index hash_index update namespace prepared_xacts delete
select_parallel

Add this to serial_schedule, too.

> --- /dev/null
> +++ b/src/test/regress/sql/select_parallel.sql
> @@ -0,0 +1,22 @@
> +--
> +-- PARALLEL
> +--
> +
> +-- setup parallel test
> +create unlogged table tenk_parallel as table tenk1 with no data;
> +set parallel_setup_cost=0;
> +set parallel_tuple_cost=0;
> +
> +-- create a table with enough data to trigger parallel behavior
> +insert into tenk_parallel
> +  select tenk1.* from tenk1, generate_series(0,2);
> +
> +explain (costs off)
> +  select count(*) from tenk_parallel;
> +select count(*) from tenk_parallel;

As of today, "make installcheck" passes with "default_transaction_isolation =
serializable" in postgresql.conf.  Let's preserve that property.  You could
wrap the parallel queries in "begin isolation level repeatable read;"
... "commit;", or you could SET default_transaction_isolation itself.

> +
> +-- cleanup
> +drop table tenk_parallel;
> +set parallel_setup_cost to default;
> +set parallel_tuple_cost to default;
> +

Remove the trailing blank line; it triggers a "git diff --check" diagnostic.


[Official open item status update: I expect to be able to review the next
patch version within a day or two of it becoming available.  I will update
this thread by 2016-06-14 09:00 UTC if not earlier.]



Re: parallel.c is not marked as test covered

From
Peter Eisentraut
Date:
On 6/7/16 1:27 AM, Noah Misch wrote:
>> Testing under these conditions does not test the planner job at all but at
>> least some parallel code can be run on the build farm and the test suite
>> gets 2643 more lines and 188 more function covered.
>
> Nice.

Please see also my message 
<https://www.postgresql.org/message-id/92bec2a5-6d5b-6630-ce4d-2ed76f357973@2ndquadrant.com> 
with a similar solution.  That test also contains a query that raises an 
error, which is another code path that we should cover.

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: parallel.c is not marked as test covered

From
Clément Prévost
Date:
On Sun, May 15, 2016 at 12:53:13PM +0000, Clément Prévost wrote:
> After some experiments, I found out that, for my setup (9b7bfc3a88ef7b), a
> parallel seq scan is used given both parallel_setup_cost
> and parallel_tuple_cost are set to 0 and given that the table is at least 3
> times as large as the biggest test table tenk1.

That worked because it pushed the table over the create_plain_partial_paths()
1000-block (8 MiB) threshold; see
http://www.postgresql.org/message-id/26842.1462941248@sss.pgh.pa.us

While the way you tested this is not wrong, I recommend instead querying an
inheritance parent if that achieves no less test coverage.  "SELECT count(*)
FROM a_star" gets a parallel plan under your cost parameters.  That avoids
building and discarding a relatively-large table.
 
Nice, this works well and is faster than copying tenk1 aggressively.
 
> The attached patch is a regression test using this method that is
> reasonably small and fast to run. I also hid the workers count from the
> explain output when costs are disabled as suggested by Tom Lane and Robert
> Haas on this same thread (
> http://www.postgresql.org/message-id/CA+TgmobBQS4ss3+CwoZOKgbsBqKfRndwc=hLiALAep5aXQCTDA@mail.gmail.com
> ).

It seems wrong to me for COSTS OFF to disable output that is not a cost
estimate.  file_fdw did that, but I don't want to proliferate that loose
interpretation of COSTS OFF.
 
Agreed, I reverted it. 

Unlike the example in the linked-to message, this test won't experience
variable worker count.  For the query you used and for the a_star query, low
input size makes the planner request exactly one worker.  If later planner
enhancements give this query a configuration-specific worker count, the test
could capture EXPLAIN output with PL/pgSQL and compare it to a regex.
 
I also considered setting max_parallel_degree to 1 to make the test more futur-proof but there is a rather long discussion on the setting name (https://www.postgresql.org/message-id/20160424035859.GB29404@momjian.us) so I can't decide on my own if it's worth the explicit dependency. 
 
> --- /dev/null
> +++ b/src/test/regress/sql/select_parallel.sql
> @@ -0,0 +1,22 @@
> +--
> +-- PARALLEL
> +--
> +
> +-- setup parallel test
> +create unlogged table tenk_parallel as table tenk1 with no data;
> +set parallel_setup_cost=0;
> +set parallel_tuple_cost=0;
> +
> +-- create a table with enough data to trigger parallel behavior
> +insert into tenk_parallel
> +  select tenk1.* from tenk1, generate_series(0,2);
> +
> +explain (costs off)
> +  select count(*) from tenk_parallel;
> +select count(*) from tenk_parallel;

As of today, "make installcheck" passes with "default_transaction_isolation =
serializable" in postgresql.conf.  Let's preserve that property.  You could
wrap the parallel queries in "begin isolation level repeatable read;"
... "commit;", or you could SET default_transaction_isolation itself.

I did add the transaction, but I don't get why this specific test should use this specific transaction isolation level.

On Tue, Jun 7, 2016 at 7:36 PM Peter Eisentraut <peter.eisentraut@2ndquadrant.com> wrote:
Please see also my message
<https://www.postgresql.org/message-id/92bec2a5-6d5b-6630-ce4d-2ed76f357973@2ndquadrant.com>
with a similar solution.  That test also contains a query that raises an
error, which is another code path that we should cover.

Added.
Attachment

Re: parallel.c is not marked as test covered

From
Noah Misch
Date:
On Tue, Jun 07, 2016 at 10:24:33PM +0000, Clément Prévost wrote:
> I also considered setting max_parallel_degree to 1 to make the test more
> futur-proof but there is a rather long discussion on the setting name (
> https://www.postgresql.org/message-id/20160424035859.GB29404@momjian.us) so
> I can't decide on my own if it's worth the explicit dependency.

It wouldn't be a problem to update this test when renaming the setting, but I
didn't see an impending need to use that setting.

> > As of today, "make installcheck" passes with
> > "default_transaction_isolation =
> > serializable" in postgresql.conf.  Let's preserve that property.  You could
> > wrap the parallel queries in "begin isolation level repeatable read;"
> > ... "commit;", or you could SET default_transaction_isolation itself.
> >
> 
> I did add the transaction, but I don't get why this specific test should
> use this specific transaction isolation level.

We disable all parallelism at serializable isolation.  Any other isolation
level would have worked; repeatable read was an arbitrary choice.  I added a
comment to that effect.

> -test: select_into select_distinct select_distinct_on select_implicit select_having subselect union case join
aggregatestransactions random portals arrays btree_index hash_index update namespace prepared_xacts delete
 
> +test: select_into select_distinct select_distinct_on select_implicit select_having subselect union case join
aggregatestransactions random portals arrays btree_index hash_index update namespace prepared_xacts delete
select_parallel

I moved the test to a different group, in light of this parallel_schedule
header comment:

# By convention, we put no more than twenty tests in any one parallel group;
# this limits the number of connections needed to run the tests.

> +  exception
> +    -- raise custom exception, the original message contains
> +    -- a worker PID that must be hidden in the test output
> +    when others then raise exception 'Error in worker';

I changed this to keep the main message while overwriting the CONTEXT; a bug
in this area could very well produce some other error rather than no error.


Committed that way.

Thanks,
nm



Re: parallel.c is not marked as test covered

From
Robert Haas
Date:
On Tue, Jun 7, 2016 at 11:43 PM, Noah Misch <noah@leadboat.com> wrote:
> Committed that way.

Thanks for the carefully-considered commit, Noah.  And thanks Clement,
Peter, and others involved in figuring out the best way to do this and
drawing attention to it.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: parallel.c is not marked as test covered

From
Peter Eisentraut
Date:
On 6/7/16 11:43 PM, Noah Misch wrote:
> I changed this to keep the main message while overwriting the CONTEXT; a bug
> in this area could very well produce some other error rather than no error.

Regarding the patch that ended up being committed, I wonder if it is 
intentional that PL/pgSQL overwrites the context from the parallel 
worker.  Shouldn't the context effectively look like

ERROR:  message
CONTEXT:  parallel worker
CONTEXT:  PL/pgSQL function

Elsewhere in this thread I suggested getting rid of the parallel worker 
context by default (except for debugging), but if we do want to keep it, 
then it seems to be a bug that a PL/pgSQL function can just eliminate it.

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: parallel.c is not marked as test covered

From
Robert Haas
Date:
On Fri, Jun 10, 2016 at 1:49 PM, Peter Eisentraut
<peter.eisentraut@2ndquadrant.com> wrote:
> Regarding the patch that ended up being committed, I wonder if it is
> intentional that PL/pgSQL overwrites the context from the parallel worker.
> Shouldn't the context effectively look like
>
> ERROR:  message
> CONTEXT:  parallel worker
> CONTEXT:  PL/pgSQL function
>
> Elsewhere in this thread I suggested getting rid of the parallel worker
> context by default (except for debugging), but if we do want to keep it,
> then it seems to be a bug that a PL/pgSQL function can just eliminate it.

Several people have suggested getting rid of that now, so maybe we
should just go ahead and do it.

How would we make it available for debugging, though?  That seems like
something people will want.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: parallel.c is not marked as test covered

From
Robert Haas
Date:
On Fri, Jun 10, 2016 at 4:12 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Fri, Jun 10, 2016 at 1:49 PM, Peter Eisentraut
> <peter.eisentraut@2ndquadrant.com> wrote:
>> Regarding the patch that ended up being committed, I wonder if it is
>> intentional that PL/pgSQL overwrites the context from the parallel worker.
>> Shouldn't the context effectively look like
>>
>> ERROR:  message
>> CONTEXT:  parallel worker
>> CONTEXT:  PL/pgSQL function
>>
>> Elsewhere in this thread I suggested getting rid of the parallel worker
>> context by default (except for debugging), but if we do want to keep it,
>> then it seems to be a bug that a PL/pgSQL function can just eliminate it.
>
> Several people have suggested getting rid of that now, so maybe we
> should just go ahead and do it.
>
> How would we make it available for debugging, though?  That seems like
> something people will want.

This is currently listed as an open item, but it doesn't seem very
actionable to me.  The fact that PL/plpgsql chucks the existing
context instead of appending to it is presumably a property of
PL/plpgsql, not parallel query, and changing that seems like it ought
to be out of scope for 9.6.

As far as the context is concerned, the reason why that's there is
because I felt (and still feel, to some extent) that users might
experience errors that happen inside a parallel worker and it might be
important for debugging purposes to know that.  Suppose, for example,
that you mislabel a function as PARALLEL SAFE when in fact it relies
on some backend-local state (and should therefore be PARALLEL
RESTRICTED).  When it runs in a worker, it might generate an error
message like this:

ERROR: can't generate random numbers because you haven't specified a seed

...to which the user will reply, "oh yes I did; in fact I ran SELECT
magic_setseed(42) just before I ran the offending query!".  They'll
then contact an expert (hopefully) who will very possibly spend a lot
of time troubleshooting the wrong thing.  If the message says:

ERROR: can't generate random numbers because you haven't specified a seed
CONTEXT: parallel worker, PID 62413

...then the expert has a very good chance of guessing what has gone
wrong right away.

Now, against this scenario, there is every possibility that this
message will just annoy a lot of people.  It is certainly annoying for
regression test authors because the PID changes and, even if we took
the PID out, any given error might sometimes happen in the leader
rather than the worker, depending on timing.  I am not aware of a
concrete reason why it will annoy people in other situations, but
there may well be one (or more).

The simplest thing we could do here is nothing.  We can wait to see
what happens with PostgreSQL 9.6 and then decide whether to make a
change.  Or we can do something now.  We can eliminate the context
entirely and take our chances with scenarios like the one mentioned
above.  Or we can add a new GUC, something like hide_parallel_context,
which suppresses it and turn that off when running regression tests,
or even by default.  It's all just a matter of how much it bothers you
to sometimes get an extra context line vs. how much annoyance you
think will be caused by people wasting time troubleshooting because
they didn't realize parallel query was involved.

Personally, I'm very doubtful about our ability to paper over
artifacts like this.  Sophisticated users are going to end up having
to understand that there are parallel workers and that they propagate
their messages back to the leader, which appends its own context.  We
often seem to assume users won't find out about internal details like
that and so we don't make an effort to expose and document them, but I
don't think that often works out.  There are a whole lot of people who
know that they need to run EXPLAIN 6 times to find out what's really
going on.  I don't expect this case to be any different.  I'm actually
sort of flattered that parallel query is working well enough that
multiple people think they might not ever need to know whether an
error comes from the worker or from the leader, but I'm rather
doubtful that will be true in practice.

All that having been said, I don't know everything and this is just a
judgement call.  I exercised my judgement and that's how we got here.
If there's a consensus that my judgement should be overruled, I'm fine
with that.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: parallel.c is not marked as test covered

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> On Fri, Jun 10, 2016 at 4:12 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> On Fri, Jun 10, 2016 at 1:49 PM, Peter Eisentraut
>> <peter.eisentraut@2ndquadrant.com> wrote:
>>> Elsewhere in this thread I suggested getting rid of the parallel worker
>>> context by default (except for debugging), but if we do want to keep it,
>>> then it seems to be a bug that a PL/pgSQL function can just eliminate it.

> This is currently listed as an open item, but it doesn't seem very
> actionable to me.  The fact that PL/plpgsql chucks the existing
> context instead of appending to it is presumably a property of
> PL/plpgsql, not parallel query, and changing that seems like it ought
> to be out of scope for 9.6.

FWIW, I follow all of your reasoning except this.  If we believe that the
parallel worker context line is useful, then it is a bug that plpgsql
suppresses it.  If we don't believe it's useful, then we should get
rid of it.  "Do nothing" is simply not a consistent stance here.
        regards, tom lane



Re: parallel.c is not marked as test covered

From
Robert Haas
Date:
On Tue, Jun 14, 2016 at 12:51 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> On Fri, Jun 10, 2016 at 4:12 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>>> On Fri, Jun 10, 2016 at 1:49 PM, Peter Eisentraut
>>> <peter.eisentraut@2ndquadrant.com> wrote:
>>>> Elsewhere in this thread I suggested getting rid of the parallel worker
>>>> context by default (except for debugging), but if we do want to keep it,
>>>> then it seems to be a bug that a PL/pgSQL function can just eliminate it.
>
>> This is currently listed as an open item, but it doesn't seem very
>> actionable to me.  The fact that PL/plpgsql chucks the existing
>> context instead of appending to it is presumably a property of
>> PL/plpgsql, not parallel query, and changing that seems like it ought
>> to be out of scope for 9.6.
>
> FWIW, I follow all of your reasoning except this.  If we believe that the
> parallel worker context line is useful, then it is a bug that plpgsql
> suppresses it.  If we don't believe it's useful, then we should get
> rid of it.  "Do nothing" is simply not a consistent stance here.

Well, if PL/pgsql suppresses context and nobody's complained about
that up until now, fixing it doesn't seem any more urgent than any
other bug we've had for N releases.  That would go on the 9.6 open
items list in the section entitled "Older Bugs", where it would have
plenty of company.  Any time somebody wants to fix one of those, they
can, and that would be great, but there's no more or less urgency
right now than, say, four months ago, or six months from now.  It
can't be said that this open item is holding up the release if it's
just a rediscovery of an existing behavior which somebody happens not
to like.

On the other hand, if PL/pgsql does not suppress context in general
but suppresses only this one particular bit of context from parallel
query, then that is probably a bug in new code which should be fixed
before release.  But I don't think that's what is being argued.

Am I confused?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: parallel.c is not marked as test covered

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> On Tue, Jun 14, 2016 at 12:51 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> FWIW, I follow all of your reasoning except this.  If we believe that the
>> parallel worker context line is useful, then it is a bug that plpgsql
>> suppresses it.  If we don't believe it's useful, then we should get
>> rid of it.  "Do nothing" is simply not a consistent stance here.

> Well, if PL/pgsql suppresses context and nobody's complained about
> that up until now, fixing it doesn't seem any more urgent than any
> other bug we've had for N releases.

I have not dug into the code enough to find out exactly what's happening
in Peter's complaint, but it seems like it would be a good idea to find
that out before arguing along these lines.  It seems entirely likely
to me that this is somehow parallel-query-specific.  Even if it isn't,
I don't buy your argument.  Adding a new case in which context is
suppressed is a perfectly reasonable basis for thinking that an old
bug has acquired new urgency.
        regards, tom lane



Re: parallel.c is not marked as test covered

From
Robert Haas
Date:
On Tue, Jun 14, 2016 at 1:14 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> On Tue, Jun 14, 2016 at 12:51 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> FWIW, I follow all of your reasoning except this.  If we believe that the
>>> parallel worker context line is useful, then it is a bug that plpgsql
>>> suppresses it.  If we don't believe it's useful, then we should get
>>> rid of it.  "Do nothing" is simply not a consistent stance here.
>
>> Well, if PL/pgsql suppresses context and nobody's complained about
>> that up until now, fixing it doesn't seem any more urgent than any
>> other bug we've had for N releases.
>
> I have not dug into the code enough to find out exactly what's happening
> in Peter's complaint, but it seems like it would be a good idea to find
> that out before arguing along these lines.  It seems entirely likely
> to me that this is somehow parallel-query-specific.  Even if it isn't,
> I don't buy your argument.  Adding a new case in which context is
> suppressed is a perfectly reasonable basis for thinking that an old
> bug has acquired new urgency.

OK.  I find this whole thing slightly puzzling because Noah wrote this
in the test file:
 -- Provoke error in worker.  The original message CONTEXT contains a worker -- PID that must be hidden in the test
output. PL/pgSQL conveniently -- substitutes its own CONTEXT.
 

The phrasing of the comment implies that (1) the behavior is desirable
at least for the purpose at hand and (2) the behavior is unrelated to
parallel query. However, it's surely possible that (2) is false and
(1) is not the consensus.  Noah is usually pretty careful about this
sort of thing, but he might have made a mistake.  Considering that, I
decide to take a look.

I wasn't able in a quick test to verify that this behavior doesn't
happen without parallel query.  I also wasn't able to figure out
exactly why it was happening.  I did verify that this statement
generates an error that is propagated to the leader.  The expected
output looks like this:

ERROR:  invalid input syntax for integer: "BAAAAA"
CONTEXT:  SQL statement "select stringu1::int2 from tenk1 where unique1 = 1"
PL/pgSQL function inline_code_block line 5 at SQL statement

The first line of context begins with "SQL statement" and I believe
that's coming from spi.c.  The second line of the context seems to
come from PL/pgsql. But what accounts for the failure of the "parallel
query" line to appear in the context message?  This could be explained
by _SPI_error_callback flushing the existing context in its handler,
but it doesn't seem to do that.  I guess this needs some more looking
at.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: parallel.c is not marked as test covered

From
Amit Kapila
Date:
On Wed, Jun 15, 2016 at 1:42 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Tue, Jun 14, 2016 at 1:14 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> >
> > I have not dug into the code enough to find out exactly what's happening
> > in Peter's complaint, but it seems like it would be a good idea to find
> > that out before arguing along these lines.  It seems entirely likely
> > to me that this is somehow parallel-query-specific.  Even if it isn't,
> > I don't buy your argument.  Adding a new case in which context is
> > suppressed is a perfectly reasonable basis for thinking that an old
> > bug has acquired new urgency.
>
> OK.  I find this whole thing slightly puzzling because Noah wrote this
> in the test file:
>
>   -- Provoke error in worker.  The original message CONTEXT contains a worker
>   -- PID that must be hidden in the test output.  PL/pgSQL conveniently
>   -- substitutes its own CONTEXT.
>

I have done analysis of this issue and found out that as written, test will never generate any sort of parallel plan which is an expected behaviour as per design.  The above error is generated in backend and that's why no Context: parallel worker, PID <pid_of_worker>.  Now, one might wonder why in-spite of setting force_parallel_mode = 1, this test won't generate a parallel plan. The reason for the same is that for SQL statements in plpgsql (PLPGSQL_STMT_EXECSQL), parallelModeOK will be false (we don't use CURSOR_OPT_PARALLEL_OK for such statements).  For unsafe statements (parallelModeOK=false), we don't force parallel plans even force_parallel_mode is enabled.

In short, this test doesn't serve it's purpose which is to generate an error from worker.  A modified version as below can generate such an error and the same is displayed in context as well.

do $$begin
  Perform stringu1::int2 from tenk1 where unique1 = 1;
end$$;

ERROR:  invalid input syntax for integer: "BAAAAA"
CONTEXT:  parallel worker, PID 4460
SQL statement "SELECT stringu1::int2 from tenk1 where unique1 = 1"
PL/pgSQL function inline_code_block line 2 at PERFORM

Considering above analysis is correct, we have below options:
a. Modify the test such that it actually generates an error and to hide the context, we can exception block and raise some generic error.
b. Modify the test such that it actually generates an error and to hide the context, we can use force_parallel_mode = regress;
c. Remove this test and then think of a better way to exercise error path in worker. 

Thoughts?

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: parallel.c is not marked as test covered

From
Noah Misch
Date:
On Wed, Jun 15, 2016 at 11:50:33AM +0530, Amit Kapila wrote:
> In short, this test doesn't serve it's purpose which is to generate an
> error from worker.

That's bad.  Thanks for figuring out the problem.

> do $$begin
>   Perform stringu1::int2 from tenk1 where unique1 = 1;
> end$$;
> 
> ERROR:  invalid input syntax for integer: "BAAAAA"
> CONTEXT:  parallel worker, PID 4460
> SQL statement "SELECT stringu1::int2 from tenk1 where unique1 = 1"
> PL/pgSQL function inline_code_block line 2 at PERFORM
> 
> Considering above analysis is correct, we have below options:
> a. Modify the test such that it actually generates an error and to hide the
> context, we can exception block and raise some generic error.
> b. Modify the test such that it actually generates an error and to hide the
> context, we can use force_parallel_mode = regress;

Either of those sounds okay.  No need to raise a generic error; one can raise
SQLERRM to keep the main message and not the context.  I lean toward (a) so we
have nonzero test coverage of force_parallel_mode=on.



Re: parallel.c is not marked as test covered

From
Amit Kapila
Date:
On Wed, Jun 15, 2016 at 12:07 PM, Noah Misch <noah@leadboat.com> wrote:
>
> On Wed, Jun 15, 2016 at 11:50:33AM +0530, Amit Kapila wrote:
> > do $$begin
> >   Perform stringu1::int2 from tenk1 where unique1 = 1;
> > end$$;
> >
> > ERROR:  invalid input syntax for integer: "BAAAAA"
> > CONTEXT:  parallel worker, PID 4460
> > SQL statement "SELECT stringu1::int2 from tenk1 where unique1 = 1"
> > PL/pgSQL function inline_code_block line 2 at PERFORM
> >
> > Considering above analysis is correct, we have below options:
> > a. Modify the test such that it actually generates an error and to hide the
> > context, we can exception block and raise some generic error.
> > b. Modify the test such that it actually generates an error and to hide the
> > context, we can use force_parallel_mode = regress;
>
> Either of those sounds okay.  No need to raise a generic error; one can raise
> SQLERRM to keep the main message and not the context.  I lean toward (a) so we
> have nonzero test coverage of force_parallel_mode=on.
>

Do you mean to say nonzero test coverage of force_parallel_mode=on for error paths?  I see that for force_parallel_mode=on, we have another test in select_parallel.sql

set force_parallel_mode=1;

explain (costs off)

  select stringu1::int2 from tenk1 where unique1 = 1;



With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: parallel.c is not marked as test covered

From
Amit Kapila
Date:
On Wed, Jun 15, 2016 at 12:07 PM, Noah Misch <noah@leadboat.com> wrote:
>
> On Wed, Jun 15, 2016 at 11:50:33AM +0530, Amit Kapila wrote:
> > In short, this test doesn't serve it's purpose which is to generate an
> > error from worker.
>
> That's bad.  Thanks for figuring out the problem.
>
> > do $$begin
> >   Perform stringu1::int2 from tenk1 where unique1 = 1;
> > end$$;
> >
> > ERROR:  invalid input syntax for integer: "BAAAAA"
> > CONTEXT:  parallel worker, PID 4460
> > SQL statement "SELECT stringu1::int2 from tenk1 where unique1 = 1"
> > PL/pgSQL function inline_code_block line 2 at PERFORM
> >
> > Considering above analysis is correct, we have below options:
> > a. Modify the test such that it actually generates an error and to hide the
> > context, we can exception block and raise some generic error.
> > b. Modify the test such that it actually generates an error and to hide the
> > context, we can use force_parallel_mode = regress;
>
> Either of those sounds okay.  No need to raise a generic error; one can raise
> SQLERRM to keep the main message and not the context.  I lean toward (a) so we
> have nonzero test coverage of force_parallel_mode=on.

Patch implementing option (a) attached with this mail.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
Attachment

Re: parallel.c is not marked as test covered

From
Robert Haas
Date:
On Wed, Jun 15, 2016 at 5:23 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>> > Considering above analysis is correct, we have below options:
>> > a. Modify the test such that it actually generates an error and to hide
>> > the
>> > context, we can exception block and raise some generic error.
>> > b. Modify the test such that it actually generates an error and to hide
>> > the
>> > context, we can use force_parallel_mode = regress;
>>
>> Either of those sounds okay.  No need to raise a generic error; one can
>> raise
>> SQLERRM to keep the main message and not the context.  I lean toward (a)
>> so we
>> have nonzero test coverage of force_parallel_mode=on.
>
> Patch implementing option (a) attached with this mail.

OK, committed.  I also changed "select" to "perform" per your
analysis.  I wonder if we need to revisit the choices I made inside
PL/pgsql and see why CURSOR_OPT_PARALLEL_OK is not being set here.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: parallel.c is not marked as test covered

From
Peter Eisentraut
Date:
On 6/14/16 12:37 PM, Robert Haas wrote:
> ERROR: can't generate random numbers because you haven't specified a seed
>
> ...to which the user will reply, "oh yes I did; in fact I ran SELECT
> magic_setseed(42) just before I ran the offending query!".  They'll
> then contact an expert (hopefully) who will very possibly spend a lot
> of time troubleshooting the wrong thing.  If the message says:
>
> ERROR: can't generate random numbers because you haven't specified a seed
> CONTEXT: parallel worker, PID 62413
>
> ...then the expert has a very good chance of guessing what has gone
> wrong right away.

My updated opinion is to keep the context but remove the PID.  I can see 
that as this feature is being introduced, we will want to know that 
something is happening inside a parallel worker.  Maybe in the future 
we'll have it all figured out and won't need the context anymore.  But 
the PID is not useful by the time you see the message.  The server log 
should contain a trace of the PIDs if you need to do deep digging.

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: parallel.c is not marked as test covered

From
Robert Haas
Date:
On Wed, Jun 15, 2016 at 4:59 PM, Peter Eisentraut
<peter.eisentraut@2ndquadrant.com> wrote:
> On 6/14/16 12:37 PM, Robert Haas wrote:
>> ERROR: can't generate random numbers because you haven't specified a seed
>>
>> ...to which the user will reply, "oh yes I did; in fact I ran SELECT
>> magic_setseed(42) just before I ran the offending query!".  They'll
>> then contact an expert (hopefully) who will very possibly spend a lot
>> of time troubleshooting the wrong thing.  If the message says:
>>
>> ERROR: can't generate random numbers because you haven't specified a seed
>> CONTEXT: parallel worker, PID 62413
>>
>> ...then the expert has a very good chance of guessing what has gone
>> wrong right away.
>
> My updated opinion is to keep the context but remove the PID.  I can see
> that as this feature is being introduced, we will want to know that
> something is happening inside a parallel worker.  Maybe in the future we'll
> have it all figured out and won't need the context anymore.  But the PID is
> not useful by the time you see the message.  The server log should contain a
> trace of the PIDs if you need to do deep digging.

I'm comfortable with that.  Feel free to make it so, unless you want
me to do it for some reason.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: parallel.c is not marked as test covered

From
Amit Kapila
Date:
On Thu, Jun 16, 2016 at 12:46 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Wed, Jun 15, 2016 at 5:23 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> >> > Considering above analysis is correct, we have below options:
> >> > a. Modify the test such that it actually generates an error and to hide
> >> > the
> >> > context, we can exception block and raise some generic error.
> >> > b. Modify the test such that it actually generates an error and to hide
> >> > the
> >> > context, we can use force_parallel_mode = regress;
> >>
> >> Either of those sounds okay.  No need to raise a generic error; one can
> >> raise
> >> SQLERRM to keep the main message and not the context.  I lean toward (a)
> >> so we
> >> have nonzero test coverage of force_parallel_mode=on.
> >
> > Patch implementing option (a) attached with this mail.
>
> OK, committed.

Thanks.

>I also changed "select" to "perform" per your
> analysis.

oops, it seems I have forgotten to make that change in patch.
 
>
>   I wonder if we need to revisit the choices I made inside
> PL/pgsql and see why CURSOR_OPT_PARALLEL_OK is not being set here.
>

exec_stmt_execsql() is used to execute SQL statements insider plpgsql which includes dml statements as well, so probably you wanted to play safe by not allowing parallel option from that place.  However, I think there shouldn't be a problem in using CURSOR_OPT_PARALLEL_OK from this place as we have a check in standard_planner which will take care of whether to choose parallel mode or not for a particular statement.  If you want, I can do more detailed analysis and prepare a patch.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: parallel.c is not marked as test covered

From
Robert Haas
Date:
On Wed, Jun 15, 2016 at 10:48 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> exec_stmt_execsql() is used to execute SQL statements insider plpgsql which
> includes dml statements as well, so probably you wanted to play safe by not
> allowing parallel option from that place.  However, I think there shouldn't
> be a problem in using CURSOR_OPT_PARALLEL_OK from this place as we have a
> check in standard_planner which will take care of whether to choose parallel
> mode or not for a particular statement.  If you want, I can do more detailed
> analysis and prepare a patch.

That would be great.  I have a vague recollection that that function
might be called in some contexts where execution can be suspended,
which would be no good for parallel query.  But that might be wrong.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: parallel.c is not marked as test covered

From
Robert Haas
Date:
On Wed, Jun 15, 2016 at 5:10 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> I'm comfortable with that.  Feel free to make it so, unless you want
> me to do it for some reason.

Time is short, so I did this.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: parallel.c is not marked as test covered

From
Peter Eisentraut
Date:
On 6/17/16 9:26 AM, Robert Haas wrote:
> On Wed, Jun 15, 2016 at 5:10 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> I'm comfortable with that.  Feel free to make it so, unless you want
>> me to do it for some reason.
>
> Time is short, so I did this.

With that, I think it would be preferable to undo the context-hiding
dance in the tests, as in the attached patch, no?

--
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

Re: parallel.c is not marked as test covered

From
Tom Lane
Date:
Peter Eisentraut <peter.eisentraut@2ndquadrant.com> writes:
> With that, I think it would be preferable to undo the context-hiding 
> dance in the tests, as in the attached patch, no?

Would this not result in unstable test output depending on whether the
code executes in the leader or a worker?
        regards, tom lane



Re: parallel.c is not marked as test covered

From
Amit Kapila
Date:
On Sun, Jun 19, 2016 at 10:10 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> Peter Eisentraut <peter.eisentraut@2ndquadrant.com> writes:
> > With that, I think it would be preferable to undo the context-hiding
> > dance in the tests, as in the attached patch, no?
>
> Would this not result in unstable test output depending on whether the
> code executes in the leader or a worker?
>

Before doing that test, we set force_parallel_mode=1, so it should always execute in worker which will ensure a stable output.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: parallel.c is not marked as test covered

From
Tom Lane
Date:
Amit Kapila <amit.kapila16@gmail.com> writes:
> On Sun, Jun 19, 2016 at 10:10 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Would this not result in unstable test output depending on whether the
>> code executes in the leader or a worker?

> Before doing that test, we set force_parallel_mode=1, so it should always
> execute in worker which will ensure a stable output.

No, it *might* execute in a worker.  If you can get one.
        regards, tom lane



Re: parallel.c is not marked as test covered

From
Robert Haas
Date:
On Sun, Jun 19, 2016 at 11:36 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Amit Kapila <amit.kapila16@gmail.com> writes:
>> On Sun, Jun 19, 2016 at 10:10 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> Would this not result in unstable test output depending on whether the
>>> code executes in the leader or a worker?
>
>> Before doing that test, we set force_parallel_mode=1, so it should always
>> execute in worker which will ensure a stable output.
>
> No, it *might* execute in a worker.  If you can get one.

Right.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: parallel.c is not marked as test covered

From
Peter Eisentraut
Date:
On 6/19/16 3:09 PM, Robert Haas wrote:
> On Sun, Jun 19, 2016 at 11:36 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Amit Kapila <amit.kapila16@gmail.com> writes:
>>> On Sun, Jun 19, 2016 at 10:10 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>>> Would this not result in unstable test output depending on whether the
>>>> code executes in the leader or a worker?
>>
>>> Before doing that test, we set force_parallel_mode=1, so it should always
>>> execute in worker which will ensure a stable output.
>>
>> No, it *might* execute in a worker.  If you can get one.
>
> Right.

Well, the purpose of the test is to check the error passing between 
worker and leader.  If we just silently revert to not doing that, then 
we can't really be sure that we're testing the right thing.  We've 
already seen some instances in this thread where we figured out after 
some debugging that some construct is not actually going through the 
parallel infrastructure, so I'm afraid if we leave it like this it might 
silently change behavior at some point in the future.

Independent of that, it would help testing things like this if we 
allowed setting max_worker_processes to 0, instead of the current 
minimum 1.  If there a reason for the minimum of 1?

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: parallel.c is not marked as test covered

From
Tom Lane
Date:
Peter Eisentraut <peter.eisentraut@2ndquadrant.com> writes:
> On 6/19/16 3:09 PM, Robert Haas wrote:
>> On Sun, Jun 19, 2016 at 11:36 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> No, it *might* execute in a worker.  If you can get one.

>> Right.

> Well, the purpose of the test is to check the error passing between 
> worker and leader.  If we just silently revert to not doing that, then 
> we can't really be sure that we're testing the right thing.

I would guess that 90 to 95 times out of 100, that test *will* exercise
what it's supposed to, and that's enough to make it useful.  But we can't
assume that it'll happen 100% of the time.

> We've 
> already seen some instances in this thread where we figured out after 
> some debugging that some construct is not actually going through the 
> parallel infrastructure, so I'm afraid if we leave it like this it might 
> silently change behavior at some point in the future.

Agreed, if the percentage suddenly fell to 0 we wouldn't know it, and
that's concerning.  But wishing it were 100% doesn't make it so.

Depending on what the percentage actually is, maybe we could treat
this like the "random" test, and allow a failure to be disregarded
overall?  But that doesn't seem very nice either, in view of our
increasing reliance on automated testing.  If "random" were failing
90% of the time on some buildfarm critters, that would probably
indicate a real problem, but we'd likely not realize it for a long time.

> Independent of that, it would help testing things like this if we 
> allowed setting max_worker_processes to 0, instead of the current 
> minimum 1.  If there a reason for the minimum of 1?

Good question.
        regards, tom lane



Re: parallel.c is not marked as test covered

From
Robert Haas
Date:
On Sun, Jun 19, 2016 at 5:22 PM, Peter Eisentraut
<peter.eisentraut@2ndquadrant.com> wrote:
> Well, the purpose of the test is to check the error passing between worker
> and leader.  If we just silently revert to not doing that, then we can't
> really be sure that we're testing the right thing.  We've already seen some
> instances in this thread where we figured out after some debugging that some
> construct is not actually going through the parallel infrastructure, so I'm
> afraid if we leave it like this it might silently change behavior at some
> point in the future.

I've been thinking about renaming the "single_copy" flag for Gather
nodes to something like "pipeline" or some other term that might hope
to convey that the leader won't participate unless no workers can be
launched at all, and then adding an option to force that to be used
always.  That would allow better testing here, although I fear we
might be getting to a level of tinkering with parallel query that
starts to look more like feature development.

> Independent of that, it would help testing things like this if we allowed
> setting max_worker_processes to 0, instead of the current minimum 1.  If
> there a reason for the minimum of 1?

I believe that's pure brain fade on my part.  I think the minimum should be 0.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: parallel.c is not marked as test covered

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> On Sun, Jun 19, 2016 at 5:22 PM, Peter Eisentraut
> <peter.eisentraut@2ndquadrant.com> wrote:
>> Well, the purpose of the test is to check the error passing between worker
>> and leader.  If we just silently revert to not doing that, then we can't
>> really be sure that we're testing the right thing.

> I've been thinking about renaming the "single_copy" flag for Gather
> nodes to something like "pipeline" or some other term that might hope
> to convey that the leader won't participate unless no workers can be
> launched at all, and then adding an option to force that to be used
> always.  That would allow better testing here,

How so?  You still have the fact that functions will run in the leader
when no workers are available, and the gold-plated certainty that that
case will arise routinely in the buildfarm.  Perhaps this would move
the probability of running in a worker from 90% to 95%, but I don't
think that'll make much difference from a testing standpoint.  Having
said that ...

> although I fear we
> might be getting to a level of tinkering with parallel query that
> starts to look more like feature development.

Personally, I'm +1 for such tinkering if it makes the feature either more
controllable or more understandable.  After reading the comments at the
head of nodeGather.c, though, I don't think that single_copy is either
understandable or useful, and merely renaming it won't help.  Apparently,
it runs code in the worker, except when it doesn't, and even when it does,
it's absolutely guaranteed to be a performance loss because the leader is
doing nothing.  What in the world is the point?
        regards, tom lane



Re: parallel.c is not marked as test covered

From
Amit Kapila
Date:
On Thu, Jun 16, 2016 at 8:20 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Wed, Jun 15, 2016 at 10:48 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> > exec_stmt_execsql() is used to execute SQL statements insider plpgsql which
> > includes dml statements as well, so probably you wanted to play safe by not
> > allowing parallel option from that place.  However, I think there shouldn't
> > be a problem in using CURSOR_OPT_PARALLEL_OK from this place as we have a
> > check in standard_planner which will take care of whether to choose parallel
> > mode or not for a particular statement.  If you want, I can do more detailed
> > analysis and prepare a patch.
>
> That would be great.  I have a vague recollection that that function
> might be called in some contexts where execution can be suspended,
> which would be no good for parallel query.  But that might be wrong.
>

I have done analysis on this and didn't found any use case where passing CURSOR_OPT_PARALLEL_OK in exec_stmt_execsql() can help in parallelizing the queries.  Basically, there seems to be three ways in which function exec_stmt_execsql can be used inside plpgsql.

a. In execution of statement used in open cursor (via exec_stmt_open())
b. In execution of statement when used in for loop (via exec_stmt_forc())
c. In execution of SQL statement (via direct call to exec_stmt_execsql())

For the cases (a) and (b), one can construct a case where execution needs to be suspended, so using parallel mode for those cases will not be meaningful.  For case (c), the Select statement seems to execute successfully only when there is a *into* target, else it will give an error after execution as below:
ERROR:  query has no destination for result data
HINT:  If you want to discard the results of a SELECT, use PERFORM instead.

Now, if one has used into clause, the number of rows will be restricted in which cases parallelism won't be enabled.

Do, let me know if you see any case where generating a parallel path is meaningful via this path?  I am of opinion, lets leave this code as is or may be we can add a comment why we have decided not to enable parallelism in this path.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: parallel.c is not marked as test covered

From
Robert Haas
Date:
On Sun, Jun 19, 2016 at 10:23 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> although I fear we
>> might be getting to a level of tinkering with parallel query that
>> starts to look more like feature development.
>
> Personally, I'm +1 for such tinkering if it makes the feature either more
> controllable or more understandable.  After reading the comments at the
> head of nodeGather.c, though, I don't think that single_copy is either
> understandable or useful, and merely renaming it won't help.  Apparently,
> it runs code in the worker, except when it doesn't, and even when it does,
> it's absolutely guaranteed to be a performance loss because the leader is
> doing nothing.  What in the world is the point?

The single_copy flag allows a Gather node to have a child plan which
is not intrinsically parallel.  For example, consider these two plans:

Gather
-> Parallel Seq Scan

Gather
-> Seq Scan

The first plan is safe regardless of the setting of the single-copy
flag.  If the plan is executed in every worker, the results in
aggregate across all workers will add up to the results of a
non-parallel sequential scan of the table.  The second plan is safe
only if the # of workers is 1 and the single-copy flag is set.  If
either of those things is not true, then more than one process might
try to execute the sequential scan, and the result will be that you'll
get N copies of the output, where N = (# of parallel workers) +
(leader also participates ? 1 : 0).

For force_parallel_mode = {on, regress}, the single-copy behavior is
essential.  We can run all of those plans inside a worker, but only
because we know that the leader won't also try to run those same
plans.

But it might be useful in other cases too.  For example, imagine a
plan like this:

Join
-> Join -> Join   -> Join     -> Gather (single copy)       -> Join         -> Join           -> Join             ->
Join              -> Scan (not parallel aware)
 

This is pipeline parallelism.  Instead of having one process do all of
the joins, you can have a worker do some subset of them and the send
the outputs back to the leader which can do the rest and return the
results to the client.  This is actually kind of hard to get right -
according to the literature I've read on parallel query - because you
can get pipeline stalls that erase most or all of the benefit, but
it's a possible area to explore.

Actually, though, the behavior I really want the single_copy flag to
embody is not so much "only one process runs this" but "leader does
not participate unless there are no workers", which is the same thing
only when the budgeted number of workers is one.  This is useful
because of plans like this:

Finalize HashAggregate
-> Gather -> Partial HashAggregate   -> Hash Join      -> Parallel Seq Scan on large_table      -> Hash        -> Seq
Scanon another_large_table
 

Unless the # of groups is very small, the leader actually won't
perform very much of the parallel-seq-scan on large_table, because
it'll be too busy aggregating the results from the other workers.
However, if it ever reaches a point where the Gather can't read a
tuple from one of the workers immediately, which is almost certain to
occur right at the beginning of execution, it's going to go build a
copy of the hash table so that it can "help" with the hash join.  By
the time it finishes, the workers will have done the same and be
feeding it results, and it will likely get little use out of the copy
that it built itself.  But it will still have gone to the effort of
building it.

For 10.0, Thomas Munro has already done a bunch of work, and will be
doing more work, so that we can build a shared hash table, rather than
one copy per worker.  That's going to be better when the table is
large anyway, so maybe this particular case won't matter so much.  But
in general when a partial path has a substantial startup cost, it may
be better for the leader not to get involved.  In a case like this,
it's hard to see how the leader's involvement can ever hurt:

Finalize HashAggregate
-> Gather -> Partial HashAggregate   -> Nested Loop      -> Parallel Seq Scan on large_table      -> Index Scan on
some_other_table

Even if the leader only processes only one or two pages of
large_table, there's no real harm done unless, I suppose, the combine
function is fabulously expensive, which seems unlikely.  The lack of
harm stems directly from the fact that there's no startup cost here.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: parallel.c is not marked as test covered

From
Robert Haas
Date:
On Mon, Jun 20, 2016 at 1:26 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> I have done analysis on this and didn't found any use case where passing
> CURSOR_OPT_PARALLEL_OK in exec_stmt_execsql() can help in parallelizing the
> queries.  Basically, there seems to be three ways in which function
> exec_stmt_execsql can be used inside plpgsql.
>
> a. In execution of statement used in open cursor (via exec_stmt_open())
> b. In execution of statement when used in for loop (via exec_stmt_forc())
> c. In execution of SQL statement (via direct call to exec_stmt_execsql())
>
> For the cases (a) and (b), one can construct a case where execution needs to
> be suspended, so using parallel mode for those cases will not be meaningful.
> For case (c), the Select statement seems to execute successfully only when
> there is a *into* target, else it will give an error after execution as
> below:
> ERROR:  query has no destination for result data
> HINT:  If you want to discard the results of a SELECT, use PERFORM instead.
>
> Now, if one has used into clause, the number of rows will be restricted in
> which cases parallelism won't be enabled.

OK, sounds like I got that right, then.  Thanks for verifying.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: parallel.c is not marked as test covered

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> On Sun, Jun 19, 2016 at 10:23 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Personally, I'm +1 for such tinkering if it makes the feature either more
>> controllable or more understandable.  After reading the comments at the
>> head of nodeGather.c, though, I don't think that single_copy is either
>> understandable or useful, and merely renaming it won't help.  Apparently,
>> it runs code in the worker, except when it doesn't, and even when it does,
>> it's absolutely guaranteed to be a performance loss because the leader is
>> doing nothing.  What in the world is the point?

> The single_copy flag allows a Gather node to have a child plan which
> is not intrinsically parallel.

OK, but I'm very dubious of your claim that this has any use except for
testing purposes.  It certainly has no such use today.  Therefore, the
behavior of falling back to running in the leader seems like an
anti-feature to me.  If we want to test running in a worker, then we
want to test that, not maybe test it.

I propose that the behavior we actually want here is to execute in
a worker, full stop.  If we can't get one, wait till we can.  If we
can't get one because no slots have been allocated at all, fail.
That would make the behavior deterministic enough for the regression
tests to rely on it.

And while I don't know what this mode should be called, I'm pretty sure
that neither "single_copy" nor "pipeline" are useful descriptions.
        regards, tom lane



Re: parallel.c is not marked as test covered

From
"David G. Johnston"
Date:
On Mon, Jun 20, 2016 at 12:06 PM, Robert Haas <robertmhaas@gmail.com> wrote:
On Sun, Jun 19, 2016 at 10:23 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> although I fear we
>> might be getting to a level of tinkering with parallel query that
>> starts to look more like feature development.
>
> Personally, I'm +1 for such tinkering if it makes the feature either more
> controllable or more understandable.  After reading the comments at the
> head of nodeGather.c, though, I don't think that single_copy is either
> understandable or useful, and merely renaming it won't help.  Apparently,
> it runs code in the worker, except when it doesn't, and even when it does,
> it's absolutely guaranteed to be a performance loss because the leader is
> doing nothing.  What in the world is the point?

The single_copy flag allows a Gather node to have a child plan which
is not intrinsically parallel.  For example, consider these two plans:

Gather
-> Parallel Seq Scan

Gather
-> Seq Scan

The first plan is safe regardless of the setting of the single-copy
flag.  If the plan is executed in every worker, the results in
aggregate across all workers will add up to the results of a
non-parallel sequential scan of the table.  The second plan is safe
only if the # of workers is 1 and the single-copy flag is set.  If
either of those things is not true, then more than one process might
try to execute the sequential scan, and the result will be that you'll
get N copies of the output, where N = (# of parallel workers) +
(leader also participates ? 1 : 0).

For force_parallel_mode = {on, regress}, the single-copy behavior is
essential.  We can run all of those plans inside a worker, but only
because we know that the leader won't also try to run those same
plans.


​The entire theory here looks whacked - and seems to fall into the "GUCs controlling results" bucket of undesirable things.

Is this GUC enabled by a compile time directive, or otherwise protected from misuse in production?

I'm having trouble sounding smart here about what is bothering me but basically the parallel infrastructure (i.e., additional workers) shouldn't even be used for "Seq Scan" and a "Seq Scan" under a Gather should behave no differently than a "Parallel Seq Scan" under a Gather where all work is done by the leader because no workers were available to help.

At worse this behavior should be an implementation artifact of force_parallel_mode={on,regress}; at best the Gather node would simply have this intelligence on, always, so as not to silently generate bogus results in a mis-configured or buggy setup.

 
​[...]


Actually, though, the behavior I really want the single_copy flag to
embody is not so much "only one process runs this" but "leader does
not participate unless there are no workers", which is the same thing
only when the budgeted number of workers is one.

​This sounds an awful lot like a planner hint; especially since it defaults to off.

  This is useful
because of plans like this:

Finalize HashAggregate
-> Gather
  -> Partial HashAggregate
    -> Hash Join
       -> Parallel Seq Scan on large_table
       -> Hash
         -> Seq Scan on another_large_table

Unless the # of groups is very small, the leader actually won't
perform very much of the parallel-seq-scan on large_table, because
it'll be too busy aggregating the results from the other workers.
However, if it ever reaches a point where the Gather can't read a
tuple from one of the workers immediately, which is almost certain to
occur right at the beginning of execution, it's going to go build a
copy of the hash table so that it can "help" with the hash join.  By
the time it finishes, the workers will have done the same and be
feeding it results, and it will likely get little use out of the copy
that it built itself.  But it will still have gone to the effort of
building it.

For 10.0, Thomas Munro has already done a bunch of work, and will be
doing more work, so that we can build a shared hash table, rather than
one copy per worker.  That's going to be better when the table is
large anyway, so maybe this particular case won't matter so much.  But
in general when a partial path has a substantial startup cost, it may
be better for the leader not to get involved.

​So have the Gather node understand this and act accordingly.​

This is also quite different than the "we'll get wrong results" problem described above but which this GUC also attempts to solve.

​I'm inclined to believe three things:

1) We need a test mode whereby we guarantee at least one worker is used for processing.
2) Gather needs to be inherently smart enough to accept data from a non-parallel source.
3) Gather needs to use its knowledge (hopefully it has some) of partial plan startup costs and worker availability to decide whether it wants to participate in both hunting and gathering.  It should make this decision once at startup and live with it for the duration.

The first option seems logical but doesn't actually address either of the two scenarios described as motivation for the existing GUC.

The second and third options are presented to address the two scenarios via an alternative, non-GUC, solution.

As for "we must have exactly one worker and it must perform all of the hunting, the leader shall only gather": Peter's comment seems to drive this particular use case, and it does seem to be test oriented, to prove certain capabilities are functioning correctly in parallel mode.  That said I'm not positive that "and the leader does nothing" is a mandatory aspect of this need.  If not having a setting like: <min_parallel_degree^D^D^D^D^D^Dworkers_per_gather> may be sufficient.  I guess a <parallel_leader_gather_only=on> option could be added to control the leader's participation orthogonally.

David J.

Re: parallel.c is not marked as test covered

From
Robert Haas
Date:
On Mon, Jun 20, 2016 at 1:13 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> On Sun, Jun 19, 2016 at 10:23 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> Personally, I'm +1 for such tinkering if it makes the feature either more
>>> controllable or more understandable.  After reading the comments at the
>>> head of nodeGather.c, though, I don't think that single_copy is either
>>> understandable or useful, and merely renaming it won't help.  Apparently,
>>> it runs code in the worker, except when it doesn't, and even when it does,
>>> it's absolutely guaranteed to be a performance loss because the leader is
>>> doing nothing.  What in the world is the point?
>
>> The single_copy flag allows a Gather node to have a child plan which
>> is not intrinsically parallel.
>
> OK, but I'm very dubious of your claim that this has any use except for
> testing purposes.  It certainly has no such use today.  Therefore, the
> behavior of falling back to running in the leader seems like an
> anti-feature to me.  If we want to test running in a worker, then we
> want to test that, not maybe test it.
>
> I propose that the behavior we actually want here is to execute in
> a worker, full stop.  If we can't get one, wait till we can.  If we
> can't get one because no slots have been allocated at all, fail.
> That would make the behavior deterministic enough for the regression
> tests to rely on it.

I agree that for force_parallel_mode testing, that behavior would be useful.

I am also pretty confident we're going to want the behavior where the
leader runs the plan if, and only if, no workers can be obtained for
other purposes.  However, I agree that's not essential today.

> And while I don't know what this mode should be called, I'm pretty sure
> that neither "single_copy" nor "pipeline" are useful descriptions.

Maybe we should make this an enum rather than a boolean, since there
seem to be more than two useful behaviors.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: parallel.c is not marked as test covered

From
Robert Haas
Date:
On Mon, Jun 20, 2016 at 3:29 PM, David G. Johnston
<david.g.johnston@gmail.com> wrote:
> The entire theory here looks whacked - and seems to fall into the "GUCs
> controlling results" bucket of undesirable things.

As far as I can see, this entire email is totally wrong and off-base,
because the whole thing seems to be written on the presumption that
single_copy is a GUC, when it's actually a structure member.  If there
was some confusion about that, you could have spent 5 seconds running
"git grep" before writing this email, or you could have tried "SET
single_copy" and discovered, hey, there's no such GUC.

Furthermore, I think that describing something that you obviously
haven't taken any time to understand as "whacked" is not very nice.
For that matter, I think that describing something you *have* taken
time to understand as "whacked" is not very nice.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: parallel.c is not marked as test covered

From
"David G. Johnston"
Date:
On Mon, Jun 20, 2016 at 4:03 PM, Robert Haas <robertmhaas@gmail.com> wrote:
On Mon, Jun 20, 2016 at 3:29 PM, David G. Johnston
<david.g.johnston@gmail.com> wrote:
> The entire theory here looks whacked - and seems to fall into the "GUCs
> controlling results" bucket of undesirable things.

As far as I can see, this entire email is totally wrong and off-base,
because the whole thing seems to be written on the presumption that
single_copy is a GUC, when it's actually a structure member.  If there
was some confusion about that, you could have spent 5 seconds running
"git grep" before writing this email, or you could have tried "SET
single_copy" and discovered, hey, there's no such GUC.

Furthermore, I think that describing something that you obviously
haven't taken any time to understand as "whacked" is not very nice.
For that matter, I think that describing something you *have* taken
time to understand as "whacked" is not very nice.


​Point taken.

I don't think my entire post depends solely upon this being a GUC though.

​I've burned too many brain cells on this already, though, to dive much deeper.

Internal or external I do think the number and type of flags described here, for the purposes described, seems undesirable from an architectural standpoint.  I do not and cannot offer up more than that generally due to knowledge and resource constraints.  I tried to frame things up relative to my understanding of existing, non-parallel, idioms, both to understand it better myself and to throw out another POV from a fresh perspective.  I'll admit its one with some drawbacks but its offered in good faith.

Please do with it as you will and accept my apology for the poor choice of colloquialism.

David J.

Re: parallel.c is not marked as test covered

From
Robert Haas
Date:
On Mon, Jun 20, 2016 at 4:52 PM, David G. Johnston
<david.g.johnston@gmail.com> wrote:
> Internal or external I do think the number and type of flags described here,
> for the purposes described, seems undesirable from an architectural
> standpoint.

Well, that seems like a bold statement to me, because I think that one
really has to understand why flags got added before deciding whether
there are too many of them, and it's not clear to me that you do.
What we're talking about here is ONE flag, which currently controls
whether the leader will participate in running Gather's child plan.
What happens when the flag is set to "leader should not participate"
and no workers are available?  The current behavior is that the leader
runs the plan in lieu of the workers, and Tom is proposing to change
it so that we wait for a worker to become available instead.  That has
the advantage of being better for testing, but the disadvantage of
being possibly less useful for other purposes.  Neither of us are
proposing to eliminate the flag, which would only serve to cripple
test coverage, and I will venture to say that neither Tom nor anyone
else who has spent much time looking at the way any part of the system
works would argue that a Node type with ONE flag has too many flags.

I am usually quite happy when people other than senior hackers weigh
in on threads here, especially about user-visible behavior, because
senior hackers can be quite myopic.  We spend most of our time
developing the system rather than using it, and sometimes our notion
of what is useful or acceptable is distorted because of it.  Moreover,
none of us are so smart that we can't be wrong, so a gut check from
somebody else is often helpful.  But, of course, an opinion that
proceeds from a false premise doesn't provide useful guidance.  Many
people realize this, which is why we tend to get entirely too many
opinions on questions like "should we change the version number?" and
far too few on questions like "how should we refactor heap_update?".

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: parallel.c is not marked as test covered

From
Tom Lane
Date:
I wrote:
> Depending on what the percentage actually is, maybe we could treat
> this like the "random" test, and allow a failure to be disregarded
> overall?  But that doesn't seem very nice either, in view of our
> increasing reliance on automated testing.  If "random" were failing
> 90% of the time on some buildfarm critters, that would probably
> indicate a real problem, but we'd likely not realize it for a long time.

BTW, this is getting a bit off-topic for this thread, but: I got
curious about whether any such effect might actually exist, and decided
to grep the buildfarm logs to see how many times a failure of the
"random" test had been ignored.  The answer is that there are 35 such
occurrences in otherwise-successful buildfarm runs, out of something
like 650000 successful runs in the buildfarm database.  This is in the
noise, really.   There are more occurrences than that of cases where
"random                   ... failed (ignored)" appeared in a failed
run, which more than likely means that some other regression test script
caused a crash with collateral damage to this test.

This seems like pretty good evidence that we should remove the "ignored"
marking for the random test, and maybe remove that functionality from
pg_regress altogether.  We could probably adjust the test to decrease
its risk-of-failure by another factor of ten or so, if anyone feels like
0.005% failure probability is too high.

Raw data attached for amusement's sake.

            regards, tom lane

    sysname    |      snapshot       |      stage      |                                          logline
                           

---------------+---------------------+-----------------+-------------------------------------------------------------------------------------------
 orangutan     | 2015-01-02 08:00:09 | OK              |      random                   ... failed (ignored)
 dingo         | 2015-01-22 17:29:03 | Check           |      random                   ... failed (ignored)
 fulmar        | 2015-01-26 03:03:17 | OK              | test random                   ... failed (ignored)
 nightjar      | 2015-02-03 22:27:13 | Check           |      random                   ... failed (ignored) (test
processexited with exit code 2) 
 dromedary     | 2015-03-09 18:57:28 | Check           |      random                   ... failed (ignored) (test
processexited with exit code 2) 
 olinguito     | 2015-03-09 18:59:08 | Check           |      random                   ... failed (ignored) (test
processexited with exit code 2) 
 rover_firefly | 2015-03-09 19:04:00 | Check           |      random                   ... failed (ignored) (test
processexited with exit code 2) 
 lapwing       | 2015-03-09 19:15:01 | Check           |      random                   ... failed (ignored) (test
processexited with exit code 2) 
 prairiedog    | 2015-03-09 19:20:35 | Check           |      random                   ... failed (ignored) (test
processexited with exit code 2) 
 dingo         | 2015-03-09 19:29:00 | Check           |      random                   ... failed (ignored) (test
processexited with exit code 2) 
 binturong     | 2015-03-09 19:42:28 | Check           |      random                   ... failed (ignored) (test
processexited with exit code 2) 
 shearwater    | 2015-03-09 19:58:15 | Check           |      random                   ... failed (ignored) (test
processexited with exit code 2) 
 magpie        | 2015-04-25 20:59:48 | OK              | test random                   ... failed (ignored)
 anole         | 2015-04-28 21:25:27 | Check           |      random                   ... failed (ignored) (test
processexited with exit code 2) 
 binturong     | 2015-05-07 16:55:15 | pg_upgradeCheck | test random                   ... failed (ignored) (test
processexited with exit code 2) 
 dingo         | 2015-05-07 16:58:01 | pg_upgradeCheck | test random                   ... failed (ignored) (test
processexited with exit code 2) 
 binturong     | 2015-05-08 18:56:02 | pg_upgradeCheck | test random                   ... failed (ignored) (test
processexited with exit code 2) 
 reindeer      | 2015-05-15 06:00:01 | OK              |      random                   ... failed (ignored)
 anole         | 2015-05-28 17:58:54 | Check           |      random                   ... failed (ignored) (test
processexited with exit code 2) 
 anole         | 2015-06-01 00:30:15 | Check           |      random                   ... failed (ignored) (test
processexited with exit code 2) 
 crake         | 2015-06-01 18:53:02 | OK              | test random                   ... failed (ignored)
 sittella      | 2015-06-01 22:50:36 | OK              | test random                   ... failed (ignored)
 anole         | 2015-06-01 22:58:24 | Check           |      random                   ... failed (ignored) (test
processexited with exit code 2) 
 blesbok       | 2015-06-27 18:49:43 | Check           |      random                   ... failed (ignored) (test
processexited with exit code 2) 
 anole         | 2015-06-28 21:35:02 | Check           |      random                   ... failed (ignored) (test
processexited with exit code 2) 
 binturong     | 2015-07-03 23:45:49 | pg_upgradeCheck | test random                   ... failed (ignored) (test
processexited with exit code 2) 
 chipmunk      | 2015-07-13 09:29:10 | OK              | test random                   ... failed (ignored)
 kouprey       | 2015-07-19 04:43:10 | OK              |      random                   ... failed (ignored)
 gharial       | 2015-07-19 18:54:12 | OK              |      random                   ... failed (ignored)
 frogmouth     | 2015-07-21 01:11:35 | Check           |      random                   ... failed (ignored) (test
processexited with exit code 2) 
 frogmouth     | 2015-07-21 13:15:31 | Check           |      random                   ... failed (ignored) (test
processexited with exit code 2) 
 leech         | 2015-07-28 12:13:11 | pg_upgradeCheck | test random                   ... failed (ignored) (test
processexited with exit code 2) 
 frogmouth     | 2015-07-30 01:13:06 | Check           |      random                   ... failed (ignored) (test
processexited with exit code 2) 
 mandrill      | 2015-08-05 16:44:12 | Check           |      random                   ... failed (ignored) (test
processexited with exit code 2) 
 mandrill      | 2015-08-05 23:49:11 | Check           |      random                   ... failed (ignored) (test
processexited with exit code 2) 
 mandrill      | 2015-08-06 02:22:15 | Check           |      random                   ... failed (ignored) (test
processexited with exit code 2) 
 mandrill      | 2015-08-06 03:04:08 | Check           |      random                   ... failed (ignored) (test
processexited with exit code 2) 
 mandrill      | 2015-08-06 12:30:33 | Check           |      random                   ... failed (ignored) (test
processexited with exit code 2) 
 mandrill      | 2015-08-06 18:30:58 | Check           |      random                   ... failed (ignored) (test
processexited with exit code 2) 
 mandrill      | 2015-08-06 21:22:12 | Check           |      random                   ... failed (ignored) (test
processexited with exit code 2) 
 mandrill      | 2015-08-07 02:58:39 | Check           |      random                   ... failed (ignored) (test
processexited with exit code 2) 
 mandrill      | 2015-08-07 14:41:35 | Check           |      random                   ... failed (ignored) (test
processexited with exit code 2) 
 mandrill      | 2015-08-07 20:54:58 | Check           |      random                   ... failed (ignored) (test
processexited with exit code 2) 
 capybara      | 2015-08-13 16:05:05 | OK              | test random                   ... failed (ignored)
 cockatiel     | 2015-08-26 05:36:47 | OK              | test random                   ... failed (ignored)
 mandrill      | 2015-09-07 20:47:19 | OK              | test random                   ... failed (ignored)
 treepie       | 2015-09-11 08:41:36 | OK              |      random                   ... failed (ignored)
 guaibasaurus  | 2015-09-27 02:23:01 | OK              |      random                   ... failed (ignored)
 axolotl       | 2015-10-13 18:01:18 | OK              | test random                   ... failed (ignored)
 okapi         | 2015-10-19 21:15:01 | OK              |      random                   ... failed (ignored)
 rover_firefly | 2015-11-06 06:04:00 | OK              | test random                   ... failed (ignored)
 spurfowl      | 2015-11-07 21:21:03 | OK              |      random                   ... failed (ignored)
 minisauripus  | 2015-11-26 03:42:28 | OK              | test random                   ... failed (ignored)
 jacana        | 2015-12-28 01:05:17 | OK              | test random                   ... failed (ignored)
 spurfowl      | 2016-01-07 15:27:01 | OK              |      random                   ... failed (ignored)
 tern          | 2016-01-14 00:41:10 | OK              |      random                   ... failed (ignored)
 guaibasaurus  | 2016-01-25 03:43:01 | OK              |      random                   ... failed (ignored)
 longfin       | 2016-02-05 13:19:05 | OK              | test random                   ... failed (ignored)
 gharial       | 2016-02-11 03:15:43 | OK              | test random                   ... failed (ignored)
 quokka        | 2016-02-26 15:00:06 | OK              | test random                   ... failed (ignored)
 fulmar        | 2016-03-03 12:01:03 | OK              | test random                   ... failed (ignored)
 jaguarundi    | 2016-03-09 04:52:00 | OK              | test random                   ... failed (ignored)
 nightjar      | 2016-04-05 23:30:57 | OK              | test random                   ... failed (ignored)
 skink         | 2016-04-07 16:00:01 | Check           |      random                   ... failed (ignored) (test
processexited with exit code 2) 
 skink         | 2016-04-07 16:20:01 | Check           |      random                   ... failed (ignored) (test
processexited with exit code 2) 
 skink         | 2016-04-07 16:40:01 | Check           |      random                   ... failed (ignored) (test
processexited with exit code 2) 
 gull          | 2016-04-15 02:20:05 | OK              |      random                   ... failed (ignored)
 piculet       | 2016-04-15 02:30:01 | Check           |      random                   ... failed (ignored) (test
processexited with exit code 2) 
 pademelon     | 2016-04-16 02:47:16 | Check           |      random                   ... failed (ignored) (test
processexited with exit code 2) 
 pademelon     | 2016-04-16 03:30:47 | Check           |      random                   ... failed (ignored) (test
processexited with exit code 2) 
 hornet        | 2016-05-06 15:52:14 | OK              | test random                   ... failed (ignored)
 mastodon      | 2016-05-08 04:00:01 | OK              | test random                   ... failed (ignored)
 guaibasaurus  | 2016-06-01 02:58:01 | OK              | test random                   ... failed (ignored)
 handfish      | 2016-06-02 21:55:47 | InstallCheck-C  | test random                   ... failed (ignored)
 handfish      | 2016-06-12 02:01:32 | OK              | test random                   ... failed (ignored)
 spoonbill     | 2016-06-14 23:00:05 | OK              | test random                   ... failed (ignored)
(76 rows)


Re: parallel.c is not marked as test covered

From
Alvaro Herrera
Date:
Tom Lane wrote:

> This seems like pretty good evidence that we should remove the "ignored"
> marking for the random test, and maybe remove that functionality from
> pg_regress altogether.  We could probably adjust the test to decrease
> its risk-of-failure by another factor of ten or so, if anyone feels like
> 0.005% failure probability is too high.

I suppose that as far as the buildfarm goes it's okay that the test
fails from time to time, but it may be worse from packagers' points of
view, where a randomly failing test can wreck the whole building
process.  Is a 0.005% failure probability low enough that nobody will be
bothered by that?

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: parallel.c is not marked as test covered

From
"David G. Johnston"
Date:
On Mon, Jun 20, 2016 at 5:47 PM, Robert Haas <robertmhaas@gmail.com> wrote:
On Mon, Jun 20, 2016 at 4:52 PM, David G. Johnston
<david.g.johnston@gmail.com> wrote:
> Internal or external I do think the number and type of flags described here,
> for the purposes described, seems undesirable from an architectural
> standpoint.

Well, that seems like a bold statement to me, because I think that one
really has to understand why flags got added before deciding whether
there are too many of them, and it's not clear to me that you do.

​I don't.  And I totally accept a response of: you are operating under a false premise here.​  My response above was more defensive that constructive.

What we're talking about here is ONE flag, which currently controls
whether the leader will participate in running Gather's child plan.
What happens when the flag is set to "leader should not participate"
and no workers are available?  The current behavior is that the leader
runs the plan in lieu of the workers, and Tom is proposing to change
it so that we wait for a worker to become available instead. 

​I think my biggest (again, I'm not digging deep here) misunderstanding is testing mode versus production mode and what is or is not visible to the test framework compared to SQL/libpq

I am usually quite happy when people other than senior hackers weigh
in on threads here, especially about user-visible behavior, because
senior hackers can be quite myopic.  We spend most of our time
developing the system rather than using it, and sometimes our notion
of what is useful or acceptable is distorted because of it.  Moreover,
none of us are so smart that we can't be wrong, so a gut check from
somebody else is often helpful. 
 
But, of course, an opinion that
proceeds from a false premise doesn't provide useful guidance. 
This is true - though I'd suspect not absolute.​
 
Many
people realize this, which is why we tend to get entirely too many
opinions on questions like "should we change the version number?" and
far too few on questions like "how should we refactor heap_update?".


​This is a non-sequitur - or I'm just to worn to follow it.

I did not intentionally set out to spend an hour of my own time to waste 20 minutes of yours.  I won't apologize for an earnest attempt at self-education and contribution; no matter how badly it turned out.  I will endeavor to learn from it and avoid similar errors in the future.

David J.

Re: parallel.c is not marked as test covered

From
Tom Lane
Date:
Alvaro Herrera <alvherre@2ndquadrant.com> writes:
> Tom Lane wrote:
>> This seems like pretty good evidence that we should remove the "ignored"
>> marking for the random test, and maybe remove that functionality from
>> pg_regress altogether.  We could probably adjust the test to decrease
>> its risk-of-failure by another factor of ten or so, if anyone feels like
>> 0.005% failure probability is too high.

> I suppose that as far as the buildfarm goes it's okay that the test
> fails from time to time, but it may be worse from packagers' points of
> view, where a randomly failing test can wreck the whole building
> process.  Is a 0.005% failure probability low enough that nobody will be
> bothered by that?

As an ex-packager, I think that's a couple orders of magnitude below where
anybody will notice it, let alone feel pain.  There are other causes of
failure that will dwarf this one.

(You may recall that I used to bitch regularly about the failure
probabilities for mysql's regression tests --- but that was because
the probability of failure was on the order of 50%, when building
in Red Hat's buildfarm.)
        regards, tom lane



Re: parallel.c is not marked as test covered

From
Peter Eisentraut
Date:
On 6/19/16 5:55 PM, Tom Lane wrote:
> Depending on what the percentage actually is, maybe we could treat
> this like the "random" test, and allow a failure to be disregarded
> overall?  But that doesn't seem very nice either, in view of our
> increasing reliance on automated testing.  If "random" were failing
> 90% of the time on some buildfarm critters, that would probably
> indicate a real problem, but we'd likely not realize it for a long time.

I think this test would only fail if it runs out of workers, and that 
would only happen in an installcheck run against a server configured in 
a nonstandard way or that is doing something else -- which doesn't 
happen on the buildfarm.

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: parallel.c is not marked as test covered

From
Tom Lane
Date:
Peter Eisentraut <peter.eisentraut@2ndquadrant.com> writes:
> On 6/19/16 5:55 PM, Tom Lane wrote:
>> Depending on what the percentage actually is, maybe we could treat
>> this like the "random" test, and allow a failure to be disregarded
>> overall?  But that doesn't seem very nice either, in view of our
>> increasing reliance on automated testing.  If "random" were failing
>> 90% of the time on some buildfarm critters, that would probably
>> indicate a real problem, but we'd likely not realize it for a long time.

> I think this test would only fail if it runs out of workers, and that 
> would only happen in an installcheck run against a server configured in 
> a nonstandard way or that is doing something else -- which doesn't 
> happen on the buildfarm.

Um, if you're speaking of select_parallel, that already runs in parallel
with two other regression tests, and there is no annotation in the
parallel_schedule file suggesting that adding more scripts to that group
would be bad.  But yes, perhaps putting this test into its own standalone
group would be enough of a fix.
        regards, tom lane



Re: parallel.c is not marked as test covered

From
Amit Kapila
Date:
On Tue, Jun 21, 2016 at 1:24 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Mon, Jun 20, 2016 at 1:13 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > Robert Haas <robertmhaas@gmail.com> writes:
> >> On Sun, Jun 19, 2016 at 10:23 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> >>> Personally, I'm +1 for such tinkering if it makes the feature either more
> >>> controllable or more understandable.  After reading the comments at the
> >>> head of nodeGather.c, though, I don't think that single_copy is either
> >>> understandable or useful, and merely renaming it won't help.  Apparently,
> >>> it runs code in the worker, except when it doesn't, and even when it does,
> >>> it's absolutely guaranteed to be a performance loss because the leader is
> >>> doing nothing.  What in the world is the point?
> >
> >> The single_copy flag allows a Gather node to have a child plan which
> >> is not intrinsically parallel.
> >
> > OK, but I'm very dubious of your claim that this has any use except for
> > testing purposes.  It certainly has no such use today.  Therefore, the
> > behavior of falling back to running in the leader seems like an
> > anti-feature to me.  If we want to test running in a worker, then we
> > want to test that, not maybe test it.
> >
> > I propose that the behavior we actually want here is to execute in
> > a worker, full stop.  If we can't get one, wait till we can.  If we
> > can't get one because no slots have been allocated at all, fail.
> > That would make the behavior deterministic enough for the regression
> > tests to rely on it.
>

+1.

> I agree that for force_parallel_mode testing, that behavior would be useful.
>
> I am also pretty confident we're going to want the behavior where the
> leader runs the plan if, and only if, no workers can be obtained for
> other purposes.  However, I agree that's not essential today.
>
> > And while I don't know what this mode should be called, I'm pretty sure
> > that neither "single_copy" nor "pipeline" are useful descriptions.
>
> Maybe we should make this an enum rather than a boolean, since there
> seem to be more than two useful behaviors.
>

How about calling it as control_parallel/control_parallelism or override_parallel/override_parallelism?

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: parallel.c is not marked as test covered

From
Peter Eisentraut
Date:
On 6/19/16 10:00 PM, Robert Haas wrote:
>> Independent of that, it would help testing things like this if we allowed
>> > setting max_worker_processes to 0, instead of the current minimum 1.  If
>> > there a reason for the minimum of 1?
> I believe that's pure brain fade on my part.  I think the minimum should be 0.

Fixed.

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: parallel.c is not marked as test covered

From
Robert Haas
Date:
On Tue, Aug 2, 2016 at 1:17 PM, Peter Eisentraut
<peter.eisentraut@2ndquadrant.com> wrote:
> On 6/19/16 10:00 PM, Robert Haas wrote:
>>> Independent of that, it would help testing things like this if we allowed
>>> > setting max_worker_processes to 0, instead of the current minimum 1.  If
>>> > there a reason for the minimum of 1?
>> I believe that's pure brain fade on my part.  I think the minimum should be 0.
>
> Fixed.

Thank you.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: parallel.c is not marked as test covered

From
Peter Eisentraut
Date:
On 6/20/16 11:16 PM, Tom Lane wrote:
>> > I think this test would only fail if it runs out of workers, and that
>> > would only happen in an installcheck run against a server configured in
>> > a nonstandard way or that is doing something else -- which doesn't
>> > happen on the buildfarm.
> Um, if you're speaking of select_parallel, that already runs in parallel
> with two other regression tests, and there is no annotation in the
> parallel_schedule file suggesting that adding more scripts to that group
> would be bad.  But yes, perhaps putting this test into its own standalone
> group would be enough of a fix.

Maybe now would be a good time to address this by applying the attached
patch to master and seeing what happens?

--
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

Re: parallel.c is not marked as test covered

From
Amit Kapila
Date:
On Wed, Aug 17, 2016 at 1:34 AM, Peter Eisentraut
<peter.eisentraut@2ndquadrant.com> wrote:
> On 6/20/16 11:16 PM, Tom Lane wrote:
>>> > I think this test would only fail if it runs out of workers, and that
>>> > would only happen in an installcheck run against a server configured in
>>> > a nonstandard way or that is doing something else -- which doesn't
>>> > happen on the buildfarm.
>> Um, if you're speaking of select_parallel, that already runs in parallel
>> with two other regression tests, and there is no annotation in the
>> parallel_schedule file suggesting that adding more scripts to that group
>> would be bad.  But yes, perhaps putting this test into its own standalone
>> group would be enough of a fix.
>
> Maybe now would be a good time to address this by applying the attached
> patch to master and seeing what happens?
>

+1.  Your patch looks good to me.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com