Thread: Parallel append plan instability/randomness

Parallel append plan instability/randomness

From
Tom Lane
Date:
According to buildfarm member silverfish,

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=silverfish&dt=2018-01-06%2008%3A53%3A38

it's possible to sometimes get this failure in the regression tests:

*** /mnt/buildfarm/buildroot/HEAD/pgsql.build/../pgsql/src/test/regress/expected/select_parallel.out    Tue Dec 19
20:24:022017 
--- /mnt/buildfarm/buildroot/HEAD/pgsql.build/src/test/regress/results/select_parallel.out    Sat Jan  6 09:21:39 2018
***************
*** 75,84 ****
           Workers Planned: 3
           ->  Partial Aggregate
                 ->  Parallel Append
                       ->  Seq Scan on d_star
                       ->  Seq Scan on f_star
                       ->  Seq Scan on e_star
-                      ->  Seq Scan on b_star
                       ->  Seq Scan on c_star
                       ->  Seq Scan on a_star
  (11 rows)
--- 75,84 ----
           Workers Planned: 3
           ->  Partial Aggregate
                 ->  Parallel Append
+                      ->  Seq Scan on b_star
                       ->  Seq Scan on d_star
                       ->  Seq Scan on f_star
                       ->  Seq Scan on e_star
                       ->  Seq Scan on c_star
                       ->  Seq Scan on a_star
  (11 rows)

Irreproducible failures in the regression tests are not very acceptable.
Furthermore, considering that the query being tested is

explain (costs off)
  select round(avg(aa)), sum(aa) from a_star;

it seems to me that the "expected" order of the sub-scans is mighty
random to begin with.  A non-parallel implementation of the same
query will consistently scan these tables in their inheritance order:

 Aggregate
   ->  Append
         ->  Seq Scan on a_star
         ->  Seq Scan on b_star
         ->  Seq Scan on c_star
         ->  Seq Scan on d_star
         ->  Seq Scan on e_star
         ->  Seq Scan on f_star

Could we fix the parallel case to behave similarly please?

            regards, tom lane


Re: Parallel append plan instability/randomness

From
Amit Kapila
Date:
On Sun, Jan 7, 2018 at 5:44 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> According to buildfarm member silverfish,
>
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=silverfish&dt=2018-01-06%2008%3A53%3A38
>
> it's possible to sometimes get this failure in the regression tests:
>
> *** /mnt/buildfarm/buildroot/HEAD/pgsql.build/../pgsql/src/test/regress/expected/select_parallel.out    Tue Dec 19
20:24:022017
 
> --- /mnt/buildfarm/buildroot/HEAD/pgsql.build/src/test/regress/results/select_parallel.out      Sat Jan  6 09:21:39
2018
> ***************
> *** 75,84 ****
>            Workers Planned: 3
>            ->  Partial Aggregate
>                  ->  Parallel Append
>                        ->  Seq Scan on d_star
>                        ->  Seq Scan on f_star
>                        ->  Seq Scan on e_star
> -                      ->  Seq Scan on b_star
>                        ->  Seq Scan on c_star
>                        ->  Seq Scan on a_star
>   (11 rows)
> --- 75,84 ----
>            Workers Planned: 3
>            ->  Partial Aggregate
>                  ->  Parallel Append
> +                      ->  Seq Scan on b_star
>                        ->  Seq Scan on d_star
>                        ->  Seq Scan on f_star
>                        ->  Seq Scan on e_star
>                        ->  Seq Scan on c_star
>                        ->  Seq Scan on a_star
>   (11 rows)
>
> Irreproducible failures in the regression tests are not very acceptable.
> Furthermore, considering that the query being tested is
>
> explain (costs off)
>   select round(avg(aa)), sum(aa) from a_star;
>
> it seems to me that the "expected" order of the sub-scans is mighty
> random to begin with.
>

I think order of sub-scans can be random if the number of rows in
child relations can vary across runs.  For the above case, the
subpaths (non-partial-paths) are always in the descending order of
their cost and I can see that by running it locally.  On my local m/c,
output is as below:

regression=# explain select round(avg(aa)), sum(aa) from a_star;
                                  QUERY PLAN
-------------------------------------------------------------------------------
 Finalize Aggregate  (cost=2.30..2.31 rows=1 width=40)
   ->  Gather  (cost=2.28..2.29 rows=3 width=40)
         Workers Planned: 3
         ->  Partial Aggregate  (cost=2.28..2.29 rows=1 width=40)
               ->  Parallel Append  (cost=0.00..2.20 rows=15 width=4)
                     ->  Seq Scan on d_star  (cost=0.00..1.16 rows=16 width=4)
                     ->  Seq Scan on f_star  (cost=0.00..1.16 rows=16 width=4)
                     ->  Seq Scan on e_star  (cost=0.00..1.07 rows=7 width=4)
                     ->  Seq Scan on b_star  (cost=0.00..1.04 rows=4 width=4)
                     ->  Seq Scan on c_star  (cost=0.00..1.04 rows=4 width=4)
                     ->  Seq Scan on a_star  (cost=0.00..1.03 rows=3 width=4)
(11 rows)

The above indicates that paths are listed in the order as expected.
What makes you think that the order of sub-scans can be random?  Is it
possible that the number of rows in child relations can vary across
runs?

One theory that can explain above failure is that the costs of
scanning some of the sub-paths is very close due to which sometimes
the results can vary.  If that is the case, then probably using
fuzz_factor in costs comparison (as is done in attached patch) can
improve the situation, may be we have to consider some other factors
like number of rows in each subpath.  However, it might be better to
first somehow reproduce this case and see what is going wrong, any
ideas?

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachment

Re: Parallel append plan instability/randomness

From
Amit Khandekar
Date:
On 8 January 2018 at 10:10, Amit Kapila <amit.kapila16@gmail.com> wrote:
> regression=# explain select round(avg(aa)), sum(aa) from a_star;
>                                   QUERY PLAN
> -------------------------------------------------------------------------------
>  Finalize Aggregate  (cost=2.30..2.31 rows=1 width=40)
>    ->  Gather  (cost=2.28..2.29 rows=3 width=40)
>          Workers Planned: 3
>          ->  Partial Aggregate  (cost=2.28..2.29 rows=1 width=40)
>                ->  Parallel Append  (cost=0.00..2.20 rows=15 width=4)
>                      ->  Seq Scan on d_star  (cost=0.00..1.16 rows=16 width=4)
>                      ->  Seq Scan on f_star  (cost=0.00..1.16 rows=16 width=4)
>                      ->  Seq Scan on e_star  (cost=0.00..1.07 rows=7 width=4)
>                      ->  Seq Scan on b_star  (cost=0.00..1.04 rows=4 width=4)
>                      ->  Seq Scan on c_star  (cost=0.00..1.04 rows=4 width=4)
>                      ->  Seq Scan on a_star  (cost=0.00..1.03 rows=3 width=4)
> (11 rows)
>
> The above indicates that paths are listed in the order as expected.
> What makes you think that the order of sub-scans can be random?  Is it
> possible that the number of rows in child relations can vary across
> runs?
>
> One theory that can explain above failure is that the costs of
> scanning some of the sub-paths is very close due to which sometimes
> the results can vary.  If that is the case, then probably using
> fuzz_factor in costs comparison (as is done in attached patch) can
> improve the situation, may be we have to consider some other factors
> like number of rows in each subpath.  However, it might be better to
> first somehow reproduce this case and see what is going wrong, any
> ideas?

The fact that b_star gets moved from 5th position to  the first
position in the scans, indicates that it's cost shoots up from 1.04 to
a value greater than 1.16. It does not look like a case where two
costs are almost same due to which their positions swap sometimes. I
am trying to figure out what else can it be ...

-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company


Re: Parallel append plan instability/randomness

From
Tom Lane
Date:
Amit Khandekar <amitdkhan.pg@gmail.com> writes:
> The fact that b_star gets moved from 5th position to  the first
> position in the scans, indicates that it's cost shoots up from 1.04 to
> a value greater than 1.16. It does not look like a case where two
> costs are almost same due to which their positions swap sometimes. I
> am trying to figure out what else can it be ...

The gut feeling I had upon seeing the failure was that the plan shape
depends on the order in which rows happen to be read from the system
catalogs by a heapscan.  I've not tried to run that idea to ground yet.

            regards, tom lane


Re: Parallel append plan instability/randomness

From
Amit Kapila
Date:
On Mon, Jan 8, 2018 at 11:26 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Amit Khandekar <amitdkhan.pg@gmail.com> writes:
>> The fact that b_star gets moved from 5th position to  the first
>> position in the scans, indicates that it's cost shoots up from 1.04 to
>> a value greater than 1.16. It does not look like a case where two
>> costs are almost same due to which their positions swap sometimes. I
>> am trying to figure out what else can it be ...
>

That occurred to me as well, but still, the change in plan can happen
due to the similar costs.  Another possibility as indicated in the
previous email is that if somehow the stats of table (reltuples,
relpages) is not appropriate, say due to some reason analyze doesn't
happen on the table.  For example, if you just manually update the
value of reltuples for b_star in pg_class to 20 or so, you will see
the plan as indicated in the failure.  If that is true, then probably
doing Analyze before Parallel Append should do the trick.

> The gut feeling I had upon seeing the failure was that the plan shape
> depends on the order in which rows happen to be read from the system
> catalogs by a heapscan.  I've not tried to run that idea to ground yet.
>

I don't see how something like that can happen because we internally
sort the subpaths for parallel append.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: Parallel append plan instability/randomness

From
Amit Khandekar
Date:
On 8 January 2018 at 13:35, Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Mon, Jan 8, 2018 at 11:26 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Amit Khandekar <amitdkhan.pg@gmail.com> writes:
>>> The fact that b_star gets moved from 5th position to  the first
>>> position in the scans, indicates that it's cost shoots up from 1.04 to
>>> a value greater than 1.16. It does not look like a case where two
>>> costs are almost same due to which their positions swap sometimes. I
>>> am trying to figure out what else can it be ...
>>
>
> That occurred to me as well, but still, the change in plan can happen
> due to the similar costs.

Agreed. But I think we should first fix the issue due to which the
test might be failing in this case. BTW, for your patch, I am thinking
we can have a separate factor other than STD_FUZZ_FACTOR ? This way,
we can make it much smaller than 1.01 also. And anyways,
STD_FUZZ_FACTOR is used only for comparing paths on the same relation,
whereas in our case, our comparision goal is different.

> Another possibility as indicated in the
> previous email is that if somehow the stats of table (reltuples,
> relpages) is not appropriate, say due to some reason analyze doesn't
> happen on the table.

Yes, I am also thinking on the same lines. E.g., if the relpages is 0
(due to no analyze run yet), tuple density calculation follows a
different logic, due to which reltuples can be quite bigger. I suspect
this also might be the reason. So yes, I think it's worth having
ANALYZE on *_star.

> For example, if you just manually update the
> value of reltuples for b_star in pg_class to 20 or so, you will see
> the plan as indicated in the failure.  If that is true, then probably
> doing Analyze before Parallel Append should do the trick.

 Or better still, we can have Analyze in create_misc.sql and
create_table.sql where the table is populated.

>
>> The gut feeling I had upon seeing the failure was that the plan shape
>> depends on the order in which rows happen to be read from the system
>> catalogs by a heapscan.  I've not tried to run that idea to ground yet.
>>
>
> I don't see how something like that can happen because we internally
> sort the subpaths for parallel append.

True. Or may I didn't understand. Tom, are you referring to reading
pg_class.reltuples or similar, when say "read from the system
catalogs" ?

>
>
> --
> With Regards,
> Amit Kapila.
> EnterpriseDB: http://www.enterprisedb.com



-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company


Re: Parallel append plan instability/randomness

From
Amit Kapila
Date:
On Mon, Jan 8, 2018 at 2:11 PM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
> On 8 January 2018 at 13:35, Amit Kapila <amit.kapila16@gmail.com> wrote:
>> On Mon, Jan 8, 2018 at 11:26 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> Amit Khandekar <amitdkhan.pg@gmail.com> writes:
>>>> The fact that b_star gets moved from 5th position to  the first
>>>> position in the scans, indicates that it's cost shoots up from 1.04 to
>>>> a value greater than 1.16. It does not look like a case where two
>>>> costs are almost same due to which their positions swap sometimes. I
>>>> am trying to figure out what else can it be ...
>>>
>>
>> That occurred to me as well, but still, the change in plan can happen
>> due to the similar costs.
>
> Agreed. But I think we should first fix the issue due to which the
> test might be failing in this case. BTW, for your patch, I am thinking
> we can have a separate factor other than STD_FUZZ_FACTOR ? This way,
> we can make it much smaller than 1.01 also.
>

Sounds good.  However, I am not sure what should be the right value
for it, apart from STD_FUZZ_FACTOR we are using 1.0000000001 also as
fuzz_factor in the code.  Any suggestions?

 And anyways,
> STD_FUZZ_FACTOR is used only for comparing paths on the same relation,
> whereas in our case, our comparision goal is different.
>
>> Another possibility as indicated in the
>> previous email is that if somehow the stats of table (reltuples,
>> relpages) is not appropriate, say due to some reason analyze doesn't
>> happen on the table.
>
> Yes, I am also thinking on the same lines. E.g., if the relpages is 0
> (due to no analyze run yet), tuple density calculation follows a
> different logic, due to which reltuples can be quite bigger. I suspect
> this also might be the reason. So yes, I think it's worth having
> ANALYZE on *_star.
>
>> For example, if you just manually update the
>> value of reltuples for b_star in pg_class to 20 or so, you will see
>> the plan as indicated in the failure.  If that is true, then probably
>> doing Analyze before Parallel Append should do the trick.
>
>  Or better still, we can have Analyze in create_misc.sql and
> create_table.sql where the table is populated.
>

Then probably we might want to do in misc.sql file as that Alters some
of these tables which can lead to a rewrite of the table.  I think we
can do it either way, but lets first wait for Tom or Robert to comment
on whether they agree with this theory or if they see some other
reason for this behavior.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: Parallel append plan instability/randomness

From
Robert Haas
Date:
On Sun, Jan 7, 2018 at 11:40 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> One theory that can explain above failure is that the costs of
> scanning some of the sub-paths is very close due to which sometimes
> the results can vary.  If that is the case, then probably using
> fuzz_factor in costs comparison (as is done in attached patch) can
> improve the situation, may be we have to consider some other factors
> like number of rows in each subpath.

This isn't an acceptable solution because sorting requires that the
comparison operator satisfy the transitive property; that is, if a = b
and b = c then a = c.  With your proposed comparator, you could have a
= b and b = c but a < c.  That will break stuff.

It seems like the obvious fix here is to use a query where the
contents of the partitions are such that the sorting always produces
the same result.  We could do that either by changing the query or by
changing the data in the partitions or, maybe, by inserting ANALYZE
someplace.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Parallel append plan instability/randomness

From
Jim Finnerty
Date:
Ordering the plan output elements by estimated cost will cause spurious plan
changes to be reported after table cardinalities change.  Can we choose an
explain output order that is stable to changes in cardinality, please?

thank you,

    /Jim




--
Sent from: http://www.postgresql-archive.org/PostgreSQL-hackers-f1928748.html


Re: Parallel append plan instability/randomness

From
Robert Haas
Date:
On Mon, Jan 8, 2018 at 11:28 AM, Jim Finnerty <jfinnert@amazon.com> wrote:
> Ordering the plan output elements by estimated cost will cause spurious plan
> changes to be reported after table cardinalities change.  Can we choose an
> explain output order that is stable to changes in cardinality, please?

No.  The order of the plans in the EXPLAIN output isn't some kind of
display artifact.  Parallel Append sorts the paths by cost and by
whether they are partial, for reasons explained in the comments in
create_append_path(), and EXPLAIN prints them in that order.

In general, if you expect to be able to change the contents of the
tables in the regression test database without changing the results of
the regression test queries, you are in for a lot of disappointment.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Parallel append plan instability/randomness

From
Tom Lane
Date:
Jim Finnerty <jfinnert@amazon.com> writes:
> Ordering the plan output elements by estimated cost will cause spurious plan
> changes to be reported after table cardinalities change.  Can we choose an
> explain output order that is stable to changes in cardinality, please?

I found the code that's doing this, in create_append_path, and it says:

     * For parallel append, non-partial paths are sorted by descending total
     * costs. That way, the total time to finish all non-partial paths is
     * minimized.  Also, the partial paths are sorted by descending startup
     * costs.  There may be some paths that require to do startup work by a
     * single worker.  In such case, it's better for workers to choose the
     * expensive ones first, whereas the leader should choose the cheapest
     * startup plan.

There's some merit in that argument, although I'm not sure how much.
It's certainly pointless to sort that way if the expected number of
workers is >= the number of subpaths.  More generally, I wonder if
it wouldn't be better to implement this behavior at runtime rather
than plan time.  Something along the line of "workers choose the
highest-cost remaining subplan, but leader chooses the lowest-cost one".

            regards, tom lane


Re: Parallel append plan instability/randomness

From
Robert Haas
Date:
On Mon, Jan 8, 2018 at 11:42 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Jim Finnerty <jfinnert@amazon.com> writes:
>> Ordering the plan output elements by estimated cost will cause spurious plan
>> changes to be reported after table cardinalities change.  Can we choose an
>> explain output order that is stable to changes in cardinality, please?
>
> I found the code that's doing this, in create_append_path, and it says:
>
>      * For parallel append, non-partial paths are sorted by descending total
>      * costs. That way, the total time to finish all non-partial paths is
>      * minimized.  Also, the partial paths are sorted by descending startup
>      * costs.  There may be some paths that require to do startup work by a
>      * single worker.  In such case, it's better for workers to choose the
>      * expensive ones first, whereas the leader should choose the cheapest
>      * startup plan.
>
> There's some merit in that argument, although I'm not sure how much.
> It's certainly pointless to sort that way if the expected number of
> workers is >= the number of subpaths.

That is not completely true, because the leader will generally pick
first before the workers have finished starting up, and we want it to
pick something cheap and, ideally, partial, so that it's not committed
to doing a ton of work while the workers sit there idle after filling
their tuple queues.

> More generally, I wonder if
> it wouldn't be better to implement this behavior at runtime rather
> than plan time.  Something along the line of "workers choose the
> highest-cost remaining subplan, but leader chooses the lowest-cost one".

Ignoring some details around partial vs. non-partial plans, that's
pretty much what we ARE doing, but to make it efficient, we sort the
paths at plan time so that those choices are easy to make at runtime.
If we didn't do that, we could have every worker sort the paths at
execution time instead, or have the first process to arrive perform
the sort and store the results in shared memory while everyone else
waits, but that seems to be more complicated and less efficient, so I
don't understand why you're proposing it.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Parallel append plan instability/randomness

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> On Mon, Jan 8, 2018 at 11:42 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> More generally, I wonder if
>> it wouldn't be better to implement this behavior at runtime rather
>> than plan time.

> Ignoring some details around partial vs. non-partial plans, that's
> pretty much what we ARE doing, but to make it efficient, we sort the
> paths at plan time so that those choices are easy to make at runtime.
> If we didn't do that, we could have every worker sort the paths at
> execution time instead, or have the first process to arrive perform
> the sort and store the results in shared memory while everyone else
> waits, but that seems to be more complicated and less efficient, so I
> don't understand why you're proposing it.

The main bit of info we'd have at runtime that we lack at plan time is
certainty about the number of available workers.  Maybe that doesn't
really add anything useful to the order in which subplans would be doled
out; not sure.

            regards, tom lane


Re: Parallel append plan instability/randomness

From
Robert Haas
Date:
On Mon, Jan 8, 2018 at 11:57 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Ignoring some details around partial vs. non-partial plans, that's
>> pretty much what we ARE doing, but to make it efficient, we sort the
>> paths at plan time so that those choices are easy to make at runtime.
>> If we didn't do that, we could have every worker sort the paths at
>> execution time instead, or have the first process to arrive perform
>> the sort and store the results in shared memory while everyone else
>> waits, but that seems to be more complicated and less efficient, so I
>> don't understand why you're proposing it.
>
> The main bit of info we'd have at runtime that we lack at plan time is
> certainty about the number of available workers.  Maybe that doesn't
> really add anything useful to the order in which subplans would be doled
> out; not sure.

The current algorithm doesn't have any use for that information; I'm
not sure whether some other algorithm might.  If we had perfect
information, each participant would always choose the next task in
such a way as to minimize overall execution time.  Is it possible that
the correct choice might depend on how many workers we have in total?

/me thinks for a bit.  Yes, that's possible. Suppose that the
execution times of 5 non-partial plans are 100 99 60 40 1 and that the
number of participants is either 2 or 3.  Suppose further that the
number of rows returned is nominal so that there's no issue of the
worker(s) having to wait.  If there are 2 participants, one process
should get the plans with execution times 99 and 60 and the other
should get the rest.  If there are 3 participants, one process should
get only 100, the second should get 99 and 1, and the last should get
60 and 40.

In the actually implemented algorithm, the leader will grab the cheap
plan at the end, the worker(s) will grab the expensive plans at the
beginning, and things probably won't actually work out very nicely.
With three participants, the leader will end up with 1 and 40, the
first worker with 100 only, and the second worker with 99 and 60.
Fail.

The choice of which plan the leader takes might also depend on the
number of rows being returned by the various plans.  If there's one
plan that returns a lot of rows, it would be nice to run that one in
the leader, because then we save the cost of transferring those rows
between processes.  On the other hand, that also risks starving all of
the workers; it could be faster to incur the overhead of transferring
those rows so that the leader focuses on servicing the tuple queues
rather than running plans.

There are also problems with partial plans having high startup costs.
In general, if a partial plan has a high startup cost, we'd like the
leader not to choose it, because then it's not available to read
tuples.  So, if we have one plan with a startup cost with 1000 and
another plan with a startup cost of 800, then we might think that the
leader should prefer the latter.  However, if the former plan is a
parallel hash join and the latter is a non-parallel-aware join, then
the reverse *may* be true, because the startup cost of parallel hash
can (mostly) be *shared* among as many participants as we have,
whereas non-parallel-aware nodes will *repeat* the startup work.  Of
course, the plan tree doesn't carry information about whether startup
costs of parallel plans are likely to be amortized across workers or
repeated in each one, and even if it did, it does not seem to be
trivial to put that information to good use.

Yet another problem with parallel append -- and really, with parallel
planning in general -- is that it ignores the cost of resource
contention between cooperating backends.  If we do a Parallel Append
over Parallel Seq Scan operations, spreading workers out figures to
win if (1) there are enough of them to general significant contention
on the in-memory data structures, which is especially likely if the
data is resident in memory or if (2) the different partitions are on
different filesystems that can satisfy I/O requests in parallel or
even if (3) the different partitions are all on the same filesystem
but that filesystem has enough spindles to sequential-scan them all at
once.  But it could easily lose if, say, the partitions are on
different filesystems serviced by a single disk head, and the
alternating I/O pattern produces a continuous stream of long seeks.
On the the third hand, that case could work out too if the per-row
processing is enough that we weren't I/O-bound in the first place.

I'd be quite happy if somebody wanted to dig into these issues more.
I don't pretend that the existing implementation is anything more than
a first approximation of what we really want here.  It's just based on
the observations that (1) in practice, the leader is very often the
bottleneck when executing parallel queries and (2) a task that takes a
long time to run and can't be sped up by adding workers must get
started as soon as possible.  I have a feeling that to do this really
well is going to require information that we don't have readily
available, like which resources which subplans will use at which times
and how many CPUs we need to use to render the query I/O-bound.
However, there may be clear improvements that can be made with only
localized changes, and I welcome ideas.  Query planning appears to be
a hard problem.  :-p

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Parallel append plan instability/randomness

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> [ example where current parallel-append dispatch algorithm loses ]

I should clarify my thinking a bit more: I was imagining that we might
switch to a system where, as each worker comes free, it makes a dynamic
choice of which subplan to take next.  That means that not only do we have
more knowledge of the number of available workers than the planner did,
but we also have hard knowledge of which subplans actually finished first.
So even (or especially?) granted that the cost estimates are imperfect,
that might provide a leg up in choosing how to schedule the remaining
tasks.

I haven't thought about it hard enough to have a clear idea of how that
might win.  But for sure I don't think we should dismiss out of hand
the idea that it could be better than choosing a fixed dispatch order
at plan time.

Anyway, I'm merely suggesting that as an approach worth investigating in
future research; I doubt anyone is going to produce an implementation
right away.  So we still want to figure out exactly what happened on
silverfish and how we can nail down that expected plan shape.  (Or,
perhaps, decide that we don't need to see that plan in the output?)

            regards, tom lane


Re: Parallel append plan instability/randomness

From
Jim Finnerty
Date:
I see.  Maybe the EXPLAIN output can be made more stable when "EXPLAIN (costs
FALSE)", though?



--
Sent from: http://www.postgresql-archive.org/PostgreSQL-hackers-f1928748.html


Re: Parallel append plan instability/randomness

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> On Sun, Jan 7, 2018 at 11:40 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>> One theory that can explain above failure is that the costs of
>> scanning some of the sub-paths is very close due to which sometimes
>> the results can vary.  If that is the case, then probably using
>> fuzz_factor in costs comparison (as is done in attached patch) can
>> improve the situation, may be we have to consider some other factors
>> like number of rows in each subpath.

> This isn't an acceptable solution because sorting requires that the
> comparison operator satisfy the transitive property; that is, if a = b
> and b = c then a = c.  With your proposed comparator, you could have a
> = b and b = c but a < c.  That will break stuff.

> It seems like the obvious fix here is to use a query where the
> contents of the partitions are such that the sorting always produces
> the same result.  We could do that either by changing the query or by
> changing the data in the partitions or, maybe, by inserting ANALYZE
> someplace.

The foo_star tables are made in create_table.sql, filled in
create_misc.sql, and not modified thereafter.  The fact that we have
accurate rowcounts for them in select_parallel.sql is because of the
database-wide VACUUM that happens at the start of sanity_check.sql.
Given the lack of any WHERE condition, the costs in this particular query
depend only on the rowcount and physical table size, so inserting an
ANALYZE shouldn't (and doesn't, for me) change anything.  I would be
concerned about side-effects on other queries anyway if we were to ANALYZE
tables that have never been ANALYZEd in the regression tests before.

I remain pretty much at a loss for exactly what happened on silverfish,
but that doesn't mean I have no ideas for improving things.  Both
append_total_cost_compare and append_startup_cost_compare are seriously
deficient IMO, because they ignore the question of what to do when total
costs (resp. startup costs) are exactly equal, leaving the outcome to
the none too tender mercies of qsort().  Which you'll recall is not a
stable algorithm.

(For fewer than 7 items, our implementation of qsort falls back to an
insertion sort, which *is* stable, perhaps explaining why we've not
seen failures in this test case many times before.  So this line of
thought doesn't explain what happened on silverfish.  But there's
surely room for improvement here.)

The normal rule in the planner, as embodied in places like
compare_path_costs, is to break ties on total cost by comparing startup
cost, and vice versa, which these two functions fail to do; so my first
point is that they ought to be using compare_path_costs rather than
inventing their own approach.  This would especially affect
append_startup_cost_compare, because of the frequency with which plans
have identical startup costs (ie 0).  Right now I'm afraid we're getting
basically random ordering of the partial subpaths in most cases.

In the regression test case at hand, the startup costs are all zero so
this change wouldn't improve the test case's stability.  So I'm thinking
that in addition, it would be a good idea for these functions to break
exact compare_path_costs ties in some arbitrary but deterministic way,
rather than letting qsort() have the final say on what happens.  If the
subplans were all simple relation scans we could order them by planner
relid, but I'm not sure what to do if they're not.

BTW, I had not looked closely at list_qsort before, but now that I have
it seems seriously bizarre.  Why is it allocating a new List header
rather than reusing the old one?  Because it stomps on the next-links
of the list cells, the old header is effectively corrupted and can't
possibly be useful afterwards, so we might as well re-use it and save
a palloc cycle.  (By the same token, it's pretty ludicrous that it
claims the input List is "const".  Not stomping on the list header
is a pretty poor definition of not modifying the list.)

            regards, tom lane


Re: Parallel append plan instability/randomness

From
Tom Lane
Date:
I wrote:
> In the regression test case at hand, the startup costs are all zero so
> this change wouldn't improve the test case's stability.  So I'm thinking
> that in addition, it would be a good idea for these functions to break
> exact compare_path_costs ties in some arbitrary but deterministic way,
> rather than letting qsort() have the final say on what happens.  If the
> subplans were all simple relation scans we could order them by planner
> relid, but I'm not sure what to do if they're not.

Ah, I have an idea --- let's create a bms_compare() qsort comparator for
bitmapsets, and use that on the paths' relid sets.  It hardly matters
what the exact semantics of the comparator are, as long as it behaves
sanely for equal sets.

            regards, tom lane


Re: Parallel append plan instability/randomness

From
Robert Haas
Date:
On Mon, Jan 8, 2018 at 1:26 PM, Jim Finnerty <jfinnert@amazon.com> wrote:
> I see.  Maybe the EXPLAIN output can be made more stable when "EXPLAIN (costs
> FALSE)", though?

I think that would be fixing the problem from the wrong end.  We want
to actually get a stable choice of plan, not display something other
than the actual plan because that other thing is more stable.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Parallel append plan instability/randomness

From
Tom Lane
Date:
I wrote:
>> In the regression test case at hand, the startup costs are all zero so
>> this change wouldn't improve the test case's stability.  So I'm thinking
>> that in addition, it would be a good idea for these functions to break
>> exact compare_path_costs ties in some arbitrary but deterministic way,
>> rather than letting qsort() have the final say on what happens.  If the
>> subplans were all simple relation scans we could order them by planner
>> relid, but I'm not sure what to do if they're not.

> Ah, I have an idea --- let's create a bms_compare() qsort comparator for
> bitmapsets, and use that on the paths' relid sets.  It hardly matters
> what the exact semantics of the comparator are, as long as it behaves
> sanely for equal sets.

Concretely, I propose the attached.  I spent a little bit of extra effort
on the comparator in hopes that it might be useful for something else.

            regards, tom lane

diff --git a/src/backend/nodes/bitmapset.c b/src/backend/nodes/bitmapset.c
index 733fe3c..edcd19a 100644
*** a/src/backend/nodes/bitmapset.c
--- b/src/backend/nodes/bitmapset.c
*************** bms_equal(const Bitmapset *a, const Bitm
*** 173,178 ****
--- 173,222 ----
  }

  /*
+  * bms_compare - qsort-style comparator for bitmapsets
+  *
+  * This guarantees to report values as equal iff bms_equal would say they are
+  * equal.  Otherwise, the highest-numbered bit that is set in one value but
+  * not the other determines the result.  (This rule means that, for example,
+  * {6} is greater than {5}, which seems plausible.)
+  */
+ int
+ bms_compare(const Bitmapset *a, const Bitmapset *b)
+ {
+     int            shortlen;
+     int            i;
+
+     /* Handle cases where either input is NULL */
+     if (a == NULL)
+         return bms_is_empty(b) ? 0 : -1;
+     else if (b == NULL)
+         return bms_is_empty(a) ? 0 : +1;
+     /* Handle cases where one input is longer than the other */
+     shortlen = Min(a->nwords, b->nwords);
+     for (i = shortlen; i < a->nwords; i++)
+     {
+         if (a->words[i] != 0)
+             return +1;
+     }
+     for (i = shortlen; i < b->nwords; i++)
+     {
+         if (b->words[i] != 0)
+             return -1;
+     }
+     /* Process words in common */
+     i = shortlen;
+     while (--i >= 0)
+     {
+         bitmapword    aw = a->words[i];
+         bitmapword    bw = b->words[i];
+
+         if (aw != bw)
+             return (aw > bw) ? +1 : -1;
+     }
+     return 0;
+ }
+
+ /*
   * bms_make_singleton - build a bitmapset containing a single member
   */
  Bitmapset *
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index 7df8761..5c2b996 100644
*** a/src/backend/optimizer/util/pathnode.c
--- b/src/backend/optimizer/util/pathnode.c
*************** create_append_path(RelOptInfo *rel,
*** 1274,1311 ****

  /*
   * append_total_cost_compare
!  *      list_qsort comparator for sorting append child paths by total_cost
   */
  static int
  append_total_cost_compare(const void *a, const void *b)
  {
      Path       *path1 = (Path *) lfirst(*(ListCell **) a);
      Path       *path2 = (Path *) lfirst(*(ListCell **) b);

!     if (path1->total_cost > path2->total_cost)
!         return -1;
!     if (path1->total_cost < path2->total_cost)
!         return 1;
!
!     return 0;
  }

  /*
   * append_startup_cost_compare
!  *      list_qsort comparator for sorting append child paths by startup_cost
   */
  static int
  append_startup_cost_compare(const void *a, const void *b)
  {
      Path       *path1 = (Path *) lfirst(*(ListCell **) a);
      Path       *path2 = (Path *) lfirst(*(ListCell **) b);

!     if (path1->startup_cost > path2->startup_cost)
!         return -1;
!     if (path1->startup_cost < path2->startup_cost)
!         return 1;
!
!     return 0;
  }

  /*
--- 1274,1317 ----

  /*
   * append_total_cost_compare
!  *      qsort comparator for sorting append child paths by total_cost descending
!  *
!  * For equal total costs, we fall back to comparing startup costs; if those
!  * are equal too, break ties using bms_compare on the paths' relids.
!  * (This is to avoid getting unpredictable results from qsort.)
   */
  static int
  append_total_cost_compare(const void *a, const void *b)
  {
      Path       *path1 = (Path *) lfirst(*(ListCell **) a);
      Path       *path2 = (Path *) lfirst(*(ListCell **) b);
+     int            cmp;

!     cmp = compare_path_costs(path1, path2, TOTAL_COST);
!     if (cmp == 0)
!         cmp = bms_compare(path1->parent->relids, path2->parent->relids);
!     return -cmp;
  }

  /*
   * append_startup_cost_compare
!  *      qsort comparator for sorting append child paths by startup_cost descending
!  *
!  * For equal startup costs, we fall back to comparing total costs; if those
!  * are equal too, break ties using bms_compare on the paths' relids.
!  * (This is to avoid getting unpredictable results from qsort.)
   */
  static int
  append_startup_cost_compare(const void *a, const void *b)
  {
      Path       *path1 = (Path *) lfirst(*(ListCell **) a);
      Path       *path2 = (Path *) lfirst(*(ListCell **) b);
+     int            cmp;

!     cmp = compare_path_costs(path1, path2, STARTUP_COST);
!     if (cmp == 0)
!         cmp = bms_compare(path1->parent->relids, path2->parent->relids);
!     return -cmp;
  }

  /*
diff --git a/src/include/nodes/bitmapset.h b/src/include/nodes/bitmapset.h
index 15397e9..67e8920 100644
*** a/src/include/nodes/bitmapset.h
--- b/src/include/nodes/bitmapset.h
*************** typedef enum
*** 65,70 ****
--- 65,71 ----

  extern Bitmapset *bms_copy(const Bitmapset *a);
  extern bool bms_equal(const Bitmapset *a, const Bitmapset *b);
+ extern int    bms_compare(const Bitmapset *a, const Bitmapset *b);
  extern Bitmapset *bms_make_singleton(int x);
  extern void bms_free(Bitmapset *a);

diff --git a/src/test/regress/expected/select_parallel.out b/src/test/regress/expected/select_parallel.out
index 7824ca5..5f73be4 100644
*** a/src/test/regress/expected/select_parallel.out
--- b/src/test/regress/expected/select_parallel.out
*************** explain (costs off)
*** 21,32 ****
           Workers Planned: 3
           ->  Partial Aggregate
                 ->  Parallel Append
!                      ->  Parallel Seq Scan on a_star
!                      ->  Parallel Seq Scan on b_star
!                      ->  Parallel Seq Scan on c_star
                       ->  Parallel Seq Scan on d_star
                       ->  Parallel Seq Scan on e_star
!                      ->  Parallel Seq Scan on f_star
  (11 rows)

  select round(avg(aa)), sum(aa) from a_star a1;
--- 21,32 ----
           Workers Planned: 3
           ->  Partial Aggregate
                 ->  Parallel Append
!                      ->  Parallel Seq Scan on f_star
                       ->  Parallel Seq Scan on d_star
                       ->  Parallel Seq Scan on e_star
!                      ->  Parallel Seq Scan on c_star
!                      ->  Parallel Seq Scan on b_star
!                      ->  Parallel Seq Scan on a_star
  (11 rows)

  select round(avg(aa)), sum(aa) from a_star a1;
*************** explain (costs off)
*** 49,58 ****
                 ->  Parallel Append
                       ->  Seq Scan on d_star
                       ->  Seq Scan on c_star
-                      ->  Parallel Seq Scan on a_star
-                      ->  Parallel Seq Scan on b_star
-                      ->  Parallel Seq Scan on e_star
                       ->  Parallel Seq Scan on f_star
  (11 rows)

  select round(avg(aa)), sum(aa) from a_star a2;
--- 49,58 ----
                 ->  Parallel Append
                       ->  Seq Scan on d_star
                       ->  Seq Scan on c_star
                       ->  Parallel Seq Scan on f_star
+                      ->  Parallel Seq Scan on e_star
+                      ->  Parallel Seq Scan on b_star
+                      ->  Parallel Seq Scan on a_star
  (11 rows)

  select round(avg(aa)), sum(aa) from a_star a2;
*************** explain (costs off)
*** 75,85 ****
           Workers Planned: 3
           ->  Partial Aggregate
                 ->  Parallel Append
-                      ->  Seq Scan on d_star
                       ->  Seq Scan on f_star
                       ->  Seq Scan on e_star
-                      ->  Seq Scan on b_star
                       ->  Seq Scan on c_star
                       ->  Seq Scan on a_star
  (11 rows)

--- 75,85 ----
           Workers Planned: 3
           ->  Partial Aggregate
                 ->  Parallel Append
                       ->  Seq Scan on f_star
+                      ->  Seq Scan on d_star
                       ->  Seq Scan on e_star
                       ->  Seq Scan on c_star
+                      ->  Seq Scan on b_star
                       ->  Seq Scan on a_star
  (11 rows)


Re: Parallel append plan instability/randomness

From
Tom Lane
Date:
I wrote:
> BTW, I had not looked closely at list_qsort before, but now that I have
> it seems seriously bizarre.  Why is it allocating a new List header
> rather than reusing the old one?  Because it stomps on the next-links
> of the list cells, the old header is effectively corrupted and can't
> possibly be useful afterwards, so we might as well re-use it and save
> a palloc cycle.  (By the same token, it's pretty ludicrous that it
> claims the input List is "const".  Not stomping on the list header
> is a pretty poor definition of not modifying the list.)

Actually, after looking closer, I think the current implementation
of list_qsort is actively dangerous.  If we left the code like this,
we would have to document that create_append_path corrupts its caller's
copies of subpaths and partial_subpaths when parallel_aware is true
(and by "corrupt" I don't just mean "reorder": the list headers the
caller has pointers to are now flat wrong).  Even with the change
I imagine above, we'd have to document that the caller's lists get
reordered.  Although the sole caller that's currently affected doesn't
currently do anything with those lists after the call, it's not terribly
hard to imagine that future improvements there would lead to trying to
build multiple paths with the same lists or something like that, so
that we'd have hidden interactions between different Paths.  The same
types of hazards would apply to every future use of list_qsort.

(In fact, I spent some time trying to think of a way that this could have
led to the silverfish failure we started the thread with.  I couldn't
convince myself that it was possible, but I'm not entirely convinced it's
not, either.)

So I think that re-using the caller's list cells is another penny-wise-
but-pound-foolish idea, and that we ought to just build a new list to
avoid the hazard.  As per attached.

            regards, tom lane

diff --git a/src/backend/nodes/list.c b/src/backend/nodes/list.c
index 083538f..fe085d2 100644
*** a/src/backend/nodes/list.c
--- b/src/backend/nodes/list.c
*************** list_copy_tail(const List *oldlist, int
*** 1250,1290 ****
  }

  /*
!  * Sort a list using qsort. A sorted list is built but the cells of the
!  * original list are re-used.  The comparator function receives arguments of
!  * type ListCell **
   */
  List *
  list_qsort(const List *list, list_qsort_comparator cmp)
  {
-     ListCell   *cell;
-     int            i;
      int            len = list_length(list);
      ListCell  **list_arr;
!     List       *new_list;

      if (len == 0)
          return NIL;

      i = 0;
-     list_arr = palloc(sizeof(ListCell *) * len);
      foreach(cell, list)
          list_arr[i++] = cell;

      qsort(list_arr, len, sizeof(ListCell *), cmp);

!     new_list = (List *) palloc(sizeof(List));
!     new_list->type = list->type;
!     new_list->length = len;
!     new_list->head = list_arr[0];
!     new_list->tail = list_arr[len - 1];

!     for (i = 0; i < len - 1; i++)
!         list_arr[i]->next = list_arr[i + 1];

!     list_arr[len - 1]->next = NULL;
      pfree(list_arr);
!     return new_list;
  }

  /*
--- 1250,1312 ----
  }

  /*
!  * Sort a list as though by qsort.
!  *
!  * A new list is built and returned.
!  * The comparator function receives arguments of type ListCell **.
   */
  List *
  list_qsort(const List *list, list_qsort_comparator cmp)
  {
      int            len = list_length(list);
      ListCell  **list_arr;
!     List       *newlist;
!     ListCell   *newlist_prev;
!     ListCell   *cell;
!     int            i;

+     /* Empty list is easy */
      if (len == 0)
          return NIL;

+     /* Flatten list cells into an array, so we can use qsort */
+     list_arr = (ListCell **) palloc(sizeof(ListCell *) * len);
      i = 0;
      foreach(cell, list)
          list_arr[i++] = cell;

      qsort(list_arr, len, sizeof(ListCell *), cmp);

!     /* Construct new list (this code is much like list_copy) */
!     newlist = new_list(list->type);
!     newlist->length = len;

!     /*
!      * Copy over the data in the first cell; new_list() has already allocated
!      * the head cell itself
!      */
!     newlist->head->data = list_arr[0]->data;

!     newlist_prev = newlist->head;
!     for (i = 1; i < len; i++)
!     {
!         ListCell   *newlist_cur;
!
!         newlist_cur = (ListCell *) palloc(sizeof(*newlist_cur));
!         newlist_cur->data = list_arr[i]->data;
!         newlist_prev->next = newlist_cur;
!
!         newlist_prev = newlist_cur;
!     }
!
!     newlist_prev->next = NULL;
!     newlist->tail = newlist_prev;
!
!     /* Might as well free the workspace array */
      pfree(list_arr);
!
!     check_list_invariants(newlist);
!     return newlist;
  }

  /*

Re: Parallel append plan instability/randomness

From
Amit Kapila
Date:
On Tue, Jan 9, 2018 at 12:48 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> On Sun, Jan 7, 2018 at 11:40 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>>> One theory that can explain above failure is that the costs of
>>> scanning some of the sub-paths is very close due to which sometimes
>>> the results can vary.  If that is the case, then probably using
>>> fuzz_factor in costs comparison (as is done in attached patch) can
>>> improve the situation, may be we have to consider some other factors
>>> like number of rows in each subpath.
>
>> This isn't an acceptable solution because sorting requires that the
>> comparison operator satisfy the transitive property; that is, if a = b
>> and b = c then a = c.  With your proposed comparator, you could have a
>> = b and b = c but a < c.  That will break stuff.
>
>> It seems like the obvious fix here is to use a query where the
>> contents of the partitions are such that the sorting always produces
>> the same result.  We could do that either by changing the query or by
>> changing the data in the partitions or, maybe, by inserting ANALYZE
>> someplace.
>
> The foo_star tables are made in create_table.sql, filled in
> create_misc.sql, and not modified thereafter.  The fact that we have
> accurate rowcounts for them in select_parallel.sql is because of the
> database-wide VACUUM that happens at the start of sanity_check.sql.
> Given the lack of any WHERE condition, the costs in this particular query
> depend only on the rowcount and physical table size, so inserting an
> ANALYZE shouldn't (and doesn't, for me) change anything.  I would be
> concerned about side-effects on other queries anyway if we were to ANALYZE
> tables that have never been ANALYZEd in the regression tests before.
>

Fair point.  This seems to indicate that wrong rowcounts is probably
not the reason for the failure.  However, I think it might still be
good to use a different set of tables (probably create new tables with
appropriate data for these queries) and analyze them explicitly before
these queries rather than relying on the execution order of some
not-directly related tests.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com