Thread: [HACKERS] Partition-wise aggregation/grouping

[HACKERS] Partition-wise aggregation/grouping

From
Jeevan Chalke
Date:
Hi all,

Declarative partitioning is supported in PostgreSQL 10 and work is already in
progress to support partition-wise joins. Here is a proposal for partition-wise
aggregation/grouping.  Our initial performance measurement has shown 7 times
performance when partitions are on foreign servers and approximately 15% when
partitions are local.

Partition-wise aggregation/grouping computes aggregates for each partition
separately.  If the group clause contains the partition key, all the rows
belonging to a given group come from one partition, thus allowing aggregates
to be computed completely for each partition.  Otherwise, partial aggregates
computed for each partition are combined across the partitions to produce the
final aggregates. This technique improves performance because:
i. When partitions are located on foreign server, we can push down the
aggregate to the foreign server.
ii. If hash table for each partition fits in memory, but that for the whole
relation does not, each partition-wise aggregate can use an in-memory hash
table.
iii. Aggregation at the level of partitions can exploit properties of
partitions like indexes, their storage etc.

Attached an experimental patch for the same based on the partition-wise join
patches posted in [1].

This patch currently implements partition-wise aggregation when group clause
contains the partitioning key.  A query below, involving a partitioned table
with 3 partitions containing 1M rows each, producing total 30 groups showed
15% improvement over non-partition-wise aggregation. Same query showed 7 times
improvement when the partitions were located on the foreign servers.

Here is the sample plan:

postgres=# set enable_partition_wise_agg to true;
SET
postgres=# EXPLAIN ANALYZE SELECT a, count(*) FROM plt1 GROUP BY a;
                                                  QUERY PLAN                                                 
--------------------------------------------------------------------------------------------------------------
 Append  (cost=5100.00..61518.90 rows=30 width=12) (actual time=324.837..944.804 rows=30 loops=1)
   ->  Foreign Scan  (cost=5100.00..20506.30 rows=10 width=12) (actual time=324.837..324.838 rows=10 loops=1)
         Relations: Aggregate on (public.fplt1_p1 plt1)
   ->  Foreign Scan  (cost=5100.00..20506.30 rows=10 width=12) (actual time=309.954..309.956 rows=10 loops=1)
         Relations: Aggregate on (public.fplt1_p2 plt1)
   ->  Foreign Scan  (cost=5100.00..20506.30 rows=10 width=12) (actual time=310.002..310.004 rows=10 loops=1)
         Relations: Aggregate on (public.fplt1_p3 plt1)
 Planning time: 0.370 ms
 Execution time: 945.384 ms
(9 rows)

postgres=# set enable_partition_wise_agg to false;
SET
postgres=# EXPLAIN ANALYZE SELECT a, count(*) FROM plt1 GROUP BY a;
                                                              QUERY PLAN                                                              
---------------------------------------------------------------------------------------------------------------------------------------
 HashAggregate  (cost=121518.01..121518.31 rows=30 width=12) (actual time=6498.452..6498.459 rows=30 loops=1)
   Group Key: plt1.a
   ->  Append  (cost=0.00..106518.00 rows=3000001 width=4) (actual time=0.595..5769.592 rows=3000000 loops=1)
         ->  Seq Scan on plt1  (cost=0.00..0.00 rows=1 width=4) (actual time=0.007..0.007 rows=0 loops=1)
         ->  Foreign Scan on fplt1_p1  (cost=100.00..35506.00 rows=1000000 width=4) (actual time=0.587..1844.506 rows=1000000 loops=1)
         ->  Foreign Scan on fplt1_p2  (cost=100.00..35506.00 rows=1000000 width=4) (actual time=0.384..1839.633 rows=1000000 loops=1)
         ->  Foreign Scan on fplt1_p3  (cost=100.00..35506.00 rows=1000000 width=4) (actual time=0.402..1876.505 rows=1000000 loops=1)
 Planning time: 0.251 ms
 Execution time: 6499.018 ms
(9 rows)

Patch needs a lot of improvement including:
1. Support for partial partition-wise aggregation
2. Estimating number of groups for every partition
3. Estimating cost of partition-wise aggregation based on sample partitions
similar to partition-wise join
and much more.

In order to support partial aggregation on foreign partitions, we need support
to fetch partially aggregated results from the foreign server. That can be
handled as a separate follow-on patch.

Though is lot of work to be done, I would like to get suggestions/opinions from
hackers.

I would like to thank Ashutosh Bapat for providing a draft patch and helping
me off-list on this feature while he is busy working on partition-wise join
feature.

[1] https://www.postgresql.org/message-id/CAFjFpRcbY2QN3cfeMTzVEoyF5Lfku-ijyNR%3DPbXj1e%3D9a%3DqMoQ%40mail.gmail.com

Thanks

--
Jeevan Chalke
Principal Software Engineer, Product Development
EnterpriseDB Corporation
The Enterprise PostgreSQL Company

Attachment

Re: [HACKERS] Partition-wise aggregation/grouping

From
Antonin Houska
Date:
Jeevan Chalke <jeevan.chalke@enterprisedb.com> wrote:

> Declarative partitioning is supported in PostgreSQL 10 and work is already in
> progress to support partition-wise joins. Here is a proposal for partition-wise
> aggregation/grouping. Our initial performance measurement has shown 7 times
> performance when partitions are on foreign servers and approximately 15% when
> partitions are local.
>
> Partition-wise aggregation/grouping computes aggregates for each partition
> separately. If the group clause contains the partition key, all the rows
> belonging to a given group come from one partition, thus allowing aggregates
> to be computed completely for each partition. Otherwise, partial aggregates
> computed for each partition are combined across the partitions to produce the
> final aggregates. This technique improves performance because:

> i. When partitions are located on foreign server, we can push down the
> aggregate to the foreign server.

> ii. If hash table for each partition fits in memory, but that for the whole
> relation does not, each partition-wise aggregate can use an in-memory hash
> table.

> iii. Aggregation at the level of partitions can exploit properties of
> partitions like indexes, their storage etc.

I suspect this overlaps with

https://www.postgresql.org/message-id/29111.1483984605%40localhost

I'm working on the next version of the patch, which will be able to aggregate
the result of both base relation scans and joins. I'm trying hard to make the
next version available before an urgent vacation that I'll have to take at
random date between today and early April. I suggest that we coordinate the
effort, it's lot of work in any case.

--
Antonin Houska
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de, http://www.cybertec.at



Re: [HACKERS] Partition-wise aggregation/grouping

From
Jeevan Chalke
Date:


On Tue, Mar 21, 2017 at 1:47 PM, Antonin Houska <ah@cybertec.at> wrote:
Jeevan Chalke <jeevan.chalke@enterprisedb.com> wrote:

> Declarative partitioning is supported in PostgreSQL 10 and work is already in
> progress to support partition-wise joins. Here is a proposal for partition-wise
> aggregation/grouping. Our initial performance measurement has shown 7 times
> performance when partitions are on foreign servers and approximately 15% when
> partitions are local.
>
> Partition-wise aggregation/grouping computes aggregates for each partition
> separately. If the group clause contains the partition key, all the rows
> belonging to a given group come from one partition, thus allowing aggregates
> to be computed completely for each partition. Otherwise, partial aggregates
> computed for each partition are combined across the partitions to produce the
> final aggregates. This technique improves performance because:

> i. When partitions are located on foreign server, we can push down the
> aggregate to the foreign server.

> ii. If hash table for each partition fits in memory, but that for the whole
> relation does not, each partition-wise aggregate can use an in-memory hash
> table.

> iii. Aggregation at the level of partitions can exploit properties of
> partitions like indexes, their storage etc.

I suspect this overlaps with

https://www.postgresql.org/message-id/29111.1483984605%40localhost

I'm working on the next version of the patch, which will be able to aggregate
the result of both base relation scans and joins. I'm trying hard to make the
next version available before an urgent vacation that I'll have to take at
random date between today and early April. I suggest that we coordinate the
effort, it's lot of work in any case.

IIUC, it seems that you are trying to push down the aggregation into the
joining relations. So basically you are converting
Agg -> Join -> {scan1, scan2} into
FinalAgg -> Join -> {PartialAgg -> scan1, PartialAgg -> scan2}.
In addition to that your patch pushes aggregates on base rel to its children,
if any.

Where as what I propose here is pushing down aggregation below the append
node keeping join/scan as is. So basically I am converting
Agg -> Append-> Join -> {scan1, scan2} into
Append -> Agg -> Join -> {scan1, scan2}.
This will require partition-wise join as posted in [1].
But I am planning to make this work for partitioned relations and not for
generic inheritance.

I treat these two as separate strategies/paths to be consider while planning.

Our work will overlap when we are pushing down the aggregate on partitioned
base relation to its children/partitions.

I think you should continue working on pushing down aggregate onto the
joins/scans where as I will continue my work on pushing down aggregates to
partitions (joins as well as single table). Once we are done with these task,
then we may need to find a way to integrate them.

[1] https://www.postgresql.org/message-id/flat/CAFjFpRfQ8GrQvzp3jA2wnLqrHmaXna-urjm_UY9BqXj=EaDTSA@mail.gmail.com#CAFjFpRfQ8GrQvzp3jA2wnLqrHmaXna-urjm_UY9BqXj=EaDTSA@mail.gmail.com

 

--
Antonin Houska
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de, http://www.cybertec.at


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers



--
Jeevan Chalke
Principal Software Engineer, Product Development
EnterpriseDB Corporation
The Enterprise PostgreSQL Company

Re: Partition-wise aggregation/grouping

From
Antonin Houska
Date:
The promised new version of my patch is here:

https://www.postgresql.org/message-id/9666.1491295317%40localhost

Jeevan Chalke <jeevan.chalke@enterprisedb.com> wrote:

> On Tue, Mar 21, 2017 at 1:47 PM, Antonin Houska <ah@cybertec.at> wrote:
>
>  Jeevan Chalke <jeevan.chalke@enterprisedb.com> wrote:
>
> IIUC, it seems that you are trying to push down the aggregation into the
> joining relations. So basically you are converting
> Agg -> Join -> {scan1, scan2} into
> FinalAgg -> Join -> {PartialAgg -> scan1, PartialAgg -> scan2}.
> In addition to that your patch pushes aggregates on base rel to its children,
> if any.
>
> Where as what I propose here is pushing down aggregation below the append
> node keeping join/scan as is. So basically I am converting
> Agg -> Append-> Join -> {scan1, scan2} into
> Append -> Agg -> Join -> {scan1, scan2}.
> This will require partition-wise join as posted in [1].
> But I am planning to make this work for partitioned relations and not for
> generic inheritance.
>
> I treat these two as separate strategies/paths to be consider while planning.
>
> Our work will overlap when we are pushing down the aggregate on partitioned
> base relation to its children/partitions.
>
> I think you should continue working on pushing down aggregate onto the
> joins/scans where as I will continue my work on pushing down aggregates to
> partitions (joins as well as single table). Once we are done with these task,
> then we may need to find a way to integrate them.
>
> [1]
https://www.postgresql.org/message-id/flat/CAFjFpRfQ8GrQvzp3jA2wnLqrHmaXna-urjm_UY9BqXj=EaDTSA@mail.gmail.com#CAFjFpRfQ8GrQvzp3jA2wnLqrHmaXna-urjm_UY9BqXj=EaDTSA@mail.gmail.com

My patch does also create (partial) aggregation paths below the Append node,
but only expects SeqScan as input. Please check if you patch can be based on
this or if there's any conflict.

(I'll probably be unable to respond before Monday 04/17.)

--
Antonin Houska
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de, http://www.cybertec.at



Re: Partition-wise aggregation/grouping

From
Antonin Houska
Date:
Antonin Houska <ah@cybertec.at> wrote:
>
> Jeevan Chalke <jeevan.chalke@enterprisedb.com> wrote:
>
> > Our work will overlap when we are pushing down the aggregate on partitioned
> > base relation to its children/partitions.
> >
> > I think you should continue working on pushing down aggregate onto the
> > joins/scans where as I will continue my work on pushing down aggregates to
> > partitions (joins as well as single table). Once we are done with these task,
> > then we may need to find a way to integrate them.
> >
> > [1]
https://www.postgresql.org/message-id/flat/CAFjFpRfQ8GrQvzp3jA2wnLqrHmaXna-urjm_UY9BqXj=EaDTSA@mail.gmail.com#CAFjFpRfQ8GrQvzp3jA2wnLqrHmaXna-urjm_UY9BqXj=EaDTSA@mail.gmail.com
>
> My patch does also create (partial) aggregation paths below the Append node,
> but only expects SeqScan as input. Please check if you patch can be based on
> this or if there's any conflict.

Well, I haven't imposed any explicit restriction on the kind of path to be
aggregated below the Append path. Maybe the only thing to do is to merge my
patch with the "partition-wise join" patch (which I haven't checked yet).

--
Antonin Houska
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de, http://www.cybertec.at



Re: [HACKERS] Partition-wise aggregation/grouping

From
Antonin Houska
Date:
Antonin Houska <ah@cybertec.at> wrote:

> Antonin Houska <ah@cybertec.at> wrote:
> >
> > Jeevan Chalke <jeevan.chalke@enterprisedb.com> wrote:
> >
> > > Our work will overlap when we are pushing down the aggregate on partitioned
> > > base relation to its children/partitions.
> > >
> > > I think you should continue working on pushing down aggregate onto the
> > > joins/scans where as I will continue my work on pushing down aggregates to
> > > partitions (joins as well as single table). Once we are done with these task,
> > > then we may need to find a way to integrate them.
> > >
> > > [1]
https://www.postgresql.org/message-id/flat/CAFjFpRfQ8GrQvzp3jA2wnLqrHmaXna-urjm_UY9BqXj=EaDTSA@mail.gmail.com#CAFjFpRfQ8GrQvzp3jA2wnLqrHmaXna-urjm_UY9BqXj=EaDTSA@mail.gmail.com
> >
> > My patch does also create (partial) aggregation paths below the Append node,
> > but only expects SeqScan as input. Please check if you patch can be based on
> > this or if there's any conflict.
>
> Well, I haven't imposed any explicit restriction on the kind of path to be
> aggregated below the Append path. Maybe the only thing to do is to merge my
> patch with the "partition-wise join" patch (which I haven't checked yet).

Attached is a diff that contains both patches merged. This is just to prove my
assumption, details to be elaborated later. The scripts attached produce the
following plan in my environment:

                   QUERY PLAN
------------------------------------------------
 Parallel Finalize HashAggregate
   Group Key: b_1.j
   ->  Append
         ->  Parallel Partial HashAggregate
               Group Key: b_1.j
               ->  Hash Join
                     Hash Cond: (b_1.j = c_1.k)
                     ->  Seq Scan on b_1
                     ->  Hash
                           ->  Seq Scan on c_1
         ->  Parallel Partial HashAggregate
               Group Key: b_2.j
               ->  Hash Join
                     Hash Cond: (b_2.j = c_2.k)
                     ->  Seq Scan on b_2
                     ->  Hash
                           ->  Seq Scan on c_2


Note that I had no better idea how to enforce the plan than hard-wiring zero
costs of the partial aggregation paths. This simulates the use case of partial
aggregation performed on remote node (postgres_fdw). Other use cases may
exist, but I only wanted to prove the concept in terms of coding so far.

--
Antonin Houska
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de, http://www.cybertec.at


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] Partition-wise aggregation/grouping

From
Robert Haas
Date:
On Wed, Apr 26, 2017 at 6:28 AM, Antonin Houska <ah@cybertec.at> wrote:
> Attached is a diff that contains both patches merged. This is just to prove my
> assumption, details to be elaborated later. The scripts attached produce the
> following plan in my environment:
>
>                    QUERY PLAN
> ------------------------------------------------
>  Parallel Finalize HashAggregate
>    Group Key: b_1.j
>    ->  Append
>          ->  Parallel Partial HashAggregate
>                Group Key: b_1.j
>                ->  Hash Join
>                      Hash Cond: (b_1.j = c_1.k)
>                      ->  Seq Scan on b_1
>                      ->  Hash
>                            ->  Seq Scan on c_1
>          ->  Parallel Partial HashAggregate
>                Group Key: b_2.j
>                ->  Hash Join
>                      Hash Cond: (b_2.j = c_2.k)
>                      ->  Seq Scan on b_2
>                      ->  Hash
>                            ->  Seq Scan on c_2

Well, I'm confused.  I see that there's a relationship between what
Antonin is trying to do and what Jeevan is trying to do, but I can't
figure out whether one is a subset of the other, whether they're both
orthogonal, or something else.  This plan looks similar to what I
would expect Jeevan's patch to produce, except i have no idea what
"Parallel" would mean in a plan that contains no Gather node.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] Partition-wise aggregation/grouping

From
Antonin Houska
Date:
Robert Haas <robertmhaas@gmail.com> wrote:

> On Wed, Apr 26, 2017 at 6:28 AM, Antonin Houska <ah@cybertec.at> wrote:
> > Attached is a diff that contains both patches merged. This is just to prove my
> > assumption, details to be elaborated later. The scripts attached produce the
> > following plan in my environment:
> >
> >                    QUERY PLAN
> > ------------------------------------------------
> >  Parallel Finalize HashAggregate
> >    Group Key: b_1.j
> >    ->  Append
> >          ->  Parallel Partial HashAggregate
> >                Group Key: b_1.j
> >                ->  Hash Join
> >                      Hash Cond: (b_1.j = c_1.k)
> >                      ->  Seq Scan on b_1
> >                      ->  Hash
> >                            ->  Seq Scan on c_1
> >          ->  Parallel Partial HashAggregate
> >                Group Key: b_2.j
> >                ->  Hash Join
> >                      Hash Cond: (b_2.j = c_2.k)
> >                      ->  Seq Scan on b_2
> >                      ->  Hash
> >                            ->  Seq Scan on c_2
>
> Well, I'm confused.  I see that there's a relationship between what
> Antonin is trying to do and what Jeevan is trying to do, but I can't
> figure out whether one is a subset of the other, whether they're both
> orthogonal, or something else.  This plan looks similar to what I
> would expect Jeevan's patch to produce,

The point is that the patch Jeevan wanted to work on is actually a subset of
[1] combined with [2].

> except i have no idea what "Parallel" would mean in a plan that contains no
> Gather node.

parallel_aware field was set mistakenly on the AggPath. Fixed patch is
attached below, producing this plan:

                   QUERY PLAN
------------------------------------------------
 Finalize HashAggregate
   Group Key: b_1.j
   ->  Append
         ->  Partial HashAggregate
               Group Key: b_1.j
               ->  Hash Join
                     Hash Cond: (b_1.j = c_1.k)
                     ->  Seq Scan on b_1
                     ->  Hash
                           ->  Seq Scan on c_1
         ->  Partial HashAggregate
               Group Key: b_2.j
               ->  Hash Join
                     Hash Cond: (b_2.j = c_2.k)
                     ->  Seq Scan on b_2
                     ->  Hash
                           ->  Seq Scan on c_2

[1] https://www.postgresql.org/message-id/9666.1491295317%40localhost

[2] https://commitfest.postgresql.org/14/994/

--
Antonin Houska
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de, http://www.cybertec.at


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] Partition-wise aggregation/grouping

From
Jeevan Chalke
Date:


On Thu, Apr 27, 2017 at 4:53 PM, Antonin Houska <ah@cybertec.at> wrote:
Robert Haas <robertmhaas@gmail.com> wrote:

> On Wed, Apr 26, 2017 at 6:28 AM, Antonin Houska <ah@cybertec.at> wrote:
> > Attached is a diff that contains both patches merged. This is just to prove my
> > assumption, details to be elaborated later. The scripts attached produce the
> > following plan in my environment:
> >
> >                    QUERY PLAN
> > ------------------------------------------------
> >  Parallel Finalize HashAggregate
> >    Group Key: b_1.j
> >    ->  Append
> >          ->  Parallel Partial HashAggregate
> >                Group Key: b_1.j
> >                ->  Hash Join
> >                      Hash Cond: (b_1.j = c_1.k)
> >                      ->  Seq Scan on b_1
> >                      ->  Hash
> >                            ->  Seq Scan on c_1
> >          ->  Parallel Partial HashAggregate
> >                Group Key: b_2.j
> >                ->  Hash Join
> >                      Hash Cond: (b_2.j = c_2.k)
> >                      ->  Seq Scan on b_2
> >                      ->  Hash
> >                            ->  Seq Scan on c_2
>
> Well, I'm confused.  I see that there's a relationship between what
> Antonin is trying to do and what Jeevan is trying to do, but I can't
> figure out whether one is a subset of the other, whether they're both
> orthogonal, or something else.  This plan looks similar to what I
> would expect Jeevan's patch to produce,

The point is that the patch Jeevan wanted to work on is actually a subset of
[1] combined with [2].

Seems like, as you are targeting every relation whether or not it is
partitioned. Where as I am targeting only partitioned relations in my
patch.
 

> except i have no idea what "Parallel" would mean in a plan that contains no
> Gather node.

parallel_aware field was set mistakenly on the AggPath. Fixed patch is
attached below, producing this plan:

                   QUERY PLAN
------------------------------------------------
 Finalize HashAggregate
   Group Key: b_1.j
   ->  Append
         ->  Partial HashAggregate
               Group Key: b_1.j
               ->  Hash Join
                     Hash Cond: (b_1.j = c_1.k)
                     ->  Seq Scan on b_1
                     ->  Hash
                           ->  Seq Scan on c_1
         ->  Partial HashAggregate
               Group Key: b_2.j
               ->  Hash Join
                     Hash Cond: (b_2.j = c_2.k)
                     ->  Seq Scan on b_2
                     ->  Hash
                           ->  Seq Scan on c_2

With my patch, I am getting following plan where we push entire
aggregation below append.

                QUERY PLAN               
------------------------------------------
 Append
   ->  HashAggregate
         Group Key: b_1.j
         ->  Hash Join
               Hash Cond: (b_1.j = c_1.k)
               ->  Seq Scan on b_1
               ->  Hash
                     ->  Seq Scan on c_1
   ->  HashAggregate
         Group Key: b_2.j
         ->  Hash Join
               Hash Cond: (b_2.j = c_2.k)
               ->  Seq Scan on b_2
               ->  Hash
                     ->  Seq Scan on c_2
(15 rows)


Antonin, I have tried applying your patch on master but it doesn't get
apply. Can you please provide the HEAD and any other changes required
to be applied first?

How the plan look like when GROUP BY key does not match with the
partitioning key i.e. GROUP BY b.v ?
 

[1] https://www.postgresql.org/message-id/9666.1491295317%40localhost

[2] https://commitfest.postgresql.org/14/994/

--
Antonin Houska
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de, http://www.cybertec.at



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers




--
Jeevan Chalke
Principal Software Engineer, Product Development
EnterpriseDB Corporation
The Enterprise PostgreSQL Company

Phone: +91 20 66449694

Website: www.enterprisedb.com
EnterpriseDB Blog: http://blogs.enterprisedb.com/
Follow us on Twitter: http://www.twitter.com/enterprisedb

This e-mail message (and any attachment) is intended for the use of the individual or entity to whom it is addressed. This message contains information from EnterpriseDB Corporation that may be privileged, confidential, or exempt from disclosure under applicable law. If you are not the intended recipient or authorized to receive this for the intended recipient, any use, dissemination, distribution, retention, archiving, or copying of this communication is strictly prohibited. If you have received this e-mail in error, please notify the sender immediately by reply e-mail and delete this message.

Re: [HACKERS] Partition-wise aggregation/grouping

From
Antonin Houska
Date:
Jeevan Chalke <jeevan.chalke@enterprisedb.com> wrote:

> On Thu, Apr 27, 2017 at 4:53 PM, Antonin Houska <ah@cybertec.at> wrote:
>
> > Robert Haas <robertmhaas@gmail.com> wrote:
> > > Well, I'm confused. I see that there's a relationship between what
> > > Antonin is trying to do and what Jeevan is trying to do, but I can't
> > > figure out whether one is a subset of the other, whether they're both
> > > orthogonal, or something else. This plan looks similar to what I
> > > would expect Jeevan's patch to produce,

> >  The point is that the patch Jeevan wanted to work on is actually a subset of
> >  [1] combined with [2].

> Seems like, as you are targeting every relation whether or not it is
> partitioned.

Yes.

> With my patch, I am getting following plan where we push entire
> aggregation below append.
>
> QUERY PLAN
> ------------------------------------------
> Append
> -> HashAggregate
> Group Key: b_1.j
> -> Hash Join
> Hash Cond: (b_1.j = c_1.k)
> -> Seq Scan on b_1
> -> Hash
> -> Seq Scan on c_1
> -> HashAggregate
> Group Key: b_2.j
> -> Hash Join
> Hash Cond: (b_2.j = c_2.k)
> -> Seq Scan on b_2
> -> Hash
> -> Seq Scan on c_2
> (15 rows)

I think this is not generic enough because the result of the Append plan can
be joined to another relation. As such a join can duplicate the
already-aggregated values, the aggregates should not be finalized below the
top-level plan.

> Antonin, I have tried applying your patch on master but it doesn't get
> apply. Can you please provide the HEAD and any other changes required
> to be applied first?

I've lost that information. I'll post a new version to the [1] thread asap.

> How the plan look like when GROUP BY key does not match with the
> partitioning key i.e. GROUP BY b.v ?

EXPLAIN (COSTS false)
SELECT          b.v, avg(b.v + c.v)
FROM            b               JOIN               c ON b.j = c.k
GROUP BY        b.v;
                  QUERY PLAN
------------------------------------------------Finalize HashAggregate  Group Key: b_1.v  ->  Append        ->  Partial
HashAggregate             Group Key: b_1.v              ->  Hash Join                    Hash Cond: (b_1.j = c_1.k)
              ->  Seq Scan on b_1                    ->  Hash                          ->  Seq Scan on c_1        ->
PartialHashAggregate              Group Key: b_2.v              ->  Hash Join                    Hash Cond: (b_2.j =
c_2.k)                   ->  Seq Scan on b_2                    ->  Hash                          ->  Seq Scan on c_2 


> >  [1] https://www.postgresql.org/message-id/9666.1491295317%40localhost
> >
> >  [2] https://commitfest.postgresql.org/14/994/


--
Antonin Houska
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de, http://www.cybertec.at



Re: [HACKERS] Partition-wise aggregation/grouping

From
Robert Haas
Date:
On Fri, Apr 28, 2017 at 3:03 AM, Antonin Houska <ah@cybertec.at> wrote:
> I think this is not generic enough because the result of the Append plan can
> be joined to another relation. As such a join can duplicate the
> already-aggregated values, the aggregates should not be finalized below the
> top-level plan.

If the grouping key matches the partition key, then it's correct to
push the entire aggregate down, and there's probably a large
performance advantage from avoiding aggregating twice.  If the two
don't match, then pushing the aggregate down necessarily involves a
"partial" and a "finalize" stage, which may or may not be cheaper than
doing the aggregation all at once.  If you have lots of 2-row groups
with 1 row in the first branch of the append and 1 row in the second
branch of the append, breaking the aggregate into two steps is
probably going to be a loser.  If the overall number of groups is
small, it's probably going to win.  But when the grouping key matches
the partition key, so that two-stage aggregation isn't required, I
suspect the pushdown should almost always win.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] Partition-wise aggregation/grouping

From
Jeevan Chalke
Date:
Hi,

Attached is the patch to implement partition-wise aggregation/grouping.

As explained earlier, we produce a full aggregation for each partition when
partition keys are leading group by clauses and then append is performed.
Else we do a partial aggregation on each partition, append them and then add
finalization step over it.

I have observed that cost estimated for partition-wise aggregation and cost
for the plans without partition-wise aggregation is almost same. However,
execution time shows significant improvement (as explained my in the very
first email) with partition-wise aggregates. Planner chooses a plan according
to the costs, and thus most of the time plan without partition-wise
aggregation is chosen. Hence, to force partition-wise plans and for the
regression runs, I have added a GUC named partition_wise_agg_cost_factor to
adjust the costings.

This feature is only used when enable_partition_wise_agg GUC is set to on.

Here are the details of the patches in the patch-set:

0001 - Refactors sort and hash final grouping paths into separate functions.
Since partition-wise aggregation too builds paths same as that of
create_grouping_paths(), separated path creation for sort and hash agg into
separate functions. These functions later used by main partition-wise
aggregation/grouping patch.

0002 - Passes targetlist to get_number_of_groups().
We need to estimate groups for individual child relations and thus need to
pass targetlist corresponding to the child rel.

0003 - Adds enable_partition_wise_agg and partition_wise_agg_cost_factor GUCs.

0004 - Implements partition-wise aggregation.

0005 - Adds test-cases.

0006 - postgres_fdw changes which enable pushing aggregation for other upper
relations.


Since this patch is highly dependent on partition-wise join [1], one needs to
apply all those patches on HEAD (my repository head was at:
66ed3829df959adb47f71d7c903ac59f0670f3e1) before applying these patches in
order.

Suggestions / feedback / inputs ?

[1] https://www.postgresql.org/message-id/CAFjFpRd9Vqh_=-Ldv-XqWY006d07TJ+VXuhXCbdj=P1jukYBrw@mail.gmail.com


On Tue, Mar 21, 2017 at 12:47 PM, Jeevan Chalke <jeevan.chalke@enterprisedb.com> wrote:
Hi all,

Declarative partitioning is supported in PostgreSQL 10 and work is already in
progress to support partition-wise joins. Here is a proposal for partition-wise
aggregation/grouping.  Our initial performance measurement has shown 7 times
performance when partitions are on foreign servers and approximately 15% when
partitions are local.

Partition-wise aggregation/grouping computes aggregates for each partition
separately.  If the group clause contains the partition key, all the rows
belonging to a given group come from one partition, thus allowing aggregates
to be computed completely for each partition.  Otherwise, partial aggregates
computed for each partition are combined across the partitions to produce the
final aggregates. This technique improves performance because:
i. When partitions are located on foreign server, we can push down the
aggregate to the foreign server.
ii. If hash table for each partition fits in memory, but that for the whole
relation does not, each partition-wise aggregate can use an in-memory hash
table.
iii. Aggregation at the level of partitions can exploit properties of
partitions like indexes, their storage etc.

Attached an experimental patch for the same based on the partition-wise join
patches posted in [1].

This patch currently implements partition-wise aggregation when group clause
contains the partitioning key.  A query below, involving a partitioned table
with 3 partitions containing 1M rows each, producing total 30 groups showed
15% improvement over non-partition-wise aggregation. Same query showed 7 times
improvement when the partitions were located on the foreign servers.

Here is the sample plan:

postgres=# set enable_partition_wise_agg to true;
SET
postgres=# EXPLAIN ANALYZE SELECT a, count(*) FROM plt1 GROUP BY a;
                                                  QUERY PLAN                                                 
--------------------------------------------------------------------------------------------------------------
 Append  (cost=5100.00..61518.90 rows=30 width=12) (actual time=324.837..944.804 rows=30 loops=1)
   ->  Foreign Scan  (cost=5100.00..20506.30 rows=10 width=12) (actual time=324.837..324.838 rows=10 loops=1)
         Relations: Aggregate on (public.fplt1_p1 plt1)
   ->  Foreign Scan  (cost=5100.00..20506.30 rows=10 width=12) (actual time=309.954..309.956 rows=10 loops=1)
         Relations: Aggregate on (public.fplt1_p2 plt1)
   ->  Foreign Scan  (cost=5100.00..20506.30 rows=10 width=12) (actual time=310.002..310.004 rows=10 loops=1)
         Relations: Aggregate on (public.fplt1_p3 plt1)
 Planning time: 0.370 ms
 Execution time: 945.384 ms
(9 rows)

postgres=# set enable_partition_wise_agg to false;
SET
postgres=# EXPLAIN ANALYZE SELECT a, count(*) FROM plt1 GROUP BY a;
                                                              QUERY PLAN                                                              
---------------------------------------------------------------------------------------------------------------------------------------
 HashAggregate  (cost=121518.01..121518.31 rows=30 width=12) (actual time=6498.452..6498.459 rows=30 loops=1)
   Group Key: plt1.a
   ->  Append  (cost=0.00..106518.00 rows=3000001 width=4) (actual time=0.595..5769.592 rows=3000000 loops=1)
         ->  Seq Scan on plt1  (cost=0.00..0.00 rows=1 width=4) (actual time=0.007..0.007 rows=0 loops=1)
         ->  Foreign Scan on fplt1_p1  (cost=100.00..35506.00 rows=1000000 width=4) (actual time=0.587..1844.506 rows=1000000 loops=1)
         ->  Foreign Scan on fplt1_p2  (cost=100.00..35506.00 rows=1000000 width=4) (actual time=0.384..1839.633 rows=1000000 loops=1)
         ->  Foreign Scan on fplt1_p3  (cost=100.00..35506.00 rows=1000000 width=4) (actual time=0.402..1876.505 rows=1000000 loops=1)
 Planning time: 0.251 ms
 Execution time: 6499.018 ms
(9 rows)

Patch needs a lot of improvement including:
1. Support for partial partition-wise aggregation
2. Estimating number of groups for every partition
3. Estimating cost of partition-wise aggregation based on sample partitions
similar to partition-wise join
and much more.

In order to support partial aggregation on foreign partitions, we need support
to fetch partially aggregated results from the foreign server. That can be
handled as a separate follow-on patch.

Though is lot of work to be done, I would like to get suggestions/opinions from
hackers.

I would like to thank Ashutosh Bapat for providing a draft patch and helping
me off-list on this feature while he is busy working on partition-wise join
feature.

[1] https://www.postgresql.org/message-id/CAFjFpRcbY2QN3cfeMTzVEoyF5Lfku-ijyNR%3DPbXj1e%3D9a%3DqMoQ%40mail.gmail.com

Thanks

--
Jeevan Chalke
Principal Software Engineer, Product Development
EnterpriseDB Corporation
The Enterprise PostgreSQL Company




--
Jeevan Chalke
Principal Software Engineer, Product Development
EnterpriseDB Corporation
The Enterprise PostgreSQL Company

Attachment

Re: [HACKERS] Partition-wise aggregation/grouping

From
Jeevan Chalke
Date:


On Wed, Aug 23, 2017 at 4:43 PM, Jeevan Chalke <jeevan.chalke@enterprisedb.com> wrote:
Hi,

Attached is the patch to implement partition-wise aggregation/grouping.

As explained earlier, we produce a full aggregation for each partition when
partition keys are leading group by clauses and then append is performed.
Else we do a partial aggregation on each partition, append them and then add
finalization step over it.

I have observed that cost estimated for partition-wise aggregation and cost
for the plans without partition-wise aggregation is almost same. However,
execution time shows significant improvement (as explained my in the very
first email) with partition-wise aggregates. Planner chooses a plan according
to the costs, and thus most of the time plan without partition-wise
aggregation is chosen. Hence, to force partition-wise plans and for the
regression runs, I have added a GUC named partition_wise_agg_cost_factor to
adjust the costings.

This feature is only used when enable_partition_wise_agg GUC is set to on.

Here are the details of the patches in the patch-set:

Here are the new patch-set re-based on HEAD (f0a0c17) and
latest partition-wise join (v29) patches.
 

0001 - Refactors sort and hash final grouping paths into separate functions.
Since partition-wise aggregation too builds paths same as that of
create_grouping_paths(), separated path creation for sort and hash agg into
separate functions. These functions later used by main partition-wise
aggregation/grouping patch.

0002 - Passes targetlist to get_number_of_groups().
We need to estimate groups for individual child relations and thus need to
pass targetlist corresponding to the child rel.

0003 - Adds enable_partition_wise_agg and partition_wise_agg_cost_factor GUCs.

0004 - Implements partition-wise aggregation.

0005 - Adds test-cases.

0006 - postgres_fdw changes which enable pushing aggregation for other upper
relations.

0007 - Provides infrastructure to allow partial aggregation
This will allow us to push the partial aggregation over fdw.
With this one can write SUM(PARTIAL x) to get a partial sum
result. Since PARTIAL is used in syntax, I need to move that
to a reserved keywords category. This is kind of PoC patch
and needs input over approach and the way it is implemented.

0008 - Teaches postgres_fdw to push partial aggregation
With this we can push aggregate on remote server when
GROUP BY key does not match with the PARTITION key too.
 


Since this patch is highly dependent on partition-wise join [1], one needs to
apply all those patches on HEAD (my repository head was at:
66ed3829df959adb47f71d7c903ac59f0670f3e1) before applying these patches in
order.

Suggestions / feedback / inputs ?

[1] https://www.postgresql.org/message-id/CAFjFpRd9Vqh_=-Ldv-XqWY006d07TJ+VXuhXCbdj=P1jukYBrw@mail.gmail.com



--
Jeevan Chalke
Principal Software Engineer, Product Development
EnterpriseDB Corporation
The Enterprise PostgreSQL Company

Attachment

Re: [HACKERS] Partition-wise aggregation/grouping

From
Rajkumar Raghuwanshi
Date:

On Fri, Sep 8, 2017 at 5:47 PM, Jeevan Chalke <jeevan.chalke@enterprisedb.com> wrote:
Here are the new patch-set re-based on HEAD (f0a0c17) and
latest partition-wise join (v29) patches.

Hi Jeevan,

I have started testing partition-wise-aggregate and got one observation, please take a look.
with the v2 patch, here if I change target list order, query is not picking full partition-wise-aggregate.

SET enable_partition_wise_agg TO true;
SET partition_wise_agg_cost_factor TO 0.5;
SET enable_partition_wise_join TO true;
SET max_parallel_workers_per_gather TO 0;

CREATE TABLE pagg_tab (a int, b int, c int) PARTITION BY RANGE(a);
CREATE TABLE pagg_tab_p1 PARTITION OF pagg_tab FOR VALUES FROM (0) TO (10);
CREATE TABLE pagg_tab_p2 PARTITION OF pagg_tab FOR VALUES FROM (10) TO (20);
CREATE TABLE pagg_tab_p3 PARTITION OF pagg_tab FOR VALUES FROM (20) TO (30);
INSERT INTO pagg_tab SELECT i % 30, i % 30, i % 50 FROM generate_series(0, 299) i;
ANALYZE pagg_tab;

postgres=# explain (verbose, costs off) select a,b,count(*) from pagg_tab group by a,b order by 1,2;
                          QUERY PLAN                         
--------------------------------------------------------------
 Sort
   Output: pagg_tab_p1.a, pagg_tab_p1.b, (count(*))
   Sort Key: pagg_tab_p1.a, pagg_tab_p1.b
   ->  Append
         ->  HashAggregate
               Output: pagg_tab_p1.a, pagg_tab_p1.b, count(*)
               Group Key: pagg_tab_p1.a, pagg_tab_p1.b
               ->  Seq Scan on public.pagg_tab_p1
                     Output: pagg_tab_p1.a, pagg_tab_p1.b
         ->  HashAggregate
               Output: pagg_tab_p2.a, pagg_tab_p2.b, count(*)
               Group Key: pagg_tab_p2.a, pagg_tab_p2.b
               ->  Seq Scan on public.pagg_tab_p2
                     Output: pagg_tab_p2.a, pagg_tab_p2.b
         ->  HashAggregate
               Output: pagg_tab_p3.a, pagg_tab_p3.b, count(*)
               Group Key: pagg_tab_p3.a, pagg_tab_p3.b
               ->  Seq Scan on public.pagg_tab_p3
                     Output: pagg_tab_p3.a, pagg_tab_p3.b
(19 rows)

-- changing target list order
-- picking partial partition-wise aggregation path
postgres=# explain (verbose, costs off) select b,a,count(*) from pagg_tab group by a,b order by 1,2;
                                 QUERY PLAN                                
----------------------------------------------------------------------------
 Finalize GroupAggregate
   Output: pagg_tab_p1.b, pagg_tab_p1.a, count(*)
   Group Key: pagg_tab_p1.b, pagg_tab_p1.a
   ->  Sort
         Output: pagg_tab_p1.b, pagg_tab_p1.a, (PARTIAL count(*))
         Sort Key: pagg_tab_p1.b, pagg_tab_p1.a
         ->  Append
               ->  Partial HashAggregate
                     Output: pagg_tab_p1.b, pagg_tab_p1.a, PARTIAL count(*)
                     Group Key: pagg_tab_p1.b, pagg_tab_p1.a
                     ->  Seq Scan on public.pagg_tab_p1
                           Output: pagg_tab_p1.b, pagg_tab_p1.a
               ->  Partial HashAggregate
                     Output: pagg_tab_p2.b, pagg_tab_p2.a, PARTIAL count(*)
                     Group Key: pagg_tab_p2.b, pagg_tab_p2.a
                     ->  Seq Scan on public.pagg_tab_p2
                           Output: pagg_tab_p2.b, pagg_tab_p2.a
               ->  Partial HashAggregate
                     Output: pagg_tab_p3.b, pagg_tab_p3.a, PARTIAL count(*)
                     Group Key: pagg_tab_p3.b, pagg_tab_p3.a
                     ->  Seq Scan on public.pagg_tab_p3
                           Output: pagg_tab_p3.b, pagg_tab_p3.a
(22 rows)

Thanks & Regards,
Rajkumar Raghuwanshi
QMG, EnterpriseDB Corporation


 

Re: [HACKERS] Partition-wise aggregation/grouping

From
Jeevan Chalke
Date:


On Tue, Sep 12, 2017 at 3:24 PM, Rajkumar Raghuwanshi <rajkumar.raghuwanshi@enterprisedb.com> wrote:

On Fri, Sep 8, 2017 at 5:47 PM, Jeevan Chalke <jeevan.chalke@enterprisedb.com> wrote:
Here are the new patch-set re-based on HEAD (f0a0c17) and
latest partition-wise join (v29) patches.

Hi Jeevan,

I have started testing partition-wise-aggregate and got one observation, please take a look.
with the v2 patch, here if I change target list order, query is not picking full partition-wise-aggregate.

Thanks Rajkumar for reporting this.

I am looking into this issue and will post updated patch with the fix.

Thanks & Regards,
Rajkumar Raghuwanshi
QMG, EnterpriseDB Corporation


 

--
Jeevan Chalke
Principal Software Engineer, Product Development
EnterpriseDB Corporation
The Enterprise PostgreSQL Company

Re: [HACKERS] Partition-wise aggregation/grouping

From
Jeevan Chalke
Date:


On Tue, Sep 12, 2017 at 6:21 PM, Jeevan Chalke <jeevan.chalke@enterprisedb.com> wrote:


On Tue, Sep 12, 2017 at 3:24 PM, Rajkumar Raghuwanshi <rajkumar.raghuwanshi@enterprisedb.com> wrote:

Hi Jeevan,

I have started testing partition-wise-aggregate and got one observation, please take a look.
with the v2 patch, here if I change target list order, query is not picking full partition-wise-aggregate.

Thanks Rajkumar for reporting this.

I am looking into this issue and will post updated patch with the fix.

Logic for checking whether partition keys lead group by keys needs to be
updated here. The group by expressions can appear in any order without
affecting the final result. And thus, the need for partition keys should
be leading the group by keys to have full aggregation is not mandatory.
Instead we must ensure that the partition keys are part of the group by
keys to compute full aggregation on a partition.

Attached, revised patch-set with above fix.

Also, in test-cases, I have removed DROP/ANALYZE commands on child
relations and also removed VERBOSE from the EXPLAIN.

Notes:
HEAD: 8edacab209957520423770851351ab4013cb0167
Partition-wise Join patch-set version: v32

Thanks

--
Jeevan Chalke
Principal Software Engineer, Product Development
EnterpriseDB Corporation
The Enterprise PostgreSQL Company

Attachment

Re: [HACKERS] Partition-wise aggregation/grouping

From
Rajkumar Raghuwanshi
Date:


On Mon, Sep 18, 2017 at 12:37 PM, Jeevan Chalke <jeevan.chalke@enterprisedb.com> wrote:


On Tue, Sep 12, 2017 at 6:21 PM, Jeevan Chalke <jeevan.chalke@enterprisedb.com> wrote:


On Tue, Sep 12, 2017 at 3:24 PM, Rajkumar Raghuwanshi <rajkumar.raghuwanshi@enterprisedb.com> wrote:

Hi Jeevan,

I have started testing partition-wise-aggregate and got one observation, please take a look.
with the v2 patch, here if I change target list order, query is not picking full partition-wise-aggregate.

Thanks Rajkumar for reporting this.

I am looking into this issue and will post updated patch with the fix.

Logic for checking whether partition keys lead group by keys needs to be
updated here. The group by expressions can appear in any order without
affecting the final result. And thus, the need for partition keys should
be leading the group by keys to have full aggregation is not mandatory.
Instead we must ensure that the partition keys are part of the group by
keys to compute full aggregation on a partition.

Attached, revised patch-set with above fix.

Also, in test-cases, I have removed DROP/ANALYZE commands on child
relations and also removed VERBOSE from the EXPLAIN.

Notes:
HEAD: 8edacab209957520423770851351ab4013cb0167
Partition-wise Join patch-set version: v32

Thanks for the patch. I have tested it and issue is fixed now.  

Re: [HACKERS] Partition-wise aggregation/grouping

From
Rajkumar Raghuwanshi
Date:
On Mon, Sep 18, 2017 at 12:37 PM, Jeevan Chalke <jeevan.chalke@enterprisedb.com> wrote:

On Tue, Sep 12, 2017 at 6:21 PM, Jeevan Chalke <jeevan.chalke@enterprisedb.com> wrote:


On Tue, Sep 12, 2017 at 3:24 PM, Rajkumar Raghuwanshi <rajkumar.raghuwanshi@enterprisedb.com> wrote:

Hi Jeevan,

I have started testing partition-wise-aggregate and got one observation, please take a look.
with the v2 patch, here if I change target list order, query is not picking full partition-wise-aggregate.

Thanks Rajkumar for reporting this.

I am looking into this issue and will post updated patch with the fix.

Logic for checking whether partition keys lead group by keys needs to be
updated here. The group by expressions can appear in any order without
affecting the final result. And thus, the need for partition keys should
be leading the group by keys to have full aggregation is not mandatory.
Instead we must ensure that the partition keys are part of the group by
keys to compute full aggregation on a partition.

Attached, revised patch-set with above fix.

Also, in test-cases, I have removed DROP/ANALYZE commands on child
relations and also removed VERBOSE from the EXPLAIN.

Notes:
HEAD: 8edacab209957520423770851351ab4013cb0167
Partition-wise Join patch-set version: v32

Hi Jeevan,

while testing latest v3 patches, I am  getting a server crash if I reset partition_wise_agg_cost_factor, please take a look.

CREATE TABLE lp (a TEXT, b FLOAT, c INT) PARTITION BY LIST(c);
CREATE TABLE lp1 PARTITION OF lp FOR VALUES IN (10,20);
CREATE TABLE lp2 PARTITION OF lp FOR VALUES IN (30,40);

INSERT INTO lp VALUES ('a1',800, 20);
INSERT INTO lp VALUES ('a2',1250,30);
INSERT INTO lp VALUES ('a3',2975,20);
INSERT INTO lp VALUES ('a3',2850,30);

postgres=# SET enable_partition_wise_agg TO true;
SET
postgres=# SET partition_wise_agg_cost_factor TO 0.5;
SET
postgres=#
postgres=# SELECT MAX(b), AVG(b) FROM lp GROUP BY a HAVING a = 'a3' ORDER BY 1,2;
 max  |  avg  
------+--------
 2975 | 2912.5
(1 row)

postgres=# RESET partition_wise_agg_cost_factor;
RESET
postgres=#
postgres=# SELECT MAX(b), AVG(b) FROM lp GROUP BY a HAVING a = 'a3' ORDER BY 1,2;
server closed the connection unexpectedly
    This probably means the server terminated abnormally
    before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.
!>

Thanks & Regards,
Rajkumar Raghuwanshi
QMG, EnterpriseDB Corporation

Re: [HACKERS] Partition-wise aggregation/grouping

From
Jeevan Chalke
Date:


On Thu, Sep 21, 2017 at 5:00 PM, Rajkumar Raghuwanshi <rajkumar.raghuwanshi@enterprisedb.com> wrote:
Hi Jeevan,

while testing latest v3 patches, I am  getting a server crash if I reset partition_wise_agg_cost_factor, please take a look.

CREATE TABLE lp (a TEXT, b FLOAT, c INT) PARTITION BY LIST(c);
CREATE TABLE lp1 PARTITION OF lp FOR VALUES IN (10,20);
CREATE TABLE lp2 PARTITION OF lp FOR VALUES IN (30,40);

INSERT INTO lp VALUES ('a1',800, 20);
INSERT INTO lp VALUES ('a2',1250,30);
INSERT INTO lp VALUES ('a3',2975,20);
INSERT INTO lp VALUES ('a3',2850,30);

postgres=# SET enable_partition_wise_agg TO true;
SET
postgres=# SET partition_wise_agg_cost_factor TO 0.5;
SET
postgres=#
postgres=# SELECT MAX(b), AVG(b) FROM lp GROUP BY a HAVING a = 'a3' ORDER BY 1,2;
 max  |  avg  
------+--------
 2975 | 2912.5
(1 row)

postgres=# RESET partition_wise_agg_cost_factor;
RESET
postgres=#
postgres=# SELECT MAX(b), AVG(b) FROM lp GROUP BY a HAVING a = 'a3' ORDER BY 1,2;
server closed the connection unexpectedly
    This probably means the server terminated abnormally
    before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.
!>


Oops. Will fix this.

I have added these tests in testcase, but testcase is working as expected.
However running those steps on psql reproduces the crash (not consistent though).

Looking into it. Thanks for reporting.
 
Thanks & Regards,
Rajkumar Raghuwanshi
QMG, EnterpriseDB Corporation


--
Jeevan Chalke
Principal Software Engineer, Product Development
EnterpriseDB Corporation
The Enterprise PostgreSQL Company

Re: [HACKERS] Partition-wise aggregation/grouping

From
Ashutosh Bapat
Date:
Hi Jeevan,
I have started reviewing these patches.

0001 looks fine. There might be some changes that will be needed, but
those will be clear when I review the patch that uses this
refactoring.

0002
+ *
+ * If targetlist is provided, we use it else use targetlist from the root. */static
doubleget_number_of_groups(PlannerInfo*root,                    double path_rows,
 
-                    grouping_sets_data *gd)
+                    grouping_sets_data *gd,
+                    List *tlist){   Query      *parse = root->parse;   double      dNumGroups;
+   List       *targetList = (tlist == NIL) ? parse->targetList : tlist;

May be we should just pass targetlist always. Instead of passing NIL,
pass parse->targetList directly. That would save us one conditional
assignment. May be passing NIL is required for the patches that use
this refactoring, but that's not clear as is in this patch.

0003
In the documenation of enable_partition_wise_aggregate, we should
probably explain why the default is off or like partition_wise_join
GUC, explain the consequences of turning it off. I doubt if we could
accept something like partition_wise_agg_cost_factor looks. But we can
discuss this at a later stage. Mean time it may be worthwhile to fix
the reason why we would require this GUC. If the regular aggregation
has cost lesser than partition-wise aggregation in most of the cases,
then probably we need to fix the cost model.

I will continue reviewing rest of the patches.

On Mon, Sep 18, 2017 at 12:37 PM, Jeevan Chalke
<jeevan.chalke@enterprisedb.com> wrote:
>
>
> On Tue, Sep 12, 2017 at 6:21 PM, Jeevan Chalke
> <jeevan.chalke@enterprisedb.com> wrote:
>>
>>
>>
>> On Tue, Sep 12, 2017 at 3:24 PM, Rajkumar Raghuwanshi
>> <rajkumar.raghuwanshi@enterprisedb.com> wrote:
>>>
>>>
>>> Hi Jeevan,
>>>
>>> I have started testing partition-wise-aggregate and got one observation,
>>> please take a look.
>>> with the v2 patch, here if I change target list order, query is not
>>> picking full partition-wise-aggregate.
>>
>>
>> Thanks Rajkumar for reporting this.
>>
>> I am looking into this issue and will post updated patch with the fix.
>
>
> Logic for checking whether partition keys lead group by keys needs to be
> updated here. The group by expressions can appear in any order without
> affecting the final result. And thus, the need for partition keys should
> be leading the group by keys to have full aggregation is not mandatory.
> Instead we must ensure that the partition keys are part of the group by
> keys to compute full aggregation on a partition.
>
> Attached, revised patch-set with above fix.
>
> Also, in test-cases, I have removed DROP/ANALYZE commands on child
> relations and also removed VERBOSE from the EXPLAIN.
>
> Notes:
> HEAD: 8edacab209957520423770851351ab4013cb0167
> Partition-wise Join patch-set version: v32
>
> Thanks
>
> --
> Jeevan Chalke
> Principal Software Engineer, Product Development
> EnterpriseDB Corporation
> The Enterprise PostgreSQL Company
>
>
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>



-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Partition-wise aggregation/grouping

From
Jeevan Chalke
Date:
Thanks Ashutosh for reviewing.

Attached new patch-set with following changes:

1. Removed earlier 0007 and 0008 patches which were PoC for supporting
partial aggregation over fdw. I removed them as it will be a different
issue altogether and hence I will tackle them separately once this is
done.

This patch-set now includes support for parallel plans within partitions.

Notes:
HEAD: 59597e6
Partition-wise Join Version: 34

(First six patches 0001 - 0006, remains the same functionality-wise)
0007 - Refactors partial grouping paths creation into the separate function.
0008 - Enables parallelism within the partition-wise aggregation.

This patch also includes a fix for the crash reported by Rajkumar.
While forcibly applying scan/join target to all the Paths for the scan/join
rel, earlier I was using apply_projection_to_path() which modifies the path
in-place which causing this crash as the path finally chosen has been
updated by partition-wise agg path creation. Now I have used
create_projection_path() like we do in partial aggregation paths.

Also, fixed issues reported by Ashutosh.

On Tue, Sep 26, 2017 at 6:16 PM, Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote:
Hi Jeevan,
I have started reviewing these patches.

0001 looks fine. There might be some changes that will be needed, but
those will be clear when I review the patch that uses this
refactoring.

0002
+ *
+ * If targetlist is provided, we use it else use targetlist from the root.
  */
 static double
 get_number_of_groups(PlannerInfo *root,
                     double path_rows,
-                    grouping_sets_data *gd)
+                    grouping_sets_data *gd,
+                    List *tlist)
 {
    Query      *parse = root->parse;
    double      dNumGroups;
+   List       *targetList = (tlist == NIL) ? parse->targetList : tlist;

May be we should just pass targetlist always. Instead of passing NIL,
pass parse->targetList directly. That would save us one conditional
assignment. May be passing NIL is required for the patches that use
this refactoring, but that's not clear as is in this patch.

Done in attached patch-set.
 

0003
In the documenation of enable_partition_wise_aggregate, we should
probably explain why the default is off or like partition_wise_join
GUC, explain the consequences of turning it off.

I have updated this. Please have a look.
 
I doubt if we could
accept something like partition_wise_agg_cost_factor looks. But we can
discuss this at a later stage. Mean time it may be worthwhile to fix
the reason why we would require this GUC. If the regular aggregation
has cost lesser than partition-wise aggregation in most of the cases,
then probably we need to fix the cost model.

Yep. I will have a look mean-while.
 

I will continue reviewing rest of the patches.


Thanks
--
Jeevan Chalke
Technical Architect, Product Development
EnterpriseDB Corporation
The Enterprise PostgreSQL Company

Attachment

Re: [HACKERS] Partition-wise aggregation/grouping

From
Ashutosh Bapat
Date:
On Wed, Sep 27, 2017 at 3:42 PM, Jeevan Chalke
<jeevan.chalke@enterprisedb.com> wrote:
> Thanks Ashutosh for reviewing.
>
> Attached new patch-set with following changes:
>
> 1. Removed earlier 0007 and 0008 patches which were PoC for supporting
> partial aggregation over fdw. I removed them as it will be a different
> issue altogether and hence I will tackle them separately once this is
> done.
>
> This patch-set now includes support for parallel plans within partitions.
>
> Notes:
> HEAD: 59597e6
> Partition-wise Join Version: 34
>
> (First six patches 0001 - 0006, remains the same functionality-wise)
> 0007 - Refactors partial grouping paths creation into the separate function.
> 0008 - Enables parallelism within the partition-wise aggregation.
>
> This patch also includes a fix for the crash reported by Rajkumar.
> While forcibly applying scan/join target to all the Paths for the scan/join
> rel, earlier I was using apply_projection_to_path() which modifies the path
> in-place which causing this crash as the path finally chosen has been
> updated by partition-wise agg path creation. Now I have used
> create_projection_path() like we do in partial aggregation paths.
>
> Also, fixed issues reported by Ashutosh.

Thanks.

Here are comments on 0004 from last patch set. But most the comments
still apply.

Patch 0001 adds functions create_hash_agg_path() and create_sort_agg_path().
Patch 0004 adds a new argument to those functions for conditions in HAVING
clause. We should move those changes to 0001 and pass parse->havingQual to
these functions in 0001 itself. That will keep all changes to those functions
together and also make 0003 small.

The prologue of try_partition_wise_grouping() mentions a restriction of
partition keys being leading group by clauses. This restriction is not
specified in the prologue of have_grouping_by_partkey(), which actually checks
for this restriction. The requirement per prologue of that function is just to
have partition keys in group clause. I think have_grouping_by_partkey() is
correct, and we should change prologue of try_partition_wise_grouping() to be
in sync with have_grouping_by_partkey(). The prologue explains why
partition-wise aggregation/grouping would be efficient with this restriction,
but it doesn't explain why partial aggregation/grouping per partition would be
efficient. May be it should do that as well. Similar is the case with partial
aggregation/grouping discussion in README.

+    /* Do not handle grouping sets for now. */
Is this a real restriction or just restriction for first cut of this feature?
Can you please add more explanation? May be update README as well?

+    grouped_rel->part_scheme = input_rel->part_scheme;
Is this true even for partial aggregates? I think not. Since group by clause
does not contain partition keys, the rows from multiple partitions participate
in one group and thus the partition keys of input relation do not apply to the
grouped relation. In this case, it seems that the grouped rel will have
part_rels but will not be partitioned.

+        /*
+         * If there is no path for the child relation, then we cannot add
+         * aggregate path too.
+         */
+        if (input_child_rel->pathlist == NIL)
+            return;
When can this happen? I think, similar to partition-wise join it happens when
there is a dummy parent relation. See [1]. If that's the case, you may want to
do things similar to what partition-wise join is doing. If there's some other
reason for doing this, returing from here half-way is actually waste of memory
and planning time. Instead, we may want to loop over the part_rels to find if
any of those have empty pathlist and return from there before doing any actual
work.

+        extra.pathTarget = child_target;
+        extra.inputRows = input_child_rel->cheapest_startup_path->rows;
+        extra.havingQual = (Node *) adjust_appendrel_attrs(root,
+                                                           (Node *)
query->havingQual,
+                                                           nappinfos,
+                                                           appinfos);
These lines are updating some fields of "extra" structure in every loop. The
structure is passed to create_child_grouping_paths() in the loop and to
add_paths_to_append_rel() outside the loop. Thus add_paths_to_append_rel() only
gets some member values for the last child. Is that right? Should we split
extra into two structures one to be used within the loop and one outside? Or
may be send the members being updated within the loop separately?

+        /*
+         * Forcibly apply scan/join target to all the Paths for the scan/join
+         * rel.
+         *
[ lines clipped ]
+                if (subpath == input_child_rel->cheapest_total_path)
+                    input_child_rel->cheapest_total_path = path;
+            }
+        }
This code seems to be copied from grouping_planner() almost verbatim. Is there
a way we can refactor it into a function and use it in both the places.

have_grouping_by_partkey() may use match_expr_to_partition_keys() to find
whether a given clause expression matches any of the partition keys. Or you
could use list_intersection() instead of following loop
+        foreach(lc, partexprs)
+        {
+            Expr       *partexpr = lfirst(lc);
+
+            if (list_member(groupexprs, partexpr))
+            {
+                found = true;
+                break;
+            }
+        }
+        /*
+         * If none of the partition key matches with any of the GROUP BY
+         * expression, return false.
+         */
+        if (!found)
+            return false;

create_child_grouping_paths() and create_grouping_paths() has almost similar
code. Is there a way we could refactor the code to extract common code into a
function called by these two functions or reuse create_grouping_paths() for
children as well? I don't think we will be able to do the later.

+    /* Nothing to do if there is an empty pathlist */
+    if (grouped_rel->pathlist == NIL)
+        return false;
When would that happen? Similar condition in case of parent grouped rel throws
an error, so when this code is called, we know for sure that parent had
non-empty pathlist. So, we would expect child to have non-empty pathlist as
well.

+        grouped_rel = fetch_upper_rel(root, UPPERREL_GROUP_AGG,
+                                      input_rel->relids);
+
+        /* Mark this rel as "other upper rel" */
+        grouped_rel->reloptkind = RELOPT_OTHER_UPPER_REL;
I think we need to pass relopkind as an argument to fetch_upper_rel(), now that
we have "upper" relations and "other upper" relations. relids will still be a
"key" to find an upper relation but its reloptkind should match the given
reloptkind.  fetch_upper_rel() is used to create the upper relation if it
doesn't exist. So, with the above code, if some other function calls
fetch_upper_rel() with given relids, it would get an upper rel with
RELOPT_UPPER_REL and then this code would change it to RELOPT_OTHER_UPPER_REL.
That looks odd. May be we should have written build_upper_rel() and
find_upper_rel() similar to build_join_rel() and find_join_rel() instead of
combining both the functionalities in one function.

+        /*
+         * create_append_path() sets the path target from the given relation.
+         * However, in this case grouped_rel doesn't have a target set.  So we
+         * must set the pathtarget to the passed in target.
+         */
+        apath->pathtarget = target;
I think, we should change create_append_path() functions to accept target as an
additional argument. For append rels other than aggregate and grouping, target
will be same as relation's target. For agg/group append rels, we will pass
different targets for partial and non-partial grouping paths.

+        /*
+         * Since Append's output is always unsorted, we'll need to sort,
+         * unless there's no GROUP BY clause or a degenerate (constant) one,
+         * in which case there will only be a single group.
+         */
append path here can be output of either merge append or append. If it's output
of merge append, we don't need to sort it again, do we?

create_partition_agg_paths() creates append paths and then adds finalization
path if necessary. The code to add finalization path seems to be similar to the
code that adds finalization path for parallel query. May be we could take out
common code into a function and call that function in two places. I see this
function as accepting a partial aggregation/grouping path and returning a path
that finalizes partial aggregates/groups.

[1] https://www.postgresql.org/message-id/CAFjFpRd5+zroxY7UMGTR2M=rjBV4aBOCxQg3+1rBmTPLK5mpDg@mail.gmail.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Partition-wise aggregation/grouping

From
Jeevan Chalke
Date:


On Thu, Sep 28, 2017 at 3:12 PM, Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote:

Here are comments on 0004 from last patch set. But most the comments
still apply.

Thank you, Ashutosh for reviewing.
 

Patch 0001 adds functions create_hash_agg_path() and create_sort_agg_path().
Patch 0004 adds a new argument to those functions for conditions in HAVING
clause. We should move those changes to 0001 and pass parse->havingQual to
these functions in 0001 itself. That will keep all changes to those functions
together and also make 0003 small.

Done.


The prologue of try_partition_wise_grouping() mentions a restriction of
partition keys being leading group by clauses. This restriction is not
specified in the prologue of have_grouping_by_partkey(), which actually checks
for this restriction. The requirement per prologue of that function is just to
have partition keys in group clause. I think have_grouping_by_partkey() is
correct, and we should change prologue of try_partition_wise_grouping() to be
in sync with have_grouping_by_partkey().

Done.

The prologue explains why
partition-wise aggregation/grouping would be efficient with this restriction,
but it doesn't explain why partial aggregation/grouping per partition would be
efficient. May be it should do that as well. Similar is the case with partial
aggregation/grouping discussion in README.

I have tried updating it. Please check.


+    /* Do not handle grouping sets for now. */
Is this a real restriction or just restriction for first cut of this feature?
Can you please add more explanation? May be update README as well?

Grouping sets plan does not work with an inheritance subtree (see notes in
create_groupingsets_plan). Thus grouping sets are not handled here.
 

+    grouped_rel->part_scheme = input_rel->part_scheme;
Is this true even for partial aggregates? I think not. Since group by clause
does not contain partition keys, the rows from multiple partitions participate
in one group and thus the partition keys of input relation do not apply to the
grouped relation. In this case, it seems that the grouped rel will have
part_rels but will not be partitioned.

I have removed this as your analysis is correct. grouped_rel is not
partitioned.


+        /*
+         * If there is no path for the child relation, then we cannot add
+         * aggregate path too.
+         */
+        if (input_child_rel->pathlist == NIL)
+            return;
When can this happen? I think, similar to partition-wise join it happens when
there is a dummy parent relation. See [1]. If that's the case, you may want to
do things similar to what partition-wise join is doing. If there's some other
reason for doing this, returing from here half-way is actually waste of memory
and planning time. Instead, we may want to loop over the part_rels to find if
any of those have empty pathlist and return from there before doing any actual
work.

This is kind of can't happen scenario, so I have converted it to an Assert().
And yes, I am marking a grouped_rel as dummy rel when input rel is dummy.


+        extra.pathTarget = child_target;
+        extra.inputRows = input_child_rel->cheapest_startup_path->rows;
+        extra.havingQual = (Node *) adjust_appendrel_attrs(root,
+                                                           (Node *)
query->havingQual,
+                                                           nappinfos,
+                                                           appinfos);
These lines are updating some fields of "extra" structure in every loop. The
structure is passed to create_child_grouping_paths() in the loop and to
add_paths_to_append_rel() outside the loop. Thus add_paths_to_append_rel() only
gets some member values for the last child. Is that right?

No. Patch do update those fields before calling add_paths_to_append_rel().

Should we split
extra into two structures one to be used within the loop and one outside? Or
may be send the members being updated within the loop separately?

I don't see any point in splitting. We need almost all fields at child path
creation as well as at finalization step. The patch basically just re-using
the struct variable.


+        /*
+         * Forcibly apply scan/join target to all the Paths for the scan/join
+         * rel.
+         *
[ lines clipped ]
+                if (subpath == input_child_rel->cheapest_total_path)
+                    input_child_rel->cheapest_total_path = path;
+            }
+        }
This code seems to be copied from grouping_planner() almost verbatim. Is there
a way we can refactor it into a function and use it in both the places.

Done.
Moved this in 0003-Refactor-code-applying-scanjoin-target-to-paths-into.patch


have_grouping_by_partkey() may use match_expr_to_partition_keys() to find
whether a given clause expression matches any of the partition keys. Or you
could use list_intersection() instead of following loop
+        foreach(lc, partexprs)
+        {
+            Expr       *partexpr = lfirst(lc);
+
+            if (list_member(groupexprs, partexpr))
+            {
+                found = true;
+                break;
+            }
+        }
+        /*
+         * If none of the partition key matches with any of the GROUP BY
+         * expression, return false.
+         */
+        if (!found)
+            return false;

Well, the logic in match_expr_to_partition_keys() does not exactly match with
the scenarios here. It may match with few alterations but then it will become
complex. So better to have them separate.

list_intersection() is a good suggestion as it will reduce this block
altogether and will have less lines-of-code to maintain. However, it returns
a list of all matching cells from List1 which is done by comparing all
elements. But here in this case we don't need to match further after very
first match. Thus this logic saves on those unnecessary matching.
 

create_child_grouping_paths() and create_grouping_paths() has almost similar
code. Is there a way we could refactor the code to extract common code into a
function called by these two functions or reuse create_grouping_paths() for
children as well? I don't think we will be able to do the later.

After refactoring most of the code in create_grouping_paths() (0001-0003),
it is very little code remained which is duplicated. Refactoring those few
lines into another function looks odd.
Let me know, if you still think to refactor those few lines in a separate
function.
 

+    /* Nothing to do if there is an empty pathlist */
+    if (grouped_rel->pathlist == NIL)
+        return false;
When would that happen? Similar condition in case of parent grouped rel throws
an error, so when this code is called, we know for sure that parent had
non-empty pathlist. So, we would expect child to have non-empty pathlist as
well.

Yes and agree too. This is kind of not-reachable return.
Do you mean we should also throw an error here like in case of parent grouped
rel? I opted to not throw an error and instead go with the non partition-wise
path.
 

+        grouped_rel = fetch_upper_rel(root, UPPERREL_GROUP_AGG,
+                                      input_rel->relids);
+
+        /* Mark this rel as "other upper rel" */
+        grouped_rel->reloptkind = RELOPT_OTHER_UPPER_REL;
I think we need to pass relopkind as an argument to fetch_upper_rel(), now that
we have "upper" relations and "other upper" relations. relids will still be a
"key" to find an upper relation but its reloptkind should match the given
reloptkind.  fetch_upper_rel() is used to create the upper relation if it
doesn't exist. So, with the above code, if some other function calls
fetch_upper_rel() with given relids, it would get an upper rel with
RELOPT_UPPER_REL and then this code would change it to RELOPT_OTHER_UPPER_REL.
That looks odd. May be we should have written build_upper_rel() and
find_upper_rel() similar to build_join_rel() and find_join_rel() instead of
combining both the functionalities in one function.

Make sense. But I am reluctant to update fetch_upper_rel() and all it's
callers.
However, do you think having a separate function for other upper rel for this
is a good idea, named fetch_other_upper_rel() in-lined with fetch_upper_rel()?
 

+        /*
+         * create_append_path() sets the path target from the given relation.
+         * However, in this case grouped_rel doesn't have a target set.  So we
+         * must set the pathtarget to the passed in target.
+         */
+        apath->pathtarget = target;
I think, we should change create_append_path() functions to accept target as an
additional argument. For append rels other than aggregate and grouping, target
will be same as relation's target. For agg/group append rels, we will pass
different targets for partial and non-partial grouping paths.

Done in 0005-Pass-pathtarget-to-create_-merge_-append_path.patch.


+        /*
+         * Since Append's output is always unsorted, we'll need to sort,
+         * unless there's no GROUP BY clause or a degenerate (constant) one,
+         * in which case there will only be a single group.
+         */
append path here can be output of either merge append or append. If it's output
of merge append, we don't need to sort it again, do we?

Yes, you are right, we don't need an explicit sort over merge-append.
Done those changes.


create_partition_agg_paths() creates append paths and then adds finalization
path if necessary. The code to add finalization path seems to be similar to the
code that adds finalization path for parallel query. May be we could take out
common code into a function and call that function in two places. I see this
function as accepting a partial aggregation/grouping path and returning a path
that finalizes partial aggregates/groups.

It seems that it will become messy. Per my understanding the only common code
is related to the add_path() call with appropriate create_agg_path() or
create_group_path(). Those are anyways function calls and I don't see any
reason to split them into the separate function.
 


Attached new patch set having HEAD at 84ad4b0 with all these review points
fixed. Let me know if I missed any thanks.

I have merged parallelism changes into main patch i.e. 0007 as most of the
changes in that patch are actual modifying same lines added by 0007.

Thanks

--
Jeevan Chalke
Technical Architect, Product Development
EnterpriseDB Corporation
The Enterprise PostgreSQL Company

Attachment

Re: [HACKERS] Partition-wise aggregation/grouping

From
David Rowley
Date:
On 10 October 2017 at 01:10, Jeevan Chalke
<jeevan.chalke@enterprisedb.com> wrote:
> Attached new patch set having HEAD at 84ad4b0 with all these review points
> fixed. Let me know if I missed any thanks.

I've only really skimmed over this thread and only opened the code
enough to extract the following:

+ /* Multiply the costs by partition_wise_agg_cost_factor. */
+ apath->startup_cost *= partition_wise_agg_cost_factor;
+ apath->total_cost *= partition_wise_agg_cost_factor;

I've not studied how all the path plumbing is done, but I think
instead of doing this costing magic we should really stop pretending
that Append/MergeAppend nodes are cost-free. I think something like
charging cpu_tuple_cost per row expected through Append/MergeAppend
would be a better approach to this.

If you perform grouping or partial grouping before the Append, then in
most cases the Append will receive less rows, so come out cheaper than
if you perform the grouping after it. I've not learned the
partition-wise join code enough to know if this is going to affect
that too, but for everything else, there should be no plan change,
since there's normally no alternative paths. I see there's even a
comment in create_append_path() which claims the zero cost is a bit
optimistic.

-- David Rowley                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Partition-wise aggregation/grouping

From
Ashutosh Bapat
Date:
On Tue, Oct 10, 2017 at 3:15 AM, David Rowley
<david.rowley@2ndquadrant.com> wrote:
> On 10 October 2017 at 01:10, Jeevan Chalke
> <jeevan.chalke@enterprisedb.com> wrote:
>> Attached new patch set having HEAD at 84ad4b0 with all these review points
>> fixed. Let me know if I missed any thanks.
>
> I've only really skimmed over this thread and only opened the code
> enough to extract the following:
>
> + /* Multiply the costs by partition_wise_agg_cost_factor. */
> + apath->startup_cost *= partition_wise_agg_cost_factor;
> + apath->total_cost *= partition_wise_agg_cost_factor;
>
> I've not studied how all the path plumbing is done, but I think
> instead of doing this costing magic we should really stop pretending
> that Append/MergeAppend nodes are cost-free. I think something like
> charging cpu_tuple_cost per row expected through Append/MergeAppend
> would be a better approach to this.
>
> If you perform grouping or partial grouping before the Append, then in
> most cases the Append will receive less rows, so come out cheaper than
> if you perform the grouping after it. I've not learned the
> partition-wise join code enough to know if this is going to affect
> that too, but for everything else, there should be no plan change,
> since there's normally no alternative paths. I see there's even a
> comment in create_append_path() which claims the zero cost is a bit
> optimistic.
>

+1. Partition-wise join will also benefit from costing Append
processing. Number of rows * width of join result compared with the
sum of that measure for joining relations decides whether Append node
processes more data in Append->Join case than Join->Append case.

Append node just returns the result of ExecProcNode(). Charging
cpu_tuple_cost may make it too expensive. In other places where we
charge cpu_tuple_cost there's some processing done to the tuple like
ExecStoreTuple() in SeqNext(). May be we need some other measure for
Append's processing of the tuple.

May be we should try to measure the actual time spent in Append node
as a fraction of say time spent in child seq scans. That might give us
a clue as to how Append processing can be charged in terms of costing.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Partition-wise aggregation/grouping

From
Jeevan Chalke
Date:


On Tue, Oct 10, 2017 at 10:27 AM, Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote:
On Tue, Oct 10, 2017 at 3:15 AM, David Rowley
<david.rowley@2ndquadrant.com> wrote:
> On 10 October 2017 at 01:10, Jeevan Chalke
> <jeevan.chalke@enterprisedb.com> wrote:
>> Attached new patch set having HEAD at 84ad4b0 with all these review points
>> fixed. Let me know if I missed any thanks.
>
> I've only really skimmed over this thread and only opened the code
> enough to extract the following:
>
> + /* Multiply the costs by partition_wise_agg_cost_factor. */
> + apath->startup_cost *= partition_wise_agg_cost_factor;
> + apath->total_cost *= partition_wise_agg_cost_factor;
>
> I've not studied how all the path plumbing is done, but I think
> instead of doing this costing magic we should really stop pretending
> that Append/MergeAppend nodes are cost-free. I think something like
> charging cpu_tuple_cost per row expected through Append/MergeAppend
> would be a better approach to this.
>
> If you perform grouping or partial grouping before the Append, then in
> most cases the Append will receive less rows, so come out cheaper than
> if you perform the grouping after it. I've not learned the
> partition-wise join code enough to know if this is going to affect
> that too, but for everything else, there should be no plan change,
> since there's normally no alternative paths. I see there's even a
> comment in create_append_path() which claims the zero cost is a bit
> optimistic.
>

+1.

Yes. Me and Ashutosh had a thought on this offlist that we will need to cost Append node too as having an extra GUC to control this is not a good idea per se. Thanks for your vote too.

I will try doing this and will see how plan goes with it.

Partition-wise join will also benefit from costing Append
processing. Number of rows * width of join result compared with the
sum of that measure for joining relations decides whether Append node
processes more data in Append->Join case than Join->Append case.

Append node just returns the result of ExecProcNode(). Charging
cpu_tuple_cost may make it too expensive. In other places where we
charge cpu_tuple_cost there's some processing done to the tuple like
ExecStoreTuple() in SeqNext(). May be we need some other measure for
Append's processing of the tuple.

May be we should try to measure the actual time spent in Append node
as a fraction of say time spent in child seq scans. That might give us
a clue as to how Append processing can be charged in terms of costing.

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company



--
Jeevan Chalke
Technical Architect, Product Development
EnterpriseDB Corporation
The Enterprise PostgreSQL Company

Phone: +91 20 66449694

Website: www.enterprisedb.com
EnterpriseDB Blog: http://blogs.enterprisedb.com/
Follow us on Twitter: http://www.twitter.com/enterprisedb

This e-mail message (and any attachment) is intended for the use of the individual or entity to whom it is addressed. This message contains information from EnterpriseDB Corporation that may be privileged, confidential, or exempt from disclosure under applicable law. If you are not the intended recipient or authorized to receive this for the intended recipient, any use, dissemination, distribution, retention, archiving, or copying of this communication is strictly prohibited. If you have received this e-mail in error, please notify the sender immediately by reply e-mail and delete this message.

Re: [HACKERS] Partition-wise aggregation/grouping

From
David Rowley
Date:
On 10 October 2017 at 17:57, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:
> Append node just returns the result of ExecProcNode(). Charging
> cpu_tuple_cost may make it too expensive. In other places where we
> charge cpu_tuple_cost there's some processing done to the tuple like
> ExecStoreTuple() in SeqNext(). May be we need some other measure for
> Append's processing of the tuple.

I don't think there's any need to invent any new GUC. You could just
divide cpu_tuple_cost by something.

I did a quick benchmark on my laptop to see how much Append really
costs, and with the standard costs the actual cost seems to be about
cpu_tuple_cost / 2.4. So probably cpu_tuple_cost / 2 might be
realistic. create_set_projection_path() does something similar and
brincostestimate() does some similar magic and applies 0.1 *
cpu_operator_cost to the total cost.

# create table p (a int, b int);
# create table p1 () inherits (p);
# insert into p1 select generate_series(1,1000000);
# vacuum analyze p1;
# \q
$ echo "select count(*) from p1;" > p1.sql
$ echo "select count(*) from p;" > p.sql
$ pgbench -T 60 -f p1.sql -n

latency average = 58.567 ms

$ pgbench -T 60 -f p.sql -n
latency average = 72.984 ms

$ psql
psql (11devel)
Type "help" for help.

# -- check the cost of the plan.
# explain select count(*) from p1;                           QUERY PLAN
------------------------------------------------------------------Aggregate  (cost=16925.00..16925.01 rows=1 width=8)
-> Seq Scan on p1  (cost=0.00..14425.00 rows=1000000 width=0)
 
(2 rows)

# -- selecting from the parent is the same due to zero Append cost.
# explain select count(*) from p;                              QUERY PLAN
------------------------------------------------------------------------Aggregate  (cost=16925.00..16925.01 rows=1
width=8) ->  Append  (cost=0.00..14425.00 rows=1000001 width=0)        ->  Seq Scan on p  (cost=0.00..0.00 rows=1
width=0)       ->  Seq Scan on p1  (cost=0.00..14425.00 rows=1000000 width=0)
 
(4 rows)

# -- extrapolate the additional time taken for the Append scan and
work out what the planner
# -- should add to the plan's cost, then divide by the number of rows
in p1 to work out the
# -- tuple cost of pulling a row through the append.
# select (16925.01 * (72.984 / 58.567) - 16925.01)  / 1000000;       ?column?
------------------------0.00416630302337493743
(1 row)

# show cpu_tuple_cost;cpu_tuple_cost
----------------0.01
(1 row)

# -- How does that compare to the cpu_tuple_cost?
# select current_Setting('cpu_tuple_cost')::float8 / 0.00416630302337493743;   ?column?
----------------2.400209476818
(1 row)

Maybe it's worth trying with different row counts to see if the
additional cost is consistent, but it's probably not worth being too
critical here.

-- David Rowley                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Partition-wise aggregation/grouping

From
Ashutosh Bapat
Date:
On Tue, Oct 10, 2017 at 1:31 PM, David Rowley
<david.rowley@2ndquadrant.com> wrote:
>
> I don't think there's any need to invent any new GUC. You could just
> divide cpu_tuple_cost by something.
>
> I did a quick benchmark on my laptop to see how much Append really
> costs, and with the standard costs the actual cost seems to be about
> cpu_tuple_cost / 2.4. So probably cpu_tuple_cost / 2 might be
> realistic. create_set_projection_path() does something similar and
> brincostestimate() does some similar magic and applies 0.1 *
> cpu_operator_cost to the total cost.
>
>
> # -- How does that compare to the cpu_tuple_cost?
> # select current_Setting('cpu_tuple_cost')::float8 / 0.00416630302337493743;
>     ?column?
> ----------------
>  2.400209476818
> (1 row)
>
> Maybe it's worth trying with different row counts to see if the
> additional cost is consistent, but it's probably not worth being too
> critical here.
>

This looks good to me. I think it should be a separate, yet very small patch.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Partition-wise aggregation/grouping

From
Robert Haas
Date:
On Tue, Oct 10, 2017 at 6:00 AM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:
> This looks good to me. I think it should be a separate, yet very small patch.

+1.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Partition-wise aggregation/grouping

From
Jeevan Chalke
Date:


On Tue, Oct 10, 2017 at 1:31 PM, David Rowley <david.rowley@2ndquadrant.com> wrote:
On 10 October 2017 at 17:57, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:
> Append node just returns the result of ExecProcNode(). Charging
> cpu_tuple_cost may make it too expensive. In other places where we
> charge cpu_tuple_cost there's some processing done to the tuple like
> ExecStoreTuple() in SeqNext(). May be we need some other measure for
> Append's processing of the tuple.

I don't think there's any need to invent any new GUC. You could just
divide cpu_tuple_cost by something.

I did a quick benchmark on my laptop to see how much Append really
costs, and with the standard costs the actual cost seems to be about
cpu_tuple_cost / 2.4. So probably cpu_tuple_cost / 2 might be
realistic. create_set_projection_path() does something similar and
brincostestimate() does some similar magic and applies 0.1 *
cpu_operator_cost to the total cost.

# create table p (a int, b int);
# create table p1 () inherits (p);
# insert into p1 select generate_series(1,1000000);
# vacuum analyze p1;
# \q
$ echo "select count(*) from p1;" > p1.sql
$ echo "select count(*) from p;" > p.sql
$ pgbench -T 60 -f p1.sql -n

latency average = 58.567 ms

$ pgbench -T 60 -f p.sql -n
latency average = 72.984 ms

$ psql
psql (11devel)
Type "help" for help.

# -- check the cost of the plan.
# explain select count(*) from p1;
                            QUERY PLAN
------------------------------------------------------------------
 Aggregate  (cost=16925.00..16925.01 rows=1 width=8)
   ->  Seq Scan on p1  (cost=0.00..14425.00 rows=1000000 width=0)
(2 rows)

# -- selecting from the parent is the same due to zero Append cost.
# explain select count(*) from p;
                               QUERY PLAN
------------------------------------------------------------------------
 Aggregate  (cost=16925.00..16925.01 rows=1 width=8)
   ->  Append  (cost=0.00..14425.00 rows=1000001 width=0)
         ->  Seq Scan on p  (cost=0.00..0.00 rows=1 width=0)
         ->  Seq Scan on p1  (cost=0.00..14425.00 rows=1000000 width=0)
(4 rows)

# -- extrapolate the additional time taken for the Append scan and
work out what the planner
# -- should add to the plan's cost, then divide by the number of rows
in p1 to work out the
# -- tuple cost of pulling a row through the append.
# select (16925.01 * (72.984 / 58.567) - 16925.01)  / 1000000;
        ?column?
------------------------
 0.00416630302337493743
(1 row)

# show cpu_tuple_cost;
 cpu_tuple_cost
----------------
 0.01
(1 row)

# -- How does that compare to the cpu_tuple_cost?
# select current_Setting('cpu_tuple_cost')::float8 / 0.00416630302337493743;
    ?column?
----------------
 2.400209476818
(1 row)

Maybe it's worth trying with different row counts to see if the
additional cost is consistent, but it's probably not worth being too
critical here.

I have tried exactly same tests to get to this factor on my local developer machine. And with parallelism enabled I got this number as 7.9. However, if I disable the parallelism (and I believe David too disabled that), I get this number as 1.8. Whereas for 10000 rows, I get this number to 1.7

-- With Gather
# select current_Setting('cpu_tuple_cost')::float8 / ((10633.56 * (81.035 / 72.450) - 10633.56)  / 1000000);
7.9

-- Without Gather
# select current_Setting('cpu_tuple_cost')::float8 / ((16925.01 * (172.838 / 131.400) - 16925.01)  / 1000000);
1.8

-- With 10000 rows (so no Gather too)
# select current_Setting('cpu_tuple_cost')::float8 / ((170.01 * (1.919 / 1.424) - 170.01)  / 10000);
1.7

So it is not so straight forward to come up the correct heuristic here. Thus using 50% of cpu_tuple_cost look good to me here.

As suggested by Ashutosh and Robert, attached separate small WIP patch for it.

I think it will be better if we take this topic on another mail-thread.
Do you agree?
 

--
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services



--
Jeevan Chalke
Technical Architect, Product Development
EnterpriseDB Corporation
The Enterprise PostgreSQL Company

Attachment

Re: [HACKERS] Partition-wise aggregation/grouping

From
David Rowley
Date:
On 13 October 2017 at 19:36, Jeevan Chalke
<jeevan.chalke@enterprisedb.com> wrote:
> I have tried exactly same tests to get to this factor on my local developer
> machine. And with parallelism enabled I got this number as 7.9. However, if
> I disable the parallelism (and I believe David too disabled that), I get
> this number as 1.8. Whereas for 10000 rows, I get this number to 1.7
>
> -- With Gather
> # select current_Setting('cpu_tuple_cost')::float8 / ((10633.56 * (81.035 /
> 72.450) - 10633.56)  / 1000000);
> 7.9
>
> -- Without Gather
> # select current_Setting('cpu_tuple_cost')::float8 / ((16925.01 * (172.838 /
> 131.400) - 16925.01)  / 1000000);
> 1.8
>
> -- With 10000 rows (so no Gather too)
> # select current_Setting('cpu_tuple_cost')::float8 / ((170.01 * (1.919 /
> 1.424) - 170.01)  / 10000);
> 1.7
>
> So it is not so straight forward to come up the correct heuristic here. Thus
> using 50% of cpu_tuple_cost look good to me here.
>
> As suggested by Ashutosh and Robert, attached separate small WIP patch for
> it.

Good to see it stays fairly consistent at different tuple counts, and
is not too far away from what I got on this machine.

I looked over the patch and saw this:

@@ -1800,6 +1827,9 @@ cost_merge_append(Path *path, PlannerInfo *root, */ run_cost += cpu_operator_cost * tuples;

+ /* Add MergeAppend node overhead like we do it for the Append node */
+ run_cost += cpu_tuple_cost * DEFAULT_APPEND_COST_FACTOR * tuples;
+ path->startup_cost = startup_cost + input_startup_cost; path->total_cost = startup_cost + run_cost +
input_total_cost;}

You're doing that right after a comment that says we don't do that. It
also does look like the "run_cost += cpu_operator_cost * tuples" is
trying to do the same thing, so perhaps it's worth just replacing
that, which by default will double that additional cost, although
doing so would have the planner slightly prefer a MergeAppend to an
Append than previously.

+#define DEFAULT_APPEND_COST_FACTOR 0.5

I don't really think the DEFAULT_APPEND_COST_FACTOR adds much. it
means very little by itself. It also seems that most of the other cost
functions just use the magic number.

-- David Rowley                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Partition-wise aggregation/grouping

From
Jeevan Chalke
Date:


On Fri, Oct 13, 2017 at 1:13 PM, David Rowley <david.rowley@2ndquadrant.com> wrote:

I looked over the patch and saw this:

@@ -1800,6 +1827,9 @@ cost_merge_append(Path *path, PlannerInfo *root,
  */
  run_cost += cpu_operator_cost * tuples;

+ /* Add MergeAppend node overhead like we do it for the Append node */
+ run_cost += cpu_tuple_cost * DEFAULT_APPEND_COST_FACTOR * tuples;
+
  path->startup_cost = startup_cost + input_startup_cost;
  path->total_cost = startup_cost + run_cost + input_total_cost;
 }

You're doing that right after a comment that says we don't do that. It
also does look like the "run_cost += cpu_operator_cost * tuples" is
trying to do the same thing, so perhaps it's worth just replacing
that, which by default will double that additional cost, although
doing so would have the planner slightly prefer a MergeAppend to an
Append than previously.

I think we can remove that code block entirely. I have added relevant comments
around DEFAULT_APPEND_COST_FACTOR already.
However, I am not sure of doing this as you correctly said it may prefer
MergeAppend to an Append. Will it be fine we remove that code block?


+#define DEFAULT_APPEND_COST_FACTOR 0.5

I don't really think the DEFAULT_APPEND_COST_FACTOR adds much. it
means very little by itself. It also seems that most of the other cost
functions just use the magic number.

Agree, but those magic numbers used only once at that place. But here there
are two places. So if someone wants to update it, (s)he needs to make sure
to update that at two places. To minimize that risk, having a #define seems
better.
 

--
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services



--
Jeevan Chalke
Technical Architect, Product Development
EnterpriseDB Corporation
The Enterprise PostgreSQL Company

Re: [HACKERS] Partition-wise aggregation/grouping

From
Dilip Kumar
Date:
On Fri, Oct 13, 2017 at 12:06 PM, Jeevan Chalke
<jeevan.chalke@enterprisedb.com> wrote:
>
While playing around with the patch I have noticed one regression with
the partial partition-wise aggregate.

I am consistently able to reproduce this on my local machine.

Scenario: Group by on non-key column and only one tuple per group.

Complete Test:
--------------------
create table t(a int,b int) partition by range(a);
create table t1 partition of t for values from (1) to (100000);
create table t2 partition of t for values from (100000) to (200000);

insert into t values (generate_series(1,199999),generate_series(1, 199999));
postgres=# explain analyze select sum(a) from t group by b;
QUERY
PLAN

--------------------------------------------------------------------------------------------------------------------------------------Finalize
GroupAggregate (cost=20379.55..28379.51 rows=199999
 
width=12) (actual time=102.311..322.969 rows=199999 loops=1)  Group Key: t1.b  ->  Merge Append
(cost=20379.55..25379.53rows=199999 width=12)
 
(actual time=102.303..232.310 rows=199999 loops=1)        Sort Key: t1.b        ->  Partial GroupAggregate
(cost=10189.72..11939.70
rows=99999 width=12) (actual time=52.164..108.967 rows=99999 loops=1)              Group Key: t1.b              ->
Sort (cost=10189.72..10439.72 rows=99999 width=8)
 
(actual time=52.158..66.236 rows=99999 loops=1)                    Sort Key: t1.b                    Sort Method:
externalmerge  Disk: 1768kB                    ->  Seq Scan on t1  (cost=0.00..1884.99
 
rows=99999 width=8) (actual time=0.860..20.388 rows=99999 loops=1)        ->  Partial GroupAggregate
(cost=10189.82..11939.82
rows=100000 width=12) (actual time=50.134..102.976 rows=100000
loops=1)              Group Key: t2.b              ->  Sort  (cost=10189.82..10439.82 rows=100000 width=8)
(actual time=50.128..63.362 rows=100000 loops=1)                    Sort Key: t2.b                    Sort Method:
externalmerge  Disk: 1768kB                    ->  Seq Scan on t2  (cost=0.00..1885.00
 
rows=100000 width=8) (actual time=0.498..20.977 rows=100000 loops=1)Planning time: 0.190 msExecution time: 339.929 ms
(18 rows)

postgres=# set enable_partition_wise_agg=off;
SET
postgres=# explain analyze select sum(a) from t group by b;
QUERYPLAN
 

----------------------------------------------------------------------------------------------------------------------------------GroupAggregate
(cost=26116.53..29616.51 rows=199999 width=12)
 
(actual time=139.413..250.751 rows=199999 loops=1)  Group Key: t1.b  ->  Sort  (cost=26116.53..26616.52 rows=199999
width=8)(actual
 
time=139.406..168.775 rows=199999 loops=1)        Sort Key: t1.b        Sort Method: external merge  Disk: 3544kB
->  Result  (cost=0.00..5769.98 rows=199999 width=8) (actual
 
time=0.674..76.392 rows=199999 loops=1)              ->  Append  (cost=0.00..3769.99 rows=199999 width=8)
(actual time=0.672..40.291 rows=199999 loops=1)                    ->  Seq Scan on t1  (cost=0.00..1884.99
rows=99999 width=8) (actual time=0.672..12.408 rows=99999 loops=1)                    ->  Seq Scan on t2
(cost=0.00..1885.00
rows=100000 width=8) (actual time=1.407..11.689 rows=100000 loops=1)Planning time: 0.146 msExecution time: 263.678 ms
(11 rows)


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Partition-wise aggregation/grouping

From
Jeevan Chalke
Date:


On Tue, Oct 17, 2017 at 7:13 PM, Dilip Kumar <dilipbalaut@gmail.com> wrote:
On Fri, Oct 13, 2017 at 12:06 PM, Jeevan Chalke
<jeevan.chalke@enterprisedb.com> wrote:
>
While playing around with the patch I have noticed one regression with
the partial partition-wise aggregate.

I am consistently able to reproduce this on my local machine.

Scenario: Group by on non-key column and only one tuple per group.

I didn't get what you mean by regression here. Can you please explain?

I see that PWA plan is selected over regular plan when enabled on the basis of costing.
Regular planning need a Result node due to which costing increases where as PWA don't need that and thus wins.


--
Jeevan Chalke
Technical Architect, Product Development
EnterpriseDB Corporation
The Enterprise PostgreSQL Company

Re: [HACKERS] Partition-wise aggregation/grouping

From
Dilip Kumar
Date:
On Tue, Oct 17, 2017 at 10:44 PM, Jeevan Chalke
<jeevan.chalke@enterprisedb.com> wrote:
>

> I didn't get what you mean by regression here. Can you please explain?
>
> I see that PWA plan is selected over regular plan when enabled on the basis
> of costing.
> Regular planning need a Result node due to which costing increases where as
> PWA don't need that and thus wins.

Sorry for not clearly explaining,  I meant that with normal plan
execution time is 263.678 ms whereas with PWA its 339.929 ms.

I only set enable_partition_wise_agg=on and it switched to PWA and
execution time increased by 30%.
I understand that the this is the worst case for PWA where
FinalizeAggregate is getting all the tuple.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Partition-wise aggregation/grouping

From
Jeevan Chalke
Date:
Hi,

Attached new patch-set here. Changes include:

1. Added separate patch for costing Append node as discussed up-front in the
patch-set.
2. Since we now cost Append node, we don't need partition_wise_agg_cost_factor
GUC. So removed that. The remaining patch hence merged into main implementation
patch.
3. Updated rows in test-cases so that we will get partition-wise plans.

Thanks


On Wed, Oct 18, 2017 at 9:53 AM, Dilip Kumar <dilipbalaut@gmail.com> wrote:
On Tue, Oct 17, 2017 at 10:44 PM, Jeevan Chalke
<jeevan.chalke@enterprisedb.com> wrote:
>

> I didn't get what you mean by regression here. Can you please explain?
>
> I see that PWA plan is selected over regular plan when enabled on the basis
> of costing.
> Regular planning need a Result node due to which costing increases where as
> PWA don't need that and thus wins.

Sorry for not clearly explaining,  I meant that with normal plan
execution time is 263.678 ms whereas with PWA its 339.929 ms.

I only set enable_partition_wise_agg=on and it switched to PWA and
execution time increased by 30%.
I understand that the this is the worst case for PWA where
FinalizeAggregate is getting all the tuple.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



--
Jeevan Chalke
Technical Architect, Product Development
EnterpriseDB Corporation
The Enterprise PostgreSQL Company

Attachment

Re: [HACKERS] Partition-wise aggregation/grouping

From
Robert Haas
Date:
On Fri, Oct 27, 2017 at 1:01 PM, Jeevan Chalke
<jeevan.chalke@enterprisedb.com> wrote:
> 1. Added separate patch for costing Append node as discussed up-front in the
> patch-set.
> 2. Since we now cost Append node, we don't need
> partition_wise_agg_cost_factor
> GUC. So removed that. The remaining patch hence merged into main
> implementation
> patch.
> 3. Updated rows in test-cases so that we will get partition-wise plans.

With 0006 applied, cost_merge_append() is now a little bit confused:
   /*    * Also charge a small amount (arbitrarily set equal to operator cost) per    * extracted tuple.  We don't
chargecpu_tuple_cost because a MergeAppend    * node doesn't do qual-checking or projection, so it has less overhead
*than most plan nodes.    */   run_cost += cpu_operator_cost * tuples;
 
   /* Add MergeAppend node overhead like we do it for the Append node */   run_cost += cpu_tuple_cost *
DEFAULT_APPEND_COST_FACTOR* tuples;
 

The first comment says that we don't add cpu_tuple_cost, and the
second one then adds half of it anyway.

I think it's fine to have a #define for DEFAULT_APPEND_COST_FACTOR,
because as you say it's used twice, but I don't think that should be
exposed in cost.h; I'd make it private to costsize.c and rename it to
something like APPEND_CPU_COST_MULTIPLIER.  The word DEFAULT, in
particular, seems useless to me, since there's no provision for it to
be overridden by a different value.

What testing, if any, can we think about doing with this plan to make
sure it doesn't regress things?  For example, if we do a TPC-H run
with partitioned tables and partition-wise join enabled, will any
plans change with this patch?  Do they get faster or not?  Anyone have
other ideas for what to test?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Partition-wise aggregation/grouping

From
Jeevan Chalke
Date:


On Sat, Oct 28, 2017 at 3:07 PM, Robert Haas <robertmhaas@gmail.com> wrote:
On Fri, Oct 27, 2017 at 1:01 PM, Jeevan Chalke
<jeevan.chalke@enterprisedb.com> wrote:
> 1. Added separate patch for costing Append node as discussed up-front in the
> patch-set.
> 2. Since we now cost Append node, we don't need
> partition_wise_agg_cost_factor
> GUC. So removed that. The remaining patch hence merged into main
> implementation
> patch.
> 3. Updated rows in test-cases so that we will get partition-wise plans.

With 0006 applied, cost_merge_append() is now a little bit confused:

    /*
     * Also charge a small amount (arbitrarily set equal to operator cost) per
     * extracted tuple.  We don't charge cpu_tuple_cost because a MergeAppend
     * node doesn't do qual-checking or projection, so it has less overhead
     * than most plan nodes.
     */
    run_cost += cpu_operator_cost * tuples;

    /* Add MergeAppend node overhead like we do it for the Append node */
    run_cost += cpu_tuple_cost * DEFAULT_APPEND_COST_FACTOR * tuples;

The first comment says that we don't add cpu_tuple_cost, and the
second one then adds half of it anyway.

Yep.
But as David reported earlier, if we remove the first part i.e. adding cpu_operator_cost per tuple, Merge Append will be preferred over an Append node unlike before. And thus, I thought of better having both, but no so sure. Should we remove that part altogether, or add both in a single statement with updated comments?


I think it's fine to have a #define for DEFAULT_APPEND_COST_FACTOR,
because as you say it's used twice, but I don't think that should be
exposed in cost.h; I'd make it private to costsize.c and rename it to
something like APPEND_CPU_COST_MULTIPLIER.  The word DEFAULT, in
particular, seems useless to me, since there's no provision for it to
be overridden by a different value.

Agree. Will make that change.
 

What testing, if any, can we think about doing with this plan to make
sure it doesn't regress things?  For example, if we do a TPC-H run
with partitioned tables and partition-wise join enabled, will any
plans change with this patch?

I have tried doing this on my local developer machine. For 1GB database size (tpc-h scale factor 1), I see no plan change with and without this patch.

I have tried with scale factor 10, but query is not executing well due to space and memory constraints. Can someone try out that?
 
  Do they get faster or not?  Anyone have
other ideas for what to test?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


--
Jeevan Chalke
Technical Architect, Product Development
EnterpriseDB Corporation
The Enterprise PostgreSQL Company

Re: [HACKERS] Partition-wise aggregation/grouping

From
Robert Haas
Date:
On Wed, Nov 1, 2017 at 6:20 PM, Jeevan Chalke
<jeevan.chalke@enterprisedb.com> wrote:
> Yep.
> But as David reported earlier, if we remove the first part i.e. adding
> cpu_operator_cost per tuple, Merge Append will be preferred over an Append
> node unlike before. And thus, I thought of better having both, but no so
> sure. Should we remove that part altogether, or add both in a single
> statement with updated comments?

I was only suggesting that you update the comments.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Partition-wise aggregation/grouping

From
Konstantin Knizhnik
Date:
On 10/27/2017 02:01 PM, Jeevan Chalke wrote:
> Hi,
>
> Attached new patch-set here. Changes include:
>
> 1. Added separate patch for costing Append node as discussed up-front in the
> patch-set.
> 2. Since we now cost Append node, we don't need partition_wise_agg_cost_factor
> GUC. So removed that. The remaining patch hence merged into main implementation
> patch.
> 3. Updated rows in test-cases so that we will get partition-wise plans.
>
> Thanks

I applied partition-wise-agg-v6.tar.gz patch to  the master and use shard.sh example from
https://www.postgresql.org/message-id/14577.1509723225%40localhost
Plan for count(*) is the following:

shard=# explain select count(*) from orders;                                      QUERY PLAN
--------------------------------------------------------------------------------------- Finalize Aggregate
(cost=100415.29..100415.30rows=1 width=8)   ->  Append  (cost=50207.63..100415.29 rows=2 width=8)         ->  Partial
Aggregate (cost=50207.63..50207.64 rows=1 width=8)               ->  Foreign Scan on orders_0 (cost=101.00..50195.13
rows=5000width=0)         ->  Partial Aggregate  (cost=50207.63..50207.64 rows=1 width=8)               ->  Foreign
Scanon orders_1 (cost=101.00..50195.13 rows=5000 width=0)
 


We really calculate partial aggregate for each partition, but to do we still have to fetch all data from remote host.
So for foreign partitions such plans is absolutely inefficient.
Amy be it should be combined with some other patch?
For example, with  agg_pushdown_v4.tgz patch https://www.postgresql.org/message-id/14577.1509723225%40localhost ?
But it is not applied after partition-wise-agg-v6.tar.gz patch.
Also postgres_fdw in 11dev is able to push down aggregates without agg_pushdown_v4.tgz patch.

In 0009-Teach-postgres_fdw-to-push-aggregates-for-child-rela.patch
there is the following check:
 /* Partial aggregates are not supported. */
+       if (extra->isPartial)
+           return;

If we just comment this line then produced plan will be the following:

shard=# explain select sum(product_id) from orders;                           QUERY PLAN
---------------------------------------------------------------- Finalize Aggregate  (cost=308.41..308.42 rows=1
width=8)  ->  Append  (cost=144.18..308.41 rows=2 width=8)         ->  Foreign Scan  (cost=144.18..154.20 rows=1
width=8)              Relations: Aggregate on (public.orders_0 orders)         ->  Foreign Scan  (cost=144.18..154.20
rows=1width=8)               Relations: Aggregate on (public.orders_1 orders)
 
(6 rows)

And it is actually desired plan!
Obviously such approach will not always work. FDW really doesn't support partial aggregates now.
But for most frequently used aggregates: sum, min, max, count aggtype==aggtranstype and there is no difference
between partial and normal aggregate calculation.
So instead of (extra->isPartial) condition we can add more complex check which will traverse pathtarget expressions
and
check if it can be evaluated in this way. Or... extend FDW API to support partial aggregation.

But even the last plan is not ideal: it will calculate predicates at each remote node sequentially.
There is parallel append patch:
https://www.postgresql.org/message-id/CAJ3gD9ctEcrVUmpY6fq_JUB6WDKGXAGd70EY68jVFA4kxMbKeQ%40mail.gmail.com
but ... FDW doesn't support parallel scan, so parallel append can not be applied in this case.
And we actually do not need parallel append with all its dynamic workers here.
We just need to start commands at all remote servers and only after it fetch results (which can be done sequentially).

I am investigating problem of efficient execution of OLAP queries on sharded tables (tables with remote partitions).
After reading all this threads and corresponding  patches, it seems to me
that we already have most of parts of the puzzle, what we need is to put them on right places and may be add missed
ones.
I wonder if somebody is busy with it and can I somehow help here?

Also I am not quite sure about the best approach with parallel execution of distributed query at all nodes.
Should we make postgres_fdw parallel safe and use parallel append? How difficult it will be?
Or in addition to parallel append we should also have "asynchronous append" which will be able to initiate execution at
allnodes?
 
It seems to be close to merge append, because it should simultaneously traverse all cursors.

Looks like second approach is easier for implementation. But in case of sharded table, distributed query may need to
traverseboth remote
 
and local shards and this approach doesn't allow to processed several local shards in parallel.

-- 
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Partition-wise aggregation/grouping

From
Konstantin Knizhnik
Date:

On 11.11.2017 23:29, Konstantin Knizhnik wrote:
> On 10/27/2017 02:01 PM, Jeevan Chalke wrote:
>> Hi,
>>
>> Attached new patch-set here. Changes include:
>>
>> 1. Added separate patch for costing Append node as discussed up-front 
>> in the
>> patch-set.
>> 2. Since we now cost Append node, we don't need 
>> partition_wise_agg_cost_factor
>> GUC. So removed that. The remaining patch hence merged into main 
>> implementation
>> patch.
>> 3. Updated rows in test-cases so that we will get partition-wise plans.
>>
>> Thanks
>
> I applied partition-wise-agg-v6.tar.gz patch to  the master and use 
> shard.sh example from 
> https://www.postgresql.org/message-id/14577.1509723225%40localhost
> Plan for count(*) is the following:
>
> shard=# explain select count(*) from orders;
>                                       QUERY PLAN
> --------------------------------------------------------------------------------------- 
>
>  Finalize Aggregate  (cost=100415.29..100415.30 rows=1 width=8)
>    ->  Append  (cost=50207.63..100415.29 rows=2 width=8)
>          ->  Partial Aggregate  (cost=50207.63..50207.64 rows=1 width=8)
>                ->  Foreign Scan on orders_0 (cost=101.00..50195.13 
> rows=5000 width=0)
>          ->  Partial Aggregate  (cost=50207.63..50207.64 rows=1 width=8)
>                ->  Foreign Scan on orders_1 (cost=101.00..50195.13 
> rows=5000 width=0)
>
>
> We really calculate partial aggregate for each partition, but to do we 
> still have to fetch all data from remote host.
> So for foreign partitions such plans is absolutely inefficient.
> Amy be it should be combined with some other patch?
> For example, with  agg_pushdown_v4.tgz patch 
> https://www.postgresql.org/message-id/14577.1509723225%40localhost ?
> But it is not applied after partition-wise-agg-v6.tar.gz patch.
> Also postgres_fdw in 11dev is able to push down aggregates without 
> agg_pushdown_v4.tgz patch.
>
> In 0009-Teach-postgres_fdw-to-push-aggregates-for-child-rela.patch
> there is the following check:
>
>  /* Partial aggregates are not supported. */
> +       if (extra->isPartial)
> +           return;
>
> If we just comment this line then produced plan will be the following:
>
> shard=# explain select sum(product_id) from orders;
>                            QUERY PLAN
> ----------------------------------------------------------------
>  Finalize Aggregate  (cost=308.41..308.42 rows=1 width=8)
>    ->  Append  (cost=144.18..308.41 rows=2 width=8)
>          ->  Foreign Scan  (cost=144.18..154.20 rows=1 width=8)
>                Relations: Aggregate on (public.orders_0 orders)
>          ->  Foreign Scan  (cost=144.18..154.20 rows=1 width=8)
>                Relations: Aggregate on (public.orders_1 orders)
> (6 rows)
>
> And it is actually desired plan!
> Obviously such approach will not always work. FDW really doesn't 
> support partial aggregates now.
> But for most frequently used aggregates: sum, min, max, count 
> aggtype==aggtranstype and there is no difference
> between partial and normal aggregate calculation.
> So instead of (extra->isPartial) condition we can add more complex 
> check which will traverse pathtarget expressions and
> check if it can be evaluated in this way. Or... extend FDW API to 
> support partial aggregation.
>
> But even the last plan is not ideal: it will calculate predicates at 
> each remote node sequentially.
> There is parallel append patch:
> https://www.postgresql.org/message-id/CAJ3gD9ctEcrVUmpY6fq_JUB6WDKGXAGd70EY68jVFA4kxMbKeQ%40mail.gmail.com 
>
> but ... FDW doesn't support parallel scan, so parallel append can not 
> be applied in this case.
> And we actually do not need parallel append with all its dynamic 
> workers here.
> We just need to start commands at all remote servers and only after it 
> fetch results (which can be done sequentially).
>
> I am investigating problem of efficient execution of OLAP queries on 
> sharded tables (tables with remote partitions).
> After reading all this threads and corresponding  patches, it seems to me
> that we already have most of parts of the puzzle, what we need is to 
> put them on right places and may be add missed ones.
> I wonder if somebody is busy with it and can I somehow help here?
>
> Also I am not quite sure about the best approach with parallel 
> execution of distributed query at all nodes.
> Should we make postgres_fdw parallel safe and use parallel append? How 
> difficult it will be?
> Or in addition to parallel append we should also have "asynchronous 
> append" which will be able to initiate execution at all nodes?
> It seems to be close to merge append, because it should simultaneously 
> traverse all cursors.
>
> Looks like second approach is easier for implementation. But in case 
> of sharded table, distributed query may need to traverse both remote
> and local shards and this approach doesn't allow to processed several 
> local shards in parallel.
>

I attach small patch for postgres_fdw.c which allows concurrent 
execution of aggregates by all remote servers (when them are accessed 
through postgres_fdw).
I have added "postgres_fdw.use_prefetch" GUC to enable/disable 
prefetching data in postgres_fdw.
This patch should be applied after of partition-wise-agg-v6.tar.gz patch.
With shard example and the following two GUCs set:

shard=# set postgres_fdw.use_prefetch=on;
shard=# set enable_partition_wise_agg=on;
shard=# select sum(product_id) from orders;
    sum
---------
  9965891
(1 row)

shard=# explain select sum(product_id) from orders;
                            QUERY PLAN
----------------------------------------------------------------
  Finalize Aggregate  (cost=308.41..308.42 rows=1 width=8)
    ->  Append  (cost=144.18..308.41 rows=2 width=8)
          ->  Foreign Scan  (cost=144.18..154.20 rows=1 width=8)
                Relations: Aggregate on (public.orders_0 orders)
          ->  Foreign Scan  (cost=144.18..154.20 rows=1 width=8)
                Relations: Aggregate on (public.orders_1 orders)
(6 rows)


sum aggregate is calculated in parallel by both servers.

I have not tested it much, it is just prove of concept.

-- 
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] Partition-wise aggregation/grouping

From
Jeevan Chalke
Date:


On Sun, Nov 12, 2017 at 1:59 AM, Konstantin Knizhnik <k.knizhnik@postgrespro.ru> wrote:
On 10/27/2017 02:01 PM, Jeevan Chalke wrote:
Hi,

Attached new patch-set here. Changes include:

1. Added separate patch for costing Append node as discussed up-front in the
patch-set.
2. Since we now cost Append node, we don't need partition_wise_agg_cost_factor
GUC. So removed that. The remaining patch hence merged into main implementation
patch.
3. Updated rows in test-cases so that we will get partition-wise plans.

Thanks

I applied partition-wise-agg-v6.tar.gz patch to  the master and use shard.sh example from https://www.postgresql.org/message-id/14577.1509723225%40localhost
Plan for count(*) is the following:

shard=# explain select count(*) from orders;
                                      QUERY PLAN
---------------------------------------------------------------------------------------
 Finalize Aggregate  (cost=100415.29..100415.30 rows=1 width=8)
   ->  Append  (cost=50207.63..100415.29 rows=2 width=8)
         ->  Partial Aggregate  (cost=50207.63..50207.64 rows=1 width=8)
               ->  Foreign Scan on orders_0 (cost=101.00..50195.13 rows=5000 width=0)
         ->  Partial Aggregate  (cost=50207.63..50207.64 rows=1 width=8)
               ->  Foreign Scan on orders_1 (cost=101.00..50195.13 rows=5000 width=0)


We really calculate partial aggregate for each partition, but to do we still have to fetch all data from remote host.
So for foreign partitions such plans is absolutely inefficient.
Amy be it should be combined with some other patch?
For example, with  agg_pushdown_v4.tgz patch https://www.postgresql.org/message-id/14577.1509723225%40localhost ?
But it is not applied after partition-wise-agg-v6.tar.gz patch.
Also postgres_fdw in 11dev is able to push down aggregates without agg_pushdown_v4.tgz patch.

In 0009-Teach-postgres_fdw-to-push-aggregates-for-child-rela.patch
there is the following check:

 /* Partial aggregates are not supported. */
+       if (extra->isPartial)
+           return;

If we just comment this line then produced plan will be the following:

shard=# explain select sum(product_id) from orders;
                           QUERY PLAN
----------------------------------------------------------------
 Finalize Aggregate  (cost=308.41..308.42 rows=1 width=8)
   ->  Append  (cost=144.18..308.41 rows=2 width=8)
         ->  Foreign Scan  (cost=144.18..154.20 rows=1 width=8)
               Relations: Aggregate on (public.orders_0 orders)
         ->  Foreign Scan  (cost=144.18..154.20 rows=1 width=8)
               Relations: Aggregate on (public.orders_1 orders)
(6 rows)

And it is actually desired plan!
Obviously such approach will not always work. FDW really doesn't support partial aggregates now.
But for most frequently used aggregates: sum, min, max, count aggtype==aggtranstype and there is no difference
between partial and normal aggregate calculation.
So instead of (extra->isPartial) condition we can add more complex check which will traverse pathtarget expressions and
check if it can be evaluated in this way. Or... extend FDW API to support partial aggregation.

As explained by Ashutosh Bapat in reply
https://www.postgresql.org/message-id/CAFjFpRdpeMTd8kYbM_x0769V-aEKst5Nkg3+coG=8ki7s8Zqjw@mail.gmail.com
we cannot rely on just aggtype==aggtranstype.

However, I have tried pushing partial aggregation over remote server and also
submitted a PoC patch here:
https://www.postgresql.org/message-id/CAM2+6=UakP9+TSJuh2fbhHWNJc7OYFL1_gvu7mt2fXtVt6GY3g@mail.gmail.com

I have later removed these patches from Partition-wise-Aggregation patch set
as it is altogether a different issue than this mail thread. We might need to
discuss on it separately.
 

But even the last plan is not ideal: it will calculate predicates at each remote node sequentially.
There is parallel append patch:
https://www.postgresql.org/message-id/CAJ3gD9ctEcrVUmpY6fq_JUB6WDKGXAGd70EY68jVFA4kxMbKeQ%40mail.gmail.com
but ... FDW doesn't support parallel scan, so parallel append can not be applied in this case.
And we actually do not need parallel append with all its dynamic workers here.
We just need to start commands at all remote servers and only after it fetch results (which can be done sequentially).

I am investigating problem of efficient execution of OLAP queries on sharded tables (tables with remote partitions).
After reading all this threads and corresponding  patches, it seems to me
that we already have most of parts of the puzzle, what we need is to put them on right places and may be add missed ones.
I wonder if somebody is busy with it and can I somehow help here?

Also I am not quite sure about the best approach with parallel execution of distributed query at all nodes.
Should we make postgres_fdw parallel safe and use parallel append? How difficult it will be?
Or in addition to parallel append we should also have "asynchronous append" which will be able to initiate execution at all nodes?
It seems to be close to merge append, because it should simultaneously traverse all cursors.

Looks like second approach is easier for implementation. But in case of sharded table, distributed query may need to traverse both remote
and local shards and this approach doesn't allow to processed several local shards in parallel.


Interesting idea of "asynchronous append". However, IMHO it deserves its own
email-chain.
 
--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company




--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Thanks
--
Jeevan Chalke
Technical Architect, Product Development
EnterpriseDB Corporation
The Enterprise PostgreSQL Company

Re: [HACKERS] Partition-wise aggregation/grouping

From
Jeevan Chalke
Date:


On Thu, Nov 2, 2017 at 7:36 AM, Robert Haas <robertmhaas@gmail.com> wrote:
On Wed, Nov 1, 2017 at 6:20 PM, Jeevan Chalke
<jeevan.chalke@enterprisedb.com> wrote:
> Yep.
> But as David reported earlier, if we remove the first part i.e. adding
> cpu_operator_cost per tuple, Merge Append will be preferred over an Append
> node unlike before. And thus, I thought of better having both, but no so
> sure. Should we remove that part altogether, or add both in a single
> statement with updated comments?

I was only suggesting that you update the comments.

OK. Done in the attached patch set.

I have rebased all my patches on latest HEAD which is at
7518049980be1d90264addab003476ae105f70d4

Thanks


--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



--
Jeevan Chalke
Technical Architect, Product Development
EnterpriseDB Corporation
The Enterprise PostgreSQL Company

Attachment

Re: [HACKERS] Partition-wise aggregation/grouping

From
Konstantin Knizhnik
Date:


On 15.11.2017 13:35, Jeevan Chalke wrote:

As explained by Ashutosh Bapat in reply
https://www.postgresql.org/message-id/CAFjFpRdpeMTd8kYbM_x0769V-aEKst5Nkg3+coG=8ki7s8Zqjw@mail.gmail.com
we cannot rely on just aggtype==aggtranstype.

Obviously this check (aggtype==aggtranstype) is not correct criteria for all user defined aggregates.
I just did it as temporary work around for standard aggregates.


However, I have tried pushing partial aggregation over remote server and also
submitted a PoC patch here:
https://www.postgresql.org/message-id/CAM2+6=UakP9+TSJuh2fbhHWNJc7OYFL1_gvu7mt2fXtVt6GY3g@mail.gmail.com

I have later removed these patches from Partition-wise-Aggregation patch set
as it is altogether a different issue than this mail thread. We might need to
discuss on it separately.
...
Interesting idea of "asynchronous append". However, IMHO it deserves its own
email-chain.

The main problem IMHO is that there are a lot of different threads and patches related with this topic:(
And it is very difficult to combine all of them together to achieve the final goal: efficient execution of OLAP queries on sharded table.
It will be nice if somebody who is making the most contribution in this direction can somehow maintain it...
I just faced with particular problem with our pg_shardman extension and now (thanks to your patch) I have some working solution for it.
But certainly I prefer to have this support in mainstream version of Postgres.

There are two open questions, which I wan to discuss (sorry, may be one again this is not the right thread for it):

1. Parallel append and FDW/postgres_fdw: should FDW support parallel scan and do we really need it to support concurrent execution of query on local and remote partitions?
"Asynchronous append" partly solves this problem, but only for remote partitions. I do not completely understand all complexity of alternative approaches.

2. Right now partition-wise aggregation/grouping works only for tables partitioned using new PG 10 partitioning mechanism. But it doesn't work for inherited tables, although
there seems to be not so much difference between this two cases. Do you think that sometimes it will be also supported for standard inheritance mechanism or there is no sense in it?


-- 
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company 

Re: [HACKERS] Partition-wise aggregation/grouping

From
David Rowley
Date:
On 16 November 2017 at 05:57, Konstantin Knizhnik
<k.knizhnik@postgrespro.ru> wrote:
> The main problem IMHO is that there are a lot of different threads and
> patches related with this topic:(
> And it is very difficult to combine all of them together to achieve the
> final goal: efficient execution of OLAP queries on sharded table.
> It will be nice if somebody who is making the most contribution in this
> direction can somehow maintain it...
> I just faced with particular problem with our pg_shardman extension and now
> (thanks to your patch) I have some working solution for it.
> But certainly I prefer to have this support in mainstream version of
> Postgres.

I don't think it's fair to be asking about additional features on this
thread. It seems to me you're asking about two completely separate
features, with the aim of trying to solve your own problems.

It also looks to me that Jeevan has been clear on what his goals are
for this patch. Perhaps what you're asking for is a logical direction
to travel once this patch is committed, so I think, probably, the best
way to conduct what you're after here is to either:

a) Wait until this is committed and spin up your own thread about
you're proposed changes to allow the PARTIAL aggregate to be pushed
into the foreign server, or;
b) Spin up your own thread now, with reference to this patch as a
prerequisite to your own patch.

I agree that what you're talking about is quite exciting stuff, but
please, let's not talk about it here.

-- David Rowley                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services


Re: [HACKERS] Partition-wise aggregation/grouping

From
Ashutosh Bapat
Date:
On Thu, Nov 16, 2017 at 6:02 AM, David Rowley
<david.rowley@2ndquadrant.com> wrote:
> On 16 November 2017 at 05:57, Konstantin Knizhnik
> <k.knizhnik@postgrespro.ru> wrote:
>> The main problem IMHO is that there are a lot of different threads and
>> patches related with this topic:(
>> And it is very difficult to combine all of them together to achieve the
>> final goal: efficient execution of OLAP queries on sharded table.
>> It will be nice if somebody who is making the most contribution in this
>> direction can somehow maintain it...
>> I just faced with particular problem with our pg_shardman extension and now
>> (thanks to your patch) I have some working solution for it.
>> But certainly I prefer to have this support in mainstream version of
>> Postgres.
>
> I don't think it's fair to be asking about additional features on this
> thread. It seems to me you're asking about two completely separate
> features, with the aim of trying to solve your own problems.
>
> It also looks to me that Jeevan has been clear on what his goals are
> for this patch. Perhaps what you're asking for is a logical direction
> to travel once this patch is committed, so I think, probably, the best
> way to conduct what you're after here is to either:
>
> a) Wait until this is committed and spin up your own thread about
> you're proposed changes to allow the PARTIAL aggregate to be pushed
> into the foreign server, or;
> b) Spin up your own thread now, with reference to this patch as a
> prerequisite to your own patch.
>
> I agree that what you're talking about is quite exciting stuff, but
> please, let's not talk about it here.
>

+1 for all that.


-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company


Re: [HACKERS] Partition-wise aggregation/grouping

From
Ashutosh Bapat
Date:
On Wed, Nov 15, 2017 at 5:31 PM, Jeevan Chalke
<jeevan.chalke@enterprisedb.com> wrote:
>
> OK. Done in the attached patch set.
>
> I have rebased all my patches on latest HEAD which is at
> 7518049980be1d90264addab003476ae105f70d4
>
> Thanks

These are review comments for the last set and I think most of them
apply to the new set as well.

Patches 0001 - 0005 refactoring existing code. I haven't
reviewed them in detail, checking whether we have missed anything in moving the
code, but they mostly look fine.

Comments on 0006/*
+ * cost_append
+ *      Determines and returns the cost of an Append node.
+ *
... clipped portion
+
+    /* Add Append node overhead. */
+    run_cost += cpu_tuple_cost * DEFAULT_APPEND_COST_FACTOR * tuples;
+

I am wondering whether it's really worth creating a new function for a single
line addition to create_append_path(). I think all we need to change in
create_append_path() is add (cpu_tuple_cost * DEFAULT_APPEND_COST_FACTOR *
tuples) to path->total_cost.


+    /* Add MergeAppend node overhead like we do it for the Append node */
+    run_cost += cpu_tuple_cost * DEFAULT_APPEND_COST_FACTOR * tuples;
+

With this change the following comment is no more true. Please remove it.    * extracted tuple.  We don't charge
cpu_tuple_costbecause a MergeAppend    * node doesn't do qual-checking or projection, so it has less overhead    * than
mostplan nodes.    */
 

+/*
+ * Arbitrarily use 50% of the cpu_tuple_cost to cost append node. Note that

May be reword it as " ... to cost per tuple processing by an append node ..."

+ * this value should be multiplied with cpu_tuple_cost wherever applicable.
+ */
+#define DEFAULT_APPEND_COST_FACTOR 0.5

I am wondering whether we should just define
#define APPEND_TUPLE_COST (cpu_tuple_cost * 0.5)
and use this macro everywhere. What else use DEFAULT_APPEND_COST_FACTOR would
have other than multiplying with cpu_tuple_cost?
-- test partition matching with N-way joinEXPLAIN (COSTS OFF)SELECT avg(t1.a), avg(t2.b), avg(t3.a + t3.b), t1.c, t2.c,
t3.cFROM
 
plt1 t1, plt2 t2, plt1_e t3 WHERE t1.c = t2.c AND ltrim(t3.c, 'A') =
t1.c GROUP BY t1.c, t2.c, t3.c ORDER BY t1.c, t2.c, t3.c;
-                                      QUERY PLAN
---------------------------------------------------------------------------------------
+                                   QUERY PLAN
+-------------------------------------------------------------------------------- Sort   Sort Key: t1.c, t3.c   ->
HashAggregate        Group Key: t1.c, t2.c, t3.c
 
-         ->  Result
+         ->  Hash Join
+               Hash Cond: (t1.c = t2.c)               ->  Append
-                     ->  Hash Join
-                           Hash Cond: (t1.c = t2.c)

That's sad. Interestingly this query has an aggregate, so the plan will use
partition-wise join again when partition-wise aggregation patch will be
applied. So may be fine.

- Append  (cost=0.00..0.04 rows=2 width=32)
+ Append  (cost=0.00..0.05 rows=2 width=32)

- Append  (cost=0.00..0.04 rows=2 width=4)
+ Append  (cost=0.00..0.05 rows=2 width=4)

We do have some testcases which print costs. Interesting :). I don't have
any objection to this change.

Comments on 0007

+       <para>
+        Enables or disables the query planner's use of partition-wise grouping
+        or aggregation, which allows  If partition-wise aggregation
does not result in the
+        cheapest path, it will still spend time in creating these paths and
+        consume memory which increase linearly with the number of partitions.
+        The default is <literal>off</>.
+       </para>
+      </listitem>
+     </varlistentry>
+
May be we should word this in the same manner as partition-wise join like
       Enables or disables the query planner's use of partition-wise grouping       or aggregation, which allows
aggregationor grouping on a partitioned       tables to be spread across the partitions. If <literal>GROUP
BY<literal>clause includes partition keys, the rows are aggregated at       each partition. Otherwise, partial
aggregatescomputed for each       partition are required to be combined. Because partition-wise
aggregation/gropuingcan use significantly more CPU time and memory       during planning, the default is
<literal>off</literal>.

+
+Partition-wise aggregates/grouping
+----------------------------------

... clipped patch

+In above plan, aggregation is performed after append node which means that the
+whole table is an input for the aggregation node. However, with partition-wise
+aggregation, same query will have plane like:

s/plane/plan/

+ Append

... clipped patch

+PartialAggregate stage greatly reduces the number of groups and lose if we have
+lots of small groups.

To keep the discussion brief, I suggest we rewrite this paragraph as

----
If GROUP BY clause has all partition keys, all the rows that belong to a given
group come from a single partition and thus aggregates can be finalized
separately for each partition. When the number of groups is far lesser than the
number of rows being grouped, as usually is the case, the number of rows
processed by an Append node reduces apart from reducing the size of the hash
table or size of the data to be sorted. This usually improves efficiency of the
query. If GROUP BY doesn't contain all the partition keys, partial
aggregates can be computed for
each partition followed by combining partial aggregates from one or more
partitions belonging to the same group to compute complete aggregate for each
group. This improves efficiency of the query if the number of groups is far
less than the number of rows produced by the scan underneath.
---

I am not sure whether we should be discussing why this technique performs
better or when it performs better. We don't have similar discussion for
partition-wise join. That paragraph just describes the technique and may be we
want to do the same here.

+ *
+ * extra is the additional information required when we are doing aggregation
+ * or grouping below the append node. In case of partial partition-wise
+ * aggregation on a child node, we need to compute finalized step after the
+ * append, which cannot be done in this function. And thus if we have non-NULL
+ * value for extra, we call create_partition_agg_paths() to create an append
+ * node and finalization, if any.

May be we want to just say "extra provides more information about the
partitioned aggregation/grouping e.g path target, whether to use partial
aggregate and so on." When present we call create_partition_agg_paths() to add
paths for partition-wise aggregatges.

-        add_path(rel, (Path *) create_append_path(rel, subpaths,
-                                                  rel->reltarget, NULL, 0,
-                                                  partitioned_rels));
+    {
+        if (extra)
+            create_partition_agg_paths(root, rel, subpaths, NIL,
+                                       NIL, NULL, 0,
+                                       partitioned_rels, extra);
+        else
+            add_path(rel, (Path *) create_append_path(rel, subpaths,
+                                                      rel->reltarget, NULL, 0,
+                                                      partitioned_rels));
+    }

I am wondering whether we could write a function to call appropriate one out of
create_append_path(), create_partition_agg_paths() or
create_merge_append_path() based on the presence of extra and/or pathkeys and
use it everywhere such a change is made. I don't know whether that will be
worth the code. But there are a handful places where such diffs are required.

-
-    plan = make_sort_from_pathkeys(subplan, best_path->path.pathkeys, NULL);
+    plan = make_sort_from_pathkeys(subplan, best_path->path.pathkeys,
+                                   IS_OTHER_REL(best_path->subpath->parent) ?
+                                   best_path->path.parent->relids : NULL);

While I can guess why this change is required, it may be better to separate it
into a patch of its own and adding some explanation in the commit message, for
other reviewers.

+    /* Copy input rels's relids to grouped rel */
+    grouped_rel->relids = input_rel->relids;

I am fine with this change, but Tom may not agree [1]. May be we should get his
opinion on this one.
    /*
+     * If input relation is partitioned, check if we can perform
+     * partition-wise grouping and/or aggregation.
+     */

Just like partition-wise join a concise "Apply partition-wise aggregation
technique, if possible." would suffice.
    dNumPartialGroups = get_number_of_groups(root,
cheapest_partial_path->rows,                                            gd,
 
-                                             parse->targetList);
+
make_tlist_from_pathtarget(target));
Can we guarantee that the output of make_tlist_from_pathtarget() will be same
as translation of parse->targetList for the given child? Even if not, may be
it's fine to pass slightly different tlist to get_number_of_groups() since it
doesn't depend upon the exact shape but right group column references.
Nonetheless something to test and verify.
 *
- * Determines whether parallel grouping and/or aggrgation is possible, or not.
+ * Determines whether parallel grouping and/or aggregation is possible, or not. * Returns true when possible, false
otherwise.

Does this hunk belong to one of the refactoring patches or as a separate patch
correcting a typo?

+/*
+ * try_partition_wise_grouping
+ *
+ * If the input relation is partitioned and the partition keys are part of the
+ * group by clauses, each partition produces a different set of groups.
+ * Aggregates within each such group can be computed partition-wise. This

While these sentences are correct, I think the reason why we could compute an
aggregate at the level of each partition is because rows from a given group
belong to a single partition. So, I guess, we have to reword this as

"If the partition keys of input relation are part of group by clause, all the
rows belonging to a given group come from a single partition, each partition
producing a different set of groups. This allows aggregation/grouping over a
partitioned relation to be broken down into aggregation/grouping on each
partition.

If group by clause does not contain all the partition keys, rows from a given
group may be spread across multiple partitions. In that case, we can combine
partial aggregates for a given group across partitions to produce the final
aggregate for a that group "

+ * might be optimal because of presence of suitable paths with pathkeys or
+ * because the hash tables for most of the partitions fit into the memory.
+ * However, if partition keys are not part of the group by clauses, then we
+ * still able to compute the partial aggregation for each partition and then
+ * finalize them over append. This can either win or lose. It may win if the
+ * PartialAggregate stage greatly reduces the number of groups and lose if we
+ * have lots of small groups.

I have not seen prologue of a function implementing a query optimization
technique explain why that technique improves performance. So I am not sure
whether the comment should include this explanation. One of the reasons being
that the reasons why a technique works might change over the period of time
with the introduction of other techniques, thus obsoleting the comment.  But
may be it's good to have it here.

+    /*
+     * Grouping sets plan does not work with an inheritance subtree (see notes
+     * in create_groupingsets_plan). Thus do not handle grouping sets here.
+     */
+    if (query->groupingSets || gd)
+        return;

Even if that restriction is lifted, we won't be able to compute
"whole" grouping sets
for each partition, since grouping sets implies multiple group by clauses, each
of which may not have all partition keys. Those sets which have all partition
keys will be computed completely for each partition, but others will require
partial aggregation. I guess, we will need to apply partition-wise aggregation
at each derived group by clause and not as a whole-sale strategy.

Anyway, it doesn't look like a good idea to pass an argument (gd) only to
return from that function in case of its presence. May be we should handle it
outside this function.

+
+    /* Nothing to do, if the input relation is not partitioned. */
+    if (!input_rel->part_scheme)
+        return;
+
+    Assert(input_rel->part_rels);

For a join between two partitioned tables with one of them being dummy
relation, would have part_scheme set but not part_rels (See
try_partition_wise_join()). This assertion would
fail in such a case. Have you tested the case? May be we should just test if
input_rel->part_rels exists similar to generate_partition_wise_join_paths().
Also, how is a dummy input relation is handled in this function? Do we need to
handle?

+    nparts = input_rel->nparts;
+    part_rels = (RelOptInfo **) palloc(nparts * sizeof(RelOptInfo *));
+    grouped_rel->part_rels = part_rels;

For a partial aggregation, we can't say that the child rels produced here are
partitions of the top grouped relation, so setting part_rels looks wrong. We
should set this only when a full aggregate is obtained from each partition.

+        scanjoin_target =
copy_pathtarget(input_rel->cheapest_startup_path->pathtarget);
+        scanjoin_target->exprs = (List *) adjust_appendrel_attrs(root,
+
(Node *) scanjoin_target->exprs,
+                                                                 nappinfos,
+                                                                 appinfos);

Why can't we use input_child_rel->pathtarget? It should be same as the
translation of its parent's path target. I probably understand that's because
the input rel's path targets have been changed after the underlying join was
planned, a step which is not applied to the individual children. May be add a
comment here?

+        child_target->exprs = (List *) adjust_appendrel_attrs(root,
+                                                              (Node
*) target->exprs,
+                                                              nappinfos,
+                                                              appinfos);
+        partial_target = make_partial_grouping_target(root, target);
+        partial_target->exprs = (List *) adjust_appendrel_attrs(root,
+                                                                (Node
*) partial_target->exprs,
+                                                                nappinfos,
+                                                                appinfos);

We need both of these steps for any aggregate since parallel paths will compute
parial paths anyway. If that's correct, may be we should add a comment?

+        extra.inputRows = 0;    /* Not needed at child paths creation */

Why? Comment should be on its own line.

+        if (!create_child_grouping_paths(root, input_child_rel, agg_costs, gd,
+                                         &extra))
+        {
+            /* Could not create path for childrel, return */
+            pfree(appinfos);
+            return;
+        }

Can we detect this condition and bail out even before planning any of the
children? It looks wasteful to try to plan children only to bail out in this
case.

+    /* Nothing to do if we have no live children */
+    if (live_children == NIL)
+        return;

A parent relation with all dummy children will also be dummy. May be we should
mark the parent dummy case using mark_dummy_rel() similar to
generate_partition_wise_join_paths().

+/*
+ * have_grouping_by_partkey
+ *

Somehow this name sounds like it would return true when GROUP BY contains only
partition key. May be rename as group_by_has_partkey? to indicate the

+ * Returns true, if partition keys of the given relation are part of the
+ * GROUP BY clauses, false otherwise.

Reword as " ... if all the partition keys of ... "

+static bool
+have_grouping_by_partkey(RelOptInfo *input_rel, PathTarget *target,
+                         List *groupClause)
+{
+    List       *tlist = make_tlist_from_pathtarget(target);
+    List       *groupexprs = get_sortgrouplist_exprs(groupClause, tlist);

Have we tested the case with multi-level partitioned table and children with
different order of partition key columns?

+        partexprs = input_rel->partexprs ? input_rel->partexprs[cnt] : NIL;
+
+        /* Rule out early, if there are no partition keys present */
+        if (partexprs == NIL)
+            return false;

If input_rel->partexprs is NIL, we should "bail" out even before the loop
starts.

+        foreach(lc, partexprs)
+        {
+            Expr       *partexpr = lfirst(lc);
+
+            if (list_member(groupexprs, partexpr))
+            {
+                found = true;
+                break;
+            }
+        }

This looks like a useful piece of general functionality
list_has_intersection(), which would returns boolean instead of the whole
intersection. I am not sure whether we should add that function to list.c and
use here.

+         * If none of the partition key matches with any of the GROUP BY

Reword as "... the partition key expressions match with ...."

This isn't a full review of 0007, but I think it covers most of the new
functionality.

[1] https://www.postgresql.org/message-id/CAFjFpRdUz6h6cmFZFYAngmQAX8Zvo+MZsPXidZ077h=gp9bvQw@mail.gmail.com



-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company


Re: [HACKERS] Partition-wise aggregation/grouping

From
Robert Haas
Date:
On Fri, Nov 17, 2017 at 7:24 AM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:
> + * this value should be multiplied with cpu_tuple_cost wherever applicable.
> + */
> +#define DEFAULT_APPEND_COST_FACTOR 0.5
>
> I am wondering whether we should just define
> #define APPEND_TUPLE_COST (cpu_tuple_cost * 0.5)
> and use this macro everywhere. What else use DEFAULT_APPEND_COST_FACTOR would
> have other than multiplying with cpu_tuple_cost?

-1.  If you wrap it in a macro like that, future readers of the code
will have to go look up what the macro does.  If you just multiply by
DEFAULT_APPEND_COST_FACTOR it will be clear that's what being used is
a multiple of cpu_tuple_cost.

> I am not sure whether we should be discussing why this technique performs
> better or when it performs better. We don't have similar discussion for
> partition-wise join. That paragraph just describes the technique and may be we
> want to do the same here.

+1.

> + * might be optimal because of presence of suitable paths with pathkeys or
> + * because the hash tables for most of the partitions fit into the memory.
> + * However, if partition keys are not part of the group by clauses, then we
> + * still able to compute the partial aggregation for each partition and then
> + * finalize them over append. This can either win or lose. It may win if the
> + * PartialAggregate stage greatly reduces the number of groups and lose if we
> + * have lots of small groups.
>
> I have not seen prologue of a function implementing a query optimization
> technique explain why that technique improves performance. So I am not sure
> whether the comment should include this explanation. One of the reasons being
> that the reasons why a technique works might change over the period of time
> with the introduction of other techniques, thus obsoleting the comment.  But
> may be it's good to have it here.

+1 for keeping it.

> +        extra.inputRows = 0;    /* Not needed at child paths creation */
>
> Why? Comment should be on its own line.

Comments on same line are fine if they are short enough.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: [HACKERS] Partition-wise aggregation/grouping

From
Jeevan Chalke
Date:


On Fri, Nov 17, 2017 at 5:54 PM, Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote:
On Wed, Nov 15, 2017 at 5:31 PM, Jeevan Chalke
<jeevan.chalke@enterprisedb.com> wrote:
>
> OK. Done in the attached patch set.
>
> I have rebased all my patches on latest HEAD which is at
> 7518049980be1d90264addab003476ae105f70d4
>
> Thanks

These are review comments for the last set and I think most of them
apply to the new set as well.

Patches 0001 - 0005 refactoring existing code. I haven't
reviewed them in detail, checking whether we have missed anything in moving the
code, but they mostly look fine.

Thanks.
 

Comments on 0006
 /*
+ * cost_append
+ *      Determines and returns the cost of an Append node.
+ *
... clipped portion
+
+    /* Add Append node overhead. */
+    run_cost += cpu_tuple_cost * DEFAULT_APPEND_COST_FACTOR * tuples;
+

I am wondering whether it's really worth creating a new function for a single
line addition to create_append_path(). I think all we need to change in
create_append_path() is add (cpu_tuple_cost * DEFAULT_APPEND_COST_FACTOR *
tuples) to path->total_cost.

Agree. However, there was ab existing comment in create_append_path() saying
"We don't bother with inventing a cost_append(), but just do it here", which
implies at sometime in future we may need it; so why not now where we are explicitly costing for an append node. Having a function is good so that, if required in future, we need update in only this function.
Let me know if you think otherwise, I make those changes in next patchset.



+    /* Add MergeAppend node overhead like we do it for the Append node */
+    run_cost += cpu_tuple_cost * DEFAULT_APPEND_COST_FACTOR * tuples;
+

With this change the following comment is no more true. Please remove it.
     * extracted tuple.  We don't charge cpu_tuple_cost because a MergeAppend
     * node doesn't do qual-checking or projection, so it has less overhead
     * than most plan nodes.
     */

 
This was already fixed in v7.

+/*
+ * Arbitrarily use 50% of the cpu_tuple_cost to cost append node. Note that

May be reword it as " ... to cost per tuple processing by an append node ..."

Done.
 

+ * this value should be multiplied with cpu_tuple_cost wherever applicable.
+ */
+#define DEFAULT_APPEND_COST_FACTOR 0.5

I am wondering whether we should just define
#define APPEND_TUPLE_COST (cpu_tuple_cost * 0.5)
and use this macro everywhere. What else use DEFAULT_APPEND_COST_FACTOR would
have other than multiplying with cpu_tuple_cost?

As suggested by Robert, I have renamed it to APPEND_CPU_COST_MULTIPLIER in v7 patchset.
Also, retained the #define for just multiplier as suggested by Robert.


 -- test partition matching with N-way join
 EXPLAIN (COSTS OFF)
 SELECT avg(t1.a), avg(t2.b), avg(t3.a + t3.b), t1.c, t2.c, t3.c FROM
plt1 t1, plt2 t2, plt1_e t3 WHERE t1.c = t2.c AND ltrim(t3.c, 'A') =
t1.c GROUP BY t1.c, t2.c, t3.c ORDER BY t1.c, t2.c, t3.c;
-                                      QUERY PLAN
---------------------------------------------------------------------------------------
+                                   QUERY PLAN
+--------------------------------------------------------------------------------
  Sort
    Sort Key: t1.c, t3.c
    ->  HashAggregate
          Group Key: t1.c, t2.c, t3.c
-         ->  Result
+         ->  Hash Join
+               Hash Cond: (t1.c = t2.c)
                ->  Append
-                     ->  Hash Join
-                           Hash Cond: (t1.c = t2.c)

That's sad. Interestingly this query has an aggregate, so the plan will use
partition-wise join again when partition-wise aggregation patch will be
applied. So may be fine.

Yep. I have modified this testcase and enabled partition-wise aggregation before this test, so that we will see the desired plan.


- Append  (cost=0.00..0.04 rows=2 width=32)
+ Append  (cost=0.00..0.05 rows=2 width=32)

- Append  (cost=0.00..0.04 rows=2 width=4)
+ Append  (cost=0.00..0.05 rows=2 width=4)

We do have some testcases which print costs. Interesting :). I don't have
any objection to this change.

OK. Thanks.


Comments on 0007

+       <para>
+        Enables or disables the query planner's use of partition-wise grouping
+        or aggregation, which allows  If partition-wise aggregation
does not result in the
+        cheapest path, it will still spend time in creating these paths and
+        consume memory which increase linearly with the number of partitions.
+        The default is <literal>off</>.
+       </para>
+      </listitem>
+     </varlistentry>
+
May be we should word this in the same manner as partition-wise join like

        Enables or disables the query planner's use of partition-wise grouping
        or aggregation, which allows aggregation or grouping on a partitioned
        tables to be spread across the partitions. If <literal>GROUP
        BY<literal> clause includes partition keys, the rows are aggregated at
        each partition. Otherwise, partial aggregates computed for each
        partition are required to be combined. Because partition-wise
        aggregation/gropuing can use significantly more CPU time and memory
        during planning, the default is <literal>off</literal>.

Done. Thanks for the new wordings.


+
+Partition-wise aggregates/grouping
+----------------------------------

... clipped patch

+In above plan, aggregation is performed after append node which means that the
+whole table is an input for the aggregation node. However, with partition-wise
+aggregation, same query will have plane like:

s/plane/plan/

Oops. Fixed.
 

+ Append

... clipped patch

+PartialAggregate stage greatly reduces the number of groups and lose if we have
+lots of small groups.

To keep the discussion brief, I suggest we rewrite this paragraph as

----
If GROUP BY clause has all partition keys, all the rows that belong to a given
group come from a single partition and thus aggregates can be finalized
separately for each partition. When the number of groups is far lesser than the
number of rows being grouped, as usually is the case, the number of rows
processed by an Append node reduces apart from reducing the size of the hash
table or size of the data to be sorted. This usually improves efficiency of the
query. If GROUP BY doesn't contain all the partition keys, partial
aggregates can be computed for
each partition followed by combining partial aggregates from one or more
partitions belonging to the same group to compute complete aggregate for each
group. This improves efficiency of the query if the number of groups is far
less than the number of rows produced by the scan underneath.
---

I am not sure whether we should be discussing why this technique performs
better or when it performs better. We don't have similar discussion for
partition-wise join. That paragraph just describes the technique and may be we
want to do the same here.

OK.
I have removed the text explaining when it performs better.
Please have a look over new text and let me know your views.
 

+ *
+ * extra is the additional information required when we are doing aggregation
+ * or grouping below the append node. In case of partial partition-wise
+ * aggregation on a child node, we need to compute finalized step after the
+ * append, which cannot be done in this function. And thus if we have non-NULL
+ * value for extra, we call create_partition_agg_paths() to create an append
+ * node and finalization, if any.

May be we want to just say "extra provides more information about the
partitioned aggregation/grouping e.g path target, whether to use partial
aggregate and so on." When present we call create_partition_agg_paths() to add
paths for partition-wise aggregatges.

Done.
 

-        add_path(rel, (Path *) create_append_path(rel, subpaths,
-                                                  rel->reltarget, NULL, 0,
-                                                  partitioned_rels));
+    {
+        if (extra)
+            create_partition_agg_paths(root, rel, subpaths, NIL,
+                                       NIL, NULL, 0,
+                                       partitioned_rels, extra);
+        else
+            add_path(rel, (Path *) create_append_path(rel, subpaths,
+                                                      rel->reltarget, NULL, 0,
+                                                      partitioned_rels));
+    }

I am wondering whether we could write a function to call appropriate one out of
create_append_path(), create_partition_agg_paths() or
create_merge_append_path() based on the presence of extra and/or pathkeys and
use it everywhere such a change is made. I don't know whether that will be
worth the code. But there are a handful places where such diffs are required.

Done.
Added function named add_append_path() which does the same. Function name seems too generic, it will be good if you suggest few.


-
-    plan = make_sort_from_pathkeys(subplan, best_path->path.pathkeys, NULL);
+    plan = make_sort_from_pathkeys(subplan, best_path->path.pathkeys,
+                                   IS_OTHER_REL(best_path->subpath->parent) ?
+                                   best_path->path.parent->relids : NULL);

While I can guess why this change is required, it may be better to separate it
into a patch of its own and adding some explanation in the commit message, for
other reviewers.

Done.
Tried adding comments.  See whether it make sense or need further improvements.


+    /* Copy input rels's relids to grouped rel */
+    grouped_rel->relids = input_rel->relids;

I am fine with this change, but Tom may not agree [1]. May be we should get his
opinion on this one.

Yep. Agree.
This mainly required for FDW as all_baserels do not have relids for the child relations. Another solution I think of is to have say, rel->fs_relids and set that appropriately and use it in create_foreignscan_plan().


     /*
+     * If input relation is partitioned, check if we can perform
+     * partition-wise grouping and/or aggregation.
+     */

Just like partition-wise join a concise "Apply partition-wise aggregation
technique, if possible." would suffice.

Done.
 

     dNumPartialGroups = get_number_of_groups(root,
                                              cheapest_partial_path->rows,
                                              gd,
-                                             parse->targetList);
+
make_tlist_from_pathtarget(target));
Can we guarantee that the output of make_tlist_from_pathtarget() will be same
as translation of parse->targetList for the given child? Even if not, may be
it's fine to pass slightly different tlist to get_number_of_groups() since it
doesn't depend upon the exact shape but right group column references.
Nonetheless something to test and verify.

We are interested here to get the group expressions. get_number_of_groups() fetches the group expressions from tlist by checking tle->ressortgroupref which is presumably same as that of path target's sortgrouprefs. So I don't see any issue here.


  *
- * Determines whether parallel grouping and/or aggrgation is possible, or not.
+ * Determines whether parallel grouping and/or aggregation is possible, or not.
  * Returns true when possible, false otherwise.

Does this hunk belong to one of the refactoring patches or as a separate patch
correcting a typo?

Oops. Moved to the appropriate refactoring patch.


+/*
+ * try_partition_wise_grouping
+ *
+ * If the input relation is partitioned and the partition keys are part of the
+ * group by clauses, each partition produces a different set of groups.
+ * Aggregates within each such group can be computed partition-wise. This

While these sentences are correct, I think the reason why we could compute an
aggregate at the level of each partition is because rows from a given group
belong to a single partition. So, I guess, we have to reword this as

"If the partition keys of input relation are part of group by clause, all the
rows belonging to a given group come from a single partition, each partition
producing a different set of groups. This allows aggregation/grouping over a
partitioned relation to be broken down into aggregation/grouping on each
partition.

If group by clause does not contain all the partition keys, rows from a given
group may be spread across multiple partitions. In that case, we can combine
partial aggregates for a given group across partitions to produce the final
aggregate for a that group "

Done. Thanks.


+ * might be optimal because of presence of suitable paths with pathkeys or
+ * because the hash tables for most of the partitions fit into the memory.
+ * However, if partition keys are not part of the group by clauses, then we
+ * still able to compute the partial aggregation for each partition and then
+ * finalize them over append. This can either win or lose. It may win if the
+ * PartialAggregate stage greatly reduces the number of groups and lose if we
+ * have lots of small groups.

I have not seen prologue of a function implementing a query optimization
technique explain why that technique improves performance. So I am not sure
whether the comment should include this explanation. One of the reasons being
that the reasons why a technique works might change over the period of time
with the introduction of other techniques, thus obsoleting the comment.  But
may be it's good to have it here.

Yep. Retained.


+    /*
+     * Grouping sets plan does not work with an inheritance subtree (see notes
+     * in create_groupingsets_plan). Thus do not handle grouping sets here.
+     */
+    if (query->groupingSets || gd)
+        return;

Even if that restriction is lifted, we won't be able to compute
"whole" grouping sets
for each partition, since grouping sets implies multiple group by clauses, each
of which may not have all partition keys. Those sets which have all partition
keys will be computed completely for each partition, but others will require
partial aggregation. I guess, we will need to apply partition-wise aggregation
at each derived group by clause and not as a whole-sale strategy.

Done.
 

Anyway, it doesn't look like a good idea to pass an argument (gd) only to
return from that function in case of its presence. May be we should handle it
outside this function.

Well, I would like to have it inside the function itself. Let the function itself do all the necessary checking rather than doing some of them outside.


+
+    /* Nothing to do, if the input relation is not partitioned. */
+    if (!input_rel->part_scheme)
+        return;
+
+    Assert(input_rel->part_rels);

For a join between two partitioned tables with one of them being dummy
relation, would have part_scheme set but not part_rels (See
try_partition_wise_join()). This assertion would
fail in such a case. Have you tested the case? May be we should just test if
input_rel->part_rels exists similar to generate_partition_wise_join_paths().

Yep. This was already fixed in v7 and also has a covering testcase.

Also, how is a dummy input relation is handled in this function? Do we need to
handle?

Yes, we need to handle. Need to return without doing PWA when input relation itself is dummy.
Added covering testcase for it.

 
+    nparts = input_rel->nparts;
+    part_rels = (RelOptInfo **) palloc(nparts * sizeof(RelOptInfo *));
+    grouped_rel->part_rels = part_rels;

For a partial aggregation, we can't say that the child rels produced here are
partitions of the top grouped relation, so setting part_rels looks wrong. We
should set this only when a full aggregate is obtained from each partition.

Done.
 

+        scanjoin_target =
copy_pathtarget(input_rel->cheapest_startup_path->pathtarget);
+        scanjoin_target->exprs = (List *) adjust_appendrel_attrs(root,
+
(Node *) scanjoin_target->exprs,
+                                                                 nappinfos,
+                                                                 appinfos);

Why can't we use input_child_rel->pathtarget? It should be same as the
translation of its parent's path target. I probably understand that's because
the input rel's path targets have been changed after the underlying join was
planned, a step which is not applied to the individual children. May be add a
comment here?

Done. Added comments.


+        child_target->exprs = (List *) adjust_appendrel_attrs(root,
+                                                              (Node
*) target->exprs,
+                                                              nappinfos,
+                                                              appinfos);
+        partial_target = make_partial_grouping_target(root, target);
+        partial_target->exprs = (List *) adjust_appendrel_attrs(root,
+                                                                (Node
*) partial_target->exprs,
+                                                                nappinfos,
+                                                                appinfos);

We need both of these steps for any aggregate since parallel paths will compute
parial paths anyway. If that's correct, may be we should add a comment?

Done.
 

+        extra.inputRows = 0;    /* Not needed at child paths creation */

Why? Comment should be on its own line.

It is actually not used in create_child_grouping_paths(). But setting that value has no side effect, thus set that correctly and removed the comments.


+        if (!create_child_grouping_paths(root, input_child_rel, agg_costs, gd,
+                                         &extra))
+        {
+            /* Could not create path for childrel, return */
+            pfree(appinfos);
+            return;
+        }

Can we detect this condition and bail out even before planning any of the
children? It looks wasteful to try to plan children only to bail out in this
case.

I don't think so. It is like non-reachable and added just for a safety in case we can't able to create a child path. The bail out conditions cannot be evaluated at the beginning. Do you this an Assert() will be good here? Am I missing something?


+    /* Nothing to do if we have no live children */
+    if (live_children == NIL)
+        return;

A parent relation with all dummy children will also be dummy. May be we should
mark the parent dummy case using mark_dummy_rel() similar to
generate_partition_wise_join_paths().

If parent is dummy, then we are not at all doing PWA. So no need to mark parent grouped_rel as dummy I guess.
However, if some of the children are dummy, I am marking corresponding upper rel as dummy too.
Actually, this condition will never going to be true as you said correctly that "A parent relation with all dummy children will also be dummy". Should we have an Assert() instead?


+/*
+ * have_grouping_by_partkey
+ *

Somehow this name sounds like it would return true when GROUP BY contains only
partition key. May be rename as group_by_has_partkey? to indicate the

OK. Renamed.


+ * Returns true, if partition keys of the given relation are part of the
+ * GROUP BY clauses, false otherwise.

Reword as " ... if all the partition keys of ... "

Done.
 

+static bool
+have_grouping_by_partkey(RelOptInfo *input_rel, PathTarget *target,
+                         List *groupClause)
+{
+    List       *tlist = make_tlist_from_pathtarget(target);
+    List       *groupexprs = get_sortgrouplist_exprs(groupClause, tlist);

Have we tested the case with multi-level partitioned table and children with
different order of partition key columns?

I have testcase for multi-level partitioned table.
However, I did not understand by what you mean by "children with different order of partition key columns". I had a look over tests in partition_join.sql and it seems that I have cover all those scenarios. Please have a look over testcases added for PWA and let me know the scenarios missing, I will add them then.


+        partexprs = input_rel->partexprs ? input_rel->partexprs[cnt] : NIL;
+
+        /* Rule out early, if there are no partition keys present */
+        if (partexprs == NIL)
+            return false;

If input_rel->partexprs is NIL, we should "bail" out even before the loop
starts.

Done.
 

+        foreach(lc, partexprs)
+        {
+            Expr       *partexpr = lfirst(lc);
+
+            if (list_member(groupexprs, partexpr))
+            {
+                found = true;
+                break;
+            }
+        }

This looks like a useful piece of general functionality
list_has_intersection(), which would returns boolean instead of the whole
intersection. I am not sure whether we should add that function to list.c and
use here.

Sounds good.
But for now, I am keeping it as part of this feature itself.
 

+         * If none of the partition key matches with any of the GROUP BY

Reword as "... the partition key expressions match with ...."

Done.
 

This isn't a full review of 0007, but I think it covers most of the new
functionality.

[1] https://www.postgresql.org/message-id/CAFjFpRdUz6h6cmFZFYAngmQAX8Zvo+MZsPXidZ077h=gp9bvQw@mail.gmail.com



--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

Thanks for the details review Ashutosh.

Let me know if I missed any comment to be fixed.

Thanks


--
Jeevan Chalke
Technical Architect, Product Development
EnterpriseDB Corporation
The Enterprise PostgreSQL Company

Attachment

Re: [HACKERS] Partition-wise aggregation/grouping

From
Ashutosh Bapat
Date:
On Thu, Nov 23, 2017 at 6:38 PM, Jeevan Chalke
<jeevan.chalke@enterprisedb.com> wrote:
>
>
> Agree. However, there was ab existing comment in create_append_path() saying
> "We don't bother with inventing a cost_append(), but just do it here", which
> implies at sometime in future we may need it; so why not now where we are
> explicitly costing for an append node. Having a function is good so that, if
> required in future, we need update in only this function.
> Let me know if you think otherwise, I make those changes in next patchset.
>

I don't read that comment as something we will do it in future. I
don't think the amount of changes that this patch introduces just for
adding one more line of code aren't justified. There's anyway only one
place where we are costing append, so it's not that the new function
avoids code duplication. Although I am happy to defer this to the
committer, if you think that we need a separate function.

>
> As suggested by Robert, I have renamed it to APPEND_CPU_COST_MULTIPLIER in
> v7 patchset.
> Also, retained the #define for just multiplier as suggested by Robert.

Ok.


>
>>
>>
>> Anyway, it doesn't look like a good idea to pass an argument (gd) only to
>> return from that function in case of its presence. May be we should handle
>> it
>> outside this function.
>
>
> Well, I would like to have it inside the function itself. Let the function
> itself do all the necessary checking rather than doing some of them outside.

We will leave this to the committer. I don't like that style, but it's
also good to expect a function to do all related work.


>>
>> +        if (!create_child_grouping_paths(root, input_child_rel,
>> agg_costs, gd,
>> +                                         &extra))
>> +        {
>> +            /* Could not create path for childrel, return */
>> +            pfree(appinfos);
>> +            return;
>> +        }
>>
>> Can we detect this condition and bail out even before planning any of the
>> children? It looks wasteful to try to plan children only to bail out in
>> this
>> case.
>
>
> I don't think so. It is like non-reachable and added just for a safety in
> case we can't able to create a child path. The bail out conditions cannot be
> evaluated at the beginning. Do you this an Assert() will be good here? Am I
> missing something?

An Assert would help. If it's something that should not happen, we
should try catching that rather that silently ignoring it.

>
>>
>> +    /* Nothing to do if we have no live children */
>> +    if (live_children == NIL)
>> +        return;
>>
>> A parent relation with all dummy children will also be dummy. May be we
>> should
>> mark the parent dummy case using mark_dummy_rel() similar to
>> generate_partition_wise_join_paths().
>
>
> If parent is dummy, then we are not at all doing PWA. So no need to mark
> parent grouped_rel as dummy I guess.
> However, if some of the children are dummy, I am marking corresponding upper
> rel as dummy too.
> Actually, this condition will never going to be true as you said correctly
> that "A parent relation with all dummy children will also be dummy". Should
> we have an Assert() instead?

Yes.


>
>
> I have testcase for multi-level partitioned table.
> However, I did not understand by what you mean by "children with different
> order of partition key columns". I had a look over tests in
> partition_join.sql and it seems that I have cover all those scenarios.
> Please have a look over testcases added for PWA and let me know the
> scenarios missing, I will add them then.

By children with different order of partition key columns, I meant
something like this

parent(a int, b int, c int) partition by (a), child1(b int, c int, a
int) partition by b, child1_1 (c int, a int, b int);

where the attribute numbers of the partition keys in different
children are different.

>> This looks like a useful piece of general functionality
>> list_has_intersection(), which would returns boolean instead of the whole
>> intersection. I am not sure whether we should add that function to list.c
>> and
>> use here.
>
>
> Sounds good.
> But for now, I am keeping it as part of this feature itself.

ok

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company


Re: [HACKERS] Partition-wise aggregation/grouping

From
Rajkumar Raghuwanshi
Date:
On Thu, Nov 23, 2017 at 6:38 PM, Jeevan Chalke
<jeevan.chalke@enterprisedb.com> wrote:
> Let me know if I missed any comment to be fixed.

Hi,

I have applied v8 patches on commit id 8735978e7aebfbc499843630131c18d1f7346c79,
and getting below observation, please take a look.

Observation:
"when joining a foreign partition table with local partition table
getting wrong output
with partition_wise_join enabled, same is working fine on PG-head
without aggregates patch."

Test-case:
CREATE EXTENSION postgres_fdw;
CREATE SERVER pwj_server FOREIGN DATA WRAPPER postgres_fdw OPTIONS
(dbname 'postgres',port '5432',use_remote_estimate 'true');
CREATE USER MAPPING FOR PUBLIC SERVER pwj_server;

CREATE TABLE fplt1 (a int,  c text) PARTITION BY LIST(c);
CREATE TABLE fplt1_p1 (a int,  c text);
CREATE TABLE fplt1_p2 (a int,  c text);
CREATE FOREIGN TABLE ftplt1_p1 PARTITION OF fplt1 FOR VALUES IN
('0000', '0001', '0002', '0003') SERVER pwj_server OPTIONS (TABLE_NAME
'fplt1_p1');
CREATE FOREIGN TABLE ftplt1_p2 PARTITION OF fplt1 FOR VALUES IN
('0004', '0005', '0006', '0007') SERVER pwj_server OPTIONS (TABLE_NAME
'fplt1_p2');
INSERT INTO fplt1_p1 SELECT i, to_char(i%8, 'FM0000') FROM
generate_series(0, 199, 2) i;
INSERT INTO fplt1_p2 SELECT i, to_char(i%8, 'FM0000') FROM
generate_series(200, 398, 2) i;

CREATE TABLE lplt2 (a int,  c text) PARTITION BY LIST(c);
CREATE TABLE lplt2_p1 PARTITION OF lplt2 FOR VALUES IN ('0000',
'0001', '0002', '0003');
CREATE TABLE lplt2_p2 PARTITION OF lplt2 FOR VALUES IN ('0004',
'0005', '0006', '0007');
INSERT INTO lplt2 SELECT i, to_char(i%8, 'FM0000') FROM
generate_series(0, 398, 3) i;

SELECT t1.c, t2.c,count(*) FROM fplt1 t1 JOIN lplt2 t2 ON (t1.c = t2.c
and t1.a = t2.a)  WHERE t1.a % 25 = 0 GROUP BY 1,2 ORDER BY t1.c,
t2.c; c   |  c   | count
------+------+-------0000 | 0000 |     10004 | 0004 |     10006 | 0006 |     1
(3 rows)

SET enable_partition_wise_join = on;
SELECT t1.c, t2.c,count(*) FROM fplt1 t1 JOIN lplt2 t2 ON (t1.c = t2.c
and t1.a = t2.a)  WHERE t1.a % 25 = 0 GROUP BY 1,2 ORDER BY t1.c,
t2.c; c   |  c   | count
------+------+-------0000 | 0000 |     10004 | 0004 |     1
(2 rows)


Thanks & Regards,
Rajkumar Raghuwanshi
QMG, EnterpriseDB Corporation


Re: [HACKERS] Partition-wise aggregation/grouping

From
Jeevan Chalke
Date:
On Tue, Nov 28, 2017 at 12:37 PM, Rajkumar Raghuwanshi < rajkumar.raghuwanshi@enterprisedb.com> wrote: > On Thu, Nov 23, 2017 at 6:38 PM, Jeevan Chalke > wrote: > > Let me know if I missed any comment to be fixed. > > Hi, > > I have applied v8 patches on commit id 8735978e7aebfbc499843630131c18 > d1f7346c79, > and getting below observation, please take a look. > > Observation: > "when joining a foreign partition table with local partition table > getting wrong output > with partition_wise_join enabled, same is working fine on PG-head > without aggregates patch." > I have observed the same behavior on the master branch too when partition-wise join path is selected irrespective of this patch-set. This is happening because data on the foreign table is not compliance with the partitioning constraints. > Test-case: > CREATE EXTENSION postgres_fdw; > CREATE SERVER pwj_server FOREIGN DATA WRAPPER postgres_fdw OPTIONS > (dbname 'postgres',port '5432',use_remote_estimate 'true'); > CREATE USER MAPPING FOR PUBLIC SERVER pwj_server; > > CREATE TABLE fplt1 (a int, c text) PARTITION BY LIST(c); > CREATE TABLE fplt1_p1 (a int, c text); > CREATE TABLE fplt1_p2 (a int, c text); > CREATE FOREIGN TABLE ftplt1_p1 PARTITION OF fplt1 FOR VALUES IN > ('0000', '0001', '0002', '0003') SERVER pwj_server OPTIONS (TABLE_NAME > 'fplt1_p1'); > CREATE FOREIGN TABLE ftplt1_p2 PARTITION OF fplt1 FOR VALUES IN > ('0004', '0005', '0006', '0007') SERVER pwj_server OPTIONS (TABLE_NAME > 'fplt1_p2'); > INSERT INTO fplt1_p1 SELECT i, to_char(i%8, 'FM0000') FROM > generate_series(0, 199, 2) i; > INSERT INTO fplt1_p2 SELECT i, to_char(i%8, 'FM0000') FROM > generate_series(200, 398, 2) i; > > CREATE TABLE lplt2 (a int, c text) PARTITION BY LIST(c); > CREATE TABLE lplt2_p1 PARTITION OF lplt2 FOR VALUES IN ('0000', > '0001', '0002', '0003'); > CREATE TABLE lplt2_p2 PARTITION OF lplt2 FOR VALUES IN ('0004', > '0005', '0006', '0007'); > INSERT INTO lplt2 SELECT i, to_char(i%8, 'FM0000') FROM > generate_series(0, 398, 3) i; > > SELECT t1.c, t2.c,count(*) FROM fplt1 t1 JOIN lplt2 t2 ON (t1.c = t2.c > and t1.a = t2.a) WHERE t1.a % 25 = 0 GROUP BY 1,2 ORDER BY t1.c, > t2.c; > c | c | count > ------+------+------- > 0000 | 0000 | 1 > 0004 | 0004 | 1 > 0006 | 0006 | 1 > (3 rows) > > SET enable_partition_wise_join = on; > SELECT t1.c, t2.c,count(*) FROM fplt1 t1 JOIN lplt2 t2 ON (t1.c = t2.c > and t1.a = t2.a) WHERE t1.a % 25 = 0 GROUP BY 1,2 ORDER BY t1.c, > t2.c; > c | c | count > ------+------+------- > 0000 | 0000 | 1 > 0004 | 0004 | 1 > (2 rows) > > > Thanks & Regards, > Rajkumar Raghuwanshi > QMG, EnterpriseDB Corporation > -- Jeevan Chalke Technical Architect, Product Development EnterpriseDB Corporation The Enterprise PostgreSQL Company

Re: [HACKERS] Partition-wise aggregation/grouping

From
Michael Paquier
Date:
On Tue, Nov 28, 2017 at 5:50 PM, Jeevan Chalke
<jeevan.chalke@enterprisedb.com> wrote:
> [snip]

This is still a hot topic so I am moving it to next CF.
-- 
Michael


Re: [HACKERS] Partition-wise aggregation/grouping

From
Ashutosh Bapat
Date:
Continuing with review of 0007.

+
+    /* Copy input rels's relids to grouped rel */
+    grouped_rel->relids = input_rel->relids;

Isn't this done in fetch_upper_rel()? Why do we need it here?
There's also a similar hunk in create_grouping_paths() which doesn't look
appropriate. I guess, you need relids in grouped_rel->relids for FDW. There are
two ways to do this: 1. set grouped_rel->relids for parent upper rel as well,
but then we should pass relids to fetch_upper_rel() instead of setting those
later. 2. For a parent upper rel, in create_foreignscan_plan(), set relids to
all_baserels, if upper_rel->relids is NULL and don't set relids for a parent
upper rel. I am fine with either of those.

+            /* partial phase */
+            get_agg_clause_costs(root, (Node *) partial_target->exprs,
+                                 AGGSPLIT_INITIAL_SERIAL,
+                                 &agg_partial_costs);

IIUC, the costs for evaluating aggregates would not change per child. They
won't be different for parent and any of the children. So, we should be able to
calculate those once, save in "extra" and use repeatedly.

+        if (can_sort)
+        {
+            Path       *path = cheapest_path;
+
+            if (!(pathkeys_contained_in(root->group_pathkeys,
+                                        path->pathkeys)))
[ .. clipped patch .. ]
+                                           NIL,
+                                           dNumGroups));
+        }

We create two kinds of paths partial paths for parallel query and partial
aggregation paths when group keys do not have partition keys. The comments and
code uses partial to mean both the things, which is rather confusing. May be we
should use term "partial aggregation" explicitly wherever it means that in
comments and in variable names.

I still feel that create_grouping_paths() and create_child_grouping_paths()
have a lot of common code. While I can see that there are some pockets in
create_grouping_paths() which are not required in create_child_grouping_paths()
and vice-versa, may be we should create only one function and move those
pockets under "if (child)" or "if (parent)" kind of conditions. It will be a
maintenance burden to keep these two functions in sync in future. If we do not
keep them in sync, that will introduce bugs.

+
+/*
+ * create_partition_agg_paths
+ *
+ * Creates append path for all the partitions and adds into the grouped rel.

I think you want to write "Creates an append path containing paths from all the
child grouped rels and adds into the given parent grouped rel".

+ * For partial mode we need to add a finalize agg node over append path before
+ * adding a path to grouped rel.
+ */
+void
+create_partition_agg_paths(PlannerInfo *root,
+                           RelOptInfo *grouped_rel,
+                           List *subpaths,
+                           List *partial_subpaths,

Why do we have these two as separate arguments? I don't see any call to
create_partition_agg_paths() through add_append_path() passing both of them
non-NULL simultaneously. May be you want use a single list subpaths and another
boolean indicating whether it's list of partial paths or regular paths.

+
+    /* For non-partial path, just create a append path and we are done. */
This is the kind of confusion, I am talking about above. Here you have
mentioned "non-partial path" which may mean a regular path but what you
actually mean by that term is a path representing partial aggregates.

+    /*
+     * Partial partition-wise grouping paths.  Need to create final
+     * aggregation path over append relation.
+     *
+     * If there are partial subpaths, then we need to add gather path before we
+     * append these subpaths.

More confusion here.

+     */
+    if (partial_subpaths)
+    {
+        ListCell   *lc;
+
+        Assert(subpaths == NIL);
+
+        foreach(lc, partial_subpaths)
+        {
+            Path       *path = lfirst(lc);
+            double        total_groups = path->rows * path->parallel_workers;
+
+            /* Create gather paths for partial subpaths */
+            Path *gpath = (Path *) create_gather_path(root, grouped_rel, path,
+                                                      path->pathtarget, NULL,
+                                                      &total_groups);
+
+            subpaths = lappend(subpaths, gpath);

Using the argument variable is confusing and that's why you need two different
List variables. Instead probably you could have another variable local to this
function to hold the gather subpaths.

AFAIU, the Gather paths that this code creates has its parent set to
parent grouped
rel. That's not correct. These partial paths come from children of grouped rel
and each gather is producing rows corresponding to one children of grouped rel.
So gather path's parent should be set to corresponding child and not parent
grouped rel.

This code creates plans where there are multiple Gather nodes under an Append
node. AFAIU, the workers assigned to one gather node can be reused until that
Gather node finishes. Having multiple Gather nodes under an Append mean that
every worker will be idle from the time that worker finishes the work till the
last worker finishes the work. That doesn't seem to be optimal use of workers.
The plan that we create with Gather on top of Append seems to be better. So, we
should avoid creating one Gather node per child plans. Have we tried to compare
performance of these two plans?

+        if (!IsA(apath, MergeAppendPath) && root->group_pathkeys)
+        {
+            spath = (Path *) create_sort_path(root,
+                                              grouped_rel,
+                                              apath,
+                                              root->group_pathkeys,
+                                              -1.0);
+        }

The code here assumes that a MergeAppend path will always have pathkeys
matching group_pathkeys. I believe that's true but probably we should have an
Assert to make it clear and add comments. If that's not true, we will need to
sort the output of MergeAppend OR discard MergeAppend paths which do not have
pathkeys matching group_pathkeys.


diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index b422050..1941468 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2345,6 +2345,7 @@ UnlistenStmt
 UnresolvedTup
 UnresolvedTupData
 UpdateStmt
+UpperPathExtraData
 UpperRelationKind
 UpperUniquePath
 UserAuth

Do we commit this file as part of the feature?
-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company


Re: [HACKERS] Partition-wise aggregation/grouping

From
Robert Haas
Date:
On Fri, Dec 1, 2017 at 7:41 AM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:
> This code creates plans where there are multiple Gather nodes under an Append
> node.

We should avoid that.  Starting and stopping workers is inefficient,
and precludes things like turning the Append into a Parallel Append.

> AFAIU, the workers assigned to one gather node can be reused until that
> Gather node finishes. Having multiple Gather nodes under an Append mean that
> every worker will be idle from the time that worker finishes the work till the
> last worker finishes the work.

No, workers will exit as soon as they finish.  They don't hang around idle.

> index b422050..1941468 100644
> --- a/src/tools/pgindent/typedefs.list
> +++ b/src/tools/pgindent/typedefs.list
> @@ -2345,6 +2345,7 @@ UnlistenStmt
>  UnresolvedTup
>  UnresolvedTupData
>  UpdateStmt
> +UpperPathExtraData
>  UpperRelationKind
>  UpperUniquePath
>  UserAuth
>
> Do we commit this file as part of the feature?

Andres and I regularly commit such changes; Tom rejects them.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: [HACKERS] Partition-wise aggregation/grouping

From
Ashutosh Bapat
Date:
On Sat, Dec 2, 2017 at 4:08 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Fri, Dec 1, 2017 at 7:41 AM, Ashutosh Bapat
> <ashutosh.bapat@enterprisedb.com> wrote:
>> This code creates plans where there are multiple Gather nodes under an Append
>> node.
>
> We should avoid that.  Starting and stopping workers is inefficient,
> and precludes things like turning the Append into a Parallel Append.

Ah, I didn't think about it. Thanks for bringing it up.

>
>> AFAIU, the workers assigned to one gather node can be reused until that
>> Gather node finishes. Having multiple Gather nodes under an Append mean that
>> every worker will be idle from the time that worker finishes the work till the
>> last worker finishes the work.
>
> No, workers will exit as soon as they finish.  They don't hang around idle.

Sorry, I think I used wrong word "idle". I meant that if a worker
finishes and exists, the query can't use it that worker slot until the
next Gather node starts. But as you pointed out, starting and stopping
a worker is costlier than the cost of not using the slot. So we should
avoid such plans.

>
>> index b422050..1941468 100644
>> --- a/src/tools/pgindent/typedefs.list
>> +++ b/src/tools/pgindent/typedefs.list
>> @@ -2345,6 +2345,7 @@ UnlistenStmt
>>  UnresolvedTup
>>  UnresolvedTupData
>>  UpdateStmt
>> +UpperPathExtraData
>>  UpperRelationKind
>>  UpperUniquePath
>>  UserAuth
>>
>> Do we commit this file as part of the feature?
>
> Andres and I regularly commit such changes; Tom rejects them.
>

We will leave it to the committer to decide what to do with this hunk.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company


Re: Partition-wise aggregation/grouping

From
legrand legrand
Date:
Hello,

I'm testing Partition wise and found a strange result in V8 patch
see  init_star_schema_agg.sql
<http://www.postgresql-archive.org/file/t348768/init_star_schema_agg.sql>  

Explain gives  
  ->  Seq Scan on facts_p2  (...) (actual time=0.012..0.012 rows=1 loops=1)
for partitions that are joined with empty partitions

I was expecting a message saying that partition facts_p2 was not accessed

Am I wrong ?
Regards
PAscal



--
Sent from: http://www.postgresql-archive.org/PostgreSQL-hackers-f1928748.html


Re: Partition-wise aggregation/grouping

From
Jeevan Chalke
Date:


On Fri, Dec 8, 2017 at 3:08 AM, legrand legrand <legrand_legrand@hotmail.com> wrote:
Hello,

I'm testing Partition wise and found a strange result in V8 patch
see  init_star_schema_agg.sql
<http://www.postgresql-archive.org/file/t348768/init_star_schema_agg.sql>

Explain gives
  ->  Seq Scan on facts_p2  (...) (actual time=0.012..0.012 rows=1 loops=1)
for partitions that are joined with empty partitions

I was expecting a message saying that partition facts_p2 was not accessed

Am I wrong ?

Is this related to partition-wise aggregation as you saying you found this behaviour on v8 patch?

I have tried your testcase on master and see similar plan.

I had a look over the plan and I did not see any empty relation at the planning time.
 
Regards
PAscal



--
Sent from: http://www.postgresql-archive.org/PostgreSQL-hackers-f1928748.html




--
Jeevan Chalke
Technical Architect, Product Development
EnterpriseDB Corporation
The Enterprise PostgreSQL Company

Phone: +91 20 66449694

Website: www.enterprisedb.com
EnterpriseDB Blog: http://blogs.enterprisedb.com/
Follow us on Twitter: http://www.twitter.com/enterprisedb

This e-mail message (and any attachment) is intended for the use of the individual or entity to whom it is addressed. This message contains information from EnterpriseDB Corporation that may be privileged, confidential, or exempt from disclosure under applicable law. If you are not the intended recipient or authorized to receive this for the intended recipient, any use, dissemination, distribution, retention, archiving, or copying of this communication is strictly prohibited. If you have received this e-mail in error, please notify the sender immediately by reply e-mail and delete this message.

Re: Partition-wise aggregation/grouping

From
legrand legrand
Date:
Thank you for the answer
This is a miss understanding of hash join behaviour on my side.

That means that there is at less on line read in facts_p2 part even if the
second table partition of the hash join operation is empty.

I will remember it now ;o)

Regards
PAscal



--
Sent from: http://www.postgresql-archive.org/PostgreSQL-hackers-f1928748.html


Re: [HACKERS] Partition-wise aggregation/grouping

From
Ashutosh Bapat
Date:
Here are review comments for 0009
Only full aggregation is pushed on the remote server.

I think we can live with that for a while but we need to be able to push down
partial aggregates to the foreign server. I agree that it needs some
infrastructure to serialized and deserialize the partial aggregate values,
support unpartitioned aggregation first and then work on partitioned
aggregates. That is definitely a separate piece of work.

+-- ===================================================================
+-- test partition-wise-aggregates
+-- ===================================================================
+CREATE TABLE pagg_tab (a int, b int, c text) PARTITION BY RANGE(a);
+CREATE TABLE pagg_tab_p1 (a int, b int, c text);
+CREATE TABLE pagg_tab_p2 (a int, b int, c text);
+CREATE TABLE pagg_tab_p3 (a int, b int, c text);

Like partition-wise join testcases please use LIKE so that it's easy to change
the table schema if required.

+INSERT INTO pagg_tab_p1 SELECT i % 30, i % 50, to_char(i/30,
'FM0000') FROM generate_series(1, 3000) i WHERE (i % 30) < 10;
+INSERT INTO pagg_tab_p2 SELECT i % 30, i % 50, to_char(i/30,
'FM0000') FROM generate_series(1, 3000) i WHERE (i % 30) < 20 and (i %
30) >= 10;
+INSERT INTO pagg_tab_p3 SELECT i % 30, i % 50, to_char(i/30,
'FM0000') FROM generate_series(1, 3000) i WHERE (i % 30) < 30 and (i %
30) >= 20;

We have to do this because INSERT tuple routing to a foreign partition is not
supported right now. Somebody has to remember to change this to a single
statement once that's done.

+ANALYZE fpagg_tab_p1;
+ANALYZE fpagg_tab_p2;
+ANALYZE fpagg_tab_p3;

I thought this is not needed. When you ANALYZE the partitioned table, it would
analyze the partitions as well. But I see that partition-wise join is also
ANALYZING the foreign partitions separately. When I ran ANALYZE on a
partitioned table with foreign partitions, statistics for only the local tables
(partitioned and partitions) was updated. Of course this is separate work, but
probably needs to be fixed.

+-- When GROUP BY clause matches with PARTITION KEY.
+-- Plan when partition-wise-agg is disabled

s/when/with/

+-- Plan when partition-wise-agg is enabled

s/when/with/

+   ->  Append

Just like ForeignScan node's Relations tell what kind of ForeignScan this is,
may be we should annotate Append to tell whether the children are joins,
aggregates or relation scans. That might be helpful. Of course as another
patch.

+SELECT a, sum(b), min(b), count(*) FROM pagg_tab GROUP BY a HAVING
avg(b) < 25 ORDER BY 1;
+ a  | sum  | min | count
+----+------+-----+-------
+  0 | 2000 |   0 |   100
+  1 | 2100 |   1 |   100
[ ... clipped ...]
+ 23 | 2300 |   3 |   100
+ 24 | 2400 |   4 |   100
+(15 rows)

May be we want to reduce the number of rows to a few by using a stricter HAVING
clause?

+
+-- When GROUP BY clause not matches with PARTITION KEY.

... clause does not match ...

+EXPLAIN (VERBOSE, COSTS OFF)
+SELECT b, avg(a), max(a), count(*) FROM pagg_tab GROUP BY b HAVING
sum(a) < 800 ORDER BY 1;
+
QUERY PLAN

+-----------------------------------------------------------------------------------------------------------------------------------------------------
+ Sort
+   Output: fpagg_tab_p1.b, (avg(fpagg_tab_p1.a)),
(max(fpagg_tab_p1.a)), (count(*))
+               ->  Partial HashAggregate
[ ... clipped ... ]
+                     Output: fpagg_tab_p3.b, PARTIAL
avg(fpagg_tab_p3.a), PARTIAL max(fpagg_tab_p3.a), PARTIAL count(*),
PARTIAL sum(fpagg_tab_p3.a)
+                     Group Key: fpagg_tab_p3.b
+                     ->  Foreign Scan on public.fpagg_tab_p3
+                           Output: fpagg_tab_p3.b, fpagg_tab_p3.a
+                           Remote SQL: SELECT a, b FROM public.pagg_tab_p3
+(26 rows)

I think we interested in overall shape of the plan and not the details of
Remote SQL etc. So, may be turn off VERBOSE. This comment applies to an earlier
plan with enable_partition_wise_agg = false;

+
+SELECT b, avg(a), max(a), count(*) FROM pagg_tab GROUP BY b HAVING
sum(a) < 800 ORDER BY 1;
+ b  |         avg         | max | count
+----+---------------------+-----+-------
+  0 | 10.0000000000000000 |  20 |    60
+  1 | 11.0000000000000000 |  21 |    60
[... clipped ...]
+ 42 | 12.0000000000000000 |  22 |    60
+ 43 | 13.0000000000000000 |  23 |    60
+(20 rows)

Since the aggregates were not pushed down, I doubt we should be testing the
output. But this test is good to check partial aggregates over foreign
partition scans, which we don't have in postgres_fdw.sql I think. So, may be
add it as a separate patch?

Can you please add a test where we reference a whole-row; that usually has
troubles.

-    if (root->hasHavingQual && query->havingQual)
+    if (root->hasHavingQual && fpinfo->havingQual)

This is not exactly a problem with your patch, but why do we need to check both
the boolean and the actual clauses? If the boolean is true, query->havingQual
should be non-NULL and NULL otherwise.

     /* Grouping information */
     List       *grouped_tlist;
+    PathTarget *grouped_target;
+    Node       *havingQual;

I think we don't need havingQual as a separate member. foreign_grouping_ok()
separates the clauses in havingQual into shippable and non-shippable clauses
and saves in local_conditions and remote_conditions. Probably we want to use
those instead of adding a new member.

index 04e43cc..c8999f6 100644
--- a/src/include/foreign/fdwapi.h
+++ b/src/include/foreign/fdwapi.h
@@ -62,7 +62,8 @@ typedef void (*GetForeignJoinPaths_function)
(PlannerInfo *root,
 typedef void (*GetForeignUpperPaths_function) (PlannerInfo *root,
                                                UpperRelationKind stage,
                                                RelOptInfo *input_rel,
-                                               RelOptInfo *output_rel);
+                                               RelOptInfo *output_rel,
+                                               UpperPathExtraData *extra);

Probably this comment belongs to 0007, but it's in this patch that it becomes
clear how invasive UpperPathExtraData changes are. While UpperPathExtraData has
upper paths in the name, all of its members are related to grouping. That's
fine since we only support partition-wise aggregate and not the other upper
operations. But if we were to do that in future, which of these members would
be applicable to other upper relations? inputRows, pathTarget,
partialPathTarget may be applicable to other upper rels as well. can_sort,
can_hash may be applicable to DISTINCT, SORT relations. isPartial and
havingQual will be applicable only to Grouping/Aggregation. So, may be it's ok,
and like RelOptInfo we may separate them by comments.

Another problem with that structure is its name doesn't mention that the
structure is used only for child upper relations, whereas the code assumes that
if extra is not present it's a parent upper relation. May be we want to rename
it to that effect or always use it whether for a parent or a child relation.

We may want to rename pathTarget and partialPathTarget as relTarget and
partialRelTarget since those targets are not specific to any path, but will be
applicable to all the paths created for that rel.

On Mon, Dec 4, 2017 at 7:44 AM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:
> On Sat, Dec 2, 2017 at 4:08 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>> On Fri, Dec 1, 2017 at 7:41 AM, Ashutosh Bapat
>> <ashutosh.bapat@enterprisedb.com> wrote:
>>> This code creates plans where there are multiple Gather nodes under an Append
>>> node.
>>
>> We should avoid that.  Starting and stopping workers is inefficient,
>> and precludes things like turning the Append into a Parallel Append.
>
> Ah, I didn't think about it. Thanks for bringing it up.
>
>>
>>> AFAIU, the workers assigned to one gather node can be reused until that
>>> Gather node finishes. Having multiple Gather nodes under an Append mean that
>>> every worker will be idle from the time that worker finishes the work till the
>>> last worker finishes the work.
>>
>> No, workers will exit as soon as they finish.  They don't hang around idle.
>
> Sorry, I think I used wrong word "idle". I meant that if a worker
> finishes and exists, the query can't use it that worker slot until the
> next Gather node starts. But as you pointed out, starting and stopping
> a worker is costlier than the cost of not using the slot. So we should
> avoid such plans.
>
>>
>>> index b422050..1941468 100644
>>> --- a/src/tools/pgindent/typedefs.list
>>> +++ b/src/tools/pgindent/typedefs.list
>>> @@ -2345,6 +2345,7 @@ UnlistenStmt
>>>  UnresolvedTup
>>>  UnresolvedTupData
>>>  UpdateStmt
>>> +UpperPathExtraData
>>>  UpperRelationKind
>>>  UpperUniquePath
>>>  UserAuth
>>>
>>> Do we commit this file as part of the feature?
>>
>> Andres and I regularly commit such changes; Tom rejects them.
>>
>
> We will leave it to the committer to decide what to do with this hunk.
>
> --
> Best Wishes,
> Ashutosh Bapat
> EnterpriseDB Corporation
> The Postgres Database Company



-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company


Re: [HACKERS] Partition-wise aggregation/grouping

From
Jeevan Chalke
Date:


On Tue, Dec 12, 2017 at 3:43 PM, Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote:
Here are review comments for 0009

Thank you, Ashutosh for the detailed review so far.

I am working on your reviews but since parallel Append is now committed,
I need to re-base my changes over it and need to resolve the conflicts too.

Once done, I will submit the new patch-set fixing these and earlier review
comments.

Thanks
--
Jeevan Chalke
Technical Architect, Product Development
EnterpriseDB Corporation
The Enterprise PostgreSQL Company

Re: [HACKERS] Partition-wise aggregation/grouping

From
Ashutosh Bapat
Date:
On Wed, Dec 13, 2017 at 6:37 PM, Jeevan Chalke
<jeevan.chalke@enterprisedb.com> wrote:
>
>
> On Tue, Dec 12, 2017 at 3:43 PM, Ashutosh Bapat
> <ashutosh.bapat@enterprisedb.com> wrote:
>>
>> Here are review comments for 0009
>
>
> Thank you, Ashutosh for the detailed review so far.
>
> I am working on your reviews but since parallel Append is now committed,
> I need to re-base my changes over it and need to resolve the conflicts too.
>
> Once done, I will submit the new patch-set fixing these and earlier review
> comments.
>

Sure no problem. Take your time. Here's set of comments for 0008. That
ends the first read of all the patches (2nd reading for the core
changes)

+-- Also, disable parallel paths.
+SET max_parallel_workers_per_gather TO 0;

If you enable parallel aggregation for smaller data partition-wise aggregation
paths won't be chosen. I think this is the reason why you are disabling
parallel query. But we should probably explain that in a comment. Better if we
could come up testcases without disabling parallel query. Since parallel append
is now committed, may be it can help.

+
+-- Check with multiple columns in GROUP BY, order in target-list is reversed
+EXPLAIN (COSTS OFF)
+SELECT c, a, count(*) FROM pagg_tab GROUP BY a, c;
+                   QUERY PLAN
+-------------------------------------------------
+ Append
+   ->  HashAggregate
+         Group Key: pagg_tab_p1.a, pagg_tab_p1.c
+         ->  Seq Scan on pagg_tab_p1
[ ... clipped ... ]
+(10 rows)

Why do we need this testcase?

+
+-- Test when input relation for grouping is dummy
+EXPLAIN (COSTS OFF)
+SELECT c, sum(a) FROM pagg_tab WHERE 1 = 2 GROUP BY c;
+           QUERY PLAN
+--------------------------------
+ HashAggregate
+   Group Key: pagg_tab.c
+   ->  Result
+         One-Time Filter: false
+(4 rows)

Not part of your patch, I am wondering if we can further optimize this plan by
converting HashAggregate to Result (One-time Filter: false) and the aggregate
target. Just an idea.

+
+SELECT c, sum(a) FROM pagg_tab WHERE 1 = 2 GROUP BY c;
+ c | sum
+---+-----
+(0 rows)

I think we also need a case when the child input relations are marked dummy and
then the parent is marked dummy. Just use a condition with partkey = <none of
list bounds>.

+
+-- Check with SORTED paths. Disable hashagg to get group aggregate

Suggest: "Test GroupAggregate paths by disabling hash aggregates."

+-- When GROUP BY clause matches with PARTITION KEY.

I don't think we need "with", and just extend the same sentence with "complete
aggregation is performed for each partition"

+-- Should choose full partition-wise aggregation path

suggest: "Should choose full partition-wise GroupAggregate plan", but I guess
with the above suggestion, this sentence is not needed.

+
+-- When GROUP BY clause not matches with PARTITION KEY.
+-- Should choose partial partition-wise aggregation path

Similar suggestions as above.

+-- No aggregates, but still able to perform partition-wise aggregates

That's a funny construction. May be "Test partition-wise grouping without any
aggregates".

We should try some output for this query.

+
+EXPLAIN (COSTS OFF)
+SELECT a FROM pagg_tab GROUP BY a ORDER BY 1;
+                   QUERY PLAN
+-------------------------------------------------
+ Group
+   Group Key: pagg_tab_p1.a
+   ->  Merge Append
+         Sort Key: pagg_tab_p1.a
+         ->  Group
+               Group Key: pagg_tab_p1.a
+               ->  Sort
+                     Sort Key: pagg_tab_p1.a
+                     ->  Seq Scan on pagg_tab_p1
[ ... clipped ... ]
+(19 rows)

It's strange that we do not annotate partial grouping as Partial. Does not look
like a bug in your patch. Do we get similar output with parallel grouping?

+
+-- ORDERED SET within aggregate
+EXPLAIN (COSTS OFF)
+SELECT a, sum(b order by a) FROM pagg_tab GROUP BY a ORDER BY 1, 2;
+                               QUERY PLAN
+------------------------------------------------------------------------
+ Sort
+   Sort Key: pagg_tab_p1.a, (sum(pagg_tab_p1.b ORDER BY pagg_tab_p1.a))
+   ->  GroupAggregate
+         Group Key: pagg_tab_p1.a
+         ->  Sort
+               Sort Key: pagg_tab_p1.a
+               ->  Append
+                     ->  Seq Scan on pagg_tab_p1
+                     ->  Seq Scan on pagg_tab_p2
+                     ->  Seq Scan on pagg_tab_p3
+(10 rows)

pagg_tab is partitioned by column c. So, not having it in GROUP BY
itself might produce this plan if Partial parallel aggregation is expensive.
When testing negative tests like this GROUP BY should always have the partition
key.

In case of full aggregation, since all the rows that belong to the same group
come from the same partition, having an ORDER BY doesn't make any difference.
We should support such a case.

+INSERT INTO pagg_tab1 SELECT i%30, i%20 FROM generate_series(0, 299, 2) i;
+INSERT INTO pagg_tab2 SELECT i%20, i%30 FROM generate_series(0, 299, 3) i;

spaces around % operator?

+-- When GROUP BY clause matches with PARTITION KEY.
+-- Should choose full partition-wise aggregation path.

Probably we should just club single table and join cases under one set of
comments rather than repeating those? Create the tables once at the beginning
of the test file and group together the queries under one comment head.

+-- Disable mergejoin to get hash aggregate.
+SET enable_mergejoin TO false;

Why? We have tested that once.

+
+-- When GROUP BY clause not matches with PARTITION KEY.
+-- Should choose partial partition-wise aggregation path.
+-- Also check with SORTED paths. Disable hashagg to get group aggregate.
+SET enable_hashagg TO false;

Same as above. Two of those clubbed together they will produce one hash and one
group plan. That will cover it.

+-- Check with LEFT/RIGHT/FULL OUTER JOINs which produces NULL values for
+-- aggregation
+-- LEFT JOIN, should produce partial partition-wise aggregation plan as
+-- GROUP BY is on nullable column
+EXPLAIN (COSTS OFF)
+SELECT b.y, sum(a.y) FROM pagg_tab1 a LEFT JOIN pagg_tab2 b ON a.x =
b.y GROUP BY 1 ORDER BY 1 NULLS LAST;

May be you should explicitly use GROUP BY b.y in all of these queries.

+-- FULL JOIN, should produce partial partition-wise aggregation plan as
+-- GROUP BY is on nullable column

In case of a FULL JOIN partition keys from the joining relations land on
nullable side; there is no key on non-nulllable side, so an aggregation on top
of FULL JOIN will always be partial partition-wise aggregation.

+
+-- Empty relations on LEFT side, no partition-wise agg plan.

Suggest: Empty join relation because of empty outer side.  I don't think we are
writing a negative test to check whether partition-wise agg plan is not chosen.
We are testing the case when the join relation is empty.

+
+EXPLAIN (COSTS OFF)
+SELECT a, c, sum(b), avg(c), count(*) FROM pagg_tab GROUP BY c, a,
(a+b)/2 HAVING sum(b) = 50 AND avg(c) > 25 ORDER BY 1, 2, 3;

Keep this or the previous one, both is overkill. I will vote for this one, but
it's upto you.

May be add a testcase with the partition keys themselves switched; output just
the plan.

+-- Test with multi-level partitioning scheme
+-- Partition-wise aggregation is tried only on first level.
[ ... clipped ... ]
+-- Full aggregation as GROUP BY clause matches with PARTITION KEY

This seems to contradict with the previous comment. May be club them together
and say "Partition-wise aggregation with full aggregation only at the first
leve" and move that whole comment down.

+
+-- Partial aggregation as GROUP BY clause does not match with PARTITION KEY
+EXPLAIN (COSTS OFF)
+SELECT b, sum(a), count(*) FROM pagg_tab GROUP BY b ORDER BY 1, 2, 3;
+                           QUERY PLAN
+----------------------------------------------------------------
+ Sort
+   Sort Key: pagg_tab_p1.b, (sum(pagg_tab_p1.a)), (count(*))
+   ->  Finalize GroupAggregate
+         Group Key: pagg_tab_p1.b
+         ->  Sort
+               Sort Key: pagg_tab_p1.b
+               ->  Append
+                     ->  Partial HashAggregate
+                           Group Key: pagg_tab_p1.b
+                           ->  Seq Scan on pagg_tab_p1
+                     ->  Partial HashAggregate
+                           Group Key: pagg_tab_p2_s1.b
+                           ->  Append
+                                 ->  Seq Scan on pagg_tab_p2_s1
+                                 ->  Seq Scan on pagg_tab_p2_s2
+                     ->  Partial HashAggregate
+                           Group Key: pagg_tab_p3_s1.b
+                           ->  Append
+                                 ->  Seq Scan on pagg_tab_p3_s1
+                                 ->  Seq Scan on pagg_tab_p3_s2
+(20 rows)

Why aren't we seeing partial aggregation paths for level two and below
partitions?

+
+-- Test on middle level partitioned table which is further partitioned on b.
+-- Full aggregation as GROUP BY clause matches with PARTITION KEY
+EXPLAIN (COSTS OFF)
+SELECT b, sum(a), count(*) FROM pagg_tab_p3 GROUP BY b ORDER BY 1, 2, 3;
+                            QUERY PLAN
+-------------------------------------------------------------------
+ Sort
+   Sort Key: pagg_tab_p3_s1.b, (sum(pagg_tab_p3_s1.a)), (count(*))
+   ->  Append
+         ->  HashAggregate
+               Group Key: pagg_tab_p3_s1.b
+               ->  Seq Scan on pagg_tab_p3_s1
+         ->  HashAggregate
+               Group Key: pagg_tab_p3_s2.b
+               ->  Seq Scan on pagg_tab_p3_s2
+(9 rows)
+
+SELECT b, sum(a), count(*) FROM pagg_tab_p3 GROUP BY b ORDER BY 1, 2, 3;
+ b | sum  | count
+---+------+-------
+ 0 | 2000 |   100
+ 1 | 2100 |   100
+ 2 | 2200 |   100
+ 3 | 2300 |   100
+ 4 | 2400 |   100
+ 5 | 2500 |   100
+ 6 | 2600 |   100
+ 7 | 2700 |   100
+ 8 | 2800 |   100
+ 9 | 2900 |   100
+(10 rows)

We should just remove this case, it's same as testing top-level partitioned
tables.

+
+-- Full aggregation as GROUP BY clause matches with PARTITION KEY
+EXPLAIN (COSTS OFF)
+SELECT a, sum(b), array_agg(distinct c), count(*) FROM pagg_tab GROUP
BY a, b HAVING avg(b) < 3 ORDER BY 1, 2, 3;
+                                      QUERY PLAN
+--------------------------------------------------------------------------------------
+ Sort
+   Sort Key: pagg_tab_p1.a, (sum(pagg_tab_p1.b)), (array_agg(DISTINCT
pagg_tab_p1.c))
+   ->  Append
+         ->  GroupAggregate
+               Group Key: pagg_tab_p1.a, pagg_tab_p1.b
+               Filter: (avg(pagg_tab_p1.b) < '3'::numeric)
+               ->  Sort
+                     Sort Key: pagg_tab_p1.a, pagg_tab_p1.b
+                     ->  Seq Scan on pagg_tab_p1
+         ->  GroupAggregate
+               Group Key: pagg_tab_p2_s1.a, pagg_tab_p2_s1.b
+               Filter: (avg(pagg_tab_p2_s1.b) < '3'::numeric)
+               ->  Sort
+                     Sort Key: pagg_tab_p2_s1.a, pagg_tab_p2_s1.b
+                     ->  Append
+                           ->  Seq Scan on pagg_tab_p2_s1
+                           ->  Seq Scan on pagg_tab_p2_s2
+         ->  GroupAggregate
+               Group Key: pagg_tab_p3_s1.a, pagg_tab_p3_s1.b
+               Filter: (avg(pagg_tab_p3_s1.b) < '3'::numeric)
+               ->  Sort
+                     Sort Key: pagg_tab_p3_s1.a, pagg_tab_p3_s1.b
+                     ->  Append
+                           ->  Seq Scan on pagg_tab_p3_s1
+                           ->  Seq Scan on pagg_tab_p3_s2
+(25 rows)

Instead of an Append node appearing under GroupAggregate, I think we should
flatten all the partition scans for the subpartitions whose partition keys are
part of group keys and add GroupAggregate on top of each of such partition
scans.

+-- Parallelism within partition-wise aggregates
+RESET max_parallel_workers_per_gather;
+SET min_parallel_table_scan_size TO '8kB';
+SET parallel_setup_cost TO 0;
+INSERT INTO pagg_tab_para SELECT i%30, i%20 FROM generate_series(0, 29999) i;

spaces around % operator?

+SHOW max_parallel_workers_per_gather;
+ max_parallel_workers_per_gather
+---------------------------------
+ 2

Why do we need this?

+
+-- When GROUP BY clause matches with PARTITION KEY.
+EXPLAIN (COSTS OFF)
+SELECT x, sum(y), avg(y), count(*) FROM pagg_tab_para GROUP BY x
HAVING avg(y) < 7 ORDER BY 1, 2, 3;
+                                      QUERY PLAN
+--------------------------------------------------------------------------------------
+ Sort
+   Sort Key: pagg_tab_para_p1.x, (sum(pagg_tab_para_p1.y)),
(avg(pagg_tab_para_p1.y))
+   ->  Append
[ ... clipped ...]
+         ->  Finalize GroupAggregate
+               Group Key: pagg_tab_para_p3.x
+               Filter: (avg(pagg_tab_para_p3.y) < '7'::numeric)
+               ->  Sort
+                     Sort Key: pagg_tab_para_p3.x
+                     ->  Gather
+                           Workers Planned: 2
+                           ->  Partial HashAggregate
+                                 Group Key: pagg_tab_para_p3.x
+                                 ->  Parallel Seq Scan on pagg_tab_para_p3
[ ... clipped ... ]
+-- When GROUP BY clause not matches with PARTITION KEY.
+EXPLAIN (COSTS OFF)
+SELECT y, sum(x), avg(x), count(*) FROM pagg_tab_para GROUP BY y
HAVING avg(x) < 12 ORDER BY 1, 2, 3;
+                                      QUERY PLAN
+--------------------------------------------------------------------------------------
+ Sort
+   Sort Key: pagg_tab_para_p1.y, (sum(pagg_tab_para_p1.x)),
(avg(pagg_tab_para_p1.x))
+   ->  Finalize HashAggregate
+         Group Key: pagg_tab_para_p1.y
[ ... clipped ... ]
+               ->  Gather
+                     Workers Planned: 2
+                     ->  Partial HashAggregate
+                           Group Key: pagg_tab_para_p3.y
+                           ->  Parallel Seq Scan on pagg_tab_para_p3

Per a prior discussion on this thread, we shouldn't produce such plans;
Parallel Append instead?

+SET enable_partition_wise_agg to true;

May be just enable it at the beginning instead of enabling and disabling twice?

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company


Re: [HACKERS] Partition-wise aggregation/grouping

From
Ashutosh Bapat
Date:
On Thu, Dec 14, 2017 at 4:01 PM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:
>
> +
> +EXPLAIN (COSTS OFF)
> +SELECT a FROM pagg_tab GROUP BY a ORDER BY 1;
> +                   QUERY PLAN
> +-------------------------------------------------
> + Group
> +   Group Key: pagg_tab_p1.a
> +   ->  Merge Append
> +         Sort Key: pagg_tab_p1.a
> +         ->  Group
> +               Group Key: pagg_tab_p1.a
> +               ->  Sort
> +                     Sort Key: pagg_tab_p1.a
> +                     ->  Seq Scan on pagg_tab_p1
> [ ... clipped ... ]
> +(19 rows)
>
> It's strange that we do not annotate partial grouping as Partial. Does not look
> like a bug in your patch. Do we get similar output with parallel grouping?

I am wrong here. It's not partial grouping. It's two level grouping. I
think annotating Group as Partial would be misleading. Sorry.


-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company


Re: [HACKERS] Partition-wise aggregation/grouping

From
Ashutosh Bapat
Date:
On Thu, Dec 14, 2017 at 4:01 PM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:
> +
> +-- Test when input relation for grouping is dummy
> +EXPLAIN (COSTS OFF)
> +SELECT c, sum(a) FROM pagg_tab WHERE 1 = 2 GROUP BY c;
> +           QUERY PLAN
> +--------------------------------
> + HashAggregate
> +   Group Key: pagg_tab.c
> +   ->  Result
> +         One-Time Filter: false
> +(4 rows)
>
> Not part of your patch, I am wondering if we can further optimize this plan by
> converting HashAggregate to Result (One-time Filter: false) and the aggregate
> target. Just an idea.
>
This comment is also wrong. The finalization step of aggregates needs
to be executed irrespective of whether or not the underlying scan
produces any rows. It may, for example, add a constant value to the
transition result. We may apply this optimization only when none of
aggregations have finalization functions, but it may not be worth the
code.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company


Re: [HACKERS] Partition-wise aggregation/grouping

From
Jeevan Chalke
Date:

Attached patchset with all the review comments reported so far.

On Fri, Dec 1, 2017 at 6:11 PM, Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote:
Continuing with review of 0007.

+
+    /* Copy input rels's relids to grouped rel */
+    grouped_rel->relids = input_rel->relids;

Isn't this done in fetch_upper_rel()? Why do we need it here?
There's also a similar hunk in create_grouping_paths() which doesn't look
appropriate. I guess, you need relids in grouped_rel->relids for FDW. There are
two ways to do this: 1. set grouped_rel->relids for parent upper rel as well,
but then we should pass relids to fetch_upper_rel() instead of setting those
later. 2. For a parent upper rel, in create_foreignscan_plan(), set relids to
all_baserels, if upper_rel->relids is NULL and don't set relids for a parent
upper rel. I am fine with either of those.

Done. Opted second option.
 

+            /* partial phase */
+            get_agg_clause_costs(root, (Node *) partial_target->exprs,
+                                 AGGSPLIT_INITIAL_SERIAL,
+                                 &agg_partial_costs);

IIUC, the costs for evaluating aggregates would not change per child. They
won't be different for parent and any of the children. So, we should be able to
calculate those once, save in "extra" and use repeatedly.

Yep. Done.
 

+        if (can_sort)
+        {
+            Path       *path = cheapest_path;
+
+            if (!(pathkeys_contained_in(root->group_pathkeys,
+                                        path->pathkeys)))
[ .. clipped patch .. ]
+                                           NIL,
+                                           dNumGroups));
+        }

We create two kinds of paths partial paths for parallel query and partial
aggregation paths when group keys do not have partition keys. The comments and
code uses partial to mean both the things, which is rather confusing. May be we
should use term "partial aggregation" explicitly wherever it means that in
comments and in variable names.

Agree. Used "partial aggregation" wherever applicable. Let me know if you see any other place need this adjustments.


I still feel that create_grouping_paths() and create_child_grouping_paths()
have a lot of common code. While I can see that there are some pockets in
create_grouping_paths() which are not required in create_child_grouping_paths()
and vice-versa, may be we should create only one function and move those
pockets under "if (child)" or "if (parent)" kind of conditions. It will be a
maintenance burden to keep these two functions in sync in future. If we do not
keep them in sync, that will introduce bugs.

Agree that keeping these two functions in sync in future will be a maintenance burden, but I am not yet sure how to refactor them cleanly.
Will give one more try and update those changes in the next patchset.


+
+/*
+ * create_partition_agg_paths
+ *
+ * Creates append path for all the partitions and adds into the grouped rel.

I think you want to write "Creates an append path containing paths from all the
child grouped rels and adds into the given parent grouped rel".

Reworded as you said.
 

+ * For partial mode we need to add a finalize agg node over append path before
+ * adding a path to grouped rel.
+ */
+void
+create_partition_agg_paths(PlannerInfo *root,
+                           RelOptInfo *grouped_rel,
+                           List *subpaths,
+                           List *partial_subpaths,

Why do we have these two as separate arguments? I don't see any call to
create_partition_agg_paths() through add_append_path() passing both of them
non-NULL simultaneously. May be you want use a single list subpaths and another
boolean indicating whether it's list of partial paths or regular paths.


After redesigning in the area of putting gather over append, I don't need to pass all Append subpaths to this function at-all. Append is done by add_paths_to_append_rel() itself. This function now just adds fanalization steps as needed.
So, we don't have two lists now. And to know about partial paths, passed a boolean instead. Please have a look and let me know if I missed any.
 
+
+    /* For non-partial path, just create a append path and we are done. */
This is the kind of confusion, I am talking about above. Here you have
mentioned "non-partial path" which may mean a regular path but what you
actually mean by that term is a path representing partial aggregates.

+    /*
+     * Partial partition-wise grouping paths.  Need to create final
+     * aggregation path over append relation.
+     *
+     * If there are partial subpaths, then we need to add gather path before we
+     * append these subpaths.

More confusion here.

Hopefully no more confusion in this new version.


+     */
+    if (partial_subpaths)
+    {
+        ListCell   *lc;
+
+        Assert(subpaths == NIL);
+
+        foreach(lc, partial_subpaths)
+        {
+            Path       *path = lfirst(lc);
+            double        total_groups = path->rows * path->parallel_workers;
+
+            /* Create gather paths for partial subpaths */
+            Path *gpath = (Path *) create_gather_path(root, grouped_rel, path,
+                                                      path->pathtarget, NULL,
+                                                      &total_groups);
+
+            subpaths = lappend(subpaths, gpath);

Using the argument variable is confusing and that's why you need two different
List variables. Instead probably you could have another variable local to this
function to hold the gather subpaths.

Done.
 

AFAIU, the Gather paths that this code creates has its parent set to
parent grouped
rel. That's not correct. These partial paths come from children of grouped rel
and each gather is producing rows corresponding to one children of grouped rel.
So gather path's parent should be set to corresponding child and not parent
grouped rel.

Yep.
 

This code creates plans where there are multiple Gather nodes under an Append
node. AFAIU, the workers assigned to one gather node can be reused until that
Gather node finishes. Having multiple Gather nodes under an Append mean that
every worker will be idle from the time that worker finishes the work till the
last worker finishes the work. That doesn't seem to be optimal use of workers.
The plan that we create with Gather on top of Append seems to be better. So, we
should avoid creating one Gather node per child plans. Have we tried to compare
performance of these two plans?

Agree. Having Gather on top of the Append is better. Done that way. It resolves your previous comment too.


+        if (!IsA(apath, MergeAppendPath) && root->group_pathkeys)
+        {
+            spath = (Path *) create_sort_path(root,
+                                              grouped_rel,
+                                              apath,
+                                              root->group_pathkeys,
+                                              -1.0);
+        }

The code here assumes that a MergeAppend path will always have pathkeys
matching group_pathkeys. I believe that's true but probably we should have an
Assert to make it clear and add comments. If that's not true, we will need to
sort the output of MergeAppend OR discard MergeAppend paths which do not have
pathkeys matching group_pathkeys.

Oops. Thanks for pointing out that. You are correct.
Added relevant check which checks for required pathkeys present or not.
 


diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index b422050..1941468 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2345,6 +2345,7 @@ UnlistenStmt
 UnresolvedTup
 UnresolvedTupData
 UpdateStmt
+UpperPathExtraData
 UpperRelationKind
 UpperUniquePath
 UserAuth

Do we commit this file as part of the feature?
--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

This patchset contains fixes for other review comments too.

Thanks
--
Jeevan Chalke
Technical Architect, Product Development
EnterpriseDB Corporation
The Enterprise PostgreSQL Company

Attachment

Re: [HACKERS] Partition-wise aggregation/grouping

From
Jeevan Chalke
Date:


On Tue, Dec 12, 2017 at 3:43 PM, Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote:
Here are review comments for 0009
Only full aggregation is pushed on the remote server.

I think we can live with that for a while but we need to be able to push down
partial aggregates to the foreign server. I agree that it needs some
infrastructure to serialized and deserialize the partial aggregate values,
support unpartitioned aggregation first and then work on partitioned
aggregates. That is definitely a separate piece of work.

Yep.
 

+CREATE TABLE pagg_tab_p3 (a int, b int, c text);

Like partition-wise join testcases please use LIKE so that it's easy to change
the table schema if required.

Done.
 

+INSERT INTO pagg_tab_p1 SELECT i % 30, i % 50, to_char(i/30,
'FM0000') FROM generate_series(1, 3000) i WHERE (i % 30) < 10;
+INSERT INTO pagg_tab_p2 SELECT i % 30, i % 50, to_char(i/30,
'FM0000') FROM generate_series(1, 3000) i WHERE (i % 30) < 20 and (i %
30) >= 10;
+INSERT INTO pagg_tab_p3 SELECT i % 30, i % 50, to_char(i/30,
'FM0000') FROM generate_series(1, 3000) i WHERE (i % 30) < 30 and (i %
30) >= 20;

We have to do this because INSERT tuple routing to a foreign partition is not
supported right now.

Yes.
 
Somebody has to remember to change this to a single
statement once that's done.

I don't know how we can keep track of it.
 

+ANALYZE fpagg_tab_p1;
+ANALYZE fpagg_tab_p2;
+ANALYZE fpagg_tab_p3;

I thought this is not needed. When you ANALYZE the partitioned table, it would
analyze the partitions as well. But I see that partition-wise join is also
ANALYZING the foreign partitions separately. When I ran ANALYZE on a
partitioned table with foreign partitions, statistics for only the local tables
(partitioned and partitions) was updated. Of course this is separate work, but
probably needs to be fixed.

Hmm.
 

+-- When GROUP BY clause matches with PARTITION KEY.
+-- Plan when partition-wise-agg is disabled

s/when/with/

+-- Plan when partition-wise-agg is enabled

s/when/with/

Done.
 

+   ->  Append

Just like ForeignScan node's Relations tell what kind of ForeignScan this is,
may be we should annotate Append to tell whether the children are joins,
aggregates or relation scans. That might be helpful. Of course as another
patch.

OK.
 

+SELECT a, sum(b), min(b), count(*) FROM pagg_tab GROUP BY a HAVING
avg(b) < 25 ORDER BY 1;
+ a  | sum  | min | count
+----+------+-----+-------
+  0 | 2000 |   0 |   100
+  1 | 2100 |   1 |   100
[ ... clipped ...]
+ 23 | 2300 |   3 |   100
+ 24 | 2400 |   4 |   100
+(15 rows)

May be we want to reduce the number of rows to a few by using a stricter HAVING
clause?

Done.
 

+
+-- When GROUP BY clause not matches with PARTITION KEY.

... clause does not match ...

Done.
 

+EXPLAIN (VERBOSE, COSTS OFF)
+SELECT b, avg(a), max(a), count(*) FROM pagg_tab GROUP BY b HAVING
sum(a) < 800 ORDER BY 1;
+
QUERY PLAN
+-----------------------------------------------------------------------------------------------------------------------------------------------------
+ Sort
+   Output: fpagg_tab_p1.b, (avg(fpagg_tab_p1.a)),
(max(fpagg_tab_p1.a)), (count(*))
+               ->  Partial HashAggregate
[ ... clipped ... ]
+                     Output: fpagg_tab_p3.b, PARTIAL
avg(fpagg_tab_p3.a), PARTIAL max(fpagg_tab_p3.a), PARTIAL count(*),
PARTIAL sum(fpagg_tab_p3.a)
+                     Group Key: fpagg_tab_p3.b
+                     ->  Foreign Scan on public.fpagg_tab_p3
+                           Output: fpagg_tab_p3.b, fpagg_tab_p3.a
+                           Remote SQL: SELECT a, b FROM public.pagg_tab_p3
+(26 rows)

I think we interested in overall shape of the plan and not the details of
Remote SQL etc. So, may be turn off VERBOSE. This comment applies to an earlier
plan with enable_partition_wise_agg = false;

OK. Removed VERBOSE from all the queries as we are interested in overall shape of the plan.
 

+
+SELECT b, avg(a), max(a), count(*) FROM pagg_tab GROUP BY b HAVING
sum(a) < 800 ORDER BY 1;
+ b  |         avg         | max | count
+----+---------------------+-----+-------
+  0 | 10.0000000000000000 |  20 |    60
+  1 | 11.0000000000000000 |  21 |    60
[... clipped ...]
+ 42 | 12.0000000000000000 |  22 |    60
+ 43 | 13.0000000000000000 |  23 |    60
+(20 rows)

Since the aggregates were not pushed down, I doubt we should be testing the
output. But this test is good to check partial aggregates over foreign
partition scans, which we don't have in postgres_fdw.sql I think. So, may be
add it as a separate patch?

Agree. Removed SELECT query. EXPLAIN is enough here.
 

Can you please add a test where we reference a whole-row; that usually has
troubles.

Added.
 

-    if (root->hasHavingQual && query->havingQual)
+    if (root->hasHavingQual && fpinfo->havingQual)

This is not exactly a problem with your patch, but why do we need to check both
the boolean and the actual clauses? If the boolean is true, query->havingQual
should be non-NULL and NULL otherwise.

Agree. But since this is not fault of this patch, I have kept as is.
 

     /* Grouping information */
     List       *grouped_tlist;
+    PathTarget *grouped_target;
+    Node       *havingQual;

I think we don't need havingQual as a separate member. foreign_grouping_ok()
separates the clauses in havingQual into shippable and non-shippable clauses
and saves in local_conditions and remote_conditions. Probably we want to use
those instead of adding a new member.

local/remote_conditions is a list of RestrictInfos which cannot be used while costing the aggregates. So we anyways need to store the havingQual.
 

index 04e43cc..c8999f6 100644
--- a/src/include/foreign/fdwapi.h
+++ b/src/include/foreign/fdwapi.h
@@ -62,7 +62,8 @@ typedef void (*GetForeignJoinPaths_function)
(PlannerInfo *root,
 typedef void (*GetForeignUpperPaths_function) (PlannerInfo *root,
                                                UpperRelationKind stage,
                                                RelOptInfo *input_rel,
-                                               RelOptInfo *output_rel);
+                                               RelOptInfo *output_rel,
+                                               UpperPathExtraData *extra);

Probably this comment belongs to 0007, but it's in this patch that it becomes
clear how invasive UpperPathExtraData changes are. While UpperPathExtraData has
upper paths in the name, all of its members are related to grouping. That's
fine since we only support partition-wise aggregate and not the other upper
operations. But if we were to do that in future, which of these members would
be applicable to other upper relations? inputRows, pathTarget,
partialPathTarget may be applicable to other upper rels as well. can_sort,
can_hash may be applicable to DISTINCT, SORT relations. isPartial and
havingQual will be applicable only to Grouping/Aggregation. So, may be it's ok,
and like RelOptInfo we may separate them by comments.

I have grouped them like you said but not added any comments yet as I am not sure at this point that which fields will be used by those other upper rel kinds.
Please have a look.
 

Another problem with that structure is its name doesn't mention that the
structure is used only for child upper relations, whereas the code assumes that
if extra is not present it's a parent upper relation. May be we want to rename
it to that effect or always use it whether for a parent or a child relation.

Renamed to  OtherUpperPathExtraData.


We may want to rename pathTarget and partialPathTarget as relTarget and
partialRelTarget since those targets are not specific to any path, but will be
applicable to all the paths created for that rel.

Renamed.
 
These fixes are part of the v9 patchset.

Thanks
--
Jeevan Chalke
Technical Architect, Product Development
EnterpriseDB Corporation
The Enterprise PostgreSQL Company

Re: [HACKERS] Partition-wise aggregation/grouping

From
Jeevan Chalke
Date:


On Thu, Dec 14, 2017 at 4:01 PM, Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote:

Sure no problem. Take your time. Here's set of comments for 0008. That
ends the first read of all the patches (2nd reading for the core
changes)

+-- Also, disable parallel paths.
+SET max_parallel_workers_per_gather TO 0;

If you enable parallel aggregation for smaller data partition-wise aggregation
paths won't be chosen. I think this is the reason why you are disabling
parallel query. But we should probably explain that in a comment. Better if we
could come up testcases without disabling parallel query. Since parallel append
is now committed, may be it can help.

Removed.
 

+
+-- Check with multiple columns in GROUP BY, order in target-list is reversed
+EXPLAIN (COSTS OFF)
+SELECT c, a, count(*) FROM pagg_tab GROUP BY a, c;
+                   QUERY PLAN
+-------------------------------------------------
+ Append
+   ->  HashAggregate
+         Group Key: pagg_tab_p1.a, pagg_tab_p1.c
+         ->  Seq Scan on pagg_tab_p1
[ ... clipped ... ]
+(10 rows)

Why do we need this testcase?

Rajkumar, earlier reported one issue when order in the target list is reversed. Fix then required redesigning the GROUP key matching algorithm. So I think it will be good to have this testcase.


+
+SELECT c, sum(a) FROM pagg_tab WHERE 1 = 2 GROUP BY c;
+ c | sum
+---+-----
+(0 rows)

I think we also need a case when the child input relations are marked dummy and
then the parent is marked dummy. Just use a condition with partkey = <none of
list bounds>.

I have added the testcase for that. But don't you think both are same. When all input children are dummy, parent too marked as dummy, i.e. input relation is itself dummy.
Am I missing something here?
 

+
+-- Check with SORTED paths. Disable hashagg to get group aggregate

Suggest: "Test GroupAggregate paths by disabling hash aggregates."

+-- When GROUP BY clause matches with PARTITION KEY.

I don't think we need "with", and just extend the same sentence with "complete
aggregation is performed for each partition"

+-- Should choose full partition-wise aggregation path

suggest: "Should choose full partition-wise GroupAggregate plan", but I guess
with the above suggestion, this sentence is not needed.

+
+-- When GROUP BY clause not matches with PARTITION KEY.
+-- Should choose partial partition-wise aggregation path

Similar suggestions as above.

+-- No aggregates, but still able to perform partition-wise aggregates

That's a funny construction. May be "Test partition-wise grouping without any
aggregates".

We should try some output for this query.

+
+EXPLAIN (COSTS OFF)
+SELECT a FROM pagg_tab GROUP BY a ORDER BY 1;
+                   QUERY PLAN
+-------------------------------------------------
+ Group
+   Group Key: pagg_tab_p1.a
+   ->  Merge Append
+         Sort Key: pagg_tab_p1.a
+         ->  Group
+               Group Key: pagg_tab_p1.a
+               ->  Sort
+                     Sort Key: pagg_tab_p1.a
+                     ->  Seq Scan on pagg_tab_p1
[ ... clipped ... ]
+(19 rows)

It's strange that we do not annotate partial grouping as Partial. Does not look
like a bug in your patch. Do we get similar output with parallel grouping?


Its partial aggregation only which is finalize at the top.
But since tere are no aggregates involved we create a GROUP path and not an AGG path. GROUP path has no partial annotations.
Yes, we see similar plan for parallel grouping too.

+
+-- ORDERED SET within aggregate
+EXPLAIN (COSTS OFF)
+SELECT a, sum(b order by a) FROM pagg_tab GROUP BY a ORDER BY 1, 2;
+                               QUERY PLAN
+------------------------------------------------------------------------
+ Sort
+   Sort Key: pagg_tab_p1.a, (sum(pagg_tab_p1.b ORDER BY pagg_tab_p1.a))
+   ->  GroupAggregate
+         Group Key: pagg_tab_p1.a
+         ->  Sort
+               Sort Key: pagg_tab_p1.a
+               ->  Append
+                     ->  Seq Scan on pagg_tab_p1
+                     ->  Seq Scan on pagg_tab_p2
+                     ->  Seq Scan on pagg_tab_p3
+(10 rows)

pagg_tab is partitioned by column c. So, not having it in GROUP BY
itself might produce this plan if Partial parallel aggregation is expensive.
When testing negative tests like this GROUP BY should always have the partition
key.

I deliberatly wanted to test when GROUP BY key does not match with the partition key so that partial aggregation is forced. But then we do have some limitiation to perform the aggregation in partial i.e.  ORDERED SET cannot be done in partial mode, this is the test to excerisize that code path.


In case of full aggregation, since all the rows that belong to the same group
come from the same partition, having an ORDER BY doesn't make any difference.
We should support such a case.

We do support this.
Added testcase for it.
 

+INSERT INTO pagg_tab1 SELECT i%30, i%20 FROM generate_series(0, 299, 2) i;
+INSERT INTO pagg_tab2 SELECT i%20, i%30 FROM generate_series(0, 299, 3) i;

spaces around % operator?

+-- When GROUP BY clause matches with PARTITION KEY.
+-- Should choose full partition-wise aggregation path.

Probably we should just club single table and join cases under one set of
comments rather than repeating those? Create the tables once at the beginning
of the test file and group together the queries under one comment head.

I think other way round. It will be good to have corresponding CREATE/INSERTs near the test queries to avoid lengthy scrolls to see the table structure and data. Each query has a comment to describe what it does.
 

+-- Disable mergejoin to get hash aggregate.
+SET enable_mergejoin TO false;

Why? We have tested that once.


Removed.
 
+
+-- When GROUP BY clause not matches with PARTITION KEY.
+-- Should choose partial partition-wise aggregation path.
+-- Also check with SORTED paths. Disable hashagg to get group aggregate.
+SET enable_hashagg TO false;

Same as above. Two of those clubbed together they will produce one hash and one
group plan. That will cover it.

For join queries plan with GroupAgg is not chosen which I wanted to have in a test-coverage. Thus kept this as is.
We have tested GroupAgg for single partitioned relations though. Let me know if you think this test is not necessary, I will remove it then.
 

+-- Check with LEFT/RIGHT/FULL OUTER JOINs which produces NULL values for
+-- aggregation
+-- LEFT JOIN, should produce partial partition-wise aggregation plan as
+-- GROUP BY is on nullable column
+EXPLAIN (COSTS OFF)
+SELECT b.y, sum(a.y) FROM pagg_tab1 a LEFT JOIN pagg_tab2 b ON a.x =
b.y GROUP BY 1 ORDER BY 1 NULLS LAST;

May be you should explicitly use GROUP BY b.y in all of these queries.

I actually wanted to test GROUP BY n case too. But as you said, in these queries I have used b.y and modified some other queries to have positional notation in GROUP BY.
 

+-- FULL JOIN, should produce partial partition-wise aggregation plan as
+-- GROUP BY is on nullable column

In case of a FULL JOIN partition keys from the joining relations land on
nullable side; there is no key on non-nulllable side, so an aggregation on top
of FULL JOIN will always be partial partition-wise aggregation.


Yep.
Do you want me to add this explanation in the comment? I don't think so.

+
+-- Empty relations on LEFT side, no partition-wise agg plan.

Suggest: Empty join relation because of empty outer side.  I don't think we are
writing a negative test to check whether partition-wise agg plan is not chosen.
We are testing the case when the join relation is empty.

I didn't get what exactly you mean here. However updated the comment as per your suggestion.
 

+
+EXPLAIN (COSTS OFF)
+SELECT a, c, sum(b), avg(c), count(*) FROM pagg_tab GROUP BY c, a,
(a+b)/2 HAVING sum(b) = 50 AND avg(c) > 25 ORDER BY 1, 2, 3;

Keep this or the previous one, both is overkill. I will vote for this one, but
it's upto you.

Removed previous one.
 

May be add a testcase with the partition keys themselves switched; output just
the plan.

I don't think we need this, instead modified the earlier one. Please have a look.
 

+-- Test with multi-level partitioning scheme
+-- Partition-wise aggregation is tried only on first level.
[ ... clipped ... ]
+-- Full aggregation as GROUP BY clause matches with PARTITION KEY

This seems to contradict with the previous comment. May be club them together
and say "Partition-wise aggregation with full aggregation only at the first
leve" and move that whole comment down.

+
+-- Partial aggregation as GROUP BY clause does not match with PARTITION KEY
+EXPLAIN (COSTS OFF)
+SELECT b, sum(a), count(*) FROM pagg_tab GROUP BY b ORDER BY 1, 2, 3;
+                           QUERY PLAN
+----------------------------------------------------------------
+ Sort
+   Sort Key: pagg_tab_p1.b, (sum(pagg_tab_p1.a)), (count(*))
+   ->  Finalize GroupAggregate
+         Group Key: pagg_tab_p1.b
+         ->  Sort
+               Sort Key: pagg_tab_p1.b
+               ->  Append
+                     ->  Partial HashAggregate
+                           Group Key: pagg_tab_p1.b
+                           ->  Seq Scan on pagg_tab_p1
+                     ->  Partial HashAggregate
+                           Group Key: pagg_tab_p2_s1.b
+                           ->  Append
+                                 ->  Seq Scan on pagg_tab_p2_s1
+                                 ->  Seq Scan on pagg_tab_p2_s2
+                     ->  Partial HashAggregate
+                           Group Key: pagg_tab_p3_s1.b
+                           ->  Append
+                                 ->  Seq Scan on pagg_tab_p3_s1
+                                 ->  Seq Scan on pagg_tab_p3_s2
+(20 rows)

Why aren't we seeing partial aggregation paths for level two and below
partitions?

In this version of the patch I have not recursed into next level.
Will work on it and submit changes in the next patch-set.
 

+
+-- Test on middle level partitioned table which is further partitioned on b.
+-- Full aggregation as GROUP BY clause matches with PARTITION KEY
+EXPLAIN (COSTS OFF)
+SELECT b, sum(a), count(*) FROM pagg_tab_p3 GROUP BY b ORDER BY 1, 2, 3;
+                            QUERY PLAN
+-------------------------------------------------------------------
+ Sort
+   Sort Key: pagg_tab_p3_s1.b, (sum(pagg_tab_p3_s1.a)), (count(*))
+   ->  Append
+         ->  HashAggregate
+               Group Key: pagg_tab_p3_s1.b
+               ->  Seq Scan on pagg_tab_p3_s1
+         ->  HashAggregate
+               Group Key: pagg_tab_p3_s2.b
+               ->  Seq Scan on pagg_tab_p3_s2
+(9 rows)
+
+SELECT b, sum(a), count(*) FROM pagg_tab_p3 GROUP BY b ORDER BY 1, 2, 3;
+ b | sum  | count
+---+------+-------
+ 0 | 2000 |   100
+ 1 | 2100 |   100
+ 2 | 2200 |   100
+ 3 | 2300 |   100
+ 4 | 2400 |   100
+ 5 | 2500 |   100
+ 6 | 2600 |   100
+ 7 | 2700 |   100
+ 8 | 2800 |   100
+ 9 | 2900 |   100
+(10 rows)

We should just remove this case, it's same as testing top-level partitioned
tables.

Removed.
 

+
+-- Full aggregation as GROUP BY clause matches with PARTITION KEY
+EXPLAIN (COSTS OFF)
+SELECT a, sum(b), array_agg(distinct c), count(*) FROM pagg_tab GROUP
BY a, b HAVING avg(b) < 3 ORDER BY 1, 2, 3;
+                                      QUERY PLAN
+--------------------------------------------------------------------------------------
+ Sort
+   Sort Key: pagg_tab_p1.a, (sum(pagg_tab_p1.b)), (array_agg(DISTINCT
pagg_tab_p1.c))
+   ->  Append
+         ->  GroupAggregate
+               Group Key: pagg_tab_p1.a, pagg_tab_p1.b
+               Filter: (avg(pagg_tab_p1.b) < '3'::numeric)
+               ->  Sort
+                     Sort Key: pagg_tab_p1.a, pagg_tab_p1.b
+                     ->  Seq Scan on pagg_tab_p1
+         ->  GroupAggregate
+               Group Key: pagg_tab_p2_s1.a, pagg_tab_p2_s1.b
+               Filter: (avg(pagg_tab_p2_s1.b) < '3'::numeric)
+               ->  Sort
+                     Sort Key: pagg_tab_p2_s1.a, pagg_tab_p2_s1.b
+                     ->  Append
+                           ->  Seq Scan on pagg_tab_p2_s1
+                           ->  Seq Scan on pagg_tab_p2_s2
+         ->  GroupAggregate
+               Group Key: pagg_tab_p3_s1.a, pagg_tab_p3_s1.b
+               Filter: (avg(pagg_tab_p3_s1.b) < '3'::numeric)
+               ->  Sort
+                     Sort Key: pagg_tab_p3_s1.a, pagg_tab_p3_s1.b
+                     ->  Append
+                           ->  Seq Scan on pagg_tab_p3_s1
+                           ->  Seq Scan on pagg_tab_p3_s2
+(25 rows)

Instead of an Append node appearing under GroupAggregate, I think we should
flatten all the partition scans for the subpartitions whose partition keys are
part of group keys and add GroupAggregate on top of each of such partition
scans.

Yes. As explained earlier, will do that as a separate patch.


+-- Parallelism within partition-wise aggregates
+RESET max_parallel_workers_per_gather;
+SET min_parallel_table_scan_size TO '8kB';
+SET parallel_setup_cost TO 0;
+INSERT INTO pagg_tab_para SELECT i%30, i%20 FROM generate_series(0, 29999) i;

spaces around % operator?

+SHOW max_parallel_workers_per_gather;
+ max_parallel_workers_per_gather
+---------------------------------
+ 2

Why do we need this?


Removed.
 
+
+-- When GROUP BY clause matches with PARTITION KEY.
+EXPLAIN (COSTS OFF)
+SELECT x, sum(y), avg(y), count(*) FROM pagg_tab_para GROUP BY x
HAVING avg(y) < 7 ORDER BY 1, 2, 3;
+                                      QUERY PLAN
+--------------------------------------------------------------------------------------
+ Sort
+   Sort Key: pagg_tab_para_p1.x, (sum(pagg_tab_para_p1.y)),
(avg(pagg_tab_para_p1.y))
+   ->  Append
[ ... clipped ...]
+         ->  Finalize GroupAggregate
+               Group Key: pagg_tab_para_p3.x
+               Filter: (avg(pagg_tab_para_p3.y) < '7'::numeric)
+               ->  Sort
+                     Sort Key: pagg_tab_para_p3.x
+                     ->  Gather
+                           Workers Planned: 2
+                           ->  Partial HashAggregate
+                                 Group Key: pagg_tab_para_p3.x
+                                 ->  Parallel Seq Scan on pagg_tab_para_p3
[ ... clipped ... ]
+-- When GROUP BY clause not matches with PARTITION KEY.
+EXPLAIN (COSTS OFF)
+SELECT y, sum(x), avg(x), count(*) FROM pagg_tab_para GROUP BY y
HAVING avg(x) < 12 ORDER BY 1, 2, 3;
+                                      QUERY PLAN
+--------------------------------------------------------------------------------------
+ Sort
+   Sort Key: pagg_tab_para_p1.y, (sum(pagg_tab_para_p1.x)),
(avg(pagg_tab_para_p1.x))
+   ->  Finalize HashAggregate
+         Group Key: pagg_tab_para_p1.y
[ ... clipped ... ]
+               ->  Gather
+                     Workers Planned: 2
+                     ->  Partial HashAggregate
+                           Group Key: pagg_tab_para_p3.y
+                           ->  Parallel Seq Scan on pagg_tab_para_p3

Per a prior discussion on this thread, we shouldn't produce such plans;
Parallel Append instead?

Yes. We do get a Parallel Append path now.
For full aggregation, normal Append plan in chosen over Append, but we do create that.


+SET enable_partition_wise_agg to true;

May be just enable it at the beginning instead of enabling and disabling twice?

Done as you said. However, this affected one more testcase from partition_join.sql. Updated expected output for that too.

Comments for which I have not responded are all done.

All these fixes are part of v9 patchset.

Thanks Ashutosh for detailed reviews so far.
 

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company


Thanks
--
Jeevan Chalke
Technical Architect, Product Development
EnterpriseDB Corporation
The Enterprise PostgreSQL Company

Re: [HACKERS] Partition-wise aggregation/grouping

From
Jeevan Chalke
Date:

On Thu, Dec 14, 2017 at 4:01 PM, Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote:

+
+-- Partial aggregation as GROUP BY clause does not match with PARTITION KEY
+EXPLAIN (COSTS OFF)
+SELECT b, sum(a), count(*) FROM pagg_tab GROUP BY b ORDER BY 1, 2, 3;
+                           QUERY PLAN
+----------------------------------------------------------------
+ Sort
+   Sort Key: pagg_tab_p1.b, (sum(pagg_tab_p1.a)), (count(*))
+   ->  Finalize GroupAggregate
+         Group Key: pagg_tab_p1.b
+         ->  Sort
+               Sort Key: pagg_tab_p1.b
+               ->  Append
+                     ->  Partial HashAggregate
+                           Group Key: pagg_tab_p1.b
+                           ->  Seq Scan on pagg_tab_p1
+                     ->  Partial HashAggregate
+                           Group Key: pagg_tab_p2_s1.b
+                           ->  Append
+                                 ->  Seq Scan on pagg_tab_p2_s1
+                                 ->  Seq Scan on pagg_tab_p2_s2
+                     ->  Partial HashAggregate
+                           Group Key: pagg_tab_p3_s1.b
+                           ->  Append
+                                 ->  Seq Scan on pagg_tab_p3_s1
+                                 ->  Seq Scan on pagg_tab_p3_s2
+(20 rows)

Why aren't we seeing partial aggregation paths for level two and below
partitions?


In this version of the patch I have not recursed into next level.
Will work on it and submit changes in the next patch-set.

 Attached new set of patches adding this. Only patch 0007 (main patch) and 0008 (testcase patch) has changed.

Please have a look and let me know if I missed any.

Thanks
--
Jeevan Chalke
Technical Architect, Product Development
EnterpriseDB Corporation
The Enterprise PostgreSQL Company

Attachment

Re: [HACKERS] Partition-wise aggregation/grouping

From
Robert Haas
Date:
On Thu, Jan 11, 2018 at 6:00 AM, Jeevan Chalke
<jeevan.chalke@enterprisedb.com> wrote:
>  Attached new set of patches adding this. Only patch 0007 (main patch) and
> 0008 (testcase patch) has changed.
>
> Please have a look and let me know if I missed any.

I spent a little time studying 0001 and 0002 today, as well as their
relation with 0007.  I find the way that the refactoring has been done
slightly odd.  With 0001 and 0002 applied, we end up with three
functions for creating aggregate paths: create_partial_agg_path, which
handles the partial-path case for both sort and hash;
create_sort_agg_path, which handles the sort case for non-partial
paths only; and create_hash_agg_path, which handles the hash case for
non-partial paths only.  This leads to the following code in 0007:

+               /* Full grouping paths */
+
+               if (try_parallel_aggregation)
+               {
+                       Assert(extra->agg_partial_costs &&
extra->agg_final_costs);
+                       create_partial_agg_path(root, input_rel,
grouped_rel, target,
+
 partial_target, extra->agg_partial_costs,
+
 extra->agg_final_costs, gd, can_sort,
+
 can_hash, (List *) extra->havingQual);
+               }
+
+               if (can_sort)
+                       create_sort_agg_path(root, input_rel,
grouped_rel, target,
+
partial_target, agg_costs,
+
extra->agg_final_costs, gd, can_hash,
+
dNumGroups, (List *) extra->havingQual);
+
+               if (can_hash)
+                       create_hash_agg_path(root, input_rel,
grouped_rel, target,
+
partial_target, agg_costs,
+
extra->agg_final_costs, gd, dNumGroups,
+                                                                (List
*) extra->havingQual);

That looks strange -- you would expect to see either "sort" and "hash"
cases here, or maybe "partial" and "non-partial", or maybe all four
combinations, but seeing three things here looks surprising.  I think
the solution is just to create a single function that does both the
work of create_sort_agg_path and the work of create_hash_agg_path
instead of having two separate functions.

A related thing that is also surprising is that 0007 manages to reuse
create_partial_agg_path for both the isPartialAgg and non-isPartialAgg
cases -- in fact, the calls appear to be identical, and could be
hoisted out of the "if" statement -- but create_sort_agg_path and
create_hash_agg_path do not get reused.  I think you should see
whether you can define the new combo function that can be used for
both cases.  The logic looks very similar, and I'm wondering why it
isn't more similar than it is; for instance, create_sort_agg_path
loops over the input rel's pathlist, but the code for
isPartialAgg/can_sort seems to consider only the cheapest path.  If
this is correct, it needs a comment explaining it, but I don't see why
it should be correct.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: [HACKERS] Partition-wise aggregation/grouping

From
Jeevan Chalke
Date:


On Tue, Jan 16, 2018 at 3:41 AM, Robert Haas <robertmhaas@gmail.com> wrote:
On Thu, Jan 11, 2018 at 6:00 AM, Jeevan Chalke
<jeevan.chalke@enterprisedb.com> wrote:
>  Attached new set of patches adding this. Only patch 0007 (main patch) and
> 0008 (testcase patch) has changed.
>
> Please have a look and let me know if I missed any.

I spent a little time studying 0001 and 0002 today, as well as their
relation with 0007.  I find the way that the refactoring has been done
slightly odd.  With 0001 and 0002 applied, we end up with three
functions for creating aggregate paths: create_partial_agg_path, which
handles the partial-path case for both sort and hash;
create_sort_agg_path, which handles the sort case for non-partial
paths only; and create_hash_agg_path, which handles the hash case for
non-partial paths only.  This leads to the following code in 0007:

+               /* Full grouping paths */
+
+               if (try_parallel_aggregation)
+               {
+                       Assert(extra->agg_partial_costs &&
extra->agg_final_costs);
+                       create_partial_agg_path(root, input_rel,
grouped_rel, target,
+
 partial_target, extra->agg_partial_costs,
+
 extra->agg_final_costs, gd, can_sort,
+
 can_hash, (List *) extra->havingQual);
+               }
+
+               if (can_sort)
+                       create_sort_agg_path(root, input_rel,
grouped_rel, target,
+
partial_target, agg_costs,
+
extra->agg_final_costs, gd, can_hash,
+
dNumGroups, (List *) extra->havingQual);
+
+               if (can_hash)
+                       create_hash_agg_path(root, input_rel,
grouped_rel, target,
+
partial_target, agg_costs,
+
extra->agg_final_costs, gd, dNumGroups,
+                                                                (List
*) extra->havingQual);

That looks strange -- you would expect to see either "sort" and "hash"
cases here, or maybe "partial" and "non-partial", or maybe all four
combinations, but seeing three things here looks surprising.  I think
the solution is just to create a single function that does both the
work of create_sort_agg_path and the work of create_hash_agg_path
instead of having two separate functions.

In existing code (in create_grouping_paths()), I see following pattern:
    if (try_parallel_aggregation)
        if (can_sort)
        if (can_hash)
    if (can_sort)
    if (can_hash)

And thus, I have created three functions to match with existing pattern.

I will make your suggested changes that is merge create_sort_agg_path() and
create_hash_agg_path(). Will name that function as
create_sort_and_hash_agg_paths().
 

A related thing that is also surprising is that 0007 manages to reuse
create_partial_agg_path for both the isPartialAgg and non-isPartialAgg
cases -- in fact, the calls appear to be identical, and could be
hoisted out of the "if" statement

Yes. We can do that as well and I think it is better too.
I was just trying to preserve the existing pattern. So for PWA I chose:
    if partialAgg
        if (try_parallel_aggregation)
            if (can_sort)
            if (can_hash)
        if (can_sort)
        if (can_hash)
    else fullAgg
        if (try_parallel_aggregation)
            if (can_sort)
            if (can_hash)
        if (can_sort)
        if (can_hash)

But since, if (try_parallel_aggregation) case is exactly same, I will pull
that out of if..else.

 
-- but create_sort_agg_path and
create_hash_agg_path do not get reused.  I think you should see
whether you can define the new combo function that can be used for
both cases.  The logic looks very similar, and I'm wondering why it
isn't more similar than it is; for instance, create_sort_agg_path
loops over the input rel's pathlist, but the code for
isPartialAgg/can_sort seems to consider only the cheapest path.  If
this is correct, it needs a comment explaining it, but I don't see why
it should be correct.

Oops. My mistake. Missed. We should loop over the input rel's pathlist.

Yep. With above change, the logic is very similar except
(1) isPartialAgg/can_sort case creates the partial paths and
(2) finalization step is not needed at this stage.

I think it can be done by passing a flag to create_sort_agg_path() (or new
combo function) and making appropriate adjustments. Do you think addition of
this new flag should go in re-factoring patch or main PWA patch?
I think re-factoring patch.
 

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



--
Jeevan Chalke
Technical Architect, Product Development
EnterpriseDB Corporation
The Enterprise PostgreSQL Company

Re: [HACKERS] Partition-wise aggregation/grouping

From
Robert Haas
Date:
On Tue, Jan 16, 2018 at 3:56 AM, Jeevan Chalke
<jeevan.chalke@enterprisedb.com> wrote:
> I will make your suggested changes that is merge create_sort_agg_path() and
> create_hash_agg_path(). Will name that function as
> create_sort_and_hash_agg_paths().

I suggest add_paths_to_grouping_rel() and
add_partial_paths_to_grouping_rel(), similar to what commit
c44c47a773bd9073012935a29b0264d95920412c did with
add_paths_to_append_rel().

> Oops. My mistake. Missed. We should loop over the input rel's pathlist.
>
> Yep. With above change, the logic is very similar except
> (1) isPartialAgg/can_sort case creates the partial paths and
> (2) finalization step is not needed at this stage.

I'm not sure what you mean by #1.

> I think it can be done by passing a flag to create_sort_agg_path() (or new
> combo function) and making appropriate adjustments. Do you think addition of
> this new flag should go in re-factoring patch or main PWA patch?
> I think re-factoring patch.

I think the refactoring patch should move the existing code into a new
function without any changes, and then the main patch should add an
additional argument to that function that allows for either behavior.

By the way, I'm also a bit concerned about this:

+               /*
+                * For full aggregation, we are done with the partial
paths.  Just
+                * clear it out so that we don't try to create a
parallel plan over it.
+                */
+               grouped_rel->partial_pathlist = NIL;

I think that's being done for the same reason as mentioned at the
bottom of the current code for create_grouping_paths().  They are only
partially aggregated and wouldn't produce correct final results if
some other planning step -- create_ordered_paths, or the code that
sets up final_rel -- used them as if they had been fully agggregated.
I'm worried that there might be an analogous danger for partition-wise
aggregation -- that is, that the paths being inserted into the partial
pathlists of the aggregate child rels might get reused by some later
planning step which doesn't realize that the output they produce
doesn't quite match up with the rel to which they are attached.  You
may have already taken care of that problem somehow, but we should
make sure that it's fully correct and clearly commented.  I don't
immediately see why the isPartialAgg case should be any different from
the !isPartialAgg case.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: [HACKERS] Partition-wise aggregation/grouping

From
Jeevan Chalke
Date:


On Wed, Jan 17, 2018 at 1:18 AM, Robert Haas <robertmhaas@gmail.com> wrote:
On Tue, Jan 16, 2018 at 3:56 AM, Jeevan Chalke
<jeevan.chalke@enterprisedb.com> wrote:
> I will make your suggested changes that is merge create_sort_agg_path() and
> create_hash_agg_path(). Will name that function as
> create_sort_and_hash_agg_paths().

I suggest add_paths_to_grouping_rel() and
add_partial_paths_to_grouping_rel(), similar to what commit
c44c47a773bd9073012935a29b0264d95920412c did with
add_paths_to_append_rel().

> Oops. My mistake. Missed. We should loop over the input rel's pathlist.
>
> Yep. With above change, the logic is very similar except
> (1) isPartialAgg/can_sort case creates the partial paths and
> (2) finalization step is not needed at this stage.

I'm not sure what you mean by #1.

I mean, in case of isPartialAgg=true, we need to create a partial aggregation path which has aggsplit=AGGSPLIT_INITIAL_SERIAL and should not perform finalization at this stage. And thus add_paths_to_grouping_rel() needs a flag to diffrentiate it. By adding that code chunk allows us to reuse same function in both the cases i.e full and partial aggregation.

Attached patch with other review points fixed.


> I think it can be done by passing a flag to create_sort_agg_path() (or new
> combo function) and making appropriate adjustments. Do you think addition of
> this new flag should go in re-factoring patch or main PWA patch?
> I think re-factoring patch.

I think the refactoring patch should move the existing code into a new
function without any changes, and then the main patch should add an
additional argument to that function that allows for either behavior.

By the way, I'm also a bit concerned about this:

+               /*
+                * For full aggregation, we are done with the partial
paths.  Just
+                * clear it out so that we don't try to create a
parallel plan over it.
+                */
+               grouped_rel->partial_pathlist = NIL;

I think that's being done for the same reason as mentioned at the
bottom of the current code for create_grouping_paths().  They are only
partially aggregated and wouldn't produce correct final results if
some other planning step -- create_ordered_paths, or the code that
sets up final_rel -- used them as if they had been fully agggregated.
I'm worried that there might be an analogous danger for partition-wise
aggregation -- that is, that the paths being inserted into the partial
pathlists of the aggregate child rels might get reused by some later
planning step which doesn't realize that the output they produce
doesn't quite match up with the rel to which they are attached.  You
may have already taken care of that problem somehow, but we should
make sure that it's fully correct and clearly commented.  I don't
immediately see why the isPartialAgg case should be any different from
the !isPartialAgg case.

Actually I needed this because in case of full aggregation we already build all final aggregation paths before performing Append operation. However, when we do append in add_paths_to_append_rel(), it thinks that partial_pathlist present in grouped_rel is finalized one exactly like you mentioned above and thus we need to clear it out.

But yes, for safer side, I think once we done with partition-wise aggregation step, we need to again go through the partitioning chain and need to clear out all child grouped rel's partial_pathlist for the reason mentioned at the bottom of the current code for create_grouping_paths().
 

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



--
Jeevan Chalke
Technical Architect, Product Development
EnterpriseDB Corporation
The Enterprise PostgreSQL Company

Attachment

Re: [HACKERS] Partition-wise aggregation/grouping

From
Robert Haas
Date:
On Thu, Jan 18, 2018 at 8:55 AM, Jeevan Chalke
<jeevan.chalke@enterprisedb.com> wrote:
> Attached patch with other review points fixed.

Committed 0001 and 0002 together, with some cosmetic changes,
including fixing pgindent damage.  Please pgindent your patches before
submitting.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: [HACKERS] Partition-wise aggregation/grouping

From
Jeevan Chalke
Date:


On Sat, Jan 27, 2018 at 1:35 AM, Robert Haas <robertmhaas@gmail.com> wrote:
On Thu, Jan 18, 2018 at 8:55 AM, Jeevan Chalke
<jeevan.chalke@enterprisedb.com> wrote:
> Attached patch with other review points fixed.

Committed 0001 and 0002 together, with some cosmetic changes,
including fixing pgindent damage.

Thanks Robert.

 
  Please pgindent your patches before
submitting.

Sure, will take care of this.

Attached new patch set and rebased it on latest HEAD.
 

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Thanks

--
Jeevan Chalke
Technical Architect, Product Development
EnterpriseDB Corporation
The Enterprise PostgreSQL Company

Attachment

Re: [HACKERS] Partition-wise aggregation/grouping

From
Robert Haas
Date:
On Mon, Jan 29, 2018 at 3:42 AM, Jeevan Chalke
<jeevan.chalke@enterprisedb.com> wrote:
> Attached new patch set and rebased it on latest HEAD.

I strongly dislike add_single_path_to_append_rel.  It adds branches
and complexity to code that is already very complex.  Most
importantly, why are we adding paths to fields in
OtherUpperPathExtraData *extra instead of adding them to the path list
of some RelOptInfo?  If we had an appropriate RelOptInfo to which we
could add the paths, then we could make this simpler.

If I understand correctly, the reason you're doing it this way is
because we have no place to put partially-aggregated, non-partial
paths.  If we only needed to worry about the parallel query case, we
could just build an append of partially-aggregated paths for each
child and stick it into the grouped rel's partial pathlist, just as we
already do for regular parallel aggregation.  There's no reason why
add_paths_to_grouping_rel() needs to care about the difference a
Partial Aggregate on top of whatever and an Append each branch of
which is a Partial Aggregate on top of whatever.  However, this won't
work for non-partial paths, because add_paths_to_grouping_rel() needs
to put paths into the grouped rel's pathlist -- and we can't mix
together partially-aggregated paths and fully-aggregated paths in the
same path list.

But, really, the way we're using grouped_rel->partial_pathlist right
now is an awful hack.  What I'm thinking we could do is introduce a
new UpperRelationKind called UPPERREL_PARTIAL_GROUP_AGG, coming just
before UPPERREL_GROUP_AGG.  Without partition-wise aggregate,
partially_grouped_rel's pathlist would always be NIL, and its partial
pathlist would be constructed using the logic in
add_partial_paths_to_grouping_rel, which would need renaming.  Then,
add_paths_to_grouping_rel would use paths from input_rel when doing
non-parallel aggregation and paths from partially_grouped_rel when
doing parallel aggregation.  This would eliminate the ugly
grouped_rel->partial_pathlist = NIL assignment at the bottom of
create_grouping_paths(), because the grouped_rel's partial_pathlist
would never have been (bogusly) populated in the first place, and
hence would not need to be reset.  All of these changes could be made
via a preparatory patch.

Then the main patch needs to worry about four cases:

1. Parallel partition-wise aggregate, grouping key doesn't contain
partition key.  This should just be a matter of adding additional
Append paths to partially_grouped_rel->partial_pathlist.  The existing
code already knows how to stick a Gather and FinalizeAggregate step on
top of that, and I don't see why that logic would need any
modification or addition.  An Append of child partial-grouping paths
should be producing the same output as a partial grouping of an
Append, except that the former case might produce more separate groups
that need to be merged; but that should be OK: we can just throw all
the paths into the same path list and let the cheapest one win.

2. Parallel partition-wise aggregate, grouping key contains partition
key.  For the most part, this is no different from case #1.  We won't
have groups spanning different partitions in this case, but we might
have groups spanning different workers, so we still need a
FinalizeAggregate step.  As an exception, Gather -> Parallel Append ->
[non-partial Aggregate path] would give us a way of doing aggregation
in parallel without a separate Finalize step.  I'm not sure if we want
to consider that to be in scope for this patch.  If we do, then we'd
add the Parallel Append path to grouped_rel->partial_pathlist.  Then,
we could stick Gather (Merge) on top if it to produce a path for
grouped_rel->pathlist using generate_gather_paths(); alternatively, it
can be used by upper planning steps -- something we currently can't
ever make work with parallel aggregation.

3. Non-parallel partition-wise aggregate, grouping key contains
partition key.  Build Append paths from the children of grouped_rel
and add them to grouped_rel->pathlist.

3. Non-parallel partition-wise aggregate, grouping key doesn't contain
partition key.  Build Append paths from the children of
partially_grouped_rel and add them to partially_grouped_rel->pathlist.
Also add code to generate paths for grouped_rel->pathlist by sticking
a FinalizeAggregate step on top of each path from
partially_grouped_rel->pathlist.

Overall, what I'm trying to argue for here is making this feature look
less like its own separate thing and more like part of the general
mechanism we've already got: partial paths would turn into regular
paths via generate_gather_paths(), and partially aggregated paths
would turn into fully aggregated paths by adding FinalizeAggregate.
The existing special case that allows us to build a non-partial, fully
aggregated path from a partial, partially-aggregated path would be
preserved.

I think this would probably eliminate some other problems I see in the
existing design as well.  For example, create_partition_agg_paths
doesn't consider using Gather Merge, but that might be a win.  With
the design above, I think you never need to call create_gather_path()
anywhere.  In case #1, the existing code takes care of it.  In the
special case mentioned under #2, if we chose to support that,
generate_gather_paths() would take care of it.  Both of those places
already know about Gather Merge.

On another note, I found preferFullAgg to be wicked confusing.  To
"prefer" something is to like it better, but be willing to accept
other options if the preference can't be accommodated.  Here, it seems
like preferFullAgg = false prevents consideration of full aggregation.
So it's really more like allowFullAgg, or, maybe better,
try_full_aggregation.  Also, try_partition_wise_grouping has a
variable isPartialAgg which is always ends up getting set to
!preferFullAgg.  Having two Boolean variables which are always set to
the opposite of each other isn't good.  To add to the confusion, the
code following the place where isPartialAgg is set sometimes refers to
isPartialAgg and sometimes refers to preferFullAgg.

I think the comments in this patch still need a good bit of work.
They tend to explain what the code does rather than the reason it does
it, and they tend to speak vaguely rather than precisely about things
happening in other places.  For example, consider the need to set
partial_pathlist = NIL in create_grouping_paths(). Here's the existing
comment:

    /*
     * We've been using the partial pathlist for the grouped relation to hold
     * partially aggregated paths, but that's actually a little bit bogus
     * because it's unsafe for later planning stages -- like ordered_rel ---
     * to get the idea that they can use these partial paths as if they didn't
     * need a FinalizeAggregate step.  Zap the partial pathlist at this stage
     * so we don't get confused.
     */

Here's your comment about the same hazzard:

+        * For full aggregation, at this point we are already done with the
+        * finalization step and thus partial paths are no more needed. Keeping
+        * those will lead to some unwanted result later in the planning stage.
+        * Thus like create_grouping_paths(), clear them out.

Notice that your comment says that it will created an "unwanted
result" and that this result will happen "later", whereas the existing
comment is a lot more specific.  It says exactly what the problem is
(FinalizeAggregate needed) and where the confusion will happen
(ordered_rel).  Some other examples of comments with similar problems:

+        * If there are partial subpaths for parallelism, then we need to add
+        * gather path on top of the append. However, we only do this when full
+        * aggregation is required.  For partial aggregation this can be done at
+        * later stage.

Doesn't really explain why we're doing any of those things, just says
that we are.  Also, what later stage?

+        * For non-partial aggregation path, just need to add given
append path to
+        * a grouped_rel.  Also, if caller requested a partial aggregation only,
+        * skip finalize step.

Again, why?

+        * Add all collected append paths into the grouped_rel.  For partial
+        * aggregation mode we need to add a finalize agg node over an append
+        * path.

Why?

+       /* Similarly, for partial paths.  Here we need to add gather
node too. */

Why?

+                * However, in partial aggregation mode this is done
at later stage,
+                * so skip it.

When?

Here's an example of a much better comment you wrote:

+        * In find_ec_member_for_tle(), child EC members are ignored
if they don't
+        * belong to the given relids. Thus, if this sort path is
based on a child
+        * relation, we must pass the relids of it. Otherwise, we will
end-up into
+        * an error requiring pathkey item.

I haven't studied this patch in enough depth yet to figure out whether
I think that makes sense, but clearly when I go to do that this
comment is going to be a big help in figuring it out.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: [HACKERS] Partition-wise aggregation/grouping

From
Jeevan Chalke
Date:


On Thu, Feb 1, 2018 at 1:11 AM, Robert Haas <robertmhaas@gmail.com> wrote:
On Mon, Jan 29, 2018 at 3:42 AM, Jeevan Chalke
<jeevan.chalke@enterprisedb.com> wrote:
> Attached new patch set and rebased it on latest HEAD.

I strongly dislike add_single_path_to_append_rel.  It adds branches
and complexity to code that is already very complex.  Most
importantly, why are we adding paths to fields in
OtherUpperPathExtraData *extra instead of adding them to the path list
of some RelOptInfo?  If we had an appropriate RelOptInfo to which we
could add the paths, then we could make this simpler.

If I understand correctly, the reason you're doing it this way is
because we have no place to put partially-aggregated, non-partial
paths.  If we only needed to worry about the parallel query case, we
could just build an append of partially-aggregated paths for each
child and stick it into the grouped rel's partial pathlist, just as we
already do for regular parallel aggregation.  There's no reason why
add_paths_to_grouping_rel() needs to care about the difference a
Partial Aggregate on top of whatever and an Append each branch of
which is a Partial Aggregate on top of whatever.  However, this won't
work for non-partial paths, because add_paths_to_grouping_rel() needs
to put paths into the grouped rel's pathlist -- and we can't mix
together partially-aggregated paths and fully-aggregated paths in the
same path list.

Yes.
 

But, really, the way we're using grouped_rel->partial_pathlist right
now is an awful hack.  What I'm thinking we could do is introduce a
new UpperRelationKind called UPPERREL_PARTIAL_GROUP_AGG, coming just
before UPPERREL_GROUP_AGG.  Without partition-wise aggregate,
partially_grouped_rel's pathlist would always be NIL, and its partial
pathlist would be constructed using the logic in
add_partial_paths_to_grouping_rel, which would need renaming.  Then,
add_paths_to_grouping_rel would use paths from input_rel when doing
non-parallel aggregation and paths from partially_grouped_rel when
doing parallel aggregation.  This would eliminate the ugly
grouped_rel->partial_pathlist = NIL assignment at the bottom of
create_grouping_paths(), because the grouped_rel's partial_pathlist
would never have been (bogusly) populated in the first place, and
hence would not need to be reset.  All of these changes could be made
via a preparatory patch.

I wrote a patch for this (on current HEAD) and attached separately here. Please have a look.

I still not yet fully understand how we are going to pass those to the add_paths_to_append_rel(). I need to look it more deeply though.


Then the main patch needs to worry about four cases:

1. Parallel partition-wise aggregate, grouping key doesn't contain
partition key.  This should just be a matter of adding additional
Append paths to partially_grouped_rel->partial_pathlist.  The existing
code already knows how to stick a Gather and FinalizeAggregate step on
top of that, and I don't see why that logic would need any
modification or addition.  An Append of child partial-grouping paths
should be producing the same output as a partial grouping of an
Append, except that the former case might produce more separate groups
that need to be merged; but that should be OK: we can just throw all
the paths into the same path list and let the cheapest one win.

For any partial aggregation we need to add finalization step after we are done with the APPEND i.e. post add_paths_to_append_rel(). Given that we need to replicate the logic of sticking Gather and FinalizeAggregate step at later stage. This is what exactly done in create_partition_agg_paths().
Am I missing something here?


2. Parallel partition-wise aggregate, grouping key contains partition
key.  For the most part, this is no different from case #1.  We won't
have groups spanning different partitions in this case, but we might
have groups spanning different workers, so we still need a
FinalizeAggregate step.  As an exception, Gather -> Parallel Append ->
[non-partial Aggregate path] would give us a way of doing aggregation
in parallel without a separate Finalize step.  I'm not sure if we want
to consider that to be in scope for this patch.  If we do, then we'd
add the Parallel Append path to grouped_rel->partial_pathlist.  Then,
we could stick Gather (Merge) on top if it to produce a path for
grouped_rel->pathlist using generate_gather_paths(); alternatively, it
can be used by upper planning steps -- something we currently can't
ever make work with parallel aggregation.

3. Non-parallel partition-wise aggregate, grouping key contains
partition key.  Build Append paths from the children of grouped_rel
and add them to grouped_rel->pathlist.

Yes.
 

3. Non-parallel partition-wise aggregate, grouping key doesn't contain
partition key.  Build Append paths from the children of
partially_grouped_rel and add them to partially_grouped_rel->pathlist.
Also add code to generate paths for grouped_rel->pathlist by sticking
a FinalizeAggregate step on top of each path from
partially_grouped_rel->pathlist.

Yes, this is done in create_partition_agg_paths().
create_partition_agg_paths() basically adds gather path, if required and then finalizes it again if required. These steps are similar to that of add_paths_to_grouping_rel() counterpart which does gather + finalization.


Overall, what I'm trying to argue for here is making this feature look
less like its own separate thing and more like part of the general
mechanism we've already got: partial paths would turn into regular
paths via generate_gather_paths(), and partially aggregated paths
would turn into fully aggregated paths by adding FinalizeAggregate.
The existing special case that allows us to build a non-partial, fully
aggregated path from a partial, partially-aggregated path would be
preserved.

I think this would probably eliminate some other problems I see in the
existing design as well.  For example, create_partition_agg_paths
doesn't consider using Gather Merge, but that might be a win.

Append path is always non-sorted and has no pathkeys. Thus Gather Merge over an Append path seems infeasible, isn't it?
 
  With
the design above, I think you never need to call create_gather_path()
anywhere.  In case #1, the existing code takes care of it.  In the
special case mentioned under #2, if we chose to support that,
generate_gather_paths() would take care of it.  Both of those places
already know about Gather Merge.

I don't understand how exactly, will have more careful look over this.


On another note, I found preferFullAgg to be wicked confusing.  To
"prefer" something is to like it better, but be willing to accept
other options if the preference can't be accommodated.  Here, it seems
like preferFullAgg = false prevents consideration of full aggregation.
So it's really more like allowFullAgg, or, maybe better,
try_full_aggregation.  Also, try_partition_wise_grouping has a
variable isPartialAgg which is always ends up getting set to
!preferFullAgg.  Having two Boolean variables which are always set to
the opposite of each other isn't good.  To add to the confusion, the
code following the place where isPartialAgg is set sometimes refers to
isPartialAgg and sometimes refers to preferFullAgg.

I will have a look over this and commenting part.


--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Thanks

--
Jeevan Chalke
Technical Architect, Product Development
EnterpriseDB Corporation
The Enterprise PostgreSQL Company

Attachment

Re: [HACKERS] Partition-wise aggregation/grouping

From
Robert Haas
Date:
On Thu, Feb 1, 2018 at 8:59 AM, Jeevan Chalke
<jeevan.chalke@enterprisedb.com> wrote:
> I wrote a patch for this (on current HEAD) and attached separately here.
> Please have a look.

Yes, this is approximately what I had in mind, though it needs more
work (e.g. it doesn't removing the clearing of the grouped_rel's
partial_pathlist, which should no longer be necessary; also, it needs
substantial comment updates).

>> 1. Parallel partition-wise aggregate, grouping key doesn't contain
>> partition key.  This should just be a matter of adding additional
>> Append paths to partially_grouped_rel->partial_pathlist.  The existing
>> code already knows how to stick a Gather and FinalizeAggregate step on
>> top of that, and I don't see why that logic would need any
>> modification or addition.  An Append of child partial-grouping paths
>> should be producing the same output as a partial grouping of an
>> Append, except that the former case might produce more separate groups
>> that need to be merged; but that should be OK: we can just throw all
>> the paths into the same path list and let the cheapest one win.
>
> For any partial aggregation we need to add finalization step after we are
> done with the APPEND i.e. post add_paths_to_append_rel(). Given that we need
> to replicate the logic of sticking Gather and FinalizeAggregate step at
> later stage. This is what exactly done in create_partition_agg_paths().
> Am I missing something here?

The problem is that create_partition_agg_paths() is doing *exactly*
same thing that add_paths_to_grouping_rel() is already doing inside
the blocks that say if (grouped_rel->partial_pathlist).  We don't need
two copies of that code.  Both of those places except to take a
partial path that has been partially aggregated and produce a
non-partial path that is fully aggregated.  We do not need or want two
copies of that code.

Here's another way to look at it.  We have four kinds of things.

1. Partially aggregated partial paths
2. Partially aggregated non-partial paths
3. Fully aggregated partial paths
4. Fully aggregated non-partial paths

The current code only ever generates paths of type #1 and #4; this
patch will add paths of type #2 as well, and maybe also type #3.  But
the way you've got it, the existing paths of type #1 go into the
grouping_rel's partial_pathlist, and the new paths of type #1 go into
the OtherUpperPathExtraData's partial_paths list.  Maybe there's a
good reason why we should keep them separate, but I'm inclined to
think they should all be going into the same list.

>> Overall, what I'm trying to argue for here is making this feature look
>> less like its own separate thing and more like part of the general
>> mechanism we've already got: partial paths would turn into regular
>> paths via generate_gather_paths(), and partially aggregated paths
>> would turn into fully aggregated paths by adding FinalizeAggregate.
>> The existing special case that allows us to build a non-partial, fully
>> aggregated path from a partial, partially-aggregated path would be
>> preserved.
>>
>> I think this would probably eliminate some other problems I see in the
>> existing design as well.  For example, create_partition_agg_paths
>> doesn't consider using Gather Merge, but that might be a win.
>
> Append path is always non-sorted and has no pathkeys. Thus Gather Merge over
> an Append path seems infeasible, isn't it?

We currently never generate an Append path with pathkeys, but we do
generate MergeAppend paths with pathkeys, as in the following example:

rhaas=# create table foo (a int, b text) partition by range (a);
CREATE TABLE
rhaas=# create index on foo (a);
CREATE INDEX
rhaas=# create table foo1 partition of foo for values from (0) to (1000000);
CREATE TABLE
rhaas=# create table foo2 partition of foo for values from (1000000)
to (2000000);
CREATE TABLE
rhaas=# select * from foo foo order by a;
 a | b
---+---
(0 rows)
rhaas=# explain select * from foo foo order by a;
                                       QUERY PLAN
----------------------------------------------------------------------------------------
 Merge Append  (cost=0.32..145.47 rows=2540 width=36)
   Sort Key: foo.a
   ->  Index Scan using foo1_a_idx on foo1 foo  (cost=0.15..63.20
rows=1270 width=36)
   ->  Index Scan using foo2_a_idx on foo2 foo_1  (cost=0.15..63.20
rows=1270 width=36)
(4 rows)

Actually, in this example, the MergeAppend could be safely converted
into an Append, because the partitions are in bound order, and
somebody already proposed a patch for that.

The point is that we want to be able to get plans like this:

Finalize GroupAggregate
-> Gather Merge
  -> MergeAppend
    -> Partial GroupAggregate
      -> Parallel Index Scan on t1
    -> Partial GroupAggregate
      -> Parallel Index Scan on t2
    -> Partial GroupAggregate
      -> Parallel Index Scan on t3

If we only consider Gather, not Gather Merge, when turning a partially
aggregated partial path into a non-partial path, then we end up having
to insert a Sort node if we want to perform a Finalize GroupAggregate
step.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: [HACKERS] Partition-wise aggregation/grouping

From
Jeevan Chalke
Date:


On Fri, Feb 2, 2018 at 1:41 AM, Robert Haas <robertmhaas@gmail.com> wrote:
On Thu, Feb 1, 2018 at 8:59 AM, Jeevan Chalke
<jeevan.chalke@enterprisedb.com> wrote:
> I wrote a patch for this (on current HEAD) and attached separately here.
> Please have a look.

Yes, this is approximately what I had in mind, though it needs more
work (e.g. it doesn't removing the clearing of the grouped_rel's
partial_pathlist, which should no longer be necessary; also, it needs
substantial comment updates).

That was just a quick patch to make sure is this what you meant.
Yes, it need some more work as you suggested and comment updates.
 

>> 1. Parallel partition-wise aggregate, grouping key doesn't contain
>> partition key.  This should just be a matter of adding additional
>> Append paths to partially_grouped_rel->partial_pathlist.  The existing
>> code already knows how to stick a Gather and FinalizeAggregate step on
>> top of that, and I don't see why that logic would need any
>> modification or addition.  An Append of child partial-grouping paths
>> should be producing the same output as a partial grouping of an
>> Append, except that the former case might produce more separate groups
>> that need to be merged; but that should be OK: we can just throw all
>> the paths into the same path list and let the cheapest one win.
>
> For any partial aggregation we need to add finalization step after we are
> done with the APPEND i.e. post add_paths_to_append_rel(). Given that we need
> to replicate the logic of sticking Gather and FinalizeAggregate step at
> later stage. This is what exactly done in create_partition_agg_paths().
> Am I missing something here?

The problem is that create_partition_agg_paths() is doing *exactly*
same thing that add_paths_to_grouping_rel() is already doing inside
the blocks that say if (grouped_rel->partial_pathlist).  We don't need
two copies of that code.  Both of those places except to take a
partial path that has been partially aggregated and produce a
non-partial path that is fully aggregated.  We do not need or want two
copies of that code.

OK. Got it.

Will try to find a common place for them and will also check how it goes with your suggested design change.
 

Here's another way to look at it.  We have four kinds of things.

1. Partially aggregated partial paths
2. Partially aggregated non-partial paths
3. Fully aggregated partial paths
4. Fully aggregated non-partial paths

The current code only ever generates paths of type #1 and #4; this
patch will add paths of type #2 as well, and maybe also type #3.  But
the way you've got it, the existing paths of type #1 go into the
grouping_rel's partial_pathlist, and the new paths of type #1 go into
the OtherUpperPathExtraData's partial_paths list.  Maybe there's a
good reason why we should keep them separate, but I'm inclined to
think they should all be going into the same list.

The new paths are specific to partition-wise aggregates and I thought better to keep them separately without interfering with grouped_rel pathlist/partial_pathlist. And as you said, I didn't find a better place that its own structure.


>> Overall, what I'm trying to argue for here is making this feature look
>> less like its own separate thing and more like part of the general
>> mechanism we've already got: partial paths would turn into regular
>> paths via generate_gather_paths(), and partially aggregated paths
>> would turn into fully aggregated paths by adding FinalizeAggregate.
>> The existing special case that allows us to build a non-partial, fully
>> aggregated path from a partial, partially-aggregated path would be
>> preserved.
>>
>> I think this would probably eliminate some other problems I see in the
>> existing design as well.  For example, create_partition_agg_paths
>> doesn't consider using Gather Merge, but that might be a win.
>
> Append path is always non-sorted and has no pathkeys. Thus Gather Merge over
> an Append path seems infeasible, isn't it?

We currently never generate an Append path with pathkeys, but we do
generate MergeAppend paths with pathkeys, as in the following example:

rhaas=# create table foo (a int, b text) partition by range (a);
CREATE TABLE
rhaas=# create index on foo (a);
CREATE INDEX
rhaas=# create table foo1 partition of foo for values from (0) to (1000000);
CREATE TABLE
rhaas=# create table foo2 partition of foo for values from (1000000)
to (2000000);
CREATE TABLE
rhaas=# select * from foo foo order by a;
 a | b
---+---
(0 rows)
rhaas=# explain select * from foo foo order by a;
                                       QUERY PLAN
----------------------------------------------------------------------------------------
 Merge Append  (cost=0.32..145.47 rows=2540 width=36)
   Sort Key: foo.a
   ->  Index Scan using foo1_a_idx on foo1 foo  (cost=0.15..63.20
rows=1270 width=36)
   ->  Index Scan using foo2_a_idx on foo2 foo_1  (cost=0.15..63.20
rows=1270 width=36)
(4 rows)

Actually, in this example, the MergeAppend could be safely converted
into an Append, because the partitions are in bound order, and
somebody already proposed a patch for that.

The point is that we want to be able to get plans like this:

Finalize GroupAggregate
-> Gather Merge
  -> MergeAppend
    -> Partial GroupAggregate
      -> Parallel Index Scan on t1
    -> Partial GroupAggregate
      -> Parallel Index Scan on t2
    -> Partial GroupAggregate
      -> Parallel Index Scan on t3


add_paths_to_append_rel() -> generate_mergeappend_paths() does not consider partial_pathlist. Thus we will never see MergeAppend over parallel scan given by partial_pathlist. And thus plan like:
-> Gather Merge
  -> MergeAppend
is not possible with current HEAD.

Are you suggesting we should implement that here? I think that itself is a separate task.
 
If we only consider Gather, not Gather Merge, when turning a partially
aggregated partial path into a non-partial path, then we end up having
to insert a Sort node if we want to perform a Finalize GroupAggregate
step.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



--
Jeevan Chalke
Technical Architect, Product Development
EnterpriseDB Corporation
The Enterprise PostgreSQL Company

Re: [HACKERS] Partition-wise aggregation/grouping

From
Robert Haas
Date:
On Fri, Feb 2, 2018 at 8:25 AM, Jeevan Chalke
<jeevan.chalke@enterprisedb.com> wrote:
>> The problem is that create_partition_agg_paths() is doing *exactly*
>> same thing that add_paths_to_grouping_rel() is already doing inside
>> the blocks that say if (grouped_rel->partial_pathlist).  We don't need
>> two copies of that code.  Both of those places except to take a
>> partial path that has been partially aggregated and produce a
>> non-partial path that is fully aggregated.  We do not need or want two
>> copies of that code.
>
> OK. Got it.
>
> Will try to find a common place for them and will also check how it goes
> with your suggested design change.
>
>> Here's another way to look at it.  We have four kinds of things.
>>
>> 1. Partially aggregated partial paths
>> 2. Partially aggregated non-partial paths
>> 3. Fully aggregated partial paths
>> 4. Fully aggregated non-partial paths

So in the new scheme I'm proposing, you've got a partially_grouped_rel
and a grouped_rel.  So all paths of type #1 go into
partially_grouped_rel->partial_pathlist, paths of type #2 go into
partially_grouped_rel->pathlist, type #3 (if we have any) goes into
grouped_rel->partial_pathlist, and type #4 goes into
grouped_rel->pathlist.

> add_paths_to_append_rel() -> generate_mergeappend_paths() does not consider
> partial_pathlist. Thus we will never see MergeAppend over parallel scan
> given by partial_pathlist. And thus plan like:
> -> Gather Merge
>   -> MergeAppend
> is not possible with current HEAD.
>
> Are you suggesting we should implement that here? I think that itself is a
> separate task.

Oh, I didn't realize that wasn't working already.  I agree that it's a
separate task from this patch, but it's really too bad that it doesn't
already work.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: [HACKERS] Partition-wise aggregation/grouping

From
Jeevan Chalke
Date:
Hi,

In this attached version, I have rebased my changes over new design of partially_grouped_rel. The preparatory changes of adding partially_grouped_rel are in 0001.

Also to minimize finalization code duplication, I have refactored them into two separate functions, finalize_sorted_partial_agg_path() and finalize_hashed_partial_agg_path(). I need to create these two functions as current path creation order in like,
    Sort Agg Path
    Sort Agg Path - Parallel Aware (Finalization needed here)
    Hash Agg Path
    Hash Agg Path - Parallel Aware (Finalization needed here)
And if we club those finalizations together, then path creation order will be changed and it may result in the existing plan changes.
Let me know if that's OK, I will merge them together as they are distinct anyways. These changes are part of 0002.

0003 - 0006 are refactoring patches as before.

0007 is the main patch per new design. I have removed create_partition_agg_paths() altogether as finalization code is reused. Also, renamed preferFullAgg with forcePartialAgg as we forcefully needed a partial path from nested level if the parent is doing a partial aggregation. add_single_path_to_append_rel() is no more exists and also there is no need to pass OtherUpperPathExtraData to add_paths_to_append_rel().

0008 - 0009, testcase and postgres_fdw changes.

Please have a look at new changes and let me know if I missed any.

Thanks


On Fri, Feb 2, 2018 at 7:29 PM, Robert Haas <robertmhaas@gmail.com> wrote:
On Fri, Feb 2, 2018 at 8:25 AM, Jeevan Chalke
<jeevan.chalke@enterprisedb.com> wrote:
>> The problem is that create_partition_agg_paths() is doing *exactly*
>> same thing that add_paths_to_grouping_rel() is already doing inside
>> the blocks that say if (grouped_rel->partial_pathlist).  We don't need
>> two copies of that code.  Both of those places except to take a
>> partial path that has been partially aggregated and produce a
>> non-partial path that is fully aggregated.  We do not need or want two
>> copies of that code.
>
> OK. Got it.
>
> Will try to find a common place for them and will also check how it goes
> with your suggested design change.
>
>> Here's another way to look at it.  We have four kinds of things.
>>
>> 1. Partially aggregated partial paths
>> 2. Partially aggregated non-partial paths
>> 3. Fully aggregated partial paths
>> 4. Fully aggregated non-partial paths

So in the new scheme I'm proposing, you've got a partially_grouped_rel
and a grouped_rel.  So all paths of type #1 go into
partially_grouped_rel->partial_pathlist, paths of type #2 go into
partially_grouped_rel->pathlist, type #3 (if we have any) goes into
grouped_rel->partial_pathlist, and type #4 goes into
grouped_rel->pathlist.

> add_paths_to_append_rel() -> generate_mergeappend_paths() does not consider
> partial_pathlist. Thus we will never see MergeAppend over parallel scan
> given by partial_pathlist. And thus plan like:
> -> Gather Merge
>   -> MergeAppend
> is not possible with current HEAD.
>
> Are you suggesting we should implement that here? I think that itself is a
> separate task.

Oh, I didn't realize that wasn't working already.  I agree that it's a
separate task from this patch, but it's really too bad that it doesn't
already work.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



--
Jeevan Chalke
Technical Architect, Product Development
EnterpriseDB Corporation
The Enterprise PostgreSQL Company

Attachment

Re: [HACKERS] Partition-wise aggregation/grouping

From
Rafia Sabih
Date:


On Thu, Feb 8, 2018 at 6:35 PM, Jeevan Chalke <jeevan.chalke@enterprisedb.com> wrote:
Hi,

In this attached version, I have rebased my changes over new design of partially_grouped_rel. The preparatory changes of adding partially_grouped_rel are in 0001.

Also to minimize finalization code duplication, I have refactored them into two separate functions, finalize_sorted_partial_agg_path() and finalize_hashed_partial_agg_path(). I need to create these two functions as current path creation order in like,
    Sort Agg Path
    Sort Agg Path - Parallel Aware (Finalization needed here)
    Hash Agg Path
    Hash Agg Path - Parallel Aware (Finalization needed here)
And if we club those finalizations together, then path creation order will be changed and it may result in the existing plan changes.
Let me know if that's OK, I will merge them together as they are distinct anyways. These changes are part of 0002.

0003 - 0006 are refactoring patches as before.

0007 is the main patch per new design. I have removed create_partition_agg_paths() altogether as finalization code is reused. Also, renamed preferFullAgg with forcePartialAgg as we forcefully needed a partial path from nested level if the parent is doing a partial aggregation. add_single_path_to_append_rel() is no more exists and also there is no need to pass OtherUpperPathExtraData to add_paths_to_append_rel().

0008 - 0009, testcase and postgres_fdw changes.

Please have a look at new changes and let me know if I missed any.

Thanks
 
I was testing this patch for TPC-H benchmarking and came across following results,

Q1 completes in 229 secs with patch and in 66 secs without it. It looks like with this patch the time of parallel seq scan itself is elevated for some of the partitions. Notice for partitions, lineitem_3, lineitem_7, lineitem_10, and linietem_5 it is some 13 secs which was somewhere around 5 secs on head.

Q6 completes in some 7 secs with patch and it takes 4 secs without it. This is mainly caused because with the new parallel append, the parallel operator below it (parallel index scan in this case) is not used, however, on head it was the append of all the parallel index scans, which was saving quite some time.

Q18 takes some 390 secs with patch and some 147 secs without it. 

The experimental setup for these tests is as follows,
work_mem = 500MB
shared_buffers = 10GB
effective_cache_size = 4GB
seq_page_cost = random+page_cost = 0.01
enable_partition_wise_join = off

Partitioning info:
Total 10 partitions on tables - lineitem and orders each with partitioning key being l_orderkey and o_orderkey respectively.

Please find the attached file for explain analyse outputs of each of the reported query.
--
Regards,
Rafia Sabih
Attachment

Re: [HACKERS] Partition-wise aggregation/grouping

From
Jeevan Chalke
Date:


On Tue, Feb 13, 2018 at 12:37 PM, Rafia Sabih <rafia.sabih@enterprisedb.com> wrote:
 
I was testing this patch for TPC-H benchmarking and came across following results,

Thanks Rafia for testing this with TPC-H benchmarking.
 

Q1 completes in 229 secs with patch and in 66 secs without it. It looks like with this patch the time of parallel seq scan itself is elevated for some of the partitions. Notice for partitions, lineitem_3, lineitem_7, lineitem_10, and linietem_5 it is some 13 secs which was somewhere around 5 secs on head.

Q6 completes in some 7 secs with patch and it takes 4 secs without it. This is mainly caused because with the new parallel append, the parallel operator below it (parallel index scan in this case) is not used, however, on head it was the append of all the parallel index scans, which was saving quite some time.
 
I see that partition-wise aggregate plan too uses parallel index, am I missing something?
 

Q18 takes some 390 secs with patch and some 147 secs without it. 

This looks strange. This patch set does not touch parallel or seq scan as such. I am not sure why this is happening. All these three queries explain plan shows much higher execution time for parallel/seq scan.

However, do you see similar behaviour with patches applied, "enable_partition_wise_agg = on" and "enable_partition_wise_agg = off" ?
 
Also, does rest of the queries perform better with partition-wise aggregates?
 

The experimental setup for these tests is as follows,
work_mem = 500MB
shared_buffers = 10GB
effective_cache_size = 4GB
seq_page_cost = random+page_cost = 0.01
enable_partition_wise_join = off

Partitioning info:
Total 10 partitions on tables - lineitem and orders each with partitioning key being l_orderkey and o_orderkey respectively.

Please find the attached file for explain analyse outputs of each of the reported query.
--
Regards,
Rafia Sabih



--
Jeevan Chalke
Technical Architect, Product Development
EnterpriseDB Corporation
The Enterprise PostgreSQL Company

Re: [HACKERS] Partition-wise aggregation/grouping

From
Rafia Sabih
Date:
On Tue, Feb 13, 2018 at 6:21 PM, Jeevan Chalke <jeevan.chalke@enterprisedb.com> wrote:
 
I see that partition-wise aggregate plan too uses parallel index, am I missing something?
 
You're right, I missed that, oops. 

Q18 takes some 390 secs with patch and some 147 secs without it. 

This looks strange. This patch set does not touch parallel or seq scan as such. I am not sure why this is happening. All these three queries explain plan shows much higher execution time for parallel/seq scan.

Yeah strange it is.
However, do you see similar behaviour with patches applied, "enable_partition_wise_agg = on" and "enable_partition_wise_agg = off" ?

I tried that for query 18, with patch and  enable_partition_wise_agg = off, query completes in some 270 secs. You may find the explain analyse output for it in the attached file. I noticed that on head the query plan had parallel hash join however with patch and no partition-wise agg it is using nested loop joins. This might be the issue.
 
Also, does rest of the queries perform better with partition-wise aggregates?
 
As far as this setting goes, there wasn't any other query using partition-wise-agg, so, no.

BTW, just an FYI, this experiment is on scale factor 20.

--
Regards,
Rafia Sabih
Attachment

Re: [HACKERS] Partition-wise aggregation/grouping

From
Jeevan Chalke
Date:


On Wed, Feb 14, 2018 at 12:17 PM, Rafia Sabih <rafia.sabih@enterprisedb.com> wrote:
On Tue, Feb 13, 2018 at 6:21 PM, Jeevan Chalke <jeevan.chalke@enterprisedb.com> wrote:
 
I see that partition-wise aggregate plan too uses parallel index, am I missing something?
 
You're right, I missed that, oops. 

Q18 takes some 390 secs with patch and some 147 secs without it. 

This looks strange. This patch set does not touch parallel or seq scan as such. I am not sure why this is happening. All these three queries explain plan shows much higher execution time for parallel/seq scan.

Yeah strange it is.

Off-list I have asked Rafia to provide me the perf machine access where she is doing this bench-marking to see what's going wrong.
Thanks Rafia for the details.

What I have observed that, there are two sources, one with HEAD and other with HEAD+PWA. However the configuration switches were different. Sources with HEAD+PWA has CFLAGS="-ggdb3 -O0" CXXFLAGS="-ggdb3 -O0" flags in addition with other sources. i.e. HEAD+PWA is configured with debugging/optimization enabled which account for the slowness.

I have run EXPLAIN for these three queries on both the sources having exactly same configuration switches and I don't find any slowness with PWA patch-set.

Thus, it will be good if you re-run the benchmark by keeping configuration switches same on both the sources and share the results.

Thanks

 
However, do you see similar behaviour with patches applied, "enable_partition_wise_agg = on" and "enable_partition_wise_agg = off" ?

I tried that for query 18, with patch and  enable_partition_wise_agg = off, query completes in some 270 secs. You may find the explain analyse output for it in the attached file. I noticed that on head the query plan had parallel hash join however with patch and no partition-wise agg it is using nested loop joins. This might be the issue.
 
Also, does rest of the queries perform better with partition-wise aggregates?
 
As far as this setting goes, there wasn't any other query using partition-wise-agg, so, no.

BTW, just an FYI, this experiment is on scale factor 20.

--
Regards,
Rafia Sabih


--
Jeevan Chalke
Technical Architect, Product Development
EnterpriseDB Corporation
The Enterprise PostgreSQL Company

Re: [HACKERS] Partition-wise aggregation/grouping

From
Ashutosh Bapat
Date:
Commit 2fb1abaeb016aeb45b9e6d0b81b7a7e92bb251b9, changed
enable_partition_wise_join to enable_partitionwise_join. This patch
too should use enable_partitionwise_agg instead of
enable_partition_wise_agg.


Re: [HACKERS] Partition-wise aggregation/grouping

From
Robert Haas
Date:
On Thu, Feb 8, 2018 at 8:05 AM, Jeevan Chalke
<jeevan.chalke@enterprisedb.com> wrote:
> 0003 - 0006 are refactoring patches as before.

I have committed 0006 with some modifications.  In particular, [1] I
revised the comments and formatting; [2] I made cost_merge_append()
add cpu_tuple_cost * APPEND_CPU_COST_MULTIPLIER in lieu of, rather
than in addition to, cpu_operator_cost; and [3] I modified the
regression test so that the overall plan shape didn't change.

[2] was proposed upthread, but not adopted.  I had the same thought
while reading the patch (having forgotten the previous discussion) and
that seemed like a good enough reason to do it according to the
previous proposal.  If there is a good reason to think MergeAppend
needs that extra cost increment to be fairly-costed, I don't see it on
this thread.

[3] was also remarked upon upthread -- Ashutosh mentioned that the
change in plan shape was "sad" but there was no further discussion of
the matter.  I also found it sad; hence the change.  This is, by the
way, an interesting illustration of how partition-wise join could
conceivably lose.  Up until now I've thought that it seemed to be a
slam dunk to always win or at least break even, but if you've got a
relatively unselective join, such that the output is much larger than
either input, then doing the join partition-wise means putting all of
the output rows through an Append node, whereas doing it the normal
way means putting only the input rows through Append nodes.  If the
smaller number of rows being joined at one time doesn't help -- e.g.
all of the inner rows across all partitions fit in a tiny little hash
table -- then we're just feeding more rows through the Append for no
gain.  Not a common case, perhaps, but not impossible.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: [HACKERS] Partition-wise aggregation/grouping

From
Robert Haas
Date:
On Thu, Feb 8, 2018 at 8:05 AM, Jeevan Chalke
<jeevan.chalke@enterprisedb.com> wrote:
> In this attached version, I have rebased my changes over new design of
> partially_grouped_rel. The preparatory changes of adding
> partially_grouped_rel are in 0001.

I spent today hacking in 0001; results attached.  The big change from
your version is that this now uses generate_gather_paths() to add
Gather/Gather Merge nodes (except in the case where we sort by group
pathkeys and then Gather Merge) rather than keeping all of the bespoke
code.  That turned up to be a bit less elegant than I would have liked
-- I had to an override_rows argument to generate_gather_paths to make
it work.  But overall I think this is still a big improvement, since
it lets us share code instead of duplicating it.  Also, it potentially
lets us add partially-aggregated but non-parallel paths into
partially_grouped_rel->pathlist and that should Just Work; they will
get the Finalize Aggregate step but not the Gather.  With your
arrangement that wouldn't work.

Please review/test.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

Re: [HACKERS] Partition-wise aggregation/grouping

From
Jeevan Chalke
Date:
Hi Robert,

On Fri, Feb 23, 2018 at 2:53 AM, Robert Haas <robertmhaas@gmail.com> wrote:
On Thu, Feb 8, 2018 at 8:05 AM, Jeevan Chalke
<jeevan.chalke@enterprisedb.com> wrote:
> In this attached version, I have rebased my changes over new design of
> partially_grouped_rel. The preparatory changes of adding
> partially_grouped_rel are in 0001.

I spent today hacking in 0001; results attached.  The big change from
your version is that this now uses generate_gather_paths() to add
Gather/Gather Merge nodes (except in the case where we sort by group
pathkeys and then Gather Merge) rather than keeping all of the bespoke
code.  That turned up to be a bit less elegant than I would have liked
-- I had to an override_rows argument to generate_gather_paths to make
it work.  But overall I think this is still a big improvement, since
it lets us share code instead of duplicating it.  Also, it potentially
lets us add partially-aggregated but non-parallel paths into
partially_grouped_rel->pathlist and that should Just Work; they will
get the Finalize Aggregate step but not the Gather.  With your
arrangement that wouldn't work.

Please review/test.

I have reviewed and tested the patch and here are my couple of points:

     /*
-     * If the input rel belongs to a single FDW, so does the grouped rel.
+     * If the input rel belongs to a single FDW, so does the grouped rel. Same
+     * for the partially_grouped_rel.
      */
     grouped_rel->serverid = input_rel->serverid;
     grouped_rel->userid = input_rel->userid;
     grouped_rel->useridiscurrent = input_rel->useridiscurrent;
     grouped_rel->fdwroutine = input_rel->fdwroutine;
+    partially_grouped_rel->serverid = input_rel->serverid;
+    partially_grouped_rel->userid = input_rel->userid;
+    partially_grouped_rel->useridiscurrent = input_rel->useridiscurrent;
+    partially_grouped_rel->fdwroutine = input_rel->fdwroutine;


In my earlier mail where I have posted a patch for this partially grouped rel changes, I forgot to put my question on this.
I was unclear about above changes and thus passed grouped_rel whenever we wanted to work on partially_grouped_rel to fetch relevant details.

One idea I thought about is to memcpy the struct once we have set all required fields for grouped_rel so that we don't have to do similar stuff for partially_grouped_rel.

---

+             * Insert a Sort node, if required.  But there's no point in
+             * sorting anything but the cheapest path.
              */
-            if (root->group_pathkeys)
+            if (!pathkeys_contained_in(root->group_pathkeys, path->pathkeys))
+            {
+                if (path != linitial(partially_grouped_rel->pathlist))
+                    continue;

Paths in pathlist are added by add_path(). Though we have paths is pathlist is sorted with the cheapest total path, we generally use RelOptInfo->cheapest_total_path instead of using first entry, unlike partial paths. But here you use the first entry like partial paths case. Will it better to use cheapest total path from partially_grouped_rel? This will require calling set_cheapest on partially_grouped_rel before we call this function.

Attached top-up patch doing this along with few indentation fixes.

Rest of the changes look good to me.

Once this gets in, I will re-base my other patches accordingly.

And, thanks for committing 0006.

Thanks
 

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


--
Jeevan Chalke
Technical Architect, Product Development
EnterpriseDB Corporation
The Enterprise PostgreSQL Company

Attachment

Re: [HACKERS] Partition-wise aggregation/grouping

From
Robert Haas
Date:
On Mon, Feb 26, 2018 at 6:38 AM, Jeevan Chalke
<jeevan.chalke@enterprisedb.com> wrote:
> One idea I thought about is to memcpy the struct once we have set all
> required fields for grouped_rel so that we don't have to do similar stuff
> for partially_grouped_rel.

I think that would be a poor idea.  We want to copy a few specific
fields, not everything, and copying those fields is cheap, because
they are just simple assignment statements.  I think memcpy()'ing the
whole structure would be using a sledgehammer to solve a problem for
which a scalpel is more suited.

> Paths in pathlist are added by add_path(). Though we have paths is pathlist
> is sorted with the cheapest total path, we generally use
> RelOptInfo->cheapest_total_path instead of using first entry, unlike partial
> paths. But here you use the first entry like partial paths case. Will it
> better to use cheapest total path from partially_grouped_rel? This will
> require calling set_cheapest on partially_grouped_rel before we call this
> function.

Hmm, I guess that seems like a reasonable approach, although I am not
sure it matters much either way.

> Attached top-up patch doing this along with few indentation fixes.

I don't see much point to the change in generate_gather_paths -- that
line is only 77 characters long.

Committed after incorporating your other fixes and updating the
optimizer README.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: [HACKERS] Partition-wise aggregation/grouping

From
Rafia Sabih
Date:


On Wed, Feb 14, 2018 at 8:35 PM, Jeevan Chalke <jeevan.chalke@enterprisedb.com> wrote:


On Wed, Feb 14, 2018 at 12:17 PM, Rafia Sabih <rafia.sabih@enterprisedb.com> wrote:
On Tue, Feb 13, 2018 at 6:21 PM, Jeevan Chalke <jeevan.chalke@enterprisedb.com> wrote:
 
I see that partition-wise aggregate plan too uses parallel index, am I missing something?
 
You're right, I missed that, oops. 

Q18 takes some 390 secs with patch and some 147 secs without it. 

This looks strange. This patch set does not touch parallel or seq scan as such. I am not sure why this is happening. All these three queries explain plan shows much higher execution time for parallel/seq scan.

Yeah strange it is.

Off-list I have asked Rafia to provide me the perf machine access where she is doing this bench-marking to see what's going wrong.
Thanks Rafia for the details.

What I have observed that, there are two sources, one with HEAD and other with HEAD+PWA. However the configuration switches were different. Sources with HEAD+PWA has CFLAGS="-ggdb3 -O0" CXXFLAGS="-ggdb3 -O0" flags in addition with other sources. i.e. HEAD+PWA is configured with debugging/optimization enabled which account for the slowness.

I have run EXPLAIN for these three queries on both the sources having exactly same configuration switches and I don't find any slowness with PWA patch-set.

Thus, it will be good if you re-run the benchmark by keeping configuration switches same on both the sources and share the results.

Thanks

Interesting. I checked with keeping configure flags same for both the repos,
on head:
CONFIGURE = '--enable-cassert' 'CFLAGS=-ggdb3 -O0' 'CXXFLAGS=-ggdb3 -O0' 'prefix=/data/rafia.sabih/pg_head/install/'
On head+PWA:
CONFIGURE = '--enable-cassert' 'CFLAGS=-ggdb3 -O0' 'CXXFLAGS=-ggdb3 -O0' 'prefix=/data/rafia.sabih/pg_part_pa/install/'

The queries I previously reported are now performing same as on head with the above mentioned configuration. However, I further experimented with partitionwise_join set to true, and found following cases of regression,
Q17 was taking some 1400 secs on head but with PWA it's taking some 1600 secs, looks like append of scan+aggregates is coming to be costlier than that of just scan.
Q20 took 470 secs on head and with PWA it's taking 630 secs, the execution plan is changed a lot, one thing in particular with the patch is not using parallel bitmap heap scan on lineitem table.

The experimental settings were kept same as before with the change of partitionwise_join = 1.

Please find the attached zip for the explain analyse outputs.
 
However, do you see similar behaviour with patches applied, "enable_partition_wise_agg = on" and "enable_partition_wise_agg = off" ?

I tried that for query 18, with patch and  enable_partition_wise_agg = off, query completes in some 270 secs. You may find the explain analyse output for it in the attached file. I noticed that on head the query plan had parallel hash join however with patch and no partition-wise agg it is using nested loop joins. This might be the issue.
 
Also, does rest of the queries perform better with partition-wise aggregates?
 
As far as this setting goes, there wasn't any other query using partition-wise-agg, so, no.

BTW, just an FYI, this experiment is on scale factor 20.

--
Regards,
Rafia Sabih


--
Jeevan Chalke
Technical Architect, Product Development
EnterpriseDB Corporation
The Enterprise PostgreSQL Company




--
Regards,
Rafia Sabih
Attachment

Re: [HACKERS] Partition-wise aggregation/grouping

From
Jeevan Chalke
Date:
Hi Robert,


On Mon, Feb 26, 2018 at 8:03 PM, Robert Haas <robertmhaas@gmail.com> wrote:

Committed after incorporating your other fixes and updating the
optimizer README.

Thanks Robert.

Off-list Rajkumar has reported an issue. When we have enable_hashagg set to false, and Gather Merge path is chosen, it ended-up in an error saying "ERROR:  Aggref found in non-Agg plan node".

I had a look over his provided testcase and observed that when we create a Gather Merge path over a cheapest partial path by sorting it explicitly as generate_gather_paths won't consider it, we accidentally used cheapest partial path from the input_rel to create a Gather Merge; instead we need a cheapest partial path from the partially_grouped_rel.

Attached fix_aggref_in_non-agg_error.patch fixing this. test_for_aggref_in_non-agg_error.patch has a testcase reported by Rajkumar which I have added in a aggregates.sql.

While doing so, I have observed few cleanup changes, added those in misc_cleanup.patch.

---

While re-basing my partitionwise aggregate changes, I observed that when we want to create partial aggregation paths for a child partition, we don't need to add Gather or Gather Merge on top of it as we first want to append them all and then want to stick a gather on it. So it will be better to have that code part in a separate function so that we can call it from required places.

I have attached patch (create_non_partial_paths.patch) for it including all above fix.

Thanks


--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



--
Jeevan Chalke
Technical Architect, Product Development
EnterpriseDB Corporation
The Enterprise PostgreSQL Company

Attachment

Re: [HACKERS] Partition-wise aggregation/grouping

From
Robert Haas
Date:
On Tue, Feb 27, 2018 at 4:29 AM, Jeevan Chalke
<jeevan.chalke@enterprisedb.com> wrote:
> Hi Robert,
> I had a look over his provided testcase and observed that when we create a
> Gather Merge path over a cheapest partial path by sorting it explicitly as
> generate_gather_paths won't consider it, we accidentally used cheapest
> partial path from the input_rel to create a Gather Merge; instead we need a
> cheapest partial path from the partially_grouped_rel.
>
> Attached fix_aggref_in_non-agg_error.patch fixing this.

Oops.  Thanks, committed.

> test_for_aggref_in_non-agg_error.patch has a testcase reported by Rajkumar
> which I have added in a aggregates.sql.

Didn't commit this; I think that's overkill.

> While doing so, I have observed few cleanup changes, added those in
> misc_cleanup.patch.

Committed those.

> While re-basing my partitionwise aggregate changes, I observed that when we
> want to create partial aggregation paths for a child partition, we don't
> need to add Gather or Gather Merge on top of it as we first want to append
> them all and then want to stick a gather on it. So it will be better to have
> that code part in a separate function so that we can call it from required
> places.
>
> I have attached patch (create_non_partial_paths.patch) for it including all
> above fix.

I don't like that very much.  For one thing, the name
create_non_partial_paths() is not very descriptive at all.  For
another thing, it somewhat renders add_paths_to_partial_grouping_rel()
a misnomer, as that function then adds only partial paths.  I think
what you should just do is have the main patch add a test for
rel->reloptkind == RELOPT_UPPER_REL; if true, add the Gather paths; if
not, skip it.  Then it will be skipped for RELOPT_OTHER_UPPER_REL
which is what we want.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: [HACKERS] Partition-wise aggregation/grouping

From
Jeevan Chalke
Date:


On Mon, Feb 26, 2018 at 8:03 PM, Robert Haas <robertmhaas@gmail.com> wrote:

Committed after incorporating your other fixes and updating the
optimizer README.

Attached new patchset after rebasing my changes over these changes and on latest HEAD.
 

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Thanks

--
Jeevan Chalke
Technical Architect, Product Development
EnterpriseDB Corporation
The Enterprise PostgreSQL Company

Attachment

Re: [HACKERS] Partition-wise aggregation/grouping

From
Robert Haas
Date:
On Thu, Mar 1, 2018 at 5:34 AM, Jeevan Chalke
<jeevan.chalke@enterprisedb.com> wrote:
> Attached new patchset after rebasing my changes over these changes and on
> latest HEAD.

+        * We have already created a Gather or Gather Merge path atop cheapest
+        * partial path. Thus the partial path referenced by the
Gather node needs
+        * to be preserved as adding new partial paths in same rel may
delete this
+        * referenced path. To do this we need to clear the
partial_pathlist from
+        * the partially_grouped_rel as we may add partial paths again
while doing
+        * partitionwise aggregation. Keeping older partial path intact seems
+        * reasonable too as it might possible that the final path
chosen which is
+        * using it wins, but the underneath partial path is not the
cheapest one.

This isn't a good design.  You shouldn't create a Gather or Gather
Merge node until all partial paths have been added.  I mean, the point
is to put a Gather node on top of the cheapest path, not the path that
is currently the cheapest but might not actually be the cheapest once
we've added them all.

+add_gather_or_gather_merge(PlannerInfo *root,

Please stop picking generic function names for functions that have
very specific purposes.  I don't really think that you need this to be
a separate function at all, but it it is certainly NOT a
general-purpose function for adding a Gather or Gather Merge node.

+        /*
+         * Collect statistics about aggregates for estimating costs of
+         * performing aggregation in parallel or in partial, if not already
+         * done. We use same cost for all the children as they will be same
+         * anyways.
+         */

If it only needs to be done once, do we really have to have it inside
the loop?  I see that you're using the varno-translated
partial_target->exprs and target->exprs, but if the specific varnos
don't matter, why not just use the untranslated version of the targets
before entering the loop?  And if the specific varnos do matter, then
presumably you need to do it every time.

This is not a full review, but I'm out of time for right now.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: [HACKERS] Partition-wise aggregation/grouping

From
Robert Haas
Date:
On Thu, Mar 1, 2018 at 4:52 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> This is not a full review, but I'm out of time for right now.

Another thing I see here now is that create_grouping_paths() and
create_child_grouping_paths() are extremely similar.  Isn't there some
way we can refactor things so that we can reuse the code instead of
duplicating it?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: [HACKERS] Partition-wise aggregation/grouping

From
Jeevan Chalke
Date:


On Fri, Mar 2, 2018 at 3:22 AM, Robert Haas <robertmhaas@gmail.com> wrote:
On Thu, Mar 1, 2018 at 5:34 AM, Jeevan Chalke
<jeevan.chalke@enterprisedb.com> wrote:
> Attached new patchset after rebasing my changes over these changes and on
> latest HEAD.

+        * We have already created a Gather or Gather Merge path atop cheapest
+        * partial path. Thus the partial path referenced by the
Gather node needs
+        * to be preserved as adding new partial paths in same rel may
delete this
+        * referenced path. To do this we need to clear the
partial_pathlist from
+        * the partially_grouped_rel as we may add partial paths again
while doing
+        * partitionwise aggregation. Keeping older partial path intact seems
+        * reasonable too as it might possible that the final path
chosen which is
+        * using it wins, but the underneath partial path is not the
cheapest one.

This isn't a good design.  You shouldn't create a Gather or Gather
Merge node until all partial paths have been added.  I mean, the point
is to put a Gather node on top of the cheapest path, not the path that
is currently the cheapest but might not actually be the cheapest once
we've added them all.

To be honest, I didn't know that we should not generated Gather or Gather Merge until we have all possible partial paths in place. I realize it recently while debugging one issue reported by Rajkumar off-list. While working on that fix, what I have observed is
- I have cheapest partial path with cost say 10, a Gather on it increased cost to 11.
- Later when I add a partial path it has a cost say 9 but a gather on it resulted is total cost to 12.
This means, the first Gather path is the cheapest one but not the underneath partial path and unfortunately that got removed when my partial path is added into the partial_pathlist.

Due to this, I thought it is better to have both paths valid and to avoid deleting earlier cheapest partial_path, I chose to reset the partially_grouped_rel->partial_pathlist.

But, yes per comment in generate_gather_paths() and as you said, we should add Gather or Gather Merge only after we have done with all partial path creation. Sorry for not knowing this before.
 

+add_gather_or_gather_merge(PlannerInfo *root,

Please stop picking generic function names for functions that have
very specific purposes.  I don't really think that you need this to be
a separate function at all, but it it is certainly NOT a
general-purpose function for adding a Gather or Gather Merge node.

OK. Got it now.
 

+        /*
+         * Collect statistics about aggregates for estimating costs of
+         * performing aggregation in parallel or in partial, if not already
+         * done. We use same cost for all the children as they will be same
+         * anyways.
+         */

If it only needs to be done once, do we really have to have it inside
the loop?  I see that you're using the varno-translated
partial_target->exprs and target->exprs, but if the specific varnos
don't matter, why not just use the untranslated version of the targets
before entering the loop?  And if the specific varnos do matter, then
presumably you need to do it every time.

Yes. It can be pulled outside a loop.
 

This is not a full review, but I'm out of time for right now.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



--
Jeevan Chalke
Technical Architect, Product Development
EnterpriseDB Corporation
The Enterprise PostgreSQL Company

Re: [HACKERS] Partition-wise aggregation/grouping

From
Jeevan Chalke
Date:


On Sat, Mar 3, 2018 at 12:12 AM, Robert Haas <robertmhaas@gmail.com> wrote:
On Thu, Mar 1, 2018 at 4:52 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> This is not a full review, but I'm out of time for right now.

Another thing I see here now is that create_grouping_paths() and
create_child_grouping_paths() are extremely similar.  Isn't there some
way we can refactor things so that we can reuse the code instead of
duplicating it?

Yes. I too observed the same after our re-design.

To avoid code duplication, I am now calling create_grouping_paths() for child relation too.

However, to perform Gather or Gather Merge once we have all partial paths ready, and to avoid too many existing code rearrangement, I am calling try_partitionwise_grouping() before we do any aggregation/grouping on whole relation. By doing this, we will be having all partial paths in partially_grouped_rel and then existing code will do required finalization along with any Gather or Gather Merge, if required.

Please have a look over attached patch-set and let me know if it needs further changes.

Thanks


--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



--
Jeevan Chalke
Technical Architect, Product Development
EnterpriseDB Corporation
The Enterprise PostgreSQL Company

Attachment

Re: [HACKERS] Partition-wise aggregation/grouping

From
Robert Haas
Date:
On Mon, Mar 5, 2018 at 3:56 AM, Jeevan Chalke
<jeevan.chalke@enterprisedb.com> wrote:
> However, to perform Gather or Gather Merge once we have all partial paths
> ready, and to avoid too many existing code rearrangement, I am calling
> try_partitionwise_grouping() before we do any aggregation/grouping on whole
> relation. By doing this, we will be having all partial paths in
> partially_grouped_rel and then existing code will do required finalization
> along with any Gather or Gather Merge, if required.
>
> Please have a look over attached patch-set and let me know if it needs
> further changes.

This does look better.

+      <term><varname>enable_partitionwise_agg</varname> (<type>boolean</type>)

Please don't abbreviate "aggregate" to "agg".

-       /* Build final grouping paths */
-       add_paths_to_grouping_rel(root, input_rel, grouped_rel, target,
-
partially_grouped_rel, agg_costs,
-
&agg_final_costs, gd, can_sort, can_hash,
-                                                         dNumGroups,
(List *) parse->havingQual);
+       if (isPartialAgg)
+       {
+               Assert(agg_partial_costs && agg_final_costs);
+               add_paths_to_partial_grouping_rel(root, input_rel,
+
           partially_grouped_rel,
+
           agg_partial_costs,
+
           gd, can_sort, can_hash,
+
           false, true);
+       }
+       else
+       {
+               double          dNumGroups;
+
+               /* Estimate number of groups. */
+               dNumGroups = get_number_of_groups(root,
+
           cheapest_path->rows,
+
           gd,
+
           child_data ? make_tlist_from_pathtarget(target) :
parse->targetList);
+
+               /* Build final grouping paths */
+               add_paths_to_grouping_rel(root, input_rel, grouped_rel, target,
+
partially_grouped_rel, agg_costs,
+
agg_final_costs, gd, can_sort, can_hash,
+
dNumGroups, (List *) havingQual);
+       }

This looks strange.  Why do we go down two completely different code
paths here?  It seems to me that the set of paths we add to the
partial_pathlist shouldn't depend at all on isPartialAgg.  I might be
confused, but it seems to me that any aggregate path we construct that
is going to run in parallel must necessarily be partial, because even
if each group will occur only in one partition, it might still occur
in multiple workers for that partition, so finalization would be
needed.  On the other hand, for non-partial paths, we can add then to
partially_grouped_rel when isPartialAgg = true and to grouped_rel when
isPartialAgg = false, with the only difference being AGGSPLIT_SIMPLE
vs. AGGSPLIT_INITIAL_SERIAL.  But that doesn't appear to be what this
is doing.

+       /*
+        * If there are any fully aggregated partial paths present,
may be because
+        * of parallel Append over partitionwise aggregates, we must stick a
+        * Gather or Gather Merge path atop the cheapest partial path.
+        */
+       if (grouped_rel->partial_pathlist)

This comment is copied from someplace where the code does what the
comment says, but here it doesn't do any such thing.

More tomorrow...

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: [HACKERS] Partition-wise aggregation/grouping

From
Jeevan Chalke
Date:


On Tue, Mar 6, 2018 at 2:29 AM, Robert Haas <robertmhaas@gmail.com> wrote:
On Mon, Mar 5, 2018 at 3:56 AM, Jeevan Chalke
<jeevan.chalke@enterprisedb.com> wrote:
> However, to perform Gather or Gather Merge once we have all partial paths
> ready, and to avoid too many existing code rearrangement, I am calling
> try_partitionwise_grouping() before we do any aggregation/grouping on whole
> relation. By doing this, we will be having all partial paths in
> partially_grouped_rel and then existing code will do required finalization
> along with any Gather or Gather Merge, if required.
>
> Please have a look over attached patch-set and let me know if it needs
> further changes.

This does look better.

Thank you, Robert.
 

+      <term><varname>enable_partitionwise_agg</varname> (<type>boolean</type>)

Please don't abbreviate "aggregate" to "agg".

This is in-lined with enable_hashagg GUC. Do you think enable_partitionwise_aggregate seems better? But it will be not consistent with other GUC names like enable_hashagg then.
 

-       /* Build final grouping paths */
-       add_paths_to_grouping_rel(root, input_rel, grouped_rel, target,
-
partially_grouped_rel, agg_costs,
-
&agg_final_costs, gd, can_sort, can_hash,
-                                                         dNumGroups,
(List *) parse->havingQual);
+       if (isPartialAgg)
+       {
+               Assert(agg_partial_costs && agg_final_costs);
+               add_paths_to_partial_grouping_rel(root, input_rel,
+
           partially_grouped_rel,
+
           agg_partial_costs,
+
           gd, can_sort, can_hash,
+
           false, true);
+       }
+       else
+       {
+               double          dNumGroups;
+
+               /* Estimate number of groups. */
+               dNumGroups = get_number_of_groups(root,
+
           cheapest_path->rows,
+
           gd,
+
           child_data ? make_tlist_from_pathtarget(target) :
parse->targetList);
+
+               /* Build final grouping paths */
+               add_paths_to_grouping_rel(root, input_rel, grouped_rel, target,
+
partially_grouped_rel, agg_costs,
+
agg_final_costs, gd, can_sort, can_hash,
+
dNumGroups, (List *) havingQual);
+       }

This looks strange.  Why do we go down two completely different code
paths here?

It is because when isPartialAgg = true we need to create partially aggregated non-partial paths which should be added in partially_grouped_rel->pathlist. And when isPartialAgg = false, we are creating fully aggregated paths which goes into grouped_rel->pathlist.
 
  It seems to me that the set of paths we add to the
partial_pathlist shouldn't depend at all on isPartialAgg.  I might be
confused, but it seems to me that any aggregate path we construct that
is going to run in parallel must necessarily be partial, because even
if each group will occur only in one partition, it might still occur
in multiple workers for that partition, so finalization would be
needed.

Thats's true. We are creating partially aggregated partial paths for this and keeps them in partially_grouped_rel->partial_pathlist.
 
  On the other hand, for non-partial paths, we can add then to
partially_grouped_rel when isPartialAgg = true and to grouped_rel when
isPartialAgg = false, with the only difference being AGGSPLIT_SIMPLE
vs. AGGSPLIT_INITIAL_SERIAL. 

Yes. As explained above, they goes in pathlist of respective Rel.
However, PathTarget is different too, we need partial_pathtarget when isPartialAgg = true and also need agg_partial_costs.
 
But that doesn't appear to be what this
is doing.

So the code for doing partially aggregated partial paths and partially aggregated non-partial path is same except partial paths goes into partial_pathlist where as non-partial goes into pathlist of partially_grouped_rel. Thus, calling add_paths_to_partial_grouping_rel() when isPartialAgg = true seems correct. Also as we have decided, this function is responsible to create all partially aggregated paths including both partial and non-partial.

Am I missing something?
 

+       /*
+        * If there are any fully aggregated partial paths present,
may be because
+        * of parallel Append over partitionwise aggregates, we must stick a
+        * Gather or Gather Merge path atop the cheapest partial path.
+        */
+       if (grouped_rel->partial_pathlist)

This comment is copied from someplace where the code does what the
comment says, but here it doesn't do any such thing.

Well, these comments are not present anywhere else than this place. With Parallel Append and Partitionwise aggregates, it is now possible to have fully aggregated partial paths now. And thus we need to stick a Gather and/or Gather Merge atop cheapest partial path. And I believe the code does the same. Am I missing something?
 

More tomorrow...

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



--
Jeevan Chalke
Technical Architect, Product Development
EnterpriseDB Corporation
The Enterprise PostgreSQL Company

Re: [HACKERS] Partition-wise aggregation/grouping

From
Ashutosh Bapat
Date:
Hi Jeevan,
I am back reviewing this. Here are some comments.

@@ -1415,7 +1413,8 @@ add_paths_to_append_rel(PlannerInfo *root,
RelOptInfo *rel,
          * the unparameterized Append path we are constructing for the parent.
          * If not, there's no workable unparameterized path.
          */
-        if (childrel->cheapest_total_path->param_info == NULL)
+        if (childrel->pathlist != NIL &&
+            childrel->cheapest_total_path->param_info == NULL)
             accumulate_append_subpath(childrel->cheapest_total_path,
                                       &subpaths, NULL);
         else
@@ -1683,6 +1682,13 @@ add_paths_to_append_rel(PlannerInfo *root,
RelOptInfo *rel,
             RelOptInfo *childrel = (RelOptInfo *) lfirst(lcr);
             Path       *subpath;

+            if (childrel->pathlist == NIL)
+            {
+                /* failed to make a suitable path for this child */
+                subpaths_valid = false;
+                break;
+            }
+
When can childrel->pathlist be NIL?

diff --git a/src/backend/optimizer/plan/createplan.c
b/src/backend/optimizer/plan/createplan.c
index 9ae1bf3..f90626c 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -1670,7 +1670,15 @@ create_sort_plan(PlannerInfo *root, SortPath
*best_path, int flags)
     subplan = create_plan_recurse(root, best_path->subpath,
                                   flags | CP_SMALL_TLIST);

-    plan = make_sort_from_pathkeys(subplan, best_path->path.pathkeys, NULL);
+    /*
+     * In find_ec_member_for_tle(), child EC members are ignored if they don't
+     * belong to the given relids. Thus, if this sort path is based on a child
+     * relation, we must pass the relids of it. Otherwise, we will end-up into
+     * an error requiring pathkey item.
+     */
+    plan = make_sort_from_pathkeys(subplan, best_path->path.pathkeys,
+                                   IS_OTHER_REL(best_path->subpath->parent) ?
+                                   best_path->path.parent->relids : NULL);

     copy_generic_path_info(&plan->plan, (Path *) best_path);

Please separate this small adjustment in a patch of its own, with some
explanation of why we need it i.e. now this function can see SortPaths from
child (other) relations.

+    if (child_data)
+    {
+        /* Must be other rel as all child relations are marked OTHER_RELs */
+        Assert(IS_OTHER_REL(input_rel));

I think we should check IS_OTHER_REL() and Assert(child_data). That way we know
that the code in the if block is executed for OTHER relation.

-    if ((root->hasHavingQual || parse->groupingSets) &&
+    if (!child_data && (root->hasHavingQual || parse->groupingSets) &&

Degenerate grouping will never see child relations, so instead of checking for
child_data, Assert (!IS_OTHER_REL()) inside this block. Add a comment there
explaining the assertion.

+     *
+     * If we are performing grouping for a child relation, fetch can_sort from
+     * the child_data to avoid re-calculating same.
      */
-    can_sort = (gd && gd->rollups != NIL)
-        || grouping_is_sortable(parse->groupClause);
+    can_sort = child_data ? child_data->can_sort : ((gd &&
gd->rollups != NIL) ||
+
grouping_is_sortable(parse->groupClause));

Instead of adding a conditional here, we can compute these values before
create_grouping_paths() is called from grouping_planner() and then pass them
down to try_partitionwise_grouping(). I have attached a patch which refactors
this code. Please see if this refactoring is useful. In the attached patch, I
have handled can_sort, can_hash and partial aggregation costs. More on the last
component below.


     /*
      * Figure out whether a PartialAggregate/Finalize Aggregate execution
@@ -3740,10 +3789,8 @@ create_grouping_paths(PlannerInfo *root,
      * partial paths for partially_grouped_rel; that way, later code can
      * easily consider both parallel and non-parallel approaches to grouping.
      */
-    if (try_parallel_aggregation)
+    if (!child_data && !(agg_costs->hasNonPartial || agg_costs->hasNonSerial))
     {
-        PathTarget *partial_grouping_target;
-
[... clipped ...]
+            get_agg_clause_costs(root, havingQual,
                                  AGGSPLIT_FINAL_DESERIAL,
-                                 &agg_final_costs);
+                                 agg_final_costs);
         }
+    }

With this change, we are computing partial aggregation costs even in
the cases when
those will not be used e.g. when there are no children and parallel paths can
not be created. In the attached patch, I have refactored the code such that
they are computed when they are needed the first time and re-used later.

+    if (child_data)
+    {
+        partial_grouping_target = child_data->partialRelTarget;
+        partially_grouped_rel->reltarget = partial_grouping_target;
+        agg_partial_costs = child_data->agg_partial_costs;
+        agg_final_costs = child_data->agg_final_costs;
+    }

I think, with the refactoring, we can get rid of the last two lines here. I
think we can get rid of this block entirely, but I have not reviewed the entire
code to confirm that.

 static PathTarget *
-make_partial_grouping_target(PlannerInfo *root, PathTarget *grouping_target)
+make_partial_grouping_target(PlannerInfo *root,
+                             PathTarget *grouping_target,
+                             Node *havingQual)
This looks like a refactoring change. Should go to one of the refactoring
patches or in a patch of its own.

This isn't full review. I will continue reviewing this further.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

Attachment

Re: [HACKERS] Partition-wise aggregation/grouping

From
Jeevan Chalke
Date:


On Tue, Mar 6, 2018 at 4:59 PM, Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote:
Hi Jeevan,
I am back reviewing this. Here are some comments.

@@ -1415,7 +1413,8 @@ add_paths_to_append_rel(PlannerInfo *root,
RelOptInfo *rel,
          * the unparameterized Append path we are constructing for the parent.
          * If not, there's no workable unparameterized path.
          */
-        if (childrel->cheapest_total_path->param_info == NULL)
+        if (childrel->pathlist != NIL &&
+            childrel->cheapest_total_path->param_info == NULL)
             accumulate_append_subpath(childrel->cheapest_total_path,
                                       &subpaths, NULL);
         else
@@ -1683,6 +1682,13 @@ add_paths_to_append_rel(PlannerInfo *root,
RelOptInfo *rel,
             RelOptInfo *childrel = (RelOptInfo *) lfirst(lcr);
             Path       *subpath;

+            if (childrel->pathlist == NIL)
+            {
+                /* failed to make a suitable path for this child */
+                subpaths_valid = false;
+                break;
+            }
+
When can childrel->pathlist be NIL?

diff --git a/src/backend/optimizer/plan/createplan.c
b/src/backend/optimizer/plan/createplan.c
index 9ae1bf3..f90626c 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -1670,7 +1670,15 @@ create_sort_plan(PlannerInfo *root, SortPath
*best_path, int flags)
     subplan = create_plan_recurse(root, best_path->subpath,
                                   flags | CP_SMALL_TLIST);

-    plan = make_sort_from_pathkeys(subplan, best_path->path.pathkeys, NULL);
+    /*
+     * In find_ec_member_for_tle(), child EC members are ignored if they don't
+     * belong to the given relids. Thus, if this sort path is based on a child
+     * relation, we must pass the relids of it. Otherwise, we will end-up into
+     * an error requiring pathkey item.
+     */
+    plan = make_sort_from_pathkeys(subplan, best_path->path.pathkeys,
+                                   IS_OTHER_REL(best_path->subpath->parent) ?
+                                   best_path->path.parent->relids : NULL);

     copy_generic_path_info(&plan->plan, (Path *) best_path);

Please separate this small adjustment in a patch of its own, with some
explanation of why we need it i.e. now this function can see SortPaths from
child (other) relations.

+    if (child_data)
+    {
+        /* Must be other rel as all child relations are marked OTHER_RELs */
+        Assert(IS_OTHER_REL(input_rel));

I think we should check IS_OTHER_REL() and Assert(child_data). That way we know
that the code in the if block is executed for OTHER relation.

-    if ((root->hasHavingQual || parse->groupingSets) &&
+    if (!child_data && (root->hasHavingQual || parse->groupingSets) &&

Degenerate grouping will never see child relations, so instead of checking for
child_data, Assert (!IS_OTHER_REL()) inside this block. Add a comment there
explaining the assertion.

+     *
+     * If we are performing grouping for a child relation, fetch can_sort from
+     * the child_data to avoid re-calculating same.
      */
-    can_sort = (gd && gd->rollups != NIL)
-        || grouping_is_sortable(parse->groupClause);
+    can_sort = child_data ? child_data->can_sort : ((gd &&
gd->rollups != NIL) ||
+
grouping_is_sortable(parse->groupClause));

Instead of adding a conditional here, we can compute these values before
create_grouping_paths() is called from grouping_planner() and then pass them
down to try_partitionwise_grouping(). I have attached a patch which refactors
this code. Please see if this refactoring is useful. In the attached patch, I
have handled can_sort, can_hash and partial aggregation costs. More on the last
component below.

Changes look good to me and refactoring will be useful for partitionwise patches.

However, will it be good if we add agg_costs into the GroupPathExtraData too?
Also can we pass this to the add_partial_paths_to_grouping_rel() and add_paths_to_grouping_rel() to avoid passing can_sort, can_hash and costs related details individually to them?
 


     /*
      * Figure out whether a PartialAggregate/Finalize Aggregate execution
@@ -3740,10 +3789,8 @@ create_grouping_paths(PlannerInfo *root,
      * partial paths for partially_grouped_rel; that way, later code can
      * easily consider both parallel and non-parallel approaches to grouping.
      */
-    if (try_parallel_aggregation)
+    if (!child_data && !(agg_costs->hasNonPartial || agg_costs->hasNonSerial))
     {
-        PathTarget *partial_grouping_target;
-
[... clipped ...]
+            get_agg_clause_costs(root, havingQual,
                                  AGGSPLIT_FINAL_DESERIAL,
-                                 &agg_final_costs);
+                                 agg_final_costs);
         }
+    }

With this change, we are computing partial aggregation costs even in
the cases when
those will not be used e.g. when there are no children and parallel paths can
not be created. In the attached patch, I have refactored the code such that
they are computed when they are needed the first time and re-used later.

+    if (child_data)
+    {
+        partial_grouping_target = child_data->partialRelTarget;
+        partially_grouped_rel->reltarget = partial_grouping_target;
+        agg_partial_costs = child_data->agg_partial_costs;
+        agg_final_costs = child_data->agg_final_costs;
+    }

I think, with the refactoring, we can get rid of the last two lines here. I
think we can get rid of this block entirely, but I have not reviewed the entire
code to confirm that.

 static PathTarget *
-make_partial_grouping_target(PlannerInfo *root, PathTarget *grouping_target)
+make_partial_grouping_target(PlannerInfo *root,
+                             PathTarget *grouping_target,
+                             Node *havingQual)
This looks like a refactoring change. Should go to one of the refactoring
patches or in a patch of its own.

This isn't full review. I will continue reviewing this further.

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company



--
Jeevan Chalke
Technical Architect, Product Development
EnterpriseDB Corporation
The Enterprise PostgreSQL Company

Re: [HACKERS] Partition-wise aggregation/grouping

From
Robert Haas
Date:
On Tue, Mar 6, 2018 at 5:31 AM, Jeevan Chalke
<jeevan.chalke@enterprisedb.com> wrote:
> This is in-lined with enable_hashagg GUC. Do you think
> enable_partitionwise_aggregate seems better? But it will be not consistent
> with other GUC names like enable_hashagg then.

Well, if I had my way, enable_hashagg would be spelled
enable_hash_aggregate, too, but I wasn't involved in the project that
long ago.  100% consistency is hard to achieve here; the perfect
parallel of enable_hashagg would be enable_partitionwiseagg, but then
it would be inconsistent with enable_partitionwise_join unless we
renamed it to enable_partitionwisejoin, which I would rather not do.
I think the way the enable_blahfoo names were done was kinda
shortsighted -- it works OK as long as blahfoo is pretty short, like
mergejoin or hashagg or whatever, but if you have more or longer words
then I think it's hard to see where the word boundaries are without
any punctuation.  And if you start abbreviating then you end up with
things like enable_pwagg which are not very easy to understand.  So I
favor spelling everything out and punctuating it.

> So the code for doing partially aggregated partial paths and partially
> aggregated non-partial path is same except partial paths goes into
> partial_pathlist where as non-partial goes into pathlist of
> partially_grouped_rel. Thus, calling add_paths_to_partial_grouping_rel()
> when isPartialAgg = true seems correct. Also as we have decided, this
> function is responsible to create all partially aggregated paths including
> both partial and non-partial.
>
> Am I missing something?

Hmm.  I guess not.  I think I didn't read this code well enough
previously.  Please find attached proposed incremental patches (0001
and 0002) which hopefully make the code in this area a bit clearer.

>> +       /*
>> +        * If there are any fully aggregated partial paths present,
>> may be because
>> +        * of parallel Append over partitionwise aggregates, we must stick
>> a
>> +        * Gather or Gather Merge path atop the cheapest partial path.
>> +        */
>> +       if (grouped_rel->partial_pathlist)
>>
>> This comment is copied from someplace where the code does what the
>> comment says, but here it doesn't do any such thing.
>
> Well, these comments are not present anywhere else than this place. With
> Parallel Append and Partitionwise aggregates, it is now possible to have
> fully aggregated partial paths now. And thus we need to stick a Gather
> and/or Gather Merge atop cheapest partial path. And I believe the code does
> the same. Am I missing something?

I misread the code.  Sigh.  I should have waited until today to send
that email and taken time to study it more carefully.  But I still
don't think it's completely correct.  It will not consider using a
pre-sorted path; the only strategies it can consider are cheapest path
+ Gather and cheapest path + explicit Sort (even if the cheapest path
is already correctly sorted!) + Gather Merge.  It should really do
something similar to what add_paths_to_partial_grouping_rel() already
does: first call generate_gather_paths() and then, if the cheapest
partial path is not already correctly sorted, also try an explicit
Sort + Gather Merge.  In fact, it looks like we can actually reuse
that logic exactly.  See attached 0003 incremental patch; this changes
the outputs of one of your regression tests, but the new plan looks
better.

Some other notes:

There's a difference between performing partial aggregation in the
same process and performing it in a different process.  hasNonPartial
tells us that we can't perform partial aggregation *at all*;
hasNonSerial only tells us that partial and final aggregation must
happen in the same process.  This patch could possibly take advantage
of partial aggregation even when hasNonSerial is set.  Finalize
Aggregate -> Append -> N copies of { Partial Aggregate -> Whatever }
is OK with hasNonSerial = true as long as hasNonPartial = false.  Now,
the bad news is that for this to actually work we'd need to define new
values of AggSplit, like AGGSPLIT_INITIAL = AGGSPLITOP_SKIPFINAL and
AGGSPLIT_FINAL = AGGSPLITOP_COMBINE, and I'm not sure how much
complexity that adds.  However, if we're not going to do that, I think
we'd better at last add some comments about it suggesting that someone
might want to do something about it in the future.

I think that, in general, it's a good idea to keep the number of times
that create_grouping_paths() does something which is conditional on
whether child_data is NULL to a minimum.  I haven't looked at what
Ashutosh tried to do there so I don't know whether it's good or bad,
but I like the idea, if we can do it cleanly.

It strikes me that we might want to consider refactoring things so
that create_grouping_paths() takes the grouping_rel and
partial_grouping_rel as input arguments.  Right now, the
initialization of the child grouping and partial-grouping rels is
partly in try_partitionwise_aggregate(), which considers marking one
of them (but never both?) as dummy rels and create_grouping_paths()
which sets reloptkind, serverid, userid, etc.  The logic of all of
this is a little unclear to me.  Presumably, if the input rel is
dummy, then both the grouping_rel and the partial_grouping_rel are
also dummy.  Also, presumably we should set the reloptkind correctly
as soon as we create the rel, not at some later stage.

Or maybe what we should do is split create_grouping_paths() into two
functions.  Like this:

    if (child_data)
    {
        partial_grouping_target = child_data->partialRelTarget;
        partially_grouped_rel->reltarget = partial_grouping_target;
        agg_partial_costs = child_data->agg_partial_costs;
        agg_final_costs = child_data->agg_final_costs;
    }

--- SPLIT IT HERE ---

    /* Apply partitionwise aggregation technique, if possible. */
    try_partitionwise_grouping(root, input_rel, grouped_rel,
                               partially_grouped_rel, target,
                               partial_grouping_target, agg_costs,
                               agg_partial_costs, agg_final_costs, gd,
                               can_sort, can_hash, havingQual, isPartialAgg);

It seems to me that everything from that point to the end is doing the
path generation and it's all pretty much the same for the parent and
child cases.  But everything before that is either stuff that doesn't
apply to the child case at all (like the degenerate grouping case) or
stuff that should be done once and passed down (like
can_sort/can_hash).  The only exception I see is some of the stuff
that sets up the upper rel at the top of the function, but maybe that
logic could be refactored into a separate function as well (like
initialize_grouping_rel).  Then, instead of try_partitionwise_join()
actually calling create_grouping_paths(), it would call
initialize_grouping_rel() and then the path-adding function that we
split off from the bottom of the current create_grouping_paths(),
basically skipping all that stuff in the middle that we don't really
want to do in that case.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

Re: [HACKERS] Partition-wise aggregation/grouping

From
Ashutosh Bapat
Date:
On Tue, Mar 6, 2018 at 7:52 PM, Jeevan Chalke
<jeevan.chalke@enterprisedb.com> wrote:

>
>
> Changes look good to me and refactoring will be useful for partitionwise
> patches.
>
> However, will it be good if we add agg_costs into the GroupPathExtraData
> too?
> Also can we pass this to the add_partial_paths_to_grouping_rel() and
> add_paths_to_grouping_rel() to avoid passing can_sort, can_hash and costs
> related details individually to them?

I think so too.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company


Re: [HACKERS] Partition-wise aggregation/grouping

From
Ashutosh Bapat
Date:
On Wed, Mar 7, 2018 at 10:04 AM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:
> On Tue, Mar 6, 2018 at 7:52 PM, Jeevan Chalke
> <jeevan.chalke@enterprisedb.com> wrote:
>
>>
>>
>> Changes look good to me and refactoring will be useful for partitionwise
>> patches.
>>
>> However, will it be good if we add agg_costs into the GroupPathExtraData
>> too?
>> Also can we pass this to the add_partial_paths_to_grouping_rel() and
>> add_paths_to_grouping_rel() to avoid passing can_sort, can_hash and costs
>> related details individually to them?
>
> I think so too.

Here's patch doing that. agg_costs is calculated way before we
populate other members of GroupPathExtraData, which means that we
either set the pointer to agg_costs in GroupPathExtraData or memcpy
its contents. The first option will make GroupPathExtraData asymmetric
about the costs it holds, some as pointers and some as whole
structure.Holding whole structures allows us to compute those anywhere
without worrying about memory allocation or variable life time. So, I
am reluctant to make all costs as pointers. So, I have not added
agg_costs to GroupPathExtraData yet.

We could make GroupPathExtraData as a variable in grouping_planner()
and populate its members as we progress. But I think that's digression
from the original purpose of the patch.

I observe that we are computing agg_costs, number of groups etc. again
in postgres_fdw so there seems to be a merit in passing those values
as GroupPathExtraData to FDW as well like what you have done with
OtherUpperExtraData. But we will come to that once we have
straightened the partition-wise aggregate patches.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

Attachment

Re: [HACKERS] Partition-wise aggregation/grouping

From
Jeevan Chalke
Date:


On Tue, Mar 6, 2018 at 4:59 PM, Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote:
Hi Jeevan,
I am back reviewing this. Here are some comments.

@@ -1415,7 +1413,8 @@ add_paths_to_append_rel(PlannerInfo *root,
RelOptInfo *rel,
          * the unparameterized Append path we are constructing for the parent.
          * If not, there's no workable unparameterized path.
          */
-        if (childrel->cheapest_total_path->param_info == NULL)
+        if (childrel->pathlist != NIL &&
+            childrel->cheapest_total_path->param_info == NULL)
             accumulate_append_subpath(childrel->cheapest_total_path,
                                       &subpaths, NULL);
         else
@@ -1683,6 +1682,13 @@ add_paths_to_append_rel(PlannerInfo *root,
RelOptInfo *rel,
             RelOptInfo *childrel = (RelOptInfo *) lfirst(lcr);
             Path       *subpath;

+            if (childrel->pathlist == NIL)
+            {
+                /* failed to make a suitable path for this child */
+                subpaths_valid = false;
+                break;
+            }
+
When can childrel->pathlist be NIL?

Done. Sorry it was leftover from my earlier trial. Not needed now. Removed.
 

diff --git a/src/backend/optimizer/plan/createplan.c
b/src/backend/optimizer/plan/createplan.c
index 9ae1bf3..f90626c 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -1670,7 +1670,15 @@ create_sort_plan(PlannerInfo *root, SortPath
*best_path, int flags)
     subplan = create_plan_recurse(root, best_path->subpath,
                                   flags | CP_SMALL_TLIST);

-    plan = make_sort_from_pathkeys(subplan, best_path->path.pathkeys, NULL);
+    /*
+     * In find_ec_member_for_tle(), child EC members are ignored if they don't
+     * belong to the given relids. Thus, if this sort path is based on a child
+     * relation, we must pass the relids of it. Otherwise, we will end-up into
+     * an error requiring pathkey item.
+     */
+    plan = make_sort_from_pathkeys(subplan, best_path->path.pathkeys,
+                                   IS_OTHER_REL(best_path->subpath->parent) ?
+                                   best_path->path.parent->relids : NULL);

     copy_generic_path_info(&plan->plan, (Path *) best_path);

Please separate this small adjustment in a patch of its own, with some
explanation of why we need it i.e. now this function can see SortPaths from
child (other) relations.

I am not sure whether it is good to split this out of the main patch. Main patch exposes this requirement and thus seems better to have these changes in main patch itself.
However, I have no issues in extracting it into a separate small patch. Let me know your views.
 

+    if (child_data)
+    {
+        /* Must be other rel as all child relations are marked OTHER_RELs */
+        Assert(IS_OTHER_REL(input_rel));

I think we should check IS_OTHER_REL() and Assert(child_data). That way we know
that the code in the if block is executed for OTHER relation.

Done.
 

-    if ((root->hasHavingQual || parse->groupingSets) &&
+    if (!child_data && (root->hasHavingQual || parse->groupingSets) &&

Degenerate grouping will never see child relations, so instead of checking for
child_data, Assert (!IS_OTHER_REL()) inside this block. Add a comment there
explaining the assertion.

Done.
 

+     *
+     * If we are performing grouping for a child relation, fetch can_sort from
+     * the child_data to avoid re-calculating same.
      */
-    can_sort = (gd && gd->rollups != NIL)
-        || grouping_is_sortable(parse->groupClause);
+    can_sort = child_data ? child_data->can_sort : ((gd &&
gd->rollups != NIL) ||
+
grouping_is_sortable(parse->groupClause));

Instead of adding a conditional here, we can compute these values before
create_grouping_paths() is called from grouping_planner() and then pass them
down to try_partitionwise_grouping(). I have attached a patch which refactors
this code. Please see if this refactoring is useful. In the attached patch, I
have handled can_sort, can_hash and partial aggregation costs. More on the last
component below.


     /*
      * Figure out whether a PartialAggregate/Finalize Aggregate execution
@@ -3740,10 +3789,8 @@ create_grouping_paths(PlannerInfo *root,
      * partial paths for partially_grouped_rel; that way, later code can
      * easily consider both parallel and non-parallel approaches to grouping.
      */
-    if (try_parallel_aggregation)
+    if (!child_data && !(agg_costs->hasNonPartial || agg_costs->hasNonSerial))
     {
-        PathTarget *partial_grouping_target;
-
[... clipped ...]
+            get_agg_clause_costs(root, havingQual,
                                  AGGSPLIT_FINAL_DESERIAL,
-                                 &agg_final_costs);
+                                 agg_final_costs);
         }
+    }

With this change, we are computing partial aggregation costs even in
the cases when
those will not be used e.g. when there are no children and parallel paths can
not be created. In the attached patch, I have refactored the code such that
they are computed when they are needed the first time and re-used later.

+    if (child_data)
+    {
+        partial_grouping_target = child_data->partialRelTarget;
+        partially_grouped_rel->reltarget = partial_grouping_target;
+        agg_partial_costs = child_data->agg_partial_costs;
+        agg_final_costs = child_data->agg_final_costs;
+    }

I think, with the refactoring, we can get rid of the last two lines here. I
think we can get rid of this block entirely, but I have not reviewed the entire
code to confirm that.

I have added your patch as one of the refactoring patch and rebased my changes over it.
Yes, it removed this block and other few conditions too.
 

 static PathTarget *
-make_partial_grouping_target(PlannerInfo *root, PathTarget *grouping_target)
+make_partial_grouping_target(PlannerInfo *root,
+                             PathTarget *grouping_target,
+                             Node *havingQual)
This looks like a refactoring change. Should go to one of the refactoring
patches or in a patch of its own.

OK. Refactored into separate patch.


Will post a new patchset with these changes included.
 

This isn't full review. I will continue reviewing this further.

Sure.
 

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company



--
Jeevan Chalke
Technical Architect, Product Development
EnterpriseDB Corporation
The Enterprise PostgreSQL Company

Re: [HACKERS] Partition-wise aggregation/grouping

From
Jeevan Chalke
Date:


On Wed, Mar 7, 2018 at 1:45 AM, Robert Haas <robertmhaas@gmail.com> wrote:
On Tue, Mar 6, 2018 at 5:31 AM, Jeevan Chalke
<jeevan.chalke@enterprisedb.com> wrote:
> This is in-lined with enable_hashagg GUC. Do you think
> enable_partitionwise_aggregate seems better? But it will be not consistent
> with other GUC names like enable_hashagg then.

Well, if I had my way, enable_hashagg would be spelled
enable_hash_aggregate, too, but I wasn't involved in the project that
long ago.  100% consistency is hard to achieve here; the perfect
parallel of enable_hashagg would be enable_partitionwiseagg, but then
it would be inconsistent with enable_partitionwise_join unless we
renamed it to enable_partitionwisejoin, which I would rather not do.
I think the way the enable_blahfoo names were done was kinda
shortsighted -- it works OK as long as blahfoo is pretty short, like
mergejoin or hashagg or whatever, but if you have more or longer words
then I think it's hard to see where the word boundaries are without
any punctuation.  And if you start abbreviating then you end up with
things like enable_pwagg which are not very easy to understand.  So I
favor spelling everything out and punctuating it.

Understood and make sense.
Updated.
 

> So the code for doing partially aggregated partial paths and partially
> aggregated non-partial path is same except partial paths goes into
> partial_pathlist where as non-partial goes into pathlist of
> partially_grouped_rel. Thus, calling add_paths_to_partial_grouping_rel()
> when isPartialAgg = true seems correct. Also as we have decided, this
> function is responsible to create all partially aggregated paths including
> both partial and non-partial.
>
> Am I missing something?

Hmm.  I guess not.  I think I didn't read this code well enough
previously.  Please find attached proposed incremental patches (0001
and 0002) which hopefully make the code in this area a bit clearer.

Yep. Thanks for these patches.
I have merged these changes into my main (0007) patch now.
 

>> +       /*
>> +        * If there are any fully aggregated partial paths present,
>> may be because
>> +        * of parallel Append over partitionwise aggregates, we must stick
>> a
>> +        * Gather or Gather Merge path atop the cheapest partial path.
>> +        */
>> +       if (grouped_rel->partial_pathlist)
>>
>> This comment is copied from someplace where the code does what the
>> comment says, but here it doesn't do any such thing.
>
> Well, these comments are not present anywhere else than this place. With
> Parallel Append and Partitionwise aggregates, it is now possible to have
> fully aggregated partial paths now. And thus we need to stick a Gather
> and/or Gather Merge atop cheapest partial path. And I believe the code does
> the same. Am I missing something?

I misread the code.  Sigh.  I should have waited until today to send
that email and taken time to study it more carefully.  But I still
don't think it's completely correct.  It will not consider using a
pre-sorted path; the only strategies it can consider are cheapest path
+ Gather and cheapest path + explicit Sort (even if the cheapest path
is already correctly sorted!) + Gather Merge.  It should really do
something similar to what add_paths_to_partial_grouping_rel() already
does: first call generate_gather_paths() and then, if the cheapest
partial path is not already correctly sorted, also try an explicit
Sort + Gather Merge.  In fact, it looks like we can actually reuse
that logic exactly.  See attached 0003 incremental patch; this changes
the outputs of one of your regression tests, but the new plan looks
better.

This seems like a refactoring patch and thus added as separate patch (0005) in patch-set.
Changes related to PWA patch are merged accordingly too.

Attached new patch-set with these changes merged and fixing review comments from Ashutosh Bapat along with his GroupPathExtraData changes patch.


Some other notes:

There's a difference between performing partial aggregation in the
same process and performing it in a different process.  hasNonPartial
tells us that we can't perform partial aggregation *at all*;
hasNonSerial only tells us that partial and final aggregation must
happen in the same process.  This patch could possibly take advantage
of partial aggregation even when hasNonSerial is set.  Finalize
Aggregate -> Append -> N copies of { Partial Aggregate -> Whatever }
is OK with hasNonSerial = true as long as hasNonPartial = false.  Now,
the bad news is that for this to actually work we'd need to define new
values of AggSplit, like AGGSPLIT_INITIAL = AGGSPLITOP_SKIPFINAL and
AGGSPLIT_FINAL = AGGSPLITOP_COMBINE, and I'm not sure how much
complexity that adds.  However, if we're not going to do that, I think
we'd better at last add some comments about it suggesting that someone
might want to do something about it in the future.

Am I not familier with these much. So will add a comment as you said.
 

I think that, in general, it's a good idea to keep the number of times
that create_grouping_paths() does something which is conditional on
whether child_data is NULL to a minimum.  I haven't looked at what
Ashutosh tried to do there so I don't know whether it's good or bad,
but I like the idea, if we can do it cleanly.

It strikes me that we might want to consider refactoring things so
that create_grouping_paths() takes the grouping_rel and
partial_grouping_rel as input arguments.  Right now, the
initialization of the child grouping and partial-grouping rels is
partly in try_partitionwise_aggregate(), which considers marking one
of them (but never both?) as dummy rels and create_grouping_paths()
which sets reloptkind, serverid, userid, etc.  The logic of all of
this is a little unclear to me.  Presumably, if the input rel is
dummy, then both the grouping_rel and the partial_grouping_rel are
also dummy.  Also, presumably we should set the reloptkind correctly
as soon as we create the rel, not at some later stage.

Or maybe what we should do is split create_grouping_paths() into two
functions.  Like this:

    if (child_data)
    {
        partial_grouping_target = child_data->partialRelTarget;
        partially_grouped_rel->reltarget = partial_grouping_target;
        agg_partial_costs = child_data->agg_partial_costs;
        agg_final_costs = child_data->agg_final_costs;
    }

--- SPLIT IT HERE ---

    /* Apply partitionwise aggregation technique, if possible. */
    try_partitionwise_grouping(root, input_rel, grouped_rel,
                               partially_grouped_rel, target,
                               partial_grouping_target, agg_costs,
                               agg_partial_costs, agg_final_costs, gd,
                               can_sort, can_hash, havingQual, isPartialAgg);

It seems to me that everything from that point to the end is doing the
path generation and it's all pretty much the same for the parent and
child cases.  But everything before that is either stuff that doesn't
apply to the child case at all (like the degenerate grouping case) or
stuff that should be done once and passed down (like
can_sort/can_hash).  The only exception I see is some of the stuff
that sets up the upper rel at the top of the function, but maybe that
logic could be refactored into a separate function as well (like
initialize_grouping_rel).  Then, instead of try_partitionwise_join()
actually calling create_grouping_paths(), it would call
initialize_grouping_rel() and then the path-adding function that we
split off from the bottom of the current create_grouping_paths(),
basically skipping all that stuff in the middle that we don't really
want to do in that case.

I will have a look over this proposal.
 

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



--
Jeevan Chalke
Technical Architect, Product Development
EnterpriseDB Corporation
The Enterprise PostgreSQL Company

Attachment

Re: [HACKERS] Partition-wise aggregation/grouping

From
Ashutosh Bapat
Date:
On Wed, Mar 7, 2018 at 8:07 PM, Jeevan Chalke
<jeevan.chalke@enterprisedb.com> wrote:
Here are some more review comments esp. on
try_partitionwise_grouping() function. BTW name of that function
doesn't go in sync with enable_partitionwise_aggregation (which btw is
in sync with enable_fooagg GUCs). But it goes in sync with
create_grouping_paths(). It looks like we have already confused
aggregation and grouping e.g. enable_hashagg may affect path creation
when there is no aggregation involved i.e. only grouping is involved
but create_grouping_paths will create paths when there is no grouping
involved. Generally it looks like the function names use grouping to
mean both aggregation and grouping but GUCs use aggregation to mean
both of those. So, the naming in this patch looks consistent with the
current naming conventions.

+        grouped_rel->part_scheme = input_rel->part_scheme;
+        grouped_rel->nparts = nparts;
+        grouped_rel->boundinfo = input_rel->boundinfo;
+        grouped_rel->part_rels = part_rels;

You need to set the part_exprs which will provide partition keys for this
partitioned relation. I think, we should include all the part_exprs of
input_rel which are part of GROUP BY clause. Since any other expressions in
part_exprs are not part of GROUP BY clause, they can not appear in the
targetlist without an aggregate on top. So they can't be part of the partition
keys of the grouped relation.

In create_grouping_paths() we fetch both partial as well as fully grouped rel
for given input relation. But in case of partial aggregation, we don't need
fully grouped rel since we are not computing full aggregates for the children.
Since fetch_upper_rel() creates a relation when one doesn't exist, we are
unnecessarily creating fully grouped rels in this case. For thousands of
partitions that's a lot of memory wasted.

I see a similar issue with create_grouping_paths() when we are computing only
full aggregates (either because partial aggregation is not possible or because
parallelism is not possible). In that case, we unconditionally create partially
grouped rels. That too would waste a lot of memory.

I think that partially_grouped_rel, when required, is partitioned irrespective
of whether we do full aggregation per partition or not. So, if we have its
part_rels and other partitioning properties set. I agree that right now we
won't use this information anywhere. It may be useful, in case we device a way
to use partially_grouped_rel directly without using grouped_rel for planning
beyond grouping, which seems unlikely.

+
+        /*
+         * Parallel aggregation requires partial target, so compute it here
+         * and translate all vars. For partial aggregation, we need it
+         * anyways.
+         */
+        partial_target = make_partial_grouping_target(root, target,
+                                                      havingQual);

Don't we have this available in partially_grouped_rel?

That shows one asymmetry that Robert's refactoring has introduced. We don't set
reltarget of grouped_rel but set reltarget of partially_grouped_rel. If
reltarget of grouped_rel changes across paths (the reason why we have not set
it in grouped_rel), shouldn't reltarget of partially grouped paths change
accordingly?

+
+/*
+ * group_by_has_partkey
+ *
+ * Returns true, if all the partition keys of the given relation are part of
+ * the GROUP BY clauses, false otherwise.
+ */
+static bool
+group_by_has_partkey(RelOptInfo *input_rel, PathTarget *target,
+                     List *groupClause)

We could modify this function to return the list of part_exprs which are part
of group clause and use that as the partition keys of the grouped_rel if
required. If group by doesn't have all the partition keys, the function would
return a NULL list.

Right now, in case of full aggregation, partially_grouped_rel is populated with
the partial paths created by adding partial aggregation to the partial paths of
input relation. But we are not trying to create partial paths by (parallel)
appending the (non)partial paths from the child partially_grouped_rel. Have we
thought about that? Would such paths have different shapes from the ones that
we create now and will they be better?

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company


Re: [HACKERS] Partition-wise aggregation/grouping

From
Robert Haas
Date:
On Thu, Mar 8, 2018 at 2:45 AM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:
> +        grouped_rel->part_scheme = input_rel->part_scheme;
> +        grouped_rel->nparts = nparts;
> +        grouped_rel->boundinfo = input_rel->boundinfo;
> +        grouped_rel->part_rels = part_rels;
>
> You need to set the part_exprs which will provide partition keys for this
> partitioned relation. I think, we should include all the part_exprs of
> input_rel which are part of GROUP BY clause. Since any other expressions in
> part_exprs are not part of GROUP BY clause, they can not appear in the
> targetlist without an aggregate on top. So they can't be part of the partition
> keys of the grouped relation.
>
> In create_grouping_paths() we fetch both partial as well as fully grouped rel
> for given input relation. But in case of partial aggregation, we don't need
> fully grouped rel since we are not computing full aggregates for the children.
> Since fetch_upper_rel() creates a relation when one doesn't exist, we are
> unnecessarily creating fully grouped rels in this case. For thousands of
> partitions that's a lot of memory wasted.
>
> I see a similar issue with create_grouping_paths() when we are computing only
> full aggregates (either because partial aggregation is not possible or because
> parallelism is not possible). In that case, we unconditionally create partially
> grouped rels. That too would waste a lot of memory.

This kind of goes along with the suggestion I made yesterday to
introduce a new function, which at the time I proposed calling
initialize_grouping_rel(), to set up new grouped or partially grouped
relations.  By doing that it would be easier to ensure the
initialization is always done in a consistent way but only for the
relations we actually need.  But maybe we should call it
fetch_grouping_rel() instead.  The idea would be that instead of
calling fetch_upper_rel() we would call fetch_grouping_rel() when it
is a question of the grouped or partially grouped relation.  It would
either return the existing relation or initialize a new one for us.  I
think that would make it fairly easy to initialize only the ones we're
going to need.

Also, I don't think we should be paranoid about memory usage here.
It's good to avoid creating new rels that are obviously not needed,
not only because of memory consumption but also because of the CPU
consumption involved, but I don't want to contort the code to squeeze
every last byte of memory out of this.

On a related note, I'm not sure that this code is correct:

+       if (!isPartialAgg)
+       {
+               grouped_rel->part_scheme = input_rel->part_scheme;
+               grouped_rel->nparts = nparts;
+               grouped_rel->boundinfo = input_rel->boundinfo;
+               grouped_rel->part_rels = part_rels;
+       }

It's not obvious to me why this should be done only when
!isPartialAgg.  The comments claim that the partially grouped child
rels can't be considered partitions of the top-level partitially
grouped rel, but it seems to me that we could consider them that way.
Maybe I'm missing something.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: [HACKERS] Partition-wise aggregation/grouping

From
Jeevan Chalke
Date:


On Thu, Mar 8, 2018 at 1:15 PM, Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote:
On Wed, Mar 7, 2018 at 8:07 PM, Jeevan Chalke
<jeevan.chalke@enterprisedb.com> wrote:
Here are some more review comments esp. on
try_partitionwise_grouping() function. BTW name of that function
doesn't go in sync with enable_partitionwise_aggregation (which btw is
in sync with enable_fooagg GUCs). But it goes in sync with
create_grouping_paths(). It looks like we have already confused
aggregation and grouping e.g. enable_hashagg may affect path creation
when there is no aggregation involved i.e. only grouping is involved
but create_grouping_paths will create paths when there is no grouping
involved. Generally it looks like the function names use grouping to
mean both aggregation and grouping but GUCs use aggregation to mean
both of those. So, the naming in this patch looks consistent with the
current naming conventions.

+        grouped_rel->part_scheme = input_rel->part_scheme;
+        grouped_rel->nparts = nparts;
+        grouped_rel->boundinfo = input_rel->boundinfo;
+        grouped_rel->part_rels = part_rels;

You need to set the part_exprs which will provide partition keys for this
partitioned relation. I think, we should include all the part_exprs of
input_rel which are part of GROUP BY clause. Since any other expressions in
part_exprs are not part of GROUP BY clause, they can not appear in the
targetlist without an aggregate on top. So they can't be part of the partition
keys of the grouped relation.

In create_grouping_paths() we fetch both partial as well as fully grouped rel
for given input relation. But in case of partial aggregation, we don't need
fully grouped rel since we are not computing full aggregates for the children.
Since fetch_upper_rel() creates a relation when one doesn't exist, we are
unnecessarily creating fully grouped rels in this case. For thousands of
partitions that's a lot of memory wasted.

I see a similar issue with create_grouping_paths() when we are computing only
full aggregates (either because partial aggregation is not possible or because
parallelism is not possible). In that case, we unconditionally create partially
grouped rels. That too would waste a lot of memory.

I think that partially_grouped_rel, when required, is partitioned irrespective
of whether we do full aggregation per partition or not. So, if we have its
part_rels and other partitioning properties set. I agree that right now we
won't use this information anywhere. It may be useful, in case we device a way
to use partially_grouped_rel directly without using grouped_rel for planning
beyond grouping, which seems unlikely.

+
+        /*
+         * Parallel aggregation requires partial target, so compute it here
+         * and translate all vars. For partial aggregation, we need it
+         * anyways.
+         */
+        partial_target = make_partial_grouping_target(root, target,
+                                                      havingQual);

Don't we have this available in partially_grouped_rel?

That shows one asymmetry that Robert's refactoring has introduced. We don't set
reltarget of grouped_rel but set reltarget of partially_grouped_rel. If
reltarget of grouped_rel changes across paths (the reason why we have not set
it in grouped_rel), shouldn't reltarget of partially grouped paths change
accordingly?

I am not sure why we don't set reltarget into the grouped_rel too.

But if we do so like we did in partially_grouped_rel, then it will be lot easier for partitionwise aggregate as then we don't have to pass target to functions creating paths like create_append_path. We now need to update generate_gather_paths() to take target too as it is now being called on grouped_rel in which reltarget is not set.

But yes, if there is any specific reason we can't do so, then I think the same like Ashutosh Said. I didn't aware of such reason though.


+
+/*
+ * group_by_has_partkey
+ *
+ * Returns true, if all the partition keys of the given relation are part of
+ * the GROUP BY clauses, false otherwise.
+ */
+static bool
+group_by_has_partkey(RelOptInfo *input_rel, PathTarget *target,
+                     List *groupClause)

We could modify this function to return the list of part_exprs which are part
of group clause and use that as the partition keys of the grouped_rel if
required. If group by doesn't have all the partition keys, the function would
return a NULL list.

Right now, in case of full aggregation, partially_grouped_rel is populated with
the partial paths created by adding partial aggregation to the partial paths of
input relation. But we are not trying to create partial paths by (parallel)
appending the (non)partial paths from the child partially_grouped_rel. Have we
thought about that? Would such paths have different shapes from the ones that
we create now and will they be better?

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company



--
Jeevan Chalke
Technical Architect, Product Development
EnterpriseDB Corporation
The Enterprise PostgreSQL Company

Re: [HACKERS] Partition-wise aggregation/grouping

From
Robert Haas
Date:
On Thu, Mar 8, 2018 at 9:15 AM, Jeevan Chalke
<jeevan.chalke@enterprisedb.com> wrote:
> I am not sure why we don't set reltarget into the grouped_rel too.
>
> But if we do so like we did in partially_grouped_rel, then it will be lot
> easier for partitionwise aggregate as then we don't have to pass target to
> functions creating paths like create_append_path. We now need to update
> generate_gather_paths() to take target too as it is now being called on
> grouped_rel in which reltarget is not set.
>
> But yes, if there is any specific reason we can't do so, then I think the
> same like Ashutosh Said. I didn't aware of such reason though.

I see no problem with setting reltarget for the grouped_rel.  Before
we added partially_grouped_rel, that rel computed paths with two
different targets: partial paths had the partial grouping target, and
non-partial paths had the ordinary grouping target.  But that's fixed
now.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: [HACKERS] Partition-wise aggregation/grouping

From
Jeevan Chalke
Date:


On Thu, Mar 8, 2018 at 7:49 PM, Robert Haas <robertmhaas@gmail.com> wrote:
On Thu, Mar 8, 2018 at 9:15 AM, Jeevan Chalke
<jeevan.chalke@enterprisedb.com> wrote:
> I am not sure why we don't set reltarget into the grouped_rel too.
>
> But if we do so like we did in partially_grouped_rel, then it will be lot
> easier for partitionwise aggregate as then we don't have to pass target to
> functions creating paths like create_append_path. We now need to update
> generate_gather_paths() to take target too as it is now being called on
> grouped_rel in which reltarget is not set.
>
> But yes, if there is any specific reason we can't do so, then I think the
> same like Ashutosh Said. I didn't aware of such reason though.

I see no problem with setting reltarget for the grouped_rel.  Before
we added partially_grouped_rel, that rel computed paths with two
different targets: partial paths had the partial grouping target, and
non-partial paths had the ordinary grouping target.  But that's fixed
now.

OK.
Will update my changes accordingly.
If we set reltarget into the grouped_rel now, then I don't need one of the refactoring patch which is passing target to the path creation functions.


--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



--
Jeevan Chalke
Technical Architect, Product Development
EnterpriseDB Corporation
The Enterprise PostgreSQL Company

Re: [HACKERS] Partition-wise aggregation/grouping

From
Ashutosh Bapat
Date:
On Thu, Mar 8, 2018 at 8:02 PM, Jeevan Chalke
<jeevan.chalke@enterprisedb.com> wrote:
>
>
> On Thu, Mar 8, 2018 at 7:49 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>>
>> On Thu, Mar 8, 2018 at 9:15 AM, Jeevan Chalke
>> <jeevan.chalke@enterprisedb.com> wrote:
>> > I am not sure why we don't set reltarget into the grouped_rel too.
>> >
>> > But if we do so like we did in partially_grouped_rel, then it will be
>> > lot
>> > easier for partitionwise aggregate as then we don't have to pass target
>> > to
>> > functions creating paths like create_append_path. We now need to update
>> > generate_gather_paths() to take target too as it is now being called on
>> > grouped_rel in which reltarget is not set.
>> >
>> > But yes, if there is any specific reason we can't do so, then I think
>> > the
>> > same like Ashutosh Said. I didn't aware of such reason though.
>>
>> I see no problem with setting reltarget for the grouped_rel.  Before
>> we added partially_grouped_rel, that rel computed paths with two
>> different targets: partial paths had the partial grouping target, and
>> non-partial paths had the ordinary grouping target.  But that's fixed
>> now.
>
>
> OK.
> Will update my changes accordingly.
> If we set reltarget into the grouped_rel now, then I don't need one of the
> refactoring patch which is passing target to the path creation functions.
>

For some reason we do not set reltarget of any of the upper relations.
I don't know why, neither browsing through the comments in
grouping_planner(), including the one below before the code that
creates an array of upper relation targets.
        /*
         * Save the various upper-rel PathTargets we just computed into
         * root->upper_targets[].  The core code doesn't use this, but it
         * provides a convenient place for extensions to get at the info.  For
         * consistency, we save all the intermediate targets, even though some
         * of the corresponding upperrels might not be needed for this query.
         */
Why don't we just set those in the corresponding RelOptInfos? May be
we should do that for all the upper rels and not just grouping_rel.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company


Re: [HACKERS] Partition-wise aggregation/grouping

From
Ashutosh Bapat
Date:
On Thu, Mar 8, 2018 at 7:31 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>
> This kind of goes along with the suggestion I made yesterday to
> introduce a new function, which at the time I proposed calling
> initialize_grouping_rel(), to set up new grouped or partially grouped
> relations.  By doing that it would be easier to ensure the
> initialization is always done in a consistent way but only for the
> relations we actually need.  But maybe we should call it
> fetch_grouping_rel() instead.  The idea would be that instead of
> calling fetch_upper_rel() we would call fetch_grouping_rel() when it
> is a question of the grouped or partially grouped relation.  It would
> either return the existing relation or initialize a new one for us.  I
> think that would make it fairly easy to initialize only the ones we're
> going to need.

Hmm. I am working on refactoring the code to do something like this.

> On a related note, I'm not sure that this code is correct:
>
> +       if (!isPartialAgg)
> +       {
> +               grouped_rel->part_scheme = input_rel->part_scheme;
> +               grouped_rel->nparts = nparts;
> +               grouped_rel->boundinfo = input_rel->boundinfo;
> +               grouped_rel->part_rels = part_rels;
> +       }
>
> It's not obvious to me why this should be done only when
> !isPartialAgg.  The comments claim that the partially grouped child
> rels can't be considered partitions of the top-level partitially
> grouped rel, but it seems to me that we could consider them that way.
> Maybe I'm missing something.

When we are performing partial aggregates, GROUP clause does not have
partition keys. This means that the targetlist of the grouped relation
and partially grouped relation do not have bare partition keys. So,
for a relation sitting on top of this (partially) grouped relation the
partition key doesn't exist. So, we can't consider grouped or
partially grouped relation as partitioned.


-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company


Re: [HACKERS] Partition-wise aggregation/grouping

From
Ashutosh Bapat
Date:
On Fri, Mar 9, 2018 at 4:21 PM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:
> On Thu, Mar 8, 2018 at 7:31 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>>
>> This kind of goes along with the suggestion I made yesterday to
>> introduce a new function, which at the time I proposed calling
>> initialize_grouping_rel(), to set up new grouped or partially grouped
>> relations.  By doing that it would be easier to ensure the
>> initialization is always done in a consistent way but only for the
>> relations we actually need.  But maybe we should call it
>> fetch_grouping_rel() instead.  The idea would be that instead of
>> calling fetch_upper_rel() we would call fetch_grouping_rel() when it
>> is a question of the grouped or partially grouped relation.  It would
>> either return the existing relation or initialize a new one for us.  I
>> think that would make it fairly easy to initialize only the ones we're
>> going to need.
>
> Hmm. I am working on refactoring the code to do something like this.
>

Here's patch doing the same. I have split create_grouping_paths() into
three functions 1. to handle degenerate grouping paths
(try_degenerate_grouping_paths()) 2. to create the grouping rels,
partial grouped rel and grouped rel (make_grouping_rels()), which also
sets some properties in GroupPathExtraData. 3. populate grouping rels
with paths (populate_grouping_rels_with_paths()). With those changes,
I have been able to get rid of partially grouped rels when they are
not necessary. But I haven't tried to get rid of grouped_rels when
they are not needed.

GroupPathExtraData now completely absorbs members from and replaces
OtherUpperPathExtraData. This means that we have to consider a way to
pass GroupPathExtraData to FDWs through GetForeignUpperPaths(). I
haven't tried it in this patch.

With this patch there's a failure in partition_aggregation where the
patch is creating paths with MergeAppend with GatherMerge underneath.
I think this is related to the call
add_paths_to_partial_grouping_rel() when try_parallel_aggregation is
true. But I didn't investigate it further.

With those two things remaining I am posting this patch, so that
Jeevan Chalke can merge this patch into his patches and also merge
some of his changes related to mine and Robert's changes. Let me know
if this refactoring looks good.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

Attachment

Re: [HACKERS] Partition-wise aggregation/grouping

From
Jeevan Chalke
Date:


On Mon, Mar 12, 2018 at 6:07 PM, Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote:
On Fri, Mar 9, 2018 at 4:21 PM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:
> On Thu, Mar 8, 2018 at 7:31 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>>
>> This kind of goes along with the suggestion I made yesterday to
>> introduce a new function, which at the time I proposed calling
>> initialize_grouping_rel(), to set up new grouped or partially grouped
>> relations.  By doing that it would be easier to ensure the
>> initialization is always done in a consistent way but only for the
>> relations we actually need.  But maybe we should call it
>> fetch_grouping_rel() instead.  The idea would be that instead of
>> calling fetch_upper_rel() we would call fetch_grouping_rel() when it
>> is a question of the grouped or partially grouped relation.  It would
>> either return the existing relation or initialize a new one for us.  I
>> think that would make it fairly easy to initialize only the ones we're
>> going to need.
>
> Hmm. I am working on refactoring the code to do something like this.
>

Here's patch doing the same. I have split create_grouping_paths() into
three functions 1. to handle degenerate grouping paths
(try_degenerate_grouping_paths()) 2. to create the grouping rels,
partial grouped rel and grouped rel (make_grouping_rels()), which also
sets some properties in GroupPathExtraData. 3. populate grouping rels
with paths (populate_grouping_rels_with_paths()). With those changes,
I have been able to get rid of partially grouped rels when they are
not necessary. But I haven't tried to get rid of grouped_rels when
they are not needed.

GroupPathExtraData now completely absorbs members from and replaces
OtherUpperPathExtraData. This means that we have to consider a way to
pass GroupPathExtraData to FDWs through GetForeignUpperPaths(). I
haven't tried it in this patch.

With this patch there's a failure in partition_aggregation where the
patch is creating paths with MergeAppend with GatherMerge underneath.
I think this is related to the call
add_paths_to_partial_grouping_rel() when try_parallel_aggregation is
true. But I didn't investigate it further.

With those two things remaining I am posting this patch, so that
Jeevan Chalke can merge this patch into his patches and also merge
some of his changes related to mine and Robert's changes. Let me know
if this refactoring looks good.

Thanks Ashutosh for the refactoring patch.
I will rebase my changes and will also resolve above two issues you have reported.
 

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company



--
Jeevan Chalke
Technical Architect, Product Development
EnterpriseDB Corporation
The Enterprise PostgreSQL Company

Re: [HACKERS] Partition-wise aggregation/grouping

From
Jeevan Chalke
Date:


On Mon, Mar 12, 2018 at 6:07 PM, Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote:
On Fri, Mar 9, 2018 at 4:21 PM, Ashutosh Bapat

GroupPathExtraData now completely absorbs members from and replaces
OtherUpperPathExtraData. This means that we have to consider a way to
pass GroupPathExtraData to FDWs through GetForeignUpperPaths(). I
haven't tried it in this patch.

Initially, I was passing OtherUpperPathExtraData to FDW. But now we need to pass GroupPathExtraData.

However, since GetForeignUpperPaths() is a generic function for all upper relations, we might think of renaming this struct to UpperPathExtraData. Add an UpperRelationKind member to it
Which will be used to distinguish the passed in extra data. But now we only have extra data for grouping only, I chose not to do that here. But someone, when needed, may choose this approach.

 
With this patch there's a failure in partition_aggregation where the
patch is creating paths with MergeAppend with GatherMerge underneath.
I think this is related to the call
add_paths_to_partial_grouping_rel() when try_parallel_aggregation is
true. But I didn't investigate it further.

I fixed it. We need to pass  is_partial_agg instead of extra->partial_partitionwise_grouping while calling add_paths_to_partial_grouping_rel() in case of parallelism.


With those two things remaining I am posting this patch, so that
Jeevan Chalke can merge this patch into his patches and also merge
some of his changes related to mine and Robert's changes. Let me know
if this refactoring looks good.

Will rebase my changes tomorrow.
 

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company



--
Jeevan Chalke
Technical Architect, Product Development
EnterpriseDB Corporation
The Enterprise PostgreSQL Company

Attachment

Re: [HACKERS] Partition-wise aggregation/grouping

From
Ashutosh Bapat
Date:
On Mon, Mar 12, 2018 at 7:49 PM, Jeevan Chalke
<jeevan.chalke@enterprisedb.com> wrote:
>
>
> On Mon, Mar 12, 2018 at 6:07 PM, Ashutosh Bapat
> <ashutosh.bapat@enterprisedb.com> wrote:
>>
>> On Fri, Mar 9, 2018 at 4:21 PM, Ashutosh Bapat
>>
>> GroupPathExtraData now completely absorbs members from and replaces
>> OtherUpperPathExtraData. This means that we have to consider a way to
>> pass GroupPathExtraData to FDWs through GetForeignUpperPaths(). I
>> haven't tried it in this patch.
>
>
> Initially, I was passing OtherUpperPathExtraData to FDW. But now we need to
> pass GroupPathExtraData.
>
> However, since GetForeignUpperPaths() is a generic function for all upper
> relations, we might think of renaming this struct to UpperPathExtraData. Add
> an UpperRelationKind member to it
> Which will be used to distinguish the passed in extra data. But now we only
> have extra data for grouping only, I chose not to do that here. But someone,
> when needed, may choose this approach.

We don't need UpperRelationKind member in that structure. That will be
provided by the RelOptInfo passed.

The problem here is the extra information required for grouping is not
going to be the same for that needed for window aggregate and
certainly not for ordering. If we try to jam everything in the same
structure, it will become large with many members useless for a given
operation. A reader will not have an idea about which of them are
useful and which of them are not. So, instead we should try some
polymorphism. I think we can pass a void * to GetForeignUpperPaths()
and corresponding FDW hook knows what to cast it to based on the
UpperRelationKind passed.

BTW, the patch has added an argument to GetForeignUpperPaths() but has
not documented the change in API. If we go the route of polymorphism,
we will need to document the mapping between UpperRelationKind and the
type of structure passed in.

>
>>
>> With this patch there's a failure in partition_aggregation where the
>> patch is creating paths with MergeAppend with GatherMerge underneath.
>> I think this is related to the call
>> add_paths_to_partial_grouping_rel() when try_parallel_aggregation is
>> true. But I didn't investigate it further.
>
>
> I fixed it. We need to pass  is_partial_agg instead of
> extra->partial_partitionwise_grouping while calling
> add_paths_to_partial_grouping_rel() in case of parallelism.

Thanks for investigation and the fix.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company


Re: [HACKERS] Partition-wise aggregation/grouping

From
Jeevan Chalke
Date:
Hi,

I have resolved all the comments/issues reported in this new patch-set.

Changes done by Ashutosh Bapat for splitting out create_grouping_paths() into separate functions so that partitionwise aggregate code will use them were based on my partitionwise aggregate changes. Those were like refactoring changes. And thus, I have refactored them separately and before any partitionwise changes (see 0005-Split-create_grouping_paths-and-Add-GroupPathExtraDa.patch). And then I have re-based all partitionwise changes over it including all fixes.

The patch-set is complete now. But still, there is a scope of some comment improvements due to all these refactorings. I will work on it. Also, need to update few documentations and indentations etc. Will post those changes in next patch-set. But meanwhile, this patch-set is ready to review.


On Tue, Mar 13, 2018 at 9:12 AM, Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote:
On Mon, Mar 12, 2018 at 7:49 PM, Jeevan Chalke
<jeevan.chalke@enterprisedb.com> wrote:
>
>
> On Mon, Mar 12, 2018 at 6:07 PM, Ashutosh Bapat
> <ashutosh.bapat@enterprisedb.com> wrote:
>>
>> On Fri, Mar 9, 2018 at 4:21 PM, Ashutosh Bapat
>>
We don't need UpperRelationKind member in that structure. That will be
provided by the RelOptInfo passed.

The problem here is the extra information required for grouping is not
going to be the same for that needed for window aggregate and
certainly not for ordering. If we try to jam everything in the same
structure, it will become large with many members useless for a given
operation. A reader will not have an idea about which of them are
useful and which of them are not. So, instead we should try some
polymorphism. I think we can pass a void * to GetForeignUpperPaths()
and corresponding FDW hook knows what to cast it to based on the
UpperRelationKind passed.

Yep. Done this way.
 

BTW, the patch has added an argument to GetForeignUpperPaths() but has
not documented the change in API. If we go the route of polymorphism,
we will need to document the mapping between UpperRelationKind and the
type of structure passed in.

Oops. Will do that in next patchset.

Thanks for pointing out, I have missed this at first place it self.

--
Jeevan Chalke
Technical Architect, Product Development
EnterpriseDB Corporation
The Enterprise PostgreSQL Company

Attachment

Re: [HACKERS] Partition-wise aggregation/grouping

From
Jeevan Chalke
Date:


On Tue, Mar 13, 2018 at 7:43 PM, Jeevan Chalke <jeevan.chalke@enterprisedb.com> wrote:
Hi,

The patch-set is complete now. But still, there is a scope of some comment improvements due to all these refactorings. I will work on it. Also, need to update few documentations and indentations etc. Will post those changes in next patch-set. But meanwhile, this patch-set is ready to review.

Attached complete patch-set.

Fixed all remaining documentation, indentation and comments.
Also incorporated few comments improvements provided off-list by Ashutosh Bapat.

Thanks
--
Jeevan Chalke
Technical Architect, Product Development
EnterpriseDB Corporation
The Enterprise PostgreSQL Company

Attachment

Re: [HACKERS] Partition-wise aggregation/grouping

From
Ashutosh Bapat
Date:
On Wed, Mar 14, 2018 at 7:51 PM, Jeevan Chalke
<jeevan.chalke@enterprisedb.com> wrote:
>
>
> On Tue, Mar 13, 2018 at 7:43 PM, Jeevan Chalke
> <jeevan.chalke@enterprisedb.com> wrote:
>>
>> Hi,
>>
>> The patch-set is complete now. But still, there is a scope of some comment
>> improvements due to all these refactorings. I will work on it. Also, need to
>> update few documentations and indentations etc. Will post those changes in
>> next patch-set. But meanwhile, this patch-set is ready to review.
>
>
> Attached complete patch-set.
>
> Fixed all remaining documentation, indentation and comments.
> Also incorporated few comments improvements provided off-list by Ashutosh
> Bapat.
>

The patchset needs rebase.  I have rebased those on the latest head
and made following changes.

Argument force_partial_agg is added after output arguments to
make_grouping_rels(). Moved it before output arguemnts to keep input and output
arguments separate.

Also moved the comment about creating partial aggregate paths before full
aggregate paths in make_grouping_rels() moved to
populate_grouping_rels_with_paths().

In create_grouping_paths() moved call to try_degenerate_grouping_paths() before
computing extra grouping information.

Some more comment changes in the attached patch set.

+         *
+         * With partitionwise aggregates, we may have childrel's pathlist empty
+         * if we are doing partial aggregation. Thus do this only if childrel's
+         * pathlist is not NIL.
          */
-        if (childrel->cheapest_total_path->param_info == NULL)
+        if (childrel->pathlist != NIL &&
+            childrel->cheapest_total_path->param_info == NULL)
             accumulate_append_subpath(childrel->cheapest_total_path,
                                       &subpaths, NULL);
I thought we got rid of this code. Why has it come back? May be the comment
should point to a function where this case happen.

In populate_grouping_rels_with_paths(), we need to pass extra to
extension hook create_upper_paths_hook similar to what
add_paths_to_joinrel() does.

Also, we aren't passing is_partial_agg to FDW hook, so it won't know
whether to compute partial or full aggregates. I think the check
5296         /* Partial aggregates are not supported. */
5297         if (extra->partial_partitionwise_grouping)
5298             return;
in add_foreign_grouping_paths() is wrong. It's checking whether the
children of the given relation will produce partial aggregates or not.
But it is supposed to check whether the given relation should produce
partial aggregates or not. I think we need to include is_partial_agg
in GroupPathExtraData so that it gets passed to FDWs. If we do so, we
need to add a comment in the prologue of GroupPathExtraData to
disambiguate usage of is_partial_agg and
partial_partitionwise_grouping.

In current create_grouping_paths() (without any of your patches
applied) we first create partial paths in partially grouped rel and
then add parallel path to grouped rel using those partial paths. Then
we hand over this to FDW and extension hooks, which may add partial
paths, which might throw away a partial path used to create a parallel
path in grouped rel causing a segfault. I think this bug exists since
we introduced parallel aggregation or upper relation refactoring
whichever happened later. Introduction of partially grouped rel has
just made it visible.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

Attachment

Re: [HACKERS] Partition-wise aggregation/grouping

From
Jeevan Chalke
Date:


On Thu, Mar 15, 2018 at 3:38 PM, Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote:

The patchset needs rebase.  I have rebased those on the latest head
and made following changes.

Argument force_partial_agg is added after output arguments to
make_grouping_rels(). Moved it before output arguemnts to keep input and output
arguments separate.

Also moved the comment about creating partial aggregate paths before full
aggregate paths in make_grouping_rels() moved to
populate_grouping_rels_with_paths().

In create_grouping_paths() moved call to try_degenerate_grouping_paths() before
computing extra grouping information.

Thanks for these changes.
 

Some more comment changes in the attached patch set.

+         *
+         * With partitionwise aggregates, we may have childrel's pathlist empty
+         * if we are doing partial aggregation. Thus do this only if childrel's
+         * pathlist is not NIL.
          */
-        if (childrel->cheapest_total_path->param_info == NULL)
+        if (childrel->pathlist != NIL &&
+            childrel->cheapest_total_path->param_info == NULL)
             accumulate_append_subpath(childrel->cheapest_total_path,
                                       &subpaths, NULL);
I thought we got rid of this code. Why has it come back? May be the comment
should point to a function where this case happen.

Oops. My mistake. I have added it back while working on your comments. And at that time we were creating partially_grouped_rel unconditionally. I was getting segfault here.

But now, we create partially_grouped_rel only when needed, thanks for your refactoring work. Thus no need of this guard. Removed in attached patch set.


In populate_grouping_rels_with_paths(), we need to pass extra to
extension hook create_upper_paths_hook similar to what
add_paths_to_joinrel() does.

Hmm.. you are right. Done.
 

Also, we aren't passing is_partial_agg to FDW hook, so it won't know
whether to compute partial or full aggregates. I think the check
5296         /* Partial aggregates are not supported. */
5297         if (extra->partial_partitionwise_grouping)
5298             return;
in add_foreign_grouping_paths() is wrong. It's checking whether the
children of the given relation will produce partial aggregates or not.
But it is supposed to check whether the given relation should produce
partial aggregates or not. I think we need to include is_partial_agg
in GroupPathExtraData so that it gets passed to FDWs. If we do so, we
need to add a comment in the prologue of GroupPathExtraData to
disambiguate usage of is_partial_agg and
partial_partitionwise_grouping.

Sorry, I mis-interpreted this.
Reworked and added is_partial_aggregation in GroupPathExtraData.


In current create_grouping_paths() (without any of your patches
applied) we first create partial paths in partially grouped rel and
then add parallel path to grouped rel using those partial paths. Then
we hand over this to FDW and extension hooks, which may add partial
paths, which might throw away a partial path used to create a parallel
path in grouped rel causing a segfault. I think this bug exists since
we introduced parallel aggregation or upper relation refactoring
whichever happened later. Introduction of partially grouped rel has
just made it visible.

Yes. That's why I needed to create partitionwise aggregation paths before we create any normal paths.
Yeah, but if this issue needs to be taken care, it should be a separate patch.

However, as noticed by Ashutosh Bapat, after refactoring, we no more try to create partitionwise paths, rather if possible we do create them. So inlined with create_grouping_paths() I have renamed it to create_partitionwise_grouping_paths() in attached patch-set.

Thanks


--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company


--
Jeevan Chalke
Technical Architect, Product Development
EnterpriseDB Corporation
The Enterprise PostgreSQL Company

Attachment

Re: [HACKERS] Partition-wise aggregation/grouping

From
Robert Haas
Date:
On Thu, Mar 15, 2018 at 6:08 AM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:
> In current create_grouping_paths() (without any of your patches
> applied) we first create partial paths in partially grouped rel and
> then add parallel path to grouped rel using those partial paths. Then
> we hand over this to FDW and extension hooks, which may add partial
> paths, which might throw away a partial path used to create a parallel
> path in grouped rel causing a segfault. I think this bug exists since
> we introduced parallel aggregation or upper relation refactoring
> whichever happened later. Introduction of partially grouped rel has
> just made it visible.

I don't think there's really a problem here; or at least if there is,
I'm not seeing it.  If an FDW wants to introduce partially grouped
paths, it should do so when it is called for
UPPERREL_PARTIAL_GROUP_AGG from within
add_paths_to_partial_grouping_rel.  If it wants to introduce fully
grouped paths, it should do so when it is called for
UPPERREL_GROUP_AGG from within create_grouping_paths.  By the time the
latter call is made, it's too late to add partially grouped paths; if
the FDW does, that's a bug in the FDW.

Admittedly, this means that commit
3bf05e096b9f8375e640c5d7996aa57efd7f240c subtly changed the API
contract for FDWs.  Before that, an FDW that wanted to support partial
aggregation would have needed to add partially grouped paths to
UPPERREL_GROUP_AGG when called for that relation; whereas now it would
need to add them to UPPERREL_PARTIAL_GROUP_AGG when called for that
relation. This doesn't actually falsify any documentation, though,
because this oddity wasn't documented before.  Possibly more
documentation could stand to be written in this area, but that's not
the topic of this thread.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: [HACKERS] Partition-wise aggregation/grouping

From
Robert Haas
Date:
On Thu, Mar 15, 2018 at 9:46 AM, Jeevan Chalke
<jeevan.chalke@enterprisedb.com> wrote:
> Hmm.. you are right. Done.

I don't see a reason to hold off on committing 0002 and 0003, so I've
done that now; since they are closely related changes, I pushed them
as a single commit.  It probably could've just been included in the
main patch, but it's fine.

I don't much like the code that 0001 refactors and am not keen to
propagate it into more places.  I've separately proposed patches to
restructure that code in
http://postgr.es/m/CA+TgmoakT5gmahbPWGqrR2nAdFOMAOnOXYoWHRdVfGWs34t6_A@mail.gmail.com
and if we end up deciding to adopt that approach then I think this
patch will also need to create rels for UPPERREL_TLIST.  I suspect
that approach would also remove the need for 0004, as that case would
also end up being handled in a different way.  However, the jury is
still out on whether or not the approach I've proposed there is any
good.  Feel free to opine over on that thread.

I'm going to go spend some time looking at 0005 next.  It looks to me
like it's generally going in a very promising direction, but I need to
study the details more.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: [HACKERS] Partition-wise aggregation/grouping

From
Robert Haas
Date:
On Thu, Mar 15, 2018 at 11:49 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> I'm going to go spend some time looking at 0005 next.  It looks to me
> like it's generally going in a very promising direction, but I need to
> study the details more.

On further study this patch is doing a number of things, some of which
seem like better ideas than others.  Splitting out the degenerate
grouping case looks like a great idea.  I've committed a patch to do
this loosely based on 0005, but I whacked it around quite a bit.  I
think the result is cleaner than what you had.

As far as the stuff in GroupPathExtraData extra is concerned:

- can_hash and can_sort look good; we can precompute them once and
reuse them for every grouping relation.  Cool.

- can_partial_agg looks a bit pointless.  You're not going to save
many CPU cycles by computing a value that is derived from two Booleans
and storing the result in another Boolean variable.

- partial_costs_set.  The comment in compute_group_path_extra_data
doesn't look good.  It says "Set partial aggregation costs if we are
going to calculate partial aggregates in make_grouping_rels()", but
what it means is something more like "agg_partial_costs and
agg_final_costs are not valid yet".  I also wonder if there's a way
that we can figure out in advance whether we're going to need to do
this and just do it at the appropriate place in the code, as opposed
to doing it lazily.  Even if there were rare cases where we did it
unnecessarily I'm not sure that would be any big deal.

- agg_partial_costs and agg_final_costs themselves seem OK.

- consider_parallel is only used in make_grouping_rels and could be
replaced with a local variable.  Its comment in relation.h doesn't
make a lot of sense either, as it is not used to double-check
anything.

- The remaining fields vary across the partitioning hierarchy, and it
seems a little strange to store them in this structure alongside the
pre-computed stuff that doesn't change.  I'm not quite sure what to do
about that; obviously passing around 15-20 arguments to a function
isn't too desirable either.

I wonder if we could simplify things by copying more information from
the parent grouping rel to the child grouping rels.  It seems to me
for example that the way you're computing consider_parallel for the
child relations is kind of pointless.  The parallel-safety of the
grouping_target can't vary across children, nor can that of the
havingQual; the only thing that can change is whether the input rel
has consider_parallel set.  You could cater to that by having
GroupPathExtraData do something like extra.grouping_is_parallel_safe =
target_parallel_safe && is_parallel_safe(root, havingQual) and then
set each child's consider parallel flag to
input_rel->consider_parallel && extra.grouping_is_parallel_safe.

Similarly, right now the way the patch sets the reltargets for
grouping rels and partially grouping rels is a bit complex.
make_grouping_rels() calls make_partial_grouping_target() separately
for each partial grouping rel, but for non-partial grouping rels it
gets the translated tlist as an argument.  Could we instead consider
always building the tlist by translation from the parent, that is, a
child grouped rel's tlist is the translation of the parent
grouped_rel's tlist, and the child partially grouped rel's tlist is
the translation of the parent partially_grouped_rel's tlist?  If you
could both make that work and find a different place to compute the
partial agg costs, make_grouping_rels() would get a lot simpler or
perhaps go away entirely.

I don't like this condition which appears in that function:

    if (extra->try_parallel_aggregation || force_partial_agg ||
        (extra->partitionwise_grouping &&
         extra->partial_partitionwise_grouping))

The problem with that is that it's got to exactly match the criteria
for whether we're going to need the partial_grouping_rel.  If it's
true when we are not using partial paths, then you've missed an
optimization; in the reverse case, we'll probably crash or fail to
consider paths we should have considered.  It is not entirely
straightforward to verify that this test is correct.
add_paths_to_partial_grouping_rel() gets called if
extra->try_parallel_aggregation is true or if
extra->is_partial_aggregation is true, but the condition doesn't test
extra->is_partial_aggregation at all. The other way that we can end up
using partially_grouped_rel is if create_partitionwise_grouping_paths
is called, but it just silently fails to do anything if we have no
partially_grouped_rel.  Putting all that together, do the conditions
under which a partially_grouped_rel gets created match the conditions
under which we want to have one?  Beats me!  Moreover, even if it's
correct now, I think that the chances that the next person who
modifies this code will manage to keep it correct are not great.  I
think we need to create the partial grouping rel somewhere in the code
that's closer to where it's actually needed, so that we don't have so
much action at a distance, or at least have a simpler and more
transparent set of tests.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: [HACKERS] Partition-wise aggregation/grouping

From
Robert Haas
Date:
On Thu, Mar 15, 2018 at 2:46 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> I wonder if we could simplify things by copying more information from
> the parent grouping rel to the child grouping rels.

On further review, it seems like a better idea is to generate the
partial grouping relations from the grouping relations to which they
correspond.  Attached is a series of proposed further refactoring
patches.

0001 moves the creation of partially_grouped_rel downwards.  Instead
of happening in create_grouping_paths(), it gets moved downward to
add_paths_to_partial_grouping_rel(), which is renamed
create_partial_grouping_paths() and now returns a pointer to new
RelOptInfo.  This seems like a better design than what we've got now:
it avoids creating the partially grouped relation if we don't need it,
and it looks more like the other upper planner functions
(create_grouping_paths, create_ordered_paths, etc.) which all create
and return a new relation.  One possible objection to this line of
attack is that Jeevan's
0006-Implement-partitionwise-aggregation-paths-for-partit.patch patch
adds an additional Boolean argument to that function so that it can be
called twice, once for partial paths and a second time for non-partial
paths.  But it looks to me like we should instead just add separate
handling to this function for the pathlist in each place where we're
currently handling the partial_pathlist.  That's more like what we do
elsewhere and avoids complicating the code with a bunch of
conditionals.  To make the generation of partially_grouped_rel from
grouped_rel work cleanly, this also sets grouped_rel's reltarget.

0002 moves the determination of which grouping strategies are possible
upwards.  It represents them as a 'flags' variable with bits for
GROUPING_CAN_USE_SORT, GROUPING_CAN_USE_HASH, and
GROUPING_CAN_PARTIAL_AGG.  These are set in create_grouping_paths()
and passed down to create_ordinary_grouping_paths().  The idea is that
the flags value would be passed down to the partition-wise aggregate
code which in turn would call create_ordinary_grouping_paths() for the
child grouping relations, so that the relevant determinations are made
only at the top level.  This patch also renames can_parallel_agg to
can_partial_agg and removes the parallelism-specific bits from it.  To
compensate for this, create_ordinary_grouping_paths() now tests the
removed conditions instead.  This is all good stuff for partition-wise
aggregate, since the grouped_rel->consider_parallel &&
input_rel->partial_pathlist != NIL conditions can vary on a per-child
basis but the rest of the stuff can't.  In some subsequent patch, the
test should be pushed down inside create_partial_grouping_paths()
itself, so that this function can handle both partial and non-partial
paths as mentioned in the preceding paragraph.

0003 is a cleanup patch.  It removes the grouping target as a separate
argument from a bunch of places that no longer need it given that 0001
sets grouped_rel->reltarget.

I think 0001 and 0002 together get us pretty close to having the stuff
that should be done for every child rel
(create_ordinary_grouping_paths) disentangled from the stuff that
should be done only once (create_grouping_paths).  There are a couple
of exceptions:

- create_partial_grouping_paths() is still doing
get_agg_clause_costs() for the partial grouping target, which (I
think) only needs to be done once.  Possibly we could handle that by
having create_grouping_paths() do that work whenever it sets
GROUPING_CAN_PARTIAL_AGG and pass the value downward.  You might
complain that it won't get used unless either there are partial paths
available for the input rel OR partition-wise aggregate is used --
there's no point in partially aggregating a non-partial path at the
top level.  We could just accept that as not a big deal, or maybe we
can figure out how to make it conditional so that we only do it when
either the input_rel has a partial path list or we have child rels.
Or we could do as you did in your patches and save it when we compute
it first, reusing it on each subsequent call.  Or maybe there's some
other idea.

- These patches don't do anything about
create_partial_grouping_paths() and create_ordinary_grouping_paths()
directly referencing parse->targetList and parse->havingQual.  I think
that we could add those as additional arguments to
create_ordinary_grouping_paths().  create_grouping_paths() would pass
the values taken from "parse" and partition-wise join would pass down
translated versions.

I am sort of unclear whether we need/want GroupPathExtraData at all.
What determines whether something gets passed via GroupPathExtraData
or just as a separate argument?  If we have a rule that stuff that is
common to all child grouped rels goes in there and other stuff
doesn't, or stuff postgres_fdw needs goes in there and other stuff
doesn't, then that might be OK.  But I'm not sure that there is such a
rule in the v20 patches.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

Re: [HACKERS] Partition-wise aggregation/grouping

From
Ashutosh Bapat
Date:
On Thu, Mar 15, 2018 at 7:46 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Thu, Mar 15, 2018 at 6:08 AM, Ashutosh Bapat
> <ashutosh.bapat@enterprisedb.com> wrote:
>> In current create_grouping_paths() (without any of your patches
>> applied) we first create partial paths in partially grouped rel and
>> then add parallel path to grouped rel using those partial paths. Then
>> we hand over this to FDW and extension hooks, which may add partial
>> paths, which might throw away a partial path used to create a parallel
>> path in grouped rel causing a segfault. I think this bug exists since
>> we introduced parallel aggregation or upper relation refactoring
>> whichever happened later. Introduction of partially grouped rel has
>> just made it visible.
>
> I don't think there's really a problem here; or at least if there is,
> I'm not seeing it.  If an FDW wants to introduce partially grouped
> paths, it should do so when it is called for
> UPPERREL_PARTIAL_GROUP_AGG from within
> add_paths_to_partial_grouping_rel.  If it wants to introduce fully
> grouped paths, it should do so when it is called for
> UPPERREL_GROUP_AGG from within create_grouping_paths.  By the time the
> latter call is made, it's too late to add partially grouped paths; if
> the FDW does, that's a bug in the FDW.

Right.

>
> Admittedly, this means that commit
> 3bf05e096b9f8375e640c5d7996aa57efd7f240c subtly changed the API
> contract for FDWs.  Before that, an FDW that wanted to support partial
> aggregation would have needed to add partially grouped paths to
> UPPERREL_GROUP_AGG when called for that relation; whereas now it would
> need to add them to UPPERREL_PARTIAL_GROUP_AGG when called for that
> relation.

And when an FDW added partial paths to the grouped rel
(UPPERREL_GROUP_AGG) it might throw away the partial paths gathered by
the core code. 3bf05e096b9f8375e640c5d7996aa57efd7f240c has fixed that
bug by allowing FDW to add partial paths to UPPERREL_PARTIAL_GROUP_AGG
before adding gather paths to the UPPERREL_GROUP_AGG. But I agree that
it has subtly changed that API and we need to add this to the
documentation (may be we should have added that in the commit which
introduced partially grouped relation).

> This doesn't actually falsify any documentation, though,
> because this oddity wasn't documented before.  Possibly more
> documentation could stand to be written in this area, but that's not
> the topic of this thread.

+1.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company


Re: [HACKERS] Partition-wise aggregation/grouping

From
Ashutosh Bapat
Date:
On Thu, Mar 15, 2018 at 9:19 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Thu, Mar 15, 2018 at 9:46 AM, Jeevan Chalke
> <jeevan.chalke@enterprisedb.com> wrote:
>> Hmm.. you are right. Done.
>
> I don't see a reason to hold off on committing 0002 and 0003, so I've
> done that now; since they are closely related changes, I pushed them
> as a single commit.  It probably could've just been included in the
> main patch, but it's fine.

Thanks.

>
> I don't much like the code that 0001 refactors and am not keen to
> propagate it into more places.  I've separately proposed patches to
> restructure that code in
> http://postgr.es/m/CA+TgmoakT5gmahbPWGqrR2nAdFOMAOnOXYoWHRdVfGWs34t6_A@mail.gmail.com
> and if we end up deciding to adopt that approach then I think this
> patch will also need to create rels for UPPERREL_TLIST.  I suspect
> that approach would also remove the need for 0004, as that case would
> also end up being handled in a different way.  However, the jury is
> still out on whether or not the approach I've proposed there is any
> good.  Feel free to opine over on that thread.

Will take a look at that patch next week.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company


Re: [HACKERS] Partition-wise aggregation/grouping

From
Ashutosh Bapat
Date:
On Fri, Mar 16, 2018 at 12:16 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>
> - partial_costs_set.  The comment in compute_group_path_extra_data
> doesn't look good.  It says "Set partial aggregation costs if we are
> going to calculate partial aggregates in make_grouping_rels()", but
> what it means is something more like "agg_partial_costs and
> agg_final_costs are not valid yet".  I also wonder if there's a way
> that we can figure out in advance whether we're going to need to do
> this and just do it at the appropriate place in the code, as opposed
> to doing it lazily.  Even if there were rare cases where we did it
> unnecessarily I'm not sure that would be any big deal.

I struggled with this one. In case of multi-level partitioning we will
compute fully aggregated results per partition in the levels for which
the partition keys are covered by GROUP BY clause. But beyond those
levels we will compute partial aggregates. When none of the levels can
use parallelism, only those levels which compute the partial
aggregates need these costs to be calculated. We will need to traverse
partitioning hierarchy before even starting aggregation to decide
whether we will need partial aggregation downwards. Instead of adding
that walker, I thought it better to use this flag. But more on this in
reply to your next mail.

>
> I wonder if we could simplify things by copying more information from
> the parent grouping rel to the child grouping rels.  It seems to me
> for example that the way you're computing consider_parallel for the
> child relations is kind of pointless.  The parallel-safety of the
> grouping_target can't vary across children, nor can that of the
> havingQual; the only thing that can change is whether the input rel
> has consider_parallel set.  You could cater to that by having
> GroupPathExtraData do something like extra.grouping_is_parallel_safe =
> target_parallel_safe && is_parallel_safe(root, havingQual) and then
> set each child's consider parallel flag to
> input_rel->consider_parallel && extra.grouping_is_parallel_safe.

I am actually confused by the current code itself. What parallel
workers compute is partial target which is different from the full
target. The partial target would only contain expressions in the
GROUPBY clause and partial aggregate nodes. It will not contain any
expressions comprising of full aggregates. When partial aggregate
nodes are parallel safe but the expressions using the full aggregates
are not parallel safe, the current code will not allow parallel
aggregation to take place whereas it should. That looks like an
optimization we are missing today.

That bug aside, the point is the target of grouped relation and that
of the partially grouped relation are different, giving rise to the
possibility that one is parallel safe and other is not. In fact, for
partially grouped relation, we shouldn't check parallel safety of
havingQual since havingQual is only applicable over the fully
aggregated result. That seems to be another missing optimization. OR I
am missing something here.

In case of partition-wise aggregates some levels may be performing
only partial aggregation and if partial aggregation is parallel safe,
we should allow those levels to run parallel partial aggregation.
Checking parallel safety of only grouped relation doesn't help here.
But since the optimization is already missing right now, I think this
patch shouldn't bother about it.

>
> Similarly, right now the way the patch sets the reltargets for
> grouping rels and partially grouping rels is a bit complex.
> make_grouping_rels() calls make_partial_grouping_target() separately
> for each partial grouping rel, but for non-partial grouping rels it
> gets the translated tlist as an argument.  Could we instead consider
> always building the tlist by translation from the parent, that is, a
> child grouped rel's tlist is the translation of the parent
> grouped_rel's tlist, and the child partially grouped rel's tlist is
> the translation of the parent partially_grouped_rel's tlist?  If you
> could both make that work and find a different place to compute the
> partial agg costs, make_grouping_rels() would get a lot simpler or
> perhaps go away entirely.

Hmm that's a thought. While we are translating, we allocate new nodes,
whereas make_partial_grouping_target() uses same nodes from the full
target. For a partially grouped child relation, this means that we
will allocate nodes to create partial target as well and then setrefs
will spend cycles matching node trees instead of matching pointers.
But I think we can take that hit if it saves us some complexity in the
code.

>
> I don't like this condition which appears in that function:
>
>     if (extra->try_parallel_aggregation || force_partial_agg ||
>         (extra->partitionwise_grouping &&
>          extra->partial_partitionwise_grouping))
>
> The problem with that is that it's got to exactly match the criteria
> for whether we're going to need the partial_grouping_rel.  If it's
> true when we are not using partial paths, then you've missed an
> optimization;  in the reverse case, we'll probably crash or fail to
> consider paths we should have considered.

>  It is not entirely
> straightforward to verify that this test is correct.
> add_paths_to_partial_grouping_rel() gets called if
> extra->try_parallel_aggregation is true or if
> extra->is_partial_aggregation is true, but the condition doesn't test
> extra->is_partial_aggregation at all.

Why do we need to test extra->is_partial_aggregation? We are testing
force_partial_agg. I agree that we should probably test
is_partial_aggregation, but that doesn't make this condition wrong.

> The other way that we can end up
> using partially_grouped_rel is if create_partitionwise_grouping_paths
> is called, but it just silently fails to do anything if we have no
> partially_grouped_rel.

It will create partially_grouped_rel when partition-wise grouping
requires it. So, this sentence seems to contradict itself. I am
confused.

> Moreover, even if it's
> correct now, I think that the chances that the next person who
> modifies this code will manage to keep it correct are not great.  I
> think we need to create the partial grouping rel somewhere in the code
> that's closer to where it's actually needed, so that we don't have so
> much action at a distance, or at least have a simpler and more
> transparent set of tests.

+1. I agree with that.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company


Re: [HACKERS] Partition-wise aggregation/grouping

From
Ashutosh Bapat
Date:
On Fri, Mar 16, 2018 at 3:19 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Thu, Mar 15, 2018 at 2:46 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> I wonder if we could simplify things by copying more information from
>> the parent grouping rel to the child grouping rels.
>
> On further review, it seems like a better idea is to generate the
> partial grouping relations from the grouping relations to which they
> correspond.  Attached is a series of proposed further refactoring
> patches.

Ok. That looks good.

>
> 0001 moves the creation of partially_grouped_rel downwards.  Instead
> of happening in create_grouping_paths(), it gets moved downward to
> add_paths_to_partial_grouping_rel(), which is renamed
> create_partial_grouping_paths() and now returns a pointer to new
> RelOptInfo.  This seems like a better design than what we've got now:
> it avoids creating the partially grouped relation if we don't need it,
> and it looks more like the other upper planner functions
> (create_grouping_paths, create_ordered_paths, etc.) which all create
> and return a new relation.

I liked that.

>
> 0002 moves the determination of which grouping strategies are possible
> upwards.  It represents them as a 'flags' variable with bits for
> GROUPING_CAN_USE_SORT, GROUPING_CAN_USE_HASH, and
> GROUPING_CAN_PARTIAL_AGG.  These are set in create_grouping_paths()
> and passed down to create_ordinary_grouping_paths().  The idea is that
> the flags value would be passed down to the partition-wise aggregate
> code which in turn would call create_ordinary_grouping_paths() for the
> child grouping relations, so that the relevant determinations are made
> only at the top level.

+1.

> This patch also renames can_parallel_agg to
> can_partial_agg and removes the parallelism-specific bits from it.

I think we need to update the comments in this function to use phrase
"partial aggregation" instead of "parallel aggregation". And I think
we need to change the conditions as well. For example if
parse->groupClause == NIL, why can't we do partial aggregation? This
is the classical case when we will need patial aggregation. Probably
we should test this with Jeevan's patches for partition-wise aggregate
to see if it considers partition-wise aggregate or not.

OR When parse->groupingSets is true, I can see why we can't use
parallel query, but we can still compute partial aggregates. This
condition doesn't hurt since partition-wise aggregation bails out when
there are grouping sets, so it's not that harmful here.

> To
> compensate for this, create_ordinary_grouping_paths() now tests the
> removed conditions instead.  This is all good stuff for partition-wise
> aggregate, since the grouped_rel->consider_parallel &&
> input_rel->partial_pathlist != NIL conditions can vary on a per-child
> basis but the rest of the stuff can't.  In some subsequent patch, the
> test should be pushed down inside create_partial_grouping_paths()
> itself, so that this function can handle both partial and non-partial
> paths as mentioned in the preceding paragraph.

I think can_parallel_agg() combines two conditions, whether partial
aggregation is possible and whether parallel aggregation is possible.
can_partial_agg() should have the first set and we should retain
can_parallel_agg() for the second set. We may then split
can_parallel_agg() into variant and invariant conditions i.e. the
conditions which change with input_rel and grouped_rel and those
don't.



>
> - create_partial_grouping_paths() is still doing
> get_agg_clause_costs() for the partial grouping target, which (I
> think) only needs to be done once.  Possibly we could handle that by
> having create_grouping_paths() do that work whenever it sets
> GROUPING_CAN_PARTIAL_AGG and pass the value downward.  You might
> complain that it won't get used unless either there are partial paths
> available for the input rel OR partition-wise aggregate is used --
> there's no point in partially aggregating a non-partial path at the
> top level.  We could just accept that as not a big deal, or maybe we
> can figure out how to make it conditional so that we only do it when
> either the input_rel has a partial path list or we have child rels.
> Or we could do as you did in your patches and save it when we compute
> it first, reusing it on each subsequent call.  Or maybe there's some
> other idea.

I am good with anything as long as we avoid repeated computation.

>
> I am sort of unclear whether we need/want GroupPathExtraData at all.
> What determines whether something gets passed via GroupPathExtraData
> or just as a separate argument?  If we have a rule that stuff that is
> common to all child grouped rels goes in there and other stuff
> doesn't, or stuff postgres_fdw needs goes in there and other stuff
> doesn't, then that might be OK.  But I'm not sure that there is such a
> rule in the v20 patches.

We have a single FDW hook for all the upper relations and that hook
can not accept grouping specific arguments. Either we need a separate
FDW hook for grouping OR we need some way of passing upper relation
specific information down to an FDW. I think some FDWs and extensions
will be happy if we provide them readymade decisions for can_sort,
can_hash, can_partial_agg etc. It will be good if they don't have to
translate the grouping target and havingQual for every child twice,
once for core and second time in the FDW. In all it looks like we need
some structure to hold that information so that we can pass it down
the hook. I am fine with two structures one variable and other
invariable. An upper operation can have one of them or both.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company


Re: [HACKERS] Partition-wise aggregation/grouping

From
Robert Haas
Date:
On Fri, Mar 16, 2018 at 1:50 PM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:
> Ok. That looks good.

Here's an updated version.  In this version, based on a voice
discussion with Ashutosh and Jeevan, I adjusted 0001 to combine it
with an earlier idea of splitting Gather/Gather Merge path generation
out of the function that creates partially aggregated paths.  The idea
here is that create_ordinary_gather_paths() could first call
create_partial_grouping_paths(), then add additional paths which might
be partial or non-partial by invoking the partition-wise aggregate
logic, then call gather_grouping_paths() and set_cheapest() to
finalize the partially grouped rel.  Also, I added draft commit
messages.

With this patch set applied, the key bit of logic in
create_ordinary_grouping_paths() ends up looking like this:

    if (grouped_rel->consider_parallel && input_rel->partial_pathlist != NIL
        && (flags & GROUPING_CAN_PARTIAL_AGG) != 0)
    {
        partially_grouped_rel =
            create_partial_grouping_paths(root,
                                          grouped_rel,
                                          input_rel,
                                          gd,
                                          can_sort,
                                          can_hash,
                                          &agg_final_costs);
        gather_grouping_paths(root, partially_grouped_rel);
        set_cheapest(partially_grouped_rel);
    }

I imagine that what the main partition-wise aggregate patch would do
is (1) change the conditions under which
create_partial_grouping_paths() gets called, (2) postpone
gather_grouping_paths() and set_cheapest() until after partition-wise
aggregate had been done, doing them only if partially_grouped_rel !=
NULL.  Partition-wise aggregate will need to happen before
add_paths_to_grouping_rel(), though, so that the latter function can
try a FinalizeAggregate node on top of an Append added by
partition-wise aggregate.

This is a bit strange, because it will mean that partition-wise
aggregate will be attempted BEFORE adding ordinary aggregate paths to
grouped_rel but AFTER adding them to partially_grouped_rel.  We could
fix that by splitting add_paths_to_grouping_rel() into two functions,
one of which performs full aggregation directly and the other of which
tries finishing partial aggregation.  I'm unsure that's a good idea
though: it would mean that we have very similar logic in two different
functions that could get out of sync as a result of future code
changes, and it's not really fixing any problem.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

Re: [HACKERS] Partition-wise aggregation/grouping

From
Robert Haas
Date:
On Fri, Mar 16, 2018 at 1:50 PM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:
>> This patch also renames can_parallel_agg to
>> can_partial_agg and removes the parallelism-specific bits from it.
>
> I think we need to update the comments in this function to use phrase
> "partial aggregation" instead of "parallel aggregation". And I think
> we need to change the conditions as well. For example if
> parse->groupClause == NIL, why can't we do partial aggregation? This
> is the classical case when we will need patial aggregation. Probably
> we should test this with Jeevan's patches for partition-wise aggregate
> to see if it considers partition-wise aggregate or not.

I think the case where we have neither any aggregates nor a grouping
clause is where we are doing SELECT DISTINCT.  Something like SELECT
COUNT(*) FROM ... is not this case because that has an aggregate.

I'm sort of on the fence as to whether and how to update the comments.
I agree that it's a little strange to leave the comments here
referencing parallel aggregation when the function has been renamed to
is_partial_agg(), but a simple search-and-replace doesn't necessarily
improve the situation very much.  Most notably, hasNonSerial is
irrelevant for partial but non-parallel aggregation, but we still have
the test because we haven't done the work to really do the right thing
here, which is to separately track whether we can do parallel partial
aggregation (either hasNonPartial or hasNonSerial is a problem) and
non-parallel partial aggregation (only hasNonPartial is a problem).
This needs a deeper reassessment, but I don't think that can or should
be something we try to do right now.

> OR When parse->groupingSets is true, I can see why we can't use
> parallel query, but we can still compute partial aggregates. This
> condition doesn't hurt since partition-wise aggregation bails out when
> there are grouping sets, so it's not that harmful here.

I haven't thought deeply about what will break when GROUPING SETS are
in use, but it's not the purpose of this patch set to make them work
where they didn't before.  The point of hoisting the first two tests
out of this function was just to avoid doing repeated work when
partition-wise aggregate is in use.  Those two tests could conceivably
produce different results for different children, whereas the
remaining tests won't give different answers.  Let's not get
distracted by the prospect of improving the tests.  I suspect that's
not anyway so simple to achieve as what you seem to be speculating
here...

>> I am sort of unclear whether we need/want GroupPathExtraData at all.
>> What determines whether something gets passed via GroupPathExtraData
>> or just as a separate argument?  If we have a rule that stuff that is
>> common to all child grouped rels goes in there and other stuff
>> doesn't, or stuff postgres_fdw needs goes in there and other stuff
>> doesn't, then that might be OK.  But I'm not sure that there is such a
>> rule in the v20 patches.
>
> We have a single FDW hook for all the upper relations and that hook
> can not accept grouping specific arguments. Either we need a separate
> FDW hook for grouping OR we need some way of passing upper relation
> specific information down to an FDW. I think some FDWs and extensions
> will be happy if we provide them readymade decisions for can_sort,
> can_hash, can_partial_agg etc. It will be good if they don't have to
> translate the grouping target and havingQual for every child twice,
> once for core and second time in the FDW. In all it looks like we need
> some structure to hold that information so that we can pass it down
> the hook. I am fine with two structures one variable and other
> invariable. An upper operation can have one of them or both.

I'm fine with using a structure to bundle details that need to be sent
to the FDW, but why should the FDW need to know about
can_sort/can_hash?  I suppose if it needs to know about
can_partial_agg then it doesn't really cost anything to pass through
all the flags, but I doubt whether the FDW has any use for the others.
Anyway, if that's the goal, let's just make sure that the set of
things we're passing that way are exactly the set of things that we
think the FDW might need.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: [HACKERS] Partition-wise aggregation/grouping

From
Ashutosh Bapat
Date:
On Mon, Mar 19, 2018 at 11:15 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Fri, Mar 16, 2018 at 1:50 PM, Ashutosh Bapat
> <ashutosh.bapat@enterprisedb.com> wrote:
>>> This patch also renames can_parallel_agg to
>>> can_partial_agg and removes the parallelism-specific bits from it.
>>
>> I think we need to update the comments in this function to use phrase
>> "partial aggregation" instead of "parallel aggregation". And I think
>> we need to change the conditions as well. For example if
>> parse->groupClause == NIL, why can't we do partial aggregation? This
>> is the classical case when we will need patial aggregation. Probably
>> we should test this with Jeevan's patches for partition-wise aggregate
>> to see if it considers partition-wise aggregate or not.
>
> I think the case where we have neither any aggregates nor a grouping
> clause is where we are doing SELECT DISTINCT.

That's a case which will also benefit from partial partition-wise
grouping where each partition produces its own distinct values, thus
reducing the number of rows that flow through an Append node or better
over network, and finalization step removes duplicates. In fact, what
we are attempting here is partition-wise grouping and partial
aggregation happens to be a requirement for doing that if there are
aggregates. If there are no aggregates, we still benefit from
partition-wise grouping. So, can_partial_agg should only tell whether
we can calculate partial aggregates. If there are aggregates present,
and can_partial_agg is false, we can not attempt partition-wise
grouping. But if there are no aggregates and can_partial_agg is false,
it shouldn't prohibit us from using partition-wise aggregates.

> Something like SELECT
> COUNT(*) FROM ... is not this case because that has an aggregate.
>
> I'm sort of on the fence as to whether and how to update the comments.
> I agree that it's a little strange to leave the comments here
> referencing parallel aggregation when the function has been renamed to
> is_partial_agg(), but a simple search-and-replace doesn't necessarily
> improve the situation very much.

Hmm, agree. And as you have mentioned downthread, it's not part of the
this patchset to attempt to do that. May be I will try providing a
patch to update comments once we have committed PWA.

> Most notably, hasNonSerial is
> irrelevant for partial but non-parallel aggregation, but we still have
> the test because we haven't done the work to really do the right thing
> here, which is to separately track whether we can do parallel partial
> aggregation (either hasNonPartial or hasNonSerial is a problem) and
> non-parallel partial aggregation (only hasNonPartial is a problem).
> This needs a deeper reassessment, but I don't think that can or should
> be something we try to do right now.

I think this bit is easy to mention in the comments. We could always
say that since partial aggregation is using serialization and
deserialization while calculating partial aggregates, even though the
step is not needed, we need hasNonSerial to be true for partial
aggregation to work. A future improvement to avoid serialization and
deserialization in partial aggregation when no parallel query is
involved should remove this condition from here and treat it as a
requirement for parallel aggregation.

>
>> OR When parse->groupingSets is true, I can see why we can't use
>> parallel query, but we can still compute partial aggregates. This
>> condition doesn't hurt since partition-wise aggregation bails out when
>> there are grouping sets, so it's not that harmful here.
>
> I haven't thought deeply about what will break when GROUPING SETS are
> in use, but it's not the purpose of this patch set to make them work
> where they didn't before.  The point of hoisting the first two tests
> out of this function was just to avoid doing repeated work when
> partition-wise aggregate is in use.  Those two tests could conceivably
> produce different results for different children, whereas the
> remaining tests won't give different answers.  Let's not get
> distracted by the prospect of improving the tests.  I suspect that's
> not anyway so simple to achieve as what you seem to be speculating
> here...

+1.


>
> I'm fine with using a structure to bundle details that need to be sent
> to the FDW, but why should the FDW need to know about
> can_sort/can_hash?  I suppose if it needs to know about
> can_partial_agg then it doesn't really cost anything to pass through
> all the flags, but I doubt whether the FDW has any use for the others.
> Anyway, if that's the goal, let's just make sure that the set of
> things we're passing that way are exactly the set of things that we
> think the FDW might need.

I am speculating that an FDW or custom planner hook, which does some
local processing or hints something to the foreign server can use
those flags. But I agree, that unless we see such a requirement we
shouldn't expose those in the structure. It will get difficult to
remove those later.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company


Re: [HACKERS] Partition-wise aggregation/grouping

From
Jeevan Chalke
Date:
Hi,

On Mon, Mar 19, 2018 at 10:56 PM, Robert Haas <robertmhaas@gmail.com> wrote:
On Fri, Mar 16, 2018 at 1:50 PM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:
> Ok. That looks good.

Here's an updated version.  In this version, based on a voice
discussion with Ashutosh and Jeevan, I adjusted 0001 to combine it
with an earlier idea of splitting Gather/Gather Merge path generation
out of the function that creates partially aggregated paths.  The idea
here is that create_ordinary_gather_paths() could first call
create_partial_grouping_paths(), then add additional paths which might
be partial or non-partial by invoking the partition-wise aggregate
logic, then call gather_grouping_paths() and set_cheapest() to
finalize the partially grouped rel.  Also, I added draft commit
messages.

I have added all these three patches in the attached patch-set and rebased my changes over it.

However, I have not yet made this patch-set dependednt on UPPERREL_TLIST changes you have proposed on another mail-thread and thus it has 0001 patch refactoring the scanjoin issue.
0002, 0003 and 0004 are your patches added in this patchset.
0005 and 0006 are further refactoring patches. 0006 adds a GroupPathExtraData which stores mostly child variant data and costs.
0007 is main partitionwise aggregation patch which is then rebased accordingly.
0008 contains testcase and 0009 contains FDW changes.

Let me know if I missed any point to consider while rebasing.

--
Jeevan Chalke
Technical Architect, Product Development
EnterpriseDB Corporation
The Enterprise PostgreSQL Company

Attachment

Re: [HACKERS] Partition-wise aggregation/grouping

From
Robert Haas
Date:
On Tue, Mar 20, 2018 at 10:46 AM, Jeevan Chalke
<jeevan.chalke@enterprisedb.com> wrote:
> I have added all these three patches in the attached patch-set and rebased
> my changes over it.
>
> However, I have not yet made this patch-set dependednt on UPPERREL_TLIST
> changes you have proposed on another mail-thread and thus it has 0001 patch
> refactoring the scanjoin issue.
> 0002, 0003 and 0004 are your patches added in this patchset.
> 0005 and 0006 are further refactoring patches. 0006 adds a
> GroupPathExtraData which stores mostly child variant data and costs.
> 0007 is main partitionwise aggregation patch which is then rebased
> accordingly.
> 0008 contains testcase and 0009 contains FDW changes.

Committed my refactoring patches (your 0002-0004).

Regarding apply_scanjoin_target_to_paths in 0001 and 0007, it seems
like what happens is: we first build an Append path for the topmost
scan/join rel.  That uses paths from the individual relations that
don't necessarily produce the final scan/join target.  Then we mutate
those relations in place during partition-wise aggregate so that they
now do produce the final scan/join target and generate some more paths
using the results.  So there's an ordering dependency, and the same
pathlist represents different things at different times.  That is, I
suppose, not technically any worse than what we're doing for the
scan/join rel's pathlist in general, but here there's the additional
complexity that the paths get used both before and after being
mutated.  The UPPERREL_TLIST proposal would clean this up, although I
realize that has unresolved issues.

In create_partial_grouping_paths, the loop that does "for (i = 0; i <
2; i++)" is not exactly what I had in mind when I said that we should
use two loops.  I did not mean a loop with two iterations.  I meant
adding a loop like foreach(lc, input_rel->pathlist) in each place
where we currently have a loop like
foreach(input_rel->partial_pathlist).  See 0001, attached.

Don't write if (a) Assert(b) but rather Assert(!a || b).  See 0002, attached.

In the patch as proposed, create_partial_grouping_paths() can get
called even if GROUPING_CAN_PARTIAL_AGG is not set.  I think that's
wrong.  If can_partial_agg() isn't accurately determining whether
partial aggregation is possible, and as Ashutosh and I have been
discussing, there's room for improvement in that area, then that's a
topic for some other set of patches.  Also, the test in
create_ordinary_grouping_paths for whether or not to call
create_partial_grouping_paths() is super-complicated and uncommented.
I think a simpler approach is to allow create_partial_grouping_paths()
the option of returning NULL.  See 0003, attached.

make_grouping_rel() claims that "For now, all aggregated paths are
added to the (GROUP_AGG, NULL) upperrel", but this is false: we no
longer have only one grouped upper rel.

I'm having a heck of a time understanding what is_partial_aggregation
and perform_partial_partitionwise_aggregation are supposed to be
doing.  It seems like is_partial_aggregation means that we should ONLY
do partial aggregation, which is not exactly what the name implies.
It also seems like perform_partial_partitionwise_aggregation and
is_partial_aggregation differ only in that they apply to the current
level and to the child level respectively; can't we merge these
somehow so that we don't need both of them?

I think that if the last test in can_partitionwise_grouping were moved
before the previous test, it could be simplified to test only
(extra->flags & GROUPING_CAN_PARTIAL_AGG) == 0 and not
*perform_partial_partitionwise_aggregation.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

Re: [HACKERS] Partition-wise aggregation/grouping

From
Jeevan Chalke
Date:


On Wed, Mar 21, 2018 at 2:04 AM, Robert Haas <robertmhaas@gmail.com> wrote:
On Tue, Mar 20, 2018 at 10:46 AM, Jeevan Chalke
<jeevan.chalke@enterprisedb.com> wrote:
> I have added all these three patches in the attached patch-set and rebased
> my changes over it.
>
> However, I have not yet made this patch-set dependednt on UPPERREL_TLIST
> changes you have proposed on another mail-thread and thus it has 0001 patch
> refactoring the scanjoin issue.
> 0002, 0003 and 0004 are your patches added in this patchset.
> 0005 and 0006 are further refactoring patches. 0006 adds a
> GroupPathExtraData which stores mostly child variant data and costs.
> 0007 is main partitionwise aggregation patch which is then rebased
> accordingly.
> 0008 contains testcase and 0009 contains FDW changes.

Committed my refactoring patches (your 0002-0004).

Thanks Robert.
 

Regarding apply_scanjoin_target_to_paths in 0001 and 0007, it seems
like what happens is: we first build an Append path for the topmost
scan/join rel.  That uses paths from the individual relations that
don't necessarily produce the final scan/join target.  Then we mutate
those relations in place during partition-wise aggregate so that they
now do produce the final scan/join target and generate some more paths
using the results.  So there's an ordering dependency, and the same
pathlist represents different things at different times.  That is, I
suppose, not technically any worse than what we're doing for the
scan/join rel's pathlist in general, but here there's the additional
complexity that the paths get used both before and after being
mutated.  The UPPERREL_TLIST proposal would clean this up, although I
realize that has unresolved issues.

In create_partial_grouping_paths, the loop that does "for (i = 0; i <
2; i++)" is not exactly what I had in mind when I said that we should
use two loops.  I did not mean a loop with two iterations.  I meant
adding a loop like foreach(lc, input_rel->pathlist) in each place
where we currently have a loop like
foreach(input_rel->partial_pathlist).

The path creation logic for partial_pathlist and pathlist was identical and thus I thought of just loop over it twice switching the pathlist, so that we have minimal code to maintain. But yes I agree that it adds additional complexity.

  See 0001, attached.

Looks great. Thanks.
 

Don't write if (a) Assert(b) but rather Assert(!a || b).  See 0002, attached.

OK. Noted.


In the patch as proposed, create_partial_grouping_paths() can get
called even if GROUPING_CAN_PARTIAL_AGG is not set.  I think that's
wrong.

I don't think so. For parallel case, we do check that. And for partitionwise aggregation check, it was checked inside can_partitionwise_grouping() function and flags were set accordingly. Am I missing something?
 
  If can_partial_agg() isn't accurately determining whether
partial aggregation is possible,

I think it does accurately determine.
if (!parse->hasAggs && parse->groupClause == NIL)
is only valid for DISTINCT queries which we are anyway not handling here and for partitionwise aggregate it won't be true otherwise it will be a degenerate grouping case.
 
and as Ashutosh and I have been
discussing, there's room for improvement in that area, then that's a
topic for some other set of patches.  Also, the test in
create_ordinary_grouping_paths for whether or not to call
create_partial_grouping_paths() is super-complicated and uncommented.
I think a simpler approach is to allow create_partial_grouping_paths()
the option of returning NULL.  See 0003, attached.

Thanks for simplifying it.

However, after this simplification, we were unnecessary creating non-parallel partial aggregation paths for the root input rel when not needed.
Consider a case where we need a partial aggregation from a child, in this case, extra->is_partial_aggregation = 0 at root level entry as the parent is still doing full aggregation but perform_partial_partitionwise_aggregation is true, which tells a child to perform partial partitionwise aggregation. In this case, cheapest_total_path will be set and thus we will go ahead and create partial aggregate paths for the parent rel, which is not needed.

I have tweaked these conditions and posted in a separate patch (0006).
However, I have merged all your three patches in one (0005).
 

make_grouping_rel() claims that "For now, all aggregated paths are
added to the (GROUP_AGG, NULL) upperrel", but this is false: we no
longer have only one grouped upper rel.

Done.
 

I'm having a heck of a time understanding what is_partial_aggregation
and perform_partial_partitionwise_aggregation are supposed to be
doing.

As you said correctly, is_partial_aggregation denotes that we are doing ONLY a partial aggregation at this level of partitioning hierarchy whereas perform_partial_partitionwise_aggregation is used to instruct the child whether it should perform partial or full aggregation at its level. Since we need to create a partially_grouped_rel we evaluate all these possibilities and thus need to pass those to the child so that child will not need to compute it again.

  It seems like is_partial_aggregation means that we should ONLY
do partial aggregation, which is not exactly what the name implies.

I think it says we are doing a partial aggregation and thus implies to use partially_grouped_rel and skip finalization step.

It also seems like perform_partial_partitionwise_aggregation and
is_partial_aggregation differ only in that they apply to the current
level and to the child level respectively;

It's the other way. is_partial_aggregation is applied to the current level and perform_partial_partitionwise_aggregation applies to child level.

can't we merge these
somehow so that we don't need both of them?

I think we can't as they apply at two different levels. A scenario in which we are doing full aggregation at level 1 and need to perform partial aggregation at level 2, they are different. But yes, they both will be same if both the levels are doing same. But can't merge those.

Do you think any better names as it seems confusing?


I think that if the last test in can_partitionwise_grouping were moved
before the previous test, it could be simplified to test only
(extra->flags & GROUPING_CAN_PARTIAL_AGG) == 0 and not
*perform_partial_partitionwise_aggregation.

I think we can't do this way. If *perform_partial_partitionwise_aggregation found to be true then only we need to check whether partial aggregation itself is possible or not. If we are going to perform a full partitionwise aggregation then test for can_partial_agg is not needed. Have I misread your comments?
 

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



--
Jeevan Chalke
Technical Architect, Product Development
EnterpriseDB Corporation
The Enterprise PostgreSQL Company

Attachment

Re: [HACKERS] Partition-wise aggregation/grouping

From
Robert Haas
Date:
On Wed, Mar 21, 2018 at 8:01 AM, Jeevan Chalke
<jeevan.chalke@enterprisedb.com> wrote:
>> In the patch as proposed, create_partial_grouping_paths() can get
>> called even if GROUPING_CAN_PARTIAL_AGG is not set.  I think that's
>> wrong.
>
> I don't think so. For parallel case, we do check that. And for partitionwise
> aggregation check, it was checked inside can_partitionwise_grouping()
> function and flags were set accordingly. Am I missing something?

Well, one of us is missing something somewhere.  If
GROUPING_CAN_PARTIAL_AGG means that we're allowed to do partial
grouping, and if create_partial_grouping_paths() is where partial
grouping happens, then we should only be calling the latter if the
former is set.  I mean, how can it make sense to create
partially-grouped paths if we're not allowed to do partial grouping?

> I have tweaked these conditions and posted in a separate patch (0006).
> However, I have merged all your three patches in one (0005).

OK, thanks.  I wasn't sure I had understood what was going on, so
thanks for checking it.

Thanks also for keeping 0004-0006 separate here, but I think you can
flatten them into one patch in the next version.

>> I think that if the last test in can_partitionwise_grouping were moved
>> before the previous test, it could be simplified to test only
>> (extra->flags & GROUPING_CAN_PARTIAL_AGG) == 0 and not
>> *perform_partial_partitionwise_aggregation.
>
> I think we can't do this way. If *perform_partial_partitionwise_aggregation
> found to be true then only we need to check whether partial aggregation
> itself is possible or not. If we are going to perform a full partitionwise
> aggregation then test for can_partial_agg is not needed. Have I misread your
> comments?

It seems you're correct, because when I change it the tests fail.  I
don't yet understand why.

Basically, the main patch seems to use three Boolean signaling mechanisms:

1. GROUPING_CAN_PARTITIONWISE_AGG
2. is_partial_aggregation
3. perform_partial_partitionwise_aggregation

Stuff I don't understand:

- Why is one of them a Boolean shoved into "flags", even though it's
not static across the whole hierarchy like the other flags, and the
other two are separate Booleans?
- What do they all do, anyway?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: [HACKERS] Partition-wise aggregation/grouping

From
Jeevan Chalke
Date:


On Wed, Mar 21, 2018 at 7:46 PM, Robert Haas <robertmhaas@gmail.com> wrote:
On Wed, Mar 21, 2018 at 8:01 AM, Jeevan Chalke
<jeevan.chalke@enterprisedb.com> wrote:
>> In the patch as proposed, create_partial_grouping_paths() can get
>> called even if GROUPING_CAN_PARTIAL_AGG is not set.  I think that's
>> wrong.
>
> I don't think so. For parallel case, we do check that. And for partitionwise
> aggregation check, it was checked inside can_partitionwise_grouping()
> function and flags were set accordingly. Am I missing something?

Well, one of us is missing something somewhere.  If
GROUPING_CAN_PARTIAL_AGG means that we're allowed to do partial
grouping, and if create_partial_grouping_paths() is where partial
grouping happens, then we should only be calling the latter if the
former is set.  I mean, how can it make sense to create
partially-grouped paths if we're not allowed to do partial grouping?

Yes, that's true. If we are not allowed to do partial grouping, partially_grouped paths should not be created.

However, what I mean is that the partitionwise related checks added, if evaluates to true it implies that GROUPING_CAN_PARTIAL_AGG is also set as it was checked earlier. And thus does not need explicit check again.

Anyway, after your refactoring, it becomes more readable now.
 

> I have tweaked these conditions and posted in a separate patch (0006).
> However, I have merged all your three patches in one (0005).

OK, thanks.  I wasn't sure I had understood what was going on, so
thanks for checking it.

Thanks also for keeping 0004-0006 separate here, but I think you can
flatten them into one patch in the next version.

OK. Sure.
 

>> I think that if the last test in can_partitionwise_grouping were moved
>> before the previous test, it could be simplified to test only
>> (extra->flags & GROUPING_CAN_PARTIAL_AGG) == 0 and not
>> *perform_partial_partitionwise_aggregation.
>
> I think we can't do this way. If *perform_partial_partitionwise_aggregation
> found to be true then only we need to check whether partial aggregation
> itself is possible or not. If we are going to perform a full partitionwise
> aggregation then test for can_partial_agg is not needed. Have I misread your
> comments?

It seems you're correct, because when I change it the tests fail.  I
don't yet understand why.

Basically, the main patch seems to use three Boolean signaling mechanisms:

1. GROUPING_CAN_PARTITIONWISE_AGG
2. is_partial_aggregation
3. perform_partial_partitionwise_aggregation

Stuff I don't understand:

- Why is one of them a Boolean shoved into "flags", even though it's
not static across the whole hierarchy like the other flags, and the
other two are separate Booleans?
- What do they all do, anyway?

Let me try to explain this:

1. GROUPING_CAN_PARTITIONWISE_AGG
Tells us whether or not partitionwise grouping and/or aggregation is ever possible. If it is FALSE, other two have no meaning and they will be useless. However, if it is TRUE, then only we attempt to create paths partitionwise.
I have kept it in "flags" as it looks similar in behavior with other flag members like can_sort, can_hash etc. And, for given grouped relation whether parent or child, they all work similarly. But yes, for child relation, we inherit can_sort/can_hash from the parent as they won't change. But need to evaluate this for every child.
If required, I can move that to a GroupPathExtraData struct.

2. extra->is_partial_aggregation
This boolean var is used to identify at any given time whether we are computing a full aggregation or a partial aggregation. This boolean is necessary when doing partial aggregation to skip finalization. And also tells us to use partially_grouped_rel when true.

3. extra->perform_partial_partitionwise_aggregation
This boolean var is used to instruct child that it has to create a partially aggregated paths when TRUE. And then it transferred to child_extra->is_partial_aggregation in create_partitionwise_grouping_paths().

Basically (3) is required as we wanted to create a partially_grouped_rel upfront. So that if the child is going to create a partially aggregated paths, they can append those into the parent's partially grouped rel and thus we need to create that before even we enter into the child paths creation.
Since (3) is only valid if (1) is true, we need to compute (1) upfront too.
 

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



--
Jeevan Chalke
Technical Architect, Product Development
EnterpriseDB Corporation
The Enterprise PostgreSQL Company

Re: [HACKERS] Partition-wise aggregation/grouping

From
Robert Haas
Date:
On Wed, Mar 21, 2018 at 11:33 AM, Jeevan Chalke
<jeevan.chalke@enterprisedb.com> wrote:
> Let me try to explain this:
> 1. GROUPING_CAN_PARTITIONWISE_AGG
> 2. extra->is_partial_aggregation
> 3. extra->perform_partial_partitionwise_aggregation

Please find attached an incremental patch that attempts to refactor
this logic into a simpler form.  What I've done is merged all three of
the above Booleans into a single state variable called 'patype', which
can be one of PARTITIONWISE_AGGREGATE_NONE,
PARTITIONWISE_AGGREGATE_FULL, and PARTITIONWISE_AGGREGATE_PARTIAL.
When create_ordinary_grouping_paths() is called, extra.patype is the
value for the parent relation; that function computes a new value and
passes it down to create_partitionwise_grouping_paths(), which inserts
into the new 'extra' structure for the child.

Essentially, in your system, extra->is_partial_aggregation and
extra->perform_partial_partitionwise_aggregation both corresponded to
whether or not patype == PARTITIONWISE_AGGREGATE_PARTIAL, but the
former indicated whether the *parent* was doing partition-wise
aggregation (and thus we needed to generate only partial paths) while
the latter indicated whether the *current* relation was doing
partition-wise aggregation (and thus we needed to force creation of
partially_grouped_rel).  This took me a long time to understand
because of the way the fields were named; they didn't indicate that
one was for the parent and one for the current relation.  Meanwhile,
GROUPING_CAN_PARTITIONWISE_AGG indicated whether partition-wise
aggregate should be tried at all for the current relation; there was
no analogous indicator for the parent relation because we can't be
processing a child at all if the parent didn't decide to do
partition-wise aggregation.  So to know what was happening for the
current relation you had to look at GROUPING_CAN_PARTITIONWISE_AGG +
extra->perform_partial_partitionwise_aggregation, and to know what was
happening for the parent relation you just looked at
extra->is_partial_aggregation.  With this proposed refactoring patch,
there's just one patype value at each level, which at least to me
seems simpler.  I tried to improve the comments somewhat, too.

You have some long lines that it would be good to break, like this:

         child_extra.targetList = (List *) adjust_appendrel_attrs(root,

(Node *) extra->targetList,
                                                                  nappinfos,
                                                                  appinfos);

If you put a newline after (List *), the formatting will come out
nicer -- it will fit within 80 columns.  Please go through the patches
and make these kinds of changes for lines over 80 columns where
possible.

I guess we'd better move the GROUPING_CAN_* constants to a header
file, if they're going to be exposed through GroupPathExtraData.  That
can go in some refactoring patch.

Is there a good reason not to use input_rel->relids as the input to
fetch_upper_rel() in all cases, rather than just at subordinate
levels?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

Re: [HACKERS] Partition-wise aggregation/grouping

From
Ashutosh Bapat
Date:
On Thu, Mar 22, 2018 at 3:26 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>
> Is there a good reason not to use input_rel->relids as the input to
> fetch_upper_rel() in all cases, rather than just at subordinate
> levels?
>

That would simplify some code in these patches. We have set
upper_rel->relids to NULL for non-other upper relation since Tom
expected to use relids to mean something other than scan/join relids.
With these patch-sets for grouping rels we are using upper_rel->relids
to the relids of underlying scan/join relation. So it does make sense
to set relids to input_rel->relids for all the grouping rels whether
"other" or non-"other" grouping rels.

But with this change, we have to change all the existing code to pass
input_rel->relids to fetch_upper_rel(). If we don't do that or in
future somebody calls that function with relids = NULL we will produce
two relations which are supposed to do the same thing but have
different relids set. That's because fetch_upper_rel() creates a
relation if one does not exist whether or not the caller intends to
create one. We should probably create two functions 1. to build an
upper relation and 2. to search for it similar to what we have done
for join relations and base relation. The other possibility is to pass
a flag to fetch_upper_rel() indicating whether a caller intends to
create a new relation when one doesn't exist. With this design a
caller can be sure that an upper relation will not be created when it
wants to just fetch an existing relation (and error out/assert if it
doesn't find one.).

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company


Re: [HACKERS] Partition-wise aggregation/grouping

From
Jeevan Chalke
Date:


On Thu, Mar 22, 2018 at 3:26 AM, Robert Haas <robertmhaas@gmail.com> wrote:
On Wed, Mar 21, 2018 at 11:33 AM, Jeevan Chalke
<jeevan.chalke@enterprisedb.com> wrote:
> Let me try to explain this:
> 1. GROUPING_CAN_PARTITIONWISE_AGG
> 2. extra->is_partial_aggregation
> 3. extra->perform_partial_partitionwise_aggregation

Please find attached an incremental patch that attempts to refactor
this logic into a simpler form.  What I've done is merged all three of
the above Booleans into a single state variable called 'patype', which
can be one of PARTITIONWISE_AGGREGATE_NONE,
PARTITIONWISE_AGGREGATE_FULL, and PARTITIONWISE_AGGREGATE_PARTIAL.
When create_ordinary_grouping_paths() is called, extra.patype is the
value for the parent relation; that function computes a new value and
passes it down to create_partitionwise_grouping_paths(), which inserts
into the new 'extra' structure for the child.

Essentially, in your system, extra->is_partial_aggregation and
extra->perform_partial_partitionwise_aggregation both corresponded to
whether or not patype == PARTITIONWISE_AGGREGATE_PARTIAL, but the
former indicated whether the *parent* was doing partition-wise
aggregation (and thus we needed to generate only partial paths) while
the latter indicated whether the *current* relation was doing
partition-wise aggregation (and thus we needed to force creation of
partially_grouped_rel).  This took me a long time to understand
because of the way the fields were named; they didn't indicate that
one was for the parent and one for the current relation.  Meanwhile,
GROUPING_CAN_PARTITIONWISE_AGG indicated whether partition-wise
aggregate should be tried at all for the current relation; there was
no analogous indicator for the parent relation because we can't be
processing a child at all if the parent didn't decide to do
partition-wise aggregation.  So to know what was happening for the
current relation you had to look at GROUPING_CAN_PARTITIONWISE_AGG +
extra->perform_partial_partitionwise_aggregation, and to know what was
happening for the parent relation you just looked at
extra->is_partial_aggregation.  With this proposed refactoring patch,
there's just one patype value at each level, which at least to me
seems simpler.  I tried to improve the comments somewhat, too.

Leeks cleaner now. Thanks for refactoring it.

I have merged these changes and flatten all previuos changes into the main patch.
 

You have some long lines that it would be good to break, like this:

         child_extra.targetList = (List *) adjust_appendrel_attrs(root,

(Node *) extra->targetList,
                                                                  nappinfos,
                                                                  appinfos);

If you put a newline after (List *), the formatting will come out
nicer -- it will fit within 80 columns.  Please go through the patches
and make these kinds of changes for lines over 80 columns where
possible.

OK. Done.
I was under impression that it is not good to split the first line. What if in split version, while applying any new patch or in Merge process something gets inserted between them? That will result into something unexpected. But I am not sure if that's ever a possibility.


I guess we'd better move the GROUPING_CAN_* constants to a header
file, if they're going to be exposed through GroupPathExtraData.  That
can go in some refactoring patch.

Yep. Moved that into 0003 where we create GroupPathExtraData.


Is there a good reason not to use input_rel->relids as the input to
fetch_upper_rel() in all cases, rather than just at subordinate
levels?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



--
Jeevan Chalke
Technical Architect, Product Development
EnterpriseDB Corporation
The Enterprise PostgreSQL Company

Attachment

Re: [HACKERS] Partition-wise aggregation/grouping

From
Jeevan Chalke
Date:


On Thu, Mar 22, 2018 at 10:06 AM, Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote:
On Thu, Mar 22, 2018 at 3:26 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>
> Is there a good reason not to use input_rel->relids as the input to
> fetch_upper_rel() in all cases, rather than just at subordinate
> levels?
>

That would simplify some code in these patches. We have set
upper_rel->relids to NULL for non-other upper relation since Tom
expected to use relids to mean something other than scan/join relids.
With these patch-sets for grouping rels we are using upper_rel->relids
to the relids of underlying scan/join relation. So it does make sense
to set relids to input_rel->relids for all the grouping rels whether
"other" or non-"other" grouping rels.

But with this change, we have to change all the existing code to pass
input_rel->relids to fetch_upper_rel(). If we don't do that or in
future somebody calls that function with relids = NULL we will produce
two relations which are supposed to do the same thing but have
different relids set. That's because fetch_upper_rel() creates a
relation if one does not exist whether or not the caller intends to
create one. We should probably create two functions 1. to build an
upper relation and 2. to search for it similar to what we have done
for join relations and base relation. The other possibility is to pass
a flag to fetch_upper_rel() indicating whether a caller intends to
create a new relation when one doesn't exist. With this design a
caller can be sure that an upper relation will not be created when it
wants to just fetch an existing relation (and error out/assert if it
doesn't find one.).

Like Ashutosh said, splitting fetch_upper_rel() in two functions, build_upper_rel() and find_upper_rel() looks better.

However, I am not sure whether setting relids in a top-most grouped rel is a good idea or not. I remember we need this while working on Aggregate PushDown, and in [1] Tom Lane opposed the idea of setting the relids in grouped_rel.

If we want to go with this, then I think it should be done as a separate stand-alone patch.

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company



--
Jeevan Chalke
Technical Architect, Product Development
EnterpriseDB Corporation
The Enterprise PostgreSQL Company

Re: [HACKERS] Partition-wise aggregation/grouping

From
Robert Haas
Date:
On Thu, Mar 22, 2018 at 6:15 AM, Jeevan Chalke
<jeevan.chalke@enterprisedb.com> wrote:
> Leeks cleaner now. Thanks for refactoring it.
>
> I have merged these changes and flatten all previuos changes into the main
> patch.

Committed 0001-0005.  I made a few further modifications.  These were
mostly cosmetic, but with two exceptions:

1. I moved one set_cheapest() call to avoid doing that twice for the
top-level grouped_rel.

2. I removed the logic to set partition properties for grouped_rels.
As far as I can see, there's nothing that needs this.  It would be
important if we wanted subsequent planning stages to be able to do
partition-wise stuff, e.g. when doing window functions or setops, or
at higher query levels.  Maybe we'll have that someday; until then, I
think this is just a waste of cycles.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: [HACKERS] Partition-wise aggregation/grouping

From
Jeevan Chalke
Date:


On Thu, Mar 22, 2018 at 10:28 PM, Robert Haas <robertmhaas@gmail.com> wrote:
On Thu, Mar 22, 2018 at 6:15 AM, Jeevan Chalke
<jeevan.chalke@enterprisedb.com> wrote:
> Leeks cleaner now. Thanks for refactoring it.
>
> I have merged these changes and flatten all previuos changes into the main
> patch.

Committed 0001-0005.

Thanks Robert.
 
  I made a few further modifications.  These were
mostly cosmetic, but with two exceptions:

1. I moved one set_cheapest() call to avoid doing that twice for the
top-level grouped_rel.

2. I removed the logic to set partition properties for grouped_rels.
As far as I can see, there's nothing that needs this.  It would be
important if we wanted subsequent planning stages to be able to do
partition-wise stuff, e.g. when doing window functions or setops, or
at higher query levels.  Maybe we'll have that someday; until then, I
think this is just a waste of cycles.

OK.

Changes related to postgres_fdw which allows pushing aggregate on the foreign server is not yet committed. Due to this, we will end up getting an error when we have foreign partitions + aggregation.

Attached 0001 patch here (0006 from my earlier patch-set) which adds support for this and thus will not have any error.

Thanks


--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



--
Jeevan Chalke
Technical Architect, Product Development
EnterpriseDB Corporation
The Enterprise PostgreSQL Company

Attachment

Re: [HACKERS] Partition-wise aggregation/grouping

From
Jeevan Chalke
Date:
Hi Robert,

On pgsql-committers Andres reported one concern about test case failure with installcheck with local settings.
(Sorry, I have not subscribed to that mailing list and thus not able to reply there).

Attached patch which fixes that.

However, I am not sure whether it is expected to have stable regression run with installcheck having local settings.
For example, If I have enabale_hashagg = false locally; I will definitely see failures.

ISTM, that I am missing Andres point here.

Thanks

--
Jeevan Chalke
Technical Architect, Product Development
EnterpriseDB Corporation
The Enterprise PostgreSQL Company

Phone: +91 20 66449694

Website: www.enterprisedb.com
EnterpriseDB Blog: http://blogs.enterprisedb.com/
Follow us on Twitter: http://www.twitter.com/enterprisedb

This e-mail message (and any attachment) is intended for the use of the individual or entity to whom it is addressed. This message contains information from EnterpriseDB Corporation that may be privileged, confidential, or exempt from disclosure under applicable law. If you are not the intended recipient or authorized to receive this for the intended recipient, any use, dissemination, distribution, retention, archiving, or copying of this communication is strictly prohibited. If you have received this e-mail in error, please notify the sender immediately by reply e-mail and delete this message.
Attachment

Re: [HACKERS] Partition-wise aggregation/grouping

From
Ashutosh Bapat
Date:
On Fri, Mar 23, 2018 at 4:35 PM, Jeevan Chalke
<jeevan.chalke@enterprisedb.com> wrote:
>
> Changes related to postgres_fdw which allows pushing aggregate on the
> foreign server is not yet committed. Due to this, we will end up getting an
> error when we have foreign partitions + aggregation.
>
> Attached 0001 patch here (0006 from my earlier patch-set) which adds support
> for this and thus will not have any error.


         else if (IS_UPPER_REL(foreignrel))
         {
             PgFdwRelationInfo *ofpinfo;
-            PathTarget *ptarget = root->upper_targets[UPPERREL_GROUP_AGG];
+            PathTarget *ptarget = fpinfo->grouped_target;

I think we need an assert there to make sure that the upper relation is a
grouping relation. That way any future push down will notice it.

-                get_agg_clause_costs(root, (Node *) root->parse->havingQual,
+                get_agg_clause_costs(root, fpinfo->havingQual,
                                      AGGSPLIT_SIMPLE, &aggcosts);
             }
Should we pass agg costs as well through GroupPathExtraData to avoid
calculating it again in this function?

 /*
+    /*
+     * Store passed-in target and havingQual in fpinfo. If its a foreign
+     * partition, then path target and HAVING quals fetched from the root are
+     * not correct as Vars within it won't match with this child relation.
+     * However, server passed them through extra and thus fetch from it.
+     */
+    if (extra)
+    {
+        /* Partial aggregates are not supported. */
+        Assert(extra->patype != PARTITIONWISE_AGGREGATE_PARTIAL);
+
+        fpinfo->grouped_target = extra->target;
+        fpinfo->havingQual = extra->havingQual;
+    }
+    else
+    {
+        fpinfo->grouped_target = root->upper_targets[UPPERREL_GROUP_AGG];
+        fpinfo->havingQual = parse->havingQual;
+    }
I think both these cases, extra should always be present whether a child
relation or a parent relation. Just pick from extra always.

     /* Grouping information */
     List       *grouped_tlist;
+    PathTarget *grouped_target;

We should use the target stored in the grouped rel directly.

+    Node       *havingQual;
I am wondering whether we could use remote_conds member for storing this.

     /*
      * Likewise, copy the relids that are represented by this foreign scan. An
-     * upper rel doesn't have relids set, but it covers all the base relations
-     * participating in the underlying scan, so use root's all_baserels.
+     * upper rel (but not the other upper rel) doesn't have relids set, but it
+     * covers all the base relations participating in the underlying scan, so
+     * use root's all_baserels.
      */

This is correct only for "other" grouping relations. We are yet to
decide what to do
for the other upper relations.

-    if (IS_UPPER_REL(rel))
+    if (IS_UPPER_REL(rel) && !IS_OTHER_REL(rel))
I guess, this condition boils down to rel->reloptkind == RELOPT_UPPER_REL. Use
it that way?

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company


Re: [HACKERS] Partition-wise aggregation/grouping

From
Andres Freund
Date:
Hi,

On 2018-03-23 17:01:54 +0530, Jeevan Chalke wrote:
> Attached patch which fixes that.

Thanks, will push. For the future, I'd be more likely to notice if you
CC me ;)


> However, I am not sure whether it is expected to have stable regression run
> with installcheck having local settings.
> For example, If I have enabale_hashagg = false locally; I will definitely
> see failures.
> 
> ISTM, that I am missing Andres point here.

I don't think there's a hard and fast rule here. I personally often
during development disable parallelism because it makes some things
harder (you can't easily debug crashes with gdb, benchmarks show larger
variance, ...).  There doesn't seem to be an equivalent benefit to
support running e.g. with enabale_hashagg = false.

- Andres


Re: [HACKERS] Partition-wise aggregation/grouping

From
Jeevan Chalke
Date:


On Tue, Mar 27, 2018 at 3:33 AM, Andres Freund <andres@anarazel.de> wrote:
Hi,

On 2018-03-23 17:01:54 +0530, Jeevan Chalke wrote:
> Attached patch which fixes that.

Thanks, will push. For the future, I'd be more likely to notice if you
CC me ;)

Sure. Thanks.
 


> However, I am not sure whether it is expected to have stable regression run
> with installcheck having local settings.
> For example, If I have enabale_hashagg = false locally; I will definitely
> see failures.
>
> ISTM, that I am missing Andres point here.

I don't think there's a hard and fast rule here. I personally often
during development disable parallelism because it makes some things
harder (you can't easily debug crashes with gdb, benchmarks show larger
variance, ...).

Yep.
 
  There doesn't seem to be an equivalent benefit to
support running e.g. with enabale_hashagg = false.

OK.
Noted.

Thanks for the explanation.
 

- Andres



--
Jeevan Chalke
Technical Architect, Product Development
EnterpriseDB Corporation
The Enterprise PostgreSQL Company

Re: [HACKERS] Partition-wise aggregation/grouping

From
Jeevan Chalke
Date:


On Mon, Mar 26, 2018 at 5:24 PM, Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote:
On Fri, Mar 23, 2018 at 4:35 PM, Jeevan Chalke
<jeevan.chalke@enterprisedb.com> wrote:
>
> Changes related to postgres_fdw which allows pushing aggregate on the
> foreign server is not yet committed. Due to this, we will end up getting an
> error when we have foreign partitions + aggregation.
>
> Attached 0001 patch here (0006 from my earlier patch-set) which adds support
> for this and thus will not have any error.

I have observed that, target member in GroupPathExtraData is not needed as we store the target in grouped_rel itself.
Attached 0001 patch to remove that.
 


         else if (IS_UPPER_REL(foreignrel))
         {
             PgFdwRelationInfo *ofpinfo;
-            PathTarget *ptarget = root->upper_targets[UPPERREL_GROUP_AGG];
+            PathTarget *ptarget = fpinfo->grouped_target;

I think we need an assert there to make sure that the upper relation is a
grouping relation. That way any future push down will notice it.

I am not sure on what we should Assetrt here. Note that we end-up here only when doing grouping, and thus I don't think we need any Assert here.
Let me know if I missed anything.
 

-                get_agg_clause_costs(root, (Node *) root->parse->havingQual,
+                get_agg_clause_costs(root, fpinfo->havingQual,
                                      AGGSPLIT_SIMPLE, &aggcosts);
             }
Should we pass agg costs as well through GroupPathExtraData to avoid
calculating it again in this function?

Adding an extra member in GroupPathExtraData just for FDW does not look good to me.
But yes, if we do that, then we can save this calculation.
Let me know if its OK to have an extra member for just FDW use, will prepare a separate patch for that.
 

 /*
+    /*
+     * Store passed-in target and havingQual in fpinfo. If its a foreign
+     * partition, then path target and HAVING quals fetched from the root are
+     * not correct as Vars within it won't match with this child relation.
+     * However, server passed them through extra and thus fetch from it.
+     */
+    if (extra)
+    {
+        /* Partial aggregates are not supported. */
+        Assert(extra->patype != PARTITIONWISE_AGGREGATE_PARTIAL);
+
+        fpinfo->grouped_target = extra->target;
+        fpinfo->havingQual = extra->havingQual;
+    }
+    else
+    {
+        fpinfo->grouped_target = root->upper_targets[UPPERREL_GROUP_AGG];
+        fpinfo->havingQual = parse->havingQual;
+    }
I think both these cases, extra should always be present whether a child
relation or a parent relation. Just pick from extra always.

Yes.
Done.
 

     /* Grouping information */
     List       *grouped_tlist;
+    PathTarget *grouped_target;

We should use the target stored in the grouped rel directly.

Yep.
 

+    Node       *havingQual;
I am wondering whether we could use remote_conds member for storing this.

This havingQual is later checked for shippability and classified into pushable and non-pushable quals and stored in remote_conds and local_conds respectively.
Storing it directly in remote_conds and then splitting it does not look good to me.
Also, remote_conds is list of RestrictInfo nodes whereas havingQual is not. So using that for storing havingQual does not make sense. So better to have a separate member in PgFdwRelationInfo.


     /*
      * Likewise, copy the relids that are represented by this foreign scan. An
-     * upper rel doesn't have relids set, but it covers all the base relations
-     * participating in the underlying scan, so use root's all_baserels.
+     * upper rel (but not the other upper rel) doesn't have relids set, but it
+     * covers all the base relations participating in the underlying scan, so
+     * use root's all_baserels.
      */

This is correct only for "other" grouping relations. We are yet to
decide what to do
for the other upper relations.

I have removed this comment change as existing comments look good after doing following changes:
 

-    if (IS_UPPER_REL(rel))
+    if (IS_UPPER_REL(rel) && !IS_OTHER_REL(rel))
I guess, this condition boils down to rel->reloptkind == RELOPT_UPPER_REL. Use
it that way?

Done.

Attached 0002 for this.

Thanks
 

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company



--
Jeevan Chalke
Technical Architect, Product Development
EnterpriseDB Corporation
The Enterprise PostgreSQL Company

Attachment

Re: [HACKERS] Partition-wise aggregation/grouping

From
Ashutosh Bapat
Date:
On Tue, Mar 27, 2018 at 2:43 PM, Jeevan Chalke
<jeevan.chalke@enterprisedb.com> wrote:

>>          else if (IS_UPPER_REL(foreignrel))
>>          {
>>              PgFdwRelationInfo *ofpinfo;
>> -            PathTarget *ptarget =
>> root->upper_targets[UPPERREL_GROUP_AGG];
>> +            PathTarget *ptarget = fpinfo->grouped_target;
>>
>> I think we need an assert there to make sure that the upper relation is a
>> grouping relation. That way any future push down will notice it.
>
>
> I am not sure on what we should Assetrt here. Note that we end-up here only
> when doing grouping, and thus I don't think we need any Assert here.
> Let me know if I missed anything.

Since we are just checking whether it's an upper relation and directly
using root->upper_targets[UPPERREL_GROUP_AGG], I thought we could add
an assert to verify that it's really the grouping rel we are dealing
with. But I guess, we can't really check that from given relation. But
then for a grouped rel we can get its target from RelOptInfo. So, we
shouldn't need to root->upper_targets[UPPERREL_GROUP_AGG]. Am I
missing something? For other upper relations we do not set the target
yet but then we could assert that there exists one in the grouped
relation.

>
>>
>>
>> -                get_agg_clause_costs(root, (Node *)
>> root->parse->havingQual,
>> +                get_agg_clause_costs(root, fpinfo->havingQual,
>>                                       AGGSPLIT_SIMPLE, &aggcosts);
>>              }
>> Should we pass agg costs as well through GroupPathExtraData to avoid
>> calculating it again in this function?
>
>
> Adding an extra member in GroupPathExtraData just for FDW does not look good
> to me.
> But yes, if we do that, then we can save this calculation.
> Let me know if its OK to have an extra member for just FDW use, will prepare
> a separate patch for that.

I think that should be fine. A separate patch would be good, so that a
committer can decide whether or not to include it.



>>
>> +    Node       *havingQual;
>> I am wondering whether we could use remote_conds member for storing this.
>
>
> This havingQual is later checked for shippability and classified into
> pushable and non-pushable quals and stored in remote_conds and local_conds
> respectively.
> Storing it directly in remote_conds and then splitting it does not look good
> to me.
> Also, remote_conds is list of RestrictInfo nodes whereas havingQual is not.
> So using that for storing havingQual does not make sense. So better to have
> a separate member in PgFdwRelationInfo.

Ah sorry, I was wrong about remote_conds. remote_conds and local_conds
are basically the conditions on the relation being pushed down.
havingQuals are conditions on a grouped relation so treating them like
baserestrictinfo or join conditions looks more straight forward,
rather than having a separate member in PgFdwRelationInfo. So, remote
havingQuals go into remote_conds and local havingQuals go to
local_conds.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company


Re: [HACKERS] Partition-wise aggregation/grouping

From
Ashutosh Bapat
Date:
On Wed, Mar 28, 2018 at 7:21 PM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:
>
> Ah sorry, I was wrong about remote_conds. remote_conds and local_conds
> are basically the conditions on the relation being pushed down.
> havingQuals are conditions on a grouped relation so treating them like
> baserestrictinfo or join conditions looks more straight forward,
> rather than having a separate member in PgFdwRelationInfo. So, remote
> havingQuals go into remote_conds and local havingQuals go to
> local_conds.

Looks like we already do that. Then we have remote_conds, local_conds
which together should be equivalent to havingQual. Storing all those
three doesn't make sense. In future someone may use havingQual instead
of remote_conds/local_conds just because its available and then there
is risk of these three lists going out of sync.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company


Re: [HACKERS] Partition-wise aggregation/grouping

From
Jeevan Chalke
Date:


On Wed, Mar 28, 2018 at 7:21 PM, Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote:
On Tue, Mar 27, 2018 at 2:43 PM, Jeevan Chalke
<jeevan.chalke@enterprisedb.com> wrote:

> I am not sure on what we should Assetrt here. Note that we end-up here only
> when doing grouping, and thus I don't think we need any Assert here.
> Let me know if I missed anything.

Since we are just checking whether it's an upper relation and directly
using root->upper_targets[UPPERREL_GROUP_AGG], I thought we could add
an assert to verify that it's really the grouping rel we are dealing
with. But I guess, we can't really check that from given relation. But
then for a grouped rel we can get its target from RelOptInfo. So, we
shouldn't need to root->upper_targets[UPPERREL_GROUP_AGG]. Am I
missing something? For other upper relations we do not set the target
yet but then we could assert that there exists one in the grouped
relation.

Yes. We fetch target from the grouped_rel itself.
Added Assert() per out off-list discussion.
 

>
>>
>>
>> -                get_agg_clause_costs(root, (Node *)
>> root->parse->havingQual,
>> +                get_agg_clause_costs(root, fpinfo->havingQual,
>>                                       AGGSPLIT_SIMPLE, &aggcosts);
>>              }
>> Should we pass agg costs as well through GroupPathExtraData to avoid
>> calculating it again in this function?
>
>
> Adding an extra member in GroupPathExtraData just for FDW does not look good
> to me.
> But yes, if we do that, then we can save this calculation.
> Let me know if its OK to have an extra member for just FDW use, will prepare
> a separate patch for that.

I think that should be fine. A separate patch would be good, so that a
committer can decide whether or not to include it.

Attached patch 0003 for this.




>>
>> +    Node       *havingQual;
>> I am wondering whether we could use remote_conds member for storing this.
>
>
> This havingQual is later checked for shippability and classified into
> pushable and non-pushable quals and stored in remote_conds and local_conds
> respectively.
> Storing it directly in remote_conds and then splitting it does not look good
> to me.
> Also, remote_conds is list of RestrictInfo nodes whereas havingQual is not.
> So using that for storing havingQual does not make sense. So better to have
> a separate member in PgFdwRelationInfo.

Ah sorry, I was wrong about remote_conds. remote_conds and local_conds
are basically the conditions on the relation being pushed down.
havingQuals are conditions on a grouped relation so treating them like
baserestrictinfo or join conditions looks more straight forward,
rather than having a separate member in PgFdwRelationInfo. So, remote
havingQuals go into remote_conds and local havingQuals go to
local_conds.

OK. Agree.
In this version, I have not added anything in PgFdwRelationInfo.
Having qual is needed at two places; (1) in foreign_grouping_ok() to check shippability, so passed this translated HAVING qual as a parameter to it, and (2) estimating aggregates costs in estimate_path_cost_size(); there we can use havingQual from root itself as costs won't change for parent and child.
Thus no need of storing a havingQual in PgFdwRelationInfo.

Thanks


--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company



--
Jeevan Chalke
Technical Architect, Product Development
EnterpriseDB Corporation
The Enterprise PostgreSQL Company

Attachment

Re: [HACKERS] Partition-wise aggregation/grouping

From
Jeevan Chalke
Date:


On Thu, Mar 29, 2018 at 4:13 PM, Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote:
On Wed, Mar 28, 2018 at 7:21 PM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:
>
> Ah sorry, I was wrong about remote_conds. remote_conds and local_conds
> are basically the conditions on the relation being pushed down.
> havingQuals are conditions on a grouped relation so treating them like
> baserestrictinfo or join conditions looks more straight forward,
> rather than having a separate member in PgFdwRelationInfo. So, remote
> havingQuals go into remote_conds and local havingQuals go to
> local_conds.

Looks like we already do that. Then we have remote_conds, local_conds
which together should be equivalent to havingQual. Storing all those
three doesn't make sense. In future someone may use havingQual instead
of remote_conds/local_conds just because its available and then there
is risk of these three lists going out of sync.

Yep, I see the risk.
 

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company



--
Jeevan Chalke
Technical Architect, Product Development
EnterpriseDB Corporation
The Enterprise PostgreSQL Company

Re: [HACKERS] Partition-wise aggregation/grouping

From
Robert Haas
Date:
On Thu, Mar 29, 2018 at 9:02 AM, Jeevan Chalke
<jeevan.chalke@enterprisedb.com> wrote:
> Yep, I see the risk.

Committed 0001 last week and 0002 just now.  I don't really see 0003 a
a critical need.  If somebody demonstrates that this saves a
meaningful amount of planning time, we can consider that part for a
future release.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: [HACKERS] Partition-wise aggregation/grouping

From
David Steele
Date:
Hi Jeevan,

On 4/2/18 10:57 AM, Robert Haas wrote:
> On Thu, Mar 29, 2018 at 9:02 AM, Jeevan Chalke
> <jeevan.chalke@enterprisedb.com> wrote:
>> Yep, I see the risk.
> 
> Committed 0001 last week and 0002 just now.  I don't really see 0003 a
> a critical need.  If somebody demonstrates that this saves a
> meaningful amount of planning time, we can consider that part for a
> future release.

The bulk of this patch was committed so I have marked it that way.

If you would like to pursue patch 03 I think it would be best to start a
new thread and demonstrate how the patch will improve performance.

Regards,
-- 
-David
david@pgmasters.net


Re: [HACKERS] Partition-wise aggregation/grouping

From
Jeevan Chalke
Date:


On Tue, Apr 10, 2018 at 7:30 PM, David Steele <david@pgmasters.net> wrote:
Hi Jeevan,

On 4/2/18 10:57 AM, Robert Haas wrote:
> On Thu, Mar 29, 2018 at 9:02 AM, Jeevan Chalke
> <jeevan.chalke@enterprisedb.com> wrote:
>> Yep, I see the risk.
>
> Committed 0001 last week and 0002 just now.  I don't really see 0003 a
> a critical need.  If somebody demonstrates that this saves a
> meaningful amount of planning time, we can consider that part for a
> future release.

The bulk of this patch was committed so I have marked it that way.

Thanks, David.
 

If you would like to pursue patch 03 I think it would be best to start a
new thread and demonstrate how the patch will improve performance.

Sure.
 

Regards,
--
-David
david@pgmasters.net



--
Jeevan Chalke
Technical Architect, Product Development
EnterpriseDB Corporation
The Enterprise PostgreSQL Company

Re: [HACKERS] Partition-wise aggregation/grouping

From
Andrey Lepikhov
Date:
 > Regarding apply_scanjoin_target_to_paths in 0001 and 0007, it seems
 > like what happens is: we first build an Append path for the topmost
 > scan/join rel.  That uses paths from the individual relations that
 > don't necessarily produce the final scan/join target.  Then we mutate
 > those relations in place during partition-wise aggregate so that they
 > now do produce the final scan/join target and generate some more paths
 > using the results.  So there's an ordering dependency, and the same
 > pathlist represents different things at different times.  That is, I
 > suppose, not technically any worse than what we're doing for the
 > scan/join rel's pathlist in general, but here there's the additional
 > complexity that the paths get used both before and after being
 > mutated.  The UPPERREL_TLIST proposal would clean this up, although I
 > realize that has unresolved issues.

I discouraged by this logic.
Now I use set_rel_pathlist_hook and make some optimizations at partition 
scanning paths. But apply_scanjoin_target_to_paths() deletes pathlist 
and violates all optimizations.
May be it is possible to introduce some flag, that hook can set to 
prevent pathlist cleaning?

-- 
Andrey Lepikhov
Postgres Professional
https://postgrespro.com
The Russian Postgres Company