Thread: Re: [HACKERS] WIP: Aggregation push-down

Re: [HACKERS] WIP: Aggregation push-down

From
Antonin Houska
Date:
Antonin Houska <ah@cybertec.at> wrote:

> This is a new version of the patch I presented in [1].

Rebased.

cat .git/refs/heads/master
b9a3ef55b253d885081c2d0e9dc45802cab71c7b

--
Antonin Houska
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de, http://www.cybertec.at


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] WIP: Aggregation push-down

From
Antonin Houska
Date:
Antonin Houska <ah@cybertec.at> wrote:

> Antonin Houska <ah@cybertec.at> wrote:
>
> > This is a new version of the patch I presented in [1].
>
> Rebased.
>
> cat .git/refs/heads/master
> b9a3ef55b253d885081c2d0e9dc45802cab71c7b

This is another version of the patch.

Besides other changes, it enables the aggregation push-down for postgres_fdw,
although only for aggregates whose transient aggregation state is equal to the
output type. For other aggregates (like avg()) the remote nodes will have to
return the transient state value in an appropriate form (maybe bytea type),
which does not depend on PG version.

shard.tgz demonstrates the typical postgres_fdw use case. One query shows base
scans of base relation's partitions being pushed to shard nodes, the other
pushes down a join and performs aggregation of the join result on the remote
node. Of course, the query can only references one particular partition, until
the "partition-wise join" [1] patch gets committed and merged with this my
patch.

One thing I'm not sure about is whether the patch should remove
GetForeignUpperPaths function from FdwRoutine, which it essentially makes
obsolete. Or should it only be deprecated so far? I understand that
deprecation is important for "ordinary SQL users", but FdwRoutine is an
interface for extension developers, so the policy might be different.

[1] https://commitfest.postgresql.org/14/994/

Any feedback is appreciated.

--
Antonin Houska
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de, http://www.cybertec.at


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] WIP: Aggregation push-down

From
Jeevan Chalke
Date:
Hi Antonin,

To understand the feature you have proposed, I have tried understanding
your patch. Here are few comments so far on it:

1.
+         if (aggref->aggvariadic ||
+             aggref->aggdirectargs || aggref->aggorder ||
+             aggref->aggdistinct || aggref->aggfilter)

I did not understand, why you are not pushing aggregate in above cases?
Can you please explain?

2. "make check" in contrib/postgres_fdw crashes.

  SELECT COUNT(*) FROM ft1 t1;
! server closed the connection unexpectedly
!   This probably means the server terminated abnormally
!   before or while processing the request.
! connection to server was lost

From your given setup, if I wrote a query like:
EXPLAIN VERBOSE SELECT count(*) FROM orders_0;
it crashes.

Seems like missing few checks.

3. In case of pushing partial aggregation on the remote side, you use schema
named "partial", I did not get that change. If I have AVG() aggregate,
then I end up getting an error saying
"ERROR:  schema "partial" does not exist".

4. I am not sure about the code changes related to pushing partial
aggregate on the remote side. Basically, you are pushing a whole aggregate
on the remote side and result of that is treated as partial on the
basis of aggtype = transtype. But I am not sure that is correct unless
I miss something here. The type may be same but the result may not.

5. There are lot many TODO comments in the patch-set, are you working
on those?

Thanks


On Thu, Aug 17, 2017 at 8:52 PM, Antonin Houska <ah@cybertec.at> wrote:
Antonin Houska <ah@cybertec.at> wrote:

> Antonin Houska <ah@cybertec.at> wrote:
>
> > This is a new version of the patch I presented in [1].
>
> Rebased.
>
> cat .git/refs/heads/master
> b9a3ef55b253d885081c2d0e9dc45802cab71c7b

This is another version of the patch.

Besides other changes, it enables the aggregation push-down for postgres_fdw,
although only for aggregates whose transient aggregation state is equal to the
output type. For other aggregates (like avg()) the remote nodes will have to
return the transient state value in an appropriate form (maybe bytea type),
which does not depend on PG version.

shard.tgz demonstrates the typical postgres_fdw use case. One query shows base
scans of base relation's partitions being pushed to shard nodes, the other
pushes down a join and performs aggregation of the join result on the remote
node. Of course, the query can only references one particular partition, until
the "partition-wise join" [1] patch gets committed and merged with this my
patch.

One thing I'm not sure about is whether the patch should remove
GetForeignUpperPaths function from FdwRoutine, which it essentially makes
obsolete. Or should it only be deprecated so far? I understand that
deprecation is important for "ordinary SQL users", but FdwRoutine is an
interface for extension developers, so the policy might be different.

[1] https://commitfest.postgresql.org/14/994/

Any feedback is appreciated.

--
Antonin Houska
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de, http://www.cybertec.at



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers




--
Jeevan Chalke
Principal Software Engineer, Product Development
EnterpriseDB Corporation
The Enterprise PostgreSQL Company

Re: [HACKERS] WIP: Aggregation push-down

From
Ashutosh Bapat
Date:
On Thu, Aug 17, 2017 at 8:52 PM, Antonin Houska <ah@cybertec.at> wrote:
> Antonin Houska <ah@cybertec.at> wrote:
>
>> Antonin Houska <ah@cybertec.at> wrote:
>>
>> > This is a new version of the patch I presented in [1].
>>
>> Rebased.
>>
>> cat .git/refs/heads/master
>> b9a3ef55b253d885081c2d0e9dc45802cab71c7b
>
> This is another version of the patch.
>
> Besides other changes, it enables the aggregation push-down for postgres_fdw,
> although only for aggregates whose transient aggregation state is equal to the
> output type. For other aggregates (like avg()) the remote nodes will have to
> return the transient state value in an appropriate form (maybe bytea type),
> which does not depend on PG version.

Having transient aggregation state type same as output type doesn't
mean that we can feed the output of remote aggregation as is to the
finalization stage locally or finalization is not needed at all. I am
not sure why is that test being used to decide whether or not to push
an aggregate (I assume they are partial aggregates) to the foreign
server. postgres_fdw doesn't have a feature to fetch partial aggregate
results from the foreign server. Till we do that, I don't think this
patch can push any partial aggregates down to the foreign server.

>
> shard.tgz demonstrates the typical postgres_fdw use case. One query shows base
> scans of base relation's partitions being pushed to shard nodes, the other
> pushes down a join and performs aggregation of the join result on the remote
> node. Of course, the query can only references one particular partition, until
> the "partition-wise join" [1] patch gets committed and merged with this my
> patch.

Right. Until then joins will not have children.

>
> One thing I'm not sure about is whether the patch should remove
> GetForeignUpperPaths function from FdwRoutine, which it essentially makes
> obsolete. Or should it only be deprecated so far? I understand that
> deprecation is important for "ordinary SQL users", but FdwRoutine is an
> interface for extension developers, so the policy might be different.

I doubt if that's correct. We can do that only when we know that all
kinds aggregates/grouping can be pushed down the join tree, which
looks impossible to me. Apart from that that FdwRoutine handles all
kinds of upper relations like windowing, distinct, ordered which are
not pushed down by this set of patches.

Here's review of the patchset
The patches compile cleanly when applied together.

Testcase "limit" fails in make check.

This is a pretty large patchset and the patches are divided by stages in the
planner rather than functionality. IOW, no single patch provides a full
functionality. Reviewing and probably committing such patches is not easy and
may take quite long. Instead, if the patches are divided by functionality, i.e.
each patch provides some functionality completely, it will be easy to review
and commit such patches with each commit giving something useful. I had
suggested breaking patches on that line at [1]. The patches also refactor
existing code. It's better to move such refactoring in a patch/es by itself.

The patches need more comments, explaining various decisions e.g.            if (joinrel)            {
/*Create GatherPaths for any useful partial paths for rel */
 
-                generate_gather_paths(root, joinrel);
+                generate_gather_paths(root, joinrel, false);
                /* Find and save the cheapest paths for this joinrel */                set_cheapest(joinrel);
For base relations, the patch calls generate_gather_paths() with
grouped as true and
false, but for join relations it doesn't do that. There is no comment
explaining why.
OR
in the following code
+    /*
+     * Do partial aggregation at base relation level if the relation is
+     * eligible for it.
+     */
+    if (rel->gpi != NULL)
+        create_grouped_path(root, rel, path, false, true, AGG_HASHED);
Why not to consider AGG_SORTED paths?

The patch maintains an extra rel target, two pathlists, partial pathlist and
non-partial pathlist for grouping in RelOptInfo. These two extra
pathlists require "grouped" argument to be passed to a number of functions.
Instead probably it makes sense to create a RelOptInfo for aggregation/grouping
and set pathlist and partial_pathlist in that RelOptInfo. That will also make a
place for keeping grouped estimates.

For placeholders we have two function add_placeholders_to_base_rels() and
add_placeholders_to_joinrel(). We have add_grouping_info_to_base_rels(), but do
not have corresponding function for joinrels. How do we push aggregates
involving two or more relations to the corresponding joinrels?

Some minor assorted comments
Avoid useless hunk.
@@ -279,8 +301,8 @@ add_paths_to_joinrel(PlannerInfo *root,     * joins, because there may be no other alternative.
*/   if (enable_hashjoin || jointype == JOIN_FULL)
 
-        hash_inner_and_outer(root, joinrel, outerrel, innerrel,
-                             jointype, &extra);
+        hash_inner_and_outer(root, joinrel, outerrel, innerrel, jointype,
+                             &extra);

In the last "regression" patch, there are some code changes (mostly because of
pg_indent run). May be you want to include those in appropriate code patches.

Some quick observations using two tables t1(a int, b int) and t2(a int, b int),
populated as
insert into t1 select i, i % 5 from generate_series(1, 100) i where i % 2 = 0;
insert into t2 select i, i % 5 from generate_series(1, 100) i where i % 3 = 0;

1. The patch pushes aggregates down join in the following query
explain verbose select avg(t2.a) from t1 inner join t2 on t1.b = t2.b group by
t2.b;
but does not push the aggregates down join in the following query
explain verbose select avg(t2.a) from t1 left join t2 on t1.b = t2.b group by
t2.b;
In fact, it doesn't use the optimization for any OUTER join. I think the reason
for this behaviour is that the patch uses equivalence classes to distribute the
aggregates and grouping to base relations and OUTER equi-joins do not form
equivalence classes. But I think it should be possible to push the aggregates
down the OUTER join by adding one row for NULL values if the grouping is pushed
to the inner side. I don't see much change for outer side. This also means that
we have to choose some means other than equivalence class for propagating the
aggregates.

2. Following query throws error with these patches, but not without the
patches.
explain verbose select sum(t1.a + t2.a) from t1, t2, t2 t3 where t1.a
= t2.a
group by t2.a, t1.a;
ERROR:  ORDER/GROUP BY expression not found in targetlist

[1] https://www.postgresql.org/message-id/CAFjFpRejPP4K%3Dg%2B0aaq_US0YrMaZzyM%2BNUCi%3DJgwaxhMUj2Zcg%40mail.gmail.com

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company



Re: [HACKERS] WIP: Aggregation push-down

From
Merlin Moncure
Date:
On Thu, Aug 17, 2017 at 10:22 AM, Antonin Houska <ah@cybertec.at> wrote:
> Antonin Houska <ah@cybertec.at> wrote:
> output type. For other aggregates (like avg()) the remote nodes will have to
> return the transient state value in an appropriate form (maybe bytea type),
> which does not depend on PG version.

Hm, that seems like an awful lot of work (new version agnostic
serialization format) for very little benefit (version independent
type serialization for remote aggregate pushdown).  How about forcing
the version to be the same for the feature to be used?

merlin



Re: [HACKERS] WIP: Aggregation push-down

From
Antonin Houska
Date:
Jeevan Chalke <jeevan.chalke@enterprisedb.com> wrote:

> 1.
> + if (aggref->aggvariadic ||
> + aggref->aggdirectargs || aggref->aggorder ||
> + aggref->aggdistinct || aggref->aggfilter)
>
> I did not understand, why you are not pushing aggregate in above cases?
> Can you please explain?

Currently I'm trying to implement infrastructure to propagate result of
partial aggregation from base relation or join to the root of the join tree
and to use these as input for the final aggregation. Some special cases are
disabled because I haven't thought about them enough so far. Some of these
restrictions may be easy to relax, others may be hard. I'll get back to them
at some point.

> 2. "make check" in contrib/postgres_fdw crashes.
>
> SELECT COUNT(*) FROM ft1 t1;
> ! server closed the connection unexpectedly
> ! This probably means the server terminated abnormally
> ! before or while processing the request.
> ! connection to server was lost
>
> From your given setup, if I wrote a query like:
> EXPLAIN VERBOSE SELECT count(*) FROM orders_0;
> it crashes.
>
> Seems like missing few checks.

Please consider this a temporary limitation.

> 3. In case of pushing partial aggregation on the remote side, you use schema
> named "partial", I did not get that change. If I have AVG() aggregate,
> then I end up getting an error saying
> "ERROR: schema "partial" does not exist".

Attached to his email is an extension that I hacked quickly at some point, for
experimental purposes only. It's pretty raw but may be useful if you just want
to play with it.

> 4. I am not sure about the code changes related to pushing partial
> aggregate on the remote side. Basically, you are pushing a whole aggregate
> on the remote side and result of that is treated as partial on the
> basis of aggtype = transtype. But I am not sure that is correct unless
> I miss something here. The type may be same but the result may not.

You're right. I mostly used sum() and count() in my examples but these are
special cases. It should also be checked whether aggfinalfn exists or
not. I'll fix it, thanks.

> 5. There are lot many TODO comments in the patch-set, are you working
> on those?

Yes. I wasn't too eager to complete all the TODOs soon because reviews of the
partch may result in a major rework. And if large parts of the code will have
to be wasted, some / many TODOs will be gone as well.

> Thanks
>
> On Thu, Aug 17, 2017 at 8:52 PM, Antonin Houska <ah@cybertec.at> wrote:
>
>  Antonin Houska <ah@cybertec.at> wrote:
>
>  > Antonin Houska <ah@cybertec.at> wrote:
>  >
>  > > This is a new version of the patch I presented in [1].
>  >
>  > Rebased.
>  >
>  > cat .git/refs/heads/master
>  > b9a3ef55b253d885081c2d0e9dc45802cab71c7b
>
>  This is another version of the patch.
>
>  Besides other changes, it enables the aggregation push-down for postgres_fdw,
>  although only for aggregates whose transient aggregation state is equal to the
>  output type. For other aggregates (like avg()) the remote nodes will have to
>  return the transient state value in an appropriate form (maybe bytea type),
>  which does not depend on PG version.
>
>  shard.tgz demonstrates the typical postgres_fdw use case. One query shows base
>  scans of base relation's partitions being pushed to shard nodes, the other
>  pushes down a join and performs aggregation of the join result on the remote
>  node. Of course, the query can only references one particular partition, until
>  the "partition-wise join" [1] patch gets committed and merged with this my
>  patch.
>
>  One thing I'm not sure about is whether the patch should remove
>  GetForeignUpperPaths function from FdwRoutine, which it essentially makes
>  obsolete. Or should it only be deprecated so far? I understand that
>  deprecation is important for "ordinary SQL users", but FdwRoutine is an
>  interface for extension developers, so the policy might be different.
>
>  [1] https://commitfest.postgresql.org/14/994/
>
>  Any feedback is appreciated.
>
>  --
>  Antonin Houska
>  Cybertec Schönig & Schönig GmbH
>  Gröhrmühlgasse 26
>  A-2700 Wiener Neustadt
>  Web: http://www.postgresql-support.de, http://www.cybertec.at
>
>  --
>  Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
>  To make changes to your subscription:
>  http://www.postgresql.org/mailpref/pgsql-hackers
>
> --
> Jeevan Chalke
> Principal Software Engineer, Product Development
> EnterpriseDB Corporation
> The Enterprise PostgreSQL Company
>
>

--
Antonin Houska
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de, http://www.cybertec.at


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] WIP: Aggregation push-down

From
Antonin Houska
Date:
Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote:

> On Thu, Aug 17, 2017 at 8:52 PM, Antonin Houska <ah@cybertec.at> wrote:
> > Antonin Houska <ah@cybertec.at> wrote:
> >
> >> Antonin Houska <ah@cybertec.at> wrote:
> >>
> >> > This is a new version of the patch I presented in [1].
> >>
> >> Rebased.
> >>
> >> cat .git/refs/heads/master
> >> b9a3ef55b253d885081c2d0e9dc45802cab71c7b
> >
> > This is another version of the patch.
> >
> > Besides other changes, it enables the aggregation push-down for postgres_fdw,
> > although only for aggregates whose transient aggregation state is equal to the
> > output type. For other aggregates (like avg()) the remote nodes will have to
> > return the transient state value in an appropriate form (maybe bytea type),
> > which does not depend on PG version.
>
> Having transient aggregation state type same as output type doesn't
> mean that we can feed the output of remote aggregation as is to the
> finalization stage locally or finalization is not needed at all. I am
> not sure why is that test being used to decide whether or not to push
> an aggregate (I assume they are partial aggregates) to the foreign
> server.

I agree. This seems to be the same problem as reported in [2].

> postgres_fdw doesn't have a feature to fetch partial aggregate
> results from the foreign server. Till we do that, I don't think this
> patch can push any partial aggregates down to the foreign server.

Even if the query contains only aggregates like sum(int4), i.e. aggfinalfn
does not exist and the transient type can be transfered from the remote server
in textual form? (Of course there's a question is if such behaviour is
consistent from user's perspective.)

> >
> > One thing I'm not sure about is whether the patch should remove
> > GetForeignUpperPaths function from FdwRoutine, which it essentially makes
> > obsolete. Or should it only be deprecated so far? I understand that
> > deprecation is important for "ordinary SQL users", but FdwRoutine is an
> > interface for extension developers, so the policy might be different.
>
> I doubt if that's correct. We can do that only when we know that all
> kinds aggregates/grouping can be pushed down the join tree, which
> looks impossible to me. Apart from that that FdwRoutine handles all
> kinds of upper relations like windowing, distinct, ordered which are
> not pushed down by this set of patches.

Good point.

> Here's review of the patchset
> The patches compile cleanly when applied together.
>
> Testcase "limit" fails in make check.

If I remember correctly, this test generates different plan because the
aggregate push-down gets involved. I tend to ignore this error until the patch
has cost estimates refined. Then let's see if the test will pass or if the
expected output should be adjusted.

> This is a pretty large patchset and the patches are divided by stages in the
> planner rather than functionality. IOW, no single patch provides a full
> functionality. Reviewing and probably committing such patches is not easy and
> may take quite long. Instead, if the patches are divided by functionality, i.e.
> each patch provides some functionality completely, it will be easy to review
> and commit such patches with each commit giving something useful. I had
> suggested breaking patches on that line at [1]. The patches also refactor
> existing code. It's better to move such refactoring in a patch/es by itself.

I definitely saw commits whose purpose is preparation for something else. But
you're right that some splitting of the actual funcionality wouldn't
harm. I'll at least separate partial aggregation of base relations and that of
joins. And maybe some more preparation steps.

> The patches need more comments, explaining various decisions

o.k.

> The patch maintains an extra rel target, two pathlists, partial pathlist and
> non-partial pathlist for grouping in RelOptInfo. These two extra
> pathlists require "grouped" argument to be passed to a number of functions.
> Instead probably it makes sense to create a RelOptInfo for aggregation/grouping
> and set pathlist and partial_pathlist in that RelOptInfo. That will also make a
> place for keeping grouped estimates.

If grouped paths require a separate RelOptInfo, why the existing partial paths
do not?

> For placeholders we have two function add_placeholders_to_base_rels() and
> add_placeholders_to_joinrel(). We have add_grouping_info_to_base_rels(), but do
> not have corresponding function for joinrels. How do we push aggregates
> involving two or more relations to the corresponding joinrels?

add_grouping_info_to_base_rels() calls prepare_rel_for_grouping() which
actually adds the "grouping info". For join, prepare_rel_for_grouping() is
called from build_join_rel().

While PlaceHolderVars need to be finalized before planner starts to create
joins, the GroupedPathInfo has to fit particular pair of joined relations.

Perhaps create_grouping_info_... would be better, to indicate that the thing
we're adding does not exist yet. I'll think about it.

> Some minor assorted comments ...

o.k., will fix.

> Some quick observations using two tables t1(a int, b int) and t2(a int, b int),
> populated as
> insert into t1 select i, i % 5 from generate_series(1, 100) i where i % 2 = 0;
> insert into t2 select i, i % 5 from generate_series(1, 100) i where i % 3 = 0;
>
> 1. The patch pushes aggregates down join in the following query
> explain verbose select avg(t2.a) from t1 inner join t2 on t1.b = t2.b group by
> t2.b;
> but does not push the aggregates down join in the following query
> explain verbose select avg(t2.a) from t1 left join t2 on t1.b = t2.b group by
> t2.b;

> In fact, it doesn't use the optimization for any OUTER join. I think the reason
> for this behaviour is that the patch uses equivalence classes to distribute the
> aggregates and grouping to base relations and OUTER equi-joins do not form
> equivalence classes. But I think it should be possible to push the aggregates
> down the OUTER join by adding one row for NULL values if the grouping is pushed
> to the inner side. I don't see much change for outer side. This also means that
> we have to choose some means other than equivalence class for propagating the
> aggregates.

The problem is that aggregate pushed down to the nullable side of an outer
join receives different input values than the original aggregate at the top
level of the query. NULL values generated by the OJ make the difference. I
have no idea how to handle this problem. If the aggregation is performed on
the nullable side of the OJ, it can't predict which of the input values don't
match any value of the other side. Suggestions are appreciated.

> 2. Following query throws error with these patches, but not without the
> patches.
> explain verbose select sum(t1.a + t2.a) from t1, t2, t2 t3 where t1.a
> = t2.a
> group by t2.a, t1.a;
> ERROR:  ORDER/GROUP BY expression not found in targetlist

I'll check this. Thanks.

> [1]
https://www.postgresql.org/message-id/CAFjFpRejPP4K%3Dg%2B0aaq_US0YrMaZzyM%2BNUCi%3DJgwaxhMUj2Zcg%40mail.gmail.com

[2] https://www.postgresql.org/message-id/CAM2%2B6%3DW2J-iaQBgj-sdMERELQLUm5dvOQEWQ2ho%2BQ7KZgnonkQ%40mail.gmail.com

--
Antonin Houska
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de, http://www.cybertec.at



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WIP: Aggregation push-down

From
Ashutosh Bapat
Date:
On Fri, Sep 8, 2017 at 7:04 PM, Antonin Houska <ah@cybertec.at> wrote:
> Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote:
>
>> On Thu, Aug 17, 2017 at 8:52 PM, Antonin Houska <ah@cybertec.at> wrote:
>> > Antonin Houska <ah@cybertec.at> wrote:
>> >
>> >> Antonin Houska <ah@cybertec.at> wrote:
>> >>
>> >> > This is a new version of the patch I presented in [1].
>> >>
>> >> Rebased.
>> >>
>> >> cat .git/refs/heads/master
>> >> b9a3ef55b253d885081c2d0e9dc45802cab71c7b
>> >
>> > This is another version of the patch.
>> >
>> > Besides other changes, it enables the aggregation push-down for postgres_fdw,
>> > although only for aggregates whose transient aggregation state is equal to the
>> > output type. For other aggregates (like avg()) the remote nodes will have to
>> > return the transient state value in an appropriate form (maybe bytea type),
>> > which does not depend on PG version.
>>
>> Having transient aggregation state type same as output type doesn't
>> mean that we can feed the output of remote aggregation as is to the
>> finalization stage locally or finalization is not needed at all. I am
>> not sure why is that test being used to decide whether or not to push
>> an aggregate (I assume they are partial aggregates) to the foreign
>> server.
>
> I agree. This seems to be the same problem as reported in [2].
>
>> postgres_fdw doesn't have a feature to fetch partial aggregate
>> results from the foreign server. Till we do that, I don't think this
>> patch can push any partial aggregates down to the foreign server.
>
> Even if the query contains only aggregates like sum(int4), i.e. aggfinalfn
> does not exist and the transient type can be transfered from the remote server
> in textual form? (Of course there's a question is if such behaviour is
> consistent from user's perspective.)

Yes, those functions will work.

>
>> >
>> > One thing I'm not sure about is whether the patch should remove
>> > GetForeignUpperPaths function from FdwRoutine, which it essentially makes
>> > obsolete. Or should it only be deprecated so far? I understand that
>> > deprecation is important for "ordinary SQL users", but FdwRoutine is an
>> > interface for extension developers, so the policy might be different.
>>
>> I doubt if that's correct. We can do that only when we know that all
>> kinds aggregates/grouping can be pushed down the join tree, which
>> looks impossible to me. Apart from that that FdwRoutine handles all
>> kinds of upper relations like windowing, distinct, ordered which are
>> not pushed down by this set of patches.
>
> Good point.
>

I think this is where Jeevan Chalke's partition-wise aggregation
helps. It deals with the cases where your patch can not push the
aggregates down to children of join.

>> The patch maintains an extra rel target, two pathlists, partial pathlist and
>> non-partial pathlist for grouping in RelOptInfo. These two extra
>> pathlists require "grouped" argument to be passed to a number of functions.
>> Instead probably it makes sense to create a RelOptInfo for aggregation/grouping
>> and set pathlist and partial_pathlist in that RelOptInfo. That will also make a
>> place for keeping grouped estimates.
>
> If grouped paths require a separate RelOptInfo, why the existing partial paths
> do not?

partial paths produce the same targetlist and the same relation that
non-partial paths do. A RelOptInfo in planner represents a set of rows
and all paths added to that RelOptInfo produce same whole result
(parameterized paths produces the same result collectively). Grouping
paths however are producing a different result and different
targetlist, so may be it's better to have a separate RelOptInfo.

>
>> For placeholders we have two function add_placeholders_to_base_rels() and
>> add_placeholders_to_joinrel(). We have add_grouping_info_to_base_rels(), but do
>> not have corresponding function for joinrels. How do we push aggregates
>> involving two or more relations to the corresponding joinrels?
>
> add_grouping_info_to_base_rels() calls prepare_rel_for_grouping() which
> actually adds the "grouping info". For join, prepare_rel_for_grouping() is
> called from build_join_rel().

Ok.

>
> While PlaceHolderVars need to be finalized before planner starts to create
> joins, the GroupedPathInfo has to fit particular pair of joined relations.
>
> Perhaps create_grouping_info_... would be better, to indicate that the thing
> we're adding does not exist yet. I'll think about it.
>
>> Some minor assorted comments ...
>
> o.k., will fix.
>
>> Some quick observations using two tables t1(a int, b int) and t2(a int, b int),
>> populated as
>> insert into t1 select i, i % 5 from generate_series(1, 100) i where i % 2 = 0;
>> insert into t2 select i, i % 5 from generate_series(1, 100) i where i % 3 = 0;
>>
>> 1. The patch pushes aggregates down join in the following query
>> explain verbose select avg(t2.a) from t1 inner join t2 on t1.b = t2.b group by
>> t2.b;
>> but does not push the aggregates down join in the following query
>> explain verbose select avg(t2.a) from t1 left join t2 on t1.b = t2.b group by
>> t2.b;
>
>> In fact, it doesn't use the optimization for any OUTER join. I think the reason
>> for this behaviour is that the patch uses equivalence classes to distribute the
>> aggregates and grouping to base relations and OUTER equi-joins do not form
>> equivalence classes. But I think it should be possible to push the aggregates
>> down the OUTER join by adding one row for NULL values if the grouping is pushed
>> to the inner side. I don't see much change for outer side. This also means that
>> we have to choose some means other than equivalence class for propagating the
>> aggregates.
>
> The problem is that aggregate pushed down to the nullable side of an outer
> join receives different input values than the original aggregate at the top
> level of the query. NULL values generated by the OJ make the difference. I
> have no idea how to handle this problem. If the aggregation is performed on
> the nullable side of the OJ, it can't predict which of the input values don't
> match any value of the other side. Suggestions are appreciated.

I haven't thought through this fully as well, but I think this can be
fixed by using some kind of all NULL row on the nullable side. If we
are going to use equivalence classes, we won't be able to extend that
mechanism to OUTER joins, so may be you want to rethink about using
equivalence classes as a method of propagating grouping information.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WIP: Aggregation push-down

From
Antonin Houska
Date:
Antonin Houska <ah@cybertec.at> wrote:

> This is another version of the patch.

> shard.tgz demonstrates the typical postgres_fdw use case. One query shows base
> scans of base relation's partitions being pushed to shard nodes, the other
> pushes down a join and performs aggregation of the join result on the remote
> node. Of course, the query can only references one particular partition, until
> the "partition-wise join" [1] patch gets committed and merged with this my
> patch.

Since [1] is already there, the new version of shard.tgz shows what I consider
the typical use case. (I'm aware of the postgres_fdw regression test failures,
I'll try to fix them all in the next version.)

Besides that:

* A concept of "path unique keys" has been introduced. It helps to find out if
  the final relation appears to generate a distinct set of grouping keys. If
  that happens, the final aggregation is replaced by mere call of aggfinalfn()
  function on each transient state value.

* FDW can sort rows by aggregate.

* enable_agg_pushdown GUC was added. The default value is false.

* I fixed errors reported during the previous CF.

* Added a few more regression tests.


I'm not about to add any other features now. Implementation of the missing
parts (see the TODO comments in the code) is the next step. But what I'd
appreciate most is a feedback on the design. Thanks.

> [1] https://commitfest.postgresql.org/14/994/

--
Antonin Houska
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de, http://www.cybertec.at


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] WIP: Aggregation push-down

From
Michael Paquier
Date:
On Sat, Nov 4, 2017 at 12:33 AM, Antonin Houska <ah@cybertec.at> wrote:
> I'm not about to add any other features now. Implementation of the missing
> parts (see the TODO comments in the code) is the next step. But what I'd
> appreciate most is a feedback on the design. Thanks.

I am getting a conflict after applying patch 5 but this did not get
any reviews so moved to next CF with waiting on author as status.
-- 
Michael


Re: [HACKERS] WIP: Aggregation push-down

From
Antonin Houska
Date:
Michael Paquier <michael.paquier@gmail.com> wrote:

> On Sat, Nov 4, 2017 at 12:33 AM, Antonin Houska <ah@cybertec.at> wrote:
> > I'm not about to add any other features now. Implementation of the missing
> > parts (see the TODO comments in the code) is the next step. But what I'd
> > appreciate most is a feedback on the design. Thanks.
>
> I am getting a conflict after applying patch 5 but this did not get
> any reviews so moved to next CF with waiting on author as status.

Attached is the next version.

--
Antonin Houska
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de, http://www.cybertec.at


Attachment

Re: [HACKERS] WIP: Aggregation push-down

From
Robert Haas
Date:
On Fri, Dec 22, 2017 at 10:43 AM, Antonin Houska <ah@cybertec.at> wrote:
> Michael Paquier <michael.paquier@gmail.com> wrote:
>> On Sat, Nov 4, 2017 at 12:33 AM, Antonin Houska <ah@cybertec.at> wrote:
>> > I'm not about to add any other features now. Implementation of the missing
>> > parts (see the TODO comments in the code) is the next step. But what I'd
>> > appreciate most is a feedback on the design. Thanks.
>>
>> I am getting a conflict after applying patch 5 but this did not get
>> any reviews so moved to next CF with waiting on author as status.
>
> Attached is the next version.

I've been a bit confused for a while about what this patch is trying
to do, so I spent some time today looking at it to try to figure it
out.  There's a lot I don't understand yet, but it seems like the
general idea is to build, potentially for each relation in the join
tree, not only the regular list of paths but also a list of "grouped"
paths.  If the pre-grouped path wins, then we can get a final path
that looks like Finalize Aggregate -> Some Joins -> Partial Aggregate
-> Maybe Some More Joins -> Base Table Scan.  In some cases the patch
seems to replace that uppermost Finalize Aggregate with a Result node.

translate_expression_to_rels() looks unsafe.  Equivalence members are
known to be equal under the semantics of the relevant operator class,
but that doesn't mean that one can be freely substituted for another.
For example:

rhaas=# create table one (a numeric);
CREATE TABLE
rhaas=# create table two (a numeric);
CREATE TABLE
rhaas=# insert into one values ('0');
INSERT 0 1
rhaas=# insert into two values ('0.00');
INSERT 0 1
rhaas=# select one.a, count(*) from one, two where one.a = two.a group by 1;
 a | count
---+-------
 0 |     1
(1 row)

rhaas=# select two.a, count(*) from one, two where one.a = two.a group by 1;
  a   | count
------+-------
 0.00 |     1
(1 row)

There are, admittedly, a large number of data types for which such a
substitution would work just fine.  numeric will not, citext will not,
but many others are fine. Regrettably, we have no framework in the
system for identifying equality operators which actually test identity
versus some looser notion of equality.  Cross-type operators are a
problem, too; if we have foo.x = bar.y in the query, one might be int4
and the other int8, for example, but they can still belong to the same
equivalence class.

Concretely, in your test query "SELECT p.i, avg(c1.v) FROM
agg_pushdown_parent AS p JOIN agg_pushdown_child1 AS c1 ON c1.parent =
p.i GROUP BY p.i" you assume that it's OK to do a Partial
HashAggregate over c1.parent rather than p.i.  This will be false if,
say, c1.parent is of type citext and p.i is of type text; this will
get grouped together that shouldn't.  It will also be false if the
grouping expression is something like GROUP BY length(p.i::text),
because one value could be '0'::numeric and the other '0.00'::numeric.
I can't think of a reason why it would be false if the grouping
expressions are both simple Vars of the same underlying data type, but
I'm a little nervous that I might be wrong even about that case.
Maybe you've handled all of this somehow, but it's not obvious to me
that it has been considered.

Another thing I noticed is that the GroupedPathInfo looks a bit like a
stripped-down RelOptInfo, in that for example it has both a pathlist
and a partial_pathlist. I'm inclined to think that we should build new
RelOptInfos instead.  As you have it, there are an awful lot of things
that seem to know about the grouped/ungrouped distinction, many of
which are quite low-level functions like add_path(),
add_paths_to_append_rel(), try_nestloop_path(), etc.  I think that
some of this would go away if you had separate RelOptInfos instead of
GroupedPathInfo.

Some compiler noise:

allpaths.c:2794:11: error: variable 'partial_target' is used
uninitialized whenever 'if' condition is false
      [-Werror,-Wsometimes-uninitialized]
        else if (rel->gpi != NULL && rel->gpi->target != NULL)
                 ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
allpaths.c:2798:56: note: uninitialized use occurs here
                create_gather_path(root, rel, cheapest_partial_path,
partial_target,

^~~~~~~~~~~~~~
allpaths.c:2794:7: note: remove the 'if' if its condition is always true
        else if (rel->gpi != NULL && rel->gpi->target != NULL)
             ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
allpaths.c:2794:11: error: variable 'partial_target' is used
uninitialized whenever '&&' condition is false
      [-Werror,-Wsometimes-uninitialized]
        else if (rel->gpi != NULL && rel->gpi->target != NULL)
                 ^~~~~~~~~~~~~~~~
allpaths.c:2798:56: note: uninitialized use occurs here
                create_gather_path(root, rel, cheapest_partial_path,
partial_target,

^~~~~~~~~~~~~~
allpaths.c:2794:11: note: remove the '&&' if its condition is always true
        else if (rel->gpi != NULL && rel->gpi->target != NULL)
                 ^~~~~~~~~~~~~~~~~~~
allpaths.c:2773:28: note: initialize the variable 'partial_target' to
silence this warning
        PathTarget *partial_target;
                                  ^
                                   = NULL

Core dump running the regression tests:

  * frame #0: 0x00007fffe633ad42 libsystem_kernel.dylib`__pthread_kill + 10
    frame #1: 0x00007fffe6428457 libsystem_pthread.dylib`pthread_kill + 90
    frame #2: 0x00007fffe62a0420 libsystem_c.dylib`abort + 129
    frame #3: 0x000000010e3ae90e
postgres`ExceptionalCondition(conditionName=<unavailable>,
errorType=<unavailable>, fileName=<unavailable>,
lineNumber=<unavailable>) at assert.c:54 [opt]
    frame #4: 0x000000010e17c23b
postgres`check_list_invariants(list=<unavailable>) at list.c:39 [opt]
    frame #5: 0x000000010e17cdff postgres`list_free [inlined]
list_free_private(list=0x00007fe6a483be40, deep='\0') at list.c:1107
[opt]
    frame #6: 0x000000010e17cdfa
postgres`list_free(list=0x00007fe6a483be40) at list.c:1135 [opt]
    frame #7: 0x000000010e1b365f
postgres`generate_bitmap_or_paths(root=0x00007fe6a68131c8,
rel=0x00007fe6a6815310, clauses=<unavailable>,
other_clauses=<unavailable>, agg_info=0x0000000000000000) at
indxpath.c:1580 [opt]
    frame #8: 0x000000010e1b305c
postgres`create_index_paths(root=0x00007fe6a68131c8,
rel=<unavailable>, agg_info=0x0000000000000000) at indxpath.c:334
[opt]
    frame #9: 0x000000010e1a7cfa postgres`set_rel_pathlist [inlined]
set_plain_rel_pathlist at allpaths.c:745 [opt]
    frame #10: 0x000000010e1a7bc1
postgres`set_rel_pathlist(root=0x00007fe6a68131c8,
rel=0x00007fe6a6815310, rti=1, rte=0x00007fe6a6803270) at
allpaths.c:454 [opt]
    frame #11: 0x000000010e1a4910 postgres`make_one_rel at allpaths.c:312 [opt]
    frame #12: 0x000000010e1a48cc
postgres`make_one_rel(root=0x00007fe6a68131c8,
joinlist=0x00007fe6a6815810) at allpaths.c:182 [opt]
    frame #13: 0x000000010e1cb806
postgres`query_planner(root=0x00007fe6a68131c8, tlist=<unavailable>,
qp_callback=<unavailable>, qp_extra=0x00007fff51ca24d8) at
planmain.c:268 [opt]
    frame #14: 0x000000010e1ce7b4
postgres`grouping_planner(root=<unavailable>, inheritance_update='\0',
tuple_fraction=<unavailable>) at planner.c:1789 [opt]
    frame #15: 0x000000010e1ccb12
postgres`subquery_planner(glob=<unavailable>,
parse=0x00007fe6a6803158, parent_root=<unavailable>,
hasRecursion=<unavailable>, tuple_fraction=0) at planner.c:901 [opt]
    frame #16: 0x000000010e1cba3e
postgres`standard_planner(parse=0x00007fe6a6803158, cursorOptions=256,
boundParams=0x0000000000000000) at planner.c:364 [opt]
    frame #17: 0x000000010e291b3b
postgres`pg_plan_query(querytree=0x00007fe6a6803158,
cursorOptions=256, boundParams=0x0000000000000000) at postgres.c:807
[opt]
    frame #18: 0x000000010e291c6e
postgres`pg_plan_queries(querytrees=<unavailable>, cursorOptions=256,
boundParams=0x0000000000000000) at postgres.c:873 [opt]
    frame #19: 0x000000010e296335
postgres`exec_simple_query(query_string="SELECT p1.oid,
p1.typname\nFROM pg_type as p1\nWHERE p1.typnamespace = 0 OR\n
(p1.typlen <= 0 AND p1.typlen != -1 AND p1.typlen != -2) OR\n
(p1.typtype not in ('b', 'c', 'd', 'e', 'p', 'r')) OR\n    NOT
p1.typisdefined OR\n    (p1.typalign not in ('c', 's', 'i', 'd')) OR\n
   (p1.typstorage not in ('p', 'x', 'e', 'm'));") at postgres.c:1048
[opt]
    frame #20: 0x000000010e295099
postgres`PostgresMain(argc=<unavailable>, argv=<unavailable>,
dbname="regression", username=<unavailable>) at postgres.c:0 [opt]
    frame #21: 0x000000010e20a910 postgres`PostmasterMain [inlined]
BackendRun at postmaster.c:4412 [opt]
    frame #22: 0x000000010e20a741 postgres`PostmasterMain [inlined]
BackendStartup at postmaster.c:4084 [opt]
    frame #23: 0x000000010e20a3bd postgres`PostmasterMain at
postmaster.c:1757 [opt]
    frame #24: 0x000000010e2098ff
postgres`PostmasterMain(argc=1516050552, argv=<unavailable>) at
postmaster.c:1365 [opt]
    frame #25: 0x000000010e177537 postgres`main(argc=<unavailable>,
argv=<unavailable>) at main.c:228 [opt]
    frame #26: 0x00007fffe620c235 libdyld.dylib`start + 1
(lldb) p debug_query_string
(const char *) $0 = 0x00007fe6a401e518 "SELECT p1.oid,
p1.typname\nFROM pg_type as p1\nWHERE p1.typnamespace = 0 OR\n
(p1.typlen <= 0 AND p1.typlen != -1 AND p1.typlen != -2) OR\n
(p1.typtype not in ('b', 'c', 'd', 'e', 'p', 'r')) OR\n    NOT
p1.typisdefined OR\n    (p1.typalign not in ('c', 's', 'i', 'd')) OR\n
   (p1.typstorage not in ('p', 'x', 'e', 'm'));"

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: [HACKERS] WIP: Aggregation push-down

From
Antonin Houska
Date:
Robert Haas <robertmhaas@gmail.com> wrote:

> On Fri, Dec 22, 2017 at 10:43 AM, Antonin Houska <ah@cybertec.at> wrote:
> > Michael Paquier <michael.paquier@gmail.com> wrote:
> >> On Sat, Nov 4, 2017 at 12:33 AM, Antonin Houska <ah@cybertec.at> wrote:
> >> > I'm not about to add any other features now. Implementation of the missing
> >> > parts (see the TODO comments in the code) is the next step. But what I'd
> >> > appreciate most is a feedback on the design. Thanks.
> >>
> >> I am getting a conflict after applying patch 5 but this did not get
> >> any reviews so moved to next CF with waiting on author as status.
> >
> > Attached is the next version.
>
> I've been a bit confused for a while about what this patch is trying
> to do, so I spent some time today looking at it to try to figure it
> out.

Thanks.

> There's a lot I don't understand yet, but it seems like the general idea is
> to build, potentially for each relation in the join tree, not only the
> regular list of paths but also a list of "grouped" paths.  If the
> pre-grouped path wins, then we can get a final path that looks like Finalize
> Aggregate -> Some Joins -> Partial Aggregate -> Maybe Some More Joins ->
> Base Table Scan.

Yes. The regression test output shows some more plans.

> In some cases the patch seems to replace that uppermost Finalize Aggregate
> with a Result node.

This is a feature implemented in 09_avoid_agg_finalization.diff. You can
ignore it for now, it's of lower priority than the preceding parts and needs
more work to be complete.

> translate_expression_to_rels() looks unsafe.  Equivalence members are
> known to be equal under the semantics of the relevant operator class,
> but that doesn't mean that one can be freely substituted for another.
> For example:
>
> rhaas=# create table one (a numeric);
> CREATE TABLE
> rhaas=# create table two (a numeric);
> CREATE TABLE
> rhaas=# insert into one values ('0');
> INSERT 0 1
> rhaas=# insert into two values ('0.00');
> INSERT 0 1
> rhaas=# select one.a, count(*) from one, two where one.a = two.a group by 1;
>  a | count
> ---+-------
>  0 |     1
> (1 row)
>
> rhaas=# select two.a, count(*) from one, two where one.a = two.a group by 1;
>   a   | count
> ------+-------
>  0.00 |     1
> (1 row)
>
> There are, admittedly, a large number of data types for which such a
> substitution would work just fine.  numeric will not, citext will not,
> but many others are fine. Regrettably, we have no framework in the
> system for identifying equality operators which actually test identity
> versus some looser notion of equality.  Cross-type operators are a
> problem, too; if we have foo.x = bar.y in the query, one might be int4
> and the other int8, for example, but they can still belong to the same
> equivalence class.
>
> Concretely, in your test query "SELECT p.i, avg(c1.v) FROM
> agg_pushdown_parent AS p JOIN agg_pushdown_child1 AS c1 ON c1.parent =
> p.i GROUP BY p.i" you assume that it's OK to do a Partial
> HashAggregate over c1.parent rather than p.i.  This will be false if,
> say, c1.parent is of type citext and p.i is of type text; this will
> get grouped together that shouldn't.

I've really missed this problem, thanks for bringing it up! Your
considerations made me think of the specifics of the types like numeric or
citext. Although your example does not demonstrate what I consider the core
issue, I believe we eventually think of the same.

Consider this example query (using the tables you defined upthread):

SELECT      one.a
FROM        one, two
WHERE       one.a = two.a AND scale(two.a) > 1
GROUP BY    1

I we first try to aggregate "two", then evaluate WHERE clause and then
finalize the aggregation, the partial aggregation of "two" can put various
binary values of the "a" column (0, 0.0, etc.) into the same group so the
scale of the output (of the partial agregation) will be undefined. Thus the
aggregation push-down will change the input of the WHERE clause.

So one problem is that the grouping expression can be inappropriate for
partial aggregation even if there's no type change during the
translation. What I consider typical for this case is that the equality
operator used to identify groups (SortGroupClause.eqop) can return true even
if binary (stored) values of the inputs differ. The partial aggregation could
only take place if we had a special equality operator which distinguishes the
*binary* values (I don't know yet how to store this operator the catalog ---
in pg_opclass recors for the hash opclasses?)..

On the other hand, if the grouping expression is not a plain Var and there's
no type change during the translation and the expression output type is not of
the "identity-unsafe type" (numeric, citext, etc.), input vars of the
"identity-unsafe type"should not prevent us from using the expression for
partial aggregation: in such a case the grouping keys are computed in the same
way they would be w/o the partial aggregation.

The other problem is which type changes are legal. We should not allow any
type-specific information (such as scale or precision of the numeric type) to
bet lost or ambiguous. In this example

> It will also be false if the grouping expression is something like GROUP BY
> length(p.i::text), because one value could be '0'::numeric and the other
> '0.00'::numeric.

you show that not only numeric -> int is a problem but also int ->
numeric. This is the tricky part.

One of my ideas is to check whether the source and destination types are
binary coercible (i.e. pg_cast(castfunc) = 0) but this might be a misuse of
the binary coercibility. Another idea is to allow only such changes that the
destination type is in the same operator class as the source, and explicitly
enumerate the "safe opclasses". But that would mean that user cannot define
new opclasses within which the translation is possible --- not sure this is a
serious issue.

Perhaps the most consistent solution is that the translation is not performed
if any input variable of the grouping expression should change its data type.

> Another thing I noticed is that the GroupedPathInfo looks a bit like a
> stripped-down RelOptInfo, in that for example it has both a pathlist
> and a partial_pathlist. I'm inclined to think that we should build new
> RelOptInfos instead.  As you have it, there are an awful lot of things
> that seem to know about the grouped/ungrouped distinction, many of
> which are quite low-level functions like add_path(),
> add_paths_to_append_rel(), try_nestloop_path(), etc.  I think that
> some of this would go away if you had separate RelOptInfos instead of
> GroupedPathInfo.

This was proposed by Ashutosh Bapat upthread. I actually did this in the early
(not published) prototype of the patch and I considered it too invasive for
standard_join_search() and subroutines. IIRC an array like simple_rel_array
had to be introduced for the grouped relation in which the meaning of NULL was
twofold: either the relation is not a base relation or it is base relation but
no grouped relation exists for it.  Also I thought there would be too much
duplicate data if the grouped relation had a separate RelOptInfo. Now that I'm
done with one approach I can consider the original approach aggain.

> Some compiler noise:

> Core dump running the regression tests:

I'll fix these, thanks.

--
Antonin Houska
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26, A-2700 Wiener Neustadt
Web: https://www.cybertec-postgresql.com



Re: [HACKERS] WIP: Aggregation push-down

From
Robert Haas
Date:
On Fri, Jan 26, 2018 at 8:04 AM, Antonin Houska <ah@cybertec.at> wrote:
> So one problem is that the grouping expression can be inappropriate for
> partial aggregation even if there's no type change during the
> translation. What I consider typical for this case is that the equality
> operator used to identify groups (SortGroupClause.eqop) can return true even
> if binary (stored) values of the inputs differ. The partial aggregation could
> only take place if we had a special equality operator which distinguishes the
> *binary* values (I don't know yet how to store this operator the catalog ---
> in pg_opclass recors for the hash opclasses?)..

We don't have an operator that tests for binary equality, but it's
certainly testable from C; see datumIsEqual.  I'm not sure how much
this really helps, though.  I think it would be safe to build an
initial set of groups based on datumIsEqual comparisons and then
arrange to later merge some of the groups.  But that's not something
we ever do in the executor today, so it might require quite a lot of
hacking.  Also, it seems like it might really suck in some cases.  For
instance, consider something like SELECT scale(one.a), sum(two.a) FROM
one, two WHERE one.a = two.a GROUP BY 1.  Doing a Partial Aggregate on
two.a using datumIsEqual semantics, joining to a, and then doing a
Finalize Aggregate looks legal, but the Partial Aggregate may produce
a tremendous number of groups compared to the Finalize Aggregate.  In
other words, this technique wouldn't merge any groups that shouldn't
be merged, but it might fail to merge groups that really need to be
merged to get good performance.

> One of my ideas is to check whether the source and destination types are
> binary coercible (i.e. pg_cast(castfunc) = 0) but this might be a misuse of
> the binary coercibility.

Yeah, binary coercibility has nothing to do with this; that tells you
whether the two types are the same on disk, not whether they have the
same equality semantics.  For instance, I think text and citext are
binary coercible, but their equality semantics are not the same.

> Another idea is to allow only such changes that the
> destination type is in the same operator class as the source, and explicitly
> enumerate the "safe opclasses". But that would mean that user cannot define
> new opclasses within which the translation is possible --- not sure this is a
> serious issue.

Enumerating specific opclasses in the source code is a non-starter --
Tom Lane would roll over in his grave if he weren't still alive.  What
we could perhaps consider doing is adding some mechanism for an
opclass or opfamily to say whether its equality semantics happen to be
exactly datumIsEqual() semantics.  That's a little grotty because it
leaves data types for which that isn't true out in the cold, but on
the other hand it seems like it would be useful as a way of optimizing
a bunch of things other than this.  Maybe it could also include a way
to specify that the comparator happens to have the semantics as C's
built-in < and > operators, which seems like a case that might also
lend itself to some optimizations.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: [HACKERS] WIP: Aggregation push-down

From
Antonin Houska
Date:
Robert Haas <robertmhaas@gmail.com> wrote:

> On Fri, Jan 26, 2018 at 8:04 AM, Antonin Houska <ah@cybertec.at> wrote:
> > So one problem is that the grouping expression can be inappropriate for
> > partial aggregation even if there's no type change during the
> > translation. What I consider typical for this case is that the equality
> > operator used to identify groups (SortGroupClause.eqop) can return true even
> > if binary (stored) values of the inputs differ. The partial aggregation could
> > only take place if we had a special equality operator which distinguishes the
> > *binary* values (I don't know yet how to store this operator the catalog ---
> > in pg_opclass recors for the hash opclasses?)..
>
> We don't have an operator that tests for binary equality, but it's
> certainly testable from C; see datumIsEqual.  I'm not sure how much
> this really helps, though.  I think it would be safe to build an
> initial set of groups based on datumIsEqual comparisons and then
> arrange to later merge some of the groups.  But that's not something
> we ever do in the executor today, so it might require quite a lot of
> hacking.  Also, it seems like it might really suck in some cases.  For
> instance, consider something like SELECT scale(one.a), sum(two.a) FROM
> one, two WHERE one.a = two.a GROUP BY 1.  Doing a Partial Aggregate on
> two.a using datumIsEqual semantics, joining to a, and then doing a
> Finalize Aggregate looks legal, but the Partial Aggregate may produce
> a tremendous number of groups compared to the Finalize Aggregate.  In
> other words, this technique wouldn't merge any groups that shouldn't
> be merged, but it might fail to merge groups that really need to be
> merged to get good performance.

I don't insist on doing Partial Aggregate in any case. If we wanted to group
by the binary value, we'd probably have to enhance statistics accordingly. The
important thing is to recognize the special case like this. Rejection of the
Partial Aggregate would be the default response.

> > Another idea is to allow only such changes that the
> > destination type is in the same operator class as the source, and explicitly
> > enumerate the "safe opclasses". But that would mean that user cannot define
> > new opclasses within which the translation is possible --- not sure this is a
> > serious issue.
>
> Enumerating specific opclasses in the source code is a non-starter --
> Tom Lane would roll over in his grave if he weren't still alive.  What
> we could perhaps consider doing is adding some mechanism for an
> opclass or opfamily to say whether its equality semantics happen to be
> exactly datumIsEqual() semantics.  That's a little grotty because it
> leaves data types for which that isn't true out in the cold, but on
> the other hand it seems like it would be useful as a way of optimizing
> a bunch of things other than this.  Maybe it could also include a way
> to specify that the comparator happens to have the semantics as C's
> built-in < and > operators, which seems like a case that might also
> lend itself to some optimizations.

I think of a variant of this: implement an universal function that tests the
binary values for equality (besides the actual arguments, caller would have to
pass info like typlen, typalign, typbyval for each argument, and cache these
for repeated calls) and set pg_proc(oprcode) to 0 wherever this function is
sufficient. Thus the problematic cases like numeric, citext, etc. would be
detected by their non-zero oprcode.

--
Antonin Houska
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26, A-2700 Wiener Neustadt
Web: https://www.cybertec-postgresql.com


Re: [HACKERS] WIP: Aggregation push-down

From
Chapman Flack
Date:
On 01/29/18 03:32, Antonin Houska wrote:
> Robert Haas <robertmhaas@gmail.com> wrote:

>>> only take place if we had a special equality operator which distinguishes the
>>> *binary* values (I don't know yet how to store this operator the catalog ---
...
>> We don't have an operator that tests for binary equality, but it's
>> certainly testable from C; see datumIsEqual.

Disclaimer: I haven't been following the whole thread closely. But could
there be some way to exploit the composite-type *= operator?

https://www.postgresql.org/docs/current/static/functions-comparisons.html#COMPOSITE-TYPE-COMPARISON

-Chap


Re: [HACKERS] WIP: Aggregation push-down

From
Robert Haas
Date:
On Mon, Jan 29, 2018 at 3:32 AM, Antonin Houska <ah@cybertec.at> wrote:
> I think of a variant of this: implement an universal function that tests the
> binary values for equality (besides the actual arguments, caller would have to
> pass info like typlen, typalign, typbyval for each argument, and cache these
> for repeated calls) and set pg_proc(oprcode) to 0 wherever this function is
> sufficient. Thus the problematic cases like numeric, citext, etc. would be
> detected by their non-zero oprcode.

I don't think that's a good option, because we don't want int4eq to
have to go through a code path that has branches to support varlenas
and cstrings.  Andres is busy trying to speed up expression evaluation
by removing unnecessary code branches; adding new ones would be
undoing that valuable work.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: [HACKERS] WIP: Aggregation push-down

From
Antonin Houska
Date:
Robert Haas <robertmhaas@gmail.com> wrote:

> On Mon, Jan 29, 2018 at 3:32 AM, Antonin Houska <ah@cybertec.at> wrote:
> > I think of a variant of this: implement an universal function that tests the
> > binary values for equality (besides the actual arguments, caller would have to
> > pass info like typlen, typalign, typbyval for each argument, and cache these
> > for repeated calls) and set pg_proc(oprcode) to 0 wherever this function is
> > sufficient. Thus the problematic cases like numeric, citext, etc. would be
> > detected by their non-zero oprcode.
>
> I don't think that's a good option, because we don't want int4eq to
> have to go through a code path that has branches to support varlenas
> and cstrings.  Andres is busy trying to speed up expression evaluation
> by removing unnecessary code branches; adding new ones would be
> undoing that valuable work.

I spent some more time thinking about this. What about adding a new strategy
number for hash index operator classes, e.g. HTBinaryEqualStrategyNumber? For
most types both HTEqualStrategyNumber and HTBinaryEqualStrategyNumber strategy
would point to the same operator, but types like numeric would naturally have
them different.

Thus the pushed-down partial aggregation can only use the
HTBinaryEqualStrategyNumber's operator to compare grouping expressions. In the
initial version (until we have useful statistics for the binary values) we can
avoid the aggregation push-down if the grouping expression output type has the
two strategies implemented using different functions because, as you noted
upthread, grouping based on binary equality can result in excessive number of
groups.

One open question is whether the binary equality operator needs a separate
operator class or not. If an opclass cares only about the binary equality, its
hash function(s) can be a lot simpler.

--
Antonin Houska
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26, A-2700 Wiener Neustadt
Web: https://www.cybertec-postgresql.com


Re: [HACKERS] WIP: Aggregation push-down

From
Robert Haas
Date:
On Fri, Feb 23, 2018 at 11:08 AM, Antonin Houska <ah@cybertec.at> wrote:
> I spent some more time thinking about this. What about adding a new strategy
> number for hash index operator classes, e.g. HTBinaryEqualStrategyNumber? For
> most types both HTEqualStrategyNumber and HTBinaryEqualStrategyNumber strategy
> would point to the same operator, but types like numeric would naturally have
> them different.
>
> Thus the pushed-down partial aggregation can only use the
> HTBinaryEqualStrategyNumber's operator to compare grouping expressions. In the
> initial version (until we have useful statistics for the binary values) we can
> avoid the aggregation push-down if the grouping expression output type has the
> two strategies implemented using different functions because, as you noted
> upthread, grouping based on binary equality can result in excessive number of
> groups.
>
> One open question is whether the binary equality operator needs a separate
> operator class or not. If an opclass cares only about the binary equality, its
> hash function(s) can be a lot simpler.

Hmm.  How about instead adding another regproc field to pg_type which
stores the OID of a function that tests binary equality for that
datatype?  If that happens to be equal to the OID you got from the
opclass, then you're all set.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: [HACKERS] WIP: Aggregation push-down

From
Antonin Houska
Date:
Robert Haas <robertmhaas@gmail.com> wrote:

> On Fri, Dec 22, 2017 at 10:43 AM, Antonin Houska <ah@cybertec.at> wrote:
> > Michael Paquier <michael.paquier@gmail.com> wrote:
> >> On Sat, Nov 4, 2017 at 12:33 AM, Antonin Houska <ah@cybertec.at> wrote:
> >> > I'm not about to add any other features now. Implementation of the missing
> >> > parts (see the TODO comments in the code) is the next step. But what I'd
> >> > appreciate most is a feedback on the design. Thanks.
> >>
> >> I am getting a conflict after applying patch 5 but this did not get
> >> any reviews so moved to next CF with waiting on author as status.
> >
> > Attached is the next version.

> translate_expression_to_rels() looks unsafe.  Equivalence members are
> known to be equal under the semantics of the relevant operator class,
> but that doesn't mean that one can be freely substituted for another.
> For example:
>
> rhaas=# create table one (a numeric);
> CREATE TABLE
> rhaas=# create table two (a numeric);
> CREATE TABLE
> rhaas=# insert into one values ('0');
> INSERT 0 1
> rhaas=# insert into two values ('0.00');
> INSERT 0 1
> rhaas=# select one.a, count(*) from one, two where one.a = two.a group by 1;
>  a | count
> ---+-------
>  0 |     1
> (1 row)
>
> rhaas=# select two.a, count(*) from one, two where one.a = two.a group by 1;
>   a   | count
> ------+-------
>  0.00 |     1
> (1 row)
>
> There are, admittedly, a large number of data types for which such a
> substitution would work just fine.  numeric will not, citext will not,
> but many others are fine. Regrettably, we have no framework in the
> system for identifying equality operators which actually test identity
> versus some looser notion of equality.  Cross-type operators are a
> problem, too; if we have foo.x = bar.y in the query, one might be int4
> and the other int8, for example, but they can still belong to the same
> equivalence class.
>
> Concretely, in your test query "SELECT p.i, avg(c1.v) FROM
> agg_pushdown_parent AS p JOIN agg_pushdown_child1 AS c1 ON c1.parent =
> p.i GROUP BY p.i" you assume that it's OK to do a Partial
> HashAggregate over c1.parent rather than p.i.  This will be false if,
> say, c1.parent is of type citext and p.i is of type text; this will
> get grouped together that shouldn't.  It will also be false if the
> grouping expression is something like GROUP BY length(p.i::text),
> because one value could be '0'::numeric and the other '0.00'::numeric.
> I can't think of a reason why it would be false if the grouping
> expressions are both simple Vars of the same underlying data type, but
> I'm a little nervous that I might be wrong even about that case.
> Maybe you've handled all of this somehow, but it's not obvious to me
> that it has been considered.

These problems are still being investigated, see [1]

> Another thing I noticed is that the GroupedPathInfo looks a bit like a
> stripped-down RelOptInfo, in that for example it has both a pathlist
> and a partial_pathlist. I'm inclined to think that we should build new
> RelOptInfos instead.  As you have it, there are an awful lot of things
> that seem to know about the grouped/ungrouped distinction, many of
> which are quite low-level functions like add_path(),
> add_paths_to_append_rel(), try_nestloop_path(), etc.  I think that
> some of this would go away if you had separate RelOptInfos instead of
> GroupedPathInfo.

I've reworked the patch so that separate RelOptInfo is used for grouped
relation. The attached patch is only the core part. Neither postgres_fdw
changes nor the part that tries to avoid the final aggregation is included
here. I'll add these when the patch does not seem to need another major rework.

[1] https://www.postgresql.org/message-id/CA+Tgmoa5Pp-DBJg=W8Xj8Czf-32PfxPgxwFPkA6qN2w_wPX8bg@mail.gmail.com

--
Antonin Houska
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26, A-2700 Wiener Neustadt
Web: https://www.cybertec-postgresql.com


Attachment

Re: [HACKERS] WIP: Aggregation push-down

From
Antonin Houska
Date:
Robert Haas <robertmhaas@gmail.com> wrote:

> On Fri, Feb 23, 2018 at 11:08 AM, Antonin Houska <ah@cybertec.at> wrote:
> > I spent some more time thinking about this. What about adding a new strategy
> > number for hash index operator classes, e.g. HTBinaryEqualStrategyNumber? For
> > most types both HTEqualStrategyNumber and HTBinaryEqualStrategyNumber strategy
> > would point to the same operator, but types like numeric would naturally have
> > them different.
> >
> > Thus the pushed-down partial aggregation can only use the
> > HTBinaryEqualStrategyNumber's operator to compare grouping expressions. In the
> > initial version (until we have useful statistics for the binary values) we can
> > avoid the aggregation push-down if the grouping expression output type has the
> > two strategies implemented using different functions because, as you noted
> > upthread, grouping based on binary equality can result in excessive number of
> > groups.
> >
> > One open question is whether the binary equality operator needs a separate
> > operator class or not. If an opclass cares only about the binary equality, its
> > hash function(s) can be a lot simpler.
>
> Hmm.  How about instead adding another regproc field to pg_type which
> stores the OID of a function that tests binary equality for that
> datatype?  If that happens to be equal to the OID you got from the
> opclass, then you're all set.

I suppose you mean pg_operator, not pg_type. What I don't like about this is
that the new field would only be useful for very little fraction of
operators.

On the other hand, the drawback of an additional operator classes is that we'd
have to modify almost all the existing operator classes for the hash AM. (The
absence of the new strategy number in an operator class cannot mean that the
existing equality operator can be used to compare binary values too, because
thus we can't guarantee correct behavior of the already existing user-defined
operator classes.)


--
Antonin Houska
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26, A-2700 Wiener Neustadt
Web: https://www.cybertec-postgresql.com


Re: [HACKERS] WIP: Aggregation push-down

From
Antonin Houska
Date:
Antonin Houska <ah@cybertec.at> wrote:

> I've reworked the patch so that separate RelOptInfo is used for grouped
> relation. The attached patch is only the core part. Neither postgres_fdw
> changes nor the part that tries to avoid the final aggregation is included
> here. I'll add these when the patch does not seem to need another major rework.

This is the next version. After having posted a few notes to the [1] thread,
I'm returning to the original one so it can be referenced from my entry in the
CF application. As proposed in the other thread, it uses only "simple
aggregation". The 2-stage aggregation, which gives more power to the feature
and which also enables paralle processing for it, will be coded in a separate
patch.

I eventually decided abandon the Var substitution proposed by Heikki in
[2]. It's not necessary if we accept the restriction that the feature accepts
only simple column reference as grouping expression so far.

[1] https://www.postgresql.org/message-id/c165b72e-8dbb-2a24-291f-113aeb67b76a%40iki.fi

[2] https://www.postgresql.org/message-id/113e9594-7c08-0f1f-ad15-41773b56a86b%40iki.fi

--
Antonin Houska
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26, A-2700 Wiener Neustadt
Web: https://www.cybertec-postgresql.com


Attachment

Re: [HACKERS] WIP: Aggregation push-down

From
Tom Lane
Date:
Antonin Houska <ah@cybertec.at> writes:
> [ agg_pushdown_v8.tgz ]

I spent a few hours looking at this today.  It seems to me that at no
point has there been a clear explanation of what the patch is trying to
accomplish, so let me see whether I've got it straight or not.  (I suggest
that something like this ought to be included in optimizer/README; the
patch's lack of internal documentation is a serious deficiency.)

--

Conceptually, we'd like to be able to push aggregation down below joins
when that yields a win, which it could do by reducing the number of rows
that have to be processed at the join stage.  Thus consider

SELECT agg(a.x) FROM a, b WHERE a.y = b.z;

We can't simply aggregate during the scan of "a", because some values of
"x" might not appear at all in the input of the naively-computed aggregate
(if their rows have no join partner in "b"), or might appear multiple
times (if their rows have multiple join partners).  So at first glance,
aggregating before the join is impossible.  The key insight of the patch
is that we can make some progress by considering only grouped aggregation:
if we can group "a" into sets of rows that must all have the same join
partners, then we can do preliminary aggregation within each such group,
and take care of the number-of-repetitions problem when we join.  If the
groups are large then this reduces the number of rows processed by the
join, at the cost that we might spend time computing the aggregate for
some row sets that will just be discarded by the join.  So now we consider

SELECT agg(a.x) FROM a, b WHERE a.y = b.z GROUP BY ?;

and ask what properties the grouping column(s) "?" must have to make this
work.  I believe we can say that it's OK to push down through a join if
its join clauses are all of the form "a.y = b.z", where either a.y or b.z
is listed as a GROUP BY column *and* the join operator is equality for
the btree opfamily specified by the SortGroupClause.  (Note: actually,
SortGroupClause per se mentions an equality operator, although I think the
planner quickly reduces that to an opfamily specification.)  The concerns
Robert had about equality semantics are certainly vital, but I believe
that the logic goes through correctly as long as the grouping equality
operator and the join equality operator belong to the same opfamily.

Case 1: the GROUP BY column is a.y.  Two rows of "a" whose y values are
equal per the grouping operator must join to exactly the same set of "b"
rows, else transitivity is failing.

Case 2: the GROUP BY column is b.z.  It still works, because the set of
"a" rows that are equal to a given z value is well-defined, and those
rows must join to exactly the "b" rows whose z entries are equal to
the given value, else transitivity is failing.

Robert postulated cases like "citext = text", but that doesn't apply
here because no cross-type citext = text operator could be part of a
well-behaved opfamily.  What we'd actually be looking at is either
"citextvar::text texteq textvar" or "citextvar citexteq textvar::citext",
and the casts prevent these expressions from matching GROUP BY entries
that have the wrong semantics.

However: what we have proven here is only that we can aggregate across
a set of rows that must share the same join partners.  We still have
to be able to handle the case that the rowset has more than one join
partner, which AFAICS means that we must use partial aggregation and
then apply an "aggmultifn" (or else apply the aggcombinefn N times).
We can avoid that and use plain aggregation when we can prove the "b"
side of the join is unique, so that no sets of rows will have to be merged
post-join; but ISTM that that reduces the set of use cases to be too small
to be worth such a complex patch.  So I'm really doubtful that we should
proceed forward with only that case available.

Also, Tomas complained in the earlier thread that he didn't think
grouping on the join column was a very common use-case in the first
place.  I share that concern, but I think we could extend the logic
to the case that Tomas posited as being useful:

SELECT agg(a.x) FROM a, b WHERE a.y = b.id GROUP BY b.z;

where the join column b.id is unique.  If we group on a.y (using semantics
compatible with the join operator and the uniqueness constraint), then all
"a" rows in a given group will join to exactly one "b" row that
necessarily has exactly one grouping value, so this group can safely be
aggregated together.  We might need to combine it post-join with other "b"
rows that have equal "z" values, but we can do that as long as we're okay
with partial aggregation.  I think this example shows why the idea is far
more powerful with partial aggregation than without.

--

In short, then, I don't have much use for the patch as presented in this
thread, without "aggmultifn".  That might be OK as the first phase in a
multi-step patch, but not as a production feature.  I also have no use
for the stuff depending on bitwise equality rather than the user-visible
operators; that's just a hack, and an ugly one.

As far as the patch details go, I've not looked through it in great
detail, but it appears to me that you are still trying to cram the same
stuff into base relations as before; shoving it into a subsidiary struct
doesn't improve that.  Multiple people have said that's the wrong thing,
and I agree.  Conceptually it's a disaster: a single RelOptInfo should
represent one well-defined set of result rows, not more than one.  This
approach also cannot be extended to handle doing aggregations partway up
the join tree, which is at least theoretically interesting (though I'm
fine with not tackling it right away).  I think the right representation
is to create UPPERREL_GROUP_AGG RelOptInfos whose relids show the set
of baserels joined before aggregating.  Currently there's only one of
those and its relids is equal to all_baserels (or should be, anyway).
This patch would add instances of UPPERREL_GROUP_AGG RelOptInfos with
singleton relids, and later we might put in the ability to consider
aggregation across other relid subsets, and in any case we'd run join
planning on those level-one rels and create new UPPERREL_GROUP_AGG
RelOptInfos that represent the intermediate join steps leading up to
the final join.  The correct indexing structure for this collection of
RelOptInfos is certainly not simple_rel_array; it should look more like
the way we index the various regular join rels that we consider.  (Maybe
an appropriate preliminary patch is to refactor the support code
associated with join_rel_list + join_rel_hash so that we can treat those
fields as one instance of a structure we'll replicate later.)  Note that
I'm envisioning that we'd basically run join planning a second time on
this tree of join rels, rather than try to merge it with the primary
join planning.  Since the numbers of rows to be processed will be
completely different in this join tree, merging it with the standard one
seems hopeless.

BTW, I noticed some noise in the v8 patch: what's it doing touching
src/interfaces/libpq/fe-auth.c or src/interfaces/libpq/fe-secure-common.c?

            regards, tom lane


Re: [HACKERS] WIP: Aggregation push-down

From
Antonin Houska
Date:
Tom Lane <tgl@sss.pgh.pa.us> wrote:

> Antonin Houska <ah@cybertec.at> writes:
> > [ agg_pushdown_v8.tgz ]
>
> I spent a few hours looking at this today.

Thanks!

> It seems to me that at no point has there been a clear explanation of what
> the patch is trying to accomplish, so let me see whether I've got it
> straight or not.  (I suggest that something like this ought to be included
> in optimizer/README; the patch's lack of internal documentation is a serious
> deficiency.)

Earlier version of the patch did update optimizer/README but I forgot to merge
it into the current one. I'll fix that in the next version. (Also some more
will be added, especially header comments of some functions.)

> Conceptually, we'd like to be able to push aggregation down below joins
> when that yields a win, which it could do by reducing the number of rows
> that have to be processed at the join stage.  Thus consider
>
> SELECT agg(a.x) FROM a, b WHERE a.y = b.z;
>
> We can't simply aggregate during the scan of "a", because some values of
> "x" might not appear at all in the input of the naively-computed aggregate
> (if their rows have no join partner in "b"), or might appear multiple
> times (if their rows have multiple join partners).  So at first glance,
> aggregating before the join is impossible.  The key insight of the patch
> is that we can make some progress by considering only grouped aggregation:

Yes, the patch does not handle queries with no GROUP BY clause. Primarily
because I don't know how the grouped "a" table in this example could emit the
"a.y" column w/o using it as a grouping key.

> if we can group "a" into sets of rows that must all have the same join
> partners, then we can do preliminary aggregation within each such group,
> and take care of the number-of-repetitions problem when we join.  If the
> groups are large then this reduces the number of rows processed by the
> join, at the cost that we might spend time computing the aggregate for
> some row sets that will just be discarded by the join.  So now we consider
>
> SELECT agg(a.x) FROM a, b WHERE a.y = b.z GROUP BY ?;
>
> and ask what properties the grouping column(s) "?" must have to make this
> work.  I believe we can say that it's OK to push down through a join if
> its join clauses are all of the form "a.y = b.z", where either a.y or b.z
> is listed as a GROUP BY column *and* the join operator is equality for
> the btree opfamily specified by the SortGroupClause.  (Note: actually,
> SortGroupClause per se mentions an equality operator, although I think the
> planner quickly reduces that to an opfamily specification.)

I suppose you mean make_pathkey_from_sortop().

> The concerns Robert had about equality semantics are certainly vital, but I
> believe that the logic goes through correctly as long as the grouping
> equality operator and the join equality operator belong to the same
> opfamily.

ok, generic approach like this is always better. I just could not devise it,
nor can I prove that your theory is wrong.

> Robert postulated cases like "citext = text", but that doesn't apply
> here because no cross-type citext = text operator could be part of a
> well-behaved opfamily.

The reason I can see now is that the GROUP BY operator (opfamily) essentially
has the grouping column on both sides, so it does not match the cross-type
operator.

> What we'd actually be looking at is either "citextvar::text texteq textvar"
> or "citextvar citexteq textvar::citext", and the casts prevent these
> expressions from matching GROUP BY entries that have the wrong semantics.

Can you please explain this? If the expression is "citextvar::text texteq
textvar", then both operands of the operator should be of "text" type (because
the operator receives the output of the cast), so I'd expect a match to
SortGroupClause.eqop of clause like "GROUP BY <text expression>".

> However: what we have proven here is only that we can aggregate across
> a set of rows that must share the same join partners.  We still have
> to be able to handle the case that the rowset has more than one join
> partner, which AFAICS means that we must use partial aggregation and
> then apply an "aggmultifn" (or else apply the aggcombinefn N times).
> We can avoid that and use plain aggregation when we can prove the "b"
> side of the join is unique, so that no sets of rows will have to be merged
> post-join; but ISTM that that reduces the set of use cases to be too small
> to be worth such a complex patch.  So I'm really doubtful that we should
> proceed forward with only that case available.

I tried to follow Heikki's proposal in [1] and separated the functionality
that does not need the partial aggregation. I was not able to foresee that the
patch does not get much smaller this way. Also because I had to introduce
functions that check whether particular join does duplicate aggregated values
or not. (Even if we use partial aggregation, we can use this functionality to
detect that we only need to call aggfinalfn() in some cases, as opposed to
setting up regular Agg node for the final aggregation. However that's an
optimization that can probably be moved to later position in the patch
series.)

I'll re-introduce the parallel aggregation in the next patch version.

> Also, Tomas complained in the earlier thread that he didn't think
> grouping on the join column was a very common use-case in the first
> place.  I share that concern, but I think we could extend the logic
> to the case that Tomas posited as being useful:
>
> SELECT agg(a.x) FROM a, b WHERE a.y = b.id GROUP BY b.z;
>
> where the join column b.id is unique.  If we group on a.y (using semantics
> compatible with the join operator and the uniqueness constraint), then all
> "a" rows in a given group will join to exactly one "b" row that
> necessarily has exactly one grouping value, so this group can safely be
> aggregated together.  We might need to combine it post-join with other "b"
> rows that have equal "z" values, but we can do that as long as we're okay
> with partial aggregation.  I think this example shows why the idea is far
> more powerful with partial aggregation than without.

Good idea. I'll try to incorporate it in the next version.

> In short, then, I don't have much use for the patch as presented in this
> thread, without "aggmultifn".  That might be OK as the first phase in a
> multi-step patch, but not as a production feature.

The version 8 actually implements only part of the functionality that earlier
versions did. I wanted to have more feedback from the community on the concept
before I rework the whole series again. So for version 9 I'm going to include
the partial aggregation. On the other hand, support of things like parallel
processing, append relation or postgres_fdw doesn't seem to be necessary for
the inital version of the feature.

> I also have no use for the stuff depending on bitwise equality rather than
> the user-visible operators; that's just a hack, and an ugly one.

The purpose of this was to avoid aggregation push-down for data types like
numeric, see the example

SELECT one.a
FROM one, two
WHERE one.a = two.a AND scale(two.a) > 1
GROUP BY 1

in [2]. The problem is that aggregation can make some information lost, which
is needed above by WHERE/ON clauses. I appreciate any suggestions how to
address this problem.

> As far as the patch details go, I've not looked through it in great
> detail, but it appears to me that you are still trying to cram the same
> stuff into base relations as before; shoving it into a subsidiary struct
> doesn't improve that.  Multiple people have said that's the wrong thing,
> and I agree.  Conceptually it's a disaster: a single RelOptInfo should
> represent one well-defined set of result rows, not more than one.

The first version of my patch did what you reject here, but then I started to
use a separate RelOptInfo for the grouped relations. First I introduced
simple_grouped_rel_array, but then (again per Heikki's suggestion) made the
"plain" RelOptInfo point to the grouped one.

> I think the right representation is to create UPPERREL_GROUP_AGG RelOptInfos

(Just a detail: since UPPERREL_PARTIAL_GROUP_AGG was added at some point, I
wonder if this is more appropriate than UPPERREL_GROUP_AGG for base relations
and joins. create_grouping_paths() can then add the paths to
UPPERREL_GROUP_AGG.)

> whose relids show the set of baserels joined before aggregating.  Currently
> there's only one of those and its relids is equal to all_baserels (or should
> be, anyway).  This patch would add instances of UPPERREL_GROUP_AGG
> RelOptInfos with singleton relids, and later we might put in the ability to
> consider aggregation across other relid subsets, and in any case we'd run
> join planning on those level-one rels and create new UPPERREL_GROUP_AGG
> RelOptInfos that represent the intermediate join steps leading up to the
> final join.  The correct indexing structure for this collection of
> RelOptInfos is certainly not simple_rel_array; it should look more like the
> way we index the various regular join rels that we consider.

Nothing like simple_rel_array (e.g. simple_grouped_rel_array) even for the
base relations? I think the fact that the extra base relations are grouped
should not affect the way they are retrieved.

> (Maybe an appropriate preliminary patch is to refactor the support code
> associated with join_rel_list + join_rel_hash so that we can treat those
> fields as one instance of a structure we'll replicate later.)

Do you mean that, instead of a single RelOptInfo, join_rel_list would contain
a structure that points to two RelOptInfo structures, where one is the "plain"
one and the other is the "grouped" one?

> Note that I'm envisioning that we'd basically run join planning a second
> time on this tree of join rels, rather than try to merge it with the primary
> join planning.  Since the numbers of rows to be processed will be completely
> different in this join tree, merging it with the standard one seems
> hopeless.

A separate run of make_join_rel() for the grouped relations is what I tried to
do so far.

> BTW, I noticed some noise in the v8 patch: what's it doing touching
> src/interfaces/libpq/fe-auth.c or src/interfaces/libpq/fe-secure-common.c?

It's definitely just a noise. Something went wrong when I was extracting the
diff from my private repository.


[1] https://www.postgresql.org/message-id/113e9594-7c08-0f1f-ad15-41773b56a86b@iki.fi

[2] https://www.postgresql.org/message-id/20239.1516971866%40localhost

--
Antonin Houska
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26, A-2700 Wiener Neustadt
Web: https://www.cybertec-postgresql.com


Re: [HACKERS] WIP: Aggregation push-down

From
Antonin Houska
Date:
This is the next version. A few more comments below.

Antonin Houska <ah@cybertec.at> wrote:

> Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> > Antonin Houska <ah@cybertec.at> writes:
>
> Thanks!
>
> > Conceptually, we'd like to be able to push aggregation down below joins
> > when that yields a win, which it could do by reducing the number of rows
> > that have to be processed at the join stage.  Thus consider
> >
> > SELECT agg(a.x) FROM a, b WHERE a.y = b.z;
> >
> > We can't simply aggregate during the scan of "a", because some values of
> > "x" might not appear at all in the input of the naively-computed aggregate
> > (if their rows have no join partner in "b"), or might appear multiple
> > times (if their rows have multiple join partners).  So at first glance,
> > aggregating before the join is impossible.  The key insight of the patch
> > is that we can make some progress by considering only grouped aggregation:
>
> Yes, the patch does not handle queries with no GROUP BY clause. Primarily
> because I don't know how the grouped "a" table in this example could emit the
> "a.y" column w/o using it as a grouping key.
>
> > if we can group "a" into sets of rows that must all have the same join
> > partners, then we can do preliminary aggregation within each such group,
> > and take care of the number-of-repetitions problem when we join.  If the
> > groups are large then this reduces the number of rows processed by the
> > join, at the cost that we might spend time computing the aggregate for
> > some row sets that will just be discarded by the join.  So now we consider
> >
> > SELECT agg(a.x) FROM a, b WHERE a.y = b.z GROUP BY ?;
> >
> > and ask what properties the grouping column(s) "?" must have to make this
> > work.  I believe we can say that it's OK to push down through a join if
> > its join clauses are all of the form "a.y = b.z", where either a.y or b.z
> > is listed as a GROUP BY column *and* the join operator is equality for
> > the btree opfamily specified by the SortGroupClause.

I think the requirement for the join clauses to be of the form "a.y = b.z"
would only be valid if we wanted to avoid the 2-stage aggregation. For
example, in this query

SELECT agg(a.x) FROM a, b WHERE a.y = b.z GROUP BY b.z

uniqueness of "b.z" ensures that groups resulting from the pushed-down
aggregation (SELECT a.y, agg(a.x) FROM a GROUP BY a.y) do not get duplicated
by the join. However if there's a consensus that the 2-stage aggregation will
be used (because avoidance of it would limit usefulness of the feature too
much), I think we can accept generic join clauses. (Note that the patch does
not try to push aggregate down below nullable side of outer join - handling of
generated NULL values by join clauses would be another problem.)

> > The concerns Robert had about equality semantics are certainly vital, but I
> > believe that the logic goes through correctly as long as the grouping
> > equality operator and the join equality operator belong to the same
> > opfamily.
>
> ok, generic approach like this is always better. I just could not devise it,
> nor can I prove that your theory is wrong.
>
> > Robert postulated cases like "citext = text", but that doesn't apply
> > here because no cross-type citext = text operator could be part of a
> > well-behaved opfamily.
>
> The reason I can see now is that the GROUP BY operator (opfamily) essentially
> has the grouping column on both sides, so it does not match the cross-type
> operator.

Even if my statement that the join clause can be of other kind than "<leftarg>
= <rightarg>" was right, it might still be worth insisting on "<leftargt> <op>
<rightarg>" where <op> is of the same operator family as the grouping
operator?

> > Also, Tomas complained in the earlier thread that he didn't think
> > grouping on the join column was a very common use-case in the first
> > place.  I share that concern, but I think we could extend the logic
> > to the case that Tomas posited as being useful:
> >
> > SELECT agg(a.x) FROM a, b WHERE a.y = b.id GROUP BY b.z;
> >
> > where the join column b.id is unique.  If we group on a.y (using semantics
> > compatible with the join operator and the uniqueness constraint), then all
> > "a" rows in a given group will join to exactly one "b" row that
> > necessarily has exactly one grouping value, so this group can safely be
> > aggregated together.  We might need to combine it post-join with other "b"
> > rows that have equal "z" values, but we can do that as long as we're okay
> > with partial aggregation.  I think this example shows why the idea is far
> > more powerful with partial aggregation than without.
>
> Good idea. I'll try to incorporate it in the next version.

The previous version of the patch (before I removed the partial aggregation)
could actually handle this case, so I restored it. It just does not check the
uniqueness of the b.id column because: the final aggregation will eliminate
the duplicated groups anyway.

The patch views the problem from this perspective: "a.y" is needed by join
clause, and the only way to retrieve it from the partially aggregated relation
"a" is to use it as a grouping column. Such an "artificial" grouping
expression can only be used if the plan contains the final aggregate node,
which uses only the original GROUP BY clause.

> > I also have no use for the stuff depending on bitwise equality rather than
> > the user-visible operators; that's just a hack, and an ugly one.
>
> The purpose of this was to avoid aggregation push-down for data types like
> numeric, see the example
>
> SELECT one.a
> FROM one, two
> WHERE one.a = two.a AND scale(two.a) > 1
> GROUP BY 1
>
> in [2]. The problem is that aggregation can make some information lost, which
> is needed above by WHERE/ON clauses. I appreciate any suggestions how to
> address this problem.

The new patch version does not contain this hack, but the problem is still
open. Alternatively I think of a boolean flag to be added to pg_opfamily or
pg_opclass catalog, saying whether equality also means "bitwise equality". If
the value is false (the default) for the GROUP BY operator, no aggregate
push-down can take place.

> > I think the right representation is to create UPPERREL_GROUP_AGG
> > RelOptInfos whose relids show the set of baserels joined before
> > aggregating.  Currently there's only one of those and its relids is equal
> > to all_baserels (or should be, anyway).  This patch would add instances of
> > UPPERREL_GROUP_AGG RelOptInfos with singleton relids, and later we might
> > put in the ability to consider aggregation across other relid subsets, and
> > in any case we'd run join planning on those level-one rels and create new
> > UPPERREL_GROUP_AGG RelOptInfos that represent the intermediate join steps
> > leading up to the final join.  The correct indexing structure for this
> > collection of RelOptInfos is certainly not simple_rel_array; it should
> > look more like the way we index the various regular join rels that we
> > consider.

> > (Maybe an appropriate preliminary patch is to refactor the support code
> > associated with join_rel_list + join_rel_hash so that we can treat those
> > fields as one instance of a structure we'll replicate later.)

> Do you mean that, instead of a single RelOptInfo, join_rel_list would contain
> a structure that points to two RelOptInfo structures, where one is the "plain"
> one and the other is the "grouped" one?

Eventually I thought that you propose a separate instance of join_rel_list
(e.g. join_rel_list_grouped) for the 2nd run of the joining process, however
that would mean that some functions need to receive RelOptInfos as arguments,
and *also* search in the the join_rel_list_grouped when joining grouped
relation to non-grouped one. So I decided to wrap both grouped and non-grouped
relation into a new structure (RelOptInfoSet). Thus all the relations to be
joined are passed as function arguments. make_join_rel() shows best how
instances of the new structure is used.

> [1] https://www.postgresql.org/message-id/113e9594-7c08-0f1f-ad15-41773b56a86b@iki.fi
>
> [2] https://www.postgresql.org/message-id/20239.1516971866%40localhost


--
Antonin Houska
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26, A-2700 Wiener Neustadt
Web: https://www.cybertec-postgresql.com


Attachment

Re: [HACKERS] WIP: Aggregation push-down

From
Michael Paquier
Date:
On Tue, Jan 15, 2019 at 03:06:18PM +0100, Antonin Houska wrote:
> This is the next version. A few more comments below.

Latest patch set fails to apply, so moved to next CF, waiting on
author.
--
Michael

Attachment

Re: [HACKERS] WIP: Aggregation push-down

From
Antonin Houska
Date:
Michael Paquier <michael@paquier.xyz> wrote:

> On Tue, Jan 15, 2019 at 03:06:18PM +0100, Antonin Houska wrote:
> > This is the next version. A few more comments below.
>
> Latest patch set fails to apply, so moved to next CF, waiting on
> author.

Rebased.

--
Antonin Houska
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26, A-2700 Wiener Neustadt
Web: https://www.cybertec-postgresql.com


Attachment

Re: [HACKERS] WIP: Aggregation push-down

From
Tom Lane
Date:
Antonin Houska <ah@cybertec.at> writes:
> Michael Paquier <michael@paquier.xyz> wrote:
>> Latest patch set fails to apply, so moved to next CF, waiting on
>> author.

> Rebased.

This is in need of rebasing again :-(.  I went ahead and pushed the 001
part, since that seemed fairly uncontroversial.  (Note that I changed
estimate_hashagg_tablesize's result type to double on the way; you
probably want to make corresponding adjustments in your patch.)

I did not spend a whole lot of time looking at the patch today, but
I'm still pretty distressed at the data structures you've chosen.
I remain of the opinion that a grouped relation and a base relation
are, er, unrelated, even if they happen to share the same relid set.
So I do not see the value of the RelOptInfoSet struct you propose here,
and I definitely don't think there's any value in having, eg,
create_seqscan_path or create_index_path dealing with this stuff.

I also don't like changing create_nestloop_path et al to take a PathTarget
rather than using the RelOptInfo's pathtarget; IMO, it's flat out wrong
for a path to generate a tlist different from what its parent RelOptInfo
says that the relation produces.

I think basically the way this ought to work is that we generate baserel
paths pretty much the same as today, and then we run through the baserels
and see which ones are potentially worth doing partial aggregation on,
and for each one that is, we create a separate "upper relation" RelOptInfo
describing that.  The paths for this RelOptInfo would be
partial-aggregation paths using paths from the corresponding baserel as
input.  Then we'd run a join search that only considers joining grouped
rels with plain rels (I concur that joining two grouped rels is not worth
coping with, at least for now).

            regards, tom lane


Re: [HACKERS] WIP: Aggregation push-down

From
Antonin Houska
Date:
Tom Lane <tgl@sss.pgh.pa.us> wrote:

> Antonin Houska <ah@cybertec.at> writes:
> > Michael Paquier <michael@paquier.xyz> wrote:
> >> Latest patch set fails to apply, so moved to next CF, waiting on
> >> author.
> 
> > Rebased.
> 
> This is in need of rebasing again :-(.  I went ahead and pushed the 001
> part, since that seemed fairly uncontroversial.

ok, thanks.

> I did not spend a whole lot of time looking at the patch today, but
> I'm still pretty distressed at the data structures you've chosen.
> I remain of the opinion that a grouped relation and a base relation
> are, er, unrelated, even if they happen to share the same relid set.
> So I do not see the value of the RelOptInfoSet struct you propose here,

ok. As you suggested upthread, I try now to reuse the join_rel_list /
join_rel_hash structures, see v11-001-Introduce_RelInfoList.patch.

> and I definitely don't think there's any value in having, eg,
> create_seqscan_path or create_index_path dealing with this stuff.

Originally I tried to aggregate any path that ever gets passed to agg_path(),
but that's probably not worth the code complexity. Now the partial aggregation
is only applied to paths that survived agg_path() on the plain relation.

> I also don't like changing create_nestloop_path et al to take a PathTarget
> rather than using the RelOptInfo's pathtarget; IMO, it's flat out wrong
> for a path to generate a tlist different from what its parent RelOptInfo
> says that the relation produces.

Likewise, the current patch version is less invasive, so create_nestloop_path
et al are not touched at all.

> I think basically the way this ought to work is that we generate baserel
> paths pretty much the same as today, and then we run through the baserels
> and see which ones are potentially worth doing partial aggregation on,
> and for each one that is, we create a separate "upper relation" RelOptInfo
> describing that.  The paths for this RelOptInfo would be
> partial-aggregation paths using paths from the corresponding baserel as
> input.  Then we'd run a join search that only considers joining grouped
> rels with plain rels (I concur that joining two grouped rels is not worth
> coping with, at least for now).

make_join_rel() is the core of my implementation: besides joining two plain
relations it tries to join plain relation to grouped one, and also to
aggregate the join of the two plain relations. I consider this approach less
invasive and more efficient than running the whole standard_join_search again
for the grouped rels.

The problem of numeric-like data types (i.e. types for wich equality of two
values of the grouping key does not justify putting them into the same group
because information like scale would be discarded this way) remains open. My
last idea was to add a boolean flag to operator class which tells that
equality implies "bitwise equality", and to disallow aggregate push-down if
SortGroupClause.eqop is in an opclass which has this field FALSE. I'd like to
hear your opinion before I do any coding.

-- 
Antonin Houska
https://www.cybertec-postgresql.com


Attachment

Re: [HACKERS] WIP: Aggregation push-down

From
Richard Guo
Date:


On Fri, Mar 1, 2019 at 12:01 AM Antonin Houska <ah@cybertec.at> wrote:
Tom Lane <tgl@sss.pgh.pa.us> wrote:

> Antonin Houska <ah@cybertec.at> writes:
> > Michael Paquier <michael@paquier.xyz> wrote:
> >> Latest patch set fails to apply, so moved to next CF, waiting on
> >> author.
>
> > Rebased.
>
> This is in need of rebasing again :-(.  I went ahead and pushed the 001
> part, since that seemed fairly uncontroversial.

ok, thanks.


Another rebase is needed for the patches.

Thanks
Richard 

Re: [HACKERS] WIP: Aggregation push-down

From
Antonin Houska
Date:
Richard Guo <riguo@pivotal.io> wrote:

> Another rebase is needed for the patches.

Done.

-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com


Attachment

Re: [HACKERS] WIP: Aggregation push-down

From
Richard Guo
Date:


On Tue, Jul 9, 2019 at 9:47 PM Antonin Houska <ah@cybertec.at> wrote:
Richard Guo <riguo@pivotal.io> wrote:

> Another rebase is needed for the patches.

Done.


I didn't fully follow the whole thread and mainly looked into the latest
patch set. So what are the considerations for abandoning the aggmultifn
concept? In my opinion, aggmultifn would enable us to do a lot more
types of transformation. For example, consider the query below:

select sum(foo.c) from foo join bar on foo.b = bar.b group by foo.a, bar.a;

With the latest patch, the plan looks like:

Finalize HashAggregate    <------ sum(psum)
   Group Key: foo.a, bar.a
   ->  Hash Join
         Hash Cond: (bar.b = foo.b)
         ->  Seq Scan on bar
         ->  Hash
               ->  Partial HashAggregate    <------ sum(foo.c) as psum
                     Group Key: foo.a, foo.b
                     ->  Seq Scan on foo


If we have aggmultifn, we can perform the query this way:

Finalize HashAggregate    <------ sum(foo.c)*cnt
   Group Key: foo.a, bar.a
   ->  Hash Join
         Hash Cond: (foo.b = bar.b)
         ->  Seq Scan on foo
         ->  Hash
               ->  Partial HashAggregate    <------ count(*) as cnt
                     Group Key: bar.a, bar.b
                     ->  Seq Scan on bar


And this way:

Finalize HashAggregate    <------ sum(psum)*cnt
   Group Key: foo.a, bar.a
   ->  Hash Join
         Hash Cond: (foo.b = bar.b)
               ->  Partial HashAggregate    <------ sum(foo.c) as psum
                     Group Key: foo.a, foo.b
                     ->  Seq Scan on foo
         ->  Hash
               ->  Partial HashAggregate    <------ count(*) as cnt
                     Group Key: bar.a, bar.b
                     ->  Seq Scan on bar


My another question is in function add_grouped_path(), when creating
sorted aggregation path on top of subpath. If the subpath is not sorted,
then the sorted aggregation path would not be generated. Why not in this
case we create a sort path on top of subpath first and then create group
aggregation path on top of the sort path?


Core dump when running one query in agg_pushdown.sql

EXPLAIN ANALYZE
SELECT p.x, avg(c1.v) FROM agg_pushdown_parent AS p JOIN agg_pushdown_child1
AS c1 ON c1.parent = p.i GROUP BY p.i;


#0  0x00000000006def98 in CheckVarSlotCompatibility (slot=0x0, attnum=1, vartype=23) at execExprInterp.c:1850
#1  0x00000000006def5d in CheckExprStillValid (state=0x2b63a28, econtext=0x2ba4958) at execExprInterp.c:1814
#2  0x00000000006dee38 in ExecInterpExprStillValid (state=0x2b63a28, econtext=0x2ba4958, isNull=0x7fff7cd16a37) at execExprInterp.c:1763
#3  0x00000000007144dd in ExecEvalExpr (state=0x2b63a28, econtext=0x2ba4958, isNull=0x7fff7cd16a37)
    at ../../../src/include/executor/executor.h:288
#4  0x0000000000715475 in ExecIndexEvalRuntimeKeys (econtext=0x2ba4958, runtimeKeys=0x2b63910, numRuntimeKeys=1) at nodeIndexscan.c:630
#5  0x000000000071533b in ExecReScanIndexScan (node=0x2b62bf8) at nodeIndexscan.c:568
#6  0x00000000006d4ce6 in ExecReScan (node=0x2b62bf8) at execAmi.c:182
#7  0x00000000007152a0 in ExecIndexScan (pstate=0x2b62bf8) at nodeIndexscan.c:530


This is really a cool feature. Thank you for working on this.

Thanks
Richard 

Re: [HACKERS] WIP: Aggregation push-down

From
Antonin Houska
Date:
Richard Guo <riguo@pivotal.io> wrote:

> I didn't fully follow the whole thread and mainly looked into the
> latest
> patch set. So what are the considerations for abandoning the
> aggmultifn
> concept?

Originally the function was there to support join where both relations are
grouped. My doubts about usefulness of such join started at [1]. (See the
thread referenced from [2].)

> In my opinion, aggmultifn would enable us to do a lot more
> types of transformation. For example, consider the query below:
> 
> select sum(foo.c) from foo join bar on foo.b = bar.b group by foo.a,
> bar.a;
> 
> With the latest patch, the plan looks like:
> 
> Finalize HashAggregate    <------ sum(psum)
>    Group Key: foo.a, bar.a
>    ->  Hash Join
>          Hash Cond: (bar.b = foo.b)
>          ->  Seq Scan on bar
>          ->  Hash
>                ->  Partial HashAggregate    <------ sum(foo.c) as
> psum
>                      Group Key: foo.a, foo.b
>                      ->  Seq Scan on foo
> 
> 
> If we have aggmultifn, we can perform the query this way:
> 
> Finalize HashAggregate    <------ sum(foo.c)*cnt
>    Group Key: foo.a, bar.a
>    ->  Hash Join
>          Hash Cond: (foo.b = bar.b)
>          ->  Seq Scan on foo
>          ->  Hash
>                ->  Partial HashAggregate    <------ count(*) as cnt
>                      Group Key: bar.a, bar.b
>                      ->  Seq Scan on bar

Perhaps, but it that would require changes to nodeAgg.c, which I want to avoid
for now.

> And this way:
> 
> Finalize HashAggregate    <------ sum(psum)*cnt
>    Group Key: foo.a, bar.a
>    ->  Hash Join
>          Hash Cond: (foo.b = bar.b)
>                ->  Partial HashAggregate    <------ sum(foo.c) as
> psum
>                      Group Key: foo.a, foo.b
>                      ->  Seq Scan on foo
>          ->  Hash
>                ->  Partial HashAggregate    <------ count(*) as cnt
>                      Group Key: bar.a, bar.b
>                      ->  Seq Scan on bar

This looks like my idea presented in [1], for which there seems to be no
common use case.

> My another question is in function add_grouped_path(), when creating
> sorted aggregation path on top of subpath. If the subpath is not
> sorted,
> then the sorted aggregation path would not be generated. Why not in
> this
> case we create a sort path on top of subpath first and then create
> group
> aggregation path on top of the sort path?

I see no reason not to do it. (If you want to try, just go ahead.) However the
current patch version brings only the basic functionality and I'm not going to
add new functionality (such as parallel aggregation, partitioned tables or
postgres_fdw) until the open design questions are resolved. Otherwise there's
a risk that the additional work will be wasted due to major rework of the core
functionality.

> Core dump when running one query in agg_pushdown.sql

Thanks for the report! Fixed in the new version.

> This is really a cool feature. Thank you for working on this.

I appreciate to hear that :-) Let's see which postgres release will adopt it.

[1] https://www.postgresql.org/message-id/cc823e89-3fbc-f94e-b9d4-9c713b044b5d%402ndquadrant.com

[2] https://www.postgresql.org/message-id/flat/9666.1491295317%40localhost

-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com


Attachment

Re: [HACKERS] WIP: Aggregation push-down

From
Richard Guo
Date:
On Fri, Jul 12, 2019 at 4:42 PM Antonin Houska <ah@cybertec.at> wrote:
Richard Guo <riguo@pivotal.io> wrote:

> I didn't fully follow the whole thread and mainly looked into the
> latest
> patch set. So what are the considerations for abandoning the
> aggmultifn
> concept?

Originally the function was there to support join where both relations are
grouped. My doubts about usefulness of such join started at [1]. (See the
thread referenced from [2].)

> In my opinion, aggmultifn would enable us to do a lot more
> types of transformation. For example, consider the query below:
>
> select sum(foo.c) from foo join bar on foo.b = bar.b group by foo.a,
> bar.a;
>
> With the latest patch, the plan looks like:
>
> Finalize HashAggregate    <------ sum(psum)
>    Group Key: foo.a, bar.a
>    ->  Hash Join
>          Hash Cond: (bar.b = foo.b)
>          ->  Seq Scan on bar
>          ->  Hash
>                ->  Partial HashAggregate    <------ sum(foo.c) as
> psum
>                      Group Key: foo.a, foo.b
>                      ->  Seq Scan on foo
>
>
> If we have aggmultifn, we can perform the query this way:
>
> Finalize HashAggregate    <------ sum(foo.c)*cnt
>    Group Key: foo.a, bar.a
>    ->  Hash Join
>          Hash Cond: (foo.b = bar.b)
>          ->  Seq Scan on foo
>          ->  Hash
>                ->  Partial HashAggregate    <------ count(*) as cnt
>                      Group Key: bar.a, bar.b
>                      ->  Seq Scan on bar

Perhaps, but it that would require changes to nodeAgg.c, which I want to avoid
for now.

> And this way:
>
> Finalize HashAggregate    <------ sum(psum)*cnt
>    Group Key: foo.a, bar.a
>    ->  Hash Join
>          Hash Cond: (foo.b = bar.b)
>                ->  Partial HashAggregate    <------ sum(foo.c) as
> psum
>                      Group Key: foo.a, foo.b
>                      ->  Seq Scan on foo
>          ->  Hash
>                ->  Partial HashAggregate    <------ count(*) as cnt
>                      Group Key: bar.a, bar.b
>                      ->  Seq Scan on bar

This looks like my idea presented in [1], for which there seems to be no
common use case.

> My another question is in function add_grouped_path(), when creating
> sorted aggregation path on top of subpath. If the subpath is not
> sorted,
> then the sorted aggregation path would not be generated. Why not in
> this
> case we create a sort path on top of subpath first and then create
> group
> aggregation path on top of the sort path?

I see no reason not to do it. (If you want to try, just go ahead.) However the
current patch version brings only the basic functionality and I'm not going to
add new functionality (such as parallel aggregation, partitioned tables or
postgres_fdw) until the open design questions are resolved. Otherwise there's
a risk that the additional work will be wasted due to major rework of the core
functionality.

> Core dump when running one query in agg_pushdown.sql

Thanks for the report! Fixed in the new version.

Another core dump for query below:

select sum(t1.s1) from t1, t2, t3, t4 where t1.j1 = t2.j2 group by t1.g1, t2.g2;

This is due to a small mistake:

diff --git a/src/backend/optimizer/util/relnode.c b/src/backend/optimizer/util/relnode.c
index 10becc0..9c2deed 100644
--- a/src/backend/optimizer/util/relnode.c
+++ b/src/backend/optimizer/util/relnode.c
@@ -648,7 +648,7 @@ void
 add_grouped_rel(PlannerInfo *root, RelOptInfo *rel, RelAggInfo *agg_info)
 {
        add_rel_info(root->grouped_rel_list, rel);
-       add_rel_info(root->agg_info_list, rel);
+       add_rel_info(root->agg_info_list, agg_info);
 }

Thanks
Richard
 

Re: [HACKERS] WIP: Aggregation push-down

From
Antonin Houska
Date:
Richard Guo <riguo@pivotal.io> wrote:

> Another core dump for query below:
> 
> select sum(t1.s1) from t1, t2, t3, t4 where t1.j1 = t2.j2 group by t1.g1, t2.g2;
> 
> This is due to a small mistake:
> 
> diff --git a/src/backend/optimizer/util/relnode.c b/src/backend/optimizer/util/relnode.c
> index 10becc0..9c2deed 100644
> --- a/src/backend/optimizer/util/relnode.c
> +++ b/src/backend/optimizer/util/relnode.c
> @@ -648,7 +648,7 @@ void
>  add_grouped_rel(PlannerInfo *root, RelOptInfo *rel, RelAggInfo *agg_info)
>  {
>         add_rel_info(root->grouped_rel_list, rel);
> -       add_rel_info(root->agg_info_list, rel);
> +       add_rel_info(root->agg_info_list, agg_info);
>  }
> 

Thanks! Yes, this is what I had to fix.

-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com


Attachment

Re: [HACKERS] WIP: Aggregation push-down

From
Alvaro Herrera
Date:
This stuff seems very useful.  How come it sits unreviewed for so long?

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [HACKERS] WIP: Aggregation push-down

From
Antonin Houska
Date:
Alvaro Herrera <alvherre@2ndquadrant.com> wrote:

> This stuff seems very useful.  How come it sits unreviewed for so long?

I think the review is hard for people who are not interested in the planner
very much. And as for further development, there are a few design decisions
that can hardly be resolved without Tom Lane's comments. Right now I recall
two problems: 1) is the way I currently store RelOptInfo for the grouped
relations correct?, 2) how should we handle types for which logical equality
does not imply physical (byte-wise) equality?

Fortunately it seems now that I'm not the only one who cares about 2), so this
problem might be resolved soon:

https://www.postgresql.org/message-id/CAH2-Wzn3Ee49Gmxb7V1VJ3-AC8fWn-Fr8pfWQebHe8rYRxt5OQ%40mail.gmail.com

But 1) still remains.

--
Antonin Houska
Web: https://www.cybertec-postgresql.com



Re: [HACKERS] WIP: Aggregation push-down

From
Tomas Vondra
Date:
Hi,

I've been looking at the last version (v14) of this patch series,
submitted way back in July and unfortunately quiet since then. Antonin
is probably right one of the reasons for the lack of reviews is that it
requires interest/experience with planner.

Initially it was also a bit hard to understand what are the benefits
(and the patch shifted a bit), which is now mostly addressed by the
README in the last patch. The trouble is that's hidden in the patch and
so not immediately obvious to people considering reviewing it :-( Tom
posted a nice summary in November 2018, but it was perhaps focused on
the internals.

So my first suggestion it to write a short summary with example how it
affects practical queries (plan change, speedup) etc.

My second suggestion is to have meaningful commit messages for each part
of the patch series, explaining why we need that particular part. It
might have been explained somewhere in the thread, but as a reviewer I
really don't want to reverse-engineer the whole thread.


Now, regarding individual parts of the patch:


1) v14-0001-Introduce-RelInfoList-structure.patch
-------------------------------------------------

- I'm not entirely sure why we need this change. We had the list+hash
   before, so I assume we do this because we need the output functions?

- The RelInfoList naming is pretty confusing, because it's actually not
   a list but a combination of list+hash. And the embedded list is called
   items, to make it yet a bit more confusing. I suggest we call this
   just RelInfo or RelInfoLookup or something else not including List.

- RelInfoEntry seems severely under-documented, particularly the data
   part (which is void* making it even harder to understand what's it for
   without having to read the functions). AFAICS it's just simply a link
   to the embedded list, but maybe I'm wrong.

- I suggest we move find_join_rel/add_rel_info/add_join_rel in relnode.c
   right before build_join_rel. This makes diff clearer/smaller and
   visual diffs nicer.


2) v14-0002-Introduce-make_join_rel_common-function.patch
---------------------------------------------------------

I see no point in keeping this change in a separate patch, as it prety
much just renames make_join_rel to make_join_rel_common and then adds 

   make_join_rel(PlannerInfo *root, RelOptInfo *rel1, RelOptInfo *rel2)
   {
       return make_join_rel_common(root, rel1, rel2);
   }

which seems rather trivial and useless on it's own. I'd just merge it
into 0003 where we use the _common function more extensively.


3) v14-0003-Aggregate-push-down-basic-functionality.patch
---------------------------------------------------------

I haven't done much review on this yet, but I've done some testing and
benchmarking so let me share at least that. The queries I used were of
the type I mentioned earlier in this thread - essentially a star schema,
i.e. fact table referencing dimensions, with aggregation per columns in
the dimension. So something like

   SELECT d.c, sum(f) FROM fact JOIN dimension d ON (d.id = f.d_id)
   GROUP BY d.c;

where "fact" table is much much larger than the dimension.

Attached is a script I used for testing with a bunch of queries and a
number of parameters (parallelism, size of dimension, size of fact, ...)
and a spreadsheed summarizing the results.

Overall, the results seem pretty good - in many cases the queries get
about 2x faster and I haven't seen any regressions. That's pretty nice.

One annoying thing is that this only works for non-parallel queries.
That is, I saw no improvements with parallel queries. It was still
faster than the non-parallel query with aggregate pushdown, but the
parallel query did not improve.

An example of timing looks like this:

   non-parallel (pushdown = off): 918 ms
   non-parallel (pushdown = on):  561 ms
   parallel (pushdown = off):     313 ms
   parallel (pushdown = on):      313 ms

The non-parallel query gets faster because after enabling push-down the
plan changes (and we get partial aggregate below the join). With
parallel query that does not happen, the plans stay the same.

I'm not claiming this is a bug - we end up picking the fastest execution
plan (i.e. we don't make a mistake by e.g. switching to the non-parallel
one with pushdown).

My understanding is that the aggregate pushdown can't (currently?) be
used with queries that use partial aggregate because of parallelism.
That is we can either end up with a plan like this:

   Finalize Aggregate
     -> Join
       -> Partial Aggregate
       -> ...

or a parallel plan like this:

   Finalize Aggregate
     -> Gather
         -> Partial Aggregate
             -> Join
                 -> ...
                 -> ...

but we currently don't support a combination of both

   Finalize Aggregate
     -> Gather
         -> Join
             -> Partial Aggregate
             -> ...

I wonder if that's actually even possible and what would it take to make
it work?


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

Re: [HACKERS] WIP: Aggregation push-down

From
Antonin Houska
Date:
Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:

> Hi,
> 
> I've been looking at the last version (v14) of this patch series,
> submitted way back in July and unfortunately quiet since then. Antonin
> is probably right one of the reasons for the lack of reviews is that it
> requires interest/experience with planner.
> 
> Initially it was also a bit hard to understand what are the benefits
> (and the patch shifted a bit), which is now mostly addressed by the
> README in the last patch. The trouble is that's hidden in the patch and
> so not immediately obvious to people considering reviewing it :-( Tom
> posted a nice summary in November 2018, but it was perhaps focused on
> the internals.
> 
> So my first suggestion it to write a short summary with example how it
> affects practical queries (plan change, speedup) etc.

I think README plus regression test output should be enough for someone who is
about to review patch as complex as this.

> My second suggestion is to have meaningful commit messages for each part
> of the patch series, explaining why we need that particular part. It
> might have been explained somewhere in the thread, but as a reviewer I
> really don't want to reverse-engineer the whole thread.

ok, done.

> Now, regarding individual parts of the patch:
> 
> 
> 1) v14-0001-Introduce-RelInfoList-structure.patch
> -------------------------------------------------
> 
> - I'm not entirely sure why we need this change. We had the list+hash
>   before, so I assume we do this because we need the output functions?

I believe that this is what Tom proposed in [1], see "Maybe an appropriate
preliminary patch is to refactor the support code ..." near the end of that
message. The point is that now we need two lists: one for "plain" relations
and one for grouped ones.

> - The RelInfoList naming is pretty confusing, because it's actually not
>   a list but a combination of list+hash. And the embedded list is called
>   items, to make it yet a bit more confusing. I suggest we call this
>   just RelInfo or RelInfoLookup or something else not including List.

I think it can be considered a list by the caller of add_join_rel() etc. The
hash table is hidden from user and is empty until the list becomes long
enough. The word "list" in the structure name may just indicate that new
elements can be added to the end, which shouldn't be expected if the structure
was an array.

I searched a bit in tools/pgindent/typedefs.list and found at least a few
structures that also do have "list" in the name but are not lists internally:
CatCList, FuncCandidateList, MCVList.

Actually I'm not opposed to renaming the structure, but don't have better idea
right now. As for *Lookup: following is the only use of such a structure name
in PG code and it's not something to store data in:

typedef enum
{
    IDENTIFIER_LOOKUP_NORMAL,    /* normal processing of var names */
    IDENTIFIER_LOOKUP_DECLARE,    /* In DECLARE --- don't look up names */
    IDENTIFIER_LOOKUP_EXPR        /* In SQL expression --- special case */
} IdentifierLookup;

> - RelInfoEntry seems severely under-documented, particularly the data
>   part (which is void* making it even harder to understand what's it for
>   without having to read the functions). AFAICS it's just simply a link
>   to the embedded list, but maybe I'm wrong.

This is just JoinHashEntry (which was also undocumented) renamed. I've added a
short comment now.

> - I suggest we move find_join_rel/add_rel_info/add_join_rel in relnode.c
>   right before build_join_rel. This makes diff clearer/smaller and
>   visual diffs nicer.

Hm, it might improve readability of the diff, but this API is exactly where I
still need feedback from Tom. I'm not eager to optimize the diff as long as
there's a risk that these functions will be removed or renamed.

> 2) v14-0002-Introduce-make_join_rel_common-function.patch
> ---------------------------------------------------------
> 
> I see no point in keeping this change in a separate patch, as it prety
> much just renames make_join_rel to make_join_rel_common and then adds 
> 
>   make_join_rel(PlannerInfo *root, RelOptInfo *rel1, RelOptInfo *rel2)
>   {
>       return make_join_rel_common(root, rel1, rel2);
>   }
> 
> which seems rather trivial and useless on it's own. I'd just merge it
> into 0003 where we use the _common function more extensively.

ok. I thought that this improves readability of the diffs, but it doesn't look
that bad if this is included in 0003.

> 
> 3) v14-0003-Aggregate-push-down-basic-functionality.patch
> ---------------------------------------------------------
> 
> I haven't done much review on this yet, but I've done some testing and
> benchmarking so let me share at least that. The queries I used were of
> the type I mentioned earlier in this thread - essentially a star schema,
> i.e. fact table referencing dimensions, with aggregation per columns in
> the dimension. So something like
> 
>   SELECT d.c, sum(f) FROM fact JOIN dimension d ON (d.id = f.d_id)
>   GROUP BY d.c;
> 
> where "fact" table is much much larger than the dimension.
> 
> Attached is a script I used for testing with a bunch of queries and a
> number of parameters (parallelism, size of dimension, size of fact, ...)
> and a spreadsheed summarizing the results.
> 
> Overall, the results seem pretty good - in many cases the queries get
> about 2x faster and I haven't seen any regressions. That's pretty nice.

Thanks for such an elaborate testing! The ultimate goal of this patch is to
improve sharding (using postgres_fdw), but it's nice to see significant
improvement even for local queries.

> One annoying thing is that this only works for non-parallel queries.
> That is, I saw no improvements with parallel queries. It was still
> faster than the non-parallel query with aggregate pushdown, but the
> parallel query did not improve.

> An example of timing looks like this:
> 
>   non-parallel (pushdown = off): 918 ms
>   non-parallel (pushdown = on):  561 ms
>   parallel (pushdown = off):     313 ms
>   parallel (pushdown = on):      313 ms
> 
> The non-parallel query gets faster because after enabling push-down the
> plan changes (and we get partial aggregate below the join). With
> parallel query that does not happen, the plans stay the same.
> 
> I'm not claiming this is a bug - we end up picking the fastest execution
> plan (i.e. we don't make a mistake by e.g. switching to the non-parallel
> one with pushdown).
> 
> My understanding is that the aggregate pushdown can't (currently?) be
> used with queries that use partial aggregate because of parallelism.
> That is we can either end up with a plan like this:
> 
>   Finalize Aggregate
>     -> Join
>       -> Partial Aggregate
>       -> ...
> 
> or a parallel plan like this:
> 
>   Finalize Aggregate
>     -> Gather
>         -> Partial Aggregate
>             -> Join
>                 -> ...
>                 -> ...
> 
> but we currently don't support a combination of both
> 
>   Finalize Aggregate
>     -> Gather
>         -> Join
>             -> Partial Aggregate
>             -> ...
> 
> I wonder if that's actually even possible and what would it take to make
> it work?

I had a prototype of the feature that does affect parallel queries (as well as
partitioned tables and postgres_fdw), but didn't include these parts in the
recent patch versions. The point is that the API to store RelOptInfos can
still change and in such a case I'd have to rebase too much code.

v15 is attached. Actually there are no significant changes relative to v14,
but v14 does not apply to the current master:

error: patch failed: src/test/regress/serial_schedule:140
error: src/test/regress/serial_schedule: patch does not apply

This is interesting because no problem is currently reported at
http://commitfest.cputube.org/

[1] https://www.postgresql.org/message-id/9726.1542577439@sss.pgh.pa.us

-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com


Attachment

Re: WIP: Aggregation push-down

From
legrand legrand
Date:
Hello,

Thank you for this great feature !
I hope this will be reviewed/validated soon ;o)

Just a comment:
set enable_agg_pushdown to true;
isn't displayed in EXPLAIN (SETTINGS) syntax.


The following modification seems to fix that: 

src/backend/utils/misc/guc.c
 
         {"enable_agg_pushdown", PGC_USERSET, QUERY_TUNING_METHOD,
            gettext_noop("Enables aggregation push-down."),
            NULL,
            GUC_EXPLAIN   <<<<--- line Added --->>>>
        },
        &enable_agg_pushdown,
        false,
        NULL, NULL, NULL
    },


then
postgres=# set enable_agg_pushdown = true;
SET
postgres=# explain (settings) select 1;
                QUERY PLAN
------------------------------------------
 Result  (cost=0.00..0.01 rows=1 width=4)
 Settings: enable_agg_pushdown = 'on'
(2 rows)


Regards
PAscal



--
Sent from: https://www.postgresql-archive.org/PostgreSQL-hackers-f1928748.html



Re: WIP: Aggregation push-down

From
Antonin Houska
Date:
legrand legrand <legrand_legrand@hotmail.com> wrote:

> set enable_agg_pushdown to true;
> isn't displayed in EXPLAIN (SETTINGS) syntax.
> 
> 
> The following modification seems to fix that: 
> 
> src/backend/utils/misc/guc.c
>  
>          {"enable_agg_pushdown", PGC_USERSET, QUERY_TUNING_METHOD,
>             gettext_noop("Enables aggregation push-down."),
>             NULL,
>             GUC_EXPLAIN   <<<<--- line Added --->>>>
>         },
>         &enable_agg_pushdown,
>         false,
>         NULL, NULL, NULL
>     },

Thanks. I'll include this change in the next version.

-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com



Re: [HACKERS] WIP: Aggregation push-down

From
Richard Guo
Date:
Hi,

I've been looking at the 'make_join_rel()' part of the patch, and I'm
aware that, if we are joining A and B, a 'grouped join rel (AB)' would
be created besides the 'plain join rel (AB)', and paths are added by 1)
applying partial aggregate to the paths of the 'plain join rel (AB)', or
2) joining grouped A to plain B or joining plain A to grouped B.

This is a smart idea. One issue I can see is that some logic would have
to be repeated several times. For example, the validity check for the
same proposed join would be performed at most three times by
join_is_legal().

I'm thinking of another idea that, instead of using a separate
RelOptInfo for the grouped rel, we add in RelOptInfo a
'grouped_pathlist' for the grouped paths, just like how we implement
parallel query via adding 'partial_pathlist'.

For base rel, if we decide it can produce grouped paths, we create the
grouped paths by applying partial aggregation to the paths in 'pathlist'
and add them into 'grouped_pathlist'.

For join rel (AB), we can create the grouped paths for it by:
1) joining a grouped path from 'A->grouped_pathlist' to a plain path
from 'B->pathlist', or vice versa;
2) applying partial aggregation to the paths in '(AB)->pathlist'.

This is basically the same idea, just moves the grouped paths from the
grouped rel to a 'grouped_pathlist'. With it we should not need to make
any code changes to 'make_join_rel()'. Most code changes would happen in
'add_paths_to_joinrel()'.

Will this idea work? Is it better or worse?

Thanks
Richard

Re: [HACKERS] WIP: Aggregation push-down

From
Antonin Houska
Date:
Richard Guo <guofenglinux@gmail.com> wrote:

> Hi,
>
> I've been looking at the 'make_join_rel()' part of the patch, and I'm
> aware that, if we are joining A and B, a 'grouped join rel (AB)' would
> be created besides the 'plain join rel (AB)', and paths are added by 1)
> applying partial aggregate to the paths of the 'plain join rel (AB)', or
> 2) joining grouped A to plain B or joining plain A to grouped B.
>
> This is a smart idea. One issue I can see is that some logic would have
> to be repeated several times. For example, the validity check for the
> same proposed join would be performed at most three times by
> join_is_legal().
>
> I'm thinking of another idea that, instead of using a separate
> RelOptInfo for the grouped rel, we add in RelOptInfo a
> 'grouped_pathlist' for the grouped paths, just like how we implement
> parallel query via adding 'partial_pathlist'.
>
> For base rel, if we decide it can produce grouped paths, we create the
> grouped paths by applying partial aggregation to the paths in 'pathlist'
> and add them into 'grouped_pathlist'.
>
> For join rel (AB), we can create the grouped paths for it by:
> 1) joining a grouped path from 'A->grouped_pathlist' to a plain path
> from 'B->pathlist', or vice versa;
> 2) applying partial aggregation to the paths in '(AB)->pathlist'.
>
> This is basically the same idea, just moves the grouped paths from the
> grouped rel to a 'grouped_pathlist'. With it we should not need to make
> any code changes to 'make_join_rel()'. Most code changes would happen in
> 'add_paths_to_joinrel()'.
>
> Will this idea work? Is it better or worse?

I tried such approach in an earlier version of the patch [1], and I think the
reason also was to avoid repeated calls of functions like
join_is_legal(). However there were objections against such approach,
e.g. [2], and I admit that it was more invasive than what the current patch
version does.

Perhaps we can cache the result of join_is_legal() that we get for the plain
relation, and use it for the group relation. I'll consider that. Thanks.

[1] https://www.postgresql.org/message-id/18007.1513957437%40localhost
[2]
https://www.postgresql.org/message-id/CA%2BTgmob8og%2B9HzMg1vM%2B3LwDm2f_bHUi9%2Bg1bqLDTjqpw5s%2BnQ%40mail.gmail.com

--
Antonin Houska
Web: https://www.cybertec-postgresql.com



Re: WIP: Aggregation push-down

From
legrand legrand
Date:
Antonin Houska-2 wrote
> Alvaro Herrera <

> alvherre@

> > wrote:
> 
>> This stuff seems very useful.  How come it sits unreviewed for so long?
> 
> I think the review is hard for people who are not interested in the
> planner
> very much. And as for further development, there are a few design
> decisions
> that can hardly be resolved without Tom Lane's comments. Right now I
> recall
> two problems: 1) is the way I currently store RelOptInfo for the grouped
> relations correct?, 2) how should we handle types for which logical
> equality
> does not imply physical (byte-wise) equality?
> 
> Fortunately it seems now that I'm not the only one who cares about 2), so
> this
> problem might be resolved soon:
> 
> https://www.postgresql.org/message-id/CAH2-Wzn3Ee49Gmxb7V1VJ3-AC8fWn-Fr8pfWQebHe8rYRxt5OQ%40mail.gmail.com
> 
> But 1) still remains.
> 
> -- 
> Antonin Houska
> Web: https://www.cybertec-postgresql.com

Hello
would "pgsql: Add equalimage B-Tree support functions." 
https://www.postgresql.org/message-id/E1j72NY-0002gi-2B@gemulon.postgresql.org

help for 2) ?

Regards
PAscal




--
Sent from: https://www.postgresql-archive.org/PostgreSQL-hackers-f1928748.html



Re: WIP: Aggregation push-down

From
Antonin Houska
Date:
legrand legrand <legrand_legrand@hotmail.com> wrote:

> Antonin Houska-2 wrote

> > Right now I recall two problems: 1) is the way I currently store
> > RelOptInfo for the grouped relations correct?, 2) how should we handle
> > types for which logical equality does not imply physical (byte-wise)
> > equality?
> >
> > Fortunately it seems now that I'm not the only one who cares about 2), so
> > this
> > problem might be resolved soon:
> >
> > https://www.postgresql.org/message-id/CAH2-Wzn3Ee49Gmxb7V1VJ3-AC8fWn-Fr8pfWQebHe8rYRxt5OQ%40mail.gmail.com
> >
> > But 1) still remains.
> >
>
> Hello
> would "pgsql: Add equalimage B-Tree support functions."
> https://www.postgresql.org/message-id/E1j72NY-0002gi-2B@gemulon.postgresql.org

Yes, it seems so. I'll adapt the patch soon, hopefully next week. Thanks for
reminder.

--
Antonin Houska
Web: https://www.cybertec-postgresql.com



Re: WIP: Aggregation push-down

From
Andy Fan
Date:


On Thu, Feb 27, 2020 at 4:50 PM Antonin Houska <ah@cybertec.at> wrote:
legrand legrand <legrand_legrand@hotmail.com> wrote:

> Antonin Houska-2 wrote

> > Right now I recall two problems: 1) is the way I currently store
> > RelOptInfo for the grouped relations correct?, 2) how should we handle
> > types for which logical equality does not imply physical (byte-wise)
> > equality?
> >
> > Fortunately it seems now that I'm not the only one who cares about 2), so
> > this
> > problem might be resolved soon:
> >
> > https://www.postgresql.org/message-id/CAH2-Wzn3Ee49Gmxb7V1VJ3-AC8fWn-Fr8pfWQebHe8rYRxt5OQ%40mail.gmail.com
> >
> > But 1) still remains.
> >
>
> Hello
> would "pgsql: Add equalimage B-Tree support functions."
> https://www.postgresql.org/message-id/E1j72NY-0002gi-2B@gemulon.postgresql.org

Yes, it seems so. I'll adapt the patch soon, hopefully next week. Thanks for
reminder.

--
Antonin Houska
Web: https://www.cybertec-postgresql.com


Hi Antonin:

The more tests on your patch, the more powerful I feel it is! At the same time, 
I think the most difficult part to understand your design is you can accept
any number of generic join clauses,  so I guess more explanation on this part
may be helpful. 

At the code level, I did some slight changes on init_grouping_targets which may
make the code easier to read.  You are free to to use/not use it.

Hope your patch get more attention soon!

Best Regards
Andy Fan


Attachment

Re: WIP: Aggregation push-down

From
Andy Fan
Date:

> 1) v14-0001-Introduce-RelInfoList-structure.patch
> -------------------------------------------------
>
> - I'm not entirely sure why we need this change. We had the list+hash
> before, so I assume we do this because we need the output functions?

I believe that this is what Tom proposed in [1], see "Maybe an appropriate
preliminary patch is to refactor the support code ..." near the end of that
message. The point is that now we need two lists: one for "plain" relations
and one for grouped ones.

 
I guess what Toms suggested[1] is to store the the grouped ones into 
root->upper_rels rather than a separated member, see fetch_upper_rel
or UpperRelationKind.  If we do need the list+hash method for long list lookup, 
we can merge it into upper_rels.  If we want this benefits at other place rather 
than root->upper_rels, we can store a generic type in RelInfoList, looks currently
it is bounded to RelAggInfo besides RelOptInfo.   But overall, personally I think we can 
ignore such optimization at the first stage to save the attention of the core reviewers
since they are too precious :)    Just FYI


Best Regards
Andy Fan

Re: WIP: Aggregation push-down

From
Antonin Houska
Date:
Andy Fan <zhihui.fan1213@gmail.com> wrote:

> The more tests on your patch, the more powerful I feel it is!

Thanks for the appreciation. Given the poor progress it's increasingly hard
for me to find motivation to work on it. I'll try to be more professional :-)

> At the same time, I think the most difficult part to understand your design
> is you can accept any number of generic join clauses, so I guess more
> explanation on this part may be helpful.

ok, I'll consider adding some comments, although the concept is mentioned in
optimizer/README

+Furthermore, extra grouping columns can be added to the partial Agg node if a
+join clause above that node references a column which is not in the query
+GROUP BY clause and which could not be derived using equivalence class.
+
...

> At the code level, I did some slight changes on init_grouping_targets which may
> make the code easier to read.  You are free to to use/not use it.

I'm going to accept your change of create_rel_agg_info(), but I hesitate about
the changes to init_grouping_targets().

First, is it worth to spend CPU cycles on construction of an extra list
grouping_columns? Is there a corner case in which we cannot simply pass
grouping_columns=target->exprs to check_functional_grouping()?

Second, it's obvious that you prefer the style

    foreach ()
    {
        if ()
           ...
        else if ()
           ...
        else
           ...
    }

over this

    foreach ()
    {
        if ()
        {
           ...
           continue;
        }

        if ()
        {
           ...
           continue;
        }

        ...
    }

I often prefer the latter and I see that the existing planner code uses this
style quite often too. I think the reason is that it allows for more complex
tests, while the "else-if style" requires all tests to take place inside the
"if ()" expression. However, if several (not necessarily tightly related)
tests become "compressed" this way, it's less obvious how where to put
comments.  Indeed it seems that some comments got lost due to your changes.

--
Antonin Houska
Web: https://www.cybertec-postgresql.com



Re: WIP: Aggregation push-down

From
Antonin Houska
Date:
Andy Fan <zhihui.fan1213@gmail.com> wrote:

>  > 1) v14-0001-Introduce-RelInfoList-structure.patch
>  > -------------------------------------------------
>  >
>  > - I'm not entirely sure why we need this change. We had the list+hash
>  > before, so I assume we do this because we need the output functions?
>
>  I believe that this is what Tom proposed in [1], see "Maybe an appropriate
>  preliminary patch is to refactor the support code ..." near the end of that
>  message. The point is that now we need two lists: one for "plain" relations
>  and one for grouped ones.
>
>
> I guess what Toms suggested[1] is to store the the grouped ones into
> root->upper_rels rather than a separated member, see fetch_upper_rel
> or UpperRelationKind.  If we do need the list+hash method for long list lookup,
> we can merge it into upper_rels.  If we want this benefits at other place rather
> than root->upper_rels, we can store a generic type in RelInfoList, looks currently
> it is bounded to RelAggInfo besides RelOptInfo.   But overall, personally I think we can
> ignore such optimization at the first stage to save the attention of the core reviewers
> since they are too precious :)    Just FYI
>
> [1] https://www.postgresql.org/message-id/9726.1542577439@sss.pgh.pa.us

Hm, you seem to be right, not sure why I missed the point. I thought that the
reason Tom doesn't like storing the grouped relations in simple_rel_array is
that we only need the grouped base relations inside query_planner(), but
simple_rel_array is used higher in the stack. So I introduced a new field and
used it only in query_planner() and subroutines.

Yes, it's better to use root->upper_rels than to introduce the new field. I'll
adjust the patch. Thanks.

--
Antonin Houska
Web: https://www.cybertec-postgresql.com



Re: WIP: Aggregation push-down

From
Andy Fan
Date:


On Fri, Apr 24, 2020 at 8:10 PM Antonin Houska <ah@cybertec.at> wrote:
Andy Fan <zhihui.fan1213@gmail.com> wrote:

> The more tests on your patch, the more powerful I feel it is!

Thanks for the appreciation. Given the poor progress it's increasingly hard
for me to find motivation to work on it. I'll try to be more professional :-)


Let's not give up:)  I see your patch can push down the aggregation in 3 cases
at least:

1).  The group by clause exists in the join eq clause.
2).  The group by clause doesn't exist in join eq clause, but we have 
pk on on side of join eq clause. 
3).  The group by clause doesn't exist in join eq clause, and we don't have 
the pk as well. 

Tom well explained the correctness of the first 2 cases [1] and probably the case
3) is correct as well, but it is a bit of hard to understand.  If this is a case for others
as well, that probably make the review harder. 

So my little suggestion is can we split the patch into some smaller commit
to handle each case?  like: commit 1 & 2 handles case 1 & 2,commit 3 handles
join relation as well.   commit 4 handles the case 3.   Commit 5 can avoid the 
two-phase aggregation for case 2.  Commit 6 introduces the aggmultifn.  and
commit 7 to handle some outer join case Ashutosh raised at [2].  However not
all the cases need to be handled at one time. 

Actually when I read the code, I want such separation by my own to verify my
understanding is correct,  but I don't have such time at least now. It maybe much
easier for you since you are the author.  I will be pretty pleasure to help in any case
after I close my current working item (Hopefully in 2 months). 

> At the code level, I did some slight changes on init_grouping_targets which may
> make the code easier to read.  You are free to to use/not use it.

I'm going to accept your change of create_rel_agg_info(), but I hesitate about
the changes to init_grouping_targets().

First, is it worth to spend CPU cycles on construction of an extra list
grouping_columns? Is there a corner case in which we cannot simply pass
grouping_columns=target->exprs to check_functional_grouping()?

Second, it's obvious that you prefer the style

    foreach ()
    {
        if ()
           ...
        else if ()
           ...
        else
           ...
    }

over this

    foreach ()
    {
        if ()
        {
           ...
           continue;
        }

        if ()
        {
           ...
           continue;
        }

        ...
    }

I often prefer the latter and I see that the existing planner code uses this
style quite often too. I think the reason is that it allows for more complex
tests, while the "else-if style" requires all tests to take place inside the
"if ()" expression. However, if several (not necessarily tightly related)
tests become "compressed" this way, it's less obvious how where to put
comments.  Indeed it seems that some comments got lost due to your changes.

Your explanation looks reasonable,  thanks for that.  the changes also used
some builtin function to avoid the one more level for-loop. like tlist_member.  

As for the high level design, based on my current knowledge,  probably no need
to change.  


 

Re: WIP: Aggregation push-down

From
Antonin Houska
Date:
Andy Fan <zhihui.fan1213@gmail.com> wrote:

> On Fri, Apr 24, 2020 at 8:10 PM Antonin Houska <ah@cybertec.at> wrote:
>
> Andy Fan <zhihui.fan1213@gmail.com> wrote:
>
> Let's not give up:)  I see your patch can push down the aggregation in 3 cases
> at least:
>
> 1).  The group by clause exists in the join eq clause.
> 2).  The group by clause doesn't exist in join eq clause, but we have
> pk on on side of join eq clause.
> 3).  The group by clause doesn't exist in join eq clause, and we don't have
> the pk as well.
>
> Tom well explained the correctness of the first 2 cases [1]

I admit that I don't quite understand his statement

    We can't simply aggregate during the scan of "a", because some values of
    "x" might not appear at all in the input of the naively-computed aggregate
    (if their rows have no join partner in "b"), or might appear multiple
    times (if their rows have multiple join partners).

Consider the following tables

    =======         ======
       A              B
    =======         ======
     x | y            z
    -------         ------
     3 | 4            1
     1 | 4            2
     4 | 0            3
     1 | 0            4


and this query

    SELECT sum(a.x), a.y FROM a, b WHERE a.y > b.z GROUP BY a.y;

Without aggregate push-down, the aggregation input is

     a.x | a.y | b.z
    ----------------
      3  |  4  |  1
      3  |  4  |  2
      3  |  4  |  3
      1  |  4  |  1
      1  |  4  |  2
      1  |  4  |  3


and the query result is

     sum(a.x) | a.y
    -----------------
         12   |  4


With the push-down, "A" aggregated looks like

     sum(a.x) | a.y
    -----------------
          4   |  4
          5   |  0

and the join with "B" produces this:

     sum(a.x) | a.y | b.z
    ---------------------
          4   |  4  | 1
          4   |  4  | 2
          4   |  4  | 3

After that, the final aggregation yields the same result that we got w/o the
aggregate push-down:

     sum(a.x) | a.y
    -----------------
         12   |  4

So if a row of "A" has no / multiple join partners in "B", the same is true
for the group produced by the pushed-down aggregation. Of course it's assumed
that the input rows belonging to particular group can be distributed among the
PartialAgg nodes in arbitrary way, for example:

    Agg
        <value 1>
        <value 1>
        <value 2>
        <value 2>

needs to be equivalent to

    FinalAgg
        PartialAgg
            <value 1>
            <value 2>
        PartialAgg
            <value 1>
            <value 2>

This should be o.k. because the partial aggregation performed currently by
parallel workers relies o the same "distributive property" of aggregation.

> So my little suggestion is can we split the patch into some smaller commit
> to handle each case?  like: commit 1 & 2 handles case 1 & 2,commit 3 handles
> join relation as well.   commit 4 handles the case 3.

To separate the patch this way effectively means to rework the feature almost
from scratch. I'm still not convinced that such a rework is necesary.

> Commit 5 can avoid the two-phase aggregation for case 2.

The two-phase aggregation can be avoided in more generic way as soon as the
"unique key" feature (I think you do participated in the related discussion)
is committed.

> Commit 6 introduces the aggmultifnf.

I used this function in the initial version of the patch [3], to join two
grouped relations. The current version does not try to join two grouped
relations and it's hard to imagine how to determine the "multiplication
factor". I suppose this is not really trivial. Since the current version of
the feature hasn't find it's way into PG core for several years, I'm not
willing to make it even trickier soon, if at all.

> and commit 7 to handle some outer join case Ashutosh raised at [2].

That was too high-level design, as he admits in [4].

> However not all the cases need to be handled at one time.

Sure.

>
>  > At the code level, I did some slight changes on init_grouping_targets which may
>  > make the code easier to read.  You are free to to use/not use it.
>
>  I'm going to accept your change of create_rel_agg_info(), but I hesitate about
>  the changes to init_grouping_targets().
>
>  First, is it worth to spend CPU cycles on construction of an extra list
>  grouping_columns? Is there a corner case in which we cannot simply pass
>  grouping_columns=target->exprs to check_functional_grouping()?
>
>  Second, it's obvious that you prefer the style
>
>      foreach ()
>      {
>          if ()
>             ...
>          else if ()
>             ...
>          else
>             ...
>      }
>
>  over this
>
>      foreach ()
>      {
>          if ()
>          {
>             ...
>             continue;
>          }
>
>          if ()
>          {
>             ...
>             continue;
>          }
>
>          ...
>      }
>
>  I often prefer the latter and I see that the existing planner code uses this
>  style quite often too. I think the reason is that it allows for more complex
>  tests, while the "else-if style" requires all tests to take place inside the
>  "if ()" expression. However, if several (not necessarily tightly related)
>  tests become "compressed" this way, it's less obvious how where to put
>  comments.  Indeed it seems that some comments got lost due to your changes.
>
> Your explanation looks reasonable,  thanks for that.  the changes also used
> some builtin function to avoid the one more level for-loop. like tlist_member.

Eventually I've reworked init_grouping_targets(), although not exactly as you
proposed. I hope it's easier to understand now.

> [1] https://www.postgresql.org/message-id/9726.1542577439%40sss.pgh.pa.us
> [2]
https://www.postgresql.org/message-id/flat/CAFjFpRdpeMTd8kYbM_x0769V-aEKst5Nkg3%2BcoG%3D8ki7s8Zqjw%40mail.gmail.com 

[3] https://www.postgresql.org/message-id/29111.1483984605@localhost
[4] https://www.postgresql.org/message-id/CAFjFpRfUKq3Asgtki1XctPkCN6YC4oA2vNWh66OgBo-H26ePWA%40mail.gmail.com

The next version is attached.

--
Antonin Houska
Web: https://www.cybertec-postgresql.com


Attachment

Re: WIP: Aggregation push-down

From
Daniel Gustafsson
Date:
> On 19 May 2020, at 10:17, Antonin Houska <ah@cybertec.at> wrote:

> The next version is attached.

This version now fails to apply to HEAD, with what looks like like a trivial
error in the expected test output.  Can you please submit a rebased version so
we can see it run in the patch tester CI?  I'm marking the entry as Waiting on
Author in the meantime.

cheers ./daniel


Re: WIP: Aggregation push-down

From
Laurenz Albe
Date:
On Thu, 2020-07-02 at 14:39 +0200, Daniel Gustafsson wrote:
> This version now fails to apply to HEAD, with what looks like like a trivial
> error in the expected test output.  Can you please submit a rebased version so
> we can see it run in the patch tester CI?  I'm marking the entry as Waiting on
> Author in the meantime.

I have rebased the patch against current HEAD, it passes "make installcheck".

Yours,
Laurenz Albe

Attachment

RE: WIP: Aggregation push-down

From
"Fujii.Yuki@df.MitsubishiElectric.co.jp"
Date:
Hi everyone.

I develop postgresql's extension such as fdw in my work. 
I'm interested in using postgresql for OLAP. 

I think that this patch is realy useful when using OLAP queries.
Furthermore, I think it would be more useful if this patch works on a foreign table.
Actually, I changed this patch a little and confirmed that my idea is true.
The followings are things I found and differences of between my prototype and this patch. 
  1. Things I found
   I execute a query which contain join of postgres_fdw's foreign table and a table and aggregation of the josin
result.
   In my setting, my prototype reduce this query's response by 93%.
  2. Differences between my prototype and this patch
   (1) Pushdown aggregation of foeign table if FDW pushdown partioal aggregation
   (2) postgres_fdw pushdowns some partial aggregations
I attached my prototype source code and content of my experiment.
I want to resume development of this patch if there is some possibility of accept of this patch's function.
. I took a contact to Mr.Houska on resuming development of this patch.
As a result, Mr.Houska advised for me that I ask in pgsql-hackers whether any reviewers / committers are 
interested to work on the patch.
Is anyone interested in my work?

Sincerely yours.
Yuuki Fujii

--
Yuuki Fujii
Information Technology R&D Center Mitsubishi Electric Corporation


Attachment