Thread: Re: [HACKERS] WIP: Aggregation push-down
Antonin Houska <ah@cybertec.at> wrote: > This is a new version of the patch I presented in [1]. Rebased. cat .git/refs/heads/master b9a3ef55b253d885081c2d0e9dc45802cab71c7b -- Antonin Houska Cybertec Schönig & Schönig GmbH Gröhrmühlgasse 26 A-2700 Wiener Neustadt Web: http://www.postgresql-support.de, http://www.cybertec.at -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
Antonin Houska <ah@cybertec.at> wrote: > Antonin Houska <ah@cybertec.at> wrote: > > > This is a new version of the patch I presented in [1]. > > Rebased. > > cat .git/refs/heads/master > b9a3ef55b253d885081c2d0e9dc45802cab71c7b This is another version of the patch. Besides other changes, it enables the aggregation push-down for postgres_fdw, although only for aggregates whose transient aggregation state is equal to the output type. For other aggregates (like avg()) the remote nodes will have to return the transient state value in an appropriate form (maybe bytea type), which does not depend on PG version. shard.tgz demonstrates the typical postgres_fdw use case. One query shows base scans of base relation's partitions being pushed to shard nodes, the other pushes down a join and performs aggregation of the join result on the remote node. Of course, the query can only references one particular partition, until the "partition-wise join" [1] patch gets committed and merged with this my patch. One thing I'm not sure about is whether the patch should remove GetForeignUpperPaths function from FdwRoutine, which it essentially makes obsolete. Or should it only be deprecated so far? I understand that deprecation is important for "ordinary SQL users", but FdwRoutine is an interface for extension developers, so the policy might be different. [1] https://commitfest.postgresql.org/14/994/ Any feedback is appreciated. -- Antonin Houska Cybertec Schönig & Schönig GmbH Gröhrmühlgasse 26 A-2700 Wiener Neustadt Web: http://www.postgresql-support.de, http://www.cybertec.at -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
Hi Antonin,
To understand the feature you have proposed, I have tried understanding
your patch. Here are few comments so far on it:
1.
+ if (aggref->aggvariadic ||
+ aggref->aggdirectargs || aggref->aggorder ||
+ aggref->aggdistinct || aggref->aggfilter)
I did not understand, why you are not pushing aggregate in above cases?
Can you please explain?
2. "make check" in contrib/postgres_fdw crashes.
SELECT COUNT(*) FROM ft1 t1;
! server closed the connection unexpectedly
! This probably means the server terminated abnormally
! before or while processing the request.
! connection to server was lost
From your given setup, if I wrote a query like:
EXPLAIN VERBOSE SELECT count(*) FROM orders_0;
it crashes.
Seems like missing few checks.
3. In case of pushing partial aggregation on the remote side, you use schema
named "partial", I did not get that change. If I have AVG() aggregate,
then I end up getting an error saying
"ERROR: schema "partial" does not exist".
4. I am not sure about the code changes related to pushing partial
aggregate on the remote side. Basically, you are pushing a whole aggregate
on the remote side and result of that is treated as partial on the
basis of aggtype = transtype. But I am not sure that is correct unless
I miss something here. The type may be same but the result may not.
5. There are lot many TODO comments in the patch-set, are you working
on those?
Thanks
--
To understand the feature you have proposed, I have tried understanding
your patch. Here are few comments so far on it:
1.
+ if (aggref->aggvariadic ||
+ aggref->aggdirectargs || aggref->aggorder ||
+ aggref->aggdistinct || aggref->aggfilter)
I did not understand, why you are not pushing aggregate in above cases?
Can you please explain?
2. "make check" in contrib/postgres_fdw crashes.
SELECT COUNT(*) FROM ft1 t1;
! server closed the connection unexpectedly
! This probably means the server terminated abnormally
! before or while processing the request.
! connection to server was lost
From your given setup, if I wrote a query like:
EXPLAIN VERBOSE SELECT count(*) FROM orders_0;
it crashes.
Seems like missing few checks.
3. In case of pushing partial aggregation on the remote side, you use schema
named "partial", I did not get that change. If I have AVG() aggregate,
then I end up getting an error saying
"ERROR: schema "partial" does not exist".
4. I am not sure about the code changes related to pushing partial
aggregate on the remote side. Basically, you are pushing a whole aggregate
on the remote side and result of that is treated as partial on the
basis of aggtype = transtype. But I am not sure that is correct unless
I miss something here. The type may be same but the result may not.
5. There are lot many TODO comments in the patch-set, are you working
on those?
Thanks
On Thu, Aug 17, 2017 at 8:52 PM, Antonin Houska <ah@cybertec.at> wrote:
Antonin Houska <ah@cybertec.at> wrote:
> Antonin Houska <ah@cybertec.at> wrote:
>
> > This is a new version of the patch I presented in [1].
>
> Rebased.
>
> cat .git/refs/heads/master
> b9a3ef55b253d885081c2d0e9dc45802cab71c7b
This is another version of the patch.
Besides other changes, it enables the aggregation push-down for postgres_fdw,
although only for aggregates whose transient aggregation state is equal to the
output type. For other aggregates (like avg()) the remote nodes will have to
return the transient state value in an appropriate form (maybe bytea type),
which does not depend on PG version.
shard.tgz demonstrates the typical postgres_fdw use case. One query shows base
scans of base relation's partitions being pushed to shard nodes, the other
pushes down a join and performs aggregation of the join result on the remote
node. Of course, the query can only references one particular partition, until
the "partition-wise join" [1] patch gets committed and merged with this my
patch.
One thing I'm not sure about is whether the patch should remove
GetForeignUpperPaths function from FdwRoutine, which it essentially makes
obsolete. Or should it only be deprecated so far? I understand that
deprecation is important for "ordinary SQL users", but FdwRoutine is an
interface for extension developers, so the policy might be different.
[1] https://commitfest.postgresql.org/14/994/
Any feedback is appreciated.
--
Antonin Houska
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de, http://www.cybertec.at
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
--
Jeevan Chalke
Principal Software Engineer, Product Development
EnterpriseDB Corporation
The Enterprise PostgreSQL Company
Principal Software Engineer, Product Development
EnterpriseDB Corporation
The Enterprise PostgreSQL Company
On Thu, Aug 17, 2017 at 8:52 PM, Antonin Houska <ah@cybertec.at> wrote: > Antonin Houska <ah@cybertec.at> wrote: > >> Antonin Houska <ah@cybertec.at> wrote: >> >> > This is a new version of the patch I presented in [1]. >> >> Rebased. >> >> cat .git/refs/heads/master >> b9a3ef55b253d885081c2d0e9dc45802cab71c7b > > This is another version of the patch. > > Besides other changes, it enables the aggregation push-down for postgres_fdw, > although only for aggregates whose transient aggregation state is equal to the > output type. For other aggregates (like avg()) the remote nodes will have to > return the transient state value in an appropriate form (maybe bytea type), > which does not depend on PG version. Having transient aggregation state type same as output type doesn't mean that we can feed the output of remote aggregation as is to the finalization stage locally or finalization is not needed at all. I am not sure why is that test being used to decide whether or not to push an aggregate (I assume they are partial aggregates) to the foreign server. postgres_fdw doesn't have a feature to fetch partial aggregate results from the foreign server. Till we do that, I don't think this patch can push any partial aggregates down to the foreign server. > > shard.tgz demonstrates the typical postgres_fdw use case. One query shows base > scans of base relation's partitions being pushed to shard nodes, the other > pushes down a join and performs aggregation of the join result on the remote > node. Of course, the query can only references one particular partition, until > the "partition-wise join" [1] patch gets committed and merged with this my > patch. Right. Until then joins will not have children. > > One thing I'm not sure about is whether the patch should remove > GetForeignUpperPaths function from FdwRoutine, which it essentially makes > obsolete. Or should it only be deprecated so far? I understand that > deprecation is important for "ordinary SQL users", but FdwRoutine is an > interface for extension developers, so the policy might be different. I doubt if that's correct. We can do that only when we know that all kinds aggregates/grouping can be pushed down the join tree, which looks impossible to me. Apart from that that FdwRoutine handles all kinds of upper relations like windowing, distinct, ordered which are not pushed down by this set of patches. Here's review of the patchset The patches compile cleanly when applied together. Testcase "limit" fails in make check. This is a pretty large patchset and the patches are divided by stages in the planner rather than functionality. IOW, no single patch provides a full functionality. Reviewing and probably committing such patches is not easy and may take quite long. Instead, if the patches are divided by functionality, i.e. each patch provides some functionality completely, it will be easy to review and commit such patches with each commit giving something useful. I had suggested breaking patches on that line at [1]. The patches also refactor existing code. It's better to move such refactoring in a patch/es by itself. The patches need more comments, explaining various decisions e.g. if (joinrel) { /*Create GatherPaths for any useful partial paths for rel */ - generate_gather_paths(root, joinrel); + generate_gather_paths(root, joinrel, false); /* Find and save the cheapest paths for this joinrel */ set_cheapest(joinrel); For base relations, the patch calls generate_gather_paths() with grouped as true and false, but for join relations it doesn't do that. There is no comment explaining why. OR in the following code + /* + * Do partial aggregation at base relation level if the relation is + * eligible for it. + */ + if (rel->gpi != NULL) + create_grouped_path(root, rel, path, false, true, AGG_HASHED); Why not to consider AGG_SORTED paths? The patch maintains an extra rel target, two pathlists, partial pathlist and non-partial pathlist for grouping in RelOptInfo. These two extra pathlists require "grouped" argument to be passed to a number of functions. Instead probably it makes sense to create a RelOptInfo for aggregation/grouping and set pathlist and partial_pathlist in that RelOptInfo. That will also make a place for keeping grouped estimates. For placeholders we have two function add_placeholders_to_base_rels() and add_placeholders_to_joinrel(). We have add_grouping_info_to_base_rels(), but do not have corresponding function for joinrels. How do we push aggregates involving two or more relations to the corresponding joinrels? Some minor assorted comments Avoid useless hunk. @@ -279,8 +301,8 @@ add_paths_to_joinrel(PlannerInfo *root, * joins, because there may be no other alternative. */ if (enable_hashjoin || jointype == JOIN_FULL) - hash_inner_and_outer(root, joinrel, outerrel, innerrel, - jointype, &extra); + hash_inner_and_outer(root, joinrel, outerrel, innerrel, jointype, + &extra); In the last "regression" patch, there are some code changes (mostly because of pg_indent run). May be you want to include those in appropriate code patches. Some quick observations using two tables t1(a int, b int) and t2(a int, b int), populated as insert into t1 select i, i % 5 from generate_series(1, 100) i where i % 2 = 0; insert into t2 select i, i % 5 from generate_series(1, 100) i where i % 3 = 0; 1. The patch pushes aggregates down join in the following query explain verbose select avg(t2.a) from t1 inner join t2 on t1.b = t2.b group by t2.b; but does not push the aggregates down join in the following query explain verbose select avg(t2.a) from t1 left join t2 on t1.b = t2.b group by t2.b; In fact, it doesn't use the optimization for any OUTER join. I think the reason for this behaviour is that the patch uses equivalence classes to distribute the aggregates and grouping to base relations and OUTER equi-joins do not form equivalence classes. But I think it should be possible to push the aggregates down the OUTER join by adding one row for NULL values if the grouping is pushed to the inner side. I don't see much change for outer side. This also means that we have to choose some means other than equivalence class for propagating the aggregates. 2. Following query throws error with these patches, but not without the patches. explain verbose select sum(t1.a + t2.a) from t1, t2, t2 t3 where t1.a = t2.a group by t2.a, t1.a; ERROR: ORDER/GROUP BY expression not found in targetlist [1] https://www.postgresql.org/message-id/CAFjFpRejPP4K%3Dg%2B0aaq_US0YrMaZzyM%2BNUCi%3DJgwaxhMUj2Zcg%40mail.gmail.com -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company
On Thu, Aug 17, 2017 at 10:22 AM, Antonin Houska <ah@cybertec.at> wrote: > Antonin Houska <ah@cybertec.at> wrote: > output type. For other aggregates (like avg()) the remote nodes will have to > return the transient state value in an appropriate form (maybe bytea type), > which does not depend on PG version. Hm, that seems like an awful lot of work (new version agnostic serialization format) for very little benefit (version independent type serialization for remote aggregate pushdown). How about forcing the version to be the same for the feature to be used? merlin
Jeevan Chalke <jeevan.chalke@enterprisedb.com> wrote: > 1. > + if (aggref->aggvariadic || > + aggref->aggdirectargs || aggref->aggorder || > + aggref->aggdistinct || aggref->aggfilter) > > I did not understand, why you are not pushing aggregate in above cases? > Can you please explain? Currently I'm trying to implement infrastructure to propagate result of partial aggregation from base relation or join to the root of the join tree and to use these as input for the final aggregation. Some special cases are disabled because I haven't thought about them enough so far. Some of these restrictions may be easy to relax, others may be hard. I'll get back to them at some point. > 2. "make check" in contrib/postgres_fdw crashes. > > SELECT COUNT(*) FROM ft1 t1; > ! server closed the connection unexpectedly > ! This probably means the server terminated abnormally > ! before or while processing the request. > ! connection to server was lost > > From your given setup, if I wrote a query like: > EXPLAIN VERBOSE SELECT count(*) FROM orders_0; > it crashes. > > Seems like missing few checks. Please consider this a temporary limitation. > 3. In case of pushing partial aggregation on the remote side, you use schema > named "partial", I did not get that change. If I have AVG() aggregate, > then I end up getting an error saying > "ERROR: schema "partial" does not exist". Attached to his email is an extension that I hacked quickly at some point, for experimental purposes only. It's pretty raw but may be useful if you just want to play with it. > 4. I am not sure about the code changes related to pushing partial > aggregate on the remote side. Basically, you are pushing a whole aggregate > on the remote side and result of that is treated as partial on the > basis of aggtype = transtype. But I am not sure that is correct unless > I miss something here. The type may be same but the result may not. You're right. I mostly used sum() and count() in my examples but these are special cases. It should also be checked whether aggfinalfn exists or not. I'll fix it, thanks. > 5. There are lot many TODO comments in the patch-set, are you working > on those? Yes. I wasn't too eager to complete all the TODOs soon because reviews of the partch may result in a major rework. And if large parts of the code will have to be wasted, some / many TODOs will be gone as well. > Thanks > > On Thu, Aug 17, 2017 at 8:52 PM, Antonin Houska <ah@cybertec.at> wrote: > > Antonin Houska <ah@cybertec.at> wrote: > > > Antonin Houska <ah@cybertec.at> wrote: > > > > > This is a new version of the patch I presented in [1]. > > > > Rebased. > > > > cat .git/refs/heads/master > > b9a3ef55b253d885081c2d0e9dc45802cab71c7b > > This is another version of the patch. > > Besides other changes, it enables the aggregation push-down for postgres_fdw, > although only for aggregates whose transient aggregation state is equal to the > output type. For other aggregates (like avg()) the remote nodes will have to > return the transient state value in an appropriate form (maybe bytea type), > which does not depend on PG version. > > shard.tgz demonstrates the typical postgres_fdw use case. One query shows base > scans of base relation's partitions being pushed to shard nodes, the other > pushes down a join and performs aggregation of the join result on the remote > node. Of course, the query can only references one particular partition, until > the "partition-wise join" [1] patch gets committed and merged with this my > patch. > > One thing I'm not sure about is whether the patch should remove > GetForeignUpperPaths function from FdwRoutine, which it essentially makes > obsolete. Or should it only be deprecated so far? I understand that > deprecation is important for "ordinary SQL users", but FdwRoutine is an > interface for extension developers, so the policy might be different. > > [1] https://commitfest.postgresql.org/14/994/ > > Any feedback is appreciated. > > -- > Antonin Houska > Cybertec Schönig & Schönig GmbH > Gröhrmühlgasse 26 > A-2700 Wiener Neustadt > Web: http://www.postgresql-support.de, http://www.cybertec.at > > -- > Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-hackers > > -- > Jeevan Chalke > Principal Software Engineer, Product Development > EnterpriseDB Corporation > The Enterprise PostgreSQL Company > > -- Antonin Houska Cybertec Schönig & Schönig GmbH Gröhrmühlgasse 26 A-2700 Wiener Neustadt Web: http://www.postgresql-support.de, http://www.cybertec.at -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote: > On Thu, Aug 17, 2017 at 8:52 PM, Antonin Houska <ah@cybertec.at> wrote: > > Antonin Houska <ah@cybertec.at> wrote: > > > >> Antonin Houska <ah@cybertec.at> wrote: > >> > >> > This is a new version of the patch I presented in [1]. > >> > >> Rebased. > >> > >> cat .git/refs/heads/master > >> b9a3ef55b253d885081c2d0e9dc45802cab71c7b > > > > This is another version of the patch. > > > > Besides other changes, it enables the aggregation push-down for postgres_fdw, > > although only for aggregates whose transient aggregation state is equal to the > > output type. For other aggregates (like avg()) the remote nodes will have to > > return the transient state value in an appropriate form (maybe bytea type), > > which does not depend on PG version. > > Having transient aggregation state type same as output type doesn't > mean that we can feed the output of remote aggregation as is to the > finalization stage locally or finalization is not needed at all. I am > not sure why is that test being used to decide whether or not to push > an aggregate (I assume they are partial aggregates) to the foreign > server. I agree. This seems to be the same problem as reported in [2]. > postgres_fdw doesn't have a feature to fetch partial aggregate > results from the foreign server. Till we do that, I don't think this > patch can push any partial aggregates down to the foreign server. Even if the query contains only aggregates like sum(int4), i.e. aggfinalfn does not exist and the transient type can be transfered from the remote server in textual form? (Of course there's a question is if such behaviour is consistent from user's perspective.) > > > > One thing I'm not sure about is whether the patch should remove > > GetForeignUpperPaths function from FdwRoutine, which it essentially makes > > obsolete. Or should it only be deprecated so far? I understand that > > deprecation is important for "ordinary SQL users", but FdwRoutine is an > > interface for extension developers, so the policy might be different. > > I doubt if that's correct. We can do that only when we know that all > kinds aggregates/grouping can be pushed down the join tree, which > looks impossible to me. Apart from that that FdwRoutine handles all > kinds of upper relations like windowing, distinct, ordered which are > not pushed down by this set of patches. Good point. > Here's review of the patchset > The patches compile cleanly when applied together. > > Testcase "limit" fails in make check. If I remember correctly, this test generates different plan because the aggregate push-down gets involved. I tend to ignore this error until the patch has cost estimates refined. Then let's see if the test will pass or if the expected output should be adjusted. > This is a pretty large patchset and the patches are divided by stages in the > planner rather than functionality. IOW, no single patch provides a full > functionality. Reviewing and probably committing such patches is not easy and > may take quite long. Instead, if the patches are divided by functionality, i.e. > each patch provides some functionality completely, it will be easy to review > and commit such patches with each commit giving something useful. I had > suggested breaking patches on that line at [1]. The patches also refactor > existing code. It's better to move such refactoring in a patch/es by itself. I definitely saw commits whose purpose is preparation for something else. But you're right that some splitting of the actual funcionality wouldn't harm. I'll at least separate partial aggregation of base relations and that of joins. And maybe some more preparation steps. > The patches need more comments, explaining various decisions o.k. > The patch maintains an extra rel target, two pathlists, partial pathlist and > non-partial pathlist for grouping in RelOptInfo. These two extra > pathlists require "grouped" argument to be passed to a number of functions. > Instead probably it makes sense to create a RelOptInfo for aggregation/grouping > and set pathlist and partial_pathlist in that RelOptInfo. That will also make a > place for keeping grouped estimates. If grouped paths require a separate RelOptInfo, why the existing partial paths do not? > For placeholders we have two function add_placeholders_to_base_rels() and > add_placeholders_to_joinrel(). We have add_grouping_info_to_base_rels(), but do > not have corresponding function for joinrels. How do we push aggregates > involving two or more relations to the corresponding joinrels? add_grouping_info_to_base_rels() calls prepare_rel_for_grouping() which actually adds the "grouping info". For join, prepare_rel_for_grouping() is called from build_join_rel(). While PlaceHolderVars need to be finalized before planner starts to create joins, the GroupedPathInfo has to fit particular pair of joined relations. Perhaps create_grouping_info_... would be better, to indicate that the thing we're adding does not exist yet. I'll think about it. > Some minor assorted comments ... o.k., will fix. > Some quick observations using two tables t1(a int, b int) and t2(a int, b int), > populated as > insert into t1 select i, i % 5 from generate_series(1, 100) i where i % 2 = 0; > insert into t2 select i, i % 5 from generate_series(1, 100) i where i % 3 = 0; > > 1. The patch pushes aggregates down join in the following query > explain verbose select avg(t2.a) from t1 inner join t2 on t1.b = t2.b group by > t2.b; > but does not push the aggregates down join in the following query > explain verbose select avg(t2.a) from t1 left join t2 on t1.b = t2.b group by > t2.b; > In fact, it doesn't use the optimization for any OUTER join. I think the reason > for this behaviour is that the patch uses equivalence classes to distribute the > aggregates and grouping to base relations and OUTER equi-joins do not form > equivalence classes. But I think it should be possible to push the aggregates > down the OUTER join by adding one row for NULL values if the grouping is pushed > to the inner side. I don't see much change for outer side. This also means that > we have to choose some means other than equivalence class for propagating the > aggregates. The problem is that aggregate pushed down to the nullable side of an outer join receives different input values than the original aggregate at the top level of the query. NULL values generated by the OJ make the difference. I have no idea how to handle this problem. If the aggregation is performed on the nullable side of the OJ, it can't predict which of the input values don't match any value of the other side. Suggestions are appreciated. > 2. Following query throws error with these patches, but not without the > patches. > explain verbose select sum(t1.a + t2.a) from t1, t2, t2 t3 where t1.a > = t2.a > group by t2.a, t1.a; > ERROR: ORDER/GROUP BY expression not found in targetlist I'll check this. Thanks. > [1] https://www.postgresql.org/message-id/CAFjFpRejPP4K%3Dg%2B0aaq_US0YrMaZzyM%2BNUCi%3DJgwaxhMUj2Zcg%40mail.gmail.com [2] https://www.postgresql.org/message-id/CAM2%2B6%3DW2J-iaQBgj-sdMERELQLUm5dvOQEWQ2ho%2BQ7KZgnonkQ%40mail.gmail.com -- Antonin Houska Cybertec Schönig & Schönig GmbH Gröhrmühlgasse 26 A-2700 Wiener Neustadt Web: http://www.postgresql-support.de, http://www.cybertec.at -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
On Fri, Sep 8, 2017 at 7:04 PM, Antonin Houska <ah@cybertec.at> wrote: > Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote: > >> On Thu, Aug 17, 2017 at 8:52 PM, Antonin Houska <ah@cybertec.at> wrote: >> > Antonin Houska <ah@cybertec.at> wrote: >> > >> >> Antonin Houska <ah@cybertec.at> wrote: >> >> >> >> > This is a new version of the patch I presented in [1]. >> >> >> >> Rebased. >> >> >> >> cat .git/refs/heads/master >> >> b9a3ef55b253d885081c2d0e9dc45802cab71c7b >> > >> > This is another version of the patch. >> > >> > Besides other changes, it enables the aggregation push-down for postgres_fdw, >> > although only for aggregates whose transient aggregation state is equal to the >> > output type. For other aggregates (like avg()) the remote nodes will have to >> > return the transient state value in an appropriate form (maybe bytea type), >> > which does not depend on PG version. >> >> Having transient aggregation state type same as output type doesn't >> mean that we can feed the output of remote aggregation as is to the >> finalization stage locally or finalization is not needed at all. I am >> not sure why is that test being used to decide whether or not to push >> an aggregate (I assume they are partial aggregates) to the foreign >> server. > > I agree. This seems to be the same problem as reported in [2]. > >> postgres_fdw doesn't have a feature to fetch partial aggregate >> results from the foreign server. Till we do that, I don't think this >> patch can push any partial aggregates down to the foreign server. > > Even if the query contains only aggregates like sum(int4), i.e. aggfinalfn > does not exist and the transient type can be transfered from the remote server > in textual form? (Of course there's a question is if such behaviour is > consistent from user's perspective.) Yes, those functions will work. > >> > >> > One thing I'm not sure about is whether the patch should remove >> > GetForeignUpperPaths function from FdwRoutine, which it essentially makes >> > obsolete. Or should it only be deprecated so far? I understand that >> > deprecation is important for "ordinary SQL users", but FdwRoutine is an >> > interface for extension developers, so the policy might be different. >> >> I doubt if that's correct. We can do that only when we know that all >> kinds aggregates/grouping can be pushed down the join tree, which >> looks impossible to me. Apart from that that FdwRoutine handles all >> kinds of upper relations like windowing, distinct, ordered which are >> not pushed down by this set of patches. > > Good point. > I think this is where Jeevan Chalke's partition-wise aggregation helps. It deals with the cases where your patch can not push the aggregates down to children of join. >> The patch maintains an extra rel target, two pathlists, partial pathlist and >> non-partial pathlist for grouping in RelOptInfo. These two extra >> pathlists require "grouped" argument to be passed to a number of functions. >> Instead probably it makes sense to create a RelOptInfo for aggregation/grouping >> and set pathlist and partial_pathlist in that RelOptInfo. That will also make a >> place for keeping grouped estimates. > > If grouped paths require a separate RelOptInfo, why the existing partial paths > do not? partial paths produce the same targetlist and the same relation that non-partial paths do. A RelOptInfo in planner represents a set of rows and all paths added to that RelOptInfo produce same whole result (parameterized paths produces the same result collectively). Grouping paths however are producing a different result and different targetlist, so may be it's better to have a separate RelOptInfo. > >> For placeholders we have two function add_placeholders_to_base_rels() and >> add_placeholders_to_joinrel(). We have add_grouping_info_to_base_rels(), but do >> not have corresponding function for joinrels. How do we push aggregates >> involving two or more relations to the corresponding joinrels? > > add_grouping_info_to_base_rels() calls prepare_rel_for_grouping() which > actually adds the "grouping info". For join, prepare_rel_for_grouping() is > called from build_join_rel(). Ok. > > While PlaceHolderVars need to be finalized before planner starts to create > joins, the GroupedPathInfo has to fit particular pair of joined relations. > > Perhaps create_grouping_info_... would be better, to indicate that the thing > we're adding does not exist yet. I'll think about it. > >> Some minor assorted comments ... > > o.k., will fix. > >> Some quick observations using two tables t1(a int, b int) and t2(a int, b int), >> populated as >> insert into t1 select i, i % 5 from generate_series(1, 100) i where i % 2 = 0; >> insert into t2 select i, i % 5 from generate_series(1, 100) i where i % 3 = 0; >> >> 1. The patch pushes aggregates down join in the following query >> explain verbose select avg(t2.a) from t1 inner join t2 on t1.b = t2.b group by >> t2.b; >> but does not push the aggregates down join in the following query >> explain verbose select avg(t2.a) from t1 left join t2 on t1.b = t2.b group by >> t2.b; > >> In fact, it doesn't use the optimization for any OUTER join. I think the reason >> for this behaviour is that the patch uses equivalence classes to distribute the >> aggregates and grouping to base relations and OUTER equi-joins do not form >> equivalence classes. But I think it should be possible to push the aggregates >> down the OUTER join by adding one row for NULL values if the grouping is pushed >> to the inner side. I don't see much change for outer side. This also means that >> we have to choose some means other than equivalence class for propagating the >> aggregates. > > The problem is that aggregate pushed down to the nullable side of an outer > join receives different input values than the original aggregate at the top > level of the query. NULL values generated by the OJ make the difference. I > have no idea how to handle this problem. If the aggregation is performed on > the nullable side of the OJ, it can't predict which of the input values don't > match any value of the other side. Suggestions are appreciated. I haven't thought through this fully as well, but I think this can be fixed by using some kind of all NULL row on the nullable side. If we are going to use equivalence classes, we won't be able to extend that mechanism to OUTER joins, so may be you want to rethink about using equivalence classes as a method of propagating grouping information. -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Antonin Houska <ah@cybertec.at> wrote: > This is another version of the patch. > shard.tgz demonstrates the typical postgres_fdw use case. One query shows base > scans of base relation's partitions being pushed to shard nodes, the other > pushes down a join and performs aggregation of the join result on the remote > node. Of course, the query can only references one particular partition, until > the "partition-wise join" [1] patch gets committed and merged with this my > patch. Since [1] is already there, the new version of shard.tgz shows what I consider the typical use case. (I'm aware of the postgres_fdw regression test failures, I'll try to fix them all in the next version.) Besides that: * A concept of "path unique keys" has been introduced. It helps to find out if the final relation appears to generate a distinct set of grouping keys. If that happens, the final aggregation is replaced by mere call of aggfinalfn() function on each transient state value. * FDW can sort rows by aggregate. * enable_agg_pushdown GUC was added. The default value is false. * I fixed errors reported during the previous CF. * Added a few more regression tests. I'm not about to add any other features now. Implementation of the missing parts (see the TODO comments in the code) is the next step. But what I'd appreciate most is a feedback on the design. Thanks. > [1] https://commitfest.postgresql.org/14/994/ -- Antonin Houska Cybertec Schönig & Schönig GmbH Gröhrmühlgasse 26 A-2700 Wiener Neustadt Web: http://www.postgresql-support.de, http://www.cybertec.at -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
On Sat, Nov 4, 2017 at 12:33 AM, Antonin Houska <ah@cybertec.at> wrote: > I'm not about to add any other features now. Implementation of the missing > parts (see the TODO comments in the code) is the next step. But what I'd > appreciate most is a feedback on the design. Thanks. I am getting a conflict after applying patch 5 but this did not get any reviews so moved to next CF with waiting on author as status. -- Michael
Michael Paquier <michael.paquier@gmail.com> wrote: > On Sat, Nov 4, 2017 at 12:33 AM, Antonin Houska <ah@cybertec.at> wrote: > > I'm not about to add any other features now. Implementation of the missing > > parts (see the TODO comments in the code) is the next step. But what I'd > > appreciate most is a feedback on the design. Thanks. > > I am getting a conflict after applying patch 5 but this did not get > any reviews so moved to next CF with waiting on author as status. Attached is the next version. -- Antonin Houska Cybertec Schönig & Schönig GmbH Gröhrmühlgasse 26 A-2700 Wiener Neustadt Web: http://www.postgresql-support.de, http://www.cybertec.at
Attachment
On Fri, Dec 22, 2017 at 10:43 AM, Antonin Houska <ah@cybertec.at> wrote: > Michael Paquier <michael.paquier@gmail.com> wrote: >> On Sat, Nov 4, 2017 at 12:33 AM, Antonin Houska <ah@cybertec.at> wrote: >> > I'm not about to add any other features now. Implementation of the missing >> > parts (see the TODO comments in the code) is the next step. But what I'd >> > appreciate most is a feedback on the design. Thanks. >> >> I am getting a conflict after applying patch 5 but this did not get >> any reviews so moved to next CF with waiting on author as status. > > Attached is the next version. I've been a bit confused for a while about what this patch is trying to do, so I spent some time today looking at it to try to figure it out. There's a lot I don't understand yet, but it seems like the general idea is to build, potentially for each relation in the join tree, not only the regular list of paths but also a list of "grouped" paths. If the pre-grouped path wins, then we can get a final path that looks like Finalize Aggregate -> Some Joins -> Partial Aggregate -> Maybe Some More Joins -> Base Table Scan. In some cases the patch seems to replace that uppermost Finalize Aggregate with a Result node. translate_expression_to_rels() looks unsafe. Equivalence members are known to be equal under the semantics of the relevant operator class, but that doesn't mean that one can be freely substituted for another. For example: rhaas=# create table one (a numeric); CREATE TABLE rhaas=# create table two (a numeric); CREATE TABLE rhaas=# insert into one values ('0'); INSERT 0 1 rhaas=# insert into two values ('0.00'); INSERT 0 1 rhaas=# select one.a, count(*) from one, two where one.a = two.a group by 1; a | count ---+------- 0 | 1 (1 row) rhaas=# select two.a, count(*) from one, two where one.a = two.a group by 1; a | count ------+------- 0.00 | 1 (1 row) There are, admittedly, a large number of data types for which such a substitution would work just fine. numeric will not, citext will not, but many others are fine. Regrettably, we have no framework in the system for identifying equality operators which actually test identity versus some looser notion of equality. Cross-type operators are a problem, too; if we have foo.x = bar.y in the query, one might be int4 and the other int8, for example, but they can still belong to the same equivalence class. Concretely, in your test query "SELECT p.i, avg(c1.v) FROM agg_pushdown_parent AS p JOIN agg_pushdown_child1 AS c1 ON c1.parent = p.i GROUP BY p.i" you assume that it's OK to do a Partial HashAggregate over c1.parent rather than p.i. This will be false if, say, c1.parent is of type citext and p.i is of type text; this will get grouped together that shouldn't. It will also be false if the grouping expression is something like GROUP BY length(p.i::text), because one value could be '0'::numeric and the other '0.00'::numeric. I can't think of a reason why it would be false if the grouping expressions are both simple Vars of the same underlying data type, but I'm a little nervous that I might be wrong even about that case. Maybe you've handled all of this somehow, but it's not obvious to me that it has been considered. Another thing I noticed is that the GroupedPathInfo looks a bit like a stripped-down RelOptInfo, in that for example it has both a pathlist and a partial_pathlist. I'm inclined to think that we should build new RelOptInfos instead. As you have it, there are an awful lot of things that seem to know about the grouped/ungrouped distinction, many of which are quite low-level functions like add_path(), add_paths_to_append_rel(), try_nestloop_path(), etc. I think that some of this would go away if you had separate RelOptInfos instead of GroupedPathInfo. Some compiler noise: allpaths.c:2794:11: error: variable 'partial_target' is used uninitialized whenever 'if' condition is false [-Werror,-Wsometimes-uninitialized] else if (rel->gpi != NULL && rel->gpi->target != NULL) ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ allpaths.c:2798:56: note: uninitialized use occurs here create_gather_path(root, rel, cheapest_partial_path, partial_target, ^~~~~~~~~~~~~~ allpaths.c:2794:7: note: remove the 'if' if its condition is always true else if (rel->gpi != NULL && rel->gpi->target != NULL) ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ allpaths.c:2794:11: error: variable 'partial_target' is used uninitialized whenever '&&' condition is false [-Werror,-Wsometimes-uninitialized] else if (rel->gpi != NULL && rel->gpi->target != NULL) ^~~~~~~~~~~~~~~~ allpaths.c:2798:56: note: uninitialized use occurs here create_gather_path(root, rel, cheapest_partial_path, partial_target, ^~~~~~~~~~~~~~ allpaths.c:2794:11: note: remove the '&&' if its condition is always true else if (rel->gpi != NULL && rel->gpi->target != NULL) ^~~~~~~~~~~~~~~~~~~ allpaths.c:2773:28: note: initialize the variable 'partial_target' to silence this warning PathTarget *partial_target; ^ = NULL Core dump running the regression tests: * frame #0: 0x00007fffe633ad42 libsystem_kernel.dylib`__pthread_kill + 10 frame #1: 0x00007fffe6428457 libsystem_pthread.dylib`pthread_kill + 90 frame #2: 0x00007fffe62a0420 libsystem_c.dylib`abort + 129 frame #3: 0x000000010e3ae90e postgres`ExceptionalCondition(conditionName=<unavailable>, errorType=<unavailable>, fileName=<unavailable>, lineNumber=<unavailable>) at assert.c:54 [opt] frame #4: 0x000000010e17c23b postgres`check_list_invariants(list=<unavailable>) at list.c:39 [opt] frame #5: 0x000000010e17cdff postgres`list_free [inlined] list_free_private(list=0x00007fe6a483be40, deep='\0') at list.c:1107 [opt] frame #6: 0x000000010e17cdfa postgres`list_free(list=0x00007fe6a483be40) at list.c:1135 [opt] frame #7: 0x000000010e1b365f postgres`generate_bitmap_or_paths(root=0x00007fe6a68131c8, rel=0x00007fe6a6815310, clauses=<unavailable>, other_clauses=<unavailable>, agg_info=0x0000000000000000) at indxpath.c:1580 [opt] frame #8: 0x000000010e1b305c postgres`create_index_paths(root=0x00007fe6a68131c8, rel=<unavailable>, agg_info=0x0000000000000000) at indxpath.c:334 [opt] frame #9: 0x000000010e1a7cfa postgres`set_rel_pathlist [inlined] set_plain_rel_pathlist at allpaths.c:745 [opt] frame #10: 0x000000010e1a7bc1 postgres`set_rel_pathlist(root=0x00007fe6a68131c8, rel=0x00007fe6a6815310, rti=1, rte=0x00007fe6a6803270) at allpaths.c:454 [opt] frame #11: 0x000000010e1a4910 postgres`make_one_rel at allpaths.c:312 [opt] frame #12: 0x000000010e1a48cc postgres`make_one_rel(root=0x00007fe6a68131c8, joinlist=0x00007fe6a6815810) at allpaths.c:182 [opt] frame #13: 0x000000010e1cb806 postgres`query_planner(root=0x00007fe6a68131c8, tlist=<unavailable>, qp_callback=<unavailable>, qp_extra=0x00007fff51ca24d8) at planmain.c:268 [opt] frame #14: 0x000000010e1ce7b4 postgres`grouping_planner(root=<unavailable>, inheritance_update='\0', tuple_fraction=<unavailable>) at planner.c:1789 [opt] frame #15: 0x000000010e1ccb12 postgres`subquery_planner(glob=<unavailable>, parse=0x00007fe6a6803158, parent_root=<unavailable>, hasRecursion=<unavailable>, tuple_fraction=0) at planner.c:901 [opt] frame #16: 0x000000010e1cba3e postgres`standard_planner(parse=0x00007fe6a6803158, cursorOptions=256, boundParams=0x0000000000000000) at planner.c:364 [opt] frame #17: 0x000000010e291b3b postgres`pg_plan_query(querytree=0x00007fe6a6803158, cursorOptions=256, boundParams=0x0000000000000000) at postgres.c:807 [opt] frame #18: 0x000000010e291c6e postgres`pg_plan_queries(querytrees=<unavailable>, cursorOptions=256, boundParams=0x0000000000000000) at postgres.c:873 [opt] frame #19: 0x000000010e296335 postgres`exec_simple_query(query_string="SELECT p1.oid, p1.typname\nFROM pg_type as p1\nWHERE p1.typnamespace = 0 OR\n (p1.typlen <= 0 AND p1.typlen != -1 AND p1.typlen != -2) OR\n (p1.typtype not in ('b', 'c', 'd', 'e', 'p', 'r')) OR\n NOT p1.typisdefined OR\n (p1.typalign not in ('c', 's', 'i', 'd')) OR\n (p1.typstorage not in ('p', 'x', 'e', 'm'));") at postgres.c:1048 [opt] frame #20: 0x000000010e295099 postgres`PostgresMain(argc=<unavailable>, argv=<unavailable>, dbname="regression", username=<unavailable>) at postgres.c:0 [opt] frame #21: 0x000000010e20a910 postgres`PostmasterMain [inlined] BackendRun at postmaster.c:4412 [opt] frame #22: 0x000000010e20a741 postgres`PostmasterMain [inlined] BackendStartup at postmaster.c:4084 [opt] frame #23: 0x000000010e20a3bd postgres`PostmasterMain at postmaster.c:1757 [opt] frame #24: 0x000000010e2098ff postgres`PostmasterMain(argc=1516050552, argv=<unavailable>) at postmaster.c:1365 [opt] frame #25: 0x000000010e177537 postgres`main(argc=<unavailable>, argv=<unavailable>) at main.c:228 [opt] frame #26: 0x00007fffe620c235 libdyld.dylib`start + 1 (lldb) p debug_query_string (const char *) $0 = 0x00007fe6a401e518 "SELECT p1.oid, p1.typname\nFROM pg_type as p1\nWHERE p1.typnamespace = 0 OR\n (p1.typlen <= 0 AND p1.typlen != -1 AND p1.typlen != -2) OR\n (p1.typtype not in ('b', 'c', 'd', 'e', 'p', 'r')) OR\n NOT p1.typisdefined OR\n (p1.typalign not in ('c', 's', 'i', 'd')) OR\n (p1.typstorage not in ('p', 'x', 'e', 'm'));" -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> wrote: > On Fri, Dec 22, 2017 at 10:43 AM, Antonin Houska <ah@cybertec.at> wrote: > > Michael Paquier <michael.paquier@gmail.com> wrote: > >> On Sat, Nov 4, 2017 at 12:33 AM, Antonin Houska <ah@cybertec.at> wrote: > >> > I'm not about to add any other features now. Implementation of the missing > >> > parts (see the TODO comments in the code) is the next step. But what I'd > >> > appreciate most is a feedback on the design. Thanks. > >> > >> I am getting a conflict after applying patch 5 but this did not get > >> any reviews so moved to next CF with waiting on author as status. > > > > Attached is the next version. > > I've been a bit confused for a while about what this patch is trying > to do, so I spent some time today looking at it to try to figure it > out. Thanks. > There's a lot I don't understand yet, but it seems like the general idea is > to build, potentially for each relation in the join tree, not only the > regular list of paths but also a list of "grouped" paths. If the > pre-grouped path wins, then we can get a final path that looks like Finalize > Aggregate -> Some Joins -> Partial Aggregate -> Maybe Some More Joins -> > Base Table Scan. Yes. The regression test output shows some more plans. > In some cases the patch seems to replace that uppermost Finalize Aggregate > with a Result node. This is a feature implemented in 09_avoid_agg_finalization.diff. You can ignore it for now, it's of lower priority than the preceding parts and needs more work to be complete. > translate_expression_to_rels() looks unsafe. Equivalence members are > known to be equal under the semantics of the relevant operator class, > but that doesn't mean that one can be freely substituted for another. > For example: > > rhaas=# create table one (a numeric); > CREATE TABLE > rhaas=# create table two (a numeric); > CREATE TABLE > rhaas=# insert into one values ('0'); > INSERT 0 1 > rhaas=# insert into two values ('0.00'); > INSERT 0 1 > rhaas=# select one.a, count(*) from one, two where one.a = two.a group by 1; > a | count > ---+------- > 0 | 1 > (1 row) > > rhaas=# select two.a, count(*) from one, two where one.a = two.a group by 1; > a | count > ------+------- > 0.00 | 1 > (1 row) > > There are, admittedly, a large number of data types for which such a > substitution would work just fine. numeric will not, citext will not, > but many others are fine. Regrettably, we have no framework in the > system for identifying equality operators which actually test identity > versus some looser notion of equality. Cross-type operators are a > problem, too; if we have foo.x = bar.y in the query, one might be int4 > and the other int8, for example, but they can still belong to the same > equivalence class. > > Concretely, in your test query "SELECT p.i, avg(c1.v) FROM > agg_pushdown_parent AS p JOIN agg_pushdown_child1 AS c1 ON c1.parent = > p.i GROUP BY p.i" you assume that it's OK to do a Partial > HashAggregate over c1.parent rather than p.i. This will be false if, > say, c1.parent is of type citext and p.i is of type text; this will > get grouped together that shouldn't. I've really missed this problem, thanks for bringing it up! Your considerations made me think of the specifics of the types like numeric or citext. Although your example does not demonstrate what I consider the core issue, I believe we eventually think of the same. Consider this example query (using the tables you defined upthread): SELECT one.a FROM one, two WHERE one.a = two.a AND scale(two.a) > 1 GROUP BY 1 I we first try to aggregate "two", then evaluate WHERE clause and then finalize the aggregation, the partial aggregation of "two" can put various binary values of the "a" column (0, 0.0, etc.) into the same group so the scale of the output (of the partial agregation) will be undefined. Thus the aggregation push-down will change the input of the WHERE clause. So one problem is that the grouping expression can be inappropriate for partial aggregation even if there's no type change during the translation. What I consider typical for this case is that the equality operator used to identify groups (SortGroupClause.eqop) can return true even if binary (stored) values of the inputs differ. The partial aggregation could only take place if we had a special equality operator which distinguishes the *binary* values (I don't know yet how to store this operator the catalog --- in pg_opclass recors for the hash opclasses?).. On the other hand, if the grouping expression is not a plain Var and there's no type change during the translation and the expression output type is not of the "identity-unsafe type" (numeric, citext, etc.), input vars of the "identity-unsafe type"should not prevent us from using the expression for partial aggregation: in such a case the grouping keys are computed in the same way they would be w/o the partial aggregation. The other problem is which type changes are legal. We should not allow any type-specific information (such as scale or precision of the numeric type) to bet lost or ambiguous. In this example > It will also be false if the grouping expression is something like GROUP BY > length(p.i::text), because one value could be '0'::numeric and the other > '0.00'::numeric. you show that not only numeric -> int is a problem but also int -> numeric. This is the tricky part. One of my ideas is to check whether the source and destination types are binary coercible (i.e. pg_cast(castfunc) = 0) but this might be a misuse of the binary coercibility. Another idea is to allow only such changes that the destination type is in the same operator class as the source, and explicitly enumerate the "safe opclasses". But that would mean that user cannot define new opclasses within which the translation is possible --- not sure this is a serious issue. Perhaps the most consistent solution is that the translation is not performed if any input variable of the grouping expression should change its data type. > Another thing I noticed is that the GroupedPathInfo looks a bit like a > stripped-down RelOptInfo, in that for example it has both a pathlist > and a partial_pathlist. I'm inclined to think that we should build new > RelOptInfos instead. As you have it, there are an awful lot of things > that seem to know about the grouped/ungrouped distinction, many of > which are quite low-level functions like add_path(), > add_paths_to_append_rel(), try_nestloop_path(), etc. I think that > some of this would go away if you had separate RelOptInfos instead of > GroupedPathInfo. This was proposed by Ashutosh Bapat upthread. I actually did this in the early (not published) prototype of the patch and I considered it too invasive for standard_join_search() and subroutines. IIRC an array like simple_rel_array had to be introduced for the grouped relation in which the meaning of NULL was twofold: either the relation is not a base relation or it is base relation but no grouped relation exists for it. Also I thought there would be too much duplicate data if the grouped relation had a separate RelOptInfo. Now that I'm done with one approach I can consider the original approach aggain. > Some compiler noise: > Core dump running the regression tests: I'll fix these, thanks. -- Antonin Houska Cybertec Schönig & Schönig GmbH Gröhrmühlgasse 26, A-2700 Wiener Neustadt Web: https://www.cybertec-postgresql.com
On Fri, Jan 26, 2018 at 8:04 AM, Antonin Houska <ah@cybertec.at> wrote: > So one problem is that the grouping expression can be inappropriate for > partial aggregation even if there's no type change during the > translation. What I consider typical for this case is that the equality > operator used to identify groups (SortGroupClause.eqop) can return true even > if binary (stored) values of the inputs differ. The partial aggregation could > only take place if we had a special equality operator which distinguishes the > *binary* values (I don't know yet how to store this operator the catalog --- > in pg_opclass recors for the hash opclasses?).. We don't have an operator that tests for binary equality, but it's certainly testable from C; see datumIsEqual. I'm not sure how much this really helps, though. I think it would be safe to build an initial set of groups based on datumIsEqual comparisons and then arrange to later merge some of the groups. But that's not something we ever do in the executor today, so it might require quite a lot of hacking. Also, it seems like it might really suck in some cases. For instance, consider something like SELECT scale(one.a), sum(two.a) FROM one, two WHERE one.a = two.a GROUP BY 1. Doing a Partial Aggregate on two.a using datumIsEqual semantics, joining to a, and then doing a Finalize Aggregate looks legal, but the Partial Aggregate may produce a tremendous number of groups compared to the Finalize Aggregate. In other words, this technique wouldn't merge any groups that shouldn't be merged, but it might fail to merge groups that really need to be merged to get good performance. > One of my ideas is to check whether the source and destination types are > binary coercible (i.e. pg_cast(castfunc) = 0) but this might be a misuse of > the binary coercibility. Yeah, binary coercibility has nothing to do with this; that tells you whether the two types are the same on disk, not whether they have the same equality semantics. For instance, I think text and citext are binary coercible, but their equality semantics are not the same. > Another idea is to allow only such changes that the > destination type is in the same operator class as the source, and explicitly > enumerate the "safe opclasses". But that would mean that user cannot define > new opclasses within which the translation is possible --- not sure this is a > serious issue. Enumerating specific opclasses in the source code is a non-starter -- Tom Lane would roll over in his grave if he weren't still alive. What we could perhaps consider doing is adding some mechanism for an opclass or opfamily to say whether its equality semantics happen to be exactly datumIsEqual() semantics. That's a little grotty because it leaves data types for which that isn't true out in the cold, but on the other hand it seems like it would be useful as a way of optimizing a bunch of things other than this. Maybe it could also include a way to specify that the comparator happens to have the semantics as C's built-in < and > operators, which seems like a case that might also lend itself to some optimizations. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> wrote: > On Fri, Jan 26, 2018 at 8:04 AM, Antonin Houska <ah@cybertec.at> wrote: > > So one problem is that the grouping expression can be inappropriate for > > partial aggregation even if there's no type change during the > > translation. What I consider typical for this case is that the equality > > operator used to identify groups (SortGroupClause.eqop) can return true even > > if binary (stored) values of the inputs differ. The partial aggregation could > > only take place if we had a special equality operator which distinguishes the > > *binary* values (I don't know yet how to store this operator the catalog --- > > in pg_opclass recors for the hash opclasses?).. > > We don't have an operator that tests for binary equality, but it's > certainly testable from C; see datumIsEqual. I'm not sure how much > this really helps, though. I think it would be safe to build an > initial set of groups based on datumIsEqual comparisons and then > arrange to later merge some of the groups. But that's not something > we ever do in the executor today, so it might require quite a lot of > hacking. Also, it seems like it might really suck in some cases. For > instance, consider something like SELECT scale(one.a), sum(two.a) FROM > one, two WHERE one.a = two.a GROUP BY 1. Doing a Partial Aggregate on > two.a using datumIsEqual semantics, joining to a, and then doing a > Finalize Aggregate looks legal, but the Partial Aggregate may produce > a tremendous number of groups compared to the Finalize Aggregate. In > other words, this technique wouldn't merge any groups that shouldn't > be merged, but it might fail to merge groups that really need to be > merged to get good performance. I don't insist on doing Partial Aggregate in any case. If we wanted to group by the binary value, we'd probably have to enhance statistics accordingly. The important thing is to recognize the special case like this. Rejection of the Partial Aggregate would be the default response. > > Another idea is to allow only such changes that the > > destination type is in the same operator class as the source, and explicitly > > enumerate the "safe opclasses". But that would mean that user cannot define > > new opclasses within which the translation is possible --- not sure this is a > > serious issue. > > Enumerating specific opclasses in the source code is a non-starter -- > Tom Lane would roll over in his grave if he weren't still alive. What > we could perhaps consider doing is adding some mechanism for an > opclass or opfamily to say whether its equality semantics happen to be > exactly datumIsEqual() semantics. That's a little grotty because it > leaves data types for which that isn't true out in the cold, but on > the other hand it seems like it would be useful as a way of optimizing > a bunch of things other than this. Maybe it could also include a way > to specify that the comparator happens to have the semantics as C's > built-in < and > operators, which seems like a case that might also > lend itself to some optimizations. I think of a variant of this: implement an universal function that tests the binary values for equality (besides the actual arguments, caller would have to pass info like typlen, typalign, typbyval for each argument, and cache these for repeated calls) and set pg_proc(oprcode) to 0 wherever this function is sufficient. Thus the problematic cases like numeric, citext, etc. would be detected by their non-zero oprcode. -- Antonin Houska Cybertec Schönig & Schönig GmbH Gröhrmühlgasse 26, A-2700 Wiener Neustadt Web: https://www.cybertec-postgresql.com
On 01/29/18 03:32, Antonin Houska wrote: > Robert Haas <robertmhaas@gmail.com> wrote: >>> only take place if we had a special equality operator which distinguishes the >>> *binary* values (I don't know yet how to store this operator the catalog --- ... >> We don't have an operator that tests for binary equality, but it's >> certainly testable from C; see datumIsEqual. Disclaimer: I haven't been following the whole thread closely. But could there be some way to exploit the composite-type *= operator? https://www.postgresql.org/docs/current/static/functions-comparisons.html#COMPOSITE-TYPE-COMPARISON -Chap
On Mon, Jan 29, 2018 at 3:32 AM, Antonin Houska <ah@cybertec.at> wrote: > I think of a variant of this: implement an universal function that tests the > binary values for equality (besides the actual arguments, caller would have to > pass info like typlen, typalign, typbyval for each argument, and cache these > for repeated calls) and set pg_proc(oprcode) to 0 wherever this function is > sufficient. Thus the problematic cases like numeric, citext, etc. would be > detected by their non-zero oprcode. I don't think that's a good option, because we don't want int4eq to have to go through a code path that has branches to support varlenas and cstrings. Andres is busy trying to speed up expression evaluation by removing unnecessary code branches; adding new ones would be undoing that valuable work. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> wrote: > On Mon, Jan 29, 2018 at 3:32 AM, Antonin Houska <ah@cybertec.at> wrote: > > I think of a variant of this: implement an universal function that tests the > > binary values for equality (besides the actual arguments, caller would have to > > pass info like typlen, typalign, typbyval for each argument, and cache these > > for repeated calls) and set pg_proc(oprcode) to 0 wherever this function is > > sufficient. Thus the problematic cases like numeric, citext, etc. would be > > detected by their non-zero oprcode. > > I don't think that's a good option, because we don't want int4eq to > have to go through a code path that has branches to support varlenas > and cstrings. Andres is busy trying to speed up expression evaluation > by removing unnecessary code branches; adding new ones would be > undoing that valuable work. I spent some more time thinking about this. What about adding a new strategy number for hash index operator classes, e.g. HTBinaryEqualStrategyNumber? For most types both HTEqualStrategyNumber and HTBinaryEqualStrategyNumber strategy would point to the same operator, but types like numeric would naturally have them different. Thus the pushed-down partial aggregation can only use the HTBinaryEqualStrategyNumber's operator to compare grouping expressions. In the initial version (until we have useful statistics for the binary values) we can avoid the aggregation push-down if the grouping expression output type has the two strategies implemented using different functions because, as you noted upthread, grouping based on binary equality can result in excessive number of groups. One open question is whether the binary equality operator needs a separate operator class or not. If an opclass cares only about the binary equality, its hash function(s) can be a lot simpler. -- Antonin Houska Cybertec Schönig & Schönig GmbH Gröhrmühlgasse 26, A-2700 Wiener Neustadt Web: https://www.cybertec-postgresql.com
On Fri, Feb 23, 2018 at 11:08 AM, Antonin Houska <ah@cybertec.at> wrote: > I spent some more time thinking about this. What about adding a new strategy > number for hash index operator classes, e.g. HTBinaryEqualStrategyNumber? For > most types both HTEqualStrategyNumber and HTBinaryEqualStrategyNumber strategy > would point to the same operator, but types like numeric would naturally have > them different. > > Thus the pushed-down partial aggregation can only use the > HTBinaryEqualStrategyNumber's operator to compare grouping expressions. In the > initial version (until we have useful statistics for the binary values) we can > avoid the aggregation push-down if the grouping expression output type has the > two strategies implemented using different functions because, as you noted > upthread, grouping based on binary equality can result in excessive number of > groups. > > One open question is whether the binary equality operator needs a separate > operator class or not. If an opclass cares only about the binary equality, its > hash function(s) can be a lot simpler. Hmm. How about instead adding another regproc field to pg_type which stores the OID of a function that tests binary equality for that datatype? If that happens to be equal to the OID you got from the opclass, then you're all set. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> wrote: > On Fri, Dec 22, 2017 at 10:43 AM, Antonin Houska <ah@cybertec.at> wrote: > > Michael Paquier <michael.paquier@gmail.com> wrote: > >> On Sat, Nov 4, 2017 at 12:33 AM, Antonin Houska <ah@cybertec.at> wrote: > >> > I'm not about to add any other features now. Implementation of the missing > >> > parts (see the TODO comments in the code) is the next step. But what I'd > >> > appreciate most is a feedback on the design. Thanks. > >> > >> I am getting a conflict after applying patch 5 but this did not get > >> any reviews so moved to next CF with waiting on author as status. > > > > Attached is the next version. > translate_expression_to_rels() looks unsafe. Equivalence members are > known to be equal under the semantics of the relevant operator class, > but that doesn't mean that one can be freely substituted for another. > For example: > > rhaas=# create table one (a numeric); > CREATE TABLE > rhaas=# create table two (a numeric); > CREATE TABLE > rhaas=# insert into one values ('0'); > INSERT 0 1 > rhaas=# insert into two values ('0.00'); > INSERT 0 1 > rhaas=# select one.a, count(*) from one, two where one.a = two.a group by 1; > a | count > ---+------- > 0 | 1 > (1 row) > > rhaas=# select two.a, count(*) from one, two where one.a = two.a group by 1; > a | count > ------+------- > 0.00 | 1 > (1 row) > > There are, admittedly, a large number of data types for which such a > substitution would work just fine. numeric will not, citext will not, > but many others are fine. Regrettably, we have no framework in the > system for identifying equality operators which actually test identity > versus some looser notion of equality. Cross-type operators are a > problem, too; if we have foo.x = bar.y in the query, one might be int4 > and the other int8, for example, but they can still belong to the same > equivalence class. > > Concretely, in your test query "SELECT p.i, avg(c1.v) FROM > agg_pushdown_parent AS p JOIN agg_pushdown_child1 AS c1 ON c1.parent = > p.i GROUP BY p.i" you assume that it's OK to do a Partial > HashAggregate over c1.parent rather than p.i. This will be false if, > say, c1.parent is of type citext and p.i is of type text; this will > get grouped together that shouldn't. It will also be false if the > grouping expression is something like GROUP BY length(p.i::text), > because one value could be '0'::numeric and the other '0.00'::numeric. > I can't think of a reason why it would be false if the grouping > expressions are both simple Vars of the same underlying data type, but > I'm a little nervous that I might be wrong even about that case. > Maybe you've handled all of this somehow, but it's not obvious to me > that it has been considered. These problems are still being investigated, see [1] > Another thing I noticed is that the GroupedPathInfo looks a bit like a > stripped-down RelOptInfo, in that for example it has both a pathlist > and a partial_pathlist. I'm inclined to think that we should build new > RelOptInfos instead. As you have it, there are an awful lot of things > that seem to know about the grouped/ungrouped distinction, many of > which are quite low-level functions like add_path(), > add_paths_to_append_rel(), try_nestloop_path(), etc. I think that > some of this would go away if you had separate RelOptInfos instead of > GroupedPathInfo. I've reworked the patch so that separate RelOptInfo is used for grouped relation. The attached patch is only the core part. Neither postgres_fdw changes nor the part that tries to avoid the final aggregation is included here. I'll add these when the patch does not seem to need another major rework. [1] https://www.postgresql.org/message-id/CA+Tgmoa5Pp-DBJg=W8Xj8Czf-32PfxPgxwFPkA6qN2w_wPX8bg@mail.gmail.com -- Antonin Houska Cybertec Schönig & Schönig GmbH Gröhrmühlgasse 26, A-2700 Wiener Neustadt Web: https://www.cybertec-postgresql.com
Attachment
Robert Haas <robertmhaas@gmail.com> wrote: > On Fri, Feb 23, 2018 at 11:08 AM, Antonin Houska <ah@cybertec.at> wrote: > > I spent some more time thinking about this. What about adding a new strategy > > number for hash index operator classes, e.g. HTBinaryEqualStrategyNumber? For > > most types both HTEqualStrategyNumber and HTBinaryEqualStrategyNumber strategy > > would point to the same operator, but types like numeric would naturally have > > them different. > > > > Thus the pushed-down partial aggregation can only use the > > HTBinaryEqualStrategyNumber's operator to compare grouping expressions. In the > > initial version (until we have useful statistics for the binary values) we can > > avoid the aggregation push-down if the grouping expression output type has the > > two strategies implemented using different functions because, as you noted > > upthread, grouping based on binary equality can result in excessive number of > > groups. > > > > One open question is whether the binary equality operator needs a separate > > operator class or not. If an opclass cares only about the binary equality, its > > hash function(s) can be a lot simpler. > > Hmm. How about instead adding another regproc field to pg_type which > stores the OID of a function that tests binary equality for that > datatype? If that happens to be equal to the OID you got from the > opclass, then you're all set. I suppose you mean pg_operator, not pg_type. What I don't like about this is that the new field would only be useful for very little fraction of operators. On the other hand, the drawback of an additional operator classes is that we'd have to modify almost all the existing operator classes for the hash AM. (The absence of the new strategy number in an operator class cannot mean that the existing equality operator can be used to compare binary values too, because thus we can't guarantee correct behavior of the already existing user-defined operator classes.) -- Antonin Houska Cybertec Schönig & Schönig GmbH Gröhrmühlgasse 26, A-2700 Wiener Neustadt Web: https://www.cybertec-postgresql.com
Antonin Houska <ah@cybertec.at> wrote: > I've reworked the patch so that separate RelOptInfo is used for grouped > relation. The attached patch is only the core part. Neither postgres_fdw > changes nor the part that tries to avoid the final aggregation is included > here. I'll add these when the patch does not seem to need another major rework. This is the next version. After having posted a few notes to the [1] thread, I'm returning to the original one so it can be referenced from my entry in the CF application. As proposed in the other thread, it uses only "simple aggregation". The 2-stage aggregation, which gives more power to the feature and which also enables paralle processing for it, will be coded in a separate patch. I eventually decided abandon the Var substitution proposed by Heikki in [2]. It's not necessary if we accept the restriction that the feature accepts only simple column reference as grouping expression so far. [1] https://www.postgresql.org/message-id/c165b72e-8dbb-2a24-291f-113aeb67b76a%40iki.fi [2] https://www.postgresql.org/message-id/113e9594-7c08-0f1f-ad15-41773b56a86b%40iki.fi -- Antonin Houska Cybertec Schönig & Schönig GmbH Gröhrmühlgasse 26, A-2700 Wiener Neustadt Web: https://www.cybertec-postgresql.com
Attachment
Antonin Houska <ah@cybertec.at> writes: > [ agg_pushdown_v8.tgz ] I spent a few hours looking at this today. It seems to me that at no point has there been a clear explanation of what the patch is trying to accomplish, so let me see whether I've got it straight or not. (I suggest that something like this ought to be included in optimizer/README; the patch's lack of internal documentation is a serious deficiency.) -- Conceptually, we'd like to be able to push aggregation down below joins when that yields a win, which it could do by reducing the number of rows that have to be processed at the join stage. Thus consider SELECT agg(a.x) FROM a, b WHERE a.y = b.z; We can't simply aggregate during the scan of "a", because some values of "x" might not appear at all in the input of the naively-computed aggregate (if their rows have no join partner in "b"), or might appear multiple times (if their rows have multiple join partners). So at first glance, aggregating before the join is impossible. The key insight of the patch is that we can make some progress by considering only grouped aggregation: if we can group "a" into sets of rows that must all have the same join partners, then we can do preliminary aggregation within each such group, and take care of the number-of-repetitions problem when we join. If the groups are large then this reduces the number of rows processed by the join, at the cost that we might spend time computing the aggregate for some row sets that will just be discarded by the join. So now we consider SELECT agg(a.x) FROM a, b WHERE a.y = b.z GROUP BY ?; and ask what properties the grouping column(s) "?" must have to make this work. I believe we can say that it's OK to push down through a join if its join clauses are all of the form "a.y = b.z", where either a.y or b.z is listed as a GROUP BY column *and* the join operator is equality for the btree opfamily specified by the SortGroupClause. (Note: actually, SortGroupClause per se mentions an equality operator, although I think the planner quickly reduces that to an opfamily specification.) The concerns Robert had about equality semantics are certainly vital, but I believe that the logic goes through correctly as long as the grouping equality operator and the join equality operator belong to the same opfamily. Case 1: the GROUP BY column is a.y. Two rows of "a" whose y values are equal per the grouping operator must join to exactly the same set of "b" rows, else transitivity is failing. Case 2: the GROUP BY column is b.z. It still works, because the set of "a" rows that are equal to a given z value is well-defined, and those rows must join to exactly the "b" rows whose z entries are equal to the given value, else transitivity is failing. Robert postulated cases like "citext = text", but that doesn't apply here because no cross-type citext = text operator could be part of a well-behaved opfamily. What we'd actually be looking at is either "citextvar::text texteq textvar" or "citextvar citexteq textvar::citext", and the casts prevent these expressions from matching GROUP BY entries that have the wrong semantics. However: what we have proven here is only that we can aggregate across a set of rows that must share the same join partners. We still have to be able to handle the case that the rowset has more than one join partner, which AFAICS means that we must use partial aggregation and then apply an "aggmultifn" (or else apply the aggcombinefn N times). We can avoid that and use plain aggregation when we can prove the "b" side of the join is unique, so that no sets of rows will have to be merged post-join; but ISTM that that reduces the set of use cases to be too small to be worth such a complex patch. So I'm really doubtful that we should proceed forward with only that case available. Also, Tomas complained in the earlier thread that he didn't think grouping on the join column was a very common use-case in the first place. I share that concern, but I think we could extend the logic to the case that Tomas posited as being useful: SELECT agg(a.x) FROM a, b WHERE a.y = b.id GROUP BY b.z; where the join column b.id is unique. If we group on a.y (using semantics compatible with the join operator and the uniqueness constraint), then all "a" rows in a given group will join to exactly one "b" row that necessarily has exactly one grouping value, so this group can safely be aggregated together. We might need to combine it post-join with other "b" rows that have equal "z" values, but we can do that as long as we're okay with partial aggregation. I think this example shows why the idea is far more powerful with partial aggregation than without. -- In short, then, I don't have much use for the patch as presented in this thread, without "aggmultifn". That might be OK as the first phase in a multi-step patch, but not as a production feature. I also have no use for the stuff depending on bitwise equality rather than the user-visible operators; that's just a hack, and an ugly one. As far as the patch details go, I've not looked through it in great detail, but it appears to me that you are still trying to cram the same stuff into base relations as before; shoving it into a subsidiary struct doesn't improve that. Multiple people have said that's the wrong thing, and I agree. Conceptually it's a disaster: a single RelOptInfo should represent one well-defined set of result rows, not more than one. This approach also cannot be extended to handle doing aggregations partway up the join tree, which is at least theoretically interesting (though I'm fine with not tackling it right away). I think the right representation is to create UPPERREL_GROUP_AGG RelOptInfos whose relids show the set of baserels joined before aggregating. Currently there's only one of those and its relids is equal to all_baserels (or should be, anyway). This patch would add instances of UPPERREL_GROUP_AGG RelOptInfos with singleton relids, and later we might put in the ability to consider aggregation across other relid subsets, and in any case we'd run join planning on those level-one rels and create new UPPERREL_GROUP_AGG RelOptInfos that represent the intermediate join steps leading up to the final join. The correct indexing structure for this collection of RelOptInfos is certainly not simple_rel_array; it should look more like the way we index the various regular join rels that we consider. (Maybe an appropriate preliminary patch is to refactor the support code associated with join_rel_list + join_rel_hash so that we can treat those fields as one instance of a structure we'll replicate later.) Note that I'm envisioning that we'd basically run join planning a second time on this tree of join rels, rather than try to merge it with the primary join planning. Since the numbers of rows to be processed will be completely different in this join tree, merging it with the standard one seems hopeless. BTW, I noticed some noise in the v8 patch: what's it doing touching src/interfaces/libpq/fe-auth.c or src/interfaces/libpq/fe-secure-common.c? regards, tom lane
Tom Lane <tgl@sss.pgh.pa.us> wrote: > Antonin Houska <ah@cybertec.at> writes: > > [ agg_pushdown_v8.tgz ] > > I spent a few hours looking at this today. Thanks! > It seems to me that at no point has there been a clear explanation of what > the patch is trying to accomplish, so let me see whether I've got it > straight or not. (I suggest that something like this ought to be included > in optimizer/README; the patch's lack of internal documentation is a serious > deficiency.) Earlier version of the patch did update optimizer/README but I forgot to merge it into the current one. I'll fix that in the next version. (Also some more will be added, especially header comments of some functions.) > Conceptually, we'd like to be able to push aggregation down below joins > when that yields a win, which it could do by reducing the number of rows > that have to be processed at the join stage. Thus consider > > SELECT agg(a.x) FROM a, b WHERE a.y = b.z; > > We can't simply aggregate during the scan of "a", because some values of > "x" might not appear at all in the input of the naively-computed aggregate > (if their rows have no join partner in "b"), or might appear multiple > times (if their rows have multiple join partners). So at first glance, > aggregating before the join is impossible. The key insight of the patch > is that we can make some progress by considering only grouped aggregation: Yes, the patch does not handle queries with no GROUP BY clause. Primarily because I don't know how the grouped "a" table in this example could emit the "a.y" column w/o using it as a grouping key. > if we can group "a" into sets of rows that must all have the same join > partners, then we can do preliminary aggregation within each such group, > and take care of the number-of-repetitions problem when we join. If the > groups are large then this reduces the number of rows processed by the > join, at the cost that we might spend time computing the aggregate for > some row sets that will just be discarded by the join. So now we consider > > SELECT agg(a.x) FROM a, b WHERE a.y = b.z GROUP BY ?; > > and ask what properties the grouping column(s) "?" must have to make this > work. I believe we can say that it's OK to push down through a join if > its join clauses are all of the form "a.y = b.z", where either a.y or b.z > is listed as a GROUP BY column *and* the join operator is equality for > the btree opfamily specified by the SortGroupClause. (Note: actually, > SortGroupClause per se mentions an equality operator, although I think the > planner quickly reduces that to an opfamily specification.) I suppose you mean make_pathkey_from_sortop(). > The concerns Robert had about equality semantics are certainly vital, but I > believe that the logic goes through correctly as long as the grouping > equality operator and the join equality operator belong to the same > opfamily. ok, generic approach like this is always better. I just could not devise it, nor can I prove that your theory is wrong. > Robert postulated cases like "citext = text", but that doesn't apply > here because no cross-type citext = text operator could be part of a > well-behaved opfamily. The reason I can see now is that the GROUP BY operator (opfamily) essentially has the grouping column on both sides, so it does not match the cross-type operator. > What we'd actually be looking at is either "citextvar::text texteq textvar" > or "citextvar citexteq textvar::citext", and the casts prevent these > expressions from matching GROUP BY entries that have the wrong semantics. Can you please explain this? If the expression is "citextvar::text texteq textvar", then both operands of the operator should be of "text" type (because the operator receives the output of the cast), so I'd expect a match to SortGroupClause.eqop of clause like "GROUP BY <text expression>". > However: what we have proven here is only that we can aggregate across > a set of rows that must share the same join partners. We still have > to be able to handle the case that the rowset has more than one join > partner, which AFAICS means that we must use partial aggregation and > then apply an "aggmultifn" (or else apply the aggcombinefn N times). > We can avoid that and use plain aggregation when we can prove the "b" > side of the join is unique, so that no sets of rows will have to be merged > post-join; but ISTM that that reduces the set of use cases to be too small > to be worth such a complex patch. So I'm really doubtful that we should > proceed forward with only that case available. I tried to follow Heikki's proposal in [1] and separated the functionality that does not need the partial aggregation. I was not able to foresee that the patch does not get much smaller this way. Also because I had to introduce functions that check whether particular join does duplicate aggregated values or not. (Even if we use partial aggregation, we can use this functionality to detect that we only need to call aggfinalfn() in some cases, as opposed to setting up regular Agg node for the final aggregation. However that's an optimization that can probably be moved to later position in the patch series.) I'll re-introduce the parallel aggregation in the next patch version. > Also, Tomas complained in the earlier thread that he didn't think > grouping on the join column was a very common use-case in the first > place. I share that concern, but I think we could extend the logic > to the case that Tomas posited as being useful: > > SELECT agg(a.x) FROM a, b WHERE a.y = b.id GROUP BY b.z; > > where the join column b.id is unique. If we group on a.y (using semantics > compatible with the join operator and the uniqueness constraint), then all > "a" rows in a given group will join to exactly one "b" row that > necessarily has exactly one grouping value, so this group can safely be > aggregated together. We might need to combine it post-join with other "b" > rows that have equal "z" values, but we can do that as long as we're okay > with partial aggregation. I think this example shows why the idea is far > more powerful with partial aggregation than without. Good idea. I'll try to incorporate it in the next version. > In short, then, I don't have much use for the patch as presented in this > thread, without "aggmultifn". That might be OK as the first phase in a > multi-step patch, but not as a production feature. The version 8 actually implements only part of the functionality that earlier versions did. I wanted to have more feedback from the community on the concept before I rework the whole series again. So for version 9 I'm going to include the partial aggregation. On the other hand, support of things like parallel processing, append relation or postgres_fdw doesn't seem to be necessary for the inital version of the feature. > I also have no use for the stuff depending on bitwise equality rather than > the user-visible operators; that's just a hack, and an ugly one. The purpose of this was to avoid aggregation push-down for data types like numeric, see the example SELECT one.a FROM one, two WHERE one.a = two.a AND scale(two.a) > 1 GROUP BY 1 in [2]. The problem is that aggregation can make some information lost, which is needed above by WHERE/ON clauses. I appreciate any suggestions how to address this problem. > As far as the patch details go, I've not looked through it in great > detail, but it appears to me that you are still trying to cram the same > stuff into base relations as before; shoving it into a subsidiary struct > doesn't improve that. Multiple people have said that's the wrong thing, > and I agree. Conceptually it's a disaster: a single RelOptInfo should > represent one well-defined set of result rows, not more than one. The first version of my patch did what you reject here, but then I started to use a separate RelOptInfo for the grouped relations. First I introduced simple_grouped_rel_array, but then (again per Heikki's suggestion) made the "plain" RelOptInfo point to the grouped one. > I think the right representation is to create UPPERREL_GROUP_AGG RelOptInfos (Just a detail: since UPPERREL_PARTIAL_GROUP_AGG was added at some point, I wonder if this is more appropriate than UPPERREL_GROUP_AGG for base relations and joins. create_grouping_paths() can then add the paths to UPPERREL_GROUP_AGG.) > whose relids show the set of baserels joined before aggregating. Currently > there's only one of those and its relids is equal to all_baserels (or should > be, anyway). This patch would add instances of UPPERREL_GROUP_AGG > RelOptInfos with singleton relids, and later we might put in the ability to > consider aggregation across other relid subsets, and in any case we'd run > join planning on those level-one rels and create new UPPERREL_GROUP_AGG > RelOptInfos that represent the intermediate join steps leading up to the > final join. The correct indexing structure for this collection of > RelOptInfos is certainly not simple_rel_array; it should look more like the > way we index the various regular join rels that we consider. Nothing like simple_rel_array (e.g. simple_grouped_rel_array) even for the base relations? I think the fact that the extra base relations are grouped should not affect the way they are retrieved. > (Maybe an appropriate preliminary patch is to refactor the support code > associated with join_rel_list + join_rel_hash so that we can treat those > fields as one instance of a structure we'll replicate later.) Do you mean that, instead of a single RelOptInfo, join_rel_list would contain a structure that points to two RelOptInfo structures, where one is the "plain" one and the other is the "grouped" one? > Note that I'm envisioning that we'd basically run join planning a second > time on this tree of join rels, rather than try to merge it with the primary > join planning. Since the numbers of rows to be processed will be completely > different in this join tree, merging it with the standard one seems > hopeless. A separate run of make_join_rel() for the grouped relations is what I tried to do so far. > BTW, I noticed some noise in the v8 patch: what's it doing touching > src/interfaces/libpq/fe-auth.c or src/interfaces/libpq/fe-secure-common.c? It's definitely just a noise. Something went wrong when I was extracting the diff from my private repository. [1] https://www.postgresql.org/message-id/113e9594-7c08-0f1f-ad15-41773b56a86b@iki.fi [2] https://www.postgresql.org/message-id/20239.1516971866%40localhost -- Antonin Houska Cybertec Schönig & Schönig GmbH Gröhrmühlgasse 26, A-2700 Wiener Neustadt Web: https://www.cybertec-postgresql.com
This is the next version. A few more comments below. Antonin Houska <ah@cybertec.at> wrote: > Tom Lane <tgl@sss.pgh.pa.us> wrote: > > > Antonin Houska <ah@cybertec.at> writes: > > Thanks! > > > Conceptually, we'd like to be able to push aggregation down below joins > > when that yields a win, which it could do by reducing the number of rows > > that have to be processed at the join stage. Thus consider > > > > SELECT agg(a.x) FROM a, b WHERE a.y = b.z; > > > > We can't simply aggregate during the scan of "a", because some values of > > "x" might not appear at all in the input of the naively-computed aggregate > > (if their rows have no join partner in "b"), or might appear multiple > > times (if their rows have multiple join partners). So at first glance, > > aggregating before the join is impossible. The key insight of the patch > > is that we can make some progress by considering only grouped aggregation: > > Yes, the patch does not handle queries with no GROUP BY clause. Primarily > because I don't know how the grouped "a" table in this example could emit the > "a.y" column w/o using it as a grouping key. > > > if we can group "a" into sets of rows that must all have the same join > > partners, then we can do preliminary aggregation within each such group, > > and take care of the number-of-repetitions problem when we join. If the > > groups are large then this reduces the number of rows processed by the > > join, at the cost that we might spend time computing the aggregate for > > some row sets that will just be discarded by the join. So now we consider > > > > SELECT agg(a.x) FROM a, b WHERE a.y = b.z GROUP BY ?; > > > > and ask what properties the grouping column(s) "?" must have to make this > > work. I believe we can say that it's OK to push down through a join if > > its join clauses are all of the form "a.y = b.z", where either a.y or b.z > > is listed as a GROUP BY column *and* the join operator is equality for > > the btree opfamily specified by the SortGroupClause. I think the requirement for the join clauses to be of the form "a.y = b.z" would only be valid if we wanted to avoid the 2-stage aggregation. For example, in this query SELECT agg(a.x) FROM a, b WHERE a.y = b.z GROUP BY b.z uniqueness of "b.z" ensures that groups resulting from the pushed-down aggregation (SELECT a.y, agg(a.x) FROM a GROUP BY a.y) do not get duplicated by the join. However if there's a consensus that the 2-stage aggregation will be used (because avoidance of it would limit usefulness of the feature too much), I think we can accept generic join clauses. (Note that the patch does not try to push aggregate down below nullable side of outer join - handling of generated NULL values by join clauses would be another problem.) > > The concerns Robert had about equality semantics are certainly vital, but I > > believe that the logic goes through correctly as long as the grouping > > equality operator and the join equality operator belong to the same > > opfamily. > > ok, generic approach like this is always better. I just could not devise it, > nor can I prove that your theory is wrong. > > > Robert postulated cases like "citext = text", but that doesn't apply > > here because no cross-type citext = text operator could be part of a > > well-behaved opfamily. > > The reason I can see now is that the GROUP BY operator (opfamily) essentially > has the grouping column on both sides, so it does not match the cross-type > operator. Even if my statement that the join clause can be of other kind than "<leftarg> = <rightarg>" was right, it might still be worth insisting on "<leftargt> <op> <rightarg>" where <op> is of the same operator family as the grouping operator? > > Also, Tomas complained in the earlier thread that he didn't think > > grouping on the join column was a very common use-case in the first > > place. I share that concern, but I think we could extend the logic > > to the case that Tomas posited as being useful: > > > > SELECT agg(a.x) FROM a, b WHERE a.y = b.id GROUP BY b.z; > > > > where the join column b.id is unique. If we group on a.y (using semantics > > compatible with the join operator and the uniqueness constraint), then all > > "a" rows in a given group will join to exactly one "b" row that > > necessarily has exactly one grouping value, so this group can safely be > > aggregated together. We might need to combine it post-join with other "b" > > rows that have equal "z" values, but we can do that as long as we're okay > > with partial aggregation. I think this example shows why the idea is far > > more powerful with partial aggregation than without. > > Good idea. I'll try to incorporate it in the next version. The previous version of the patch (before I removed the partial aggregation) could actually handle this case, so I restored it. It just does not check the uniqueness of the b.id column because: the final aggregation will eliminate the duplicated groups anyway. The patch views the problem from this perspective: "a.y" is needed by join clause, and the only way to retrieve it from the partially aggregated relation "a" is to use it as a grouping column. Such an "artificial" grouping expression can only be used if the plan contains the final aggregate node, which uses only the original GROUP BY clause. > > I also have no use for the stuff depending on bitwise equality rather than > > the user-visible operators; that's just a hack, and an ugly one. > > The purpose of this was to avoid aggregation push-down for data types like > numeric, see the example > > SELECT one.a > FROM one, two > WHERE one.a = two.a AND scale(two.a) > 1 > GROUP BY 1 > > in [2]. The problem is that aggregation can make some information lost, which > is needed above by WHERE/ON clauses. I appreciate any suggestions how to > address this problem. The new patch version does not contain this hack, but the problem is still open. Alternatively I think of a boolean flag to be added to pg_opfamily or pg_opclass catalog, saying whether equality also means "bitwise equality". If the value is false (the default) for the GROUP BY operator, no aggregate push-down can take place. > > I think the right representation is to create UPPERREL_GROUP_AGG > > RelOptInfos whose relids show the set of baserels joined before > > aggregating. Currently there's only one of those and its relids is equal > > to all_baserels (or should be, anyway). This patch would add instances of > > UPPERREL_GROUP_AGG RelOptInfos with singleton relids, and later we might > > put in the ability to consider aggregation across other relid subsets, and > > in any case we'd run join planning on those level-one rels and create new > > UPPERREL_GROUP_AGG RelOptInfos that represent the intermediate join steps > > leading up to the final join. The correct indexing structure for this > > collection of RelOptInfos is certainly not simple_rel_array; it should > > look more like the way we index the various regular join rels that we > > consider. > > (Maybe an appropriate preliminary patch is to refactor the support code > > associated with join_rel_list + join_rel_hash so that we can treat those > > fields as one instance of a structure we'll replicate later.) > Do you mean that, instead of a single RelOptInfo, join_rel_list would contain > a structure that points to two RelOptInfo structures, where one is the "plain" > one and the other is the "grouped" one? Eventually I thought that you propose a separate instance of join_rel_list (e.g. join_rel_list_grouped) for the 2nd run of the joining process, however that would mean that some functions need to receive RelOptInfos as arguments, and *also* search in the the join_rel_list_grouped when joining grouped relation to non-grouped one. So I decided to wrap both grouped and non-grouped relation into a new structure (RelOptInfoSet). Thus all the relations to be joined are passed as function arguments. make_join_rel() shows best how instances of the new structure is used. > [1] https://www.postgresql.org/message-id/113e9594-7c08-0f1f-ad15-41773b56a86b@iki.fi > > [2] https://www.postgresql.org/message-id/20239.1516971866%40localhost -- Antonin Houska Cybertec Schönig & Schönig GmbH Gröhrmühlgasse 26, A-2700 Wiener Neustadt Web: https://www.cybertec-postgresql.com
Attachment
On Tue, Jan 15, 2019 at 03:06:18PM +0100, Antonin Houska wrote: > This is the next version. A few more comments below. Latest patch set fails to apply, so moved to next CF, waiting on author. -- Michael
Attachment
Michael Paquier <michael@paquier.xyz> wrote: > On Tue, Jan 15, 2019 at 03:06:18PM +0100, Antonin Houska wrote: > > This is the next version. A few more comments below. > > Latest patch set fails to apply, so moved to next CF, waiting on > author. Rebased. -- Antonin Houska Cybertec Schönig & Schönig GmbH Gröhrmühlgasse 26, A-2700 Wiener Neustadt Web: https://www.cybertec-postgresql.com
Attachment
Antonin Houska <ah@cybertec.at> writes: > Michael Paquier <michael@paquier.xyz> wrote: >> Latest patch set fails to apply, so moved to next CF, waiting on >> author. > Rebased. This is in need of rebasing again :-(. I went ahead and pushed the 001 part, since that seemed fairly uncontroversial. (Note that I changed estimate_hashagg_tablesize's result type to double on the way; you probably want to make corresponding adjustments in your patch.) I did not spend a whole lot of time looking at the patch today, but I'm still pretty distressed at the data structures you've chosen. I remain of the opinion that a grouped relation and a base relation are, er, unrelated, even if they happen to share the same relid set. So I do not see the value of the RelOptInfoSet struct you propose here, and I definitely don't think there's any value in having, eg, create_seqscan_path or create_index_path dealing with this stuff. I also don't like changing create_nestloop_path et al to take a PathTarget rather than using the RelOptInfo's pathtarget; IMO, it's flat out wrong for a path to generate a tlist different from what its parent RelOptInfo says that the relation produces. I think basically the way this ought to work is that we generate baserel paths pretty much the same as today, and then we run through the baserels and see which ones are potentially worth doing partial aggregation on, and for each one that is, we create a separate "upper relation" RelOptInfo describing that. The paths for this RelOptInfo would be partial-aggregation paths using paths from the corresponding baserel as input. Then we'd run a join search that only considers joining grouped rels with plain rels (I concur that joining two grouped rels is not worth coping with, at least for now). regards, tom lane
Tom Lane <tgl@sss.pgh.pa.us> wrote: > Antonin Houska <ah@cybertec.at> writes: > > Michael Paquier <michael@paquier.xyz> wrote: > >> Latest patch set fails to apply, so moved to next CF, waiting on > >> author. > > > Rebased. > > This is in need of rebasing again :-(. I went ahead and pushed the 001 > part, since that seemed fairly uncontroversial. ok, thanks. > I did not spend a whole lot of time looking at the patch today, but > I'm still pretty distressed at the data structures you've chosen. > I remain of the opinion that a grouped relation and a base relation > are, er, unrelated, even if they happen to share the same relid set. > So I do not see the value of the RelOptInfoSet struct you propose here, ok. As you suggested upthread, I try now to reuse the join_rel_list / join_rel_hash structures, see v11-001-Introduce_RelInfoList.patch. > and I definitely don't think there's any value in having, eg, > create_seqscan_path or create_index_path dealing with this stuff. Originally I tried to aggregate any path that ever gets passed to agg_path(), but that's probably not worth the code complexity. Now the partial aggregation is only applied to paths that survived agg_path() on the plain relation. > I also don't like changing create_nestloop_path et al to take a PathTarget > rather than using the RelOptInfo's pathtarget; IMO, it's flat out wrong > for a path to generate a tlist different from what its parent RelOptInfo > says that the relation produces. Likewise, the current patch version is less invasive, so create_nestloop_path et al are not touched at all. > I think basically the way this ought to work is that we generate baserel > paths pretty much the same as today, and then we run through the baserels > and see which ones are potentially worth doing partial aggregation on, > and for each one that is, we create a separate "upper relation" RelOptInfo > describing that. The paths for this RelOptInfo would be > partial-aggregation paths using paths from the corresponding baserel as > input. Then we'd run a join search that only considers joining grouped > rels with plain rels (I concur that joining two grouped rels is not worth > coping with, at least for now). make_join_rel() is the core of my implementation: besides joining two plain relations it tries to join plain relation to grouped one, and also to aggregate the join of the two plain relations. I consider this approach less invasive and more efficient than running the whole standard_join_search again for the grouped rels. The problem of numeric-like data types (i.e. types for wich equality of two values of the grouping key does not justify putting them into the same group because information like scale would be discarded this way) remains open. My last idea was to add a boolean flag to operator class which tells that equality implies "bitwise equality", and to disallow aggregate push-down if SortGroupClause.eqop is in an opclass which has this field FALSE. I'd like to hear your opinion before I do any coding. -- Antonin Houska https://www.cybertec-postgresql.com
Attachment
On Fri, Mar 1, 2019 at 12:01 AM Antonin Houska <ah@cybertec.at> wrote:
Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Antonin Houska <ah@cybertec.at> writes:
> > Michael Paquier <michael@paquier.xyz> wrote:
> >> Latest patch set fails to apply, so moved to next CF, waiting on
> >> author.
>
> > Rebased.
>
> This is in need of rebasing again :-(. I went ahead and pushed the 001
> part, since that seemed fairly uncontroversial.
ok, thanks.
Another rebase is needed for the patches.
Thanks
Richard
Richard Guo <riguo@pivotal.io> wrote: > Another rebase is needed for the patches. Done. -- Antonin Houska Web: https://www.cybertec-postgresql.com
Attachment
On Tue, Jul 9, 2019 at 9:47 PM Antonin Houska <ah@cybertec.at> wrote:
Richard Guo <riguo@pivotal.io> wrote:
> Another rebase is needed for the patches.
Done.
patch set. So what are the considerations for abandoning the aggmultifn
concept? In my opinion, aggmultifn would enable us to do a lot more
types of transformation. For example, consider the query below:
select sum(foo.c) from foo join bar on foo.b = bar.b group by foo.a, bar.a;
With the latest patch, the plan looks like:
Finalize HashAggregate <------ sum(psum)
Group Key: foo.a, bar.a
-> Hash Join
Hash Cond: (bar.b = foo.b)
-> Seq Scan on bar
-> Hash
-> Partial HashAggregate <------ sum(foo.c) as psum
Group Key: foo.a, foo.b
-> Seq Scan on foo
If we have aggmultifn, we can perform the query this way:
Finalize HashAggregate <------ sum(foo.c)*cnt
Group Key: foo.a, bar.a
-> Hash Join
Hash Cond: (foo.b = bar.b)
-> Seq Scan on foo
-> Hash
-> Partial HashAggregate <------ count(*) as cnt
Group Key: bar.a, bar.b
-> Seq Scan on bar
And this way:
Finalize HashAggregate <------ sum(psum)*cnt
Group Key: foo.a, bar.a
-> Hash Join
Hash Cond: (foo.b = bar.b)
-> Partial HashAggregate <------ sum(foo.c) as psum
Group Key: foo.a, foo.b
-> Seq Scan on foo
-> Hash
-> Partial HashAggregate <------ count(*) as cnt
Group Key: bar.a, bar.b
-> Seq Scan on bar
My another question is in function add_grouped_path(), when creating
sorted aggregation path on top of subpath. If the subpath is not sorted,
then the sorted aggregation path would not be generated. Why not in this
case we create a sort path on top of subpath first and then create group
aggregation path on top of the sort path?
Core dump when running one query in agg_pushdown.sql
EXPLAIN ANALYZE
SELECT p.x, avg(c1.v) FROM agg_pushdown_parent AS p JOIN agg_pushdown_child1
AS c1 ON c1.parent = p.i GROUP BY p.i;
#0 0x00000000006def98 in CheckVarSlotCompatibility (slot=0x0, attnum=1, vartype=23) at execExprInterp.c:1850
#1 0x00000000006def5d in CheckExprStillValid (state=0x2b63a28, econtext=0x2ba4958) at execExprInterp.c:1814
#2 0x00000000006dee38 in ExecInterpExprStillValid (state=0x2b63a28, econtext=0x2ba4958, isNull=0x7fff7cd16a37) at execExprInterp.c:1763
#3 0x00000000007144dd in ExecEvalExpr (state=0x2b63a28, econtext=0x2ba4958, isNull=0x7fff7cd16a37)
at ../../../src/include/executor/executor.h:288
#4 0x0000000000715475 in ExecIndexEvalRuntimeKeys (econtext=0x2ba4958, runtimeKeys=0x2b63910, numRuntimeKeys=1) at nodeIndexscan.c:630
#5 0x000000000071533b in ExecReScanIndexScan (node=0x2b62bf8) at nodeIndexscan.c:568
#6 0x00000000006d4ce6 in ExecReScan (node=0x2b62bf8) at execAmi.c:182
#7 0x00000000007152a0 in ExecIndexScan (pstate=0x2b62bf8) at nodeIndexscan.c:530
This is really a cool feature. Thank you for working on this.
Thanks
Richard
Richard Guo <riguo@pivotal.io> wrote: > I didn't fully follow the whole thread and mainly looked into the > latest > patch set. So what are the considerations for abandoning the > aggmultifn > concept? Originally the function was there to support join where both relations are grouped. My doubts about usefulness of such join started at [1]. (See the thread referenced from [2].) > In my opinion, aggmultifn would enable us to do a lot more > types of transformation. For example, consider the query below: > > select sum(foo.c) from foo join bar on foo.b = bar.b group by foo.a, > bar.a; > > With the latest patch, the plan looks like: > > Finalize HashAggregate <------ sum(psum) > Group Key: foo.a, bar.a > -> Hash Join > Hash Cond: (bar.b = foo.b) > -> Seq Scan on bar > -> Hash > -> Partial HashAggregate <------ sum(foo.c) as > psum > Group Key: foo.a, foo.b > -> Seq Scan on foo > > > If we have aggmultifn, we can perform the query this way: > > Finalize HashAggregate <------ sum(foo.c)*cnt > Group Key: foo.a, bar.a > -> Hash Join > Hash Cond: (foo.b = bar.b) > -> Seq Scan on foo > -> Hash > -> Partial HashAggregate <------ count(*) as cnt > Group Key: bar.a, bar.b > -> Seq Scan on bar Perhaps, but it that would require changes to nodeAgg.c, which I want to avoid for now. > And this way: > > Finalize HashAggregate <------ sum(psum)*cnt > Group Key: foo.a, bar.a > -> Hash Join > Hash Cond: (foo.b = bar.b) > -> Partial HashAggregate <------ sum(foo.c) as > psum > Group Key: foo.a, foo.b > -> Seq Scan on foo > -> Hash > -> Partial HashAggregate <------ count(*) as cnt > Group Key: bar.a, bar.b > -> Seq Scan on bar This looks like my idea presented in [1], for which there seems to be no common use case. > My another question is in function add_grouped_path(), when creating > sorted aggregation path on top of subpath. If the subpath is not > sorted, > then the sorted aggregation path would not be generated. Why not in > this > case we create a sort path on top of subpath first and then create > group > aggregation path on top of the sort path? I see no reason not to do it. (If you want to try, just go ahead.) However the current patch version brings only the basic functionality and I'm not going to add new functionality (such as parallel aggregation, partitioned tables or postgres_fdw) until the open design questions are resolved. Otherwise there's a risk that the additional work will be wasted due to major rework of the core functionality. > Core dump when running one query in agg_pushdown.sql Thanks for the report! Fixed in the new version. > This is really a cool feature. Thank you for working on this. I appreciate to hear that :-) Let's see which postgres release will adopt it. [1] https://www.postgresql.org/message-id/cc823e89-3fbc-f94e-b9d4-9c713b044b5d%402ndquadrant.com [2] https://www.postgresql.org/message-id/flat/9666.1491295317%40localhost -- Antonin Houska Web: https://www.cybertec-postgresql.com
Attachment
On Fri, Jul 12, 2019 at 4:42 PM Antonin Houska <ah@cybertec.at> wrote:
Richard Guo <riguo@pivotal.io> wrote:
> I didn't fully follow the whole thread and mainly looked into the
> latest
> patch set. So what are the considerations for abandoning the
> aggmultifn
> concept?
Originally the function was there to support join where both relations are
grouped. My doubts about usefulness of such join started at [1]. (See the
thread referenced from [2].)
> In my opinion, aggmultifn would enable us to do a lot more
> types of transformation. For example, consider the query below:
>
> select sum(foo.c) from foo join bar on foo.b = bar.b group by foo.a,
> bar.a;
>
> With the latest patch, the plan looks like:
>
> Finalize HashAggregate <------ sum(psum)
> Group Key: foo.a, bar.a
> -> Hash Join
> Hash Cond: (bar.b = foo.b)
> -> Seq Scan on bar
> -> Hash
> -> Partial HashAggregate <------ sum(foo.c) as
> psum
> Group Key: foo.a, foo.b
> -> Seq Scan on foo
>
>
> If we have aggmultifn, we can perform the query this way:
>
> Finalize HashAggregate <------ sum(foo.c)*cnt
> Group Key: foo.a, bar.a
> -> Hash Join
> Hash Cond: (foo.b = bar.b)
> -> Seq Scan on foo
> -> Hash
> -> Partial HashAggregate <------ count(*) as cnt
> Group Key: bar.a, bar.b
> -> Seq Scan on bar
Perhaps, but it that would require changes to nodeAgg.c, which I want to avoid
for now.
> And this way:
>
> Finalize HashAggregate <------ sum(psum)*cnt
> Group Key: foo.a, bar.a
> -> Hash Join
> Hash Cond: (foo.b = bar.b)
> -> Partial HashAggregate <------ sum(foo.c) as
> psum
> Group Key: foo.a, foo.b
> -> Seq Scan on foo
> -> Hash
> -> Partial HashAggregate <------ count(*) as cnt
> Group Key: bar.a, bar.b
> -> Seq Scan on bar
This looks like my idea presented in [1], for which there seems to be no
common use case.
> My another question is in function add_grouped_path(), when creating
> sorted aggregation path on top of subpath. If the subpath is not
> sorted,
> then the sorted aggregation path would not be generated. Why not in
> this
> case we create a sort path on top of subpath first and then create
> group
> aggregation path on top of the sort path?
I see no reason not to do it. (If you want to try, just go ahead.) However the
current patch version brings only the basic functionality and I'm not going to
add new functionality (such as parallel aggregation, partitioned tables or
postgres_fdw) until the open design questions are resolved. Otherwise there's
a risk that the additional work will be wasted due to major rework of the core
functionality.
> Core dump when running one query in agg_pushdown.sql
Thanks for the report! Fixed in the new version.
Another core dump for query below:
select sum(t1.s1) from t1, t2, t3, t4 where t1.j1 = t2.j2 group by t1.g1, t2.g2;
This is due to a small mistake:
diff --git a/src/backend/optimizer/util/relnode.c b/src/backend/optimizer/util/relnode.c
index 10becc0..9c2deed 100644
--- a/src/backend/optimizer/util/relnode.c
+++ b/src/backend/optimizer/util/relnode.c
@@ -648,7 +648,7 @@ void
add_grouped_rel(PlannerInfo *root, RelOptInfo *rel, RelAggInfo *agg_info)
{
add_rel_info(root->grouped_rel_list, rel);
- add_rel_info(root->agg_info_list, rel);
+ add_rel_info(root->agg_info_list, agg_info);
}
index 10becc0..9c2deed 100644
--- a/src/backend/optimizer/util/relnode.c
+++ b/src/backend/optimizer/util/relnode.c
@@ -648,7 +648,7 @@ void
add_grouped_rel(PlannerInfo *root, RelOptInfo *rel, RelAggInfo *agg_info)
{
add_rel_info(root->grouped_rel_list, rel);
- add_rel_info(root->agg_info_list, rel);
+ add_rel_info(root->agg_info_list, agg_info);
}
Thanks
Richard
Richard Guo <riguo@pivotal.io> wrote: > Another core dump for query below: > > select sum(t1.s1) from t1, t2, t3, t4 where t1.j1 = t2.j2 group by t1.g1, t2.g2; > > This is due to a small mistake: > > diff --git a/src/backend/optimizer/util/relnode.c b/src/backend/optimizer/util/relnode.c > index 10becc0..9c2deed 100644 > --- a/src/backend/optimizer/util/relnode.c > +++ b/src/backend/optimizer/util/relnode.c > @@ -648,7 +648,7 @@ void > add_grouped_rel(PlannerInfo *root, RelOptInfo *rel, RelAggInfo *agg_info) > { > add_rel_info(root->grouped_rel_list, rel); > - add_rel_info(root->agg_info_list, rel); > + add_rel_info(root->agg_info_list, agg_info); > } > Thanks! Yes, this is what I had to fix. -- Antonin Houska Web: https://www.cybertec-postgresql.com
Attachment
This stuff seems very useful. How come it sits unreviewed for so long? -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > This stuff seems very useful. How come it sits unreviewed for so long? I think the review is hard for people who are not interested in the planner very much. And as for further development, there are a few design decisions that can hardly be resolved without Tom Lane's comments. Right now I recall two problems: 1) is the way I currently store RelOptInfo for the grouped relations correct?, 2) how should we handle types for which logical equality does not imply physical (byte-wise) equality? Fortunately it seems now that I'm not the only one who cares about 2), so this problem might be resolved soon: https://www.postgresql.org/message-id/CAH2-Wzn3Ee49Gmxb7V1VJ3-AC8fWn-Fr8pfWQebHe8rYRxt5OQ%40mail.gmail.com But 1) still remains. -- Antonin Houska Web: https://www.cybertec-postgresql.com
Hi, I've been looking at the last version (v14) of this patch series, submitted way back in July and unfortunately quiet since then. Antonin is probably right one of the reasons for the lack of reviews is that it requires interest/experience with planner. Initially it was also a bit hard to understand what are the benefits (and the patch shifted a bit), which is now mostly addressed by the README in the last patch. The trouble is that's hidden in the patch and so not immediately obvious to people considering reviewing it :-( Tom posted a nice summary in November 2018, but it was perhaps focused on the internals. So my first suggestion it to write a short summary with example how it affects practical queries (plan change, speedup) etc. My second suggestion is to have meaningful commit messages for each part of the patch series, explaining why we need that particular part. It might have been explained somewhere in the thread, but as a reviewer I really don't want to reverse-engineer the whole thread. Now, regarding individual parts of the patch: 1) v14-0001-Introduce-RelInfoList-structure.patch ------------------------------------------------- - I'm not entirely sure why we need this change. We had the list+hash before, so I assume we do this because we need the output functions? - The RelInfoList naming is pretty confusing, because it's actually not a list but a combination of list+hash. And the embedded list is called items, to make it yet a bit more confusing. I suggest we call this just RelInfo or RelInfoLookup or something else not including List. - RelInfoEntry seems severely under-documented, particularly the data part (which is void* making it even harder to understand what's it for without having to read the functions). AFAICS it's just simply a link to the embedded list, but maybe I'm wrong. - I suggest we move find_join_rel/add_rel_info/add_join_rel in relnode.c right before build_join_rel. This makes diff clearer/smaller and visual diffs nicer. 2) v14-0002-Introduce-make_join_rel_common-function.patch --------------------------------------------------------- I see no point in keeping this change in a separate patch, as it prety much just renames make_join_rel to make_join_rel_common and then adds make_join_rel(PlannerInfo *root, RelOptInfo *rel1, RelOptInfo *rel2) { return make_join_rel_common(root, rel1, rel2); } which seems rather trivial and useless on it's own. I'd just merge it into 0003 where we use the _common function more extensively. 3) v14-0003-Aggregate-push-down-basic-functionality.patch --------------------------------------------------------- I haven't done much review on this yet, but I've done some testing and benchmarking so let me share at least that. The queries I used were of the type I mentioned earlier in this thread - essentially a star schema, i.e. fact table referencing dimensions, with aggregation per columns in the dimension. So something like SELECT d.c, sum(f) FROM fact JOIN dimension d ON (d.id = f.d_id) GROUP BY d.c; where "fact" table is much much larger than the dimension. Attached is a script I used for testing with a bunch of queries and a number of parameters (parallelism, size of dimension, size of fact, ...) and a spreadsheed summarizing the results. Overall, the results seem pretty good - in many cases the queries get about 2x faster and I haven't seen any regressions. That's pretty nice. One annoying thing is that this only works for non-parallel queries. That is, I saw no improvements with parallel queries. It was still faster than the non-parallel query with aggregate pushdown, but the parallel query did not improve. An example of timing looks like this: non-parallel (pushdown = off): 918 ms non-parallel (pushdown = on): 561 ms parallel (pushdown = off): 313 ms parallel (pushdown = on): 313 ms The non-parallel query gets faster because after enabling push-down the plan changes (and we get partial aggregate below the join). With parallel query that does not happen, the plans stay the same. I'm not claiming this is a bug - we end up picking the fastest execution plan (i.e. we don't make a mistake by e.g. switching to the non-parallel one with pushdown). My understanding is that the aggregate pushdown can't (currently?) be used with queries that use partial aggregate because of parallelism. That is we can either end up with a plan like this: Finalize Aggregate -> Join -> Partial Aggregate -> ... or a parallel plan like this: Finalize Aggregate -> Gather -> Partial Aggregate -> Join -> ... -> ... but we currently don't support a combination of both Finalize Aggregate -> Gather -> Join -> Partial Aggregate -> ... I wonder if that's actually even possible and what would it take to make it work? regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachment
Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: > Hi, > > I've been looking at the last version (v14) of this patch series, > submitted way back in July and unfortunately quiet since then. Antonin > is probably right one of the reasons for the lack of reviews is that it > requires interest/experience with planner. > > Initially it was also a bit hard to understand what are the benefits > (and the patch shifted a bit), which is now mostly addressed by the > README in the last patch. The trouble is that's hidden in the patch and > so not immediately obvious to people considering reviewing it :-( Tom > posted a nice summary in November 2018, but it was perhaps focused on > the internals. > > So my first suggestion it to write a short summary with example how it > affects practical queries (plan change, speedup) etc. I think README plus regression test output should be enough for someone who is about to review patch as complex as this. > My second suggestion is to have meaningful commit messages for each part > of the patch series, explaining why we need that particular part. It > might have been explained somewhere in the thread, but as a reviewer I > really don't want to reverse-engineer the whole thread. ok, done. > Now, regarding individual parts of the patch: > > > 1) v14-0001-Introduce-RelInfoList-structure.patch > ------------------------------------------------- > > - I'm not entirely sure why we need this change. We had the list+hash > before, so I assume we do this because we need the output functions? I believe that this is what Tom proposed in [1], see "Maybe an appropriate preliminary patch is to refactor the support code ..." near the end of that message. The point is that now we need two lists: one for "plain" relations and one for grouped ones. > - The RelInfoList naming is pretty confusing, because it's actually not > a list but a combination of list+hash. And the embedded list is called > items, to make it yet a bit more confusing. I suggest we call this > just RelInfo or RelInfoLookup or something else not including List. I think it can be considered a list by the caller of add_join_rel() etc. The hash table is hidden from user and is empty until the list becomes long enough. The word "list" in the structure name may just indicate that new elements can be added to the end, which shouldn't be expected if the structure was an array. I searched a bit in tools/pgindent/typedefs.list and found at least a few structures that also do have "list" in the name but are not lists internally: CatCList, FuncCandidateList, MCVList. Actually I'm not opposed to renaming the structure, but don't have better idea right now. As for *Lookup: following is the only use of such a structure name in PG code and it's not something to store data in: typedef enum { IDENTIFIER_LOOKUP_NORMAL, /* normal processing of var names */ IDENTIFIER_LOOKUP_DECLARE, /* In DECLARE --- don't look up names */ IDENTIFIER_LOOKUP_EXPR /* In SQL expression --- special case */ } IdentifierLookup; > - RelInfoEntry seems severely under-documented, particularly the data > part (which is void* making it even harder to understand what's it for > without having to read the functions). AFAICS it's just simply a link > to the embedded list, but maybe I'm wrong. This is just JoinHashEntry (which was also undocumented) renamed. I've added a short comment now. > - I suggest we move find_join_rel/add_rel_info/add_join_rel in relnode.c > right before build_join_rel. This makes diff clearer/smaller and > visual diffs nicer. Hm, it might improve readability of the diff, but this API is exactly where I still need feedback from Tom. I'm not eager to optimize the diff as long as there's a risk that these functions will be removed or renamed. > 2) v14-0002-Introduce-make_join_rel_common-function.patch > --------------------------------------------------------- > > I see no point in keeping this change in a separate patch, as it prety > much just renames make_join_rel to make_join_rel_common and then adds > > make_join_rel(PlannerInfo *root, RelOptInfo *rel1, RelOptInfo *rel2) > { > return make_join_rel_common(root, rel1, rel2); > } > > which seems rather trivial and useless on it's own. I'd just merge it > into 0003 where we use the _common function more extensively. ok. I thought that this improves readability of the diffs, but it doesn't look that bad if this is included in 0003. > > 3) v14-0003-Aggregate-push-down-basic-functionality.patch > --------------------------------------------------------- > > I haven't done much review on this yet, but I've done some testing and > benchmarking so let me share at least that. The queries I used were of > the type I mentioned earlier in this thread - essentially a star schema, > i.e. fact table referencing dimensions, with aggregation per columns in > the dimension. So something like > > SELECT d.c, sum(f) FROM fact JOIN dimension d ON (d.id = f.d_id) > GROUP BY d.c; > > where "fact" table is much much larger than the dimension. > > Attached is a script I used for testing with a bunch of queries and a > number of parameters (parallelism, size of dimension, size of fact, ...) > and a spreadsheed summarizing the results. > > Overall, the results seem pretty good - in many cases the queries get > about 2x faster and I haven't seen any regressions. That's pretty nice. Thanks for such an elaborate testing! The ultimate goal of this patch is to improve sharding (using postgres_fdw), but it's nice to see significant improvement even for local queries. > One annoying thing is that this only works for non-parallel queries. > That is, I saw no improvements with parallel queries. It was still > faster than the non-parallel query with aggregate pushdown, but the > parallel query did not improve. > An example of timing looks like this: > > non-parallel (pushdown = off): 918 ms > non-parallel (pushdown = on): 561 ms > parallel (pushdown = off): 313 ms > parallel (pushdown = on): 313 ms > > The non-parallel query gets faster because after enabling push-down the > plan changes (and we get partial aggregate below the join). With > parallel query that does not happen, the plans stay the same. > > I'm not claiming this is a bug - we end up picking the fastest execution > plan (i.e. we don't make a mistake by e.g. switching to the non-parallel > one with pushdown). > > My understanding is that the aggregate pushdown can't (currently?) be > used with queries that use partial aggregate because of parallelism. > That is we can either end up with a plan like this: > > Finalize Aggregate > -> Join > -> Partial Aggregate > -> ... > > or a parallel plan like this: > > Finalize Aggregate > -> Gather > -> Partial Aggregate > -> Join > -> ... > -> ... > > but we currently don't support a combination of both > > Finalize Aggregate > -> Gather > -> Join > -> Partial Aggregate > -> ... > > I wonder if that's actually even possible and what would it take to make > it work? I had a prototype of the feature that does affect parallel queries (as well as partitioned tables and postgres_fdw), but didn't include these parts in the recent patch versions. The point is that the API to store RelOptInfos can still change and in such a case I'd have to rebase too much code. v15 is attached. Actually there are no significant changes relative to v14, but v14 does not apply to the current master: error: patch failed: src/test/regress/serial_schedule:140 error: src/test/regress/serial_schedule: patch does not apply This is interesting because no problem is currently reported at http://commitfest.cputube.org/ [1] https://www.postgresql.org/message-id/9726.1542577439@sss.pgh.pa.us -- Antonin Houska Web: https://www.cybertec-postgresql.com
Attachment
Hello, Thank you for this great feature ! I hope this will be reviewed/validated soon ;o) Just a comment: set enable_agg_pushdown to true; isn't displayed in EXPLAIN (SETTINGS) syntax. The following modification seems to fix that: src/backend/utils/misc/guc.c {"enable_agg_pushdown", PGC_USERSET, QUERY_TUNING_METHOD, gettext_noop("Enables aggregation push-down."), NULL, GUC_EXPLAIN <<<<--- line Added --->>>> }, &enable_agg_pushdown, false, NULL, NULL, NULL }, then postgres=# set enable_agg_pushdown = true; SET postgres=# explain (settings) select 1; QUERY PLAN ------------------------------------------ Result (cost=0.00..0.01 rows=1 width=4) Settings: enable_agg_pushdown = 'on' (2 rows) Regards PAscal -- Sent from: https://www.postgresql-archive.org/PostgreSQL-hackers-f1928748.html
legrand legrand <legrand_legrand@hotmail.com> wrote: > set enable_agg_pushdown to true; > isn't displayed in EXPLAIN (SETTINGS) syntax. > > > The following modification seems to fix that: > > src/backend/utils/misc/guc.c > > {"enable_agg_pushdown", PGC_USERSET, QUERY_TUNING_METHOD, > gettext_noop("Enables aggregation push-down."), > NULL, > GUC_EXPLAIN <<<<--- line Added --->>>> > }, > &enable_agg_pushdown, > false, > NULL, NULL, NULL > }, Thanks. I'll include this change in the next version. -- Antonin Houska Web: https://www.cybertec-postgresql.com
Hi,
I've been looking at the 'make_join_rel()' part of the patch, and I'm
aware that, if we are joining A and B, a 'grouped join rel (AB)' would
be created besides the 'plain join rel (AB)', and paths are added by 1)
applying partial aggregate to the paths of the 'plain join rel (AB)', or
2) joining grouped A to plain B or joining plain A to grouped B.
aware that, if we are joining A and B, a 'grouped join rel (AB)' would
be created besides the 'plain join rel (AB)', and paths are added by 1)
applying partial aggregate to the paths of the 'plain join rel (AB)', or
2) joining grouped A to plain B or joining plain A to grouped B.
This is a smart idea. One issue I can see is that some logic would have
to be repeated several times. For example, the validity check for the
same proposed join would be performed at most three times by
join_is_legal().
to be repeated several times. For example, the validity check for the
same proposed join would be performed at most three times by
join_is_legal().
I'm thinking of another idea that, instead of using a separate
RelOptInfo for the grouped rel, we add in RelOptInfo a
'grouped_pathlist' for the grouped paths, just like how we implement
parallel query via adding 'partial_pathlist'.
RelOptInfo for the grouped rel, we add in RelOptInfo a
'grouped_pathlist' for the grouped paths, just like how we implement
parallel query via adding 'partial_pathlist'.
For base rel, if we decide it can produce grouped paths, we create the
grouped paths by applying partial aggregation to the paths in 'pathlist'
and add them into 'grouped_pathlist'.
grouped paths by applying partial aggregation to the paths in 'pathlist'
and add them into 'grouped_pathlist'.
For join rel (AB), we can create the grouped paths for it by:
1) joining a grouped path from 'A->grouped_pathlist' to a plain path
from 'B->pathlist', or vice versa;
2) applying partial aggregation to the paths in '(AB)->pathlist'.
1) joining a grouped path from 'A->grouped_pathlist' to a plain path
from 'B->pathlist', or vice versa;
2) applying partial aggregation to the paths in '(AB)->pathlist'.
This is basically the same idea, just moves the grouped paths from the
grouped rel to a 'grouped_pathlist'. With it we should not need to make
any code changes to 'make_join_rel()'. Most code changes would happen in
'add_paths_to_joinrel()'.
grouped rel to a 'grouped_pathlist'. With it we should not need to make
any code changes to 'make_join_rel()'. Most code changes would happen in
'add_paths_to_joinrel()'.
Will this idea work? Is it better or worse?
Thanks
Richard
Richard Guo <guofenglinux@gmail.com> wrote: > Hi, > > I've been looking at the 'make_join_rel()' part of the patch, and I'm > aware that, if we are joining A and B, a 'grouped join rel (AB)' would > be created besides the 'plain join rel (AB)', and paths are added by 1) > applying partial aggregate to the paths of the 'plain join rel (AB)', or > 2) joining grouped A to plain B or joining plain A to grouped B. > > This is a smart idea. One issue I can see is that some logic would have > to be repeated several times. For example, the validity check for the > same proposed join would be performed at most three times by > join_is_legal(). > > I'm thinking of another idea that, instead of using a separate > RelOptInfo for the grouped rel, we add in RelOptInfo a > 'grouped_pathlist' for the grouped paths, just like how we implement > parallel query via adding 'partial_pathlist'. > > For base rel, if we decide it can produce grouped paths, we create the > grouped paths by applying partial aggregation to the paths in 'pathlist' > and add them into 'grouped_pathlist'. > > For join rel (AB), we can create the grouped paths for it by: > 1) joining a grouped path from 'A->grouped_pathlist' to a plain path > from 'B->pathlist', or vice versa; > 2) applying partial aggregation to the paths in '(AB)->pathlist'. > > This is basically the same idea, just moves the grouped paths from the > grouped rel to a 'grouped_pathlist'. With it we should not need to make > any code changes to 'make_join_rel()'. Most code changes would happen in > 'add_paths_to_joinrel()'. > > Will this idea work? Is it better or worse? I tried such approach in an earlier version of the patch [1], and I think the reason also was to avoid repeated calls of functions like join_is_legal(). However there were objections against such approach, e.g. [2], and I admit that it was more invasive than what the current patch version does. Perhaps we can cache the result of join_is_legal() that we get for the plain relation, and use it for the group relation. I'll consider that. Thanks. [1] https://www.postgresql.org/message-id/18007.1513957437%40localhost [2] https://www.postgresql.org/message-id/CA%2BTgmob8og%2B9HzMg1vM%2B3LwDm2f_bHUi9%2Bg1bqLDTjqpw5s%2BnQ%40mail.gmail.com -- Antonin Houska Web: https://www.cybertec-postgresql.com
Antonin Houska-2 wrote > Alvaro Herrera < > alvherre@ > > wrote: > >> This stuff seems very useful. How come it sits unreviewed for so long? > > I think the review is hard for people who are not interested in the > planner > very much. And as for further development, there are a few design > decisions > that can hardly be resolved without Tom Lane's comments. Right now I > recall > two problems: 1) is the way I currently store RelOptInfo for the grouped > relations correct?, 2) how should we handle types for which logical > equality > does not imply physical (byte-wise) equality? > > Fortunately it seems now that I'm not the only one who cares about 2), so > this > problem might be resolved soon: > > https://www.postgresql.org/message-id/CAH2-Wzn3Ee49Gmxb7V1VJ3-AC8fWn-Fr8pfWQebHe8rYRxt5OQ%40mail.gmail.com > > But 1) still remains. > > -- > Antonin Houska > Web: https://www.cybertec-postgresql.com Hello would "pgsql: Add equalimage B-Tree support functions." https://www.postgresql.org/message-id/E1j72NY-0002gi-2B@gemulon.postgresql.org help for 2) ? Regards PAscal -- Sent from: https://www.postgresql-archive.org/PostgreSQL-hackers-f1928748.html
legrand legrand <legrand_legrand@hotmail.com> wrote: > Antonin Houska-2 wrote > > Right now I recall two problems: 1) is the way I currently store > > RelOptInfo for the grouped relations correct?, 2) how should we handle > > types for which logical equality does not imply physical (byte-wise) > > equality? > > > > Fortunately it seems now that I'm not the only one who cares about 2), so > > this > > problem might be resolved soon: > > > > https://www.postgresql.org/message-id/CAH2-Wzn3Ee49Gmxb7V1VJ3-AC8fWn-Fr8pfWQebHe8rYRxt5OQ%40mail.gmail.com > > > > But 1) still remains. > > > > Hello > would "pgsql: Add equalimage B-Tree support functions." > https://www.postgresql.org/message-id/E1j72NY-0002gi-2B@gemulon.postgresql.org Yes, it seems so. I'll adapt the patch soon, hopefully next week. Thanks for reminder. -- Antonin Houska Web: https://www.cybertec-postgresql.com
On Thu, Feb 27, 2020 at 4:50 PM Antonin Houska <ah@cybertec.at> wrote:
legrand legrand <legrand_legrand@hotmail.com> wrote:
> Antonin Houska-2 wrote
> > Right now I recall two problems: 1) is the way I currently store
> > RelOptInfo for the grouped relations correct?, 2) how should we handle
> > types for which logical equality does not imply physical (byte-wise)
> > equality?
> >
> > Fortunately it seems now that I'm not the only one who cares about 2), so
> > this
> > problem might be resolved soon:
> >
> > https://www.postgresql.org/message-id/CAH2-Wzn3Ee49Gmxb7V1VJ3-AC8fWn-Fr8pfWQebHe8rYRxt5OQ%40mail.gmail.com
> >
> > But 1) still remains.
> >
>
> Hello
> would "pgsql: Add equalimage B-Tree support functions."
> https://www.postgresql.org/message-id/E1j72NY-0002gi-2B@gemulon.postgresql.org
Yes, it seems so. I'll adapt the patch soon, hopefully next week. Thanks for
reminder.
--
Antonin Houska
Web: https://www.cybertec-postgresql.com
Hi Antonin:
The more tests on your patch, the more powerful I feel it is! At the same time,
I think the most difficult part to understand your design is you can accept
any number of generic join clauses, so I guess more explanation on this part
may be helpful.
At the code level, I did some slight changes on init_grouping_targets which may
make the code easier to read. You are free to to use/not use it.
Hope your patch get more attention soon!
Best Regards
Andy Fan
Attachment
> 1) v14-0001-Introduce-RelInfoList-structure.patch
> -------------------------------------------------
>
> - I'm not entirely sure why we need this change. We had the list+hash
> before, so I assume we do this because we need the output functions?
I believe that this is what Tom proposed in [1], see "Maybe an appropriate
preliminary patch is to refactor the support code ..." near the end of that
message. The point is that now we need two lists: one for "plain" relations
and one for grouped ones.
I guess what Toms suggested[1] is to store the the grouped ones into
root->upper_rels rather than a separated member, see fetch_upper_rel
or UpperRelationKind. If we do need the list+hash method for long list lookup,
we can merge it into upper_rels. If we want this benefits at other place rather
than root->upper_rels, we can store a generic type in RelInfoList, looks currently
it is bounded to RelAggInfo besides RelOptInfo. But overall, personally I think we can
ignore such optimization at the first stage to save the attention of the core reviewers
since they are too precious :) Just FYI
Best Regards
Andy Fan
Andy Fan <zhihui.fan1213@gmail.com> wrote: > The more tests on your patch, the more powerful I feel it is! Thanks for the appreciation. Given the poor progress it's increasingly hard for me to find motivation to work on it. I'll try to be more professional :-) > At the same time, I think the most difficult part to understand your design > is you can accept any number of generic join clauses, so I guess more > explanation on this part may be helpful. ok, I'll consider adding some comments, although the concept is mentioned in optimizer/README +Furthermore, extra grouping columns can be added to the partial Agg node if a +join clause above that node references a column which is not in the query +GROUP BY clause and which could not be derived using equivalence class. + ... > At the code level, I did some slight changes on init_grouping_targets which may > make the code easier to read. You are free to to use/not use it. I'm going to accept your change of create_rel_agg_info(), but I hesitate about the changes to init_grouping_targets(). First, is it worth to spend CPU cycles on construction of an extra list grouping_columns? Is there a corner case in which we cannot simply pass grouping_columns=target->exprs to check_functional_grouping()? Second, it's obvious that you prefer the style foreach () { if () ... else if () ... else ... } over this foreach () { if () { ... continue; } if () { ... continue; } ... } I often prefer the latter and I see that the existing planner code uses this style quite often too. I think the reason is that it allows for more complex tests, while the "else-if style" requires all tests to take place inside the "if ()" expression. However, if several (not necessarily tightly related) tests become "compressed" this way, it's less obvious how where to put comments. Indeed it seems that some comments got lost due to your changes. -- Antonin Houska Web: https://www.cybertec-postgresql.com
Andy Fan <zhihui.fan1213@gmail.com> wrote: > > 1) v14-0001-Introduce-RelInfoList-structure.patch > > ------------------------------------------------- > > > > - I'm not entirely sure why we need this change. We had the list+hash > > before, so I assume we do this because we need the output functions? > > I believe that this is what Tom proposed in [1], see "Maybe an appropriate > preliminary patch is to refactor the support code ..." near the end of that > message. The point is that now we need two lists: one for "plain" relations > and one for grouped ones. > > > I guess what Toms suggested[1] is to store the the grouped ones into > root->upper_rels rather than a separated member, see fetch_upper_rel > or UpperRelationKind. If we do need the list+hash method for long list lookup, > we can merge it into upper_rels. If we want this benefits at other place rather > than root->upper_rels, we can store a generic type in RelInfoList, looks currently > it is bounded to RelAggInfo besides RelOptInfo. But overall, personally I think we can > ignore such optimization at the first stage to save the attention of the core reviewers > since they are too precious :) Just FYI > > [1] https://www.postgresql.org/message-id/9726.1542577439@sss.pgh.pa.us Hm, you seem to be right, not sure why I missed the point. I thought that the reason Tom doesn't like storing the grouped relations in simple_rel_array is that we only need the grouped base relations inside query_planner(), but simple_rel_array is used higher in the stack. So I introduced a new field and used it only in query_planner() and subroutines. Yes, it's better to use root->upper_rels than to introduce the new field. I'll adjust the patch. Thanks. -- Antonin Houska Web: https://www.cybertec-postgresql.com
On Fri, Apr 24, 2020 at 8:10 PM Antonin Houska <ah@cybertec.at> wrote:
Andy Fan <zhihui.fan1213@gmail.com> wrote:
> The more tests on your patch, the more powerful I feel it is!
Thanks for the appreciation. Given the poor progress it's increasingly hard
for me to find motivation to work on it. I'll try to be more professional :-)
Let's not give up:) I see your patch can push down the aggregation in 3 cases
at least:
1). The group by clause exists in the join eq clause.
2). The group by clause doesn't exist in join eq clause, but we have
pk on on side of join eq clause.
3). The group by clause doesn't exist in join eq clause, and we don't have
the pk as well.
Tom well explained the correctness of the first 2 cases [1] and probably the case
3) is correct as well, but it is a bit of hard to understand. If this is a case for others
as well, that probably make the review harder.
So my little suggestion is can we split the patch into some smaller commit
to handle each case? like: commit 1 & 2 handles case 1 & 2,commit 3 handles
join relation as well. commit 4 handles the case 3. Commit 5 can avoid the
two-phase aggregation for case 2. Commit 6 introduces the aggmultifn. and
commit 7 to handle some outer join case Ashutosh raised at [2]. However not
all the cases need to be handled at one time.
Actually when I read the code, I want such separation by my own to verify my
understanding is correct, but I don't have such time at least now. It maybe much
easier for you since you are the author. I will be pretty pleasure to help in any case
after I close my current working item (Hopefully in 2 months).
> At the code level, I did some slight changes on init_grouping_targets which may
> make the code easier to read. You are free to to use/not use it.
I'm going to accept your change of create_rel_agg_info(), but I hesitate about
the changes to init_grouping_targets().
First, is it worth to spend CPU cycles on construction of an extra list
grouping_columns? Is there a corner case in which we cannot simply pass
grouping_columns=target->exprs to check_functional_grouping()?
Second, it's obvious that you prefer the style
foreach ()
{
if ()
...
else if ()
...
else
...
}
over this
foreach ()
{
if ()
{
...
continue;
}
if ()
{
...
continue;
}
...
}
I often prefer the latter and I see that the existing planner code uses this
style quite often too. I think the reason is that it allows for more complex
tests, while the "else-if style" requires all tests to take place inside the
"if ()" expression. However, if several (not necessarily tightly related)
tests become "compressed" this way, it's less obvious how where to put
comments. Indeed it seems that some comments got lost due to your changes.
Your explanation looks reasonable, thanks for that. the changes also used
some builtin function to avoid the one more level for-loop. like tlist_member.
As for the high level design, based on my current knowledge, probably no need
to change.
Andy Fan <zhihui.fan1213@gmail.com> wrote: > On Fri, Apr 24, 2020 at 8:10 PM Antonin Houska <ah@cybertec.at> wrote: > > Andy Fan <zhihui.fan1213@gmail.com> wrote: > > Let's not give up:) I see your patch can push down the aggregation in 3 cases > at least: > > 1). The group by clause exists in the join eq clause. > 2). The group by clause doesn't exist in join eq clause, but we have > pk on on side of join eq clause. > 3). The group by clause doesn't exist in join eq clause, and we don't have > the pk as well. > > Tom well explained the correctness of the first 2 cases [1] I admit that I don't quite understand his statement We can't simply aggregate during the scan of "a", because some values of "x" might not appear at all in the input of the naively-computed aggregate (if their rows have no join partner in "b"), or might appear multiple times (if their rows have multiple join partners). Consider the following tables ======= ====== A B ======= ====== x | y z ------- ------ 3 | 4 1 1 | 4 2 4 | 0 3 1 | 0 4 and this query SELECT sum(a.x), a.y FROM a, b WHERE a.y > b.z GROUP BY a.y; Without aggregate push-down, the aggregation input is a.x | a.y | b.z ---------------- 3 | 4 | 1 3 | 4 | 2 3 | 4 | 3 1 | 4 | 1 1 | 4 | 2 1 | 4 | 3 and the query result is sum(a.x) | a.y ----------------- 12 | 4 With the push-down, "A" aggregated looks like sum(a.x) | a.y ----------------- 4 | 4 5 | 0 and the join with "B" produces this: sum(a.x) | a.y | b.z --------------------- 4 | 4 | 1 4 | 4 | 2 4 | 4 | 3 After that, the final aggregation yields the same result that we got w/o the aggregate push-down: sum(a.x) | a.y ----------------- 12 | 4 So if a row of "A" has no / multiple join partners in "B", the same is true for the group produced by the pushed-down aggregation. Of course it's assumed that the input rows belonging to particular group can be distributed among the PartialAgg nodes in arbitrary way, for example: Agg <value 1> <value 1> <value 2> <value 2> needs to be equivalent to FinalAgg PartialAgg <value 1> <value 2> PartialAgg <value 1> <value 2> This should be o.k. because the partial aggregation performed currently by parallel workers relies o the same "distributive property" of aggregation. > So my little suggestion is can we split the patch into some smaller commit > to handle each case? like: commit 1 & 2 handles case 1 & 2,commit 3 handles > join relation as well. commit 4 handles the case 3. To separate the patch this way effectively means to rework the feature almost from scratch. I'm still not convinced that such a rework is necesary. > Commit 5 can avoid the two-phase aggregation for case 2. The two-phase aggregation can be avoided in more generic way as soon as the "unique key" feature (I think you do participated in the related discussion) is committed. > Commit 6 introduces the aggmultifnf. I used this function in the initial version of the patch [3], to join two grouped relations. The current version does not try to join two grouped relations and it's hard to imagine how to determine the "multiplication factor". I suppose this is not really trivial. Since the current version of the feature hasn't find it's way into PG core for several years, I'm not willing to make it even trickier soon, if at all. > and commit 7 to handle some outer join case Ashutosh raised at [2]. That was too high-level design, as he admits in [4]. > However not all the cases need to be handled at one time. Sure. > > > At the code level, I did some slight changes on init_grouping_targets which may > > make the code easier to read. You are free to to use/not use it. > > I'm going to accept your change of create_rel_agg_info(), but I hesitate about > the changes to init_grouping_targets(). > > First, is it worth to spend CPU cycles on construction of an extra list > grouping_columns? Is there a corner case in which we cannot simply pass > grouping_columns=target->exprs to check_functional_grouping()? > > Second, it's obvious that you prefer the style > > foreach () > { > if () > ... > else if () > ... > else > ... > } > > over this > > foreach () > { > if () > { > ... > continue; > } > > if () > { > ... > continue; > } > > ... > } > > I often prefer the latter and I see that the existing planner code uses this > style quite often too. I think the reason is that it allows for more complex > tests, while the "else-if style" requires all tests to take place inside the > "if ()" expression. However, if several (not necessarily tightly related) > tests become "compressed" this way, it's less obvious how where to put > comments. Indeed it seems that some comments got lost due to your changes. > > Your explanation looks reasonable, thanks for that. the changes also used > some builtin function to avoid the one more level for-loop. like tlist_member. Eventually I've reworked init_grouping_targets(), although not exactly as you proposed. I hope it's easier to understand now. > [1] https://www.postgresql.org/message-id/9726.1542577439%40sss.pgh.pa.us > [2] https://www.postgresql.org/message-id/flat/CAFjFpRdpeMTd8kYbM_x0769V-aEKst5Nkg3%2BcoG%3D8ki7s8Zqjw%40mail.gmail.com [3] https://www.postgresql.org/message-id/29111.1483984605@localhost [4] https://www.postgresql.org/message-id/CAFjFpRfUKq3Asgtki1XctPkCN6YC4oA2vNWh66OgBo-H26ePWA%40mail.gmail.com The next version is attached. -- Antonin Houska Web: https://www.cybertec-postgresql.com
Attachment
> On 19 May 2020, at 10:17, Antonin Houska <ah@cybertec.at> wrote: > The next version is attached. This version now fails to apply to HEAD, with what looks like like a trivial error in the expected test output. Can you please submit a rebased version so we can see it run in the patch tester CI? I'm marking the entry as Waiting on Author in the meantime. cheers ./daniel
On Thu, 2020-07-02 at 14:39 +0200, Daniel Gustafsson wrote: > This version now fails to apply to HEAD, with what looks like like a trivial > error in the expected test output. Can you please submit a rebased version so > we can see it run in the patch tester CI? I'm marking the entry as Waiting on > Author in the meantime. I have rebased the patch against current HEAD, it passes "make installcheck". Yours, Laurenz Albe
Attachment
Hi everyone. I develop postgresql's extension such as fdw in my work. I'm interested in using postgresql for OLAP. I think that this patch is realy useful when using OLAP queries. Furthermore, I think it would be more useful if this patch works on a foreign table. Actually, I changed this patch a little and confirmed that my idea is true. The followings are things I found and differences of between my prototype and this patch. 1. Things I found I execute a query which contain join of postgres_fdw's foreign table and a table and aggregation of the josin result. In my setting, my prototype reduce this query's response by 93%. 2. Differences between my prototype and this patch (1) Pushdown aggregation of foeign table if FDW pushdown partioal aggregation (2) postgres_fdw pushdowns some partial aggregations I attached my prototype source code and content of my experiment. I want to resume development of this patch if there is some possibility of accept of this patch's function. . I took a contact to Mr.Houska on resuming development of this patch. As a result, Mr.Houska advised for me that I ask in pgsql-hackers whether any reviewers / committers are interested to work on the patch. Is anyone interested in my work? Sincerely yours. Yuuki Fujii -- Yuuki Fujii Information Technology R&D Center Mitsubishi Electric Corporation