Re: [HACKERS] PoC: Grouped base relation - Mailing list pgsql-hackers
From | Antonin Houska |
---|---|
Subject | Re: [HACKERS] PoC: Grouped base relation |
Date | |
Msg-id | 23667.1484824945@localhost Whole thread Raw |
In response to | Re: [HACKERS] PoC: Grouped base relation (Tomas Vondra <tomas.vondra@2ndquadrant.com>) |
Responses |
Re: [HACKERS] PoC: Grouped base relation
Re: [HACKERS] PoC: Grouped base relation |
List | pgsql-hackers |
Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: > On 01/17/2017 08:05 PM, Antonin Houska wrote: > > Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: > > > > > >> Another thing is that in my experience most queries do joins on foreign keys > >> (so the PK side is unique by definition), so the benefit on practical examples > >> is likely much smaller. > > > > ok. So in some cases the David's approach might be better. > > > In which cases would David's approach be more efficient? But even if there are > such cases, I assume we could generate both paths and decide based on cost, > just like with all other alternative paths. Sorry, it was my thinko - I somehow confused David's CROSS JOIN example with this one. If one side of the join clause is unique and the other becomes unique due to aggregation (and if parallel processing is not engaged) then neither combinefn nor multiplyfn should be necessary before the finalfn. > > However I think the ability to join 2 grouped (originally non-unique) > > relations is still important. Consider a query involving "sales" as well as > > another table which also has many-to-one relationship to "products". > > > > Well, can you give a practical example? What you describe seems like a > combination of two fact tables + a dimension, something like this: > > CREATE TABLE products ( > id SERIAL PRIMARY KEY, > name TEXT, > category_id INT, > producer_id INT > ); > > CREATE TABLE sales ( > product_id REFERENCES products (id), > nitems INT, > price NUMERIC > ); > > CREATE TABLE reviews ( > product_id REFERENCES products (id), > stars INT > ); > > But how exactly do you join that together? Because > > SELECT * FROM products p JOIN sales s ON (p.id = s.product_id) > JOIN reviews r ON (p.id = r.product_id) > > is clearly wrong - it's essentially M:N join between the two fact tables, > increasing the number of rows. Without elaborating details I imagined join condition between the 2 fact tables which references their non-unique columns, but that does not make sense here. Sorry. > It'd helpful to have an example of a practical query optimized by the > patch. I'm not claiming it does not exist, but I've been unable to come up > with something reasonable at the moment. As I mentioned elsewhere, remote aggregation is the primary use case. Besides that, I can imagine another one - join of partitioned tables (costs are not displayed just to make the plan easier to read). For this query SELECT p.id, sum(price) FROM products AS p JOIN sales AS s ON s.product_id = p.id GROUP BY p.id I get this plan at "normal circumstances" HashAggregate Group Key: p.id -> Hash Join Hash Cond: (s.product_id = p.id) -> Gather WorkersPlanned: 2 -> Append -> Parallel Seq Scan on sales s -> ParallelSeq Scan on sales_2015 s_1 -> Parallel Seq Scan on sales_2016 s_2 -> ParallelSeq Scan on sales_2017 s_3 -> Hash -> Gather Workers Planned: 2 -> Append -> Parallel Seq Scan on products p -> ParallelSeq Scan on products_01 p_1 -> Parallel Seq Scan on products_02 p_2 -> Parallel Seq Scan on products_03 p_3 -> Parallel Seq Scan on products_04 p_4 but if work_mem is sufficiently low for the hash join to be efficient, the aggregation can be moved to individual partitions. Gather Workers Planned: 1 Single Copy: true -> Finalize HashAggregate Group Key: p.id -> Hash Join Hash Cond: (p.id = s.product_id) -> Append -> Partial HashAggregate Group Key: p.id -> Seq Scan on products p -> Partial HashAggregate Group Key: p_1.id -> Seq Scan on products_01 p_1 -> Partial HashAggregate Group Key: p_2.id -> Seq Scan onproducts_02 p_2 -> Partial HashAggregate Group Key: p_3.id -> Seq Scan on products_03 p_3 -> Partial HashAggregate Group Key:p_4.id -> Seq Scan on products_04 p_4 -> Hash -> Append -> Partial HashAggregate Group Key: s.product_id -> Seq Scan on sales s -> Partial HashAggregate Group Key: s_1.product_id -> Seq Scan on sales_2015 s_1 -> PartialHashAggregate Group Key: s_2.product_id -> Seq Scanon sales_2016 s_2 -> Partial HashAggregate Group Key: s_3.product_id -> Seq Scan on sales_2017 s_3 Finally, the patch should help if the aggregation argument is rather expensive to evaluate. I refer to this part of the example plan that I showed in the initial message of this thread (in which case the argument was actually simple ...): -> Gather Workers Planned: 2 -> Partial HashAggregate Group Key: a.i -> Parallel Seq Scan on a The first 2 cases show where the base relation grouping can help alone, the last one combines the base relation grouping with parallel query processing. > >> But I guess my main question is if there are actual examples of queries the > >> patch is trying to improve, or whether the general benefit is allowing > >> parallel plans for queries where it would not be possible otherwise. > > > > In fact I did all this with postgres_fdw in mind. > > I assume there's not much difference between pushing down aggregates to local > workers and to remote nodes. There'll be costing differences, but are there > any other differences? It's easier for me to see what these 2 things have in common than to point out differences. One common thing is that aggregation takes place in multiple stages. They can also be similar in terms of implementation (e.g. by keeping paths in separate lists), although there are still some choices to be done. As for doing aggregation in parallel, I personally don't think that "pushing the aggregate down to worker" is the exact description. If the parallel worker aggregates the base relation and if the nearest Gather node is at the top of the plan, then the worker actually performs the whole plan, except for the Gather node. And yes, costing needs to be considered. Besides estimating the inputs (table width, but especially row count), some "policy decisions" might be necessary, similar to those that planner applies to parameterized paths - see the comments of add_path(). The grouped path should not be used where relatively smally error in estimates of the base relation aggregation can lead to significant error in the total plan costs. >>> CREATE TABLE products ( >>> id SERIAL PRIMARY KEY, >>> name TEXT, >>> category_id INT, >>> producer_id INT >>> ); >>> >>> CREATE TABLE sales ( >>> product_id REFERENCES products (id), >>> nitems INT, >>> price NUMERIC >>> ); >>> >>> A typical query then looks like this: >>> >>> SELECT category_id, SUM(nitems), SUM(price) >>> FROM products p JOIN sales s ON (p.id = s.product_id) >>> GROUP BY p.category_id; >>> >>> which obviously uses different columns for the grouping and join, and so the >>> patch won't help with that. Of course, a query grouping by product_id would >>> allow the patch to work >> >> Right, the current version does not handle this. Thanks for suggestion. >> > > So you're saying it's merely a limitation of the initial patch version and not > an inherent limitation? I just haven't thought about this so far, but now that I try, I seem to miss something. p.category_id as grouping column gives the patch no chance to do anything useful because only one table of the join has that column. "product_id" does help, but gives a different result ... -- Antonin Houska Cybertec Schönig & Schönig GmbH Gröhrmühlgasse 26 A-2700 Wiener Neustadt Web: http://www.postgresql-support.de, http://www.cybertec.at
pgsql-hackers by date: