Partial Aggregation / GROUP BY before JOIN - Mailing list pgsql-hackers

From David Rowley
Subject Partial Aggregation / GROUP BY before JOIN
Date
Msg-id CAKJS1f9kw95K2pnCKAoPmNw==7fgjSjC-82cy1RB+-x-Jz0QHA@mail.gmail.com
Whole thread Raw
Responses Re: Partial Aggregation / GROUP BY before JOIN
List pgsql-hackers
I've been spending time working on allowing the planner to perform aggregation before the final join relation is created.

Now, I should warn that at the moment the patch's status is WIP. There's still many things to work out, although there's many parts of it that I think are a bit closer to being final.

The main purpose of this email is to get community approval on my proposed implementation. I'd also like to gather up any ideas that people may have about better ways to do certain things.

Let me start by just giving a very quick example and overview of the problem that this is trying so solve: (please skip to the proposed implementation section, if you already know)

This is likely best done by example, so let say we have a "sale" table which contains millions of records of sold products. We also have a "product" table, which contains a small number of products that are sold.

A query such as:

SELECT p.product_id,p.description,sum(s.quantity)
FROM sale s
INNER JOIN product p ON p.product_id = s.product_id
GROUP BY p.product_id;

For now, the planner will first join sale and product, and perform the grouping on the final joined relation. Now, if we assume that we have a small number of products, and the sale table stores multiple records for each product, then this query will be better executed as if it had been written as:

SELECT p.product_id,p.description,s.quantity
FROM (SELECT product_id,SUM(quantity) AS quantity FROM sale GROUP BY product_id) s
INNER JOIN product p ON p.product_id = s.product_id;

Although, what I want to demonstrate is that this is not always the case. Imagine:

SELECT p.product_id,p.description,sum(s.quantity)
FROM sale s
INNER JOIN product p ON p.product_id = s.product_id
WHERE p.description = 'something highly selective'
GROUP BY s.product_id;

In this case, since the qual on p.description could not be pushed down into the subquery, it is more likely that the query is better executed by performing the join and then performing the grouping.

For this reason the planner must take into account the cost of both methods before it decided which it should use.

Proposed Implementation
----------------------------------

Due to what I explained above about costing. I'd like to propose that this patch introduces a new RelOptKind named RELOPT_ALTREL, which will allow the rel to be stored in PlannerInfo's simple_rel_array, but not have the relation considered in make_one_rel(). Instead, I propose that this "Alternative Relation" be tagged onto the RelOptInfo that it is an alternative of in a new List field, and when we perform the standard join search, whenever we consider a relation which has alternatives listed, we also consider each alternative relation. Having this as RELOPT_ALTREL instead of RELOPT_DEADREL will allow relation sizes to be gathered for these relations (e.g. in set_base_rel_sizes()). I believe this method may also help in the future for allowing materialized views to be used instead of base tables (for when we have auto-update MVs).

The idea here is that during planning, we will analyze the query, and check if all aggregate function parameters belong to just 1 relation, and we'll also check that all those aggregate functions have a combine function set (which is a concept that this patch introduces). We'll also check to ensure that the aggregates don't have an ORDER BY or DISTINCT. If the aggregates are found to be suitable, then we'll construct a subquery which performs the required grouping on the single relation.

Now, there's also more complex cases where the GROUP BY clause contains Vars from a relation that's not the same as the relation seen in the aggregate function's parameters. I plan to handle this too, but I don't want to bloat this email with the complexities of that, but I do need to mention the fact that the subquery may only be able to partially group the tuples, and that a final aggregate stage may need to happen too. For example:

SELECT p.description,sum(s.quantity)
FROM sale s
INNER JOIN product p ON p.product_id = s.product_id
GROUP BY p.description;

Here we might want to group sales on s.product_id, pass those partially grouped results down to be joined to "product" and then perform the final grouping on p.description. This may or may not further reduce the number of groups. For this reason we need a partial grouping stage and also a final grouping stage, and to allow this we need to be able to instruct the subquery to only perform partial grouping (i.e. send us the non-finalised aggregate states without the aggregate's final function being called on them).  To implement this I've made various changes to nodeAgg.c to allow 4 possible modes of aggregation, although only 3 of these are currently used.

These modes are controlled by 2 new boolean parameters:

bool combineStates; /* input tuples contain transition states */
bool finalizeAggs; /* should we call the finalfn on agg states? */

So, when we generate this subquery to perform the partial aggregation for us, we need to tell it not to perform the finalise stage. In the attached patch this is done by adding a new parameter to subquery_planner()

Plan *
subquery_planner(PlannerGlobal *glob, Query *parse,
PlannerInfo *parent_root,
bool hasRecursion, bool finalize_aggregates,
double tuple_fraction, PlannerInfo **subroot)

The main query must set combineStates to true if a subquery RTE exists which has been marked as finalize_aggregates = false. This will instruct the top level nodeAgg to call the combine function rather than transition function.

Status of current patch
------------------------------

1. I believe that I have made all the changes required to nodeAgg.c which supports this partial and final aggregation concept. (Likely this needs more testing, but there's other priorities)

2. The EXPLAIN changes are complete. When combineStates is true we see "Finalize Aggregate", and when finalizeAggs is false we see "Partial Aggregate". If finalizeAggs is true and combineStates is false, then we don't see any changes to the EXPLAIN output for Aggregate nodes.

3. The Alternative relation stuff and the changes to the standard join search are not done yet. I want to get 4 finished before I start on this, and also want some feedback about the idea.

4. Currenrly the code which constructs the subquery which is to perform the aggregation is very much a dumping ground of unfinished and quite broken code, that only works for a handful of cases so far. I keep hacking away at this trying to find a nice neat way to do this, but so far I've only managed to get what you see in joinpath.c (which of course is almost certainly not the final resting place for this code). At the moment I'm just adding the new subquery rel and removing the original version from the joinlist. The final version won't do that part, but doing it this way is good for testing as it'll always perform Group before join when it's possible.

The patch is however so far capable of giving us extremely nice performance improvements for some (likely artificial) queries.

Let's look at a quick example:

CREATE TABLE product (product_id INT NOT NULL,product_code VARCHAR(64) NOT NULL, PRIMARY KEY(product_id));
CREATE UNIQUE INDEX product_product_code_uidx ON product (product_code);
-- create small list of products
INSERT INTO product SELECT g.id,'ABC' || CAST(g.id AS TEXT) FROM generate_series(1,100) g(id);

CREATE TABLE sale (sale_id INT NOT NULL, product_id INT NOT NULL, quantity INT NOT NULL);

INSERT INTO sale (sale_id, product_id,quantity) SELECT x.x,x.x%100+1,CAST(random() * 1000 AS INT) FROM generate_series(1,100000000) x(x);

ALTER TABLE sale ADD CONSTRAINT sale_pkey PRIMARY KEY(sale_id);
 
test=# SELECT count(sale.sale_id) FROM sale, product;
    count
-------------
 10000000000
(1 row)
Time: 10323.053 ms


And if I disable the optimisation:

test=# set enable_earlygrouping = off;
SET
Time: 0.302 ms
test=# SELECT count(sale.sale_id) FROM sale, product;
    count
-------------
 10000000000
(1 row)
Time: 775790.764 ms

So, in this probably rather unlikely query, we get something around a 7500% performance increase. Of course as the ratio of groups per underlying tuples increase, the performance increase will tail off.

The explain output from the optimised version is as follows:

                                     QUERY PLAN
------------------------------------------------------------------------------------
 Finalize Aggregate  (cost=1790544.37..1790544.38 rows=1 width=4)
   ->  Nested Loop  (cost=1790541.10..1790544.12 rows=100 width=4)
         ->  Partial Aggregate  (cost=1790541.10..1790541.11 rows=1 width=4)
               ->  Seq Scan on sale  (cost=0.00..1540541.08 rows=100000008 width=4)
         ->  Seq Scan on product  (cost=0.00..2.00 rows=100 width=0)


I also know that Tom is making a seriously big change to allow subquery path-ification. I believe my changes in the areas he'll be touching are quite small, and I'm not predicting too big a conflict when his changes are pushed. I may of course be wrong about that.

Any constructive comments are welcome.

--
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services
Attachment

pgsql-hackers by date:

Previous
From: Amir Rohan
Date:
Subject: Patch: Revised documentation on base backups
Next
From: Tatsuo Ishii
Date:
Subject: Re: Doubt in pgbench TPS number