Re: [HACKERS] PoC: Grouped base relation - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: [HACKERS] PoC: Grouped base relation
Date
Msg-id cc823e89-3fbc-f94e-b9d4-9c713b044b5d@2ndquadrant.com
Whole thread Raw
In response to Re: [HACKERS] PoC: Grouped base relation  (David Rowley <david.rowley@2ndquadrant.com>)
Responses Re: [HACKERS] PoC: Grouped base relation
Re: [HACKERS] PoC: Grouped base relation
List pgsql-hackers
On 01/17/2017 12:42 AM, David Rowley wrote:
> On 10 January 2017 at 06:56, Antonin Houska <ah@cybertec.at> wrote:
>> Before performing the final aggregation, we need to multiply sum(a.x) by
>> count(j.*) because w/o the aggregation at base relation level the input
>> of the query-level aggregation would look like
>>
>>  a.i | a.x | b.j
>> ----------------
>>  1   |   3 |  1
>>  1   |   4 |  1
>>  1   |   3 |  1
>>  1   |   4 |  1
>>
>> In other words, grouping of the base relation "b" below the join prevents the
>> join from bringing per-group input set to the aggregate input multiple
>> times. To compensate for this effect, I've added a new field "aggtransmultifn"
>> to pg_aggregate catalog. It "multiplies" the aggregate state w/o repeated
>> processing of the same input set many times. sum() is an example of an
>> aggregate that needs such processing, avg() is one that does not.
>
> First off, I'd like to say that making improvements in this area
> sounds like a great thing. I'm very keen to see progress here.
>

+1

Pushing down aggregates would be very useful for analytical queries.
>
> I've been thinking about this aggtransmultifn and I'm not sure if it's
> really needed. Adding a whole series of new transition functions is
> quite a pain. At least I think so, and I have a feeling Robert might
> agree with me.
>
> Let's imagine some worst case (and somewhat silly) aggregate query:
>
> SELECT count(*)
> FROM million_row_table
> CROSS JOIN another_million_row_table;
>
> Today that's going to cause 1 TRILLION transitions! Performance will
> be terrible.
>
> If we pushed the aggregate down into one of those tables and performed
> a Partial Aggregate on that, then a Finalize Aggregate on that single
> row result (after the join), then that's 1 million transfn calls, and
> 1 million combinefn calls, one for each row produced by the join.
>
> If we did it your way (providing I understand your proposal correctly)
> there's 1 million transfn calls on one relation, then 1 million on the
> other and then 1 multiplyfn call. which does 1000000 * 1000000
>
> What did we save vs. using the existing aggcombinefn infrastructure
> which went into 9.6? Using this actually costs us 1 extra function
> call, right? I'd imagine the size of the patch to use aggcombinefn
> instead would be a fraction of the size of the one which included all
> the new aggmultiplyfns and pg_aggregate.h changes.
>

I think the patch relies on the assumption that the grouping reduces 
cardinality, so a CROSS JOIN without a GROUP BY clause may not be the 
best counterexample.

Let's instead use an example similar to what Antonin mentioned in the 
initial post - two tables, with two columns each.
    CREATE TABLE t1 (a INT, b INT);    CREATE TABLE t2 (c INT, d INT);

And let's assume each table has 100.000 rows, but only 100 groups in the 
first column, with 1000 rows per group. Something like
    INSERT INTO t1    SELECT mod(i,100), i FROM generate_series(1, 1e5) s(i);
    INSERT INTO t2    SELECT mod(i,100), i FROM generate_series(1, 1e5) s(i);

And let's assume this query

SELECT t1.a, count(t2.d) FROM t1 JOIN t2 ON (t1.a = t2.c) GROUP BY t1.a;

On master, EXPLAIN (COSTS OFF, TIMING OFF, ANALYZE) looks like this:
                           QUERY PLAN
----------------------------------------------------------------- HashAggregate (actual rows=100 loops=1)   Group Key:
t1.a  ->  Hash Join (actual rows=100000000 loops=1)         Hash Cond: (t2.c = t1.a)         ->  Seq Scan on t2 (actual
rows=100000loops=1)         ->  Hash (actual rows=100000 loops=1)               Buckets: 131072  Batches: 2  Memory
Usage:2716kB               ->  Seq Scan on t1 (actual rows=100000 loops=1) Planning time: 0.167 ms Execution time:
17005.300ms
 
(10 rows)

while with the patch it looks like this
                             QUERY PLAN
--------------------------------------------------------------------- Finalize HashAggregate (actual rows=100 loops=1)
Group Key: t1.a   ->  Hash Join (actual rows=100 loops=1)         Hash Cond: (t1.a = t2.c)         ->  Partial
HashAggregate(actual rows=100 loops=1)               Group Key: t1.a               ->  Seq Scan on t1 (actual
rows=100000loops=1)         ->  Hash (actual rows=100 loops=1)               Buckets: 1024  Batches: 1  Memory Usage:
14kB              ->  Partial HashAggregate (actual rows=100 loops=1)                     Group Key: t2.c
     ->  Seq Scan on t2 (actual rows=100000 loops=1) Planning time: 0.105 ms Execution time: 31.609 ms
 
(14 rows)

This of course happens because with the patch we run the transition 
function 200k-times (on each side of the join) and aggtransmultifn on 
the 100 rows produced by the join, while on master the join produces 
10.000.000 rows (which already takes much more time), and then have to 
run the transition function on all those rows.

The performance difference is pretty obvious, I guess.
>
> There's already a lot of infrastructure in there to help you detect
> when this optimisation can be applied. For example,
> AggClauseCosts.hasNonPartial will be true if any aggregates don't have
> a combinefn, or if there's any DISTINCT or ORDER BY aggregates,
> which'll also mean you can't apply the optimisation.
>
> It sounds like a much more manageable project by using aggcombinefn
> instead. Then maybe one day when we can detect if a join did not cause
> any result duplication (i.e Unique Joins), we could finalise the
> aggregates on the first call, and completely skip the combine state
> altogether.
>

I don't quite see how the patch could use aggcombinefn without 
sacrificing a lot of the benefits. Sure, it's possible to run the 
aggcombinefn in a loop (with number of iterations determined by the 
group size on the other side of the join), but that sounds pretty 
expensive and eliminates the reduction of transition function calls. The 
join cardinality would still be reduced, though.

I do have other question about the patch, however. It seems to rely on 
the fact that the grouping and joins both reference the same columns. I 
wonder how uncommon such queries are.

To give a reasonable example, imagine the typical start schema, which is 
pretty standard for large analytical databases. A dimension table is 
"products" and the fact table is "sales", and the schema might look like 
this:

CREATE TABLE products (    id            SERIAL PRIMARY KEY,    name          TEXT,    category_id   INT,
producer_id  INT
 
);

CREATE TABLE sales (    product_id    REFERENCES products (id),    nitems        INT,    price         NUMERIC
);

A typical query then looks like this:
    SELECT category_id, SUM(nitems), SUM(price)    FROM products p JOIN sales s ON (p.id = s.product_id)    GROUP BY
p.category_id;

which obviously uses different columns for the grouping and join, and so 
the patch won't help with that. Of course, a query grouping by 
product_id would allow the patch to work
    SELECT category_id, SUM(nitems), SUM(price)    FROM products p JOIN sales s ON (p.id = s.product_id)    GROUP BY
p.product_id;

Another thing is that in my experience most queries do joins on foreign 
keys (so the PK side is unique by definition), so the benefit on 
practical examples is likely much smaller.

But I guess my main question is if there are actual examples of queries 
the patch is trying to improve, or whether the general benefit is 
allowing parallel plans for queries where it would not be possible 
otherwise.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



pgsql-hackers by date:

Previous
From: Jim Nasby
Date:
Subject: Re: [HACKERS] GSoC 2017
Next
From: Haribabu Kommi
Date:
Subject: Re: [HACKERS] [WIP]Vertical Clustered Index (columnar store extension)