Re: Push down Aggregates below joins - Mailing list pgsql-hackers

From Antonin Houska
Subject Re: Push down Aggregates below joins
Date
Msg-id 18751.1533307811@localhost
Whole thread Raw
In response to Re: Push down Aggregates below joins  (Antonin Houska <ah@cybertec.at>)
Responses Re: Push down Aggregates below joins
List pgsql-hackers
Antonin Houska <ah@cybertec.at> wrote:

> I didn't have enough time to separate "your functionality" and can do it when
> I'm back from vacation.

So I've separated the code that does not use the 2-stage replication (and
therefore the feature is not involved in parallel queries).

Note on coding: so far most of the functions to which I added the "grouped"
argument can get the same information from rel->agg_info (it's set iff the
relation is grouped). So we can remove it for this "simple aggregation", but
the next (more generic) part of the patch will have to add some argument
anyway, to indicate whether AGGSPLIT_SIMPLE, AGGSPLIT_INITIAL_SERIAL (or no
aggregation) should be performed.

Besides splitting the code I worked on the determination whether join
duplicates grouping keys or not. Currently it still fails to combine
"uniquekeys" of joined relation in some (important) cases when the information
needed is actually available. I think ECs should be used here, like it is for
pathkeys.

So currently you should comment out this code

    if (!match_uniquekeys_to_group_pathkeys(root, result, target))
        *keys_ok = false;

in pathnode.c:make_uniquekeys_for_join() if you want the patch to at leat
produce interesting EXPLAIN output.

One example:

CREATE TABLE a(i int primary key);
CREATE TABLE b(j int primary key, k int);

SET enable_agg_pushdown TO true;

EXPLAIN
SELECT          j, sum(k)
FROM            a, b
WHERE           i = j
GROUP BY        j

                              QUERY PLAN
-----------------------------------------------------------------------
 Hash Join  (cost=94.75..162.43 rows=2260 width=12)
   Hash Cond: (a.i = b.j)
   ->  Seq Scan on a  (cost=0.00..35.50 rows=2550 width=4)
   ->  Hash  (cost=66.50..66.50 rows=2260 width=12)
         ->  HashAggregate  (cost=43.90..66.50 rows=2260 width=12)
               Group Key: b.j
               ->  Seq Scan on b  (cost=0.00..32.60 rows=2260 width=8)

However there are cases like this

EXPLAIN
SELECT          i, sum(k)
FROM            a, b
WHERE           i = j
GROUP BY        i

which currently does not work. The reason is that the column b.j which is not
in the GROUP BY clause needs to be in the grouped output of the "b" (grouped)
table output, otherwise the join condition cannot be evaluated.

While separating the code that only uses 1-stage aggregation I removed the
code that adds such extra grouping keys to per-relation AggPath because in
general this is only safe if the final aggregation is performed, and the final
aggregation uses no added columns. However I forgot that grouping keys can be
added in cases like shown above, i.e. the grouping expression b.j is derived
from GROUP BY using equivalence class.

I'll fix this (and various other problems) asap. I believe it's worth to at
least show the current code. I'm curious if it's something we can build on or
if another rework will be needed.

(I'll be off next week.)

--
Antonin Houska
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26, A-2700 Wiener Neustadt
Web: https://www.cybertec-postgresql.com


Attachment

pgsql-hackers by date:

Previous
From: Alvaro Herrera
Date:
Subject: Re: Explain buffers wrong counter with parallel plans
Next
From: Tom Lane
Date:
Subject: Re: Fallout from PQhost() semantics changes