WIP: Aggregation push-down - Mailing list pgsql-hackers

From Antonin Houska
Subject WIP: Aggregation push-down
Date
Msg-id 9666.1491295317@localhost
Whole thread Raw
List pgsql-hackers
This is a new version of the patch I presented in [1]. A new thread seems
appropriate because the current version can aggregate both base relations and
joins, so the original subject would no longer match.

There's still work to do but I'd consider the patch complete in terms of
concept. A few things worth attention for those who want to look into the
code:

* I've abandoned the concept of aggmultifn proposed in [1], as it doesn't
  appear to be very useful. That implies that a "grouped join" can be formed
  in 2 ways: 1) join a grouped relation to a "plain" (i.e. non-grouped) one,
  2) join 2 plain relations and aggregate the result. However, w/o the
  aggmultifn we can't join 2 grouped relations.

* GroupedVar type is used to propagate the result of partial aggregation from
  to the top-level join. It's conceptually very similar to PlaceHolderVar.

* Although I intended to use the "unique join" feature [2], I postponed it so
  far. The point is that [2] does conflict with my patch and thus I'd have to
  rebase the patch more often. Anyway, the impact of [2] on aggregation
  finalization (i.e. possible avoidance of the "finalize aggregate node"
  setup) is not really specific to my patch.

* Scan of base relation or join result can be partially aggregated for 2
  reasons: 1) it makes the whole plan cheaper because the aggregation takes
  place on remote node and thus the amount of data to be transferred via
  network is significanlty reduced, 2) aggregate functions are rather
  expensive so it makes sense to evaluate them by multiple parallel workers.

  The patch contains both of these features as they are hard to
  separate from each other.

  While 1) needs additional work on postgres_fdw, scripts to simulate 2) are
  attached. Planner settings are such that cost of expression evaluation is
  significant, so that it's worth to engage multiple parallel workers.

  In my environment it yields the following output:

 Parallel Finalize HashAggregate
   Group Key: a.i
   ->  Gather Merge
         Workers Planned: 4
         ->  Merge Join
               Merge Cond: (b.parent = a.i)
               ->  Sort
                     Sort Key: b.parent
                     ->  Parallel Partial HashAggregate
                           Group Key: b.parent
                           ->  Hash Join
                                 Hash Cond: ((b.parent = c.parent) AND (b.j = c.k))
                                 ->  Parallel Seq Scan on b
                                 ->  Hash
                                       ->  Seq Scan on c
               ->  Sort
                     Sort Key: a.i
                     ->  Seq Scan on a

Feedback is appreciated.

[1] https://www.postgresql.org/message-id/29111.1483984605%40localhost

[2] https://commitfest.postgresql.org/13/859/

--
Antonin Houska
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de, http://www.cybertec.at


Attachment

pgsql-hackers by date:

Previous
From: Pavel Stehule
Date:
Subject: Re: Variable substitution in psql backtick expansion
Next
From: Amit Langote
Date:
Subject: Re: ANALYZE command progress checker