Re: Eager aggregation, take 3 - Mailing list pgsql-hackers
From | jian he |
---|---|
Subject | Re: Eager aggregation, take 3 |
Date | |
Msg-id | CACJufxEwZqZcjay_2wEJfCy8fCxUJK_NOknxEpd4M+udkyp_0A@mail.gmail.com Whole thread Raw |
In response to | Re: Eager aggregation, take 3 (Richard Guo <guofenglinux@gmail.com>) |
List | pgsql-hackers |
On Thu, Aug 29, 2024 at 10:26 AM Richard Guo <guofenglinux@gmail.com> wrote: > > > > 2. I think there might be techniques we could use to limit planning > > effort at an earlier stage when the approach doesn't appear promising. > > For example, if the proposed grouping column is already unique, the > > exercise is pointless (I think). Ideally we'd like to detect that > > without even creating the grouped_rel. But the proposed grouping > > column might also be *mostly* unique. For example, consider a table > > with a million rows and a column 500,000 distinct values. I suspect it > > will be difficult for partial aggregation to work out to a win in a > > case like this, because I think that the cost of performing the > > partial aggregation will not reduce the cost either of the final > > aggregation or of the intervening join steps by enough to compensate. > > It would be best to find a way to avoid generating a lot of rels and > > paths in cases where there's really not much hope of a win. > > > > One could, perhaps, imagine going further with this by postponing > > eager aggregation planning until after regular paths have been built, > > so that we have good cardinality estimates. Suppose the query joins a > > single fact table to a series of dimension tables. The final plan thus > > uses the fact table as the driving table and joins to the dimension > > tables one by one. Do we really need to consider partial aggregation > > at every level? Perhaps just where there's been a significant row > > count reduction since the last time we tried it, but at the next level > > the row count will increase again? > > > > Maybe there are other heuristics we could use in addition or instead. > > Yeah, one of my concerns with this work is that it can use > significantly more CPU time and memory during planning once enabled. > It would be great if we have some efficient heuristics to limit the > effort. I'll work on that next and see what happens. > in v13, latest version. we can /* ... and initialize these targets */ if (!init_grouping_targets(root, rel, target, agg_input, &group_clauses, &group_exprs)) return NULL; if (rel->reloptkind == RELOPT_BASEREL && group_exprs != NIL) { foreach_node(Var, var, group_exprs) { if(var->varno == rel->relid && has_unique_index(rel, var->varattno)) return NULL; } } since in init_grouping_targets we already Asserted that group_exprs is a list of Var. -------------------------------------------------------------------------------- also in create_rel_agg_info, estimate_num_groups result->group_exprs = group_exprs; result->grouped_rows = estimate_num_groups(root, result->group_exprs, rel->rows, NULL, NULL); /* * The grouped paths for the given relation are considered useful iff * the row reduction ratio is greater than EAGER_AGGREGATE_RATIO. */ agg_info->agg_useful = (agg_info->grouped_rows <= rel->rows * (1 - EAGER_AGGREGATE_RATIO)); If the associated Var in group_exprs is too many, then result->grouped_rows will be less accurate, therefore agg_info->agg_useful will be less accurate. should we limit the number of Var associated with Var group_exprs. for example: SET enable_eager_aggregate TO on; drop table if exists eager_agg_t1,eager_agg_t2, eager_agg_t3; CREATE TABLE eager_agg_t1 (a int, b int, c double precision); CREATE TABLE eager_agg_t2 (a int, b int, c double precision); INSERT INTO eager_agg_t1 SELECT i % 100, i, i FROM generate_series(1, 5)i; INSERT INTO eager_agg_t2 SELECT i % 10, i, i FROM generate_series(1, 5)i; INSERT INTO eager_agg_t2 SELECT i % 10, i, i FROM generate_series(-4, -2)i; explain(costs off, verbose, settings) SELECT t1.a, avg(t2.c) FROM eager_agg_t1 t1 JOIN eager_agg_t2 t2 ON abs(t1.b) = abs(t2.b % 10 + t2.a) group by 1; explain(costs off, verbose, settings) SELECT t1.a, avg(t2.c) FROM eager_agg_t1 t1 JOIN eager_agg_t2 t2 ON abs(t1.b) = abs(t2.b % 10 + t2.a) group by 1; QUERY PLAN -------------------------------------------------------------------------------------- Finalize HashAggregate Output: t1.a, avg(t2.c) Group Key: t1.a -> Merge Join Output: t1.a, (PARTIAL avg(t2.c)) Merge Cond: ((abs(((t2.b % 10) + t2.a))) = (abs(t1.b))) -> Sort Output: t2.b, t2.a, (PARTIAL avg(t2.c)), (abs(((t2.b % 10) + t2.a))) Sort Key: (abs(((t2.b % 10) + t2.a))) -> Partial HashAggregate Output: t2.b, t2.a, PARTIAL avg(t2.c), abs(((t2.b % 10) + t2.a)) Group Key: t2.b, t2.a -> Seq Scan on public.eager_agg_t2 t2 Output: t2.a, t2.b, t2.c -> Sort Output: t1.a, t1.b, (abs(t1.b)) Sort Key: (abs(t1.b)) -> Seq Scan on public.eager_agg_t1 t1 Output: t1.a, t1.b, abs(t1.b) Settings: enable_eager_aggregate = 'on' Query Identifier: -734044107933323262
pgsql-hackers by date: