Re: Eager aggregation, take 3 - Mailing list pgsql-hackers

From jian he
Subject Re: Eager aggregation, take 3
Date
Msg-id CACJufxEwZqZcjay_2wEJfCy8fCxUJK_NOknxEpd4M+udkyp_0A@mail.gmail.com
Whole thread Raw
In response to Re: Eager aggregation, take 3  (Richard Guo <guofenglinux@gmail.com>)
List pgsql-hackers
On Thu, Aug 29, 2024 at 10:26 AM Richard Guo <guofenglinux@gmail.com> wrote:
>
>
> > 2. I think there might be techniques we could use to limit planning
> > effort at an earlier stage when the approach doesn't appear promising.
> > For example, if the proposed grouping column is already unique, the
> > exercise is pointless (I think). Ideally we'd like to detect that
> > without even creating the grouped_rel. But the proposed grouping
> > column might also be *mostly* unique. For example, consider a table
> > with a million rows and a column 500,000 distinct values. I suspect it
> > will be difficult for partial aggregation to work out to a win in a
> > case like this, because I think that the cost of performing the
> > partial aggregation will not reduce the cost either of the final
> > aggregation or of the intervening join steps by enough to compensate.
> > It would be best to find a way to avoid generating a lot of rels and
> > paths in cases where there's really not much hope of a win.
> >
> > One could, perhaps, imagine going further with this by postponing
> > eager aggregation planning until after regular paths have been built,
> > so that we have good cardinality estimates. Suppose the query joins a
> > single fact table to a series of dimension tables. The final plan thus
> > uses the fact table as the driving table and joins to the dimension
> > tables one by one. Do we really need to consider partial aggregation
> > at every level? Perhaps just where there's been a significant row
> > count reduction since the last time we tried it, but at the next level
> > the row count will increase again?
> >
> > Maybe there are other heuristics we could use in addition or instead.
>
> Yeah, one of my concerns with this work is that it can use
> significantly more CPU time and memory during planning once enabled.
> It would be great if we have some efficient heuristics to limit the
> effort.  I'll work on that next and see what happens.
>

in v13, latest version. we can

    /* ... and initialize these targets */
    if (!init_grouping_targets(root, rel, target, agg_input,
                               &group_clauses, &group_exprs))
        return NULL;
    if (rel->reloptkind == RELOPT_BASEREL && group_exprs != NIL)
    {
        foreach_node(Var, var, group_exprs)
        {
            if(var->varno == rel->relid &&
                has_unique_index(rel, var->varattno))
                return NULL;
        }
    }

since in init_grouping_targets we already Asserted that group_exprs is
a list of Var.


--------------------------------------------------------------------------------
also in create_rel_agg_info, estimate_num_groups

    result->group_exprs = group_exprs;
    result->grouped_rows = estimate_num_groups(root, result->group_exprs,
                                               rel->rows, NULL, NULL);
        /*
         * The grouped paths for the given relation are considered useful iff
         * the row reduction ratio is greater than EAGER_AGGREGATE_RATIO.
         */
        agg_info->agg_useful =
            (agg_info->grouped_rows <= rel->rows * (1 - EAGER_AGGREGATE_RATIO));

If the associated Var in group_exprs is too many, then result->grouped_rows
will be less accurate, therefore agg_info->agg_useful will be less accurate.
should we limit the number of Var associated with Var group_exprs.


for example:
SET enable_eager_aggregate TO on;
drop table if exists eager_agg_t1,eager_agg_t2, eager_agg_t3;
CREATE TABLE eager_agg_t1 (a int, b int, c double precision);
CREATE TABLE eager_agg_t2 (a int, b int, c double precision);
INSERT INTO eager_agg_t1 SELECT i % 100, i, i FROM generate_series(1, 5)i;
INSERT INTO eager_agg_t2 SELECT i % 10, i, i FROM generate_series(1, 5)i;
INSERT INTO eager_agg_t2 SELECT i % 10, i, i FROM generate_series(-4, -2)i;
explain(costs off, verbose, settings)
SELECT t1.a, avg(t2.c) FROM eager_agg_t1 t1 JOIN eager_agg_t2 t2 ON
abs(t1.b) = abs(t2.b % 10 + t2.a) group by 1;



explain(costs off, verbose, settings)
SELECT t1.a, avg(t2.c) FROM eager_agg_t1 t1 JOIN eager_agg_t2 t2 ON
abs(t1.b) = abs(t2.b % 10 + t2.a) group by 1;
                                      QUERY PLAN
--------------------------------------------------------------------------------------
 Finalize HashAggregate
   Output: t1.a, avg(t2.c)
   Group Key: t1.a
   ->  Merge Join
         Output: t1.a, (PARTIAL avg(t2.c))
         Merge Cond: ((abs(((t2.b % 10) + t2.a))) = (abs(t1.b)))
         ->  Sort
               Output: t2.b, t2.a, (PARTIAL avg(t2.c)), (abs(((t2.b %
10) + t2.a)))
               Sort Key: (abs(((t2.b % 10) + t2.a)))
               ->  Partial HashAggregate
                     Output: t2.b, t2.a, PARTIAL avg(t2.c), abs(((t2.b
% 10) + t2.a))
                     Group Key: t2.b, t2.a
                     ->  Seq Scan on public.eager_agg_t2 t2
                           Output: t2.a, t2.b, t2.c
         ->  Sort
               Output: t1.a, t1.b, (abs(t1.b))
               Sort Key: (abs(t1.b))
               ->  Seq Scan on public.eager_agg_t1 t1
                     Output: t1.a, t1.b, abs(t1.b)
 Settings: enable_eager_aggregate = 'on'
 Query Identifier: -734044107933323262



pgsql-hackers by date:

Previous
From: Peter Smith
Date:
Subject: Re: Pgoutput not capturing the generated columns
Next
From: Srinath Reddy Sadipiralla
Date:
Subject: Re: Building Postgres 17.0 with meson