Thread: [HACKERS] Hash support for grouping sets
Herewith a patch for doing grouping sets via hashing or mixed hashing and sorting. The principal objective is to pick whatever combination of grouping sets has an estimated size that fits in work_mem, and minimizes the number of sorting passes we need to do over the data, and hash those. (Yes, this is a knapsack problem.) This is achieved by adding another strategy to the Agg node; AGG_MIXED means that both hashing and (sorted or plain) grouping happens. In addition, support for multiple hash tables in AGG_HASHED mode is added. Sample plans look like this: explain select two,ten from onek group by cube(two,ten); QUERY PLAN -------------------------------------------------------------- MixedAggregate (cost=0.00..89.33 rows=33 width=8) Hash Key: two, ten Hash Key: two Hash Key: ten Group Key: () -> Seq Scan on onek (cost=0.00..79.00 rows=1000 width=8) explain select two,ten from onek group by two, cube(ten,twenty); QUERY PLAN --------------------------------------------------------------- HashAggregate (cost=86.50..100.62 rows=162 width=12) Hash Key: two, ten, twenty Hash Key: two, ten Hash Key: two Hash Key: two, twenty -> Seq Scan on onek (cost=0.00..79.00 rows=1000 width=12) set work_mem='64kB'; explain select count(*) from tenk1 group by grouping sets (unique1,thousand,hundred,ten,two); QUERY PLAN ------------------------------------------------------------------------ MixedAggregate (cost=1535.39..3023.89 rows=11112 width=28) Hash Key: two Hash Key: ten Hash Key: hundred Group Key: unique1 Sort Key: thousand Group Key: thousand -> Sort (cost=1535.39..1560.39 rows=10000 width=20) Sort Key: unique1 -> Seq Scan on tenk1 (cost=0.00..458.00 rows=10000 width=20) (10 rows) Columns with unhashable or unsortable data types are handled appropriately, though obviously every individual grouping set must end up containing compatible column types. There is one current weakness which I don't see a good solution for: the planner code still has to pick a single value for group_pathkeys before planning the input path. This means that we sometimes can't choose a minimal set of sorts, because we may have chosen a sort order for a grouping set that should have been hashed, for example: explain select count(*) from tenk1 group by grouping sets (two,unique1,thousand,hundred,ten); QUERY PLAN ------------------------------------------------------------------------ MixedAggregate (cost=1535.39..4126.28 rows=11112 width=28) Hash Key: ten Hash Key: hundred Group Key: two Sort Key: unique1 Group Key: unique1 Sort Key: thousand Group Key: thousand -> Sort (cost=1535.39..1560.39 rows=10000 width=20) Sort Key: two -> Seq Scan on tenk1 (cost=0.00..458.00 rows=10000 width=20) (3 sorts, vs. 2 in the previous example that listed the same grouping sets in different order) Current status of the work: probably still needs cleanup, maybe some better regression tests, and Andres has expressed some concern over the performance impact of additional complexity in advance_aggregates; it would be useful if this could be tested. But it should be reviewable as-is. -- Andrew (irc:RhodiumToad) -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
On Thu, Jan 5, 2017 at 10:48 PM, Andrew Gierth <andrew@tao11.riddles.org.uk> wrote: > Herewith a patch for doing grouping sets via hashing or mixed hashing > and sorting. Cool. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
The ability to exploit hashed aggregation within sorted groups, when the order of the input stream can be exploited thisway, is potentially a useful way to improve aggregation performance more generally. This would potentially be beneficialwhen the input size is expected to be larger than the amount of working memory available for hashed aggregation,but where there is enough memory to hash-aggregate just the unsorted grouping key combinations, and when thecumulative cost of rebuilding the hash table for each sorted subgroup is less than the cost of sorting the entire input. In other words, if most of the grouping key combinations are already segregated by virtue of the input order, thenhashing the remaining combinations within each sorted group might be done in memory, at the cost of rebuilding the hashtable for each sorted subgroup. I haven’t looked at the code for this change yet (I hope I will have the time to do that). Ideally the decision to choosethe aggregation method as sorted, hashed, or mixed hash/sort should be integrated into the cost model, but given thenotorious difficulty of estimating intermediate cardinalities accurately it would be difficult to develop a cardinalitymodel and a cost model accurate enough to choose among these options consistently well. Jim Finnerty Amazon Corp. On 1/10/17, 10:22 AM, "pgsql-hackers-owner@postgresql.org on behalf of Robert Haas" <pgsql-hackers-owner@postgresql.org onbehalf of robertmhaas@gmail.com> wrote: On Thu, Jan 5, 2017 at 10:48 PM, Andrew Gierth <andrew@tao11.riddles.org.uk> wrote: > Herewith a patch for doing groupingsets via hashing or mixed hashing > and sorting. Cool. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
On Mon, Jan 16, 2017 at 10:59 AM, Finnerty, Jim <jfinnert@amazon.com> wrote: > The ability to exploit hashed aggregation within sorted groups, when the order of the input stream can be exploited thisway, is potentially a useful way to improve aggregation performance more generally. This would potentially be beneficialwhen the input size is expected to be larger than the amount of working memory available for hashed aggregation,but where there is enough memory to hash-aggregate just the unsorted grouping key combinations, and when thecumulative cost of rebuilding the hash table for each sorted subgroup is less than the cost of sorting the entire input. In other words, if most of the grouping key combinations are already segregated by virtue of the input order, thenhashing the remaining combinations within each sorted group might be done in memory, at the cost of rebuilding the hashtable for each sorted subgroup. Neat idea. > I haven’t looked at the code for this change yet (I hope I will have the time to do that). Ideally the decision to choosethe aggregation method as sorted, hashed, or mixed hash/sort should be integrated into the cost model, but given thenotorious difficulty of estimating intermediate cardinalities accurately it would be difficult to develop a cardinalitymodel and a cost model accurate enough to choose among these options consistently well. Yes, that might be a little tricky. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 6 January 2017 at 03:48, Andrew Gierth <andrew@tao11.riddles.org.uk> wrote: > Herewith a patch for doing grouping sets via hashing or mixed hashing > and sorting. > > The principal objective is to pick whatever combination of grouping sets > has an estimated size that fits in work_mem, and minimizes the number of > sorting passes we need to do over the data, and hash those. (Yes, this > is a knapsack problem.) > > This is achieved by adding another strategy to the Agg node; AGG_MIXED > means that both hashing and (sorted or plain) grouping happens. In > addition, support for multiple hash tables in AGG_HASHED mode is added. > > Sample plans look like this: > > explain select two,ten from onek group by cube(two,ten); > QUERY PLAN > -------------------------------------------------------------- > MixedAggregate (cost=0.00..89.33 rows=33 width=8) > Hash Key: two, ten > Hash Key: two > Hash Key: ten > Group Key: () > -> Seq Scan on onek (cost=0.00..79.00 rows=1000 width=8) > > explain select two,ten from onek group by two, cube(ten,twenty); > QUERY PLAN > --------------------------------------------------------------- > HashAggregate (cost=86.50..100.62 rows=162 width=12) > Hash Key: two, ten, twenty > Hash Key: two, ten > Hash Key: two > Hash Key: two, twenty > -> Seq Scan on onek (cost=0.00..79.00 rows=1000 width=12) > > set work_mem='64kB'; > explain select count(*) from tenk1 > group by grouping sets (unique1,thousand,hundred,ten,two); > QUERY PLAN > ------------------------------------------------------------------------ > MixedAggregate (cost=1535.39..3023.89 rows=11112 width=28) > Hash Key: two > Hash Key: ten > Hash Key: hundred > Group Key: unique1 > Sort Key: thousand > Group Key: thousand > -> Sort (cost=1535.39..1560.39 rows=10000 width=20) > Sort Key: unique1 > -> Seq Scan on tenk1 (cost=0.00..458.00 rows=10000 width=20) > (10 rows) > > Columns with unhashable or unsortable data types are handled > appropriately, though obviously every individual grouping set must end > up containing compatible column types. > > There is one current weakness which I don't see a good solution for: the > planner code still has to pick a single value for group_pathkeys before > planning the input path. This means that we sometimes can't choose a > minimal set of sorts, because we may have chosen a sort order for a > grouping set that should have been hashed, for example: > > explain select count(*) from tenk1 > group by grouping sets (two,unique1,thousand,hundred,ten); > QUERY PLAN > ------------------------------------------------------------------------ > MixedAggregate (cost=1535.39..4126.28 rows=11112 width=28) > Hash Key: ten > Hash Key: hundred > Group Key: two > Sort Key: unique1 > Group Key: unique1 > Sort Key: thousand > Group Key: thousand > -> Sort (cost=1535.39..1560.39 rows=10000 width=20) > Sort Key: two > -> Seq Scan on tenk1 (cost=0.00..458.00 rows=10000 width=20) > > (3 sorts, vs. 2 in the previous example that listed the same grouping > sets in different order) > > Current status of the work: probably still needs cleanup, maybe some > better regression tests, and Andres has expressed some concern over the > performance impact of additional complexity in advance_aggregates; it > would be useful if this could be tested. But it should be reviewable > as-is. This doesn't apply cleanly to latest master. Could you please post a rebased patch? Thanks Thom
>>>>> "Thom" == Thom Brown <thom@linux.com> writes: Thom> This doesn't apply cleanly to latest master. Could you please Thom> post a rebased patch? Sure. -- Andrew (irc:RhodiumToad) -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
The following review has been posted through the commitfest application: make installcheck-world: tested, failed Implements feature: not tested Spec compliant: not tested Documentation: not tested On my MacBook, `make check-world` gives differences in the contrib modules: cat contrib/postgres_fdw/regression.diffs *** /Users/mark/hydra/postgresql.review/contrib/postgres_fdw/expected/postgres_fdw.out 2017-03-03 13:33:47.000000000 -0800 --- /Users/mark/hydra/postgresql.review/contrib/postgres_fdw/results/postgres_fdw.out 2017-03-07 13:27:56.000000000 -0800 *************** *** 3148,3163 **** -- Grouping sets explain (verbose, costs off) select c2, sum(c1) from ft1 where c2 < 3 group by rollup(c2)order by 1 nulls last; ! QUERY PLAN ! --------------------------------------------------------------------------------------------------- ! GroupAggregate ! Output: c2, sum(c1) ! Group Key: ft1.c2 ! Group Key: () ! -> Foreign Scan on public.ft1 ! Output: c2, c1 ! Remote SQL: SELECT "C 1", c2 FROM "S 1"."T 1" WHERE ((c2 < 3)) ORDER BY c2 ASC NULLS LAST ! (7 rows) select c2, sum(c1) from ft1 where c2 < 3 group by rollup(c2) order by 1 nulls last; c2 | sum --- 3148,3166 ---- -- Grouping sets explain (verbose, costs off) select c2, sum(c1) from ft1 where c2 < 3 group by rollup(c2)order by 1 nulls last; ! QUERY PLAN ! ------------------------------------------------------------------------------ ! Sort ! Output: c2, (sum(c1)) ! Sort Key: ft1.c2 ! -> MixedAggregate ! Output: c2, sum(c1) ! Hash Key: ft1.c2 ! Group Key: () ! -> Foreign Scan on public.ft1 ! Output: c2, c1 ! Remote SQL: SELECT "C 1", c2 FROM "S 1"."T 1" WHERE ((c2 < 3)) ! (10 rows) select c2, sum(c1) from ft1 where c2 < 3 group by rollup(c2) order by 1 nulls last; c2 | sum *************** *** 3170,3185 **** explain (verbose, costs off) select c2, sum(c1) from ft1 where c2 < 3 group by cube(c2) order by 1 nullslast; ! QUERY PLAN ! --------------------------------------------------------------------------------------------------- ! GroupAggregate ! Output: c2, sum(c1) ! Group Key: ft1.c2 ! Group Key: () ! -> Foreign Scan on public.ft1 ! Output: c2, c1 ! Remote SQL: SELECT "C 1", c2 FROM "S 1"."T 1" WHERE ((c2 < 3)) ORDER BY c2 ASC NULLS LAST ! (7 rows) select c2, sum(c1) from ft1 where c2 < 3 group by cube(c2) order by 1 nulls last; c2 | sum --- 3173,3191 ---- explain (verbose, costs off) select c2, sum(c1) from ft1 where c2 < 3 group by cube(c2) order by 1 nullslast; ! QUERY PLAN ! ------------------------------------------------------------------------------ ! Sort ! Output: c2, (sum(c1)) ! Sort Key: ft1.c2 ! -> MixedAggregate ! Output: c2, sum(c1) ! Hash Key: ft1.c2 ! Group Key: () ! -> Foreign Scan on public.ft1 ! Output: c2, c1 ! Remote SQL: SELECT "C 1", c2 FROM "S 1"."T 1" WHERE ((c2 < 3)) ! (10 rows) select c2, sum(c1) from ft1 where c2 < 3 group by cube(c2) order by 1 nulls last; c2 | sum *************** *** 3192,3211 **** explain (verbose, costs off) select c2, c6, sum(c1) from ft1 where c2 < 3 group by grouping sets(c2,c6) order by 1 nulls last, 2 nulls last; ! QUERY PLAN ! ------------------------------------------------------------------------------------------------------------- Sort Output: c2, c6, (sum(c1)) Sort Key: ft1.c2, ft1.c6 ! -> GroupAggregate Output: c2, c6, sum(c1) ! Group Key: ft1.c2 ! Sort Key: ft1.c6 ! Group Key: ft1.c6 -> Foreign Scan on public.ft1 Output: c2, c6, c1 ! Remote SQL: SELECT "C 1", c2, c6 FROM "S 1"."T 1" WHERE ((c2 < 3)) ORDER BY c2 ASC NULLS LAST ! (11 rows) select c2, c6, sum(c1) from ft1 where c2 < 3 group by grouping sets(c2, c6) order by 1 nulls last, 2 nulls last; c2 | c6 | sum --- 3198,3216 ---- explain (verbose, costs off) select c2, c6, sum(c1) from ft1 where c2 < 3 group by grouping sets(c2,c6) order by 1 nulls last, 2 nulls last; ! QUERY PLAN ! ---------------------------------------------------------------------------------- Sort Output: c2, c6, (sum(c1)) Sort Key: ft1.c2, ft1.c6 ! -> HashAggregate Output: c2, c6, sum(c1) ! Hash Key: ft1.c2 ! Hash Key: ft1.c6 -> Foreign Scan on public.ft1 Output: c2, c6, c1 ! Remote SQL: SELECT "C 1", c2, c6 FROM "S 1"."T 1" WHERE ((c2 < 3)) ! (10 rows) select c2, c6, sum(c1) from ft1 where c2 < 3 group by grouping sets(c2, c6) order by 1 nulls last, 2 nulls last; c2 | c6 | sum ======================================================================
> On Mar 7, 2017, at 1:43 PM, Mark Dilger <hornschnorter@gmail.com> wrote: > > On my MacBook, `make check-world` gives differences in the contrib modules: I get the same (or similar -- didn't check) regression failure on CentOS, so this doesn't appear to be MacBook / hardware specific. Mark Dilger
The following review has been posted through the commitfest application: make installcheck-world: tested, failed Implements feature: not tested Spec compliant: not tested Documentation: not tested On linux/gcc the patch generates a warning in nodeAgg.c that is fairly easy to fix. Using -Werror to make catching the error easier, I get: make[3]: Entering directory `/home/mark/postgresql.clean/src/backend/executor' gcc -Wall -Wmissing-prototypes -Wpointer-arith -Wdeclaration-after-statement -Wendif-labels -Wmissing-format-attribute -Wformat-security-fno-strict-aliasing -fwrapv -O2 -Werror -I../../../src/include -D_GNU_SOURCE -c -o nodeAgg.o nodeAgg.c-MMD -MP -MF .deps/nodeAgg.Po cc1: warnings being treated as errors nodeAgg.c: In function ‘ExecAgg’: nodeAgg.c:2053: error: ‘result’ may be used uninitialized in this function make[3]: *** [nodeAgg.o] Error 1 make[3]: Leaving directory `/home/mark/postgresql.clean/src/backend/executor' make[2]: *** [executor-recursive] Error 2 make[2]: Leaving directory `/home/mark/postgresql.clean/src/backend' make[1]: *** [all-backend-recurse] Error 2 make[1]: Leaving directory `/home/mark/postgresql.clean/src' make: *** [all-src-recurse] Error 2 There are two obvious choices to silence the warning: diff --git a/src/backend/executor/nodeAgg.c b/src/backend/executor/nodeAgg.c index f059560..8959f5b 100644 --- a/src/backend/executor/nodeAgg.c +++ b/src/backend/executor/nodeAgg.c @@ -2066,6 +2066,7 @@ ExecAgg(AggState *node) break; case AGG_PLAIN: case AGG_SORTED: + default: result = agg_retrieve_direct(node); break; } which I like better, because I don't expect it would create an extra instruction in the compiled code, but also: diff --git a/src/backend/executor/nodeAgg.c b/src/backend/executor/nodeAgg.c index f059560..eab2605 100644 --- a/src/backend/executor/nodeAgg.c +++ b/src/backend/executor/nodeAgg.c @@ -2050,7 +2050,7 @@ lookup_hash_entries(AggState *aggstate)TupleTableSlot *ExecAgg(AggState *node){ - TupleTableSlot *result; + TupleTableSlot *result = NULL; if (!node->agg_done) {
>>>>> "Mark" == Mark Dilger <hornschnorter@gmail.com> writes: Mark> On linux/gcc the patch generates a warning in nodeAgg.c that isMark> fairly easy to fix. Using -Werror to make catchingthe errorMark> easier, I get: what gcc version is this exactly? -- Andrew (irc:RhodiumToad)
>>>>> "Mark" == Mark Dilger <hornschnorter@gmail.com> writes: Mark> On my MacBook, `make check-world` gives differences in the Mark> contrib modules: Thanks! Latest cleaned up version of patch is attached. -- Andrew (irc:RhodiumToad) -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
> On Mar 8, 2017, at 2:30 AM, Andrew Gierth <andrew@tao11.riddles.org.uk> wrote: > >>>>>> "Mark" == Mark Dilger <hornschnorter@gmail.com> writes: > > Mark> On linux/gcc the patch generates a warning in nodeAgg.c that is > Mark> fairly easy to fix. Using -Werror to make catching the error > Mark> easier, I get: > > what gcc version is this exactly? > Linux version 2.6.32-573.18.1.el6.x86_64 (mockbuild@c6b8.bsys.dev.centos.org) (gcc version 4.4.7 20120313 (Red Hat 4.4.7-16)(GCC) ) #1 SMP Tue Feb 9 22:46:17 UTC 2016
> On Mar 8, 2017, at 5:47 AM, Andrew Gierth <andrew@tao11.riddles.org.uk> wrote: > >>>>>> "Mark" == Mark Dilger <hornschnorter@gmail.com> writes: > > Mark> On my MacBook, `make check-world` gives differences in the > Mark> contrib modules: > > Thanks! Latest cleaned up version of patch is attached. This fixes both the warning and the contrib tests on linux and mac.
Hi Andrew, Reviewing the patch a bit more, I find it hard to understand the comment about passing -1 as a flag for finalize_aggregates. Any chance you can spend a bit more time word-smithing that code comment? @@ -1559,7 +1647,9 @@ prepare_projection_slot(AggState *aggstate, TupleTableSlot *slot, int currentSet/* * Compute the finalvalue of all aggregates for one group. * - * This function handles only one grouping set at a time. + * This function handles only one grouping set at a time. But in the hash + * case, it's the caller's responsibility to have selected the set already, and + * we pass in -1 here to flag that and to control the indexing into pertrans. * * Results are stored in the output econtextaggvalues/aggnulls. */ @@ -1575,10 +1665,11 @@ finalize_aggregates(AggState *aggstate, int aggno; int transno; - Assert(currentSet == 0 || - ((Agg *) aggstate->ss.ps.plan)->aggstrategy != AGG_HASHED); - - aggstate->current_set = currentSet; + /* ugly hack for hash */ + if (currentSet >= 0) + select_current_set(aggstate, currentSet, false); + else + currentSet = 0; > On Mar 8, 2017, at 8:00 AM, Mark Dilger <hornschnorter@gmail.com> wrote: > > >> On Mar 8, 2017, at 5:47 AM, Andrew Gierth <andrew@tao11.riddles.org.uk> wrote: >> >>>>>>> "Mark" == Mark Dilger <hornschnorter@gmail.com> writes: >> >> Mark> On my MacBook, `make check-world` gives differences in the >> Mark> contrib modules: >> >> Thanks! Latest cleaned up version of patch is attached. > > This fixes both the warning and the contrib tests on linux and mac. >
>>>>> "Mark" == Mark Dilger <hornschnorter@gmail.com> writes: Mark> Hi Andrew, Mark> Reviewing the patch a bit more, I find it hard to understand theMark> comment about passing -1 as a flag for finalize_aggregates. AnyMark> chance you can spend a bit more time word-smithing that codeMark> comment? Sure. How does this look for wording (I'll incorporate it into an updated patch later): /** Compute the final value of all aggregates for one group.** This function handles only one grouping set at a time. Butin the hash* case, it's the caller's responsibility to have selected the set already, and* we pass in -1 as currentSethere to flag that; this also changes how we* handle the indexing into AggStatePerGroup as explained below.** Resultsare stored in the output econtext aggvalues/aggnulls.*/ static void finalize_aggregates(AggState *aggstate, AggStatePerAgg peraggs, AggStatePerGroup pergroup, int currentSet) {ExprContext *econtext = aggstate->ss.ps.ps_ExprContext;Datum *aggvalues = econtext->ecxt_aggvalues;bool *aggnulls= econtext->ecxt_aggnulls;int aggno;int transno; /* * If currentSet >= 0, then we're doing sorted grouping, and pergroup is an * array of size numTrans*numSets which we haveto index into using * currentSet in addition to transno. The caller may not have selected the * set, so we do that. ** If currentSet < 0, then we're doing hashed grouping, and pergroup is an * array of only numTrans items (since for hashedgrouping, each grouping * set is in a separate hashtable). We rely on the caller having done * select_current_set,and we fudge currentSet to 0 in order to make the * same indexing calculations work as for the groupingcase. */if (currentSet >= 0) select_current_set(aggstate, currentSet, false);else currentSet = 0; -- Andrew (irc:RhodiumToad)
> On Mar 8, 2017, at 9:40 AM, Andrew Gierth <andrew@tao11.riddles.org.uk> wrote: > >>>>>> "Mark" == Mark Dilger <hornschnorter@gmail.com> writes: > > Mark> Hi Andrew, > > Mark> Reviewing the patch a bit more, I find it hard to understand the > Mark> comment about passing -1 as a flag for finalize_aggregates. Any > Mark> chance you can spend a bit more time word-smithing that code > Mark> comment? > > Sure. > > How does this look for wording (I'll incorporate it into an updated > patch later): Much better. Thanks! > /* > * Compute the final value of all aggregates for one group. > * > * This function handles only one grouping set at a time. But in the hash > * case, it's the caller's responsibility to have selected the set already, and > * we pass in -1 as currentSet here to flag that; this also changes how we > * handle the indexing into AggStatePerGroup as explained below. > * > * Results are stored in the output econtext aggvalues/aggnulls. > */ > static void > finalize_aggregates(AggState *aggstate, > AggStatePerAgg peraggs, > AggStatePerGroup pergroup, > int currentSet) > { > ExprContext *econtext = aggstate->ss.ps.ps_ExprContext; > Datum *aggvalues = econtext->ecxt_aggvalues; > bool *aggnulls = econtext->ecxt_aggnulls; > int aggno; > int transno; > > /* > * If currentSet >= 0, then we're doing sorted grouping, and pergroup is an > * array of size numTrans*numSets which we have to index into using > * currentSet in addition to transno. The caller may not have selected the > * set, so we do that. > * > * If currentSet < 0, then we're doing hashed grouping, and pergroup is an > * array of only numTrans items (since for hashed grouping, each grouping > * set is in a separate hashtable). We rely on the caller having done > * select_current_set, and we fudge currentSet to 0 in order to make the > * same indexing calculations work as for the grouping case. > */ > if (currentSet >= 0) > select_current_set(aggstate, currentSet, false); > else > currentSet = 0; > > > -- > Andrew (irc:RhodiumToad)
>>>>> "Mark" == Mark Dilger <hornschnorter@gmail.com> writes: Mark> Hi Andrew, Mark> Reviewing the patch a bit more, I find it hard to understand theMark> comment about passing -1 as a flag for finalize_aggregates. AnyMark> chance you can spend a bit more time word-smithing that codeMark> comment? Actually, ignore that prior response, I'll refactor it a bit to remove the need for the comment at all. -- Andrew (irc:RhodiumToad)
>>>>> "Mark" == Mark Dilger <hornschnorter@gmail.com> writes: Mark> Reviewing the patch a bit more, I find it hard to understand the Mark> comment about passing -1 as a flag for finalize_aggregates. Any Mark> chance you can spend a bit more time word-smithing that code Mark> comment? >> Sure. >> >> How does this look for wording (I'll incorporate it into an updated >> patch later): Mark> Much better. Thanks! Here's a version of the patch that obviates that comment entirely, by moving the differences between the hashed and sorted cases out to the separate callers of finalize_aggregates. -- Andrew (irc:RhodiumToad) -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
Another small update to the patch, this time to eliminate any possibility of integer overflow when handling extremely large estimated groupings. Key change: - k_weights[i] = (int) floor(sz / scale); + /* + * If sz is enormous, but work_mem (and hence scale) is + * small, avoid integer overflow here. + */ + k_weights[i] = (int) Min(floor(sz / scale), + k_capacity + 1.0); -- Andrew (irc:RhodiumToad) -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
[snip] This thread seems to have gone quiet - is it time for me to just go ahead and commit the thing anyway? Anyone else want to weigh in? -- Andrew (irc:RhodiumToad)
Hi, > +/* > + * Switch to phase "newphase", which must either be 0 or 1 (to reset) or > * current_phase + 1. Juggle the tuplesorts accordingly. > */ > static void > initialize_phase(AggState *aggstate, int newphase) > { > - Assert(newphase == 0 || newphase == aggstate->current_phase + 1); > + Assert(newphase <= 1 || newphase == aggstate->current_phase + 1); I think this somewhere in the file header needs an expanded explanation about what these "special" phases mean. > @@ -838,17 +897,21 @@ advance_transition_function(AggState *aggstate, > /* > * Advance each aggregate transition state for one input tuple. The input > * tuple has been stored in tmpcontext->ecxt_outertuple, so that it is > - * accessible to ExecEvalExpr. pergroup is the array of per-group structs to > - * use (this might be in a hashtable entry). > + * accessible to ExecEvalExpr. > + * > + * We have two sets of transition states to handle: one for sorted aggregation > + * and one for hashed; we do them both here, to avoid multiple evaluation of > + * the inputs. > * > * When called, CurrentMemoryContext should be the per-query context. > */ Changes to advance_aggregates() are, in my experience, quite likely to have performance effects. This needs some performance tests. I scheduled a small TPCH comparison run: master q01 min: 28924.766 dev-hash min: 28893.923 [diff -0.11] master q02 min: 2448.135 dev-hash min: 2430.589 [diff -0.72] master q03 min: 14596.292 dev-hash min: 14421.054 [diff -1.22] master q04 min: 1999.684 dev-hash min: 1958.727 [diff -2.09] master q05 min: 10638.148 dev-hash min: 10753.075 [diff +1.07] master q06 min: 3679.955 dev-hash min: 3649.114 [diff -0.85] master q07 min: 10097.272 dev-hash min: 10358.089 [diff +2.52] master q08 min: 3103.698 dev-hash min: 3078.108 [diff -0.83] master q09 min: 13034.87 dev-hash min: 13049.402 [diff +0.11] master q10 min: 11702.966 dev-hash min: 11535.05 [diff -1.46] master q11 min: 577.114 dev-hash min: 586.767 [diff +1.65] master q12 min: 9087.413 dev-hash min: 9272.874 [diff +2.00] master q13 min: 16353.813 dev-hash min: 16473.882 [diff +0.73] master q14 min: 1564.321 dev-hash min: 1552.31 [diff -0.77] master q15 min: 3958.565 dev-hash min: 3946.728 [diff -0.30] master q16 min: 3793.345 dev-hash min: 3784.996 [diff -0.22] master q17 min: 828.286 dev-hash min: 834.929 [diff +0.80] master q18 min: 13473.084 dev-hash min: 13777.327 [diff +2.21] master q19 min: 654.241 dev-hash min: 673.863 [diff +2.91] master q20 min: 2906.811 dev-hash min: 2882.793 [diff -0.83] master q22 min: 1226.397 dev-hash min: 1262.045 [diff +2.82] master total min: 154649.176 dev-hash min: 155175.645 [diff +0.34] Looks like it could all be noise, but it seems worthwhile to look into some. > * GROUP BY columns. The per-group data is allocated in lookup_hash_entry(), > * for each entry. > * > - * The hash table always lives in the aggcontext memory context. > + * We have a separate hashtable and associated perhash data structure for each > + * grouping set for which we're doing hashing. > + * > + * The hash tables always live in the hashcontext's per-tuple memory context > + * (there is only one of these for all tables together, since they are all > + * reset at the same time). > */ I've not yet read up on the thread, or the whole patch, but an explanation of the memory management for all those tables would be good somewhere (maybe I've just not read that far). > @@ -2388,7 +2597,36 @@ agg_retrieve_hash_table(AggState *aggstate) > * ExecInitAgg > * > * Creates the run-time information for the agg node produced by the > - * planner and initializes its outer subtree > + * planner and initializes its outer subtree. > + * > + * What we get from the planner is actually one "real" Agg node which is part > + * of the plan tree proper, but which optionally has an additional list of Agg > + * nodes hung off the side via the "chain" field. This is because an Agg node > + * happens to be a convenient representation of all the data we need for > + * grouping sets. > + * > + * For many purposes, we treat the "real" node as if it were just the first > + * node in the chain. The chain must be ordered such that hashed entries come > + * before sorted/plain entries; the real node is marked AGG_MIXED if there are > + * mixed types (in which case the real node describes one of the hashed "mixed types" sounds a bit weird. > + * We collect all hashed nodes into a single "phase", numbered 0, and create a > + * sorted phase (numbered 1..n) for each AGG_SORTED or AGG_PLAIN node. Phase > + * 0 is allocated even if there are no hashes, but remains unused in that > + * case. Ah, here's the explanation I'd been looking for ;) > +Bitmapset * > +DiscreteKnapsack(int max_weight, int num_items, > + int *item_weights, double *item_values) > +{ > + MemoryContext local_ctx = AllocSetContextCreate(CurrentMemoryContext, > + "Knapsack", > + ALLOCSET_SMALL_MINSIZE, > + ALLOCSET_SMALL_INITSIZE, > + ALLOCSET_SMALL_MAXSIZE); > + MemoryContext oldctx = MemoryContextSwitchTo(local_ctx); > + double *values; > + Bitmapset **sets; > + Bitmapset *result; > + int i, > + j; > + > + Assert(max_weight >= 0); > + Assert(num_items > 0 && item_weights); > + > + values = palloc((1 + max_weight) * sizeof(double)); > + sets = palloc((1 + max_weight) * sizeof(Bitmapset *)); We usually cast the result of palloc. > + for (i = 0; i <= max_weight; ++i) > + { > + values[i] = 0; > + sets[i] = bms_make_singleton(num_items); > + } > + > + for (i = 0; i < num_items; ++i) > + { > + int iw = item_weights[i]; > + double iv = item_values ? item_values[i] : 1; > + > + for (j = max_weight; j >= iw; --j) > + { > + int ow = j - iw; > + > + if (values[j] <= values[ow] + iv) > + { > + /* copy sets[ow] to sets[j] without realloc */ > + if (j != ow) > + { > + sets[j] = bms_del_members(sets[j], sets[j]); > + sets[j] = bms_add_members(sets[j], sets[ow]); > + } > + > + sets[j] = bms_add_member(sets[j], i); > + > + values[j] = values[ow] + iv; > + } > + } > + } > + > + MemoryContextSwitchTo(oldctx); > + > + result = bms_del_member(bms_copy(sets[max_weight]), num_items); > + > + MemoryContextDelete(local_ctx); > + > + return result; > +} > diff --git a/src/backend/nodes/bitmapset.c b/src/backend/nodes/bitmapset.c > index 252af5c870..ea1c538287 100644 > --- a/src/backend/nodes/bitmapset.c > +++ b/src/backend/nodes/bitmapset.c > @@ -21,7 +21,7 @@ > #include "postgres.h" > > #include "access/hash.h" > - > +#include "nodes/pg_list.h" Pedanticism^7: You're deleting a line here... > +/* > * bms_nonempty_difference - do sets have a nonempty difference? > */ > bool > diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c > index 7418fbeded..d2d66f915b 100644 > --- a/src/backend/nodes/outfuncs.c > +++ b/src/backend/nodes/outfuncs.c > @@ -1936,6 +1936,28 @@ _outAggPath(StringInfo str, const AggPath *node) > } > > static void > +_outRollupData(StringInfo str, const RollupData *node) > +{ > + WRITE_NODE_TYPE("ROLLUP"); > + > + WRITE_NODE_FIELD(groupClause); > + WRITE_NODE_FIELD(gsets); > + WRITE_NODE_FIELD(gsets_data); > + WRITE_FLOAT_FIELD(numGroups, "%.0f"); > + WRITE_BOOL_FIELD(hashable); > + WRITE_BOOL_FIELD(is_hashed); > +} > + > +static void > +_outGroupingSetData(StringInfo str, const GroupingSetData *node) > +{ > + WRITE_NODE_TYPE("GSDATA"); > + > + WRITE_NODE_FIELD(set); > + WRITE_FLOAT_FIELD(numGroups, "%.0f"); > +} .0f? > +static grouping_sets_data * > +preprocess_grouping_sets(PlannerInfo *root) > +{ It'd not hurt to add a comment as to what this function is actually doing. Btw, this would be a lot easier to review if the move to preprocess_grouping_sets were done in a separate preliminary commit. > + if (!bms_is_empty(gd->unsortable_refs)) > + { ... > + } > + else > + sets = extract_rollup_sets(parse->groupingSets); I hope there's tests for both these branches. > + /* > + * Is it hashable? We pretend empty sets are hashable even though we > + * actually force them not to be hashed later. But don't bother if > + * there's nothing but empty sets. > + */ Why? > @@ -3214,6 +3407,11 @@ get_number_of_groups(PlannerInfo *root, > * estimate_hashagg_tablesize > * estimate the number of bytes that a hash aggregate hashtable will > * require based on the agg_costs, path width and dNumGroups. > + * > + * XXX this may be over-estimating the size now that hashagg knows to omit > + * unneeded columns from the hashtable. Also for mixed-mode grouping sets, > + * grouping columns not in the hashed set are counted here even though hashagg > + * won't store them. Is this a problem? > */ Hm. This seems like it could move us to use sorting, even if not actually warranted. > +/* > + * For a given input path, consider the possible ways of doing grouping sets on > + * it, by combinations of hashing and sorting. This can be called multiple > + * times, so it's important that it not scribble on input. No result is > + * returned, but any generated paths are added to grouped_rel. > + */ > + > +static void > +consider_groupingsets_paths(PlannerInfo *root, What's the newline after the comment about? > + RelOptInfo *grouped_rel, > + Path *path, > + bool is_sorted, > + bool can_hash, > + PathTarget *target, > + grouping_sets_data *gd, > + const AggClauseCosts *agg_costs, > + double dNumGroups) > +{ > + Query *parse = root->parse; > + > + /* > + * If we're not being offered sorted input, then only consider plans that > + * can be done entirely by hashing. > + * > + * We can hash everything if it looks like it'll fit in work_mem. But if > + * the input is actually sorted despite not being advertised as such, we > + * prefer to make use of that in order to use less memory. > + * If none of the grouping sets are sortable, then ignore the work_mem > + * limit and generate a path anyway, since otherwise we'll just fail. > + */ > + if (!is_sorted) > + { > + List *new_rollups = NIL; > + RollupData *unhashable_rollup = NULL; > + List *sets_data; > + List *empty_sets_data = NIL; > + List *empty_sets = NIL; > + ListCell *lc; > + ListCell *l_start = list_head(gd->rollups); > + AggStrategy strat = AGG_HASHED; > + Size hashsize; > + double exclude_groups = 0.0; > + > + Assert(can_hash); > + > + if (pathkeys_contained_in(root->group_pathkeys, path->pathkeys)) > + { > + unhashable_rollup = lfirst(l_start); a) cast result of lfirst/lnext/whatnot. b) Why is this guaranteed to be unhashable? > + exclude_groups = unhashable_rollup->numGroups; > + l_start = lnext(l_start); > + } What guarantees that such a rollup would be at the start? > + hashsize = estimate_hashagg_tablesize(path, > + agg_costs, > + dNumGroups - exclude_groups); > + > + if (hashsize > work_mem * 1024L && gd->rollups) > + return; /* nope, won't fit */ Didn't you above talk about generating a path anyway? Or is rollups guaranteed not to be set in that case (if so, not obvious on a casual read)? > + /* > + * If we have sorted input but nothing we can do with it, bail. > + */ > + > + if (list_length(gd->rollups) == 0) > + return; Postgres style usually doesn't add a newline before comment and the block it concerns - you do that all over. > + if (can_hash && gd->any_hashable) > + { > + List *rollups = NIL; > + List *hash_sets = list_copy(gd->unsortable_sets); > + double availspace = (work_mem * 1024.0); Why the parens, and why the .0? > + ListCell *lc; > + > + /* > + * Account first for space needed for groups we can't sort at all. > + */ > + availspace -= (double) estimate_hashagg_tablesize(path, > + agg_costs, > + gd->dNumHashGroups); > + > + if (availspace > 0 && list_length(gd->rollups) > 1) > + { > + double scale; > + int num_rollups = list_length(gd->rollups); > + int k_capacity; > + int *k_weights = palloc(num_rollups * sizeof(int)); > + Bitmapset *hash_items = NULL; > + int i; > + > + /* > + * To use the discrete knapsack, we need to scale the values to a > + * reasonably small bounded range. We choose to allow a 5% error > + * margin; we have no more than 4096 rollups in the worst possible > + * case, which with a 5% error margin will require a bit over 42MB > + * of workspace. (Anyone wanting to plan queries that complex had > + * better have the memory for it. In more reasonable cases, with > + * no more than a couple of dozen rollups, the memory usage will > + * be negligible.) > + * > + * k_capacity is naturally bounded, but we clamp the values for > + * scale and weight (below) to avoid overflows or underflows (or > + * uselessly trying to use a scale factor less than 1 byte). > + */ I think you need a higher level explanation of how you're mapping the work_mem issue to knapsack somewhere. > + /* > + * Apply knapsack algorithm; compute the set of items which > + * maximizes the value stored (in this case the number of sorts > + * saved) while keeping the total size (approximately) within > + * capacity. > + */ Hm. So we're solely optimizing for the number of sorts, rather the cost of the sorts. Probably an acceptable tradeoff. > +} Phew, this isn't simple. Need to be more awake than I'm 8 hours into a flight. > @@ -737,7 +739,8 @@ typedef enum AggStrategy > { > AGG_PLAIN, /* simple agg across all input rows */ > AGG_SORTED, /* grouped agg, input must be sorted */ > - AGG_HASHED /* grouped agg, use internal hashtable */ > + AGG_HASHED, /* grouped agg, use internal hashtable */ > + AGG_MIXED /* grouped agg, hash and sort concurrently */ > } AggStrategy; Concurrently sounds a bit like using multiple processes / parallelism... - Andres
>>>>> "Andres" == Andres Freund <andres@anarazel.de> writes: Andres> Changes to advance_aggregates() are, in my experience, quiteAndres> likely to have performance effects. This needssomeAndres> performance tests.[...]Andres> Looks like it could all be noise, but it seems worthwhile toAndres> lookinto some. Trying to sort out signal from noise when dealing with performance impacts of no more than a few percent is _extremely hard_ these days. Remember this, from a couple of years back? http://tinyurl.com/op9qg8a That's a graph of performance results from tests where the only change being made was adding padding bytes to a function that was never called during the test. Performance differences on the order of 3% were being introduced by doing nothing more than changing the actual memory locations of functions. My latest round of tests on this patch shows a similar effect. Here's one representative test timing (a select count(*) with no grouping sets involved): master: 5727ms gsets_hash: 5949msgsets_hash + 500 padding bytes: 5680ms Since the effect of padding is going to vary over time as other, unrelated, patches happen, I think I can safely claim that the performance of the patch at least overlaps the noise range of the performance of current code. To get a more definitive result, it would be necessary to run at least some dozens of tests, with different padding sizes, and determine whether the average changes detectably between the two versions. I will go ahead and do this, out of sheer curiosity if nothing else, but the preliminary results suggest there's probably nothing worth holding up the patch for. -- Andrew (irc:RhodiumToad)
> On Mar 22, 2017, at 8:09 AM, Mark Dilger <hornschnorter@gmail.com> wrote: > > >> On Mar 22, 2017, at 7:55 AM, Andrew Gierth <andrew@tao11.riddles.org.uk> wrote: >> >> [snip] >> >> This thread seems to have gone quiet - is it time for me to just go >> ahead and commit the thing anyway? Anyone else want to weigh in? > > Sorry for the delay. I'll try to look at this today. Others are most welcome > to weigh in. You define DiscreteKnapsack to take integer weights and double values, and perform the usual Dynamic Programming algorithm to solve. But the only place you call this, you pass in NULL for the values, indicating that all the values are equal. I'm confused why in this degenerate case you bother with the DP at all. Can't you just load the knapsack from lightest to heaviest after sorting, reducing the runtime complexity to O(nlogn)? It's been a long day. Sorry if I am missing something obvious. I'm guessing you intend to extend the code at some later date to have different values for items. Is that right? I reapplied your patch and reran the regression tests on my laptop and they all pass. mark
>>>>> "Andres" == Andres Freund <andres@anarazel.de> writes: >> - Assert(newphase == 0 || newphase == aggstate->current_phase + 1);>> + Assert(newphase <= 1 || newphase == aggstate->current_phase+ 1); Andres> I think this somewhere in the file header needs an expandedAndres> explanation about what these "special" phasesmean. I could move most of the block comment for ExecInitAgg up into the file header; would that be better? Andres> I've not yet read up on the thread, or the whole patch, but anAndres> explanation of the memory management for allthose tables wouldAndres> be good somewhere (maybe I've just not read that far). I can expand on that. Andres> "mixed types" sounds a bit weird. changing to "if there are both types present" >> + values = palloc((1 + max_weight) * sizeof(double));>> + sets = palloc((1 + max_weight) * sizeof(Bitmapset *)); Andres> We usually cast the result of palloc. Rough count in the backend has ~400 without casts to ~1350 with, so this doesn't seem to have been consistently enforced. >> +static void>> +_outGroupingSetData(StringInfo str, const GroupingSetData *node)>> +{>> + WRITE_NODE_TYPE("GSDATA");>>+>> + WRITE_NODE_FIELD(set);>> + WRITE_FLOAT_FIELD(numGroups, "%.0f");>> +} Andres> .0f? Copied from every other node that uses "%.0f" to write out a numGroups value. >> +static grouping_sets_data *>> +preprocess_grouping_sets(PlannerInfo *root)>> +{ Andres> It'd not hurt to add a comment as to what this function isAndres> actually doing. Sure. Andres> I hope there's tests for both these branches. A number of tests for both unsortable and unhashable cases are in the regression test. >> + /*>> + * Is it hashable? We pretend empty sets are hashable even though we>> + * actually forcethem not to be hashed later. But don't bother if>> + * there's nothing but empty sets.>> + */ Andres> Why? If there's no non-empty grouping set then there's nothing to hash (the hash code assumes at least one column). I'll put that in the comment. >> @@ -3214,6 +3407,11 @@ get_number_of_groups(PlannerInfo *root,>> * estimate_hashagg_tablesize>> * estimate the numberof bytes that a hash aggregate hashtable will>> * require based on the agg_costs, path width and dNumGroups.>>+ *>> + * XXX this may be over-estimating the size now that hashagg knows to omit>> + * unneeded columns fromthe hashtable. Also for mixed-mode grouping sets,>> + * grouping columns not in the hashed set are counted here eventhough hashagg>> + * won't store them. Is this a problem?>> */ Andres> Hm. This seems like it could move us to use sorting, even ifAndres> not actually warranted. This is largely a pre-existing issue that this patch doesn't try and fix. That would be an issue for a separate patch, I think. I merely added the comment to point it out. Andres> What's the newline after the comment about? Old style habits die hard. >> + RelOptInfo *grouped_rel,>> + Path *path,>> + bool is_sorted,>> + bool can_hash,>> + PathTarget *target,>>+ grouping_sets_data *gd,>> + const AggClauseCosts *agg_costs,>>+ double dNumGroups)>> +{>> + Query *parse = root->parse;>> +>> + /*>>+ * If we're not being offered sorted input, then only consider plans that>> + * can be done entirely by hashing.>>+ *>> + * We can hash everything if it looks like it'll fit in work_mem. But if>> + * the input isactually sorted despite not being advertised as such, we>> + * prefer to make use of that in order to use less memory.>>+ * If none of the grouping sets are sortable, then ignore the work_mem>> + * limit and generate a pathanyway, since otherwise we'll just fail.>> + */>> + if (!is_sorted)>> + {>> + List *new_rollups= NIL;>> + RollupData *unhashable_rollup = NULL;>> + List *sets_data;>> + List *empty_sets_data = NIL;>> + List *empty_sets = NIL;>> + ListCell *lc;>> + ListCell *l_start= list_head(gd->rollups);>> + AggStrategy strat = AGG_HASHED;>> + Size hashsize;>> + double exclude_groups = 0.0;>> +>> + Assert(can_hash);>> +>> + if (pathkeys_contained_in(root->group_pathkeys,path->pathkeys))>> + {>> + unhashable_rollup = lfirst(l_start); Andres> a) cast result of lfirst/lnext/whatnot. casting lfirst is defensible (and only about 10% of existing uses don't cast it), but why would you ever cast the result of lnext? Andres> b) Why is this guaranteed to be unhashable? It might not be unhashable; that probably needs a better name. It's the rollup that we aren't going to hash because we're being presented with input already in correct order. Also, it's the only rollup that can contain empty grouping sets, which the hash code can't cope with. Obviously this needs a comment and some consideration of naming. >> + if (hashsize > work_mem * 1024L && gd->rollups)>> + return; /* nope, won't fit */ Andres> Didn't you above talk about generating a path anyway? Or isAndres> rollups guaranteed not to be set in that case(if so, notAndres> obvious on a casual read)? Unsortable groups can't participate in rollups. The presence of a rollup implies that we're going to be offered at least one is_sorted input path. >> + if (can_hash && gd->any_hashable)>> + {>> + List *rollups = NIL;>> + List *hash_sets= list_copy(gd->unsortable_sets);>> + double availspace = (work_mem * 1024.0); Andres> Why the parens, and why the .0? The .0 is because the * should be done as a double, not an int. This isn't an uncommon idiom, there are quite a few similar uses in the planner alone. The parens because I thought it was clearer that way. Andres> I think you need a higher level explanation of how you'reAndres> mapping the work_mem issue to knapsack somewhere. The general overview ("work_mem is the knapsack size, and we need to figure out how best to pack it with hashtables"), or the specific issue of the scale factor for discrete chunking of the size? Andres> Hm. So we're solely optimizing for the number of sorts, ratherAndres> the cost of the sorts. Probably an acceptabletradeoff. The existing cost model for sorts doesn't actually seem to help us here; all of our sorts sort the same data, just with different comparison columns, but cost_sort doesn't actually account for that (all its callers except the CLUSTER planning code pass in 0.0 for comparison_cost). If the sort costs were correct, we could compute a "value" for each rollup that reflected the cost difference between the sort+unique comparisons for it, and the hash functions that would be called if we hashed instead; and just pass that to the knapsack function. -- Andrew (irc:RhodiumToad)
>>>>> "Mark" == Mark Dilger <hornschnorter@gmail.com> writes: Mark> You define DiscreteKnapsack to take integer weights and doubleMark> values, and perform the usual Dynamic Programmingalgorithm toMark> solve. But the only place you call this, you pass in NULL forMark> the values, indicating thatall the values are equal. I'mMark> confused why in this degenerate case you bother with the DP atMark> all. Can't youjust load the knapsack from lightest to heaviestMark> after sorting, reducing the runtime complexity to O(nlogn)? It'sMark>been a long day. Sorry if I am missing something obvious. That's actually a very good question. (Though in passing I should point out that the runtime cost is going to be negligible in all practical cases. Even an 8-way cube (256 grouping sets) has only 70 rollups, and a 4-way cube has only 6; the limit of 4096 grouping sets was only chosen because it was a nice round number that was larger than comparable limits in other database products. Other stuff in the grouping sets code has worse time bounds; the bipartite-match code used to minimize the number of rollups is potentially O(n^2.5) in the number of grouping sets.) Thinking about it, it's not at all obvious what we should be optimizing for here in the absence of accurate sort costs. Right now what matters is saving as many sort steps as possible, since that's a saving likely to be many orders of magnitude greater than the differences between two sorts. The one heuristic that might be useful is to prefer the smaller estimated size if other factors are equal, just on memory usage grounds, but even that seems a bit tenuous. I'm inclined to leave things as they are in the current patch, and maybe revisit it during beta if we get any relevant feedback from people actually using grouping sets? Mark> I'm guessing you intend to extend the code at some later date toMark> have different values for items. Is that right? Well, as I mentioned in a reply to Andres, we probably should eventually optimize for greatest reduction in cost, and the code as it stands would allow that to be added easily, but that would require having more accurate cost info than we have now. cost_sort doesn't take into consideration the number or types of sort columns or the costs of their comparison functions, so all sorts of the same input data are costed equally. -- Andrew (irc:RhodiumToad)
On 2017-03-23 03:43:57 +0000, Andrew Gierth wrote: > >>>>> "Andres" == Andres Freund <andres@anarazel.de> writes: > > Andres> Changes to advance_aggregates() are, in my experience, quite > Andres> likely to have performance effects. This needs some > Andres> performance tests. > [...] > Andres> Looks like it could all be noise, but it seems worthwhile to > Andres> look into some. > > Trying to sort out signal from noise when dealing with performance > impacts of no more than a few percent is _extremely hard_ these days. Indeed. But that doesn't mean we needn't try. With some determination and profiling you can often sepearate signal from noise - I managed to track down 0.12% regressions in the expression evaluation work... > I will go ahead and do this, out of sheer curiosity if nothing else, > but the preliminary results suggest there's probably nothing worth > holding up the patch for. Agreed. I'd want to run one more profile, checking whether the profiles indicate new hotspots, but other than that... Greetings, Andres Freund
> On Mar 23, 2017, at 11:22 AM, Andrew Gierth <andrew@tao11.riddles.org.uk> wrote: > >>>>>> "Mark" == Mark Dilger <hornschnorter@gmail.com> writes: > > Mark> You define DiscreteKnapsack to take integer weights and double > Mark> values, and perform the usual Dynamic Programming algorithm to > Mark> solve. But the only place you call this, you pass in NULL for > Mark> the values, indicating that all the values are equal. I'm > Mark> confused why in this degenerate case you bother with the DP at > Mark> all. Can't you just load the knapsack from lightest to heaviest > Mark> after sorting, reducing the runtime complexity to O(nlogn)? It's > Mark> been a long day. Sorry if I am missing something obvious. > > That's actually a very good question. > > (Though in passing I should point out that the runtime cost is going to > be negligible in all practical cases. Even an 8-way cube (256 grouping > sets) has only 70 rollups, and a 4-way cube has only 6; the limit of > 4096 grouping sets was only chosen because it was a nice round number > that was larger than comparable limits in other database products. Other > stuff in the grouping sets code has worse time bounds; the > bipartite-match code used to minimize the number of rollups is > potentially O(n^2.5) in the number of grouping sets.) > > Thinking about it, it's not at all obvious what we should be optimizing > for here in the absence of accurate sort costs. Is there a performance test case where this patch should shine brightest? I'd like to load a schema with lots of data, and run a grouping sets query, both before and after applying the patch, to see what the performance advantage is. mark
On 2017-03-23 05:09:46 +0000, Andrew Gierth wrote: > >>>>> "Andres" == Andres Freund <andres@anarazel.de> writes: > > >> - Assert(newphase == 0 || newphase == aggstate->current_phase + 1); > >> + Assert(newphase <= 1 || newphase == aggstate->current_phase + 1); > > Andres> I think this somewhere in the file header needs an expanded > Andres> explanation about what these "special" phases mean. > > I could move most of the block comment for ExecInitAgg up into the file > header; would that be better? Yes, I think so. > >> + values = palloc((1 + max_weight) * sizeof(double)); > >> + sets = palloc((1 + max_weight) * sizeof(Bitmapset *)); > > Andres> We usually cast the result of palloc. > > Rough count in the backend has ~400 without casts to ~1350 with, so this > doesn't seem to have been consistently enforced. Yea, but we're still trying. > >> + RelOptInfo *grouped_rel, > >> + Path *path, > >> + bool is_sorted, > >> + bool can_hash, > >> + PathTarget *target, > >> + grouping_sets_data *gd, > >> + const AggClauseCosts *agg_costs, > >> + double dNumGroups) > >> +{ > >> + Query *parse = root->parse; > >> + > >> + /* > >> + * If we're not being offered sorted input, then only consider plans that > >> + * can be done entirely by hashing. > >> + * > >> + * We can hash everything if it looks like it'll fit in work_mem. But if > >> + * the input is actually sorted despite not being advertised as such, we > >> + * prefer to make use of that in order to use less memory. > >> + * If none of the grouping sets are sortable, then ignore the work_mem > >> + * limit and generate a path anyway, since otherwise we'll just fail. > >> + */ > >> + if (!is_sorted) > >> + { > >> + List *new_rollups = NIL; > >> + RollupData *unhashable_rollup = NULL; > >> + List *sets_data; > >> + List *empty_sets_data = NIL; > >> + List *empty_sets = NIL; > >> + ListCell *lc; > >> + ListCell *l_start = list_head(gd->rollups); > >> + AggStrategy strat = AGG_HASHED; > >> + Size hashsize; > >> + double exclude_groups = 0.0; > >> + > >> + Assert(can_hash); > >> + > >> + if (pathkeys_contained_in(root->group_pathkeys, path->pathkeys)) > >> + { > >> + unhashable_rollup = lfirst(l_start); > > Andres> a) cast result of lfirst/lnext/whatnot. > > casting lfirst is defensible (and only about 10% of existing uses don't > cast it), but why would you ever cast the result of lnext? Obviously lnext is bogus, was thinking of linitial... > >> + if (can_hash && gd->any_hashable) > >> + { > >> + List *rollups = NIL; > >> + List *hash_sets = list_copy(gd->unsortable_sets); > >> + double availspace = (work_mem * 1024.0); > > Andres> Why the parens, and why the .0? > > The .0 is because the * should be done as a double, not an int. I was just missing why - but now that I've not already been awake for 20 hours, 8 of those crammed into a plane, it's obviously beneficial because of overflow concerns. > Andres> I think you need a higher level explanation of how you're > Andres> mapping the work_mem issue to knapsack somewhere. > > The general overview ("work_mem is the knapsack size, and we need to > figure out how best to pack it with hashtables"), or the specific issue > of the scale factor for discrete chunking of the size? The general overview - the scaling thing doesn't seem that important to understand in detail, to understand the approach. Briefly explaining what we're trying to minimize (sort steps), what the constraint is (memory), that the problem is representable as the classic "knapsack problem". > Andres> Hm. So we're solely optimizing for the number of sorts, rather > Andres> the cost of the sorts. Probably an acceptable tradeoff. > > The existing cost model for sorts doesn't actually seem to help us here; > all of our sorts sort the same data, just with different comparison > columns, but cost_sort doesn't actually account for that (all its > callers except the CLUSTER planning code pass in 0.0 for > comparison_cost). > > If the sort costs were correct, we could compute a "value" for each > rollup that reflected the cost difference between the sort+unique > comparisons for it, and the hash functions that would be called if we > hashed instead; and just pass that to the knapsack function. That'd indeed be desirable - but I think we can punt on that for now. Greetings, Andres Freund
>>>>> "Mark" == Mark Dilger <hornschnorter@gmail.com> writes: Mark> Is there a performance test case where this patch should shineMark> brightest? I'd like to load a schema with lotsof data, and runMark> a grouping sets query, both before and after applying the patch,Mark> to see what the performanceadvantage is. The area which I think is most important for performance is the handling of small cubes; without this patch, a 2d cube needs 2 full sorts, a 3d one needs 3, and a 4d one needs 6. In many real-world data sets these would all hash entirely in memory. So here's a very simple example (deliberately using integers for grouping to minimize the advantage; grouping by text columns in a non-C locale would show a much greater speedup for the patch): create table sales ( id serial, product_id integer, store_id integer, customer_id integer, qty integer); -- random integer function create function d(integer) returns integer language sqlas $f$ select floor(random()*$1)::integer + 1; $f$; -- 10 million random rows insert into sales (product_id,store_id,customer_id,qty) select d(20), d(6), d(10), d(100) from generate_series(1,10000000); -- example 2d cube: select product_id, store_id, count(*), sum(qty) from salesgroup by cube(product_id, store_id); -- example 3d cube: select product_id, store_id, customer_id, count(*), sum(qty) from salesgroup by cube(product_id, store_id, customer_id); -- example 4d cube with a computed column: select product_id, store_id, customer_id, (qty / 10), count(*), sum(qty) from salesgroup by cube(product_id, store_id, customer_id,(qty / 10)); On my machine, the 2d cube is about 3.6 seconds with the patch, and about 8 seconds without it; the 4d is about 18 seconds with the patch and about 32 seconds without it (all with work_mem=1GB, compiled with -O2 and assertions off). -- Andrew (irc:RhodiumToad)
>>>>> "Andres" == Andres Freund <andres@anarazel.de> writes: Andres> We usually cast the result of palloc. >> Rough count in the backend has ~400 without casts to ~1350 with, so>> this doesn't seem to have been consistently enforced. Andres> Yea, but we're still trying. Well, a lot of the uncasted ones are in fairly new code, from quite a number of different committers. So if this is a big deal, why don't we already have #define palloc_array(etype,ecount) (((etype) *) palloc((ecount) * sizeof(etype))) #define palloc_object(otype) (((otype) *) palloc(sizeof(otype))) or something of that ilk? -- Andrew (irc:RhodiumToad)
>>>>> "Andres" == Andres Freund <andres@anarazel.de> writes: Andres> a) cast result of lfirst/lnext/whatnot. Again, what we need here is something like #define lfirst_node(_type_, l) (castNode(_type_, lfirst(l))) etc. -- Andrew (irc:RhodiumToad)
>>>>> "Andres" == Andres Freund <andres@anarazel.de> writes: >> I could move most of the block comment for ExecInitAgg up into the >> file header; would that be better? Andres> Yes, I think so. Done that and some other clarifications. Attached is the latest patch; no substantive changes, only comments, formatting and a change to a variable name. The performance side of things I'm still working on; I think that would be best handled as a followup patch. -- Andrew (irc:RhodiumToad)