Thread: [HACKERS] Hash support for grouping sets

[HACKERS] Hash support for grouping sets

From
Andrew Gierth
Date:
Herewith a patch for doing grouping sets via hashing or mixed hashing
and sorting.

The principal objective is to pick whatever combination of grouping sets
has an estimated size that fits in work_mem, and minimizes the number of
sorting passes we need to do over the data, and hash those.  (Yes, this
is a knapsack problem.)

This is achieved by adding another strategy to the Agg node; AGG_MIXED
means that both hashing and (sorted or plain) grouping happens.  In
addition, support for multiple hash tables in AGG_HASHED mode is added.

Sample plans look like this:

explain select two,ten from onek group by cube(two,ten);
                          QUERY PLAN                          
--------------------------------------------------------------
 MixedAggregate  (cost=0.00..89.33 rows=33 width=8)
   Hash Key: two, ten
   Hash Key: two
   Hash Key: ten
   Group Key: ()
   ->  Seq Scan on onek  (cost=0.00..79.00 rows=1000 width=8)

explain select two,ten from onek group by two, cube(ten,twenty);
                          QUERY PLAN                           
---------------------------------------------------------------
 HashAggregate  (cost=86.50..100.62 rows=162 width=12)
   Hash Key: two, ten, twenty
   Hash Key: two, ten
   Hash Key: two
   Hash Key: two, twenty
   ->  Seq Scan on onek  (cost=0.00..79.00 rows=1000 width=12)

set work_mem='64kB';
explain select count(*) from tenk1
  group by grouping sets (unique1,thousand,hundred,ten,two);
                               QUERY PLAN                               
------------------------------------------------------------------------
 MixedAggregate  (cost=1535.39..3023.89 rows=11112 width=28)
   Hash Key: two
   Hash Key: ten
   Hash Key: hundred
   Group Key: unique1
   Sort Key: thousand
     Group Key: thousand
   ->  Sort  (cost=1535.39..1560.39 rows=10000 width=20)
         Sort Key: unique1
         ->  Seq Scan on tenk1  (cost=0.00..458.00 rows=10000 width=20)
(10 rows)

Columns with unhashable or unsortable data types are handled
appropriately, though obviously every individual grouping set must end
up containing compatible column types.

There is one current weakness which I don't see a good solution for: the
planner code still has to pick a single value for group_pathkeys before
planning the input path.  This means that we sometimes can't choose a
minimal set of sorts, because we may have chosen a sort order for a
grouping set that should have been hashed, for example:

explain select count(*) from tenk1
  group by grouping sets (two,unique1,thousand,hundred,ten);
                               QUERY PLAN                               
------------------------------------------------------------------------
 MixedAggregate  (cost=1535.39..4126.28 rows=11112 width=28)
   Hash Key: ten
   Hash Key: hundred
   Group Key: two
   Sort Key: unique1
     Group Key: unique1
   Sort Key: thousand
     Group Key: thousand
   ->  Sort  (cost=1535.39..1560.39 rows=10000 width=20)
         Sort Key: two
         ->  Seq Scan on tenk1  (cost=0.00..458.00 rows=10000 width=20)

(3 sorts, vs. 2 in the previous example that listed the same grouping
sets in different order)

Current status of the work: probably still needs cleanup, maybe some
better regression tests, and Andres has expressed some concern over the
performance impact of additional complexity in advance_aggregates; it
would be useful if this could be tested.  But it should be reviewable
as-is.

-- 
Andrew (irc:RhodiumToad)


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] Hash support for grouping sets

From
Robert Haas
Date:
On Thu, Jan 5, 2017 at 10:48 PM, Andrew Gierth
<andrew@tao11.riddles.org.uk> wrote:
> Herewith a patch for doing grouping sets via hashing or mixed hashing
> and sorting.

Cool.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] Hash support for grouping sets

From
"Finnerty, Jim"
Date:
The ability to exploit hashed aggregation within sorted groups, when the order of the input stream can be exploited
thisway, is potentially a useful way to improve aggregation performance more generally.  This would potentially be
beneficialwhen the input size is expected to be larger than the amount of working memory available for hashed
aggregation,but where there is enough memory to hash-aggregate just the unsorted grouping key combinations, and when
thecumulative cost of rebuilding the hash table for each sorted subgroup is less than the cost of sorting the entire
input. In other words, if most of the grouping key combinations are already segregated by virtue of the input order,
thenhashing the remaining combinations within each sorted group might be done in memory, at the cost of rebuilding the
hashtable for each sorted subgroup.
 

I haven’t looked at the code for this change yet (I hope I will have the time to do that).  Ideally the decision to
choosethe aggregation method as sorted, hashed, or mixed hash/sort should be integrated into the cost model, but given
thenotorious difficulty of estimating intermediate cardinalities accurately it would be difficult to develop a
cardinalitymodel and a cost model accurate enough to choose among these options consistently well.
 

Jim Finnerty
Amazon Corp.

On 1/10/17, 10:22 AM, "pgsql-hackers-owner@postgresql.org on behalf of Robert Haas" <pgsql-hackers-owner@postgresql.org
onbehalf of robertmhaas@gmail.com> wrote:
 
   On Thu, Jan 5, 2017 at 10:48 PM, Andrew Gierth   <andrew@tao11.riddles.org.uk> wrote:   > Herewith a patch for doing
groupingsets via hashing or mixed hashing   > and sorting.      Cool.      --    Robert Haas   EnterpriseDB:
http://www.enterprisedb.com  The Enterprise PostgreSQL Company         --    Sent via pgsql-hackers mailing list
(pgsql-hackers@postgresql.org)  To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers  
 


Re: [HACKERS] Hash support for grouping sets

From
Robert Haas
Date:
On Mon, Jan 16, 2017 at 10:59 AM, Finnerty, Jim <jfinnert@amazon.com> wrote:
> The ability to exploit hashed aggregation within sorted groups, when the order of the input stream can be exploited
thisway, is potentially a useful way to improve aggregation performance more generally.  This would potentially be
beneficialwhen the input size is expected to be larger than the amount of working memory available for hashed
aggregation,but where there is enough memory to hash-aggregate just the unsorted grouping key combinations, and when
thecumulative cost of rebuilding the hash table for each sorted subgroup is less than the cost of sorting the entire
input. In other words, if most of the grouping key combinations are already segregated by virtue of the input order,
thenhashing the remaining combinations within each sorted group might be done in memory, at the cost of rebuilding the
hashtable for each sorted subgroup. 

Neat idea.

> I haven’t looked at the code for this change yet (I hope I will have the time to do that).  Ideally the decision to
choosethe aggregation method as sorted, hashed, or mixed hash/sort should be integrated into the cost model, but given
thenotorious difficulty of estimating intermediate cardinalities accurately it would be difficult to develop a
cardinalitymodel and a cost model accurate enough to choose among these options consistently well. 

Yes, that might be a little tricky.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] Hash support for grouping sets

From
Thom Brown
Date:
On 6 January 2017 at 03:48, Andrew Gierth <andrew@tao11.riddles.org.uk> wrote:
> Herewith a patch for doing grouping sets via hashing or mixed hashing
> and sorting.
>
> The principal objective is to pick whatever combination of grouping sets
> has an estimated size that fits in work_mem, and minimizes the number of
> sorting passes we need to do over the data, and hash those.  (Yes, this
> is a knapsack problem.)
>
> This is achieved by adding another strategy to the Agg node; AGG_MIXED
> means that both hashing and (sorted or plain) grouping happens.  In
> addition, support for multiple hash tables in AGG_HASHED mode is added.
>
> Sample plans look like this:
>
> explain select two,ten from onek group by cube(two,ten);
>                           QUERY PLAN
> --------------------------------------------------------------
>  MixedAggregate  (cost=0.00..89.33 rows=33 width=8)
>    Hash Key: two, ten
>    Hash Key: two
>    Hash Key: ten
>    Group Key: ()
>    ->  Seq Scan on onek  (cost=0.00..79.00 rows=1000 width=8)
>
> explain select two,ten from onek group by two, cube(ten,twenty);
>                           QUERY PLAN
> ---------------------------------------------------------------
>  HashAggregate  (cost=86.50..100.62 rows=162 width=12)
>    Hash Key: two, ten, twenty
>    Hash Key: two, ten
>    Hash Key: two
>    Hash Key: two, twenty
>    ->  Seq Scan on onek  (cost=0.00..79.00 rows=1000 width=12)
>
> set work_mem='64kB';
> explain select count(*) from tenk1
>   group by grouping sets (unique1,thousand,hundred,ten,two);
>                                QUERY PLAN
> ------------------------------------------------------------------------
>  MixedAggregate  (cost=1535.39..3023.89 rows=11112 width=28)
>    Hash Key: two
>    Hash Key: ten
>    Hash Key: hundred
>    Group Key: unique1
>    Sort Key: thousand
>      Group Key: thousand
>    ->  Sort  (cost=1535.39..1560.39 rows=10000 width=20)
>          Sort Key: unique1
>          ->  Seq Scan on tenk1  (cost=0.00..458.00 rows=10000 width=20)
> (10 rows)
>
> Columns with unhashable or unsortable data types are handled
> appropriately, though obviously every individual grouping set must end
> up containing compatible column types.
>
> There is one current weakness which I don't see a good solution for: the
> planner code still has to pick a single value for group_pathkeys before
> planning the input path.  This means that we sometimes can't choose a
> minimal set of sorts, because we may have chosen a sort order for a
> grouping set that should have been hashed, for example:
>
> explain select count(*) from tenk1
>   group by grouping sets (two,unique1,thousand,hundred,ten);
>                                QUERY PLAN
> ------------------------------------------------------------------------
>  MixedAggregate  (cost=1535.39..4126.28 rows=11112 width=28)
>    Hash Key: ten
>    Hash Key: hundred
>    Group Key: two
>    Sort Key: unique1
>      Group Key: unique1
>    Sort Key: thousand
>      Group Key: thousand
>    ->  Sort  (cost=1535.39..1560.39 rows=10000 width=20)
>          Sort Key: two
>          ->  Seq Scan on tenk1  (cost=0.00..458.00 rows=10000 width=20)
>
> (3 sorts, vs. 2 in the previous example that listed the same grouping
> sets in different order)
>
> Current status of the work: probably still needs cleanup, maybe some
> better regression tests, and Andres has expressed some concern over the
> performance impact of additional complexity in advance_aggregates; it
> would be useful if this could be tested.  But it should be reviewable
> as-is.

This doesn't apply cleanly to latest master.  Could you please post a
rebased patch?

Thanks

Thom



Re: [HACKERS] Hash support for grouping sets

From
Andrew Gierth
Date:
>>>>> "Thom" == Thom Brown <thom@linux.com> writes:

 Thom> This doesn't apply cleanly to latest master.  Could you please
 Thom> post a rebased patch?

Sure.

-- 
Andrew (irc:RhodiumToad)


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] Hash support for grouping sets

From
Mark Dilger
Date:
The following review has been posted through the commitfest application:
make installcheck-world:  tested, failed
Implements feature:       not tested
Spec compliant:           not tested
Documentation:            not tested

On my MacBook, `make check-world` gives differences in the contrib modules:

cat contrib/postgres_fdw/regression.diffs
*** /Users/mark/hydra/postgresql.review/contrib/postgres_fdw/expected/postgres_fdw.out    2017-03-03 13:33:47.000000000
-0800
--- /Users/mark/hydra/postgresql.review/contrib/postgres_fdw/results/postgres_fdw.out    2017-03-07 13:27:56.000000000
-0800
***************
*** 3148,3163 **** -- Grouping sets explain (verbose, costs off) select c2, sum(c1) from ft1 where c2 < 3 group by
rollup(c2)order by 1 nulls last;
 
!                                             QUERY PLAN                                             
! ---------------------------------------------------------------------------------------------------
!  GroupAggregate
!    Output: c2, sum(c1)
!    Group Key: ft1.c2
!    Group Key: ()
!    ->  Foreign Scan on public.ft1
!          Output: c2, c1
!          Remote SQL: SELECT "C 1", c2 FROM "S 1"."T 1" WHERE ((c2 < 3)) ORDER BY c2 ASC NULLS LAST
! (7 rows)  select c2, sum(c1) from ft1 where c2 < 3 group by rollup(c2) order by 1 nulls last;  c2 |  sum   
--- 3148,3166 ---- -- Grouping sets explain (verbose, costs off) select c2, sum(c1) from ft1 where c2 < 3 group by
rollup(c2)order by 1 nulls last;
 
!                                   QUERY PLAN                                  
! ------------------------------------------------------------------------------
!  Sort
!    Output: c2, (sum(c1))
!    Sort Key: ft1.c2
!    ->  MixedAggregate
!          Output: c2, sum(c1)
!          Hash Key: ft1.c2
!          Group Key: ()
!          ->  Foreign Scan on public.ft1
!                Output: c2, c1
!                Remote SQL: SELECT "C 1", c2 FROM "S 1"."T 1" WHERE ((c2 < 3))
! (10 rows)  select c2, sum(c1) from ft1 where c2 < 3 group by rollup(c2) order by 1 nulls last;  c2 |  sum   
***************
*** 3170,3185 ****  explain (verbose, costs off) select c2, sum(c1) from ft1 where c2 < 3 group by cube(c2) order by 1
nullslast;
 
!                                             QUERY PLAN                                             
! ---------------------------------------------------------------------------------------------------
!  GroupAggregate
!    Output: c2, sum(c1)
!    Group Key: ft1.c2
!    Group Key: ()
!    ->  Foreign Scan on public.ft1
!          Output: c2, c1
!          Remote SQL: SELECT "C 1", c2 FROM "S 1"."T 1" WHERE ((c2 < 3)) ORDER BY c2 ASC NULLS LAST
! (7 rows)  select c2, sum(c1) from ft1 where c2 < 3 group by cube(c2) order by 1 nulls last;  c2 |  sum   
--- 3173,3191 ----  explain (verbose, costs off) select c2, sum(c1) from ft1 where c2 < 3 group by cube(c2) order by 1
nullslast;
 
!                                   QUERY PLAN                                  
! ------------------------------------------------------------------------------
!  Sort
!    Output: c2, (sum(c1))
!    Sort Key: ft1.c2
!    ->  MixedAggregate
!          Output: c2, sum(c1)
!          Hash Key: ft1.c2
!          Group Key: ()
!          ->  Foreign Scan on public.ft1
!                Output: c2, c1
!                Remote SQL: SELECT "C 1", c2 FROM "S 1"."T 1" WHERE ((c2 < 3))
! (10 rows)  select c2, sum(c1) from ft1 where c2 < 3 group by cube(c2) order by 1 nulls last;  c2 |  sum   
***************
*** 3192,3211 ****  explain (verbose, costs off) select c2, c6, sum(c1) from ft1 where c2 < 3 group by grouping
sets(c2,c6) order by 1 nulls last, 2 nulls last;
 
!                                                  QUERY PLAN                                                  
! -------------------------------------------------------------------------------------------------------------  Sort
Output: c2, c6, (sum(c1))    Sort Key: ft1.c2, ft1.c6
 
!    ->  GroupAggregate          Output: c2, c6, sum(c1)
!          Group Key: ft1.c2
!          Sort Key: ft1.c6
!            Group Key: ft1.c6          ->  Foreign Scan on public.ft1                Output: c2, c6, c1
!                Remote SQL: SELECT "C 1", c2, c6 FROM "S 1"."T 1" WHERE ((c2 < 3)) ORDER BY c2 ASC NULLS LAST
! (11 rows)  select c2, c6, sum(c1) from ft1 where c2 < 3 group by grouping sets(c2, c6) order by 1 nulls last, 2 nulls
last; c2 | c6 |  sum  
 
--- 3198,3216 ----  explain (verbose, costs off) select c2, c6, sum(c1) from ft1 where c2 < 3 group by grouping
sets(c2,c6) order by 1 nulls last, 2 nulls last;
 
!                                     QUERY PLAN                                    
! ----------------------------------------------------------------------------------  Sort    Output: c2, c6, (sum(c1))
  Sort Key: ft1.c2, ft1.c6
 
!    ->  HashAggregate          Output: c2, c6, sum(c1)
!          Hash Key: ft1.c2
!          Hash Key: ft1.c6          ->  Foreign Scan on public.ft1                Output: c2, c6, c1
!                Remote SQL: SELECT "C 1", c2, c6 FROM "S 1"."T 1" WHERE ((c2 < 3))
! (10 rows)  select c2, c6, sum(c1) from ft1 where c2 < 3 group by grouping sets(c2, c6) order by 1 nulls last, 2 nulls
last; c2 | c6 |  sum  
 

======================================================================


Re: [HACKERS] Hash support for grouping sets

From
Mark Dilger
Date:
> On Mar 7, 2017, at 1:43 PM, Mark Dilger <hornschnorter@gmail.com> wrote:
>
> On my MacBook, `make check-world` gives differences in the contrib modules:

I get the same (or similar -- didn't check) regression failure on CentOS, so this
doesn't appear to be MacBook / hardware specific.

Mark Dilger


Re: [HACKERS] Hash support for grouping sets

From
Mark Dilger
Date:
The following review has been posted through the commitfest application:
make installcheck-world:  tested, failed
Implements feature:       not tested
Spec compliant:           not tested
Documentation:            not tested

On linux/gcc the patch generates a warning in nodeAgg.c that is fairly easy
to fix.  Using -Werror to make catching the error easier, I get:

make[3]: Entering directory `/home/mark/postgresql.clean/src/backend/executor'
gcc -Wall -Wmissing-prototypes -Wpointer-arith -Wdeclaration-after-statement -Wendif-labels -Wmissing-format-attribute
-Wformat-security-fno-strict-aliasing -fwrapv -O2 -Werror -I../../../src/include -D_GNU_SOURCE   -c -o nodeAgg.o
nodeAgg.c-MMD -MP -MF .deps/nodeAgg.Po
 
cc1: warnings being treated as errors
nodeAgg.c: In function ‘ExecAgg’:
nodeAgg.c:2053: error: ‘result’ may be used uninitialized in this function
make[3]: *** [nodeAgg.o] Error 1
make[3]: Leaving directory `/home/mark/postgresql.clean/src/backend/executor'
make[2]: *** [executor-recursive] Error 2
make[2]: Leaving directory `/home/mark/postgresql.clean/src/backend'
make[1]: *** [all-backend-recurse] Error 2
make[1]: Leaving directory `/home/mark/postgresql.clean/src'
make: *** [all-src-recurse] Error 2


There are two obvious choices to silence the warning:

diff --git a/src/backend/executor/nodeAgg.c b/src/backend/executor/nodeAgg.c
index f059560..8959f5b 100644
--- a/src/backend/executor/nodeAgg.c
+++ b/src/backend/executor/nodeAgg.c
@@ -2066,6 +2066,7 @@ ExecAgg(AggState *node)                               break;                       case
AGG_PLAIN:                      case AGG_SORTED:
 
+                       default:                               result = agg_retrieve_direct(node);
        break;               }
 

which I like better, because I don't expect it would create
an extra instruction in the compiled code, but also:

diff --git a/src/backend/executor/nodeAgg.c b/src/backend/executor/nodeAgg.c
index f059560..eab2605 100644
--- a/src/backend/executor/nodeAgg.c
+++ b/src/backend/executor/nodeAgg.c
@@ -2050,7 +2050,7 @@ lookup_hash_entries(AggState *aggstate)TupleTableSlot *ExecAgg(AggState *node){
-       TupleTableSlot *result;
+       TupleTableSlot *result = NULL;       if (!node->agg_done)       {

Re: [HACKERS] Hash support for grouping sets

From
Andrew Gierth
Date:
>>>>> "Mark" == Mark Dilger <hornschnorter@gmail.com> writes:
Mark> On linux/gcc the patch generates a warning in nodeAgg.c that isMark> fairly easy to fix.  Using -Werror to make
catchingthe errorMark> easier, I get:
 

what gcc version is this exactly?

-- 
Andrew (irc:RhodiumToad)



Re: [HACKERS] Hash support for grouping sets

From
Andrew Gierth
Date:
>>>>> "Mark" == Mark Dilger <hornschnorter@gmail.com> writes:

 Mark> On my MacBook, `make check-world` gives differences in the
 Mark> contrib modules:

Thanks! Latest cleaned up version of patch is attached.

-- 
Andrew (irc:RhodiumToad)


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] Hash support for grouping sets

From
Mark Dilger
Date:
> On Mar 8, 2017, at 2:30 AM, Andrew Gierth <andrew@tao11.riddles.org.uk> wrote:
>
>>>>>> "Mark" == Mark Dilger <hornschnorter@gmail.com> writes:
>
> Mark> On linux/gcc the patch generates a warning in nodeAgg.c that is
> Mark> fairly easy to fix.  Using -Werror to make catching the error
> Mark> easier, I get:
>
> what gcc version is this exactly?
>

Linux version 2.6.32-573.18.1.el6.x86_64 (mockbuild@c6b8.bsys.dev.centos.org) (gcc version 4.4.7 20120313 (Red Hat
4.4.7-16)(GCC) ) #1 SMP Tue Feb 9 22:46:17 UTC 2016 




Re: [HACKERS] Hash support for grouping sets

From
Mark Dilger
Date:
> On Mar 8, 2017, at 5:47 AM, Andrew Gierth <andrew@tao11.riddles.org.uk> wrote:
>
>>>>>> "Mark" == Mark Dilger <hornschnorter@gmail.com> writes:
>
> Mark> On my MacBook, `make check-world` gives differences in the
> Mark> contrib modules:
>
> Thanks! Latest cleaned up version of patch is attached.

This fixes both the warning and the contrib tests on linux and mac.




Re: [HACKERS] Hash support for grouping sets

From
Mark Dilger
Date:
Hi Andrew,

Reviewing the patch a bit more, I find it hard to understand the comment about
passing -1 as a flag for finalize_aggregates.  Any chance you can spend a bit
more time word-smithing that code comment?


@@ -1559,7 +1647,9 @@ prepare_projection_slot(AggState *aggstate, TupleTableSlot *slot, int currentSet/* * Compute the
finalvalue of all aggregates for one group. * 
- * This function handles only one grouping set at a time.
+ * This function handles only one grouping set at a time.  But in the hash
+ * case, it's the caller's responsibility to have selected the set already, and
+ * we pass in -1 here to flag that and to control the indexing into pertrans. * * Results are stored in the output
econtextaggvalues/aggnulls. */ 
@@ -1575,10 +1665,11 @@ finalize_aggregates(AggState *aggstate,   int         aggno;   int         transno;

-   Assert(currentSet == 0 ||
-          ((Agg *) aggstate->ss.ps.plan)->aggstrategy != AGG_HASHED);
-
-   aggstate->current_set = currentSet;
+   /* ugly hack for hash */
+   if (currentSet >= 0)
+       select_current_set(aggstate, currentSet, false);
+   else
+       currentSet = 0;


> On Mar 8, 2017, at 8:00 AM, Mark Dilger <hornschnorter@gmail.com> wrote:
>
>
>> On Mar 8, 2017, at 5:47 AM, Andrew Gierth <andrew@tao11.riddles.org.uk> wrote:
>>
>>>>>>> "Mark" == Mark Dilger <hornschnorter@gmail.com> writes:
>>
>> Mark> On my MacBook, `make check-world` gives differences in the
>> Mark> contrib modules:
>>
>> Thanks! Latest cleaned up version of patch is attached.
>
> This fixes both the warning and the contrib tests on linux and mac.
>




Re: [HACKERS] Hash support for grouping sets

From
Andrew Gierth
Date:
>>>>> "Mark" == Mark Dilger <hornschnorter@gmail.com> writes:
Mark> Hi Andrew,
Mark> Reviewing the patch a bit more, I find it hard to understand theMark> comment about passing -1 as a flag for
finalize_aggregates. AnyMark> chance you can spend a bit more time word-smithing that codeMark> comment?
 

Sure.

How does this look for wording (I'll incorporate it into an updated
patch later):

/** Compute the final value of all aggregates for one group.** This function handles only one grouping set at a time.
Butin the hash* case, it's the caller's responsibility to have selected the set already, and* we pass in -1 as
currentSethere to flag that; this also changes how we* handle the indexing into AggStatePerGroup as explained below.**
Resultsare stored in the output econtext aggvalues/aggnulls.*/
 
static void
finalize_aggregates(AggState *aggstate,                AggStatePerAgg peraggs,                AggStatePerGroup
pergroup,               int currentSet)
 
{ExprContext *econtext = aggstate->ss.ps.ps_ExprContext;Datum       *aggvalues = econtext->ecxt_aggvalues;bool
*aggnulls= econtext->ecxt_aggnulls;int            aggno;int            transno;
 
/* * If currentSet >= 0, then we're doing sorted grouping, and pergroup is an * array of size numTrans*numSets which we
haveto index into using * currentSet in addition to transno. The caller may not have selected the * set, so we do that.
** If currentSet < 0, then we're doing hashed grouping, and pergroup is an * array of only numTrans items (since for
hashedgrouping, each grouping * set is in a separate hashtable).  We rely on the caller having done *
select_current_set,and we fudge currentSet to 0 in order to make the * same indexing calculations work as for the
groupingcase. */if (currentSet >= 0)    select_current_set(aggstate, currentSet, false);else    currentSet = 0;
 


-- 
Andrew (irc:RhodiumToad)



Re: [HACKERS] Hash support for grouping sets

From
Mark Dilger
Date:
> On Mar 8, 2017, at 9:40 AM, Andrew Gierth <andrew@tao11.riddles.org.uk> wrote:
>
>>>>>> "Mark" == Mark Dilger <hornschnorter@gmail.com> writes:
>
> Mark> Hi Andrew,
>
> Mark> Reviewing the patch a bit more, I find it hard to understand the
> Mark> comment about passing -1 as a flag for finalize_aggregates.  Any
> Mark> chance you can spend a bit more time word-smithing that code
> Mark> comment?
>
> Sure.
>
> How does this look for wording (I'll incorporate it into an updated
> patch later):

Much better.  Thanks!

> /*
> * Compute the final value of all aggregates for one group.
> *
> * This function handles only one grouping set at a time.  But in the hash
> * case, it's the caller's responsibility to have selected the set already, and
> * we pass in -1 as currentSet here to flag that; this also changes how we
> * handle the indexing into AggStatePerGroup as explained below.
> *
> * Results are stored in the output econtext aggvalues/aggnulls.
> */
> static void
> finalize_aggregates(AggState *aggstate,
>                     AggStatePerAgg peraggs,
>                     AggStatePerGroup pergroup,
>                     int currentSet)
> {
>     ExprContext *econtext = aggstate->ss.ps.ps_ExprContext;
>     Datum       *aggvalues = econtext->ecxt_aggvalues;
>     bool       *aggnulls = econtext->ecxt_aggnulls;
>     int            aggno;
>     int            transno;
>
>     /*
>      * If currentSet >= 0, then we're doing sorted grouping, and pergroup is an
>      * array of size numTrans*numSets which we have to index into using
>      * currentSet in addition to transno. The caller may not have selected the
>      * set, so we do that.
>      *
>      * If currentSet < 0, then we're doing hashed grouping, and pergroup is an
>      * array of only numTrans items (since for hashed grouping, each grouping
>      * set is in a separate hashtable).  We rely on the caller having done
>      * select_current_set, and we fudge currentSet to 0 in order to make the
>      * same indexing calculations work as for the grouping case.
>      */
>     if (currentSet >= 0)
>         select_current_set(aggstate, currentSet, false);
>     else
>         currentSet = 0;
>
>
> --
> Andrew (irc:RhodiumToad)




Re: [HACKERS] Hash support for grouping sets

From
Andrew Gierth
Date:
>>>>> "Mark" == Mark Dilger <hornschnorter@gmail.com> writes:
Mark> Hi Andrew,
Mark> Reviewing the patch a bit more, I find it hard to understand theMark> comment about passing -1 as a flag for
finalize_aggregates. AnyMark> chance you can spend a bit more time word-smithing that codeMark> comment?
 

Actually, ignore that prior response, I'll refactor it a bit to remove
the need for the comment at all.

-- 
Andrew (irc:RhodiumToad)



Re: [HACKERS] Hash support for grouping sets

From
Andrew Gierth
Date:
>>>>> "Mark" == Mark Dilger <hornschnorter@gmail.com> writes:

 Mark> Reviewing the patch a bit more, I find it hard to understand the
 Mark> comment about passing -1 as a flag for finalize_aggregates.  Any
 Mark> chance you can spend a bit more time word-smithing that code
 Mark> comment?
 
 >> Sure.
 >> 
 >> How does this look for wording (I'll incorporate it into an updated
 >> patch later):

 Mark> Much better.  Thanks!

Here's a version of the patch that obviates that comment entirely, by
moving the differences between the hashed and sorted cases out to the
separate callers of finalize_aggregates. 

-- 
Andrew (irc:RhodiumToad)


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] Hash support for grouping sets

From
Andrew Gierth
Date:
Another small update to the patch, this time to eliminate any
possibility of integer overflow when handling extremely large estimated
groupings.

Key change:

-                   k_weights[i] = (int) floor(sz / scale);
+                   /*
+                    * If sz is enormous, but work_mem (and hence scale) is
+                    * small, avoid integer overflow here.
+                    */
+                   k_weights[i] = (int) Min(floor(sz / scale),
+                                            k_capacity + 1.0);

-- 
Andrew (irc:RhodiumToad)


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] Hash support for grouping sets

From
Andrew Gierth
Date:
[snip]

This thread seems to have gone quiet - is it time for me to just go
ahead and commit the thing anyway? Anyone else want to weigh in?

-- 
Andrew (irc:RhodiumToad)



Re: [HACKERS] Hash support for grouping sets

From
Andres Freund
Date:
Hi,

> +/*
> + * Switch to phase "newphase", which must either be 0 or 1 (to reset) or
>   * current_phase + 1. Juggle the tuplesorts accordingly.
>   */
>  static void
>  initialize_phase(AggState *aggstate, int newphase)
>  {
> -    Assert(newphase == 0 || newphase == aggstate->current_phase + 1);
> +    Assert(newphase <= 1 || newphase == aggstate->current_phase + 1);

I think this somewhere in the file header needs an expanded explanation
about what these "special" phases mean.


> @@ -838,17 +897,21 @@ advance_transition_function(AggState *aggstate,
>  /*
>   * Advance each aggregate transition state for one input tuple.  The input
>   * tuple has been stored in tmpcontext->ecxt_outertuple, so that it is
> - * accessible to ExecEvalExpr.  pergroup is the array of per-group structs to
> - * use (this might be in a hashtable entry).
> + * accessible to ExecEvalExpr.
> + *
> + * We have two sets of transition states to handle: one for sorted aggregation
> + * and one for hashed; we do them both here, to avoid multiple evaluation of
> + * the inputs.
>   *
>   * When called, CurrentMemoryContext should be the per-query context.
>   */

Changes to advance_aggregates() are, in my experience, quite likely to
have performance effects.  This needs some performance tests.

I scheduled a small TPCH comparison run:

master q01 min: 28924.766 dev-hash min: 28893.923 [diff -0.11]
master q02 min: 2448.135 dev-hash min: 2430.589 [diff -0.72]
master q03 min: 14596.292 dev-hash min: 14421.054 [diff -1.22]
master q04 min: 1999.684 dev-hash min: 1958.727 [diff -2.09]
master q05 min: 10638.148 dev-hash min: 10753.075 [diff +1.07]
master q06 min: 3679.955 dev-hash min: 3649.114 [diff -0.85]
master q07 min: 10097.272 dev-hash min: 10358.089 [diff +2.52]
master q08 min: 3103.698 dev-hash min: 3078.108 [diff -0.83]
master q09 min: 13034.87 dev-hash min: 13049.402 [diff +0.11]
master q10 min: 11702.966 dev-hash min: 11535.05 [diff -1.46]
master q11 min: 577.114 dev-hash min: 586.767 [diff +1.65]
master q12 min: 9087.413 dev-hash min: 9272.874 [diff +2.00]
master q13 min: 16353.813 dev-hash min: 16473.882 [diff +0.73]
master q14 min: 1564.321 dev-hash min: 1552.31 [diff -0.77]
master q15 min: 3958.565 dev-hash min: 3946.728 [diff -0.30]
master q16 min: 3793.345 dev-hash min: 3784.996 [diff -0.22]
master q17 min: 828.286 dev-hash min: 834.929 [diff +0.80]
master q18 min: 13473.084 dev-hash min: 13777.327 [diff +2.21]
master q19 min: 654.241 dev-hash min: 673.863 [diff +2.91]
master q20 min: 2906.811 dev-hash min: 2882.793 [diff -0.83]
master q22 min: 1226.397 dev-hash min: 1262.045 [diff +2.82]

master total min: 154649.176 dev-hash min: 155175.645 [diff +0.34]

Looks like it could all be noise, but it seems worthwhile to look into
some.



>   * GROUP BY columns.  The per-group data is allocated in lookup_hash_entry(),
>   * for each entry.
>   *
> - * The hash table always lives in the aggcontext memory context.
> + * We have a separate hashtable and associated perhash data structure for each
> + * grouping set for which we're doing hashing.
> + *
> + * The hash tables always live in the hashcontext's per-tuple memory context
> + * (there is only one of these for all tables together, since they are all
> + * reset at the same time).
>   */

I've not yet read up on the thread, or the whole patch, but an
explanation of the memory management for all those tables would be good
somewhere (maybe I've just not read that far).


> @@ -2388,7 +2597,36 @@ agg_retrieve_hash_table(AggState *aggstate)
>   * ExecInitAgg
>   *
>   *    Creates the run-time information for the agg node produced by the
> - *    planner and initializes its outer subtree
> + *    planner and initializes its outer subtree.
> + *
> + *    What we get from the planner is actually one "real" Agg node which is part
> + *    of the plan tree proper, but which optionally has an additional list of Agg
> + *    nodes hung off the side via the "chain" field.  This is because an Agg node
> + *    happens to be a convenient representation of all the data we need for
> + *    grouping sets.
> + *
> + *    For many purposes, we treat the "real" node as if it were just the first
> + *    node in the chain.  The chain must be ordered such that hashed entries come
> + *    before sorted/plain entries; the real node is marked AGG_MIXED if there are
> + *    mixed types (in which case the real node describes one of the hashed

"mixed types" sounds a bit weird.


> + *    We collect all hashed nodes into a single "phase", numbered 0, and create a
> + *    sorted phase (numbered 1..n) for each AGG_SORTED or AGG_PLAIN node.  Phase
> + *    0 is allocated even if there are no hashes, but remains unused in that
> + *    case.

Ah, here's the explanation I'd been looking for ;)



> +Bitmapset *
> +DiscreteKnapsack(int max_weight, int num_items,
> +                 int *item_weights, double *item_values)
> +{
> +    MemoryContext local_ctx = AllocSetContextCreate(CurrentMemoryContext,
> +                                                    "Knapsack",
> +                                                    ALLOCSET_SMALL_MINSIZE,
> +                                                    ALLOCSET_SMALL_INITSIZE,
> +                                                    ALLOCSET_SMALL_MAXSIZE);
> +    MemoryContext oldctx = MemoryContextSwitchTo(local_ctx);
> +    double       *values;
> +    Bitmapset **sets;
> +    Bitmapset  *result;
> +    int            i,
> +                j;
> +
> +    Assert(max_weight >= 0);
> +    Assert(num_items > 0 && item_weights);
> +
> +    values = palloc((1 + max_weight) * sizeof(double));
> +    sets = palloc((1 + max_weight) * sizeof(Bitmapset *));

We usually cast the result of palloc.


> +    for (i = 0; i <= max_weight; ++i)
> +    {
> +        values[i] = 0;
> +        sets[i] = bms_make_singleton(num_items);
> +    }
> +
> +    for (i = 0; i < num_items; ++i)
> +    {
> +        int            iw = item_weights[i];
> +        double        iv = item_values ? item_values[i] : 1;
> +
> +        for (j = max_weight; j >= iw; --j)
> +        {
> +            int            ow = j - iw;
> +
> +            if (values[j] <= values[ow] + iv)
> +            {
> +                /* copy sets[ow] to sets[j] without realloc */
> +                if (j != ow)
> +                {
> +                    sets[j] = bms_del_members(sets[j], sets[j]);
> +                    sets[j] = bms_add_members(sets[j], sets[ow]);
> +                }
> +
> +                sets[j] = bms_add_member(sets[j], i);
> +
> +                values[j] = values[ow] + iv;
> +            }
> +        }
> +    }
> +
> +    MemoryContextSwitchTo(oldctx);
> +
> +    result = bms_del_member(bms_copy(sets[max_weight]), num_items);
> +
> +    MemoryContextDelete(local_ctx);
> +
> +    return result;
> +}

> diff --git a/src/backend/nodes/bitmapset.c b/src/backend/nodes/bitmapset.c
> index 252af5c870..ea1c538287 100644
> --- a/src/backend/nodes/bitmapset.c
> +++ b/src/backend/nodes/bitmapset.c
> @@ -21,7 +21,7 @@
>  #include "postgres.h"
>  
>  #include "access/hash.h"
> -
> +#include "nodes/pg_list.h"

Pedanticism^7: You're deleting a line here...


> +/*
>   * bms_nonempty_difference - do sets have a nonempty difference?
>   */
>  bool
> diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
> index 7418fbeded..d2d66f915b 100644
> --- a/src/backend/nodes/outfuncs.c
> +++ b/src/backend/nodes/outfuncs.c
> @@ -1936,6 +1936,28 @@ _outAggPath(StringInfo str, const AggPath *node)
>  }
>  
>  static void
> +_outRollupData(StringInfo str, const RollupData *node)
> +{
> +    WRITE_NODE_TYPE("ROLLUP");
> +
> +    WRITE_NODE_FIELD(groupClause);
> +    WRITE_NODE_FIELD(gsets);
> +    WRITE_NODE_FIELD(gsets_data);
> +    WRITE_FLOAT_FIELD(numGroups, "%.0f");
> +    WRITE_BOOL_FIELD(hashable);
> +    WRITE_BOOL_FIELD(is_hashed);
> +}
> +
> +static void
> +_outGroupingSetData(StringInfo str, const GroupingSetData *node)
> +{
> +    WRITE_NODE_TYPE("GSDATA");
> +
> +    WRITE_NODE_FIELD(set);
> +    WRITE_FLOAT_FIELD(numGroups, "%.0f");
> +}

.0f?


> +static grouping_sets_data *
> +preprocess_grouping_sets(PlannerInfo *root)
> +{

It'd not hurt to add a comment as to what this function is actually
doing.


Btw, this would be a lot easier to review if the move to
preprocess_grouping_sets were done in a separate preliminary commit.


> +    if (!bms_is_empty(gd->unsortable_refs))
> +    {
...
> +    }
> +    else
> +        sets = extract_rollup_sets(parse->groupingSets);

I hope there's tests for both these branches.


> +        /*
> +         * Is it hashable? We pretend empty sets are hashable even though we
> +         * actually force them not to be hashed later. But don't bother if
> +         * there's nothing but empty sets.
> +         */

Why?


> @@ -3214,6 +3407,11 @@ get_number_of_groups(PlannerInfo *root,
>   * estimate_hashagg_tablesize
>   *      estimate the number of bytes that a hash aggregate hashtable will
>   *      require based on the agg_costs, path width and dNumGroups.
> + *
> + * XXX this may be over-estimating the size now that hashagg knows to omit
> + * unneeded columns from the hashtable. Also for mixed-mode grouping sets,
> + * grouping columns not in the hashed set are counted here even though hashagg
> + * won't store them. Is this a problem?
>   */

Hm. This seems like it could move us to use sorting, even if not
actually warranted.


> +/*
> + * For a given input path, consider the possible ways of doing grouping sets on
> + * it, by combinations of hashing and sorting.  This can be called multiple
> + * times, so it's important that it not scribble on input.  No result is
> + * returned, but any generated paths are added to grouped_rel.
> + */
> +
> +static void
> +consider_groupingsets_paths(PlannerInfo *root,

What's the newline after the comment about?

> +                            RelOptInfo *grouped_rel,
> +                            Path *path,
> +                            bool is_sorted,
> +                            bool can_hash,
> +                            PathTarget *target,
> +                            grouping_sets_data *gd,
> +                            const AggClauseCosts *agg_costs,
> +                            double dNumGroups)
> +{
> +    Query       *parse = root->parse;
> +
> +    /*
> +     * If we're not being offered sorted input, then only consider plans that
> +     * can be done entirely by hashing.
> +     *
> +     * We can hash everything if it looks like it'll fit in work_mem. But if
> +     * the input is actually sorted despite not being advertised as such, we
> +     * prefer to make use of that in order to use less memory.
> +     * If none of the grouping sets are sortable, then ignore the work_mem
> +     * limit and generate a path anyway, since otherwise we'll just fail.
> +     */
> +    if (!is_sorted)
> +    {
> +        List       *new_rollups = NIL;
> +        RollupData *unhashable_rollup = NULL;
> +        List       *sets_data;
> +        List       *empty_sets_data = NIL;
> +        List       *empty_sets = NIL;
> +        ListCell   *lc;
> +        ListCell   *l_start = list_head(gd->rollups);
> +        AggStrategy strat = AGG_HASHED;
> +        Size        hashsize;
> +        double        exclude_groups = 0.0;
> +
> +        Assert(can_hash);
> +
> +        if (pathkeys_contained_in(root->group_pathkeys, path->pathkeys))
> +        {
> +            unhashable_rollup = lfirst(l_start);

a) cast result of lfirst/lnext/whatnot. b) Why is this guaranteed to be
unhashable?


> +            exclude_groups = unhashable_rollup->numGroups;
> +            l_start = lnext(l_start);
> +        }


What guarantees that such a rollup would be at the start?


> +        hashsize = estimate_hashagg_tablesize(path,
> +                                              agg_costs,
> +                                              dNumGroups - exclude_groups);
> +
> +        if (hashsize > work_mem * 1024L && gd->rollups)
> +            return;                /* nope, won't fit */

Didn't you above talk about generating a path anyway?  Or is rollups
guaranteed not to be set in that case (if so, not obvious on a casual
read)?


> +    /*
> +     * If we have sorted input but nothing we can do with it, bail.
> +     */
> +
> +    if (list_length(gd->rollups) == 0)
> +        return;

Postgres style usually doesn't add a newline before comment and the
block it concerns - you do that all over.


> +    if (can_hash && gd->any_hashable)
> +    {
> +        List       *rollups = NIL;
> +        List       *hash_sets = list_copy(gd->unsortable_sets);
> +        double        availspace = (work_mem * 1024.0);

Why the parens, and why the .0?

> +        ListCell   *lc;
> +
> +        /*
> +         * Account first for space needed for groups we can't sort at all.
> +         */
> +        availspace -= (double) estimate_hashagg_tablesize(path,
> +                                                          agg_costs,
> +                                                          gd->dNumHashGroups);
> +
> +        if (availspace > 0 && list_length(gd->rollups) > 1)
> +        {
> +            double        scale;
> +            int            num_rollups = list_length(gd->rollups);
> +            int            k_capacity;
> +            int           *k_weights = palloc(num_rollups * sizeof(int));
> +            Bitmapset  *hash_items = NULL;
> +            int            i;
> +
> +            /*
> +             * To use the discrete knapsack, we need to scale the values to a
> +             * reasonably small bounded range.  We choose to allow a 5% error
> +             * margin; we have no more than 4096 rollups in the worst possible
> +             * case, which with a 5% error margin will require a bit over 42MB
> +             * of workspace. (Anyone wanting to plan queries that complex had
> +             * better have the memory for it.  In more reasonable cases, with
> +             * no more than a couple of dozen rollups, the memory usage will
> +             * be negligible.)
> +             *
> +             * k_capacity is naturally bounded, but we clamp the values for
> +             * scale and weight (below) to avoid overflows or underflows (or
> +             * uselessly trying to use a scale factor less than 1 byte).
> +             */

I think you need a higher level explanation of how you're mapping the
work_mem issue to knapsack somewhere.



> +            /*
> +             * Apply knapsack algorithm; compute the set of items which
> +             * maximizes the value stored (in this case the number of sorts
> +             * saved) while keeping the total size (approximately) within
> +             * capacity.
> +             */

Hm.  So we're solely optimizing for the number of sorts, rather the cost
of the sorts.  Probably an acceptable tradeoff.

> +}

Phew, this isn't simple. Need to be more awake than I'm 8 hours into a
flight.


> @@ -737,7 +739,8 @@ typedef enum AggStrategy
>  {
>      AGG_PLAIN,                    /* simple agg across all input rows */
>      AGG_SORTED,                    /* grouped agg, input must be sorted */
> -    AGG_HASHED                    /* grouped agg, use internal hashtable */
> +    AGG_HASHED,                    /* grouped agg, use internal hashtable */
> +    AGG_MIXED                    /* grouped agg, hash and sort concurrently */
>  } AggStrategy;

Concurrently sounds a bit like using multiple processes / parallelism...


- Andres



Re: [HACKERS] Hash support for grouping sets

From
Andrew Gierth
Date:
>>>>> "Andres" == Andres Freund <andres@anarazel.de> writes:
Andres> Changes to advance_aggregates() are, in my experience, quiteAndres> likely to have performance effects.  This
needssomeAndres> performance tests.[...]Andres> Looks like it could all be noise, but it seems worthwhile toAndres>
lookinto some.
 

Trying to sort out signal from noise when dealing with performance
impacts of no more than a few percent is _extremely hard_ these days.

Remember this, from a couple of years back?  http://tinyurl.com/op9qg8a

That's a graph of performance results from tests where the only change
being made was adding padding bytes to a function that was never called
during the test. Performance differences on the order of 3% were being
introduced by doing nothing more than changing the actual memory
locations of functions.

My latest round of tests on this patch shows a similar effect. Here's
one representative test timing (a select count(*) with no grouping sets
involved):
                        master: 5727ms                    gsets_hash: 5949msgsets_hash + 500 padding bytes: 5680ms

Since the effect of padding is going to vary over time as other,
unrelated, patches happen, I think I can safely claim that the
performance of the patch at least overlaps the noise range of the
performance of current code.  To get a more definitive result, it would
be necessary to run at least some dozens of tests, with different
padding sizes, and determine whether the average changes detectably
between the two versions.  I will go ahead and do this, out of sheer
curiosity if nothing else, but the preliminary results suggest there's
probably nothing worth holding up the patch for.

-- 
Andrew (irc:RhodiumToad)



Re: [HACKERS] Hash support for grouping sets

From
Mark Dilger
Date:
> On Mar 22, 2017, at 8:09 AM, Mark Dilger <hornschnorter@gmail.com> wrote:
>
>
>> On Mar 22, 2017, at 7:55 AM, Andrew Gierth <andrew@tao11.riddles.org.uk> wrote:
>>
>> [snip]
>>
>> This thread seems to have gone quiet - is it time for me to just go
>> ahead and commit the thing anyway? Anyone else want to weigh in?
>
> Sorry for the delay.  I'll try to look at this today.  Others are most welcome
> to weigh in.

You define DiscreteKnapsack to take integer weights and double values,
and perform the usual Dynamic Programming algorithm to solve.  But
the only place you call this, you pass in NULL for the values, indicating
that all the values are equal.  I'm confused why in this degenerate case
you bother with the DP at all.  Can't you just load the knapsack from
lightest to heaviest after sorting, reducing the runtime complexity to O(nlogn)?
It's been a long day.  Sorry if I am missing something obvious.

I'm guessing you intend to extend the code at some later date to have
different values for items.  Is that right?

I reapplied your patch and reran the regression tests on my laptop and they
all pass.

mark




Re: [HACKERS] Hash support for grouping sets

From
Andrew Gierth
Date:
>>>>> "Andres" == Andres Freund <andres@anarazel.de> writes:
>> -    Assert(newphase == 0 || newphase == aggstate->current_phase + 1);>> +    Assert(newphase <= 1 || newphase ==
aggstate->current_phase+ 1);
 
Andres> I think this somewhere in the file header needs an expandedAndres> explanation about what these "special"
phasesmean.
 

I could move most of the block comment for ExecInitAgg up into the file
header; would that be better?
Andres> I've not yet read up on the thread, or the whole patch, but anAndres> explanation of the memory management for
allthose tables wouldAndres> be good somewhere (maybe I've just not read that far).
 

I can expand on that.
Andres> "mixed types" sounds a bit weird.

changing to "if there are both types present"
>> +    values = palloc((1 + max_weight) * sizeof(double));>> +    sets = palloc((1 + max_weight) * sizeof(Bitmapset
*));
Andres> We usually cast the result of palloc.

Rough count in the backend has ~400 without casts to ~1350 with, so this
doesn't seem to have been consistently enforced.
>> +static void>> +_outGroupingSetData(StringInfo str, const GroupingSetData *node)>> +{>> +
WRITE_NODE_TYPE("GSDATA");>>+>> +    WRITE_NODE_FIELD(set);>> +    WRITE_FLOAT_FIELD(numGroups, "%.0f");>> +}
 
Andres> .0f?

Copied from every other node that uses "%.0f" to write out a numGroups
value.
>> +static grouping_sets_data *>> +preprocess_grouping_sets(PlannerInfo *root)>> +{
Andres> It'd not hurt to add a comment as to what this function isAndres> actually doing.

Sure.
Andres> I hope there's tests for both these branches.

A number of tests for both unsortable and unhashable cases are in the
regression test.
>> +        /*>> +         * Is it hashable? We pretend empty sets are hashable even though we>> +         * actually
forcethem not to be hashed later. But don't bother if>> +         * there's nothing but empty sets.>> +         */
 
Andres> Why?

If there's no non-empty grouping set then there's nothing to hash (the
hash code assumes at least one column). I'll put that in the comment.
>> @@ -3214,6 +3407,11 @@ get_number_of_groups(PlannerInfo *root,>> * estimate_hashagg_tablesize>> *      estimate the
numberof bytes that a hash aggregate hashtable will>> *      require based on the agg_costs, path width and
dNumGroups.>>+ *>> + * XXX this may be over-estimating the size now that hashagg knows to omit>> + * unneeded columns
fromthe hashtable. Also for mixed-mode grouping sets,>> + * grouping columns not in the hashed set are counted here
eventhough hashagg>> + * won't store them. Is this a problem?>> */
 
Andres> Hm. This seems like it could move us to use sorting, even ifAndres> not actually warranted.

This is largely a pre-existing issue that this patch doesn't try and
fix.  That would be an issue for a separate patch, I think.  I merely
added the comment to point it out.
Andres> What's the newline after the comment about?

Old style habits die hard.
>> +                            RelOptInfo *grouped_rel,>> +                            Path *path,>> +
          bool is_sorted,>> +                            bool can_hash,>> +                            PathTarget
*target,>>+                            grouping_sets_data *gd,>> +                            const AggClauseCosts
*agg_costs,>>+                            double dNumGroups)>> +{>> +    Query       *parse = root->parse;>> +>> +
/*>>+     * If we're not being offered sorted input, then only consider plans that>> +     * can be done entirely by
hashing.>>+     *>> +     * We can hash everything if it looks like it'll fit in work_mem. But if>> +     * the input
isactually sorted despite not being advertised as such, we>> +     * prefer to make use of that in order to use less
memory.>>+     * If none of the grouping sets are sortable, then ignore the work_mem>> +     * limit and generate a
pathanyway, since otherwise we'll just fail.>> +     */>> +    if (!is_sorted)>> +    {>> +        List
*new_rollups= NIL;>> +        RollupData *unhashable_rollup = NULL;>> +        List       *sets_data;>> +        List
   *empty_sets_data = NIL;>> +        List       *empty_sets = NIL;>> +        ListCell   *lc;>> +        ListCell
*l_start= list_head(gd->rollups);>> +        AggStrategy strat = AGG_HASHED;>> +        Size        hashsize;>> +
double        exclude_groups = 0.0;>> +>> +        Assert(can_hash);>> +>> +        if
(pathkeys_contained_in(root->group_pathkeys,path->pathkeys))>> +        {>> +            unhashable_rollup =
lfirst(l_start);
Andres> a) cast result of lfirst/lnext/whatnot.

casting lfirst is defensible (and only about 10% of existing uses don't
cast it), but why would you ever cast the result of lnext?
Andres> b) Why is this guaranteed to be unhashable?

It might not be unhashable; that probably needs a better name. It's the
rollup that we aren't going to hash because we're being presented with
input already in correct order. Also, it's the only rollup that can
contain empty grouping sets, which the hash code can't cope with.

Obviously this needs a comment and some consideration of naming.
>> +        if (hashsize > work_mem * 1024L && gd->rollups)>> +            return;                /* nope, won't fit
*/
Andres> Didn't you above talk about generating a path anyway?  Or isAndres> rollups guaranteed not to be set in that
case(if so, notAndres> obvious on a casual read)?
 

Unsortable groups can't participate in rollups. The presence of a rollup
implies that we're going to be offered at least one is_sorted input
path.
>> +    if (can_hash && gd->any_hashable)>> +    {>> +        List       *rollups = NIL;>> +        List
*hash_sets= list_copy(gd->unsortable_sets);>> +        double        availspace = (work_mem * 1024.0);
 
Andres> Why the parens, and why the .0?

The .0 is because the * should be done as a double, not an int.  This
isn't an uncommon idiom, there are quite a few similar uses in the
planner alone.

The parens because I thought it was clearer that way.
Andres> I think you need a higher level explanation of how you'reAndres> mapping the work_mem issue to knapsack
somewhere.

The general overview ("work_mem is the knapsack size, and we need to
figure out how best to pack it with hashtables"), or the specific issue
of the scale factor for discrete chunking of the size?
Andres> Hm.  So we're solely optimizing for the number of sorts, ratherAndres> the cost of the sorts.  Probably an
acceptabletradeoff.
 

The existing cost model for sorts doesn't actually seem to help us here;
all of our sorts sort the same data, just with different comparison
columns, but cost_sort doesn't actually account for that (all its
callers except the CLUSTER planning code pass in 0.0 for
comparison_cost).

If the sort costs were correct, we could compute a "value" for each
rollup that reflected the cost difference between the sort+unique
comparisons for it, and the hash functions that would be called if we
hashed instead; and just pass that to the knapsack function.

-- 
Andrew (irc:RhodiumToad)



Re: [HACKERS] Hash support for grouping sets

From
Andrew Gierth
Date:
>>>>> "Mark" == Mark Dilger <hornschnorter@gmail.com> writes:
Mark> You define DiscreteKnapsack to take integer weights and doubleMark> values, and perform the usual Dynamic
Programmingalgorithm toMark> solve.  But the only place you call this, you pass in NULL forMark> the values, indicating
thatall the values are equal.  I'mMark> confused why in this degenerate case you bother with the DP atMark> all.  Can't
youjust load the knapsack from lightest to heaviestMark> after sorting, reducing the runtime complexity to O(nlogn)?
It'sMark>been a long day.  Sorry if I am missing something obvious.
 

That's actually a very good question.

(Though in passing I should point out that the runtime cost is going to
be negligible in all practical cases. Even an 8-way cube (256 grouping
sets) has only 70 rollups, and a 4-way cube has only 6; the limit of
4096 grouping sets was only chosen because it was a nice round number
that was larger than comparable limits in other database products. Other
stuff in the grouping sets code has worse time bounds; the
bipartite-match code used to minimize the number of rollups is
potentially O(n^2.5) in the number of grouping sets.)

Thinking about it, it's not at all obvious what we should be optimizing
for here in the absence of accurate sort costs. Right now what matters
is saving as many sort steps as possible, since that's a saving likely
to be many orders of magnitude greater than the differences between two
sorts. The one heuristic that might be useful is to prefer the smaller
estimated size if other factors are equal, just on memory usage grounds,
but even that seems a bit tenuous.

I'm inclined to leave things as they are in the current patch, and maybe
revisit it during beta if we get any relevant feedback from people
actually using grouping sets?
Mark> I'm guessing you intend to extend the code at some later date toMark> have different values for items.  Is that
right?

Well, as I mentioned in a reply to Andres, we probably should eventually
optimize for greatest reduction in cost, and the code as it stands would
allow that to be added easily, but that would require having more
accurate cost info than we have now. cost_sort doesn't take into
consideration the number or types of sort columns or the costs of their
comparison functions, so all sorts of the same input data are costed
equally.

-- 
Andrew (irc:RhodiumToad)



Re: [HACKERS] Hash support for grouping sets

From
Andres Freund
Date:
On 2017-03-23 03:43:57 +0000, Andrew Gierth wrote:
> >>>>> "Andres" == Andres Freund <andres@anarazel.de> writes:
> 
>  Andres> Changes to advance_aggregates() are, in my experience, quite
>  Andres> likely to have performance effects.  This needs some
>  Andres> performance tests.
>  [...]
>  Andres> Looks like it could all be noise, but it seems worthwhile to
>  Andres> look into some.
> 
> Trying to sort out signal from noise when dealing with performance
> impacts of no more than a few percent is _extremely hard_ these days.

Indeed. But that doesn't mean we needn't try.  With some determination
and profiling you can often sepearate signal from noise - I managed to
track down 0.12% regressions in the expression evaluation work...


> I will go ahead and do this, out of sheer curiosity if nothing else,
> but the preliminary results suggest there's probably nothing worth
> holding up the patch for.

Agreed. I'd want to run one more profile, checking whether the profiles
indicate new hotspots, but other than that...

Greetings,

Andres Freund



Re: [HACKERS] Hash support for grouping sets

From
Mark Dilger
Date:
> On Mar 23, 2017, at 11:22 AM, Andrew Gierth <andrew@tao11.riddles.org.uk> wrote:
>
>>>>>> "Mark" == Mark Dilger <hornschnorter@gmail.com> writes:
>
> Mark> You define DiscreteKnapsack to take integer weights and double
> Mark> values, and perform the usual Dynamic Programming algorithm to
> Mark> solve.  But the only place you call this, you pass in NULL for
> Mark> the values, indicating that all the values are equal.  I'm
> Mark> confused why in this degenerate case you bother with the DP at
> Mark> all.  Can't you just load the knapsack from lightest to heaviest
> Mark> after sorting, reducing the runtime complexity to O(nlogn)?  It's
> Mark> been a long day.  Sorry if I am missing something obvious.
>
> That's actually a very good question.
>
> (Though in passing I should point out that the runtime cost is going to
> be negligible in all practical cases. Even an 8-way cube (256 grouping
> sets) has only 70 rollups, and a 4-way cube has only 6; the limit of
> 4096 grouping sets was only chosen because it was a nice round number
> that was larger than comparable limits in other database products. Other
> stuff in the grouping sets code has worse time bounds; the
> bipartite-match code used to minimize the number of rollups is
> potentially O(n^2.5) in the number of grouping sets.)
>
> Thinking about it, it's not at all obvious what we should be optimizing
> for here in the absence of accurate sort costs.

Is there a performance test case where this patch should shine
brightest?  I'd like to load a schema with lots of data, and run
a grouping sets query, both before and after applying the patch,
to see what the performance advantage is.

mark


Re: [HACKERS] Hash support for grouping sets

From
Andres Freund
Date:
On 2017-03-23 05:09:46 +0000, Andrew Gierth wrote:
> >>>>> "Andres" == Andres Freund <andres@anarazel.de> writes:
> 
>  >> -    Assert(newphase == 0 || newphase == aggstate->current_phase + 1);
>  >> +    Assert(newphase <= 1 || newphase == aggstate->current_phase + 1);
> 
>  Andres> I think this somewhere in the file header needs an expanded
>  Andres> explanation about what these "special" phases mean.
> 
> I could move most of the block comment for ExecInitAgg up into the file
> header; would that be better?

Yes, I think so.


>  >> +    values = palloc((1 + max_weight) * sizeof(double));
>  >> +    sets = palloc((1 + max_weight) * sizeof(Bitmapset *));
> 
>  Andres> We usually cast the result of palloc.
> 
> Rough count in the backend has ~400 without casts to ~1350 with, so this
> doesn't seem to have been consistently enforced.

Yea, but we're still trying.

>  >> +                            RelOptInfo *grouped_rel,
>  >> +                            Path *path,
>  >> +                            bool is_sorted,
>  >> +                            bool can_hash,
>  >> +                            PathTarget *target,
>  >> +                            grouping_sets_data *gd,
>  >> +                            const AggClauseCosts *agg_costs,
>  >> +                            double dNumGroups)
>  >> +{
>  >> +    Query       *parse = root->parse;
>  >> +
>  >> +    /*
>  >> +     * If we're not being offered sorted input, then only consider plans that
>  >> +     * can be done entirely by hashing.
>  >> +     *
>  >> +     * We can hash everything if it looks like it'll fit in work_mem. But if
>  >> +     * the input is actually sorted despite not being advertised as such, we
>  >> +     * prefer to make use of that in order to use less memory.
>  >> +     * If none of the grouping sets are sortable, then ignore the work_mem
>  >> +     * limit and generate a path anyway, since otherwise we'll just fail.
>  >> +     */
>  >> +    if (!is_sorted)
>  >> +    {
>  >> +        List       *new_rollups = NIL;
>  >> +        RollupData *unhashable_rollup = NULL;
>  >> +        List       *sets_data;
>  >> +        List       *empty_sets_data = NIL;
>  >> +        List       *empty_sets = NIL;
>  >> +        ListCell   *lc;
>  >> +        ListCell   *l_start = list_head(gd->rollups);
>  >> +        AggStrategy strat = AGG_HASHED;
>  >> +        Size        hashsize;
>  >> +        double        exclude_groups = 0.0;
>  >> +
>  >> +        Assert(can_hash);
>  >> +
>  >> +        if (pathkeys_contained_in(root->group_pathkeys, path->pathkeys))
>  >> +        {
>  >> +            unhashable_rollup = lfirst(l_start);
> 
>  Andres> a) cast result of lfirst/lnext/whatnot.
> 
> casting lfirst is defensible (and only about 10% of existing uses don't
> cast it), but why would you ever cast the result of lnext?

Obviously lnext is bogus, was thinking of linitial...

>  >> +    if (can_hash && gd->any_hashable)
>  >> +    {
>  >> +        List       *rollups = NIL;
>  >> +        List       *hash_sets = list_copy(gd->unsortable_sets);
>  >> +        double        availspace = (work_mem * 1024.0);
> 
>  Andres> Why the parens, and why the .0?
> 
> The .0 is because the * should be done as a double, not an int.

I was just missing why - but now that I've not already been awake for 20
hours, 8 of those crammed into a plane, it's obviously beneficial
because of overflow concerns.


>  Andres> I think you need a higher level explanation of how you're
>  Andres> mapping the work_mem issue to knapsack somewhere.
> 
> The general overview ("work_mem is the knapsack size, and we need to
> figure out how best to pack it with hashtables"), or the specific issue
> of the scale factor for discrete chunking of the size?

The general overview - the scaling thing doesn't seem that important to
understand in detail, to understand the approach.  Briefly explaining
what we're trying to minimize (sort steps), what the constraint is
(memory), that the problem is representable as the classic "knapsack
problem".


>  Andres> Hm.  So we're solely optimizing for the number of sorts, rather
>  Andres> the cost of the sorts.  Probably an acceptable tradeoff.
> 
> The existing cost model for sorts doesn't actually seem to help us here;
> all of our sorts sort the same data, just with different comparison
> columns, but cost_sort doesn't actually account for that (all its
> callers except the CLUSTER planning code pass in 0.0 for
> comparison_cost).
> 
> If the sort costs were correct, we could compute a "value" for each
> rollup that reflected the cost difference between the sort+unique
> comparisons for it, and the hash functions that would be called if we
> hashed instead; and just pass that to the knapsack function.

That'd indeed be desirable - but I think we can punt on that for now.

Greetings,

Andres Freund



Re: [HACKERS] Hash support for grouping sets

From
Andrew Gierth
Date:
>>>>> "Mark" == Mark Dilger <hornschnorter@gmail.com> writes:
Mark> Is there a performance test case where this patch should shineMark> brightest?  I'd like to load a schema with
lotsof data, and runMark> a grouping sets query, both before and after applying the patch,Mark> to see what the
performanceadvantage is.
 

The area which I think is most important for performance is the handling
of small cubes; without this patch, a 2d cube needs 2 full sorts, a 3d
one needs 3, and a 4d one needs 6. In many real-world data sets these
would all hash entirely in memory.

So here's a very simple example (deliberately using integers for
grouping to minimize the advantage; grouping by text columns in a non-C
locale would show a much greater speedup for the patch):

create table sales ( id serial, product_id integer, store_id integer, customer_id integer, qty integer);

-- random integer function
create function d(integer) returns integer language sqlas $f$ select floor(random()*$1)::integer + 1; $f$;

-- 10 million random rows
insert into sales (product_id,store_id,customer_id,qty) select d(20), d(6), d(10), d(100) from
generate_series(1,10000000);

-- example 2d cube:
select product_id, store_id, count(*), sum(qty) from salesgroup by cube(product_id, store_id);

-- example 3d cube:
select product_id, store_id, customer_id, count(*), sum(qty) from salesgroup by cube(product_id, store_id,
customer_id);

-- example 4d cube with a computed column:
select product_id, store_id, customer_id, (qty / 10), count(*), sum(qty) from salesgroup by cube(product_id, store_id,
customer_id,(qty / 10));
 

On my machine, the 2d cube is about 3.6 seconds with the patch, and
about 8 seconds without it; the 4d is about 18 seconds with the patch
and about 32 seconds without it (all with work_mem=1GB, compiled with
-O2 and assertions off).

-- 
Andrew (irc:RhodiumToad)



Re: [HACKERS] Hash support for grouping sets

From
Andrew Gierth
Date:
>>>>> "Andres" == Andres Freund <andres@anarazel.de> writes:
Andres> We usually cast the result of palloc.
>> Rough count in the backend has ~400 without casts to ~1350 with, so>> this doesn't seem to have been consistently
enforced.
Andres> Yea, but we're still trying.

Well, a lot of the uncasted ones are in fairly new code, from quite a
number of different committers.

So if this is a big deal, why don't we already have

#define palloc_array(etype,ecount) (((etype) *) palloc((ecount) * sizeof(etype)))
#define palloc_object(otype) (((otype) *) palloc(sizeof(otype)))

or something of that ilk?

-- 
Andrew (irc:RhodiumToad)



Re: [HACKERS] Hash support for grouping sets

From
Andrew Gierth
Date:
>>>>> "Andres" == Andres Freund <andres@anarazel.de> writes:
Andres> a) cast result of lfirst/lnext/whatnot.

Again, what we need here is something like

#define lfirst_node(_type_, l) (castNode(_type_, lfirst(l)))

etc.

-- 
Andrew (irc:RhodiumToad)



Re: Hash support for grouping sets

From
Andrew Gierth
Date:
>>>>> "Andres" == Andres Freund <andres@anarazel.de> writes:

 >> I could move most of the block comment for ExecInitAgg up into the
 >> file header; would that be better?

 Andres> Yes, I think so.

Done that and some other clarifications. Attached is the latest patch;
no substantive changes, only comments, formatting and a change to a
variable name.

The performance side of things I'm still working on; I think that would
be best handled as a followup patch.

-- 
Andrew (irc:RhodiumToad)


Attachment