Re: Spilling hashed SetOps and aggregates to disk - Mailing list pgsql-hackers

From David Gershuni
Subject Re: Spilling hashed SetOps and aggregates to disk
Date
Msg-id 49FB34E0-337F-4FEE-AB3D-97879E4E6954@cs.cmu.edu
Whole thread Raw
In response to Re: Spilling hashed SetOps and aggregates to disk  (Jeff Davis <pgsql@j-davis.com>)
Responses Re: Spilling hashed SetOps and aggregates to disk
List pgsql-hackers
> On Jun 19, 2018, at 10:36 PM, Jeff Davis <pgsql@j-davis.com> wrote:
>
> But I am worried that I am missing something, because it appears that
> for AGG_MIXED, we wait until the end of the last phase to emit the
> contents of the hash tables. Wouldn't they be complete after the first
> phase?

You're right. They're complete after the first phase phase, but I believe
they're not emitted until the end in order to minimize the complexity of
the code. I don't think the possibility of exceeding work_mem was taken
into account when this decision was made.


> I was imagining that (in the simplest approach) each hash table would
> have its own set of spilled tuples. If grouping set A can be advanced
> (because it's present in the hash table) but B cannot (because it's not
> in the hash table and we are at the memory limit), then it goes into
> the spill file for B but not A. That means that A is still finished
> after the first pass.
...
> I realize that this simple approach could be inefficient if multiple
> hash tables are spilling because we'd duplicate the spilled tuples. But
> let me know if it won't work at all.

This approach seems functionally correct, but I don't like the idea of
transforming O(N) tuples of disk I/O into O(S*N) tuples of disk I/O
(in the worst case). As of yesterday, I think I have a better approach
that will work even for non-sortable grouping keys. (See below).


> I am having a hard time trying to satisfy all of the constraints that
> have been raised so far:
>
> * handle data types with hash ops but no btree ops
> * handle many different group size distributions and input orders
> efficiently
> * avoid regressions or inefficiencies for grouping sets
> * handle variable-size (particularly O(n)-space) transition states
>
> without taking on a lot of complexity. I like to work toward the right
> design, but I think simplicity also factors in to the right design.
> NodeAgg.c is already one of the larger and more complicated files that
> we have.

Yes, there are a lot of moving pieces here! And it will be non-trivial to
satisfy them all with one approach. I'm working on a new aggstrategy
that could be used in conjunction with AGG_SORT. This approach could
replace the current AGG_HASH and would satisfy all of the requirements
you mentioned for aggregates that have serial/deserial functions.

My approach is based on the HashSort algorithm that I mentioned before
but with new modifications to a) handle multiple grouping sets and b) not
require btree ops on the grouping keys.

a) is accomplished by hashing all grouping sets simultaneously and putting
each set in its own memory context. Whenever work_mem is exceeded, we
sort and spill the hash table with the largest memory context.

b) is accomplished by not sorting groups by their actual grouping keys, but
instead sorting by their hash values. This works because we don't need a
true sort. We just need to process identical groups in a consistent order
during the merge step. As long as we're careful about collisions during the
merge, this should work.

> A lot of the complexity we already have is that grouping sets are
> performed in one giant executor node rather than exposing the
> individual phases as separate executor nodes. I suppose the reason
> (correct me if I'm wrong) is that there are two outputs from a phase:
> the aggregated groups, and the original input tuples in a new sort
> order. That seems solvable -- the executor passes bitmaps around until
> BitmapHeapScan turns them into tuples.

That is an interesting idea, but it's somewhat orthogonal to spill implementation.
Still, it could be a nice way to decompose aggregate nodes that require multiple
phases/strategies. I'm curious to hear what others think.

Best,
David

pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: Fast default stuff versus pg_upgrade
Next
From: Nico Williams
Date:
Subject: Re: [Proposal] Table-level Transparent Data Encryption (TDE) and KeyManagement Service (KMS)