Re: Spilling hashed SetOps and aggregates to disk - Mailing list pgsql-hackers

From David Gershuni
Subject Re: Spilling hashed SetOps and aggregates to disk
Date
Msg-id 4DE7C251-86E3-4F13-85D4-F95DB9ED8AA6@cs.cmu.edu
Whole thread Raw
In response to Re: Spilling hashed SetOps and aggregates to disk  (Jeff Davis <pgsql@j-davis.com>)
Responses Re: Spilling hashed SetOps and aggregates to disk
List pgsql-hackers
> On Jun 13, 2018, at 12:53 PM, Jeff Davis <pgsql@j-davis.com> wrote:
>
>>
>> An adaptive hash agg node would start as a hash agg, and turn into a
>> "partial hash agg + sort + final group agg" when OOM is detected.
>>
>> The benefit over ordinary sort+group agg is that the sort is
>> happening
>> on a potentially much smaller data set. When the buffer is large
>> enough to capture enough key repetitions, the output of the partial
>> hash agg can be orders of magnitude smaller than the scan.
>>
>> My proposal was to use this for all group aggs, not only when the
>> planner chooses a hash agg, because it can speed up the sort and
>> reduce temp storage considerably. I guess the trick lies in
>> estimating
>> that cardinality reduction to avoid applying this when there's no
>> hope
>> of getting a payoff. The overhead of such a lazy hash table isn't
>> much, really. But yes, its cache locality is terrible, so it needs to
>> be considered.
>
> I think this is a promising approach because it means the planner has
> one less decion to make. And the planner doesn't have great information
> to make its decision on, anyway (ndistinct estimates are hard enough on
> plain tables, and all bets are off a few levels up in the plan).
>
> Unfortunately, it means that either all data types must support hashing
> and sorting[1], or we need to have a fallback path that can get by with
> hashing alone (which would not be tested nearly as well as more typical
> paths).

Jeff, I agree with you, but we should also take grouping sets into consideration
because they can cause the executor to create multiple hash tables in memory
simultaneously, each growing at a different rate. Not only does this make eviction
more complicated, but it actually prevents your original approach from working
because it violates the assumption that all entries left in the hash table at the end
of a phase are complete and can be flushed to output.

For example, imagine a tuple that belongs to a group G in grouping set A had to be
spilled because it also belonged to an evicted group from grouping set B. Then group
G would remain in set A’s hash table at the end of the phase, but it wouldn’t have
aggregated the values from the spilled tuple. Of course, work-arounds are possible,
but the current approach would not work as is.

Claudio’s proposal mostly solves this problem. In fact, his proposal is very similar to
the HashSort algorithm from [1] that I mentioned. We would still need to think about
how to choose which hash table to evict from. For example, we could evict a group
from the largest hash table (if we tracked memory usage independently for each one).

But as you mentioned, this solution relies on all grouping keys being sortable, so we
would need a fallback plan. For this, we could use your hash-based approach, but we
would have to make modifications. One idea would be to split each grouping set into its
own aggregate phase, so that only one hash table is in memory at a time. This would
eliminate the performance benefits of grouping sets when keys aren’t sortable, but at
least they would still work.

Best,
David
Salesforce

[1] Revisiting Aggregation for Data Intensive Applications: A Performance Study
https://arxiv.org/pdf/1311.0059.pdf




pgsql-hackers by date:

Previous
From: Joe Conway
Date:
Subject: Re: missing toast table for pg_policy
Next
From: Alvaro Herrera
Date:
Subject: Re: [HACKERS] Statement-level rollback