Re: Spilling hashed SetOps and aggregates to disk - Mailing list pgsql-hackers
From | David Gershuni |
---|---|
Subject | Re: Spilling hashed SetOps and aggregates to disk |
Date | |
Msg-id | 4DE7C251-86E3-4F13-85D4-F95DB9ED8AA6@cs.cmu.edu Whole thread Raw |
In response to | Re: Spilling hashed SetOps and aggregates to disk (Jeff Davis <pgsql@j-davis.com>) |
Responses |
Re: Spilling hashed SetOps and aggregates to disk
|
List | pgsql-hackers |
> On Jun 13, 2018, at 12:53 PM, Jeff Davis <pgsql@j-davis.com> wrote: > >> >> An adaptive hash agg node would start as a hash agg, and turn into a >> "partial hash agg + sort + final group agg" when OOM is detected. >> >> The benefit over ordinary sort+group agg is that the sort is >> happening >> on a potentially much smaller data set. When the buffer is large >> enough to capture enough key repetitions, the output of the partial >> hash agg can be orders of magnitude smaller than the scan. >> >> My proposal was to use this for all group aggs, not only when the >> planner chooses a hash agg, because it can speed up the sort and >> reduce temp storage considerably. I guess the trick lies in >> estimating >> that cardinality reduction to avoid applying this when there's no >> hope >> of getting a payoff. The overhead of such a lazy hash table isn't >> much, really. But yes, its cache locality is terrible, so it needs to >> be considered. > > I think this is a promising approach because it means the planner has > one less decion to make. And the planner doesn't have great information > to make its decision on, anyway (ndistinct estimates are hard enough on > plain tables, and all bets are off a few levels up in the plan). > > Unfortunately, it means that either all data types must support hashing > and sorting[1], or we need to have a fallback path that can get by with > hashing alone (which would not be tested nearly as well as more typical > paths). Jeff, I agree with you, but we should also take grouping sets into consideration because they can cause the executor to create multiple hash tables in memory simultaneously, each growing at a different rate. Not only does this make eviction more complicated, but it actually prevents your original approach from working because it violates the assumption that all entries left in the hash table at the end of a phase are complete and can be flushed to output. For example, imagine a tuple that belongs to a group G in grouping set A had to be spilled because it also belonged to an evicted group from grouping set B. Then group G would remain in set A’s hash table at the end of the phase, but it wouldn’t have aggregated the values from the spilled tuple. Of course, work-arounds are possible, but the current approach would not work as is. Claudio’s proposal mostly solves this problem. In fact, his proposal is very similar to the HashSort algorithm from [1] that I mentioned. We would still need to think about how to choose which hash table to evict from. For example, we could evict a group from the largest hash table (if we tracked memory usage independently for each one). But as you mentioned, this solution relies on all grouping keys being sortable, so we would need a fallback plan. For this, we could use your hash-based approach, but we would have to make modifications. One idea would be to split each grouping set into its own aggregate phase, so that only one hash table is in memory at a time. This would eliminate the performance benefits of grouping sets when keys aren’t sortable, but at least they would still work. Best, David Salesforce [1] Revisiting Aggregation for Data Intensive Applications: A Performance Study https://arxiv.org/pdf/1311.0059.pdf
pgsql-hackers by date: