Re: Memory-Bounded Hash Aggregation - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: Memory-Bounded Hash Aggregation
Date
Msg-id 20190703001753.s5e5cme3ucjgeu6c@development
Whole thread Raw
In response to Memory-Bounded Hash Aggregation  (Jeff Davis <pgsql@j-davis.com>)
Responses Re: Memory-Bounded Hash Aggregation  (Jeff Davis <pgsql@j-davis.com>)
List pgsql-hackers
Hi Jeff,

On Mon, Jul 01, 2019 at 12:13:53PM -0700, Jeff Davis wrote:
>This is for design review. I have a patch (WIP) for Approach 1, and if
>this discussion starts to converge on that approach I will polish and
>post it.
>

Thanks for working on this.

>Let's start at the beginning: why do we have two strategies -- hash
>and sort -- for aggregating data? The two are more similar than they
>first appear. A partitioned hash strategy writes randomly among the
>partitions, and later reads the partitions sequentially; a sort will
>write sorted runs sequentially, but then read the among the runs
>randomly during the merge phase. A hash is a convenient small
>representation of the data that is cheaper to operate on; sort uses
>abbreviated keys for the same reason.
>

What does "partitioned hash strategy" do? It's probably explained in one
of the historical discussions, but I'm not sure which one. I assume it
simply hashes the group keys and uses that to partition the data, and then
passing it to hash aggregate.

>Hash offers:
>
>* Data is aggregated on-the-fly, effectively "compressing" the amount
>  of data that needs to go to disk. This is particularly important
>  when the data contains skewed groups (see below).
>
>* Can output some groups after the first pass of the input data even
>  if other groups spilled.
>
>* Some data types only support hashing; not sorting.
>
>Sort+Group offers:
>
>* Only one group is accumulating at once, so if the transition state
>  grows (like with ARRAY_AGG), it minimizes the memory needed.
>
>* The input may already happen to be sorted.
>
>* Some data types only support sorting; not hashing.
>
>Currently, Hash Aggregation is only chosen if the optimizer believes
>that all the groups (and their transition states) fit in
>memory. Unfortunately, if the optimizer is wrong (often the case if the
>input is not a base table), the hash table will
>keep growing beyond work_mem, potentially bringing the entire system
>to OOM. This patch fixes that problem by extending the Hash
>Aggregation strategy to spill to disk when needed.
>

OK, makes sense.

>Previous discussions:
>
>
>https://www.postgresql.org/message-id/1407706010.6623.16.camel@jeff-desktop
>
>https://www.postgresql.org/message-id/1419326161.24895.13.camel%40jeff-desktop
>
>https://www.postgresql.org/message-id/87be3bd5-6b13-d76e-5618-6db0a4db584d%40iki.fi
>
>A lot was discussed, which I will try to summarize and address here.
>
>Digression: Skewed Groups:
>
>Imagine the input tuples have the following grouping keys:
>
>  0, 1, 0, 2, 0, 3, 0, 4, ..., 0, N-1, 0, N
>
>Group 0 is a skew group because it consists of 50% of all tuples in
>the table, whereas every other group has a single tuple. If the
>algorithm is able to keep group 0 in memory the whole time until
>finalized, that means that it doesn't have to spill any group-0
>tuples. In this example, that would amount to a 50% savings, and is a
>major advantage of Hash Aggregation versus Sort+Group.
>

Right. I agree efficiently handling skew is important and may be crucial
for achieving good performance.

>High-level approaches:
>
>1. When the in-memory hash table fills, keep existing entries in the
>hash table, and spill the raw tuples for all new groups in a
>partitioned fashion. When all input tuples are read, finalize groups
>in memory and emit. Now that the in-memory hash table is cleared (and
>memory context reset), process a spill file the same as the original
>input, but this time with a fraction of the group cardinality.
>
>2. When the in-memory hash table fills, partition the hash space, and
>evict the groups from all partitions except one by writing out their
>partial aggregate states to disk. Any input tuples belonging to an
>evicted partition get spilled to disk. When the input is read
>entirely, finalize the groups remaining in memory and emit. Now that
>the in-memory hash table is cleared, process the next partition by
>loading its partial states into the hash table, and then processing
>its spilled tuples.
>
>3. Use some kind of hybrid[1][2] of hashing and sorting.
>

Unfortunately the second link does not work :-(

>Evaluation of approaches:
>
>Approach 1 is a nice incremental improvement on today's code. The
>final patch may be around 1KLOC. It's a single kind of on-disk data
>(spilled tuples), and a single algorithm (hashing). It also handles
>skewed groups well because the skewed groups are likely to be
>encountered before the hash table fills up the first time, and
>therefore will stay in memory.
>

I'm not going to block Approach 1, althought I'd really like to see
something that helps with array_agg.

>Approach 2 is nice because it resembles the approach of Hash Join, and
>it can determine whether a tuple should be spilled without a hash
>lookup. Unfortunately, those upsides are fairly mild, and it has
>significant downsides:
>
>* It doesn't handle skew values well because it's likely to evict
>  them.
>
>* If we leave part of the hash table in memory, it's difficult to
>  ensure that we will be able to actually use the space freed by
>  eviction, because the freed memory may be fragmented. That could
>  force us to evict the entire in-memory hash table as soon as we
>  partition, reducing a lot of the benefit of hashing.
>

Yeah, and it may not work well with the memory accounting if we only track
the size of allocated blocks, not chunks (because pfree likely won't free
the blocks).

>* It requires eviction for the algorithm to work. That may be
>  necessary for handling cases like ARRAY_AGG (see below) anyway, but
>  this approach constrains the specifics of eviction.
>
>Approach 3 is interesting because it unifies the two approaches and
>can get some of the benfits of both. It's only a single path, so it
>avoids planner mistakes. I really like this idea and it's possible we
>will end up with approach 3. However:
>
>* It requires that all data types support sorting, or that we punt
>  somehow.
>
>* Right now we are in a weird state because hash aggregation cheats,
>  so it's difficult to evaluate whether Approach 3 is moving us in the
>  right direction because we have no other correct implementation to
>  compare against. Even if Approach 3 is where we end up, it seems
>  like we should fix hash aggregation as a stepping stone first.
>

Aren't all three approaches a way to "fix" hash aggregate? In any case,
it's certainly reasonable to make incremental changes. The question is
whether "approach 1" is sensible step towards some form of "approach 3"


>* It means we have a hash table and sort running concurrently, each
>  using memory. Andres said this might not be a problem[3], but I'm
>  not convinced that the problem is zero. If you use small work_mem
>  for the write phase of sorting, you'll end up with a lot of runs to
>  merge later and that has some kind of cost.
>

Why would we need to do both concurrently? I thought we'd empty the hash
table before doing the sort, no?

>* The simplicity might start to evaporate when we consider grouping
>  sets and eviction strategy.
>

Hmm, yeah :-/

>Main topics to consider:
>
>ARRAY_AGG:
>
>Some aggregates, like ARRAY_AGG, have a transition state that grows
>proportionally with the group size. In other words, it is not a
>summary like COUNT or AVG, it contains all of the input data in a new
>form.
>

Strictly speaking the state may grow even for count/avg aggregates, e.g.
for numeric types, but it's far less serious than array_agg etc.

>These aggregates are not a good candidate for hash aggregation. Hash
>aggregation is about keeping many transition states running in
>parallel, which is just a bad fit for large transition states. Sorting
>is better because it advances one transition state at a time. We could:
>
>* Let ARRAY_AGG continue to exceed work_mem like today.
>
>* Block or pessimize use of hash aggregation for such aggregates.
>
>* Evict groups from the hash table when it becomes too large. This
>  requires the ability to serialize and deserialize transition states,
>  and some approaches here might also need combine_func
>  specified. These requirements seem reasonable, but we still need
>  some answer of what to do for aggregates that grow like ARRAY_AGG
>  but don't have the required serialfunc, deserialfunc, or
>  combine_func.
>

Do we actually need to handle that case? How many such aggregates are
there? I think it's OK to just ignore that case (and keep doing what we do
now), and require serial/deserial functions for anything better.

>GROUPING SETS:
>
>With grouping sets, there are multiple hash tables and each hash table
>has it's own hash function, so that makes partitioning more
>complex. In Approach 1, that means we need to either (a) not partition
>the spilled tuples; or (b) have a different set of partitions for each
>hash table and spill the same tuple multiple times. In Approach 2, we
>would be required to partition each hash table separately and spill
>tuples multiple times. In Approach 3 (depending on the exact approach
>but taking a guess here) we would need to add a set of phases (one
>extra phase for each hash table) for spilled tuples.
>

No thoughts about this yet.

>MEMORY TRACKING:
>
>I have a patch to track the total allocated memory by
>incrementing/decrementing it when blocks are malloc'd/free'd. This
>doesn't do bookkeeping for each chunk, only each block. Previously,
>Robert Haas raised some concerns[4] about performance, which were
>mitigated[5] but perhaps not entirely eliminated (but did become
>elusive).
>
>The only alternative is estimation, which is ugly and seems like a bad
>idea. Memory usage isn't just driven by inputs, it's also driven by
>patterns of use. Misestimates in the planner are fine (within reason)
>because we don't have any other choice, and a small-factor misestimate
>might not change the plan anyway. But in the executor, a small-factor
>misestimate seems like it's just not doing the job. If a user found
>that hash aggregation was using 3X work_mem, and my only explanation
>is "well, it's just an estimate", I would be pretty embarrassed and
>the user would likely lose confidence in the feature. I don't mean
>that we must track memory perfectly everywhere, but using an estimate
>seems like a mediocre improvement of the current state.

I agree estimates are not the right tool here.

>
>We should proceed with memory context tracking and try to eliminate or
>mitigate performance concerns. I would not like to make any hurculean
>effort as a part of the hash aggregation work though; I think it's
>basically just something a memory manager in a database system should
>have supported all along. I think we will find other uses for it as
>time goes on. We have more and more things happening in the executor
>and having a cheap way to check "how much memory is this thing using?"
>seems very likely to be useful.
>

IMO we should just use the cheapest memory accounting (tracking the amount
of memory allocated for blocks). I agree it's a feature we need, I don't
think we can devise anything cheaper than this.

>Other points:
>
>* Someone brought up the idea of using logtapes.c instead of writing
>  separate files for each partition. That seems reasonable, but it's
>  using logtapes.c slightly outside of its intended purpose. Also,
>  it's awkward to need to specify the number of tapes up-front. Worth
>  experimenting with to see if it's a win.
>
>* Tomas did some experiments regarding the number of batches to choose
>  and how to choose them. It seems like there's room for improvement
>  over ths simple calculation I'm doing now.
>

Me? I don't recall such benchmarks, but maybe I did. But I think we'll
need to repeat those with the new patches etc. I think the question is
whether we see this as an emergency solution - in that case I wouldn't
obsess about getting the best possible parameters.

>* A lot of discussion about a smart eviction strategy. I don't see
>  strong evidence that it's worth the complexity at this time. The
>  smarter we try to be, the more bookkeeping and memory fragmentation
>  problems we will have. If we evict something, we should probably
>  evict the whole hash table or some large part of it.
>

Maybe. For each "smart" eviction strategy there is a (trivial) example
of data on which it performs poorly.

I think it's the same thing as with the number of partitions - if we
consider this to be an emergency solution, it's OK if the performance is
not entirely perfect when it kicks in.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




pgsql-hackers by date:

Previous
From: Michael Paquier
Date:
Subject: Re: MSVC Build support with visual studio 2019
Next
From: didier
Date:
Subject: contrib make check-world fail if data have been modified and there's vpath