Re: Memory-Bounded Hash Aggregation - Mailing list pgsql-hackers
From | Tomas Vondra |
---|---|
Subject | Re: Memory-Bounded Hash Aggregation |
Date | |
Msg-id | 20190703001753.s5e5cme3ucjgeu6c@development Whole thread Raw |
In response to | Memory-Bounded Hash Aggregation (Jeff Davis <pgsql@j-davis.com>) |
Responses |
Re: Memory-Bounded Hash Aggregation
|
List | pgsql-hackers |
Hi Jeff, On Mon, Jul 01, 2019 at 12:13:53PM -0700, Jeff Davis wrote: >This is for design review. I have a patch (WIP) for Approach 1, and if >this discussion starts to converge on that approach I will polish and >post it. > Thanks for working on this. >Let's start at the beginning: why do we have two strategies -- hash >and sort -- for aggregating data? The two are more similar than they >first appear. A partitioned hash strategy writes randomly among the >partitions, and later reads the partitions sequentially; a sort will >write sorted runs sequentially, but then read the among the runs >randomly during the merge phase. A hash is a convenient small >representation of the data that is cheaper to operate on; sort uses >abbreviated keys for the same reason. > What does "partitioned hash strategy" do? It's probably explained in one of the historical discussions, but I'm not sure which one. I assume it simply hashes the group keys and uses that to partition the data, and then passing it to hash aggregate. >Hash offers: > >* Data is aggregated on-the-fly, effectively "compressing" the amount > of data that needs to go to disk. This is particularly important > when the data contains skewed groups (see below). > >* Can output some groups after the first pass of the input data even > if other groups spilled. > >* Some data types only support hashing; not sorting. > >Sort+Group offers: > >* Only one group is accumulating at once, so if the transition state > grows (like with ARRAY_AGG), it minimizes the memory needed. > >* The input may already happen to be sorted. > >* Some data types only support sorting; not hashing. > >Currently, Hash Aggregation is only chosen if the optimizer believes >that all the groups (and their transition states) fit in >memory. Unfortunately, if the optimizer is wrong (often the case if the >input is not a base table), the hash table will >keep growing beyond work_mem, potentially bringing the entire system >to OOM. This patch fixes that problem by extending the Hash >Aggregation strategy to spill to disk when needed. > OK, makes sense. >Previous discussions: > > >https://www.postgresql.org/message-id/1407706010.6623.16.camel@jeff-desktop > >https://www.postgresql.org/message-id/1419326161.24895.13.camel%40jeff-desktop > >https://www.postgresql.org/message-id/87be3bd5-6b13-d76e-5618-6db0a4db584d%40iki.fi > >A lot was discussed, which I will try to summarize and address here. > >Digression: Skewed Groups: > >Imagine the input tuples have the following grouping keys: > > 0, 1, 0, 2, 0, 3, 0, 4, ..., 0, N-1, 0, N > >Group 0 is a skew group because it consists of 50% of all tuples in >the table, whereas every other group has a single tuple. If the >algorithm is able to keep group 0 in memory the whole time until >finalized, that means that it doesn't have to spill any group-0 >tuples. In this example, that would amount to a 50% savings, and is a >major advantage of Hash Aggregation versus Sort+Group. > Right. I agree efficiently handling skew is important and may be crucial for achieving good performance. >High-level approaches: > >1. When the in-memory hash table fills, keep existing entries in the >hash table, and spill the raw tuples for all new groups in a >partitioned fashion. When all input tuples are read, finalize groups >in memory and emit. Now that the in-memory hash table is cleared (and >memory context reset), process a spill file the same as the original >input, but this time with a fraction of the group cardinality. > >2. When the in-memory hash table fills, partition the hash space, and >evict the groups from all partitions except one by writing out their >partial aggregate states to disk. Any input tuples belonging to an >evicted partition get spilled to disk. When the input is read >entirely, finalize the groups remaining in memory and emit. Now that >the in-memory hash table is cleared, process the next partition by >loading its partial states into the hash table, and then processing >its spilled tuples. > >3. Use some kind of hybrid[1][2] of hashing and sorting. > Unfortunately the second link does not work :-( >Evaluation of approaches: > >Approach 1 is a nice incremental improvement on today's code. The >final patch may be around 1KLOC. It's a single kind of on-disk data >(spilled tuples), and a single algorithm (hashing). It also handles >skewed groups well because the skewed groups are likely to be >encountered before the hash table fills up the first time, and >therefore will stay in memory. > I'm not going to block Approach 1, althought I'd really like to see something that helps with array_agg. >Approach 2 is nice because it resembles the approach of Hash Join, and >it can determine whether a tuple should be spilled without a hash >lookup. Unfortunately, those upsides are fairly mild, and it has >significant downsides: > >* It doesn't handle skew values well because it's likely to evict > them. > >* If we leave part of the hash table in memory, it's difficult to > ensure that we will be able to actually use the space freed by > eviction, because the freed memory may be fragmented. That could > force us to evict the entire in-memory hash table as soon as we > partition, reducing a lot of the benefit of hashing. > Yeah, and it may not work well with the memory accounting if we only track the size of allocated blocks, not chunks (because pfree likely won't free the blocks). >* It requires eviction for the algorithm to work. That may be > necessary for handling cases like ARRAY_AGG (see below) anyway, but > this approach constrains the specifics of eviction. > >Approach 3 is interesting because it unifies the two approaches and >can get some of the benfits of both. It's only a single path, so it >avoids planner mistakes. I really like this idea and it's possible we >will end up with approach 3. However: > >* It requires that all data types support sorting, or that we punt > somehow. > >* Right now we are in a weird state because hash aggregation cheats, > so it's difficult to evaluate whether Approach 3 is moving us in the > right direction because we have no other correct implementation to > compare against. Even if Approach 3 is where we end up, it seems > like we should fix hash aggregation as a stepping stone first. > Aren't all three approaches a way to "fix" hash aggregate? In any case, it's certainly reasonable to make incremental changes. The question is whether "approach 1" is sensible step towards some form of "approach 3" >* It means we have a hash table and sort running concurrently, each > using memory. Andres said this might not be a problem[3], but I'm > not convinced that the problem is zero. If you use small work_mem > for the write phase of sorting, you'll end up with a lot of runs to > merge later and that has some kind of cost. > Why would we need to do both concurrently? I thought we'd empty the hash table before doing the sort, no? >* The simplicity might start to evaporate when we consider grouping > sets and eviction strategy. > Hmm, yeah :-/ >Main topics to consider: > >ARRAY_AGG: > >Some aggregates, like ARRAY_AGG, have a transition state that grows >proportionally with the group size. In other words, it is not a >summary like COUNT or AVG, it contains all of the input data in a new >form. > Strictly speaking the state may grow even for count/avg aggregates, e.g. for numeric types, but it's far less serious than array_agg etc. >These aggregates are not a good candidate for hash aggregation. Hash >aggregation is about keeping many transition states running in >parallel, which is just a bad fit for large transition states. Sorting >is better because it advances one transition state at a time. We could: > >* Let ARRAY_AGG continue to exceed work_mem like today. > >* Block or pessimize use of hash aggregation for such aggregates. > >* Evict groups from the hash table when it becomes too large. This > requires the ability to serialize and deserialize transition states, > and some approaches here might also need combine_func > specified. These requirements seem reasonable, but we still need > some answer of what to do for aggregates that grow like ARRAY_AGG > but don't have the required serialfunc, deserialfunc, or > combine_func. > Do we actually need to handle that case? How many such aggregates are there? I think it's OK to just ignore that case (and keep doing what we do now), and require serial/deserial functions for anything better. >GROUPING SETS: > >With grouping sets, there are multiple hash tables and each hash table >has it's own hash function, so that makes partitioning more >complex. In Approach 1, that means we need to either (a) not partition >the spilled tuples; or (b) have a different set of partitions for each >hash table and spill the same tuple multiple times. In Approach 2, we >would be required to partition each hash table separately and spill >tuples multiple times. In Approach 3 (depending on the exact approach >but taking a guess here) we would need to add a set of phases (one >extra phase for each hash table) for spilled tuples. > No thoughts about this yet. >MEMORY TRACKING: > >I have a patch to track the total allocated memory by >incrementing/decrementing it when blocks are malloc'd/free'd. This >doesn't do bookkeeping for each chunk, only each block. Previously, >Robert Haas raised some concerns[4] about performance, which were >mitigated[5] but perhaps not entirely eliminated (but did become >elusive). > >The only alternative is estimation, which is ugly and seems like a bad >idea. Memory usage isn't just driven by inputs, it's also driven by >patterns of use. Misestimates in the planner are fine (within reason) >because we don't have any other choice, and a small-factor misestimate >might not change the plan anyway. But in the executor, a small-factor >misestimate seems like it's just not doing the job. If a user found >that hash aggregation was using 3X work_mem, and my only explanation >is "well, it's just an estimate", I would be pretty embarrassed and >the user would likely lose confidence in the feature. I don't mean >that we must track memory perfectly everywhere, but using an estimate >seems like a mediocre improvement of the current state. I agree estimates are not the right tool here. > >We should proceed with memory context tracking and try to eliminate or >mitigate performance concerns. I would not like to make any hurculean >effort as a part of the hash aggregation work though; I think it's >basically just something a memory manager in a database system should >have supported all along. I think we will find other uses for it as >time goes on. We have more and more things happening in the executor >and having a cheap way to check "how much memory is this thing using?" >seems very likely to be useful. > IMO we should just use the cheapest memory accounting (tracking the amount of memory allocated for blocks). I agree it's a feature we need, I don't think we can devise anything cheaper than this. >Other points: > >* Someone brought up the idea of using logtapes.c instead of writing > separate files for each partition. That seems reasonable, but it's > using logtapes.c slightly outside of its intended purpose. Also, > it's awkward to need to specify the number of tapes up-front. Worth > experimenting with to see if it's a win. > >* Tomas did some experiments regarding the number of batches to choose > and how to choose them. It seems like there's room for improvement > over ths simple calculation I'm doing now. > Me? I don't recall such benchmarks, but maybe I did. But I think we'll need to repeat those with the new patches etc. I think the question is whether we see this as an emergency solution - in that case I wouldn't obsess about getting the best possible parameters. >* A lot of discussion about a smart eviction strategy. I don't see > strong evidence that it's worth the complexity at this time. The > smarter we try to be, the more bookkeeping and memory fragmentation > problems we will have. If we evict something, we should probably > evict the whole hash table or some large part of it. > Maybe. For each "smart" eviction strategy there is a (trivial) example of data on which it performs poorly. I think it's the same thing as with the number of partitions - if we consider this to be an emergency solution, it's OK if the performance is not entirely perfect when it kicks in. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
pgsql-hackers by date: