Thread: Memory-Bounded Hash Aggregation
This is for design review. I have a patch (WIP) for Approach 1, and if this discussion starts to converge on that approach I will polish and post it. Let's start at the beginning: why do we have two strategies -- hash and sort -- for aggregating data? The two are more similar than they first appear. A partitioned hash strategy writes randomly among the partitions, and later reads the partitions sequentially; a sort will write sorted runs sequentially, but then read the among the runs randomly during the merge phase. A hash is a convenient small representation of the data that is cheaper to operate on; sort uses abbreviated keys for the same reason. Hash offers: * Data is aggregated on-the-fly, effectively "compressing" the amount of data that needs to go to disk. This is particularly important when the data contains skewed groups (see below). * Can output some groups after the first pass of the input data even if other groups spilled. * Some data types only support hashing; not sorting. Sort+Group offers: * Only one group is accumulating at once, so if the transition state grows (like with ARRAY_AGG), it minimizes the memory needed. * The input may already happen to be sorted. * Some data types only support sorting; not hashing. Currently, Hash Aggregation is only chosen if the optimizer believes that all the groups (and their transition states) fit in memory. Unfortunately, if the optimizer is wrong (often the case if the input is not a base table), the hash table will keep growing beyond work_mem, potentially bringing the entire system to OOM. This patch fixes that problem by extending the Hash Aggregation strategy to spill to disk when needed. Previous discussions: https://www.postgresql.org/message-id/1407706010.6623.16.camel@jeff-desktop https://www.postgresql.org/message-id/1419326161.24895.13.camel%40jeff-desktop https://www.postgresql.org/message-id/87be3bd5-6b13-d76e-5618-6db0a4db584d%40iki.fi A lot was discussed, which I will try to summarize and address here. Digression: Skewed Groups: Imagine the input tuples have the following grouping keys: 0, 1, 0, 2, 0, 3, 0, 4, ..., 0, N-1, 0, N Group 0 is a skew group because it consists of 50% of all tuples in the table, whereas every other group has a single tuple. If the algorithm is able to keep group 0 in memory the whole time until finalized, that means that it doesn't have to spill any group-0 tuples. In this example, that would amount to a 50% savings, and is a major advantage of Hash Aggregation versus Sort+Group. High-level approaches: 1. When the in-memory hash table fills, keep existing entries in the hash table, and spill the raw tuples for all new groups in a partitioned fashion. When all input tuples are read, finalize groups in memory and emit. Now that the in-memory hash table is cleared (and memory context reset), process a spill file the same as the original input, but this time with a fraction of the group cardinality. 2. When the in-memory hash table fills, partition the hash space, and evict the groups from all partitions except one by writing out their partial aggregate states to disk. Any input tuples belonging to an evicted partition get spilled to disk. When the input is read entirely, finalize the groups remaining in memory and emit. Now that the in-memory hash table is cleared, process the next partition by loading its partial states into the hash table, and then processing its spilled tuples. 3. Use some kind of hybrid[1][2] of hashing and sorting. Evaluation of approaches: Approach 1 is a nice incremental improvement on today's code. The final patch may be around 1KLOC. It's a single kind of on-disk data (spilled tuples), and a single algorithm (hashing). It also handles skewed groups well because the skewed groups are likely to be encountered before the hash table fills up the first time, and therefore will stay in memory. Approach 2 is nice because it resembles the approach of Hash Join, and it can determine whether a tuple should be spilled without a hash lookup. Unfortunately, those upsides are fairly mild, and it has significant downsides: * It doesn't handle skew values well because it's likely to evict them. * If we leave part of the hash table in memory, it's difficult to ensure that we will be able to actually use the space freed by eviction, because the freed memory may be fragmented. That could force us to evict the entire in-memory hash table as soon as we partition, reducing a lot of the benefit of hashing. * It requires eviction for the algorithm to work. That may be necessary for handling cases like ARRAY_AGG (see below) anyway, but this approach constrains the specifics of eviction. Approach 3 is interesting because it unifies the two approaches and can get some of the benfits of both. It's only a single path, so it avoids planner mistakes. I really like this idea and it's possible we will end up with approach 3. However: * It requires that all data types support sorting, or that we punt somehow. * Right now we are in a weird state because hash aggregation cheats, so it's difficult to evaluate whether Approach 3 is moving us in the right direction because we have no other correct implementation to compare against. Even if Approach 3 is where we end up, it seems like we should fix hash aggregation as a stepping stone first. * It means we have a hash table and sort running concurrently, each using memory. Andres said this might not be a problem[3], but I'm not convinced that the problem is zero. If you use small work_mem for the write phase of sorting, you'll end up with a lot of runs to merge later and that has some kind of cost. * The simplicity might start to evaporate when we consider grouping sets and eviction strategy. Main topics to consider: ARRAY_AGG: Some aggregates, like ARRAY_AGG, have a transition state that grows proportionally with the group size. In other words, it is not a summary like COUNT or AVG, it contains all of the input data in a new form. These aggregates are not a good candidate for hash aggregation. Hash aggregation is about keeping many transition states running in parallel, which is just a bad fit for large transition states. Sorting is better because it advances one transition state at a time. We could: * Let ARRAY_AGG continue to exceed work_mem like today. * Block or pessimize use of hash aggregation for such aggregates. * Evict groups from the hash table when it becomes too large. This requires the ability to serialize and deserialize transition states, and some approaches here might also need combine_func specified. These requirements seem reasonable, but we still need some answer of what to do for aggregates that grow like ARRAY_AGG but don't have the required serialfunc, deserialfunc, or combine_func. GROUPING SETS: With grouping sets, there are multiple hash tables and each hash table has it's own hash function, so that makes partitioning more complex. In Approach 1, that means we need to either (a) not partition the spilled tuples; or (b) have a different set of partitions for each hash table and spill the same tuple multiple times. In Approach 2, we would be required to partition each hash table separately and spill tuples multiple times. In Approach 3 (depending on the exact approach but taking a guess here) we would need to add a set of phases (one extra phase for each hash table) for spilled tuples. MEMORY TRACKING: I have a patch to track the total allocated memory by incrementing/decrementing it when blocks are malloc'd/free'd. This doesn't do bookkeeping for each chunk, only each block. Previously, Robert Haas raised some concerns[4] about performance, which were mitigated[5] but perhaps not entirely eliminated (but did become elusive). The only alternative is estimation, which is ugly and seems like a bad idea. Memory usage isn't just driven by inputs, it's also driven by patterns of use. Misestimates in the planner are fine (within reason) because we don't have any other choice, and a small-factor misestimate might not change the plan anyway. But in the executor, a small-factor misestimate seems like it's just not doing the job. If a user found that hash aggregation was using 3X work_mem, and my only explanation is "well, it's just an estimate", I would be pretty embarrassed and the user would likely lose confidence in the feature. I don't mean that we must track memory perfectly everywhere, but using an estimate seems like a mediocre improvement of the current state. We should proceed with memory context tracking and try to eliminate or mitigate performance concerns. I would not like to make any hurculean effort as a part of the hash aggregation work though; I think it's basically just something a memory manager in a database system should have supported all along. I think we will find other uses for it as time goes on. We have more and more things happening in the executor and having a cheap way to check "how much memory is this thing using?" seems very likely to be useful. Other points: * Someone brought up the idea of using logtapes.c instead of writing separate files for each partition. That seems reasonable, but it's using logtapes.c slightly outside of its intended purpose. Also, it's awkward to need to specify the number of tapes up-front. Worth experimenting with to see if it's a win. * Tomas did some experiments regarding the number of batches to choose and how to choose them. It seems like there's room for improvement over ths simple calculation I'm doing now. * A lot of discussion about a smart eviction strategy. I don't see strong evidence that it's worth the complexity at this time. The smarter we try to be, the more bookkeeping and memory fragmentation problems we will have. If we evict something, we should probably evict the whole hash table or some large part of it. Regards, Jeff Davis [1] https://postgr.es/m/20180604185205.epue25jzpavokupf%40alap3.anarazel.de [2] https://postgr.es/m/message-id/CAGTBQpa__-NP7%3DkKwze_enkqw18vodRxKkOmNhxAPzqkruc-8g%40mail.gmail.com [3] https://www.postgresql.org/message-id/20180605175209.vavuqe4idovcpeie%40alap3.anarazel.de [4] https://www.postgresql.org/message-id/CA%2BTgmobnu7XEn1gRdXnFo37P79bF%3DqLt46%3D37ajP3Cro9dBRaA%40mail.gmail.com [5] https://www.postgresql.org/message-id/1413422787.18615.18.camel%40jeff-desktop
Hi Jeff, On Mon, Jul 01, 2019 at 12:13:53PM -0700, Jeff Davis wrote: >This is for design review. I have a patch (WIP) for Approach 1, and if >this discussion starts to converge on that approach I will polish and >post it. > Thanks for working on this. >Let's start at the beginning: why do we have two strategies -- hash >and sort -- for aggregating data? The two are more similar than they >first appear. A partitioned hash strategy writes randomly among the >partitions, and later reads the partitions sequentially; a sort will >write sorted runs sequentially, but then read the among the runs >randomly during the merge phase. A hash is a convenient small >representation of the data that is cheaper to operate on; sort uses >abbreviated keys for the same reason. > What does "partitioned hash strategy" do? It's probably explained in one of the historical discussions, but I'm not sure which one. I assume it simply hashes the group keys and uses that to partition the data, and then passing it to hash aggregate. >Hash offers: > >* Data is aggregated on-the-fly, effectively "compressing" the amount > of data that needs to go to disk. This is particularly important > when the data contains skewed groups (see below). > >* Can output some groups after the first pass of the input data even > if other groups spilled. > >* Some data types only support hashing; not sorting. > >Sort+Group offers: > >* Only one group is accumulating at once, so if the transition state > grows (like with ARRAY_AGG), it minimizes the memory needed. > >* The input may already happen to be sorted. > >* Some data types only support sorting; not hashing. > >Currently, Hash Aggregation is only chosen if the optimizer believes >that all the groups (and their transition states) fit in >memory. Unfortunately, if the optimizer is wrong (often the case if the >input is not a base table), the hash table will >keep growing beyond work_mem, potentially bringing the entire system >to OOM. This patch fixes that problem by extending the Hash >Aggregation strategy to spill to disk when needed. > OK, makes sense. >Previous discussions: > > >https://www.postgresql.org/message-id/1407706010.6623.16.camel@jeff-desktop > >https://www.postgresql.org/message-id/1419326161.24895.13.camel%40jeff-desktop > >https://www.postgresql.org/message-id/87be3bd5-6b13-d76e-5618-6db0a4db584d%40iki.fi > >A lot was discussed, which I will try to summarize and address here. > >Digression: Skewed Groups: > >Imagine the input tuples have the following grouping keys: > > 0, 1, 0, 2, 0, 3, 0, 4, ..., 0, N-1, 0, N > >Group 0 is a skew group because it consists of 50% of all tuples in >the table, whereas every other group has a single tuple. If the >algorithm is able to keep group 0 in memory the whole time until >finalized, that means that it doesn't have to spill any group-0 >tuples. In this example, that would amount to a 50% savings, and is a >major advantage of Hash Aggregation versus Sort+Group. > Right. I agree efficiently handling skew is important and may be crucial for achieving good performance. >High-level approaches: > >1. When the in-memory hash table fills, keep existing entries in the >hash table, and spill the raw tuples for all new groups in a >partitioned fashion. When all input tuples are read, finalize groups >in memory and emit. Now that the in-memory hash table is cleared (and >memory context reset), process a spill file the same as the original >input, but this time with a fraction of the group cardinality. > >2. When the in-memory hash table fills, partition the hash space, and >evict the groups from all partitions except one by writing out their >partial aggregate states to disk. Any input tuples belonging to an >evicted partition get spilled to disk. When the input is read >entirely, finalize the groups remaining in memory and emit. Now that >the in-memory hash table is cleared, process the next partition by >loading its partial states into the hash table, and then processing >its spilled tuples. > >3. Use some kind of hybrid[1][2] of hashing and sorting. > Unfortunately the second link does not work :-( >Evaluation of approaches: > >Approach 1 is a nice incremental improvement on today's code. The >final patch may be around 1KLOC. It's a single kind of on-disk data >(spilled tuples), and a single algorithm (hashing). It also handles >skewed groups well because the skewed groups are likely to be >encountered before the hash table fills up the first time, and >therefore will stay in memory. > I'm not going to block Approach 1, althought I'd really like to see something that helps with array_agg. >Approach 2 is nice because it resembles the approach of Hash Join, and >it can determine whether a tuple should be spilled without a hash >lookup. Unfortunately, those upsides are fairly mild, and it has >significant downsides: > >* It doesn't handle skew values well because it's likely to evict > them. > >* If we leave part of the hash table in memory, it's difficult to > ensure that we will be able to actually use the space freed by > eviction, because the freed memory may be fragmented. That could > force us to evict the entire in-memory hash table as soon as we > partition, reducing a lot of the benefit of hashing. > Yeah, and it may not work well with the memory accounting if we only track the size of allocated blocks, not chunks (because pfree likely won't free the blocks). >* It requires eviction for the algorithm to work. That may be > necessary for handling cases like ARRAY_AGG (see below) anyway, but > this approach constrains the specifics of eviction. > >Approach 3 is interesting because it unifies the two approaches and >can get some of the benfits of both. It's only a single path, so it >avoids planner mistakes. I really like this idea and it's possible we >will end up with approach 3. However: > >* It requires that all data types support sorting, or that we punt > somehow. > >* Right now we are in a weird state because hash aggregation cheats, > so it's difficult to evaluate whether Approach 3 is moving us in the > right direction because we have no other correct implementation to > compare against. Even if Approach 3 is where we end up, it seems > like we should fix hash aggregation as a stepping stone first. > Aren't all three approaches a way to "fix" hash aggregate? In any case, it's certainly reasonable to make incremental changes. The question is whether "approach 1" is sensible step towards some form of "approach 3" >* It means we have a hash table and sort running concurrently, each > using memory. Andres said this might not be a problem[3], but I'm > not convinced that the problem is zero. If you use small work_mem > for the write phase of sorting, you'll end up with a lot of runs to > merge later and that has some kind of cost. > Why would we need to do both concurrently? I thought we'd empty the hash table before doing the sort, no? >* The simplicity might start to evaporate when we consider grouping > sets and eviction strategy. > Hmm, yeah :-/ >Main topics to consider: > >ARRAY_AGG: > >Some aggregates, like ARRAY_AGG, have a transition state that grows >proportionally with the group size. In other words, it is not a >summary like COUNT or AVG, it contains all of the input data in a new >form. > Strictly speaking the state may grow even for count/avg aggregates, e.g. for numeric types, but it's far less serious than array_agg etc. >These aggregates are not a good candidate for hash aggregation. Hash >aggregation is about keeping many transition states running in >parallel, which is just a bad fit for large transition states. Sorting >is better because it advances one transition state at a time. We could: > >* Let ARRAY_AGG continue to exceed work_mem like today. > >* Block or pessimize use of hash aggregation for such aggregates. > >* Evict groups from the hash table when it becomes too large. This > requires the ability to serialize and deserialize transition states, > and some approaches here might also need combine_func > specified. These requirements seem reasonable, but we still need > some answer of what to do for aggregates that grow like ARRAY_AGG > but don't have the required serialfunc, deserialfunc, or > combine_func. > Do we actually need to handle that case? How many such aggregates are there? I think it's OK to just ignore that case (and keep doing what we do now), and require serial/deserial functions for anything better. >GROUPING SETS: > >With grouping sets, there are multiple hash tables and each hash table >has it's own hash function, so that makes partitioning more >complex. In Approach 1, that means we need to either (a) not partition >the spilled tuples; or (b) have a different set of partitions for each >hash table and spill the same tuple multiple times. In Approach 2, we >would be required to partition each hash table separately and spill >tuples multiple times. In Approach 3 (depending on the exact approach >but taking a guess here) we would need to add a set of phases (one >extra phase for each hash table) for spilled tuples. > No thoughts about this yet. >MEMORY TRACKING: > >I have a patch to track the total allocated memory by >incrementing/decrementing it when blocks are malloc'd/free'd. This >doesn't do bookkeeping for each chunk, only each block. Previously, >Robert Haas raised some concerns[4] about performance, which were >mitigated[5] but perhaps not entirely eliminated (but did become >elusive). > >The only alternative is estimation, which is ugly and seems like a bad >idea. Memory usage isn't just driven by inputs, it's also driven by >patterns of use. Misestimates in the planner are fine (within reason) >because we don't have any other choice, and a small-factor misestimate >might not change the plan anyway. But in the executor, a small-factor >misestimate seems like it's just not doing the job. If a user found >that hash aggregation was using 3X work_mem, and my only explanation >is "well, it's just an estimate", I would be pretty embarrassed and >the user would likely lose confidence in the feature. I don't mean >that we must track memory perfectly everywhere, but using an estimate >seems like a mediocre improvement of the current state. I agree estimates are not the right tool here. > >We should proceed with memory context tracking and try to eliminate or >mitigate performance concerns. I would not like to make any hurculean >effort as a part of the hash aggregation work though; I think it's >basically just something a memory manager in a database system should >have supported all along. I think we will find other uses for it as >time goes on. We have more and more things happening in the executor >and having a cheap way to check "how much memory is this thing using?" >seems very likely to be useful. > IMO we should just use the cheapest memory accounting (tracking the amount of memory allocated for blocks). I agree it's a feature we need, I don't think we can devise anything cheaper than this. >Other points: > >* Someone brought up the idea of using logtapes.c instead of writing > separate files for each partition. That seems reasonable, but it's > using logtapes.c slightly outside of its intended purpose. Also, > it's awkward to need to specify the number of tapes up-front. Worth > experimenting with to see if it's a win. > >* Tomas did some experiments regarding the number of batches to choose > and how to choose them. It seems like there's room for improvement > over ths simple calculation I'm doing now. > Me? I don't recall such benchmarks, but maybe I did. But I think we'll need to repeat those with the new patches etc. I think the question is whether we see this as an emergency solution - in that case I wouldn't obsess about getting the best possible parameters. >* A lot of discussion about a smart eviction strategy. I don't see > strong evidence that it's worth the complexity at this time. The > smarter we try to be, the more bookkeeping and memory fragmentation > problems we will have. If we evict something, we should probably > evict the whole hash table or some large part of it. > Maybe. For each "smart" eviction strategy there is a (trivial) example of data on which it performs poorly. I think it's the same thing as with the number of partitions - if we consider this to be an emergency solution, it's OK if the performance is not entirely perfect when it kicks in. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Mon, 2019-07-01 at 12:13 -0700, Jeff Davis wrote: > This is for design review. I have a patch (WIP) for Approach 1, and > if > this discussion starts to converge on that approach I will polish and > post it. WIP patch attached (based on 9a81c9fa); targeting September CF. Not intended for detailed review yet, but it seems to work in enough cases (including grouping sets and JIT) to be a good proof-of-concept for the algorithm and its complexity. Initial performance numbers put it at 2X slower than sort for grouping 10M distinct integers. There are quite a few optimizations I haven't tried yet and quite a few tunables I haven't tuned yet, so hopefully I can close the gap a bit for the small-groups case. I will offer more details soon when I have more confidence in the numbers. It does not attempt to spill ARRAY_AGG at all yet. Regards, Jeff Davis
Attachment
On Wed, 2019-07-03 at 02:17 +0200, Tomas Vondra wrote: > What does "partitioned hash strategy" do? It's probably explained in > one > of the historical discussions, but I'm not sure which one. I assume > it > simply hashes the group keys and uses that to partition the data, and > then > passing it to hash aggregate. Yes. When spilling, it is cheap to partition on the hash value at the same time, which dramatically reduces the need to spill multiple times. Previous discussions: > Unfortunately the second link does not work :-( It's supposed to be: https://www.postgresql.org/message-id/CAGTBQpa__-NP7%3DkKwze_enkqw18vodRxKkOmNhxAPzqkruc-8g%40mail.gmail.com > I'm not going to block Approach 1, althought I'd really like to see > something that helps with array_agg. I have a WIP patch that I just posted. It doesn't yet work with ARRAY_AGG, but I think it can be made to work by evicting the entire hash table, serializing the transition states, and then later combining them. > Aren't all three approaches a way to "fix" hash aggregate? In any > case, > it's certainly reasonable to make incremental changes. The question > is > whether "approach 1" is sensible step towards some form of "approach > 3" Disk-based hashing certainly seems like a reasonable algorithm on paper that has some potential advantages over sorting. It certainly seems sensible to me that we explore the disk-based hashing strategy first, and then we would at least know what we are missing (if anything) by going with the hybrid approach later. There's also a fair amount of design space to explore in the hybrid strategy. That could take a while to converge, especially if we don't have anything in place to compare against. > > * It means we have a hash table and sort running concurrently, each > > using memory. Andres said this might not be a problem[3], but I'm > > not convinced that the problem is zero. If you use small work_mem > > for the write phase of sorting, you'll end up with a lot of runs > > to > > merge later and that has some kind of cost. > > > > Why would we need to do both concurrently? I thought we'd empty the > hash > table before doing the sort, no? So you are saying we spill the tuples into a tuplestore, then feed the tuplestore through a tuplesort? Seems inefficient, but I guess we can. > Do we actually need to handle that case? How many such aggregates are > there? I think it's OK to just ignore that case (and keep doing what > we do > now), and require serial/deserial functions for anything better. Punting on a few cases is fine with me, if the user has a way to fix it. Regards, Jeff Davis
On Wed, Jul 03, 2019 at 07:03:06PM -0700, Jeff Davis wrote: >On Wed, 2019-07-03 at 02:17 +0200, Tomas Vondra wrote: >> What does "partitioned hash strategy" do? It's probably explained in >> one >> of the historical discussions, but I'm not sure which one. I assume >> it >> simply hashes the group keys and uses that to partition the data, and >> then >> passing it to hash aggregate. > >Yes. When spilling, it is cheap to partition on the hash value at the >same time, which dramatically reduces the need to spill multiple times. >Previous discussions: > > >> Unfortunately the second link does not work :-( > >It's supposed to be: > > >https://www.postgresql.org/message-id/CAGTBQpa__-NP7%3DkKwze_enkqw18vodRxKkOmNhxAPzqkruc-8g%40mail.gmail.com > > >> I'm not going to block Approach 1, althought I'd really like to see >> something that helps with array_agg. > >I have a WIP patch that I just posted. It doesn't yet work with >ARRAY_AGG, but I think it can be made to work by evicting the entire >hash table, serializing the transition states, and then later combining >them. > >> Aren't all three approaches a way to "fix" hash aggregate? In any >> case, >> it's certainly reasonable to make incremental changes. The question >> is >> whether "approach 1" is sensible step towards some form of "approach >> 3" > >Disk-based hashing certainly seems like a reasonable algorithm on paper >that has some potential advantages over sorting. It certainly seems >sensible to me that we explore the disk-based hashing strategy first, >and then we would at least know what we are missing (if anything) by >going with the hybrid approach later. > >There's also a fair amount of design space to explore in the hybrid >strategy. That could take a while to converge, especially if we don't >have anything in place to compare against. > Makes sense. I haven't thought about how the hybrid approach would be implemented very much, so I can't quite judge how complicated would it be to extend "approach 1" later. But if you think it's a sensible first step, I trust you. And I certainly agree we need something to compare the other approaches against. >> > * It means we have a hash table and sort running concurrently, each >> > using memory. Andres said this might not be a problem[3], but I'm >> > not convinced that the problem is zero. If you use small work_mem >> > for the write phase of sorting, you'll end up with a lot of runs >> > to >> > merge later and that has some kind of cost. >> > >> >> Why would we need to do both concurrently? I thought we'd empty the >> hash >> table before doing the sort, no? > >So you are saying we spill the tuples into a tuplestore, then feed the >tuplestore through a tuplesort? Seems inefficient, but I guess we can. > I think the question is whether we see this as "emergency fix" (for cases that are misestimated and could/would fail with OOM at runtime), or as something that is meant to make "hash agg" more widely applicable. I personally see it as an emergency fix, in which cases it's perfectly fine if it's not 100% efficient, assuming it kicks in only rarely. Effectively, we're betting on hash agg, and from time to time we lose. But even if we see it as a general optimization technique it does not have to be perfectly efficient, as long as it's properly costed (so the planner only uses it when appropriate). If we have a better solution (in terms of efficiency, code complexity, etc.) then sure - let's use that. But considering we've started this discussion in ~2015 and we still don't have anything, I wouldn't hold my breath. Let's do something good enough, and maybe improve it later. >> Do we actually need to handle that case? How many such aggregates are >> there? I think it's OK to just ignore that case (and keep doing what >> we do >> now), and require serial/deserial functions for anything better. > >Punting on a few cases is fine with me, if the user has a way to fix >it. > +1 to doing that regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Wed, Jul 03, 2019 at 07:03:06PM -0700, Jeff Davis wrote: >On Wed, 2019-07-03 at 02:17 +0200, Tomas Vondra wrote: >> What does "partitioned hash strategy" do? It's probably explained in >> one >> of the historical discussions, but I'm not sure which one. I assume >> it >> simply hashes the group keys and uses that to partition the data, and >> then >> passing it to hash aggregate. > >Yes. When spilling, it is cheap to partition on the hash value at the >same time, which dramatically reduces the need to spill multiple times. >Previous discussions: > > >> Unfortunately the second link does not work :-( > >It's supposed to be: > > >https://www.postgresql.org/message-id/CAGTBQpa__-NP7%3DkKwze_enkqw18vodRxKkOmNhxAPzqkruc-8g%40mail.gmail.com > > >> I'm not going to block Approach 1, althought I'd really like to see >> something that helps with array_agg. > >I have a WIP patch that I just posted. It doesn't yet work with >ARRAY_AGG, but I think it can be made to work by evicting the entire >hash table, serializing the transition states, and then later combining >them. > >> Aren't all three approaches a way to "fix" hash aggregate? In any >> case, >> it's certainly reasonable to make incremental changes. The question >> is >> whether "approach 1" is sensible step towards some form of "approach >> 3" > >Disk-based hashing certainly seems like a reasonable algorithm on paper >that has some potential advantages over sorting. It certainly seems >sensible to me that we explore the disk-based hashing strategy first, >and then we would at least know what we are missing (if anything) by >going with the hybrid approach later. > >There's also a fair amount of design space to explore in the hybrid >strategy. That could take a while to converge, especially if we don't >have anything in place to compare against. > Makes sense. I haven't thought about how the hybrid approach would be implemented very much, so I can't quite judge how complicated would it be to extend "approach 1" later. But if you think it's a sensible first step, I trust you. And I certainly agree we need something to compare the other approaches against. >> > * It means we have a hash table and sort running concurrently, each >> > using memory. Andres said this might not be a problem[3], but I'm >> > not convinced that the problem is zero. If you use small work_mem >> > for the write phase of sorting, you'll end up with a lot of runs >> > to >> > merge later and that has some kind of cost. >> > >> >> Why would we need to do both concurrently? I thought we'd empty the >> hash >> table before doing the sort, no? > >So you are saying we spill the tuples into a tuplestore, then feed the >tuplestore through a tuplesort? Seems inefficient, but I guess we can. > I think the question is whether we see this as "emergency fix" (for cases that are misestimated and could/would fail with OOM at runtime), or as something that is meant to make "hash agg" more widely applicable. I personally see it as an emergency fix, in which cases it's perfectly fine if it's not 100% efficient, assuming it kicks in only rarely. Effectively, we're betting on hash agg, and from time to time we lose. But even if we see it as a general optimization technique it does not have to be perfectly efficient, as long as it's properly costed (so the planner only uses it when appropriate). If we have a better solution (in terms of efficiency, code complexity, etc.) then sure - let's use that. But considering we've started this discussion in ~2015 and we still don't have anything, I wouldn't hold my breath. Let's do something good enough, and maybe improve it later. >> Do we actually need to handle that case? How many such aggregates are >> there? I think it's OK to just ignore that case (and keep doing what >> we do >> now), and require serial/deserial functions for anything better. > >Punting on a few cases is fine with me, if the user has a way to fix >it. > +1 to doing that regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Thu, 2019-07-11 at 17:55 +0200, Tomas Vondra wrote: > Makes sense. I haven't thought about how the hybrid approach would be > implemented very much, so I can't quite judge how complicated would > it be > to extend "approach 1" later. But if you think it's a sensible first > step, > I trust you. And I certainly agree we need something to compare the > other > approaches against. Is this a duplicate of your previous email? I'm slightly confused but I will use the opportunity to put out another WIP patch. The patch could use a few rounds of cleanup and quality work, but the funcionality is there and the performance seems reasonable. I rebased on master and fixed a few bugs, and most importantly, added tests. It seems to be working with grouping sets fine. It will take a little longer to get good performance numbers, but even for group size of one, I'm seeing HashAgg get close to Sort+Group in some cases. You are right that the missed lookups appear to be costly, at least when the data all fits in system memory. I think it's the cache misses, because sometimes reducing work_mem improves performance. I'll try tuning the number of buckets for the hash table and see if that helps. If not, then the performance still seems pretty good to me. Of course, HashAgg can beat sort for larger group sizes, but I'll try to gather some more data on the cross-over point. Regards, Jeff Davis
Attachment
On Thu, Jul 11, 2019 at 06:06:33PM -0700, Jeff Davis wrote: >On Thu, 2019-07-11 at 17:55 +0200, Tomas Vondra wrote: >> Makes sense. I haven't thought about how the hybrid approach would be >> implemented very much, so I can't quite judge how complicated would >> it be >> to extend "approach 1" later. But if you think it's a sensible first >> step, >> I trust you. And I certainly agree we need something to compare the >> other >> approaches against. > >Is this a duplicate of your previous email? > Yes. I don't know how I managed to send it again. Sorry. >I'm slightly confused but I will use the opportunity to put out another >WIP patch. The patch could use a few rounds of cleanup and quality >work, but the funcionality is there and the performance seems >reasonable. > >I rebased on master and fixed a few bugs, and most importantly, added >tests. > >It seems to be working with grouping sets fine. It will take a little >longer to get good performance numbers, but even for group size of one, >I'm seeing HashAgg get close to Sort+Group in some cases. > Nice! That's a very nice progress! >You are right that the missed lookups appear to be costly, at least >when the data all fits in system memory. I think it's the cache misses, >because sometimes reducing work_mem improves performance. I'll try >tuning the number of buckets for the hash table and see if that helps. >If not, then the performance still seems pretty good to me. > >Of course, HashAgg can beat sort for larger group sizes, but I'll try >to gather some more data on the cross-over point. > Yes, makes sense. I think it's acceptable as long as we consider this during costing (when we know in advance we'll need this) or treat it to be emergency measure. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
> High-level approaches: > > 1. When the in-memory hash table fills, keep existing entries in the > hash table, and spill the raw tuples for all new groups in a > partitioned fashion. When all input tuples are read, finalize groups > in memory and emit. Now that the in-memory hash table is cleared (and > memory context reset), process a spill file the same as the original > input, but this time with a fraction of the group cardinality. > > 2. When the in-memory hash table fills, partition the hash space, and > evict the groups from all partitions except one by writing out their > partial aggregate states to disk. Any input tuples belonging to an > evicted partition get spilled to disk. When the input is read > entirely, finalize the groups remaining in memory and emit. Now that > the in-memory hash table is cleared, process the next partition by > loading its partial states into the hash table, and then processing > its spilled tuples. I'm late to the party. These two approaches both spill the input tuples, what if the skewed groups are not encountered before the hash table fills up? The spill files' size and disk I/O could be downsides. Greenplum spills all the groups by writing the partial aggregate states, reset the memory context, process incoming tuples and build in-memory hash table, then reload and combine the spilled partial states at last, how does this sound? -- Adam Lee
On Fri, 2019-08-02 at 14:44 +0800, Adam Lee wrote: > I'm late to the party. You are welcome to join any time! > These two approaches both spill the input tuples, what if the skewed > groups are not encountered before the hash table fills up? The spill > files' size and disk I/O could be downsides. Let's say the worst case is that we encounter 10 million groups of size one first; just enough to fill up memory. Then, we encounter a single additional group of size 20 million, and need to write out all of those 20 million raw tuples. That's still not worse than Sort+GroupAgg which would need to write out all 30 million raw tuples (in practice Sort is pretty fast so may still win in some cases, but not by any huge amount). > Greenplum spills all the groups by writing the partial aggregate > states, > reset the memory context, process incoming tuples and build in-memory > hash table, then reload and combine the spilled partial states at > last, > how does this sound? That can be done as an add-on to approach #1 by evicting the entire hash table (writing out the partial states), then resetting the memory context. It does add to the complexity though, and would only work for the aggregates that support serializing and combining partial states. It also might be a net loss to do the extra work of initializing and evicting a partial state if we don't have large enough groups to benefit. Given that the worst case isn't worse than Sort+GroupAgg, I think it should be left as a future optimization. That would give us time to tune the process to work well in a variety of cases. Regards, Jeff Davis
On Fri, Aug 02, 2019 at 08:11:19AM -0700, Jeff Davis wrote: >On Fri, 2019-08-02 at 14:44 +0800, Adam Lee wrote: >> I'm late to the party. > >You are welcome to join any time! > >> These two approaches both spill the input tuples, what if the skewed >> groups are not encountered before the hash table fills up? The spill >> files' size and disk I/O could be downsides. > >Let's say the worst case is that we encounter 10 million groups of size >one first; just enough to fill up memory. Then, we encounter a single >additional group of size 20 million, and need to write out all of those >20 million raw tuples. That's still not worse than Sort+GroupAgg which >would need to write out all 30 million raw tuples (in practice Sort is >pretty fast so may still win in some cases, but not by any huge >amount). > >> Greenplum spills all the groups by writing the partial aggregate >> states, >> reset the memory context, process incoming tuples and build in-memory >> hash table, then reload and combine the spilled partial states at >> last, >> how does this sound? > >That can be done as an add-on to approach #1 by evicting the entire >hash table (writing out the partial states), then resetting the memory >context. > >It does add to the complexity though, and would only work for the >aggregates that support serializing and combining partial states. It >also might be a net loss to do the extra work of initializing and >evicting a partial state if we don't have large enough groups to >benefit. > >Given that the worst case isn't worse than Sort+GroupAgg, I think it >should be left as a future optimization. That would give us time to >tune the process to work well in a variety of cases. > +1 to leaving that as a future optimization I think it's clear there's no perfect eviction strategy - for every algorithm we came up with we can construct a data set on which it performs terribly (I'm sure we could do that for the approach used by Greenplum, for example). So I think it makes sense to do what Jeff proposed, and then maybe try improving that in the future with a switch to different eviction strategy based on some heuristics. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
I started to review this patch yesterday with Melanie Plageman, so we
rebased this patch over the current master. The main conflicts were
due to a simplehash patch that has been committed separately[1]. I've
attached the rebased patch.
I was playing with the code, and if one of the table's most common
values isn't placed into the initial hash table it spills a whole lot
of tuples to disk that might have been avoided if we had some way to
'seed' the hash table with MCVs from the statistics. Seems to me that
you would need some way of dealing with values that are in the MCV
list, but ultimately don't show up in the scan. I imagine that this
kind of optimization would most useful for aggregates on a full table
scan.
Some questions:
Right now the patch always initializes 32 spill partitions. Have you given
any thought into how to intelligently pick an optimal number of
partitions yet?
> That can be done as an add-on to approach #1 by evicting the entire
> Hash table (writing out the partial states), then resetting the memory
> Context.
By add-on approach, do you mean to say that you have something in mind
to combine the two strategies? Or do you mean that it could be implemented
rebased this patch over the current master. The main conflicts were
due to a simplehash patch that has been committed separately[1]. I've
attached the rebased patch.
I was playing with the code, and if one of the table's most common
values isn't placed into the initial hash table it spills a whole lot
of tuples to disk that might have been avoided if we had some way to
'seed' the hash table with MCVs from the statistics. Seems to me that
you would need some way of dealing with values that are in the MCV
list, but ultimately don't show up in the scan. I imagine that this
kind of optimization would most useful for aggregates on a full table
scan.
Some questions:
Right now the patch always initializes 32 spill partitions. Have you given
any thought into how to intelligently pick an optimal number of
partitions yet?
> That can be done as an add-on to approach #1 by evicting the entire
> Hash table (writing out the partial states), then resetting the memory
> Context.
By add-on approach, do you mean to say that you have something in mind
to combine the two strategies? Or do you mean that it could be implemented
as a separate strategy?
> I think it's clear there's no perfect eviction strategy - for every
> algorithm we came up with we can construct a data set on which it
> performs terribly (I'm sure we could do that for the approach used by
> Greenplum, for example).
>
> So I think it makes sense to do what Jeff proposed, and then maybe try
> improving that in the future with a switch to different eviction
> strategy based on some heuristics.
I agree. It definitely feels like both spilling strategies have their
own use case.
That said, I think it's worth mentioning that with parallel aggregates
it might actually be more useful to spill the trans values instead,
and have them combined in a Gather or Finalize stage.
[1] https://www.postgresql.org/message-id/flat/48abe675e1330f0c264ab2fe0d4ff23eb244f9ef.camel%40j-davis.com
> I think it's clear there's no perfect eviction strategy - for every
> algorithm we came up with we can construct a data set on which it
> performs terribly (I'm sure we could do that for the approach used by
> Greenplum, for example).
>
> So I think it makes sense to do what Jeff proposed, and then maybe try
> improving that in the future with a switch to different eviction
> strategy based on some heuristics.
I agree. It definitely feels like both spilling strategies have their
own use case.
That said, I think it's worth mentioning that with parallel aggregates
it might actually be more useful to spill the trans values instead,
and have them combined in a Gather or Finalize stage.
[1] https://www.postgresql.org/message-id/flat/48abe675e1330f0c264ab2fe0d4ff23eb244f9ef.camel%40j-davis.com
Attachment
On Wed, 2019-08-28 at 12:52 -0700, Taylor Vesely wrote: > I started to review this patch yesterday with Melanie Plageman, so we > rebased this patch over the current master. The main conflicts were > due to a simplehash patch that has been committed separately[1]. I've > attached the rebased patch. Great, thanks! > I was playing with the code, and if one of the table's most common > values isn't placed into the initial hash table it spills a whole lot > of tuples to disk that might have been avoided if we had some way to > 'seed' the hash table with MCVs from the statistics. Seems to me that > you would need some way of dealing with values that are in the MCV > list, but ultimately don't show up in the scan. I imagine that this > kind of optimization would most useful for aggregates on a full table > scan. Interesting idea, I didn't think of that. > Some questions: > > Right now the patch always initializes 32 spill partitions. Have you > given > any thought into how to intelligently pick an optimal number of > partitions yet? Yes. The idea is to guess how many groups are remaining, then guess how much space they will need in memory, then divide by work_mem. I just didn't get around to it yet. (Same with the costing work.) > By add-on approach, do you mean to say that you have something in > mind > to combine the two strategies? Or do you mean that it could be > implemented > as a separate strategy? It would be an extension of the existing patch, but would add a fair amount of complexity (dealing with partial states, etc.) and the benefit would be fairly modest. We can do it later if justified. > That said, I think it's worth mentioning that with parallel > aggregates > it might actually be more useful to spill the trans values instead, > and have them combined in a Gather or Finalize stage. That's a good point. Regards, Jeff Davis
On Wed, 2019-08-28 at 12:52 -0700, Taylor Vesely wrote: > Right now the patch always initializes 32 spill partitions. Have you > given > any thought into how to intelligently pick an optimal number of > partitions yet? Attached a new patch that addresses this. 1. Divide hash table memory used by the number of groups in the hash table to get the average memory used per group. 2. Multiply by the number of groups spilled -- which I pessimistically estimate as the number of tuples spilled -- to get the total amount of memory that we'd like to have to process all spilled tuples at once. 3. Divide the desired amount of memory by work_mem to get the number of partitions we'd like to have such that each partition can be processed in work_mem without spilling. 4. Apply a few sanity checks, fudge factors, and limits. Using this runtime information should be substantially better than using estimates and projections. Additionally, I removed some branches from the common path. I think I still have more work to do there. I also rebased of course, and fixed a few other things. Regards, Jeff Davis
Attachment
On Wed, Nov 27, 2019 at 02:58:04PM -0800, Jeff Davis wrote: >On Wed, 2019-08-28 at 12:52 -0700, Taylor Vesely wrote: >> Right now the patch always initializes 32 spill partitions. Have you >> given >> any thought into how to intelligently pick an optimal number of >> partitions yet? > >Attached a new patch that addresses this. > >1. Divide hash table memory used by the number of groups in the hash >table to get the average memory used per group. >2. Multiply by the number of groups spilled -- which I pessimistically >estimate as the number of tuples spilled -- to get the total amount of >memory that we'd like to have to process all spilled tuples at once. Isn't the "number of tuples = number of groups" estimate likely to be way too pessimistic? IIUC the consequence is that it pushes us to pick more partitions than necessary, correct? Could we instead track how many tuples we actually consumed for the the in-memory groups, and then use this information to improve the estimate of number of groups? I mean, if we know we've consumed 1000 tuples which created 100 groups, then we know there's ~1:10 ratio. >3. Divide the desired amount of memory by work_mem to get the number of >partitions we'd like to have such that each partition can be processed >in work_mem without spilling. >4. Apply a few sanity checks, fudge factors, and limits. > >Using this runtime information should be substantially better than >using estimates and projections. > >Additionally, I removed some branches from the common path. I think I >still have more work to do there. > >I also rebased of course, and fixed a few other things. > A couple of comments based on eye-balling the patch: 1) Shouldn't the hashagg_mem_overflow use the other GUC naming, i.e. maybe it should be enable_hashagg_mem_overflow or something similar? 2) I'm a bit puzzled by this code in ExecInterpExpr (there are multiple such blocks, this is just an example) aggstate = op->d.agg_init_trans.aggstate; pergroup_allaggs = aggstate->all_pergroups[op->d.agg_init_trans.setoff]; pergroup = &pergroup_allaggs[op->d.agg_init_trans.transno]; /* If transValue has not yet been initialized, do so now. */ if (pergroup_allaggs != NULL && pergroup->noTransValue) { ... } How could the (pergroup_allaggs != NULL) protect against anything? Let's assume the pointer really is NULL. Surely we'll get a segfault on the preceding line which does dereference it pergroup = &pergroup_allaggs[op->d.agg_init_trans.transno]; Or am I missing anything? 3) execGrouping.c A couple of functions would deserve a comment, explaining what it does. - LookupTupleHashEntryHash - prepare_hash_slot - calculate_hash And it's not clear to me why we should remove part of the comment before TupleHashTableHash. 4) I'm not sure I agree with this reasoning that HASH_PARTITION_FACTOR making the hash tables smaller is desirable - it may be, but if that was generally the case we'd just use small hash tables all the time. It's a bit annoying to give user the capability to set work_mem and then kinda override that. * ... Another benefit of having more, smaller partitions is that small * hash tables may perform better than large ones due to memory caching * effects. 5) Not sure what "directly" means in this context? * partitions at the time we need to spill, and because this algorithm * shouldn't depend too directly on the internal memory needs of a * BufFile. #define HASH_PARTITION_MEM (HASH_MIN_PARTITIONS * BLCKSZ) Does that mean we don't want to link to PGAlignedBlock, or what? 6) I think we should have some protection against underflows in this piece of code: - this would probably deserve some protection against underflow if HASH_PARTITION_MEM gets too big if (hashagg_mem_overflow) aggstate->hash_mem_limit = SIZE_MAX; else aggstate->hash_mem_limit = (work_mem * 1024L) - HASH_PARTITION_MEM; At the moment it's safe because work_mem is 64kB at least, and HASH_PARTITION_MEM is 32kB (4 partitions, 8kB each). But if we happen to bump HASH_MIN_PARTITIONS up, this can underflow. 7) Shouldn't lookup_hash_entry briefly explain why/how it handles the memory limit? 8) The comment before lookup_hash_entries says: ... * Return false if hash table has exceeded its memory limit. .. But that's clearly bogus, because that's a void function. 9) Shouldn't the hash_finish_initial_spills calls in agg_retrieve_direct have a comment, similar to the surrounding code? Might be an overkill, not sure. 10) The comment for agg_refill_hash_table says * Should only be called after all in memory hash table entries have been * consumed. Can we enforce that with an assert, somehow? 11) The hash_spill_npartitions naming seems a bit confusing, because it seems to imply it's about the "spill" while in practice it just choses number of spill partitions. Maybe hash_choose_num_spill_partitions would be better? 12) It's not clear to me why we need HASH_MAX_PARTITIONS? What's the reasoning behind the current value (256)? Not wanting to pick too many partitions? Comment? if (npartitions > HASH_MAX_PARTITIONS) npartitions = HASH_MAX_PARTITIONS; 13) As for this: /* make sure that we don't exhaust the hash bits */ if (partition_bits + input_bits >= 32) partition_bits = 32 - input_bits; We already ran into this issue (exhausting bits in a hash value) in hashjoin batching, we should be careful to use the same approach in both places (not the same code, just general approach). regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Thu, Nov 28, 2019 at 9:47 AM Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
On Wed, Nov 27, 2019 at 02:58:04PM -0800, Jeff Davis wrote:
>On Wed, 2019-08-28 at 12:52 -0700, Taylor Vesely wrote:
>> Right now the patch always initializes 32 spill partitions. Have you
>> given
>> any thought into how to intelligently pick an optimal number of
>> partitions yet?
>
>Attached a new patch that addresses this.
>
>1. Divide hash table memory used by the number of groups in the hash
>table to get the average memory used per group.
>2. Multiply by the number of groups spilled -- which I pessimistically
>estimate as the number of tuples spilled -- to get the total amount of
>memory that we'd like to have to process all spilled tuples at once.
Isn't the "number of tuples = number of groups" estimate likely to be
way too pessimistic? IIUC the consequence is that it pushes us to pick
more partitions than necessary, correct?
Could we instead track how many tuples we actually consumed for the the
in-memory groups, and then use this information to improve the estimate
of number of groups? I mean, if we know we've consumed 1000 tuples which
created 100 groups, then we know there's ~1:10 ratio.
What would the cost be of having many small partitions? Some of the
spill files created may not be used if the estimate was pessimistic,
but that seems better than the alternative of re-spilling, since every
spill writes every tuple again.
spill files created may not be used if the estimate was pessimistic,
but that seems better than the alternative of re-spilling, since every
spill writes every tuple again.
Also, number of groups = number of tuples is only for re-spilling.
This is a little bit unclear from the variable naming.
It looks like the parameter input_tuples passed to hash_spill_init()
in lookup_hash_entries() is the number of groups estimated by planner.
However, when reloading a spill file, if we run out of memory and
re-spill, hash_spill_init() is passed batch->input_groups (which is
actually set from input_ngroups which is the number of tuples in the
spill file). So, input_tuples is groups and input_groups is
input_tuples. It may be helpful to rename this.
This is a little bit unclear from the variable naming.
It looks like the parameter input_tuples passed to hash_spill_init()
in lookup_hash_entries() is the number of groups estimated by planner.
However, when reloading a spill file, if we run out of memory and
re-spill, hash_spill_init() is passed batch->input_groups (which is
actually set from input_ngroups which is the number of tuples in the
spill file). So, input_tuples is groups and input_groups is
input_tuples. It may be helpful to rename this.
4) I'm not sure I agree with this reasoning that HASH_PARTITION_FACTOR
making the hash tables smaller is desirable - it may be, but if that was
generally the case we'd just use small hash tables all the time. It's a
bit annoying to give user the capability to set work_mem and then kinda
override that.
* ... Another benefit of having more, smaller partitions is that small
* hash tables may perform better than large ones due to memory caching
* effects.
So, it looks like the HASH_PARTITION_FACTOR is only used when
re-spilling. The initial hashtable will use work_mem.
It seems like the reason for using it when re-spilling is to be very
conservative to avoid more than one re-spill and make sure each spill
file fits in a hashtable in memory.
The comment does seem to point to some other reason, though...
re-spilling. The initial hashtable will use work_mem.
It seems like the reason for using it when re-spilling is to be very
conservative to avoid more than one re-spill and make sure each spill
file fits in a hashtable in memory.
The comment does seem to point to some other reason, though...
11) The hash_spill_npartitions naming seems a bit confusing, because it
seems to imply it's about the "spill" while in practice it just choses
number of spill partitions. Maybe hash_choose_num_spill_partitions would
be better?
Agreed that a name with "choose" or "calculate" as the verb would be
more clear.
more clear.
12) It's not clear to me why we need HASH_MAX_PARTITIONS? What's the
reasoning behind the current value (256)? Not wanting to pick too many
partitions? Comment?
if (npartitions > HASH_MAX_PARTITIONS)
npartitions = HASH_MAX_PARTITIONS;
256 actually seems very large. hash_spill_npartitions() will be called
for every respill, so, HASH_MAX_PARTITIONS it not the total number of
spill files permitted, but, actually, it is the number of respill
files in a given spill (a spill set). So if you made X partitions
initially and every partition re-spills, now you would have (at most)
X * 256 partitions.
If HASH_MAX_PARTITIONS is 256, wouldn't the metadata from the spill
files take up a lot of memory at that point?
for every respill, so, HASH_MAX_PARTITIONS it not the total number of
spill files permitted, but, actually, it is the number of respill
files in a given spill (a spill set). So if you made X partitions
initially and every partition re-spills, now you would have (at most)
X * 256 partitions.
If HASH_MAX_PARTITIONS is 256, wouldn't the metadata from the spill
files take up a lot of memory at that point?
Melanie & Adam Lee
Thanks very much for a great review! I've attached a new patch. There are some significant changes in the new version also: In the non-spilling path, removed the extra nullcheck branch in the compiled evaltrans expression. When the first tuple is spilled, I the branch becomes necessary, so I recompile the expression using a new opcode that includes that branch. I also changed the read-from-spill path to use a slot with TTSOpsMinimalTuple (avoiding the need to make it into a virtual slot right away), which means I need to recompile the evaltrans expression for that case, as well. I also improved the way we initialize the hash tables to use a better estimate for the number of groups. And I made it only initialize one hash table in the read-from-spill path. With all of the changes I made (thanks to some suggestions from Andres) the performance is looking pretty good. It's pretty easy to beat Sort+Group when the group size is 10+. Even for average group size of ~1, HashAgg is getting really close to Sort in some cases. There are still a few things to do, most notably costing. I also need to project before spilling to avoid wasting disk. And I'm sure my changes have created some more problems, so I have some significant work to do on quality. My answers to your questions inline: On Thu, 2019-11-28 at 18:46 +0100, Tomas Vondra wrote: > Could we instead track how many tuples we actually consumed for the > the > in-memory groups, and then use this information to improve the > estimate > of number of groups? I mean, if we know we've consumed 1000 tuples > which > created 100 groups, then we know there's ~1:10 ratio. That would be a good estimate for an even distribution, but not necessarily for a skewed distribution. I'm not opposed to it, but it's generally my philosophy to overpartition as it seems there's not a big downside. > A couple of comments based on eye-balling the patch: > > > 1) Shouldn't the hashagg_mem_overflow use the other GUC naming, i.e. > maybe it should be enable_hashagg_mem_overflow or something similar? The enable_* naming is for planner GUCs. hashagg_mem_overflow is an execution-time GUC that disables spilling and overflows work_mem (that is, it reverts to the old behavior). > > assume the pointer really is NULL. Surely we'll get a segfault on the > preceding line which does dereference it > > pergroup = &pergroup_allaggs[op->d.agg_init_trans.transno]; > > Or am I missing anything? That's not actually dereferencing anything, it's just doing a pointer calculation. You are probably right that it's not a good thing to rely on, or at least not quite as readable, so I changed the order to put the NULL check first. > > 3) execGrouping.c > > A couple of functions would deserve a comment, explaining what it > does. > > - LookupTupleHashEntryHash > - prepare_hash_slot > - calculate_hash Done, thank you. > And it's not clear to me why we should remove part of the comment > before > TupleHashTableHash. Trying to remember back to when I first did that, but IIRC the comment was not updated from a previous change, and I was cleaning it up. I will check over that again to be sure it's an improvement. > > 4) I'm not sure I agree with this reasoning that > HASH_PARTITION_FACTOR > making the hash tables smaller is desirable - it may be, but if that > was > generally the case we'd just use small hash tables all the time. It's > a > bit annoying to give user the capability to set work_mem and then > kinda > override that. I think adding some kind of headroom is reasonable to avoid recursively spilling, but perhaps it's not critical. I see this as a tuning question more than anything else. I don't see it as "overriding" work_mem, but I can see where you're coming from. > 5) Not sure what "directly" means in this context? > > * partitions at the time we need to spill, and because this > algorithm > * shouldn't depend too directly on the internal memory needs of a > * BufFile. > > #define HASH_PARTITION_MEM (HASH_MIN_PARTITIONS * BLCKSZ) > > Does that mean we don't want to link to PGAlignedBlock, or what? That's what I meant, yes, but I reworded the comment to not say that. > 6) I think we should have some protection against underflows in this > piece of code: > > - this would probably deserve some protection against underflow if > HASH_PARTITION_MEM gets too big > > if (hashagg_mem_overflow) > aggstate->hash_mem_limit = SIZE_MAX; > else > aggstate->hash_mem_limit = (work_mem * 1024L) - > HASH_PARTITION_MEM; > > At the moment it's safe because work_mem is 64kB at least, and > HASH_PARTITION_MEM is 32kB (4 partitions, 8kB each). But if we happen > to > bump HASH_MIN_PARTITIONS up, this can underflow. Thank you, done. > 7) Shouldn't lookup_hash_entry briefly explain why/how it handles the > memory limit? Improved. > > 8) The comment before lookup_hash_entries says: > > ... > * Return false if hash table has exceeded its memory limit. > .. > > But that's clearly bogus, because that's a void function. Thank you, improved comment. > 9) Shouldn't the hash_finish_initial_spills calls in > agg_retrieve_direct > have a comment, similar to the surrounding code? Might be an > overkill, > not sure. Sure, done. > 10) The comment for agg_refill_hash_table says > > * Should only be called after all in memory hash table entries have > been > * consumed. > > Can we enforce that with an assert, somehow? It's a bit awkward. Simplehash doesn't expose the number of groups, and we would also have to check each hash table. Not a bad idea to add an interface to simplehash to make that work, though. > 11) The hash_spill_npartitions naming seems a bit confusing, because > it > seems to imply it's about the "spill" while in practice it just > choses > number of spill partitions. Maybe hash_choose_num_spill_partitions > would > be better? Done. > 12) It's not clear to me why we need HASH_MAX_PARTITIONS? What's the > reasoning behind the current value (256)? Not wanting to pick too > many > partitions? Comment? > > if (npartitions > HASH_MAX_PARTITIONS) > npartitions = HASH_MAX_PARTITIONS; Added a comment. There's no deep reasoning there -- I just don't want it to choose to create 5000 files and surprise a user. > 13) As for this: > > /* make sure that we don't exhaust the hash bits */ > if (partition_bits + input_bits >= 32) > partition_bits = 32 - input_bits; > > We already ran into this issue (exhausting bits in a hash value) in > hashjoin batching, we should be careful to use the same approach in > both > places (not the same code, just general approach). Didn't investigate this yet, but will do. Regards, Jeff Davis
Attachment
On Wed, Dec 04, 2019 at 06:55:43PM -0800, Jeff Davis wrote: > > Thanks very much for a great review! I've attached a new patch. Hi, About the `TODO: project needed attributes only` in your patch, when would the input tuple contain columns not needed? It seems like anything you can project has to be in the group or aggregates. -- Melanie Plageman & Adam
On Wed, 2019-12-04 at 19:50 -0800, Adam Lee wrote: > On Wed, Dec 04, 2019 at 06:55:43PM -0800, Jeff Davis wrote: > > > > Thanks very much for a great review! I've attached a new patch. > > Hi, > > About the `TODO: project needed attributes only` in your patch, when > would the input tuple contain columns not needed? It seems like > anything > you can project has to be in the group or aggregates. If you have a table like: CREATE TABLE foo(i int, j int, x int, y int, z int); And do: SELECT i, SUM(j) FROM foo GROUP BY i; At least from a logical standpoint, you might expect that we project only the attributes we need from foo before feeding them into the HashAgg. But that's not quite how postgres works. Instead, it leaves the tuples intact (which, in this case, means they have 5 attributes) until after aggregation and lazily fetches whatever attributes are referenced. Tuples are spilled from the input, at which time they still have 5 attributes; so naively copying them is wasteful. I'm not sure how often this laziness is really a win in practice, especially after the expression evaluation has changed so much in recent releases. So it might be better to just project all the attributes eagerly, and then none of this would be a problem. If we still wanted to be lazy about attribute fetching, that should still be possible even if we did a kind of "logical" projection of the tuple so that the useless attributes would not be relevant. Regardless, that's outside the scope of the patch I'm currently working on. What I'd like to do is copy just the attributes needed into a new virtual slot, leave the unneeded ones NULL, and then write it out to the tuplestore as a MinimalTuple. I just need to be sure to get the right attributes. Regards, Jeff Davis
On Wed, 2019-12-04 at 17:24 -0800, Melanie Plageman wrote: > > It looks like the parameter input_tuples passed to hash_spill_init() > in lookup_hash_entries() is the number of groups estimated by > planner. > However, when reloading a spill file, if we run out of memory and > re-spill, hash_spill_init() is passed batch->input_groups (which is > actually set from input_ngroups which is the number of tuples in the > spill file). So, input_tuples is groups and input_groups is > input_tuples. It may be helpful to rename this. You're right; this is confusing. I will clarify this in the next patch. > So, it looks like the HASH_PARTITION_FACTOR is only used when > re-spilling. The initial hashtable will use work_mem. > It seems like the reason for using it when re-spilling is to be very > conservative to avoid more than one re-spill and make sure each spill > file fits in a hashtable in memory. It's used any time a spill happens, even the first spill. I'm flexible on the use of HASH_PARTITION_FACTOR though... it seems not everyone thinks it's a good idea. To me it's just a knob to tune and I tend to think over-partitioning is the safer bet most of the time. > The comment does seem to point to some other reason, though... I have observed some anomalies where smaller work_mem values (for already-low values of work_mem) result faster runtime. The only explanation I have is caching effects. > 256 actually seems very large. hash_spill_npartitions() will be > called > for every respill, so, HASH_MAX_PARTITIONS it not the total number of > spill files permitted, but, actually, it is the number of respill > files in a given spill (a spill set). So if you made X partitions > initially and every partition re-spills, now you would have (at most) > X * 256 partitions. Right. Though I'm not sure there's any theoretical max... given enough input tuples and it will just keep getting deeper. If this is a serious concern maybe I should make it depth-first recursion by prepending new work items rather than appending. That would still not bound the theoretical max, but it would slow the growth. > If HASH_MAX_PARTITIONS is 256, wouldn't the metadata from the spill > files take up a lot of memory at that point? Yes. Each file keeps a BLCKSZ buffer, plus some other metadata. And it does create a file, so it's offloading some work to the OS to manage that new file. It's annoying to properly account for these costs because the memory needs to be reserved at the time we are building the hash table, but we don't know how many partitions we want until it comes time to spill. And for that matter, we don't even know whether we will need to spill or not. There are two alternative approaches which sidestep this problem: 1. Reserve a fixed fraction of work_mem, say, 1/8 to make space for however many partitions that memory can handle. We would still have a min and max, but the logic for reserving the space would be easy and so would choosing the number of partitions to create. * Pro: simple * Con: lose the ability to choose the numer of partitions 2. Use logtape.c instead (suggestion from Heikki). Supporting more logical tapes doesn't impose costs on the OS, and we can potentially use a lot of logical tapes. * Pro: can use lots of partitions without making lots of files * Con: buffering still needs to happen somewhere, so we still need memory for each logical tape. Also, we risk losing locality of read access when reading the tapes, or perhaps confusing readahead. Fundamentally, logtapes.c was designed for sequential write, random read; but we are going to do random write and sequential read. Regards, Jeff Davis
On Thu, 2019-11-28 at 18:46 +0100, Tomas Vondra wrote: > And it's not clear to me why we should remove part of the comment > before > TupleHashTableHash. It looks like 5dfc1981 changed the signature of TupleHashTableHash without updating the comment, so it doesn't really make sense any more. I just updated the comment as a part of my patch, but it's not related. Andres, comments? Maybe we can just commit a fix for that comment and take it out of my patch. Regards, Jeff Davis
On Thu, Dec 05, 2019 at 12:55:51PM -0800, Jeff Davis wrote: >On Thu, 2019-11-28 at 18:46 +0100, Tomas Vondra wrote: >> And it's not clear to me why we should remove part of the comment >> before >> TupleHashTableHash. > >It looks like 5dfc1981 changed the signature of TupleHashTableHash >without updating the comment, so it doesn't really make sense any more. >I just updated the comment as a part of my patch, but it's not related. > >Andres, comments? Maybe we can just commit a fix for that comment and >take it out of my patch. > +1 to push that as an independent fix regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 2019-12-05 12:55:51 -0800, Jeff Davis wrote: > On Thu, 2019-11-28 at 18:46 +0100, Tomas Vondra wrote: > > And it's not clear to me why we should remove part of the comment > > before > > TupleHashTableHash. > > It looks like 5dfc1981 changed the signature of TupleHashTableHash > without updating the comment, so it doesn't really make sense any more. > I just updated the comment as a part of my patch, but it's not related. > > Andres, comments? Maybe we can just commit a fix for that comment and > take it out of my patch. Fine with me! - Andres
On Wed, Dec 04, 2019 at 10:57:51PM -0800, Jeff Davis wrote: > > About the `TODO: project needed attributes only` in your patch, when > > would the input tuple contain columns not needed? It seems like > > anything > > you can project has to be in the group or aggregates. > > If you have a table like: > > CREATE TABLE foo(i int, j int, x int, y int, z int); > > And do: > > SELECT i, SUM(j) FROM foo GROUP BY i; > > At least from a logical standpoint, you might expect that we project > only the attributes we need from foo before feeding them into the > HashAgg. But that's not quite how postgres works. Instead, it leaves > the tuples intact (which, in this case, means they have 5 attributes) > until after aggregation and lazily fetches whatever attributes are > referenced. Tuples are spilled from the input, at which time they still > have 5 attributes; so naively copying them is wasteful. > > I'm not sure how often this laziness is really a win in practice, > especially after the expression evaluation has changed so much in > recent releases. So it might be better to just project all the > attributes eagerly, and then none of this would be a problem. If we > still wanted to be lazy about attribute fetching, that should still be > possible even if we did a kind of "logical" projection of the tuple so > that the useless attributes would not be relevant. Regardless, that's > outside the scope of the patch I'm currently working on. > > What I'd like to do is copy just the attributes needed into a new > virtual slot, leave the unneeded ones NULL, and then write it out to > the tuplestore as a MinimalTuple. I just need to be sure to get the > right attributes. > > Regards, > Jeff Davis Melanie and I tried this, had a installcheck passed patch. The way how we verify it is composing a wide table with long unnecessary text columns, then check the size it writes on every iteration. Please check out the attachment, it's based on your 1204 version. -- Adam Lee
Attachment
On Thu, 2019-11-28 at 18:46 +0100, Tomas Vondra wrote: > 13) As for this: > > /* make sure that we don't exhaust the hash bits */ > if (partition_bits + input_bits >= 32) > partition_bits = 32 - input_bits; > > We already ran into this issue (exhausting bits in a hash value) in > hashjoin batching, we should be careful to use the same approach in > both > places (not the same code, just general approach). I assume you're talking about ExecHashIncreaseNumBatches(), and in particular, commit 8442317b. But that's a 10-year-old commit, so perhaps you're talking about something else? It looks like that code in HJ is protecting against having a very large number of batches, such that we can't allocate an array of pointers for each batch. And it seems like the concern is more related to a planner error causing such a large nbatch. I don't quite see the analogous case in HashAgg. npartitions is already constrained to a maximum of 256. And the batches are individually allocated, held in a list, not an array. It could perhaps use some defensive programming to make sure that we don't run into problems if the max is set very high. Can you clarify what you're looking for here? Perhaps I can also add a comment saying that we can have less than HASH_MIN_PARTITIONS when running out of bits. Regards, Jeff Davis
On Thu, Dec 12, 2019 at 06:10:50PM -0800, Jeff Davis wrote: >On Thu, 2019-11-28 at 18:46 +0100, Tomas Vondra wrote: >> 13) As for this: >> >> /* make sure that we don't exhaust the hash bits */ >> if (partition_bits + input_bits >= 32) >> partition_bits = 32 - input_bits; >> >> We already ran into this issue (exhausting bits in a hash value) in >> hashjoin batching, we should be careful to use the same approach in >> both >> places (not the same code, just general approach). > >I assume you're talking about ExecHashIncreaseNumBatches(), and in >particular, commit 8442317b. But that's a 10-year-old commit, so >perhaps you're talking about something else? > >It looks like that code in HJ is protecting against having a very large >number of batches, such that we can't allocate an array of pointers for >each batch. And it seems like the concern is more related to a planner >error causing such a large nbatch. > >I don't quite see the analogous case in HashAgg. npartitions is already >constrained to a maximum of 256. And the batches are individually >allocated, held in a list, not an array. > >It could perhaps use some defensive programming to make sure that we >don't run into problems if the max is set very high. > >Can you clarify what you're looking for here? > I'm talking about this recent discussion on pgsql-bugs: https://www.postgresql.org/message-id/CA%2BhUKGLyafKXBMFqZCSeYikPbdYURbwr%2BjP6TAy8sY-8LO0V%2BQ%40mail.gmail.com I.e. when number of batches/partitions and buckets is high enough, we may end up with very few bits in one of the parts. >Perhaps I can also add a comment saying that we can have less than >HASH_MIN_PARTITIONS when running out of bits. > Maybe. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Wed, Nov 27, 2019 at 02:58:04PM -0800, Jeff Davis wrote: >On Wed, 2019-08-28 at 12:52 -0700, Taylor Vesely wrote: >> Right now the patch always initializes 32 spill partitions. Have you >> given >> any thought into how to intelligently pick an optimal number of >> partitions yet? > >Attached a new patch that addresses this. > >1. Divide hash table memory used by the number of groups in the hash >table to get the average memory used per group. >2. Multiply by the number of groups spilled -- which I pessimistically >estimate as the number of tuples spilled -- to get the total amount of >memory that we'd like to have to process all spilled tuples at once. >3. Divide the desired amount of memory by work_mem to get the number of >partitions we'd like to have such that each partition can be processed >in work_mem without spilling. >4. Apply a few sanity checks, fudge factors, and limits. > >Using this runtime information should be substantially better than >using estimates and projections. > >Additionally, I removed some branches from the common path. I think I >still have more work to do there. > >I also rebased of course, and fixed a few other things. > I've done a bit more testing on this, after resolving a couple of minor conflicts due to recent commits (rebased version attached). In particular, I've made a comparison with different dataset sizes, group sizes, GUC settings etc. The script and results from two different machines are available here: * https://bitbucket.org/tvondra/hashagg-tests/src/master/ The script essentially runs a simple grouping query with different number of rows, groups, work_mem and parallelism settings. There's nothing particularly magical about it. I did run it both on master and patched code, allowing us to compare results and assess impact of the patch. Overall, the changes are expected and either neutral or beneficial, i.e. the timing are the same or faster. The number of cases that regressed is fairly small, but sometimes the regressions are annoyingly large - up to 2x in some cases. Consider for example this trivial example with 100M rows: CREATE TABLE t AS SELECT (100000000 * random())::int AS a FROM generate_series(1,100000000) s(i); On the master, the plan with default work_mem (i.e. 4MB) and SET max_parallel_workers_per_gather = 8; looks like this: EXPLAIN SELECT * FROM (SELECT a, count(*) FROM t GROUP BY a OFFSET 1000000000) foo; QUERY PLAN ---------------------------------------------------------------------------------------------------- Limit (cost=16037474.49..16037474.49 rows=1 width=12) -> Finalize GroupAggregate (cost=2383745.73..16037474.49 rows=60001208 width=12) Group Key: t.a -> Gather Merge (cost=2383745.73..14937462.25 rows=100000032 width=12) Workers Planned: 8 -> Partial GroupAggregate (cost=2382745.59..2601495.66 rows=12500004 width=12) Group Key: t.a -> Sort (cost=2382745.59..2413995.60 rows=12500004 width=4) Sort Key: t.a -> Parallel Seq Scan on t (cost=0.00..567478.04 rows=12500004 width=4) (10 rows) Which kinda makes sense - we can't do hash aggregate, because there are 100M distinct values, and that won't fit into 4MB of memory (and the planner knows about that). And it completes in about 108381 ms, give or take. With the patch, the plan changes like this: EXPLAIN SELECT * FROM (SELECT a, count(*) FROM t GROUP BY a OFFSET 1000000000) foo; QUERY PLAN --------------------------------------------------------------------------- Limit (cost=2371037.74..2371037.74 rows=1 width=12) -> HashAggregate (cost=1942478.48..2371037.74 rows=42855926 width=12) Group Key: t.a -> Seq Scan on t (cost=0.00..1442478.32 rows=100000032 width=4) (4 rows) i.e. it's way cheaper than the master plan, it's not parallel, but when executed it takes much longer (about 147442 ms). After forcing a parallel query (by setting parallel_setup_cost = 0) the plan changes to a parallel one, but without a partial aggregate, but it's even slower. The explain analyze for the non-parallel plan looks like this: QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------------- Limit (cost=2371037.74..2371037.74 rows=1 width=12) (actual time=160180.718..160180.718 rows=0 loops=1) -> HashAggregate (cost=1942478.48..2371037.74 rows=42855926 width=12) (actual time=54462.728..157594.756 rows=63215980loops=1) Group Key: t.a Memory Usage: 4096kB Batches: 8320 Disk Usage:4529172kB -> Seq Scan on t (cost=0.00..1442478.32 rows=100000032 width=4) (actual time=0.014..12198.044 rows=100000000loops=1) Planning Time: 0.110 ms Execution Time: 160183.517 ms (7 rows) So the cost is about 7x lower than for master, but the duration is much higher. I don't know how much of this is preventable, but it seems there might be something missing in the costing, because when I set work_mem to 1TB on the master, and I tweak the n_distinct estimates for the column to be exactly the same on the two clusters, I get this: master: ------- SET work_mem = '1TB'; EXPLAIN SELECT * FROM (SELECT a, count(*) FROM t GROUP BY a OFFSET 1000000000) foo; QUERY PLAN --------------------------------------------------------------------------- Limit (cost=2574638.28..2574638.28 rows=1 width=12) -> HashAggregate (cost=1942478.48..2574638.28 rows=63215980 width=12) Group Key: t.a -> Seq Scan on t (cost=0.00..1442478.32 rows=100000032 width=4) (4 rows) patched: -------- EXPLAIN SELECT * FROM (SELECT a, count(*) FROM t GROUP BY a OFFSET 1000000000) foo; QUERY PLAN --------------------------------------------------------------------------- Limit (cost=2574638.28..2574638.28 rows=1 width=12) -> HashAggregate (cost=1942478.48..2574638.28 rows=63215980 width=12) Group Key: t.a -> Seq Scan on t (cost=0.00..1442478.32 rows=100000032 width=4) (4 rows) That is, the cost is exactly the same, except that in the second case we expect to do quite a bit of batching - there are 8320 batches (and we know that, because on master we'd not use hash aggregate without the work_mem tweak). So I think we're not costing the batching properly / at all. A couple more comments: 1) IMHO we should rename hashagg_mem_overflow to enable_hashagg_overflow or something like that. I think that describes the GUC purpose better (and it's more consistent with enable_hashagg_spill). 2) show_hashagg_info I think there's a missing space after ":" here: " Batches: %d Disk Usage:%ldkB", and maybe we should use just "Disk:" just like in we do for sort: -> Sort (actual time=662.136..911.558 rows=1000000 loops=1) Sort Key: t2.a Sort Method: external merge Disk: 13800kB 3) I'm not quite sure what to think about the JIT recompile we do for EEOP_AGG_INIT_TRANS_SPILLED etc. I'm no llvm/jit expert, but do we do that for some other existing cases? regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachment
On Sat, Dec 14, 2019 at 06:32:25PM +0100, Tomas Vondra wrote: > I've done a bit more testing on this, after resolving a couple of minor > conflicts due to recent commits (rebased version attached). > > In particular, I've made a comparison with different dataset sizes, > group sizes, GUC settings etc. The script and results from two different > machines are available here: > > The script essentially runs a simple grouping query with different > number of rows, groups, work_mem and parallelism settings. There's > nothing particularly magical about it. Nice! > I did run it both on master and patched code, allowing us to compare > results and assess impact of the patch. Overall, the changes are > expected and either neutral or beneficial, i.e. the timing are the same > or faster. > > The number of cases that regressed is fairly small, but sometimes the > regressions are annoyingly large - up to 2x in some cases. Consider for > example this trivial example with 100M rows: I suppose this is because the patch has no costing changes yet. I hacked a little to give hash agg a spilling punish, just some value based on (groups_in_hashtable * num_of_input_tuples)/num_groups_from_planner, it would not choose hash aggregate in this case. However, that punish is wrong, because comparing to the external sort algorithm, hash aggregate has the respilling, which involves even more I/O, especially with a very large number of groups but a very small number of tuples in a single group like the test you did. It would be a challenge. BTW, Jeff, Greenplum has a test for hash agg spill, I modified a little to check how many batches a query uses, it's attached, not sure if it would help. -- Adam Lee
Attachment
On Tue, 2019-12-10 at 13:34 -0800, Adam Lee wrote: > Melanie and I tried this, had a installcheck passed patch. The way > how > we verify it is composing a wide table with long unnecessary text > columns, then check the size it writes on every iteration. > > Please check out the attachment, it's based on your 1204 version. Thank you. Attached a new patch that incorporates your projection work. A few comments: * You are only nulling out up to tts_nvalid, which means that you can still end up storing more on disk if the wide column comes at the end of the table and hasn't been deserialized yet. I fixed this by copying needed attributes to the hash_spill_slot and making it virtual. * aggregated_columns does not need to be a member of AggState; nor does it need to be computed inside of the perhash loop. Aside: if adding a field to AggState is necessary, you need to bump the field numbers of later fields that are labeled for JIT use, otherwise it will break JIT. * I used an array rather than a bitmapset. It makes it easier to find the highest column (to do a slot_getsomeattrs), and it might be a little more efficient for wide tables with mostly useless columns. * Style nitpick: don't mix code and declarations The updated patch also saves the transitionSpace calculation in the Agg node for better hash table size estimating. This is a good way to choose an initial number of buckets for the hash table, and also to cap the number of groups we permit in the hash table when we expect the groups to grow. Regards, Jeff Davis
Attachment
On Sat, 2019-12-14 at 18:32 +0100, Tomas Vondra wrote: > So I think we're not costing the batching properly / at all. Thank you for all of the testing! I think the results are good: even for cases where HashAgg is the wrong choice, it's not too bad. You're right that costing is not done, and when it is, I think it will avoid these bad choices most of the time. > A couple more comments: > > 1) IMHO we should rename hashagg_mem_overflow to > enable_hashagg_overflow > or something like that. I think that describes the GUC purpose better > (and it's more consistent with enable_hashagg_spill). The other enable_* GUCs are all planner GUCs, so I named this one differently to stand out as an executor GUC. > 2) show_hashagg_info > > I think there's a missing space after ":" here: > > " Batches: %d Disk Usage:%ldkB", > > and maybe we should use just "Disk:" just like in we do for sort: Done, thank you. > 3) I'm not quite sure what to think about the JIT recompile we do for > EEOP_AGG_INIT_TRANS_SPILLED etc. I'm no llvm/jit expert, but do we do > that for some other existing cases? Andres asked for that explicitly to avoid branches in the non-spilling code path (or at least branches that are likely to be mispredicted). Regards, Jeff Davis
On Sat, 2019-12-14 at 18:32 +0100, Tomas Vondra wrote: > So I think we're not costing the batching properly / at all. Hi, I've attached a new patch that adds some basic costing for disk during hashagg. The accuracy is unfortunately not great, especially at smaller work_mem sizes and smaller entry sizes. The biggest discrepency seems to be the estimate for the average size of an entry in the hash table is significantly smaller than the actual average size. I'm not sure how big of a problem this accuracy is or how it compares to sort, for instance (it's a bit hard to compare because sort works with theoretical memory usage while hashagg looks at actual allocated memory). Costing was the last major TODO, so I'm considering this feature complete, though it still needs some work on quality. Regards, Jeff Davis
Attachment
Hi, Jeff I tried to use the logical tape APIs for hash agg spilling, based on your 1220 version. Turns out it doesn't make much of performance difference with the default 8K block size (might be my patch's problem), but the disk space (not I/O) would be saved a lot because I force the respilling to use the same LogicalTapeSet. Logtape APIs with default block size 8K: ``` postgres=# EXPLAIN ANALYZE SELECT avg(g) FROM generate_series(0,5000000) g GROUP BY g; QUERY PLAN -------------------------------------------------------------------------------------------------------------------------------------------- HashAggregate (cost=75000.02..75002.52 rows=200 width=36) (actual time=7701.706..24473.002 rows=5000001 loops=1) Group Key: g Memory Usage: 4096kB Batches: 516 Disk: 116921kB -> Function Scan on generate_series g (cost=0.00..50000.01 rows=5000001 width=4) (actual time=1611.829..3253.150 rows=5000001loops=1) Planning Time: 0.194 ms Execution Time: 25129.239 ms (6 rows) ``` Bare BufFile APIs: ``` postgres=# EXPLAIN ANALYZE SELECT avg(g) FROM generate_series(0,5000000) g GROUP BY g; QUERY PLAN -------------------------------------------------------------------------------------------------------------------------------------------- HashAggregate (cost=75000.02..75002.52 rows=200 width=36) (actual time=7339.835..24472.466 rows=5000001 loops=1) Group Key: g Memory Usage: 4096kB Batches: 516 Disk: 232773kB -> Function Scan on generate_series g (cost=0.00..50000.01 rows=5000001 width=4) (actual time=1580.057..3128.749 rows=5000001loops=1) Planning Time: 0.769 ms Execution Time: 26696.502 ms (6 rows) ``` Even though, I'm not sure which API is better, because we should avoid the respilling as much as we could in the planner, and hash join uses the bare BufFile. Attached my hacky and probably not robust diff for your reference. -- Adam Lee
Attachment
On 28/12/2019 01:35, Jeff Davis wrote: > I've attached a new patch that adds some basic costing for disk during > hashagg. This patch (hashagg-20191227.patch) doesn't compile: nodeAgg.c:3379:7: error: ‘hashagg_mem_overflow’ undeclared (first use in this function) if (hashagg_mem_overflow) ^~~~~~~~~~~~~~~~~~~~ Looks like the new GUCs got lost somewhere between hashagg-20191220.patch and hashagg-20191227.patch. > /* > * find_aggregated_cols > * Construct a bitmapset of the column numbers of aggregated Vars > * appearing in our targetlist and qual (HAVING clause) > */ > static Bitmapset * > find_aggregated_cols(AggState *aggstate) > { > Agg *node = (Agg *) aggstate->ss.ps.plan; > Bitmapset *colnos = NULL; > ListCell *temp; > > /* > * We only want the columns used by aggregations in the targetlist or qual > */ > if (node->plan.targetlist != NULL) > { > foreach(temp, (List *) node->plan.targetlist) > { > if (IsA(lfirst(temp), TargetEntry)) > { > Node *node = (Node *)((TargetEntry *)lfirst(temp))->expr; > if (IsA(node, Aggref) || IsA(node, GroupingFunc)) > find_aggregated_cols_walker(node, &colnos); > } > } > } This makes the assumption that all Aggrefs or GroupingFuncs are at the top of the TargetEntry. That's not true, e.g.: select 0+sum(a) from foo group by b; I think find_aggregated_cols() and find_unaggregated_cols() should be merged into one function that scans the targetlist once, and returns two Bitmapsets. They're always used together, anyway. - Heikki
On Wed, 2020-01-08 at 12:38 +0200, Heikki Linnakangas wrote: > This makes the assumption that all Aggrefs or GroupingFuncs are at > the > top of the TargetEntry. That's not true, e.g.: > > select 0+sum(a) from foo group by b; > > I think find_aggregated_cols() and find_unaggregated_cols() should > be > merged into one function that scans the targetlist once, and returns > two > Bitmapsets. They're always used together, anyway. I cut the projection out for now, because there's some work in that area in another thread[1]. If that work doesn't pan out, I can reintroduce the projection logic to this one. New patch attached. It now uses logtape.c (thanks Adam for prototyping this work) instead of buffile.c. This gives better control over the number of files and the memory consumed for buffers, and reduces waste. It requires two changes to logtape.c though: * add API to extend the number of tapes * lazily allocate buffers for reading (buffers for writing were already allocated lazily) so that the total number of buffers needed at any time is bounded Unfortunately, I'm seeing some bad behavior (at least in some cases) with logtape.c, where it's spending a lot of time qsorting the list of free blocks. Adam, did you also see this during your perf tests? It seems to be worst with lower work_mem settings and a large number of input groups (perhaps there are just too many small tapes?). It also has some pretty major refactoring that hopefully makes it simpler to understand and reason about, and hopefully I didn't introduce too many bugs/regressions. A list of other changes: * added test that involves rescan * tweaked some details and tunables so that I think memory usage tracking and reporting (EXPLAIN ANALYZE) is better, especially for smaller work_mem * simplified quite a few function signatures Regards, Jeff Davis [1] https://postgr.es/m/CAAKRu_Yj=Q_ZxiGX+pgstNWMbUJApEJX-imvAEwryCk5SLUebg@mail.gmail.com
Attachment
On Fri, Jan 24, 2020 at 5:01 PM Jeff Davis <pgsql@j-davis.com> wrote: > Unfortunately, I'm seeing some bad behavior (at least in some cases) > with logtape.c, where it's spending a lot of time qsorting the list of > free blocks. Adam, did you also see this during your perf tests? It > seems to be worst with lower work_mem settings and a large number of > input groups (perhaps there are just too many small tapes?). That sounds weird. Might be pathological in some sense. I have a wild guess for you. Maybe this has something to do with the "test for presorted input" added by commit a3f0b3d68f9. That can perform very badly when the input is almost sorted, but has a few tuples that are out of order towards the end. (I have called these "banana skin tuples" in the past.) -- Peter Geoghegan
On Fri, 2020-01-24 at 17:16 -0800, Peter Geoghegan wrote: > That sounds weird. Might be pathological in some sense. > > I have a wild guess for you. Maybe this has something to do with the > "test for presorted input" added by commit a3f0b3d68f9. That can > perform very badly when the input is almost sorted, but has a few > tuples that are out of order towards the end. (I have called these > "banana skin tuples" in the past.) My simple test case is: 'explain analyze select i from big group by i;', where "big" has 20M tuples. I tried without that change and it helped (brought the time from 55s to 45s). But if I completely remove the sorting of the freelist, it goes down to 12s. So it's something about the access pattern. After digging a bit more, I see that, for Sort, the LogicalTapeSet's freelist hovers around 300 entries and doesn't grow larger than that. For HashAgg, it gets up to almost 60K. The pattern in HashAgg is that the space required is at a maximum after the first spill, and after that point the used space declines with each batch (because the groups that fit in the hash table were finalized and emitted, and only the ones that didn't fit were written out). As the amount of required space declines, the size of the freelist grows. That leaves a few options: 1) Cap the size of the LogicalTapeSet's freelist. If the freelist is growing large, that's probably because it will never actually be used. I'm not quite sure how to pick the cap though, and it seems a bit hacky to just leak the freed space. 2) Use a different structure more capable of handling a large fraction of free space. A compressed bitmap might make sense, but that seems like overkill to waste effort tracking a lot of space that is unlikely to ever be used. 3) Don't bother tracking free space for HashAgg at all. There's already an API for that so I don't need to further hack logtape.c. 4) Try to be clever and shrink the file (or at least the tracked portion of the file) if the freed blocks are at the end. This wouldn't be very useful in the first recursive level, but the problem is worst for the later levels anyway. Unfortunately, I think this requires a breadth-first strategy to make sure that blocks at the end get freed. If I do change it to breadth-first also, this does amount to a significant speedup. I am leaning toward #1 or #3. As an aside, I'm curious why the freelist is managed the way it is. Newly-released blocks are likely to be higher in number (or at least not the lowest in number), but they are added to the end of an array. The array is therefore likely to require repeated re-sorting to get back to descending order. Wouldn't a minheap or something make more sense? Regards, Jeff Davis
On Wed, 2020-01-29 at 14:48 -0800, Jeff Davis wrote: > 2) Use a different structure more capable of handling a large > fraction > of free space. A compressed bitmap might make sense, but that seems > like overkill to waste effort tracking a lot of space that is > unlikely > to ever be used. I ended up converting the freelist to a min heap. Attached is a patch which makes three changes to better support HashAgg: 1. Use a minheap for the freelist. The original design used an array that had to be sorted between a read (which frees a block) and a write (which needs to sort the array to consume the lowest block number). The comments said: * sorted. This is an efficient way to handle it because we expect cycles * of releasing many blocks followed by re-using many blocks, due to * the larger read buffer. But I didn't find a case where that actually wins over a simple minheap. With that in mind, a minheap seems closer to what one might expect for that purpose, and more robust when the assumptions don't hold up as well. If someone knows of a case where the re-sorting behavior is important, please let me know. Changing to a minheap effectively solves the problem for HashAgg, though in theory the memory consumption of the freelist itself could become significant (though it's only 0.1% of the free space being tracked). 2. Lazily-allocate the read buffer. The write buffer was always lazily- allocated, so this patch creates better symmetry. More importantly, it means freshly-rewound tapes don't have any buffer allocated, so it greatly expands the number of tapes that can be managed efficiently as long as only a limited number are active at once. 3. Allow expanding the number of tapes for an existing tape set. This is useful for HashAgg, which doesn't know how many tapes will be needed in advance. Regards, Jeff Davis
Attachment
On Mon, 2020-02-03 at 10:29 -0800, Jeff Davis wrote: > I ended up converting the freelist to a min heap. > > Attached is a patch which makes three changes to better support > HashAgg: And now I'm attaching another version of the main Hash Aggregation patch to be applied on top of the logtape.c patch. Not a lot of changes from the last version; mostly some cleanup and rebasing. But it's faster now with the logtape.c changes. Regards, Jeff Davis
Attachment
On Mon, Feb 03, 2020 at 06:24:14PM -0800, Jeff Davis wrote: > On Mon, 2020-02-03 at 10:29 -0800, Jeff Davis wrote: > > I ended up converting the freelist to a min heap. > > > > Attached is a patch which makes three changes to better support > > HashAgg: > > And now I'm attaching another version of the main Hash Aggregation > patch to be applied on top of the logtape.c patch. > > Not a lot of changes from the last version; mostly some cleanup and > rebasing. But it's faster now with the logtape.c changes. Nice! Just back from the holiday. I had the perf test with Tomas's script, didn't notice the freelist sorting regression at that time. The minheap looks good, have you tested the performance and aggregate validation? About the "Cap the size of the LogicalTapeSet's freelist" and "Don't bother tracking free space for HashAgg at all" you mentioned in last mail, I suppose these two options will lost the disk space saving benefit since some blocks are not reusable then? -- Adam Lee
On 03/02/2020 20:29, Jeff Davis wrote: > 1. Use a minheap for the freelist. The original design used an array > that had to be sorted between a read (which frees a block) and a write > (which needs to sort the array to consume the lowest block number). The > comments said: > > * sorted. This is an efficient way to handle it because we expect > cycles > * of releasing many blocks followed by re-using many blocks, due to > * the larger read buffer. > > But I didn't find a case where that actually wins over a simple > minheap. With that in mind, a minheap seems closer to what one might > expect for that purpose, and more robust when the assumptions don't > hold up as well. If someone knows of a case where the re-sorting > behavior is important, please let me know. A minheap certainly seems more natural for that. I guess re-sorting the array would be faster in the extreme case that you free almost all of the blocks, and then consume almost all of the blocks, but I don't think the usage pattern is ever that extreme. Because if all the data fit in memory, we wouldn't be spilling in the first place. I wonder if a more advanced heap like the pairing heap or fibonacci heap would perform better? Probably doesn't matter in practice, so better keep it simple... > Changing to a minheap effectively solves the problem for HashAgg, > though in theory the memory consumption of the freelist itself could > become significant (though it's only 0.1% of the free space being > tracked). We could fairly easily spill parts of the freelist to disk, too, if necessary. But it's probably not worth the trouble. > 2. Lazily-allocate the read buffer. The write buffer was always lazily- > allocated, so this patch creates better symmetry. More importantly, it > means freshly-rewound tapes don't have any buffer allocated, so it > greatly expands the number of tapes that can be managed efficiently as > long as only a limited number are active at once. Makes sense. > 3. Allow expanding the number of tapes for an existing tape set. This > is useful for HashAgg, which doesn't know how many tapes will be needed > in advance. I'd love to change the LogicalTape API so that you could allocate and free tapes more freely. I wrote a patch to do that, as part of replacing tuplesort.c's polyphase algorithm with a simpler one (see [1]), but I never got around to committing it. Maybe the time is ripe to do that now? [1] https://www.postgresql.org/message-id/420a0ec7-602c-d406-1e75-1ef7ddc58d83@iki.fi - Heikki
On Mon, Feb 3, 2020 at 6:24 PM Jeff Davis <pgsql@j-davis.com> wrote: > And now I'm attaching another version of the main Hash Aggregation > patch to be applied on top of the logtape.c patch. Have you tested this against tuplesort.c, particularly parallel CREATE INDEX? It would be worth trying to measure any performance impact. Note that most parallel CREATE INDEX tuplesorts will do a merge within each worker, and one big merge in the leader. It's much more likely to have multiple passes than a regular serial external sort. Parallel CREATE INDEX is currently accidentally disabled on the master branch. That should be fixed in the next couple of days. You can temporarily revert 74618e77 if you want to get it back for testing purposes today. Have you thought about integer overflow in your heap related routines? This isn't as unlikely as you might think. See commit 512f67c8, for example. Have you thought about the MaxAllocSize restriction as it concerns lts->freeBlocks? Will that be okay when you have many more tapes than before? > Not a lot of changes from the last version; mostly some cleanup and > rebasing. But it's faster now with the logtape.c changes. LogicalTapeSetExtend() seems to work in a way that assumes that the tape is frozen. It would be good to document that assumption, and possible enforce it by way of an assertion. The same remark applies to any other assumptions you're making there. -- Peter Geoghegan
On Wed, Feb 5, 2020 at 12:08 PM Peter Geoghegan <pg@bowt.ie> wrote: > Parallel CREATE INDEX is currently accidentally disabled on the master > branch. That should be fixed in the next couple of days. You can > temporarily revert 74618e77 if you want to get it back for testing > purposes today. (Fixed -- sorry for the disruption.)
On Tue, 2020-02-04 at 18:42 +0800, Adam Lee wrote: > The minheap looks good, have you tested the performance and aggregate > validation? Not sure exactly what you mean, but I tested the min heap with both Sort and HashAgg and it performs well. > About the "Cap the size of the LogicalTapeSet's freelist" and "Don't > bother tracking free space for HashAgg at all" you mentioned in last > mail, I suppose these two options will lost the disk space saving > benefit since some blocks are not reusable then? No freelist at all will, of course, leak the blocks and not reuse the space. A capped freelist is not bad in practice; it seems to still work as long as the cap is reasonable. But it feels too arbitrary, and could cause unexpected leaks when our assumptions change. I think a minheap just makes more sense unless the freelist just becomes way too large. Regards, Jeff Davis
On Tue, 2020-02-04 at 15:08 -0800, Peter Geoghegan wrote: > Have you tested this against tuplesort.c, particularly parallel > CREATE > INDEX? It would be worth trying to measure any performance impact. > Note that most parallel CREATE INDEX tuplesorts will do a merge > within > each worker, and one big merge in the leader. It's much more likely > to > have multiple passes than a regular serial external sort. I did not observe any performance regression when creating an index in parallel over 20M ints (random ints in random order). I tried 2 parallel workers with work_mem=4MB and also 4 parallel workers with work_mem=256kB. > Have you thought about integer overflow in your heap related > routines? > This isn't as unlikely as you might think. See commit 512f67c8, for > example. It's dealing with blocks rather than tuples, so it's a bit less likely. But changed it to use "unsigned long" instead. > Have you thought about the MaxAllocSize restriction as it concerns > lts->freeBlocks? Will that be okay when you have many more tapes than > before? I added a check. If it exceeds MaxAllocSize, before trying to perform the allocation, just leak the block rather than adding it to the freelist. Perhaps there's a usecase for an extraordinarily-long freelist, but it's outside the scope of this patch. > LogicalTapeSetExtend() seems to work in a way that assumes that the > tape is frozen. It would be good to document that assumption, and > possible enforce it by way of an assertion. The same remark applies > to > any other assumptions you're making there. Can you explain? I am not freezing any tapes in Hash Aggregation, so what about LogicalTapeSetExtend() assumes the tape is frozen? Attached new logtape.c patches. Regards, Jeff Davis
Attachment
On Wed, Feb 5, 2020 at 10:37 AM Jeff Davis <pgsql@j-davis.com> wrote: > > LogicalTapeSetExtend() seems to work in a way that assumes that the > > tape is frozen. It would be good to document that assumption, and > > possible enforce it by way of an assertion. The same remark applies > > to > > any other assumptions you're making there. > > Can you explain? I am not freezing any tapes in Hash Aggregation, so > what about LogicalTapeSetExtend() assumes the tape is frozen? Sorry, I was very unclear. I meant to write just the opposite: you assume that the tapes are *not* frozen. If you're adding a new capability to logtape.c, it makes sense to be clear on the requirements on tapeset state or individual tape state. -- Peter Geoghegan
On Tue, 2020-02-04 at 18:10 +0200, Heikki Linnakangas wrote: > I'd love to change the LogicalTape API so that you could allocate > and > free tapes more freely. I wrote a patch to do that, as part of > replacing > tuplesort.c's polyphase algorithm with a simpler one (see [1]), but > I > never got around to committing it. Maybe the time is ripe to do that > now? It's interesting that you wrote a patch to pause the tapes a while ago. Did it just fall through the cracks or was there a problem with it? Is pause/resume functionality required, or is it good enough that rewinding a tape frees the buffer, to be lazily allocated later? Regarding the API, I'd like to change it, but I'm running into some performance challenges when adding a layer of indirection. If I apply the very simple attached patch, which simply makes a separate allocation for the tapes array, it seems to slow down sort by ~5%. Regards, Jeff Davis
Attachment
On Wed, 2020-02-05 at 11:56 -0800, Jeff Davis wrote: > Regarding the API, I'd like to change it, but I'm running into some > performance challenges when adding a layer of indirection. If I apply > the very simple attached patch, which simply makes a separate > allocation for the tapes array, it seems to slow down sort by ~5%. I tried a few different approaches to allow a flexible number of tapes without regressing normal Sort performance. I found some odd hacks, but I can't explain why they perform better than the more obvious approach. The LogicalTapeSetExtend() API is a natural evolution of what's already there, so I think I'll stick with that to keep the scope of Hash Aggregation under control. If we improve the API later I'm happy to adapt the HashAgg work to use it -- anything to take more code out of nodeAgg.c! Regards, Jeff Davis
On Fri, 2020-01-24 at 17:01 -0800, Jeff Davis wrote: > New patch attached. Three minor independent refactoring patches: 1. Add new entry points for the tuple hash table: TupleHashTableHash() LookupTupleHashEntryHash() which are useful for saving and reusing hash values to avoid recomputing. 2. Refactor hash_agg_entry_size() so that the callers don't need to do as much work. 3. Save calculated aggcosts->transitionSpace in the Agg node for later use, rather than discarding it. These are helpful for the upcoming Hash Aggregation work. Regards, Jeff Davis
Attachment
On 05/02/2020 21:56, Jeff Davis wrote: > On Tue, 2020-02-04 at 18:10 +0200, Heikki Linnakangas wrote: >> I'd love to change the LogicalTape API so that you could allocate >> and >> free tapes more freely. I wrote a patch to do that, as part of >> replacing >> tuplesort.c's polyphase algorithm with a simpler one (see [1]), but >> I >> never got around to committing it. Maybe the time is ripe to do that >> now? > > It's interesting that you wrote a patch to pause the tapes a while ago. > Did it just fall through the cracks or was there a problem with it? > > Is pause/resume functionality required, or is it good enough that > rewinding a tape frees the buffer, to be lazily allocated later? It wasn't strictly required for what I was hacking on then. IIRC it would have saved some memory during sorting, but Peter G felt that it wasn't worth the trouble, because he made some other changes around the same time, which made it less important (https://www.postgresql.org/message-id/CAM3SWZS0nwOPoJQHvxugA9kKPzky2QC2348TTWdSStZOkke5tg%40mail.gmail.com). I dropped the ball on both patches then, but I still think they would be worthwhile. - Heikki
On Thu, Feb 6, 2020 at 12:01 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote: > It wasn't strictly required for what I was hacking on then. IIRC it > would have saved some memory during sorting, but Peter G felt that it > wasn't worth the trouble, because he made some other changes around the > same time, which made it less important FWIW, I am not opposed to the patch at all. I would be quite happy to get rid of a bunch of code in tuplesort.c that apparently isn't really necessary anymore (by removing polyphase merge). All I meant back in 2016 was that "pausing" tapes was orthogonal to my own idea of capping the number of tapes that could be used by tuplesort.c. The 500 MAXORDER cap thing hadn't been committed yet when I explained this in the message you linked to, and it wasn't clear if it would ever be committed (Robert committed it about a month afterwards, as it turned out). Capping the size of the merge heap made marginal sorts faster overall, since a more cache efficient merge heap more than made up for having more than one merge pass overall (thanks to numerous optimizations added in 2016, some of which were your work). I also said that the absolute overhead of tapes was not that important back in 2016. Using many tapes within tuplesort.c can never happen anyway (with the 500 MAXORDER cap). Maybe the use of logtape.c by hash aggregate changes the picture there now. Even if it doesn't, I still think that your patch is a good idea. -- Peter Geoghegan
On Mon, 2020-02-03 at 18:24 -0800, Jeff Davis wrote: > On Mon, 2020-02-03 at 10:29 -0800, Jeff Davis wrote: > > I ended up converting the freelist to a min heap. > > > > Attached is a patch which makes three changes to better support > > HashAgg: > > And now I'm attaching another version of the main Hash Aggregation > patch to be applied on top of the logtape.c patch. > > Not a lot of changes from the last version; mostly some cleanup and > rebasing. But it's faster now with the logtape.c changes. Attaching latest version (combined logtape changes along with main HashAgg patch). I believe I've addressed all of the comments, except for Heikki's question about changing the logtape.c API. I think big changes to the API (such as Heikki's proposal) are out of scope for this patch, although I do favor the changes in general. This patch just includes the LogicalTapeSetExtend() API by Adam Lee, which is less intrusive. I noticed (and fixed) a small regression for some in-memory hashagg queries due to the way I was choosing the number of buckets when creating the hash table. I don't think that it is necessarily worse in general, but given that there is at least one case of a regression, I made it more closely match the old behavior, and the regression disappared. I improved costing by taking into account the actual number of partitions and the memory limits, at least for the first pass (in recursive passes the number of partitions can change). Aside from that, just some cleanup and rebasing. Regards, Jeff Davis
Attachment
On Mon, 2020-02-10 at 15:57 -0800, Jeff Davis wrote: > Attaching latest version (combined logtape changes along with main > HashAgg patch). I ran a matrix of small performance tests to look for regressions. The goal was to find out if the refactoring or additional branches introduced by this patch caused regressions in in-memory HashAgg, Sort, or the JIT paths. Fortunately, I didn't find any. This is *not* supposed to represent the performance benefits of the patch, only to see if I regressed somewhere else. The performance benefits will be shown in the next round of tests. I tried with JIT on/off, work_mem='4MB' and also a value high enough to fit the entire working set, enable_hashagg on/off, and 4 different tables. The 4 tables are (each containing 20 million tuples): t1k_20k_int4: 1K groups of 20K tuples each (randomly generated and ordered) t20m_1_int4: 20M groups of 1 tuple each (randomly generated and ordered) t1k_20k_text: the same as t1k_20k_int4 but cast to text (collation C.UTF-8) t20m_1_text: the same as t20m_1_int4 but cast to text (collation C.UTF-8) The query is: select count(*) from (select i, count(*) from $TABLE group by i) s; I just did 3 runs in psql and took the median result. I ran against master (cac8ce4a, slightly older, before any of my patches went in) and my dev branch (attached patch applied against 0973f560). Results were pretty boring, in a good way. All results within the noise, and about as many results were better on dev than master as there were better on master than dev. I also did some JIT-specific tests against only t1k_20k_int4. For that, the hash table fits in memory anyway, so I didn't vary work_mem. The query I ran included more aggregates to better test JIT: select i, sum(i), avg(i), min(i) from t1k_20k_int4 group by i offset 1000000; -- offset so it doesn't return result I know these tests are simplistic, but I also think they represent a lot of areas where regressions could have potentially been introduced. If someone else can find a regression, please let me know. The new patch is basically just rebased -- a few other very minor changes. Regards, Jeff Davis
Attachment
On Wed, 2020-02-12 at 21:51 -0800, Jeff Davis wrote: > The new patch is basically just rebased -- a few other very minor > changes. I extracted out some minor refactoring of nodeAgg.c that I can commit separately. That will make the main patch a little easier to review. Attached. * split build_hash_table() into two functions * separated hash calculation from lookup * changed lookup_hash_entry to return AggStatePerGroup directly instead of the TupleHashEntryData (which the caller only used to get the AggStatePerGroup, anyway) Regards, Jeff Davis
Attachment
On Wed, Jan 8, 2020 at 2:38 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
This makes the assumption that all Aggrefs or GroupingFuncs are at the
top of the TargetEntry. That's not true, e.g.:
select 0+sum(a) from foo group by b;
I think find_aggregated_cols() and find_unaggregated_cols() should be
merged into one function that scans the targetlist once, and returns two
Bitmapsets. They're always used together, anyway.
So, I've attached a patch that does what Heikki recommended and gets
both aggregated and unaggregated columns in two different bitmapsets.
I think it works for more cases than the other patch.
I'm not sure it is the ideal interface, but, since there aren't many
consumers, I don't know.
Also, it needs some formatting/improved naming/etc.
Per Jeff's comment in [1] I started looking into using the scanCols
patch from the thread on extracting scanCols from PlannerInfo [2] to
get the aggregated and unaggregated columns for this patch.
Since we only make one bitmap for scanCols containing all of the
columns that need to be scanned, there is no context about where the
columns came from in the query.
That is, once the bit is set in the bitmapset, we have no way of
knowing if that column was needed for aggregation or if it is filtered
out immediately.
We could solve this by creating multiple bitmaps at the time that we
create the scanCols field -- one for aggregated columns, one for
unaggregated columns, and, potentially more if useful to other
consumers.
The initial problem with this is that we extract scanCols from the
PlannerInfo->simple_rel_array and PlannerInfo->simple_rte_array.
If we wanted more context about where those columns were from in the
query, we would have to either change how we construct the scanCols or
construct them early and add to the bitmap when adding columns to the
simple_rel_array and simple_rte_array (which, I suppose, is the same
thing as changing how we construct scanCols).
This might decentralize the code for the benefit of one consumer.
Also, looping through the simple_rel_array and simple_rte_array a
couple times per query seems like it would add negligible overhead.
I'm more hesitant to add code that, most likely, would involve a
walker to the codepath everybody uses if only agg will leverage the
two distinct bitmapsets.
Overall, I think it seems like a good idea to leverage scanCols for
determining what columns hashagg needs to spill, but I can't think of
a way of doing it that doesn't seem bad. scanCols are currently just
that -- columns that will need to be scanned.
[1] https://www.postgresql.org/message-id/e5566f7def33a9e9fdff337cca32d07155d7b635.camel%40j-davis.com
[2] https://www.postgresql.org/message-id/flat/CAAKRu_Yj%3DQ_ZxiGX%2BpgstNWMbUJApEJX-imvAEwryCk5SLUebg%40mail.gmail.com
--
Melanie Plageman
Attachment
On Wed, 2020-02-12 at 21:51 -0800, Jeff Davis wrote: > On Mon, 2020-02-10 at 15:57 -0800, Jeff Davis wrote: > > Attaching latest version (combined logtape changes along with main > > HashAgg patch). > > I ran a matrix of small performance tests to look for regressions. I ran some more tests, this time comparing Hash Aggregation to Sort+Group. Summary of trends: group key complexity : favors Hash group key size : favors Hash group size : favors Hash higher work_mem : favors Sort[1] data set size : favors Sort[1] number of aggregates : favors Hash[2] [1] I have closed the gap a bit with some post-experiment tuning. I have just begun to analyze this case so I think there is quite a bit more room for improvement. [2] Could use more exploration -- I don't have an explanation. Data sets: t20m_1_int4: ~20 million groups of size ~1 (uniform) t1m_20_int4: ~1 million groups of size ~20 (uniform) t1k_20k_int4: ~1k groups of size ~20k (uniform) also, text versions of each of those with collate "C.UTF-8" Results: 1. A general test to vary the group size, key type, and work_mem. Query: select i from $TABLE group by i offset 100000000; work_mem='4MB' +----------------+----------+-------------+--------------+ | | sort(ms) | hashagg(ms) | sort/hashagg | +----------------+----------+-------------+--------------+ | t20m_1_int4 | 11852 | 10640 | 1.11 | | t1m_20_int4 | 11108 | 8109 | 1.37 | | t1k_20k_int4 | 8575 | 2732 | 3.14 | | t20m_1_text | 80463 | 12902 | 6.24 | | t1m_20_text | 58586 | 9252 | 6.33 | | t1k_20k_text | 21781 | 5739 | 3.80 | +----------------+----------+-------------+---- ----------+ work_mem='32MB' +----------------+----------+-------------+--------------+ | | sort(ms) | hashagg(ms) | sort/hashagg | +----------------+----------+-------------+--------------+ | t20m_1_int4 | 9656 | 11702 | 0.83 | | t1m_20_int4 | 8870 | 9804 | 0.90 | | t1k_20k_int4 | 6359 | 1852 | 3.43 | | t20m_1_text | 74266 | 14434 | 5.15 | | t1m_20_text | 56549 | 10180 | 5.55 | | t1k_20k_text | 21407 | 3989 | 5.37 | +----------------+----------+-------------+--------------+ 2. Test group key size data set: 20m rows, four int4 columns. Columns a,b,c are all the constant value 1, forcing each comparison to look at all four columns. Query: select a,b,c,d from wide group by a,b,c,d offset 100000000; work_mem='4MB' Sort : 30852ms HashAgg : 12343ms Sort/HashAgg : 2.50 In theory, if the first grouping column is highly selective, then Sort may have a slight advantage because it can look at only the first column, while HashAgg needs to look at all 4. But HashAgg only needs to perform this calculation once and it seems hard enough to show this in practice that I consider it an edge case. In "normal" cases, it appears that more grouping columns significantly favors Hash Agg. 3. Test number of aggregates Data Set: same as for test #2 (group key size). Query: select d, count(a),sum(b),avg(c),min(d) from wide group by d offset 100000000; work_mem='4MB' Sort : 22373ms HashAgg : 17338ms Sort/HashAgg : 1.29 I don't have an explanation of why HashAgg is doing better here. Both of them are using JIT and essentially doing the same number of advancements. This could use more exploration, but the effect isn't major. 4. Test data size Data 400 million rows of four random int8s. Group size of one. Query: select a from t400m_1_int8 group by a offset 1000000000; work_mem='32MB' Sort : 300675ms HashAgg : 560740ms Sort/HashAgg : 0.54 I tried increasing the max number of partitions and brought the HashAgg runtime down to 481985 (using 1024 partitions), which closes the gap to 0.62. That's not too bad for HashAgg considering this is a group size of one with a simple group key. A bit more tuning might be able to close the gap further. Conclusion: HashAgg is winning in a lot of cases, and this will be an important improvement for many workloads. Not only is it faster in a lot of cases, but it's also less risky. When an input has unknown group size, it's much easier for the planner to choose HashAgg -- a small downside and a big upside. Regards, Jeff Davis
Hi, I wanted to take a look at this thread and do a review, but it's not very clear to me if the recent patches posted here are independent or how exactly they fit together. I see 1) hashagg-20200212-1.patch (2020/02/13 by Jeff) 2) refactor.patch (2020/02/13 by Jeff) 3) v1-0001-aggregated-unaggregated-cols-together.patch (2020/02/14 by Melanie) I suppose this also confuses the cfbot - it's probably only testing (3) as it's the last thing posted here, at least I think that's the case. And it fails: nodeAgg.c: In function ‘find_aggregated_cols_walker’: nodeAgg.c:1208:2: error: ISO C90 forbids mixed declarations and code [-Werror=declaration-after-statement] FindColsContext *find_cols_context = (FindColsContext *) context; ^ nodeAgg.c: In function ‘find_unaggregated_cols_walker’: nodeAgg.c:1225:2: error: ISO C90 forbids mixed declarations and code [-Werror=declaration-after-statement] FindColsContext *find_cols_context = (FindColsContext *) context; ^ cc1: all warnings being treated as errors <builtin>: recipe for target 'nodeAgg.o' failed make[3]: *** [nodeAgg.o] Error 1 make[3]: *** Waiting for unfinished jobs.... It's probably a good idea to either start a separate thread for patches that are only loosely related to the main topic, or always post the whole patch series. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Tue, 2020-02-18 at 19:57 +0100, Tomas Vondra wrote: > Hi, > > I wanted to take a look at this thread and do a review, but it's not > very clear to me if the recent patches posted here are independent or > how exactly they fit together. I see Attached latest version rebased on master. > It's probably a good idea to either start a separate thread for > patches > that are only loosely related to the main topic, or always post the > whole patch series. Will do, sorry for the confusion. Regards, Jeff Davis
Attachment
On Tue, Feb 18, 2020 at 10:57 AM Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
Hi,
I wanted to take a look at this thread and do a review, but it's not
very clear to me if the recent patches posted here are independent or
how exactly they fit together. I see
1) hashagg-20200212-1.patch (2020/02/13 by Jeff)
2) refactor.patch (2020/02/13 by Jeff)
3) v1-0001-aggregated-unaggregated-cols-together.patch (2020/02/14 by
Melanie)
I suppose this also confuses the cfbot - it's probably only testing (3)
as it's the last thing posted here, at least I think that's the case.
And it fails:
nodeAgg.c: In function ‘find_aggregated_cols_walker’:
nodeAgg.c:1208:2: error: ISO C90 forbids mixed declarations and code [-Werror=declaration-after-statement]
FindColsContext *find_cols_context = (FindColsContext *) context;
^
nodeAgg.c: In function ‘find_unaggregated_cols_walker’:
nodeAgg.c:1225:2: error: ISO C90 forbids mixed declarations and code [-Werror=declaration-after-statement]
FindColsContext *find_cols_context = (FindColsContext *) context;
^
cc1: all warnings being treated as errors
<builtin>: recipe for target 'nodeAgg.o' failed
make[3]: *** [nodeAgg.o] Error 1
make[3]: *** Waiting for unfinished jobs....
Oops! Sorry, I would fix the code that those compiler warnings is
complaining about, but that would confuse the cfbot more. So, I'll let
Jeff decide what he wants to do about the patch at all (e.g. include
it in his overall patch or exclude it for now). Anyway it is trivial
to move those declarations up, were he to decide to include it.
complaining about, but that would confuse the cfbot more. So, I'll let
Jeff decide what he wants to do about the patch at all (e.g. include
it in his overall patch or exclude it for now). Anyway it is trivial
to move those declarations up, were he to decide to include it.
--
Melanie Plageman
Hi, I've started reviewing the 20200218 version of the patch. In general it seems fine, but I have a couple minor comments and two crashes. 1) explain.c currently does this: I wonder if we could show something for plain explain (without analyze). At least the initial estimate of partitions, etc. I know not showing those details until after execution is what e.g. sort does, but I find it a bit annoying. A related comment is that maybe this should report also the initial number of partitions, not just the total number. With just the total it's impossible to say if there were any repartitions, etc. 2) The ExecBuildAggTrans comment should probably explain "spilled". 3) I wonder if we need to invent new opcodes? Wouldn't it be simpler to just add a new flag to the agg_* structs instead? I haven't tried hacking this, so maybe it's a silly idea. 4) lookup_hash_entries says /* check to see if we need to spill the tuple for this grouping set */ But that seems bogus, because AFAIK we can't spill tuples for grouping sets. So maybe this should say just "grouping"? 5) Assert(nbuckets > 0); I was curious what happens in case of extreme skew, when a lot/all rows consistently falls into a single partition. So I did this: create table t (a int, b real); insert into t select i, random() from generate_series(-2000000000, 2000000000) s(i) where mod(hashint4(i), 16384) = 0; analyze t; set work_mem = '64kB'; set max_parallel_workers_per_gather = 0; set enable_sort = 0; explain select a, sum(b) from t group by a; QUERY PLAN --------------------------------------------------------------- HashAggregate (cost=23864.26..31088.52 rows=244631 width=8) Group Key: a -> Seq Scan on t (cost=0.00..3529.31 rows=244631 width=8) (3 rows) This however quickly fails on this assert in BuildTupleHashTableExt (see backtrace1.txt): Assert(nbuckets > 0); The value is computed in hash_choose_num_buckets, and there seem to be no protections against returning bogus values like 0. So maybe we should return Min(nbuckets, 1024) or something like that, similarly to hash join. OTOH maybe it's simply due to agg_refill_hash_table() passing bogus values to the function? 6) Another thing that occurred to me was what happens to grouping sets, which we can't spill to disk. So I did this: create table t2 (a int, b int, c int); -- run repeatedly, until there are about 20M rows in t2 (1GB) with tx as (select array_agg(a) as a, array_agg(b) as b from (select a, b from t order by random()) foo), ty as (select array_agg(a) AS a from (select a from t order by random()) foo) insert into t2 select unnest(tx.a), unnest(ty.a), unnest(tx.b) from tx, ty; analyze t2; This produces a table with two independent columns, skewed the same as the column t.a. I don't know which of this actually matters, considering grouping sets don't spill, so maybe the independence is sufficient and the skew may be irrelevant? And then do this: set work_mem = '200MB'; set max_parallel_workers_per_gather = 0; set enable_sort = 0; explain select a, b, sum(c) from t2 group by cube (a,b);; QUERY PLAN --------------------------------------------------------------------- MixedAggregate (cost=0.00..833064.27 rows=2756495 width=16) Hash Key: a, b Hash Key: a Hash Key: b Group Key: () -> Seq Scan on t2 (cost=0.00..350484.44 rows=22750744 width=12) (6 rows) which fails with segfault at execution time: tuplehash_start_iterate (tb=0x18, iter=iter@entry=0x2349340) 870 for (i = 0; i < tb->size; i++) (gdb) bt #0 tuplehash_start_iterate (tb=0x18, iter=iter@entry=0x2349340) #1 0x0000000000654e49 in agg_retrieve_hash_table_in_memory ... That's not surprising, because 0x18 pointer is obviously bogus. I guess this is simply an offset 18B added to a NULL pointer? Disabling hashagg spill (setting both GUCs to off) makes no difference, but on master it fails like this: ERROR: out of memory DETAIL: Failed on request of size 3221225472 in memory context "ExecutorState". which is annoying, but expected with an under-estimate and hashagg. And much better than just crashing the whole cluster. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachment
On Wed, Feb 19, 2020 at 08:16:36PM +0100, Tomas Vondra wrote: > 4) lookup_hash_entries says > > /* check to see if we need to spill the tuple for this grouping set */ > > But that seems bogus, because AFAIK we can't spill tuples for grouping > sets. So maybe this should say just "grouping"? As I see it, it does traverse all hash sets, fill the hash table and spill if needed, for each tuple. The segfault is probably related to this and MixedAggregate, I'm looking into it. -- Adam Lee
On Wed, Feb 19, 2020 at 08:16:36PM +0100, Tomas Vondra wrote: > 5) Assert(nbuckets > 0); > ... > This however quickly fails on this assert in BuildTupleHashTableExt (see > backtrace1.txt): > > Assert(nbuckets > 0); > > The value is computed in hash_choose_num_buckets, and there seem to be > no protections against returning bogus values like 0. So maybe we should > return > > Min(nbuckets, 1024) > > or something like that, similarly to hash join. OTOH maybe it's simply > due to agg_refill_hash_table() passing bogus values to the function? > > > 6) Another thing that occurred to me was what happens to grouping sets, > which we can't spill to disk. So I did this: > > create table t2 (a int, b int, c int); > > -- run repeatedly, until there are about 20M rows in t2 (1GB) > with tx as (select array_agg(a) as a, array_agg(b) as b > from (select a, b from t order by random()) foo), > ty as (select array_agg(a) AS a > from (select a from t order by random()) foo) > insert into t2 select unnest(tx.a), unnest(ty.a), unnest(tx.b) > from tx, ty; > > analyze t2; > ... > > which fails with segfault at execution time: > > tuplehash_start_iterate (tb=0x18, iter=iter@entry=0x2349340) > 870 for (i = 0; i < tb->size; i++) > (gdb) bt > #0 tuplehash_start_iterate (tb=0x18, iter=iter@entry=0x2349340) > #1 0x0000000000654e49 in agg_retrieve_hash_table_in_memory ... > > That's not surprising, because 0x18 pointer is obviously bogus. I guess > this is simply an offset 18B added to a NULL pointer? I did some investigation, have you disabled the assert when this panic happens? If so, it's the same issue as "5) nbucket == 0", which passes a zero size to allocator when creates that endup-with-0x18 hashtable. Sorry my testing env goes weird right now, haven't reproduced it yet. -- Adam Lee
On Wed, 2020-02-19 at 20:16 +0100, Tomas Vondra wrote: > 1) explain.c currently does this: > > I wonder if we could show something for plain explain (without > analyze). > At least the initial estimate of partitions, etc. I know not showing > those details until after execution is what e.g. sort does, but I > find > it a bit annoying. Looks like you meant to include some example explain output, but I think I understand what you mean. I'll look into it. > 2) The ExecBuildAggTrans comment should probably explain "spilled". Done. > 3) I wonder if we need to invent new opcodes? Wouldn't it be simpler > to > just add a new flag to the agg_* structs instead? I haven't tried > hacking > this, so maybe it's a silly idea. There was a reason I didn't do it this way, but I'm trying to remember why. I'll look into this, also. > 4) lookup_hash_entries says > > /* check to see if we need to spill the tuple for this grouping > set */ > > But that seems bogus, because AFAIK we can't spill tuples for > grouping > sets. So maybe this should say just "grouping"? Yes, we can spill tuples for grouping sets. Unfortunately, I think my tests (which covered this case previously) don't seem to be exercising that path well now. I am going to improve my tests, too. > 5) Assert(nbuckets > 0); I did not repro this issue, but I did set a floor of 256 buckets. > which fails with segfault at execution time: Fixed. I was resetting the hash table context without setting the pointers to NULL. Thanks! I attached a new, rebased version. The fixes are quick fixes for now and I will revisit them after I improve my test cases (which might find more issues). Regards, Jeff Davis
Attachment
Hi, On 2020-02-19 20:16:36 +0100, Tomas Vondra wrote: > 3) I wonder if we need to invent new opcodes? Wouldn't it be simpler to > just add a new flag to the agg_* structs instead? I haven't tried hacking > this, so maybe it's a silly idea. New opcodes don't really cost that much - it's a jump table based dispatch already (yes, it increases the table size slightly, but not by much). But adding branches inside opcode implementation does add cost - and we're already bottlenecked by stalls. I assume code duplication is your primary concern here? If so, I think the patch 0008 in https://postgr.es/m/20191023163849.sosqbfs5yenocez3%40alap3.anarazel.de would improve the situation. I'll try to rebase that onto master. I'd also like to apply something like 0013 from that thread, I find the whole curperagg, select_current_set, curaggcontext logic confusing as hell. I'd so far planned to put this on the backburner until this patch has been committed, to avoid breaking it. But perhaps that's not the right call? Greetings, Andres Freund
On Fri, 2020-02-21 at 12:22 -0800, Andres Freund wrote: > I'd also like to apply something like 0013 from that thread, I find > the > whole curperagg, select_current_set, curaggcontext logic confusing as > hell. I'd so far planned to put this on the backburner until this > patch > has been committed, to avoid breaking it. But perhaps that's not the > right call? At least for now, I appreciate you holding off on those a bit. Regards, Jeff Davis
Hi, On 2020-02-22 09:55:26 -0800, Jeff Davis wrote: > On Fri, 2020-02-21 at 12:22 -0800, Andres Freund wrote: > > I'd also like to apply something like 0013 from that thread, I find > > the > > whole curperagg, select_current_set, curaggcontext logic confusing as > > hell. I'd so far planned to put this on the backburner until this > > patch > > has been committed, to avoid breaking it. But perhaps that's not the > > right call? > > At least for now, I appreciate you holding off on those a bit. Both patches, or just 0013? Seems the earlier one might make the addition of the opcodes you add less verbose? Greetings, Andres Freund
On Sat, 2020-02-22 at 10:00 -0800, Andres Freund wrote: > Both patches, or just 0013? Seems the earlier one might make the > addition of the opcodes you add less verbose? Just 0013, thank you. 0008 looks like it will simplify things. Regards, Jeff Davis
On Wed, 2020-02-19 at 20:16 +0100, Tomas Vondra wrote: > 5) Assert(nbuckets > 0); ... > 6) Another thing that occurred to me was what happens to grouping > sets, > which we can't spill to disk. So I did this: ... > which fails with segfault at execution time: The biggest problem was that my grouping sets test was not testing multiple hash tables spilling, so a couple bugs crept in. I fixed them, thank you. To fix the tests, I also had to fix the GUCs and the way the planner uses them with my patch. In master, grouping sets are planned by generating a path that tries to do as many grouping sets with hashing as possible (limited by work_mem). But with my patch, the notion of fitting hash tables in work_mem is not necessarily important. If we ignore work_mem during path generation entirely (and only consider it during costing and execution), it will change quite a few plans and undermine the concept of mixed aggregates entirely. That may be a good thing to do eventually as a simplification, but for now it seems like too much, so I am preserving the notion of trying to fit hash tables in work_mem to create mixed aggregates. But that created the testing problem: I need a reliable way to get grouping sets with several hash tables in memory that are all spilling, but the planner is trying to avoid exactly that. So, I am introducing a new GUC called enable_groupingsets_hash_disk (better name suggestions welcome), defaulting it to "off" (and turned on during the test). Additionally, I removed the other GUCs I introduced in earlier versions of this patch. They were basically designed around the idea to revert back to the previous hash aggregation behavior if desired (by setting enable_hashagg_spill=false and hashagg_mem_overflow=true). That makes some sense, but that was already covered pretty well by existing GUCs. If you want to use HashAgg without spilling, just set work_mem higher; and if you want to avoid the planner from choosing HashAgg at all, you set enable_hashagg=false. So I just got rid of enable_hashagg_spill and hashagg_mem_overflow. I didn't forget about your explain-related suggestions. I'll address them in the next patch. Regards, Jeff Davis
Attachment
On Thu, Feb 20, 2020 at 04:56:38PM -0800, Jeff Davis wrote: >On Wed, 2020-02-19 at 20:16 +0100, Tomas Vondra wrote: >> 1) explain.c currently does this: >> >> I wonder if we could show something for plain explain (without >> analyze). >> At least the initial estimate of partitions, etc. I know not showing >> those details until after execution is what e.g. sort does, but I >> find >> it a bit annoying. > >Looks like you meant to include some example explain output, but I >think I understand what you mean. I'll look into it. > Oh, right. What I wanted to include is this code snippet: if (es->analyze) show_hashagg_info((AggState *) planstate, es); but I forgot to do the copy-paste. >> 2) The ExecBuildAggTrans comment should probably explain "spilled". > >Done. > >> 3) I wonder if we need to invent new opcodes? Wouldn't it be simpler >> to >> just add a new flag to the agg_* structs instead? I haven't tried >> hacking >> this, so maybe it's a silly idea. > >There was a reason I didn't do it this way, but I'm trying to remember >why. I'll look into this, also. > >> 4) lookup_hash_entries says >> >> /* check to see if we need to spill the tuple for this grouping >> set */ >> >> But that seems bogus, because AFAIK we can't spill tuples for >> grouping >> sets. So maybe this should say just "grouping"? > >Yes, we can spill tuples for grouping sets. Unfortunately, I think my >tests (which covered this case previously) don't seem to be exercising >that path well now. I am going to improve my tests, too. > >> 5) Assert(nbuckets > 0); > >I did not repro this issue, but I did set a floor of 256 buckets. > Hmmm. I can reproduce it reliably (on the patch from 2020/02/18) but it seems to only happen when the table is large enough. For me, doing insert into t select * from t; until the table has ~7.8M rows does the trick. I can't reproduce it on the current patch, so ensuring there are at least 256 buckets seems to have helped. If I add an elog() to print nbuckets at the beginning of hash_choose_num_buckets, I see it starts as 0 from time to time (and then gets tweaked to 256). I suppose this is due to how the input data is generated, i.e. all hash values should fall to the first batch, so all other batches should be empty. But in agg_refill_hash_table we use the number of input tuples as a starting point for, which is how we get nbuckets = 0. I think enforcing nbuckets to be at least 256 is OK. >> which fails with segfault at execution time: > >Fixed. I was resetting the hash table context without setting the >pointers to NULL. > Yep, can confirm it's no longer crashing for me. >Thanks! I attached a new, rebased version. The fixes are quick fixes >for now and I will revisit them after I improve my test cases (which >might find more issues). > OK, sounds good. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 2020-02-22 11:02:16 -0800, Jeff Davis wrote: > On Sat, 2020-02-22 at 10:00 -0800, Andres Freund wrote: > > Both patches, or just 0013? Seems the earlier one might make the > > addition of the opcodes you add less verbose? > > Just 0013, thank you. 0008 looks like it will simplify things. Pushed 0008.
On Mon, 2020-02-24 at 15:29 -0800, Andres Freund wrote: > On 2020-02-22 11:02:16 -0800, Jeff Davis wrote: > > On Sat, 2020-02-22 at 10:00 -0800, Andres Freund wrote: > > > Both patches, or just 0013? Seems the earlier one might make the > > > addition of the opcodes you add less verbose? > > > > Just 0013, thank you. 0008 looks like it will simplify things. > > Pushed 0008. Rebased on your change. This simplified the JIT and interpretation code quite a bit. Also: * caching the compiled expressions so I can switch between the variants cheaply * added "Planned Partitions" to explain output * included tape buffers in the "Memory Used" output * Simplified the way I try to track memory usage and trigger spilling. * Reset hash tables always rather than rebuilding them from scratch. I will do another round of performance tests and see if anything changed from last time. Regards, Jeff Davis
Attachment
On Wed, 2020-02-26 at 19:14 -0800, Jeff Davis wrote: > Rebased on your change. This simplified the JIT and interpretation > code > quite a bit. Attached another version. * tweaked EXPLAIN output some more * rebased and cleaned up * Added back the enable_hashagg_disk flag (defaulting to on). I've gone back and forth on this, but it seems like a good idea to have it there. So now there are a total of two GUCs: enable_hashagg_disk and enable_groupingsets_hash_disk Unless I (or someone else) finds something significant, this is close to commit. Regards, Jeff Davis
Attachment
On Wed, Mar 11, 2020 at 11:55:35PM -0700, Jeff Davis wrote: > * tweaked EXPLAIN output some more > Unless I (or someone else) finds something significant, this is close > to commit. Thanks for working on this ; I finally made a pass over the patch. +++ b/doc/src/sgml/config.sgml + <term><varname>enable_groupingsets_hash_disk</varname> (<type>boolean</type>) + Enables or disables the query planner's use of hashed aggregation for + grouping sets when the size of the hash tables is expected to exceed + <varname>work_mem</varname>. See <xref + linkend="queries-grouping-sets"/>. Note that this setting only + affects the chosen plan; execution time may still require using + disk-based hash aggregation. ... ... + <term><varname>enable_hashagg_disk</varname> (<type>boolean</type>) + ... This only affects the planner choice; + execution time may still require using disk-based hash + aggregation. The default is <literal>on</literal>. I don't understand what's meant by "the chosen plan". Should it say, "at execution ..." instead of "execution time" ? + Enables or disables the query planner's use of hashed aggregation plan + types when the memory usage is expected to exceed Either remove "plan types" for consistency with enable_groupingsets_hash_disk, Or add it there. Maybe it should say "when the memory usage would OTHERWISE BE expected to exceed.." +show_hashagg_info(AggState *aggstate, ExplainState *es) +{ + Agg *agg = (Agg *)aggstate->ss.ps.plan; + long memPeakKb = (aggstate->hash_mem_peak + 1023) / 1024; I see this partially duplicates my patch [0] to show memory stats for (at Andres' suggestion) all of execGrouping.c. Perhaps you'd consider naming the function something more generic in case my patch progresses ? I'm using: |show_tuplehash_info(HashTableInstrumentation *inst, ExplainState *es); Mine also shows: |ExplainPropertyInteger("Original Hash Buckets", NULL, |ExplainPropertyInteger("Peak Memory Usage (hashtable)", "kB", |ExplainPropertyInteger("Peak Memory Usage (tuples)", "kB", [0] https://www.postgresql.org/message-id/20200306213310.GM684%40telsasoft.com You added hash_mem_peak and hash_batches_used to struct AggState. In my 0001 patch, I added instrumentation to struct TupleHashTable, and in my 0005 patch I move it into AggStatePerHashData and other State nodes. + if (from_tape) + partition_mem += HASHAGG_READ_BUFFER_SIZE; + partition_mem = npartitions * HASHAGG_WRITE_BUFFER_SIZE; => That looks wrong ; should say += ? + gettext_noop("Enables the planner's use of hashed aggregation plans that are expected to exceed work_mem."), should say: "when the memory usage is otherwise be expected to exceed.." -- Justin
On Thu, 2020-03-12 at 16:01 -0500, Justin Pryzby wrote: > I don't understand what's meant by "the chosen plan". > Should it say, "at execution ..." instead of "execution time" ? I removed that wording; hopefully it's more clear without it? > Either remove "plan types" for consistency with > enable_groupingsets_hash_disk, > Or add it there. Maybe it should say "when the memory usage would > OTHERWISE BE > expected to exceed.." I added "plan types". I don't think "otherwise be..." would quite work there. "Otherwise" sounds to me like it's referring to another plan type (e.g. Sort+GroupAgg), and that doesn't fit. It's probably best to leave that level of detail out of the docs. I think the main use case for enable_hashagg_disk is for users who experience some plan changes and want the old behavior which favors Sort when there are a lot of groups. > +show_hashagg_info(AggState *aggstate, ExplainState *es) > +{ > + Agg *agg = (Agg *)aggstate->ss.ps.plan; > + long memPeakKb = (aggstate->hash_mem_peak + 1023) / 1024; > > I see this partially duplicates my patch [0] to show memory stats for ... > You added hash_mem_peak and hash_batches_used to struct AggState. > In my 0001 patch, I added instrumentation to struct TupleHashTable I replied in that thread and I'm not sure that tracking the memory in the TupleHashTable is the right approach. The group keys and the transition state data can't be estimated easily that way. Perhaps we can do that if the THT owns the memory contexts (and can call MemoryContextMemAllocated()), rather than using passed-in ones, but that might require more discussion. (I'm open to that idea, by the way.) Also, my patch also considers the buffer space, so would that be a third memory number? For now, I think I'll leave the way I report it in a simpler form and we can change it later as we sort out these details. That leaves mine specific to HashAgg, but we can always refactor it later. I did change my code to put the metacontext in a child context of its own so that I could call MemoryContextMemAllocated() on it to include it in the memory total, and that will make reporting it separately easier when we want to do so. > + if (from_tape) > + partition_mem += HASHAGG_READ_BUFFER_SIZE; > + partition_mem = npartitions * HASHAGG_WRITE_BUFFER_SIZE; > > => That looks wrong ; should say += ? Good catch! Fixed. Regards, Jeff Davis
Attachment
Committed. There's some future work that would be nice (some of these are just ideas and may not be worth it): * Refactor MemoryContextMemAllocated() to be a part of MemoryContextStats(), but allow it to avoid walking through the blocks and freelists. * Improve the choice of the initial number of buckets in the hash table. For this patch, I tried to preserve the existing behavior of estimating the number of groups and trying to initialize with that many buckets. But my performance tests seem to indicate this is not the best approach. More work is needed to find what we should really do here. * For workloads that are not in work_mem *or* system memory, and need to actually go to storage, I see poor CPU utilization because it's not effectively overlapping CPU and IO work. Perhaps buffering or readahead changes can improve this, or async IO (even better). * Project unnecessary attributes away before spilling tuples to disk. * Improve logtape.c API so that the caller doesn't need to manage a bunch of tape numbers. * Improve estimate of the hash entry size. This patch doesn't change the way the planner estimates it, but I observe that actual size as seen at runtime is significantly different. This is connected to the initial number of buckets for the hash table. * In recursive steps, I don't have a good estimate for the number of groups, so I just estimate it as the number of tuples in that spill tape (which is pessimistic). That could be improved by doing a real cardinality estimate as the tuples are spilling (perhaps with HLL?). * Many aggregates with pass-by-ref transition states don't provide a great aggtransspace. We should consider doing something smarter, like having negative numbers represent a number that should be multiplied by the size of the group (e.g. ARRAY_AGG would have a size dependent on the group size, not a constant). * If we want to handle ARRAY_AGG (and the like) well, we can consider spilling the partial states in the hash table whem the memory is full. That would add a fair amount of complexity because there would be two types of spilled data (tuples and partial states), but it could be useful in some cases. Regards, Jeff Davis
On Wed, Mar 18, 2020 at 04:35:57PM -0700, Jeff Davis wrote: > >Committed. > \o/ -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Sun, Mar 15, 2020 at 04:05:37PM -0700, Jeff Davis wrote: > > + if (from_tape) > > + partition_mem += HASHAGG_READ_BUFFER_SIZE; > > + partition_mem = npartitions * HASHAGG_WRITE_BUFFER_SIZE; > > > > => That looks wrong ; should say += ? > > Good catch! Fixed. > +++ b/src/backend/executor/nodeAgg.c > @@ -2518,9 +3499,36 @@ ExecInitAgg(Agg *node, EState *estate, int eflags) > */ > if (use_hashing) > { > + Plan *outerplan = outerPlan(node); > + uint64 totalGroups = 0; > + for (i = 0; i < aggstate->num_hashes; i++) > + totalGroups = aggstate->perhash[i].aggnode->numGroups; > + > + hash_agg_set_limits(aggstate->hashentrysize, totalGroups, 0, I realize that I missed the train but .. that looks like another += issue? Also, Andres was educating me about the range of behavior of "long" type, and I see now while rebasing that you did the same thing. https://www.postgresql.org/message-id/20200306175859.d56ohskarwldyrrw%40alap3.anarazel.de -- Justin
Hi,
I happen to notice that "set enable_sort to false" cannot guarantee the planner to use hashagg in test groupingsets.sql,
the following comparing results of sortagg and hashagg seems to have no meaning.
Thanks,
Pengzhou
On Thu, Mar 19, 2020 at 7:36 AM Jeff Davis <pgsql@j-davis.com> wrote:
Committed.
There's some future work that would be nice (some of these are just
ideas and may not be worth it):
* Refactor MemoryContextMemAllocated() to be a part of
MemoryContextStats(), but allow it to avoid walking through the blocks
and freelists.
* Improve the choice of the initial number of buckets in the hash
table. For this patch, I tried to preserve the existing behavior of
estimating the number of groups and trying to initialize with that many
buckets. But my performance tests seem to indicate this is not the best
approach. More work is needed to find what we should really do here.
* For workloads that are not in work_mem *or* system memory, and need
to actually go to storage, I see poor CPU utilization because it's not
effectively overlapping CPU and IO work. Perhaps buffering or readahead
changes can improve this, or async IO (even better).
* Project unnecessary attributes away before spilling tuples to disk.
* Improve logtape.c API so that the caller doesn't need to manage a
bunch of tape numbers.
* Improve estimate of the hash entry size. This patch doesn't change
the way the planner estimates it, but I observe that actual size as
seen at runtime is significantly different. This is connected to the
initial number of buckets for the hash table.
* In recursive steps, I don't have a good estimate for the number of
groups, so I just estimate it as the number of tuples in that spill
tape (which is pessimistic). That could be improved by doing a real
cardinality estimate as the tuples are spilling (perhaps with HLL?).
* Many aggregates with pass-by-ref transition states don't provide a
great aggtransspace. We should consider doing something smarter, like
having negative numbers represent a number that should be multiplied by
the size of the group (e.g. ARRAY_AGG would have a size dependent on
the group size, not a constant).
* If we want to handle ARRAY_AGG (and the like) well, we can consider
spilling the partial states in the hash table whem the memory is full.
That would add a fair amount of complexity because there would be two
types of spilled data (tuples and partial states), but it could be
useful in some cases.
Regards,
Jeff Davis
On Fri, Mar 20, 2020 at 1:20 PM Pengzhou Tang <ptang@pivotal.io> wrote:
Hi,I happen to notice that "set enable_sort to false" cannot guarantee the planner to use hashagg in test groupingsets.sql,the following comparing results of sortagg and hashagg seems to have no meaning.
Please forget my comment, I should set enable_groupingsets_hash_disk too.
Hello,
When calculating the disk costs of hash aggregation that spills to disk,
there is something wrong with how we determine depth:
> depth = ceil( log(nbatches - 1) / log(num_partitions) );
If nbatches is some number between 1.0 and 2.0, we would have a negative
depth. As a result, we may have a negative cost for hash aggregation
plan node, as described in [1].
I don't think 'log(nbatches - 1)' is what we want here. Should it be
just '(nbatches - 1)'?
[1] https://www.postgresql.org/message-id/flat/CAMbWs4_maqdBnRR4x01pDpoV-CiQ%2BRvMQaPm4JoTPbA%3DmZmhMw%40mail.gmail.com
When calculating the disk costs of hash aggregation that spills to disk,
there is something wrong with how we determine depth:
> depth = ceil( log(nbatches - 1) / log(num_partitions) );
If nbatches is some number between 1.0 and 2.0, we would have a negative
depth. As a result, we may have a negative cost for hash aggregation
plan node, as described in [1].
I don't think 'log(nbatches - 1)' is what we want here. Should it be
just '(nbatches - 1)'?
[1] https://www.postgresql.org/message-id/flat/CAMbWs4_maqdBnRR4x01pDpoV-CiQ%2BRvMQaPm4JoTPbA%3DmZmhMw%40mail.gmail.com
Thanks
Richard
On Thu, Mar 19, 2020 at 7:36 AM Jeff Davis <pgsql@j-davis.com> wrote:
Committed.
There's some future work that would be nice (some of these are just
ideas and may not be worth it):
* Refactor MemoryContextMemAllocated() to be a part of
MemoryContextStats(), but allow it to avoid walking through the blocks
and freelists.
* Improve the choice of the initial number of buckets in the hash
table. For this patch, I tried to preserve the existing behavior of
estimating the number of groups and trying to initialize with that many
buckets. But my performance tests seem to indicate this is not the best
approach. More work is needed to find what we should really do here.
* For workloads that are not in work_mem *or* system memory, and need
to actually go to storage, I see poor CPU utilization because it's not
effectively overlapping CPU and IO work. Perhaps buffering or readahead
changes can improve this, or async IO (even better).
* Project unnecessary attributes away before spilling tuples to disk.
* Improve logtape.c API so that the caller doesn't need to manage a
bunch of tape numbers.
* Improve estimate of the hash entry size. This patch doesn't change
the way the planner estimates it, but I observe that actual size as
seen at runtime is significantly different. This is connected to the
initial number of buckets for the hash table.
* In recursive steps, I don't have a good estimate for the number of
groups, so I just estimate it as the number of tuples in that spill
tape (which is pessimistic). That could be improved by doing a real
cardinality estimate as the tuples are spilling (perhaps with HLL?).
* Many aggregates with pass-by-ref transition states don't provide a
great aggtransspace. We should consider doing something smarter, like
having negative numbers represent a number that should be multiplied by
the size of the group (e.g. ARRAY_AGG would have a size dependent on
the group size, not a constant).
* If we want to handle ARRAY_AGG (and the like) well, we can consider
spilling the partial states in the hash table whem the memory is full.
That would add a fair amount of complexity because there would be two
types of spilled data (tuples and partial states), but it could be
useful in some cases.
Regards,
Jeff Davis
On Thu, Mar 26, 2020 at 05:56:56PM +0800, Richard Guo wrote: >Hello, > >When calculating the disk costs of hash aggregation that spills to disk, >there is something wrong with how we determine depth: > >> depth = ceil( log(nbatches - 1) / log(num_partitions) ); > >If nbatches is some number between 1.0 and 2.0, we would have a negative >depth. As a result, we may have a negative cost for hash aggregation >plan node, as described in [1]. > >I don't think 'log(nbatches - 1)' is what we want here. Should it be >just '(nbatches - 1)'? > I think using log() is correct, but why should we allow fractional nbatches values between 1.0 and 2.0? You either have 1 batch or 2 batches, you can't have 1.5 batches. So I think the issue is here nbatches = Max((numGroups * hashentrysize) / mem_limit, numGroups / ngroups_limit ); and we should probably do nbatches = ceil(nbatches); right after it. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Fri, 2020-03-27 at 02:31 +0100, Tomas Vondra wrote: > On Thu, Mar 26, 2020 at 05:56:56PM +0800, Richard Guo wrote: > > If nbatches is some number between 1.0 and 2.0, we would have a > > negative > > depth. As a result, we may have a negative cost for hash > > aggregation > > plan node, as described in [1]. > > numGroups / ngroups_limit ); > > and we should probably do > > nbatches = ceil(nbatches); > Thank you both. I also protected against nbatches == 0 (shouldn't happen), and against num_partitions <= 1. That allowed me to remove the conditional and simplify a bit. Regards, Jeff Davis