Thread: Large Scale Aggregation (HashAgg Enhancement)
A couple of days ago I found myself wanting to aggregate 3 Billion tuples down to 100 Million tuples based on an integer key with six integer values -- six sum()'s. PostgreSQL ran out of memory with its Hash Aggregator and doing an old style Sort & Sum took a fair amount of time to complete (cancelled the process after 24 hours -- small machine). Spilling to disk would be nice but I suspect the obvious method would thrash quite badly with non-sorted input. One solution is to partially sort the data into various buckets. If we know how many keys can fit into sort_mem and what the upper and lower bounds of our keys are then # Keys per MB / sort_mem temporary files can be created. A sequential scan of the source data would sort each tuple into the appropriate temporary file. From there we can loop through a temporary file, HashAgg the contents, present the results, and move to the next temporary file. For my particular problem the lower bound is 1 and the upper bound is about 100M. The sort_mem setting allows HashAgg to handle 1M keys at a time. The first pass through the 3B tuples would create 100 temporary files on disk. Temp file 1 would get 1 through 1M, temp file 2 gets keys 1M + 1 through 2M, etc. From there it is pretty easy. This would allow for a 1000 fold increase in the number of distinct keys PostgreSQL can simultaneously HashAgg in the default configuration at a reasonable speed. I've written something similar using a client and COPY with temporary tables. Even with the Export/Import copy I still beat the Sort&Sum method PostgreSQL falls back to. --
On Mon, 2006-01-16 at 00:07 -0500, Rod Taylor wrote: > A couple of days ago I found myself wanting to aggregate 3 Billion > tuples down to 100 Million tuples based on an integer key with six > integer values -- six sum()'s. > > PostgreSQL ran out of memory with its Hash Aggregator and doing an old > style Sort & Sum took a fair amount of time to complete (cancelled the > process after 24 hours -- small machine). > Spilling to disk would be nice but I suspect the obvious method would > thrash quite badly with non-sorted input. There is already hash table overflow (spill to disk) logic in HashJoins, so this should be possible by reusing that code for HashAggs. That's on my todo list, but I'd welcome any assistance. A question: Are the rows in your 3 B row table clumped together based upon the 100M row key? (or *mostly* so) We might also be able to pre-aggregate the rows using a plan likeHashAgg SortedAgg orSortedAgg Sort SortedAgg The first SortedAgg seems superfluous, buy would reduce the row volume considerably if incoming rows were frequently naturally adjacent, even if the values were not actually sorted. (This could also be done during sorting, but its much easier to slot the extra executor step into the plan). That might then reduce the size of the later sort, or allow it to become a HashAgg. I could make that manually enabled using "enable_pre_agg" to allow us to measure the effectiveness of that technique and decide what cost model we'd use to make it automatic. Would that help? > I've written something similar using a client and COPY with temporary > tables. Even with the Export/Import copy I still beat the Sort&Sum > method PostgreSQL falls back to. You can get round this now by chopping the larger table into pieces with a WHERE clause and then putting them back together with a UNION. If the table is partitioned, then do this by partitions. This should also help when it comes to recalculating the sums again in the future, since you'll only need to rescan the rows that have been added since the last summation. Best Regards, Simon Riggs
On Mon, 2006-01-16 at 08:32 +0000, Simon Riggs wrote: > On Mon, 2006-01-16 at 00:07 -0500, Rod Taylor wrote: > > A couple of days ago I found myself wanting to aggregate 3 Billion > > tuples down to 100 Million tuples based on an integer key with six > > integer values -- six sum()'s. > > > > PostgreSQL ran out of memory with its Hash Aggregator and doing an old > > style Sort & Sum took a fair amount of time to complete (cancelled the > > process after 24 hours -- small machine). > > > Spilling to disk would be nice but I suspect the obvious method would > > thrash quite badly with non-sorted input. > > There is already hash table overflow (spill to disk) logic in HashJoins, > so this should be possible by reusing that code for HashAggs. That's on > my todo list, but I'd welcome any assistance. > A question: Are the rows in your 3 B row table clumped together based > upon the 100M row key? (or *mostly* so) We might also be able to They are randomly distributed. Fully sorting the data is quite painful. > pre-aggregate the rows using a plan like > HashAgg > SortedAgg > or > SortedAgg > Sort > SortedAgg > > The first SortedAgg seems superfluous, buy would reduce the row volume > considerably if incoming rows were frequently naturally adjacent, even > if the values were not actually sorted. (This could also be done during > sorting, but its much easier to slot the extra executor step into the > plan). That might then reduce the size of the later sort, or allow it to > become a HashAgg. > > I could make that manually enabled using "enable_pre_agg" to allow us to > measure the effectiveness of that technique and decide what cost model > we'd use to make it automatic. Would that help? I don't understand how this helps. The problem isn't the 3B data source rows but rather the 100M destination keys that are being aggregated against. The memory constraints of HashAgg are a result of the large number of target keys and should be the same if it was 100M rows or 10B rows. I think I need something closer to: HashAgg-> HashSort (to disk) HashSort would create a number of files on disk with "similar" data. Grouping all similar keys into a single temporary file which HashAgg can deal with individually (100 loops by 1M target keys instead of 1 loop by 100M target keys). The results would be the same as partitioning by keyblock and running a HashAgg on each partition, but it would be handled by the Executor rather than by client side code. > > I've written something similar using a client and COPY with temporary > > tables. Even with the Export/Import copy I still beat the Sort&Sum > > method PostgreSQL falls back to. > > You can get round this now by chopping the larger table into pieces with > a WHERE clause and then putting them back together with a UNION. If the > table is partitioned, then do this by partitions. True, except this results in several sequential scans over the source data. I can extract and sort in a single pass at client side but it would be far better if I could get PostgreSQL to do the same. I could probably write a plpgsql function to do that logic but it would be quite messy. > This should also help when it comes to recalculating the sums again in > the future, since you'll only need to rescan the rows that have been > added since the last summation. We store the aggregated results and never do this type of calculation on that dataset again. The original dataset comes from about 300 partitions (time and source) and they are removed upon completion. While this calculation is being performed additional partitions are added. I suppose I could store source data in 300 * 1000 partitions (Approx 300 batches times 1000 segments) but that would probably run into other problems. PostgreSQL probably has issues with that many tables. --
Simon Riggs <simon@2ndquadrant.com> writes: > On Mon, 2006-01-16 at 00:07 -0500, Rod Taylor wrote: >> A couple of days ago I found myself wanting to aggregate 3 Billion >> tuples down to 100 Million tuples based on an integer key with six >> integer values -- six sum()'s. > There is already hash table overflow (spill to disk) logic in HashJoins, > so this should be possible by reusing that code for HashAggs. That's on > my todo list, but I'd welcome any assistance. Yeah, I proposed something similar awhile back in conjunction with fixing the spill logic for hash joins (which was always there, but was not adaptive until recently). I got the join part done but got distracted before fixing HashAgg :-( In principle, you just reduce the range of currently-in-memory hash codes whenever you run low on memory. The so-far-accumulated working state for aggregates that are not in the range anymore goes into a temp file, and subsequently any incoming tuples that hash outside the range go into another temp file. After you've completed the scan, you finalize and emit the aggregates that are still in memory, then you pick up the first set of dropped aggregates, rescan the associated "TODO" file of unprocessed tuples, lather rinse repeat till done. The tricky part is to preserve the existing guarantee that tuples are merged into their aggregate in arrival order. (This does not matter for the standard aggregates but it definitely does for custom aggregates, and there will be unhappy villagers appearing on our doorsteps if we break it.) I think this can work correctly under the above sketch but it needs to be verified. It might require different handling of the TODO files than what hashjoin does. regards, tom lane
On Mon, 2006-01-16 at 09:42 -0500, Rod Taylor wrote: > On Mon, 2006-01-16 at 08:32 +0000, Simon Riggs wrote: > > On Mon, 2006-01-16 at 00:07 -0500, Rod Taylor wrote: > > > > > A question: Are the rows in your 3 B row table clumped together based > > upon the 100M row key? (or *mostly* so) We might also be able to > > They are randomly distributed. Fully sorting the data is quite painful. ... > I don't understand how this helps. It wouldn't since your rows are randomly distributed. The idea was not related to improving HashAgg, but to improving Aggregation for the case of naturally grouped data. > I think I need something closer to: > > HashAgg > -> HashSort (to disk) > > HashSort would create a number of files on disk with "similar" data. > Grouping all similar keys into a single temporary file which HashAgg can > deal with individually (100 loops by 1M target keys instead of 1 loop by > 100M target keys). The results would be the same as partitioning by > keyblock and running a HashAgg on each partition, but it would be > handled by the Executor rather than by client side code. > > > > I've written something similar using a client and COPY with temporary > > > tables. Even with the Export/Import copy I still beat the Sort&Sum > > > method PostgreSQL falls back to. That is exactly how the spill to disk logic works for HashJoin (and incidentally, identical to an Oracle one-pass hash join since both are based upon the hybrid hash join algorithm). Multi-pass would only be required to handle very skewed hash distributions, which HJ doesn't do yet. So yes, this can be done. Best Regards, Simon Riggs
On Mon, 2006-01-16 at 12:36 -0500, Tom Lane wrote: > Simon Riggs <simon@2ndquadrant.com> writes: > > On Mon, 2006-01-16 at 00:07 -0500, Rod Taylor wrote: > >> A couple of days ago I found myself wanting to aggregate 3 Billion > >> tuples down to 100 Million tuples based on an integer key with six > >> integer values -- six sum()'s. > > > There is already hash table overflow (spill to disk) logic in HashJoins, > > so this should be possible by reusing that code for HashAggs. That's on > > my todo list, but I'd welcome any assistance. > > Yeah, I proposed something similar awhile back in conjunction with > fixing the spill logic for hash joins (which was always there, but was > not adaptive until recently). I got the join part done but got > distracted before fixing HashAgg :-( You've done the main work. :-) > The tricky part is to preserve the existing guarantee that tuples are > merged into their aggregate in arrival order. (This does not matter for > the standard aggregates but it definitely does for custom aggregates, > and there will be unhappy villagers appearing on our doorsteps if we > break it.) I think this can work correctly under the above sketch but > it needs to be verified. It might require different handling of the > TODO files than what hashjoin does. For HJ we write each outer tuple to its own file-per-batch in the order they arrive. Reading them back in preserves the original ordering. So yes, caution required, but I see no difficulty, just reworking the HJ code (nodeHashjoin and nodeHash). What else do you see? Best Regards, Simon Riggs
Simon Riggs <simon@2ndquadrant.com> writes: > For HJ we write each outer tuple to its own file-per-batch in the order > they arrive. Reading them back in preserves the original ordering. So > yes, caution required, but I see no difficulty, just reworking the HJ > code (nodeHashjoin and nodeHash). What else do you see? With dynamic adjustment of the hash partitioning, some tuples will go through multiple temp files before they ultimately get eaten, and different tuples destined for the same aggregate may take different paths through the temp files depending on when they arrive. It's not immediately obvious that ordering is preserved when that happens. I think it can be made to work but it may take different management of the temp files than hashjoin uses. (Worst case, we could use just a single temp file for all unprocessed tuples, but this would result in extra I/O.) regards, tom lane
On Mon, 2006-01-16 at 14:43 -0500, Tom Lane wrote: > Simon Riggs <simon@2ndquadrant.com> writes: > > For HJ we write each outer tuple to its own file-per-batch in the order > > they arrive. Reading them back in preserves the original ordering. So > > yes, caution required, but I see no difficulty, just reworking the HJ > > code (nodeHashjoin and nodeHash). What else do you see? > > With dynamic adjustment of the hash partitioning, some tuples will go > through multiple temp files before they ultimately get eaten, and > different tuples destined for the same aggregate may take different > paths through the temp files depending on when they arrive. It's not > immediately obvious that ordering is preserved when that happens. > I think it can be made to work but it may take different management of > the temp files than hashjoin uses. (Worst case, we could use just a > single temp file for all unprocessed tuples, but this would result in > extra I/O.) Sure hash table is dynamic, but we read all inner rows to create the hash table (nodeHash) before we get the outer rows (nodeHJ). Why would we continue to dynamically build the hash table after the start of the outer scan? (I see that we do this, as you say, but I am surprised). Best Regards, Simon Riggs
Simon Riggs <simon@2ndquadrant.com> writes: > Sure hash table is dynamic, but we read all inner rows to create the > hash table (nodeHash) before we get the outer rows (nodeHJ). But our idea of the number of batches needed can change during that process, resulting in some inner tuples being initially assigned to the wrong temp file. This would also be true for hashagg. > Why would we continue to dynamically build the hash table after the > start of the outer scan? The number of tuples written to a temp file might exceed what we want to hold in memory; we won't detect this until the batch is read back in, and in that case we have to split the batch at that time. For hashagg this point would apply to the aggregate states not the input tuples, but it's still a live problem (especially if the aggregate states aren't fixed-size values ... consider a "concat" aggregate for instance). regards, tom lane
Tom Lane <tgl@sss.pgh.pa.us> writes: > > Why would we continue to dynamically build the hash table after the > > start of the outer scan? > > The number of tuples written to a temp file might exceed what we want to > hold in memory; we won't detect this until the batch is read back in, > and in that case we have to split the batch at that time. For hashagg > this point would apply to the aggregate states not the input tuples, but > it's still a live problem (especially if the aggregate states aren't > fixed-size values ... consider a "concat" aggregate for instance). For a hash aggregate would it be possible to rescan the original table instead of spilling to temporary files? Then when you run out of working memory you simply throw out half the hash table and ignore subsequent tuples that fall in those hash buckets. Then you rescan for the discarded hash bucket regions. This avoids having to do any disk writes at the expense possibly of additional reads. I think in terms of i/o it would be much faster in most cases. The downsides are: a) volatile aggregates or aggregates with side-effects would be confused by being executed twice. I'm not clear that volatile aggregate functions make any sense anyways though. b) I'm unclear whether rescanning the table could potentially find tuples in a different state than previous scans. If so then the idea doesn't work at all. But I don't think that's possible is it? The main problem is c) it may lose in terms of i/o for cases where the cardinality is low (ie, it's overflowing despite having low cardinality because the table is really really big too). But most cases will be spilling because the cardinality is high. So the table may be big but the spill files are nearly as big anyways and having to write and then read them means double the i/o. The upside of not having to write out temporary files is big. I find queries that require temporary sort files get hit with a *huge* performance penalty. Often an order of magnitude. Part of that could probably be mitigated by having the sort files on a separate spindle but I think it's always going to hurt especially if there are multiple operations spilling to disk simultaneously. -- greg
Greg Stark <gsstark@mit.edu> writes: > For a hash aggregate would it be possible to rescan the original table > instead of spilling to temporary files? Sure, but the possible performance gain is finite and the possible performance loss is not. The "original table" could be an extremely expensive join. We'd like to think that the planner gets the input size estimate approximately right and so the amount of extra I/O caused by hash table resizing should normally be minimal. The cases where it is not right are *especially* not likely to be a trivial table scan as you are supposing. regards, tom lane
On Mon, 2006-01-16 at 20:02 -0500, Tom Lane wrote: > Simon Riggs <simon@2ndquadrant.com> writes: > > Sure hash table is dynamic, but we read all inner rows to create the > > hash table (nodeHash) before we get the outer rows (nodeHJ). > > But our idea of the number of batches needed can change during that > process, resulting in some inner tuples being initially assigned to the > wrong temp file. This would also be true for hashagg. So we correct that before we start reading the outer table. > > Why would we continue to dynamically build the hash table after the > > start of the outer scan? > > The number of tuples written to a temp file might exceed what we want to > hold in memory; we won't detect this until the batch is read back in, > and in that case we have to split the batch at that time. For hashagg > this point would apply to the aggregate states not the input tuples, but > it's still a live problem (especially if the aggregate states aren't > fixed-size values ... consider a "concat" aggregate for instance). OK, I see what you mean. Sounds like we should have a new definition for Aggregates, "Sort Insensitive" that allows them to work when the input ordering does not effect the result, since that case can be optimised much better when using HashAgg. Since we know that applies to the common cases of SUM, AVG etc this will certainly help people. For sort-sensitive aggregates sounds like we either: 1. Write to a single file, while we remember the start offset of the first row of each batch. 2. Write to multiple files, adding a globally incrementing sequenceid. Batches are then resorted on the sequenceid before processing. 3. We give up, delete the existing batches and restart the scan from the beginning of the outer table. Sounds like (1) is best, since the overflow just becomes a SortedAgg. But all of them sound ugly. Best Regards, Simon Riggs
Simon Riggs <simon@2ndquadrant.com> writes: > On Mon, 2006-01-16 at 20:02 -0500, Tom Lane wrote: >> But our idea of the number of batches needed can change during that >> process, resulting in some inner tuples being initially assigned to the >> wrong temp file. This would also be true for hashagg. > So we correct that before we start reading the outer table. Why? That would require a useless additional pass over the data. With the current design, we can process and discard at least *some* of the data in a temp file when we read it, but a reorganization pass would mean that it *all* goes back out to disk a second time. Also, you assume that we can accurately tell how many tuples will fit in memory in advance of actually processing them --- a presumption clearly false in the hashagg case, and not that easy to do even for hashjoin. (You can tell the overall size of a temp file, sure, but how do you know how it will split when the batch size changes? A perfectly even split is unlikely.) > OK, I see what you mean. Sounds like we should have a new definition for > Aggregates, "Sort Insensitive" that allows them to work when the input > ordering does not effect the result, since that case can be optimised > much better when using HashAgg. Please don't propose pushing this problem onto the user until it's demonstrated that there's no other way. I don't want to become the next Oracle, with forty zillion knobs that it takes a highly trained DBA to deal with. > But all of them sound ugly. I was thinking along the lines of having multiple temp files per hash bucket. If you have a tuple that needs to migrate from bucket M to bucket N, you know that it arrived before every tuple that was assigned to bucket N originally, so put such tuples into a separate temp file and process them before the main bucket-N temp file. This might get a little tricky to manage after multiple hash resizings, but in principle it seems doable. regards, tom lane
On Mon, 2006-01-16 at 12:36 -0500, Tom Lane wrote: > The tricky part is to preserve the existing guarantee that tuples are > merged into their aggregate in arrival order. (This does not matter for > the standard aggregates but it definitely does for custom aggregates, > and there will be unhappy villagers appearing on our doorsteps if we > break it.) I think this can work correctly under the above sketch but > it needs to be verified. It might require different handling of the > TODO files than what hashjoin does. You almost had me there... but there isn't any "arrival order". The sort that precedes an aggregation only sorts on the GROUP BY columns, not on additional columns - so by the SQL standard there is not a guaranteed ordering of the data into a aggregate. That is exactly what windowed aggregates are for. (There isn't any way of specifying an ORDER BY yet either). The only way of doing this is by doing a derived tableselect a, sum(b) from (select a,b order by a,b); but AFAICS this is not part of the standard?? It is highly likely that rows are clumped together, but there just isn't any guarantee that is the case. Any update of any row would change the arrival order. Should we support something that has worked by luck? I've been looking into windowed aggregates; these will provide this functionality should people require it. I don't see how we'd be able to do windowed aggregates and hashAgg at the same time, so this seems less relevant. Best Regards, Simon Riggs
Simon Riggs <simon@2ndquadrant.com> writes: > On Mon, 2006-01-16 at 12:36 -0500, Tom Lane wrote: >> The tricky part is to preserve the existing guarantee that tuples are >> merged into their aggregate in arrival order. > You almost had me there... but there isn't any "arrival order". The fact that it's not in the spec doesn't mean we don't support it. Here are a couple of threads on the subject: http://archives.postgresql.org/pgsql-general/2005-11/msg00304.php http://archives.postgresql.org/pgsql-sql/2003-06/msg00135.php Per the second message, this has worked since 7.4, and it was requested fairly often before that. > Should we support something that has worked by luck? No luck about it, and yes people are depending on it. You don't get to break it just because it's not in the spec. regards, tom lane
On Tue, 2006-01-17 at 14:41 -0500, Tom Lane wrote: > Simon Riggs <simon@2ndquadrant.com> writes: > > On Mon, 2006-01-16 at 12:36 -0500, Tom Lane wrote: > >> The tricky part is to preserve the existing guarantee that tuples are > >> merged into their aggregate in arrival order. > > > You almost had me there... but there isn't any "arrival order". > > The fact that it's not in the spec doesn't mean we don't support it. > Here are a couple of threads on the subject: > http://archives.postgresql.org/pgsql-general/2005-11/msg00304.php > http://archives.postgresql.org/pgsql-sql/2003-06/msg00135.php > > Per the second message, this has worked since 7.4, and it was requested > fairly often before that. OK.... My interest was in expanding the role of HashAgg, which as Rod says can be used to avoid the sort, so the overlap between those ideas was low anyway. On Tue, 2006-01-17 at 09:52 -0500, Tom Lane wrote: > I was thinking along the lines of having multiple temp files per hash > bucket. If you have a tuple that needs to migrate from bucket M to > bucket N, you know that it arrived before every tuple that was > assigned > to bucket N originally, so put such tuples into a separate temp file > and process them before the main bucket-N temp file. This might get a > little tricky to manage after multiple hash resizings, but in > principle > it seems doable. OK, so we do need to do this when we have a defined arrival order: this idea the best one so far. I don't see any optimization of this by ignoring the arrival order, so it seems best to preserve the ordering this way in all cases. You can manage that with file naming. Rows moved from batch N to batch M would be renamed N.M, so you'd be able to use file ordering to retrieve all files for *.M That scheme would work for multiple splits too, so that filenames could grow yet retain their sort order and final target batch properties. Best Regards, Simon Riggs
On Tue, 2006-01-17 at 21:43 +0000, Simon Riggs wrote: > OK.... My interest was in expanding the role of HashAgg, which as Rod > says can be used to avoid the sort, so the overlap between those ideas > was low anyway. Am I right in thinking that HashAgg would almost always be quicker than SortAgg, even for large (> memory) aggregation sets? (Except where the prior ordering has already been forced via an ORDER BY). If that is so, then I will probably look to work on this sooner, especially since we seem to have a clear design. I'd originally viewed the spill-to-disk logic as a safety measure rather than as a performance feature. Best Regards, Simon Riggs
On Tue, 2006-01-17 at 21:43 +0000, Simon Riggs wrote: > On Tue, 2006-01-17 at 09:52 -0500, Tom Lane wrote: > > I was thinking along the lines of having multiple temp files per hash > > bucket. If you have a tuple that needs to migrate from bucket M to > > bucket N, you know that it arrived before every tuple that was > > assigned > > to bucket N originally, so put such tuples into a separate temp file > > and process them before the main bucket-N temp file. This might get a > > little tricky to manage after multiple hash resizings, but in > > principle > > it seems doable. > You can manage that with file naming. Rows moved from batch N to batch M > would be renamed N.M, so you'd be able to use file ordering to retrieve > all files for *.M > That scheme would work for multiple splits too, so that filenames could > grow yet retain their sort order and final target batch properties. This seems to lead to a super-geometric progression in the number of files required, if we assume that the current batch could be redistributed to all future batches each of which could be similarly redistributed. batches 1 no files 2 1 file 4 7 files 8 64 files 16 64,000 files 32 4 billion files ish So it does seem important whether we demand sorted input or not. Or at least requires us to provide the executor with a starting point for the number of batches, so we could manage that. Best Regards, Simon Riggs
Simon Riggs <simon@2ndquadrant.com> writes: > This seems to lead to a super-geometric progression in the number of > files required, But we double the number of batches at each step, so there are going to be at most 20 or so levels, and that's only assuming a *horridly* wrong initial guess by the planner. In practice I think it's reasonable to assume at most a couple rounds of doubling. If you have more than that, the extra data-shuffling is going to exhaust your patience anyway. regards, tom lane
On Thu, 2006-01-19 at 18:38 -0500, Tom Lane wrote: > Simon Riggs <simon@2ndquadrant.com> writes: > > This seems to lead to a super-geometric progression in the number of > > files required, > > But we double the number of batches at each step, so there are going to > be at most 20 or so levels, and that's only assuming a *horridly* wrong > initial guess by the planner. In practice I think it's reasonable to > assume at most a couple rounds of doubling. If you have more than that, > the extra data-shuffling is going to exhaust your patience anyway. What I'm saying is that if we start from 1 batch and move dynamically upwards we quickly get an unmanageable number of files. However, if we start at a particular number N, then we start with N-1 files, then move to at most 2N(N-1) files etc.. So we can only "get it wrong" and double the number of batches about twice before we get swamped with files. i.e. if we start at 1 we can only reasonably get to 8 batches. So we should start at a number higher than 1, attempting to make an accurate guess about number of batches (N) required. If we have R rows to aggregate and we get N correct, then the cost of the HashAgg is 2*R*(N-1)/N I/Os, which is cheaper than a sort, for *any* value of R for both CPU and I/O costs. If we get it wrong, we have to read and re-write more and more rows, which could eventually surpass the sort costs, especially if we have growing transition state data from the aggregate. I think the cost will be to re-write half of all rows already written when we double N. If we fail early because we got Ndistinct wrong then this could be cheap, though if we fail later on because of a growing aggregate then this could easily be very expensive and quickly exceed the cost of a sort. My thought is to collect statistics about an aggregate at CREATE AGGREGATE time. Simply send the aggregate 100 data values and see if the output varies in size according to the input, if it does we take much greater care about selecting HashAgg plans with that aggregate. ...and that way we don't need the user to define the aggregate type directly. This would only work with aggregates that return well known datatypes such as int or char. So getting the number of groups correct would be critical to making this work, but HashAgg could be effective for even very large aggregates. Any holes in that thinking? Best Regards, Simon Riggs
Simon Riggs <simon@2ndquadrant.com> writes: > Any holes in that thinking? Only that it's about five times more complicated than is currently known to be necessary ;-). How about we just implement the dynamic spill to disk first, and not bother with the other stuff until we see problems in the field? Saying "we have to do all this" is a good recipe for not getting any of it done. regards, tom lane