Thread: Group-count estimation statistics
I got a complaint from a fellow Red Hatter that PG 8.0 is way slower than 7.4 on some statistical analysis tasks he was doing. Investigation showed that the primary problem was selection of Sort/GroupAgg in place of HashAgg to compute some grouped aggregates. Essentially he was doing select sum(), avg() ... from temp_tablegroup by customer_id, item_id, month, year; where the temp table had just been constructed and hadn't been analyzed in any way. (Of course I told him "so analyze" ... but it didn't help.) In 7.4, the utter lack of any information about the source table caused the planner to assume only 1000 rows, so of course the estimated hash table size was plenty small enough to fit in sort_mem. In 8.0, the planner looks at the true physical size of the table, and even without any stats comes out with a rowcount estimate tolerably close to the actual value (about 10M rows). With or without stats it then estimates that there will also be about 10M groups, under which conditions even raising sort_mem to the max won't persuade it to hash-aggregate. (However the actual number of groups is about 0.2M, and hash aggregation is really a good bit faster than sorting.) The reason this happens even with stats is that the algorithm for estimating the number of groups in a multi-group-column situation is basically "take the product of the number of distinct values of each grouping column, but clamp to the number of rows estimated for the whole table". It turns out that customer_id and item_id are pretty highly correlated, enough that the actual number of distinct groups is barely a hundredth of what you get from the product rule. The only real solution, of course, is to acquire cross-column statistics, but I don't see that happening in the near future. As a short-term hack, I am thinking that the "clamp to size of table" part of the rule is overly pessimistic, and that we should consider something like "clamp to size of table / 10" instead. The justification for this is the thought that you aren't going to bother grouping unless it actually reduces the data volume. We have used similar rules in the past --- for example, before the logic for trying to estimate actual group counts was put in, the estimate for the number of output rows from an Agg or Group node was just the number of input rows over 10. Essentially I think that that rule wasn't too bad, and that we should allow the statistics-based estimation code to reduce that estimate but not increase it --- at least not when there's more than one variable involved. I have some faith in the stats-based approach for a single column but hardly any for multi columns, because of the correlation issue. Comments? regards, tom lane
* Tom Lane (tgl@sss.pgh.pa.us) wrote: > The only real solution, of course, is to acquire cross-column > statistics, but I don't see that happening in the near future. That'd be nice, but sounds like alot of work. > As a short-term hack, I am thinking that the "clamp to size of table" > part of the rule is overly pessimistic, and that we should consider > something like "clamp to size of table / 10" instead. The justification > for this is the thought that you aren't going to bother grouping unless > it actually reduces the data volume. We have used similar rules in the I definitely agree with this. Stephen
Tom Lane <tgl@sss.pgh.pa.us> writes: > The reason this happens even with stats is that the algorithm for > estimating the number of groups in a multi-group-column situation > is basically "take the product of the number of distinct values of > each grouping column, but clamp to the number of rows estimated for > the whole table". It turns out that customer_id and item_id are > pretty highly correlated, enough that the actual number of distinct > groups is barely a hundredth of what you get from the product rule. So why is it any more reasonable for Postgres to assume 0 correlation than any other value. Perhaps Postgres should calculate these cases assuming some arbitrary level of correlation. For example, pulling an algorithm out of nowhere, it could take the most selective value, then multiply -- instead of by the next most selective -- just the square root of the next value, then the cube root of the third value, and the fourth root of the fourth value, etc. So for 10M records grouped over three columns, each of which has 1,000 distinct values, you would get 1,000 * 1,000 ^ 1/2 * 1,000 ^ 1/3 or 316,228 distinct values. Which seems like not a bad guess actually. To be honest I took the "successive roots" thing out of nowhere. I suspect there's a more rigorous statistics approach which would have some actual motivation. For a given assumed correlation and distribution there ought be a way to calculate the expected number of distinct combinations. Then when we have some mechanism for providing real correlation we can just plug that in in place of the arbitrarily assumed correlation. Actually the formula would be quite complex. As the total number of records goes up the expected number of distinct values should approach the total number of records, even if the number of distinct values of each column doesn't change. > The only real solution, of course, is to acquire cross-column > statistics, but I don't see that happening in the near future. There's another possible solution, if Postgres kept statistics on the actual results of the query it could later use that feedback to come up with better guesses even if it doesn't know *why* they're better. -- greg
Greg Stark <gsstark@mit.edu> writes: > So why is it any more reasonable for Postgres to assume 0 correlation than any > other value. Perhaps Postgres should calculate these cases assuming some > arbitrary level of correlation. [ shrug... ] Sure, if you want to do the legwork to develop something credible. But I think I'd still throw in the number-of-rows-over-10 clamp, or something much like it. > As the total number of records > goes up the expected number of distinct values should approach the total > number of records, even if the number of distinct values of each column > doesn't change. Well, that's what I thought when I wrote the existing code, but it's wrong: you don't GROUP BY unique combinations of columns over huge tables --- or at least, you shouldn't expect great performance if you do. It'd probably be more reasonable to use a heuristic that expects a *smaller* fraction of distinct combinations, instead of a larger one, as the table size goes up. > There's another possible solution, if Postgres kept statistics on the actual > results of the query it could later use that feedback to come up with better > guesses even if it doesn't know *why* they're better. That's been proposed before but I think it's a blind alley. In most cases (certainly with anything as complex as a multiply grouped query) you're not going to be able to derive any trustworthy corrections to your original statistical estimates. There are too many variables and their relationships to the end costs are not simple. regards, tom lane
On Fri, 28 Jan 2005, Tom Lane wrote: > you don't GROUP BY unique combinations of columns over huge > tables --- or at least, you shouldn't expect great performance if you do. The proposed change biases towards a hash plan which has no provision for spilling to disk. Slow is one thing, but excessive memory usage and possibly failing is another thing. Kris Jurka
Kris Jurka <books@ejurka.com> writes: > The proposed change biases towards a hash plan which has no provision for > spilling to disk. Slow is one thing, but excessive memory usage and > possibly failing is another thing. Keep in mind that we are replacing 7.4 code that had a serious tendency to select hash plans when it really shouldn't, because of underestimated table sizes. Now that we have the physical-size-driven estimate of table rowcounts, I think we've gone too far over in the other direction. Which is not to say that spill-to-disk logic wouldn't be a nice thing to add, but I'm not going to panic about its not being there today. 7.4 presented a much greater hazard than we have now, but we got through that cycle without a large number of complaints. There's always the enable_hashagg switch if you really find yourself backed into a corner. regards, tom lane
>>>>> "Tom" == Tom Lane <tgl@sss.pgh.pa.us> writes: Tom> The only real solution, of course, is to acquire cross-column Tom> statistics, but I don't see that happening inthe near Tom> future. Another approach is a hybrid hashing scheme where we use a hash table until we run out of memory at which time we start spilling to disk. In other words, no longer use SortAgg at all .. Under what circumstances will a SortAgg consumer more IOs than a hybrid hash strategy ? -- Pip-pip Sailesh http://www.cs.berkeley.edu/~sailesh
> From: Sailesh Krishnamurthy <sailesh@cs.berkeley.edu> > >>>>> "Tom" == Tom Lane <tgl@sss.pgh.pa.us> writes: > > Tom> The only real solution, of course, is to acquire cross-column > Tom> statistics, but I don't see that happening in the near > Tom> future. > > Another approach is a hybrid hashing scheme where we use a hash table > until we run out of memory at which time we start spilling to disk. In > other words, no longer use SortAgg at all .. > > Under what circumstances will a SortAgg consumer more IOs than a > hybrid hash strategy ? Goetz Graefe did a heck of a lot of analysis of this, prior to his being snapped up by Microsoft. He also worked out a lot of the nitty-gritty for hybrid hash algorithms, extending the Grace hash for spill-to-disk, and adding a kind of recursion for really huge sets. The figures say that hybrid hash beats sort-aggregate, across the board.
On Fri, 28 Jan 2005 10:53:33 -0500, Tom Lane <tgl@sss.pgh.pa.us> wrote: > we should consider >something like "clamp to size of table / 10" instead. ... unless a *single* grouping column is estimated to have more than N/10 distinct values, which should be easy to check. ServusManfred
Manfred Koizar <mkoi-pg@aon.at> writes: > On Fri, 28 Jan 2005 10:53:33 -0500, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> we should consider >> something like "clamp to size of table / 10" instead. > ... unless a *single* grouping column is estimated to have more than > N/10 distinct values, which should be easy to check. Already done that way. /* * Clamp to size of rel, or size of rel / 10 if multiple Vars. * The fudge factor is becausethe Vars are probably correlated * but we don't know by how much. */ double clamp = rel->tuples; if (relvarcount > 1) clamp *= 0.1; if (reldistinct > clamp) reldistinct= clamp; regards, tom lane
On Mon, 31 Jan 2005 11:20:31 -0500, Tom Lane <tgl@sss.pgh.pa.us> wrote: >Already done that way. > if (relvarcount > 1) > clamp *= 0.1; That's not what I meant. I tried to say that if we have a GROUP BY several columns and one of these columns alone has more than N/10 distinct values, there's no way to get less than that many groups. ServusManfred
Manfred Koizar <mkoi-pg@aon.at> writes: > That's not what I meant. I tried to say that if we have a GROUP BY > several columns and one of these columns alone has more than N/10 > distinct values, there's no way to get less than that many groups. Oh, I see, you want a "max" calculation in there too. Seems reasonable. Any objections? regards, tom lane
On Mon, 31 Jan 2005 14:40:08 -0500, Tom Lane <tgl@sss.pgh.pa.us> wrote: >Manfred Koizar <mkoi-pg@aon.at> writes: >> That's not what I meant. I tried to say that if we have a GROUP BY >> several columns and one of these columns alone has more than N/10 >> distinct values, there's no way to get less than that many groups. > >Oh, I see, you want a "max" calculation in there too. Seems reasonable. >Any objections? Yes. :-( What I said is only true in the absence of any WHERE clause (or join). Otherwise the same cross-column correlation issues you tried to work around with the N/10 clamping might come back through the backdoor. I'm not sure whether coding for such a narrow use case is worth the trouble. Forget my idea. ServusManfred
Manfred Koizar <mkoi-pg@aon.at> writes: > On Mon, 31 Jan 2005 14:40:08 -0500, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Oh, I see, you want a "max" calculation in there too. Seems reasonable. >> Any objections? > Yes. :-( What I said is only true in the absence of any WHERE clause > (or join). Otherwise the same cross-column correlation issues you tried > to work around with the N/10 clamping might come back through the > backdoor. I'm not sure whether coding for such a narrow use case is > worth the trouble. Forget my idea. No, I think it's still good. The WHERE clauses are factored in separately (essentially by assuming their selectivity on the grouped rows is the same as it would be on the raw rows, which is pretty bogus but it's hard to do better). The important point is that the group count before WHERE filtering certainly does behave as you suggest, and so the clamp is going to be overoptimistic if it clamps to less than the largest individual number-of-distinct-values. regards, tom lane