Thread: Group-count estimation statistics

Group-count estimation statistics

From
Tom Lane
Date:
I got a complaint from a fellow Red Hatter that PG 8.0 is way slower
than 7.4 on some statistical analysis tasks he was doing.  Investigation
showed that the primary problem was selection of Sort/GroupAgg in place
of HashAgg to compute some grouped aggregates.  Essentially he was doing
select sum(), avg() ... from temp_tablegroup by customer_id, item_id, month, year;

where the temp table had just been constructed and hadn't been analyzed
in any way.  (Of course I told him "so analyze" ... but it didn't help.)

In 7.4, the utter lack of any information about the source table caused
the planner to assume only 1000 rows, so of course the estimated hash
table size was plenty small enough to fit in sort_mem.  In 8.0, the
planner looks at the true physical size of the table, and even without
any stats comes out with a rowcount estimate tolerably close to the
actual value (about 10M rows).  With or without stats it then estimates
that there will also be about 10M groups, under which conditions even
raising sort_mem to the max won't persuade it to hash-aggregate.
(However the actual number of groups is about 0.2M, and hash aggregation
is really a good bit faster than sorting.)

The reason this happens even with stats is that the algorithm for
estimating the number of groups in a multi-group-column situation
is basically "take the product of the number of distinct values of
each grouping column, but clamp to the number of rows estimated for
the whole table".  It turns out that customer_id and item_id are
pretty highly correlated, enough that the actual number of distinct
groups is barely a hundredth of what you get from the product rule.

The only real solution, of course, is to acquire cross-column
statistics, but I don't see that happening in the near future.

As a short-term hack, I am thinking that the "clamp to size of table"
part of the rule is overly pessimistic, and that we should consider
something like "clamp to size of table / 10" instead.  The justification
for this is the thought that you aren't going to bother grouping unless
it actually reduces the data volume.  We have used similar rules in the
past --- for example, before the logic for trying to estimate actual
group counts was put in, the estimate for the number of output rows
from an Agg or Group node was just the number of input rows over 10.
Essentially I think that that rule wasn't too bad, and that we should
allow the statistics-based estimation code to reduce that estimate but
not increase it --- at least not when there's more than one variable
involved.  I have some faith in the stats-based approach for a single
column but hardly any for multi columns, because of the correlation
issue.

Comments?
        regards, tom lane


Re: Group-count estimation statistics

From
Stephen Frost
Date:
* Tom Lane (tgl@sss.pgh.pa.us) wrote:
> The only real solution, of course, is to acquire cross-column
> statistics, but I don't see that happening in the near future.

That'd be nice, but sounds like alot of work.

> As a short-term hack, I am thinking that the "clamp to size of table"
> part of the rule is overly pessimistic, and that we should consider
> something like "clamp to size of table / 10" instead.  The justification
> for this is the thought that you aren't going to bother grouping unless
> it actually reduces the data volume.  We have used similar rules in the

I definitely agree with this.
Stephen

Re: Group-count estimation statistics

From
Greg Stark
Date:
Tom Lane <tgl@sss.pgh.pa.us> writes:

> The reason this happens even with stats is that the algorithm for
> estimating the number of groups in a multi-group-column situation
> is basically "take the product of the number of distinct values of
> each grouping column, but clamp to the number of rows estimated for
> the whole table".  It turns out that customer_id and item_id are
> pretty highly correlated, enough that the actual number of distinct
> groups is barely a hundredth of what you get from the product rule.

So why is it any more reasonable for Postgres to assume 0 correlation than any
other value. Perhaps Postgres should calculate these cases assuming some
arbitrary level of correlation.

For example, pulling an algorithm out of nowhere, it could take the most
selective value, then multiply -- instead of by the next most selective --
just the square root of the next value, then the cube root of the third value,
and the fourth root of the fourth value, etc.

So for 10M records grouped over three columns, each of which has 1,000
distinct values, you would get 1,000 * 1,000 ^ 1/2 * 1,000 ^ 1/3 or 316,228
distinct values. Which seems like not a bad guess actually.

To be honest I took the "successive roots" thing out of nowhere. I suspect
there's a more rigorous statistics approach which would have some actual
motivation. For a given assumed correlation and distribution there ought be a
way to calculate the expected number of distinct combinations. Then when we
have some mechanism for providing real correlation we can just plug that in in
place of the arbitrarily assumed correlation.

Actually the formula would be quite complex. As the total number of records
goes up the expected number of distinct values should approach the total
number of records, even if the number of distinct values of each column
doesn't change.


> The only real solution, of course, is to acquire cross-column
> statistics, but I don't see that happening in the near future.

There's another possible solution, if Postgres kept statistics on the actual
results of the query it could later use that feedback to come up with better
guesses even if it doesn't know *why* they're better.


-- 
greg



Re: Group-count estimation statistics

From
Tom Lane
Date:
Greg Stark <gsstark@mit.edu> writes:
> So why is it any more reasonable for Postgres to assume 0 correlation than any
> other value. Perhaps Postgres should calculate these cases assuming some
> arbitrary level of correlation.

[ shrug... ]  Sure, if you want to do the legwork to develop something
credible.  But I think I'd still throw in the number-of-rows-over-10
clamp, or something much like it.

> As the total number of records
> goes up the expected number of distinct values should approach the total
> number of records, even if the number of distinct values of each column
> doesn't change.

Well, that's what I thought when I wrote the existing code, but it's
wrong: you don't GROUP BY unique combinations of columns over huge
tables --- or at least, you shouldn't expect great performance if you do.
It'd probably be more reasonable to use a heuristic that expects a
*smaller* fraction of distinct combinations, instead of a larger one,
as the table size goes up.

> There's another possible solution, if Postgres kept statistics on the actual
> results of the query it could later use that feedback to come up with better
> guesses even if it doesn't know *why* they're better.

That's been proposed before but I think it's a blind alley.  In most
cases (certainly with anything as complex as a multiply grouped query)
you're not going to be able to derive any trustworthy corrections to
your original statistical estimates.  There are too many variables and
their relationships to the end costs are not simple.
        regards, tom lane


Re: Group-count estimation statistics

From
Kris Jurka
Date:

On Fri, 28 Jan 2005, Tom Lane wrote:

> you don't GROUP BY unique combinations of columns over huge
> tables --- or at least, you shouldn't expect great performance if you do.

The proposed change biases towards a hash plan which has no provision for
spilling to disk.  Slow is one thing, but excessive memory usage and
possibly failing is another thing.

Kris Jurka


Re: Group-count estimation statistics

From
Tom Lane
Date:
Kris Jurka <books@ejurka.com> writes:
> The proposed change biases towards a hash plan which has no provision for
> spilling to disk.  Slow is one thing, but excessive memory usage and
> possibly failing is another thing.

Keep in mind that we are replacing 7.4 code that had a serious tendency
to select hash plans when it really shouldn't, because of underestimated
table sizes.  Now that we have the physical-size-driven estimate of
table rowcounts, I think we've gone too far over in the other direction.

Which is not to say that spill-to-disk logic wouldn't be a nice thing to
add, but I'm not going to panic about its not being there today.  7.4
presented a much greater hazard than we have now, but we got through
that cycle without a large number of complaints.  There's always the
enable_hashagg switch if you really find yourself backed into a corner.
        regards, tom lane


Re: Group-count estimation statistics

From
Sailesh Krishnamurthy
Date:
>>>>> "Tom" == Tom Lane <tgl@sss.pgh.pa.us> writes:
   Tom> The only real solution, of course, is to acquire cross-column   Tom> statistics, but I don't see that happening
inthe near   Tom> future.
 

Another approach is a hybrid hashing scheme where we use a hash table
until we run out of memory at which time we start spilling to disk. In
other words, no longer use SortAgg at all ..

Under what circumstances will a SortAgg consumer more IOs than a
hybrid hash strategy ?

-- 
Pip-pip
Sailesh
http://www.cs.berkeley.edu/~sailesh




Re: Group-count estimation statistics

From
Mischa
Date:
> From: Sailesh Krishnamurthy <sailesh@cs.berkeley.edu>
> >>>>> "Tom" == Tom Lane <tgl@sss.pgh.pa.us> writes:
> 
>     Tom> The only real solution, of course, is to acquire cross-column
>     Tom> statistics, but I don't see that happening in the near
>     Tom> future.
> 
> Another approach is a hybrid hashing scheme where we use a hash table
> until we run out of memory at which time we start spilling to disk. In
> other words, no longer use SortAgg at all ..
> 
> Under what circumstances will a SortAgg consumer more IOs than a
> hybrid hash strategy ?

Goetz Graefe did a heck of a lot of analysis of this, prior to his being snapped
up by Microsoft. He also worked out a lot of the nitty-gritty for hybrid hash
algorithms, extending the Grace hash for spill-to-disk, and adding a kind of
recursion for really huge sets. The figures say that hybrid hash beats
sort-aggregate, across the board. 



Re: Group-count estimation statistics

From
Manfred Koizar
Date:
On Fri, 28 Jan 2005 10:53:33 -0500, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> we should consider
>something like "clamp to size of table / 10" instead.

... unless a *single* grouping column is estimated to have more than
N/10 distinct values, which should be easy to check.

ServusManfred


Re: Group-count estimation statistics

From
Tom Lane
Date:
Manfred Koizar <mkoi-pg@aon.at> writes:
> On Fri, 28 Jan 2005 10:53:33 -0500, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> we should consider
>> something like "clamp to size of table / 10" instead.

> ... unless a *single* grouping column is estimated to have more than
> N/10 distinct values, which should be easy to check.

Already done that way.
           /*            * Clamp to size of rel, or size of rel / 10 if multiple Vars.            * The fudge factor is
becausethe Vars are probably correlated            * but we don't know by how much.            */           double
 clamp = rel->tuples;
 
           if (relvarcount > 1)               clamp *= 0.1;           if (reldistinct > clamp)
reldistinct= clamp;
 

           regards, tom lane


Re: Group-count estimation statistics

From
Manfred Koizar
Date:
On Mon, 31 Jan 2005 11:20:31 -0500, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>Already done that way.
>            if (relvarcount > 1)
>                clamp *= 0.1;

That's not what I meant.  I tried to say that if we have a GROUP BY
several columns and one of these columns alone has more than N/10
distinct values, there's no way to get less than that many groups.

ServusManfred


Re: Group-count estimation statistics

From
Tom Lane
Date:
Manfred Koizar <mkoi-pg@aon.at> writes:
> That's not what I meant.  I tried to say that if we have a GROUP BY
> several columns and one of these columns alone has more than N/10
> distinct values, there's no way to get less than that many groups.

Oh, I see, you want a "max" calculation in there too.  Seems reasonable.
Any objections?
        regards, tom lane


Re: Group-count estimation statistics

From
Manfred Koizar
Date:
On Mon, 31 Jan 2005 14:40:08 -0500, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>Manfred Koizar <mkoi-pg@aon.at> writes:
>> That's not what I meant.  I tried to say that if we have a GROUP BY
>> several columns and one of these columns alone has more than N/10
>> distinct values, there's no way to get less than that many groups.
>
>Oh, I see, you want a "max" calculation in there too.  Seems reasonable.
>Any objections?

Yes.  :-(  What I said is only true in the absence of any WHERE clause
(or join).  Otherwise the same cross-column correlation issues you tried
to work around with the N/10 clamping might come back through the
backdoor.  I'm not sure whether coding for such a narrow use case is
worth the trouble.  Forget my idea.

ServusManfred


Re: Group-count estimation statistics

From
Tom Lane
Date:
Manfred Koizar <mkoi-pg@aon.at> writes:
> On Mon, 31 Jan 2005 14:40:08 -0500, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Oh, I see, you want a "max" calculation in there too.  Seems reasonable.
>> Any objections?

> Yes.  :-(  What I said is only true in the absence of any WHERE clause
> (or join).  Otherwise the same cross-column correlation issues you tried
> to work around with the N/10 clamping might come back through the
> backdoor.  I'm not sure whether coding for such a narrow use case is
> worth the trouble.  Forget my idea.

No, I think it's still good.  The WHERE clauses are factored in
separately (essentially by assuming their selectivity on the grouped
rows is the same as it would be on the raw rows, which is pretty bogus
but it's hard to do better).  The important point is that the group
count before WHERE filtering certainly does behave as you suggest,
and so the clamp is going to be overoptimistic if it clamps to less than
the largest individual number-of-distinct-values.
        regards, tom lane