Thread: When to bump up statistics?

When to bump up statistics?

From
Dawid Kuroczko
Date:
ALTER TABLE foo ALTER COLUMN bar SET STATISTICS n; .....

I wonder what are the implications of using this statement,
I know by using, say n=100, ANALYZE will take more time,
pg_statistics will be bigger, planner will take longer time,
on the other hand it will make better decisions... Etc, etc.

I wonder however when it is most uselful to bump it up.
Please tell me what you think about it:

Is bumping up statistics is only useful for indexed columns?

When is it most useful/benefitial to bump them up:

1) huge table with huge number of distinct values (_almost_
    unique ;))

2) huge table with relatively equally distributed values
    (like each value is in between, say, 30-50 rows).

3) huge table with unequally distributed values (some
    values are in 1-5 rows, some are in 1000-5000 rows).

4) huge table with small number values (around ~100
    distinct values, equally or uneqally distributed).

5) boolean column.

I think SET STATISTICS 100 is very useful for case with
unequally distributed values, but I wonder what about
the other cases.  And as a side note -- what are the
reasonable bounds for statistics (between 10 and 100?)

What are the runtime implications of setting statistics
too large -- how much can it affect queries?

And finally -- how other RDBMS and RDBM-likes deal
with this issue? :)

   Regards,
      Dawid

Re: When to bump up statistics?

From
Josh Berkus
Date:
Dawid,

> I wonder what are the implications of using this statement,
> I know by using, say n=100, ANALYZE will take more time,
> pg_statistics will be bigger, planner will take longer time,
> on the other hand it will make better decisions... Etc, etc.

Yep.   And pg_statistics will need to be vacuumed more often.

> Is bumping up statistics is only useful for indexed columns?

No.   It's potentially useful for any queried column.

> 1) huge table with huge number of distinct values (_almost_
>     unique ;))

Yes.

> 2) huge table with relatively equally distributed values
>     (like each value is in between, say, 30-50 rows).

Not usually.

> 3) huge table with unequally distributed values (some
>     values are in 1-5 rows, some are in 1000-5000 rows).

Yes.

> 4) huge table with small number values (around ~100
>     distinct values, equally or uneqally distributed).

Not usually, especially if they are equally distributed.

> 5) boolean column.

Almost never, just as it is seldom useful to index a boolean column.

> I think SET STATISTICS 100 is very useful for case with
> unequally distributed values, but I wonder what about
> the other cases.  And as a side note -- what are the
> reasonable bounds for statistics (between 10 and 100?)

Oh, no, I've used values up to 500 in production, and we've tested up to the
max on DBT-3.    In my experience, if the default (10) isn't sufficient, you
often have to go up to > 250 to get a different plan.

> What are the runtime implications of setting statistics
> too large -- how much can it affect queries?

It won't affect select queries.   It will affect ANALYZE time (substantially
in the aggregate) and maintenance on the pg_statistics table.

> And finally -- how other RDBMS and RDBM-likes deal
> with this issue? :)

Most don't allow such fine-tuned adjustment.   MSSQL, for example, allows only
setting it per-table or maybe even database-wide, and on that platform it
doesn't seem to have much effect on query plans.    Oracle prefers to use
HINTS, which are a brute-force method to manage query plans.

--
--Josh

Josh Berkus
Aglio Database Solutions
San Francisco

Re: When to bump up statistics?

From
Chris Browne
Date:
qnex42@gmail.com (Dawid Kuroczko) writes:
> ALTER TABLE foo ALTER COLUMN bar SET STATISTICS n; .....
>
> I wonder what are the implications of using this statement,
> I know by using, say n=100, ANALYZE will take more time,
> pg_statistics will be bigger, planner will take longer time,
> on the other hand it will make better decisions... Etc, etc.
>
> I wonder however when it is most uselful to bump it up.
> Please tell me what you think about it:
>
> Is bumping up statistics is only useful for indexed columns?

The main decision changes that result from this would occur then...

> When is it most useful/benefitial to bump them up:
>
> 1) huge table with huge number of distinct values (_almost_
>     unique ;))
>
> 2) huge table with relatively equally distributed values
>     (like each value is in between, say, 30-50 rows).
>
> 3) huge table with unequally distributed values (some
>     values are in 1-5 rows, some are in 1000-5000 rows).
>
> 4) huge table with small number values (around ~100
>     distinct values, equally or uneqally distributed).

A hard and fast rule hasn't emerged, definitely not to distinguish
precisely between these cases.

There are two effects that come out of changing the numbers:

 1.  They increase the number of tuples examined.

     This would pointedly affect cases 3 and 4, increasing the
     likelihood that the statistics are more representative

 2.  They increase the number of samples that are kept, increasing the
     number of items recorded in the histogram.

     If you have on the order of 100 unique values (it would not be
     unusual for a company to have 100 "main" customers or suppliers),
     that allows there to be nearly a bin apiece, which makes
     estimates _way_ more representative both for common and less
     common cases amongst the "top 100."

Both of those properties are useful for pretty much all of the above
cases.

> 5) boolean column.

Boolean column would more or less indicate SET STATISTICS 2; the only
point to having more would be if there was one of the values that
almost never occurred so that you'd need to collect more stats to even
pick up instances of the "rare" case.

A boolean column is seldom much use for indices anyways...

> I think SET STATISTICS 100 is very useful for case with unequally
> distributed values, but I wonder what about the other cases.  And as
> a side note -- what are the reasonable bounds for statistics
> (between 10 and 100?)

If there are, say, 200 unique values, then increasing from 10 to 100
would seem likely to be useful in making the histogram MUCH more
representative...

> What are the runtime implications of setting statistics too large --
> how much can it affect queries?

More stats would mean a bit more time evaluating query plans, but the
quality of the plans should be better.

> And finally -- how other RDBMS and RDBM-likes deal with this issue?
> :)

For Oracle and DB/2, the issues are not dissimilar.  Oracle somewhat
prefers the notion of collecting comprehensive statistics on the whole
table, which will be even more costly than PostgreSQL's sampling.
--
let name="cbbrowne" and tld="cbbrowne.com" in String.concat "@" [name;tld];;
http://www.ntlug.org/~cbbrowne/linuxxian.html
A VAX is virtually a computer, but not quite.