Thread: Expected accuracy of planner statistics

Expected accuracy of planner statistics

From

Casey Duncan

Date:

28 September 2006, 19:19:58

I have some databases that have grown significantly over time (as
databases do). As the databases have grown, I have noticed that the
statistics have grown less and less accurate. In particular, the
n_distinct values have become many OOM too small for certain foreign
key columns. Predictably this leads to poor query plans.

The databases in question were all using the default stats target
value, so naturally the thing to do is to increase that and see what
happens. First I'll show you one table in question:

qa_full=# \d fk
                    Table "public.fk"
      Column   |            Type             |   Modifiers
--------------+-----------------------------+---------------
fk_id         | bigint                      | not null
st_id         | bigint                      | not null
is_positive   | boolean                     | not null
mc_id         | character varying(20)       | not null
matching_seed | character varying(20)       |
ft_id         | character varying(20)       |
s_title       | text                        | not null
a_summary     | text                        | not null
date_created  | timestamp without time zone | default now()
qx_id         | bigint                      |
Indexes:
     "fk_pkey" PRIMARY KEY, btree (fk_id)
     "fk_st_mc_id_idx" UNIQUE, btree (st_id, mc_id)
     "fk_date_created_is_positive_idx" btree (is_positive, date_created)
     "fk_st_id_idx" btree (st_id)
Foreign-key constraints:
     "fk_qx_id_fkey" FOREIGN KEY (qx_id) REFERENCES st(st_id) ON
DELETE RESTRICT
     "fk_st_id_fkey" FOREIGN KEY (st_id) REFERENCES st(st_id) ON
DELETE RESTRICT


qa_full=# select count(*) from fk;
    count
-----------
195555889

Here are the n_distinct stats on the st_id column with stock stats
settings:

qa_full=# select n_distinct from pg_stats where tablename='fk' and
attname='st_id'';
   attname  | n_distinct
-----------+-------------
st_id      |      14910

here's the actual distinct count:

qa_full=# select count(distinct st_id) from fk;
count
----------
15191387
(1 row)

Here's what it looks like after turning the stats target up to 100:

qa_full=# select n_distinct from pg_stats where tablename='fk' and
attname='st_id'';
   attname  | n_distinct
-----------+-------------
st_id      |     136977

Still way off (3 OOM), so let's pull out the stops and go for 1000:

qa_full=# select n_distinct from pg_stats where tablename='fk' and
attname='st_id'';
   attname  | n_distinct
-----------+-------------
st_id      |      860796

Better, but still way off. Here's more of the pg_stats row for the
curious with the stats target at 1000:

schemaname        | public
tablename         | fk
attname           | st_id
null_frac         | 0
avg_width         | 8
n_distinct        | 860796
most_common_vals  |
{9822972459012807,81553350123749183,50260420266636724,16953859416556337,
57992478091506908,6789385517968759,13155841310992808,4649594156182905,11
950505984130111,19815690615418387,23232929805154508,24940819255590358,25
304517086243633,30084673952005845,33845252828401578,36510232790970904,44
301350711321256,47572440754042499,66302045808587415,106949745150210138,7
948257888859857,11709841786637953,12034360925626832,17311819170902574,21
933556169120032,31401742852411043,37178443803282644,39714175315169346,42
699954975194688,63648700912541567,73785794393665562,...many elided..}
most_common_freqs |
{7.33333e-05,6.66667e-05,5.33333e-05,5e-05,5e-05,4.66667e-05,4.66667e-05
,
4.33333e-05,4.33333e-05,4.33333e-05,4.33333e-05,4.33333e-05,4.33333e-05,
4.33333e-05,4.33333e-05,4.33333e-05,4.33333e-05,4.33333e-05,4.33333e-05,
4.33333e-05,4e-05,4e-05,4e-05,4e-05,4e-05,4e-05,4e-05,4e-05,4e-05,4e-05,
4e-05,4e-05,4e-05,3.66667e-05,3.66667e-05,3.66667e-05,3.66667e-05,3.6666
7e-05,3.66667e-05,3.66667e-05,3.66667e-05,3.66667e-05,3.66667e-05,3.6666
7e-05,3.66667e-05,3.66667e-05,3.66667e-05,3.33333e-05,3.33333e-05,3.3333
3e-05,3.33333e-05,3.33333e-05,3.33333e-05,3.33333e-05,3.33333e-05,3.3333
3e-05,3.33333e-05,3.33333e-05,3.33333e-05,3.33333e-05,3.33333e-05,3.3333
3e-05,3.33333e-05,3.33333e-05,3.33333e-05,3.33333e-05,3.33333e-05,3.3333
3e-05,3.33333e-05,3.33333e-05,3.33333e-05,3.33333e-05,3e-05,3e-05,3e-05,
3e-05,3e-05,3e-05,3e-05,3e-05,3e-05,..many elided..}
histogram_bounds  |
{9474697855526,186642098833097,425502410065792,655064117100237,917344884
999940,1135224280975580,1510900775316064,1919850381534192,23918286327044
65,2773745714634569,3197981109338899,3601128214604953,3887435029566307,4
289757501117626,4604286546172963,5030605000015434,5410915764179364,57126
62986537560,6096452674229658,6531206443844232,6761515475182966,692428185
0823004,7145897868348599,7357502317108796,7537560231072453,7737194605867
515,7923617661480232,8094845122681350,8304911973154200,8504211340608556,
8735469559703009,9008968782181381,9233161779966219,..many elided..}
correlation       | 0.770339

The correlation is likely high here because this table has been
clustered on this column in the past. I don't know if that
contributes to the n_distinct inaccuracy, I don't know if I have the
patience to reorder the table to find out ;^)

Note that new st_ids are also being added all the time, at a rate
roughly proportional to fk rows (fk rows being added more
frequently). So actually a fractional value for the n_distinct here
would be more ideal. The docs hint that analyze will sometimes decide
to use a fractional (negative) value. What triggers that?

I was also trying to figure out how big the sample really is. Does a
stats target of 1000 mean 1000 rows sampled? If the sample really is
a fixed number of rows, it would seem to my naive eyes that sampling
a fraction of the rows (like 0.1% or something) would be better
(especially in cases like this), but maybe it already tries to do that.

Any insights appreciated.

-Casey

Re: Expected accuracy of planner statistics

From

"Jim C. Nasby"

Date:

28 September 2006, 22:00:18

On Thu, Sep 28, 2006 at 03:19:46PM -0700, Casey Duncan wrote:
> I have some databases that have grown significantly over time (as
> databases do). As the databases have grown, I have noticed that the
> statistics have grown less and less accurate. In particular, the
> n_distinct values have become many OOM too small for certain foreign
> key columns. Predictably this leads to poor query plans.

Search the -hackers archives. The problem is that you can't actually get
a good n_distinct estimate if you're sampling less than a very large
chunk of the table. Since our sampling maxes out at something like 30k
pages, at some point the n_distinct estimates just degrade. :(

Patches/solutions welcome. :)
--
Jim Nasby                                            jim@nasby.net
EnterpriseDB      http://enterprisedb.com      512.569.9461 (cell)

Re: Expected accuracy of planner statistics

From

Tom Lane

Date:

29 September 2006, 00:51:28

Casey Duncan <casey@pandora.com> writes:
> I was also trying to figure out how big the sample really is. Does a
> stats target of 1000 mean 1000 rows sampled?

No.  From memory, the sample size is 300 times the stats target (eg,
3000 rows sampled for the default target of 10).  This is based on some
math that says that's enough for a high probability of getting good
histogram estimates.  Unfortunately that math promises nothing about
n_distinct.

The information we've seen says that the only statistically reliable way
to arrive at an accurate n_distinct estimate is to examine most of the
table :-(.  Which seems infeasible for extremely large tables, which is
exactly where the problem is worst.  Marginal increases in the sample
size seem unlikely to help much ... as indeed your experiment shows.

We could also diddle the estimator equation to inflate the estimate.
I'm not sure whether such a cure would be worse than the disease, but
certainly the current code was not given to us on stone tablets.
IIRC I picked an equation out of the literature partially on the basis
of it being simple and fairly cheap to compute...

            regards, tom lane

Re: Expected accuracy of planner statistics

From

Casey Duncan

Date:

29 September 2006, 03:46:40

On Sep 28, 2006, at 8:51 PM, Tom Lane wrote:
[..]
> The information we've seen says that the only statistically
> reliable way
> to arrive at an accurate n_distinct estimate is to examine most of the
> table :-(.  Which seems infeasible for extremely large tables,
> which is
> exactly where the problem is worst.  Marginal increases in the sample
> size seem unlikely to help much ... as indeed your experiment shows.

I think a first step might be to introduce a new analyze command,
such as ANALYZE FULL. This would be executed deliberately (IOW not by
autovacuum) like CLUSTER or VACUUM FULL when deemed necessary by the
dba. The command as implied would scan the entire table and fill in
the stats based on that (as analyze used to do IIRC). It would also
be useful if this command froze the stats so that autovacuum didn't
clobber them with inaccurate ones shortly thereafter. Perhaps an
explicit ANALYZE FULL FREEZE command would be useful for that case,
the behavior being that a normal ANALYZE would not overwrite the
stats for a stats-frozen table, another ANALYZE FULL would, however.
Such a frozen state would also be useful if you wanted to hand-tweak
stats for a single table and have it stick and still use autovac. As
I understand it now, with autovac on, you cannot do that unless you
hack the pg_autovacuum table (i.e., set anl_base_thresh to an
artificially high value).

Another option (that I think others have suggested) would be to make
this the behavior for VACUUM ANALYZE. That saves the baggage of a new
command at least. Another advantage would be that the autovac daemon
could run it. Perhaps some smarts could also be built in. What if
VACUUM ANALYZE first runs a normal (sampled) ANALYZE. Then it
performs the VACUUM with full ANALYZE pass. The stats gathered by the
latter full pass are compared to that of the first sampled pass. If
the full ANALYZE statistics are sufficiently different from the
sampled pass, then the table is flagged so that normal ANALYZE is not
performed by the autovac daemon on that table. Also, a global ANALYZE
could ignore it (though this seems more magical).

A more pie-in-the-sky idea could take advantage of the fact that the
larger a table is the less likely the statistics will change much
over time. If we cannot afford to sample many rows in a given analyze
pass, then perhaps we should use a "newton's method" approach where
we attempt to converge on an accurate value over time with each
analyze pass contributing more samples to the statistics and honing
them incrementally rather than simply replacing the old ones.

I'm not statistician, so it's not clear to me how much more state you
would need to keep between analyze passes to make this viable, but in
order for this to work the following would need to be true:

1) Analyze would need to be run on a regular basis (luckily we have
autovaccum to help). You would want to analyze this table
periodically even if nothing much changed, however. Perhaps tuning
the autovac parameters is enough here.

2) Each analyze pass would need to sample randomly so that multiple
passes tend to sample different rows.

3) The stats would need to somehow be cumulative. Perhaps this means
storing sample values between passes, or some other statistical voodoo.

4) Needs to be smart enough to realize when a table has changed
drastically, and toss out the old stats in this case. Either that or
we require a human to tell us via ANALYZE FULL/VACUUM ANALYZE.

I think that the incremental stats approach would more or less depend
on the full ANALYZE functionality for bootstrapping. I think when you
first load the table, you want to get the stats right immediately and
not wait some indeterminate amount of time for them to "converge" on
the right value.

-Casey

Re: Expected accuracy of planner statistics

From

Arturo Perez

Date:

29 September 2006, 08:47:42

In article <20060929010011.GQ34238@nasby.net>,
 jim@nasby.net ("Jim C. Nasby") wrote:

> The problem is that you can't actually get
> a good n_distinct estimate if you're sampling less than a very large
> chunk of the table. Since our sampling maxes out at something like 30k
> pages, at some point the n_distinct estimates just degrade. :(

Can the DBA just set n_distinct?  Sometimes s/he just knows what the
value should be.

Then, of course, the questions becomes how to keep vacuum et al from
messing it up.

-arturo

Re: Expected accuracy of planner statistics

From

Jorge Godoy

Date:

29 September 2006, 09:05:14

Arturo Perez <aperez@hayesinc.com> writes:

> Can the DBA just set n_distinct?  Sometimes s/he just knows what the
> value should be.

Having an expensive process run once in a while and setting this value also
sounds interesting.  If it has to be calculated every time then this is a bad
thing, but having some kind of command or function to update it that could be
called when the database has a lower load would be interesting.  For companies
that work from 8am to 5pm this could be scheduled to run every night...

> Then, of course, the questions becomes how to keep vacuum et al from
> messing it up.

It could not touch these setting if the specific command isn't called, it
could gain a new parameter "VACUUM FULL N_DISTINCT ..." to touch it (and then
we probably discard the extra command / function) or it could update these
settings when called with ANALYZE only...

--
Jorge Godoy      <jgodoy@gmail.com>

Re: Expected accuracy of planner statistics

From

Csaba Nagy

Date:

29 September 2006, 09:34:11

> [snip] Having an expensive process run once in a while and setting this value also
> sounds interesting. [snip]

What about externalizing the statistics calculation ? I mean, would it
make sense to have for e.g. a WAL-fed standby which has an additional
process which keeps the statistics in sync based on the incoming WAL
records, and feeds back the stats to the master as soon as they change
significantly ? The standby would be able to crunch ALL the data, in
almost real time... with almost no overhead for the master. It would
require though another server... but I guess where analyze is a problem,
throwing another server at it is not a problem.

Cheers,
Csaba.

Re: Expected accuracy of planner statistics

From

"John D. Burger"

Date:

29 September 2006, 11:52:57

Tom Lane wrote:

> The information we've seen says that the only statistically
> reliable way
> to arrive at an accurate n_distinct estimate is to examine most of the
> table :-(.

> IIRC I picked an equation out of the literature partially on the basis
> of it being simple and fairly cheap to compute...

I'm very curious about this - can you recall where you got this, or
at least point me to where in the code this happens?

Thanks.

- John Burger
   MITRE

Re: Expected accuracy of planner statistics

From

Tom Lane

Date:

29 September 2006, 12:53:44

"John D. Burger" <john@mitre.org> writes:
> Tom Lane wrote:
>> IIRC I picked an equation out of the literature partially on the basis
>> of it being simple and fairly cheap to compute...

> I'm very curious about this - can you recall where you got this, or
> at least point me to where in the code this happens?

src/backend/commands/analyze.c, around line 1930 as of CVS HEAD:

            /*----------
             * Estimate the number of distinct values using the estimator
             * proposed by Haas and Stokes in IBM Research Report RJ 10025:
             *        n*d / (n - f1 + f1*n/N)
             * where f1 is the number of distinct values that occurred
             * exactly once in our sample of n rows (from a total of N),
             * and d is the total number of distinct values in the sample.
             * This is their Duj1 estimator; the other estimators they
             * recommend are considerably more complex, and are numerically
             * very unstable when n is much smaller than N.
             *
             * Overwidth values are assumed to have been distinct.
             *----------
             */

            regards, tom lane

Re: Expected accuracy of planner statistics

From

"John D. Burger"

Date:

29 September 2006, 13:37:37

> src/backend/commands/analyze.c, around line 1930 as of CVS HEAD:
>
>             /*----------
>              * Estimate the number of distinct values using the
> estimator
>              * proposed by Haas and Stokes in IBM Research Report
> RJ 10025:

Thanks for the pointer, Tom.  I shouldn't have been surprised to find
such a nice comment pointing me at the literature.

- John D. Burger
   MITRE