Thread: n_distinct way off, but following a pattern.

n_distinct way off, but following a pattern.

From
"Nick Fankhauser"
Date:
Hi-

I'm seeing estimates for n_distinct that are way off for a large table
(8,700,000 rows). They get better by setting the stats target higher, but
are still off by a factor of 10 with the stats set to 1000. I've noticed and
reported a similar pattern before on another table. Because this follows the
same very consistent pattern, I figured it was worth reporting again. This
looks more like a bug than randomness. If the poor result was simply due to
having a small sample to work from, the estimates should be all over the
map, but these are consistently low, and vary in almost exact inverse
proportion to the stats target:

                                        run 1:     run2:     run3:
n_distinct estimate, statistics = 10:   3168       3187      3212
n_distinct estimate, statistics = 100:  23828      24059     23615
n_distinct estimate, statistics = 1000: 194690     194516    194081
Actual distinct values:                 3340724

Or to put it another way, if you were to take the estimate from analyze,
divide by the stats target and multiply by 10000, the result would be pretty
close to exact. (Within a factor of 2, which ought to be plenty close for
planning purposes.)

I'm running version 7.3.2

Any thoughts from folks familiar with this part of the source code?

Regards,
     -Nick

PS-
Here's a log of the session that I got this from.

alpha=# select count(distinct actor_id) from actor_case_assignment;
-[ RECORD 1 ]--
count | 3340724
alpha=# analyze;
ANALYZE
alpha=# SELECT * FROM pg_stats
alpha-#  WHERE tablename='actor_case_assignment' AND attname='actor_id';
-[ RECORD
1 ]-----+-------------------------------------------------------------------
----------------------------------------------------------------------------
----------------
schemaname        | public
tablename         | actor_case_assignment
attname           | actor_id
null_frac         | 0
avg_width         | 16
n_distinct        | 3168
most_common_vals  |
{18105XS,18115XS,18106XS,18113JD02,18115JD02,18106J27,18113XS,18113A10656,18
115LST,18108XS}
most_common_freqs |
{0.0206667,0.0206667,0.0196667,0.019,0.0176667,0.0173333,0.0163333,0.015,0.0
14,0.0136667}
histogram_bounds  |
{18067A000-07P,18067PD397SC1574-1,18105LBPD,18106A2119-49,18106PD399IF845-1,
18108A03068-20,18108LECS207,18108PTW03737278-2,18111A19788-77,18115A50420,18
115XC}
correlation       | 0.876795
alpha=#
alpha=# alter table actor_case_assignment alter column actor_id set
statistics 100;
ALTER TABLE
alpha=# analyze actor_case_assignment;
ANALYZE
alpha=# SELECT * FROM pg_stats
alpha-#  WHERE tablename='actor_case_assignment' AND attname='actor_id';
-[ RECORD 1 ]
<Header snipped>
schemaname        | public
tablename         | actor_case_assignment
attname           | actor_id
null_frac         | 0
avg_width         | 17
n_distinct        | 23828
most_common_vals  | {18115XS,18113JD02,18106XS,1....
<Rest of values snipped>
alpha=# alter table actor_case_assignment alter column actor_id set
statistics 1000;
ALTER TABLE
alpha=# analyze actor_case_assignment;
ANALYZE
alpha=# SELECT * FROM pg_stats
alpha-#  WHERE tablename='actor_case_assignment' AND attname='actor_id';
-[ RECORD 1 ]-----
<Header snipped>
schemaname        | public
tablename         | actor_case_assignment
attname           | actor_id
null_frac         | 0
avg_width         | 16
n_distinct        | 194690
most_common_vals  | {18106XS,18115XS,18115...
<Rest of values snipped>

alpha=# \x
Expanded display is off.
alpha=# alter table actor_case_assignment alter column actor_id set
statistics 10;
ALTER TABLE
alpha=# analyze actor_case_assignment;
ANALYZE
alpha=# select n_distinct from pg_stats where
tablename='actor_case_assignment and attname='actor_id';
alpha'# ';
ERROR:  parser: parse error at or near "actor_id" at character 85
alpha=# select n_distinct from pg_stats where
tablename='actor_case_assignment' and attname='actor_id';
 n_distinct
------------
       3187
(1 row)

alpha=# alter table actor_case_assignment alter column actor_id set
statistics 10;
ALTER TABLE
alpha=# analyze actor_case_assignment;
ANALYZE
alpha=# select n_distinct from pg_stats where
tablename='actor_case_assignment' and attname='actor_id';
 n_distinct
------------
       3212
(1 row)

alpha=# analyze actor_case_assignment;
ANALYZE
alpha=# alter table actor_case_assignment alter column actor_id set
statistics 100;
ALTER TABLE
alpha=# analyze actor_case_assignment;
ANALYZE
alpha=# select n_distinct from pg_stats where
tablename='actor_case_assignment' and attname='actor_id';
 n_distinct
------------
      24059
(1 row)

alpha=# analyze actor_case_assignment;
ANALYZE
alpha=# alter table actor_case_assignment alter column actor_id set
statistics 100;
ALTER TABLE
alpha=# select n_distinct from pg_stats where
tablename='actor_case_assignment' and attname='actor_id';
 n_distinct
------------
      23615
(1 row)

alpha=# alter table actor_case_assignment alter column actor_id set
statistics 1000;
ALTER TABLE
alpha=# analyze actor_case_assignment;
ANALYZE
alpha=# select n_distinct from pg_stats where
tablename='actor_case_assignment' and attname='actor_id';
 n_distinct
------------
     194516
(1 row)

alpha=# analyze actor_case_assignment;
ANALYZE
alpha=# select n_distinct from pg_stats where
tablename='actor_case_assignment' and attname='actor_id';
 n_distinct
------------
     194081
(1 row)

alpha=#



Re: n_distinct way off, but following a pattern.

From
Tom Lane
Date:
"Nick Fankhauser" <nickf@ontko.com> writes:
> I'm seeing estimates for n_distinct that are way off for a large table

Estimating n_distinct from a small sample is inherently a hard problem.
I'm not surprised that the estimates would get better as the sample size
increases.  But maybe we can do better.  The method we are currently
using is this:

            /*----------
             * Estimate the number of distinct values using the estimator
             * proposed by Haas and Stokes in IBM Research Report RJ 10025:
             *        n*d / (n - f1 + f1*n/N)
             * where f1 is the number of distinct values that occurred
             * exactly once in our sample of n rows (from a total of N),
             * and d is the total number of distinct values in the sample.
             * This is their Duj1 estimator; the other estimators they
             * recommend are considerably more complex, and are numerically
             * very unstable when n is much smaller than N.

It would be interesting to see exactly what inputs are going into this
equation.  Do you feel like adding some debug printouts into this code?
Or just looking at the variables with a debugger?  In 7.3 it's about
line 1060 in src/backend/commands/analyze.c.

BTW, this is already our second try at this problem, the original 7.2
equation didn't last long at all ...

            regards, tom lane

Re: n_distinct way off, but following a pattern.

From
"Nick Fankhauser"
Date:

> It would be interesting to see exactly what inputs are going into this
> equation.  Do you feel like adding some debug printouts into this code?
> Or just looking at the variables with a debugger?  In 7.3 it's about
> line 1060 in src/backend/commands/analyze.c.

Tom-

I don't really have time to follow up at this moment, but I think this would
be interesting to look into, so I'll plan to dig into it over the
Thanksgiving Holiday when I'll have a little time free to follow up on some
fun projects. Your pointers should let me get into it pretty quickly.

In the meantime, I'll just set up a cron job that runs behind my nightly
analyze to put the correct numbers into pg_statistic on the tables that this
affects.

Thanks-
    -Nick