Thread: Odd statistics behaviour in 7.2

Odd statistics behaviour in 7.2

From
"Gordon A. Runkle"
Date:
Hello, all,

I'm having a strange problem with v7.2 relating to statistics collection
and plan calculation.  I'm not sure if this relates to the problems Marc
was seeing, but here goes.

I have a table with 1,066,673 rows.  The column I'm interested in has
this distribution of values:
tdnr_ct |   ct
---------+--------     16 |      1      4 |      1      3 |     58      2 |  68904      1 | 928171

This means that 'ct' records have 'tdnr_ct' duplicate values.  As you
can
see, the index I have on this column is highly selective, and should be
used to look up records based on this column.  In v7.1.3, it always
does.

Under v7.2, it only sometimes does.  I've looked at the statistics,
thanks to what I learned from Tom and Marc's discussion, and I see that
sometimes when I VACUUM ANALYZE the table, 'n_distinct' for this column
gets a value of '-1' (desireable), and other times a value such as 56596
or something.

This is with the default setting for the statistics.

Doing a 'SET STATISTICS 40' on the column got me to '-0.106047', which
is
better.  But even so, the values do change somewhat over subsequent runs
of VACUUM ANALYZE.  And sometimes I get the coveted '-1'.

The query I'm running is fairly complex.  The difference between getting
the index lookup versus the sequential scan causes an order of magnitude
difference in run time.

The query plans are below.  Same query, no changes, just the difference
in statistics.

The desireable query plan:

Unique  (cost=176572.08..177673.89 rows=3673 width=176) ->  Sort  (cost=176572.08..176572.08 rows=36727 width=176)
->  Merge Join  (cost=172982.30..173787.35 rows=36727 width=176)             ->  Sort  (cost=169436.41..169436.41
rows=27883width=142)                   ->  Nested Loop  (cost=0.00..167377.66 rows=27883
 
width=142)                          ->  Seq Scan on pprv_ticket ptk 
(cost=0.00..3345.83 rows=27883 width=125)                         ->  Index Scan using xie01_cat24 on
cat24_ticket_doc_id c24  (cost=0.00..5.87 rows=1 width=17)             ->  Sort  (cost=3545.89..3545.89 rows=37048
width=34)                  ->  Seq Scan on pprv_violation pe 
 
(cost=0.00..734.48 rows=37048 width=34)             SubPlan               ->  Aggregate  (cost=5.87..5.87 rows=1
width=17)                    ->  Index Scan using xie01_cat24 on
 
cat24_ticket_doc_id  (cost=0.00..5.87 rows=1 width=17)               ->  Aggregate  (cost=5.88..5.88 rows=1 width=17)
                 ->  Index Scan using xie01_cat24 on
 
cat24_ticket_doc_id  (cost=0.00..5.88 rows=1 width=17)




The undesireable query plan:

Unique  (cost=1129322.57..1187392.58 rows=193567 width=176) ->  Sort  (cost=1129322.57..1129322.57 rows=1935667
width=176)      ->  Merge Join  (cost=204226.57..249046.32 rows=1935667
 
width=176)             ->  Merge Join  (cost=200135.91..209436.90 rows=525268
width=142)                    ->  Sort  (cost=6435.89..6435.89
rows=27883 width=125)                         ->  Seq Scan on pprv_ticket ptk 
(cost=0.00..3335.83 rows=27883 width=125)                   ->  Sort  (cost=193700.02..193700.02 rows=1066173
width=17)                          ->  Seq Scan on cat24_ticket_doc_id
c24  (cost=0.00..50164.73 rows=1066173 width=17)             ->  Sort  (cost=4090.66..4090.66 rows=37048 width=34)
            ->  Seq Scan on pprv_violation pv 
 
(cost=0.00..734.48 rows=37048 width=34)             SubPlan               ->  Aggregate  (cost=74.72..74.72 rows=1
width=17)                    ->  Index Scan using xie01_cat24 on
 
cat24_ticket_doc_id  (cost=0.00..74.67 rows=19 width=17)               ->  Aggregate  (cost=29.12..29.12 rows=1
width=17)                    ->  Index Scan using xie07_cat24 on
 
cat24_ticket_doc_id  (cost=0.00..29.12 rows=1 width=17)


I hope I've given enough information that it makes sense.  If there's
anything
I can do my end to help figure this out, let me know.

Thanks,

Gordon.
-- 
"Far and away the best prize that life has to offer is the chance to work hard at work worth doing."      -- Theodore
Roosevelt




Re: Odd statistics behaviour in 7.2

From
Thomas Lockhart
Date:
> ... sometimes ... 'n_distinct' for this column
> gets a value of '-1' (desireable), and other times a value such as 56596
> or something.
> This is with the default setting for the statistics.
> Doing a 'SET STATISTICS 40' on the column got me to '-0.106047', which
> is better.  But even so, the values do change somewhat over subsequent
> runs of VACUUM ANALYZE.  And sometimes I get the coveted '-1'.

I'm guessing that your table is not randomly populated, so that the new
"statistically sampled ANALYZE" is sometimes getting a good sample (or
at least one with the result you want) and sometimes getting a terrible
one.

Tom?
                    - Thomas


Re: Odd statistics behaviour in 7.2

From
Tom Lane
Date:
Gordon Runkle <gar@www.integrated-dynamics.com> writes:
> You can retrieve the dump of my data at: [snipped]

Thanks.  Indeed, the first time I did an ANALYZE I got:
tablename | attname | null_frac | avg_width | n_distinct |       most_common_vals        |     most_common_freqs     |
                                                                   histogram_bounds
                                 | correlation
 

-----------+---------+-----------+-----------+------------+-------------------------------+---------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------tom_help
| tdnr    |         0 |        17 |      56484 | {0557088000957,0557700369880} | {0.000666667,0.000666667} |
{0551000386411,0551504108858,0557011074656,0557050633939,0557111430036,0557151012769,0557179871119,0557698138188,0557750158740,0557783053444,0558980779763}
|  -0.199108
 

Now, I happen to know that the default size of ANALYZE's statistical
sample is 3000 rows.  What evidently happened here is that the
statistical sampling picked up two values appearing twice (namely,
0557088000957 and 0557700369880); the given frequencies for these two
values establish that they appeared twice in the 3000-row sample.
Since no other values are mentioned in most_common_vals, the remaining
2996 samples must have been values that appeared only once.

Having computed these raw statistics, ANALYZE has to try to extrapolate the
number of distinct values in the whole table.  What it's using for the
purpose is an equation I found in
"Random sampling for histogram construction: how much is enough?"by Surajit Chaudhuri, Rajeev Motwani and Vivek
Narasayya,inProceedings of ACM SIGMOD International Conference on Managementof Data, 1998, Pages 436-447.
 

namely
    sqrt(n/r) * max(f1,1) + f2 + f3 + ...
 where fk is the number of distinct values that occurred exactly k times in our sample of r rows (from a total of n).

And indeed you get 56484 when you plug in these numbers.  So the code is
operating as designed, and we can't complain that the sample is an
unreasonable sample given the true underlying distribution.  We have to
blame the equation: evidently this estimation equation doesn't apply
very well to nearly-unique columns.

I had already modified the Chaudhuri approach slightly: if the ANALYZE
sample contains no duplicate values at all (ie, f1=r, f2=f3=f4=...=0)
then their equation reduces to sqrt(n*r), but actually ANALYZE assumes
the column is unique (ie, n distinct values, not sqrt(n*r)), which seems
a lot better assumption in practice.  The runs where you got n_distinct
equal to -1 are presumably those where the ANALYZE sample chanced to
contain no duplicates.

I am thinking that we need to find another estimator equation, or at
least shade away from their equation when f1 is close to r.  Ideally the
estimate for f1=r should be a logical extrapolation of the curve for f1
close to r, but right now it's quite discontinuous.

There was some previous discussion about this, cf the thread at
http://archives.postgresql.org/pgsql-general/2001-10/msg01032.php
but nothing really emerged on how to do better.  The Chaudhuri paper
points out that estimating total number of distinct values from a sample
is inherently a hard problem and subject to large estimation errors,
so it may be that we can't do a lot better.  I think we should be wary
of ad-hoc answers, anyhow.  Something with a little math behind it would
make me feel more comfortable.

Anyone have any thoughts on how to do better?
        regards, tom lane


Re: Odd statistics behaviour in 7.2

From
"Gordon A. Runkle"
Date:
On Wed, 2002-02-13 at 13:02, Tom Lane wrote:

[ Lots of enlightening info ]

Thanks, Tom!

Would it be fair to say that the correct workaround for now would
be to use ALTER TABLE SET STATISTICS on columns of interest which have
this near-unique characteristic?

Does ALTER TABLE SET STATISTICS only increase the histogram size, or
does it also cause more rows to be sampled?  If not, how would one
increase the sample size (assuming this would be desirable)?

I have quite a few tables that have columns of interest with near-unique
values, so I'm keenly interested. I'm sorry I didn't have time to do all
this testing during the beta phase, there just wasn't time.  :-(

Thanks again,

Gordon.
-- 
"Far and away the best prize that life has to offer is the chance to work hard at work worth doing."      -- Theodore
Roosevelt




Re: Odd statistics behaviour in 7.2

From
Tom Lane
Date:
"Gordon A. Runkle" <gar@integrated-dynamics.com> writes:
> Would it be fair to say that the correct workaround for now would
> be to use ALTER TABLE SET STATISTICS on columns of interest which have
> this near-unique characteristic?

Yeah, that's probably the best we can do until we can think of a better
estimation equation.

> Does ALTER TABLE SET STATISTICS only increase the histogram size, or
> does it also cause more rows to be sampled?

Both.  The Chaudhuri paper I referred to has some math purporting to
prove that the required sample size is directly proportional to the
histogram size, for fixed relative error in the histogram boundaries.
So I made the same parameter control both.

Actually the sample size is driven by the largest SET STATISTICS value
for any column of the table.  So you can pick which one you think a
larger histogram would be most useful for; it doesn't have to be the
same column that's got the bad-number-of-distinct-values problem.
Which columns, if any, do you do range queries on?  Those would be the
ones where a bigger histogram would be useful.
        regards, tom lane


Re: Odd statistics behaviour in 7.2

From
Tom Lane
Date:
"Gordon A. Runkle" <gar@integrated-dynamics.com> writes:
> [ tale of poor statistical results in a near-unique column ]

After some further digging in the literature I have come across a
number-of-distinct-values estimator that I like better than Chaudhuri's.
It runs like this:
                   n*d / (n - f1 + f1*n/N)            where f1 is the number of distinct values that occurred
exactly once in our sample of n rows (from a total of N),            and d is the total number of distinct values in
thesample.
 

The nice properties this has include:

1. At f1=0 (all values in sample occurred more than once), the estimate
reduces to d, which I had already determined to be a sensible result.

2. At f1=n (all values were distinct, hence also d=n), the estimate
reduces to N, which again is the most reasonable estimate.

3. The equation is numerically stable even when n is much smaller than
N, because the only cancellation is in the term (n - f1) which we can
compute exactly.  A lot of the other equations I looked at depend on
series like (1 - n/N)**i which are going to be really nasty when n/N
is tiny.

In particular, point 2 means that this equation should perform
reasonably well for nearly-unique columns (f1 approaching n),
which was the case you were seeing bad results for.

Attached is a patch that implements this revised equation.  Would
appreciate any feedback...
        regards, tom lane

*** src/backend/commands/analyze.c.orig    Sat Jan  5 19:37:44 2002
--- src/backend/commands/analyze.c    Fri Feb 15 18:46:45 2002
***************
*** 1009,1018 ****         {             /*----------              * Estimate the number of distinct values using the
estimator
!              * proposed by Chaudhuri et al (see citation above).  This is
!              *        sqrt(n/r) * max(f1,1) + f2 + f3 + ...
!              * where fk is the number of distinct values that occurred
!              * exactly k times in our sample of r rows (from a total of n).              * We assume (not very
reliably!)that all the multiply-occurring              * values are reflected in the final track[] list, and the other
           * nonnull values all appeared but once.  (XXX this usually
 
--- 1009,1023 ----         {             /*----------              * Estimate the number of distinct values using the
estimator
!              * proposed by Haas and Stokes in IBM Research Report RJ 10025:
!              *        n*d / (n - f1 + f1*n/N)
!              * where f1 is the number of distinct values that occurred
!              * exactly once in our sample of n rows (from a total of N),
!              * and d is the total number of distinct values in the sample.
!              * This is their Duj1 estimator; the other estimators they
!              * recommend are considerably more complex, and are numerically
!              * very unstable when n is much smaller than N.
!              *              * We assume (not very reliably!) that all the multiply-occurring              * values
arereflected in the final track[] list, and the other              * nonnull values all appeared but once.  (XXX this
usually
***************
*** 1021,1032 ****              *----------              */             int            f1 = nonnull_cnt - summultiple;
!             double        term1; 
!             if (f1 < 1)
!                 f1 = 1;
!             term1 = sqrt(totalrows / (double) numrows) * f1;
!             stats->stadistinct = floor(term1 + nmultiple + 0.5);         }          /*
--- 1026,1044 ----              *----------              */             int            f1 = nonnull_cnt - summultiple;
!             int            d = f1 + nmultiple;
!             double        numer, denom, stadistinct; 
!             numer = (double) numrows * (double) d;
!             denom = (double) (numrows - f1) +
!                 (double) f1 * (double) numrows / totalrows;
!             stadistinct = numer / denom;
!             /* Clamp to sane range in case of roundoff error */
!             if (stadistinct < (double) d)
!                 stadistinct = (double) d;
!             if (stadistinct > totalrows)
!                 stadistinct = totalrows;
!             stats->stadistinct = floor(stadistinct + 0.5);         }          /*
***************
*** 1313,1332 ****         {             /*----------              * Estimate the number of distinct values using the
estimator
!              * proposed by Chaudhuri et al (see citation above).  This is
!              *        sqrt(n/r) * max(f1,1) + f2 + f3 + ...
!              * where fk is the number of distinct values that occurred
!              * exactly k times in our sample of r rows (from a total of n).              * Overwidth values are
assumedto have been distinct.              *----------              */             int            f1 = ndistinct -
nmultiple+ toowide_cnt;
 
!             double        term1; 
!             if (f1 < 1)
!                 f1 = 1;
!             term1 = sqrt(totalrows / (double) numrows) * f1;
!             stats->stadistinct = floor(term1 + nmultiple + 0.5);         }          /*
--- 1325,1356 ----         {             /*----------              * Estimate the number of distinct values using the
estimator
!              * proposed by Haas and Stokes in IBM Research Report RJ 10025:
!              *        n*d / (n - f1 + f1*n/N)
!              * where f1 is the number of distinct values that occurred
!              * exactly once in our sample of n rows (from a total of N),
!              * and d is the total number of distinct values in the sample.
!              * This is their Duj1 estimator; the other estimators they
!              * recommend are considerably more complex, and are numerically
!              * very unstable when n is much smaller than N.
!              *              * Overwidth values are assumed to have been distinct.              *----------
 */             int            f1 = ndistinct - nmultiple + toowide_cnt;
 
!             int            d = f1 + nmultiple;
!             double        numer, denom, stadistinct; 
!             numer = (double) numrows * (double) d;
!             denom = (double) (numrows - f1) +
!                 (double) f1 * (double) numrows / totalrows;
!             stadistinct = numer / denom;
!             /* Clamp to sane range in case of roundoff error */
!             if (stadistinct < (double) d)
!                 stadistinct = (double) d;
!             if (stadistinct > totalrows)
!                 stadistinct = totalrows;
!             stats->stadistinct = floor(stadistinct + 0.5);         }          /*


Re: Odd statistics behaviour in 7.2

From
"Gordon A. Runkle"
Date:
On Fri, 2002-02-15 at 19:09, Tom Lane wrote:
> "Gordon A. Runkle" <gar@integrated-dynamics.com> writes:
> > [ tale of poor statistical results in a near-unique column ]
> [ new algorithm and patch ]

Hi Tom,

This works much better (I did SET STATISTICS back to 10).  Not always
'-1', but almost always negative.

Is "-0.503824" the same as "503824 with a predicted increase in the
number of distinct values" (as opposed to using "-503824")?

I checked a couple of other tables for which my field is a bit less
nearly-unique, and they have better stats now also.  I've attached a
file with distributions and stats, if you're interested.

And, getting to where the rubber meets the road, my queries are running
well again!

I'm running a batch of other queries (which were unaffected by this
problem) overnight to see if they still work well.

Are you planning to include this patch in v7.2.1, or would it require
too much testing by others?

Thanks for all your hard work, it is greatly appreciated,

Gordon.
--
"Far and away the best prize that life has to offer
  is the chance to work hard at work worth doing."
       -- Theodore Roosevelt


Attachment

Re: Odd statistics behaviour in 7.2

From
Tom Lane
Date:
"Gordon A. Runkle" <gar@integrated-dynamics.com> writes:
> Is "-0.503824" the same as "503824 with a predicted increase in the
> number of distinct values" (as opposed to using "-503824")?

No, it means "0.503824 times the number of rows in the table".
Although your table was ~ 1 million rows, so that's approximately
right in your case.

Given the stats you cited, the exactly correct stadistinct value would
be -0.9348085.  In testing I got -1, -0.808612, -0.678641, or once
-0.584611 from your data, depending on whether the sample chanced to
find none, one, two, or three repeated values.  Any of these strike me
as plenty close enough for statistical purposes.  But the Chaudhuri
estimator was off by more than a factor of 10.

> Are you planning to include this patch in v7.2.1, or would it require
> too much testing by others?

I'm going to put it in 7.2.1 unless there are objections.
        regards, tom lane


Re: Odd statistics behaviour in 7.2

From
Tom Lane
Date:
BTW, while we're thinking about this, there's another aspect of the
number-of-distinct-values estimator that could use some peer review.
That's the decision whether to assume that the number of distinct
values in a column is fixed, or will vary with the size of the
table.  (For example, in a boolean column, ndistinct should clearly
be 2 no matter how large the table gets; but in any unique column
ndistinct should equal the table size.)  This is important since there
are times when we update the table size estimate (pg_class.reltuples)
without recomputing the statistics in pg_statistic.  The "negative
stadistinct" convention in pg_statistic is used to signal which case
ANALYZE thinks applies.

Presently the decision is pretty simplistic: if the estimated number
of distinct values is more than 10% of the number of rows, then assume
the number of distinct values scales with the number of rows.

I believe that some rule of this form is reasonable, but the 10%
threshold was just picked out of the air.  Can anyone suggest an
argument in favor of some other value, or a better way to look at it?
        regards, tom lane


Re: Odd statistics behaviour in 7.2

From
Bruno Wolff III
Date:
On Fri, Feb 15, 2002 at 07:09:42PM -0500, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> 
> 3. The equation is numerically stable even when n is much smaller than
> N, because the only cancellation is in the term (n - f1) which we can
> compute exactly.  A lot of the other equations I looked at depend on
> series like (1 - n/N)**i which are going to be really nasty when n/N
> is tiny.

You can work with the above for n << N by using power series. For n << N,
(1 - n/N)**i ~= 1 - in/N.


Re: Odd statistics behaviour in 7.2

From
Tom Lane
Date:
Bruno Wolff III <bruno@wolff.to> writes:
> Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> A lot of the other equations I looked at depend on
>> series like (1 - n/N)**i which are going to be really nasty when n/N
>> is tiny.

> You can work with the above for n << N by using power series. For n << N,
> (1 - n/N)**i ~= 1 - in/N.

Yeah, but the pain-in-the-neck aspect comes from the fact that n/N isn't
necessarily tiny --- it could approach 1.  So you'd need code smart
enough to handle both cases with accuracy.

We could probably do this if it proves necessary, but I'd like to stick
to simpler equations if at all possible.
        regards, tom lane


Re: Odd statistics behaviour in 7.2

From
Tom Lane
Date:
Bruce Momjian <pgman@candle.pha.pa.us> writes:
> It would seem that if you could determine if the number of distinct
> values is _increasing_ as you scan more rows, that an increase in table
> size would also cause an increase, e.g. if you have X distinct values
> looking at N rows, and 2X distinct values looking at 2N rows, that
> clearly would show a scale.

[ thinks for awhile... ]  I don't think that'll help.  You could not
expect an exact 2:1 increase, except in the case of a simple unique
column, which isn't the problem anyway.  So the above would really
have to be coded as "count the number of distinct values in the sample
(d1) and the number in half of the sample (d2); then if d1/d2 >= X
assume the number of distinct values scales".  X is a constant somewhere
between 1 and 2, but where?  I think you've only managed to trade one
arbitrary threshold for another one.

A more serious problem is that the above could easily be fooled by a
distribution that contains a few very-popular values and a larger number
of seldom-seen ones.  Consider for example a column "number of children"
over a database of families.  In a sample of a thousand or so, you might
well see only values 0..4 (or so); if you double the size of the sample,
and find a few rows with 5 to 10 kids, are you then correct to label the
column as scaling with the size of the database?
        regards, tom lane