Re: Query optimizer 8.0.1 (and 8.0) - Mailing list pgsql-hackers

From pgsql@mohawksoft.com
Subject Re: Query optimizer 8.0.1 (and 8.0)
Date
Msg-id 16759.24.91.171.78.1107814616.squirrel@mail.mohawksoft.com
Whole thread Raw
In response to Re: Query optimizer 8.0.1 (and 8.0)  (Bruno Wolff III <bruno@wolff.to>)
Responses Re: Query optimizer 8.0.1 (and 8.0)  (Alvaro Herrera <alvherre@dcc.uchile.cl>)
List pgsql-hackers
> On Mon, Feb 07, 2005 at 13:28:04 -0500,
>
> What you are saying here is that if you want more accurate statistics, you
> need to sample more rows. That is true. However, the size of the sample
> is essentially only dependent on the accuracy you need and not the size
> of the population, for large populations.
>
That's nonsense.

If your total data size is 100 elements in a set, then a sample size of
100 elements will cover 100% of your data.

If your total data size is 10,000 elements in a set, the a sample size of
100 elements will cover 1% of your data.

In the case of the TIGER database, the base of 100 samples is about .002%
0f the data is sampled. Think about that, that is an average of 1 sample
about every 50,000 records. You could have substantial but irregular
trends in the data that may never get detected, and this is EXACTLY what
we see. If we increase the sample size (targrows), the statistics suddenly
work better.

For instance, look at the data below.

The first analyze / select from pg_stats is with an analyze of 3000
samples. The zipl and zipr columns get calculated poorly and can cause the
planner to use a table scan instead of an index scan.

The second analyze / select from the pg_stats is with an analyse of 10000
samples. The zipl and zipr n_distinct values are still off by a factor of
10, but close enough for the planner to deal.

If the premise is that samples size doesn't make a difference, I think
we've proved that this is not true.


tiger=# analyze verbose rt1;
INFO:  analyzing "public.rt1"
INFO:  "rt1": scanned 3000 of 1527360 pages, containing 90978 live rows
and 0 dead rows; 3000 rows in sample, 46318719 estimated total rows
ANALYZE

tiger=# select * from pg_stats where tablename = 'rt1' and attname like
'zip%';schemaname | tablename | attname | null_frac | avg_width | n_distinct |                  most_common_vals
            |                                        most_common_freqs                                         |
                 histogram_bounds                        | correlation
 

------------+-----------+---------+-----------+-----------+------------+---------------------------------------------------------+------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------+-------------public
   | rt1       | zipl    |     0.672 |         4 |        960 |
 
{76240,52601,55746,71730,74604,92705,93117,95818}       |
{0.00166667,0.00133333,0.00133333,0.00133333,0.00133333,0.00133333,0.00133333,0.00133333}          |
{1085,16652,28206,33412,43147,49428,58801,68110,77515,91340,99006} |  
-0.119519public     | rt1       | zipr    |     0.677 |         4 |        960 |
{76240,52601,55746,71730,74604,78577,92705,93117,95818} |
{0.00166667,0.00133333,0.00133333,0.00133333,0.00133333,0.00133333,0.00133333,0.00133333,0.00133333}
| {962,15613,28572,33606,43545,49428,60423,68064,77040,91340,99006}  |  
-0.104158
(2 rows)

Now this:
tiger=# analyze verbose rt1;
INFO:  analyzing "public.rt1"
INFO:  "rt1": scanned 10000 of 1527360 pages, containing 303419 live rows
and 0 dead rows; 10000 rows in sample, 46343004 estimated total rows
ANALYZE

tiger=# select * from pg_stats where tablename = 'rt1' and attname like
'zip%';schemaname | tablename | attname | null_frac | avg_width | n_distinct |                     most_common_vals
                  |                          most_common_freqs                            |
histogram_bounds                         | correlation
 

------------+-----------+---------+-----------+-----------+------------+---------------------------------------------------------------+-------------------------------------------------------------------------+-------------------------------------------------------------------+-------------public
   | rt1       | zipl    |    0.6807 |         4 |       2942 |
 
{61832,13090,17404,30907,31204,45342,47714,63050,80918,93726} |
{0.0008,0.0006,0.0005,0.0005,0.0005,0.0005,0.0005,0.0005,0.0005,0.0005} |
{654,15018,28208,33870,43006,49008,59741,68803,78640,92105,99687} |  
-0.137744public     | rt1       | zipr    |     0.684 |         4 |       2921 |
{13090,61832,30907,31204,45342,47714,63050,70122,80918,93726} |
{0.0006,0.0006,0.0005,0.0005,0.0005,0.0005,0.0005,0.0005,0.0005,0.0005} |
{731,14824,27871,33324,42276,48895,58401,68338,78575,92105,99654} |  
-0.140663
(2 rows)




pgsql-hackers by date:

Previous
From: Alvaro Herrera
Date:
Subject: Re: Is there a way to make VACUUM run completely outside
Next
From: Alvaro Herrera
Date:
Subject: Re: Query optimizer 8.0.1 (and 8.0)