Thread: Incorrect expected rows by ANALYZE

Incorrect expected rows by ANALYZE

From
Kenji Sugita
Date:
Assuming that following queries are executed on PostgreSQL 7.3.
create table c1 (i integer);create table c2 (i integer);insert into c1 values (1 .. 100000);insert into c2 values (1 ..
100000);  (A) delete from c1;   (B)    insert into c1 values (1 .. 100000);analyze;select * from c1, c2 where c1.i =
c2.i;
   =# explain analyze select * from c1, c2 where c1.i = c2.i;                        QUERY PLAN

-------------------------------------------------------------------------------------------------------------------
HashJoin  (cost=950.90..3713.21 rows=5512 width=8) (actual time=1381.88..26452.33 rows=100000 loops=1)      Hash Cond:
("outer".i= "inner".i)      ->  Seq Scan on c2  (cost=0.00..1442.07 rows=100107 width=4) (actual time=0.28..1038.02
rows=100000loops=1)      ->  Hash  (cost=937.12..937.12 rows=5512 width=4) (actual time=1381.29..1381.29 rows=0
loops=1)    ->  Seq Scan on c1  (cost=0.00..937.12 rows=5512 width=4) (actual time=76.83..858.99 rows=100000 loops=1)
Total runtime: 26600.80 msec   (6 rows)
 
   =# 

Expected rows of 'Seq Scan on c1' does not coincide with actual rows.  These
two values coincide after vacuum and expected rows coincide with actual rows
when queries (A) and (B) are not executed.
   =# explain analyze select * from c1, c2 where c1.i = c2.i;                         QUERY PLAN

---------------------------------------------------------------------------------------------------------------------
HashJoin  (cost=2035.34..7259.35 rows=100107 width=8) (actual time=1367.11..8604.49 rows=100000 loops=1)      Hash
Cond:("outer".i = "inner".i)      ->  Seq Scan on c1  (cost=0.00..1442.07 rows=100107 width=4) (actual
time=0.28..556.03rows=100000 loops=1)      ->  Hash  (cost=1442.07..1442.07 rows=100107 width=4) (actual
time=1365.79..1365.79rows=0 loops=1)     ->  Seq Scan on c2  (cost=0.00..1442.07 rows=100107 width=4) (actual
time=0.30..626.40rows=100000 loops=1)    Total runtime: 9113.14 msec   (6 rows)
 
   =# 

Why do dead tuples cause incorrect expected rows by ANALYZE?


Kenji Sugita                                      



Re: Incorrect expected rows by ANALYZE

From
Tom Lane
Date:
Kenji Sugita <sugita@srapc1327.sra.co.jp> writes:
> Why do dead tuples cause incorrect expected rows by ANALYZE?

Because it extrapolates the density of live rows in the whole table from
their density in the first few pages.

There has been some previous discussion of how to improve this, but no
really workable ideas AFAIR.  Ultimately, unless you want to abandon
the notion that ANALYZE only samples the table and doesn't read the
whole thing, there will be situations where it gets fooled.  But
possibly we can make those situations less likely to correspond to
common real-world cases.
        regards, tom lane



Re: Incorrect expected rows by ANALYZE

From
pgsql@mohawksoft.com
Date:
> Kenji Sugita <sugita@srapc1327.sra.co.jp> writes:
>> Why do dead tuples cause incorrect expected rows by ANALYZE?
>
> Because it extrapolates the density of live rows in the whole table
> from their density in the first few pages.
>
> There has been some previous discussion of how to improve this, but no
> really workable ideas AFAIR.  Ultimately, unless you want to abandon
> the notion that ANALYZE only samples the table and doesn't read the
> whole thing, there will be situations where it gets fooled.  But
> possibly we can make those situations less likely to correspond to
> common real-world cases.

Couldn't there be an ANALYZE FULL option?



Re: Incorrect expected rows by ANALYZE

From
"scott.marlowe"
Date:
On Mon, 7 Apr 2003 pgsql@mohawksoft.com wrote:

> > Kenji Sugita <sugita@srapc1327.sra.co.jp> writes:
> >> Why do dead tuples cause incorrect expected rows by ANALYZE?
> > 
> > Because it extrapolates the density of live rows in the whole table
> > from their density in the first few pages.
> > 
> > There has been some previous discussion of how to improve this, but no
> > really workable ideas AFAIR.  Ultimately, unless you want to abandon
> > the notion that ANALYZE only samples the table and doesn't read the
> > whole thing, there will be situations where it gets fooled.  But
> > possibly we can make those situations less likely to correspond to
> > common real-world cases.
> 
> Couldn't there be an ANALYZE FULL option?

There's something close, that is, you can change the statistics target for 
a column in a table like so:

ALTER TABLE [ ONLY ] table [ * ]   ALTER [ COLUMN ] column SET STATISTICS integer

which can make it so that more of the rows are analyzed.



Re: Incorrect expected rows by ANALYZE

From
Tom Lane
Date:
On Mon, 7 Apr 2003 pgsql@mohawksoft.com wrote:
>> Couldn't there be an ANALYZE FULL option?

It already gets the row count right if you did VACUUM ANALYZE.  I don't
think we need ANALYZE FULL (at least, not for this purpose).

"scott.marlowe" <scott.marlowe@ihs.com> writes:
> There's something close, that is, you can change the statistics target for 
> a column in a table like so:
> ALTER TABLE [ ONLY ] table [ * ]
>     ALTER [ COLUMN ] column SET STATISTICS integer
> which can make it so that more of the rows are analyzed.

I'm not sure though how effective that will be at avoiding the problem
Kenji is complaining of.
        regards, tom lane



Re: Incorrect expected rows by ANALYZE

From
Kevin Brown
Date:
Tom Lane wrote:
> Kenji Sugita <sugita@srapc1327.sra.co.jp> writes:
> > Why do dead tuples cause incorrect expected rows by ANALYZE?
> 
> Because it extrapolates the density of live rows in the whole table from
> their density in the first few pages.

Shouldn't it select random pages within the table for sampling?  It
would be almost as fast and would probably yield much more reliable
statistics, because it would be much more likely to reflect
recently-made changes, particularly on tables that haven't been
VACUUMed recently..


-- 
Kevin Brown                          kevin@sysexperts.com