Thread: Incorrect expected rows by ANALYZE
Assuming that following queries are executed on PostgreSQL 7.3. create table c1 (i integer);create table c2 (i integer);insert into c1 values (1 .. 100000);insert into c2 values (1 .. 100000); (A) delete from c1; (B) insert into c1 values (1 .. 100000);analyze;select * from c1, c2 where c1.i = c2.i; =# explain analyze select * from c1, c2 where c1.i = c2.i; QUERY PLAN ------------------------------------------------------------------------------------------------------------------- HashJoin (cost=950.90..3713.21 rows=5512 width=8) (actual time=1381.88..26452.33 rows=100000 loops=1) Hash Cond: ("outer".i= "inner".i) -> Seq Scan on c2 (cost=0.00..1442.07 rows=100107 width=4) (actual time=0.28..1038.02 rows=100000loops=1) -> Hash (cost=937.12..937.12 rows=5512 width=4) (actual time=1381.29..1381.29 rows=0 loops=1) -> Seq Scan on c1 (cost=0.00..937.12 rows=5512 width=4) (actual time=76.83..858.99 rows=100000 loops=1) Total runtime: 26600.80 msec (6 rows) =# Expected rows of 'Seq Scan on c1' does not coincide with actual rows. These two values coincide after vacuum and expected rows coincide with actual rows when queries (A) and (B) are not executed. =# explain analyze select * from c1, c2 where c1.i = c2.i; QUERY PLAN --------------------------------------------------------------------------------------------------------------------- HashJoin (cost=2035.34..7259.35 rows=100107 width=8) (actual time=1367.11..8604.49 rows=100000 loops=1) Hash Cond:("outer".i = "inner".i) -> Seq Scan on c1 (cost=0.00..1442.07 rows=100107 width=4) (actual time=0.28..556.03rows=100000 loops=1) -> Hash (cost=1442.07..1442.07 rows=100107 width=4) (actual time=1365.79..1365.79rows=0 loops=1) -> Seq Scan on c2 (cost=0.00..1442.07 rows=100107 width=4) (actual time=0.30..626.40rows=100000 loops=1) Total runtime: 9113.14 msec (6 rows) =# Why do dead tuples cause incorrect expected rows by ANALYZE? Kenji Sugita
Kenji Sugita <sugita@srapc1327.sra.co.jp> writes: > Why do dead tuples cause incorrect expected rows by ANALYZE? Because it extrapolates the density of live rows in the whole table from their density in the first few pages. There has been some previous discussion of how to improve this, but no really workable ideas AFAIR. Ultimately, unless you want to abandon the notion that ANALYZE only samples the table and doesn't read the whole thing, there will be situations where it gets fooled. But possibly we can make those situations less likely to correspond to common real-world cases. regards, tom lane
> Kenji Sugita <sugita@srapc1327.sra.co.jp> writes: >> Why do dead tuples cause incorrect expected rows by ANALYZE? > > Because it extrapolates the density of live rows in the whole table > from their density in the first few pages. > > There has been some previous discussion of how to improve this, but no > really workable ideas AFAIR. Ultimately, unless you want to abandon > the notion that ANALYZE only samples the table and doesn't read the > whole thing, there will be situations where it gets fooled. But > possibly we can make those situations less likely to correspond to > common real-world cases. Couldn't there be an ANALYZE FULL option?
On Mon, 7 Apr 2003 pgsql@mohawksoft.com wrote: > > Kenji Sugita <sugita@srapc1327.sra.co.jp> writes: > >> Why do dead tuples cause incorrect expected rows by ANALYZE? > > > > Because it extrapolates the density of live rows in the whole table > > from their density in the first few pages. > > > > There has been some previous discussion of how to improve this, but no > > really workable ideas AFAIR. Ultimately, unless you want to abandon > > the notion that ANALYZE only samples the table and doesn't read the > > whole thing, there will be situations where it gets fooled. But > > possibly we can make those situations less likely to correspond to > > common real-world cases. > > Couldn't there be an ANALYZE FULL option? There's something close, that is, you can change the statistics target for a column in a table like so: ALTER TABLE [ ONLY ] table [ * ] ALTER [ COLUMN ] column SET STATISTICS integer which can make it so that more of the rows are analyzed.
On Mon, 7 Apr 2003 pgsql@mohawksoft.com wrote: >> Couldn't there be an ANALYZE FULL option? It already gets the row count right if you did VACUUM ANALYZE. I don't think we need ANALYZE FULL (at least, not for this purpose). "scott.marlowe" <scott.marlowe@ihs.com> writes: > There's something close, that is, you can change the statistics target for > a column in a table like so: > ALTER TABLE [ ONLY ] table [ * ] > ALTER [ COLUMN ] column SET STATISTICS integer > which can make it so that more of the rows are analyzed. I'm not sure though how effective that will be at avoiding the problem Kenji is complaining of. regards, tom lane
Tom Lane wrote: > Kenji Sugita <sugita@srapc1327.sra.co.jp> writes: > > Why do dead tuples cause incorrect expected rows by ANALYZE? > > Because it extrapolates the density of live rows in the whole table from > their density in the first few pages. Shouldn't it select random pages within the table for sampling? It would be almost as fast and would probably yield much more reliable statistics, because it would be much more likely to reflect recently-made changes, particularly on tables that haven't been VACUUMed recently.. -- Kevin Brown kevin@sysexperts.com