Re: BUG #7853: Incorrect statistics in table with many dead rows. - Mailing list pgsql-bugs
From | James R Skaggs |
---|---|
Subject | Re: BUG #7853: Incorrect statistics in table with many dead rows. |
Date | |
Msg-id | CAHqB0wSpxSXMG-7VXXbXoSLgOkSPWRbfrRzLEwDmhvmRbWcegA@mail.gmail.com Whole thread Raw |
In response to | Re: BUG #7853: Incorrect statistics in table with many dead rows. (Jeff Janes <jeff.janes@gmail.com>) |
Responses |
Re: BUG #7853: Incorrect statistics in table with many dead rows.
|
List | pgsql-bugs |
Okay, I have some more info. Some background info. This one table gets so many changes, I CLUSTER it each night. However, after I do this. The statistics still appear to be incorrect. Even after I do a "select pg_stat_reset();" Followed by 3 ANALYZE at default_statistics_target as 1, 10, and 100 select relname, n_live_tup, n_dead_tup, n_tup_ins, n_tup_del, n_tup_hot_upd from pg_stat_all_tables ('stream_file', 109061143L, 103913868L, 0L, 14201L, 0L) Is it possible that there are still dead tuples after a CLUSTER? Explain Analyze thinks we have 112M rows (in fact there are 10M), so it needs do do a disc sort with work_mem=1GB.. "Aggregate (cost=23622814.39..23622814.40 rows=1 width=12) (actual time=57512.462..57512.462 rows=1 loops=1)" " -> GroupAggregate (cost=21536989.70..23425506.64 rows=11274728 width=53) (actual time=50825.396..57457.421 rows=202394 loops=1)" " Output: max(sf.current_code_line_count), max(sf.current_comment_line_count), max(sf.current_blank_line_count)" " -> Sort (cost=21536989.70..21818857.90 rows=112747280 width=53) (actual time=50825.386..55004.916 rows=7683730 loops=1)" " Output: sf.current_code_line_count, sf.current_comment_line_count, sf.current_blank_line_count, sf.file_path_id, sf.current_source_md5" " Sort Key: sf.file_path_id, sf.current_source_md5" " Sort Method: external merge Disk: 502288kB" " -> Seq Scan on stream_file sf (cost=0.00..2604208.80 rows=112747280 width=53) (actual time=0.033..27922.485 rows=7683730 loops=1)" " Output: sf.current_code_line_count, sf.current_comment_line_count, sf.current_blank_line_count, sf.file_path_id, sf.current_source_md5" "Total runtime: 57693.835 ms" Now we do many, many ANALYZE VERBOSE, and converge on the correct value, which is known to be about 10M rows. INFO: analyzing "public.stream_file" INFO: "stream_file": scanned 30000 of 1476736 pages, containing 158846 live rows and 2175512 dead rows; 30000 rows in sample, 112747282 estimated total rows Query returned successfully with no result in 9172 ms. ... 200X !... INFO: analyzing "public.stream_file" INFO: "stream_file": scanned 30000 of 1480611 pages, containing 158776 live rows and 2170410 dead rows; 30000 rows in sample, 9769236 estimated total rows Query returned successfully with no result in 441 ms. Now, the optimizer thinks we can do a has aggregate in memory and we get better performance. "Aggregate (cost=1734729.12..1734729.14 rows=1 width=12) (actual time=33816.049..33816.049 rows=1 loops=1)" " -> HashAggregate (cost=1700534.50..1717631.81 rows=976989 width=53) (actual time=33535.083..33712.787 rows=202404 loops=1)" " Output: max(sf.current_code_line_count), max(sf.current_comment_line_count), max(sf.current_blank_line_count)" " -> Seq Scan on stream_file sf (cost=0.00..1578410.89 rows=9769889 width=53) (actual time=392.435..26278.143 rows=7710223 loops=1)" " Output: sf.id, sf.current_blank_line_count, sf.current_code_line_count, sf.current_comment_line_count, sf.current_source_md5, sf.component_id, sf.current_file_instance_id, sf.current_file_state_id, sf.file_path_id, sf.stream_element_id" "Total runtime: 33822.707 ms" But later in the day, the statistics revert back to the 100M number! Any ideas? Is there some kind of cache that is remembering the old statistics.? Thanks and Regards, James Skaggs IT/FIS Longmont SeaTel: 8 684 1048 General: +1 612 367 6224 On Sun, Feb 10, 2013 at 1:10 PM, Jeff Janes <jeff.janes@gmail.com> wrote: > On Tue, Feb 5, 2013 at 2:00 PM, Kevin Grittner <kgrittn@ymail.com> wrote: > > "jimbob@seagate.com" <jimbob@seagate.com> wrote: > > > >> INFO: analyzing "public.stream_file" > >> INFO: "stream_file": scanned 30000 of 2123642 pages, containing > >> 184517 live rows and 2115512 dead rows; 30000 rows in sample, > >> 158702435 estimated total rows > > > > 184517 live rows in 30000 randomly sampled pages out of 2123642 > > total pages, means that the statistics predict that a select > > count(*) will find about 13 million live rows to count. > > > >> After "analyze verbose", the table shows 158 million rows. A > >> select count(1) yields 13.8 million rows. > > > > OK, the estimate was 13 million and there were actually 13.8 > > million, but it is a random sample used to generate estimates. > > That seems worse than average, but close enough to be useful. > > The 158.7 million total rows includes dead rows, which must be > > visited to determine visibility, but will not be counted because > > they are not visible to the counting transaction. > > To clarify here, the 158.7 million estimate does not *intentionally* > include dead rows. As you say, the ANALYZE did get a very good > instantaneous estimate of the number of live rows. However, ANALYZE > doesn't over-write the old estimate, it averages its estimate into the > old one. After the table shape changes dramatically, the ANALYZE > needs to be run repeatedly before the estimate will converge to the > new reality. (Of course a cluster or vacuum full will blow away the > old statistics, so the next analyze after that will solely determine > the new statistics.) > > I agree, not a bug. > > Cheers, > > Jeff >
pgsql-bugs by date: