Re: BUG #7853: Incorrect statistics in table with many dead rows. - Mailing list pgsql-bugs

From James R Skaggs
Subject Re: BUG #7853: Incorrect statistics in table with many dead rows.
Date
Msg-id CAHqB0wSpxSXMG-7VXXbXoSLgOkSPWRbfrRzLEwDmhvmRbWcegA@mail.gmail.com
Whole thread Raw
In response to Re: BUG #7853: Incorrect statistics in table with many dead rows.  (Jeff Janes <jeff.janes@gmail.com>)
Responses Re: BUG #7853: Incorrect statistics in table with many dead rows.  (Jeff Janes <jeff.janes@gmail.com>)
List pgsql-bugs
Okay, I have some more info.

Some background info.  This one table gets so many changes, I CLUSTER it
each night.  However, after I do this. The statistics still appear to be
incorrect.  Even after I do a "select pg_stat_reset();" Followed by 3
ANALYZE at default_statistics_target as 1, 10, and 100

    select relname, n_live_tup, n_dead_tup, n_tup_ins, n_tup_del,
n_tup_hot_upd
        from pg_stat_all_tables
    ('stream_file', 109061143L, 103913868L, 0L, 14201L, 0L)

Is it possible that there are still dead tuples after a CLUSTER?

Explain Analyze thinks we have 112M rows (in fact there are 10M), so it
needs do do a disc sort with work_mem=1GB..

"Aggregate  (cost=23622814.39..23622814.40 rows=1 width=12) (actual
time=57512.462..57512.462 rows=1 loops=1)"
"  ->  GroupAggregate  (cost=21536989.70..23425506.64 rows=11274728
width=53) (actual time=50825.396..57457.421 rows=202394 loops=1)"
"        Output: max(sf.current_code_line_count),
max(sf.current_comment_line_count), max(sf.current_blank_line_count)"
"        ->  Sort  (cost=21536989.70..21818857.90 rows=112747280 width=53)
(actual time=50825.386..55004.916 rows=7683730 loops=1)"
"              Output: sf.current_code_line_count,
sf.current_comment_line_count, sf.current_blank_line_count,
sf.file_path_id, sf.current_source_md5"
"              Sort Key: sf.file_path_id, sf.current_source_md5"
"              Sort Method:  external merge  Disk: 502288kB"
"              ->  Seq Scan on stream_file sf  (cost=0.00..2604208.80
rows=112747280 width=53) (actual time=0.033..27922.485 rows=7683730
loops=1)"
"                    Output: sf.current_code_line_count,
sf.current_comment_line_count, sf.current_blank_line_count,
sf.file_path_id, sf.current_source_md5"
"Total runtime: 57693.835 ms"

Now we do many, many ANALYZE VERBOSE, and converge on the correct value,
which is known to be about 10M rows.

INFO:  analyzing "public.stream_file"
INFO:  "stream_file": scanned 30000 of 1476736 pages, containing 158846
live rows and 2175512 dead rows; 30000 rows in sample, 112747282 estimated
total rows
Query returned successfully with no result in 9172 ms.

... 200X !...

INFO:  analyzing "public.stream_file"
INFO:  "stream_file": scanned 30000 of 1480611 pages, containing 158776
live rows and 2170410 dead rows; 30000 rows in sample, 9769236 estimated
total rows
Query returned successfully with no result in 441 ms.

Now, the optimizer thinks we can do a has aggregate in memory and we get
better performance.

"Aggregate  (cost=1734729.12..1734729.14 rows=1 width=12) (actual
time=33816.049..33816.049 rows=1 loops=1)"
"  ->  HashAggregate  (cost=1700534.50..1717631.81 rows=976989 width=53)
(actual time=33535.083..33712.787 rows=202404 loops=1)"
"        Output: max(sf.current_code_line_count),
max(sf.current_comment_line_count), max(sf.current_blank_line_count)"
"        ->  Seq Scan on stream_file sf  (cost=0.00..1578410.89
rows=9769889 width=53) (actual time=392.435..26278.143 rows=7710223
loops=1)"
"              Output: sf.id, sf.current_blank_line_count,
sf.current_code_line_count, sf.current_comment_line_count,
sf.current_source_md5, sf.component_id, sf.current_file_instance_id,
sf.current_file_state_id, sf.file_path_id, sf.stream_element_id"
"Total runtime: 33822.707 ms"

But later in the day, the statistics revert back to the 100M number!  Any
ideas?  Is there some kind of cache that is remembering the old statistics.?


Thanks and Regards,

James Skaggs
IT/FIS Longmont
SeaTel: 8 684 1048
General: +1 612 367 6224


On Sun, Feb 10, 2013 at 1:10 PM, Jeff Janes <jeff.janes@gmail.com> wrote:

> On Tue, Feb 5, 2013 at 2:00 PM, Kevin Grittner <kgrittn@ymail.com> wrote:
> > "jimbob@seagate.com" <jimbob@seagate.com> wrote:
> >
> >> INFO:  analyzing "public.stream_file"
> >> INFO:  "stream_file": scanned 30000 of 2123642 pages, containing
> >> 184517 live rows and 2115512 dead rows; 30000 rows in sample,
> >> 158702435 estimated total rows
> >
> > 184517 live rows in 30000 randomly sampled pages out of 2123642
> > total pages, means that the statistics predict that a select
> > count(*) will find about  13 million live rows to count.
> >
> >> After "analyze verbose", the table shows 158 million rows. A
> >> select count(1) yields 13.8 million rows.
> >
> > OK, the estimate was 13 million and there were actually 13.8
> > million, but it is a random sample used to generate estimates.
> > That seems worse than average, but close enough to be useful.
> > The 158.7 million total rows includes dead rows, which must be
> > visited to determine visibility, but will not be counted because
> > they are not visible to the counting transaction.
>
> To clarify here, the 158.7 million estimate does not *intentionally*
> include dead rows.  As you say, the ANALYZE did get a very good
> instantaneous estimate of the number of live rows.  However, ANALYZE
> doesn't over-write the old estimate, it averages its estimate into the
> old one.  After the table shape changes dramatically, the ANALYZE
> needs to be run repeatedly before the estimate will converge to the
> new reality.  (Of course a cluster or vacuum full will blow away the
> old statistics, so the next analyze after that will solely determine
> the new statistics.)
>
> I agree, not a bug.
>
> Cheers,
>
> Jeff
>

pgsql-bugs by date:

Previous
From: jeff@pgexperts.com
Date:
Subject: BUG #7902: lazy cleanup of extraneous WAL files can cause out of disk issues
Next
From: kurt@roeckx.be
Date:
Subject: BUG #7903: EAN13s are shown ISBN values