Re: BUG #7853: Incorrect statistics in table with many dead rows. - Mailing list pgsql-bugs

From James R Skaggs
Subject Re: BUG #7853: Incorrect statistics in table with many dead rows.
Date
Msg-id CAHqB0wQhGzjzbtK1t0t+xixcdJ8L3FxX+XouJw4-LCaG+B8Sjw@mail.gmail.com
Whole thread Raw
In response to Re: BUG #7853: Incorrect statistics in table with many dead rows.  (Kevin Grittner <kgrittn@ymail.com>)
Responses Re: BUG #7853: Incorrect statistics in table with many dead rows.
List pgsql-bugs
Kevin -

Agreed, we shouldn't have so many dead rows.  Our autovacuum is set on but
has default parameters.   We are clustering today.  This, of course, means
downtime and inconvenience to the users.

Here is the troublesome query:

select

    sum(case when t1.cdlc < 0 then 0 else t1.cdlc end) as code_line_cnt,

    sum(case when t1.cmlc < 0 then 0 else t1.cmlc end) as comment_line_cnt,

    sum(case when t1.bllc < 0 then 0 else t1.bllc end) as blank_line_cnt

  from

    ( select

        max(sf.current_code_line_count) as cdlc,

        max(sf.current_comment_line_count) as cmlc,

        max(sf.current_blank_line_count) as bllc

      from

        stream_file sf

      group by sf.file_path_id, sf.current_source_md5

    ) as t1;


And the explain plan.  Against a optimized database, the plan was better,
as well as execution

Aggregate  (cost=32073953.42..32073953.43 rows=1 width=12) (actual
time=77394.354..77394.354 rows=1 loops=1)
  ->  GroupAggregate  (cost=29271854.25..31808889.97 rows=15146482
width=53) (actual time=62490.514..77333.105 rows=168591 loops=1)
        ->  Sort  (cost=29271854.25..29650516.30 rows=151464819 width=53)
(actual time=62490.492..73098.009 rows=14403547 loops=1)
              Sort Key: sf.file_path_id, sf.current_source_md5
              Sort Method:  external merge  Disk: 942440kB
              ->  Seq Scan on stream_file sf  (cost=0.00..3514999.19
rows=151464819 width=53) (actual time=0.327..20620.230 rows=14403547
loops=1)
Total runtime: 77836.949 ms

Thanks and Regards,

James Skaggs
IT/FIS Longmont
SeaTel: 8 684 1048
General: +1 612 367 6224


On Tue, Feb 5, 2013 at 3:00 PM, Kevin Grittner <kgrittn@ymail.com> wrote:

> "jimbob@seagate.com" <jimbob@seagate.com> wrote:
>
> > INFO:  analyzing "public.stream_file"
> > INFO:  "stream_file": scanned 30000 of 2123642 pages, containing
> > 184517 live rows and 2115512 dead rows; 30000 rows in sample,
> > 158702435 estimated total rows
>
> 184517 live rows in 30000 randomly sampled pages out of 2123642
> total pages, means that the statistics predict that a select
> count(*) will find about  13 million live rows to count.
>
> > After "analyze verbose", the table shows 158 million rows. A
> > select count(1) yields 13.8 million rows.
>
> OK, the estimate was 13 million and there were actually 13.8
> million, but it is a random sample used to generate estimates.
> That seems worse than average, but close enough to be useful.
> The 158.7 million total rows includes dead rows, which must be
> visited to determine visibility, but will not be counted because
> they are not visible to the counting transaction.  Having over 90%
> of your table filled with dead rows is a bad situation to be in,
> from a performance standpoint.  You should use aggressive
> maintenance (like VACUUM FULL or CLUSTER) to fix the existing
> extreme bloat, and then review your autovacuum settings and overall
> vacuum regimen to prevent future bloat.
>
> This does not look like a bug from the information provided so far.
>
> -Kevin
>

pgsql-bugs by date:

Previous
From: Kevin Grittner
Date:
Subject: Re: BUG #7853: Incorrect statistics in table with many dead rows.
Next
From: Kevin Grittner
Date:
Subject: Re: BUG #7853: Incorrect statistics in table with many dead rows.