Re: BUG #7853: Incorrect statistics in table with many dead rows. - Mailing list pgsql-bugs

From James R Skaggs
Subject Re: BUG #7853: Incorrect statistics in table with many dead rows.
Date
Msg-id CAHqB0wQZUzLU_12MEm9pGbB8sYdck4m+dTKwaDTiQ4ZE55SFgg@mail.gmail.com
Whole thread Raw
In response to Re: BUG #7853: Incorrect statistics in table with many dead rows.  (Kevin Grittner <kgrittn@ymail.com>)
List pgsql-bugs
Kevin -

Here is the plan *after *we clustered on the PK index..it was probably that
way before the bloat, but I don't know.

Aggregate  (cost=514557.38..514557.40 rows=1 width=12) (actual
time=8823.973..8823.974 rows=1 loops=1)
  ->  HashAggregate  (cost=471029.03..492793.20 rows=1243667 width=53)
(actual time=8725.324..8789.926 rows=159369 loops=1)"
        ->  Seq Scan on stream_file sf  (cost=0.00..315570.68 rows=12436668
width=53) (actual time=0.013..1152.600 rows=12436753 loops=1)
Total runtime: 8838.395 ms"




I guess the statistics should reflect the true costs.  So if the "bloated"
statistics are representative of the true cost, I'm all for them.

We are clustering the database as we speak, so I can't force a hash
aggregate to see if that is the right choice or not.

Thanks and Regards,

James Skaggs
IT/FIS Longmont
SeaTel: 8 684 1048
General: +1 612 367 6224


On Tue, Feb 5, 2013 at 4:12 PM, Kevin Grittner <kgrittn@ymail.com> wrote:

> James R Skaggs <james.r.skaggs@seagate.com> wrote:
>
> > Agreed, we shouldn't have so many dead rows.  Our autovacuum is
> > set on but has default parameters.   We are clustering today.
> > This, of course, means downtime and inconvenience to the users.
>
> Right, which is why it's important to figure out why the bloat
> happened.  Sometimes it is unavoidable, like when you delete 90% of
> the rows in your table or a long-lived "idle in transaction"
> connection prevents autovacuum from being able to do its work
> normally.  To prevent further downtime it is important to figure
> out what happened and make appropriate changes to your monitoring
> or vacuuming.
>
> > Here is the troublesome query:
> >
> >> select
> >>    sum(case when t1.cdlc < 0 then 0 else t1.cdlc end) as code_line_cnt,
> >>    sum(case when t1.cmlc < 0 then 0 else t1.cmlc end) as
> comment_line_cnt,
> >>    sum(case when t1.bllc < 0 then 0 else t1.bllc end) as blank_line_cnt
> >>  from
> >>    ( select
> >>        max(sf.current_code_line_count) as cdlc,
> >>        max(sf.current_comment_line_count) as cmlc,
> >>        max(sf.current_blank_line_count) as bllc
> >>      from
> >>        stream_file sf
> >>      group by sf.file_path_id, sf.current_source_md5
> >>    ) as t1;
>
> Well, I don't see that the planner has a lot of choice there
> besides whether to use a sort or a hash to do the inner
> aggregation.  Are you saying that prior to the bloat it used a hash
> aggregation, and that was faster?  And that you feel that it should
> be using that even with the bloat?  That the dead rows seem to be
> getting included in the statistics, driving to the slower plan, and
> you feel they should be omitted?
>
> Note that I'm not aruing one way or another on these points at the
> moment; I'm just trying to understand your point clearly.
>
> -Kevin
>

pgsql-bugs by date:

Previous
From: Dave Page
Date:
Subject: Re: BUG #7851: Installer crash with message: An error occured executing the Microsoft VC++ runtimeinstaller
Next
From: William Goh
Date:
Subject: Bug #7728