On Tue, Feb 5, 2013 at 2:00 PM, Kevin Grittner <kgrittn@ymail.com> wrote:
> "jimbob@seagate.com" <jimbob@seagate.com> wrote:
>
>> INFO: analyzing "public.stream_file"
>> INFO: "stream_file": scanned 30000 of 2123642 pages, containing
>> 184517 live rows and 2115512 dead rows; 30000 rows in sample,
>> 158702435 estimated total rows
>
> 184517 live rows in 30000 randomly sampled pages out of 2123642
> total pages, means that the statistics predict that a select
> count(*) will find about 13 million live rows to count.
>
>> After "analyze verbose", the table shows 158 million rows. A
>> select count(1) yields 13.8 million rows.
>
> OK, the estimate was 13 million and there were actually 13.8
> million, but it is a random sample used to generate estimates.
> That seems worse than average, but close enough to be useful.
> The 158.7 million total rows includes dead rows, which must be
> visited to determine visibility, but will not be counted because
> they are not visible to the counting transaction.
To clarify here, the 158.7 million estimate does not *intentionally*
include dead rows. As you say, the ANALYZE did get a very good
instantaneous estimate of the number of live rows. However, ANALYZE
doesn't over-write the old estimate, it averages its estimate into the
old one. After the table shape changes dramatically, the ANALYZE
needs to be run repeatedly before the estimate will converge to the
new reality. (Of course a cluster or vacuum full will blow away the
old statistics, so the next analyze after that will solely determine
the new statistics.)
I agree, not a bug.
Cheers,
Jeff