Thread: BUG #7853: Incorrect statistics in table with many dead rows.
The following bug has been logged on the website: Bug reference: 7853 Logged by: James Skaggs Email address: jimbob@seagate.com PostgreSQL version: 8.4.14 Operating system: RHEL6 Description: = After "analyze verbose", the table shows 158 million rows. A select count(1) yields 13.8 million rows. INFO: analyzing "public.stream_file" INFO: "stream_file": scanned 30000 of 2123642 pages, containing 184517 live rows and 2115512 dead rows; 30000 rows in sample, 158702435 estimated total rows Here are the table statistics. Sequential Scans 81853 = Sequential Tuples Read 578848425234 = Index Scans 1976513672 = Index Tuples Fetched 2183339860 = Tuples Inserted 65122575 = Tuples Updated 308883671 = Tuples Deleted 51238760 = Tuples HOT Updated 2242897 = Live Tuples 163981972 = Dead Tuples 7056493 = Heap Blocks Read 43483331819 = Heap Blocks Hit 43121456487 = Index Blocks Read 134539277 = Index Blocks Hit 13606451182 = Toast Blocks Read = Toast Blocks Hit = Toast Index Blocks Read = Toast Index Blocks Hit = Last Vacuum 2013-02-04 10:06:44.058743-07 = Last Autovacuum 2013-02-04 16:11:34.289823-07 = Last Analyze 2013-02-04 14:22:27.848547-07 = Last Autoanalyze 2013-02-01 17:37:29.855553-07 = Table Size 17 GB = Toast Table Size none = Indexes Size 34 GB Query returned successfully with no result in 4094 ms. Bad statistics led to a bad plan. We will cluster the table today to see if that fixes it, but I think statistics should be correct, regardless of the state of a table. BTW, Coverity product requries 8.x, and we'll upgrade to 8.4.15 today. Didn't see anything about better statistics in the 8.4.15 changelog.
"jimbob@seagate.com" <jimbob@seagate.com> wrote:=0A=0A> INFO:=A0 analyzing = "public.stream_file"=0A> INFO:=A0 "stream_file": scanned 30000 of 2123642 p= ages, containing=0A> 184517 live rows and 2115512 dead rows; 30000 rows in = sample,=0A> 158702435 estimated total rows=0A=0A184517 live rows in 30000 r= andomly sampled pages out of 2123642=0Atotal pages, means that the statisti= cs predict that a select=0Acount(*) will find about=A0 13 million live rows= to count.=0A=0A> After "analyze verbose", the table shows 158 million rows= . A=0A> select count(1) yields 13.8 million rows.=0A=0AOK, the estimate was= 13 million and there were actually 13.8=0Amillion, but it is a random samp= le used to generate estimates. =0AThat seems worse than average, but close = enough to be useful.=0AThe 158.7 million total rows includes dead rows, whi= ch must be=0Avisited to determine visibility, but will not be counted becau= se=0Athey are not visible to the counting transaction.=A0 Having over 90%= =0Aof your table filled with dead rows is a bad situation to be in,=0Afrom = a performance standpoint.=A0 You should use aggressive=0Amaintenance (like = VACUUM FULL or CLUSTER) to fix the existing=0Aextreme bloat, and then revie= w your autovacuum settings and overall=0Avacuum regimen to prevent future b= loat.=0A=0AThis does not look like a bug from the information provided so f= ar.=0A=0A-Kevin
Kevin - Agreed, we shouldn't have so many dead rows. Our autovacuum is set on but has default parameters. We are clustering today. This, of course, means downtime and inconvenience to the users. Here is the troublesome query: select sum(case when t1.cdlc < 0 then 0 else t1.cdlc end) as code_line_cnt, sum(case when t1.cmlc < 0 then 0 else t1.cmlc end) as comment_line_cnt, sum(case when t1.bllc < 0 then 0 else t1.bllc end) as blank_line_cnt from ( select max(sf.current_code_line_count) as cdlc, max(sf.current_comment_line_count) as cmlc, max(sf.current_blank_line_count) as bllc from stream_file sf group by sf.file_path_id, sf.current_source_md5 ) as t1; And the explain plan. Against a optimized database, the plan was better, as well as execution Aggregate (cost=32073953.42..32073953.43 rows=1 width=12) (actual time=77394.354..77394.354 rows=1 loops=1) -> GroupAggregate (cost=29271854.25..31808889.97 rows=15146482 width=53) (actual time=62490.514..77333.105 rows=168591 loops=1) -> Sort (cost=29271854.25..29650516.30 rows=151464819 width=53) (actual time=62490.492..73098.009 rows=14403547 loops=1) Sort Key: sf.file_path_id, sf.current_source_md5 Sort Method: external merge Disk: 942440kB -> Seq Scan on stream_file sf (cost=0.00..3514999.19 rows=151464819 width=53) (actual time=0.327..20620.230 rows=14403547 loops=1) Total runtime: 77836.949 ms Thanks and Regards, James Skaggs IT/FIS Longmont SeaTel: 8 684 1048 General: +1 612 367 6224 On Tue, Feb 5, 2013 at 3:00 PM, Kevin Grittner <kgrittn@ymail.com> wrote: > "jimbob@seagate.com" <jimbob@seagate.com> wrote: > > > INFO: analyzing "public.stream_file" > > INFO: "stream_file": scanned 30000 of 2123642 pages, containing > > 184517 live rows and 2115512 dead rows; 30000 rows in sample, > > 158702435 estimated total rows > > 184517 live rows in 30000 randomly sampled pages out of 2123642 > total pages, means that the statistics predict that a select > count(*) will find about 13 million live rows to count. > > > After "analyze verbose", the table shows 158 million rows. A > > select count(1) yields 13.8 million rows. > > OK, the estimate was 13 million and there were actually 13.8 > million, but it is a random sample used to generate estimates. > That seems worse than average, but close enough to be useful. > The 158.7 million total rows includes dead rows, which must be > visited to determine visibility, but will not be counted because > they are not visible to the counting transaction. Having over 90% > of your table filled with dead rows is a bad situation to be in, > from a performance standpoint. You should use aggressive > maintenance (like VACUUM FULL or CLUSTER) to fix the existing > extreme bloat, and then review your autovacuum settings and overall > vacuum regimen to prevent future bloat. > > This does not look like a bug from the information provided so far. > > -Kevin >
James R Skaggs <james.r.skaggs@seagate.com> wrote:=0A=0A> Agreed, we should= n't have so many dead rows.=A0 Our autovacuum is=0A> set on but has default= parameters.=A0=A0 We are clustering today. =0A> This, of course, means dow= ntime and inconvenience to the users.=0A=0ARight, which is why it's importa= nt to figure out why the bloat=0Ahappened.=A0 Sometimes it is unavoidable, = like when you delete 90% of=0Athe rows in your table or a long-lived "idle = in transaction"=0Aconnection prevents autovacuum from being able to do its = work=0Anormally.=A0 To prevent further downtime it is important to figure= =0Aout what happened and make appropriate changes to your monitoring=0Aor v= acuuming.=0A=0A> Here is the troublesome query:=0A>=0A>> select =0A>>=A0=A0= =A0 sum(case when t1.cdlc < 0 then 0 else t1.cdlc end) as code_line_cnt, = =0A>>=A0=A0=A0 sum(case when t1.cmlc < 0 then 0 else t1.cmlc end) as commen= t_line_cnt, =0A>>=A0=A0=A0 sum(case when t1.bllc < 0 then 0 else t1.bllc en= d) as blank_line_cnt =0A>>=A0 from =0A>>=A0=A0=A0 ( select =0A>>=A0=A0=A0= =A0=A0=A0=A0 max(sf.current_code_line_count) as cdlc, =0A>>=A0=A0=A0=A0=A0= =A0=A0 max(sf.current_comment_line_count) as cmlc,=0A>>=A0=A0=A0=A0=A0=A0= =A0 max(sf.current_blank_line_count) as bllc =0A>>=A0=A0=A0=A0=A0 from =0A>= >=A0=A0=A0=A0=A0=A0=A0 stream_file sf =0A>>=A0=A0=A0=A0=A0 group by sf.file= _path_id, sf.current_source_md5 =0A>>=A0=A0=A0 ) as t1;=0A=0AWell, I don't = see that the planner has a lot of choice there=0Abesides whether to use a s= ort or a hash to do the inner=0Aaggregation.=A0 Are you saying that prior t= o the bloat it used a hash=0Aaggregation, and that was faster?=A0 And that = you feel that it should=0Abe using that even with the bloat?=A0 That the de= ad rows seem to be=0Agetting included in the statistics, driving to the slo= wer plan, and=0Ayou feel they should be omitted?=0A=0ANote that I'm not aru= ing one way or another on these points at the=0Amoment; I'm just trying to = understand your point clearly.=0A=0A-Kevin
Kevin - Here is the plan *after *we clustered on the PK index..it was probably that way before the bloat, but I don't know. Aggregate (cost=514557.38..514557.40 rows=1 width=12) (actual time=8823.973..8823.974 rows=1 loops=1) -> HashAggregate (cost=471029.03..492793.20 rows=1243667 width=53) (actual time=8725.324..8789.926 rows=159369 loops=1)" -> Seq Scan on stream_file sf (cost=0.00..315570.68 rows=12436668 width=53) (actual time=0.013..1152.600 rows=12436753 loops=1) Total runtime: 8838.395 ms" I guess the statistics should reflect the true costs. So if the "bloated" statistics are representative of the true cost, I'm all for them. We are clustering the database as we speak, so I can't force a hash aggregate to see if that is the right choice or not. Thanks and Regards, James Skaggs IT/FIS Longmont SeaTel: 8 684 1048 General: +1 612 367 6224 On Tue, Feb 5, 2013 at 4:12 PM, Kevin Grittner <kgrittn@ymail.com> wrote: > James R Skaggs <james.r.skaggs@seagate.com> wrote: > > > Agreed, we shouldn't have so many dead rows. Our autovacuum is > > set on but has default parameters. We are clustering today. > > This, of course, means downtime and inconvenience to the users. > > Right, which is why it's important to figure out why the bloat > happened. Sometimes it is unavoidable, like when you delete 90% of > the rows in your table or a long-lived "idle in transaction" > connection prevents autovacuum from being able to do its work > normally. To prevent further downtime it is important to figure > out what happened and make appropriate changes to your monitoring > or vacuuming. > > > Here is the troublesome query: > > > >> select > >> sum(case when t1.cdlc < 0 then 0 else t1.cdlc end) as code_line_cnt, > >> sum(case when t1.cmlc < 0 then 0 else t1.cmlc end) as > comment_line_cnt, > >> sum(case when t1.bllc < 0 then 0 else t1.bllc end) as blank_line_cnt > >> from > >> ( select > >> max(sf.current_code_line_count) as cdlc, > >> max(sf.current_comment_line_count) as cmlc, > >> max(sf.current_blank_line_count) as bllc > >> from > >> stream_file sf > >> group by sf.file_path_id, sf.current_source_md5 > >> ) as t1; > > Well, I don't see that the planner has a lot of choice there > besides whether to use a sort or a hash to do the inner > aggregation. Are you saying that prior to the bloat it used a hash > aggregation, and that was faster? And that you feel that it should > be using that even with the bloat? That the dead rows seem to be > getting included in the statistics, driving to the slower plan, and > you feel they should be omitted? > > Note that I'm not aruing one way or another on these points at the > moment; I'm just trying to understand your point clearly. > > -Kevin >
On Tue, Feb 5, 2013 at 2:00 PM, Kevin Grittner <kgrittn@ymail.com> wrote: > "jimbob@seagate.com" <jimbob@seagate.com> wrote: > >> INFO: analyzing "public.stream_file" >> INFO: "stream_file": scanned 30000 of 2123642 pages, containing >> 184517 live rows and 2115512 dead rows; 30000 rows in sample, >> 158702435 estimated total rows > > 184517 live rows in 30000 randomly sampled pages out of 2123642 > total pages, means that the statistics predict that a select > count(*) will find about 13 million live rows to count. > >> After "analyze verbose", the table shows 158 million rows. A >> select count(1) yields 13.8 million rows. > > OK, the estimate was 13 million and there were actually 13.8 > million, but it is a random sample used to generate estimates. > That seems worse than average, but close enough to be useful. > The 158.7 million total rows includes dead rows, which must be > visited to determine visibility, but will not be counted because > they are not visible to the counting transaction. To clarify here, the 158.7 million estimate does not *intentionally* include dead rows. As you say, the ANALYZE did get a very good instantaneous estimate of the number of live rows. However, ANALYZE doesn't over-write the old estimate, it averages its estimate into the old one. After the table shape changes dramatically, the ANALYZE needs to be run repeatedly before the estimate will converge to the new reality. (Of course a cluster or vacuum full will blow away the old statistics, so the next analyze after that will solely determine the new statistics.) I agree, not a bug. Cheers, Jeff
Okay, I have some more info. Some background info. This one table gets so many changes, I CLUSTER it each night. However, after I do this. The statistics still appear to be incorrect. Even after I do a "select pg_stat_reset();" Followed by 3 ANALYZE at default_statistics_target as 1, 10, and 100 select relname, n_live_tup, n_dead_tup, n_tup_ins, n_tup_del, n_tup_hot_upd from pg_stat_all_tables ('stream_file', 109061143L, 103913868L, 0L, 14201L, 0L) Is it possible that there are still dead tuples after a CLUSTER? Explain Analyze thinks we have 112M rows (in fact there are 10M), so it needs do do a disc sort with work_mem=1GB.. "Aggregate (cost=23622814.39..23622814.40 rows=1 width=12) (actual time=57512.462..57512.462 rows=1 loops=1)" " -> GroupAggregate (cost=21536989.70..23425506.64 rows=11274728 width=53) (actual time=50825.396..57457.421 rows=202394 loops=1)" " Output: max(sf.current_code_line_count), max(sf.current_comment_line_count), max(sf.current_blank_line_count)" " -> Sort (cost=21536989.70..21818857.90 rows=112747280 width=53) (actual time=50825.386..55004.916 rows=7683730 loops=1)" " Output: sf.current_code_line_count, sf.current_comment_line_count, sf.current_blank_line_count, sf.file_path_id, sf.current_source_md5" " Sort Key: sf.file_path_id, sf.current_source_md5" " Sort Method: external merge Disk: 502288kB" " -> Seq Scan on stream_file sf (cost=0.00..2604208.80 rows=112747280 width=53) (actual time=0.033..27922.485 rows=7683730 loops=1)" " Output: sf.current_code_line_count, sf.current_comment_line_count, sf.current_blank_line_count, sf.file_path_id, sf.current_source_md5" "Total runtime: 57693.835 ms" Now we do many, many ANALYZE VERBOSE, and converge on the correct value, which is known to be about 10M rows. INFO: analyzing "public.stream_file" INFO: "stream_file": scanned 30000 of 1476736 pages, containing 158846 live rows and 2175512 dead rows; 30000 rows in sample, 112747282 estimated total rows Query returned successfully with no result in 9172 ms. ... 200X !... INFO: analyzing "public.stream_file" INFO: "stream_file": scanned 30000 of 1480611 pages, containing 158776 live rows and 2170410 dead rows; 30000 rows in sample, 9769236 estimated total rows Query returned successfully with no result in 441 ms. Now, the optimizer thinks we can do a has aggregate in memory and we get better performance. "Aggregate (cost=1734729.12..1734729.14 rows=1 width=12) (actual time=33816.049..33816.049 rows=1 loops=1)" " -> HashAggregate (cost=1700534.50..1717631.81 rows=976989 width=53) (actual time=33535.083..33712.787 rows=202404 loops=1)" " Output: max(sf.current_code_line_count), max(sf.current_comment_line_count), max(sf.current_blank_line_count)" " -> Seq Scan on stream_file sf (cost=0.00..1578410.89 rows=9769889 width=53) (actual time=392.435..26278.143 rows=7710223 loops=1)" " Output: sf.id, sf.current_blank_line_count, sf.current_code_line_count, sf.current_comment_line_count, sf.current_source_md5, sf.component_id, sf.current_file_instance_id, sf.current_file_state_id, sf.file_path_id, sf.stream_element_id" "Total runtime: 33822.707 ms" But later in the day, the statistics revert back to the 100M number! Any ideas? Is there some kind of cache that is remembering the old statistics.? Thanks and Regards, James Skaggs IT/FIS Longmont SeaTel: 8 684 1048 General: +1 612 367 6224 On Sun, Feb 10, 2013 at 1:10 PM, Jeff Janes <jeff.janes@gmail.com> wrote: > On Tue, Feb 5, 2013 at 2:00 PM, Kevin Grittner <kgrittn@ymail.com> wrote: > > "jimbob@seagate.com" <jimbob@seagate.com> wrote: > > > >> INFO: analyzing "public.stream_file" > >> INFO: "stream_file": scanned 30000 of 2123642 pages, containing > >> 184517 live rows and 2115512 dead rows; 30000 rows in sample, > >> 158702435 estimated total rows > > > > 184517 live rows in 30000 randomly sampled pages out of 2123642 > > total pages, means that the statistics predict that a select > > count(*) will find about 13 million live rows to count. > > > >> After "analyze verbose", the table shows 158 million rows. A > >> select count(1) yields 13.8 million rows. > > > > OK, the estimate was 13 million and there were actually 13.8 > > million, but it is a random sample used to generate estimates. > > That seems worse than average, but close enough to be useful. > > The 158.7 million total rows includes dead rows, which must be > > visited to determine visibility, but will not be counted because > > they are not visible to the counting transaction. > > To clarify here, the 158.7 million estimate does not *intentionally* > include dead rows. As you say, the ANALYZE did get a very good > instantaneous estimate of the number of live rows. However, ANALYZE > doesn't over-write the old estimate, it averages its estimate into the > old one. After the table shape changes dramatically, the ANALYZE > needs to be run repeatedly before the estimate will converge to the > new reality. (Of course a cluster or vacuum full will blow away the > old statistics, so the next analyze after that will solely determine > the new statistics.) > > I agree, not a bug. > > Cheers, > > Jeff >
On Sun, Feb 10, 2013 at 12:10 PM, Jeff Janes <jeff.janes@gmail.com> wrote: > On Tue, Feb 5, 2013 at 2:00 PM, Kevin Grittner <kgrittn@ymail.com> wrote: > > > > OK, the estimate was 13 million and there were actually 13.8 > > million, but it is a random sample used to generate estimates. > > That seems worse than average, but close enough to be useful. > > The 158.7 million total rows includes dead rows, which must be > > visited to determine visibility, but will not be counted because > > they are not visible to the counting transaction. > > To clarify here, the 158.7 million estimate does not *intentionally* > include dead rows. As you say, the ANALYZE did get a very good > instantaneous estimate of the number of live rows. However, ANALYZE > doesn't over-write the old estimate, it averages its estimate into the > old one. After the table shape changes dramatically, the ANALYZE > needs to be run repeatedly before the estimate will converge to the > new reality. (Of course a cluster or vacuum full will blow away the > old statistics, so the next analyze after that will solely determine > the new statistics.) > I was incredibly wrong here. The cluster or vacuum do not blow away the stats so that the next analyze gets to solely determine them. Rather, they impose their own idea of live tuples, and then analyze can only update that incrementally as it averages itself into the old value. Worse, the two methods have very different ideas of what constitutes a live tuple. ANALYZE thinks tuples that are visible to a current/recent snapshot are live. While CLUSTER and VACUUM think tuples that are possibly visible to anyone are live. I would say that this is a bug, or at least approaching to being one. It is not obvious whether reltuples and n_live_tuples should count the "recently dead", but it should either be one way or the other and not an unholy mixture of the two. As it is now, a cluster or simple vacuum will snap n_live_tuples so that it counts recently dead, then analyze will slowly converge it to excludes recently dead, and then the next vacuum will snap it back again. Of course, all of this only comes into play in the presence of very long-lived transactions that prevent tuples from going away. Otherwise the number recently dead is small enough not to matter. create table foo as select (random()*1000000)::integer as val from generate_series(1,50000000); In a different session, open a transaction and leave it open: begin; create temp table adlfkj (x serial); Back in the main session: delete from foo where val > 100; run this repeatedly and watch the rows estimate slowly decay: ANALYZE verbose foo; explain select count(*) from foo; Then run this and watch it instantly spring back: VACUUM VERBOSE foo ; explain select count(*) from foo; Cheers, Jeff
So, I have some observations. Is this what you are seeing as well? So when we CLUSTER a table heavily-updated table: CLUSTER does appear to reset *n_dead_tup*, *n_tup_ins*, *n_tup_del*, *n_tup_hot_upd*, but NOT *n_live_tup* pg_stat_reset() truly clears out all the statistics counters. I tried this because *n_live_tup* is not correct. A subsequent ANALYZE will update *n_dead_tup* and *n_live_tup* to some values that could not possibly be based on the newly CLUSTERed table So, how to get correct statistics for a heavily updated table? In my experience, we only need to get the exponent correct, but we're not even getting that. BTW, I've upgraded to 8.4.15. -- View this message in context: http://postgresql.1045698.n5.nabble.com/BUG-7853-Incorrect-statistics-in-table-with-many-dead-rows-tp5743845p5746602.html Sent from the PostgreSQL - bugs mailing list archive at Nabble.com.
BTW "jimbob" and "James.R.Skaggs" are the same person. I just didn't want to use my "work" email for this.... -- View this message in context: http://postgresql.1045698.n5.nabble.com/BUG-7853-Incorrect-statistics-in-table-with-many-dead-rows-tp5743845p5747000.html Sent from the PostgreSQL - bugs mailing list archive at Nabble.com.
On Fri, Feb 22, 2013 at 3:41 PM, James R Skaggs <james.r.skaggs@seagate.com>wrote: > Okay, I have some more info. > > Some background info. This one table gets so many changes, I CLUSTER it > each night. However, after I do this. The statistics still appear to be > incorrect. Even after I do a "select pg_stat_reset();" Followed by 3 > ANALYZE at default_statistics_target as 1, 10, and 100 > > select relname, n_live_tup, n_dead_tup, n_tup_ins, n_tup_del, > n_tup_hot_upd > from pg_stat_all_tables > ('stream_file', 109061143L, 103913868L, 0L, 14201L, 0L) > > Is it possible that there are still dead tuples after a CLUSTER? > Yes. A cluster must bring along any tuples which are possibly visible to any open transaction. Your root problem seems to be that you have long-open transactions which are preventing vacuum from doing its thing, which leads you try clustering, but the long-open transaction prevents that from doing its things effectively as well. Perhaps PG could deal with this situation more gracefully, but fundamentally you have to figure why you have these ancient transactions lying around, and fix them or kill them. Cheers, Jeff