Re: Peculiar performance observation.... - Mailing list pgsql-general
From | Net Virtual Mailing Lists |
---|---|
Subject | Re: Peculiar performance observation.... |
Date | |
Msg-id | 20050315205431.20070@mail.net-virtual.com Whole thread Raw |
In response to | Re: Peculiar performance observation.... (Scott Marlowe <smarlowe@g2switchworks.com>) |
List | pgsql-general |
>On Mon, 2005-03-14 at 21:14, Net Virtual Mailing Lists wrote: >> >On Mon, 2005-03-14 at 06:11, Net Virtual Mailing Lists wrote: >> >> Hello, >> >> >> >> >> >> I am sorry to bring this up again.... Does anyone have any idea what >> >> might be going on here?... I'm very worried about this situation.. ;-( >> > >> >It looks to me like either you're not analyzing often enough, or your >> >statistics target is too low to get a good sample. Note your estimated >> >versus real rows are off by a factor of 70 (28 est. versus 1943 actual >> >rows). That's a pretty big difference, and where you should be looking. >> > >> >> > -> Seq Scan on jobdata (cost=0.00..7567.88 rows=28 width=52) >(actual >> >> >time=11.498..4800.907 rows=1943 loops=1) >> > >> >Yes, this is because PostgreSQL is using an index to approximate a >> >sequential scan, which is not a good thing since PostgreSQL can't get >> >all the information it needs from just an index, but has to visit the >> >table to check visibility. >> > >> >> >> All of these were after a vacuum full analyze, which I actually do >> nightly on the database. >> >> I probably confused the issue with all of my posts, this is the query >> which has me concerned. When running it on my system here, the disk >> thrashes (and I mean *THRASHES*) the entire 12-20 seconds it takes to >> run... WHen running on our production servers, I can't hear the disk, >> but see an equally troubling performance loss when using the index. > >I'll call this query 1: > >> >> database=> explain analyze select id from table1 where category <@ 'a.b'; >> QUERY >> PLAN >> ------------------------------------- >> ------------------------------------- >> ------------------------------------------------------------------------- >> Index Scan using table1_category_full_gist_idx on jobdata >> (cost=0.00..113.48 rows=28 width=4) (actual time=43.814..12201.528 >> rows=1943 loops=1) >> Index Cond: (category <@ 'a.b'::ltree) >> Filter: (category <@ 'a.b'::ltree) >> Total runtime: 12222.258 ms >> >> >> I can do this to speed things up (this results in very little disk >> activity, certainly not the thrashing the original query did): >> >> >> create table yuck (id integer, category ltree[]); >> insert into yuck select id, category from table1; >> create index category_idx on yuck using gist(category); >> vacuum analyze yuck; >> jobs=> explain analyze select id from table1 where id in (select id from >> yuck where category <@ 'a.b'); >> QUERY PLAN >> >> ------------------------------------- >> ------------------------------------- >> ------------------------------------------------------------- >> Nested Loop (cost=108.64..114.28 rows=1 width=52) (actual >> time=654.645..1245.212 rows=1943 loops=1) >> -> HashAggregate (cost=108.64..108.64 rows=1 width=4) (actual >> time=654.202..690.709 rows=1943 loops=1) >> -> Index Scan using category_idx on yuck (cost=0.00..108.57 >> rows=28 width=4) (actual time=2.046..623.436 rows=1943 loops=1) >> Index Cond: (category <@ 'a.b'::ltree) >> Filter: (category <@ 'a.b'::ltree) >> -> Index Scan using table1_pkey on jobdata (cost=0.00..5.64 rows=1 >> width=52) (actual time=0.219..0.235 rows=1 loops=1943) >> Index Cond: (table1.id = "outer".id) >> Total runtime: 1261.551 ms >> (8 rows) > > >> If I drop the index "table1_category_full_gist_idx", the query speeds up >> dramatically (10-15 times faster on both dev and prod uction systems). >> >> So my concern, in short: why is it so much slower when actually using an >> index and why is it trying to make mince meat out of my hard drive? > >I'll explain it again, sorry if my quoting originally was a bit of a >mess. I meant to post the last comment I made after some other comment >in your original post that I think I deleted. > >Anyway, the reason it's slow is that PostgreSQL, unlike most other >databases, cannot get the answers from an index. It can only get a >pointer to the right place in the table to look for the answer. After >that, due to visibility issues caused by the way postgresql implements >MVCC, it then has to look IN THE TABLE to find out if the value is >visible to your transaction or not. So it's going Index then table, >then index, then table, then index, then table, for however many rows >it's gonna grab. In this case 1943. > >In query 1, the number of rows being returned by the index scan is 1943, >but the planner only thinks it's gonna get back 28. So, with a 70:1 >ratio of incorrectness here, the planner thinks an index scan is a good >idea. It's not, it's a terrible idea for your table. The problem is >likely that the query planner is not getting the right numbers for this >table, and I'm not even sure how accurate statistics can be for ltrees, >as I've only ever used btree indexes in postgresql. But, upping the >statistics target for the column producing this bad behavior and >rerunning analyze might help. > >Have you upped the statistics target for this column yet? > Ahh, I should be the one apologizing. I did not understand what you meant.. :-) Your explanation makes sense and certainly seems to be what is going on here, I will see what I can do to adjust the statistics on that column to make it do a table scan instead. I am curious, why can't it do the index scan then when it retrieves the records from the database remove any which are not visible to the transaction?.... It seems like the index would always return more possible matches then the table, right? As a final question, do have an equally astute explanation as to why with the creation of my "yuck" table it is so much faster? It still does an index scan, still uses ltree[], etc?... - Greg
pgsql-general by date: