Re: Peculiar performance observation.... - Mailing list pgsql-general

From Scott Marlowe
Subject Re: Peculiar performance observation....
Date
Msg-id 1110858250.28555.122.camel@state.g2switchworks.com
Whole thread Raw
In response to Re: Peculiar performance observation....  ("Net Virtual Mailing Lists" <mailinglists@net-virtual.com>)
Responses Re: Peculiar performance observation....
List pgsql-general
On Mon, 2005-03-14 at 21:14, Net Virtual Mailing Lists wrote:
> >On Mon, 2005-03-14 at 06:11, Net Virtual Mailing Lists wrote:
> >> Hello,
> >>
> >>
> >> I am sorry to bring this up again.... Does anyone have any idea what
> >> might be going on here?...   I'm very worried about this situation.. ;-(
> >
> >It looks to me like either you're not analyzing often enough, or your
> >statistics target is too low to get a good sample.  Note your estimated
> >versus real rows are off by a factor of 70 (28 est. versus 1943 actual
> >rows). That's a pretty big difference, and where you should be looking.
> >
> >> >   ->  Seq Scan on jobdata  (cost=0.00..7567.88 rows=28 width=52) (actual
> >> >time=11.498..4800.907 rows=1943 loops=1)
> >
> >Yes, this is because PostgreSQL is using an index to approximate a
> >sequential scan, which is not a good thing since PostgreSQL can't get
> >all the information it needs from just an index, but has to visit the
> >table to check visibility.
> >
>
>
> All of these were after a vacuum full analyze, which I actually do
> nightly on the database.
>
> I probably confused the issue with all of my posts, this is the query
> which has me concerned.  When running it on my system here, the disk
> thrashes (and I mean *THRASHES*) the entire 12-20 seconds it takes to
> run...  WHen running on our production servers, I can't hear the disk,
> but see an equally troubling performance loss when using the index.

I'll call this query 1:

>
> database=> explain analyze select id from table1 where category <@ 'a.b';
>                                                                     QUERY
> PLAN
> -------------------------------------
> -------------------------------------
> -------------------------------------------------------------------------
>  Index Scan using table1_category_full_gist_idx on jobdata
> (cost=0.00..113.48 rows=28 width=4) (actual time=43.814..12201.528
> rows=1943 loops=1)
>    Index Cond: (category <@ 'a.b'::ltree)
>    Filter: (category <@ 'a.b'::ltree)
>  Total runtime: 12222.258 ms
>
>
> I can do this to speed things up (this results in very little disk
> activity, certainly not the thrashing the original query did):
>
>
> create table yuck (id integer, category ltree[]);
> insert into yuck select id, category from table1;
> create index category_idx on yuck using gist(category);
> vacuum analyze yuck;
> jobs=> explain analyze select id from table1 where id in (select id from
> yuck where category <@ 'a.b');
>                                                               QUERY PLAN
>
> -------------------------------------
> -------------------------------------
> -------------------------------------------------------------
>  Nested Loop  (cost=108.64..114.28 rows=1 width=52) (actual
> time=654.645..1245.212 rows=1943 loops=1)
>    ->  HashAggregate  (cost=108.64..108.64 rows=1 width=4) (actual
> time=654.202..690.709 rows=1943 loops=1)
>          ->  Index Scan using category_idx on yuck  (cost=0.00..108.57
> rows=28 width=4) (actual time=2.046..623.436 rows=1943 loops=1)
>                Index Cond: (category <@ 'a.b'::ltree)
>                Filter: (category <@ 'a.b'::ltree)
>    ->  Index Scan using table1_pkey on jobdata  (cost=0.00..5.64 rows=1
> width=52) (actual time=0.219..0.235 rows=1 loops=1943)
>          Index Cond: (table1.id = "outer".id)
>  Total runtime: 1261.551 ms
> (8 rows)


> If I drop the index "table1_category_full_gist_idx", the query speeds up
> dramatically (10-15 times faster on both dev and prod uction systems).
>
> So my concern, in short: why is it so much slower when actually using an
> index and why is it trying to make mince meat out of my hard drive?

I'll explain it again, sorry if my quoting originally was a bit of a
mess.  I meant to post the last comment I made after some other comment
in your original post that I think I deleted.

Anyway, the reason it's slow is that PostgreSQL, unlike most other
databases, cannot get the answers from an index.  It can only get a
pointer to the right place in the table to look for the answer.  After
that, due to visibility issues caused by the way postgresql implements
MVCC, it then has to look IN THE TABLE to find out if the value is
visible to your transaction or not.  So it's going Index then table,
then index, then table, then index, then table, for however many rows
it's gonna grab.  In this case 1943.

In query 1, the number of rows being returned by the index scan is 1943,
but the planner only thinks it's gonna get back 28.  So, with a 70:1
ratio of incorrectness here, the planner thinks an index scan is a good
idea.  It's not, it's a terrible idea for your table.  The problem is
likely that the query planner is not getting the right numbers for this
table, and I'm not even sure how accurate statistics can be for ltrees,
as I've only ever used btree indexes in postgresql.  But, upping the
statistics target for the column producing this bad behavior and
rerunning analyze might help.

Have you upped the statistics target for this column yet?

pgsql-general by date:

Previous
From: Yu Jie
Date:
Subject: Question about database restrict
Next
From: "Brian Maguire"
Date:
Subject: Daffodil Replicator is now available open source