Thread: Peculiar performance observation....

Peculiar performance observation....

From
"Net Virtual Mailing Lists"
Date:
I have a rather peculiar performance observation and would welcome any
feedback on this.....

First off, the main table (well, part of it.. it is quite large..):


                                             Table "table1"
       Column       |           Type           |
   Modifiers
--------------------+--------------------------
+-----------------------------------------------------------------
 id                 | integer                  | not null default
nextval('master.id_seq'::text)
 user_id            | integer                  |
 ... (skipping about 20 columns)
 category           | ltree[]                  |
 somedata           | text                     | not null



Indexes:
    "table1_pkey" primary key, btree (id)
    "table1_category_full_gist_idx" gist (category)
    "table1_id_idx" btree (id)
    "table1_fti_idx" gist (fti) WHERE ((status)::text = 'open'::text)
    "table1_user_id_idx" btree (user_id)


database=> explain analyze select id from table1 where category <@ 'a.b';
                                                                    QUERY
PLAN
-------------------------------------
-------------------------------------
-------------------------------------------------------------------------
 Index Scan using table1_category_full_gist_idx on jobdata
(cost=0.00..113.48 rows=28 width=4) (actual time=43.814..12201.528
rows=1943 loops=1)
   Index Cond: (category <@ 'a.b'::ltree)
   Filter: (category <@ 'a.b'::ltree)
 Total runtime: 12222.258 ms


If I do this:

create table yuck (id integer, category ltree[]);
insert into yuck select id, category from table1;
create index category_idx on yuck using gist(category);
vacuum analyze yuck;
jobs=> explain analyze select id from table1 where id in (select id from
yuck where category <@ 'a.b');
                                                              QUERY PLAN

-------------------------------------
-------------------------------------
-------------------------------------------------------------
 Nested Loop  (cost=108.64..114.28 rows=1 width=52) (actual
time=654.645..1245.212 rows=1943 loops=1)
   ->  HashAggregate  (cost=108.64..108.64 rows=1 width=4) (actual
time=654.202..690.709 rows=1943 loops=1)
         ->  Index Scan using category_idx on yuck  (cost=0.00..108.57
rows=28 width=4) (actual time=2.046..623.436 rows=1943 loops=1)
               Index Cond: (category <@ 'a.b'::ltree)
               Filter: (category <@ 'a.b'::ltree)
   ->  Index Scan using table1_pkey on jobdata  (cost=0.00..5.64 rows=1
width=52) (actual time=0.219..0.235 rows=1 loops=1943)
         Index Cond: (table1.id = "outer".id)
 Total runtime: 1261.551 ms
(8 rows)


In the first query, my hard disk trashes audibly the entire 12 seconds
(this is actually the best run I could get, it is usually closer to 20
seconds), the second query runs almost effortlessly..  I've tried
reindexing, even dropping the index and recreating it but nothing I do
helps at all.

Now keep in mind that I do all of my development on painfully slow
hardware in order to make any performance issues really stand out.  But,
I've done this on production servers too with an equal performance
improvement noticed.

I just can't figure out why this second query is so much faster, I feel
like I must have done something very wrong in my schema design or
something to be suffering this sort of a performance loss.   Any idea
what I can do about this?

Thanks as always!

- Greg


Re: Peculiar performance observation....

From
"Net Virtual Mailing Lists"
Date:
Something even more peculiar (at least it seems to me..)...


If I drop the index table1_category_gist_idx, I get this:


jobs=> explain analyze select id from table1 where category <@ 'a.b'
ORDER BY category;
                                                    QUERY PLAN

-------------------------------------
------------------------------------------------------------------------------
 Sort  (cost=7568.55..7568.62 rows=28 width=52) (actual
time=4842.691..4854.468 rows=1943 loops=1)
   Sort Key: category
   ->  Seq Scan on jobdata  (cost=0.00..7567.88 rows=28 width=52) (actual
time=11.498..4800.907 rows=1943 loops=1)
         Filter: (category <@ 'a.b'::ltree)
 Total runtime: 4871.076 ms
(5 rows)


.. no disk thrashing all over the place..

I'm really perplexed about this one..;-(

- Greg

>I have a rather peculiar performance observation and would welcome any
>feedback on this.....
>
>First off, the main table (well, part of it.. it is quite large..):
>
>
>                                             Table "table1"
>       Column       |           Type           |
>   Modifiers
>--------------------+--------------------------
>+-----------------------------------------------------------------
> id                 | integer                  | not null default
>nextval('master.id_seq'::text)
> user_id            | integer                  |
> ... (skipping about 20 columns)
> category           | ltree[]                  |
> somedata           | text                     | not null
>
>
>
>Indexes:
>    "table1_pkey" primary key, btree (id)
>    "table1_category_full_gist_idx" gist (category)
>    "table1_id_idx" btree (id)
>    "table1_fti_idx" gist (fti) WHERE ((status)::text = 'open'::text)
>    "table1_user_id_idx" btree (user_id)
>
>
>database=> explain analyze select id from table1 where category <@ 'a.b';
>                                                                    QUERY
>PLAN
>-------------------------------------
>-------------------------------------
>-------------------------------------------------------------------------
> Index Scan using table1_category_full_gist_idx on jobdata
>(cost=0.00..113.48 rows=28 width=4) (actual time=43.814..12201.528
>rows=1943 loops=1)
>   Index Cond: (category <@ 'a.b'::ltree)
>   Filter: (category <@ 'a.b'::ltree)
> Total runtime: 12222.258 ms
>
>
>If I do this:
>
>create table yuck (id integer, category ltree[]);
>insert into yuck select id, category from table1;
>create index category_idx on yuck using gist(category);
>vacuum analyze yuck;
>jobs=> explain analyze select id from table1 where id in (select id from
>yuck where category <@ 'a.b');
>                                                              QUERY PLAN
>
>-------------------------------------
>-------------------------------------
>-------------------------------------------------------------
> Nested Loop  (cost=108.64..114.28 rows=1 width=52) (actual
>time=654.645..1245.212 rows=1943 loops=1)
>   ->  HashAggregate  (cost=108.64..108.64 rows=1 width=4) (actual
>time=654.202..690.709 rows=1943 loops=1)
>         ->  Index Scan using category_idx on yuck  (cost=0.00..108.57
>rows=28 width=4) (actual time=2.046..623.436 rows=1943 loops=1)
>               Index Cond: (category <@ 'a.b'::ltree)
>               Filter: (category <@ 'a.b'::ltree)
>   ->  Index Scan using table1_pkey on jobdata  (cost=0.00..5.64 rows=1
>width=52) (actual time=0.219..0.235 rows=1 loops=1943)
>         Index Cond: (table1.id = "outer".id)
> Total runtime: 1261.551 ms
>(8 rows)
>
>
>In the first query, my hard disk trashes audibly the entire 12 seconds
>(this is actually the best run I could get, it is usually closer to 20
>seconds), the second query runs almost effortlessly..  I've tried
>reindexing, even dropping the index and recreating it but nothing I do
>helps at all.
>
>Now keep in mind that I do all of my development on painfully slow
>hardware in order to make any performance issues really stand out.  But,
>I've done this on production servers too with an equal performance
>improvement noticed.
>
>I just can't figure out why this second query is so much faster, I feel
>like I must have done something very wrong in my schema design or
>something to be suffering this sort of a performance loss.   Any idea
>what I can do about this?
>
>Thanks as always!
>
>- Greg


Re: Peculiar performance observation....

From
"Net Virtual Mailing Lists"
Date:
Hello,


I am sorry to bring this up again.... Does anyone have any idea what
might be going on here?...   I'm very worried about this situation.. ;-(

- Greg


>Something even more peculiar (at least it seems to me..)...
>
>
>If I drop the index table1_category_gist_idx, I get this:
>
>
>jobs=> explain analyze select id from table1 where category <@ 'a.b'
>ORDER BY category;
>                                                    QUERY PLAN
>
>-------------------------------------
>------------------------------------
------------------------------------------
> Sort  (cost=7568.55..7568.62 rows=28 width=52) (actual
>time=4842.691..4854.468 rows=1943 loops=1)
>   Sort Key: category
>   ->  Seq Scan on jobdata  (cost=0.00..7567.88 rows=28 width=52) (actual
>time=11.498..4800.907 rows=1943 loops=1)
>         Filter: (category <@ 'a.b'::ltree)
> Total runtime: 4871.076 ms
>(5 rows)
>
>
>.. no disk thrashing all over the place..
>
>I'm really perplexed about this one..;-(
>
>- Greg
>
>>I have a rather peculiar performance observation and would welcome any
>>feedback on this.....
>>
>>First off, the main table (well, part of it.. it is quite large..):
>>
>>
>>                                             Table "table1"
>>       Column       |           Type           |
>>   Modifiers
>>--------------------+--------------------------
>>+-----------------------------------------------------------------
>> id                 | integer                  | not null default
>>nextval('master.id_seq'::text)
>> user_id            | integer                  |
>> ... (skipping about 20 columns)
>> category           | ltree[]                  |
>> somedata           | text                     | not null
>>
>>
>>
>>Indexes:
>>    "table1_pkey" primary key, btree (id)
>>    "table1_category_full_gist_idx" gist (category)
>>    "table1_id_idx" btree (id)
>>    "table1_fti_idx" gist (fti) WHERE ((status)::text = 'open'::text)
>>    "table1_user_id_idx" btree (user_id)
>>
>>
>>database=> explain analyze select id from table1 where category <@ 'a.b';
>>                                                                    QUERY
>>PLAN
>>-------------------------------------
>>-------------------------------------
>>-------------------------------------------------------------------------
>> Index Scan using table1_category_full_gist_idx on jobdata
>>(cost=0.00..113.48 rows=28 width=4) (actual time=43.814..12201.528
>>rows=1943 loops=1)
>>   Index Cond: (category <@ 'a.b'::ltree)
>>   Filter: (category <@ 'a.b'::ltree)
>> Total runtime: 12222.258 ms
>>
>>
>>If I do this:
>>
>>create table yuck (id integer, category ltree[]);
>>insert into yuck select id, category from table1;
>>create index category_idx on yuck using gist(category);
>>vacuum analyze yuck;
>>jobs=> explain analyze select id from table1 where id in (select id from
>>yuck where category <@ 'a.b');
>>                                                              QUERY PLAN
>>
>>-------------------------------------
>>-------------------------------------
>>-------------------------------------------------------------
>> Nested Loop  (cost=108.64..114.28 rows=1 width=52) (actual
>>time=654.645..1245.212 rows=1943 loops=1)
>>   ->  HashAggregate  (cost=108.64..108.64 rows=1 width=4) (actual
>>time=654.202..690.709 rows=1943 loops=1)
>>         ->  Index Scan using category_idx on yuck  (cost=0.00..108.57
>>rows=28 width=4) (actual time=2.046..623.436 rows=1943 loops=1)
>>               Index Cond: (category <@ 'a.b'::ltree)
>>               Filter: (category <@ 'a.b'::ltree)
>>   ->  Index Scan using table1_pkey on jobdata  (cost=0.00..5.64 rows=1
>>width=52) (actual time=0.219..0.235 rows=1 loops=1943)
>>         Index Cond: (table1.id = "outer".id)
>> Total runtime: 1261.551 ms
>>(8 rows)
>>
>>
>>In the first query, my hard disk trashes audibly the entire 12 seconds
>>(this is actually the best run I could get, it is usually closer to 20
>>seconds), the second query runs almost effortlessly..  I've tried
>>reindexing, even dropping the index and recreating it but nothing I do
>>helps at all.
>>
>>Now keep in mind that I do all of my development on painfully slow
>>hardware in order to make any performance issues really stand out.  But,
>>I've done this on production servers too with an equal performance
>>improvement noticed.
>>
>>I just can't figure out why this second query is so much faster, I feel
>>like I must have done something very wrong in my schema design or
>>something to be suffering this sort of a performance loss.   Any idea
>>what I can do about this?
>>
>>Thanks as always!
>>
>>- Greg


Re: Peculiar performance observation....

From
Thomas F.O'Connell
Date:
Well, your expected vs. actual rows are off, so analyzing might help.
Otherwise, what is your sort_mem set to?

-tfo

--
Thomas F. O'Connell
Co-Founder, Information Architect
Sitening, LLC
http://www.sitening.com/
110 30th Avenue North, Suite 6
Nashville, TN 37203-6320
615-260-0005

On Mar 14, 2005, at 6:11 AM, Net Virtual Mailing Lists wrote:

> Hello,
>
>
> I am sorry to bring this up again.... Does anyone have any idea what
> might be going on here?...   I'm very worried about this situation..
> ;-(
>
> - Greg
>
>
>> Something even more peculiar (at least it seems to me..)...
>>
>>
>> If I drop the index table1_category_gist_idx, I get this:
>>
>>
>> jobs=> explain analyze select id from table1 where category <@ 'a.b'
>> ORDER BY category;
>>                                                    QUERY PLAN
>>
>> -------------------------------------
>> ------------------------------------
> ------------------------------------------
>> Sort  (cost=7568.55..7568.62 rows=28 width=52) (actual
>> time=4842.691..4854.468 rows=1943 loops=1)
>>   Sort Key: category
>>   ->  Seq Scan on jobdata  (cost=0.00..7567.88 rows=28 width=52)
>> (actual
>> time=11.498..4800.907 rows=1943 loops=1)
>>         Filter: (category <@ 'a.b'::ltree)
>> Total runtime: 4871.076 ms
>> (5 rows)
>>
>>
>> .. no disk thrashing all over the place..
>>
>> I'm really perplexed about this one..;-(
>>
>> - Greg
>>
>>> I have a rather peculiar performance observation and would welcome
>>> any
>>> feedback on this.....
>>>
>>> First off, the main table (well, part of it.. it is quite large..):
>>>
>>>
>>>                                             Table "table1"
>>>       Column       |           Type           |
>>>   Modifiers
>>> --------------------+--------------------------
>>> +-----------------------------------------------------------------
>>> id                 | integer                  | not null default
>>> nextval('master.id_seq'::text)
>>> user_id            | integer                  |
>>> ... (skipping about 20 columns)
>>> category           | ltree[]                  |
>>> somedata           | text                     | not null
>>>
>>>
>>>
>>> Indexes:
>>>    "table1_pkey" primary key, btree (id)
>>>    "table1_category_full_gist_idx" gist (category)
>>>    "table1_id_idx" btree (id)
>>>    "table1_fti_idx" gist (fti) WHERE ((status)::text = 'open'::text)
>>>    "table1_user_id_idx" btree (user_id)
>>>
>>>
>>> database=> explain analyze select id from table1 where category <@
>>> 'a.b';
>>>
>>> QUERY
>>> PLAN
>>> -------------------------------------
>>> -------------------------------------
>>> ---------------------------------------------------------------------
>>> ----
>>> Index Scan using table1_category_full_gist_idx on jobdata
>>> (cost=0.00..113.48 rows=28 width=4) (actual time=43.814..12201.528
>>> rows=1943 loops=1)
>>>   Index Cond: (category <@ 'a.b'::ltree)
>>>   Filter: (category <@ 'a.b'::ltree)
>>> Total runtime: 12222.258 ms
>>>
>>>
>>> If I do this:
>>>
>>> create table yuck (id integer, category ltree[]);
>>> insert into yuck select id, category from table1;
>>> create index category_idx on yuck using gist(category);
>>> vacuum analyze yuck;
>>> jobs=> explain analyze select id from table1 where id in (select id
>>> from
>>> yuck where category <@ 'a.b');
>>>                                                              QUERY
>>> PLAN
>>>
>>> -------------------------------------
>>> -------------------------------------
>>> -------------------------------------------------------------
>>> Nested Loop  (cost=108.64..114.28 rows=1 width=52) (actual
>>> time=654.645..1245.212 rows=1943 loops=1)
>>>   ->  HashAggregate  (cost=108.64..108.64 rows=1 width=4) (actual
>>> time=654.202..690.709 rows=1943 loops=1)
>>>         ->  Index Scan using category_idx on yuck  (cost=0.00..108.57
>>> rows=28 width=4) (actual time=2.046..623.436 rows=1943 loops=1)
>>>               Index Cond: (category <@ 'a.b'::ltree)
>>>               Filter: (category <@ 'a.b'::ltree)
>>>   ->  Index Scan using table1_pkey on jobdata  (cost=0.00..5.64
>>> rows=1
>>> width=52) (actual time=0.219..0.235 rows=1 loops=1943)
>>>         Index Cond: (table1.id = "outer".id)
>>> Total runtime: 1261.551 ms
>>> (8 rows)
>>>
>>>
>>> In the first query, my hard disk trashes audibly the entire 12
>>> seconds
>>> (this is actually the best run I could get, it is usually closer to
>>> 20
>>> seconds), the second query runs almost effortlessly..  I've tried
>>> reindexing, even dropping the index and recreating it but nothing I
>>> do
>>> helps at all.
>>>
>>> Now keep in mind that I do all of my development on painfully slow
>>> hardware in order to make any performance issues really stand out.
>>> But,
>>> I've done this on production servers too with an equal performance
>>> improvement noticed.
>>>
>>> I just can't figure out why this second query is so much faster, I
>>> feel
>>> like I must have done something very wrong in my schema design or
>>> something to be suffering this sort of a performance loss.   Any idea
>>> what I can do about this?
>>>
>>> Thanks as always!
>>>
>>> - Greg
>
>
> ---------------------------(end of
> broadcast)---------------------------
> TIP 8: explain analyze is your friend


Re: Peculiar performance observation....

From
Scott Marlowe
Date:
On Mon, 2005-03-14 at 06:11, Net Virtual Mailing Lists wrote:
> Hello,
>
>
> I am sorry to bring this up again.... Does anyone have any idea what
> might be going on here?...   I'm very worried about this situation.. ;-(

It looks to me like either you're not analyzing often enough, or your
statistics target is too low to get a good sample.  Note your estimated
versus real rows are off by a factor of 70 (28 est. versus 1943 actual
rows). That's a pretty big difference, and where you should be looking.

> >   ->  Seq Scan on jobdata  (cost=0.00..7567.88 rows=28 width=52) (actual
> >time=11.498..4800.907 rows=1943 loops=1)

Yes, this is because PostgreSQL is using an index to approximate a
sequential scan, which is not a good thing since PostgreSQL can't get
all the information it needs from just an index, but has to visit the
table to check visibility.

Re: Peculiar performance observation....

From
"Net Virtual Mailing Lists"
Date:
>On Mon, 2005-03-14 at 06:11, Net Virtual Mailing Lists wrote:
>> Hello,
>>
>>
>> I am sorry to bring this up again.... Does anyone have any idea what
>> might be going on here?...   I'm very worried about this situation.. ;-(
>
>It looks to me like either you're not analyzing often enough, or your
>statistics target is too low to get a good sample.  Note your estimated
>versus real rows are off by a factor of 70 (28 est. versus 1943 actual
>rows). That's a pretty big difference, and where you should be looking.
>
>> >   ->  Seq Scan on jobdata  (cost=0.00..7567.88 rows=28 width=52) (actual
>> >time=11.498..4800.907 rows=1943 loops=1)
>
>Yes, this is because PostgreSQL is using an index to approximate a
>sequential scan, which is not a good thing since PostgreSQL can't get
>all the information it needs from just an index, but has to visit the
>table to check visibility.
>


All of these were after a vacuum full analyze, which I actually do
nightly on the database.

I probably confused the issue with all of my posts, this is the query
which has me concerned.  When running it on my system here, the disk
thrashes (and I mean *THRASHES*) the entire 12-20 seconds it takes to
run...  WHen running on our production servers, I can't hear the disk,
but see an equally troubling performance loss when using the index.

database=> explain analyze select id from table1 where category <@ 'a.b';
                                                                    QUERY
PLAN
-------------------------------------
-------------------------------------
-------------------------------------------------------------------------
 Index Scan using table1_category_full_gist_idx on jobdata
(cost=0.00..113.48 rows=28 width=4) (actual time=43.814..12201.528
rows=1943 loops=1)
   Index Cond: (category <@ 'a.b'::ltree)
   Filter: (category <@ 'a.b'::ltree)
 Total runtime: 12222.258 ms


I can do this to speed things up (this results in very little disk
activity, certainly not the thrashing the original query did):


create table yuck (id integer, category ltree[]);
insert into yuck select id, category from table1;
create index category_idx on yuck using gist(category);
vacuum analyze yuck;
jobs=> explain analyze select id from table1 where id in (select id from
yuck where category <@ 'a.b');
                                                              QUERY PLAN

-------------------------------------
-------------------------------------
-------------------------------------------------------------
 Nested Loop  (cost=108.64..114.28 rows=1 width=52) (actual
time=654.645..1245.212 rows=1943 loops=1)
   ->  HashAggregate  (cost=108.64..108.64 rows=1 width=4) (actual
time=654.202..690.709 rows=1943 loops=1)
         ->  Index Scan using category_idx on yuck  (cost=0.00..108.57
rows=28 width=4) (actual time=2.046..623.436 rows=1943 loops=1)
               Index Cond: (category <@ 'a.b'::ltree)
               Filter: (category <@ 'a.b'::ltree)
   ->  Index Scan using table1_pkey on jobdata  (cost=0.00..5.64 rows=1
width=52) (actual time=0.219..0.235 rows=1 loops=1943)
         Index Cond: (table1.id = "outer".id)
 Total runtime: 1261.551 ms
(8 rows)



If I drop the index "table1_category_full_gist_idx", the query speeds up
dramatically (10-15 times faster on both dev and prod uction systems).

So my concern, in short: why is it so much slower when actually using an
index and why is it trying to make mince meat out of my hard drive?


- Greg


Re: Peculiar performance observation....

From
Scott Marlowe
Date:
On Mon, 2005-03-14 at 21:14, Net Virtual Mailing Lists wrote:
> >On Mon, 2005-03-14 at 06:11, Net Virtual Mailing Lists wrote:
> >> Hello,
> >>
> >>
> >> I am sorry to bring this up again.... Does anyone have any idea what
> >> might be going on here?...   I'm very worried about this situation.. ;-(
> >
> >It looks to me like either you're not analyzing often enough, or your
> >statistics target is too low to get a good sample.  Note your estimated
> >versus real rows are off by a factor of 70 (28 est. versus 1943 actual
> >rows). That's a pretty big difference, and where you should be looking.
> >
> >> >   ->  Seq Scan on jobdata  (cost=0.00..7567.88 rows=28 width=52) (actual
> >> >time=11.498..4800.907 rows=1943 loops=1)
> >
> >Yes, this is because PostgreSQL is using an index to approximate a
> >sequential scan, which is not a good thing since PostgreSQL can't get
> >all the information it needs from just an index, but has to visit the
> >table to check visibility.
> >
>
>
> All of these were after a vacuum full analyze, which I actually do
> nightly on the database.
>
> I probably confused the issue with all of my posts, this is the query
> which has me concerned.  When running it on my system here, the disk
> thrashes (and I mean *THRASHES*) the entire 12-20 seconds it takes to
> run...  WHen running on our production servers, I can't hear the disk,
> but see an equally troubling performance loss when using the index.

I'll call this query 1:

>
> database=> explain analyze select id from table1 where category <@ 'a.b';
>                                                                     QUERY
> PLAN
> -------------------------------------
> -------------------------------------
> -------------------------------------------------------------------------
>  Index Scan using table1_category_full_gist_idx on jobdata
> (cost=0.00..113.48 rows=28 width=4) (actual time=43.814..12201.528
> rows=1943 loops=1)
>    Index Cond: (category <@ 'a.b'::ltree)
>    Filter: (category <@ 'a.b'::ltree)
>  Total runtime: 12222.258 ms
>
>
> I can do this to speed things up (this results in very little disk
> activity, certainly not the thrashing the original query did):
>
>
> create table yuck (id integer, category ltree[]);
> insert into yuck select id, category from table1;
> create index category_idx on yuck using gist(category);
> vacuum analyze yuck;
> jobs=> explain analyze select id from table1 where id in (select id from
> yuck where category <@ 'a.b');
>                                                               QUERY PLAN
>
> -------------------------------------
> -------------------------------------
> -------------------------------------------------------------
>  Nested Loop  (cost=108.64..114.28 rows=1 width=52) (actual
> time=654.645..1245.212 rows=1943 loops=1)
>    ->  HashAggregate  (cost=108.64..108.64 rows=1 width=4) (actual
> time=654.202..690.709 rows=1943 loops=1)
>          ->  Index Scan using category_idx on yuck  (cost=0.00..108.57
> rows=28 width=4) (actual time=2.046..623.436 rows=1943 loops=1)
>                Index Cond: (category <@ 'a.b'::ltree)
>                Filter: (category <@ 'a.b'::ltree)
>    ->  Index Scan using table1_pkey on jobdata  (cost=0.00..5.64 rows=1
> width=52) (actual time=0.219..0.235 rows=1 loops=1943)
>          Index Cond: (table1.id = "outer".id)
>  Total runtime: 1261.551 ms
> (8 rows)


> If I drop the index "table1_category_full_gist_idx", the query speeds up
> dramatically (10-15 times faster on both dev and prod uction systems).
>
> So my concern, in short: why is it so much slower when actually using an
> index and why is it trying to make mince meat out of my hard drive?

I'll explain it again, sorry if my quoting originally was a bit of a
mess.  I meant to post the last comment I made after some other comment
in your original post that I think I deleted.

Anyway, the reason it's slow is that PostgreSQL, unlike most other
databases, cannot get the answers from an index.  It can only get a
pointer to the right place in the table to look for the answer.  After
that, due to visibility issues caused by the way postgresql implements
MVCC, it then has to look IN THE TABLE to find out if the value is
visible to your transaction or not.  So it's going Index then table,
then index, then table, then index, then table, for however many rows
it's gonna grab.  In this case 1943.

In query 1, the number of rows being returned by the index scan is 1943,
but the planner only thinks it's gonna get back 28.  So, with a 70:1
ratio of incorrectness here, the planner thinks an index scan is a good
idea.  It's not, it's a terrible idea for your table.  The problem is
likely that the query planner is not getting the right numbers for this
table, and I'm not even sure how accurate statistics can be for ltrees,
as I've only ever used btree indexes in postgresql.  But, upping the
statistics target for the column producing this bad behavior and
rerunning analyze might help.

Have you upped the statistics target for this column yet?

Re: Peculiar performance observation....

From
"Net Virtual Mailing Lists"
Date:
>On Mon, 2005-03-14 at 21:14, Net Virtual Mailing Lists wrote:
>> >On Mon, 2005-03-14 at 06:11, Net Virtual Mailing Lists wrote:
>> >> Hello,
>> >>
>> >>
>> >> I am sorry to bring this up again.... Does anyone have any idea what
>> >> might be going on here?...   I'm very worried about this situation.. ;-(
>> >
>> >It looks to me like either you're not analyzing often enough, or your
>> >statistics target is too low to get a good sample.  Note your estimated
>> >versus real rows are off by a factor of 70 (28 est. versus 1943 actual
>> >rows). That's a pretty big difference, and where you should be looking.
>> >
>> >> >   ->  Seq Scan on jobdata  (cost=0.00..7567.88 rows=28 width=52)
>(actual
>> >> >time=11.498..4800.907 rows=1943 loops=1)
>> >
>> >Yes, this is because PostgreSQL is using an index to approximate a
>> >sequential scan, which is not a good thing since PostgreSQL can't get
>> >all the information it needs from just an index, but has to visit the
>> >table to check visibility.
>> >
>>
>>
>> All of these were after a vacuum full analyze, which I actually do
>> nightly on the database.
>>
>> I probably confused the issue with all of my posts, this is the query
>> which has me concerned.  When running it on my system here, the disk
>> thrashes (and I mean *THRASHES*) the entire 12-20 seconds it takes to
>> run...  WHen running on our production servers, I can't hear the disk,
>> but see an equally troubling performance loss when using the index.
>
>I'll call this query 1:
>
>>
>> database=> explain analyze select id from table1 where category <@ 'a.b';
>>                                                                     QUERY
>> PLAN
>> -------------------------------------
>> -------------------------------------
>> -------------------------------------------------------------------------
>>  Index Scan using table1_category_full_gist_idx on jobdata
>> (cost=0.00..113.48 rows=28 width=4) (actual time=43.814..12201.528
>> rows=1943 loops=1)
>>    Index Cond: (category <@ 'a.b'::ltree)
>>    Filter: (category <@ 'a.b'::ltree)
>>  Total runtime: 12222.258 ms
>>
>>
>> I can do this to speed things up (this results in very little disk
>> activity, certainly not the thrashing the original query did):
>>
>>
>> create table yuck (id integer, category ltree[]);
>> insert into yuck select id, category from table1;
>> create index category_idx on yuck using gist(category);
>> vacuum analyze yuck;
>> jobs=> explain analyze select id from table1 where id in (select id from
>> yuck where category <@ 'a.b');
>>                                                               QUERY PLAN
>>
>> -------------------------------------
>> -------------------------------------
>> -------------------------------------------------------------
>>  Nested Loop  (cost=108.64..114.28 rows=1 width=52) (actual
>> time=654.645..1245.212 rows=1943 loops=1)
>>    ->  HashAggregate  (cost=108.64..108.64 rows=1 width=4) (actual
>> time=654.202..690.709 rows=1943 loops=1)
>>          ->  Index Scan using category_idx on yuck  (cost=0.00..108.57
>> rows=28 width=4) (actual time=2.046..623.436 rows=1943 loops=1)
>>                Index Cond: (category <@ 'a.b'::ltree)
>>                Filter: (category <@ 'a.b'::ltree)
>>    ->  Index Scan using table1_pkey on jobdata  (cost=0.00..5.64 rows=1
>> width=52) (actual time=0.219..0.235 rows=1 loops=1943)
>>          Index Cond: (table1.id = "outer".id)
>>  Total runtime: 1261.551 ms
>> (8 rows)
>
>
>> If I drop the index "table1_category_full_gist_idx", the query speeds up
>> dramatically (10-15 times faster on both dev and prod uction systems).
>>
>> So my concern, in short: why is it so much slower when actually using an
>> index and why is it trying to make mince meat out of my hard drive?
>
>I'll explain it again, sorry if my quoting originally was a bit of a
>mess.  I meant to post the last comment I made after some other comment
>in your original post that I think I deleted.
>
>Anyway, the reason it's slow is that PostgreSQL, unlike most other
>databases, cannot get the answers from an index.  It can only get a
>pointer to the right place in the table to look for the answer.  After
>that, due to visibility issues caused by the way postgresql implements
>MVCC, it then has to look IN THE TABLE to find out if the value is
>visible to your transaction or not.  So it's going Index then table,
>then index, then table, then index, then table, for however many rows
>it's gonna grab.  In this case 1943.
>
>In query 1, the number of rows being returned by the index scan is 1943,
>but the planner only thinks it's gonna get back 28.  So, with a 70:1
>ratio of incorrectness here, the planner thinks an index scan is a good
>idea.  It's not, it's a terrible idea for your table.  The problem is
>likely that the query planner is not getting the right numbers for this
>table, and I'm not even sure how accurate statistics can be for ltrees,
>as I've only ever used btree indexes in postgresql.  But, upping the
>statistics target for the column producing this bad behavior and
>rerunning analyze might help.
>
>Have you upped the statistics target for this column yet?
>


Ahh, I should be the one apologizing.  I did not understand what you
meant.. :-)

Your explanation makes sense and certainly seems to be what is going on
here, I will see what I can do to adjust the statistics on that column to
make it do a table scan instead.

I am curious, why can't it do the index scan then when it retrieves the
records from the database remove any which are not visible to the
transaction?....  It seems like the index would always return more
possible matches then the table, right?

As a final question, do have an equally astute explanation as to why with
the creation of my "yuck" table it is so much faster?  It still does an
index scan, still uses ltree[], etc?...


- Greg