Thread: Query using SeqScan instead of IndexScan

Query using SeqScan instead of IndexScan

From
Brendan Duddridge
Date:
Hi,

I have a query that is using a sequential scan instead of an index scan. I've turned off sequential scans and it is in fact faster with the index scan.

Here's my before and after.

Before:

ssdev=# SET enable_seqscan TO DEFAULT;
ssdev=# explain analyze select cp.product_id
from category_product cp, product_attribute_value pav
where cp.category_id = 1001082 and cp.product_id = pav.product_id;

                                                                                  QUERY PLAN                                                                                 
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Hash Join  (cost=25.52..52140.59 rows=5139 width=4) (actual time=4.521..2580.520 rows=19695 loops=1)
   Hash Cond: ("outer".product_id = "inner".product_id)
   ->  Seq Scan on product_attribute_value pav  (cost=0.00..40127.12 rows=2387312 width=4) (actual time=0.039..1469.295 rows=2385846 loops=1)
   ->  Hash  (cost=23.10..23.10 rows=970 width=4) (actual time=2.267..2.267 rows=1140 loops=1)
         ->  Index Scan using x_category_product__category_id_fk_idx on category_product cp  (cost=0.00..23.10 rows=970 width=4) (actual time=0.122..1.395 rows=1140 loops=1)
               Index Cond: (category_id = 1001082)
Total runtime: 2584.221 ms
(7 rows)


After:

ssdev=# SET enable_seqscan TO false;
ssdev=# explain analyze select cp.product_id
from category_product cp, product_attribute_value pav
where cp.category_id = 1001082 and cp.product_id = pav.product_id;

                                                                                     QUERY PLAN                                                                                     
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Nested Loop  (cost=0.00..157425.22 rows=5139 width=4) (actual time=0.373..71.177 rows=19695 loops=1)
   ->  Index Scan using x_category_product__category_id_fk_idx on category_product cp  (cost=0.00..23.10 rows=970 width=4) (actual time=0.129..1.438 rows=1140 loops=1)
         Index Cond: (category_id = 1001082)
   ->  Index Scan using product_attribute_value__product_id_fk_idx on product_attribute_value pav  (cost=0.00..161.51 rows=61 width=4) (actual time=0.016..0.053 rows=17 loops=1140)
         Index Cond: ("outer".product_id = pav.product_id)
Total runtime: 74.747 ms
(6 rows)

There's quite a big difference in speed there. 2584.221 ms vs. 74.747 ms.

Any ideas what I can do to improve this without turning sequential scanning off?

Thanks,

____________________________________________________________________
Brendan Duddridge | CTO | 403-277-5591 x24 |  brendan@clickspace.com

ClickSpace Interactive Inc.
Suite L100, 239 - 10th Ave. SE
Calgary, AB  T2G 0V9

http://www.clickspace.com 

Attachment

Re: Query using SeqScan instead of IndexScan

From
Brendan Duddridge
Date:
Oops. I forgot to mention that I was using PostgreSQL 8.1.3 on Mac OS X.

Thanks,

____________________________________________________________________
Brendan Duddridge | CTO | 403-277-5591 x24 |  brendan@clickspace.com

ClickSpace Interactive Inc.
Suite L100, 239 - 10th Ave. SE
Calgary, AB  T2G 0V9

http://www.clickspace.com 

On Mar 29, 2006, at 8:12 PM, Brendan Duddridge wrote:

Hi,

I have a query that is using a sequential scan instead of an index scan. I've turned off sequential scans and it is in fact faster with the index scan.

Here's my before and after.

Before:

ssdev=# SET enable_seqscan TO DEFAULT;
ssdev=# explain analyze select cp.product_id
from category_product cp, product_attribute_value pav
where cp.category_id = 1001082 and cp.product_id = pav.product_id;

                                                                                  QUERY PLAN                                                                                 
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Hash Join  (cost=25.52..52140.59 rows=5139 width=4) (actual time=4.521..2580.520 rows=19695 loops=1)
   Hash Cond: ("outer".product_id = "inner".product_id)
   ->  Seq Scan on product_attribute_value pav  (cost=0.00..40127.12 rows=2387312 width=4) (actual time=0.039..1469.295 rows=2385846 loops=1)
   ->  Hash  (cost=23.10..23.10 rows=970 width=4) (actual time=2.267..2.267 rows=1140 loops=1)
         ->  Index Scan using x_category_product__category_id_fk_idx on category_product cp  (cost=0.00..23.10 rows=970 width=4) (actual time=0.122..1.395 rows=1140 loops=1)
               Index Cond: (category_id = 1001082)
Total runtime: 2584.221 ms
(7 rows)


After:

ssdev=# SET enable_seqscan TO false;
ssdev=# explain analyze select cp.product_id
from category_product cp, product_attribute_value pav
where cp.category_id = 1001082 and cp.product_id = pav.product_id;

                                                                                     QUERY PLAN                                                                                     
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Nested Loop  (cost=0.00..157425.22 rows=5139 width=4) (actual time=0.373..71.177 rows=19695 loops=1)
   ->  Index Scan using x_category_product__category_id_fk_idx on category_product cp  (cost=0.00..23.10 rows=970 width=4) (actual time=0.129..1.438 rows=1140 loops=1)
         Index Cond: (category_id = 1001082)
   ->  Index Scan using product_attribute_value__product_id_fk_idx on product_attribute_value pav  (cost=0.00..161.51 rows=61 width=4) (actual time=0.016..0.053 rows=17 loops=1140)
         Index Cond: ("outer".product_id = pav.product_id)
Total runtime: 74.747 ms
(6 rows)

There's quite a big difference in speed there. 2584.221 ms vs. 74.747 ms.

Any ideas what I can do to improve this without turning sequential scanning off?

Thanks,

____________________________________________________________________
Brendan Duddridge | CTO | 403-277-5591 x24 |  brendan@clickspace.com

ClickSpace Interactive Inc.
Suite L100, 239 - 10th Ave. SE
Calgary, AB  T2G 0V9

http://www.clickspace.com 


Attachment

Re: Query using SeqScan instead of IndexScan

From
Josh Berkus
Date:
Brenden,

> Any ideas what I can do to improve this without turning sequential  
> scanning off?

Hmmm, looks like your row estimates are good.  Which means it's probably your
postgresql.conf parameters which are off.  Try the following, in the order
below:

1) Raise effective_cache_size to 2/3 of your RAM (remember that ecs is in 8k
pages).  Test again.

2) Multiply all of the cpu_* costs by 0.3.  Test again.

3) Lower random_page_cost by steps to 3.5, then 3.0, then 2.5, then 2.0,
testing each time.

These are all runtime-settable parameters, so you can test them in one query
window, then set them in the main postgresql.conf if they work.

--
Josh Berkus
Sun Microsystems
San Francisco

Re: Query using SeqScan instead of IndexScan

From
"Jim C. Nasby"
Date:
What's the correlation of category_id? The current index scan cost
estimator places a heavy penalty on anything with a correlation much
below about 90%.

On Wed, Mar 29, 2006 at 08:12:28PM -0700, Brendan Duddridge wrote:
> Hi,
>
> I have a query that is using a sequential scan instead of an index
> scan. I've turned off sequential scans and it is in fact faster with
> the index scan.
>
> Here's my before and after.
>
> Before:
>
> ssdev=# SET enable_seqscan TO DEFAULT;
> ssdev=# explain analyze select cp.product_id
>         from category_product cp, product_attribute_value pav
>         where cp.category_id = 1001082 and cp.product_id =
>         pav.product_id;
>
>
>            QUERY PLAN
> ------------------------------------------------------------------------
> ------------------------------------------------------------------------
> ------------------------------
> Hash Join  (cost=25.52..52140.59 rows=5139 width=4) (actual
> time=4.521..2580.520 rows=19695 loops=1)
>    Hash Cond: ("outer".product_id = "inner".product_id)
>    ->  Seq Scan on product_attribute_value pav  (cost=0.00..40127.12
> rows=2387312 width=4) (actual time=0.039..1469.295 rows=2385846 loops=1)
>    ->  Hash  (cost=23.10..23.10 rows=970 width=4) (actual
> time=2.267..2.267 rows=1140 loops=1)
>          ->  Index Scan using x_category_product__category_id_fk_idx
> on category_product cp  (cost=0.00..23.10 rows=970 width=4) (actual
> time=0.122..1.395 rows=1140 loops=1)
>                Index Cond: (category_id = 1001082)
> Total runtime: 2584.221 ms
> (7 rows)
>
>
> After:
>
> ssdev=# SET enable_seqscan TO false;
> ssdev=# explain analyze select cp.product_id
>         from category_product cp, product_attribute_value pav
>         where cp.category_id = 1001082 and cp.product_id =
>         pav.product_id;
>
>
>               QUERY PLAN
> ------------------------------------------------------------------------
> ------------------------------------------------------------------------
> -------------------------------------
> Nested Loop  (cost=0.00..157425.22 rows=5139 width=4) (actual
> time=0.373..71.177 rows=19695 loops=1)
>    ->  Index Scan using x_category_product__category_id_fk_idx on
> category_product cp  (cost=0.00..23.10 rows=970 width=4) (actual
> time=0.129..1.438 rows=1140 loops=1)
>          Index Cond: (category_id = 1001082)
>    ->  Index Scan using product_attribute_value__product_id_fk_idx
> on product_attribute_value pav  (cost=0.00..161.51 rows=61 width=4)
> (actual time=0.016..0.053 rows=17 loops=1140)
>          Index Cond: ("outer".product_id = pav.product_id)
> Total runtime: 74.747 ms
> (6 rows)
>
> There's quite a big difference in speed there. 2584.221 ms vs. 74.747
> ms.
>
> Any ideas what I can do to improve this without turning sequential
> scanning off?
>
> Thanks,
>
> ____________________________________________________________________
> Brendan Duddridge | CTO | 403-277-5591 x24 |  brendan@clickspace.com
>
> ClickSpace Interactive Inc.
> Suite L100, 239 - 10th Ave. SE
> Calgary, AB  T2G 0V9
>
> http://www.clickspace.com
>



--
Jim C. Nasby, Sr. Engineering Consultant      jnasby@pervasive.com
Pervasive Software      http://pervasive.com    work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf       cell: 512-569-9461

Re: Query using SeqScan instead of IndexScan

From
Brendan Duddridge
Date:
Hi Jim,

I'm not quite sure what you mean by the correlation of category_id?
The category_id is part of a compound primary key in the
category_product
table. The primary key on category_product is (category_id, product_id).

Here's the definitions of the two tables involved in the join:

             Table "public.category_product"
        Column        |         Type         | Modifiers
---------------------+----------------------+-----------
category_id         | integer              | not null
product_id          | integer              | not null
en_name_sort_order  | integer              |
fr_name_sort_order  | integer              |
merchant_sort_order | integer              |
price_sort_order    | integer              |
merchant_count      | integer              |
is_active           | character varying(5) |
Indexes:
     "x_category_product_pk" PRIMARY KEY, btree (category_id,
product_id)
     "category_product__is_active_idx" btree (is_active)
     "category_product__merchant_sort_order_idx" btree
(merchant_sort_order)
     "x_category_product__category_id_fk_idx" btree (category_id)
CLUSTER
     "x_category_product__product_id_fk_idx" btree (product_id)
Foreign-key constraints:
     "x_category_product_category_fk" FOREIGN KEY (category_id)
REFERENCES category(category_id) DEFERRABLE INITIALLY DEFERRED
     "x_category_product_product_fk" FOREIGN KEY (product_id)
REFERENCES product(product_id) DEFERRABLE INITIALLY DEFERRED



              Table "public.product_attribute_value"
            Column           |         Type          | Modifiers
----------------------------+-----------------------+-----------
attribute_id               | integer               | not null
attribute_unit_id          | integer               |
attribute_value_id         | integer               |
boolean_value              | character varying(5)  |
decimal_value              | numeric(30,10)        |
product_attribute_value_id | integer               | not null
product_id                 | integer               | not null
product_reference_id       | integer               |
status_code                | character varying(32) |
Indexes:
     "product_attribute_value_pk" PRIMARY KEY, btree
(product_attribute_value_id)
     "product_attribute_value__attribute_id_fk_idx" btree (attribute_id)
     "product_attribute_value__attribute_unit_id_fk_idx" btree
(attribute_unit_id)
     "product_attribute_value__attribute_value_id_fk_idx" btree
(attribute_value_id)
     "product_attribute_value__product_id_fk_idx" btree (product_id)
     "product_attribute_value__product_reference_id_fk_idx" btree
(product_reference_id)
Foreign-key constraints:
     "product_attribute_value_attribute_fk" FOREIGN KEY
(attribute_id) REFERENCES attribute(attribute_id) DEFERRABLE
INITIALLY DEFERRED
     "product_attribute_value_attributeunit_fk" FOREIGN KEY
(attribute_unit_id) REFERENCES attribute_unit(attribute_unit_id)
DEFERRABLE INITIALLY DEFERRED
     "product_attribute_value_attributevalue_fk" FOREIGN KEY
(attribute_value_id) REFERENCES attribute_value(attribute_value_id)
DEFERRABLE INITIALLY DEFERRED
     "product_attribute_value_product_fk" FOREIGN KEY (product_id)
REFERENCES product(product_id) DEFERRABLE INITIALLY DEFERRED
     "product_attribute_value_productreference_fk" FOREIGN KEY
(product_reference_id) REFERENCES product(product_id) DEFERRABLE
INITIALLY DEFERRED


Not sure if that helps answer your question, but the query is pretty
slow. Sometimes it takes 5 - 15 seconds depending on the category_id
specified.

Thanks,

____________________________________________________________________
Brendan Duddridge | CTO | 403-277-5591 x24 |  brendan@clickspace.com

ClickSpace Interactive Inc.
Suite L100, 239 - 10th Ave. SE
Calgary, AB  T2G 0V9

http://www.clickspace.com

On Mar 31, 2006, at 8:59 AM, Jim C. Nasby wrote:

> What's the correlation of category_id? The current index scan cost
> estimator places a heavy penalty on anything with a correlation much
> below about 90%.
>
> On Wed, Mar 29, 2006 at 08:12:28PM -0700, Brendan Duddridge wrote:
>> Hi,
>>
>> I have a query that is using a sequential scan instead of an index
>> scan. I've turned off sequential scans and it is in fact faster with
>> the index scan.
>>
>> Here's my before and after.
>>
>> Before:
>>
>> ssdev=# SET enable_seqscan TO DEFAULT;
>> ssdev=# explain analyze select cp.product_id
>>         from category_product cp, product_attribute_value pav
>>         where cp.category_id = 1001082 and cp.product_id =
>>         pav.product_id;
>>
>>
>>            QUERY PLAN
>> ---------------------------------------------------------------------
>> ---
>> ---------------------------------------------------------------------
>> ---
>> ------------------------------
>> Hash Join  (cost=25.52..52140.59 rows=5139 width=4) (actual
>> time=4.521..2580.520 rows=19695 loops=1)
>>    Hash Cond: ("outer".product_id = "inner".product_id)
>>    ->  Seq Scan on product_attribute_value pav  (cost=0.00..40127.12
>> rows=2387312 width=4) (actual time=0.039..1469.295 rows=2385846
>> loops=1)
>>    ->  Hash  (cost=23.10..23.10 rows=970 width=4) (actual
>> time=2.267..2.267 rows=1140 loops=1)
>>          ->  Index Scan using x_category_product__category_id_fk_idx
>> on category_product cp  (cost=0.00..23.10 rows=970 width=4) (actual
>> time=0.122..1.395 rows=1140 loops=1)
>>                Index Cond: (category_id = 1001082)
>> Total runtime: 2584.221 ms
>> (7 rows)
>>
>>
>> After:
>>
>> ssdev=# SET enable_seqscan TO false;
>> ssdev=# explain analyze select cp.product_id
>>         from category_product cp, product_attribute_value pav
>>         where cp.category_id = 1001082 and cp.product_id =
>>         pav.product_id;
>>
>>
>>               QUERY PLAN
>> ---------------------------------------------------------------------
>> ---
>> ---------------------------------------------------------------------
>> ---
>> -------------------------------------
>> Nested Loop  (cost=0.00..157425.22 rows=5139 width=4) (actual
>> time=0.373..71.177 rows=19695 loops=1)
>>    ->  Index Scan using x_category_product__category_id_fk_idx on
>> category_product cp  (cost=0.00..23.10 rows=970 width=4) (actual
>> time=0.129..1.438 rows=1140 loops=1)
>>          Index Cond: (category_id = 1001082)
>>    ->  Index Scan using product_attribute_value__product_id_fk_idx
>> on product_attribute_value pav  (cost=0.00..161.51 rows=61 width=4)
>> (actual time=0.016..0.053 rows=17 loops=1140)
>>          Index Cond: ("outer".product_id = pav.product_id)
>> Total runtime: 74.747 ms
>> (6 rows)
>>
>> There's quite a big difference in speed there. 2584.221 ms vs. 74.747
>> ms.
>>
>> Any ideas what I can do to improve this without turning sequential
>> scanning off?
>>
>> Thanks,
>>
>> ____________________________________________________________________
>> Brendan Duddridge | CTO | 403-277-5591 x24 |  brendan@clickspace.com
>>
>> ClickSpace Interactive Inc.
>> Suite L100, 239 - 10th Ave. SE
>> Calgary, AB  T2G 0V9
>>
>> http://www.clickspace.com
>>
>
>
>
> --
> Jim C. Nasby, Sr. Engineering Consultant      jnasby@pervasive.com
> Pervasive Software      http://pervasive.com    work: 512-231-6117
> vcard: http://jim.nasby.net/pervasive.vcf       cell: 512-569-9461
>
> ---------------------------(end of
> broadcast)---------------------------
> TIP 1: if posting/reading through Usenet, please send an appropriate
>        subscribe-nomail command to majordomo@postgresql.org so that
> your
>        message can get through to the mailing list cleanly
>


Attachment

Re: Query using SeqScan instead of IndexScan

From
"chris smith"
Date:
On 4/1/06, Brendan Duddridge <brendan@clickspace.com> wrote:
> Hi Jim,
>
> I'm not quite sure what you mean by the correlation of category_id?

It means how many distinct values does it have (at least that's my
understanding of it ;) ).

select category_id, count(*) from category_product group by category_id;

will show you how many category_id's there are and how many products
are in each category.

Having a lot of products in one category (or having a small amount of
categories) can slow things down because the db can't use the index
effectively.. which might be what you're seeing (hence why it's fast
for some categories, slow for others).


> On Mar 31, 2006, at 8:59 AM, Jim C. Nasby wrote:
>
> > What's the correlation of category_id? The current index scan cost
> > estimator places a heavy penalty on anything with a correlation much
> > below about 90%.
> >
> > On Wed, Mar 29, 2006 at 08:12:28PM -0700, Brendan Duddridge wrote:
> >> Hi,
> >>
> >> I have a query that is using a sequential scan instead of an index
> >> scan. I've turned off sequential scans and it is in fact faster with
> >> the index scan.
> >>
> >> Here's my before and after.
> >>
> >> Before:
> >>
> >> ssdev=# SET enable_seqscan TO DEFAULT;
> >> ssdev=# explain analyze select cp.product_id
> >>              from category_product cp, product_attribute_value pav
> >>              where cp.category_id = 1001082 and cp.product_id =
> >>              pav.product_id;
> >>
> >>
> >>            QUERY PLAN
> >> ---------------------------------------------------------------------
> >> ---
> >> ---------------------------------------------------------------------
> >> ---
> >> ------------------------------
> >> Hash Join  (cost=25.52..52140.59 rows=5139 width=4) (actual
> >> time=4.521..2580.520 rows=19695 loops=1)
> >>    Hash Cond: ("outer".product_id = "inner".product_id)
> >>    ->  Seq Scan on product_attribute_value pav  (cost=0.00..40127.12
> >> rows=2387312 width=4) (actual time=0.039..1469.295 rows=2385846
> >> loops=1)
> >>    ->  Hash  (cost=23.10..23.10 rows=970 width=4) (actual
> >> time=2.267..2.267 rows=1140 loops=1)
> >>          ->  Index Scan using x_category_product__category_id_fk_idx
> >> on category_product cp  (cost=0.00..23.10 rows=970 width=4) (actual
> >> time=0.122..1.395 rows=1140 loops=1)
> >>                Index Cond: (category_id = 1001082)
> >> Total runtime: 2584.221 ms
> >> (7 rows)
> >>
> >>
> >> After:
> >>
> >> ssdev=# SET enable_seqscan TO false;
> >> ssdev=# explain analyze select cp.product_id
> >>              from category_product cp, product_attribute_value pav
> >>              where cp.category_id = 1001082 and cp.product_id =
> >>              pav.product_id;
> >>
> >>
> >>               QUERY PLAN
> >> ---------------------------------------------------------------------
> >> ---
> >> ---------------------------------------------------------------------
> >> ---
> >> -------------------------------------
> >> Nested Loop  (cost=0.00..157425.22 rows=5139 width=4) (actual
> >> time=0.373..71.177 rows=19695 loops=1)
> >>    ->  Index Scan using x_category_product__category_id_fk_idx on
> >> category_product cp  (cost=0.00..23.10 rows=970 width=4) (actual
> >> time=0.129..1.438 rows=1140 loops=1)
> >>          Index Cond: (category_id = 1001082)
> >>    ->  Index Scan using product_attribute_value__product_id_fk_idx
> >> on product_attribute_value pav  (cost=0.00..161.51 rows=61 width=4)
> >> (actual time=0.016..0.053 rows=17 loops=1140)
> >>          Index Cond: ("outer".product_id = pav.product_id)
> >> Total runtime: 74.747 ms
> >> (6 rows)
> >>
> >> There's quite a big difference in speed there. 2584.221 ms vs. 74.747
> >> ms.
> >>
> >> Any ideas what I can do to improve this without turning sequential
> >> scanning off?
> >>
> >> Thanks,
> >>
> >> ____________________________________________________________________
> >> Brendan Duddridge | CTO | 403-277-5591 x24 |  brendan@clickspace.com
> >>
> >> ClickSpace Interactive Inc.
> >> Suite L100, 239 - 10th Ave. SE
> >> Calgary, AB  T2G 0V9
> >>
> >> http://www.clickspace.com
> >>
> >
> >
> >
> > --
> > Jim C. Nasby, Sr. Engineering Consultant      jnasby@pervasive.com
> > Pervasive Software      http://pervasive.com    work: 512-231-6117
> > vcard: http://jim.nasby.net/pervasive.vcf       cell: 512-569-9461
> >
> > ---------------------------(end of
> > broadcast)---------------------------
> > TIP 1: if posting/reading through Usenet, please send an appropriate
> >        subscribe-nomail command to majordomo@postgresql.org so that
> > your
> >        message can get through to the mailing list cleanly
> >
>
>
>
>


--
Postgresql & php tutorials
http://www.designmagick.com/

Re: Query using SeqScan instead of IndexScan

From
Brendan Duddridge
Date:
Ah I see. Ok, well we have a very wide variety here...

category_id | count
-------------+-------
      1000521 | 31145
      1001211 | 22991
      1001490 | 22019
      1001628 | 12472
      1000046 | 10480
      1000087 | 10338
      1001223 | 10020
      1001560 |  9532
      1000954 |  8633
      1001314 |  8191
      1001482 |  8140
      1001556 |  7959
      1001481 |  7850
[snip...]
      1001133 |     1
      1000532 |     1
      1000691 |     1
      1000817 |     1
      1000783 |     1
      1000689 |     1

(1157 rows)

So what's the best kind of query to handle this kind of data to make
it fast in all cases? I'd like get down to sub-second response times.

currently we have:

select cp.product_id
              from category_product cp, product_attribute_value pav
              where cp.category_id = 1001082 and cp.product_id =
              pav.product_id;


Thanks,

____________________________________________________________________
Brendan Duddridge | CTO | 403-277-5591 x24 |  brendan@clickspace.com

ClickSpace Interactive Inc.
Suite L100, 239 - 10th Ave. SE
Calgary, AB  T2G 0V9

http://www.clickspace.com

On Mar 31, 2006, at 6:23 PM, chris smith wrote:

> On 4/1/06, Brendan Duddridge <brendan@clickspace.com> wrote:
>> Hi Jim,
>>
>> I'm not quite sure what you mean by the correlation of category_id?
>
> It means how many distinct values does it have (at least that's my
> understanding of it ;) ).
>
> select category_id, count(*) from category_product group by
> category_id;
>
> will show you how many category_id's there are and how many products
> are in each category.
>
> Having a lot of products in one category (or having a small amount of
> categories) can slow things down because the db can't use the index
> effectively.. which might be what you're seeing (hence why it's fast
> for some categories, slow for others).
>
>
>> On Mar 31, 2006, at 8:59 AM, Jim C. Nasby wrote:
>>
>>> What's the correlation of category_id? The current index scan cost
>>> estimator places a heavy penalty on anything with a correlation much
>>> below about 90%.
>>>
>>> On Wed, Mar 29, 2006 at 08:12:28PM -0700, Brendan Duddridge wrote:
>>>> Hi,
>>>>
>>>> I have a query that is using a sequential scan instead of an index
>>>> scan. I've turned off sequential scans and it is in fact faster
>>>> with
>>>> the index scan.
>>>>
>>>> Here's my before and after.
>>>>
>>>> Before:
>>>>
>>>> ssdev=# SET enable_seqscan TO DEFAULT;
>>>> ssdev=# explain analyze select cp.product_id
>>>>              from category_product cp, product_attribute_value pav
>>>>              where cp.category_id = 1001082 and cp.product_id =
>>>>              pav.product_id;
>>>>
>>>>
>>>>            QUERY PLAN
>>>> -------------------------------------------------------------------
>>>> --
>>>> ---
>>>> -------------------------------------------------------------------
>>>> --
>>>> ---
>>>> ------------------------------
>>>> Hash Join  (cost=25.52..52140.59 rows=5139 width=4) (actual
>>>> time=4.521..2580.520 rows=19695 loops=1)
>>>>    Hash Cond: ("outer".product_id = "inner".product_id)
>>>>    ->  Seq Scan on product_attribute_value pav
>>>> (cost=0.00..40127.12
>>>> rows=2387312 width=4) (actual time=0.039..1469.295 rows=2385846
>>>> loops=1)
>>>>    ->  Hash  (cost=23.10..23.10 rows=970 width=4) (actual
>>>> time=2.267..2.267 rows=1140 loops=1)
>>>>          ->  Index Scan using
>>>> x_category_product__category_id_fk_idx
>>>> on category_product cp  (cost=0.00..23.10 rows=970 width=4) (actual
>>>> time=0.122..1.395 rows=1140 loops=1)
>>>>                Index Cond: (category_id = 1001082)
>>>> Total runtime: 2584.221 ms
>>>> (7 rows)
>>>>
>>>>
>>>> After:
>>>>
>>>> ssdev=# SET enable_seqscan TO false;
>>>> ssdev=# explain analyze select cp.product_id
>>>>              from category_product cp, product_attribute_value pav
>>>>              where cp.category_id = 1001082 and cp.product_id =
>>>>              pav.product_id;
>>>>
>>>>
>>>>               QUERY PLAN
>>>> -------------------------------------------------------------------
>>>> --
>>>> ---
>>>> -------------------------------------------------------------------
>>>> --
>>>> ---
>>>> -------------------------------------
>>>> Nested Loop  (cost=0.00..157425.22 rows=5139 width=4) (actual
>>>> time=0.373..71.177 rows=19695 loops=1)
>>>>    ->  Index Scan using x_category_product__category_id_fk_idx on
>>>> category_product cp  (cost=0.00..23.10 rows=970 width=4) (actual
>>>> time=0.129..1.438 rows=1140 loops=1)
>>>>          Index Cond: (category_id = 1001082)
>>>>    ->  Index Scan using product_attribute_value__product_id_fk_idx
>>>> on product_attribute_value pav  (cost=0.00..161.51 rows=61 width=4)
>>>> (actual time=0.016..0.053 rows=17 loops=1140)
>>>>          Index Cond: ("outer".product_id = pav.product_id)
>>>> Total runtime: 74.747 ms
>>>> (6 rows)
>>>>
>>>> There's quite a big difference in speed there. 2584.221 ms vs.
>>>> 74.747
>>>> ms.
>>>>
>>>> Any ideas what I can do to improve this without turning sequential
>>>> scanning off?
>>>>
>>>> Thanks,
>>>>
>>>> ___________________________________________________________________
>>>> _
>>>> Brendan Duddridge | CTO | 403-277-5591 x24 |
>>>> brendan@clickspace.com
>>>>
>>>> ClickSpace Interactive Inc.
>>>> Suite L100, 239 - 10th Ave. SE
>>>> Calgary, AB  T2G 0V9
>>>>
>>>> http://www.clickspace.com
>>>>
>>>
>>>
>>>
>>> --
>>> Jim C. Nasby, Sr. Engineering Consultant      jnasby@pervasive.com
>>> Pervasive Software      http://pervasive.com    work: 512-231-6117
>>> vcard: http://jim.nasby.net/pervasive.vcf       cell: 512-569-9461
>>>
>>> ---------------------------(end of
>>> broadcast)---------------------------
>>> TIP 1: if posting/reading through Usenet, please send an appropriate
>>>        subscribe-nomail command to majordomo@postgresql.org so that
>>> your
>>>        message can get through to the mailing list cleanly
>>>
>>
>>
>>
>>
>
>
> --
> Postgresql & php tutorials
> http://www.designmagick.com/
>


Attachment

Re: Query using SeqScan instead of IndexScan

From
"Jim C. Nasby"
Date:
On Sat, Apr 01, 2006 at 11:23:37AM +1000, chris smith wrote:
> On 4/1/06, Brendan Duddridge <brendan@clickspace.com> wrote:
> > Hi Jim,
> >
> > I'm not quite sure what you mean by the correlation of category_id?
>
> It means how many distinct values does it have (at least that's my
> understanding of it ;) ).

Your understanding is wrong. :) What you're discussing is n_distinct.

http://www.postgresql.org/docs/8.1/interactive/view-pg-stats.html

correlation: "Statistical correlation between physical row ordering and
logical ordering of the column values. This ranges from -1 to +1. When
the value is near -1 or +1, an index scan on the column will be
estimated to be cheaper than when it is near zero, due to reduction of
random access to the disk. (This column is NULL if the column data type
does not have a < operator.)"

In other words, the following will have a correlation of 1:

1
2
3
...
998
999
1000

And this is -1...

1000
999
...
2
1

While this would have a very low correlation:

1
1000
2
999
...

The lower the correlation, the more expensive an index scan is, because
it's more random. As I mentioned, I believe that the current index scan
cost estimator is flawed though, because it will bias heavily against
correlations that aren't close to 1 or -1.

So, what does

SELECT * FROM pg_stats WHERE tablename='table' AND attname='category_id';

show?
--
Jim C. Nasby, Sr. Engineering Consultant      jnasby@pervasive.com
Pervasive Software      http://pervasive.com    work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf       cell: 512-569-9461

Re: Query using SeqScan instead of IndexScan

From
Brendan Duddridge
Date:
Hi Jim,

from SELECT * FROM pg_stats WHERE tablename='table' AND
attname='category_id'

I find correlation on category_product for category_id is 0.643703

Would setting the index on category_id to be clustered help with this?

Thanks,

____________________________________________________________________
Brendan Duddridge | CTO | 403-277-5591 x24 |  brendan@clickspace.com

ClickSpace Interactive Inc.
Suite L100, 239 - 10th Ave. SE
Calgary, AB  T2G 0V9

http://www.clickspace.com

On Apr 1, 2006, at 8:32 AM, Jim C. Nasby wrote:

> On Sat, Apr 01, 2006 at 11:23:37AM +1000, chris smith wrote:
>> On 4/1/06, Brendan Duddridge <brendan@clickspace.com> wrote:
>>> Hi Jim,
>>>
>>> I'm not quite sure what you mean by the correlation of category_id?
>>
>> It means how many distinct values does it have (at least that's my
>> understanding of it ;) ).
>
> Your understanding is wrong. :) What you're discussing is n_distinct.
>
> http://www.postgresql.org/docs/8.1/interactive/view-pg-stats.html
>
> correlation: "Statistical correlation between physical row ordering
> and
> logical ordering of the column values. This ranges from -1 to +1. When
> the value is near -1 or +1, an index scan on the column will be
> estimated to be cheaper than when it is near zero, due to reduction of
> random access to the disk. (This column is NULL if the column data
> type
> does not have a < operator.)"
>
> In other words, the following will have a correlation of 1:
>
> 1
> 2
> 3
> ...
> 998
> 999
> 1000
>
> And this is -1...
>
> 1000
> 999
> ...
> 2
> 1
>
> While this would have a very low correlation:
>
> 1
> 1000
> 2
> 999
> ...
>
> The lower the correlation, the more expensive an index scan is,
> because
> it's more random. As I mentioned, I believe that the current index
> scan
> cost estimator is flawed though, because it will bias heavily against
> correlations that aren't close to 1 or -1.
>
> So, what does
>
> SELECT * FROM pg_stats WHERE tablename='table' AND
> attname='category_id';
>
> show?
> --
> Jim C. Nasby, Sr. Engineering Consultant      jnasby@pervasive.com
> Pervasive Software      http://pervasive.com    work: 512-231-6117
> vcard: http://jim.nasby.net/pervasive.vcf       cell: 512-569-9461
>
> ---------------------------(end of
> broadcast)---------------------------
> TIP 4: Have you searched our list archives?
>
>                http://archives.postgresql.org
>


Attachment

Re: Query using SeqScan instead of IndexScan

From
"chris smith"
Date:
On 4/2/06, Jim C. Nasby <jnasby@pervasive.com> wrote:
> On Sat, Apr 01, 2006 at 11:23:37AM +1000, chris smith wrote:
> > On 4/1/06, Brendan Duddridge <brendan@clickspace.com> wrote:
> > > Hi Jim,
> > >
> > > I'm not quite sure what you mean by the correlation of category_id?
> >
> > It means how many distinct values does it have (at least that's my
> > understanding of it ;) ).
>
> Your understanding is wrong. :) What you're discussing is n_distinct.

Geez, I'm going well this week ;)

Thanks for the detailed info.

--
Postgresql & php tutorials
http://www.designmagick.com/

Re: Query using SeqScan instead of IndexScan

From
"chris smith"
Date:
On 4/2/06, chris smith <dmagick@gmail.com> wrote:
> On 4/2/06, Jim C. Nasby <jnasby@pervasive.com> wrote:
> > On Sat, Apr 01, 2006 at 11:23:37AM +1000, chris smith wrote:
> > > On 4/1/06, Brendan Duddridge <brendan@clickspace.com> wrote:
> > > > Hi Jim,
> > > >
> > > > I'm not quite sure what you mean by the correlation of category_id?
> > >
> > > It means how many distinct values does it have (at least that's my
> > > understanding of it ;) ).
> >
> > Your understanding is wrong. :) What you're discussing is n_distinct.

<rant>
It'd be nice if the database developers agreed on what terms meant.

http://dev.mysql.com/doc/refman/5.1/en/myisam-index-statistics.html

The SHOW INDEX statement displays a cardinality value based on N/S,
where N is the number of rows in the table and S is the average value
group size. That ratio yields an approximate number of value groups in
the table.
</rant>

A work colleague found that information a few weeks ago so that's
where my misunderstanding came from - if I'm reading that right they
use n_distinct as their "cardinality" basis.. then again I could be
reading that completely wrong too.

I believe postgres (because it's a lot more standards compliant).. but
sheesh - what a difference!

This week's task - stop reading mysql documentation.

--
Postgresql & php tutorials
http://www.designmagick.com/

Re: Query using SeqScan instead of IndexScan

From
Alvaro Herrera
Date:
chris smith wrote:

> I believe postgres (because it's a lot more standards compliant).. but
> sheesh - what a difference!
>
> This week's task - stop reading mysql documentation.

You don't _have_ to believe Postgres -- this is stuff taught in any
statistics course.

--
Alvaro Herrera                                http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

Re: Query using SeqScan instead of IndexScan

From
Mark Kirkwood
Date:
chris smith wrote:

> <rant>
> It'd be nice if the database developers agreed on what terms meant.
>
> http://dev.mysql.com/doc/refman/5.1/en/myisam-index-statistics.html
>
> The SHOW INDEX statement displays a cardinality value based on N/S,
> where N is the number of rows in the table and S is the average value
> group size. That ratio yields an approximate number of value groups in
> the table.
> </rant>
>
> A work colleague found that information a few weeks ago so that's
> where my misunderstanding came from - if I'm reading that right they
> use n_distinct as their "cardinality" basis.. then again I could be
> reading that completely wrong too.
>

Yeah that's right - e.g using the same table in postgres and mysql:

pgsql> SELECT attname,n_distinct,correlation
        FROM pg_stats
        WHERE tablename='fact0'
        AND attname LIKE 'd%key';
  attname | n_distinct | correlation
---------+------------+-------------
  d0key   |      10000 |  -0.0211169
  d1key   |        100 |    0.124012
  d2key   |         10 |    0.998393
(3 rows)


mysql> SHOW INDEX FROM fact0
     -> ;

+-------+------------+-------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+
| Table | Non_unique | Key_name    | Seq_in_index | Column_name |
Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment |

+-------+------------+-------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+
| fact0 |          1 | fact0_d0key |            1 | d0key       | A
     |       10000 |     NULL | NULL   |      | BTREE      |         |
| fact0 |          1 | fact0_d1key |            1 | d1key       | A
     |         100 |     NULL | NULL   |      | BTREE      |         |
| fact0 |          1 | fact0_d2key |            1 | d2key       | A
     |          10 |     NULL | NULL   |      | BTREE      |         |

+-------+------------+-------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+
3 rows in set (0.00 sec)


It is a bit confusing - '(distinct) cardinality' might be a better
heading for their 'cardinality' column!

On the correlation business - I don't think Mysql calculates it (or if
it does, its not displayed).


> I believe postgres (because it's a lot more standards compliant).. but
> sheesh - what a difference!
>

Well yes - however, to be fair to the Mysql guys, AFAICS the capture and
display of index stats (and any other optimizer related data) is not
part of any standard.


Cheers

Mark

Re: Query using SeqScan instead of IndexScan

From
Tom Lane
Date:
Mark Kirkwood <markir@paradise.net.nz> writes:
> It is a bit confusing - '(distinct) cardinality' might be a better
> heading for their 'cardinality' column!

The usual mathematical meaning of "cardinality" is "the number of
members in a set".  That isn't real helpful for the point at hand,
because the mathematical definition of a set disallows duplicate
members, so if you're dealing with non-unique values you could argue it
either way about whether to count duplicates or not.  However, I read in
the SQL99 spec (3.1 Definitions)

         d) cardinality (of a value of a collection type): The number of
            elements in that value. Those elements need not necessarily have
            distinct values.

so ... as all too often ... the mysql boys have not got a clue about
standards compliance.  They are using this term in the opposite way
from how the SQL committee uses it.

            regards, tom lane

Re: Query using SeqScan instead of IndexScan

From
Josh Berkus
Date:
Brendan,

> But just as a follow up question to your #1 suggestion, I have 8 GB
> of ram in my production server. You're saying to set the
> effective_cache_size then to 5 GB roughly? Somewhere around 655360?
> Currently it is set to 65535. Is that something that's OS dependent?
> I'm not sure how much memory my server sets aside for disk caching.

Yes, about.  It's really a judgement call; you're looking for the approximate
combined RAM available for disk caching and shared mem.  However, this is
just used as a way of estimating the probability that the data you want is
cached in memory, so you're just trying to be order-of-magnitude accurate,
not to-the-MB accurate.

--
Josh Berkus
Aglio Database Solutions
San Francisco

Re: Query using SeqScan instead of IndexScan

From
Brendan Duddridge
Date:
Hi Josh,

Thanks. I've adjusted my effective_cache_size to 5 GB, so we'll see
how that goes.

I'm also doing some query and de-normalization optimizations so we'll
see how those go too.

____________________________________________________________________
Brendan Duddridge | CTO | 403-277-5591 x24 |  brendan@clickspace.com

ClickSpace Interactive Inc.
Suite L100, 239 - 10th Ave. SE
Calgary, AB  T2G 0V9

http://www.clickspace.com

On Apr 2, 2006, at 4:30 PM, Josh Berkus wrote:

> Brendan,
>
>> But just as a follow up question to your #1 suggestion, I have 8 GB
>> of ram in my production server. You're saying to set the
>> effective_cache_size then to 5 GB roughly? Somewhere around 655360?
>> Currently it is set to 65535. Is that something that's OS dependent?
>> I'm not sure how much memory my server sets aside for disk caching.
>
> Yes, about.  It's really a judgement call; you're looking for the
> approximate
> combined RAM available for disk caching and shared mem.  However,
> this is
> just used as a way of estimating the probability that the data you
> want is
> cached in memory, so you're just trying to be order-of-magnitude
> accurate,
> not to-the-MB accurate.
>
> --
> Josh Berkus
> Aglio Database Solutions
> San Francisco
>
> ---------------------------(end of
> broadcast)---------------------------
> TIP 4: Have you searched our list archives?
>
>                http://archives.postgresql.org
>


Attachment

Re: Query using SeqScan instead of IndexScan

From
Jim Nasby
Date:
On Apr 2, 2006, at 6:30 PM, Josh Berkus wrote:
>> But just as a follow up question to your #1 suggestion, I have 8 GB
>> of ram in my production server. You're saying to set the
>> effective_cache_size then to 5 GB roughly? Somewhere around 655360?
>> Currently it is set to 65535. Is that something that's OS dependent?
>> I'm not sure how much memory my server sets aside for disk caching.
>
> Yes, about.  It's really a judgement call; you're looking for the
> approximate
> combined RAM available for disk caching and shared mem.  However,
> this is
> just used as a way of estimating the probability that the data you
> want is
> cached in memory, so you're just trying to be order-of-magnitude
> accurate,
> not to-the-MB accurate.

FWIW, I typically set effective_cache_size to the amount of memory in
the machine minus 1G for the OS and various other daemons, etc. But
as Josh said, as long as your somewhere in the ballpark it's probably
good enough.
--
Jim C. Nasby, Sr. Engineering Consultant      jnasby@pervasive.com
Pervasive Software      http://pervasive.com    work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf       cell: 512-569-9461



Re: Query using SeqScan instead of IndexScan

From
Jim Nasby
Date:
On Apr 1, 2006, at 12:51 PM, Brendan Duddridge wrote:
> from SELECT * FROM pg_stats WHERE tablename='table' AND
> attname='category_id'
>
> I find correlation on category_product for category_id is 0.643703
>
> Would setting the index on category_id to be clustered help with this?

It would absolutely help on the query in question. In my experience,
a correlation of 0.64 is too low to allow an index scan to be used
for anything but a tiny number of rows.
--
Jim C. Nasby, Sr. Engineering Consultant      jnasby@pervasive.com
Pervasive Software      http://pervasive.com    work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf       cell: 512-569-9461