Thread: Query using SeqScan instead of IndexScan
Hi,
I have a query that is using a sequential scan instead of an index scan. I've turned off sequential scans and it is in fact faster with the index scan.
Here's my before and after.
Before:
ssdev=# SET enable_seqscan TO DEFAULT;
ssdev=# explain analyze select cp.product_id
from category_product cp, product_attribute_value pav
where cp.category_id = 1001082 and cp.product_id = pav.product_id;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Hash Join (cost=25.52..52140.59 rows=5139 width=4) (actual time=4.521..2580.520 rows=19695 loops=1)
Hash Cond: ("outer".product_id = "inner".product_id)
-> Seq Scan on product_attribute_value pav (cost=0.00..40127.12 rows=2387312 width=4) (actual time=0.039..1469.295 rows=2385846 loops=1)
-> Hash (cost=23.10..23.10 rows=970 width=4) (actual time=2.267..2.267 rows=1140 loops=1)
-> Index Scan using x_category_product__category_id_fk_idx on category_product cp (cost=0.00..23.10 rows=970 width=4) (actual time=0.122..1.395 rows=1140 loops=1)
Index Cond: (category_id = 1001082)
Total runtime: 2584.221 ms
(7 rows)
After:
ssdev=# SET enable_seqscan TO false;
ssdev=# explain analyze select cp.product_id
from category_product cp, product_attribute_value pav
where cp.category_id = 1001082 and cp.product_id = pav.product_id;
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Nested Loop (cost=0.00..157425.22 rows=5139 width=4) (actual time=0.373..71.177 rows=19695 loops=1)
-> Index Scan using x_category_product__category_id_fk_idx on category_product cp (cost=0.00..23.10 rows=970 width=4) (actual time=0.129..1.438 rows=1140 loops=1)
Index Cond: (category_id = 1001082)
-> Index Scan using product_attribute_value__product_id_fk_idx on product_attribute_value pav (cost=0.00..161.51 rows=61 width=4) (actual time=0.016..0.053 rows=17 loops=1140)
Index Cond: ("outer".product_id = pav.product_id)
Total runtime: 74.747 ms
(6 rows)
There's quite a big difference in speed there. 2584.221 ms vs. 74.747 ms.
Any ideas what I can do to improve this without turning sequential scanning off?
Thanks,
____________________________________________________________________
Brendan Duddridge | CTO | 403-277-5591 x24 | brendan@clickspace.com
ClickSpace Interactive Inc.
Suite L100, 239 - 10th Ave. SE
Calgary, AB T2G 0V9
http://www.clickspace.com
____________________________________________________________________
Brendan Duddridge | CTO | 403-277-5591 x24 | brendan@clickspace.com
ClickSpace Interactive Inc.
Suite L100, 239 - 10th Ave. SE
Calgary, AB T2G 0V9
http://www.clickspace.com
Attachment
Oops. I forgot to mention that I was using PostgreSQL 8.1.3 on Mac OS X.
Thanks,
____________________________________________________________________
Brendan Duddridge | CTO | 403-277-5591 x24 | brendan@clickspace.com
ClickSpace Interactive Inc.
Suite L100, 239 - 10th Ave. SE
Calgary, AB T2G 0V9
http://www.clickspace.com
____________________________________________________________________
Brendan Duddridge | CTO | 403-277-5591 x24 | brendan@clickspace.com
ClickSpace Interactive Inc.
Suite L100, 239 - 10th Ave. SE
Calgary, AB T2G 0V9
http://www.clickspace.com
On Mar 29, 2006, at 8:12 PM, Brendan Duddridge wrote:
Hi,I have a query that is using a sequential scan instead of an index scan. I've turned off sequential scans and it is in fact faster with the index scan.Here's my before and after.Before:ssdev=# SET enable_seqscan TO DEFAULT;ssdev=# explain analyze select cp.product_idfrom category_product cp, product_attribute_value pavwhere cp.category_id = 1001082 and cp.product_id = pav.product_id;QUERY PLAN------------------------------------------------------------------------------------------------------------------------------------------------------------------------------Hash Join (cost=25.52..52140.59 rows=5139 width=4) (actual time=4.521..2580.520 rows=19695 loops=1)Hash Cond: ("outer".product_id = "inner".product_id)-> Seq Scan on product_attribute_value pav (cost=0.00..40127.12 rows=2387312 width=4) (actual time=0.039..1469.295 rows=2385846 loops=1)-> Hash (cost=23.10..23.10 rows=970 width=4) (actual time=2.267..2.267 rows=1140 loops=1)-> Index Scan using x_category_product__category_id_fk_idx on category_product cp (cost=0.00..23.10 rows=970 width=4) (actual time=0.122..1.395 rows=1140 loops=1)Index Cond: (category_id = 1001082)Total runtime: 2584.221 ms(7 rows)After:ssdev=# SET enable_seqscan TO false;ssdev=# explain analyze select cp.product_idfrom category_product cp, product_attribute_value pavwhere cp.category_id = 1001082 and cp.product_id = pav.product_id;QUERY PLAN-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------Nested Loop (cost=0.00..157425.22 rows=5139 width=4) (actual time=0.373..71.177 rows=19695 loops=1)-> Index Scan using x_category_product__category_id_fk_idx on category_product cp (cost=0.00..23.10 rows=970 width=4) (actual time=0.129..1.438 rows=1140 loops=1)Index Cond: (category_id = 1001082)-> Index Scan using product_attribute_value__product_id_fk_idx on product_attribute_value pav (cost=0.00..161.51 rows=61 width=4) (actual time=0.016..0.053 rows=17 loops=1140)Index Cond: ("outer".product_id = pav.product_id)Total runtime: 74.747 ms(6 rows)There's quite a big difference in speed there. 2584.221 ms vs. 74.747 ms.Any ideas what I can do to improve this without turning sequential scanning off?Thanks,
____________________________________________________________________
Brendan Duddridge | CTO | 403-277-5591 x24 | brendan@clickspace.com
ClickSpace Interactive Inc.
Suite L100, 239 - 10th Ave. SE
Calgary, AB T2G 0V9
http://www.clickspace.com
Attachment
Brenden, > Any ideas what I can do to improve this without turning sequential > scanning off? Hmmm, looks like your row estimates are good. Which means it's probably your postgresql.conf parameters which are off. Try the following, in the order below: 1) Raise effective_cache_size to 2/3 of your RAM (remember that ecs is in 8k pages). Test again. 2) Multiply all of the cpu_* costs by 0.3. Test again. 3) Lower random_page_cost by steps to 3.5, then 3.0, then 2.5, then 2.0, testing each time. These are all runtime-settable parameters, so you can test them in one query window, then set them in the main postgresql.conf if they work. -- Josh Berkus Sun Microsystems San Francisco
What's the correlation of category_id? The current index scan cost estimator places a heavy penalty on anything with a correlation much below about 90%. On Wed, Mar 29, 2006 at 08:12:28PM -0700, Brendan Duddridge wrote: > Hi, > > I have a query that is using a sequential scan instead of an index > scan. I've turned off sequential scans and it is in fact faster with > the index scan. > > Here's my before and after. > > Before: > > ssdev=# SET enable_seqscan TO DEFAULT; > ssdev=# explain analyze select cp.product_id > from category_product cp, product_attribute_value pav > where cp.category_id = 1001082 and cp.product_id = > pav.product_id; > > > QUERY PLAN > ------------------------------------------------------------------------ > ------------------------------------------------------------------------ > ------------------------------ > Hash Join (cost=25.52..52140.59 rows=5139 width=4) (actual > time=4.521..2580.520 rows=19695 loops=1) > Hash Cond: ("outer".product_id = "inner".product_id) > -> Seq Scan on product_attribute_value pav (cost=0.00..40127.12 > rows=2387312 width=4) (actual time=0.039..1469.295 rows=2385846 loops=1) > -> Hash (cost=23.10..23.10 rows=970 width=4) (actual > time=2.267..2.267 rows=1140 loops=1) > -> Index Scan using x_category_product__category_id_fk_idx > on category_product cp (cost=0.00..23.10 rows=970 width=4) (actual > time=0.122..1.395 rows=1140 loops=1) > Index Cond: (category_id = 1001082) > Total runtime: 2584.221 ms > (7 rows) > > > After: > > ssdev=# SET enable_seqscan TO false; > ssdev=# explain analyze select cp.product_id > from category_product cp, product_attribute_value pav > where cp.category_id = 1001082 and cp.product_id = > pav.product_id; > > > QUERY PLAN > ------------------------------------------------------------------------ > ------------------------------------------------------------------------ > ------------------------------------- > Nested Loop (cost=0.00..157425.22 rows=5139 width=4) (actual > time=0.373..71.177 rows=19695 loops=1) > -> Index Scan using x_category_product__category_id_fk_idx on > category_product cp (cost=0.00..23.10 rows=970 width=4) (actual > time=0.129..1.438 rows=1140 loops=1) > Index Cond: (category_id = 1001082) > -> Index Scan using product_attribute_value__product_id_fk_idx > on product_attribute_value pav (cost=0.00..161.51 rows=61 width=4) > (actual time=0.016..0.053 rows=17 loops=1140) > Index Cond: ("outer".product_id = pav.product_id) > Total runtime: 74.747 ms > (6 rows) > > There's quite a big difference in speed there. 2584.221 ms vs. 74.747 > ms. > > Any ideas what I can do to improve this without turning sequential > scanning off? > > Thanks, > > ____________________________________________________________________ > Brendan Duddridge | CTO | 403-277-5591 x24 | brendan@clickspace.com > > ClickSpace Interactive Inc. > Suite L100, 239 - 10th Ave. SE > Calgary, AB T2G 0V9 > > http://www.clickspace.com > -- Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com Pervasive Software http://pervasive.com work: 512-231-6117 vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461
Hi Jim, I'm not quite sure what you mean by the correlation of category_id? The category_id is part of a compound primary key in the category_product table. The primary key on category_product is (category_id, product_id). Here's the definitions of the two tables involved in the join: Table "public.category_product" Column | Type | Modifiers ---------------------+----------------------+----------- category_id | integer | not null product_id | integer | not null en_name_sort_order | integer | fr_name_sort_order | integer | merchant_sort_order | integer | price_sort_order | integer | merchant_count | integer | is_active | character varying(5) | Indexes: "x_category_product_pk" PRIMARY KEY, btree (category_id, product_id) "category_product__is_active_idx" btree (is_active) "category_product__merchant_sort_order_idx" btree (merchant_sort_order) "x_category_product__category_id_fk_idx" btree (category_id) CLUSTER "x_category_product__product_id_fk_idx" btree (product_id) Foreign-key constraints: "x_category_product_category_fk" FOREIGN KEY (category_id) REFERENCES category(category_id) DEFERRABLE INITIALLY DEFERRED "x_category_product_product_fk" FOREIGN KEY (product_id) REFERENCES product(product_id) DEFERRABLE INITIALLY DEFERRED Table "public.product_attribute_value" Column | Type | Modifiers ----------------------------+-----------------------+----------- attribute_id | integer | not null attribute_unit_id | integer | attribute_value_id | integer | boolean_value | character varying(5) | decimal_value | numeric(30,10) | product_attribute_value_id | integer | not null product_id | integer | not null product_reference_id | integer | status_code | character varying(32) | Indexes: "product_attribute_value_pk" PRIMARY KEY, btree (product_attribute_value_id) "product_attribute_value__attribute_id_fk_idx" btree (attribute_id) "product_attribute_value__attribute_unit_id_fk_idx" btree (attribute_unit_id) "product_attribute_value__attribute_value_id_fk_idx" btree (attribute_value_id) "product_attribute_value__product_id_fk_idx" btree (product_id) "product_attribute_value__product_reference_id_fk_idx" btree (product_reference_id) Foreign-key constraints: "product_attribute_value_attribute_fk" FOREIGN KEY (attribute_id) REFERENCES attribute(attribute_id) DEFERRABLE INITIALLY DEFERRED "product_attribute_value_attributeunit_fk" FOREIGN KEY (attribute_unit_id) REFERENCES attribute_unit(attribute_unit_id) DEFERRABLE INITIALLY DEFERRED "product_attribute_value_attributevalue_fk" FOREIGN KEY (attribute_value_id) REFERENCES attribute_value(attribute_value_id) DEFERRABLE INITIALLY DEFERRED "product_attribute_value_product_fk" FOREIGN KEY (product_id) REFERENCES product(product_id) DEFERRABLE INITIALLY DEFERRED "product_attribute_value_productreference_fk" FOREIGN KEY (product_reference_id) REFERENCES product(product_id) DEFERRABLE INITIALLY DEFERRED Not sure if that helps answer your question, but the query is pretty slow. Sometimes it takes 5 - 15 seconds depending on the category_id specified. Thanks, ____________________________________________________________________ Brendan Duddridge | CTO | 403-277-5591 x24 | brendan@clickspace.com ClickSpace Interactive Inc. Suite L100, 239 - 10th Ave. SE Calgary, AB T2G 0V9 http://www.clickspace.com On Mar 31, 2006, at 8:59 AM, Jim C. Nasby wrote: > What's the correlation of category_id? The current index scan cost > estimator places a heavy penalty on anything with a correlation much > below about 90%. > > On Wed, Mar 29, 2006 at 08:12:28PM -0700, Brendan Duddridge wrote: >> Hi, >> >> I have a query that is using a sequential scan instead of an index >> scan. I've turned off sequential scans and it is in fact faster with >> the index scan. >> >> Here's my before and after. >> >> Before: >> >> ssdev=# SET enable_seqscan TO DEFAULT; >> ssdev=# explain analyze select cp.product_id >> from category_product cp, product_attribute_value pav >> where cp.category_id = 1001082 and cp.product_id = >> pav.product_id; >> >> >> QUERY PLAN >> --------------------------------------------------------------------- >> --- >> --------------------------------------------------------------------- >> --- >> ------------------------------ >> Hash Join (cost=25.52..52140.59 rows=5139 width=4) (actual >> time=4.521..2580.520 rows=19695 loops=1) >> Hash Cond: ("outer".product_id = "inner".product_id) >> -> Seq Scan on product_attribute_value pav (cost=0.00..40127.12 >> rows=2387312 width=4) (actual time=0.039..1469.295 rows=2385846 >> loops=1) >> -> Hash (cost=23.10..23.10 rows=970 width=4) (actual >> time=2.267..2.267 rows=1140 loops=1) >> -> Index Scan using x_category_product__category_id_fk_idx >> on category_product cp (cost=0.00..23.10 rows=970 width=4) (actual >> time=0.122..1.395 rows=1140 loops=1) >> Index Cond: (category_id = 1001082) >> Total runtime: 2584.221 ms >> (7 rows) >> >> >> After: >> >> ssdev=# SET enable_seqscan TO false; >> ssdev=# explain analyze select cp.product_id >> from category_product cp, product_attribute_value pav >> where cp.category_id = 1001082 and cp.product_id = >> pav.product_id; >> >> >> QUERY PLAN >> --------------------------------------------------------------------- >> --- >> --------------------------------------------------------------------- >> --- >> ------------------------------------- >> Nested Loop (cost=0.00..157425.22 rows=5139 width=4) (actual >> time=0.373..71.177 rows=19695 loops=1) >> -> Index Scan using x_category_product__category_id_fk_idx on >> category_product cp (cost=0.00..23.10 rows=970 width=4) (actual >> time=0.129..1.438 rows=1140 loops=1) >> Index Cond: (category_id = 1001082) >> -> Index Scan using product_attribute_value__product_id_fk_idx >> on product_attribute_value pav (cost=0.00..161.51 rows=61 width=4) >> (actual time=0.016..0.053 rows=17 loops=1140) >> Index Cond: ("outer".product_id = pav.product_id) >> Total runtime: 74.747 ms >> (6 rows) >> >> There's quite a big difference in speed there. 2584.221 ms vs. 74.747 >> ms. >> >> Any ideas what I can do to improve this without turning sequential >> scanning off? >> >> Thanks, >> >> ____________________________________________________________________ >> Brendan Duddridge | CTO | 403-277-5591 x24 | brendan@clickspace.com >> >> ClickSpace Interactive Inc. >> Suite L100, 239 - 10th Ave. SE >> Calgary, AB T2G 0V9 >> >> http://www.clickspace.com >> > > > > -- > Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com > Pervasive Software http://pervasive.com work: 512-231-6117 > vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461 > > ---------------------------(end of > broadcast)--------------------------- > TIP 1: if posting/reading through Usenet, please send an appropriate > subscribe-nomail command to majordomo@postgresql.org so that > your > message can get through to the mailing list cleanly >
Attachment
On 4/1/06, Brendan Duddridge <brendan@clickspace.com> wrote: > Hi Jim, > > I'm not quite sure what you mean by the correlation of category_id? It means how many distinct values does it have (at least that's my understanding of it ;) ). select category_id, count(*) from category_product group by category_id; will show you how many category_id's there are and how many products are in each category. Having a lot of products in one category (or having a small amount of categories) can slow things down because the db can't use the index effectively.. which might be what you're seeing (hence why it's fast for some categories, slow for others). > On Mar 31, 2006, at 8:59 AM, Jim C. Nasby wrote: > > > What's the correlation of category_id? The current index scan cost > > estimator places a heavy penalty on anything with a correlation much > > below about 90%. > > > > On Wed, Mar 29, 2006 at 08:12:28PM -0700, Brendan Duddridge wrote: > >> Hi, > >> > >> I have a query that is using a sequential scan instead of an index > >> scan. I've turned off sequential scans and it is in fact faster with > >> the index scan. > >> > >> Here's my before and after. > >> > >> Before: > >> > >> ssdev=# SET enable_seqscan TO DEFAULT; > >> ssdev=# explain analyze select cp.product_id > >> from category_product cp, product_attribute_value pav > >> where cp.category_id = 1001082 and cp.product_id = > >> pav.product_id; > >> > >> > >> QUERY PLAN > >> --------------------------------------------------------------------- > >> --- > >> --------------------------------------------------------------------- > >> --- > >> ------------------------------ > >> Hash Join (cost=25.52..52140.59 rows=5139 width=4) (actual > >> time=4.521..2580.520 rows=19695 loops=1) > >> Hash Cond: ("outer".product_id = "inner".product_id) > >> -> Seq Scan on product_attribute_value pav (cost=0.00..40127.12 > >> rows=2387312 width=4) (actual time=0.039..1469.295 rows=2385846 > >> loops=1) > >> -> Hash (cost=23.10..23.10 rows=970 width=4) (actual > >> time=2.267..2.267 rows=1140 loops=1) > >> -> Index Scan using x_category_product__category_id_fk_idx > >> on category_product cp (cost=0.00..23.10 rows=970 width=4) (actual > >> time=0.122..1.395 rows=1140 loops=1) > >> Index Cond: (category_id = 1001082) > >> Total runtime: 2584.221 ms > >> (7 rows) > >> > >> > >> After: > >> > >> ssdev=# SET enable_seqscan TO false; > >> ssdev=# explain analyze select cp.product_id > >> from category_product cp, product_attribute_value pav > >> where cp.category_id = 1001082 and cp.product_id = > >> pav.product_id; > >> > >> > >> QUERY PLAN > >> --------------------------------------------------------------------- > >> --- > >> --------------------------------------------------------------------- > >> --- > >> ------------------------------------- > >> Nested Loop (cost=0.00..157425.22 rows=5139 width=4) (actual > >> time=0.373..71.177 rows=19695 loops=1) > >> -> Index Scan using x_category_product__category_id_fk_idx on > >> category_product cp (cost=0.00..23.10 rows=970 width=4) (actual > >> time=0.129..1.438 rows=1140 loops=1) > >> Index Cond: (category_id = 1001082) > >> -> Index Scan using product_attribute_value__product_id_fk_idx > >> on product_attribute_value pav (cost=0.00..161.51 rows=61 width=4) > >> (actual time=0.016..0.053 rows=17 loops=1140) > >> Index Cond: ("outer".product_id = pav.product_id) > >> Total runtime: 74.747 ms > >> (6 rows) > >> > >> There's quite a big difference in speed there. 2584.221 ms vs. 74.747 > >> ms. > >> > >> Any ideas what I can do to improve this without turning sequential > >> scanning off? > >> > >> Thanks, > >> > >> ____________________________________________________________________ > >> Brendan Duddridge | CTO | 403-277-5591 x24 | brendan@clickspace.com > >> > >> ClickSpace Interactive Inc. > >> Suite L100, 239 - 10th Ave. SE > >> Calgary, AB T2G 0V9 > >> > >> http://www.clickspace.com > >> > > > > > > > > -- > > Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com > > Pervasive Software http://pervasive.com work: 512-231-6117 > > vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461 > > > > ---------------------------(end of > > broadcast)--------------------------- > > TIP 1: if posting/reading through Usenet, please send an appropriate > > subscribe-nomail command to majordomo@postgresql.org so that > > your > > message can get through to the mailing list cleanly > > > > > > -- Postgresql & php tutorials http://www.designmagick.com/
Ah I see. Ok, well we have a very wide variety here... category_id | count -------------+------- 1000521 | 31145 1001211 | 22991 1001490 | 22019 1001628 | 12472 1000046 | 10480 1000087 | 10338 1001223 | 10020 1001560 | 9532 1000954 | 8633 1001314 | 8191 1001482 | 8140 1001556 | 7959 1001481 | 7850 [snip...] 1001133 | 1 1000532 | 1 1000691 | 1 1000817 | 1 1000783 | 1 1000689 | 1 (1157 rows) So what's the best kind of query to handle this kind of data to make it fast in all cases? I'd like get down to sub-second response times. currently we have: select cp.product_id from category_product cp, product_attribute_value pav where cp.category_id = 1001082 and cp.product_id = pav.product_id; Thanks, ____________________________________________________________________ Brendan Duddridge | CTO | 403-277-5591 x24 | brendan@clickspace.com ClickSpace Interactive Inc. Suite L100, 239 - 10th Ave. SE Calgary, AB T2G 0V9 http://www.clickspace.com On Mar 31, 2006, at 6:23 PM, chris smith wrote: > On 4/1/06, Brendan Duddridge <brendan@clickspace.com> wrote: >> Hi Jim, >> >> I'm not quite sure what you mean by the correlation of category_id? > > It means how many distinct values does it have (at least that's my > understanding of it ;) ). > > select category_id, count(*) from category_product group by > category_id; > > will show you how many category_id's there are and how many products > are in each category. > > Having a lot of products in one category (or having a small amount of > categories) can slow things down because the db can't use the index > effectively.. which might be what you're seeing (hence why it's fast > for some categories, slow for others). > > >> On Mar 31, 2006, at 8:59 AM, Jim C. Nasby wrote: >> >>> What's the correlation of category_id? The current index scan cost >>> estimator places a heavy penalty on anything with a correlation much >>> below about 90%. >>> >>> On Wed, Mar 29, 2006 at 08:12:28PM -0700, Brendan Duddridge wrote: >>>> Hi, >>>> >>>> I have a query that is using a sequential scan instead of an index >>>> scan. I've turned off sequential scans and it is in fact faster >>>> with >>>> the index scan. >>>> >>>> Here's my before and after. >>>> >>>> Before: >>>> >>>> ssdev=# SET enable_seqscan TO DEFAULT; >>>> ssdev=# explain analyze select cp.product_id >>>> from category_product cp, product_attribute_value pav >>>> where cp.category_id = 1001082 and cp.product_id = >>>> pav.product_id; >>>> >>>> >>>> QUERY PLAN >>>> ------------------------------------------------------------------- >>>> -- >>>> --- >>>> ------------------------------------------------------------------- >>>> -- >>>> --- >>>> ------------------------------ >>>> Hash Join (cost=25.52..52140.59 rows=5139 width=4) (actual >>>> time=4.521..2580.520 rows=19695 loops=1) >>>> Hash Cond: ("outer".product_id = "inner".product_id) >>>> -> Seq Scan on product_attribute_value pav >>>> (cost=0.00..40127.12 >>>> rows=2387312 width=4) (actual time=0.039..1469.295 rows=2385846 >>>> loops=1) >>>> -> Hash (cost=23.10..23.10 rows=970 width=4) (actual >>>> time=2.267..2.267 rows=1140 loops=1) >>>> -> Index Scan using >>>> x_category_product__category_id_fk_idx >>>> on category_product cp (cost=0.00..23.10 rows=970 width=4) (actual >>>> time=0.122..1.395 rows=1140 loops=1) >>>> Index Cond: (category_id = 1001082) >>>> Total runtime: 2584.221 ms >>>> (7 rows) >>>> >>>> >>>> After: >>>> >>>> ssdev=# SET enable_seqscan TO false; >>>> ssdev=# explain analyze select cp.product_id >>>> from category_product cp, product_attribute_value pav >>>> where cp.category_id = 1001082 and cp.product_id = >>>> pav.product_id; >>>> >>>> >>>> QUERY PLAN >>>> ------------------------------------------------------------------- >>>> -- >>>> --- >>>> ------------------------------------------------------------------- >>>> -- >>>> --- >>>> ------------------------------------- >>>> Nested Loop (cost=0.00..157425.22 rows=5139 width=4) (actual >>>> time=0.373..71.177 rows=19695 loops=1) >>>> -> Index Scan using x_category_product__category_id_fk_idx on >>>> category_product cp (cost=0.00..23.10 rows=970 width=4) (actual >>>> time=0.129..1.438 rows=1140 loops=1) >>>> Index Cond: (category_id = 1001082) >>>> -> Index Scan using product_attribute_value__product_id_fk_idx >>>> on product_attribute_value pav (cost=0.00..161.51 rows=61 width=4) >>>> (actual time=0.016..0.053 rows=17 loops=1140) >>>> Index Cond: ("outer".product_id = pav.product_id) >>>> Total runtime: 74.747 ms >>>> (6 rows) >>>> >>>> There's quite a big difference in speed there. 2584.221 ms vs. >>>> 74.747 >>>> ms. >>>> >>>> Any ideas what I can do to improve this without turning sequential >>>> scanning off? >>>> >>>> Thanks, >>>> >>>> ___________________________________________________________________ >>>> _ >>>> Brendan Duddridge | CTO | 403-277-5591 x24 | >>>> brendan@clickspace.com >>>> >>>> ClickSpace Interactive Inc. >>>> Suite L100, 239 - 10th Ave. SE >>>> Calgary, AB T2G 0V9 >>>> >>>> http://www.clickspace.com >>>> >>> >>> >>> >>> -- >>> Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com >>> Pervasive Software http://pervasive.com work: 512-231-6117 >>> vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461 >>> >>> ---------------------------(end of >>> broadcast)--------------------------- >>> TIP 1: if posting/reading through Usenet, please send an appropriate >>> subscribe-nomail command to majordomo@postgresql.org so that >>> your >>> message can get through to the mailing list cleanly >>> >> >> >> >> > > > -- > Postgresql & php tutorials > http://www.designmagick.com/ >
Attachment
On Sat, Apr 01, 2006 at 11:23:37AM +1000, chris smith wrote: > On 4/1/06, Brendan Duddridge <brendan@clickspace.com> wrote: > > Hi Jim, > > > > I'm not quite sure what you mean by the correlation of category_id? > > It means how many distinct values does it have (at least that's my > understanding of it ;) ). Your understanding is wrong. :) What you're discussing is n_distinct. http://www.postgresql.org/docs/8.1/interactive/view-pg-stats.html correlation: "Statistical correlation between physical row ordering and logical ordering of the column values. This ranges from -1 to +1. When the value is near -1 or +1, an index scan on the column will be estimated to be cheaper than when it is near zero, due to reduction of random access to the disk. (This column is NULL if the column data type does not have a < operator.)" In other words, the following will have a correlation of 1: 1 2 3 ... 998 999 1000 And this is -1... 1000 999 ... 2 1 While this would have a very low correlation: 1 1000 2 999 ... The lower the correlation, the more expensive an index scan is, because it's more random. As I mentioned, I believe that the current index scan cost estimator is flawed though, because it will bias heavily against correlations that aren't close to 1 or -1. So, what does SELECT * FROM pg_stats WHERE tablename='table' AND attname='category_id'; show? -- Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com Pervasive Software http://pervasive.com work: 512-231-6117 vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461
Hi Jim, from SELECT * FROM pg_stats WHERE tablename='table' AND attname='category_id' I find correlation on category_product for category_id is 0.643703 Would setting the index on category_id to be clustered help with this? Thanks, ____________________________________________________________________ Brendan Duddridge | CTO | 403-277-5591 x24 | brendan@clickspace.com ClickSpace Interactive Inc. Suite L100, 239 - 10th Ave. SE Calgary, AB T2G 0V9 http://www.clickspace.com On Apr 1, 2006, at 8:32 AM, Jim C. Nasby wrote: > On Sat, Apr 01, 2006 at 11:23:37AM +1000, chris smith wrote: >> On 4/1/06, Brendan Duddridge <brendan@clickspace.com> wrote: >>> Hi Jim, >>> >>> I'm not quite sure what you mean by the correlation of category_id? >> >> It means how many distinct values does it have (at least that's my >> understanding of it ;) ). > > Your understanding is wrong. :) What you're discussing is n_distinct. > > http://www.postgresql.org/docs/8.1/interactive/view-pg-stats.html > > correlation: "Statistical correlation between physical row ordering > and > logical ordering of the column values. This ranges from -1 to +1. When > the value is near -1 or +1, an index scan on the column will be > estimated to be cheaper than when it is near zero, due to reduction of > random access to the disk. (This column is NULL if the column data > type > does not have a < operator.)" > > In other words, the following will have a correlation of 1: > > 1 > 2 > 3 > ... > 998 > 999 > 1000 > > And this is -1... > > 1000 > 999 > ... > 2 > 1 > > While this would have a very low correlation: > > 1 > 1000 > 2 > 999 > ... > > The lower the correlation, the more expensive an index scan is, > because > it's more random. As I mentioned, I believe that the current index > scan > cost estimator is flawed though, because it will bias heavily against > correlations that aren't close to 1 or -1. > > So, what does > > SELECT * FROM pg_stats WHERE tablename='table' AND > attname='category_id'; > > show? > -- > Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com > Pervasive Software http://pervasive.com work: 512-231-6117 > vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461 > > ---------------------------(end of > broadcast)--------------------------- > TIP 4: Have you searched our list archives? > > http://archives.postgresql.org >
Attachment
On 4/2/06, Jim C. Nasby <jnasby@pervasive.com> wrote: > On Sat, Apr 01, 2006 at 11:23:37AM +1000, chris smith wrote: > > On 4/1/06, Brendan Duddridge <brendan@clickspace.com> wrote: > > > Hi Jim, > > > > > > I'm not quite sure what you mean by the correlation of category_id? > > > > It means how many distinct values does it have (at least that's my > > understanding of it ;) ). > > Your understanding is wrong. :) What you're discussing is n_distinct. Geez, I'm going well this week ;) Thanks for the detailed info. -- Postgresql & php tutorials http://www.designmagick.com/
On 4/2/06, chris smith <dmagick@gmail.com> wrote: > On 4/2/06, Jim C. Nasby <jnasby@pervasive.com> wrote: > > On Sat, Apr 01, 2006 at 11:23:37AM +1000, chris smith wrote: > > > On 4/1/06, Brendan Duddridge <brendan@clickspace.com> wrote: > > > > Hi Jim, > > > > > > > > I'm not quite sure what you mean by the correlation of category_id? > > > > > > It means how many distinct values does it have (at least that's my > > > understanding of it ;) ). > > > > Your understanding is wrong. :) What you're discussing is n_distinct. <rant> It'd be nice if the database developers agreed on what terms meant. http://dev.mysql.com/doc/refman/5.1/en/myisam-index-statistics.html The SHOW INDEX statement displays a cardinality value based on N/S, where N is the number of rows in the table and S is the average value group size. That ratio yields an approximate number of value groups in the table. </rant> A work colleague found that information a few weeks ago so that's where my misunderstanding came from - if I'm reading that right they use n_distinct as their "cardinality" basis.. then again I could be reading that completely wrong too. I believe postgres (because it's a lot more standards compliant).. but sheesh - what a difference! This week's task - stop reading mysql documentation. -- Postgresql & php tutorials http://www.designmagick.com/
chris smith wrote: > I believe postgres (because it's a lot more standards compliant).. but > sheesh - what a difference! > > This week's task - stop reading mysql documentation. You don't _have_ to believe Postgres -- this is stuff taught in any statistics course. -- Alvaro Herrera http://www.CommandPrompt.com/ PostgreSQL Replication, Consulting, Custom Development, 24x7 support
chris smith wrote: > <rant> > It'd be nice if the database developers agreed on what terms meant. > > http://dev.mysql.com/doc/refman/5.1/en/myisam-index-statistics.html > > The SHOW INDEX statement displays a cardinality value based on N/S, > where N is the number of rows in the table and S is the average value > group size. That ratio yields an approximate number of value groups in > the table. > </rant> > > A work colleague found that information a few weeks ago so that's > where my misunderstanding came from - if I'm reading that right they > use n_distinct as their "cardinality" basis.. then again I could be > reading that completely wrong too. > Yeah that's right - e.g using the same table in postgres and mysql: pgsql> SELECT attname,n_distinct,correlation FROM pg_stats WHERE tablename='fact0' AND attname LIKE 'd%key'; attname | n_distinct | correlation ---------+------------+------------- d0key | 10000 | -0.0211169 d1key | 100 | 0.124012 d2key | 10 | 0.998393 (3 rows) mysql> SHOW INDEX FROM fact0 -> ; +-------+------------+-------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+ | Table | Non_unique | Key_name | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | +-------+------------+-------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+ | fact0 | 1 | fact0_d0key | 1 | d0key | A | 10000 | NULL | NULL | | BTREE | | | fact0 | 1 | fact0_d1key | 1 | d1key | A | 100 | NULL | NULL | | BTREE | | | fact0 | 1 | fact0_d2key | 1 | d2key | A | 10 | NULL | NULL | | BTREE | | +-------+------------+-------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+ 3 rows in set (0.00 sec) It is a bit confusing - '(distinct) cardinality' might be a better heading for their 'cardinality' column! On the correlation business - I don't think Mysql calculates it (or if it does, its not displayed). > I believe postgres (because it's a lot more standards compliant).. but > sheesh - what a difference! > Well yes - however, to be fair to the Mysql guys, AFAICS the capture and display of index stats (and any other optimizer related data) is not part of any standard. Cheers Mark
Mark Kirkwood <markir@paradise.net.nz> writes: > It is a bit confusing - '(distinct) cardinality' might be a better > heading for their 'cardinality' column! The usual mathematical meaning of "cardinality" is "the number of members in a set". That isn't real helpful for the point at hand, because the mathematical definition of a set disallows duplicate members, so if you're dealing with non-unique values you could argue it either way about whether to count duplicates or not. However, I read in the SQL99 spec (3.1 Definitions) d) cardinality (of a value of a collection type): The number of elements in that value. Those elements need not necessarily have distinct values. so ... as all too often ... the mysql boys have not got a clue about standards compliance. They are using this term in the opposite way from how the SQL committee uses it. regards, tom lane
Brendan, > But just as a follow up question to your #1 suggestion, I have 8 GB > of ram in my production server. You're saying to set the > effective_cache_size then to 5 GB roughly? Somewhere around 655360? > Currently it is set to 65535. Is that something that's OS dependent? > I'm not sure how much memory my server sets aside for disk caching. Yes, about. It's really a judgement call; you're looking for the approximate combined RAM available for disk caching and shared mem. However, this is just used as a way of estimating the probability that the data you want is cached in memory, so you're just trying to be order-of-magnitude accurate, not to-the-MB accurate. -- Josh Berkus Aglio Database Solutions San Francisco
Hi Josh, Thanks. I've adjusted my effective_cache_size to 5 GB, so we'll see how that goes. I'm also doing some query and de-normalization optimizations so we'll see how those go too. ____________________________________________________________________ Brendan Duddridge | CTO | 403-277-5591 x24 | brendan@clickspace.com ClickSpace Interactive Inc. Suite L100, 239 - 10th Ave. SE Calgary, AB T2G 0V9 http://www.clickspace.com On Apr 2, 2006, at 4:30 PM, Josh Berkus wrote: > Brendan, > >> But just as a follow up question to your #1 suggestion, I have 8 GB >> of ram in my production server. You're saying to set the >> effective_cache_size then to 5 GB roughly? Somewhere around 655360? >> Currently it is set to 65535. Is that something that's OS dependent? >> I'm not sure how much memory my server sets aside for disk caching. > > Yes, about. It's really a judgement call; you're looking for the > approximate > combined RAM available for disk caching and shared mem. However, > this is > just used as a way of estimating the probability that the data you > want is > cached in memory, so you're just trying to be order-of-magnitude > accurate, > not to-the-MB accurate. > > -- > Josh Berkus > Aglio Database Solutions > San Francisco > > ---------------------------(end of > broadcast)--------------------------- > TIP 4: Have you searched our list archives? > > http://archives.postgresql.org >
Attachment
On Apr 2, 2006, at 6:30 PM, Josh Berkus wrote: >> But just as a follow up question to your #1 suggestion, I have 8 GB >> of ram in my production server. You're saying to set the >> effective_cache_size then to 5 GB roughly? Somewhere around 655360? >> Currently it is set to 65535. Is that something that's OS dependent? >> I'm not sure how much memory my server sets aside for disk caching. > > Yes, about. It's really a judgement call; you're looking for the > approximate > combined RAM available for disk caching and shared mem. However, > this is > just used as a way of estimating the probability that the data you > want is > cached in memory, so you're just trying to be order-of-magnitude > accurate, > not to-the-MB accurate. FWIW, I typically set effective_cache_size to the amount of memory in the machine minus 1G for the OS and various other daemons, etc. But as Josh said, as long as your somewhere in the ballpark it's probably good enough. -- Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com Pervasive Software http://pervasive.com work: 512-231-6117 vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461
On Apr 1, 2006, at 12:51 PM, Brendan Duddridge wrote: > from SELECT * FROM pg_stats WHERE tablename='table' AND > attname='category_id' > > I find correlation on category_product for category_id is 0.643703 > > Would setting the index on category_id to be clustered help with this? It would absolutely help on the query in question. In my experience, a correlation of 0.64 is too low to allow an index scan to be used for anything but a tiny number of rows. -- Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com Pervasive Software http://pervasive.com work: 512-231-6117 vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461