Re: faster way to calculate top "tags" for a "resource" based on a column - Mailing list pgsql-general
From | Marc Mamin |
---|---|
Subject | Re: faster way to calculate top "tags" for a "resource" based on a column |
Date | |
Msg-id | B6F6FD62F2624C4C9916AC0175D56D8828AF19D3@jenmbs01.ad.intershop.net Whole thread Raw |
In response to | faster way to calculate top "tags" for a "resource" based on a column (Jonathan Vanasco <postgres@2xlp.com>) |
Responses |
Re: faster way to calculate top "tags" for a "resource" based on a column
|
List | pgsql-general |
>I've been able to fix most of my slow queries into something more acceptable, but I haven't been able to shave any timeoff this one. I'm hoping someone has another strategy. > >I have 2 tables: > resource > resource_2_tag > >I want to calculate the top 25 "tag_ids" in "resource_2_tag " for resources that match a given attribute on the "resource"table. > >both tables have around 1.6million records. > >If the database needs to warm up and read into cache, this can take 60seconds to read the data off disk. >If the database doesn't need to warm up, it averages 1.76seconds. > >The 1.76s time is troubling me. >Searching for the discrete elements of this is pretty lightweight. > >here's an explain -- http://explain.depesz.com/s/PndC > >I tried a subquery instead of a join, and the query optimized the plan to the same. > >i'm hoping someone will see something that I just don't see. > > > > Table "public.resource_2_tag" > Column | Type | Modifiers >-----------------------+---------+----------- > resource_id | integer | > tag_id | integer | >Indexes: > "_idx_speed_resource_2_tag__resource_id" btree (resource_id) > "_idx_speed_resource_2_tag__tag_id" btree (tag_id) > > Table "public.resource" > Column | Type | Modifiers >-------------------------------------+-----------------------------+---------------------------------------------------------- > id | integer | not null default nextval('resource_id_seq'::regclass) >resource_attribute1_id | integer | >lots of other columns | | >Indexes: > "resource_attribute1_idx" btree (resource_attribute1_id) > >-------------------------------------------------------------------------------- > >select count(*) from resource; >-- 1669729 > >select count(*) from resource_2_tag; >-- 1676594 > >select count(*) from resource where resource_attribute1_id = 614; >-- 5184 >-- 4.386ms > >select id from resource where resource_attribute1_id = 614; >-- 5184 >-- 87.303ms > >popping the 5k elements into an "in" clause, will run the query in around 100ms. > > >EXPLAIN ANALYZE >SELECT > resource_2_tag.tag_id AS resource_2_tag_tag_id, > count(resource_2_tag.tag_id) AS counted >FROM > resource_2_tag >JOIN resource ON resource.id = resource_2_tag.resource_id >WHERE > resource.resource_attribute1_id = 614 >GROUP BY resource_2_tag.tag_id >ORDER BY counted DESC >LIMIT 25 OFFSET 0; Hi, it seems to me that your subquery may deliver duplicate ids. And with the selectivity of your example, I would expect an index usage instead of a table scan. You may check how up to date your statistics are and try to raise the statistic target on the column resource_2_tag.tag_id. Also try a CTE form for your query: WITH A as (SELECT DISTINCT id FROM resource WHERE resource_attribute1_id = 614 ) SELECT resource_2_tag.tag_id AS resource_2_tag_tag_id, count(resource_2_tag.tag_id) AS counted FROM resource_2_tag JOIN A ON A.id = resource_2_tag.resource_id ORDER BY counted DESC LIMIT 25; regards, Marc Mamin
pgsql-general by date: