Home > mailing lists

faster way to calculate top "tags" for a "resource" based on a column - Mailing list pgsql-general

From	Jonathan Vanasco
Subject	faster way to calculate top "tags" for a "resource" based on a column
Date	October 3, 2014 16:21:57
Msg-id	E6CD86FB-6006-460C-B84A-E1D731A3C996@2xlp.com Whole thread Raw
In response to	Re: How to find greatest record before known values fast (Merlin Moncure <mmoncure@gmail.com>)
Responses	Re: faster way to calculate top "tags" for a "resource" based on a column Re: faster way to calculate top "tags" for a "resource" based on a column
List	pgsql-general

Tree view

I've been able to fix most of my slow queries into something more acceptable, but I haven't been able to shave any time
offthis one.  I'm hoping someone has another strategy. 

I have 2 tables:
    resource
    resource_2_tag

I want to calculate the top 25 "tag_ids" in "resource_2_tag " for resources that match a given attribute on the
"resource"table. 

both tables have around 1.6million records.

If the database needs to warm up and read into cache, this can take 60seconds to read the data off disk.
If the database doesn't need to warm up, it averages 1.76seconds.

The 1.76s time is troubling me.
Searching for the discrete elements of this is pretty lightweight.

here's an explain --  http://explain.depesz.com/s/PndC

I tried a subquery instead of a join, and the query optimized the plan to the same.

i'm hoping someone will see something that I just don't see.



      Table "public.resource_2_tag"
        Column         |  Type   | Modifiers
-----------------------+---------+-----------
 resource_id           | integer |
 tag_id                | integer |
Indexes:
    "_idx_speed_resource_2_tag__resource_id" btree (resource_id)
    "_idx_speed_resource_2_tag__tag_id" btree (tag_id)

                                                  Table "public.resource"
               Column                |            Type             |                        Modifiers
      

-------------------------------------+-----------------------------+----------------------------------------------------------
 id                                  | integer                     | not null default
nextval('resource_id_seq'::regclass)
resource_attribute1_id               | integer                     |
lots of other columns                |                             |
Indexes:
    "resource_attribute1_idx" btree (resource_attribute1_id)

--------------------------------------------------------------------------------

select count(*) from resource;
-- 1669729

select count(*) from resource_2_tag;
-- 1676594

select count(*) from resource where resource_attribute1_id = 614;
-- 5184
-- 4.386ms

select id from resource where resource_attribute1_id = 614;
-- 5184
-- 87.303ms

popping the 5k elements into an "in" clause, will run the query in around 100ms.


EXPLAIN ANALYZE
SELECT
    resource_2_tag.tag_id AS resource_2_tag_tag_id,
    count(resource_2_tag.tag_id) AS counted
FROM
    resource_2_tag
JOIN resource ON resource.id = resource_2_tag.resource_id
WHERE
    resource.resource_attribute1_id = 614
GROUP BY resource_2_tag.tag_id
ORDER BY counted DESC
LIMIT 25 OFFSET 0;

--------------------------------------------------------------------------------

 Limit  (cost=76659.61..76659.68 rows=25 width=4) (actual time=1502.902..1502.913 rows=25 loops=1)
   ->  Sort  (cost=76659.61..76672.47 rows=5141 width=4) (actual time=1502.900..1502.906 rows=25 loops=1)
         Sort Key: (count(resource_2_tag.tag_id))
         Sort Method: top-N heapsort  Memory: 26kB
         ->  HashAggregate  (cost=76463.13..76514.54 rows=5141 width=4) (actual time=1487.016..1495.206 rows=13887
loops=1)
               ->  Hash Join  (cost=35867.88..76437.42 rows=5141 width=4) (actual time=97.654..1453.337 rows=27068
loops=1)
                     Hash Cond: (resource_2_tag.resource_id = resource.id)
                     ->  Seq Scan on resource_2_tag  (cost=0.00..25847.94 rows=1676594 width=8) (actual
time=0.032..513.046rows=1676594 loops=1) 
                     ->  Hash  (cost=35803.88..35803.88 rows=5120 width=4) (actual time=97.576..97.576 rows=5184
loops=1)
                           Buckets: 1024  Batches: 1  Memory Usage: 183kB
                           ->  Bitmap Heap Scan on resource  (cost=272.68..35803.88 rows=5120 width=4) (actual
time=5.911..90.264rows=5184 loops=1) 
                                 Recheck Cond: (resource_attribute1_id = 614)
                                 ->  Bitmap Index Scan on resource_attribute1_idx  (cost=0.00..271.40 rows=5120
width=0)(actual time=3.575..3.575 rows=5184 loops=1) 
                                       Index Cond: (resource_attribute1_id = 614)
 Total runtime: 1503.146 ms

pgsql-general by date:

From: Tim Mickelson
Date: 03 October 2014, 16:17:50
Subject: Really strange foreign key constraint problem blocking delete

From: Vick Khera
Date: 03 October 2014, 18:38:34
Subject: Re: Really strange foreign key constraint problem blocking delete

faster way to calculate top "tags" for a "resource" based on a column - Mailing list pgsql-general

Previous

Next