Re: Self-referencing table question - Mailing list pgsql-sql

From Sean Davis
Subject Re: Self-referencing table question
Date
Msg-id 7e154914b76fb1a7c55912b621400860@mail.nih.gov
Whole thread Raw
In response to Self-referencing table question  (Sean Davis <sdavis2@mail.nih.gov>)
Responses Re: Self-referencing table question
List pgsql-sql
I answer my own question, if only for my own records.  The following  
query is about 5-6 times faster than the original.  Of course, if  
anyone else has other ideas, I'd be happy to hear them.

Sean

explain analyze select from_id,to_id,val from exprsdb.correlation where  
from_id in (select to_id from exprsdb.correlation where from_id=2424  
order by val desc limit 100) and to_id in (select to_id from  
exprsdb.correlation where from_id=2424 order by val desc limit 100) and  
val>0.6 and to_id<from_id;                                                                                      QUERY
PLAN
------------------------------------------------------------------------ 
------------------------------------------------------------------------ 
------------------------------------ Hash IN Join  (cost=4709.94..74758.01 rows=555 width=17) (actual  
time=110.291..1671.767 rows=973 loops=1)   Hash Cond: ("outer".to_id = "inner".to_id)   ->  Nested Loop
(cost=2354.97..72181.72rows=43154 width=17)  
 
(actual time=54.036..1612.746 rows=1482 loops=1)         ->  HashAggregate  (cost=2354.97..2354.97 rows=100 width=4)  
(actual time=53.656..54.062 rows=100 loops=1)               ->  Subquery Scan "IN_subquery"  (cost=2353.47..2354.72  
rows=100 width=4) (actual time=53.473..53.595 rows=100 loops=1)                     ->  Limit  (cost=2353.47..2353.72
rows=100 
 
width=13) (actual time=53.469..53.507 rows=100 loops=1)                           ->  Sort  (cost=2353.47..2415.03
rows=24624 
 
width=13) (actual time=53.467..53.481 rows=100 loops=1)                                 Sort Key: val
             ->  Index Scan using  
 
correlation_from_id_idx on correlation  (cost=0.00..557.42 rows=24624  
width=13) (actual time=0.199..17.717 rows=7788 loops=1)                                       Index Cond: (from_id =
2424)        ->  Index Scan using correlation_from_id_idx on correlation   
 
(cost=0.00..692.87 rows=432 width=17) (actual time=2.765..15.560  
rows=15 loops=100)               Index Cond: (correlation.from_id = "outer".to_id)               Filter: ((val > 0.6)
AND(to_id < from_id))   ->  Hash  (cost=2354.72..2354.72 rows=100 width=4) (actual  
 
time=56.239..56.239 rows=0 loops=1)         ->  Subquery Scan "IN_subquery"  (cost=2353.47..2354.72  
rows=100 width=4) (actual time=56.004..56.121 rows=100 loops=1)               ->  Limit  (cost=2353.47..2353.72
rows=100width=13)  
 
(actual time=56.001..56.038 rows=100 loops=1)                     ->  Sort  (cost=2353.47..2415.03 rows=24624  
width=13) (actual time=55.999..56.012 rows=100 loops=1)                           Sort Key: val
 ->  Index Scan using correlation_from_id_idx  
 
on correlation  (cost=0.00..557.42 rows=24624 width=13) (actual  
time=0.517..20.307 rows=7788 loops=1)                                 Index Cond: (from_id = 2424) Total runtime:
1676.966ms
 


On Mar 22, 2005, at 2:33 PM, Sean Davis wrote:

> I have a table that looks like:
>
>  Column  |     Type     | Modifiers | Description
> ---------+--------------+-----------+-------------
>  from_id | integer      | not null  |
>  to_id   | integer      | not null  |
>  val     | numeric(4,3) |           |
> Indexes:
>     "correlation_pkey" PRIMARY KEY, btree (from_id, to_id)
>     "correlation_from_id_idx" btree (from_id)
>     "correlation_to_id_idx" btree (to_id)
>     "correlation_val_idx" btree (val)
> Has OIDs: yes
>
> The table describes a pairwise correlation matrix between about 7700  
> vectors (so the table has n^2= 60652944 rows, to be exact).  I am  
> trying to choose the top 100 correlated vectors with a seed vector;  
> this is easily:
>
> select to_id from correlation where from_id=623 order by val desc  
> limit 100;
>
> Then, I want to take those 100 values and find all from_id,to_id  
> tuples where val>0.5 (to construct a graph where all "ids" are nodes  
> and are connected to each other when their correlation is >0.5).  I  
> can do this like:
>
> explain analyze select
>     from_id,to_id,val
>     from exprsdb.correlation
>     where from_id in  
> (1,2,3,4,5,6,7,8,10,9,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,2 
> 7,28,29,30)
>     and to_id in  
> (1,2,3,4,5,6,7,8,10,9,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,2 
> 7,28,29,30)
>     and from_id>to_id
>     and val>0.5;
>
> However, this does not scale well AT ALL.  The actual (very messy)  
> explain analyze output is below.  The thing I notice is that the index  
> on to_id is not used.  Also, the primary key index on (from_id, to_id  
> is not used, it seems.  Finally, with only 30 values, this already  
> takes 2.6 seconds and I am proposing to do this on 100-200 values.   
> Any hints on how better to accomplish this set of tasks?
>
>  Index Scan using correlation_from_id_idx, correlation_from_id_idx,  
> correlation_from_id_idx, correlation_from_id_idx,  
> correlation_from_id_idx, correlation_from_id_idx,  
> correlation_from_id_idx, correlation_from_id_idx,  
> correlation_from_id_idx, correlation_from_id_idx,  
> correlation_from_id_idx, correlation_from_id_idx,  
> correlation_from_id_idx, correlation_from_id_idx,  
> correlation_from_id_idx, correlation_from_id_idx,  
> correlation_from_id_idx, correlation_from_id_idx,  
> correlation_from_id_idx, correlation_from_id_idx,  
> correlation_from_id_idx, correlation_from_id_idx,  
> correlation_from_id_idx, correlation_from_id_idx,  
> correlation_from_id_idx, correlation_from_id_idx,  
> correlation_from_id_idx, correlation_from_id_idx,  
> correlation_from_id_idx, correlation_from_id_idx on correlation   
> (cost=0.00..129377.49 rows=62 width=17) (actual time=340.563..2603.967  
> rows=19 loops=1)
>    Index Cond: ((from_id = 1) OR (from_id = 2) OR (from_id = 3) OR  
> (from_id = 4) OR (from_id = 5) OR (from_id = 6) OR (from_id = 7) OR  
> (from_id = 8) OR (from_id = 10) OR (from_id = 9) OR (from_id = 11) OR  
> (from_id = 12) OR (from_id = 13) OR (from_id = 14) OR (from_id = 15)  
> OR (from_id = 16) OR (from_id = 17) OR (from_id = 18) OR (from_id =  
> 19) OR (from_id = 20) OR (from_id = 21) OR (from_id = 22) OR (from_id  
> = 23) OR (from_id = 24) OR (from_id = 25) OR (from_id = 26) OR  
> (from_id = 27) OR (from_id = 28) OR (from_id = 29) OR (from_id = 30))
>    Filter: (((to_id = 1) OR (to_id = 2) OR (to_id = 3) OR (to_id = 4)  
> OR (to_id = 5) OR (to_id = 6) OR (to_id = 7) OR (to_id = 8) OR (to_id  
> = 10) OR (to_id = 9) OR (to_id = 11) OR (to_id = 12) OR (to_id = 13)  
> OR (to_id = 14) OR (to_id = 15) OR (to_id = 16) OR (to_id = 17) OR  
> (to_id = 18) OR (to_id = 19) OR (to_id = 20) OR (to_id = 21) OR (to_id  
> = 22) OR (to_id = 23) OR (to_id = 24) OR (to_id = 25) OR (to_id = 26)  
> OR (to_id = 27) OR (to_id = 28) OR (to_id = 29) OR (to_id = 30)) AND  
> (from_id > to_id) AND (val > 0.5))
>  Total runtime: 2604.383 ms
>
> Thanks,
> Sean
>
>
> ---------------------------(end of  
> broadcast)---------------------------
> TIP 1: subscribe and unsubscribe commands go to  
> majordomo@postgresql.org



pgsql-sql by date:

Previous
From: Sean Davis
Date:
Subject: Self-referencing table question
Next
From: Richard Huxton
Date:
Subject: Re: Self-referencing table question