Re: Self-referencing table question - Mailing list pgsql-sql
From | Sean Davis |
---|---|
Subject | Re: Self-referencing table question |
Date | |
Msg-id | 7e154914b76fb1a7c55912b621400860@mail.nih.gov Whole thread Raw |
In response to | Self-referencing table question (Sean Davis <sdavis2@mail.nih.gov>) |
Responses |
Re: Self-referencing table question
|
List | pgsql-sql |
I answer my own question, if only for my own records. The following query is about 5-6 times faster than the original. Of course, if anyone else has other ideas, I'd be happy to hear them. Sean explain analyze select from_id,to_id,val from exprsdb.correlation where from_id in (select to_id from exprsdb.correlation where from_id=2424 order by val desc limit 100) and to_id in (select to_id from exprsdb.correlation where from_id=2424 order by val desc limit 100) and val>0.6 and to_id<from_id; QUERY PLAN ------------------------------------------------------------------------ ------------------------------------------------------------------------ ------------------------------------ Hash IN Join (cost=4709.94..74758.01 rows=555 width=17) (actual time=110.291..1671.767 rows=973 loops=1) Hash Cond: ("outer".to_id = "inner".to_id) -> Nested Loop (cost=2354.97..72181.72rows=43154 width=17) (actual time=54.036..1612.746 rows=1482 loops=1) -> HashAggregate (cost=2354.97..2354.97 rows=100 width=4) (actual time=53.656..54.062 rows=100 loops=1) -> Subquery Scan "IN_subquery" (cost=2353.47..2354.72 rows=100 width=4) (actual time=53.473..53.595 rows=100 loops=1) -> Limit (cost=2353.47..2353.72 rows=100 width=13) (actual time=53.469..53.507 rows=100 loops=1) -> Sort (cost=2353.47..2415.03 rows=24624 width=13) (actual time=53.467..53.481 rows=100 loops=1) Sort Key: val -> Index Scan using correlation_from_id_idx on correlation (cost=0.00..557.42 rows=24624 width=13) (actual time=0.199..17.717 rows=7788 loops=1) Index Cond: (from_id = 2424) -> Index Scan using correlation_from_id_idx on correlation (cost=0.00..692.87 rows=432 width=17) (actual time=2.765..15.560 rows=15 loops=100) Index Cond: (correlation.from_id = "outer".to_id) Filter: ((val > 0.6) AND(to_id < from_id)) -> Hash (cost=2354.72..2354.72 rows=100 width=4) (actual time=56.239..56.239 rows=0 loops=1) -> Subquery Scan "IN_subquery" (cost=2353.47..2354.72 rows=100 width=4) (actual time=56.004..56.121 rows=100 loops=1) -> Limit (cost=2353.47..2353.72 rows=100width=13) (actual time=56.001..56.038 rows=100 loops=1) -> Sort (cost=2353.47..2415.03 rows=24624 width=13) (actual time=55.999..56.012 rows=100 loops=1) Sort Key: val -> Index Scan using correlation_from_id_idx on correlation (cost=0.00..557.42 rows=24624 width=13) (actual time=0.517..20.307 rows=7788 loops=1) Index Cond: (from_id = 2424) Total runtime: 1676.966ms On Mar 22, 2005, at 2:33 PM, Sean Davis wrote: > I have a table that looks like: > > Column | Type | Modifiers | Description > ---------+--------------+-----------+------------- > from_id | integer | not null | > to_id | integer | not null | > val | numeric(4,3) | | > Indexes: > "correlation_pkey" PRIMARY KEY, btree (from_id, to_id) > "correlation_from_id_idx" btree (from_id) > "correlation_to_id_idx" btree (to_id) > "correlation_val_idx" btree (val) > Has OIDs: yes > > The table describes a pairwise correlation matrix between about 7700 > vectors (so the table has n^2= 60652944 rows, to be exact). I am > trying to choose the top 100 correlated vectors with a seed vector; > this is easily: > > select to_id from correlation where from_id=623 order by val desc > limit 100; > > Then, I want to take those 100 values and find all from_id,to_id > tuples where val>0.5 (to construct a graph where all "ids" are nodes > and are connected to each other when their correlation is >0.5). I > can do this like: > > explain analyze select > from_id,to_id,val > from exprsdb.correlation > where from_id in > (1,2,3,4,5,6,7,8,10,9,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,2 > 7,28,29,30) > and to_id in > (1,2,3,4,5,6,7,8,10,9,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,2 > 7,28,29,30) > and from_id>to_id > and val>0.5; > > However, this does not scale well AT ALL. The actual (very messy) > explain analyze output is below. The thing I notice is that the index > on to_id is not used. Also, the primary key index on (from_id, to_id > is not used, it seems. Finally, with only 30 values, this already > takes 2.6 seconds and I am proposing to do this on 100-200 values. > Any hints on how better to accomplish this set of tasks? > > Index Scan using correlation_from_id_idx, correlation_from_id_idx, > correlation_from_id_idx, correlation_from_id_idx, > correlation_from_id_idx, correlation_from_id_idx, > correlation_from_id_idx, correlation_from_id_idx, > correlation_from_id_idx, correlation_from_id_idx, > correlation_from_id_idx, correlation_from_id_idx, > correlation_from_id_idx, correlation_from_id_idx, > correlation_from_id_idx, correlation_from_id_idx, > correlation_from_id_idx, correlation_from_id_idx, > correlation_from_id_idx, correlation_from_id_idx, > correlation_from_id_idx, correlation_from_id_idx, > correlation_from_id_idx, correlation_from_id_idx, > correlation_from_id_idx, correlation_from_id_idx, > correlation_from_id_idx, correlation_from_id_idx, > correlation_from_id_idx, correlation_from_id_idx on correlation > (cost=0.00..129377.49 rows=62 width=17) (actual time=340.563..2603.967 > rows=19 loops=1) > Index Cond: ((from_id = 1) OR (from_id = 2) OR (from_id = 3) OR > (from_id = 4) OR (from_id = 5) OR (from_id = 6) OR (from_id = 7) OR > (from_id = 8) OR (from_id = 10) OR (from_id = 9) OR (from_id = 11) OR > (from_id = 12) OR (from_id = 13) OR (from_id = 14) OR (from_id = 15) > OR (from_id = 16) OR (from_id = 17) OR (from_id = 18) OR (from_id = > 19) OR (from_id = 20) OR (from_id = 21) OR (from_id = 22) OR (from_id > = 23) OR (from_id = 24) OR (from_id = 25) OR (from_id = 26) OR > (from_id = 27) OR (from_id = 28) OR (from_id = 29) OR (from_id = 30)) > Filter: (((to_id = 1) OR (to_id = 2) OR (to_id = 3) OR (to_id = 4) > OR (to_id = 5) OR (to_id = 6) OR (to_id = 7) OR (to_id = 8) OR (to_id > = 10) OR (to_id = 9) OR (to_id = 11) OR (to_id = 12) OR (to_id = 13) > OR (to_id = 14) OR (to_id = 15) OR (to_id = 16) OR (to_id = 17) OR > (to_id = 18) OR (to_id = 19) OR (to_id = 20) OR (to_id = 21) OR (to_id > = 22) OR (to_id = 23) OR (to_id = 24) OR (to_id = 25) OR (to_id = 26) > OR (to_id = 27) OR (to_id = 28) OR (to_id = 29) OR (to_id = 30)) AND > (from_id > to_id) AND (val > 0.5)) > Total runtime: 2604.383 ms > > Thanks, > Sean > > > ---------------------------(end of > broadcast)--------------------------- > TIP 1: subscribe and unsubscribe commands go to > majordomo@postgresql.org