Thread: Self-referencing table question
I have a table that looks like: Column | Type | Modifiers | Description ---------+--------------+-----------+------------- from_id | integer | not null | to_id | integer | not null | val | numeric(4,3) | | Indexes: "correlation_pkey" PRIMARY KEY, btree (from_id, to_id) "correlation_from_id_idx" btree (from_id) "correlation_to_id_idx"btree (to_id) "correlation_val_idx" btree (val) Has OIDs: yes The table describes a pairwise correlation matrix between about 7700 vectors (so the table has n^2= 60652944 rows, to be exact). I am trying to choose the top 100 correlated vectors with a seed vector; this is easily: select to_id from correlation where from_id=623 order by val desc limit 100; Then, I want to take those 100 values and find all from_id,to_id tuples where val>0.5 (to construct a graph where all "ids" are nodes and are connected to each other when their correlation is >0.5). I can do this like: explain analyze selectfrom_id,to_id,valfrom exprsdb.correlationwhere from_id in (1,2,3,4,5,6,7,8,10,9,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27 ,28,29,30)and to_id in (1,2,3,4,5,6,7,8,10,9,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27 ,28,29,30)and from_id>to_idand val>0.5; However, this does not scale well AT ALL. The actual (very messy) explain analyze output is below. The thing I notice is that the index on to_id is not used. Also, the primary key index on (from_id, to_id is not used, it seems. Finally, with only 30 values, this already takes 2.6 seconds and I am proposing to do this on 100-200 values. Any hints on how better to accomplish this set of tasks? Index Scan using correlation_from_id_idx, correlation_from_id_idx, correlation_from_id_idx, correlation_from_id_idx, correlation_from_id_idx, correlation_from_id_idx, correlation_from_id_idx, correlation_from_id_idx, correlation_from_id_idx, correlation_from_id_idx, correlation_from_id_idx, correlation_from_id_idx, correlation_from_id_idx, correlation_from_id_idx, correlation_from_id_idx, correlation_from_id_idx, correlation_from_id_idx, correlation_from_id_idx, correlation_from_id_idx, correlation_from_id_idx, correlation_from_id_idx, correlation_from_id_idx, correlation_from_id_idx, correlation_from_id_idx, correlation_from_id_idx, correlation_from_id_idx, correlation_from_id_idx, correlation_from_id_idx, correlation_from_id_idx, correlation_from_id_idx on correlation (cost=0.00..129377.49 rows=62 width=17) (actual time=340.563..2603.967 rows=19 loops=1) Index Cond: ((from_id = 1) OR (from_id = 2) OR (from_id = 3) OR (from_id = 4) OR (from_id = 5) OR (from_id = 6) OR (from_id = 7) OR (from_id = 8) OR (from_id = 10) OR (from_id = 9) OR (from_id = 11) OR (from_id = 12) OR (from_id = 13) OR (from_id = 14) OR (from_id = 15) OR (from_id = 16) OR (from_id = 17) OR (from_id = 18) OR (from_id = 19) OR (from_id = 20) OR (from_id = 21) OR (from_id = 22) OR (from_id = 23) OR (from_id = 24) OR (from_id = 25) OR (from_id = 26) OR (from_id = 27) OR (from_id = 28) OR (from_id = 29) OR (from_id = 30)) Filter: (((to_id = 1) OR (to_id = 2) OR (to_id = 3) OR (to_id = 4) OR (to_id = 5) OR (to_id = 6) OR (to_id = 7) OR (to_id = 8) OR (to_id = 10) OR (to_id = 9) OR (to_id = 11) OR (to_id = 12) OR (to_id = 13) OR (to_id = 14) OR (to_id = 15) OR (to_id = 16) OR (to_id = 17) OR (to_id = 18) OR (to_id = 19) OR (to_id = 20) OR (to_id = 21) OR (to_id = 22) OR (to_id = 23) OR (to_id = 24) OR (to_id = 25) OR (to_id = 26) OR (to_id = 27) OR (to_id = 28) OR (to_id = 29) OR (to_id = 30)) AND (from_id > to_id) AND (val > 0.5)) Total runtime: 2604.383 ms Thanks, Sean
I answer my own question, if only for my own records. The following query is about 5-6 times faster than the original. Of course, if anyone else has other ideas, I'd be happy to hear them. Sean explain analyze select from_id,to_id,val from exprsdb.correlation where from_id in (select to_id from exprsdb.correlation where from_id=2424 order by val desc limit 100) and to_id in (select to_id from exprsdb.correlation where from_id=2424 order by val desc limit 100) and val>0.6 and to_id<from_id; QUERY PLAN ------------------------------------------------------------------------ ------------------------------------------------------------------------ ------------------------------------ Hash IN Join (cost=4709.94..74758.01 rows=555 width=17) (actual time=110.291..1671.767 rows=973 loops=1) Hash Cond: ("outer".to_id = "inner".to_id) -> Nested Loop (cost=2354.97..72181.72rows=43154 width=17) (actual time=54.036..1612.746 rows=1482 loops=1) -> HashAggregate (cost=2354.97..2354.97 rows=100 width=4) (actual time=53.656..54.062 rows=100 loops=1) -> Subquery Scan "IN_subquery" (cost=2353.47..2354.72 rows=100 width=4) (actual time=53.473..53.595 rows=100 loops=1) -> Limit (cost=2353.47..2353.72 rows=100 width=13) (actual time=53.469..53.507 rows=100 loops=1) -> Sort (cost=2353.47..2415.03 rows=24624 width=13) (actual time=53.467..53.481 rows=100 loops=1) Sort Key: val -> Index Scan using correlation_from_id_idx on correlation (cost=0.00..557.42 rows=24624 width=13) (actual time=0.199..17.717 rows=7788 loops=1) Index Cond: (from_id = 2424) -> Index Scan using correlation_from_id_idx on correlation (cost=0.00..692.87 rows=432 width=17) (actual time=2.765..15.560 rows=15 loops=100) Index Cond: (correlation.from_id = "outer".to_id) Filter: ((val > 0.6) AND(to_id < from_id)) -> Hash (cost=2354.72..2354.72 rows=100 width=4) (actual time=56.239..56.239 rows=0 loops=1) -> Subquery Scan "IN_subquery" (cost=2353.47..2354.72 rows=100 width=4) (actual time=56.004..56.121 rows=100 loops=1) -> Limit (cost=2353.47..2353.72 rows=100width=13) (actual time=56.001..56.038 rows=100 loops=1) -> Sort (cost=2353.47..2415.03 rows=24624 width=13) (actual time=55.999..56.012 rows=100 loops=1) Sort Key: val -> Index Scan using correlation_from_id_idx on correlation (cost=0.00..557.42 rows=24624 width=13) (actual time=0.517..20.307 rows=7788 loops=1) Index Cond: (from_id = 2424) Total runtime: 1676.966ms On Mar 22, 2005, at 2:33 PM, Sean Davis wrote: > I have a table that looks like: > > Column | Type | Modifiers | Description > ---------+--------------+-----------+------------- > from_id | integer | not null | > to_id | integer | not null | > val | numeric(4,3) | | > Indexes: > "correlation_pkey" PRIMARY KEY, btree (from_id, to_id) > "correlation_from_id_idx" btree (from_id) > "correlation_to_id_idx" btree (to_id) > "correlation_val_idx" btree (val) > Has OIDs: yes > > The table describes a pairwise correlation matrix between about 7700 > vectors (so the table has n^2= 60652944 rows, to be exact). I am > trying to choose the top 100 correlated vectors with a seed vector; > this is easily: > > select to_id from correlation where from_id=623 order by val desc > limit 100; > > Then, I want to take those 100 values and find all from_id,to_id > tuples where val>0.5 (to construct a graph where all "ids" are nodes > and are connected to each other when their correlation is >0.5). I > can do this like: > > explain analyze select > from_id,to_id,val > from exprsdb.correlation > where from_id in > (1,2,3,4,5,6,7,8,10,9,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,2 > 7,28,29,30) > and to_id in > (1,2,3,4,5,6,7,8,10,9,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,2 > 7,28,29,30) > and from_id>to_id > and val>0.5; > > However, this does not scale well AT ALL. The actual (very messy) > explain analyze output is below. The thing I notice is that the index > on to_id is not used. Also, the primary key index on (from_id, to_id > is not used, it seems. Finally, with only 30 values, this already > takes 2.6 seconds and I am proposing to do this on 100-200 values. > Any hints on how better to accomplish this set of tasks? > > Index Scan using correlation_from_id_idx, correlation_from_id_idx, > correlation_from_id_idx, correlation_from_id_idx, > correlation_from_id_idx, correlation_from_id_idx, > correlation_from_id_idx, correlation_from_id_idx, > correlation_from_id_idx, correlation_from_id_idx, > correlation_from_id_idx, correlation_from_id_idx, > correlation_from_id_idx, correlation_from_id_idx, > correlation_from_id_idx, correlation_from_id_idx, > correlation_from_id_idx, correlation_from_id_idx, > correlation_from_id_idx, correlation_from_id_idx, > correlation_from_id_idx, correlation_from_id_idx, > correlation_from_id_idx, correlation_from_id_idx, > correlation_from_id_idx, correlation_from_id_idx, > correlation_from_id_idx, correlation_from_id_idx, > correlation_from_id_idx, correlation_from_id_idx on correlation > (cost=0.00..129377.49 rows=62 width=17) (actual time=340.563..2603.967 > rows=19 loops=1) > Index Cond: ((from_id = 1) OR (from_id = 2) OR (from_id = 3) OR > (from_id = 4) OR (from_id = 5) OR (from_id = 6) OR (from_id = 7) OR > (from_id = 8) OR (from_id = 10) OR (from_id = 9) OR (from_id = 11) OR > (from_id = 12) OR (from_id = 13) OR (from_id = 14) OR (from_id = 15) > OR (from_id = 16) OR (from_id = 17) OR (from_id = 18) OR (from_id = > 19) OR (from_id = 20) OR (from_id = 21) OR (from_id = 22) OR (from_id > = 23) OR (from_id = 24) OR (from_id = 25) OR (from_id = 26) OR > (from_id = 27) OR (from_id = 28) OR (from_id = 29) OR (from_id = 30)) > Filter: (((to_id = 1) OR (to_id = 2) OR (to_id = 3) OR (to_id = 4) > OR (to_id = 5) OR (to_id = 6) OR (to_id = 7) OR (to_id = 8) OR (to_id > = 10) OR (to_id = 9) OR (to_id = 11) OR (to_id = 12) OR (to_id = 13) > OR (to_id = 14) OR (to_id = 15) OR (to_id = 16) OR (to_id = 17) OR > (to_id = 18) OR (to_id = 19) OR (to_id = 20) OR (to_id = 21) OR (to_id > = 22) OR (to_id = 23) OR (to_id = 24) OR (to_id = 25) OR (to_id = 26) > OR (to_id = 27) OR (to_id = 28) OR (to_id = 29) OR (to_id = 30)) AND > (from_id > to_id) AND (val > 0.5)) > Total runtime: 2604.383 ms > > Thanks, > Sean > > > ---------------------------(end of > broadcast)--------------------------- > TIP 1: subscribe and unsubscribe commands go to > majordomo@postgresql.org
Sean Davis wrote: > I answer my own question, if only for my own records. The following > query is about 5-6 times faster than the original. Of course, if > anyone else has other ideas, I'd be happy to hear them. > > Sean > > explain analyze select from_id,to_id,val from exprsdb.correlation where > from_id in (select to_id from exprsdb.correlation where from_id=2424 > order by val desc limit 100) and to_id in (select to_id from > exprsdb.correlation where from_id=2424 order by val desc limit 100) and > val>0.6 and to_id<from_id; Might not be any faster, but you can do this as a self-join with subquery: SELECT c1.from_id, c1.to_id, c1.val FROM correlation c1, ( SELECT to_id FROM correlation WHERE from_id=2424 ORDER BY val DESC LIMIT 100 ) AS c2 ( SELECT to_id FROM correlation WHERE from_id=2424 ORDER BY val DESC LIMIT 100 ) AS c3 WHERE c1.from_id = c2.to_id AND c1.to_id = c3.to_id AND c1.val > 0.5 AND c1.to_id < from_id ; I think PG should be smart enough nowadays to figure out these two queries are basically the same. -- Richard Huxton Archonet Ltd
----- Original Message ----- From: "Richard Huxton" <dev@archonet.com> To: "Sean Davis" <sdavis2@mail.nih.gov> Cc: "PostgreSQL SQL" <pgsql-sql@postgresql.org> Sent: Tuesday, March 22, 2005 3:59 PM Subject: Re: [SQL] Self-referencing table question > Sean Davis wrote: >> I answer my own question, if only for my own records. The following >> query is about 5-6 times faster than the original. Of course, if anyone >> else has other ideas, I'd be happy to hear them. >> >> Sean >> >> explain analyze select from_id,to_id,val from exprsdb.correlation where >> from_id in (select to_id from exprsdb.correlation where from_id=2424 >> order by val desc limit 100) and to_id in (select to_id from >> exprsdb.correlation where from_id=2424 order by val desc limit 100) and >> val>0.6 and to_id<from_id; > > Might not be any faster, but you can do this as a self-join with subquery: > > SELECT c1.from_id, c1.to_id, c1.val > FROM > correlation c1, > ( > SELECT to_id FROM correlation WHERE from_id=2424 > ORDER BY val DESC LIMIT 100 > ) AS c2 > ( > SELECT to_id FROM correlation WHERE from_id=2424 > ORDER BY val DESC LIMIT 100 > ) AS c3 > WHERE > c1.from_id = c2.to_id > AND c1.to_id = c3.to_id > AND c1.val > 0.5 > AND c1.to_id < from_id > ; > > I think PG should be smart enough nowadays to figure out these two queries > are basically the same. Richard, In another email, I posted what I did (which was what you suggest), along with explain analyze output. It looks like the subquery is 4-6 times faster, which is getting into the acceptible for my little web application. Thanks for the help. Sean
On Mar 22, 2005, at 7:07 PM, Sean Davis wrote: > > ----- Original Message ----- From: "Richard Huxton" <dev@archonet.com> > To: "Sean Davis" <sdavis2@mail.nih.gov> > Cc: "PostgreSQL SQL" <pgsql-sql@postgresql.org> > Sent: Tuesday, March 22, 2005 3:59 PM > Subject: Re: [SQL] Self-referencing table question > > >> Sean Davis wrote: >>> I answer my own question, if only for my own records. The following >>> query is about 5-6 times faster than the original. Of course, if >>> anyone else has other ideas, I'd be happy to hear them. >>> >>> Sean >>> >>> explain analyze select from_id,to_id,val from exprsdb.correlation >>> where from_id in (select to_id from exprsdb.correlation where >>> from_id=2424 order by val desc limit 100) and to_id in (select to_id >>> from exprsdb.correlation where from_id=2424 order by val desc limit >>> 100) and val>0.6 and to_id<from_id; >> >> Might not be any faster, but you can do this as a self-join with >> subquery: >> >> SELECT c1.from_id, c1.to_id, c1.val >> FROM >> correlation c1, >> ( >> SELECT to_id FROM correlation WHERE from_id=2424 >> ORDER BY val DESC LIMIT 100 >> ) AS c2 >> ( >> SELECT to_id FROM correlation WHERE from_id=2424 >> ORDER BY val DESC LIMIT 100 >> ) AS c3 >> WHERE >> c1.from_id = c2.to_id >> AND c1.to_id = c3.to_id >> AND c1.val > 0.5 >> AND c1.to_id < from_id >> ; >> >> I think PG should be smart enough nowadays to figure out these two >> queries are basically the same. Oops, I DID do a different query in my previous email than what you suggest in the your email. Testing both against each other, the two queries--using subselects in 'in' and doing a self-join via subquery--have basically the same performance. Thanks again for the help. Sean
Sometimes using a temp table is a better idea: e.g. -- start by creating a temp table 'tids' that hold the to_ids that -- we are interested in. SELECT to_id INTO TEMP TABLE tids FROM correlation WHERE from_id = 1234 ORDER BY val DESClimit 100; -- The following temp table makes use of the primary key on -- the correlation table, and the stated goal from the original -- question that: -- from_id > to_id -- and from_id in (tids.to_id) -- and to_id in (tids.to_id) SELECT t1.to_id AS from_id, t2.to_id INTO TEMP TABLE from_to FROM tids t1, tids t2 WHERE t1.to_id > t2.to_id; -- Now we can use the from_to table as an index into the correlation -- table. SELECT c.from_id, c.to_id, c.val FROM from_to JOIN correlation c USING(from_id, to_id) WHERE val > 0.5; The explain analyze for the final select works out to:Nested Loop (cost=0.00..50692.00 rows=8488 width=16) (actual time=0.171..150.095 rows=2427 loops=1) -> Seq Scan on from_to (cost=0.00..79.38 rows=5238 width=8) (actual time=0.006..7.660 rows=4950 loops=1) -> Index Scan using correlation_pkey on correlation c (cost=0.00..9.63 rows=2 width=16) (actual time=0.024..0.025 rows=0 loops=4950) Index Cond: (("outer".from_id = c.from_id) AND ("outer".to_id = c.to_id)) Filter: (val > 0.5::double precision)Total runtime: 152.261 ms Richard Huxton wrote: > Sean Davis wrote: > >> I answer my own question, if only for my own records. The following >> query is about 5-6 times faster than the original. Of course, if >> anyone else has other ideas, I'd be happy to hear them. >> >> Sean >> >> explain analyze select from_id,to_id,val from exprsdb.correlation >> where from_id in (select to_id from exprsdb.correlation where >> from_id=2424 order by val desc limit 100) and to_id in (select to_id >> from exprsdb.correlation where from_id=2424 order by val desc limit >> 100) and val>0.6 and to_id<from_id; > > > Might not be any faster, but you can do this as a self-join with > subquery: > > SELECT c1.from_id, c1.to_id, c1.val > FROM > correlation c1, > ( > SELECT to_id FROM correlation WHERE from_id=2424 > ORDER BY val DESC LIMIT 100 > ) AS c2 > ( > SELECT to_id FROM correlation WHERE from_id=2424 > ORDER BY val DESC LIMIT 100 > ) AS c3 > WHERE > c1.from_id = c2.to_id > AND c1.to_id = c3.to_id > AND c1.val > 0.5 > AND c1.to_id < from_id > ; > > I think PG should be smart enough nowadays to figure out these two > queries are basically the same. -- Edmund Bacon <ebacon@onesystem.com>
Thanks. I thought about that a bit and it seems like it is highly likely to be expensive for a single query (though I should probably try it at some point). If I do find myself reformatting results after response to user input (i.e., reusing the query), though, then your solution is likely to be very useful. Sean On Mar 24, 2005, at 11:13 AM, Edmund Bacon wrote: > Sometimes using a temp table is a better idea: > > e.g. > > -- start by creating a temp table 'tids' that hold the to_ids that > -- we are interested in. > SELECT to_id > INTO TEMP TABLE tids > FROM correlation > WHERE from_id = 1234 > ORDER BY val DESC limit 100; > > -- The following temp table makes use of the primary key on > -- the correlation table, and the stated goal from the original > -- question that: > -- from_id > to_id > -- and from_id in (tids.to_id) > -- and to_id in (tids.to_id) > > SELECT t1.to_id AS from_id, t2.to_id > INTO TEMP TABLE from_to > FROM tids t1, tids t2 > WHERE t1.to_id > t2.to_id; > > -- Now we can use the from_to table as an index into the correlation > -- table. > > SELECT c.from_id, c.to_id, c.val > FROM from_to > JOIN correlation c USING(from_id, to_id) > WHERE val > 0.5; > > > The explain analyze for the final select works out to: > Nested Loop (cost=0.00..50692.00 rows=8488 width=16) (actual > time=0.171..150.095 rows=2427 loops=1) > -> Seq Scan on from_to (cost=0.00..79.38 rows=5238 width=8) > (actual time=0.006..7.660 rows=4950 loops=1) > -> Index Scan using correlation_pkey on correlation c > (cost=0.00..9.63 rows=2 width=16) (actual time=0.024..0.025 rows=0 > loops=4950) > Index Cond: (("outer".from_id = c.from_id) AND ("outer".to_id > = c.to_id)) > Filter: (val > 0.5::double precision) > Total runtime: 152.261 ms > > > Richard Huxton wrote: > >> Sean Davis wrote: >> >>> I answer my own question, if only for my own records. The following >>> query is about 5-6 times faster than the original. Of course, if >>> anyone else has other ideas, I'd be happy to hear them. >>> >>> Sean >>> >>> explain analyze select from_id,to_id,val from exprsdb.correlation >>> where from_id in (select to_id from exprsdb.correlation where >>> from_id=2424 order by val desc limit 100) and to_id in (select >>> to_id from exprsdb.correlation where from_id=2424 order by val desc >>> limit 100) and val>0.6 and to_id<from_id; >> >> >> Might not be any faster, but you can do this as a self-join with >> subquery: >> >> SELECT c1.from_id, c1.to_id, c1.val >> FROM >> correlation c1, >> ( >> SELECT to_id FROM correlation WHERE from_id=2424 >> ORDER BY val DESC LIMIT 100 >> ) AS c2 >> ( >> SELECT to_id FROM correlation WHERE from_id=2424 >> ORDER BY val DESC LIMIT 100 >> ) AS c3 >> WHERE >> c1.from_id = c2.to_id >> AND c1.to_id = c3.to_id >> AND c1.val > 0.5 >> AND c1.to_id < from_id >> ; >> >> I think PG should be smart enough nowadays to figure out these two >> queries are basically the same. > > > -- > Edmund Bacon <ebacon@onesystem.com> > > > ---------------------------(end of > broadcast)--------------------------- > TIP 3: if posting/reading through Usenet, please send an appropriate > subscribe-nomail command to majordomo@postgresql.org so that your > message can get through to the mailing list cleanly
Sean Davis wrote: > Thanks. I thought about that a bit and it seems like it is highly > likely to be expensive for a single query (though I should probably > try it at some point). If I do find myself reformatting results after > response to user input (i.e., reusing the query), though, then your > solution is likely to be very useful. > You might think so, however, consider: $ cat temptable.sql SELECT to_id INTO TEMP TABLE tids FROM correlation WHERE from_id = 1234 ORDER BY val DESC limit100; SELECT t1.to_id AS from_id, t2.to_id INTO TEMP TABLE from_to FROM tids t1, tids t2 WHERE t1.to_id > t2.to_id; SELECT c.from_id, c.to_id, c.val FROM from_to JOIN correlation c USING(from_id, to_id) WHERE val > 0.5 order by from_id,to_id; $ cat subselect.sql select from_id, to_id, val from correlation where from_id in (select to_id from correlation where from_id = 1234 order by val desc limit 100) and to_id in (select to_id from correlation where from_id = 1234 order by val desc limit 100) and from_id > to_id and val> 0.5 order by from_id, to_id; $ psql -q -c "select count(1) from correlation" test count ----------64000000 (1 row) $ time psql -q -f subselect.sql test | md5sum - f289b259ce8a2a4a45f9e0eca6e31957 - real 0m3.093s user 0m0.000s sys 0m0.010s $ time psql -q -f temptable.sql test | md5sum - f289b259ce8a2a4a45f9e0eca6e31957 - real 0m0.238s user 0m0.000s sys 0m0.010s $ time psql -q -f subselect.sql test | md5sum - f289b259ce8a2a4a45f9e0eca6e31957 - real 0m1.945s user 0m0.010s sys 0m0.010s $ time psql -q -f subselect.sql test | md5sum - f289b259ce8a2a4a45f9e0eca6e31957 - real 0m1.949s user 0m0.010s sys 0m0.010s $ time psql -q -f subselect.sql test | md5sum - f289b259ce8a2a4a45f9e0eca6e31957 - real 0m1.953s user 0m0.010s sys 0m0.000s $ time psql -q -f temptable.sql test | md5sum - f289b259ce8a2a4a45f9e0eca6e31957 - real 0m0.237s user 0m0.020s sys 0m0.000s $ Note that the subselect version takes about 10 times as long as the temptable version, and does not seem to be dependent on what data might be cached. I added the sort so that the results would be in the same order - you can see that the two query sets are producing the same output (at least md5sum thinks they are the same). One thing to be aware of is the size of your returned data set - If it's fairly large, then the transfer time from your web-server to the pgsql box might overwhelm any "small" optimization in query time. > Sean > > On Mar 24, 2005, at 11:13 AM, Edmund Bacon wrote: > >> Sometimes using a temp table is a better idea: >> >> e.g. >> >> -- start by creating a temp table 'tids' that hold the to_ids that >> -- we are interested in. >> SELECT to_id >> INTO TEMP TABLE tids >> FROM correlation >> WHERE from_id = 1234 >> ORDER BY val DESC limit 100; >> >> -- The following temp table makes use of the primary key on >> -- the correlation table, and the stated goal from the original >> -- question that: >> -- from_id > to_id >> -- and from_id in (tids.to_id) >> -- and to_id in (tids.to_id) >> >> SELECT t1.to_id AS from_id, t2.to_id >> INTO TEMP TABLE from_to >> FROM tids t1, tids t2 >> WHERE t1.to_id > t2.to_id; >> >> -- Now we can use the from_to table as an index into the correlation >> -- table. >> >> SELECT c.from_id, c.to_id, c.val >> FROM from_to >> JOIN correlation c USING(from_id, to_id) >> WHERE val > 0.5; >> >> >> The explain analyze for the final select works out to: >> Nested Loop (cost=0.00..50692.00 rows=8488 width=16) (actual >> time=0.171..150.095 rows=2427 loops=1) >> -> Seq Scan on from_to (cost=0.00..79.38 rows=5238 width=8) >> (actual time=0.006..7.660 rows=4950 loops=1) >> -> Index Scan using correlation_pkey on correlation c >> (cost=0.00..9.63 rows=2 width=16) (actual time=0.024..0.025 rows=0 >> loops=4950) >> Index Cond: (("outer".from_id = c.from_id) AND ("outer".to_id >> = c.to_id)) >> Filter: (val > 0.5::double precision) >> Total runtime: 152.261 ms >> >> >> Richard Huxton wrote: >> >>> Sean Davis wrote: >>> >>>> I answer my own question, if only for my own records. The >>>> following query is about 5-6 times faster than the original. Of >>>> course, if anyone else has other ideas, I'd be happy to hear them. >>>> >>>> Sean >>>> >>>> explain analyze select from_id,to_id,val from exprsdb.correlation >>>> where from_id in (select to_id from exprsdb.correlation where >>>> from_id=2424 order by val desc limit 100) and to_id in (select >>>> to_id from exprsdb.correlation where from_id=2424 order by val >>>> desc limit 100) and val>0.6 and to_id<from_id; >>> >>> >>> >>> Might not be any faster, but you can do this as a self-join with >>> subquery: >>> >>> SELECT c1.from_id, c1.to_id, c1.val >>> FROM >>> correlation c1, >>> ( >>> SELECT to_id FROM correlation WHERE from_id=2424 >>> ORDER BY val DESC LIMIT 100 >>> ) AS c2 >>> ( >>> SELECT to_id FROM correlation WHERE from_id=2424 >>> ORDER BY val DESC LIMIT 100 >>> ) AS c3 >>> WHERE >>> c1.from_id = c2.to_id >>> AND c1.to_id = c3.to_id >>> AND c1.val > 0.5 >>> AND c1.to_id < from_id >>> ; >>> >>> I think PG should be smart enough nowadays to figure out these two >>> queries are basically the same. >> >> >> >> -- >> Edmund Bacon <ebacon@onesystem.com> >> >> >> ---------------------------(end of broadcast)--------------------------- >> TIP 3: if posting/reading through Usenet, please send an appropriate >> subscribe-nomail command to majordomo@postgresql.org so that your >> message can get through to the mailing list cleanly > > -- Edmund Bacon <ebacon@onesystem.com>
On Mar 24, 2005, at 1:11 PM, Edmund Bacon wrote: > Sean Davis wrote: > >> Thanks. I thought about that a bit and it seems like it is highly >> likely to be expensive for a single query (though I should probably >> try it at some point). If I do find myself reformatting results >> after response to user input (i.e., reusing the query), though, then >> your solution is likely to be very useful. >> > > > Note that the subselect version takes about 10 times as long as the > temptable version, and does not seem to be dependent on what data > might be cached. >>> Nice. Thanks for doing my work for me! I guess I will have to think about it more seriously. It could be a slight bit complicated because my code is running under mod_perl, so connections are cached. As I understand it, the temp table will stick around, so I will have to be careful to explicitly drop it if I don't want it to persist? Also each table will need a unique name (I have a session_id I can use), as it is possible that multiple temp tables will exist and be visible to each other? Thanks again, Sean
Sean Davis wrote: > Nice. Thanks for doing my work for me! Yeah, well put it down to a certain amount of curiosity and a slack period at work ... > I guess I will have to think about it more seriously. > > It could be a slight bit complicated because my code is running under > mod_perl, so connections are cached. As I understand it, the temp > table will stick around, so I will have to be careful to explicitly > drop it if I don't want it to persist? I'm guessing so. However you could put everything in a transaction and use CREATE TEMP TABLE ... ON COMMIT DROP, and use INSERT INTO rather than SELECT INTO. The speed should be about equivalent - but you'd have to test to make sure. > Also each table will need a unique name (I have a session_id I can > use), as it is possible that multiple temp tables will exist and be > visible to each other? Each session (connection in your case?) has it's own temporary table space, so you shouldn't have to worry about that. > > > Thanks again, > Sean > -- Edmund Bacon <ebacon@onesystem.com>
On Mar 24, 2005, at 2:37 PM, Edmund Bacon wrote: > Sean Davis wrote: > >> Nice. Thanks for doing my work for me! > > Yeah, well put it down to a certain amount of curiosity and a slack > period at work ... > >> I guess I will have to think about it more seriously. >> >> It could be a slight bit complicated because my code is running under >> mod_perl, so connections are cached. As I understand it, the temp >> table will stick around, so I will have to be careful to explicitly >> drop it if I don't want it to persist? > > I'm guessing so. However you could put everything in a transaction > and use CREATE TEMP TABLE ... ON COMMIT DROP, and use INSERT INTO > rather than SELECT INTO. The speed should be about equivalent - but > you'd have to test to make sure. > >> Also each table will need a unique name (I have a session_id I can >> use), as it is possible that multiple temp tables will exist and be >> visible to each other? > > Each session (connection in your case?) has it's own temporary table > space, so you shouldn't have to worry about that. > Sessions don't map 1-to-1 with connections in the web environment. It is possible that a connection to the database would be simultaneously serving multiple users (sessions), if I understand Apache::DBI correctly. In any case, this is probably a viable solution. Sean