Re: Self-referencing table question - Mailing list pgsql-sql
From | Edmund Bacon |
---|---|
Subject | Re: Self-referencing table question |
Date | |
Msg-id | 424302D1.7000602@onesystem.com Whole thread Raw |
In response to | Re: Self-referencing table question (Sean Davis <sdavis2@mail.nih.gov>) |
Responses |
Re: Self-referencing table question
|
List | pgsql-sql |
Sean Davis wrote: > Thanks. I thought about that a bit and it seems like it is highly > likely to be expensive for a single query (though I should probably > try it at some point). If I do find myself reformatting results after > response to user input (i.e., reusing the query), though, then your > solution is likely to be very useful. > You might think so, however, consider: $ cat temptable.sql SELECT to_id INTO TEMP TABLE tids FROM correlation WHERE from_id = 1234 ORDER BY val DESC limit100; SELECT t1.to_id AS from_id, t2.to_id INTO TEMP TABLE from_to FROM tids t1, tids t2 WHERE t1.to_id > t2.to_id; SELECT c.from_id, c.to_id, c.val FROM from_to JOIN correlation c USING(from_id, to_id) WHERE val > 0.5 order by from_id,to_id; $ cat subselect.sql select from_id, to_id, val from correlation where from_id in (select to_id from correlation where from_id = 1234 order by val desc limit 100) and to_id in (select to_id from correlation where from_id = 1234 order by val desc limit 100) and from_id > to_id and val> 0.5 order by from_id, to_id; $ psql -q -c "select count(1) from correlation" test count ----------64000000 (1 row) $ time psql -q -f subselect.sql test | md5sum - f289b259ce8a2a4a45f9e0eca6e31957 - real 0m3.093s user 0m0.000s sys 0m0.010s $ time psql -q -f temptable.sql test | md5sum - f289b259ce8a2a4a45f9e0eca6e31957 - real 0m0.238s user 0m0.000s sys 0m0.010s $ time psql -q -f subselect.sql test | md5sum - f289b259ce8a2a4a45f9e0eca6e31957 - real 0m1.945s user 0m0.010s sys 0m0.010s $ time psql -q -f subselect.sql test | md5sum - f289b259ce8a2a4a45f9e0eca6e31957 - real 0m1.949s user 0m0.010s sys 0m0.010s $ time psql -q -f subselect.sql test | md5sum - f289b259ce8a2a4a45f9e0eca6e31957 - real 0m1.953s user 0m0.010s sys 0m0.000s $ time psql -q -f temptable.sql test | md5sum - f289b259ce8a2a4a45f9e0eca6e31957 - real 0m0.237s user 0m0.020s sys 0m0.000s $ Note that the subselect version takes about 10 times as long as the temptable version, and does not seem to be dependent on what data might be cached. I added the sort so that the results would be in the same order - you can see that the two query sets are producing the same output (at least md5sum thinks they are the same). One thing to be aware of is the size of your returned data set - If it's fairly large, then the transfer time from your web-server to the pgsql box might overwhelm any "small" optimization in query time. > Sean > > On Mar 24, 2005, at 11:13 AM, Edmund Bacon wrote: > >> Sometimes using a temp table is a better idea: >> >> e.g. >> >> -- start by creating a temp table 'tids' that hold the to_ids that >> -- we are interested in. >> SELECT to_id >> INTO TEMP TABLE tids >> FROM correlation >> WHERE from_id = 1234 >> ORDER BY val DESC limit 100; >> >> -- The following temp table makes use of the primary key on >> -- the correlation table, and the stated goal from the original >> -- question that: >> -- from_id > to_id >> -- and from_id in (tids.to_id) >> -- and to_id in (tids.to_id) >> >> SELECT t1.to_id AS from_id, t2.to_id >> INTO TEMP TABLE from_to >> FROM tids t1, tids t2 >> WHERE t1.to_id > t2.to_id; >> >> -- Now we can use the from_to table as an index into the correlation >> -- table. >> >> SELECT c.from_id, c.to_id, c.val >> FROM from_to >> JOIN correlation c USING(from_id, to_id) >> WHERE val > 0.5; >> >> >> The explain analyze for the final select works out to: >> Nested Loop (cost=0.00..50692.00 rows=8488 width=16) (actual >> time=0.171..150.095 rows=2427 loops=1) >> -> Seq Scan on from_to (cost=0.00..79.38 rows=5238 width=8) >> (actual time=0.006..7.660 rows=4950 loops=1) >> -> Index Scan using correlation_pkey on correlation c >> (cost=0.00..9.63 rows=2 width=16) (actual time=0.024..0.025 rows=0 >> loops=4950) >> Index Cond: (("outer".from_id = c.from_id) AND ("outer".to_id >> = c.to_id)) >> Filter: (val > 0.5::double precision) >> Total runtime: 152.261 ms >> >> >> Richard Huxton wrote: >> >>> Sean Davis wrote: >>> >>>> I answer my own question, if only for my own records. The >>>> following query is about 5-6 times faster than the original. Of >>>> course, if anyone else has other ideas, I'd be happy to hear them. >>>> >>>> Sean >>>> >>>> explain analyze select from_id,to_id,val from exprsdb.correlation >>>> where from_id in (select to_id from exprsdb.correlation where >>>> from_id=2424 order by val desc limit 100) and to_id in (select >>>> to_id from exprsdb.correlation where from_id=2424 order by val >>>> desc limit 100) and val>0.6 and to_id<from_id; >>> >>> >>> >>> Might not be any faster, but you can do this as a self-join with >>> subquery: >>> >>> SELECT c1.from_id, c1.to_id, c1.val >>> FROM >>> correlation c1, >>> ( >>> SELECT to_id FROM correlation WHERE from_id=2424 >>> ORDER BY val DESC LIMIT 100 >>> ) AS c2 >>> ( >>> SELECT to_id FROM correlation WHERE from_id=2424 >>> ORDER BY val DESC LIMIT 100 >>> ) AS c3 >>> WHERE >>> c1.from_id = c2.to_id >>> AND c1.to_id = c3.to_id >>> AND c1.val > 0.5 >>> AND c1.to_id < from_id >>> ; >>> >>> I think PG should be smart enough nowadays to figure out these two >>> queries are basically the same. >> >> >> >> -- >> Edmund Bacon <ebacon@onesystem.com> >> >> >> ---------------------------(end of broadcast)--------------------------- >> TIP 3: if posting/reading through Usenet, please send an appropriate >> subscribe-nomail command to majordomo@postgresql.org so that your >> message can get through to the mailing list cleanly > > -- Edmund Bacon <ebacon@onesystem.com>