Thread: any hope for my big query?
Hey guys, I've got a query that is inherently expensive, because it has to do some joins against some large tables. But it's currently *very* expensive (at least for a web app), and I've been struggling in vain all day to knock the cost down. Annoyingly, the least costly version I've come up with remains my first attempt, and is the most straight-forward: explain select distinct public.album.id from public.album,public.albumjoin,public.track,umdb.node where node.dir=2811 and albumjoin.album = public.album.id and public.albumjoin.track = public.track.id and levenshtein(substring(public.track.name for 75), substring(node.file for 75)) <= 10 and public.album.id in (select album from albumjoin group by album having count(*) between 15 and 25) group by public.album.id having count(*) >= 5; Unique (cost=991430.53..1013711.74 rows=425772 width=4) -> GroupAggregate (cost=991430.53..1012647.31 rows=425772 width=4) Filter: (count(*) >= 5) -> Sort (cost=991430.53..996373.93 rows=1977360 width=4) Sort Key: album.id -> Nested Loop (cost=513549.06..737866.68 rows=1977360 width=4) Join Filter: (levenshtein("substring"(("inner".name)::text, 1, 75), "substring"("outer".file, 1, 75))<= 10) -> Index Scan using node_dir on node (cost=0.00..3.22 rows=16 width=40) Index Cond: (dir = 2811) -> Materialize (cost=513549.06..520153.61 rows=370755 width=25) -> Hash Join (cost=271464.72..510281.31 rows=370755 width=25) Hash Cond: ("outer".id = "inner".track) -> Seq Scan on track (cost=0.00..127872.69 rows=5111469 width=25) -> Hash (cost=268726.83..268726.83 rows=370755 width=8) -> Hash Join (cost=150840.51..268726.83 rows=370755 width=8) Hash Cond: ("outer".album = "inner".id) -> Seq Scan on albumjoin (cost=0.00..88642.18 rows=5107318 width=8) -> Hash (cost=150763.24..150763.24 rows=30908 width=8) -> Hash Join (cost=127951.57..150763.24 rows=30908 width=8) Hash Cond: ("outer".id = "inner".album) -> Seq Scan on album (cost=0.00..12922.72 rows=425772 width=4) -> Hash (cost=127874.30..127874.30 rows=30908 width=4) -> HashAggregate (cost=126947.06..127565.22 rows=30908width=4) Filter: ((count(*) >= 15) AND (count(*) <= 25)) -> Seq Scan on albumjoin (cost=0.00..88642.18 rows=5107318width=4) I've tried adding a length(public.track.name) index and filtering public.track to those rows where length(name) is within a few characters of node.file, but that actually makes the plan more expensive. Is there any hope to make things much cheaper? Unfortunately, I can't filter out anything from the album or albumjoin tables.
Instead of the IN, see if this is better: AND (SELECT count(*) FROM albumjoin aj WHERE aj.album = album.id) BETWEEN 15 AND 25. From a design standpoint, it probably makes sense to have a track_count field in the album table that is kept up-to-date by triggers on albumjoin. And some nits. :) I find it's a lot easier to call all id fields by the object name, ie: album_id, track_id, etc. Lets you do things like: FROM album a JOIN albumjoin aj USING album_id JOIN track t USING track_id Unless you've got a lot of tables in a query, table aliases (t, aj, and a above) are your friend. :) Face it... camelCase just doesn't work worth anything in databases... underscoresmakeitmucheasiertoreadthings. :) On Thu, Sep 28, 2006 at 03:18:56PM -0700, Ben wrote: > Hey guys, I've got a query that is inherently expensive, because it has to > do some joins against some large tables. But it's currently *very* > expensive (at least for a web app), and I've been struggling in vain all > day to knock the cost down. Annoyingly, the least costly version I've come > up with remains my first attempt, and is the most straight-forward: > > explain select > distinct public.album.id > from > public.album,public.albumjoin,public.track,umdb.node > where > node.dir=2811 > and albumjoin.album = public.album.id > and public.albumjoin.track = public.track.id > and levenshtein(substring(public.track.name for 75), > substring(node.file for 75)) <= 10 > and public.album.id in > (select album from albumjoin group by album having count(*) > between 15 and 25) group by public.album.id > having count(*) >= 5; > > > Unique (cost=991430.53..1013711.74 rows=425772 width=4) > -> GroupAggregate (cost=991430.53..1012647.31 rows=425772 width=4) > Filter: (count(*) >= 5) > -> Sort (cost=991430.53..996373.93 rows=1977360 width=4) > Sort Key: album.id > -> Nested Loop (cost=513549.06..737866.68 rows=1977360 > width=4) > Join Filter: > (levenshtein("substring"(("inner".name)::text, 1, 75), > "substring"("outer".file, 1, 75)) <= 10) > -> Index Scan using node_dir on node > (cost=0.00..3.22 rows=16 width=40) > Index Cond: (dir = 2811) > -> Materialize (cost=513549.06..520153.61 > rows=370755 width=25) > -> Hash Join (cost=271464.72..510281.31 > rows=370755 width=25) > Hash Cond: ("outer".id = "inner".track) > -> Seq Scan on track > (cost=0.00..127872.69 rows=5111469 > width=25) > -> Hash (cost=268726.83..268726.83 > rows=370755 width=8) > -> Hash Join > (cost=150840.51..268726.83 > rows=370755 width=8) > Hash Cond: ("outer".album = > "inner".id) > -> Seq Scan on albumjoin > (cost=0.00..88642.18 > rows=5107318 width=8) > -> Hash > (cost=150763.24..150763.24 > rows=30908 width=8) > -> Hash Join > (cost=127951.57..150763.24 rows=30908 width=8) > Hash Cond: > ("outer".id = > "inner".album) > -> Seq Scan on > album > (cost=0.00..12922.72 rows=425772 width=4) > -> Hash > (cost=127874.30..127874.30 rows=30908 width=4) > -> > HashAggregate (cost=126947.06..127565.22 rows=30908 width=4) > Filter: ((count(*) >= 15) AND (count(*) <= 25)) > -> > Seq > Scan > on > albumjoin (cost=0.00..88642.18 rows=5107318 width=4) > > > I've tried adding a length(public.track.name) index and filtering > public.track to those rows where length(name) is within a few characters > of node.file, but that actually makes the plan more expensive. > > Is there any hope to make things much cheaper? Unfortunately, I can't > filter out anything from the album or albumjoin tables. > > ---------------------------(end of broadcast)--------------------------- > TIP 1: if posting/reading through Usenet, please send an appropriate > subscribe-nomail command to majordomo@postgresql.org so that your > message can get through to the mailing list cleanly > -- Jim Nasby jim@nasby.net EnterpriseDB http://enterprisedb.com 512.569.9461 (cell)
You have 2 seqscans on albumjoin table, you first make a simple join: ...and albumjoin.album = public.album.id ... that generates the first -> Seq Scan on albumjoin (cost=0.00..88642.18 rows=5107318 width=8) and then you group values from same table counting them with ... (select album from albumjoin group by album having count(*) between 15 and 25) ... that generates the second Seq Scan on albumjoin (cost=0.00..88642.18 rows=5107318 width=4) given the complexity of the query, maybe you could create an intermediate table with only one seqscan and use that one in final query but I don't know if that's possible with the db structure you have Can I ask what exactly is albumjoin table? is it a n-n relation? > > explain select > distinct public.album.id > from > public.album,public.albumjoin,public.track,umdb.node > where > node.dir=2811 > and albumjoin.album = public.album.id > and public.albumjoin.track = public.track.id > and levenshtein(substring(public.track.name for 75), > substring(node.file for 75)) <= 10 > and public.album.id in > (select album from albumjoin group by album having count(*) > between 15 and 25) group by public.album.id > having count(*) >= 5; > > > Unique (cost=991430.53..1013711.74 rows=425772 width=4) > -> GroupAggregate (cost=991430.53..1012647.31 rows=425772 width=4) > Filter: (count(*) >= 5) > -> Sort (cost=991430.53..996373.93 rows=1977360 width=4) > Sort Key: album.id > -> Nested Loop (cost=513549.06..737866.68 > rows=1977360 width=4) > Join Filter: > (levenshtein("substring"(("inner".name)::text, 1, 75), > "substring"("outer".file, 1, 75)) <= 10) > -> Index Scan using node_dir on node > (cost=0.00..3.22 rows=16 width=40) > Index Cond: (dir = 2811) > -> Materialize (cost=513549.06..520153.61 > rows=370755 width=25) > -> Hash Join (cost=271464.72..510281.31 > rows=370755 width=25) > Hash Cond: ("outer".id = "inner".track) > -> Seq Scan on track > (cost=0.00..127872.69 rows=5111469 width=25) > -> Hash (cost=268726.83..268726.83 > rows=370755 width=8) > -> Hash Join > (cost=150840.51..268726.83 rows=370755 width=8) > Hash Cond: ("outer".album > = "inner".id) > -> Seq Scan on > albumjoin (cost=0.00..88642.18 rows=5107318 width=8) > -> Hash > (cost=150763.24..150763.24 rows=30908 width=8) > -> Hash Join > (cost=127951.57..150763.24 rows=30908 width=8) > Hash Cond: > ("outer".id = "inner".album) > -> Seq Scan > on album (cost=0.00..12922.72 rows=425772 width=4) > -> Hash > (cost=127874.30..127874.30 rows=30908 width=4) > -> > HashAggregate (cost=126947.06..127565.22 rows=30908 width=4) > > Filter: ((count(*) >= 15) AND (count(*) <= 25)) > > -> Seq Scan on albumjoin (cost=0.00..88642.18 rows=5107318 width=4) > > > I've tried adding a length(public.track.name) index and filtering > public.track to those rows where length(name) is within a few > characters of node.file, but that actually makes the plan more expensive. > > Is there any hope to make things much cheaper? Unfortunately, I can't > filter out anything from the album or albumjoin tables. > > ---------------------------(end of broadcast)--------------------------- > TIP 1: if posting/reading through Usenet, please send an appropriate > subscribe-nomail command to majordomo@postgresql.org so that your > message can get through to the mailing list cleanly >
On Thu, 2006-09-28 at 15:18 -0700, Ben wrote: > distinct public.album.id > group by public.album.id You can remove the distinct clause for starters... -- Simon Riggs EnterpriseDB http://www.enterprisedb.com
There's no join criteria for umdb.node... is that really what you want? On Thu, Sep 28, 2006 at 03:18:56PM -0700, Ben wrote: > Hey guys, I've got a query that is inherently expensive, because it has to > do some joins against some large tables. But it's currently *very* > expensive (at least for a web app), and I've been struggling in vain all > day to knock the cost down. Annoyingly, the least costly version I've come > up with remains my first attempt, and is the most straight-forward: > > explain select > distinct public.album.id > from > public.album,public.albumjoin,public.track,umdb.node > where > node.dir=2811 > and albumjoin.album = public.album.id > and public.albumjoin.track = public.track.id > and levenshtein(substring(public.track.name for 75), > substring(node.file for 75)) <= 10 > and public.album.id in > (select album from albumjoin group by album having count(*) > between 15 and 25) group by public.album.id > having count(*) >= 5; > > > Unique (cost=991430.53..1013711.74 rows=425772 width=4) > -> GroupAggregate (cost=991430.53..1012647.31 rows=425772 width=4) > Filter: (count(*) >= 5) > -> Sort (cost=991430.53..996373.93 rows=1977360 width=4) > Sort Key: album.id > -> Nested Loop (cost=513549.06..737866.68 rows=1977360 > width=4) > Join Filter: > (levenshtein("substring"(("inner".name)::text, 1, 75), > "substring"("outer".file, 1, 75)) <= 10) > -> Index Scan using node_dir on node > (cost=0.00..3.22 rows=16 width=40) > Index Cond: (dir = 2811) > -> Materialize (cost=513549.06..520153.61 > rows=370755 width=25) > -> Hash Join (cost=271464.72..510281.31 > rows=370755 width=25) > Hash Cond: ("outer".id = "inner".track) > -> Seq Scan on track > (cost=0.00..127872.69 rows=5111469 > width=25) > -> Hash (cost=268726.83..268726.83 > rows=370755 width=8) > -> Hash Join > (cost=150840.51..268726.83 > rows=370755 width=8) > Hash Cond: ("outer".album = > "inner".id) > -> Seq Scan on albumjoin > (cost=0.00..88642.18 > rows=5107318 width=8) > -> Hash > (cost=150763.24..150763.24 > rows=30908 width=8) > -> Hash Join > (cost=127951.57..150763.24 rows=30908 width=8) > Hash Cond: > ("outer".id = > "inner".album) > -> Seq Scan on > album > (cost=0.00..12922.72 rows=425772 width=4) > -> Hash > (cost=127874.30..127874.30 rows=30908 width=4) > -> > HashAggregate (cost=126947.06..127565.22 rows=30908 width=4) > Filter: ((count(*) >= 15) AND (count(*) <= 25)) > -> > Seq > Scan > on > albumjoin (cost=0.00..88642.18 rows=5107318 width=4) > > > I've tried adding a length(public.track.name) index and filtering > public.track to those rows where length(name) is within a few characters > of node.file, but that actually makes the plan more expensive. > > Is there any hope to make things much cheaper? Unfortunately, I can't > filter out anything from the album or albumjoin tables. > > ---------------------------(end of broadcast)--------------------------- > TIP 1: if posting/reading through Usenet, please send an appropriate > subscribe-nomail command to majordomo@postgresql.org so that your > message can get through to the mailing list cleanly > -- Jim Nasby jim@nasby.net EnterpriseDB http://enterprisedb.com 512.569.9461 (cell)
On Thursday 28 September 2006 17:18, Ben wrote: > explain select > distinct public.album.id > from > public.album,public.albumjoin,public.track,umdb.node > where > node.dir=2811 > and albumjoin.album = public.album.id > and public.albumjoin.track = public.track.id > and levenshtein(substring(public.track.name for 75), > substring(node.file for 75)) <= 10 > and public.album.id in > (select album from albumjoin group by album having count(*) between 15 and 25) > group by public.album.id > having count(*) >= 5; If I'm reading this right, you want all the albums with 15-25 entries in album join having 5 or more tracks that are (soundex type) similar to other nodes. Knowing that, you can also try something like this: select a.album from (select album,track from albumjoin group by album having count(1) between 15 and 25) a join public.track t on (a.track = t.id) join umdb.node n on (levenshtein(substring(t.name for 75), substring(n.file for 75)) < 9) where n.dir = 2811 group by a.album having count(1) > 4; This removes two of your tables, since you were only interested in albums with 15-25 albumjoins, and weren't actually using any album data other than the ID, which albumjoin supplies. Your subselect is now an integral part of the whole query, being treated like a temp table that only supplies album IDs with 15-25 albumjoins. From there, add on the track information, and use that to restrict the matching nodes. Your explain should be better with the above. Just remember with the levenshtein in there, you're forcing a sequence scan on the node table. Depending on how big that table is, you may not be able to truly optimize this. -- Shaun Thomas Database Administrator Leapfrog Online 807 Greenwood Street Evanston, IL 60201 Tel. 847-440-8253 Fax. 847-570-5750 www.leapfrogonline.com
On Fri, 29 Sep 2006, Jim C. Nasby wrote: > There's no join criteria for umdb.node... is that really what you want? > Unfortunately, yes, it is. I've taken in all of everybody's helpful advice (thanks!) and reworked things a little, and now I'm left with this expensive nugget: select aj.album from (select seconds-1 as a,seconds+1 as b from node where node.dir = 6223) n join public.track t on (t.length between n.a*1000 and n.b*1000) join public.albumjoin aj on (aj.track = t.id) join (select id from public.albummeta am where tracks between 3 and 7) lam on (lam.id = aj.album) group by aj.album having count(*) >= 4; ...which comes out to be: HashAggregate (cost=904444.69..904909.99 rows=31020 width=4) Filter: (count(*) >= 4) -> Nested Loop (cost=428434.81..897905.17 rows=1307904 width=4) Join Filter: (("inner".length >= (("outer".seconds - 1) * 1000)) AND ("inner".length <= (("outer".seconds + 1)* 1000))) -> Index Scan using node_dir on node (cost=0.00..3.46 rows=17 width=4) Index Cond: (dir = 6223) -> Materialize (cost=428434.81..438740.01 rows=692420 width=8) -> Hash Join (cost=210370.58..424361.39 rows=692420 width=8) Hash Cond: ("outer".id = "inner".track) -> Seq Scan on track t (cost=0.00..128028.41 rows=5123841 width=8) -> Hash (cost=205258.53..205258.53 rows=692420 width=8) -> Hash Join (cost=6939.10..205258.53 rows=692420 width=8) Hash Cond: ("outer".album = "inner".id) -> Seq Scan on albumjoin aj (cost=0.00..88918.41 rows=5123841 width=8) -> Hash (cost=6794.51..6794.51 rows=57834 width=4) -> Bitmap Heap Scan on albummeta am (cost=557.00..6794.51 rows=57834 width=4) Recheck Cond: ((tracks >= 3) AND (tracks <= 7)) -> Bitmap Index Scan on albummeta_tracks_index (cost=0.00..557.00 rows=57834width=0) Index Cond: ((tracks >= 3) AND (tracks <= 7)) (19 rows) I'm surprised (though probably just because I'm ignorant) that it would have so much sequential scanning in there. For instance, because n is going to have at most a couple dozen rows, it seems that instead of scanning all of public.track, it should be able to convert my "t.length between a and b" clause to some between statements or'd together. Or at least, it would be nice if the planner could do that. :)
On Oct 4, 2006, at 4:40 PM, Ben wrote: > I'm surprised (though probably just because I'm ignorant) that it > would have so much sequential scanning in there. For instance, > because n is going to have at most a couple dozen rows, it seems > that instead of scanning all of public.track, it should be able to > convert my "t.length between a and b" clause to some between > statements or'd together. Or at least, it would be nice if the > planner could do that. That would require the planner having that knowledge at plan-time, which it can't without actually querying the database. One thing that might work wonders is performing the n query ahead of time and then sticking it in an array... that might speed things up. Worst case, you could run the n query, and then run the rest of the query for each row of n you get back. Better yet... send us a patch that allows the planner to look into what a subselect will return to us. ;) -- Jim Nasby jim@nasby.net EnterpriseDB http://enterprisedb.com 512.569.9461 (cell) -- Jim Nasby jimn@enterprisedb.com EnterpriseDB http://enterprisedb.com 512.569.9461 (cell)