Thread: any hope for my big query?

any hope for my big query?

From
Ben
Date:
Hey guys, I've got a query that is inherently expensive, because it has to
do some joins against some large tables. But it's currently *very*
expensive (at least for a web app), and I've been struggling in vain all
day to knock the cost down. Annoyingly, the least costly version I've come
up with remains my first attempt, and is the most straight-forward:

explain select
     distinct public.album.id
from
     public.album,public.albumjoin,public.track,umdb.node
where
     node.dir=2811
     and albumjoin.album = public.album.id
     and public.albumjoin.track = public.track.id
     and levenshtein(substring(public.track.name for 75),
         substring(node.file for 75)) <= 10
     and public.album.id in
         (select album from albumjoin group by album having count(*) between 15 and 25)
group by public.album.id
having count(*) >= 5;


  Unique  (cost=991430.53..1013711.74 rows=425772 width=4)
    ->  GroupAggregate  (cost=991430.53..1012647.31 rows=425772 width=4)
          Filter: (count(*) >= 5)
          ->  Sort  (cost=991430.53..996373.93 rows=1977360 width=4)
                Sort Key: album.id
                ->  Nested Loop  (cost=513549.06..737866.68 rows=1977360 width=4)
                      Join Filter: (levenshtein("substring"(("inner".name)::text, 1, 75), "substring"("outer".file, 1,
75))<= 10) 
                      ->  Index Scan using node_dir on node  (cost=0.00..3.22 rows=16 width=40)
                            Index Cond: (dir = 2811)
                      ->  Materialize  (cost=513549.06..520153.61 rows=370755 width=25)
                            ->  Hash Join  (cost=271464.72..510281.31 rows=370755 width=25)
                                  Hash Cond: ("outer".id = "inner".track)
                                  ->  Seq Scan on track  (cost=0.00..127872.69 rows=5111469 width=25)
                                  ->  Hash  (cost=268726.83..268726.83 rows=370755 width=8)
                                        ->  Hash Join  (cost=150840.51..268726.83 rows=370755 width=8)
                                              Hash Cond: ("outer".album = "inner".id)
                                              ->  Seq Scan on albumjoin  (cost=0.00..88642.18 rows=5107318 width=8)
                                              ->  Hash  (cost=150763.24..150763.24 rows=30908 width=8)
                                                    ->  Hash Join  (cost=127951.57..150763.24 rows=30908 width=8)
                                                          Hash Cond: ("outer".id = "inner".album)
                                                          ->  Seq Scan on album  (cost=0.00..12922.72 rows=425772
width=4)
                                                          ->  Hash  (cost=127874.30..127874.30 rows=30908 width=4)
                                                                ->  HashAggregate  (cost=126947.06..127565.22
rows=30908width=4) 
                                                                      Filter: ((count(*) >= 15) AND (count(*) <= 25))
                                                                      ->  Seq Scan on albumjoin  (cost=0.00..88642.18
rows=5107318width=4) 


I've tried adding a length(public.track.name) index and filtering
public.track to those rows where length(name) is within a few characters
of node.file, but that actually makes the plan more expensive.

Is there any hope to make things much cheaper? Unfortunately, I can't
filter out anything from the album or albumjoin tables.

Re: any hope for my big query?

From
"Jim C. Nasby"
Date:
Instead of the IN, see if this is better:

AND (SELECT count(*) FROM albumjoin aj WHERE aj.album = album.id)
BETWEEN 15 AND 25.

From a design standpoint, it probably makes sense to have a track_count
field in the album table that is kept up-to-date by triggers on
albumjoin.

And some nits. :)

I find it's a lot easier to call all id fields by the object name, ie:
album_id, track_id, etc. Lets you do things like:

FROM album a
    JOIN albumjoin aj USING album_id
    JOIN track t USING track_id

Unless you've got a lot of tables in a query, table aliases (t, aj, and
a above) are your friend. :)

Face it... camelCase just doesn't work worth anything in databases...
underscoresmakeitmucheasiertoreadthings. :)

On Thu, Sep 28, 2006 at 03:18:56PM -0700, Ben wrote:
> Hey guys, I've got a query that is inherently expensive, because it has to
> do some joins against some large tables. But it's currently *very*
> expensive (at least for a web app), and I've been struggling in vain all
> day to knock the cost down. Annoyingly, the least costly version I've come
> up with remains my first attempt, and is the most straight-forward:
>
> explain select
>     distinct public.album.id
> from
>     public.album,public.albumjoin,public.track,umdb.node
> where
>     node.dir=2811
>     and albumjoin.album = public.album.id
>     and public.albumjoin.track = public.track.id
>     and levenshtein(substring(public.track.name for 75),
>         substring(node.file for 75)) <= 10
>     and public.album.id in
>         (select album from albumjoin group by album having count(*)
>         between 15 and 25) group by public.album.id
> having count(*) >= 5;
>
>
>  Unique  (cost=991430.53..1013711.74 rows=425772 width=4)
>    ->  GroupAggregate  (cost=991430.53..1012647.31 rows=425772 width=4)
>          Filter: (count(*) >= 5)
>          ->  Sort  (cost=991430.53..996373.93 rows=1977360 width=4)
>                Sort Key: album.id
>                ->  Nested Loop  (cost=513549.06..737866.68 rows=1977360
>                width=4)
>                      Join Filter:
>                      (levenshtein("substring"(("inner".name)::text, 1, 75),
>                      "substring"("outer".file, 1, 75)) <= 10)
>                      ->  Index Scan using node_dir on node
>                      (cost=0.00..3.22 rows=16 width=40)
>                            Index Cond: (dir = 2811)
>                      ->  Materialize  (cost=513549.06..520153.61
>                      rows=370755 width=25)
>                            ->  Hash Join  (cost=271464.72..510281.31
>                            rows=370755 width=25)
>                                  Hash Cond: ("outer".id = "inner".track)
>                                  ->  Seq Scan on track
>                                  (cost=0.00..127872.69 rows=5111469
>                                  width=25)
>                                  ->  Hash  (cost=268726.83..268726.83
>                                  rows=370755 width=8)
>                                        ->  Hash Join
>                                        (cost=150840.51..268726.83
>                                        rows=370755 width=8)
>                                              Hash Cond: ("outer".album =
>                                              "inner".id)
>                                              ->  Seq Scan on albumjoin
>                                              (cost=0.00..88642.18
>                                              rows=5107318 width=8)
>                                              ->  Hash
>                                              (cost=150763.24..150763.24
>                                              rows=30908 width=8)
>                                                    ->  Hash Join
>                                                    (cost=127951.57..150763.24 rows=30908 width=8)
>                                                          Hash Cond:
>                                                          ("outer".id =
>                                                          "inner".album)
>                                                          ->  Seq Scan on
>                                                          album
>                                                          (cost=0.00..12922.72 rows=425772 width=4)
>                                                          ->  Hash
>                                                          (cost=127874.30..127874.30 rows=30908 width=4)
>                                                                ->
>                                                                HashAggregate  (cost=126947.06..127565.22 rows=30908
width=4)
>                                                                      Filter: ((count(*) >= 15) AND (count(*) <= 25))
>                                                                      ->
>                                                                      Seq
>                                                                      Scan
>                                                                      on
>                                                                      albumjoin  (cost=0.00..88642.18 rows=5107318
width=4)
>
>
> I've tried adding a length(public.track.name) index and filtering
> public.track to those rows where length(name) is within a few characters
> of node.file, but that actually makes the plan more expensive.
>
> Is there any hope to make things much cheaper? Unfortunately, I can't
> filter out anything from the album or albumjoin tables.
>
> ---------------------------(end of broadcast)---------------------------
> TIP 1: if posting/reading through Usenet, please send an appropriate
>       subscribe-nomail command to majordomo@postgresql.org so that your
>       message can get through to the mailing list cleanly
>

--
Jim Nasby                                            jim@nasby.net
EnterpriseDB      http://enterprisedb.com      512.569.9461 (cell)

Re: any hope for my big query?

From
Edoardo Ceccarelli
Date:
You have 2 seqscans on albumjoin table, you first make a simple join:

...and albumjoin.album = public.album.id ...

that generates the first
->  Seq Scan on albumjoin  (cost=0.00..88642.18 rows=5107318 width=8)
and then you group values from same table counting them with

... (select album from albumjoin group by album having count(*) between
15 and 25)  ...

that generates the second

Seq Scan on albumjoin  (cost=0.00..88642.18 rows=5107318 width=4)

given the complexity of the query, maybe you could create an
intermediate table with only one seqscan and use that one in final query
but I don't know if that's possible with the db structure you have


Can I ask what exactly is albumjoin table? is it a n-n relation?

>
> explain select
>     distinct public.album.id
> from
>     public.album,public.albumjoin,public.track,umdb.node
> where
>     node.dir=2811
>     and albumjoin.album = public.album.id
>     and public.albumjoin.track = public.track.id
>     and levenshtein(substring(public.track.name for 75),
>         substring(node.file for 75)) <= 10
>     and public.album.id in
>         (select album from albumjoin group by album having count(*)
> between 15 and 25) group by public.album.id
> having count(*) >= 5;
>
>
>  Unique  (cost=991430.53..1013711.74 rows=425772 width=4)
>    ->  GroupAggregate  (cost=991430.53..1012647.31 rows=425772 width=4)
>          Filter: (count(*) >= 5)
>          ->  Sort  (cost=991430.53..996373.93 rows=1977360 width=4)
>                Sort Key: album.id
>                ->  Nested Loop  (cost=513549.06..737866.68
> rows=1977360 width=4)
>                      Join Filter:
> (levenshtein("substring"(("inner".name)::text, 1, 75),
> "substring"("outer".file, 1, 75)) <= 10)
>                      ->  Index Scan using node_dir on node
> (cost=0.00..3.22 rows=16 width=40)
>                            Index Cond: (dir = 2811)
>                      ->  Materialize  (cost=513549.06..520153.61
> rows=370755 width=25)
>                            ->  Hash Join  (cost=271464.72..510281.31
> rows=370755 width=25)
>                                  Hash Cond: ("outer".id = "inner".track)
>                                  ->  Seq Scan on track
> (cost=0.00..127872.69 rows=5111469 width=25)
>                                  ->  Hash  (cost=268726.83..268726.83
> rows=370755 width=8)
>                                        ->  Hash Join
> (cost=150840.51..268726.83 rows=370755 width=8)
>                                              Hash Cond: ("outer".album
> = "inner".id)
>                                              ->  Seq Scan on
> albumjoin  (cost=0.00..88642.18 rows=5107318 width=8)
>                                              ->  Hash
> (cost=150763.24..150763.24 rows=30908 width=8)
>                                                    ->  Hash Join
> (cost=127951.57..150763.24 rows=30908 width=8)
>                                                          Hash Cond:
> ("outer".id = "inner".album)
>                                                          ->  Seq Scan
> on album  (cost=0.00..12922.72 rows=425772 width=4)
>                                                          ->  Hash
> (cost=127874.30..127874.30 rows=30908 width=4)
>                                                                ->
> HashAggregate  (cost=126947.06..127565.22 rows=30908 width=4)
>
> Filter: ((count(*) >= 15) AND (count(*) <= 25))
>
> ->  Seq Scan on albumjoin  (cost=0.00..88642.18 rows=5107318 width=4)
>
>
> I've tried adding a length(public.track.name) index and filtering
> public.track to those rows where length(name) is within a few
> characters of node.file, but that actually makes the plan more expensive.
>
> Is there any hope to make things much cheaper? Unfortunately, I can't
> filter out anything from the album or albumjoin tables.
>
> ---------------------------(end of broadcast)---------------------------
> TIP 1: if posting/reading through Usenet, please send an appropriate
>       subscribe-nomail command to majordomo@postgresql.org so that your
>       message can get through to the mailing list cleanly
>

Re: any hope for my big query?

From
Simon Riggs
Date:
On Thu, 2006-09-28 at 15:18 -0700, Ben wrote:

>      distinct public.album.id

> group by public.album.id

You can remove the distinct clause for starters...

--
  Simon Riggs
  EnterpriseDB   http://www.enterprisedb.com


Re: any hope for my big query?

From
"Jim C. Nasby"
Date:
There's no join criteria for umdb.node... is that really what you want?

On Thu, Sep 28, 2006 at 03:18:56PM -0700, Ben wrote:
> Hey guys, I've got a query that is inherently expensive, because it has to
> do some joins against some large tables. But it's currently *very*
> expensive (at least for a web app), and I've been struggling in vain all
> day to knock the cost down. Annoyingly, the least costly version I've come
> up with remains my first attempt, and is the most straight-forward:
>
> explain select
>     distinct public.album.id
> from
>     public.album,public.albumjoin,public.track,umdb.node
> where
>     node.dir=2811
>     and albumjoin.album = public.album.id
>     and public.albumjoin.track = public.track.id
>     and levenshtein(substring(public.track.name for 75),
>         substring(node.file for 75)) <= 10
>     and public.album.id in
>         (select album from albumjoin group by album having count(*)
>         between 15 and 25) group by public.album.id
> having count(*) >= 5;
>
>
>  Unique  (cost=991430.53..1013711.74 rows=425772 width=4)
>    ->  GroupAggregate  (cost=991430.53..1012647.31 rows=425772 width=4)
>          Filter: (count(*) >= 5)
>          ->  Sort  (cost=991430.53..996373.93 rows=1977360 width=4)
>                Sort Key: album.id
>                ->  Nested Loop  (cost=513549.06..737866.68 rows=1977360
>                width=4)
>                      Join Filter:
>                      (levenshtein("substring"(("inner".name)::text, 1, 75),
>                      "substring"("outer".file, 1, 75)) <= 10)
>                      ->  Index Scan using node_dir on node
>                      (cost=0.00..3.22 rows=16 width=40)
>                            Index Cond: (dir = 2811)
>                      ->  Materialize  (cost=513549.06..520153.61
>                      rows=370755 width=25)
>                            ->  Hash Join  (cost=271464.72..510281.31
>                            rows=370755 width=25)
>                                  Hash Cond: ("outer".id = "inner".track)
>                                  ->  Seq Scan on track
>                                  (cost=0.00..127872.69 rows=5111469
>                                  width=25)
>                                  ->  Hash  (cost=268726.83..268726.83
>                                  rows=370755 width=8)
>                                        ->  Hash Join
>                                        (cost=150840.51..268726.83
>                                        rows=370755 width=8)
>                                              Hash Cond: ("outer".album =
>                                              "inner".id)
>                                              ->  Seq Scan on albumjoin
>                                              (cost=0.00..88642.18
>                                              rows=5107318 width=8)
>                                              ->  Hash
>                                              (cost=150763.24..150763.24
>                                              rows=30908 width=8)
>                                                    ->  Hash Join
>                                                    (cost=127951.57..150763.24 rows=30908 width=8)
>                                                          Hash Cond:
>                                                          ("outer".id =
>                                                          "inner".album)
>                                                          ->  Seq Scan on
>                                                          album
>                                                          (cost=0.00..12922.72 rows=425772 width=4)
>                                                          ->  Hash
>                                                          (cost=127874.30..127874.30 rows=30908 width=4)
>                                                                ->
>                                                                HashAggregate  (cost=126947.06..127565.22 rows=30908
width=4)
>                                                                      Filter: ((count(*) >= 15) AND (count(*) <= 25))
>                                                                      ->
>                                                                      Seq
>                                                                      Scan
>                                                                      on
>                                                                      albumjoin  (cost=0.00..88642.18 rows=5107318
width=4)
>
>
> I've tried adding a length(public.track.name) index and filtering
> public.track to those rows where length(name) is within a few characters
> of node.file, but that actually makes the plan more expensive.
>
> Is there any hope to make things much cheaper? Unfortunately, I can't
> filter out anything from the album or albumjoin tables.
>
> ---------------------------(end of broadcast)---------------------------
> TIP 1: if posting/reading through Usenet, please send an appropriate
>       subscribe-nomail command to majordomo@postgresql.org so that your
>       message can get through to the mailing list cleanly
>

--
Jim Nasby                                            jim@nasby.net
EnterpriseDB      http://enterprisedb.com      512.569.9461 (cell)

Re: any hope for my big query?

From
Shaun Thomas
Date:
On Thursday 28 September 2006 17:18, Ben wrote:

> explain select
>      distinct public.album.id
> from
>      public.album,public.albumjoin,public.track,umdb.node
> where
>      node.dir=2811
>      and albumjoin.album = public.album.id
>      and public.albumjoin.track = public.track.id
>      and levenshtein(substring(public.track.name for 75),
>          substring(node.file for 75)) <= 10
>      and public.album.id in
>          (select album from albumjoin group by album having count(*) between
15 and 25)
> group by public.album.id
> having count(*) >= 5;

If I'm reading this right, you want all the albums with 15-25 entries in
album join having 5 or more tracks that are (soundex type) similar to
other nodes.  Knowing that, you can also try something like this:

select a.album
  from (select album,track from albumjoin group by album having count(1)
between 15 and 25) a
  join public.track t on (a.track = t.id)
  join umdb.node n on (levenshtein(substring(t.name for 75),
substring(n.file for 75)) < 9)
 where n.dir = 2811
 group by a.album
 having count(1) > 4;

This removes two of your tables, since you were only interested in
albums with 15-25 albumjoins, and weren't actually using any album data
other than the ID, which albumjoin supplies.  Your subselect is now an
integral part of the whole query, being treated like a temp table that
only supplies album IDs with 15-25 albumjoins.  From there, add on the
track information, and use that to restrict the matching nodes.  Your
explain should be better with the above.

Just remember with the levenshtein in there, you're forcing a sequence
scan on the node table.  Depending on how big that table is, you may
not be able to truly optimize this.

--

Shaun Thomas
Database Administrator

Leapfrog Online
807 Greenwood Street
Evanston, IL 60201
Tel. 847-440-8253
Fax. 847-570-5750
www.leapfrogonline.com

Re: any hope for my big query?

From
Ben
Date:
On Fri, 29 Sep 2006, Jim C. Nasby wrote:

> There's no join criteria for umdb.node... is that really what you want?
>

Unfortunately, yes, it is.

I've taken in all of everybody's helpful advice (thanks!) and reworked
things a little, and now I'm left with this expensive nugget:

select aj.album from
(select seconds-1 as a,seconds+1 as b from node where node.dir = 6223) n
join public.track t
on (t.length between n.a*1000 and n.b*1000)
join public.albumjoin aj
on (aj.track = t.id)
join (select id from public.albummeta am where tracks between 3 and 7) lam
on (lam.id = aj.album)
group by aj.album having count(*) >= 4;

...which comes out to be:

  HashAggregate  (cost=904444.69..904909.99 rows=31020 width=4)
    Filter: (count(*) >= 4)
    ->  Nested Loop  (cost=428434.81..897905.17 rows=1307904 width=4)
          Join Filter: (("inner".length >= (("outer".seconds - 1) * 1000)) AND ("inner".length <= (("outer".seconds +
1)* 1000))) 
          ->  Index Scan using node_dir on node  (cost=0.00..3.46 rows=17 width=4)
                Index Cond: (dir = 6223)
          ->  Materialize  (cost=428434.81..438740.01 rows=692420 width=8)
                ->  Hash Join  (cost=210370.58..424361.39 rows=692420 width=8)
                      Hash Cond: ("outer".id = "inner".track)
                      ->  Seq Scan on track t  (cost=0.00..128028.41 rows=5123841 width=8)
                      ->  Hash  (cost=205258.53..205258.53 rows=692420 width=8)
                            ->  Hash Join  (cost=6939.10..205258.53 rows=692420 width=8)
                                  Hash Cond: ("outer".album = "inner".id)
                                  ->  Seq Scan on albumjoin aj  (cost=0.00..88918.41 rows=5123841 width=8)
                                  ->  Hash  (cost=6794.51..6794.51 rows=57834 width=4)
                                        ->  Bitmap Heap Scan on albummeta am  (cost=557.00..6794.51 rows=57834 width=4)
                                              Recheck Cond: ((tracks >= 3) AND (tracks <= 7))
                                              ->  Bitmap Index Scan on albummeta_tracks_index  (cost=0.00..557.00
rows=57834width=0) 
                                                    Index Cond: ((tracks >= 3) AND (tracks <= 7))
(19 rows)


I'm surprised (though probably just because I'm ignorant) that it would
have so much sequential scanning in there. For instance, because n is
going to have at most a couple dozen rows, it seems that instead of
scanning all of public.track, it should be able to convert my "t.length
between a and b" clause to some between statements or'd together. Or at
least, it would be nice if the planner could do that. :)


Re: any hope for my big query?

From
Jim Nasby
Date:
On Oct 4, 2006, at 4:40 PM, Ben wrote:
> I'm surprised (though probably just because I'm ignorant) that it
> would have so much sequential scanning in there. For instance,
> because n is going to have at most a couple dozen rows, it seems
> that instead of scanning all of public.track, it should be able to
> convert my "t.length between a and b" clause to some between
> statements or'd together. Or at least, it would be nice if the
> planner could do that.

That would require the planner having that knowledge at plan-time,
which it can't without actually querying the database. One thing that
might work wonders is performing the n query ahead of time and then
sticking it in an array... that might speed things up.

Worst case, you could run the n query, and then run the rest of the
query for each row of n you get back.

Better yet... send us a patch that allows the planner to look into
what a subselect will return to us. ;)
--
Jim Nasby                                            jim@nasby.net
EnterpriseDB      http://enterprisedb.com      512.569.9461 (cell)


--
Jim Nasby                                    jimn@enterprisedb.com
EnterpriseDB      http://enterprisedb.com      512.569.9461 (cell)