Thread: Can you help with this JOIN?

Can you help with this JOIN?

From
Wei Weng
Date:
I have a massive join that takes something like 10 seconds to execute in
Postgresql 7.2. Can any of gurus here help me improve it?

It is as follows:

SELECT DISTINCT cs.ContentID AS cscid, ct.Name AS ctname, pb.Name AS
pbname, cs.Author AS author, cs.CreationDate AS fromtime, cs.ExpireDate
AS totime, ct.Name AS media 
FROM 
ContentSummaries AS cs JOIN ContentTypes AS ct ON (ct.ContentTypeID =
cs.ContentTypeID) 
JOIN ContentAttributes AS cab ON (cab.ContentID = cs.ContentID) 
JOIN Attributes AS ab ON (ab.AttributeID = cab.AttributeID) 
JOIN Categories AS cat ON (cat.CategoryID = cs.CategoryID) 
LEFT OUTER JOIN Publishers AS pb ON (pb.PublisherID = cs.PublisherID)
WHERE cs.CreationDate IS NOT NULL

I already created indexes on all possible fields in those tables.

the query plan:


Limit  (cost=3170.75..3173.00 rows=15 width=172) ->  Unique  (cost=3170.75..3203.77 rows=220 width=172)       ->  Sort
(cost=3170.75..3170.75rows=2201 width=172)             ->  Nested Loop  (cost=611.64..3048.54 rows=2201
 
width=172)                   ->  Hash Join  (cost=611.64..3021.02 rows=2201
width=148)                         ->  Hash Join  (cost=610.61..2931.93 rows=2201
width=136)                               ->  Hash Join  (cost=609.44..2842.53
rows=2201 width=124)                                     ->  Seq Scan on contentattributes
cab  (cost=0.00..867.05 rows=33005 width=24)                                     ->  Hash  (cost=600.27..600.27
rows=3667 width=100)                                           ->  Hash Join 
(cost=1.02..600.27 rows=3667 width=100)                                                 ->  Seq Scan on
contentsummaries cs  (cost=0.00..452.52 rows=3667 width=76)                                                 ->  Hash 
(cost=1.02..1.02 rows=2 width=24)                                                       ->  Seq Scan on
contenttypes ct  (cost=0.00..1.02 rows=2 width=24)                               ->  Hash  (cost=1.14..1.14 rows=14
width=12)                                     ->  Seq Scan on attributes ab 
(cost=0.00..1.14 rows=14 width=12)                         ->  Hash  (cost=1.02..1.02 rows=2 width=12)
            ->  Seq Scan on categories cat 
 
(cost=0.00..1.02 rows=2 width=12)                   ->  Seq Scan on publishers pb  (cost=0.00..0.00
rows=1 width=24)


Thanks in advance. I had been really scratching my head for this one.


-- 
Wei Weng
Network Software Engineer
KenCast Inc.




Re: Can you help with this JOIN?

From
Wei Weng
Date:
One question is: It appears to be using Sequential scan on tables no
matter how big it is. I tried to set enable_seqscan='false' and then
vacuum analyze, but the result is even worse.

Isn't index scan supposed to be faster/better than sequential scan for
large tables? One table (contentsummaries) has 11000 entries and another
one (contentattributes) has 33005 entries.

Thanks


On Wed, 2002-05-29 at 12:06, Wei Weng wrote:
> I have a massive join that takes something like 10 seconds to execute in
> Postgresql 7.2. Can any of gurus here help me improve it?
> 
> It is as follows:
> 
> SELECT DISTINCT cs.ContentID AS cscid, ct.Name AS ctname, pb.Name AS
> pbname, cs.Author AS author, cs.CreationDate AS fromtime, cs.ExpireDate
> AS totime, ct.Name AS media 
> FROM 
> ContentSummaries AS cs JOIN ContentTypes AS ct ON (ct.ContentTypeID =
> cs.ContentTypeID) 
> JOIN ContentAttributes AS cab ON (cab.ContentID = cs.ContentID) 
> JOIN Attributes AS ab ON (ab.AttributeID = cab.AttributeID) 
> JOIN Categories AS cat ON (cat.CategoryID = cs.CategoryID) 
> LEFT OUTER JOIN Publishers AS pb ON (pb.PublisherID = cs.PublisherID)
> WHERE cs.CreationDate IS NOT NULL
> 
> I already created indexes on all possible fields in those tables.
> 
> the query plan:
> 
> 
> Limit  (cost=3170.75..3173.00 rows=15 width=172)
>   ->  Unique  (cost=3170.75..3203.77 rows=220 width=172)
>         ->  Sort  (cost=3170.75..3170.75 rows=2201 width=172)
>               ->  Nested Loop  (cost=611.64..3048.54 rows=2201
> width=172)
>                     ->  Hash Join  (cost=611.64..3021.02 rows=2201
> width=148)
>                           ->  Hash Join  (cost=610.61..2931.93 rows=2201
> width=136)
>                                 ->  Hash Join  (cost=609.44..2842.53
> rows=2201 width=124)
>                                       ->  Seq Scan on contentattributes
> cab  (cost=0.00..867.05 rows=33005 width=24)
>                                       ->  Hash  (cost=600.27..600.27
> rows=3667 width=100)
>                                             ->  Hash Join 
> (cost=1.02..600.27 rows=3667 width=100)
>                                                   ->  Seq Scan on
> contentsummaries cs  (cost=0.00..452.52 rows=3667 width=76)
>                                                   ->  Hash 
> (cost=1.02..1.02 rows=2 width=24)
>                                                         ->  Seq Scan on
> contenttypes ct  (cost=0.00..1.02 rows=2 width=24)
>                                 ->  Hash  (cost=1.14..1.14 rows=14
> width=12)
>                                       ->  Seq Scan on attributes ab 
> (cost=0.00..1.14 rows=14 width=12)
>                           ->  Hash  (cost=1.02..1.02 rows=2 width=12)
>                                 ->  Seq Scan on categories cat 
> (cost=0.00..1.02 rows=2 width=12)
>                     ->  Seq Scan on publishers pb  (cost=0.00..0.00
> rows=1 width=24)
> 
> 
> Thanks in advance. I had been really scratching my head for this one.
> 
> 
> -- 
> Wei Weng
> Network Software Engineer
> KenCast Inc.
> 
> 
> 
> ---------------------------(end of broadcast)---------------------------
> TIP 2: you can get off all lists at once with the unregister command
>     (send "unregister YourEmailAddressHere" to majordomo@postgresql.org)
> 
-- 
Wei Weng
Network Software Engineer
KenCast Inc.




Re: Can you help with this JOIN?

From
Tom Lane
Date:
Wei Weng <wweng@kencast.com> writes:
> I have a massive join that takes something like 10 seconds to execute in
> Postgresql 7.2. Can any of gurus here help me improve it?

1. Have you vacuum analyzed these tables lately?  The row count
estimates look suspiciously low.

2. Please provide EXPLAIN ANALYZE output not just EXPLAIN.
        regards, tom lane


Re: Can you help with this JOIN?

From
Stephan Szabo
Date:
On 29 May 2002, Wei Weng wrote:

> One question is: It appears to be using Sequential scan on tables no
> matter how big it is. I tried to set enable_seqscan='false' and then
> vacuum analyze, but the result is even worse.
>
> Isn't index scan supposed to be faster/better than sequential scan for
> large tables? One table (contentsummaries) has 11000 entries and another

Not necessarily.  If the values in the table are in random pages (ie, no
real coordination between index key value and position in table) after a
point you end up reading most of the pages of the table anyway, plus
you're reading the index (and the smaller the individual row, the more
likely this is to occur as well I'd guess)

You might want to consider trying other join orders since you're using
explicit join syntax since postgres will take the order you're doing your
joins as the order you want the joins to be done in.  And are you sure
you need the distinct, that's probably resulting in the sort and unique
step.

For other things, what are your postgresql.conf settings for shared memory
and sort memory?



Re: Can you help with this JOIN?

From
Wei Weng
Date:
I was not running 7.2 when I posted this. So I upgraded my system and
this is the new query plan EXPLAIN ANALYZE:

NOTICE:  QUERY PLAN:

Limit  (cost=30448.71..30450.96 rows=15 width=713) (actual
time=7591.70..7592.67 rows=15 loops=1) ->  Unique  (cost=30448.71..30898.79 rows=3000 width=713) (actual
time=7591.69..7592.64 rows=16 loops=1)       ->  Sort  (cost=30448.71..30448.71 rows=30005 width=713) (actual
time=7591.69..7591.78 rows=46 loops=1)             ->  Hash Join  (cost=2310.20..5107.41 rows=30005
width=713) (actual time=1064.93..5288.91 rows=30005 loops=1)                   ->  Hash Join  (cost=2310.19..4955.50
rows=30005
width=529) (actual time=1064.87..5053.43 rows=30005 loops=1)                         ->  Hash Join
(cost=2309.17..4429.39
rows=30005 width=489) (actual time=1064.63..3944.85 rows=30005 loops=1)                               ->  Merge Join
(cost=2307.99..3903.13
rows=30005 width=449) (actual time=1064.36..2842.60 rows=30005 loops=1)                                     ->  Index
Scanusing
 
contentattributes_contentid_ind on contentattributes cab 
(cost=0.00..1120.06 rows=30005 width=80) (actual time=20.71..459.11
rows=30005 loops=1)                                     ->  Sort  (cost=2307.99..2307.99
rows=10002 width=369) (actual time=1043.57..1162.89 rows=30003 loops=1)                                           ->
HashJoin 
 
(cost=1.02..562.08 rows=10002 width=369) (actual time=0.33..462.06
rows=10002 loops=1)                                                 ->  Seq Scan on
contentsummaries cs  (cost=0.00..386.02 rows=10002 width=320) (actual
time=0.09..126.02 rows=10002 loops=1)                                                 ->  Hash 
(cost=1.02..1.02 rows=2 width=49) (actual time=0.11..0.11 rows=0
loops=1)                                                       ->  Seq Scan on
contenttypes ct  (cost=0.00..1.02 rows=2 width=49) (actual
time=0.10..0.10 rows=2 loops=1)                               ->  Hash  (cost=1.14..1.14 rows=14
width=40) (actual time=0.17..0.17 rows=0 loops=1)                                     ->  Seq Scan on attributes ab 
(cost=0.00..1.14 rows=14 width=40) (actual time=0.09..0.14 rows=14
loops=1)                         ->  Hash  (cost=1.02..1.02 rows=2 width=40)
(actual time=0.13..0.13 rows=0 loops=1)                               ->  Seq Scan on categories cat 
(cost=0.00..1.02 rows=2 width=40) (actual time=0.11..0.12 rows=2
loops=1)                   ->  Hash  (cost=0.00..0.00 rows=1 width=184) (actual
time=0.01..0.01 rows=0 loops=1)                         ->  Seq Scan on publishers pb 
(cost=0.00..0.00 rows=1 width=184) (actual time=0.00..0.00 rows=0
loops=1)

Total runtime: 7687.20 msec

Hope this makes sense. Thanks

On Wed, 2002-05-29 at 12:37, Tom Lane wrote:
> Wei Weng <wweng@kencast.com> writes:
> > I have a massive join that takes something like 10 seconds to execute in
> > Postgresql 7.2. Can any of gurus here help me improve it?
> 
> 1. Have you vacuum analyzed these tables lately?  The row count
> estimates look suspiciously low.
> 
> 2. Please provide EXPLAIN ANALYZE output not just EXPLAIN.
> 
>             regards, tom lane
> 
-- 
Wei Weng
Network Software Engineer
KenCast Inc.




Re: Can you help with this JOIN?

From
Wei Weng
Date:
But how do I write the LEFT OUTER JOIN if I don't do explicit JOINs on
the other tables? Can I just do

SELECT x FROM a, b, c, LEFT OUTER JOIN d ON (d.id = a.id)?


On Wed, 2002-05-29 at 13:18, Stephan Szabo wrote:
> 
> On 29 May 2002, Wei Weng wrote:
> 
> > One question is: It appears to be using Sequential scan on tables no
> > matter how big it is. I tried to set enable_seqscan='false' and then
> > vacuum analyze, but the result is even worse.
> >
> > Isn't index scan supposed to be faster/better than sequential scan for
> > large tables? One table (contentsummaries) has 11000 entries and another
> 
> Not necessarily.  If the values in the table are in random pages (ie, no
> real coordination between index key value and position in table) after a
> point you end up reading most of the pages of the table anyway, plus
> you're reading the index (and the smaller the individual row, the more
> likely this is to occur as well I'd guess)
> 
> You might want to consider trying other join orders since you're using
> explicit join syntax since postgres will take the order you're doing your
> joins as the order you want the joins to be done in.  And are you sure
> you need the distinct, that's probably resulting in the sort and unique
> step.
> 
> For other things, what are your postgresql.conf settings for shared memory
> and sort memory?
> 
> 
-- 
Wei Weng
Network Software Engineer
KenCast Inc.




Re: Can you help with this JOIN?

From
Stephan Szabo
Date:
On 30 May 2002, Wei Weng wrote:

> But how do I write the LEFT OUTER JOIN if I don't do explicit JOINs on
> the other tables? Can I just do
>
> SELECT x FROM a, b, c, LEFT OUTER JOIN d ON (d.id = a.id)?

I think you'd need to write
SELECT x FROM b,c,a LEFT OUTER JOIN d ON (d.id = a.id)

That should force the outer join first, but give flexibility for the rest
of the joins.




Re: Can you help with this JOIN?

From
Tom Lane
Date:
Wei Weng <wweng@kencast.com> writes:
> But how do I write the LEFT OUTER JOIN if I don't do explicit JOINs on
> the other tables?

See the manual: there's a relevant example near the bottom of
http://www.postgresql.org/users-lounge/docs/7.2/postgres/explicit-joins.html
but you'd do well to study that whole page...
        regards, tom lane