Thread: JOINs ... how I hate them ...

JOINs ... how I hate them ...

From
"Marc G. Fournier"
Date:
Okay, went back through teh archives, as I know that Tom provided a
solution for this before, and found it at:

http://archives.postgresql.org/pgsql-sql/2001-06/msg00329.php

Plain and simple ... makes perfect sense ... doesn't work in v7.2, or, at
least, not as I'm expecting it to ...

I've broken what I'm trying to do down the the *basest* component I can:

explain SELECT p.uid, p.handle   FROM  gender_f pgf JOIN profiles p ON (pgf.uid = p.uid) ;

Which explains out as:

Hash Join  (cost=1354.80..45297.83 rows=75664 width=27) ->  Seq Scan on profiles p  (cost=0.00..35707.69 rows=485969
width=19)->  Hash  (cost=1165.64..1165.64 rows=75664 width=8)       ->  Seq Scan on gender_f pgf  (cost=0.00..1165.64
rows=75664width=8)
 

Now, profiles has uid as its primary KEY, and there are no
duplicates in gender_f ... so, as my HashJoin points out, I should have 75664
results returned ... that is expected ... and the SeqScan on gender_f is
expected ... but the SeqScan on profiles is what I would hope to get rid
of ... get uid from gender_f, find corresponding entry in profiles ... its only
ever goign to pull out 75664 out of 485969 records from profiles, so why
would it seqscan *through* profiles for each and every UID?

Now, if I go to the next level that I'm trying to pull together:

explain SELECT p.uid, p.handle   FROM ( orientation_c poc JOIN gender_f pgf USING ( uid ) ) JOIN profiles p ON (pgf.uid
=p.uid) ;
 

It still explains, what I think, is wrong:

Hash Join  (cost=6023.92..47537.10 rows=75664 width=35) ->  Seq Scan on profiles p  (cost=0.00..35707.69 rows=485969
width=19)->  Hash  (cost=5834.76..5834.76 rows=75664 width=16)       ->  Merge Join  (cost=0.00..5834.76 rows=75664
width=16)            ->  Index Scan using poc_uid on orientation_c poc  (cost=0.00..2807.82 rows=126477 width=8)
    ->  Index Scan using pgf_uid on gender_f pgf  (cost=0.00..1575.79 rows=75664 width=8)
 


The MergeJoin between poc/pgf will only return 12000 records, and since it
is a 1:1 relationship between each of those tables, there will *only* be
12000 records pulled from profiles ... yet its doing a SeqScan through all
485k records for each of those UIDs?

This is after I've performed a VACUUM ANALYZE ...

The final query itself is:

SELECT p.uid, p.profiles_handle   FROM  ( ( profiles_orientation_c poc JOIN profiles_gender_f pgf USING ( uid ) ) JOIN
iwantu_profilesp USING (uid ) ) LEFT JOIN iwantu_last_login ll USING ( uid );
 

Which explains as:

Hash Join  (cost=31636.40..78239.34 rows=75664 width=43) ->  Hash Join  (cost=6023.92..47537.10 rows=75664 width=35)
  ->  Seq Scan on iwantu_profiles p  (cost=0.00..35707.69 rows=485969 width=19)       ->  Hash  (cost=5834.76..5834.76
rows=75664width=16)             ->  Merge Join  (cost=0.00..5834.76 rows=75664 width=16)                   ->  Index
Scanusing poc_uid on profiles_orientation_c poc  (cost=0.00..2807.82 rows=126477 width=8)                   ->  Index
Scanusing pgf_uid on profiles_gender_f pgf  (cost=0.00..1575.79 rows=75664 width=8) ->  Hash  (cost=7955.64..7955.64
rows=485964width=8)       ->  Seq Scan on iwantu_last_login ll  (cost=0.00..7955.64 rows=485964 width=8)
 

EXPLAIN

So, poc&pgf are MergeJoin's, leaving me with 12000 records again ... then
there is the SeqScan/HashJoin wiht profiles, which will leave me with
12000 records, but with more information ... but, again, for each of
*those* 12000 records, its doing a SeqScan on last_login's 485k records,
instead of using the index ... again, like pgf and poc, there is only one
record for every uid, so we aren't dealing with duplicates ...

Now, if I 'set enable_seqscan=false;' and do the exact same explain, it
definitely comes more in line with what I'd like to see, as far as index
usage is concerned:

Nested Loop  (cost=0.00..546759.46 rows=75664 width=43) ->  Nested Loop  (cost=0.00..272274.75 rows=75664 width=35)
 ->  Merge Join  (cost=0.00..5834.76 rows=75664 width=16)             ->  Index Scan using poc_uid on
profiles_orientation_cpoc  (cost=0.00..2807.82 rows=126477 width=8)             ->  Index Scan using pgf_uid on
profiles_gender_fpgf  (cost=0.00..1575.79 rows=75664 width=8)       ->  Index Scan using iwantu_profiles_uid on
iwantu_profilesp  (cost=0.00..3.51 rows=1 width=19) ->  Index Scan using ill_uid on iwantu_last_login ll
(cost=0.00..3.62rows=1 width=8)
 





Re: JOINs ... how I hate them ...

From
Hannu Krosing
Date:
On Thu, 2002-02-07 at 16:23, Marc G. Fournier wrote:
> 
> 
> 
> The MergeJoin between poc/pgf will only return 12000 records, and since it
> is a 1:1 relationship between each of those tables, there will *only* be
> 12000 records pulled from profiles ... yet its doing a SeqScan through all
> 485k records for each of those UIDs?
> 
> This is after I've performed a VACUUM ANALYZE ...
> 
> The final query itself is:
> 
> SELECT p.uid, p.profiles_handle
>     FROM  ( ( profiles_orientation_c poc JOIN profiles_gender_f pgf USING ( uid ) )>             JOIN iwantu_profiles
pUSING (uid ) ) LEFT JOIN iwantu_last_login ll USING ( uid );
 
> 
> Which explains as:
> 
> Hash Join  (cost=31636.40..78239.34 rows=75664 width=43)
>   ->  Hash Join  (cost=6023.92..47537.10 rows=75664 width=35)
>         ->  Seq Scan on iwantu_profiles p  (cost=0.00..35707.69 rows=485969 width=19)
>         ->  Hash  (cost=5834.76..5834.76 rows=75664 width=16)
>               ->  Merge Join  (cost=0.00..5834.76 rows=75664 width=16)
>                     ->  Index Scan using poc_uid on profiles_orientation_c poc  (cost=0.00..2807.82 rows=126477
width=8)
>                     ->  Index Scan using pgf_uid on profiles_gender_f pgf  (cost=0.00..1575.79 rows=75664 width=8)
>   ->  Hash  (cost=7955.64..7955.64 rows=485964 width=8)
>         ->  Seq Scan on iwantu_last_login ll  (cost=0.00..7955.64 rows=485964 width=8)
> 
> EXPLAIN
> 
> So, poc&pgf are MergeJoin's, leaving me with 12000 records again ... then
> there is the SeqScan/HashJoin wiht profiles, which will leave me with
> 12000 records, but with more information ... but, again, for each of
> *those* 12000 records, its doing a SeqScan on last_login's 485k records,
> instead of using the index ... again, like pgf and poc, there is only one
> record for every uid, so we aren't dealing with duplicates ...

I recently sped up a somewhat similar query from 15 sec to < 1 sec by
rewriting it to use a subselect:

SELECT p.uid, p.profiles_handle FROM profiles_orientation_c poc,      profiles_gender_f pgf      (select uid,
profiles_handle        from iwantu_profiles ip        where ip.uid = pgf.uid      ) p WHERE poc.uid = pgf.uid
 

If you need something from iwantu_last_login it should go into that
subselect as well

That tricked my case to do the small join first.

-----------------
Hannu





Re: JOINs ... how I hate them ...

From
"Dann Corbit"
Date:
In Rdb (for instance) you can edit the plan if you want.  (Oracle too,
IIRC -- but I never have edited a plan in Oracle)

Sure, it opens a big can of worms, but it would be nice for someone
technically inclined to be able to fix a plan if they know better than
the SQL compiler did.


Re: JOINs ... how I hate them ...

From
Tom Lane
Date:
"Marc G. Fournier" <scrappy@hub.org> writes:
> explain SELECT p.uid, p.handle
>     FROM  gender_f pgf JOIN profiles p ON (pgf.uid = p.uid) ;

> Which explains out as:

> Hash Join  (cost=1354.80..45297.83 rows=75664 width=27)
>   ->  Seq Scan on profiles p  (cost=0.00..35707.69 rows=485969 width=19)
>   ->  Hash  (cost=1165.64..1165.64 rows=75664 width=8)
>         ->  Seq Scan on gender_f pgf  (cost=0.00..1165.64 rows=75664 width=8)

> Now, profiles has uid as its primary KEY, and there are no
> duplicates in gender_f ... so, as my HashJoin points out, I should have 75664
> results returned ... that is expected ... and the SeqScan on gender_f is
> expected ... but the SeqScan on profiles is what I would hope to get rid
> of ...

Um, why?  Looks like a perfectly reasonable plan to me.

> get uid from gender_f, find corresponding entry in profiles ...

I'm not convinced that 75000 indexscan probes would be faster than a
sequential scan across that table.  You could probably force the issue
with "set enable_hashjoin to off" (and maybe also "set enable_mergejoin
to off") and then see what the plan is and what the actual timing is.
(EXPLAIN ANALYZE should be real helpful here.)

> explain SELECT p.uid, p.handle
>     FROM ( orientation_c poc JOIN gender_f pgf USING ( uid ) ) JOIN profiles p ON (pgf.uid = p.uid) ;

> Hash Join  (cost=6023.92..47537.10 rows=75664 width=35)
>   ->  Seq Scan on profiles p  (cost=0.00..35707.69 rows=485969 width=19)
>   ->  Hash  (cost=5834.76..5834.76 rows=75664 width=16)
>         ->  Merge Join  (cost=0.00..5834.76 rows=75664 width=16)
>               ->  Index Scan using poc_uid on orientation_c poc  (cost=0.00..2807.82 rows=126477 width=8)
>               ->  Index Scan using pgf_uid on gender_f pgf  (cost=0.00..1575.79 rows=75664 width=8)

> The MergeJoin between poc/pgf will only return 12000 records, and since it
> is a 1:1 relationship between each of those tables, there will *only* be
> 12000 records pulled from profiles ...

Hmm, it thinks that there will be 75664 not 12000 records out of that
join.  Why the discrepancy?  Could we see the pg_stats data for these
tables?

> ... but, again, for each of
> *those* 12000 records, its doing a SeqScan on last_login's 485k records,
> instead of using the index

No, certainly *not* "for each record".  It's a hash join, so it only
reads each table once.

> Now, if I 'set enable_seqscan=false;' and do the exact same explain, it
> definitely comes more in line with what I'd like to see, as far as index
> usage is concerned:

And what's the actual runtime come out to be?

We definitely should standardize on asking for EXPLAIN ANALYZE results
in bad-plan discussions, now that we have that capability.
        regards, tom lane