JOINs ... how I hate them ... - Mailing list pgsql-hackers

From Marc G. Fournier
Subject JOINs ... how I hate them ...
Date
Msg-id 20020207100506.Y57607-100000@earth.hub.org
Whole thread Raw
Responses Re: JOINs ... how I hate them ...  (Hannu Krosing <hannu@tm.ee>)
Re: JOINs ... how I hate them ...  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-hackers
Okay, went back through teh archives, as I know that Tom provided a
solution for this before, and found it at:

http://archives.postgresql.org/pgsql-sql/2001-06/msg00329.php

Plain and simple ... makes perfect sense ... doesn't work in v7.2, or, at
least, not as I'm expecting it to ...

I've broken what I'm trying to do down the the *basest* component I can:

explain SELECT p.uid, p.handle   FROM  gender_f pgf JOIN profiles p ON (pgf.uid = p.uid) ;

Which explains out as:

Hash Join  (cost=1354.80..45297.83 rows=75664 width=27) ->  Seq Scan on profiles p  (cost=0.00..35707.69 rows=485969
width=19)->  Hash  (cost=1165.64..1165.64 rows=75664 width=8)       ->  Seq Scan on gender_f pgf  (cost=0.00..1165.64
rows=75664width=8)
 

Now, profiles has uid as its primary KEY, and there are no
duplicates in gender_f ... so, as my HashJoin points out, I should have 75664
results returned ... that is expected ... and the SeqScan on gender_f is
expected ... but the SeqScan on profiles is what I would hope to get rid
of ... get uid from gender_f, find corresponding entry in profiles ... its only
ever goign to pull out 75664 out of 485969 records from profiles, so why
would it seqscan *through* profiles for each and every UID?

Now, if I go to the next level that I'm trying to pull together:

explain SELECT p.uid, p.handle   FROM ( orientation_c poc JOIN gender_f pgf USING ( uid ) ) JOIN profiles p ON (pgf.uid
=p.uid) ;
 

It still explains, what I think, is wrong:

Hash Join  (cost=6023.92..47537.10 rows=75664 width=35) ->  Seq Scan on profiles p  (cost=0.00..35707.69 rows=485969
width=19)->  Hash  (cost=5834.76..5834.76 rows=75664 width=16)       ->  Merge Join  (cost=0.00..5834.76 rows=75664
width=16)            ->  Index Scan using poc_uid on orientation_c poc  (cost=0.00..2807.82 rows=126477 width=8)
    ->  Index Scan using pgf_uid on gender_f pgf  (cost=0.00..1575.79 rows=75664 width=8)
 


The MergeJoin between poc/pgf will only return 12000 records, and since it
is a 1:1 relationship between each of those tables, there will *only* be
12000 records pulled from profiles ... yet its doing a SeqScan through all
485k records for each of those UIDs?

This is after I've performed a VACUUM ANALYZE ...

The final query itself is:

SELECT p.uid, p.profiles_handle   FROM  ( ( profiles_orientation_c poc JOIN profiles_gender_f pgf USING ( uid ) ) JOIN
iwantu_profilesp USING (uid ) ) LEFT JOIN iwantu_last_login ll USING ( uid );
 

Which explains as:

Hash Join  (cost=31636.40..78239.34 rows=75664 width=43) ->  Hash Join  (cost=6023.92..47537.10 rows=75664 width=35)
  ->  Seq Scan on iwantu_profiles p  (cost=0.00..35707.69 rows=485969 width=19)       ->  Hash  (cost=5834.76..5834.76
rows=75664width=16)             ->  Merge Join  (cost=0.00..5834.76 rows=75664 width=16)                   ->  Index
Scanusing poc_uid on profiles_orientation_c poc  (cost=0.00..2807.82 rows=126477 width=8)                   ->  Index
Scanusing pgf_uid on profiles_gender_f pgf  (cost=0.00..1575.79 rows=75664 width=8) ->  Hash  (cost=7955.64..7955.64
rows=485964width=8)       ->  Seq Scan on iwantu_last_login ll  (cost=0.00..7955.64 rows=485964 width=8)
 

EXPLAIN

So, poc&pgf are MergeJoin's, leaving me with 12000 records again ... then
there is the SeqScan/HashJoin wiht profiles, which will leave me with
12000 records, but with more information ... but, again, for each of
*those* 12000 records, its doing a SeqScan on last_login's 485k records,
instead of using the index ... again, like pgf and poc, there is only one
record for every uid, so we aren't dealing with duplicates ...

Now, if I 'set enable_seqscan=false;' and do the exact same explain, it
definitely comes more in line with what I'd like to see, as far as index
usage is concerned:

Nested Loop  (cost=0.00..546759.46 rows=75664 width=43) ->  Nested Loop  (cost=0.00..272274.75 rows=75664 width=35)
 ->  Merge Join  (cost=0.00..5834.76 rows=75664 width=16)             ->  Index Scan using poc_uid on
profiles_orientation_cpoc  (cost=0.00..2807.82 rows=126477 width=8)             ->  Index Scan using pgf_uid on
profiles_gender_fpgf  (cost=0.00..1575.79 rows=75664 width=8)       ->  Index Scan using iwantu_profiles_uid on
iwantu_profilesp  (cost=0.00..3.51 rows=1 width=19) ->  Index Scan using ill_uid on iwantu_last_login ll
(cost=0.00..3.62rows=1 width=8)
 





pgsql-hackers by date:

Previous
From: "Marc G. Fournier"
Date:
Subject: Re: Threaded PosgreSQL server
Next
From: Justin Clift
Date:
Subject: Re: Threaded PosgreSQL server