Re: JOINs ... how I hate them ... - Mailing list pgsql-hackers

From Tom Lane
Subject Re: JOINs ... how I hate them ...
Date
Msg-id 20120.1013124770@sss.pgh.pa.us
Whole thread Raw
In response to JOINs ... how I hate them ...  ("Marc G. Fournier" <scrappy@hub.org>)
List pgsql-hackers
"Marc G. Fournier" <scrappy@hub.org> writes:
> explain SELECT p.uid, p.handle
>     FROM  gender_f pgf JOIN profiles p ON (pgf.uid = p.uid) ;

> Which explains out as:

> Hash Join  (cost=1354.80..45297.83 rows=75664 width=27)
>   ->  Seq Scan on profiles p  (cost=0.00..35707.69 rows=485969 width=19)
>   ->  Hash  (cost=1165.64..1165.64 rows=75664 width=8)
>         ->  Seq Scan on gender_f pgf  (cost=0.00..1165.64 rows=75664 width=8)

> Now, profiles has uid as its primary KEY, and there are no
> duplicates in gender_f ... so, as my HashJoin points out, I should have 75664
> results returned ... that is expected ... and the SeqScan on gender_f is
> expected ... but the SeqScan on profiles is what I would hope to get rid
> of ...

Um, why?  Looks like a perfectly reasonable plan to me.

> get uid from gender_f, find corresponding entry in profiles ...

I'm not convinced that 75000 indexscan probes would be faster than a
sequential scan across that table.  You could probably force the issue
with "set enable_hashjoin to off" (and maybe also "set enable_mergejoin
to off") and then see what the plan is and what the actual timing is.
(EXPLAIN ANALYZE should be real helpful here.)

> explain SELECT p.uid, p.handle
>     FROM ( orientation_c poc JOIN gender_f pgf USING ( uid ) ) JOIN profiles p ON (pgf.uid = p.uid) ;

> Hash Join  (cost=6023.92..47537.10 rows=75664 width=35)
>   ->  Seq Scan on profiles p  (cost=0.00..35707.69 rows=485969 width=19)
>   ->  Hash  (cost=5834.76..5834.76 rows=75664 width=16)
>         ->  Merge Join  (cost=0.00..5834.76 rows=75664 width=16)
>               ->  Index Scan using poc_uid on orientation_c poc  (cost=0.00..2807.82 rows=126477 width=8)
>               ->  Index Scan using pgf_uid on gender_f pgf  (cost=0.00..1575.79 rows=75664 width=8)

> The MergeJoin between poc/pgf will only return 12000 records, and since it
> is a 1:1 relationship between each of those tables, there will *only* be
> 12000 records pulled from profiles ...

Hmm, it thinks that there will be 75664 not 12000 records out of that
join.  Why the discrepancy?  Could we see the pg_stats data for these
tables?

> ... but, again, for each of
> *those* 12000 records, its doing a SeqScan on last_login's 485k records,
> instead of using the index

No, certainly *not* "for each record".  It's a hash join, so it only
reads each table once.

> Now, if I 'set enable_seqscan=false;' and do the exact same explain, it
> definitely comes more in line with what I'd like to see, as far as index
> usage is concerned:

And what's the actual runtime come out to be?

We definitely should standardize on asking for EXPLAIN ANALYZE results
in bad-plan discussions, now that we have that capability.
        regards, tom lane


pgsql-hackers by date:

Previous
From: "D. Hageman"
Date:
Subject: Re: Threaded PosgreSQL server
Next
From: "D. Hageman"
Date:
Subject: Re: Threaded PosgreSQL server