question about executing JOINs - Mailing list pgsql-general

From Josh Burdick
Subject question about executing JOINs
Date
Msg-id 3DA5D037.5050000@gradient.cis.upenn.edu
Whole thread Raw
Responses Re: question about executing JOINs  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-general
      We're using Postgres 7.2.1 in a biology lab.  Some of our joins
involving selecting from two tables, but the fields we're selecting
aren't equal, they're just nearby.

select "tName" as "chrom",
    [...stuff deleted...]
       "tStart" - "snp_start" as "distance_from_start",
       "snp_start" - "tEnd" as "distance_from_end"
from   ucsc_ref_seq_ali_hg12 join ucsc_snp_tsc_hg12
       on ucsc_snp_tsc_hg12."snp_chrom" = ucsc_ref_seq_ali_hg12."tName"
where "snp_start" >= "tStart" - 1000000
      and "snp_start" <= "tEnd" + 1000000;

    The problem is that the planner comes up with:

Merge Join  (cost=201234.34..17463319.54 rows=85230539 width=51)
  ->  Sort  (cost=1870.74..1870.74 rows=15165 width=29)
        ->  Seq Scan on ucsc_ref_seq_ali_hg12  (cost=0.00..817.65
rows=15165 width=29)
  ->  Sort  (cost=199363.59..199363.59 rows=1145280 width=22)
        ->  Seq Scan on ucsc_snp_tsc_hg12  (cost=0.00..20996.80
rows=1145280 width=22)

    which doesn't finish after 45 minutes, and presumably would take a
while to finish.  However, by temporarily doing

set enable_mergejoin=off;
set enable_hashjoin=off;

    the planner says

Nested Loop  (cost=0.00..198183643.06 rows=85230539 width=51)
  ->  Seq Scan on ucsc_ref_seq_ali_hg12  (cost=0.00..817.65 rows=15165
width=29)
  ->  Index Scan using ucsc_snp_tsc_hg12_chrom_start on
ucsc_snp_tsc_hg12  (cost=0.00..12962.39 rows=4713 width=22)

    which takes four minutes.
    So, in practice, we just save the output of that in a table (using
CREATE TABLE AS), which works.  It's just a bit awkward to have to
fiddle with those switches; it would be nice if the planner were a bit
cleverer in this case.
    Is there a runtime parameter that seems likely to fix this?
 (perhaps CPU_INDEX_TUPLE_COST).
    The planner is assuming that it's a cross join: that we need all
pairs of records.  It's not taking into account the WHERE clause which
restricts to a tiny fraction of the records.  Perhaps the planner should
assume that a nested loop over an index scan only looks at 1% of its
records?
    This would make cross joins more expensive.  But most of my cross
joins have been accidental, when I've left out a join condition, so I
don't really mind them taking longer, because I'll just stop them after
20 minutes anyway :)
    Note that this isn't a question of join order (I think); it's just a
question of which way to execute the join.
    Thanks,
    Josh

--
Josh Burdick
jburdick@gradient.cis.upenn.edu
http://www.cis.upenn.edu/~jburdick



pgsql-general by date:

Previous
From: "Magnus Naeslund(f)"
Date:
Subject: Re: connection watchdog
Next
From: Patrick Nelson
Date:
Subject: RE in WHERE