question about executing JOINs - Mailing list pgsql-general
From | Josh Burdick |
---|---|
Subject | question about executing JOINs |
Date | |
Msg-id | 3DA5D037.5050000@gradient.cis.upenn.edu Whole thread Raw |
Responses |
Re: question about executing JOINs
(Tom Lane <tgl@sss.pgh.pa.us>)
|
List | pgsql-general |
We're using Postgres 7.2.1 in a biology lab. Some of our joins involving selecting from two tables, but the fields we're selecting aren't equal, they're just nearby. select "tName" as "chrom", [...stuff deleted...] "tStart" - "snp_start" as "distance_from_start", "snp_start" - "tEnd" as "distance_from_end" from ucsc_ref_seq_ali_hg12 join ucsc_snp_tsc_hg12 on ucsc_snp_tsc_hg12."snp_chrom" = ucsc_ref_seq_ali_hg12."tName" where "snp_start" >= "tStart" - 1000000 and "snp_start" <= "tEnd" + 1000000; The problem is that the planner comes up with: Merge Join (cost=201234.34..17463319.54 rows=85230539 width=51) -> Sort (cost=1870.74..1870.74 rows=15165 width=29) -> Seq Scan on ucsc_ref_seq_ali_hg12 (cost=0.00..817.65 rows=15165 width=29) -> Sort (cost=199363.59..199363.59 rows=1145280 width=22) -> Seq Scan on ucsc_snp_tsc_hg12 (cost=0.00..20996.80 rows=1145280 width=22) which doesn't finish after 45 minutes, and presumably would take a while to finish. However, by temporarily doing set enable_mergejoin=off; set enable_hashjoin=off; the planner says Nested Loop (cost=0.00..198183643.06 rows=85230539 width=51) -> Seq Scan on ucsc_ref_seq_ali_hg12 (cost=0.00..817.65 rows=15165 width=29) -> Index Scan using ucsc_snp_tsc_hg12_chrom_start on ucsc_snp_tsc_hg12 (cost=0.00..12962.39 rows=4713 width=22) which takes four minutes. So, in practice, we just save the output of that in a table (using CREATE TABLE AS), which works. It's just a bit awkward to have to fiddle with those switches; it would be nice if the planner were a bit cleverer in this case. Is there a runtime parameter that seems likely to fix this? (perhaps CPU_INDEX_TUPLE_COST). The planner is assuming that it's a cross join: that we need all pairs of records. It's not taking into account the WHERE clause which restricts to a tiny fraction of the records. Perhaps the planner should assume that a nested loop over an index scan only looks at 1% of its records? This would make cross joins more expensive. But most of my cross joins have been accidental, when I've left out a join condition, so I don't really mind them taking longer, because I'll just stop them after 20 minutes anyway :) Note that this isn't a question of join order (I think); it's just a question of which way to execute the join. Thanks, Josh -- Josh Burdick jburdick@gradient.cis.upenn.edu http://www.cis.upenn.edu/~jburdick
pgsql-general by date: