Thread: full outer performance problem
Hi,
I'm having problems with the query optimizer and FULL OUTER JOIN on PostgreSQL 7.4. I cannot get it to use my indexes with full outer joins. I might be naive, but I think that it should be possible?
I have two BIG tables (virtually identical) with 3 NOT NULL columns Station_id, TimeObs, Temp_XXXX, with indexes on (Station_id, TimeObs) and valid ANALYSE (set statistics=100). I want to join the two tables with a FULL OUTER JOIN.
When I specify the query as:
If further info like EXPLAIN VERBOSE is useful please say so and I will provide it.
Thanks in advance!
Kim Bisgaard.
I'm having problems with the query optimizer and FULL OUTER JOIN on PostgreSQL 7.4. I cannot get it to use my indexes with full outer joins. I might be naive, but I think that it should be possible?
I have two BIG tables (virtually identical) with 3 NOT NULL columns Station_id, TimeObs, Temp_XXXX, with indexes on (Station_id, TimeObs) and valid ANALYSE (set statistics=100). I want to join the two tables with a FULL OUTER JOIN.
When I specify the query as:
select temp_max_60min,temp_dry_at_2m from station s natural join temp_dry_at_2m a full outer join temp_max_60min b using (station_id, timeobs) where s.wmo_id=6065 and timeobs='2004-1-1 0:0:0' and '2004-1-1 0:0:0' between s.startdate and s.enddate;I get the correct results, BUT LOUSY performance, and the following explain:
Nested Loop Left Join (cost=5.84..163484.08 rows=1349 width=12) (actual time=66146.815..119005.381 rows=1 loops=1) Filter: (COALESCE("outer".timeobs, "inner".timeobs) = '2004-01-01 00:00:00'::timestamp without time zone) -> Hash Join (cost=5.84..155420.24 rows=1349 width=16) (actual time=8644.449..110836.038 rows=109826 loops=1) Hash Cond: ("outer".station_id = "inner".station_id) -> Seq Scan on temp_dry_at_2m a (cost=0.00..120615.94 rows=6956994 width=16) (actual time=0.024..104548.515 rows=6956994 loops=1) -> Hash (cost=5.84..5.84 rows=1 width=4) (actual time=0.114..0.114 rows=0 loops=1) -> Index Scan using wmo_idx on station (cost=0.00..5.84 rows=1 width=4) (actual time=0.105..0.108 rows=1 loops=1) Index Cond: ((wmo_id = 6065) AND ('2004-01-01 00:00:00'::timestamp without time zone >= startdate) AND ('2004-01-01 00:00:00'::timestamp without time zone <= enddate)) -> Index Scan using temp_max_60min_idx on temp_max_60min b (cost=0.00..5.96 rows=1 width=20) (actual time=0.071..0.071 rows=0 loops=109826) Index Cond: (("outer".station_id = b.station_id) AND ("outer".timeobs = b.timeobs))Total runtime: 119005.499 ms (11 rows)If I change the query to (and thus negates the full outer join):
select temp_max_60min,temp_dry_at_2m from station s natural join temp_dry_at_2m a full outer join temp_max_60min b using (station_id, timeobs) where s.wmo_id=6065 and a.timeobs='2004-1-1 0:0:0' and b.timeobs='2004-1-1 0:0:0' and '2004-1-1 0:0:0' between s.startdate and s.enddate;I get wrong results (In the case where one of the records is missing in one of the tables), BUT GOOD performance, and this query plan:
Nested Loop (cost=0.00..17.83 rows=1 width=12) (actual time=79.221..79.236 rows=1 loops=1) -> Nested Loop (cost=0.00..11.82 rows=1 width=24) (actual time=65.517..65.526 rows=1 loops=1) -> Index Scan using wmo_idx on station (cost=0.00..5.83 rows=1 width=4) (actual time=0.022..0.026 rows=1 loops=1) Index Cond: ((wmo_id = 6065) AND ('2004-01-01 00:00:00'::timestamp without time zone >= startdate) AND ('2004-01-01 00:00:00'::timestamp without time zone <= enddate)) -> Index Scan using temp_max_60min_idx on temp_max_60min b (cost=0.00..5.97 rows=1 width=20) (actual time=65.483..65.486 rows=1 loops=1) Index Cond: (("outer".station_id = b.station_id) AND (b.timeobs = '2004-01-01 00:00:00'::timestamp without time zone)) -> Index Scan using temp_dry_at_2m_idx on temp_dry_at_2m a (cost=0.00..6.00 rows=1 width=16) (actual time=13.694..13.698 rows=1 loops=1) Index Cond: (("outer".station_id = a.station_id) AND (a.timeobs = '2004-01-01 00:00:00'::timestamp without time zone))Total runtime: 79.340 ms (9 rows)
If further info like EXPLAIN VERBOSE is useful please say so and I will provide it.
Thanks in advance!
Kim Bisgaard.
Kim Bisgaard wrote: > Hi, > > I'm having problems with the query optimizer and FULL OUTER JOIN on > PostgreSQL 7.4. I cannot get it to use my indexes with full outer joins. > I might be naive, but I think that it should be possible? > > I have two BIG tables (virtually identical) with 3 NOT NULL columns > Station_id, TimeObs, Temp_XXXX, with indexes on (Station_id, TimeObs) > and valid ANALYSE (set statistics=100). I want to join the two tables > with a FULL OUTER JOIN. > > When I specify the query as: > > select temp_max_60min,temp_dry_at_2m > from station s natural join > temp_dry_at_2m a full outer join temp_max_60min b using (station_id, timeobs) > where s.wmo_id=6065 > and timeobs='2004-1-1 0:0:0' > and '2004-1-1 0:0:0' between s.startdate and s.enddate; > > I get the correct results, BUT LOUSY performance, and the following explain: > > Nested Loop Left Join (cost=5.84..163484.08 rows=1349 width=12) (actual time=66146.815..119005.381 rows=1 loops=1) > Filter: (COALESCE("outer".timeobs, "inner".timeobs) = '2004-01-01 00:00:00'::timestamp without time zone) > -> Hash Join (cost=5.84..155420.24 rows=1349 width=16) (actual time=8644.449..110836.038 rows=109826 loops=1) Well, the estimate here is quite a bit off. It thinks you will be getting 1349 (which is probably why it picked a nested loop plan), but then it is getting 109826 rows. I'm guessing it is misunderstanding the selectivity of the timeobs column. > Hash Cond: ("outer".station_id = "inner".station_id) > -> Seq Scan on temp_dry_at_2m a (cost=0.00..120615.94 rows=6956994 width=16) (actual time=0.024..104548.515rows=6956994 loops=1) > -> Hash (cost=5.84..5.84 rows=1 width=4) (actual time=0.114..0.114 rows=0 loops=1) > -> Index Scan using wmo_idx on station (cost=0.00..5.84 rows=1 width=4) (actual time=0.105..0.108 rows=1loops=1) > Index Cond: ((wmo_id = 6065) AND ('2004-01-01 00:00:00'::timestamp without time zone >= startdate)AND ('2004-01-01 00:00:00'::timestamp without time zone <= enddate)) > -> Index Scan using temp_max_60min_idx on temp_max_60min b (cost=0.00..5.96 rows=1 width=20) (actual time=0.071..0.071rows=0 loops=109826) > Index Cond: (("outer".station_id = b.station_id) AND ("outer".timeobs = b.timeobs)) > Total runtime: 119005.499 ms > (11 rows) I think the bigger problem is that a full outer join says grab all rows, even if they are null. What about this query: SELECT temp_max_60min,temp_dry_at_2m FROM (station s LEFT JOIN temp_dry_at_2m a USING (station_id, timeobs) LEFT JOIN temp_max_60min b USING (station_id, timeobs) where s.wmo_id=6065 and timeobs='2004-1-1 0:0:0' and '2004-1-1 0:0:0' between s.startdate and s.enddate; After that, you should probably have a multi-column index on (station_id, timeobs), which lets postgres use just that index for the lookup, rather than using an index and then a filter. (Looking at your next query you might already have that index). > > If I change the query to (and thus negates the full outer join): This is the same query, I think you messed up your copy and paste. > > select temp_max_60min,temp_dry_at_2m > from station s natural join > temp_dry_at_2m a full outer join temp_max_60min b using (station_id, timeobs) > where s.wmo_id=6065 > and _a.timeobs='2004-1-1 0:0:0' and b._timeobs='2004-1-1 0:0:0' > and '2004-1-1 0:0:0' between s.startdate and s.enddate; > > > I get wrong results (In the case where one of the records is missing in > one of the tables), BUT GOOD performance, and this query plan: > > Nested Loop (cost=0.00..17.83 rows=1 width=12) (actual time=79.221..79.236 rows=1 loops=1) > -> Nested Loop (cost=0.00..11.82 rows=1 width=24) (actual time=65.517..65.526 rows=1 loops=1) > -> Index Scan using wmo_idx on station (cost=0.00..5.83 rows=1 width=4) (actual time=0.022..0.026 rows=1 loops=1) > Index Cond: ((wmo_id = 6065) AND ('2004-01-01 00:00:00'::timestamp without time zone >= startdate) AND ('2004-01-0100:00:00'::timestamp without time zone <= enddate)) > -> Index Scan using temp_max_60min_idx on temp_max_60min b (cost=0.00..5.97 rows=1 width=20) (actual time=65.483..65.486rows=1 loops=1) > Index Cond: (("outer".station_id = b.station_id) AND (b.timeobs = '2004-01-01 00:00:00'::timestamp withouttime zone)) > -> Index Scan using temp_dry_at_2m_idx on temp_dry_at_2m a (cost=0.00..6.00 rows=1 width=16) (actual time=13.694..13.698rows=1 loops=1) > Index Cond: (("outer".station_id = a.station_id) AND (a.timeobs = '2004-01-01 00:00:00'::timestamp without timezone)) > Total runtime: 79.340 ms > (9 rows) > > > If further info like EXPLAIN VERBOSE is useful please say so and I will > provide it. > > Thanks in advance! > Kim Bisgaard. I still feel like you will have a problem with an outer join in this circumstance, because it will have to scan all of both tables. I think what you are wanting is "give me everything where station_id = X, and there is a row in either a or b". I think my LEFT JOIN example does that, but I also think there would be a subselect form which would work, which might do better. Something like: SELECT temp_max_60min,temp_dry_at_2m FROM (SELECT station_id, timeobs FROM station s WHERE s.wmo_id=6065 AND timeobs = '2004-1-1 0:0:0' AND '2004-1-1 0:0:0' BETWEEN s.startdate AND s.enddate ) AS s JOIN (SELECT temp_max_60min, temp_dry_at_2m FROM temp_dry_at_2m a FULL OUTER JOIN temp_max_60min b USING (station_id, timeobs) WHERE station_id = s.station_id AND timeobs = '2004-1-1 0:0:0' ) ; If I did this correctly, you should have a very restrictive scan done on station, which only returns a few rows based on timeobs & station_id. But it might be better to turn that final FULL OUTER JOIN into 2 LEFT JOINs like I did the first time: SELECT temp_max_60min,temp_dry_at_2m FROM (SELECT station_id, timeobs FROM station s WHERE s.wmo_id=6065 AND timeobs = '2004-1-1 0:0:0' AND '2004-1-1 0:0:0' BETWEEN s.startdate AND s.enddate ) AS s LEFT JOIN temp_dry_at_2m a USING (station_id, timeobs) LEFT JOIN temp_max_60min b USING (station_id, timeobs) ; I would hope postgres could do this from just my earlier plan. And I hope I understand what you want, such that 2 LEFT JOINS work better than your FULL OUTER JOIN. If you only want rows where one of both temp_dry or temp_max exist, you probably could just add the line: WHERE (temp_max_60_min IS NOT NULL OR temp_dry_at_2m IS NOT NULL) John =:->
Attachment
Kim Bisgaard <kib+pg@dmi.dk> writes: > I have two BIG tables (virtually identical) with 3 NOT NULL columns > Station_id, TimeObs, Temp_XXXX, with indexes on (Station_id, TimeObs) > and valid ANALYSE (set statistics=100). I want to join the two tables > with a FULL OUTER JOIN. I'm confused. If the columns are NOT NULL, why isn't this a valid transformation of your original query? > select temp_max_60min,temp_dry_at_2m > from station s natural join > temp_dry_at_2m a full outer join temp_max_60min b using (station_id, timeobs) > where s.wmo_id=6065 > and _a.timeobs='2004-1-1 0:0:0' and b._timeobs='2004-1-1 0:0:0' > and '2004-1-1 0:0:0' between s.startdate and s.enddate; Seems like it's not eliminating any rows that would otherwise succeed. regards, tom lane
Hi, Look for my comments further down... John A Meinel wrote: > Kim Bisgaard wrote: > >> Hi, >> >> I'm having problems with the query optimizer and FULL OUTER JOIN on >> PostgreSQL 7.4. I cannot get it to use my indexes with full outer joins. >> I might be naive, but I think that it should be possible? >> >> I have two BIG tables (virtually identical) with 3 NOT NULL columns >> Station_id, TimeObs, Temp_XXXX, with indexes on (Station_id, TimeObs) >> and valid ANALYSE (set statistics=100). I want to join the two tables >> with a FULL OUTER JOIN. >> >> When I specify the query as: >> >> select temp_max_60min,temp_dry_at_2m >> from station s natural join >> temp_dry_at_2m a full outer join temp_max_60min b using (station_id, >> timeobs) >> where s.wmo_id=6065 >> and timeobs='2004-1-1 0:0:0' >> and '2004-1-1 0:0:0' between s.startdate and s.enddate; >> >> I get the correct results, BUT LOUSY performance, and the following >> explain: >> >> Nested Loop Left Join (cost=5.84..163484.08 rows=1349 width=12) >> (actual time=66146.815..119005.381 rows=1 loops=1) >> Filter: (COALESCE("outer".timeobs, "inner".timeobs) = '2004-01-01 >> 00:00:00'::timestamp without time zone) >> -> Hash Join (cost=5.84..155420.24 rows=1349 width=16) (actual >> time=8644.449..110836.038 rows=109826 loops=1) > > > Well, the estimate here is quite a bit off. It thinks you will be > getting 1349 (which is probably why it picked a nested loop plan), but > then it is getting 109826 rows. > I'm guessing it is misunderstanding the selectivity of the timeobs > column. I think you are right.. > >> Hash Cond: ("outer".station_id = "inner".station_id) >> -> Seq Scan on temp_dry_at_2m a (cost=0.00..120615.94 >> rows=6956994 width=16) (actual time=0.024..104548.515 rows=6956994 >> loops=1) >> -> Hash (cost=5.84..5.84 rows=1 width=4) (actual >> time=0.114..0.114 rows=0 loops=1) >> -> Index Scan using wmo_idx on station >> (cost=0.00..5.84 rows=1 width=4) (actual time=0.105..0.108 rows=1 >> loops=1) >> Index Cond: ((wmo_id = 6065) AND ('2004-01-01 >> 00:00:00'::timestamp without time zone >= startdate) AND ('2004-01-01 >> 00:00:00'::timestamp without time zone <= enddate)) >> -> Index Scan using temp_max_60min_idx on temp_max_60min b >> (cost=0.00..5.96 rows=1 width=20) (actual time=0.071..0.071 rows=0 >> loops=109826) >> Index Cond: (("outer".station_id = b.station_id) AND >> ("outer".timeobs = b.timeobs)) >> Total runtime: 119005.499 ms >> (11 rows) > > > I think the bigger problem is that a full outer join says grab all rows, > even if they are null. > > What about this query: > SELECT temp_max_60min,temp_dry_at_2m > FROM (station s LEFT JOIN temp_dry_at_2m a USING (station_id, timeobs) > LEFT JOIN temp_max_60min b USING (station_id, timeobs) > where s.wmo_id=6065 > and timeobs='2004-1-1 0:0:0' > and '2004-1-1 0:0:0' between s.startdate and s.enddate; This works very well, and gives the correct result - thanks!! > > After that, you should probably have a multi-column index on > (station_id, timeobs), which lets postgres use just that index for the > lookup, rather than using an index and then a filter. (Looking at your > next query you might already have that index). Yes I have. > >> >> If I change the query to (and thus negates the full outer join): > > > This is the same query, I think you messed up your copy and paste. Nope. Changed "and timeobs='2004-1-1 0:0:0' " to "and a.timeobs='2004-1-1 0:0:0' and b.timeobs='2004-1-1 0:0:0' " > >> >> select temp_max_60min,temp_dry_at_2m >> from station s natural join >> temp_dry_at_2m a full outer join temp_max_60min b using (station_id, >> timeobs) >> where s.wmo_id=6065 >> and _a.timeobs='2004-1-1 0:0:0' and b._timeobs='2004-1-1 0:0:0' >> and '2004-1-1 0:0:0' between s.startdate and s.enddate; >> >> >> I get wrong results (In the case where one of the records is missing in >> one of the tables), BUT GOOD performance, and this query plan: >> >> Nested Loop (cost=0.00..17.83 rows=1 width=12) (actual >> time=79.221..79.236 rows=1 loops=1) >> -> Nested Loop (cost=0.00..11.82 rows=1 width=24) (actual >> time=65.517..65.526 rows=1 loops=1) >> -> Index Scan using wmo_idx on station (cost=0.00..5.83 >> rows=1 width=4) (actual time=0.022..0.026 rows=1 loops=1) >> Index Cond: ((wmo_id = 6065) AND ('2004-01-01 >> 00:00:00'::timestamp without time zone >= startdate) AND ('2004-01-01 >> 00:00:00'::timestamp without time zone <= enddate)) >> -> Index Scan using temp_max_60min_idx on temp_max_60min b >> (cost=0.00..5.97 rows=1 width=20) (actual time=65.483..65.486 rows=1 >> loops=1) >> Index Cond: (("outer".station_id = b.station_id) AND >> (b.timeobs = '2004-01-01 00:00:00'::timestamp without time zone)) >> -> Index Scan using temp_dry_at_2m_idx on temp_dry_at_2m a >> (cost=0.00..6.00 rows=1 width=16) (actual time=13.694..13.698 rows=1 >> loops=1) >> Index Cond: (("outer".station_id = a.station_id) AND >> (a.timeobs = '2004-01-01 00:00:00'::timestamp without time zone)) >> Total runtime: 79.340 ms >> (9 rows) >> >> >> If further info like EXPLAIN VERBOSE is useful please say so and I will >> provide it. >> >> Thanks in advance! >> Kim Bisgaard. > > > I still feel like you will have a problem with an outer join in this > circumstance, because it will have to scan all of both tables. Maybe I misunderstand outer joins but since there are no rows with NULLs, I think it is a matter of finding the rows that are there or makeing one up one if there are no rows? > > I think what you are wanting is "give me everything where station_id = > X, and there is a row in either a or b". > I think my LEFT JOIN example does that, but I also think there would be > a subselect form which would work, which might do better. Something like: > > SELECT temp_max_60min,temp_dry_at_2m > FROM (SELECT station_id, timeobs FROM station s > WHERE s.wmo_id=6065 > AND timeobs = '2004-1-1 0:0:0' > AND '2004-1-1 0:0:0' BETWEEN s.startdate AND s.enddate > ) AS s > JOIN (SELECT temp_max_60min, temp_dry_at_2m > FROM temp_dry_at_2m a > FULL OUTER JOIN temp_max_60min b > USING (station_id, timeobs) > WHERE station_id = s.station_id > AND timeobs = '2004-1-1 0:0:0' > ) > ; > > If I did this correctly, you should have a very restrictive scan done on > station, which only returns a few rows based on timeobs & station_id. > But it might be better to turn that final FULL OUTER JOIN into 2 LEFT > JOINs like I did the first time: > > SELECT temp_max_60min,temp_dry_at_2m > FROM (SELECT station_id, timeobs FROM station s > WHERE s.wmo_id=6065 > AND timeobs = '2004-1-1 0:0:0' > AND '2004-1-1 0:0:0' BETWEEN s.startdate AND s.enddate > ) AS s > LEFT JOIN temp_dry_at_2m a USING (station_id, timeobs) > LEFT JOIN temp_max_60min b USING (station_id, timeobs) > ; > > I would hope postgres could do this from just my earlier plan. And I > hope I understand what you want, such that 2 LEFT JOINS work better than > your FULL OUTER JOIN. If you only want rows where one of both temp_dry > or temp_max exist, you probably could just add the line: > > WHERE (temp_max_60_min IS NOT NULL OR temp_dry_at_2m IS NOT NULL) Thanks John. You have opened my eyes for a new way to formulate my queries! > > > John > =:-> -- Kim Bisgaard Computer Department Phone: +45 3915 7562 (direct) Danish Meteorological Institute Fax: +45 3915 7460 (division)
Sorry for not listing the exact layout of temp_XXXX:
Regards,
Kim Bisgaard.
Tom Lane wrote:
obsdb=> \d temp_dry_at_2m Table "public.temp_dry_at_2m" Column | Type | Modifiers ----------------+-----------------------------+----------- obshist_id | integer | not null station_id | integer | not null timeobs | timestamp without time zone | not null temp_dry_at_2m | real | not null Indexes: "temp_dry_at_2m_pkey" primary key, btree (obshist_id) "temp_dry_at_2m_idx" btree (station_id, timeobs)The difference between the two queries is if a (station_id,timeobs) row is missing in one table, then the first returns one record(null,9.3) while the second return no records.
Regards,
Kim Bisgaard.
Tom Lane wrote:
Kim Bisgaard <kib+pg@dmi.dk> writes:I have two BIG tables (virtually identical) with 3 NOT NULL columns Station_id, TimeObs, Temp_XXXX, with indexes on (Station_id, TimeObs) and valid ANALYSE (set statistics=100). I want to join the two tables with a FULL OUTER JOIN.I'm confused. If the columns are NOT NULL, why isn't this a valid transformation of your original query?select temp_max_60min,temp_dry_at_2m from station s natural join temp_dry_at_2m a full outer join temp_max_60min b using (station_id, timeobs) where s.wmo_id=6065 and _a.timeobs='2004-1-1 0:0:0' and b._timeobs='2004-1-1 0:0:0' and '2004-1-1 0:0:0' between s.startdate and s.enddate;Seems like it's not eliminating any rows that would otherwise succeed. regards, tom lane ---------------------------(end of broadcast)--------------------------- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to majordomo@postgresql.org so that your message can get through to the mailing list cleanly
-- Kim Bisgaard Computer Department Phone: +45 3915 7562 (direct) Danish Meteorological Institute Fax: +45 3915 7460 (division)