Thread: full outer performance problem
Hi, I'm having problems with the query optimizer and FULL OUTER JOIN on PostgreSQL 7.4. I cannot get it to use my indexes with full outer joins. I might be naive, but I think that it should be possible? I have two BIG tables (virtually identical) with 3 NOT NULL columns Station_id, TimeObs, Temp_XXXX, with unique indexes on (Station_id, TimeObs) and valid ANALYSE (set statistics=100). I want to join the two tables with a FULL OUTER JOIN. When I specify the query as: SELECT station_id, timeobs,temp_grass, temp_dry_at_2m FROM temp_dry_at_2m a FULL OUTER JOIN temp_grass b USING (station_id, timeobs) WHERE station_id = 52981 AND timeobs = '2004-1-1 0:0:0' I get the correct results station_id | timeobs | temp_grass | temp_dry_at_2m ------------+---------------------+------------+---------------- 52944 | 2004-01-01 00:10:00 | | -1.1 (1 row) BUT LOUSY performance, and the following EXPLAIN: QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ Merge Full Join (cost=1542369.83..1618958.58 rows=6956994 width=32) (actual time=187176.408..201436.264 rows=1 loops=1) Merge Cond: (("outer".station_id = "inner".station_id) AND ("outer".timeobs = "inner".timeobs)) Filter: ((COALESCE("outer".station_id, "inner".station_id) = 52981) AND (COALESCE("outer".timeobs, "inner".timeobs) ='2004-01-01 00:00:00'::timestamp without time zone)) -> Sort (cost=1207913.44..1225305.93 rows=6956994 width=16) (actual time=145748.253..153851.607 rows=6956994 loops=1) Sort Key: a.station_id, a.timeobs -> Seq Scan on temp_dry_at_2m a (cost=0.00..117549.94 rows=6956994 width=16) (actual time=0.049..54226.770 rows=6956994loops=1) -> Sort (cost=334456.38..340472.11 rows=2406292 width=16) (actual time=31668.876..34491.123 rows=2406292 loops=1) Sort Key: b.station_id, b.timeobs -> Seq Scan on temp_grass b (cost=0.00..40658.92 rows=2406292 width=16) (actual time=0.052..5484.489 rows=2406292loops=1) Total runtime: 201795.989 ms (10 rows) If I change the query (note the "b."s) explain analyse SELECT b.station_id, b.timeobs,temp_grass, temp_dry_at_2m FROM temp_dry_at_2m a FULL OUTER JOIN temp_grass b USING (station_id, timeobs) WHERE b.station_id = 52981 AND b.timeobs = '2004-1-1 0:0:0' I seem to destroy the FULL OUTER JOIN and get wrong results (nothing) If I had happend to use "a.", and not "b.", I would have gotten correct results (by accident). The "a." variant gives this EXPLAIN: QUERY PLAN -------------------------------------------------------------------------------------------------------------------------------------------- Nested Loop Left Join (cost=0.00..11.97 rows=1 width=20) (actual time=0.060..0.067 rows=1 loops=1) -> Index Scan using temp_dry_at_2m_idx on temp_dry_at_2m a (cost=0.00..5.99 rows=1 width=16) (actual time=0.033..0.036rows=1 loops=1) Index Cond: ((station_id = 52981) AND (timeobs = '2004-01-01 00:00:00'::timestamp without time zone)) -> Index Scan using temp_grass_idx on temp_grass b (cost=0.00..5.96 rows=1 width=16) (actual time=0.018..0.021 rows=1loops=1) Index Cond: (("outer".station_id = b.station_id) AND ("outer".timeobs = b.timeobs)) Total runtime: 0.140 ms (6 rows) Why will PostgreSQL not use the same plan for both these queries - they are virtually identical?? I have tried to formulate the problem with left joins, but this demands from me that I know which table has all the values (and thus has to go first), and in practice no such table excists. TIA, Kim Bisgaard.
On Wed, Jun 08, 2005 at 11:37:40 +0200, Kim Bisgaard <kib+pg@dmi.dk> wrote: > Hi, > > I'm having problems with the query optimizer and FULL OUTER JOIN on > PostgreSQL 7.4. I cannot get it to use my indexes with full outer joins. > I might be naive, but I think that it should be possible? > > I have two BIG tables (virtually identical) with 3 NOT NULL columns > Station_id, TimeObs, Temp_XXXX, with unique indexes on (Station_id, > TimeObs) and valid ANALYSE (set statistics=100). I want to join the two > tables with a FULL OUTER JOIN. > > When I specify the query as: > > SELECT station_id, timeobs,temp_grass, temp_dry_at_2m > FROM temp_dry_at_2m a > FULL OUTER JOIN temp_grass b > USING (station_id, timeobs) > WHERE station_id = 52981 > AND timeobs = '2004-1-1 0:0:0' > > I get the correct results > > station_id | timeobs | temp_grass | temp_dry_at_2m > ------------+---------------------+------------+---------------- > 52944 | 2004-01-01 00:10:00 | | -1.1 > (1 row) > > BUT LOUSY performance, and the following EXPLAIN: > > QUERY PLAN > ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ > Merge Full Join (cost=1542369.83..1618958.58 rows=6956994 width=32) > (actual time=187176.408..201436.264 rows=1 loops=1) > Merge Cond: (("outer".station_id = "inner".station_id) AND > ("outer".timeobs = "inner".timeobs)) > Filter: ((COALESCE("outer".station_id, "inner".station_id) = 52981) AND > (COALESCE("outer".timeobs, "inner".timeobs) = '2004-01-01 > 00:00:00'::timestamp without time zone)) > -> Sort (cost=1207913.44..1225305.93 rows=6956994 width=16) (actual > time=145748.253..153851.607 rows=6956994 loops=1) > Sort Key: a.station_id, a.timeobs > -> Seq Scan on temp_dry_at_2m a (cost=0.00..117549.94 > rows=6956994 width=16) (actual time=0.049..54226.770 rows=6956994 > loops=1) > -> Sort (cost=334456.38..340472.11 rows=2406292 width=16) (actual > time=31668.876..34491.123 rows=2406292 loops=1) > Sort Key: b.station_id, b.timeobs > -> Seq Scan on temp_grass b (cost=0.00..40658.92 rows=2406292 > width=16) (actual time=0.052..5484.489 rows=2406292 loops=1) > Total runtime: 201795.989 ms > (10 rows) Someone else will need to comment on why Postgres can't use a more efficient plan. What I think will work for you is to restrict the station_id and timeobs on each side and then do a full join. You can try something like the sample query below (which hasn't been tested): SELECT station_id, timeobs, temp_grass, temp_dry_at_2m FROM (SELECT station_id, timeobs, temp_dry_at_2m FROM temp_dry_at_2m WHERE station_id = 52981 AND timeobs = '2004-1-1 0:0:0') a FULL OUTER JOIN (SELECT station_id, timeobs, temp_grass FROM temp_grass WHERE station_id = 52981 AND timeobs = '2004-1-1 0:0:0') b USING (station_id, timeobs)
Hi Bruno, Thanks for the moral support! I feel so too - but I am confident it will show up soon. W.r.t. your rewrite of the query, I get this "ERROR: could not devise a query plan for the given query" but no further details - I will try google Regards, Kim. Bruno Wolff III wrote: >On Wed, Jun 08, 2005 at 11:37:40 +0200, > Kim Bisgaard <kib+pg@dmi.dk> wrote: > > >>Hi, >> >>I'm having problems with the query optimizer and FULL OUTER JOIN on >>PostgreSQL 7.4. I cannot get it to use my indexes with full outer joins. >>I might be naive, but I think that it should be possible? >> >>I have two BIG tables (virtually identical) with 3 NOT NULL columns >>Station_id, TimeObs, Temp_XXXX, with unique indexes on (Station_id, >>TimeObs) and valid ANALYSE (set statistics=100). I want to join the two >>tables with a FULL OUTER JOIN. >> >>When I specify the query as: >> >>SELECT station_id, timeobs,temp_grass, temp_dry_at_2m >> FROM temp_dry_at_2m a >> FULL OUTER JOIN temp_grass b >> USING (station_id, timeobs) >> WHERE station_id = 52981 >> AND timeobs = '2004-1-1 0:0:0' >> >>I get the correct results >> >>station_id | timeobs | temp_grass | temp_dry_at_2m >>------------+---------------------+------------+---------------- >> 52944 | 2004-01-01 00:10:00 | | -1.1 >>(1 row) >> >>BUT LOUSY performance, and the following EXPLAIN: >> >> QUERY PLAN >>------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ >>Merge Full Join (cost=1542369.83..1618958.58 rows=6956994 width=32) >>(actual time=187176.408..201436.264 rows=1 loops=1) >> Merge Cond: (("outer".station_id = "inner".station_id) AND >> ("outer".timeobs = "inner".timeobs)) >> Filter: ((COALESCE("outer".station_id, "inner".station_id) = 52981) AND >> (COALESCE("outer".timeobs, "inner".timeobs) = '2004-01-01 >> 00:00:00'::timestamp without time zone)) >> -> Sort (cost=1207913.44..1225305.93 rows=6956994 width=16) (actual >> time=145748.253..153851.607 rows=6956994 loops=1) >> Sort Key: a.station_id, a.timeobs >> -> Seq Scan on temp_dry_at_2m a (cost=0.00..117549.94 >> rows=6956994 width=16) (actual time=0.049..54226.770 rows=6956994 >> loops=1) >> -> Sort (cost=334456.38..340472.11 rows=2406292 width=16) (actual >> time=31668.876..34491.123 rows=2406292 loops=1) >> Sort Key: b.station_id, b.timeobs >> -> Seq Scan on temp_grass b (cost=0.00..40658.92 rows=2406292 >> width=16) (actual time=0.052..5484.489 rows=2406292 loops=1) >>Total runtime: 201795.989 ms >>(10 rows) >> >> > >Someone else will need to comment on why Postgres can't use a more >efficient plan. What I think will work for you is to restrict >the station_id and timeobs on each side and then do a full join. >You can try something like the sample query below (which hasn't been tested): >SELECT station_id, timeobs, temp_grass, temp_dry_at_2m > FROM > (SELECT station_id, timeobs, temp_dry_at_2m > FROM temp_dry_at_2m > WHERE > station_id = 52981 > AND > timeobs = '2004-1-1 0:0:0') a > FULL OUTER JOIN > (SELECT station_id, timeobs, temp_grass > FROM temp_grass > WHERE > station_id = 52981 > AND > timeobs = '2004-1-1 0:0:0') b > USING (station_id, timeobs) > >---------------------------(end of broadcast)--------------------------- >TIP 3: if posting/reading through Usenet, please send an appropriate > subscribe-nomail command to majordomo@postgresql.org so that your > message can get through to the mailing list cleanly > > >
Kim Bisgaard <kib+pg@dmi.dk> writes: > SELECT station_id, timeobs,temp_grass, temp_dry_at_2m > FROM temp_dry_at_2m a > FULL OUTER JOIN temp_grass b > USING (station_id, timeobs) > WHERE station_id = 52981 > AND timeobs = '2004-1-1 0:0:0' > explain analyse SELECT b.station_id, b.timeobs,temp_grass, temp_dry_at_2m > FROM temp_dry_at_2m a > FULL OUTER JOIN temp_grass b > USING (station_id, timeobs) > WHERE b.station_id = 52981 > AND b.timeobs = '2004-1-1 0:0:0' > Why will PostgreSQL not use the same plan for both these queries - they > are virtually identical?? Because they're semantically completely different. The second query is effectively a RIGHT JOIN, because join rows in which b is all-null will be thrown away by the WHERE. The optimizer sees this (note your second plan doesn't use a Full Join step anywhere) and is able to produce a much better plan. Full outer join is difficult to optimize, in part because we have no choice but to use a merge join for it --- the other join types don't support full join. regards, tom lane
Kim Bisgaard <kib+pg@dmi.dk> writes: > W.r.t. your rewrite of the query, I get this "ERROR: could not devise a > query plan for the given query" but no further details - I will try google Which PG version are you using again? That should be fixed in 7.4.3 and later. regards, tom lane
Quoting Tom Lane <tgl@sss.pgh.pa.us>: > Kim Bisgaard <kib+pg@dmi.dk> writes: > > SELECT station_id, timeobs,temp_grass, temp_dry_at_2m > > FROM temp_dry_at_2m a > > FULL OUTER JOIN temp_grass b > > USING (station_id, timeobs) > > WHERE station_id = 52981 > > AND timeobs = '2004-1-1 0:0:0' > > > explain analyse SELECT b.station_id, b.timeobs,temp_grass, temp_dry_at_2m > > FROM temp_dry_at_2m a > > FULL OUTER JOIN temp_grass b > > USING (station_id, timeobs) > > WHERE b.station_id = 52981 > > AND b.timeobs = '2004-1-1 0:0:0' > > > Why will PostgreSQL not use the same plan for both these queries - they > > are virtually identical?? > > Because they're semantically completely different. The second query is > effectively a RIGHT JOIN, because join rows in which b is all-null will > be thrown away by the WHERE. The optimizer sees this (note your second > plan doesn't use a Full Join step anywhere) and is able to produce a > much better plan. Full outer join is difficult to optimize, in part > because we have no choice but to use a merge join for it --- the other > join types don't support full join. > > regards, tom lane > Yes I am aware that they are not "identical", they also give different results, but the data nessesary to compute the results is (0-2 rows, 0-1 row from each table), and thus ideally have the potential to have similar performance - to my head anyway, but I may not have grasped the complete picture yet :-) Regards, Kim.
Quoting Tom Lane <tgl@sss.pgh.pa.us>: > Kim Bisgaard <kib+pg@dmi.dk> writes: > > W.r.t. your rewrite of the query, I get this "ERROR: could not devise a > > query plan for the given query" but no further details - I will try google > > Which PG version are you using again? That should be fixed in 7.4.3 > and later. > > regards, tom lane > Its 7.4.1. I am in the process (may take a while yet) of installing 8.0.3 on the same hardware in order to have a parallel system. Time is a finite meassure :-) I must admit I would rather have the first query perform, that have this workaround function ;-) Regards, Kim.