Re: full outer performance problem - Mailing list pgsql-performance
From | Bruno Wolff III |
---|---|
Subject | Re: full outer performance problem |
Date | |
Msg-id | 20050608121755.GB11464@wolff.to Whole thread Raw |
In response to | full outer performance problem (Kim Bisgaard <kib+pg@dmi.dk>) |
Responses |
Re: full outer performance problem
|
List | pgsql-performance |
On Wed, Jun 08, 2005 at 11:37:40 +0200, Kim Bisgaard <kib+pg@dmi.dk> wrote: > Hi, > > I'm having problems with the query optimizer and FULL OUTER JOIN on > PostgreSQL 7.4. I cannot get it to use my indexes with full outer joins. > I might be naive, but I think that it should be possible? > > I have two BIG tables (virtually identical) with 3 NOT NULL columns > Station_id, TimeObs, Temp_XXXX, with unique indexes on (Station_id, > TimeObs) and valid ANALYSE (set statistics=100). I want to join the two > tables with a FULL OUTER JOIN. > > When I specify the query as: > > SELECT station_id, timeobs,temp_grass, temp_dry_at_2m > FROM temp_dry_at_2m a > FULL OUTER JOIN temp_grass b > USING (station_id, timeobs) > WHERE station_id = 52981 > AND timeobs = '2004-1-1 0:0:0' > > I get the correct results > > station_id | timeobs | temp_grass | temp_dry_at_2m > ------------+---------------------+------------+---------------- > 52944 | 2004-01-01 00:10:00 | | -1.1 > (1 row) > > BUT LOUSY performance, and the following EXPLAIN: > > QUERY PLAN > ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ > Merge Full Join (cost=1542369.83..1618958.58 rows=6956994 width=32) > (actual time=187176.408..201436.264 rows=1 loops=1) > Merge Cond: (("outer".station_id = "inner".station_id) AND > ("outer".timeobs = "inner".timeobs)) > Filter: ((COALESCE("outer".station_id, "inner".station_id) = 52981) AND > (COALESCE("outer".timeobs, "inner".timeobs) = '2004-01-01 > 00:00:00'::timestamp without time zone)) > -> Sort (cost=1207913.44..1225305.93 rows=6956994 width=16) (actual > time=145748.253..153851.607 rows=6956994 loops=1) > Sort Key: a.station_id, a.timeobs > -> Seq Scan on temp_dry_at_2m a (cost=0.00..117549.94 > rows=6956994 width=16) (actual time=0.049..54226.770 rows=6956994 > loops=1) > -> Sort (cost=334456.38..340472.11 rows=2406292 width=16) (actual > time=31668.876..34491.123 rows=2406292 loops=1) > Sort Key: b.station_id, b.timeobs > -> Seq Scan on temp_grass b (cost=0.00..40658.92 rows=2406292 > width=16) (actual time=0.052..5484.489 rows=2406292 loops=1) > Total runtime: 201795.989 ms > (10 rows) Someone else will need to comment on why Postgres can't use a more efficient plan. What I think will work for you is to restrict the station_id and timeobs on each side and then do a full join. You can try something like the sample query below (which hasn't been tested): SELECT station_id, timeobs, temp_grass, temp_dry_at_2m FROM (SELECT station_id, timeobs, temp_dry_at_2m FROM temp_dry_at_2m WHERE station_id = 52981 AND timeobs = '2004-1-1 0:0:0') a FULL OUTER JOIN (SELECT station_id, timeobs, temp_grass FROM temp_grass WHERE station_id = 52981 AND timeobs = '2004-1-1 0:0:0') b USING (station_id, timeobs)
pgsql-performance by date: