Thread: full outer performance problem

full outer performance problem

From
Kim Bisgaard
Date:
Hi,

I'm having problems with the query optimizer and FULL OUTER JOIN on PostgreSQL 7.4. I cannot get it to use my indexes with full outer joins. I might be naive, but I think that it should be possible?

I have two BIG tables (virtually identical) with 3 NOT NULL columns Station_id, TimeObs, Temp_XXXX, with indexes on (Station_id, TimeObs) and valid ANALYSE (set statistics=100). I want to join the two tables with a FULL OUTER JOIN.

When I specify the query as:
select temp_max_60min,temp_dry_at_2m
from station s natural join
temp_dry_at_2m a full outer join temp_max_60min b using (station_id, timeobs)
where s.wmo_id=6065 
and timeobs='2004-1-1 0:0:0'
and '2004-1-1 0:0:0' between s.startdate and s.enddate;
I get the correct results, BUT LOUSY performance, and the following explain:
 Nested Loop Left Join  (cost=5.84..163484.08 rows=1349 width=12) (actual time=66146.815..119005.381 rows=1 loops=1)  Filter: (COALESCE("outer".timeobs, "inner".timeobs) = '2004-01-01 00:00:00'::timestamp without time zone)  ->  Hash Join  (cost=5.84..155420.24 rows=1349 width=16) (actual time=8644.449..110836.038 rows=109826 loops=1)        Hash Cond: ("outer".station_id = "inner".station_id)        ->  Seq Scan on temp_dry_at_2m a  (cost=0.00..120615.94 rows=6956994 width=16) (actual time=0.024..104548.515 rows=6956994 loops=1)        ->  Hash  (cost=5.84..5.84 rows=1 width=4) (actual time=0.114..0.114 rows=0 loops=1)              ->  Index Scan using wmo_idx on station  (cost=0.00..5.84 rows=1 width=4) (actual time=0.105..0.108 rows=1 loops=1)                    Index Cond: ((wmo_id = 6065) AND ('2004-01-01 00:00:00'::timestamp without time zone >= startdate) AND ('2004-01-01 00:00:00'::timestamp without time zone <= enddate))  ->  Index Scan using temp_max_60min_idx on temp_max_60min b  (cost=0.00..5.96 rows=1 width=20) (actual time=0.071..0.071 rows=0 loops=109826)        Index Cond: (("outer".station_id = b.station_id) AND ("outer".timeobs = b.timeobs))Total runtime: 119005.499 ms
(11 rows)
If I change the query to (and thus negates the full outer join):
select temp_max_60min,temp_dry_at_2m
from station s natural join
temp_dry_at_2m a full outer join temp_max_60min b using (station_id, timeobs)
where s.wmo_id=6065 
and a.timeobs='2004-1-1 0:0:0' and b.timeobs='2004-1-1 0:0:0' 
and '2004-1-1 0:0:0' between s.startdate and s.enddate;

I get wrong results (In the case where one of the records is missing in one of the tables), BUT GOOD performance, and this query plan:
 Nested Loop  (cost=0.00..17.83 rows=1 width=12) (actual time=79.221..79.236 rows=1 loops=1)  ->  Nested Loop  (cost=0.00..11.82 rows=1 width=24) (actual time=65.517..65.526 rows=1 loops=1)        ->  Index Scan using wmo_idx on station  (cost=0.00..5.83 rows=1 width=4) (actual time=0.022..0.026 rows=1 loops=1)              Index Cond: ((wmo_id = 6065) AND ('2004-01-01 00:00:00'::timestamp without time zone >= startdate) AND ('2004-01-01 00:00:00'::timestamp without time zone <= enddate))        ->  Index Scan using temp_max_60min_idx on temp_max_60min b  (cost=0.00..5.97 rows=1 width=20) (actual time=65.483..65.486 rows=1 loops=1)              Index Cond: (("outer".station_id = b.station_id) AND (b.timeobs = '2004-01-01 00:00:00'::timestamp without time zone))  ->  Index Scan using temp_dry_at_2m_idx on temp_dry_at_2m a  (cost=0.00..6.00 rows=1 width=16) (actual time=13.694..13.698 rows=1 loops=1)        Index Cond: (("outer".station_id = a.station_id) AND (a.timeobs = '2004-01-01 00:00:00'::timestamp without time zone))Total runtime: 79.340 ms
(9 rows)

If further info like EXPLAIN VERBOSE is useful please say so and I will provide it.

Thanks in advance!
Kim Bisgaard.

Re: full outer performance problem

From
John A Meinel
Date:
Kim Bisgaard wrote:
> Hi,
>
> I'm having problems with the query optimizer and FULL OUTER JOIN on
> PostgreSQL 7.4. I cannot get it to use my indexes with full outer joins.
> I might be naive, but I think that it should be possible?
>
> I have two BIG tables (virtually identical) with 3 NOT NULL columns
> Station_id, TimeObs, Temp_XXXX, with indexes on (Station_id, TimeObs)
> and valid ANALYSE (set statistics=100). I want to join the two tables
> with a FULL OUTER JOIN.
>
> When I specify the query as:
>
> select temp_max_60min,temp_dry_at_2m
> from station s natural join
> temp_dry_at_2m a full outer join temp_max_60min b using (station_id, timeobs)
> where s.wmo_id=6065
> and timeobs='2004-1-1 0:0:0'
> and '2004-1-1 0:0:0' between s.startdate and s.enddate;
>
> I get the correct results, BUT LOUSY performance, and the following explain:
>
>  Nested Loop Left Join  (cost=5.84..163484.08 rows=1349 width=12) (actual time=66146.815..119005.381 rows=1 loops=1)
>    Filter: (COALESCE("outer".timeobs, "inner".timeobs) = '2004-01-01 00:00:00'::timestamp without time zone)
>    ->  Hash Join  (cost=5.84..155420.24 rows=1349 width=16) (actual time=8644.449..110836.038 rows=109826 loops=1)

Well, the estimate here is quite a bit off. It thinks you will be
getting 1349 (which is probably why it picked a nested loop plan), but
then it is getting 109826 rows.
I'm guessing it is misunderstanding the selectivity of the timeobs column.

>          Hash Cond: ("outer".station_id = "inner".station_id)
>          ->  Seq Scan on temp_dry_at_2m a  (cost=0.00..120615.94 rows=6956994 width=16) (actual
time=0.024..104548.515rows=6956994 loops=1) 
>          ->  Hash  (cost=5.84..5.84 rows=1 width=4) (actual time=0.114..0.114 rows=0 loops=1)
>                ->  Index Scan using wmo_idx on station  (cost=0.00..5.84 rows=1 width=4) (actual time=0.105..0.108
rows=1loops=1) 
>                      Index Cond: ((wmo_id = 6065) AND ('2004-01-01 00:00:00'::timestamp without time zone >=
startdate)AND ('2004-01-01 00:00:00'::timestamp without time zone <= enddate)) 
>    ->  Index Scan using temp_max_60min_idx on temp_max_60min b  (cost=0.00..5.96 rows=1 width=20) (actual
time=0.071..0.071rows=0 loops=109826) 
>          Index Cond: (("outer".station_id = b.station_id) AND ("outer".timeobs = b.timeobs))
>  Total runtime: 119005.499 ms
> (11 rows)

I think the bigger problem is that a full outer join says grab all rows,
even if they are null.

What about this query:
SELECT temp_max_60min,temp_dry_at_2m
   FROM (station s LEFT JOIN temp_dry_at_2m a USING (station_id, timeobs)
        LEFT JOIN temp_max_60min b USING (station_id, timeobs)
where s.wmo_id=6065
and timeobs='2004-1-1 0:0:0'
and '2004-1-1 0:0:0' between s.startdate and s.enddate;

After that, you should probably have a multi-column index on
(station_id, timeobs), which lets postgres use just that index for the
lookup, rather than using an index and then a filter. (Looking at your
next query you might already have that index).

>
> If I change the query to (and thus negates the full outer join):

This is the same query, I think you messed up your copy and paste.

>
> select temp_max_60min,temp_dry_at_2m
> from station s natural join
> temp_dry_at_2m a full outer join temp_max_60min b using (station_id, timeobs)
> where s.wmo_id=6065
> and _a.timeobs='2004-1-1 0:0:0' and b._timeobs='2004-1-1 0:0:0'
> and '2004-1-1 0:0:0' between s.startdate and s.enddate;
>
>
> I get wrong results (In the case where one of the records is missing in
> one of the tables), BUT GOOD performance, and this query plan:
>
>  Nested Loop  (cost=0.00..17.83 rows=1 width=12) (actual time=79.221..79.236 rows=1 loops=1)
>    ->  Nested Loop  (cost=0.00..11.82 rows=1 width=24) (actual time=65.517..65.526 rows=1 loops=1)
>          ->  Index Scan using wmo_idx on station  (cost=0.00..5.83 rows=1 width=4) (actual time=0.022..0.026 rows=1
loops=1)
>                Index Cond: ((wmo_id = 6065) AND ('2004-01-01 00:00:00'::timestamp without time zone >= startdate) AND
('2004-01-0100:00:00'::timestamp without time zone <= enddate)) 
>          ->  Index Scan using temp_max_60min_idx on temp_max_60min b  (cost=0.00..5.97 rows=1 width=20) (actual
time=65.483..65.486rows=1 loops=1) 
>                Index Cond: (("outer".station_id = b.station_id) AND (b.timeobs = '2004-01-01 00:00:00'::timestamp
withouttime zone)) 
>    ->  Index Scan using temp_dry_at_2m_idx on temp_dry_at_2m a  (cost=0.00..6.00 rows=1 width=16) (actual
time=13.694..13.698rows=1 loops=1) 
>          Index Cond: (("outer".station_id = a.station_id) AND (a.timeobs = '2004-01-01 00:00:00'::timestamp without
timezone)) 
>  Total runtime: 79.340 ms
> (9 rows)
>
>
> If further info like EXPLAIN VERBOSE is useful please say so and I will
> provide it.
>
> Thanks in advance!
> Kim Bisgaard.

I still feel like you will have a problem with an outer join in this
circumstance, because it will have to scan all of both tables.

I think what you are wanting is "give me everything where station_id =
X, and there is a row in either a or b".
I think my LEFT JOIN example does that, but I also think there would be
a subselect form which would work, which might do better. Something like:

SELECT temp_max_60min,temp_dry_at_2m
    FROM (SELECT station_id, timeobs FROM station s
           WHERE s.wmo_id=6065
             AND timeobs = '2004-1-1 0:0:0'
             AND '2004-1-1 0:0:0' BETWEEN s.startdate AND s.enddate
          ) AS s
    JOIN (SELECT temp_max_60min, temp_dry_at_2m
        FROM temp_dry_at_2m a
        FULL OUTER JOIN temp_max_60min b
        USING (station_id, timeobs)
        WHERE station_id = s.station_id
          AND timeobs = '2004-1-1 0:0:0'
         )
;

If I did this correctly, you should have a very restrictive scan done on
station, which only returns a few rows based on timeobs & station_id.
But it might be better to turn that final FULL OUTER JOIN into 2 LEFT
JOINs like I did the first time:

SELECT temp_max_60min,temp_dry_at_2m
    FROM (SELECT station_id, timeobs FROM station s
           WHERE s.wmo_id=6065
             AND timeobs = '2004-1-1 0:0:0'
             AND '2004-1-1 0:0:0' BETWEEN s.startdate AND s.enddate
          ) AS s
    LEFT JOIN temp_dry_at_2m a USING (station_id, timeobs)
    LEFT JOIN temp_max_60min b USING (station_id, timeobs)
;

I would hope postgres could do this from just my earlier plan. And I
hope I understand what you want, such that 2 LEFT JOINS work better than
your FULL OUTER JOIN. If you only want rows where one of both temp_dry
or temp_max exist, you probably could just add the line:

WHERE (temp_max_60_min IS NOT NULL OR temp_dry_at_2m IS NOT NULL)


John
=:->

Attachment

Re: full outer performance problem

From
Tom Lane
Date:
Kim Bisgaard <kib+pg@dmi.dk> writes:
> I have two BIG tables (virtually identical) with 3 NOT NULL columns
> Station_id, TimeObs, Temp_XXXX, with indexes on (Station_id, TimeObs)
> and valid ANALYSE (set statistics=100). I want to join the two tables
> with a FULL OUTER JOIN.

I'm confused.  If the columns are NOT NULL, why isn't this a valid
transformation of your original query?

> select temp_max_60min,temp_dry_at_2m
> from station s natural join
> temp_dry_at_2m a full outer join temp_max_60min b using (station_id, timeobs)
> where s.wmo_id=6065
> and _a.timeobs='2004-1-1 0:0:0' and b._timeobs='2004-1-1 0:0:0'
> and '2004-1-1 0:0:0' between s.startdate and s.enddate;

Seems like it's not eliminating any rows that would otherwise succeed.

            regards, tom lane

Re: full outer performance problem

From
Kim Bisgaard
Date:
Hi,

Look for my comments further down...

John A Meinel wrote:

> Kim Bisgaard wrote:
>
>> Hi,
>>
>> I'm having problems with the query optimizer and FULL OUTER JOIN on
>> PostgreSQL 7.4. I cannot get it to use my indexes with full outer joins.
>> I might be naive, but I think that it should be possible?
>>
>> I have two BIG tables (virtually identical) with 3 NOT NULL columns
>> Station_id, TimeObs, Temp_XXXX, with indexes on (Station_id, TimeObs)
>> and valid ANALYSE (set statistics=100). I want to join the two tables
>> with a FULL OUTER JOIN.
>>
>> When I specify the query as:
>>
>> select temp_max_60min,temp_dry_at_2m
>> from station s natural join
>> temp_dry_at_2m a full outer join temp_max_60min b using (station_id,
>> timeobs)
>> where s.wmo_id=6065
>> and timeobs='2004-1-1 0:0:0'
>> and '2004-1-1 0:0:0' between s.startdate and s.enddate;
>>
>> I get the correct results, BUT LOUSY performance, and the following
>> explain:
>>
>>  Nested Loop Left Join  (cost=5.84..163484.08 rows=1349 width=12)
>> (actual time=66146.815..119005.381 rows=1 loops=1)
>>    Filter: (COALESCE("outer".timeobs, "inner".timeobs) = '2004-01-01
>> 00:00:00'::timestamp without time zone)
>>    ->  Hash Join  (cost=5.84..155420.24 rows=1349 width=16) (actual
>> time=8644.449..110836.038 rows=109826 loops=1)
>
>
> Well, the estimate here is quite a bit off. It thinks you will be
> getting 1349 (which is probably why it picked a nested loop plan), but
> then it is getting 109826 rows.
> I'm guessing it is misunderstanding the selectivity of the timeobs
> column.

I think you are right..

>
>>          Hash Cond: ("outer".station_id = "inner".station_id)
>>          ->  Seq Scan on temp_dry_at_2m a  (cost=0.00..120615.94
>> rows=6956994 width=16) (actual time=0.024..104548.515 rows=6956994
>> loops=1)
>>          ->  Hash  (cost=5.84..5.84 rows=1 width=4) (actual
>> time=0.114..0.114 rows=0 loops=1)
>>                ->  Index Scan using wmo_idx on station
>> (cost=0.00..5.84 rows=1 width=4) (actual time=0.105..0.108 rows=1
>> loops=1)
>>                      Index Cond: ((wmo_id = 6065) AND ('2004-01-01
>> 00:00:00'::timestamp without time zone >= startdate) AND ('2004-01-01
>> 00:00:00'::timestamp without time zone <= enddate))
>>    ->  Index Scan using temp_max_60min_idx on temp_max_60min b
>> (cost=0.00..5.96 rows=1 width=20) (actual time=0.071..0.071 rows=0
>> loops=109826)
>>          Index Cond: (("outer".station_id = b.station_id) AND
>> ("outer".timeobs = b.timeobs))
>>  Total runtime: 119005.499 ms
>> (11 rows)
>
>
> I think the bigger problem is that a full outer join says grab all rows,
> even if they are null.
>
> What about this query:
> SELECT temp_max_60min,temp_dry_at_2m
>   FROM (station s LEFT JOIN temp_dry_at_2m a USING (station_id, timeobs)
>        LEFT JOIN temp_max_60min b USING (station_id, timeobs)
> where s.wmo_id=6065
> and timeobs='2004-1-1 0:0:0'
> and '2004-1-1 0:0:0' between s.startdate and s.enddate;

This works very well, and gives the correct result - thanks!!

>
> After that, you should probably have a multi-column index on
> (station_id, timeobs), which lets postgres use just that index for the
> lookup, rather than using an index and then a filter. (Looking at your
> next query you might already have that index).

Yes I have.

>
>>
>> If I change the query to (and thus negates the full outer join):
>
>
> This is the same query, I think you messed up your copy and paste.

Nope. Changed "and timeobs='2004-1-1 0:0:0' " to "and
a.timeobs='2004-1-1 0:0:0' and b.timeobs='2004-1-1 0:0:0' "

>
>>
>> select temp_max_60min,temp_dry_at_2m
>> from station s natural join
>> temp_dry_at_2m a full outer join temp_max_60min b using (station_id,
>> timeobs)
>> where s.wmo_id=6065
>> and _a.timeobs='2004-1-1 0:0:0' and b._timeobs='2004-1-1 0:0:0'
>> and '2004-1-1 0:0:0' between s.startdate and s.enddate;
>>
>>
>> I get wrong results (In the case where one of the records is missing in
>> one of the tables), BUT GOOD performance, and this query plan:
>>
>>  Nested Loop  (cost=0.00..17.83 rows=1 width=12) (actual
>> time=79.221..79.236 rows=1 loops=1)
>>    ->  Nested Loop  (cost=0.00..11.82 rows=1 width=24) (actual
>> time=65.517..65.526 rows=1 loops=1)
>>          ->  Index Scan using wmo_idx on station  (cost=0.00..5.83
>> rows=1 width=4) (actual time=0.022..0.026 rows=1 loops=1)
>>                Index Cond: ((wmo_id = 6065) AND ('2004-01-01
>> 00:00:00'::timestamp without time zone >= startdate) AND ('2004-01-01
>> 00:00:00'::timestamp without time zone <= enddate))
>>          ->  Index Scan using temp_max_60min_idx on temp_max_60min b
>> (cost=0.00..5.97 rows=1 width=20) (actual time=65.483..65.486 rows=1
>> loops=1)
>>                Index Cond: (("outer".station_id = b.station_id) AND
>> (b.timeobs = '2004-01-01 00:00:00'::timestamp without time zone))
>>    ->  Index Scan using temp_dry_at_2m_idx on temp_dry_at_2m a
>> (cost=0.00..6.00 rows=1 width=16) (actual time=13.694..13.698 rows=1
>> loops=1)
>>          Index Cond: (("outer".station_id = a.station_id) AND
>> (a.timeobs = '2004-01-01 00:00:00'::timestamp without time zone))
>>  Total runtime: 79.340 ms
>> (9 rows)
>>
>>
>> If further info like EXPLAIN VERBOSE is useful please say so and I will
>> provide it.
>>
>> Thanks in advance!
>> Kim Bisgaard.
>
>
> I still feel like you will have a problem with an outer join in this
> circumstance, because it will have to scan all of both tables.

Maybe I misunderstand outer joins but since there are no rows with
NULLs, I think it is a matter of finding the rows that are there or
makeing one up one if there are no rows?

>
> I think what you are wanting is "give me everything where station_id =
> X, and there is a row in either a or b".
> I think my LEFT JOIN example does that, but I also think there would be
> a subselect form which would work, which might do better. Something like:
>
> SELECT temp_max_60min,temp_dry_at_2m
>     FROM (SELECT station_id, timeobs FROM station s
>            WHERE s.wmo_id=6065
>              AND timeobs = '2004-1-1 0:0:0'
>              AND '2004-1-1 0:0:0' BETWEEN s.startdate AND s.enddate
>           ) AS s
>     JOIN (SELECT temp_max_60min, temp_dry_at_2m
>         FROM temp_dry_at_2m a
>         FULL OUTER JOIN temp_max_60min b
>         USING (station_id, timeobs)
>         WHERE station_id = s.station_id
>           AND timeobs = '2004-1-1 0:0:0'
>          )
> ;
>
> If I did this correctly, you should have a very restrictive scan done on
> station, which only returns a few rows based on timeobs & station_id.
> But it might be better to turn that final FULL OUTER JOIN into 2 LEFT
> JOINs like I did the first time:
>
> SELECT temp_max_60min,temp_dry_at_2m
>     FROM (SELECT station_id, timeobs FROM station s
>            WHERE s.wmo_id=6065
>              AND timeobs = '2004-1-1 0:0:0'
>              AND '2004-1-1 0:0:0' BETWEEN s.startdate AND s.enddate
>           ) AS s
>     LEFT JOIN temp_dry_at_2m a USING (station_id, timeobs)
>     LEFT JOIN temp_max_60min b USING (station_id, timeobs)
> ;
>
> I would hope postgres could do this from just my earlier plan. And I
> hope I understand what you want, such that 2 LEFT JOINS work better than
> your FULL OUTER JOIN. If you only want rows where one of both temp_dry
> or temp_max exist, you probably could just add the line:
>
> WHERE (temp_max_60_min IS NOT NULL OR temp_dry_at_2m IS NOT NULL)


Thanks John. You have opened my eyes for a new way to formulate my queries!

>
>
> John
> =:->


--
Kim Bisgaard

Computer Department                  Phone: +45 3915 7562 (direct)
Danish Meteorological Institute      Fax: +45 3915 7460 (division)


Re: full outer performance problem

From
Kim Bisgaard
Date:
Sorry for not listing the exact layout of temp_XXXX:
obsdb=> \d temp_dry_at_2m
              Table "public.temp_dry_at_2m"
     Column     |            Type             | Modifiers
----------------+-----------------------------+-----------
 obshist_id     | integer                     | not null
 station_id     | integer                     | not null
 timeobs        | timestamp without time zone | not null
 temp_dry_at_2m | real                        | not null
Indexes:
    "temp_dry_at_2m_pkey" primary key, btree (obshist_id)
    "temp_dry_at_2m_idx" btree (station_id, timeobs)
The difference between the two queries is if a (station_id,timeobs) row is missing in one table, then the first returns one record(null,9.3) while the second return no records.

Regards,
Kim Bisgaard.

Tom Lane wrote:
Kim Bisgaard <kib+pg@dmi.dk> writes: 
I have two BIG tables (virtually identical) with 3 NOT NULL columns 
Station_id, TimeObs, Temp_XXXX, with indexes on (Station_id, TimeObs) 
and valid ANALYSE (set statistics=100). I want to join the two tables 
with a FULL OUTER JOIN.   
I'm confused.  If the columns are NOT NULL, why isn't this a valid
transformation of your original query?
 
select temp_max_60min,temp_dry_at_2m
from station s natural join
temp_dry_at_2m a full outer join temp_max_60min b using (station_id, timeobs)
where s.wmo_id=6065 
and _a.timeobs='2004-1-1 0:0:0' and b._timeobs='2004-1-1 0:0:0' 
and '2004-1-1 0:0:0' between s.startdate and s.enddate;   
Seems like it's not eliminating any rows that would otherwise succeed.
		regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 3: if posting/reading through Usenet, please send an appropriate     subscribe-nomail command to majordomo@postgresql.org so that your     message can get through to the mailing list cleanly
 

-- 
Kim Bisgaard

Computer Department                  Phone: +45 3915 7562 (direct)
Danish Meteorological Institute      Fax: +45 3915 7460 (division)