Thread: Left Outer Join much faster than non-outer Join?

Left Outer Join much faster than non-outer Join?

From
rm_pg@cheapcomplexdevices.com
Date:
Can anyone please help me make my JOIN find the right index to use?

It seems strange to me that in the two queries listed below, the
LEFT OUTER JOIN can find the most efficient index to use, while
the unadorned JOIN can not.   The result is that my query is
orders of magnitude slower than it seems it should be.



The table "tlid_smaller" (\d and explain analyze shown below) is a
large table contining integer IDs just like the fact table of any
traditional star-schema warehouse.

The tables *_lookup are simply tables that map strings to IDs, with
unique IDs associating strings to the IDs.

The table "tlid_smaller" has an index on (streetname_id, city_id) that
is extremely efficient at finding the desired row.  When I use a "LEFT
OUTER JOIN", the optimizer happily sees that it can use this index.
This is shown in the first explain analyze below.  However when I
simply do a "JOIN" the optimizer does not use this index and rather
does a hash join comparing thousands of rows.

Note that the cost estimate using the good index is much better
(16.94 vs 29209.16 thousands of times better).  Any ideas why
the non-outer join didn't use it?






fli=# explain analyze
 select *
     from streetname_lookup as sl
     join city_lookup as cl on (true)
     left outer join tlid_smaller as ts on (sl.geo_streetname_id = ts.geo_streetname_id and
cl.geo_city_id=ts.geo_city_id)
     where  str_name='alamo' and  city='san antonio' and state='TX'
;
fli-# fli-# fli-# fli-# fli-# fli-#                                                                           QUERY
PLAN                                                                     \ 


---------------------------------------------------------------------------------------------------------------------------------------------------------------
 Nested Loop Left Join  (cost=0.00..16.94 rows=1 width=74) (actual time=0.115..0.539 rows=78 loops=1)
   ->  Nested Loop  (cost=0.00..9.03 rows=1 width=42) (actual time=0.077..0.084 rows=1 loops=1)
         ->  Index Scan using streetname_lookup__str_name on streetname_lookup sl  (cost=0.00..3.01 rows=1 width=19)
(actualtime=0.042..0.044 rows=1 loops=1) 
               Index Cond: (str_name = 'alamo'::text)
         ->  Index Scan using city_lookup__name on city_lookup cl  (cost=0.00..6.01 rows=1 width=23) (actual
time=0.026..0.028rows=1 loops=1) 
               Index Cond: ((city = 'san antonio'::text) AND (state = 'TX'::text))
   ->  Index Scan using tlid_smaller__street_city on tlid_smaller ts  (cost=0.00..7.86 rows=3 width=32) (actual
time=0.029..0.176rows=78 loops=1) 
         Index Cond: (("outer".geo_streetname_id = ts.geo_streetname_id) AND ("outer".geo_city_id = ts.geo_city_id))
 Total runtime: 0.788 ms
(9 rows)


fli=#
fli=# explain analyze
 select *
     from streetname_lookup as sl
     join city_lookup as cl on (true)
     join tlid_smaller as ts on (sl.geo_streetname_id = ts.geo_streetname_id and cl.geo_city_id=ts.geo_city_id)
     where  str_name='alamo' and  city='san antonio' and state='TX'
;
fli-# fli-# fli-# fli-# fli-# fli-#                                                                              QUERY
PLAN                                                                  \ 


---------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Hash Join  (cost=6.01..29209.16 rows=1 width=74) (actual time=9.421..28.154 rows=78 loops=1)
   Hash Cond: ("outer".geo_city_id = "inner".geo_city_id)
   ->  Nested Loop  (cost=0.00..29202.88 rows=52 width=51) (actual time=0.064..23.296 rows=4151 loops=1)
         ->  Index Scan using streetname_lookup__str_name on streetname_lookup sl  (cost=0.00..3.01 rows=1 width=19)
(actualtime=0.025..0.032 rows=1 loops=1) 
               Index Cond: (str_name = 'alamo'::text)
         ->  Index Scan using tlid_smaller__street_zipint on tlid_smaller ts  (cost=0.00..28994.70 rows=16413 width=32)
(actualtime=0.028..8.153 rows=4151 loops=1) 
               Index Cond: ("outer".geo_streetname_id = ts.geo_streetname_id)
   ->  Hash  (cost=6.01..6.01 rows=1 width=23) (actual time=0.073..0.073 rows=0 loops=1)
         ->  Index Scan using city_lookup__name on city_lookup cl  (cost=0.00..6.01 rows=1 width=23) (actual
time=0.065..0.067rows=1 loops=1) 
               Index Cond: ((city = 'san antonio'::text) AND (state = 'TX'::text))
 Total runtime: 28.367 ms
(11 rows)

fli=#

fli=#



fli=# \d tlid_smaller
        Table "geo.tlid_smaller"
      Column       |  Type   | Modifiers
-------------------+---------+-----------
 tlid              | integer |
 geo_streetname_id | integer |
 geo_streettype_id | integer |
 geo_city_id       | integer |
 zipint            | integer |
 tigerfile         | integer |
 low               | integer |
 high              | integer |
Indexes:
    "tlid_smaller__city" btree (geo_city_id)
    "tlid_smaller__street_city" btree (geo_streetname_id, geo_city_id)
    "tlid_smaller__street_zipint" btree (geo_streetname_id, zipint)
    "tlid_smaller__tlid" btree (tlid)


Re: Left Outer Join much faster than non-outer Join?

From
Ron Mayer
Date:
Setting join_collapse_limit=1 improves my performance dramatically.

Even on a query with only 3 tables.

This surprised me, since there are only 3 tables being joined, I would
have assumed that the optimizer would have done the exhaustive search
and not used geqo stuff - and that this exhaustive search would have
found the good plan.

Any reason it didn't?   Explain analyze results shown below.



On Wed, 30 Mar 2005 rm_pg@cheapcomplexdevices.com wrote:
>
> Can anyone please help me make my JOIN find the right index to use?
>

fli=# set join_collapse_limit=1;
SET
fli=# explain analyze
 select *
     from streetname_lookup as sl
     join city_lookup as cl on (true)
     join tlid_smaller as ts on (sl.geo_streetname_id = ts.geo_streetname_id and cl.geo_city_id=ts.geo_city_id)
     where  str_name='alamo' and  city='san antonio' and state='TX'
;
fli-# fli-# fli-# fli-# fli-# fli-#                                                                           QUERY
PLAN                                                                     \ 


---------------------------------------------------------------------------------------------------------------------------------------------------------------
 Nested Loop  (cost=0.00..16.94 rows=1 width=74) (actual time=0.116..0.528 rows=78 loops=1)
   ->  Nested Loop  (cost=0.00..9.03 rows=1 width=42) (actual time=0.079..0.086 rows=1 loops=1)
         ->  Index Scan using streetname_lookup__str_name on streetname_lookup sl  (cost=0.00..3.01 rows=1 width=19)
(actualtime=0.042..0.044 rows=1 loops=1) 
               Index Cond: (str_name = 'alamo'::text)
         ->  Index Scan using city_lookup__name on city_lookup cl  (cost=0.00..6.01 rows=1 width=23) (actual
time=0.026..0.028rows=1 loops=1) 
               Index Cond: ((city = 'san antonio'::text) AND (state = 'TX'::text))
   ->  Index Scan using tlid_smaller__street_city on tlid_smaller ts  (cost=0.00..7.86 rows=3 width=32) (actual
time=0.031..0.181rows=78 loops=1) 
         Index Cond: (("outer".geo_streetname_id = ts.geo_streetname_id) AND ("outer".geo_city_id = ts.geo_city_id))
 Total runtime: 0.709 ms
(9 rows)


--------[with the default join_collapse_limit]-----------
> fli=# explain analyze
>  select *
>      from streetname_lookup as sl
>      join city_lookup as cl on (true)
>      join tlid_smaller as ts on (sl.geo_streetname_id = ts.geo_streetname_id and cl.geo_city_id=ts.geo_city_id)
>      where  str_name='alamo' and  city='san antonio' and state='TX'
> ;
> fli-# fli-# fli-# fli-# fli-# fli-#
QUERYPLAN                                                                   \ 
>
>
---------------------------------------------------------------------------------------------------------------------------------------------------------------------
>  Hash Join  (cost=6.01..29209.16 rows=1 width=74) (actual time=9.421..28.154 rows=78 loops=1)
>    Hash Cond: ("outer".geo_city_id = "inner".geo_city_id)
>    ->  Nested Loop  (cost=0.00..29202.88 rows=52 width=51) (actual time=0.064..23.296 rows=4151 loops=1)
>          ->  Index Scan using streetname_lookup__str_name on streetname_lookup sl  (cost=0.00..3.01 rows=1 width=19)
(actualtime=0.025..0.032 rows=1 loops=1) 
>                Index Cond: (str_name = 'alamo'::text)
>          ->  Index Scan using tlid_smaller__street_zipint on tlid_smaller ts  (cost=0.00..28994.70 rows=16413
width=32)(actual time=0.028..8.153 rows=4151 loops=1) 
>                Index Cond: ("outer".geo_streetname_id = ts.geo_streetname_id)
>    ->  Hash  (cost=6.01..6.01 rows=1 width=23) (actual time=0.073..0.073 rows=0 loops=1)
>          ->  Index Scan using city_lookup__name on city_lookup cl  (cost=0.00..6.01 rows=1 width=23) (actual
time=0.065..0.067rows=1 loops=1) 
>                Index Cond: ((city = 'san antonio'::text) AND (state = 'TX'::text))
>  Total runtime: 28.367 ms
> (11 rows)
>

Re: Left Outer Join much faster than non-outer Join?

From
Tom Lane
Date:
rm_pg@cheapcomplexdevices.com writes:
>  select *
>      from streetname_lookup as sl
>      join city_lookup as cl on (true)
>      left outer join tlid_smaller as ts on (sl.geo_streetname_id = ts.geo_streetname_id and
cl.geo_city_id=ts.geo_city_id)
>      where  str_name='alamo' and  city='san antonio' and state='TX'
> ;

That's a fairly odd query; why don't you have any join condition between
streetname_lookup and city_lookup?

The planner won't consider Cartesian joins unless forced to, which is
why it fails to consider the join order "((sl join cl) join ts)" unless
you have an outer join in the mix.  I think that's generally a good
heuristic, and am disinclined to remove it ...

            regards, tom lane

Re: Left Outer Join much faster than non-outer Join?

From
Ron Mayer
Date:
Tom Lane wrote:
> rm_pg@cheapcomplexdevices.com writes:
>
>> select *
>>     from streetname_lookup as sl
>>     join city_lookup as cl on (true)
>>     left outer join tlid_smaller as ts on (sl.geo_streetname_id = ts.geo_streetname_id and
cl.geo_city_id=ts.geo_city_id)
>>     where  str_name='alamo' and  city='san antonio' and state='TX'
>>;
>
> That's a fairly odd query;

I think it's a very common type of query in data warehousing.

It's reasonably typical of a traditional star schema where
"streetname_lookup" and "city_lookup" are dimension tables
and "tlid_smaller" is the central fact table.

> why don't you have any join condition between
> streetname_lookup and city_lookup?

Those two tables shared no data.   They merely get the "id"s
for looking things up in the much larger central table.

Unique indexes on the city_lookup and street_lookup make the
cartesian join harmless (they each return only 1 value); and
the huge fact table has a multi-column index that takes both
of the ids from those lookups.


With the tables I have (shown below), how else could one
efficiently fetch the data for "Main St" "San Francisco"?

   streetname_lookup
   (for every street name used in the country)
       streetid  |  name  | type
       ----------+--------+------
       1         |  Main  | St
       2         |  1st   | St

   city_lookup
   (for every city name used in the country)
       cityid  |  name   | state
       --------+---------+------
       1       |  Boston | MA
       2       |  Alameda| CA


   tlid_smaller
   (containing a record for every city block in the country)
    city_id | street_id  | addresses | demographics, etc.
    --------+------------+-----------+----------------------
         1  |    1       | 100 block | [lots of columns]
         1  |    1       | 200 block | [lots of columns]
         1  |    1       | 300 block | [lots of columns]
         1  |    2       | 100 block | [lots of columns]
         1  |    2       | 100 block | [lots of columns]

> The planner won't consider Cartesian joins unless forced to, which is
> why it fails to consider the join order "((sl join cl) join ts)" unless
> you have an outer join in the mix.  I think that's generally a good
> heuristic, and am disinclined to remove it ...

IMHO it's a shame it doesn't even consider it when the estimated
results are very small.   I think often joins that merely look up
IDs would be useful to consider for the purpose of making potential
multi-column indexes (as shown in the previous email's explain
analyze result where the cartesian join was 30X faster than the
other approach since it could use the multi-column index on the
very large table).

     Ron

Re: Left Outer Join much faster than non-outer Join?

From
Ron Mayer
Date:
Ron Mayer wrote:
> Tom Lane wrote:
>> rm_pg@cheapcomplexdevices.com writes:
>>> select *
>>>     from streetname_lookup as sl
>>>     join city_lookup as cl on (true)
>>>     left outer join tlid_smaller as ts on (sl.geo_streetname_id =
>>>          ts.geo_streetname_id and cl.geo_city_id=ts.geo_city_id)
>>>     where  str_name='alamo' and  city='san antonio' and state='TX'
>>> ;
>> That's a fairly odd query;
>
>
> I think it's a very common type of query in data warehousing.
>
> It's reasonably typical of a traditional star schema where
> "streetname_lookup" and "city_lookup" are dimension tables
> and "tlid_smaller" is the central fact table.

Although looking again I must admit the query was
written unconventionally.  Perhaps those queries are
remnants dating back to a version when you could
force join orders this way?

Perhaps a more common way of writing it would have been:

   select * from tlid_smaller
    where geo_streetname_id in (select geo_streetname_id from streetname_lookup where str_name='$str_name')
      and geo_city_id       in (select geo_city_id from city_lookup where city='$city' and state='$state');

However this query also fails to use the multi-column
index on (geo_streetname_id,geo_city_id).  Explain
analyze shown below.


In cases where I can be sure only one result will come
from each of the lookup queries I guess I can do this:

   select * from tlid_smaller
    where geo_streetname_id = (select geo_streetname_id from streetname_lookup where str_name='$str_name')
      and geo_city_id       = (select geo_city_id from city_lookup where city='$city' and state='$state');

which has the nicest plan of them all (explain analyze
also shown below).

 > With the tables I have (shown below), how else could one
 > efficiently fetch the data for "Main St" "San Francisco"?

I guess I just answered that question myself.  Where possible,
I'll write my queries this way.

      Thanks,
      Ron


fli=# fli=# explain analyze  select * from tlid_smaller
         where geo_streetname_id in (select geo_streetname_id from streetname_lookup where str_name='alamo')
           and geo_city_id       in (select geo_city_id from city_lookup where city='san antonio' and state='TX');
fli-# fli-#                                                                              QUERY PLAN

--------------------------------------------------------------------------------------------------------------------------------------------------------------------
  Hash IN Join  (cost=9.03..29209.16 rows=1 width=32) (actual time=76.576..96.605 rows=78 loops=1)
    Hash Cond: ("outer".geo_city_id = "inner".geo_city_id)
    ->  Nested Loop  (cost=3.02..29202.88 rows=52 width=32) (actual time=65.877..91.789 rows=4151 loops=1)
          ->  HashAggregate  (cost=3.02..3.02 rows=1 width=4) (actual time=0.039..0.042 rows=1 loops=1)
                ->  Index Scan using streetname_lookup__str_name on streetname_lookup  (cost=0.00..3.01 rows=1 width=4)
(actualtime=0.025..0.028 rows=1 loops=1) 
                      Index Cond: (str_name = 'alamo'::text)
          ->  Index Scan using tlid_smaller__street_zipint on tlid_smaller  (cost=0.00..28994.70 rows=16413 width=32)
(actualtime=65.820..81.309 rows=4151 loops=1) 
                Index Cond: (tlid_smaller.geo_streetname_id = "outer".geo_streetname_id)
    ->  Hash  (cost=6.01..6.01 rows=1 width=4) (actual time=0.054..0.054 rows=0 loops=1)
          ->  Index Scan using city_lookup__name on city_lookup  (cost=0.00..6.01 rows=1 width=4) (actual
time=0.039..0.041rows=1 loops=1) 
                Index Cond: ((city = 'san antonio'::text) AND (state = 'TX'::text))
  Total runtime: 97.577 ms
(12 rows)

fli=#

fli=# explain analyze  select * from tlid_smaller
         where geo_streetname_id = (select geo_streetname_id from streetname_lookup where str_name='alamo')
           and geo_city_id       = (select geo_city_id from city_lookup where city='san antonio' and state='TX');

fli-# fli-#                                                                       QUERY PLAN

-------------------------------------------------------------------------------------------------------------------------------------------------------
  Index Scan using tlid_smaller__street_city on tlid_smaller  (cost=9.02..16.88 rows=3 width=32) (actual
time=0.115..0.255rows=78 loops=1) 
    Index Cond: ((geo_streetname_id = $0) AND (geo_city_id = $1))
    InitPlan
      ->  Index Scan using streetname_lookup__str_name on streetname_lookup  (cost=0.00..3.01 rows=1 width=4) (actual
time=0.044..0.047rows=1 loops=1) 
            Index Cond: (str_name = 'alamo'::text)
      ->  Index Scan using city_lookup__name on city_lookup  (cost=0.00..6.01 rows=1 width=4) (actual time=0.028..0.030
rows=1loops=1) 
            Index Cond: ((city = 'san antonio'::text) AND (state = 'TX'::text))
  Total runtime: 0.474 ms
(8 rows)

Re: Left Outer Join much faster than non-outer Join?

From
"Patrick Vedrines"
Date:
> > rm_pg@cheapcomplexdevices.com writes:
>    streetname_lookup
>    (for every street name used in the country)
>        streetid  |  name  | type
>        ----------+--------+------
>        1         |  Main  | St
>        2         |  1st   | St
>
Afa I'm concerned, I would add the column "city_id" since 2 different
streets in 2 different cities may have the same name.

Amicalement

Patrick


Re: Left Outer Join much faster than non-outer Join?

From
Simon Riggs
Date:
On Thu, 2005-03-31 at 00:15 -0800, Ron Mayer wrote:
> Ron Mayer wrote:
> > Tom Lane wrote:
> >> rm_pg@cheapcomplexdevices.com writes:
> >>> select *
> >>>     from streetname_lookup as sl
> >>>     join city_lookup as cl on (true)
> >>>     left outer join tlid_smaller as ts on (sl.geo_streetname_id =
> >>>          ts.geo_streetname_id and cl.geo_city_id=ts.geo_city_id)
> >>>     where  str_name='alamo' and  city='san antonio' and state='TX'
> >>> ;
> >> That's a fairly odd query;
> >
> >
> > I think it's a very common type of query in data warehousing.
> >
> > It's reasonably typical of a traditional star schema where
> > "streetname_lookup" and "city_lookup" are dimension tables
> > and "tlid_smaller" is the central fact table.
>

Yes, agreed.

> Although looking again I must admit the query was
> written unconventionally.  Perhaps those queries are
> remnants dating back to a version when you could
> force join orders this way?
>
> Perhaps a more common way of writing it would have been:
>
>    select * from tlid_smaller
>     where geo_streetname_id in (select geo_streetname_id from streetname_lookup where str_name='$str_name')
>       and geo_city_id       in (select geo_city_id from city_lookup where city='$city' and state='$state');
>
> However this query also fails to use the multi-column
> index on (geo_streetname_id,geo_city_id).  Explain
> analyze shown below.

...which is my understanding too.

> In cases where I can be sure only one result will come
> from each of the lookup queries I guess I can do this:
>
>    select * from tlid_smaller
>     where geo_streetname_id = (select geo_streetname_id from streetname_lookup where str_name='$str_name')
>       and geo_city_id       = (select geo_city_id from city_lookup where city='$city' and state='$state');
>
> which has the nicest plan of them all (explain analyze
> also shown below).

Which is not the case for the generalised star join.

The general case query here is:
    SELECT (whatever)
    FROM FACT, DIMENSION1 D1, DIMENSION2 D2, DIMENSION3 D3etc..
    WHERE
        FACT.dimension1_pk = D1.dimension1_pk
    AND    FACT.dimension2_pk = D2.dimension2_pk
    AND     FACT.dimension3_pk = D3.dimension3_pk
    AND    D1.dimdescription = 'X'
    AND    D2.dimdescription = 'Y'
    AND    D3.dimdescription = 'Z'
    ...
with FACT PK=(dimension1_pk, dimension2_pk, dimension3_pk)

with a more specific example of
    SELECT sum(item_price)
    FROM Sales, Store, Item, TTime
    WHERE
        Sales.store_pk = Store.store_pk
    AND    Store.region = 'UK'
    AND    Sales.item_pk = Item.item_pk
    AND    Item.category = 'Cameras'
    AND    Sales.time_pk = TTime.time_pk
    AND    TTime.month = 3
    AND    TTime.year = 2005

A very good plan for solving this, under specific conditions is...
    CartesianProduct(Store, Item, TTime) -> Sales.PK

which accesses the largest table only once.

As Tom says, the current optimizer won't go near that plan, for good
reason, without specifically tweaking collapse limits. I know full well
that any changes in that direction will need to be strong because that
execution plan is very sensitive to even minor changes in data
distribution.

The plan requires some fairly extensive checking to be put into place.
The selectivity of requests against the smaller tables needs to be very
well known, so that the upper bound estimate of cardinality of the
cartesian product is feasible AND still low enough to use the index on
Sales.

This is probably going to need information to be captured on multi-
column index selectivity, to ensure that last part.

It is likely that the statistics targets on the dimension tables would
need to be higher enough to identify MFVs or at least reduce the upper
bound of selectivity. It is also requires the table sizes to be
examined, to ensure this type of plan is considered pointlessly.
Some other systems that support this join type, turn off checking for it
by default. We could do the same with enable_starjoin = off.

Anyway, seems like a fair amount of work there... yes?

Best Regards, Simon Riggs