Re: Left Outer Join much faster than non-outer Join? - Mailing list pgsql-performance

From Simon Riggs
Subject Re: Left Outer Join much faster than non-outer Join?
Date
Msg-id 1112292073.16721.327.camel@localhost.localdomain
Whole thread Raw
In response to Re: Left Outer Join much faster than non-outer Join?  (Ron Mayer <rm_pg@cheapcomplexdevices.com>)
List pgsql-performance
On Thu, 2005-03-31 at 00:15 -0800, Ron Mayer wrote:
> Ron Mayer wrote:
> > Tom Lane wrote:
> >> rm_pg@cheapcomplexdevices.com writes:
> >>> select *
> >>>     from streetname_lookup as sl
> >>>     join city_lookup as cl on (true)
> >>>     left outer join tlid_smaller as ts on (sl.geo_streetname_id =
> >>>          ts.geo_streetname_id and cl.geo_city_id=ts.geo_city_id)
> >>>     where  str_name='alamo' and  city='san antonio' and state='TX'
> >>> ;
> >> That's a fairly odd query;
> >
> >
> > I think it's a very common type of query in data warehousing.
> >
> > It's reasonably typical of a traditional star schema where
> > "streetname_lookup" and "city_lookup" are dimension tables
> > and "tlid_smaller" is the central fact table.
>

Yes, agreed.

> Although looking again I must admit the query was
> written unconventionally.  Perhaps those queries are
> remnants dating back to a version when you could
> force join orders this way?
>
> Perhaps a more common way of writing it would have been:
>
>    select * from tlid_smaller
>     where geo_streetname_id in (select geo_streetname_id from streetname_lookup where str_name='$str_name')
>       and geo_city_id       in (select geo_city_id from city_lookup where city='$city' and state='$state');
>
> However this query also fails to use the multi-column
> index on (geo_streetname_id,geo_city_id).  Explain
> analyze shown below.

...which is my understanding too.

> In cases where I can be sure only one result will come
> from each of the lookup queries I guess I can do this:
>
>    select * from tlid_smaller
>     where geo_streetname_id = (select geo_streetname_id from streetname_lookup where str_name='$str_name')
>       and geo_city_id       = (select geo_city_id from city_lookup where city='$city' and state='$state');
>
> which has the nicest plan of them all (explain analyze
> also shown below).

Which is not the case for the generalised star join.

The general case query here is:
    SELECT (whatever)
    FROM FACT, DIMENSION1 D1, DIMENSION2 D2, DIMENSION3 D3etc..
    WHERE
        FACT.dimension1_pk = D1.dimension1_pk
    AND    FACT.dimension2_pk = D2.dimension2_pk
    AND     FACT.dimension3_pk = D3.dimension3_pk
    AND    D1.dimdescription = 'X'
    AND    D2.dimdescription = 'Y'
    AND    D3.dimdescription = 'Z'
    ...
with FACT PK=(dimension1_pk, dimension2_pk, dimension3_pk)

with a more specific example of
    SELECT sum(item_price)
    FROM Sales, Store, Item, TTime
    WHERE
        Sales.store_pk = Store.store_pk
    AND    Store.region = 'UK'
    AND    Sales.item_pk = Item.item_pk
    AND    Item.category = 'Cameras'
    AND    Sales.time_pk = TTime.time_pk
    AND    TTime.month = 3
    AND    TTime.year = 2005

A very good plan for solving this, under specific conditions is...
    CartesianProduct(Store, Item, TTime) -> Sales.PK

which accesses the largest table only once.

As Tom says, the current optimizer won't go near that plan, for good
reason, without specifically tweaking collapse limits. I know full well
that any changes in that direction will need to be strong because that
execution plan is very sensitive to even minor changes in data
distribution.

The plan requires some fairly extensive checking to be put into place.
The selectivity of requests against the smaller tables needs to be very
well known, so that the upper bound estimate of cardinality of the
cartesian product is feasible AND still low enough to use the index on
Sales.

This is probably going to need information to be captured on multi-
column index selectivity, to ensure that last part.

It is likely that the statistics targets on the dimension tables would
need to be higher enough to identify MFVs or at least reduce the upper
bound of selectivity. It is also requires the table sizes to be
examined, to ensure this type of plan is considered pointlessly.
Some other systems that support this join type, turn off checking for it
by default. We could do the same with enable_starjoin = off.

Anyway, seems like a fair amount of work there... yes?

Best Regards, Simon Riggs



pgsql-performance by date:

Previous
From: Steve Wampler
Date:
Subject: Re: Reading recommendations
Next
From: Richard_D_Levine@raytheon.com
Date:
Subject: Re: Reading recommendations