Re: Left Outer Join much faster than non-outer Join? - Mailing list pgsql-performance
From | Simon Riggs |
---|---|
Subject | Re: Left Outer Join much faster than non-outer Join? |
Date | |
Msg-id | 1112292073.16721.327.camel@localhost.localdomain Whole thread Raw |
In response to | Re: Left Outer Join much faster than non-outer Join? (Ron Mayer <rm_pg@cheapcomplexdevices.com>) |
List | pgsql-performance |
On Thu, 2005-03-31 at 00:15 -0800, Ron Mayer wrote: > Ron Mayer wrote: > > Tom Lane wrote: > >> rm_pg@cheapcomplexdevices.com writes: > >>> select * > >>> from streetname_lookup as sl > >>> join city_lookup as cl on (true) > >>> left outer join tlid_smaller as ts on (sl.geo_streetname_id = > >>> ts.geo_streetname_id and cl.geo_city_id=ts.geo_city_id) > >>> where str_name='alamo' and city='san antonio' and state='TX' > >>> ; > >> That's a fairly odd query; > > > > > > I think it's a very common type of query in data warehousing. > > > > It's reasonably typical of a traditional star schema where > > "streetname_lookup" and "city_lookup" are dimension tables > > and "tlid_smaller" is the central fact table. > Yes, agreed. > Although looking again I must admit the query was > written unconventionally. Perhaps those queries are > remnants dating back to a version when you could > force join orders this way? > > Perhaps a more common way of writing it would have been: > > select * from tlid_smaller > where geo_streetname_id in (select geo_streetname_id from streetname_lookup where str_name='$str_name') > and geo_city_id in (select geo_city_id from city_lookup where city='$city' and state='$state'); > > However this query also fails to use the multi-column > index on (geo_streetname_id,geo_city_id). Explain > analyze shown below. ...which is my understanding too. > In cases where I can be sure only one result will come > from each of the lookup queries I guess I can do this: > > select * from tlid_smaller > where geo_streetname_id = (select geo_streetname_id from streetname_lookup where str_name='$str_name') > and geo_city_id = (select geo_city_id from city_lookup where city='$city' and state='$state'); > > which has the nicest plan of them all (explain analyze > also shown below). Which is not the case for the generalised star join. The general case query here is: SELECT (whatever) FROM FACT, DIMENSION1 D1, DIMENSION2 D2, DIMENSION3 D3etc.. WHERE FACT.dimension1_pk = D1.dimension1_pk AND FACT.dimension2_pk = D2.dimension2_pk AND FACT.dimension3_pk = D3.dimension3_pk AND D1.dimdescription = 'X' AND D2.dimdescription = 'Y' AND D3.dimdescription = 'Z' ... with FACT PK=(dimension1_pk, dimension2_pk, dimension3_pk) with a more specific example of SELECT sum(item_price) FROM Sales, Store, Item, TTime WHERE Sales.store_pk = Store.store_pk AND Store.region = 'UK' AND Sales.item_pk = Item.item_pk AND Item.category = 'Cameras' AND Sales.time_pk = TTime.time_pk AND TTime.month = 3 AND TTime.year = 2005 A very good plan for solving this, under specific conditions is... CartesianProduct(Store, Item, TTime) -> Sales.PK which accesses the largest table only once. As Tom says, the current optimizer won't go near that plan, for good reason, without specifically tweaking collapse limits. I know full well that any changes in that direction will need to be strong because that execution plan is very sensitive to even minor changes in data distribution. The plan requires some fairly extensive checking to be put into place. The selectivity of requests against the smaller tables needs to be very well known, so that the upper bound estimate of cardinality of the cartesian product is feasible AND still low enough to use the index on Sales. This is probably going to need information to be captured on multi- column index selectivity, to ensure that last part. It is likely that the statistics targets on the dimension tables would need to be higher enough to identify MFVs or at least reduce the upper bound of selectivity. It is also requires the table sizes to be examined, to ensure this type of plan is considered pointlessly. Some other systems that support this join type, turn off checking for it by default. We could do the same with enable_starjoin = off. Anyway, seems like a fair amount of work there... yes? Best Regards, Simon Riggs
pgsql-performance by date: