Thread: Performance of outer joins?
I have a situation that can be summarized to the following: -- day in 20061215 format Create table calendar ( day integer unique not null ); Create table customers ( id serial unique not null, name varchar, address varchar, ); Create table deliveries ( customers_id integer not null references customers(id), calendar_day integer not null references calendar(day), delivered bool not null default false, unique(customers_id, calendar_id) ); Imagine tens of thousands of customers, a few million deliveries. A query that's structurally similar to the following query is rather slow. It's taking over 30 seconds, climbing fast on reasonable HW. (SMP Opteron, 10k SCSI, 4 GB RAM) If I remove the outer join, performance is < 1 second. SELECT customers.id as customers_id, customers.name AS customers_name, calendar.day AS calendar_day, CASE WHEN (deliveries.delivered IS NULL) THEN 'n/a' WHEN (deliveries.delivered=TRUE) THEN 'yes' ELSE 'no' END AS delivered FROM customers JOIN calendars ON ( -- GIVE A CALENDAR OF POSSIBLE DAYS FOR DELIVERIES calendar.day < 20061201 AND calendar.day >= 20060101 ) LEFT OUTER JOIN deliveries ON ( customers.id=deliveries.customers_id AND deliveries.calendar_day=calendar.day ) ; What can I do to improve the performance of this oft-used query? Is there a better way to do this, or am I doomed to looping thru results and parsing the results in code? Thanks, -Ben -- "I kept looking around for somebody to solve the problem. Then I realized I am somebody" -Anonymous
Benjamin Smith <ben@schoolpathways.com> writes: > FROM customers > JOIN calendars ON > ( > -- GIVE A CALENDAR OF POSSIBLE DAYS FOR DELIVERIES > calendar.day < 20061201 > AND calendar.day >= 20060101 > ) Haven't you written a cross-product join here? Seems like a bad idea if you'd like it to be fast. Especially since you then left-joined it to something else, meaning that every row in the cross product must in fact generate output. Do you really want to be deluged with a table showing all the customers you DIDN'T deliver to on each day, for every day in the year? regards, tom lane
ben would something like this work in your situation? SELECT customer.id, customer.name, deliveries.calendar_day, deliveries.delivered FROM ben_customers as customer, ben_deliveries as deliveries WHERE customer.id = deliveries.customers_id and deliveries.calendar_day in (Select day from ben_calendar where day < 20061201 and day >= 20060101) I think that would cut down the deliveries table fairly quickly, thus making the customer_id join nice and quick. You would also only be returning data from the tables in your from clause. I suppose it depends on what you are trying to display. If you want a list of all the possible days, and deliveries on those days then I'd approach it a bit differently. On Dec 15, 7:59 pm, b...@schoolpathways.com (Benjamin Smith) wrote: > > What can I do to improve the performance of this oft-used query? Is there a > better way to do this, or am I doomed to looping thru results and parsing the > results in code? >