Thread: Performance of outer joins?

Performance of outer joins?

From
Benjamin Smith
Date:
I have a situation that can be summarized to the following:

-- day in 20061215 format
Create table calendar (
    day integer unique not null
    );

Create table customers (
    id serial unique not null,
    name varchar,
    address varchar,
    );

Create table deliveries (
    customers_id integer not null references customers(id),
    calendar_day integer not null references calendar(day),
    delivered bool not null default false,
    unique(customers_id, calendar_id)
    );

Imagine tens of thousands of customers, a few million deliveries. A query
that's structurally similar to the following query is rather slow. It's
taking over 30 seconds, climbing fast on reasonable HW. (SMP Opteron, 10k
SCSI, 4 GB RAM) If I remove the outer join, performance is < 1 second.

SELECT customers.id as customers_id,
    customers.name AS customers_name,
    calendar.day AS calendar_day,
    CASE WHEN (deliveries.delivered IS NULL) THEN 'n/a'
        WHEN (deliveries.delivered=TRUE) THEN 'yes'
        ELSE 'no' END AS delivered
    FROM customers
    JOIN calendars ON
        (
        -- GIVE A CALENDAR OF POSSIBLE DAYS FOR DELIVERIES
        calendar.day < 20061201
        AND calendar.day >= 20060101
        )
    LEFT OUTER JOIN deliveries ON
        (
        customers.id=deliveries.customers_id
        AND deliveries.calendar_day=calendar.day
        )
    ;

What can I do to improve the performance of this oft-used query? Is there a
better way to do this, or am I doomed to looping thru results and parsing the
results in code?

Thanks,

-Ben

--
"I kept looking around for somebody to solve the problem.
Then I realized I am somebody"
   -Anonymous

Re: Performance of outer joins?

From
Tom Lane
Date:
Benjamin Smith <ben@schoolpathways.com> writes:
>     FROM customers
>     JOIN calendars ON
>         (
>         -- GIVE A CALENDAR OF POSSIBLE DAYS FOR DELIVERIES
>         calendar.day < 20061201
>         AND calendar.day >= 20060101
>         )

Haven't you written a cross-product join here?  Seems like a bad
idea if you'd like it to be fast.  Especially since you then left-joined
it to something else, meaning that every row in the cross product
must in fact generate output.  Do you really want to be deluged with
a table showing all the customers you DIDN'T deliver to on each day,
for every day in the year?

            regards, tom lane

Re: Performance of outer joins?

From
"macgillivary"
Date:
ben would something like this work in your situation?

SELECT customer.id, customer.name, deliveries.calendar_day,
deliveries.delivered
FROM ben_customers as customer, ben_deliveries as deliveries

WHERE customer.id = deliveries.customers_id
and deliveries.calendar_day in (Select day
                from ben_calendar
                where day < 20061201 and day >= 20060101)

I think that would cut down the deliveries table fairly quickly, thus
making the customer_id join nice and quick.  You would also only be
returning data from the tables in your from clause.  I suppose it
depends on what you are trying to display.  If you want a list of all
the possible days, and deliveries on those days then I'd approach it a
bit differently.


On Dec 15, 7:59 pm, b...@schoolpathways.com (Benjamin Smith) wrote:

>
> What can I do to improve the performance of this oft-used query? Is there a
> better way to do this, or am I doomed to looping thru results and parsing the
> results in code?
>