Hi All,
I've noticed in the past that doing aggregates while joining to more
than one table can sometimes give you unintended results. For example,
suppose I have three tables: products, sales, and resupplies. In sales I
track what I sell, and in resupplies I track my own purchases to
increase inventory. Both have a foreign key to products. Now I want to
run a report showing the total dollars sold for each product versus the
total dollars spent for each product. I could try this:
SELECT p.id,
SUM(s.price * s.qty) AS total_sold,
SUM(r.price * r.qty) AS total_spent
FROM products p
LEFT OUTER JOIN sales s
ON s.product_id = p.id
LEFT OUTER JOIN resupplies r
ON r.product_id = p.id
GROUP BY p.id
;
That seems pretty safe, but actually I get bad answers,
for example if some product has this data:
sales
-----
sold 1 @ $2/ea
resupplies
----------
bought 1 @ $1/eq
bought 2 @ $1/ea
Then pre-grouping I have this:
p.id | s.qty | s.price | r.qty | r.price
-----+-------+---------+-------+--------
1 | 1 | $2 | 1 | $1
1 | 1 | $2 | 2 | $1
You can see the problem is that I'm going to double-count my sales.
What I really want is this:
p.id | s.qty | s.price | r.qty | r.price
-----+-------+---------+-------+--------
1 | 1 | $2 | 1 | $1
1 | | | 2 | $1
In the past I've always handled these situations by aggregating each
table separately
and only then joining things together:
WITH
s AS (
SELECT product_id,
SUM(price * qty) AS total_sold
FROM sales
GROUP BY product_id) s
),
r AS (
SELECT product_id,
SUM(price * qty) AS total_spent
FROM resupplies
GROUP BY product_id) r
)
SELECT p.id,
COALESCE(s.total_sold, 0),
COALESCE(r.total_spent, 0)
FROM products p
LEFT OUTER JOIN s
ON s.product_id = p.id
LEFT OUTER JOIN r
ON r.product_id = p.id
;
Since I've guaranteed that each CTE includes at most one row per product,
this is safe from double-counting errors.
But is there a better way? My approach feels verbose
and harder to read. Also you have to type COALESCE a lot. :-)
Is there some different way of doing things I haven't thought of yet?
Also I wonder about the performance merging all these subqueries together.
Would the final merging be any faster if I had an ORDER BY in each CTE?
It seems like this pattern comes up a lot;
what have others done about it?
Thanks,
Paul