Avoiding double-counting in aggregates with more than one join? - Mailing list pgsql-general

From Paul Jungwirth
Subject Avoiding double-counting in aggregates with more than one join?
Date
Msg-id 86b9ec78-925c-1935-bc9d-6bad4ceb1f40@illuminatedcomputing.com
Whole thread Raw
Responses Re: Avoiding double-counting in aggregates with more than one join?  ("David G. Johnston" <david.g.johnston@gmail.com>)
List pgsql-general
Hi All,

I've noticed in the past that doing aggregates while joining to more
than one table can sometimes give you unintended results. For example,
suppose I have three tables: products, sales, and resupplies. In sales I
track what I sell, and in resupplies I track my own purchases to
increase inventory. Both have a foreign key to products. Now I want to
run a report showing the total dollars sold for each product versus the
total dollars spent for each product. I could try this:

     SELECT  p.id,
             SUM(s.price * s.qty) AS total_sold,
             SUM(r.price * r.qty) AS total_spent
     FROM    products p
     LEFT OUTER JOIN sales s
     ON      s.product_id = p.id
     LEFT OUTER JOIN resupplies r
     ON      r.product_id = p.id
     GROUP BY p.id
     ;

That seems pretty safe, but actually I get bad answers,
for example if some product has this data:

     sales
     -----
     sold 1 @ $2/ea

     resupplies
     ----------
     bought 1 @ $1/eq
     bought 2 @ $1/ea

Then pre-grouping I have this:

     p.id | s.qty | s.price | r.qty | r.price
     -----+-------+---------+-------+--------
        1 |     1 |      $2 |     1 |      $1
        1 |     1 |      $2 |     2 |      $1

You can see the problem is that I'm going to double-count my sales.
What I really want is this:

     p.id | s.qty | s.price | r.qty | r.price
     -----+-------+---------+-------+--------
        1 |     1 |      $2 |     1 |      $1
        1 |       |         |     2 |      $1

In the past I've always handled these situations by aggregating each
table separately
and only then joining things together:

     WITH
     s AS (
       SELECT  product_id,
               SUM(price * qty) AS total_sold
       FROM    sales
       GROUP BY product_id) s
     ),
     r AS (
       SELECT  product_id,
               SUM(price * qty) AS total_spent
       FROM    resupplies
       GROUP BY product_id) r
     )
     SELECT  p.id,
             COALESCE(s.total_sold, 0),
             COALESCE(r.total_spent, 0)
     FROM    products p
     LEFT OUTER JOIN s
     ON      s.product_id = p.id
     LEFT OUTER JOIN r
     ON      r.product_id = p.id
     ;

Since I've guaranteed that each CTE includes at most one row per product,
this is safe from double-counting errors.

But is there a better way? My approach feels verbose
and harder to read. Also you have to type COALESCE a lot. :-)
Is there some different way of doing things I haven't thought of yet?

Also I wonder about the performance merging all these subqueries together.
Would the final merging be any faster if I had an ORDER BY in each CTE?

It seems like this pattern comes up a lot;
what have others done about it?

Thanks,
Paul


pgsql-general by date:

Previous
From: William Ivanski
Date:
Subject: Re: Trim performance on 9.5
Next
From: "David G. Johnston"
Date:
Subject: Re: Avoiding double-counting in aggregates with more than one join?