Home > mailing lists

Thread: Aggregated join vs. aggregate in column?

Aggregated join vs. aggregate in column?

From

Durumdara

Date:

12 July 2019, 10:32:04

Dear Members!

I have a query which I extended with an extra calculated column.

I need to list the request, and the last date when they scheduled.

Example:

select

request.*,

(

select max(s_date) as s_date from schedule
where schedule.product_id = request.product_id and schedule.ok = True
) as max_s_date,

...

from request

...

This can find the last scheduled timestamp for that product.

It's working, but now they wanted to search for this max_s_date column).

I have two ways:

1.) With query / subquery:

select * from

(

select

request.*,

(

select max(s_date) as s_date from schedule
where schedule.product_id = request.product_id and schedule.ok = True
) as max_s_date,

...

from request

...

) t where t.max_s_date between ...

2.) I may relocate this section as join...

select

request.*, s.max_s_date

from request

left join

(

select schedule.product_id, max(s_date) as max_s_date from schedule

where schedule.ok = True

group by schedule.product_id
) s on (s.product_id = request.product_id)

...

But I really don't know what the hell will happen in query optimizer with this method.

a.)
Optimizer is clever, and it calculates the aggregates only in the needed rows.
So it find the request.product_id-s, and execute the "s" query only in these rows.

b.)

Or it isn't enough wise, it executes the "s" subquery on whole schedule, and later joins to main table.

The schedule table is big in customers' database, so this is worst case. :-(

---

I asked this because sometimes I need to get more result columns in the select, but I couldn't retreive more...

(

select max(s_date) as s_date from schedule
where schedule.product_id = request.product_id and schedule.ok = True
) as max_s_date, <=== only one column

So sometimes the join is better.

But if the optimizer isn't enough wise, I can get these values only "WITH" queries (select the main rows to temp, run subselects with only these records, return the mix of main and subselects in one query).

What is your experience with these kind of problems?

Can I use this join or I need to avoid because of very slow (and slower-slower) running time.

Thank your for any help!

Best regards

Re: Aggregated join vs. aggregate in column?

From

David Rowley

Date:

12 July 2019, 11:13:27

On Fri, 12 Jul 2019 at 19:32, Durumdara <durumdara@gmail.com> wrote:
> 2.) I may relocate this section as join...
>
> select
>     request.*, s.max_s_date
> from request
> left join
>     (
>         select schedule.product_id, max(s_date) as max_s_date from schedule
>         where schedule.ok = True
>         group by  schedule.product_id
>      ) s on (s.product_id = request.product_id)
>     ...
>
> But I really don't know what the hell will happen in query optimizer with this method.
>
> a.)
> Optimizer is clever, and it calculates the aggregates only in the needed rows.
> So it find the request.product_id-s, and execute the "s" query only in these rows.
>
> b.)
> Or it isn't enough wise, it executes the "s" subquery on whole schedule, and later joins to main table.
> The schedule table is big in customers' database, so this is worst case. :-(

I'm afraid for the particular query above, the answer is closer to b)
However, that's only going to be a problem if there are many more
distinct product_id records in "schedule". If you were to add a WHERE
clause to the outer query that did WHERE request.product_id = X, then
that qual would be pushed down into the subquery.  This qual pushing
only works for equality. So if you changed out WHERE
request.product_id = X to WHERE request.product_id IN(X,Y); then that
wouldn't push the qual to the subquery.

> I asked this because sometimes I need to get more result columns in the select, but I couldn't retreive more...
>
>     (
>         select max(s_date) as s_date from schedule
>         where schedule.product_id = request.product_id and schedule.ok = True
>      ) as max_s_date,   <=== only one column
>
> So sometimes the join is better.
>
> But if the optimizer isn't enough wise, I can get these values only "WITH" queries (select the main rows to temp, run
subselectswith only these records, return the mix of main and subselects in one query).
 

If you don't want to repeat the same subquery in the SELECT list then
you could perform a CROSS JOIN LATERAL. For example:

select
    request.*, s.max_s_date, s.max_s1_date
from request
cross join lateral (
        select max(s_date) as max_s_date, max(s1_date) as max_s1_date
        from schedule
        where schedule.ok = True
       and s.product_id = request.product_id) s;

In this case, the subquery will be executed once per output row, so if
you have some restrictive WHERE clause on the outer query then the
subquery will be executed fewer times.

With a bit of training, you should be able to see what the query
planner has done for yourself by using the EXPLAIN command:
https://www.postgresql.org/docs/current/sql-explain.html

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services