Thread: Sort is generating rows

Sort is generating rows

From
Nicolas Seinlet
Date:
Hi,

I have a query with a strange query plan.

This query is roughly searching for sales, and convert them with a currency rate. As currency rate changes from time to time, table contains the currency, the company, the rate, the start date of availability of this rate and the end date of availability.

The join is done using :
    left join currency_rate cr on (cr.currency_id = pp.currency_id and
          cr.company_id = s.company_id and
          cr.date_start <= coalesce(s.date_order, now()) and
         (cr.date_end is null or cr.date_end > coalesce(s.date_order, now())))

The tricky part is the date range on the currency rate, which is not an equality.

the query plan shows:
->  Sort  (cost=120.13..124.22 rows=1637 width=56) (actual time=14.300..72084.758 rows=308054684 loops=1)
                          Sort Key: cr.currency_id, cr.company_id
                          Sort Method: quicksort  Memory: 172kB
                          ->  CTE Scan on currency_rate cr  (cost=0.00..32.74 rows=1637 width=56) (actual time=1.403..13.610 rows=1576 loops=1)

There's 2 challenging things :
- planner estimates 1637 rows, and get 300 million lines
- sorting is generating lines

later in the query plan, you find the join:
->  Merge Left Join  (cost=341056.75..351344.76 rows=1165112 width=224) (actual time=9792.635..269120.409 rows=1170055 loops=1)
                    Merge Cond: ((pp.currency_id = cr.currency_id) AND (s.company_id = cr.company_id))
                    Join Filter: ((cr.date_start <= COALESCE((s.date_order)::timestamp with time zone, now())) AND ((cr.date_end IS NULL) OR (cr.date_end > COALESCE((s.date_order)::timestamp with time zone, now()))))
                    Rows Removed by Join Filter: 307266434

It seems the join deletes all the generated million lines, which is correct.

My question is then , is there a better way to join a table to another using a date range, knowing that there's no overlap between date ranges?
Should we generate a virtual table with rates for all dates, and joining using an equality?

For now, the more currency rates, the slowest the query. There's not that much currency rates (1k in this case), as you can only have one rate per day per currency.

Have a nice day,

Nicolas.

Re: Sort is generating rows

From
Jeff Janes
Date:
On Thu, May 31, 2018 at 7:22 AM, Nicolas Seinlet <nicolas@seinlet.com> wrote:
Hi,

I have a query with a strange query plan.

This query is roughly searching for sales, and convert them with a currency rate. As currency rate changes from time to time, table contains the currency, the company, the rate, the start date of availability of this rate and the end date of availability.

The join is done using :
    left join currency_rate cr on (cr.currency_id = pp.currency_id and
          cr.company_id = s.company_id and
          cr.date_start <= coalesce(s.date_order, now()) and
         (cr.date_end is null or cr.date_end > coalesce(s.date_order, now())))

The tricky part is the date range on the currency rate, which is not an equality.

the query plan shows:
->  Sort  (cost=120.13..124.22 rows=1637 width=56) (actual time=14.300..72084.758 rows=308054684 loops=1)
                          Sort Key: cr.currency_id, cr.company_id
                          Sort Method: quicksort  Memory: 172kB
                          ->  CTE Scan on currency_rate cr  (cost=0.00..32.74 rows=1637 width=56) (actual time=1.403..13.610 rows=1576 loops=1)

There's 2 challenging things :
- planner estimates 1637 rows, and get 300 million lines
- sorting is generating lines

These are both explained by the same thing.  The sort is feeding into a merge join.  For every row in the other node which have the same value of the scan keys, the entire section of this sort with those same keys gets scanned again.  The repeated scanning gets counted in the actual row count, but isn't counted in the expected row count, or the actual row count of the thing feeding into the sort (the CTE)
 
 
For now, the more currency rates, the slowest the query. There's not that much currency rates (1k in this case), as you can only have one rate per day per currency.

If it is only per currency per day, then why is company_id present? In any case, you might be better off listing the rates per day, rather than as a range, and then doing an equality join.

Cheers,

Jeff

Re: Sort is generating rows

From
Nicolas Seinlet
Date:


2018-05-31 15:10 GMT+02:00 Jeff Janes <jeff.janes@gmail.com>:
On Thu, May 31, 2018 at 7:22 AM, Nicolas Seinlet <nicolas@seinlet.com> wrote:
Hi,

I have a query with a strange query plan.

This query is roughly searching for sales, and convert them with a currency rate. As currency rate changes from time to time, table contains the currency, the company, the rate, the start date of availability of this rate and the end date of availability.

The join is done using :
    left join currency_rate cr on (cr.currency_id = pp.currency_id and
          cr.company_id = s.company_id and
          cr.date_start <= coalesce(s.date_order, now()) and
         (cr.date_end is null or cr.date_end > coalesce(s.date_order, now())))

The tricky part is the date range on the currency rate, which is not an equality.

the query plan shows:
->  Sort  (cost=120.13..124.22 rows=1637 width=56) (actual time=14.300..72084.758 rows=308054684 loops=1)
                          Sort Key: cr.currency_id, cr.company_id
                          Sort Method: quicksort  Memory: 172kB
                          ->  CTE Scan on currency_rate cr  (cost=0.00..32.74 rows=1637 width=56) (actual time=1.403..13.610 rows=1576 loops=1)

There's 2 challenging things :
- planner estimates 1637 rows, and get 300 million lines
- sorting is generating lines

These are both explained by the same thing.  The sort is feeding into a merge join.  For every row in the other node which have the same value of the scan keys, the entire section of this sort with those same keys gets scanned again.  The repeated scanning gets counted in the actual row count, but isn't counted in the expected row count, or the actual row count of the thing feeding into the sort (the CTE)
 
 
For now, the more currency rates, the slowest the query. There's not that much currency rates (1k in this case), as you can only have one rate per day per currency.

If it is only per currency per day, then why is company_id present? In any case, you might be better off listing the rates per day, rather than as a range, and then doing an equality join.

Cheers,

Jeff

Hi,

Thanks for the answer. 

You're right, company_id is present, so you can have one rate per day per currency per company. I've tried to simplify the question without modifying the query plan, so I didn't talk about it even if it's present. I will now try to generate a virtual table of rates per dates.