Thread: Why is there a Sort after an Index Only Scan?

Why is there a Sort after an Index Only Scan?

From
André Hänsel
Date:
Quick(?) question... why is there a Sort node after an Index Only Scan?
Shouldn't the index already spit out sorted tuples?

CREATE INDEX ON orders_test(shipping_date, order_id);

EXPLAIN ANALYZE SELECT
FROM orders_test
WHERE TRUE
AND shipping_date >= '2022-05-01'
AND shipping_date <= '2022-05-01'
ORDER BY order_id
LIMIT 50;

Limit  (cost=8.46..8.46 rows=1 width=4) (actual time=0.031..0.032 rows=0
loops=1)
  ->  Sort  (cost=8.46..8.46 rows=1 width=4) (actual time=0.025..0.025
rows=0 loops=1)
        Sort Key: order_id
        Sort Method: quicksort  Memory: 25kB
        ->  Index Only Scan using orders_test_shipping_date_order_id_idx on
orders_test  (cost=0.43..8.45 rows=1 width=4) (actual time=0.017..0.018
rows=0 loops=1)
              Index Cond: ((shipping_date >= '2022-05-01'::date) AND
(shipping_date <= '2022-05-01'::date))
              Heap Fetches: 0

Fiddle:
https://dbfiddle.uk/?rdbms=postgres_14&fiddle=7a3bc2421b5de5a2a377bd39b78d1c
d5

I am actually asking because when I skew the distribution a little and
repeat the query I get a rather unfortunate plan:

INSERT INTO orders_test SELECT generate_series(2000001, 2100000),
'2022-05-01';
ANALYZE orders_test;

Limit  (cost=0.43..37.05 rows=50 width=4) (actual time=1186.565..1186.593
rows=50 loops=1)
  ->  Index Scan using orders_test_pkey on orders_test  (cost=0.43..74336.43
rows=101502 width=4) (actual time=1186.562..1186.584 rows=50 loops=1)
        Filter: ((shipping_date >= '2022-05-01'::date) AND (shipping_date <=
'2022-05-01'::date))
        Rows Removed by Filter: 2000000

Postgres here uses the primary key to get the sort order, so I'm wondering
if there is anything about my index that precludes its use for ORDER BY.





Re: Why is there a Sort after an Index Only Scan?

From
Jeff Janes
Date:
On Wed, May 4, 2022 at 7:15 PM André Hänsel <andre@webkr.de> wrote:
Quick(?) question... why is there a Sort node after an Index Only Scan?
Shouldn't the index already spit out sorted tuples?

CREATE INDEX ON orders_test(shipping_date, order_id);

EXPLAIN ANALYZE SELECT
FROM orders_test
WHERE TRUE
AND shipping_date >= '2022-05-01'
AND shipping_date <= '2022-05-01'
ORDER BY order_id
LIMIT 50;

They are sorted by order_id only within sets of the same shipping_date, which is not good enough.  (It would be good enough if it were smart enough to know that there is only one possible shipping date to satisfy your weird range condition.)
 
Cheers,

Jeff

Re: Why is there a Sort after an Index Only Scan?

From
David Rowley
Date:
On Thu, 5 May 2022 at 11:15, André Hänsel <andre@webkr.de> wrote:
>
> Quick(?) question... why is there a Sort node after an Index Only Scan?
> Shouldn't the index already spit out sorted tuples?
>
> CREATE INDEX ON orders_test(shipping_date, order_id);
>
> EXPLAIN ANALYZE SELECT
> FROM orders_test
> WHERE TRUE
> AND shipping_date >= '2022-05-01'
> AND shipping_date <= '2022-05-01'
> ORDER BY order_id
> LIMIT 50;

Unfortunately, the query planner is not quite smart enough to realise
that your shipping_date clauses can only match a single value.
There's quite a bit more we could do with the planner's
EquivalanceClasses. There is a patch around to help improve things in
this area but it requires some more infrastructure to make it more
practical to do from a performance standpoint in the planner.

You'll get the plan you want if you requite the query and replace your
date range with shipping_date = '2022-05-01'.  Your use of WHERE TRUE
indicates to me that you might be building this query in an
application already, so maybe you can just tweak that application to
test if the start and end dates are the same and use equality when
they are.

David

[1] https://commitfest.postgresql.org/38/3524/



RE: Why is there a Sort after an Index Only Scan?

From
André Hänsel
Date:
> They are sorted by order_id only within sets of the same shipping_date, which is not good enough.

Ah yes, that totally makes sense for the general case.

> so maybe you can just tweak that application to test if the start and end dates are the same and use equality when
theyare. 

I definitely can.

But now I have a followup question, which probably should have been a separate question all along. I have modified the
examplea bit to have a more natural date distribution and I got rid of the weird shipping_date condition and actually
madeit different dates, so the index order is out of the picture. I also added some statistics so Postgres knows about
therelationship between the columns. 

https://dbfiddle.uk/?rdbms=postgres_14&fiddle=54c7774432e896e3c0e89d8084c4b194

After inserting more rows, Postgres still chooses a scan on the primary key instead of using the index.

Limit  (cost=0.43..296.63 rows=50 width=4) (actual time=1052.692..1052.737 rows=50 loops=1)
  ->  Index Scan using orders_test_pkey on orders_test  (cost=0.43..71149.43 rows=12010 width=4) (actual
time=1052.690..1052.728rows=50 loops=1) 
        Filter: ((shipping_date >= '2022-04-30'::date) AND (shipping_date <= '2022-05-01'::date))
        Rows Removed by Filter: 1998734

By setting the CPU costs to 0 (last block in the fiddle) I can force the use of the previous plan and as I already
suspectedit is much better: 

Limit  (cost=101.00..101.00 rows=50 width=4) (actual time=4.835..4.843 rows=50 loops=1)
  ->  Sort  (cost=101.00..101.00 rows=12010 width=4) (actual time=4.833..4.837 rows=50 loops=1)
        Sort Key: order_id
        Sort Method: top-N heapsort  Memory: 27kB
        ->  Index Scan using orders_test_shipping_date_idx on orders_test  (cost=0.00..101.00 rows=12010 width=4)
(actualtime=0.026..3.339 rows=11266 loops=1) 
              Index Cond: ((shipping_date >= '2022-04-30'::date) AND (shipping_date <= '2022-05-01'::date))

Is it overestimating the cost of the sorting?





Re: Why is there a Sort after an Index Only Scan?

From
Tom Lane
Date:
=?UTF-8?Q?Andr=C3=A9_H=C3=A4nsel?= <andre@webkr.de> writes:
> Limit  (cost=0.43..296.63 rows=50 width=4) (actual time=1052.692..1052.737 rows=50 loops=1)
>   ->  Index Scan using orders_test_pkey on orders_test  (cost=0.43..71149.43 rows=12010 width=4) (actual
time=1052.690..1052.728rows=50 loops=1) 
>         Filter: ((shipping_date >= '2022-04-30'::date) AND (shipping_date <= '2022-05-01'::date))
>         Rows Removed by Filter: 1998734

> Is it overestimating the cost of the sorting?

No, but it's guessing it will hit 50 rows that satisfy the filter
before it's gone very far in this index.  If the shipping date and
pkey are correlated in the wrong direction, that could be a very
overoptimistic guess.  I don't think we have adequate stats yet
to detect this sort of problem.

            regards, tom lane