Thread: Why is there a Sort after an Index Only Scan?
Quick(?) question... why is there a Sort node after an Index Only Scan? Shouldn't the index already spit out sorted tuples? CREATE INDEX ON orders_test(shipping_date, order_id); EXPLAIN ANALYZE SELECT FROM orders_test WHERE TRUE AND shipping_date >= '2022-05-01' AND shipping_date <= '2022-05-01' ORDER BY order_id LIMIT 50; Limit (cost=8.46..8.46 rows=1 width=4) (actual time=0.031..0.032 rows=0 loops=1) -> Sort (cost=8.46..8.46 rows=1 width=4) (actual time=0.025..0.025 rows=0 loops=1) Sort Key: order_id Sort Method: quicksort Memory: 25kB -> Index Only Scan using orders_test_shipping_date_order_id_idx on orders_test (cost=0.43..8.45 rows=1 width=4) (actual time=0.017..0.018 rows=0 loops=1) Index Cond: ((shipping_date >= '2022-05-01'::date) AND (shipping_date <= '2022-05-01'::date)) Heap Fetches: 0 Fiddle: https://dbfiddle.uk/?rdbms=postgres_14&fiddle=7a3bc2421b5de5a2a377bd39b78d1c d5 I am actually asking because when I skew the distribution a little and repeat the query I get a rather unfortunate plan: INSERT INTO orders_test SELECT generate_series(2000001, 2100000), '2022-05-01'; ANALYZE orders_test; Limit (cost=0.43..37.05 rows=50 width=4) (actual time=1186.565..1186.593 rows=50 loops=1) -> Index Scan using orders_test_pkey on orders_test (cost=0.43..74336.43 rows=101502 width=4) (actual time=1186.562..1186.584 rows=50 loops=1) Filter: ((shipping_date >= '2022-05-01'::date) AND (shipping_date <= '2022-05-01'::date)) Rows Removed by Filter: 2000000 Postgres here uses the primary key to get the sort order, so I'm wondering if there is anything about my index that precludes its use for ORDER BY.
On Wed, May 4, 2022 at 7:15 PM André Hänsel <andre@webkr.de> wrote:
Quick(?) question... why is there a Sort node after an Index Only Scan?
Shouldn't the index already spit out sorted tuples?
CREATE INDEX ON orders_test(shipping_date, order_id);
EXPLAIN ANALYZE SELECT
FROM orders_test
WHERE TRUE
AND shipping_date >= '2022-05-01'
AND shipping_date <= '2022-05-01'
ORDER BY order_id
LIMIT 50;
They are sorted by order_id only within sets of the same shipping_date, which is not good enough. (It would be good enough if it were smart enough to know that there is only one possible shipping date to satisfy your weird range condition.)
Cheers,
Jeff
On Thu, 5 May 2022 at 11:15, André Hänsel <andre@webkr.de> wrote: > > Quick(?) question... why is there a Sort node after an Index Only Scan? > Shouldn't the index already spit out sorted tuples? > > CREATE INDEX ON orders_test(shipping_date, order_id); > > EXPLAIN ANALYZE SELECT > FROM orders_test > WHERE TRUE > AND shipping_date >= '2022-05-01' > AND shipping_date <= '2022-05-01' > ORDER BY order_id > LIMIT 50; Unfortunately, the query planner is not quite smart enough to realise that your shipping_date clauses can only match a single value. There's quite a bit more we could do with the planner's EquivalanceClasses. There is a patch around to help improve things in this area but it requires some more infrastructure to make it more practical to do from a performance standpoint in the planner. You'll get the plan you want if you requite the query and replace your date range with shipping_date = '2022-05-01'. Your use of WHERE TRUE indicates to me that you might be building this query in an application already, so maybe you can just tweak that application to test if the start and end dates are the same and use equality when they are. David [1] https://commitfest.postgresql.org/38/3524/
> They are sorted by order_id only within sets of the same shipping_date, which is not good enough. Ah yes, that totally makes sense for the general case. > so maybe you can just tweak that application to test if the start and end dates are the same and use equality when theyare. I definitely can. But now I have a followup question, which probably should have been a separate question all along. I have modified the examplea bit to have a more natural date distribution and I got rid of the weird shipping_date condition and actually madeit different dates, so the index order is out of the picture. I also added some statistics so Postgres knows about therelationship between the columns. https://dbfiddle.uk/?rdbms=postgres_14&fiddle=54c7774432e896e3c0e89d8084c4b194 After inserting more rows, Postgres still chooses a scan on the primary key instead of using the index. Limit (cost=0.43..296.63 rows=50 width=4) (actual time=1052.692..1052.737 rows=50 loops=1) -> Index Scan using orders_test_pkey on orders_test (cost=0.43..71149.43 rows=12010 width=4) (actual time=1052.690..1052.728rows=50 loops=1) Filter: ((shipping_date >= '2022-04-30'::date) AND (shipping_date <= '2022-05-01'::date)) Rows Removed by Filter: 1998734 By setting the CPU costs to 0 (last block in the fiddle) I can force the use of the previous plan and as I already suspectedit is much better: Limit (cost=101.00..101.00 rows=50 width=4) (actual time=4.835..4.843 rows=50 loops=1) -> Sort (cost=101.00..101.00 rows=12010 width=4) (actual time=4.833..4.837 rows=50 loops=1) Sort Key: order_id Sort Method: top-N heapsort Memory: 27kB -> Index Scan using orders_test_shipping_date_idx on orders_test (cost=0.00..101.00 rows=12010 width=4) (actualtime=0.026..3.339 rows=11266 loops=1) Index Cond: ((shipping_date >= '2022-04-30'::date) AND (shipping_date <= '2022-05-01'::date)) Is it overestimating the cost of the sorting?
=?UTF-8?Q?Andr=C3=A9_H=C3=A4nsel?= <andre@webkr.de> writes: > Limit (cost=0.43..296.63 rows=50 width=4) (actual time=1052.692..1052.737 rows=50 loops=1) > -> Index Scan using orders_test_pkey on orders_test (cost=0.43..71149.43 rows=12010 width=4) (actual time=1052.690..1052.728rows=50 loops=1) > Filter: ((shipping_date >= '2022-04-30'::date) AND (shipping_date <= '2022-05-01'::date)) > Rows Removed by Filter: 1998734 > Is it overestimating the cost of the sorting? No, but it's guessing it will hit 50 rows that satisfy the filter before it's gone very far in this index. If the shipping date and pkey are correlated in the wrong direction, that could be a very overoptimistic guess. I don't think we have adequate stats yet to detect this sort of problem. regards, tom lane