Thread: A cost issue in ORDER BY + LIMIT

A cost issue in ORDER BY + LIMIT

From
Paul Guo
Date:
Hello,

Postgres seems to always optimize ORDER BY + LIMIT as top-k sort.
Recently I happened to notice
that in this scenario the output tuple number of the sort node is not
the same as the LIMIT tuple number.

See below,

postgres=# explain analyze verbose select * from t1 order by a limit 10;
                                                           QUERY PLAN

--------------------------------------------------------------------------------------------------
------------------------------
 Limit  (cost=69446.17..69446.20 rows=10 width=4) (actual
time=282.896..282.923 rows=10 loops=1)
   Output: a
   ->  Sort  (cost=69446.17..71925.83 rows=991862 width=4) (actual
time=282.894..282.896 rows=10 l
oops=1)
         Output: a
         Sort Key: t1.a
         Sort Method: top-N heapsort  Memory: 25kB
         ->  Seq Scan on public.t1  (cost=0.00..44649.62 rows=991862
width=4) (actual time=0.026..
195.438 rows=1001000 loops=1)
               Output: a
 Planning Time: 0.553 ms
 Execution Time: 282.961 ms
(10 rows)

We can see from the output that the LIMIT cost is wrong also due to
this since it assumes the input_rows
from the sort node is 991862 (per gdb debugging).

This could be easily fixed by below patch,

diff --git a/src/backend/optimizer/path/costsize.c
b/src/backend/optimizer/path/costsize.c
index fb28e6411a..800cf0b256 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -2429,7 +2429,11 @@ cost_sort(Path *path, PlannerInfo *root,

    startup_cost += input_cost;

-   path->rows = tuples;
+   if (limit_tuples > 0 && limit_tuples < tuples)
+       path->rows = limit_tuples;
+   else
+       path->rows = tuples;
+
    path->startup_cost = startup_cost;
    path->total_cost = startup_cost + run_cost;
 }

Withe the patch the explain output looks like this.

postgres=# explain analyze verbose select * from t1 order by a limit 10;
                                                           QUERY PLAN

--------------------------------------------------------------------------------------------------
------------------------------
 Limit  (cost=69446.17..71925.83 rows=10 width=4) (actual
time=256.204..256.207 rows=10 loops=1)
   Output: a
   ->  Sort  (cost=69446.17..71925.83 rows=10 width=4) (actual
time=256.202..256.203 rows=10 loops
=1)
         Output: a
         Sort Key: t1.a
         Sort Method: top-N heapsort  Memory: 25kB
         ->  Seq Scan on public.t1  (cost=0.00..44649.62 rows=991862
width=4) (actual time=1.014..
169.509 rows=1001000 loops=1)
               Output: a
 Planning Time: 0.076 ms
 Execution Time: 256.232 ms
(10 rows)

Regards,
Paul



Re: A cost issue in ORDER BY + LIMIT

From
Zhang Mingli
Date:
HI,

What if the the rows of t1 is less than the limit number(ex: t1 has 5 rows, limit 10)?
Does it matter?


Regards,
Zhang Mingli
On Aug 6, 2022, 23:38 +0800, Paul Guo <paulguo@gmail.com>, wrote:

limit_tuples

Re: A cost issue in ORDER BY + LIMIT

From
Tom Lane
Date:
Paul Guo <paulguo@gmail.com> writes:
> Postgres seems to always optimize ORDER BY + LIMIT as top-k sort.
> Recently I happened to notice
> that in this scenario the output tuple number of the sort node is not
> the same as the LIMIT tuple number.

No, it isn't, and your proposed patch is completely misguided.
The cost and rowcount estimates for a plan node are always written
on the assumption that the node is run to completion.

            regards, tom lane