Thread: BUG #18834: Query planer is choosing the sub-optimal plan when limit is present

BUG #18834: Query planer is choosing the sub-optimal plan when limit is present

From
PG Bug reporting form
Date:
The following bug has been logged on the website:

Bug reference:      18834
Logged by:          Michael Stoeckel
Email address:      michael@joincandidhealth.com
PostgreSQL version: 15.10
Operating system:   x86_64-pc-linux-gnu
Description:

We have a table, `fn_ledger`, containing approximately **80 million
records**. The table includes a **primary key (`id`)** of type **UUID** and
a **`provenance`** column of type **JSONB**, which is **GIN indexed**, along
with other fields.  
We observed **poor query performance** when using an **`ORDER BY`** clause
combined with **`LIMIT`**, which led us to investigate further.  
Consider the following simple query:
```
select *
from fn_ledger
where provenance @> '{"locator": "52ca2f03-8184-448f-9b9e-7a0ed99e6922"}'
order by id
```
In the query above, the `WHERE` clause is **highly selective**, meaning it
will return at most a handful of records.  
Through our investigation, we found that:  
- **When `LIMIT` is not present**, the planner correctly chooses to use the
**bitmap index scan** on `ix_fn_ledger_provenance`, resulting in efficient
query execution.  
- **When `LIMIT` is introduced**, the planner **incorrectly opts** for a
sequential scan on `fn_ledger_pkey` while filtering on the `provenance`
value, leading to poor performance.  

See the query plans below:
```
postgres=> explain analyze
select *
from fn_ledger
where provenance @> '{"locator": "52ca2f03-8184-448f-9b9e-7a0ed99e6922"}'
order by id;
                                                                 QUERY PLAN
                                                               

--------------------------------------------------------------------------------------------------------------------------------------------
 Sort  (cost=32625.36..32645.41 rows=8020 width=333) (actual
time=2.943..2.944 rows=1 loops=1)
   Sort Key: id
   Sort Method: quicksort  Memory: 25kB
   ->  Bitmap Heap Scan on fn_ledger  (cost=1230.16..32105.29 rows=8020
width=333) (actual time=2.939..2.939 rows=1 loops=1)
         Recheck Cond: (provenance @> '{"locator":
"52ca2f03-8184-448f-9b9e-7a0ed99e6922"}'::jsonb)
         Heap Blocks: exact=1
         ->  Bitmap Index Scan on ix_fn_ledger_provenance
(cost=0.00..1228.15 rows=8020 width=0) (actual time=2.923..2.924 rows=1
loops=1)
               Index Cond: (provenance @> '{"locator":
"52ca2f03-8184-448f-9b9e-7a0ed99e6922"}'::jsonb)
 Planning Time: 0.115 ms
 Execution Time: 2.978 ms
(10 rows)

postgres=> explain analyze
select *
from fn_ledger
where provenance @> '{"locator": "52ca2f03-8184-448f-9b9e-7a0ed99e6922"}'
order by id
limit 1;
                                                                      QUERY
PLAN                                                                      

------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.57..1816.54 rows=1 width=333) (actual
time=317508.674..317508.675 rows=1 loops=1)
   ->  Index Scan using fn_ledger_pkey on fn_ledger  (cost=0.57..14564073.97
rows=8020 width=333) (actual time=317508.673..317508.673 rows=1 loops=1)
         Filter: (provenance @> '{"locator":
"52ca2f03-8184-448f-9b9e-7a0ed99e6922"}'::jsonb)
         Rows Removed by Filter: 49197422
 Planning Time: 0.148 ms
 Execution Time: 317508.742 ms
(6 rows)
```
Furthermore, since the planner incorrectly chooses to perform an **index
scan** on `fn_ledger_pkey`, the query performance varies significantly
depending on **where the candidate record's `id` appears** within the
`fn_ledger_pkey` index.  
To illustrate this, I selected the **first and last records** based on the
ordering of the `id` column.
```
select id, provenance->'locator'
from fn_ledger
where id in (
'00000025-fb46-46cd-b7eb-19e111fa1e87',
'ffffffdb-23f6-4a6e-9385-86e5e32a9309'
)
===================
00000025-fb46-46cd-b7eb-19e111fa1e87
"9cdf7902-ba7d-4c07-9f92-451aeebe6e2d"
ffffffdb-23f6-4a6e-9385-86e5e32a9309
"f2d20507-a543-4c38-b3e3-c977ac407b42"
```
then, I executed the following queries, both with `LIMIT`
```
postgres=> explain analyze
select *
from fn_ledger
where fn_ledger.provenance @> '{"locator":
"9cdf7902-ba7d-4c07-9f92-451aeebe6e2d"}'
order by id
limit 1;
                                                                 QUERY PLAN
                                                               

--------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.57..1817.68 rows=1 width=333) (actual time=0.016..0.017
rows=1 loops=1)
   ->  Index Scan using fn_ledger_pkey on fn_ledger  (cost=0.57..14478744.69
rows=7968 width=333) (actual time=0.015..0.016 rows=1 loops=1)
         Filter: (provenance @> '{"locator":
"9cdf7902-ba7d-4c07-9f92-451aeebe6e2d"}'::jsonb)
 Planning Time: 0.184 ms
 Execution Time: 0.031 ms
(5 rows)

postgres=> explain analyze
select *
from fn_ledger
where fn_ledger.provenance @> '{"locator":
"f2d20507-a543-4c38-b3e3-c977ac407b42"}'
order by id
limit 1;
                                                                     QUERY
PLAN                                                                     

----------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.57..1817.75 rows=1 width=333) (actual
time=57515.333..57515.334 rows=1 loops=1)
   ->  Index Scan using fn_ledger_pkey on fn_ledger  (cost=0.57..14479340.89
rows=7968 width=333) (actual time=57515.331..57515.332 rows=1 loops=1)
         Filter: (provenance @> '{"locator":
"f2d20507-a543-4c38-b3e3-c977ac407b42"}'::jsonb)
         Rows Removed by Filter: 30847879
 Planning Time: 0.147 ms
 Execution Time: 57515.418 ms
(6 rows)
```
As you can see, both queries are **identical**, yet the first query is
**orders of magnitude faster** than the second. This difference occurs
because the **candidate record** for the first query appears **towards the
beginning** of the `fn_ledger_pkey` index, allowing it to be found quickly.
In contrast, the candidate record for the second query appears **towards the
end** of the index, requiring significantly more scanning.

Beyond this inconsistency, if the **`LIMIT` value exceeds the actual number
of records found by the query**, the index scan will have to **traverse the
entire index**, regardless of the candidate record’s position. This further
degrades performance.
```
postgres=> explain analyze
select *
from fn_ledger
where fn_ledger.provenance @> '{"locator":
"9cdf7902-ba7d-4c07-9f92-451aeebe6e2d"}'
order by id
limit 1;
                                                                 QUERY PLAN
                                                               

--------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.57..1816.41 rows=1 width=333) (actual time=0.013..0.014
rows=1 loops=1)
   ->  Index Scan using fn_ledger_pkey on fn_ledger  (cost=0.57..14564849.49
rows=8021 width=333) (actual time=0.013..0.013 rows=1 loops=1)
         Filter: (provenance @> '{"locator":
"9cdf7902-ba7d-4c07-9f92-451aeebe6e2d"}'::jsonb)
 Planning Time: 0.145 ms
 Execution Time: 0.027 ms
(5 rows)

postgres=> explain analyze
select *
from fn_ledger
where fn_ledger.provenance @> '{"locator":
"9cdf7902-ba7d-4c07-9f92-451aeebe6e2d"}'
order by id
limit 2;
                                                                   QUERY
PLAN                                                                    

-------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.57..3632.25 rows=2 width=333) (actual time=0.017..126786.189
rows=1 loops=1)
   ->  Index Scan using fn_ledger_pkey on fn_ledger  (cost=0.57..14564849.49
rows=8021 width=333) (actual time=0.016..126786.187 rows=1 loops=1)
         Filter: (provenance @> '{"locator":
"9cdf7902-ba7d-4c07-9f92-451aeebe6e2d"}'::jsonb)
         Rows Removed by Filter: 80472490
 Planning Time: 0.148 ms
 Execution Time: 126786.246 ms
(6 rows)
```
As you can see from the above queries, the only difference is that the
second query's `LIMIT` is 2 instead of 1.

In conclusion, it seems that the query planner is **placing too much
emphasis on the `LIMIT` clause**, prioritizing it over selecting the more
efficient `ix_fn_ledger_provenance` index. As a result, it **ignores the
optimal indexing strategy** and instead opts for a less efficient scan on
`fn_ledger_pkey`, leading to inconsistent and suboptimal query performance.


PG Bug reporting form <noreply@postgresql.org> writes:
> We observed **poor query performance** when using an **`ORDER BY`** clause
> combined with **`LIMIT`**, which led us to investigate further.  

This is a well-known syndrome.  The fundamental problem is the poor
selectivity estimate:

>          ->  Bitmap Index Scan on ix_fn_ledger_provenance 
> (cost=0.00..1228.15 rows=8020 width=0) (actual time=2.923..2.924 rows=1
> loops=1)
>                Index Cond: (provenance @> '{"locator":
> "52ca2f03-8184-448f-9b9e-7a0ed99e6922"}'::jsonb)

If the thing thinks there are 8020 matches for the @> condition when
there's only one, it's really pure luck if you get an optimal plan.
All its cost estimates will be off by a factor of more than 1000,
and these decisions are by no means linear.  They're particularly
not linear in the presence of LIMIT, but query plans can go far
astray even without that.

So what you need to do about this is get a better selectivity
estimate.  You might be able to get somewhere with custom statistics
(see CREATE STATISTICS), or just by cranking up the statistics target
for the "provenance" column to the maximum.  But I'm not sure how
much that will help.  Fundamentally, putting stuff into unstructured
JSON storage and expecting to get efficient searches of it is an
antipattern.  A more reliable answer is to change your query.
You could, for example, make an expression index on
(provenance ->> 'locator') and then write

WHERE (provenance ->> 'locator') = '52ca2f03-8184-448f-9b9e-7a0ed99e6922'

That will provide both fast searches and reasonably trustworthy stats
(once ANALYZE has seen the index).

            regards, tom lane