SELECT over partitioned table with LIMIT 1 performance regression issue in PostgreSQL 17 and 18 - Mailing list pgsql-hackers
| From | Илья Сербин |
|---|---|
| Subject | SELECT over partitioned table with LIMIT 1 performance regression issue in PostgreSQL 17 and 18 |
| Date | |
| Msg-id | CAKS1CiSrRPDcR3YQkf95g-SY6y3TbhO2K4tgdGWbN-YYr0xivg@mail.gmail.com Whole thread |
| List | pgsql-hackers |
Hello,
I think I may have found a planner regression, or at least a planner corner case, in PostgreSQL 17 and 18 involving a partitioned table, LIMIT 1, and OR-ed timestamp ranges.
I could not find an existing report that matches this pattern closely, so I am sending a reproducible example and the observations below to discuss whether or not this behaviour is appropriate and expected.
Environment
-----------
Observed on:
- PostgreSQL 17.9 on Ubuntu 24.04
- PostgreSQL 15.x and 16.x do not show the same bad parent-level plan for the same query pattern
Workload pattern
----------------
- A table partitioned by RANGE(ts_col), with weekly partitions
- B-tree indexes on ts_col and kind_col
- ts_col is highly correlated with heap order
- kind_col = 'k_a' is very common (about 81%)
- the query uses LIMIT 1 without ORDER BY
- the filter is:
kind_col IN (...)
AND (ts_col in range1 OR ts_col in range2)
Problem summary
---------------
On PostgreSQL 17, the query on the partitioned parent table may choose a Seq Scan on the child partition instead of a ts_col-based bitmap/index path, even though the ts_col-based path is much faster when forced.
What looks suspicious to me is this:
1. PostgreSQL 17 can use the ts_col index efficiently for a very similar query when the OR condition is removed and only one timestamp range is used.
2. PostgreSQL 17 can also execute a forced bitmap path on ts_col very quickly for the same child partition.
3. But for the parent-table query with OR-ed ranges and LIMIT 1, PostgreSQL 17 may choose a Seq Scan with much worse runtime.
There is also an important difference between PostgreSQL 16 and 17 here:
- On PostgreSQL 16, the child partition by itself may still choose Seq Scan for this LIMIT 1 pattern, but the query on the partitioned parent chooses a Bitmap-based Append plan and runs fast.
- On PostgreSQL 17, the parent-level plan no longer does that in this case and regresses to the poor plan shape.
Representative observations
---------------------------
For the parent query on PostgreSQL 17:
EXPLAIN (ANALYZE, BUFFERS)
SELECT 1
FROM fact_events
WHERE kind_col IN (
'k_a', 'k_b', 'k_c', 'k_d', 'k_e'
)
AND (
(ts_col >= '2026-03-28 07:42:30.405987' AND ts_col < '2026-03-28 08:00:00')
OR (ts_col >= '2026-03-31 12:00:00' AND ts_col < '2026-03-31 12:10:25.420965')
)
LIMIT 1;
Observed bad plan:
- Seq Scan on the child partition under Append
- runtime around 12 seconds on our production-sized dataset
- Rows Removed by Filter: about 33 million
On the same PostgreSQL 17 instance:
A) Parent query with only one timestamp range:
EXPLAIN (ANALYZE, BUFFERS)
SELECT 1
FROM fact_events
WHERE kind_col IN (
'k_a', 'k_b', 'k_c', 'k_d', 'k_e'
)
AND ts_col >= '2026-03-28 07:42:30.405987'
AND ts_col < '2026-03-28 08:00:00'
LIMIT 1;
This uses the ts_col index and returns very quickly.
B) Forced bitmap path on the child partition:
SET enable_seqscan = off;
SET enable_indexscan = off;
EXPLAIN (ANALYZE, BUFFERS, SETTINGS)
SELECT 1
FROM fact_events_p2026w13
WHERE kind_col IN (
'k_a', 'k_b', 'k_c', 'k_d', 'k_e'
)
AND (
(ts_col >= '2026-03-28 07:42:30.405987' AND ts_col < '2026-03-28 08:00:00')
OR (ts_col >= '2026-03-31 12:00:00' AND ts_col < '2026-03-31 12:10:25.420965')
)
LIMIT 1;
This uses BitmapOr + Bitmap Heap Scan via ts_col_idx and also returns quickly.
C) Default plan for the child partition alone:
EXPLAIN (ANALYZE, BUFFERS, SETTINGS)
SELECT 1
FROM fact_events_p2026w13
WHERE kind_col IN (
'k_a', 'k_b', 'k_c', 'k_d', 'k_e'
)
AND (
(ts_col >= '2026-03-28 07:42:30.405987' AND ts_col < '2026-03-28 08:00:00')
OR (ts_col >= '2026-03-31 12:00:00' AND ts_col < '2026-03-31 12:10:25.420965')
)
LIMIT 1;
This can still choose Seq Scan on the child relation itself, but on PostgreSQL 16 the parent-level Append plan is still good, while on PostgreSQL 17 the parent-level plan may also fall back to the poor plan shape.
Why this looks suspicious
-------------------------
- The ts_col-based path is clearly valid and fast.
- This is not just a matter of missing statistics or tuning. I tried extended statistics, random_page_cost, effective_cache_size, and combinations of these, and they did not fix the bad parent-level plan.
- I can reproduce the issue not only on a tuned production-like setup, but also on plain PostgreSQL containers with no special tuning.
- The issue looks related to parent/Append planning for LIMIT 1 on a partitioned table with OR-ed ranges.
- PostgreSQL 17 release notes mention planner changes in this area: LIMIT optimization on partitioned tables, inheritance parents, and UNION ALL. Commit that looks potentially relevant is a8a968a82 (Consider cheap startup paths in add_paths_to_append_rel, PG17), which introduced startup-path selection for Append nodes. I have not confirmed this is the root cause, but the behavior and timing match: if Seq Scan wins as cheapest_startup_path because BitmapOr has higher startup cost, the parent-level Append would pick the wrong path for LIMIT queries.
My guess is that the planner is underestimating how bad the Seq Scan really is for LIMIT 1 in this case. Since ts_col is very highly correlated with heap order, the matching rows are physically located close to the end of the partition, so the Seq Scan ends up reading a very large part of the table before it finds the first match.
Testing
------------
I am attaching two SQL files:
- reproduce_preparation_env.sql
This file creates the schema, partitions, indexes, and loads the test data.
- reproduce.sql
This file runs the actual query tests and prints short headings and comments before each step, so the output is easier to read.
The tests were performed on four clean Docker containers running PostgreSQL 15, 16, 17, 18
Notes
-----
- Exact runtimes will depend on hardware and cache state, but the plan change itself is reproducible. I'm using SET synchronize_seqscans=off; for more deterministic behaviour of the tests
- In the attached reproduction scripts, the generated row counts are already large enough to show the difference.
- In my production case, correlation(ts_col) is about 0.999 and kind_col = 'k_a' is about 81%.
Thanks.
I think I may have found a planner regression, or at least a planner corner case, in PostgreSQL 17 and 18 involving a partitioned table, LIMIT 1, and OR-ed timestamp ranges.
I could not find an existing report that matches this pattern closely, so I am sending a reproducible example and the observations below to discuss whether or not this behaviour is appropriate and expected.
Environment
-----------
Observed on:
- PostgreSQL 17.9 on Ubuntu 24.04
- PostgreSQL 15.x and 16.x do not show the same bad parent-level plan for the same query pattern
Workload pattern
----------------
- A table partitioned by RANGE(ts_col), with weekly partitions
- B-tree indexes on ts_col and kind_col
- ts_col is highly correlated with heap order
- kind_col = 'k_a' is very common (about 81%)
- the query uses LIMIT 1 without ORDER BY
- the filter is:
kind_col IN (...)
AND (ts_col in range1 OR ts_col in range2)
Problem summary
---------------
On PostgreSQL 17, the query on the partitioned parent table may choose a Seq Scan on the child partition instead of a ts_col-based bitmap/index path, even though the ts_col-based path is much faster when forced.
What looks suspicious to me is this:
1. PostgreSQL 17 can use the ts_col index efficiently for a very similar query when the OR condition is removed and only one timestamp range is used.
2. PostgreSQL 17 can also execute a forced bitmap path on ts_col very quickly for the same child partition.
3. But for the parent-table query with OR-ed ranges and LIMIT 1, PostgreSQL 17 may choose a Seq Scan with much worse runtime.
There is also an important difference between PostgreSQL 16 and 17 here:
- On PostgreSQL 16, the child partition by itself may still choose Seq Scan for this LIMIT 1 pattern, but the query on the partitioned parent chooses a Bitmap-based Append plan and runs fast.
- On PostgreSQL 17, the parent-level plan no longer does that in this case and regresses to the poor plan shape.
Representative observations
---------------------------
For the parent query on PostgreSQL 17:
EXPLAIN (ANALYZE, BUFFERS)
SELECT 1
FROM fact_events
WHERE kind_col IN (
'k_a', 'k_b', 'k_c', 'k_d', 'k_e'
)
AND (
(ts_col >= '2026-03-28 07:42:30.405987' AND ts_col < '2026-03-28 08:00:00')
OR (ts_col >= '2026-03-31 12:00:00' AND ts_col < '2026-03-31 12:10:25.420965')
)
LIMIT 1;
Observed bad plan:
- Seq Scan on the child partition under Append
- runtime around 12 seconds on our production-sized dataset
- Rows Removed by Filter: about 33 million
On the same PostgreSQL 17 instance:
A) Parent query with only one timestamp range:
EXPLAIN (ANALYZE, BUFFERS)
SELECT 1
FROM fact_events
WHERE kind_col IN (
'k_a', 'k_b', 'k_c', 'k_d', 'k_e'
)
AND ts_col >= '2026-03-28 07:42:30.405987'
AND ts_col < '2026-03-28 08:00:00'
LIMIT 1;
This uses the ts_col index and returns very quickly.
B) Forced bitmap path on the child partition:
SET enable_seqscan = off;
SET enable_indexscan = off;
EXPLAIN (ANALYZE, BUFFERS, SETTINGS)
SELECT 1
FROM fact_events_p2026w13
WHERE kind_col IN (
'k_a', 'k_b', 'k_c', 'k_d', 'k_e'
)
AND (
(ts_col >= '2026-03-28 07:42:30.405987' AND ts_col < '2026-03-28 08:00:00')
OR (ts_col >= '2026-03-31 12:00:00' AND ts_col < '2026-03-31 12:10:25.420965')
)
LIMIT 1;
This uses BitmapOr + Bitmap Heap Scan via ts_col_idx and also returns quickly.
C) Default plan for the child partition alone:
EXPLAIN (ANALYZE, BUFFERS, SETTINGS)
SELECT 1
FROM fact_events_p2026w13
WHERE kind_col IN (
'k_a', 'k_b', 'k_c', 'k_d', 'k_e'
)
AND (
(ts_col >= '2026-03-28 07:42:30.405987' AND ts_col < '2026-03-28 08:00:00')
OR (ts_col >= '2026-03-31 12:00:00' AND ts_col < '2026-03-31 12:10:25.420965')
)
LIMIT 1;
This can still choose Seq Scan on the child relation itself, but on PostgreSQL 16 the parent-level Append plan is still good, while on PostgreSQL 17 the parent-level plan may also fall back to the poor plan shape.
Why this looks suspicious
-------------------------
- The ts_col-based path is clearly valid and fast.
- This is not just a matter of missing statistics or tuning. I tried extended statistics, random_page_cost, effective_cache_size, and combinations of these, and they did not fix the bad parent-level plan.
- I can reproduce the issue not only on a tuned production-like setup, but also on plain PostgreSQL containers with no special tuning.
- The issue looks related to parent/Append planning for LIMIT 1 on a partitioned table with OR-ed ranges.
- PostgreSQL 17 release notes mention planner changes in this area: LIMIT optimization on partitioned tables, inheritance parents, and UNION ALL. Commit that looks potentially relevant is a8a968a82 (Consider cheap startup paths in add_paths_to_append_rel, PG17), which introduced startup-path selection for Append nodes. I have not confirmed this is the root cause, but the behavior and timing match: if Seq Scan wins as cheapest_startup_path because BitmapOr has higher startup cost, the parent-level Append would pick the wrong path for LIMIT queries.
My guess is that the planner is underestimating how bad the Seq Scan really is for LIMIT 1 in this case. Since ts_col is very highly correlated with heap order, the matching rows are physically located close to the end of the partition, so the Seq Scan ends up reading a very large part of the table before it finds the first match.
Testing
------------
I am attaching two SQL files:
- reproduce_preparation_env.sql
This file creates the schema, partitions, indexes, and loads the test data.
- reproduce.sql
This file runs the actual query tests and prints short headings and comments before each step, so the output is easier to read.
The tests were performed on four clean Docker containers running PostgreSQL 15, 16, 17, 18
Notes
-----
- Exact runtimes will depend on hardware and cache state, but the plan change itself is reproducible. I'm using SET synchronize_seqscans=off; for more deterministic behaviour of the tests
- In the attached reproduction scripts, the generated row counts are already large enough to show the difference.
- In my production case, correlation(ts_col) is about 0.999 and kind_col = 'k_a' is about 81%.
Thanks.
Attachment
pgsql-hackers by date: