Thread: Query rewrite(optimization) using constraints
Hi, hackers
I notice that postgres use constraints to optimize/rewrite queries in limited cases as https://www.postgresql.org/docs/current/runtime-config-query.html#GUC-CONSTRAINT-EXCLUSION.
I propose two new types of query rewrites using constraints here
1) Remove DISTINCT
A simple example is SELECT DISTINCT(name) FROM R. If there is a unique constraint on the name column. The DISTINCT keyword can be removed safely. Query plans without the DISTINCT keyword might be much cheaper since DISTINCT is expensive.
2) Add LIMIT 1
An example of this optimization will be SELECT name from R WHERE name = ‘foo’. If there is a unique constraint on the name column, the selection result has at most one record. Therefore, we can add LIMIT 1 safely. If the original query plan performs a sequential scan on the R, adding LIMIT 1 might speed up the query because of the early return.
We designed an algorithm to decide if 1), 2) can be performed safely. Rewriting queries manually and experimenting on a table with 10K records shows 2X ~ 3X improvement for both rewrites. We have some other rewrite rules, but the two are most obvious ones. With this feature, the optimizer can consider the query plans both before and after the rewrite and choose the one with minimum cost.
Will that feature be useful? How hard to implement the feature in the current system? Any thoughts or comments are highly appreciated!
Best,
Lily
On Fri, Oct 08, 2021 at 10:24:33AM -0700, Lily Liu wrote: > 1) Remove DISTINCT > > A simple example is SELECT DISTINCT(name) FROM R. If there is a unique > constraint on the name column. The DISTINCT keyword can be removed safely. > Query plans without the DISTINCT keyword might be much cheaper since > DISTINCT is expensive. There's an ongoing discussion and patches for this here. Erase the distinctClause if the result is unique by definition https://commitfest.postgresql.org/35/2433/ Perhaps you could help to test or review the patch ? -- Justin
Lily Liu <lilyliupku@gmail.com> writes: > I propose two new types of query rewrites using constraints here > 1) Remove DISTINCT > A simple example is SELECT DISTINCT(name) FROM R. If there is a unique > constraint on the name column. The DISTINCT keyword can be removed safely. > Query plans without the DISTINCT keyword might be much cheaper since > DISTINCT is expensive. There's already been a fair amount of effort in that direction, cf [1]. However, I can't avoid the impression that that patch series is adding way too much overcomplicated infrastructure. If you have an idea for an easier way, let's hear it. > 2) Add LIMIT 1 > An example of this optimization will be SELECT name from R WHERE name = > ‘foo’. If there is a unique constraint on the name column, the selection > result has at most one record. Therefore, we can add LIMIT 1 safely. If the > original query plan performs a sequential scan on the R, adding LIMIT 1 > might speed up the query because of the early return. I strongly suspect that this idea isn't actually useful. If there is a matching unique constraint, the planner will almost surely choose an indexscan on the unique index, and adding an additional plan node to that is likely to add more overhead than it removes. For example, regression=# explain analyze select * from tenk1 where unique1 = 42; QUERY PLAN ----------------------------------------------------------------------------------------------------------------------- Index Scan using tenk1_unique1 on tenk1 (cost=0.29..8.30 rows=1 width=244) (actual time=0.012..0.013 rows=1 loops=1) Index Cond: (unique1 = 42) Planning Time: 0.059 ms Execution Time: 0.030 ms (4 rows) regression=# explain analyze select * from tenk1 where unique1 = 42 limit 1; QUERY PLAN ----------------------------------------------------------------------------------------------------------------------------- Limit (cost=0.29..8.30 rows=1 width=244) (actual time=0.013..0.013 rows=1 loops=1) -> Index Scan using tenk1_unique1 on tenk1 (cost=0.29..8.30 rows=1 width=244) (actual time=0.012..0.012 rows=1 loops=1) Index Cond: (unique1 = 42) Planning Time: 0.067 ms Execution Time: 0.034 ms (5 rows) This test case shouldn't be taken too seriously, because it was just a quick one-off check with little attempt to control for noise, besides which it's a debug-enabled build. Nonetheless, if you want to pursue this idea I think you first need to prove that it actually is a win. "Might speed up" won't cut it. Another concern here is that the planner is famously bad about making good decisions for queries involving LIMIT. The cost estimates for that require a bunch of assumptions that we don't need elsewhere, and some of them aren't very trustworthy. Now maybe for the restricted cases where you want to add LIMIT, this isn't really a problem. But you haven't shown that to be true, so I'm afraid that this transformation will sometimes lead us into worse plan choices than we'd make otherwise. I'm pretty leery of adding LIMIT when we don't have to. regards, tom lane [1] https://commitfest.postgresql.org/35/2433/