Thread: Query rewrite(optimization) using constraints

Query rewrite(optimization) using constraints

From
Lily Liu
Date:

Hi, hackers


I notice that postgres use constraints to optimize/rewrite queries in limited cases as https://www.postgresql.org/docs/current/runtime-config-query.html#GUC-CONSTRAINT-EXCLUSION


I propose two new types of query rewrites using constraints here

1) Remove DISTINCT 

A simple example is SELECT DISTINCT(name) FROM R. If there is a unique constraint on the name column. The DISTINCT keyword can be removed safely. Query plans without the DISTINCT keyword might be much cheaper since DISTINCT is expensive.


2) Add LIMIT 1

An example of this optimization will be SELECT name from R WHERE name = ‘foo’. If there is a unique constraint on the name column, the selection result has at most one record. Therefore, we can add LIMIT 1 safely. If the original query plan performs a sequential scan on the R, adding LIMIT 1 might speed up the query because of the early return.


We designed an algorithm to decide if 1), 2) can be performed safely. Rewriting queries manually and experimenting on a table with 10K records shows 2X ~ 3X improvement for both rewrites. We have some other rewrite rules, but the two are most obvious ones. With this feature, the optimizer can consider the query plans both before and after the rewrite and choose the one with minimum cost. 


Will that feature be useful? How hard to implement the feature in the current system? Any thoughts or comments are highly appreciated!


Best,

Lily

Re: Query rewrite(optimization) using constraints

From
Justin Pryzby
Date:
On Fri, Oct 08, 2021 at 10:24:33AM -0700, Lily Liu wrote:
> 1) Remove DISTINCT
> 
> A simple example is SELECT DISTINCT(name) FROM R. If there is a unique
> constraint on the name column. The DISTINCT keyword can be removed safely.
> Query plans without the DISTINCT keyword might be much cheaper since
> DISTINCT is expensive.

There's an ongoing discussion and patches for this here.

Erase the distinctClause if the result is unique by definition
https://commitfest.postgresql.org/35/2433/

Perhaps you could help to test or review the patch ?

-- 
Justin



Re: Query rewrite(optimization) using constraints

From
Tom Lane
Date:
Lily Liu <lilyliupku@gmail.com> writes:
> I propose two new types of query rewrites using constraints here

> 1) Remove DISTINCT
> A simple example is SELECT DISTINCT(name) FROM R. If there is a unique
> constraint on the name column. The DISTINCT keyword can be removed safely.
> Query plans without the DISTINCT keyword might be much cheaper since
> DISTINCT is expensive.

There's already been a fair amount of effort in that direction, cf [1].
However, I can't avoid the impression that that patch series is adding
way too much overcomplicated infrastructure.  If you have an idea for
an easier way, let's hear it.

> 2) Add LIMIT 1
> An example of this optimization will be SELECT name from R WHERE name =
> ‘foo’. If there is a unique constraint on the name column, the selection
> result has at most one record. Therefore, we can add LIMIT 1 safely. If the
> original query plan performs a sequential scan on the R, adding LIMIT 1
> might speed up the query because of the early return.

I strongly suspect that this idea isn't actually useful.  If there is a
matching unique constraint, the planner will almost surely choose an
indexscan on the unique index, and adding an additional plan node to that
is likely to add more overhead than it removes.  For example,

regression=# explain analyze select * from tenk1 where unique1 = 42;
                                                      QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------
 Index Scan using tenk1_unique1 on tenk1  (cost=0.29..8.30 rows=1 width=244) (actual time=0.012..0.013 rows=1 loops=1)
   Index Cond: (unique1 = 42)
 Planning Time: 0.059 ms
 Execution Time: 0.030 ms
(4 rows)

regression=# explain analyze select * from tenk1 where unique1 = 42 limit 1;
                                                         QUERY PLAN
     

-----------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.29..8.30 rows=1 width=244) (actual time=0.013..0.013 rows=1 loops=1)
   ->  Index Scan using tenk1_unique1 on tenk1  (cost=0.29..8.30 rows=1 width=244) (actual time=0.012..0.012 rows=1
loops=1)
         Index Cond: (unique1 = 42)
 Planning Time: 0.067 ms
 Execution Time: 0.034 ms
(5 rows)

This test case shouldn't be taken too seriously, because it was just a
quick one-off check with little attempt to control for noise, besides
which it's a debug-enabled build.  Nonetheless, if you want to pursue
this idea I think you first need to prove that it actually is a win.
"Might speed up" won't cut it.

Another concern here is that the planner is famously bad about making
good decisions for queries involving LIMIT.  The cost estimates for
that require a bunch of assumptions that we don't need elsewhere, and
some of them aren't very trustworthy.  Now maybe for the restricted
cases where you want to add LIMIT, this isn't really a problem.  But
you haven't shown that to be true, so I'm afraid that this transformation
will sometimes lead us into worse plan choices than we'd make otherwise.
I'm pretty leery of adding LIMIT when we don't have to.

            regards, tom lane

[1] https://commitfest.postgresql.org/35/2433/