Re: Query rewrite(optimization) using constraints - Mailing list pgsql-hackers

From Tom Lane
Subject Re: Query rewrite(optimization) using constraints
Date
Msg-id 252481.1633717024@sss.pgh.pa.us
Whole thread Raw
In response to Query rewrite(optimization) using constraints  (Lily Liu <lilyliupku@gmail.com>)
List pgsql-hackers
Lily Liu <lilyliupku@gmail.com> writes:
> I propose two new types of query rewrites using constraints here

> 1) Remove DISTINCT
> A simple example is SELECT DISTINCT(name) FROM R. If there is a unique
> constraint on the name column. The DISTINCT keyword can be removed safely.
> Query plans without the DISTINCT keyword might be much cheaper since
> DISTINCT is expensive.

There's already been a fair amount of effort in that direction, cf [1].
However, I can't avoid the impression that that patch series is adding
way too much overcomplicated infrastructure.  If you have an idea for
an easier way, let's hear it.

> 2) Add LIMIT 1
> An example of this optimization will be SELECT name from R WHERE name =
> ‘foo’. If there is a unique constraint on the name column, the selection
> result has at most one record. Therefore, we can add LIMIT 1 safely. If the
> original query plan performs a sequential scan on the R, adding LIMIT 1
> might speed up the query because of the early return.

I strongly suspect that this idea isn't actually useful.  If there is a
matching unique constraint, the planner will almost surely choose an
indexscan on the unique index, and adding an additional plan node to that
is likely to add more overhead than it removes.  For example,

regression=# explain analyze select * from tenk1 where unique1 = 42;
                                                      QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------
 Index Scan using tenk1_unique1 on tenk1  (cost=0.29..8.30 rows=1 width=244) (actual time=0.012..0.013 rows=1 loops=1)
   Index Cond: (unique1 = 42)
 Planning Time: 0.059 ms
 Execution Time: 0.030 ms
(4 rows)

regression=# explain analyze select * from tenk1 where unique1 = 42 limit 1;
                                                         QUERY PLAN
     

-----------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.29..8.30 rows=1 width=244) (actual time=0.013..0.013 rows=1 loops=1)
   ->  Index Scan using tenk1_unique1 on tenk1  (cost=0.29..8.30 rows=1 width=244) (actual time=0.012..0.012 rows=1
loops=1)
         Index Cond: (unique1 = 42)
 Planning Time: 0.067 ms
 Execution Time: 0.034 ms
(5 rows)

This test case shouldn't be taken too seriously, because it was just a
quick one-off check with little attempt to control for noise, besides
which it's a debug-enabled build.  Nonetheless, if you want to pursue
this idea I think you first need to prove that it actually is a win.
"Might speed up" won't cut it.

Another concern here is that the planner is famously bad about making
good decisions for queries involving LIMIT.  The cost estimates for
that require a bunch of assumptions that we don't need elsewhere, and
some of them aren't very trustworthy.  Now maybe for the restricted
cases where you want to add LIMIT, this isn't really a problem.  But
you haven't shown that to be true, so I'm afraid that this transformation
will sometimes lead us into worse plan choices than we'd make otherwise.
I'm pretty leery of adding LIMIT when we don't have to.

            regards, tom lane

[1] https://commitfest.postgresql.org/35/2433/



pgsql-hackers by date:

Previous
From: Justin Pryzby
Date:
Subject: Re: Query rewrite(optimization) using constraints
Next
From: Stephen Frost
Date:
Subject: Re: storing an explicit nonce