Thread: Replace constants in subquery

Replace constants in subquery

From

Dominik Moritz

Date:

02 May 2014, 11:07:37

Hi,

We were looking into optimizing a very long running query and very very surprised by the query plan for a query. To
illustratethe question, I created a sqlfiddle [0]. 

(1) The query we are running looks like this one:

SELECT *
FROM Foo o
WHERE
  o.b = (SELECT b
         FROM Foo i
         WHERE i.a = o.a
         ORDER BY b
         LIMIT 1)
  AND o.a = 2


(2) We made a minor change to this query by pushing the constant value (2) into the subquery:

SELECT *
FROM Foo o
WHERE
  o.b = (SELECT b
         FROM Foo i
         WHERE i.a = 2
         ORDER BY b
         LIMIT 1)
  AND o.a = 2

The difference in execution time was 5 orders of magnitude. For the former (1) query, Postgres executes the subquery
forevery row, while for the latter query (2) Postgres executes the subquery once and then uses its result. 


(3) Finally, we made another modification and removed the subquery. Also, we have an index on Foo(a,b).

SELECT *
FROM Foo o
WHERE
  o.b = 2
  AND o.a = 2

Is about 2x faster than (2) because it will run as an index scan whereas we couldn't make (2) use the index -- it looks
upthe value of b once, then does a bitmap heap scan followed by a sort.  


You might not be able to easily reproduce the difference between (2) and (3) because the decision whether to use an
indexor not depends on the data. However, we would be very interested in learning why (2) is so much faster than (1)
eventhough they seem to be equivalent. Are we missing a corder case in which (1) and (2) have different semantics?  

Thank you,
Dominik


[0] http://sqlfiddle.com/#!15/dec73/1

Re: Replace constants in subquery

From

Albe Laurenz

Date:

02 May 2014, 12:58:19

Dominik Moritz wrote:
> We were looking into optimizing a very long running query and very very surprised by the query plan
> for a query. To illustrate the question, I created a sqlfiddle [0].
> 
> (1) The query we are running looks like this one:
> 
> SELECT *
> FROM Foo o
> WHERE
>   o.b = (SELECT b
>          FROM Foo i
>          WHERE i.a = o.a
>          ORDER BY b
>          LIMIT 1)
>   AND o.a = 2
> 
> 
> (2) We made a minor change to this query by pushing the constant value (2) into the subquery:
> 
> SELECT *
> FROM Foo o
> WHERE
>   o.b = (SELECT b
>          FROM Foo i
>          WHERE i.a = 2
>          ORDER BY b
>          LIMIT 1)
>   AND o.a = 2
> 
> The difference in execution time was 5 orders of magnitude. For the former (1) query, Postgres
> executes the subquery for every row, while for the latter query (2) Postgres executes the subquery
> once and then uses its result.

[...]

> However, we would be very interested in learning
> why (2) is so much faster than (1) even though they seem to be equivalent. Are we missing a corder
> case in which (1) and (2) have different semantics?

Because of the LIMIT clause the optimizer cannot convert the subquery
into a join, and in the first case the subquery depends on o.a, which
can take different values, so the subquery has to be executed for each value
of o.a.  In the second case, the subquery does not depend on the outer query,
so ist result can be reused.

The optimizer isn't smart enough to figure out by itself that o.a will
always be the same value in the first case.

Yours,
Laurenz Albe