Thread: Optimization idea: merging multiple EXISTS(...) with constraint-based join removal

Optimization idea: merging multiple EXISTS(...) with constraint-based join removal

From

Frédéric TERRAZZONI

Date:

28 July 2015, 00:38:16

I've frequently encountered queries where the WHERE clause contains a lot of "EXISTS(...)" subqueries linked by logical ANDs. They are typically generated queries, and the "EXISTS(...)" fragments are composable pieces of SQL code acting as filters. Moreover, all those filters ends up being very similar: the joins used to navigate in the objects graph often involves the same tables.

Unfortunately, PostgreSQL doesn't seem to be smart enough to simplify them. In fact, according to my experiments, even dead-simple uncorrelated subqueries such as "SELECT * FROM t1 WHERE EXISTS(SELECT * FROM t2) AND EXISTS(SELECT * FROM t2)" are not simplified. (the execution plan is rather naive: t2 is checked twice for non-emptiness).

I've not seen anything related to that topic on the TODO list or in the mailing list. Consequently, I would like to propose a method which can be used to merge multiple EXISTS(...) semi-joins linked by logical ANDs.

For the sake of this example, the database schema is:

create table t1(id int primary key, t2_id int, t6_id int)

create table t2(id int primary key, t3_id int)

create table t3(id int primary key, t4_id int, t5_id int)

create table t4(id int primary key, val text)

create table t5(id int primary key, val text)

create table t6(id int primary key, val text)

Here is a machine-generated query. Notice that the path "t1->t2->t3" is used twice.

SELECT * FROM t1

WHERE EXISTS(

SELECT 1 FROM t2, t3, t4

WHERE t2.id = t1.t2_id

AND t3.id = t2.t3_id

AND t4.id = t3.t4_id

AND t4.val = 'XYZ'

) AND EXISTS(

SELECT 1 FROM t2, t3, t5

WHERE t2.id = t1.t2_id

AND t3.id = t2.t3_id

AND t5.id = t3.t5_id

AND t5.val = 'Blablabla'

) AND EXISTS(

SELECT 1 FROM t6

WHERE t6.id = t1.t6_id

AND t6.val = 'Hello'

)

The following transformation can be applied:

SELECT * FROM [...]

WHERE

EXISTS(SELECT 1 FROM [...tables_1...] WHERE [...conditions_1...])

AND EXISTS(SELECT 1 FROM [...tables_2...] WHERE [...conditions_2...])

<=>

SELECT * FROM [...]

WHERE

EXISTS(

SELECT 1 FROM

(SELECT 1 FROM [...tables_1...] WHERE [...conditions_1...]) q1,

(SELECT 1 FROM [...tables_2...] WHERE [...conditions_2...]) q2

)

The resulting query is:

SELECT * FROM t1

WHERE EXISTS(

SELECT 1 FROM t2 t2_a, t3 t3_a, t4 t4_a, t2 t2_b, t3 t3_b, t5, t6

WHERE t2_a.id = t1.t2_id

AND t3_a.id = t2_a.t3_id

AND t4_a.id = t3_a.t4_id

AND t4_a.val = 'XYZ'

AND t2_b.id = t1.t2_id

AND t3_b.id = t2_b.t3_id

AND t5.id = t3_b.t5_id

AND t5.val = 'Blablabla'

AND t6.id = t1.t6_id

AND t6.val = 'Hello'

)

Given the constraints, it can be shown that t2_a = t2_b AND t3_a = t3_b :

* t2_a.id = t1.t2_id AND t2_b.id = t1.t2_id => t2_a.id = t2_b.id

* t2_a.id = t2_b.id AND t2(id) is a pkey => t2_a = t2_b

* t2_a = t2_b => t2_a.t3_id = t2_b.t3_id

* t2_a.t3_id = t2_b.t3_id AND t3_b.id = t2_b.t3_id AND t3_a.id = t2_a.t3_id => t3_b.id = t3_a.id

* t3_b.id = t3_a.id AND t3(id) is a pkey => t3_a = t3_b

It is then safe to remove the joins "t2_b" and "t3_b" in the subquery, because they have been proved to be duplicates of "t2_a" and "t3_a".

SELECT * FROM t1

WHERE EXISTS(

SELECT 1 FROM t2, t3, t4, t2, t3, t5, t6

WHERE t2.id = t1.t2_id

AND t3.id = t2.t3_id

AND t4.id = t3.t4_id

AND t5.id = t3.t5_id

AND t4.val = 'XYZ'

AND t5.val = 'Blablabla'

AND t6.id = t1.t6_id

AND t6.val = 'Hello'

)

Using a connected component analysis on join predicates, joined tables can be separated and we get back our original query with the first two EXISTS(...) semi-joins merged.

SELECT * FROM t1

WHERE EXISTS(

SELECT 1 FROM t2, t3, t4, t2, t3, t5

WHERE t2.id = t1.t2_id

AND t3.id = t2.t3_id

AND t4.id = t3.t4_id

AND t5.id = t3.t5_id

AND t4.val = 'XYZ'

AND t5.val = 'Blablabla'

) AND EXISTS(

SELECT 1 FROM t6

WHERE t6.id = t1.t6_id

AND t6.val = 'Hello'

)

My questions are:

- Does PostgreSQL already supports this optimization ? Maybe it's not enabled in my case only?

- If yes, is my reasoning incorrect ? Can you point me my mistake ?

- Otherwise is there any plan to add this optimization to PostgreSQL ?

Thank you !

Frédéric Terrazzoni

Re: Optimization idea: merging multiple EXISTS(...) with constraint-based join removal

From

David Rowley

Date:

28 July 2015, 02:22:52

On 28 July 2015 at 09:37, Frédéric TERRAZZONI <frederic.terrazzoni@gmail.com> wrote:

SELECT * FROM t1
WHERE EXISTS(
SELECT 1 FROM t2, t3, t4
WHERE t2.id = t1.t2_id
AND t3.id = t2.t3_id
AND t4.id = t3.t4_id
AND t4.val = 'XYZ'
) AND EXISTS(
SELECT 1 FROM t2, t3, t5
WHERE t2.id = t1.t2_id
AND t3.id = t2.t3_id
AND t5.id = t3.t5_id
AND t5.val = 'Blablabla'
) AND EXISTS(
SELECT 1 FROM t6
WHERE t6.id = t1.t6_id
AND t6.val = 'Hello'
)

...

The resulting query is:

SELECT * FROM t1
WHERE EXISTS(
SELECT 1 FROM t2 t2_a, t3 t3_a, t4 t4_a, t2 t2_b, t3 t3_b, t5, t6
WHERE t2_a.id = t1.t2_id
AND t3_a.id = t2_a.t3_id
AND t4_a.id = t3_a.t4_id
AND t4_a.val = 'XYZ'
AND t2_b.id = t1.t2_id
AND t3_b.id = t2_b.t3_id
AND t5.id = t3_b.t5_id
AND t5.val = 'Blablabla'
AND t6.id = t1.t6_id
AND t6.val = 'Hello'
)

My questions are:
- Does PostgreSQL already supports this optimization ? Maybe it's not enabled in my case only?

No, there's nothing which supports that currently.

- If yes, is my reasoning incorrect ? Can you point me my mistake ?

It sounds reasonable to me.

- Otherwise is there any plan to add this optimization to PostgreSQL ?

I did propose the idea here http://www.postgresql.org/message-id/CAApHDvopmWq4i2BCf0VqU4mYmxzHCYwfnUMat9TWuKzdvo7=8w@mail.gmail.com but I didn't focus just with semi-joins. Without re-reading, I think I was talking about any join that could be proved to not duplicate rows could be "consolidated".

I do believe that this optimisation would be useful in more cases than most people might think, for example:

UPDATE t1 SET col1 = (SELECT col1 FROM t2 WHERE t1.id=t2.id), col2 = (SELECT col2 FROM t2 WHERE t1.id=t2.id), ...;

Of course, this query could have been written using UPDATE/FROM, (non-standard), or UPDATE t1 SET (col1,col2) = (SELECT ...), which was only added recently.

There's also the case of the view which just has 1 column missing, so the consumer joins a table that's already been joined to in the view.

I think it would be quite nice to have this, and I don't think it would be all that expensive for the planner to detect this.

I think the planner would have to do something like:

1. Scan simple_rte_array looking for relids which are the same as another entry's.

2. If found, is the join condition the same as the other one?

3. Is there a unique index to prove that joining to this does not duplicate rows, or is it a semi-join?

4. Remove relation and replace all Vars from the removed relation with the one from the other table, mark relation as REL_DEAD.

I think 1 is quite cheap to perform, so normal queries wouldn't suffer much of a slow-down from these extra checks, as most queries won't have self joins.

Are you thinking of working on it?

Regards

David Rowley

--
David Rowley http://www.2ndQuadrant.com/

PostgreSQL Development, 24x7 Support, Training & Services