Missed query planner optimization - Mailing list pgsql-hackers

From Mathias Kunter
Subject Missed query planner optimization
Date
Msg-id c43bf236-ba37-3978-2845-cf8fcbc024c8@gmail.com
Whole thread Raw
List pgsql-hackers
Hi all,

this is about a limitation of the current query planner implementation 
which causes big performance declines for certain types of queries. 
Affected queries will typically execute about 1000 times slower than 
they could. Examples are given below.

After talking about this with David Rowley on the pgsql-bugs mailing 
list (bug #17964), it turns out that the reason for the problem 
apparently is that eligible IN clauses are always converted into 
semi-joins. This is done even in situations where such a conversion 
prevents further optimizations to be made.

Hence, it would be desirable that the planner would intelligently decide 
based on estimated costs whether or not an IN clause should be converted 
into a semi-join. The planner obviously can already correctly estimate 
which variant will be faster, as shown in the query plans below.

The tricky part is that it's unfortunately not guaranteed that we'll 
find the BEST possible solution if we decide independently for each IN 
clause, because estimated total costs will depend on the other IN 
clauses of the query as well. However, a simple heuristic solution would 
be to restrain from converting an IN clause into a semi-join if the 
estimated number of rows returned by the subselect is below a certain 
threshold. Then, the planner should make its final decision based on the 
estimated total cost of the two possible query variants (i.e. between 
applying the heuristic vs. not applying the heuristic).

Example queries follow. Full query plans are provided within the linked 
database fiddles.



Example 1: Combining an IN clause with OR.

SELECT * FROM book WHERE
   author_id IS NULL OR
   author_id IN (SELECT id FROM author WHERE name = 'some_name');

Execution time:                       159.227 ms
Execution time (optimized variant):     0.084 ms (1896 times faster)

Estimated cost:                     16933.31
Estimated cost (optimized variant):  2467.85     (6.86 times lower)

Full query plans here: https://dbfiddle.uk/SOOJBMwI



Example 2: Combining two IN clauses with OR.

SELECT * FROM book WHERE
   author_id IN (SELECT id FROM author WHERE name = 'some_name') OR
   publisher_id IN (SELECT id FROM publisher WHERE name = 'some_name');

Execution time:                       227.822 ms
Execution time (optimized variant):     0.088 ms (2589 times faster)

Estimated cost:                     20422.61
Estimated cost (optimized variant):  4113.39     (4.96 times lower)

Full query plans here: https://dbfiddle.uk/q6_4NuDX



Example 3: Combining an IN clause with UNION.

SELECT * FROM
   (SELECT * FROM table1 UNION SELECT * FROM table2) AS q
   WHERE id IN (SELECT id FROM table3);

Execution time:                        932.412 ms
Execution time (optimized variant):      0.728 ms (1281 times faster)

Estimated cost:                     207933.98
Estimated cost (optimized variant):     97.40     (2135 times lower)

Full query plans here: https://dbfiddle.uk/TXASgMZf



Example 4: Complex real-life query from our project.

The full query is linked below.

Execution time:                       72436.509 ms
Execution time (optimized variant):       0.201 ms (360381 times faster)

Estimated cost:                     3941783.92
Estimated cost (optimized variant):    1515.62     (2601 times lower)

Original query here: https://pastebin.com/raw/JsY1PzG3
Optimized query here: https://pastebin.com/raw/Xvq7zUY2



Now, I'm not familiar with the current planner implementation, but 
wanted to know whether there is anybody on this list who would be 
willing to work on this. Having the planner consider the costs of 
converting IN clauses into semi-joins obviously seems like a worthy 
goal. As shown, the performance improvements are gigantic for certain 
types of queries.

Thank you very much!

Mathias



pgsql-hackers by date:

Previous
From: Amit Kapila
Date:
Subject: Re: Synchronizing slots from primary to standby
Next
From: Jelte Fennema
Date:
Subject: Re: [EXTERNAL] Re: Add non-blocking version of PQcancel