Missed query planner optimization - Mailing list pgsql-hackers
From | Mathias Kunter |
---|---|
Subject | Missed query planner optimization |
Date | |
Msg-id | c43bf236-ba37-3978-2845-cf8fcbc024c8@gmail.com Whole thread Raw |
List | pgsql-hackers |
Hi all, this is about a limitation of the current query planner implementation which causes big performance declines for certain types of queries. Affected queries will typically execute about 1000 times slower than they could. Examples are given below. After talking about this with David Rowley on the pgsql-bugs mailing list (bug #17964), it turns out that the reason for the problem apparently is that eligible IN clauses are always converted into semi-joins. This is done even in situations where such a conversion prevents further optimizations to be made. Hence, it would be desirable that the planner would intelligently decide based on estimated costs whether or not an IN clause should be converted into a semi-join. The planner obviously can already correctly estimate which variant will be faster, as shown in the query plans below. The tricky part is that it's unfortunately not guaranteed that we'll find the BEST possible solution if we decide independently for each IN clause, because estimated total costs will depend on the other IN clauses of the query as well. However, a simple heuristic solution would be to restrain from converting an IN clause into a semi-join if the estimated number of rows returned by the subselect is below a certain threshold. Then, the planner should make its final decision based on the estimated total cost of the two possible query variants (i.e. between applying the heuristic vs. not applying the heuristic). Example queries follow. Full query plans are provided within the linked database fiddles. Example 1: Combining an IN clause with OR. SELECT * FROM book WHERE author_id IS NULL OR author_id IN (SELECT id FROM author WHERE name = 'some_name'); Execution time: 159.227 ms Execution time (optimized variant): 0.084 ms (1896 times faster) Estimated cost: 16933.31 Estimated cost (optimized variant): 2467.85 (6.86 times lower) Full query plans here: https://dbfiddle.uk/SOOJBMwI Example 2: Combining two IN clauses with OR. SELECT * FROM book WHERE author_id IN (SELECT id FROM author WHERE name = 'some_name') OR publisher_id IN (SELECT id FROM publisher WHERE name = 'some_name'); Execution time: 227.822 ms Execution time (optimized variant): 0.088 ms (2589 times faster) Estimated cost: 20422.61 Estimated cost (optimized variant): 4113.39 (4.96 times lower) Full query plans here: https://dbfiddle.uk/q6_4NuDX Example 3: Combining an IN clause with UNION. SELECT * FROM (SELECT * FROM table1 UNION SELECT * FROM table2) AS q WHERE id IN (SELECT id FROM table3); Execution time: 932.412 ms Execution time (optimized variant): 0.728 ms (1281 times faster) Estimated cost: 207933.98 Estimated cost (optimized variant): 97.40 (2135 times lower) Full query plans here: https://dbfiddle.uk/TXASgMZf Example 4: Complex real-life query from our project. The full query is linked below. Execution time: 72436.509 ms Execution time (optimized variant): 0.201 ms (360381 times faster) Estimated cost: 3941783.92 Estimated cost (optimized variant): 1515.62 (2601 times lower) Original query here: https://pastebin.com/raw/JsY1PzG3 Optimized query here: https://pastebin.com/raw/Xvq7zUY2 Now, I'm not familiar with the current planner implementation, but wanted to know whether there is anybody on this list who would be willing to work on this. Having the planner consider the costs of converting IN clauses into semi-joins obviously seems like a worthy goal. As shown, the performance improvements are gigantic for certain types of queries. Thank you very much! Mathias
pgsql-hackers by date: