Hi,
We recently had an issue in production. We have queries that are procedurally generated by an Object/Relational Mapping framework. Some of these queries are huge, involving over 120 tables.
With the following parameters the planner seemed to be getting very bad plans for some of these queries (times are from a single execution, but they are in those orders of magnitude):
----
from_collapse_limit = 14
join_collapse_limit = 14
geqo_threshold = 14
geqo_effort= 5
(cost=14691360.79..81261293.30 rows=6 width=15934)
Planning time: 3859.928 ms
Execution time: 6883365.973 ms
----
If we raise the join_collapse_limit to a really high value the plans are much better, but (of course) planning time gets worse:
----
from_collapse_limit = 150
join_collapse_limit = 150
geqo_threshold = 14
geqo_effort= 5
(cost=379719.44..562997.32 rows=7 width=15934)
Planning time: 7112.416 ms
Execution time: 7.741 ms
----
After some testing in order to lower the planning time we ended bringing down the GEQO values, and we have the best results with:
----
from_collapse_limit = 150
join_collapse_limit = 150
geqo_threshold = 2
geqo_effort= 2
(cost=406427.86..589667.55 rows=6 width=15934)
Planning time: 2721.099 ms
Execution time: 22.728 ms
----
Issues with the join_collapse_limit have been discussed before [1], but lowering the GEQO values seems counterintuitive based on the documentation for this parameter [2]: "Setting this value [join_collapse_limit] to geqo_threshold or more may trigger use of the GEQO planner, resulting in non-optimal plans."
What we want to know is if this mechanisms are working as intended and we can follow a similar approach in the future (lower GEQO values), or this is just a fluke for a corner case.
I have been able to reproduce a similar behaviour, to a much smaller scale, with the attached scripts in Postgres 10.
Regards,
Juan José Santamaría