Thread: surprisingly expensive join planning query
Hi, while evaluating one of the CF patches (the incremental sort one, but that's mostly irrelevant), I ran into a strange issue with join planning for a fairly simple query. I needed to asses how the patch affects query planning for different GUCs, so I ran a group of queries and stashed the results into a table with this structure CREATE TABLE plans ( query text, index text, option text, plan text, type text, force boolean, parallel boolean); Essentially all the columns are 'dimensions' with the exception of the 'plan' column storing the explain plan generated. The results (~60k rows / 30MB) is available here: https://drive.google.com/open?id=1Q4oR1KtaAil87lbMo-xUvvw_0wf_zDx- copy plans from '/tmp/results-100M.data'; To evaluate results, I needed to see which GUCs result in a different plan compared to the master, so I did a query like this: with master AS (select * from plans where option = ''), create_ordered_paths_parallel AS (select * from plans where option = 'devel_create_ordered_paths_parallel'), create_partial_grouping_paths_2 AS (select * from plans where option = 'devel_create_partial_grouping_paths_2'), create_partial_grouping_paths AS (select * from plans where option = 'devel_create_partial_grouping_paths'), standard_join_search AS (select * from plans where option = 'devel_standard_join_search'), add_paths_to_grouping_rel AS (select * from plans where option = 'devel_add_paths_to_grouping_rel'), gather_grouping_paths AS (select * from plans where option = 'devel_gather_grouping_paths'), create_ordered_paths AS (select * from plans where option = 'devel_create_ordered_paths'), add_paths_to_grouping_rel_parallel AS (select * from plans where option = 'devel_add_paths_to_grouping_rel_parallel'), set_rel_pathlist AS (select * from plans where option = 'devel_set_rel_pathlist'), apply_scanjoin_target_to_paths AS (select * from plans where option = 'devel_apply_scanjoin_target_to_paths') select master.query, master.index, master.type, master.force, master.parallel, md5(master.plan), (CASE WHEN (master.plan = r1.plan) THEN NULL ELSE 'DIFF' END) guc1, (CASE WHEN (master.plan = r2.plan) THEN NULL ELSE 'DIFF' END) guc2, (CASE WHEN (master.plan = r3.plan) THEN NULL ELSE 'DIFF' END) guc3, (CASE WHEN (master.plan = r4.plan) THEN NULL ELSE 'DIFF' END) guc4, (CASE WHEN (master.plan = r5.plan) THEN NULL ELSE 'DIFF' END) guc5, (CASE WHEN (master.plan = r6.plan) THEN NULL ELSE 'DIFF' END) guc6, (CASE WHEN (master.plan = r7.plan) THEN NULL ELSE 'DIFF' END) guc7, (CASE WHEN (master.plan = r8.plan) THEN NULL ELSE 'DIFF' END) guc8, (CASE WHEN (master.plan = r9.plan) THEN NULL ELSE 'DIFF' END) guc9, (CASE WHEN (master.plan = r10.plan) THEN NULL ELSE 'DIFF' END) guc10 from master join create_ordered_paths_parallel r1 using (query, index, type, force, parallel) join create_partial_grouping_paths r2 using (query, index, type, force, parallel) join create_partial_grouping_paths_2 r3 using (query, index, type, force, parallel) join standard_join_search r4 using (query, index, type, force, parallel) join add_paths_to_grouping_rel r5 using (query, index, type, force, parallel) join gather_grouping_paths r6 using (query, index, type, force, parallel) join create_ordered_paths r7 using (query, index, type, force, parallel) join add_paths_to_grouping_rel_parallel r8 using (query, index, type, force, parallel) join set_rel_pathlist r9 using (query, index, type, force, parallel) join apply_scanjoin_target_to_paths r10 using (query, index, type, force, parallel); This however causes pretty serious issues during planning. Firstly, it consumes insane amounts of memory, to the extent that on my machine it crashes due to OOM. If I lover the join_collapse_limit to 1, it works just fine, but once I increase it too much, the memory consumption and planning time go through the roof and eventually crashes. I did a bit of investigation, and after instrumenting aset.c a bit I got a statistic like this: size | alloc count | alloc sum | free count | free sum | diff -------+-------------+-----------+------------+----------+----------- 64 | 5606157 | 358794048 | 118 | 7552 | 358786496 i.e. there's a lot of 64B chunks allocated, but almost none of them are freed, resulting in ~350MB leak. There are various other sizes with a lot of allocated chunks, but nowhere close to this. It seems most of this comesfrom find_mergeclauses_for_outer_pathkeys() which builds matched_restrictinfos and then just leaves it allocated. After pfreeing this (see attached patch), the memory usage gets way down and the query completes. I'm sure there are other things we could pfree to reduce the memory usage even more. That being said, I wonder if this is worth it - the query is a bit strange and probably could be rewritten, and we generally don't free stuff in the planner very aggressively. OTOH consuming gigabytes of memory and crashing with OOM is no fun. The other issue is planning time, which looks like this: join_collapse_limit = 1 3.698 ms join_collapse_limit = 2 3.109 ms join_collapse_limit = 3 5.244 ms join_collapse_limit = 4 9.025 ms join_collapse_limit = 5 29.371 ms join_collapse_limit = 6 83.190 ms join_collapse_limit = 7 93.693 ms join_collapse_limit = 8 253.369 ms join_collapse_limit = 9 760.415 ms join_collapse_limit = 10 2385.519 ms join_collapse_limit = 11 7880.276 ms compared to the execution time (consistently ~800ms) this is pretty high, and it'd get considerably worse with more tables (it'd start crashing because of OOM too). regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachment
Tomas Vondra <tomas.vondra@2ndquadrant.com> writes: > It seems most of this comesfrom find_mergeclauses_for_outer_pathkeys() > which builds matched_restrictinfos and then just leaves it allocated. > After pfreeing this (see attached patch), the memory usage gets way down > and the query completes. Interesting. The memory leak was probably much less bad before commit 1cff1b95a, since in the old List implementation this code would have leaked only a list header. It makes sense to me to add the list_free. Alternatively, it'd be possible to get rid of the separate List altogether, and just add the rinfo's to "mergeclauses" immediately. The functionality of the separate list could be replaced by a bool variable remembering whether we found any matches in this pass through the loop. I think the code would be a little less clear that way, but this report makes it clear that it's a performance bottleneck, so maybe we should just change it. regards, tom lane
On Sun, Dec 01, 2019 at 01:27:04PM -0500, Tom Lane wrote: >Tomas Vondra <tomas.vondra@2ndquadrant.com> writes: >> It seems most of this comesfrom find_mergeclauses_for_outer_pathkeys() >> which builds matched_restrictinfos and then just leaves it allocated. >> After pfreeing this (see attached patch), the memory usage gets way down >> and the query completes. > >Interesting. The memory leak was probably much less bad before >commit 1cff1b95a, since in the old List implementation this code >would have leaked only a list header. It makes sense to me to >add the list_free. > I forgot to mention I tried on older releases, up to 9.5 (I suspected it might be related to parallel queries), and I get OOM crashes there too. I can't say if the memory is leaking slower/faster, though. I tried fixing 9.5 - a simple pfree(matched_restrictinfos) triggers some sort of list_concat error for me, seemed a bit weird TBH. >Alternatively, it'd be possible to get rid of the separate List >altogether, and just add the rinfo's to "mergeclauses" immediately. >The functionality of the separate list could be replaced by a >bool variable remembering whether we found any matches in this >pass through the loop. I think the code would be a little less >clear that way, but this report makes it clear that it's a >performance bottleneck, so maybe we should just change it. > Yes, that might be an option. And it works even on 9.5 for me (per the attached patch). I don't think it's much less clear compared to just doing an explicit free at the end. It does fix cases with up to join_collapse_limit = 10, but with 11 it still OOM-crashes. That definitely depends on available memory, of course. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachment
Tomas Vondra <tomas.vondra@2ndquadrant.com> writes: > On Sun, Dec 01, 2019 at 01:27:04PM -0500, Tom Lane wrote: >> Alternatively, it'd be possible to get rid of the separate List >> altogether, and just add the rinfo's to "mergeclauses" immediately. >> The functionality of the separate list could be replaced by a >> bool variable remembering whether we found any matches in this >> pass through the loop. I think the code would be a little less >> clear that way, but this report makes it clear that it's a >> performance bottleneck, so maybe we should just change it. > Yes, that might be an option. And it works even on 9.5 for me (per the > attached patch). I don't think it's much less clear compared to just > doing an explicit free at the end. LGTM. regards, tom lane
On Sun, Dec 01, 2019 at 02:17:15PM -0500, Tom Lane wrote: >Tomas Vondra <tomas.vondra@2ndquadrant.com> writes: >> On Sun, Dec 01, 2019 at 01:27:04PM -0500, Tom Lane wrote: >>> Alternatively, it'd be possible to get rid of the separate List >>> altogether, and just add the rinfo's to "mergeclauses" immediately. >>> The functionality of the separate list could be replaced by a >>> bool variable remembering whether we found any matches in this >>> pass through the loop. I think the code would be a little less >>> clear that way, but this report makes it clear that it's a >>> performance bottleneck, so maybe we should just change it. > >> Yes, that might be an option. And it works even on 9.5 for me (per the >> attached patch). I don't think it's much less clear compared to just >> doing an explicit free at the end. > >LGTM. > Thanks. Do you think this is backpatch-worthy? I'm leaning to yes, but maybe tweaking this just in master is fine. The query is somewhat artificial and there are probably ways to rewrite it. The thing that annoys me a bit is that this fix is only partial. It gets rid of maybe 80% of the allocations, but there's plenty of unnecessary stuff left allocated ... regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Tomas Vondra <tomas.vondra@2ndquadrant.com> writes: > On Sun, Dec 01, 2019 at 02:17:15PM -0500, Tom Lane wrote: >> LGTM. > Thanks. Do you think this is backpatch-worthy? I'm leaning to yes, but > maybe tweaking this just in master is fine. The query is somewhat > artificial and there are probably ways to rewrite it. I don't object to back-patching. > The thing that annoys me a bit is that this fix is only partial. It gets > rid of maybe 80% of the allocations, but there's plenty of unnecessary > stuff left allocated ... Meh. I'm not that excited about getting rid of retail space wastage, unless there are single dominant points such as you found here. For small stuff it's far better to worry about memory context management. (Speaking of which, I don't quite see why this would have been a problem once you got past geqo_threshold; the context resets that GEQO does should've kept things under control.) regards, tom lane
On Mon, Dec 02, 2019 at 03:21:47PM -0500, Tom Lane wrote: >Tomas Vondra <tomas.vondra@2ndquadrant.com> writes: >> On Sun, Dec 01, 2019 at 02:17:15PM -0500, Tom Lane wrote: >>> LGTM. > >> Thanks. Do you think this is backpatch-worthy? I'm leaning to yes, but >> maybe tweaking this just in master is fine. The query is somewhat >> artificial and there are probably ways to rewrite it. > >I don't object to back-patching. > OK >> The thing that annoys me a bit is that this fix is only partial. It gets >> rid of maybe 80% of the allocations, but there's plenty of unnecessary >> stuff left allocated ... > >Meh. I'm not that excited about getting rid of retail space wastage, >unless there are single dominant points such as you found here. For >small stuff it's far better to worry about memory context management. I agree. >(Speaking of which, I don't quite see why this would have been a problem >once you got past geqo_threshold; the context resets that GEQO does >should've kept things under control.) > Not sure I follow. geqo_threshold is 12 by default, and the OOM issues are hapenning way before that. It might be that one reason why this example is so bad is that the CTEs have *exactly* the different join orders are bound to be costed exactly the same I think. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Tomas Vondra <tomas.vondra@2ndquadrant.com> writes: >> (Speaking of which, I don't quite see why this would have been a problem >> once you got past geqo_threshold; the context resets that GEQO does >> should've kept things under control.) > Not sure I follow. geqo_threshold is 12 by default, and the OOM issues > are hapenning way before that. Ah, right. But would the peak memory usage keep growing with more than 12 rels? > It might be that one reason why this example is so bad is that the CTEs > have *exactly* the different join orders are bound to be costed exactly > the same I think. Hmm. I didn't really look into exactly why this example is so awful. regards, tom lane