Thread: Inheritance planner CPU and memory usage change since 9.3.2
Hi We saw a rather extreme performance problem in a cluster upgraded from 9.1 to 9.3. It uses a largish number of child tables (partitions) and many columns. Planning a simple UPDATE via the base table started using a very large amount of memory and CPU time. My colleague Rushabh Lathia tracked the performance change down to http://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=c03ad5602f529787968fa3201b35c119bbc6d782 . The call to copyObject in the loop introduced here seems to be problematic (copying append_rel_list for every element in append_rel_list unconditionally), though we're still trying to figure it out. Attached is a simple repro script, with variables to tweak. Quite a few others have posted about this sort of thing and been politely reminded of the 100 table caveat [1][2] which is fair enough, but the situations seems to have got dramatically worse for some users after an upgrade. [1] http://www.postgresql.org/message-id/8c9acaa.1f453.14c0da0402f.Coremail.chjischj@163.com [2] http://www.postgresql.org/message-id/flat/20141107185824.2513.53433@wrigleys.postgresql.org#20141107185824.2513.53433@wrigleys.postgresql.org -- Thomas Munro http://www.enterprisedb.com
Attachment
On Wed, Jun 17, 2015 at 9:32 AM, Thomas Munro <thomas.munro@enterprisedb.com> wrote: > We saw a rather extreme performance problem in a cluster upgraded from > 9.1 to 9.3. It uses a largish number of child tables (partitions) and > many columns. Planning a simple UPDATE via the base table started > using a very large amount of memory and CPU time. > > My colleague Rushabh Lathia tracked the performance change down to > http://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=c03ad5602f529787968fa3201b35c119bbc6d782 > . > > The call to copyObject in the loop introduced here seems to be > problematic (copying append_rel_list for every element in > append_rel_list unconditionally), though we're still trying to figure > it out. Attached is a simple repro script, with variables to tweak. > > Quite a few others have posted about this sort of thing and been > politely reminded of the 100 table caveat [1][2] which is fair enough, > but the situations seems to have got dramatically worse for some users > after an upgrade. Yes. The copyObject() call introduced by this commit seems to have complexity O(T^2*C) where T is the number of child tables and C is the number of columns per child. That's because the List that is being copied is a list of AppendRelInfo nodes, each of which has a translated_vars member that is a list of every Var in one table, and we copy that list once per child. It appears that in a lot of cases this work is unnecessary. The second (long) for loop in inheritance_planner copies root->rowMarks and root->append_rel_list so as to be able to apply ChangeVarNodes() to the result, but we only actually do that if the rte is of type RTE_SUBQUERY or if it has security quals. In the common case where we reach inheritance_planner not because of UNION ALL but just because the table has a bunch of inheritance children (none of which have RLS policies applied), we copy everything and then modify none of it, using up startling large amounts of memory in ways that pre-9.2 versions did not. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, Jun 17, 2015 at 9:56 AM, Robert Haas <robertmhaas@gmail.com> wrote: > On Wed, Jun 17, 2015 at 9:32 AM, Thomas Munro > <thomas.munro@enterprisedb.com> wrote: >> We saw a rather extreme performance problem in a cluster upgraded from >> 9.1 to 9.3. It uses a largish number of child tables (partitions) and >> many columns. Planning a simple UPDATE via the base table started >> using a very large amount of memory and CPU time. >> >> My colleague Rushabh Lathia tracked the performance change down to >> http://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=c03ad5602f529787968fa3201b35c119bbc6d782 >> . >> >> The call to copyObject in the loop introduced here seems to be >> problematic (copying append_rel_list for every element in >> append_rel_list unconditionally), though we're still trying to figure >> it out. Attached is a simple repro script, with variables to tweak. >> >> Quite a few others have posted about this sort of thing and been >> politely reminded of the 100 table caveat [1][2] which is fair enough, >> but the situations seems to have got dramatically worse for some users >> after an upgrade. > > Yes. The copyObject() call introduced by this commit seems to have > complexity O(T^2*C) where T is the number of child tables and C is the > number of columns per child. That's because the List that is being > copied is a list of AppendRelInfo nodes, each of which has a > translated_vars member that is a list of every Var in one table, and > we copy that list once per child. > > It appears that in a lot of cases this work is unnecessary. The > second (long) for loop in inheritance_planner copies root->rowMarks > and root->append_rel_list so as to be able to apply ChangeVarNodes() > to the result, but we only actually do that if the rte is of type > RTE_SUBQUERY or if it has security quals. In the common case where we > reach inheritance_planner not because of UNION ALL but just because > the table has a bunch of inheritance children (none of which have RLS > policies applied), we copy everything and then modify none of it, > using up startling large amounts of memory in ways that pre-9.2 > versions did not. The attached patch helps. It does two things: 1. It arranges for inheritance_planner to throw away the memory consumed by the subroot's rowMarks and append_rel_list after the call to grouping_planner for that subroot returns. This prevents the explosive growth of memory usage in all cases I've tested so far, but planning is still really slow. 2. It arranges not to deep-copy append_rel_list when the root's append_rel_list doesn't need to be modified for the subroot. This makes planning much much faster in simple cases, like a simple update on a table with many partitions. But if you do attach a FROM clause containing a subquery to such an update, then this optimization doesn't kick in any more and things are still very slow (though still memory-bounded, due to part 1). I feel I might be missing a trick here. It seems unlikely to me that we actually need the entire append_rel_list for every subquery; and we almost certainly don't need to modify every element of the append_rel_list for every subquery. Even the ones that no ChangeVarNodes() call mutates still get deep-copied. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
On 18 June 2015 at 14:48, Robert Haas <robertmhaas@gmail.com> wrote: > On Wed, Jun 17, 2015 at 9:56 AM, Robert Haas <robertmhaas@gmail.com> wrote: >> On Wed, Jun 17, 2015 at 9:32 AM, Thomas Munro >> <thomas.munro@enterprisedb.com> wrote: >>> We saw a rather extreme performance problem in a cluster upgraded from >>> 9.1 to 9.3. It uses a largish number of child tables (partitions) and >>> many columns. Planning a simple UPDATE via the base table started >>> using a very large amount of memory and CPU time. >>> >>> My colleague Rushabh Lathia tracked the performance change down to >>> http://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=c03ad5602f529787968fa3201b35c119bbc6d782 >>> . >>> >>> The call to copyObject in the loop introduced here seems to be >>> problematic (copying append_rel_list for every element in >>> append_rel_list unconditionally), though we're still trying to figure >>> it out. Attached is a simple repro script, with variables to tweak. >>> >>> Quite a few others have posted about this sort of thing and been >>> politely reminded of the 100 table caveat [1][2] which is fair enough, >>> but the situations seems to have got dramatically worse for some users >>> after an upgrade. >> >> Yes. The copyObject() call introduced by this commit seems to have >> complexity O(T^2*C) where T is the number of child tables and C is the >> number of columns per child. That's because the List that is being >> copied is a list of AppendRelInfo nodes, each of which has a >> translated_vars member that is a list of every Var in one table, and >> we copy that list once per child. >> >> It appears that in a lot of cases this work is unnecessary. The >> second (long) for loop in inheritance_planner copies root->rowMarks >> and root->append_rel_list so as to be able to apply ChangeVarNodes() >> to the result, but we only actually do that if the rte is of type >> RTE_SUBQUERY or if it has security quals. In the common case where we >> reach inheritance_planner not because of UNION ALL but just because >> the table has a bunch of inheritance children (none of which have RLS >> policies applied), we copy everything and then modify none of it, >> using up startling large amounts of memory in ways that pre-9.2 >> versions did not. > > The attached patch helps. It does two things: > > 1. It arranges for inheritance_planner to throw away the memory > consumed by the subroot's rowMarks and append_rel_list after the call > to grouping_planner for that subroot returns. This prevents the > explosive growth of memory usage in all cases I've tested so far, but > planning is still really slow. > > 2. It arranges not to deep-copy append_rel_list when the root's > append_rel_list doesn't need to be modified for the subroot. This > makes planning much much faster in simple cases, like a simple update > on a table with many partitions. But if you do attach a FROM clause > containing a subquery to such an update, then this optimization > doesn't kick in any more and things are still very slow (though still > memory-bounded, due to part 1). > > I feel I might be missing a trick here. It seems unlikely to me that > we actually need the entire append_rel_list for every subquery; and we > almost certainly don't need to modify every element of the > append_rel_list for every subquery. Even the ones that no > ChangeVarNodes() call mutates still get deep-copied. > Yeah, you could probably pre-compute the indexes of the RTEs that need to copied, outside of the big loop, and store them in a bitmapset. Then, instead of copying the entire list of rowmarks/append_rel_infos each time, you could just copy the ones that referred to those RTE indexes (and only if the bitmapset was non-empty, which is the equivalent of your second optimisation). However, for AppendRelInfos, ChangeVarNodes() descends into the Vars in the translated_vars list, so short-cutting the copying of the AppendRelInfo isn't obviously safe. But, looking more closely, does ChangeVarNodes actually need to examine translated_vars (the fall-through case) when child_relid isn't the old rt_index? If not, that could be a big saving in cases like this. Regards, Dean
Dean Rasheed <dean.a.rasheed@gmail.com> writes: > On 18 June 2015 at 14:48, Robert Haas <robertmhaas@gmail.com> wrote: >> I feel I might be missing a trick here. It seems unlikely to me that >> we actually need the entire append_rel_list for every subquery; and we >> almost certainly don't need to modify every element of the >> append_rel_list for every subquery. Even the ones that no >> ChangeVarNodes() call mutates still get deep-copied. > Yeah, you could probably pre-compute the indexes of the RTEs that need > to copied, outside of the big loop, and store them in a bitmapset. > Then, instead of copying the entire list of rowmarks/append_rel_infos > each time, you could just copy the ones that referred to those RTE > indexes (and only if the bitmapset was non-empty, which is the > equivalent of your second optimisation). However, for AppendRelInfos, > ChangeVarNodes() descends into the Vars in the translated_vars list, > so short-cutting the copying of the AppendRelInfo isn't obviously > safe. But, looking more closely, does ChangeVarNodes actually need to > examine translated_vars (the fall-through case) when child_relid isn't > the old rt_index? If not, that could be a big saving in cases like > this. I'm a bit surprised that duplicating the append_rel_list is a noticeable performance problem. It ought to be far smaller than the Query tree that we've always duplicated in this loop --- in particular, it's really a subset of what we have in the RTE list, no? regards, tom lane
On Thu, Jun 18, 2015 at 3:14 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Dean Rasheed <dean.a.rasheed@gmail.com> writes: >> On 18 June 2015 at 14:48, Robert Haas <robertmhaas@gmail.com> wrote: >>> I feel I might be missing a trick here. It seems unlikely to me that >>> we actually need the entire append_rel_list for every subquery; and we >>> almost certainly don't need to modify every element of the >>> append_rel_list for every subquery. Even the ones that no >>> ChangeVarNodes() call mutates still get deep-copied. > >> Yeah, you could probably pre-compute the indexes of the RTEs that need >> to copied, outside of the big loop, and store them in a bitmapset. >> Then, instead of copying the entire list of rowmarks/append_rel_infos >> each time, you could just copy the ones that referred to those RTE >> indexes (and only if the bitmapset was non-empty, which is the >> equivalent of your second optimisation). However, for AppendRelInfos, >> ChangeVarNodes() descends into the Vars in the translated_vars list, >> so short-cutting the copying of the AppendRelInfo isn't obviously >> safe. But, looking more closely, does ChangeVarNodes actually need to >> examine translated_vars (the fall-through case) when child_relid isn't >> the old rt_index? If not, that could be a big saving in cases like >> this. > > I'm a bit surprised that duplicating the append_rel_list is a noticeable > performance problem. It ought to be far smaller than the Query tree that > we've always duplicated in this loop --- in particular, it's really a > subset of what we have in the RTE list, no? Well, append_rel_list has an AppendRelInfo for every RTE and that contains a List (translated_vars) which in turn contains a Var node for every column. I'm not sure how that compares to the RTE itself. I think it's the cost of copying the translated_vars list that is really the problem here - you can have 200 or 500 columns in a table, so that's a Var and a ListCell for each one. In the problem cases, the number of AppendRelInfo elements is a small percentage of the number of Var nodes under them. That having been said, I don't know how the size of all that compares to the size of the Query. But I think the Query must be smaller, because arranging to discard the AppendRelInfo and its translated_vars list, and the per-subroot rowMarks list, after every call to grouping_planner stops the memory blowup. The whole translated_vars representation seems needless inefficient. For a subquery, you probably need something like that. But for an inheritance child, you just need to swap the varno and maybe remap some varattnos. Indexing into a C array to grab the varattno seems like it would be a heck of a lot more efficient than calling list_nth(). It might be worth making AppendRelInfo support a choice of representations so that we can use something more optimized for the simple case while not losing the full generality when we need it. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > On Thu, Jun 18, 2015 at 3:14 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> I'm a bit surprised that duplicating the append_rel_list is a noticeable >> performance problem. It ought to be far smaller than the Query tree that >> we've always duplicated in this loop --- in particular, it's really a >> subset of what we have in the RTE list, no? > Well, append_rel_list has an AppendRelInfo for every RTE and that > contains a List (translated_vars) which in turn contains a Var node > for every column. I'm not sure how that compares to the RTE itself. RTEs also have a per-column component, namely the lists of column alias names. So there's something odd going on here. I'll dig into it when I get a chance (possibly not during PGCon). regards, tom lane
I wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> Well, append_rel_list has an AppendRelInfo for every RTE and that >> contains a List (translated_vars) which in turn contains a Var node >> for every column. I'm not sure how that compares to the RTE itself. > RTEs also have a per-column component, namely the lists of column alias > names. ... although I see that range_table_mutator doesn't bother to copy/change the column alias substructure. (Wonder if that gives rise to any observable EXPLAIN bugs...) But it still seems like the append_rel_list shouldn't be all that much bulkier than all the other crap that gets generated inside this loop. We're not doing anything at all to reclaim space consumed inside subquery_planner, and you'd think that would be a lot. By the by, the tablesample additions to range_table_mutator are obviously broken. regards, tom lane
On Thu, Jun 18, 2015 at 4:04 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > ... although I see that range_table_mutator doesn't bother to copy/change > the column alias substructure. (Wonder if that gives rise to any > observable EXPLAIN bugs...) But it still seems like the append_rel_list > shouldn't be all that much bulkier than all the other crap that gets > generated inside this loop. We're not doing anything at all to reclaim > space consumed inside subquery_planner, and you'd think that would be > a lot. > > By the by, the tablesample additions to range_table_mutator are obviously > broken. Whee. Meanwhile, here is an updated patch. The attached script (a modified version of something Thomas Munro sent me privately) contains a bunch of test queries. With the original patch I sent earlier, here are the timings I got: Q1 Time: 16215.887 ms Q2 Time: 18674.139 ms Q3 Time: 1029.093 ms Q4 Time: 86497.781 ms Q5 Time: 1143.851 ms This version is about the same for the last three, but the first two get much faster: Q1 Time: 2951.231 ms Q2 Time: 1251.809 ms Q3 Time: 1049.235 ms Q4 Time: 88477.803 ms Q5 Time: 1172.965 ms The speedup comes from the following trick: the first time we hit a query that might requite a ChangeVarNodes() on the append_rel_list, we compute a bitmapset of varnos that appear in that list. Then, every time we're thinking about doing a ChangeVarNodes from rti to new_rti, we check whether rti appears in the Bitmapset. If not, we can skip ChangeVarNodes(). That seems to reduce the amount of object-copying and object-walking attributable to this loop to something negligible in all of these test cases. The extraordinarily planning time for query 4 is caused by a completely different problem: SearchCatCache eats up huge amounts of CPU; its callers are get_attavgwidth and get_typlen. It's not clear to me why doubling the number of relations causes such an enormous explosion in calls to those functions; I would expect the number of calls to double, but presumably the actual increase is much more. That's a separate problem, though, unconnected to c03ad5602f529787968fa3201b35c119bbc6d782 and not necessarily a regression. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
On 2015-06-18 22:04, Tom Lane wrote: > > By the by, the tablesample additions to range_table_mutator are obviously > broken. > Bah, typos. Attached patch corrects them. -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
On 2015-06-19 00:38, Petr Jelinek wrote: > On 2015-06-18 22:04, Tom Lane wrote: >> >> By the by, the tablesample additions to range_table_mutator are obviously >> broken. >> > > Bah, typos. Attached patch corrects them. > Actually it should probably look more like this, sorry. -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
On 2015-06-19 01:04, Petr Jelinek wrote: > On 2015-06-19 00:38, Petr Jelinek wrote: >> On 2015-06-18 22:04, Tom Lane wrote: >>> >>> By the by, the tablesample additions to range_table_mutator are >>> obviously >>> broken. >>> >> >> Bah, typos. Attached patch corrects them. >> > > Actually it should probably look more like this, sorry. > Apparently it's not a good idea to do this at 1AM after long day :/ The previous diff included some garbage in tests from my local experimentations. -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
Petr Jelinek <petr@2ndquadrant.com> writes: > On 2015-06-19 01:04, Petr Jelinek wrote: >> On 2015-06-19 00:38, Petr Jelinek wrote: >>> On 2015-06-18 22:04, Tom Lane wrote: >>>> By the by, the tablesample additions to range_table_mutator are >>>> obviously broken. > Apparently it's not a good idea to do this at 1AM after long day :/ > The previous diff included some garbage in tests from my local > experimentations. Pushed with minor adjustments. regards, tom lane
On Fri, Jun 19, 2015 at 9:20 AM, Robert Haas <robertmhaas@gmail.com> wrote: > The extraordinarily planning time for query 4 is caused by a > completely different problem: SearchCatCache eats up huge amounts of > CPU; its callers are get_attavgwidth and get_typlen. It's not clear > to me why doubling the number of relations causes such an enormous > explosion in calls to those functions; I would expect the number of > calls to double, but presumably the actual increase is much more. > That's a separate problem, though, unconnected to > c03ad5602f529787968fa3201b35c119bbc6d782 and not necessarily a > regression. I don't have a great high level understanding of the planner, and Q4 may be somehow asking for trouble or unrepresentative of anything useful, but I did some profiling and instrumenting, and I noticed that we spend tables^2 * columns time in get_attavgwidth. I wonder if estimate_rel_size (or some other function in that stack, or some new function wrapper) should remember the result for each relation for the scope of this planner invocation. That should bring the calls to get_attavgwidth down to the same order as Q3 (tables * columns). Here is some profiler output from a 500 table, 500 column Q4 run: 160295.0ms 60.2% 95 inheritance_planner 120064.0ms 45.1% 0 grouping_planner 119826.0ms 45.0% 2 query_planner 114204.0ms 42.9% 0 add_base_rels_to_query 114204.0ms 42.9% 0 add_base_rels_to_query 114204.0ms 42.9% 151 build_simple_rel 113817.0ms 42.8% 57 build_simple_rel 113600.0ms 42.7% 19 get_relation_info 112123.0ms 42.1% 27 estimate_rel_size 111557.0ms 41.9% 14139 get_rel_data_width 80152.0ms 30.1% 362 get_attavgwidth 79788.0ms 30.0% 282 SearchSysCache 79368.0ms 29.8% 52373 SearchCatCache 13182.0ms 4.9% 2125 CatalogCacheComputeHashValue Here are some tables showing function call counts. The columns are ordered like this: 1: Query number 2: Number of child tables 3: Number of columns 4: Number of calls to add_base_rels_to_query 5: Number of calls to build_simple_rel 6: Number of calls to get_relation_info 7: Number of calls to estimate_rel_size 8: Number of calls to get_attavgwidth Q3 10 10 22 11 11 11 131 Q3 10 20 22 11 11 11 241 Q3 10 30 22 11 11 11 351 Q3 20 10 42 21 21 21 251 Q3 20 20 42 21 21 21 461 Q3 20 30 42 21 21 21 671 Q3 30 10 62 31 31 31 371 Q3 30 20 62 31 31 31 681 Q3 30 30 62 31 31 31 991 Q3 500 500 1002 501 501 501 251501 Q4 10 10 33 143 143 132 1451 Q4 10 20 33 143 143 132 2661 Q4 10 30 33 143 143 132 3871 Q4 20 10 63 483 483 462 5291 Q4 20 20 63 483 483 462 9701 Q4 20 30 63 483 483 462 14111 Q4 30 10 93 1023 1023 992 11531 Q4 30 20 93 1023 1023 992 21141 Q4 30 30 93 1023 1023 992 30751 Q4 500 500 1503 252003 252003 251502 126002501 -- Thomas Munro http://www.enterprisedb.com
Robert Haas <robertmhaas@gmail.com> writes: > Meanwhile, here is an updated patch. I don't care for that patch too much: it seems a bit brute-force, and I'm quite worried by the assumption that it's okay to destroy each child's append_rel_list after processing the child. That would fail if any of the Vars/subexpressions in the lists get incorporated into the resulting child Plan, which does not seem impossible. (I think in many cases we'd do a copyObject() when extracting an append_rel_list expression, but this hardly seems guaranteed.) I propose instead the attached patch, which operates by identifying which of the append_rel_list entries actually contain subquery references, and copying only those; the other ones are just linked into the child's append_rel_list by reference, which is okay because they won't get modified. On my laptop, I get the following timings for your test queries from unmodified HEAD (--cassert build): # Q1: 41260.239 ms # Q2: 45225.768 ms # Q3: 43066.958 ms # Q4: 193360.726 ms # Q5: 40746.503 ms and these with my patch: # Q1: 1767.753 ms # Q2: 3662.131 ms # Q3: 814.293 ms # Q4: 64468.914 ms # Q5: 881.295 ms which seems to be generally a better result. > The extraordinarily planning time for query 4 is caused by a > completely different problem: SearchCatCache eats up huge amounts of > CPU; its callers are get_attavgwidth and get_typlen. It's not clear > to me why doubling the number of relations causes such an enormous > explosion in calls to those functions; I would expect the number of > calls to double, but presumably the actual increase is much more. Actually, Q4 necessarily involves O(N^2) planning time, because for each of N target relations you're considering a join to an N-member inheritance tree. A lot of those ultimately get thrown away by constraint exclusion, but not before we've expended significant cycles on them. I do not think we are going to get much traction on that --- even if we do something to knock off whatever the leading term is, there'll still be more O(N^2) behavior right behind it. regards, tom lane diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c index 8afde2b7d5069707e346901f819bed888a2333ee..d7fee96ba01272efdf7231d5985ab688912bcf58 100644 *** a/src/backend/optimizer/plan/planner.c --- b/src/backend/optimizer/plan/planner.c *************** inheritance_planner(PlannerInfo *root) *** 834,840 **** { Query *parse = root->parse; int parentRTindex = parse->resultRelation; ! Bitmapset *resultRTindexes = NULL; int nominalRelation = -1; List *final_rtable = NIL; int save_rel_array_size = 0; --- 834,842 ---- { Query *parse = root->parse; int parentRTindex = parse->resultRelation; ! Bitmapset *resultRTindexes; ! Bitmapset *subqueryRTindexes; ! Bitmapset *modifiableARIindexes; int nominalRelation = -1; List *final_rtable = NIL; int save_rel_array_size = 0; *************** inheritance_planner(PlannerInfo *root) *** 845,850 **** --- 847,853 ---- List *returningLists = NIL; List *rowMarks; ListCell *lc; + Index rti; Assert(parse->commandType != CMD_INSERT); *************** inheritance_planner(PlannerInfo *root) *** 867,874 **** * subqueries during planning, and so we must create copies of them too, * except where they are target relations, which will each only be used in * a single plan. */ ! resultRTindexes = bms_add_member(resultRTindexes, parentRTindex); foreach(lc, root->append_rel_list) { AppendRelInfo *appinfo = (AppendRelInfo *) lfirst(lc); --- 870,879 ---- * subqueries during planning, and so we must create copies of them too, * except where they are target relations, which will each only be used in * a single plan. + * + * To begin with, we'll need a bitmapset of the target relation relids. */ ! resultRTindexes = bms_make_singleton(parentRTindex); foreach(lc, root->append_rel_list) { AppendRelInfo *appinfo = (AppendRelInfo *) lfirst(lc); *************** inheritance_planner(PlannerInfo *root) *** 878,889 **** appinfo->child_relid); } foreach(lc, root->append_rel_list) { AppendRelInfo *appinfo = (AppendRelInfo *) lfirst(lc); PlannerInfo subroot; Plan *subplan; ! Index rti; /* append_rel_list contains all append rels; ignore others */ if (appinfo->parent_relid != parentRTindex) --- 883,937 ---- appinfo->child_relid); } + /* + * Now, generate a bitmapset of the relids of the subquery RTEs, including + * security-barrier RTEs that will become subqueries, as just explained. + */ + subqueryRTindexes = NULL; + rti = 1; + foreach(lc, parse->rtable) + { + RangeTblEntry *rte = (RangeTblEntry *) lfirst(lc); + + if (rte->rtekind == RTE_SUBQUERY || + (rte->securityQuals != NIL && + !bms_is_member(rti, resultRTindexes))) + subqueryRTindexes = bms_add_member(subqueryRTindexes, rti); + rti++; + } + + /* + * Next, we want to identify which AppendRelInfo items contain references + * to any of the aforesaid subquery RTEs. These items will need to be + * copied and modified to adjust their subquery references; whereas the + * other ones need not be touched. It's worth being tense over this + * because we can usually avoid processing most of the AppendRelInfo + * items, thereby saving O(N^2) space and time when the target is a large + * inheritance tree. We can identify AppendRelInfo items by their + * child_relid, since that should be unique within the list. + */ + modifiableARIindexes = NULL; + foreach(lc, root->append_rel_list) + { + AppendRelInfo *appinfo = (AppendRelInfo *) lfirst(lc); + + if (bms_is_member(appinfo->parent_relid, subqueryRTindexes) || + bms_is_member(appinfo->child_relid, subqueryRTindexes) || + bms_overlap(pull_varnos((Node *) appinfo->translated_vars), + subqueryRTindexes)) + modifiableARIindexes = bms_add_member(modifiableARIindexes, + appinfo->child_relid); + } + + /* + * And now we can get on with generating a plan for each child table. + */ foreach(lc, root->append_rel_list) { AppendRelInfo *appinfo = (AppendRelInfo *) lfirst(lc); PlannerInfo subroot; Plan *subplan; ! ListCell *lc2; /* append_rel_list contains all append rels; ignore others */ if (appinfo->parent_relid != parentRTindex) *************** inheritance_planner(PlannerInfo *root) *** 917,925 **** /* * The append_rel_list likewise might contain references to subquery * RTEs (if any subqueries were flattenable UNION ALLs). So prepare ! * to apply ChangeVarNodes to that, too. */ ! subroot.append_rel_list = (List *) copyObject(root->append_rel_list); /* * Add placeholders to the child Query's rangetable list to fill the --- 965,985 ---- /* * The append_rel_list likewise might contain references to subquery * RTEs (if any subqueries were flattenable UNION ALLs). So prepare ! * to apply ChangeVarNodes to that, too. As explained above, we only ! * want to copy items that actually contain such references; the ! * rest can just get linked into the subroot's append_rel_list. */ ! subroot.append_rel_list = NIL; ! foreach(lc2, root->append_rel_list) ! { ! AppendRelInfo *appinfo2 = (AppendRelInfo *) lfirst(lc2); ! ! if (bms_is_member(appinfo2->child_relid, modifiableARIindexes)) ! appinfo2 = (AppendRelInfo *) copyObject(appinfo2); ! ! subroot.append_rel_list = lappend(subroot.append_rel_list, ! appinfo2); ! } /* * Add placeholders to the child Query's rangetable list to fill the *************** inheritance_planner(PlannerInfo *root) *** 933,945 **** /* * If this isn't the first child Query, generate duplicates of all ! * subquery RTEs, and adjust Var numbering to reference the ! * duplicates. To simplify the loop logic, we scan the original rtable ! * not the copy just made by adjust_appendrel_attrs; that should be OK ! * since subquery RTEs couldn't contain any references to the target ! * rel. */ ! if (final_rtable != NIL) { ListCell *lr; --- 993,1005 ---- /* * If this isn't the first child Query, generate duplicates of all ! * subquery (or subquery-to-be) RTEs, and adjust Var numbering to ! * reference the duplicates. To simplify the loop logic, we scan the ! * original rtable not the copy just made by adjust_appendrel_attrs; ! * that should be OK since subquery RTEs couldn't contain any ! * references to the target rel. */ ! if (final_rtable != NIL && !bms_is_empty(subqueryRTindexes)) { ListCell *lr; *************** inheritance_planner(PlannerInfo *root) *** 948,961 **** { RangeTblEntry *rte = (RangeTblEntry *) lfirst(lr); ! /* ! * Copy subquery RTEs and RTEs with security barrier quals ! * that will be turned into subqueries, except those that are ! * target relations. ! */ ! if (rte->rtekind == RTE_SUBQUERY || ! (rte->securityQuals != NIL && ! !bms_is_member(rti, resultRTindexes))) { Index newrti; --- 1008,1014 ---- { RangeTblEntry *rte = (RangeTblEntry *) lfirst(lr); ! if (bms_is_member(rti, subqueryRTindexes)) { Index newrti;
On Sat, Jun 20, 2015 at 6:48 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> Meanwhile, here is an updated patch. > > I don't care for that patch too much: it seems a bit brute-force, and I'm > quite worried by the assumption that it's okay to destroy each child's > append_rel_list after processing the child. That would fail if any of > the Vars/subexpressions in the lists get incorporated into the resulting > child Plan, which does not seem impossible. (I think in many cases we'd > do a copyObject() when extracting an append_rel_list expression, but this > hardly seems guaranteed.) Well, if such a thing is possible, the regression tests don't catch it - can we add one that would? > I propose instead the attached patch, which operates by identifying which > of the append_rel_list entries actually contain subquery references, and > copying only those; the other ones are just linked into the child's > append_rel_list by reference, which is okay because they won't get > modified. I thought about that approach, but wasn't sure if I could make it simple enough to pass muster. Note that I generally erred on the side of deferring all work as long as possible and to the greatest extent possible in a way that your patch does not. We don't need to compute modifiableARIindexes if subqueryRTindexes ends up empty, and we definitely don't need to generate O(n^2) list cells in that case. I think that latter point, at least, is quite likely to be worth optimizing. Granted, spewing out extra ListCells is far less harmful than doing the same thing with AppendRelInfos and their entire list of translated_vars, but it's still not good. Can't we move the loop that copies root.append_rel_list inside if (final_rtable != NIL && !bms_is_empty(subqueryRTindexes))? If we don't take that branch, root.append_rel_list isn't getting modified at all, so a shallow copy is good enough. > On my laptop, I get the following timings for your test queries from > unmodified HEAD (--cassert build): > > # Q1: 41260.239 ms > # Q2: 45225.768 ms > # Q3: 43066.958 ms > # Q4: 193360.726 ms > # Q5: 40746.503 ms > > and these with my patch: > > # Q1: 1767.753 ms > # Q2: 3662.131 ms > # Q3: 814.293 ms > # Q4: 64468.914 ms > # Q5: 881.295 ms > > which seems to be generally a better result. Better than unpatched, definitely! Not sure how it compares to my patch. >> The extraordinarily planning time for query 4 is caused by a >> completely different problem: SearchCatCache eats up huge amounts of >> CPU; its callers are get_attavgwidth and get_typlen. It's not clear >> to me why doubling the number of relations causes such an enormous >> explosion in calls to those functions; I would expect the number of >> calls to double, but presumably the actual increase is much more. > > Actually, Q4 necessarily involves O(N^2) planning time, because for > each of N target relations you're considering a join to an N-member > inheritance tree. A lot of those ultimately get thrown away by > constraint exclusion, but not before we've expended significant > cycles on them. I do not think we are going to get much traction > on that --- even if we do something to knock off whatever the leading > term is, there'll still be more O(N^2) behavior right behind it. Hmm, maybe so. On the other hand, if there's a way to significantly shrink the constant factor on that O(N^2) stuff, it could bring a lot of people some much-needed relief. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 21 June 2015 at 05:27, Robert Haas <robertmhaas@gmail.com> wrote: > On Sat, Jun 20, 2015 at 6:48 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> I propose instead the attached patch, which operates by identifying which >> of the append_rel_list entries actually contain subquery references, and >> copying only those; the other ones are just linked into the child's >> append_rel_list by reference, which is okay because they won't get >> modified. > > Better than unpatched, definitely! Not sure how it compares to my patch. > I tested on my machine (optimised build, asserts off). With HEAD I got: Q1: 8076ms Q2: 7165ms Q3: 4027ms Q4: OOM (killed by kernel, used > 16GB RAM) Q5: 4131ms The machine only has 16GB of RAM and almost no swap, so it wasn't able to do Q4. With Robert's patch: Q1: 1121ms Q2: 542ms Q3: 498ms Q4: 50763ms (used 3GB RAM) Q5: 556ms and with Tom's patch: Q1: 2264ms Q2: 3785ms Q3: 507ms Q4: 50851ms (used 3GB RAM) Q5: 558ms However, there's an obvious improvement that can be made to Tom's patch -- having computed modifiableARIindexes, you may as well use it in the innermost loop to only apply ChangeVarNodes() to those AppendRelInfo's that can actually change, rather than having it trawl through all the other ones that we know won't be touched. With that improvement (attached), the timings become: Q1: 1148ms Q2: 547ms Q3: 505ms Q4: 51325ms Q5: 544ms i.e., basically the same as Robert's patch. Regards, Dean
Attachment
On Sun, Jun 21, 2015 at 5:45 AM, Dean Rasheed <dean.a.rasheed@gmail.com> wrote: > On 21 June 2015 at 05:27, Robert Haas <robertmhaas@gmail.com> wrote: >> On Sat, Jun 20, 2015 at 6:48 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >>> I propose instead the attached patch, which operates by identifying which >>> of the append_rel_list entries actually contain subquery references, and >>> copying only those; the other ones are just linked into the child's >>> append_rel_list by reference, which is okay because they won't get >>> modified. >> >> Better than unpatched, definitely! Not sure how it compares to my patch. >> > > I tested on my machine (optimised build, asserts off). With HEAD I got: > > Q1: 8076ms > Q2: 7165ms > Q3: 4027ms > Q4: OOM (killed by kernel, used > 16GB RAM) > Q5: 4131ms > > The machine only has 16GB of RAM and almost no swap, so it wasn't able to do Q4. > > With Robert's patch: > > Q1: 1121ms > Q2: 542ms > Q3: 498ms > Q4: 50763ms (used 3GB RAM) > Q5: 556ms > > and with Tom's patch: > > Q1: 2264ms > Q2: 3785ms > Q3: 507ms > Q4: 50851ms (used 3GB RAM) > Q5: 558ms > > However, there's an obvious improvement that can be made to Tom's > patch -- having computed modifiableARIindexes, you may as well use it > in the innermost loop to only apply ChangeVarNodes() to those > AppendRelInfo's that can actually change, rather than having it trawl > through all the other ones that we know won't be touched. > > With that improvement (attached), the timings become: > > Q1: 1148ms > Q2: 547ms > Q3: 505ms > Q4: 51325ms > Q5: 544ms > > i.e., basically the same as Robert's patch. Cool. That sounds good. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Dean Rasheed <dean.a.rasheed@gmail.com> writes: > However, there's an obvious improvement that can be made to Tom's > patch -- having computed modifiableARIindexes, you may as well use it > in the innermost loop to only apply ChangeVarNodes() to those > AppendRelInfo's that can actually change, rather than having it trawl > through all the other ones that we know won't be touched. Right. Also, as Robert noted, we can short-circuit a few more things when there are no subquery RTEs. I'll combine these ideas and push something, but probably not till tomorrow. regards, tom lane