Thread: Inheritance planner CPU and memory usage change since 9.3.2

Inheritance planner CPU and memory usage change since 9.3.2

From
Thomas Munro
Date:
Hi

We saw a rather extreme performance problem in a cluster upgraded from
9.1 to 9.3.  It uses a largish number of child tables (partitions) and
many columns.  Planning a simple UPDATE via the base table started
using a very large amount of memory and CPU time.

My colleague Rushabh Lathia tracked the performance change down to
http://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=c03ad5602f529787968fa3201b35c119bbc6d782
.

The call to copyObject in the loop introduced here seems to be
problematic (copying append_rel_list for every element in
append_rel_list unconditionally), though we're still trying to figure
it out.  Attached is a simple repro script, with variables to tweak.

Quite a few others have posted about this sort of thing and been
politely reminded of the 100 table caveat [1][2] which is fair enough,
but the situations seems to have got dramatically worse for some users
after an upgrade.

[1] http://www.postgresql.org/message-id/8c9acaa.1f453.14c0da0402f.Coremail.chjischj@163.com
[2]
http://www.postgresql.org/message-id/flat/20141107185824.2513.53433@wrigleys.postgresql.org#20141107185824.2513.53433@wrigleys.postgresql.org

--
Thomas Munro
http://www.enterprisedb.com

Attachment

Re: Inheritance planner CPU and memory usage change since 9.3.2

From
Robert Haas
Date:
On Wed, Jun 17, 2015 at 9:32 AM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
> We saw a rather extreme performance problem in a cluster upgraded from
> 9.1 to 9.3.  It uses a largish number of child tables (partitions) and
> many columns.  Planning a simple UPDATE via the base table started
> using a very large amount of memory and CPU time.
>
> My colleague Rushabh Lathia tracked the performance change down to
> http://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=c03ad5602f529787968fa3201b35c119bbc6d782
> .
>
> The call to copyObject in the loop introduced here seems to be
> problematic (copying append_rel_list for every element in
> append_rel_list unconditionally), though we're still trying to figure
> it out.  Attached is a simple repro script, with variables to tweak.
>
> Quite a few others have posted about this sort of thing and been
> politely reminded of the 100 table caveat [1][2] which is fair enough,
> but the situations seems to have got dramatically worse for some users
> after an upgrade.

Yes.  The copyObject() call introduced by this commit seems to have
complexity O(T^2*C) where T is the number of child tables and C is the
number of columns per child.  That's because the List that is being
copied is a list of AppendRelInfo nodes, each of which has a
translated_vars member that is a list of every Var in one table, and
we copy that list once per child.

It appears that in a lot of cases this work is unnecessary.  The
second (long) for loop in inheritance_planner copies root->rowMarks
and root->append_rel_list so as to be able to apply ChangeVarNodes()
to the result, but we only actually do that if the rte is of type
RTE_SUBQUERY or if it has security quals.  In the common case where we
reach inheritance_planner not because of UNION ALL but just because
the table has a bunch of inheritance children (none of which have RLS
policies applied), we copy everything and then modify none of it,
using up startling large amounts of memory in ways that pre-9.2
versions did not.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Inheritance planner CPU and memory usage change since 9.3.2

From
Robert Haas
Date:
On Wed, Jun 17, 2015 at 9:56 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Wed, Jun 17, 2015 at 9:32 AM, Thomas Munro
> <thomas.munro@enterprisedb.com> wrote:
>> We saw a rather extreme performance problem in a cluster upgraded from
>> 9.1 to 9.3.  It uses a largish number of child tables (partitions) and
>> many columns.  Planning a simple UPDATE via the base table started
>> using a very large amount of memory and CPU time.
>>
>> My colleague Rushabh Lathia tracked the performance change down to
>> http://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=c03ad5602f529787968fa3201b35c119bbc6d782
>> .
>>
>> The call to copyObject in the loop introduced here seems to be
>> problematic (copying append_rel_list for every element in
>> append_rel_list unconditionally), though we're still trying to figure
>> it out.  Attached is a simple repro script, with variables to tweak.
>>
>> Quite a few others have posted about this sort of thing and been
>> politely reminded of the 100 table caveat [1][2] which is fair enough,
>> but the situations seems to have got dramatically worse for some users
>> after an upgrade.
>
> Yes.  The copyObject() call introduced by this commit seems to have
> complexity O(T^2*C) where T is the number of child tables and C is the
> number of columns per child.  That's because the List that is being
> copied is a list of AppendRelInfo nodes, each of which has a
> translated_vars member that is a list of every Var in one table, and
> we copy that list once per child.
>
> It appears that in a lot of cases this work is unnecessary.  The
> second (long) for loop in inheritance_planner copies root->rowMarks
> and root->append_rel_list so as to be able to apply ChangeVarNodes()
> to the result, but we only actually do that if the rte is of type
> RTE_SUBQUERY or if it has security quals.  In the common case where we
> reach inheritance_planner not because of UNION ALL but just because
> the table has a bunch of inheritance children (none of which have RLS
> policies applied), we copy everything and then modify none of it,
> using up startling large amounts of memory in ways that pre-9.2
> versions did not.

The attached patch helps.  It does two things:

1. It arranges for inheritance_planner to throw away the memory
consumed by the subroot's rowMarks and append_rel_list after the call
to grouping_planner for that subroot returns.  This prevents the
explosive growth of memory usage in all cases I've tested so far, but
planning is still really slow.

2. It arranges not to deep-copy append_rel_list when the root's
append_rel_list doesn't need to be modified for the subroot.  This
makes planning much much faster in simple cases, like a simple update
on a table with many partitions.  But if you do attach a FROM clause
containing a subquery to such an update, then this optimization
doesn't kick in any more and things are still very slow (though still
memory-bounded, due to part 1).

I feel I might be missing a trick here.  It seems unlikely to me that
we actually need the entire append_rel_list for every subquery; and we
almost certainly don't need to modify every element of the
append_rel_list for every subquery.  Even the ones that no
ChangeVarNodes() call mutates still get deep-copied.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

Re: Inheritance planner CPU and memory usage change since 9.3.2

From
Dean Rasheed
Date:
On 18 June 2015 at 14:48, Robert Haas <robertmhaas@gmail.com> wrote:
> On Wed, Jun 17, 2015 at 9:56 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>> On Wed, Jun 17, 2015 at 9:32 AM, Thomas Munro
>> <thomas.munro@enterprisedb.com> wrote:
>>> We saw a rather extreme performance problem in a cluster upgraded from
>>> 9.1 to 9.3.  It uses a largish number of child tables (partitions) and
>>> many columns.  Planning a simple UPDATE via the base table started
>>> using a very large amount of memory and CPU time.
>>>
>>> My colleague Rushabh Lathia tracked the performance change down to
>>> http://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=c03ad5602f529787968fa3201b35c119bbc6d782
>>> .
>>>
>>> The call to copyObject in the loop introduced here seems to be
>>> problematic (copying append_rel_list for every element in
>>> append_rel_list unconditionally), though we're still trying to figure
>>> it out.  Attached is a simple repro script, with variables to tweak.
>>>
>>> Quite a few others have posted about this sort of thing and been
>>> politely reminded of the 100 table caveat [1][2] which is fair enough,
>>> but the situations seems to have got dramatically worse for some users
>>> after an upgrade.
>>
>> Yes.  The copyObject() call introduced by this commit seems to have
>> complexity O(T^2*C) where T is the number of child tables and C is the
>> number of columns per child.  That's because the List that is being
>> copied is a list of AppendRelInfo nodes, each of which has a
>> translated_vars member that is a list of every Var in one table, and
>> we copy that list once per child.
>>
>> It appears that in a lot of cases this work is unnecessary.  The
>> second (long) for loop in inheritance_planner copies root->rowMarks
>> and root->append_rel_list so as to be able to apply ChangeVarNodes()
>> to the result, but we only actually do that if the rte is of type
>> RTE_SUBQUERY or if it has security quals.  In the common case where we
>> reach inheritance_planner not because of UNION ALL but just because
>> the table has a bunch of inheritance children (none of which have RLS
>> policies applied), we copy everything and then modify none of it,
>> using up startling large amounts of memory in ways that pre-9.2
>> versions did not.
>
> The attached patch helps.  It does two things:
>
> 1. It arranges for inheritance_planner to throw away the memory
> consumed by the subroot's rowMarks and append_rel_list after the call
> to grouping_planner for that subroot returns.  This prevents the
> explosive growth of memory usage in all cases I've tested so far, but
> planning is still really slow.
>
> 2. It arranges not to deep-copy append_rel_list when the root's
> append_rel_list doesn't need to be modified for the subroot.  This
> makes planning much much faster in simple cases, like a simple update
> on a table with many partitions.  But if you do attach a FROM clause
> containing a subquery to such an update, then this optimization
> doesn't kick in any more and things are still very slow (though still
> memory-bounded, due to part 1).
>
> I feel I might be missing a trick here.  It seems unlikely to me that
> we actually need the entire append_rel_list for every subquery; and we
> almost certainly don't need to modify every element of the
> append_rel_list for every subquery.  Even the ones that no
> ChangeVarNodes() call mutates still get deep-copied.
>

Yeah, you could probably pre-compute the indexes of the RTEs that need
to copied, outside of the big loop, and store them in a bitmapset.
Then, instead of copying the entire list of rowmarks/append_rel_infos
each time, you could just copy the ones that referred to those RTE
indexes (and only if the bitmapset was non-empty, which is the
equivalent of your second optimisation). However, for AppendRelInfos,
ChangeVarNodes() descends into the Vars in the translated_vars list,
so short-cutting the copying of the AppendRelInfo isn't obviously
safe. But, looking more closely, does ChangeVarNodes actually need to
examine translated_vars (the fall-through case) when child_relid isn't
the old rt_index? If not, that could be a big saving in cases like
this.

Regards,
Dean



Re: Inheritance planner CPU and memory usage change since 9.3.2

From
Tom Lane
Date:
Dean Rasheed <dean.a.rasheed@gmail.com> writes:
> On 18 June 2015 at 14:48, Robert Haas <robertmhaas@gmail.com> wrote:
>> I feel I might be missing a trick here.  It seems unlikely to me that
>> we actually need the entire append_rel_list for every subquery; and we
>> almost certainly don't need to modify every element of the
>> append_rel_list for every subquery.  Even the ones that no
>> ChangeVarNodes() call mutates still get deep-copied.

> Yeah, you could probably pre-compute the indexes of the RTEs that need
> to copied, outside of the big loop, and store them in a bitmapset.
> Then, instead of copying the entire list of rowmarks/append_rel_infos
> each time, you could just copy the ones that referred to those RTE
> indexes (and only if the bitmapset was non-empty, which is the
> equivalent of your second optimisation). However, for AppendRelInfos,
> ChangeVarNodes() descends into the Vars in the translated_vars list,
> so short-cutting the copying of the AppendRelInfo isn't obviously
> safe. But, looking more closely, does ChangeVarNodes actually need to
> examine translated_vars (the fall-through case) when child_relid isn't
> the old rt_index? If not, that could be a big saving in cases like
> this.

I'm a bit surprised that duplicating the append_rel_list is a noticeable
performance problem.  It ought to be far smaller than the Query tree that
we've always duplicated in this loop --- in particular, it's really a
subset of what we have in the RTE list, no?
        regards, tom lane



Re: Inheritance planner CPU and memory usage change since 9.3.2

From
Robert Haas
Date:
On Thu, Jun 18, 2015 at 3:14 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Dean Rasheed <dean.a.rasheed@gmail.com> writes:
>> On 18 June 2015 at 14:48, Robert Haas <robertmhaas@gmail.com> wrote:
>>> I feel I might be missing a trick here.  It seems unlikely to me that
>>> we actually need the entire append_rel_list for every subquery; and we
>>> almost certainly don't need to modify every element of the
>>> append_rel_list for every subquery.  Even the ones that no
>>> ChangeVarNodes() call mutates still get deep-copied.
>
>> Yeah, you could probably pre-compute the indexes of the RTEs that need
>> to copied, outside of the big loop, and store them in a bitmapset.
>> Then, instead of copying the entire list of rowmarks/append_rel_infos
>> each time, you could just copy the ones that referred to those RTE
>> indexes (and only if the bitmapset was non-empty, which is the
>> equivalent of your second optimisation). However, for AppendRelInfos,
>> ChangeVarNodes() descends into the Vars in the translated_vars list,
>> so short-cutting the copying of the AppendRelInfo isn't obviously
>> safe. But, looking more closely, does ChangeVarNodes actually need to
>> examine translated_vars (the fall-through case) when child_relid isn't
>> the old rt_index? If not, that could be a big saving in cases like
>> this.
>
> I'm a bit surprised that duplicating the append_rel_list is a noticeable
> performance problem.  It ought to be far smaller than the Query tree that
> we've always duplicated in this loop --- in particular, it's really a
> subset of what we have in the RTE list, no?

Well, append_rel_list has an AppendRelInfo for every RTE and that
contains a List (translated_vars) which in turn contains a Var node
for every column.  I'm not sure how that compares to the RTE itself.
I think it's the cost of copying the translated_vars list that is
really the problem here - you can have 200 or 500 columns in a table,
so that's a Var and a ListCell for each one.  In the problem cases,
the number of AppendRelInfo elements is a small percentage of the
number of Var nodes under them.

That having been said, I don't know how the size of all that compares
to the size of the Query.  But I think the Query must be smaller,
because arranging to discard the AppendRelInfo and its translated_vars
list, and the per-subroot rowMarks list, after every call to
grouping_planner stops the memory blowup.

The whole translated_vars representation seems needless inefficient.
For a subquery, you probably need something like that.  But for an
inheritance child, you just need to swap the varno and maybe remap
some varattnos.  Indexing into a C array to grab the varattno seems
like it would be a heck of a lot more efficient than calling
list_nth().  It might be worth making AppendRelInfo support a choice
of representations so that we can use something more optimized for the
simple case while not losing the full generality when we need it.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Inheritance planner CPU and memory usage change since 9.3.2

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> On Thu, Jun 18, 2015 at 3:14 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> I'm a bit surprised that duplicating the append_rel_list is a noticeable
>> performance problem.  It ought to be far smaller than the Query tree that
>> we've always duplicated in this loop --- in particular, it's really a
>> subset of what we have in the RTE list, no?

> Well, append_rel_list has an AppendRelInfo for every RTE and that
> contains a List (translated_vars) which in turn contains a Var node
> for every column.  I'm not sure how that compares to the RTE itself.

RTEs also have a per-column component, namely the lists of column alias
names.  So there's something odd going on here.  I'll dig into it when
I get a chance (possibly not during PGCon).
        regards, tom lane



Re: Inheritance planner CPU and memory usage change since 9.3.2

From
Tom Lane
Date:
I wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> Well, append_rel_list has an AppendRelInfo for every RTE and that
>> contains a List (translated_vars) which in turn contains a Var node
>> for every column.  I'm not sure how that compares to the RTE itself.

> RTEs also have a per-column component, namely the lists of column alias
> names.

... although I see that range_table_mutator doesn't bother to copy/change
the column alias substructure.  (Wonder if that gives rise to any
observable EXPLAIN bugs...)  But it still seems like the append_rel_list
shouldn't be all that much bulkier than all the other crap that gets
generated inside this loop.  We're not doing anything at all to reclaim
space consumed inside subquery_planner, and you'd think that would be
a lot.

By the by, the tablesample additions to range_table_mutator are obviously
broken.
        regards, tom lane



Re: Inheritance planner CPU and memory usage change since 9.3.2

From
Robert Haas
Date:
On Thu, Jun 18, 2015 at 4:04 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> ... although I see that range_table_mutator doesn't bother to copy/change
> the column alias substructure.  (Wonder if that gives rise to any
> observable EXPLAIN bugs...)  But it still seems like the append_rel_list
> shouldn't be all that much bulkier than all the other crap that gets
> generated inside this loop.  We're not doing anything at all to reclaim
> space consumed inside subquery_planner, and you'd think that would be
> a lot.
>
> By the by, the tablesample additions to range_table_mutator are obviously
> broken.

Whee.

Meanwhile, here is an updated patch.  The attached script (a modified
version of something Thomas Munro sent me privately) contains a bunch
of test queries.  With the original patch I sent earlier, here are the
timings I got:

Q1 Time: 16215.887 ms
Q2 Time: 18674.139 ms
Q3 Time: 1029.093 ms
Q4 Time: 86497.781 ms
Q5 Time: 1143.851 ms

This version is about the same for the last three, but the first two
get much faster:

Q1 Time: 2951.231 ms
Q2 Time: 1251.809 ms
Q3 Time: 1049.235 ms
Q4 Time: 88477.803 ms
Q5 Time: 1172.965 ms

The speedup comes from the following trick: the first time we hit a
query that might requite a ChangeVarNodes() on the append_rel_list, we
compute a bitmapset of varnos that appear in that list.  Then, every
time we're thinking about doing a ChangeVarNodes from rti to new_rti,
we check whether rti appears in the Bitmapset.  If not, we can skip
ChangeVarNodes().  That seems to reduce the amount of object-copying
and object-walking attributable to this loop to something negligible
in all of these test cases.

The extraordinarily planning time for query 4 is caused by a
completely different problem: SearchCatCache eats up huge amounts of
CPU; its callers are get_attavgwidth and get_typlen.  It's not clear
to me why doubling the number of relations causes such an enormous
explosion in calls to those functions; I would expect the number of
calls to double, but presumably the actual increase is much more.
That's a separate problem, though, unconnected to
c03ad5602f529787968fa3201b35c119bbc6d782 and not necessarily a
regression.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

Re: Inheritance planner CPU and memory usage change since 9.3.2

From
Petr Jelinek
Date:
On 2015-06-18 22:04, Tom Lane wrote:
>
> By the by, the tablesample additions to range_table_mutator are obviously
> broken.
>

Bah, typos. Attached patch corrects them.


--
  Petr Jelinek                  http://www.2ndQuadrant.com/
  PostgreSQL Development, 24x7 Support, Training & Services

Attachment

Re: Inheritance planner CPU and memory usage change since 9.3.2

From
Petr Jelinek
Date:
On 2015-06-19 00:38, Petr Jelinek wrote:
> On 2015-06-18 22:04, Tom Lane wrote:
>>
>> By the by, the tablesample additions to range_table_mutator are obviously
>> broken.
>>
>
> Bah, typos. Attached patch corrects them.
>

Actually it should probably look more like this, sorry.


--
  Petr Jelinek                  http://www.2ndQuadrant.com/
  PostgreSQL Development, 24x7 Support, Training & Services

Attachment

Re: Inheritance planner CPU and memory usage change since 9.3.2

From
Petr Jelinek
Date:
On 2015-06-19 01:04, Petr Jelinek wrote:
> On 2015-06-19 00:38, Petr Jelinek wrote:
>> On 2015-06-18 22:04, Tom Lane wrote:
>>>
>>> By the by, the tablesample additions to range_table_mutator are
>>> obviously
>>> broken.
>>>
>>
>> Bah, typos. Attached patch corrects them.
>>
>
> Actually it should probably look more like this, sorry.
>

Apparently it's not a good idea to do this at 1AM after long day :/

The previous diff included some garbage in tests from my local
experimentations.

--
  Petr Jelinek                  http://www.2ndQuadrant.com/
  PostgreSQL Development, 24x7 Support, Training & Services

Attachment

Re: Inheritance planner CPU and memory usage change since 9.3.2

From
Tom Lane
Date:
Petr Jelinek <petr@2ndquadrant.com> writes:
> On 2015-06-19 01:04, Petr Jelinek wrote:
>> On 2015-06-19 00:38, Petr Jelinek wrote:
>>> On 2015-06-18 22:04, Tom Lane wrote:
>>>> By the by, the tablesample additions to range_table_mutator are
>>>> obviously broken.

> Apparently it's not a good idea to do this at 1AM after long day :/

> The previous diff included some garbage in tests from my local 
> experimentations.

Pushed with minor adjustments.
        regards, tom lane



Re: Inheritance planner CPU and memory usage change since 9.3.2

From
Thomas Munro
Date:
On Fri, Jun 19, 2015 at 9:20 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> The extraordinarily planning time for query 4 is caused by a
> completely different problem: SearchCatCache eats up huge amounts of
> CPU; its callers are get_attavgwidth and get_typlen.  It's not clear
> to me why doubling the number of relations causes such an enormous
> explosion in calls to those functions; I would expect the number of
> calls to double, but presumably the actual increase is much more.
> That's a separate problem, though, unconnected to
> c03ad5602f529787968fa3201b35c119bbc6d782 and not necessarily a
> regression.

I don't have a great high level understanding of the planner, and Q4
may be somehow asking for trouble or unrepresentative of anything
useful, but I did some profiling and instrumenting, and I noticed that
we spend tables^2 * columns time in get_attavgwidth.  I wonder if
estimate_rel_size (or some other function in that stack, or some new
function wrapper) should remember the result for each relation for the
scope of this planner invocation.  That should bring the calls to
get_attavgwidth down to the same order as Q3 (tables * columns).

Here is some profiler output from a 500 table, 500 column Q4 run:

160295.0ms   60.2% 95                    inheritance_planner
120064.0ms   45.1% 0                     grouping_planner
119826.0ms   45.0% 2                      query_planner
114204.0ms   42.9% 0                       add_base_rels_to_query
114204.0ms   42.9% 0                        add_base_rels_to_query
114204.0ms   42.9% 151                         build_simple_rel
113817.0ms   42.8% 57                          build_simple_rel
113600.0ms   42.7% 19                           get_relation_info
112123.0ms   42.1% 27                            estimate_rel_size
111557.0ms   41.9% 14139                             get_rel_data_width
80152.0ms   30.1% 362                              get_attavgwidth
79788.0ms   30.0% 282                               SearchSysCache
79368.0ms   29.8% 52373                                SearchCatCache
13182.0ms    4.9% 2125
CatalogCacheComputeHashValue

Here are some tables showing function call counts.  The columns are
ordered like this:

1: Query number
2: Number of child tables
3: Number of columns
4: Number of calls to add_base_rels_to_query
5: Number of calls to build_simple_rel
6: Number of calls to get_relation_info
7: Number of calls to estimate_rel_size
8: Number of calls to get_attavgwidth

Q3 10 10 22 11 11 11 131
Q3 10 20 22 11 11 11 241
Q3 10 30 22 11 11 11 351
Q3 20 10 42 21 21 21 251
Q3 20 20 42 21 21 21 461
Q3 20 30 42 21 21 21 671
Q3 30 10 62 31 31 31 371
Q3 30 20 62 31 31 31 681
Q3 30 30 62 31 31 31 991
Q3 500 500 1002 501 501 501 251501

Q4 10 10 33 143 143 132 1451
Q4 10 20 33 143 143 132 2661
Q4 10 30 33 143 143 132 3871
Q4 20 10 63 483 483 462 5291
Q4 20 20 63 483 483 462 9701
Q4 20 30 63 483 483 462 14111
Q4 30 10 93 1023 1023 992 11531
Q4 30 20 93 1023 1023 992 21141
Q4 30 30 93 1023 1023 992 30751
Q4 500 500 1503 252003 252003 251502 126002501

-- 
Thomas Munro
http://www.enterprisedb.com



Re: Inheritance planner CPU and memory usage change since 9.3.2

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> Meanwhile, here is an updated patch.

I don't care for that patch too much: it seems a bit brute-force, and I'm
quite worried by the assumption that it's okay to destroy each child's
append_rel_list after processing the child.  That would fail if any of
the Vars/subexpressions in the lists get incorporated into the resulting
child Plan, which does not seem impossible.  (I think in many cases we'd
do a copyObject() when extracting an append_rel_list expression, but this
hardly seems guaranteed.)

I propose instead the attached patch, which operates by identifying which
of the append_rel_list entries actually contain subquery references, and
copying only those; the other ones are just linked into the child's
append_rel_list by reference, which is okay because they won't get
modified.

On my laptop, I get the following timings for your test queries from
unmodified HEAD (--cassert build):

# Q1: 41260.239 ms
# Q2: 45225.768 ms
# Q3: 43066.958 ms
# Q4: 193360.726 ms
# Q5: 40746.503 ms

and these with my patch:

# Q1: 1767.753 ms
# Q2: 3662.131 ms
# Q3: 814.293 ms
# Q4: 64468.914 ms
# Q5: 881.295 ms

which seems to be generally a better result.

> The extraordinarily planning time for query 4 is caused by a
> completely different problem: SearchCatCache eats up huge amounts of
> CPU; its callers are get_attavgwidth and get_typlen.  It's not clear
> to me why doubling the number of relations causes such an enormous
> explosion in calls to those functions; I would expect the number of
> calls to double, but presumably the actual increase is much more.

Actually, Q4 necessarily involves O(N^2) planning time, because for
each of N target relations you're considering a join to an N-member
inheritance tree.  A lot of those ultimately get thrown away by
constraint exclusion, but not before we've expended significant
cycles on them.  I do not think we are going to get much traction
on that --- even if we do something to knock off whatever the leading
term is, there'll still be more O(N^2) behavior right behind it.

            regards, tom lane

diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index 8afde2b7d5069707e346901f819bed888a2333ee..d7fee96ba01272efdf7231d5985ab688912bcf58 100644
*** a/src/backend/optimizer/plan/planner.c
--- b/src/backend/optimizer/plan/planner.c
*************** inheritance_planner(PlannerInfo *root)
*** 834,840 ****
  {
      Query       *parse = root->parse;
      int            parentRTindex = parse->resultRelation;
!     Bitmapset  *resultRTindexes = NULL;
      int            nominalRelation = -1;
      List       *final_rtable = NIL;
      int            save_rel_array_size = 0;
--- 834,842 ----
  {
      Query       *parse = root->parse;
      int            parentRTindex = parse->resultRelation;
!     Bitmapset  *resultRTindexes;
!     Bitmapset  *subqueryRTindexes;
!     Bitmapset  *modifiableARIindexes;
      int            nominalRelation = -1;
      List       *final_rtable = NIL;
      int            save_rel_array_size = 0;
*************** inheritance_planner(PlannerInfo *root)
*** 845,850 ****
--- 847,853 ----
      List       *returningLists = NIL;
      List       *rowMarks;
      ListCell   *lc;
+     Index        rti;

      Assert(parse->commandType != CMD_INSERT);

*************** inheritance_planner(PlannerInfo *root)
*** 867,874 ****
       * subqueries during planning, and so we must create copies of them too,
       * except where they are target relations, which will each only be used in
       * a single plan.
       */
!     resultRTindexes = bms_add_member(resultRTindexes, parentRTindex);
      foreach(lc, root->append_rel_list)
      {
          AppendRelInfo *appinfo = (AppendRelInfo *) lfirst(lc);
--- 870,879 ----
       * subqueries during planning, and so we must create copies of them too,
       * except where they are target relations, which will each only be used in
       * a single plan.
+      *
+      * To begin with, we'll need a bitmapset of the target relation relids.
       */
!     resultRTindexes = bms_make_singleton(parentRTindex);
      foreach(lc, root->append_rel_list)
      {
          AppendRelInfo *appinfo = (AppendRelInfo *) lfirst(lc);
*************** inheritance_planner(PlannerInfo *root)
*** 878,889 ****
                                               appinfo->child_relid);
      }

      foreach(lc, root->append_rel_list)
      {
          AppendRelInfo *appinfo = (AppendRelInfo *) lfirst(lc);
          PlannerInfo subroot;
          Plan       *subplan;
!         Index        rti;

          /* append_rel_list contains all append rels; ignore others */
          if (appinfo->parent_relid != parentRTindex)
--- 883,937 ----
                                               appinfo->child_relid);
      }

+     /*
+      * Now, generate a bitmapset of the relids of the subquery RTEs, including
+      * security-barrier RTEs that will become subqueries, as just explained.
+      */
+     subqueryRTindexes = NULL;
+     rti = 1;
+     foreach(lc, parse->rtable)
+     {
+         RangeTblEntry *rte = (RangeTblEntry *) lfirst(lc);
+
+         if (rte->rtekind == RTE_SUBQUERY ||
+             (rte->securityQuals != NIL &&
+              !bms_is_member(rti, resultRTindexes)))
+             subqueryRTindexes = bms_add_member(subqueryRTindexes, rti);
+         rti++;
+     }
+
+     /*
+      * Next, we want to identify which AppendRelInfo items contain references
+      * to any of the aforesaid subquery RTEs.  These items will need to be
+      * copied and modified to adjust their subquery references; whereas the
+      * other ones need not be touched.  It's worth being tense over this
+      * because we can usually avoid processing most of the AppendRelInfo
+      * items, thereby saving O(N^2) space and time when the target is a large
+      * inheritance tree.  We can identify AppendRelInfo items by their
+      * child_relid, since that should be unique within the list.
+      */
+     modifiableARIindexes = NULL;
+     foreach(lc, root->append_rel_list)
+     {
+         AppendRelInfo *appinfo = (AppendRelInfo *) lfirst(lc);
+
+         if (bms_is_member(appinfo->parent_relid, subqueryRTindexes) ||
+             bms_is_member(appinfo->child_relid, subqueryRTindexes) ||
+             bms_overlap(pull_varnos((Node *) appinfo->translated_vars),
+                         subqueryRTindexes))
+             modifiableARIindexes = bms_add_member(modifiableARIindexes,
+                                                   appinfo->child_relid);
+     }
+
+     /*
+      * And now we can get on with generating a plan for each child table.
+      */
      foreach(lc, root->append_rel_list)
      {
          AppendRelInfo *appinfo = (AppendRelInfo *) lfirst(lc);
          PlannerInfo subroot;
          Plan       *subplan;
!         ListCell   *lc2;

          /* append_rel_list contains all append rels; ignore others */
          if (appinfo->parent_relid != parentRTindex)
*************** inheritance_planner(PlannerInfo *root)
*** 917,925 ****
          /*
           * The append_rel_list likewise might contain references to subquery
           * RTEs (if any subqueries were flattenable UNION ALLs).  So prepare
!          * to apply ChangeVarNodes to that, too.
           */
!         subroot.append_rel_list = (List *) copyObject(root->append_rel_list);

          /*
           * Add placeholders to the child Query's rangetable list to fill the
--- 965,985 ----
          /*
           * The append_rel_list likewise might contain references to subquery
           * RTEs (if any subqueries were flattenable UNION ALLs).  So prepare
!          * to apply ChangeVarNodes to that, too.  As explained above, we only
!          * want to copy items that actually contain such references; the
!          * rest can just get linked into the subroot's append_rel_list.
           */
!         subroot.append_rel_list = NIL;
!         foreach(lc2, root->append_rel_list)
!         {
!             AppendRelInfo *appinfo2 = (AppendRelInfo *) lfirst(lc2);
!
!             if (bms_is_member(appinfo2->child_relid, modifiableARIindexes))
!                 appinfo2 = (AppendRelInfo *) copyObject(appinfo2);
!
!             subroot.append_rel_list = lappend(subroot.append_rel_list,
!                                               appinfo2);
!         }

          /*
           * Add placeholders to the child Query's rangetable list to fill the
*************** inheritance_planner(PlannerInfo *root)
*** 933,945 ****

          /*
           * If this isn't the first child Query, generate duplicates of all
!          * subquery RTEs, and adjust Var numbering to reference the
!          * duplicates. To simplify the loop logic, we scan the original rtable
!          * not the copy just made by adjust_appendrel_attrs; that should be OK
!          * since subquery RTEs couldn't contain any references to the target
!          * rel.
           */
!         if (final_rtable != NIL)
          {
              ListCell   *lr;

--- 993,1005 ----

          /*
           * If this isn't the first child Query, generate duplicates of all
!          * subquery (or subquery-to-be) RTEs, and adjust Var numbering to
!          * reference the duplicates.  To simplify the loop logic, we scan the
!          * original rtable not the copy just made by adjust_appendrel_attrs;
!          * that should be OK since subquery RTEs couldn't contain any
!          * references to the target rel.
           */
!         if (final_rtable != NIL && !bms_is_empty(subqueryRTindexes))
          {
              ListCell   *lr;

*************** inheritance_planner(PlannerInfo *root)
*** 948,961 ****
              {
                  RangeTblEntry *rte = (RangeTblEntry *) lfirst(lr);

!                 /*
!                  * Copy subquery RTEs and RTEs with security barrier quals
!                  * that will be turned into subqueries, except those that are
!                  * target relations.
!                  */
!                 if (rte->rtekind == RTE_SUBQUERY ||
!                     (rte->securityQuals != NIL &&
!                      !bms_is_member(rti, resultRTindexes)))
                  {
                      Index        newrti;

--- 1008,1014 ----
              {
                  RangeTblEntry *rte = (RangeTblEntry *) lfirst(lr);

!                 if (bms_is_member(rti, subqueryRTindexes))
                  {
                      Index        newrti;


Re: Inheritance planner CPU and memory usage change since 9.3.2

From
Robert Haas
Date:
On Sat, Jun 20, 2015 at 6:48 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> Meanwhile, here is an updated patch.
>
> I don't care for that patch too much: it seems a bit brute-force, and I'm
> quite worried by the assumption that it's okay to destroy each child's
> append_rel_list after processing the child.  That would fail if any of
> the Vars/subexpressions in the lists get incorporated into the resulting
> child Plan, which does not seem impossible.  (I think in many cases we'd
> do a copyObject() when extracting an append_rel_list expression, but this
> hardly seems guaranteed.)

Well, if such a thing is possible, the regression tests don't catch it
- can we add one that would?

> I propose instead the attached patch, which operates by identifying which
> of the append_rel_list entries actually contain subquery references, and
> copying only those; the other ones are just linked into the child's
> append_rel_list by reference, which is okay because they won't get
> modified.

I thought about that approach, but wasn't sure if I could make it
simple enough to pass muster.  Note that I generally erred on the side
of deferring all work as long as possible and to the greatest extent
possible in a way that your patch does not.  We don't need to compute
modifiableARIindexes if subqueryRTindexes ends up empty, and we
definitely don't need to generate O(n^2) list cells in that case.  I
think that latter point, at least, is quite likely to be worth
optimizing.  Granted, spewing out extra ListCells is far less harmful
than doing the same thing with AppendRelInfos and their entire list of
translated_vars, but it's still not good.  Can't we move the loop that
copies root.append_rel_list inside if (final_rtable != NIL &&
!bms_is_empty(subqueryRTindexes))?  If we don't take that branch,
root.append_rel_list isn't getting modified at all, so a shallow copy
is good enough.

> On my laptop, I get the following timings for your test queries from
> unmodified HEAD (--cassert build):
>
> # Q1: 41260.239 ms
> # Q2: 45225.768 ms
> # Q3: 43066.958 ms
> # Q4: 193360.726 ms
> # Q5: 40746.503 ms
>
> and these with my patch:
>
> # Q1: 1767.753 ms
> # Q2: 3662.131 ms
> # Q3: 814.293 ms
> # Q4: 64468.914 ms
> # Q5: 881.295 ms
>
> which seems to be generally a better result.

Better than unpatched, definitely!  Not sure how it compares to my patch.

>> The extraordinarily planning time for query 4 is caused by a
>> completely different problem: SearchCatCache eats up huge amounts of
>> CPU; its callers are get_attavgwidth and get_typlen.  It's not clear
>> to me why doubling the number of relations causes such an enormous
>> explosion in calls to those functions; I would expect the number of
>> calls to double, but presumably the actual increase is much more.
>
> Actually, Q4 necessarily involves O(N^2) planning time, because for
> each of N target relations you're considering a join to an N-member
> inheritance tree.  A lot of those ultimately get thrown away by
> constraint exclusion, but not before we've expended significant
> cycles on them.  I do not think we are going to get much traction
> on that --- even if we do something to knock off whatever the leading
> term is, there'll still be more O(N^2) behavior right behind it.

Hmm, maybe so.  On the other hand, if there's a way to significantly
shrink the constant factor on that O(N^2) stuff, it could bring a lot
of people some much-needed relief.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Inheritance planner CPU and memory usage change since 9.3.2

From
Dean Rasheed
Date:
On 21 June 2015 at 05:27, Robert Haas <robertmhaas@gmail.com> wrote:
> On Sat, Jun 20, 2015 at 6:48 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> I propose instead the attached patch, which operates by identifying which
>> of the append_rel_list entries actually contain subquery references, and
>> copying only those; the other ones are just linked into the child's
>> append_rel_list by reference, which is okay because they won't get
>> modified.
>
> Better than unpatched, definitely!  Not sure how it compares to my patch.
>

I tested on my machine (optimised build, asserts off). With HEAD I got:

Q1: 8076ms
Q2: 7165ms
Q3: 4027ms
Q4: OOM (killed by kernel, used > 16GB RAM)
Q5: 4131ms

The machine only has 16GB of RAM and almost no swap, so it wasn't able to do Q4.

With Robert's patch:

Q1: 1121ms
Q2: 542ms
Q3: 498ms
Q4: 50763ms (used 3GB RAM)
Q5: 556ms

and with Tom's patch:

Q1: 2264ms
Q2: 3785ms
Q3: 507ms
Q4: 50851ms (used 3GB RAM)
Q5: 558ms

However, there's an obvious improvement that can be made to Tom's
patch -- having computed modifiableARIindexes, you may as well use it
in the innermost loop to only apply ChangeVarNodes() to those
AppendRelInfo's that can actually change, rather than having it trawl
through all the other ones that we know won't be touched.

With that improvement (attached), the timings become:

Q1: 1148ms
Q2: 547ms
Q3: 505ms
Q4: 51325ms
Q5: 544ms

i.e., basically the same as Robert's patch.

Regards,
Dean

Attachment

Re: Inheritance planner CPU and memory usage change since 9.3.2

From
Robert Haas
Date:
On Sun, Jun 21, 2015 at 5:45 AM, Dean Rasheed <dean.a.rasheed@gmail.com> wrote:
> On 21 June 2015 at 05:27, Robert Haas <robertmhaas@gmail.com> wrote:
>> On Sat, Jun 20, 2015 at 6:48 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> I propose instead the attached patch, which operates by identifying which
>>> of the append_rel_list entries actually contain subquery references, and
>>> copying only those; the other ones are just linked into the child's
>>> append_rel_list by reference, which is okay because they won't get
>>> modified.
>>
>> Better than unpatched, definitely!  Not sure how it compares to my patch.
>>
>
> I tested on my machine (optimised build, asserts off). With HEAD I got:
>
> Q1: 8076ms
> Q2: 7165ms
> Q3: 4027ms
> Q4: OOM (killed by kernel, used > 16GB RAM)
> Q5: 4131ms
>
> The machine only has 16GB of RAM and almost no swap, so it wasn't able to do Q4.
>
> With Robert's patch:
>
> Q1: 1121ms
> Q2: 542ms
> Q3: 498ms
> Q4: 50763ms (used 3GB RAM)
> Q5: 556ms
>
> and with Tom's patch:
>
> Q1: 2264ms
> Q2: 3785ms
> Q3: 507ms
> Q4: 50851ms (used 3GB RAM)
> Q5: 558ms
>
> However, there's an obvious improvement that can be made to Tom's
> patch -- having computed modifiableARIindexes, you may as well use it
> in the innermost loop to only apply ChangeVarNodes() to those
> AppendRelInfo's that can actually change, rather than having it trawl
> through all the other ones that we know won't be touched.
>
> With that improvement (attached), the timings become:
>
> Q1: 1148ms
> Q2: 547ms
> Q3: 505ms
> Q4: 51325ms
> Q5: 544ms
>
> i.e., basically the same as Robert's patch.

Cool.  That sounds good.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Inheritance planner CPU and memory usage change since 9.3.2

From
Tom Lane
Date:
Dean Rasheed <dean.a.rasheed@gmail.com> writes:
> However, there's an obvious improvement that can be made to Tom's
> patch -- having computed modifiableARIindexes, you may as well use it
> in the innermost loop to only apply ChangeVarNodes() to those
> AppendRelInfo's that can actually change, rather than having it trawl
> through all the other ones that we know won't be touched.

Right.  Also, as Robert noted, we can short-circuit a few more things when
there are no subquery RTEs.  I'll combine these ideas and push something,
but probably not till tomorrow.
        regards, tom lane