Thread: Performance regression with PostgreSQL 11 and partitioning

Performance regression with PostgreSQL 11 and partitioning

From
Thomas Reiss
Date:
Hello,

I spent some time to test the new features on partitioning with the
beta1. I noticed a potentially huge performance regression with
plan-time partition pruning.

To show the issue, I used this DO statement to generate some partitions,
one per day :
DO $$
DECLARE
  part_date date;
  ddl text;
BEGIN
  CREATE TABLE t1 (
    num INTEGER NOT NULL,
    dt  DATE NOT NULL
  ) PARTITION BY LIST (dt);

  FOR part_date IN SELECT d FROM generate_series(date '2010-01-01',
'2020-12-31', interval '1 day') d LOOP
    ddl := 'CREATE TABLE t1_' || to_char(part_date, 'YYYY_MM_DD') || E'
PARTITION OF t1 FOR VALUES IN (\'' || part_date || E'\')';
    EXECUTE ddl;
  END LOOP;
END;
$$;

Then I used the following to compare the planning time :
explain (analyze) SELECT * FROM t1 WHERE dt = '2018-05-25';

With PostgreSQL 10, planning time is 66ms, in v11, planning rise to
143ms. I also did a little test with more than 20k partitions, and while
the planning time was reasonable with PG10 (287.453 ms), it exploded
with v11 with 4578.054 ms.

Perf showed that thes functions find_appinfos_by_relids and
bms_is_member consumes most of the CPU time with v11. With v10, this
functions don't appear. It seems that find_appinfos_by_relids was
introduced by commit 480f1f4329f.

Regards,
Thomas


Re: Performance regression with PostgreSQL 11 and partitioning

From
Robert Haas
Date:
On Fri, May 25, 2018 at 10:30 AM, Thomas Reiss <thomas.reiss@dalibo.com> wrote:
> Then I used the following to compare the planning time :
> explain (analyze) SELECT * FROM t1 WHERE dt = '2018-05-25';
>
> With PostgreSQL 10, planning time is 66ms, in v11, planning rise to
> 143ms. I also did a little test with more than 20k partitions, and while
> the planning time was reasonable with PG10 (287.453 ms), it exploded
> with v11 with 4578.054 ms.
>
> Perf showed that thes functions find_appinfos_by_relids and
> bms_is_member consumes most of the CPU time with v11. With v10, this
> functions don't appear. It seems that find_appinfos_by_relids was
> introduced by commit 480f1f4329f.

Hmm.  Have you verified whether that commit is actually the one that
caused the regression?  It's certainly possible, but I wouldn't expect
calling find_appinfos_by_relids() with 1 AppendRelInfo to be too much
more expensive than calling find_childrel_appendrelinfo() as the
previous code did.  I wonder if some later change, perhaps related to
pruning, just caused this code path to be hit more often.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Performance regression with PostgreSQL 11 and partitioning

From
Thomas Reiss
Date:

Le 25/05/2018 à 16:49, Robert Haas a écrit :
> On Fri, May 25, 2018 at 10:30 AM, Thomas Reiss <thomas.reiss@dalibo.com> wrote:
>> Then I used the following to compare the planning time :
>> explain (analyze) SELECT * FROM t1 WHERE dt = '2018-05-25';
>>
>> With PostgreSQL 10, planning time is 66ms, in v11, planning rise to
>> 143ms. I also did a little test with more than 20k partitions, and while
>> the planning time was reasonable with PG10 (287.453 ms), it exploded
>> with v11 with 4578.054 ms.
>>
>> Perf showed that thes functions find_appinfos_by_relids and
>> bms_is_member consumes most of the CPU time with v11. With v10, this
>> functions don't appear. It seems that find_appinfos_by_relids was
>> introduced by commit 480f1f4329f.
> 
> Hmm.  Have you verified whether that commit is actually the one that
> caused the regression?  It's certainly possible, but I wouldn't expect
> calling find_appinfos_by_relids() with 1 AppendRelInfo to be too much
> more expensive than calling find_childrel_appendrelinfo() as the
> previous code did.  I wonder if some later change, perhaps related to
> pruning, just caused this code path to be hit more often.

I didn't because I didn't enough time. I'll take another look next week.


Re: Performance regression with PostgreSQL 11 and partitioning

From
Amit Langote
Date:
On Fri, May 25, 2018 at 11:49 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Fri, May 25, 2018 at 10:30 AM, Thomas Reiss <thomas.reiss@dalibo.com> wrote:
>> Then I used the following to compare the planning time :
>> explain (analyze) SELECT * FROM t1 WHERE dt = '2018-05-25';
>>
>> With PostgreSQL 10, planning time is 66ms, in v11, planning rise to
>> 143ms. I also did a little test with more than 20k partitions, and while
>> the planning time was reasonable with PG10 (287.453 ms), it exploded
>> with v11 with 4578.054 ms.
>>
>> Perf showed that thes functions find_appinfos_by_relids and
>> bms_is_member consumes most of the CPU time with v11. With v10, this
>> functions don't appear. It seems that find_appinfos_by_relids was
>> introduced by commit 480f1f4329f.
>
> Hmm.  Have you verified whether that commit is actually the one that
> caused the regression?  It's certainly possible, but I wouldn't expect
> calling find_appinfos_by_relids() with 1 AppendRelInfo to be too much
> more expensive than calling find_childrel_appendrelinfo() as the
> previous code did.  I wonder if some later change, perhaps related to
> pruning, just caused this code path to be hit more often.

One more possibility is that find_appinfos_by_relids being shown high
up in profiles is called from apply_scanjoin_target_to_paths that's
new in PG 11, which in turn is called (unconditionally) from
grouping_planner.  Especially, the following bit in it:

    /*
     * If the relation is partitioned, recurseively apply the same changes to
     * all partitions and generate new Append paths. Since Append is not
     * projection-capable, that might save a separate Result node, and it also
     * is important for partitionwise aggregate.
     */
    if (rel->part_scheme && rel->part_rels)
    {
        int         partition_idx;
        List       *live_children = NIL;

        /* Adjust each partition. */
        for (partition_idx = 0; partition_idx < rel->nparts; partition_idx++)
        {
            RelOptInfo *child_rel = rel->part_rels[partition_idx];
            ListCell   *lc;
            AppendRelInfo **appinfos;
            int         nappinfos;
            List       *child_scanjoin_targets = NIL;

            /* Translate scan/join targets for this child. */
            appinfos = find_appinfos_by_relids(root, child_rel->relids,
                                               &nappinfos);


Seems here that we call find_appinfos_by_relids here for *all*
partitions, even if all but one partition may have been pruned.  I
haven't studied this code in detail, but I suspect it might be
unnecessary, although I might be wrong.

Fwiw, I'm not sure why the new pruning code would call here, at least
git grep find_appinfos_by_relids doesn't turn up anything interesting
in that regard.

Thanks,
Amit


Re: Performance regression with PostgreSQL 11 and partitioning

From
Robert Haas
Date:
On Fri, May 25, 2018 at 1:53 PM, Amit Langote <amitlangote09@gmail.com> wrote:
> Seems here that we call find_appinfos_by_relids here for *all*
> partitions, even if all but one partition may have been pruned.  I
> haven't studied this code in detail, but I suspect it might be
> unnecessary, although I might be wrong.

Uggh.  It might be possible to skip it for dummy children.  That would
leave the dummy child rels generating a different pathtarget than the
non-dummy children, but I guess if we never use them again maybe it
wouldn't matter.

> Fwiw, I'm not sure why the new pruning code would call here, at least
> git grep find_appinfos_by_relids doesn't turn up anything interesting
> in that regard.

Hmm, OK.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Performance regression with PostgreSQL 11 and partitioning

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> On Fri, May 25, 2018 at 1:53 PM, Amit Langote <amitlangote09@gmail.com> wrote:
>> Seems here that we call find_appinfos_by_relids here for *all*
>> partitions, even if all but one partition may have been pruned.  I
>> haven't studied this code in detail, but I suspect it might be
>> unnecessary, although I might be wrong.

> Uggh.  It might be possible to skip it for dummy children.  That would
> leave the dummy child rels generating a different pathtarget than the
> non-dummy children, but I guess if we never use them again maybe it
> wouldn't matter.

I think this line of thought is shooting the messenger.  The core of the
problem here is that the appinfo list, which is a data structure designed
with the expectation of having a few dozen entries at most, has got four
thousand entries (or twenty thousand if you try Thomas' bigger case).
What we need, if we're going to cope with that, is something smarter than
linear searches.

I'm not exactly impressed with the design concept of
find_appinfos_by_relids in itself, either, as what that does is to
linearly search the appinfo list to produce ... an array, which also
has to be linearly searched.  In this particular example, it seems
that all the output arrays have length 1; if they did not then we'd
be seeing the search loops in adjust_appendrel_attrs et al taking
up unreasonable amounts of time as well.  (Not to mention the palloc
cycles and unreclaimed space eaten by said arrays.)

I'm inclined to think that we should flush find_appinfos_by_relids
altogether, along with these inefficient intermediate arrays, and instead
have the relevant places in adjust_appendrel_attrs call some function
defined as "gimme the AppendRelInfo for child rel A and parent rel B,
working directly from the PlannerInfo root data".  That could use a hash
lookup when dealing with more than some-small-number of AppendRelInfos,
comparable to what we do with join_rel_list/join_rel_hash.

I also wonder whether it's really necessary to do a fresh search
for each individual Var, as I see is currently happening in
adjust_appendrel_attrs_mutator.  At the very least we could expect
that there would be runs of requests for the same parent/child pair,
and avoid fresh list searches/hash probes when the answer is the
same as last time.  But do we really have to leave that for the bottom
level of the recursion, rather than doing it once per expr at some higher
call level?

Maybe it's all right to decide that this rejiggering can be left
for v12 ... did we promise anyone that it's now sane to use thousands
of partitions?

            regards, tom lane


Re: Performance regression with PostgreSQL 11 and partitioning

From
"Jonathan S. Katz"
Date:
> On May 25, 2018, at 5:17 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> Maybe it's all right to decide that this rejiggering can be left
> for v12 ... did we promise anyone that it's now sane to use thousands
> of partitions?

Per beta release, we’ve only said “improved SELECT query performance due
to enhanced partition elimination during query processing and execution as
well as parallelized partition scans” with the disclaimers of “subject to change
due to beta testing” and “please test and report back.”

In other words, no promises on the above.

However, the question is how common is the above scenario? If you’re doing
partitions by day, it would take awhile to get to 20K. But if you have something
where you have so much inbound data that you need to partition by hour, it
can occur a little bit more quickly (though will still take a couple of years, if you
were to start partitioning today).

Jonathan

Re: Performance regression with PostgreSQL 11 and partitioning

From
Justin Pryzby
Date:
On Fri, May 25, 2018 at 05:17:12PM -0400, Tom Lane wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
> > On Fri, May 25, 2018 at 1:53 PM, Amit Langote <amitlangote09@gmail.com> wrote:
> >> Seems here that we call find_appinfos_by_relids here for *all*
> >> partitions, even if all but one partition may have been pruned.  I
> >> haven't studied this code in detail, but I suspect it might be
> >> unnecessary, although I might be wrong.
> 
> > Uggh.  It might be possible to skip it for dummy children.  That would
> > leave the dummy child rels generating a different pathtarget than the
> > non-dummy children, but I guess if we never use them again maybe it
> > wouldn't matter.

> Maybe it's all right to decide that this rejiggering can be left
> for v12 ... did we promise anyone that it's now sane to use thousands
> of partitions?

https://www.postgresql.org/docs/devel/static/ddl-partitioning.html
|The following caveats apply to CONSTRAINT EXCLUSION:
|[...]
|All constraints on all partitions of the master table are examined during
|constraint exclusion, so large numbers of partitions are likely to increase
|query planning time considerably. So the legacy inheritance based partitioning
|will work well with up to perhaps a hundred partitions; DON'T TRY TO USE MANY
|THOUSANDS OF PARTITIONS.

It doesn't matter for us, as we're already using tooo many partitions, but I
would interpret that to mean that thousands of partitions are a job exclusively
for "partition pruning" of declarative partitions.

Justin


Re: Performance regression with PostgreSQL 11 and partitioning

From
David Rowley
Date:
On 26 May 2018 at 09:17, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> I'm inclined to think that we should flush find_appinfos_by_relids
> altogether, along with these inefficient intermediate arrays, and instead
> have the relevant places in adjust_appendrel_attrs call some function
> defined as "gimme the AppendRelInfo for child rel A and parent rel B,
> working directly from the PlannerInfo root data".  That could use a hash
> lookup when dealing with more than some-small-number of AppendRelInfos,
> comparable to what we do with join_rel_list/join_rel_hash.

For partitioned tables, a child just has a single parent, so to speed
up find_appinfos_by_relids we'd simply just need a faster way to find
the AppendRelInfo by child relid.

I've been working on a patch series which I plan to submit to PG12
aimed to speed up partitioning. Ideally, I'd have liked to just tag
the AppendRelInfo onto the child RelOptInfo, but that falls down
during inheritance planning.

In the end I just made an array to store AppendRelInfo's by their
child_relid which is created and populated during
setup_simple_rel_arrays.

Probably the patch could go a bit further and skip array allocation
when the append_rel_list is empty, but I've been busy working on
another bunch of stuff to improve planning time. I'd planned to give
this another look before submitting in the PG12 cycle, so state ==
WIP.

A quick test of the attached on Thomas' 4k part test, I get:

Unpatched
tps = 5.957508 (excluding connections establishing)

Patched:
tps = 15.368806 (excluding connections establishing)

Using that, if Thomas sees the same speedup then that puts PG11 at
55.4ms vs his measured 66ms in PG10.

The attached should apply with some fuzz to master.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Attachment

Re: Performance regression with PostgreSQL 11 and partitioning

From
David Rowley
Date:
On 26 May 2018 at 02:30, Thomas Reiss <thomas.reiss@dalibo.com> wrote:
> I spent some time to test the new features on partitioning with the
> beta1. I noticed a potentially huge performance regression with
> plan-time partition pruning.

Thanks for reporting. I've added this item to the PG11 open items list in:

https://wiki.postgresql.org/wiki/PostgreSQL_11_Open_Items#Open_Issues

I've placed it there so that it does not get forgotten about. It's
still to be decided if we can come up with something low-risk enough
that will resolve the issue.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: Performance regression with PostgreSQL 11 and partitioning

From
Thomas Reiss
Date:

Le 30/05/2018 à 03:57, David Rowley a écrit :
> On 26 May 2018 at 09:17, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> I'm inclined to think that we should flush find_appinfos_by_relids
>> altogether, along with these inefficient intermediate arrays, and instead
>> have the relevant places in adjust_appendrel_attrs call some function
>> defined as "gimme the AppendRelInfo for child rel A and parent rel B,
>> working directly from the PlannerInfo root data".  That could use a hash
>> lookup when dealing with more than some-small-number of AppendRelInfos,
>> comparable to what we do with join_rel_list/join_rel_hash.
> 
> For partitioned tables, a child just has a single parent, so to speed
> up find_appinfos_by_relids we'd simply just need a faster way to find
> the AppendRelInfo by child relid.
> 
> I've been working on a patch series which I plan to submit to PG12
> aimed to speed up partitioning. Ideally, I'd have liked to just tag
> the AppendRelInfo onto the child RelOptInfo, but that falls down
> during inheritance planning.
> 
> In the end I just made an array to store AppendRelInfo's by their
> child_relid which is created and populated during
> setup_simple_rel_arrays.
> 
> Probably the patch could go a bit further and skip array allocation
> when the append_rel_list is empty, but I've been busy working on
> another bunch of stuff to improve planning time. I'd planned to give
> this another look before submitting in the PG12 cycle, so state ==
> WIP.
> 
> A quick test of the attached on Thomas' 4k part test, I get:
> 
> Unpatched
> tps = 5.957508 (excluding connections establishing)
> 
> Patched:
> tps = 15.368806 (excluding connections establishing)
> 
> Using that, if Thomas sees the same speedup then that puts PG11 at
> 55.4ms vs his measured 66ms in PG10.
> 
> The attached should apply with some fuzz to master.


Thanks for this quick patch David, unfortunately I didn't have time to
play around with it until now.

On the 4k part test, the planning time falls down to 55 ms instead of
the 4 seconds from the current head.
I also have better results with pgbench. Your patch solves this planning
time issue.

Thanks David !




Re: Performance regression with PostgreSQL 11 and partitioning

From
Ashutosh Bapat
Date:
On Wed, May 30, 2018 at 7:27 AM, David Rowley
<david.rowley@2ndquadrant.com> wrote:

>
> In the end I just made an array to store AppendRelInfo's by their
> child_relid which is created and populated during
> setup_simple_rel_arrays.

May be we want to change the name of the function or plural "arrays" conveys it?

>
> Probably the patch could go a bit further and skip array allocation
> when the append_rel_list is empty, but I've been busy working on
> another bunch of stuff to improve planning time. I'd planned to give
> this another look before submitting in the PG12 cycle, so state ==
> WIP.

I was wondering if we can get rid of append_rel_list altogether. In
your patch, you have saved AppendRelInfos by child_relid. So all the
slots indexed by parent relid are empty. We could place AppendRelInfos
by parent relid. Thus a given AppendRelInfo would be places at two
places, one at the index of child relid and second at the index
pointed by parent relid. That's fine even in case of multilevel
inheritance since an intermediate parent has two relids one as a
parent and other as a child.

One problem with that we do not know how long the array would be to
start with. But we could keep on repallocing the array to increase its
size.

It then helps us in places where we iterate over append_rel_list
looking for a parent id. It does not help in places like the loop
below in inheritance_planner().

        foreach(lc, root->append_rel_list)
        {
            AppendRelInfo *appinfo = lfirst_node(AppendRelInfo, lc);

            if (bms_is_member(appinfo->parent_relid, subqueryRTindexes) ||
                bms_is_member(appinfo->child_relid, subqueryRTindexes) ||
                bms_overlap(pull_varnos((Node *) appinfo->translated_vars),
                            subqueryRTindexes))
                modifiableARIindexes = bms_add_member(modifiableARIindexes,
                                                      appinfo->child_relid);
        }


>
> A quick test of the attached on Thomas' 4k part test, I get:
>
> Unpatched
> tps = 5.957508 (excluding connections establishing)
>
> Patched:
> tps = 15.368806 (excluding connections establishing)

That's really good.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company


Re: Performance regression with PostgreSQL 11 and partitioning

From
David Rowley
Date:
On 5 June 2018 at 01:35, Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote:
> I was wondering if we can get rid of append_rel_list altogether. In
> your patch, you have saved AppendRelInfos by child_relid. So all the
> slots indexed by parent relid are empty. We could place AppendRelInfos
> by parent relid. Thus a given AppendRelInfo would be places at two
> places, one at the index of child relid and second at the index
> pointed by parent relid. That's fine even in case of multilevel
> inheritance since an intermediate parent has two relids one as a
> parent and other as a child.
>
> One problem with that we do not know how long the array would be to
> start with. But we could keep on repallocing the array to increase its
> size.

To be honest, the patch was meant as a discussion topic for ways to
improve the reported regression in PG11. The patch was put together
fairly quickly.

I've not really considered what happens in the case where an inherited
table has multiple parents.  The patch happens to pass regression
tests, but I've not checked to see if the existing tests create any
tables this way.

Perhaps it's okay to change things this way for just partitioned
tables and leave inheritance hierarchies alone.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: Performance regression with PostgreSQL 11 and partitioning

From
Ashutosh Bapat
Date:
On Tue, Jun 5, 2018 at 5:50 AM, David Rowley
<david.rowley@2ndquadrant.com> wrote:
> On 5 June 2018 at 01:35, Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote:
>> I was wondering if we can get rid of append_rel_list altogether. In
>> your patch, you have saved AppendRelInfos by child_relid. So all the
>> slots indexed by parent relid are empty. We could place AppendRelInfos
>> by parent relid. Thus a given AppendRelInfo would be places at two
>> places, one at the index of child relid and second at the index
>> pointed by parent relid. That's fine even in case of multilevel
>> inheritance since an intermediate parent has two relids one as a
>> parent and other as a child.
>>
>> One problem with that we do not know how long the array would be to
>> start with. But we could keep on repallocing the array to increase its
>> size.
>
> To be honest, the patch was meant as a discussion topic for ways to
> improve the reported regression in PG11. The patch was put together
> fairly quickly.

I think the idea is brilliant. I do not have objections for trying
something in that direction. I am suggesting that we take this a bit
forward and try to eliminate append_rel_list altogether.

>
> I've not really considered what happens in the case where an inherited
> table has multiple parents.  The patch happens to pass regression
> tests, but I've not checked to see if the existing tests create any
> tables this way.

AFAIK, that case doesn't arise while processing a query. Examining the
way we create AppendRelInfos during expand_inherited_tables(), it's
clear that we create only one AppendRelInfo for a given child and also
for a given child and parent pair. If there are two tables from which
a child inherits, the child's RTE/RelOptInfo is associated only with
the top-most parent that appears in the query. See following comment
from find_all_inheritors()
        /*
         * Add to the queue only those children not already seen. This avoids
         * making duplicate entries in case of multiple inheritance paths from
         * the same parent.  (It'll also keep us from getting into an infinite
         * loop, though theoretically there can't be any cycles in the
         * inheritance graph anyway.)
         */

>
> Perhaps it's okay to change things this way for just partitioned
> tables and leave inheritance hierarchies alone.
>

There's no point having two separate code paths when internally we are
treating them as same. The only difference between partitioning and
inheritance is that for multi-level partitioned table, we preserve the
inheritance hierarchy whereas for multi-level inheritance, we flatten
it.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company


Re: Performance regression with PostgreSQL 11 and partitioning

From
Amit Langote
Date:
On 2018/06/05 13:44, Ashutosh Bapat wrote:
> On Tue, Jun 5, 2018 at 5:50 AM, David Rowley wrote:
>> On 5 June 2018 at 01:35, Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote:
>>> I was wondering if we can get rid of append_rel_list altogether. In
>>> your patch, you have saved AppendRelInfos by child_relid. So all the
>>> slots indexed by parent relid are empty. We could place AppendRelInfos
>>> by parent relid. Thus a given AppendRelInfo would be places at two
>>> places, one at the index of child relid and second at the index
>>> pointed by parent relid. That's fine even in case of multilevel
>>> inheritance since an intermediate parent has two relids one as a
>>> parent and other as a child.
>>>
>>> One problem with that we do not know how long the array would be to
>>> start with. But we could keep on repallocing the array to increase its
>>> size.
>>
>> To be honest, the patch was meant as a discussion topic for ways to
>> improve the reported regression in PG11. The patch was put together
>> fairly quickly.
> 
> I think the idea is brilliant. I do not have objections for trying
> something in that direction. I am suggesting that we take this a bit
> forward and try to eliminate append_rel_list altogether.

Imho, we should try to refrain from implementing something that needs
repalloc'ing the array, as long as we're doing it as a fix for PG 11.
Also, because getting rid of append_rel_list altogether seems a bit involved.

Let the AppendRelInfo's be added to append_rel_list when created, as is
done now (expand_inherited_tables), and transfer them to the array when
setting up various arrays (setup_simple_rel_arrays) as the David's patch does.

>> I've not really considered what happens in the case where an inherited
>> table has multiple parents.  The patch happens to pass regression
>> tests, but I've not checked to see if the existing tests create any
>> tables this way.
> 
> AFAIK, that case doesn't arise while processing a query. Examining the
> way we create AppendRelInfos during expand_inherited_tables(), it's
> clear that we create only one AppendRelInfo for a given child and also
> for a given child and parent pair. If there are two tables from which
> a child inherits, the child's RTE/RelOptInfo is associated only with
> the top-most parent that appears in the query. See following comment
> from find_all_inheritors()
>         /*
>          * Add to the queue only those children not already seen. This avoids
>          * making duplicate entries in case of multiple inheritance paths from
>          * the same parent.  (It'll also keep us from getting into an infinite
>          * loop, though theoretically there can't be any cycles in the
>          * inheritance graph anyway.)
>          */

That's right.

create table p1 (a int);
create table p2 (b int);
create table c () inherits (p1, p2);

Child table 'c' has two parents but if a query specifies only p1, then
just one AppendRelInfo corresponding to p1-c pair will be created.  If p2
was also specified in the query, there will be an AppendRelInfo for p2-c
pair too, but its child_relid will refer to another RT entry that's
created for 'c', that is, the one created when expanding p2.

>> Perhaps it's okay to change things this way for just partitioned
>> tables and leave inheritance hierarchies alone.
> 
> There's no point having two separate code paths when internally we are
> treating them as same.

+1

> The only difference between partitioning and
> inheritance is that for multi-level partitioned table, we preserve the
> inheritance hierarchy whereas for multi-level inheritance, we flatten
> it.

Yeah, the proposed patch doesn't change anything about how many
AppendRelInfo's are created, nor about what's contained in them, so it
seems safe to say that it would work unchanged for both regular
inheritance and partitioning.

Thanks,
Amit



Re: Performance regression with PostgreSQL 11 and partitioning

From
David Rowley
Date:
On 5 June 2018 at 16:44, Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote:
> I think the idea is brilliant. I do not have objections for trying
> something in that direction. I am suggesting that we take this a bit
> forward and try to eliminate append_rel_list altogether.

I was trying to be realistic for something we can do to fix v11. It's
probably better to minimise the risky surgery on this code while in
beta. What I proposed was intended to fix a performance regression new
in v11. I'm not sure what you're proposing has the same intentions.

Probably, if you do want an efficient way to store the children
belonging to a parent we could just have another array of Bitmapsets
which would just serve as a set of indexes into the array I've added.

I'd prefer to get a committers thoughts on this before doing any further work.

I've attached a cleaned up patch from the last one. This just adds
some sanity checks to make sure we get an ERROR if we do ever see two
AppendRelInfos with the same child relation id.  I've also made it so
the append_rel_array is only allocated when there are append rels.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Attachment

Re: Performance regression with PostgreSQL 11 and partitioning

From
Ashutosh Bapat
Date:
On Wed, Jun 6, 2018 at 11:27 AM, David Rowley
<david.rowley@2ndquadrant.com> wrote:
>
> I was trying to be realistic for something we can do to fix v11. It's
> probably better to minimise the risky surgery on this code while in
> beta. What I proposed was intended to fix a performance regression new
> in v11. I'm not sure what you're proposing has the same intentions.

Agreed that we want a less risky fix at this stage. What I am worried
is with your implementation there are two ways to get AppendRelInfo
corresponding to a child, append_rel_list and append_rel_array. Right
now we will change all the code to use append_rel_array but there is
no guarantee that some code in future will use append_rel_list. Bugs
would rise if these two get out of sync esp. considering
append_rel_list is a list which can be easily modified. I think we
should avoid that. See what we do to rt_fetch() and
planner_rt_fetch(), but we still have two ways to get an RTE. That's a
source of future bugs.

>
> Probably, if you do want an efficient way to store the children
> belonging to a parent we could just have another array of Bitmapsets
> which would just serve as a set of indexes into the array I've added.
>

I was actually wrong when I proposed that we store AppendRelInfos
indexed by parent_id in the same array. You are right; there can be
multiple children with same parent id. I like your solution of having
an array of bitmapsets, members within which are indexes into
append_rel_array.

> I'd prefer to get a committers thoughts on this before doing any further work.

Yes. I think so too.

>
> I've attached a cleaned up patch from the last one. This just adds
> some sanity checks to make sure we get an ERROR if we do ever see two
> AppendRelInfos with the same child relation id.  I've also made it so
> the append_rel_array is only allocated when there are append rels.
>

In find_appinfos_by_relids(), we should Assert that the child_id in
the AppendRelInfo obtained from the array is same as its index. My
guess is that we could actually get rid of child_relid member of
AppendRelInfo altogether if we directly store AppendRelInfos in the
array without creating a list first. But that may be V12 material.

Also in the same function we want to Assert that cnt never exceeds *nappinfo.

+    Assert(root->append_rel_array);
root->append_rel_array != NULL seems to be a preferred style in the code.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company


Re: Performance regression with PostgreSQL 11 and partitioning

From
Tom Lane
Date:
Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> writes:
> On Wed, Jun 6, 2018 at 11:27 AM, David Rowley
> <david.rowley@2ndquadrant.com> wrote:
>> I was trying to be realistic for something we can do to fix v11. It's
>> probably better to minimise the risky surgery on this code while in
>> beta. What I proposed was intended to fix a performance regression new
>> in v11. I'm not sure what you're proposing has the same intentions.

> Agreed that we want a less risky fix at this stage. What I am worried
> is with your implementation there are two ways to get AppendRelInfo
> corresponding to a child, append_rel_list and append_rel_array. Right
> now we will change all the code to use append_rel_array but there is
> no guarantee that some code in future will use append_rel_list. Bugs
> would rise if these two get out of sync esp. considering
> append_rel_list is a list which can be easily modified. I think we
> should avoid that. See what we do to rt_fetch() and
> planner_rt_fetch(), but we still have two ways to get an RTE. That's a
> source of future bugs.

>> I'd prefer to get a committers thoughts on this before doing any further work.

> Yes. I think so too.

As the original author of the append_rel_list stuff, I do have a few
thoughts here.

The reason why append_rel_list is just a list, and not any more
complicated data structure, is alluded to in the comments for
find_childrel_appendrelinfo (which David's patch should have
changed, and did not):

 * This search could be eliminated by storing a link in child RelOptInfos,
 * but for now it doesn't seem performance-critical.  (Also, it might be
 * difficult to maintain such a link during mutation of the append_rel_list.)

There are a lot of places in prepjointree.c and planner.c that whack
around the append_rel_list contents more or less extensively.  It's
a lot easier in those places if the data is referenced by just one
list pointer.  I do not think we want to replace append_rel_list
with something that would make those functions' lives much harder.

However, so far as I can see, the append_rel_list contents don't change
anymore once we enter query_planner().  Therefore, it's safe to build
secondary indexing structure(s) to allow fast access beyond that point.
This is not much different from what we do with, eg, the rangetable
and simple_rte_array[].

So that's basically what David's patch does, and it seems fine as far
as it goes, although I disapprove of shoving the responsibility into
setup_simple_rel_arrays() without so much as a comment change.
I'd make a separate function for that, I think.  (BTW, perhaps instead
we should do what the comment quoted above contemplates, and set up a
link in the child's RelOptInfo?)

I'm still of the opinion that find_appinfos_by_relids() needs to be
nuked from orbit.  It has nothing to recommend it either from the
standpoint of performance or that of intellectual coherency (or maybe
that problem is just inadequate documentation).  The places consuming
its results are no better.

I was also pretty unhappy to discover, as I poked around in the code, that
recent partitioning patches have introduced various assumptions about the
ordering of the append_rel_list.  It's bad enough that those exist at all;
it's worse that they're documented, if at all, only beside the code that
will fail (usually silently) if they're violated.  I do not find this
acceptable.  If we cannot avoid these assumptions, they need to be
documented more centrally, like in the relation.h block comment for struct
AppendRelInfo.

            regards, tom lane


Re: Performance regression with PostgreSQL 11 and partitioning

From
David Rowley
Date:
On 8 June 2018 at 08:22, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> So that's basically what David's patch does, and it seems fine as far
> as it goes, although I disapprove of shoving the responsibility into
> setup_simple_rel_arrays() without so much as a comment change.
> I'd make a separate function for that, I think.  (BTW, perhaps instead
> we should do what the comment quoted above contemplates, and set up a
> link in the child's RelOptInfo?)

I had originally wanted to stuff these into RelOptInfo, but I
discovered it was not really possible because of how the inheritance
planner works. It replaces the parent with the child RelOptInfo for
each child and replans the query over and over, meaning we'd never
have a complete list of AppendRelInfos to work with.

> I'm still of the opinion that find_appinfos_by_relids() needs to be
> nuked from orbit.  It has nothing to recommend it either from the
> standpoint of performance or that of intellectual coherency (or maybe
> that problem is just inadequate documentation).  The places consuming
> its results are no better.

Yeah, I agree it's not nice that it pallocs an array then pfrees it
again. adjust_appendrel_attrs and adjust_child_relids could probably
just accept a RelIds of childrels and find the AppendRelInfos itself,
similar to how I coded find_appinfos_by_relids.

Do you see that as something for PG11?  Keep in mind, I did code the
patch in a way to minimise invasiveness having considered that PG11 is
now in beta.  I'm willing to write a patch that gets rid of
find_appinfos_by_relids completely. There are a few places, e.g.
create_partitionwise_grouping_paths that make use of the appinfos
multiple times, but the direct lookup is probably fast enough that
this would not matter.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: Performance regression with PostgreSQL 11 and partitioning

From
Tom Lane
Date:
David Rowley <david.rowley@2ndquadrant.com> writes:
> On 8 June 2018 at 08:22, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> I'm still of the opinion that find_appinfos_by_relids() needs to be
>> nuked from orbit.

> Yeah, I agree it's not nice that it pallocs an array then pfrees it
> again. adjust_appendrel_attrs and adjust_child_relids could probably
> just accept a RelIds of childrels and find the AppendRelInfos itself,
> similar to how I coded find_appinfos_by_relids.

Why do they need to accept any additional parameters at all?

The pre-v11 incarnation of those functions took a single AppendRelInfo,
specifying an exact translation from one parent relid to one child
relid.  The fundamental problem I've got with the current code, entirely
independently of any performance issues, is that it's completely unclear
-- or at least undocumented -- which translation(s) are supposed to occur.
If we assume that the code isn't 100% broken (which I'm hardly convinced
of, but we'll take it as read for the moment) then it must be that the
translations are constrained to be well-determined by the query structure
as represented by the totality of the AppendRelInfo data.  So I'm thinking
maybe we don't need those parameters at all.  In the pre-v11 situation,
we were saving lookups by passing the only applicable AppendRelInfo to
the translation functions.  But if the translation functions have to
look up the right translation anyway, let's just make them do that,
and create whatever infrastructure we have to have to make it fast.

> Do you see that as something for PG11?

We already broke the API of these functions for v11.  I don't much
want to break them again in v12.  Let's get it right the first time.

            regards, tom lane


Re: Performance regression with PostgreSQL 11 and partitioning

From
Ashutosh Bapat
Date:
On Fri, Jun 8, 2018 at 10:26 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> David Rowley <david.rowley@2ndquadrant.com> writes:
>> On 8 June 2018 at 08:22, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> I'm still of the opinion that find_appinfos_by_relids() needs to be
>>> nuked from orbit.
>
>> Yeah, I agree it's not nice that it pallocs an array then pfrees it
>> again. adjust_appendrel_attrs and adjust_child_relids could probably
>> just accept a RelIds of childrels and find the AppendRelInfos itself,
>> similar to how I coded find_appinfos_by_relids.
>
> Why do they need to accept any additional parameters at all?
>
> The pre-v11 incarnation of those functions took a single AppendRelInfo,
> specifying an exact translation from one parent relid to one child
> relid.  The fundamental problem I've got with the current code, entirely
> independently of any performance issues, is that it's completely unclear
> -- or at least undocumented -- which translation(s) are supposed to occur.
> If we assume that the code isn't 100% broken (which I'm hardly convinced
> of, but we'll take it as read for the moment) then it must be that the
> translations are constrained to be well-determined by the query structure
> as represented by the totality of the AppendRelInfo data.  So I'm thinking
> maybe we don't need those parameters at all.  In the pre-v11 situation,
> we were saving lookups by passing the only applicable AppendRelInfo to
> the translation functions.  But if the translation functions have to
> look up the right translation anyway, let's just make them do that,
> and create whatever infrastructure we have to have to make it fast.
>

Pre-v11 we required to translate an expressions only for one
parent-child pair using adjust_appendrel_attrs() since only base
relations could have "other" relations. With partition-wise join and
aggregates, we have to do that for multiple parent-child pairs. If we
were to use the same pre-v11 interfaces, we have to do those
adjustments one pair at a time. But the way adjust_appendrel_attrs()
works, it creates a copy of whole expression tree replacing only the
nodes that adjust_appendrel_attrs() cares about. Doing adjustments one
pair at at time meant that all translated expression trees leaked
memory except the last one, which will be used. It was natural to pass
multiple AppendRelInfos, one per parent-child pair, to
adjust_appendrel_attrs() so that it carries out the translations in
one go. This was discussed at length over an year ago. I can see some
references starting [1] and following mails. The original mail thread
started at [2]. You will also find discussion about why we chose to
pass an array instead of list. We didn't do inside
adjust_appendrel_attrs() or adjust_child_relids() because in some
functions those were called multiple times and scanning list those
many times for same set of AppendRelInfos would have been waste of CPU
cycles.

[1] https://postgrespro.com/list/id/CAFjFpRcMWwepj-Do1otxQ-GApGPSZ1FmH7YQvQTwzQOGczq_sw@mail.gmail.com
[2] https://postgrespro.com/list/id/CAFjFpRcMWwepj-Do1otxQ-GApGPSZ1FmH7YQvQTwzQOGczq_sw@mail.gmail.com

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company


Re: Performance regression with PostgreSQL 11 and partitioning

From
Ashutosh Bapat
Date:
On Fri, Jun 8, 2018 at 1:52 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote
>
> I'm still of the opinion that find_appinfos_by_relids() needs to be
> nuked from orbit.  It has nothing to recommend it either from the
> standpoint of performance or that of intellectual coherency (or maybe
> that problem is just inadequate documentation).  The places consuming
> its results are no better.

Here's patch with some comments added to find_appinfos_by_relid(),
adjust_appendrel_attrs and ajust_appendrel_attrs_context. There was no
explanation about AppendRelInfo argument to adjust_appendrel_attrs or
its context in pre-v11 code. May be that was implicit in the first
paragraph. But then that implicit-ness holds true for AppendRelInfo
array as well. So, it was not changed when we changed the signature of
ajdust_appendrel_attrs().

>
> I was also pretty unhappy to discover, as I poked around in the code, that
> recent partitioning patches have introduced various assumptions about the
> ordering of the append_rel_list.  It's bad enough that those exist at all;
> it's worse that they're documented, if at all, only beside the code that
> will fail (usually silently) if they're violated.

Not silently exactly; a build with assert would trip the assertion in
inheritance_planner() at line 1295. So any changes to that assumption
would be caught by our regression first. I agree that is not so useful
in production, but it wouldn't go, thanks to our regression.

> I do not find this
> acceptable.  If we cannot avoid these assumptions, they need to be
> documented more centrally, like in the relation.h block comment for struct
> AppendRelInfo.
>

Attached patch adds the assumption to the block you mention above. Please check.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

Attachment

Re: Performance regression with PostgreSQL 11 and partitioning

From
Robert Haas
Date:
On Fri, Jun 8, 2018 at 12:56 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> The pre-v11 incarnation of those functions took a single AppendRelInfo,
> specifying an exact translation from one parent relid to one child
> relid.  The fundamental problem I've got with the current code, entirely
> independently of any performance issues, is that it's completely unclear
> -- or at least undocumented -- which translation(s) are supposed to occur.

I don't understand this complaint.  Before, the code took one
AppendRelInfo, and according to you, it was clear what was supposed to
happen.  Now it takes an array of AppendRelInfos and, according to
you, it's completely unclear.  Yet that seems, to me at least, to be a
straightforward generalization.  If 1 AppendRelInfo is an adequate
specification of one translations, why are N AppendRelInfos not an
adequate specification of N translations?

(On another note, I don't have a strong what should be done about the
fact that all AppendRelInfos are stored in a flat list or whether it
should be done in v11 or v12, but I completely agree with the idea
that something should be done about it at some point.  I'm sure it's
possible to design a data structure that is less efficient to search
than a singly-linked list, but you'd probably have to be explicitly
aiming for that goal.)

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Performance regression with PostgreSQL 11 and partitioning

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> On Fri, Jun 8, 2018 at 12:56 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> The pre-v11 incarnation of those functions took a single AppendRelInfo,
>> specifying an exact translation from one parent relid to one child
>> relid.  The fundamental problem I've got with the current code, entirely
>> independently of any performance issues, is that it's completely unclear
>> -- or at least undocumented -- which translation(s) are supposed to occur.

> I don't understand this complaint.  Before, the code took one
> AppendRelInfo, and according to you, it was clear what was supposed to
> happen.  Now it takes an array of AppendRelInfos and, according to
> you, it's completely unclear.  Yet that seems, to me at least, to be a
> straightforward generalization.  If 1 AppendRelInfo is an adequate
> specification of one translations, why are N AppendRelInfos not an
> adequate specification of N translations?

Because the relationships between the transforms are unclear.  Are we
supposed to apply those N transformations to the expression in sequence?
It doesn't look to me like that's what the code does.  I think --- I might
be wrong --- that the code is relying on the transformations to be
non-overlapping, that is a change made by any one of them cannot be
further affected by another one.  This is, however, undocumented.

            regards, tom lane


Re: Performance regression with PostgreSQL 11 and partitioning

From
Robert Haas
Date:
On Fri, Jun 8, 2018 at 2:10 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> I don't understand this complaint.  Before, the code took one
>> AppendRelInfo, and according to you, it was clear what was supposed to
>> happen.  Now it takes an array of AppendRelInfos and, according to
>> you, it's completely unclear.  Yet that seems, to me at least, to be a
>> straightforward generalization.  If 1 AppendRelInfo is an adequate
>> specification of one translations, why are N AppendRelInfos not an
>> adequate specification of N translations?
>
> Because the relationships between the transforms are unclear.  Are we
> supposed to apply those N transformations to the expression in sequence?
> It doesn't look to me like that's what the code does.  I think --- I might
> be wrong --- that the code is relying on the transformations to be
> non-overlapping, that is a change made by any one of them cannot be
> further affected by another one.  This is, however, undocumented.

OK, I see.  I guess that possible confusion didn't occur to me both
because I was looking at the code, which I knew wouldn't handle
overlapping transformations, and also because the intended use was for
partition-wise join, where the problem can't come up because we're
translating from the toplevel RTIs for the join to the set of RTIs
appropriate for one child-join.  But, it's certainly fine to add a
comment.

Regarding your original complaint that find_appinfos_by_relids will
"linearly search the appinfo list to produce ... an array, which also
has to be linearly searched", let me just note that there is a big
difference between the probable lengths of those arrays.  Given an
N-way partition-wise join between tables with P partitions each, the
entire appinfo list will contain N * P entries, but the entries
produced by find_appinfos_by_relids() will at most of length P.  There
is a lot of user interest in having P grow into the thousands, or at
the very least the hundreds, but the planner is probably going to be
in a tough spot anyway if there are more than a few tens of tables
and, in practice, I suspect that the number of compatibly partitioned
tables in a query is not likely to grow even that large.  Moreover,
searching an array should be noticeably more cache-efficient than
searching a list.  So the overall efficiency of searching the array is
probably 2 to 4 orders of magnitude better than searching the list.

That being said, I don't mind a bit if you want to look for further
speedups here, but if you do, keep in mind that a lot of queries won't
even use partition-wise join, so all of the arrays will be of length
1.  Even when partition-wise join is used, it is quite likely not to
be used for every table in the query, in which case it will still be
of length 1 in some cases.  So pessimizing nappinfos = 1 even slightly
is probably a regression overall.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Performance regression with PostgreSQL 11 and partitioning

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> That being said, I don't mind a bit if you want to look for further
> speedups here, but if you do, keep in mind that a lot of queries won't
> even use partition-wise join, so all of the arrays will be of length
> 1.  Even when partition-wise join is used, it is quite likely not to
> be used for every table in the query, in which case it will still be
> of length 1 in some cases.  So pessimizing nappinfos = 1 even slightly
> is probably a regression overall.

TBH, I am way more concerned about the pessimization introduced for
every pre-existing usage of these functions by putting search loops
into them at all.  I'd like very much to revert that.  If we can
replace those with something along the line of root->index_array[varno]
we'll be better off across the board.

            regards, tom lane


Re: Performance regression with PostgreSQL 11 and partitioning

From
Ashutosh Bapat
Date:
On Sat, Jun 9, 2018 at 12:22 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Fri, Jun 8, 2018 at 2:10 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> I don't understand this complaint.  Before, the code took one
>>> AppendRelInfo, and according to you, it was clear what was supposed to
>>> happen.  Now it takes an array of AppendRelInfos and, according to
>>> you, it's completely unclear.  Yet that seems, to me at least, to be a
>>> straightforward generalization.  If 1 AppendRelInfo is an adequate
>>> specification of one translations, why are N AppendRelInfos not an
>>> adequate specification of N translations?
>>
>> Because the relationships between the transforms are unclear.  Are we
>> supposed to apply those N transformations to the expression in sequence?
>> It doesn't look to me like that's what the code does.  I think --- I might
>> be wrong --- that the code is relying on the transformations to be
>> non-overlapping, that is a change made by any one of them cannot be
>> further affected by another one.  This is, however, undocumented.
>
> OK, I see.  I guess that possible confusion didn't occur to me both
> because I was looking at the code, which I knew wouldn't handle
> overlapping transformations, and also because the intended use was for
> partition-wise join, where the problem can't come up because we're
> translating from the toplevel RTIs for the join to the set of RTIs
> appropriate for one child-join.  But, it's certainly fine to add a
> comment.

That didn't occurred to me as well.

Here's patch with comments and Assertions added to check the
non-overlapping AppendRelInfos. The assertions check that same parent
or child doesn't appear more than once in any of the AppendRelInfos,
neither as a parent nor as a child.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

Attachment

Re: Performance regression with PostgreSQL 11 and partitioning

From
David Rowley
Date:
On 9 June 2018 at 07:08, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> If we can
> replace those with something along the line of root->index_array[varno]
> we'll be better off across the board.

Are you talking about the loop over the appinfos in functions such as
adjust_appendrel_attrs_mutator?

If so, please note that it's looking for the appinfo with the parent
relid matching the appinfo, not a child relid. There may be many
appinfos in this "index_array" with that parent, so what you mention
above, if I understand you correctly, is not possible. We'd still need
a list of childrelids and we'd need to still loop over each
AppendRelInfo which belongs to that child and check if the parent
matches the Var.

I started coding a patch which instead of the direct lookup you've
mentioned just does a bms_next_member() loop using the given
childrelids. It only part way through writing the patch when I hit a
snag around the various places that call adjust_appendrel_attrs() with
a pre-determined AppendRelInfo, for example, the one in
inheritance_planner() at line 1312.  I'd thought that maybe I could
just pass the childrelids in with
bms_make_singleton(appinfo->child_relid), but the manufactured
"parent_root" inside the inheritance_planner does not have the
append_rel_array set. We could work around this by having another
version of adjust_appendrel_attrs() which accepts an AppendRelInfo
instead of a Relids, but that looks like it means making another
adjust_appendrel_attrs_mutator like function and duplicating quite a
bit of code.  Another option is perhaps to add a callback function to
adjust_appendrel_attrs_context to have that search for the
AppendRelInfo, or just return the known one, but unsure if that's
going to look very good.  I've now stopped as I'm starting to think
that this is not improving the code.

I've attached the half done broken, regression test failing patch just
so you can see what I'm talking about.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Attachment

Re: Performance regression with PostgreSQL 11 and partitioning

From
Robert Haas
Date:
On Fri, Jun 8, 2018 at 3:08 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> That being said, I don't mind a bit if you want to look for further
>> speedups here, but if you do, keep in mind that a lot of queries won't
>> even use partition-wise join, so all of the arrays will be of length
>> 1.  Even when partition-wise join is used, it is quite likely not to
>> be used for every table in the query, in which case it will still be
>> of length 1 in some cases.  So pessimizing nappinfos = 1 even slightly
>> is probably a regression overall.
>
> TBH, I am way more concerned about the pessimization introduced for
> every pre-existing usage of these functions by putting search loops
> into them at all.  I'd like very much to revert that.  If we can
> replace those with something along the line of root->index_array[varno]
> we'll be better off across the board.

I think David's analysis is correct -- this doesn't quite work.  We're
trying to identify whether a given varno is one of the ones to be
translated, and it's hard to come up with a faster solution than
iterating over a (very short) array of those values.  One thing we
could do is have two versions of each function, or else an optimized
path for the very common nappinfos=1 case.  I'm just not sure it would
be worthwhile.  Traversing a short array isn't free, but it's pretty
cheap.

An early version of the patch that made these changes used a List here
rather than a C array, and I asked for that to be changed on
efficiency grounds, and also because constructing 1-element lists
would have a cost of its own.  I think in general we have way too much
code that uses Lists for convenience even though C arrays would be
faster.  For the most part, the performance consequences of any
individual place where we do this are probably beneath the noise
floor, but in the aggregate I think it has nasty consequences for both
performance and memory utilization.  I think if we are going to look
at optimizing we are likely to buy more by worrying about cases where
we traverse lists, especially ones that may be long, rather than
worrying about looping over short C arrays.  Of course I'm open to
being proved wrong...

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Performance regression with PostgreSQL 11 and partitioning

From
David Rowley
Date:
On 12 June 2018 at 01:49, Robert Haas <robertmhaas@gmail.com> wrote:
> On Fri, Jun 8, 2018 at 3:08 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Robert Haas <robertmhaas@gmail.com> writes:
>>> That being said, I don't mind a bit if you want to look for further
>>> speedups here, but if you do, keep in mind that a lot of queries won't
>>> even use partition-wise join, so all of the arrays will be of length
>>> 1.  Even when partition-wise join is used, it is quite likely not to
>>> be used for every table in the query, in which case it will still be
>>> of length 1 in some cases.  So pessimizing nappinfos = 1 even slightly
>>> is probably a regression overall.
>>
>> TBH, I am way more concerned about the pessimization introduced for
>> every pre-existing usage of these functions by putting search loops
>> into them at all.  I'd like very much to revert that.  If we can
>> replace those with something along the line of root->index_array[varno]
>> we'll be better off across the board.
>
> I think David's analysis is correct -- this doesn't quite work.  We're
> trying to identify whether a given varno is one of the ones to be
> translated, and it's hard to come up with a faster solution than
> iterating over a (very short) array of those values.  One thing we
> could do is have two versions of each function, or else an optimized
> path for the very common nappinfos=1 case.  I'm just not sure it would
> be worthwhile.  Traversing a short array isn't free, but it's pretty
> cheap.

So this is the current situation as far as I see it:

We could go and add a new version of adjust_appendrel_attrs() and
adjust_appendrel_attrs_mutator() that accept a Relids for the child
rels rather than an array of AppendRelInfos. However, that's quite a
lot of code duplication. We could perhaps cut down on duplication by
having a callback function stored inside of
adjust_appendrel_attrs_context which searches for the correct
AppendRelInfo to use. However, it does not seem to be inline with
simplifying the code.

We've not yet heard back from Tom with more details about his
root->index_array[varno] idea. I can't quite see how this is possible
and for the moment I assume Tom misunderstood that it's the parent
varno that's known, not the varno of the child.

I've attached a patch which cleans up my earlier version and moves the
setup of the append_rel_array into its own function instead of
sneaking code into setup_simple_rel_arrays(). I've also now updated
the comment above find_childrel_appendrelinfo(), which is now an
unused function.

I tried the following test case:

CREATE TABLE partbench (date TIMESTAMP NOT NULL, i1 INT NOT NULL, i2
INT NOT NULL, i3 INT NOT NULL, i4 INT NOT NULL, i5 INT NOT NULL)
PARTITION BY RANGE (date);
\o /dev/null
select 'CREATE TABLE partbench' || x::text || ' PARTITION OF partbench
FOR VALUES FROM (''' || '2017-03-06'::date + (x::text || '
hours')::interval || ''') TO (''' || '2017-03-06'::date + ((x+1)::text
|| ' hours')::interval || ''');'
from generate_Series(0,9999) x;
\gexec
\o

SELECT * FROM partbench WHERE date = '2018-04-23 00:00:00';

Patched

Time: 190.989 ms
Time: 187.313 ms
Time: 190.239 ms

Unpatched

Time: 775.771 ms
Time: 781.631 ms
Time: 762.777 ms

Is this good enough for v11?

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Attachment

Re: Performance regression with PostgreSQL 11 and partitioning

From
Thomas Reiss
Date:

Le 18/06/2018 à 10:46, David Rowley a écrit :
> On 12 June 2018 at 01:49, Robert Haas <robertmhaas@gmail.com> wrote:
>> On Fri, Jun 8, 2018 at 3:08 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> Robert Haas <robertmhaas@gmail.com> writes:
>>>> That being said, I don't mind a bit if you want to look for further
>>>> speedups here, but if you do, keep in mind that a lot of queries won't
>>>> even use partition-wise join, so all of the arrays will be of length
>>>> 1.  Even when partition-wise join is used, it is quite likely not to
>>>> be used for every table in the query, in which case it will still be
>>>> of length 1 in some cases.  So pessimizing nappinfos = 1 even slightly
>>>> is probably a regression overall.
>>>
>>> TBH, I am way more concerned about the pessimization introduced for
>>> every pre-existing usage of these functions by putting search loops
>>> into them at all.  I'd like very much to revert that.  If we can
>>> replace those with something along the line of root->index_array[varno]
>>> we'll be better off across the board.
>>
>> I think David's analysis is correct -- this doesn't quite work.  We're
>> trying to identify whether a given varno is one of the ones to be
>> translated, and it's hard to come up with a faster solution than
>> iterating over a (very short) array of those values.  One thing we
>> could do is have two versions of each function, or else an optimized
>> path for the very common nappinfos=1 case.  I'm just not sure it would
>> be worthwhile.  Traversing a short array isn't free, but it's pretty
>> cheap.
> 
> So this is the current situation as far as I see it:
> 
> We could go and add a new version of adjust_appendrel_attrs() and
> adjust_appendrel_attrs_mutator() that accept a Relids for the child
> rels rather than an array of AppendRelInfos. However, that's quite a
> lot of code duplication. We could perhaps cut down on duplication by
> having a callback function stored inside of
> adjust_appendrel_attrs_context which searches for the correct
> AppendRelInfo to use. However, it does not seem to be inline with
> simplifying the code.
> 
> We've not yet heard back from Tom with more details about his
> root->index_array[varno] idea. I can't quite see how this is possible
> and for the moment I assume Tom misunderstood that it's the parent
> varno that's known, not the varno of the child.
> 
> I've attached a patch which cleans up my earlier version and moves the
> setup of the append_rel_array into its own function instead of
> sneaking code into setup_simple_rel_arrays(). I've also now updated
> the comment above find_childrel_appendrelinfo(), which is now an
> unused function.
> 
> I tried the following test case:
> 
> CREATE TABLE partbench (date TIMESTAMP NOT NULL, i1 INT NOT NULL, i2
> INT NOT NULL, i3 INT NOT NULL, i4 INT NOT NULL, i5 INT NOT NULL)
> PARTITION BY RANGE (date);
> \o /dev/null
> select 'CREATE TABLE partbench' || x::text || ' PARTITION OF partbench
> FOR VALUES FROM (''' || '2017-03-06'::date + (x::text || '
> hours')::interval || ''') TO (''' || '2017-03-06'::date + ((x+1)::text
> || ' hours')::interval || ''');'
> from generate_Series(0,9999) x;
> \gexec
> \o
> 
> SELECT * FROM partbench WHERE date = '2018-04-23 00:00:00';
> 
> Patched
> 
> Time: 190.989 ms
> Time: 187.313 ms
> Time: 190.239 ms
> 
> Unpatched
> 
> Time: 775.771 ms
> Time: 781.631 ms
> Time: 762.777 ms
> 
> Is this good enough for v11?

It works pretty well with your last patch. IMHO, this issue should be
addressed in v11. I used a pretty unrealistic test-case to show this
regression but it appear with a reasonnable count of partitions, v11 is
slower than v10 even with 10 partitions.

Thanks a lot !


Re: Performance regression with PostgreSQL 11 and partitioning

From
Alvaro Herrera
Date:
On 2018-Jun-18, David Rowley wrote:

> I've attached a patch which cleans up my earlier version and moves the
> setup of the append_rel_array into its own function instead of
> sneaking code into setup_simple_rel_arrays(). I've also now updated
> the comment above find_childrel_appendrelinfo(), which is now an
> unused function.

I checked that this patch fixes the originally reported performance
regression.

Unless there are objections, I intend to push this patch tomorrow.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: Performance regression with PostgreSQL 11 and partitioning

From
Tom Lane
Date:
Alvaro Herrera <alvherre@2ndquadrant.com> writes:
> On 2018-Jun-18, David Rowley wrote:
>> I've attached a patch which cleans up my earlier version and moves the
>> setup of the append_rel_array into its own function instead of
>> sneaking code into setup_simple_rel_arrays(). I've also now updated
>> the comment above find_childrel_appendrelinfo(), which is now an
>> unused function.

> I checked that this patch fixes the originally reported performance
> regression.
> Unless there are objections, I intend to push this patch tomorrow.

If find_childrel_appendrelinfo is now unused, we should remove it.

            regards, tom lane


Re: Performance regression with PostgreSQL 11 and partitioning

From
Alvaro Herrera
Date:
On 2018-Jun-25, Tom Lane wrote:

> Alvaro Herrera <alvherre@2ndquadrant.com> writes:
> > On 2018-Jun-18, David Rowley wrote:
> >> I've attached a patch which cleans up my earlier version and moves the
> >> setup of the append_rel_array into its own function instead of
> >> sneaking code into setup_simple_rel_arrays(). I've also now updated
> >> the comment above find_childrel_appendrelinfo(), which is now an
> >> unused function.
> 
> > I checked that this patch fixes the originally reported performance
> > regression.
> > Unless there are objections, I intend to push this patch tomorrow.
> 
> If find_childrel_appendrelinfo is now unused, we should remove it.

Agreed -- thanks for following up.  Pushed that way.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: Performance regression with PostgreSQL 11 and partitioning

From
Thomas Reiss
Date:

Le 26/06/2018 à 16:43, Alvaro Herrera a écrit :
> On 2018-Jun-25, Tom Lane wrote:
> 
>> Alvaro Herrera <alvherre@2ndquadrant.com> writes:
>>> On 2018-Jun-18, David Rowley wrote:
>>>> I've attached a patch which cleans up my earlier version and moves the
>>>> setup of the append_rel_array into its own function instead of
>>>> sneaking code into setup_simple_rel_arrays(). I've also now updated
>>>> the comment above find_childrel_appendrelinfo(), which is now an
>>>> unused function.
>>
>>> I checked that this patch fixes the originally reported performance
>>> regression.
>>> Unless there are objections, I intend to push this patch tomorrow.
>>
>> If find_childrel_appendrelinfo is now unused, we should remove it.
> 
> Agreed -- thanks for following up.  Pushed that way.

Thanks Alvaro, Ashutosh and David.



Re: Performance regression with PostgreSQL 11 and partitioning

From
Amit Langote
Date:
On 2018/06/26 23:43, Alvaro Herrera wrote:
> On 2018-Jun-25, Tom Lane wrote:
> 
>> Alvaro Herrera <alvherre@2ndquadrant.com> writes:
>>> On 2018-Jun-18, David Rowley wrote:
>>>> I've attached a patch which cleans up my earlier version and moves the
>>>> setup of the append_rel_array into its own function instead of
>>>> sneaking code into setup_simple_rel_arrays(). I've also now updated
>>>> the comment above find_childrel_appendrelinfo(), which is now an
>>>> unused function.
>>
>>> I checked that this patch fixes the originally reported performance
>>> regression.
>>> Unless there are objections, I intend to push this patch tomorrow.
>>
>> If find_childrel_appendrelinfo is now unused, we should remove it.
> 
> Agreed -- thanks for following up.  Pushed that way.

I noticed that there is a typo in a comment, fixed as follows in the
attached patch.

     /*
-     * append_rel_list is the same length as the above arrays, and holds
+     * append_rel_array is the same length as the above arrays, and holds
      * pointers to the corresponding AppendRelInfo entry indexed by
      * child_relid, or NULL if none.  The array itself is not allocated if
      * append_rel_list is empty.

Thanks,
Amit

Attachment

Re: Performance regression with PostgreSQL 11 and partitioning

From
Alvaro Herrera
Date:
On 2018-Jun-27, Amit Langote wrote:

> I noticed that there is a typo in a comment, fixed as follows in the
> attached patch.
> 
>      /*
> -     * append_rel_list is the same length as the above arrays, and holds
> +     * append_rel_array is the same length as the above arrays, and holds
>       * pointers to the corresponding AppendRelInfo entry indexed by
>       * child_relid, or NULL if none.  The array itself is not allocated if
>       * append_rel_list is empty.

Yikes.  I looked for that kind of mistake *specifically* :-(
Thanks, pushed.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Un peu décu : Re: Performance regression with PostgreSQL 11 and partitioning

From
Christophe Courtois
Date:
Hello,

J'ai eu peur, 7d872c91a3f9d49b56117557cdbb0c3d4c620687 n'est pas en bêta
2 mais bien dans REL_11_STABLE.

J'ai relancé mes scripts avec 10000 partitions vides, un peu à l'arrache
et sans rigueur : la dernière version incluant ce patch est
effectivement souvent moitié plus rapide qu'en 10, mais si je joins deux
tables à 10000 partitions entre elles, ça devient 2 fois plus
catastrophique.

Bref, un sacré progrès quand même. Merci d'avoir porté ça sur hackers.

Le 26/06/2018 à 16:46, Thomas Reiss a écrit :
> 
> 
> Le 26/06/2018 à 16:43, Alvaro Herrera a écrit :
>> On 2018-Jun-25, Tom Lane wrote:
>>
>>> Alvaro Herrera <alvherre@2ndquadrant.com> writes:
>>>> On 2018-Jun-18, David Rowley wrote:
>>>>> I've attached a patch which cleans up my earlier version and moves the
>>>>> setup of the append_rel_array into its own function instead of
>>>>> sneaking code into setup_simple_rel_arrays(). I've also now updated
>>>>> the comment above find_childrel_appendrelinfo(), which is now an
>>>>> unused function.
>>>
>>>> I checked that this patch fixes the originally reported performance
>>>> regression.
>>>> Unless there are objections, I intend to push this patch tomorrow.
>>>
>>> If find_childrel_appendrelinfo is now unused, we should remove it.
>>
>> Agreed -- thanks for following up.  Pushed that way.
> 
> Thanks Alvaro, Ashutosh and David.
> 
> 
> 

-- 
Christophe Courtois
Consultant Dalibo
http://dalibo.com/  -  http://dalibo.org/


Oops, that message was supposed to be in private.

Sorry for the noise.

-- 
Christophe Courtois



On 10 July 2018 at 00:47, Christophe Courtois
<christophe.courtois@dalibo.com> wrote:
> J'ai eu peur, 7d872c91a3f9d49b56117557cdbb0c3d4c620687 n'est pas en bêta
> 2 mais bien dans REL_11_STABLE.
>
> J'ai relancé mes scripts avec 10000 partitions vides, un peu à l'arrache
> et sans rigueur : la dernière version incluant ce patch est
> effectivement souvent moitié plus rapide qu'en 10, mais si je joins deux
> tables à 10000 partitions entre elles, ça devient 2 fois plus
> catastrophique.
>
> Bref, un sacré progrès quand même. Merci d'avoir porté ça sur hackers.

(Christophe reports 2x performance regression with PG11 when using
10000 partitions)

Hi Christophe,

Thanks for the report. Can you supply your test case when shows this regression?

Please, can you also post in English for the future.

--
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: Performance regression with PostgreSQL 11 and partitioning

From
Christophe Courtois
Date:
Hi,

Le 09/07/2018 à 22:10, David Rowley a écrit :
> On 10 July 2018 at 00:47, Christophe Courtois
> <christophe.courtois@dalibo.com> wrote:
> (Christophe reports 2x performance regression with PG11 when using
> 10000 partitions)
> Thanks for the report. Can you supply your test case when shows this regression?
> Please, can you also post in English for the future.

Sorry, that was supposed to be a private mail to Thomas. I lack time to
reproduce cleanly my tests after the latest patch and dig a bit myself.
Basically I do an inner join between a 10'000 partitions table and
itself with a simple WHERE. I'm not sure that this happens in the real
world.

Yours,

-- 
Christophe Courtois
Consultant Dalibo
http://dalibo.com/  -  http://dalibo.org/