Thread: making update/delete of inheritance trees scale better

making update/delete of inheritance trees scale better

From

Amit Langote

Date:

08 May 2020, 13:32:50

Here is a sketch for implementing the design that Tom described here:
https://www.postgresql.org/message-id/flat/357.1550612935%40sss.pgh.pa.us

In short, we would like to have only one plan for ModifyTable to get
tuples out of to update/delete, not N for N child result relations as
is done currently.

I suppose things are the way they are because creating a separate plan
for each result relation makes the job of ModifyTable node very
simple, which is currently this:

1. Take the plan's output tuple, extract the tupleid of the tuple to
update/delete in the currently active result relation,
2. If delete, go to 3, else if update, filter out the junk columns
from the above tuple
3. Call ExecUpdate()/ExecDelete() on the result relation with the new
tuple, if any

If we make ModifyTable do a little more work for the inheritance case,
we can create only one plan but without "expanding" the targetlist.
That is, it will contain entries only for attributes that are assigned
values in the SET clause. This makes the plan reusable across result
relations, because all child relations must have those attributes,
even though the attribute numbers might be different. Anyway, the
work that ModifyTable will now have to do is this:

1. Take the plan's output tuple, extract tupleid of the tuple to
update/delete and "tableoid"
2. Select the result relation to operate on using the tableoid
3. If delete, go to 4, else if update, fetch the tuple identified by
tupleid from the result relation and fill in the unassigned columns
using that "old" tuple, also filtering out the junk columns
4. Call ExecUpdate()/ExecDelete() on the result relation with the new
tuple, if any

I do think that doing this would be worthwhile even if we may be
increasing ModifyTable's per-row overhead slightly, because planning
overhead of the current approach is very significant, especially for
partition trees with beyond a couple of thousand partitions. As to
how bad the problem is, trying to create a generic plan for `update
foo set ... where key = $1`, where foo has over 2000 partitions,
causes OOM even on a machine with 6GB of memory.

The one plan shared by all result relations will be same as the one we
would get if the query were SELECT, except it will contain junk
attributes such as ctid needed to identify tuples and a new "tableoid"
junk attribute if multiple result relations will be present due to
inheritance. One major way in which this targetlist differs from the
current per-result-relation plans is that it won't be passed through
expand_targetlist(), because the set of unassigned attributes may not
be unique among children. As mentioned above, those missing
attributes will be filled by ModifyTable doing some extra work,
whereas previously they would have come with the plan's output tuple.

For child result relations that are foreign tables, their FDW adds
junk attribute(s) to the query’s targetlist by updating it in-place
(AddForeignUpdateTargets). However, as the child tables will no
longer get their own parsetree, we must use some hack around this
interface to obtain the foreign table specific junk attributes and add
them to the original/parent query’s targetlist. Assuming that all or
most of the children will belong to the same FDW, we will end up with
only a handful such junk columns in the final targetlist. I am not
sure if it's worthwhile to change the API of AddForeignUpdateTargets
to require FDWs to not scribble on the passed-in parsetree as part of
this patch.

As for how ModifyTable will create the new tuple for updates, I have
decided to use a ProjectionInfo for each result relation, which
projects a full, *clean* tuple ready to be put back into the relation.
When projecting, plan’s output tuple serves as OUTER tuple and the old
tuple fetched to fill unassigned attributes serves as SCAN tuple. By
having this ProjectionInfo also serve as the “junk filter”, we don't
need JunkFilters. The targetlist that this projection computes is
same as that of the result-relation-specific plan. Initially, I
thought to generate this "expanded" targetlist in
ExecInitModifyTable(). But as it can be somewhat expensive, doing it
only once in the planner seemed like a good idea. These
per-result-relations targetlists are carried in the ModifyTable node.

To identify the result relation from the tuple produced by the plan,
“tableoid” junk column will be used. As the tuples for different
result relations won’t necessarily come out in the order in which
result relations are laid out in the ModifyTable node, we need a way
to map the tableoid value to result relation indexes. I have decided
to use a hash table here.

A couple of things that I didn't think very hard what to do about now,
but may revisit later.

* We will no longer be able use DirectModify APIs to push updates to
remote servers for foreign child result relations

* Over in [1], I have said that we get run-time pruning for free for
ModifyTable because the plan we are using is same as that for SELECT,
although now I think that I hadn't thought that through. With the PoC
patch that I have:

prepare q as update foo set a = 250001 where a = $1;
set plan_cache_mode to 'force_generic_plan';
explain execute q(1);
QUERY PLAN
--------------------------------------------------------------------
Update on foo (cost=0.00..142.20 rows=40 width=14)
Update on foo_1
Update on foo_2 foo
Update on foo_3 foo
Update on foo_4 foo
-> Append (cost=0.00..142.20 rows=40 width=14)
Subplans Removed: 3
-> Seq Scan on foo_1 (cost=0.00..35.50 rows=10 width=14)
Filter: (a = $1)
(9 rows)

While it's true that we will never have to actually update foo_2,
foo_3, and foo_4, ModifyTable still sets up its ResultRelInfos, which
ideally it shouldn't. Maybe we'll need to do something about that
after all.

I will post the patch shortly.

--
Amit Langote
EnterpriseDB: http://www.enterprisedb.com

[1] https://www.postgresql.org/message-id/CA%2BHiwqGXmP3-S9y%3DOQHyJyeWnZSOmcxBGdgAMWcLUOsnPTL88w%40mail.gmail.com

Re: making update/delete of inheritance trees scale better

From

Ashutosh Bapat

Date:

11 May 2020, 12:58:15

On Fri, May 8, 2020 at 7:03 PM Amit Langote <amitlangote09@gmail.com> wrote:
>
> Here is a sketch for implementing the design that Tom described here:
> https://www.postgresql.org/message-id/flat/357.1550612935%40sss.pgh.pa.us
>
> In short, we would like to have only one plan for ModifyTable to get
> tuples out of to update/delete, not N for N child result relations as
> is done currently.
>
> I suppose things are the way they are because creating a separate plan
> for each result relation makes the job of ModifyTable node very
> simple, which is currently this:
>
> 1. Take the plan's output tuple, extract the tupleid of the tuple to
> update/delete in the currently active result relation,
> 2. If delete, go to 3, else if update, filter out the junk columns
> from the above tuple
> 3. Call ExecUpdate()/ExecDelete() on the result relation with the new
> tuple, if any
>
> If we make ModifyTable do a little more work for the inheritance case,
> we can create only one plan but without "expanding" the targetlist.
> That is, it will contain entries only for attributes that are assigned
> values in the SET clause.  This makes the plan reusable across result
> relations, because all child relations must have those attributes,
> even though the attribute numbers might be different.  Anyway, the
> work that ModifyTable will now have to do is this:
>
> 1. Take the plan's output tuple, extract tupleid of the tuple to
> update/delete and "tableoid"
> 2. Select the result relation to operate on using the tableoid
> 3. If delete, go to 4, else if update, fetch the tuple identified by
> tupleid from the result relation and fill in the unassigned columns
> using that "old" tuple, also filtering out the junk columns
> 4. Call ExecUpdate()/ExecDelete() on the result relation with the new
> tuple, if any
>
> I do think that doing this would be worthwhile even if we may be
> increasing ModifyTable's per-row overhead slightly, because planning
> overhead of the current approach is very significant, especially for
> partition trees with beyond a couple of thousand partitions.  As to
> how bad the problem is, trying to create a generic plan for `update
> foo set ... where key = $1`, where foo has over 2000 partitions,
> causes OOM even on a machine with 6GB of memory.

Per row overhead would be incurred for every row whereas the plan time
overhead is one-time or in case of a prepared statement almost free.
So we need to compare it esp. when there are 2000 partitions and all
of them are being updated. But generally I agree that this would be a
better approach. It might help using PWJ when the result relation
joins with other partitioned table. I am not sure whether that
effectively happens today by partition pruning. More on this later.

>
> The one plan shared by all result relations will be same as the one we
> would get if the query were SELECT, except it will contain junk
> attributes such as ctid needed to identify tuples and a new "tableoid"
> junk attribute if multiple result relations will be present due to
> inheritance.  One major way in which this targetlist differs from the
> current per-result-relation plans is that it won't be passed through
> expand_targetlist(), because the set of unassigned attributes may not
> be unique among children.  As mentioned above, those missing
> attributes will be filled by ModifyTable doing some extra work,
> whereas previously they would have come with the plan's output tuple.
>
> For child result relations that are foreign tables, their FDW adds
> junk attribute(s) to the query’s targetlist by updating it in-place
> (AddForeignUpdateTargets).  However, as the child tables will no
> longer get their own parsetree, we must use some hack around this
> interface to obtain the foreign table specific junk attributes and add
> them to the original/parent query’s targetlist.  Assuming that all or
> most of the children will belong to the same FDW, we will end up with
> only a handful such junk columns in the final targetlist.  I am not
> sure if it's worthwhile to change the API of AddForeignUpdateTargets
> to require FDWs to not scribble on the passed-in parsetree as part of
> this patch.

What happens if there's a mixture of foreign and local partitions or
mixture of FDWs? Injecting junk columns from all FDWs in the top level
target list will cause error because those attributes won't be
available everywhere.

>
> As for how ModifyTable will create the new tuple for updates, I have
> decided to use a ProjectionInfo for each result relation, which
> projects a full, *clean* tuple ready to be put back into the relation.
> When projecting, plan’s output tuple serves as OUTER tuple and the old
> tuple fetched to fill unassigned attributes serves as SCAN tuple.  By
> having this ProjectionInfo also serve as the “junk filter”, we don't
> need JunkFilters.  The targetlist that this projection computes is
> same as that of the result-relation-specific plan.  Initially, I
> thought to generate this "expanded" targetlist in
> ExecInitModifyTable().  But as it can be somewhat expensive, doing it
> only once in the planner seemed like a good idea.  These
> per-result-relations targetlists are carried in the ModifyTable node.
>
> To identify the result relation from the tuple produced by the plan,
> “tableoid” junk column will be used.  As the tuples for different
> result relations won’t necessarily come out in the order in which
> result relations are laid out in the ModifyTable node, we need a way
> to map the tableoid value to result relation indexes.  I have decided
> to use a hash table here.

Can we plan the scan query to add a sort node to order the rows by tableoid?

>
> A couple of things that I didn't think very hard what to do about now,
> but may revisit later.
>
> * We will no longer be able use DirectModify APIs to push updates to
> remote servers for foreign child result relations

If we convert a whole DML into partitionwise DML (just as it happens
today unintentionally), we should be able to use DirectModify. PWJ
will help there. But even we can detect that the scan underlying a
particular partition can be evaluated completely on the node same as
where the partition resides, we should be able to use DirectModify.
But if we are not able to support this optimization, the queries which
benefit from it for today won't perform well. I think we need to think
about this now instead of leave for later. Otherwise, make it so that
we use the old way when there are foreign partitions and new way
otherwise.

>
> * Over in [1], I have said that we get run-time pruning for free for
> ModifyTable because the plan we are using is same as that for SELECT,
> although now I think that I hadn't thought that through.  With the PoC
> patch that I have:
>
> prepare q as update foo set a = 250001 where a = $1;
> set plan_cache_mode to 'force_generic_plan';
> explain execute q(1);
>                              QUERY PLAN
> --------------------------------------------------------------------
>  Update on foo  (cost=0.00..142.20 rows=40 width=14)
>    Update on foo_1
>    Update on foo_2 foo
>    Update on foo_3 foo
>    Update on foo_4 foo
>    ->  Append  (cost=0.00..142.20 rows=40 width=14)
>          Subplans Removed: 3
>          ->  Seq Scan on foo_1  (cost=0.00..35.50 rows=10 width=14)
>                Filter: (a = $1)
> (9 rows)
>
> While it's true that we will never have to actually update foo_2,
> foo_3, and foo_4, ModifyTable still sets up its ResultRelInfos, which
> ideally it shouldn't.  Maybe we'll need to do something about that
> after all.

* Tuple re-routing during UPDATE. For now it's disabled so your design
should work. But we shouldn't design this feature in such a way that
it comes in the way to enable tuple re-routing in future :).

--
Best Wishes,
Ashutosh Bapat

Re: making update/delete of inheritance trees scale better

From

Amit Langote

Date:

11 May 2020, 14:41:30

Hi Ashutosh,

Thanks for chiming in.

On Mon, May 11, 2020 at 9:58 PM Ashutosh Bapat
<ashutosh.bapat.oss@gmail.com> wrote:
> On Fri, May 8, 2020 at 7:03 PM Amit Langote <amitlangote09@gmail.com> wrote:
> > I do think that doing this would be worthwhile even if we may be
> > increasing ModifyTable's per-row overhead slightly, because planning
> > overhead of the current approach is very significant, especially for
> > partition trees with beyond a couple of thousand partitions.  As to
> > how bad the problem is, trying to create a generic plan for `update
> > foo set ... where key = $1`, where foo has over 2000 partitions,
> > causes OOM even on a machine with 6GB of memory.
>
> Per row overhead would be incurred for every row whereas the plan time
> overhead is one-time or in case of a prepared statement almost free.
> So we need to compare it esp. when there are 2000 partitions and all
> of them are being updated.

I assume that such UPDATEs would be uncommon.

> But generally I agree that this would be a
> better approach. It might help using PWJ when the result relation
> joins with other partitioned table.

It does, because the plan below ModifyTable is same as if the query
were SELECT instead of UPDATE; with my PoC:

explain (costs off) update foo set a = foo2.a + 1 from foo foo2 where
foo.a = foo2.a;
                    QUERY PLAN
--------------------------------------------------
 Update on foo
   Update on foo_1
   Update on foo_2
   ->  Append
         ->  Merge Join
               Merge Cond: (foo_1.a = foo2_1.a)
               ->  Sort
                     Sort Key: foo_1.a
                     ->  Seq Scan on foo_1
               ->  Sort
                     Sort Key: foo2_1.a
                     ->  Seq Scan on foo_1 foo2_1
         ->  Merge Join
               Merge Cond: (foo_2.a = foo2_2.a)
               ->  Sort
                     Sort Key: foo_2.a
                     ->  Seq Scan on foo_2
               ->  Sort
                     Sort Key: foo2_2.a
                     ->  Seq Scan on foo_2 foo2_2
(20 rows)

as opposed to what you get today:

explain (costs off) update foo set a = foo2.a + 1 from foo foo2 where
foo.a = foo2.a;
                    QUERY PLAN
--------------------------------------------------
 Update on foo
   Update on foo_1
   Update on foo_2
   ->  Merge Join
         Merge Cond: (foo_1.a = foo2.a)
         ->  Sort
               Sort Key: foo_1.a
               ->  Seq Scan on foo_1
         ->  Sort
               Sort Key: foo2.a
               ->  Append
                     ->  Seq Scan on foo_1 foo2
                     ->  Seq Scan on foo_2 foo2_1
   ->  Merge Join
         Merge Cond: (foo_2.a = foo2.a)
         ->  Sort
               Sort Key: foo_2.a
               ->  Seq Scan on foo_2
         ->  Sort
               Sort Key: foo2.a
               ->  Append
                     ->  Seq Scan on foo_1 foo2
                     ->  Seq Scan on foo_2 foo2_1
(23 rows)

> > For child result relations that are foreign tables, their FDW adds
> > junk attribute(s) to the query’s targetlist by updating it in-place
> > (AddForeignUpdateTargets).  However, as the child tables will no
> > longer get their own parsetree, we must use some hack around this
> > interface to obtain the foreign table specific junk attributes and add
> > them to the original/parent query’s targetlist.  Assuming that all or
> > most of the children will belong to the same FDW, we will end up with
> > only a handful such junk columns in the final targetlist.  I am not
> > sure if it's worthwhile to change the API of AddForeignUpdateTargets
> > to require FDWs to not scribble on the passed-in parsetree as part of
> > this patch.
>
> What happens if there's a mixture of foreign and local partitions or
> mixture of FDWs? Injecting junk columns from all FDWs in the top level
> target list will cause error because those attributes won't be
> available everywhere.

That is a good question and something I struggled with ever since I
started started thinking about implementing this.

For the problem that FDWs may inject junk columns that could neither
be present in local tables (root parent and other local children) nor
other FDWs, I couldn't think of any solution other than to restrict
what those junk columns can be -- to require them to be either "ctid",
"wholerow", or a set of only *inherited* user columns.  I think that's
what Tom was getting at when he said the following in the email I
cited in my first email:

"...It gets  a bit harder if the tree contains some foreign tables,
because they might have different concepts of row identity, but I'd
think in most cases you could still combine those into a small number
of output columns."

Maybe I misunderstood what Tom said, but I can't imagine how to let
these junk columns be anything that *all* tables contained in an
inheritance tree, especially the root parent, cannot emit, if they are
to be emitted out of a single plan.

> > As for how ModifyTable will create the new tuple for updates, I have
> > decided to use a ProjectionInfo for each result relation, which
> > projects a full, *clean* tuple ready to be put back into the relation.
> > When projecting, plan’s output tuple serves as OUTER tuple and the old
> > tuple fetched to fill unassigned attributes serves as SCAN tuple.  By
> > having this ProjectionInfo also serve as the “junk filter”, we don't
> > need JunkFilters.  The targetlist that this projection computes is
> > same as that of the result-relation-specific plan.  Initially, I
> > thought to generate this "expanded" targetlist in
> > ExecInitModifyTable().  But as it can be somewhat expensive, doing it
> > only once in the planner seemed like a good idea.  These
> > per-result-relations targetlists are carried in the ModifyTable node.
> >
> > To identify the result relation from the tuple produced by the plan,
> > “tableoid” junk column will be used.  As the tuples for different
> > result relations won’t necessarily come out in the order in which
> > result relations are laid out in the ModifyTable node, we need a way
> > to map the tableoid value to result relation indexes.  I have decided
> > to use a hash table here.
>
> Can we plan the scan query to add a sort node to order the rows by tableoid?

Hmm, I am afraid that some piece of partitioning code that assumes a
certain order of result relations, and that order is not based on
sorting tableoids.

> > A couple of things that I didn't think very hard what to do about now,
> > but may revisit later.
> >
> > * We will no longer be able use DirectModify APIs to push updates to
> > remote servers for foreign child result relations
>
> If we convert a whole DML into partitionwise DML (just as it happens
> today unintentionally), we should be able to use DirectModify. PWJ
> will help there. But even we can detect that the scan underlying a
> particular partition can be evaluated completely on the node same as
> where the partition resides, we should be able to use DirectModify.

I remember Fujita-san mentioned something like this, but I haven't
looked into how feasible it would be given the current DirectModify
interface.

> But if we are not able to support this optimization, the queries which
> benefit from it for today won't perform well. I think we need to think
> about this now instead of leave for later. Otherwise, make it so that
> we use the old way when there are foreign partitions and new way
> otherwise.

I would very much like find a solution for this, which hopefully isn't
to fall back to using the old way.

> * Tuple re-routing during UPDATE. For now it's disabled so your design
> should work. But we shouldn't design this feature in such a way that
> it comes in the way to enable tuple re-routing in future :).

Sorry, what is tuple re-routing and why does this new approach get in its way?

--
Amit Langote
EnterpriseDB: http://www.enterprisedb.com

Re: making update/delete of inheritance trees scale better

From

Robert Haas

Date:

11 May 2020, 18:34:51

On Mon, May 11, 2020 at 8:58 AM Ashutosh Bapat
<ashutosh.bapat.oss@gmail.com> wrote:
> What happens if there's a mixture of foreign and local partitions or
> mixture of FDWs? Injecting junk columns from all FDWs in the top level
> target list will cause error because those attributes won't be
> available everywhere.

I think that we're talking about a plan like this:

Update
-> Append
  -> a bunch of children

I believe that you'd want to have happen here is for each child to
emit the row identity columns that it knows about, and emit NULL for
the others. Then when you do the Append you end up with a row format
that includes all the individual identity columns, but for any
particular tuple, only one set of such columns is populated and the
others are all NULL. There doesn't seem to be any execution-time
problem with such a representation, but there might be a planning-time
problem with building it, because when you're writing a tlist for the
Append node, what varattno are you going to use for the columns that
exist only in one particular child and not the others? The fact that
setrefs processing happens so late seems like an annoyance in this
case.

Maybe it would be easier to have one Update note per kind of row
identity, i.e. if there's more than one such notion then...

Placeholder
-> Update
 -> Append
  -> all children with one notion of row identity
-> Update
 -> Append
  -> all children with another notion of row identity

...and so forth.

But I'm not sure.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: making update/delete of inheritance trees scale better

From

Tom Lane

Date:

11 May 2020, 18:48:41

Robert Haas <robertmhaas@gmail.com> writes:
> I believe that you'd want to have happen here is for each child to
> emit the row identity columns that it knows about, and emit NULL for
> the others. Then when you do the Append you end up with a row format
> that includes all the individual identity columns, but for any
> particular tuple, only one set of such columns is populated and the
> others are all NULL.

Yeah, that was what I'd imagined in my earlier thinking about this.

> There doesn't seem to be any execution-time
> problem with such a representation, but there might be a planning-time
> problem with building it,

Possibly.  We manage to cope with not-all-alike children now, of course,
but I think it might be true that no one plan node has Vars from
dissimilar children.  Even so, the Vars are self-identifying, so it
seems like this ought to be soluble.

            regards, tom lane

Re: making update/delete of inheritance trees scale better

From

Robert Haas

Date:

11 May 2020, 19:10:52

On Mon, May 11, 2020 at 2:48 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > There doesn't seem to be any execution-time
> > problem with such a representation, but there might be a planning-time
> > problem with building it,
>
> Possibly.  We manage to cope with not-all-alike children now, of course,
> but I think it might be true that no one plan node has Vars from
> dissimilar children.  Even so, the Vars are self-identifying, so it
> seems like this ought to be soluble.

If the parent is RTI 1, and the children are RTIs 2..6, what
varno/varattno will we use in RTI 1's tlist to represent a column that
exists in both RTI 2 and RTI 3 but not in RTI 1, 4, 5, or 6?

I suppose the answer is 2 - or 3, but I guess we'd pick the first
child as the representative of the class. We surely can't use varno 1,
because then there's no varattno that makes any sense. But if we use
2, now we have the tlist for RTI 1 containing expressions with a
child's RTI as the varno. I could be wrong, but I think that's going
to make setrefs.c throw up and die, and I wouldn't be very surprised
if there were a bunch of other things that crashed and burned, too. I
think we have quite a bit of code that expects to be able to translate
between parent-rel expressions and child-rel expressions, and that's
going to be pretty problematic here.

Maybe your answer is - let's just fix all that stuff. That could well
be right, but my first reaction is to think that it sounds hard.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: making update/delete of inheritance trees scale better

From

Tom Lane

Date:

11 May 2020, 20:22:40

Robert Haas <robertmhaas@gmail.com> writes:
> If the parent is RTI 1, and the children are RTIs 2..6, what
> varno/varattno will we use in RTI 1's tlist to represent a column that
> exists in both RTI 2 and RTI 3 but not in RTI 1, 4, 5, or 6?

Fair question.  We don't have any problem representing the column
as it exists in any one of those children, but we lack a notation
for the "union" or whatever you want to call it, except in the case
where the parent relation has a corresponding column.  Still, this
doesn't seem that hard to fix.  My inclination would be to invent
dummy parent-rel columns (possibly with negative attnums? not sure if
that'd be easier or harder than adding them in the positive direction)
to represent such "union" columns.  This concept would only need to
exist within the planner I think, since after setrefs.c there'd be no
trace of those dummy columns.

> I think we have quite a bit of code that expects to be able to translate
> between parent-rel expressions and child-rel expressions, and that's
> going to be pretty problematic here.

... shrug.  Sure, we'll need to be able to do that mapping.  Why will
it be any harder than any other parent <-> child mapping?  The planner
would know darn well what the mapping is while it's inventing the
dummy columns, so it just has to keep that info around for use later.

> Maybe your answer is - let's just fix all that stuff. That could well
> be right, but my first reaction is to think that it sounds hard.

I have to think that it'll net out as less code, and certainly less
complicated code, than trying to extend inheritance_planner in its
current form to do what we wish it'd do.

            regards, tom lane

Re: making update/delete of inheritance trees scale better

From

Robert Haas

Date:

11 May 2020, 20:25:38

On Mon, May 11, 2020 at 4:22 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
> > If the parent is RTI 1, and the children are RTIs 2..6, what
> > varno/varattno will we use in RTI 1's tlist to represent a column that
> > exists in both RTI 2 and RTI 3 but not in RTI 1, 4, 5, or 6?
>
> Fair question.  We don't have any problem representing the column
> as it exists in any one of those children, but we lack a notation
> for the "union" or whatever you want to call it, except in the case
> where the parent relation has a corresponding column.  Still, this
> doesn't seem that hard to fix.  My inclination would be to invent
> dummy parent-rel columns (possibly with negative attnums? not sure if
> that'd be easier or harder than adding them in the positive direction)
> to represent such "union" columns.

Ah, that makes sense. If we can invent dummy columns on the parent
rel, then most of what I was worrying about no longer seems very
worrying.

I'm not sure what's involved in inventing such dummy columns, though.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: making update/delete of inheritance trees scale better

From

Ashutosh Bapat

Date:

12 May 2020, 12:54:17

On Mon, May 11, 2020 at 8:11 PM Amit Langote <amitlangote09@gmail.com> wrote:
> > Per row overhead would be incurred for every row whereas the plan time
> > overhead is one-time or in case of a prepared statement almost free.
> > So we need to compare it esp. when there are 2000 partitions and all
> > of them are being updated.
>
> I assume that such UPDATEs would be uncommon.

Yes, 2000 partitions being updated would be rare. But many rows from
the same partition being updated may not be that common. We have to
know how much is that per row overhead and updating how many rows it
takes to beat the planning time overhead. If the number of rows is
very large, we are good.

>
> > But generally I agree that this would be a
> > better approach. It might help using PWJ when the result relation
> > joins with other partitioned table.
>
> It does, because the plan below ModifyTable is same as if the query
> were SELECT instead of UPDATE; with my PoC:
>
> explain (costs off) update foo set a = foo2.a + 1 from foo foo2 where
> foo.a = foo2.a;
>                     QUERY PLAN
> --------------------------------------------------
>  Update on foo
>    Update on foo_1
>    Update on foo_2
>    ->  Append
>          ->  Merge Join
>                Merge Cond: (foo_1.a = foo2_1.a)
>                ->  Sort
>                      Sort Key: foo_1.a
>                      ->  Seq Scan on foo_1
>                ->  Sort
>                      Sort Key: foo2_1.a
>                      ->  Seq Scan on foo_1 foo2_1
>          ->  Merge Join
>                Merge Cond: (foo_2.a = foo2_2.a)
>                ->  Sort
>                      Sort Key: foo_2.a
>                      ->  Seq Scan on foo_2
>                ->  Sort
>                      Sort Key: foo2_2.a
>                      ->  Seq Scan on foo_2 foo2_2
> (20 rows)

Wonderful. That looks good.


> > Can we plan the scan query to add a sort node to order the rows by tableoid?
>
> Hmm, I am afraid that some piece of partitioning code that assumes a
> certain order of result relations, and that order is not based on
> sorting tableoids.

I am suggesting that we override that order (if any) in
create_modifytable_path() or create_modifytable_plan() by explicitly
ordering the incoming paths on tableoid. May be using MergeAppend.


>
> > * Tuple re-routing during UPDATE. For now it's disabled so your design
> > should work. But we shouldn't design this feature in such a way that
> > it comes in the way to enable tuple re-routing in future :).
>
> Sorry, what is tuple re-routing and why does this new approach get in its way?

An UPDATE causing a tuple to move to a different partition. It would
get in its way since the tuple will be located based on tableoid,
which will be the oid of the old partition. But I think this approach
has higher chance of being able to solve that problem eventually
rather than the current approach.
-- 
Best Wishes,
Ashutosh Bapat

Re: making update/delete of inheritance trees scale better

From

Amit Langote

Date:

12 May 2020, 12:55:36

On Tue, May 12, 2020 at 5:25 AM Robert Haas <robertmhaas@gmail.com> wrote:
> On Mon, May 11, 2020 at 4:22 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > Robert Haas <robertmhaas@gmail.com> writes:
> > > If the parent is RTI 1, and the children are RTIs 2..6, what
> > > varno/varattno will we use in RTI 1's tlist to represent a column that
> > > exists in both RTI 2 and RTI 3 but not in RTI 1, 4, 5, or 6?
> >
> > Fair question.  We don't have any problem representing the column
> > as it exists in any one of those children, but we lack a notation
> > for the "union" or whatever you want to call it, except in the case
> > where the parent relation has a corresponding column.  Still, this
> > doesn't seem that hard to fix.  My inclination would be to invent
> > dummy parent-rel columns (possibly with negative attnums? not sure if
> > that'd be easier or harder than adding them in the positive direction)
> > to represent such "union" columns.
>
> Ah, that makes sense. If we can invent dummy columns on the parent
> rel, then most of what I was worrying about no longer seems very
> worrying.

IIUC, the idea is to have "dummy" columns in the top parent's
reltarget for every junk TLE added to the top-level targetlist by
child tables' FDWs that the top parent itself can't emit. But we allow
these FDW junk TLEs to contain any arbitrary expression, not just
plain Vars [1], so what node type are these dummy parent columns?  I
can see from add_vars_to_targetlist() that we allow only Vars and
PlaceHolderVars to be present in a relation's reltarget->exprs, but
neither of those seem suitable for the task.

Once we get something in the parent's reltarget->exprs representing
these child expressions, from there they go back into child
reltargets, so it would appear that our appendrel transformation code
must somehow be taught to deal with these dummy columns.

--
Amit Langote
EnterpriseDB: http://www.enterprisedb.com

[1] https://www.postgresql.org/docs/current/fdw-callbacks.html#FDW-CALLBACKS-UPDATE

"...If the extra expressions are more complex than simple Vars, they
must be run through eval_const_expressions before adding them to the
targetlist."

Re: making update/delete of inheritance trees scale better

From

Tom Lane

Date:

12 May 2020, 13:57:44

Amit Langote <amitlangote09@gmail.com> writes:
> On Tue, May 12, 2020 at 5:25 AM Robert Haas <robertmhaas@gmail.com> wrote:
>> Ah, that makes sense. If we can invent dummy columns on the parent
>> rel, then most of what I was worrying about no longer seems very
>> worrying.

> IIUC, the idea is to have "dummy" columns in the top parent's
> reltarget for every junk TLE added to the top-level targetlist by
> child tables' FDWs that the top parent itself can't emit. But we allow
> these FDW junk TLEs to contain any arbitrary expression, not just
> plain Vars [1], so what node type are these dummy parent columns?

We'd have to group the children into groups that share the same
row-identity column type.  This is why I noted way-back-when that
it'd be a good idea to discourage FDWs from being too wild about
what they use for row identity.

(Also, just to be totally clear: I am *not* envisioning this as a
mechanism for FDWs to inject whatever computations they darn please
into query trees.  It's for the row identity needed by UPDATE/DELETE,
and nothing else.  That being the case, it's hard to understand why
the bottom-level Vars wouldn't be just plain Vars --- maybe "system
column" Vars or something like that, but still just Vars, not
expressions.)

            regards, tom lane

Re: making update/delete of inheritance trees scale better

From

David Rowley

Date:

12 May 2020, 23:51:54

On Wed, 13 May 2020 at 00:54, Ashutosh Bapat
<ashutosh.bapat.oss@gmail.com> wrote:
>
> On Mon, May 11, 2020 at 8:11 PM Amit Langote <amitlangote09@gmail.com> wrote:
> > > Per row overhead would be incurred for every row whereas the plan time
> > > overhead is one-time or in case of a prepared statement almost free.
> > > So we need to compare it esp. when there are 2000 partitions and all
> > > of them are being updated.
> >
> > I assume that such UPDATEs would be uncommon.
>
> Yes, 2000 partitions being updated would be rare. But many rows from
> the same partition being updated may not be that common. We have to
> know how much is that per row overhead and updating how many rows it
> takes to beat the planning time overhead. If the number of rows is
> very large, we are good.

Rows from a non-parallel Append should arrive in order. If you were
worried about the performance of finding the correct ResultRelInfo for
the tuple that we just got, then we could just cache the tableOid and
ResultRelInfo for the last row, and if that tableoid matches on this
row, just use the same ResultRelInfo as last time.   That'll save
doing the hash table lookup in all cases, apart from when the Append
changes to the next child subplan.  Not sure exactly how that'll fit
in with the foreign table discussion that's going on here though.
Another option would be to not use tableoid and instead inject an INT4
Const (0 to nsubplans) into each subplan's targetlist that serves as
the index into an array of ResultRelInfos.

As for which ResultRelInfos to initialize, couldn't we just have the
planner generate an OidList of all the ones that we could need.
Basically, all the non-pruned partitions. Perhaps we could even be
pretty lazy about building those ResultRelInfos during execution too.
We'd need to grab the locks first, but, without staring at the code, I
doubt there's a reason we'd need to build them all upfront.  That
would help in cases where pruning didn't prune much, but due to
something else in the WHERE clause, the results only come from some
small subset of partitions.

David

Re: making update/delete of inheritance trees scale better

From

Amit Langote

Date:

13 May 2020, 03:50:54

On Tue, May 12, 2020 at 9:54 PM Ashutosh Bapat
<ashutosh.bapat.oss@gmail.com> wrote:
> On Mon, May 11, 2020 at 8:11 PM Amit Langote <amitlangote09@gmail.com> wrote:
> > > Per row overhead would be incurred for every row whereas the plan time
> > > overhead is one-time or in case of a prepared statement almost free.
> > > So we need to compare it esp. when there are 2000 partitions and all
> > > of them are being updated.
> >
> > I assume that such UPDATEs would be uncommon.
>
> Yes, 2000 partitions being updated would be rare. But many rows from
> the same partition being updated may not be that common. We have to
> know how much is that per row overhead and updating how many rows it
> takes to beat the planning time overhead. If the number of rows is
> very large, we are good.

Maybe I am misunderstanding you, but the more the rows to update, the
more overhead we will be paying with the new approach.

> > > Can we plan the scan query to add a sort node to order the rows by tableoid?
> >
> > Hmm, I am afraid that some piece of partitioning code that assumes a
> > certain order of result relations, and that order is not based on
> > sorting tableoids.
>
> I am suggesting that we override that order (if any) in
> create_modifytable_path() or create_modifytable_plan() by explicitly
> ordering the incoming paths on tableoid. May be using MergeAppend.

So, we will need to do 2 things:

1. Implicitly apply an ORDER BY tableoid clause
2. Add result relation RTIs to ModifyTable.resultRelations in the
order of their RTE's relid.

Maybe we can do that as a separate patch.  Also, I am not sure if it
will get in the way of someone wanting to have ORDER BY LIMIT for
updates.

> > > * Tuple re-routing during UPDATE. For now it's disabled so your design
> > > should work. But we shouldn't design this feature in such a way that
> > > it comes in the way to enable tuple re-routing in future :).
> >
> > Sorry, what is tuple re-routing and why does this new approach get in its way?
>
> An UPDATE causing a tuple to move to a different partition. It would
> get in its way since the tuple will be located based on tableoid,
> which will be the oid of the old partition. But I think this approach
> has higher chance of being able to solve that problem eventually
> rather than the current approach.

Again, I don't think I understand.   We do currently (as of v11)
re-route tuples when UPDATE causes them to move to a different
partition, which, gladly, continues to work with my patch.

So how it works is like this: for a given "new" tuple, ExecUpdate()
checks if the tuple would violate the partition constraint of the
result relation that was passed along with the tuple.  If it does, the
new tuple will be moved, by calling ExecDelete() to delete it from the
current relation, followed by ExecInsert() to find the new home for
the tuple.  The only thing that changes with the new approach is how
ExecModifyTable() chooses a result relation to pass to ExecUpdate()
for a given "new" tuple it has fetched from the plan, which is quite
independent from the tuple re-routing mechanism proper.

--
Amit Langote
EnterpriseDB: http://www.enterprisedb.com

Re: making update/delete of inheritance trees scale better

From

Amit Langote

Date:

13 May 2020, 07:02:35

On Wed, May 13, 2020 at 8:52 AM David Rowley <dgrowleyml@gmail.com> wrote:
> On Wed, 13 May 2020 at 00:54, Ashutosh Bapat
> <ashutosh.bapat.oss@gmail.com> wrote:
> >
> > On Mon, May 11, 2020 at 8:11 PM Amit Langote <amitlangote09@gmail.com> wrote:
> > > > Per row overhead would be incurred for every row whereas the plan time
> > > > overhead is one-time or in case of a prepared statement almost free.
> > > > So we need to compare it esp. when there are 2000 partitions and all
> > > > of them are being updated.
> > >
> > > I assume that such UPDATEs would be uncommon.
> >
> > Yes, 2000 partitions being updated would be rare. But many rows from
> > the same partition being updated may not be that common. We have to
> > know how much is that per row overhead and updating how many rows it
> > takes to beat the planning time overhead. If the number of rows is
> > very large, we are good.
>
> Rows from a non-parallel Append should arrive in order. If you were
> worried about the performance of finding the correct ResultRelInfo for
> the tuple that we just got, then we could just cache the tableOid and
> ResultRelInfo for the last row, and if that tableoid matches on this
> row, just use the same ResultRelInfo as last time.   That'll save
> doing the hash table lookup in all cases, apart from when the Append
> changes to the next child subplan.

That would be a more common case, yes.  Not when a join is involved though.

>  Not sure exactly how that'll fit
> in with the foreign table discussion that's going on here though.

Foreign table discussion is concerned with what the only top-level
targetlist should look like given that different result relations may
require different row-identifying junk columns, due to possibly
belonging to different FDWs.  Currently that's not a thing to worry
about, because each result relation has its own plan and hence the
targetlist.

> Another option would be to not use tableoid and instead inject an INT4
> Const (0 to nsubplans) into each subplan's targetlist that serves as
> the index into an array of ResultRelInfos.

That may be a bit fragile, considering how volatile that number
(result relation index) can be if you figure in run-time pruning, but
maybe worth considering.

> As for which ResultRelInfos to initialize, couldn't we just have the
> planner generate an OidList of all the ones that we could need.
> Basically, all the non-pruned partitions.

Why would replacing list of RT indexes by OIDs be better?

> Perhaps we could even be
> pretty lazy about building those ResultRelInfos during execution too.
> We'd need to grab the locks first, but, without staring at the code, I
> doubt there's a reason we'd need to build them all upfront.  That
> would help in cases where pruning didn't prune much, but due to
> something else in the WHERE clause, the results only come from some

Late ResultRelInfo initialization is worth considering, given that
doing it for tuple-routing target relations works.  I don't know why
we are still Initializing them all in InitPlan(), because the only
justification given for doing so that I know of is that it prevents
lock-upgrade.  I think we discussed somewhat recently that that is not
really a hazard.

--
Amit Langote
EnterpriseDB: http://www.enterprisedb.com

Re: making update/delete of inheritance trees scale better

From

Amit Langote

Date:

13 May 2020, 13:15:45

On Tue, May 12, 2020 at 10:57 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Amit Langote <amitlangote09@gmail.com> writes:
> > On Tue, May 12, 2020 at 5:25 AM Robert Haas <robertmhaas@gmail.com> wrote:
> >> Ah, that makes sense. If we can invent dummy columns on the parent
> >> rel, then most of what I was worrying about no longer seems very
> >> worrying.
>
> > IIUC, the idea is to have "dummy" columns in the top parent's
> > reltarget for every junk TLE added to the top-level targetlist by
> > child tables' FDWs that the top parent itself can't emit. But we allow
> > these FDW junk TLEs to contain any arbitrary expression, not just
> > plain Vars [1], so what node type are these dummy parent columns?
>
> We'd have to group the children into groups that share the same
> row-identity column type.  This is why I noted way-back-when that
> it'd be a good idea to discourage FDWs from being too wild about
> what they use for row identity.

I understood the part about having a dummy parent column for each
group of children that use the same junk attribute.  I think we must
group them using resname + row-identity Var type though, not just the
latter, because during execution, the FDWs look up the junk columns by
name.  If two FDWs add junk Vars of the same type, say, 'tid', but use
different resname, say, "ctid" and "rowid", respectively, we must add
two dummy parent columns.

> (Also, just to be totally clear: I am *not* envisioning this as a
> mechanism for FDWs to inject whatever computations they darn please
> into query trees.  It's for the row identity needed by UPDATE/DELETE,
> and nothing else.  That being the case, it's hard to understand why
> the bottom-level Vars wouldn't be just plain Vars --- maybe "system
> column" Vars or something like that, but still just Vars, not
> expressions.)

I suppose we would need to explicitly check that and cause an error if
the contained expression is not a plain Var.  Neither the interface
we've got nor the documentation discourages them from putting just
about any expression into the junk TLE.

Based on an off-list chat with Robert, I started looking into whether
it would make sense to drop the middleman Append (or MergeAppend)
altogether, if only to avoid having to invent a representation for
parent targetlist that is never actually computed.  However, it's not
hard to imagine that any new book-keeping code to manage child plans,
even though perhaps cheaper in terms of cycles spent than
inheritance_planner(), would add complexity to the main planner.  It
would also be a shame to lose useful functionality that we get by
having an Append present, such as run-time pruning and partitionwise
joins.

--
Amit Langote
EnterpriseDB: http://www.enterprisedb.com

Re: making update/delete of inheritance trees scale better

From

David Rowley

Date:

13 May 2020, 22:55:00

On Wed, 13 May 2020 at 19:02, Amit Langote <amitlangote09@gmail.com> wrote:
> > As for which ResultRelInfos to initialize, couldn't we just have the
> > planner generate an OidList of all the ones that we could need.
> > Basically, all the non-pruned partitions.
>
> Why would replacing list of RT indexes by OIDs be better?

TBH, I didn't refresh my memory of the code before saying that.
However, if we have a list of RT index for which rangetable entries we
must build ResultRelInfos for, then why is it a problem that plan-time
pruning is not allowing you to eliminate the excess ResultRelInfos,
like you mentioned in:

On Sat, 9 May 2020 at 01:33, Amit Langote <amitlangote09@gmail.com> wrote:
> prepare q as update foo set a = 250001 where a = $1;
> set plan_cache_mode to 'force_generic_plan';
> explain execute q(1);
>                              QUERY PLAN
> --------------------------------------------------------------------
>  Update on foo  (cost=0.00..142.20 rows=40 width=14)
>    Update on foo_1
>    Update on foo_2 foo
>    Update on foo_3 foo
>    Update on foo_4 foo
>    ->  Append  (cost=0.00..142.20 rows=40 width=14)
>          Subplans Removed: 3
>          ->  Seq Scan on foo_1  (cost=0.00..35.50 rows=10 width=14)
>                Filter: (a = $1)
> (9 rows)

Shouldn't you just be setting the ModifyTablePath.resultRelations to
the non-pruned RT indexes?

> > Perhaps we could even be
> > pretty lazy about building those ResultRelInfos during execution too.
> > We'd need to grab the locks first, but, without staring at the code, I
> > doubt there's a reason we'd need to build them all upfront.  That
> > would help in cases where pruning didn't prune much, but due to
> > something else in the WHERE clause, the results only come from some
>
> Late ResultRelInfo initialization is worth considering, given that
> doing it for tuple-routing target relations works.  I don't know why
> we are still Initializing them all in InitPlan(), because the only
> justification given for doing so that I know of is that it prevents
> lock-upgrade.  I think we discussed somewhat recently that that is not
> really a hazard.

Looking more closely at ExecGetRangeTableRelation(), we'll already
have the lock by that time, there's an Assert to verify that too.
It'll have been acquired either during planning or during
AcquireExecutorLocks(). So it seems doing anything for delaying the
building of ResultRelInfos wouldn't need to account for taking the
lock at a different time.

David

Re: making update/delete of inheritance trees scale better

From

Amit Langote

Date:

14 May 2020, 05:09:46

On Thu, May 14, 2020 at 7:55 AM David Rowley <dgrowleyml@gmail.com> wrote:
> On Wed, 13 May 2020 at 19:02, Amit Langote <amitlangote09@gmail.com> wrote:
> > > As for which ResultRelInfos to initialize, couldn't we just have the
> > > planner generate an OidList of all the ones that we could need.
> > > Basically, all the non-pruned partitions.
> >
> > Why would replacing list of RT indexes by OIDs be better?
>
> TBH, I didn't refresh my memory of the code before saying that.
> However, if we have a list of RT index for which rangetable entries we
> must build ResultRelInfos for, then why is it a problem that plan-time
> pruning is not allowing you to eliminate the excess ResultRelInfos,
> like you mentioned in:
>
> On Sat, 9 May 2020 at 01:33, Amit Langote <amitlangote09@gmail.com> wrote:
> > prepare q as update foo set a = 250001 where a = $1;
> > set plan_cache_mode to 'force_generic_plan';
> > explain execute q(1);
> >                              QUERY PLAN
> > --------------------------------------------------------------------
> >  Update on foo  (cost=0.00..142.20 rows=40 width=14)
> >    Update on foo_1
> >    Update on foo_2 foo
> >    Update on foo_3 foo
> >    Update on foo_4 foo
> >    ->  Append  (cost=0.00..142.20 rows=40 width=14)
> >          Subplans Removed: 3
> >          ->  Seq Scan on foo_1  (cost=0.00..35.50 rows=10 width=14)
> >                Filter: (a = $1)
> > (9 rows)
>
> Shouldn't you just be setting the ModifyTablePath.resultRelations to
> the non-pruned RT indexes?

Oh, that example is showing run-time pruning for a generic plan.  If
planner prunes partitions, of course, their result relation indexes
are not present in ModifyTablePath.resultRelations.

> > > Perhaps we could even be
> > > pretty lazy about building those ResultRelInfos during execution too.
> > > We'd need to grab the locks first, but, without staring at the code, I
> > > doubt there's a reason we'd need to build them all upfront.  That
> > > would help in cases where pruning didn't prune much, but due to
> > > something else in the WHERE clause, the results only come from some
> >
> > Late ResultRelInfo initialization is worth considering, given that
> > doing it for tuple-routing target relations works.  I don't know why
> > we are still Initializing them all in InitPlan(), because the only
> > justification given for doing so that I know of is that it prevents
> > lock-upgrade.  I think we discussed somewhat recently that that is not
> > really a hazard.
>
> Looking more closely at ExecGetRangeTableRelation(), we'll already
> have the lock by that time, there's an Assert to verify that too.
> It'll have been acquired either during planning or during
> AcquireExecutorLocks(). So it seems doing anything for delaying the
> building of ResultRelInfos wouldn't need to account for taking the
> lock at a different time.

Yep, I think it might be worthwhile to delay ResultRelInfo building
for UPDATE/DELETE too.  I would like to leave that for another patch
though.

-- 
Amit Langote
EnterpriseDB: http://www.enterprisedb.com

Re: making update/delete of inheritance trees scale better

From

Ashutosh Bapat

Date:

14 May 2020, 12:54:25

On Wed, May 13, 2020 at 9:21 AM Amit Langote <amitlangote09@gmail.com> wrote:
>
> Maybe I am misunderstanding you, but the more the rows to update, the
> more overhead we will be paying with the new approach.

Yes, that's right. How much is that compared to the current planning
overhead. How many rows it takes for that overhead to be comparable to
the current planning overhead.

But let's not sweat on that point much right now.

>
> So, we will need to do 2 things:
>
> 1. Implicitly apply an ORDER BY tableoid clause
> 2. Add result relation RTIs to ModifyTable.resultRelations in the
> order of their RTE's relid.
>
> Maybe we can do that as a separate patch.  Also, I am not sure if it
> will get in the way of someone wanting to have ORDER BY LIMIT for
> updates.

It won't. But may be David's idea is better.

>
> > > > * Tuple re-routing during UPDATE. For now it's disabled so your design
> > > > should work. But we shouldn't design this feature in such a way that
> > > > it comes in the way to enable tuple re-routing in future :).
> > >
> > > Sorry, what is tuple re-routing and why does this new approach get in its way?
> >
> > An UPDATE causing a tuple to move to a different partition. It would
> > get in its way since the tuple will be located based on tableoid,
> > which will be the oid of the old partition. But I think this approach
> > has higher chance of being able to solve that problem eventually
> > rather than the current approach.
>
> Again, I don't think I understand.   We do currently (as of v11)
> re-route tuples when UPDATE causes them to move to a different
> partition, which, gladly, continues to work with my patch.

Ah! Ok. I missed that part then.

>
> So how it works is like this: for a given "new" tuple, ExecUpdate()
> checks if the tuple would violate the partition constraint of the
> result relation that was passed along with the tuple.  If it does, the
> new tuple will be moved, by calling ExecDelete() to delete it from the
> current relation, followed by ExecInsert() to find the new home for
> the tuple.  The only thing that changes with the new approach is how
> ExecModifyTable() chooses a result relation to pass to ExecUpdate()
> for a given "new" tuple it has fetched from the plan, which is quite
> independent from the tuple re-routing mechanism proper.
>

Thanks for the explanation.

-- 
Best Wishes,
Ashutosh Bapat

Re: making update/delete of inheritance trees scale better

From

Amit Langote

Date:

02 June 2020, 04:15:24

So, I think I have a patch that seems to work, but not all the way,
more on which below.

Here is the commit message in the attached patch.

===
Subject: [PATCH] Overhaul UPDATE's targetlist processing

Instead of emitting the full tuple matching the target table's tuple
descriptor, make the plan emit only the attributes that are assigned
values in the SET clause, plus row-identity junk attributes as before.
This allows us to avoid making a separate plan for each target
relation in the inheritance case, because the only reason it is so
currently is to account for the fact that each target relations may
have a set of attributes that is different from others.  Having only
one plan suffices, because the set of assigned attributes must be same
in all the result relations.

While the plan will now produce only the assigned attributes and
row-identity junk attributes, other columns' values are filled by
refetching the old tuple. To that end, there will be a targetlist for
each target relation to compute the full tuple, that is, by combining
the values from the plan tuple and the old tuple, but they are passed
separately in the ModifyTable node.

Implementation notes:

* In the inheritance case, as the same plan produces tuples to be
updated from multiple result relations, the tuples now need to also
identity which table they come from, so an additional junk attribute
"tableoid" is present in that case.

* Considering that the inheritance set may contain foreign tables that
require a different (set of) row-identity junk attribute(s), the plan
needs to emit multiple distinct junk attributes.  When transposed to a
child scan node, this targetlist emits a non-NULL value for the junk
attribute that's valid for the child relation and NULL for others.

* Executor and FDW execution APIs can no longer assume any specific
order in which the result relations will be processed. For each
tuple to be updated/deleted, result relation is selected by looking it
up in a hash table using the "tableoid" value as the key.

* Since the plan does not emit values for all the attributes, FDW APIs
may not assume that the individual column values in the TupleTableSlot
containing the plan tuple are accessible by their attribute numbers.

TODO:

* Reconsider having only one plan!
* Update FDW handler docs to reflect the API changes
===

Regarding the first TODO, it is to address the limitation that FDWs
will no longer be able push the *whole* child UPDATE/DELETE query down
to the remote server, including any joins, which is allowed at the
moment via PlanDirectModify API.  The API seems to have been designed
with an assumption that the child scan/join node is the top-level
plan, but that's no longer the case.  If we consider bypassing the
Append and allow ModifyTable to access the child scan/join nodes
directly, maybe we can allow that.  I haven't updated the expected
output of postgres_fdw regression tests for now pending this.

A couple of things in the patch that I feel slightly uneasy about:

* Result relations are now appendrel children in the planner.
Normally, any wholerow Vars in the child relation's reltarget->exprs
get a ConvertRowType added on top to convert it back to the parent's
reltype, because that's what the client expects in the SELECT case.
In the result relation case, the executor expects to see child
wholerow Vars themselves, not their parent versions.

* FDW's ExecFoeignUpdate() API expects that the NEW tuple passed to it
match the target foreign table reltype, so that it can access the
target attributes in the tuple by attribute numbers.  Considering that
the plan no longer builds the full tuple itself, I made the executor
satisfy that expectation by filling the missing attributes' values
using the target table's wholerow attribute.  That is, we now *always*
fetch the wholerow attributes for UPDATE, not just when there are
row-level triggers that need it.  I think that's unfortunate.  Maybe,
the correct way is asking the FDWs to translate (setrefs.c style) the
target attribute numbers appropriately to access the plan's output
tuple.

I will add the patch to the next CF.  I haven't yet fully checked the
performance considerations of the new approach, but will do so in the
coming days.


--
Amit Langote
EnterpriseDB: http://www.enterprisedb.com

Attachment

v1-0001-Overhaul-UPDATE-s-targetlist-processing.patch

Re: making update/delete of inheritance trees scale better

From

Amit Langote

Date:

12 June 2020, 06:46:41

On Tue, Jun 2, 2020 at 1:15 PM Amit Langote <amitlangote09@gmail.com> wrote:
> So, I think I have a patch that seems to work, but not all the way,
> more on which below.
>
> Here is the commit message in the attached patch.
>
> ===
> Subject: [PATCH] Overhaul UPDATE's targetlist processing
>
> Instead of emitting the full tuple matching the target table's tuple
> descriptor, make the plan emit only the attributes that are assigned
> values in the SET clause, plus row-identity junk attributes as before.
> This allows us to avoid making a separate plan for each target
> relation in the inheritance case, because the only reason it is so
> currently is to account for the fact that each target relations may
> have a set of attributes that is different from others.  Having only
> one plan suffices, because the set of assigned attributes must be same
> in all the result relations.
>
> While the plan will now produce only the assigned attributes and
> row-identity junk attributes, other columns' values are filled by
> refetching the old tuple. To that end, there will be a targetlist for
> each target relation to compute the full tuple, that is, by combining
> the values from the plan tuple and the old tuple, but they are passed
> separately in the ModifyTable node.
>
> Implementation notes:
>
> * In the inheritance case, as the same plan produces tuples to be
> updated from multiple result relations, the tuples now need to also
> identity which table they come from, so an additional junk attribute
> "tableoid" is present in that case.
>
> * Considering that the inheritance set may contain foreign tables that
> require a different (set of) row-identity junk attribute(s), the plan
> needs to emit multiple distinct junk attributes.  When transposed to a
> child scan node, this targetlist emits a non-NULL value for the junk
> attribute that's valid for the child relation and NULL for others.
>
> * Executor and FDW execution APIs can no longer assume any specific
> order in which the result relations will be processed. For each
> tuple to be updated/deleted, result relation is selected by looking it
> up in a hash table using the "tableoid" value as the key.
>
> * Since the plan does not emit values for all the attributes, FDW APIs
> may not assume that the individual column values in the TupleTableSlot
> containing the plan tuple are accessible by their attribute numbers.
>
> TODO:
>
> * Reconsider having only one plan!
> * Update FDW handler docs to reflect the API changes
> ===

I divided that into two patches:

1. Make the plan producing tuples to be updated emit only the columns
that are actually updated.  postgres_fdw test fails unless you also
apply the patch I posted at [1], because there is an unrelated bug in
UPDATE tuple routing code that manifests due to some changes of this
patch.

2. Due to 1, inheritance_planner() is no longer needed, that is,
inherited update/delete can be handled by pulling the rows to
update/delete from only one plan, not one per child result relation.
This one makes that so.

There are some unsolved problems having to do with foreign tables in
both 1 and 2:

In 1, FDW update APIs still assume that the plan produces "full" tuple
for update.  That needs to be fixed so that FDWs deal with getting
only the updated columns in the plan's output targetlist.

In 2, still haven't figured out a way to call PlanDirectModify() on
child foreign tables.  Lacking that, inherited updates on foreign
tables are now slower, because they are not pushed down.  I'd like to
figure something out to fix that situation.

-- 
Amit Langote
EnterpriseDB: http://www.enterprisedb.com

[1] https://www.postgresql.org/message-id/CA%2BHiwqE_UK1jTSNrjb8mpTdivzd3dum6mK--xqKq0Y9VmfwWQA%40mail.gmail.com

Attachment

Re: making update/delete of inheritance trees scale better

From

Amit Langote

Date:

11 September 2020, 10:20:56

Hello,

I have been working away at this and have updated the patches for many
cosmetic and some functional improvements.

On Fri, Jun 12, 2020 at 3:46 PM Amit Langote <amitlangote09@gmail.com> wrote:
> I divided that into two patches:
>
> 1. Make the plan producing tuples to be updated emit only the columns
> that are actually updated.  postgres_fdw test fails unless you also
> apply the patch I posted at [1], because there is an unrelated bug in
> UPDATE tuple routing code that manifests due to some changes of this
> patch.
>
> 2. Due to 1, inheritance_planner() is no longer needed, that is,
> inherited update/delete can be handled by pulling the rows to
> update/delete from only one plan, not one per child result relation.
> This one makes that so.
>
> There are some unsolved problems having to do with foreign tables in
> both 1 and 2:
>
> In 1, FDW update APIs still assume that the plan produces "full" tuple
> for update.  That needs to be fixed so that FDWs deal with getting
> only the updated columns in the plan's output targetlist.
>
> In 2, still haven't figured out a way to call PlanDirectModify() on
> child foreign tables.  Lacking that, inherited updates on foreign
> tables are now slower, because they are not pushed down.  I'd like to
> figure something out to fix that situation.

In the updated patch, I have implemented a partial solution to this,
but I think it should be enough in most practically useful situations.
With the updated patch, PlanDirectModify is now called for child
result relations, but the FDWs will need to be revised to do useful
work in that call (as the patch does for postgres_fdw), because a
potentially pushable ForeignScan involving a given child result
relation will now be at the bottom of the source plan tree, whereas
before it would be the top-level plan.  Another disadvantage of this
new situation is that inherited update/delete involving joins that
were previously pushable cannot be pushed anymore.  If update/delete
would have been able to use partition-wise join, a child join
involving a given child result relation could in principle be pushed,
but some semi-related issues prevent the use of partition-wise joins
for update/delete, especially when there are foreign table partitions.

Another major change is that instead of "tableoid"  junk attribute to
identify the target result relation for a given tuple to be
updated/deleted, the patch now makes the tuples to be updated/deleted
contain a junk attribute that gives the index of the result relation
in the query's list of result relations which can be used to look up
the target result relation directly.  With "tableoid", we would need
to build a hash table to map the result relation OIDs to result
relation indexes, a step that could be seen to become a bottleneck
with large partition counts (I am talking about executing generic
plans here and have mentioned this problem on the thread to make
generic plan execution for update/delete faster [1]).

Here are the commit messages of the attached patches:

[PATCH v3 1/3] Overhaul how updates compute a new tuple

Currently, the planner rewrites the top-level targetlist of an update
statement's parsetree so that it contains entries for all attributes
of the target relation, including for those columns that have not
been changed.  This arrangement means that the executor can take a
tuple that the plan produces, remove any junk attributes in it and
pass it down to the table AM or FDW update API as the new tuple.
It also means that in an inherited update, where there are multiple
target relations, the planner must produce that many plans, because
the targetlists for different target relations may not all look the
same considering that child relations may have different sets of
columns with varying attribute numbers.

This commit revises things so that the planner no longer expands
the parsetree targetlist to include unchanged columns so that the
plan only produces values of the changed columns.  To make the new
tuple to pass to table AM and FDW update API, executor now evaluates
another targetlist matching the target table's TupleDesc which refers
to the plan's output tuple to gets values of the changed columns and
to the old tuple that is refetched for values of unchanged columns.

To get values for unchanged columns to use when forming the new tuple
to pass to ExecForeignUpdate(), we now require foreign scans to
always include the wholerow Var corresponding to the old tuple being
updated, because the unchanged columns are not present in the
plan's targetlist.

As a note to FDW authors, any FDW update planning APIs that look at
the plan's targetlist for checking if it is pushable to remote side
(e.g. PlanDirectModify) should now instead look at "update targetlist"
that is set by the planner in PlannerInfo.update_tlist, because resnos
in the plan's targetlist is no longer indexable by target column's
attribute numbers.

Note that even though the main goal of doing this is to avoid having
to make multiple plans in the inherited update case, this commit does
not touch that subject.  A subsequent commit will change things that
are necessary to make inherited updates work with a single plan.

[PATCH v3 2/3] Include result relation index if any in ForeignScan

FDWs that can perform an UPDATE/DELETE remotely using the "direct
modify" set of APIs need in some cases to access the result relation
properties for which they can currently look at
EState.es_result_relation_info.  However that means the executor must
ensure that es_result_relation_info points to the correct result
relation at all times, especially during inherited updates.  This
requirement gets in the way of number of projects related to changing
how ModifyTable operates.  For example, an upcoming patch will change
things such that there will be one source plan for all result
relations whereas currently there is one per result relation, an
arrangement which makes it convenient to switch the result relation
when the source plan changes.

This commit installs a new field 'resultRelIndex' in ForeignScan node
which must be set by an FDW if the node will be used to carry out an
UPDATE/DELETE operation on a given foreign table, which is the case
if the FDW manages to push that operations to the remote side.  This
commit also modifies postgres_fdw to implement that.

[PATCH v3 3/3] Revise how inherited update/delete are handled

Now that we have the ability to maintain and evaluate the targetlist
needed to generate an update's new tuples independently of the plan
which fetches the tuples to be updated, there is no need to make
separate plans for child result relations as inheritance_planner()
currently does.  We generated separate plans before such capability
was present, because that was the only way to generate new tuples of
child relations where each may have its own unique set of columns
(albeit all sharing the set columns present in the root parent).

With this commit, an inherited update/delete query will now be planned
just as a non-inherited one, generating a single plan that goes under
ModifyTable.  The plan for the inherited case is essentially the one
that we get for a select query, although the targetlist additionally
contains junk attributes needed by update/delete.

By going from one plan per result relation to only one shared across
all result relations, the executor now needs a new way to identify the
result relation to direct a given tuple's update/delete to, whereas
before, it could tell that from the plan it is executing.  To that
end, the planner now adds a new junk attribute to the query's
targetlist that for each tuple gives the index of the result relation
in the query's list of result relations.  That is in addition to the
junk attribute that the planner already adds to identify the tuple's
position in a given relation (such as "ctid").

Given the way query planning with inherited tables work where child
relations are not part of the query's jointree and only the root
parent is, there are some challenges that arise in the update/delete
case:

* The junk attributes needed by child result relations need to be
represented as root parent Vars, which is a non-issue for a given
child if what the child needs and what is added for the root parent
are one and the same column.  But considering that that may not
always be the case, more parent Vars might get added to the top-level
targetlist as children are added to the query as result relations.
In some cases, a child relation may use a column that is not present
in the parent (allowed by traditional inheritance) or a non-column
expression, which must be represented using what this patch calls
"fake" parent vars.  These fake parent vars are really only
placeholders for the underlying child relation's column or expression
and don't reach the executor's expression evluation machinery.

* FDWs that are able to push update/delete fully to the remote side
using DirectModify set of APIs now have to go through hoops to
identify the subplan and the UPDATE targetlist to push for child
result relations, because the subplans for individual result
relations are no loger top-level plans.  In fact, if the result
relation is joined to another relation, update/delete cannot be
pushed down at all anymore, whereas before since the child relations
would be present in the main jointree, they could be in the case
where the relation being joined to was present on the same server as
the child result relation.

-- 
Amit Langote
EnterpriseDB: http://www.enterprisedb.com

[1] https://www.postgresql.org/message-id/CA%2BHiwqG7ZruBmmih3wPsBZ4s0H2EhywrnXEduckY5Hr3fWzPWA%40mail.gmail.com

Attachment

Re: making update/delete of inheritance trees scale better

From

Michael Paquier

Date:

01 October 2020, 04:32:20

On Fri, Sep 11, 2020 at 07:20:56PM +0900, Amit Langote wrote:
> I have been working away at this and have updated the patches for many
> cosmetic and some functional improvements.

Please note that this patch set fails to apply.  Could you provide a
rebase please?
--
Michael

Attachment

signature.asc

Re: making update/delete of inheritance trees scale better

From

Amit Langote

Date:

01 October 2020, 06:24:03

Hi,

On Thu, Oct 1, 2020 at 1:32 PM Michael Paquier <michael@paquier.xyz> wrote:
>
> On Fri, Sep 11, 2020 at 07:20:56PM +0900, Amit Langote wrote:
> > I have been working away at this and have updated the patches for many
> > cosmetic and some functional improvements.
>
> Please note that this patch set fails to apply.  Could you provide a
> rebase please?

Yeah, I'm working on posting an updated patch.

-- 
Amit Langote
EnterpriseDB: http://www.enterprisedb.com

Re: making update/delete of inheritance trees scale better

From

Amit Langote

Date:

04 October 2020, 02:44:03

On Fri, Sep 11, 2020 at 7:20 PM Amit Langote <amitlangote09@gmail.com> wrote:
> Here are the commit messages of the attached patches:
>
> [PATCH v3 1/3] Overhaul how updates compute a new tuple
>
> Currently, the planner rewrites the top-level targetlist of an update
> statement's parsetree so that it contains entries for all attributes
> of the target relation, including for those columns that have not
> been changed.  This arrangement means that the executor can take a
> tuple that the plan produces, remove any junk attributes in it and
> pass it down to the table AM or FDW update API as the new tuple.
> It also means that in an inherited update, where there are multiple
> target relations, the planner must produce that many plans, because
> the targetlists for different target relations may not all look the
> same considering that child relations may have different sets of
> columns with varying attribute numbers.
>
> This commit revises things so that the planner no longer expands
> the parsetree targetlist to include unchanged columns so that the
> plan only produces values of the changed columns.  To make the new
> tuple to pass to table AM and FDW update API, executor now evaluates
> another targetlist matching the target table's TupleDesc which refers
> to the plan's output tuple to gets values of the changed columns and
> to the old tuple that is refetched for values of unchanged columns.
>
> To get values for unchanged columns to use when forming the new tuple
> to pass to ExecForeignUpdate(), we now require foreign scans to
> always include the wholerow Var corresponding to the old tuple being
> updated, because the unchanged columns are not present in the
> plan's targetlist.
>
> As a note to FDW authors, any FDW update planning APIs that look at
> the plan's targetlist for checking if it is pushable to remote side
> (e.g. PlanDirectModify) should now instead look at "update targetlist"
> that is set by the planner in PlannerInfo.update_tlist, because resnos
> in the plan's targetlist is no longer indexable by target column's
> attribute numbers.
>
> Note that even though the main goal of doing this is to avoid having
> to make multiple plans in the inherited update case, this commit does
> not touch that subject.  A subsequent commit will change things that
> are necessary to make inherited updates work with a single plan.

I tried to assess the performance impact of this rejiggering of how
updates are performed.  As to why one may think there may be a
negative impact, consider that ExecModifyTable() now has to perform an
extra fetch of the tuple being updated for filling in the unchanged
values of the update's NEW tuple, because the plan itself will only
produce the values of changed columns.

* Setup: a 10 column target table with a millions rows

create table test_update_10 (
        a       int,
        b       int             default NULL,
        c       int             default 0,
        d       text    default 'ddd',
        e       text    default 'eee',
        f       text    default 'fff',
        g       text    default 'ggg',
        h       text    default 'hhh',
        i       text    default 'iii',
        j       text    default 'jjj'
);
insert into test_update_1o (a) select generate_series(1, 1000000);

* pgbench test script (test_update_10.sql):

\set a random(1, 1000000)
update test_update_10 set b = :a where a = :a;

* TPS of `pgbench -n -T 120 -f test_update_10.sql`

HEAD:

tps = 10964.391120 (excluding connections establishing)
tps = 12142.456638 (excluding connections establishing)
tps = 11746.345270 (excluding connections establishing)
tps = 11959.602001 (excluding connections establishing)
tps = 12267.249378 (excluding connections establishing)

median: 11959.60

Patched:

tps = 11565.916170 (excluding connections establishing)
tps = 11952.491663 (excluding connections establishing)
tps = 11959.789308 (excluding connections establishing)
tps = 11699.611281 (excluding connections establishing)
tps = 11799.220930 (excluding connections establishing)

median: 11799.22

There is a slight impact but the difference seems within margin of error.

On the more optimistic side, I imagined that the trimming down of the
plan's targetlist to include only changed columns would boost
performance, especially with tables containing more columns, which is
not uncommon.  With 20 columns (additional columns are all filler ones
as shown in the 10-column example), the same benchmarks gives the
following numbers:

HEAD:

tps = 11401.691219 (excluding connections establishing)
tps = 11620.855088 (excluding connections establishing)
tps = 11285.469430 (excluding connections establishing)
tps = 10991.890904 (excluding connections establishing)
tps = 10847.433093 (excluding connections establishing)

median: 11285.46

Patched:

tps = 10958.443325 (excluding connections establishing)
tps = 11613.783817 (excluding connections establishing)
tps = 10940.129336 (excluding connections establishing)
tps = 10717.405272 (excluding connections establishing)
tps = 11691.330537 (excluding connections establishing)

median: 10958.44

Hmm, not so much.

With 40 columns:

HEAD:

tps = 9778.362149 (excluding connections establishing)
tps = 10004.792176 (excluding connections establishing)
tps = 9473.849373 (excluding connections establishing)
tps = 9776.931393 (excluding connections establishing)
tps = 9737.891870 (excluding connections establishing)

median: 9776.93

Patched:

tps = 10709.949043 (excluding connections establishing)
tps = 10754.160718 (excluding connections establishing)
tps = 10175.841480 (excluding connections establishing)
tps = 9973.729774 (excluding connections establishing)
tps = 10467.109679 (excluding connections establishing)

median: 10467.10

There you go.

Perhaps, the plan's bigger target list with HEAD does not cause a
significant overhead in the *simple* update like above, because most
of the work during execution is of fetching the tuple to update and of
actually updating it.  So, I also checked with a slightly more
complicated query containing a join:

\set a random(1, 1000000)
update test_update_10 t set b = foo.b from foo where t.a = foo.a and foo.b = :a;

where `foo` is defined as:

create table foo (a int, b int);
insert into foo select generate_series(1, 1000000);
create index on foo (b);

Looking at the EXPLAIN output of the query, one can see that the
target list is smaller after patching which can save some work:

HEAD:

explain (costs off, verbose) update test_update_10 t set b = foo.b
from foo where t.a = foo.a and foo.b = 1;
                                      QUERY PLAN
--------------------------------------------------------------------------------------
 Update on public.test_update_10 t
   ->  Nested Loop
         Output: t.a, foo.b, t.c, t.d, t.e, t.f, t.g, t.h, t.i, t.j,
t.ctid, foo.ctid
         ->  Index Scan using foo_b_idx on public.foo
               Output: foo.b, foo.ctid, foo.a
               Index Cond: (foo.b = 1)
         ->  Index Scan using test_update_10_a_idx on public.test_update_10 t
               Output: t.a, t.c, t.d, t.e, t.f, t.g, t.h, t.i, t.j, t.ctid
               Index Cond: (t.a = foo.a)
(9 rows)

Patched:

explain (costs off, verbose) update test_update_10 t set b = foo.b
from foo where t.a = foo.a and foo.b = 1;
                                  QUERY PLAN
------------------------------------------------------------------------------
 Update on public.test_update_10 t
   ->  Nested Loop
         Output: foo.b, t.ctid, foo.ctid
         ->  Index Scan using foo_b_idx on public.foo
               Output: foo.b, foo.ctid, foo.a
               Index Cond: (foo.b = 1)
         ->  Index Scan using test_update_10_a_idx on public.test_update_10 t
               Output: t.ctid, t.a
               Index Cond: (t.a = foo.a)
(9 rows)

And here are the TPS numbers for that query with 10, 20, 40 columns
table cases.  Note that the more columns the target table has, the
bigger the target list to compute is with HEAD.

10 columns:

HEAD:

tps = 7594.881268 (excluding connections establishing)
tps = 7660.451217 (excluding connections establishing)
tps = 7598.899951 (excluding connections establishing)
tps = 7413.397046 (excluding connections establishing)
tps = 7484.978635 (excluding connections establishing)

median: 7594.88

Patched:

tps = 7402.409104 (excluding connections establishing)
tps = 7532.776214 (excluding connections establishing)
tps = 7549.397016 (excluding connections establishing)
tps = 7512.321466 (excluding connections establishing)
tps = 7448.255418 (excluding connections establishing)

median: 7512.32

20 columns:

HEAD:

tps = 6842.674366 (excluding connections establishing)
tps = 7151.724481 (excluding connections establishing)
tps = 7093.727976 (excluding connections establishing)
tps = 7072.273547 (excluding connections establishing)
tps = 7040.350004 (excluding connections establishing)

median: 7093.72

Patched:

tps = 7362.941398 (excluding connections establishing)
tps = 7106.826433 (excluding connections establishing)
tps = 7353.507317 (excluding connections establishing)
tps = 7361.944770 (excluding connections establishing)
tps = 7072.027684 (excluding connections establishing)

median: 7353.50

40 columns:

HEAD:

tps = 6396.845818 (excluding connections establishing)
tps = 6383.105593 (excluding connections establishing)
tps = 6370.143763 (excluding connections establishing)
tps = 6370.455213 (excluding connections establishing)
tps = 6380.993666 (excluding connections establishing)

median: 6380.99

Patched:

tps = 7091.581813 (excluding connections establishing)
tps = 7036.805326 (excluding connections establishing)
tps = 7019.120007 (excluding connections establishing)
tps = 7025.704379 (excluding connections establishing)
tps = 6848.846667 (excluding connections establishing)

median: 7025.70

It seems clear that the saving on the target list computation overhead
that we get from the patch is hard to ignore in this case.

I've attached updated patches, because as Michael pointed out, the
previous version no longer applies.

--
Amit Langote
EnterpriseDB: http://www.enterprisedb.com

Attachment

Re: making update/delete of inheritance trees scale better

From

Amit Langote

Date:

29 October 2020, 13:03:33

On Sun, Oct 4, 2020 at 11:44 AM Amit Langote <amitlangote09@gmail.com> wrote:
> On Fri, Sep 11, 2020 at 7:20 PM Amit Langote <amitlangote09@gmail.com> wrote:
> > Here are the commit messages of the attached patches:
> >
> > [PATCH v3 1/3] Overhaul how updates compute a new tuple
>
> I tried to assess the performance impact of this rejiggering of how
> updates are performed.  As to why one may think there may be a
> negative impact, consider that ExecModifyTable() now has to perform an
> extra fetch of the tuple being updated for filling in the unchanged
> values of the update's NEW tuple, because the plan itself will only
> produce the values of changed columns.
>
...
> It seems clear that the saving on the target list computation overhead
> that we get from the patch is hard to ignore in this case.
>
> I've attached updated patches, because as Michael pointed out, the
> previous version no longer applies.

Rebased over the recent executor result relation related commits.

-- 
Amit Langote
EDB: http://www.enterprisedb.com

On 29/10/2020 15:03, Amit Langote wrote:
> Rebased over the recent executor result relation related commits.

ModifyTablePath didn't get the memo that a ModifyTable can only have one 
subpath after these patches. Attached patch, on top of your v5 patches, 
cleans that up.

- Heikki

Attachment

cleanup-modifytablepath.patch

Re: making update/delete of inheritance trees scale better

From

Amit Langote

Date:

13 November 2020, 09:52:34

On Wed, Nov 11, 2020 at 9:11 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> On 29/10/2020 15:03, Amit Langote wrote:
> > Rebased over the recent executor result relation related commits.
>
> ModifyTablePath didn't get the memo that a ModifyTable can only have one
> subpath after these patches. Attached patch, on top of your v5 patches,
> cleans that up.

Ah, thought I'd taken care of that, thanks.  Attached v6.

-- 
Amit Langote
EDB: http://www.enterprisedb.com

Attachment

Re: making update/delete of inheritance trees scale better

From

Amit Langote

Date:

26 January 2021, 11:54:09

On Fri, Nov 13, 2020 at 6:52 PM Amit Langote <amitlangote09@gmail.com> wrote:
> On Wed, Nov 11, 2020 at 9:11 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> > On 29/10/2020 15:03, Amit Langote wrote:
> > > Rebased over the recent executor result relation related commits.
> >
> > ModifyTablePath didn't get the memo that a ModifyTable can only have one
> > subpath after these patches. Attached patch, on top of your v5 patches,
> > cleans that up.
>
> Ah, thought I'd taken care of that, thanks.  Attached v6.

This got slightly broken due to the recent batch insert related
changes, so here is the rebased version.  I also made a few cosmetic
changes.

-- 
Amit Langote
EDB: http://www.enterprisedb.com

Attachment

Re: making update/delete of inheritance trees scale better

From

Robert Haas

Date:

26 January 2021, 19:41:58

On Fri, Oct 30, 2020 at 6:26 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> Yeah, you need to access the old tuple to update its t_ctid, but
> accessing it twice is still more expensive than accessing it once. Maybe
> you could optimize it somewhat by keeping the buffer pinned or
> something. Or push the responsibility down to the table AM, passing the
> AM only the modified columns, and let the AM figure out how to deal with
> the columns that were not modified, hoping that it can do something smart.

Just as a point of possible interest, back when I was working on
zheap, I sort of wanted to take this in the opposite direction. In
effect, a zheap tuple has system columns that don't exist for a heap
tuple, and you can't do an update or delete without knowing what the
values for those columns are, so zheap had to just refetch the tuple,
but that sucked in comparisons with the existing heap, which didn't
have to do the refetch. At the time, I thought maybe the right idea
would be to extend things so that a table AM could specify an
arbitrary set of system columns that needed to be bubbled up to the
point where the update or delete happens, but that seemed really
complicated to implement and I never tried. Here it seems like we're
thinking of going the other way, and just always doing the refetch.
That is of course fine for zheap comparative benchmarks: instead of
making zheap faster, we just make the heap slower!

Well, sort of. I didn't think about the benefits of the refetch
approach when the tuples are wide. That does cast a somewhat different
light on things. I suppose we could have both methods and choose the
one that seems likely to be faster in particular cases, but that seems
like way too much machinery. Maybe there's some way to further
optimize accessing the same tuple multiple times in rapid succession
to claw back some of the lost performance in the slow cases, but I
don't have a specific idea.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: making update/delete of inheritance trees scale better

From

Amit Langote

Date:

04 February 2021, 06:22:29

On Tue, Jan 26, 2021 at 8:54 PM Amit Langote <amitlangote09@gmail.com> wrote:
> On Fri, Nov 13, 2020 at 6:52 PM Amit Langote <amitlangote09@gmail.com> wrote:
> > On Wed, Nov 11, 2020 at 9:11 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> > > On 29/10/2020 15:03, Amit Langote wrote:
> > > > Rebased over the recent executor result relation related commits.
> > >
> > > ModifyTablePath didn't get the memo that a ModifyTable can only have one
> > > subpath after these patches. Attached patch, on top of your v5 patches,
> > > cleans that up.
> >
> > Ah, thought I'd taken care of that, thanks.  Attached v6.
>
> This got slightly broken due to the recent batch insert related
> changes, so here is the rebased version.  I also made a few cosmetic
> changes.

Broken again, so rebased.

-- 
Amit Langote
EDB: http://www.enterprisedb.com

On Thu, Feb 4, 2021 at 3:22 PM Amit Langote <amitlangote09@gmail.com> wrote:
> On Tue, Jan 26, 2021 at 8:54 PM Amit Langote <amitlangote09@gmail.com> wrote:
> > On Fri, Nov 13, 2020 at 6:52 PM Amit Langote <amitlangote09@gmail.com> wrote:
> > > On Wed, Nov 11, 2020 at 9:11 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> > > > On 29/10/2020 15:03, Amit Langote wrote:
> > > > > Rebased over the recent executor result relation related commits.
> > > >
> > > > ModifyTablePath didn't get the memo that a ModifyTable can only have one
> > > > subpath after these patches. Attached patch, on top of your v5 patches,
> > > > cleans that up.
> > >
> > > Ah, thought I'd taken care of that, thanks.  Attached v6.
> >
> > This got slightly broken due to the recent batch insert related
> > changes, so here is the rebased version.  I also made a few cosmetic
> > changes.
>
> Broken again, so rebased.

Rebased.

-- 
Amit Langote
EDB: http://www.enterprisedb.com

On Wed, Mar 24, 2021 at 1:46 AM Robert Haas <robertmhaas@gmail.com> wrote:
> On Wed, Mar 3, 2021 at 9:39 AM Amit Langote <amitlangote09@gmail.com> wrote:
> > Just noticed that a test added by the recent 927f453a941 fails due to
> > 0002.  We no longer allow moving a row into a postgres_fdw partition
> > if it is among the UPDATE's result relations, whereas previously we
> > would if the UPDATE on that partition is already finished.
> >
> > To fix, I've adjusted the test case.  Attached updated version.
>
> I spent some time studying this patch this morning.

Thanks a lot for your time on this.

> As far as I can
> see, 0001 is a relatively faithful implementation of the design Tom
> proposed back in early 2019. I think it would be nice to either get
> this committed or else decide that we don't want it and what we're
> going to try to do instead, because we can't make UPDATEs and DELETEs
> stop sucking with partitioned tables until we settle on some solution
> to the problem that is inheritance_planner(), and that strikes me as
> an *extremely* important problem. Lots of people are trying to use
> partitioning in PostgreSQL, and they don't like finding out that, in
> many cases, it makes things slower rather than faster. Neither this
> nor any other patch is going to solve that problem in general, because
> there are *lots* of things that haven't been well-optimized for
> partitioning yet. But, this is a pretty important case that we should
> really try to do something about.
>
> So, that said, here are some random comments:
>
> - I think it would be interesting to repeat your earlier benchmarks
> using -Mprepared. One question I have is whether we're saving overhead
> here during planning at the price of execution-time overhead, or
> whether we're saving during both planning and execution.

Please see at the bottom of this reply.

> - Until I went back and found your earlier remarks on this thread, I
> was confused as to why you were replacing a JunkFilter with a
> ProjectionInfo. I think it would be good to try to add some more
> explicit comments about that design choice, perhaps as header comments
> for ExecGetUpdateNewTuple, or maybe there's a better place.

I think the comments around ri_projectNew that holds the
ProjectionInfo node explains this to some degree, especially the
comment in ExecInitModifyTable() that sets it.  I don't particularly
see a need to go into detail why JunkFilter is not suitable for the
task if we're no longer using it at all in nodeModifyTable.c.

> I'm still
> not sure why we need to do the same thing for the insert case, which
> seems to just be about removing junk columns.

I think I was hesitant to have both a ri_junkFilter and ri_projectNew
catering for inserts and update/delete respectively.

> At least in the non-JIT
> case, it seems to me that ExecJunkFilter() should be cheaper than
> ExecProject(). Is it different enough to matter? Does the fact that we
> can JIT the ExecProject() work make it actually faster? These are
> things I don't know.

ExecJunkFilter() indeed looks cheaper on a first look for simple junk
filtering, but as Tom also found out, there's actually no test case
involving INSERT to do the actual performance comparison with.

> - There's a comment which you didn't write but just moved which I find
> to be quite confusing. It says "For UPDATE/DELETE, find the
> appropriate junk attr now. Typically, this will be a 'ctid' or
> 'wholerow' attribute, but in the case of a foreign data wrapper it
> might be a set of junk attributes sufficient to identify the remote
> row." But, the block of code which follows caters only to the 'ctid'
> and 'wholerow' cases, not anything else. Perhaps that's explained by
> the comment a bit further down which says "When there is a row-level
> trigger, there should be a wholerow attribute," but if the point is
> that this code is only reached when there's a row-level trigger,
> that's far from obvious. It *looks* like something we'd reach for ANY
> insert or UPDATE case. Maybe it's not your job to do anything about
> this confusion, but I thought it was worth highlighting.

I do remember being confused by that note regarding the junk
attributes required by FDWs for their result relations when I first
saw it, but eventually found out that it's talking about the
information about junk attributes that FDWs track in their *private*
data structures.  For example, postgres_fdw uses
PgFdwModifyState.ctidAttno to record the index of the "ctid" TLE in
the source plan's targetlist.  It is used, for example, by
postgresExecForeignUpdate() to extract the ctid from the plan tuple
passed to it and pass the value as parameter for the remote query:
update remote_tab set ... where ctid = $1.

I've clarified the comment to make that a bit clear.

> - The comment for filter_junk_tlist_entries(), needless to say, is of
> the highest quality,

Sorry, it was a copy-paste job.

> but would it make any sense to think about having
> an option for expand_tlist() to omit the junk entries itself, to avoid
> extra work? I'm unclear whether we want that behavior in all UPDATE
> cases or only some of them, because preproces_targetlist() has a call
> to expand_tlist() to set parse->onConflict->onConflictSet that does
> not call filter_junk_tlist_entries() on the result.

I added an exclude_junk parameter to expand_targetlist() and passed
false for it in all sites except make_update_tlist(), including where
it's called on parse->onConflict->onConflictSet.

Although, make_update_tlist() and related code may have been
superseded by Tom's WIP patch.

> Does this patch
> need to make any changes to the handling of ON CONFLICT .. UPDATE? It
> looks to me like right now it doesn't, but I don't know whether that's
> an oversight or intentional.

I intentionally didn't bother with changing any part of the ON
CONFLICT UPDATE case, mainly because INSERTs don't have a
inheritance_planner() problem.  We may want to revisit that in the
future if we decide to revise the ExecUpdate() API to not pass the
fully-reconstructed new tuple, which this patch doesn't do.

> - The output changes in the main regression test suite are pretty easy
> to understand: we're just seeing columns that no longer need to get
> fed through the execution get dropped. The changes in the postgres_fdw
> results are harder to understand. In general, it appears that what's
> happening is that we're no longer outputting the non-updated columns
> individually -- which makes sense -- but now we're outputting a
> whole-row var that wasn't there before, e.g.:
>
> -         Output: foreign_tbl.a, (foreign_tbl.b + 15), foreign_tbl.ctid
> +         Output: (foreign_tbl.b + 15), foreign_tbl.ctid, foreign_tbl.*
>
> Since this is postgres_fdw, we can re-fetch the row using CTID, so
> it's not clear to me why we need foreign_tbl.* when we didn't before.
> Maybe the comments need improvement.

ExecForeignUpdate FDW API expects being passed a fully-formed new
tuple, even though it will typically only access the changed columns
from that tuple to pass in the remote update query.  There is a
comment in rewriteTargetListUD() to explain this, which I have updated
somewhat to read as follows:

        /*
         * ExecUpdate() needs to pass a full new tuple to be assigned to the
         * result relation to ExecForeignUpdate(), although the plan will have
         * produced values for only the changed columns.  Here we ask the FDW
         * to fetch wholerow to serve as the side channel for getting the
         * values of the unchanged columns when constructing the full tuple to
         * be passed to ExecForeignUpdate().  Actually, we only really need
         * this for UPDATEs that are not pushed to the remote side, but whether
         * or not the pushdown will occur is not clear when this function is
         * called, so we ask for wholerow anyway.
         *
         * We will also need the "old" tuple if there are any row triggers.
         */

> - Specifically, I think the comments in preptlist.c need some work.
> You've edited the top-of-file comment, but I don't think it's really
> accurate. The first sentence says "For INSERT and UPDATE, the
> targetlist must contain an entry for each attribute of the target
> relation in the correct order," but I don't think that's really true
> any more. It's certainly not what you see in the EXPLAIN output. The
> paragraph goes on to explain that UPDATE has a second target list, but
> (a) that seems to contradict the first sentence and (b) that second
> target list isn't what you see when you run EXPLAIN. Also, there's no
> mention of what happens for FDWs here, but it's evidently different,
> as per the previous review comment.

It seems Tom has other things in mind for what I've implemented as
update_tlist, so I will leave this alone.

> - The comments atop fix_join_expr() should be updated. Maybe you can
> just adjust the wording for case #2.

Apparently the changes in setrefs.c are being thrown out as well in
Tom's patch, so likewise I will leave this alone.


Attached updated version of the patch.  I have forgotten to mention in
my recent posts on this thread one thing about 0001 that I had
mentioned upthread back in June.  That it currently fails a test in
postgres_fdw's suite due to a bug of cross-partition updates that I
decided at the time to pursue in another thread:
https://www.postgresql.org/message-id/CA%2BHiwqE_UK1jTSNrjb8mpTdivzd3dum6mK--xqKq0Y9VmfwWQA%40mail.gmail.com

That bug is revealed due to some changes that 0001 makes.  However, it
does not matter after applying 0002, because the current way of having
one plan per result relation is a precondition for that bug to
manifest.  So, if we are to apply only 0001 first, then I'm afraid we
would have take care of that bug before applying 0001.

Finally, here are the detailed results of the benchmarks I redid to
check the performance implications of doing UPDATEs the new way,
comparing master and 0001.

Repeated 2 custom pgbench tests against the UPDATE target tables
containing 10, 20, 40, and 80 columns.  The 2 custom tests are as
follows:

nojoin:

\set a random(1, 1000000)
update test_table t set b = :a where a = :a;

join:

\set a random(1, 1000000)
update test_table t set b = foo.b from foo where t.a = foo.a and foo.b = :a;

foo has just 2 integer columns a, b, with an index on b.

Checked using both -Msimple and -Mprepared this time, whereas I had
only checked the former the last time.

I'd summarize the results I see as follows:

In -Msimple mode, patched wins by a tiny margin for both nojoin and
join cases at 10, 20 columns, and by slightly larger margin at 40, 80
columns with the join case showing bigger margin than nojoin.

In -Mprepared mode, where the numbers are a bit noisy, I can only tell
clearly that the patched wins by a very wide margin for the join case
at 40, 80 columns, without a clear winner in other cases.

To answer Robert's questions in this regard:

> One question I have is whether we're saving overhead
> here during planning at the price of execution-time overhead, or
> whether we're saving during both planning and execution.

Smaller targetlists due to the patch at least help the patched end up
on the better side of tps comparison.  Maybe this aspect helps reduce
both the planning and execution time.  As for whether the results
reflect negatively on the fact that we now fetch the tuple one more
time to construct the new tuple, I don't quite see that to be the
case.

Raw tps figures (each case repeated 3 times) follow.  I'm also
attaching (a hopefully self-contained) shell script file
(test_update.sh) that you can run to reproduce the numbers for the
various cases.

10 columns

nojoin simple master
tps = 12278.749205 (without initial connection time)
tps = 11537.051718 (without initial connection time)
tps = 12312.717990 (without initial connection time)
nojoin simple patched
tps = 12160.125784 (without initial connection time)
tps = 12170.271905 (without initial connection time)
tps = 12212.037774 (without initial connection time)

nojoin prepared master
tps = 12228.149183 (without initial connection time)
tps = 12509.135100 (without initial connection time)
tps = 11698.161145 (without initial connection time)
nojoin prepared patched
tps = 13033.005860 (without initial connection time)
tps = 14690.203013 (without initial connection time)
tps = 15083.096511 (without initial connection time)

join simple master
tps = 9112.059568 (without initial connection time)
tps = 10730.739559 (without initial connection time)
tps = 10663.677821 (without initial connection time)
join simple patched
tps = 10980.139631 (without initial connection time)
tps = 10887.743691 (without initial connection time)
tps = 10929.663379 (without initial connection time)

join prepared master
tps = 21333.421825 (without initial connection time)
tps = 23895.538826 (without initial connection time)
tps = 24761.384786 (without initial connection time)
join prepared patched
tps = 25665.062858 (without initial connection time)
tps = 25037.391119 (without initial connection time)
tps = 25421.839842 (without initial connection time)

20 columns

nojoin simple master
tps = 11215.161620 (without initial connection time)
tps = 11306.536537 (without initial connection time)
tps = 11310.776393 (without initial connection time)
nojoin simple patched
tps = 11791.107767 (without initial connection time)
tps = 11757.933141 (without initial connection time)
tps = 11743.983647 (without initial connection time)

nojoin prepared master
tps = 17144.510719 (without initial connection time)
tps = 14032.133587 (without initial connection time)
tps = 15678.801224 (without initial connection time)
nojoin prepared patched
tps = 16603.131255 (without initial connection time)
tps = 14703.564675 (without initial connection time)
tps = 13652.827905 (without initial connection time)

join simple master
tps = 9637.904229 (without initial connection time)
tps = 9869.163480 (without initial connection time)
tps = 9865.673335 (without initial connection time)
join simple patched
tps = 10779.705826 (without initial connection time)
tps = 10790.961520 (without initial connection time)
tps = 10917.759963 (without initial connection time)

join prepared master
tps = 23030.120609 (without initial connection time)
tps = 22347.620338 (without initial connection time)
tps = 24227.376933 (without initial connection time)
join prepared patched
tps = 22303.689184 (without initial connection time)
tps = 24507.395745 (without initial connection time)
tps = 25219.535413 (without initial connection time)

40 columns

nojoin simple master
tps = 10348.352638 (without initial connection time)
tps = 9978.449528 (without initial connection time)
tps = 10024.132430 (without initial connection time)
nojoin simple patched
tps = 10169.485989 (without initial connection time)
tps = 10239.297780 (without initial connection time)
tps = 10643.076675 (without initial connection time)

nojoin prepared master
tps = 13606.361325 (without initial connection time)
tps = 15815.149553 (without initial connection time)
tps = 15940.675165 (without initial connection time)
nojoin prepared patched
tps = 13889.450942 (without initial connection time)
tps = 13406.879350 (without initial connection time)
tps = 15640.326344 (without initial connection time)

join simple master
tps = 9235.503480 (without initial connection time)
tps = 9244.756832 (without initial connection time)
tps = 8785.542317 (without initial connection time)
join simple patched
tps = 10106.285796 (without initial connection time)
tps = 10375.248536 (without initial connection time)
tps = 10357.087162 (without initial connection time)

join prepared master
tps = 18795.665779 (without initial connection time)
tps = 17650.815736 (without initial connection time)
tps = 20903.206602 (without initial connection time)
join prepared patched
tps = 24706.505207 (without initial connection time)
tps = 22867.751793 (without initial connection time)
tps = 23589.244380 (without initial connection time)

80 columns

nojoin simple master
tps = 8281.679334 (without initial connection time)
tps = 7517.657106 (without initial connection time)
tps = 8509.366647 (without initial connection time)
nojoin simple patched
tps = 9200.437258 (without initial connection time)
tps = 9349.939671 (without initial connection time)
tps = 9128.197101 (without initial connection time)

nojoin prepared master
tps = 12975.410783 (without initial connection time)
tps = 13486.858443 (without initial connection time)
tps = 10994.355244 (without initial connection time)
nojoin prepared patched
tps = 14266.725696 (without initial connection time)
tps = 15250.258418 (without initial connection time)
tps = 13356.236075 (without initial connection time)

join simple master
tps = 7678.440018 (without initial connection time)
tps = 7699.796166 (without initial connection time)
tps = 7880.407359 (without initial connection time)
join simple patched
tps = 9552.413096 (without initial connection time)
tps = 9469.579290 (without initial connection time)
tps = 9584.026033 (without initial connection time)

join prepared master
tps = 18390.262404 (without initial connection time)
tps = 18754.121500 (without initial connection time)
tps = 20355.875827 (without initial connection time)
join prepared patched
tps = 24041.648927 (without initial connection time)
tps = 22510.192030 (without initial connection time)
tps = 21825.870402 (without initial connection time)

--
Amit Langote
EDB: http://www.enterprisedb.com

On Tue, Mar 30, 2021 at 1:51 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Here's a v13 patchset that I feel pretty good about.

Thanks.  After staring at this for a day now, I do too.

> My original thought for replacing the "fake variable" design was to
> add another RTE holding the extra variables, and then have setrefs.c
> translate the placeholder variables to the real thing at the last
> possible moment.  I soon realized that instead of an actual RTE,
> it'd be better to invent a special varno value akin to INDEX_VAR
> (I called it ROWID_VAR, though I'm not wedded to that name).  Info
> about the associated variables is kept in a list of RowIdentityVarInfo
> structs, which are more suitable than a regular RTE would be.
>
> I got that and the translate-in-setrefs approach more or less working,
> but it was fairly messy, because the need to know about these special
> variables spilled into FDWs and a lot of other places; for example
> indxpath.c needed a special check for them when deciding if an
> index-only scan is possible.  What turns out to be a lot cleaner is
> to handle the translation in adjust_appendrel_attrs_mutator(), so that
> we have converted to real variables by the time we reach any
> relation-scan-level logic.
>
> I did end up having to break the API for FDW AddForeignUpdateTargets
> functions: they need to do things differently when adding junk columns,
> and they need different parameters.  This seems all to the good though,
> because the old API has been a backwards-compatibility hack for some
> time (e.g., in not passing the "root" pointer).

This all looks really neat.

I couldn't help but think that the RowIdentityVarInfo management code
looks a bit like SpecialJunkVarInfo stuff in my earliest patches, but
of course without all the fragility of assigning "fake" attribute
numbers to a "real" base relation(s).

> Some other random notes:
>
> * I was unimpressed with the idea of distinguishing different target
> relations by embedding integer constants in the plan.  In the first
> place, the implementation was extremely fragile --- there was
> absolutely NOTHING tying the counter you used to the subplans' eventual
> indexes in the ModifyTable lists.  Plus I don't have a lot of faith
> that setrefs.c will reliably do what you want in terms of bubbling the
> things up.  Maybe that could be made more robust, but the other problem
> is that the EXPLAIN output is just about unreadable; nobody will
> understand what "(0)" means.  So I went back to the idea of emitting
> tableoid, and installed a hashtable plus a one-entry lookup cache
> to make the run-time mapping as fast as I could.  I'm not necessarily
> saying that this is how it has to be indefinitely, but I think we
> need more work on planner and EXPLAIN infrastructure before we can
> get the idea of directly providing a list index to work nicely.

Okay.

> * I didn't agree with your decision to remove the now-failing test
> cases from postgres_fdw.sql.  I think it's better to leave them there,
> especially in the cases where we were checking the plan as well as
> the execution.  Hopefully we'll be able to un-break those soon.

Okay.

> * I updated a lot of hereby-obsoleted comments, which makes the patch
> a bit longer than v12; but actually the code is a good bit smaller.
> There's a noticeable net code savings in src/backend/optimizer/,
> which there was not before.

Agreed.  (I had evidently missed a bunch of comments referring to the
old ways of how inherited updates are performed.)

> I've not made any attempt to do performance testing on this,
> but I think that's about the only thing standing between us
> and committing this thing.

I reran some of the performance tests I did earlier (I've attached the
modified test running script for reference):

pgbench -n -T60 -M{simple|prepared} -f nojoin.sql

nojoin.sql:

\set a random(1, 1000000)
update test_table t set b = :a where a = :a;

...and here are the tps figures:

-Msimple

nparts  10cols      20cols      40cols

master:
64      10112       9878        10920
128     9662        10691       10604
256     9642        9691        10626
1024    8589        9675        9521

patched:
64      13493       13463       13313
128     13305       13447       12705
256     13190       13161       12954
1024    11791       11408       11786

No variation across various column counts, but the patched improves
the tps for each case by quite a bit.

-Mprepared (plan_cache_mode=force_generic_plan)

master:
64      2286        2285        2266
128     1163        1127        1091
256     531         519         544
1024    77          71          69

patched:
64      6522        6612        6275
128     3568        3625        3372
256     1847        1710        1823
1024    433         427         386

Again, no variation across columns counts.  tps drops as partition
count increases both before and after applying the patches, although
patched performs way better, which is mainly attributable to the
ability of UPDATE to now utilize runtime pruning (actually of the
Append under ModifyTable).  The drop as partition count increases can
be attributed to the fact that with a generic plan, there are a bunch
of steps that must be done across all partitions, such as
AcauireExecutorLocks(), ExecCheckRTPerms(), per-result-rel
initializations in ExecInitModifyTable(), etc., even with the patched.
As mentioned upthread, [1] can help with the last bit.

--
Amit Langote
EDB: http://www.enterprisedb.com

[1] https://commitfest.postgresql.org/32/2621/

Attachment

test_update_inh.sh

Re: making update/delete of inheritance trees scale better

From

Tom Lane

Date:

31 March 2021, 15:58:38

Amit Langote <amitlangote09@gmail.com> writes:
> On Tue, Mar 30, 2021 at 1:51 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Here's a v13 patchset that I feel pretty good about.

> Thanks.  After staring at this for a day now, I do too.

Thanks for looking!  Pushed after some more docs-fiddling and a final
read-through.  I think the only code change from v13 is that I decided
to convert ExecGetJunkAttribute into a "static inline", since it's
just a thin wrapper around slot_getattr().  Doesn't really help
performance much, but it shouldn't hurt.

> ... The drop as partition count increases can
> be attributed to the fact that with a generic plan, there are a bunch
> of steps that must be done across all partitions, such as
> AcauireExecutorLocks(), ExecCheckRTPerms(), per-result-rel
> initializations in ExecInitModifyTable(), etc., even with the patched.
> As mentioned upthread, [1] can help with the last bit.

I'll try to find some time to look at that one.

I'd previously been thinking that we couldn't be lazy about applying
most of those steps at executor startup, but on second thought,
ExecCheckRTPerms should be a no-op anyway for child tables.  So
maybe it would be okay to not take a lock, much less do the other
stuff, until the particular child table is stored into.

            regards, tom lane

Re: making update/delete of inheritance trees scale better

From

Robert Haas

Date:

31 March 2021, 17:01:38

On Tue, Mar 30, 2021 at 12:51 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Maybe that could be made more robust, but the other problem
> is that the EXPLAIN output is just about unreadable; nobody will
> understand what "(0)" means.

I think this was an idea that originally came from me, prompted by
what we already do for:

rhaas=# explain verbose select 1 except select 2;
                                 QUERY PLAN
-----------------------------------------------------------------------------
 HashSetOp Except  (cost=0.00..0.06 rows=1 width=8)
   Output: (1), (0)
   ->  Append  (cost=0.00..0.05 rows=2 width=8)
         ->  Subquery Scan on "*SELECT* 1"  (cost=0.00..0.02 rows=1 width=8)
               Output: 1, 0
               ->  Result  (cost=0.00..0.01 rows=1 width=4)
                     Output: 1
         ->  Subquery Scan on "*SELECT* 2"  (cost=0.00..0.02 rows=1 width=8)
               Output: 2, 1
               ->  Result  (cost=0.00..0.01 rows=1 width=4)
                     Output: 2
(11 rows)

That is admittedly pretty magical, but it's a precedent. If you think
the relation OID to subplan index lookup is fast enough that it
doesn't matter, then I guess it's OK, but I guess my opinion is that
the subplan index feels like the thing we really want, and if we're
passing anything else up the plan tree, that seems to be a decision
made out of embarrassment rather than conviction. I think the real
problem here is that the deparsing code isn't in on the secret. If in
the above example, or in this patch, it deparsed as (Subplan Index) at
the parent level, and 0, 1, 2, ... in the children, it wouldn't
confuse anyone, or at least not much more than EXPLAIN output does in
general.

Or even if we just output (Constant-Value) it wouldn't be that bad.
The whole convention of deparsing target lists by recursing into the
children, or one of them, in some ways obscures what's really going
on. I did a talk a few years ago in which I made those target lists
deparse as $OUTER.0, $OUTER.1, $INNER.0, etc. and I think people found
that pretty enlightening, because it's sort of non-obvious in what way
table foo is present when a target list 8 levels up in the join tree
claims to have a value for foo.x. Now, such notation can't really be
recommended in general, because it'd be too hard to understand what
was happening in a lot of cases, but the recursive stuff is clearly
not without its own attendant confusions.

Thanks to both of you for working on this! As I said before, this
seems like really important work.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: making update/delete of inheritance trees scale better

From

Tom Lane

Date:

31 March 2021, 17:24:01

Robert Haas <robertmhaas@gmail.com> writes:
> On Tue, Mar 30, 2021 at 12:51 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Maybe that could be made more robust, but the other problem
>> is that the EXPLAIN output is just about unreadable; nobody will
>> understand what "(0)" means.

> I think this was an idea that originally came from me, prompted by
> what we already do for:

I agree that we have some existing behavior that's related to this, but
it's still messy, and I couldn't find any evidence that suggested that the
runtime lookup costs anything.  Typical subplans are going to deliver
long runs of tuples from the same target relation, so as long as we
maintain a one-element cache of the last lookup result, it's only about
one comparison per tuple most of the time.

> I think the real
> problem here is that the deparsing code isn't in on the secret.

Agreed; if we spent some more effort on that end of it, maybe we
could do something different here.  I'm not very sure what good
output would look like though.  A key advantage of tableoid is
that that's already a thing people know about.

            regards, tom lane

Re: making update/delete of inheritance trees scale better

From

Robert Haas

Date:

31 March 2021, 17:37:27

On Wed, Mar 31, 2021 at 1:24 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> I agree that we have some existing behavior that's related to this, but
> it's still messy, and I couldn't find any evidence that suggested that the
> runtime lookup costs anything.  Typical subplans are going to deliver
> long runs of tuples from the same target relation, so as long as we
> maintain a one-element cache of the last lookup result, it's only about
> one comparison per tuple most of the time.

OK, that's pretty fair.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: making update/delete of inheritance trees scale better

From

Amit Langote

Date:

01 April 2021, 02:09:09

On Thu, Apr 1, 2021 at 12:58 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Amit Langote <amitlangote09@gmail.com> writes:
> > On Tue, Mar 30, 2021 at 1:51 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> >> Here's a v13 patchset that I feel pretty good about.
>
> > Thanks.  After staring at this for a day now, I do too.
>
> Thanks for looking!  Pushed after some more docs-fiddling and a final
> read-through.  I think the only code change from v13 is that I decided
> to convert ExecGetJunkAttribute into a "static inline", since it's
> just a thin wrapper around slot_getattr().  Doesn't really help
> performance much, but it shouldn't hurt.

Thanks a lot.

> > ... The drop as partition count increases can
> > be attributed to the fact that with a generic plan, there are a bunch
> > of steps that must be done across all partitions, such as
> > AcauireExecutorLocks(), ExecCheckRTPerms(), per-result-rel
> > initializations in ExecInitModifyTable(), etc., even with the patched.
> > As mentioned upthread, [1] can help with the last bit.
>
> I'll try to find some time to look at that one.
>
> I'd previously been thinking that we couldn't be lazy about applying
> most of those steps at executor startup, but on second thought,
> ExecCheckRTPerms should be a no-op anyway for child tables.

Yeah, David did say that in that thread:

https://www.postgresql.org/message-id/CAApHDvqPzsMcKLRpmNpUW97PmaQDTmD7b2BayEPS5AN4LY-0bA%40mail.gmail.com

>  So
> maybe it would be okay to not take a lock, much less do the other
> stuff, until the particular child table is stored into.

Note that the patch over there doesn't do anything about
AcquireExecutorLocks() bottleneck, as there are some yet-unsolved race
conditions that were previously discussed here:

https://www.postgresql.org/message-id/flat/CAKJS1f_kfRQ3ZpjQyHC7=PK9vrhxiHBQFZ+hc0JCwwnRKkF3hg@mail.gmail.com

Anyway, I'll post the rebased version of the patch that we do have.

-- 
Amit Langote
EDB: http://www.enterprisedb.com

Re: making update/delete of inheritance trees scale better

From

David Rowley

Date:

01 April 2021, 03:06:29

On Thu, 1 Apr 2021 at 15:09, Amit Langote <amitlangote09@gmail.com> wrote:
> Note that the patch over there doesn't do anything about
> AcquireExecutorLocks() bottleneck, as there are some yet-unsolved race
> conditions that were previously discussed here:
>
> https://www.postgresql.org/message-id/flat/CAKJS1f_kfRQ3ZpjQyHC7=PK9vrhxiHBQFZ+hc0JCwwnRKkF3hg@mail.gmail.com

The only way I can think of so far to get around having to lock all
child partitions is pretty drastic and likely it's too late to change
anyway.  The idea is that when you attach an existing table as a
partition that you can no longer access it directly. We'd likely have
to invent a new relkind for partitions for that to work.  This would
mean that we shouldn't ever need to lock individual partitions as all
things which access them must do so via the parent. I imagined that we
might still be able to truncate partitions with an ALTER TABLE ...
TRUNCATE PARTITION ...; or something.   It feels a bit late for all
that now though, especially so with all the CONCURRENTLY work Alvaro
has done to make ATTACH/DETACH not take an AEL.

Additionally, I imagine doing this would upset a lot of people who do
direct accesses to partitions.

Robert also mentioned some ideas in [1]. However, it seems that might
have a performance impact on locking in general.

I think some other DBMSes might not allow direct access to partitions.
Perhaps the locking issue is the reason why.

David

[1] https://www.postgresql.org/message-id/CA%2BTgmoYbtm1uuDne3rRp_uNA2RFiBwXX1ngj3RSLxOfc3oS7cQ%40mail.gmail.com

Re: making update/delete of inheritance trees scale better

From

Amit Langote

Date:

02 April 2021, 07:41:13

On Wed, Mar 31, 2021 at 9:54 PM Amit Langote <amitlangote09@gmail.com> wrote:
> I reran some of the performance tests I did earlier (I've attached the
> modified test running script for reference):

For archives' sake, noticing a mistake in my benchmarking script, I
repeated the tests. Apparently, all pgbench runs were performed with
40 column tables, not 10, 20, and 40 as shown in the results.

> pgbench -n -T60 -M{simple|prepared} -f nojoin.sql
>
> nojoin.sql:
>
> \set a random(1, 1000000)
> update test_table t set b = :a where a = :a;
>
> ...and here are the tps figures:
>
> -Msimple
>
> nparts  10cols      20cols      40cols
>
> master:
> 64      10112       9878        10920
> 128     9662        10691       10604
> 256     9642        9691        10626
> 1024    8589        9675        9521
>
> patched:
> 64      13493       13463       13313
> 128     13305       13447       12705
> 256     13190       13161       12954
> 1024    11791       11408       11786
>
> No variation across various column counts, but the patched improves
> the tps for each case by quite a bit.

-Msimple

pre-86dc90056:
nparts  10cols      20cols      40cols

64      11345       10650       10327
128     11014       11005       10069
256     10759       10827       10133
1024    9518        10314       8418

post-86dc90056:
        10cols      20cols      40cols

64      13829       13677       13207
128     13521       12843       12418
256     13071       13006       12926
1024    12351       12036       11739

My previous assertion that the tps does vary across different column
counts seems to hold in this case, that is, -Msimple mode.

> -Mprepared (plan_cache_mode=force_generic_plan)
>
> master:
> 64      2286        2285        2266
> 128     1163        1127        1091
> 256     531         519         544
> 1024    77          71          69
>
> patched:
> 64      6522        6612        6275
> 128     3568        3625        3372
> 256     1847        1710        1823
> 1024    433         427         386
>
> Again, no variation across columns counts.

-Mprepared

pre-86dc90056:
        10cols      20cols      40cols

64      3059        2851        2154
128     1675        1366        1100
256     685         658         544
1024    126         85          76

post-86dc90056:
        10cols      20cols      40cols

64      7665        6966        6444
128     4211        3968        3389
256     2205        2020        1783
1024    545         499         389

In the -Mprepared case however, it does vary, both before and after
86dc90056.  For the post-86dc90056 case, I suspect it's because
ExecBuildUpdateProjection(), whose complexity is O(number-of-columns),
being performed for *all* partitions in ExecInitModifyTable().  In the
-Msimple case, it would always be for only one partition, so it
doesn't make that much of a difference to ExecInitModifyTable() time.

>  tps drops as partition
> count increases both before and after applying the patches, although
> patched performs way better, which is mainly attributable to the
> ability of UPDATE to now utilize runtime pruning (actually of the
> Append under ModifyTable).  The drop as partition count increases can
> be attributed to the fact that with a generic plan, there are a bunch
> of steps that must be done across all partitions, such as
> AcauireExecutorLocks(), ExecCheckRTPerms(), per-result-rel
> initializations in ExecInitModifyTable(), etc., even with the patched.
> As mentioned upthread, [1] can help with the last bit.

Here are the numbers after applying that patch:

        10cols      20cols      40cols

64      17185       17064       16625
128     12261       11648       11968
256     7662        7564        7439
1024    2252        2185        2101

With the patch, ExecBuildUpdateProjection() will be called only once
irrespective of the number of partitions, almost like the -Msimple
case, so the tps across column counts does not vary by much.

-- 
Amit Langote
EDB: http://www.enterprisedb.com

RE: making update/delete of inheritance trees scale better

From

"houzj.fnst@fujitsu.com"

Date:

17 May 2021, 06:07:39

Hi

After 86dc900, In " src/include/nodes/pathnodes.h ",
I noticed that it uses the word " partitioned UPDATE " in the comment above struct RowIdentityVarInfo.

But, it seems " inherited UPDATE " is used in the rest of places.
Is it better to keep them consistent by using " inherited UPDATE " ?

Best regards,
houzj

Re: making update/delete of inheritance trees scale better

From

Amit Langote

Date:

17 May 2021, 06:32:51

Hi,

On Mon, May 17, 2021 at 3:07 PM houzj.fnst@fujitsu.com
<houzj.fnst@fujitsu.com> wrote:
>
> Hi
>
> After 86dc900, In " src/include/nodes/pathnodes.h ",
> I noticed that it uses the word " partitioned UPDATE " in the comment above struct RowIdentityVarInfo.
>
> But, it seems " inherited UPDATE " is used in the rest of places.
> Is it better to keep them consistent by using " inherited UPDATE " ?

Yeah, I would not be opposed to fixing that.  Like this maybe (patch attached)?

- * In partitioned UPDATE/DELETE it's important for child partitions to share
+ * In an inherited UPDATE/DELETE it's important for child tables to share

While at it, I also noticed that the comment refers to the
row_identity_vars, but it can be unclear which variable it is
referring to.  So fixed that too.

-- 
Amit Langote
EDB: http://www.enterprisedb.com

Attachment

RowIdentityVarInfo-comment.patch

RE: making update/delete of inheritance trees scale better

From

"houzj.fnst@fujitsu.com"

Date:

17 May 2021, 09:18:26

> On Mon, May 17, 2021 at 3:07 PM houzj.fnst@fujitsu.com
> <houzj.fnst@fujitsu.com> wrote:
> >
> > Hi
> >
> > After 86dc900, In " src/include/nodes/pathnodes.h ", I noticed that it
> > uses the word " partitioned UPDATE " in the comment above struct
> RowIdentityVarInfo.
> >
> > But, it seems " inherited UPDATE " is used in the rest of places.
> > Is it better to keep them consistent by using " inherited UPDATE " ?
> 
> Yeah, I would not be opposed to fixing that.  Like this maybe (patch attached)?

> - * In partitioned UPDATE/DELETE it's important for child partitions to share
> + * In an inherited UPDATE/DELETE it's important for child tables to 
> + share

Thanks for the change, it looks good to me.

Best regards,
houzj