making update/delete of inheritance trees scale better - Mailing list pgsql-hackers

From Amit Langote
Subject making update/delete of inheritance trees scale better
Date
Msg-id CA+HiwqHpHdqdDn48yCEhynnniahH78rwcrv1rEX65-fsZGBOLQ@mail.gmail.com
Whole thread Raw
Responses Re: making update/delete of inheritance trees scale better
List pgsql-hackers
Here is a sketch for implementing the design that Tom described here:
https://www.postgresql.org/message-id/flat/357.1550612935%40sss.pgh.pa.us

In short, we would like to have only one plan for ModifyTable to get
tuples out of to update/delete, not N for N child result relations as
is done currently.

I suppose things are the way they are because creating a separate plan
for each result relation makes the job of ModifyTable node very
simple, which is currently this:

1. Take the plan's output tuple, extract the tupleid of the tuple to
update/delete in the currently active result relation,
2. If delete, go to 3, else if update, filter out the junk columns
from the above tuple
3. Call ExecUpdate()/ExecDelete() on the result relation with the new
tuple, if any

If we make ModifyTable do a little more work for the inheritance case,
we can create only one plan but without "expanding" the targetlist.
That is, it will contain entries only for attributes that are assigned
values in the SET clause.  This makes the plan reusable across result
relations, because all child relations must have those attributes,
even though the attribute numbers might be different.  Anyway, the
work that ModifyTable will now have to do is this:

1. Take the plan's output tuple, extract tupleid of the tuple to
update/delete and "tableoid"
2. Select the result relation to operate on using the tableoid
3. If delete, go to 4, else if update, fetch the tuple identified by
tupleid from the result relation and fill in the unassigned columns
using that "old" tuple, also filtering out the junk columns
4. Call ExecUpdate()/ExecDelete() on the result relation with the new
tuple, if any

I do think that doing this would be worthwhile even if we may be
increasing ModifyTable's per-row overhead slightly, because planning
overhead of the current approach is very significant, especially for
partition trees with beyond a couple of thousand partitions.  As to
how bad the problem is, trying to create a generic plan for `update
foo set ... where key = $1`, where foo has over 2000 partitions,
causes OOM even on a machine with 6GB of memory.

The one plan shared by all result relations will be same as the one we
would get if the query were SELECT, except it will contain junk
attributes such as ctid needed to identify tuples and a new "tableoid"
junk attribute if multiple result relations will be present due to
inheritance.  One major way in which this targetlist differs from the
current per-result-relation plans is that it won't be passed through
expand_targetlist(), because the set of unassigned attributes may not
be unique among children.  As mentioned above, those missing
attributes will be filled by ModifyTable doing some extra work,
whereas previously they would have come with the plan's output tuple.

For child result relations that are foreign tables, their FDW adds
junk attribute(s) to the query’s targetlist by updating it in-place
(AddForeignUpdateTargets).  However, as the child tables will no
longer get their own parsetree, we must use some hack around this
interface to obtain the foreign table specific junk attributes and add
them to the original/parent query’s targetlist.  Assuming that all or
most of the children will belong to the same FDW, we will end up with
only a handful such junk columns in the final targetlist.  I am not
sure if it's worthwhile to change the API of AddForeignUpdateTargets
to require FDWs to not scribble on the passed-in parsetree as part of
this patch.

As for how ModifyTable will create the new tuple for updates, I have
decided to use a ProjectionInfo for each result relation, which
projects a full, *clean* tuple ready to be put back into the relation.
When projecting, plan’s output tuple serves as OUTER tuple and the old
tuple fetched to fill unassigned attributes serves as SCAN tuple.  By
having this ProjectionInfo also serve as the “junk filter”, we don't
need JunkFilters.  The targetlist that this projection computes is
same as that of the result-relation-specific plan.  Initially, I
thought to generate this "expanded" targetlist in
ExecInitModifyTable().  But as it can be somewhat expensive, doing it
only once in the planner seemed like a good idea.  These
per-result-relations targetlists are carried in the ModifyTable node.

To identify the result relation from the tuple produced by the plan,
“tableoid” junk column will be used.  As the tuples for different
result relations won’t necessarily come out in the order in which
result relations are laid out in the ModifyTable node, we need a way
to map the tableoid value to result relation indexes.  I have decided
to use a hash table here.

A couple of things that I didn't think very hard what to do about now,
but may revisit later.

* We will no longer be able use DirectModify APIs to push updates to
remote servers for foreign child result relations

* Over in [1], I have said that we get run-time pruning for free for
ModifyTable because the plan we are using is same as that for SELECT,
although now I think that I hadn't thought that through.  With the PoC
patch that I have:

prepare q as update foo set a = 250001 where a = $1;
set plan_cache_mode to 'force_generic_plan';
explain execute q(1);
                             QUERY PLAN
--------------------------------------------------------------------
 Update on foo  (cost=0.00..142.20 rows=40 width=14)
   Update on foo_1
   Update on foo_2 foo
   Update on foo_3 foo
   Update on foo_4 foo
   ->  Append  (cost=0.00..142.20 rows=40 width=14)
         Subplans Removed: 3
         ->  Seq Scan on foo_1  (cost=0.00..35.50 rows=10 width=14)
               Filter: (a = $1)
(9 rows)

While it's true that we will never have to actually update foo_2,
foo_3, and foo_4, ModifyTable still sets up its ResultRelInfos, which
ideally it shouldn't.  Maybe we'll need to do something about that
after all.

I will post the patch shortly.

--
Amit Langote
EnterpriseDB: http://www.enterprisedb.com

[1] https://www.postgresql.org/message-id/CA%2BHiwqGXmP3-S9y%3DOQHyJyeWnZSOmcxBGdgAMWcLUOsnPTL88w%40mail.gmail.com



pgsql-hackers by date:

Previous
From: godjan •
Date:
Subject: Is it possible to find out write_lsn on standby?
Next
From: "Andrey M. Borodin"
Date:
Subject: MultiXact\SLRU buffers configuration