Thread: ON CONFLICT DO UPDATE for partitioned tables

ON CONFLICT DO UPDATE for partitioned tables

From

Alvaro Herrera

Date:

28 February 2018, 03:46:02

I updated Amit Langote's patch for INSERT ON CONFLICT DO UPDATE[1].
Following the lead of edd44738bc88 ("Be lazier about partition tuple
routing.") this incarnation only does the necessary push-ups for the
specific partition that needs it, at execution time.  As far as I can
tell, it works as intended.

I chose to refuse the case where the DO UPDATE clause causes the tuple
to move to another partition (i.e. you're updating the partition key of
the tuple).  While it's probably possible to implement that, it doesn't
seem a very productive use of time.

However, there is a shortcoming in the design: it fails if there are
multiple levels of partitioning, because there is no easy (efficient)
way to map the index OID more than one level.  I had already mentioned
this shortcoming to Amit's patch.  So this case (which I added in the
regression tests) fails unexpectedly:

-- multiple-layered partitioning
create table parted_conflict_test (a int primary key, b text) partition by range (a);
create table parted_conflict_test_1 partition of parted_conflict_test
  for values from (0) to (10000) partition by range (a);
create table parted_conflict_test_1_1 partition of parted_conflict_test_1
  for values from (0) to (100);
insert into parted_conflict_test values ('10', 'ten');
insert into parted_conflict_test values ('10', 'ten two')
  on conflict (a) do update set b = excluded.b;
ERROR:  invalid ON CONFLICT DO UPDATE specification
DETAIL:  An inferred index was not found in partition "parted_conflict_test_1_1".

So the problem here is that my MapPartitionIndexList() implementation is
too stupid.  I think it would be smarter to use the ResultRelInfo
instead of bare Relation, for one.  But that still doesn't answer how to
find a "path" from root to leaf partition, which is what I'd need to
verify that there are valid pg_inherits relationships for the partition
indexes.  I'm probably missing something about the partitionDesc or
maybe the partitioned_rels lists that helps me do it efficiently, but I
hope figure out soon.

One idea I was toying with is to add RelationData->rd_children as a list
of OIDs of children relations.  So it should be possible to walk the
list from the root to the desired descendant, without having to scan
pg_inherits repeatedly ... although that would probably require doing
relation_open() for every index, which sounds undesirable.

(ISTM that having RelationData->rd_children might be a good idea in
general anyway -- I mean to speed up some code that currently scans
pg_inherits via find_inheritance_children.  However, since the partition
descriptor itself is already in relcache, maybe this doesn't matter too
much.)

Another idea is to abandon the notion that we need to find a path from
parent index to descendant index, and just run the inference algorithm
again on the partition.  I'm not sure how much I like this idea, yet.

Anyway, while this is still WIP, I think it works correctly for the case
where there is a single partition level.


[1] https://postgr.es/m/c1651d5b-7bd6-b7e7-e1cc-16ecfe2c0da5@lab.ntt.co.jp

-- 
Álvaro Herrer

Attachment

v1-0001-fix-ON-CONFLICT-DO-UPDATE-for-partitioned-tables.patch

Re: ON CONFLICT DO UPDATE for partitioned tables

From

Amit Langote

Date:

28 February 2018, 12:24:27

On 2018/02/28 9:46, Alvaro Herrera wrote:
> I updated Amit Langote's patch for INSERT ON CONFLICT DO UPDATE[1].
> Following the lead of edd44738bc88 ("Be lazier about partition tuple
> routing.") this incarnation only does the necessary push-ups for the
> specific partition that needs it, at execution time.  As far as I can
> tell, it works as intended.

Thanks.  I too have been meaning to send an updated (nay, significantly
rewritten) version of the patch, but you beat me to it.

> I chose to refuse the case where the DO UPDATE clause causes the tuple
> to move to another partition (i.e. you're updating the partition key of
> the tuple).  While it's probably possible to implement that, it doesn't
> seem a very productive use of time.

Probably a good idea to save it for later.

> However, there is a shortcoming in the design: it fails if there are
> multiple levels of partitioning, because there is no easy (efficient)
> way to map the index OID more than one level.  I had already mentioned
> this shortcoming to Amit's patch.  So this case (which I added in the
> regression tests) fails unexpectedly:
> 
> -- multiple-layered partitioning
> create table parted_conflict_test (a int primary key, b text) partition by range (a);
> create table parted_conflict_test_1 partition of parted_conflict_test
>   for values from (0) to (10000) partition by range (a);
> create table parted_conflict_test_1_1 partition of parted_conflict_test_1
>   for values from (0) to (100);
> insert into parted_conflict_test values ('10', 'ten');
> insert into parted_conflict_test values ('10', 'ten two')
>   on conflict (a) do update set b = excluded.b;
> ERROR:  invalid ON CONFLICT DO UPDATE specification
> DETAIL:  An inferred index was not found in partition "parted_conflict_test_1_1".
> 
> So the problem here is that my MapPartitionIndexList() implementation is
> too stupid.  I think it would be smarter to use the ResultRelInfo
> instead of bare Relation, for one.  But that still doesn't answer how to
> find a "path" from root to leaf partition, which is what I'd need to
> verify that there are valid pg_inherits relationships for the partition
> indexes.  I'm probably missing something about the partitionDesc or
> maybe the partitioned_rels lists that helps me do it efficiently, but I
> hope figure out soon.
> 
> One idea I was toying with is to add RelationData->rd_children as a list
> of OIDs of children relations.  So it should be possible to walk the
> list from the root to the desired descendant, without having to scan
> pg_inherits repeatedly ... although that would probably require doing
> relation_open() for every index, which sounds undesirable.
> 
> (ISTM that having RelationData->rd_children might be a good idea in
> general anyway -- I mean to speed up some code that currently scans
> pg_inherits via find_inheritance_children.  However, since the partition
> descriptor itself is already in relcache, maybe this doesn't matter too
> much.)
> 
> Another idea is to abandon the notion that we need to find a path from
> parent index to descendant index, and just run the inference algorithm
> again on the partition.  I'm not sure how much I like this idea, yet.
> 
> Anyway, while this is still WIP, I think it works correctly for the case
> where there is a single partition level.

Actually, after your comment on my original patch [1], I did make it work
for multiple levels by teaching the partition initialization code to find
a given partition's indexes that are inherited from the root table (that
is the table mentioned in command).  So, after a tuple is routed to a
partition, we switch from the original arbiterIndexes list to the one we
created for the partition, which must contain OIDs corresponding to those
in the original list.  After all, for each of the parent's indexes that
the planner put into the original arbiterIndexes list, there must exist an
index in each of the leaf partitions.

I had also observed when working on the patch that various TupleTableSlots
used by the ON CONFLICT DO UPDATE code must be based on TupleDesc of the
inheritance-translated target list (DO UPDATE SET target list).  In fact,
that has to take into account also the dropped columns; we may have
dropped columns either in parent or in a partition or in both at same or
different attnum positions.  That means, simple map_partition_varattnos()
translation doesn't help in this case.

For example, with your patch (sorry, I know you said it's a WIP patch), I
see either a crash or errors when dealing with such differing attribute
numbers:

drop table p;
create table p (a int, b text) partition by list (a);
create table p12 (b text, a int);
alter table p attach partition p12 for values in (1, 2);
alter table p drop b, add b text;
create table p4 partition of p for values in (4);
create unique index on p (a);

insert into p values (1, 'a') on conflict (a) do update set b = excluded.b;

insert into p values (1, 'b') on conflict (a) do update set b = excluded.b;
server closed the connection unexpectedly
    This probably means the server terminated abnormally
    before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.

insert into p values (4, 'a') on conflict (a) do update set b = excluded.b;

postgres=# insert into p values (4, 'b') on conflict (a) do update set b =
excluded.b;
ERROR:  attribute number 3 exceeds number of columns 2

I attach my patch here for your reference, which I polished this morning
after seeing your email and the patch.  It works for most of the cases, as
you can see in the updated tests in insert_conflict.sql.  Since I agree
with you that we should, for now, error out if DO UPDATE causes row
movement, I adopted the code from your patch for that.

Thanks,
Amit

[1]
https://www.postgresql.org/message-id/20171227225031.osh6vunpuhsath25%40alvherre.pgsql

Attachment

v1-0001-Fix-ON-CONFLICT-to-work-with-partitioned-tables.patch

Re: ON CONFLICT DO UPDATE for partitioned tables

From

Robert Haas

Date:

28 February 2018, 19:03:57

On Tue, Feb 27, 2018 at 7:46 PM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:
> I updated Amit Langote's patch for INSERT ON CONFLICT DO UPDATE[1].
> Following the lead of edd44738bc88 ("Be lazier about partition tuple
> routing.") this incarnation only does the necessary push-ups for the
> specific partition that needs it, at execution time.  As far as I can
> tell, it works as intended.
>
> I chose to refuse the case where the DO UPDATE clause causes the tuple
> to move to another partition (i.e. you're updating the partition key of
> the tuple).  While it's probably possible to implement that, it doesn't
> seem a very productive use of time.

I would have thought that to be the only case we could support with
the current infrastructure.  Doesn't a correct implementation for any
other case require a global index?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: ON CONFLICT DO UPDATE for partitioned tables

From

Amit Langote

Date:

01 March 2018, 03:57:47

On 2018/03/01 1:03, Robert Haas wrote:
> On Tue, Feb 27, 2018 at 7:46 PM, Alvaro Herrera
> <alvherre@2ndquadrant.com> wrote:
>> I updated Amit Langote's patch for INSERT ON CONFLICT DO UPDATE[1].
>> Following the lead of edd44738bc88 ("Be lazier about partition tuple
>> routing.") this incarnation only does the necessary push-ups for the
>> specific partition that needs it, at execution time.  As far as I can
>> tell, it works as intended.
>>
>> I chose to refuse the case where the DO UPDATE clause causes the tuple
>> to move to another partition (i.e. you're updating the partition key of
>> the tuple).  While it's probably possible to implement that, it doesn't
>> seem a very productive use of time.
> 
> I would have thought that to be the only case we could support with
> the current infrastructure.  Doesn't a correct implementation for any
> other case require a global index?

I'm thinking that Alvaro is talking here about the DO UPDATE action part,
not the conflict checking part.  The latter will definitely require global
indexes if conflict were to be checked on columns not containing the
partition key.

The case Alvaro mentions arises after checking the conflict, presumably
using an inherited unique index whose keys must include the partition
keys. If the conflict action is DO UPDATE and its SET clause changes
partition key columns such that the row will have to change the partition,
then the current patch will result in an error.  I think that's because
making update row movement work in this case will require some adjustments
to what 2f178441044 (Allow UPDATE to move rows between partitions)
implemented.  We wouldn't have things set up in the ModifyTableState that
the current row-movement code depends on being set up; for example, there
wouldn't be per-subplan ResultRelInfo's in the ON CONFLICT DO UPDATE case.
 The following Assert in ExecUpdate() will fail for instance:

            map_index = resultRelInfo - mtstate->resultRelInfo;
            Assert(map_index >= 0 && map_index < mtstate->mt_nplans);

Thanks,
Amit

Re: ON CONFLICT DO UPDATE for partitioned tables

From

Alvaro Herrera

Date:

02 March 2018, 18:35:58

Amit Langote wrote:

> Actually, after your comment on my original patch [1], I did make it work
> for multiple levels by teaching the partition initialization code to find
> a given partition's indexes that are inherited from the root table (that
> is the table mentioned in command).  So, after a tuple is routed to a
> partition, we switch from the original arbiterIndexes list to the one we
> created for the partition, which must contain OIDs corresponding to those
> in the original list.  After all, for each of the parent's indexes that
> the planner put into the original arbiterIndexes list, there must exist an
> index in each of the leaf partitions.

Oh, your solution for this seems simple enough.  Silly me, I was trying
to implement it in a quite roundabout way.  Thanks.  (I do wonder if we
should save the "root" reloid in the relcache).

> I had also observed when working on the patch that various TupleTableSlots
> used by the ON CONFLICT DO UPDATE code must be based on TupleDesc of the
> inheritance-translated target list (DO UPDATE SET target list).  In fact,
> that has to take into account also the dropped columns; we may have
> dropped columns either in parent or in a partition or in both at same or
> different attnum positions.  That means, simple map_partition_varattnos()
> translation doesn't help in this case.

Yeah, I was aware these corner cases could become a problem though I
hadn't gotten around to testing them yet.  Thanks for all your work on
this.

The usage of the few optimizer/prep/ functions that are currently static
doesn't fill me with joy.  These functions have weird APIs because
they're static so we don't rally care, but once we export them we should
strive to be more careful.  I'd rather stay away from just exporting
them all, so I chose to encapsulate those calls in a single function and
export only expand_targetlist from preptlist.c, keeping the others
static in prepunion.c.  In the attached patch set, I put an API change
(work on tupdescs rather than full-blown relations) for a couple of
those functions as 0001, then your patch as 0002, then a few fixups of
my own.  (0002 is not bit-by-bit identical to yours; I think I had to
fix some merge conflict with 0001, but should be pretty much the same).

But looking further, I think there is much cruft that has accumulated in
those functions (because they've gotten simplified over time), and we
could do some additional cleanup surgery.  For example, there is no
reason to pass a list pointer to make_inh_translation_list(); we could
just return it.  And then we don't have to cons up a fake AppendRelInfo
with all dummy values that adjust_inherited_tlist doesn't even care
about.  I think there was a point for all these contortions back at some
point (visible by checking git history of this code), but it all seems
useless now.

Re. the "ugly hack" comments in adjust_inherited_tlist(), I'm confused:
expand_targetlist() runs *after*, not before, so how could it have
affected the result?  I'm obviously confused about what
expand_targetlist call this comment is talking about.  Anyway I wanted
to make it use resjunk entries instead, but that broke some other case
that I didn't have time to research yesterday.  I'll get back to this
soon, but in the meantime, here's what I have.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: ON CONFLICT DO UPDATE for partitioned tables

From

Alvaro Herrera

Date:

02 March 2018, 18:36:10

Amit Langote wrote:

> Actually, after your comment on my original patch [1], I did make it work
> for multiple levels by teaching the partition initialization code to find
> a given partition's indexes that are inherited from the root table (that
> is the table mentioned in command).  So, after a tuple is routed to a
> partition, we switch from the original arbiterIndexes list to the one we
> created for the partition, which must contain OIDs corresponding to those
> in the original list.  After all, for each of the parent's indexes that
> the planner put into the original arbiterIndexes list, there must exist an
> index in each of the leaf partitions.

Oh, your solution for this seems simple enough.  Silly me, I was trying
to implement it in a quite roundabout way.  Thanks.  (I do wonder if we
should save the "root" reloid in the relcache).

> I had also observed when working on the patch that various TupleTableSlots
> used by the ON CONFLICT DO UPDATE code must be based on TupleDesc of the
> inheritance-translated target list (DO UPDATE SET target list).  In fact,
> that has to take into account also the dropped columns; we may have
> dropped columns either in parent or in a partition or in both at same or
> different attnum positions.  That means, simple map_partition_varattnos()
> translation doesn't help in this case.

Yeah, I was aware these corner cases could become a problem though I
hadn't gotten around to testing them yet.  Thanks for all your work on
this.

The usage of the few optimizer/prep/ functions that are currently static
doesn't fill me with joy.  These functions have weird APIs because
they're static so we don't rally care, but once we export them we should
strive to be more careful.  I'd rather stay away from just exporting
them all, so I chose to encapsulate those calls in a single function and
export only expand_targetlist from preptlist.c, keeping the others
static in prepunion.c.  In the attached patch set, I put an API change
(work on tupdescs rather than full-blown relations) for a couple of
those functions as 0001, then your patch as 0002, then a few fixups of
my own.  (0002 is not bit-by-bit identical to yours; I think I had to
fix some merge conflict with 0001, but should be pretty much the same).

But looking further, I think there is much cruft that has accumulated in
those functions (because they've gotten simplified over time), and we
could do some additional cleanup surgery.  For example, there is no
reason to pass a list pointer to make_inh_translation_list(); we could
just return it.  And then we don't have to cons up a fake AppendRelInfo
with all dummy values that adjust_inherited_tlist doesn't even care
about.  I think there was a point for all these contortions back at some
point (visible by checking git history of this code), but it all seems
useless now.

Re. the "ugly hack" comments in adjust_inherited_tlist(), I'm confused:
expand_targetlist() runs *after*, not before, so how could it have
affected the result?  I'm obviously confused about what
expand_targetlist call this comment is talking about.  Anyway I wanted
to make it use resjunk entries instead, but that broke some other case
that I didn't have time to research yesterday.  I'll get back to this
soon, but in the meantime, here's what I have.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

Re: ON CONFLICT DO UPDATE for partitioned tables

From

Amit Langote

Date:

05 March 2018, 08:28:50

On 2018/03/03 0:36, Alvaro Herrera wrote:
> Amit Langote wrote:
> 
>> Actually, after your comment on my original patch [1], I did make it work
>> for multiple levels by teaching the partition initialization code to find
>> a given partition's indexes that are inherited from the root table (that
>> is the table mentioned in command).  So, after a tuple is routed to a
>> partition, we switch from the original arbiterIndexes list to the one we
>> created for the partition, which must contain OIDs corresponding to those
>> in the original list.  After all, for each of the parent's indexes that
>> the planner put into the original arbiterIndexes list, there must exist an
>> index in each of the leaf partitions.
> 
> Oh, your solution for this seems simple enough.  Silly me, I was trying
> to implement it in a quite roundabout way.  Thanks.  (I do wonder if we
> should save the "root" reloid in the relcache).

Do you mean store the root reloid for "any" partition (table or index)?

>> I had also observed when working on the patch that various TupleTableSlots
>> used by the ON CONFLICT DO UPDATE code must be based on TupleDesc of the
>> inheritance-translated target list (DO UPDATE SET target list).  In fact,
>> that has to take into account also the dropped columns; we may have
>> dropped columns either in parent or in a partition or in both at same or
>> different attnum positions.  That means, simple map_partition_varattnos()
>> translation doesn't help in this case.
> 
> Yeah, I was aware these corner cases could become a problem though I
> hadn't gotten around to testing them yet.  Thanks for all your work on
> this.
> 
> The usage of the few optimizer/prep/ functions that are currently static
> doesn't fill me with joy.  These functions have weird APIs because
> they're static so we don't rally care, but once we export them we should
> strive to be more careful.  I'd rather stay away from just exporting
> them all, so I chose to encapsulate those calls in a single function and
> export only expand_targetlist from preptlist.c, keeping the others
> static in prepunion.c.  In the attached patch set, I put an API change
> (work on tupdescs rather than full-blown relations) for a couple of
> those functions as 0001, then your patch as 0002, then a few fixups of
> my own.  (0002 is not bit-by-bit identical to yours; I think I had to
> fix some merge conflict with 0001, but should be pretty much the same).
> 
> But looking further, I think there is much cruft that has accumulated in
> those functions (because they've gotten simplified over time), and we
> could do some additional cleanup surgery.  For example, there is no
> reason to pass a list pointer to make_inh_translation_list(); we could
> just return it.  And then we don't have to cons up a fake AppendRelInfo
> with all dummy values that adjust_inherited_tlist doesn't even care
> about.  I think there was a point for all these contortions back at some
> point (visible by checking git history of this code), but it all seems
> useless now.

Yeah, I think it might as well work to fix up these interfaces the way
you've done.

> Re. the "ugly hack" comments in adjust_inherited_tlist(), I'm confused:
> expand_targetlist() runs *after*, not before, so how could it have
> affected the result?  I'm obviously confused about what
> expand_targetlist call this comment is talking about.  Anyway I wanted
> to make it use resjunk entries instead, but that broke some other case
> that I didn't have time to research yesterday.  I'll get back to this
> soon, but in the meantime, here's what I have.

Hmm.  I can imagine how the newly added comments in
adjust_inherited_tlist() may be confusing.  With this patch, we're now
calling adjust_inherited_tlist() from the executor, whereas it was
designed only to be run in the planner prep phase.  What we're passing to
it from the executor is the DO UPDATE SET's targetlist that has undergone
the expand_targetlist() treatment by the planner.  Maybe, we need to
update the adjust_inherited_tlist() comments to reflect the expansion of
its scope due to this patch.

Some comments on 0003:

+     * Fist, fix the target entries' resnos, by using inheritance
translation.

First

+    appinfo.parent_reltype = InvalidOid; // parentRel->rd_rel->reltype;

I guess you won't retain that comment. :)

+    result_tl = expand_targetlist(result_tl, CMD_UPDATE,

Should we add a CmdType argument to adjust_and_expand_partition_tlist()
and use it instead of hard-coding CMD_UPDATE here?

Thanks,
Amit

Re: ON CONFLICT DO UPDATE for partitioned tables

From

Pavan Deolasee

Date:

05 March 2018, 12:04:02

On Fri, Mar 2, 2018 at 9:06 PM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:

Re. the "ugly hack" comments in adjust_inherited_tlist(), I'm confused:
expand_targetlist() runs *after*, not before, so how could it have
affected the result?

If I understand correctly, planner must have called expand_targetlist() once for the parent relation's descriptor and added any dropped columns from the parent relation. So we may not find mapped attributes for those dropped columns in the parent. I haven't actually tested this case though.

I wonder if it makes sense to actually avoid expanding on-conflict-set targetlist in case the target is a partition table and deal with it during execution, like we are doing now.

I'm obviously confused about what
expand_targetlist call this comment is talking about. Anyway I wanted
to make it use resjunk entries instead, but that broke some other case
that I didn't have time to research yesterday. I'll get back to this
soon, but in the meantime, here's what I have.

+ conv_setproj =

+ adjust_and_expand_partition_tlist(RelationGetDescr(firstResultRel),

+ RelationGetDescr(partrel),

+ RelationGetRelationName(partrel),

+ firstVarno,

+ conv_setproj);

Aren't we are adding back Vars referring to the parent relation, after having converted the existing ones to use partition relation's varno? May be that works because missing attributes are already added during planning and expand_targetlist() here only adds dropped columns, which are just NULL constants.

Thanks,

Pavan

Pavan Deolasee http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Re: ON CONFLICT DO UPDATE for partitioned tables

From

Pavan Deolasee

Date:

16 March 2018, 13:43:56

On Fri, Mar 2, 2018 at 9:06 PM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:

@@ -106,6 +120,9 @@ typedef struct PartitionTupleRouting

int num_subplan_partition_offsets;

TupleTableSlot *partition_tuple_slot;

TupleTableSlot *root_tuple_slot;

+ List **partition_arbiter_indexes;

+ TupleTableSlot **partition_conflproj_slots;

+ TupleTableSlot **partition_existing_slots;

} PartitionTupleRouting;

I am curious why you decided to add these members to PartitionTupleRouting structure. Wouldn't ResultRelationInfo be a better place to track these or is there some rule that we follow?

Thanks,

Pavan

Pavan Deolasee http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Re: ON CONFLICT DO UPDATE for partitioned tables

From

Etsuro Fujita

Date:

16 March 2018, 14:43:30

(2018/03/16 19:43), Pavan Deolasee wrote:
> On Fri, Mar 2, 2018 at 9:06 PM, Alvaro Herrera <alvherre@2ndquadrant.com
> <mailto:alvherre@2ndquadrant.com>> wrote:

> @@ -106,6 +120,9 @@ typedef struct PartitionTupleRouting
>      int         num_subplan_partition_offsets;
>      TupleTableSlot *partition_tuple_slot;
>      TupleTableSlot *root_tuple_slot;
> +   List      **partition_arbiter_indexes;
> +   TupleTableSlot **partition_conflproj_slots;
> +   TupleTableSlot **partition_existing_slots;
>   } PartitionTupleRouting;

> I am curious why you decided to add these members to
> PartitionTupleRouting structure. Wouldn't ResultRelationInfo be a better
> place to track these or is there some rule that we follow?

I just started reviewing the patch, so maybe I'm missing something, but 
I think it would be a good idea to have these in that structure, not in 
ResultRelInfo, because these would be required only for partitions 
chosen via tuple routing.

Best regards,
Etsuro Fujita

Re: ON CONFLICT DO UPDATE for partitioned tables

From

Pavan Deolasee

Date:

16 March 2018, 14:59:36

On Fri, Mar 16, 2018 at 5:13 PM, Etsuro Fujita <fujita.etsuro@lab.ntt.co.jp> wrote:

(2018/03/16 19:43), Pavan Deolasee wrote:
On Fri, Mar 2, 2018 at 9:06 PM, Alvaro Herrera <alvherre@2ndquadrant.com
<mailto:alvherre@2ndquadrant.com>> wrote:

@@ -106,6 +120,9 @@ typedef struct PartitionTupleRouting
int num_subplan_partition_offsets;
TupleTableSlot *partition_tuple_slot;
TupleTableSlot *root_tuple_slot;
+ List **partition_arbiter_indexes;
+ TupleTableSlot **partition_conflproj_slots;
+ TupleTableSlot **partition_existing_slots;
} PartitionTupleRouting;

I am curious why you decided to add these members to
PartitionTupleRouting structure. Wouldn't ResultRelationInfo be a better
place to track these or is there some rule that we follow?

I just started reviewing the patch, so maybe I'm missing something, but I think it would be a good idea to have these in that structure, not in ResultRelInfo, because these would be required only for partitions chosen via tuple routing.

Hmm, yeah, probably you're right. The reason I got confused is because the patch uses ri_onConflictSetProj/ri_onConflictSetWhere to store the converted projection info/where clause for a given partition in its ResultRelationInfo. So I was wondering if we can move mt_arbiterindexes, mt_existing and mt_conflproj to ResultRelInfo and then just use that per-partition structures too.

Thanks,

Pavan

Pavan Deolasee http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Re: ON CONFLICT DO UPDATE for partitioned tables

From

Alvaro Herrera

Date:

16 March 2018, 18:13:03

Pavan Deolasee wrote:

> Hmm, yeah, probably you're right. The reason I got confused is because the
> patch uses ri_onConflictSetProj/ri_onConflictSetWhere to store the
> converted projection info/where clause for a given partition in its
> ResultRelationInfo. So I was wondering if we can
> move mt_arbiterindexes, mt_existing and mt_conflproj to ResultRelInfo and
> then just use that per-partition structures too.

I wonder if the proposed structure is good actually.

Some notes as I go along.

1. ModifyTableState->mt_arbiterindexes is just a copy of
ModifyTable->arbiterIndexes.  So why do we need it?  For an
unpartitioned relation we can just use
ModifyTableState.ps->arbiterIndexes.  Now, for each partition we need to
map these indexes onto the partition indexes.  Not sure where to put
these; I'm tempted to say ResultRelInfo is the place.  Maybe the list
should always be in ResultRelInfo instead of the state/plan node?  Not
sure.

2. We don't need mt_existing to be per-relation; a single tuple slot can
do for all partitions; we just need to ExecSlotSetDescriptor to the
partition's descriptor whenever the slot is going to be used.  (This
means the slot no longer has a fixed tupdesc.  That seems OK).

3. ModifyTableState->mt_conflproj is more or less the same thing; the
same TTS can be reused by all the different projections, as long as we
set the descriptor before projecting.  So we don't
need PartitionTupleRouting->partition_conflproj_slots, but we do need a
descriptor to be used when needed.  Let's say
  PartitionTupleRouting->partition_confl_tupdesc 

I'll experiment with these ideas and see where that leads me.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: ON CONFLICT DO UPDATE for partitioned tables

From

Alvaro Herrera

Date:

16 March 2018, 21:21:06

So ExecInsert receives the ModifyTableState, and separately it receives
arbiterIndexes and the OnConflictAction, both of which are members of
the passed ModifyTableState.  I wonder why does it do that; wouldn't it
be simpler to extract those members from the node?

With the patch proposed upthread, we receive arbiterIndexes as a
parameter and if the table is a partition we summarily ignore those and
use the list as extracted from the PartitionRoutingInfo.  This is
confusing and pointless.  It seems to me that the logic ought to be "if
partition then use the list in PartitionRoutingInfo; if not partition
use it from ModifyTableState".  This requires changing as per above,
i.e. make the arbiter index list not part of the ExecInsert's API.

The OnConflictAction doesn't matter much; not passing it is merely a
matter of cleanliness.

Or is there another reason to pass the index list?

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: ON CONFLICT DO UPDATE for partitioned tables

From

Peter Geoghegan

Date:

16 March 2018, 21:33:56

On Fri, Mar 16, 2018 at 11:21 AM, Alvaro Herrera
<alvherre@alvh.no-ip.org> wrote:
> So ExecInsert receives the ModifyTableState, and separately it receives
> arbiterIndexes and the OnConflictAction, both of which are members of
> the passed ModifyTableState.  I wonder why does it do that; wouldn't it
> be simpler to extract those members from the node?

> Or is there another reason to pass the index list?

It works that way pretty much by accident, as far as I can tell.
Removing the two extra arguments sounds like a good idea.

-- 
Peter Geoghegan

Re: ON CONFLICT DO UPDATE for partitioned tables

From

Alvaro Herrera

Date:

16 March 2018, 23:56:10

Peter Geoghegan wrote:
> On Fri, Mar 16, 2018 at 11:21 AM, Alvaro Herrera
> <alvherre@alvh.no-ip.org> wrote:
> > So ExecInsert receives the ModifyTableState, and separately it receives
> > arbiterIndexes and the OnConflictAction, both of which are members of
> > the passed ModifyTableState.  I wonder why does it do that; wouldn't it
> > be simpler to extract those members from the node?
> 
> > Or is there another reason to pass the index list?
> 
> It works that way pretty much by accident, as far as I can tell.
> Removing the two extra arguments sounds like a good idea.

Great, thanks.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: ON CONFLICT DO UPDATE for partitioned tables

From

Alvaro Herrera

Date:

17 March 2018, 00:23:44

Another thing I noticed is that the split of the ON CONFLICT slot and
its corresponding projection is pretty weird.  The projection is in
ResultRelInfo, but the slot itself is in ModifyTableState.  You can't
make the projection work without a corresponding slot initialized with
the correct descriptor, so splitting it this way doesn't make a lot of
sense to me.

(Now, TBH the split between resultRelInfo->ri_projectReturning and
ModifyTableState->ps.ps_ResultTupleSlot, which is the slot that the
returning project uses, doesn't make a lot of sense to me either; so
maybe there some reason that I'm just not seeing.  But I digress.)

So I want to propose that we move the slot to be together with the
projection node that it serves, ie. we put the slot in ResultRelInfo:

typedef struct ResultRelInfo
{
...
    /* for computing ON CONFLICT DO UPDATE SET */
    TupleTableSlot *ri_onConflictProjSlot;
    ProjectionInfo *ri_onConflictSetProj;

and with this the structure makes more sense.  So ExecInitModifyTable
does this

        /* create target slot for UPDATE SET projection */
        tupDesc = ExecTypeFromTL((List *) node->onConflictSet,
                                 relationDesc->tdhasoid);
        resultRelInfo->ri_onConflictProjSlot =
            ExecInitExtraTupleSlot(mtstate->ps.state, tupDesc);

        /* build UPDATE SET projection state */
        resultRelInfo->ri_onConflictSetProj =
            ExecBuildProjectionInfo(node->onConflictSet, econtext,
                                    resultRelInfo->ri_onConflictProjSlot,
                                    &mtstate->ps, relationDesc);

and then ExecOnConflictUpdate can simply do this:

    /* Project the new tuple version */
    ExecProject(resultRelInfo->ri_onConflictSetProj);

    /* Execute UPDATE with projection */
    *returning = ExecUpdate(mtstate, &tuple.t_self, NULL,
                            resultRelInfo->ri_onConflictProjSlot, planSlot,
                            &mtstate->mt_epqstate, mtstate->ps.state,
                            canSetTag);

Now, maybe there is some reason I'm missing for the on conflict slot for
the projection to be in ModifyTableState rather than resultRelInfo.  But
this code passes all current tests, so I don't know what that reason
would be.


Overall, the resulting code looks simpler to me than the previous
arrangements.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: ON CONFLICT DO UPDATE for partitioned tables

From

Andres Freund

Date:

17 March 2018, 00:28:51

Hi,

On 2018-03-16 18:23:44 -0300, Alvaro Herrera wrote:
> Another thing I noticed is that the split of the ON CONFLICT slot and
> its corresponding projection is pretty weird.  The projection is in
> ResultRelInfo, but the slot itself is in ModifyTableState.  You can't
> make the projection work without a corresponding slot initialized with
> the correct descriptor, so splitting it this way doesn't make a lot of
> sense to me.
> 
> (Now, TBH the split between resultRelInfo->ri_projectReturning and
> ModifyTableState->ps.ps_ResultTupleSlot, which is the slot that the
> returning project uses, doesn't make a lot of sense to me either; so
> maybe there some reason that I'm just not seeing.  But I digress.)

The projections for different child tables / child plans can look
different, therefore it can't be stored in ModifyTableState itself. No?

The slot's descriptor is changed to be "appropriate" when necessary:
            if (slot->tts_tupleDescriptor != RelationGetDescr(resultRelationDesc))
                ExecSetSlotDescriptor(slot, RelationGetDescr(resultRelationDesc));


> So I want to propose that we move the slot to be together with the
> projection node that it serves, ie. we put the slot in ResultRelInfo:

That'll mean we have a good number of additional slots in some cases. I
don't think the overhead of that is going to break the bank, but it's
worth considering.

Greetings,

Andres Freund

Re: ON CONFLICT DO UPDATE for partitioned tables

From

Alvaro Herrera

Date:

17 March 2018, 00:46:17

Andres Freund wrote:
> Hi,
> 
> On 2018-03-16 18:23:44 -0300, Alvaro Herrera wrote:
> > Another thing I noticed is that the split of the ON CONFLICT slot and
> > its corresponding projection is pretty weird.  The projection is in
> > ResultRelInfo, but the slot itself is in ModifyTableState.  You can't
> > make the projection work without a corresponding slot initialized with
> > the correct descriptor, so splitting it this way doesn't make a lot of
> > sense to me.
> > 
> > (Now, TBH the split between resultRelInfo->ri_projectReturning and
> > ModifyTableState->ps.ps_ResultTupleSlot, which is the slot that the
> > returning project uses, doesn't make a lot of sense to me either; so
> > maybe there some reason that I'm just not seeing.  But I digress.)
> 
> The projections for different child tables / child plans can look
> different, therefore it can't be stored in ModifyTableState itself. No?
> 
> The slot's descriptor is changed to be "appropriate" when necessary:
>             if (slot->tts_tupleDescriptor != RelationGetDescr(resultRelationDesc))
>                 ExecSetSlotDescriptor(slot, RelationGetDescr(resultRelationDesc));

Grumble.  This stuff looks like full of cheats to me, but I won't touch
it anyway.

> > So I want to propose that we move the slot to be together with the
> > projection node that it serves, ie. we put the slot in ResultRelInfo:
> 
> That'll mean we have a good number of additional slots in some cases. I
> don't think the overhead of that is going to break the bank, but it's
> worth considering.

Good point.

I think what I should be doing is the same as the returning stuff: keep
a tupdesc around, and use a single slot, whose descriptor is changed
just before the projection.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: ON CONFLICT DO UPDATE for partitioned tables

From

Alvaro Herrera

Date:

18 March 2018, 07:17:15

Alvaro Herrera wrote:

> I think what I should be doing is the same as the returning stuff: keep
> a tupdesc around, and use a single slot, whose descriptor is changed
> just before the projection.

Yes, this works, though it's ugly.  Not any uglier than what's already
there, though, so I think it's okay.

The only thing that I remain unhappy about this patch is the whole
adjust_and_expand_partition_tlist() thing.  I fear we may be doing
redundant and/or misplaced work.  I'll look into it next week.

0001 should be pretty much ready to push -- adjustments to ExecInsert
  and ModifyTableState I already mentioned.

0002 is stuff I would like to get rid of completely -- changes to
  planner code so that it better supports functionality we need for
  0003.

0003 is the main patch.  Compared to the previous version, this one
  reuses slots by switching them to different tupdescs as needed.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

Re: ON CONFLICT DO UPDATE for partitioned tables

From

Amit Langote

Date:

19 March 2018, 09:04:18

On 2018/03/05 18:04, Pavan Deolasee wrote:
> On Fri, Mar 2, 2018 at 9:06 PM, Alvaro Herrera <alvherre@2ndquadrant.com>
> wrote:
>> Re. the "ugly hack" comments in adjust_inherited_tlist(), I'm confused:
>> expand_targetlist() runs *after*, not before, so how could it have
>> affected the result?
>>
> 
> If I understand correctly, planner must have called expand_targetlist()
> once for the parent relation's descriptor and added any dropped columns
> from the parent relation. So we may not find mapped attributes for those
> dropped columns in the parent. I haven't actually tested this case though.
> 
> I wonder if it makes sense to actually avoid expanding on-conflict-set
> targetlist in case the target is a partition table and deal with it during
> execution, like we are doing now.
>> I'm obviously confused about what
>> expand_targetlist call this comment is talking about.  Anyway I wanted
>> to make it use resjunk entries instead, but that broke some other case
>> that I didn't have time to research yesterday.  I'll get back to this
>> soon, but in the meantime, here's what I have.
>>
> 
> +           conv_setproj =
> +
>  adjust_and_expand_partition_tlist(RelationGetDescr(firstResultRel),
> +                                                 RelationGetDescr(partrel),
> +
>  RelationGetRelationName(partrel),
> +                                                 firstVarno,
> +                                                 conv_setproj);
> 
> Aren't we are adding back Vars referring to the parent relation, after
> having converted the existing ones to use partition relation's varno? May
> be that works because missing attributes are already added during planning
> and expand_targetlist() here only adds dropped columns, which are just NULL
> constants.

I think this suggestion to defer onConflictSet target list expansion  to
execution time is a good one.  So, in preprocess_targetlist(), we'd only
perform exapand_targetlist on the onConflictSet list if the table is not a
partitioned table.  For partitioned tables, we don't know which
partition(s) will be affected, so it's useless to do the expansion there.
Instead it's better to expand in ExecInitPartitionInfo(), possibly after
changing original TargetEntry nodes to have correct resnos due to
attribute number differences (calling adjust_inherited_tlist(), etc.).

And then since we're not expanding the target list in the planner anymore,
we don't need to install those hacks in adjust_inherited_tlist() like the
patch currently does to ignore entries for dropped columns in the parent
the plan-time expansion adds.

Thanks,
Amit

Re: ON CONFLICT DO UPDATE for partitioned tables

From

Amit Langote

Date:

19 March 2018, 10:45:11

Thanks for the updated patches.

On 2018/03/18 13:17, Alvaro Herrera wrote:
> Alvaro Herrera wrote:
> 
>> I think what I should be doing is the same as the returning stuff: keep
>> a tupdesc around, and use a single slot, whose descriptor is changed
>> just before the projection.
> 
> Yes, this works, though it's ugly.  Not any uglier than what's already
> there, though, so I think it's okay.
> 
> The only thing that I remain unhappy about this patch is the whole
> adjust_and_expand_partition_tlist() thing.  I fear we may be doing
> redundant and/or misplaced work.  I'll look into it next week.
> 
> 0001 should be pretty much ready to push -- adjustments to ExecInsert
>   and ModifyTableState I already mentioned.

This seems like good cleanup.

While at it, why not also get rid of mt_onconflict in favor of always just
using its counterpart in ModifyTable -- onConflictAction?

> 0002 is stuff I would like to get rid of completely -- changes to
>   planner code so that it better supports functionality we need for
>   0003.

Hmm.  I'm not sure if we can completely get rid of this, because we do
need the adjust_inherited_tlist() facility to translate TargetEntry resnos
in any case.

But as I just said in reply to Pavan's email suggesting deferring
onConflistSet expansion to execution time, we don't need the hack in
adjust_inherited_tlist() if we go with the suggestion.

> 0003 is the main patch.  Compared to the previous version, this one
>   reuses slots by switching them to different tupdescs as needed.

Your proposed change to use just one slot (the existing mt_conflproj slot)
sounds good.  Instead, it seems now we have an array to hold tupleDescs
for the onConflistSet target lists for each partition.

Some comments:

1. I noticed a bug that crashes a test in insert_conflit.sql that uses DO
NOTHING instead of DO UPDATE SET.  It's illegal for ExecInitPartitionInfo
to expect mt_conflproj_tupdesc to be valid in the DO NOTHING case, because
ExecInitModifyTable would only set it if (onConflictAction == DO_UPDATE).

2. It seems better to name the new array field in PartitionTupleRouting
partition_conflproj_tupdescs rather than partition_onconfl_tupdescs to be
consistent with the new field in ModifyTableState.

3. I think it was an oversight in my original patch, but it seems we
should allocate the partition_onconfl_tdescs array only if DO UPDATE
action is used.  Also, ri_onConflictSetProj, ri_onConflictSetWhere should
be only set in that case.  OTOH, we always need to set
partition_arbiter_indexes, that is, for both DO NOTHING and DO UPDATE SET
actions.

4. Need to remove the comments for partition_conflproj_slots and
partition_existing_slots, fields of PartitionTupleRouting that no longer
exist.  Instead one for partition_conflproj_tupdescs should be added.

5. I know the following is so as not to break the Assert in
adjust_inherited_tlist(), so why not have a parentOid argument for
adjust_and_expand_partition_tlist()?

+    appinfo.parent_reloid = 1; // dummy  parentRel->rd_id;

6. There is a sentence in the comment above adjust_inherited_tlist():

  Note that this is not needed for INSERT because INSERT isn't
  inheritable.

Maybe, we need to delete that and mention that we do need it in the case
of INSERT ON CONFLICT DO UPDATE on partitioned tables for translating DO
UPDATE SET target list.

7. In ExecInsert, it'd be better to have a partArbiterIndexes, just like
partConflTupdesc in the outermost scope and then do:

+            /* Use the appropriate list of arbiter indexes. */
+            if (mtstate->mt_partition_tuple_routing != NULL)
+                arbiterIndexes = partArbiterIndexes;
+            else
+                arbiterIndexes = node->arbiterIndexes;

and

+        /* Use the appropriate tuple descriptor. */
+        if (mtstate->mt_partition_tuple_routing != NULL)
+            onconfl_tupdesc = partConflTupdesc;
+        else
+            onconfl_tupdesc = mtstate->mt_conflproj_tupdesc;

using arbiterIndexes and onconfl_tupdesc declared in the appropriate scopes.

I have tried to make these changes and attached are the updated patches
containing those, including the change I suggested for 0001 (that is,
getting rid of mt_onconflict).  I also expanded some comments in 0003
while making those changes.

Thanks,
Amit

Attachment

Re: ON CONFLICT DO UPDATE for partitioned tables

From

Amit Langote

Date:

19 March 2018, 11:50:10

On 2018/03/19 16:45, Amit Langote wrote:
> I have tried to make these changes and attached are the updated patches
> containing those, including the change I suggested for 0001 (that is,
> getting rid of mt_onconflict).  I also expanded some comments in 0003
> while making those changes.

I realized that there should be a test where transition table is involved
for an ON CONFLICT DO UPDATE on a partitioned table due to relevant
statement trigger on the table; something like the attached.  But due to a
bug being discussed over at [1], we won't get the correct expected output
for the test until the latest patch submitted for that bug [2] is
committed as a bug fix.

Thanks,
Amit

[1]
https://www.postgresql.org/message-id/ba19eff9-2120-680e-4671-55a9bea9454f%40lab.ntt.co.jp

[2]
https://www.postgresql.org/message-id/df921671-32df-45ea-c0e4-9b51ee86ba3b%40lab.ntt.co.jp

Attachment

0001-More-tests-for-ON-CONFLICT-DO-UPDATE-on-partitioned-.patch

Re: ON CONFLICT DO UPDATE for partitioned tables

From

Etsuro Fujita

Date:

19 March 2018, 15:59:01

(2018/03/18 13:17), Alvaro Herrera wrote:
> Alvaro Herrera wrote:
> The only thing that I remain unhappy about this patch is the whole
> adjust_and_expand_partition_tlist() thing.  I fear we may be doing
> redundant and/or misplaced work.  I'll look into it next week.

I'm still reviewing the patches, but I really agree on that point.  As 
Pavan mentioned upthread, the onConflictSet tlist for the root parent, 
from which we create a translated onConflictSet tlist for a partition, 
would have already been processed by expand_targetlist() to contain all 
missing columns as well, so I think we could create the tlist for the 
partition by simply re-ordering the expression-converted tlist (ie, 
conv_setproj) based on the conversion map for the partition.  The 
Attached defines a function for that, which could be called, instead of 
calling adjust_and_expand_partition_tlist().  This would allow us to get 
rid of planner changes from the patches.  Maybe I'm missing something, 
though.

Best regards,
Etsuro Fujita

Attachment

simplify-tlist-adjustment.patch

Re: ON CONFLICT DO UPDATE for partitioned tables

From

Amit Langote

Date:

20 March 2018, 07:30:26

Fujita-san,

On 2018/03/19 21:59, Etsuro Fujita wrote:
> (2018/03/18 13:17), Alvaro Herrera wrote:
>> Alvaro Herrera wrote:
>> The only thing that I remain unhappy about this patch is the whole
>> adjust_and_expand_partition_tlist() thing.  I fear we may be doing
>> redundant and/or misplaced work.  I'll look into it next week.
> 
> I'm still reviewing the patches, but I really agree on that point.  As
> Pavan mentioned upthread, the onConflictSet tlist for the root parent,
> from which we create a translated onConflictSet tlist for a partition,
> would have already been processed by expand_targetlist() to contain all
> missing columns as well, so I think we could create the tlist for the
> partition by simply re-ordering the expression-converted tlist (ie,
> conv_setproj) based on the conversion map for the partition.  The Attached
> defines a function for that, which could be called, instead of calling
> adjust_and_expand_partition_tlist().  This would allow us to get rid of
> planner changes from the patches.  Maybe I'm missing something, though.

Thanks for the patch.  I can confirm your proposed
adjust_onconflictset_tlist() is enough to replace adjust_inherited_tlist()
+ expand_targetlist() combination (aka
adjust_and_expand_partition_tlist()), thus rendering the planner changes
in this patch unnecessary.  I tested it with a partition tree involving
partitions of varying attribute numbers (dropped columns included) and it
seems to work as expected (as also exercised in regression tests) as shown
below.

Partitioned table p has partitions p1, p2, p3, p4, and p5 whose attributes
look like this; shown as (colname: attnum, ...).

p:  (a: 1, b: 4)
p1: (a: 1, b: 4)
p2: (a: 2, b: 4)
p3: (a: 1, b: 3)
p4: (a: 3, b: 8)
p5: (a: 1, b: 5)

You may notice that all partitions but p1 will have a tuple conversion map
and hence will undergo adjust_onconflictset_tlist() treatment.

insert into p values (1, 'a') on conflict (a) do update set b = excluded.b
where excluded.b = 'b';
INSERT 0 1

insert into p values (1, 'b') on conflict (a) do update set b = excluded.b
where excluded.b = 'b';
INSERT 0 1

insert into p values (1, 'c') on conflict (a) do update set b = excluded.b
where excluded.b = 'b';
INSERT 0 0

insert into p values (1) on conflict (a) do update set b = excluded.b
where excluded.b = 'b';
INSERT 0 0

insert into p values (2, 'a') on conflict (a) do update set b = excluded.b
where excluded.b = 'b';
INSERT 0 1

insert into p values (2, 'b') on conflict (a) do update set b = excluded.b
where excluded.b = 'b';
INSERT 0 1

insert into p values (2, 'c') on conflict (a) do update set b = excluded.b
where excluded.b = 'b';
INSERT 0 0

insert into p values (5, 'a') on conflict (a) do update set b = excluded.b
where excluded.b = 'b';
INSERT 0 1

insert into p values (5, 'b') on conflict (a) do update set b = excluded.b
where excluded.b = 'b';
INSERT 0 1

insert into p values (5, 'c') on conflict (a) do update set b = excluded.b
where excluded.b = 'b';
INSERT 0 0

insert into p values (5) on conflict (a) do update set b = excluded.b
where excluded.b = 'b';
INSERT 0 0

select tableoid::regclass, * from p;
 tableoid | a | b
----------+---+---
 p1       | 1 | b
 p2       | 2 | b
 p5       | 5 | b
(3 rows)

I have incorporated your patch in the main patch after updating the
comments a bit.  Also, now that 6666ee49f49 is in [1], the transition
table related tests I proposed yesterday pass nicely.  Instead of posting
as a separate patch, I have merged it with the main patch.  So now that
planner refactoring is unnecessary, attached is just one patch.

Thanks,
Amit

[1] https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=6666ee49f49

Attachment

v6-0001-Fix-ON-CONFLICT-to-work-with-partitioned-tables.patch

Re: ON CONFLICT DO UPDATE for partitioned tables

From

Amit Langote

Date:

20 March 2018, 07:39:26

On 2018/03/20 13:30, Amit Langote wrote:
> I have incorporated your patch in the main patch after updating the
> comments a bit.  Also, now that 6666ee49f49 is in [1], the transition
> table related tests I proposed yesterday pass nicely.  Instead of posting
> as a separate patch, I have merged it with the main patch.  So now that
> planner refactoring is unnecessary, attached is just one patch.

Sorry, I forgot to remove a hunk in the patch affecting
src/include/optimizer/prep.h.  Fixed in the attached updated version.

Thanks,
Amit

Attachment

v7-0001-Fix-ON-CONFLICT-to-work-with-partitioned-tables.patch

Re: ON CONFLICT DO UPDATE for partitioned tables

From

Etsuro Fujita

Date:

20 March 2018, 14:53:23

(2018/03/20 13:30), Amit Langote wrote:
> On 2018/03/19 21:59, Etsuro Fujita wrote:
>> (2018/03/18 13:17), Alvaro Herrera wrote:
>>> Alvaro Herrera wrote:
>>> The only thing that I remain unhappy about this patch is the whole
>>> adjust_and_expand_partition_tlist() thing.  I fear we may be doing
>>> redundant and/or misplaced work.  I'll look into it next week.
>>
>> I'm still reviewing the patches, but I really agree on that point.  As
>> Pavan mentioned upthread, the onConflictSet tlist for the root parent,
>> from which we create a translated onConflictSet tlist for a partition,
>> would have already been processed by expand_targetlist() to contain all
>> missing columns as well, so I think we could create the tlist for the
>> partition by simply re-ordering the expression-converted tlist (ie,
>> conv_setproj) based on the conversion map for the partition.  The Attached
>> defines a function for that, which could be called, instead of calling
>> adjust_and_expand_partition_tlist().  This would allow us to get rid of
>> planner changes from the patches.  Maybe I'm missing something, though.
>
> Thanks for the patch.  I can confirm your proposed
> adjust_onconflictset_tlist() is enough to replace adjust_inherited_tlist()
> + expand_targetlist() combination (aka
> adjust_and_expand_partition_tlist()), thus rendering the planner changes
> in this patch unnecessary.  I tested it with a partition tree involving
> partitions of varying attribute numbers (dropped columns included) and it
> seems to work as expected (as also exercised in regression tests) as shown
> below.

Thanks for testing!

> I have incorporated your patch in the main patch after updating the
> comments a bit.  Also, now that 6666ee49f49 is in [1], the transition
> table related tests I proposed yesterday pass nicely.  Instead of posting
> as a separate patch, I have merged it with the main patch.  So now that
> planner refactoring is unnecessary, attached is just one patch.

Here are comments on executor changes in (the latest version of) the patch:

@@ -421,8 +424,18 @@ ExecInsert(ModifyTableState *mtstate,
              ItemPointerData conflictTid;
              bool        specConflict;
              List       *arbiterIndexes;
+            PartitionTupleRouting *proute =
+                                        mtstate->mt_partition_tuple_routing;

-            arbiterIndexes = node->arbiterIndexes;
+            /* Use the appropriate list of arbiter indexes. */
+            if (mtstate->mt_partition_tuple_routing != NULL)
+            {
+                Assert(partition_index >= 0 && proute != NULL);
+                arbiterIndexes =
+                        proute->partition_arbiter_indexes[partition_index];
+            }
+            else
+                arbiterIndexes = node->arbiterIndexes;

To handle both cases the same way, I wonder if it would be better to 
have the arbiterindexes list in ResultRelInfo as well, as mentioned by 
Alvaro upthread, or to re-add mt_arbiterindexes as before and set it to 
proute->partition_arbiter_indexes[partition_index] before we get here, 
maybe in ExecPrepareTupleRouting, in the case of tuple routing.

  ExecOnConflictUpdate(ModifyTableState *mtstate,
                       ResultRelInfo *resultRelInfo,
+                     TupleDesc onConflictSetTupdesc,
                       ItemPointer conflictTid,
                       TupleTableSlot *planSlot,
                       TupleTableSlot *excludedSlot,
@@ -1419,6 +1459,7 @@ ExecOnConflictUpdate(ModifyTableState *mtstate,
      ExecCheckHeapTupleVisible(estate, &tuple, buffer);

      /* Store target's existing tuple in the state's dedicated slot */
+    ExecSetSlotDescriptor(mtstate->mt_existing, RelationGetDescr(relation));
      ExecStoreTuple(&tuple, mtstate->mt_existing, buffer, false);

      /*
@@ -1462,6 +1503,7 @@ ExecOnConflictUpdate(ModifyTableState *mtstate,
      }

      /* Project the new tuple version */
+    ExecSetSlotDescriptor(mtstate->mt_conflproj, onConflictSetTupdesc);
      ExecProject(resultRelInfo->ri_onConflictSetProj);

Can we do ExecSetSlotDescriptor for mtstate->mt_existing and 
mtstate->mt_conflproj in ExecPrepareTupleRouting in the case of tuple 
routing?  That would make the API changes to ExecOnConflictUpdate 
unnecessary.

@@ -2368,9 +2419,13 @@ ExecInitModifyTable(ModifyTable *node, EState 
*estate, int eflags)
          econtext = mtstate->ps.ps_ExprContext;
          relationDesc = resultRelInfo->ri_RelationDesc->rd_att;

-        /* initialize slot for the existing tuple */
-        mtstate->mt_existing =
-            ExecInitExtraTupleSlot(mtstate->ps.state, relationDesc);
+        /*
+         * Initialize slot for the existing tuple.  We determine which
+         * tupleDesc to use for this after we have determined which relation
+         * the insert/update will be applied to, possibly after performing
+         * tuple routing.
+         */
+        mtstate->mt_existing = ExecInitExtraTupleSlot(mtstate->ps.state, NULL);

          /* carried forward solely for the benefit of explain */
          mtstate->mt_excludedtlist = node->exclRelTlist;
@@ -2378,8 +2433,16 @@ ExecInitModifyTable(ModifyTable *node, EState 
*estate, int eflags)
          /* create target slot for UPDATE SET projection */
          tupDesc = ExecTypeFromTL((List *) node->onConflictSet,
                                   relationDesc->tdhasoid);
+        PinTupleDesc(tupDesc);
+        mtstate->mt_conflproj_tupdesc = tupDesc;
+
+        /*
+         * Just like the "existing tuple" slot, we'll defer deciding which
+         * tupleDesc to use for this slot to a point where tuple routing has
+         * been performed.
+         */
          mtstate->mt_conflproj =
-            ExecInitExtraTupleSlot(mtstate->ps.state, tupDesc);
+            ExecInitExtraTupleSlot(mtstate->ps.state, NULL);

If we do ExecInitExtraTupleSlot for mtstate->mt_existing and 
mtstate->mt_conflproj in ExecPrepareTupleRouting in the case of tuple 
routing, as said above, we wouldn't need this changes.  I think doing 
that only in the case of tuple routing and keeping this as-is would be 
better because that would save cycles in the normal case.

I'll look at other parts of the patch next.

Best regards,
Etsuro Fujita

Re: ON CONFLICT DO UPDATE for partitioned tables

From

Pavan Deolasee

Date:

22 March 2018, 07:34:51

On Tue, Mar 20, 2018 at 10:09 AM, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:

On 2018/03/20 13:30, Amit Langote wrote:
> I have incorporated your patch in the main patch after updating the
> comments a bit. Also, now that 6666ee49f49 is in [1], the transition
> table related tests I proposed yesterday pass nicely. Instead of posting
> as a separate patch, I have merged it with the main patch. So now that
> planner refactoring is unnecessary, attached is just one patch.

Sorry, I forgot to remove a hunk in the patch affecting
src/include/optimizer/prep.h. Fixed in the attached updated version.

Thanks for writing a new version. A few comments:

<para>

- Using the <literal>ON CONFLICT</literal> clause with partitioned tables

- will cause an error if the conflict target is specified (see

- <xref linkend="sql-on-conflict" /> for more details on how the clause

- works). Therefore, it is not possible to specify

- <literal>DO UPDATE</literal> as the alternative action, because

- specifying the conflict target is mandatory in that case. On the other

- hand, specifying <literal>DO NOTHING</literal> as the alternative action

- works fine provided the conflict target is not specified. In that case,

- unique constraints (or exclusion constraints) of the individual leaf

- partitions are considered.

- </para>

- </listitem>

We should document it somewhere that partition key update is not supported by

ON CONFLICT DO UPDATE

* get_partition_parent

+ * Obtain direct parent or topmost ancestor of given relation

- * Returns inheritance parent of a partition by scanning pg_inherits

+ * Returns direct inheritance parent of a partition by scanning pg_inherits;

+ * or, if 'getroot' is true, the topmost parent in the inheritance hierarchy.

* Note: Because this function assumes that the relation whose OID is passed

* as an argument will have precisely one parent, it should only be called

* when it is known that the relation is a partition.

Given that most callers only look for immediate parent, I wonder if it makes

sense to have a new function, get_partition_root(), instead of changing

signature of the current function. That will reduce foot-print of this patch

quite a lot.

@@ -36,6 +38,7 @@ static char *ExecBuildSlotPartitionKeyDescription(Relation rel,

Datum *values,

bool *isnull,

int maxfieldlen);

+static List *adjust_onconflictset_tlist(List *tlist, TupleConversionMap *map);

We should name this function in a more generic way, given that it's going to be

used for other things too. What about adjust_partition_tlist?

+ /*

+ * If partition's tuple descriptor differs from the root parent,

+ * we need to adjust the onConflictSet target list to account for

+ * differences in attribute numbers.

+ */

+ if (map != NULL)

+ {

+ /*

+ * First convert Vars to contain partition's atttribute

+ * numbers.

+ */

+ /* Convert Vars referencing EXCLUDED pseudo-relation. */

+ onconflset = map_partition_varattnos(node->onConflictSet,

+ INNER_VAR,

+ partrel,

+ firstResultRel, NULL);

Are we not modifying node->onConflictSet in place? Or does

map_partition_varattnos() create a fresh copy before scribbling on the input?

If it's former then I guess that's a problem. If it's latter then we ought to

have comments explaining that.

+ tupDesc = ExecTypeFromTL(onconflset, partrelDesc->tdhasoid);

+ PinTupleDesc(tupDesc);

Why do we need to pin the descriptor? If we do need, why don't we need

corresponding ReleaseTupDesc() call?

I am still confused if the partition_conflproj_tdescs really belongs to

PartitionTupleRouting or should it be a part of the ResultRelInfo. FWIW for the

MERGE patch, I moved everything to a new struct and made it part of the

ResultRelInfo. If no re-mapping is necessary, we can just point to the struct

in the root relation's ResultRelInfo. Otherwise create and populate a new one

in the partition relation's ResultRelInfo.

+ leaf_part_rri->ri_onConflictSetWhere =

+ ExecInitQual(onconflwhere, &mtstate->ps);

+ }

So ri_onConflictSetWhere and ri_onConflictSetProj are part of the

ResultRelInfo. Why not move mt_conflproj_tupdesc, partition_conflproj_tdescs,

partition_arbiter_indexes etc to the ResultRelInfo as well?

+/*

+ * Adjust the targetlist entries of an inherited ON CONFLICT DO UPDATE

+ * operation for a given partition

+ *

As I said above, we should disassociate this function from ON CONFLICT DO

UPDATE and rather have it as a general purpose facility.

+ * The expressions have already been fixed, but we have to make sure that the

+ * target resnos match the partition. In some cases, this can force us to

+ * re-order the tlist to preserve resno ordering.

+ *

Can we have some explanation regarding how the targetlist is reordered? I know

the function does that by updating the resno in place, but some explanation

would help. Also, should we add an assertion-build check to confirm that the

resultant list is actually ordered?

@@ -66,7 +67,8 @@ static TupleTableSlot *ExecPrepareTupleRouting(ModifyTableState *mtstate,

EState *estate,

PartitionTupleRouting *proute,

ResultRelInfo *targetRelInfo,

- TupleTableSlot *slot);

+ TupleTableSlot *slot,

+ int *partition_index);

static ResultRelInfo *getTargetResultRelInfo(ModifyTableState *node);

static void ExecSetupChildParentMapForTcs(ModifyTableState *mtstate);

static void ExecSetupChildParentMapForSubplan(ModifyTableState *mtstate);

@@ -264,6 +266,7 @@ ExecInsert(ModifyTableState *mtstate,

TupleTableSlot *slot,

TupleTableSlot *planSlot,

EState *estate,

+ int partition_index,

bool canSetTag)

{

HeapTuple tuple;

If we move the list of arbiter indexes and the tuple desc to ResultRelInfo, as

suggested above, I think we can avoid making any API changes to

ExecPrepareTupleRouting and ExecInsert.

Thanks,

Pavan

Pavan Deolasee http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Re: ON CONFLICT DO UPDATE for partitioned tables

From

Amit Langote

Date:

22 March 2018, 12:31:59

Fujita-san, Pavan,

Thank you both for reviewing.  Replying to both emails here.

On 2018/03/20 20:53, Etsuro Fujita wrote:
> Here are comments on executor changes in (the latest version of) the patch:
> 
> @@ -421,8 +424,18 @@ ExecInsert(ModifyTableState *mtstate,
>              ItemPointerData conflictTid;
>              bool        specConflict;
>              List       *arbiterIndexes;
> +            PartitionTupleRouting *proute =
> +                                        mtstate->mt_partition_tuple_routing;
> 
> -            arbiterIndexes = node->arbiterIndexes;
> +            /* Use the appropriate list of arbiter indexes. */
> +            if (mtstate->mt_partition_tuple_routing != NULL)
> +            {
> +                Assert(partition_index >= 0 && proute != NULL);
> +                arbiterIndexes =
> +                        proute->partition_arbiter_indexes[partition_index];
> +            }
> +            else
> +                arbiterIndexes = node->arbiterIndexes;
> 
> To handle both cases the same way, I wonder if it would be better to have
> the arbiterindexes list in ResultRelInfo as well, as mentioned by Alvaro
> upthread, or to re-add mt_arbiterindexes as before and set it to
> proute->partition_arbiter_indexes[partition_index] before we get here,
> maybe in ExecPrepareTupleRouting, in the case of tuple routing.

It's a good idea.  I somehow missed that Alvaro had already mentioned it.

In HEAD, we now have ri_onConflictSetProj and ri_onConflictSetWhere.  I
propose we name the field ri_onConflictArbiterIndexes as done in the
updated patch.

>  ExecOnConflictUpdate(ModifyTableState *mtstate,
>                       ResultRelInfo *resultRelInfo,
> +                     TupleDesc onConflictSetTupdesc,
>                       ItemPointer conflictTid,
>                       TupleTableSlot *planSlot,
>                       TupleTableSlot *excludedSlot,
> @@ -1419,6 +1459,7 @@ ExecOnConflictUpdate(ModifyTableState *mtstate,
>      ExecCheckHeapTupleVisible(estate, &tuple, buffer);
> 
>      /* Store target's existing tuple in the state's dedicated slot */
> +    ExecSetSlotDescriptor(mtstate->mt_existing, RelationGetDescr(relation));
>      ExecStoreTuple(&tuple, mtstate->mt_existing, buffer, false);
> 
>      /*
> @@ -1462,6 +1503,7 @@ ExecOnConflictUpdate(ModifyTableState *mtstate,
>      }
> 
>      /* Project the new tuple version */
> +    ExecSetSlotDescriptor(mtstate->mt_conflproj, onConflictSetTupdesc);
>      ExecProject(resultRelInfo->ri_onConflictSetProj);
> 
> Can we do ExecSetSlotDescriptor for mtstate->mt_existing and
> mtstate->mt_conflproj in ExecPrepareTupleRouting in the case of tuple
> routing?  That would make the API changes to ExecOnConflictUpdate
> unnecessary.

That's a good idea too, so done.

> 
> @@ -2368,9 +2419,13 @@ ExecInitModifyTable(ModifyTable *node, EState
> *estate, int eflags)
>          econtext = mtstate->ps.ps_ExprContext;
>          relationDesc = resultRelInfo->ri_RelationDesc->rd_att;
> 
> -        /* initialize slot for the existing tuple */
> -        mtstate->mt_existing =
> -            ExecInitExtraTupleSlot(mtstate->ps.state, relationDesc);
> +        /*
> +         * Initialize slot for the existing tuple.  We determine which
> +         * tupleDesc to use for this after we have determined which relation
> +         * the insert/update will be applied to, possibly after performing
> +         * tuple routing.
> +         */
> +        mtstate->mt_existing = ExecInitExtraTupleSlot(mtstate->ps.state,
> NULL);
> 
>          /* carried forward solely for the benefit of explain */
>          mtstate->mt_excludedtlist = node->exclRelTlist;
> @@ -2378,8 +2433,16 @@ ExecInitModifyTable(ModifyTable *node, EState
> *estate, int eflags)
>          /* create target slot for UPDATE SET projection */
>          tupDesc = ExecTypeFromTL((List *) node->onConflictSet,
>                                   relationDesc->tdhasoid);
> +        PinTupleDesc(tupDesc);
> +        mtstate->mt_conflproj_tupdesc = tupDesc;
> +
> +        /*
> +         * Just like the "existing tuple" slot, we'll defer deciding which
> +         * tupleDesc to use for this slot to a point where tuple routing has
> +         * been performed.
> +         */
>          mtstate->mt_conflproj =
> -            ExecInitExtraTupleSlot(mtstate->ps.state, tupDesc);
> +            ExecInitExtraTupleSlot(mtstate->ps.state, NULL);
> 
> If we do ExecInitExtraTupleSlot for mtstate->mt_existing and
> mtstate->mt_conflproj in ExecPrepareTupleRouting in the case of tuple
> routing, as said above, we wouldn't need this changes.  I think doing that
> only in the case of tuple routing and keeping this as-is would be better
> because that would save cycles in the normal case.

Hmm, I think we shouldn't be doing ExecInitExtraTupleSlot in
ExecPrepareTupleRouting, because we shouldn't have more than one instance
of mtstate->mt_existing and mtstate->mt_conflproj slots.

As you also said above, I think you meant to say here that we do
ExecInitExtraTupleSlot only once for both mtstate->mt_existing and
mtstate->mt_conflproj in ExecInitModifyTable and only do
ExecSetSlotDescriptor in ExecPrepareTupleRouting.  I have changed it so
that ExecInitModifyTable now both creates the slot and sets the descriptor
for non-tuple-routing cases and only creates but doesn't set the
descriptor in the tuple-routing case.

For ExecPrepareTupleRouting to be able to access the tupDesc of the
onConflictSet target list, I've added ri_onConflictSetProjTupDesc which is
set by ExecInitPartitionInfo on first call for a give partition.  This is
also suggested by Pavan in his review.

Considering all of that, both mt_conflproj_tupdesc and
partition_conflproj_tdescs (the array in PartitionTupleRouting) no longer
exist in the patch.  And since arbiterIndexes has been moved into
ResultRelInfo too, partition_arbiter_indexes (the array in
PartitionTupleRouting) is gone too.

On 2018/03/22 13:34, Pavan Deolasee wrote:
> Thanks for writing a new version. A few comments:
>
>
>       <listitem>
>        <para>
> -       Using the <literal>ON CONFLICT</literal> clause with partitioned
> tables
> -       will cause an error if the conflict target is specified (see
> -       <xref linkend="sql-on-conflict" /> for more details on how the
> clause
> -       works).  Therefore, it is not possible to specify
> -       <literal>DO UPDATE</literal> as the alternative action, because
> -       specifying the conflict target is mandatory in that case.  On the
> other
> -       hand, specifying <literal>DO NOTHING</literal> as the alternative
> action
> -       works fine provided the conflict target is not specified.  In that
> case,
> -       unique constraints (or exclusion constraints) of the individual leaf
> -       partitions are considered.
> -      </para>
> -     </listitem>
>
> We should document it somewhere that partition key update is not supported
> by
> ON CONFLICT DO UPDATE

Agreed.  I have added a line on INSERT reference page to mention this
limitation.

>  /*
>   * get_partition_parent
> + * Obtain direct parent or topmost ancestor of given relation
>   *
> - * Returns inheritance parent of a partition by scanning pg_inherits
> + * Returns direct inheritance parent of a partition by scanning
> pg_inherits;
> + * or, if 'getroot' is true, the topmost parent in the inheritance
> hierarchy.
>   *
>   * Note: Because this function assumes that the relation whose OID is
> passed
>   * as an argument will have precisely one parent, it should only be called
>   * when it is known that the relation is a partition.
>   */
>
> Given that most callers only look for immediate parent, I wonder if it makes
> sense to have a new function, get_partition_root(), instead of changing
> signature of the current function. That will reduce foot-print of this patch
> quite a lot.

It seems like a good idea, so done that way.

> @@ -36,6 +38,7 @@ static char
> *ExecBuildSlotPartitionKeyDescription(Relation rel,
>   Datum *values,
>   bool *isnull,
>   int maxfieldlen);
> +static List *adjust_onconflictset_tlist(List *tlist, TupleConversionMap
> *map);
>
> We should name this function in a more generic way, given that it's going
> to be
> used for other things too. What about adjust_partition_tlist?

I think that makes sense.  We were trying to use adjust_inherited_tlist in
the earlier versions of this patch, so adjust_partition_tlist sounds like
a good name for this piece of code.

> +
> + /*
> + * If partition's tuple descriptor differs from the root parent,
> + * we need to adjust the onConflictSet target list to account for
> + * differences in attribute numbers.
> + */
> + if (map != NULL)
> + {
> + /*
> + * First convert Vars to contain partition's atttribute
> + * numbers.
> + */
> +
> + /* Convert Vars referencing EXCLUDED pseudo-relation. */
> + onconflset = map_partition_varattnos(node->onConflictSet,
> + INNER_VAR,
> + partrel,
> + firstResultRel, NULL);
>
> Are we not modifying node->onConflictSet in place? Or does
> map_partition_varattnos() create a fresh copy before scribbling on the
> input?
> If it's former then I guess that's a problem. If it's latter then we ought
> to
> have comments explaining that.

A copy is made before scribbling.  Clarified that in the nearby comment.

> + tupDesc = ExecTypeFromTL(onconflset, partrelDesc->tdhasoid);
> + PinTupleDesc(tupDesc);
>
> Why do we need to pin the descriptor? If we do need, why don't we need
> corresponding ReleaseTupDesc() call?

PinTupleDesc was added in the patch as Alvaro had submitted it upthread,
which it wasn't clear to me either why it was needed.  Looking at it
closely, it seems we can get rid of it if for the lack of corresponding
ReleaseTupleDesc().  ExecSetSlotDescriptor() seems to take care of pinning
and releasing tuple descriptors that are passed to it.  If some
partition's tupDesc remains pinned because it was the last one that was
passed to it, the final ExecResetTupleTable will take care of releasing it.

I have removed the instances of PinTupleDesc in the updated patch, but I'm
not sure why the loose PinTupleDesc() in the previous version of the patch
didn't cause reference leak warnings or some such.

> I am still confused if the partition_conflproj_tdescs really belongs to
> PartitionTupleRouting or should it be a part of the ResultRelInfo. FWIW for
> the
> MERGE patch, I moved everything to a new struct and made it part of the
> ResultRelInfo. If no re-mapping is necessary, we can just point to the
> struct
> in the root relation's ResultRelInfo. Otherwise create and populate a new
> one
> in the partition relation's ResultRelInfo.
>
> + leaf_part_rri->ri_onConflictSetWhere =
> + ExecInitQual(onconflwhere, &mtstate->ps);
> + }
>
> So ri_onConflictSetWhere and ri_onConflictSetProj are part of the
> ResultRelInfo. Why not move mt_conflproj_tupdesc,
> partition_conflproj_tdescs,
> partition_arbiter_indexes etc to the ResultRelInfo as well?

I have moved both the projection tupdesc and the arbiter indexes list into
ResultRelInfo as I wrote above.

> +
> +/*
> + * Adjust the targetlist entries of an inherited ON CONFLICT DO UPDATE
> + * operation for a given partition
> + *
>
> As I said above, we should disassociate this function from ON CONFLICT DO
> UPDATE and rather have it as a general purpose facility.

OK, have fixed the comment and the name as mentioned above.

> + * The expressions have already been fixed, but we have to make sure that
> the
> + * target resnos match the partition.  In some cases, this can force us to
> + * re-order the tlist to preserve resno ordering.
> + *
>
> Can we have some explanation regarding how the targetlist is reordered? I
> know
> the function does that by updating the resno in place, but some explanation
> would help. Also, should we add an assertion-build check to confirm that the
> resultant list is actually ordered?

OK, added a comment and also the assertion-build check on the order of
entries.

>
> @@ -66,7 +67,8 @@ static TupleTableSlot
> *ExecPrepareTupleRouting(ModifyTableState *mtstate,
>   EState *estate,
>   PartitionTupleRouting *proute,
>   ResultRelInfo *targetRelInfo,
> - TupleTableSlot *slot);
> + TupleTableSlot *slot,
> + int *partition_index);
>  static ResultRelInfo *getTargetResultRelInfo(ModifyTableState *node);
>  static void ExecSetupChildParentMapForTcs(ModifyTableState *mtstate);
>  static void ExecSetupChildParentMapForSubplan(ModifyTableState *mtstate);
> @@ -264,6 +266,7 @@ ExecInsert(ModifyTableState *mtstate,
>      TupleTableSlot *slot,
>      TupleTableSlot *planSlot,
>      EState *estate,
> +    int partition_index,
>      bool canSetTag)
>  {
>   HeapTuple tuple;
>
> If we move the list of arbiter indexes and the tuple desc to ResultRelInfo,
> as
> suggested above, I think we can avoid making any API changes to
> ExecPrepareTupleRouting and ExecInsert.

Those API changes are no longer part of the patch.


Attached please find an updated version.

Thanks,
Amit

Attachment

v8-0001-Fix-ON-CONFLICT-to-work-with-partitioned-tables.patch

Re: ON CONFLICT DO UPDATE for partitioned tables

From

Pavan Deolasee

Date:

22 March 2018, 14:48:19

On Thu, Mar 22, 2018 at 3:01 PM, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:

>
> Why do we need to pin the descriptor? If we do need, why don't we need
> corresponding ReleaseTupDesc() call?

PinTupleDesc was added in the patch as Alvaro had submitted it upthread,
which it wasn't clear to me either why it was needed. Looking at it
closely, it seems we can get rid of it if for the lack of corresponding
ReleaseTupleDesc(). ExecSetSlotDescriptor() seems to take care of pinning
and releasing tuple descriptors that are passed to it. If some
partition's tupDesc remains pinned because it was the last one that was
passed to it, the final ExecResetTupleTable will take care of releasing it.

I have removed the instances of PinTupleDesc in the updated patch, but I'm
not sure why the loose PinTupleDesc() in the previous version of the patch
didn't cause reference leak warnings or some such.

Yeah, it wasn't clear to me as well. But I did not investigate. May be Alvaro knows better.

> I am still confused if the partition_conflproj_tdescs really belongs to
> PartitionTupleRouting or should it be a part of the ResultRelInfo. FWIW for
> the
> MERGE patch, I moved everything to a new struct and made it part of the
> ResultRelInfo. If no re-mapping is necessary, we can just point to the
> struct
> in the root relation's ResultRelInfo. Otherwise create and populate a new
> one
> in the partition relation's ResultRelInfo.
>
> + leaf_part_rri->ri_onConflictSetWhere =
> + ExecInitQual(onconflwhere, &mtstate->ps);
> + }
>
> So ri_onConflictSetWhere and ri_onConflictSetProj are part of the
> ResultRelInfo. Why not move mt_conflproj_tupdesc,
> partition_conflproj_tdescs,
> partition_arbiter_indexes etc to the ResultRelInfo as well?

I have moved both the projection tupdesc and the arbiter indexes list into
ResultRelInfo as I wrote above.

Thanks. It's looking much better now. I think we can possibly move all ON CONFLICT related members to a separate structure and just copy the pointer to the structure if (map == NULL). That might make the code a bit more tidy.

Is there anything that needs to be done for transition tables? I checked and didn't see anything, but please check.

Thanks,

Pavan

Pavan Deolasee http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Re: ON CONFLICT DO UPDATE for partitioned tables

From

Amit Langote

Date:

23 March 2018, 05:22:26

On 2018/03/22 20:48, Pavan Deolasee wrote:
> Thanks. It's looking much better now.

Thanks.

> I think we can possibly move all ON
> CONFLICT related members to a separate structure and just copy the pointer
> to the structure if (map == NULL). That might make the code a bit more tidy.

OK, I tried that in the attached updated patch.

> Is there anything that needs to be done for transition tables? I checked
> and didn't see anything, but please check.

There doesn't seem to be anything that this patch has to do for transition
tables.  If you look at the tests I added in triggers.sql which exercise
INSERT ON CONFLICT's interaction with transition tables, you can see that
we get the same output for a partitioned table as we get for a normal table.

Thanks,
Amit

Attachment

v9-0001-Fix-ON-CONFLICT-to-work-with-partitioned-tables.patch

Re: ON CONFLICT DO UPDATE for partitioned tables

From

Etsuro Fujita

Date:

23 March 2018, 14:02:14

(2018/03/22 18:31), Amit Langote wrote:
> On 2018/03/20 20:53, Etsuro Fujita wrote:
>> Here are comments on executor changes in (the latest version of) the patch:
>>
>> @@ -421,8 +424,18 @@ ExecInsert(ModifyTableState *mtstate,
>>               ItemPointerData conflictTid;
>>               bool        specConflict;
>>               List       *arbiterIndexes;
>> +            PartitionTupleRouting *proute =
>> +                                        mtstate->mt_partition_tuple_routing;
>>
>> -            arbiterIndexes = node->arbiterIndexes;
>> +            /* Use the appropriate list of arbiter indexes. */
>> +            if (mtstate->mt_partition_tuple_routing != NULL)
>> +            {
>> +                Assert(partition_index>= 0&&  proute != NULL);
>> +                arbiterIndexes =
>> +                        proute->partition_arbiter_indexes[partition_index];
>> +            }
>> +            else
>> +                arbiterIndexes = node->arbiterIndexes;
>>
>> To handle both cases the same way, I wonder if it would be better to have
>> the arbiterindexes list in ResultRelInfo as well, as mentioned by Alvaro
>> upthread, or to re-add mt_arbiterindexes as before and set it to
>> proute->partition_arbiter_indexes[partition_index] before we get here,
>> maybe in ExecPrepareTupleRouting, in the case of tuple routing.
>
> It's a good idea.  I somehow missed that Alvaro had already mentioned it.
>
> In HEAD, we now have ri_onConflictSetProj and ri_onConflictSetWhere.  I
> propose we name the field ri_onConflictArbiterIndexes as done in the
> updated patch.

I like that naming.

>> @@ -2368,9 +2419,13 @@ ExecInitModifyTable(ModifyTable *node, EState
>> *estate, int eflags)
>>           econtext = mtstate->ps.ps_ExprContext;
>>           relationDesc = resultRelInfo->ri_RelationDesc->rd_att;
>>
>> -        /* initialize slot for the existing tuple */
>> -        mtstate->mt_existing =
>> -            ExecInitExtraTupleSlot(mtstate->ps.state, relationDesc);
>> +        /*
>> +         * Initialize slot for the existing tuple.  We determine which
>> +         * tupleDesc to use for this after we have determined which relation
>> +         * the insert/update will be applied to, possibly after performing
>> +         * tuple routing.
>> +         */
>> +        mtstate->mt_existing = ExecInitExtraTupleSlot(mtstate->ps.state,
>> NULL);
>>
>>           /* carried forward solely for the benefit of explain */
>>           mtstate->mt_excludedtlist = node->exclRelTlist;
>> @@ -2378,8 +2433,16 @@ ExecInitModifyTable(ModifyTable *node, EState
>> *estate, int eflags)
>>           /* create target slot for UPDATE SET projection */
>>           tupDesc = ExecTypeFromTL((List *) node->onConflictSet,
>>                                    relationDesc->tdhasoid);
>> +        PinTupleDesc(tupDesc);
>> +        mtstate->mt_conflproj_tupdesc = tupDesc;
>> +
>> +        /*
>> +         * Just like the "existing tuple" slot, we'll defer deciding which
>> +         * tupleDesc to use for this slot to a point where tuple routing has
>> +         * been performed.
>> +         */
>>           mtstate->mt_conflproj =
>> -            ExecInitExtraTupleSlot(mtstate->ps.state, tupDesc);
>> +            ExecInitExtraTupleSlot(mtstate->ps.state, NULL);
>>
>> If we do ExecInitExtraTupleSlot for mtstate->mt_existing and
>> mtstate->mt_conflproj in ExecPrepareTupleRouting in the case of tuple
>> routing, as said above, we wouldn't need this changes.  I think doing that
>> only in the case of tuple routing and keeping this as-is would be better
>> because that would save cycles in the normal case.
>
> Hmm, I think we shouldn't be doing ExecInitExtraTupleSlot in
> ExecPrepareTupleRouting, because we shouldn't have more than one instance
> of mtstate->mt_existing and mtstate->mt_conflproj slots.

Yeah, I think so too.  What I was going to say here is 
ExecSetSlotDescriptor, not ExecInitExtraTupleSlot, as you said below. 
Sorry about the incorrectness.  I guess I was too tired when writing 
that comments.

> As you also said above, I think you meant to say here that we do
> ExecInitExtraTupleSlot only once for both mtstate->mt_existing and
> mtstate->mt_conflproj in ExecInitModifyTable and only do
> ExecSetSlotDescriptor in ExecPrepareTupleRouting.

That's right.

> I have changed it so
> that ExecInitModifyTable now both creates the slot and sets the descriptor
> for non-tuple-routing cases and only creates but doesn't set the
> descriptor in the tuple-routing case.

IMHO I don't see much value in modifying code as such, because we do 
ExecSetSlotDescriptor for mt_existing and mt_conflproj in 
ExecPrepareTupleRouting for every inserted tuple.  So, I would leave 
that as-is, to keep that simple.

> For ExecPrepareTupleRouting to be able to access the tupDesc of the
> onConflictSet target list, I've added ri_onConflictSetProjTupDesc which is
> set by ExecInitPartitionInfo on first call for a give partition.  This is
> also suggested by Pavan in his review.

Seems like a good idea.

Here are some comments on the latest version of the patch:

+               /*
+                * Caller must set mtstate->mt_conflproj's tuple 
descriptor to
+                * this one before trying to use it for projection.
+                */
+               tupDesc = ExecTypeFromTL(onconflset, partrelDesc->tdhasoid);
+               leaf_part_rri->ri_onConflictSet->proj =
+                       ExecBuildProjectionInfo(onconflset, econtext,
+                                               mtstate->mt_conflproj,
+                                               &mtstate->ps, partrelDesc);

ExecBuildProjectionInfo is called without setting the tuple descriptor 
of mtstate->mt_conflproj to tupDesc.  That might work at least for now, 
but I think it's a good thing to set it appropriately to make that 
future proof.

+            * This corresponds to a dropped attribute in the partition, for
+            * which we enerate a dummy entry with resno matching the
+            * partition's attno.

s/enerate/generate/

+ * OnConflictSetState
+ *
+ * Contains execution time state of a ON CONFLICT DO UPDATE operation, 
which
+ * includes the state of projection, tuple descriptor of the 
projection, and
+ * WHERE quals if any.

s/a ON/an ON/

+typedef struct OnConflictSetState
+{  /* for computing ON CONFLICT DO UPDATE SET */

This is nitpicking, but this wouldn't follow the project style, so I 
think that needs re-formatting.

I'll look at the patch a little bit more early next week.

Thanks for updating the patch!

Best regards,
Etsuro Fujita

Re: ON CONFLICT DO UPDATE for partitioned tables

From

Alvaro Herrera

Date:

23 March 2018, 17:34:17

Thanks for these changes.  I'm going over this now, with intention to
push it shortly.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: ON CONFLICT DO UPDATE for partitioned tables

From

Alvaro Herrera

Date:

24 March 2018, 03:23:40

I made a bunch of further edits and I think this v10 is ready to push.
Before doing so I'll give it a final look, particularly because of the
new elog(ERROR) I added.  Post-commit review is of course always
appreciated.

Most notable change is because I noticed that if you mention an
intermediate partition level in the INSERT command, and the index is on
the top level, arbiter index selection fails to find the correct index
because it walks all the way to the top instead of stopping in the
middle, as it should (the command was still working because it ended up
with an empty arbiter index list).

To fix this, I had to completely rework the "get partition parent root"
stuff into "get list of ancestors of this partition".

Because of this, I added a new check that the partition's arbiter index
list is same length as parent's; if not, throw an error.  I couldn't get
it to fire (so it's just an elog not ereport), but maybe I just didn't
try any weird enough scenarios.

Other changes:

* I added a copyObject() call for nodes we're operating upon.  Maybe
  this is unnecessary but the comments claimed "we're working on a copy"
  and I couldn't find any place where we were actually making one.
  Anyway it seems sane to make a copy, because we're scribbling on those
  nodes ... I hope I didn't introduce any serious memory leaks.

* I made the new OnConflictSetState thing into a proper node.  Like
  ResultRelInfo, it doesn't have any niceties like nodeToString support,
  but it seems saner this way (palloc -> makeNode).  I reworked the
  formatting of that struct definition too, and renamed members.

* I removed an assertion block at the bottom of adjust_partition_tlist.
  It seemed quite pointless, since it was just about checking that the
  resno values were sorted, but by construction we already know that
  they are indeed sorted ...

* General code style improvements, comment rewording, etc.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

v10-0001-on-conflict.patch

Re: ON CONFLICT DO UPDATE for partitioned tables

From

Amit Langote

Date:

26 March 2018, 05:30:51

On 2018/03/24 9:23, Alvaro Herrera wrote:
> I made a bunch of further edits and I think this v10 is ready to push.
> Before doing so I'll give it a final look, particularly because of the
> new elog(ERROR) I added.  Post-commit review is of course always
> appreciated.
> 
> Most notable change is because I noticed that if you mention an
> intermediate partition level in the INSERT command, and the index is on
> the top level, arbiter index selection fails to find the correct index
> because it walks all the way to the top instead of stopping in the
> middle, as it should (the command was still working because it ended up
> with an empty arbiter index list).

Good catch!

> To fix this, I had to completely rework the "get partition parent root"
> stuff into "get list of ancestors of this partition".

I wondered if a is_partition_ancestor(partrelid, ancestorid) isn't enough
instead of creating a list of ancestors and then looping over it as you've
done, but maybe what you have here is fine.

> Because of this, I added a new check that the partition's arbiter index
> list is same length as parent's; if not, throw an error.  I couldn't get
> it to fire (so it's just an elog not ereport), but maybe I just didn't
> try any weird enough scenarios.
> 
> Other changes:
> 
> * I added a copyObject() call for nodes we're operating upon.  Maybe
>   this is unnecessary but the comments claimed "we're working on a copy"
>   and I couldn't find any place where we were actually making one.
>   Anyway it seems sane to make a copy, because we're scribbling on those
>   nodes ... I hope I didn't introduce any serious memory leaks.

That seems fine as ExecInitPartitionInfo allocates in the query context
(es_query_cxt).

> * I made the new OnConflictSetState thing into a proper node.  Like
>   ResultRelInfo, it doesn't have any niceties like nodeToString support,
>   but it seems saner this way (palloc -> makeNode).  I reworked the
>   formatting of that struct definition too, and renamed members.

Looks good, thanks.

> * I removed an assertion block at the bottom of adjust_partition_tlist.
>   It seemed quite pointless, since it was just about checking that the
>   resno values were sorted, but by construction we already know that
>   they are indeed sorted ...

Hmm yes.

> * General code style improvements, comment rewording, etc.

There was one comment in Fujita-san's review he posted on Friday [1] that
doesn't seem to be addressed in v10, which I think we probably should.  It
was this comment:

"ExecBuildProjectionInfo is called without setting the tuple descriptor of
mtstate->mt_conflproj to tupDesc.  That might work at least for now, but I
think it's a good thing to set it appropriately to make that future proof."

All of his other comments seem to have been taken care of in v10.  I have
fixed the above one in the attached updated version.

Thanks,
Amit

[1] https://www.postgresql.org/message-id/5AB4DEB6.2020901%40lab.ntt.co.jp

Attachment

v11-0001-on-conflict.patch

Re: ON CONFLICT DO UPDATE for partitioned tables

From

Alvaro Herrera

Date:

26 March 2018, 17:20:16

Pushed now.

Amit Langote wrote:
> On 2018/03/24 9:23, Alvaro Herrera wrote:

> > To fix this, I had to completely rework the "get partition parent root"
> > stuff into "get list of ancestors of this partition".
> 
> I wondered if a is_partition_ancestor(partrelid, ancestorid) isn't enough
> instead of creating a list of ancestors and then looping over it as you've
> done, but maybe what you have here is fine.

Yeah, I wondered about doing it that way too (since you can stop looking
early), but decided that I didn't like repeatedly opening pg_inherits
for each level.  Anyway the most common case is a single level, and in
rare cases two levels ... I don't think we're going to see much more
than that.  So it doesn't matter too much.  We can refine later anyway,
if this becomes a hot spot (I doubt it TBH).

> > * General code style improvements, comment rewording, etc.
> 
> There was one comment in Fujita-san's review he posted on Friday [1] that
> doesn't seem to be addressed in v10, which I think we probably should.  It
> was this comment:
> 
> "ExecBuildProjectionInfo is called without setting the tuple descriptor of
> mtstate->mt_conflproj to tupDesc.  That might work at least for now, but I
> think it's a good thing to set it appropriately to make that future proof."
> 
> All of his other comments seem to have been taken care of in v10.  I have
> fixed the above one in the attached updated version.

I was of two minds about this item myself; we don't use the tupdesc for
anything at that point and I expect more things would break if we
required that.  But I don't think it hurts, so I kept it.

The one thing I wasn't terribly in love with is the four calls to
map_partition_varattnos(), creating the attribute map four times ... but
we already have it in the TupleConversionMap, no?  Looks like we could
save a bunch of work there.

And a final item is: can we have a whole-row expression in the clauses?
We currently don't handle those either, not even to throw an error.
[figures a test case] ... and now that I test it, it does crash!

create table part (a int primary key, b text) partition by range (a);
create table part1 (b text, a int not null);
alter table part attach partition part1 for values from (1) to (1000);
insert into part values (1, 'two') on conflict (a)
  do update set b = format('%s (was %s)', excluded.b, part.b)
  where part.* *<> (1, text 'two');

I think this means we should definitely handle found_whole_row.  (If you
create part1 in the normal way, it works as you'd expect.)

I'm going to close a few other things first, then come back to this.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: ON CONFLICT DO UPDATE for partitioned tables

From

Amit Langote

Date:

27 March 2018, 07:27:29

On 2018/03/26 23:20, Alvaro Herrera wrote:
> Pushed now.

Thank you!

> Amit Langote wrote:
>> On 2018/03/24 9:23, Alvaro Herrera wrote:
> 
>>> To fix this, I had to completely rework the "get partition parent root"
>>> stuff into "get list of ancestors of this partition".
>>
>> I wondered if a is_partition_ancestor(partrelid, ancestorid) isn't enough
>> instead of creating a list of ancestors and then looping over it as you've
>> done, but maybe what you have here is fine.
> 
> Yeah, I wondered about doing it that way too (since you can stop looking
> early), but decided that I didn't like repeatedly opening pg_inherits
> for each level.  Anyway the most common case is a single level, and in
> rare cases two levels ... I don't think we're going to see much more
> than that.  So it doesn't matter too much.  We can refine later anyway,
> if this becomes a hot spot (I doubt it TBH).

Yes, I suppose.

>>> * General code style improvements, comment rewording, etc.
>>
>> There was one comment in Fujita-san's review he posted on Friday [1] that
>> doesn't seem to be addressed in v10, which I think we probably should.  It
>> was this comment:
>>
>> "ExecBuildProjectionInfo is called without setting the tuple descriptor of
>> mtstate->mt_conflproj to tupDesc.  That might work at least for now, but I
>> think it's a good thing to set it appropriately to make that future proof."
>>
>> All of his other comments seem to have been taken care of in v10.  I have
>> fixed the above one in the attached updated version.
> 
> I was of two minds about this item myself; we don't use the tupdesc for
> anything at that point and I expect more things would break if we
> required that.  But I don't think it hurts, so I kept it.
> 
> The one thing I wasn't terribly in love with is the four calls to
> map_partition_varattnos(), creating the attribute map four times ... but
> we already have it in the TupleConversionMap, no?  Looks like we could
> save a bunch of work there.

Hmm, actually we can't use that map, assuming you're talking about the
following map:

  TupleConversionMap *map = proute->parent_child_tupconv_maps[partidx];

We can only use that to tell if we need converting expressions (as we
currently do), but it cannot be used to actually convert the expressions.
The map in question is for use by do_convert_tuple(), not to map varattnos
in Vars using map_variable_attnos().

But it's definitely a bit undesirable to have various
map_partition_varattnos() calls within ExecInitPartitionInfo() to
initialize the same information (the map) multiple times.

> And a final item is: can we have a whole-row expression in the clauses?
> We currently don't handle those either, not even to throw an error.
> [figures a test case] ... and now that I test it, it does crash!
> 
> create table part (a int primary key, b text) partition by range (a);
> create table part1 (b text, a int not null);
> alter table part attach partition part1 for values from (1) to (1000);
> insert into part values (1, 'two') on conflict (a)
>   do update set b = format('%s (was %s)', excluded.b, part.b)
>   where part.* *<> (1, text 'two');
> 
> I think this means we should definitely handle found_whole_row.  (If you
> create part1 in the normal way, it works as you'd expect.)

I agree.  That means we simply remove the Assert after the
map_partition_varattnos call.

> I'm going to close a few other things first, then come back to this.

Attached find a patch to fix the whole-row expression issue.  I added your
test to insert_conflict.sql.

Thanks,
Amit

Attachment

partition-on-conflict-wholerow-fix.patch

Re: ON CONFLICT DO UPDATE for partitioned tables

From

Amit Langote

Date:

10 April 2018, 05:56:13

On 2018/03/27 13:27, Amit Langote wrote:
> On 2018/03/26 23:20, Alvaro Herrera wrote:
>> The one thing I wasn't terribly in love with is the four calls to
>> map_partition_varattnos(), creating the attribute map four times ... but
>> we already have it in the TupleConversionMap, no?  Looks like we could
>> save a bunch of work there.
> 
> Hmm, actually we can't use that map, assuming you're talking about the
> following map:
> 
>   TupleConversionMap *map = proute->parent_child_tupconv_maps[partidx];
> 
> We can only use that to tell if we need converting expressions (as we
> currently do), but it cannot be used to actually convert the expressions.
> The map in question is for use by do_convert_tuple(), not to map varattnos
> in Vars using map_variable_attnos().
> 
> But it's definitely a bit undesirable to have various
> map_partition_varattnos() calls within ExecInitPartitionInfo() to
> initialize the same information (the map) multiple times.

I will try to think of doing something about this later this week.

>> And a final item is: can we have a whole-row expression in the clauses?
>> We currently don't handle those either, not even to throw an error.
>> [figures a test case] ... and now that I test it, it does crash!
>>
>> create table part (a int primary key, b text) partition by range (a);
>> create table part1 (b text, a int not null);
>> alter table part attach partition part1 for values from (1) to (1000);
>> insert into part values (1, 'two') on conflict (a)
>>   do update set b = format('%s (was %s)', excluded.b, part.b)
>>   where part.* *<> (1, text 'two');
>>
>> I think this means we should definitely handle found_whole_row.  (If you
>> create part1 in the normal way, it works as you'd expect.)
> 
> I agree.  That means we simply remove the Assert after the
> map_partition_varattnos call.
> 
>> I'm going to close a few other things first, then come back to this.
> 
> Attached find a patch to fix the whole-row expression issue.  I added your
> test to insert_conflict.sql.

Adding this to the open items list.

Thanks,
Amit

Re: ON CONFLICT DO UPDATE for partitioned tables

From

Amit Langote

Date:

16 April 2018, 11:47:26

On 2018/04/10 11:56, Amit Langote wrote:
> On 2018/03/27 13:27, Amit Langote wrote:
>> On 2018/03/26 23:20, Alvaro Herrera wrote:
>>> The one thing I wasn't terribly in love with is the four calls to
>>> map_partition_varattnos(), creating the attribute map four times ... but
>>> we already have it in the TupleConversionMap, no?  Looks like we could
>>> save a bunch of work there.
>>
>> Hmm, actually we can't use that map, assuming you're talking about the
>> following map:
>>
>>   TupleConversionMap *map = proute->parent_child_tupconv_maps[partidx];
>>
>> We can only use that to tell if we need converting expressions (as we
>> currently do), but it cannot be used to actually convert the expressions.
>> The map in question is for use by do_convert_tuple(), not to map varattnos
>> in Vars using map_variable_attnos().
>>
>> But it's definitely a bit undesirable to have various
>> map_partition_varattnos() calls within ExecInitPartitionInfo() to
>> initialize the same information (the map) multiple times.
> 
> I will try to think of doing something about this later this week.

The solution I came up with is to call map_variable_attnos() directly,
instead of going through map_partition_varattnos() every time, after first
creating the attribute map ourselves.

>>> And a final item is: can we have a whole-row expression in the clauses?
>>> We currently don't handle those either, not even to throw an error.
>>> [figures a test case] ... and now that I test it, it does crash!
>>>
>>> create table part (a int primary key, b text) partition by range (a);
>>> create table part1 (b text, a int not null);
>>> alter table part attach partition part1 for values from (1) to (1000);
>>> insert into part values (1, 'two') on conflict (a)
>>>   do update set b = format('%s (was %s)', excluded.b, part.b)
>>>   where part.* *<> (1, text 'two');
>>>
>>> I think this means we should definitely handle found_whole_row.  (If you
>>> create part1 in the normal way, it works as you'd expect.)
>>
>> I agree.  That means we simply remove the Assert after the
>> map_partition_varattnos call.
>>
>>> I'm going to close a few other things first, then come back to this.
>>
>> Attached find a patch to fix the whole-row expression issue.  I added your
>> test to insert_conflict.sql.

Combined the above patch into the attached patch.

Thanks,
Amit

Attachment

v2-0001-Couple-of-fixes-for-ExecInitPartitionInfo.patch

Re: ON CONFLICT DO UPDATE for partitioned tables

From

Alvaro Herrera

Date:

16 April 2018, 22:10:03

Amit Langote wrote:

> The solution I came up with is to call map_variable_attnos() directly,
> instead of going through map_partition_varattnos() every time, after first
> creating the attribute map ourselves.

Yeah, sounds good.  I added a tweak: if the tupledescs are equal, there
should be no need to do any mapping.

(Minor adjustment to the test: "cuarenta" means forty, so I changed the
new test to say "cincuenta" instead, which means fifty).

Pushed now, thanks.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: ON CONFLICT DO UPDATE for partitioned tables

From

Tom Lane

Date:

16 April 2018, 22:20:57

Alvaro Herrera <alvherre@alvh.no-ip.org> writes:
> Pushed now, thanks.

Buildfarm doesn't like this even a little bit.

            regards, tom lane

Re: ON CONFLICT DO UPDATE for partitioned tables

From

Amit Langote

Date:

17 April 2018, 10:45:45

On 2018/04/17 4:10, Alvaro Herrera wrote:
> Amit Langote wrote:
> 
>> The solution I came up with is to call map_variable_attnos() directly,
>> instead of going through map_partition_varattnos() every time, after first
>> creating the attribute map ourselves.
> 
> Yeah, sounds good.  I added a tweak: if the tupledescs are equal, there
> should be no need to do any mapping.

Thanks for the commit!

About the equalTupleDescs()-based optimization you added -- It seems to me
that that *always* returns false if you pass it tupledescs of two
different tables, which in this case, we do.  That's because
equalTupleDescs has this:

    if (tupdesc1->tdtypeid != tupdesc2->tdtypeid)
        return false;

So, it fails to optimize as you were hoping it would.

Instead of doing this, I think we should try to make
convert_tuples_by_name_map() a bit smarter by integrating the logic in
convert_tuples_by_name() that's used conclude if no tuple conversion is
necessary.  So, if it turns that the tuples descriptors passed to
convert_tuples_by_name_map() contain the same number of attributes and the
individual attributes are at the same positions, we signal to the caller
that no conversion is necessary by returning NULL.

Attached find a patch that does that.  When working on this, I noticed
that when recursing for inheritance children, ATPrepAlterColumnType()
would use a AlterTableCmd (cmd) that's already scribbled on as if it were
the original.

Thanks,
Amit

Attachment

v1-0001-Optimize-convert_tuples_by_name_map-usage-a-bit.patch

Re: ON CONFLICT DO UPDATE for partitioned tables

From

Amit Langote

Date:

17 April 2018, 12:00:36

On 2018/04/17 16:45, Amit Langote wrote:
> Instead of doing this, I think we should try to make
> convert_tuples_by_name_map() a bit smarter by integrating the logic in
> convert_tuples_by_name() that's used conclude if no tuple conversion is
> necessary.  So, if it turns that the tuples descriptors passed to
> convert_tuples_by_name_map() contain the same number of attributes and the
> individual attributes are at the same positions, we signal to the caller
> that no conversion is necessary by returning NULL.
> 
> Attached find a patch that does that.
I just confirmed my hunch that this wouldn't somehow do the right thing
when the OID system column is involved.  Like this case:

create table parent (a int);
create table child () inherits (parent) with oids;
insert into parent values (1);
insert into child values (1);
analyze parent;
server closed the connection unexpectedly
    This probably means the server terminated abnormally
    before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.

That's because, convert_tuples_by_name() that's called by
acquire_inherited_sample_rows() gets a TupleConversionMap whose attrMap is
set to NULL.  do_convert_tuple() may then try to access a member of such
NULL attrMap.  In this case, even if parent and child tables have same
user attributes, patched convert_tuples_by_name_map would return NULL, but
since their hasoids setting doesn't match, a TupleConversionMap is still
returned but has its attrMap set to NULL.  To fix that, I taught
do_convert_tuple() to ignore the map if NULL.  Also, free_conversion_map()
shouldn't try to free attrMap if it's NULL.

Attached updated patch.

Thanks,
Amit

Attachment

v2-0001-Optimize-convert_tuples_by_name_map-usage-a-bit.patch

Re: ON CONFLICT DO UPDATE for partitioned tables

From

Alvaro Herrera

Date:

17 April 2018, 18:02:07

Amit Langote wrote:

> Attached find a patch that does that.  When working on this, I noticed
> that when recursing for inheritance children, ATPrepAlterColumnType()
> would use a AlterTableCmd (cmd) that's already scribbled on as if it were
> the original.

While I agree that the code here is in poor style, there is no live bug
here, because the only thing that is changed each time is the copy's
cmd->def, and its value is not obtained from the scribbled 'cmd' -- it's
obtained from the passed-in cmd->def, which is unmodified.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: ON CONFLICT DO UPDATE for partitioned tables

From

Alvaro Herrera

Date:

17 April 2018, 18:04:23

Amit Langote wrote:

> I just confirmed my hunch that this wouldn't somehow do the right thing
> when the OID system column is involved.  Like this case:

This looks too big a patch to pursue now.  I'm inclined to just remove
the equalTupdesc changes.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: ON CONFLICT DO UPDATE for partitioned tables

From

Amit Langote

Date:

18 April 2018, 07:13:38

On 2018/04/18 0:02, Alvaro Herrera wrote:
> Amit Langote wrote:
> 
>> Attached find a patch that does that.  When working on this, I noticed
>> that when recursing for inheritance children, ATPrepAlterColumnType()
>> would use a AlterTableCmd (cmd) that's already scribbled on as if it were
>> the original.
> 
> While I agree that the code here is in poor style, there is no live bug
> here, because the only thing that is changed each time is the copy's
> cmd->def, and its value is not obtained from the scribbled 'cmd' -- it's
> obtained from the passed-in cmd->def, which is unmodified.

Ah, you're right.  The original cmd->def itself remains intact.

Thanks,
Amit

Re: ON CONFLICT DO UPDATE for partitioned tables

From

Amit Langote

Date:

18 April 2018, 07:35:02

On 2018/04/18 0:04, Alvaro Herrera wrote:
> Amit Langote wrote:
> 
>> I just confirmed my hunch that this wouldn't somehow do the right thing
>> when the OID system column is involved.  Like this case:
> 
> This looks too big a patch to pursue now.  I'm inclined to just remove
> the equalTupdesc changes.

OK.  Here is the patch that removes equalTupdesc optimization.

I will add the rest of the patch to the next CF after starting a new
thread for it sometime later.

Thanks,
Amit

Attachment

v1-0001-Remove-equalTupDescs-based-optimization-in-ExecIn.patch

Re: ON CONFLICT DO UPDATE for partitioned tables

From

Alvaro Herrera

Date:

18 April 2018, 16:40:07

Amit Langote wrote:
> On 2018/04/18 0:04, Alvaro Herrera wrote:
> > Amit Langote wrote:
> > 
> >> I just confirmed my hunch that this wouldn't somehow do the right thing
> >> when the OID system column is involved.  Like this case:
> > 
> > This looks too big a patch to pursue now.  I'm inclined to just remove
> > the equalTupdesc changes.
> 
> OK.  Here is the patch that removes equalTupdesc optimization.

Hmm.  If we modify (during pg12, of course -- not now) partition tables
that are created identical to their parent table so that they share the
pg_type row, this would become useful.  Unless there a reason why that
change is completely unworkable, I'd just leave it there.  (I claim that
it works like that only because it used to work like that, not because
it's impossible to make work the other way.)

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: ON CONFLICT DO UPDATE for partitioned tables

From

Amit Langote

Date:

19 April 2018, 04:51:02

On 2018/04/18 22:40, Alvaro Herrera wrote:
> Amit Langote wrote:
>> On 2018/04/18 0:04, Alvaro Herrera wrote:
>>> Amit Langote wrote:
>>>
>>>> I just confirmed my hunch that this wouldn't somehow do the right thing
>>>> when the OID system column is involved.  Like this case:
>>>
>>> This looks too big a patch to pursue now.  I'm inclined to just remove
>>> the equalTupdesc changes.
>>
>> OK.  Here is the patch that removes equalTupdesc optimization.
> 
> Hmm.  If we modify (during pg12, of course -- not now) partition tables
> that are created identical to their parent table so that they share the
> pg_type row, this would become useful.  Unless there a reason why that
> change is completely unworkable, I'd just leave it there.  (I claim that
> it works like that only because it used to work like that, not because
> it's impossible to make work the other way.)

Yeah, I too have wondered in the past what it would take to make
equalTupDescs() return true for parent and partitions.  Maybe we can make
it work by looking a bit harder than I did then.

Although, just leaving it there now would mean we're adding a few cycles
needlessly in the PG 11 code.  Why not add that optimization when we
surely know it can work?

Thanks,
Amit

Re: ON CONFLICT DO UPDATE for partitioned tables

From

Alvaro Herrera

Date:

19 April 2018, 20:20:54

Amit Langote wrote:

> Yeah, I too have wondered in the past what it would take to make
> equalTupDescs() return true for parent and partitions.  Maybe we can make
> it work by looking a bit harder than I did then.

How about simply relaxing the tdtypeid test from equalTupleDescs?  I
haven't looked deeply but I think just checking whether or not both are
RECORDOID might be sufficient, for typecache purposes.

If we just remove the tdtypeid test, check-world passes.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: ON CONFLICT DO UPDATE for partitioned tables

From

Robert Haas

Date:

19 April 2018, 22:00:45

On Thu, Apr 19, 2018 at 1:20 PM, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
> Amit Langote wrote:
>> Yeah, I too have wondered in the past what it would take to make
>> equalTupDescs() return true for parent and partitions.  Maybe we can make
>> it work by looking a bit harder than I did then.
>
> How about simply relaxing the tdtypeid test from equalTupleDescs?  I
> haven't looked deeply but I think just checking whether or not both are
> RECORDOID might be sufficient, for typecache purposes.

That strike me as a very scary thing to do.  There's code all over the
system that may have non-obvious assumptions about the behavior of
equalTupleDescs(), and I don't think we can have any confidence that
nothing will break unless we do a detailed audit of all that code.

> If we just remove the tdtypeid test, check-world passes.

That does not reassure me.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: ON CONFLICT DO UPDATE for partitioned tables

From

Peter Geoghegan

Date:

19 April 2018, 22:21:40

On Thu, Apr 19, 2018 at 12:00 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Thu, Apr 19, 2018 at 1:20 PM, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
>> How about simply relaxing the tdtypeid test from equalTupleDescs?  I
>> haven't looked deeply but I think just checking whether or not both are
>> RECORDOID might be sufficient, for typecache purposes.
>
> That strike me as a very scary thing to do.  There's code all over the
> system that may have non-obvious assumptions about the behavior of
> equalTupleDescs(), and I don't think we can have any confidence that
> nothing will break unless we do a detailed audit of all that code.

+1. I think that it is plainly a bad idea to do something like that at
this point in the cycle.

-- 
Peter Geoghegan

Re: ON CONFLICT DO UPDATE for partitioned tables

From

Alvaro Herrera

Date:

19 April 2018, 22:40:17

Alvaro Herrera wrote:
> Amit Langote wrote:
> 
> > Yeah, I too have wondered in the past what it would take to make
> > equalTupDescs() return true for parent and partitions.  Maybe we can make
> > it work by looking a bit harder than I did then.
> 
> How about simply relaxing the tdtypeid test from equalTupleDescs?  I
> haven't looked deeply but I think just checking whether or not both are
> RECORDOID might be sufficient, for typecache purposes.

After looking at the code, I'm a bit nervous about doing this, because I
don't fully understand what is going on in typcache, and what is the
HeapTupleHeaderGetTypeId macro really doing.  I'm afraid that if we
confuse a table's tupdesc with one of its partition's , something
entirely random might end up happening.

Maybe this is completely off-base, but if so I'd like to have to proof.
So I'm thinking of reverting that patch instead per your patch.

While composing this we got emails from Robert and Peter G suggesting
the same too, so consider it done.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: ON CONFLICT DO UPDATE for partitioned tables

From

Amit Langote

Date:

20 April 2018, 04:08:56

On 2018/04/20 4:40, Alvaro Herrera wrote:
> Alvaro Herrera wrote:
>> Amit Langote wrote:
>>
>>> Yeah, I too have wondered in the past what it would take to make
>>> equalTupDescs() return true for parent and partitions.  Maybe we can make
>>> it work by looking a bit harder than I did then.
>>
>> How about simply relaxing the tdtypeid test from equalTupleDescs?  I
>> haven't looked deeply but I think just checking whether or not both are
>> RECORDOID might be sufficient, for typecache purposes.
> 
> After looking at the code, I'm a bit nervous about doing this, because I
> don't fully understand what is going on in typcache, and what is the
> HeapTupleHeaderGetTypeId macro really doing.  I'm afraid that if we
> confuse a table's tupdesc with one of its partition's , something
> entirely random might end up happening.
> 
> Maybe this is completely off-base, but if so I'd like to have to proof.
> So I'm thinking of reverting that patch instead per your patch.
> 
> While composing this we got emails from Robert and Peter G suggesting
> the same too, so consider it done.

Thank you for committing the patch.

Regards,
Amit