Thread: ON CONFLICT DO UPDATE for partitioned tables
I updated Amit Langote's patch for INSERT ON CONFLICT DO UPDATE[1]. Following the lead of edd44738bc88 ("Be lazier about partition tuple routing.") this incarnation only does the necessary push-ups for the specific partition that needs it, at execution time. As far as I can tell, it works as intended. I chose to refuse the case where the DO UPDATE clause causes the tuple to move to another partition (i.e. you're updating the partition key of the tuple). While it's probably possible to implement that, it doesn't seem a very productive use of time. However, there is a shortcoming in the design: it fails if there are multiple levels of partitioning, because there is no easy (efficient) way to map the index OID more than one level. I had already mentioned this shortcoming to Amit's patch. So this case (which I added in the regression tests) fails unexpectedly: -- multiple-layered partitioning create table parted_conflict_test (a int primary key, b text) partition by range (a); create table parted_conflict_test_1 partition of parted_conflict_test for values from (0) to (10000) partition by range (a); create table parted_conflict_test_1_1 partition of parted_conflict_test_1 for values from (0) to (100); insert into parted_conflict_test values ('10', 'ten'); insert into parted_conflict_test values ('10', 'ten two') on conflict (a) do update set b = excluded.b; ERROR: invalid ON CONFLICT DO UPDATE specification DETAIL: An inferred index was not found in partition "parted_conflict_test_1_1". So the problem here is that my MapPartitionIndexList() implementation is too stupid. I think it would be smarter to use the ResultRelInfo instead of bare Relation, for one. But that still doesn't answer how to find a "path" from root to leaf partition, which is what I'd need to verify that there are valid pg_inherits relationships for the partition indexes. I'm probably missing something about the partitionDesc or maybe the partitioned_rels lists that helps me do it efficiently, but I hope figure out soon. One idea I was toying with is to add RelationData->rd_children as a list of OIDs of children relations. So it should be possible to walk the list from the root to the desired descendant, without having to scan pg_inherits repeatedly ... although that would probably require doing relation_open() for every index, which sounds undesirable. (ISTM that having RelationData->rd_children might be a good idea in general anyway -- I mean to speed up some code that currently scans pg_inherits via find_inheritance_children. However, since the partition descriptor itself is already in relcache, maybe this doesn't matter too much.) Another idea is to abandon the notion that we need to find a path from parent index to descendant index, and just run the inference algorithm again on the partition. I'm not sure how much I like this idea, yet. Anyway, while this is still WIP, I think it works correctly for the case where there is a single partition level. [1] https://postgr.es/m/c1651d5b-7bd6-b7e7-e1cc-16ecfe2c0da5@lab.ntt.co.jp -- Álvaro Herrer
Attachment
On 2018/02/28 9:46, Alvaro Herrera wrote: > I updated Amit Langote's patch for INSERT ON CONFLICT DO UPDATE[1]. > Following the lead of edd44738bc88 ("Be lazier about partition tuple > routing.") this incarnation only does the necessary push-ups for the > specific partition that needs it, at execution time. As far as I can > tell, it works as intended. Thanks. I too have been meaning to send an updated (nay, significantly rewritten) version of the patch, but you beat me to it. > I chose to refuse the case where the DO UPDATE clause causes the tuple > to move to another partition (i.e. you're updating the partition key of > the tuple). While it's probably possible to implement that, it doesn't > seem a very productive use of time. Probably a good idea to save it for later. > However, there is a shortcoming in the design: it fails if there are > multiple levels of partitioning, because there is no easy (efficient) > way to map the index OID more than one level. I had already mentioned > this shortcoming to Amit's patch. So this case (which I added in the > regression tests) fails unexpectedly: > > -- multiple-layered partitioning > create table parted_conflict_test (a int primary key, b text) partition by range (a); > create table parted_conflict_test_1 partition of parted_conflict_test > for values from (0) to (10000) partition by range (a); > create table parted_conflict_test_1_1 partition of parted_conflict_test_1 > for values from (0) to (100); > insert into parted_conflict_test values ('10', 'ten'); > insert into parted_conflict_test values ('10', 'ten two') > on conflict (a) do update set b = excluded.b; > ERROR: invalid ON CONFLICT DO UPDATE specification > DETAIL: An inferred index was not found in partition "parted_conflict_test_1_1". > > So the problem here is that my MapPartitionIndexList() implementation is > too stupid. I think it would be smarter to use the ResultRelInfo > instead of bare Relation, for one. But that still doesn't answer how to > find a "path" from root to leaf partition, which is what I'd need to > verify that there are valid pg_inherits relationships for the partition > indexes. I'm probably missing something about the partitionDesc or > maybe the partitioned_rels lists that helps me do it efficiently, but I > hope figure out soon. > > One idea I was toying with is to add RelationData->rd_children as a list > of OIDs of children relations. So it should be possible to walk the > list from the root to the desired descendant, without having to scan > pg_inherits repeatedly ... although that would probably require doing > relation_open() for every index, which sounds undesirable. > > (ISTM that having RelationData->rd_children might be a good idea in > general anyway -- I mean to speed up some code that currently scans > pg_inherits via find_inheritance_children. However, since the partition > descriptor itself is already in relcache, maybe this doesn't matter too > much.) > > Another idea is to abandon the notion that we need to find a path from > parent index to descendant index, and just run the inference algorithm > again on the partition. I'm not sure how much I like this idea, yet. > > Anyway, while this is still WIP, I think it works correctly for the case > where there is a single partition level. Actually, after your comment on my original patch [1], I did make it work for multiple levels by teaching the partition initialization code to find a given partition's indexes that are inherited from the root table (that is the table mentioned in command). So, after a tuple is routed to a partition, we switch from the original arbiterIndexes list to the one we created for the partition, which must contain OIDs corresponding to those in the original list. After all, for each of the parent's indexes that the planner put into the original arbiterIndexes list, there must exist an index in each of the leaf partitions. I had also observed when working on the patch that various TupleTableSlots used by the ON CONFLICT DO UPDATE code must be based on TupleDesc of the inheritance-translated target list (DO UPDATE SET target list). In fact, that has to take into account also the dropped columns; we may have dropped columns either in parent or in a partition or in both at same or different attnum positions. That means, simple map_partition_varattnos() translation doesn't help in this case. For example, with your patch (sorry, I know you said it's a WIP patch), I see either a crash or errors when dealing with such differing attribute numbers: drop table p; create table p (a int, b text) partition by list (a); create table p12 (b text, a int); alter table p attach partition p12 for values in (1, 2); alter table p drop b, add b text; create table p4 partition of p for values in (4); create unique index on p (a); insert into p values (1, 'a') on conflict (a) do update set b = excluded.b; insert into p values (1, 'b') on conflict (a) do update set b = excluded.b; server closed the connection unexpectedly This probably means the server terminated abnormally before or while processing the request. The connection to the server was lost. Attempting reset: Failed. insert into p values (4, 'a') on conflict (a) do update set b = excluded.b; postgres=# insert into p values (4, 'b') on conflict (a) do update set b = excluded.b; ERROR: attribute number 3 exceeds number of columns 2 I attach my patch here for your reference, which I polished this morning after seeing your email and the patch. It works for most of the cases, as you can see in the updated tests in insert_conflict.sql. Since I agree with you that we should, for now, error out if DO UPDATE causes row movement, I adopted the code from your patch for that. Thanks, Amit [1] https://www.postgresql.org/message-id/20171227225031.osh6vunpuhsath25%40alvherre.pgsql
Attachment
On Tue, Feb 27, 2018 at 7:46 PM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > I updated Amit Langote's patch for INSERT ON CONFLICT DO UPDATE[1]. > Following the lead of edd44738bc88 ("Be lazier about partition tuple > routing.") this incarnation only does the necessary push-ups for the > specific partition that needs it, at execution time. As far as I can > tell, it works as intended. > > I chose to refuse the case where the DO UPDATE clause causes the tuple > to move to another partition (i.e. you're updating the partition key of > the tuple). While it's probably possible to implement that, it doesn't > seem a very productive use of time. I would have thought that to be the only case we could support with the current infrastructure. Doesn't a correct implementation for any other case require a global index? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2018/03/01 1:03, Robert Haas wrote: > On Tue, Feb 27, 2018 at 7:46 PM, Alvaro Herrera > <alvherre@2ndquadrant.com> wrote: >> I updated Amit Langote's patch for INSERT ON CONFLICT DO UPDATE[1]. >> Following the lead of edd44738bc88 ("Be lazier about partition tuple >> routing.") this incarnation only does the necessary push-ups for the >> specific partition that needs it, at execution time. As far as I can >> tell, it works as intended. >> >> I chose to refuse the case where the DO UPDATE clause causes the tuple >> to move to another partition (i.e. you're updating the partition key of >> the tuple). While it's probably possible to implement that, it doesn't >> seem a very productive use of time. > > I would have thought that to be the only case we could support with > the current infrastructure. Doesn't a correct implementation for any > other case require a global index? I'm thinking that Alvaro is talking here about the DO UPDATE action part, not the conflict checking part. The latter will definitely require global indexes if conflict were to be checked on columns not containing the partition key. The case Alvaro mentions arises after checking the conflict, presumably using an inherited unique index whose keys must include the partition keys. If the conflict action is DO UPDATE and its SET clause changes partition key columns such that the row will have to change the partition, then the current patch will result in an error. I think that's because making update row movement work in this case will require some adjustments to what 2f178441044 (Allow UPDATE to move rows between partitions) implemented. We wouldn't have things set up in the ModifyTableState that the current row-movement code depends on being set up; for example, there wouldn't be per-subplan ResultRelInfo's in the ON CONFLICT DO UPDATE case. The following Assert in ExecUpdate() will fail for instance: map_index = resultRelInfo - mtstate->resultRelInfo; Assert(map_index >= 0 && map_index < mtstate->mt_nplans); Thanks, Amit
Amit Langote wrote: > Actually, after your comment on my original patch [1], I did make it work > for multiple levels by teaching the partition initialization code to find > a given partition's indexes that are inherited from the root table (that > is the table mentioned in command). So, after a tuple is routed to a > partition, we switch from the original arbiterIndexes list to the one we > created for the partition, which must contain OIDs corresponding to those > in the original list. After all, for each of the parent's indexes that > the planner put into the original arbiterIndexes list, there must exist an > index in each of the leaf partitions. Oh, your solution for this seems simple enough. Silly me, I was trying to implement it in a quite roundabout way. Thanks. (I do wonder if we should save the "root" reloid in the relcache). > I had also observed when working on the patch that various TupleTableSlots > used by the ON CONFLICT DO UPDATE code must be based on TupleDesc of the > inheritance-translated target list (DO UPDATE SET target list). In fact, > that has to take into account also the dropped columns; we may have > dropped columns either in parent or in a partition or in both at same or > different attnum positions. That means, simple map_partition_varattnos() > translation doesn't help in this case. Yeah, I was aware these corner cases could become a problem though I hadn't gotten around to testing them yet. Thanks for all your work on this. The usage of the few optimizer/prep/ functions that are currently static doesn't fill me with joy. These functions have weird APIs because they're static so we don't rally care, but once we export them we should strive to be more careful. I'd rather stay away from just exporting them all, so I chose to encapsulate those calls in a single function and export only expand_targetlist from preptlist.c, keeping the others static in prepunion.c. In the attached patch set, I put an API change (work on tupdescs rather than full-blown relations) for a couple of those functions as 0001, then your patch as 0002, then a few fixups of my own. (0002 is not bit-by-bit identical to yours; I think I had to fix some merge conflict with 0001, but should be pretty much the same). But looking further, I think there is much cruft that has accumulated in those functions (because they've gotten simplified over time), and we could do some additional cleanup surgery. For example, there is no reason to pass a list pointer to make_inh_translation_list(); we could just return it. And then we don't have to cons up a fake AppendRelInfo with all dummy values that adjust_inherited_tlist doesn't even care about. I think there was a point for all these contortions back at some point (visible by checking git history of this code), but it all seems useless now. Re. the "ugly hack" comments in adjust_inherited_tlist(), I'm confused: expand_targetlist() runs *after*, not before, so how could it have affected the result? I'm obviously confused about what expand_targetlist call this comment is talking about. Anyway I wanted to make it use resjunk entries instead, but that broke some other case that I didn't have time to research yesterday. I'll get back to this soon, but in the meantime, here's what I have. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Amit Langote wrote: > Actually, after your comment on my original patch [1], I did make it work > for multiple levels by teaching the partition initialization code to find > a given partition's indexes that are inherited from the root table (that > is the table mentioned in command). So, after a tuple is routed to a > partition, we switch from the original arbiterIndexes list to the one we > created for the partition, which must contain OIDs corresponding to those > in the original list. After all, for each of the parent's indexes that > the planner put into the original arbiterIndexes list, there must exist an > index in each of the leaf partitions. Oh, your solution for this seems simple enough. Silly me, I was trying to implement it in a quite roundabout way. Thanks. (I do wonder if we should save the "root" reloid in the relcache). > I had also observed when working on the patch that various TupleTableSlots > used by the ON CONFLICT DO UPDATE code must be based on TupleDesc of the > inheritance-translated target list (DO UPDATE SET target list). In fact, > that has to take into account also the dropped columns; we may have > dropped columns either in parent or in a partition or in both at same or > different attnum positions. That means, simple map_partition_varattnos() > translation doesn't help in this case. Yeah, I was aware these corner cases could become a problem though I hadn't gotten around to testing them yet. Thanks for all your work on this. The usage of the few optimizer/prep/ functions that are currently static doesn't fill me with joy. These functions have weird APIs because they're static so we don't rally care, but once we export them we should strive to be more careful. I'd rather stay away from just exporting them all, so I chose to encapsulate those calls in a single function and export only expand_targetlist from preptlist.c, keeping the others static in prepunion.c. In the attached patch set, I put an API change (work on tupdescs rather than full-blown relations) for a couple of those functions as 0001, then your patch as 0002, then a few fixups of my own. (0002 is not bit-by-bit identical to yours; I think I had to fix some merge conflict with 0001, but should be pretty much the same). But looking further, I think there is much cruft that has accumulated in those functions (because they've gotten simplified over time), and we could do some additional cleanup surgery. For example, there is no reason to pass a list pointer to make_inh_translation_list(); we could just return it. And then we don't have to cons up a fake AppendRelInfo with all dummy values that adjust_inherited_tlist doesn't even care about. I think there was a point for all these contortions back at some point (visible by checking git history of this code), but it all seems useless now. Re. the "ugly hack" comments in adjust_inherited_tlist(), I'm confused: expand_targetlist() runs *after*, not before, so how could it have affected the result? I'm obviously confused about what expand_targetlist call this comment is talking about. Anyway I wanted to make it use resjunk entries instead, but that broke some other case that I didn't have time to research yesterday. I'll get back to this soon, but in the meantime, here's what I have. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachment
On 2018/03/03 0:36, Alvaro Herrera wrote: > Amit Langote wrote: > >> Actually, after your comment on my original patch [1], I did make it work >> for multiple levels by teaching the partition initialization code to find >> a given partition's indexes that are inherited from the root table (that >> is the table mentioned in command). So, after a tuple is routed to a >> partition, we switch from the original arbiterIndexes list to the one we >> created for the partition, which must contain OIDs corresponding to those >> in the original list. After all, for each of the parent's indexes that >> the planner put into the original arbiterIndexes list, there must exist an >> index in each of the leaf partitions. > > Oh, your solution for this seems simple enough. Silly me, I was trying > to implement it in a quite roundabout way. Thanks. (I do wonder if we > should save the "root" reloid in the relcache). Do you mean store the root reloid for "any" partition (table or index)? >> I had also observed when working on the patch that various TupleTableSlots >> used by the ON CONFLICT DO UPDATE code must be based on TupleDesc of the >> inheritance-translated target list (DO UPDATE SET target list). In fact, >> that has to take into account also the dropped columns; we may have >> dropped columns either in parent or in a partition or in both at same or >> different attnum positions. That means, simple map_partition_varattnos() >> translation doesn't help in this case. > > Yeah, I was aware these corner cases could become a problem though I > hadn't gotten around to testing them yet. Thanks for all your work on > this. > > The usage of the few optimizer/prep/ functions that are currently static > doesn't fill me with joy. These functions have weird APIs because > they're static so we don't rally care, but once we export them we should > strive to be more careful. I'd rather stay away from just exporting > them all, so I chose to encapsulate those calls in a single function and > export only expand_targetlist from preptlist.c, keeping the others > static in prepunion.c. In the attached patch set, I put an API change > (work on tupdescs rather than full-blown relations) for a couple of > those functions as 0001, then your patch as 0002, then a few fixups of > my own. (0002 is not bit-by-bit identical to yours; I think I had to > fix some merge conflict with 0001, but should be pretty much the same). > > But looking further, I think there is much cruft that has accumulated in > those functions (because they've gotten simplified over time), and we > could do some additional cleanup surgery. For example, there is no > reason to pass a list pointer to make_inh_translation_list(); we could > just return it. And then we don't have to cons up a fake AppendRelInfo > with all dummy values that adjust_inherited_tlist doesn't even care > about. I think there was a point for all these contortions back at some > point (visible by checking git history of this code), but it all seems > useless now. Yeah, I think it might as well work to fix up these interfaces the way you've done. > Re. the "ugly hack" comments in adjust_inherited_tlist(), I'm confused: > expand_targetlist() runs *after*, not before, so how could it have > affected the result? I'm obviously confused about what > expand_targetlist call this comment is talking about. Anyway I wanted > to make it use resjunk entries instead, but that broke some other case > that I didn't have time to research yesterday. I'll get back to this > soon, but in the meantime, here's what I have. Hmm. I can imagine how the newly added comments in adjust_inherited_tlist() may be confusing. With this patch, we're now calling adjust_inherited_tlist() from the executor, whereas it was designed only to be run in the planner prep phase. What we're passing to it from the executor is the DO UPDATE SET's targetlist that has undergone the expand_targetlist() treatment by the planner. Maybe, we need to update the adjust_inherited_tlist() comments to reflect the expansion of its scope due to this patch. Some comments on 0003: + * Fist, fix the target entries' resnos, by using inheritance translation. First + appinfo.parent_reltype = InvalidOid; // parentRel->rd_rel->reltype; I guess you won't retain that comment. :) + result_tl = expand_targetlist(result_tl, CMD_UPDATE, Should we add a CmdType argument to adjust_and_expand_partition_tlist() and use it instead of hard-coding CMD_UPDATE here? Thanks, Amit
On Fri, Mar 2, 2018 at 9:06 PM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
Re. the "ugly hack" comments in adjust_inherited_tlist(), I'm confused:
expand_targetlist() runs *after*, not before, so how could it have
affected the result?
If I understand correctly, planner must have called expand_targetlist() once for the parent relation's descriptor and added any dropped columns from the parent relation. So we may not find mapped attributes for those dropped columns in the parent. I haven't actually tested this case though.
I wonder if it makes sense to actually avoid expanding on-conflict-set targetlist in case the target is a partition table and deal with it during execution, like we are doing now.
I'm obviously confused about what
expand_targetlist call this comment is talking about. Anyway I wanted
to make it use resjunk entries instead, but that broke some other case
that I didn't have time to research yesterday. I'll get back to this
soon, but in the meantime, here's what I have.
+ conv_setproj =
+ adjust_and_expand_partition_tlist(RelationGetDescr(firstResultRel),
+ RelationGetDescr(partrel),
+ RelationGetRelationName(partrel),
+ firstVarno,
+ conv_setproj);
Aren't we are adding back Vars referring to the parent relation, after having converted the existing ones to use partition relation's varno? May be that works because missing attributes are already added during planning and expand_targetlist() here only adds dropped columns, which are just NULL constants.
Thanks,
Pavan
Pavan Deolasee http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
PostgreSQL Development, 24x7 Support, Training & Services
@@ -106,6 +120,9 @@ typedef struct PartitionTupleRouting
int num_subplan_partition_offsets;
TupleTableSlot *partition_tuple_slot;
TupleTableSlot *root_tuple_slot;
+ List **partition_arbiter_indexes;
+ TupleTableSlot **partition_conflproj_slots;
+ TupleTableSlot **partition_existing_slots;
} PartitionTupleRouting;
I am curious why you decided to add these members to PartitionTupleRouting structure. Wouldn't ResultRelationInfo be a better place to track these or is there some rule that we follow?
Thanks,
Pavan
Pavan Deolasee http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
PostgreSQL Development, 24x7 Support, Training & Services
(2018/03/16 19:43), Pavan Deolasee wrote: > On Fri, Mar 2, 2018 at 9:06 PM, Alvaro Herrera <alvherre@2ndquadrant.com > <mailto:alvherre@2ndquadrant.com>> wrote: > @@ -106,6 +120,9 @@ typedef struct PartitionTupleRouting > int num_subplan_partition_offsets; > TupleTableSlot *partition_tuple_slot; > TupleTableSlot *root_tuple_slot; > + List **partition_arbiter_indexes; > + TupleTableSlot **partition_conflproj_slots; > + TupleTableSlot **partition_existing_slots; > } PartitionTupleRouting; > I am curious why you decided to add these members to > PartitionTupleRouting structure. Wouldn't ResultRelationInfo be a better > place to track these or is there some rule that we follow? I just started reviewing the patch, so maybe I'm missing something, but I think it would be a good idea to have these in that structure, not in ResultRelInfo, because these would be required only for partitions chosen via tuple routing. Best regards, Etsuro Fujita
On Fri, Mar 16, 2018 at 5:13 PM, Etsuro Fujita <fujita.etsuro@lab.ntt.co.jp> wrote:
(2018/03/16 19:43), Pavan Deolasee wrote:On Fri, Mar 2, 2018 at 9:06 PM, Alvaro Herrera <alvherre@2ndquadrant.com
<mailto:alvherre@2ndquadrant.com>> wrote: @@ -106,6 +120,9 @@ typedef struct PartitionTupleRouting
int num_subplan_partition_offsets;
TupleTableSlot *partition_tuple_slot;
TupleTableSlot *root_tuple_slot;
+ List **partition_arbiter_indexes;
+ TupleTableSlot **partition_conflproj_slots;
+ TupleTableSlot **partition_existing_slots;
} PartitionTupleRouting;I am curious why you decided to add these members to
PartitionTupleRouting structure. Wouldn't ResultRelationInfo be a better
place to track these or is there some rule that we follow?
I just started reviewing the patch, so maybe I'm missing something, but I think it would be a good idea to have these in that structure, not in ResultRelInfo, because these would be required only for partitions chosen via tuple routing.
Hmm, yeah, probably you're right. The reason I got confused is because the patch uses ri_onConflictSetProj/ri_onConflictSetWhere to store the converted projection info/where clause for a given partition in its ResultRelationInfo. So I was wondering if we can move mt_arbiterindexes, mt_existing and mt_conflproj to ResultRelInfo and then just use that per-partition structures too.
Thanks,
Pavan
Pavan Deolasee http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
PostgreSQL Development, 24x7 Support, Training & Services
Pavan Deolasee wrote: > Hmm, yeah, probably you're right. The reason I got confused is because the > patch uses ri_onConflictSetProj/ri_onConflictSetWhere to store the > converted projection info/where clause for a given partition in its > ResultRelationInfo. So I was wondering if we can > move mt_arbiterindexes, mt_existing and mt_conflproj to ResultRelInfo and > then just use that per-partition structures too. I wonder if the proposed structure is good actually. Some notes as I go along. 1. ModifyTableState->mt_arbiterindexes is just a copy of ModifyTable->arbiterIndexes. So why do we need it? For an unpartitioned relation we can just use ModifyTableState.ps->arbiterIndexes. Now, for each partition we need to map these indexes onto the partition indexes. Not sure where to put these; I'm tempted to say ResultRelInfo is the place. Maybe the list should always be in ResultRelInfo instead of the state/plan node? Not sure. 2. We don't need mt_existing to be per-relation; a single tuple slot can do for all partitions; we just need to ExecSlotSetDescriptor to the partition's descriptor whenever the slot is going to be used. (This means the slot no longer has a fixed tupdesc. That seems OK). 3. ModifyTableState->mt_conflproj is more or less the same thing; the same TTS can be reused by all the different projections, as long as we set the descriptor before projecting. So we don't need PartitionTupleRouting->partition_conflproj_slots, but we do need a descriptor to be used when needed. Let's say PartitionTupleRouting->partition_confl_tupdesc I'll experiment with these ideas and see where that leads me. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
So ExecInsert receives the ModifyTableState, and separately it receives arbiterIndexes and the OnConflictAction, both of which are members of the passed ModifyTableState. I wonder why does it do that; wouldn't it be simpler to extract those members from the node? With the patch proposed upthread, we receive arbiterIndexes as a parameter and if the table is a partition we summarily ignore those and use the list as extracted from the PartitionRoutingInfo. This is confusing and pointless. It seems to me that the logic ought to be "if partition then use the list in PartitionRoutingInfo; if not partition use it from ModifyTableState". This requires changing as per above, i.e. make the arbiter index list not part of the ExecInsert's API. The OnConflictAction doesn't matter much; not passing it is merely a matter of cleanliness. Or is there another reason to pass the index list? -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Fri, Mar 16, 2018 at 11:21 AM, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: > So ExecInsert receives the ModifyTableState, and separately it receives > arbiterIndexes and the OnConflictAction, both of which are members of > the passed ModifyTableState. I wonder why does it do that; wouldn't it > be simpler to extract those members from the node? > Or is there another reason to pass the index list? It works that way pretty much by accident, as far as I can tell. Removing the two extra arguments sounds like a good idea. -- Peter Geoghegan
Peter Geoghegan wrote: > On Fri, Mar 16, 2018 at 11:21 AM, Alvaro Herrera > <alvherre@alvh.no-ip.org> wrote: > > So ExecInsert receives the ModifyTableState, and separately it receives > > arbiterIndexes and the OnConflictAction, both of which are members of > > the passed ModifyTableState. I wonder why does it do that; wouldn't it > > be simpler to extract those members from the node? > > > Or is there another reason to pass the index list? > > It works that way pretty much by accident, as far as I can tell. > Removing the two extra arguments sounds like a good idea. Great, thanks. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Another thing I noticed is that the split of the ON CONFLICT slot and its corresponding projection is pretty weird. The projection is in ResultRelInfo, but the slot itself is in ModifyTableState. You can't make the projection work without a corresponding slot initialized with the correct descriptor, so splitting it this way doesn't make a lot of sense to me. (Now, TBH the split between resultRelInfo->ri_projectReturning and ModifyTableState->ps.ps_ResultTupleSlot, which is the slot that the returning project uses, doesn't make a lot of sense to me either; so maybe there some reason that I'm just not seeing. But I digress.) So I want to propose that we move the slot to be together with the projection node that it serves, ie. we put the slot in ResultRelInfo: typedef struct ResultRelInfo { ... /* for computing ON CONFLICT DO UPDATE SET */ TupleTableSlot *ri_onConflictProjSlot; ProjectionInfo *ri_onConflictSetProj; and with this the structure makes more sense. So ExecInitModifyTable does this /* create target slot for UPDATE SET projection */ tupDesc = ExecTypeFromTL((List *) node->onConflictSet, relationDesc->tdhasoid); resultRelInfo->ri_onConflictProjSlot = ExecInitExtraTupleSlot(mtstate->ps.state, tupDesc); /* build UPDATE SET projection state */ resultRelInfo->ri_onConflictSetProj = ExecBuildProjectionInfo(node->onConflictSet, econtext, resultRelInfo->ri_onConflictProjSlot, &mtstate->ps, relationDesc); and then ExecOnConflictUpdate can simply do this: /* Project the new tuple version */ ExecProject(resultRelInfo->ri_onConflictSetProj); /* Execute UPDATE with projection */ *returning = ExecUpdate(mtstate, &tuple.t_self, NULL, resultRelInfo->ri_onConflictProjSlot, planSlot, &mtstate->mt_epqstate, mtstate->ps.state, canSetTag); Now, maybe there is some reason I'm missing for the on conflict slot for the projection to be in ModifyTableState rather than resultRelInfo. But this code passes all current tests, so I don't know what that reason would be. Overall, the resulting code looks simpler to me than the previous arrangements. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Hi, On 2018-03-16 18:23:44 -0300, Alvaro Herrera wrote: > Another thing I noticed is that the split of the ON CONFLICT slot and > its corresponding projection is pretty weird. The projection is in > ResultRelInfo, but the slot itself is in ModifyTableState. You can't > make the projection work without a corresponding slot initialized with > the correct descriptor, so splitting it this way doesn't make a lot of > sense to me. > > (Now, TBH the split between resultRelInfo->ri_projectReturning and > ModifyTableState->ps.ps_ResultTupleSlot, which is the slot that the > returning project uses, doesn't make a lot of sense to me either; so > maybe there some reason that I'm just not seeing. But I digress.) The projections for different child tables / child plans can look different, therefore it can't be stored in ModifyTableState itself. No? The slot's descriptor is changed to be "appropriate" when necessary: if (slot->tts_tupleDescriptor != RelationGetDescr(resultRelationDesc)) ExecSetSlotDescriptor(slot, RelationGetDescr(resultRelationDesc)); > So I want to propose that we move the slot to be together with the > projection node that it serves, ie. we put the slot in ResultRelInfo: That'll mean we have a good number of additional slots in some cases. I don't think the overhead of that is going to break the bank, but it's worth considering. Greetings, Andres Freund
Andres Freund wrote: > Hi, > > On 2018-03-16 18:23:44 -0300, Alvaro Herrera wrote: > > Another thing I noticed is that the split of the ON CONFLICT slot and > > its corresponding projection is pretty weird. The projection is in > > ResultRelInfo, but the slot itself is in ModifyTableState. You can't > > make the projection work without a corresponding slot initialized with > > the correct descriptor, so splitting it this way doesn't make a lot of > > sense to me. > > > > (Now, TBH the split between resultRelInfo->ri_projectReturning and > > ModifyTableState->ps.ps_ResultTupleSlot, which is the slot that the > > returning project uses, doesn't make a lot of sense to me either; so > > maybe there some reason that I'm just not seeing. But I digress.) > > The projections for different child tables / child plans can look > different, therefore it can't be stored in ModifyTableState itself. No? > > The slot's descriptor is changed to be "appropriate" when necessary: > if (slot->tts_tupleDescriptor != RelationGetDescr(resultRelationDesc)) > ExecSetSlotDescriptor(slot, RelationGetDescr(resultRelationDesc)); Grumble. This stuff looks like full of cheats to me, but I won't touch it anyway. > > So I want to propose that we move the slot to be together with the > > projection node that it serves, ie. we put the slot in ResultRelInfo: > > That'll mean we have a good number of additional slots in some cases. I > don't think the overhead of that is going to break the bank, but it's > worth considering. Good point. I think what I should be doing is the same as the returning stuff: keep a tupdesc around, and use a single slot, whose descriptor is changed just before the projection. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Alvaro Herrera wrote: > I think what I should be doing is the same as the returning stuff: keep > a tupdesc around, and use a single slot, whose descriptor is changed > just before the projection. Yes, this works, though it's ugly. Not any uglier than what's already there, though, so I think it's okay. The only thing that I remain unhappy about this patch is the whole adjust_and_expand_partition_tlist() thing. I fear we may be doing redundant and/or misplaced work. I'll look into it next week. 0001 should be pretty much ready to push -- adjustments to ExecInsert and ModifyTableState I already mentioned. 0002 is stuff I would like to get rid of completely -- changes to planner code so that it better supports functionality we need for 0003. 0003 is the main patch. Compared to the previous version, this one reuses slots by switching them to different tupdescs as needed. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachment
On 2018/03/05 18:04, Pavan Deolasee wrote: > On Fri, Mar 2, 2018 at 9:06 PM, Alvaro Herrera <alvherre@2ndquadrant.com> > wrote: >> Re. the "ugly hack" comments in adjust_inherited_tlist(), I'm confused: >> expand_targetlist() runs *after*, not before, so how could it have >> affected the result? >> > > If I understand correctly, planner must have called expand_targetlist() > once for the parent relation's descriptor and added any dropped columns > from the parent relation. So we may not find mapped attributes for those > dropped columns in the parent. I haven't actually tested this case though. > > I wonder if it makes sense to actually avoid expanding on-conflict-set > targetlist in case the target is a partition table and deal with it during > execution, like we are doing now. >> I'm obviously confused about what >> expand_targetlist call this comment is talking about. Anyway I wanted >> to make it use resjunk entries instead, but that broke some other case >> that I didn't have time to research yesterday. I'll get back to this >> soon, but in the meantime, here's what I have. >> > > + conv_setproj = > + > adjust_and_expand_partition_tlist(RelationGetDescr(firstResultRel), > + RelationGetDescr(partrel), > + > RelationGetRelationName(partrel), > + firstVarno, > + conv_setproj); > > Aren't we are adding back Vars referring to the parent relation, after > having converted the existing ones to use partition relation's varno? May > be that works because missing attributes are already added during planning > and expand_targetlist() here only adds dropped columns, which are just NULL > constants. I think this suggestion to defer onConflictSet target list expansion to execution time is a good one. So, in preprocess_targetlist(), we'd only perform exapand_targetlist on the onConflictSet list if the table is not a partitioned table. For partitioned tables, we don't know which partition(s) will be affected, so it's useless to do the expansion there. Instead it's better to expand in ExecInitPartitionInfo(), possibly after changing original TargetEntry nodes to have correct resnos due to attribute number differences (calling adjust_inherited_tlist(), etc.). And then since we're not expanding the target list in the planner anymore, we don't need to install those hacks in adjust_inherited_tlist() like the patch currently does to ignore entries for dropped columns in the parent the plan-time expansion adds. Thanks, Amit
Thanks for the updated patches. On 2018/03/18 13:17, Alvaro Herrera wrote: > Alvaro Herrera wrote: > >> I think what I should be doing is the same as the returning stuff: keep >> a tupdesc around, and use a single slot, whose descriptor is changed >> just before the projection. > > Yes, this works, though it's ugly. Not any uglier than what's already > there, though, so I think it's okay. > > The only thing that I remain unhappy about this patch is the whole > adjust_and_expand_partition_tlist() thing. I fear we may be doing > redundant and/or misplaced work. I'll look into it next week. > > 0001 should be pretty much ready to push -- adjustments to ExecInsert > and ModifyTableState I already mentioned. This seems like good cleanup. While at it, why not also get rid of mt_onconflict in favor of always just using its counterpart in ModifyTable -- onConflictAction? > 0002 is stuff I would like to get rid of completely -- changes to > planner code so that it better supports functionality we need for > 0003. Hmm. I'm not sure if we can completely get rid of this, because we do need the adjust_inherited_tlist() facility to translate TargetEntry resnos in any case. But as I just said in reply to Pavan's email suggesting deferring onConflistSet expansion to execution time, we don't need the hack in adjust_inherited_tlist() if we go with the suggestion. > 0003 is the main patch. Compared to the previous version, this one > reuses slots by switching them to different tupdescs as needed. Your proposed change to use just one slot (the existing mt_conflproj slot) sounds good. Instead, it seems now we have an array to hold tupleDescs for the onConflistSet target lists for each partition. Some comments: 1. I noticed a bug that crashes a test in insert_conflit.sql that uses DO NOTHING instead of DO UPDATE SET. It's illegal for ExecInitPartitionInfo to expect mt_conflproj_tupdesc to be valid in the DO NOTHING case, because ExecInitModifyTable would only set it if (onConflictAction == DO_UPDATE). 2. It seems better to name the new array field in PartitionTupleRouting partition_conflproj_tupdescs rather than partition_onconfl_tupdescs to be consistent with the new field in ModifyTableState. 3. I think it was an oversight in my original patch, but it seems we should allocate the partition_onconfl_tdescs array only if DO UPDATE action is used. Also, ri_onConflictSetProj, ri_onConflictSetWhere should be only set in that case. OTOH, we always need to set partition_arbiter_indexes, that is, for both DO NOTHING and DO UPDATE SET actions. 4. Need to remove the comments for partition_conflproj_slots and partition_existing_slots, fields of PartitionTupleRouting that no longer exist. Instead one for partition_conflproj_tupdescs should be added. 5. I know the following is so as not to break the Assert in adjust_inherited_tlist(), so why not have a parentOid argument for adjust_and_expand_partition_tlist()? + appinfo.parent_reloid = 1; // dummy parentRel->rd_id; 6. There is a sentence in the comment above adjust_inherited_tlist(): Note that this is not needed for INSERT because INSERT isn't inheritable. Maybe, we need to delete that and mention that we do need it in the case of INSERT ON CONFLICT DO UPDATE on partitioned tables for translating DO UPDATE SET target list. 7. In ExecInsert, it'd be better to have a partArbiterIndexes, just like partConflTupdesc in the outermost scope and then do: + /* Use the appropriate list of arbiter indexes. */ + if (mtstate->mt_partition_tuple_routing != NULL) + arbiterIndexes = partArbiterIndexes; + else + arbiterIndexes = node->arbiterIndexes; and + /* Use the appropriate tuple descriptor. */ + if (mtstate->mt_partition_tuple_routing != NULL) + onconfl_tupdesc = partConflTupdesc; + else + onconfl_tupdesc = mtstate->mt_conflproj_tupdesc; using arbiterIndexes and onconfl_tupdesc declared in the appropriate scopes. I have tried to make these changes and attached are the updated patches containing those, including the change I suggested for 0001 (that is, getting rid of mt_onconflict). I also expanded some comments in 0003 while making those changes. Thanks, Amit
Attachment
On 2018/03/19 16:45, Amit Langote wrote: > I have tried to make these changes and attached are the updated patches > containing those, including the change I suggested for 0001 (that is, > getting rid of mt_onconflict). I also expanded some comments in 0003 > while making those changes. I realized that there should be a test where transition table is involved for an ON CONFLICT DO UPDATE on a partitioned table due to relevant statement trigger on the table; something like the attached. But due to a bug being discussed over at [1], we won't get the correct expected output for the test until the latest patch submitted for that bug [2] is committed as a bug fix. Thanks, Amit [1] https://www.postgresql.org/message-id/ba19eff9-2120-680e-4671-55a9bea9454f%40lab.ntt.co.jp [2] https://www.postgresql.org/message-id/df921671-32df-45ea-c0e4-9b51ee86ba3b%40lab.ntt.co.jp
Attachment
(2018/03/18 13:17), Alvaro Herrera wrote: > Alvaro Herrera wrote: > The only thing that I remain unhappy about this patch is the whole > adjust_and_expand_partition_tlist() thing. I fear we may be doing > redundant and/or misplaced work. I'll look into it next week. I'm still reviewing the patches, but I really agree on that point. As Pavan mentioned upthread, the onConflictSet tlist for the root parent, from which we create a translated onConflictSet tlist for a partition, would have already been processed by expand_targetlist() to contain all missing columns as well, so I think we could create the tlist for the partition by simply re-ordering the expression-converted tlist (ie, conv_setproj) based on the conversion map for the partition. The Attached defines a function for that, which could be called, instead of calling adjust_and_expand_partition_tlist(). This would allow us to get rid of planner changes from the patches. Maybe I'm missing something, though. Best regards, Etsuro Fujita
Attachment
Fujita-san, On 2018/03/19 21:59, Etsuro Fujita wrote: > (2018/03/18 13:17), Alvaro Herrera wrote: >> Alvaro Herrera wrote: >> The only thing that I remain unhappy about this patch is the whole >> adjust_and_expand_partition_tlist() thing. I fear we may be doing >> redundant and/or misplaced work. I'll look into it next week. > > I'm still reviewing the patches, but I really agree on that point. As > Pavan mentioned upthread, the onConflictSet tlist for the root parent, > from which we create a translated onConflictSet tlist for a partition, > would have already been processed by expand_targetlist() to contain all > missing columns as well, so I think we could create the tlist for the > partition by simply re-ordering the expression-converted tlist (ie, > conv_setproj) based on the conversion map for the partition. The Attached > defines a function for that, which could be called, instead of calling > adjust_and_expand_partition_tlist(). This would allow us to get rid of > planner changes from the patches. Maybe I'm missing something, though. Thanks for the patch. I can confirm your proposed adjust_onconflictset_tlist() is enough to replace adjust_inherited_tlist() + expand_targetlist() combination (aka adjust_and_expand_partition_tlist()), thus rendering the planner changes in this patch unnecessary. I tested it with a partition tree involving partitions of varying attribute numbers (dropped columns included) and it seems to work as expected (as also exercised in regression tests) as shown below. Partitioned table p has partitions p1, p2, p3, p4, and p5 whose attributes look like this; shown as (colname: attnum, ...). p: (a: 1, b: 4) p1: (a: 1, b: 4) p2: (a: 2, b: 4) p3: (a: 1, b: 3) p4: (a: 3, b: 8) p5: (a: 1, b: 5) You may notice that all partitions but p1 will have a tuple conversion map and hence will undergo adjust_onconflictset_tlist() treatment. insert into p values (1, 'a') on conflict (a) do update set b = excluded.b where excluded.b = 'b'; INSERT 0 1 insert into p values (1, 'b') on conflict (a) do update set b = excluded.b where excluded.b = 'b'; INSERT 0 1 insert into p values (1, 'c') on conflict (a) do update set b = excluded.b where excluded.b = 'b'; INSERT 0 0 insert into p values (1) on conflict (a) do update set b = excluded.b where excluded.b = 'b'; INSERT 0 0 insert into p values (2, 'a') on conflict (a) do update set b = excluded.b where excluded.b = 'b'; INSERT 0 1 insert into p values (2, 'b') on conflict (a) do update set b = excluded.b where excluded.b = 'b'; INSERT 0 1 insert into p values (2, 'c') on conflict (a) do update set b = excluded.b where excluded.b = 'b'; INSERT 0 0 insert into p values (5, 'a') on conflict (a) do update set b = excluded.b where excluded.b = 'b'; INSERT 0 1 insert into p values (5, 'b') on conflict (a) do update set b = excluded.b where excluded.b = 'b'; INSERT 0 1 insert into p values (5, 'c') on conflict (a) do update set b = excluded.b where excluded.b = 'b'; INSERT 0 0 insert into p values (5) on conflict (a) do update set b = excluded.b where excluded.b = 'b'; INSERT 0 0 select tableoid::regclass, * from p; tableoid | a | b ----------+---+--- p1 | 1 | b p2 | 2 | b p5 | 5 | b (3 rows) I have incorporated your patch in the main patch after updating the comments a bit. Also, now that 6666ee49f49 is in [1], the transition table related tests I proposed yesterday pass nicely. Instead of posting as a separate patch, I have merged it with the main patch. So now that planner refactoring is unnecessary, attached is just one patch. Thanks, Amit [1] https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=6666ee49f49
Attachment
On 2018/03/20 13:30, Amit Langote wrote: > I have incorporated your patch in the main patch after updating the > comments a bit. Also, now that 6666ee49f49 is in [1], the transition > table related tests I proposed yesterday pass nicely. Instead of posting > as a separate patch, I have merged it with the main patch. So now that > planner refactoring is unnecessary, attached is just one patch. Sorry, I forgot to remove a hunk in the patch affecting src/include/optimizer/prep.h. Fixed in the attached updated version. Thanks, Amit
Attachment
(2018/03/20 13:30), Amit Langote wrote: > On 2018/03/19 21:59, Etsuro Fujita wrote: >> (2018/03/18 13:17), Alvaro Herrera wrote: >>> Alvaro Herrera wrote: >>> The only thing that I remain unhappy about this patch is the whole >>> adjust_and_expand_partition_tlist() thing. I fear we may be doing >>> redundant and/or misplaced work. I'll look into it next week. >> >> I'm still reviewing the patches, but I really agree on that point. As >> Pavan mentioned upthread, the onConflictSet tlist for the root parent, >> from which we create a translated onConflictSet tlist for a partition, >> would have already been processed by expand_targetlist() to contain all >> missing columns as well, so I think we could create the tlist for the >> partition by simply re-ordering the expression-converted tlist (ie, >> conv_setproj) based on the conversion map for the partition. The Attached >> defines a function for that, which could be called, instead of calling >> adjust_and_expand_partition_tlist(). This would allow us to get rid of >> planner changes from the patches. Maybe I'm missing something, though. > > Thanks for the patch. I can confirm your proposed > adjust_onconflictset_tlist() is enough to replace adjust_inherited_tlist() > + expand_targetlist() combination (aka > adjust_and_expand_partition_tlist()), thus rendering the planner changes > in this patch unnecessary. I tested it with a partition tree involving > partitions of varying attribute numbers (dropped columns included) and it > seems to work as expected (as also exercised in regression tests) as shown > below. Thanks for testing! > I have incorporated your patch in the main patch after updating the > comments a bit. Also, now that 6666ee49f49 is in [1], the transition > table related tests I proposed yesterday pass nicely. Instead of posting > as a separate patch, I have merged it with the main patch. So now that > planner refactoring is unnecessary, attached is just one patch. Here are comments on executor changes in (the latest version of) the patch: @@ -421,8 +424,18 @@ ExecInsert(ModifyTableState *mtstate, ItemPointerData conflictTid; bool specConflict; List *arbiterIndexes; + PartitionTupleRouting *proute = + mtstate->mt_partition_tuple_routing; - arbiterIndexes = node->arbiterIndexes; + /* Use the appropriate list of arbiter indexes. */ + if (mtstate->mt_partition_tuple_routing != NULL) + { + Assert(partition_index >= 0 && proute != NULL); + arbiterIndexes = + proute->partition_arbiter_indexes[partition_index]; + } + else + arbiterIndexes = node->arbiterIndexes; To handle both cases the same way, I wonder if it would be better to have the arbiterindexes list in ResultRelInfo as well, as mentioned by Alvaro upthread, or to re-add mt_arbiterindexes as before and set it to proute->partition_arbiter_indexes[partition_index] before we get here, maybe in ExecPrepareTupleRouting, in the case of tuple routing. ExecOnConflictUpdate(ModifyTableState *mtstate, ResultRelInfo *resultRelInfo, + TupleDesc onConflictSetTupdesc, ItemPointer conflictTid, TupleTableSlot *planSlot, TupleTableSlot *excludedSlot, @@ -1419,6 +1459,7 @@ ExecOnConflictUpdate(ModifyTableState *mtstate, ExecCheckHeapTupleVisible(estate, &tuple, buffer); /* Store target's existing tuple in the state's dedicated slot */ + ExecSetSlotDescriptor(mtstate->mt_existing, RelationGetDescr(relation)); ExecStoreTuple(&tuple, mtstate->mt_existing, buffer, false); /* @@ -1462,6 +1503,7 @@ ExecOnConflictUpdate(ModifyTableState *mtstate, } /* Project the new tuple version */ + ExecSetSlotDescriptor(mtstate->mt_conflproj, onConflictSetTupdesc); ExecProject(resultRelInfo->ri_onConflictSetProj); Can we do ExecSetSlotDescriptor for mtstate->mt_existing and mtstate->mt_conflproj in ExecPrepareTupleRouting in the case of tuple routing? That would make the API changes to ExecOnConflictUpdate unnecessary. @@ -2368,9 +2419,13 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags) econtext = mtstate->ps.ps_ExprContext; relationDesc = resultRelInfo->ri_RelationDesc->rd_att; - /* initialize slot for the existing tuple */ - mtstate->mt_existing = - ExecInitExtraTupleSlot(mtstate->ps.state, relationDesc); + /* + * Initialize slot for the existing tuple. We determine which + * tupleDesc to use for this after we have determined which relation + * the insert/update will be applied to, possibly after performing + * tuple routing. + */ + mtstate->mt_existing = ExecInitExtraTupleSlot(mtstate->ps.state, NULL); /* carried forward solely for the benefit of explain */ mtstate->mt_excludedtlist = node->exclRelTlist; @@ -2378,8 +2433,16 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags) /* create target slot for UPDATE SET projection */ tupDesc = ExecTypeFromTL((List *) node->onConflictSet, relationDesc->tdhasoid); + PinTupleDesc(tupDesc); + mtstate->mt_conflproj_tupdesc = tupDesc; + + /* + * Just like the "existing tuple" slot, we'll defer deciding which + * tupleDesc to use for this slot to a point where tuple routing has + * been performed. + */ mtstate->mt_conflproj = - ExecInitExtraTupleSlot(mtstate->ps.state, tupDesc); + ExecInitExtraTupleSlot(mtstate->ps.state, NULL); If we do ExecInitExtraTupleSlot for mtstate->mt_existing and mtstate->mt_conflproj in ExecPrepareTupleRouting in the case of tuple routing, as said above, we wouldn't need this changes. I think doing that only in the case of tuple routing and keeping this as-is would be better because that would save cycles in the normal case. I'll look at other parts of the patch next. Best regards, Etsuro Fujita
On Tue, Mar 20, 2018 at 10:09 AM, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
On 2018/03/20 13:30, Amit Langote wrote:
> I have incorporated your patch in the main patch after updating the
> comments a bit. Also, now that 6666ee49f49 is in [1], the transition
> table related tests I proposed yesterday pass nicely. Instead of posting
> as a separate patch, I have merged it with the main patch. So now that
> planner refactoring is unnecessary, attached is just one patch.
Sorry, I forgot to remove a hunk in the patch affecting
src/include/optimizer/prep.h. Fixed in the attached updated version.
Thanks for writing a new version. A few comments:
<listitem>
<para>
- Using the <literal>ON CONFLICT</literal> clause with partitioned tables
- will cause an error if the conflict target is specified (see
- <xref linkend="sql-on-conflict" /> for more details on how the clause
- works). Therefore, it is not possible to specify
- <literal>DO UPDATE</literal> as the alternative action, because
- specifying the conflict target is mandatory in that case. On the other
- hand, specifying <literal>DO NOTHING</literal> as the alternative action
- works fine provided the conflict target is not specified. In that case,
- unique constraints (or exclusion constraints) of the individual leaf
- partitions are considered.
- </para>
- </listitem>
We should document it somewhere that partition key update is not supported by
ON CONFLICT DO UPDATE
/*
* get_partition_parent
+ * Obtain direct parent or topmost ancestor of given relation
*
- * Returns inheritance parent of a partition by scanning pg_inherits
+ * Returns direct inheritance parent of a partition by scanning pg_inherits;
+ * or, if 'getroot' is true, the topmost parent in the inheritance hierarchy.
*
* Note: Because this function assumes that the relation whose OID is passed
* as an argument will have precisely one parent, it should only be called
* when it is known that the relation is a partition.
*/
Given that most callers only look for immediate parent, I wonder if it makes
sense to have a new function, get_partition_root(), instead of changing
signature of the current function. That will reduce foot-print of this patch
quite a lot.
@@ -36,6 +38,7 @@ static char *ExecBuildSlotPartitionKeyDescription(Relation rel,
Datum *values,
bool *isnull,
int maxfieldlen);
+static List *adjust_onconflictset_tlist(List *tlist, TupleConversionMap *map);
We should name this function in a more generic way, given that it's going to be
used for other things too. What about adjust_partition_tlist?
+
+ /*
+ * If partition's tuple descriptor differs from the root parent,
+ * we need to adjust the onConflictSet target list to account for
+ * differences in attribute numbers.
+ */
+ if (map != NULL)
+ {
+ /*
+ * First convert Vars to contain partition's atttribute
+ * numbers.
+ */
+
+ /* Convert Vars referencing EXCLUDED pseudo-relation. */
+ onconflset = map_partition_varattnos(node->onConflictSet,
+ INNER_VAR,
+ partrel,
+ firstResultRel, NULL);
Are we not modifying node->onConflictSet in place? Or does
map_partition_varattnos() create a fresh copy before scribbling on the input?
If it's former then I guess that's a problem. If it's latter then we ought to
have comments explaining that.
+ tupDesc = ExecTypeFromTL(onconflset, partrelDesc->tdhasoid);
+ PinTupleDesc(tupDesc);
Why do we need to pin the descriptor? If we do need, why don't we need
corresponding ReleaseTupDesc() call?
I am still confused if the partition_conflproj_tdescs really belongs to
PartitionTupleRouting or should it be a part of the ResultRelInfo. FWIW for the
MERGE patch, I moved everything to a new struct and made it part of the
ResultRelInfo. If no re-mapping is necessary, we can just point to the struct
in the root relation's ResultRelInfo. Otherwise create and populate a new one
in the partition relation's ResultRelInfo.
+ leaf_part_rri->ri_onConflictSetWhere =
+ ExecInitQual(onconflwhere, &mtstate->ps);
+ }
So ri_onConflictSetWhere and ri_onConflictSetProj are part of the
ResultRelInfo. Why not move mt_conflproj_tupdesc, partition_conflproj_tdescs,
partition_arbiter_indexes etc to the ResultRelInfo as well?
+
+/*
+ * Adjust the targetlist entries of an inherited ON CONFLICT DO UPDATE
+ * operation for a given partition
+ *
As I said above, we should disassociate this function from ON CONFLICT DO
UPDATE and rather have it as a general purpose facility.
+ * The expressions have already been fixed, but we have to make sure that the
+ * target resnos match the partition. In some cases, this can force us to
+ * re-order the tlist to preserve resno ordering.
+ *
Can we have some explanation regarding how the targetlist is reordered? I know
the function does that by updating the resno in place, but some explanation
would help. Also, should we add an assertion-build check to confirm that the
resultant list is actually ordered?
@@ -66,7 +67,8 @@ static TupleTableSlot *ExecPrepareTupleRouting(ModifyTableState *mtstate,
EState *estate,
PartitionTupleRouting *proute,
ResultRelInfo *targetRelInfo,
- TupleTableSlot *slot);
+ TupleTableSlot *slot,
+ int *partition_index);
static ResultRelInfo *getTargetResultRelInfo(ModifyTableState *node);
static void ExecSetupChildParentMapForTcs(ModifyTableState *mtstate);
static void ExecSetupChildParentMapForSubplan(ModifyTableState *mtstate);
@@ -264,6 +266,7 @@ ExecInsert(ModifyTableState *mtstate,
TupleTableSlot *slot,
TupleTableSlot *planSlot,
EState *estate,
+ int partition_index,
bool canSetTag)
{
HeapTuple tuple;
If we move the list of arbiter indexes and the tuple desc to ResultRelInfo, as
suggested above, I think we can avoid making any API changes to
ExecPrepareTupleRouting and ExecInsert.
Thanks,
Pavan
Pavan Deolasee http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
PostgreSQL Development, 24x7 Support, Training & Services
Fujita-san, Pavan, Thank you both for reviewing. Replying to both emails here. On 2018/03/20 20:53, Etsuro Fujita wrote: > Here are comments on executor changes in (the latest version of) the patch: > > @@ -421,8 +424,18 @@ ExecInsert(ModifyTableState *mtstate, > ItemPointerData conflictTid; > bool specConflict; > List *arbiterIndexes; > + PartitionTupleRouting *proute = > + mtstate->mt_partition_tuple_routing; > > - arbiterIndexes = node->arbiterIndexes; > + /* Use the appropriate list of arbiter indexes. */ > + if (mtstate->mt_partition_tuple_routing != NULL) > + { > + Assert(partition_index >= 0 && proute != NULL); > + arbiterIndexes = > + proute->partition_arbiter_indexes[partition_index]; > + } > + else > + arbiterIndexes = node->arbiterIndexes; > > To handle both cases the same way, I wonder if it would be better to have > the arbiterindexes list in ResultRelInfo as well, as mentioned by Alvaro > upthread, or to re-add mt_arbiterindexes as before and set it to > proute->partition_arbiter_indexes[partition_index] before we get here, > maybe in ExecPrepareTupleRouting, in the case of tuple routing. It's a good idea. I somehow missed that Alvaro had already mentioned it. In HEAD, we now have ri_onConflictSetProj and ri_onConflictSetWhere. I propose we name the field ri_onConflictArbiterIndexes as done in the updated patch. > ExecOnConflictUpdate(ModifyTableState *mtstate, > ResultRelInfo *resultRelInfo, > + TupleDesc onConflictSetTupdesc, > ItemPointer conflictTid, > TupleTableSlot *planSlot, > TupleTableSlot *excludedSlot, > @@ -1419,6 +1459,7 @@ ExecOnConflictUpdate(ModifyTableState *mtstate, > ExecCheckHeapTupleVisible(estate, &tuple, buffer); > > /* Store target's existing tuple in the state's dedicated slot */ > + ExecSetSlotDescriptor(mtstate->mt_existing, RelationGetDescr(relation)); > ExecStoreTuple(&tuple, mtstate->mt_existing, buffer, false); > > /* > @@ -1462,6 +1503,7 @@ ExecOnConflictUpdate(ModifyTableState *mtstate, > } > > /* Project the new tuple version */ > + ExecSetSlotDescriptor(mtstate->mt_conflproj, onConflictSetTupdesc); > ExecProject(resultRelInfo->ri_onConflictSetProj); > > Can we do ExecSetSlotDescriptor for mtstate->mt_existing and > mtstate->mt_conflproj in ExecPrepareTupleRouting in the case of tuple > routing? That would make the API changes to ExecOnConflictUpdate > unnecessary. That's a good idea too, so done. > > @@ -2368,9 +2419,13 @@ ExecInitModifyTable(ModifyTable *node, EState > *estate, int eflags) > econtext = mtstate->ps.ps_ExprContext; > relationDesc = resultRelInfo->ri_RelationDesc->rd_att; > > - /* initialize slot for the existing tuple */ > - mtstate->mt_existing = > - ExecInitExtraTupleSlot(mtstate->ps.state, relationDesc); > + /* > + * Initialize slot for the existing tuple. We determine which > + * tupleDesc to use for this after we have determined which relation > + * the insert/update will be applied to, possibly after performing > + * tuple routing. > + */ > + mtstate->mt_existing = ExecInitExtraTupleSlot(mtstate->ps.state, > NULL); > > /* carried forward solely for the benefit of explain */ > mtstate->mt_excludedtlist = node->exclRelTlist; > @@ -2378,8 +2433,16 @@ ExecInitModifyTable(ModifyTable *node, EState > *estate, int eflags) > /* create target slot for UPDATE SET projection */ > tupDesc = ExecTypeFromTL((List *) node->onConflictSet, > relationDesc->tdhasoid); > + PinTupleDesc(tupDesc); > + mtstate->mt_conflproj_tupdesc = tupDesc; > + > + /* > + * Just like the "existing tuple" slot, we'll defer deciding which > + * tupleDesc to use for this slot to a point where tuple routing has > + * been performed. > + */ > mtstate->mt_conflproj = > - ExecInitExtraTupleSlot(mtstate->ps.state, tupDesc); > + ExecInitExtraTupleSlot(mtstate->ps.state, NULL); > > If we do ExecInitExtraTupleSlot for mtstate->mt_existing and > mtstate->mt_conflproj in ExecPrepareTupleRouting in the case of tuple > routing, as said above, we wouldn't need this changes. I think doing that > only in the case of tuple routing and keeping this as-is would be better > because that would save cycles in the normal case. Hmm, I think we shouldn't be doing ExecInitExtraTupleSlot in ExecPrepareTupleRouting, because we shouldn't have more than one instance of mtstate->mt_existing and mtstate->mt_conflproj slots. As you also said above, I think you meant to say here that we do ExecInitExtraTupleSlot only once for both mtstate->mt_existing and mtstate->mt_conflproj in ExecInitModifyTable and only do ExecSetSlotDescriptor in ExecPrepareTupleRouting. I have changed it so that ExecInitModifyTable now both creates the slot and sets the descriptor for non-tuple-routing cases and only creates but doesn't set the descriptor in the tuple-routing case. For ExecPrepareTupleRouting to be able to access the tupDesc of the onConflictSet target list, I've added ri_onConflictSetProjTupDesc which is set by ExecInitPartitionInfo on first call for a give partition. This is also suggested by Pavan in his review. Considering all of that, both mt_conflproj_tupdesc and partition_conflproj_tdescs (the array in PartitionTupleRouting) no longer exist in the patch. And since arbiterIndexes has been moved into ResultRelInfo too, partition_arbiter_indexes (the array in PartitionTupleRouting) is gone too. On 2018/03/22 13:34, Pavan Deolasee wrote: > Thanks for writing a new version. A few comments: > > > <listitem> > <para> > - Using the <literal>ON CONFLICT</literal> clause with partitioned > tables > - will cause an error if the conflict target is specified (see > - <xref linkend="sql-on-conflict" /> for more details on how the > clause > - works). Therefore, it is not possible to specify > - <literal>DO UPDATE</literal> as the alternative action, because > - specifying the conflict target is mandatory in that case. On the > other > - hand, specifying <literal>DO NOTHING</literal> as the alternative > action > - works fine provided the conflict target is not specified. In that > case, > - unique constraints (or exclusion constraints) of the individual leaf > - partitions are considered. > - </para> > - </listitem> > > We should document it somewhere that partition key update is not supported > by > ON CONFLICT DO UPDATE Agreed. I have added a line on INSERT reference page to mention this limitation. > /* > * get_partition_parent > + * Obtain direct parent or topmost ancestor of given relation > * > - * Returns inheritance parent of a partition by scanning pg_inherits > + * Returns direct inheritance parent of a partition by scanning > pg_inherits; > + * or, if 'getroot' is true, the topmost parent in the inheritance > hierarchy. > * > * Note: Because this function assumes that the relation whose OID is > passed > * as an argument will have precisely one parent, it should only be called > * when it is known that the relation is a partition. > */ > > Given that most callers only look for immediate parent, I wonder if it makes > sense to have a new function, get_partition_root(), instead of changing > signature of the current function. That will reduce foot-print of this patch > quite a lot. It seems like a good idea, so done that way. > @@ -36,6 +38,7 @@ static char > *ExecBuildSlotPartitionKeyDescription(Relation rel, > Datum *values, > bool *isnull, > int maxfieldlen); > +static List *adjust_onconflictset_tlist(List *tlist, TupleConversionMap > *map); > > We should name this function in a more generic way, given that it's going > to be > used for other things too. What about adjust_partition_tlist? I think that makes sense. We were trying to use adjust_inherited_tlist in the earlier versions of this patch, so adjust_partition_tlist sounds like a good name for this piece of code. > + > + /* > + * If partition's tuple descriptor differs from the root parent, > + * we need to adjust the onConflictSet target list to account for > + * differences in attribute numbers. > + */ > + if (map != NULL) > + { > + /* > + * First convert Vars to contain partition's atttribute > + * numbers. > + */ > + > + /* Convert Vars referencing EXCLUDED pseudo-relation. */ > + onconflset = map_partition_varattnos(node->onConflictSet, > + INNER_VAR, > + partrel, > + firstResultRel, NULL); > > Are we not modifying node->onConflictSet in place? Or does > map_partition_varattnos() create a fresh copy before scribbling on the > input? > If it's former then I guess that's a problem. If it's latter then we ought > to > have comments explaining that. A copy is made before scribbling. Clarified that in the nearby comment. > + tupDesc = ExecTypeFromTL(onconflset, partrelDesc->tdhasoid); > + PinTupleDesc(tupDesc); > > Why do we need to pin the descriptor? If we do need, why don't we need > corresponding ReleaseTupDesc() call? PinTupleDesc was added in the patch as Alvaro had submitted it upthread, which it wasn't clear to me either why it was needed. Looking at it closely, it seems we can get rid of it if for the lack of corresponding ReleaseTupleDesc(). ExecSetSlotDescriptor() seems to take care of pinning and releasing tuple descriptors that are passed to it. If some partition's tupDesc remains pinned because it was the last one that was passed to it, the final ExecResetTupleTable will take care of releasing it. I have removed the instances of PinTupleDesc in the updated patch, but I'm not sure why the loose PinTupleDesc() in the previous version of the patch didn't cause reference leak warnings or some such. > I am still confused if the partition_conflproj_tdescs really belongs to > PartitionTupleRouting or should it be a part of the ResultRelInfo. FWIW for > the > MERGE patch, I moved everything to a new struct and made it part of the > ResultRelInfo. If no re-mapping is necessary, we can just point to the > struct > in the root relation's ResultRelInfo. Otherwise create and populate a new > one > in the partition relation's ResultRelInfo. > > + leaf_part_rri->ri_onConflictSetWhere = > + ExecInitQual(onconflwhere, &mtstate->ps); > + } > > So ri_onConflictSetWhere and ri_onConflictSetProj are part of the > ResultRelInfo. Why not move mt_conflproj_tupdesc, > partition_conflproj_tdescs, > partition_arbiter_indexes etc to the ResultRelInfo as well? I have moved both the projection tupdesc and the arbiter indexes list into ResultRelInfo as I wrote above. > + > +/* > + * Adjust the targetlist entries of an inherited ON CONFLICT DO UPDATE > + * operation for a given partition > + * > > As I said above, we should disassociate this function from ON CONFLICT DO > UPDATE and rather have it as a general purpose facility. OK, have fixed the comment and the name as mentioned above. > + * The expressions have already been fixed, but we have to make sure that > the > + * target resnos match the partition. In some cases, this can force us to > + * re-order the tlist to preserve resno ordering. > + * > > Can we have some explanation regarding how the targetlist is reordered? I > know > the function does that by updating the resno in place, but some explanation > would help. Also, should we add an assertion-build check to confirm that the > resultant list is actually ordered? OK, added a comment and also the assertion-build check on the order of entries. > > @@ -66,7 +67,8 @@ static TupleTableSlot > *ExecPrepareTupleRouting(ModifyTableState *mtstate, > EState *estate, > PartitionTupleRouting *proute, > ResultRelInfo *targetRelInfo, > - TupleTableSlot *slot); > + TupleTableSlot *slot, > + int *partition_index); > static ResultRelInfo *getTargetResultRelInfo(ModifyTableState *node); > static void ExecSetupChildParentMapForTcs(ModifyTableState *mtstate); > static void ExecSetupChildParentMapForSubplan(ModifyTableState *mtstate); > @@ -264,6 +266,7 @@ ExecInsert(ModifyTableState *mtstate, > TupleTableSlot *slot, > TupleTableSlot *planSlot, > EState *estate, > + int partition_index, > bool canSetTag) > { > HeapTuple tuple; > > If we move the list of arbiter indexes and the tuple desc to ResultRelInfo, > as > suggested above, I think we can avoid making any API changes to > ExecPrepareTupleRouting and ExecInsert. Those API changes are no longer part of the patch. Attached please find an updated version. Thanks, Amit
Attachment
On Thu, Mar 22, 2018 at 3:01 PM, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>
> Why do we need to pin the descriptor? If we do need, why don't we need
> corresponding ReleaseTupDesc() call?
PinTupleDesc was added in the patch as Alvaro had submitted it upthread,
which it wasn't clear to me either why it was needed. Looking at it
closely, it seems we can get rid of it if for the lack of corresponding
ReleaseTupleDesc(). ExecSetSlotDescriptor() seems to take care of pinning
and releasing tuple descriptors that are passed to it. If some
partition's tupDesc remains pinned because it was the last one that was
passed to it, the final ExecResetTupleTable will take care of releasing it.
I have removed the instances of PinTupleDesc in the updated patch, but I'm
not sure why the loose PinTupleDesc() in the previous version of the patch
didn't cause reference leak warnings or some such.
Yeah, it wasn't clear to me as well. But I did not investigate. May be Alvaro knows better.
> I am still confused if the partition_conflproj_tdescs really belongs to
> PartitionTupleRouting or should it be a part of the ResultRelInfo. FWIW for
> the
> MERGE patch, I moved everything to a new struct and made it part of the
> ResultRelInfo. If no re-mapping is necessary, we can just point to the
> struct
> in the root relation's ResultRelInfo. Otherwise create and populate a new
> one
> in the partition relation's ResultRelInfo.
>
> + leaf_part_rri->ri_onConflictSetWhere =
> + ExecInitQual(onconflwhere, &mtstate->ps);
> + }
>
> So ri_onConflictSetWhere and ri_onConflictSetProj are part of the
> ResultRelInfo. Why not move mt_conflproj_tupdesc,
> partition_conflproj_tdescs,
> partition_arbiter_indexes etc to the ResultRelInfo as well?
I have moved both the projection tupdesc and the arbiter indexes list into
ResultRelInfo as I wrote above.
Is there anything that needs to be done for transition tables? I checked and didn't see anything, but please check.
Thanks,
Pavan
--
Pavan Deolasee http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
PostgreSQL Development, 24x7 Support, Training & Services
On 2018/03/22 20:48, Pavan Deolasee wrote: > Thanks. It's looking much better now. Thanks. > I think we can possibly move all ON > CONFLICT related members to a separate structure and just copy the pointer > to the structure if (map == NULL). That might make the code a bit more tidy. OK, I tried that in the attached updated patch. > Is there anything that needs to be done for transition tables? I checked > and didn't see anything, but please check. There doesn't seem to be anything that this patch has to do for transition tables. If you look at the tests I added in triggers.sql which exercise INSERT ON CONFLICT's interaction with transition tables, you can see that we get the same output for a partitioned table as we get for a normal table. Thanks, Amit
Attachment
(2018/03/22 18:31), Amit Langote wrote: > On 2018/03/20 20:53, Etsuro Fujita wrote: >> Here are comments on executor changes in (the latest version of) the patch: >> >> @@ -421,8 +424,18 @@ ExecInsert(ModifyTableState *mtstate, >> ItemPointerData conflictTid; >> bool specConflict; >> List *arbiterIndexes; >> + PartitionTupleRouting *proute = >> + mtstate->mt_partition_tuple_routing; >> >> - arbiterIndexes = node->arbiterIndexes; >> + /* Use the appropriate list of arbiter indexes. */ >> + if (mtstate->mt_partition_tuple_routing != NULL) >> + { >> + Assert(partition_index>= 0&& proute != NULL); >> + arbiterIndexes = >> + proute->partition_arbiter_indexes[partition_index]; >> + } >> + else >> + arbiterIndexes = node->arbiterIndexes; >> >> To handle both cases the same way, I wonder if it would be better to have >> the arbiterindexes list in ResultRelInfo as well, as mentioned by Alvaro >> upthread, or to re-add mt_arbiterindexes as before and set it to >> proute->partition_arbiter_indexes[partition_index] before we get here, >> maybe in ExecPrepareTupleRouting, in the case of tuple routing. > > It's a good idea. I somehow missed that Alvaro had already mentioned it. > > In HEAD, we now have ri_onConflictSetProj and ri_onConflictSetWhere. I > propose we name the field ri_onConflictArbiterIndexes as done in the > updated patch. I like that naming. >> @@ -2368,9 +2419,13 @@ ExecInitModifyTable(ModifyTable *node, EState >> *estate, int eflags) >> econtext = mtstate->ps.ps_ExprContext; >> relationDesc = resultRelInfo->ri_RelationDesc->rd_att; >> >> - /* initialize slot for the existing tuple */ >> - mtstate->mt_existing = >> - ExecInitExtraTupleSlot(mtstate->ps.state, relationDesc); >> + /* >> + * Initialize slot for the existing tuple. We determine which >> + * tupleDesc to use for this after we have determined which relation >> + * the insert/update will be applied to, possibly after performing >> + * tuple routing. >> + */ >> + mtstate->mt_existing = ExecInitExtraTupleSlot(mtstate->ps.state, >> NULL); >> >> /* carried forward solely for the benefit of explain */ >> mtstate->mt_excludedtlist = node->exclRelTlist; >> @@ -2378,8 +2433,16 @@ ExecInitModifyTable(ModifyTable *node, EState >> *estate, int eflags) >> /* create target slot for UPDATE SET projection */ >> tupDesc = ExecTypeFromTL((List *) node->onConflictSet, >> relationDesc->tdhasoid); >> + PinTupleDesc(tupDesc); >> + mtstate->mt_conflproj_tupdesc = tupDesc; >> + >> + /* >> + * Just like the "existing tuple" slot, we'll defer deciding which >> + * tupleDesc to use for this slot to a point where tuple routing has >> + * been performed. >> + */ >> mtstate->mt_conflproj = >> - ExecInitExtraTupleSlot(mtstate->ps.state, tupDesc); >> + ExecInitExtraTupleSlot(mtstate->ps.state, NULL); >> >> If we do ExecInitExtraTupleSlot for mtstate->mt_existing and >> mtstate->mt_conflproj in ExecPrepareTupleRouting in the case of tuple >> routing, as said above, we wouldn't need this changes. I think doing that >> only in the case of tuple routing and keeping this as-is would be better >> because that would save cycles in the normal case. > > Hmm, I think we shouldn't be doing ExecInitExtraTupleSlot in > ExecPrepareTupleRouting, because we shouldn't have more than one instance > of mtstate->mt_existing and mtstate->mt_conflproj slots. Yeah, I think so too. What I was going to say here is ExecSetSlotDescriptor, not ExecInitExtraTupleSlot, as you said below. Sorry about the incorrectness. I guess I was too tired when writing that comments. > As you also said above, I think you meant to say here that we do > ExecInitExtraTupleSlot only once for both mtstate->mt_existing and > mtstate->mt_conflproj in ExecInitModifyTable and only do > ExecSetSlotDescriptor in ExecPrepareTupleRouting. That's right. > I have changed it so > that ExecInitModifyTable now both creates the slot and sets the descriptor > for non-tuple-routing cases and only creates but doesn't set the > descriptor in the tuple-routing case. IMHO I don't see much value in modifying code as such, because we do ExecSetSlotDescriptor for mt_existing and mt_conflproj in ExecPrepareTupleRouting for every inserted tuple. So, I would leave that as-is, to keep that simple. > For ExecPrepareTupleRouting to be able to access the tupDesc of the > onConflictSet target list, I've added ri_onConflictSetProjTupDesc which is > set by ExecInitPartitionInfo on first call for a give partition. This is > also suggested by Pavan in his review. Seems like a good idea. Here are some comments on the latest version of the patch: + /* + * Caller must set mtstate->mt_conflproj's tuple descriptor to + * this one before trying to use it for projection. + */ + tupDesc = ExecTypeFromTL(onconflset, partrelDesc->tdhasoid); + leaf_part_rri->ri_onConflictSet->proj = + ExecBuildProjectionInfo(onconflset, econtext, + mtstate->mt_conflproj, + &mtstate->ps, partrelDesc); ExecBuildProjectionInfo is called without setting the tuple descriptor of mtstate->mt_conflproj to tupDesc. That might work at least for now, but I think it's a good thing to set it appropriately to make that future proof. + * This corresponds to a dropped attribute in the partition, for + * which we enerate a dummy entry with resno matching the + * partition's attno. s/enerate/generate/ + * OnConflictSetState + * + * Contains execution time state of a ON CONFLICT DO UPDATE operation, which + * includes the state of projection, tuple descriptor of the projection, and + * WHERE quals if any. s/a ON/an ON/ +typedef struct OnConflictSetState +{ /* for computing ON CONFLICT DO UPDATE SET */ This is nitpicking, but this wouldn't follow the project style, so I think that needs re-formatting. I'll look at the patch a little bit more early next week. Thanks for updating the patch! Best regards, Etsuro Fujita
Thanks for these changes. I'm going over this now, with intention to push it shortly. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
I made a bunch of further edits and I think this v10 is ready to push. Before doing so I'll give it a final look, particularly because of the new elog(ERROR) I added. Post-commit review is of course always appreciated. Most notable change is because I noticed that if you mention an intermediate partition level in the INSERT command, and the index is on the top level, arbiter index selection fails to find the correct index because it walks all the way to the top instead of stopping in the middle, as it should (the command was still working because it ended up with an empty arbiter index list). To fix this, I had to completely rework the "get partition parent root" stuff into "get list of ancestors of this partition". Because of this, I added a new check that the partition's arbiter index list is same length as parent's; if not, throw an error. I couldn't get it to fire (so it's just an elog not ereport), but maybe I just didn't try any weird enough scenarios. Other changes: * I added a copyObject() call for nodes we're operating upon. Maybe this is unnecessary but the comments claimed "we're working on a copy" and I couldn't find any place where we were actually making one. Anyway it seems sane to make a copy, because we're scribbling on those nodes ... I hope I didn't introduce any serious memory leaks. * I made the new OnConflictSetState thing into a proper node. Like ResultRelInfo, it doesn't have any niceties like nodeToString support, but it seems saner this way (palloc -> makeNode). I reworked the formatting of that struct definition too, and renamed members. * I removed an assertion block at the bottom of adjust_partition_tlist. It seemed quite pointless, since it was just about checking that the resno values were sorted, but by construction we already know that they are indeed sorted ... * General code style improvements, comment rewording, etc. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachment
On 2018/03/24 9:23, Alvaro Herrera wrote: > I made a bunch of further edits and I think this v10 is ready to push. > Before doing so I'll give it a final look, particularly because of the > new elog(ERROR) I added. Post-commit review is of course always > appreciated. > > Most notable change is because I noticed that if you mention an > intermediate partition level in the INSERT command, and the index is on > the top level, arbiter index selection fails to find the correct index > because it walks all the way to the top instead of stopping in the > middle, as it should (the command was still working because it ended up > with an empty arbiter index list). Good catch! > To fix this, I had to completely rework the "get partition parent root" > stuff into "get list of ancestors of this partition". I wondered if a is_partition_ancestor(partrelid, ancestorid) isn't enough instead of creating a list of ancestors and then looping over it as you've done, but maybe what you have here is fine. > Because of this, I added a new check that the partition's arbiter index > list is same length as parent's; if not, throw an error. I couldn't get > it to fire (so it's just an elog not ereport), but maybe I just didn't > try any weird enough scenarios. > > Other changes: > > * I added a copyObject() call for nodes we're operating upon. Maybe > this is unnecessary but the comments claimed "we're working on a copy" > and I couldn't find any place where we were actually making one. > Anyway it seems sane to make a copy, because we're scribbling on those > nodes ... I hope I didn't introduce any serious memory leaks. That seems fine as ExecInitPartitionInfo allocates in the query context (es_query_cxt). > * I made the new OnConflictSetState thing into a proper node. Like > ResultRelInfo, it doesn't have any niceties like nodeToString support, > but it seems saner this way (palloc -> makeNode). I reworked the > formatting of that struct definition too, and renamed members. Looks good, thanks. > * I removed an assertion block at the bottom of adjust_partition_tlist. > It seemed quite pointless, since it was just about checking that the > resno values were sorted, but by construction we already know that > they are indeed sorted ... Hmm yes. > * General code style improvements, comment rewording, etc. There was one comment in Fujita-san's review he posted on Friday [1] that doesn't seem to be addressed in v10, which I think we probably should. It was this comment: "ExecBuildProjectionInfo is called without setting the tuple descriptor of mtstate->mt_conflproj to tupDesc. That might work at least for now, but I think it's a good thing to set it appropriately to make that future proof." All of his other comments seem to have been taken care of in v10. I have fixed the above one in the attached updated version. Thanks, Amit [1] https://www.postgresql.org/message-id/5AB4DEB6.2020901%40lab.ntt.co.jp
Attachment
Pushed now. Amit Langote wrote: > On 2018/03/24 9:23, Alvaro Herrera wrote: > > To fix this, I had to completely rework the "get partition parent root" > > stuff into "get list of ancestors of this partition". > > I wondered if a is_partition_ancestor(partrelid, ancestorid) isn't enough > instead of creating a list of ancestors and then looping over it as you've > done, but maybe what you have here is fine. Yeah, I wondered about doing it that way too (since you can stop looking early), but decided that I didn't like repeatedly opening pg_inherits for each level. Anyway the most common case is a single level, and in rare cases two levels ... I don't think we're going to see much more than that. So it doesn't matter too much. We can refine later anyway, if this becomes a hot spot (I doubt it TBH). > > * General code style improvements, comment rewording, etc. > > There was one comment in Fujita-san's review he posted on Friday [1] that > doesn't seem to be addressed in v10, which I think we probably should. It > was this comment: > > "ExecBuildProjectionInfo is called without setting the tuple descriptor of > mtstate->mt_conflproj to tupDesc. That might work at least for now, but I > think it's a good thing to set it appropriately to make that future proof." > > All of his other comments seem to have been taken care of in v10. I have > fixed the above one in the attached updated version. I was of two minds about this item myself; we don't use the tupdesc for anything at that point and I expect more things would break if we required that. But I don't think it hurts, so I kept it. The one thing I wasn't terribly in love with is the four calls to map_partition_varattnos(), creating the attribute map four times ... but we already have it in the TupleConversionMap, no? Looks like we could save a bunch of work there. And a final item is: can we have a whole-row expression in the clauses? We currently don't handle those either, not even to throw an error. [figures a test case] ... and now that I test it, it does crash! create table part (a int primary key, b text) partition by range (a); create table part1 (b text, a int not null); alter table part attach partition part1 for values from (1) to (1000); insert into part values (1, 'two') on conflict (a) do update set b = format('%s (was %s)', excluded.b, part.b) where part.* *<> (1, text 'two'); I think this means we should definitely handle found_whole_row. (If you create part1 in the normal way, it works as you'd expect.) I'm going to close a few other things first, then come back to this. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 2018/03/26 23:20, Alvaro Herrera wrote: > Pushed now. Thank you! > Amit Langote wrote: >> On 2018/03/24 9:23, Alvaro Herrera wrote: > >>> To fix this, I had to completely rework the "get partition parent root" >>> stuff into "get list of ancestors of this partition". >> >> I wondered if a is_partition_ancestor(partrelid, ancestorid) isn't enough >> instead of creating a list of ancestors and then looping over it as you've >> done, but maybe what you have here is fine. > > Yeah, I wondered about doing it that way too (since you can stop looking > early), but decided that I didn't like repeatedly opening pg_inherits > for each level. Anyway the most common case is a single level, and in > rare cases two levels ... I don't think we're going to see much more > than that. So it doesn't matter too much. We can refine later anyway, > if this becomes a hot spot (I doubt it TBH). Yes, I suppose. >>> * General code style improvements, comment rewording, etc. >> >> There was one comment in Fujita-san's review he posted on Friday [1] that >> doesn't seem to be addressed in v10, which I think we probably should. It >> was this comment: >> >> "ExecBuildProjectionInfo is called without setting the tuple descriptor of >> mtstate->mt_conflproj to tupDesc. That might work at least for now, but I >> think it's a good thing to set it appropriately to make that future proof." >> >> All of his other comments seem to have been taken care of in v10. I have >> fixed the above one in the attached updated version. > > I was of two minds about this item myself; we don't use the tupdesc for > anything at that point and I expect more things would break if we > required that. But I don't think it hurts, so I kept it. > > The one thing I wasn't terribly in love with is the four calls to > map_partition_varattnos(), creating the attribute map four times ... but > we already have it in the TupleConversionMap, no? Looks like we could > save a bunch of work there. Hmm, actually we can't use that map, assuming you're talking about the following map: TupleConversionMap *map = proute->parent_child_tupconv_maps[partidx]; We can only use that to tell if we need converting expressions (as we currently do), but it cannot be used to actually convert the expressions. The map in question is for use by do_convert_tuple(), not to map varattnos in Vars using map_variable_attnos(). But it's definitely a bit undesirable to have various map_partition_varattnos() calls within ExecInitPartitionInfo() to initialize the same information (the map) multiple times. > And a final item is: can we have a whole-row expression in the clauses? > We currently don't handle those either, not even to throw an error. > [figures a test case] ... and now that I test it, it does crash! > > create table part (a int primary key, b text) partition by range (a); > create table part1 (b text, a int not null); > alter table part attach partition part1 for values from (1) to (1000); > insert into part values (1, 'two') on conflict (a) > do update set b = format('%s (was %s)', excluded.b, part.b) > where part.* *<> (1, text 'two'); > > I think this means we should definitely handle found_whole_row. (If you > create part1 in the normal way, it works as you'd expect.) I agree. That means we simply remove the Assert after the map_partition_varattnos call. > I'm going to close a few other things first, then come back to this. Attached find a patch to fix the whole-row expression issue. I added your test to insert_conflict.sql. Thanks, Amit
Attachment
On 2018/03/27 13:27, Amit Langote wrote: > On 2018/03/26 23:20, Alvaro Herrera wrote: >> The one thing I wasn't terribly in love with is the four calls to >> map_partition_varattnos(), creating the attribute map four times ... but >> we already have it in the TupleConversionMap, no? Looks like we could >> save a bunch of work there. > > Hmm, actually we can't use that map, assuming you're talking about the > following map: > > TupleConversionMap *map = proute->parent_child_tupconv_maps[partidx]; > > We can only use that to tell if we need converting expressions (as we > currently do), but it cannot be used to actually convert the expressions. > The map in question is for use by do_convert_tuple(), not to map varattnos > in Vars using map_variable_attnos(). > > But it's definitely a bit undesirable to have various > map_partition_varattnos() calls within ExecInitPartitionInfo() to > initialize the same information (the map) multiple times. I will try to think of doing something about this later this week. >> And a final item is: can we have a whole-row expression in the clauses? >> We currently don't handle those either, not even to throw an error. >> [figures a test case] ... and now that I test it, it does crash! >> >> create table part (a int primary key, b text) partition by range (a); >> create table part1 (b text, a int not null); >> alter table part attach partition part1 for values from (1) to (1000); >> insert into part values (1, 'two') on conflict (a) >> do update set b = format('%s (was %s)', excluded.b, part.b) >> where part.* *<> (1, text 'two'); >> >> I think this means we should definitely handle found_whole_row. (If you >> create part1 in the normal way, it works as you'd expect.) > > I agree. That means we simply remove the Assert after the > map_partition_varattnos call. > >> I'm going to close a few other things first, then come back to this. > > Attached find a patch to fix the whole-row expression issue. I added your > test to insert_conflict.sql. Adding this to the open items list. Thanks, Amit
On 2018/04/10 11:56, Amit Langote wrote: > On 2018/03/27 13:27, Amit Langote wrote: >> On 2018/03/26 23:20, Alvaro Herrera wrote: >>> The one thing I wasn't terribly in love with is the four calls to >>> map_partition_varattnos(), creating the attribute map four times ... but >>> we already have it in the TupleConversionMap, no? Looks like we could >>> save a bunch of work there. >> >> Hmm, actually we can't use that map, assuming you're talking about the >> following map: >> >> TupleConversionMap *map = proute->parent_child_tupconv_maps[partidx]; >> >> We can only use that to tell if we need converting expressions (as we >> currently do), but it cannot be used to actually convert the expressions. >> The map in question is for use by do_convert_tuple(), not to map varattnos >> in Vars using map_variable_attnos(). >> >> But it's definitely a bit undesirable to have various >> map_partition_varattnos() calls within ExecInitPartitionInfo() to >> initialize the same information (the map) multiple times. > > I will try to think of doing something about this later this week. The solution I came up with is to call map_variable_attnos() directly, instead of going through map_partition_varattnos() every time, after first creating the attribute map ourselves. >>> And a final item is: can we have a whole-row expression in the clauses? >>> We currently don't handle those either, not even to throw an error. >>> [figures a test case] ... and now that I test it, it does crash! >>> >>> create table part (a int primary key, b text) partition by range (a); >>> create table part1 (b text, a int not null); >>> alter table part attach partition part1 for values from (1) to (1000); >>> insert into part values (1, 'two') on conflict (a) >>> do update set b = format('%s (was %s)', excluded.b, part.b) >>> where part.* *<> (1, text 'two'); >>> >>> I think this means we should definitely handle found_whole_row. (If you >>> create part1 in the normal way, it works as you'd expect.) >> >> I agree. That means we simply remove the Assert after the >> map_partition_varattnos call. >> >>> I'm going to close a few other things first, then come back to this. >> >> Attached find a patch to fix the whole-row expression issue. I added your >> test to insert_conflict.sql. Combined the above patch into the attached patch. Thanks, Amit
Attachment
Amit Langote wrote: > The solution I came up with is to call map_variable_attnos() directly, > instead of going through map_partition_varattnos() every time, after first > creating the attribute map ourselves. Yeah, sounds good. I added a tweak: if the tupledescs are equal, there should be no need to do any mapping. (Minor adjustment to the test: "cuarenta" means forty, so I changed the new test to say "cincuenta" instead, which means fifty). Pushed now, thanks. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Alvaro Herrera <alvherre@alvh.no-ip.org> writes: > Pushed now, thanks. Buildfarm doesn't like this even a little bit. regards, tom lane
On 2018/04/17 4:10, Alvaro Herrera wrote: > Amit Langote wrote: > >> The solution I came up with is to call map_variable_attnos() directly, >> instead of going through map_partition_varattnos() every time, after first >> creating the attribute map ourselves. > > Yeah, sounds good. I added a tweak: if the tupledescs are equal, there > should be no need to do any mapping. Thanks for the commit! About the equalTupleDescs()-based optimization you added -- It seems to me that that *always* returns false if you pass it tupledescs of two different tables, which in this case, we do. That's because equalTupleDescs has this: if (tupdesc1->tdtypeid != tupdesc2->tdtypeid) return false; So, it fails to optimize as you were hoping it would. Instead of doing this, I think we should try to make convert_tuples_by_name_map() a bit smarter by integrating the logic in convert_tuples_by_name() that's used conclude if no tuple conversion is necessary. So, if it turns that the tuples descriptors passed to convert_tuples_by_name_map() contain the same number of attributes and the individual attributes are at the same positions, we signal to the caller that no conversion is necessary by returning NULL. Attached find a patch that does that. When working on this, I noticed that when recursing for inheritance children, ATPrepAlterColumnType() would use a AlterTableCmd (cmd) that's already scribbled on as if it were the original. Thanks, Amit
Attachment
On 2018/04/17 16:45, Amit Langote wrote: > Instead of doing this, I think we should try to make > convert_tuples_by_name_map() a bit smarter by integrating the logic in > convert_tuples_by_name() that's used conclude if no tuple conversion is > necessary. So, if it turns that the tuples descriptors passed to > convert_tuples_by_name_map() contain the same number of attributes and the > individual attributes are at the same positions, we signal to the caller > that no conversion is necessary by returning NULL. > > Attached find a patch that does that. I just confirmed my hunch that this wouldn't somehow do the right thing when the OID system column is involved. Like this case: create table parent (a int); create table child () inherits (parent) with oids; insert into parent values (1); insert into child values (1); analyze parent; server closed the connection unexpectedly This probably means the server terminated abnormally before or while processing the request. The connection to the server was lost. Attempting reset: Failed. That's because, convert_tuples_by_name() that's called by acquire_inherited_sample_rows() gets a TupleConversionMap whose attrMap is set to NULL. do_convert_tuple() may then try to access a member of such NULL attrMap. In this case, even if parent and child tables have same user attributes, patched convert_tuples_by_name_map would return NULL, but since their hasoids setting doesn't match, a TupleConversionMap is still returned but has its attrMap set to NULL. To fix that, I taught do_convert_tuple() to ignore the map if NULL. Also, free_conversion_map() shouldn't try to free attrMap if it's NULL. Attached updated patch. Thanks, Amit
Attachment
Amit Langote wrote: > Attached find a patch that does that. When working on this, I noticed > that when recursing for inheritance children, ATPrepAlterColumnType() > would use a AlterTableCmd (cmd) that's already scribbled on as if it were > the original. While I agree that the code here is in poor style, there is no live bug here, because the only thing that is changed each time is the copy's cmd->def, and its value is not obtained from the scribbled 'cmd' -- it's obtained from the passed-in cmd->def, which is unmodified. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Amit Langote wrote: > I just confirmed my hunch that this wouldn't somehow do the right thing > when the OID system column is involved. Like this case: This looks too big a patch to pursue now. I'm inclined to just remove the equalTupdesc changes. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 2018/04/18 0:02, Alvaro Herrera wrote: > Amit Langote wrote: > >> Attached find a patch that does that. When working on this, I noticed >> that when recursing for inheritance children, ATPrepAlterColumnType() >> would use a AlterTableCmd (cmd) that's already scribbled on as if it were >> the original. > > While I agree that the code here is in poor style, there is no live bug > here, because the only thing that is changed each time is the copy's > cmd->def, and its value is not obtained from the scribbled 'cmd' -- it's > obtained from the passed-in cmd->def, which is unmodified. Ah, you're right. The original cmd->def itself remains intact. Thanks, Amit
On 2018/04/18 0:04, Alvaro Herrera wrote: > Amit Langote wrote: > >> I just confirmed my hunch that this wouldn't somehow do the right thing >> when the OID system column is involved. Like this case: > > This looks too big a patch to pursue now. I'm inclined to just remove > the equalTupdesc changes. OK. Here is the patch that removes equalTupdesc optimization. I will add the rest of the patch to the next CF after starting a new thread for it sometime later. Thanks, Amit
Attachment
Amit Langote wrote: > On 2018/04/18 0:04, Alvaro Herrera wrote: > > Amit Langote wrote: > > > >> I just confirmed my hunch that this wouldn't somehow do the right thing > >> when the OID system column is involved. Like this case: > > > > This looks too big a patch to pursue now. I'm inclined to just remove > > the equalTupdesc changes. > > OK. Here is the patch that removes equalTupdesc optimization. Hmm. If we modify (during pg12, of course -- not now) partition tables that are created identical to their parent table so that they share the pg_type row, this would become useful. Unless there a reason why that change is completely unworkable, I'd just leave it there. (I claim that it works like that only because it used to work like that, not because it's impossible to make work the other way.) -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 2018/04/18 22:40, Alvaro Herrera wrote: > Amit Langote wrote: >> On 2018/04/18 0:04, Alvaro Herrera wrote: >>> Amit Langote wrote: >>> >>>> I just confirmed my hunch that this wouldn't somehow do the right thing >>>> when the OID system column is involved. Like this case: >>> >>> This looks too big a patch to pursue now. I'm inclined to just remove >>> the equalTupdesc changes. >> >> OK. Here is the patch that removes equalTupdesc optimization. > > Hmm. If we modify (during pg12, of course -- not now) partition tables > that are created identical to their parent table so that they share the > pg_type row, this would become useful. Unless there a reason why that > change is completely unworkable, I'd just leave it there. (I claim that > it works like that only because it used to work like that, not because > it's impossible to make work the other way.) Yeah, I too have wondered in the past what it would take to make equalTupDescs() return true for parent and partitions. Maybe we can make it work by looking a bit harder than I did then. Although, just leaving it there now would mean we're adding a few cycles needlessly in the PG 11 code. Why not add that optimization when we surely know it can work? Thanks, Amit
Amit Langote wrote: > Yeah, I too have wondered in the past what it would take to make > equalTupDescs() return true for parent and partitions. Maybe we can make > it work by looking a bit harder than I did then. How about simply relaxing the tdtypeid test from equalTupleDescs? I haven't looked deeply but I think just checking whether or not both are RECORDOID might be sufficient, for typecache purposes. If we just remove the tdtypeid test, check-world passes. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Thu, Apr 19, 2018 at 1:20 PM, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: > Amit Langote wrote: >> Yeah, I too have wondered in the past what it would take to make >> equalTupDescs() return true for parent and partitions. Maybe we can make >> it work by looking a bit harder than I did then. > > How about simply relaxing the tdtypeid test from equalTupleDescs? I > haven't looked deeply but I think just checking whether or not both are > RECORDOID might be sufficient, for typecache purposes. That strike me as a very scary thing to do. There's code all over the system that may have non-obvious assumptions about the behavior of equalTupleDescs(), and I don't think we can have any confidence that nothing will break unless we do a detailed audit of all that code. > If we just remove the tdtypeid test, check-world passes. That does not reassure me. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Thu, Apr 19, 2018 at 12:00 PM, Robert Haas <robertmhaas@gmail.com> wrote: > On Thu, Apr 19, 2018 at 1:20 PM, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: >> How about simply relaxing the tdtypeid test from equalTupleDescs? I >> haven't looked deeply but I think just checking whether or not both are >> RECORDOID might be sufficient, for typecache purposes. > > That strike me as a very scary thing to do. There's code all over the > system that may have non-obvious assumptions about the behavior of > equalTupleDescs(), and I don't think we can have any confidence that > nothing will break unless we do a detailed audit of all that code. +1. I think that it is plainly a bad idea to do something like that at this point in the cycle. -- Peter Geoghegan
Alvaro Herrera wrote: > Amit Langote wrote: > > > Yeah, I too have wondered in the past what it would take to make > > equalTupDescs() return true for parent and partitions. Maybe we can make > > it work by looking a bit harder than I did then. > > How about simply relaxing the tdtypeid test from equalTupleDescs? I > haven't looked deeply but I think just checking whether or not both are > RECORDOID might be sufficient, for typecache purposes. After looking at the code, I'm a bit nervous about doing this, because I don't fully understand what is going on in typcache, and what is the HeapTupleHeaderGetTypeId macro really doing. I'm afraid that if we confuse a table's tupdesc with one of its partition's , something entirely random might end up happening. Maybe this is completely off-base, but if so I'd like to have to proof. So I'm thinking of reverting that patch instead per your patch. While composing this we got emails from Robert and Peter G suggesting the same too, so consider it done. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 2018/04/20 4:40, Alvaro Herrera wrote: > Alvaro Herrera wrote: >> Amit Langote wrote: >> >>> Yeah, I too have wondered in the past what it would take to make >>> equalTupDescs() return true for parent and partitions. Maybe we can make >>> it work by looking a bit harder than I did then. >> >> How about simply relaxing the tdtypeid test from equalTupleDescs? I >> haven't looked deeply but I think just checking whether or not both are >> RECORDOID might be sufficient, for typecache purposes. > > After looking at the code, I'm a bit nervous about doing this, because I > don't fully understand what is going on in typcache, and what is the > HeapTupleHeaderGetTypeId macro really doing. I'm afraid that if we > confuse a table's tupdesc with one of its partition's , something > entirely random might end up happening. > > Maybe this is completely off-base, but if so I'd like to have to proof. > So I'm thinking of reverting that patch instead per your patch. > > While composing this we got emails from Robert and Peter G suggesting > the same too, so consider it done. Thank you for committing the patch. Regards, Amit